Bounding the softwired parsimony score of a phylogenetic network
Abstract
In comparison to phylogenetic trees, phylogenetic networks are more suitable to represent complex evolutionary histories of species whose past includes reticulation such as hybridisation or lateral gene transfer. However, the reconstruction of phylogenetic networks remains challenging and computationally expensive due to their intricate structural properties. For example, the small parsimony problem that is solvable in polynomial time for phylogenetic trees, becomes NP-hard on phylogenetic networks under softwired and parental parsimony, even for a single binary character and structurally constrained networks. To calculate the parsimony score of a phylogenetic network , these two parsimony notions consider different exponential-size sets of phylogenetic trees that can be extracted from and infer the minimum parsimony score over all trees in the set. In this paper, we ask: What is the maximum difference between the parsimony score of any phylogenetic tree that is contained in the set of considered trees and a phylogenetic tree whose parsimony score equates to the parsimony score of ? Given a gap-free sequence alignment of multi-state characters and a rooted binary level- phylogenetic network, we use the novel concept of an informative blob to show that this difference is bounded by times the softwired parsimony score of . In particular, the difference is independent of the alignment length and the number of character states. We show that an analogous bound can be obtained for the softwired parsimony score of semi-directed networks, while under parental parsimony on the other hand, such a bound does not hold.
Keywords: Phylogenetic networks; level-; softwired parsimony; parental parsimony
1 Introduction
The generalisation of phylogenetic trees to phylogenetic networks goes along with the development of new methods to reconstruct phylogenetic networks from genomic data. Since phylogenetic networks are structurally much more complicated than phylogenetic trees, the algorithms to infer networks are typically computationally expensive. Indeed, several optimisation problems that can be solved efficiently for phylogenetic trees, become computationally difficult for phylogenetic networks. For example, the small parsimony problem, which seeks to find the parsimony score of a given phylogenetic tree with character states assigned to its leaves is solvable in polynomial time for phylogenetic trees, e.g., using the well-known Fitch-Hartigan algorithm [7, 10], but becomes NP-hard under different notions of parsimony for phylogenetic networks, even for a single binary character and structurally constrained networks [6, 24].
Despite the popularity of model-based methods to infer phylogenetic trees, maximum parsimony (see, e.g. [4] and references therein) continues to be widely used in certain areas of evolutionary biology, such as the analysis of morphological data (see, e.g. [17, 18, 20]). Moreover, since calculating the parsimony score of a phylogenetic tree is computationally less expensive than calculating its likelihood, parsimony trees are often used as starting trees from which a search through tree space is started [23] and are also used in Bayesian phylogenetic inference [25].
Recently, different notions for parsimony on rooted phylogenetic networks have been proposed, referred to as hardwired, softwired, and parental parsimony. Softwired [16] and parental [24] parsimony both consider collections of trees that can be extracted from a rooted phylogenetic network (so-called displayed trees, respectively parental trees) and define the parsimony score as the minimum parsimony score of any tree in the collection. Softwired parsimony is implemented in the popular software package PhyloNetworks [22] and is the main focus of this paper. Hardwired parsimony [13], on the other hand, calculates the parsimony score of a phylogenetic network by considering character-state transitions along all edges of the network. As the sets of rooted phylogenetic trees that are evaluated when computing the softwired and parental parsimony scores of a phylogenetic network have exponential size, it is of interest to investigate the differences in parsimony scores of elements in these sets. Given a gap-free alignment of multi-state characters and a rooted binary level- network (formally defined below), we analyse how different the parsimony score of any phylogenetic tree displayed by and the softwired parsimony score of can be. We show that independent of the alignment length and number of character states, this difference is bounded by times the parsimony score of . Thus, while computing the softwired parsimony score is in general an NP-hard problem (even for a single binary character and a structurally constrained network [6, Theorem 4.3]), our result implies that an upper bound for the softwired parsimony score of can be obtained in polynomial time by simply evaluating the parsimony score of an arbitrary phylogenetic tree that is displayed by . In particular if the level of is small, this upper bound gives a good indication of the magnitude of the softwired parsimony score of .
Related to our result, it was shown by Fischer et al. [6, Theorem 5.7] that the NP-hard optimisation problem of computing the softwired parsimony of a rooted level- network for a single multi-state character is, on the positive side, also fixed-parameter tractable, when the parameter is . If one considers more than a single binary character, the softwired parsimony problem is NP-hard even for a rooted level- network [15, Theorem 1]. As a consequence of the latter negative result, Kelk et al. [15] posed the following question. Are there good (i.e. constant-factor) approximation algorithms for computing the softwired parsimony score of a rooted phylogenetic network and a sequence alignment without gaps under the following three restrictions: (i) is level-, (ii) each biconnected component of has exactly three outgoing edges, and (iii) consists of binary characters?
As hinted at above, from an algorithmic perspective, our upper bound result implies a -approximation algorithm for computing the softwired parsimony score of a rooted binary level- network . Specifically, take an arbitrary phylogenetic tree that is displayed by , compute its parsimony score, and use this to approximate the softwired parsimony score of . If the level of is fixed, this algorithm provides a polynomial-time constant factor approximation. Hence, we answer the aforementioned question by Kelk et al. affirmatively for a much larger class of rooted phylogenetic networks in the sense that our result holds for level- networks (for a fixed non-negative integer ), it does not require restriction (ii), and it holds for gap-free alignments independent of the number of character states. Our result also complements a recent paper by Frohn and Kelk [8], in which the authors establish a 2-approximation algorithm for the softwired parsimony problem on binary tree-child networks for a single character.
While softwired parsimony for rooted phylogenetic networks is the main focus of our paper, we additionally show that an analogous upper bound for the softwired parsimony score holds for semi-directed networks that are obtained from rooted phylogenetic networks by deleting the root and omitting the direction of all but reticulation edges. Semi-directed networks have recently been central to studying identifiability questions related to phylogenetic networks and to developing phylogenetic network estimation algorithms (e.g. [1, 9, 11, 21]). We also briefly turn to the notion of parental parsimony (on rooted phylogenetic networks) and show by way of counterexample that an analogous bound for the parental parsimony score does not hold.
The remainder of this paper is organised as follows. We define all relevant concepts related to phylogenetic trees and networks, introduce the notion of softwired parsimony, and state the main result in Section 2. In Section 3, we revisit the rSPR distance and establish an upper bound on this distance for two phylogenetic trees that are both displayed by a given phylogenetic network. In Sections 4 and 5, we introduce the notion of an informative blob and a blob reduction, respectively. Informative blobs are a novel concept that is crucial for obtaining our main result, the upper bound on the softwired parsimony score, which we establish in Section 6. Sections 7 and 8 are then devoted to parental parsimony on rooted networks and softwired parsimony on semi-directed networks, respectively. We end the paper with some concluding remarks and directions for future research in Section 9.
2 Preliminaries and statement of main result
This section introduces notation and terminology, and states the main result. Throughout this paper, denotes a non-empty finite set. Let be a directed graph. We use and to denote the vertex set and edge set, respectively, of . Furthermore, for each edge of , is called a parent of and is called a child of . We sometimes also refer to and as neighbours in . In a similar vein, for two (not necessarily distinct) vertices and of , we say that (resp. ) is an ancestor (resp. descendant) of (resp. ) if there is a directed path of length zero or more from to . Now let and be two directed graphs, and let be an edge of . Then subdividing is the operation of replacing with two new edges and . Furthermore, we call a subdivision of if can be obtained from by repeatedly subdividing an edge. We also consider to be a subdivision of itself.
Phylogenetic trees and networks. A rooted binary phylogenetic network on is a rooted acyclic digraph with no loops and no parallel edges that satisfies the following three properties:
-
(i)
the set of leaves is ,
-
(ii)
the out-degree of the (unique) root is exactly one, and
-
(iii)
every other vertex has either in-degree one and out-degree two, or in-degree two and out-degree one.
The set is also sometimes called the label set of . Furthermore, a vertex of is referred to as a reticulation if it has in-degree two and as a tree vertex if it has in-degree one and out-degree two. Similarly, an edge of that is directed into a reticulation is referred to as a reticulation edge. We denote the number of reticulations in by .
Let be a rooted binary phylogenetic network on . If has no reticulation, then it is called a rooted binary phylogenetic -tree. Since all phylogenetic trees and networks are rooted and binary throughout this paper except for Section 8, we refer to a rooted binary phylogenetic network as a phylogenetic network on and to a rooted binary phylogenetic tree as a phylogenetic -tree.
Let be a subdivision of a phylogenetic -tree. We call the directed path from the root of to its closest degree-three vertex its root path. If is a phylogenetic tree, then the root path consists of a single edge, in which case we sometimes refer to the root path as the root edge.
Now let be a phylogenetic network. A biconnected component of is a maximal subgraph of that is connected and cannot be disconnected by deleting exactly one of its vertices. Furthermore, a vertex of a biconnected component of is called a reticulation if it is a reticulation in . With this definition in hand, we say that is level- if the maximum number of reticulations of a biconnected component of is at most . Lastly, we call a biconnected component of a blob if it has at least one reticulation. For a blob of , we refer to the unique vertex with in-degree zero and out-degree two in as the source of . A phylogenetic network on with two blobs and is shown on the left-hand side of Figure 1.
Clusters.
Let be a subdivision of a phylogenetic network on , and let be a subset of . We call a cluster of if there exists a vertex in that has precisely as its set of descendant leaves. Note that there may be more than one vertex in whose cluster is and that this may also be the case if is a subdivision of a phylogenetic -tree. Furthermore, we use or if the subscript is clear from the context to denote the cluster of a given vertex of .
Displaying. Let be a phylogenetic network on with root , and let be a phylogenetic -tree with . We say that is displayed by if there exists a subgraph of that is a subdivision of that includes , in which case this subgraph is called an embedding of in . The set of all phylogenetic -trees that are displayed by is referred to as the display set of and denoted by . Ignoring the assignment of 0 and 1 to vertices for the moment, Figure 1 shows a phylogenetic network , a phylogenetic tree that is displayed by , and an embedding of in . Now, consider a subset of the reticulation edges of . We refer to as a switching if, for each reticulation in , it contains exactly one of the two edges that are directed into . By deleting each reticulation edge of that is not in , we obtain a connected subgraph of with no underlying cycle and, for each leaf , there is a directed path from the root of , which coincides with , to . If we repeatedly suppress each vertex in with in-degree one and out-degree one, and delete each vertex in with out-degree zero that is not in until no such operation is possible, we obtain a phylogenetic -tree . We say that yields . By construction, is displayed by . Conversely, observe that, for each phylogenetic -tree in , there exists at least one switching that yields . In summary, we have the following observation.

Observation 2.1.
Let be a phylogenetic network on , and let be a phylogenetic -tree. Then is displayed by if and only if there exists a switching of that yields .
rSPR operation. Let be a phylogenetic -tree, and let be an edge of that is not incident with the root. Let be a phylogenetic -tree obtained from by deleting and reattaching the resulting rooted subtree that contains via a new edge in the following way: Subdivide an edge of the component that contains the root of with a new vertex , join and with , and suppress . We say that has been obtained from by a rooted subtree prune and regraft (rSPR) operation. The rSPR distance between any two phylogenetic -trees and , denoted by , is the minimum number of rSPR operations that transform into . It is well known that can always be obtained from by a sequence of single rSPR operations and, so, is well defined.
Characters. An -state character on is a surjective function from into a set of character states with . If , then is called a binary character. Throughout this paper, all results are established for -state characters with being fixed and arbitrarily large. For simplicity, we refer to an -state character on as a character on .
Let be an acyclic digraph with leaf set , and let be a character on . An extension of to is a function such that for each element . For an extension of , we set
and refer to as the changing number of . Intuitively, each edge of that contributes to the changing number of requires a character-state transition to explain on . Lastly, we say that an extension of to is minimum if there exists no extension of to whose changing number is strictly smaller than that of .
In what follows, we often consider a sequence of characters on instead of a single character. We call such a sequence an
alignment. Unless stated otherwise, all alignments in this paper are sequences of -state characters for that do not contain the gap symbol “–”. Such an alignment is referred to as gap-free. In applied phylogenetics, multiple sequence alignments frequently contain gaps which, intuitively, are placeholders that can take on any of the other character states. We will see in the last section why the restriction to gap-free alignments is necessary. Lastly, we denote a sequence that consists of a single element by and omit parentheses for simplicity.
Parsimony on phylogenetic trees and their subdivisions. Given an alignment of characters on and an arbitrary rooted tree with leaf set , we refer to
as the parsimony score of on , where the minimum is taken over all extensions of to .
Instead of calculating the parsimony score of a phylogenetic -tree , we are often interested in calculating the parsimony score of a subdivision of in the upcoming sections. The next lemma states that both scores are equal. Its correctness can be established analogously to the proof of Lemma 4.5 in [6]. In the proof of this lemma, Fischer et al. have shown that the softwired parsimony score of a character on a phylogenetic tree is equal to the parsimony score of on a particular rooted tree that is a generalisation of a subdivision of in the sense that it may contain unlabeled leaves in addition to the leaves in .
Lemma 2.2.
Let be a character on , and let be a subdivision of a phylogenetic -tree . Then .
We also have the following observation.
Observation 2.3.
Let be a character on , and let be a subdivision of a phylogenetic -tree. If is an extension of to such that for some edge of the root path of , then there exists an extension of to such that and .
By Observation 2.3, we freely assume throughout the remainder of the paper that every extension of a character to the vertices of a subdivision of a phylogenetic tree has the additional property that there is no character state transition on any edge of its root path.
Parsimony on phylogenetic networks. As outlined in the introduction, several notions of parsimony have been introduced that generalise parsimony from phylogenetic trees to phylogenetic networks. In this paper, we are focusing on the notion of softwired parsimony and briefly touch on parental parsimony in Section 7. Roughly speaking, the softwired parsimony score of an alignment of characters on a phylogenetic network is defined to be the smallest number of character-state transitions that is necessary to explain on any phylogenetic tree that is displayed by . Following Nakhleh et al. [16], we now make this precise. Let be an alignment of characters on , and let be a phylogenetic network on . The softwired parsimony score of on is defined as
(1) |
where, for each character , the first minimum is taken over all phylogenetic trees in the display set of and the second minimum is taken over all extensions of to . As per Equation (2), each character in can follow a different tree in . A slightly more restricted definition of softwired parsimony, which has also appeared in the literature (e.g. see [14, 15]), is the following
(2) |
where all characters in follow the same tree in . Clearly, if , then . On the other hand, for , it follows from the definition that . For the purpose of the upcoming sections, we adopt the softwired parsimony definition as formalised in Equation (2) and will see later, that our main result holds also under the definition given in Equation (2). In this context, it is worth mentioning that the hardness result for computing the softwired parsimony score of a level- network for an alignment of at least two binary characters [15, Theorem 1] as mentioned in the introduction has been established for the definition given in Equation (2).
Statement of main result. The main result of this paper is the following theorem which we establish in Section 6.
Theorem 2.4.
Let be a phylogenetic network on , and let be a phylogenetic -tree in . Furthermore, let be an alignment of characters on . Then
where is the level of .
For example, if is a level-1 network, Theorem 2.4 implies that the parsimony score of an arbitrary tree displayed by is at most twice the parsimony score of a tree displayed by whose parsimony score is equal to . Moreover, we show in Section 6 that the bound as stated in Theorem 2.4 is sharp.
The next corollary positively answers the open problem that is detailed in the introduction and that was first posed in [15].
Corollary 2.5.
For a fixed non-negative integer , let be a level- network on , and let be an alignment of characters on . There exists a polynomial -approximation algorithm to calculate and .
3 Bounding the rSPR distance
In this section, we establish an upper bound on the rSPR distance between two phylogenetic trees for when both trees are displayed by a given network. Let be a phylogenetic network, and let and be two switchings of . We define
to be the switching distance between and . Intuitively, is the number of reticulations in for which and contain different reticulation edges.
The following lemma is a generalisation of [3, Lemma 3.1].
Lemma 3.1.
Let be a phylogenetic network on , and let and be two phylogenetic -trees that are yielded by two switchings and , respectively, of . Then
Proof.
Let (resp. ) be an embedding of (resp. ) in whose edge set contains each edge in (resp. ). Obtain a directed acyclic graph from by deleting each edge that is not contained in and, subsequently, applying any of the following two operations until no further operation is possible.
-
(i)
Suppress a vertex of in-degree one and out-degree one.
-
(ii)
If and are two edges in parallel, delete .
By construction, is a phylogenetic network on and each reticulation edge that is not contained in is deleted in obtaining . Hence, . Furthermore, as and are embeddings of and , respectively, and are displayed by . Now, let be the minimum number of reticulations of any phylogenetic network that displays and . Clearly, and, thus
where the last inequality follows from [19, Equation 10.1]. ∎
4 Informative blobs
To establish the main result of this paper, we introduce the novel concept of informative and non-informative blobs in this section. After giving a formal definition of these blobs, we establish results related to the changing number of character extensions to embeddings of phylogenetic trees that are displayed by phylogenetic networks that consist of a single informative or non-informative blob. Let be a phylogenetic network on , and let be a blob of . Furthermore, let be the subset of that contains precisely each vertex that is not in and that is a child of a vertex of . We refer to , as the children of . As an immediate consequence of the definition of , we have the following lemma that we will freely use throughout the remainder of the paper.
Lemma 4.1.
Let be a blob of a phylogenetic network on , and let be a vertex of . Furthermore, let be an embedding of a phylogenetic -tree that is displayed by . Then, is a vertex of . Moreover, if is a vertex of a blob in , then is the source of .
Proof.
Suppose that is a vertex of a blob in . Since , we have . Towards a contradiction, assume that is not the source of . It follows that is a vertex of in-degree two. Let be the parent of in , and let be the other parent of . Then there is a directed path from the root of to that traverses and a directed path from the root of to that traverses , thereby contradicting that and are two distinct blobs in . We complete the proof by noting that contains each edge of whose deletion disconnects into more than one connected component and, so, is a vertex of . ∎
Now, let be the source of a blob in a phylogenetic network on . Furthermore, let be an embedding of a phylogenetic -tree that is displayed by . Let be a character on , and let be an extension of to . We set if each element in is assigned to the same character state under and, otherwise, we set . By Lemma 4.1, recall that each vertex in is also a vertex of and, thus, is well defined. Moreover, we say that is a non-informative blob relative to and if there exists an extension of to such that and . Otherwise, we say that is an informative blob. We next extend the concept of a single informative blob to all blobs of and set
where the minimum is taken over all extensions of to whose changing number is equal to . Then denotes the number of informative blobs relative to and in . If is an extension of to such that , then we say that realises . See Figure 1 for an example of a phylogenetic network on with two blobs and , an embedding of a phylogenetic -tree displayed by , and a binary character on such that . Here, is non-informative because there exists a minimum extension of that assigns character state to all elements of , where contains the source of and leaves and . Blob , on the other hand, is informative, as the elements in are assigned two different states by and thus by any extension of it. To see that is indeed minimum, notice that . This is minimum since employs two states, and it is well-known and easy to see that, in this case, any extension of requires at least one change.
Lemma 4.2.
Let be a character on , and let be a phylogenetic network on with a single blob whose source is the child of the root. Let and be embeddings of two phylogenetic -trees that are displayed by . Suppose that is non-informative relative to and . Let be an extension of to with that assigns the same character state to each vertex in . Then there exists an extension of to such that
-
(i)
and
-
(ii)
.
Proof.
Since is non-informative, recall that exists. Furthermore, by the definition of an embedding, is the child of the roots of and . Let be the subset of that precisely contains each vertex that is not a vertex of . Since is the only blob of , each vertex in is also a vertex of and . Furthermore, each vertex of or that is not in , is a vertex of . Since is the child of the root of and assigns the same character state, say , to each vertex in , it follows that also assigns to each vertex in that is an ancestor of some vertex in . In particular, .
Now, consider . Set for each vertex and set for each vertex . By definition of , we again have that each vertex in that is an ancestor of some vertex in is assigned to under . Hence, as is an extension of to , is an extension of to with ; thereby satisfying (ii). Moreover, since and are embeddings of two phylogenetic -trees that are displayed by , the edges of satisfy the following property: If is an edge of (resp. ) but not an edge of (resp. ), then is an edge of and, consequently, (resp. ). It follows that which satisfies (i) and, therefore, completes the proof of the lemma. ∎
Lemma 4.3.
Let be a character on . Let and be two phylogenetic -trees. Furthermore, let be an extension of to . Then there exists an extension of to such that
-
(i)
and
-
(ii)
, where and is the root of and , respectively.
Proof.
We show by induction on that there exists an extension of to that satisfies (i)–(ii). Suppose that . Then there exists a single rSPR operation that transforms into . Given such an rSPR operation, let be the edge of that is deleted in the pruning part of the operation. Let and be the parent and other child of in . Further, let be the vertex that subdivides an edge, say , when reattaching the resulting subtree with root such that is an edge in . Noting that each vertex in except for is also a vertex of , we next obtain an extension of to with no character state transition on the root edge of as follows: For each vertex , we set . In particular, we have and, so, (ii) follows. Moreover, if , we set . Otherwise, we set , where is a character state that has been assigned to at least one neighbour of in under and there is no other character state that has been assigned to strictly more neighbours of in under . We next show that (i) holds. Consider the edges of . Except for the edges used to reattach the subtree with root , obtained from suppressing , and and obtained from subdividing with , each edge of is also an edge of . If , then either or . Hence, suppressing does not increase the changing number. On the other hand, when assigning a character state to the changing number may increase. More specifically, we consider three cases. First, if , then by definition of . Note that may be . Thus, there is no character state transition on the two edges and , and at most one such transition on the edge under in . Second, if , then there is a character state transition on the edge under and we have two character state transitions on the three edges , , and under . Third, if and , then there is again a character state transition on the edge under and we have one character state transition on the three edges , , and under . Hence, regardless of which case applies
thereby satisfying (i) for when .
Now suppose that and that (i)–(ii) are satisfied for all pairs of phylogenetic trees whose rSPR distance is strictly smaller than . Let be a phylogenetic -tree such that and . Recalling that is an extension of to , it follows from the induction hypothesis, that there is an extension of to that satisfies (ii) and . Again, by the induction hypothesis, there exists an extension of to that satisfies (ii) and . Hence, by combining the two inequalities we obtain
and . Hence, satisfies (i)–(ii). This completes the proof of the lemma. ∎
Corollary 4.4.
Let be a character on . Let be a phylogenetic network on with a single blob, and let and be two phylogenetic -trees displayed by . Furthermore, let be an extension of to . Then there exists an extension of to such that
-
(i)
, where is the level of , and
-
(ii)
, where and is the root of and , respectively.
Proof.
While Lemma 4.2 is restricted to phylogenetic networks that consist of a single non-informative blob, the next lemma establishes an analogous result for all phylogenetic networks that consist of a single blob.
Lemma 4.5.
Let be a character on , and let be a phylogenetic network on with a single blob whose source is the child of the root. Let and be embeddings of two phylogenetic trees that are displayed by . Furthermore, let be an extension of to . Then there exists an extension of to such that
-
(i)
, where is the level of , and
-
(ii)
.
Proof.
Let (resp. ) be the two phylogenetic -trees such that (resp. ) is an embedding of (resp. ) in . Furthermore, let . Observe that each vertex in is a unique degree-three vertex in . First, let be the extension of to obtained from by setting for each . Then
(3) |
and, because there is no character state transition on any edge of the root path of that contains , the root of is assigned to . Second, by Corollary 4.4, there exists an extension of to such that
(4) |
and the root of is also assigned to . Third, we obtain an extension of to from as follows. Noting that each edge in corresponds to a unique directed path in whose non-terminal vertices all have degree two, we set and . Then
(5) |
and, since the root of is assigned to , it follows that each vertex on the root path of , in particular , is also assigned to . Hence (ii) is satisfied. Moreover, by combining Equations (3)–(5), we have
This concludes the proof of the lemma. ∎
5 Blob reduction
In this section, we introduce the notion of a blob reduction. Intuitively, this allows us to decompose a phylogenetic network into two smaller phylogenetic networks and calculate the parsimony score of an embedding of a phylogenetic -tree displayed by based on these two smaller networks.
Let be a phylogenetic network on , let be a blob of whose source, say , has maximum distance from the root of , and let . For some , the blob reduction of reduces to two smaller phylogenetic networks as follows. Let be the phylogenetic network on that is obtained from by replacing the subnetwork of that is rooted at with a single new leaf . Furthermore, let be the phylogenetic network on that is obtained from the subnetwork of that is rooted at by adding a new vertex and edge . By construction, each of and contains at least one leaf.
Now, let be an embedding of a phylogenetic -tree that is displayed by . Recall that is a vertex of and . Let be a character on , and let be an extension of to . Using the aforementioned blob reduction of as a guide, we next also reduce to two smaller trees such that one of the resulting trees is an embedding of a subtree of in and the other one is an embedding of another subtree of in . More specifically, let be the tree with leaf set that is obtained from by replacing the subtree of that is rooted at with a single new leaf . Furthermore, let be the tree with leaf set that is obtained from the subtree of that is rooted at by adding a new vertex and edge . We call the cluster tree pair of relative to . Let and be a character on and on , respectively, such that for each and for each . We refer to extensions of to and of to as a pair of cluster extensions with respect to if . Except for , observe that uniquely determines the character state of each leaf in and . Moreover, since the root path of contains at least one edge, the definition of a pair of cluster extensions implies that by our assumption following Observation 2.3. The next lemma shows how the changing number of extensions of characters , , and to , , and , respectively, are related to each other.
Lemma 5.1.
Let be a blob of a phylogenetic network on to which the blob reduction can be applied. Let be a character on , and let be an embedding of a phylogenetic -tree that is displayed by . Furthermore, let be the cluster tree pair of relative to . Then, the following two statements hold.
-
(i)
If is an extension of to , then there exists a pair of cluster extensions such that
-
(ii)
If is a pair of cluster extensions with respect to , then there exists an extension of to such that
Proof.
Let be the source of , and let . By the definition of a cluster tree pair, has leaf set and root , and has leaf set and root , where is also the root of . Lastly, as is a vertex of , it follows from the construction of and that corresponds to the child of in and to in , whereas each other vertex of corresponds to a unique vertex in either or . To ease reading, we refer to the child of in as . Reversely, the only vertex of and that does not correspond to a unique vertex in is . We next show that (i) and (ii) hold.
First, let be an extension of to . Obtain a pair of cluster extensions and of characters and to and , respectively, in the following way. For each vertex of , set , where is the vertex of that corresponds to, and set . Similarly, for each vertex of , set , where is the vertex of that corresponds to. Since and both correspond to , it follows that . It is is now easily checked that and is a pair of cluster extensions with respect to and that
Hence, (i) holds.
Now, let and be a pair of cluster extensions with respect to . In particular, for each , for each , and . Now obtain an extension of to from and in the following way. For each vertex of that corresponds to a vertex of , set and, for each vertex of that corresponds to a vertex of , set . Since , it follows that
Thus, (ii) holds as well. ∎
6 Proof of Theorem 2.4
In this section, we establish the proof of Theorem 2.4 and show that the bound that is given in the theorem is sharp. Most work in proving Theorem 2.4 goes into establishing the following lemma.
Lemma 6.1.
Let be a character on . Let be a phylogenetic network on , and let and be embeddings of two phylogenetic -trees that are displayed by . Furthermore, let be an extension of to that realises . Then there exists an extension of to such that
where is the level of .
Proof.
Let be the blobs of . The proof is by induction on . If , then is a phylogenetic tree with and so the result clearly follows since and, therefore,
when setting . Now assume that and that the statement is true for all phylogenetic networks with at most blobs. Let be a blob of whose source has maximum distance from the root of over all its blobs, and let . Without loss of generality, we assume that .
For some , let be the phylogenetic network on , and let be the phylogenetic network on and with root resulting from by applying a blob reduction to . Notice that by construction, consists of the single blob with being the child of , whereas contains precisely blobs. Moreover, let be the cluster tree pair of relative to , and let be the cluster tree pair of relative to . Since is a vertex of and , is also a vertex of and . Now, by Lemma 5.1, Part (i), there exists a pair of cluster extensions with respect to such that
(6) |
and . Let and be the characters on and , respectively, such that and are extensions of and , respectively.
We next consider and start by making two observations. First, since , it follows from Lemma 5.1, Parts (i) and (ii), that . Second, by the construction of the pair of cluster extensions in Lemma 5.1 Part (i), we may assume that, for each vertex of , we have , where is the vertex of that corresponds to. Then, as realises , realises . Noting that has blobs and level at most , we now apply the induction hypothesis to obtain an extension of to that satisfies
(7) |
such that .
To complete the proof, we consider . Here, we distinguish two cases depending on whether its single blob whose level is at most is informative or non-informative.
First, assume that is informative. By Lemma 4.5, there exists an extension of to such that
(8) |
Since , the pair is a pair of cluster extensions with respect to . Thus, by Lemma 5.1, Part (ii), there exists an extension of to such that
Now, using Inequalities (7) and (8), we obtain
where the last equality follows from Equation (6) and the fact that is informative.
Second, assume that is non-informative. Then by Lemma 4.2, there exists an extension of to such that
(9) |
The remainder of the proof is now similar to the first case. In particular, noting that the pair is a pair of cluster extensions with respect to , by Lemma 5.1, Part(ii), there exists an extension of to such that
Using Inequalities in (7) and (9), we obtain
where the last equality follows from Equation (6) and the fact that is non-informative.
In both cases, we obtain an extension of to such that
This concludes the proof of the lemma. ∎
Corollary 6.2.
Let be a character on . Let be a phylogenetic network on , and let and be embeddings of two phylogenetic -trees displayed by such that . Then
where is the level of .
Proof.
Let be an extension of to that realises . By Lemma 6.1, there exists an extension of to such that . ∎
We are finally in a position to establish the main result of this paper, which we restate for convenience.
Theorem 2.4. Let be a phylogenetic network on , and let be a phylogenetic -tree in . Let be an alignment of characters on . Then
where is the level of .
Proof.
We establish that
Since , it immediately follows that also holds.
First, assume that consists of a single character . Let be an embedding of in , and let be an embedding of a phylogenetic -tree in such that . By Corollary 6.2, we have
where the first equality follows from Lemma 2.2. Now, as every blob in that is informative relative to contributes at least one to , we have
By Lemma 2.2 and combining the last two inequalities, we obtain
(10) |

We close this section by presenting, for each , a level- network and a binary character such that the upper bound stated in Theorem 2.4 is sharp. As level-0 networks on are phylogenetic -trees, the bound is sharp for any level-0 network and any binary character on . For , let be the level- network on that is depicted in Figure 2. Further, let be the binary character with if and otherwise. Let us consider the two phylogenetic -trees that are illustrated on the right-hand side of Figure 2 together with extensions and of to and , respectively. By using Fitch’s algorithm it is easy to verify that and are minimum. Hence, we have and, thus, (since employs two character states, ). Moreover, . In summary, we have . As the construction shown in Figure 2 involves a single character, the two notions of softwired parsimony on coincide, and we have for the same example.
7 Parental parsimony for phylogenetic networks
We now briefly consider the notion of parental parsimony introduced by van Iersel et al. [24] as an alternative to softwired and hardwired parsimony. Intuitively, instead of defining the parsimony score of a phylogenetic network by considering its display set , parental parsimony considers the set of parental trees (sometimes also called weakly displayed trees [12]), which is a superset of .
A multilabelled tree on is a leaf-labelled rooted tree whose root has out-degree one, all other interior vertices have in-degree one and out-degree one, or in-degree one and out-degree two, and, for each element in , there exists at least one leaf in that is labelled . Now using the same notation as [24], let be the multilabelled obtained from a phylogenetic network on as follows: The vertices of are the directed paths in starting at the root of , and for each pair of directed paths , there is an edge in if and only if is an extension of by one additional edge of . Furthermore, each vertex in corresponding to a path in starting at the root of and ending at is labelled by . For an example of the multilabelled tree obtained from a phylogenetic network , see Figure 3. Now, a phylogenetic -tree is called a parental tree of if it can be obtained from a subgraph of by suppressing vertices of in-degree and out-degree one. To denote the set of all parental trees of , we use . Informally speaking, a tree is a parental tree of a phylogenetic network if it can be drawn inside the network in such a way that the tree vertices of the tree correspond to tree vertices of the network. Importantly, though, a parental tree is not necessarily a displayed tree (see Figure 3), whereas every displayed tree is also parental.

Given a character on and a phylogenetic network on , the parental parsimony score of on is now defined as
where the minimum is taken over all parental trees for .
It was shown by van Iersel et al. [24, Theorem 2] that computing the parental parsimony score is NP-hard even if is a binary character and is a restricted type of a so-called tree-child network [2]. It is thus a natural question if our main result for softwired parsimony (Theorem 2.4) generalises to parental parsimony. Unfortunately, this is not the case. Suppose that is an even integer and that is the level- network on leaves depicted in Figure 4. Then, both phylogenetic trees and as depicted in the same figure are parental trees of and . Additionally, suppose that is the binary character that assigns state to leaves , and state to leaves . Crucially, we have and . In particular, , and thus,
for each (a similar argument applies to being odd), which shows that Theorem 2.4 does not generalise to parental parsimony even if the phylogenetic network is level- with a single blob and the alignment consists of a single binary character.

8 Softwired parsimony for semi-directed and unrooted networks
In this section, we show that binary semi-directed networks have an analogous bound on the softwired parsimony score as the one we established for rooted binary phylogenetic networks. We briefly turn our attention to unrooted binary phylogenetic networks at the end of the section and show that the approach we take to establish the bound for semi-directed networks does not work in this setting. However, this does not exclude the possibility that similar bounds can be obtained for unrooted phylogenetic networks by other means.
A binary semi-directed network on is a leaf-labelled mixed multigraph without any loops that can be obtained from a rooted binary phylogenetic network by deleting its root, suppressing the child of the root, and omitting the direction of each edge that is not a reticulation edge. We call a rooted partner of . By construction, has at most one pair of parallel edges. This is precisely the case when has an underlying 3-cycle that contains the child of the root. Note that may have multiple rooted partners. A vertex in is called a reticulation if contains two edges, referred to as reticulation edges, that are directed into . A semi-directed level- network is a semi-directed network such that a rooted partner of is a rooted binary level- network. Lastly, an unrooted binary phylogenetic -tree is a connected undirected acyclic graph whose leaf set is and whose inner vertices all have degree three. Note that an unrooted binary phylogenetic tree is a binary semi-directed network without reticulations. In the following, all rooted (resp. semi-directed) phylogenetic networks and rooted (resp. unrooted) phylogenetic trees are assumed to be binary.
Now, let be an unrooted binary phylogenetic -tree, and let be a semi-directed network on . We say that is displayed by if there exists a subgraph of such that is a subdivision of (omitting the directions of the reticulation edges) and contains, for each reticulation in , at most one reticulation edge incident with . Similar to rooted phylogenetic networks, if is displayed by , we call an embedding of in . Furthermore, we refer to the set of all unrooted phylogenetic -trees that are displayed by as the unrooted display set of and denote it by .
Let be an alignment of characters on , and let be a semi-directed network on . We define the softwired parsimony score of on as
(11) |
where the parsimony score of an unrooted phylogenetic -tree is defined as in the rooted case since the corresponding concepts of the changing number and (minimum) extensions naturally translate to undirected graphs.
The next lemma shows how the set of unrooted phylogenetic trees that are displayed by a semi-directed network is related to the set of rooted phylogenetic trees that are displayed by a rooted partner of .
Lemma 8.1.
Let be a semi-directed network on , and let be a rooted partner of . Then, the following two statements hold.
-
(i)
If is an unrooted phylogenetic -tree that is displayed by , then there exists a rooted phylogenetic -tree such that can be obtained from by deleting the root and suppressing its child.
-
(ii)
If is a rooted phylogenetic -tree that is displayed by , then there exists an unrooted phylogenetic -tree such that can be obtained from by deleting the root and suppressing its child.
Proof.
Throughout the proof, we assume that does not contain a pair of parallel edges. Indeed, if contains such a pair of edges, and say, we can obtain a semi-directed network on from , by deleting , suppressing the two resulting degree-two vertices, and omitting the direction of . Clearly . Now, let be the child of the root in , and let and be the children of . If either or is a reticulation, we assume without loss of generality that is a reticulation. Observe that at most one of and is a reticulation. Each edge in that is not incident with corresponds to exactly one edge in , and each edge in except (resp. if is a reticulation in ) corresponds to exactly one edge in . In particular, each reticulation edge in corresponds to exactly one such edge in and vice versa. First, let be an unrooted phylogenetic -tree that is displayed by . By definition, there exists an embedding of in . If is an edge in , we obtain an embedding of a rooted phylogenetic -tree in from as follows. We replace with the three directed edges , and and each edge with its directed counterpart in . By definition, is displayed by , that is . Further, by construction, can be obtained from by deleting the root and suppressing its child. Now, let us consider the case that is not an edge in . Let be the connected acyclic subgraph of that is obtained from by replacing each edge in with its directed counterpart in . We next show that there is a unique vertex in with in-degree zero and out-degree two. Since is acyclic, it follows that contains a vertex with in-degree zero. Clearly, the out-degree of this vertex cannot be one or three and, thus, exists. Assume towards a contradiction that is another vertex with this property. As is connected and does not contain a vertex with in-degree two, one of and is a descendant of the other in . But then one of these vertices has in-degree one, a contradiction. It now follows that each vertex in is a descendant of . Let be a directed path in . Such a path exists as there is a directed path from the root to any vertex in . Since each vertex in is a descendant of , it follows that with is not in . Then, we obtain an embedding of a rooted phylogenetic -tree in from by adding the directed edges , , to . Again, we have and, by construction, can be obtained from by deleting the root and suppressing its child. Hence, (i) holds.
Second, let be a rooted phylogenetic -tree that is displayed by . Let be an embedding of in , and let be the root path of . Then, we obtain an embedding of an unrooted phylogenetic -tree in from by deleting each vertex that lies on except for , turning directed edges into undirected ones and, if , suppressing . If , then consists of the single edge , and contains and . Hence, contains the edge as a result of suppressing . Otherwise, if , then contains and exactly one of and , and is a vertex in . Furthermore, if , each edge with that is traversed by corresponds to a unique edge consisting of the same vertices in . It now follows that is indeed an embedding of in , that is, . By construction, can be obtained from by deleting the root and suppressing its child. Hence, (ii) holds as well. ∎
We next obtain the following corollary from Lemma 8.1.
Corollary 8.2.
Let be a semi-directed network on , let be a rooted partner of , and let be an alignment of characters on . Then .
Proof.
The corollary follows from Lemma 8.1 and the fact that, if is a rooted phylogenetic -tree and is an unrooted phylogenetic -tree such that can be obtained from by deleting the root and suppressing its child, then . ∎
We are now in a position to state the main result of the section.
Theorem 8.3.
Let be a semi-directed network on , and let be an unrooted phylogenetic -tree that is displayed by . Furthermore, let be an alignment of characters on . Then
where is the level of .
Proof.
Let be a rooted partner of , and let be a rooted phylogenetic -tree such that deleting the root in and suppressing its child yields . The rooted phylogenetic -tree exists by Lemma 8.1 (i). Since an unrooted phylogenetic tree is a semi-directed network without reticulations, we have (i) by Corollary 8.2. Furthermore, by Theorem 2.4, we have (ii) . Finally, by Corollary 8.2, we have (iii) . Combining (i)–(iii) yields . ∎
We now shift our focus from semi-directed networks to unrooted phylogenetic networks. An unrooted binary phylogenetic network on is a connected undirected graph without any loops or edges in parallel, whose leaf set is , and whose inner vertices all have degree three. As before, we omit the term binary in the following as all unrooted phylogenetic networks considered in this section are binary.
Let be an unrooted phylogenetic network on . We say that an unrooted phylogenetic -tree is displayed by if there exists a subgraph of that is a subdivision of . Furthermore, we refer to the set of all unrooted phylogenetic -trees that are displayed by as the display set of and denote it by . We call an unrooted level- network if at most edges have to be deleted in each biconnected component of such that the resulting graph is acyclic. Lastly, if can be obtained from a rooted phylogenetic network by deleting its root, suppressing the child of the root, and omitting all edge directions, we say that is an orientation of .
Let be an alignment of characters on , and let be an unrooted phylogenetic network on . We define the softwired parsimony score of on as
(12) |
Next, we present an unrooted level-1 network and a binary character such that for an orientation of , that is, we show that Corollary 8.2 does not translate from semi-directed networks to unrooted phylogenetic networks. Moreover, we give two different orientations and of with .

To this end, let be the unrooted level-1 network on , and let and be the two orientations of as shown in Figure 5. The display set of size five is shown in the middle of the same figure. By deleting the root and suppressing its child in each element of (resp. ), we obtain a subset of as indicated by the two dashed rectangles that each enclose (resp. ) and two elements of . Now for the single binary character with
we have , and . Since the softwired parsimony score of an unrooted phylogenetic network is not necessarily the same as the score of an orientation of , we cannot represent an unrooted phylogenetic network by an arbitrary orientation in the way we used a rooted partner of a semi-directed network to obtain Theorem 8.3. While our example shows that using the same approach as in the semi-directed setting is not viable, it does not exclude the existence of similar bounds for unrooted phylogenetic networks or classes thereof (such as unrooted level-1 networks, for example).
9 Concluding remarks
In this paper, we have obtained a bound on the softwired parsimony score of a gap-free alignment of multistate characters on rooted as well as semi-directed phylogenetic level- networks. To be precise, we have shown that the maximum difference between the softwired parsimony score of a phylogenetic network and the parsimony score of any tree displayed by is bounded by times the parsimony score of . Unfortunately, our approximation result as stated in Theorem 2.4 cannot be generalised to alignments with gaps since it was already shown in [15, Corollary 2] that computing the softwired parsimony score of a level- network for an alignment of binary characters that additionally allows gaps is APX-hard.
Extending the notion of softwired parsimony to semi-directed networks and exploiting a connection between the display sets of semi-directed networks and their rooted partners, we have shown that an analogous bound holds for semi-directed networks. For unrooted networks, on the other hand, the approach via rooted partners (more formally, via orientations) does not seem to be viable. Nevertheless, it would be an interesting question for future research to investigate if an analogous or similar bound for the softwired parsimony score can be obtained in some other way for unrooted phylogenetic networks.
Another interesting direction for future research would be to analyse whether our results also apply in the case of non-binary phylogenetic networks, i.e., phylogenetic networks that may have vertices of degree strictly greater than three. While there exists a polynomial time algorithm to compute the parsimony score of a given non-binary phylogenetic tree with character states assigned to its leaves, namely the Fitch-Hartigan algorithm [7, 10], other concepts that our results rely on, such as the rSPR distance and its relation to the switching distance, have been studied less for non-binary phylogenetic trees (see [5, Section 2] for a related discussion).
Acknowledgements. The first and second authors thank the New Zealand Marsden Fund for financial support. All authors thank Steven Kelk for helpful discussions.
References
- Allman et al. [2019] E. S. Allman, H. Baños, and J. A. Rhodes. NANUQ: A method for inferring species networks from gene trees under the coalescent model. Algorithms for Molecular Biology, 14:1–25, 2019.
- Cardona et al. [2009] G. Cardona, F. Rossello, and G. Valiente. Comparison of tree-child phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(4):552–569, 2009.
- Döcker et al. [2024] J. Döcker, S. Linz, and C. Semple. Hypercubes and Hamilton cycles of display sets of rooted phylogenetic networks. Advances in Applied Mathematics, 152:102595, 2024.
- Felsenstein [2004] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, US, 2004.
- Fischer and Kelk [2016] M. Fischer and S. Kelk. On the maximum parsimony distance between phylogenetic trees. Annals of Combinatorics, 20:87–113, 2016.
- Fischer et al. [2015] M. Fischer, L. van Iersel, S. Kelk, and C. Scornavacca. On computing the maximum parsimony score of a phylogenetic network. SIAM Journal on Discrete Mathematics, 29(1):559–585, 2015.
- Fitch [1971] W. M. Fitch. Toward defining the course of evolution: minimum change for a specific tree topology. Systematic Biology, 20(4):406–416, 1971.
- Frohn and Kelk [2023] M. Frohn and S. Kelk. A -approximation algorithm for the softwired parsimony problem on binary, tree-child phylogenetic networks. submitted, 2023.
- Gross and Long [2018] E. Gross and C. Long. Distinguishing phylogenetic networks. SIAM Journal on Applied Algebra and Geometry, 2(1):72–93, 2018.
- Hartigan [1973] J. A. Hartigan. Minimum mutation fits to a given tree. Biometrics, 29(1):53, 1973.
- Hollering and Sullivant [2021] B. Hollering and S. Sullivant. Identifiability in phylogenetics using algebraic matroids. Journal of Symbolic Computation, 104:142–158, 2021.
- Huber et al. [2016] K. T. Huber, V. Moulton, M. Steel, and T. Wu. Folding and unfolding phylogenetic trees and networks. Journal of Mathematical Biology, 73(6–7):1761–1780, 2016.
- Kannan and Wheeler [2012] L. Kannan and W. C. Wheeler. Maximum parsimony on phylogenetic networks. Algorithms for Molecular Biology, 7:1–10, 2012.
- Kelk and Fischer [2017] S. Kelk and M. Fischer. On the complexity of computing MP distance between binary phylogenetic trees. Annals of Combinatorics, 21:573–604, 2017.
- Kelk et al. [2019] S. Kelk, F. Pardi, C. Scornavacca, and L. van Iersel. Finding a most parsimonious or likely tree in a network with respect to an alignment. Journal of Mathematical Biology, 78:527–547, 2019.
- Nakhleh et al. [2005] L. Nakhleh, G. Jin, F. Zhao, and J. Mellor-Crummey. Reconstructing phylogenetic networks using maximum parsimony. In 2005 IEEE Computational Systems Bioinformatics Conference (CSB’05), pages 93–102. IEEE, 2005.
- Sansom et al. [2018] R. S. Sansom, P. G. Choate, J. N. Keating, and E. Randle. Parsimony, not Bayesian analysis, recovers more stratigraphically congruent phylogenetic trees. Biology Letters, 14(6):20180263, 2018.
- Schrago et al. [2018] C. G. Schrago, B. O. Aguiar, and B. Mello. Comparative evaluation of maximum parsimony and Bayesian phylogenetic reconstruction using empirical morphological data. Journal of Evolutionary Biology, 31(10):1477–1484, 2018.
- Semple [2007] C. Semple. Hybridization networks. In O. Gascuel and M. Steel, editors, Reconstructing Evolution: New Mathematical and Computational Advances, pages 277–314. Oxford University Press, UK, 2007.
- Smith [2019] M. R. Smith. Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets. Biology Letters, 15(2):20180632, 2019.
- Solís-Lemus and Ané [2016] C. Solís-Lemus and C. Ané. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genetics, 12(3):e1005896, 2016.
- Solís-Lemus et al. [2017] C. Solís-Lemus, P. Bastide, and C. Ané. PhyloNetworks: a package for phylogenetic networks. Molecular Biology and Evolution, 34(12):3292–3298, 2017.
- Stamatakis et al. [2005] A. Stamatakis, T. Ludwig, and H. Meier. RAxML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics, 21(4):456–463, 2005.
- van Iersel et al. [2018] L. van Iersel, M. Jones, and C. Scornavacca. Improved maximum parsimony models for phylogenetic networks. Systematic Biology, 67(3):518–542, 2018.
- Zhang et al. [2020] C. Zhang, J. P. Huelsenbeck, and F. Ronquist. Using parsimony-guided tree proposals to accelerate convergence in Bayesian phylogenetic inference. Systematic Biology, 69(5):1016–1032, 2020.