Bounding the softwired parsimony score of a phylogenetic network

Janosch Döcker School of Computer Science, University of Auckland, New Zealand Simone Linz School of Computer Science, University of Auckland, New Zealand Kristina Wicke Department of Mathematical Sciences, New Jersey Institute of Technology, USA

Abstract

In comparison to phylogenetic trees, phylogenetic networks are more suitable to represent complex evolutionary histories of species whose past includes reticulation such as hybridisation or lateral gene transfer. However, the reconstruction of phylogenetic networks remains challenging and computationally expensive due to their intricate structural properties. For example, the small parsimony problem that is solvable in polynomial time for phylogenetic trees, becomes NP-hard on phylogenetic networks under softwired and parental parsimony, even for a single binary character and structurally constrained networks. To calculate the parsimony score of a phylogenetic network $N$ , these two parsimony notions consider different exponential-size sets of phylogenetic trees that can be extracted from $N$ and infer the minimum parsimony score over all trees in the set. In this paper, we ask: What is the maximum difference between the parsimony score of any phylogenetic tree that is contained in the set of considered trees and a phylogenetic tree whose parsimony score equates to the parsimony score of $N$ ? Given a gap-free sequence alignment of multi-state characters and a rooted binary level- $k$ phylogenetic network, we use the novel concept of an informative blob to show that this difference is bounded by $k+1$ times the softwired parsimony score of $N$ . In particular, the difference is independent of the alignment length and the number of character states. We show that an analogous bound can be obtained for the softwired parsimony score of semi-directed networks, while under parental parsimony on the other hand, such a bound does not hold.

Keywords: Phylogenetic networks; level- $k$ ; softwired parsimony; parental parsimony

1 Introduction

The generalisation of phylogenetic trees to phylogenetic networks goes along with the development of new methods to reconstruct phylogenetic networks from genomic data. Since phylogenetic networks are structurally much more complicated than phylogenetic trees, the algorithms to infer networks are typically computationally expensive. Indeed, several optimisation problems that can be solved efficiently for phylogenetic trees, become computationally difficult for phylogenetic networks. For example, the small parsimony problem, which seeks to find the parsimony score of a given phylogenetic tree with character states assigned to its leaves is solvable in polynomial time for phylogenetic trees, e.g., using the well-known Fitch-Hartigan algorithm [7, 10], but becomes NP-hard under different notions of parsimony for phylogenetic networks, even for a single binary character and structurally constrained networks [6, 24].

Despite the popularity of model-based methods to infer phylogenetic trees, maximum parsimony (see, e.g. [4] and references therein) continues to be widely used in certain areas of evolutionary biology, such as the analysis of morphological data (see, e.g. [17, 18, 20]). Moreover, since calculating the parsimony score of a phylogenetic tree is computationally less expensive than calculating its likelihood, parsimony trees are often used as starting trees from which a search through tree space is started [23] and are also used in Bayesian phylogenetic inference [25].

Recently, different notions for parsimony on rooted phylogenetic networks have been proposed, referred to as hardwired, softwired, and parental parsimony. Softwired [16] and parental [24] parsimony both consider collections of trees that can be extracted from a rooted phylogenetic network (so-called displayed trees, respectively parental trees) and define the parsimony score as the minimum parsimony score of any tree in the collection. Softwired parsimony is implemented in the popular software package PhyloNetworks [22] and is the main focus of this paper. Hardwired parsimony [13], on the other hand, calculates the parsimony score of a phylogenetic network by considering character-state transitions along all edges of the network. As the sets of rooted phylogenetic trees that are evaluated when computing the softwired and parental parsimony scores of a phylogenetic network have exponential size, it is of interest to investigate the differences in parsimony scores of elements in these sets. Given a gap-free alignment of multi-state characters and a rooted binary level- $k$ network $N$ (formally defined below), we analyse how different the parsimony score of any phylogenetic tree displayed by $N$ and the softwired parsimony score of $N$ can be. We show that independent of the alignment length and number of character states, this difference is bounded by $k+1$ times the parsimony score of $N$ . Thus, while computing the softwired parsimony score is in general an NP-hard problem (even for a single binary character and a structurally constrained network [6, Theorem 4.3]), our result implies that an upper bound for the softwired parsimony score of $N$ can be obtained in polynomial time by simply evaluating the parsimony score of an arbitrary phylogenetic tree that is displayed by $N$ . In particular if the level of $N$ is small, this upper bound gives a good indication of the magnitude of the softwired parsimony score of $N$ .

Related to our result, it was shown by Fischer et al. [6, Theorem 5.7] that the NP-hard optimisation problem of computing the softwired parsimony of a rooted level- $k$ network for a single multi-state character is, on the positive side, also fixed-parameter tractable, when the parameter is $k$ . If one considers more than a single binary character, the softwired parsimony problem is NP-hard even for a rooted level- $1$ network [15, Theorem 1]. As a consequence of the latter negative result, Kelk et al. [15] posed the following question. Are there good (i.e. constant-factor) approximation algorithms for computing the softwired parsimony score of a rooted phylogenetic network $N$ and a sequence alignment $A$ without gaps under the following three restrictions: (i) $N$ is level- $1$ , (ii) each biconnected component of $N$ has exactly three outgoing edges, and (iii) $A$ consists of binary characters?

As hinted at above, from an algorithmic perspective, our upper bound result implies a $(k+1)$ -approximation algorithm for computing the softwired parsimony score of a rooted binary level- $k$ network $N$ . Specifically, take an arbitrary phylogenetic tree that is displayed by $N$ , compute its parsimony score, and use this to approximate the softwired parsimony score of $N$ . If the level of $N$ is fixed, this algorithm provides a polynomial-time constant factor approximation. Hence, we answer the aforementioned question by Kelk et al. affirmatively for a much larger class of rooted phylogenetic networks in the sense that our result holds for level- $k$ networks (for a fixed non-negative integer $k$ ), it does not require restriction (ii), and it holds for gap-free alignments independent of the number of character states. Our result also complements a recent paper by Frohn and Kelk [8], in which the authors establish a 2-approximation algorithm for the softwired parsimony problem on binary tree-child networks for a single character.

While softwired parsimony for rooted phylogenetic networks is the main focus of our paper, we additionally show that an analogous upper bound for the softwired parsimony score holds for semi-directed networks that are obtained from rooted phylogenetic networks by deleting the root and omitting the direction of all but reticulation edges. Semi-directed networks have recently been central to studying identifiability questions related to phylogenetic networks and to developing phylogenetic network estimation algorithms (e.g. [1, 9, 11, 21]). We also briefly turn to the notion of parental parsimony (on rooted phylogenetic networks) and show by way of counterexample that an analogous bound for the parental parsimony score does not hold.

The remainder of this paper is organised as follows. We define all relevant concepts related to phylogenetic trees and networks, introduce the notion of softwired parsimony, and state the main result in Section 2. In Section 3, we revisit the rSPR distance and establish an upper bound on this distance for two phylogenetic trees that are both displayed by a given phylogenetic network. In Sections 4 and 5, we introduce the notion of an informative blob and a blob reduction, respectively. Informative blobs are a novel concept that is crucial for obtaining our main result, the upper bound on the softwired parsimony score, which we establish in Section 6. Sections 7 and 8 are then devoted to parental parsimony on rooted networks and softwired parsimony on semi-directed networks, respectively. We end the paper with some concluding remarks and directions for future research in Section 9.

2 Preliminaries and statement of main result

This section introduces notation and terminology, and states the main result. Throughout this paper, $X$ denotes a non-empty finite set. Let $G$ be a directed graph. We use $V(G)$ and $E(G)$ to denote the vertex set and edge set, respectively, of $G$ . Furthermore, for each edge $(u,v)$ of $G$ , $u$ is called a parent of $v$ and $v$ is called a child of $u$ . We sometimes also refer to $u$ and $v$ as neighbours in $G$ . In a similar vein, for two (not necessarily distinct) vertices $s$ and $t$ of $G$ , we say that $s$ (resp. $t$ ) is an ancestor (resp. descendant) of $t$ (resp. $s$ ) if there is a directed path of length zero or more from $s$ to $t$ . Now let $G$ and $G^{\prime}$ be two directed graphs, and let $e=(u,w)$ be an edge of $G$ . Then subdividing $e$ is the operation of replacing $e$ with two new edges $(u,v)$ and $(v,w)$ . Furthermore, we call $G^{\prime}$ a subdivision of $G$ if $G^{\prime}$ can be obtained from $G$ by repeatedly subdividing an edge. We also consider $G$ to be a subdivision of itself.

Phylogenetic trees and networks. A rooted binary phylogenetic network $N$ on $X$ is a rooted acyclic digraph with no loops and no parallel edges that satisfies the following three properties:

(i)

the set of leaves is $X$ ,
(ii)

the out-degree of the (unique) root $\rho$ is exactly one, and
(iii)

every other vertex has either in-degree one and out-degree two, or in-degree two and out-degree one.

The set $X$ is also sometimes called the label set of $N$ . Furthermore, a vertex of $N$ is referred to as a reticulation if it has in-degree two and as a tree vertex if it has in-degree one and out-degree two. Similarly, an edge of $N$ that is directed into a reticulation is referred to as a reticulation edge. We denote the number of reticulations in $N$ by $h(N)$ .

Let $N$ be a rooted binary phylogenetic network on $X$ . If $N$ has no reticulation, then it is called a rooted binary phylogenetic $X$ -tree. Since all phylogenetic trees and networks are rooted and binary throughout this paper except for Section 8, we refer to a rooted binary phylogenetic network as a phylogenetic network on $X$ and to a rooted binary phylogenetic tree as a phylogenetic $X$ -tree.

Let $S$ be a subdivision of a phylogenetic $X$ -tree. We call the directed path from the root of $S$ to its closest degree-three vertex its root path. If $S$ is a phylogenetic tree, then the root path consists of a single edge, in which case we sometimes refer to the root path as the root edge.

Now let $N$ be a phylogenetic network. A biconnected component of $N$ is a maximal subgraph of $N$ that is connected and cannot be disconnected by deleting exactly one of its vertices. Furthermore, a vertex of a biconnected component of $N$ is called a reticulation if it is a reticulation in $N$ . With this definition in hand, we say that $N$ is level- $k$ if the maximum number of reticulations of a biconnected component of $N$ is at most $k$ . Lastly, we call a biconnected component of $N$ a blob if it has at least one reticulation. For a blob $B$ of $N$ , we refer to the unique vertex with in-degree zero and out-degree two in $B$ as the source of $B$ . A phylogenetic network $N$ on $\{x_{1},x_{2},\ldots,x_{8}\}$ with two blobs $B$ and $B^{\prime}$ is shown on the left-hand side of Figure 1.

Clusters. Let $M$ be a subdivision of a phylogenetic network on $X$ , and let $Y$ be a subset of $X$ . We call $Y$ a cluster of $M$ if there exists a vertex $v$ in $M$ that has precisely $Y$ as its set of descendant leaves. Note that there may be more than one vertex in $M$ whose cluster is $Y$ and that this may also be the case if $M$ is a subdivision of a phylogenetic $X$ -tree. Furthermore, we use $\mathrm{cl}_{M}(v)$ or $\mathrm{cl}(v)$ if the subscript is clear from the context to denote the cluster of a given vertex $v$ of $M$ .

Displaying. Let $N$ be a phylogenetic network on $X$ with root $\rho$ , and let $T$ be a phylogenetic $X^{\prime}$ -tree with $X^{\prime}\subseteq X$ . We say that $T$ is displayed by $N$ if there exists a subgraph of $N$ that is a subdivision of $T$ that includes $\rho$ , in which case this subgraph is called an embedding of $T$ in $N$ . The set of all phylogenetic $X$ -trees that are displayed by $N$ is referred to as the display set of $N$ and denoted by $D(N)$ . Ignoring the assignment of 0 and 1 to vertices for the moment, Figure 1 shows a phylogenetic network $N$ , a phylogenetic tree $T$ that is displayed by $N$ , and an embedding $S$ of $T$ in $N$ . Now, consider a subset $R$ of the reticulation edges of $N$ . We refer to $R$ as a switching if, for each reticulation $v$ in $N$ , it contains exactly one of the two edges that are directed into $v$ . By deleting each reticulation edge of $N$ that is not in $R$ , we obtain a connected subgraph $G$ of $N$ with no underlying cycle and, for each leaf $\ell\in X$ , there is a directed path from the root of $G$ , which coincides with $\rho$ , to $\ell$ . If we repeatedly suppress each vertex in $G$ with in-degree one and out-degree one, and delete each vertex in $G$ with out-degree zero that is not in $X$ until no such operation is possible, we obtain a phylogenetic $X$ -tree $T_{R}$ . We say that $R$ yields $T_{R}$ . By construction, $T_{R}$ is displayed by $N$ . Conversely, observe that, for each phylogenetic $X$ -tree $T$ in $D(N)$ , there exists at least one switching that yields $T$ . In summary, we have the following observation.

Refer to caption — Figure 1: A phylogenetic network $N$ (left), an embedding $S$ of a phylogenetic $X$ -tree displayed by $N$ together with a binary character $f$ and an extension $F$ (middle; also indicated by the dashed lines in the left panel), and the phylogenetic $X$ -tree $T$ obtained from $S$ by suppressing vertices of in-degree one and out-degree one (right).

Observation 2.1.

Let $N$ be a phylogenetic network on $X$ , and let $T$ be a phylogenetic $X$ -tree. Then $T$ is displayed by $N$ if and only if there exists a switching of $N$ that yields $T$ .

rSPR operation. Let $T$ be a phylogenetic $X$ -tree, and let $e=(u,v)$ be an edge of $T$ that is not incident with the root. Let $T^{\prime}$ be a phylogenetic $X$ -tree obtained from $T$ by deleting $e$ and reattaching the resulting rooted subtree that contains $v$ via a new edge $f$ in the following way: Subdivide an edge of the component that contains the root of $T$ with a new vertex $u^{\prime}$ , join $u^{\prime}$ and $v$ with $f$ , and suppress $u$ . We say that $T^{\prime}$ has been obtained from $T$ by a rooted subtree prune and regraft (rSPR) operation. The rSPR distance between any two phylogenetic $X$ -trees $T$ and $T^{\prime}$ , denoted by $d_{\mathrm{rSPR}}(T,T^{\prime})$ , is the minimum number of rSPR operations that transform $T$ into $T^{\prime}$ . It is well known that $T^{\prime}$ can always be obtained from $T$ by a sequence of single rSPR operations and, so, $d_{\mathrm{rSPR}}(T,T^{\prime})$ is well defined.

Characters. An $r$ -state character on $X$ is a surjective function $f\colon X\rightarrow C$ from $X$ into a set $C$ of character states with $r=|C|\geq 1$ . If $r=2$ , then $f$ is called a binary character. Throughout this paper, all results are established for $r$ -state characters with $r$ being fixed and arbitrarily large. For simplicity, we refer to an $r$ -state character on $X$ as a character on $X$ .

Let $G$ be an acyclic digraph with leaf set $X$ , and let $f\colon X\rightarrow C$ be a character on $X$ . An extension of $f$ to $V(G)$ is a function $F\colon V(G)\rightarrow C$ such that $F(\ell)=f(\ell)$ for each element $\ell\in X$ . For an extension $F$ of $f$ , we set

{\rm ch}(F,G)=|\{(u,v)\in E(G):F(u)\neq F(v)\}|,

and refer to ${\rm ch}(F,G)$ as the changing number of $F$ . Intuitively, each edge of $G$ that contributes to the changing number of $F$ requires a character-state transition to explain $f$ on $G$ . Lastly, we say that an extension $F$ of $f$ to $V(G)$ is minimum if there exists no extension of $f$ to $V(G)$ whose changing number is strictly smaller than that of $F$ .

In what follows, we often consider a sequence $(f_{1},f_{2},\ldots,f_{n})$ of characters on $X$ instead of a single character. We call such a sequence an alignment. Unless stated otherwise, all alignments in this paper are sequences of $r$ -state characters for $r\geq 2$ that do not contain the gap symbol “–”. Such an alignment is referred to as gap-free. In applied phylogenetics, multiple sequence alignments frequently contain gaps which, intuitively, are placeholders that can take on any of the other $r$ character states. We will see in the last section why the restriction to gap-free alignments is necessary. Lastly, we denote a sequence $(f_{1})$ that consists of a single element by $f_{1}$ and omit parentheses for simplicity.

Parsimony on phylogenetic trees and their subdivisions. Given an alignment $A=(f_{1},f_{2},\ldots,f_{n})$ of characters on $X$ and an arbitrary rooted tree $T$ with leaf set $X$ , we refer to

PS(A,T)=\sum_{i=1}^{n}\min_{F_{i}}({\rm ch}(F_{i},T))

as the parsimony score of $A$ on $T$ , where the minimum is taken over all extensions of $f_{i}$ to $V(T)$ .

Instead of calculating the parsimony score of a phylogenetic $X$ -tree $T$ , we are often interested in calculating the parsimony score of a subdivision of $T$ in the upcoming sections. The next lemma states that both scores are equal. Its correctness can be established analogously to the proof of Lemma 4.5 in [6]. In the proof of this lemma, Fischer et al. have shown that the softwired parsimony score of a character $f$ on a phylogenetic tree $T$ is equal to the parsimony score of $f$ on a particular rooted tree that is a generalisation of a subdivision of $T$ in the sense that it may contain unlabeled leaves in addition to the leaves in $X$ .

Lemma 2.2.

Let $f$ be a character on $X$ , and let $S$ be a subdivision of a phylogenetic $X$ -tree $T$ . Then $PS(f,S)=PS(f,T)$ .

We also have the following observation.

Observation 2.3.

Let $f$ be a character on $X$ , and let $S$ be a subdivision of a phylogenetic $X$ -tree. If $F$ is an extension of $f$ to $V(S)$ such that $F(u)\neq F(v)$ for some edge $(u,v)$ of the root path of $S$ , then there exists an extension $F^{\prime}$ of $f$ to $V(S)$ such that $F^{\prime}(u)=F^{\prime}(v)$ and ${\rm ch}(F^{\prime},T)<{\rm ch}(F,T)$ .

By Observation 2.3, we freely assume throughout the remainder of the paper that every extension $F$ of a character to the vertices of a subdivision of a phylogenetic tree has the additional property that there is no character state transition on any edge of its root path.

Parsimony on phylogenetic networks. As outlined in the introduction, several notions of parsimony have been introduced that generalise parsimony from phylogenetic trees to phylogenetic networks. In this paper, we are focusing on the notion of softwired parsimony and briefly touch on parental parsimony in Section 7. Roughly speaking, the softwired parsimony score of an alignment $A$ of characters on a phylogenetic network $N$ is defined to be the smallest number of character-state transitions that is necessary to explain $A$ on any phylogenetic tree that is displayed by $N$ . Following Nakhleh et al. [16], we now make this precise. Let $A=(f_{1},f_{2},\ldots,f_{n})$ be an alignment of characters on $X$ , and let $N$ be a phylogenetic network on $X$ . The softwired parsimony score of $A$ on $N$ is defined as

	$\displaystyle PS_{\mathrm{sw}}(A,N)=\sum\limits_{i=1}^{n}PS_{\mathrm{sw}}(f_{i},N)$	$\displaystyle=\sum_{i=1}^{n}\min_{T\in D(N)}\min_{F_{i}}({\rm ch}(F_{i},T))$
		$\displaystyle=\sum_{i=1}^{n}\min_{T\in D(N)}PS(f_{i},T),$		(1)

where, for each character $f_{i}$ , the first minimum is taken over all phylogenetic trees in the display set of $N$ and the second minimum is taken over all extensions of $f_{i}$ to $V(T)$ . As per Equation (2), each character in $A$ can follow a different tree in $D(N)$ . A slightly more restricted definition of softwired parsimony, which has also appeared in the literature (e.g. see [14, 15]), is the following

\displaystyle PS_{\mathrm{sw}^{\prime}}(A,N)=\min\limits_{T\in D(N)}\sum\limits_{i=1}^{n}\min_{F_{i}}({\rm ch}(F_{i},T))=\min\limits_{T\in D(N)}\sum\limits_{i=1}^{n}PS(f_{i},T),

(2)

where all characters in $A$ follow the same tree in $D(N)$ . Clearly, if $n=1$ , then $PS_{\mathrm{sw}}(A,N)=PS_{\mathrm{sw}^{\prime}}(A,N)$ . On the other hand, for $n\geq 1$ , it follows from the definition that $PS_{\mathrm{sw}}(A,N)\leq PS_{\mathrm{sw}^{\prime}}(A,N)$ . For the purpose of the upcoming sections, we adopt the softwired parsimony definition as formalised in Equation (2) and will see later, that our main result holds also under the definition given in Equation (2). In this context, it is worth mentioning that the hardness result for computing the softwired parsimony score of a level- $1$ network for an alignment of at least two binary characters [15, Theorem 1] as mentioned in the introduction has been established for the definition given in Equation (2).

Statement of main result. The main result of this paper is the following theorem which we establish in Section 6.

Theorem 2.4.

Let $N$ be a phylogenetic network on $X$ , and let $T$ be a phylogenetic $X$ -tree in $D(N)$ . Furthermore, let $A$ be an alignment of characters on $X$ . Then

PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N)\text{ and }PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}^{\prime}}(A,N),

where $k$ is the level of $N$ .

For example, if $N$ is a level-1 network, Theorem 2.4 implies that the parsimony score of an arbitrary tree displayed by $N$ is at most twice the parsimony score of a tree displayed by $N$ whose parsimony score is equal to $PS_{\mathrm{sw}}(A,N)$ . Moreover, we show in Section 6 that the bound as stated in Theorem 2.4 is sharp.

The next corollary positively answers the open problem that is detailed in the introduction and that was first posed in [15].

Corollary 2.5.

For a fixed non-negative integer $k$ , let $N$ be a level- $k$ network on $X$ , and let $A$ be an alignment of characters on $X$ . There exists a polynomial $(k+1)$ -approximation algorithm to calculate $PS_{\mathrm{sw}}(A,N)$ and $PS_{\mathrm{sw}^{\prime}}(A,N)$ .

Proof.

Clearly, we can construct a phylogenetic $X$ -tree $T$ such that $T\in D(N)$ in time that is polynomial in $|V(N)|$ . Furthermore, it takes time that is polynomial in $|X|$ to calculate $PS(A,T)$ by applying Fitch’s algorithm [7]. The result now follows immediately from Theorem 2.4. ∎

3 Bounding the rSPR distance

In this section, we establish an upper bound on the rSPR distance between two phylogenetic trees for when both trees are displayed by a given network. Let $N$ be a phylogenetic network, and let $R$ and $R^{\prime}$ be two switchings of $N$ . We define

d_{\mathrm{switch}}(R,R^{\prime})=h(N)-|R\cap R^{\prime}|

to be the switching distance between $R$ and $R^{\prime}$ . Intuitively, $d_{\mathrm{switch}}(R,R^{\prime})$ is the number of reticulations in $N$ for which $R$ and $R^{\prime}$ contain different reticulation edges.

The following lemma is a generalisation of [3, Lemma 3.1].

Lemma 3.1.

Let $N$ be a phylogenetic network on $X$ , and let $T_{R}$ and $T_{R^{\prime}}$ be two phylogenetic $X$ -trees that are yielded by two switchings $R$ and $R^{\prime}$ , respectively, of $N$ . Then

d_{\mathrm{rSPR}}(T_{R},T_{R^{\prime}})\leq d_{\mathrm{switch}}(R,R^{\prime}).

Proof.

Let $S$ (resp. $S^{\prime}$ ) be an embedding of $T_{R}$ (resp. $T_{R^{\prime}}$ ) in $N$ whose edge set contains each edge in $R$ (resp. $R^{\prime}$ ). Obtain a directed acyclic graph $N^{\prime}$ from $N$ by deleting each edge that is not contained in $E(S)\cup E(S^{\prime})$ and, subsequently, applying any of the following two operations until no further operation is possible.

(i)

Suppress a vertex of in-degree one and out-degree one.
(ii)

If $e$ and $e^{\prime}$ are two edges in parallel, delete $e^{\prime}$ .

By construction, $N^{\prime}$ is a phylogenetic network on $X$ and each reticulation edge that is not contained in $R\cup R^{\prime}$ is deleted in obtaining $N^{\prime}$ . Hence, $h(N^{\prime})\leq d_{\mathrm{switch}}(R,R^{\prime})$ . Furthermore, as $S$ and $S^{\prime}$ are embeddings of $T_{R}$ and $T_{R^{\prime}}$ , respectively, $T_{R}$ and $T_{R^{\prime}}$ are displayed by $N^{\prime}$ . Now, let $h(T_{R},T_{R^{\prime}})$ be the minimum number of reticulations of any phylogenetic network that displays $T_{R}$ and $T_{R^{\prime}}$ . Clearly, $h(T_{R},T_{R^{\prime}})\leq h(N^{\prime})$ and, thus

d_{\mathrm{switch}}(R,R^{\prime})\geq h(N^{\prime})\geq h(T_{R},T_{R^{\prime}})\geq d_{\mathrm{rSPR}}(T_{R},T_{R^{\prime}}),

where the last inequality follows from [19, Equation 10.1]. ∎

4 Informative blobs

To establish the main result of this paper, we introduce the novel concept of informative and non-informative blobs in this section. After giving a formal definition of these blobs, we establish results related to the changing number of character extensions to embeddings of phylogenetic trees that are displayed by phylogenetic networks that consist of a single informative or non-informative blob. Let $N$ be a phylogenetic network on $X$ , and let $B$ be a blob of $N$ . Furthermore, let $C_{N}(B)$ be the subset of $V(N)$ that contains precisely each vertex that is not in $B$ and that is a child of a vertex of $B$ . We refer to $C_{N}(B)$ , as the children of $B$ . As an immediate consequence of the definition of $C_{N}(B)$ , we have the following lemma that we will freely use throughout the remainder of the paper.

Lemma 4.1.

Let $B$ be a blob of a phylogenetic network $N$ on $X$ , and let $v$ be a vertex of $C_{N}(B)$ . Furthermore, let $S$ be an embedding of a phylogenetic $X$ -tree that is displayed by $N$ . Then, $v$ is a vertex of $S$ . Moreover, if $v$ is a vertex of a blob $B^{\prime}$ in $N$ , then $v$ is the source of $B^{\prime}$ .

Proof.

Suppose that $v$ is a vertex of a blob $B^{\prime}$ in $N$ . Since $v\in C_{N}(B)$ , we have $B^{\prime}\neq B$ . Towards a contradiction, assume that $v$ is not the source of $B^{\prime}$ . It follows that $v$ is a vertex of in-degree two. Let $u$ be the parent of $v$ in $B$ , and let $u^{\prime}$ be the other parent of $v$ . Then there is a directed path from the root of $N$ to $v$ that traverses $u$ and a directed path from the root of $N$ to $v$ that traverses $u^{\prime}$ , thereby contradicting that $B$ and $B^{\prime}$ are two distinct blobs in $N$ . We complete the proof by noting that $S$ contains each edge of $N$ whose deletion disconnects $N$ into more than one connected component and, so, $v$ is a vertex of $S$ . ∎

Now, let $s$ be the source of a blob $B$ in a phylogenetic network $N$ on $X$ . Furthermore, let $S$ be an embedding of a phylogenetic $X$ -tree that is displayed by $N$ . Let $f$ be a character on $X$ , and let $F$ be an extension of $f$ to $V(S)$ . We set ${\rm ind}(F,B,S)=0$ if each element in $C_{N}(B)$ is assigned to the same character state under $F$ and, otherwise, we set ${\rm ind}(F,B,S)=1$ . By Lemma 4.1, recall that each vertex in $C_{N}(B)$ is also a vertex of $S$ and, thus, ${\rm ind}(F,B,S)$ is well defined. Moreover, we say that $B$ is a non-informative blob relative to $S$ and $f$ if there exists an extension $F$ of $f$ to $V(S)$ such that $PS(f,S)={\rm ch}(F,S)$ and ${\rm ind}(F,B,S)=0$ . Otherwise, we say that $B$ is an informative blob. We next extend the concept of a single informative blob to all blobs $B_{1},B_{2},\ldots,B_{m}$ of $N$ and set

b(f,N,S)=\min\limits_{\genfrac{}{}{0.0pt}{1}{F_{j}\text{ such that}}{PS(f,S)={\rm ch}(F_{j},S)}}\left(\sum_{i=1}^{m}{\rm ind}(F_{j},B_{i},S)\right),

where the minimum is taken over all extensions $F_{j}$ of $f$ to $V(S)$ whose changing number is equal to $PS(f,S)$ . Then $b(f,N,S)$ denotes the number of informative blobs relative to $S$ and $f$ in $N$ . If $F_{j}$ is an extension of $f$ to $V(S)$ such that $b(f,N,S)=\sum_{i=1}^{m}{\rm ind}(F_{j},B_{i},S)$ , then we say that $F_{j}$ realises $b(f,N,S)$ . See Figure 1 for an example of a phylogenetic network $N$ on $X=\{x_{1},x_{2},\ldots,x_{8}\}$ with two blobs $B$ and $B^{\prime}$ , an embedding $S$ of a phylogenetic $X$ -tree displayed by $N$ , and a binary character $f$ on $X$ such that $b(f,N,S)=1$ . Here, $B$ is non-informative because there exists a minimum extension $F$ of $f$ that assigns character state $0$ to all elements of $C_{N}(B)$ , where $C_{N}(B)$ contains the source of $B^{\prime}$ and leaves $x_{7}$ and $x_{8}$ . Blob $B^{\prime}$ , on the other hand, is informative, as the elements in $C_{N}(B^{\prime})=\{x_{1},x_{2},\ldots,x_{6}\}$ are assigned two different states by $f$ and thus by any extension of it. To see that $F$ is indeed minimum, notice that ${\rm ch}(F,S)={\rm ch}(F,T)=1$ . This is minimum since $f$ employs two states, and it is well-known and easy to see that, in this case, any extension of $f$ requires at least one change.

Lemma 4.2.

Let $f$ be a character on $X$ , and let $N$ be a phylogenetic network on $X$ with a single blob $B$ whose source $s$ is the child of the root. Let $S$ and $S^{\prime}$ be embeddings of two phylogenetic $X$ -trees that are displayed by $N$ . Suppose that $B$ is non-informative relative to $S^{\prime}$ and $f$ . Let $F^{\prime}$ be an extension of $f$ to $V(S^{\prime})$ with ${\rm ch}(F^{\prime},S^{\prime})=PS(f,S^{\prime})$ that assigns the same character state to each vertex in $C_{N}(B)$ . Then there exists an extension $F$ of $f$ to $V(S)$ such that

(i)

${\rm ch}(F,S)={\rm ch}(F^{\prime},S^{\prime})$ and
(ii)

$F(s)=F^{\prime}(s)$ .

Proof.

Since $B$ is non-informative, recall that $F^{\prime}$ exists. Furthermore, by the definition of an embedding, $s$ is the child of the roots of $S$ and $S^{\prime}$ . Let $V$ be the subset of $V(N)$ that precisely contains each vertex that is not a vertex of $B$ . Since $B$ is the only blob of $N$ , each vertex in $V$ is also a vertex of $S$ and $S^{\prime}$ . Furthermore, each vertex of $S$ or $S^{\prime}$ that is not in $V$ , is a vertex of $B$ . Since $s$ is the child of the root of $S^{\prime}$ and $F^{\prime}$ assigns the same character state, say $\alpha$ , to each vertex in $C_{N}(B)$ , it follows that $F^{\prime}$ also assigns $\alpha$ to each vertex in $V(S^{\prime})$ that is an ancestor of some vertex in $C_{N}(B)$ . In particular, $F^{\prime}(s)=\alpha$ .

Now, consider $S$ . Set $F(u)=F^{\prime}(u)$ for each vertex $u\in V$ and set $F(u^{\prime})=\alpha$ for each vertex $u^{\prime}\in V(S)\setminus V$ . By definition of $F$ , we again have that each vertex in $V(S)$ that is an ancestor of some vertex in $C_{N}(B)$ is assigned to $\alpha$ under $F$ . Hence, as $F^{\prime}$ is an extension of $f$ to $V(S^{\prime})$ , $F$ is an extension of $f$ to $V(S)$ with $F(s)=F^{\prime}(s)=\alpha$ ; thereby satisfying (ii). Moreover, since $S$ and $S^{\prime}$ are embeddings of two phylogenetic $X$ -trees that are displayed by $N$ , the edges of $N$ satisfy the following property: If $e=(u,v)$ is an edge of $S^{\prime}$ (resp. $S$ ) but not an edge of $S$ (resp. $S^{\prime}$ ), then $e$ is an edge of $B$ and, consequently, $F^{\prime}(u)=F^{\prime}(v)=\alpha$ (resp. $F(u)=F(v)=\alpha$ ). It follows that ${\rm ch}(F,S)={\rm ch}(F^{\prime},S^{\prime})$ which satisfies (i) and, therefore, completes the proof of the lemma. ∎

Lemma 4.3.

Let $f$ be a character on $X$ . Let $T$ and $T^{\prime}$ be two phylogenetic $X$ -trees. Furthermore, let $F^{\prime}$ be an extension of $f$ to $V(T^{\prime})$ . Then there exists an extension $F$ of $f$ to $V(T)$ such that

(i)

${\rm ch}(F,T)\leq{\rm ch}(F^{\prime},T^{\prime})+d_{\mathrm{rSPR}}(T^{\prime},T)$ and
(ii)

$F(\rho)=F^{\prime}(\rho^{\prime})$ , where $\rho$ and $\rho^{\prime}$ is the root of $T$ and $T^{\prime}$ , respectively.

Proof.

We show by induction on $d_{\mathrm{rSPR}}(T^{\prime},T)$ that there exists an extension $F$ of $f$ to $V(T)$ that satisfies (i)–(ii). Suppose that $d_{\mathrm{rSPR}}(T^{\prime},T)=1$ . Then there exists a single rSPR operation that transforms $T^{\prime}$ into $T$ . Given such an rSPR operation, let $(u^{\prime},v^{\prime})$ be the edge of $T^{\prime}$ that is deleted in the pruning part of the operation. Let $u_{p}^{\prime}$ and $u_{c}^{\prime}\neq v^{\prime}$ be the parent and other child of $u^{\prime}$ in $T^{\prime}$ . Further, let $u$ be the vertex that subdivides an edge, say $(u_{p},u_{c})$ , when reattaching the resulting subtree with root $v^{\prime}$ such that $(u,v^{\prime})$ is an edge in $T$ . Noting that each vertex in $T$ except for $u$ is also a vertex of $T^{\prime}$ , we next obtain an extension $F$ of $f$ to $V(T)$ with no character state transition on the root edge of $T$ as follows: For each vertex $w\neq u$ , we set $F(w)=F^{\prime}(w)$ . In particular, we have $F(\rho)=F^{\prime}(\rho^{\prime})$ and, so, (ii) follows. Moreover, if $u_{p}=\rho$ , we set $F(u)=F(u_{p})$ . Otherwise, we set $F(u)=\alpha$ , where $\alpha$ is a character state that has been assigned to at least one neighbour of $u$ in $T$ under $F$ and there is no other character state that has been assigned to strictly more neighbours of $u$ in $T$ under $F$ . We next show that (i) holds. Consider the edges of $T$ . Except for the edges $(u,v^{\prime})$ used to reattach the subtree with root $v^{\prime}$ , $(u_{p}^{\prime},u_{c}^{\prime})$ obtained from suppressing $u^{\prime}$ , and $(u_{p},u)$ and $(u,u_{c})$ obtained from subdividing $(u_{p},u_{c})$ with $u$ , each edge of $T$ is also an edge of $T^{\prime}$ . If $F(u_{p}^{\prime})\neq F(u_{c}^{\prime})$ , then either $F^{\prime}(u_{p}^{\prime})\neq F^{\prime}(u^{\prime})$ or $F^{\prime}(u^{\prime})\neq F^{\prime}(u_{c}^{\prime})$ . Hence, suppressing $u^{\prime}$ does not increase the changing number. On the other hand, when assigning a character state to $u$ the changing number may increase. More specifically, we consider three cases. First, if $F(u_{p})=F(u_{c})$ , then $F(u)=F(u_{p})$ by definition of $F$ . Note that $u_{p}$ may be $\rho$ . Thus, there is no character state transition on the two edges $(u_{p},u)$ and $(u,u_{c})$ , and at most one such transition on the edge $(u,v^{\prime})$ under $F$ in $T$ . Second, if $|\{F(u_{p}),F(u_{c}),F(v^{\prime})\}|=3$ , then there is a character state transition on the edge $(u_{p},u_{c})$ under $F^{\prime}$ and we have two character state transitions on the three edges $(u_{p},u)$ , $(u,u_{c})$ , and $(u,v^{\prime})$ under $F$ . Third, if $F(u_{p})\neq F(u_{c})$ and $|\{F(u_{p}),F(u_{c}),F(v^{\prime})\}|=2$ , then there is again a character state transition on the edge $(u_{p},u_{c})$ under $F^{\prime}$ and we have one character state transition on the three edges $(u_{p},u)$ , $(u,u_{c})$ , and $(u,v^{\prime})$ under $F$ . Hence, regardless of which case applies

{\rm ch}(F,T)\leq{\rm ch}(F^{\prime},T^{\prime})+1={\rm ch}(F^{\prime},T^{\prime})+d_{\mathrm{rSPR}}(T^{\prime},T);

thereby satisfying (i) for when $d_{\mathrm{rSPR}}(T^{\prime},T)=1$ .

Now suppose that $d_{\mathrm{rSPR}}(T^{\prime},T)\geq 2$ and that (i)–(ii) are satisfied for all pairs of phylogenetic trees whose rSPR distance is strictly smaller than $d_{\mathrm{rSPR}}(T^{\prime},T)$ . Let $T^{\prime\prime}$ be a phylogenetic $X$ -tree such that $d_{\mathrm{rSPR}}(T^{\prime},T^{\prime\prime})=1$ and $d_{\mathrm{rSPR}}(T^{\prime\prime},T)=d_{\mathrm{rSPR}}(T^{\prime},T)-1$ . Recalling that $F^{\prime}$ is an extension of $f$ to $V(T^{\prime})$ , it follows from the induction hypothesis, that there is an extension $F^{\prime\prime}$ of $f$ to $V(T^{\prime\prime})$ that satisfies (ii) and ${\rm ch}(F^{\prime\prime},T^{\prime\prime})\leq{\rm ch}(F^{\prime},T^{\prime})+1$ . Again, by the induction hypothesis, there exists an extension $F$ of $f$ to $V(T)$ that satisfies (ii) and ${\rm ch}(F,T)\leq{\rm ch}(F^{\prime\prime},T^{\prime\prime})+d_{\mathrm{rSPR}}(T^{\prime\prime},T)$ . Hence, by combining the two inequalities we obtain

	$\displaystyle{\rm ch}(F,T)$	$\displaystyle\leq{\rm ch}(F^{\prime\prime},T^{\prime\prime})+d_{\mathrm{rSPR}}(T^{\prime\prime},T)$
		$\displaystyle\leq{\rm ch}(F^{\prime},T^{\prime})+1+d_{\mathrm{rSPR}}(T^{\prime\prime},T)={\rm ch}(F^{\prime},T^{\prime})+d_{\mathrm{rSPR}}(T^{\prime},T)$

and $F(\rho)=F^{\prime}(\rho^{\prime})$ . Hence, $F$ satisfies (i)–(ii). This completes the proof of the lemma. ∎

Corollary 4.4.

Let $f$ be a character on $X$ . Let $N$ be a phylogenetic network on $X$ with a single blob, and let $T$ and $T^{\prime}$ be two phylogenetic $X$ -trees displayed by $N$ . Furthermore, let $F^{\prime}$ be an extension of $f$ to $V(T^{\prime})$ . Then there exists an extension $F$ of $f$ to $V(T)$ such that

(i)

${\rm ch}(F,T)\leq{\rm ch}(F^{\prime},T^{\prime})+k$ , where $k$ is the level of $N$ , and
(ii)

$F(\rho)=F^{\prime}(\rho^{\prime})$ , where $\rho$ and $\rho^{\prime}$ is the root of $T$ and $T^{\prime}$ , respectively.

Proof.

Let $R$ and $R^{\prime}$ be two switchings of $N$ that yield $T$ and $T^{\prime}$ , respectively. By Lemma 3.1, we have $d_{\mathrm{rSPR}}(T,T^{\prime})\leq d_{\mathrm{switch}}(R,R^{\prime})$ . Noting that $N$ has a single blob, we have $d_{\mathrm{switch}}(R,R^{\prime})\leq k$ and the corollary now follows from Lemma 4.3. ∎

While Lemma 4.2 is restricted to phylogenetic networks that consist of a single non-informative blob, the next lemma establishes an analogous result for all phylogenetic networks that consist of a single blob.

Lemma 4.5.

Let $f$ be a character on $X$ , and let $N$ be a phylogenetic network on $X$ with a single blob $B$ whose source $s$ is the child of the root. Let $S$ and $S^{\prime}$ be embeddings of two phylogenetic trees that are displayed by $N$ . Furthermore, let $F^{\prime}$ be an extension of $f$ to $V(S^{\prime})$ . Then there exists an extension $F$ of $f$ to $V(S)$ such that

(i)

${\rm ch}(F,S)\leq{\rm ch}(F^{\prime},S^{\prime})+k$ , where $k$ is the level of $N$ , and
(ii)

$F(s)=F^{\prime}(s)$ .

Proof.

Let $T$ (resp. $T^{\prime}$ ) be the two phylogenetic $X$ -trees such that $S$ (resp. $S^{\prime}$ ) is an embedding of $T$ (resp. $T^{\prime}$ ) in $N$ . Furthermore, let $F^{\prime}(s)=\alpha$ . Observe that each vertex in $T^{\prime}$ is a unique degree-three vertex in $S^{\prime}$ . First, let $F^{\prime}_{T^{\prime}}$ be the extension of $f$ to $V(T^{\prime})$ obtained from $F^{\prime}$ by setting $F^{\prime}_{T^{\prime}}(w)=F^{\prime}(w)$ for each $w\in V(T^{\prime})$ . Then

{\rm ch}(F^{\prime}_{T^{\prime}},T^{\prime})\leq{\rm ch}(F^{\prime},S^{\prime}),

(3)

and, because there is no character state transition on any edge of the root path of $S^{\prime}$ that contains $s$ , the root of $T^{\prime}$ is assigned to $\alpha$ . Second, by Corollary 4.4, there exists an extension $F_{T}$ of $f$ to $V(T)$ such that

{\rm ch}(F_{T},T)\leq{\rm ch}(F^{\prime}_{T^{\prime}},T^{\prime})+k

(4)

and the root of $T$ is also assigned to $\alpha$ . Third, we obtain an extension $F$ of $f$ to $V(S)$ from $F_{T}$ as follows. Noting that each edge $(v,v^{\prime})$ in $T$ corresponds to a unique directed path $v=v_{1},v_{2},\ldots,v_{s}=v^{\prime}$ in $S$ whose non-terminal vertices all have degree two, we set $F(v_{1})=F(v_{2})=\cdots=F(v_{s-1})=F_{T}(v)$ and $F(v_{s})=F_{T}(v^{\prime})$ . Then

{\rm ch}(F,S)={\rm ch}(F_{T},T),

(5)

and, since the root of $T$ is assigned to $\alpha$ , it follows that each vertex on the root path of $S$ , in particular $s$ , is also assigned to $\alpha$ . Hence (ii) is satisfied. Moreover, by combining Equations (3)–(5), we have

{\rm ch}(F,S)={\rm ch}(F_{T},T)\leq{\rm ch}(F^{\prime}_{T^{\prime}},T^{\prime})+k\leq{\rm ch}(F^{\prime},S^{\prime})+k.

This concludes the proof of the lemma. ∎

5 Blob reduction

In this section, we introduce the notion of a blob reduction. Intuitively, this allows us to decompose a phylogenetic network $N$ into two smaller phylogenetic networks and calculate the parsimony score of an embedding $S$ of a phylogenetic $X$ -tree displayed by $N$ based on these two smaller networks.

Let $N$ be a phylogenetic network on $X$ , let $B$ be a blob of $N$ whose source, say $s$ , has maximum distance from the root of $N$ , and let $Y=\mathrm{cl}(s)$ . For some $y\notin X$ , the blob reduction of $B$ reduces $N$ to two smaller phylogenetic networks as follows. Let $N(\bar{Y})$ be the phylogenetic network on $(X\setminus Y)\cup\{y\}$ that is obtained from $N$ by replacing the subnetwork of $N$ that is rooted at $s$ with a single new leaf $y$ . Furthermore, let $N(Y)$ be the phylogenetic network on $Y$ that is obtained from the subnetwork of $N$ that is rooted at $s$ by adding a new vertex $\rho_{Y}$ and edge $(\rho_{Y},s)$ . By construction, each of $N(\bar{Y})$ and $N(Y)$ contains at least one leaf.

Now, let $S$ be an embedding of a phylogenetic $X$ -tree $T$ that is displayed by $N$ . Recall that $s$ is a vertex of $S$ and $\mathrm{cl}(s)=Y$ . Let $f$ be a character on $X$ , and let $F$ be an extension of $f$ to $V(S)$ . Using the aforementioned blob reduction of $B$ as a guide, we next also reduce $S$ to two smaller trees such that one of the resulting trees is an embedding of a subtree of $T$ in $N(\bar{Y})$ and the other one is an embedding of another subtree of $T$ in $N(Y)$ . More specifically, let $S(\bar{Y})$ be the tree with leaf set $(X\setminus Y)\cup\{y\}$ that is obtained from $S$ by replacing the subtree of $S$ that is rooted at $s$ with a single new leaf $y$ . Furthermore, let $S(Y)$ be the tree with leaf set $Y$ that is obtained from the subtree of $S$ that is rooted at $s$ by adding a new vertex $\rho_{Y}$ and edge $(\rho_{Y},s)$ . We call $(S(Y),S(\bar{Y}))$ the cluster tree pair of $S$ relative to $B$ . Let $f_{Y}$ and $f_{\bar{Y}}$ be a character on $Y$ and on $(X\setminus Y)\cup\{y\}$ , respectively, such that $f_{Y}(\ell)=f(\ell)$ for each $\ell\in Y$ and $f_{\bar{Y}}(\ell^{\prime})=f(\ell^{\prime})$ for each $\ell^{\prime}\in X\setminus Y$ . We refer to extensions $F_{Y}$ of $f_{Y}$ to $V(S(Y))$ and $F_{\bar{Y}}$ of $f_{\bar{Y}}$ to $V(S(\bar{Y}))$ as a pair of cluster extensions with respect to $f$ if $F_{\bar{Y}}(y)=F_{Y}(\rho_{Y})$ . Except for $f_{\bar{Y}}(y)$ , observe that $f$ uniquely determines the character state of each leaf in $S(Y)$ and $S(\bar{Y})$ . Moreover, since the root path of $S(Y)$ contains at least one edge, the definition of a pair of cluster extensions implies that $F_{Y}(\rho_{Y})=F_{Y}(s)$ by our assumption following Observation 2.3. The next lemma shows how the changing number of extensions of characters $f$ , $f_{Y}$ , and $f_{\bar{Y}}$ to $V(S)$ , $V(S(Y))$ , and $V(S(\bar{Y}))$ , respectively, are related to each other.

Lemma 5.1.

Let $B$ be a blob of a phylogenetic network $N$ on $X$ to which the blob reduction can be applied. Let $f$ be a character on $X$ , and let $S$ be an embedding of a phylogenetic $X$ -tree that is displayed by $N$ . Furthermore, let $(S(Y),S(\bar{Y}))$ be the cluster tree pair of $S$ relative to $B$ . Then, the following two statements hold.

(i)

If $F$ is an extension of $f$ to $V(S)$ , then there exists a pair of cluster extensions $(F_{Y},F_{\bar{Y}})$ such that

${\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(F_{\bar{Y}},S(\bar{Y})).$
(ii)

If $(F_{Y},F_{\bar{Y}})$ is a pair of cluster extensions with respect to $f$ , then there exists an extension $F$ of $f$ to $V(S)$ such that

${\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(F_{\bar{Y}},S(\bar{Y})).$

Proof.

Let $s$ be the source of $B$ , and let $Y=\mathrm{cl}_{N}(s)$ . By the definition of a cluster tree pair, $S(Y)$ has leaf set $Y$ and root $\rho_{Y}$ , and $S(\bar{Y})$ has leaf set $(X\setminus Y)\cup\{y\}$ and root $\rho$ , where $\rho$ is also the root of $S$ . Lastly, as $s$ is a vertex of $S$ , it follows from the construction of $S(Y)$ and $S(\bar{Y})$ that $s$ corresponds to the child of $\rho_{Y}$ in $S(Y)$ and to $y$ in $S(\bar{Y})$ , whereas each other vertex of $S$ corresponds to a unique vertex in either $S(Y)$ or $S(\bar{Y})$ . To ease reading, we refer to the child of $\rho_{Y}$ in $S(Y)$ as $s_{Y}$ . Reversely, the only vertex of $S(Y)$ and $S(\bar{Y})$ that does not correspond to a unique vertex in $S$ is $\rho_{Y}$ . We next show that (i) and (ii) hold.

First, let $F$ be an extension of $f$ to $V(S)$ . Obtain a pair of cluster extensions $F_{Y}$ and $F_{\bar{Y}}$ of characters $f_{Y}$ and $f_{\bar{Y}}$ to $V(S(Y))$ and $V(S(\bar{Y}))$ , respectively, in the following way. For each vertex $w$ of $V(S(Y))\setminus\{\rho_{Y}\}$ , set $F_{Y}(w)=F(w^{\prime})$ , where $w^{\prime}$ is the vertex of $S$ that $w$ corresponds to, and set $F_{Y}(\rho_{Y})=F_{Y}(s_{Y})$ . Similarly, for each vertex $w$ of $V(S(\bar{Y}))$ , set $F_{\bar{Y}}(w)=F(w^{\prime})$ , where $w^{\prime}$ is the vertex of $S$ that $w$ corresponds to. Since $y$ and $s_{Y}$ both correspond to $s$ , it follows that $F_{\bar{Y}}(y)=F_{Y}(\rho_{Y})$ . It is is now easily checked that $F_{Y}$ and $F_{\bar{Y}}$ is a pair of cluster extensions with respect to $f$ and that

{\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(F_{\bar{Y}},S(\bar{Y})).

Hence, (i) holds.

Now, let $F_{Y}$ and $F_{\bar{Y}}$ be a pair of cluster extensions with respect to $f$ . In particular, $F_{Y}(\ell)=f(\ell)$ for each $\ell\in Y$ , $F_{\bar{Y}}(\ell)=f(\ell)$ for each $\ell\in X\setminus Y$ , and $F_{\bar{Y}}(y)=F_{Y}(\rho_{Y})$ . Now obtain an extension $F$ of $f$ to $V(S)$ from $F_{Y}$ and $F_{\bar{Y}}$ in the following way. For each vertex $w^{\prime}$ of $V(S)$ that corresponds to a vertex $w$ of $S(Y)$ , set $F(w^{\prime})=F_{Y}(w)$ and, for each vertex $w^{\prime}$ of $V(S)\setminus\{s\}$ that corresponds to a vertex $w$ of $S(\bar{Y})$ , set $F(w^{\prime})=F_{\bar{Y}}(w)$ . Since $F_{Y}(s_{y})=F_{Y}(\rho_{Y})$ , it follows that

{\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(F_{\bar{Y}},S(\bar{Y})).

Thus, (ii) holds as well. ∎

6 Proof of Theorem 2.4

In this section, we establish the proof of Theorem 2.4 and show that the bound that is given in the theorem is sharp. Most work in proving Theorem 2.4 goes into establishing the following lemma.

Lemma 6.1.

Let $f$ be a character on $X$ . Let $N$ be a phylogenetic network on $X$ , and let $S$ and $S^{\prime}$ be embeddings of two phylogenetic $X$ -trees that are displayed by $N$ . Furthermore, let $F^{\prime}$ be an extension of $f$ to $V(S^{\prime})$ that realises $b(f,N,S^{\prime})$ . Then there exists an extension $F$ of $f$ to $V(S)$ such that

{\rm ch}(F,S)\leq{\rm ch}(F^{\prime},S^{\prime})+k\cdot b(f,N,S^{\prime}),

where $k$ is the level of $N$ .

Proof.

Let $B_{1},B_{2},\ldots,B_{m}$ be the blobs of $N$ . The proof is by induction on $m$ . If $m=0$ , then $N$ is a phylogenetic tree with $k=0$ and so the result clearly follows since $S=S^{\prime}$ and, therefore,

{\rm ch}(F,S)\leq{\rm ch}(F^{\prime},S^{\prime})+0

when setting $F=F^{\prime}$ . Now assume that $m\geq 1$ and that the statement is true for all phylogenetic networks with at most $m-1$ blobs. Let $B$ be a blob of $N$ whose source $s$ has maximum distance from the root of $N$ over all its blobs, and let $Y=\mathrm{cl}_{N}(s)$ . Without loss of generality, we assume that $B=B_{m}$ .

For some $y\notin X$ , let $N(\bar{Y})$ be the phylogenetic network on $(X\setminus Y)\cup\{y\}$ , and let $N(Y)$ be the phylogenetic network on $Y$ and with root $\rho_{Y}$ resulting from $N$ by applying a blob reduction to $B_{m}$ . Notice that by construction, $N(Y)$ consists of the single blob $B_{m}$ with $s$ being the child of $\rho_{Y}$ , whereas $N(\bar{Y})$ contains precisely $m-1$ blobs. Moreover, let $(S(Y),S(\bar{Y}))$ be the cluster tree pair of $S$ relative to $B_{m}$ , and let $(S^{\prime}(Y),S^{\prime}(\bar{Y}))$ be the cluster tree pair of $S^{\prime}$ relative to $B_{m}$ . Since $s$ is a vertex of $S$ and $S^{\prime}$ , $s$ is also a vertex of $S(Y)$ and $S^{\prime}(Y)$ . Now, by Lemma 5.1, Part (i), there exists a pair of cluster extensions $(F^{\prime}_{Y},G^{\prime}_{\bar{Y}})$ with respect to $f$ such that

{\rm ch}(F^{\prime},S^{\prime})={\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))

(6)

and $G^{\prime}_{\bar{Y}}(y)=F^{\prime}_{Y}(\rho_{Y})=F^{\prime}_{Y}(s)$ . Let $f_{Y}$ and $g_{\bar{Y}}$ be the characters on $Y$ and $(X\setminus Y)\cup\{y\}$ , respectively, such that $F^{\prime}_{Y}$ and $G^{\prime}_{\bar{Y}}$ are extensions of $f_{Y}$ and $g_{\bar{Y}}$ , respectively.

We next consider $N(\bar{Y})$ and start by making two observations. First, since ${\rm ch}(F^{\prime},S^{\prime})=PS(f,S^{\prime})$ , it follows from Lemma 5.1, Parts (i) and (ii), that ${\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))=PS(g_{\bar{Y}},S^{\prime}(\bar{Y}))$ . Second, by the construction of the pair of cluster extensions in Lemma 5.1 Part (i), we may assume that, for each vertex $w$ of $V(S^{\prime}(\bar{Y}))$ , we have $G^{\prime}_{\bar{Y}}(w)=F^{\prime}(w^{\prime})$ , where $w^{\prime}$ is the vertex of $S^{\prime}$ that $w$ corresponds to. Then, as $F^{\prime}$ realises $b(f,N,S^{\prime})$ , $G^{\prime}_{\bar{Y}}$ realises $b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y}))$ . Noting that $N(\bar{Y})$ has $m-1$ blobs and level at most $k$ , we now apply the induction hypothesis to obtain an extension $G_{\bar{Y}}$ of $g_{\bar{Y}}$ to $V(S(\bar{Y}))$ that satisfies

{\rm ch}(G_{\bar{Y}},S(\bar{Y}))\leq{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))+k\cdot b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y}))

(7)

such that $G_{\bar{Y}}(y)=g_{\bar{Y}}(y)=G^{\prime}_{\bar{Y}}(y)=F^{\prime}_{Y}(s)$ .

To complete the proof, we consider $N(Y)$ . Here, we distinguish two cases depending on whether its single blob $B_{m}$ whose level is at most $k$ is informative or non-informative.

First, assume that $B_{m}$ is informative. By Lemma 4.5, there exists an extension $F_{Y}$ of $f_{Y}$ to $V(S(Y))$ such that

{\rm ch}(F_{Y},S(Y))\leq{\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+k\quad\text{and}\quad F_{Y}(s)=F^{\prime}_{Y}(s).

(8)

Since $F_{Y}(s)=F^{\prime}_{Y}(s)=G_{\bar{Y}}(y)$ , the pair $(F_{Y},G_{\bar{Y}})$ is a pair of cluster extensions with respect to $f$ . Thus, by Lemma 5.1, Part (ii), there exists an extension $F$ of $f$ to $V(S)$ such that

{\rm ch}(F,S)={\rm ch}(F_{Y},S(Y))+{\rm ch}(G_{\bar{Y}},S(\bar{Y})).

Now, using Inequalities (7) and (8), we obtain

	$\displaystyle{\rm ch}(F,S)$	$\displaystyle={\rm ch}(F_{Y},S(Y))+{\rm ch}(G_{\bar{Y}},S(\bar{Y}))$
		$\displaystyle\leq{\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+k+{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))+k\cdot b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y}))$
		$\displaystyle={\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))+k\cdot(1+b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y})))$
		$\displaystyle={\rm ch}(F^{\prime},S^{\prime})+k\cdot b(f,N,S^{\prime}),$

where the last equality follows from Equation (6) and the fact that $B_{m}$ is informative.

Second, assume that $B_{m}$ is non-informative. Then by Lemma 4.2, there exists an extension $F_{Y}$ of $f_{Y}$ to $V(S(Y))$ such that

{\rm ch}(F_{Y},S(Y))={\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))\quad\text{and}\quad F_{Y}(s)=F^{\prime}_{Y}(s).

(9)

The remainder of the proof is now similar to the first case. In particular, noting that the pair $(F_{Y},G_{\bar{Y}})$ is a pair of cluster extensions with respect to $f$ , by Lemma 5.1, Part(ii), there exists an extension $F$ of $f$ to $V(S)$ such that

\displaystyle{\rm ch}(F,S)

\displaystyle={\rm ch}(F_{Y},S(Y))+{\rm ch}(G_{\bar{Y}},S(\bar{Y})).

Using Inequalities in (7) and (9), we obtain

	$\displaystyle{\rm ch}(F,S)$	$\displaystyle={\rm ch}(F_{Y},S(Y))+{\rm ch}(G_{\bar{Y}},S(\bar{Y}))$
		$\displaystyle\leq{\rm ch}(F^{\prime}_{Y},S^{\prime}(Y))+{\rm ch}(G^{\prime}_{\bar{Y}},S^{\prime}(\bar{Y}))+k\cdot b(g_{\bar{Y}},N(\bar{Y}),S^{\prime}(\bar{Y}))$
		$\displaystyle={\rm ch}(F^{\prime},S^{\prime})+k\cdot b(f,N,S^{\prime}),$

where the last equality follows from Equation (6) and the fact that $B_{m}$ is non-informative.

In both cases, we obtain an extension $F$ of $f$ to $V(S)$ such that

{\rm ch}(F,S)\leq{\rm ch}(F^{\prime},S^{\prime})+k\cdot b(f,N,S^{\prime}).

This concludes the proof of the lemma. ∎

Corollary 6.2.

Let $f$ be a character on $X$ . Let $N$ be a phylogenetic network on $X$ , and let $S$ and $S^{\prime}$ be embeddings of two phylogenetic $X$ -trees displayed by $N$ such that $PS_{\mathrm{sw}}(f,N)=PS(f,S^{\prime})$ . Then

PS(f,S)\leq PS(f,S^{\prime})+k\cdot b(f,N,S^{\prime}),

where $k$ is the level of $N$ .

Proof.

Let $F^{\prime}$ be an extension of $f$ to $V(S^{\prime})$ that realises $b(f,N,S^{\prime})$ . By Lemma 6.1, there exists an extension $F$ of $f$ to $V(S)$ such that $PS(f,S)\leq{\rm ch}(F,S)\leq PS(f,S^{\prime})+k\cdot b(f,N,S^{\prime})$ . ∎

We are finally in a position to establish the main result of this paper, which we restate for convenience.

Theorem 2.4. Let $N$ be a phylogenetic network on $X$ , and let $T$ be a phylogenetic $X$ -tree in $D(N)$ . Let $A$ be an alignment of characters on $X$ . Then

PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N)\text{ and }PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}^{\prime}}(A,N),

where $k$ is the level of $N$ .

Proof.

We establish that

PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N).

Since $PS_{\mathrm{sw}}(A,N)\leq PS_{\mathrm{sw}^{\prime}}(A,N)$ , it immediately follows that $PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}^{\prime}}(A,N)$ also holds.

First, assume that $A$ consists of a single character $f$ . Let $S$ be an embedding of $T$ in $N$ , and let $S^{\prime}$ be an embedding of a phylogenetic $X$ -tree in $N$ such that $PS_{\mathrm{sw}}(f,N)=PS(f,S^{\prime})$ . By Corollary 6.2, we have

\displaystyle PS(f,T)=PS(f,S)\leq PS(f,S^{\prime})+k\cdot b(f,N,S^{\prime}),

where the first equality follows from Lemma 2.2. Now, as every blob $B$ in $N$ that is informative relative to $S^{\prime}$ contributes at least one to $PS(f,S^{\prime})$ , we have

\displaystyle b(f,N,S^{\prime})\leq PS(f,S^{\prime}).

By Lemma 2.2 and combining the last two inequalities, we obtain

$\displaystyle PS(f,T)=PS(f,S)$	$\displaystyle\leq PS(f,S^{\prime})+k\cdot PS(f,S^{\prime})$
	$\displaystyle=(k+1)\cdot PS(f,S^{\prime})$
	$\displaystyle=(k+1)\cdot PS_{\mathrm{sw}}(f,N).$	(10)

Now, assume that $A=(f_{1},\ldots,f_{n})$ with $n\geq 1$ . Then, we apply Inequality (10) to each character, and obtain

	$\displaystyle PS(A,T)$	$\displaystyle=\sum\limits_{i=1}^{n}PS(f_{i},T)$
		$\displaystyle\leq\sum\limits_{i=1}^{n}(k+1)\cdot PS_{\mathrm{sw}}(f_{i},N)$
		$\displaystyle\leq(k+1)\cdot\sum\limits_{i=1}^{n}PS_{\mathrm{sw}}(f_{i},N)$
		$\displaystyle=(k+1)\cdot PS_{\mathrm{sw}}(A,N).$

∎

We close this section by presenting, for each $k\geq 0$ , a level- $k$ network and a binary character such that the upper bound stated in Theorem 2.4 is sharp. As level-0 networks on $X$ are phylogenetic $X$ -trees, the bound is sharp for any level-0 network and any binary character on $X$ . For $k\geq 1$ , let $N$ be the level- $k$ network on $X=\{x_{0}\}\cup\{x_{i},x^{\prime}_{i},x^{\prime\prime}_{i}:1\leq i\leq k\}$ that is depicted in Figure 2. Further, let $f\colon X\rightarrow\{0,1\}$ be the binary character with $f(x)=1$ if $x\in\{x_{0},x_{1},\ldots,x_{k}\}$ and $f(x)=0$ otherwise. Let us consider the two phylogenetic $X$ -trees $T,T^{\prime}\in D(N)$ that are illustrated on the right-hand side of Figure 2 together with extensions $F$ and $F^{\prime}$ of $f$ to $V(T)$ and $V(T^{\prime})$ , respectively. By using Fitch’s algorithm it is easy to verify that $F$ and $F^{\prime}$ are minimum. Hence, we have $PS(f,T^{\prime})=1$ and, thus, $PS_{\mathrm{sw}}(f,N)=1$ (since $f$ employs two character states, $PS_{\mathrm{sw}}(f,N)\geq 1$ ). Moreover, $PS(f,T)=k+1$ . In summary, we have $PS(f,T)=(k+1)\cdot PS_{\mathrm{sw}}(f,N)$ . As the construction shown in Figure 2 involves a single character, the two notions of softwired parsimony on $N$ coincide, and we have $PS(f,T)=(k+1)\cdot PS_{\mathrm{sw}^{\prime}}(f,N)$ for the same example.

7 Parental parsimony for phylogenetic networks

We now briefly consider the notion of parental parsimony introduced by van Iersel et al. [24] as an alternative to softwired and hardwired parsimony. Intuitively, instead of defining the parsimony score of a phylogenetic network $N$ by considering its display set $D(N)$ , parental parsimony considers the set of parental trees (sometimes also called weakly displayed trees [12]), which is a superset of $D(N)$ .

A multilabelled tree on $X$ is a leaf-labelled rooted tree whose root has out-degree one, all other interior vertices have in-degree one and out-degree one, or in-degree one and out-degree two, and, for each element $x$ in $X$ , there exists at least one leaf in $T$ that is labelled $x$ . Now using the same notation as [24], let $U^{\ast}(N)$ be the multilabelled obtained from a phylogenetic network $N$ on $X$ as follows: The vertices of $U^{\ast}(N)$ are the directed paths in $N$ starting at the root of $N$ , and for each pair of directed paths $p,p^{\prime}$ , there is an edge $(p,p^{\prime})$ in $U^{\ast}(N)$ if and only if $p^{\prime}$ is an extension of $p$ by one additional edge of $N$ . Furthermore, each vertex in $U^{\ast}(N)$ corresponding to a path in $N$ starting at the root of $N$ and ending at $x\in X$ is labelled by $x$ . For an example of the multilabelled tree $U^{\ast}(N)$ obtained from a phylogenetic network $N$ , see Figure 3. Now, a phylogenetic $X$ -tree is called a parental tree of $N$ if it can be obtained from a subgraph of $U^{\ast}(N)$ by suppressing vertices of in-degree and out-degree one. To denote the set of all parental trees of $N$ , we use $P(N)$ . Informally speaking, a tree is a parental tree of a phylogenetic network if it can be drawn inside the network in such a way that the tree vertices of the tree correspond to tree vertices of the network. Importantly, though, a parental tree is not necessarily a displayed tree (see Figure 3), whereas every displayed tree is also parental.

Given a character $f$ on $X$ and a phylogenetic network $N$ on $X$ , the parental parsimony score of $f$ on $N$ is now defined as

\displaystyle PS_{\mathrm{pa}}(f,N)

\displaystyle=\min\limits_{T\in P(N)}PS(f,T),

where the minimum is taken over all parental trees for $N$ .

It was shown by van Iersel et al. [24, Theorem 2] that computing the parental parsimony score is NP-hard even if $f$ is a binary character and $N$ is a restricted type of a so-called tree-child network [2]. It is thus a natural question if our main result for softwired parsimony (Theorem 2.4) generalises to parental parsimony. Unfortunately, this is not the case. Suppose that $n$ is an even integer and that $N$ is the level- $1$ network on $n+1$ leaves depicted in Figure 4. Then, both phylogenetic trees $T$ and $T^{\prime}$ as depicted in the same figure are parental trees of $N$ and $T^{\prime}\notin D(N)$ . Additionally, suppose that $f$ is the binary character that assigns state $0$ to leaves $x_{2},x_{4},x_{6},\ldots,x_{n}$ , and state $1$ to leaves $x_{1},x_{3},\ldots,x_{n+1}$ . Crucially, we have $PS(f,T)=n/2$ and $PS(f,T^{\prime})=1$ . In particular, $PS_{\mathrm{pa}}(f,N)=1$ , and thus,

PS(f,T)=n/2\not\leq 2\cdot PS_{\mathrm{pa}}(f,N)=2

for each $n>4$ (a similar argument applies to $n$ being odd), which shows that Theorem 2.4 does not generalise to parental parsimony even if the phylogenetic network is level- $1$ with a single blob and the alignment consists of a single binary character.

8 Softwired parsimony for semi-directed and unrooted networks

In this section, we show that binary semi-directed networks have an analogous bound on the softwired parsimony score as the one we established for rooted binary phylogenetic networks. We briefly turn our attention to unrooted binary phylogenetic networks at the end of the section and show that the approach we take to establish the bound for semi-directed networks does not work in this setting. However, this does not exclude the possibility that similar bounds can be obtained for unrooted phylogenetic networks by other means.

A binary semi-directed network $N_{s}$ on $X$ is a leaf-labelled mixed multigraph without any loops that can be obtained from a rooted binary phylogenetic network $N_{r}$ by deleting its root, suppressing the child of the root, and omitting the direction of each edge that is not a reticulation edge. We call $N_{r}$ a rooted partner of $N_{s}$ . By construction, $N_{s}$ has at most one pair of parallel edges. This is precisely the case when $N_{r}$ has an underlying 3-cycle that contains the child of the root. Note that $N_{s}$ may have multiple rooted partners. A vertex $v$ in $N_{s}$ is called a reticulation if $N_{s}$ contains two edges, referred to as reticulation edges, that are directed into $v$ . A semi-directed level- $k$ network is a semi-directed network $N_{s}$ such that a rooted partner of $N_{s}$ is a rooted binary level- $k$ network. Lastly, an unrooted binary phylogenetic $X$ -tree $T_{u}$ is a connected undirected acyclic graph whose leaf set is $X$ and whose inner vertices all have degree three. Note that an unrooted binary phylogenetic tree is a binary semi-directed network without reticulations. In the following, all rooted (resp. semi-directed) phylogenetic networks and rooted (resp. unrooted) phylogenetic trees are assumed to be binary.

Now, let $T_{u}$ be an unrooted binary phylogenetic $X$ -tree, and let $N_{s}$ be a semi-directed network on $X$ . We say that $T_{u}$ is displayed by $N_{s}$ if there exists a subgraph $S$ of $N_{s}$ such that $S$ is a subdivision of $T_{u}$ (omitting the directions of the reticulation edges) and $S$ contains, for each reticulation $v$ in $N_{s}$ , at most one reticulation edge incident with $v$ . Similar to rooted phylogenetic networks, if $T_{u}$ is displayed by $N_{s}$ , we call $S$ an embedding of $T_{u}$ in $N_{s}$ . Furthermore, we refer to the set of all unrooted phylogenetic $X$ -trees that are displayed by $N_{s}$ as the unrooted display set of $N_{s}$ and denote it by $D(N_{s})$ .

Let $A=(f_{1},f_{2},\ldots,f_{n})$ be an alignment of characters on $X$ , and let $N_{s}$ be a semi-directed network on $X$ . We define the softwired parsimony score of $A$ on $N_{s}$ as

\displaystyle PS_{\mathrm{sw}}(A,N_{s})=\sum_{i=1}^{n}\min_{T_{u}\in D(N_{s})}PS(f_{i},T_{u}),

(11)

where the parsimony score of an unrooted phylogenetic $X$ -tree $T$ is defined as in the rooted case since the corresponding concepts of the changing number and (minimum) extensions naturally translate to undirected graphs.

The next lemma shows how the set of unrooted phylogenetic trees that are displayed by a semi-directed network $N_{s}$ is related to the set of rooted phylogenetic trees that are displayed by a rooted partner of $N_{s}$ .

Lemma 8.1.

Let $N_{s}$ be a semi-directed network on $X$ , and let $N_{r}$ be a rooted partner of $N_{s}$ . Then, the following two statements hold.

(i)

If $T_{u}$ is an unrooted phylogenetic $X$ -tree that is displayed by $N_{s}$ , then there exists a rooted phylogenetic $X$ -tree $T\in D(N_{r})$ such that $T_{u}$ can be obtained from $T$ by deleting the root and suppressing its child.
(ii)

If $T$ is a rooted phylogenetic $X$ -tree that is displayed by $N_{r}$ , then there exists an unrooted phylogenetic $X$ -tree $T_{u}\in D(N_{s})$ such that $T_{u}$ can be obtained from $T$ by deleting the root and suppressing its child.

Proof.

Throughout the proof, we assume that $N_{s}$ does not contain a pair of parallel edges. Indeed, if $N_{s}$ contains such a pair of edges, $e_{1}$ and $e_{2}$ say, we can obtain a semi-directed network $N_{s}^{\prime}$ on $X$ from $N_{s}$ , by deleting $e_{1}$ , suppressing the two resulting degree-two vertices, and omitting the direction of $e_{2}$ . Clearly $D(N_{s})=D(N_{s}^{\prime})$ . Now, let $v$ be the child of the root $\rho$ in $N_{r}$ , and let $w$ and $w^{\prime}$ be the children of $v$ . If either $w$ or $w^{\prime}$ is a reticulation, we assume without loss of generality that $w^{\prime}$ is a reticulation. Observe that at most one of $w$ and $w^{\prime}$ is a reticulation. Each edge in $N_{r}$ that is not incident with $v$ corresponds to exactly one edge in $N_{s}$ , and each edge in $N_{s}$ except $e_{\rho}=\{w,w^{\prime}\}$ (resp. $e_{\rho}=(w,w^{\prime})$ if $w^{\prime}$ is a reticulation in $N_{r}$ ) corresponds to exactly one edge in $N_{r}$ . In particular, each reticulation edge in $N_{r}$ corresponds to exactly one such edge in $N_{s}$ and vice versa. First, let $T_{u}$ be an unrooted phylogenetic $X$ -tree that is displayed by $N_{s}$ . By definition, there exists an embedding $S_{u}$ of $T_{u}$ in $N_{s}$ . If $e_{\rho}$ is an edge in $S_{u}$ , we obtain an embedding $S$ of a rooted phylogenetic $X$ -tree $T$ in $N_{r}$ from $S_{u}$ as follows. We replace $e_{\rho}$ with the three directed edges $(\rho,v)$ , $(v,w)$ and $(v,w^{\prime})$ and each edge $e\neq e_{\rho}$ with its directed counterpart in $N_{r}$ . By definition, $T$ is displayed by $N_{r}$ , that is $T\in D(N_{r})$ . Further, by construction, $T_{u}$ can be obtained from $T$ by deleting the root and suppressing its child. Now, let us consider the case that $e_{\rho}$ is not an edge in $S_{u}$ . Let $S^{\prime}$ be the connected acyclic subgraph of $N_{r}$ that is obtained from $S_{u}$ by replacing each edge in $S_{u}$ with its directed counterpart in $N_{r}$ . We next show that there is a unique vertex $v_{s}$ in $S^{\prime}$ with in-degree zero and out-degree two. Since $S^{\prime}$ is acyclic, it follows that $S^{\prime}$ contains a vertex with in-degree zero. Clearly, the out-degree of this vertex cannot be one or three and, thus, $v_{s}$ exists. Assume towards a contradiction that $v_{s}^{\prime}\neq v_{s}$ is another vertex with this property. As $S^{\prime}$ is connected and does not contain a vertex with in-degree two, one of $v_{s}^{\prime}$ and $v_{s}$ is a descendant of the other in $S^{\prime}$ . But then one of these vertices has in-degree one, a contradiction. It now follows that each vertex in $S^{\prime}$ is a descendant of $v_{s}$ . Let $\pi=(\rho=v_{1},v_{2},\ldots,v_{t}=v_{s})$ be a directed path in $N_{r}$ . Such a path $\pi$ exists as there is a directed path from the root to any vertex in $N_{r}$ . Since each vertex in $S^{\prime}$ is a descendant of $v_{s}$ , it follows that $v_{i}$ with $i<t$ is not in $S^{\prime}$ . Then, we obtain an embedding $S$ of a rooted phylogenetic $X$ -tree $T$ in $N_{r}$ from $S^{\prime}$ by adding the $t-1$ directed edges $(v_{i},v_{i+1})$ , $1\leq i<t$ , to $S^{\prime}$ . Again, we have $T\in D(N_{r})$ and, by construction, $T_{u}$ can be obtained from $T$ by deleting the root and suppressing its child. Hence, (i) holds.

Second, let $T$ be a rooted phylogenetic $X$ -tree that is displayed by $N_{r}$ . Let $S$ be an embedding of $T$ in $N_{r}$ , and let $\pi=(\rho=v_{1},v_{2},\ldots,v_{t})$ be the root path of $S$ . Then, we obtain an embedding $S_{u}$ of an unrooted phylogenetic $X$ -tree $T_{u}$ in $N_{s}$ from $S$ by deleting each vertex that lies on $\pi$ except for $v_{t}$ , turning directed edges into undirected ones and, if $t=2$ , suppressing $v_{t}$ . If $t=2$ , then $\pi$ consists of the single edge $(\rho,v)$ , and $S$ contains $(v,w)$ and $(v,w^{\prime})$ . Hence, $S_{u}$ contains the edge $\{w,w^{\prime}\}$ as a result of suppressing $v$ . Otherwise, if $t\geq 3$ , then $S$ contains $(\rho,v)$ and exactly one of $(v,w)$ and $(v,w^{\prime})$ , and $v_{t}$ is a vertex in $N_{s}$ . Furthermore, if $t\geq 4$ , each edge $(v_{i},v_{i+1})$ with $3\leq i<t$ that is traversed by $\pi$ corresponds to a unique edge consisting of the same vertices in $N_{s}$ . It now follows that $S_{u}$ is indeed an embedding of $T_{u}$ in $N_{s}$ , that is, $T_{u}\in D(N_{s})$ . By construction, $T_{u}$ can be obtained from $T$ by deleting the root and suppressing its child. Hence, (ii) holds as well. ∎

We next obtain the following corollary from Lemma 8.1.

Corollary 8.2.

Let $N_{s}$ be a semi-directed network on $X$ , let $N_{r}$ be a rooted partner of $N_{s}$ , and let $A$ be an alignment of characters on $X$ . Then $PS_{\mathrm{sw}}(A,N_{s})=PS_{\mathrm{sw}}(A,N_{r})$ .

Proof.

The corollary follows from Lemma 8.1 and the fact that, if $T$ is a rooted phylogenetic $X$ -tree and $T_{u}$ is an unrooted phylogenetic $X$ -tree such that $T_{u}$ can be obtained from $T$ by deleting the root and suppressing its child, then $PS(A,T)=PS(A,T_{u})$ . ∎

We are now in a position to state the main result of the section.

Theorem 8.3.

Let $N_{s}$ be a semi-directed network on $X$ , and let $T_{u}$ be an unrooted phylogenetic $X$ -tree that is displayed by $N_{s}$ . Furthermore, let $A$ be an alignment of characters on $X$ . Then

PS(A,T_{u})\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N_{s}),

where $k$ is the level of $N_{s}$ .

Proof.

Let $N_{r}$ be a rooted partner of $N_{s}$ , and let $T\in D(N_{r})$ be a rooted phylogenetic $X$ -tree such that deleting the root in $T$ and suppressing its child yields $T_{u}$ . The rooted phylogenetic $X$ -tree $T$ exists by Lemma 8.1 (i). Since an unrooted phylogenetic tree is a semi-directed network without reticulations, we have (i) $PS(A,T_{u})=PS(A,T)$ by Corollary 8.2. Furthermore, by Theorem 2.4, we have (ii) $PS(A,T)\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N_{r})$ . Finally, by Corollary 8.2, we have (iii) $PS_{\mathrm{sw}}(A,N_{r})=PS_{\mathrm{sw}}(A,N_{s})$ . Combining (i)–(iii) yields $PS(A,T_{u})\leq(k+1)\cdot PS_{\mathrm{sw}}(A,N_{s})$ . ∎

We now shift our focus from semi-directed networks to unrooted phylogenetic networks. An unrooted binary phylogenetic network $U$ on $X$ is a connected undirected graph without any loops or edges in parallel, whose leaf set is $X$ , and whose inner vertices all have degree three. As before, we omit the term binary in the following as all unrooted phylogenetic networks considered in this section are binary.

Let $U$ be an unrooted phylogenetic network on $X$ . We say that an unrooted phylogenetic $X$ -tree $T_{u}$ is displayed by $U$ if there exists a subgraph of $U$ that is a subdivision of $T_{u}$ . Furthermore, we refer to the set of all unrooted phylogenetic $X$ -trees that are displayed by $U$ as the display set of $U$ and denote it by $D(U)$ . We call $U$ an unrooted level- $k$ network if at most $k$ edges have to be deleted in each biconnected component of $U$ such that the resulting graph is acyclic. Lastly, if $U$ can be obtained from a rooted phylogenetic network $N$ by deleting its root, suppressing the child of the root, and omitting all edge directions, we say that $N$ is an orientation of $U$ .

Let $A=(f_{1},f_{2},\ldots,f_{n})$ be an alignment of characters on $X$ , and let $U$ be an unrooted phylogenetic network on $X$ . We define the softwired parsimony score of $A$ on $U$ as

\displaystyle PS_{\mathrm{sw}}(A,U)=\sum_{i=1}^{n}\min_{T\in D(U)}PS(f_{i},T).

(12)

Next, we present an unrooted level-1 network $U$ and a binary character $f$ such that $PS_{\mathrm{sw}}(f,U)\neq PS_{\mathrm{sw}}(f,N)$ for an orientation $N$ of $U$ , that is, we show that Corollary 8.2 does not translate from semi-directed networks to unrooted phylogenetic networks. Moreover, we give two different orientations $N$ and $N^{\prime}$ of $U$ with $PS_{\mathrm{sw}}(f,N)\neq PS_{\mathrm{sw}}(f,N^{\prime})$ .

To this end, let $U$ be the unrooted level-1 network on $X=\{a,b,c,d,e\}$ , and let $N$ and $N^{\prime}$ be the two orientations of $U$ as shown in Figure 5. The display set $D(U)$ of size five is shown in the middle of the same figure. By deleting the root and suppressing its child in each element of $D(N)$ (resp. $D(N^{\prime})$ ), we obtain a subset of $D(U)$ as indicated by the two dashed rectangles that each enclose $N$ (resp. $N^{\prime}$ ) and two elements of $D(U)$ . Now for the single binary character $f\colon X\rightarrow\{0,1\}$ with

f(a)=f(b)=f(c)=0\text{ and }f(d)=f(e)=1,

we have $PS_{\mathrm{sw}}(f,U)=1$ , $PS_{\mathrm{sw}}(f,N)=2$ and $PS_{\mathrm{sw}}(f,N^{\prime})=1$ . Since the softwired parsimony score of an unrooted phylogenetic network $U$ is not necessarily the same as the score of an orientation of $U$ , we cannot represent an unrooted phylogenetic network by an arbitrary orientation in the way we used a rooted partner of a semi-directed network to obtain Theorem 8.3. While our example shows that using the same approach as in the semi-directed setting is not viable, it does not exclude the existence of similar bounds for unrooted phylogenetic networks or classes thereof (such as unrooted level-1 networks, for example).

9 Concluding remarks

In this paper, we have obtained a bound on the softwired parsimony score of a gap-free alignment of multistate characters on rooted as well as semi-directed phylogenetic level- $k$ networks. To be precise, we have shown that the maximum difference between the softwired parsimony score of a phylogenetic network $N$ and the parsimony score of any tree displayed by $N$ is bounded by $k+1$ times the parsimony score of $N$ . Unfortunately, our approximation result as stated in Theorem 2.4 cannot be generalised to alignments with gaps since it was already shown in [15, Corollary 2] that computing the softwired parsimony score of a level- $1$ network for an alignment of binary characters that additionally allows gaps is APX-hard.

Extending the notion of softwired parsimony to semi-directed networks and exploiting a connection between the display sets of semi-directed networks and their rooted partners, we have shown that an analogous bound holds for semi-directed networks. For unrooted networks, on the other hand, the approach via rooted partners (more formally, via orientations) does not seem to be viable. Nevertheless, it would be an interesting question for future research to investigate if an analogous or similar bound for the softwired parsimony score can be obtained in some other way for unrooted phylogenetic networks.

Another interesting direction for future research would be to analyse whether our results also apply in the case of non-binary phylogenetic networks, i.e., phylogenetic networks that may have vertices of degree strictly greater than three. While there exists a polynomial time algorithm to compute the parsimony score of a given non-binary phylogenetic tree with character states assigned to its leaves, namely the Fitch-Hartigan algorithm [7, 10], other concepts that our results rely on, such as the rSPR distance and its relation to the switching distance, have been studied less for non-binary phylogenetic trees (see [5, Section 2] for a related discussion).

Acknowledgements. The first and second authors thank the New Zealand Marsden Fund for financial support. All authors thank Steven Kelk for helpful discussions.

References

Allman et al. [2019] E. S. Allman, H. Baños, and J. A. Rhodes. NANUQ: A method for inferring species networks from gene trees under the coalescent model. Algorithms for Molecular Biology, 14:1–25, 2019.
Cardona et al. [2009] G. Cardona, F. Rossello, and G. Valiente. Comparison of tree-child phylogenetic networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 6(4):552–569, 2009.
Döcker et al. [2024] J. Döcker, S. Linz, and C. Semple. Hypercubes and Hamilton cycles of display sets of rooted phylogenetic networks. Advances in Applied Mathematics, 152:102595, 2024.
Felsenstein [2004] J. Felsenstein. Inferring Phylogenies. Sinauer Associates, US, 2004.
Fischer and Kelk [2016] M. Fischer and S. Kelk. On the maximum parsimony distance between phylogenetic trees. Annals of Combinatorics, 20:87–113, 2016.
Fischer et al. [2015] M. Fischer, L. van Iersel, S. Kelk, and C. Scornavacca. On computing the maximum parsimony score of a phylogenetic network. SIAM Journal on Discrete Mathematics, 29(1):559–585, 2015.
Fitch [1971] W. M. Fitch. Toward defining the course of evolution: minimum change for a specific tree topology. Systematic Biology, 20(4):406–416, 1971.
Frohn and Kelk [2023] M. Frohn and S. Kelk. A $2$ -approximation algorithm for the softwired parsimony problem on binary, tree-child phylogenetic networks. submitted, 2023.
Gross and Long [2018] E. Gross and C. Long. Distinguishing phylogenetic networks. SIAM Journal on Applied Algebra and Geometry, 2(1):72–93, 2018.
Hartigan [1973] J. A. Hartigan. Minimum mutation fits to a given tree. Biometrics, 29(1):53, 1973.
Hollering and Sullivant [2021] B. Hollering and S. Sullivant. Identifiability in phylogenetics using algebraic matroids. Journal of Symbolic Computation, 104:142–158, 2021.
Huber et al. [2016] K. T. Huber, V. Moulton, M. Steel, and T. Wu. Folding and unfolding phylogenetic trees and networks. Journal of Mathematical Biology, 73(6–7):1761–1780, 2016.
Kannan and Wheeler [2012] L. Kannan and W. C. Wheeler. Maximum parsimony on phylogenetic networks. Algorithms for Molecular Biology, 7:1–10, 2012.
Kelk and Fischer [2017] S. Kelk and M. Fischer. On the complexity of computing MP distance between binary phylogenetic trees. Annals of Combinatorics, 21:573–604, 2017.
Kelk et al. [2019] S. Kelk, F. Pardi, C. Scornavacca, and L. van Iersel. Finding a most parsimonious or likely tree in a network with respect to an alignment. Journal of Mathematical Biology, 78:527–547, 2019.
Nakhleh et al. [2005] L. Nakhleh, G. Jin, F. Zhao, and J. Mellor-Crummey. Reconstructing phylogenetic networks using maximum parsimony. In 2005 IEEE Computational Systems Bioinformatics Conference (CSB’05), pages 93–102. IEEE, 2005.
Sansom et al. [2018] R. S. Sansom, P. G. Choate, J. N. Keating, and E. Randle. Parsimony, not Bayesian analysis, recovers more stratigraphically congruent phylogenetic trees. Biology Letters, 14(6):20180263, 2018.
Schrago et al. [2018] C. G. Schrago, B. O. Aguiar, and B. Mello. Comparative evaluation of maximum parsimony and Bayesian phylogenetic reconstruction using empirical morphological data. Journal of Evolutionary Biology, 31(10):1477–1484, 2018.
Semple [2007] C. Semple. Hybridization networks. In O. Gascuel and M. Steel, editors, Reconstructing Evolution: New Mathematical and Computational Advances, pages 277–314. Oxford University Press, UK, 2007.
Smith [2019] M. R. Smith. Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets. Biology Letters, 15(2):20180632, 2019.
Solís-Lemus and Ané [2016] C. Solís-Lemus and C. Ané. Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLoS Genetics, 12(3):e1005896, 2016.
Solís-Lemus et al. [2017] C. Solís-Lemus, P. Bastide, and C. Ané. PhyloNetworks: a package for phylogenetic networks. Molecular Biology and Evolution, 34(12):3292–3298, 2017.
Stamatakis et al. [2005] A. Stamatakis, T. Ludwig, and H. Meier. RAxML-III: A fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics, 21(4):456–463, 2005.
van Iersel et al. [2018] L. van Iersel, M. Jones, and C. Scornavacca. Improved maximum parsimony models for phylogenetic networks. Systematic Biology, 67(3):518–542, 2018.
Zhang et al. [2020] C. Zhang, J. P. Huelsenbeck, and F. Ronquist. Using parsimony-guided tree proposals to accelerate convergence in Bayesian phylogenetic inference. Systematic Biology, 69(5):1016–1032, 2020.