This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Ordinally Consensus Subset over Multiple Metrics

Dingkang Wang, Yusu Wang
Abstract

In this paper, we propose to study the following maximum ordinal consensus problem: Suppose we are given a metric system (,X)(\mathcal{M},X), which contains kk metrics ={ρ1,,ρk}\mathcal{M}=\{\rho_{1},\ldots,\rho_{k}\} defined on the same point set XX. We aim to find a maximum subset XXX^{\prime}\subset X such that all metrics in \mathcal{M} are “consistent” when restricted on the subset XX^{\prime}. In particular, our definition of consistency will rely only on the ordering between pairwise distances, and thus we call a “consistent” subset an ordinal consensus of XX w.r.t. \mathcal{M}. We will introduce two concepts of “consistency” in the ordinal sense: a strong one and a weak one. Specifically, a subset XX^{\prime} is strongly consistent means that the ordering of their pairwise distances is the same under each of the input metric ρi\rho_{i}\in\mathcal{M}. The weak consistency, on the other hand, relaxes this exact ordering condition, and intuitively allows us to take the plurality of ordering relation between two pairwise distances.

We show in this paper that the maximum consensus problems over both the strong and the weak consistency notions are NP-complete, even when there are only 2 or 3 simple metrics, such as line metrics and ultrametrics. We also develop constant-factor approximation algorithms for the dual version, the minimum inconsistent subset problem of a metric system (,P)(\mathcal{M},P), – note that optimizing these two dual problems are equivalent.

1 Introduction

In recent years, there have been many studies on data sets with multiple views, which can contain different sets of features from multiple sources carrying different types of information. For example, consider neuron cells in the field of neuroscience [15]. A single neuron cell could have both morphology features and RNA-sequencing information available. Simply concatenating these two types of feature sets and applying a classical single-view method may not produce meaningful results – The types of features may be different, and it is not clear how to properly weigh them when combined. Instead, there have been many approaches developed to handle multi-view data. For example, Lashkari and Golland [14], Bickel and Scheffer [4] considered using EM algorithm and (convex) mixture model on multi-view clustering; Kumar, Rai and Hal [13] and Cai et al. [5] extended spectral clustering algorithm for multi-view data. See also surveys on multi-view clustering [6], and more broadly, on multi-view learning [16].

Very often in applications, multiple views give rise to multiple metrics {ρ1,,ρk}\{{\rho}_{1},\ldots\\ ,{\rho}_{k}\} over the same data set XX. Our goal is to study whether these metrics are “consistent” over XX, and identify a largest subset XX^{\prime} of XX, called consensus, over which these input metrics are “consistent”. However, when comparing these metrics, note that the precise distance values between points in XX induced by different ρi{\rho}_{i}s may not have the same meaning, two metrics may not have simple, say, linear relation between them, and thus the distance values are not readily comparable (even after normalization). For example, the distance between two neuron cells based on their tree morphology can be very different from that based on their gene expression profiles. Hence in this paper, we will compute consensus under multiple metrics based on ordinal information, namely the order of pairwise distances under each metric.

More specifically, given a metric system (,X)(\mathcal{M},X), consisting of a set of kk input metrics ={ρ1,,ρk}\mathcal{M}=\{{\rho}_{1},...,{\rho}_{k}\} on a discrete data set XX with cardinality nn, we propose to study the problem of finding maximum ordinal consensus of XX w.r.t. \mathcal{M}. Specifically, we aim to find a maximum subset XXX^{\prime}\subset X, such that all metrics will have consistent pairwise distances if restricted on node set XX^{\prime}. We also call S=XXS=X\setminus X^{\prime} as outliers, while XX^{\prime} is our targeted ordinal consensus. The dual problem is to find the minimum inconsistent (outlier) set SS such that all metrics are consistent when restricted to the subset XSX\setminus S.

Our contributions. We propose two notions to measure the “ordinal consistency”, which we call strong consistency and weak consistency, respectively. Intuitively, a strong consensus XXX^{\prime}\subseteq X means that the order of all pairwise distances among XX^{\prime} must be the same w.r.t. all input metrics. Under the weak consistency notion, roughly speaking for each pair of pairwise distances, only a plurality of input metrics (instead of all of them) need to agree on that. The formal definitions of these consistency notions are in Section 2.

Note that the maximum (ordinal) consensus subset and the minimum inconsistent subset are equivalent. In Section 3 and 4, we will show for both the strong and weak consistency definitions, finding the subset SS over an input of a constant number of (two for the strong case, and three for the weak case) ultrametrics or Euclidean metrics on the real line are NP-hard. These special cases imply that the problems are NP-hard if the inputs are arbitrary metrics. We also study the approximation algorithms for both the strongly and weakly minimum inconsistent subset problems. In particular, for the strongly inconsistent subset, we propose a 4-approximation algorithm with time complexity O(kn2logn)O(kn^{2}\log n). For the weak case, we have a O(n6)O(n^{6})-time 6-approximation algorithm. See Appendix A.10 for a table summarizing our hardness results and approximation algorithms.

All missing technical details can be found in the appendix.

Some related work. We note that this maximum consensus problem has been considered before when input metrics are tree-metrics. In particular, in the scenario where the inputs are multiple leaf-labeled phylogenetic (rooted) trees, one aims to find a maximum subset of labels that are “consistent” among all inputs. In [2], Amir and Keselman proposed Maximum Agreement Subtree problem (MAST): given a set of rooted binary trees with the same set of taxa (leaf labels), find the maximum subset, such that all the given trees restricted on the subset are isomorphic. This can be considered as a special case of tree consistency [1]. The Maximum Agreement Supertree problem (SMAST) problem is studied in [3]: Here for the given trees 𝒯={T1,,Tk}\mathcal{T}=\{T_{1},...,T_{k}\}, the leaf label set Λ(Ti)\Lambda(T_{i}) for input trees may not be same. The goal is to find a tree QQ with Λ(Q)Ti𝒯Λ(Ti)\Lambda(Q)\subset\cup_{T_{i}\in\mathcal{T}}\Lambda(T_{i}) such that |Λ(Q)||\Lambda(Q)| is maximized and for each tree Ti𝒯T_{i}\in\mathcal{T}, the subtree Ti|Λ(Q)T_{i}|\Lambda(Q) is isomorphic to Q|Λ(Ti)Q|\Lambda(T_{i}) (where T|ST|S is the subtree of TT restricted on leaf set SS).

These definitions of consistency over trees however are not identical from the ordinal consistency we propose. These problems are related to, but still different from, our maximum ordinal consensus problem if the input metrics are ultrametrics. An ultrametric can be represented by a corresponding representing tree where each tree node has a height value. Finding a strong consensus SS is equivalent to finding a subset of leaf nodes such that the restricted subtrees are not only isomorphic, but also the heights of all internal nodes must have the same order – This height condition appears to make the problem much harder: While the MAST problem on two trees can be solved in polynomial time via dynamic programming, the ordinal consensus problem is NP-hard even for only two ultrametrics, as we will show in Section 3.

2 Preliminaries and problem setup

The input is a metric system (;X)(\mathcal{M};X), consisting of a set of kk metrics ={ρ1,ρ2,,ρk}\mathcal{M}=\{{\rho}_{1},{\rho}_{2},...,{\rho}_{k}\} over the point set X={x1,,xn}X=\{x_{1},...,x_{n}\}. For any i[1,k]i\in[1,k], ρi(x,x){\rho}_{i}(x,x^{\prime}) is the distance between point x,xXx,x^{\prime}\in X w.r.t. ρi{\rho}_{i}. Our goal is to find a minimum subset SXS\subset X, such that the order of all pairwise distance restricted on X\SX\backslash S are consistent under a certain definition. Below we first introduce two notions of consistency. The two optimization problems we will study are given in Definition 2.6.

Definition 2.1 (Strong Consistency).

Given a metric system (={ρ1,ρ2,,ρk};Y={y1,y2,,ym})(\mathcal{M}=\{{\rho}_{1},{\rho}_{2},...,\\ {\rho}_{k}\};Y=\{y_{1},y_{2},...,y_{m}\}), we say that the set of metrics \mathcal{M} is strongly consistent w.r.t. YY if for any quartet {yp,yq,yr,ys}Y\{y_{p},y_{q},y_{r},y_{s}\}\subset Y, we have that (i) ρi(yp,yq)<ρi(yr,ys)ρj(yp,yq)<ρj(yr,ys){\rho}_{i}(y_{p},y_{q})<{\rho}_{i}(y_{r},y_{s})\Leftrightarrow{\rho}_{j}(y_{p},y_{q})<{\rho}_{j}(y_{r},y_{s}) and (ii) ρi(yp,yq)=ρi(yr,ys)ρj(yp,yq)=ρj(yr,ys){\rho}_{i}(y_{p},y_{q})={\rho}_{i}(y_{r},y_{s})\Leftrightarrow{\rho}_{j}(y_{p},y_{q})={\rho}_{j}(y_{r},y_{s}) for any 1i,jk1\leq i,j\leq k. In this case, we say that YY is a strongly consistent set, or a strong consensus, over \mathcal{M}.

We also say that two pairs (yp,yq)(y_{p},y_{q}) and (yr,ys)(y_{r},y_{s}) are strongly consistent w.r.t. \mathcal{M}, if the order between these two pairwise distances is the same w.r.t. any metric in \mathcal{M}.

Other than strong consistency, we also consider a weaker notion of consistency: In particular, we now only require that the order constructed by taking the plurality voting111In plurality voting, a candidate wins if it has the most votes than the other candidates. It does not have to get a majority (more than 50%) of the votes. over all input metrics is valid. To define the weak consistency formally, we will first define the so-called relation set and the auxiliary graph.

Definition 2.2 (Relation Set of Pairwise Distances).

Given an input set of kk metrics \mathcal{M} over point set XX, the relation set \mathcal{R} of pairwise distances w.r.t. (;X)(\mathcal{M};X) is the set of relations over all distinct pairs {(xp,xq)|xp,xqX,pq}\{(x_{p},x_{q})\,|\,x_{p},x_{q}\in X,p\not=q\} defined as follows: For any two pairs (xp,xq)(x_{p},x_{q}) and (xr,xs)(x_{r},x_{s}), among all three possible relations between (xp,xq)(x_{p},x_{q}) and (xr,xs)(x_{r},x_{s}), namely, ``<"``<", ``="``=", and ``>"``>", the one induced by most number of metrics in MM will be included in the relation set RR. If there is any tie, we will choose the relation with appearance in the metric of smaller index.

For example, the relation (xp,xq)<(xr,xs)(x_{p},x_{q})<(x_{r},x_{s})\in\mathcal{R} if and only if this relation appears in more (or the same number of) metrics than the other two relations (``="``=", ``<"``<").

The relation set constructed above may be not “valid” in the sense that no single metric can generate those relations. To check the validity of this relation set, we now define a specific auxiliary graph 𝒢\mathcal{G}, whose nodes correspond to pairs of points from XX. There are three different connections between two graph nodes, which correspond to the three possible relations between their corresponding pairs in the relation set \mathcal{R}. We will use this graph later to decide whether \mathcal{R} is valid or not.

Definition 2.3 (Auxiliary Graph for Relation Set).

Given the relation set \mathcal{R} of pairwise distances over point set XX and metrics \mathcal{M}, the auxiliary graph 𝒢=(V,E)\mathcal{G}=(V,E), where V={(xi,xj)|xi,xjX,ij}V=\{(x_{i},x_{j})|x_{i},x_{j}\in X,i\not=j\}, is a mixed graph (meaning it contains both directed and undirected edges): There is a directed edge from v1=(xp,xq)v_{1}=(x_{p},x_{q}) to v2=(xr,xs)v_{2}=(x_{r},x_{s}) if (xp,xq)>(xr,xs)(x_{p},x_{q})>(x_{r},x_{s})\in\mathcal{R}; there is a undirected edge between v1v_{1} and v2v_{2} if (xp,xq)=(xr,xs)(x_{p},x_{q})=(x_{r},x_{s})\in\mathcal{R}.

We will use v1,,vNv_{1},...,v_{N} (corresponding to all pairs in XX) to represent nodes in graph 𝒢\mathcal{G}, where N=(n2)N={n\choose 2} if n=|X|n=|X|. The auxiliary graph GG is a fully connected mixed graph (i.e., every pair of distinct vertices is connected by a unique edge, and the edge can be directed or undirected) with one edge between every two nodes. We say vi>vjv_{i}>v_{j} (or ``="``=" or ``<"``<") if there is a directed edge from viv_{i} to vjv_{j} (or an undirected edge between them, or a directed edge from vjv_{j} to viv_{i}), which means the pair, say (x,y)X×X(x,y)\in X\times X, represented by viv_{i} has a larger pairwise distance compared with the pair represented by vjv_{j}. A cycle in a mixed graph can be formed by a mixture of directed edges and undirected edges. Intuitively, suppose we have a completely directed cycle like the one shown in Figure 1 (a), then there is a conflict in the relations for pairwise distances, as v1<v2<v3<v4<v1v_{1}<v_{2}<v_{3}<v_{4}<v_{1}. Hence the relation set \mathcal{R} will not be valid in this case. There are more types of “directed” cycles and they can cause such conflict. In particular:

Definition 2.4 (Directed cycle of mixed graphs).

A cycle C={e1,,er}C=\{e_{1},\ldots,e_{r}\} is a fully-directed cycle if all edges in it are directed, and the directions of all edges inside are consistent (namely, Figure 1 (a)). A directed cycle of a mixed graph is a cycle C={e1,e2,,en}C=\{e_{1},e_{2},...,e_{n}\} such that (i) it consists of at least one directed edge, and (ii) one can assign a direction for each undirected edges in CC to make it into a fully directed cycle. See Figure 1 for examples.

Refer to caption Refer to caption Refer to caption Refer to caption
(a) A full-directed (b) A directed cycle (c) This is not (d) This is not
(and directed) cycle (of mixed graphs) with a directed cycle since a directed cycle since
with length 5. an undirected edge. none of the conflicts between
edges is directed. directed edges.
Figure 1: Examples of fully-directed cycle, directed cycle and other cycles of a mixed graph.

We say that a metric ρ^\widehat{{\rho}} generates a relation set \mathcal{R}, if the order of any two pairwise distances induced by ρ^\widehat{{\rho}} are the same as in \mathcal{R}. In this case, we say the relation set \mathcal{R} is valid. The following result provides a simple characterization for a valid relation set by using its corresponding auxiliary graph.

Lemma 2.1.

There exists a metric ρ^\widehat{{\rho}} generating a relation set \mathcal{R}, or equivalently, \mathcal{R} is valid, if and only if there is no directed cycle in the auxiliary graph 𝒢\mathcal{G} for the relation set \mathcal{R}.

Proof.

The “\Rightarrow” direction. This direction is relatively easier and we prove it by contradiction. Suppose the relation set is defined over a point set X^\widehat{X} and there is a metric ρ^\widehat{\rho} over set X^\widehat{X} that generates all relations in \mathcal{R}. Then assume there is a directed cycle of graph nodes C=v1,v2vr,v1C=\langle v_{1},v_{2}\cdots v_{r},v_{1}\rangle in the auxiliary graph 𝒢\mathcal{G}. Now, one can derive that the distance between the pair of nodes in v1v_{1} will be larger than itself by following the edges vivi+1v_{i}\to v_{i+1} for i=1,,ri=1,\ldots,r, which is a contradiction.

The “\Leftarrow” direction. For any two nodes vav_{a} and vbv_{b} connected by an undirected edge (i.e., vav_{a}, vbv_{b} represents equal distance), we can always contract them together without causing any problems if there is no directed cycle in graph 𝒢\mathcal{G}. It is because that for another arbitrary vcv_{c}, the directions of edges (va,vc)(v_{a},v_{c}) and (vb,vc)(v_{b},v_{c}) are consistent. These two edges will be both going to vcv_{c}, coming out from vcv_{c} or undirected.

Therefore, we can iteratively contract all node pairs connected by undirected edges until there is no such pair left. The produced graph 𝒢^\widehat{\mathcal{G}} will be a directed graph, and each node corresponds to a set of nodes in the original graph 𝒢\mathcal{G}. It is clear that 𝒢^\widehat{\mathcal{G}} is acyclic, because any directed cycle in 𝒢^\widehat{\mathcal{G}} will map back to a directed cycle (of a mixed graph) in 𝒢\mathcal{G} simply by selecting an arbitrary representative for all these supernodes in 𝒢^\widehat{\mathcal{G}}.

A linear order over those supernodes can be constructed by, e.g., topological sort, such that the orders among pairwise distances are consistent with the directed acyclic graph (DAG) 𝒢^\widehat{\mathcal{G}}. Nodes in 𝒢\mathcal{G} represented by the same supernode in 𝒢^\widehat{\mathcal{G}} share the same value. This provides a way to assign values over all pairwise distances such that the ordinal relations are consistent with 𝒢\mathcal{G}. To make it a metric, one can force that the minimum distance is larger than one half of the maximum distance. This way, no matter which three nodes we consider, the sum of two pairwise distances will always be larger than the third pairwise distance. ∎

We are now ready to define the weak consistency.

Definition 2.5 (Weak Consistency).

Given a set of kk metrics ={ρ1,ρ2,,ρk}\mathcal{M}=\{{\rho}_{1},{\rho}_{2},...,{\rho}_{k}\} over the same point set Y={y1,y2,,ym}Y=\{y_{1},y_{2},...,y_{m}\}, we say that the set of metrics \mathcal{M} is weakly consistent with YY if there is no directed cycle in the auxiliary graph 𝒢(Y)\mathcal{G}(Y) for the relation set (Y)\mathcal{R}(Y) as specified in Definitions 2.2 and 2.3. In this case, we may also say that YY is a weakly consistent set, or a weak consensus, over \mathcal{M}.

The optimization problems we aim to study in this paper are defined as follows.

Definition 2.6 (Strong-MIS and Weak-MIS Problems).

Given a set of metrics ={ρ1,,ρk}\mathcal{M}=\{{\rho}_{1},...,{\rho}_{k}\} on point set XX, the Strong-MIS problem (resp. Weak-MIS problem) aims to find a minimum subset of SXS^{*}\subset X, such that all metrics restricted on X\SX\backslash S^{*} are strongly consistent (resp. weakly consistent).

The set SS^{*} is called the minimum (strong/weak) inconsistent set, while XSX\setminus S^{*} is called the maximum (strong/weak) consensus set w.r.t. input metrics \mathcal{M}.

Note that minimizing the inconsistent set is equivalent to maximizing the consensus (although their approximation may not be equivalent).

The decision version of Strong-MIS (and Weak-MIS) problem is as follows: Given ,X\mathcal{M},X and also an integer aa. Is there a subset SXS\subset X with |S|=a|S|=a such that \mathcal{M} restricted on X\SX\backslash S are strongly consistent (resp. weakly consistent)?

We will show in Section 3 and 4 that the decision versions of Strong-MIS and Weak-MIS are in NP (see lemma 3.1 and lemma 4.1). Thus in most proofs, we will only show NP-hardness via reductions from NP-hard problems. The NP-completeness naturally follows by the fact that both decision problems are in NP.

Some specific metrics.

Later in Sections 3 and 4, we will show that the decision version of the minimum inconsistent set (equivalently, maximum consensus set) problem is NP-complete even when input metrics are restricted to two common choices: the Euclidean metric on the line (and thus in any d{\mathbb{R}}^{d}), and the ultrametrics.

Definition 2.7 (Line metric).

A line metric is a metric (,ρ)({\mathbb{R}},{\rho}), where the distance function ρ(x,y)=|xy|{\rho}(x,y)=|x-y|, and x,yx,y\in{\mathbb{R}} (Note, this is simply the Euclidean metric on {\mathbb{R}}).

Definition 2.8 (Ultrametric).

An ultrametric is a metric (Z,ρ)(Z,{\rho}) defined on a set ZZ, which satisfies the following strong triangle inequality: for any x,y,zZx,y,z\in Z, d(x,z)max(d(x,y),d(y,z)).d(x,z)\leq\max(d(x,y),d(y,z)).

Any finite ultrametric (Z,ρ)(Z,{\rho}) has a corresponding representing tree [8] TZT_{Z} such that:

  1. 1.

    TZT_{Z} is a rooted tree with the set of leaf nodes being ZZ. TZT_{Z} is equipped with a height function h:NZ+h:N\cup Z\rightarrow\mathbb{R}_{+}, where NN is the set of internal nodes of TZT_{Z} such that (i) all leaves have the same height; and (ii) hh is non-increasing along any root to leaf path.

  2. 2.

    For any two leaf nodes zz and zz^{\prime}, their distance ρ(z,z){\rho}(z,z^{\prime}) equals to h(LCA(z,z))h({\texttt{LCA}}(z,z^{\prime})), namely, the height of their lowest common ancestor (LCA (z,zz,z^{\prime})).

An example of an ultrametric and its representing tree is given in Appendix A.1.

3 Strong-MIS problem

In this section, we will study the Strong-MIS problem. Specifically, we show in Theorem 3.1 and 3.2 that the decision version of the Strong-MIS problem is NP-complete even when the input metric spaces are restricted to two very simple cases: the line metrics and the ultrametrics. Corollary 3.1 gives an inapproximability result. To complement that, in Theorem 3.4, we provide a 44-approximation algorithm for the general case.

Intuitively, finding the minimum strongly inconsistent set has a similar flavor as Minimum Vertex Cover or Minimum Hitting Set [10]. Intuitively, in Strong-MIS problem, given data points XX, a quartet (xp,xq,xr,xs)(x_{p},x_{q},x_{r},x_{s}) will be a target set if they induce a conflict between any two input metrics, and the goal is to find a Minimum Hitting Set \mathcal{H} such that for all target sets, at least one element is in \mathcal{H}. However, to show that Strong-MIS remains hard even for special simple metrics, we need to construct reductions carefully, and sometimes need to use different NP-complete problems to reduce from. We include the list of NP-complete problems used for reductions in Appendix B.

To study Strong-MIS, we first define the so-called Conflict Set, which will be used frequently in NP-hardness proofs.

Definition 3.1 (Conflict Set).

Given kk metrics ={ρ1,,ρk}\mathcal{M}=\{{\rho}_{1},...,{\rho}_{k}\} on node set X={x1,,xn}X=\{x_{1},...,x_{n}\}, the conflict set 𝒞\mathcal{C} induced by (;X)(\mathcal{M};X) is defined as 𝒞={(xp,xq,xr,xs)|(xp\mathcal{C}=\{(x_{p},x_{q},x_{r},x_{s})|(x_{p}, xq),(xr,xs) are not strongly consistent over }x_{q}),(x_{r},x_{s})\text{ are not strongly consistent over }\mathcal{M}\}. Each element in this conflict set is called a conflict quartet.

It is clear that the decision version of Strong-MIS is in NP as stated in the following lemma.

Lemma 3.1.

The decision version of Strong-MIS is in NP.

Proof.

Given a set SS with size aa, one can check whether metrics in \mathcal{M} are strongly consistent on X\SX\backslash S by simply iterating over all possible quartets and comparing their pairwise distances in different metrics. This process takes polynomial time. ∎

It turns out that the decision version of Strong-MIS is weakly NP-complete (which allows the magnitude of data involved to be exponential) even if we restrict the input metrics to be only two line metrics. The proof is in Appendix A.2.

Theorem 3.1.

The decision version of Strong-MIS is weakly NP-complete even when one only considers metric systems (;X)(\mathcal{M};X) where ={ρ1,ρ2}\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\} contains only two line metrics.

The following theorem shows that even with only 2 ultrametrics, finding a Strong-MIS remains NP-hard. The proof uses a reduction from Max 2-SAT problem; it is non-trivial and can be found in Appendix A.3.

Theorem 3.2.

Given a metric system (;X)(\mathcal{M};X), where ={ρ1,ρ2}\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\} contains two ultrametrics, The decision version of Strong-MIS is NP-complete.

The hardness result on 2 ultrametrics implies that finding Strong-MIS is also NP-hard when the input is two arbitrary metrics. The result is stated below. This theorem can also be proven directly via a reduction from Minimum Vertex Cover, which for completeness we include the simple details in Appendix A.4.

Theorem 3.3.

Given a metric system (;X)(\mathcal{M};X), where ={ρ1,ρ2}\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\} contains two arbitrary metrics, the decision version of Strong-MIS is NP-complete.

In fact, the proof in Appendix A.4 gives a size-preserving reduction. Hence the Corollary below follows directly from the inapproximability result [12] for Minimum Vertex Cover.

Corollary 3.1.

Strong-MIS with 2 metrics is Unique Games-hard to approximate within a factor 2ϵ2-\epsilon, where ϵ\epsilon is an arbitrarily small positive number.

Approximation algorithm for Strong-MIS.

As we mention above, one can consider a collection 𝒞\mathcal{C} of all conflict quartet (xp,xq,xr,xs)(x_{p},x_{q},x_{r},x_{s}) as the target set. The goal is to find a minimum set from XX such that it intersects (hits) every quartet in 𝒞\mathcal{C}. This is actually a special case of 44-hitting set problem, and it is easy to obtain a 4-approximation algorithm in time O(kn4)O(kn^{4}) time by checking all quartets. However, below we show we can improve this to O(kn2logn)O(kn^{2}\log n) time complexity.

Theorem 3.4.

Given a metric system (;X)(\mathcal{M};X) where ={ρ1,,ρk},X={x1,,xn}\mathcal{M}=\{{\rho}_{1},\ldots,{\rho}_{k}\},X=\{x_{1},\ldots,x_{n}\}, there is an O(kn2logn)O(kn^{2}\log n) 4-approximation algorithm for the Strong-MIS problem.

Proof.

Let SS^{*} denote the minimum inconsistent set so that XSX\setminus S^{*} is the maximum consensus for metric system (;X)(\mathcal{M};X). We propose Algorithm 1, which will compute a set SS to be removed as inconsistent set.

Algorithm 1 SS = Strong-MIS (=(ρ1,,ρk),X\mathcal{M}=({\rho}_{1},\cdots,{\rho}_{k}),X)
1:for each metric ρi{\rho}_{i}\in\mathcal{M} do
2:     Sort all pairwise distances based on the distances in ascending order. Let L1,,LkL_{1},\cdots,L_{k} denote those kk sorted lists.
3:end for
4:Initialize S=.S=\emptyset. kk pointers 𝗉1,,𝗉k{\mathsf{p}}_{1},\cdots,{\mathsf{p}}_{k} pointing to the head of L1,,LkL_{1},\cdots,L_{k}.
5:while None of 𝗉i{\mathsf{p}}_{i}s points out of bound. do
6:     while E(𝗉1)SE({\mathsf{p}}_{1})\cap S\not=\emptyset do
7:         𝗉1=𝗉1+1{\mathsf{p}}_{1}={\mathsf{p}}_{1}+1
8:     end while\triangleright Move the pointer of the first list.
9:     flag = False
10:     for i[2k]i\in[2\cdots k] do
11:         while E(𝗉i)SE({\mathsf{p}}_{i})\cap S\not=\emptyset do
12:              𝗉i=𝗉i+1{\mathsf{p}}_{i}={\mathsf{p}}_{i}+1
13:         end while
14:         if 𝗉1,𝗉i{\mathsf{p}}_{1},{\mathsf{p}}_{i} in bound  and E(𝗉i)E(𝗉1)\textbf{ and }E({\mathsf{p}}_{i})\not=E({\mathsf{p}}_{1}) then
15:              S=S{E(𝗉i),E(𝗉1)}S=S\cup\{E({\mathsf{p}}_{i}),E({\mathsf{p}}_{1})\}
16:              flag = True \triangleright Found a conflict
17:         end if
18:         𝗉i=𝗉i+1{\mathsf{p}}_{i}={\mathsf{p}}_{i}+1 \triangleright Move the pointer to the next since it is already considered.
19:         if flag = True then \triangleright If there is a conflict, move on to the next iteration.
20:              break
21:         end if
22:     end for
23:end while
24:return SS

On the high level, Algorithm 1 starts with sorting all pairs of points based on their distances (If two pairs have the same distance, then they are sorted in lexicographical order). With kk sorted lists of pairs L1,,LkL_{1},\ldots,L_{k}, we set kk pointers to the heads of these lists: Let 𝗉i{\mathsf{p}}_{i} be the pointer to the list of pairwise distances of metric ρi\rho_{i}, and E(𝗉i)E({\mathsf{p}}_{i}) denote the pair of data points in XX that 𝗉i{\mathsf{p}}_{i} is pointing to. As pointers move down these lists, it checks whether all pointers are pointing to the same pair of points from XX. If that is the case, it moves on. Otherwise, if E(𝗉i)E(𝗉1)E({\mathsf{p}}_{i})\neq E({\mathsf{p}}_{1}) (line-14), it means a conflict quartet is discovered, formed by the two pairs E(𝗉i)E({\mathsf{p}}_{i}) and E(𝗉1)E({\mathsf{p}}_{1}). In this case, we will remove all these points (by adding them to SS), and move on. The code from lines 6-8 and 11-13 is to skip all pairs which contain at least one point from the outlier set SS. The procedure ends when any pointer moves out of bound (i.e., beyond the last element of the list). An illustration of how the algorithm works is given in Figure 9 in Appendix A.5.

Time complexity. There are kk metrics and (n2)n\choose 2 pairs. The sorting process takes O(kn2logn)O(kn^{2}\log n) time. Since the procedure ends when each pointer reaches the tail, the pointers will be moved for at most kn2kn^{2} total times. During each pointer move, the most expensive step is line-11, which can be done in O(1)O(1) with an array of length nn indicating if a node is in SS or not.

Putting everything together, the total time complexity for our algorithm is O(kn2logn)O(kn^{2}\log n).

Correctness of algorithm. We consider the points ever added to set SS. Note that SS is only updated in line-12, where there is a conflict (i.e, the pair E(𝗉i)=(a,b)E({\mathsf{p}}_{i})=(a,b) pointed by pip_{i} in list LiL_{i} is different from the pair E(𝗉1)=(c,d)E({\mathsf{p}}_{1})=(c,d) by the pointer for list L1L_{1}). Assume that (a,b)(a,b) is smaller than (c,d)(c,d) for lexicographical order. We claim that {(a,b),(c,d)}\{(a,b),(c,d)\} form a conflict quartet. To prove this, first, note that as we skip all pairs that contain any element from SS (lines 6-8, 11-13), this means that S{a,b,c,d}=S\cap\{a,b,c,d\}=\emptyset. Hence in list L1L_{1}, we have not yet seen (scanned) the pair (a,b)(a,b) – as otherwise, at the time when the pointer in L1L_{1} reaches (a,b)(a,b), if at that moment the pointer in LiL_{i} is not pointing to (a,b)(a,b), we would have already seen a conflict and added a,ba,b to SS. It then follows that w.r.t. metric ρ1{\rho}_{1}, we have that ρ1(a,b)>ρ1(c,d){\rho}_{1}(a,b)>{\rho}_{1}(c,d). On the other hand, in list LiL_{i}, it must be that we have not yet seen (c,d)(c,d) by the same reasoning, meaning that ρi(a,b)ρi(c,d){\rho}_{i}(a,b)\leq{\rho}_{i}(c,d) w.r.t. metric ρi{\rho}_{i}. Hence these two pairs form a conflict quartet. Obviously, for any conflict quartet, the minimum inconsistent set SS^{*} has to contain at least one element from it. Furthermore, since all the conflict quartet the algorithm ever identifies are disjoint. This means that the |S||S|/4|S^{*}|\geq|S|/4, that is, |S|4|S||S|\leq 4|S^{*}|.

Finally, consider the ordered sublist L^i\widehat{L}_{i} of LiL_{i}, obtained by removing from LiL_{i} all pairs that intersect SS. Then it is easy to see that by construction of the algorithm, all L^i\widehat{L}_{i}s are the same. In other words, after removing all elements in SS, the remaining points XSX\setminus S form a consensus subset for the kk metrics ={ρ1,,ρk}\mathcal{M}=\{{\rho}_{1},\ldots,{\rho}_{k}\}. Hence SS is a valid inconsistent set and |S||S||S|\geq|S^{*}|. It follows that SS is a 4-approximation of the minimum inconsistent subset for the metric system (;X)(\mathcal{M};X).

4 Weak-MIS problem

We now focus on the Weak-MIS problem of finding a minimum weakly inconsistent subset. In Theorems 4.1 and 4.2, we show that it is NP-complete for the special case of only three input line metrics or ultrametrics. We provide a straightforward 6-approximation algorithm at the end.

Refer to caption
Figure 2: Three possible directed triangles.

By definition 2.5 and 2.6, if X\SX\backslash S is a consensus set, then the auxiliary graph 𝒢\mathcal{G} restricted on X\SX\backslash S must contain no directed cycle. It is well known that a tournament (fully connected directed graph) has a directed cycle if and only if it has a directed triangle [7]. It turns out that a similar result holds for auxiliary graphs, which are mixed graphs: see the claim below. See Figure 2 for the 3 possible cases of directed triangles. The simple proof of this claim can be found in Appendix A.6.

Claim 4.1.

An auxiliary graph 𝒢\mathcal{G} has no directed cycle if and only if it has no directed triangle.

Hence to see whether there is any directed cycle in the auxiliary graph, one only needs to check if there is any directed triangle.

Followed by claim 4.1, the decision version of Weak-MIS is in NP.

Lemma 4.1.

The decision version of Weak-MIS is in NP.

Proof.

By definition 2.5 and claim 4.1, one can check whether metrics from \mathcal{M} are weakly consistent on a set SS by iterating over all triangles in the auxiliary graph. It is clearly polynomial. ∎

When there are two metrics, with the tie-breaking rule defined in Definition 2.2, it is clear that we would always prefer the first metric (once the order of input metrics is fixed). Thus in this case, the minimum inconsistent set is simply \emptyset. The problem of Weak-MIS becomes non-trivial when there are three metrics.

Our first main result is as follows, with proof in Appendix A.7.

Theorem 4.1.

Given a metric system (;X)(\mathcal{M};X), where ={ρ1,ρ2,ρ3}\mathcal{M}=\{{\rho}_{1},{\rho}_{2},{\rho}_{3}\} contains three line metrics. The decision version of Weak-MIS is weakly NP-complete.

Our second main result is the hardness for ultrametrics.

Theorem 4.2.

Given a metric system (;X)(\mathcal{M};X), where ={ρ1,ρ2,ρ3}\mathcal{M}=\{{\rho}_{1},{\rho}_{2},{\rho}_{3}\} contains three ultrametrics. The decision version of Weak-MIS is NP-complete.

Proof of Theorem 4.2.

We prove this theorem via a reduction from the so-called 3-dimensional Matching problem. In particular, instead of the problem of finding the minimum inconsistent set, we will consider the equivalent dual version of finding a maximum consensus for a set of 3 ultrametrics.

Description of the reduction. Suppose we are given an instance of 3-dimensional Matching problem (X,Y,Z;𝒮X×Y×ZX,Y,Z;{\mathcal{S}}\subseteq X\times Y\times Z), where |X|=|Y|=|Z||X|=|Y|=|Z|. Assume that X={x1,,xn},Y={y1,,yn},Z={z1,,zn}X=\{x_{1},...,x_{n}\},Y=\{y_{1},...,y_{n}\},Z=\{z_{1},...,z_{n}\}, while 𝒮={𝗌1,,𝗌m}{\mathcal{S}}=\{{\mathsf{s}}_{1},...,{\mathsf{s}}_{m}\} where each relation 𝗌i{\mathsf{s}}_{i} is of the form 𝗌i=(xa,yb,zc){\mathsf{s}}_{i}=(x_{a},y_{b},z_{c}) with xaX,ybYx_{a}\in X,y_{b}\in Y and zcZz_{c}\in Z. A matching Π𝒮\Pi\subset{\mathcal{S}} is such that each element in XYZX\cup Y\cup Z can appear at most once in all relations in Π\Pi. The decision version of the 3-dimensional Matching problem is that, given (X,Y,Z,𝒮)(X,Y,Z,{\mathcal{S}}) and an integer KK, does there exist a matching Π𝒮\Pi\subset{\mathcal{S}} such that |Π|=K|\Pi|=K?

From this instance (X,Y,Z;𝒮)(X,Y,Z;{\mathcal{S}}) of the 3-dimensional Matching problem, we will now construct an instance of the Weak-MIS problem (={UX,UY,UZ},P)(\mathcal{M}=\{U_{X},U_{Y},U_{Z}\},P), where the node set is P={a1,,am,b1,,bm,c1,,cm,d1,,dm,𝗌^1,,𝗌^m}P=\{a_{1},\ldots,a_{m},b_{1},\ldots,b_{m},c_{1},\ldots,c_{m},\\ d_{1},\ldots,d_{m},{\widehat{{\mathsf{s}}}}_{1},\ldots,{\widehat{{\mathsf{s}}}}_{m}\}, and \mathcal{M} consists of 3 ultrametrics, UX,UYU_{X},U_{Y}, and UZU_{Z} over node set PP. (Note that we do not use XX as the node set as XX is already used in the instance of 3-dimensional Matching problem.) Recall that any ultrametric over node set PP corresponds to a representing tree, which is a rooted tree where all nodes have a height value, and all leaves (corresponding to node set PP) have the same height. In what follows, we will describe the three representing trees TX,TYT_{X},T_{Y} and TZT_{Z}, generating UX,UYU_{X},U_{Y} and UZU_{Z}, respectively. In particular, these three representing trees TX,TYT_{X},T_{Y} and TZT_{Z} all have the same tree shape. However, the height of those internal nodes will be different.

We will first describe the representing tree for TXT_{X}. The root rootroot has 5 children, represented by A,B,C,DA,B,C,D and X^\widehat{X}. Each of A,B,C,DA,B,C,D has exactly mm children, which are Ch(A)={a1,,am}{\texttt{Ch}}(A)=\{a_{1},\ldots,a_{m}\}, Ch(B)={b1,,bm}{\texttt{Ch}}(B)=\{b_{1},\ldots,b_{m}\}, Ch(C)={c1,,cm}{\texttt{Ch}}(C)=\{c_{1},\ldots,c_{m}\} and Ch(D)={d1,,dm}{\texttt{Ch}}(D)=\{d_{1},\ldots,d_{m}\}, respectively. Note that these children are leaves, corresponding to the first 4m4m nodes in the node set PP (See Figure 3). The node X^\widehat{X} has nn children, x^1,,x^n{\widehat{x}}_{1},\ldots,{\widehat{x}}_{n}, corresponding to the nn points in input set XX of the 3-dimensional Matching instance (X,Y,Z;𝒮)(X,Y,Z;{\mathcal{S}}). The child(ren) of each x^i{\widehat{x}}_{i} is defined as: Ch(x^i)={𝗌^jxi𝗌j}{\texttt{Ch}}({\widehat{x}}_{i})=\{{\widehat{{\mathsf{s}}}}_{j}\mid x_{i}\in{\mathsf{s}}_{j}\}; all children of x^i{\widehat{x}}_{i}s are all leaves. See Figure 3.

Next, we assign height values for nodes in TXT_{X}. All leaves (corresponding to elements in the node set PP where ultrametrics are defined on) have height 0. The height values for the internal nodes are listed in the row corresponding to TXT_{X} in Table 1.

The representing tree TYT_{Y} (resp. TZT_{Z}) has the same tree shape as TXT_{X}, and the only difference is that the node X^\widehat{X} and x^i{\widehat{x}}_{i}s are replaced by Y^\widehat{Y} and y^i{\widehat{y}}_{i}s (resp. by Z^\widehat{Z} and z^i{\widehat{z}}_{i}s). See Figure 3. The height values of all leaves nodes are still 0, and the height values of internal nodes are listed in the last two rows of Table 1. Also see Figure 3 where the height of each node is listed in the parenthesis next to each node.

rootroot AA BB CC DD X^(Y^,Z^)\widehat{X}(\widehat{Y},\widehat{Z}) x^i(y^i,z^i)\widehat{x}_{i}(\widehat{y}_{i},\widehat{z}_{i})
TXT_{X} 10 5 4 3 1 2 0
TYT_{Y} 10 3 0 2 5 4 1
TZT_{Z} 10 2 4 0 1 5 3
Table 1: Height function values assigned to internal nodes.
Refer to caption
Figure 3: An instance of 3D Matching problem, and the corresponding instance of Weak-MIS. The values in parentheses are the height values of tree nodes.

This finishes setting up all three representing trees (thus also the ultrametrics UX,UYU_{X},U_{Y} and UZU_{Z}). Recall that for each ultrametric say UXU_{X}, the distance UX(p,q)U_{X}(p,q), with p,qPp,q\in P, corresponds to the height value of the lowest common ancestor (LCA) of leaves pp and qq.

In what follows, we will first prove some properties of the constructed ultrametrics. Specifically, consider the auxiliary graph 𝒢\mathcal{G} constructed for the metric system (={UX,UY,UZ};P)(\mathcal{M}=\{U_{X},U_{Y},U_{Z}\};P). Recall that each graph node in 𝒢\mathcal{G} corresponds to a pair of points from PP, (p,q)P×P(p,q)\in P\times P. For simplicity, we use 𝖠{\mathsf{A}} to represent the set {a1,,am}\{a_{1},\ldots,a_{m}\}, and similarly for 𝖡,𝖢{\mathsf{B}},{\mathsf{C}}, 𝖣{\mathsf{D}}, and 𝖲{\mathsf{S}}. Given a graph node (p,q)(p,q) of 𝒢\mathcal{G}, we say that this pair splits if pp and qq are from two different sets in {𝖠,𝖡,𝖢,𝖣,𝖲}\{{\mathsf{A}},{\mathsf{B}},{\mathsf{C}},{\mathsf{D}},{\mathsf{S}}\} (e.g, p𝖠p\in{\mathsf{A}} and q𝖣q\in{\mathsf{D}}). Now consider a triple 𝗌^i=(x,y,z)X×Y×Z{\widehat{{\mathsf{s}}}}_{i}=(x^{\prime},y^{\prime},z^{\prime})\in X\times Y\times Z; we refer to xx^{\prime} (resp, yy^{\prime}, zz^{\prime}) as the xx-coordinate (resp. yy- or zz-coordinate) of 𝗌^i{\widehat{{\mathsf{s}}}}_{i}. Given a graph node of the auxiliary graph 𝒢\mathcal{G} of the form (𝗌^i,𝗌^j)({\widehat{{\mathsf{s}}}}_{i},{\widehat{{\mathsf{s}}}}_{j}), we say that this pair has shared coordinate, if 𝗌^i𝗌^j{\widehat{{\mathsf{s}}}}_{i}\cap{\widehat{{\mathsf{s}}}}_{j}\neq\emptyset. This means that 𝗌^i{\widehat{{\mathsf{s}}}}_{i} shares either xx-, yy- or zz-coordinate. This is the key lemma to guarantee the correctness of our reduction. The proof of this lemma can be found in Appendix A.8. We remark that the height values of all nodes in the three representing trees are chosen carefully so that the lemma below holds. To compute these height values, we in fact write a computer program testing all possible permutations of heights over internal nodes to make sure all conditions in Lemma 4.2 are satisfied.

Lemma 4.2.

Consider a graph node (p,q)P×P(p,q)\in P\times P of the auxiliary graph 𝒢\mathcal{G}.

  • (i)

    If (p,q)(p,q) splits, then this graph node cannot appear in any directed triangle in the auxiliary graph 𝒢\mathcal{G}.

  • (ii)

    Any directed triangle in 𝒢\mathcal{G} must contain at least one graph node of the form (𝗌^i,𝗌^j)({\widehat{{\mathsf{s}}}}_{i},{\widehat{{\mathsf{s}}}}_{j}) where this pair have shared coordinate.

  • (iii)

    If (p,q)(p,q) is of the form (𝗌^i,𝗌^j)({\widehat{{\mathsf{s}}}}_{i},{\widehat{{\mathsf{s}}}}_{j}) and this pair has shared coordinate, then this graph node (𝗌^i,𝗌^j)({\widehat{{\mathsf{s}}}}_{i},{\widehat{{\mathsf{s}}}}_{j}) must participate in at least one directed triangle.

Note that the correctness of the reduction then follows easily from the above key lemma. In particular, we now show that (X,Y,Z;𝒮)(X,Y,Z;{\mathcal{S}}) has a matching Π𝒮\Pi\subseteq{\mathcal{S}} of size KK if and only if the metric system ({UX,UY,UZ},P)(\{U_{X},U_{Y},U_{Z}\},P) has a consensus subset of size 4m+K4m+K.

\Rightarrow” direction: Suppose (X,Y,Z;𝒮)(X,Y,Z;{\mathcal{S}}) has a matching Π={𝗌I1,,𝗌IK}\Pi=\{{\mathsf{s}}_{I_{1}},\ldots,{\mathsf{s}}_{I_{K}}\} of size KK. Then we claim that the set

P={a1,,am,b1,,bm,c1,,cm,d1,,dm,𝗌^I1,,𝗌^IK}=𝖠𝖡𝖢𝖣{𝗌^I1,,𝗌^IK}\begin{split}P^{\prime}=&\{a_{1},\cdots,a_{m},b_{1},\cdots,b_{m},c_{1},\cdots,c_{m},d_{1},\cdots,d_{m},{\widehat{{\mathsf{s}}}}_{I_{1}},\cdots,{\widehat{{\mathsf{s}}}}_{I_{K}}\}\\ =&{\mathsf{A}}\cup{\mathsf{B}}\cup{\mathsf{C}}\cup{\mathsf{D}}\cup\{{\widehat{{\mathsf{s}}}}_{I_{1}},\cdots,{\widehat{{\mathsf{s}}}}_{I_{K}}\}\end{split}

forms a consensus subset of PP w.r.t. the metric system ({UX,UY,UZ};P)(\{U_{X},U_{Y},U_{Z}\};P). Specifically, by Claim 4.1, we just need to show that the subgraph 𝒢\mathcal{G}^{\prime} of the auxiliary graph 𝒢\mathcal{G} spanned by nodes coming from P×PP^{\prime}\times P^{\prime} contains no directed triangle. As Π\Pi is a valid matching, no two 𝗌^Ii{\widehat{{\mathsf{s}}}}_{I_{i}} and 𝗌^Ij{\widehat{{\mathsf{s}}}}_{I_{j}} can have shared coordinates. It then follows from Lemma 4.2 (ii) that there cannot be any directed triangle in the subgraph 𝒢\mathcal{G}^{\prime}.

\Leftarrow” direction: Suppose we have a consensus subset PPP^{\prime}\subset P for the metric system ({UX,UY,UZ};P)(\{U_{X},U_{Y},U_{Z}\};P) such that |P|=4m+K|P^{\prime}|=4m+K. First, consider P𝖲={𝗌^J1,,𝗌^Js}P^{\prime}\cap{\mathsf{S}}=\{{\widehat{{\mathsf{s}}}}_{J_{1}},\ldots,{\widehat{{\mathsf{s}}}}_{J_{s}}\}. We know that the subgraph 𝒢\mathcal{G}^{\prime} spanned by nodes from P×PP^{\prime}\times P^{\prime} contains no directed triangle. By Lemma 4.2 (iii), it then follows that no two 𝗌^Ji,𝗌^Jj{\widehat{{\mathsf{s}}}}_{J_{i}},{\widehat{{\mathsf{s}}}}_{J_{j}}, i,j[1,s]i,j\in[1,s], could have shared coordinate. In other words, the set {𝗌^J1,,𝗌^Js}\{{\widehat{{\mathsf{s}}}}_{J_{1}},\ldots,{\widehat{{\mathsf{s}}}}_{J_{s}}\} forms a valid 3D matching for (X,Y,Z;𝒮)(X,Y,Z;{\mathcal{S}}) of size ss. On the other hand, we know that |P𝖲|4m|P^{\prime}\setminus{\mathsf{S}}|\leq 4m (as the largest possible choise for P𝖲P^{\prime}\setminus{\mathsf{S}} is 𝖠𝖡𝖢𝖣{\mathsf{A}}\cup{\mathsf{B}}\cup{\mathsf{C}}\cup{\mathsf{D}}). Since |P|=4m+K|P^{\prime}|=4m+K, it then follows that s|K|s\geq|K|, and thus there exists a 3D matching of (X,Y,Z;𝒮)(X,Y,Z;{\mathcal{S}}) of size at least |K||K|.

As 3-dimensional Matching is NP-complete, it then follows that the decision problem of Weak-MIS is NP-complete. This finishes the proof of Theorem 4.2. ∎

The following theorem is an implication from the previous proof. Similarly as Strong-MIS, we also provide a direct proof for arbitrary metrics in Appendix A.9. In the proof, we construct a size-preserving reduction from Minimum Vertex Cover which again leads to an (2ϵ)(2-\epsilon)-inapproximability result.

Theorem 4.3.

Given a metric system (;X)(\mathcal{M};X), where ={ρ1,ρ2,ρ3}\mathcal{M}=\{{\rho}_{1},{\rho}_{2},{\rho}_{3}\} contains three arbitrary metrics, the decision version of Weak-MIS is NP-complete.

Furthermore, Weak-MIS with 3 metrics is Unique Games-hard to approximate within a factor 2ϵ2-\epsilon for an arbitrarily small positive constant ϵ>0\epsilon>0.

Finally, there is a simple 6-approximation algorithm with running time O(n6)O(n^{6}): Specifically, given a metric system (;X)(\mathcal{M};X) with n=|X|n=|X|, we first build auxiliary graph 𝒢\mathcal{G} as described earlier in O(kn4)O(kn^{4}) time. We want to construct an outlier set SXS\subset X so that it “hits” all directed triangles in 𝒢\mathcal{G}: note that this will then guarantee that XSX\setminus S is a consensus set w.r.t. \mathcal{M}. To this end, we simply enumerate all directed triangles in 𝒢\mathcal{G} in O(n6)O(n^{6}) time. We then initialize S=S=\emptyset and go through the list of directed triangles one by one. For each directed triangle Δw1w2w3\Delta w_{1}w_{2}w_{3} (where each wiw_{i} is a pair of points in XX), if SS does not intersect with any of the points included in w1w2w3Xw_{1}\cup w_{2}\cup w_{3}\subset X, then we simply add all these points (at most 6 distinct points) to SS. Otherwise, this triangle is already “hit” by SS and we do nothing. Let SS^{*} be the minimum weakly inconsistent set (i,e, the optimal solution for Weak-MIS). It is easy to see that |S||S|/6|S^{*}|\geq|S|/6, as all the 6-tuples (at most 6) we ever added to SS are all disjoint, and for each such 6-tuple, SS^{*} must contain at least one point from it. Furthermore, it is also easy to see that after removing all points SS, the resulting auxiliary graph restricted to only pairs not containing points in SS is free of directed triangles, and thus free of directed cycles by Lemma 2.1 and Claim 4.1. Hence XSX\setminus S is a consensus set w.r.t. \mathcal{M}, and |S|6|S||S|\leq 6|S^{*}|. Hence:

Theorem 4.4.

There is a 6-approximation algorithm for the Weak-MIS problem that runs in O(kn4+n6)O(kn^{4}+n^{6}) time for kk metrics defined on a point set of size nn.

5 Conclusion

In this paper, we proposed to study the maximum ordinal consensus problem over a set of input metrics. We developed two concepts of “consistency” that only rely on ordinal information of pairwise distances. We proved several hardness results for both definitions with different input metrics. We also developed constant-factor approximation algorithms for the minimum inconsistent set problem under both definitions.

There are still some open directions for future work. For example, can we close the gap between the inapproximability and the approximation algorithm we developed? Can we improve the time complexity, especially for the Weak-MIS problem? Can we find better approximation algorithms for special cases such as Euclidean metrics or ultrametrics? We also note that the current approximation algorithms target the minimum (ordinally) inconsistent subset problems – how about the dual maximum ordinal consensus problem?

6 Acknowledgment

This work is partially supported by National Science Foundation (NSF) under grant IIS-1815697, as well as National Institute of Health (NIH) under grant R01EB022899.

References

Appendix A Missing Details

A.1 An Ultrametric and its Representing Tree

Figure 4 shows an example of an ultrametric and its representing tree. Values on internal nodes are their heights, and all leaf nodes are with height 0. The distance between any two leaf nodes is the height function value of their LCA. For example, the distance from AA to EE is 3.

Refer to caption
Figure 4: An ultrametric and its corresponding representing tree.

A.2 Proof of Theorem 3.1

Proof.

We prove this theorem via a reduction from the Minimum Vertex Cover problem.

Description of the reduction. Suppose we are given an instance G=(V,E)G=(V,E) of Minimum Vertex Cover, where V={v1,,vn}V=\{v_{1},...,v_{n}\} and E={e1,,em}E=\{e_{1},...,e_{m}\}. A vertex cover VVV^{\prime}\subset V is such that every edge in EE has at least one endpoint in VV^{\prime}. The decision version of the Minimum Vertex Cover problem is that, given G=(V,E)G=(V,E) and an integer KK, is there a vertex cover VVV^{\prime}\subset V such that |V|=K|V^{\prime}|=K? From the instance of Minimum Vertex Cover, we construct an instance of the Strong-MIS problem (={ρ1,ρ2};X)(\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\};X), where ρ1,ρ2{\rho}_{1},{\rho}_{2} are two line metrics. Here X={v^1,,v^n,re1,l,re1,r,,rem,l,rem,r}X=\{{\widehat{v}}_{1},...,{\widehat{v}}_{n},r_{e_{1},l},r_{e_{1},r},...,r_{e_{m},l},r_{e_{m},r}\}, and |X|=n+2m|X|=n+2m. The rei,lr_{e_{i},l} and rei,rr_{e_{i},r} are two “pivots” for iith edge eie_{i}.

Since all these points are on real line {\mathbb{R}}, we will use ``[]1"``[\cdot]_{1}" and ``[]2"``[\cdot]_{2}" to represent coordinates. When the coordinate is the same for both metrics or there is no ambiguity, we will omit the subscript and use ``[]"``[\cdot]" for the coordinate. For example, [v^1]1[{\widehat{v}}_{1}]_{1} is the coordinate of point v^1{\widehat{v}}_{1} in ρ1{\rho}_{1}. Instead of constructing ρ1,ρ2{\rho}_{1},{\rho}_{2} directly, we construct the coordinates of each point as follows (see Figure 5 for an example):

  1. 1.

    For both ρ1,ρ2{\rho}_{1},{\rho}_{2}, [v^i]=2i1[{\widehat{v}}_{i}]=2^{i-1}.

  2. 2.

    For both ρ1,ρ2{\rho}_{1},{\rho}_{2}, the coordinate of [rej,l]=2n+2(j1)[r_{e_{j},l}]=2^{n+2(j-1)}.

  3. 3.

    For rej,r,1jmr_{e_{j},r},1\leq j\leq m, and ej=(va,vb),b>ae_{j}=(v_{a},v_{b}),b>a, its coordinate [rej,r]1[r_{e_{j},r}]_{1} in ρ1{\rho}_{1} is: [rej,r]1=[rej,l]1+([v^b]1[v^a]1)+ϵ=2n+2(j1)+(2b12a1)+ϵ[r_{e_{j},r}]_{1}=[r_{e_{j},l}]_{1}+([{\widehat{v}}_{b}]_{1}-[{\widehat{v}}_{a}]_{1})+\epsilon=2^{n+2(j-1)}+(2^{b-1}-2^{a-1})+\epsilon. Similarly, its coordinate [rej,r]2[r_{e_{j},r}]_{2} in ρ2{\rho}_{2} is: [rej,r]2=[rej,l]2+([v^b]2[v^a]2)ϵ=2n+2(j1)+(2b12a1)ϵ[r_{e_{j},r}]_{2}=[r_{e_{j},l}]_{2}+([{\widehat{v}}_{b}]_{2}-[{\widehat{v}}_{a}]_{2})-\epsilon=2^{n+2(j-1)}+(2^{b-1}-2^{a-1})-\epsilon. The intuition is the distance from rej,lr_{e_{j},l} to rej,rr_{e_{j},r} is close to the distance from vav_{a} to vbv_{b}, but orders are different in two metrics.

Refer to caption
Figure 5: Constructed line metrics given an simple GG.

Check comparable pairwise distances. The distance between any pair of nodes under metrics ρ1{\rho}_{1} and ρ2{\rho}_{2} can be easily calculated based on coordinates []1[\cdot]_{1} and []2[\cdot]_{2}. We now show that any conflict quartet of the above constructed instance must have the form of {v^i,v^j,re,l,re,r}\{{\widehat{v}}_{i},{\widehat{v}}_{j},r_{e,l},r_{e,r}\}, where e=(vi,vj)Ee=(v_{i},v_{j})\in E. We first consider the necessary conditions for two pairs (a,b)(a,b) and (c,d)(c,d) to cause a conflict. W.l.o.g., we can assume that [b]>[a][b]>[a] and [d]>[c][d]>[c] in both metrics.

  • (1)

    To cause a conflict, the distances [b][a][b]-[a] and [d][c][d]-[c] should be close so that the relation between {[b]1[a]1,[d]1[c]1}\{[b]_{1}-[a]_{1},[d]_{1}-[c]_{1}\} and {[b]2[a]2,[d]2[c]2}\{[b]_{2}-[a]_{2},[d]_{2}-[c]_{2}\} can be different. One can notice that the distance can be distorted by at most 2ϵ2\cdot\epsilon from one metric to the other, thus in a more formal way, close means “differ by at most 4ϵ"4\cdot\epsilon".

  • (2)

    Another necessary condition for (a,b)(a,b) and (c,d)(c,d) to cause a conflict is that [b][a][b]-[a] is not identical to [d][c][d]-[c] in any one of the two metrics. For example, assume that [b]1[a]1=[d]1[c]1[b]_{1}-[a]_{1}=[d]_{1}-[c]_{1} (equivalently [b]1+[c]1=[a]1+[d]1)[b]_{1}+[c]_{1}=[a]_{1}+[d]_{1}) and [b]1+[c]1=I+tϵ=[a]1+[d]1[b]_{1}+[c]_{1}=I+t\cdot\epsilon=[a]_{1}+[d]_{1}. Here II and tt are some integers. When comes to coordinates in the second metric, one can show that [b]2+[c]2=Itϵ=[a]2+[d]2[b]_{2}+[c]_{2}=I-t\cdot\epsilon=[a]_{2}+[d]_{2} as only the polarity of ϵ\epsilon term is changed across two metrics.

Combining those two conditions, we define two pairs are comparable if they are different and differ by at most 4ϵ4\cdot\epsilon. A conflict is possible only when two pairs in the quartet are comparable. We then iterate over all pairs and check whether there is any other pair with a comparable pairwise distance. We will also use the leading term (the largest power of two) of coordinates to find comparable pairs. The leading term of [v^i]s[{\widehat{v}}_{i}]s and [re,l][r_{e,l}]s are of different scales, and [re,r][r_{e,r}] has the same leading term of [re,l][r_{e,l}]. Moreover, if [b][b] has a larger leading term than [a][a], then [b]2[a][b]\geq 2[a].

The necessary condition of two pairs (a,b)(a,b) and (c,d)(c,d) (where we assume [b]>[a][b]>[a], [d]>[c][d]>[c] and [d][b][d]\geq[b]) having comparable distances is either 1). nodes dd and bb have the leading terms of the same scale. or 2). nodes cc and dd have the leading term of the same scale. The reason is (assuming [d][d] has a larger leading term than [b][b] and [c][c]) that [d]>=2[c][d]>=2[c] and 2[b]<=[d]2[b]<=[d] implies [d][c][d]/2[b][b][a]+1[d]-[c]\geq[d]/2\geq[b]\geq[b]-[a]+1 if none of the conditions are satisfied. In either case where the necessary condition is satisfied, there must be two nodes of the form (re,l,re,r)(r_{e,l},r_{e,r}) for some edge ee. This observation largely reduced the possibilities of comparable pairs. All possible cases are considered below (we use “x,yx,y” to denote unknown elements and “\approx” to denote comparable distances):

  1. 1.

    (x,re,l)(x,r_{e,l}) and (y,re,r)(y,r_{e,r}) for some edge ee. We have [re,r][y][re,l][x][y][x][re,r][re,l][r_{e,r}]-[y]\approx[r_{e,l}]-[x]\Rightarrow[y]-[x]\approx[r_{e,r}]-[r_{e,l}]. Assume e=(vi,vj)e=(v_{i},v_{j}), and we have [y][x]2j12i1[y]-[x]\approx 2^{j-1}-2^{i-1}, this is true only when y=v^jy={\widehat{v}}_{j} and x=v^ix={\widehat{v}}_{i} by checking the leading term of yy.

  2. 2.

    (x,y)(x,y) and (re,l,re,r)(r_{e,l},r_{e,r}). By similar calculation, we should easily get y=v^jy={\widehat{v}}_{j} and x=v^ix={\widehat{v}}_{i} assuming e=(vi,vj)e=(v_{i},v_{j}).

And indeed, {v^i,v^j,re,l,re,r}\{{\widehat{v}}_{i},{\widehat{v}}_{j},r_{e,l},r_{e,r}\} is a conflict quartet (where e=(vi,vj)e=(v_{i},v_{j})) since [v^j]1[v^i]1<[re,r]1[re,l]1[{\widehat{v}}_{j}]_{1}-[{\widehat{v}}_{i}]_{1}<[r_{e,r}]_{1}-[r_{e,l}]_{1} but [v^j]2[v^i]2>[re,r]2[re,l]2[{\widehat{v}}_{j}]_{2}-[{\widehat{v}}_{i}]_{2}>[r_{e,r}]_{2}-[r_{e,l}]_{2}. Now, we can conclude that the quartets in conflict set 𝒞\mathcal{C} have one-to-one correspondence to edges in EE, i.e., 𝒞={(v^i,v^j,re,l,re,r|e=(vi,vj)E}\mathcal{C}=\{({\widehat{v}}_{i},{\widehat{v}}_{j},r_{e,l},r_{e,r}|e=(v_{i},v_{j})\in E\}. Now it is not hard to prove there is a size-preserving reduction from Minimum Vertex Cover to Strong-MIS (with two line metrics).

There is a vertex cover of size KK for GG \Rightarrow there is an inconsistent set of size KK for (={ρ1,ρ2};X)(\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\};X): Assume there is a vertex cover of GG with size KK denoted by V={v1,,vK}V^{\prime}=\{v_{1},\cdots,v_{K}\}. Then the corresponding set S^={v^1,,v^K}{\widehat{S}}=\{{\widehat{v}}_{1},\cdots,{\widehat{v}}_{K}\} is a solution for (={ρ1,ρ2};X)(\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\};X), since this set covers all edges and thus all possible conflict quartets. The remaining nodes will have no conflict quartet.

``"``\Leftarrow" direction: Conversely, assume there is an optimal inconsistent set with size KK. It is clear that there is always an optimal solution to (={ρ1,ρ2};X)(\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\};X) which only includes v^i{\widehat{v}}_{i}s. The reason is that including v^i{\widehat{v}}_{i} can potentially cover multiple conflict quartets, but re,lr_{e,l} (or re,rr_{e,r}) will only cover at most one conflict. For any solution containing re,lr_{e,l} (or re,rr_{e,r}), including one of the endpoints of ee will also cover the conflict (covered by re,lr_{e,l}). And it can potentially cover other conflicts and thus reduce the size of the solution. Assume one optimal solution is S={v^1,,v^K}S^{*}=\{{\widehat{v}}_{1},\cdots,{\widehat{v}}_{K}\} which covers all conflict quartets, then the corresponding node set {v1,,vK}\{v_{1},\cdots,v_{K}\} is a vertex cover covering all edges due to the one-to-one correspondence between edges in EE and quartets in 𝒞\mathcal{C}. ∎

A.3 Proof of Theorem 3.2

Proof.

We prove this theorem via a reduction from the Max 2-SAT problem.

Description of reduction. Suppose we are given an instance (C,X)(C,X) of the Max 2-SAT problem, where C={c1,,cm1,cm1+1,cm1+m2}C=\{c_{1},...,c_{m_{1}},c_{m_{1}+1},c_{m_{1}+m_{2}}\} is a set of clauses. There are m1m_{1} clauses with one literal and m2m_{2} clauses with two literals. And X={x1,,xn}X=\{x_{1},...,x_{n}\} is a set of variables. The decision version of the Max 2-SAT problem is that, given (C,X)(C,X) and an integer KK, is there an assignment to variables such that KK clauses are satisfied?

From the instance of Max 2-SAT problem, we construct the following instance of Strong-MIS (={ρ1,ρ2};X^)(\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\};\widehat{X}) where ρ1{\rho}_{1} and ρ2{\rho}_{2} are ultrametrics. In particular, we will set up each metric via its representing tree as introduced in Definition 2.8. As we will see below, both representing trees of ρ1,ρ2{\rho}_{1},{\rho}_{2} will have the same structure; while the only difference is the height functions over internal nodes.

First, we will describe the tree structure TT (which will be common for both ρ1{\rho}_{1} and ρ2{\rho}_{2}). The root of TT has m1+m2+1m_{1}+m_{2}+1 children, {B,c^1,,c^m1+m2}\{B,{\widehat{c}}_{1},...,{\widehat{c}}_{m_{1}+m_{2}}\}, where c^i{\widehat{c}}_{i}s correspond to iith clauses in CC. The node BB has |Ch(B)|=2m1+4m2+1|{\texttt{Ch}}(B)|=2m_{1}+4m_{2}+1 children denoted by b1,,b2m1+4m2+1b_{1},...,b_{2m_{1}+4m_{2}+1} (here for simplicity, we use Ch(B){\texttt{Ch}}(B) to denote the children set of BB). The size of Ch(B){\texttt{Ch}}(B) is designed to be larger than the total number of literals (allowing duplicates) appeared in the clause set CC. We show later that one can assume a maximum consensus always contain all nodes from Ch(B){\texttt{Ch}}(B). Each c^i{\widehat{c}}_{i} has at most two children li1,li2{x1,,xn,x1¯,,xn¯}l_{i1},l_{i2}\in\{x_{1},...,x_{n},\bar{x_{1}},...,\bar{x_{n}}\} corresponding to the literals in clause cic_{i}, and each ll has two leaf nodes dds as its children. For example, if clause cic_{i} has two literals, then c^i{\widehat{c}}_{i} will have two children li1,li2l_{i1},l_{i2}; And li1l_{i1} has two children (leaves) di1,di2d_{i1},d_{i2}, li2l_{i2} has two children (leaves) di3,di4d_{i3},d_{i4}. If clause cic_{i} has only one literal, then c^i{\widehat{c}}_{i} will have one child li1l_{i1} which has two children (leaves) di1d_{i1}, di2d_{i2}. See an illustration of the tree structure TT in Figure 6.

Recall that the leaf set of this representing tree TT corresponds to the note set of the metric. Hence, in the constructed Strong-MIS instance, we have that the note set is X^={b1,,b2m1+4m2+1,d11,d12,,dm11,dm12,,d(m1+m2)1,d(m1+m2)2,d(m1+m2)3,d(m1+m2)4}\widehat{X}=\{b_{1},...,b_{2m_{1}+4m_{2}+1},d_{11},d_{12},\cdots,d_{m_{1}1},d_{m_{1}2},\cdots,d_{(m_{1}+m_{2})1},\\ d_{(m_{1}+m_{2})2},d_{(m_{1}+m_{2})3},d_{(m_{1}+m_{2})4}\} with size 2m1+4m2+1+2m1+4m2=8m2+4m1+12m_{1}+4m_{2}+1+2m_{1}+4m_{2}=8m_{2}+4m_{1}+1.

Refer to caption
Figure 6: Given an instance of Max 2-SAT with C=x1x3¯(x1¯x2)(x2¯x3)C=x_{1}\wedge\bar{x_{3}}\wedge(\bar{x_{1}}\vee x_{2})\wedge(\bar{x_{2}}\vee x_{3}), the representing tree structure of the constructed ultrametrics is shown on the right. cic_{i} corresponds to the iith clause in CC. li1l_{i1} (and li2l_{i2}) are its literals. The corresponding literals are shown in the bracket.

Now we equip this tree structure TT with two height functions h1h_{1} and h2h_{2} mapping internal nodes of TT to real values. It generates metrics ρ1{\rho}_{1} and ρ2{\rho}_{2}, respectively. In particular, recall that given any two leaves zz and zz^{\prime}, their distance ρi(z,z)=hi(LCA(z,z)){\rho}_{i}(z,z^{\prime})=h_{i}({\texttt{LCA}}(z,z^{\prime})). We set up h1,h2h_{1},h_{2} so that h1(x1)<h1(x1¯)<h1(x2)<h1(x2¯)<<h1(xn)<h1(xn¯)<h1(B)<h1(c^1)<<h1(c^m1+m2)h_{1}(x_{1})<h_{1}(\bar{x_{1}})<h_{1}(x_{2})<h_{1}(\bar{x_{2}})<...<h_{1}(x_{n})<h_{1}(\bar{x_{n}})<h_{1}(B)<h_{1}({\widehat{c}}_{1})<...<h_{1}({\widehat{c}}_{m_{1}+m_{2}}) and h2(x1¯)<h2(x1)<h2(x2¯)<h2(x2)<<h2(xn¯)<h2(xn)<h2(c^1)<<h2(c^m1+m2)<h2(B)h_{2}(\bar{x_{1}})<h_{2}(x_{1})<h_{2}(\bar{x_{2}})<h_{2}(x_{2})<...<h_{2}(\bar{x_{n}})<h_{2}(x_{n})<h_{2}({\widehat{c}}_{1})<...<h_{2}({\widehat{c}}_{m_{1}+m_{2}})<h_{2}(B). Note that the precise value of each height does not matter as only the ordering of pairwise distances matters. We use (T1,h1)(T_{1},h_{1}) and (T2,h2)(T_{2},h_{2}) (T1T_{1} and T2T_{2} have the same tree structure as TT) to denote the two representing trees for ρ1{\rho}_{1} and ρ2{\rho}_{2}, respectively.

To see how the maximum consensus problem for ({ρ1,ρ2},X^)(\{{\rho}_{1},{\rho}_{2}\},\widehat{X}) relates to the maximum satisfiability of (C,X)(C,X), note the following: (1) Intuitively, the height functions h1h_{1} and h2h_{2} guarantee that, for a consensus X^X^\widehat{X}^{\prime}\subset\widehat{X}, we cannot include both children of xix_{i} and x¯i\bar{x}_{i}, the heights of xix_{i} and x¯i\bar{x}_{i} have opposite orders and thus inconsistent w.r.t. ρ1{\rho}_{1} and ρ2{\rho}_{2}. For instance, in the example of Figure 6, one cannot include d11,d12d_{11},d_{12} (children of l11(x1)l_{11}(x_{1})) and d31,d32d_{31},d_{32} (children of l31(x1¯)l_{31}(\bar{x_{1}})) in X^\widehat{X}^{\prime}. (2) The heights of the internal node BB guarantee that for each c^i{\widehat{c}}_{i}, one can only choose either one of its leaf nodes or two of its leaf nodes which are the children of the same literal (See Figure 8). For the instance shown in Figure 6, we can include both d41d_{41} and d42d_{42} in X^\widehat{X}^{\prime} since they are the children of the same literal l41l_{41}. However, one cannot include both d41d_{41} and d43d_{43} in X^\widehat{X}^{\prime} because h1(B)<h1(c^4)h_{1}(B)<h_{1}({\widehat{c}}_{4}) and h2(c^4)<h2(B)h_{2}({\widehat{c}}_{4})<h_{2}(B).

Basically, for the construction, the consensus can include two leaf nodes when the corresponding clause is satisfied; Otherwise it can only include one leaf node. Below we show that there is an assignment of XX satisfying KK clauses in CC if and only if there is a consensus of X^X^\widehat{X}^{\prime}\subset\widehat{X} of cardinality |X^|=|Ch(B)|+m1+m2+K|\widehat{X}^{\prime}|=|{\texttt{Ch}}(B)|+m_{1}+m_{2}+K.

````\Rightarrow^{\prime\prime} direction:

Assume there is an assignment to variables of {C,X}\{C,X\} such that KK clauses are satisfied. Then we have a consensus of size |Ch(B)|+m1+m2+K|{\texttt{Ch}}(B)|+m_{1}+m_{2}+K as follows. We first include all bib_{i}s. For each satisfied clause, we randomly selected one of its literal with true value and keep both of its children. For the clauses not satisfied, we keep any one of its four leaves. The selected node set is strongly consistent, because (1): there are no leaf nodes from different literals of a clause; 2). we do not keep both leaf nodes from two literals (say, xix_{i} and xi¯\bar{x_{i}}) of the same variable as we only keep both leaves of true literals. The size of the set is |Ch(B)|+m1+m2+K|{\texttt{Ch}}(B)|+m_{1}+m_{2}+K.

More specifically, w.l.o.g., assume nodes in X^\widehat{X}^{\prime} are {b1,,b2m1+4m2+1,,d11,d12,,dK1,dK2,,d(m1+m2)1}\{b_{1},\cdots,b_{2m_{1}+4m_{2}+1},\cdots,\\ d_{11},d_{12},\cdots,\\ d_{K1},d_{K2},\cdots,d_{(m_{1}+m_{2})1}\} (See Figure 7). We can also assume that the corresponding literals for l11,,lK1l_{11},\cdots,l_{K1} are x1,,xKx_{1},\cdots,x_{K}. Then all pairwise distances in ρ1{\rho}_{1} form the set {h1(x1),h1(x2),h1(xK),h1(B),h1(root)}\{h_{1}(x_{1}),h_{1}(x_{2}),h_{1}(x_{K}),h_{1}(B),h_{1}(root)\} (and {h2(x1),,h2(xK),h2(B),h2(root)}\{h_{2}(x_{1}),\cdots,\\ h_{2}(x_{K}),h_{2}(B),h_{2}(root)\} for ρ2{\rho}_{2}). One can check the order of those heights are the same in two metrics, and thus X^\widehat{X}^{\prime} is a consensus.

Refer to caption
Figure 7: An example of leaf nodes included in X^\widehat{X}^{\prime}. For each satisfied clause {c1,,cK}\{c_{1},\cdots,c_{K}\}, we can include two out of four leaves (these two leaves have to be the children of the same literal). For unsatisfied clauses, we can keep one leaf. This leads to a consensus with |Ch(B)|+m1+m2+K|{\texttt{Ch}}(B)|+m_{1}+m_{2}+K nodes. The internal nodes with only one child can be ignored (e.g., l(m1+m2)1l_{(m_{1}+m_{2})1}).

``"``\Leftarrow" direction: Assume there is maximum consensus of (={ρ1,ρ2};X^)(\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\};\widehat{X}) with |Ch(B)|+m1+m2+K|{\texttt{Ch}}(B)|+m_{1}+m_{2}+K nodes. Then we want to show that there is an assignment satisfying at least KK clauses based on the following observations.

  1. 1.

    There is always an optimal maximum consensus that keeps all bib_{i}s. Simply including all bib_{i}s will produce a consensus with |Ch(B)|=2m1+4m2+1|{\texttt{Ch}}(B)|=2m_{1}+4m_{2}+1 nodes. If one optimal solution does not include all bib_{i}s, it must have removed at least 2m1+4m22m_{1}+4m_{2} bib_{i}s to avoid any conflict. In this case, there are only at most 2m1+4m2+12m_{1}+4m_{2}+1 nodes left.

  2. 2.

    For the restricted tree on the nodes from the maximum consensus (including all bib_{i}s), there are only two possible options for a subtree rooted at c^i{\widehat{c}}_{i}s (See Figure 8). The reason is that we cannot keep leaf nodes of both literals (e.g., we cannot include both d31d_{31} and d33d_{33} for the case in Figure 6), which will cause a conflict since h1(B)<h1(c^i)h_{1}(B)<h_{1}({\widehat{c}}_{i}) but h2(c^i)<h2(B)h_{2}({\widehat{c}}_{i})<h_{2}(B).

  3. 3.

    For the literal with both children included, there is no conflict. Therefore, e.g., if xix_{i} has both children included, then xi¯\bar{x_{i}} cannot. This is because the heights of xix_{i} and xi¯\bar{x_{i}} are not consistent, their four leaf nodes form a conflict quartet.

Refer to caption
Figure 8: Two cases of subtrees rooted at c^i{\widehat{c}}_{i}. The indexes of literals or leaves are not important, so they are omitted.

Based on these observations, there is an optimal solution having exactly KK subtrees rooted at c^i{\widehat{c}}_{i}s with two leaves selected. The two leaves must be the children of one literal. Also, there will not be any contradiction from those KK literals with two children selected (i.e., xix_{i}, xi¯\bar{x_{i}} cannot both have two children selected) We can assign true value to these literals, and the KK corresponding clauses are satisfied. ∎

A.4 Proof of Theorem 3.3

Proof.

We prove this theorem via a reduction from Minimum Vertex Cover.

Description of reduction. Given an instance of Minimum Vertex Cover, G=(V,E)G=(V,E), where V={v1,v2,,vn}V=\{v_{1},v_{2},...,v_{n}\}, E={e1,e2,,em}E=\{e_{1},e_{2},...,e_{m}\}. We construct an instance (;X)(\mathcal{M};X) of Strong-MIS, where ={ρ1,ρ2}\mathcal{M}=\{{\rho}_{1},{\rho}_{2}\} (ρ1,ρ2{\rho}_{1},{\rho}_{2} are two arbitrary metrics) and X={r1,,rn,v^1,,v^n}X=\{r_{1},...,r_{n},{\widehat{v}}_{1},...,{\widehat{v}}_{n}\}. Here v^1,,v^n{\widehat{v}}_{1},...,{\widehat{v}}_{n} correspond to nodes {v1,v2,,vn}\{v_{1},v_{2},...,v_{n}\} in VV.

For these two metrics, ρ1(ri,v^j)=ρ2(ri,v^j)=1{\rho}_{1}(r_{i},{\widehat{v}}_{j})={\rho}_{2}(r_{i},{\widehat{v}}_{j})=1, and ρ1(ri,rj)=ρ2(ri,rj)=1{\rho}_{1}(r_{i},r_{j})={\rho}_{2}(r_{i},r_{j})=1, for i,j\forall i,j. Those pairwise distances are used as standards which will be compared with pairwise distances of ρ1(v^i,v^j){\rho}_{1}({\widehat{v}}_{i},{\widehat{v}}_{j}) and ρ2(v^i,v^j){\rho}_{2}({\widehat{v}}_{i},{\widehat{v}}_{j}), for those pairwise distances,

ρ1(v^i,v^j)={1+ϵ if (vi,vj)E,0, if i=j,1 otherwise.ρ2(v^i,v^j)={1ϵ if (vi,vj)E,0, if i=j,1 otherwise.{\rho}_{1}({\widehat{v}}_{i},{\widehat{v}}_{j})=\begin{cases}1+\epsilon\text{ if }(v_{i},v_{j})\in E,\\ 0,\text{ if }i=j,\\ 1\text{ otherwise.}\end{cases}{\rho}_{2}({\widehat{v}}_{i},{\widehat{v}}_{j})=\begin{cases}1-\epsilon\text{ if }(v_{i},v_{j})\in E,\\ 0,\text{ if }i=j,\\ 1\text{ otherwise.}\end{cases}

Clearly, for a fixed edge (vi,vj)E(v_{i},v_{j})\in E, (v^i,v^j,rk,rl),1k,ln({\widehat{v}}_{i},{\widehat{v}}_{j},r_{k},r_{l}),\forall 1\leq k,l\leq n is a conflict quartet, since ρ1(v^i,v^j)>ρ1(rk,rl)=1{\rho}_{1}({\widehat{v}}_{i},{\widehat{v}}_{j})>{\rho}_{1}(r_{k},r_{l})=1 but ρ2(v^i,v^j)<ρ2(rk,rl)=1{\rho}_{2}({\widehat{v}}_{i},{\widehat{v}}_{j})<{\rho}_{2}(r_{k},r_{l})=1. We remark that there is an optimal inconsistent set SS^{*} of (;X)(\mathcal{M};X) which does not have any point from {r1,,rn}\{r_{1},...,r_{n}\}. It is because |S|<n|S^{*}|<n by removing all v^i{\widehat{v}}_{i}’s except one; And to cover any conflict quartet, one has to remove at least n1n-1 rir_{i}s.

Now, we want to show that there is an optimal vertex cover of size KK if and only if there is an inconsistent subset of size KK.

``"``\Rightarrow" direction: If there is a vertex cover of size KK for GG, denoted by V={v1,,vK}V^{\prime}=\{v_{1},...,v_{K}\}. Then the corresponding set S^={v^1,,v^K}{\widehat{S}}=\{{\widehat{v}}_{1},...,{\widehat{v}}_{K}\} is a inconsistent set, because for the remaining nodes {v^i|v^iX\S^}\{{\widehat{v}}_{i}\,|\,{\widehat{v}}_{i}\in X\backslash{\widehat{S}}\}, their pairwise distances are all 1 in both metrics, therefore they form a consensus together with rir_{i}s.

``"``\Leftarrow" direction: Conversely, if there is a minimum inconsistent set with size KK (notice that KK is always smaller than nn and only contains nodes from {v^1,,v^n}\{{\widehat{v}}_{1},...,{\widehat{v}}_{n}\}), denoted as S={v^1,,v^K}S^{*}=\{{\widehat{v}}_{1},...,{\widehat{v}}_{K}\}, then there is a vertex cover {v1,,vk}\{v_{1},...,v_{k}\}. That is because, by removing v^1,,v^K{\widehat{v}}_{1},...,{\widehat{v}}_{K}, the remaining elements form a consensus set. Thus for any edge (vi,vj)E(v_{i},v_{j})\in E, there is at least one of v^i,v^j{\widehat{v}}_{i},{\widehat{v}}_{j} in SS^{*}.

A.5 Missing Figure in Theorem 3.4

The following figure shows an illustration for the 4-approximation algorithm for Strong-MIS problem.

Refer to caption
Figure 9: An illustration for our algorithm. In iteration ii, the kk pointers are pointing to the head of those kk lists. When testing the second list, there is already a conflict. {a,b,e}\{a,b,e\} are included in the inconsistent set SS. In the beginning of the next iteration i+1i+1, it first skips all the edges with nodes in SS. Edges in green are considered in each iteration.

A.6 Proof of Claim 4.1

Proof.

Given a mixed graph GG, assume U={u0,u1,,uj}U=\{u_{0},u_{1},...,u_{j}\} is the directed cycle with smallest length and j>2j>2. For any edge (ui,ui+1),0ik(u_{i},u_{i+1}),0\leq i\leq k (addition modulo k+1k+1), it is either undirected or from uiu_{i} to ui+1u_{i+1}. W.l.o.g., let (u0,u1)(u_{0},u_{1}) be a directed edge. Consider the edge connecting u1u_{1} and uju_{j}, there are three cases. If the edge connecting u1u_{1} and uju_{j} is undirected or from u1u_{1} to uju_{j}, {u0,u1,uj}\{u_{0},u_{1},u_{j}\} form a directed triangle. Otherwise (i.e., the edge is from uju_{j} to u1u_{1}), then {u1,,uj}\{u_{1},...,u_{j}\} forms a smaller directed cycle. Both cases contradict with the assumption of smallest length. ∎

A.7 Missing proof of Theorem 4.1

Proof.

We prove this theorem via a reduction from the Minimum Vertex Cover problem. Here we propose a similar construction of Theorem 3.1.

Description of the reduction. Suppose we are given a Minimum Vertex Cover instance G=(V,E)G=(V,E), V={v1,,vn}V=\{v_{1},...,v_{n}\}, E={e1,,em}E=\{e_{1},...,e_{m}\}. We construct an instance (={ρ1,ρ2,ρ3};X}\mathcal{M}=\{{\rho}_{1},{\rho}_{2},{\rho}_{3}\};X\} of the Weak-MIS problem, where ρ1,ρ2,ρ3{\rho}_{1},{\rho}_{2},{\rho}_{3} are line metrics and X={v^1,,v^n,re1,l1,re1,l2,re1,r1,re1,r2,,rem,l1,rem,l2,rem,r1,rem,r2}X=\{{\widehat{v}}_{1},...,{\widehat{v}}_{n},r_{e_{1},l_{1}},r_{e_{1},l_{2}},r_{e_{1},r_{1}},r_{e_{1},r_{2}},...,r_{e_{m},l_{1}},r_{e_{m},l_{2}},r_{e_{m},r_{1}},\\ r_{e_{m},r_{2}}\}. Here, we also use ``[]1"``[\cdot]_{1}", ``[]2"``[\cdot]_{2}", and ``[]3"``[\cdot]_{3}" to represent the coordinates in ρ1{\rho}_{1}, ρ2{\rho}_{2} and ρ3{\rho}_{3}, respectively. When the coordinate is the same across all three metrics or there is no ambiguity, we will omit the subscript and use ``[]"``[\cdot]" instead.

The coordinates are constructed as follow, and the intuition is that only triples like {(v^i,v^j),(re,l1,re,r1),(re,l2,re,r2)}\{({\widehat{v}}_{i},{\widehat{v}}_{j}),(r_{e,l_{1}},r_{e,r_{1}}),(r_{e,l_{2}},r_{e,r_{2}})\} (where e=(vi,vj)e=(v_{i},v_{j})) can potentially cause a directed triangle (See Figure 10 for an example):

  1. 1.

    For all of these line metrics, the coordinates [v^i]=2i1[{\widehat{v}}_{i}]=2^{i-1}.

  2. 2.

    For all of these metrics, [rej,l1]=2n+4j4[r_{e_{j},l_{1}}]=2^{n+4j-4}, and [rej,l2]=2n+4j2[r_{e_{j},l_{2}}]=2^{n+4j-2}.

  3. 3.

    For each edge e=(vi,vj)e=(v_{i},v_{j}), in ρ1{\rho}_{1}, [re,r1]1=[re,l1]1+[v^i]1[v^j]12ϵ[r_{e,r_{1}}]_{1}=[r_{e,l_{1}}]_{1}+[{\widehat{v}}_{i}]_{1}-[{\widehat{v}}_{j}]_{1}-2\epsilon, [re,r2]1=[re,l2]1+[v^i]1[v^j]1ϵ[r_{e,r_{2}}]_{1}=[r_{e,l_{2}}]_{1}+[{\widehat{v}}_{i}]_{1}-[{\widehat{v}}_{j}]_{1}-\epsilon; in ρ2{\rho}_{2}, [re,r1]2=[re,l1]2+[v^i]2[v^j]2+ϵ[r_{e,r_{1}}]_{2}=[r_{e,l_{1}}]_{2}+[{\widehat{v}}_{i}]_{2}-[{\widehat{v}}_{j}]_{2}+\epsilon, [re,r2]2=[re,l2]2+[v^i]2[v^j]2+2ϵ[r_{e,r_{2}}]_{2}=[r_{e,l_{2}}]_{2}+[{\widehat{v}}_{i}]_{2}-[{\widehat{v}}_{j}]_{2}+2\epsilon; in ρ3{\rho}_{3}, [re,r1]3=[re,l1]3+[v^i]3[v^j]3+ϵ[r_{e,r_{1}}]_{3}=[r_{e,l_{1}}]_{3}+[{\widehat{v}}_{i}]_{3}-[{\widehat{v}}_{j}]_{3}+\epsilon, [re,r2]3=[re,l2]3+[v^i]3[v^j]3ϵ[r_{e,r_{2}}]_{3}=[r_{e,l_{2}}]_{3}+[{\widehat{v}}_{i}]_{3}-[{\widehat{v}}_{j}]_{3}-\epsilon.

Refer to caption
Figure 10: An example of constructed three line metrics given graph G=(V,E)G=(V,E).

Comparable pairwise distances. Similarly as Theorem 3.1, we will use the concept of comparable distances. Here two distances are comparable if they are different and differ by at most 6ϵ6\epsilon. The constant factor for ϵ\epsilon is changed accordingly since the difference between the coordinates of the same point in two metrics is at most 3ϵ3\epsilon (instead of 2ϵ2\epsilon in Theorem 3.1). The leading terms of v^i{\widehat{v}}_{i}s, re,l1r_{e,l_{1}} (or re,r1r_{e,r_{1}}) and re,l2r_{e,l_{2}} (or re,r2r_{e,r_{2}}) are in different scales. For any edge ek=(vi,vj)e_{k}=(v_{i},v_{j}) (here we overuse kk for simplicity), the comparable pairs are (recall that (a,b),(c,d)(a,b),(c,d) are comparable only when either [b],[d][b],[d] or [c],[d][c],[d] have the same leading term):

  1. 1.

    (v^i,v^j)({\widehat{v}}_{i},{\widehat{v}}_{j}), (rek,l1,rek,r1)(r_{e_{k},l_{1}},r_{e_{k},r_{1}}) and (rek,l2,rek,r2)(r_{e_{k},l_{2}},r_{e_{k},r_{2}}). The distances are around 2j12i12^{j-1}-2^{i-1}.

  2. 2.

    (v^i,rek,l1)({\widehat{v}}_{i},r_{e_{k},l_{1}}) and (v^j,rek,r1)({\widehat{v}}_{j},r_{e_{k},r_{1}}). The distances are around 2n+4k42i12^{n+4k-4}-2^{i-1}.

  3. 3.

    (v^i,rek,l2)({\widehat{v}}_{i},r_{e_{k},l_{2}}) and (v^j,rek,r2)({\widehat{v}}_{j},r_{e_{k},r_{2}}). The distances are around 2n+4k22i12^{n+4k-2}-2^{i-1}.

  4. 4.

    (rek,l1,rek,l2)(r_{e_{k},l_{1}},r_{e_{k},l_{2}}) and (rek,r1,rek,r2)(r_{e_{k},r_{1}},r_{e_{k},r_{2}}). The distances are around 2n+4k22n+4k42^{n+4k-2}-2^{n+4k-4}.

A triangle can potentially be directed only when all three underlying edges are mutually comparable.

For any triangle, there are three cases in total depending on the scale of distances of three underlying edges (x1,x2),(y1,y2),(z1,z2)X×X(x_{1},x_{2}),(y_{1},y_{2}),(z_{1},z_{2})\in X\times X.

  1. 1.

    If the distances of these three pairs all are not comparable, the ϵ\epsilon-distortion will not affect the orders of the distances at all. Therefore, the three metrics should agree on the ordering of these pairwise distances. And clearly, the triangle is not directed.

  2. 2.

    If two of them have comparable distance, e.g., (x1,x2)(x_{1},x_{2}) and (y1,y2)(y_{1},y_{2}). Then w.l.o.g., assume (z1,z2)(z_{1},z_{2}) has much larger distance compared with those two. Then the edge in the auxiliary graph from (z1,z2)(z_{1},z_{2}) to the other two are all outgoing, thus the triangle is not directed. Otherwise, if (z1,z2)(z_{1},z_{2}) has a smaller distance, the edges are all incoming and the triangle is also not directed.

  3. 3.

    Only when all three edges have comparable distances, there will be possibility that the triangle is directed. If we look at the triangle formed by {(v^i,v^j),(re,l1,re,r1),(re,l2,re,r2)|eE,e=(vi,vj)}\{({\widehat{v}}_{i},{\widehat{v}}_{j}),(r_{e,l_{1}},r_{e,r_{1}}),(r_{e,l_{2}},r_{e,r_{2}})|e\in E,e=(v_{i},v_{j})\}. We have relations (v^i,v^j)<(re,l1,re,r1)({\widehat{v}}_{i},{\widehat{v}}_{j})<(r_{e,l_{1}},r_{e,r_{1}}), (re,l1,re,r1)<(re,l2,re,r2)(r_{e,l_{1}},r_{e,r_{1}})<(r_{e,l_{2}},r_{e,r_{2}}) and (re,l2,re,r2)<(v^i,v^j)(r_{e,l_{2}},r_{e,r_{2}})<({\widehat{v}}_{i},{\widehat{v}}_{j}) by plurality vote. And they indeed form a directed triangle from (v^i,v^j)({\widehat{v}}_{i},{\widehat{v}}_{j}) to (re,l2,re,r2)(r_{e,l_{2}},r_{e,r_{2}}) to (re,l1,re,r1)(r_{e,l_{1}},r_{e,r_{1}}), and back to (v^i,v^j)({\widehat{v}}_{i},{\widehat{v}}_{j}).

Therefore, the directed triangles are {(v^i,v^j),(re,l1,re,r1),(re,l2,re,r2)|e=(vi,vj)E}\{({\widehat{v}}_{i},{\widehat{v}}_{j}),(r_{e,l_{1}},r_{e,r_{1}}),(r_{e,l_{2}},r_{e,r_{2}})\,|\,e=(v_{i},v_{j})\in E\}. Including v^i{\widehat{v}}_{i}s in the inconsistent set is better than including re,lr_{e,l} or re,rr_{e,r} since it can possibly cover multiple directed triangles. Thus, there is always an optimal solution consisting of only v^i{\widehat{v}}_{i}s. One can always replace re,lr_{e,l} and re,rr_{e,r}s by v^i{\widehat{v}}_{i}s (where v^ie{\widehat{v}}_{i}\in e) and still cover the directed triangles. Now we are ready to show that there is a vertex cover of size KK for GG if and only if there is a weak inconsistent set of size KK for (={ρ1,ρ2,ρ3};X)(\mathcal{M}=\{{\rho}_{1},{\rho}_{2},{\rho}_{3}\};X).

``"``\Rightarrow" direction: Assume V={v1,,vK}V^{\prime}=\{v_{1},\cdots,v_{K}\} is a vertex cover for GG. Then clearly S^={v^1,,v^K}{\widehat{S}}=\{{\widehat{v}}_{1},\cdots,{\widehat{v}}_{K}\} is a weak inconsistent set. This is because S^{\widehat{S}} covers all potential directed triangles.

\Leftarrow” direction: We showed that there is always an optimal solution only consisting of v^i{\widehat{v}}_{i}s. Assume the optimal solution is S^={v^1,,v^K}{\widehat{S}}^{*}=\{{\widehat{v}}_{1},\cdots,{\widehat{v}}_{K}\}. To cover all directed triangles, for any edge e=(vi,vj)Ee=(v_{i},v_{j})\in E, there must be at least one corresponding node v^i{\widehat{v}}_{i} or v^j{\widehat{v}}_{j} contained in the inconsistent set. Therefore, V={v1,,vK}V^{\prime}=\{v_{1},\cdots,v_{K}\} is a vertex cover of GG.

A.8 Proof of Lemma 4.2

Proving statement (i) of Lemma 4.2.

To prove (i), note that if w=(p,q)w=(p,q) splits, then LCA(p,q)LCA(p,q) is the root rr in each of the three representing trees, and thus U?(p,q)=10U_{?}(p,q)=10, for ?{X,Y,Z}?\in\{X,Y,Z\}. Now take any two other graph nodes w1=(p1,q1),w2=(p2,q2)P×Pw_{1}=(p_{1},q_{1}),w_{2}=(p_{2},q_{2})\in P\times P. A simple case analysis shows that no matter these two other pairs split or not, the three graph nodes w,w1,w2w,w_{1},w_{2} cannot form a directed triangle. For example, if neither w1,w2w_{1},w_{2}, then there will be two directed edges (w,w1)(w,w_{1}) and (w,w2)(w,w_{2}) in the auxiliary graph 𝒢\mathcal{G}, and thus no matter what direction the edge between w1w_{1} and w2w_{2} is, the triangle w,w1,w2w,w_{1},w_{2} cannot be directed. This proves (i).

Proving statement (ii) of Lemma 4.2.

Note that by (i), to check whether auxiliary graph 𝒢\mathcal{G} has directed triangles or not, we only need to consider the subgraph 𝒢^\widehat{\mathcal{G}} of 𝒢\mathcal{G} spanned by nodes which do not split. Hence each node (p,q)(p,q) will be such that p,qp,q\in\boxtimes where \boxtimes could be 𝖠,𝖡,𝖢,𝖣{\mathsf{A}},{\mathsf{B}},{\mathsf{C}},{\mathsf{D}} or 𝖲{\mathsf{S}}; in this case, we say that (p,q)(p,q) is of type-\boxtimes. Two nodes are of the same type if they are both of type-\boxtimes (i.e, both of type-𝖠{\mathsf{A}}, 𝖡{\mathsf{B}}, etc).

For such a generic non-splitting node (p,q)(p,q), we list its distance under ultrametric UX,UYU_{X},U_{Y}, and UZU_{Z}, respectively, which is also the height of the LCA(p,q){\texttt{LCA}}(p,q) in representing trees TX,TYT_{X},T_{Y} and TZT_{Z}, respectively. See Figure 11. In particular, for a non-splitting node (𝗌^,𝗌^)𝖲({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime})\in{\mathsf{S}}, we first only consider the case that 𝗌^,𝗌^{\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime} do not share coordinate. (The only non-splitting nodes not covered by the table in Figure 11 is of the form (𝗌^,𝗌^)({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime}) where 𝗌^{\widehat{{\mathsf{s}}}} and 𝗌^{\widehat{{\mathsf{s}}}}^{\prime} have shared coordinate.)

non-splitting graph node TXT_{X} TYT_{Y} TZT_{Z}
(a,a)𝖠(a,a^{\prime})\in{\mathsf{A}} 5 3 2
(b,b)𝖡(b,b^{\prime})\in{\mathsf{B}} 4 0 4
(c,c)𝖢(c,c^{\prime})\in{\mathsf{C}} 3 2 0
(d,d)𝖣(d,d^{\prime})\in{\mathsf{D}} 1 5 1
(𝗌^,𝗌^)𝖲({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime})\in{\mathsf{S}} with no shared coordinate 2 4 5
Refer to caption
Figure 11: Left: each entry in the table gives the distance (height of LCA of this pair of points) of a non-splitting graph node under ultrametrics UXU_{X}, UYU_{Y} and UZU_{Z}, respectively (i.e., w.r.t. trees TXT_{X}, TYT_{Y} and TZT_{Z}, respectively). Right: the subgraph induced such nodes in the left cannot contain any directed triangle.

Based on these distances, the subgraphs induced by such nodes are shown in the right picture of Figure 11. Note that this subgraph does not contain any directed triangle. Combining with (i), this means that any directed triangles Δw1w2w3\Delta w_{1}w_{2}w_{3}, where wiw_{i}s are all of different types, has to contain a node (𝗌^,𝗌^)({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime}) such that 𝗌^,𝗌^{\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime} have shared coordinate(s).

To prove (ii), what remains is to consider a triangle Δw1w2w3\Delta w_{1}w_{2}w_{3} where at least two of them, say w1w_{1} and w2w_{2} are of the same type. Suppose all wiw_{i}s are of types listed in the table of Figure 11, then it is easy to see that in this case, the resulting triangle can only be of the shapes in Figure 12, and thus cannot be a directed triangle. Hence at least one of wiw_{i} has to be non-splitting (by statement (i)), yet not included in the types covered in the table of Figure 11 – in other words, at least one wiw_{i} is of the form (𝗌^,𝗌^)({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime}) where 𝗌^,𝗌^{\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime} have shared coordinate(s).

Refer to caption
Figure 12: Three possible directed triangles.

Putting the above two paragraphs together, statement (ii) then follows.

Proving statement (iii) of Lemma 4.2.

Finally, consider a node w=(𝗌^,𝗌^)w=({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime}) of type-𝖲{\mathsf{S}}, but such that 𝗌^{\widehat{{\mathsf{s}}}} and 𝗌^{\widehat{{\mathsf{s}}}}^{\prime} have shared coordinates. In particular, 𝗌^{\widehat{{\mathsf{s}}}} and 𝗌^{\widehat{{\mathsf{s}}}}^{\prime} could share xx-coordinate, share yy-coordinate, share zz-coordinate, share both xx- and yy-coordinates, share both xx- and zz-coordinates, or share both yy- and zz-coordinates.

non-splitting graph node TXT_{X} TYT_{Y} TZT_{Z}
(a,a)𝖠(a,a^{\prime})\in{\mathsf{A}} 5 3 2
(b,b)𝖡(b,b^{\prime})\in{\mathsf{B}} 4 0 4
(c,c)𝖢(c,c^{\prime})\in{\mathsf{C}} 3 2 0
(d,d)𝖣(d,d^{\prime})\in{\mathsf{D}} 1 5 1
(𝗌^,𝗌^)𝖲({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime})\in{\mathsf{S}} with shared yy and zz coordinates 2 1 3
Refer to caption
Figure 13: Left: each entry in the table gives the distance (height of LCA of this pair of points) of a non-splitting graph node under ultrametrics UXU_{X}, UYU_{Y} and UZU_{Z}, respectively (i.e., w.r.t. trees TXT_{X}, TYT_{Y} and TZT_{Z}, respectively). Right: the subgraph induced such nodes in the left contains a directed triangle (highlighted in red) formed by w=(𝗌^,𝗌^),(c,c),(d,d)w=({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime}),(c,c^{\prime}),(d,d^{\prime}).

By a simple but tedious case analysis, one can verify that in each of these 6 cases, w=(𝗌^,𝗌^)w=({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime}) will form a directed with some pair of nodes from w1,w2{(a,a),(b,b),(c,c),(d,d)}w_{1},w_{2}\in\{(a,a^{\prime}),(b,b^{\prime}),(c,c^{\prime}),(d,d^{\prime})\}. For example, suppose 𝗌^{\widehat{{\mathsf{s}}}} and 𝗌^{\widehat{{\mathsf{s}}}}^{\prime} share both yy- and zz- coordinates. Then the subgraph induced by w=(𝗌^,𝗌^)w=({\widehat{{\mathsf{s}}}},{\widehat{{\mathsf{s}}}}^{\prime}) and {(a,a),(b,b),(c,c),(d,d)}\{(a,a^{\prime}),(b,b^{\prime}),(c,c^{\prime}),\\ (d,d^{\prime})\} is shown in Figure 13, and the triangle formed by ww and (c,c),(d,d)(c,c^{\prime}),(d,d^{\prime}) is a directed triangle. This thus proves statement (iii).

As commented in the main text, the heights of these nodes are computed by a computer program to guarantee the three statements.

A.9 Proof of Theorem 4.3

Proof.

We prove the theorem via a reduction from Minimum Vertex Cover to Weak-MIS.

Description of the reduction. Given an instance of Minimum Vertex Cover, G=(V,E)G=(V,E), V={v1,,vn}V=\{v_{1},...,v_{n}\} and E={e1,,em}E=\{e_{1},...,e_{m}\}. We construct an instance (={ρ1,ρ2,ρ3};X)(\mathcal{M}=\{{\rho}_{1},{\rho}_{2},{\rho}_{3}\};X) of Weak-MIS with 3 metrics on node set X={re1,,rem,v^1,,v^n}X=\{r_{e_{1}},...,r_{e_{m}},{\widehat{v}}_{1},...,{\widehat{v}}_{n}\}. We will assign an index number for each edge, and will use that index to assign distances between nodes incident to that edge.

For any edge ek=(vi,vj)e_{k}=(v_{i},v_{j}), we construct the following gadget (Figure 14), where M=3mM=3m and ϵ1\epsilon\ll 1.

Refer to caption
Figure 14: A directed triangle.

All the other pairwise distances are 2M2M. One can check that, in the auxiliary graph 𝒢\mathcal{G}, a directed triangle (if exists) always corresponds to an edge e=(vi,vj)e=(v_{i},v_{j}), and the triangle is consist of (v^i,v^j),(v^i,re)({\widehat{v}}_{i},{\widehat{v}}_{j}),({\widehat{v}}_{i},r_{e}) and (v^j,re)({\widehat{v}}_{j},r_{e}). It is because that the only comparable distances are triples of form {(v^i,v^j),(v^i,re),(v^j,re)}\{({\widehat{v}}_{i},{\widehat{v}}_{j}),({\widehat{v}}_{i},r_{e}),({\widehat{v}}_{j},r_{e})\}. Now we prove that there is an vertex cover of size KK for GG if and only if there is a (weakly) inconsistent set of size KK for (={ρ1,ρ2,ρ3};X)(\mathcal{M}=\{{\rho}_{1},{\rho}_{2},{\rho}_{3}\};X).

``"``\Rightarrow" direction: Assume for graph GG, there is a vertex cover S={v1,,vK}S=\{v_{1},...,v_{K}\}. Then S^={v^1,,v^K}\widehat{S}=\{{\widehat{v}}_{1},...,{\widehat{v}}_{K}\} is a Weak-MIS, since each directed triangle corresponds to an edge in GG, and has at least one relevant node in SS (S^\widehat{S}).

``"``\Leftarrow" direction: If we have a Weak-MIS of size KK, denoted as S^={v^1,v^K}\widehat{S}^{*}=\{{\widehat{v}}_{1},...{\widehat{v}}_{K}\} (there is always an optimal solution only consisting of v^i{\widehat{v}}_{i}s). There is no directed triangle means that for each edge, at least one endpoint is in S={v1,,vK}S=\{v_{1},...,v_{K}\}, which is a vertex cover of size KK.

A.10 Summary of Hardness Results

We end this section with a summary table showing hardness results in different cases.

Problem Input Hardness Inapprox. factor Approx. algorithm
Strong-MIS 2 Arbitrary metrics NP-complete (Theorem 3.3) 2 UGC (Corollary 3.1) 4 (Theorem 3.4)
Strong-MIS 2 Line metrics Weakly NP-complete (Theorem 3.1) ? 4
Strong-MIS 2 Ultrametrics NP-complete (Theorem 3.2) ? 4
Weak-MIS 3 Arbitrary metrics NP-complete 2 UGC (Theorem 4.3) 6 (Theorem 4.4)
Weak-MIS 3 Line metrics Weakly NP-complete (Theorem 4.1) ? 6
Weak-MIS 3 Ultrametrics NP-complete (Theorem 4.2) ? 6
Table 2: Hardness results for different cases.

Appendix B NP-complete Problem Repository

Here we list the NP-complete problems that are used in the hardness proofs.

Definition B.1 (Minimum Vertex Cover).

[11]
Instance: Graph G=(V,E)G=(V,E) and a positive integer KK.
Question: Does GG have a vertex cover of size at most KK?

A vertex cover is a set of nodes VVV^{\prime}\subset V that every edge has at least one endpoint in VV^{\prime}. This is a classical problem mentioned in Karp’s 21 np-complete problems.

Definition B.2 (Max 2-SAT).

[9]
Instance: Given a boolean expression EE of nn variables in conjunctive normal form (CNF) that is the conjunction of mm clauses over nn variables, each of which is the disjunction of at most two distinct literals. An integer KK.
Question: Is there an assignment to variables such that KK clauses are satisfied?

Definition B.3 (Tournament Feedback Vertex Set).

[7]
Instance: A tournament (fully connected directed graph) TT and an integer KK.
Question: Is there a vertex set SS with at most KK nodes whose deletion will result in an acyclic directed graph.

In [7], Dom showed that Feedback Vertex Set problem is NP-complete.

Definition B.4 (3-dimensional Matching).

[11]
Instance: Let X,Y,ZX,Y,Z be three disjoint sets with the same size. And 𝒮X×Y×Z{\mathcal{S}}\subset X\times Y\times Z consists of triples (x,y,z)(x,y,z) where xX,yY,zZx\in X,y\in Y,z\in Z. Given (X,Y,Z;𝒮)(X,Y,Z;{\mathcal{S}}) and an integer KK.
Question: Is there a 3-dimensional matching Π𝒮\Pi\subset{\mathcal{S}} with size at least KK?

This is also a problem in the list of Karp’s 21 np-complete problems.