This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The computational complexity of some explainable clustering problems

Eduardo Laber
PUC-Rio
Brazil
[email protected]
Abstract

We study the computational complexity of some explainable clustering problems in the framework proposed by [Dasgupta et al., ICML 2020], where explainability is achieved via axis-aligned decision trees. We consider the kk-means, kk-medians, kk-centers and the spacing cost functions. We prove that the first three are hard to optimize while the latter can be optimized in polynomial time.

1 Introduction

Machine learning models and algorithms have been used in a number of systems that take decisions that affect our lives. Thus, explainable methods are desirable so that people are able to have a better understanding of their behavior, which allows for comfortable use of these systems or, eventually, the questioning of their applicability [1].

Recently, there has been some effort to devise explainable methods for unsupervised learning tasks, in particular, for clustering [2, 3]. We investigate the framework discussed by [2], where an explainable clustering is given by a partition, induced by the leaves of an axis-aligned decision tree, that optimizes some predefined objective function.

Figure 1 shows a decision tree that defines a clustering for the Iris dataset. The clustering has three groups, each of them corresponding to a leaf. The explanation of the group associated with the rightmost leaf is Sepal Length >0.4 AND Petal Width < 0.5.

Refer to caption
Figure 1: An explainable clustering with 3 groups for the Iris datasets

Following [2], a series of papers [4, 5, 6, 7, 8] provided algorithms, with provable guarantees, to build decision trees that induce explainable clustering. Several cost functions to guide the clustering construction were investigated as the kk-means, kk-centers, kk-medians and maximum-spacing. Despite this active research, the only work on the field that tackles the computational complexity of building explainable clustering is [9], where it was proved that optimizing the kk-means and the kk-medians cost functions is NP-Complete. Here, we improve these results and also investigate the computational complexity for both the kk-centers and the spacing cost functions.

1.1 Problem definition

Let 𝒳{\cal X} be a finite set of points in d\mathbb{R}^{d}. We say that a decision tree is standard if each internal node vv is associated with a test (cut), specified by a coordinate iv[d]i_{v}\in[d] and a real value θv\theta_{v}, that partitions the points in 𝒳{\cal X} that reach vv into two sets: those having the coordinate ivi_{v} smaller than or equal to θv\theta_{v} and those having it larger than θv\theta_{v}. The leaves of a standard decision tree induce a partition of d\mathbb{R}^{d} into axis-aligned boxes and, naturally, a partition of 𝒳{\cal X} into clusters.

Let k2k\geq 2 be an integer. The clustering problems considered here consist of finding a partition of 𝒳{\cal X} into kk groups, among those that can be induced by a standard decision tree with kk leaves, that optimizes a given objective function. For the kk-means, kk-medians and kk-centers cost functions, in addition to the partition, a representative μ(C)d\mu(C)\in\mathbb{R}^{d} for each group CC must also be output.

For the kk-means problem the objective (cost function) to be minimized is the Sum of the Squared Euclidean Distances (SSED) between each point 𝐱𝒳\mathbf{x}\in{\cal X} and the representative of the cluster where 𝐱\mathbf{x} lies. Mathematically, the cost (SSED) of a partition 𝒞=(C1,,Ck){\cal C}=(C_{1},\ldots,C_{k}) for 𝒳{\cal X} is given by

i=1k𝐱Ci𝐱μ(Ci)22.\sum_{i=1}^{k}\sum_{\mathbf{x}\in C_{i}}||\mathbf{x}-\mu(C_{i})||_{2}^{2}.

For the kk-medians problem the cost of a partition 𝒞=(C1,,Ck){\cal C}=(C_{1},\ldots,C_{k}) is given by

i=1k𝐱Ci𝐱μ(Ci)1.\sum_{i=1}^{k}\sum_{\mathbf{x}\in C_{i}}||\mathbf{x}-\mu(C_{i})||_{1}.

The kk-centers problem is also a minimization problem; its cost function for a partition 𝒞=(C1,,Ck){\cal C}=(C_{1},\ldots,C_{k}) is given by

maxi=1,,kmax𝐱Ci{𝐱μ(Ci)2}.\max_{i=1,\ldots,k}\max_{\mathbf{x}\in C_{i}}\{||\mathbf{x}-\mu(C_{i})||_{2}\}.

Let dist:𝒳×𝒳+dist:{\cal X}\times{\cal X}\mapsto\mathbb{R}^{+} be a distance function. The meximum-spacing problem consists of finding a partition with at least kk groups that has maximum spacing, where the spacing sp(𝒞)sp({\cal C}) of a partition 𝒞{\cal C} is defined as

sp(𝒞)=min{dist(𝐱,𝐲):𝐱 and 𝐲 lie in distinct groups of 𝒞}sp({\cal C})=\min\{dist(\mathbf{x},\mathbf{y}):\mathbf{x}\mbox{ and }\mathbf{y}\mbox{ lie in distinct groups of ${\cal C}$}\}

In contrast to the other criteria, the spacing is an inter-clustering criterion.

We note that an optimal solution of the unrestricted version of any of these problems, in which the decision tree constraint is not enforced, might be a partition that is hard to explain in terms of the input features. Thus, the motivation for using standard decision trees.

For the sake of simplicity, throughout of this text, by explainable clustering we mean a clustering that is obtained via decision trees.

1.2 Our contributions

In Section 2, we first show that the problem of building a partition via decision trees that minimizes the kk-means cost function does not admit an (1+ϵ)(1+\epsilon)-approximation in polynomial time, for some ϵ>0\epsilon>0, unless P=NPP=NP. Then, we show that analogous results hold for both the kk-median and kk-centers cost functions. Our results for both the kk-means and kk-medians are stronger than the NP-Hardness result established recently by [9] and they formally help to justify the quest for approximation algorithms and/or heuristics for these cost functions.

In Section 3 we propose a polynomial time algorithm that produces an explainable clustering of maximum spacing. As far as we know, this is the first efficient method that produces optimal explainable clustering with respect to some well studied metric.

1.3 Related work

Our research is inspired by the recent work of [2], where the problem of building explainable clusterings, via standard decision trees, for both the kk-means and the kk-medians cost functions are studied. This paper proposes algorithms with provable approximation bounds for building explainable clusterings. In addition, it investigates the price of explainability for these cost functions, which is the unavoidable gap between the cost of the optimal explainable and the optimal unconstrained clustering. Among their results, they showed that the price of explainability for the kk-means and kk-median are respectively O(k2)O(k^{2}) and O(k)O(k).

Their results were refined/improved by a series of recent papers [4, 5, 7, 8, 6]. Currently, the best upper bound for the kk-medians is O(logkloglogk)O(\log k\log\log k) [5, 7] while for the kk-means is O(klogk)O(k\log k) [7]. The study of bounds that depend on the dimension dd was initiated in [4], where the authors present an O(dlogk)O(d\log k) upper bound for the kk-medians and an O(dklogk)O(dk\log k) upper bound for the kk-means. These bounds were improved to O(dlog2d)O(d\log^{2}d) for the kk-medians [7] and O(k12/dpoly(d,logk))O(k^{1-2/d}poly(d,\log k)) [6] for the kk-means.

The price of explainability was also investigated for other cost functions. In [4], Laber and Murtinho considered the kk-centers and maximum-spacing cost functions. In [5], Makarychev and Shan considered the kk-medoids problem (kk-median with 2\ell_{2} objective). Finally, in [8], Gamlath et. al addressed pp\ell_{p}^{p} objectives.

The aforementioned papers, except [4] which also presents experiments, are mainly theoretical. However, there are also a number of papers that propose algorithms (without theoretical guarantees) for building explainable clustering, among them we cite [10, 11, 3].

The computational complexity of building explainable clustering via decision trees for both the kk-means and the kk-medians problems is studied in [9]. It is shown that both problems admit polynomial time algorithms when either kk or dd is constant and they are NP-Complete for arbitrary kk and dd. In addition, they show that an optimal explainable clustering cannot be found in f(k)|𝒳|o(k)f(k)\cdot|{\cal X}|^{o(k)} time for any computable function f(·)f(\textperiodcentered), unless Exponential Time Hypothesis (ETH) fails.

When we turn to standard (non-explainable) clustering, the problems of optimizing the kk-means, kk-medians and kk-centers cost functions are APX-Hard [12, 13, 14] and all of them admit polynomial time algorithms with constant approximation [15, 16, 17]. With regards to the spacing cost function, the single-link algorithm, a very popular algorithm to build hierarchical clustering, produces a partition with maximum spacing [18, Chapter 4].

2 Hardness of kk-means, kk-medians and kk-centers cost function

2.1 Background

We start by recalling some basic definitions and facts that are useful for studying the hardness of optimization problems (see, e.g., [19, chapter 29]).

Given a minimization problem 𝔸\mathbb{A} and a parameter ϵ>0\epsilon>0 we define the ϵ\epsilon-Gap-𝔸\mathbb{A} problem as the problem of deciding for an instance II of 𝔸\mathbb{A} and a parameter kk whether: (i) II admits a solution of value at most kk; or (ii) every solution of II have value at least (1+ϵ)k.(1+\epsilon)k. In such a gap decision problem it is tacitly assumed that the instances are either of type (i) or of type (ii).

Fact 1.

If for a minimization problem 𝔸\mathbb{A} there exists ϵ>0\epsilon>0 such that the ϵ\epsilon-Gap-𝔸\mathbb{A} problem is NPNP-hard, then no polynomial time (1+ϵ)(1+\epsilon)-approximation algorithm exists for 𝔸\mathbb{A} unless P=NP.P=NP.

We will use the following definition of a gap-preserving reduction.

Definition 1.

Let 𝔸,𝔹\mathbb{A},\mathbb{B} be minimization problems. A gap-preserving reduction from 𝔸\mathbb{A} to 𝔹\mathbb{B} is a polynomial time algorithm that, given an instance xx of 𝔸\mathbb{A} and a value kk, produces an instance yy of 𝔹\mathbb{B} and a value κ\kappa such that there exist constants ϵ,η>0\epsilon,\eta>0 for which

  1. 1.

    if OPT(x)kOPT(x)\leq k then OPT(y)κOPT(y)\leq\kappa;

  2. 2.

    if OPT(x)>(1+ϵ)kOPT(x)>(1+\epsilon)k then OPT(y)>(1+η)κOPT(y)>(1+\eta)\kappa;

Fact 2.

Fix minimization problems 𝔸,𝔹\mathbb{A},\mathbb{B}. If there exists ϵ\epsilon such that the ϵ\epsilon-Gap-𝔸\mathbb{A} problem is NPNP-hard and there exists a gap-preserving reduction from 𝔸\mathbb{A} to 𝔹\mathbb{B} then there exists η\eta such that the η\eta-Gap-𝔹\mathbb{B} problem is NPNP-hard

We will now specialize the above definitions for restricted variants of the problem of finding a minimum vertex cover in a graph and for our clustering problems.

Definition 2.

For every ϵ>0\epsilon>0, the ϵ\epsilon-Gap-MinVC-B-TF (gap) decision problem is defined as follows: given a triangle-free graph G=(V,E)G=(V,E), with bounded degree, and an integer kk, decide whether GG has a vertex cover of size kk or all vertex covers of GG have size at least k(1+ϵ)k(1+\epsilon).

The ϵ\epsilon-Gap-MinVC-3B-TF (gap) decision problem has a similar definition, the only differences is that, in addition of being triangle-free, the graphs are required to be 3-bounded, that is, all of its vertexes have degree at most 3.

The NP-Hardness of ϵ\epsilon-Gap-MinVC-B-TF and ϵ\epsilon-Gap-MinVC-3B-TF were established in [13] and [20], respectively.

Definition 3.

For every η>0\eta>0, the η\eta-Gap-Explainable-kkmeans (gap) decision problem is defined as follows: given a set of points 𝒳{\cal X}, an integer kk, and a value κ\kappa, decide whether there exists an explainable kk-clustering 𝒞=(C1,Ck){\cal C}=(C_{1},\dots C_{k}) of the points in 𝒳{\cal X} such that the kk-means cost of 𝒞{\cal C} is at most κ\kappa or for each explainable kk-clustering 𝒞{\cal C} of 𝒳{\cal X} it holds that the kk-means cost of 𝒞{\cal C} is at least (1+η)κ.(1+\eta)\kappa.

The η\eta-Gap-Explainable-kmedians and η\eta-Gap-Explainable-kcenters decision problems are analogously defined.

To prove the hardness for the kk-means we use a gap-preserving reduction from the ϵ\epsilon-Gap-MinVC-B-TF decision problem. To handle both the kk-centers and kk-medians, we use the ϵ\epsilon-Gap-MinVC-3B-TF decision problem.

Our reductions have some common ingredients that we explain here. For all of them, given a graph G=(V,E)G=(V,E), where V={1,,n}V=\{1,\ldots,n\}, we build an instance of the clustering problem under consideration by mapping every edge ee in EE onto a point 𝐯e=(v1e,,vne)\mathbf{v}^{e}=(v^{e}_{1},\ldots,v^{e}_{n}) in {0,1}n\{0,1\}^{n} where vie=1v^{e}_{i}=1 if vertex ii is incident on ee and vie=0v^{e}_{i}=0 otherwise. This is exactly the mapping proposed in [13] to establish that the (standard) kk-means problem is APX-Hard. We use 𝒳G:={ve|eE}{\cal X}_{G}:=\{v^{e}|e\in E\} to denote the input of the resulting clustering instance.

Let S={i1,i2,,ik}S=\{i_{1},i_{2},\ldots,i_{k}\} be a cover of size kk for GG, where each iji_{j} is an integer in [n][n] and ij<ij+1i_{j}<i_{j+1}. We define 𝒞S=(E1,,Ek){\cal C}_{S}=(E_{1},\ldots,E_{k}) as the kk-clustering induced by SS on the points in 𝒳G{\cal X}_{G}, where the group EjE_{j} includes all points 𝐯\mathbf{v} that simultaneously satisfy: its component iji_{j} is 11 and its component iji_{j^{\prime}}, for j<jj^{\prime}<j, is 0.

Proposition 1.

The clustering 𝒞S{\cal C}_{S} is explainable.

Proof.

𝒞S{\cal C}_{S} is the clustering induced by a decision tree with k1k-1 internal nodes, with exactly one internal node per level. The internal node of level jj is associated with cut (ij,1/2)(i_{j},1/2). ∎

2.2 Hardness of kk-means cost function

We prove that the problem of finding an explainable clustering with minimum kk-means cost function is hard to approximate. The reduction employed here is the one used by [13] to show that it is hard to find an (1+ϵ)(1+\epsilon)-approximation for the kk-means cost function. The extra ingredient in our proof is the construction of an explainable clustering 𝒞S{\cal C}_{S} from a vertex cover SS that was described in the previous section.

Theorem 1.

The problem of building an explainable clustering, via decision trees, that minimizes the kk-means cost function does not admit an (1+ϵ)(1+\epsilon)-approximation, for some ϵ>0\epsilon>0, in polynomial time unless P=NPP=NP.

Proof.

Let GG be a triangle-free graph that satisfies one of the following cases: (i) GG has a vertex cover of size kk or (ii) all vertex covers of GG have size >k(1+ϵ)>k(1+\epsilon).

First, we consider the case where GG has a vertex cover S={i1,i2,,ik}S=\{i_{1},i_{2},\ldots,i_{k}\} of size kk. We show that, in this case, the cost of 𝒞S=(E1,,Ek){\cal C}_{S}=(E_{1},\ldots,E_{k}) is at most |E|k|E|-k. Let us consider the mean of the points in EjE_{j} as the representative of EjE_{j}, that is, a point that has 11 at coordinate iji_{j} and 1/|Ej|1/|E_{j}| in the remaining |Ej||E_{j}| coordinates with non-zero values. The squared distance of each point in EjE_{j} to its representative is given by

(11|Ej|)2+(|Ej|1)×(1|Ej|)2=11|Ej|\left(1-\frac{1}{|E_{j}|}\right)^{2}+(|E_{j}|-1)\times\left(\frac{1}{|E_{j}|}\right)^{2}=1-\frac{1}{|E_{j}|} (1)

Thus, EjE_{j} contributes to the total cost with |Ej|1|E_{j}|-1. The cost of the clustering 𝒞S{\cal C}_{S} is, then, given by

j=1k|Ej|1=|E|k\sum_{j=1}^{k}|E_{j}|-1=|E|-k

Now, it remains to argue that if the minimum vertex cover for GG has size at least (1+ϵ)k(1+\epsilon)k then every explainable clustering for the corresponding instance has cost at least |E|(1Ω(ϵ))k|E|-(1-\Omega(\epsilon))k. This follows from [13], as in this case every clustering (and, in particular, every explainable one) has cost at least |E|(1Ω(ϵ))k|E|-(1-\Omega(\epsilon))k.

We have concluded a gap preserving reduction from ϵ\epsilon-Gap-MinVC-B-TF to η\eta-Gap-Explainable-kkmeans. ∎

2.3 Hardness of kk-medians cost function

We prove that the problem of finding an explainable clustering with minimum kk-medians cost function is hard to approximate. We show a gap preserving reduction from the ϵ\epsilon-Gap-MinVC-3B-TF problem to the η\eta-Gap-Explainable-kmedians problem.

The following well-known fact will be useful.

Fact 3.

Let CC be a set of points in d\mathbb{R}^{d} and let μ(C)d\mu(C)\in\mathbb{R}^{d} be the point for which

𝐱C𝐱μ(C)1\sum_{\mathbf{x}\in C}||\mathbf{x}-\mu(C)||_{1}

is minimum.

Then, for each i[d]i\in[d], the value of coordinate ii of point μ(C)\mu(C) is the median of the values of the points in CC on coordinate ii.

The following lemma will be also useful.

Lemma 1.

Let GG be a 3-bounded triangle free graph and let let C𝒳GC\subseteq{\cal X}_{G} be a group of points corresponding to pp edges of GG. We have that: (i) if CC is a star then its kk-medians cost is pp and (ii) if CC is not a star then its kk-medians cost is at least (4/3)p(4/3)p.

Proof.

From the previous fact, the representative of CC that yields to the minimum kk-medians cost is a point in {0,1}n\{0,1\}^{n}, where the coordinate ii has value 1 if and only if the number of edges that touch vertex ii is larger than p/2p/2. Thus, the cost of a cluster CC is given by

i=1nmin{pdC(i),dC(i)},\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\},

where dC(i)d_{C}(i) is the number of edges that touch vertex ii in CC.

If CC is a star centred on vertex jj then min{pdC(j),dC(j)}=0\min\{p-d_{C}(j),d_{C}(j)\}=0 and min{pdC(i),dC(i)}=1\min\{p-d_{C}(i),d_{C}(i)\}=1 for the other vertexes ii in the star. Thus,

i=1nmin{pdC(i),dC(i)}=ij1=p\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=\sum_{i\neq j}1=p

If CC is not a star then we have some cases:

Case 1) dC(i)p/2d_{C}(i)\leq p/2 for all ii. We have

i=1nmin{pdC(i),dC(i)}=i=1ndC(i)=2p\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=\sum_{i=1}^{n}d_{C}(i)=2p

Note that the above case covers the case p6p\geq 6 since the maximum degree in GG is at most 3.

Case 2) p=5p=5 and dC(j)=3d_{C}(j)=3 for exactly one jj. We have

i=1nmin{pdC(i),dC(i)}=dC(j)1+ijdC(i)=2p1=9=1.8p\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=d_{C}(j)-1+\sum_{i\neq j}d_{C}(i)=2p-1=9=1.8p

Case 3) p=5p=5 and dC(j)=dC(j)=3d_{C}(j)=d_{C}(j^{\prime})=3 for exactly two values jj and jj^{\prime}. We have

i=1nmin{pdC(i),dC(i)}=4+i{j,j}dC(i)=2p2=8=1.6p\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=4+\sum_{i\notin\{j,j^{\prime}\}}d_{C}(i)=2p-2=8=1.6p

Note that we cannot have 3 vertexes with degree 3 and p=5p=5.

Case 4) p=4p=4 and dC(j)=3d_{C}(j)=3 for some jj. We must have exactly one jj with dC(j)=3d_{C}(j)=3, otherwise we would have more than 44 edges. Thus,

i=1nmin{pdC(i),dC(i)}=dC(j)2+ijdC(i)=2p2=6=1.5p\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=d_{C}(j)-2+\sum_{i\neq j}d_{C}(i)=2p-2=6=1.5p

Case 5) p=3p=3 and dC(j)=2d_{C}(j)=2 for some jj. We have two possible non-isomorphic graphs. One of them consists of a path with 2 edges and an additional edge while the other is a path with 3 edges. For both cases we have

i=1nmin{pdC(i),dC(i)}4=(4/3)p\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}\geq 4=(4/3)p

Theorem 2.

The problem of building an explainable clustering, via decision trees, that minimizes the kk-medians cost function does not admit an (1+ϵ)(1+\epsilon)-approximation, for some ϵ>0\epsilon>0, in polynomial time unless P=NPP=NP.

Proof.

Let GG be a triangle-free graph with maximum degree not larger than 3 that satisfies one of the following cases: (i) GG has a vertex cover of size kk or (ii) all vertex covers of GG have size at least k(1+ϵ)k(1+\epsilon).

First, consider the case where GG has a vertex cover SS of size kk. Since the clustering 𝒞S{\cal C}_{S} consists of stars, it follows from the previous lemma that its cost is |E||E|.

Now, assume that all vertex covers for GG have size at least k(1+ϵ)k(1+\epsilon). Let 𝒞{\cal C} be a clustering with kk groups for the corresponding kk-medians instance.

Let tt be the number of groups in 𝒞{\cal C} that are stars and let pp be the total number of edges in the remaining clusters. Since there is no vertex cover for GG of size smaller than k(1+ϵ)k(1+\epsilon) we must have

t+pk(1+ϵ),t+p\geq k(1+\epsilon),

otherwise we could obtain a cover for GG with size smaller than k(1+ϵ)k(1+\epsilon) by using one vertex per star and one additional vertex for each of the pp edges. Since tkt\leq k it follows that pkϵp\geq k\epsilon. Moreover, we must have k|E|/3k\geq|E|/3 because the degree of every vertex in GG is at most 33. Thus, from the previous lemma, the cost of clustering 𝒞{\cal C} is at least

4p3+(|E|p)=|E|+p3|E|+kϵ3|E|+|E|ϵ9=|E|(1+ϵ9).\frac{4p}{3}+(|E|-p)=|E|+\frac{p}{3}\geq|E|+\frac{k\epsilon}{3}\geq|E|+\frac{|E|\epsilon}{9}=|E|\left(1+\frac{\epsilon}{9}\right).

We have concluded a gap preserving reduction from ϵ\epsilon-Gap-MinVC-3B-TF to η\eta-Gap-Explainable-kkmedians.

2.4 Hardness of kk-centers cost function

In this section we discuss the computational complexity of minimizing the kk-centers cost function. We show a gap preserving reduction from the ϵ\epsilon-Gap-MinVC-3B-TF problem to the η\eta-Gap-Explainable-kcenters problem.

Theorem 3.

The problem of building an explainable clustering, via decision trees, that minimizes the kk-centers cost function does not admit an (1+ϵ)(1+\epsilon)-approximation, for some ϵ>0\epsilon>0, in polynomial time unless P=NPP=NP.

Proof.

Let G=(V,E)G=(V,E) be a triangle-free graph with maximum degree not larger than 3 that satisfies one of the following cases: (i) GG has a vertex cover of size kk or (ii) all vertex covers of GG have size at least k(1+ϵ)k(1+\epsilon).

First, consider the case where GG has a vertex cover SS of size kk. In this case, the clustering 𝒞S=(E1,,Ek){\cal C}_{S}=(E_{1},\ldots,E_{k}) consists of stars with at most 3 edges. For the representative of EjE_{j}, as in the proof of Theorem 1, we use the mean of the points that lie in EjE_{j}.

Thus, the distance of each point in EiE_{i} to its representative is the square root of the rightmost term of (1), which is at most 1/3\sqrt{1/3} since GG has maximum degree 3.

Now, we assume that GG does not have a vertex cover with kk vertex. Let 𝒞{\cal C} be a clustering with kk groups for the edges of EE. One of the groups, say AA, does not have a vertex that touches all the edges in AA. Pick the vertex, say vv, that touches the largest number of edges in AA. Consider an edge e=yze=yz in AA that does not touch vv. We show that there is another edge in AA, say ee^{\prime}, that does not have intersection with ee. In fact, pick an edge f=vwf=vw. If ff does not intersect ee (ww is not an endpoint of ee) we set e=fe^{\prime}=f. Otherwise, we assume w.l.o.g. that ff intersects ee at point yy, that is, w=yw=y. We know that vzvz is not an edge for otherwise we would have a triangle vwzvwz in GG. Since vv is the vertex that touches the largest number of edges in AA then vv must touch an edge f=vzf^{\prime}=vz^{\prime}, with zyz^{\prime}\neq y and zzz^{\prime}\neq z. We set e=fe^{\prime}=f^{\prime}.

We can argue that the distance of the representative μ(A)\mu(A) of AA to either ee or ee^{\prime} is at least 11. For that, we consider the values of μ(A)\mu(A) at the components of the vertexes that define the edges ee^{\prime} and ee. Let μ1,μ2,μ3\mu_{1},\mu_{2},\mu_{3} and μ4\mu_{4} be these values. We have that

eμ(A)2+eμ(A)2i=14(1μi)2+μi2=||e-\mu(A)||^{2}+||e^{\prime}-\mu(A)||^{2}\geq\sum_{i=1}^{4}(1-\mu_{i})^{2}+\mu_{i}^{2}=
42(μ1+μ2+μ4+μ4)+2(μ12+μ22+μ32+μ42)24-2(\mu_{1}+\mu_{2}+\mu_{4}+\mu_{4})+2(\mu_{1}^{2}+\mu_{2}^{2}+\mu^{2}_{3}+\mu^{2}_{4})\geq 2

Thus, either ee or ee^{\prime} is at distance at least 1 from the representative of AA

3 A polynomial time algorithm for the maximum-spacing cost function

We describe MaxSpacing, a simple greedy algorithm that finds an explainable partition of maximum spacing in polynomial time.

To simplify its description we introduce some notation. For a set of leaves LL in a decision tree, we use sp(L)sp(L) to refer to the spacing of the partition of the points in 𝒳{\cal X} induced by the leaves in LL. Given a set of leaves LL, a leaf L\ell\in L and an axis-aligned cut γ=(i,θ)\gamma=(i,\theta), we use Lγ,L_{\gamma,\ell} to denote the set of leaves obtained when γ\gamma is applied to split the points that reach \ell. More precisely, Lγ,L_{\gamma,\ell} is obtained from LL by removing \ell and adding the two leaves that are created by using γ\gamma to split the points that reach \ell.

A pseudo-code for MaxSpacing is presented in Algorithm 1. The algorithm adopts a natural greedy strategy that at each step chooses the cut that yields to the partition of maximum spacing. We note that it runs in polynomial time because in Step 1 we just need to test at most (|𝒳|1)d(|{\cal X}|-1)d axis-aligned cuts: for each L\ell\in L and each dimension i[d]i\in[d] we sort the |||\ell| points that reach \ell according to their coordinate ii and consider the cuts (i,θj)(i,\theta_{j}), for j=1,,||j=1,\ldots,|\ell|, where θj\theta_{j} is the midpoint between the values of the ii-th coordinate of the jjth and (j+1)(j+1)th points in the sorted list.

Algorithm 1 MaxSpacing(𝒳{\cal X}: set of points; kk: integer)
  Initialize a decision tree with only one leaf \ell and associate it with 𝒳{\cal X}
  L{}L\leftarrow\ \{\ell\}
  Repeat k1k-1 times:
  1. 1.

    Find a cut γ\gamma and a leaf L\ell\in L that simultaneously satisfy:

    • (i)

      γ\gamma splits the points that reach \ell into 2 non-empty groups

    • (ii)

      sp(Lγ,)sp(Lγ,)sp(L_{\gamma,\ell})\geq sp(L_{\gamma^{\prime},\ell^{\prime}}) for every L\ell^{\prime}\in L and every axis-aligned cut γ\gamma^{\prime} that splits the points that reach \ell^{\prime} into two non-empty sets.

  2. 2.

    Split leaf \ell using cut γ\gamma

  3. 3.

    LLγ,L\leftarrow L_{\gamma,\ell}

In what follows, we show that MaxSpacing produces an explainable partition with maximum possible (optimal) spacing. The following simple fact will be useful.

Fact 4.

Let did^{*}_{i} be the spacing of an optimal explainable partition with i+1i+1 groups. Then, didi+1d^{*}_{i}\geq d^{*}_{i+1}, for i=1,,k1i=1,\ldots,k-1.

Proof.

Let DD^{*} be a decision tree that induces a partition with i+2i+2 groups that has spacing di+1d^{*}_{i+1}. Let DD be a decision tree obtained by removing two leaves that are siblings in DD^{*} and turning their parent into a leaf. Let xx and yy be two closest points among those that reach different leaves in DD. Since these points also reach distinct leaves in DD^{*} we have that the spacing of the leaves in DD is not smaller than that of the leaves in DD^{*}. Thus, disp(Leaves of D)di+1d^{*}_{i}\geq sp(\mbox{Leaves of }D)\geq d^{*}_{i+1}. ∎

Theorem 4.

For every 1ik11\leq i\leq k-1, the partition induced by the leaves of MaxSpacing algorithm by the end of iteration ii has the maximum spacing, among the explainable partitions with i+1i+1 groups for 𝒳{\cal X}.

Proof.

Let 𝒞i{\cal C}^{*}_{i} be an optimal explainable partition with i+1i+1 groups and let did^{*}_{i} be its spacing. Moreover, let LiL_{i}, with i<k1i<k-1, be the set of leaves by the end of iteration ii of MaxSpacing algorithm. By the greedy choice sp(L1)=d1sp(L_{1})=d^{*}_{1}. We assume by induction that the spacing of LiL_{i} is did^{*}_{i} and show that the spacing of Li+1L_{i+1} is di+1d^{*}_{i+1}.

For a node ν\nu in a decision tree, let P(ν)P(\nu) be the set of points that reach ν\nu. Let ν\nu^{*} be a node in the decision tree DD^{*} for 𝒞i+1{\cal C}_{i+1}^{*} that satisfies the following: (i) for each Li\ell\in L_{i} either P()P(ν)P(\ell)\subseteq P(\nu^{*}) or P()P(ν)=P(\ell)\cap P(\nu^{*})=\emptyset and (ii) some child of ν\nu^{*} does not satisfy (i). We will use the cut associated with ν\nu^{*} in DD^{*} to argue that we can properly split LiL_{i}.

To prove the existence of a node ν\nu^{*} with such properties, it suffices to show that the root rr^{*} of DD^{*} satisfies (i)(i) and some leaf \ell^{*} from DD^{*} does not satisfy (i) since, in this case, we can set ν\nu^{*} as the last node in the path from rr^{*} to \ell^{*} that satisfies (i). Clearly, rr^{*} satisfies (i). It remains to argue that some leaf D\ell^{*}\in D^{*} does not satisfy (i). Since the number of leaves in LiL_{i} is smaller than the number of leaves in DD^{*}, by the pigeonhole principle, there are two leaves, say 1\ell^{*}_{1} and 2\ell^{*}_{2}, in DD^{*} that contain points from the same leaf \ell in LiL_{i}. Thus, neither P()P(1)P(\ell)\cap P(\ell^{*}_{1})\neq\emptyset nor P()P(1)P(\ell)\subseteq P(\ell^{*}_{1}). We set \ell^{*} to 1\ell^{*}_{1} and ν\nu^{*} as the last node in the path from rr^{*} to 1\ell^{*}_{1} that satisfies (i).

Let νch\nu^{*}_{ch} be a child of ν\nu^{*} in DD^{*} that does not satisfy (i). Moreover, let γ\gamma be the cut associated with ν\nu^{*} and let \ell be a leaf in LiL_{i} such that P()P(ν)P(\ell)\subseteq P(\nu^{*}), P()P(νch)P(\ell)\cap P(\nu^{*}_{ch})\neq\emptyset and P()P(νch)P(\ell)\not\subset P(\nu^{*}_{ch}). We show that the spacing of the set of leaves LiL^{\prime}_{i} obtained from LiL_{i} by applying cut γ\gamma to \ell is at least di+1d^{*}_{i+1}. Let 1\ell_{1} and 2\ell_{2} be the two new leaves that are created by applying γ\gamma to \ell and let 𝐱\mathbf{x} and 𝐲\mathbf{y} be the two closest points (according to distdist) among those that reach different leaves in LiL^{\prime}_{i}. If 𝐱\mathbf{x} reaches 1\ell_{1} (resp. 2\ell_{2}) and 𝐲\mathbf{y} reaches 2\ell_{2} (resp. 1\ell_{1}) then

sp(Li)=dist(𝐱,𝐲)sp(𝒞i+1)=di+1sp(L^{\prime}_{i})=dist(\mathbf{x},\mathbf{y})\geq sp({\cal C}_{i+1}^{*})=d^{*}_{i+1}

because, due to the application of cut γ\gamma on ν\nu^{*}, 𝐱\mathbf{x} and 𝐲\mathbf{y} lies in different groups in 𝒞i+1{\cal C}_{i+1}^{*}. If some of them, say 𝐱\mathbf{x}, does not reach \ell then 𝐱\mathbf{x} and 𝐲\mathbf{y} reach different leaves in LiL_{i} and, thus,

sp(Li)=dist(𝐱,𝐲)sp(Li)=didi+1,sp(L^{\prime}_{i})=dist(\mathbf{x},\mathbf{y})\geq sp(L_{i})=d^{*}_{i}\geq d^{*}_{i+1},

where the last inequality follows from Fact 4.

We have shown that there exists a leaf \ell in LiL_{i} and a cut γ\gamma such that the application of γ\gamma to \ell yields to a partition of spacing di+1d^{*}_{i+1}. Thus, due to the greedy choice, MaxSpacing obtains a partition of spacing di+1d^{*}_{i+1} by the end of iteration i+1i+1. ∎

4 Conclusions

We have showed that the problems of finding explainable clustering (via decision trees) that optimize the classical kk-means, kk-medians and kk-centers cost functions do not admit polynomial time (1+ϵ)(1+\epsilon)-approximations. These results help to formally justify the quest for heuristics and/or approximation algorithms.

The algorithms recently proposed in the literature for building explainable clustering compare their costs with the costs of optimal unrestricted clustering [2, 4, 5, 6, 7, 8]. A major open question in this line of research is whether better bounds can be obtained when the comparison is made against the optimal explainable clustering.

For the spacing cost function we provided a simple polynomial time algorithm that computes the explainable partition with maximum spacing. An interesting note is that we have not used the fact that the cuts are axis-aligned in the proof of Theorem 4 and, thus, our result holds for any family of cuts.

References