The computational complexity of some explainable clustering problems

Eduardo Laber
PUC-Rio Brazil
[email protected]

Abstract

We study the computational complexity of some explainable clustering problems in the framework proposed by [Dasgupta et al., ICML 2020], where explainability is achieved via axis-aligned decision trees. We consider the $k$ -means, $k$ -medians, $k$ -centers and the spacing cost functions. We prove that the first three are hard to optimize while the latter can be optimized in polynomial time.

1 Introduction

Machine learning models and algorithms have been used in a number of systems that take decisions that affect our lives. Thus, explainable methods are desirable so that people are able to have a better understanding of their behavior, which allows for comfortable use of these systems or, eventually, the questioning of their applicability [1].

Recently, there has been some effort to devise explainable methods for unsupervised learning tasks, in particular, for clustering [2, 3]. We investigate the framework discussed by [2], where an explainable clustering is given by a partition, induced by the leaves of an axis-aligned decision tree, that optimizes some predefined objective function.

Figure 1 shows a decision tree that defines a clustering for the Iris dataset. The clustering has three groups, each of them corresponding to a leaf. The explanation of the group associated with the rightmost leaf is Sepal Length >0.4 AND Petal Width < 0.5.

Refer to caption — Figure 1: An explainable clustering with 3 groups for the Iris datasets

Following [2], a series of papers [4, 5, 6, 7, 8] provided algorithms, with provable guarantees, to build decision trees that induce explainable clustering. Several cost functions to guide the clustering construction were investigated as the $k$ -means, $k$ -centers, $k$ -medians and maximum-spacing. Despite this active research, the only work on the field that tackles the computational complexity of building explainable clustering is [9], where it was proved that optimizing the $k$ -means and the $k$ -medians cost functions is NP-Complete. Here, we improve these results and also investigate the computational complexity for both the $k$ -centers and the spacing cost functions.

1.1 Problem definition

Let ${\cal X}$ be a finite set of points in $\mathbb{R}^{d}$ . We say that a decision tree is standard if each internal node $v$ is associated with a test (cut), specified by a coordinate $i_{v}\in[d]$ and a real value $\theta_{v}$ , that partitions the points in ${\cal X}$ that reach $v$ into two sets: those having the coordinate $i_{v}$ smaller than or equal to $\theta_{v}$ and those having it larger than $\theta_{v}$ . The leaves of a standard decision tree induce a partition of $\mathbb{R}^{d}$ into axis-aligned boxes and, naturally, a partition of ${\cal X}$ into clusters.

Let $k\geq 2$ be an integer. The clustering problems considered here consist of finding a partition of ${\cal X}$ into $k$ groups, among those that can be induced by a standard decision tree with $k$ leaves, that optimizes a given objective function. For the $k$ -means, $k$ -medians and $k$ -centers cost functions, in addition to the partition, a representative $\mu(C)\in\mathbb{R}^{d}$ for each group $C$ must also be output.

For the $k$ -means problem the objective (cost function) to be minimized is the Sum of the Squared Euclidean Distances (SSED) between each point $\mathbf{x}\in{\cal X}$ and the representative of the cluster where $\mathbf{x}$ lies. Mathematically, the cost (SSED) of a partition ${\cal C}=(C_{1},\ldots,C_{k})$ for ${\cal X}$ is given by

\sum_{i=1}^{k}\sum_{\mathbf{x}\in C_{i}}||\mathbf{x}-\mu(C_{i})||_{2}^{2}.

For the $k$ -medians problem the cost of a partition ${\cal C}=(C_{1},\ldots,C_{k})$ is given by

\sum_{i=1}^{k}\sum_{\mathbf{x}\in C_{i}}||\mathbf{x}-\mu(C_{i})||_{1}.

The $k$ -centers problem is also a minimization problem; its cost function for a partition ${\cal C}=(C_{1},\ldots,C_{k})$ is given by

\max_{i=1,\ldots,k}\max_{\mathbf{x}\in C_{i}}\{||\mathbf{x}-\mu(C_{i})||_{2}\}.

Let $dist:{\cal X}\times{\cal X}\mapsto\mathbb{R}^{+}$ be a distance function. The meximum-spacing problem consists of finding a partition with at least $k$ groups that has maximum spacing, where the spacing $sp({\cal C})$ of a partition ${\cal C}$ is defined as

sp({\cal C})=\min\{dist(\mathbf{x},\mathbf{y}):\mathbf{x}\mbox{ and }\mathbf{y}\mbox{ lie in distinct groups of ${\cal C}$}\}

In contrast to the other criteria, the spacing is an inter-clustering criterion.

We note that an optimal solution of the unrestricted version of any of these problems, in which the decision tree constraint is not enforced, might be a partition that is hard to explain in terms of the input features. Thus, the motivation for using standard decision trees.

For the sake of simplicity, throughout of this text, by explainable clustering we mean a clustering that is obtained via decision trees.

1.2 Our contributions

In Section 2, we first show that the problem of building a partition via decision trees that minimizes the $k$ -means cost function does not admit an $(1+\epsilon)$ -approximation in polynomial time, for some $\epsilon>0$ , unless $P=NP$ . Then, we show that analogous results hold for both the $k$ -median and $k$ -centers cost functions. Our results for both the $k$ -means and $k$ -medians are stronger than the NP-Hardness result established recently by [9] and they formally help to justify the quest for approximation algorithms and/or heuristics for these cost functions.

In Section 3 we propose a polynomial time algorithm that produces an explainable clustering of maximum spacing. As far as we know, this is the first efficient method that produces optimal explainable clustering with respect to some well studied metric.

1.3 Related work

Our research is inspired by the recent work of [2], where the problem of building explainable clusterings, via standard decision trees, for both the $k$ -means and the $k$ -medians cost functions are studied. This paper proposes algorithms with provable approximation bounds for building explainable clusterings. In addition, it investigates the price of explainability for these cost functions, which is the unavoidable gap between the cost of the optimal explainable and the optimal unconstrained clustering. Among their results, they showed that the price of explainability for the $k$ -means and $k$ -median are respectively $O(k^{2})$ and $O(k)$ .

Their results were refined/improved by a series of recent papers [4, 5, 7, 8, 6]. Currently, the best upper bound for the $k$ -medians is $O(\log k\log\log k)$ [5, 7] while for the $k$ -means is $O(k\log k)$ [7]. The study of bounds that depend on the dimension $d$ was initiated in [4], where the authors present an $O(d\log k)$ upper bound for the $k$ -medians and an $O(dk\log k)$ upper bound for the $k$ -means. These bounds were improved to $O(d\log^{2}d)$ for the $k$ -medians [7] and $O(k^{1-2/d}poly(d,\log k))$ [6] for the $k$ -means.

The price of explainability was also investigated for other cost functions. In [4], Laber and Murtinho considered the $k$ -centers and maximum-spacing cost functions. In [5], Makarychev and Shan considered the $k$ -medoids problem ( $k$ -median with $\ell_{2}$ objective). Finally, in [8], Gamlath et. al addressed $\ell_{p}^{p}$ objectives.

The aforementioned papers, except [4] which also presents experiments, are mainly theoretical. However, there are also a number of papers that propose algorithms (without theoretical guarantees) for building explainable clustering, among them we cite [10, 11, 3].

The computational complexity of building explainable clustering via decision trees for both the $k$ -means and the $k$ -medians problems is studied in [9]. It is shown that both problems admit polynomial time algorithms when either $k$ or $d$ is constant and they are NP-Complete for arbitrary $k$ and $d$ . In addition, they show that an optimal explainable clustering cannot be found in $f(k)\cdot|{\cal X}|^{o(k)}$ time for any computable function $f(\textperiodcentered)$ , unless Exponential Time Hypothesis (ETH) fails.

When we turn to standard (non-explainable) clustering, the problems of optimizing the $k$ -means, $k$ -medians and $k$ -centers cost functions are APX-Hard [12, 13, 14] and all of them admit polynomial time algorithms with constant approximation [15, 16, 17]. With regards to the spacing cost function, the single-link algorithm, a very popular algorithm to build hierarchical clustering, produces a partition with maximum spacing [18, Chapter 4].

2 Hardness of $k$ -means, $k$ -medians and $k$ -centers cost function

2.1 Background

We start by recalling some basic definitions and facts that are useful for studying the hardness of optimization problems (see, e.g., [19, chapter 29]).

Given a minimization problem $\mathbb{A}$ and a parameter $\epsilon>0$ we define the $\epsilon$ -Gap- $\mathbb{A}$ problem as the problem of deciding for an instance $I$ of $\mathbb{A}$ and a parameter $k$ whether: (i) $I$ admits a solution of value at most $k$ ; or (ii) every solution of $I$ have value at least $(1+\epsilon)k.$ In such a gap decision problem it is tacitly assumed that the instances are either of type (i) or of type (ii).

Fact 1.

If for a minimization problem $\mathbb{A}$ there exists $\epsilon>0$ such that the $\epsilon$ -Gap- $\mathbb{A}$ problem is $NP$ -hard, then no polynomial time $(1+\epsilon)$ -approximation algorithm exists for $\mathbb{A}$ unless $P=NP.$

We will use the following definition of a gap-preserving reduction.

Definition 1.

Let $\mathbb{A},\mathbb{B}$ be minimization problems. A gap-preserving reduction from $\mathbb{A}$ to $\mathbb{B}$ is a polynomial time algorithm that, given an instance $x$ of $\mathbb{A}$ and a value $k$ , produces an instance $y$ of $\mathbb{B}$ and a value $\kappa$ such that there exist constants $\epsilon,\eta>0$ for which

1.

if $OPT(x)\leq k$ then $OPT(y)\leq\kappa$ ;
2.

if $OPT(x)>(1+\epsilon)k$ then $OPT(y)>(1+\eta)\kappa$ ;

Fact 2.

Fix minimization problems $\mathbb{A},\mathbb{B}$ . If there exists $\epsilon$ such that the $\epsilon$ -Gap- $\mathbb{A}$ problem is $NP$ -hard and there exists a gap-preserving reduction from $\mathbb{A}$ to $\mathbb{B}$ then there exists $\eta$ such that the $\eta$ -Gap- $\mathbb{B}$ problem is $NP$ -hard

We will now specialize the above definitions for restricted variants of the problem of finding a minimum vertex cover in a graph and for our clustering problems.

Definition 2.

For every $\epsilon>0$ , the $\epsilon$ -Gap-MinVC-B-TF (gap) decision problem is defined as follows: given a triangle-free graph $G=(V,E)$ , with bounded degree, and an integer $k$ , decide whether $G$ has a vertex cover of size $k$ or all vertex covers of $G$ have size at least $k(1+\epsilon)$ .

The $\epsilon$ -Gap-MinVC-3B-TF (gap) decision problem has a similar definition, the only differences is that, in addition of being triangle-free, the graphs are required to be 3-bounded, that is, all of its vertexes have degree at most 3.

The NP-Hardness of $\epsilon$ -Gap-MinVC-B-TF and $\epsilon$ -Gap-MinVC-3B-TF were established in [13] and [20], respectively.

Definition 3.

For every $\eta>0$ , the $\eta$ -Gap-Explainable- $k$ means (gap) decision problem is defined as follows: given a set of points ${\cal X}$ , an integer $k$ , and a value $\kappa$ , decide whether there exists an explainable $k$ -clustering ${\cal C}=(C_{1},\dots C_{k})$ of the points in ${\cal X}$ such that the $k$ -means cost of ${\cal C}$ is at most $\kappa$ or for each explainable $k$ -clustering ${\cal C}$ of ${\cal X}$ it holds that the $k$ -means cost of ${\cal C}$ is at least $(1+\eta)\kappa.$

The $\eta$ -Gap-Explainable-kmedians and $\eta$ -Gap-Explainable-kcenters decision problems are analogously defined.

To prove the hardness for the $k$ -means we use a gap-preserving reduction from the $\epsilon$ -Gap-MinVC-B-TF decision problem. To handle both the $k$ -centers and $k$ -medians, we use the $\epsilon$ -Gap-MinVC-3B-TF decision problem.

Our reductions have some common ingredients that we explain here. For all of them, given a graph $G=(V,E)$ , where $V=\{1,\ldots,n\}$ , we build an instance of the clustering problem under consideration by mapping every edge $e$ in $E$ onto a point $\mathbf{v}^{e}=(v^{e}_{1},\ldots,v^{e}_{n})$ in $\{0,1\}^{n}$ where $v^{e}_{i}=1$ if vertex $i$ is incident on $e$ and $v^{e}_{i}=0$ otherwise. This is exactly the mapping proposed in [13] to establish that the (standard) $k$ -means problem is APX-Hard. We use ${\cal X}_{G}:=\{v^{e}|e\in E\}$ to denote the input of the resulting clustering instance.

Let $S=\{i_{1},i_{2},\ldots,i_{k}\}$ be a cover of size $k$ for $G$ , where each $i_{j}$ is an integer in $[n]$ and $i_{j}<i_{j+1}$ . We define ${\cal C}_{S}=(E_{1},\ldots,E_{k})$ as the $k$ -clustering induced by $S$ on the points in ${\cal X}_{G}$ , where the group $E_{j}$ includes all points $\mathbf{v}$ that simultaneously satisfy: its component $i_{j}$ is $1$ and its component $i_{j^{\prime}}$ , for $j^{\prime}<j$ , is 0.

Proposition 1.

The clustering ${\cal C}_{S}$ is explainable.

Proof.

${\cal C}_{S}$ is the clustering induced by a decision tree with $k-1$ internal nodes, with exactly one internal node per level. The internal node of level $j$ is associated with cut $(i_{j},1/2)$ . ∎

2.2 Hardness of $k$ -means cost function

We prove that the problem of finding an explainable clustering with minimum $k$ -means cost function is hard to approximate. The reduction employed here is the one used by [13] to show that it is hard to find an $(1+\epsilon)$ -approximation for the $k$ -means cost function. The extra ingredient in our proof is the construction of an explainable clustering ${\cal C}_{S}$ from a vertex cover $S$ that was described in the previous section.

Theorem 1.

The problem of building an explainable clustering, via decision trees, that minimizes the $k-$ means cost function does not admit an $(1+\epsilon)$ -approximation, for some $\epsilon>0$ , in polynomial time unless $P=NP$ .

Proof.

Let $G$ be a triangle-free graph that satisfies one of the following cases: (i) $G$ has a vertex cover of size $k$ or (ii) all vertex covers of $G$ have size $>k(1+\epsilon)$ .

First, we consider the case where $G$ has a vertex cover $S=\{i_{1},i_{2},\ldots,i_{k}\}$ of size $k$ . We show that, in this case, the cost of ${\cal C}_{S}=(E_{1},\ldots,E_{k})$ is at most $|E|-k$ . Let us consider the mean of the points in $E_{j}$ as the representative of $E_{j}$ , that is, a point that has $1$ at coordinate $i_{j}$ and $1/|E_{j}|$ in the remaining $|E_{j}|$ coordinates with non-zero values. The squared distance of each point in $E_{j}$ to its representative is given by

\left(1-\frac{1}{|E_{j}|}\right)^{2}+(|E_{j}|-1)\times\left(\frac{1}{|E_{j}|}\right)^{2}=1-\frac{1}{|E_{j}|}

(1)

Thus, $E_{j}$ contributes to the total cost with $|E_{j}|-1$ . The cost of the clustering ${\cal C}_{S}$ is, then, given by

\sum_{j=1}^{k}|E_{j}|-1=|E|-k

Now, it remains to argue that if the minimum vertex cover for $G$ has size at least $(1+\epsilon)k$ then every explainable clustering for the corresponding instance has cost at least $|E|-(1-\Omega(\epsilon))k$ . This follows from [13], as in this case every clustering (and, in particular, every explainable one) has cost at least $|E|-(1-\Omega(\epsilon))k$ .

We have concluded a gap preserving reduction from $\epsilon$ -Gap-MinVC-B-TF to $\eta$ -Gap-Explainable- $k$ means. ∎

2.3 Hardness of $k$ -medians cost function

We prove that the problem of finding an explainable clustering with minimum $k$ -medians cost function is hard to approximate. We show a gap preserving reduction from the $\epsilon$ -Gap-MinVC-3B-TF problem to the $\eta$ -Gap-Explainable-kmedians problem.

The following well-known fact will be useful.

Fact 3.

Let $C$ be a set of points in $\mathbb{R}^{d}$ and let $\mu(C)\in\mathbb{R}^{d}$ be the point for which

\sum_{\mathbf{x}\in C}||\mathbf{x}-\mu(C)||_{1}

is minimum.

Then, for each $i\in[d]$ , the value of coordinate $i$ of point $\mu(C)$ is the median of the values of the points in $C$ on coordinate $i$ .

The following lemma will be also useful.

Lemma 1.

Let $G$ be a 3-bounded triangle free graph and let let $C\subseteq{\cal X}_{G}$ be a group of points corresponding to $p$ edges of $G$ . We have that: (i) if $C$ is a star then its $k$ -medians cost is $p$ and (ii) if $C$ is not a star then its $k$ -medians cost is at least $(4/3)p$ .

Proof.

From the previous fact, the representative of $C$ that yields to the minimum $k$ -medians cost is a point in $\{0,1\}^{n}$ , where the coordinate $i$ has value 1 if and only if the number of edges that touch vertex $i$ is larger than $p/2$ . Thus, the cost of a cluster $C$ is given by

\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\},

where $d_{C}(i)$ is the number of edges that touch vertex $i$ in $C$ .

If $C$ is a star centred on vertex $j$ then $\min\{p-d_{C}(j),d_{C}(j)\}=0$ and $\min\{p-d_{C}(i),d_{C}(i)\}=1$ for the other vertexes $i$ in the star. Thus,

\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=\sum_{i\neq j}1=p

If $C$ is not a star then we have some cases:

Case 1) $d_{C}(i)\leq p/2$ for all $i$ . We have

\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=\sum_{i=1}^{n}d_{C}(i)=2p

Note that the above case covers the case $p\geq 6$ since the maximum degree in $G$ is at most 3.

Case 2) $p=5$ and $d_{C}(j)=3$ for exactly one $j$ . We have

\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=d_{C}(j)-1+\sum_{i\neq j}d_{C}(i)=2p-1=9=1.8p

Case 3) $p=5$ and $d_{C}(j)=d_{C}(j^{\prime})=3$ for exactly two values $j$ and $j^{\prime}$ . We have

\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=4+\sum_{i\notin\{j,j^{\prime}\}}d_{C}(i)=2p-2=8=1.6p

Note that we cannot have 3 vertexes with degree 3 and $p=5$ .

Case 4) $p=4$ and $d_{C}(j)=3$ for some $j$ . We must have exactly one $j$ with $d_{C}(j)=3$ , otherwise we would have more than $4$ edges. Thus,

\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}=d_{C}(j)-2+\sum_{i\neq j}d_{C}(i)=2p-2=6=1.5p

Case 5) $p=3$ and $d_{C}(j)=2$ for some $j$ . We have two possible non-isomorphic graphs. One of them consists of a path with 2 edges and an additional edge while the other is a path with 3 edges. For both cases we have

\sum_{i=1}^{n}\min\{p-d_{C}(i),d_{C}(i)\}\geq 4=(4/3)p

∎

Theorem 2.

The problem of building an explainable clustering, via decision trees, that minimizes the $k-$ medians cost function does not admit an $(1+\epsilon)$ -approximation, for some $\epsilon>0$ , in polynomial time unless $P=NP$ .

Proof.

Let $G$ be a triangle-free graph with maximum degree not larger than 3 that satisfies one of the following cases: (i) $G$ has a vertex cover of size $k$ or (ii) all vertex covers of $G$ have size at least $k(1+\epsilon)$ .

First, consider the case where $G$ has a vertex cover $S$ of size $k$ . Since the clustering ${\cal C}_{S}$ consists of stars, it follows from the previous lemma that its cost is $|E|$ .

Now, assume that all vertex covers for $G$ have size at least $k(1+\epsilon)$ . Let ${\cal C}$ be a clustering with $k$ groups for the corresponding $k$ -medians instance.

Let $t$ be the number of groups in ${\cal C}$ that are stars and let $p$ be the total number of edges in the remaining clusters. Since there is no vertex cover for $G$ of size smaller than $k(1+\epsilon)$ we must have

t+p\geq k(1+\epsilon),

otherwise we could obtain a cover for $G$ with size smaller than $k(1+\epsilon)$ by using one vertex per star and one additional vertex for each of the $p$ edges. Since $t\leq k$ it follows that $p\geq k\epsilon$ . Moreover, we must have $k\geq|E|/3$ because the degree of every vertex in $G$ is at most $3$ . Thus, from the previous lemma, the cost of clustering ${\cal C}$ is at least

\frac{4p}{3}+(|E|-p)=|E|+\frac{p}{3}\geq|E|+\frac{k\epsilon}{3}\geq|E|+\frac{|E|\epsilon}{9}=|E|\left(1+\frac{\epsilon}{9}\right).

We have concluded a gap preserving reduction from $\epsilon$ -Gap-MinVC-3B-TF to $\eta$ -Gap-Explainable- $k$ medians.

∎

2.4 Hardness of $k$ -centers cost function

In this section we discuss the computational complexity of minimizing the $k$ -centers cost function. We show a gap preserving reduction from the $\epsilon$ -Gap-MinVC-3B-TF problem to the $\eta$ -Gap-Explainable-kcenters problem.

Theorem 3.

The problem of building an explainable clustering, via decision trees, that minimizes the $k-$ centers cost function does not admit an $(1+\epsilon)$ -approximation, for some $\epsilon>0$ , in polynomial time unless $P=NP$ .

Proof.

Let $G=(V,E)$ be a triangle-free graph with maximum degree not larger than 3 that satisfies one of the following cases: (i) $G$ has a vertex cover of size $k$ or (ii) all vertex covers of $G$ have size at least $k(1+\epsilon)$ .

First, consider the case where $G$ has a vertex cover $S$ of size $k$ . In this case, the clustering ${\cal C}_{S}=(E_{1},\ldots,E_{k})$ consists of stars with at most 3 edges. For the representative of $E_{j}$ , as in the proof of Theorem 1, we use the mean of the points that lie in $E_{j}$ .

Thus, the distance of each point in $E_{i}$ to its representative is the square root of the rightmost term of (1), which is at most $\sqrt{1/3}$ since $G$ has maximum degree 3.

Now, we assume that $G$ does not have a vertex cover with $k$ vertex. Let ${\cal C}$ be a clustering with $k$ groups for the edges of $E$ . One of the groups, say $A$ , does not have a vertex that touches all the edges in $A$ . Pick the vertex, say $v$ , that touches the largest number of edges in $A$ . Consider an edge $e=yz$ in $A$ that does not touch $v$ . We show that there is another edge in $A$ , say $e^{\prime}$ , that does not have intersection with $e$ . In fact, pick an edge $f=vw$ . If $f$ does not intersect $e$ ( $w$ is not an endpoint of $e$ ) we set $e^{\prime}=f$ . Otherwise, we assume w.l.o.g. that $f$ intersects $e$ at point $y$ , that is, $w=y$ . We know that $vz$ is not an edge for otherwise we would have a triangle $vwz$ in $G$ . Since $v$ is the vertex that touches the largest number of edges in $A$ then $v$ must touch an edge $f^{\prime}=vz^{\prime}$ , with $z^{\prime}\neq y$ and $z^{\prime}\neq z$ . We set $e^{\prime}=f^{\prime}$ .

We can argue that the distance of the representative $\mu(A)$ of $A$ to either $e$ or $e^{\prime}$ is at least $1$ . For that, we consider the values of $\mu(A)$ at the components of the vertexes that define the edges $e^{\prime}$ and $e$ . Let $\mu_{1},\mu_{2},\mu_{3}$ and $\mu_{4}$ be these values. We have that

||e-\mu(A)||^{2}+||e^{\prime}-\mu(A)||^{2}\geq\sum_{i=1}^{4}(1-\mu_{i})^{2}+\mu_{i}^{2}=

4-2(\mu_{1}+\mu_{2}+\mu_{4}+\mu_{4})+2(\mu_{1}^{2}+\mu_{2}^{2}+\mu^{2}_{3}+\mu^{2}_{4})\geq 2

Thus, either $e$ or $e^{\prime}$ is at distance at least 1 from the representative of $A$ ∎

3 A polynomial time algorithm for the maximum-spacing cost function

We describe MaxSpacing, a simple greedy algorithm that finds an explainable partition of maximum spacing in polynomial time.

To simplify its description we introduce some notation. For a set of leaves $L$ in a decision tree, we use $sp(L)$ to refer to the spacing of the partition of the points in ${\cal X}$ induced by the leaves in $L$ . Given a set of leaves $L$ , a leaf $\ell\in L$ and an axis-aligned cut $\gamma=(i,\theta)$ , we use $L_{\gamma,\ell}$ to denote the set of leaves obtained when $\gamma$ is applied to split the points that reach $\ell$ . More precisely, $L_{\gamma,\ell}$ is obtained from $L$ by removing $\ell$ and adding the two leaves that are created by using $\gamma$ to split the points that reach $\ell$ .

A pseudo-code for MaxSpacing is presented in Algorithm 1. The algorithm adopts a natural greedy strategy that at each step chooses the cut that yields to the partition of maximum spacing. We note that it runs in polynomial time because in Step 1 we just need to test at most $(|{\cal X}|-1)d$ axis-aligned cuts: for each $\ell\in L$ and each dimension $i\in[d]$ we sort the $|\ell|$ points that reach $\ell$ according to their coordinate $i$ and consider the cuts $(i,\theta_{j})$ , for $j=1,\ldots,|\ell|$ , where $\theta_{j}$ is the midpoint between the values of the $i$ -th coordinate of the $j$ th and $(j+1)$ th points in the sorted list.

Algorithm 1 MaxSpacing(

{\cal X}

: set of points;

k

: integer)

Initialize a decision tree with only one leaf

\ell

and associate it with

{\cal X}

L\leftarrow\ \{\ell\}

Repeat

k-1

times:

1.
Find a cut $\gamma$ and a leaf $\ell\in L$ that simultaneously satisfy:
- (i)
  
  $\gamma$ splits the points that reach $\ell$ into 2 non-empty groups
- (ii)
  
  $sp(L_{\gamma,\ell})\geq sp(L_{\gamma^{\prime},\ell^{\prime}})$ for every $\ell^{\prime}\in L$ and every axis-aligned cut $\gamma^{\prime}$ that splits the points that reach $\ell^{\prime}$ into two non-empty sets.
2.

Split leaf $\ell$ using cut $\gamma$
3.

$L\leftarrow L_{\gamma,\ell}$

In what follows, we show that MaxSpacing produces an explainable partition with maximum possible (optimal) spacing. The following simple fact will be useful.

Fact 4.

Let $d^{*}_{i}$ be the spacing of an optimal explainable partition with $i+1$ groups. Then, $d^{*}_{i}\geq d^{*}_{i+1}$ , for $i=1,\ldots,k-1$ .

Proof.

Let $D^{*}$ be a decision tree that induces a partition with $i+2$ groups that has spacing $d^{*}_{i+1}$ . Let $D$ be a decision tree obtained by removing two leaves that are siblings in $D^{*}$ and turning their parent into a leaf. Let $x$ and $y$ be two closest points among those that reach different leaves in $D$ . Since these points also reach distinct leaves in $D^{*}$ we have that the spacing of the leaves in $D$ is not smaller than that of the leaves in $D^{*}$ . Thus, $d^{*}_{i}\geq sp(\mbox{Leaves of }D)\geq d^{*}_{i+1}$ . ∎

Theorem 4.

For every $1\leq i\leq k-1$ , the partition induced by the leaves of MaxSpacing algorithm by the end of iteration $i$ has the maximum spacing, among the explainable partitions with $i+1$ groups for ${\cal X}$ .

Proof.

Let ${\cal C}^{*}_{i}$ be an optimal explainable partition with $i+1$ groups and let $d^{*}_{i}$ be its spacing. Moreover, let $L_{i}$ , with $i<k-1$ , be the set of leaves by the end of iteration $i$ of MaxSpacing algorithm. By the greedy choice $sp(L_{1})=d^{*}_{1}$ . We assume by induction that the spacing of $L_{i}$ is $d^{*}_{i}$ and show that the spacing of $L_{i+1}$ is $d^{*}_{i+1}$ .

For a node $\nu$ in a decision tree, let $P(\nu)$ be the set of points that reach $\nu$ . Let $\nu^{*}$ be a node in the decision tree $D^{*}$ for ${\cal C}_{i+1}^{*}$ that satisfies the following: (i) for each $\ell\in L_{i}$ either $P(\ell)\subseteq P(\nu^{*})$ or $P(\ell)\cap P(\nu^{*})=\emptyset$ and (ii) some child of $\nu^{*}$ does not satisfy (i). We will use the cut associated with $\nu^{*}$ in $D^{*}$ to argue that we can properly split $L_{i}$ .

To prove the existence of a node $\nu^{*}$ with such properties, it suffices to show that the root $r^{*}$ of $D^{*}$ satisfies $(i)$ and some leaf $\ell^{*}$ from $D^{*}$ does not satisfy (i) since, in this case, we can set $\nu^{*}$ as the last node in the path from $r^{*}$ to $\ell^{*}$ that satisfies (i). Clearly, $r^{*}$ satisfies (i). It remains to argue that some leaf $\ell^{*}\in D^{*}$ does not satisfy (i). Since the number of leaves in $L_{i}$ is smaller than the number of leaves in $D^{*}$ , by the pigeonhole principle, there are two leaves, say $\ell^{*}_{1}$ and $\ell^{*}_{2}$ , in $D^{*}$ that contain points from the same leaf $\ell$ in $L_{i}$ . Thus, neither $P(\ell)\cap P(\ell^{*}_{1})\neq\emptyset$ nor $P(\ell)\subseteq P(\ell^{*}_{1})$ . We set $\ell^{*}$ to $\ell^{*}_{1}$ and $\nu^{*}$ as the last node in the path from $r^{*}$ to $\ell^{*}_{1}$ that satisfies (i).

Let $\nu^{*}_{ch}$ be a child of $\nu^{*}$ in $D^{*}$ that does not satisfy (i). Moreover, let $\gamma$ be the cut associated with $\nu^{*}$ and let $\ell$ be a leaf in $L_{i}$ such that $P(\ell)\subseteq P(\nu^{*})$ , $P(\ell)\cap P(\nu^{*}_{ch})\neq\emptyset$ and $P(\ell)\not\subset P(\nu^{*}_{ch})$ . We show that the spacing of the set of leaves $L^{\prime}_{i}$ obtained from $L_{i}$ by applying cut $\gamma$ to $\ell$ is at least $d^{*}_{i+1}$ . Let $\ell_{1}$ and $\ell_{2}$ be the two new leaves that are created by applying $\gamma$ to $\ell$ and let $\mathbf{x}$ and $\mathbf{y}$ be the two closest points (according to $dist$ ) among those that reach different leaves in $L^{\prime}_{i}$ . If $\mathbf{x}$ reaches $\ell_{1}$ (resp. $\ell_{2}$ ) and $\mathbf{y}$ reaches $\ell_{2}$ (resp. $\ell_{1}$ ) then

sp(L^{\prime}_{i})=dist(\mathbf{x},\mathbf{y})\geq sp({\cal C}_{i+1}^{*})=d^{*}_{i+1}

because, due to the application of cut $\gamma$ on $\nu^{*}$ , $\mathbf{x}$ and $\mathbf{y}$ lies in different groups in ${\cal C}_{i+1}^{*}$ . If some of them, say $\mathbf{x}$ , does not reach $\ell$ then $\mathbf{x}$ and $\mathbf{y}$ reach different leaves in $L_{i}$ and, thus,

sp(L^{\prime}_{i})=dist(\mathbf{x},\mathbf{y})\geq sp(L_{i})=d^{*}_{i}\geq d^{*}_{i+1},

where the last inequality follows from Fact 4.

We have shown that there exists a leaf $\ell$ in $L_{i}$ and a cut $\gamma$ such that the application of $\gamma$ to $\ell$ yields to a partition of spacing $d^{*}_{i+1}$ . Thus, due to the greedy choice, MaxSpacing obtains a partition of spacing $d^{*}_{i+1}$ by the end of iteration $i+1$ . ∎

4 Conclusions

We have showed that the problems of finding explainable clustering (via decision trees) that optimize the classical $k$ -means, $k$ -medians and $k$ -centers cost functions do not admit polynomial time $(1+\epsilon)$ -approximations. These results help to formally justify the quest for heuristics and/or approximation algorithms.

The algorithms recently proposed in the literature for building explainable clustering compare their costs with the costs of optimal unrestricted clustering [2, 4, 5, 6, 7, 8]. A major open question in this line of research is whether better bounds can be obtained when the comparison is made against the optimal explainable clustering.

For the spacing cost function we provided a simple polynomial time algorithm that computes the explainable partition with maximum spacing. An interesting note is that we have not used the fact that the cuts are axis-aligned in the proof of Theorem 4 and, thus, our result holds for any family of cuts.

References

[1] M. T. Ribeiro, S. Singh, C. Guestrin, ” why should i trust you?” explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.
[2] S. Dasgupta, M. Moshkovitz, C. Rashtchian, N. Frost, Explainable k-means and k-medians clustering, in: ICML, Vol. 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 7055–7065.
[3] D. Bertsimas, A. Orfanoudaki, H. Wiberg, Interpretable clustering: an optimization approach, Machine Learning (2020) 1–50.
[4] E. S. Laber, L. Murtinho, On the price of explainability for some clustering problems, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Vol. 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 5915–5925.
URL http://proceedings.mlr.press/v139/laber21a.html
[5] K. Makarychev, L. Shan, Near-optimal algorithms for explainable k-medians and k-means, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, Vol. 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 7358–7367.
URL http://proceedings.mlr.press/v139/makarychev21a.html
[6] M. Charikar, L. Hu, Near-optimal explainable k-means for all dimensions, in: J. S. Naor, N. Buchbinder (Eds.), Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, SIAM, 2022, pp. 2580–2606. doi:10.1137/1.9781611977073.101.
URL https://doi.org/10.1137/1.9781611977073.101
[7] H. Esfandiari, V. S. Mirrokni, S. Narayanan, Almost tight approximation algorithms for explainable clustering, in: J. S. Naor, N. Buchbinder (Eds.), Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, SIAM, 2022, pp. 2641–2663. doi:10.1137/1.9781611977073.103.
URL https://doi.org/10.1137/1.9781611977073.103
[8] B. Gamlath, X. Jia, A. Polak, O. Svensson, Nearly-tight and oblivious algorithms for explainable clustering, in: Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
[9] S. Bandyapadhyay, F. V. Fomin, P. A. Golovach, W. Lochet, N. Purohit, K. Simonov, How to find a good explanation for clustering?, in: Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, AAAI Press, 2022, pp. 3904–3912.
URL https://ojs.aaai.org/index.php/AAAI/article/view/20306
[10] B. Liu, Y. Xia, P. S. Yu, Clustering through decision tree construction, in: Proceedings of the ninth international conference on Information and knowledge management, 2000, pp. 20–29.
[11] R. Fraiman, B. Ghattas, M. Svarc, Interpretable clustering using unsupervised binary trees, Advances in Data Analysis and Classification 7 (2) (2013) 125–145.
[12] N. Megiddo, K. J. Supowit, On the complexity of some common geometric location problems, SIAM J. Comput. 13 (1) (1984) 182–196. doi:10.1137/0213014.
URL https://doi.org/10.1137/0213014
[13] P. Awasthi, M. Charikar, R. Krishnaswamy, A. K. Sinop, The hardness of approximation of euclidean k-means, in: L. Arge, J. Pach (Eds.), 31st International Symposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Netherlands, Vol. 34 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015, pp. 754–767. doi:10.4230/LIPIcs.SOCG.2015.754.
URL https://doi.org/10.4230/LIPIcs.SOCG.2015.754
[14] V. Cohen-Addad, Karthik C. S., E. Lee, Johnson coverage hypothesis: Inapproximability of k-means and k-median in $l_{p}$ metrics, in: J. S. Naor, N. Buchbinder (Eds.), Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, SIAM, 2022, pp. 1493–1530. doi:10.1137/1.9781611977073.63.
URL https://doi.org/10.1137/1.9781611977073.63
[15] T. F. Gonzalez, Clustering to minimize the maximum intercluster distance, Theor. Comput. Sci. 38 (1985) 293–306. doi:10.1016/0304-3975(85)90224-5.
URL https://doi.org/10.1016/0304-3975(85)90224-5
[16] S. Ahmadian, A. Norouzi-Fard, O. Svensson, J. Ward, Better guarantees for k-means and euclidean k-median by primal-dual algorithms, SIAM J. Comput. 49 (4) (2020). doi:10.1137/18M1171321.
URL https://doi.org/10.1137/18M1171321
[17] M. Charikar, S. Guha, É. Tardos, D. B. Shmoys, A constant-factor approximation algorithm for the k-median problem, J. Comput. Syst. Sci. 65 (1) (2002) 129–149. doi:10.1006/jcss.2002.1882.
URL https://doi.org/10.1006/jcss.2002.1882
[18] J. M. Kleinberg, É. Tardos, Algorithm design, Addison-Wesley, 2006.
[19] V. V. Vazirani, Approximation algorithms, Springer, 2001.
URL http://www.springer.com/computer/theoretical+computer+science/book/978-3-540-65367-7
[20] F. Cicalese, E. S. Laber, Information theoretical clustering is hard to approximate, IEEE Trans. Inf. Theory 67 (1) (2021) 586–597. doi:10.1109/TIT.2020.3031629.
URL https://doi.org/10.1109/TIT.2020.3031629

The computational complexity of some explainable clustering problems

Abstract

1 Introduction

1.1 Problem definition

1.2 Our contributions

1.3 Related work

2 Hardness of kk-means, kk-medians and kk-centers cost function

2.1 Background

Fact 1.

Definition 1.

Fact 2.

Definition 2.

Definition 3.

Proposition 1.

Proof.

2.2 Hardness of kk-means cost function

Theorem 1.

Proof.

2.3 Hardness of kk-medians cost function

Fact 3.

Lemma 1.

Proof.

Theorem 2.

Proof.

2.4 Hardness of kk-centers cost function

Theorem 3.

Proof.

3 A polynomial time algorithm for the maximum-spacing cost function

Fact 4.

Proof.

Theorem 4.

Proof.

4 Conclusions

References

2 Hardness of $k$ -means, $k$ -medians and $k$ -centers cost function

2.2 Hardness of $k$ -means cost function

2.3 Hardness of $k$ -medians cost function

2.4 Hardness of $k$ -centers cost function