The computational complexity of some explainable clustering problems
Abstract
We study the computational complexity of some explainable clustering problems in the framework proposed by [Dasgupta et al., ICML 2020], where explainability is achieved via axis-aligned decision trees. We consider the -means, -medians, -centers and the spacing cost functions. We prove that the first three are hard to optimize while the latter can be optimized in polynomial time.
1 Introduction
Machine learning models and algorithms have been used in a number of systems that take decisions that affect our lives. Thus, explainable methods are desirable so that people are able to have a better understanding of their behavior, which allows for comfortable use of these systems or, eventually, the questioning of their applicability [1].
Recently, there has been some effort to devise explainable methods for unsupervised learning tasks, in particular, for clustering [2, 3]. We investigate the framework discussed by [2], where an explainable clustering is given by a partition, induced by the leaves of an axis-aligned decision tree, that optimizes some predefined objective function.
Figure 1 shows a decision tree that defines a clustering for the Iris dataset. The clustering has three groups, each of them corresponding to a leaf. The explanation of the group associated with the rightmost leaf is Sepal Length >0.4 AND Petal Width < 0.5.

Following [2], a series of papers [4, 5, 6, 7, 8] provided algorithms, with provable guarantees, to build decision trees that induce explainable clustering. Several cost functions to guide the clustering construction were investigated as the -means, -centers, -medians and maximum-spacing. Despite this active research, the only work on the field that tackles the computational complexity of building explainable clustering is [9], where it was proved that optimizing the -means and the -medians cost functions is NP-Complete. Here, we improve these results and also investigate the computational complexity for both the -centers and the spacing cost functions.
1.1 Problem definition
Let be a finite set of points in . We say that a decision tree is standard if each internal node is associated with a test (cut), specified by a coordinate and a real value , that partitions the points in that reach into two sets: those having the coordinate smaller than or equal to and those having it larger than . The leaves of a standard decision tree induce a partition of into axis-aligned boxes and, naturally, a partition of into clusters.
Let be an integer. The clustering problems considered here consist of finding a partition of into groups, among those that can be induced by a standard decision tree with leaves, that optimizes a given objective function. For the -means, -medians and -centers cost functions, in addition to the partition, a representative for each group must also be output.
For the -means problem the objective (cost function) to be minimized is the Sum of the Squared Euclidean Distances (SSED) between each point and the representative of the cluster where lies. Mathematically, the cost (SSED) of a partition for is given by
For the -medians problem the cost of a partition is given by
The -centers problem is also a minimization problem; its cost function for a partition is given by
Let be a distance function. The meximum-spacing problem consists of finding a partition with at least groups that has maximum spacing, where the spacing of a partition is defined as
In contrast to the other criteria, the spacing is an inter-clustering criterion.
We note that an optimal solution of the unrestricted version of any of these problems, in which the decision tree constraint is not enforced, might be a partition that is hard to explain in terms of the input features. Thus, the motivation for using standard decision trees.
For the sake of simplicity, throughout of this text, by explainable clustering we mean a clustering that is obtained via decision trees.
1.2 Our contributions
In Section 2, we first show that the problem of building a partition via decision trees that minimizes the -means cost function does not admit an -approximation in polynomial time, for some , unless . Then, we show that analogous results hold for both the -median and -centers cost functions. Our results for both the -means and -medians are stronger than the NP-Hardness result established recently by [9] and they formally help to justify the quest for approximation algorithms and/or heuristics for these cost functions.
In Section 3 we propose a polynomial time algorithm that produces an explainable clustering of maximum spacing. As far as we know, this is the first efficient method that produces optimal explainable clustering with respect to some well studied metric.
1.3 Related work
Our research is inspired by the recent work of [2], where the problem of building explainable clusterings, via standard decision trees, for both the -means and the -medians cost functions are studied. This paper proposes algorithms with provable approximation bounds for building explainable clusterings. In addition, it investigates the price of explainability for these cost functions, which is the unavoidable gap between the cost of the optimal explainable and the optimal unconstrained clustering. Among their results, they showed that the price of explainability for the -means and -median are respectively and .
Their results were refined/improved by a series of recent papers [4, 5, 7, 8, 6]. Currently, the best upper bound for the -medians is [5, 7] while for the -means is [7]. The study of bounds that depend on the dimension was initiated in [4], where the authors present an upper bound for the -medians and an upper bound for the -means. These bounds were improved to for the -medians [7] and [6] for the -means.
The price of explainability was also investigated for other cost functions. In [4], Laber and Murtinho considered the -centers and maximum-spacing cost functions. In [5], Makarychev and Shan considered the -medoids problem (-median with objective). Finally, in [8], Gamlath et. al addressed objectives.
The aforementioned papers, except [4] which also presents experiments, are mainly theoretical. However, there are also a number of papers that propose algorithms (without theoretical guarantees) for building explainable clustering, among them we cite [10, 11, 3].
The computational complexity of building explainable clustering via decision trees for both the -means and the -medians problems is studied in [9]. It is shown that both problems admit polynomial time algorithms when either or is constant and they are NP-Complete for arbitrary and . In addition, they show that an optimal explainable clustering cannot be found in time for any computable function , unless Exponential Time Hypothesis (ETH) fails.
When we turn to standard (non-explainable) clustering, the problems of optimizing the -means, -medians and -centers cost functions are APX-Hard [12, 13, 14] and all of them admit polynomial time algorithms with constant approximation [15, 16, 17]. With regards to the spacing cost function, the single-link algorithm, a very popular algorithm to build hierarchical clustering, produces a partition with maximum spacing [18, Chapter 4].
2 Hardness of -means, -medians and -centers cost function
2.1 Background
We start by recalling some basic definitions and facts that are useful for studying the hardness of optimization problems (see, e.g., [19, chapter 29]).
Given a minimization problem and a parameter we define the -Gap- problem as the problem of deciding for an instance of and a parameter whether: (i) admits a solution of value at most ; or (ii) every solution of have value at least In such a gap decision problem it is tacitly assumed that the instances are either of type (i) or of type (ii).
Fact 1.
If for a minimization problem there exists such that the -Gap- problem is -hard, then no polynomial time -approximation algorithm exists for unless
We will use the following definition of a gap-preserving reduction.
Definition 1.
Let be minimization problems. A gap-preserving reduction from to is a polynomial time algorithm that, given an instance of and a value , produces an instance of and a value such that there exist constants for which
-
1.
if then ;
-
2.
if then ;
Fact 2.
Fix minimization problems . If there exists such that the -Gap- problem is -hard and there exists a gap-preserving reduction from to then there exists such that the -Gap- problem is -hard
We will now specialize the above definitions for restricted variants of the problem of finding a minimum vertex cover in a graph and for our clustering problems.
Definition 2.
For every , the -Gap-MinVC-B-TF (gap) decision problem is defined as follows: given a triangle-free graph , with bounded degree, and an integer , decide whether has a vertex cover of size or all vertex covers of have size at least .
The -Gap-MinVC-3B-TF (gap) decision problem has a similar definition, the only differences is that, in addition of being triangle-free, the graphs are required to be 3-bounded, that is, all of its vertexes have degree at most 3.
The NP-Hardness of -Gap-MinVC-B-TF and -Gap-MinVC-3B-TF were established in [13] and [20], respectively.
Definition 3.
For every , the -Gap-Explainable-means (gap) decision problem is defined as follows: given a set of points , an integer , and a value , decide whether there exists an explainable -clustering of the points in such that the -means cost of is at most or for each explainable -clustering of it holds that the -means cost of is at least
The -Gap-Explainable-kmedians and -Gap-Explainable-kcenters decision problems are analogously defined.
To prove the hardness for the -means we use a gap-preserving reduction from the -Gap-MinVC-B-TF decision problem. To handle both the -centers and -medians, we use the -Gap-MinVC-3B-TF decision problem.
Our reductions have some common ingredients that we explain here. For all of them, given a graph , where , we build an instance of the clustering problem under consideration by mapping every edge in onto a point in where if vertex is incident on and otherwise. This is exactly the mapping proposed in [13] to establish that the (standard) -means problem is APX-Hard. We use to denote the input of the resulting clustering instance.
Let be a cover of size for , where each is an integer in and . We define as the -clustering induced by on the points in , where the group includes all points that simultaneously satisfy: its component is and its component , for , is 0.
Proposition 1.
The clustering is explainable.
Proof.
is the clustering induced by a decision tree with internal nodes, with exactly one internal node per level. The internal node of level is associated with cut . ∎
2.2 Hardness of -means cost function
We prove that the problem of finding an explainable clustering with minimum -means cost function is hard to approximate. The reduction employed here is the one used by [13] to show that it is hard to find an -approximation for the -means cost function. The extra ingredient in our proof is the construction of an explainable clustering from a vertex cover that was described in the previous section.
Theorem 1.
The problem of building an explainable clustering, via decision trees, that minimizes the means cost function does not admit an -approximation, for some , in polynomial time unless .
Proof.
Let be a triangle-free graph that satisfies one of the following cases: (i) has a vertex cover of size or (ii) all vertex covers of have size .
First, we consider the case where has a vertex cover of size . We show that, in this case, the cost of is at most . Let us consider the mean of the points in as the representative of , that is, a point that has at coordinate and in the remaining coordinates with non-zero values. The squared distance of each point in to its representative is given by
(1) |
Thus, contributes to the total cost with . The cost of the clustering is, then, given by
Now, it remains to argue that if the minimum vertex cover for has size at least then every explainable clustering for the corresponding instance has cost at least . This follows from [13], as in this case every clustering (and, in particular, every explainable one) has cost at least .
We have concluded a gap preserving reduction from -Gap-MinVC-B-TF to -Gap-Explainable-means. ∎
2.3 Hardness of -medians cost function
We prove that the problem of finding an explainable clustering with minimum -medians cost function is hard to approximate. We show a gap preserving reduction from the -Gap-MinVC-3B-TF problem to the -Gap-Explainable-kmedians problem.
The following well-known fact will be useful.
Fact 3.
Let be a set of points in and let be the point for which
is minimum.
Then, for each , the value of coordinate of point is the median of the values of the points in on coordinate .
The following lemma will be also useful.
Lemma 1.
Let be a 3-bounded triangle free graph and let let be a group of points corresponding to edges of . We have that: (i) if is a star then its -medians cost is and (ii) if is not a star then its -medians cost is at least .
Proof.
From the previous fact, the representative of that yields to the minimum -medians cost is a point in , where the coordinate has value 1 if and only if the number of edges that touch vertex is larger than . Thus, the cost of a cluster is given by
where is the number of edges that touch vertex in .
If is a star centred on vertex then and for the other vertexes in the star. Thus,
If is not a star then we have some cases:
Case 1) for all . We have
Note that the above case covers the case since the maximum degree in is at most 3.
Case 2) and for exactly one . We have
Case 3) and for exactly two values and . We have
Note that we cannot have 3 vertexes with degree 3 and .
Case 4) and for some . We must have exactly one with , otherwise we would have more than edges. Thus,
Case 5) and for some . We have two possible non-isomorphic graphs. One of them consists of a path with 2 edges and an additional edge while the other is a path with 3 edges. For both cases we have
∎
Theorem 2.
The problem of building an explainable clustering, via decision trees, that minimizes the medians cost function does not admit an -approximation, for some , in polynomial time unless .
Proof.
Let be a triangle-free graph with maximum degree not larger than 3 that satisfies one of the following cases: (i) has a vertex cover of size or (ii) all vertex covers of have size at least .
First, consider the case where has a vertex cover of size . Since the clustering consists of stars, it follows from the previous lemma that its cost is .
Now, assume that all vertex covers for have size at least . Let be a clustering with groups for the corresponding -medians instance.
Let be the number of groups in that are stars and let be the total number of edges in the remaining clusters. Since there is no vertex cover for of size smaller than we must have
otherwise we could obtain a cover for with size smaller than by using one vertex per star and one additional vertex for each of the edges. Since it follows that . Moreover, we must have because the degree of every vertex in is at most . Thus, from the previous lemma, the cost of clustering is at least
We have concluded a gap preserving reduction from -Gap-MinVC-3B-TF to -Gap-Explainable-medians.
∎
2.4 Hardness of -centers cost function
In this section we discuss the computational complexity of minimizing the -centers cost function. We show a gap preserving reduction from the -Gap-MinVC-3B-TF problem to the -Gap-Explainable-kcenters problem.
Theorem 3.
The problem of building an explainable clustering, via decision trees, that minimizes the centers cost function does not admit an -approximation, for some , in polynomial time unless .
Proof.
Let be a triangle-free graph with maximum degree not larger than 3 that satisfies one of the following cases: (i) has a vertex cover of size or (ii) all vertex covers of have size at least .
First, consider the case where has a vertex cover of size . In this case, the clustering consists of stars with at most 3 edges. For the representative of , as in the proof of Theorem 1, we use the mean of the points that lie in .
Thus, the distance of each point in to its representative is the square root of the rightmost term of (1), which is at most since has maximum degree 3.
Now, we assume that does not have a vertex cover with vertex. Let be a clustering with groups for the edges of . One of the groups, say , does not have a vertex that touches all the edges in . Pick the vertex, say , that touches the largest number of edges in . Consider an edge in that does not touch . We show that there is another edge in , say , that does not have intersection with . In fact, pick an edge . If does not intersect ( is not an endpoint of ) we set . Otherwise, we assume w.l.o.g. that intersects at point , that is, . We know that is not an edge for otherwise we would have a triangle in . Since is the vertex that touches the largest number of edges in then must touch an edge , with and . We set .
We can argue that the distance of the representative of to either or is at least . For that, we consider the values of at the components of the vertexes that define the edges and . Let and be these values. We have that
Thus, either or is at distance at least 1 from the representative of ∎
3 A polynomial time algorithm for the maximum-spacing cost function
We describe MaxSpacing, a simple greedy algorithm that finds an explainable partition of maximum spacing in polynomial time.
To simplify its description we introduce some notation. For a set of leaves in a decision tree, we use to refer to the spacing of the partition of the points in induced by the leaves in . Given a set of leaves , a leaf and an axis-aligned cut , we use to denote the set of leaves obtained when is applied to split the points that reach . More precisely, is obtained from by removing and adding the two leaves that are created by using to split the points that reach .
A pseudo-code for MaxSpacing is presented in Algorithm 1. The algorithm adopts a natural greedy strategy that at each step chooses the cut that yields to the partition of maximum spacing. We note that it runs in polynomial time because in Step 1 we just need to test at most axis-aligned cuts: for each and each dimension we sort the points that reach according to their coordinate and consider the cuts , for , where is the midpoint between the values of the -th coordinate of the th and th points in the sorted list.
-
1.
Find a cut and a leaf that simultaneously satisfy:
-
(i)
splits the points that reach into 2 non-empty groups
-
(ii)
for every and every axis-aligned cut that splits the points that reach into two non-empty sets.
-
(i)
-
2.
Split leaf using cut
-
3.
In what follows, we show that MaxSpacing produces an explainable partition with maximum possible (optimal) spacing. The following simple fact will be useful.
Fact 4.
Let be the spacing of an optimal explainable partition with groups. Then, , for .
Proof.
Let be a decision tree that induces a partition with groups that has spacing . Let be a decision tree obtained by removing two leaves that are siblings in and turning their parent into a leaf. Let and be two closest points among those that reach different leaves in . Since these points also reach distinct leaves in we have that the spacing of the leaves in is not smaller than that of the leaves in . Thus, . ∎
Theorem 4.
For every , the partition induced by the leaves of MaxSpacing algorithm by the end of iteration has the maximum spacing, among the explainable partitions with groups for .
Proof.
Let be an optimal explainable partition with groups and let be its spacing. Moreover, let , with , be the set of leaves by the end of iteration of MaxSpacing algorithm. By the greedy choice . We assume by induction that the spacing of is and show that the spacing of is .
For a node in a decision tree, let be the set of points that reach . Let be a node in the decision tree for that satisfies the following: (i) for each either or and (ii) some child of does not satisfy (i). We will use the cut associated with in to argue that we can properly split .
To prove the existence of a node with such properties, it suffices to show that the root of satisfies and some leaf from does not satisfy (i) since, in this case, we can set as the last node in the path from to that satisfies (i). Clearly, satisfies (i). It remains to argue that some leaf does not satisfy (i). Since the number of leaves in is smaller than the number of leaves in , by the pigeonhole principle, there are two leaves, say and , in that contain points from the same leaf in . Thus, neither nor . We set to and as the last node in the path from to that satisfies (i).
Let be a child of in that does not satisfy (i). Moreover, let be the cut associated with and let be a leaf in such that , and . We show that the spacing of the set of leaves obtained from by applying cut to is at least . Let and be the two new leaves that are created by applying to and let and be the two closest points (according to ) among those that reach different leaves in . If reaches (resp. ) and reaches (resp. ) then
because, due to the application of cut on , and lies in different groups in . If some of them, say , does not reach then and reach different leaves in and, thus,
where the last inequality follows from Fact 4.
We have shown that there exists a leaf in and a cut such that the application of to yields to a partition of spacing . Thus, due to the greedy choice, MaxSpacing obtains a partition of spacing by the end of iteration . ∎
4 Conclusions
We have showed that the problems of finding explainable clustering (via decision trees) that optimize the classical -means, -medians and -centers cost functions do not admit polynomial time -approximations. These results help to formally justify the quest for heuristics and/or approximation algorithms.
The algorithms recently proposed in the literature for building explainable clustering compare their costs with the costs of optimal unrestricted clustering [2, 4, 5, 6, 7, 8]. A major open question in this line of research is whether better bounds can be obtained when the comparison is made against the optimal explainable clustering.
For the spacing cost function we provided a simple polynomial time algorithm that computes the explainable partition with maximum spacing. An interesting note is that we have not used the fact that the cuts are axis-aligned in the proof of Theorem 4 and, thus, our result holds for any family of cuts.
References
- [1] M. T. Ribeiro, S. Singh, C. Guestrin, ” why should i trust you?” explaining the predictions of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.
- [2] S. Dasgupta, M. Moshkovitz, C. Rashtchian, N. Frost, Explainable k-means and k-medians clustering, in: ICML, Vol. 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 7055–7065.
- [3] D. Bertsimas, A. Orfanoudaki, H. Wiberg, Interpretable clustering: an optimization approach, Machine Learning (2020) 1–50.
-
[4]
E. S. Laber, L. Murtinho,
On the price of
explainability for some clustering problems, in: M. Meila, T. Zhang (Eds.),
Proceedings of the 38th International Conference on Machine Learning, ICML
2021, 18-24 July 2021, Virtual Event, Vol. 139 of Proceedings of Machine
Learning Research, PMLR, 2021, pp. 5915–5925.
URL http://proceedings.mlr.press/v139/laber21a.html -
[5]
K. Makarychev, L. Shan,
Near-optimal
algorithms for explainable k-medians and k-means, in: M. Meila, T. Zhang
(Eds.), Proceedings of the 38th International Conference on Machine Learning,
ICML 2021, 18-24 July 2021, Virtual Event, Vol. 139 of Proceedings of
Machine Learning Research, PMLR, 2021, pp. 7358–7367.
URL http://proceedings.mlr.press/v139/makarychev21a.html -
[6]
M. Charikar, L. Hu,
Near-optimal explainable
k-means for all dimensions, in: J. S. Naor, N. Buchbinder (Eds.),
Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA
2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, SIAM,
2022, pp. 2580–2606.
doi:10.1137/1.9781611977073.101.
URL https://doi.org/10.1137/1.9781611977073.101 -
[7]
H. Esfandiari, V. S. Mirrokni, S. Narayanan,
Almost tight approximation
algorithms for explainable clustering, in: J. S. Naor, N. Buchbinder (Eds.),
Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA
2022, Virtual Conference / Alexandria, VA, USA, January 9 - 12, 2022, SIAM,
2022, pp. 2641–2663.
doi:10.1137/1.9781611977073.103.
URL https://doi.org/10.1137/1.9781611977073.103 - [8] B. Gamlath, X. Jia, A. Polak, O. Svensson, Nearly-tight and oblivious algorithms for explainable clustering, in: Thirty-Fifth Conference on Neural Information Processing Systems, 2021.
-
[9]
S. Bandyapadhyay, F. V. Fomin, P. A. Golovach, W. Lochet, N. Purohit,
K. Simonov, How
to find a good explanation for clustering?, in: Thirty-Sixth AAAI
Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference
on Innovative Applications of Artificial Intelligence, IAAI 2022, The
Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI
2022 Virtual Event, February 22 - March 1, 2022, AAAI Press, 2022, pp.
3904–3912.
URL https://ojs.aaai.org/index.php/AAAI/article/view/20306 - [10] B. Liu, Y. Xia, P. S. Yu, Clustering through decision tree construction, in: Proceedings of the ninth international conference on Information and knowledge management, 2000, pp. 20–29.
- [11] R. Fraiman, B. Ghattas, M. Svarc, Interpretable clustering using unsupervised binary trees, Advances in Data Analysis and Classification 7 (2) (2013) 125–145.
-
[12]
N. Megiddo, K. J. Supowit, On the
complexity of some common geometric location problems, SIAM J. Comput.
13 (1) (1984) 182–196.
doi:10.1137/0213014.
URL https://doi.org/10.1137/0213014 -
[13]
P. Awasthi, M. Charikar, R. Krishnaswamy, A. K. Sinop,
The hardness of
approximation of euclidean k-means, in: L. Arge, J. Pach (Eds.), 31st
International Symposium on Computational Geometry, SoCG 2015, June 22-25,
2015, Eindhoven, The Netherlands, Vol. 34 of LIPIcs, Schloss Dagstuhl -
Leibniz-Zentrum für Informatik, 2015, pp. 754–767.
doi:10.4230/LIPIcs.SOCG.2015.754.
URL https://doi.org/10.4230/LIPIcs.SOCG.2015.754 -
[14]
V. Cohen-Addad, Karthik C. S., E. Lee,
Johnson coverage
hypothesis: Inapproximability of k-means and k-median in metrics, in:
J. S. Naor, N. Buchbinder (Eds.), Proceedings of the 2022 ACM-SIAM
Symposium on Discrete Algorithms, SODA 2022, Virtual Conference /
Alexandria, VA, USA, January 9 - 12, 2022, SIAM, 2022, pp. 1493–1530.
doi:10.1137/1.9781611977073.63.
URL https://doi.org/10.1137/1.9781611977073.63 -
[15]
T. F. Gonzalez, Clustering
to minimize the maximum intercluster distance, Theor. Comput. Sci. 38 (1985)
293–306.
doi:10.1016/0304-3975(85)90224-5.
URL https://doi.org/10.1016/0304-3975(85)90224-5 -
[16]
S. Ahmadian, A. Norouzi-Fard, O. Svensson, J. Ward,
Better guarantees for k-means and
euclidean k-median by primal-dual algorithms, SIAM J. Comput. 49 (4)
(2020).
doi:10.1137/18M1171321.
URL https://doi.org/10.1137/18M1171321 -
[17]
M. Charikar, S. Guha, É. Tardos, D. B. Shmoys,
A constant-factor approximation
algorithm for the k-median problem, J. Comput. Syst. Sci. 65 (1) (2002)
129–149.
doi:10.1006/jcss.2002.1882.
URL https://doi.org/10.1006/jcss.2002.1882 - [18] J. M. Kleinberg, É. Tardos, Algorithm design, Addison-Wesley, 2006.
-
[19]
V. V. Vazirani,
Approximation
algorithms, Springer, 2001.
URL http://www.springer.com/computer/theoretical+computer+science/book/978-3-540-65367-7 -
[20]
F. Cicalese, E. S. Laber,
Information theoretical
clustering is hard to approximate, IEEE Trans. Inf. Theory 67 (1) (2021)
586–597.
doi:10.1109/TIT.2020.3031629.
URL https://doi.org/10.1109/TIT.2020.3031629