Optimal Decision Trees For Interpretable Clustering with Constraints (Extended Version)
Abstract
Constrained clustering is a semi-supervised task that employs a limited amount of labelled data, formulated as constraints, to incorporate domain-specific knowledge and to significantly improve clustering accuracy. Previous work has considered exact optimization formulations that can guarantee optimal clustering while satisfying all constraints, however these approaches lack interpretability. Recently, decision-trees have been used to produce inherently interpretable clustering solutions, however existing approaches do not support clustering constraints and do not provide strong theoretical guarantees on solution quality. In this work, we present a novel SAT-based framework for interpretable clustering that supports clustering constraints and that also provides strong theoretical guarantees on solution quality. We also present new insight into the trade-off between interpretability and satisfaction of such user-provided constraints. Our framework is the first approach for interpretable and constrained clustering. Experiments with a range of real-world and synthetic datasets demonstrate that our approach can produce high-quality and interpretable constrained clustering solutions.
1 Introduction
Clustering is a core unsupervised machine learning problem that aims to partition a dataset into subgroups of similar data points. In practice, it is often used to discover meaningful sub-populations such as customer segments Wu and Chou (2011) or groups of correlated genes Thalamuthu et al. (2006). Constrained clustering is a semi-supervised learning task that exploits small amounts of supervision, provided in the form of constraints, to incorporate domain-specific knowledge and to significantly improve clustering performance Wagstaff and Cardie (2000); Wagstaff et al. (2001). The most popular types of clustering constraints are so-called instance-level pairwise must-link constraints, cannot-link constraints, and additional types of constraints that can be translated to such pairwise constraints Davidson and Ravi (2005); Liu and Fu (2015). In the past decades, the topic of constrained clustering has received significant attention and different constrained clustering algorithms have been proposed Bilenko et al. (2004); Pelleg and Baras (2007); Liu et al. (2017); Cohen et al. (2020). In particular, approaches that are based on exact optimization formulations, such as integer programming, constraint programming, and satisfiability have obtained state-of-the-art performance Davidson et al. (2010); Berg and Järvisalo (2017); Dao et al. (2017); Babaki et al. (2014).
Our interest is in developing an approach to constrained clustering that yields interpretable solutions with strong solution quality guarantee. Recent approaches to clustering via decision trees have yielded a degree of interpretability by exposing clustering rationale in the branching structure of the tree Frost et al. (2020); Moshkovitz et al. (2020); Bertsimas et al. (2021); Gamlath et al. (2021). However, these approaches do not support the incorporation of clustering constraints. Moreover, little is known about the compatibility of such clustering constraints with decision-tree clustering. While optimal decision trees have received significant attention in recent years, most of the literature is focused on classification tasks Ignatiev et al. (2021).
In this work, we present the first approach for interpretable and constrained optimal clustering based on decision trees. We make the following contributions:
-
1.
We present a novel SAT-based encoding of optimal constrained clustering that is interpretable by a decision tree. Our formulation supports two well-known clustering objectives as well as pairwise instance-level clustering constraints. To our knowledge this is the first approach for inherently-interpretable constrained clustering.
-
2.
We introduce novel techniques for efficient encoding of clustering problems and for improving search performance. Specifically, we introduce the notion of distance classes to support bounded suboptimal clustering, and a set of pruning rules to reduce the number of clauses.
-
3.
We empirically evaluate our approach over real-world and synthetic datasets and show that it leads to high-quality, inherently-interpretable clustering solutions that satisfy a given set of pairwise constraints.
-
4.
We present theoretical and empirical results on the trade-off between interpretability via decision trees and satisfaction of user-provided clustering constraints.
2 Preliminaries
2.1 Decision Trees for Constrained Clustering
Definition 1 (Decision Tree).
Given a set of features and number of clusters , a decision tree is a set of branching nodes , a set of leaf nodes , a root node , a parent function , left and right child functions , a node feature selection function together with a threshold selection function , and finally a leaf labelling function .
Definition 2 (Tree Cluster Assignment).
Given a dataset , and the number of clusters , a decision tree direct each point to one of its leaves by starting from the root and recursively moving to its left or right child, depending on whether the value of the chosen feature is less than or equal to the threshold, or not. The point is then assigned the cluster label of the leaf, represented by .
The tree clustering partitions into clusters such that each cluster is non-empty,
Consistent with recent work Frost et al. (2020), each cluster is allowed to spread across multiple leaves to improve expressiveness and the ability to satisfy user-provided constraints. If two points are in the same cluster (resp. different clusters), they are said to be clustered together (resp. separately).
Definition 3 (Pairwise Clustering Constraints).
Given a dataset , a set of pairs of data points named must-link constraints , and a set of pairs of data points named cannot-link constraints , a clustering is said to respect the constraints if it clusters all of must-link pairs together and all of cannot-link pairs separately:
(1) | |||
(2) |
Definition 4 (Minimum Split and Maximum Diameter).
Given a dataset , the number of clusters , and a clustering , the minimum split is the shortest distance between two points that are clustered separately,
(3) |
and the maximum diameter is the longest distance between two points that are clustered together,
(4) |
We assume to be the Euclidean distance, however any totally ordered measure of distance is applicable.
2.2 Problem Definitions
We consider two well-known clustering objectives Dao et al. (2017): (1) minimizing the maximum diameter (MD); (2) a bi-criteria objective that consists of minimizing the maximum diameter and maximizing the minimum split ([MD,MS]). Specifically, we consider bounded approximations parameterized by such that indicates an optimal solution (or Pareto optimal for [MD,MS]).
Problem 1 (MD -Optimal Tree Clustering).
Given a dataset , number of clusters , a set of must-links constraints , a set of cannot-links constraints , decision tree depth , and approximation parameter , find a complete decision tree of depth such that:
-
1.
respects the constraints and ;
-
2.
,
where is an optimal solution with respect to the objective of minimizing the maximum diameter.
Problem 2 (MS-MD -Pareto Optimal Tree Clustering).
Given a dataset , number of clusters , a set of must-links constraints , a set of cannot-links constraints , decision tree depth , and approximation parameter , find a complete decision tree of depth such that:
-
1.
respects the constraints and ;
-
2.
;
-
3.
,
where is a Pareto optimal solution with respect to the bi-criteria objective of maximizing the minimum split and minimizing the maximum diameter.
Note that Problem 1 and Problem 2 are not always feasible since a decision tree of a given depth cannot represent all possible clusterings, and may not be able to represent any clustering that satisfies the and constraints. However in Proposition 1 we show that for every clustering there exists a tree of certain depth that can represent it.111We only consider cases where the ML and CL constraints are consistent, i.e., where there exists a clustering that satisfies these constraints. For example, if the same pair of points is found in both the and constraints, there exists no clustering that would satisfy these constraints.
Proposition 1 (Decision Tree Completeness).
Given a dataset and an arbitrary clustering such that
there exists a complete decision tree of sufficiently high depth , which partitions into the same clusters .222All proofs appear in the appendix.
3 SAT-based Optimal Tree Clustering
In this section, we present our approach for solving the problems in Section 2.2. First, we present a way to simplify our handling of pairs while still maintaining the approximation guarantee. Then, we present a novel SAT-based encoding for interpretable and constrained tree clustering.
3.1 Distance Classes
To encode the -approximation in Problem 1 and Problem 2, we divide the set of all pairs of data points based on their distance into non-overlapping intervals called distance classes such that the smallest and largest distances in each class are less than apart. Specifically, we employ a greedy procedure that sorts all pairs from the smallest distance to the largest and greedily adds them one by one, creating new classes as needed to guarantee that the distances in each class are at most apart. All pairs in the same class are treated similarly w.r.t. being clustered together or not.
In order to maximize the minimum split, we consider the index such that any pair of data points in distance classes must be clustered together (Eq. (5)). Similarly, to minimize the maximum diameter, we consider the index such that any pair in distance classes must be clustered separately (Eq. (6)).
(5) | ||||
(6) |
Note that in any feasible clustering, for all and values that satisfy (Eqs. (5) and (6)), we have that and . We therefore focus on optimizing the indices and to obtain -optimal solutions to Problem 1 and Problem 2 (Proposition 2).
Proposition 2 (MD and MS-MD -approximation).
Let be a tree cluster assignment. We have that:
-
1.
If is an optimal solution w.r.t minimizing then is an -optimal solution w.r.t the MD objective.
-
2.
If is a Pareto-optimal solution w.r.t minimizing and maximizing then is an -Pareto optimal solution w.r.t the bi-criteria MD-MS objective.
3.2 MaxSAT Encoding
A SAT formula is a conjunction of clauses, a clause is a disjunction of literals, and a literal is either a Boolean variable or its negation. A clause is satisfied if at least one of its literals is true. The SAT problem consists of finding an assignment of the variables that satisfies all clauses in a formula Biere et al. (2009). We model the construction of -optimal constrained clustering trees as Partial MaxSAT, an optimization variant of SAT that divides clauses into hard and soft clauses, requiring variable assignments to satisfy all hard clauses and to maximize the number of satisfied soft clauses.
Variables.
Table 1 describes the set of Boolean variables.
Feature is chosen for the split at branching node | |
---|---|
Point is directed towards the left child, if it passes through branching node | |
Point ends up at leaf node | |
The cluster assigned to leaf is or comes after | |
The cluster assigned to point is or comes after | |
(The negation of) whether the pairs in distance class should be clustered separately | |
The pairs in class should be clustered together |
Clauses.
We encode the construction of the decision tree (Eqs. (7)-(15)) following Shati et al. Shati et al. (2021), a state-of-the-art SAT-based decision-tree classifier that naturally supports numeric features that are prevalent in clustering problems. These clauses guarantee that exactly one feature is chosen at each branching node (Eqs. (7)-(8)), the points are directed to the left or right child of each branching node based on their value of the chosen feature (Eqs. (9)-(10)), the appearance of points at leaves correctly correspond to the path that they are directed through within the tree (Eqs. (11)-(13)), and thresholds are non-trivial (Eqs. (14)-(15)).333The set contains all consecutive pairs of points when ordered according to feature and the set () contains all nodes that have as descendent of their left (right) child.444For a detailed description of the clauses in Eqs. (7)-(15), we refer the reader to Shati et al. Shati et al. (2021).
(7) | |||||
(8) | |||||
(9) | |||||
(10) | |||||
(11) | |||||
(12) | |||||
(13) | |||||
(14) | |||||
(15) |
The following clauses extend the basic decision tree encoding to support -optimal clustering trees that satisfy the and constraints. The clauses in Eq. (16) guarantee well-formed unary encoding of cluster labels in each leaf.555A unary encoding is well-formed if it does not include the sequence 01 at any point.
(16) |
Eqs. (17)-(18) guarantee that the label assigned to each data point matches the label of leaf the data point reaches.
(17) | ||||
(18) |
Eqs. (19)-(20) break the ties between the equivalent cluster assignments. We consider two cluster assignments and equivalent if there exists a relabelling, , such that . To break ties, we force each in the ascending order to be assigned to the first empty cluster, if it needs a new one. This guarantees that there are no two feasible solutions that are relabellings of each other. Note that we do not eliminate viable solutions, since any clustering can be renamed into an equivalent one that respects this property.
(19) | ||||
(20) |
The clause in Eq. (21) alongside the tie-breaking ones guarantee that all clusters are non-empty, assuming . If we substitute with in Eq. (21), clusters are guaranteed to be non-empty. Thus, we can enforce minimum and maximum clusters. In our experiments we focus on the setting where all clusters are non-empty.
(21) |
Eqs. (22)-(24), namely the unconditional separating clauses, guarantee that pairs in are clustered separately.
(22) | |||||
(23) | |||||
(24) |
Eqs. (25)-(26), namely the unconditional co-clustering clauses, guarantee that pairs in are clustered together.
(25) | ||||
(26) |
Eqs. (27)-(29), namely the conditional separating clauses, guarantee that a variable being set to false forces the pairs in distance class to be clustered separately.
(27) | |||||
(28) | |||||
(29) |
Smart Pairs.
Since each pair of points has corresponding clauses that handle being clustered together or separately, a naive encoding will be quadratic in the number of points . This is significant since all the other parts of the encoding are linear in . To reduce the number of clauses, we see the set of points as nodes in a graph and exploit connections between pairs. Pairs that are forced to be clustered together are represented by positive edges and pairs that are force to be clustered separately by negative edges. The positive edges imply a set of connected components of points that will be clustered together while a negative edge between nodes in different components indicates that each of the components are mutually exclusive, i.e., each component will be in a different cluster. The order , which sorts the pairs based on distance and breaks ties arbitrary, is used to greedily build the set of positive edges () and the set of negative edges (). As we incrementally build the sets, each new edge can be classified as inner or crossing, based on the previous members in and , to help us detect infeasibility or redundancy of clauses.
Definition 5 (Inner and Crossing edges).
Given the sets and , a new edge is said to be an:
-
•
Inner edge, if it connects two nodes within an existing connected component based on .
-
•
Crossing edge, if it connects two nodes in two connected components based on that are mutually exclusive based on .
We will use edges and the pairs of points that they represent interchangeably onward.
Since () represents the set of pairs that are forced to be clustered together (separately), an inner pair is forced to have the same label and a crossing pair to have different labels. We make use of this fact to avoid adding co-clustering clauses Eqs. (25, 26, 30, 31) or separating clauses Eqs. (22, 23, 24, 27, 28, 29) when it is redundant to do so. Furthermore, we can conclude infeasibility when an inner (crossing) pair is forced to be clustered separately (together).
Our treatment of the unconditional clauses Eqs. (22-26) and of the conditional ones Eqs. (27-31) differ in two ways.
-
1.
For the unconditional clauses the and sets are respectively constructed from and links. For conditional ones however, we also include the pairs that are implied to be co-clustered (separated), due to the order of distance imposed by and values. Note that implied pairs for the processing of conditional co-clustering clauses are not valid for the processing of conditional separating clauses.
-
2.
If the unconditional separation or co-clustering of a pair is infeasible, the problem is infeasible. However, for conditional clauses, we only fix the corresponding or variable to satisfy the conditional clause.
The detailed procedure is described in the appendix.
Objective.
In Section 3.1 we established that in order to find an -approximation of an MS-MD Pareto optimal solution, we need to find a solution that is Pareto optimal with regards to maximizing and minimizing . Similarly, for an -approximation of an MD optimal solution we need to find a solution that minimizes . Note that and are the number of and variables set to true, respectively.
(35) | |||
(36) |
To minimize , we simply introduce a soft clause with unit weight for each (Eq. (37)). To obtain a Pareto-optimal solution, we can optimize any function that is increasing w.r.t and decreasing w.r.t as objective. We opt for minimizing the simple combination of and model it using the soft clauses in Eq. (37) and Eq. (38).
(37) | ||||
(38) |
For the MD objective, the variables and the clauses in Eqs. (30)-(31) become irrelevant and can be removed.
4 Experiments
4.1 Experiment Setup
We use the Loandra solver Berg et al. (2019) to solve our tree clustering encoding. Loandra is an any-time solver that guarantees optimality if run to completion, but can also produce intermediate solutions. We run experiments on a server with two 12-core Intel E5-2697v2 CPUs and 128G of RAM.
Datasets.
We run experiments on seven real datasets from the UCI repository Dua and Graff (2017) and four synthetic datasets from FCPS Ultsch and Lötsch (2020). The datasets vary in size, number of features, and number of clusters, as presented in Table 2. For all datasets, we normalize the values of each feature in the range so that features with larger values do not dominate the pairwise distances.
Constraint Generation.
We evaluate the performance of our approach for different number of constraints, relative to the dataset size. Specifically, for a given value with , we generate a set of random clustering constraints following the methodology in Wagstaff and Cardie Wagstaff and Cardie (2000): (1) We generate pairs of data points without repetition; (2) For each pair, we generate a ML constraint if both data point share the same ground-truth label and a CL constraint otherwise.
Evaluation.
We evaluate the quality of the obtained clusterings based on ground-truth labels using two well-known external clustering evaluation metrics: the Adjusted Rand Index (ARI) Hubert and Arabie (1985) and the Normalized Mutual Information (NMI) Strehl and Ghosh (2002).
Dataset | [MD,MS] | MD | |||||||
---|---|---|---|---|---|---|---|---|---|
ARI | ARI (CC) | Feas. | Time (s) | ARI | ARI (CC) | Feas. | Time (s) | ||
Iris | 0.0 | 0.6 | 0.6 | 20 | 0.6 | 0.62 | 0.7 | 20 | 0.4 |
0.1 | 0.83 | 0.78 | 20 | 0.7 | 0.71 | 0.55 | 20 | 0.4 | |
0.25 | 0.86 | 0.8 | 20 | 0.7 | 0.81 | 0.6 | 20 | 0.4 | |
0.5 | 0.91 | 0.8 | 20 | 0.8 | 0.88 | 0.63 | 20 | 0.5 | |
1.0 | 0.95 | 0.91 | 16 | 0.7 | 0.94 | 0.71 | 16 | 0.4 | |
Wine | 0.0 | 0 | 0 | 20 | 0.7 | 0.38 | 0.21 | 20 | 0.6 |
0.1 | 0.69 | 0.53 | 20 | 1.2 | 0.41 | 0.19 | 20 | 0.6 | |
0.25 | 0.79 | 0.57 | 20 | 1.8 | 0.6 | 0.2 | 20 | 0.9 | |
0.5 | 0.82 | 0.5 | 20 | 6.9 | 0.72 | 0.2 | 20 | 5.4 | |
1.0 | 0.93 | 0.72 | 20 | 8.8 | 0.89 | 0.23 | 20 | 8 | |
Glass | 0.0 | 0.22 | 0.22 | 20 | 2.1 | 0.18 | 0.26 | 20 | 1.2 |
0.1 | 0.19 | 0.1 | 20 | 9.3 | 0.16 | 0.11 | 20 | 3.3 | |
0.25 | 0.24 | 0.06 | 20 | 61.1 | 0.16 | 0.05 | 20 | 35.1 | |
0.5 | 0.26 | 0.07 | 19 | 954.1 | 0.24 | 0.01 | 20 | 624.9 | |
1.0 | - | 0.1 | 0 | 193.8 | - | 0.01 | 0 | 200.1 | |
Ionosphere | 0.0 | 0.01 | 0.01 | 20 | 2.4 | 0.16 | 0.08 | 20 | 2 |
0.1 | 0.28 | 0.11 | 20 | 33.7 | 0.15 | 0.09 | 20 | 10.2 | |
0.25 | 0.5 | 0.22 | 11 | 598.9 | 0.48 | 0.12 | 11 | 880.9 | |
0.5 | - | 0.38 | 0 | 29.5 | - | 0.18 | 0 | 31.4 | |
1.0 | - | 0.78 | 0 | 5.8 | - | 0.76 | 0 | 5.7 | |
Seeds | 0.0 | 0.68 | 0.66 | 20 | 1.8 | 0.57 | 0.68 | 20 | 0.5 |
0.1 | 0.68 | 0.64 | 20 | 1 | 0.67 | 0.54 | 20 | 0.5 | |
0.25 | 0.73 | 0.63 | 20 | 1.3 | 0.72 | 0.52 | 20 | 0.7 | |
0.5 | 0.78 | 0.64 | 14 | 1.5 | 0.78 | 0.52 | 14 | 1.2 | |
1.0 | 0.9 | 0.79 | 1 | 0.7 | 0.93 | 0.72 | 1 | 0.7 | |
Libras | 0.0 | 0.21 | 0.22 | 20 | 1802.8 | 0.2 | 0.16 | 20 | 1802.9 |
0.1 | 0.17 | 0.17 | 20 | 1159.1 | 0.15 | 0.12 | 20 | 941.6 | |
0.25 | 0.17 | 0.14 | 20 | 1080 | 0.14 | 0.1 | 20 | 681.9 | |
0.5 | 0.18 | 0.11 | 20 | 866.2 | 0.14 | 0.07 | 20 | 305.1 | |
1.0 | 0.18 | 0.11 | 20 | 1518.3 | 0.14 | 0.06 | 20 | 859.4 | |
Spam | 0.0 | 0 | 0 | 20 | 38 | -0.02 | -0.01 | 20 | 41.5 |
0.1 | - | 0.04 | 0 | 358.9 | - | 0.05 | 0 | 379.4 | |
0.25 | - | 0.07 | 0 | 286.1 | - | 0.13 | 0 | 312.6 | |
0.5 | - | 0.08 | 0 | 151.1 | - | 0.25 | 0 | 152.9 | |
1.0 | - | 0.51 | 0 | 77.1 | - | 0.51 | 0 | 77.9 | |
Lsun | 0.0 | 0.44 | 0.39 | 20 | 1.2 | 0.39 | 0.39 | 20 | 0.8 |
0.1 | 0.95 | 0.65 | 20 | 1 | 0.74 | 0.25 | 20 | 0.7 | |
0.25 | 1 | 0.94 | 20 | 1 | 0.89 | 0.25 | 20 | 0.6 | |
0.5 | 1 | 1 | 20 | 0.9 | 0.96 | 0.26 | 20 | 0.6 | |
1.0 | 1 | 1 | 20 | 0.9 | 0.98 | 0.48 | 20 | 0.6 | |
Chainlink | 0 | 0.12 | 1 | 20 | 3 | 0.11 | 0.06 | 20 | 2 |
0.1 | 0.89 | 1 | 17 | 5 | 0.84 | 0.01 | 17 | 5.5 | |
0.25 | 0.89 | 1 | 1 | 1.8 | 0.91 | 0.03 | 1 | 1.7 | |
0.5 | - | 1 | 0 | 1.3 | - | 0.05 | 0 | 1.2 | |
1.0 | - | 1 | 0 | 1 | - | 0.56 | 0 | 0.9 | |
Target | 0.0 | 0.36 | 0.36 | 20 | 5.7 | 0.33 | 0.33 | 20 | 3.7 |
0.1 | 1 | 1 | 20 | 2.8 | 0.64 | 0.57 | 20 | 18.8 | |
0.25 | 1 | 1 | 20 | 2.7 | 0.87 | 0.45 | 20 | 9.9 | |
0.5 | 1 | 1 | 20 | 2.4 | 0.95 | 0.25 | 20 | 4.4 | |
1.0 | 1 | 1 | 20 | 2.2 | 0.99 | 0.45 | 20 | 1.9 | |
WingNut | 0.0 | 1 | 1 | 20 | 2.5 | 1 | 0.15 | 20 | 2 |
0.1 | 1 | 1 | 20 | 1.7 | 0.99 | 0.2 | 20 | 1.3 | |
0.25 | 1 | 1 | 20 | 1.6 | 0.99 | 0.28 | 20 | 1.2 | |
0.5 | 1 | 1 | 20 | 1.6 | 1 | 0.49 | 20 | 1.2 | |
1.0 | 1 | 1 | 20 | 1.6 | 1 | 0.82 | 20 | 1.2 |
4.2 Baselines
Constrained Clustering (CC).
We compare our approach with optimal constrained clustering formulation that is not restricted to conform to a decision tree. To do so, we remove the tree-related components of the encoding, namely the variables , , , and the clauses in Eqs. (7)-(18). Instead, we introduce the clauses in Eq. (39) that guarantee a well-formed unary encoding of labels.666Note that these clauses are redundant in the encoding of tree clustering as the condition is already enforced for the leaves.
(39) |
When used with the MD objective and , this baseline is equivalent to the maximum diameter constrained clustering formulations in previous works Dao et al. (2016, 2017), i.e., it has the same set of (feasible and) optimal solutions. However, this baseline also supports -approximation and the [MD,MS] objective for a fair comparison with our approach.
Mixed Integer Optimization (MIO).
The most closely related work to ours is Bertsimas et al. Bertsimas et al. (2021) that presents a MIO model for (unconstrained) clustering trees. While their model does not support clustering constraints, we can extend it to incorporate such constraints. However, Bertsimas et al. note that their model does not scale well and therefore do not present any experimental results and instead focus on a heuristic procedure that does not have any solution quality guarantees and cannot naturally support clustering constraints. Due to the relevance of the MIO model to our work, we have implemented the model using the Gurobi v10 solver and extended it to support clustering constraints (detailed description of the extended MIO model can be found in the appendix). To our knowledge, this is the only approach in the literature for constructing tree-based clustering using discrete optimization.
4.3 Results
For our first set of experimental results presented in Table 2, we solve the tree clustering problem for our datasets with different values of . We fix the value of approximation at . Consistent with previous work Dao et al. (2016); Babaki et al. (2014), we set the solver time limit to 30 minutes. To avoid bias in results due to a specific set of constraints, we generate 20 random sets of constraints and report average values for the evaluation metrics and the runtime. Note that it might not be possible for a tree clustering with a fixed depth to satisfy all of the constraints, in particular in high-dimensional datasets with complex patterns where a large number of univariate splits may be needed to fit the ground-truth constraints. Thus, we also report the number of feasible runs (out of the 20 random constraint sets) while excluding infeasible and unknown777Instances where the solver timed out without finding a feasible solution are considered “unknown”. cases from the computed average ARI.
The results show that our approach can produce high quality interpretable solutions with non-zero values in the time limit. We observe that the [MD,MS] Pareto optimality objective leads to higher quality solutions compared to MD in all but a few cases without significant overhead in runtime.
Interestingly, we observe that tree clustering almost always leads to higher quality solutions compared to CC. This seems unintuitive since CC is strictly more expressive than tree clustering. We conjecture that both clustering objectives tend to perform better with tree clustering because of its inherently restricted solution space, resulting in potentially worse objective values but higher ARI scores.
Figure 1 highlights the benefit of both tree clustering () and the Pareto objective (dashed). While the difference in ARI in unconstrained clustering () is negligible, as we increase the number of constraints we observe that tree-based clustering significantly outperforms CC and the Pareto objective tends to perform better than the MD objective when constraints are limited. Figure 1 also shows the percentage of feasible instances vs. (bars) and demonstrates that as problems become more constrained, it may be infeasible to find a tree clustering of a given depth that satisfies all the constraints, exposing the trade-off between interpretability via decision trees and satisfaction of user-provided clustering constraints.
Next, we investigate how depth impacts the feasibility of finding a decision tree that satisfies the clustering constraints. The results in Table 3 show that deeper trees can resolve the infeasibility problem in some datasets, e.g., Glass and Ionosphere, however in Spam we find that even with a depth of four we are not able to find decision trees for any of the random constraint sets. Another interesting observation is that unnecessarily deep trees could lead to a lower score, demonstrating the potential benefit of restricted solution spaces that are induced by shallow trees. We observed similar trends for the NMI metric and provide the detailed results in the appendix.
Dataset | ARI | Feas. | Time (s) | |
---|---|---|---|---|
Iris | 2 | 0.83 | 2 | 0.3 |
3 | 0.91 | 20 | 0.8 | |
4 | 0.88 | 20 | 0.8 | |
Wine | 2 | 0.87 | 2 | 0.4 |
3 | 0.82 | 20 | 6.9 | |
4 | 0.71 | 20 | 6.2 | |
Glass | 3 | - | 0 | 2.6 |
4 | 0.26 | 19 | 953.6 | |
5 | 0.22 | 20 | 35.8 | |
Ionosphere | 2 | - | 0 | 0.8 |
3 | - | 0 | 29.5 | |
4 | 0.69 | 18 | 731.7 | |
Seeds | 2 | - | 0 | 0.3 |
3 | 0.78 | 14 | 1.5 | |
4 | 0.74 | 20 | 2.1 | |
Libras | 4 | 0.16 | 2 | 1801.4 |
5 | 0.18 | 20 | 866.2 | |
6 | 0.17 | 20 | 315.4 | |
Spam | 2 | - | 0 | 35 |
3 | - | 0 | 151.1 | |
4 | - | 0 | 975.6 | |
Lsun | 2 | 1 | 20 | 0.9 |
3 | 1 | 20 | 0.9 | |
4 | 1 | 20 | 1.1 | |
Chainlink | 2 | - | 0 | 0.8 |
3 | - | 0 | 1.3 | |
4 | 1 | 20 | 3.4 | |
Target | 3 | - | 0 | 1.8 |
4 | 1 | 20 | 2.5 | |
5 | 1 | 20 | 4 | |
WingNut | 2 | 1 | 20 | 1.5 |
3 | 1 | 20 | 1.7 | |
4 | 1 | 20 | 2.1 |
Dataset | Setting | ARI | Time (s) | # Clauses |
---|---|---|---|---|
Libras | SP & | 0.18 | 866.4 | 2,082,261.2 |
0.16 | 822.0 | 3,888,452.0 | ||
0.16 | 1197.1 | 4,140,872.0 | ||
Spam | SP & | Inf. | 151.6 | 3,823,479.2 |
Inf. | 332.7 | 24,980,546.4 | ||
Inf./Unk. | 864.0 | 69,166,751.4 | ||
WingN | SP & | 1.00 | 1.7 | 95,879.25 |
1.00 | 4.2 | 1,128,700.4 | ||
OOM | 98.3 | 3,449,740.4 | ||
OOM indicates an out-of-memory error. |
Comparison with MIO.
Our experiments with the MIO baseline found that it is unable to find a feasible solution for any of the datasets for the depths specified in Table 2 across multiple runs, both in constrained and unconstrained settings. This result is consistent with Bertsimas et al.’s Bertsimas et al. (2021) observation on the limited scalability of their MIO formulation and emphasizes the strong performance of our approach.
Ablation Study.
Finally, we study the impact of the smart pairs procedure (Section 3.2) and the -approximation (Section 3.1) on the performance of our approach. Unlike the -approximation, smart pairs is guaranteed to not change the set of feasible solutions and the set of optimal solutions. However, since there could be multiple optimal solutions, the different encoding may still lead to an optimal solution with a slightly different ARI score. The results presented in Table 4 show that the aforementioned methods can reduce the number of clauses and runtimes without meaningful decrease in score.
5 Conclusion
In this work, we present the first approach for interpretable and constrained clustering using decision trees. Specifically, we present a novel SAT-based encoding for constructing clustering trees that approximate two well-known clustering objectives. Our experiments on a range of real-world and synthetic datasets demonstrate the ability of our approach to produce high-quality and interpretable clustering solutions that incorporate user-provided clustering constraints.
Our work raises several interesting questions to investigate in future work. As there are potentially many Pareto-optimal solutions, investigating and empirically evaluating strategies or tools, e.g. multi-objective solvers Jabs et al. (2022), for exploring the Pareto front and selecting promising solutions is an interesting direction for future work. One of the challenges identified in this work is that highly-constrained problems in complex, high-dimensional datasets can become infeasible for a given tree depth. In future work, we would like to investigate strategies to overcome this challenge, e.g., by converting the hard clustering constraints into soft constraints that are encouraged rather than required to be satisfied.
Acknowledgments
We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canada CIFAR AI Chairs Program. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute for Artificial Intelligence (www.vectorinstitute.ai/partners). The final author would also like to thank the Schwartz Reisman Institute for Technology and Society for providing a rich multi-disciplinary research environment.
References
- Babaki et al. [2014] Behrouz Babaki, Tias Guns, and Siegfried Nijssen. Constrained clustering using column generation. In International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 438–454. Springer, 2014.
- Berg and Järvisalo [2017] Jeremias Berg and Matti Järvisalo. Cost-optimal constrained correlation clustering via weighted partial maximum satisfiability. Artificial Intelligence, 244:110–142, 2017.
- Berg et al. [2019] Jeremias Berg, Emir Demirović, and Peter J Stuckey. Core-boosted linear search for incomplete MaxSAT. In International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research (CPAIOR), pages 39–56. Springer, 2019.
- Bertsimas et al. [2021] Dimitris Bertsimas, Agni Orfanoudaki, and Holly Wiberg. Interpretable clustering: an optimization approach. Machine Learning, 110(1):89–138, 2021.
- Biere et al. [2009] Armin Biere, Marijn Heule, and Hans van Maaren. Handbook of satisfiability, volume 185. IOS press, 2009.
- Bilenko et al. [2004] Mikhail Bilenko, Sugato Basu, and Raymond J Mooney. Integrating constraints and metric learning in semi-supervised clustering. In Proceedings of the twenty-first international conference on Machine learning, page 11, 2004.
- Cohen et al. [2020] Eldan Cohen, Arik Senderovich, and J Christopher Beck. An ising framework for constrained clustering on special purpose hardware. In International Conference on Integration of Constraint Programming, Artificial Intelligence, and Operations Research, pages 130–147. Springer, 2020.
- Dao et al. [2016] Thi-Bich-Hanh Dao, Christel Vrain, Khanh-Chuong Duong, and Ian Davidson. A framework for actionable clustering using constraint programming. In Proceedings of the Twenty-second European Conference on Artificial Intelligence, pages 453–461, 2016.
- Dao et al. [2017] Thi-Bich-Hanh Dao, Khanh-Chuong Duong, and Christel Vrain. Constrained clustering by constraint programming. Artificial Intelligence, 244:70–94, 2017.
- Davidson and Ravi [2005] Ian Davidson and SS Ravi. Clustering with constraints: Feasibility issues and the k-means algorithm. In Proceedings of the 2005 SIAM international conference on data mining, pages 138–149. SIAM, 2005.
- Davidson et al. [2010] Ian Davidson, SS Ravi, and Leonid Shamis. A sat-based framework for efficient constrained clustering. In Proceedings of the 2010 SIAM international conference on data mining, pages 94–105. SIAM, 2010.
- Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017.
- Frost et al. [2020] Nave Frost, Michal Moshkovitz, and Cyrus Rashtchian. Exkmc: Expanding explainable -means clustering. arXiv preprint arXiv:2006.02399, 2020.
- Gamlath et al. [2021] Buddhima Gamlath, Xinrui Jia, Adam Polak, and Ola Svensson. Nearly-tight and oblivious algorithms for explainable clustering. Advances in Neural Information Processing Systems, 34:28929–28939, 2021.
- Hubert and Arabie [1985] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of Classification, 2(1):193–218, 1985.
- Ignatiev et al. [2021] Alexey Ignatiev, Joao Marques-Silva, Nina Narodytska, and Peter J Stuckey. Reasoning-based learning of interpretable ml models. In IJCAI, pages 4458–4465, 2021.
- Jabs et al. [2022] Christoph Jabs, Jeremias Berg, Andreas Niskanen, and Matti Järvisalo. Maxsat-based bi-objective boolean optimization. In 25th International Conference on Theory and Applications of Satisfiability Testing (SAT 2022). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2022.
- Liu and Fu [2015] Hongfu Liu and Yun Fu. Clustering with partition level side information. In 2015 IEEE international conference on data mining, pages 877–882. IEEE, 2015.
- Liu et al. [2017] Hongfu Liu, Zhiqiang Tao, and Yun Fu. Partition level constrained clustering. IEEE transactions on pattern analysis and machine intelligence, 40(10):2469–2483, 2017.
- Moshkovitz et al. [2020] Michal Moshkovitz, Sanjoy Dasgupta, Cyrus Rashtchian, and Nave Frost. Explainable k-means and k-medians clustering. In International Conference on Machine Learning, pages 7055–7065. PMLR, 2020.
- Pelleg and Baras [2007] Dan Pelleg and Dorit Baras. K-means with large and noisy constraint sets. In European Conference on Machine Learning, pages 674–682. Springer, 2007.
- Shati et al. [2021] Pouya Shati, Eldan Cohen, and Sheila McIlraith. Sat-based approach for learning optimal decision trees with non-binary features. In 27th International Conference on Principles and Practice of Constraint Programming (CP 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2021.
- Strehl and Ghosh [2002] Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583–617, 2002.
- Thalamuthu et al. [2006] Anbupalam Thalamuthu, Indranil Mukhopadhyay, Xiaojing Zheng, and George C Tseng. Evaluation and comparison of gene clustering methods in microarray analysis. Bioinformatics, 22(19):2405–2412, 2006.
- Ultsch and Lötsch [2020] Alfred Ultsch and Jörn Lötsch. The fundamental clustering and projection suite (fcps): A dataset collection to test the performance of clustering and data projection algorithms. Data, 5(1):13, 2020.
- Wagstaff and Cardie [2000] Kiri Wagstaff and Claire Cardie. Clustering with instance-level constraints. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 1103–1110. Citeseer, 2000.
- Wagstaff et al. [2001] Kiri Wagstaff, Claire Cardie, Seth Rogers, and Stefan Schroedl. Constrained k-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 577–584, 2001.
- Wu and Chou [2011] Roung-Shiunn Wu and Po-Hsuan Chou. Customer segmentation of multiple category data in e-commerce using a soft-clustering approach. Electronic Commerce Research and Applications, 10(3):331–341, 2011.
Appendix A Proofs
Proposition 1.
We utilize the notion of consistent labelling and the completeness theorem (Theorem 1) proved in Shati et al. [2021] in order to provide a constructive proof for Proposition 1.
Note that the precondition of the proposition is logically equivalent to the consistent labelling condition in Shati et al. [2021], when we rename to . Given this precondition, Theorem 1 states that there exists a complete tree of depth which gives the same labels as to each point. Concluding our proof of the statement . ∎
Proposition 2.
We separately prove the first and second part of the Proposition 2.
-
1.
Let be the lowest value possible and be the solution yielded by this minimization. In order to prove by contradiction, we assume that there exists another solution which improves by more than .
First, we want to show that the pair corresponding to the distance belongs to .
-
•
If with , we have which contradicts the fact that is a diameter.
-
•
If with , we can conclude that all pairs belonging to are clustered separately since is the longest diameter. Thus, is also a valid value, contradicting our assumption of optimality.
Since , , and considering the fact that the shortest and longest distances of each distance class are at most apart, we can conclude that the pair corresponding to the distance is in a distance class with . Since is the longest diameter, we can conclude:
Showing that is a valid value for the solution , contradicting being the lowest value across all solutions. Therefore, our assumption of the existence of is false and is at most apart from the optimal solution with regards to .
-
•
-
2.
Let and be Pareto optimal values for and and be the solution yielded by this optimization. We want to show that there exists a solution that is Pareto optimal with regards to minimizing and maximizing such that:
(40) (41) We use reasoning analogous to that presented in Part 1, to show that there does not exist a solution such that:
(42) (43) The same reasoning is used to show that there does not exist a solution such that:
(44) (45) Note that instead of contradicting being the minimum as in Part 1, we can show that the existence of these two solutions contradicts and being a pair of Pareto optimal solutions.
Let be the optimal solution to the following optimization problem alongside all of the constraints of the original problem:
(46) (47) (48) The fact that optimizes shows that it is Pareto optimal since () cannot be further decreased (increased). The constraints on the solution do not hinder Pareto optimality as one of them is a lower bound for and the other an upper bound for .
Note that leads to the existence of and leads to the existence of , which both will result in contradictions.
Therefore, is our Pareto optimal solution that is no more than apart from in either direction.
∎
Appendix B Mixed Integer Programming Encoding
In the following, we provide the extended mixed-integer optimization (MIO) formulation for clustering trees with constraints, based on Bertsimas et al.’s Bertsimas et al. [2021] MIO formulation for clustering trees.
Variables.
- :
-
The variables in the Silhouette objective.
- :
-
The variables in the Silhouette objective, namely the average distance between points and their second closest cluster.
- :
-
The variables in the Silhouette objective, namely the average distance between points and their cluster.
- :
-
in the Silhouette objective.
- :
-
(Binary) Cluster (leaf) is the 2nd closest cluster to .
- :
-
The distance between and cluster (leaf) .
- :
-
(Binary) belongs to cluster (leaf) .
- :
-
The size of cluster (leaf) .
- :
-
(Binary) Feature is chosen at branching node .
- :
-
(Binary) Branching node is activated.
- :
-
Threshold at branching node .
- :
-
(Binary) Leaf node is activated.
Constraints.
The following set of linear and quadratic constraints model the objective (Eq. (49)) and guarantee the soundness with regards to the relation between Silhouette values (Eqs. (50)-(52)), the selection and distance to the second closest cluster (Eqs. (53)-(55)), the distance to each cluster (Eqs. (56)-(57)), the size of each cluster (Eqs. (58)-(59)), the activation of branching nodes (Eqs. (60)-(62)), the activation of leaf nodes (Eqs. (63)-(64)), and the splits leading up to appearance at leaves (Eqs. (65)-(67)).
(49) | |||||
(50) | |||||
(51) | |||||
(52) | |||||
(53) | |||||
(54) | |||||
(55) | |||||
(56) | |||||
(57) | |||||
(58) | |||||
(59) | |||||
(60) | |||||
(61) | |||||
(62) | |||||
(63) | |||||
(64) | |||||
(65) | |||||
(66) | |||||
(67) | |||||
(68) | |||||
(69) | |||||
(70) | |||||
(71) | |||||
(72) | |||||
(73) |
where is the minimum support for the number of points appearing at each active leaf. See Bertsimas et al. Bertsimas et al. [2021] for a detailed discussion on the role of each constraint.
Changes from Bertsimas et al. [2021].
We made a series of changes to the original MIO model as presented in Bertsimas et al. Bertsimas et al. [2021]. We have categorized the modifications into three groups based on their nature and significance.
-
•
Changes in the notation:
-
–
Iterating over points:
-
–
Iterating over features:
-
–
Distance between two points:
-
–
The epsilon value used in modelling splits (to avoid confusion with the epsilon used for approximating objective in our method):
-
–
- •
-
•
Changes in the model:
- –
-
–
The big M constant in Eq. (55) serves no purpose: removed .
- –
-
–
Eq. (53) allowing to be arbitrarily large will trivialize the objective:
- –
-
–
If is multiplied by on the left hand side of Eq. (67), it would be possible for points to be directed to both left and right children, if a branching node is disabled:
- –
-
–
Not having a constant upper bound for variables caused runtime issues: added Eq. (73)
Appendix C Smart Pairs Algorithm
In Algorithm 1, we provide a detailed description of how inner and crossing edges (definition 5) can be utilized in the smart pairs procedure for detecting redundant clauses and potential infeasibility.
Appendix D NMI Results
Dataset | [MD,MS] | MD | |||||||
---|---|---|---|---|---|---|---|---|---|
NMI | NMI (F) | Feas. | Time (s) | NMI | NMI (F) | Feas. | Time (s) | ||
Iris | 0.0 | 0.67 | 0.67 | 20 | 0.6 | 0.68 | 0.72 | 20 | 0.4 |
0.1 | 0.83 | 0.79 | 20 | 0.7 | 0.72 | 0.6 | 20 | 0.4 | |
0.25 | 0.85 | 0.79 | 20 | 0.7 | 0.8 | 0.62 | 20 | 0.4 | |
0.5 | 0.89 | 0.79 | 20 | 0.8 | 0.86 | 0.65 | 20 | 0.5 | |
1.0 | 0.93 | 0.89 | 16 | 0.7 | 0.93 | 0.71 | 16 | 0.4 | |
Wine | 0.0 | 0.02 | 0.02 | 20 | 0.7 | 0.41 | 0.2 | 20 | 0.6 |
0.1 | 0.68 | 0.53 | 20 | 1.2 | 0.44 | 0.22 | 20 | 0.6 | |
0.25 | 0.76 | 0.57 | 20 | 1.8 | 0.59 | 0.23 | 20 | 0.9 | |
0.5 | 0.79 | 0.49 | 20 | 6.9 | 0.69 | 0.22 | 20 | 5.4 | |
1.0 | 0.9 | 0.69 | 20 | 8.8 | 0.86 | 0.25 | 20 | 8 | |
Glass | 0.0 | 0.37 | 0.38 | 20 | 2.1 | 0.32 | 0.38 | 20 | 1.2 |
0.1 | 0.3 | 0.2 | 20 | 9.3 | 0.27 | 0.21 | 20 | 3.3 | |
0.25 | 0.32 | 0.18 | 20 | 61.1 | 0.25 | 0.16 | 20 | 35.1 | |
0.5 | 0.32 | 0.18 | 19 | 954.1 | 0.32 | 0.13 | 20 | 624.9 | |
1.0 | - | 0.2 | 0 | 193.8 | - | 0.14 | 0 | 200.1 | |
Ionosphere | 0.0 | 0.02 | 0.02 | 20 | 2.4 | 0.09 | 0.07 | 20 | 2 |
0.1 | 0.21 | 0.11 | 20 | 33.7 | 0.09 | 0.07 | 20 | 10.2 | |
0.25 | 0.39 | 0.2 | 11 | 598.9 | 0.37 | 0.08 | 11 | 880.9 | |
0.5 | - | 0.32 | 0 | 29.5 | - | 0.13 | 0 | 31.4 | |
1.0 | - | 0.7 | 0 | 5.8 | - | 0.67 | 0 | 5.7 | |
Seeds | 0.0 | 0.65 | 0.63 | 20 | 1.8 | 0.57 | 0.68 | 20 | 0.5 |
0.1 | 0.65 | 0.63 | 20 | 1 | 0.65 | 0.56 | 20 | 0.5 | |
0.25 | 0.7 | 0.63 | 20 | 1.3 | 0.69 | 0.55 | 20 | 0.7 | |
0.5 | 0.74 | 0.63 | 14 | 1.5 | 0.74 | 0.54 | 14 | 1.2 | |
1.0 | 0.87 | 0.77 | 1 | 0.7 | 0.9 | 0.7 | 1 | 0.7 | |
Libras | 0.0 | 0.46 | 0.49 | 20 | 1802.8 | 0.47 | 0.39 | 20 | 1802.9 |
0.1 | 0.42 | 0.42 | 20 | 1159.1 | 0.4 | 0.34 | 20 | 941.6 | |
0.25 | 0.43 | 0.38 | 20 | 1080 | 0.38 | 0.32 | 20 | 681.9 | |
0.5 | 0.43 | 0.33 | 20 | 866.2 | 0.37 | 0.26 | 20 | 305.1 | |
1.0 | 0.42 | 0.32 | 20 | 1518.3 | 0.37 | 0.24 | 20 | 859.4 | |
Spam | 0.0 | 0 | 0 | 20 | 38 | 0.04 | 0 | 20 | 41.5 |
0.1 | - | 0.02 | 0 | 358.9 | - | 0.03 | 0 | 379.4 | |
0.25 | - | 0.05 | 0 | 286.1 | - | 0.1 | 0 | 312.6 | |
0.5 | - | 0.06 | 0 | 151.1 | - | 0.2 | 0 | 152.9 | |
1.0 | - | 0.45 | 0 | 77.1 | - | 0.45 | 0 | 77.9 | |
Lsun | 0.0 | 0.54 | 0.49 | 20 | 1.2 | 0.48 | 0.5 | 20 | 0.8 |
0.1 | 0.95 | 0.65 | 20 | 1 | 0.73 | 0.28 | 20 | 0.7 | |
0.25 | 1 | 0.94 | 20 | 1 | 0.86 | 0.25 | 20 | 0.6 | |
0.5 | 1 | 1 | 20 | 0.9 | 0.93 | 0.25 | 20 | 0.6 | |
1.0 | 1 | 1 | 20 | 0.9 | 0.97 | 0.39 | 20 | 0.6 | |
Chainlink | 0 | 0.09 | 1 | 20 | 3 | 0.1 | 0.07 | 20 | 2 |
0.1 | 0.81 | 1 | 17 | 5 | 0.76 | 0.01 | 17 | 5.5 | |
0.25 | 0.82 | 1 | 1 | 1.8 | 0.85 | 0.05 | 1 | 1.7 | |
0.5 | - | 1 | 0 | 1.3 | - | 0.05 | 0 | 1.2 | |
1.0 | - | 1 | 0 | 1 | - | 0.49 | 0 | 0.9 | |
Target | 0.0 | 0.44 | 0.44 | 20 | 5.7 | 0.4 | 0.39 | 20 | 3.7 |
0.1 | 1 | 1 | 20 | 2.8 | 0.6 | 0.55 | 20 | 18.8 | |
0.25 | 1 | 1 | 20 | 2.7 | 0.8 | 0.41 | 20 | 9.9 | |
0.5 | 1 | 1 | 20 | 2.4 | 0.91 | 0.26 | 20 | 4.4 | |
1.0 | 1 | 1 | 20 | 2.2 | 0.97 | 0.4 | 20 | 1.9 | |
WingNut | 0.0 | 1 | 1 | 20 | 2.5 | 0.99 | 0.27 | 20 | 2 |
0.1 | 1 | 1 | 20 | 1.7 | 0.98 | 0.26 | 20 | 1.3 | |
0.25 | 1 | 1 | 20 | 1.6 | 0.99 | 0.31 | 20 | 1.2 | |
0.5 | 1 | 1 | 20 | 1.6 | 0.99 | 0.46 | 20 | 1.2 | |
1.0 | 1 | 1 | 20 | 1.6 | 1 | 0.75 | 20 | 1.2 |
The results presented in Table 5 and Table 6 mirror the ones in Table 2 and Table 3 with NMI scores instead of ARI. The improvement trends we see in NMI scores resemble those of ARI. Namely, Pareto optimal objective achieves higher average NMI compared to the diameter only case, and tree clustering outperforms CC on average as well.
Dataset | NMI | Feas. | Time (s) | |
---|---|---|---|---|
Iris | 2 | 0.81 | 2 | 0.3 |
3 | 0.89 | 20 | 0.8 | |
4 | 0.86 | 20 | 0.8 | |
Wine | 2 | 0.84 | 2 | 0.4 |
3 | 0.79 | 20 | 6.9 | |
4 | 0.69 | 20 | 6.2 | |
Glass | 3 | - | 0 | 2.6 |
4 | 0.32 | 19 | 953.6 | |
5 | 0.28 | 20 | 35.8 | |
Ionosphere | 2 | - | 0 | 0.8 |
3 | - | 0 | 29.5 | |
4 | 0.57 | 18 | 731.7 | |
Seeds | 2 | - | 0 | 0.3 |
3 | 0.74 | 14 | 1.5 | |
4 | 0.71 | 20 | 2.1 | |
Libras | 4 | 0.4 | 2 | 1801.4 |
5 | 0.43 | 20 | 866.2 | |
6 | 0.41 | 20 | 315.4 | |
Spam | 2 | - | 0 | 35 |
3 | - | 0 | 151.1 | |
4 | - | 0 | 975.6 | |
Lsun | 2 | 1 | 20 | 0.9 |
3 | 1 | 20 | 0.9 | |
4 | 1 | 20 | 1.1 | |
Chainlink | 2 | - | 0 | 0.8 |
3 | - | 0 | 1.3 | |
4 | 1 | 20 | 3.4 | |
Target | 3 | - | 0 | 1.8 |
4 | 1 | 20 | 2.5 | |
5 | 1 | 20 | 4 | |
WingNut | 2 | 1 | 20 | 1.5 |
3 | 1 | 20 | 1.7 | |
4 | 1 | 20 | 2.1 |