Flattening Multiparameter Hierarchical Clustering Functors
Abstract
We bring together topological data analysis, applied category theory, and machine learning to study multiparameter hierarchical clustering. We begin by introducing a procedure for flattening multiparameter hierarchical clusterings. We demonstrate that this procedure is a functor from a category of multiparameter hierarchical partitions to a category of binary integer programs. We also include empirical results demonstrating its effectiveness. Next, we introduce a Bayesian update algorithm for learning clustering parameters from data. We demonstrate that the composition of this algorithm with our flattening procedure satisfies a consistency property.
1 Introduction
One of the most useful ways to process a dataset represented as a finite metric space is to cluster the dataset, or break it into groups. An important step is choosing the desired granularity of the clustering, and multiparameter hierarchical clustering algorithms accept a hyperparameter vector to control this.
Since applying a multiparameter hierarchical clustering algorithm with different hyperparameter values may produce different partitions, we can view such algorithms as mapping a finite metric space to a function from the hyperparameter space to the space of partitions of . A flattening procedure then maps this function to a single partition.
Many popular clustering algorithms, such as HDBSCAN [7] and ToMATo [3], include a flattening step that operates on the same intuition that drives the analysis of persistence diagrams in TDA. That is, the most important clusters (homological structures) of a dataset are those which exist at multiple scales (have large differences between their birth and death times).
In this paper we will characterize and study clustering algorithms as functors, similarly to [1, 4, 12]. We will particularly focus on multiparameter hierarchical clustering algorithms with partially ordered hyperparameter spaces. This perspective allows us to guarantee that the clustering algorithms we study preserve both non-expansive maps between metric spaces and the ordering of the hyperparameter space. Our contributions are:
-
•
We describe an algorithm for flattening multiparameter hierarchical clusterings, which we demonstrate is a functorial map from a category of multiparameter hierarchical partitions to a category of binary integer programs.
-
•
We introduce a Bayesian update algorithm for learning a distribution over clustering hyperparameters from data.
-
•
We prove that the composition of the Bayesian update algorithm and the flattening procedure is consistent.
2 Multiparameter Hierarchical Clustering
In this work we will define flat clustering algorithms to map a finite metric space to a partition of . We will primarily work with the following categories:
Definition 1
In the category objects are finite metric spaces and morphisms are non-expansive maps, or functions such that .
Definition 2
In the category objects are tuples where is a partition of the set . Morphisms in are functions such that if then .
We will also work in the subcategories of respectively in which morphisms are further restricted to be bijective.
Definition 3
Given a subcategory of , a flat clustering functor on is a functor that is the identity on the underlying set . In the case that is unspecified we simply call a flat clustering functor.
Now recall that the set of connected components of the -Vietoris-Rips complex of is the partioning of into subsets with maximum pairwise distance no greater than . Given , an example of a flat clustering functor on is the -single linkage functor , which maps a metric space to the connected components of its -Vietoris-Rips complex [12, 4, 1]. Given , an example of a flat clustering functor on is the robust single linkage functor which maps a metric space to the connected components of the -Vietoris-Rips complex of where:
and is the distance from to its th nearest neighbor [2]. Intuitively, robust single linkage reduces the impact of dataset noise by increasing distances in sparse regions of the space. Note that robust single linkage is not a flat clustering functor on because it includes a -nearest neighbor computation that is sensitive to .
Like single linkage and robust single linkage, many flat clustering algorithms are configured by a hyperparameter vector that governs their behavior. In the case that this hyperparameter vector is an element of a partial order we can represent the output of such an algorithm with a functor .
Recall that is the category of functors from to and natural transformations between them. We will write to represent the subcategory of such functors that commute with the forgetful functor . Given in we will call the image of the underlying set of . Note that there also exists a forgetful functor that maps to its underlying set and that any natural transformation in between the functors and with underlying sets and is fully specified by a function .
Definition 4
Given a partial order and a subcategory of , an -clustering functor on is a functor that commutes with the forgetful functors from and into . In the case that is unspecified we simply call an -clustering functor.
For example, single linkage is a -clustering functor on and robust single linkage is a -clustering functor on .
Definition 5
Given the functor with underlying set , its partition collection is the set of all subsets such that there exists some where .
We will write the elements of (subsets of ) with the notation .
In practice it is often convenient to “flatten” a functor to a single partition of by selecting a non-overlapping collection of sets from its partition collection . Since we will express this selection problem as a binary integer program we will work in the following category:
Definition 6
The objects in are tuples where , are -element real-valued vectors, is an real-valued matrix and is an -valued matrix. Intuitively, the tuple represents the following binary integer program: find an -element -valued vector that maximizes subject to .
The morphisms between and are tuples where are real-valued matrices, are real-valued matrices, is an -valued matrix and is an -valued matrix such that:
where the operation is performed with logical matrix multiplication. Intuitively, a morphism maps the binary integer program “find an -element -valued vector that maximizes subject to ” to the binary integer program “find an -element -valued vector that maximizes subject to ”.
When we construct a binary integer program from an object in with underlying set , we weight the elements of its partition collection according to a model of the importance of different regions of . In this work we will only consider that are Borel measurable, so we can represent this model with a probability measure over . This probabilistic interpretation will be useful in Section 2.2 when we update this model with labeled data. We can then view the flattening algorithm as choosing a non-overlapping subset (the elements of are subsets of where no element of belongs to more than a single set in ) that maximizes the expectation of the function that maps to the number of that are also in . Formally, the algorithm maximizes . If is uniform this is similar to the Topologically Motivated HDBSCAN described in [7].
Proposition 1
Given a probability measure over , there exists a functor that maps with partition collection to a tuple such that any solution to the problem “find an -element -valued vector that maximizes subject to ” specifies a non-overlapping subset that maximizes .
Proof
Given a probability measure over , maps the functor with partition collection to the binary integer program where are -element real-valued vectors and are respectively real-valued and -valued matrices where:
A natural transformation between with underlying sets and partition collections that is specified by the surjective function is sent to the tuple where is an real-valued matrix, are real-valued matrices, and is an -valued matrix such that:
First we will show that any feasible solution to the integer program corresponds to a selection of elements from with no overlaps. If , then the th row of will have in position and in position . This implies that if then if and only if . Note also that by the definition of binary integer programming this is the selection of elements that maximizes:
Next, we will show that is a functor. Consider with underlying sets and partition collections and suppose:
Consider also a natural transformation specified by the function and define the image of on this natural transformation to be . We first need to show that:
Where is performed with logical matrix multiplication. In order to see that , note that:
Next, to see that , note that . Next, to see that , recall that is an matrix and note that since is surjective it must be that . Therefore both the left submatrix of and the top submatrix of are diagonal, so the product is a diagonal matrix with:
Next, to see that , note first that is a -valued matrix where:
And therefore that:
Finally, note that preserves the identity since when and it preserves composition by the laws of matrix multiplication. ∎
For example, if is uniform then the connected components of the Vietoris-Rips filtration of that have the largest differences between their birth and death times will be a solution to . Note also that is only functorial over , and not all of . Intuitively, this is because maps natural transformations between elements of to linear maps that only exist when the functions underlying these natural transformations are bijective.
2.1 The Multiparameter Flattening Algorithm
Given an -clustering functor and a distribution over we can use Monte Carlo integration and to implement the following algorithm:
We include an example of this procedure on Github 111https://github.com/dshieble/FunctorialHyperparameters that builds on McInnes et. al.’s [7]’s HDBSCAN implementation. In Table 2.1 we demonstrate that applying this procedure and solving the resulting binary integer program can perform better than choosing an optimal parameter value.
Algorithm | Adjusted Rand Score on | |
---|---|---|
Adjusted Rand Score on | ||
20 Newsgroups Dataset | ||
HDBSCAN With Optimal | ||
Distance Scaling Parameter | ||
HDBSCAN With Flattening Over | ||
Distance Scaling Parameter |

2.2 Multiparameter Flattening with Supervision
One of the most important components in the Multiparameter Flattening Algorithm is the choice of distribution over . For example, if and is the Dirac distribution at then will be a solution to .
We can leverage the importance of to enable our flattening procedure to learn from labeled data. Formally given an -clustering functor , a metric space , and an observed partition of (the labels) we can use Bayes rule to define the probability measure over where if is an element of the Borel algebra of and for each the map is a probability mass function over the finite set of all partitions of . There are several possible choices of . Intuitively, we want to be large when and are similar. The simplest choice would be , but a more robust strategy would be to use the Rand index, which measures how well two partitions agree on each pair of distinct points [10]. That is:
(1) |
where is the set of all partitions of and:
Note that by definition . Now suppose we have a set of tuples where each is a metric space and each is a partition of . Given an initial probability measure over we can use Bayesian updating to learn a posterior distribution over by iteratively setting:
(2) |
In Figure 1 we show the results of this procedure. Under mild conditions the functor maps to an optimization problem with an optimal solution that is consistent with these observations. Formally:
Proposition 2
Given suppose we have a compact region and an -clustering functor where is a subset of . Define to be Euclidean distance and assume that:
-
•
is -identifiable: for each there exists some pair of -element subsets with
-
•
is -smooth: for any -element and there exists a neighborhood of where for we have that
Now suppose we sample according to the uniform distribution over and for each uniformly select a -element subset from , set and set as in Equation 2.
Then for any -element there -a.s. (almost surely) exists some such that for , the partition is an optimal solution to .
Proof
We will use Doob’s theorem [5] to prove this. Suppose is the set of pairs of finite -element subsets and partitions of . Define to be the uniform distribution on the set of finite -element subsets of , and for each define where is an element of the Borel algebra of . Now since is -identifiable, and Theorem 2.4 in [8] holds. Therefore for any -element , ball centered at , and , there -a.s. (almost surely) exists some such that for we have that . Therefore, no solution to can be improved by including sets that only exist in partitions of formed from where . Since is -smooth, this implies that the partition is an optimal solution to ∎
3 Discussion and Next Steps
One issue with the procedure is that it reduces to a binary integer program, which can be very slow to solve. In the case that the hyperparameters of a multiparameter hierarchical clustering algorithm form a total order, we can organize its outputs in a merge tree or dendrogram, and then solve the optimization problem with a tree traversal [3, 7].
However, it is not always possible to use this strategy on hierarchical clustering algorithms that accept multiple real-valued hyperparameters (e.g. the hyperparameter space is a general subset of where ). In this case it is possible that there exist clusterings and that are neither disjoint nor in a containment relationship. There are some ways to get around this limitation, however. For example, Rolle and Scoccola [11] index hyperparameters by a curve in , rather than all of . In this way the hyperparameter space remains a total order. In the future we plan to explore other strategies to solve the flattening optimization problem efficiently.
References
- [1] Carlsson, G., Mémoli, F.: Persistent clustering and a theorem of J. Kleinberg. arXiv preprint arXiv:0808.2241 (2008)
- [2] Chaudhuri, K., Dasgupta, S.: Rates of convergence for the cluster tree. In: Advances in Neural Information Processing Systems. pp. 343–351 (2010)
- [3] Chazal, F., Guibas, L.J., Oudot, S.Y., Skraba, P.: Persistence-based clustering in Riemannian manifolds. Journal of the ACM (JACM) 60(6), 1–38 (2013)
- [4] Culbertson, J., Guralnik, D.P., Hansen, J., Stiller, P.F.: Consistency constraints for overlapping data clustering. arXiv preprint arXiv:1608.04331 (2016)
- [5] Doob, J.L.: Application of the theory of martingales. In: Actes du Colloque International Le Calcul des Probabilités et ses applications. pp. 23–27 (1949)
- [6] Lang, K.: Newsweeder: Learning to filter netnews. Machine Learning Proceedings (1995)
- [7] McInnes, L., Healy, J.: Accelerated hierarchical density clustering. arXiv preprint arXiv:1705.07321 (2017)
- [8] Miller, J.W.: A detailed treatment of Doob’s theorem. arXiv preprint arXiv:1801.03122 (2018)
- [9] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
- [10] Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)
- [11] Rolle, A., Scoccola, L.: Stable and consistent density-based clustering. arXiv preprint arXiv:2005.09048 (2020)
- [12] Shiebler, D.: Functorial clustering via simplicial complexes. NeurIPS Workshop on Topological Data Analysis in ML (2020)
- [13] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)