Cluster Analysis via Random Partition Distributions

David B. Dahl
Jacob Andros
J. Brandon Carter

Abstract

Hierarchical and k-medoids clustering are deterministic clustering algorithms based on pairwise distances. Using these same pairwise distances, we propose a novel stochastic clustering method based on random partition distributions. We call our method CaviarPD, for cluster analysis via random partition distributions. CaviarPD first samples clusterings from a random partition distribution and then finds the best cluster estimate based on these samples using algorithms to minimize an expected loss. We compare CaviarPD with hierarchical and k-medoids clustering through eight case studies. Cluster estimates based on our method are competitive with those of hierarchical and k-medoids clustering. They also do not require the subjective choice of the linkage method necessary for hierarchical clustering. Furthermore, our distribution-based procedure provides an intuitive graphical representation to assess clustering uncertainty.

keywords:

random partition distributions , dendrogram , Ewens-Pitman attraction distribution , hierarchical clustering , k-medoids clustering

1 Introduction

Cluster analysis seeks to partition data into distinct subsets, or clusters, such that observations in the same cluster are more similar than observations from different clusters. There are numerous methods for cluster analysis with applications in many fields (Dubes and Jain, 1976; K. Jain et al., 1999). As an unsupervised learning technique, there is little consensus on how to validate the clustering obtained. With many available clustering techniques, subjective choices must be made which influence the outcome of a clustering procedure (Gareth et al., 2013). It is also difficult to quantify the uncertainty associated with clustering estimates.

Among the various algorithms for cluster analysis, agglomerative hierarchical clustering remains one of the most used (Hennig et al., 2016). Agglomerative hierarchical clustering is a heuristic method that takes as its input a matrix of pairwise distances among items. Beginning with every item in its own cluster, the algorithm sequentially merges the most similar clusters (based on the pairwise distances) until all observations are in a single cluster. These nested clusters can be visualized as a dendrogram. There are subjective decisions involved in hierarchical clustering; namely, the linkage type used to build the tree and the tree-cutting method used to obtain a clustering estimate.

Another commonly used class of clustering methods includes k-means and k-medoids, which both seek to minimize the within-cluster distances from each point to a point specified as its cluster center. k-means minimizes squared Euclidean distances to each cluster centroid. In contrast, k-medoids uses an actual data point for each cluster center and thus can take any dissimilarity measure as input (Kaufman and Rousseeuw, 1990). The same input of a pairwise distance matrix makes k-medoids a comparable procedure to hierarchical clustering. The most influential choice a user must make in k-medoids clustering is the value of $k$ , that is, how many clusters the estimate should contain.

We propose CaviarPD: Cluster Analysis via Random Partition Distributions. Like hierarchical clustering and k-medoids, CaviarPD is based on pairwise distances, yet it provides a unique way to assess clustering uncertainty. The CaviarPD method relies on sampling from a random partition distribution that is based on pairwise distances, e.g., the Ewens-Pitman Attraction (EPA) distribution (Dahl et al., 2017). Thus, like the other two methods, the EPA distribution uses pairwise distances as input. Unlike hierarchical and k-medoids clustering, this form of cluster analysis allows us to make probability statements about clustering relationships, thereby quantifying the uncertainty of the estimate. In doing so, CaviarPD provides an alternate way to visualize clusterings by showing the pairwise probabilities that items are clustered together. An implementation of the method is provided in the caviarpd package (Dahl et al., 2021a) in R, which is available on CRAN.

To compare hierarchical and k-medoids clustering with our proposed CaviarPD method, we evaluate how well each method performs in eight different case studies where the true partition of the data is known. We compare methods using two partition loss functions, namely, Binder loss (Binder, 1978) and VI loss (Meilă, 2007; Wade and Ghahramani, 2018). Through the case studies, we show the advantages of CaviarPD over hierarchical and k-medoids clustering. All methods tend to estimate the true partition of the data well; however, the hierarchical clustering results are highly varied between linkages and choices of cutting the tree. There is little statistical reasoning to guide the choice of linkage and tree cutting. Furthermore, because CaviarPD gives the probabilities that items are clustered together, CaviarPD provides additional information about the clustering relationships beyond what hierarchical or k-medoids clustering provide.

2 Existing Distance-Based Clustering Methods

2.1 Clustering Concepts and Terminology

We introduce common concepts used in both traditional clustering methods and CaviarPD. For a more thorough description of current clustering practices, we suggest Hennig et al. (2016). A clustering $\mathbfit{c}=(c_{1},...,c_{n})$ gives labels for $n$ items in which items $i$ and $j$ are in the same cluster if and only if $c_{i}=c_{j}$ . Equivalently, a partition $\pi=\{S_{1},\ldots,S_{q}\}$ of integers $1,\ldots,n$ is composed of mutually exclusive, non-empty, and exhaustive subsets such that $i,j\in S$ implies that $c_{i}=c_{j}$ . We use the terms ‘clustering’ and ‘partition’ interchangeably and note that the term ‘cluster’ is synonymous with ‘subset’.

In cluster analysis, we seek to cluster $n$ observations of a dataset $\mathcal{D}=\{\mathbf{x}_{1},\dots,\mathbf{x}_{n}\}$ into distinct groups so that observations within a group are more similar than observations from different groups. CaviarPD, like the other two methods described, relies on distance information between observations in order to partition items into subsets. The pairwise distances between observations $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ are calculated from a specified distance function $d(\mathbf{x}_{i},\mathbf{x}_{j})$ . One common distance function is Euclidean distance: $d(\mathbf{x}_{i},\mathbf{x}_{j})=\sqrt{(\mathbf{x}_{i}-\mathbf{x}_{j})^{\prime}(\mathbf{x}_{i}-\mathbf{x}_{j})}$ . The pairwise distances between all items can be stored in an $n\times n$ distance matrix $\mathbfit{D}$ , where $d_{ij}=d(\mathbf{x}_{i},\mathbf{x}_{j})$ . The choice of distance metric is also an important consideration in the analysis, but our task here is to compare distance-based clustering methods given the user’s chosen distance matrix.

2.2 Hierarchical Clustering

In this section we highlight the decisions a user must make with hierarchical clustering. For a thorough introduction to hierarchical clustering, see Rencher and Christensen (2012) or Gareth et al. (2013).

Recall that in agglomerative hierarchical clustering, each observation begins in its own cluster and the most similar clusters are sequentially merged until all data points are in a single cluster. The criteria used to define similarity between clusters is called the linkage and is computed from the pairwise distance between items in each cluster. Hierarchical clustering requires that a user decide which linkage to use and how to cut the dendrogram to obtain a partition estimate. For a more detailed explanation of the subset distance computations for each linkage type, see Nielsen (2016).

Refer to caption — Figure 1: Clustering dendrograms for the wine dataset using complete, average, Ward and single linkage.

We demonstrate how the choice of linkage leads to highly varied clustering estimates. Figure 1 shows the resulting dendrograms from four common linkages applied to the wine recognition dataset using the function hclust in R (R Core Team, 2021). The wine dataset contains many different chemical attributes for samples of wine from three different cultivars. The distance matrix was computed with Euclidean distance using all 13 chemical attributes. The clustering structure represented in the dendrograms is very different for all four linkages. Single linkage produces long chains of connected clusters. Ward linkage seeks to create compact spherical clusters such that the dendrogram shows more distinct clusters (Ward Jr, 1963). For the wine dataset, the dendrograms for average and complete linkage show clustering structures between the long chains of single linkage and compact clusters of Ward linkage.

After choosing the type of linkage, the structure of the dendrogram is used to cut the tree to obtain a clustering estimate. From the dendrogram using Ward linkage, there appear to be three main clusters. Inspection of the other dendrograms leads to less definite conclusions about the number of clusters in the dataset as each tree varies drastically. Using complete linkage, one could reasonably argue for a cut of the tree that gives anywhere from 2 to 7 clusters. Langfelder et al. (2008) introduce a solution to the tree cutting problem with the dynamic tree cut (DTC) algorithm. This procedure takes into account the structure of the tree to detect clusters and also allows for the tree to be cut at variable heights, providing greater flexibility in clustering estimation.

The lack of agreement between the linkages and tree cutting estimates is prominent and concerning. There appeared to be three main clusters for ward linkage, yet the DTC produced a default estimate of 5 clusters. The DTC applied to the complete linkage dendrogram resulted in an estimate with 3 clusters. Likewise, DTC applied to the average linkage dendrogram also resulted in 3 main clusters with a single observation in a fourth cluster by itself. Lastly, the single linkage tree resulted in a clustering estimate with all but 3 observations in a single cluster.

2.3 K-Medoids Clustering

k-means clustering remains one of the most common and popular clustering approaches used today. The closely related k-medoids clustering, like k-means, seeks to minimize the sum of distances between points in a cluster and its cluster center. Unlike k-means, k-medoids uses actual data points (exemplars or medoids) for the cluster centers. Thus, k-medoids only uses the pairwise distance matrix as input, making it a comparable clustering approach to hierarchical clustering and CaviarPD.

Since there are $\binom{n}{k}$ combinations of medoids that can be tested to minimize the total distance, finding the exact solution to the equation is difficult. Kaufman and Rousseeuw (1990) proposed the PAM algorithm (partitioning around medoids) to conduct a nonexhaustive but greedy search through the combination space. The PAM algorithm was then modified by Schubert and Rousseeuw (2019) to reduce computation time even further at the expense of some accuracy. Since the datasets in our case studies are all relatively small, we can easily use the original PAM algorithm for our k-medoids comparison.

In our case studies we use the silhouette method, a straightforward procedure for selecting $k$ in k-means and k-medoids clustering (Rousseeuw, 1987). The silhouette method is most reliable when the choices for $k$ are narrowed down to a reasonable range using some prior intuition about the data. As with hierarchical clustering, the k-medoids approach offers no way to assess uncertainty in the results. We use the pam function in the cluster package (Maechler et al., 2021) since it has a built-in calculation for silhouette width.

3 Cluster Analysis Via Random Partition Distributions

We propose the CaviarPD method, a novel approach to clustering based on random partition distributions. A reference implementation is available on CRAN in the caviarpd package (Dahl et al., 2021a). CaviarPD is based on the Ewens-Pitman Attraction (EPA) distribution, originally proposed as a prior distribution for partitions in a Bayesian framework (Dahl et al., 2017). In the proceeding subsections, we explain how the EPA distribution can be used directly for cluster analysis along with the additional components of CaviarPD. Since the EPA distribution is a probability distribution over partitions, our simulation-based method provides as output probabilities of pairs of items being clustered together.

3.1 Ewens-Pitman Attraction Distribution

In the EPA distribution, observations are sequentially allocated to subsets of a partition with probability proportional to the attraction of the item to that subset. We use $\boldsymbol{\sigma}=(\sigma_{1},...,\sigma_{n})$ to denote the order of the $n$ observations that are sequentially allocated into subsets in order to form a partition. Thus $\sigma_{t}$ is the $t$ th observation allocated in the partition. This sequential allocation of items yields a sequence of partitions for each step of the allocation. Let $\pi(\sigma_{1},...,\sigma_{t-1})$ denote the partition of the first $t-1$ observations at time $t-1$ . At each time $t$ , let $K_{t}$ be the number of subsets in the partition $\pi(\sigma_{1},...,\sigma_{t})$ . When $t=1$ , the first item is allocated to a new subset by itself with probability 1 as shown in equation 2. For $t>1$ , $\sigma_{t}$ can either be allocated to one of the existing $K_{t-1}$ subsets or it can be allocated to a new subset in the partition. The probability mass function is conveniently expressed as the product of increasingly conditional probabilities:

p(\pi_{n}|\alpha,\lambda,\boldsymbol{\sigma})=\prod_{t=1}^{n}p_{t}(\alpha,\lambda,\pi(\sigma_{1},...,\sigma_{t-1}))

(1)

p_{t}(\alpha,\lambda,\pi(\sigma_{1},...,\sigma_{t-1}))=\mathrm{Pr}(\sigma_{t}\in A\,|\,\alpha,\lambda,\pi(\sigma_{1},...,\sigma_{t-1}))\\ =\begin{cases}\frac{t-1}{\alpha+t-1}\cdot\frac{\sum_{\sigma_{s}\in A}\lambda(\mathbf{x}_{\sigma_{t}},\mathbf{x}_{\sigma_{s}})}{\sum_{s=1}^{t-1}\lambda(\mathbf{x}_{\sigma_{t}},\mathbf{x}_{\sigma_{s}})}&\text{for}\;S\in\pi(\sigma_{1},...,\sigma_{t-1})\\ \frac{\alpha}{\alpha+t-1}&\text{for}\;S\;\text{being a new subset}\end{cases}

(2)

The probability that $\sigma_{t}$ is allocated into subset $S$ is a function of the similarity function $\lambda$ and the mass parameter $\alpha$ .¹¹1 Dahl et al. also include a discount parameter $\delta$ that further influences sampling from the EPA distribution; For simplicity, we set $\delta=0$ , yielding (2). The similarity function $\lambda(\mathbf{x}_{i},\mathbf{x}_{j})$ gives pairwise similarity between observations $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ for any $\mathbf{x}_{i},\mathbf{x}_{j}\in\mathcal{D}$ . Dahl et al. (2017) propose a general class of similarity functions $\lambda(\mathbf{x}_{i},\mathbf{x}_{j})=f(d_{ij})$ , where $f$ is a non-increasing function of the pairwise distance $d_{ij}$ between observations $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ . $d_{ij}$ is the same distance matrix $\mathbfit{D}$ used as input for hierarchical and k-medoids clustering. Two common similarity functions are reciprocal similarity $f(d)=d^{-\tau}$ and exponential similarity $f(d)=\mathrm{exp}(-\tau d)$ . The parameter $\tau\geq 0$ is the temperature, which has the effect of accentuating or dampening the distance between items.

3.2 Sampling from the EPA Distribution

Sampling from the EPA distribution is a straight forward process (Dahl et al., 2017). Because items are allocated with probability proportional to the similarity to items in an existing subset, similar items are more likely to be clustered together in simulation. In order to sample a partition $\pi_{n}$ from the EPA distribution, we begin with a permutation (some ordering) of the data and fixed $\alpha$ , $\lambda$ . The first item $\sigma_{1}$ is allocated to a subset by itself with probability 1. The next item, $\sigma_{2}$ can either the assigned to the subset with $\sigma_{1}$ or to a new subset by itself. The probability of each allocation is given in equation 2. Let $\sigma_{2}$ be randomly assigned to the existing subset or new subset with the respective probabilities. For each subsequent item, the item $\sigma_{t+1}$ is randomly assigned to an existing subset or new subset with probability respective to being assigned to $S_{1},S_{2},...,S_{K},S_{K+1}$ . Here $S_{1},...,S_{K}$ represent the existing subsets of the partition and $S_{K+1}$ is a new subset. Continue to sequentially assign the items until a partition of the data is obtained. The resulting partition is a single draw from the EPA distribution. Sampling can be parallelized over many cores to simultaneously obtain many draws from the EPA distribution.

The order by which the data are sampled affects the resulting probabilities of obtaining particular partitions. To remove this dependence on sampling order, randomize the order by which the items are allocated into subsets for each draw. This has the effect of making the probability of each partition independent of any particular permutation of the data.

3.3 Visualizing Pairwise Probabilities

A key advantage of CaviarPD over traditional clustering is its ability to quantify and visualize uncertainty in clustering estimates. This is done using a heat map from a summary of the samples from the EPA distribution.

Each of the partitions $\pi$ can be represented as an $n\times n$ association matrix denoted $\gamma(\pi)$ , where the $(i,j)$ element of the association matrix is an indicator that observations $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ are in the same cluster. In short, $\gamma_{ij}(\pi)=\mathrm{I}(c_{i}=c_{j})$ . For $B$ samples from the EPA distribution, there are $B$ $\gamma(\pi)$ matrices. These matrices $\gamma(\pi_{1}),...,\gamma(\pi_{B})$ can then be averaged together element-wise to create a pairwise similarity matrix, which contains the estimated pairwise probabilities that items appear in the same subset for a given $\alpha$ and $\lambda$ . The pairwise similarity matrix is an $n\times n$ matrix denoted $\Psi$ , where the $i$ th, $j$ th element is the relative frequency with which items $i$ and $j$ are clustered together among the samples.

\mathrm{Pr}(c_{i}=c_{j})\approx\Psi_{ij}=\frac{1}{B}\sum_{k=1}^{B}\gamma_{ij}(\pi_{k})

Each element of the $\Psi$ matrix is the estimated probability that observations $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ are in the same cluster for a given $\lambda$ and $\alpha$ . $\Psi$ can then be conveniently visualized in the form of a heat map.

In the heat map, the color of a cell represents the probability that two items are clustered together while also highlighting the actual clustering estimate. In Figure 2 we can see clearly that, for $\alpha=0.90$ , there appear to be 3 distinct clusters. In this visualization, the observations are ordered to group similar items together, creating the block diagonal. The heat map becomes even more useful when the observations are ordered by an estimated partition, thereby showing the probability relationships within and between clusters. For example, clusters 1 and 2 appear much more similar than clusters 1 and 3 because there is a darker shade of yellow in the off-diagonal blocks for clusters 1 and 2, and a near-white shade of yellow in the off-diagonals for clusters 1 and 3.

3.4 Partition Estimation From Samples

The Bayesian literature provides multiple methods for obtaining a point estimate of a random partition based on samples from a posterior partition distribution (Dahl et al., 2021b). Again, our paper here is not Bayesian, but we can draw upon the Bayesian literature on random partitions to obtain a representative point estimate of a partition based on samples from the EPA distribution.

In decision theory, a loss function is specified in order to pick an optimal estimate that incurs minimal expected loss. For partitions, loss functions evaluate how distant the estimated partition is from the true partition of the data. Binder loss is a function of the number of disagreements between the estimated and true partition for all possible pairs of observations (Binder, 1978). The function is a weighted sum of the two types of disagreements: observations are in different clusters when they should be in the same cluster, and observations are in the same cluster when they should be in different clusters. Wade and Ghahramani (2018) demonstrate that when the weights of the two errors are equal, $w_{1}=w_{2}$ , then the partition that minimizes Binder loss is given by:

\hat{\pi}_{\text{binder}}=\operatorname*{arg\,min}_{\pi\in\Pi}\sum_{i<j}(\gamma_{ij}(\pi)-\Psi_{ij})^{2}.

Wade and Ghahramani (2018) also propose using the variation of information, introduced by Meilă (2007) as a loss function. The variation of information is developed from information theory and is the information present in both clusters (partition entropy) minus the information shared between the two clusters. Minimization by enumeration over the whole partition space is unfeasible, so we use the SALSO method (Dahl et al., 2021b) as implemented in the R package salso (Dahl et al., 2021c).

3.5 Selecting the Mass Parameter

Most parameters in the EPA distribution have a default value. However, the mass parameter $\alpha$ does not and it is highly influential in determining the number of subsets in a partition estimate. An objective algorithm is needed that can select a default mass parameter for any given dataset. Ideally, the heat maps generated from estimates with this algorithm should show that items in the same cluster are clustered together with high probability. They should also show clear distinction between subsets of the partition in the heat map (i.e. low probability regions between clusters). Figure 3 shows two heat maps for the pairwise probabilities of the wine dataset with the mass parameter set at 0.9 and 0.7, respectively. When the mass is 0.9, the resulting heat map shows overall higher pairwise probabilities within clusters as compared to the mass set at 0.7. Likewise, there is less variance of the pairwise probabilities within clusters when the mass is 0.9. The estimated three clusters appear more distinct because of the greater within-cluster pairwise probabilities. The goal of the mass selection algorithm, in the case of the wine dataset, is to pinpoint $\alpha$ around 0.9 instead of 0.7.

We propose an objective mass selection procedure based on the silhouette method, which is also used to select $k$ in k-means and k-medoids clustering (Rousseeuw, 1987). The ‘silhouette width’ of a given observation in any clustering estimate (not just k-medoids) is a measure of compactness. It is proportional to the difference between the observation’s average within-cluster distance and its average distance to points in the nearest cluster. The average silhouette width is the mean silhouette width for all points in a dataset; seeking to maximize the average width is intended to result in more compact and distinct clusterings. Thus, the average silhouette width is a numerical summary that corresponds well to the properties outlined in the preceeding paragraph of a “ideal” heat map. In our mass selection algorithm, the user proposes a range of cluster counts to consider in estimation. Boundary mass values are obtained for the minimum and maximum of this range, and a grid search is performed between these boundary masses. The mass from the clustering estimate that results in the maximum average silhouette width is then used for $\alpha$ .

4 Case Studies

To compare CaviarPD with the other distance-based clustering methods, we evaluate how well each method clusters data where the true partition of the data is known. Datasets were obtained from the UC Irvine machine learning repository (Dheeru and Karra Taniskidou, 2017). In order to evaluate the quality of the estimate we use Binder and VI loss, which both measure the similarity between two partitions. Both of these metrics are bounded to be nonnegative, where lower values are indicative of a more accurate partition. We compare average, complete, and Ward linkages to CaviarPD. Because the possibilities for tree cutting are numerous, we cut the dendrograms using the default settings for the cutreeDynamic function in the dynamicTreeCut package in R to obtain a partition estimate for each of the linkages (Langfelder et al., 2016).

For each of the datasets with numeric attributes, we centered and scaled the data and computed the Euclidean pairwise distances. For datasets with categorical attributes, we computed the Jaccard pairwise distances. Centering and scaling the data coerces the mean and standard deviation of each predictor to be 0 and 1, respectively. Though not always necessary, this helps safeguard against attributes with large variance carrying more weight in the distance computation. We use all three linkage types from hierarchical clustering to give it the best chance at competing with CaviarPD. We also choose $k$ in k-medoids using the silhouette method. Despite these advantages given to these other methods, CaviarPD still remains highly competitive.

In CaviarPD, we fixed the temperature $\tau$ at 10 for the exponential similarity function $\lambda$ . These defaults seem to be sufficient for the majority of applications of CaviarPD. We used our mass selection algorithm to obtain a default value for the mass $\alpha$ and corresponding partition estimate. When using the mass selection algorithm, Binder loss and VI gave very similar clustering estimates. For this reason, and for the sake of simplicity, we use only the Binder loss CaviarPD estimates in our comparisons. Again, we use the caviarpd function from the caviarpd package in R to carry out the analysis.

4.1 Wine Dataset

For the wine dataset, CaviarPD estimates the true partition on par with the main combinations of tree cutting and linkages from hierarchical clustering. However, the partition estimates from hierarchical clustering are inconsistent in their results. Without knowing the true partition of the data, we would not know that average linkage produces the best estimate.

	Wine			House Votes
	K	Binder	VI	K	Binder	VI
CaviarPD	3	0.09	0.68	2	0.22	0.99
Average: Default DTC	4	0.09	0.69	3	0.22	1.03
Complete: Default DTC	3	0.19	1.21	10	0.43	2.82
Ward: Default DTC	5	0.17	1.23	6	0.37	2.20
K-Medoids	3	0.12	0.68	2	0.23	1.10

Table 1: Clustering results for the wine and house votes datasets

Table 1 gives the clustering results for the wine dataset, which was used for demonstration in previous sections and contains 13 chemical attributes on wines from 3 different cultivars. The number of clusters for a particular estimate is given by $K$ . The DTC default gives roughly the correct number of clusters for average and complete linkage (the fourth cluster for average linkage is a singleton cluster), but not for ward linkage. Overall, average linkage and CaviarPD perform the best when compared to the true partition under Binder loss; however, when using a VI comparison, CaviarPD and k-medoids slightly outperform all of the DTC cuts. These differences are minimal, but the estimates for Ward and complete linkage fall severely short of the estimates produced by CaviarPD and average linkage. In any case, it is worth noting the lack of consistency in the resulting estimates from hierarchical clustering, and the ability of CaviarPD to cluster just as accurately as the best DTC cut and k-medoids.

4.2 House Dataset

The house votes dataset contains voting records for the 1984 House of Representatives. The class attribute is party affiliation, Republican or Democrat. Having only two clusters yet nearly 500 total observations, this data provides a rigorous test for all clustering methods. Results are displayed in Table 1. CaviarPD and average linkage produce estimates with nearly identical loss metrics. However, note that CaviarPD also identifies the correct number of clusters at 2 (along with k-medoids).

The house dataset is also an excellent example for visualizing each loss function as a function of the mass $\alpha$ . Recall that the true partition is only 2 clusters. Viewing Figure 4, note how the Binder loss starts out high and immediately declines. This dip corresponds to the clustering estimate moving from 1 cluster to 2 clusters as the mass increases. Since the true partition is 2 clusters, this 2-cluster range of mass values (between about $\alpha=.65$ and $\alpha=1.0$ ) results in the lowest Binder loss. Moving up to the VI loss function, we see that VI loss penalizes less for a small underestimation of the cluster count. As such, the 1-cluster estimate at $\alpha=0.5$ is still deemed somewhat ‘accurate’. However, once the estimate moves past the 2-cluster range, it quickly increases (particularly for VI loss), making it clear that mass values any higher than 1.0 will not yield adequate estimates. Thus, it becomes clear the importance of selecting viable mass parameters.

4.3 Flea-Beetle Dataset

For smaller datasets, it is not uncommon to estimate the true partition of the data. As an example, we take the flea-beetle dataset, which contains measurements on three different species of beetles. It contains only 74 observations total. Both CaviarPD and Ward linkage (with the DTC cut) are able to estimate the exact partition of the data. This results in the comparative loss functions, both Binder and VI, being equal to 0.

	Flea-Beetle			Olive
	K	Binder	VI	K	Binder	VI
CaviarPD	3	0.0	0.0	6	0.06	0.89
Average: Default DTC	2	0.19	0.83	10	0.07	1.05
Complete: Default DTC	2	0.27	1.08	14	0.12	1.69
Ward: Default DTC	3	0.0	0.0	9	0.11	1.25
K-Medoids	3	0.02	0.17	7	0.05	0.94

Table 2: Clustering results for the flea and olive datasets

4.4 Olive Dataset

With the olive dataset we demonstrate the use of the pairwise similarity matrix to detect clusters within subsets of the estimated partition. The olive dataset contains measurements on the levels of different oils in olives from nine different regions of Italy.

The CaviarPD estimate, despite having the lowest VI loss, has difficulty separating some pairs of regions, resulting in an estimate with only 6 clusters. k-medoids encounters a similar problem and only detects 7 clusters. On the other hand, the average-linkage DTC estimate concentrates far too many observations in the first 3 clusters, while also creating a 10th cluster with only a few observations. The only hierarchical clustering linkage that correctly estimates 9 clusters is Ward linkage; however, Ward linkage misclassifies too many observations to be considered as accurate as the other estimates.

Heat maps from CaviarPD can be used to examine subcluster structure, as demonstrated by clusters 4 and 5 in Figure 5(a). In cluster 4 of the heat map, there appear to be two higher probability regions of items being grouped together within each cluster. The same is observed in cluster 5. In short, those clusters merged olives from two different regions. Of course, post partition processing is also possible in hierarchical clustering; however, in hierarchical clustering, these decisions have no basis on probabilities as they do in CaviarPD.

4.5 Additional Datasets

The datasets in Table 3 further demonstrate the effectiveness of CaviarPD for partition estimation. The yeast dataset, which contains 10 subsets in the true partition, is an example of a larger dataset (approximately 1,500 observations) that makes running the entire mass selection algorithm computationally intensive. Though the algorithm is still feasible for a dataset this size, it may not be practical for significantly larger sets. In such cases, we recommend building heat maps for 3 to 5 different values of $\alpha$ and selecting the mass for the plot with the most concentrated pairwise probabilities. The lymphography dataset is an example of data for which the 4 cluster distinctions are not well-represented by the attributes. This leads to poor estimates by all three methods, and a relatively uninterpretable heat map as given in Figure 5(b). The E. coli dataset contains attributes for 8 different localization sites of proteins in E. coli bacteria. Chemical composition measurements are taken on 6 different glass products in the glass dataset.

	Yeast			Lymphography
	K	Binder	VI	K	Binder	VI
CaviarPD	12	0.32	3.06	6	0.49	1.44
Average: Default DTC	5	0.51	3.00	4	0.41	2.25
Complete: Default DTC	19	0.24	4.83	4	0.44	2.53
Ward: Default DTC	16	0.24	4.70	4	0.41	2.56
K-Medoids	8	0.26	3.48	2	0.43	1.96
	E. Coli			Glass Products
	K	Binder	VI	K	Binder	VI
CaviarPD	6	0.12	1.22	6	0.35	2.92
Average: Default DTC	2	0.72	2.17	3	0.50	1.90
Complete: Default DTC	6	0.11	1.25	4	0.42	2.42
Ward: Default DTC	6	0.18	1.65	4	0.35	3.05
K-Medoids	6	0.17	1.80	6	0.34	2.93

Table 3: Clustering results for additional datasets

4.6 Discussion

We do not claim that CaviarPD vastly outperforms all hierarchical clustering and k-medoids methods, as that was not usually the case in these eight case studies. Rather, we have shown that CaviarPD performs comparably to (and in some cases, slightly better than) the other methods. The benefit of CaviarPD, however, is the ability to assess clustering uncertainty in the form of the heat map plots from the pairwise similarity matrix.

In hierarchical clustering, there is no one linkage type or cutting technique that consistently produces the best cut of the tree. We compared the best possible DTC cut of the dendrogram for each linkage, but there is no cutting rule that will consistently guide the user to that result. In k-medoids clustering, there is less variability, but still no way to express uncertainty. In contrast to these techniques, CaviarPD has a validating method to select the mass parameter $\alpha$ . It also produces a pairwise similarity matrix which provides a clear and consistent representation of how the data are grouped.

4.7 Other Distributions

The EPA distribution is suited for the CaviarPD method since it is a random partition distribution taking pairwise distances as input. Most other random partition distrtibutions in the Bayesian literature do not take pairwise probabilities and would therefore not be suitable for CaviarPD. There is, however, one other partition distribution which does, called the distance-dependent Chinese Restaurant Process (ddCRP) (Blei and Frazier, 2011). The caviarpd function allows users to specify the ddCRP distribution rather than the default EPA distribution. Hence, clustering with the ddCRP distribution from a coding standpoint is merely one additional argument. However, default parameters for clustering with the ddCRP distribution have not been investigated, and we have not found a good method for selecting optimal mass for the ddCRP.

In addition, while the ddCRP distribution does sometimes generate estimates that are comparable with CaviarPD, it does not appear to lead to effective visualizations. As an example, we take the flea-beetle dataset in Figure 6, for which CaviarPD (and Ward linkage) were able to estimate precisely the true partition. Plotting the estimate from the EPA distribution shows clear distinction between the 3 clusters. All items in the same cluster have high probabilities of being clustered together, and we can see that elements in clusters 2 and 3 are extremely unlikely to be grouped together. The plot from the ddCRP estimate shows none of this - it is virtually monochromatic (besides the diagonal 1.0 probabilities of each item being clustered with itself).

Estimates with the ddCRP distribution do not seem to be any more effective than those from the EPA. They also are unable to provide the key insights that the EPA estimates do, and have far more subjectivity in tuning the parameters. For these reasons, we do not currently recommend using the ddCRP for the CaviarPD method.

5 Conclusion

Cluster analysis via random partition distributions simplifies the user’s dilemma for clustering. In hierarchical clustering, the different linkages and lack of consistent tree cutting rules present a user with many subjective choices. These different choices in hierarchical clustering lead to highly varied partition estimates. For both hierarchical and k-medoids clustering, there are no probability statements one can make regarding the clustering. In contrast, data driven statistics help guide the user in clustering estimation for CaviarPD. The pairwise similarity matrix $\Psi$ gives a probabilistic understanding for how items are clustered together in a dataset. Algorithmic clustering methods are not based on a probability distribution and therefore quantify clustering uncertainty is difficult.

The central weakness of CaviarPD is the computational cost to select a mass for large datasets. Hierarchical clustering, k-medoids (and the corresponding k-means method) all have the potential to run more quickly in these cases. It is also worth noting that while the heat maps can be very insightful, they do not always visualize overall clustering structure as comprehensively as a dendrogram would in hierarchical clustering (Sander et al., 2003). In summary, CaviarPD is not the undisputed best clustering method in every instance. Rather, it provides unique advantages over traditional approaches in the majority of clustering problems.

References

Binder (1978) Binder, D.A., 1978. Bayesian cluster analysis. Biometrika 65, 31–38.
Blei and Frazier (2011) Blei, D.M., Frazier, P.I., 2011. Distance dependent chinese restaurant processes. Journal of Machine Learning Research 12.
Dahl et al. (2021a) Dahl, D.B., Carter, B., Andros, J., 2021a. caviarpd: Cluster Analysis via Random Partition Distributions. R package version 0.2.9.
Dahl et al. (2017) Dahl, D.B., Day, R., Tsai, J.W., 2017. Random partition distribution indexed by pairwise information. Journal of the American Statistical Association 112, 721–732.
Dahl et al. (2021b) Dahl, D.B., Johnson, D.J., Mueller, P., 2021b. Search algorithms and loss functions for bayesian clustering. arXiv:2105.04451.
Dahl et al. (2021c) Dahl, D.B., Johnson, D.J., Müller, P., 2021c. salso: Search Algorithms and Loss Functions for Bayesian Clustering. URL: https://CRAN.R-project.org/package=salso. r package version 0.2.22.
Dheeru and Karra Taniskidou (2017) Dheeru, D., Karra Taniskidou, E., 2017. UCI machine learning repository. URL: http://archive.ics.uci.edu/ml.
Dubes and Jain (1976) Dubes, R., Jain, A.K., 1976. Clustering techniques: The user’s dilemma. Pattern Recognition 8, 247 – 260.
Gareth et al. (2013) Gareth, J., Witten, D., Hastie, T., Tibshirani, R., 2013. An Introduction to Statistical Learning. Springer Texts in Statistics, Springer.
Hennig et al. (2016) Hennig, C., Meila, M., Murtagh, F., Rocci, R., 2016. Handbook of Cluster Analysis. Chapman & Hall/CRC handbooks for modern statistical methods, Taylor & Francis.
K. Jain et al. (1999) K. Jain, A., Murty, M., J. Flynn, P., 1999. Data clustering: a review. acm comput surv 31, 264–323.
Kaufman and Rousseeuw (1990) Kaufman, L., Rousseeuw, P.J., 1990. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons Inc.
Langfelder et al. (2008) Langfelder, P., Zhang, B., Horvath, S., 2008. Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for r. Bioinformatics 24, 719–720.
Langfelder et al. (2016) Langfelder, P., Zhang, B., Horvath, S., 2016. dynamicTreeCut: Methods for Detection of Clusters in Hierarchical Clustering Dendrograms. URL: https://CRAN.R-project.org/package=dynamicTreeCut. r package version 1.63-1.
Maechler et al. (2021) Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K., 2021. cluster: Cluster Analysis Basics and Extensions. URL: https://CRAN.R-project.org/package=cluster. r package version 2.1.2 — For new features, see the ’Changelog’ file (in the package source).
Meilă (2007) Meilă, M., 2007. Comparing clusterings: an information based distance. Journal of Multivariate Analysis 98, 873 – 895.
Nielsen (2016) Nielsen, F., 2016. Introduction to HPC with MPI for Data Science. Undergraduate Topics in Computer Science, Springer.
R Core Team (2021) R Core Team, 2021. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. URL: https://www.R-project.org/.
Rencher and Christensen (2012) Rencher, A.C., Christensen, W.F., 2012. Methods of Multivariate Anlysis. Wiley Series in Probability and Statistics. third ed., John Wiley & Sons.
Rousseeuw (1987) Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20, 53–65.
Sander et al. (2003) Sander, J., Qin, X., Lu, Z., Niu, N., Kovarsky, A., 2003. Automatic extraction of clusters from hierarchical clustering representations. Lecture Notes in Computer Science 2637.
Schubert and Rousseeuw (2019) Schubert, E., Rousseeuw, P.J., 2019. Faster k-medoids clustering: Improving the pam, clara, and clarans algorithms. ArXiv abs/1810.05691.
Wade and Ghahramani (2018) Wade, S., Ghahramani, Z., 2018. Bayesian cluster analysis: Point estimation and credible balls (with discussion). Bayesian Anal. 13, 559–626. URL: https://doi.org/10.1214/17-BA1073.
Ward Jr (1963) Ward Jr, J.H., 1963. Hierarchical grouping to optimize an objective function. Journal of the American statistical association 58, 236–244.