Unsupervised Embedding of Hierarchical Structure in Euclidean Space

Abstract

Deep embedding methods have influenced many areas of unsupervised learning. However, the best methods for learning hierarchical structure use non-Euclidean representations, whereas Euclidean geometry underlies the theory behind many hierarchical clustering algorithms. To bridge the gap between these two areas, we consider learning a non-linear embedding of data into Euclidean space as a way to improve the hierarchical clustering produced by agglomerative algorithms. To learn the embedding, we revisit using a variational autoencoder with a Gaussian mixture prior, and we show that rescaling the latent space embedding and then applying Ward’s linkage-based algorithm leads to improved results for both dendrogram purity and the Moseley-Wang cost function. Finally, we complement our empirical results with a theoretical explanation of the success of this approach. We study a synthetic model of the embedded vectors and prove that Ward’s method exactly recovers the planted hierarchical clustering with high probability.

1 Introduction

Hierarchical clustering aims to find an iterative grouping of increasingly large subsets of data, terminating when one cluster contains all of the data. This results in a tree structure, where each level determines a partition of the data. Organizing the data in such a way has many applications, including automated taxonomy construction and phylogeny reconstruction balaban2019treecluster; friedman2001elements; manning2008introduction; shang2020nettaxo; zhang2018taxogen. Motivated by these applications, we address the simultaneous goals of recovering an underlying clustering and building a hierarchical tree on top of the clusters. We imagine that we have access to many samples from each individual cluster (e.g., images of plants and animals), while we lack labels or any direct information about the hierarchical relationships. In this scenario, our objective is to correctly cluster the samples in each group and also build a dendrogram that matches the natural supergroups.

Despite significant effort, hierarchical clustering remains a challenging task, theoretically and empirically. Many objectives are NP-Hard (charikar2017approximate; charikar2018hierarchical; dasgupta2002performance; dasgupta2016cost; heller2005bayesian; MoseleyWang), and many theoretical algorithms may not be practical because they require solving a hard subproblem (e.g., sparsest cut) at each step (ackermann2014analysis; ahmadian2019bisect; charikar2017approximate; charikar2018hierarchical; dasgupta2016cost; lin2010general). The most popular hierarchical clustering algorithms are linkage-based methods king1967step; lance1967general; sneath1973numerical; ward1963hierarchical. These algorithms first partition the dataset into singleton clusters. Then, at each step, the pair of most similar clusters is merged. There are many variants depending on how the cluster similarity is computed. While these heuristics are widely used, they do not always work well in practice. At present, there is only a superficial understanding of theoretical guarantees, and the central guiding principle is that linkage-based methods recover the underlying hierarchy as long as the dataset satisfies certain strong separability assumptions (abboud2019subquadratic; balcan2019learning; Cohen-Addad; emamjomeh2018adaptive; ghoshdastidar2019foundations; grosswendt2019analysis; kpotufepruning; reynolds2006clustering; roux2018comparative; sharma2019comparative; szekely2005hierarchical; tokuda2020revisiting).

We consider embedding the data in Euclidean space and then clustering with a linkage-based method. Prior work has shown that a variational autoencoder (VAE) with Gaussian mixture model (GMM) prior leads to separated clusters via the latent space embedding dilokthanakul2016deep; nalisnick2016approximate; jiang2017variational; uugur2020variational; xie2016unsupervised. We focus on one of the best models, Variational Deep Embedding (VaDE) jiang2017variational, in the context of hierarchical clustering. Surprisingly, the VaDE embedding seems to capture hierarchical structure. Clusters that contain semantically similar data end up closer in Euclidean distance, and this pairwise distance structure occurs at multiple scales. This phenomenon motivates our empirical and theoretical investigation of enhancing hierarchical clustering quality with unsupervised embeddings into Euclidean space.

Our experiments demonstrate that applying Ward’s method after embedding with VaDE produces state-of-the-art results for two hierarchical clustering objectives: (i) dendrogram purity for classification datasets heller2005bayesian and (ii) Moseley-Wang’s cost function MoseleyWang, which is a maximization variant of Dasgupta’s cost dasgupta2016cost. At a high level, both measures reward clusterings for merging more similar data points before merging different ones. For brevity, we restrict our attention to these two measures, while our qualitative analysis suggests that the VaDE embedding would also improve other objectives.

As another contribution, we propose an extension of the VaDE embedding that rescales the means of the learned clusters (attempting to increase cluster separation). On many real datasets, this rescaling improves both clustering objectives, which is consistent with the general principle that linkage-based methods perform better with more separated clusters. The baselines for our embedding evaluation involve both a standard VAE with isotropic Gaussian prior kingma2014auto; rezende2014stochastic and principle component analysis (PCA). In general, PCA provides a worse embedding than VAE/VaDE, which is expected because it cannot learn a non-linear transformation. The VaDE embedding leads to better hierarchical clustering results than VAE, indicating that the GMM prior offers more flexibility in the latent space.

Our focus on Euclidean geometry lends itself to a synthetic distribution for hierarchical data that serves as a model of the VaDE embedding. We emphasize that we implicitly represent the hierarchy with pairwise Euclidean distances, bypassing the assumption that the algorithm has access to actual similarities. Extending our model further, we demonstrate a shifted version, where using a non-linear embedding of the data improves the clustering quality. Figure 1 depicts the original and shifted synthetic data with 8 clusters forming a 3-level hierarchy. The 3D plot of the original data in Figure 1(a) has ground truth pairwise distances in Figure 1(b). In Figure 1(c), we see the effect of shifting four of the means by a cyclic rotation, e.g., $(4,2,1,0,0,0)\mapsto(0,0,0,4,2,1)$ . This non-linear transformation increases the distances between pairs of clusters while preserving hierarchical structure. PCA in Figure 1(d) can only distinguish between two levels of the hierarchy, whereas VaDE in Figure 1(e) leads to more concentrated clusters and better identifies multiple levels.

To improve the theoretical understanding of Ward’s method, we prove that it exactly recovers both the underlying clustering and the planted hierarchy when the cluster means are hierarchically separated. The proof strategy involves bounding the distance between points sampled from spherical Gaussians and then extending recent results on Ward’s method for separated data grosswendt2019analysis; grosswendt_2020. Intuitively, we posit that the rescaled VaDE embedding produces vectors that resemble our planted hierarchical model, and this enables Ward’s method to recover both the clusters and the hierarchy after the embedding.

Related Work. For flat clustering objectives (e.g., $k$ -means or clustering accuracy), there is a lot of research on using embeddings to obtain better results by increasing cluster separation aljalbout2018clustering; dilokthanakul2016deep; fard2018deep; figueroa2017simple; guo2017deep; jiang2017variational; li2018discriminatively; li2018learning; min2018survey; tzoreff2018deep; xie2016unsupervised; yang2017towards. However, we are unaware of prior work on learning unsupervised embeddings to improve hierarchical clustering. Studies on hierarchical representation learning instead focus on metric learning or maximizing data likelihood, without evaluating downstream unsupervised hierarchical clustering algorithms ganea2018hyperbolic; goldberger2005hierarchical; goyal2017nonparametric; heller2005bayesian; klushyn2019learning; nalisnick2017stick; shin2019hierarchically; sonthalia2020tree; teh2008bayesian; tomczak2018vae; vasconcelos1999learning. Our aim is to embed the data into Euclidean space so that linkage-based algorithms produce high-quality clusterings.

Many researchers have recently identified the benefits of hyperbolic space for representing hierarchical structure (de2018representation; gu2018learning; linial1995geometry; mathieu2019continuous; monath2019gradient; nickel2017poincare; tifrea2018poincar). Their stated motivation is that hyperbolic space can better represent tree-like metric spaces due to its negative curvature (sarkar2011low). We offer an alternative perspective. If we commit to using a linkage-based method to recover the hierarchy, then it suffices to study Euclidean geometry. Indeed, we do not need to approximate all pairwise distances — we only need to ensure that the clustering method produces a good approximation to the true hierarchical structure. Our simple approach facilitates the use of off-the-shelf algorithms designed for Euclidean geometry, whereas other methods, such as hyperbolic embeddings, require new tools for downstream tasks.

Regarding our synthetic model, prior work has studied analogous models that sample pairwise similarities from separated Gaussians balakrishnan2011noise; Cohen-Addad; ghoshdastidar2019foundations. Due to this key difference, previous theoretical techniques for similarity-based planted hierarchies do not apply to our data model. We extend the prior results to analyze Ward’s method directly in Euclidean space (consistent with the embeddings).

Refer to caption — (a) 3D plot of three-level hierarchy (eight clusters)

2 Hierarchical Clustering: Models and Objectives

Planted Hierarchies with Noise. We introduce a simple way of constructing a mixture of Gaussians that also encodes hierarchical structure via Euclidean distance. The purpose of this distribution is twofold: (i) it approximately models the latent space of VaDE, and (ii) it allows us to test the effectiveness of VaDE to increase the separation between pairs of clusters.

We called it a Binary Tree Gaussian Mixture (BTGM), and it is a uniform mixture of spherical Gaussians with shared variance and different means (that encode the hierarchy). There are three parameters: the height $h$ and the margin $m$ and the dimension $d$ . For a BTGM with height $h$ and margin $m$ , we construct $2^{h}$ spherical Gaussian distributions with mean $\mu_{i}$ and unit variance. The $j^{th}$ coordinate of $\mu_{i}$ is given by

\mu_{i}[j]=\begin{cases}(-1)^{p_{i,j}}\cdot m\cdot 2^{h-j},\ \textrm{ for }p_{i,j}=(i\mbox{ mod }2^{h-i})\textrm{ and }j\leq h\\ 0,\ \textrm{if }h<j\leq d\end{cases}

(1)

Intuitively, the $2^{h}$ means are leaves of a complete binary tree with height $h$ . The crucial property is that the distance between any two means $\mu_{i},\mu_{j}$ is related to their distance in the binary tree. Precisely, $\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}\geq m\cdot 2^{\mathsf{ht}(\mathsf{LCA}(\mu_{i},\mu_{j}))}$ , where $\mathsf{ht}(v)$ denotes the height of node $v$ in the tree, and $\mathsf{LCA}$ is the least common ancestor. Figure 1 shows an example of the BTGM with $m=4$ , $h=3$ and $d=100$ . Figure 1(a) depicts the 3D tSNE visualization of the data and Figure 1(b) shows the pairwise distance matrix (normalized to [0,1]) of each cluster as well as the ground-truth hierarchy. We also define a non-linear shifting operation to the BTGM. In experiments, we shift the nonzero entries of the first $\frac{k}{2}$ components in BTGM to an arbitrary location, e.g., for a BTGM with $k=8,h=3$ , and $m=2$ , the first mean $\mu_{1}$ will shift $(4,2,1,0,0,0)\mapsto(0,0,0,4,2,1)$ . After the shifting, the hierarchical structure of BTGM is less clear from the pairwise distances in the $h$ -dimensional Euclidean subspace.

Objectives. Index the dataset as $\mathcal{X}=\{x_{1},x_{2},\ldots,x_{n}\}$ . Let $\mathsf{leaves}(\mathsf{T}[i\vee j])\subseteq\mathcal{X}$ denote the set of leaves of the subtree in $\mathsf{T}$ rooted at $\mathsf{LCA}(x_{i},x_{j})$ , the least common ancestor of $x_{i}$ and $x_{j}$ .

Given pairwise similarities $w_{ij}$ between $x_{i}$ and $x_{j}$ , Dasgupta’s cost dasgupta2016cost aims to minimize

\mathsf{cost}(\mathsf{T})=\sum_{i,j\in[n]}w_{ij}\cdot|\mathsf{leaves}(\mathsf{T}[i\vee j])|.

Moseley-Wang’s objective MoseleyWang is a dual formulation, based on maximizing

\mathsf{MW}(\mathsf{T})=n\sum_{i,j\in[n]}w_{ij}-\mathsf{cost}(\mathsf{T}).

(2)

Dendrogram purity heller2005bayesian is a measure using ground truth clusters $\mathcal{C}=\{C^{*}_{1},\ldots,C^{*}_{k}\}$ . Let $\mathcal{P}^{*}$ be the set of pairs in $\mathcal{X}$ residing in the same cluster: $\mathcal{P}^{*}=\{(x^{i},x^{j})\mid\mathcal{C}(x^{i})=\mathcal{C}(x^{j})\}$ . Then,

\mathsf{DP}(\mathsf{T})=\frac{1}{|\mathcal{P}^{*}|}\sum_{\ell=1}^{k}\ \sum_{x^{i},x^{j}\in C^{*}_{\ell}}\mathsf{pur}(\mathsf{leaves}(\mathsf{T}[i\vee j]),C^{*}_{\ell})\quad\mbox{ where }\quad\mathsf{pur}(A,B)=\frac{|A\cap B|}{|A|}.

(3)

Ward’s Method. Let $C_{1},\ldots,C_{k}$ be a $k$ -clustering, and let $\mu(C)=\frac{1}{|C|}\sum_{x\in C}x$ denote the cluster mean. Ward’s method ward1963hierarchical; grosswendt2019analysis merges two clusters to minimize the increase in sum of squared distances: $\min_{i,j}\sum_{x\in C_{i}\cup C_{j}}\|x-\mu(C_{i}\cup C_{j})\|_{2}^{2}-\sum_{x\in C_{i}}\|x-\mu(C_{i})\|_{2}^{2}-\sum_{x\in C_{j}}\|x-\mu(C_{j})\|_{2}^{2}$ .

3 Embedding with VaDE

VaDE jiang2017variational considers the following generative model $p(x,z,c)=p(x\mid z)p(z\mid c)p(c)$ . We use $\mathsf{Cat}(.)$ to denote a categorical distribution, $\mathcal{N}(.)$ is a normal distribution, and $k$ is the predefined number of clusters. In the following, $\mu_{c}$ and $\sigma^{2}_{c}$ are parameters for the latent distribution, and $\hat{\mu}_{z}$ and $\hat{\sigma}^{2}_{z}$ are outputs of $f(z;\theta)$ , parametrized by a neural network. The generative process for the observed data is

c\sim\mathsf{Cat}(1/k),\quad z\sim\mathcal{N}(\mu_{c},\sigma^{2}_{c}\cdot I),\quad x\sim\mathcal{N}(\hat{\mu}_{z},\hat{\sigma}^{2}_{z}\cdot I).

(4)

The VaDE maximizes the likelihood of the data, which is bounded by the evidence lower bound,

L_{\text{ELBO}}=E_{q(z,c|x)}[\log p(x\mid c)]-D_{\text{KL}}(q(z,c\mid x)\ \|\ p(z,c)),

(5)

where $q(z,c\mid x)$ is a variational posterior to approximate the true posterior $p(z,c\mid x)$ . We use a neural network $g(x;\phi)$ to parametrize the model $q(z\mid x)$ . We refer to the original paper for the derivation of Eq. (5) and for the connection of $L_{\text{ELBO}}$ to the gradient formulation of the objective jiang2017variational.

Improving separation by rescaling means. The VaDE landscape is a mixture of multiple Gaussian distributions. As we will see in Section 4, the separation of Gaussian distributions is crucial for Ward’s method to recover the underlying clusters. Since Ward’s method is a local heuristic, data points that lie in the middle area of two Gaussians are likely to form a larger cluster before merging into one of the true clusters. This "problematic" cluster will further degrade the purity of the dendrogram. However, the objective that VaDE optimizes does not encourage a larger separation. As a result, we apply a transformation to the VaDE latent space to enlarge the separation between each Gaussians. Let $x^{\prime}_{i}$ be the embedded value of the data point $x_{i}$ , and define

\mathcal{C}(x^{\prime}_{i})=\mathrm{argmax}_{j}[p(x^{\prime}_{i}\mid\mu_{c=j},\sigma^{2}_{c=j}\cdot I)]

to be the cluster label assigned by the learned GMM model. Then, we compute the transformation

x^{\prime\prime}_{i}=x^{\prime}_{i}+s\cdot\mu_{i}\text{ with }\mu_{i}=\mu_{c=\mathcal{C}(x_{i})},

where $s$ is a positive rescaling factor. The transformation has the following properties: (i) for points assigned to the same Gaussian by the VaDE method, the transformation preserves their pairwise distances; (ii) if $x_{i}$ ’s assigned to the same $\mu$ by $\mathcal{C}$ are i.i.d. random variables with expectation $\mu$ (which coincides with the intuition behind VaDE), then the transformation preserves the ratio between the distances of the expected point locations associated with different mean values.

Appendix C illustrates how to choose $s$ in detail. To improve both DP and MW, the rule of thumb is to choose a scaling factor $s$ large enough. We set $s=3$ for all datasets. In summary, we propose the following hierarchical clustering procedure:

1.

Train VaDE on the unlabeled dataset $\{x_{1},\ldots,x_{n}\}$ .
2.

Embed $x_{i}$ using the value $x_{i}^{\prime}=\hat{\mu}_{\hat{z}}$ where $(\hat{\mu}_{\hat{z}},\log\hat{\sigma}_{\hat{z}}^{2})=f(\hat{z};\ \theta)$ and $\hat{z}=g(x_{i}\ ;\ \phi)$ .
3.

Choose a rescaling factor $s$ and apply the rescaling transformation on each $x^{\prime}_{i}$ to get $x^{\prime\prime}_{i}$ .
4.

Run Ward’s method on $\{x^{\prime\prime}_{i}\}_{i=1}^{n}$ to produce a hierarchical clustering.

4 Theoretical Analysis of GMM and Ward’s Method

We provide theoretical guarantees for Ward’s method on separated data, providing justification for using VaDE. Here, we focus on the mixture model of spherical Gaussians. We are given a sample of size $n$ , say $X_{1},\ldots,X_{n}$ , where each sample point $X_{i}\in\mathbb{R}^{d}$ is independently drawn from one of $k$ Gaussian components according to mixing weights $w_{1},\ldots,w_{k}$ , with the $j$ -th component having mean $\mu_{j}\in\mathbb{R}^{d}$ and covariance matrix $\Sigma_{j}\in\mathbb{R}^{d\times d}$ . Specifically, for any component index $j$ , we assume that the covariance $\Sigma_{j}$ satisfies $\Sigma_{j}=\sigma_{j}^{2}\cdot I_{d}$ , where $\sigma_{j}>0$ and $I_{d}\in\mathbb{R}^{d\times d}$ is the identity matrix. Under this assumption, the model is fully determined by the set of parameters $\{(w_{i},\mu_{i},\sigma_{i}):\forall i\in[k]\}$ . The goal is to cluster the sample points by their associated components.

The theorem below presents sufficient conditions for Ward’s method to recover the underlying $k$ -component clustering when points are drawn from a $k$ -mixture of spherical Gaussians. To recover a $k$ -clustering, we terminate when there are $k$ clusters. Proofs appear in Appendix A.

Theorem 1.

There exist absolute constants $c,c_{0}>0$ such that the following holds. Suppose we are given a sample of size $n$ from a $k$ -mixture of spherical Gaussians having parameters $\{(w_{i},\mu_{i},\sigma_{i}):w_{i}>0,\forall i\in[k]\}$ that satisfy

\forall i\not=j\in[k],\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}\geq c\sqrt{\nu}(\sigma_{i}+\sigma_{j})(\sqrt{d}+\sqrt{\log n}),

where $\nu:=\max_{\ell\not=t}w_{\ell}/w_{t}$ , and suppose we have $n\geq c_{0}(\log k)/w_{\text{min}}$ , where $w_{\text{min}}:=\min_{t}w_{t}$ . Then, Ward’s method recovers the underlying $k$ -component clustering with probability at least $1-2/k$ .

As a recap, Ward’s method starts with singleton clusters, and repeatedly selects a pair of clusters and merges them into a new cluster, where the selection is based on the optimal value of the error sum of squares. In particular, recovering the underlying $k$ -component clustering means that every cluster contains only points from the same Gaussian component when the algorithm produces $k$ -clusters.

Intuition and Optimality. On a high level, the mean separation conditions in Theorem 1 guarantee that with high probability, the radiuses of the clusters will be larger than the inter-cluster distances, and the sample lower bound ensures that we observe at least constant many sample points from every Gaussian, with high probability. In Appendix B, we illustrate the hardness of improving the theorem without pre-processing the data, e.g., applying an embedding method to the sample points, such as VaDE or PCA. In particular, we show that i) without further complicating the algorithm, the separation conditions in Theorem 1 have the right dependence on $d$ and $n$ ; ii) for exact recovery, the sample-size lower bound in Theorem 1 is also tight up to logarithmic factors of $k$ .

These results and arguments indeed justify our motivation for combining VaDE with Ward’s method. To further demonstrate our approach’s power, we experimentally evaluate the clustering performance of three methods on high-dimensional equally separated Gaussian mixtures – the vanilla Ward’s method, the PCA-Ward combination, and the VaDE-Ward combination. As shown in Figure 2(a), in terms of the dendrogram purity, Ward’s method with VaDE pre-processing tops the other two, for nearly every mean-separation level. In terms of the final clustering result in the (projected) space, Figures 2(b) to 2(d) clearly shows the advantage of the VaDE-Ward method as it significantly enlarges the separation among different clusters while making much fewer mistakes in the clustering accuracy.

Next, we show that given the exact recovery in Theorem 1 and some mild cluster separation conditions, Ward’s method recovers the target hierarchy.

For any index set $I\subseteq[k]$ and the corresponding Gaussian components $\mathcal{G}_{i},i\in I$ , denote by $w_{I}$ the total weight $\sum_{i\in I}w_{i}$ , and by $C_{I}$ the union of sample points from these components. We say that a collection $\mathcal{H}$ of nonempty subsets of $[k]$ form a hierarchy of order $\tau$ over $[k]$ if there exists a index set sequence $T_{\ell}$ for $\ell=1,\ldots,\tau$ such that i) for every $\ell$ , the set $\mathcal{H}_{\ell}$ partitions $[k]$ ; ii) $\mathcal{H}_{k}$ is the union of $\mathcal{H}_{\ell}:=\{I_{\ell t}:t\in T_{\ell}\}$ ; iii) for every $\ell<\tau$ and $t\in T_{\ell}$ , we have $I_{\ell t}\subseteq I_{(\ell+1)t^{\prime}}$ for some $t^{\prime}\in T_{\ell+1}$ .

For any sample size $n$ and Gaussian component $\mathcal{G}_{i}$ , denote by $S_{i}:=\sigma_{i}(\sqrt{d}+2\sqrt{\log n})$ , which upper bounds the radius of the corresponding sample cluster, with high probability. The triangle inequality then implies that for any $i\not=j\in[k]$ , the distance between a point in the sample cluster of $\mathcal{G}_{i}$ and that of $\mathcal{G}_{j}$ is at most $D_{ij}^{+}:=\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}+S_{i}+S_{j}$ , and at least $D_{ij}^{-}:=\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}-S_{i}-S_{j}$ . Naturally, we also define $D_{I,J}^{+}:=\max_{i\in I,j\in J}D_{ij}^{+}$ and $D_{I,J}^{-}:=\min_{i\in I,j\in J}D_{ij}^{-}$ for any $I,J\subseteq[k]$ .

Theorem 2.

There exists an absolute constant $c_{1}>0$ such that the following holds. Suppose we are given a sample of size $n$ from a $k$ -mixture of spherical Gaussians that satisfy the separation conditions and the sample size lower bound in Theorem 1, and suppose that there is an underlying hierarchy of the Gaussian components $\mathcal{H}$ satisfying

\forall\ell\in[s],\ \ I\not=J\in\mathcal{H}_{\ell},\ \ D_{I,J}^{-}\geq c_{1}\sqrt{\nu_{\ell}}\max\{D_{I,I}^{+},D_{J,J}^{+}\},

where $\nu_{\ell}:=\max_{I\not=J\in\mathcal{H}_{\ell}}w_{I}/w_{J}$ . Then, Ward’s method recovers the underlying hierarchy $\mathcal{H}$ with probability at least $1-2/k$ .

The above theorem resembles Theorem 1, with separation conditions comparing the cluster radiuses with the inter-cluster distances at different hierarchy levels. Informally, the theorem states that if the clusters in each level of the hierarchy are well-separated, Ward’s method will be able to recover the correct clustering of that level, and hence also the entire hierarchy.

5 Experiments

We empirically determine which embedding method produces the best results for dendrogram purity and Moseley-Wang’s objective. We use Ward’s method for clustering after embedding (Appendix C has other algorithms). We provide results for our synthetic model (BTGM) and three real datasets.

Set-up and methods. We compare VaDE against PCA (linear baseline) and a standard VAE (non-linear baseline). All the embedding methods are trained on the whole dataset. Then we run Ward’s method on the sub-sampled embedded data and the original data to get the dendrograms. In the experiments, we only use 25 classes from CIFAR-100 that fall into one of the five superclasses “aquatic animals”, “flowers”, “fruits and vegetables”, “large natural outdoor scenes” and “vehicle1”. We call it CIFAR-25, and we also use CIFAR-25-s to refer to the images labeled with the five superclasses. We average the results over 100 runs, each with 10k random samples from each dataset (except 20newsgroup where we use all 18k samples). Details in Appendix C.

Synthetic Data (BTGM). We vary the depth and margin of the BTGM evaluate different regimes of hierarchy depth and cluster separation. After generating $k$ different means for the BTGM, we shift all the non-zero entries of the first $\frac{k}{2}$ means to an arbitrary location. The goal is to show that this simple non-linear transformation is challenging for linear embedding methods (i.e., PCA), while preserving the hierarchical structure. To generate the dataset of $k$ clusters with shifted means, we sample 2000 points from each Gaussian distributions of the BTGM of height h such that $2^{h}=k$ . Then, we set the embedding dimension of PCA, VAE and VaDE to be $h$ .

Dendrogram Purity (DP). We compute the DP of various methods using (3). We construct ground truth clusters either from class labels of real datasets or from the leaf clusters of the BTGM model.

Moseley-Wang’s Objective (MW). To evaluate the MW objective in (2), we need pairwise similarities $w_{ij}$ . For classification datasets, we let $w_{ij}=\mathbbm{1}_{\mathcal{C}(x^{i})=\mathcal{C}(x^{j})}$ to be the indicator function for point $i$ and $j$ being in the same class. When there are hierarchical labels, we instead define $w_{ij}=2^{l-1}*\mathbbm{1}_{\mathcal{C}^{l}(x^{i})=\mathcal{C}^{l}(x^{j})}$ , where $\mathcal{C}^{l}(x^{i})$ denotes the label for point $x^{i}$ at the $l^{th}$ level of the hierarchical tree from top to bottom. We measure the ratio achieved by various methods to the optimal MW value opt that we compute as follows. For $k$ flat clusters, the optimal MW value is obtained by first merging points in the same clusters, then building a binary tree on top of the $k$ clusters. When there are hierarchical labels, we compare against the optimal tree with ground truth levels.

5.1 Results

Results on Synthetic Data. Table 3(b) presents the results of the synthetic experiments. We set the margin $m$ of the BTGM to be 8 while varying the depth, so the closest distance between the means of any two Gaussian distributions is 8. We also set the embedding dimension to be $h$ , which is the intrinsic dimensionality of the pre-shifted means. As Table 3(b) shows, the combination of embedding with VaDE and then clustering with Ward’s method improves the dendrogram purity and Moseley-Wang’s objective in all settings. VaDE fully utilizes the $h$ -dimensional latent space. On the other hand, embedding with PCA or VAE is actually detrimental to the two objectives compared to the original data. We conjecture that PCA fails to resolve the non-linear shifting of the underlying means. In general, using a linear embedding method will likely lead to a drop in both DP and MW for this distribution. Similarly, VAE does not provide sufficient separation between clusters.

Results on Real Data. Tables 2 and 3 report the values of the DP and MW objectives on real datasets. We observe that for MNIST, using the VaDE embedding improves DP and MW significantly compared to other embeddings, and applying the transformation further leads to an increased value of both objectives. Compared to PCA and VAE, we see that VaDE also has lower variance in the clustering results over randomly sampled subsets from the MNIST dataset. The results on the Reuters dataset show that VaDE plus the rescaling transformation is also suitable for text data, where it leads to the best performance. In the CIFAR-25 experiments, we see only a modest improvement with VaDE when evaluating with all 25 classes. It seems that a large number of real clusters may be a limitation of the VaDE embedding, and this deserves additional attention. On the other hand, when restricting to the five superclass labels, the improvement on CIFAR-25-s increases. We hypothesize that VaDE generates a latent space that encodes more of the high level structure for the CIFAR-25 images. This hypothesis can be partially verified by the qualitative evaluation in Figure 3. A trend consistent with the synthetic experiments is that PCA and VAE are not always helpful embeddings for hierarchical clustering with Ward’s method. Overall, VaDE discovers an appropriate embedding into Euclidean space based on its GMM prior that improves the clustering results.

Table 1: Dendrogram Purity and Moseley-Wang’s objective on BTGM synthetic dataset. We evaluate Ward’s method using different embeddings, while varying parameters:

k

denotes the number of pure clusters,

h

denotes both the dimension of the PCA and VaDE latent spaces and the tree height.

	$k$ = 8 ( $h$ = 3)	$k$ = 16 ( $h$ = 4)	$k$ = 32 ( $h$ = 5)
Original	0.798 $\pm$ .020	0.789 $\pm$ .017	0.775 $\pm$ .016
PCA	0.518 $\pm$ .002	0.404 $\pm$ .003	0.299 $\pm$ .003
VAE	0.619 $\pm$ .030	0.755 $\pm$ .019	0.667 $\pm$ .012
VaDE	0.943 $\pm$ .011	0.945 $\pm$ .001	0.869 $\pm$ .013

(a)

$k=8\ (h=3)$	$k=16\ (h=4)$	$k=32\ (h=5)$
0.960 $\pm$ 0.004	0.980 $\pm$ 0.002	0.989 $\pm$ 0.001
0.909 $\pm$ 0.001	0.913 $\pm$ 0.004	0.936 $\pm$ 0.001
0.791 $\pm$ 0.028	0.898 $\pm$ 0.017	0.969 $\pm$ 0.011
0.990 $\pm$ 0.020	0.994 $\pm$ 0.001	0.991 $\pm$ 0.001

(b)

Table 2: DP on three real datasets for different embedding methods.

	Reuters	MNIST	CIFAR-25-s	CIFAR-25
Original	0.535 $\pm$ .023	0.478 $\pm$ .025	0.280 $\pm$ .009	0.110 $\pm$ .005
PCA	0.587 $\pm$ .021	0.472 $\pm$ .025	0.275 $\pm$ .007	0.108 $\pm$ .004
VAE	0.468 $\pm$ .027	0.495 $\pm$ .030	0.238 $\pm$ .004	0.082 $\pm$ .003
VaDE	0.650 $\pm$ .015	0.803 $\pm$ .021	0.287 $\pm$ .009	0.120 $\pm$ .003
VaDE + Trans.	0.672 $\pm$ .009	0.886 $\pm$ .015	0.300 $\pm$ .008	0.128 $\pm$ .004

Table 3: MW (ratio vs. opt) on three real datasets for different embedding methods.

	Reuters	MNIST	CIFAR-25
Original	0.654 $\pm$ .025	0.663 $\pm$ .014	0.438 $\pm$ .005
PCA	0.679 $\pm$ .021	0.650 $\pm$ .037	0.435 $\pm$ .007
VAE	0.613 $\pm$ .027	0.630 $\pm$ .034	0.393 $\pm$ .004
VaDE	0.745 $\pm$ .025	0.915 $\pm$ .017	0.455 $\pm$ .003
VaDE + Trans.	0.768 $\pm$ .012	0.955 $\pm$ .005	0.472 $\pm$ .002

5.2 Discussion

Theoretical insights. In Theorems 1 and 2, we showed that Ward’s method recovers the hierarchical clustering when the Gaussian means are separated in Euclidean distance. Since VaDE uses a GMM prior, this is a natural metric for the embedded space. As the rescaling increases the separation, we suppose that the embedded vectors behave as if they are approximately sampled from a distribution satisfying the conditions of the theorems. Thus, our theoretical results justify the success of the embedding in practice, where we see that VaDE+Transformation performs the best on many datasets.

Recovering planted and real hierarchies. The synthetic reuslts in Table 3(b) exhibit the success of the VaDE+Transformation method to recover the underlying clusters and the planted hierarchy. This trend holds while varying the number of clusters and depth of the hierarchy. Qualitatively, we find that VaDE can “denoise” the sampled points when the margin does not suffice to guarantee non-overlapping clusters. This enables Ward’s method to achieve a higher value of DP and MW than running the algorithm on the original data. If the hierarchy is not encoded in Euclidean space, e.g., MNIST or CIFAR-25, then VaDE provides good embeddings from which Ward’s method can recover the hierarchy. In many cases, PCA and VAE are unable to find a good embedding.

Rescaling helps. The rescaling transformation, although simple, leads to consistently better results for both DP and MW on multiple real datasets. In practice, the data distributions are correlated and overlapping, without clear cluster structure in high-dimensional space. The VaDE embedding leads to more concentrated clusters, and the rescaling transformation enlarges the separation enough for Ward’s method to recover what is learned by VaDE. In the synthetic experiments, the latent distributions learned by VaDE are already well-separated, so the rescaling is not necessary in this case. In the future, it may be beneficial to explore transformations that depend on the whole hierarchy.

6 Conclusion

We investigated the effectiveness of embedding data in Euclidean space to improve unsupervised hierarchical clustering methods. We saw that rescaling the VaDE latent space embedding, and then applying Ward’s linkage-based method, leads to the best results for dendrogram purity and the Moseley-Wang objective on many datasets. To better understand the VaDE embedding, we proposed a planted hierarchical model using Gaussians with separated means. Theoretically, we proved that Ward’s method recovers both the clustering and the hierarchy with high probability in this model.

While we focus on embedding using variational autoencoders, an open direction for further research could involve embedding hierarchical structure using other representation learning methods agarwal2007generalized; borg2005modern; caron2018deep; chierchia2019ultrametric; mishne2019diffusion; nina2019decoder; salakhutdinov2012one; tsai2017learning; yadav2019supervised. Another direction is to better understand the differences between learned embeddings, comparison-based methods, and ordinal relations emamjomeh2018adaptive; ghoshdastidar2019foundations; kazemi2018comparison; jamieson2011low. Finally, it would be worth exploring the use of approximate clustering methods to improve the running time abboud2019subquadratic; kobren2017hierarchical; rashtchian2017clustering.

Appendix A Proof of the Theorems

The next two lemmas provide tight tail bounds for binomial distributions and spherical Gaussians.

Lemma 1 (Concentration of binomial random variables).

[chung2006complex] Let $X$ be a binomial random variable with mean $\mu$ , then for any $\delta>0$ ,

\Pr(X\geq{(1+\delta)\mu})\leq{e^{-(\max\{\delta^{2},\delta\})\mu/3}},

and for any $\delta\in{(0,1)}$ ,

\Pr(X\leq{(1-\delta)\mu})\leq{e^{-\delta^{2}\mu/2}}.

Lemma 2 (Concentration of spherical Gaussians).

[Gaussian2016notes] For a random $X$ distributed according to a spherical Gaussian in $d$ dimensions (mean $\mu$ and variance $\sigma^{2}$ in each direction), and any $\delta\in(0,1)$ ,

\Pr\left(\left\lVert X-\mu\right\rVert_{2}^{2}\leq\sigma^{2}d\left(1+2\sqrt{\frac{\log(1/\delta)}{d}}+\frac{2\log(1/\delta)}{d}\right)\right)\geq 1-\delta.

Lemma 3 (Optimal clustering through Ward’s method).

[grosswendt2019analysis, grosswendt_2020] Let $P\subseteq\mathbb{R}^{d}$ be a collection of points with underlying clustering $C_{1}^{*},\ldots,C_{k}^{*}$ , such that the corresponding cluster mean values $\mu_{1}^{*},\ldots,\mu_{k}^{*}$ satisfy $\left\lVert\mu_{i}^{*}-\mu_{j}^{*}\right\rVert_{2}>(2+2\sqrt{2\nu})\max_{x\in C_{i}^{*}}\left\lVert x-\mu_{i}^{*}\right\rVert_{2}$ for all $i\not=j$ , where $\nu:=\max_{\ell,t\in[k]}\frac{|C_{\ell}^{*}|}{|C_{t}^{*}|}$ is the maximum ratio between cluster sizes. Then Ward’s method recovers the underlying clustering.

Now we are ready to prove Theorem 1.

Proof of Theorem 1.

First, for each component in the mixture, we determine the associated number of sample points. This is equivalent to drawing a size- $n$ sample from the weighting distribution that assigns $w_{i}>0$ to each $i\in[k]$ . The number of times each symbol $i\in[k]$ appearing in the sample, which we refer to as the (sample) multiplicity of the symbol, follows a binomial distribution with mean $nw_{i}$ . By Lemma 1, the probability that an arbitrary multiplicity is within a factor of $2$ from its mean value is at least $1-2e^{-nw_{i}/8}$ . Hence, for $n\geq 16(\log k)/w_{i}$ , the union bound yields that this deviation bound holds for all symbols with probability at least $1-k\cdot k^{-2}=1-k^{-1}$ .

Second, given the symbol multiplicities, we draw sample points from each component correspondingly. Consider the $i$ -th component with mean $\mu_{i}$ . Then, by Lemma 2, every sample point from this component deviates (under Euclidean distance) from $\mu_{i}$ by at most $S_{i}:=\sigma_{i}(\sqrt{d}+2\sqrt{\log n})$ , with probability at least $1-n^{-2}$ . Again, the union bound yields that this deviation bound holds for all components and all sample points with probability at least $1-n\cdot n^{-2}=1-n^{-1}$ .

Third, by the above reasoning and $n>k$ , with probability at least $1-2/k$ and for all $i\in[k]$ ,

1.

the sample size associated with the $i$ -th component is between $nw_{i}/2$ and $2nw_{i}$ ;
2.

every sample point from the $i$ -th component deviates from $\mu_{i}$ by at most $S_{i}$ .

Therefore, if further $\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}\geq(5+12\max_{\ell,t\in[k]}\sqrt{w_{\ell}/w_{t}})(S_{i}+S_{j})$ for all $i\not=j$ , the sample points will form well-separated clusters according to the underlying components. In addition, the first assumption ensures that the maximum ratio between cluster sizes is at most $4\max_{\ell,t\in[k]}\sqrt{w_{\ell}/w_{t}}$ , and the second assumption implies that for any $i\not=j\in[k]$ ,

1.

the $i$ -th cluster the distance between the mean to any point in it is at most $2S_{i}$ ;

the empirical means of the $i$ -th and $j$ -th clusters are separated by at least

16(S_{i}+S_{j})>\left(2+2\sqrt{2\cdot 4}\cdot\max_{\ell,t\in[k]}\sqrt{\frac{w_{\ell}}{w_{t}}}\right)\max\{2S_{i},2S_{j}\}.

Finally, applying Lemma 3 completes the proof of the theorem. ∎

Next, we move to the proof of Theorem 2. Let us first recap what the theorem states.

Theorem 2.

\forall\ell\in[s],\quad I\not=J\in\mathcal{H}_{\ell},\quad D_{I,J}^{-}\geq c_{1}\sqrt{\nu_{\ell}}\ (D_{I,I}^{+}+D_{J,J}^{+}),

where $\nu_{\ell}:=\max_{I\not=J\in\mathcal{H}_{\ell}}w_{I}/w_{J}$ . Then, Ward’s method recovers the underlying hierarchy $\mathcal{H}$ with probability at least $1-2/k$ .

Proof of Theorem 2.

By the proof of Theorem 1, for any sample size $n$ and Gaussian component $\mathcal{G}_{i}$ , the quantity $S_{i}:=\sigma_{i}(\sqrt{d}+2\sqrt{\log n})$ upper bounds the radius of the corresponding sample cluster, with high probability. Here, the term “radius” is defined with respect to the actual center $\mu_{i}$ .

The triangle inequality then implies that for any $i\neq j\in[k]$ , the distance between a point in the sample cluster of $\mathcal{G}_{i}$ and that of $\mathcal{G}_{j}$ is at most $D_{ij}^{+}:=\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}+S_{i}+S_{j}$ , i.e., the sum of the distance between the two centers and the radiuses of the two clusters, and at least $D_{ij}^{-}:=\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}-S_{i}-S_{j}$ .

More generally, for any non-singleton clusters represented by $I,J\subset[k]$ , the distance between two points $x\in I$ and $y\in J$ is at most $D_{I,J}^{+}:=\max_{i\in I,j\in J}D_{ij}^{+}$ , and at least $D_{I,J}^{-}:=\min_{i\in I,j\in J}D_{ij}^{-}$ . In particular, for $I=J$ , quantity $D_{I,I}^{+}$ upper bounds the diameter of cluster $I$ .

Furthermore, by the proof of Theorem 1, given the exact recovery of the $k$ Gaussian components, the ratio between the sizes of any two clusters (at any level), say $C_{I}$ and $C_{J}$ , differs from the ratio between their weights, $w_{I}/w_{J}$ , by a factor of at most $4$ . Hence, the quantity $v_{\ell}$ essentially characterizes the maximum ratio between the sample cluster sizes at level $\ell$ of the hierarchy.

With the above reasoning, Lemma 3 naturally completes the proof of the theorem. Intuitively, this shows that if the clusters in each level of the hierarchy are well-separated, Ward’s method will be able to recover the correct clustering of that level, and hence also the entire hierarchy. ∎

Appendix B Optimally of Theorem 1

Below, we illustrate the hardness of improving the theorem without preprocessing the data, e.g., applying an embedding method to the sample points, such as VaDE or PCA.

Separation conditions.

As many other commonly used linkage methods, the behavior and performance of Ward’s method depend on only the inter-point distances between the observations. For optimal-clustering recovery through Ward’s method, a natural assumption to make is that the distances between points in the same cluster is at most that between disjoint clusters. Consider a simple example where we draw $3$ sample points from a uniform mixture of two standard $d$ -dimensional spherical Gaussians $\mathcal{G}_{0}$ and $\mathcal{G}_{1}$ with mean separated by a distance $\Delta$ . Then, with a $3/8$ probability, we will draw two sample points from $\mathcal{G}_{0}$ , say $X,X_{0}$ , and one sample point from $\mathcal{G}_{1}$ , say $Y$ . Conditioned on this, the reasoning in Section 2.8 of [blum2020foundations] implies that, with high probability,

D_{\mathcal{G}_{0}}:=\left\lVert X-X_{0}\right\rVert_{2}^{2}\approx 2d\pm\mathcal{O}(\sqrt{d})\text{ and }D_{\mathcal{G}_{1}}:=\left\lVert X-Y\right\rVert_{2}^{2}\approx\Delta^{2}+2d\pm\mathcal{O}(\sqrt{d}).

Hence, the assumption that points are closer inside the clusters translates to

D_{\mathcal{G}_{0}}<D_{\mathcal{G}_{1}}\implies 2d\pm\mathcal{O}(\sqrt{d})<\Delta^{2}+2d\pm\mathcal{O}(\sqrt{d})\implies\Delta=\Omega(d^{1/4}).

Furthermore, if one requires the inner-cluster distance to be slightly smaller than the inter-cluster distance with high probability, say $(1+\epsilon)D_{\mathcal{G}_{0}}<D_{\mathcal{G}_{1}}$ , where $\epsilon>0$ is a small constant,

(1+\epsilon)(2d\pm\mathcal{O}(\sqrt{d}))<\Delta^{2}+2d\pm\mathcal{O}(\sqrt{d})\implies\Delta=\Omega(\sqrt{\epsilon d}).

Therefore, the mean difference between the two Gaussians should exhibit a linear dependence on $\sqrt{d}$ .

In addition, for a standard $d$ -dimensional spherical Gaussian, Theorem 2.9 in [blum2020foundations] states that for any $\beta\leq\sqrt{d}$ , all but at most $3\exp(-\Omega(\beta^{2}))$ of its probability mass lies within $\sqrt{d}-\beta\leq|x|\leq\sqrt{d}+\beta$ . Hence, if we increase the sample size from $3$ to $n$ , correctly separating the points by their associated components requires an extra $\mathcal{O}(\sqrt{\log n})$ term in the mean difference.

Sample-size lower bound.

In terms of the sample size $n$ , Theorem 1 requires $n\geq c_{0}\log k/w_{\text{min}}$ . For Ward’s method to recover the underlying $k$ -component clustering with high probability, we must observe at least one sample point from every component of the Gaussian mixture. Clearly, this means that the expected size of the smallest cluster should be at least $1$ , implying $n\geq 1/w_{\text{min}}$ . If we have $\nu=\mathcal{O}(1)$ , this reduces to a weighted version of the coupon collector problem, which calls for a sample size in the order of $k\log k=\Theta((\log k)/w_{\text{min}})$ by the standard results [feller1968introduction].

Appendix C More Experimental Details

Network Architecture. The architecture for the encoder and decoder for both VaDE and VAE are fully-connected layers with size $d$ -2000-500-500- $h$ and $h$ -500-500-2000- $d$ , where $h$ is the hidden dimension for the latent space $\mathbb{R}^{h}$ , and $d$ is the input dimension of the data in $\mathbb{R}^{d}$ . These settings follow the original paper of VaDE [jiang2017variational]. We use Adam optimizer on mini-batched data of size 800 with learning rate 5e-4 on all three real datasets. The choices of the hyper-parameters such as $h$ and $k$ will be discussed in the Sensitivity Analysis.

Computing Dendrogram Purity. For every set of $N$ points sampled from the dataset (e.g. we use $N=2048$ for MNIST, CIFAR-25), we compute the exact dendrogram purity using the formula 3. We then report the mean and standard variance of the results by repeating the experiments 100 times.

Sensitivity Analysis. We check the sensitivity of VAE and VaDE to its hyperparameters $h$ and $k$ in terms of DP and MW. We also check the sensitivity of rescaling factor $s$ applied after the VaDE embedding. The details are presented in Figure 4, 5, and 6.

Comparing Linkage-Based Methods. Table 4 shows the DP and MW results for several linkage-based variations, depending on the function used to determine cluster similarity. Overall, we see that Ward’s method performs the best on average. However, in some cases, the other methods do comparably.

Table 4: Dendrogram Purity and Moseley-Wang’s objective using different HAC linkage methods on VaDE + Transformed latent space.

	Dendro. Purity	M-W objective	Dendro. Purity	M-W objective
Single	0.829 $\pm$ 0.038	0.932 $\pm$ 0.023	0.260 $\pm$ 0.005	0.419 $\pm$ 0.006
Complete	0.854 $\pm$ 0.040	0.937 $\pm$ 0.024	0.293 $\pm$ 0.008	0.451 $\pm$ 0.016
Centroid	0.859 $\pm$ 0.035	0.949 $\pm$ 0.019	0.284 $\pm$ 0.008	0.453 $\pm$ 0.009
Average	0.866 $\pm$ 0.035	0.948 $\pm$ 0.023	0.295 $\pm$ 0.005	0.462 $\pm$ 0.006
Ward	0.870 $\pm$ 0.031	0.948 $\pm$ 0.025	0.300 $\pm$ 0.008	0.465 $\pm$ 0.011

MNIST

CIFAR-100

Dataset Details. We conduct the experiments on our synthetic data from BTGM as well as real data benchmarks: Reuters , MNIST [MNIST] and CIFAR-100 [CIFAR100]. In the experiments, we only use 25 classes from CIFAR-100 that fall into one of the five superclasses "aquatic animals", "flowers", "fruits and vegetables", "large natural outdoor scenes" and "vehicle1" ¹¹1The information of these superclasses can be found in https://www.cs.toronto.edu/ kriz/cifar.html, we call it CIFAR-25. Below we also provide further details on the Digits dataset of handwritten digits ²²2https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits. More details about the number of clusters and other statistics can be found in Table 5.

Rescaling explained. The motivation of rescaling transformation can be explained using Figure 7. (a) shows the tSNE visualization of the VaDE embedding before and after the rescaling transformation. As we can see, each cluster becomes more separated. In (b), we visualize the inter-class distance matrices before and after the transformation. We use blue and black bounding boxes to highlight few superclusters revealed in the VaDE embedding. These high level structures remain unchanged after the rescaling, which is a consequence of property (ii) in Section 3.

Table 5: Dataset Details.

# ground truth clusters	10	4	10	5	25
	Digits	Reuters	MNIST	CIFAR-25-s	CIFAR-25
# data for training	6560	680K	60K	15K	15K
# data for evaluation	1000	1000	2048	2048	2048
Dimensionality	64	2000	768	3072	3072
Size of hidden dimension	5	10	10	20	20

C.1 Digits Dataset

There are few experimental results that evaluate dendrogram purity for the Digits dataset in previous papers. First we experiment using the same settings and same test set construction for Digits as in [monath2019gradient]. In this case, VaDE+Ward achieves a DP of 0.941, substantially improving the previous state-of-the-art results for dendrogram purity from existing methods: the DP of [monath2019gradient] is 0.675, the DP of PERCH [kobren2017hierarchical] is 0.614, and the DP of BIRCH [BIRCH1996] is 0.544. In other words, our approach of VaDE+Ward leads to 26.5 point increase in DP for this simple dataset.

Surprisingly, we find that the test data they used is quite easy, as the DP and MW numbers are substantially better compared to using a random subset as the test set. Therefore, we follow the same evaluation settings as in Section 5 and randomly sampled 100 samples of 1000 data points for evaluation. The results are shown in Table 6. The contrast between the previous test data (easier) versus random sampling (harder) further supports our experimental set-up in the main paper.

Table 6: Dendrogram Purity and Moseley-Wang’s objective using different embedding methods in Digits dataset. Overall, VaDE + Trans. only provides a minor improvement in DP. Using Ward on the original space already achieves decent results.

	Dendro. Purity	M-W objective
Original	0.820 $\pm$ 0.031	0.846 $\pm$ 0.029
PCA	0.823 $\pm$ 0.028	0.844 $\pm$ 0.027
VaDE	0.824 $\pm$ 0.026	0.840 $\pm$ 0.024
VaDE + Trans.	0.832 $\pm$ 0.029	0.845 $\pm$ 0.024