This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unsupervised Embedding of Hierarchical Structure in Euclidean Space

Abstract

Deep embedding methods have influenced many areas of unsupervised learning. However, the best methods for learning hierarchical structure use non-Euclidean representations, whereas Euclidean geometry underlies the theory behind many hierarchical clustering algorithms. To bridge the gap between these two areas, we consider learning a non-linear embedding of data into Euclidean space as a way to improve the hierarchical clustering produced by agglomerative algorithms. To learn the embedding, we revisit using a variational autoencoder with a Gaussian mixture prior, and we show that rescaling the latent space embedding and then applying Ward’s linkage-based algorithm leads to improved results for both dendrogram purity and the Moseley-Wang cost function. Finally, we complement our empirical results with a theoretical explanation of the success of this approach. We study a synthetic model of the embedded vectors and prove that Ward’s method exactly recovers the planted hierarchical clustering with high probability.

1 Introduction

Hierarchical clustering aims to find an iterative grouping of increasingly large subsets of data, terminating when one cluster contains all of the data. This results in a tree structure, where each level determines a partition of the data. Organizing the data in such a way has many applications, including automated taxonomy construction and phylogeny reconstruction balaban2019treecluster; friedman2001elements; manning2008introduction; shang2020nettaxo; zhang2018taxogen. Motivated by these applications, we address the simultaneous goals of recovering an underlying clustering and building a hierarchical tree on top of the clusters. We imagine that we have access to many samples from each individual cluster (e.g., images of plants and animals), while we lack labels or any direct information about the hierarchical relationships. In this scenario, our objective is to correctly cluster the samples in each group and also build a dendrogram that matches the natural supergroups.

Despite significant effort, hierarchical clustering remains a challenging task, theoretically and empirically. Many objectives are NP-Hard (charikar2017approximate; charikar2018hierarchical; dasgupta2002performance; dasgupta2016cost; heller2005bayesian; MoseleyWang), and many theoretical algorithms may not be practical because they require solving a hard subproblem (e.g., sparsest cut) at each step (ackermann2014analysis; ahmadian2019bisect; charikar2017approximate; charikar2018hierarchical; dasgupta2016cost; lin2010general). The most popular hierarchical clustering algorithms are linkage-based methods king1967step; lance1967general; sneath1973numerical; ward1963hierarchical. These algorithms first partition the dataset into singleton clusters. Then, at each step, the pair of most similar clusters is merged. There are many variants depending on how the cluster similarity is computed. While these heuristics are widely used, they do not always work well in practice. At present, there is only a superficial understanding of theoretical guarantees, and the central guiding principle is that linkage-based methods recover the underlying hierarchy as long as the dataset satisfies certain strong separability assumptions (abboud2019subquadratic; balcan2019learning; Cohen-Addad; emamjomeh2018adaptive; ghoshdastidar2019foundations; grosswendt2019analysis; kpotufepruning; reynolds2006clustering; roux2018comparative; sharma2019comparative; szekely2005hierarchical; tokuda2020revisiting).

We consider embedding the data in Euclidean space and then clustering with a linkage-based method. Prior work has shown that a variational autoencoder (VAE) with Gaussian mixture model (GMM) prior leads to separated clusters via the latent space embedding dilokthanakul2016deep; nalisnick2016approximate; jiang2017variational; uugur2020variational; xie2016unsupervised. We focus on one of the best models, Variational Deep Embedding (VaDE) jiang2017variational, in the context of hierarchical clustering. Surprisingly, the VaDE embedding seems to capture hierarchical structure. Clusters that contain semantically similar data end up closer in Euclidean distance, and this pairwise distance structure occurs at multiple scales. This phenomenon motivates our empirical and theoretical investigation of enhancing hierarchical clustering quality with unsupervised embeddings into Euclidean space.

Our experiments demonstrate that applying Ward’s method after embedding with VaDE produces state-of-the-art results for two hierarchical clustering objectives: (i) dendrogram purity for classification datasets heller2005bayesian and (ii) Moseley-Wang’s cost function MoseleyWang, which is a maximization variant of Dasgupta’s cost dasgupta2016cost. At a high level, both measures reward clusterings for merging more similar data points before merging different ones. For brevity, we restrict our attention to these two measures, while our qualitative analysis suggests that the VaDE embedding would also improve other objectives.

As another contribution, we propose an extension of the VaDE embedding that rescales the means of the learned clusters (attempting to increase cluster separation). On many real datasets, this rescaling improves both clustering objectives, which is consistent with the general principle that linkage-based methods perform better with more separated clusters. The baselines for our embedding evaluation involve both a standard VAE with isotropic Gaussian prior kingma2014auto; rezende2014stochastic and principle component analysis (PCA). In general, PCA provides a worse embedding than VAE/VaDE, which is expected because it cannot learn a non-linear transformation. The VaDE embedding leads to better hierarchical clustering results than VAE, indicating that the GMM prior offers more flexibility in the latent space.

Our focus on Euclidean geometry lends itself to a synthetic distribution for hierarchical data that serves as a model of the VaDE embedding. We emphasize that we implicitly represent the hierarchy with pairwise Euclidean distances, bypassing the assumption that the algorithm has access to actual similarities. Extending our model further, we demonstrate a shifted version, where using a non-linear embedding of the data improves the clustering quality. Figure 1 depicts the original and shifted synthetic data with 8 clusters forming a 3-level hierarchy. The 3D plot of the original data in Figure 1(a) has ground truth pairwise distances in Figure 1(b). In Figure 1(c), we see the effect of shifting four of the means by a cyclic rotation, e.g., (4,2,1,0,0,0)(0,0,0,4,2,1)(4,2,1,0,0,0)\mapsto(0,0,0,4,2,1). This non-linear transformation increases the distances between pairs of clusters while preserving hierarchical structure. PCA in Figure 1(d) can only distinguish between two levels of the hierarchy, whereas VaDE in Figure 1(e) leads to more concentrated clusters and better identifies multiple levels.

To improve the theoretical understanding of Ward’s method, we prove that it exactly recovers both the underlying clustering and the planted hierarchy when the cluster means are hierarchically separated. The proof strategy involves bounding the distance between points sampled from spherical Gaussians and then extending recent results on Ward’s method for separated data grosswendt2019analysis; grosswendt_2020. Intuitively, we posit that the rescaled VaDE embedding produces vectors that resemble our planted hierarchical model, and this enables Ward’s method to recover both the clusters and the hierarchy after the embedding.

Related Work.   For flat clustering objectives (e.g., kk-means or clustering accuracy), there is a lot of research on using embeddings to obtain better results by increasing cluster separation aljalbout2018clustering; dilokthanakul2016deep; fard2018deep; figueroa2017simple; guo2017deep; jiang2017variational; li2018discriminatively; li2018learning; min2018survey; tzoreff2018deep; xie2016unsupervised; yang2017towards. However, we are unaware of prior work on learning unsupervised embeddings to improve hierarchical clustering. Studies on hierarchical representation learning instead focus on metric learning or maximizing data likelihood, without evaluating downstream unsupervised hierarchical clustering algorithms ganea2018hyperbolic; goldberger2005hierarchical; goyal2017nonparametric; heller2005bayesian; klushyn2019learning; nalisnick2017stick; shin2019hierarchically; sonthalia2020tree; teh2008bayesian; tomczak2018vae; vasconcelos1999learning. Our aim is to embed the data into Euclidean space so that linkage-based algorithms produce high-quality clusterings.

Many researchers have recently identified the benefits of hyperbolic space for representing hierarchical structure (de2018representation; gu2018learning; linial1995geometry; mathieu2019continuous; monath2019gradient; nickel2017poincare; tifrea2018poincar). Their stated motivation is that hyperbolic space can better represent tree-like metric spaces due to its negative curvature (sarkar2011low). We offer an alternative perspective. If we commit to using a linkage-based method to recover the hierarchy, then it suffices to study Euclidean geometry. Indeed, we do not need to approximate all pairwise distances — we only need to ensure that the clustering method produces a good approximation to the true hierarchical structure. Our simple approach facilitates the use of off-the-shelf algorithms designed for Euclidean geometry, whereas other methods, such as hyperbolic embeddings, require new tools for downstream tasks.

Regarding our synthetic model, prior work has studied analogous models that sample pairwise similarities from separated Gaussians balakrishnan2011noise; Cohen-Addad; ghoshdastidar2019foundations. Due to this key difference, previous theoretical techniques for similarity-based planted hierarchies do not apply to our data model. We extend the prior results to analyze Ward’s method directly in Euclidean space (consistent with the embeddings).

Refer to caption
(a) 3D plot of three-level hierarchy (eight clusters)
Refer to caption
(b) Ground truth dendrogram and distance matrix
Refer to caption
(c) Dataset with shifted means
Refer to caption
(d) PCA on shifted dataset
Refer to caption
(e) VaDE on shifted dataset
Figure 1: Top row: Synthetic data demonstrating a three-level hierarchy encoded via the pairwise Euclidean distances between eight clusters. Bottom row: Effect of shifting half of the means by cyclically rotating the vectors, e.g., (4,2,1,0,0,0)(0,0,0,4,2,1)(4,2,1,0,0,0)\mapsto(0,0,0,4,2,1). Comparing PCA and VaDE shows that a non-linear embedding can increase cluster separation and roughly preserve the hierarchy.

2 Hierarchical Clustering: Models and Objectives

Planted Hierarchies with Noise.   We introduce a simple way of constructing a mixture of Gaussians that also encodes hierarchical structure via Euclidean distance. The purpose of this distribution is twofold: (i) it approximately models the latent space of VaDE, and (ii) it allows us to test the effectiveness of VaDE to increase the separation between pairs of clusters.

We called it a Binary Tree Gaussian Mixture (BTGM), and it is a uniform mixture of spherical Gaussians with shared variance and different means (that encode the hierarchy). There are three parameters: the height hh and the margin mm and the dimension dd. For a BTGM with height hh and margin mm, we construct 2h2^{h} spherical Gaussian distributions with mean μi\mu_{i} and unit variance. The jthj^{th} coordinate of μi\mu_{i} is given by

μi[j]={(1)pi,jm2hj, for pi,j=(i mod 2hi) and jh0,if h<jd\mu_{i}[j]=\begin{cases}(-1)^{p_{i,j}}\cdot m\cdot 2^{h-j},\ \textrm{ for }p_{i,j}=(i\mbox{ mod }2^{h-i})\textrm{ and }j\leq h\\ 0,\ \textrm{if }h<j\leq d\end{cases} (1)

Intuitively, the 2h2^{h} means are leaves of a complete binary tree with height hh. The crucial property is that the distance between any two means μi,μj\mu_{i},\mu_{j} is related to their distance in the binary tree. Precisely, μiμj2m2𝗁𝗍(𝖫𝖢𝖠(μi,μj))\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}\geq m\cdot 2^{\mathsf{ht}(\mathsf{LCA}(\mu_{i},\mu_{j}))}, where 𝗁𝗍(v)\mathsf{ht}(v) denotes the height of node vv in the tree, and 𝖫𝖢𝖠\mathsf{LCA} is the least common ancestor. Figure 1 shows an example of the BTGM with m=4m=4, h=3h=3 and d=100d=100. Figure 1(a) depicts the 3D tSNE visualization of the data and Figure 1(b) shows the pairwise distance matrix (normalized to [0,1]) of each cluster as well as the ground-truth hierarchy. We also define a non-linear shifting operation to the BTGM. In experiments, we shift the nonzero entries of the first k2\frac{k}{2} components in BTGM to an arbitrary location, e.g., for a BTGM with k=8,h=3k=8,h=3, and m=2m=2, the first mean μ1\mu_{1} will shift (4,2,1,0,0,0)(0,0,0,4,2,1)(4,2,1,0,0,0)\mapsto(0,0,0,4,2,1). After the shifting, the hierarchical structure of BTGM is less clear from the pairwise distances in the hh-dimensional Euclidean subspace.

Objectives.   Index the dataset as 𝒳={x1,x2,,xn}\mathcal{X}=\{x_{1},x_{2},\ldots,x_{n}\}. Let 𝗅𝖾𝖺𝗏𝖾𝗌(𝖳[ij])𝒳\mathsf{leaves}(\mathsf{T}[i\vee j])\subseteq\mathcal{X} denote the set of leaves of the subtree in 𝖳\mathsf{T} rooted at 𝖫𝖢𝖠(xi,xj)\mathsf{LCA}(x_{i},x_{j}), the least common ancestor of xix_{i} and xjx_{j}.

Given pairwise similarities wijw_{ij} between xix_{i} and xjx_{j}, Dasgupta’s cost dasgupta2016cost aims to minimize

𝖼𝗈𝗌𝗍(𝖳)=i,j[n]wij|𝗅𝖾𝖺𝗏𝖾𝗌(𝖳[ij])|.\mathsf{cost}(\mathsf{T})=\sum_{i,j\in[n]}w_{ij}\cdot|\mathsf{leaves}(\mathsf{T}[i\vee j])|.

Moseley-Wang’s objective MoseleyWang is a dual formulation, based on maximizing

𝖬𝖶(𝖳)=ni,j[n]wij𝖼𝗈𝗌𝗍(𝖳).\mathsf{MW}(\mathsf{T})=n\sum_{i,j\in[n]}w_{ij}-\mathsf{cost}(\mathsf{T}). (2)

Dendrogram purity heller2005bayesian is a measure using ground truth clusters 𝒞={C1,,Ck}\mathcal{C}=\{C^{*}_{1},\ldots,C^{*}_{k}\}. Let 𝒫\mathcal{P}^{*} be the set of pairs in 𝒳\mathcal{X} residing in the same cluster: 𝒫={(xi,xj)𝒞(xi)=𝒞(xj)}\mathcal{P}^{*}=\{(x^{i},x^{j})\mid\mathcal{C}(x^{i})=\mathcal{C}(x^{j})\}. Then,

𝖣𝖯(𝖳)=1|𝒫|=1kxi,xjC𝗉𝗎𝗋(𝗅𝖾𝖺𝗏𝖾𝗌(𝖳[ij]),C) where 𝗉𝗎𝗋(A,B)=|AB||A|.\mathsf{DP}(\mathsf{T})=\frac{1}{|\mathcal{P}^{*}|}\sum_{\ell=1}^{k}\ \sum_{x^{i},x^{j}\in C^{*}_{\ell}}\mathsf{pur}(\mathsf{leaves}(\mathsf{T}[i\vee j]),C^{*}_{\ell})\quad\mbox{ where }\quad\mathsf{pur}(A,B)=\frac{|A\cap B|}{|A|}. (3)

Ward’s Method.   Let C1,,CkC_{1},\ldots,C_{k} be a kk-clustering, and let μ(C)=1|C|xCx\mu(C)=\frac{1}{|C|}\sum_{x\in C}x denote the cluster mean. Ward’s method ward1963hierarchical; grosswendt2019analysis merges two clusters to minimize the increase in sum of squared distances: mini,jxCiCjxμ(CiCj)22xCixμ(Ci)22xCjxμ(Cj)22\min_{i,j}\sum_{x\in C_{i}\cup C_{j}}\|x-\mu(C_{i}\cup C_{j})\|_{2}^{2}-\sum_{x\in C_{i}}\|x-\mu(C_{i})\|_{2}^{2}-\sum_{x\in C_{j}}\|x-\mu(C_{j})\|_{2}^{2}.

3 Embedding with VaDE

VaDE jiang2017variational considers the following generative model p(x,z,c)=p(xz)p(zc)p(c)p(x,z,c)=p(x\mid z)p(z\mid c)p(c). We use 𝖢𝖺𝗍(.)\mathsf{Cat}(.) to denote a categorical distribution, 𝒩(.)\mathcal{N}(.) is a normal distribution, and kk is the predefined number of clusters. In the following, μc\mu_{c} and σc2\sigma^{2}_{c} are parameters for the latent distribution, and μ^z\hat{\mu}_{z} and σ^z2\hat{\sigma}^{2}_{z} are outputs of f(z;θ)f(z;\theta), parametrized by a neural network. The generative process for the observed data is

c𝖢𝖺𝗍(1/k),z𝒩(μc,σc2I),x𝒩(μ^z,σ^z2I).c\sim\mathsf{Cat}(1/k),\quad z\sim\mathcal{N}(\mu_{c},\sigma^{2}_{c}\cdot I),\quad x\sim\mathcal{N}(\hat{\mu}_{z},\hat{\sigma}^{2}_{z}\cdot I). (4)

The VaDE maximizes the likelihood of the data, which is bounded by the evidence lower bound,

LELBO=Eq(z,c|x)[logp(xc)]DKL(q(z,cx)p(z,c)),L_{\text{ELBO}}=E_{q(z,c|x)}[\log p(x\mid c)]-D_{\text{KL}}(q(z,c\mid x)\ \|\ p(z,c)), (5)

where q(z,cx)q(z,c\mid x) is a variational posterior to approximate the true posterior p(z,cx)p(z,c\mid x). We use a neural network g(x;ϕ)g(x;\phi) to parametrize the model q(zx)q(z\mid x). We refer to the original paper for the derivation of Eq. (5) and for the connection of LELBOL_{\text{ELBO}} to the gradient formulation of the objective jiang2017variational.

Improving separation by rescaling means.   The VaDE landscape is a mixture of multiple Gaussian distributions. As we will see in Section 4, the separation of Gaussian distributions is crucial for Ward’s method to recover the underlying clusters. Since Ward’s method is a local heuristic, data points that lie in the middle area of two Gaussians are likely to form a larger cluster before merging into one of the true clusters. This "problematic" cluster will further degrade the purity of the dendrogram. However, the objective that VaDE optimizes does not encourage a larger separation. As a result, we apply a transformation to the VaDE latent space to enlarge the separation between each Gaussians. Let xix^{\prime}_{i} be the embedded value of the data point xix_{i}, and define

𝒞(xi)=argmaxj[p(xiμc=j,σc=j2I)]\mathcal{C}(x^{\prime}_{i})=\mathrm{argmax}_{j}[p(x^{\prime}_{i}\mid\mu_{c=j},\sigma^{2}_{c=j}\cdot I)]

to be the cluster label assigned by the learned GMM model. Then, we compute the transformation

xi′′=xi+sμi with μi=μc=𝒞(xi),x^{\prime\prime}_{i}=x^{\prime}_{i}+s\cdot\mu_{i}\text{ with }\mu_{i}=\mu_{c=\mathcal{C}(x_{i})},

where ss is a positive rescaling factor. The transformation has the following properties: (i) for points assigned to the same Gaussian by the VaDE method, the transformation preserves their pairwise distances; (ii) if xix_{i}’s assigned to the same μ\mu by 𝒞\mathcal{C} are i.i.d. random variables with expectation μ\mu (which coincides with the intuition behind VaDE), then the transformation preserves the ratio between the distances of the expected point locations associated with different mean values.

Appendix C illustrates how to choose ss in detail. To improve both DP and MW, the rule of thumb is to choose a scaling factor ss large enough. We set s=3s=3 for all datasets. In summary, we propose the following hierarchical clustering procedure:

  1. 1.

    Train VaDE on the unlabeled dataset {x1,,xn}\{x_{1},\ldots,x_{n}\}.

  2. 2.

    Embed xix_{i} using the value xi=μ^z^x_{i}^{\prime}=\hat{\mu}_{\hat{z}} where (μ^z^,logσ^z^2)=f(z^;θ)(\hat{\mu}_{\hat{z}},\log\hat{\sigma}_{\hat{z}}^{2})=f(\hat{z};\ \theta) and z^=g(xi;ϕ)\hat{z}=g(x_{i}\ ;\ \phi).

  3. 3.

    Choose a rescaling factor ss and apply the rescaling transformation on each xix^{\prime}_{i} to get xi′′x^{\prime\prime}_{i}.

  4. 4.

    Run Ward’s method on {xi′′}i=1n\{x^{\prime\prime}_{i}\}_{i=1}^{n} to produce a hierarchical clustering.

Refer to caption
(a) Dendrogram Purity
Refer to caption
(b) Original + Ward
Refer to caption
(c) PCA + Ward
Refer to caption
(d) VaDE + Ward
Figure 2: (a) DP of Ward’s method on original, PCA, and VaDE space, varying mean separation. (b)(c)(d) show 8 clusters recovered by running Ward’s method on original, PCA and VaDE latent space (Ward’s method terminates when there are 8 clusters). Data drawn from 8 spherical Gaussians in 100\mathbb{R}^{100} with pairwise mean separation 8. Point clouds generated by tSNE; Red colors indicate correct cluster labels assigned by Ward’s method, and blue indicates wrong cluster.

4 Theoretical Analysis of GMM and Ward’s Method

We provide theoretical guarantees for Ward’s method on separated data, providing justification for using VaDE. Here, we focus on the mixture model of spherical Gaussians. We are given a sample of size nn, say X1,,XnX_{1},\ldots,X_{n}, where each sample point XidX_{i}\in\mathbb{R}^{d} is independently drawn from one of kk Gaussian components according to mixing weights w1,,wkw_{1},\ldots,w_{k}, with the jj-th component having mean μjd\mu_{j}\in\mathbb{R}^{d} and covariance matrix Σjd×d\Sigma_{j}\in\mathbb{R}^{d\times d}. Specifically, for any component index jj, we assume that the covariance Σj\Sigma_{j} satisfies Σj=σj2Id\Sigma_{j}=\sigma_{j}^{2}\cdot I_{d}, where σj>0\sigma_{j}>0 and Idd×dI_{d}\in\mathbb{R}^{d\times d} is the identity matrix. Under this assumption, the model is fully determined by the set of parameters {(wi,μi,σi):i[k]}\{(w_{i},\mu_{i},\sigma_{i}):\forall i\in[k]\}. The goal is to cluster the sample points by their associated components.

The theorem below presents sufficient conditions for Ward’s method to recover the underlying kk-component clustering when points are drawn from a kk-mixture of spherical Gaussians. To recover a kk-clustering, we terminate when there are kk clusters. Proofs appear in Appendix A.

Theorem 1.

There exist absolute constants c,c0>0c,c_{0}>0 such that the following holds. Suppose we are given a sample of size nn from a kk-mixture of spherical Gaussians having parameters {(wi,μi,σi):wi>0,i[k]}\{(w_{i},\mu_{i},\sigma_{i}):w_{i}>0,\forall i\in[k]\} that satisfy

ij[k],μiμj2cν(σi+σj)(d+logn),\forall i\not=j\in[k],\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}\geq c\sqrt{\nu}(\sigma_{i}+\sigma_{j})(\sqrt{d}+\sqrt{\log n}),

where ν:=maxtw/wt\nu:=\max_{\ell\not=t}w_{\ell}/w_{t}, and suppose we have nc0(logk)/wminn\geq c_{0}(\log k)/w_{\text{min}}, where wmin:=mintwtw_{\text{min}}:=\min_{t}w_{t}. Then, Ward’s method recovers the underlying kk-component clustering with probability at least 12/k1-2/k.

As a recap, Ward’s method starts with singleton clusters, and repeatedly selects a pair of clusters and merges them into a new cluster, where the selection is based on the optimal value of the error sum of squares. In particular, recovering the underlying kk-component clustering means that every cluster contains only points from the same Gaussian component when the algorithm produces kk-clusters.

Intuition and Optimality.   On a high level, the mean separation conditions in Theorem 1 guarantee that with high probability, the radiuses of the clusters will be larger than the inter-cluster distances, and the sample lower bound ensures that we observe at least constant many sample points from every Gaussian, with high probability. In Appendix B, we illustrate the hardness of improving the theorem without pre-processing the data, e.g., applying an embedding method to the sample points, such as VaDE or PCA. In particular, we show that i) without further complicating the algorithm, the separation conditions in Theorem 1 have the right dependence on dd and nn; ii) for exact recovery, the sample-size lower bound in Theorem 1 is also tight up to logarithmic factors of kk.

These results and arguments indeed justify our motivation for combining VaDE with Ward’s method. To further demonstrate our approach’s power, we experimentally evaluate the clustering performance of three methods on high-dimensional equally separated Gaussian mixtures – the vanilla Ward’s method, the PCA-Ward combination, and the VaDE-Ward combination. As shown in Figure 2(a), in terms of the dendrogram purity, Ward’s method with VaDE pre-processing tops the other two, for nearly every mean-separation level. In terms of the final clustering result in the (projected) space, Figures 2(b) to 2(d) clearly shows the advantage of the VaDE-Ward method as it significantly enlarges the separation among different clusters while making much fewer mistakes in the clustering accuracy.

Next, we show that given the exact recovery in Theorem 1 and some mild cluster separation conditions, Ward’s method recovers the target hierarchy.

For any index set I[k]I\subseteq[k] and the corresponding Gaussian components 𝒢i,iI\mathcal{G}_{i},i\in I, denote by wIw_{I} the total weight iIwi\sum_{i\in I}w_{i}, and by CIC_{I} the union of sample points from these components. We say that a collection \mathcal{H} of nonempty subsets of [k][k] form a hierarchy of order τ\tau over [k][k] if there exists a index set sequence TT_{\ell} for =1,,τ\ell=1,\ldots,\tau such that i) for every \ell, the set \mathcal{H}_{\ell} partitions [k][k]; ii) k\mathcal{H}_{k} is the union of :={It:tT}\mathcal{H}_{\ell}:=\{I_{\ell t}:t\in T_{\ell}\}; iii) for every <τ\ell<\tau and tTt\in T_{\ell}, we have ItI(+1)tI_{\ell t}\subseteq I_{(\ell+1)t^{\prime}} for some tT+1t^{\prime}\in T_{\ell+1}.

For any sample size nn and Gaussian component 𝒢i\mathcal{G}_{i}, denote by Si:=σi(d+2logn)S_{i}:=\sigma_{i}(\sqrt{d}+2\sqrt{\log n}), which upper bounds the radius of the corresponding sample cluster, with high probability. The triangle inequality then implies that for any ij[k]i\not=j\in[k], the distance between a point in the sample cluster of 𝒢i\mathcal{G}_{i} and that of 𝒢j\mathcal{G}_{j} is at most Dij+:=μiμj2+Si+SjD_{ij}^{+}:=\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}+S_{i}+S_{j}, and at least Dij:=μiμj2SiSjD_{ij}^{-}:=\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}-S_{i}-S_{j}. Naturally, we also define DI,J+:=maxiI,jJDij+D_{I,J}^{+}:=\max_{i\in I,j\in J}D_{ij}^{+} and DI,J:=miniI,jJDijD_{I,J}^{-}:=\min_{i\in I,j\in J}D_{ij}^{-} for any I,J[k]I,J\subseteq[k].

Theorem 2.

There exists an absolute constant c1>0c_{1}>0 such that the following holds. Suppose we are given a sample of size nn from a kk-mixture of spherical Gaussians that satisfy the separation conditions and the sample size lower bound in Theorem 1, and suppose that there is an underlying hierarchy of the Gaussian components \mathcal{H} satisfying

[s],IJ,DI,Jc1νmax{DI,I+,DJ,J+},\forall\ell\in[s],\ \ I\not=J\in\mathcal{H}_{\ell},\ \ D_{I,J}^{-}\geq c_{1}\sqrt{\nu_{\ell}}\max\{D_{I,I}^{+},D_{J,J}^{+}\},

where ν:=maxIJwI/wJ\nu_{\ell}:=\max_{I\not=J\in\mathcal{H}_{\ell}}w_{I}/w_{J}. Then, Ward’s method recovers the underlying hierarchy \mathcal{H} with probability at least 12/k1-2/k.

The above theorem resembles Theorem 1, with separation conditions comparing the cluster radiuses with the inter-cluster distances at different hierarchy levels. Informally, the theorem states that if the clusters in each level of the hierarchy are well-separated, Ward’s method will be able to recover the correct clustering of that level, and hence also the entire hierarchy.

5 Experiments

We empirically determine which embedding method produces the best results for dendrogram purity and Moseley-Wang’s objective. We use Ward’s method for clustering after embedding (Appendix C has other algorithms). We provide results for our synthetic model (BTGM) and three real datasets.

Set-up and methods.   We compare VaDE against PCA (linear baseline) and a standard VAE (non-linear baseline). All the embedding methods are trained on the whole dataset. Then we run Ward’s method on the sub-sampled embedded data and the original data to get the dendrograms. In the experiments, we only use 25 classes from CIFAR-100 that fall into one of the five superclasses “aquatic animals”, “flowers”, “fruits and vegetables”, “large natural outdoor scenes” and “vehicle1”. We call it CIFAR-25, and we also use CIFAR-25-s to refer to the images labeled with the five superclasses. We average the results over 100 runs, each with 10k random samples from each dataset (except 20newsgroup where we use all 18k samples). Details in Appendix C.

Synthetic Data (BTGM).   We vary the depth and margin of the BTGM evaluate different regimes of hierarchy depth and cluster separation. After generating kk different means for the BTGM, we shift all the non-zero entries of the first k2\frac{k}{2} means to an arbitrary location. The goal is to show that this simple non-linear transformation is challenging for linear embedding methods (i.e., PCA), while preserving the hierarchical structure. To generate the dataset of kk clusters with shifted means, we sample 2000 points from each Gaussian distributions of the BTGM of height h such that 2h=k2^{h}=k. Then, we set the embedding dimension of PCA, VAE and VaDE to be hh.

Dendrogram Purity (DP).   We compute the DP of various methods using (3). We construct ground truth clusters either from class labels of real datasets or from the leaf clusters of the BTGM model.

Moseley-Wang’s Objective (MW).   To evaluate the MW objective in (2), we need pairwise similarities wijw_{ij}. For classification datasets, we let wij=𝟙𝒞(xi)=𝒞(xj)w_{ij}=\mathbbm{1}_{\mathcal{C}(x^{i})=\mathcal{C}(x^{j})} to be the indicator function for point ii and jj being in the same class. When there are hierarchical labels, we instead define wij=2l1𝟙𝒞l(xi)=𝒞l(xj)w_{ij}=2^{l-1}*\mathbbm{1}_{\mathcal{C}^{l}(x^{i})=\mathcal{C}^{l}(x^{j})}, where 𝒞l(xi)\mathcal{C}^{l}(x^{i}) denotes the label for point xix^{i} at the lthl^{th} level of the hierarchical tree from top to bottom. We measure the ratio achieved by various methods to the optimal MW value opt that we compute as follows. For kk flat clusters, the optimal MW value is obtained by first merging points in the same clusters, then building a binary tree on top of the kk clusters. When there are hierarchical labels, we compare against the optimal tree with ground truth levels.

Refer to caption
Figure 3: In the original space, Ward’s method makes several mistakes: three orchids merge into the aquatic animals cluster before the flowers cluster. Using the VaDE embedding, Ward’s method produces a near optimal dendrogram on this subset of data for both lower and higher level structures.

5.1 Results

Results on Synthetic Data.   Table 3(b) presents the results of the synthetic experiments. We set the margin mm of the BTGM to be 8 while varying the depth, so the closest distance between the means of any two Gaussian distributions is 8. We also set the embedding dimension to be hh, which is the intrinsic dimensionality of the pre-shifted means. As Table 3(b) shows, the combination of embedding with VaDE and then clustering with Ward’s method improves the dendrogram purity and Moseley-Wang’s objective in all settings. VaDE fully utilizes the hh-dimensional latent space. On the other hand, embedding with PCA or VAE is actually detrimental to the two objectives compared to the original data. We conjecture that PCA fails to resolve the non-linear shifting of the underlying means. In general, using a linear embedding method will likely lead to a drop in both DP and MW for this distribution. Similarly, VAE does not provide sufficient separation between clusters.

Results on Real Data.   Tables 2 and 3 report the values of the DP and MW objectives on real datasets. We observe that for MNIST, using the VaDE embedding improves DP and MW significantly compared to other embeddings, and applying the transformation further leads to an increased value of both objectives. Compared to PCA and VAE, we see that VaDE also has lower variance in the clustering results over randomly sampled subsets from the MNIST dataset. The results on the Reuters dataset show that VaDE plus the rescaling transformation is also suitable for text data, where it leads to the best performance. In the CIFAR-25 experiments, we see only a modest improvement with VaDE when evaluating with all 25 classes. It seems that a large number of real clusters may be a limitation of the VaDE embedding, and this deserves additional attention. On the other hand, when restricting to the five superclass labels, the improvement on CIFAR-25-s increases. We hypothesize that VaDE generates a latent space that encodes more of the high level structure for the CIFAR-25 images. This hypothesis can be partially verified by the qualitative evaluation in Figure 3. A trend consistent with the synthetic experiments is that PCA and VAE are not always helpful embeddings for hierarchical clustering with Ward’s method. Overall, VaDE discovers an appropriate embedding into Euclidean space based on its GMM prior that improves the clustering results.

Table 1: Dendrogram Purity and Moseley-Wang’s objective on BTGM synthetic dataset. We evaluate Ward’s method using different embeddings, while varying parameters: kk denotes the number of pure clusters, hh denotes both the dimension of the PCA and VaDE latent spaces and the tree height.
kk = 8 (hh = 3) kk = 16 (hh = 4) kk = 32 (hh = 5)
Original 0.798 ±\pm .020 0.789 ±\pm .017 0.775 ±\pm .016
PCA 0.518 ±\pm .002 0.404 ±\pm .003 0.299 ±\pm .003
VAE 0.619 ±\pm .030 0.755 ±\pm .019 0.667 ±\pm .012
VaDE 0.943 ±\pm .011 0.945 ±\pm .001 0.869 ±\pm .013
(a)
k=8(h=3)k=8\ (h=3) k=16(h=4)k=16\ (h=4) k=32(h=5)k=32\ (h=5)
0.960 ±\pm 0.004 0.980 ±\pm 0.002 0.989 ±\pm 0.001
0.909 ±\pm 0.001 0.913 ±\pm 0.004 0.936 ±\pm 0.001
0.791 ±\pm 0.028 0.898 ±\pm 0.017 0.969 ±\pm 0.011
0.990 ±\pm 0.020 0.994 ±\pm 0.001 0.991 ±\pm 0.001
(b)
Table 2: DP on three real datasets for different embedding methods.
Reuters MNIST CIFAR-25-s CIFAR-25
Original 0.535 ±\pm .023 0.478 ±\pm .025 0.280 ±\pm .009 0.110 ±\pm .005
PCA 0.587 ±\pm .021 0.472 ±\pm .025 0.275 ±\pm .007 0.108 ±\pm .004
VAE 0.468 ±\pm .027 0.495 ±\pm .030 0.238 ±\pm .004 0.082 ±\pm .003
VaDE 0.650 ±\pm .015 0.803 ±\pm .021 0.287 ±\pm .009 0.120 ±\pm .003
VaDE + Trans. 0.672 ±\pm .009 0.886 ±\pm .015 0.300 ±\pm .008 0.128 ±\pm .004
Table 3: MW (ratio vs. opt) on three real datasets for different embedding methods.
Reuters MNIST CIFAR-25
Original 0.654 ±\pm .025 0.663 ±\pm .014 0.438 ±\pm .005
PCA 0.679 ±\pm .021 0.650 ±\pm .037 0.435 ±\pm .007
VAE 0.613 ±\pm .027 0.630 ±\pm .034 0.393 ±\pm .004
VaDE 0.745 ±\pm .025 0.915 ±\pm .017 0.455 ±\pm .003
VaDE + Trans. 0.768 ±\pm .012 0.955 ±\pm .005 0.472 ±\pm .002

5.2 Discussion

Theoretical insights.   In Theorems 1 and 2, we showed that Ward’s method recovers the hierarchical clustering when the Gaussian means are separated in Euclidean distance. Since VaDE uses a GMM prior, this is a natural metric for the embedded space. As the rescaling increases the separation, we suppose that the embedded vectors behave as if they are approximately sampled from a distribution satisfying the conditions of the theorems. Thus, our theoretical results justify the success of the embedding in practice, where we see that VaDE+Transformation performs the best on many datasets.

Recovering planted and real hierarchies.   The synthetic reuslts in Table 3(b) exhibit the success of the VaDE+Transformation method to recover the underlying clusters and the planted hierarchy. This trend holds while varying the number of clusters and depth of the hierarchy. Qualitatively, we find that VaDE can “denoise” the sampled points when the margin does not suffice to guarantee non-overlapping clusters. This enables Ward’s method to achieve a higher value of DP and MW than running the algorithm on the original data. If the hierarchy is not encoded in Euclidean space, e.g., MNIST or CIFAR-25, then VaDE provides good embeddings from which Ward’s method can recover the hierarchy. In many cases, PCA and VAE are unable to find a good embedding.

Rescaling helps.   The rescaling transformation, although simple, leads to consistently better results for both DP and MW on multiple real datasets. In practice, the data distributions are correlated and overlapping, without clear cluster structure in high-dimensional space. The VaDE embedding leads to more concentrated clusters, and the rescaling transformation enlarges the separation enough for Ward’s method to recover what is learned by VaDE. In the synthetic experiments, the latent distributions learned by VaDE are already well-separated, so the rescaling is not necessary in this case. In the future, it may be beneficial to explore transformations that depend on the whole hierarchy.

6 Conclusion

We investigated the effectiveness of embedding data in Euclidean space to improve unsupervised hierarchical clustering methods. We saw that rescaling the VaDE latent space embedding, and then applying Ward’s linkage-based method, leads to the best results for dendrogram purity and the Moseley-Wang objective on many datasets. To better understand the VaDE embedding, we proposed a planted hierarchical model using Gaussians with separated means. Theoretically, we proved that Ward’s method recovers both the clustering and the hierarchy with high probability in this model.

While we focus on embedding using variational autoencoders, an open direction for further research could involve embedding hierarchical structure using other representation learning methods agarwal2007generalized; borg2005modern; caron2018deep; chierchia2019ultrametric; mishne2019diffusion; nina2019decoder; salakhutdinov2012one; tsai2017learning; yadav2019supervised. Another direction is to better understand the differences between learned embeddings, comparison-based methods, and ordinal relations emamjomeh2018adaptive; ghoshdastidar2019foundations; kazemi2018comparison; jamieson2011low. Finally, it would be worth exploring the use of approximate clustering methods to improve the running time abboud2019subquadratic; kobren2017hierarchical; rashtchian2017clustering.

Appendix A Proof of the Theorems

The next two lemmas provide tight tail bounds for binomial distributions and spherical Gaussians.

Lemma 1 (Concentration of binomial random variables).

[chung2006complex] Let XX be a binomial random variable with mean μ\mu, then for any δ>0\delta>0,

Pr(X(1+δ)μ)e(max{δ2,δ})μ/3,\Pr(X\geq{(1+\delta)\mu})\leq{e^{-(\max\{\delta^{2},\delta\})\mu/3}},

and for any δ(0,1)\delta\in{(0,1)},

Pr(X(1δ)μ)eδ2μ/2.\Pr(X\leq{(1-\delta)\mu})\leq{e^{-\delta^{2}\mu/2}}.
Lemma 2 (Concentration of spherical Gaussians).

[Gaussian2016notes] For a random XX distributed according to a spherical Gaussian in dd dimensions (mean μ\mu and variance σ2\sigma^{2} in each direction), and any δ(0,1)\delta\in(0,1),

Pr(Xμ22σ2d(1+2log(1/δ)d+2log(1/δ)d))1δ.\Pr\left(\left\lVert X-\mu\right\rVert_{2}^{2}\leq\sigma^{2}d\left(1+2\sqrt{\frac{\log(1/\delta)}{d}}+\frac{2\log(1/\delta)}{d}\right)\right)\geq 1-\delta.
Lemma 3 (Optimal clustering through Ward’s method).

[grosswendt2019analysis, grosswendt_2020] Let PdP\subseteq\mathbb{R}^{d} be a collection of points with underlying clustering C1,,CkC_{1}^{*},\ldots,C_{k}^{*}, such that the corresponding cluster mean values μ1,,μk\mu_{1}^{*},\ldots,\mu_{k}^{*} satisfy μiμj2>(2+22ν)maxxCixμi2\left\lVert\mu_{i}^{*}-\mu_{j}^{*}\right\rVert_{2}>(2+2\sqrt{2\nu})\max_{x\in C_{i}^{*}}\left\lVert x-\mu_{i}^{*}\right\rVert_{2} for all iji\not=j, where ν:=max,t[k]|C||Ct|\nu:=\max_{\ell,t\in[k]}\frac{|C_{\ell}^{*}|}{|C_{t}^{*}|} is the maximum ratio between cluster sizes. Then Ward’s method recovers the underlying clustering.

Now we are ready to prove Theorem 1.

Proof of Theorem 1.

First, for each component in the mixture, we determine the associated number of sample points. This is equivalent to drawing a size-nn sample from the weighting distribution that assigns wi>0w_{i}>0 to each i[k]i\in[k]. The number of times each symbol i[k]i\in[k] appearing in the sample, which we refer to as the (sample) multiplicity of the symbol, follows a binomial distribution with mean nwinw_{i}. By Lemma 1, the probability that an arbitrary multiplicity is within a factor of 22 from its mean value is at least 12enwi/81-2e^{-nw_{i}/8}. Hence, for n16(logk)/win\geq 16(\log k)/w_{i}, the union bound yields that this deviation bound holds for all symbols with probability at least 1kk2=1k11-k\cdot k^{-2}=1-k^{-1}.

Second, given the symbol multiplicities, we draw sample points from each component correspondingly. Consider the ii-th component with mean μi\mu_{i}. Then, by Lemma 2, every sample point from this component deviates (under Euclidean distance) from μi\mu_{i} by at most Si:=σi(d+2logn)S_{i}:=\sigma_{i}(\sqrt{d}+2\sqrt{\log n}), with probability at least 1n21-n^{-2}. Again, the union bound yields that this deviation bound holds for all components and all sample points with probability at least 1nn2=1n11-n\cdot n^{-2}=1-n^{-1}.

Third, by the above reasoning and n>kn>k, with probability at least 12/k1-2/k and for all i[k]i\in[k],

  1. 1.

    the sample size associated with the ii-th component is between nwi/2nw_{i}/2 and 2nwi2nw_{i};

  2. 2.

    every sample point from the ii-th component deviates from μi\mu_{i} by at most SiS_{i}.

Therefore, if further μiμj2(5+12max,t[k]w/wt)(Si+Sj)\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}\geq(5+12\max_{\ell,t\in[k]}\sqrt{w_{\ell}/w_{t}})(S_{i}+S_{j}) for all iji\not=j, the sample points will form well-separated clusters according to the underlying components. In addition, the first assumption ensures that the maximum ratio between cluster sizes is at most 4max,t[k]w/wt4\max_{\ell,t\in[k]}\sqrt{w_{\ell}/w_{t}}, and the second assumption implies that for any ij[k]i\not=j\in[k],

  1. 1.

    the ii-th cluster the distance between the mean to any point in it is at most 2Si2S_{i};

  2. 2.

    the empirical means of the ii-th and jj-th clusters are separated by at least

    16(Si+Sj)>(2+224max,t[k]wwt)max{2Si,2Sj}.16(S_{i}+S_{j})>\left(2+2\sqrt{2\cdot 4}\cdot\max_{\ell,t\in[k]}\sqrt{\frac{w_{\ell}}{w_{t}}}\right)\max\{2S_{i},2S_{j}\}.

Finally, applying Lemma 3 completes the proof of the theorem. ∎

Next, we move to the proof of Theorem 2. Let us first recap what the theorem states.

Theorem 2.

There exists an absolute constant c1>0c_{1}>0 such that the following holds. Suppose we are given a sample of size nn from a kk-mixture of spherical Gaussians that satisfy the separation conditions and the sample size lower bound in Theorem 1, and suppose that there is an underlying hierarchy of the Gaussian components \mathcal{H} satisfying

[s],IJ,DI,Jc1ν(DI,I++DJ,J+),\forall\ell\in[s],\quad I\not=J\in\mathcal{H}_{\ell},\quad D_{I,J}^{-}\geq c_{1}\sqrt{\nu_{\ell}}\ (D_{I,I}^{+}+D_{J,J}^{+}),

where ν:=maxIJwI/wJ\nu_{\ell}:=\max_{I\not=J\in\mathcal{H}_{\ell}}w_{I}/w_{J}. Then, Ward’s method recovers the underlying hierarchy \mathcal{H} with probability at least 12/k1-2/k.

Proof of Theorem 2.

By the proof of Theorem 1, for any sample size nn and Gaussian component 𝒢i\mathcal{G}_{i}, the quantity Si:=σi(d+2logn)S_{i}:=\sigma_{i}(\sqrt{d}+2\sqrt{\log n}) upper bounds the radius of the corresponding sample cluster, with high probability. Here, the term “radius” is defined with respect to the actual center μi\mu_{i}.

The triangle inequality then implies that for any ij[k]i\neq j\in[k], the distance between a point in the sample cluster of 𝒢i\mathcal{G}_{i} and that of 𝒢j\mathcal{G}_{j} is at most Dij+:=μiμj2+Si+SjD_{ij}^{+}:=\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}+S_{i}+S_{j}, i.e., the sum of the distance between the two centers and the radiuses of the two clusters, and at least Dij:=μiμj2SiSjD_{ij}^{-}:=\left\lVert\mu_{i}-\mu_{j}\right\rVert_{2}-S_{i}-S_{j}.

More generally, for any non-singleton clusters represented by I,J[k]I,J\subset[k], the distance between two points xIx\in I and yJy\in J is at most DI,J+:=maxiI,jJDij+D_{I,J}^{+}:=\max_{i\in I,j\in J}D_{ij}^{+}, and at least DI,J:=miniI,jJDijD_{I,J}^{-}:=\min_{i\in I,j\in J}D_{ij}^{-}. In particular, for I=JI=J, quantity DI,I+D_{I,I}^{+} upper bounds the diameter of cluster II.

Furthermore, by the proof of Theorem 1, given the exact recovery of the kk Gaussian components, the ratio between the sizes of any two clusters (at any level), say CIC_{I} and CJC_{J}, differs from the ratio between their weights, wI/wJw_{I}/w_{J}, by a factor of at most 44. Hence, the quantity vv_{\ell} essentially characterizes the maximum ratio between the sample cluster sizes at level \ell of the hierarchy.

With the above reasoning, Lemma 3 naturally completes the proof of the theorem. Intuitively, this shows that if the clusters in each level of the hierarchy are well-separated, Ward’s method will be able to recover the correct clustering of that level, and hence also the entire hierarchy. ∎

Appendix B Optimally of Theorem 1

Below, we illustrate the hardness of improving the theorem without preprocessing the data, e.g., applying an embedding method to the sample points, such as VaDE or PCA.

Separation conditions.

As many other commonly used linkage methods, the behavior and performance of Ward’s method depend on only the inter-point distances between the observations. For optimal-clustering recovery through Ward’s method, a natural assumption to make is that the distances between points in the same cluster is at most that between disjoint clusters. Consider a simple example where we draw 33 sample points from a uniform mixture of two standard dd-dimensional spherical Gaussians 𝒢0\mathcal{G}_{0} and 𝒢1\mathcal{G}_{1} with mean separated by a distance Δ\Delta. Then, with a 3/83/8 probability, we will draw two sample points from 𝒢0\mathcal{G}_{0}, say X,X0X,X_{0}, and one sample point from 𝒢1\mathcal{G}_{1}, say YY. Conditioned on this, the reasoning in Section 2.8 of [blum2020foundations] implies that, with high probability,

D𝒢0:=XX0222d±𝒪(d) and D𝒢1:=XY22Δ2+2d±𝒪(d).D_{\mathcal{G}_{0}}:=\left\lVert X-X_{0}\right\rVert_{2}^{2}\approx 2d\pm\mathcal{O}(\sqrt{d})\text{ and }D_{\mathcal{G}_{1}}:=\left\lVert X-Y\right\rVert_{2}^{2}\approx\Delta^{2}+2d\pm\mathcal{O}(\sqrt{d}).

Hence, the assumption that points are closer inside the clusters translates to

D𝒢0<D𝒢12d±𝒪(d)<Δ2+2d±𝒪(d)Δ=Ω(d1/4).D_{\mathcal{G}_{0}}<D_{\mathcal{G}_{1}}\implies 2d\pm\mathcal{O}(\sqrt{d})<\Delta^{2}+2d\pm\mathcal{O}(\sqrt{d})\implies\Delta=\Omega(d^{1/4}).

Furthermore, if one requires the inner-cluster distance to be slightly smaller than the inter-cluster distance with high probability, say (1+ϵ)D𝒢0<D𝒢1(1+\epsilon)D_{\mathcal{G}_{0}}<D_{\mathcal{G}_{1}}, where ϵ>0\epsilon>0 is a small constant,

(1+ϵ)(2d±𝒪(d))<Δ2+2d±𝒪(d)Δ=Ω(ϵd).(1+\epsilon)(2d\pm\mathcal{O}(\sqrt{d}))<\Delta^{2}+2d\pm\mathcal{O}(\sqrt{d})\implies\Delta=\Omega(\sqrt{\epsilon d}).

Therefore, the mean difference between the two Gaussians should exhibit a linear dependence on d\sqrt{d}.

In addition, for a standard dd-dimensional spherical Gaussian, Theorem 2.9 in [blum2020foundations] states that for any βd\beta\leq\sqrt{d}, all but at most 3exp(Ω(β2))3\exp(-\Omega(\beta^{2})) of its probability mass lies within dβ|x|d+β\sqrt{d}-\beta\leq|x|\leq\sqrt{d}+\beta. Hence, if we increase the sample size from 33 to nn, correctly separating the points by their associated components requires an extra 𝒪(logn)\mathcal{O}(\sqrt{\log n}) term in the mean difference.

Sample-size lower bound.

In terms of the sample size nn, Theorem 1 requires nc0logk/wminn\geq c_{0}\log k/w_{\text{min}}. For Ward’s method to recover the underlying kk-component clustering with high probability, we must observe at least one sample point from every component of the Gaussian mixture. Clearly, this means that the expected size of the smallest cluster should be at least 11, implying n1/wminn\geq 1/w_{\text{min}}. If we have ν=𝒪(1)\nu=\mathcal{O}(1), this reduces to a weighted version of the coupon collector problem, which calls for a sample size in the order of klogk=Θ((logk)/wmin)k\log k=\Theta((\log k)/w_{\text{min}}) by the standard results [feller1968introduction].

Appendix C More Experimental Details

Network Architecture.   The architecture for the encoder and decoder for both VaDE and VAE are fully-connected layers with size dd-2000-500-500-hh and hh-500-500-2000-dd, where hh is the hidden dimension for the latent space h\mathbb{R}^{h}, and dd is the input dimension of the data in d\mathbb{R}^{d}. These settings follow the original paper of VaDE [jiang2017variational]. We use Adam optimizer on mini-batched data of size 800 with learning rate 5e-4 on all three real datasets. The choices of the hyper-parameters such as hh and kk will be discussed in the Sensitivity Analysis.

Computing Dendrogram Purity.   For every set of NN points sampled from the dataset (e.g. we use N=2048N=2048 for MNIST, CIFAR-25), we compute the exact dendrogram purity using the formula 3. We then report the mean and standard variance of the results by repeating the experiments 100 times.

Sensitivity Analysis.   We check the sensitivity of VAE and VaDE to its hyperparameters hh and kk in terms of DP and MW. We also check the sensitivity of rescaling factor ss applied after the VaDE embedding. The details are presented in Figure 4, 5, and 6.

(a) Sensitivity of VaDE to the hidden dimension hh while we fix k=10k=10
Refer to caption
(b) Sensitivity of VaDE to the number of latent clusters kk while we fix h=10h=10
Refer to caption
Figure 4: Evaluating the sensitivity of VaDE on MNIST to its hyper-parameters hh, which denotes the dimensionality of the latent space embedding, and kk, which denotes the number of GMM components used for the latent space. We plot the Moseley-Wang objective (MW) and dendrogram purity (DP) as we vary either hh or kk. Overall, we see that as long as hh is between 5 and 12, then the results are stable. Similarly, kk just needs to be at least 8. Therefore, while these parameters are important, the VaDE model is not very sensitive. Note that in our experiments for real data we fixed h=10h=10 and for synthetic data we set hh to be the dimensionality of the BTGM (either 3,4,3,4, or 55). For real/synthetic data we set kk to be the number of ground truth clusters (i.e., k=10k=10 for MNIST).
(a) Sensitivity of VAE to the hidden dimension hh with the same architecture as VaDE
Refer to caption
(b) Sensitivity of VaDE to the rescaling factor ss while we fix h=10h=10 and k=10k=10
Refer to caption
Figure 5: On the left, we evaluate the sensitivity of VAE on MNIST to its hyper-parameters hh, which denotes the dimensionality of the latent space embedding. On the right, we evaluate the sensitivity of VaDE + Trans to its hyper-parametere ss. We plot the Moseley-Wang objective (MW) and dendrogram purity (DP) as we vary either hh or ss. We see that the performance of VAE is not sensitive to its hidden dimension size hh. But overall, the VAE does not perform well with respect to DP and MW. As for the rescaling experiment, we see that the effect of ss becomes stable when s3s\geq 3. Note that in the real dataset experiments we set s=3s=3.

Comparing Linkage-Based Methods.   Table 4 shows the DP and MW results for several linkage-based variations, depending on the function used to determine cluster similarity. Overall, we see that Ward’s method performs the best on average. However, in some cases, the other methods do comparably.

Table 4: Dendrogram Purity and Moseley-Wang’s objective using different HAC linkage methods on VaDE + Transformed latent space.
Dendro. Purity M-W objective Dendro. Purity M-W objective
Single 0.829 ±\pm 0.038 0.932 ±\pm 0.023 0.260 ±\pm 0.005 0.419 ±\pm 0.006
Complete 0.854 ±\pm 0.040 0.937 ±\pm 0.024 0.293 ±\pm 0.008 0.451 ±\pm 0.016
Centroid 0.859 ±\pm 0.035 0.949 ±\pm 0.019 0.284 ±\pm 0.008 0.453 ±\pm 0.009
Average 0.866 ±\pm 0.035 0.948 ±\pm 0.023 0.295 ±\pm 0.005 0.462 ±\pm 0.006
Ward 0.870 ±\pm 0.031 0.948 ±\pm 0.025 0.300 ±\pm 0.008 0.465 ±\pm 0.011
MNIST CIFAR-100
Refer to caption
(a) k=10k=10
Refer to caption
(b) k=15k=15
Refer to caption
(c) k=20k=20
Figure 6: tSNE visualization of the VaDE latent space with different numbers of latent clusters kk. We can see that the latent space of VaDE becomes more separated as the number of clusters kk increases. When k=20k=20, many of the original 10 clusters of digit further divide into 2 smaller clusters, which in total, form the 20 clusters learned by VaDE.

Dataset Details.   We conduct the experiments on our synthetic data from BTGM as well as real data benchmarks: Reuters , MNIST [MNIST] and CIFAR-100 [CIFAR100]. In the experiments, we only use 25 classes from CIFAR-100 that fall into one of the five superclasses "aquatic animals", "flowers", "fruits and vegetables", "large natural outdoor scenes" and "vehicle1" 111The information of these superclasses can be found in https://www.cs.toronto.edu/ kriz/cifar.html, we call it CIFAR-25. Below we also provide further details on the Digits dataset of handwritten digits 222https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits. More details about the number of clusters and other statistics can be found in Table 5.

Rescaling explained.   The motivation of rescaling transformation can be explained using Figure 7. (a) shows the tSNE visualization of the VaDE embedding before and after the rescaling transformation. As we can see, each cluster becomes more separated. In (b), we visualize the inter-class distance matrices before and after the transformation. We use blue and black bounding boxes to highlight few superclusters revealed in the VaDE embedding. These high level structures remain unchanged after the rescaling, which is a consequence of property (ii) in Section 3.

Table 5: Dataset Details.
Digits Reuters MNIST CIFAR-25-s CIFAR-25
# ground truth clusters 10 4 10 5 25
# data for training 6560 680K 60K 15K 15K
# data for evaluation 1000 1000 2048 2048 2048
Dimensionality 64 2000 768 3072 3072
Size of hidden dimension 5 10 10 20 20

C.1 Digits Dataset

There are few experimental results that evaluate dendrogram purity for the Digits dataset in previous papers. First we experiment using the same settings and same test set construction for Digits as in [monath2019gradient]. In this case, VaDE+Ward achieves a DP of 0.941, substantially improving the previous state-of-the-art results for dendrogram purity from existing methods: the DP of [monath2019gradient] is 0.675, the DP of PERCH [kobren2017hierarchical] is 0.614, and the DP of BIRCH [BIRCH1996] is 0.544. In other words, our approach of VaDE+Ward leads to 26.5 point increase in DP for this simple dataset.

Surprisingly, we find that the test data they used is quite easy, as the DP and MW numbers are substantially better compared to using a random subset as the test set. Therefore, we follow the same evaluation settings as in Section 5 and randomly sampled 100 samples of 1000 data points for evaluation. The results are shown in Table 6. The contrast between the previous test data (easier) versus random sampling (harder) further supports our experimental set-up in the main paper.

Table 6: Dendrogram Purity and Moseley-Wang’s objective using different embedding methods in Digits dataset. Overall, VaDE + Trans. only provides a minor improvement in DP. Using Ward on the original space already achieves decent results.
Dendro. Purity M-W objective
Original 0.820 ±\pm 0.031 0.846 ±\pm 0.029
PCA 0.823 ±\pm 0.028 0.844 ±\pm 0.027
VaDE 0.824 ±\pm 0.026 0.840 ±\pm 0.024
VaDE + Trans. 0.832 ±\pm 0.029 0.845 ±\pm 0.024
Refer to caption
(a)
Refer to caption
(b)
Figure 7: (a) tSNE visualization of the embedded data for MNIST before and after the rescaling transformation, (b) shows the inter-class distance matrix before and after the rescaling transformation