This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Axiomatic Definition of Hierarchical Clustering

Ery Arias-Castro Department of Mathematics, University of California, San Diego Halıcıoğlu Data Science Institute, University of California, San Diego Elizabeth Coda Department of Mathematics, University of California, San Diego
Abstract

In this paper, we take an axiomatic approach to defining a population hierarchical clustering for piecewise constant densities, and in a similar manner to Lebesgue integration, extend this definition to more general densities. When the density satisfies some mild conditions, e.g., when it has connected support, is continuous, and vanishes only at infinity, or when the connected components of the density satisfy these conditions, our axiomatic definition results in Hartigan’s definition of cluster tree.

1 Introduction

Clustering, informally understood as the grouping of data, is a central task in statistics and computer science with broad applications. Modern clustering algorithms originated in the work of numerical taxonomists, who developed methods to identify hierarchical structures in the classification of plant and animal species. Since then clustering has been used in disciplines such as medicine, astronomy, anthropology, economics, etc., with aims such as exploratory analysis, data summarization, the identification of salient structures in data, and information organization.

The notion of a “good” or “accurate” clustering varies between fields and applications. For example, to some computer scientists, the correct clustering of a dataset is often defined as the solution to an optimization problem (think K-means) and a good algorithm either solves or approximates a solution to this problem, ideally with some guarantees [31, 14]. From this perspective, the dataset is viewed as fixed, and the cluster definition is based on the data alone [25]. Moreover, depending on the particular application, how good a clustering is deemed to be may be further loosened, such as in the task of image segmentation, where a good clustering need only “extract the global impression of an image” according to [32].

Even as this view of clustering is widespread well outside computer science, it is not satisfactory from a statistical inference perspective. Indeed, in statistics, it is typically assumed that the sample is representative of an underlying population and a clustering method, to be useful, should inform the analyst about that population. This viewpoint calls for a definition of clustering at the population level. When the sample is assumed iid from an underlying distribution representing the population, by clustering we mean a partition of the support of that distribution, and in that case, a clustering of the sample is deemed “good” or “accurate” by reference to the population clustering — and a clustering algorithm is a good one if it is consistent, meaning, exact, in the large-sample limit. This reference to the population is what gives meaning to statistical inference, and to questions such as whether an observed cluster is “real” or not.

However, there is not a generally accepted definition of clustering at the population level. One popular approach assumes that the data is drawn from a mixture model f=i=1kαifif=\sum_{i=1}^{k}\alpha_{i}f_{i} and the population level clustering consists of kk clusters corresponding to the mixture components [18, 6]. If the underlying density is not a mixture, it can be approximated by a mixture model (typically chosen to be a multivariate Gaussian), though this requires a modeling choice. This approximation may require a very large number of components to approximate well, resulting in an artificially large number of clusters. Moreover, even if the density is a mixture, under this definition, a unimodal mixture could have multiple clusters. Alternatively, in the gradient flow approach to defining the population level clustering, often attributed to Fukunaga and Hostetler [19], each point is assigned to the nearest mode (i.e., local maximum) in the direction of the gradient. Thus, at least when the density has Morse regularity, the clusters correspond to the basin of attraction of each mode. Though this definition relies on assumptions about the smoothness of the density and does not account for arbitrarily flat densities [30], it overcomes some of the described difficulties of the mixture model clustering. If the components in the mixture model are well-separated, this definition results in a similar clustering to the mixture-based definition [10].

Taking a hierarchical perspective of clustering, Hartigan [21] has proposed a population-level cluster tree, where clusters correspond to the maximally connected components of density upper level sets. Though Hartigan provides minimal motivation for this definition beyond observing that each cluster CC in his tree “conforms to the informal requirement that CC is a high-density region surrounded by a low-density region” [21], this is generally accepted as the definition of hierarchical clustering at the population level and has been used in subsequent works [15, 37, 11, 28, 4, 34]. It has been shown that Hartigan’s definition of hierarchical clustering is fully compatible with Fukunaga and Hostetler’s definition of clustering [1].

Several works have explored axiomatic approaches to defining clustering algorithms that take as input a finite number of data points [29, 38, 5, 31, 27, 7]. In each of these, the authors state desirable characteristics of a clustering function (or clustering criterion) and identify algorithms that satisfy these requirements, or in case of [29], prove the non-existence of such an algorithm. Inspired by these axiomatic approaches, we devote a significant portion of the paper to developing an axiom-based definition of a population hierarchical clustering. We focus on hierarchical clustering, rather than flat clustering, as we find this to be a simpler starting point. We recover Hartigan’s definition of hierarchical clustering for densities with connected support that satisfy continuity and some additional mild assumptions, as well as densities with finitely many connected components, each of which satisfy these conditions.

1.1 Related Work

While we take an axiomatic approach to defining the population-level hierarchical clustering, several previous works have explored axiomatic approaches to defining clustering algorithms. The most famous of which might be that of [29], where three axioms are proposed (scale-invariance, richness, and consistency) and an impossibility theorem is established, proving that no clustering algorithm can simultaneously satisfy all three axioms. (The ‘consistency’ axiom is not in the statistical sense, but refers to the property that if within-cluster distances are decreased and between-cluster distances are enlarged, then the output clustering does not change.)

However, as has been pointed out by others [5, 35, 12], including Kleinberg himself in Section 5 of the same article [29], the consistency property may not be so desirable. Rather, a relaxation of this property, in which a refinement or coarsening of the clusters is allowed, may be more appropriate. Kleinberg states that clustering algorithms that satisfy scale-invariance, richness, and this relaxed notion of refinement-coarsening consistency do exist and clustering algorithms that satisfy scale-invariance, near richness, and refinement consistency also exist. This was, in a sense, confirmed by Cohen-Addad et al. [12], who allow the number of clusters to vary with the refinement. Zadeh and Ben-David [38] show that, if the number of clusters that a clustering algorithm can return is fixed at kk, there exist clustering algorithms that satisfy scale-invariance, kk-richness, and consistency (in the original sense). They also show that single linkage is the unique clustering algorithm returning a fixed number of clusters simultaneously satisfying these axioms and two additional axioms.

Puzicha et al. [31] consider clustering data via optimization of a suitable objective function and define a suitable objective function with a set of axioms. Though their axioms are somewhat strong, requiring the objective function have an additive structure, they show that only one of the objective functions considered satisfies all of their axioms. Ben-David and Ackerman [5] also propose a set of axioms which strongly parallel Kleinberg’s axioms for a clustering quality measure function and show the existence of functions satisfying these axioms.

In the 1960s and 1970s, there were a number of articles examining the axiomatic foundation of clustering. Cormack [13] provides a comprehensive review. For example, Jardine and Sibson [27], Jardine et al. [26], Sibson [33], list axioms that, according to them, a hierarchical clustering algorithm should satisfy, and then state that single linkage is the only algorithm they are aware of that satisfies all of their axioms. More recently, Carlsson and Mémoli [7] propose their own sets of axioms for hierarchical clustering, and then prove that single linkage is the only algorithm that satisfies them. Though this result has been presented as a demonstration that Kleinberg’s impossibility theorem does not hold when hierarchical clustering algorithms are considered, this connection is somewhat unclear to us, as the proposed axioms do not mirror Kleinberg’s axioms very precisely.

1.2 Content

The organization of the paper is as follows. Section 2 provides some basic notation and definitions. In Section 3, we take an axiomatic approach to defining a hierarchical clustering for a piecewise constant density with connected support. In Section 4, we extend this definition to continuous densities, first to densities with connected support, and then to more general densities. Section 5 is a discussion section where we go over some extensions, some practical considerations, and also discuss some outlook on flat clustering. In an appendix, we provide a close examination of the merge distortion metric in Section A, and provide further technical details for the special case of a Euclidean space in Section B.

2 Preliminaries

Throughout this paper, we will work with a metric space (Ω,d)(\Omega,d). For technical reasons, we assume it is locally connected, which is for example the case if the balls are connected. This is so that the connected components of an open set are connected.111This is, in fact, an equivalence, the proof of which is left as an exercise in Armstrong’s textbook [3, Ch 3].

In principle, we would equip this metric space with a suitable Borel measure, and consider densities with respect to that measure. As it turns out, this equipment is not needed as we can directly work with non-negative functions. We will do so for the most part, although we will sometimes talk about densities.

For a set AΩA\subseteq\Omega, we let int(A)\text{int}(A) or AA^{\circ} denote its interior and clo(A)\text{clo}(A) or A¯\overline{A} denote its closure; we also let cc(A)\text{cc}(A) denote the collection of its connected components. For a function f:Ωf:\Omega\to\mathbb{R}, its support is supp(f)=clo{f0}\text{supp}(f)=\text{clo}\{f\neq 0\}, and for λ\lambda\in\mathbb{R}, its upper λ\lambda-level set is defined as {fλ}\{f\geq\lambda\}, denoted UλU_{\lambda} when there is no ambiguity.

Definition 2.1 (Hierarchical clustering or cluster tree).

A hierarchical clustering, or cluster tree, of 𝒳Ω\mathcal{X}\subseteq\Omega is a collection of connected subsets of 𝒳\mathcal{X}, referred to as clusters, that has a nested structure in that two clusters are either disjoint or nested.

A hierarchical clustering of a function ff is understood as a hierarchical clustering of its support supp(f)\text{supp}(f). Hartigan’s definition of hierarchical clustering for a density is arguably the most well-known one.

Definition 2.2 (Hartigan cluster tree).

The Hartigan cluster tree of a function ff, which will be denoted f\mathcal{H}_{f}, is the collection consisting of the maximally connected components of the upper λ\lambda-level sets of ff for all λ>0\lambda>0. f\mathcal{H}_{f} is a hierarchical clustering of supp(f)\text{supp}(f).

A dendrogram is commonly understood as the output of a hierarchical clustering algorithms such as single-linkage clustering. It turns out to be simpler to work with dendrograms instead of directly with cluster trees [7, 15].

Definition 2.3 (Dendrogram).

A dendrogram is a cluster tree equipped with a real-valued non-increasing function defined on the cluster tree called the height function. A dendrogram is thus of the form (𝒞,h)(\mathcal{C},h) where 𝒞\mathcal{C} is a cluster tree and h:𝒞h:\mathcal{C}\to\mathbb{R} is such that h(C)h(C)h(C^{\prime})\geq h(C) whenever CCC^{\prime}\subseteq C.

The Hartigan tree of a function ff is naturally equipped with the following height function

hf(C)=infxCf(x).h_{f}(C)=\inf\limits_{x\in C}f(x). (2.1)

Note that this function has the required monotonicity.

Eldridge et al. [15] introduced the merge distortion metric to compare dendrograms. It is based on the notion of merge height, which gives the height at which two points stop belonging to the same cluster, or equivalently, the height of the smallest cluster that contains both points.

Definition 2.4 (Merge height).

Let (𝒞,h)(\mathcal{C},h) be a dendrogram. The merge height of two points x,yΩx,y\in\Omega is defined as

m(𝒞,h)(x,y)=supC𝒞x,yCh(C).m_{(\mathcal{C},h)}(x,y)=\sup\limits_{\begin{subarray}{c}C\in\mathcal{C}\\ x,y\in C\end{subarray}}h(C). (2.2)

For the special case of an Hartigan cluster tree,

mf(x,y)=m(f,hf)(x,y)=supC connected x,yCinfzCf(z).m_{f}(x,y)=m_{(\mathcal{H}_{f},h_{f})}(x,y)=\sup\limits_{\begin{subarray}{c}C\text{ connected }\\ x,y\in C\end{subarray}}\inf_{z\in C}f(z). (2.3)
Definition 2.5 (Merge distortion metric).

Let (𝒞,h)(\mathcal{C},h) and (𝒞,h)(\mathcal{C}^{\prime},h^{\prime}) be two dendrograms. Their merge distortion distance is defined as

dM((𝒞,h),(𝒞,h))=supx,yΩ|m(𝒞,h)(x,y)m(𝒞,h)(x,y)|.d_{M}((\mathcal{C},h),(\mathcal{C}^{\prime},h^{\prime}))=\sup\limits_{x,y\in\Omega}\big{|}m_{(\mathcal{C},h)}(x,y)-m_{(\mathcal{C}^{\prime},h^{\prime})}(x,y)\big{|}.
Refer to caption
Figure 2.1: Left: A collection of sets with neighboring regions (green) and non-neighboring regions (red). This collection of sets does not have the internally connected property. Right: A collection of sets with the internally connected property.

The merge distortion metric has the following useful property [15, Th 17].

Lemma 2.6.

For two functions ff and gg,

dM((f,hf),(g,hg))fg.d_{M}((\mathcal{H}_{f},h_{f}),(\mathcal{H}_{g},h_{g}))\leq\|f-g\|_{\infty}. (2.4)
Proof.

The arguments in [15] are a little unclear (likely due to typos), but correct arguments are given in [28, Lem 1]. We nonetheless provide a concise proof as it is instructive. Take x,yΩx,y\in\Omega, and let s=mf(x,y)s=m_{f}(x,y) and t=mg(x,y)t=m_{g}(x,y). We need to show that |st|η:=fg|s-t|\leq\eta:=\|f-g\|_{\infty}. For any ϵ>0\epsilon>0, by (2.3), there is a connected set CC containing xx and yy such that f(z)sϵf(z)\geq s-\epsilon for all zCz\in C. Since this implies that g(z)sϵηg(z)\geq s-\epsilon-\eta for all zCz\in C, by (2.3) again, this yields tsϵηt\geq s-\epsilon-\eta. We have thus shown that st+η+ϵs\leq t+\eta+\epsilon, and can show that ts+η+ϵt\leq s+\eta+\epsilon in exactly the same way, which combined allows us to obtain that |st|η+ϵ|s-t|\leq\eta+\epsilon. With ϵ>0\epsilon>0 arbitrary, we conclude. ∎

The merge distortion metric has gained some popularity in subsequent works that discuss the consistency of hierarchical methods [28, 37]. In Section A we discuss some limitations and issues with the merge distortion metric, which is in fact a pseudometric on general cluster trees. However, in the context in which we use the metric, these issues are not significant.

We also introduce the notion of neighboring sets. Throughout, we adopt the convention that the empty set is disconnected.

Definition 2.7 (Neighboring regions).

Given a collection of sets 𝒜={Ai}\mathcal{A}=\{A_{i}\}, we define the neighborhood of AiA_{i} as

𝒩(Ai)={Aj:int(Ai¯Aj¯) is connected}.\mathcal{N}(A_{i})=\bigcup\big{\{}A_{j}:\text{int}\big{(}\overline{A_{i}}\cup\overline{A_{j}}\big{)}\text{ is connected}\big{\}}. (2.5)

Note that Aj𝒩(Ai)Ai𝒩(Aj)A_{j}\subseteq\mathcal{N}(A_{i})\ \Leftrightarrow\ A_{i}\subseteq\mathcal{N}(A_{j}), so that we may speak of AiA_{i} and AjA_{j} as being neighbors, which we will denote by AiAjA_{i}\sim A_{j}.

Under this definition, in a Euclidean space, balls that only meet at one point are not neighbors, and neither are rectangles in dimension three that intersect only along an edge. Our discussion will be simplified in the case where we consider collections where all sets that intersect are neighbors.

Definition 2.8 (Internally connected property).

Let 𝒜={Ai}\mathcal{A}=\{A_{i}\} be a collection of sets. We say 𝒜\mathcal{A} has the internally connected property if

Ai¯Aj¯ connected int(Ai¯Aj¯) connected .\overline{A_{i}}\cup\overline{A_{j}}\ \text{ connected }\ \Rightarrow\ \text{int}\big{(}\overline{A_{i}}\cup\overline{A_{j}}\big{)}\text{ connected }. (2.6)

Figure 2.1 illustrates these two definitions.

3 Axioms

In this section, we develop a definition of the population cluster tree for a density ff. Inspired by previous axiomatic approaches to clustering algorithms and in the spirit of Lebesgue integration, we propose a set of axioms for a population cluster tree when the density is piecewise constant with connected support. We then extend this definition to more general densities, and arrive at a definition that is equivalent to Hartigan’s tree (Definition 2.2) for continuous densities with multiple connected components, under some mild assumptions.

3.1 Axioms for Piecewise Constant Functions

Previous work has discussed difficulties in defining what the “true” clusters are [13, 21, 25, 36], observing that there may not be a single definition for all intents and purposes. So as to simplify the situation as much as possible so that a definition may arise as natural, we first consider piecewise constant functions with connected, bounded support. A function in that class is of the form

f=i=1mλi𝕀Ai,f=\sum\limits_{i=1}^{m}\lambda_{i}\,\mathbb{I}_{A_{i}}, (3.1)

where, for all ii, λi>0\lambda_{i}>0 and AiA_{i} is a connected, bounded region with connected interior, and we also require that supp(f)=i=1mAi¯\text{supp}(f)=\bigcup_{i=1}^{m}\overline{A_{i}} has connected interior. Additionally, without loss of generality, assume the AiA_{i} are disjoint. Let \mathcal{F} denote the class of all such functions.

Remark 3.1.

We require each region AiA_{i} and the entire support to not only be connected, but have connected interior, and the same is true of the clusters (Axiom 1). It is well-known that the closure of a connected set is always connected, so that this is a stronger requirement, and is meant to avoid ambiguities.

For ff\in\mathcal{F} we propose that a hierarchical clustering 𝒞\mathcal{C} should satisfy the following three axioms. For what it’s worth, Axiom 1 and Axiom 3 were put forth early on by Carmichael et al. [8] and, most famously although not as directly, by Hartigan [21], and also correspond to the 7th item on the list of “desirable characteristics of clusters” suggested by Hennig [25], and Axiom 2 can be motivated by the 13th item on Hennig’s list.

3.1.1 Axiom 1: Clusters have connected interior

Refer to caption
Figure 3.1: A piecewise constant density in \mathcal{F}. On the left, the highlighted region may be a cluster under Axiom 1 and on the right, the highlighted region is not a cluster under Axiom 1 as the interior is not connected.
Refer to caption
Figure 3.2: Left: A simple example of a piecewise constant density built on two regions. Right: The clustering output of K-means with number of clusters K=2K=2. One of the clusters is disconnected.

We propose that any cluster in 𝒞\mathcal{C} should not only be connected, but have a connected interior. With Axiom 2 below in place, see (A2), we may express Axiom 1 as follows:

If C𝒞C\in\mathcal{C} and Ai,AjCA_{i},A_{j}\subseteq C, then there are Ak1,,AknCA_{k_{1}},\dots,A_{k_{n}}\subseteq C (A1)
such that AiAk1AknAjA_{i}\sim A_{k_{1}}\sim\cdots A_{k_{n}}\sim A_{j}. (3.2)

For example, for the density in Figure 3.1, the highlighted region in the right hand figure should not be a cluster in 𝒞\mathcal{C}, but the highlighted region in the left hand figure could be a cluster in 𝒞\mathcal{C}. This reflects the idea that elements of a cluster should in some sense be similar to each other, without imposing additional assumptions on the within-cluster distances, between-cluster distances, the relative sizes of clusters, or the shape of clusters.

The condition that a cluster be a connected region was considered early on in the literature as it was part of the postulates put forth by Carmichael et al. [8]. However, it is important to note that this condition is not enforced in other definitions of what a cluster is. Most prominently, K-means can return disconnected clusters — see Figure 3.2 for an example.

3.1.2 Axiom 2: Clusters do not partition connected regions of constant density

We propose that a connected region with constant density should not be broken up into smaller clusters as this would impose an additional structure that is not present in the density. We may write this axiom as:

Any C𝒞C\in\mathcal{C} is of the form C=iIAiC=\bigcup\limits_{i\in I}A_{i} for some I{1,2,,m}I\subseteq\{1,2,\dots,m\}. (A2)

Figure 3.3 depicts and example of a valid and invalid cluster under this axiom. Note that as a consequence of this axiom, the within-cluster distances may be larger than the between-cluster distances, depending on the relative widths and separations between regions.

We find this condition to be particularly natural in the present situation where the density is piecewise constant. It is in essence already present in the concept of relatedness introduced by Carmichael et al. [8]. But it is important to note that other definitions do not enforce this property. This is the case of K-means, which can split connected regions of constant density — see, again, Figure 3.2 for an example.

Refer to caption
Figure 3.3: On the left, the highlighted region could be a cluster under Axiom 2, but the highlighted region on the right oversegments a region of constant density, and should not be a cluster.
Refer to caption
Figure 3.4: On the left, the lowest density in highlighted cluster exceeds the largest density in a neighboring set. On the right, the highlighted cluster contains a region with lower density than a neighbor, and thus this should not be a cluster.

3.1.3 Axiom 3: Clusters are surrounded by regions of lower density

We propose that a cluster should be surrounded by regions of lower density, meaning that:

For any C𝒞C\in\mathcal{C}, it holds that infxCf(x)>supx𝒩(C)Cf(x)\inf\limits_{x\in C}f(x)>\!\!\!\!\!\sup_{x\in\mathcal{N}(C)\setminus C}f(x), (A3)

where, if C=iIAiC=\bigcup_{i\in I}A_{i}, then 𝒩(C)=iI𝒩(Ai)\mathcal{N}(C)=\bigcup_{i\in I}\mathcal{N}(A_{i}) denotes the neighbor of CC, extending the definition given in (2.5). Figure 3.4 includes an example.

This is one of the postulates of Carmichael et al. [8], although it was perhaps most popularized by Hartigan in his well-known book [21, Ch 11]. Although it is not part of most other approaches to clustering — K-means being among those as Figure 3.2 shows — we find that this condition is rather compatible with the colloquial understanding of ‘points clustering together’.

3.2 Finest hierarchical clustering

Definition 3.2 (Finer cluster tree).

We say that a cluster tree 𝒞\mathcal{C} is finer than (or a refinement of) another cluster tree 𝒞\mathcal{C}^{\prime} if 𝒞\mathcal{C} includes all the clusters of 𝒞\mathcal{C}^{\prime}, namely, C𝒞C𝒞C\in\mathcal{C}^{\prime}\ \Rightarrow\ C\in\mathcal{C}.

As it turns out, given a nonnegative function, there is one, and only one, finest cluster tree among those satisfying the axioms above.

Proposition 3.3.

For any ff\in\mathcal{F}, there exists a unique finest hierarchical clustering of ff among those satisfying the axioms.

Proof.

Let ff be as in (3.1). The proof is by construction. Let 𝒞\mathcal{C}_{*} denote the collection of every cluster that satisfies (A1), (A2), and (A3). Clearly, it suffices to show that 𝒞\mathcal{C}_{*} is a hierarchical clustering (Definition 2.1). Take two clusters in 𝒞\mathcal{C}_{*}, say C1=iI1AiC_{1}=\bigcup_{i\in I_{1}}A_{i} and C2=iI2AiC_{2}=\bigcup_{i\in I_{2}}A_{i}. We need to show that C1C_{1} and C2C_{2} are either disjoint or nested. Suppose for contradiction that this is not the case, so that C1C_{1} and C2C_{2} are neither disjoint nor nested. Since they are not disjoint, there is iI1I2i\in I_{1}\cap I_{2}, so that AiC1C2A_{i}\subseteq C_{1}\cap C_{2}. And since they are disjoint, there is jI1I2j\in I_{1}\setminus I_{2}, so that AjC1C2A_{j}\subseteq C_{1}\setminus C_{2}. By (A1), there are i1,,isI1i_{1},\dots,i_{s}\in I_{1} such that AiAi1AisAjA_{i}\sim A_{i_{1}}\sim\cdots\sim A_{i_{s}}\sim A_{j}. Let t=max{q:AiqC2}t=\max\{q:A_{i_{q}}\subseteq C_{2}\}, so that AitC2A_{i_{t}}\subseteq C_{2} while while Ait+1C2A_{i_{t+1}}\nsubseteq C_{2}, and in particular Ait+1𝒩(C2)C2A_{i_{t+1}}\subseteq\mathcal{N}(C_{2})\setminus C_{2}, and applying (A3), we get

minC2f>max𝒩(C2)C2fλit+1minC1f,\min_{C_{2}}f>\max_{\mathcal{N}(C_{2})\setminus C_{2}}f\geq\lambda_{i_{t+1}}\geq\min_{C_{1}}f,

using at the end the fact that Ait+1C1A_{i_{t+1}}\subseteq C_{1}. However, we could also get the reverse inequality, minC1f>minC2f\min_{C_{1}}f>\min_{C_{2}}f, in the same exact way, which would result in a contradiction. ∎

Proposition 3.3 justifies the following definition.

Definition 3.4 (Finest axiom cluster tree).

For ff\in\mathcal{F}, we denote by 𝒞f\mathcal{C}_{f}^{*} the finest cluster tree of ff among those satisfying the axioms.

3.3 Comparison with Hartigan’s Cluster Tree

It is natural to compare the finest axiom cluster tree of Definition 3.4 with the Hartigan cluster tree of Definition 2.2. First, observe that for ff\in\mathcal{F}, f\mathcal{H}_{f} satisfies (A2) and (A3). However, f\mathcal{H}_{f} need not satisfy (A1), as clusters in f\mathcal{H}_{f} are only required to be connected. As a result, in general, the Hartigan tree f\mathcal{H}_{f} is not the same as the finest axiom cluster tree 𝒞f\mathcal{C}^{*}_{f}. A counter example is given in Figure 3.5.

We define int\mathcal{F}_{\rm int} as the class of functions in \mathcal{F} with {Ai}\{A_{i}\} in (3.1) having the internally connected property (Definition 2.8).

Theorem 3.5.

For any fintf\in\mathcal{F}_{\rm int}, it holds that 𝒞f=f\mathcal{C}^{*}_{f}=\mathcal{H}_{f}.

Proof.

First, observe that under our assumption, f\mathcal{H}_{f} satisfies all axioms (A1), (A2), and (A3). Thus, because 𝒞f\mathcal{C}^{*}_{f} is the finest cluster tree among those satisfying the axioms (Proposition 3.3), it must be the case that f𝒞f\mathcal{H}_{f}\subseteq\mathcal{C}^{*}_{f}.

For the reverse inclusion, take any C𝒞fC\in\mathcal{C}^{*}_{f}. We want to show that CfC\in\mathcal{H}_{f}. Recalling the definition of hfh_{f} in (2.1), define λ=hf(C)\lambda=h_{f}(C) and let MM be the maximally connected subset of {fλ}\{f\geq\lambda\} that contains CC. We need to show that C=MC=M. Noting that CC is of the form iIAi\bigcup_{i\in I}A_{i} because of (A2), and that MM must be of the form jJAj\bigcup_{j\in J}A_{j} because ff is of the form (3.1), and that MM contains CC by definition, it is the case that IJI\subseteq J.

Suppose for contradiction that CMC\neq M, so that IJI\neq J. Since MM is connected, there must be AiA_{i} in CC and AjA_{j} in MCM\setminus C such that AiAjA_{i}\cup A_{j} is connected. As is well-known, this implies that AiAj¯=Ai¯Aj¯\overline{A_{i}\cup A_{j}}=\overline{A_{i}}\cup\overline{A_{j}} is connected, and since fintf\in\mathcal{F}_{\rm int}, int(Aj¯Ai¯)\text{int}(\overline{A_{j}}\cup\overline{A_{i}}) is also connected, in turn implying that AiAjA_{i}\sim A_{j}. Applying (A3), we get that λ>f(Aj)\lambda>f(A_{j}), and this is a contradiction since AjMA_{j}\subseteq M and MM is part of the upper λ\lambda-level set. ∎

Refer to caption
Figure 3.5: An example where Hartigan’s cluster tree does not satisfy the axioms, so that 𝒞ff\mathcal{C}^{*}_{f}\neq\mathcal{H}_{f}. Indeed, f={A1,A1A2,A1A2A3}\mathcal{H}_{f}=\{A_{1},A_{1}\cup A_{2},A_{1}\cup A_{2}\cup A_{3}\} but A1A2𝒞fA_{1}\cup A_{2}\notin\mathcal{C}^{*}_{f} because int(A1A2)\text{int}(A_{1}\cup A_{2}) is not connected. Instead, we have 𝒞f={A1,A2,A1A2A3}\mathcal{C}^{*}_{f}=\{A_{1},A_{2},A_{1}\cup A_{2}\cup A_{3}\}.
Remark 3.6.

As a relaxation of Axiom 1, we could simply require a cluster to be connected, and allow it to have disconnected interior. If the definition of a neighboring region were also relaxed so that if the closure of the union of two sets is connected, then the sets are considered neighbors, then the relaxed Axiom 1, original Axiom 2, and original Axiom 3 would yield an axiomatic definition of a cluster tree that is identical to the Hartigan tree for ff\in\mathcal{F}. All that said, we find the requirement that the interior be connected in our original Axiom 1 (and in Definition 2.7) to be more natural and robust.

4 Extension to Continuous Functions

Having defined the finest axiom cluster tree for a piecewise constant function (Definition 3.4), we now examine its implication when piecewise constant functions are used to approximate continuous functions. More specifically, we consider sequences of piecewise constant functions in int\mathcal{F}_{\rm int} converging to a continuous function, and show that, under some conditions, the corresponding finest axiom cluster trees converge to the Hartigan cluster tree of the limit function in merge distortion metric (Definition 2.5).

4.1 Functions with connected support

We start with continuous functions whose support has connected interior.

Definition 4.1.

Given a continuous function ff with connected support, we say that 𝒞\mathcal{C} is an axiom cluster tree for ff if there is a sequence (fn)int(f_{n})\subseteq\mathcal{F}_{\rm int} that uniformly approximates ff such that

limndM((𝒞fn,hfn),(𝒞,hf))=0.\lim\limits_{n\rightarrow\infty}d_{M}((\mathcal{C}^{*}_{f_{n}},h_{f_{n}}),(\mathcal{C},h_{f}))=0. (4.1)

At this point it is not clear whether a continuous function admits an axiom cluster tree. However, if it does, then its Hartigan cluster tree is one of them and, moreover, all other axiom cluster trees are zero merge distortion distance away.

Theorem 4.2.

Suppose ff is a continuous function that admits an axiom cluster tree. Then its Hartigan tree f\mathcal{H}_{f} is an axiom cluster tree for ff. Moreover, if 𝒞\mathcal{C} is an axiom cluster tree for ff, then dM((𝒞,hf),(f,hf))=0d_{M}((\mathcal{C},h_{f}),(\mathcal{H}_{f},h_{f}))=0.

Proof.

Let 𝒞\mathcal{C} be an axiom cluster tree for ff. By Definition 4.1, there is a sequence (fn)(f_{n}) in int\mathcal{F}_{\rm int} that converges uniformly to ff such that (4.1) holds. By the triangle inequality,

dM((𝒞,hf),(f,hf))\displaystyle d_{M}((\mathcal{C},h_{f}),(\mathcal{H}_{f},h_{f})) dM((𝒞,hf),(𝒞fn,hfn))+dM((𝒞fn,hfn),(f,hf)).\displaystyle\leq d_{M}((\mathcal{C},h_{f}),(\mathcal{C}^{*}_{f_{n}},h_{f_{n}}))+d_{M}((\mathcal{C}^{*}_{f_{n}},h_{f_{n}}),(\mathcal{H}_{f},h_{f})). (4.2)

We already know that the first term on the RHS tends to zero. For the second term, using Theorem 3.5 and Lemma 2.6,

dM((𝒞fn,hfn),(f,hf))\displaystyle d_{M}((\mathcal{C}^{*}_{f_{n}},h_{f_{n}}),(\mathcal{H}_{f},h_{f})) =dM((fn,hfn),(f,hf))fnf0,n.\displaystyle=d_{M}((\mathcal{H}_{f_{n}},h_{f_{n}}),(\mathcal{H}_{f},h_{f}))\leq\|f_{n}-f\|_{\infty}\rightarrow 0,\quad n\to\infty. (4.3)

We thus have that dM((𝒞,hf),(f,hf))=0d_{M}((\mathcal{C},h_{f}),(\mathcal{H}_{f},h_{f}))=0 — this being true for any axiom cluster tree 𝒞\mathcal{C}. In the process, we have also shown in (4.3) that f\mathcal{H}_{f} is axiomatic. ∎

The remaining of this subsection is dedicated to providing sufficient conditions on a function ff for the existence of sequence (fn)int(f_{n})\subseteq\mathcal{F}_{\rm int} that converges uniformly to ff. In formalizing this, we will utilize the following terminology and results.

Definition 4.3 (Internally connected partition property).

We say that Ω\Omega has the internally connected partition property if it is connected, and for any r>0r>0, there exists a locally finite partition {Ai}\{A_{i}\} of Ω\Omega that has the internally connected property and is such that, for all ii, AiA_{i} is connected with connected interior and diameter at most rr.

We establish in Proposition B.1 that any Euclidean space (and, consequently, of any finite-dimensional normed space) has the internally connected partition property. And we conjecture that this extends to some Riemannian manifolds.

Proposition 4.4.

Suppose (Ω,d)(\Omega,d) is a metric space where all closed and bounded sets are compact222This is sometimes called the Heine–Borel property., and that has the internally connected partition property. Let f:Ω[0,)f:\Omega\rightarrow[0,\infty) be continuous with all upper level sets bounded, and such that the upper λ\lambda-level set is connected when λ>0\lambda>0 is small enough. Then, there is a sequence (fn)int(f_{n})\in\mathcal{F}_{\rm int} that converges uniformly to ff.

Proof.

It is enough to show that, for any η>0\eta>0, there is a function in int\mathcal{F}_{\rm int} within η\eta of ff in supnorm. Therefore, fix η>0\eta>0, and take it small enough that the upper η\eta-level set, K={x:f(x)η}K=\{x:f(x)\geq\eta\}, is connected. Consider

K1={y:d(y,x)1, for some xK}.K_{1}=\big{\{}y:d(y,x)\leq 1,\text{ for some }x\in K\big{\}}. (4.4)

In particular, K1K_{1} is compact, and since ff is continuous on K1K_{1}, it is uniformly so, and therefore there exists 0<ϵ<10<\epsilon<1 such that, if x,yK1x,y\in K_{1} are such that d(x,y)ϵd(x,y)\leq\epsilon, then |f(x)f(y)|η|f(x)-f(y)|\leq\eta.

By the fact that Ω\Omega has the internally connected partition property, it admits a locally finite partition {Ai}\{A_{i}\} with the internally connected property and such that, for all ii, AiA_{i} has connected interior and diameter at most ϵ\epsilon. Let

I={i:AiK},I=\{i:A_{i}\cap K\neq\emptyset\},

and note that II is finite and that KiIAiK1K\subseteq\bigcup_{i\in I}A_{i}\subseteq K_{1}. For iIi\in I, let λi=supxAif(x)\lambda_{i}=\sup_{x\in A_{i}}f(x). Because AiKA_{i}\cap K\neq\emptyset, we have λiη\lambda_{i}\geq\eta. Finally, we define the piecewise constant function g=iIλi𝕀Aig=\sum_{i\in I}\lambda_{i}\mathbb{I}_{A_{i}}. We claim that gintg\in\mathcal{F}_{\rm int}. Since {Ai:iI}\{A_{i}:i\in I\} inherits the internally connected property from {Ai}\{A_{i}\}, all we need to check is that iIAi¯\bigcup_{i\in I}\overline{A_{i}} is connected. To see this, first note that it is enough that iIAi\bigcup_{i\in I}A_{i} be connected (since the closure of a connected set is connected). Suppose for contradiction that iIAi\bigcup_{i\in I}A_{i} is disconnected, so that we can write it as a disjoint union of iI1Ai\bigcup_{i\in I_{1}}A_{i} and iI2Ai\bigcup_{i\in I_{2}}A_{i}, where I1I_{1} and I2I_{2} are non-empty disjoint subsets of II. Because KiIAiK\subseteq\bigcup_{i\in I}A_{i}, then we have that KK is the disjoint union of K1:=iI1AiK_{1}:=\bigcup_{i\in I_{1}}A_{i} and K2:=iI2AiK_{2}:=\bigcup_{i\in I_{2}}A_{i}, both non-empty by construction, so that KK is not connected — a contradiction.

We now show that fgη\|f-g\|_{\infty}\leq\eta. For xiIAix\notin\bigcup_{i\in I}A_{i}, g(x)=0g(x)=0 and since xKx\notin K, f(x)<ηf(x)<\eta, so that |f(x)g(x)|η|f(x)-g(x)|\leq\eta. For xAix\in A_{i}, for some iIi\in I, g(x)=f(y)g(x)=f(y) for some yAi¯y\in\overline{A_{i}}, and because x,yK1x,y\in K_{1} and d(x,y)ϵd(x,y)\leq\epsilon, we have |f(y)f(x)|η|f(y)-f(x)|\leq\eta. ∎

4.2 Functions with Disconnected Support

So far, we have focused our attention on densities whose support has connected interior. However, there is no real difficulty in extending our approach to more general densities. Indeed, given a function with support having disconnected interior, our approach can define a hierarchical clustering of each connected component of {f>0}\{f>0\}.

In more detail, let ff be a function of the form

f=j=1Nfj,f=\sum_{j=1}^{N}f_{j}, (4.5)

where int(supp(fj))int(supp(fk))=\text{int}(\text{supp}(f_{j}))\cap\text{int}(\text{supp}(f_{k}))=\emptyset when jkj\neq k. First, suppose that each fjf_{j}\in\mathcal{F}. If we apply the axioms of Section 3, we obtain that CC is a cluster for ff if and only if it is a cluster for one of the fjf_{j}, and consequently that the finest axiom cluster tree for ff is simply the union of the finest axiom cluster trees for the fjf_{j}, i.e.,

𝒞f=j=1N𝒞fj.\mathcal{C}^{*}_{f}=\bigcup_{j=1}^{N}\mathcal{C}^{*}_{f_{j}}.

If ff is continuous, that is, if each fjf_{j} in (4.5) is continuous, we may proceed exactly as in Section 4.1 and, based on the facts that f=jfj\mathcal{H}_{f}=\bigcup_{j}\mathcal{H}_{f_{j}}, hf(C)=hfj(C)h_{f}(C)=h_{f_{j}}(C) when C𝒞fjC\in\mathcal{C}_{f_{j}}, and

dM((𝒞,hf),(𝒞,hf))=maxj=1,,NdM((𝒞j,hfj),(𝒞j,hfj)),d_{M}((\mathcal{C},h_{f}),(\mathcal{C}^{\prime},h_{f}))=\max_{j=1,\dots,N}d_{M}((\mathcal{C}_{j},h_{f_{j}}),(\mathcal{C}^{\prime}_{j},h_{f_{j}})),

for any two axiom cluster trees for ff, 𝒞=j𝒞j\mathcal{C}=\bigcup_{j}\mathcal{C}_{j} and 𝒞=j𝒞j\mathcal{C}^{\prime}=\bigcup_{j}\mathcal{C}^{\prime}_{j} (all cluster trees for ff are of that form), we find that Theorem 4.2 applies verbatim.

This is as far as our approach goes. The end result is Hartigan’s cluster tree, with the same caveats that come from using the merge distortion metric detailed in Section A. In particular, instead of a tree we have a forest with NN trees in general, one for each fjf_{j}. We find this end result natural, but if it is desired to further group regions (see Figure 4.1 for an illustration), one possibility is to apply a form of agglomerative hierarchical clustering to the ‘clusters’, supp(f1),,supp(fN)\text{supp}(f_{1}),\dots,\text{supp}(f_{N}). (In our definition, these are not clusters of f\mathcal{H}_{f}, but this is immaterial.) Doing this presents the usual question of what clustering procedure to use, but given what we discuss in Section 5.2.1, single-linkage clustering would be a very natural choice.

Refer to caption
Figure 4.1: An example of density with a support that has disconnected interior which appears to exhibit some clustering structure beyond that happening inside each of its eight connected components.

5 Discussion

5.1 Extensions

We speculate that our axiomatic definition of hierarchical clustering can be extended beyond continuous functions (Section 4) to piecewise continuous functions with connected support by the same process of taking a limit of sequences in int\mathcal{F}_{\rm int} that uniformly approximate the function of interest ff.

The natural approach is to work within each region where ff is continuous, say RR, and to consider there a partition of RR that would allow the definition of a piecewise constant function approximating ff uniformly on RR. The main technical hurdle is the construction of such a partition with the internally connected property, as a region RR may not be regular enough to allow for that. Additionally, even if there is a partition with the internally connected property on each region, taken together, these partitions may not have the internally connected property. We see some possible workarounds, but their implementation may be complicated.

5.2 Practical Implications

We first examine some implications of adopting the axioms defining clusters in Section 3.

5.2.1 Algorithms

A large majority of existing approaches to clustering return clusters that, when taken to the large-sample limit, do not necessarily satisfy the proposed axioms. This is true of K-means and all agglomerative hierarchical clustering that we know of, with the partial exception of single-linkage clustering, as repeatedly pointed out by Hartigan [22, 24, 23]. Interestingly, single-linkage clustering arises out of various axiomatic discussion of (flat) clustering such as [29, 5, 38, 12], and also of hierarchical clustering [27, 7].

This is despite the heavy criticism of single linkage in the literature for its “chaining” tendencies. However, in practice “chaining” can be a concern, and regularized variants of single-linkage clustering are preferred. Most prominently, this includes DBSCAN [16], which has been shown to consistently estimate the Hartigan cluster tree [37] in the merge distortion metric when the underlying density is Hölder smooth, for example; see, also, the “robust” variant of single-linkage clustering proposed in [15], also shown to be consistent under some conditions.

5.2.2 Clustering in High Dimensions

Wang et al. [37] derive minimax rates for the estimation of the Hartigan cluster tree, which turn out to match the corresponding minimax rates for density estimation in the LL_{\infty} norm under assumptions of Hölder smoothness on the density. In particular, these rates exhibit the usual behavior in that they require that the sample size grow exponentially with the dimension. This is a real limitation of adopting the definition of cluster and cluster tree that we proposed in Section 3, although the usual caveats apply in that the curse of dimensionality is with respect to the intrinsic dimension if the density is in fact with respect to a measure supported on a lower-dimensional manifold [4]; and ‘assuming’ more structure can help circumvent the curse of dimensionality, as done for example in [9], where a mixture is fitted to the data before applying modal clustering.

5.3 Axiomatic Definition of Flat Clustering

We have already mentioned some axiomatic approaches to defining flat [5, 29, 38, 31, 35, 12] and hierarchical [27, 26, 33, 7] clustering algorithms. But beyond these efforts, defining what clusters are has proven to be challenging for decades, in particular due to the fact that the problem is at the very core of Taxonomy. In his comprehensive review of the field at the time, Cormack [13] says that “There are many intuitive ideas, often conflicting, of what constitutes a cluster, but few formal definitions.” More recent discussions include those of Estivill-Castro [17], von Luxburg et al. [36] and that of Hennig [25]. As Gan et al. [20] say in their recent book on clustering, “The clustering problem has been addressed extensively, although there is no uniform definition for data clustering and there may never be one”.

By focusing on hierarchical clustering, as others have done (e.g., [7]), we circumvented the thorny issue of defining the correct number of clusters, and propose simple axioms defining a cluster that are ‘natural’ in our view — in fact, as we pointed out earlier, the axioms we propose are hardly novel. However, the question of an axiomatic definition of a flat clustering of a population or density support remains intriguing, and we hope to address it in future work. For now, we remark that the definition most congruent with our definition of hierarchical clustering is that of Fukunaga and Hostetler [19], which when the density ff has some regularity amounts to partitioning supp(f)\text{supp}(f) according to the basin of attraction of the gradient ascent flow defined by ff. This has been argued in recent work [1, 2]. It would be interesting to see whether one can arrive at this definition of clustering by the proposal of a ‘natural’ set of axioms.

References

  • Arias-Castro and Qiao [2023a] E. Arias-Castro and W. Qiao. A unifying view of modal clustering. Information and Inference: A Journal of the IMA, 12(2):897–920, 2023a.
  • Arias-Castro and Qiao [2023b] E. Arias-Castro and W. Qiao. Moving up the cluster tree with the gradient flow. SIAM Journal on Mathematics of Data Science, 5(2):400–421, 2023b.
  • Armstrong [1983] M. A. M. A. Armstrong. Basic Topology. Undergraduate Texts in Mathematics. Springer New York, 1983.
  • Balakrishnan et al. [2013] S. Balakrishnan, S. Narayanan, A. Rinaldo, A. Singh, and L. Wasserman. Cluster trees on manifolds. Advances in Neural Information Processing Systems, 26, 2013.
  • Ben-David and Ackerman [2008] S. Ben-David and M. Ackerman. Measures of clustering quality: A working set of axioms for clustering. Advances in Neural Information Processing Systems, 21, 2008.
  • Bouveyron et al. [2019] C. Bouveyron, G. Celeux, T. B. Murphy, and A. E. Raftery. Model-Based Clustering and Classification for Data Science: With Applications in R. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
  • Carlsson and Mémoli [2010] G. Carlsson and F. Mémoli. Characterization, stability and convergence of hierarchical clustering methods. Journal of Machine Learning Research, 11(47):1425–1470, 2010.
  • Carmichael et al. [1968] J. W. Carmichael, J. A. George, and R. S. Julius. Finding natural clusters. Systematic Zoology, 17(2):144–150, 1968.
  • Chacón [2019] J. E. Chacón. Mixture model modal clustering. Advances in Data Analysis and Classification, 13(2):379–404, 2019.
  • Chacón [2020] J. E. Chacón. The modal age of statistics. International Statistical Review, 88(1):122–141, 2020.
  • Chaudhuri et al. [2014] K. Chaudhuri, S. Dasgupta, S. Kpotufe, and U. von Luxburg. Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory, 60(12):7900–7912, 2014.
  • Cohen-Addad et al. [2018] V. Cohen-Addad, V. Kanade, and F. Mallmann-Trenn. Clustering redemption–beyond the impossibility of kleinberg’s axioms. In Advances in Neural Information Processing Systems, volume 31, 2018.
  • Cormack [1971] R. M. Cormack. A review of classification. Journal of the Royal Statistical Society: Series A (General), 134(3):321–353, 1971.
  • Dasgupta [2016] S. Dasgupta. A cost function for similarity-based hierarchical clustering. In ACM Symposium on Theory of Computing, pages 118–127, 2016.
  • Eldridge et al. [2015] J. Eldridge, M. Belkin, and Y. Wang. Beyond hartigan consistency: Merge distortion metric for hierarchical clustering. In Conference on Learning Theory, volume 40, pages 588–606. PMLR, 2015.
  • Ester et al. [1996] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In International Conference on Knowledge Discovery and Data Mining, pages 226–231. AAAI Press, 1996.
  • Estivill-Castro [2002] V. Estivill-Castro. Why so many clustering algorithms: a position paper. ACM SIGKDD Explorations Newsletter, 4(1):65–75, 2002.
  • Fraley and Raftery [2002] C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97(458):611–631, 2002.
  • Fukunaga and Hostetler [1975] K. Fukunaga and L. Hostetler. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory, 21(1):32–40, 1975.
  • Gan et al. [2021] G. Gan, C. Ma, and J. Wu. Data Clustering : Theory, Algorithms, and Applications. Society for Industrial and Applied Mathematics, 2nd edition, 2021.
  • Hartigan [1975] J. Hartigan. Clustering Algorithms. Wiley, 1975.
  • Hartigan [1977] J. Hartigan. Distribution problems in clustering. In Classification and Clustering, pages 45–71. Academic Press, 1977.
  • Hartigan [1985] J. Hartigan. Statistical theory in clustering. Journal of Classification, 2:63–76, 1985.
  • Hartigan [1981] J. A. Hartigan. Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76(374):388–394, 1981.
  • Hennig [2015] C. Hennig. What are the true clusters? Pattern Recognition Letters, 64:53–62, 2015. Philosophical Aspects of Pattern Recognition.
  • Jardine et al. [1967] C. Jardine, N. Jardine, and R. Sibson. The structure and construction of taxonomic hierarchies. Mathematical Biosciences, 1(2):173–179, 1967.
  • Jardine and Sibson [1968] N. Jardine and R. Sibson. The construction of hierarchic and non-hierarchic classifications. The Computer Journal, 11(2):177–184, 1968.
  • Kim et al. [2016] J. Kim, Y.-C. Chen, S. Balakrishnan, A. Rinaldo, and L. Wasserman. Statistical inference for cluster trees. In Advances in Neural Information Processing Systems, volume 29, 2016.
  • Kleinberg [2002] J. Kleinberg. An impossibility theorem for clustering. Advances in Neural Information Processing Systems, 15, 2002.
  • Menardi [2016] G. Menardi. A review on modal clustering. International Statistical Review, 84(3):413–433, 2016.
  • Puzicha et al. [2000] J. Puzicha, T. Hofmann, and J. M. Buhmann. A theory of proximity based clustering: Structure detection by optimization. Pattern Recognition, 33(4):617–634, 2000.
  • Shi and Malik [2000] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888–905, 2000.
  • Sibson [1970] R. Sibson. A model for taxonomy. ii. Mathematical Biosciences, 6:405–430, 1970.
  • Steinwart [2011] I. Steinwart. Adaptive density level set clustering. In Conference on Learning Theory, volume 19, pages 703–738. PMLR, 2011.
  • Strazzeri and Sánchez-García [2022] F. Strazzeri and R. J. Sánchez-García. Possibility results for graph clustering: A novel consistency axiom. Pattern Recognition, 128:108687, 2022.
  • von Luxburg et al. [2012] U. von Luxburg, R. C. Williamson, and I. Guyon. Clustering: Science or art? In ICML Workshop on Unsupervised and Transfer Learning, volume 27, pages 65–79. PMLR, 2012.
  • Wang et al. [2019] D. Wang, X. Lu, and A. Rinaldo. Dbscan: Optimal rates for density-based cluster estimation. Journal of Machine Learning Research, 20(170):1–50, 2019.
  • Zadeh and Ben-David [2009] R. B. Zadeh and S. Ben-David. A uniqueness theorem for clustering. In Conference on Uncertainty in Artificial Intelligence, pages 639–646, 2009.

Appendix A Merge Distortion Metric

In this section we discuss some limitations and issues of the merge distortion metric. We restrict our attention to the situation considered in [15] where the height of a tree is defined by the density itself as in (2.1). We denote the density by ff and the corresponding height function by hh, and we identify a cluster tree 𝒞\mathcal{C} with the dendrogram (𝒞,h)(\mathcal{C},h) whenever needed. We only consider cluster trees 𝒞\mathcal{C} made of clusters C𝒞C\in\mathcal{C} satisfying h(C)>0h(C)>0. Our discussion applies to non-negative functions, and throughout this section, ff will be non-negative.

The main issue that we want to highlight is that the merge distortion metric is only a pseudometric, and not a metric, on general cluster trees, as it is possible to have dM(𝒞,𝒞)=0d_{M}(\mathcal{C},\mathcal{C}^{\prime})=0 even when 𝒞\mathcal{C} and 𝒞\mathcal{C}^{\prime} are not isomorphic. (To be clear, we take the partially ordered sets 𝒞\mathcal{C} and 𝒞\mathcal{C}^{\prime} to be isomorphic if they are order isomorphic.) Two examples of this follow.

Example A.1.

Consider f=12𝕀A1+13𝕀A2+16𝕀A3f=\frac{1}{2}\mathbb{I}_{A_{1}}+\frac{1}{3}\mathbb{I}_{A_{2}}+\frac{1}{6}\mathbb{I}_{A_{3}} where the AiA_{i} are disjoint sets with unit measure. Let 𝒞={A1,A1A2,A1A2A3}\mathcal{C}=\{A_{1},A_{1}\cup A_{2},A_{1}\cup A_{2}\cup A_{3}\} and 𝒞={A1,A2,A1A2,A1A2A3}\mathcal{C}^{\prime}=\{A_{1},A_{2},A_{1}\cup A_{2},A_{1}\cup A_{2}\cup A_{3}\}. Both 𝒞\mathcal{C} and 𝒞\mathcal{C}^{\prime} are cluster trees and it can be checked that m𝒞(x,y)=m𝒞(x,y)m_{\mathcal{C}}(x,y)=m_{\mathcal{C}^{\prime}}(x,y) for all x,yx,y so that dM(𝒞,𝒞)=0d_{M}(\mathcal{C},\mathcal{C}^{\prime})=0. However, the two trees are clearly not isomorphic.

Example A.2.

Consider f=𝕀Af=\mathbb{I}_{A} where AA has unit measure. Then any collection of subsets of AA with a nested structure is a cluster tree for ff, and the merge distortion distance between any pair of such cluster trees is zero.

The issue in the preceding examples arises because a cluster tree contains nested clusters with the same cluster height. For example, in Example A.1, the addition of the cluster A2A_{2} to 𝒞\mathcal{C} does not change the merge height of any two points, and hence the merge distance between 𝒞\mathcal{C} and 𝒞\mathcal{C}^{\prime} is zero.

Note that neither of these examples compare Hartigan trees, and we suspect in the original merge distortion metric paper [15], the claim (without proof) that if the merge distortion metric is zero then the trees must be isomorphic was intended in the context of comparing Hartigan trees. This is true for comparing Hartigan trees of continuous densities on d\mathbb{R}^{d}, as for Hartigan trees of continuous functions f,gf,g,

dM(f,g)=fg.d_{M}(\mathcal{H}_{f},\mathcal{H}_{g})=\|f-g\|_{\infty}. (A.1)

This is established in [28, lem 1]. The proof of that result can be adapted to extend the result to the case where ff is continuous and gg is piecewise-continuous satisfying an additional regularity condition that, for every xx in its support, there exists a δ\delta small enough such that gg is continuous on a half-ball centered at xx of radius δ\delta.

In view of Theorem 4.2, we are particularly interested in understanding how different a cluster tree 𝒞\mathcal{C} such that dM(𝒞,f)=0d_{M}(\mathcal{C},\mathcal{H}_{f})=0 can be from f\mathcal{H}_{f}. The following results clarify the situation. The λ\lambda-level set of ff is defined as

Lλ={f=λ}.L_{\lambda}=\{f=\lambda\}. (A.2)
Proposition A.3.

Let ff be a continuous density. Consider a collection of clusters of the form

𝒞=(f{Ci:iI}){𝒮j:jJ},\mathcal{C}=\left(\mathcal{H}_{f}\setminus\{C_{i}:i\in I\}\right)\cup\{\mathcal{S}_{j}:j\in J\}, (A.3)

where Cicc(Uλi)C_{i}\in\text{cc}(U_{\lambda_{i}}) for some λi>0\lambda_{i}>0 such that {λi:iI}\{\lambda_{i}:i\in I\} has empty interior; and 𝒮j\mathcal{S}_{j} is a cluster tree of LλjL_{\lambda_{j}} for some λj>0\lambda_{j}>0 such that {λj:jJ}\{\lambda_{j}:j\in J\} are all distinct. Then 𝒞\mathcal{C} is a cluster tree for ff satisfying dM(𝒞,f)=0d_{M}(\mathcal{C},\mathcal{H}_{f})=0.

Proof.

We will use the fact that, by continuity of ff, the supremum in (2.2) is attained, or more specifically, that if λ=mf(x,y)\lambda=m_{f}(x,y), there is a connected component CC of UλU_{\lambda} that contains xx and yy. The continuity of ff also implies that, for any subset CC, h(C)=h(C¯)h(C)=h(\overline{C}).

We first show that any 𝒞\mathcal{C} defined as in (A.3) is a cluster tree. Indeed, the removal of any number of clusters preserves the nested structure. Now, consider adding 𝒮j\mathcal{S}_{j}, a cluster tree for LλjL_{\lambda_{j}} for some λj>0\lambda_{j}>0. We may clearly assume that 𝒮j\mathcal{S}_{j} is a cluster tree for a connected component of LλjL_{\lambda_{j}}, say BjB_{j}, which is itself contained in some Cjcc(Uλj)C_{j}\in\text{cc}(U_{\lambda_{j}}), so that SBjCjS\subseteq B_{j}\subseteq C_{j} for any S𝒮jS\in\mathcal{S}_{j}. Take CfC\in\mathcal{H}_{f} distinct from CjC_{j}. We show that either SC=S\cap C=\emptyset or SCS\subseteq C for any S𝒮jS\in\mathcal{S}_{j}. Let λ=h(C)\lambda=h(C) so that CC is a connected component of UλU_{\lambda}. If λ=λj\lambda=\lambda_{j}, CjC_{j} and CC are disjoint. If λ<λj\lambda<\lambda_{j}, BjB_{j} is disjoint from CC unless CC contains CjC_{j}. If this is the case, CC also contains BjB_{j}, and therefore SS. If λ>λj\lambda>\lambda_{j}, BjLλjB_{j}\subseteq L_{\lambda_{j}}, CUλC\subseteq U_{\lambda}, and LλjUλ=L_{\lambda_{j}}\cap U_{\lambda}=\emptyset. Take S𝒮kS^{\prime}\in\mathcal{S}_{k}. We show that SS and SS^{\prime} are either disjoint or nested. This is the case if j=kj=k by assumption that 𝒮j\mathcal{S}_{j} is a cluster tree. For jkj\neq k, BjB_{j} and BkB_{k} are disjoint since, by assumption, λjλk\lambda_{j}\neq\lambda_{k} in that case. (We have used the fact that two distinct clusters in f\mathcal{H}_{f} have disjoint boundaries.)

To go further, we use the assumption that Λ:={λi:iI}\Lambda:=\{\lambda_{i}:i\in I\} has empty interior. We want to show that m𝒞(x,y)=mf(x,y)m_{\mathcal{C}}(x,y)=m_{f}(x,y) for any pair of points xx and yy. First, consider 𝒞1=f{cc(Uλi):iI}\mathcal{C}_{1}=\mathcal{H}_{f}\setminus\{\text{cc}(U_{\lambda_{i}}):i\in I\}. Clearly, because the merge height is defined based on a supremum, m𝒞1(x,y)m𝒞(x,y)m_{\mathcal{C}_{1}}(x,y)\leq m_{\mathcal{C}}(x,y). Let λ=mf(x,y)\lambda=m_{f}(x,y), so that there is Ccc(Uλ)C\in\text{cc}(U_{\lambda}) such that x,yCx,y\in C. If λλi\lambda\neq\lambda_{i} for all iIi\in I, then m𝒞1(x,y)λm_{\mathcal{C}_{1}}(x,y)\geq\lambda. If λ=λi\lambda=\lambda_{i} for some iIi\in I, we reason as follows. For t<λt<\lambda, let CtC_{t} be the connected component of UtU_{t} that contains CC. Then x,yCtx,y\in C_{t} for all t<λt<\lambda, and therefore m𝒞1(x,y)tm_{\mathcal{C}_{1}}(x,y)\geq t for any t<λt<\lambda not in Λ\Lambda. Since Λ\Lambda has empty interior, its complement is dense in Λ\Lambda, and by continuity of ff this implies that m𝒞1(x,y)λm_{\mathcal{C}_{1}}(x,y)\geq\lambda. We have thus established that m𝒞1(x,y)λ=mf(x,y)m_{\mathcal{C}_{1}}(x,y)\geq\lambda=m_{f}(x,y), which then implies m𝒞(x,y)mf(x,y)m_{\mathcal{C}}(x,y)\geq m_{f}(x,y). Next, consider 𝒞2=f{𝒮j:iJ}\mathcal{C}_{2}=\mathcal{H}_{f}\cup\{\mathcal{S}_{j}:i\in J\}, so that m𝒞2(x,y)m𝒞(x,y)m_{\mathcal{C}_{2}}(x,y)\geq m_{\mathcal{C}}(x,y). Consider S𝒮jS\in\mathcal{S}_{j}, so that SCjS\subseteq C_{j} for some Cjcc(Uλj)C_{j}\in\text{cc}(U_{\lambda_{j}}). Because h(S)h(Cj)h(S)\leq h(C_{j}) and CjfC_{j}\in\mathcal{H}_{f}, the merge height of xx and yy is not increased by adding SS to f\mathcal{H}_{f}. Therefore, m𝒞2(x,y)mf(x,y)m_{\mathcal{C}_{2}}(x,y)\leq m_{f}(x,y), which then implies that m𝒞(x,y)mf(x,y)m_{\mathcal{C}}(x,y)\leq m_{f}(x,y). ∎

It turns out that the condition (A.3) is not necessary for a cluster tree 𝒞\mathcal{C} to satisfy dM(𝒞,f)=0d_{M}(\mathcal{C},\mathcal{H}_{f})=0 — although we believe it is not far from that. To deal with the possible removal of clusters from f\mathcal{H}_{f}, we only consider cluster trees satisfying the following regularity condition. We say that a cluster tree 𝒞\mathcal{C} is closed (for h=hfh=h_{f}) if it is closed under intersection and union in the sense that, for any sub-collection of nested clusters 𝒮𝒞\mathcal{S}\subseteq\mathcal{C}, C𝒮C𝒞\bigcap_{C\in\mathcal{S}}C\in\mathcal{C} and, if infC𝒮h(C)>0\inf_{C\in\mathcal{S}}h(C)>0, C𝒮C𝒞\bigcup_{C\in\mathcal{S}}C\in\mathcal{C}. (Note that this is automatic when 𝒮\mathcal{S} is finite, but below we will consider infinite sub-collections.)

Lemma A.4.

Suppose 𝒞\mathcal{C} is a closed cluster tree. Then the supremum defining the merge height in (2.2) is attained, meaning that for any x,yx,y there is C𝒞C\in\mathcal{C} containing x,yx,y such that m𝒞(x,y)=h(C)m_{\mathcal{C}}(x,y)=h(C).

Proof.

Fix x,yx,y and let λ=m𝒞(x,y)\lambda=m_{\mathcal{C}}(x,y), assumed to be strictly positive. It suffices to show that there is a cluster that contains these points with height at least λ\lambda.

By the definition in (2.2), for any m1m\geq 1 integer, there is Cm𝒞C_{m}\in\mathcal{C} that contains x,yx,y such that h(Cm)>λ(11/m)h(C_{m})>\lambda(1-1/m). Note that CmC_{m} and CnC_{n} have at least x,yx,y in common, so that they must be nested. Therefore the sub-collection {Cm:m1}\{C_{m}:m\geq 1\} is nested, and by the fact that 𝒞\mathcal{C} is closed, C=m1CmC=\bigcap_{m\geq 1}C_{m} is a cluster in 𝒞\mathcal{C}. By monotonicity of hh, h(C)h(Cm)h(C)\geq h(C_{m}) for all mm, so that h(C)λh(C)\geq\lambda. ∎

To simplify things further, we just avoid talking about what happens within level sets. We will use the following results.

Lemma A.5.

For any s,t>0s,t>0, the connected components of {f>s}\{f>s\} and those of {ft}\{f\geq t\} are either disjoint or nested.

Proof.

Let RR be a connected component of {f>s}\{f>s\} and let CC be a connected component of {ft}\{f\geq t\}. Assume they intersect, i.e., CRC\cap R\neq\emptyset. First, assume that s<ts<t. In that case C{f>s}C\subseteq\{f>s\}, and being connected, there is a unique connected component of {f>s}\{f>s\} that contains it, which is necessarily RR. The reasoning is similar if sts\geq t. Indeed, in that case R{ft}R\subseteq\{f\geq t\}, and being connected, there is a unique connected component of {ft}\{f\geq t\} that contains it, which is necessarily CC. ∎

Recall that a mode is simply a local maximum with strictly positive value, i.e., it is a point xx such that f(x)>0f(x)>0 and f(x)f(y)f(x)\geq f(y) whenever d(x,y)rd(x,y)\leq r for some r>0r>0.

Lemma A.6.

Consider a continuous function ff with bounded upper level sets. Then each connected component of any of its upper level sets contains at least one mode.

Proof.

Take Ccc(Uλ)C\in\text{cc}(U_{\lambda}) for some λ>0\lambda>0. Because ff is continuous and UλU_{\lambda} is compact, CC is compact, so that there is x0Cx_{0}\in C such that f(x0)=maxCff(x_{0})=\max_{C}f. Because the distance function is continuous,333As is well-known, |d(x,y)d(x,y)|d(x,x)+d(y,y)|d(x,y)-d(x^{\prime},y^{\prime})|\leq d(x,x^{\prime})+d(y,y^{\prime}) by a simple use of the triangle inequality, so that d:Ω×Ωd:\Omega\times\Omega\to\mathbb{R} is Lipschitz and continuous when equipping Ω×Ω\Omega\times\Omega with the product topology. and the fact that all connected components of UλU_{\lambda} are compact, we have that d(C,C):=minxC,xCd(x,x)>0d(C,C^{\prime}):=\min_{x\in C,x^{\prime}\in C^{\prime}}d(x,x^{\prime})>0 for all Ccc(Uλ)C^{\prime}\in\text{cc}(U_{\lambda}), so that there is η>0\eta>0 such that d(C,C)>ηd(C,C^{\prime})>\eta for all Ccc(Uλ)C^{\prime}\in\text{cc}(U_{\lambda}). Now, consider yy within distance η\eta of x0x_{0}. If yCy\in C, then f(y)maxCf=f(x0)f(y)\leq\max_{C}f=f(x_{0}); and if yCy\notin C, then yUλy\notin U_{\lambda}, and therefore f(y)<λf(x0)f(y)<\lambda\leq f(x_{0}). We can conclude that x0x_{0} is a mode. ∎

Lemma A.7.

Consider a continuous function ff with bounded upper level sets and locally finitely many modes. Then, ff satisfies the following property:

For every λ>0, if ϵ>0 is small enough, each connected component of {f>λ}contains exactly one connected component of {f>λ+ϵ}.\begin{gathered}\text{For every $\lambda>0$, if $\epsilon>0$ is small enough, each connected component of $\{f>\lambda\}$}\\ \text{contains exactly one connected component of $\{f>\lambda+\epsilon\}$.}\end{gathered} (A.4)
Proof.

Take any upper level set UU. Since UU is bounded, it can only include a finite number of modes. And since each of its connected components contains at least one mode by Lemma A.6, it must be the case that UU has at most as many connected components as it contains modes.

We now assume that the upper level sets all have finitely many components, and show that (A.4) holds. We do so by contradiction. Therefore, assume that (A.4) does not hold so that there is λ>0\lambda>0 and RR a connected component of {f>λ}\{f>\lambda\}, and a sequence (ϵn)(\epsilon_{n}), which we can take to be decreasing and converging to zero, such that for each nn, RR contains at least two connected components of {fλ+ϵn}\{f\geq\lambda+\epsilon_{n}\}. Because RR is a bounded region, applying the first part of the statement we find that RR can only contain finitely many components of {f>λ+ϵn}\{f>\lambda+\epsilon_{n}\}, denoted A1n,,AmnnA^{n}_{1},\dots,A^{n}_{m_{n}}, with 2mnM2\leq m_{n}\leq M for all nn, where MM is the number of modes within UλU_{\lambda}. By taking a subsequence if needed, we may further assume that mn=m2m_{n}=m\geq 2 for all nn. By the usual nesting property, at every nn, for each ii, there is exactly one jj such that AinAjn+1A^{n}_{i}\subseteq A^{n+1}_{j}, and so that we may choose the indexing in such a way that AinAin+1A^{n}_{i}\subseteq A^{n+1}_{i} for all nn and all ii. This allows us to define Ai=nAinA_{i}=\bigcup_{n}A^{n}_{i} for i=1,,mi=1,\dots,m. Since ff is continuous, {f>λ+ϵn}\{f>\lambda+\epsilon_{n}\} is open, and therefore so are its connected components (since we assume throughout that Ω\Omega is locally connected), and therefore each AinA^{n}_{i} is open, which then carries over to each AiA_{i} being open. The AiA_{i} are disjoint because AinAin=A^{n}_{i}\cap A^{n^{\prime}}_{i^{\prime}}=\emptyset unless i=ii=i^{\prime}. Therefore, because R=iAiR=\bigcup_{i}A_{i}, RR must be disconnected — a contradiction. ∎

Proposition A.8.

Let ff be a continuous density. Assume that 𝒞\mathcal{C} is a closed cluster tree such that dM(𝒞,f)=0d_{M}(\mathcal{C},\mathcal{H}_{f})=0. Then 𝒞\mathcal{C} contains f\mathcal{H}_{f}. If, in addition, (A.4) holds, then, for every C𝒞C\in\mathcal{C}, {f>h(C)}C\{f>h(C)\}\cap C is some union of connected components of {f>h(C)}\{f>h(C)\}.

Refer to caption Refer to caption
Figure A.1: Left: The level sets of a bimodal Gaussian. The upper level set “splits” at the level containing the saddle point. (The level set and saddle point are highlighted.) Right: The highlighted cluster is R1¯\overline{R_{1}}. The addition of this cluster to f\mathcal{H}_{f} forms a valid and distinct cluster tree 𝒞1\mathcal{C}_{1}.
Proof.

Let Vcc(Uλ)V\in\text{cc}(U_{\lambda}) for some λ>0\lambda>0. Fix xVx\in V such that f(x)=λf(x)=\lambda. Take yVy\in V. First, mf(x,y)=λm_{f}(x,y)=\lambda, and since m𝒞(x,y)=mf(x,y)m_{\mathcal{C}}(x,y)=m_{f}(x,y) and 𝒞\mathcal{C} is assumed closed, there is Cy𝒞C_{y}\in\mathcal{C} containing x,yx,y such that h(Cy)=λh(C_{y})=\lambda. Note that this implies that CyVC_{y}\subseteq V since VV is the largest connected set that contains x,yx,y such that h(V)λh(V)\geq\lambda. If yzy\neq z are both in VV, we have that xCyCzx\in C_{y}\cap C_{z}, so that CyC_{y} and CzC_{z} are nested. Therefore, the collection {Cy:yV}\{C_{y}:y\in V\} is nested, and because 𝒞\mathcal{C} is closed, C=yVCyC=\bigcup_{y\in V}C_{y} belongs to 𝒞\mathcal{C}. Since CyVC_{y}\subseteq V for all yy, we have CVC\subseteq V; and since CyC_{y} contains yy for all yy, we also have CVC\supseteq V; therefore, C=VC=V, and we conclude that V𝒞V\in\mathcal{C}.

For the second part, assume that (A.4) holds. Take C𝒞C\in\mathcal{C} with λ=h(C)>0\lambda=h(C)>0. We want to show that, if RR is a connected component of {f>λ}\{f>\lambda\} such that RCR\cap C\neq\emptyset, then RCR\subseteq C. For ϵ>0\epsilon>0 small enough, RR contains exactly one connected component of {f>λ+ϵ}\{f>\lambda+\epsilon\}, which by way of Lemma A.5 implies that RR contains exactly one connected component of {fλ+ϵ}\{f\geq\lambda+\epsilon\}, which we denote by VϵV_{\epsilon}. By the first part of the proposition, which we have already established, VϵV_{\epsilon} belongs to 𝒞\mathcal{C}, and 𝒞\mathcal{C} being a cluster tree, we have either VϵC=V_{\epsilon}\cap C=\emptyset or VϵCV_{\epsilon}\subseteq C. Only the latter is possible when ϵ\epsilon is small enough. Indeed, take xRCx\in R\cap C, so that f(x)>λf(x)>\lambda. Let ϵ>0\epsilon>0 be small enough that f(x)λ+ϵf(x)\geq\lambda+\epsilon, so that xVϵx\in V_{\epsilon}. Hence, VϵCV_{\epsilon}\subseteq C when ϵ>0\epsilon>0 is small enough, and we then use the fact that R=ϵ>0VϵR=\bigcup_{\epsilon>0}V_{\epsilon} to conclude that RCR\subseteq C. ∎

We remark that, when ff is ‘flat nowhere’ in the sense that

{f>λ}¯={fλ}\overline{\{f>\lambda\}}=\{f\geq\lambda\} for any λ>0\lambda>0, (A.5)

then, under the same conditions as in Proposition A.8, any C𝒞C\in\mathcal{C} is closure of the union of connected components of {f>λ}\{f>\lambda\}. This still leaves the possibility that 𝒞f\mathcal{C}\neq\mathcal{H}_{f}, and it can indeed happen — unless ff is unimodal. To see this, for simplicity, suppose that ff is ‘flat nowhere’ and has exactly two modes. Assuming that its support has connected interior, there is exactly one level λ>0\lambda>0 where the upper level set ‘splits’ in the sense that {f>λ}\{f>\lambda\} has two connected components, say R1R_{1} and R2R_{2}, while {fλ}\{f\geq\lambda\} is connected. Then, for j{1,2}j\in\{1,2\}, Rj¯\overline{R_{j}} does not belong to f\mathcal{H}_{f} and 𝒞j=f{Rj¯}\mathcal{C}_{j}=\mathcal{H}_{f}\cup\{\overline{R_{j}}\} is a cluster tree satisfying dM(𝒞j,f)=0d_{M}(\mathcal{C}_{j},\mathcal{H}_{f})=0. (Note that f{R1¯,R2¯}\mathcal{H}_{f}\cup\{\overline{R_{1}},\overline{R_{2}}\} is not a cluster tree since R1¯\overline{R_{1}} and R2¯\overline{R_{2}} intersect but are not nested.) The situation is illustrated in Figure A.1.

Appendix B Euclidean Spaces

In this section we show that Euclidean spaces have the internally connected partition property by constructing a ‘shifted’ grid that has the required property.

Refer to caption
Figure B.1: The existence of this shifted grid clearly shows that 2\mathbb{R}^{2} has the internally connected partition property. This definition of a shifted grid can be extended to higher dimensions to show that d\mathbb{R}^{d} has the internally connected partition property for any d2d\geq 2. (This is trivially true in dimension d=1d=1 where a regular grid can be used to show that \mathbb{R} has the internally connected partition property.)
Proposition B.1.

Any Euclidean space has the internally connected partition property.

Proof.

Consider the Euclidean space d\mathbb{R}^{d} (equipped with its Euclidean norm). It is enough to show that there is a a locally finite partition {Ai}\{A_{i}\} that has the internally connected property and is such that, for all ii, int(Ai)\text{int}(A_{i}) is connected and diam(Ai)d\operatorname{diam}(A_{i})\leq\sqrt{d}.

Let 1=\mathcal{L}_{1}=\mathbb{Z}, and for d2d\geq 2, define

d={(x1,,xd):xd\displaystyle\mathcal{L}_{d}=\Big{\{}(x_{1},\dots,x_{d}):x_{d} 2 and (x1,,xd1)d1;\displaystyle\in 2\mathbb{Z}\text{ and }(x_{1},\dots,x_{d-1})\in\mathcal{L}_{d-1}; (B.1)
or xd\displaystyle\text{ or }x_{d} 2+1 and (x1+12,,xd1+12)d1}.\displaystyle\in 2\mathbb{Z}+1\text{ and }(x_{1}+\tfrac{1}{2},\dots,x_{d-1}+\tfrac{1}{2})\in\mathcal{L}_{d-1}\Big{\}}. (B.2)

For (x1,,xd)d(x_{1},\dots,x_{d})\in\mathcal{L}_{d} define the corresponding cell

A(x1,,xd)=[x1,x1+1)××[xd,xd+1).A_{(x_{1},\dots,x_{d})}=[x_{1},x_{1}+1)\times\dots\times[x_{d},x_{d}+1).

And consider the collection of these cells

𝒜d={A(x1,,xd):(x1,,xd)d}.\mathcal{A}_{d}=\big{\{}A_{(x_{1},\dots,x_{d})}:(x_{1},\dots,x_{d})\in\mathcal{L}_{d}\big{\}}.

Each of these cells has connected interior and has diameter d\sqrt{d}. Moreover, 𝒜d\mathcal{A}_{d} is a partition of the entire space d\mathbb{R}^{d}. And, as a partition, 𝒜d\mathcal{A}_{d} is clearly locally finite. The partition is depicted for d=2d=2 in Figure B.1.

We now prove that 𝒜d\mathcal{A}_{d} has the internally connected property. We will proceed by induction on dd. For d=1d=1, this is clear. For d2d\geq 2, assume that 𝒜d1\mathcal{A}_{d-1} has the internally connected property. Consider (x1,,xd)(x_{1},\dots,x_{d}) and (y1,,yd)(y_{1},\dots,y_{d}), both in d\mathcal{L}_{d}, such that A(x1,,xd)¯A(y1,,yd)¯\overline{A_{(x_{1},\dots,x_{d})}}\cup\overline{A_{(y_{1},\dots,y_{d})}} is connected. We want to show that int(A(x1,,xd)¯A(y1,,yd)¯)\text{int}\big{(}\overline{A_{(x_{1},\dots,x_{d})}}\cup\overline{A_{(y_{1},\dots,y_{d})}}\big{)} is connected too. By induction, {A(z1,,zd):zd=xd}\{A_{(z_{1},\dots,z_{d})}:z_{d}=x_{d}\} has the internally connected property, so that it is enough to consider a situation where ydxdy_{d}\neq x_{d}. Suppose, for example, that yd>xdy_{d}>x_{d}. In that case, the fact that A(x1,,xd)¯A(y1,,yd)¯\overline{A_{(x_{1},\dots,x_{d})}}\cup\overline{A_{(y_{1},\dots,y_{d})}} is connected implies that yd=xd+1y_{d}=x_{d}+1 and yi=xi±12y_{i}=x_{i}\pm\frac{1}{2} for 1id11\leq i\leq d-1. Further,

int(A(x1,,xd)¯A(y1,,yd)¯)=int(A(x1,,xd))int(A(y1,,yd))C,\text{int}(\overline{A_{(x_{1},\dots,x_{d})}}\cup\overline{A_{(y_{1},\dots,y_{d})}})=\text{int}(A_{(x_{1},\dots,x_{d})})\cup\text{int}(A_{(y_{1},\dots,y_{d})})\cup C,

where

C={(z1,z2,,zd1,xd+1):xi+14+14sign(yixi)zixi+34+14sign(yixi)}.C=\Big{\{}(z_{1},z_{2},\dots,z_{d-1},x_{d}+1):x_{i}+\tfrac{1}{4}+\tfrac{1}{4}\operatorname{sign}(y_{i}-x_{i})\leq z_{i}\leq x_{i}+\tfrac{3}{4}+\tfrac{1}{4}\operatorname{sign}(y_{i}-x_{i})\Big{\}}.

And the fact that CA(x1,,xd)A(y1,,yd)C\subseteq\partial A_{(x_{1},\dots,x_{d})}\cap\partial A_{(y_{1},\dots,y_{d})} proves that the union above is connected. ∎