This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: University of Oxford

Flattening Multiparameter Hierarchical Clustering Functors

Dan Shiebler
Abstract

We bring together topological data analysis, applied category theory, and machine learning to study multiparameter hierarchical clustering. We begin by introducing a procedure for flattening multiparameter hierarchical clusterings. We demonstrate that this procedure is a functor from a category of multiparameter hierarchical partitions to a category of binary integer programs. We also include empirical results demonstrating its effectiveness. Next, we introduce a Bayesian update algorithm for learning clustering parameters from data. We demonstrate that the composition of this algorithm with our flattening procedure satisfies a consistency property.

1 Introduction

One of the most useful ways to process a dataset represented as a finite metric space is to cluster the dataset, or break it into groups. An important step is choosing the desired granularity of the clustering, and multiparameter hierarchical clustering algorithms accept a hyperparameter vector to control this.

Since applying a multiparameter hierarchical clustering algorithm with different hyperparameter values may produce different partitions, we can view such algorithms as mapping a finite metric space (X,dX)(X,d_{X}) to a function from the hyperparameter space to the space of partitions of XX. A flattening procedure then maps this function to a single partition.

Many popular clustering algorithms, such as HDBSCAN [7] and ToMATo [3], include a flattening step that operates on the same intuition that drives the analysis of persistence diagrams in TDA. That is, the most important clusters (homological structures) of a dataset are those which exist at multiple scales (have large differences between their birth and death times).

In this paper we will characterize and study clustering algorithms as functors, similarly to [1, 4, 12]. We will particularly focus on multiparameter hierarchical clustering algorithms with partially ordered hyperparameter spaces. This perspective allows us to guarantee that the clustering algorithms we study preserve both non-expansive maps between metric spaces and the ordering of the hyperparameter space. Our contributions are:

  • We describe an algorithm for flattening multiparameter hierarchical clusterings, which we demonstrate is a functorial map from a category of multiparameter hierarchical partitions to a category of binary integer programs.

  • We introduce a Bayesian update algorithm for learning a distribution over clustering hyperparameters from data.

  • We prove that the composition of the Bayesian update algorithm and the flattening procedure is consistent.

2 Multiparameter Hierarchical Clustering

In this work we will define flat clustering algorithms to map a finite metric space (X,dX)(X,d_{X}) to a partition of XX. We will primarily work with the following categories:

Definition 1

In the category 𝐌𝐞𝐭\mathcal{\mathbf{Met}} objects are finite metric spaces (X,dX)(X,d_{X}) and morphisms are non-expansive maps, or functions f:XYf:X\rightarrow Y such that dY(f(x1),f(x2))dX(x1,x2)d_{Y}(f(x_{1}),f(x_{2}))\leq d_{X}(x_{1},x_{2}).

Definition 2

In the category 𝐏𝐚𝐫𝐭\mathbf{Part} objects are tuples (X,𝒫X)(X,\mathcal{P}_{X}) where 𝒫X\mathcal{P}_{X} is a partition of the set XX. Morphisms in 𝐏𝐚𝐫𝐭\mathbf{Part} are functions f:XYf:X\rightarrow Y such that if S𝒫XS\in\mathcal{P}_{X} then SY,f(S)S\exists S^{\prime}\in Y,f(S)\subseteq S^{\prime}.

We will also work in the subcategories 𝐌𝐞𝐭bij,𝐏𝐚𝐫𝐭bij\mathcal{\mathbf{Met}}_{bij},\mathbf{Part}_{bij} of 𝐌𝐞𝐭,𝐏𝐚𝐫𝐭\mathcal{\mathbf{Met}},\mathbf{Part} respectively in which morphisms are further restricted to be bijective.

Definition 3

Given a subcategory 𝐃\mathbf{D} of 𝐌𝐞𝐭\mathcal{\mathbf{Met}}, a flat clustering functor on 𝐃\mathbf{D} is a functor F:𝐃𝐏𝐚𝐫𝐭F:\mathbf{D}\rightarrow\mathbf{Part} that is the identity on the underlying set XX. In the case that 𝐃\mathbf{D} is unspecified we simply call FF a flat clustering functor.

Now recall that the set of connected components of the δ\delta-Vietoris-Rips complex of (X,dX)(X,d_{X}) is the partioning of XX into subsets with maximum pairwise distance no greater than δ\delta. Given a(0,1]a\in(0,1], an example of a flat clustering functor on 𝐌𝐞𝐭\mathcal{\mathbf{Met}} is the aa-single linkage functor 𝒮(a)\mathcal{SL}(a), which maps a metric space to the connected components of its log(a)-log(a)-Vietoris-Rips complex [12, 4, 1]. Given a1,a2(0,1]a_{1},a_{2}\in(0,1], an example of a flat clustering functor on 𝐌𝐞𝐭bij\mathcal{\mathbf{Met}}_{bij} is the robust single linkage functor 𝒮(a1,a2)\mathcal{SL}_{\mathcal{R}}(a_{1},a_{2}) which maps a metric space (X,dX)(X,d_{X}) to the connected components of the log(a2)-log(a_{2})-Vietoris-Rips complex of (X,dXa1)(X,d^{a_{1}}_{X}) where:

dXa1(x1,x2)=max(dX(x1,x2),μXa1(x1),μXa1(x2))\displaystyle d^{a_{1}}_{X}(x_{1},x_{2})=max(d_{X}(x_{1},x_{2}),\mu_{X_{a_{1}}}(x_{1}),\mu_{X_{a_{1}}}(x_{2}))

and μXa1(x1)\mu_{X_{a_{1}}}(x_{1}) is the distance from x1x_{1} to its a1|X|\left\lfloor{a_{1}*|X|}\right\rfloorth nearest neighbor [2]. Intuitively, robust single linkage reduces the impact of dataset noise by increasing distances in sparse regions of the space. Note that robust single linkage is not a flat clustering functor on 𝐌𝐞𝐭\mathcal{\mathbf{Met}} because it includes a kk-nearest neighbor computation that is sensitive to |X||X|.

Like single linkage and robust single linkage, many flat clustering algorithms are configured by a hyperparameter vector that governs their behavior. In the case that this hyperparameter vector is an element of a partial order OO we can represent the output of such an algorithm with a functor O𝐏𝐚𝐫𝐭O\rightarrow\mathbf{Part}.

Recall that 𝐏𝐚𝐫𝐭O\mathbf{Part}^{O} is the category of functors from OO to 𝐏𝐚𝐫𝐭\mathbf{Part} and natural transformations between them. We will write 𝐏𝐚𝐫𝐭O¯\mathbf{Part}^{\overline{O}} to represent the subcategory of such functors that commute with the forgetful functor U:𝐏𝐚𝐫𝐭𝐒𝐞𝐭U:\mathbf{Part}\rightarrow\mathbf{Set}. Given F:O𝐏𝐚𝐫𝐭F:O\rightarrow\mathbf{Part} in 𝐏𝐚𝐫𝐭O¯\mathbf{Part}^{\overline{O}} we will call the image of UFU\circ F the underlying set of FF. Note that there also exists a forgetful functor 𝐏𝐚𝐫𝐭O¯𝐒𝐞𝐭\mathbf{Part}^{\overline{O}}\rightarrow\mathbf{Set} that maps F:O𝐏𝐚𝐫𝐭F:O\rightarrow\mathbf{Part} to its underlying set and that any natural transformation in 𝐏𝐚𝐫𝐭O¯\mathbf{Part}^{\overline{O}} between the functors FX:O𝐏𝐚𝐫𝐭F_{X}:O\rightarrow\mathbf{Part} and FY:O𝐏𝐚𝐫𝐭F_{Y}:O\rightarrow\mathbf{Part} with underlying sets XX and YY is fully specified by a function f:XYf:X\rightarrow Y.

Definition 4

Given a partial order OO and a subcategory 𝐃\mathbf{D} of 𝐌𝐞𝐭\mathcal{\mathbf{Met}}, an OO-clustering functor on 𝐃\mathbf{D} is a functor H:𝐃𝐏𝐚𝐫𝐭O¯H:\mathbf{D}\rightarrow\mathbf{Part}^{\overline{O}} that commutes with the forgetful functors from 𝐃\mathbf{D} and 𝐏𝐚𝐫𝐭O¯\mathbf{Part}^{\overline{O}} into 𝐒𝐞𝐭\mathbf{Set}. In the case that 𝐃\mathbf{D} is unspecified we simply call HH an OO-clustering functor.

For example, single linkage 𝒮:𝐌𝐞𝐭𝐏𝐚𝐫𝐭(0,1]op¯\mathcal{SL}:\mathcal{\mathbf{Met}}\rightarrow\mathbf{Part}^{\overline{(0,1]^{op}}} is a (0,1]op(0,1]^{op}-clustering functor on 𝐌𝐞𝐭\mathcal{\mathbf{Met}} and robust single linkage 𝒮:𝐌𝐞𝐭bij𝐏𝐚𝐫𝐭bij(0,1]op×(0,1]op¯\mathcal{SL}_{\mathcal{R}}:\mathcal{\mathbf{Met}}_{bij}\rightarrow\mathbf{Part}_{bij}^{\overline{(0,1]^{op}\times(0,1]^{op}}} is a (0,1]op×(0,1]op(0,1]^{op}\times(0,1]^{op}-clustering functor on 𝐌𝐞𝐭bij\mathcal{\mathbf{Met}}_{bij}.

Definition 5

Given the functor FX𝐏𝐚𝐫𝐭O¯F_{X}\in\mathbf{Part}^{\overline{O}} with underlying set XX, its partition collection is the set SXS_{X} of all subsets SXS\subseteq X such that there exists some aOa\in O where SFX(a)S\in F_{X}(a).

We will write the elements of SXS_{X} (subsets of XX) with the notation SX={s1X,s2X,,snX}S_{X}=\{s_{{1}_{X}},s_{{2}_{X}},\cdots,s_{{n}_{X}}\}.

In practice it is often convenient to “flatten” a functor FX𝐏𝐚𝐫𝐭O¯F_{X}\in\mathbf{Part}^{\overline{O}} to a single partition of XX by selecting a non-overlapping collection of sets from its partition collection SXS_{X}. Since we will express this selection problem as a binary integer program we will work in the following category:

Definition 6

The objects in 𝐁𝐈𝐏\mathbf{BIP} are tuples (n,m,c,A,B,u)(n,m,c,A,B,u) where n,mn,m\in\mathbb{N}, c,uc,u are mm-element real-valued vectors, AA is an n×mn\times m real-valued matrix and BB is an n×mn\times m {0,1}\{0,1\}-valued matrix. Intuitively, the tuple (n,m,c,A,B,u)(n,m,c,A,B,u) represents the following binary integer program: find an mm-element {0,1}\{0,1\}-valued vector vv that maximizes cTvc^{T}v subject to Av+BvuAv+Bv\leq u.

The morphisms between (n,m,c,A,B,u)(n,m,c,A,B,u) and (n,m,c,A,B,u)(n^{\prime},m^{\prime},c^{\prime},A^{\prime},B^{\prime},u^{\prime}) are tuples (Pc,Pu,PA,PA,PB,PB)(P_{c},P_{u},P_{A},P_{A^{*}},P_{B},P_{B^{*}}) where Pu,PAP_{u},P_{A} are n×nn^{\prime}\times n real-valued matrices, PA,PcP_{A^{*}},P_{c} are m×mm\times m^{\prime} real-valued matrices, PBP_{B} is an n×nn^{\prime}\times n {0,1}\{0,1\}-valued matrix and PBP_{B^{*}} is an m×mm\times m^{\prime} {0,1}\{0,1\}-valued matrix such that:

Pcc=cPuu=uPAAPA=APBBPB=B\displaystyle P_{c}c^{\prime}=c\qquad P_{u}u=u^{\prime}\qquad P_{A}AP_{A^{*}}=A^{\prime}\qquad P_{B}BP_{B^{*}}=B^{\prime}

where the operation PBBPBP_{B}BP_{B^{*}} is performed with logical matrix multiplication. Intuitively, a morphism (Pc,Pu,PA,PA,PB,PB)(P_{c},P_{u},P_{A},P_{A^{*}},P_{B},P_{B^{*}}) maps the binary integer program “find an mm-element {0,1}\{0,1\}-valued vector vv that maximizes (Pcc)Tv(P_{c}c^{\prime})^{T}v subject to Av+BvuAv+Bv\leq u” to the binary integer program “find an mm^{\prime}-element {0,1}\{0,1\}-valued vector vv that maximizes cTvc^{\prime T}v subject to PAAPAv+PBBPBvPuuP_{A}AP_{A^{*}}v+P_{B}BP_{B^{*}}v\leq P_{u}u”.

When we construct a binary integer program from an object FXF_{X} in 𝐏𝐚𝐫𝐭O¯\mathbf{Part}^{\overline{O}} with underlying set XX, we weight the elements of its partition collection SXS_{X} according to a model of the importance of different regions of OO. In this work we will only consider OO that are Borel measurable, so we can represent this model with a probability measure μ\mu over OO. This probabilistic interpretation will be useful in Section 2.2 when we update this model with labeled data. We can then view the flattening algorithm as choosing a non-overlapping subset 𝒫XSX\mathcal{P}_{X}\subseteq S_{X} (the elements of 𝒫X\mathcal{P}_{X} are subsets of XX where no element of XX belongs to more than a single set in 𝒫X\mathcal{P}_{X}) that maximizes the expectation of the function that maps aa to the number of siX𝒫Xs_{{i}_{X}}\in\mathcal{P}_{X} that are also in FX(a)F_{X}(a). Formally, the algorithm maximizes aO|{siX|siX𝒫X,siXFX(a)}|𝑑μ\int_{a\in O}\left|\{s_{{i}_{X}}\ |\ s_{{i}_{X}}\in\mathcal{P}_{X},s_{{i}_{X}}\in F_{X}(a)\}\right|d\mu . If μ\mu is uniform this is similar to the Topologically Motivated HDBSCAN described in [7].

Proposition 1

Given a probability measure μ\mu over OO, there exists a functor Flattenμ:𝐏𝐚𝐫𝐭bijO¯𝐁𝐈𝐏Flatten_{\mu}:\mathbf{Part}_{bij}^{\overline{O}}\rightarrow\mathbf{BIP} that maps FX:O𝐏𝐚𝐫𝐭bijF_{X}:O\rightarrow\mathbf{Part}_{bij} with partition collection SXS_{X} to a tuple (|SX|,|SX|,c,A,B,u)(|S_{X}|,|S_{X}|,c,A,B,u) such that any solution to the problem “find an mm-element {0,1}\{0,1\}-valued vector vv that maximizes cTvc^{T}v subject to Av+BvuAv+Bv\leq u” specifies a non-overlapping subset 𝒫XSX\mathcal{P}_{X}\subseteq S_{X} that maximizes aO|{siX|siX𝒫X,siXFX(a)}|𝑑μ\int_{a\in O}\left|\{s_{{i}_{X}}\ |\ s_{{i}_{X}}\in\mathcal{P}_{X},s_{{i}_{X}}\in F_{X}(a)\}\right|d\mu.

Proof

Given a probability measure μ\mu over OO, Flattenμ:𝐏𝐚𝐫𝐭bijO¯𝐁𝐈𝐏Flatten_{\mu}:\mathbf{Part}_{bij}^{\overline{O}}\rightarrow\mathbf{BIP} maps the functor FX:O𝐏𝐚𝐫𝐭bijF_{X}:O\rightarrow\mathbf{Part}_{bij} with partition collection SXS_{X} to the binary integer program (|SX|,|SX|,c,A,B,u)(|S_{X}|,|S_{X}|,c,A,B,u) where c,uc,u are |SX||S_{X}|-element real-valued vectors and A,BA,B are respectively real-valued and {0,1}\{0,1\}-valued |SX|×|SX||S_{X}|\times|S_{X}| matrices where:

ui=|SX|ci={a|siXFX(a)}𝑑μ\displaystyle u_{i}=|S_{X}|\qquad c_{i}=\int_{\{a\ |\ s_{{i}_{X}}\in F_{X}(a)\}}d\mu
Ai,j={|SX|1i=j0elseBi,j={1siXsjX0else\displaystyle A_{i,j}=\begin{cases}|S_{X}|-1&i=j\\ 0&else\end{cases}\qquad B_{i,j}=\begin{cases}1&s_{{i}_{X}}\cap s_{{j}_{X}}\neq\varnothing\\ 0&else\end{cases}

A natural transformation between FX,FY:O𝐏𝐚𝐫𝐭bijF_{X},F_{Y}:O\rightarrow\mathbf{Part}_{bij} with underlying sets X,YX,Y and partition collections SX,SYS_{X},S_{Y} that is specified by the surjective function f:XYf:X\rightarrow Y is sent to the tuple (Pc,Pu,PA,PAT,PB,PBT)(P_{c},P_{u},P_{A},P^{T}_{A},P_{B},P^{T}_{B}) where PcP_{c} is an |SX|×|SY||S_{X}|\times|S_{Y}| real-valued matrix, PA,PuP_{A},P_{u} are |SY|×|SX||S_{Y}|\times|S_{X}| real-valued matrices, and PBP_{B} is an |SY|×|SX||S_{Y}|\times|S_{X}| {0,1}\{0,1\}-valued matrix such that:

Pci,j={{a|siXFX(a),sjYFY(a)}𝑑μ{a|sjYFY(a)}𝑑μf(siX)sjY0elsePuj,i={|SY||SX|i=j0else\displaystyle P_{c_{i,j}}=\begin{cases}\frac{\int_{\{a\ |\ s_{{i}_{X}}\in F_{X}(a),\ s_{{j}_{Y}}\in F_{Y}(a)\}}\ d\mu}{\int_{\{a\ |\ s_{{j}_{Y}}\in F_{Y}(a)\}}\ d\mu}&f(s_{{i}_{X}})\subseteq s_{{j}_{Y}}\\ 0&else\end{cases}\qquad P_{u_{j,i}}=\begin{cases}\frac{|S_{Y}|}{|S_{X}|}&i=j\\ 0&else\end{cases}
PAj,i={|SY|1|SX|1i=j0elsePBj,i={1f(siX)sjY0else\displaystyle P_{A_{j,i}}=\begin{cases}\sqrt{\frac{|S_{Y}|-1}{|S_{X}|-1}}&i=j\\ 0&else\end{cases}\qquad P_{B_{j,i}}=\begin{cases}1&f(s_{{i}_{X}})\subseteq s_{{j}_{Y}}\\ 0&else\end{cases}

First we will show that any feasible solution to the integer program FlattenμFXFlatten_{\mu}F_{X} corresponds to a selection of elements from SXS_{X} with no overlaps. If siXsjXs_{{i}_{X}}\cap s_{{j}_{X}}\neq\varnothing, then the iith row of A+BA+B will have |SX||S_{X}| in position ii and 11 in position jj. This implies that if vi=1v_{i}=1 then (A+B)iTv|SX|(A+B)_{i}^{T}v\leq|S_{X}| if and only if vj=0v_{j}=0. Note also that by the definition of binary integer programming this is the selection of elements that maximizes:

siX𝒫Xci=siX𝒫X{a|siXFX(a)}𝑑μ=aO|{siX|siX𝒫X,siXFX(a)}|𝑑μ\displaystyle\sum_{s_{{i}_{X}}\in\mathcal{P}_{X}}c_{i}=\sum_{s_{{i}_{X}}\in\mathcal{P}_{X}}\int_{\{a\ |\ s_{{i}_{X}}\in F_{X}(a)\}}d\mu=\int_{a\in O}\left|\{s_{{i}_{X}}\ |\ s_{{i}_{X}}\in\mathcal{P}_{X},s_{{i}_{X}}\in F_{X}(a)\}\right|d\mu

Next, we will show that FlattenμFlatten_{\mu} is a functor. Consider FX,FY:O𝐏𝐚𝐫𝐭bijF_{X},F_{Y}:O\rightarrow\mathbf{Part}_{bij} with underlying sets X,YX,Y and partition collections SX,SYS_{X},S_{Y} and suppose:

FlattenμFX=(|SX|,|SX|,cX,AX,BX,uX)FlattenμFY=(|SY|,|SY|,cY,AY,BY,uY)\displaystyle Flatten_{\mu}F_{X}=(|S_{X}|,|S_{X}|,c_{X},A_{X},B_{X},u_{X})\qquad Flatten_{\mu}F_{Y}=(|S_{Y}|,|S_{Y}|,c_{Y},A_{Y},B_{Y},u_{Y})

Consider also a natural transformation specified by the function f:XYf:X\rightarrow Y and define the image of FlattenμFlatten_{\mu} on this natural transformation to be (Pc,Pu,PA,PAT,PB,PBT)(P_{c},P_{u},P_{A},P^{T}_{A},P_{B},P^{T}_{B}). We first need to show that:

PccY=cXPuuX=uYPAAXPAT=AYPBBXPBT=BY\displaystyle P_{c}c_{Y}=c_{X}\qquad P_{u}u_{X}=u_{Y}\qquad P_{A}A_{X}P^{T}_{A}=A_{Y}\qquad P_{B}B_{X}P^{T}_{B}=B_{Y}

Where PBBXPBT=BYP_{B}B_{X}P^{T}_{B}=B_{Y} is performed with logical matrix multiplication. In order to see that PccY=cXP_{c}c_{Y}=c_{X}, note that:

(PccY)i={j|f(siX)sjY}({a|sjYFY(a)}𝑑μ)({a|siXFX(a),sjYFY(a)}𝑑μ{a|sjYFY(a)}𝑑μ)=\displaystyle(P_{c}c_{Y})_{i}=\sum_{\{j\ |\ f(s_{{i}_{X}})\subseteq s_{{j}_{Y}}\}}\left(\int_{\{a\ |\ s_{{j}_{Y}}\in F_{Y}(a)\}}d\mu\right)\left(\frac{\int_{\{a\ |\ s_{{i}_{X}}\in F_{X}(a),\ s_{{j}_{Y}}\in F_{Y}(a)\}}\ d\mu}{\int_{\{a\ |\ s_{{j}_{Y}}\in F_{Y}(a)\}}\ d\mu}\right)=
{j|f(siX)sjY}{a|siXFX(a),sjYFY(a)}𝑑μ={a|siXFX(a)}𝑑μ=cXi\displaystyle\sum_{\{j\ |\ f(s_{{i}_{X}})\subseteq s_{{j}_{Y}}\}}\int_{\{a\ |\ s_{{i}_{X}}\in F_{X}(a),\ s_{{j}_{Y}}\in F_{Y}(a)\}}\ d\mu=\int_{\{a\ |\ s_{{i}_{X}}\in F_{X}(a)\}}\ d\mu=c_{X_{i}}

Next, to see that PuuX=uYP_{u}u_{X}=u_{Y}, note that (PuuX)j=|SY||SX||SX|=|SY|=uYj(P_{u}u_{X})_{j}=\frac{|S_{Y}|}{|S_{X}|}|S_{X}|=|S_{Y}|=u_{Y_{j}}. Next, to see that PAAXPAT=AYP_{A}A_{X}P^{T}_{A}=A_{Y}, recall that PAP_{A} is an |SY|×|SX||S_{Y}|\times|S_{X}| matrix and note that since ff is surjective it must be that SYSXS_{Y}\leq S_{X}. Therefore both the left |SY|×|SY||S_{Y}|\times|S_{Y}| submatrix of PAP_{A} and the top |SY|×|SY||S_{Y}|\times|S_{Y}| submatrix of PATP_{A}^{T} are diagonal, so the product PAAXPATP_{A}A_{X}P^{T}_{A} is a diagonal |SY|×|SY||S_{Y}|\times|S_{Y}| matrix with:

(PAAXPAT)jj=PAjjAXjjPAjjT=|SY|1|SX|1(|SX|1)=|SY|1=AYjj\displaystyle(P_{A}A_{X}P^{T}_{A})_{jj}=P_{A_{jj}}A_{X_{jj}}P^{T}_{A_{jj}}=\frac{|S_{Y}|-1}{|S_{X}|-1}(|S_{X}|-1)=|S_{Y}|-1=A_{Y_{jj}}

Next, to see that PBBXPBT=BYP_{B}B_{X}P^{T}_{B}=B_{Y}, note first that PBBXP_{B}B_{X} is a {0,1}\{0,1\}-valued |SY|×|SX||S_{Y}|\times|S_{X}| matrix where:

(PBBX)ji=k=1|SX|PBj,kBXk,i=skXSX,f(skX)sjYskXsiX\displaystyle(P_{B}B_{X})_{ji}=\exists_{k=1...|S_{X}|}P_{B_{j,k}}\wedge B_{X_{k,i}}=\exists\ s_{{k}_{X}}\in S_{X},f(s_{{k}_{X}})\subseteq s_{{j}_{Y}}\wedge s_{{k}_{X}}\cap s_{{i}_{X}}\neq\varnothing

And therefore that:

(PBBXPBT)jj=\displaystyle(P_{B}B_{X}P^{T}_{B})_{jj^{\prime}}=
slXSX(skXSX,f(skX)sjYskXslX)(f(slX)sjY)=\displaystyle\exists\ s_{{l}_{X}}\in S_{X}\left(\exists\ s_{{k}_{X}}\in S_{X},f(s_{{k}_{X}})\subseteq s_{{j}_{Y}}\wedge s_{{k}_{X}}\cap s_{{l}_{X}}\neq\varnothing\right)\wedge\left(f(s_{{l}_{X}})\subseteq s_{{j^{\prime}}_{Y}}\right)=
slX,skXSX,f(skX)sjYf(slX)sjYskXslX=\displaystyle\exists\ s_{{l}_{X}},s_{{k}_{X}}\in S_{X},f(s_{{k}_{X}})\subseteq s_{{j}_{Y}}\wedge f(s_{{l}_{X}})\subseteq s_{{j^{\prime}}_{Y}}\wedge s_{{k}_{X}}\cap s_{{l}_{X}}\neq\varnothing=
sjYsjY\displaystyle s_{{j}_{Y}}\cap s_{{j^{\prime}}_{Y}}\neq\varnothing

Finally, note that FlattenμFlatten_{\mu} preserves the identity since PB=PA=IP_{B}=P_{A}=I when SX=SYS_{X}=S_{Y} and it preserves composition by the laws of matrix multiplication. ∎

For example, if μ\mu is uniform then the connected components of the Vietoris-Rips filtration of (X,dX)(X,d_{X}) that have the largest differences between their birth and death times will be a solution to (Flattenμ𝒮)(X,dX)(Flatten_{\mu}\circ\mathcal{SL})(X,d_{X}). Note also that FlattenμFlatten_{\mu} is only functorial over 𝐏𝐚𝐫𝐭bijO¯\mathbf{Part}_{bij}^{\overline{O}}, and not all of 𝐏𝐚𝐫𝐭O¯\mathbf{Part}^{\overline{O}}. Intuitively, this is because FlattenμFlatten_{\mu} maps natural transformations between elements of 𝐏𝐚𝐫𝐭bijO¯\mathbf{Part}_{bij}^{\overline{O}} to linear maps that only exist when the functions underlying these natural transformations are bijective.

2.1 The Multiparameter Flattening Algorithm

Given an OO-clustering functor HH and a distribution μ\mu over OO we can use Monte Carlo integration and FlattenμFlatten_{\mu} to implement the following algorithm:

1:procedure MultiparameterFlattening(H,μ,(X,dX),nH,\mu,(X,d_{X}),n)
2:     Initialize an empty list LL
3:     Repeat nn times:
4:     Sample the hyperparameter vector aa according to μ\mu
5:     Add each set in H(X,dX)(a)H(X,d_{X})(a) to LL
6:     Define SXS_{X} to be the list of unique elements of LL
7:     Define cc such that cic_{i} is the number of times that siXs_{{i}_{X}} appears in LL
8:     Set A,B,uA,B,u with FlattenμFlatten_{\mu}
9:     Return the solution to the binary integer program (|SX|,|SX|,c,A,B,u)(|S_{X}|,|S_{X}|,c,A,B,u)

We include an example of this procedure on Github 111https://github.com/dshieble/FunctorialHyperparameters that builds on McInnes et. al.’s [7]’s HDBSCAN implementation. In Table 2.1 we demonstrate that applying this procedure and solving the resulting binary integer program can perform better than choosing an optimal parameter value.

Algorithm Adjusted Rand Score on
Adjusted Rand Score on
20 Newsgroups Dataset
HDBSCAN With Optimal
Distance Scaling Parameter α\alpha 0.2170.217 0.1810.181
HDBSCAN With Flattening Over
Distance Scaling Parameter α\alpha 0.2620.262 0.2310.231
Table 1: We compare the performance of applying FlattenμFlatten_{\mu} to HDBSCAN with simply running HDBSCAN with the value of the distance scaling parameter α\alpha that achieves the best Adjusted Rand Score. We evaluate over the Fashion MNIST [13] and 20 Newsgroups [6] datasets by using the scikit-learn implementation of the Adjusted Rand Score [9]. We see that the FlattenμFlatten_{\mu} procedure performs consistently better, which suggests that it may be a good option for unsupervised learning applications such as data visualization or pre-processing.
Refer to caption
Figure 1: If we define H=𝒮H=\mathcal{SL} and apply Equation 2 to learn μn\mu_{n}, then samples from μn\mu_{n} are distance thresholds for single linkage. We see that the samples tend to be drawn from the region where the Adjusted Rand Score is highest (code on Github).

2.2 Multiparameter Flattening with Supervision

One of the most important components in the Multiparameter Flattening Algorithm is the choice of distribution μ\mu over OO. For example, if O=(0,1]opO=(0,1]^{op} and μ\mu is the Dirac distribution at aa then 𝒮(X,dX)(a)\mathcal{SL}(X,d_{X})(a) will be a solution to (Flattenμ𝒮)(X,dX)(Flatten_{\mu}\circ\mathcal{SL})(X,d_{X}).

We can leverage the importance of μ\mu to enable our flattening procedure to learn from labeled data. Formally given an OO-clustering functor HH, a metric space (X,dX)(X,d_{X}), and an observed partition 𝒫X\mathcal{P}_{X} of XX (the labels) we can use Bayes rule to define the probability measure μ𝒫X\mu_{\mathcal{P}_{X}} over OO where μ𝒫X(σ)=aσγX(𝒫X|a)𝑑μaOγX(𝒫X|a)𝑑μ\mu_{\mathcal{P}_{X}}(\sigma)=\frac{\int_{a\in\sigma}\gamma_{X}(\mathcal{P}_{X}\ |\ a)\ d\mu}{\int_{a\in O}\gamma_{X}(\mathcal{P}_{X}\ |\ a)\ d\mu} if σ\sigma is an element of the Borel algebra of OO and for each aOa\in O the map γX(_|a)\gamma_{X}(\_\ |\ a) is a probability mass function over the finite set of all partitions of XX. There are several possible choices of γX\gamma_{X}. Intuitively, we want γX(𝒫X|a)\gamma_{X}(\mathcal{P}_{X}\ |\ a) to be large when H(X,dX)(a)H(X,d_{X})(a) and 𝒫X\mathcal{P}_{X} are similar. The simplest choice would be γX(𝒫X|a)={1H(X,dX)(a)=𝒫X0else\gamma_{X}(\mathcal{P}_{X}\ |\ a)=\begin{cases}1&H(X,d_{X})(a)=\mathcal{P}_{X}\\ 0&else\end{cases}, but a more robust strategy would be to use the Rand index, which measures how well two partitions agree on each pair of distinct points [10]. That is:

γX(𝒫X|a)=|both(𝒫X)|+|neither(𝒫X)|𝒫X𝐏X|both(𝒫X)|+|neither(𝒫X)|\displaystyle\gamma_{X}(\mathcal{P}_{X}\ |\ a)=\frac{|both(\mathcal{P}_{X})|+|neither(\mathcal{P}_{X})|}{\sum_{\mathcal{P}_{X}^{\prime}\in\mathbf{P}_{X}}|both(\mathcal{P}_{X}^{\prime})|+|neither(\mathcal{P}_{X}^{\prime})|} (1)

where 𝐏X\mathbf{P}_{X} is the set of all partitions of XX and:

both(𝒫X)={xi,xj|sX𝒫X,xi,xjsXsXH(X,dX)(a),xi,xjsX}\displaystyle both(\mathcal{P}_{X})=\{x_{i},x_{j}\ |\ \exists s_{X}\in\mathcal{P}_{X},x_{i},x_{j}\in s_{X}\wedge\exists s^{\prime}_{X}\in H(X,d_{X})(a),x_{i},x_{j}\in s^{\prime}_{X}\}
neither(𝒫X)={xi,xj|sX𝒫X,xi,xjsXsXH(X,dX)(a),xi,xjsX}\displaystyle neither(\mathcal{P}_{X})=\{x_{i},x_{j}\ |\ \not\exists s_{X}\in\mathcal{P}_{X},x_{i},x_{j}\in s_{X}\wedge\not\exists s^{\prime}_{X}\in H(X,d_{X})(a),x_{i},x_{j}\in s^{\prime}_{X}\}

Note that by definition 𝒫X𝐏XγX(𝒫X|a)=1\sum_{\mathcal{P}_{X}\in\mathbf{P}_{X}}\gamma_{X}(\mathcal{P}_{X}\ |\ a)=1. Now suppose we have a set of tuples {(X1,dX1,𝒫X1),(X2,dX2,𝒫X2),,(Xn,dXn,𝒫Xn)}\{(X_{1},d_{X_{1}},\mathcal{P}_{X_{1}}),(X_{2},d_{X_{2}},\mathcal{P}_{X_{2}}),\cdots,(X_{n},d_{X_{n}},\mathcal{P}_{X_{n}})\} where each (Xi,dXi)(X_{i},d_{X_{i}}) is a metric space and each 𝒫Xi\mathcal{P}_{X_{i}} is a partition of XiX_{i}. Given an initial probability measure μ0\mu_{0} over OO we can use Bayesian updating to learn a posterior distribution over OO by iteratively setting:

μi(σ)=aσγXi(𝒫Xi|a)𝑑μi1aOγXi(𝒫Xi|a)𝑑μi1\displaystyle\mu_{i}(\sigma)=\frac{\int_{a\in\sigma}\gamma_{X_{i}}(\mathcal{P}_{X_{i}}\ |\ a)\ d\mu_{i-1}}{\int_{a\in O}\gamma_{X_{i}}(\mathcal{P}_{X_{i}}\ |\ a)\ d\mu_{i-1}} (2)

In Figure 1 we show the results of this procedure. Under mild conditions the functor FlattenμnHFlatten_{\mu_{n}}\circ H maps (X,dX)(X,d_{X}) to an optimization problem with an optimal solution that is consistent with these nn observations. Formally:

Proposition 2

Given d1,d2d_{1},d_{2}\in\mathbb{N} suppose we have a compact region Rd1R\subseteq\mathbb{R}^{d_{1}} and an OO-clustering functor HH where OO is a subset of d2\mathbb{R}^{d_{2}}. Define dRd_{R} to be Euclidean distance and assume that:

  • HH is RR-identifiable: for each aOa\in O there exists some pair of kk-element subsets X1,X2RX_{1},X_{2}\subset R with H(X1,dR)(a)H(X2,dR)(a)H(X_{1},d_{R})(a)\neq H(X_{2},d_{R})(a)

  • HH is RR-smooth: for any kk-element XRX\subset R and aOa\in O there exists a neighborhood BaB_{a} of aa where for aBaa^{\prime}\in B_{a} we have that H(X,dR)(a)=H(X,dR)(a)H(X,d_{R})(a)=H(X,d_{R})(a^{\prime})

Now suppose we sample aOa_{*}\in O according to the uniform distribution μ0\mu_{0} over OO and for each i>0i>0 uniformly select a kk-element subset XiX_{i} from RR, set (Xi,𝒫Xi)=H(Xi,dR)(a)(X_{i},\mathcal{P}_{X_{i}})=H(X_{i},d_{R})(a_{*}) and set μi\mu_{i} as in Equation 2.

Then for any kk-element XRX\subset R there μ0\mu_{0}-a.s. (almost surely) exists some mm such that for nmn\geq m, the partition H(X,dR)(a)H(X,d_{R})(a_{*}) is an optimal solution to (FlattenμnH)(X,dR)(Flatten_{\mu_{n}}\circ H)(X,d_{R}).

Proof

We will use Doob’s theorem [5] to prove this. Suppose 𝒫R\mathcal{P}_{R} is the set of pairs (X,𝒫X)(X,\mathcal{P}_{X}) of finite kk-element subsets XRX\in R and partitions 𝒫X\mathcal{P}_{X} of XX. Define λR\lambda_{R} to be the uniform distribution on the set of finite kk-element subsets of RR, and for each aOa\in O define ηa(σ)=(X,𝒫X)σγX(𝒫X|a)𝑑λR\eta_{a}(\sigma)=\int_{(X,\mathcal{P}_{X})\in\sigma}\gamma_{X}(\mathcal{P}_{X}\ |\ a)\ d\lambda_{R} where σ\sigma is an element of the Borel algebra of 𝒫R\mathcal{P}_{R}. Now since HH is RR-identifiable, aaηaηaa\neq a^{\prime}\implies\eta_{a}\neq\eta_{a^{\prime}} and Theorem 2.4 in [8] holds. Therefore for any kk-element XRX\subset R, ball BaB_{a_{*}} centered at aa_{*}, and ϵ>0\epsilon>0, there μ0\mu_{0}-a.s. (almost surely) exists some mm such that for nmn\geq m we have that μn(Ba)1ϵ\mu_{n}(B_{a_{*}})\geq 1-\epsilon. Therefore, no solution to (FlattenμnH)(X,dR)(Flatten_{\mu_{n}}\circ H)(X,d_{R}) can be improved by including sets that only exist in partitions of XX formed from H(X,dR)(a)H(X,d_{R})(a^{\prime}) where aBaa^{\prime}\not\in B_{a_{*}}. Since HH is RR-smooth, this implies that the partition H(X,dR)(a)H(X,d_{R})(a_{*}) is an optimal solution to (FlattenμnH)(X,dR)(Flatten_{\mu_{n}}\circ H)(X,d_{R})

3 Discussion and Next Steps

One issue with the FlattenμFlatten_{\mu} procedure is that it reduces to a binary integer program, which can be very slow to solve. In the case that the hyperparameters of a multiparameter hierarchical clustering algorithm form a total order, we can organize its outputs in a merge tree or dendrogram, and then solve the optimization problem with a tree traversal [3, 7].

However, it is not always possible to use this strategy on hierarchical clustering algorithms that accept multiple real-valued hyperparameters (e.g. the hyperparameter space is a general subset of d\mathbb{R}^{d} where d>1d>1). In this case it is possible that there exist clusterings cH(a)c\in H(a) and cH(a)c^{\prime}\in H(a^{\prime}) that are neither disjoint nor in a containment relationship. There are some ways to get around this limitation, however. For example, Rolle and Scoccola [11] index hyperparameters by a curve γ\gamma in d\mathbb{R}^{d}, rather than all of d\mathbb{R}^{d}. In this way the hyperparameter space remains a total order. In the future we plan to explore other strategies to solve the flattening optimization problem efficiently.

References

  • [1] Carlsson, G., Mémoli, F.: Persistent clustering and a theorem of J. Kleinberg. arXiv preprint arXiv:0808.2241 (2008)
  • [2] Chaudhuri, K., Dasgupta, S.: Rates of convergence for the cluster tree. In: Advances in Neural Information Processing Systems. pp. 343–351 (2010)
  • [3] Chazal, F., Guibas, L.J., Oudot, S.Y., Skraba, P.: Persistence-based clustering in Riemannian manifolds. Journal of the ACM (JACM) 60(6), 1–38 (2013)
  • [4] Culbertson, J., Guralnik, D.P., Hansen, J., Stiller, P.F.: Consistency constraints for overlapping data clustering. arXiv preprint arXiv:1608.04331 (2016)
  • [5] Doob, J.L.: Application of the theory of martingales. In: Actes du Colloque International Le Calcul des Probabilités et ses applications. pp. 23–27 (1949)
  • [6] Lang, K.: Newsweeder: Learning to filter netnews. Machine Learning Proceedings (1995)
  • [7] McInnes, L., Healy, J.: Accelerated hierarchical density clustering. arXiv preprint arXiv:1705.07321 (2017)
  • [8] Miller, J.W.: A detailed treatment of Doob’s theorem. arXiv preprint arXiv:1801.03122 (2018)
  • [9] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
  • [10] Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)
  • [11] Rolle, A., Scoccola, L.: Stable and consistent density-based clustering. arXiv preprint arXiv:2005.09048 (2020)
  • [12] Shiebler, D.: Functorial clustering via simplicial complexes. NeurIPS Workshop on Topological Data Analysis in ML (2020)
  • [13] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)