This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Selective Inference for Hierarchical Clustering

Lucy L. Gao111Corresponding author: [email protected], Jacob Bien, and Daniela Witten
 
\dagger Department of Statistics, University of British Columbia
\circ Department of Data Sciences and Operations, University of Southern California
\ddagger Departments of Statistics and Biostatistics, University of Washington
Abstract

Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.

Keywords: post-selection inference, hypothesis testing, difference in means, type I error

1 Introduction

Testing for a difference in means between groups is fundamental to answering research questions across virtually every scientific area. Classical tests control the type I error rate when the groups are defined a priori. However, it is increasingly common for researchers to instead define the groups via a clustering algorithm. In the context of single-cell RNA-sequencing data, researchers often cluster the cells to identify putative cell types, then test for a difference in means between the putative cell types in that same data set; see e.g. Hwang et al. (2018). Unfortunately, available tests do not properly account for the double-use of data, which invalidates the resulting inference. One example of this problematic issue can be seen in the FindMarkers function in the popular and highly-cited R package Seurat (Satija et al. 2015, Butler et al. 2018, Stuart et al. 2019, Hao et al. 2021). Many recent papers have described the issues associated with the use of data for both clustering and downstream testing, without proposing suitable solutions (Luecken & Theis 2019, Lähnemann et al. 2020, Deconinck et al. 2021). In fact, testing for a difference in means between a pair of estimated clusters while controlling the type I error rate has even been described as one of eleven “grand challenges” in the entire field of single-cell data science (Lähnemann et al. 2020). Similar issues arise in the field of neuroscience (Kriegeskorte et al. 2009).

In this paper, we develop a valid test for a difference in means between two clusters estimated from the data. We consider the following model for nn observations of qq features:

𝐗𝒩n×q(𝝁,𝐈n,σ2𝐈q),\displaystyle{\bf X}\sim\mathcal{MN}_{n\times q}(\bm{\mu},{\bf I}_{n},\sigma^{2}{\bf I}_{q}), (1)

where 𝝁n×q\bm{\mu}\in\mathbb{R}^{n\times q}, with rows μi\mu_{i}, is unknown, and σ2>0\sigma^{2}>0 is known. (We discuss the case where σ2\sigma^{2} is unknown in Section 4.3.) For 𝒢{1,2,,n}\mathcal{G}\subseteq\{1,2,\ldots,n\}, let

μ¯𝒢1|𝒢|i𝒢μiandX¯𝒢1|𝒢|i𝒢Xi,\displaystyle\bar{\mu}_{\mathcal{G}}\equiv\frac{1}{|\mathcal{G}|}\sum\limits_{i\in\mathcal{G}}\mu_{i}\quad\text{and}\quad\bar{X}_{\mathcal{G}}\equiv\frac{1}{|\mathcal{G}|}\sum\limits_{i\in\mathcal{G}}X_{i}, (2)

which we refer to as the mean of 𝒢\mathcal{G} and the empirical mean of 𝒢\mathcal{G} in 𝐗{\bf X}, respectively. Given a realization 𝐱n×q{\bf x}\in\mathbb{R}^{n\times q} of 𝐗{\bf X}, we first apply a clustering algorithm 𝒞\mathcal{C} to obtain 𝒞(𝐱)\mathcal{C}({\bf x}), a partition of {1,2,,n}\{1,2,\ldots,n\}. We then use 𝐱{\bf x} to test, for a pair of clusters 𝒞^1,𝒞^2𝒞(𝐱)\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}),

H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2versusH1{𝒞^1,𝒞^2}:μ¯𝒞^1μ¯𝒞^2.\displaystyle H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}}\quad\text{versus}\quad H_{1}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}\neq\bar{\mu}_{\mathcal{\hat{C}}_{2}}. (3)

It is tempting to simply apply a Wald test of (3), with p-value given by

H0{𝒞^1,𝒞^2}(X¯𝒞^1X¯𝒞^22x¯𝒞^1x¯𝒞^22),\displaystyle\mathbb{P}_{H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}}\left(\|\bar{X}_{\mathcal{\hat{C}}_{1}}-\bar{X}_{\mathcal{\hat{C}}_{2}}\|_{2}\geq\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}\right), (4)

where X¯𝒞^1X¯𝒞^22H0{𝒞^1,𝒞^2}(σ1|𝒞^1|+1|𝒞^2|)χq\|\bar{X}_{\mathcal{\hat{C}}_{1}}-\bar{X}_{\mathcal{\hat{C}}_{2}}\|_{2}\overset{H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}}{\sim}\left(\sigma\sqrt{\frac{1}{|\mathcal{\hat{C}}_{1}|}+\frac{1}{|\mathcal{\hat{C}}_{2}|}}\right)\cdot\chi_{q}. However, since we clustered 𝐱{\bf x} to get 𝒞(𝐱)={𝒞^k}k=1K\mathcal{C}({\bf x})=\{\mathcal{\hat{C}}_{k}\}_{k=1}^{K}, we will observe substantial differences between {x¯𝒞^k}k=1K\{\bar{x}_{\mathcal{\hat{C}}_{k}}\}_{k=1}^{K} even when there is no signal in the data, as is shown in Figure 1(a). That is, in (4), the random variable on the left-hand side of the inequality follows a scaled χq\chi_{q} distribution, but the right-hand side of the inequality is not drawn from a scaled χq\chi_{q} distribution, because 𝒞^1\mathcal{\hat{C}}_{1} and 𝒞^2\mathcal{\hat{C}}_{2} are functions of 𝐱{\bf x}. In short, the problem is that we used the data to select a null hypothesis to test. Since the Wald test does not account for this hypothesis selection procedure, it is extremely anti-conservative, as is shown in Figure 1(b).

Refer to caption
Figure 1: (a) A simulated data set from (1) with 𝝁=𝟎100×2\bm{\mu}=\bm{0}_{100\times 2} and σ2=1\sigma^{2}=1. We apply average linkage hierarchical clustering to get three clusters. The empirical means (defined in (2)) of the three clusters are displayed as triangles. QQ-plots of the Uniform(0,1)(0,1) distribution against the p-values from (b) the Wald test in (4) and (c) our proposed test, over 2000 simulated data sets from (1) with 𝝁=𝟎100×2\bm{\mu}=\bm{0}_{100\times 2} and σ2=1\sigma^{2}=1. For each simulated data set, a p-value was computed for a randomly chosen pair of estimated clusters.

At first glance, it seems that we might be able to overcome this problem via sample splitting. That is, we divide the observations into a training and a test set, cluster the observations in the training set, and then assign the test set observations to those clusters, as in Figures 2(a)–(c). Then, we apply the Wald test in (4) to the test set. Unfortunately, by assigning test observations to clusters, we have once again used the data to select a null hypothesis to test, in the sense that H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} in (3) is a function of the test observations. Thus, the Wald test is extremely anti-conservative (Figure 2(d)). In other words, sample splitting does not provide a valid way to test the hypothesis in (3).

Refer to caption
Figure 2: (a) A simulated data set from (1) with 𝝁=𝟎100×2\bm{\mu}=\bm{0}_{100\times 2} and σ2=1\sigma^{2}=1. (b) We cluster the training set using average linkage hierarchical clustering. (c) We assign clusters in the test set by training a 3-nearest neighbors classifier on the training set. (d) QQ-plot of the Uniform(0,1)(0,1) distribution against the Wald p-values in the test set, over 2000 simulated data sets for which each cluster in the test set was assigned at least one observation.

In this paper, we develop a selective inference framework to test for a difference in means after clustering. This framework exploits ideas from the recent literature on selective inference for regression and changepoint detection (Fithian et al. 2014, Loftus & Taylor 2015, Lee et al. 2016, Yang et al. 2016, Hyun et al. 2018, Jewell et al. 2019, Mehrizi & Chenouri 2021). The key idea is as follows: since we chose to test H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}} because 𝒞^1,𝒞^2𝒞(𝐱)\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}), we can account for this hypothesis selection procedure by defining a p-value that conditions on the event {𝒞^1,𝒞^2𝒞(𝐗)}\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf X})\}. This yields a correctly-sized test, as seen in Figure 1(c).

A large body of work evaluates the statistical significance of a clustering by testing the goodness-of-fit of models under the misspecification of the number of clusters (Chen et al. 2001, Liu et al. 2008, Chen et al. 2012, Maitra et al. 2012, Kimes et al. 2017) or by assessing the stability of estimated clusters (Suzuki & Shimodaira 2006). Most of these papers conduct bootstrap sampling or asymptotic approximations to the null distribution. Our proposed framework avoids the need for resampling and provides exact finite-sample inference for the difference in means between a pair of estimated clusters, under the assumption that σ\sigma in (1) is known. Chapter 3 of Campbell (2018) considers testing for a difference in means after convex clustering (Hocking et al. 2011), a relatively esoteric form of clustering. Our framework is particularly efficient when applied to hierarchical clustering, which is one of the most popular types of clustering across a number of fields. Zhang et al. (2019) proposes splitting the data, clustering the training set, and applying these clusters to the test set as illustrated in Figures 2(a)–(c). They develop a selective inference framework that yields valid p-values for a difference in the mean of a single feature between two clusters in the test set. Our framework avoids the need for sample splitting, and thereby allows inference on the set of clusters obtained from all (rather than a subset of) the data.

The rest of the paper is organized as follows. In Section 2, we develop a framework to test for a difference in means after clustering. We apply this framework to compute p-values for hierarchical clustering in Section 3. We describe extensions, simulation results, and applications to real data in Sections 4, 5, and 6. The discussion is in Section 7.

2 Selective inference for clustering

2.1 A test of no difference in means between two clusters

Let 𝐱n×q{\bf x}\in\mathbb{R}^{n\times q} be an arbitrary realization from (1), and let 𝒞^1\mathcal{\hat{C}}_{1} and 𝒞^2\mathcal{\hat{C}}_{2} be an arbitrary pair of clusters in 𝒞(𝐱)\mathcal{C}({\bf x}). Since we chose to test H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} because 𝒞^1,𝒞^2𝒞(𝐱)\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}), it is natural to define the p-value as a conditional version of (4),

H0{𝒞^1,𝒞^2}(X¯𝒞^1X¯𝒞^22x¯𝒞^1x¯𝒞^22|𝒞^1,𝒞^2𝒞(𝐗)).\displaystyle\mathbb{P}_{H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}}\left(\|\bar{X}_{\mathcal{\hat{C}}_{1}}-\bar{X}_{\mathcal{\hat{C}}_{2}}\|_{2}\geq\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}~{}\Big{|}~{}\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf X})\right). (5)

This amounts to asking, “Among all realizations of 𝐗{\bf X} that result in clusters 𝒞^1\mathcal{\hat{C}}_{1} and 𝒞^2\mathcal{\hat{C}}_{2}, what proportion have a difference in cluster means at least as large as the difference in cluster means in our observed data set, when in truth μ¯𝒞^1=μ¯𝒞^2\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}}?” One can show that rejecting H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} when (5) is below α\alpha controls the selective type I error rate (Fithian et al. 2014) at level α\alpha.

Definition 1 (Selective type I error rate for clustering).

For any non-overlapping groups of observations 𝒢1,𝒢2{1,2,,n}\mathcal{G}_{1},\mathcal{G}_{2}\subseteq\{1,2,\ldots,n\}, let H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} denote the null hypothesis that μ¯𝒢1=μ¯𝒢2\bar{\mu}_{\mathcal{G}_{1}}=\bar{\mu}_{\mathcal{G}_{2}}. We say that a test of H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} based on 𝐗{\bf X} controls the selective type I error rate for clustering at level α\alpha if

H0{𝒢1,𝒢2}(reject H0{𝒢1,𝒢2} based on 𝐗 at level α|𝒢1,𝒢2𝒞(𝐗))α,0α1.\displaystyle\mathbb{P}_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}}\left(\text{reject }H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\text{ based on }{\bf X}\text{ at level }\alpha~{}\Big{|}~{}\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right)\leq\alpha,\quad\forall~{}0\leq\alpha\leq 1. (6)

That is, if H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} is true, then the conditional probability of rejecting H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} based on 𝐗{\bf X} at level α\alpha, given that 𝒢1\mathcal{G}_{1} and 𝒢2\mathcal{G}_{2} are clusters in 𝒞(𝐗)\mathcal{C}({\bf X}), is bounded by α\alpha.

However, (5) cannot be calculated, since the conditional distribution of X¯𝒞^1X¯𝒞^22\|\bar{X}_{\mathcal{\hat{C}}_{1}}-\bar{X}_{\mathcal{\hat{C}}_{2}}\|_{2} given 𝒞^1,𝒞^2𝒞(𝐗)\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf X}) involves the unknown nuisance parameters 𝝅ν(𝒞^1,𝒞^2)𝝁\bm{\pi}_{\nu(\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2})}^{\perp}\bm{\mu}, where 𝝅ν=𝐈nννTν22\bm{\pi}_{\nu}^{\perp}={\bf I}_{n}-\frac{\nu\nu^{T}}{\|\nu\|_{2}^{2}} projects onto the orthogonal complement of the vector ν\nu, and where

[ν(𝒞^1,𝒞^2)]i=𝟙{i𝒞^1}/|𝒞^1|𝟙{i𝒞^2}/|𝒞^2|.\displaystyle[\nu(\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2})]_{i}=\mathds{1}\{i\in\mathcal{\hat{C}}_{1}\}/|\mathcal{\hat{C}}_{1}|-\mathds{1}\{i\in\mathcal{\hat{C}}_{2}\}/|\mathcal{\hat{C}}_{2}|. (7)

In other words, it requires knowing aspects of 𝝁\bm{\mu} that are not known under the null. Instead, we will define the p-value for H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} in (3) to be

p(𝐱;{𝒞^1,𝒞^2})p({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\})

=H0{𝒞^1,𝒞^2}(X¯𝒞^1X¯𝒞^22x¯𝒞^1x¯𝒞^22|𝒞^1,𝒞^2𝒞(𝐗),𝝅ν(𝒞^1,𝒞^2)𝐗=𝝅ν(𝒞^1,𝒞^2)𝐱,\displaystyle=\scalebox{0.9}{$\mathbb{P}_{H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}}\Big{(}\|\bar{X}_{\mathcal{\hat{C}}_{1}}-\bar{X}_{\mathcal{\hat{C}}_{2}}\|_{2}\geq\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}~{}\Big{|}~{}\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf X}),\bm{\pi}_{\nu(\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2})}^{\perp}{\bf X}=\bm{\pi}_{\nu(\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2})}^{\perp}{\bf x}$},
dir(X¯𝒞^1X¯𝒞^2)=dir(x¯𝒞^1x¯𝒞^2)),\displaystyle\hskip 204.85983pt\scalebox{0.9}{$\text{dir}\left(\bar{X}_{\mathcal{\hat{C}}_{1}}-\bar{X}_{\mathcal{\hat{C}}_{2}}\right)=\text{dir}\left(\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\right)\Big{)}$}, (8)

where dir(w)=ww2𝟙{w0}\text{dir}(w)=\frac{w}{\|w\|_{2}}\mathds{1}\{w\neq 0\}. The following result shows that conditioning on these additional events makes (8) computationally tractable by constraining the randomness in 𝐗{\bf X} to a scalar random variable, while maintaining control of the selective type I error rate.

Theorem 1.

For any realization 𝐱{\bf x} from (1) and for any non-overlapping groups of observations 𝒢1,𝒢2{1,2,,n}\mathcal{G}_{1},\mathcal{G}_{2}\subseteq\{1,2,\ldots,n\},

p(𝐱;{𝒢1,𝒢2})=1𝔽(x¯𝒢1x¯𝒢22;σ1|𝒢1|+1|𝒢2|,𝒮(𝐱;{𝒢1,𝒢2})),\displaystyle p({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\})=1-\mathbb{F}\left(\|\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}}\|_{2};\sigma\sqrt{\frac{1}{|\mathcal{G}_{1}|}+\frac{1}{|\mathcal{G}_{2}|}},\mathcal{S}({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\right), (9)

where p(;)p(\cdot;\cdot) is defined in (8), 𝔽(t;c,𝒮)\mathbb{F}(t;c,\mathcal{S}) denotes the cumulative distribution function of a cχqc\cdot\chi_{q} random variable truncated to the set 𝒮\mathcal{S}, and

𝒮(𝐱;{𝒢1,𝒢2})={ϕ0:𝒢1,𝒢2𝒞(𝝅ν(𝒢1,𝒢2)𝐱+(ϕ1|𝒢1|+1|𝒢2|)ν(𝒢1,𝒢2)dir(x¯𝒢1x¯𝒢2)T)}.\displaystyle\scalebox{0.9}{$\mathcal{S}({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\})=\left\{\phi\geq 0:\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}\left(\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf x}+\left(\frac{\phi}{\frac{1}{|\mathcal{G}_{1}|}+\frac{1}{|\mathcal{G}_{2}|}}\right)\nu(\mathcal{G}_{1},\mathcal{G}_{2})\text{dir}(\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}})^{T}\right)\right\}$}. (10)

Furthermore, if H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} is true, then

H0{𝒢1,𝒢2}(p(𝐗;{𝒢1,𝒢2})α|𝒢1,𝒢2𝒞(𝐗))=α,0α1.\displaystyle\mathbb{P}_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}}\left(p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha~{}\Big{|}~{}\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right)=\alpha,\quad\forall~{}0\leq\alpha\leq 1. (11)

That is, rejecting H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} whenever p(𝐱;{𝒢1,𝒢2})p({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\}) is below α\alpha controls the selective type I error rate (Definition 1) at level α\alpha.

We prove Theorem 1 in Appendix A.1. It follows from (9) that to compute the p-value p(𝐱;{𝒞^1,𝒞^2})p({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}) in (8), it suffices to characterize the one-dimensional set

𝒮^𝒮(𝐱;{𝒞^1,𝒞^2})={ϕ0:𝒞^1,𝒞^2𝒞(𝐱(ϕ))},\displaystyle\mathcal{\hat{S}}\equiv\mathcal{S}({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\})=\{\phi\geq 0:\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}^{\prime}(\phi))\}, (12)

where 𝒮(𝐱;)\mathcal{S}({\bf x};\cdot) is defined in (10), and where

𝐱(ϕ)=𝝅ν^𝐱+(ϕ1/|𝒞^1|+1/|𝒞^2|)ν^dir(x¯𝒞^1x¯𝒞^2)T,ν^=ν(𝒞^1,𝒞^2),\displaystyle{\bf x}^{\prime}(\phi)=\bm{\pi}_{\hat{\nu}}^{\perp}{\bf x}+\left(\frac{\phi}{1/|\mathcal{\hat{C}}_{1}|+1/|\mathcal{\hat{C}}_{2}|}\right)\hat{\nu}~{}\text{dir}(\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}})^{T},\quad\hat{\nu}=\nu(\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}), (13)

for ν(,)\nu(\cdot,\cdot) defined in (7).

While the test based on (8) controls the selective type I error rate, the extra conditioning may lead to lower power than a test based on (5)(Lee et al. 2016, Jewell et al. 2019, Mehrizi & Chenouri 2021). However, (8) has a major advantage over (5): Theorem 1 reveals that computing (8) simply requires characterizing 𝒮^\mathcal{\hat{S}} in (12). This is the focus of Section 3.

2.2 Interpreting 𝐱(ϕ){\bf x}^{\prime}(\phi) and 𝒮^\mathcal{\hat{S}}

Since 𝐱Tν^=x¯𝒞^1x¯𝒞^2{\bf x}^{T}\hat{\nu}=\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}, where x¯𝒞^1\bar{x}_{\mathcal{\hat{C}}_{1}} is defined in (2) and ν^\hat{\nu} is defined in (13), it follows that the iith row of 𝐱(ϕ){\bf x}^{\prime}(\phi) in (13) is

[𝐱(ϕ)]i={xi+(|𝒞^2||𝒞^1|+|𝒞^2|)(ϕx¯𝒞^1x¯𝒞^22)dir(x¯𝒞^1x¯𝒞^2),if i𝒞^1,xi(|𝒞^1||𝒞^1|+|𝒞^2|)(ϕx¯𝒞^1x¯𝒞^22)dir(x¯𝒞^1x¯𝒞^2),if i𝒞^2,xi,if i𝒞^1𝒞^2.\displaystyle[{\bf x}^{\prime}(\phi)]_{i}=\begin{cases}x_{i}+\left(\frac{|\mathcal{\hat{C}}_{2}|}{|\mathcal{\hat{C}}_{1}|+|\mathcal{\hat{C}}_{2}|}\right)\left(\phi-\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}\right)\text{dir}\left(\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\right),&\text{if }i\in\mathcal{\hat{C}}_{1},\\ x_{i}-\left(\frac{|\mathcal{\hat{C}}_{1}|}{|\mathcal{\hat{C}}_{1}|+|\mathcal{\hat{C}}_{2}|}\right)\left(\phi-\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}\right)\text{dir}\left(\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\right),&\text{if }i\in\mathcal{\hat{C}}_{2},\\ x_{i},&\text{if }i\not\in\mathcal{\hat{C}}_{1}\cup\mathcal{\hat{C}}_{2}.\end{cases} (14)

We can interpret 𝐱(ϕ){\bf x}^{\prime}(\phi) as a perturbed version of 𝐱{\bf x}, where observations in clusters 𝒞^1\mathcal{\hat{C}}_{1} and 𝒞^2\mathcal{\hat{C}}_{2} have been “pulled apart” (if ϕ>x¯𝒞^1x¯𝒞^22\phi>\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}) or “pushed together” (if 0ϕ<x¯𝒞^1x¯𝒞^220\leq\phi<\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}) in the direction of x¯𝒞^1x¯𝒞^2\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}. Furthermore, 𝒮^\mathcal{\hat{S}} in (12) describes the set of non-negative ϕ\phi for which applying the clustering algorithm 𝒞\mathcal{C} to the perturbed data set 𝐱(ϕ){\bf x}^{\prime}(\phi) yields 𝒞^1\mathcal{\hat{C}}_{1} and 𝒞^2\mathcal{\hat{C}}_{2}. To illustrate this interpretation, we apply average linkage hierarchical clustering to a realization from (1) to obtain three clusters. Figure 3(a)-(c) displays 𝐱=𝐱(ϕ){\bf x}={\bf x}^{\prime}(\phi) for ϕ=x¯𝒞^1x¯𝒞^22=4\phi=\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}=4, along with 𝐱(ϕ){\bf x}^{\prime}(\phi) for ϕ=0\phi=0 and ϕ=8\phi=8, respectively. The clusters 𝒞^1,𝒞^2𝒞(𝐱)\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}) are shown in blue and orange. In Figure 3(b), since ϕ=0\phi=0, the blue and orange clusters have been “pushed together” so that there is no difference between their empirical means, and average linkage hierarchical clustering no longer estimates these clusters. By contrast, in Figure 3(c), the blue and orange clusters have been “pulled apart”, and average linkage hierarchical clustering still estimates these clusters. In this example, 𝒮^=[2.8,)\mathcal{\hat{S}}=[2.8,\infty).

Refer to caption
Figure 3: The observations belonging to 𝒞^1,𝒞^2𝒞(𝐱)\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}) are displayed in blue and orange for: (a) the original data set 𝐱=𝐱(ϕ){\bf x}={\bf x}^{\prime}(\phi) with ϕ=x¯𝒞^1x¯𝒞^22=4\phi=\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}=4, (b) a perturbed data set 𝐱(ϕ){\bf x}^{\prime}(\phi) with ϕ=0\phi=0, and (c) a perturbed data set 𝐱(ϕ){\bf x}^{\prime}(\phi) with ϕ=8\phi=8.

3 Computing 𝒮^\mathcal{\hat{S}} for hierarchical clustering

We now consider computing 𝒮^\mathcal{\hat{S}} defined in (12) for clusters defined via hierarchical clustering. After reviewing hierarchical clustering (Section 3.1), we explicitly characterize 𝒮^\mathcal{\hat{S}} (Section 3.2), and then show how specific properties, such as the dissimilarity and linkage used, lead to substantial computational savings in computing 𝒮^\mathcal{\hat{S}} (Sections 3.33.4).

3.1 A brief review of agglomerative hierarchical clustering

Agglomerative hierarchical clustering produces a sequence of clusterings. The first clustering, 𝒞(1)(𝐱)\mathcal{C}^{(1)}({\bf x}), contains nn clusters, each with a single observation. The (t+1)(t+1)th clustering, 𝒞(t+1)(𝐱)\mathcal{C}^{(t+1)}({\bf x}), is created by merging the two most similar (or least dissimilar) clusters in the ttth clustering, for t=1,,n1t=1,\ldots,n-1. Details are provided in Algorithm 1.

Algorithm 1 Agglomerative hierarchical clustering of a data set 𝐱{\bf x}

Let 𝒞(1)(𝐱)={{1},{2},,{n}}\mathcal{C}^{(1)}({\bf x})=\{\{1\},\{2\},\ldots,\{n\}\}. For t=1,,n1t=1,\ldots,n-1:

  1. 1.

    Define {𝒲1(t)(𝐱),𝒲2(t)(𝐱)}=argmin𝒢,𝒢𝒞(t)(𝐱),𝒢𝒢d(𝒢,𝒢;𝐱)\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right\}=\underset{\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x}),\mathcal{G}\neq\mathcal{G}^{\prime}}{\arg\min}d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}). (We assume throughout this paper that the minimizer is unique.)

  2. 2.

    Merge 𝒲1(t)(𝐱)\mathcal{W}_{1}^{(t)}({\bf x}) and 𝒲2(t)(𝐱)\mathcal{W}_{2}^{(t)}({\bf x}) at the height of d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right) in the dendrogram, and let 𝒞(t+1)(𝐱)=𝒞(t)(𝐱){𝒲1(t)(𝐱)𝒲2(t)(𝐱)}\{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}.\mathcal{C}^{(t+1)}({\bf x})=\mathcal{C}^{(t)}({\bf x})\cup\left\{\mathcal{W}_{1}^{(t)}({\bf x})\cup\mathcal{W}_{2}^{(t)}({\bf x})\right\}\backslash\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right\}.

Algorithm 1 involves a function d(𝒢,𝒢;𝐱)d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}), which quantifies the dissimilarity between two groups of observations. We assume throughout this paper that the dissimilarity between the iith and ii’th observations, d({i},{i};𝐱)d(\{i\},\{i^{\prime}\};{\bf x}), depends on the data through xixix_{i}-x_{i}^{\prime} only. For example, we could define d({i},{i};𝐱)=xixi22d(\{i\},\{i^{\prime}\};{\bf x})=\|x_{i}-x_{i^{\prime}}\|_{2}^{2}. When max{|𝒢|,|𝒢|}>1\max\{|\mathcal{G}|,|\mathcal{G}^{\prime}|\}>1, then we extend the pairwise similarity to the dissimilarity between groups of obervations using the notion of linakge, to be discussed further in Section 3.3.

3.2 An explicit characterization of 𝒮^\mathcal{\hat{S}} for hierarchical clustering

We saw in Sections 2.12.2 that to compute the p-value p(𝐱;{𝒞^1,𝒞^2})p({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}) defined in (8), we must characterize the set 𝒮^={ϕ0:𝒞^1,𝒞^2𝒞(𝐱(ϕ))}\mathcal{\hat{S}}=\{\phi\geq 0:\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}^{\prime}(\phi))\} in (12), where 𝐱(ϕ){\bf x}^{\prime}(\phi) in (13) is a perturbed version of 𝐱{\bf x} in which observations in the clusters 𝒞^1\mathcal{\hat{C}}_{1} and 𝒞^2\mathcal{\hat{C}}_{2} have been “pulled together” or “pushed apart”. We do so now for hierarchical clustering.

Lemma 1.

Suppose that 𝒞=𝒞(nK+1)\mathcal{C}=\mathcal{C}^{(n-K+1)}, i.e. we perform hierarchical clustering to obtain KK clusters. Then,

d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱(ϕ))=d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱),ϕ0,t=1,,nK,\displaystyle d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}^{\prime}(\phi)\right)=d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right),~{}\forall~{}\phi\geq 0,\forall~{}t=1,\ldots,n-K, (15)

where (𝒲1(t)(𝐱),𝒲2(t)(𝐱))\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right) is the “winning pair” of clusters that merged at the ttht^{th} step of the hierarchical clustering algorithm applied to 𝐱{\bf x}. Furthermore, for any ϕ0\phi\geq 0,

𝒞^1,𝒞^2𝒞(𝐱(ϕ)) if and only if 𝒞(t)(𝐱(ϕ))=𝒞(t)(𝐱)t=1,,nK+1.\displaystyle\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}^{\prime}(\phi))\quad\text{ if and only if }\quad\mathcal{C}^{(t)}({\bf x}^{\prime}(\phi))=\mathcal{C}^{(t)}({\bf x})~{}\forall~{}t=1,\ldots,n-K+1. (16)

We prove Lemma 1 in Appendix A.2. The right-hand side of (16) says that the same merges occur in the first nKn-K steps of the hierarchical clustering algorithm applied to 𝐱(ϕ){\bf x}^{\prime}(\phi) and 𝐱{\bf x}. To characterize the set of merges that occur in 𝐱{\bf x}, consider the set of all “losing pairs”, i.e. all cluster pairs that co-exist but are not the “winning pair” in the first nKn-K steps:

(𝐱)=t=1nK{{𝒢,𝒢}:𝒢,𝒢𝒞(t)(𝐱),𝒢𝒢,{𝒢,𝒢}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}}.\displaystyle\mathcal{L}({\bf x})=\bigcup\limits_{t=1}^{n-K}\left\{\{\mathcal{G},\mathcal{G}^{\prime}\}:\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x}),\mathcal{G}\neq\mathcal{G}^{\prime},\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right\}\right\}. (17)

Each pair {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}) has a “lifetime” that starts at the step where both have been created, l𝒢,𝒢(𝐱)min{1tnK:𝒢,𝒢𝒞(t)(𝐱),{𝒢,𝒢}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}}l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\equiv\min\left\{1\leq t\leq n-K:\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x}),\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\}\right\}, and ends at step u𝒢,𝒢(𝐱)max{1tnK:𝒢,𝒢𝒞(t)(𝐱),{𝒢,𝒢}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}}.u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\equiv\max\left\{1\leq t\leq n-K:\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x}),\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\}\right\}. By construction, each pair {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}) is never the winning pair at any point in its lifetime, i.e. d(𝒢,𝒢;𝐱)>d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)d(\mathcal{G},\mathcal{G}^{\prime};{\bf x})>d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right) for all l𝒢,𝒢(𝐱)tu𝒢,𝒢(𝐱)l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t\leq u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}). Therefore, 𝐱(ϕ){\bf x}^{\prime}(\phi) preserves the merges that occur in the first nKn-K steps in 𝐱{\bf x} if and only if d(𝒢,𝒢;𝐱(ϕ))>d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱(ϕ))d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}^{\prime}(\phi)\right) for all l𝒢,𝒢(𝐱)tu𝒢,𝒢(𝐱)l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t\leq u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) and for all {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}). Furthermore, (15) says that d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱(ϕ))=d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}^{\prime}(\phi)\right)=d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right) for all ϕ0\phi\geq 0 and 1tnK1\leq t\leq n-K. This leads to the following result.

Theorem 2.

Suppose that 𝒞=𝒞(nK+1)\mathcal{C}=\mathcal{C}^{(n-K+1)}, i.e. we perform hierarchical clustering to obtain KK clusters. Then, for 𝒮^\mathcal{\hat{S}} defined in (12),

𝒮^={𝒢,𝒢}(𝐱){ϕ0:d(𝒢,𝒢;𝐱(ϕ))>maxl𝒢,𝒢(𝐱)tu𝒢,𝒢(𝐱)d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)},\displaystyle\mathcal{\hat{S}}=\bigcap\limits_{\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x})}\left\{\phi\geq 0:d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>\underset{l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t\leq u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})}{\max}d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)\right\}, (18)

where {𝒲1(t)(𝐱),𝒲2(t)(𝐱)}\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right\} is the pair of clusters that merged at the ttht^{th} step of the hierarchical clustering algorithm applied to 𝐱{\bf x}, (𝐱)\mathcal{L}({\bf x}) is defined in (17) to be the set of “losing pairs” of clusters in 𝐱{\bf x}, and [l𝒢,𝒢(𝐱),u𝒢,𝒢(𝐱)][l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}),u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})] is the lifetime of such a pair of clusters in 𝐱{\bf x}. Furthermore, (18) is the intersection of 𝒪(n2)\mathcal{O}(n^{2}) sets.

We prove Theorem 2 in Appendix A.3. Theorem 2 expresses 𝒮^\mathcal{\hat{S}} in (12) as the intersection of 𝒪(n2)\mathcal{O}(n^{2}) sets of the form {ϕ0:d(𝒢,𝒢;𝐱(ϕ))>h𝒢,𝒢(𝐱)}\{\phi\geq 0:d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\}, where

h𝒢,𝒢(𝐱)maxl𝒢,𝒢(𝐱)tu𝒢,𝒢(𝐱)d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)\displaystyle h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\equiv\underset{l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t\leq u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})}{\max}d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right) (19)

is the maximum merge height in the dendrogram of 𝐱{\bf x} over the lifetime of {𝒢,𝒢}\{\mathcal{G},\mathcal{G}^{\prime}\}. The next subsection is devoted to understanding when and how these sets can be efficiently computed. In particular, by specializing to squared Euclidean distance and a certain class of linkages, we will show that each of these sets is defined by a single quadratic inequality, and that the coefficients of all of these quadratic inequalities can be efficiently computed.

3.3 Squared Euclidean distance and “linear update” linkages

Consider hierarchical clustering with squared Euclidean distance and a linkage that satisfies a linear Lance-Williams update (Lance & Williams 1967) of the form

d(𝒢1𝒢2,𝒢3;𝐱)\displaystyle d(\mathcal{G}_{1}\cup\mathcal{G}_{2},\mathcal{G}_{3};{\bf x}) =α1d(𝒢1,𝒢3;𝐱)+α2d(𝒢2,𝒢3;𝐱)+βd(𝒢1,𝒢2;𝐱).\displaystyle=\alpha_{1}d(\mathcal{G}_{1},\mathcal{G}_{3};{\bf x})+\alpha_{2}d(\mathcal{G}_{2},\mathcal{G}_{3};{\bf x})+\beta d(\mathcal{G}_{1},\mathcal{G}_{2};{\bf x}). (20)

This includes average, weighted, Ward, centroid, and median linkage (Table 1).

Average Weighted Ward Centroid Median Single Complete
Satisfies (20)
Does not produce
inversions
Table 1: Properties of seven linkages in the case of squared Euclidean distance (Murtagh & Contreras 2012). Table 1 of Murtagh & Contreras (2012) specifies α1,α2\alpha_{1},\alpha_{2}, and β\beta in (20).

We have seen in Section 3.2 that to evaluate (18), we must evaluate 𝒪(n2)\mathcal{O}(n^{2}) sets of the form {ϕ0:d(𝒢,𝒢;𝐱(ϕ))>h𝒢,𝒢(𝐱)}\{\phi\geq 0:d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\} with {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}), where (𝐱)\mathcal{L}({\bf x}) in (17) is the set of losing cluster pairs in 𝐱{\bf x}. We now present results needed to characterize these sets.

Lemma 2.

Suppose that we define d({i},{i};𝐱)=xixi22d(\{i\},\{i^{\prime}\};{\bf x})=\|x_{i}-x_{i^{\prime}}\|_{2}^{2}. Then, for all iii\neq i^{\prime}, d({i},{i};𝐱(ϕ))=aiiϕ2+biiϕ+ciid(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi))=a_{ii^{\prime}}\phi^{2}+b_{ii^{\prime}}\phi+c_{ii^{\prime}}, where for ν^\hat{\nu} defined in (13), aii=(ν^iν^iν^22)2a_{ii^{\prime}}=\left(\frac{\hat{\nu}_{i}-\hat{\nu}_{i^{\prime}}}{\|\hat{\nu}\|_{2}^{2}}\right)^{2},
bii=2((ν^iν^iν^22)dir(𝐱Tν^),xixiaii𝐱Tν^2)b_{ii^{\prime}}=2\left(\left(\frac{\hat{\nu}_{i}-\hat{\nu}_{i^{\prime}}}{\|\hat{\nu}\|_{2}^{2}}\right)\langle\text{dir}({\bf x}^{T}\hat{\nu}),x_{i}-x_{i^{\prime}}\rangle-a_{ii^{\prime}}\|{\bf x}^{T}\hat{\nu}\|_{2}\right), and cii=xixi(ν^iν^iν^22)(𝐱Tν^)22c_{ii^{\prime}}=\left\|x_{i}-x_{i^{\prime}}-\left(\frac{\hat{\nu}_{i}-\hat{\nu}_{i^{\prime}}}{\|\hat{\nu}\|_{2}^{2}}\right)({\bf x}^{T}\hat{\nu})\right\|_{2}^{2}.

Lemma 2 follows directly from the definition of 𝐱(ϕ){\bf x}^{\prime}(\phi) in (13), and does not require (20) to hold. Next, we specialize to squared Euclidean distance and linkages satisfying (20), and characterize d(𝒢,𝒢;𝐱(ϕ))d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi)), the dissimilarity between pairs of clusters in 𝐱(ϕ){\bf x}^{\prime}(\phi). The following result follows immediately from Lemma 2 and the fact that linear combinations of quadratic functions of ϕ\phi are also quadratic functions of ϕ\phi.

Proposition 1.

Suppose we define d({i},{i};𝐱)=xixi22d(\{i\},\{i^{\prime}\};{\bf x})=\|x_{i}-x_{i^{\prime}}\|_{2}^{2}, and we define d(𝒢,𝒢;𝐱)d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}) using a linkage that satisfies (20). Then, d(𝒢,𝒢;𝐱(ϕ))d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi)) is a quadratic function of ϕ\phi for all 𝒢𝒢\mathcal{G}\neq\mathcal{G}^{\prime}. Furthermore, given the coefficients corresponding to the quadratic functions d(𝒢1,𝒢3;𝐱(ϕ))d(\mathcal{G}_{1},\mathcal{G}_{3};{\bf x}^{\prime}(\phi)), d(𝒢2,𝒢3;𝐱(ϕ))d(\mathcal{G}_{2},\mathcal{G}_{3};{\bf x}^{\prime}(\phi)), and d(𝒢1,𝒢2;𝐱(ϕ))d(\mathcal{G}_{1},\mathcal{G}_{2};{\bf x}^{\prime}(\phi)), we can compute the coefficients corresponding to the quadratic function d(𝒢1𝒢2,𝒢3;𝐱(ϕ))d(\mathcal{G}_{1}\cup\mathcal{G}_{2},\mathcal{G}_{3};{\bf x}^{\prime}(\phi)) in 𝒪(1)\mathcal{O}(1) time, using (20).

Lastly, we characterize the cost of computing h𝒢,𝒢(𝐱)h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) in (19). Naively, computing h𝒢,𝒢(𝐱)h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) could require 𝒪(n)\mathcal{O}(n) operations. However, if the dendrogram of 𝐱{\bf x} has no inversions below the (nK)(n-K)th merge, i.e. if d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)<d(𝒲1(t+1)(𝐱),𝒲2(t+1)(𝐱);𝐱)d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)<d\left(\mathcal{W}_{1}^{(t+1)}({\bf x}),\mathcal{W}_{2}^{(t+1)}({\bf x});{\bf x}\right) for all t<nKt<n-K, then h𝒢,𝒢(𝐱)=d(𝒲1(u𝒢,𝒢(𝐱))(𝐱),𝒲2(u𝒢,𝒢(𝐱))(𝐱);𝐱)h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})=d\left(\mathcal{W}_{1}^{\left(u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\right)}({\bf x}),\mathcal{W}_{2}^{\left(u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\right)}({\bf x});{\bf x}\right). More generally, h𝒢,𝒢(𝐱)=maxt𝒢,𝒢(𝐱){u𝒢,𝒢(𝐱)}d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱),h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})=\underset{t\in\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\cup\{u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\}}{\max}d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right), where 𝒢,𝒢(𝐱)={t:l𝒢,𝒢(𝐱)t<u𝒢,𝒢(𝐱),d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)>d(𝒲1(t+1)(𝐱),𝒲2(t+1)(𝐱);𝐱)}\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})=\Big{\{}t:l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t<u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}),d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)>d\left(\mathcal{W}_{1}^{(t+1)}({\bf x}),\mathcal{W}_{2}^{(t+1)}({\bf x});{\bf x}\right)\Big{\}} is the set of steps where inversions occur in the dendrogram of 𝐱{\bf x} during the lifetime of the cluster pair {𝒢,𝒢}\{\mathcal{G},\mathcal{G}^{\prime}\}. This leads to the following result.

Proposition 2.

For any {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}), given its lifetime l𝒢,𝒢(𝐱)l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) and u𝒢,𝒢(𝐱)u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}), and given (𝐱)={1tnK:d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)<d(𝒲1(t+1)(𝐱),𝒲2(t+1)(𝐱);𝐱)}\mathcal{M}({\bf x})=\left\{1\leq t\leq n-K:d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)<d\left(\mathcal{W}_{1}^{(t+1)}({\bf x}),\mathcal{W}_{2}^{(t+1)}({\bf x});{\bf x}\right)\right\}, i.e. the set of steps where inversions occur in the dendrogram of 𝐱{\bf x} below the (nK)(n-K)th merge, we can compute h𝒢,𝒢(𝐱)h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) in 𝒪(|(𝐱)|+1)\mathcal{O}(|\mathcal{M}({\bf x})|+1) time.

We prove Proposition 2 in Appendix A.4. Proposition 2 does not require defining d({i},{i};𝐱)=xixi22d(\{i\},\{i^{\prime}\};{\bf x})=\|x_{i}-x_{i^{\prime}}\|_{2}^{2} and does not require (20) to hold. We now characterize the cost of computing 𝒮^\mathcal{\hat{S}} defined in (12), in the case of squared Euclidean distance and linkages that satisfy (20).

Proposition 3.

Suppose we define d({i},{i};𝐱)=xixi22d(\{i\},\{i^{\prime}\};{\bf x})=\|x_{i}-x_{i^{\prime}}\|_{2}^{2}, and we define d(𝒢,𝒢;𝐱)d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}) using a linkage that satisfies (20). Then, we can compute 𝒮^\mathcal{\hat{S}} defined in (12) in 𝒪((|(𝐱)|+log(n))n2)\mathcal{O}\Big{(}(|\mathcal{M}({\bf x})|+\log(n))n^{2}\Big{)} time.

A detailed algorithm for computing 𝒮^\mathcal{\hat{S}} is provided in Appendix B. If the linkage we use does not produce inversions, then |(𝐱)|=0|\mathcal{M}({\bf x})|=0 for all 𝐱{\bf x}. Average, weighted, and Ward linkage satisfy (20) and are guaranteed not to produce inversions (Table 1), thus 𝒮^{\mathcal{\hat{S}}} can be computed in 𝒪(n2log(n))\mathcal{O}(n^{2}\log(n)) time.

3.4 Squared Euclidean distance and single linkage

Single linkage does not satisfy (20) (Table 1), and the inequality that defines the set {ϕ0:d(𝒢,𝒢;𝐱(ϕ))>h𝒢,𝒢(𝐱)}\{\phi\geq 0:d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\} is not quadratic in ϕ\phi for |𝒢|>1|\mathcal{G}|>1 or |𝒢|>1|\mathcal{G}^{\prime}|>1. Consequently, in the case of single linkage with squared Euclidean distance, we cannot efficiently evaluate the expression of 𝒮^\mathcal{\hat{S}} in (18) using the approach outlined in Section 3.3.

Fortunately, the definition of single linkage leads to an even simpler expression of 𝒮^\mathcal{\hat{S}} than (18). Single linkage defines d(𝒢,𝒢;𝐱)=mini𝒢,i𝒢d({i},{i};𝐱)d(\mathcal{G},\mathcal{G}^{\prime};{\bf x})=\underset{i\in\mathcal{G},i^{\prime}\in\mathcal{G}^{\prime}}{\min}~{}d(\{i\},\{i^{\prime}\};{\bf x}). Applying this definition to (18) yields 𝒮^={𝒢,𝒢}(𝐱)i𝒢i𝒢{ϕ0:d({i},{i};𝐱(ϕ))>h𝒢,𝒢(𝐱)},\mathcal{\hat{S}}=\bigcap\limits_{\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x})}\bigcap\limits_{i\in\mathcal{G}}\bigcap\limits_{i^{\prime}\in\mathcal{G}^{\prime}}\left\{\phi\geq 0:d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi))>h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\right\}, where (𝐱)\mathcal{L}({\bf x}) in (17) is the set of losing cluster pairs in 𝐱{\bf x}. Therefore, in the case of single linkage, 𝒮^={i,i}(𝐱)𝒮^i,i,\mathcal{\hat{S}}=\bigcap\limits_{\{i,i^{\prime}\}\in\mathcal{L}^{\prime}({\bf x})}\mathcal{\hat{S}}_{i,i^{\prime}}, where (𝐱)={{i,i}:i𝒢,i𝒢,{𝒢,𝒢}(𝐱)}\mathcal{L}^{\prime}({\bf x})=\{\{i,i^{\prime}\}:i\in\mathcal{G},i^{\prime}\in\mathcal{G}^{\prime},\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x})\} and
𝒮^i,i={ϕ0:d({i},{i};𝐱(ϕ))>max{𝒢,𝒢}(𝐱):i𝒢,i𝒢h𝒢,𝒢(𝐱)}.\mathcal{\hat{S}}_{i,i^{\prime}}=\Big{\{}\phi\geq 0:d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi))>\underset{\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}):i\in\mathcal{G},i^{\prime}\in\mathcal{G}^{\prime}}{\max}~{}h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\Big{\}}. The following result characterizes the sets of the form 𝒮^i,i\mathcal{\hat{S}}_{i,i^{\prime}}.

Proposition 4.

Suppose that 𝒞=𝒞(nK+1)\mathcal{C}=\mathcal{C}^{(n-K+1)} , i.e. we perform hierarchical clustering to obtain KK clusters. Let iii\neq i^{\prime}. If i,i𝒞^1i,i^{\prime}\in\mathcal{\hat{C}}_{1} or i,i𝒞^2i,i^{\prime}\in\mathcal{\hat{C}}_{2} or i,i𝒞^1𝒞^2i,i^{\prime}\not\in\mathcal{\hat{C}}_{1}\cup\mathcal{\hat{C}}_{2}, then 𝒮^i,i=[0,)\mathcal{\hat{S}}_{i,i^{\prime}}=[0,\infty). Otherwise, 𝒮^i,i={ϕ0:d({i},{i};𝐱(ϕ))>d(𝒲1(nK)(𝐱),𝒲2(nK)(𝐱);𝐱)}.\mathcal{\hat{S}}_{i,i^{\prime}}=\left\{\phi\geq 0:d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi))>d\left(\mathcal{W}^{(n-K)}_{1}({\bf x}),\mathcal{W}^{(n-K)}_{2}({\bf x});{\bf x}\right)\right\}.

We prove Proposition 4 in Appendix A.5. Therefore,

𝒮^\displaystyle\mathcal{\hat{S}} ={i,i}(𝐱)𝒮^i,i\displaystyle=\bigcap\limits_{\{i,i^{\prime}\}\in\mathcal{L}^{\prime}({\bf x})}\mathcal{\hat{S}}_{i,i^{\prime}}
={i,i}(𝐱){ϕ0:d({i},{i};𝐱(ϕ))>d(𝒲1(nK)(𝐱),𝒲2(nK)(𝐱);𝐱)},\displaystyle=\bigcap\limits_{\{i,i^{\prime}\}\in\mathcal{I}({\bf x})}\left\{\phi\geq 0:~{}d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi))>d\left(\mathcal{W}^{(n-K)}_{1}({\bf x}),\mathcal{W}^{(n-K)}_{2}({\bf x});{\bf x}\right)\right\}, (21)

where (𝐱)=(𝐱)\[{{i,i}:i,i𝒞^1}{{i,i}:i,i𝒞^2}{{i,i}:i,i𝒞^1𝒞^2}]\mathcal{I}({\bf x})=\mathcal{L}^{\prime}({\bf x})\backslash\left[\left\{\{i,i^{\prime}\}:i,i^{\prime}\in\mathcal{\hat{C}}_{1}\right\}\cup\left\{\{i,i^{\prime}\}:i,i^{\prime}\in\mathcal{\hat{C}}_{2}\right\}\cup\left\{\{i,i^{\prime}\}:i,i^{\prime}\not\in\mathcal{\hat{C}}_{1}\cup\mathcal{\hat{C}}_{2}\right\}\right]. Recall from Lemma 2 that in the case of squared Euclidean distance, d({i},{i};𝐱(ϕ))d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi)) is a quadratic function of ϕ\phi whose coefficients can be computed in 𝒪(1)\mathcal{O}(1) time. Furthermore, 𝒪(n2)\mathcal{O}(n^{2}) sets are intersected in (21). Therefore, we can evaluate (21) in 𝒪(n2log(n))\mathcal{O}(n^{2}\log(n)) time by solving 𝒪(n2)\mathcal{O}(n^{2}) quadratic inequalities and intersecting their solution sets.

4 Extensions

4.1 Monte Carlo approximation to the p-value

We may be interested in computing the p-value p(𝐱;{𝒞^1,𝒞^2})p({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}) defined in (8) for clustering methods where 𝒮^\mathcal{\hat{S}} in (12) cannot be efficiently computed (e.g. complete linkage hierarchical clustering or non-hierarchical clustering methods). Thus, we develop a Monte Carlo approximation to the p-value that does not require us to compute 𝒮^\mathcal{\hat{S}}. Recalling from (12) that 𝒮^=𝒮(𝐱;{𝒞^1,𝒞^2})\mathcal{\hat{S}}=\mathcal{S}({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}), it follows from Theorem 1 that

(22)

for ϕσ(1|𝒞^1|+1|𝒞^2|)χq\phi\sim\sigma\left(\sqrt{\frac{1}{|\mathcal{\hat{C}}_{1}|}+\frac{1}{|\mathcal{\hat{C}}_{2}|}}\right)\cdot\chi_{q}, and for 𝐱(ϕ){\bf x}^{\prime}(\phi) defined in (13). Thus, we could naively sample ϕ1,,ϕNi.i.d.σ(1|𝒞^1|+1|𝒞^2|)χq\phi_{1},\ldots,\phi_{N}\overset{i.i.d.}{\sim}\sigma\left(\sqrt{\frac{1}{|\mathcal{\hat{C}}_{1}|}+\frac{1}{|\mathcal{\hat{C}}_{2}|}}\right)\cdot\chi_{q}, and approximate the expectations in (22) with averages over the samples. However, when x¯𝒞^1x¯𝒞^22\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2} is in the tail of the σ(1|𝒞^1|+1|𝒞^2|)χq\sigma\left(\sqrt{\frac{1}{|\mathcal{\hat{C}}_{1}|}+\frac{1}{|\mathcal{\hat{C}}_{2}|}}\right)\cdot\chi_{q} distribution, the naive approximation of the expectations in (22) is poor for finite values of NN. Instead, we use an importance sampling approach, as in Yang et al. (2016). We sample ω1,,ωNi.i.dN(x¯𝒞^1x¯𝒞^22,σ2(1|𝒞^1|+1|𝒞^2|))\omega_{1},\ldots,\omega_{N}\overset{i.i.d}{\sim}N\left(\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2},\sigma^{2}\left(\frac{1}{|\mathcal{\hat{C}}_{1}|}+\frac{1}{|\mathcal{\hat{C}}_{2}|}\right)\right), and approximate (22) with

p(𝐱;{𝒞^1,𝒞^2})(i=1Nπi𝟙{ωix¯𝒞^1x¯𝒞^22,𝒞^1,𝒞^2𝒞(𝐱(ωi))})/(i=1Nπi𝟙{𝒞^1,𝒞^2𝒞(𝐱(ωi))}),\displaystyle\scalebox{0.95}{$p({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\})\approx\left(\sum\limits_{i=1}^{N}\pi_{i}\mathds{1}\left\{\omega_{i}\geq\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2},\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}^{\prime}(\omega_{i}))\right\}\right)\big{/}\left(\sum\limits_{i=1}^{N}\pi_{i}\mathds{1}\left\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}^{\prime}(\omega_{i}))\right\}\right)$},

for πi=f1(ωi)f2(ωi)\pi_{i}=\frac{f_{1}(\omega_{i})}{f_{2}(\omega_{i})}, where f1f_{1} is the density of a σ(1|𝒞^1|+1|𝒞^2|)χq\sigma\left(\sqrt{\frac{1}{|\mathcal{\hat{C}}_{1}|}+\frac{1}{|\mathcal{\hat{C}}_{2}|}}\right)\cdot\chi_{q} random variable, and f2f_{2} is the density of a N(x¯𝒞^1x¯𝒞^22,σ2(1|𝒞^1|+1|𝒞^2|))N\left(\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2},\sigma^{2}\left(\frac{1}{|\mathcal{\hat{C}}_{1}|}+\frac{1}{|\mathcal{\hat{C}}_{2}|}\right)\right) random variable.

4.2 Non-spherical covariance matrix

The selective inference framework in Section 2 assumes that 𝐱{\bf x} is a realization from (1), so that Cov(Xi)=σ2𝐈q\text{Cov}(X_{i})=\sigma^{2}{\bf I}_{q}. In this subsection, we instead assume that 𝐱{\bf x} is a realization from

𝐗𝒩n×q(𝝁,𝐈n,𝚺),\displaystyle{\bf X}\sim\mathcal{MN}_{n\times q}(\bm{\mu},{\bf I}_{n},\bm{\Sigma}), (23)

where 𝚺\bm{\Sigma} is a known q×qq\times q positive definite matrix. We define the p-value of interest as

Theorem 3.

For any realization 𝐱{\bf x} from (23), and for any non-overlapping groups of observations 𝒢1,𝒢2{1,2,,n}\mathcal{G}_{1},\mathcal{G}_{2}\subseteq\{1,2,\ldots,n\},

pΣ(𝐱;{𝒢1,𝒢2})=1𝔽(𝚺12𝐱Tν(𝒢1,𝒢2)2;ν(𝒢1,𝒢2)2,𝒮Σ(𝐱;{𝒢1,𝒢2})),\displaystyle p_{\Sigma}({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\})=1-\mathbb{F}\left(\|\bm{\Sigma}^{\scriptscriptstyle-\frac{1}{2}}{\bf x}^{T}\nu(\mathcal{G}_{1},\mathcal{G}_{2})\|_{2};\|\nu(\mathcal{G}_{1},\mathcal{G}_{2})\|_{2},\mathcal{S}_{\Sigma}({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\right), (24)

where 𝒮Σ(𝐱;{𝒢1,𝒢2})={ϕ0:𝒢1,𝒢2𝒞(𝛑ν(𝒢1,𝒢2)𝐱+ϕ(ν(𝒢1,𝒢2)ν(𝒢1,𝒢2)22)dir(𝚺12𝐱Tν(𝒢1,𝒢2))T𝚺12)}\mathcal{S}_{\Sigma}({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\})=\left\{\phi\geq 0:\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}\left(\bm{\pi}^{\perp}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}{\bf x}+\phi\left(\frac{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}{\|\nu(\mathcal{G}_{1},\mathcal{G}_{2})\|_{2}^{2}}\right)\text{dir}(\bm{\Sigma}^{\scriptscriptstyle-\frac{1}{2}}{\bf x}^{T}\nu(\mathcal{G}_{1},\mathcal{G}_{2}))^{T}\bm{\Sigma}^{\scriptscriptstyle\frac{1}{2}}\right)\right\}. Furthermore, if H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} is true, then

H0{𝒢1,𝒢2}(pΣ(𝐗;{𝒢1,𝒢2})α𝒢1,𝒢2𝒞(𝐗))=α,for all 0α1.\mathbb{P}_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}}(p_{\Sigma}({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha\mid\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X}))=\alpha,\quad\text{for all }0\leq\alpha\leq 1.

That is, rejecting H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} whenever pΣ(𝐗;{𝒢1,𝒢2})p_{\Sigma}({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\}) is below α\alpha controls the selective type I error rate (Definition 1).

We omit the proof of Theorem 3, since it closely follows that of Theorem 1. In the case of hierachical clustering with squared Euclidean distance, we can adapt Sections 3.33.4 to compute 𝒮Σ(𝐱;{𝒞^1,𝒞^2})\mathcal{S}_{\Sigma}({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}) by replacing aiia_{ii^{\prime}}, biib_{ii^{\prime}}, and ciic_{ii^{\prime}} in Lemma 2 with a~ii=(ν^iν^iν^22)2(𝐱Tν^2𝚺1/2𝐱Tν^2)2,\tilde{a}_{ii^{\prime}}=\left(\frac{\hat{\nu}_{i}-\hat{\nu}_{i^{\prime}}}{\|\hat{\nu}\|_{2}^{2}}\right)^{2}\left(\frac{\|{\bf x}^{T}\hat{\nu}\|_{2}}{\|\bm{\Sigma}^{-1/2}{\bf x}^{T}\hat{\nu}\|_{2}}\right)^{2}, b~ii=2((ν^iν^iν^22)(𝐱Tν^2𝚺1/2𝐱Tν^2)dir(𝐱Tν^),xixia~ii𝚺1/2𝐱Tν^2)\tilde{b}_{ii^{\prime}}=2\Big{(}\Big{(}\frac{\hat{\nu}_{i}-\hat{\nu}_{i^{\prime}}}{\|\hat{\nu}\|_{2}^{2}}\Big{)}\Big{(}\frac{\|{\bf x}^{T}\hat{\nu}\|_{2}}{\|\bm{\Sigma}^{-1/2}{\bf x}^{T}\hat{\nu}\|_{2}}\Big{)}\langle\text{dir}({\bf x}^{T}\hat{\nu}),x_{i}-x_{i^{\prime}}\rangle-\tilde{a}_{ii^{\prime}}\|\bm{\Sigma}^{-1/2}{\bf x}^{T}\hat{\nu}\|_{2}\Big{)}, and c~ii=xixi(ν^iν^iν^22)(𝐱Tν^)22\tilde{c}_{ii^{\prime}}=\left\|x_{i}-x_{i^{\prime}}-\left(\frac{\hat{\nu}_{i}-\hat{\nu}_{i^{\prime}}}{\|\hat{\nu}\|_{2}^{2}}\right)({\bf x}^{T}\hat{\nu})\right\|_{2}^{2}.

4.3 Unknown variance

The selective inference framework in Section 2 assumes that σ\sigma in model (1) is known. If σ\sigma is unknown, then we can plug an estimate of σ\sigma into (9), as follows:

p^(𝐱;{𝒞^1,𝒞^2})=1𝔽(x¯𝒞^1x¯𝒞^22;σ^(𝐱)1|𝒞^1|+1|𝒞^2|,𝒮^).\displaystyle\hat{p}\left({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}\right)=1-\mathbb{F}\left(\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2};\hat{\sigma}({\bf x})\sqrt{\frac{1}{|\mathcal{\hat{C}}_{1}|}+\frac{1}{|\mathcal{\hat{C}}_{2}|}},~{}\mathcal{\hat{S}}\right). (25)

Intuitively, if we plug in a consistent estimate of σ\sigma, we might expect to asymptotically control the selective type I error rate. For example, this is true in the context of selective inference for low-dimensional linear regression (Tian & Taylor 2017), where the error variance can be consistently estimated under mild assumptions.

Unfortunately, consistently estimating σ\sigma in (1) presents a major challenge. Similar issues arise when estimating the error variance in high-dimensional linear regression; see e.g. Reid et al. (2016). Thus, we adopt a similar approach to Tibshirani et al. (2018), which studied the theoretical implications of plugging in a simple over-estimate of the error variance in the context of selective inference for high-dimensional linear regression. In the following result, we show that plugging in an asymptotic over-estimate of σ\sigma in (25) leads to asymptotic control of the selective type I error rate (Definition 1).

Theorem 4.

For n=1,2,n=1,2,\ldots, suppose that 𝐗(n)𝒩n×q(𝛍(n),𝐈n,σ2𝐈q){\bf X}^{(n)}\sim\mathcal{MN}_{n\times q}(\bm{\mu}^{(n)},{\bf I}_{n},\sigma^{2}{\bf I}_{q}). Let 𝐱(n){\bf x}^{(n)} be a realization of 𝐗(n){\bf X}^{(n)}, and 𝒞^1(n)\mathcal{\hat{C}}_{1}^{(n)} and 𝒞^2(n)\mathcal{\hat{C}}_{2}^{(n)} be a pair of clusters estimated from 𝐱(n){\bf x}^{(n)}. Suppose that limnH0{𝒞^1(n),𝒞^2(n)}(σ^(𝐗(n))σ|𝒞^1(n),𝒞^2(n)𝒞(𝐗(n)))=1\underset{n\rightarrow\infty}{\lim}~{}\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{\sigma}({\bf X}^{(n)})\geq\sigma~{}\Big{|}~{}\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\in\mathcal{C}({\bf X}^{(n)})\right)=1. Then, for any α[0,1]\alpha\in[0,1], we have that limnH0{𝒞^1(n),𝒞^2(n)}(p^(𝐗(n);{𝒞^1(n),𝒞^2(n)})α|𝒞^1(n),𝒞^2(n)𝒞(𝐗(n)))α\underset{n\rightarrow\infty}{\lim}~{}\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{p}({\bf X}^{(n)};\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\})\leq\alpha~{}\Big{|}~{}\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\in\mathcal{C}({\bf X}^{(n)})\right)\leq\alpha.

We prove Theorem 4 in Appendix A.6. In Appendix C, we provide an estimator of σ\sigma that satisfies the conditions in Theorem 4.

4.4 Consequences of selective type I error rate control

This paper focuses on developing tests of H0{𝒢1,𝒢2}:μ¯𝒢1=μ¯𝒢2H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}:\bar{\mu}_{\mathcal{G}_{1}}=\bar{\mu}_{\mathcal{G}_{2}} that control the selective type I error rate (Definition 1). However, it is cumbersome to demonstrate selective type I error rate control via simulation, as (𝒢1,𝒢2𝒞(𝐗))\mathbb{P}(\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})) can be small when H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} holds.

Nevertheless, two related properties can be demonstrated via simulation. Let 0\mathcal{H}_{0} denote the set of null hypotheses of the form H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} that are true. The following property holds.

Proposition 5.

When K=2K=2, i.e. the clustering algorithm 𝒞()\mathcal{C}(\cdot) estimates two clusters,

(p(𝐗;𝒞(𝐗))α|H0𝒞(𝐗)0)=α, for all 0α1,\displaystyle\mathbb{P}\left(p({\bf X};\mathcal{C}({\bf X}))\leq\alpha~{}\Big{|}~{}H_{0}^{\mathcal{C}({\bf X})}\in\mathcal{H}_{0}\right)=\alpha,~{}\text{ for all }0\leq\alpha\leq 1, (26)

where p(;)p(\cdot;\cdot) is defined in (8). That is, if the two estimated clusters have the same mean, then the probability of falsely declaring otherwise is equal to α\alpha.

We prove Proposition 5 in Appendix A.7. What if K>2K>2? Then, given a data set 𝐱{\bf x}, we can randomly select (independently of 𝐱{\bf x}) a single pair of estimated clusters 𝒢1(𝐱),𝒢2(𝐱)𝒞(𝐱)\mathcal{G}_{1}({\bf x}),\mathcal{G}_{2}({\bf x})\in\mathcal{C}({\bf x}). This leads to the following property.

Proposition 6.

Suppose that K>2K>2, i.e. the clustering algorithm 𝒞()\mathcal{C}(\cdot) estimates three or more clusters, and let 𝒢1(𝐱),𝒢2(𝐱)𝒞(𝐱)\mathcal{G}_{1}({\bf x}),\mathcal{G}_{2}({\bf x})\in\mathcal{C}({\bf x}) denote a randomly selected pair. If {𝒢1(𝐗),𝒢2(𝐗)}\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\} and 𝐗{\bf X} are conditionally independent given 𝒞(𝐗)\mathcal{C}({\bf X}), then for p(;)p(\cdot;\cdot) defined in (8),

(p(𝐗;{𝒢1(𝐗),𝒢2(𝐗)})α|H0{𝒢1(𝐗),𝒢2(𝐗)}0)=α, for all 0α1.\displaystyle\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\})\leq\alpha~{}\Big{|}~{}H_{0}^{\left\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\right\}}\in\mathcal{H}_{0}\right)=\alpha,~{}\text{ for all }0\leq\alpha\leq 1. (27)

We prove Proposition 6 in Appendix A.7. Recall that in Figure 1(c), we simulated data with 𝝁=𝟎n×q\bm{\mu}=\bm{0}_{n\times q}, so the conditioning event in (27) holds with probability 1. Thus, (27) specializes to (p(𝐗;{𝒢1(𝐗),𝒢2(𝐗)})α)=α\mathbb{P}(p({\bf X};\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\})\leq\alpha)=\alpha, i.e. p(𝐗;{𝒢1(𝐗),𝒢2(𝐗)})Uniform(0,1)p({\bf X};\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\})\sim\text{Uniform}(0,1). This property is illustrated in Figure 1(c).

5 Simulation results

Throughout this section, we use the efficient implementation of hierarchical clustering in the fastcluster package (Müllner et al. 2013) in R.

5.1 Uniform p-values under a global null

We generate data from (1) with 𝝁=𝟎n×q{\bm{\mu}}=\bm{0}_{n\times q}, so that H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} holds for all pairs of estimated clusters. We simulate 2000 data sets for n=150n=150, σ{1,2,10}\sigma\in\{1,2,10\}, and q{2,10,100}q\in\{2,10,100\}. For each data set, we use average, centroid, single, and complete linkage hierarchical clustering to estimate three clusters, and then test H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} for a randomly-chosen pair of clusters. We compute the p-value defined in (8) as described in Section 3 for average, centroid, and single linkage. For complete linkage, we approximate the p-value as described in Section 4.1 with N=2000N=2000. Figure 4 displays QQ plots of the empirical distribution of the p-values against the Uniform(0,1)(0,1) distribution. The p-values have a Uniform(0,1)(0,1) distribution because our proposed test satisfies (27) and because 𝝁=𝟎n×q\bm{\mu}=\bm{0}_{n\times q}; see the end of Section 4.4 for a detailed discussion. In Appendix D.1, we show that plugging in an over-estimate σ\sigma as in (25) yields p-values that are stochastically larger than the Uniform(0,1)(0,1) distribution.

Refer to caption
Figure 4: For 2000 draws from (1) with 𝝁=𝟎n×q\bm{\mu}=\bm{0}_{n\times q}, n=150n=150, q{2,10,100}q\in\{2,10,100\}, and σ{1,2,10}\sigma\in\{1,2,10\}, QQ-plots of the p-values obtained from the test proposed in Section 2.1, using (a) average linkage, (b) centroid linkage, (c) single linkage, and (d) complete linkage.

5.2 Conditional power and recovery probability

We generate data from (1) with n=30n=30, and three equidistant clusters,

μ1==μn3=[δ/20q1],μn3+1==μ2n3=[0q13δ/2],μ2n3+1==μn=[δ/20q1],\displaystyle\mu_{1}=\cdots=\mu_{\frac{n}{3}}=\left[\begin{smallmatrix}-\delta/2\\ 0_{q-1}\end{smallmatrix}\right],\mu_{\frac{n}{3}+1}=\cdots=\mu_{\frac{2n}{3}}=\left[\begin{smallmatrix}0_{q-1}\\ \sqrt{3}\delta/2\end{smallmatrix}\right],\mu_{\frac{2n}{3}+1}=\cdots=\mu_{n}=\left[\begin{smallmatrix}\delta/2\\ 0_{q-1}\end{smallmatrix}\right], (28)

for δ>0\delta>0. For each simulated data set, we use average, centroid, single, and complete linkage hierarchical clustering to estimate three clusters, and then test H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} for a randomly-chosen pair of clusters, with significance level α=0.05\alpha=0.05. We simulate 300,000300,000 data sets for σ=1\sigma=1, q=10q=10, and seven evenly-spaced values of δ[4,7]\delta\in[4,7]. We define the conditional power to be the conditional probability of rejecting H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} for a randomly-chosen pair of estimated clusters, given that the randomly-chosen pair of estimated clusters correspond to two true clusters. Since this conditions on the event that the randomly-chosen pair of estimated clusters correspond to two true clusters, we are also interested in how often this event occurs. We therefore consider the recovery probability, the probability that the randomly-chosen pair of estimated clusters correspond to two true clusters. Figure 5 displays the conditional power and recovery probability as a function of the distance between the true clusters (δ(\delta).

Figure 5 displays conditional power and recovery probability as a function of the distance between the true clusters (δ(\delta).

Refer to caption
Figure 5: For the simulation study described in Section 5.2, (a) conditional power of the test proposed in Section 2 versus the difference in means between the true clusters (δ(\delta), and (b) recovery probability versus δ\delta. Conditional power and recovery probability are defined in Section 5.2.

For all four linkages, the conditional power and recovery probability increase as the distance between the true clusters (δ)(\delta) increases. Average and complete linkage have the highest conditional power, and single linkage has the lowest conditional power. Average, centroid, and complete linkage have substantially higher recovery probabilities than single linkage.

We consider an alternative definition of power that does not condition on having correctly estimated the true clusters in Appendix D.2.

6 Data applications

6.1 Palmer penguins (Horst et al. 2020)

In this section, we analyze the penguins data set from the palmerpenguins package in R (Horst et al. 2020). We estimate σ\sigma with σ^(𝐱)=i=1nj=1q(xix¯j)2/(nqq)\hat{\sigma}({\bf x})=\sqrt{\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}(x_{i}-\bar{x}_{j})^{2}/(nq-q)} for x¯j=i=1nxij/n\bar{x}_{j}=\sum\limits_{i=1}^{n}x_{ij}/n, calculated on the bill length and flipper length of 58 female penguins observed in the year 2009. We then consider the 107 female penguins observed in the years 2007–2008 with complete data on species, bill length, and flipper length. Figure 6(a) displays the dendrogram obtained from applying average linkage hierarchical clustering with squared Euclidean distance to the penguins’ bill length and flipper length, cut to yield five clusters, and Figure 6(b) displays the data.

Refer to caption
Figure 6: (a) The average-linkage hierarchical clustering dendrogram and (b) the bill lengths and flipper lengths, as well as the true species labels and estimated clusters, for the Palmer penguins data described in Section 6.1.

We test H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} for all pairs of clusters that contain more than one observation, using the test proposed in Section 2.1, and using the Wald test described in (4). (The latter does not account for the fact that the clusters were estimated from the data, and does not control the selective type I error rate.) Results are in Table 2. The Wald p-values are small, even when testing for a difference in means between a single species (Clusters 1 and 2). Our proposed test yields a large p-value when testing for a difference in means between a single species (Clusters 1 and 2), and small p-values when the clusters correspond to different species (Clusters 1 and 3, and Clusters 3 and 4). The p-values from our proposed test are large for the remaining three pairs of clusters containing different species, even though visual inspection suggests a large difference between these two clusters. This is because the test statistic is close to the left boundary of 𝒮^\mathcal{\hat{S}} defined in (12), which leads to low power: see the discussion of Figure D.3 in Appendix D.2.

Cluster pairs (1,2)(1,2) (1,3)(1,3) (1,4)1,4) (2,3(2,3) (2,4)(2,4) (3,4)(3,4)
Test statistic 10.110.1 25.025.0 10.110.1 33.833.8 17.117.1 18.918.9
Our p-value 0.5910.591 1.70×10141.70\times 10^{-14} 0.7140.714 0.0700.070 0.2910.291 2.10×1062.10\times 10^{-6}
Wald p-value 0.003830.00383 <10307<10^{-307} 0.001010.00101 <10307<10^{-307} 4.29×1054.29\times 10^{-5} 1.58×10111.58\times 10^{-11}
Table 2: Results from applying the test of H0{𝒞^1,𝒞^2}:μ¯𝒞^k=μ¯𝒞^kH_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{k}}=\bar{\mu}_{\mathcal{\hat{C}}_{k^{\prime}}} proposed in Section 2 and the Wald test defined in (4) to the Palmer penguins data set, displayed in Figure 6(b).

6.2 Single-cell RNA sequencing data (Zheng et al. 2017)

Single-cell RNA sequencing data quantifies the gene expression levels of individual cells. Biologists often cluster the cells to identify putative cell types, and then test for differential gene expression between the clusters, without properly accounting for the fact that the clusters were estimated from the data (Luecken & Theis 2019, Lähnemann et al. 2020, Deconinck et al. 2021). Zheng et al. (2017) classified peripheral blood mononuclear cells prior to sequencing. We will use this data set to demonstrate that testing for differential gene expression after clustering with our proposed selective inference framework yields reasonable results.

6.2.1 Data and pre-processing

We subset the data to the memory T cells, B cells, and monocytes. In line with standard pre-processing protocols (Duò et al. 2018), we remove cells with a high mitochondrial gene percentage, cells with a low or high number of expressed genes, and cells with a low number of total counts. Then, we scale the data so that the total number of counts for each cell equals the average count across all cells. Finally, we apply a log2\log_{2} transformation with a pseudo-count of 1, and subset to the 500 genes with the largest pre-normalization average expression levels. We separately apply this pre-processing routine to the memory T cells only, and to all of the cells. After pre-processing, we construct a “no clusters” data set by randomly sampling 600 memory T cells, and a “clusters” data set by randomly sampling 200 each of memory T cells, B cells, and monocytes.

6.2.2 Data analysis

We apply Ward-linkage hierarchical clustering with squared Euclidean distance to the “no clusters” data to get three clusters, containing 6464, 428428, and 108108 cells, respectively. For each pair of clusters, we test H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}} using (i) the test proposed in Section 4.2 under model (23) and (ii) using the Wald test under model (23), which has p-value

H0{𝒞^1,𝒞^2}((X¯𝒞^1X¯𝒞^2)T𝚺1(X¯𝒞^1X¯𝒞^2)(x¯𝒞^1x¯𝒞^2)T𝚺1(x¯𝒞^1x¯𝒞^2)).\displaystyle\mathbb{P}_{H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}}\left((\bar{X}_{\mathcal{\hat{C}}_{1}}-\bar{X}_{\mathcal{\hat{C}}_{2}})^{T}\bm{\Sigma}^{-1}(\bar{X}_{\mathcal{\hat{C}}_{1}}-\bar{X}_{\mathcal{\hat{C}}_{2}})\geq(\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}})^{T}\bm{\Sigma}^{-1}(\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}})\right). (29)

For both tests, we estimate 𝚺\bm{\Sigma} by applying the principal complement thresholding (“POET”, Fan et al. 2013) method to the 9,303 memory T cells left out of the “no clusters” data set. Results are in Table 3. The p-values from our test are large, and the Wald p-values are small. Recall that all of the cells are memory T cells, and so (as far as we know) there are no true clusters in the data.

“No clusters” “Clusters”l
Cluster pairs (1, 2) (1, 3) (2, 3) (1, 2) (1, 3) (2, 3)
Test statistic 4.05 4.76 2.96 3.04 4.27 4.38
Our p-value 0.20 0.27 0.70 4.60×10284.60\times 10^{-28} 3.20×10823.20\times 10^{-82} 1.13×10731.13\times 10^{-73}
Wald p-value <10307<10^{-307} <10307<10^{-307} <10307<10^{-307} <10307<10^{-307} <10307<10^{-307} <10307<10^{-307}
Table 3: Results from applying the test of H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}} proposed in Section 4.2 and the Wald test in (29) to the “no clusters” and “clusters” data described in Section 6.2.1.

We now apply the same analysis to the “clusters” data. Ward-linkage hierarchical clustering with squared Euclidean distance results in three clusters that almost exactly correspond to memory T cells, B cells, and monocytes. For both tests, we estimate 𝚺\bm{\Sigma} by applying the POET method to the 21,757 memory T cells, B cells, and monocytes left out of the “clusters” data set. Results are in Table 3. The p-values from both tests are extremely small. This suggests that our proposed approach is able to correctly reject the null hypothesis when it does not hold.

7 Discussion

In this paper, we proposed a selective inference framework for testing the null hypothesis that there is no difference in means between two estimated clusters, under (1). This directly solves a problem that routinely occurs when biologists analyze data, e.g. when testing for differential expression between clusters estimated from single-cell RNA-sequencing data (Luecken & Theis 2019, Lähnemann et al. 2020, Deconinck et al. 2021).

The framework proposed in Section 2 assumed that σ\sigma in (1) was known. Similarly, the extended framework in Section 4.2 assumed that 𝚺\bm{\Sigma} in (23) was known. These assumptions are unlikely to hold in practice. Thus, in the data applications of Sections 6.16.2, we replaced σ\sigma in (1) and 𝚺\bm{\Sigma} in (23) with estimates. In Section 4.3, we explored the theoretical implications of this replacement, under model (1). Future work could include investigating how best to estimate σ\sigma in (1) and 𝚺\bm{\Sigma} in (23). Another potential avenue for future work would be to develop an analogue of Theorem 4 in Section 4.3 under model (23).

The tests developed in this paper are implemented in the R package clusterpval. Instructions on how to download and use this package can be found at
http://lucylgao.com/clusterpval. Links to download the data sets in Section 6 can be found at https://github.com/lucylgao/clusterpval-experiments, along with code to reproduce the simulation and real data analysis results from this paper.

Acknowledgments

Lucy L. Gao was supported by the NSERC Discovery Grants program. Daniela Witten and Jacob Bien were supported by NIH Grant R01-GM123993. Jacob Bien was supported by NSF CAREER Award DMS-1653017. Daniela Witten was supported by NSF CAREER Award DMS-1252624, NIH Grants R01-EB026908 and R01-DA047869, and an Simons Investigator Award in Mathematical Modeling of Living Systems (No. 560585).

Conflict of interest statement

The authors have no relevant financial or non-financial competing interests to report.

References

  • (1)
  • Bourgon (2009) Bourgon, R. (2009), Overview of the intervals package. R Vignette, URL https://cran.r-project.org/web/packages/intervals/vignettes/intervals_overview.pdf.
  • Butler et al. (2018) Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. (2018), ‘Integrating single-cell transcriptomic data across different conditions, technologies, and species’, Nature Biotechnology 36(5), 411–420.
  • Campbell (2018) Campbell, F. (2018), Statistical Machine Learning Methodology and Inference for Structured Variable Selection, PhD thesis, Rice University.
  • Chen et al. (2001) Chen, H., Chen, J. & Kalbfleisch, J. D. (2001), ‘A modified likelihood ratio test for homogeneity in finite mixture models’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63(1), 19–29.
  • Chen et al. (2012) Chen, J., Li, P. & Fu, Y. (2012), ‘Inference on the order of a normal mixture’, Journal of the American Statistical Association 107(499), 1096–1105.
  • Chen & Bien (2020) Chen, S. & Bien, J. (2020), ‘Valid inference corrected for outlier removal’, Journal of Computational and Graphical Statistics 29(2), 323–334.
  • Deconinck et al. (2021) Deconinck, L., Cannoodt, R., Saelens, W., Deplancke, B. & Saeys, Y. (2021), ‘Recent advances in trajectory inference from single-cell ’omics data’, Current Opinion in Systems Biology 27, 100344.
  • Duò et al. (2018) Duò, A., Robinson, M. D. & Soneson, C. (2018), ‘A systematic performance evaluation of clustering methods for single-cell RNA-seq data’, F1000Research 7.
  • Fan et al. (2013) Fan, J., Liao, Y. & Mincheva, M. (2013), ‘Large covariance estimation by thresholding principal orthogonal complements’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(4), 603–680.
  • Fithian et al. (2014) Fithian, W., Sun, D. & Taylor, J. (2014), ‘Optimal inference after model selection’, arXiv preprint arXiv:1410.2597 .
  • Hao et al. (2021) Hao, Y., Hao, S., Andersen-Nissen, E., Mauck III, W. M., Zheng, S., Butler, A., Lee, M. J., Wilk, A. J., Darby, C., Zager, M. et al. (2021), ‘Integrated analysis of multimodal single-cell data’, Cell 184(13), 3573–3587.
  • Hocking et al. (2011) Hocking, T. D., Joulin, A., Bach, F. & Vert, J.-P. (2011), Clusterpath: An algorithm for clustering using convex fusion penalties, in ‘Proceedings of the 28th International Conference on Machine Learning’, pp. 1–8.
  • Horst et al. (2020) Horst, A. M., Hill, A. P. & Gorman, K. B. (2020), palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0, available on CRAN.
  • Hwang et al. (2018) Hwang, B., Lee, J. H. & Bang, D. (2018), ‘Single-cell RNA sequencing technologies and bioinformatics pipelines’, Experimental & Molecular Medicine 50(8), 1–14.
  • Hyun et al. (2018) Hyun, S., G’Sell, M., Tibshirani, R. J. et al. (2018), ‘Exact post-selection inference for the generalized lasso path’, Electronic Journal of Statistics 12(1), 1053–1097.
  • Jewell et al. (2019) Jewell, S., Fearnhead, P. & Witten, D. (2019), ‘Testing for a change in mean after changepoint detection’, arXiv preprint arXiv:1910.04291 .
  • Kimes et al. (2017) Kimes, P. K., Liu, Y., Neil Hayes, D. & Marron, J. S. (2017), ‘Statistical significance for hierarchical clustering’, Biometrics 73(3), 811–821.
  • Kivaranovic & Leeb (2020) Kivaranovic, D. & Leeb, H. (2020), On the length of post-model-selection confidence intervals conditional on polyhedral constraints. To appear in Journal of the American Statistical Association.
  • Kriegeskorte et al. (2009) Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. & Baker, C. I. (2009), ‘Circular analysis in systems neuroscience: the dangers of double dipping’, Nature Neuroscience 12(5), 535.
  • Lähnemann et al. (2020) Lähnemann, D., Köster, J., Szczurek, E., McCarthy, D. J., Hicks, S. C., Robinson, M. D., Vallejos, C. A., Campbell, K. R., Beerenwinkel, N., Mahfouz, A. et al. (2020), ‘Eleven grand challenges in single-cell data science’, Genome Biology 21(1), 1–35.
  • Lance & Williams (1967) Lance, G. N. & Williams, W. T. (1967), ‘A general theory of classificatory sorting strategies: 1. hierarchical systems’, The Computer Journal 9(4), 373–380.
  • Lee et al. (2016) Lee, J. D., Sun, D. L., Sun, Y., Taylor, J. E. et al. (2016), ‘Exact post-selection inference, with application to the lasso’, The Annals of Statistics 44(3), 907–927.
  • Liu et al. (2008) Liu, Y., Hayes, D. N., Nobel, A. & Marron, J. S. (2008), ‘Statistical significance of clustering for high-dimension, low–sample size data’, Journal of the American Statistical Association 103(483), 1281–1293.
  • Loftus & Taylor (2015) Loftus, J. R. & Taylor, J. E. (2015), ‘Selective inference in regression models with groups of variables’, arXiv preprint arXiv:1511.01478 .
  • Luecken & Theis (2019) Luecken, M. D. & Theis, F. J. (2019), ‘Current best practices in single-cell RNA-seq analysis: a tutorial’, Molecular Systems Biology 15(6), e8746.
  • Maitra et al. (2012) Maitra, R., Melnykov, V. & Lahiri, S. N. (2012), ‘Bootstrapping for significance of compact clusters in multi-dimensional datasets’, Journal of the American Statistical Association 107(497), 378–392.
  • Mehrizi & Chenouri (2021) Mehrizi, R. V. & Chenouri, S. (2021), ‘Valid post-detection inference for change points identified using trend filtering’, arXiv preprint arXiv:2104.12022 .
  • Müllner et al. (2013) Müllner, D. et al. (2013), ‘fastcluster: Fast hierarchical, agglomerative clustering routines for R and Python’, Journal of Statistical Software 53(9), 1–18.
  • Murtagh & Contreras (2012) Murtagh, F. & Contreras, P. (2012), ‘Algorithms for hierarchical clustering: an overview’, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(1), 86–97.
  • Reid et al. (2016) Reid, S., Tibshirani, R. & Friedman, J. (2016), ‘A study of error variance estimation in lasso regression’, Statistica Sinica pp. 35–67.
  • Satija et al. (2015) Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. (2015), ‘Spatial reconstruction of single-cell gene expression data’, Nature Biotechnology 33(5), 495–502.
  • Stuart et al. (2019) Stuart, T., Butler, A., Hoffman, P., Hafemeister, C., Papalexi, E., Mauck III, W. M., Hao, Y., Stoeckius, M., Smibert, P. & Satija, R. (2019), ‘Comprehensive integration of single-cell data’, Cell 177(7), 1888–1902.
  • Suzuki & Shimodaira (2006) Suzuki, R. & Shimodaira, H. (2006), ‘Pvclust: an R package for assessing the uncertainty in hierarchical clustering’, Bioinformatics 22(12), 1540–1542.
  • Tian & Taylor (2017) Tian, X. & Taylor, J. (2017), ‘Asymptotics of selective inference’, Scandinavian Journal of Statistics 44(2), 480–499.
  • Tibshirani et al. (2018) Tibshirani, R. J., Rinaldo, A., Tibshirani, R., Wasserman, L. et al. (2018), ‘Uniform asymptotic inference and the bootstrap after model selection’, Annals of Statistics 46(3), 1255–1287.
  • Wood (2015) Wood, S. (2015), mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. R package version 1.8-31, available on CRAN.
  • Yang et al. (2016) Yang, F., Barber, R. F., Jain, P. & Lafferty, J. (2016), Selective inference for group-sparse linear models, in ‘Advances in Neural Information Processing Systems’, pp. 2469–2477.
  • Zhang et al. (2019) Zhang, J. M., Kamath, G. M. & David, N. T. (2019), ‘Valid post-clustering differential analysis for single-cell RNA-seq’, Cell Systems 9(4), 383–392.
  • Zheng et al. (2017) Zheng, G. X., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., Ziraldo, S. B., Wheeler, T. D., McDermott, G. P., Zhu, J. et al. (2017), ‘Massively parallel digital transcriptional profiling of single cells’, Nature Communications 8(1), 1–12.

Appendix A Proofs

A.1 Proof of Theorem 1

The proof of Theorem 1 is similar to the proof of Theorem 3.1 in Loftus & Taylor (2015), the proof of Lemma 1 in Yang et al. (2016), and the proof of Theorem 3.1 in Chen & Bien (2020).

For any νn\nu\in\mathbb{R}^{n}, we have

𝐗=𝝅ν𝐗+(𝐈n𝝅ν)𝐗=𝝅ν𝐗+(𝐗Tν2ν22)νdir(𝐗Tν)T.\displaystyle{\bf X}=\bm{\pi}_{\nu}^{\perp}{\bf X}+({\bf I}_{n}-\bm{\pi}_{\nu}^{\perp}){\bf X}=\bm{\pi}_{\nu}^{\perp}{\bf X}+\left(\frac{\|{\bf X}^{T}\nu\|_{2}}{\|\nu\|_{2}^{2}}\right)\nu\text{dir}({\bf X}^{T}\nu)^{T}. (A.1)

Therefore,

𝐗\displaystyle{\bf X} =𝝅ν(𝒢1,𝒢2)𝐗+(𝐗Tν(𝒢1,𝒢2)2ν(𝒢1,𝒢2)22)νdir(𝐗Tν(𝒢1,𝒢2))T\displaystyle=\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf X}+\left(\frac{\|{\bf X}^{T}\nu(\mathcal{G}_{1},\mathcal{G}_{2})\|_{2}}{\|\nu(\mathcal{G}_{1},\mathcal{G}_{2})\|_{2}^{2}}\right)\nu~{}\text{dir}({\bf X}^{T}\nu(\mathcal{G}_{1},\mathcal{G}_{2}))^{T}
=𝝅ν(𝒢1,𝒢2)𝐗+(X¯𝒢1X¯𝒢221|𝒢1|+1|𝒢2|)ν(𝒢1,𝒢2)dir(X¯𝒢1X¯𝒢2)T,\displaystyle=\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf X}+\left(\frac{\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}}{\frac{1}{|\mathcal{G}_{1}|}+\frac{1}{|\mathcal{G}_{2}|}}\right)\nu(\mathcal{G}_{1},\mathcal{G}_{2})~{}\text{dir}(\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}})^{T}, (A.2)

where the first equality follows from (A.1), and the second equality follows from the definition of ν(𝒢1,𝒢2)\nu(\mathcal{G}_{1},\mathcal{G}_{2}) in (7). Substituting (A.2) into the definition of p(𝐱;{𝒢1,𝒢2})p({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\}) given by (8) yields

p(𝐱;{𝒢1,𝒢2})\displaystyle p({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\}) (A.3)
=H0{𝒢1,𝒢2}(X¯𝒢1X¯𝒢22x¯𝒢1x¯𝒢22|\displaystyle=\mathbb{P}_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}}\Bigg{(}\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}\geq\|\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}}\|_{2}~{}\bigg{|}~{}
𝒢1,𝒢2𝒞(𝝅ν(𝒢1,𝒢2)𝐱+(X¯𝒢1X¯𝒢221|𝒢1|+1|𝒢2|)ν(𝒢1,𝒢2)dir(x¯𝒢1x¯𝒢2)T),\displaystyle\hskip 71.13188pt\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}\left(\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf x}+\left(\frac{\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}}{\frac{1}{|\mathcal{G}_{1}|}+\frac{1}{|\mathcal{G}_{2}|}}\right)\nu(\mathcal{G}_{1},\mathcal{G}_{2})\text{dir}(\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}})^{T}\right),
𝝅ν(𝒢1,𝒢2)𝐗=𝝅ν(𝒢1,𝒢2)𝐱,dir(X¯𝒢1X¯𝒢2)=dir(x¯𝒢1x¯𝒢2)).\displaystyle\hskip 71.13188pt\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf X}=\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf x},~{}\text{dir}(\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}})=\text{dir}(\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}})\Bigg{)}.

To simplify (A.3), we now show that

X¯𝒢1X¯𝒢22𝝅ν(𝒢1,𝒢2)𝐗,\displaystyle\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}~{}~{}\rotatebox[origin={c}]{90.0}{$\models$}~{}~{}\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf X}, (A.4)

and that under H0{𝒢1,𝒢2}:μ¯𝒢1=μ¯𝒢2H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}:\bar{\mu}_{\mathcal{G}_{1}}=\bar{\mu}_{\mathcal{G}_{2}},

X¯𝒢1X¯𝒢22dir(X¯𝒢1X¯𝒢2).\displaystyle\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}~{}~{}\rotatebox[origin={c}]{90.0}{$\models$}~{}~{}\text{dir}(\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}). (A.5)

First, we will show (A.4). Recall that 𝝅ν\bm{\pi}_{\nu}^{\perp} is the orthogonal projection matrix onto the subspace orthogonal to ν\nu. Thus, 𝝅ν(𝒢1,𝒢2)ν(𝒢1,𝒢2)=0n\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}\nu(\mathcal{G}_{1},\mathcal{G}_{2})=0_{n}. It follows from properties of the matrix normal and multivariate normal distributions that 𝝅ν(𝒢1,𝒢2)𝐗𝐗Tν(𝒢1,𝒢2)\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf X}~{}~{}\rotatebox[origin={c}]{90.0}{$\models$}~{}~{}{\bf X}^{T}\nu(\mathcal{G}_{1},\mathcal{G}_{2}). This implies (A.4), since X¯𝒢1X¯𝒢2=𝐗Tν(𝒢1,𝒢2)\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}={\bf X}^{T}\nu(\mathcal{G}_{1},\mathcal{G}_{2}). Second, we will show (A.5). It follows from (1) that 𝐗iindNq(μi,σ2𝐈q){\bf X}_{i}\overset{ind}{\sim}N_{q}(\mu_{i},\sigma^{2}{\bf I}_{q}). Thus, under H0{𝒢1,𝒢2}:μ¯𝒢1=μ¯𝒢2H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}:\bar{\mu}_{\mathcal{G}_{1}}=\bar{\mu}_{\mathcal{G}_{2}},

X¯𝒢1X¯𝒢2σ1/|𝒢1|+1/|𝒢2|=𝐗Tν(𝒢1,𝒢2)ν(𝒢1,𝒢2)2Nq(0,𝐈q),\displaystyle\frac{\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}}{\sigma\sqrt{1/|\mathcal{G}_{1}|+1/|\mathcal{G}_{2}|}}=\frac{{\bf X}^{T}\nu(\mathcal{G}_{1},\mathcal{G}_{2})}{\|\nu(\mathcal{G}_{1},\mathcal{G}_{2})\|_{2}}\sim N_{q}\left(0,{\bf I}_{q}\right), (A.6)

and (A.5) follows from the independence of the length and direction of a standard multivariate normal random vector.

We now apply (A.4) and (A.5) to (A.3). This yields

=H0{𝒢1,𝒢2}(X¯𝒢1X¯𝒢22x¯𝒢1x¯𝒢22|𝒢1,𝒢2𝒞(𝝅ν(𝒢1,𝒢2)𝐱+(X¯𝒢1X¯𝒢221|𝒢1|+1|𝒢2|)ν(𝒢1,𝒢2)dir(x¯𝒢1x¯𝒢2)T))\displaystyle=\scalebox{0.9}{$\mathbb{P}_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}}\Bigg{(}\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}\geq\|\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}}\|_{2}~{}\bigg{|}~{}\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}\Bigg{(}\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf x}+\Bigg{(}\frac{\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}}{\frac{1}{|\mathcal{G}_{1}|}+\frac{1}{|\mathcal{G}_{2}|}}\Bigg{)}\nu(\mathcal{G}_{1},\mathcal{G}_{2})\text{dir}(\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}})^{T}\Bigg{)}\Bigg{)}$}
=H0{𝒢1,𝒢2}(X¯𝒢1X¯𝒢22x¯𝒢1x¯𝒢22|X¯𝒢1X¯𝒢22𝒮(𝐱;{𝒢1,𝒢2})),\displaystyle=\scalebox{0.9}{$\mathbb{P}_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}}\left(\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}\geq\|\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}}\|_{2}~{}\bigg{|}~{}\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}\in\mathcal{S}({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\right)$}, (A.7)

where (A.7) follows from the definition of 𝒮(𝐱;{𝒢1,𝒢2})\mathcal{S}({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\}) in (10). Under H0{𝒢1,𝒢2}:μ¯𝒢1=μ¯𝒢2H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}:\bar{\mu}_{\mathcal{G}_{1}}=\bar{\mu}_{\mathcal{G}_{2}},

X¯𝒢1X¯𝒢222σ2(1/|𝒢1|+1/|𝒢2|)χq2,\displaystyle\frac{\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}^{2}}{\sigma^{2}\left(1/|\mathcal{G}_{1}|+1/|\mathcal{G}_{2}|\right)}\sim\chi^{2}_{q},

by (A.6). That is, under H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}, we have X¯𝒢1X¯𝒢22(σ1/|𝒢1|+1/|𝒢2|)χq\|\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}}\|_{2}\sim\left(\sigma\sqrt{1/|\mathcal{G}_{1}|+1/|\mathcal{G}_{2}|}\right)\cdot\chi_{q}. Therefore, for ϕ(σ1/|𝒢1|+1/|𝒢2|)χq\phi\sim\left(\sigma\sqrt{1/|\mathcal{G}_{1}|+1/|\mathcal{G}_{2}|}\right)\cdot\chi_{q}, it follows from (A.7) that

p(𝐱;{𝒢1,𝒢2})\displaystyle p({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\}) =(ϕx¯𝒢1x¯𝒢22ϕ𝒮(𝐱;{𝒢1,𝒢2}))\displaystyle=\mathbb{P}\left(\phi\geq\|\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}}\|_{2}\mid\phi\in\mathcal{S}({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\right)
=1𝔽(x¯𝒢1x¯𝒢22;σ1/|𝒢1|+1/|𝒢2|,𝒮(𝐱;{𝒢1,𝒢2})),\displaystyle=1-\mathbb{F}\left(\|\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}}\|_{2};\sigma\sqrt{1/|\mathcal{G}_{1}|+1/|\mathcal{G}_{2}|},\mathcal{S}({\bf x};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\right), (A.8)

since 𝔽(t;c,𝒮)\mathbb{F}(t;c,\mathcal{S}) denotes the cumulative distribution function of a cχqc\cdot\chi_{q} random variable truncated to the set 𝒮\mathcal{S}. This is (9).

Finally, we will show (11). It follows from the definition of p(;{𝒢1,𝒢2})p(\cdot;\{\mathcal{G}_{1},\mathcal{G}_{2}\}) in (8) that for all 𝐱n×q{\bf x}\in\mathbb{R}^{n\times q}, we have

H0{𝒢1,𝒢2}(p(𝐗;{𝒢1,𝒢2})α𝒢1,𝒢2𝒞(𝐗),𝝅ν(𝒢1,𝒢2)𝐗=𝝅ν(𝒢1,𝒢2)𝐱,\displaystyle\mathbb{P}_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}}\Big{(}p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha\mid\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X}),\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf X}=\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf x},
dir(X¯𝒢1X¯𝒢2)=dir(x¯𝒢1x¯𝒢2))=α.\displaystyle\hskip 159.33542pt\text{dir}(\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}})=\text{dir}(\bar{x}_{\mathcal{G}_{1}}-\bar{x}_{\mathcal{G}_{2}})\Big{)}=\alpha. (A.9)

Therefore, letting 𝔼H0\mathbb{E}_{H_{0}} denote 𝔼H0{𝒢1,𝒢2}\mathbb{E}_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}}, we have

H0{𝒢1,𝒢2}(p(𝐗;{𝒢1,𝒢2})α|𝒢1,𝒢2𝒞(𝐗))\displaystyle\mathbb{P}_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}}\Big{(}p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha~{}\Big{|}~{}\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\Big{)}
=𝔼H0[𝟙{p(𝐗;{𝒢1,𝒢2})α}|𝒢1,𝒢2𝒞(𝐗)]\displaystyle=\mathbb{E}_{H_{0}}\Big{[}\mathds{1}\{p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha\}~{}\Big{|}~{}\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\Big{]}
=𝔼H0[𝔼H0[𝟙{p(𝐗;{𝒢1,𝒢2})α}|𝒢1,𝒢2𝒞(𝐗),𝝅ν(𝒢1,𝒢2)𝐗,dir(X¯𝒢1X¯𝒢2)]|𝒢1,𝒢2𝒞(𝐗)]\displaystyle=\mathbb{E}_{H_{0}}\Bigg{[}\mathbb{E}_{H_{0}}\Big{[}\mathds{1}\{p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha\}~{}\Big{|}~{}\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X}),\bm{\pi}_{\nu(\mathcal{G}_{1},\mathcal{G}_{2})}^{\perp}{\bf X},\text{dir}(\bar{X}_{\mathcal{G}_{1}}-\bar{X}_{\mathcal{G}_{2}})\Big{]}\bigg{|}~{}\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\Bigg{]}
=𝔼H0[α|𝒢1,𝒢2𝒞(𝐗)]\displaystyle=\mathbb{E}_{H_{0}}\Big{[}\alpha~{}\Big{|}~{}\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\Big{]}
=α,\displaystyle=\alpha,

where the second equality follows from the law of total expectation, and the third equality follows from (A.9). This is (11).

A.2 Proof of Lemma 1

First, we state and prove a preliminary result involving the dissimilarities between groups of observations in any perturbed data set 𝐱(ϕ){\bf x}^{\prime}(\phi), which holds for any clustering method 𝒞\mathcal{C}.

Lemma A.1.

For any ϕ0\phi\geq 0, and for any two sets 𝒢\mathcal{G} and 𝒢\mathcal{G}^{\prime} that are both contained in 𝒞^1\mathcal{\hat{C}}_{1}, both contained in 𝒞^2\mathcal{\hat{C}}_{2}, or both contained in {1,2,,n}\[𝒞^1𝒞^2]\{1,2,\ldots,n\}\backslash[\mathcal{\hat{C}}_{1}\cup\mathcal{\hat{C}}_{2}], we have that

d(𝒢,𝒢;𝐱(ϕ))=d(𝒢,𝒢;𝐱),\displaystyle d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))=d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}),

where 𝐱(ϕ){\bf x}^{\prime}(\phi) is defined in (13).

Proof.

By (13), [𝐱(ϕ)]i[𝐱(ϕ)]i=xixi[{\bf x}^{\prime}(\phi)]_{i}-[{\bf x}^{\prime}(\phi)]_{i^{\prime}}=x_{i}-x_{i^{\prime}} for all i,i𝒢𝒢i,i^{\prime}\in\mathcal{G}\cup\mathcal{G}^{\prime}, and therefore all pairwise distances are preserved for all i,i𝒢𝒢i,i^{\prime}\in\mathcal{G}\cup\mathcal{G}^{\prime}. The resut follows. ∎

Suppose that 𝒞=𝒞(nK+1)\mathcal{C}=\mathcal{C}^{(n-K+1)}. Then, (15) follows immediately from observing that 𝒲1(t)(𝐱)\mathcal{W}_{1}^{(t)}({\bf x}) and 𝒲2(t)(𝐱)\mathcal{W}_{2}^{(t)}({\bf x}) are clusters merged at an earlier stage of the hierarchical clustering algorithm and thus must satisfy the condition in Lemma A.1.

We will now show (16). Applying the right-hand side of (16) with t=nK+1t=n-K+1 yields the left-hand side, so the ()(\Leftarrow) direction holds. We will now prove the ()(\Rightarrow) direction by contradiction. Suppose that there exists some t{1,,nK1}t\in\{1,\ldots,n-K-1\} such that

𝒞(t+1)(𝐱(ϕ))𝒞(t+1)(𝐱),\displaystyle\mathcal{C}^{(t+1)}({\bf x}^{\prime}(\phi))\neq\mathcal{C}^{(t+1)}({\bf x}), (A.10)

and let tt specifically denote the smallest such value. Then, {𝒲1(t)(𝐱(ϕ)),𝒲2(t)(𝐱(ϕ))}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}\{\mathcal{W}_{1}^{(t)}({\bf x}^{\prime}(\phi)),\mathcal{W}_{2}^{(t)}({\bf x}^{\prime}(\phi))\}\neq\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\}, which implies that

d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱(ϕ))>d(𝒲1(t)(𝐱(ϕ)),𝒲2(t)(𝐱(ϕ));𝐱(ϕ)).\displaystyle d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}^{\prime}(\phi)\right)>d\left(\mathcal{W}_{1}^{(t)}({\bf x}^{\prime}(\phi)),\mathcal{W}_{2}^{(t)}({\bf x}^{\prime}(\phi));{\bf x}^{\prime}(\phi)\right). (A.11)

Since clusters cannot unmerge once they have merged, and 𝒞^1,𝒞^2𝒞(nK+1)(𝐱)\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}^{(n-K+1)}({\bf x}) by definition, it must be the case that 𝒲1(t)(𝐱)\mathcal{W}_{1}^{(t)}({\bf x}) and 𝒲2(t)(𝐱)\mathcal{W}_{2}^{(t)}({\bf x}) are both in 𝒞^1\mathcal{\hat{C}}_{1}, are both in 𝒞^2\mathcal{\hat{C}}_{2}, or are both in [𝒞^1𝒞^2]C[\mathcal{\hat{C}}_{1}\cup\mathcal{\hat{C}}_{2}]^{C}. Similarly, since we assumed that 𝒞^1,𝒞^2𝒞(nK+1)(𝐱(ϕ))\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}^{(n-K+1)}({\bf x}^{\prime}(\phi)), it must be the case that 𝒲1(t)(𝐱(ϕ))\mathcal{W}_{1}^{(t)}({\bf x}^{\prime}(\phi)) and 𝒲2(t)(𝐱(ϕ))\mathcal{W}_{2}^{(t)}({\bf x}^{\prime}(\phi)) are both in 𝒞^1\mathcal{\hat{C}}_{1}, are both in 𝒞^2\mathcal{\hat{C}}_{2}, or are both in [𝒞^1𝒞^2]C[\mathcal{\hat{C}}_{1}\cup\mathcal{\hat{C}}_{2}]^{C}. Thus, we can apply Lemma A.1 to both sides of (A.11) to yield

d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)>d(𝒲1(t)(𝐱(ϕ)),𝒲2(t)(𝐱(ϕ));𝐱),\displaystyle d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)>d\left(\mathcal{W}_{1}^{(t)}({\bf x}^{\prime}(\phi)),\mathcal{W}_{2}^{(t)}({\bf x}^{\prime}(\phi));{\bf x}\right), (A.12)

which implies that {𝒲1(t)(𝐱),𝒲2(t)(𝐱)}\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right\} is not the pair of clusters that merged in the (t)th(t)^{th} step of the hierarchical clustering procedure applied to 𝐱{\bf x}. This is a contradiction. Therefore, for all t{1,,nK1}t\in\{1,\ldots,n-K-1\} such that 𝒞(t)(𝐱(ϕ))=𝒞(t)(𝐱)\mathcal{C}^{(t)}({\bf x}^{\prime}(\phi))=\mathcal{C}^{(t)}({\bf x}), it must be the case that 𝒞(t+1)(𝐱(ϕ))=𝒞(t+1)(𝐱)\mathcal{C}^{(t+1)}({\bf x}^{\prime}(\phi))=\mathcal{C}^{(t+1)}({\bf x}). Observing that 𝒞(1)(𝐱(ϕ))=𝒞(1)(𝐱)={{1},{2},,{n}}\mathcal{C}^{(1)}({\bf x}^{\prime}(\phi))=\mathcal{C}^{(1)}({\bf x})=\{\{1\},\{2\},\ldots,\{n\}\} for all ϕ0\phi\geq 0 completes the proof of the ()(\Rightarrow) direction.

A.3 Proof of Theorem 2

Observe from Algorithm 1 that

𝒞(t)(𝐱(ϕ))=𝒞(t)(𝐱) for all t=1,2,,nK+1,\displaystyle\mathcal{C}^{(t)}({\bf x}^{\prime}(\phi))=\mathcal{C}^{(t)}({\bf x})\text{ for all }t=1,2,\ldots,n-K+1, (A.13)

if and only if for all t=1,2,,nKt=1,2,\ldots,n-K,

d(𝒢,𝒢;𝐱(ϕ))>d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱(ϕ)),\displaystyle d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}^{\prime}(\phi)\right), (A.14)
𝒢,𝒢𝒞(t)(𝐱),𝒢𝒢,{𝒢,𝒢}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}.\displaystyle\forall~{}\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x}),~{}\mathcal{G}\neq\mathcal{G^{\prime}},~{}\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right\}.

Equation (15) in Lemma 1 says that

d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱(ϕ))=d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱),t=1,2,,nK,ϕ0.\displaystyle d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}^{\prime}(\phi)\right)=d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right),~{}\forall~{}t=1,2,\ldots,n-K,~{}\forall~{}\phi\geq 0.

Therefore, (A.13) is true if and only if for all t=1,2,,nKt=1,2,\ldots,n-K,

d(𝒢,𝒢;𝐱(ϕ))>d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱),\displaystyle d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right), (A.15)
𝒢,𝒢𝒞(t)(𝐱),𝒢𝒢,{𝒢,𝒢}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}.\displaystyle\forall~{}\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x}),~{}\mathcal{G}\neq\mathcal{G^{\prime}},~{}\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right\}.

Recall that (𝐱)=t=1nK{{𝒢,𝒢}:𝒢,𝒢𝒞(t)(𝐱),𝒢𝒢,{𝒢,𝒢}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}}\mathcal{L}({\bf x})=\bigcup\limits_{t=1}^{n-K}\left\{\{\mathcal{G},\mathcal{G}^{\prime}\}:\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x}),\mathcal{G}\neq\mathcal{G}^{\prime},\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right\}\right\}. Observe that (A.15) is true for all t=1,2,,nKt=1,2,\ldots,n-K if and only if for all {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}),

d(𝒢,𝒢;𝐱(ϕ))>maxl𝒢,𝒢(𝐱)tu𝒢,𝒢(𝐱)d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱),\displaystyle d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>\underset{l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t\leq u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})}{\max}d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right), (A.16)

where l𝒢,𝒢(𝐱)=min{1tnK:𝒢,𝒢𝒞(t)(𝐱),{𝒢,𝒢}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}}}l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})=\min\left\{1\leq t\leq n-K:\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x}),\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\}\right\}\right\} and u𝒢,𝒢(𝐱)=max{1tnK:𝒢,𝒢𝒞(t)(𝐱),{𝒢,𝒢}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}}u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})=\max\left\{1\leq t\leq n-K:\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x}),\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\}\right\} are the start and end of the lifetime of the cluster pair {𝒢,𝒢}\{\mathcal{G},\mathcal{G}^{\prime}\}, respectively. Therefore, it follows from (16) in Lemma 1 that 𝒞^1,𝒞^2𝒞(𝐱(ϕ))\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}^{\prime}(\phi)) if and only if (A.16) is true for all {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}). Recalling from (12) that 𝒮^={ϕ0:𝒞^1,𝒞^2𝒞(𝐱(ϕ))}\mathcal{\hat{S}}=\{\phi\geq 0:\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}({\bf x}^{\prime}(\phi))\}, it follows that

𝒮^={𝒢,𝒢}(𝐱){ϕ0:d(𝒢,𝒢;𝐱(ϕ))>maxl𝒢,𝒢(𝐱)tu𝒢,𝒢(𝐱)d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)}.\displaystyle\mathcal{\hat{S}}=\bigcap\limits_{\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x})}\left\{\phi\geq 0:d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>\underset{l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t\leq u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})}{\max}d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)\right\}.

This is (18). Furthermore, this intersection involves 𝒪(n2)\mathcal{O}(n^{2}) sets, because

|(𝐱)|=((n2)1)+t=1nK1(nt1).\displaystyle|\mathcal{L}({\bf x})|=\left(\binom{n}{2}-1\right)+\sum\limits_{t=1}^{n-K-1}(n-t-1).

A.4 Proof of Proposition 2

Let {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}), where (𝐱)\mathcal{L}({\bf x}) in (17) is the set of losing cluster pairs in 𝐱{\bf x}. Suppose that we are given the start and end of the lifetime of this cluster pair in 𝐱{\bf x}, l𝒢,𝒢(𝐱)l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) and u𝒢,𝒢(𝐱)u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}), and we are given the set of steps where inversions occur in the dendrogram of 𝐱{\bf x} below the (nK)(n-K)th merge,

(𝐱)={1t<nK:d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)>d(𝒲1(t+1)(𝐱),𝒲2(t+1)(𝐱);𝐱)}.\mathcal{M}({\bf x})=\left\{1\leq t<n-K:d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)>d\left(\mathcal{W}_{1}^{(t+1)}({\bf x}),\mathcal{W}_{2}^{(t+1)}({\bf x});{\bf x}\right)\right\}.

Observe that for h𝒢,𝒢(𝐱)h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) in (19),

h𝒢,𝒢(𝐱)\displaystyle h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) =maxl𝒢,𝒢(𝐱)tu𝒢,𝒢(𝐱)d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)=maxt𝒢,𝒢(𝐱){u𝒢,𝒢(𝐱)}d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱),\displaystyle=\underset{l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t\leq u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})}{\max}d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)=\underset{t\in\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\cup\{u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\}}{\max}d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right),

where 𝒢,𝒢(𝐱)={t:l𝒢,𝒢(𝐱)t<u𝒢,𝒢(𝐱),d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)>d(𝒲1(t+1)(𝐱),𝒢2(l+1)(𝐱);𝐱)}.\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})=\left\{t:l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t<u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}),d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)>d\left(\mathcal{W}_{1}^{(t+1)}({\bf x}),\mathcal{G}_{2}^{(l+1)}({\bf x});{\bf x}\right)\right\}. Therefore, we can compute h𝒢,𝒢(𝐱)h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) for each cluster pair {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}) as follows:

  1. 1.

    Compute 𝒢,𝒢(𝐱)\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) by checking every step in (𝐱)\mathcal{M}({\bf x}) to see if it is in [l𝒢,𝒢(𝐱),u𝒢,𝒢(𝐱))\Big{[}l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}),u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\Big{)}.

  2. 2.

    Compute h𝒢,𝒢(𝐱)h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) by taking the max over |𝒢,𝒢(𝐱)|+1|\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})|+1 merge heights.

Step 1 requires 2|(𝐱)|2|\mathcal{M}({\bf x})| operations, and Step 2 requires |𝒢,𝒢(𝐱)|+1|\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})|+1 operations. Since |𝒢,𝒢(𝐱)|<|(𝐱)||\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})|<|\mathcal{M}({\bf x})|, it follows that we can compute h𝒢,𝒢(𝐱)h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) in 𝒪(|(𝐱)|+1)\mathcal{O}(|\mathcal{M}({\bf x})|+1) time.

A.5 Proof of Proposition 4

Recall that

𝒮^i,i={ϕ0:d({i},{i};𝐱(ϕ))>max{𝒢,𝒢}(𝐱):i𝒢,i𝒢h𝒢,𝒢(𝐱)}.\displaystyle\mathcal{\hat{S}}_{i,i^{\prime}}=\left\{\phi\geq 0:d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi))>\underset{\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}):i\in\mathcal{G},i^{\prime}\in\mathcal{G}^{\prime}}{\max}~{}h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\right\}. (A.17)

We will consider two cases: either

i,i𝒞^1 or i,i𝒞^2 or i,i𝒞^1𝒞^2,\displaystyle i,i^{\prime}\in\mathcal{\hat{C}}_{1}\text{ or }i,i^{\prime}\in\mathcal{\hat{C}}_{2}\text{ or }i,i^{\prime}\not\in\mathcal{\hat{C}}_{1}\cup\mathcal{\hat{C}}_{2}, (A.18)

or (A.18) doesn’t hold.

Case 1: (A.18) holds

We first state and prove a preliminary result.

Lemma A.2.

Suppose that 𝒞=𝒞(nK+1)\mathcal{C}=\mathcal{C}^{(n-K+1)}, i.e. we perform hierarchical clustering to obtain KK clusters. Let {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}), i𝒢i\in\mathcal{G}, i𝒢i^{\prime}\in\mathcal{G}^{\prime}. Suppose that (A.18) holds. Then, for all ϕ0\phi\geq 0,

d(𝒢,𝒢;𝐱(ϕ))=d(𝒢,𝒢;𝐱).\displaystyle d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))=d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}). (A.19)
Proof.

Since {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}), it follows that 𝒢,𝒢t=1nK𝒞(t)(𝐱)\mathcal{G},\mathcal{G}^{\prime}\in\bigcup\limits_{t=1}^{n-K}\mathcal{C}^{(t)}({\bf x}). Furthermore, clusters that merge cannot unmerge, and 𝒞^1,𝒞^2𝒞(nK+1)(𝐱)\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\in\mathcal{C}^{(n-K+1)}({\bf x}). This means that all observations in the set 𝒢\mathcal{G} must belong to the same cluster in 𝒞(nK+1)(𝐱)\mathcal{C}^{(n-K+1)}({\bf x}), and all observations in the set 𝒢\mathcal{G^{\prime}} must belong to the same cluster in 𝒞(nK+1)(𝐱)\mathcal{C}^{(n-K+1)}({\bf x}). Therefore, (A.18) implies that either

𝒢,𝒢𝒞^1 or 𝒢,𝒢𝒞^2 or 𝒢,𝒢{1,2,,n}\[𝒞^1𝒞^2].\displaystyle\mathcal{G},\mathcal{G}^{\prime}\subseteq\mathcal{\hat{C}}_{1}\text{ or }\mathcal{G},\mathcal{G}^{\prime}\subseteq\mathcal{\hat{C}}_{2}\text{ or }\mathcal{G},\mathcal{G}^{\prime}\subseteq\{1,2,\ldots,n\}\backslash[\mathcal{\hat{C}}_{1}\cup\mathcal{\hat{C}}_{2}].

It follows from Lemma A.1 that (A.19) holds. ∎

Suppose that i,ii,i^{\prime} satisfy (A.18). Recall from Section 3.2 that

d(𝒢,𝒢;𝐱)>h𝒢,𝒢(𝐱){𝒢,𝒢}(𝐱).\displaystyle d(\mathcal{G},\mathcal{G}^{\prime};{\bf x})>h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\quad\forall~{}\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}). (A.20)

It follows from (A.20) and Lemma A.2 that for all ϕ0\phi\geq 0,

d(𝒢,𝒢;𝐱(ϕ))=d(𝒢,𝒢;𝐱)>h𝒢,𝒢(𝐱){𝒢,𝒢}(𝐱) s.t. i𝒢,i𝒢.\displaystyle d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))=d(\mathcal{G},\mathcal{G}^{\prime};{\bf x})>h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\quad\forall~{}\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x})\text{ s.t. }i\in\mathcal{G},i^{\prime}\in\mathcal{G}^{\prime}. (A.21)

Recall that single linkage defines d(𝒢,𝒢;𝐱(ϕ))=mini𝒢,i𝒢d({i},{i};𝐱(ϕ))d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))=\underset{i\in\mathcal{G},i^{\prime}\in\mathcal{G}^{\prime}}{\min}d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi)). Thus, it follows from (A.21) that for all ϕ0\phi\geq 0,

d({i},{i};𝐱(ϕ))>h𝒢,𝒢(𝐱){𝒢,𝒢}(𝐱) s.t. i𝒢,i𝒢.\displaystyle d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi))>h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\quad\forall~{}\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x})\text{ s.t. }i\in\mathcal{G},i^{\prime}\in\mathcal{G}^{\prime}. (A.22)

Applying (A.22) to the definition of 𝒮^i,i\mathcal{\hat{S}}_{i,i^{\prime}} in (A.17) yields 𝒮^i,i=[0,)\mathcal{\hat{S}}_{i,i^{\prime}}=[0,\infty).

Case 2: (A.18) does not hold

Suppose that i,ii,i^{\prime} do not satisfy (A.18). Then, ii and ii^{\prime} are assigned to different clusters in 𝒞(𝐱)=𝒞(nK+1)(𝐱)\mathcal{C}({\bf x})=\mathcal{C}^{(n-K+1)}({\bf x}). Since 𝒞(nK+1)(𝐱)\mathcal{C}^{(n-K+1)}({\bf x}) is created by merging a pair of clusters in 𝒞(nK)(𝐱)\mathcal{C}^{(n-K)}({\bf x}), it follows that ii and ii^{\prime} are assigned to different clusters 𝒢\mathcal{G} and 𝒢\mathcal{G}^{\prime} in 𝒞(nK)(𝐱)\mathcal{C}^{(n-K)}({\bf x}) such that {𝒢,𝒢}\{\mathcal{G},\mathcal{G}^{\prime}\} is not the winning pair. That is, there exists 𝒢,𝒢𝒞(nK)(𝐱)\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(n-K)}({\bf x}) such that i𝒢i\in\mathcal{G}, and i𝒢i^{\prime}\in\mathcal{G}^{\prime}, 𝒢𝒢\mathcal{G}\neq\mathcal{G}^{\prime}, and {𝒢,𝒢}{𝒲1(nK)(𝐱),𝒲2(nK)(𝐱)}\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\left\{\mathcal{W}_{1}^{(n-K)}({\bf x}),\mathcal{W}_{2}^{(n-K)}({\bf x})\right\}. This means that {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}), with u𝒢,𝒢(𝐱)=nKu_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})=n-K. Furthermore, single linkage cannot produce inversions (Murtagh & Contreras 2012), i.e. d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)<d(𝒲1(t+1)(𝐱),𝒲2(t+1)(𝐱);𝐱)d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)<d\left(\mathcal{W}_{1}^{(t+1)}({\bf x}),\mathcal{W}_{2}^{(t+1)}({\bf x});{\bf x}\right) for all t<n1t<n-1. It follows that

max{𝒢,𝒢}(𝐱):i𝒢,i𝒢h𝒢,𝒢(𝐱)\displaystyle\underset{\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}):i\in\mathcal{G},i^{\prime}\in\mathcal{G}^{\prime}}{\max}~{}h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) =max𝒢,𝒢(𝐱):i𝒢,i𝒢maxl𝒢,𝒢(𝐱)tu𝒢,𝒢(𝐱)d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)\displaystyle=\underset{\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{L}({\bf x}):i\in\mathcal{G},i^{\prime}\in\mathcal{G}^{\prime}}{\max}~{}\underset{l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t\leq u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})}{\max}~{}d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)
=d(𝒲1(nK)(𝐱),𝒲2(nK)(𝐱);𝐱).\displaystyle=d\left(\mathcal{W}^{(n-K)}_{1}({\bf x}),\mathcal{W}^{(n-K)}_{2}({\bf x});{\bf x}\right). (A.23)

Applying (A.23) to the definition of 𝒮^i,i\mathcal{\hat{S}}_{i,i^{\prime}} in (A.17) yields

𝒮^i,i={ϕ0:d({i},{i};𝐱(ϕ))>d(𝒲1(nK)(𝐱),𝒲2(nK)(𝐱);𝐱)}.\mathcal{\hat{S}}_{i,i^{\prime}}=\left\{\phi\geq 0:d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi))>d\left(\mathcal{W}^{(n-K)}_{1}({\bf x}),\mathcal{W}^{(n-K)}_{2}({\bf x});{\bf x}\right)\right\}.

A.6 Proof of Theorem 4

The proof of Theorem 4 is adapted from Appendices A.10 and A.11 in Tibshirani et al. (2018). We first state and prove an intermediate result.

Lemma A.3.

Recall that F(t;c,𝒮)F(t;c,\mathcal{S}) denotes the CDF of a cχqc\cdot\chi_{q} random variable truncated to the set 𝒮\mathcal{S}. If 0<c1<c20<c_{1}<c_{2}, then

1F(t;c1,𝒮)<1F(t;c2,𝒮),for all t𝒮.1-F(t;c_{1},\mathcal{S})<1-F(t;c_{2},\mathcal{S}),\quad\text{for all }t\in\mathcal{S}.
Proof.

Let f(t;c,𝒮)f(t;c,\mathcal{S}) denote the probability density function (pdf) of a cχqc\cdot\chi_{q} random variable truncated to the set 𝒮\mathcal{S}. Observe that

f(t;c,𝒮)\displaystyle f(t;c,\mathcal{S}) =12q21Γ(q/2)cqtq1et22c2𝟙{t𝒮}12q21Γ(q/2)cquq1eu22c2𝟙{u𝒮}𝑑u\displaystyle=\frac{\frac{1}{2^{\frac{q}{2}-1}\Gamma(q/2)}c^{-q}t^{q-1}e^{-\frac{t^{2}}{2c^{2}}}\mathds{1}\{t\in\mathcal{S}\}}{\int\frac{1}{2^{\frac{q}{2}-1}\Gamma(q/2)}c^{-q}u^{q-1}e^{-\frac{u^{2}}{2c^{2}}}\mathds{1}\{u\in\mathcal{S}\}~{}du}
=tq1et22c2𝟙{t𝒮}uq1eu22c2𝟙{u𝒮}𝑑u\displaystyle=\frac{t^{q-1}e^{-\frac{t^{2}}{2c^{2}}}\mathds{1}\{t\in\mathcal{S}\}}{\int u^{q-1}e^{-\frac{u^{2}}{2c^{2}}}\mathds{1}\{u\in\mathcal{S}\}~{}du}
=(tq1𝟙{t𝒮}uq1eu22c2𝟙{u𝒮}𝑑u)(exp{t22c2}).\displaystyle=\left(\frac{t^{q-1}\mathds{1}\{t\in\mathcal{S}\}}{\int u^{q-1}e^{-\frac{u^{2}}{2c^{2}}}\mathds{1}\{u\in\mathcal{S}\}~{}du}\right)\left(\exp\left\{-\frac{t^{2}}{2c^{2}}\right\}\right).

Thus, {f(t;c,𝒮)}c>0\{f(t;c,\mathcal{S})\}_{c>0} is an exponential family with natural parameter 1c2\frac{1}{c^{2}}. Therefore, it has a monotone non-increasing likelihood ratio in its sufficient statistic, t2/2-t^{2}/2. This means that for any fixed 0<c1<c20<c_{1}<c_{2},

f(u2;c1,𝒮)f(u2;c2,𝒮)<f(u1;c1,𝒮)f(u1;c2,𝒮),for all u1,u2𝒮,u1<u2.\displaystyle\frac{f(u_{2};c_{1},\mathcal{S})}{f(u_{2};c_{2},\mathcal{S})}<\frac{f(u_{1};c_{1},\mathcal{S})}{f(u_{1};c_{2},\mathcal{S})},\quad\text{for all }u_{1},u_{2}\in\mathcal{S},~{}u_{1}<u_{2}. (A.24)

Rearranging terms in (A.24) yields:

f(u2;c1,𝒮)f(u1;c2,𝒮)<f(u1;c1,𝒮)f(u2;c2,𝒮),for all u1,u2𝒮,u1<u2.\displaystyle f(u_{2};c_{1},\mathcal{S})f(u_{1};c_{2},\mathcal{S})<f(u_{1};c_{1},\mathcal{S})f(u_{2};c_{2},\mathcal{S}),\quad\text{for all }u_{1},u_{2}\in\mathcal{S},~{}u_{1}<u_{2}. (A.25)

Let t,u2𝒮t,u_{2}\in\mathcal{S} with t<u2t<u_{2}. Then,

f(u2;c1,𝒮)F(t;c2,𝒮)\displaystyle f(u_{2};c_{1},\mathcal{S})F(t;c_{2},\mathcal{S}) =tf(u2;c1,𝒮)f(u1;c2,𝒮)𝑑u1\displaystyle=\int_{-\infty}^{t}f(u_{2};c_{1},\mathcal{S})f(u_{1};c_{2},\mathcal{S})du_{1}
<tf(u1;c1,𝒮)f(u2;c2,𝒮)𝑑u1=F(t;c1,𝒮)f(u2;c2,𝒮),\displaystyle<\int_{-\infty}^{t}f(u_{1};c_{1},\mathcal{S})f(u_{2};c_{2},\mathcal{S})du_{1}=F(t;c_{1},\mathcal{S})f(u_{2};c_{2},\mathcal{S}),

where the inequality follows from (A.25). That is, we have shown that

f(u2;c1,𝒮)F(u1;c2,𝒮)<F(u1;c1,𝒮)f(u2;c2,𝒮),for all u1,u2𝒮,u1<u2.\displaystyle f(u_{2};c_{1},\mathcal{S})F(u_{1};c_{2},\mathcal{S})<F(u_{1};c_{1},\mathcal{S})f(u_{2};c_{2},\mathcal{S}),\quad\text{for all }u_{1},u_{2}\in\mathcal{S},~{}u_{1}<u_{2}. (A.26)

Let t𝒮t\in\mathcal{S}. Then,

(1F(t;c1,𝒮))F(t;c2,𝒮)\displaystyle(1-F(t;c_{1},\mathcal{S}))F(t;c_{2},\mathcal{S}) =tf(u2;c1,𝒮)F(t;c2,𝒮)𝑑u2\displaystyle=\int_{t}^{\infty}f(u_{2};c_{1},\mathcal{S})F(t;c_{2},\mathcal{S})du_{2} (A.27)
<tF(t;c1,𝒮)f(u2;c2,𝒮)𝑑u2=F(t;c1,𝒮)(1F(t;c2,𝒮)),\displaystyle<\int_{t}^{\infty}F(t;c_{1},\mathcal{S})f(u_{2};c_{2},\mathcal{S})du_{2}=F(t;c_{1},\mathcal{S})(1-F(t;c_{2},\mathcal{S})),

where the inequality follows from (A.26) and the fact that f(t;c,𝒮)=0f(t;c,\mathcal{S})=0 for all t𝒮t\not\in\mathcal{S}. Rearranging terms in (A.27) yields

1F(t;c1,𝒮)<1F(t;c2,𝒮).1-F(t;c_{1},\mathcal{S})<1-F(t;c_{2},\mathcal{S}).

We will now prove Theorem 4. In the following, we will use AnA_{n} to denote the event {𝒞^1(n),𝒞^2(n)𝒞(𝐗(n))}\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\in\mathcal{C}({\bf X}^{(n)})\right\}, p^n\hat{p}_{n} to denote p^(𝐗(n);{𝒞^1(n),𝒞^2(n)})\hat{p}\left({\bf X}^{(n)};\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}\right), and pnp_{n} to denote p(𝐗(n);{𝒞^1(n),𝒞^2(n)})p\left({\bf X}^{(n)};\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}\right). Observe that

H0{𝒞^1(n),𝒞^2(n)}(p^nα|An)\displaystyle\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{p}_{n}\leq\alpha~{}\Big{|}~{}A_{n}\right)
=H0{𝒞^1(n),𝒞^2(n)}(p^nα,σ^(𝐗(n))σ|An)+H0{𝒞^1(n),𝒞^2(n)}(p^nα,σ^(𝐗(n))<σ|An)\displaystyle=\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{p}_{n}\leq\alpha,~{}\hat{\sigma}({\bf X}^{(n)})\geq\sigma~{}\Big{|}~{}A_{n}\right)+\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{p}_{n}\leq\alpha,~{}\hat{\sigma}({\bf X}^{(n)})<\sigma~{}\Big{|}~{}A_{n}\right)
H0{𝒞^1(n),𝒞^2(n)}(p^nα,σ^(𝐗(n))σ|An)+H0{𝒞^1(n),𝒞^2(n)}(σ^(𝐗(n))<σ|An).\displaystyle\leq\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{p}_{n}\leq\alpha,~{}\hat{\sigma}({\bf X}^{(n)})\geq\sigma~{}\Big{|}~{}A_{n}\right)+\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{\sigma}({\bf X}^{(n)})<\sigma~{}\Big{|}~{}A_{n}\right). (A.28)

Recall from (9) that

p(𝐱(n);{𝒞^1(n),𝒞^2(n)})=1𝔽(x¯𝒞^1(n)x¯𝒞^2(n)2;σ1|𝒞^1(n)|+1|𝒞^2(n)|,𝒮(𝐱;{𝒞^1(n),𝒞^2(n)})),p\left({\bf x}^{(n)};\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}\right)=1-\mathbb{F}\left(\left\|\bar{x}_{\mathcal{\hat{C}}_{1}^{(n)}}-\bar{x}_{\mathcal{\hat{C}}_{2}^{(n)}}\right\|_{2};\sigma\sqrt{\frac{1}{|\mathcal{\hat{C}}_{1}^{(n)}|}+\frac{1}{|\mathcal{\hat{C}}_{2}^{(n)}|}},\mathcal{S}\left({\bf x};\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}\right)\right),

and recall from (25) that

p^(𝐱(n);{𝒞^1(n),𝒞^2(n)})=1𝔽(x¯𝒞^1(n)x¯𝒞^2(n)2;σ^(𝐱)1|𝒞^1(n)|+1|𝒞^2(n)|,𝒮(𝐱;{𝒞^1(n),𝒞^2(n)})).\hat{p}\left({\bf x}^{(n)};\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}\right)=1-\mathbb{F}\left(\left\|\bar{x}_{\mathcal{\hat{C}}_{1}^{(n)}}-\bar{x}_{\mathcal{\hat{C}}_{2}^{(n)}}\right\|_{2};\hat{\sigma}({\bf x})\sqrt{\frac{1}{|\mathcal{\hat{C}}_{1}^{(n)}|}+\frac{1}{|\mathcal{\hat{C}}_{2}^{(n)}|}},\mathcal{S}\left({\bf x};\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}\right)\right).

It follows from Lemma A.3 that for any 𝐱{\bf x} such that σ^(𝐱)σ\hat{\sigma}({\bf x})\geq\sigma, we have that

p^(𝐱;{𝒞^1(n),𝒞^2(n)})p(𝐱;{𝒞^1(n),𝒞^2(n)}).\hat{p}\left({\bf x};\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}\right)\geq p\left({\bf x};\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}\right).

Thus, for all n=1,2,n=1,2,\ldots and for all 0α10\leq\alpha\leq 1,

H0{𝒞^1(n),𝒞^2(n)}(p^nα,σ^(𝐗(n))σ|An)\displaystyle\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{p}_{n}\leq\alpha,\hat{\sigma}({\bf X}^{(n)})\geq\sigma~{}\Big{|}~{}A_{n}\right) H0{𝒞^1(n),𝒞^2(n)}(pnα,σ^(𝐗(n))σ|An)\displaystyle\leq\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(p_{n}\leq\alpha,\hat{\sigma}({\bf X}^{(n)})\geq\sigma~{}\Big{|}~{}A_{n}\right)
H0{𝒞^1(n),𝒞^2(n)}(pnα|An)\displaystyle\leq\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\Big{(}p_{n}\leq\alpha~{}\Big{|}~{}A_{n}\Big{)}
=α,\displaystyle=\alpha, (A.29)

where the equality follows from (11). Applying (A.29) to (A.28) yields

H0{𝒞^1(n),𝒞^2(n)}(p^nα|An)α+H0{𝒞^1(n),𝒞^2(n)}(σ^(𝐗(n))<σ|An).\displaystyle\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{p}_{n}\leq\alpha~{}\Big{|}~{}A_{n}\right)\leq\alpha+\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{\sigma}({\bf X}^{(n)})<\sigma~{}\Big{|}~{}A_{n}\right). (A.30)

By our assumption in Theorem 4, limnH0{𝒞^1(n),𝒞^2(n)}(σ^(𝐗(n))<σ|An)=0\underset{n\rightarrow\infty}{\lim}~{}\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{\sigma}({\bf X}^{(n)})<\sigma~{}\Big{|}~{}A_{n}\right)=0. Applying this fact to (A.30) yields

limnH0{𝒞^1(n),𝒞^2(n)}(p^nα|An)α.\underset{n\rightarrow\infty}{\lim}\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{p}_{n}\leq\alpha~{}\Big{|}~{}A_{n}\right)\leq\alpha.

This completes the proof of Theorem 4.

A.7 Proof of Propositions 5 and 6

Recall that H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} denotes the null hypothesis that μ¯𝒢1=μ¯𝒢2\bar{\mu}_{\mathcal{G}_{1}}=\bar{\mu}_{\mathcal{G}_{2}}. Further recall that 0\mathcal{H}_{0} denotes the set of hypotheses of the form H0{𝒢1,𝒢2}H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}} that are true. We start by rewriting (11) as follows.

If H0{𝒢1,𝒢2}0, then (p(𝐗;{𝒢1,𝒢2})α|𝒢1,𝒢2𝒞(𝐗))=α.\displaystyle\text{If }H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0},\text{ then }\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha~{}\Big{|}~{}\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right)=\alpha. (A.31)

We will first prove Proposition 5. Let K=2K=2. Then,

(p(𝐗;𝒞(𝐗))α,H0𝒞(𝐗)0)\displaystyle\mathbb{P}\left(p({\bf X};\mathcal{C}({\bf X}))\leq\alpha,H_{0}^{\mathcal{C}({\bf X})}\in\mathcal{H}_{0}\right)
=H0{𝒢1,𝒢2}0(p(𝐗;{𝒢1,𝒢2})α,𝒞(𝐗)={𝒢1,𝒢2})\displaystyle=\sum\limits_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0}}\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha,\mathcal{C}({\bf X})=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\right)
=H0{𝒢1,𝒢2}0(p(𝐗;{𝒢1,𝒢2})α𝒞(𝐗)={𝒢1,𝒢2})(𝒞(𝐗)={𝒢1,𝒢2})\displaystyle=\sum\limits_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0}}\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha\mid\mathcal{C}({\bf X})=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\right)\mathbb{P}(\mathcal{C}({\bf X})=\{\mathcal{G}_{1},\mathcal{G}_{2}\})
=H0{𝒢1,𝒢2}0α(𝒞(𝐗)={𝒢1,𝒢2})\displaystyle=\sum\limits_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0}}\alpha~{}\mathbb{P}(\mathcal{C}({\bf X})=\{\mathcal{G}_{1},\mathcal{G}_{2}\}) (A.32)
=α(H0𝒞(𝐗)0),\displaystyle=\alpha~{}\mathbb{P}(H_{0}^{\mathcal{C}({\bf X})}\in\mathcal{H}_{0}),

where (A.32) follows from (A.31) and the fact that K=2K=2. Therefore,

(p(𝐗;𝒞(𝐗))αH0𝒞(𝐗)0)=α.\mathbb{P}\left(p({\bf X};\mathcal{C}({\bf X}))\leq\alpha\mid H_{0}^{\mathcal{C}({\bf X})}\in\mathcal{H}_{0}\right)=\alpha.

This is (26).

We will now prove Proposition 6. Let K>2K>2, and let {𝒢1(𝐗),𝒢2(𝐗)}\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\} and 𝐗{\bf X} be conditionally independent given 𝒞(𝐗)\mathcal{C}({\bf X}). Then,

(p(𝐗;{𝒢1(𝐗),𝒢2(𝐗)})α,H0{𝒢1(𝐗),𝒢2(𝐗)}0)\displaystyle\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\})\leq\alpha,~{}H_{0}^{\left\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\right\}}\in\mathcal{H}_{0}\right) (A.33)
=H0{𝒢1,𝒢2}0(p(𝐗;{𝒢1,𝒢2})α,{𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2})\displaystyle=\sum\limits_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0}}\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha,\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\right)
=H0{𝒢1,𝒢2}0({p(𝐗;{𝒢1,𝒢2})α}{𝒢1,𝒢2𝒞(𝐗)}{{𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2}})\displaystyle=\sum\limits_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0}}\mathbb{P}(\{p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha\}\cap\{\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\}\cap\{\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\})

where the last equality follows from the fact that {𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2}\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\} implies that 𝒢1,𝒢2𝒞(𝐗)\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X}). Since we assumed that {𝒢1(𝐗),𝒢2(𝐗)}\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\} and 𝐗{\bf X} are conditionally independent given 𝒞(𝐗)\mathcal{C}({\bf X}), we have for any H0{𝒢1,𝒢2}0H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0} that

(p(𝐗;{𝒢1,𝒢2})α,{𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2}𝒢1,𝒢2𝒞(𝐗))\displaystyle\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha,\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\mid\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right)
=(p(𝐗;{𝒢1,𝒢2})α𝒢1,𝒢2𝒞(𝐗))({𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2}𝒢1,𝒢2𝒞(𝐗))\displaystyle=\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha\mid\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right)\mathbb{P}\left(\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\mid\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right)
=α({𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2}𝒢1,𝒢2𝒞(𝐗)),\displaystyle=\alpha~{}\mathbb{P}\left(\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\mid\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right),

where the last equality follows from (A.31). It follows that for any H0{𝒢1,𝒢2}0H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0}, we have that

({p(𝐗;{𝒢1,𝒢2})α}{𝒢1,𝒢2𝒞(𝐗)}{{𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2}})\displaystyle\mathbb{P}(\{p({\bf X};\{\mathcal{G}_{1},\mathcal{G}_{2}\})\leq\alpha\}\cap\{\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\}\cap\{\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\})
=α({𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2}𝒢1,𝒢2𝒞(𝐗))(𝒢1,𝒢2𝒞(𝐗))\displaystyle=\alpha~{}\mathbb{P}\left(\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\mid\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right)\mathbb{P}\left(\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right)
=α({𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2},𝒢1,𝒢2𝒞(𝐗)).\displaystyle=\alpha~{}\mathbb{P}\left(\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\},\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right). (A.34)

Applying (A.34) to (A.33) yields

(p(𝐗;{𝒢1(𝐗),𝒢2(𝐗)})α,H0{𝒢1(𝐗),𝒢2(𝐗)}0)\displaystyle\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\})\leq\alpha,~{}H_{0}^{\left\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\right\}}\in\mathcal{H}_{0}\right)
=H0{𝒢1,𝒢2}0α({𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2},𝒢1,𝒢2𝒞(𝐗))\displaystyle=\sum\limits_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0}}\alpha~{}\mathbb{P}\left(\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\},\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X})\right)
=H0{𝒢1,𝒢2}0α({𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2})\displaystyle=\sum\limits_{H_{0}^{\{\mathcal{G}_{1},\mathcal{G}_{2}\}}\in\mathcal{H}_{0}}\alpha~{}\mathbb{P}\left(\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\}\right)
=α(H0{𝒢1(𝐗),𝒢2(𝐗)}0),\displaystyle=\alpha~{}\mathbb{P}\left(H_{0}^{\left\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\right\}}\in\mathcal{H}_{0}\right),

where the second-to-last equality follows from the fact that {𝒢1(𝐗),𝒢2(𝐗)}={𝒢1,𝒢2}\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\}=\{\mathcal{G}_{1},\mathcal{G}_{2}\} implies that 𝒢1,𝒢2𝒞(𝐗)\mathcal{G}_{1},\mathcal{G}_{2}\in\mathcal{C}({\bf X}). Therefore,

(p(𝐗;{𝒢1(𝐗),𝒢2(𝐗)})αH0{𝒢1(𝐗),𝒢2(𝐗)}0)=α.\mathbb{P}\left(p({\bf X};\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\})\leq\alpha\mid H_{0}^{\left\{\mathcal{G}_{1}({\bf X}),\mathcal{G}_{2}({\bf X})\right\}}\in\mathcal{H}_{0}\right)=\alpha.

This is (27).

Appendix B Algorithm for computing 𝒮^\mathcal{\hat{S}} in the case of squared Euclidean distance and linkages that satisfy (20)

We begin by defining quantities used in the algorithm. For all 𝒢t=1nK𝒞(t)(𝐱)\mathcal{G}\in\bigcup\limits_{t=1}^{n-K}\mathcal{C}^{(t)}({\bf x}), let

l𝒢(𝐱)min{1tnK:𝒢𝒞(t)(𝐱)},u𝒢(𝐱)max{1tnK:𝒢𝒞(t)(𝐱)},\displaystyle l_{\mathcal{G}}({\bf x})\equiv\min\{1\leq t\leq n-K:\mathcal{G}\in\mathcal{C}^{(t)}({\bf x})\},~{}u_{\mathcal{G}}({\bf x})\equiv\max\{1\leq t\leq n-K:\mathcal{G}\in\mathcal{C}^{(t)}({\bf x})\}, (B.1)

so that for any {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}),

l𝒢,𝒢(𝐱)\displaystyle l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) =max{l𝒢(𝐱),l𝒢(𝐱)},\displaystyle=\max\{l_{\mathcal{G}}({\bf x}),l_{\mathcal{G}^{\prime}}({\bf x})\}, (B.2)
u𝒢,𝒢(𝐱)\displaystyle u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) ={min{u𝒢(𝐱),u𝒢(𝐱)}1,if {𝒢,𝒢}{{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}:1tnK},min{u𝒢(𝐱),u𝒢(𝐱)}, otherwise.\displaystyle=\begin{cases}\min\{u_{\mathcal{G}}({\bf x}),u_{\mathcal{G}^{\prime}}({\bf x})\}-1,&\text{if }\{\mathcal{G},\mathcal{G}^{\prime}\}\in\left\{\left\{\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x})\right\}:1\leq t\leq n-K\right\},\\ \min\{u_{\mathcal{G}}({\bf x}),u_{\mathcal{G}^{\prime}}({\bf x})\},&\text{ otherwise.}\end{cases}

Recall from Section 3.3 that

h𝒢,𝒢(𝐱)\displaystyle h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) =maxt𝒢,𝒢(𝐱){u𝒢,𝒢(𝐱)}d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱), where\displaystyle=\underset{t\in\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\cup\{u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\}}{\max}d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right),\text{ where } (B.3)
𝒢,𝒢(𝐱)\displaystyle\mathcal{M}_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x}) ={t:t(𝐱),l𝒢,𝒢(𝐱)t<u𝒢,𝒢(𝐱)}.\displaystyle=\left\{t:t\in\mathcal{M}({\bf x}),l_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\leq t<u_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\right\}. (B.4)

Let 𝒢new(t)(𝐱)=𝒲1(t1)(𝐱)𝒲2(t1)(𝐱)\mathcal{G}_{new}^{(t)}({\bf x})=\mathcal{W}_{1}^{(t-1)}({\bf x})\cup\mathcal{W}_{2}^{(t-1)}({\bf x}) for all t=2,,nKt=2,\ldots,n-K. Define

1(𝐱)\displaystyle\mathcal{L}_{1}({\bf x}) ={{𝒢,𝒢}:𝒢,𝒢𝒞(1)(𝐱),𝒢𝒢,{𝒢,𝒢}{𝒲1(1)(𝐱),𝒲2(1)(𝐱)}},\displaystyle=\left\{\{\mathcal{G},\mathcal{G}^{\prime}\}:\mathcal{G},\mathcal{G}^{\prime}\in\mathcal{C}^{(1)}({\bf x}),\mathcal{G}\neq\mathcal{G}^{\prime},\{\mathcal{G},\mathcal{G}^{\prime}\}\neq\{\mathcal{W}^{(1)}_{1}({\bf x}),\mathcal{W}^{(1)}_{2}({\bf x})\}\right\}, (B.5)

and for t=2,,nKt=2,\ldots,n-K, define

t(𝐱)\displaystyle\mathcal{L}_{t}({\bf x}) ={{𝒢new(t)(𝐱),𝒢}:𝒢𝒞(t)(𝐱)\𝒢new(t)(𝐱),{𝒢new(t)(𝐱),𝒢}{𝒲1(t)(𝐱),𝒲2(t)(𝐱)}},\displaystyle=\left\{\left\{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}\right\}:\mathcal{G}^{\prime}\in\mathcal{C}^{(t)}({\bf x})\backslash\mathcal{G}_{new}^{(t)}({\bf x}),\left\{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}\right\}\neq\left\{\mathcal{W}^{(t)}_{1}({\bf x}),\mathcal{W}^{(t)}_{2}({\bf x})\right\}\right\}, (B.6)

so that for (𝐱)\mathcal{L}({\bf x}) defined in (17), (𝐱)=t=1nKt(𝐱).\mathcal{L}({\bf x})=\bigcup\limits_{t=1}^{n-K}\mathcal{L}_{t}({\bf x}). Thus, it follows from Theorem 2 that

𝒮^=t=1nK{𝒢,𝒢}t(𝐱)𝒮^𝒢,𝒢,\displaystyle\mathcal{\hat{S}}=\bigcap\limits_{t=1}^{n-K}\bigcap\limits_{\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}_{t}({\bf x})}\mathcal{\hat{S}}_{\mathcal{G},\mathcal{G}^{\prime}}, (B.7)

where for all {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}),

𝒮^𝒢,𝒢={ϕ0:d(𝒢,𝒢;𝐱(ϕ))>h𝒢,𝒢(𝐱)}.\displaystyle\mathcal{\hat{S}}_{\mathcal{G},\mathcal{G}^{\prime}}=\{\phi\geq 0:d(\mathcal{G},\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi))>h_{\mathcal{G},\mathcal{G}^{\prime}}({\bf x})\}. (B.8)

The following algorithm computes the set 𝒮^\mathcal{\hat{S}} (defined in (12)) in the case of squared Euclidean distance and linkages that satisfy (20) in 𝒪((|(𝐱)|+log(n))n2)\mathcal{O}\Big{(}(|\mathcal{M}({\bf x})|+\log(n))n^{2}\Big{)}, where |(𝐱)||\mathcal{M}({\bf x})| is the number of inversions in the dendrogram of 𝐱{\bf x} below the (nK)(n-K)th merge:

  1. 1.

    Compute (𝐱)={1t<nK:d(𝒲1(t)(𝐱),𝒲2(t)(𝐱);𝐱)>d(𝒲1(t+1)(𝐱),𝒲2(t+1)(𝐱);𝐱)}\mathcal{M}({\bf x})=\left\{1\leq t<n-K:d\left(\mathcal{W}_{1}^{(t)}({\bf x}),\mathcal{W}_{2}^{(t)}({\bf x});{\bf x}\right)>d\left(\mathcal{W}_{1}^{(t+1)}({\bf x}),\mathcal{W}_{2}^{(t+1)}({\bf x});{\bf x}\right)\right\}.

  2. 2.

    Compute l𝒢(𝐱)l_{\mathcal{G}}({\bf x}) and u𝒢(𝐱)u_{\mathcal{G}}({\bf x}) defined in (B.1) for all 𝒢t=1nK𝒞(t)(𝐱)\mathcal{G}\in\bigcup\limits_{t=1}^{n-K}\mathcal{C}^{(t)}({\bf x}) as follows:

    1. (a)

      Let l{i}(𝐱)=1l_{\{i\}}({\bf x})=1 for all ii. Let l𝒢new(t)(𝐱)(𝐱)=tl_{\mathcal{G}_{new}^{(t)}({\bf x})}({\bf x})=t for all 2t<nK2\leq t<n-K.

    2. (b)

      For all 𝒢t=1nK𝒞(t)(𝐱)\mathcal{G}\in\bigcup\limits_{t=1}^{n-K}\mathcal{C}^{(t)}({\bf x}), initialize u𝒢(𝐱)=nKu_{\mathcal{G}}({\bf x})=n-K.

    3. (c)

      For t=1,2,,nKt^{\prime}=1,2,\ldots,n-K, update u𝒲1(t)(𝐱)(𝐱)=tu_{\mathcal{W}_{1}^{(t^{\prime})}({\bf x})}({\bf x})=t^{\prime} and u𝒲2(t)(𝐱)(𝐱)=tu_{\mathcal{W}_{2}^{(t^{\prime})}({\bf x})}({\bf x})=t^{\prime}.

  3. 3.

    For all {{i},{i}}1(𝐱)\{\{i\},\{i^{\prime}\}\}\in\mathcal{L}_{1}({\bf x}), compute 𝒮^{i},{i}\mathcal{\hat{S}}_{\{i\},\{i^{\prime}\}} defined in (B.8) as follows:

    1. (a)

      Use l{i}(𝐱)l_{\{i\}}({\bf x}) and l{i}(𝐱)l_{\{i^{\prime}\}}({\bf x}) to compute l{i},{i}(𝐱)l_{\{i\},\{i^{\prime}\}}({\bf x}) and likewise for u{i},{i}(𝐱)u_{\{i\},\{i^{\prime}\}}({\bf x}), according to (B.2).

    2. (b)

      Compute {i},{i}(𝐱)\mathcal{M}_{\{i\},\{i^{\prime}\}}({\bf x}) in (B.4) by checking which elements of (𝐱)\mathcal{M}({\bf x}) are in [l{i},{i}(𝐱),u{i},{i}(𝐱)][l_{\{i\},\{i^{\prime}\}}({\bf x}),u_{\{i\},\{i^{\prime}\}}({\bf x})].

    3. (c)

      Use {i},{i}(𝐱)\mathcal{M}_{\{i\},\{i^{\prime}\}}({\bf x}) and u{i},{i}(𝐱)u_{\{i\},\{i^{\prime}\}}({\bf x}) to compute h{i},{i}(𝐱)h_{\{i\},\{i^{\prime}\}}({\bf x}) according to (B.3).

    4. (d)

      Compute the coefficients corresponding to the quadratic function d({i},{i};𝐱(ϕ))d\left(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi)\right) in constant time, using Lemma 2.

    5. (e)

      Use these coefficients and h{i},{i}(𝐱)h_{\{i\},\{i^{\prime}\}}({\bf x}) to evaluate 𝒮^{i},{i}={ϕ0:d({i},{i};𝐱(ϕ))>h{i},{i}(𝐱))}\mathcal{\hat{S}}_{\{i\},\{i^{\prime}\}}=\Big{\{}\phi\geq 0:d(\{i\},\{i^{\prime}\};{\bf x}^{\prime}(\phi))>h_{\{i\},\{i^{\prime}\}}({\bf x}))\Big{\}}. This amounts to solving a quadratic inequality in ϕ\phi.

  4. 4.

    For all t=2,,nKt=2,\ldots,n-K and for all {𝒢new(t)(𝐱),𝒢}t(𝐱)\{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}\}\in\mathcal{L}_{t}({\bf x}), compute 𝒮^𝒢new(t)(𝐱),𝒢\mathcal{\hat{S}}_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}} defined in (B.8) as follows:

    1. (a)

      Use l𝒢new(t)(𝐱)(𝐱)l_{\mathcal{G}_{new}^{(t)}({\bf x})}({\bf x}) and l𝒢(𝐱)l_{\mathcal{G}^{\prime}}({\bf x}) to compute l𝒢new(t)(𝐱),𝒢(𝐱)l_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x}) and likewise for u𝒢new(t)(𝐱),𝒢(𝐱)u_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x}), according to (B.2).

    2. (b)

      Compute 𝒢new(t)(𝐱),𝒢(𝐱)\mathcal{M}_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x}) in (B.4) by checking which elements of (𝐱)\mathcal{M}({\bf x}) are between l𝒢new(t)(𝐱),𝒢(𝐱)l_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x}) and u𝒢new(t)(𝐱),𝒢(𝐱)u_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x}).

    3. (c)

      Use 𝒢new(t)(𝐱),𝒢(𝐱)\mathcal{M}_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x}) and u𝒢new(t)(𝐱),𝒢(𝐱)u_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x}) to compute h𝒢new(t)(𝐱),𝒢(𝐱)h_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x}) in (B.3).

    4. (d)

      Compute the coefficients corresponding to the quadratic function d(𝒢new(t)(𝐱),𝒢;𝐱(ϕ))d\left(\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi)\right) in constant time, using (20) (Proposition 1).

    5. (e)

      Use these coefficients and h𝒢new(t)(𝐱),𝒢(𝐱)h_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x}) to evaluate
      𝒮^𝒢new(t)(𝐱),𝒢={ϕ0:d(𝒢new(t)(𝐱),𝒢;𝐱(ϕ))>h𝒢new(t)(𝐱),𝒢(𝐱)}\mathcal{\hat{S}}_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}=\Big{\{}\phi\geq 0:d\Big{(}\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime};{\bf x}^{\prime}(\phi)\Big{)}>h_{\mathcal{G}_{new}^{(t)}({\bf x}),\mathcal{G}^{\prime}}({\bf x})\Big{\}}. This amounts to solving a quadratic inequality in ϕ\phi.

  5. 5.

    Compute 𝒮^=t=1nK{𝒢,𝒢}t(𝐱)𝒮^𝒢,𝒢\mathcal{\hat{S}}=\bigcap\limits_{t=1}^{n-K}\bigcap\limits_{\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}_{t}({\bf x})}\mathcal{\hat{S}}_{\mathcal{G},\mathcal{G}^{\prime}}.

To see that this algorithm computes 𝒮^\mathcal{\hat{S}} in 𝒪((|(𝐱)|+log(n))n2)\mathcal{O}\Big{(}(|\mathcal{M}({\bf x})|+\log(n))n^{2}\Big{)} time, observe that:

  • Steps 1 and 2 can each be performed in 𝒪(n)\mathcal{O}(n) time. Note that |t=1nK𝒞(t)|=𝒪(n)\left|\bigcap\limits_{t=1}^{n-K}\mathcal{C}^{(t)}\right|=\mathcal{O}(n).

  • Each of the 𝒪(n2)\mathcal{O}(n^{2}) iterations in Step 3 can be performed in 𝒪(|(𝐱)|+1)\mathcal{O}(|\mathcal{M}({\bf x})|+1) time.

  • Each of the 𝒪(n2)\mathcal{O}(n^{2}) iterations in Step 4 can be performed in 𝒪(|(𝐱)|+1)\mathcal{O}(|\mathcal{M}({\bf x})|+1) time.

  • Step 5 can be performed in 𝒪(n2log(n))\mathcal{O}(n^{2}\log(n)) time. This is because 𝒮^𝒢,𝒢\mathcal{\hat{S}}_{\mathcal{G},\mathcal{G}^{\prime}} solves a quadratic inequality for each {𝒢,𝒢}(𝐱)\{\mathcal{G},\mathcal{G}^{\prime}\}\in\mathcal{L}({\bf x}), we can take the intersection over the solution sets of NN quadratic inequalities in 𝒪(NlogN)\mathcal{O}(N\log N) time (Bourgon 2009), and |(𝐱)|=𝒪(n2)\left|\mathcal{L}({\bf x})\right|=\mathcal{O}(n^{2}).

Appendix C Supplementary material for Section 4.3

The goal of this section is to establish an estimator of σ\sigma that satisfies the condition from Theorem 4. For any data set 𝐱{\bf x}, define

σ^(𝐱)=1nqqi=1nj=1q(xijx¯j)2,\displaystyle\hat{\sigma}({\bf x})=\sqrt{\frac{1}{nq-q}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}(x_{ij}-\bar{x}_{j})^{2}}, (C.1)

where x¯j=i=1nxij/n\bar{x}_{j}=\sum\limits_{i=1}^{n}x_{ij}/n for all j=1,2,,qj=1,2,\ldots,q. We will first show that under some assumptions, σ^(𝐗(n))\hat{\sigma}({\bf X}^{(n)}) asymptotically over-estimates σ\sigma. Then, we will show that if we estimate σ\sigma by applying σ^()\hat{\sigma}(\cdot) to an independent and identically distributed copy of 𝐗(n){\bf X}^{(n)}, then the condition from Theorem 4 is satisfied.

Let K>1K^{*}>1 be unknown. We assume that the sequence of mean matrices for {𝐗(n)}n=1\{{\bf X}^{(n)}\}_{n=1}^{\infty} in Theorem 4, which we denote as {𝝁(n)}n=1\left\{\bm{\mu}^{(n)}\right\}^{\infty}_{n=1}, satisfies the following assumptions.

Assumption 1.

For all n=1,2,n=1,2,\ldots, {μi(n)}i=1n={θ1,,θK}\left\{\mu^{(n)}_{i}\right\}_{i=1}^{n}=\{\theta_{1},\ldots,\theta_{K^{*}}\}, i.e. there are exactly KK^{*} distinct mean vectors among the first nn observations.

Assumption 2.

For all k=1,2,,Kk=1,2,\ldots,K^{*}, limn(i=1n𝟙{μi(n)=θk}/n)=λk,\underset{n\rightarrow\infty}{\lim}\left(\sum\limits_{i=1}^{n}\mathds{1}\{\mu^{(n)}_{i}=\theta_{k}\}/n\right)=\lambda_{k}, where k=1Kλk=1\sum\limits_{k=1}^{K^{*}}\lambda_{k}=1 and λk>0\lambda_{k}>0. That is, the proportion of the first nn observations that have mean vector θk\theta_{k} converges to λk\lambda_{k} for all kk.

This leads to the following result, which says that under Assumptions 1 and 2, σ^(𝐗(n))\hat{\sigma}({\bf X}^{(n)}) asymptotically over-estimates σ\sigma.

Lemma C.1.

For n=1,2,n=1,2,\ldots, suppose that 𝐗(n)𝒩n×q(𝛍(n),𝐈n,σ2𝐈q){\bf X}^{(n)}\sim\mathcal{MN}_{n\times q}(\bm{\mu}^{(n)},{\bf I}_{n},\sigma^{2}{\bf I}_{q}), and that Assumptions 1 and 2 hold for some K>1K^{*}>1. Then,

limn(σ^(𝐗(n))σ)=1.\displaystyle\underset{n\rightarrow\infty}{\lim}~{}\mathbb{P}(\hat{\sigma}({\bf X}^{(n)})\geq\sigma)=1. (C.2)

where σ^()\hat{\sigma}(\cdot) is defined in (C.1).

Proof.

Observe that

(nqqnq)σ^2(𝐗(n))=1nqi=1nj=1q(Xij(n))21qj=1q(X¯j(n))2.\displaystyle\left(\frac{nq-q}{nq}\right)\hat{\sigma}^{2}({\bf X}^{(n)})=\frac{1}{nq}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left(X_{ij}^{(n)}\right)^{2}-\frac{1}{q}\sum\limits_{j=1}^{q}\left(\bar{X}^{(n)}_{j}\right)^{2}. (C.3)

We will first show that

1nqi=1nj=1q(Xij(n))2𝑝σ2+1qj=1qk=1Kλkθkj2.\displaystyle\frac{1}{nq}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left(X^{(n)}_{ij}\right)^{2}\overset{p}{\longrightarrow}\sigma^{2}+\frac{1}{q}\sum\limits_{j=1}^{q}\sum\limits_{k=1}^{K^{*}}\lambda_{k}\theta_{kj}^{2}. (C.4)

Since for any positive integer rr and for any j=1,2,,qj=1,2,\ldots,q, we have

limn1ni=1n[μij(n)]r\displaystyle\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\sum\limits_{i=1}^{n}\left[\mu_{ij}^{(n)}\right]^{r} =limn1ni=1nk=1K𝟙{μi(n)=θk}(θkj)r=k=1Kλk(θkj)r,\displaystyle=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{K^{*}}\mathds{1}\left\{\mu_{i}^{(n)}=\theta_{k}\right\}\left(\theta_{kj}\right)^{r}=\sum\limits_{k=1}^{K^{*}}\lambda_{k}\left(\theta_{kj}\right)^{r}, (C.5)

it follows that

limn𝔼[1nqi=1nj=1q(Xij(n))2]\displaystyle\underset{n\rightarrow\infty}{\lim}~{}\mathbb{E}\left[\frac{1}{nq}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left(X^{(n)}_{ij}\right)^{2}\right] =limn1nqi=1nj=1q[Var[Xij(n)]+(𝔼[Xij(n)])2]\displaystyle=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{nq}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left[\text{Var}\left[X^{(n)}_{ij}\right]+\left(\mathbb{E}\left[X^{(n)}_{ij}\right]\right)^{2}\right]
=limn(σ2+1nqi=1nj=1q[μij(n)]2)\displaystyle=\underset{n\rightarrow\infty}{\lim}\left(\sigma^{2}+\frac{1}{nq}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left[\mu_{ij}^{(n)}\right]^{2}\right)
=σ2+1qj=1qk=1Kλk(θkj)2.\displaystyle=\sigma^{2}+\frac{1}{q}\sum\limits_{j=1}^{q}\sum\limits_{k=1}^{K^{*}}\lambda_{k}\left(\theta_{kj}\right)^{2}. (C.6)

Furthermore,

Var[1nqi=1nj=1q(Xij(n))2]\displaystyle\text{Var}\left[\frac{1}{nq}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left(X^{(n)}_{ij}\right)^{2}\right] 1(nq)2i=1nj=1q𝔼[(Xij(n))4]\displaystyle\leq\frac{1}{(nq)^{2}}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\mathbb{E}\left[\left(X^{(n)}_{ij}\right)^{4}\right]
=1(nq)2i=1nj=1q[(μij(n))4+6(μij(n))2σ2+3σ4]\displaystyle=\frac{1}{(nq)^{2}}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left[\left(\mu_{ij}^{(n)}\right)^{4}+6\left(\mu_{ij}^{(n)}\right)^{2}\sigma^{2}+3\sigma^{4}\right] (C.7)
=1nq[1nqi=1nj=1q(μij(n))4]+6σ2nq[1nqi=1nj=1q[μij(n)]2]+3σ4nq,\displaystyle=\frac{1}{nq}\left[\frac{1}{nq}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left(\mu_{ij}^{(n)}\right)^{4}\right]+\frac{6\sigma^{2}}{nq}\left[\frac{1}{nq}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left[\mu_{ij}^{(n)}\right]^{2}\right]+\frac{3\sigma^{4}}{nq},

where (C.7) follows from the fact that Xij(n)N(μij(n),σ2)X_{ij}^{(n)}\sim N(\mu_{ij}^{(n)},\sigma^{2}). It follows from (C.5) that

limnVar[1nqi=1nj=1q(Xij(n))2]=0.\displaystyle\underset{n\rightarrow\infty}{\lim}~{}\text{Var}\left[\frac{1}{nq}\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{q}\left(X^{(n)}_{ij}\right)^{2}\right]=0. (C.8)

Equation (C.4) follows from (C.6) and (C.8). Next, since (C.5) holds for r=1r=1 and for all j=1,2,,qj=1,2,\ldots,q, we have that

limn𝔼[X¯j(n)]=limn1ni=1nμij(n)=k=1Kλkθkj,limnVar[X¯j(n)]=limnσ2n=0.\displaystyle\underset{n\rightarrow\infty}{\lim}~{}\mathbb{E}\left[\bar{X}^{(n)}_{j}\right]=\underset{n\rightarrow\infty}{\lim}~{}\frac{1}{n}\sum\limits_{i=1}^{n}\mu_{ij}^{(n)}=\sum\limits_{k=1}^{K^{*}}\lambda_{k}\theta_{kj},\quad\quad\underset{n\rightarrow\infty}{\lim}\text{Var}\left[\bar{X}^{(n)}_{j}\right]=\underset{n\rightarrow\infty}{\lim}\frac{\sigma^{2}}{n}=0.

Thus, for all j=1,2,,qj=1,2,\ldots,q, we have that X¯j(n)𝑝k=1Kλkθkj\bar{X}^{(n)}_{j}\overset{p}{\rightarrow}\sum\limits_{k=1}^{K^{*}}\lambda_{k}\theta_{kj}. Combining this fact with (C.4) and (C.3) yields:

(nqqnq)σ^2(𝐗(n))𝑝σ2+1qj=1qk=1Kλk(θkjθ~j)2,\displaystyle\left(\frac{nq-q}{nq}\right)\hat{\sigma}^{2}({\bf X}^{(n)})\overset{p}{\rightarrow}\sigma^{2}+\frac{1}{q}\sum\limits_{j=1}^{q}\sum\limits_{k=1}^{K^{*}}\lambda_{k}(\theta_{kj}-\tilde{\theta}_{j})^{2}, (C.9)

where θ~j=k=1Kλkθkj\tilde{\theta}_{j}=\sum\limits_{k=1}^{K^{*}}\lambda_{k}\theta_{kj} for all j=1,2,,qj=1,2,\ldots,q. It follows from (C.9) that

(nqqnq)σ^2(𝐗(n))𝑝σ2+c,\displaystyle\left(\frac{nq-q}{nq}\right)\hat{\sigma}^{2}({\bf X}^{(n)})\overset{p}{\rightarrow}\sigma^{2}+c, (C.10)

for c=1qj=1qk=1Kλk(θkjθ~j)2>0c=\frac{1}{q}\sum\limits_{j=1}^{q}\sum\limits_{k=1}^{K^{*}}\lambda_{k}(\theta_{kj}-\tilde{\theta}_{j})^{2}>0. Applying the continuous mapping theorem to (C.10) yields σ^(𝐗(n))𝑝σ2+c,\hat{\sigma}({\bf X}^{(n)})\overset{p}{\rightarrow}\sqrt{\sigma^{2}+c}, where σ2+c>σ\sqrt{\sigma^{2}+c}>\sigma. Therefore, limn(σ^(𝐗(n))σ)=1.\underset{n\rightarrow\infty}{\lim}\mathbb{P}\left(\hat{\sigma}({\bf X}^{(n)})\geq\sigma\right)=1.

The following result establishes an estimator of σ\sigma that satisfies the condition in Theorem 4, by making use of an independent and identically distributed copy of 𝐗(n){\bf X}^{(n)}, if such a copy is available. If not, then sample-splitting can be used.

Corollary 1.

For n=1,2,n=1,2,\ldots, suppose that 𝐗(n)𝒩n×q(𝛍(n),𝐈n,σ2𝐈q){\bf X}^{(n)}\sim\mathcal{MN}_{n\times q}(\bm{\mu}^{(n)},{\bf I}_{n},\sigma^{2}{\bf I}_{q}), and let 𝐱(n){\bf x}^{(n)} be a realization from 𝐗(n){\bf X}^{(n)}, and that Assumptions 1 and 2 hold for some K>1K^{*}>1. For all n=1,2,,n=1,2,\ldots, let 𝒞^1(n)\mathcal{\hat{C}}_{1}^{(n)} and 𝒞^2(n)\mathcal{\hat{C}}_{2}^{(n)} be a pair of estimated clusters in 𝒞(𝐱(n))\mathcal{C}({\bf x}^{(n)}). For all n=1,2,,n=1,2,\ldots, let 𝐘(n){\bf Y}^{(n)} be an independent and identically distributed copy of 𝐗(n){\bf X}^{(n)}. Then,

limm,nH0{𝒞^1(n),𝒞^2(n)}(p^(𝐗(n),σ^(𝐘(m)))α|𝒞^1(n),𝒞^2(n)𝒞(𝐗(n)))α, for all 0α1.\displaystyle\scalebox{0.9}{$\underset{m,n\rightarrow\infty}{\lim}~{}\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{p}({\bf X}^{(n)},\hat{\sigma}({\bf Y}^{(m)}))\leq\alpha~{}\big{|}~{}\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\in\mathcal{C}({\bf X}^{(n)})\right)\leq\alpha,\text{ for all }0\leq\alpha\leq 1$}. (C.11)
Proof.

By the independence of 𝐗(n){\bf X}^{(n)} and 𝐘(n){\bf Y}^{(n)} for all n=1,2,n=1,2,\ldots, and by Lemma C.1,

limm,nH0{𝒞^1(n),𝒞^2(n)}(σ^(𝐘(m))>σ|𝒞^1(n),𝒞^2(n)𝒞(𝐗(n)))=1.\underset{m,n\rightarrow\infty}{\lim}~{}\mathbb{P}_{H_{0}^{\left\{\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\right\}}}\left(\hat{\sigma}({\bf Y}^{(m)})>\sigma~{}\big{|}~{}\mathcal{\hat{C}}_{1}^{(n)},\mathcal{\hat{C}}_{2}^{(n)}\in\mathcal{C}({\bf X}^{(n)})\right)=1.

Thus, (C.11) follows from Theorem 4. ∎

Appendix D Additional simulation studies

D.1 Null p-values when σ\sigma is unknown

In Section 4.3, we propose plugging an estimator of σ\sigma into the p-value defined in (8) to get (25), and we established conditions under which rejecting H0{𝒞^1,𝒞^2}H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}} when (25) is below α\alpha asymptotically controls the selective type I error rate (Theorem 4). Furthermore, in Section 4.3, we establish an estimator of σ\sigma that satisfisfies the conditions in Theorem 4 (Corollary 1). In the following, we will investigate the finite-sample implications of applying the approach outlined in Section C.

We generate data from (1) with n=200n=200, q=10q=10, σ=1\sigma=1, and two clusters,

μ1==μ100=[δ/20q1],μ101==μ200=[0q1δ/2],\displaystyle\mu_{1}=\cdots=\mu_{100}=\left[\begin{smallmatrix}\delta/2\\ 0_{q-1}\end{smallmatrix}\right],\mu_{101}=\cdots=\mu_{200}=\left[\begin{smallmatrix}0_{q-1}\\ -\delta/2\end{smallmatrix}\right], (D.1)

for δ{2,4,6}\delta\in\{2,4,6\}. We randomly split each data set into equally sized “training” and “test” sets. For each training set, we use average, centroid, single, and complete linkage hierarchical clustering to estimate three clusters, and then test H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}} for a randomly chosen pair of clusters. We estimate σ\sigma by applying σ^()\hat{\sigma}(\cdot) in (C.1) to the corresponding test set, then use it to compute p-values for average, centroid, and single linkage. In the case of complete linkage, we approximate (25) by replacing σ\sigma in the importance sampling procedure described in Section 4.1 with σ^()\hat{\sigma}(\cdot) in (C.1) applied to the corresponding test set. Figure D.1 displays QQ plots of the empirical distribution of the p-values against the Uniform(0,1)(0,1) distribution, over 500 simulated data sets where H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}} holds, for δ{2,4,6}\delta\in\{2,4,6\}.

Refer to caption
Figure D.1: For the simulation study described in Section D.1, QQ-plots of the p-values, using (a) average linkage, (b) centroid linkage, (c) single linkage, and (d) complete linkage, over 500 simulated data sets where H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}} holds, for δ{2,4,6}\delta\in\{2,4,6\}.

The p-values appear to be stochastically larger than the Uniform(0,1)(0,1) distribution, across all linkages and all values of δ\delta in (D.1). As δ\delta decreases and the clusters become less separated, σ^()\hat{\sigma}(\cdot) in (C.1) overestimates σ\sigma less severely. Thus, as δ\delta decreases, the distribution of the p-values becomes closer to the Uniform(0,1)(0,1) distribution.

D.2 Power as a function of effect size

Recall that in Section 5.2, we considered the conditional power of the test proposed in Section 2.1 to detect a difference in means between two true clusters. We now evaluate the power of the test proposed in Section 2.1 to detect a difference in means between estimated clusters, without conditioning on having correctly estimated the true clusters.

We generate data from (1) with n=150n=150, σ=1\sigma=1, q=10q=10, and μ\mu given by (28). We simulate 10,00010,000 data sets for nine evenly-spaced values of δ[3,7]\delta\in[3,7]. For each simulated data set, we test H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}} for a randomly chosen pair of clusters obtained from single, average, centroid, and complete linkage hierarchical clustering, with significance level α=0.05\alpha=0.05. We define the effect size to be Δ=μ¯𝒞^1μ¯𝒞^22/σ\Delta=\|\bar{\mu}_{\mathcal{\hat{C}}_{1}}-\bar{\mu}_{\mathcal{\hat{C}}_{2}}\|_{2}/\sigma and consider the probability of rejecting H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}} as a function of Δ\Delta. We smooth our power estimates by fitting a regression spline using the gam function in the R package mgcv (Wood 2015). Figure D.2(a)–(d) displays the smoothed estimates when min{|𝒞^1|,|𝒞^2|}10\min\{|\mathcal{\hat{C}}_{1}|,|\mathcal{\hat{C}}_{2}|\}\geq 10, and Figure D.2(e)–(h) displays the smoothed estimates when min{|𝒞^1|,|𝒞^2|}<10\min\{|\mathcal{\hat{C}}_{1}|,|\mathcal{\hat{C}}_{2}|\}<10. (We stratify our results because the power is much lower when min{|𝒞^1|,|𝒞^2|}\min\{|\mathcal{\hat{C}}_{1}|,|\mathcal{\hat{C}}_{2}|\} is small.)

Refer to caption
Refer to caption
Figure D.2: For the simulation study described in Section D.2, smoothed power estimates as a function of the effect size Δ=μ¯𝒞^1μ¯𝒞^22/σ\Delta=\|\bar{\mu}_{\mathcal{\hat{C}}_{1}}-\bar{\mu}_{\mathcal{\hat{C}}_{2}}\|_{2}/\sigma. In (a)–(d), min{|𝒞^1|,|𝒞^2|}10\min\{|\mathcal{\hat{C}}_{1}|,|\mathcal{\hat{C}}_{2}|\}\geq 10, and in (e)–(h), min{|𝒞^1|,|𝒞^2|}<10\min\{|\mathcal{\hat{C}}_{1}|,|\mathcal{\hat{C}}_{2}|\}<10.

It may come as a surprise that the xx-axis on Figure D.2(c) starts at 4.5, while the xx-axis on Figure D.2(d) starts at 0. This is because single linkage hierarchical clustering produces unbalanced clusters unless it successfully detects the true clusters. Thus, the condition min{|𝒞^1|,|𝒞^2|}10\min\{|\mathcal{\hat{C}}_{1}|,|\mathcal{\hat{C}}_{2}|\}\geq 10 is only satisfied for single linkage hierarchical clustering when the true clusters are detected, which happens when Δ\Delta is greater than 4.54.5. The xx-axis on Figure (b) starts at 3 rather than 0 for a similar reason.

In Figure D.2, for all four linkages, the power to reject H0{𝒞^1,𝒞^2}:μ¯𝒞^1=μ¯𝒞^2H_{0}^{\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\}}:\bar{\mu}_{\mathcal{\hat{C}}_{1}}=\bar{\mu}_{\mathcal{\hat{C}}_{2}} increases as the effect size (Δ\Delta) increases. All four linkages have similar power where the xx-axes overlap.

Refer to caption
Figure D.3: For the simulation study described in Section D.2 with average linkage and Δ=μ¯𝒞^1μ¯𝒞^22/σ=5\Delta=\|\bar{\mu}_{\mathcal{\hat{C}}_{1}}-\bar{\mu}_{\mathcal{\hat{C}}_{2}}\|_{2}/\sigma=5, smoothed p-values on the log10\log_{10} scale as a function of the distance between the test statistic and the left boundary of 𝒮^\mathcal{\hat{S}}, defined in (12).

Our test has low power when the test statistic (x¯𝒞^1x¯𝒞^22\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}) is very close to the left boundary of 𝒮^\mathcal{\hat{S}} defined in (12). This is a direct result of the fact that p(𝐱;{𝒞^1,𝒞^2})=(ϕx¯𝒞^1x¯𝒞^22ϕ𝒮^)p({\bf x};\{\mathcal{\hat{C}}_{1},\mathcal{\hat{C}}_{2}\})=\mathbb{P}(\phi\geq\|\bar{x}_{\mathcal{\hat{C}}_{1}}-\bar{x}_{\mathcal{\hat{C}}_{2}}\|_{2}\mid\phi\in\mathcal{\hat{S}}), for ϕ(σ1|𝒞^1+1|𝒞^2|)χq\phi\sim(\sigma\sqrt{\frac{1}{|\mathcal{\hat{C}}_{1}}+\frac{1}{|\mathcal{\hat{C}}_{2}|}})\cdot\chi_{q} (Theorem 1). Figure D.3 displays the smoothed p-values from average linkage hierarchical clustering with effect size Δ=μ¯𝒞^1μ¯𝒞^22/σ=5\Delta=\|\bar{\mu}_{\mathcal{\hat{C}}_{1}}-\bar{\mu}_{\mathcal{\hat{C}}_{2}}\|_{2}/\sigma=5 as a function of the distance between the test statistic and the left boundary of 𝒮^\mathcal{\hat{S}}. In Figure D.3, the p-values increase as the distance increases. Similar results have been observed in other selective inference frameworks; see Kivaranovic & Leeb (2020) for a detailed discussion.