This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A penalized criterion for selecting the number of clusters for K-medians

Antoine Godichon-Baggioni and Sobihan Surendran,
Laboratoire de Probabilités, Statistique et Modélisation
Sorbonne-Université, 75005 Paris, France
antoine.godichon_\_[email protected], [email protected]
Abstract

Clustering is a usual unsupervised machine learning technique for grouping the data points into groups based upon similar features. We focus here on unsupervised clustering for contaminated data, i.e in the case where K-medians algorithm should be preferred to K-means because of its robustness. More precisely, we concentrate on a common question in clustering: how to chose the number of clusters? The answer proposed here is to consider the choice of the optimal number of clusters as the minimization of a penalized criterion. In this paper, we obtain a suitable penalty shape for our criterion and derive an associated oracle-type inequality. Finally, the performance of this approach with different types of K-medians algorithms is compared on a simulation study with other popular techniques. All studied algorithms are available in the R package Kmedians on CRAN.

Keywords: Clustering, K-medians, Robust statistics

1 Introduction

Clustering is an unsupervised machine learning technique that involves grouping data points into a collection of groups based on similar features. Clustering is commonly used for data compression in image processing, which is also known as vector quantization (Gersho and Gray,, 2012). There is a vast literature on clustering techniques, and general references regarding clustering can be found in Spath, (1980); Jain and Dubes, (1988); Mirkin, (1996); Jain et al., (1999); Berkhin, (2006); Kaufman and Rousseeuw, (2009). Classification methods can be categorized as hard clustering also referred as crisp clustering (including K-means, K-medians, and hierarchical clustering) and soft clustering (such as Fuzzy K-means (Dunn,, 1973; Bezdek,, 2013) and Mixture Models). In hard clustering methods, each data point belongs to only one group, whereas in soft clustering, a probability or likelihood of a data point belonging to a cluster is assigned, allowing each data point to be a member of more than one group.

We focus here on hard clustering methods. The most popular partitioning clustering methods are the non sequential (Forgy,, 1965) and the sequential (MacQueen,, 1967) versions of the K-means algorithms. The aim of the K-means algorithm is to minimize the sum of squared distances between the data points and their respective cluster centroid. More precisely, considering X1,,XnX_{1},...,X_{n} random vectors taking values in d\mathbb{R}^{d}, the aim is to find k centroids {c1,,ck}\left\{c_{1},...,c_{k}\right\} minimizing the empirical distortion

1ni=1nminj=1,..,kXicj2.\frac{1}{n}\sum_{i=1}^{n}\min_{j=1,..,k}\left\|X_{i}-c_{j}\right\|^{2}. (1)

Nevertheless, in many real-world applications, data collected may be contaminated with outliers of large magnitude, which can make traditional clustering methods such as K-means sensitive to their presence. As a result, it is necessary to use more robust clustering algorithms that produce reliable outcomes. One such algorithm is K-medians clustering, which was introduced by MacQueen, (1967) and further developed by Kaufman and Rousseeuw, (2009). Instead of using the mean to determine the centroid of each cluster, K-medians clustering uses the geometric median. It consists in considering criteria based on least norms instead of least squared norms. More precisely, considering the same sequence of i.i.d copies X1,,XnX_{1},...,X_{n}, the objective of K-medians clustering is to minimize the empirical L1L^{1}-distortion :

1ni=1nminj=1,..,kXicj.\frac{1}{n}\sum_{i=1}^{n}\min_{j=1,..,k}\left\|X_{i}-c_{j}\right\|.

In practical applications, the number of clusters k is often unknown. In this paper, we will focus on the choice of optimal number of clusters for robust clustering. Several methods for determining the optimal number of clusters have been studied for K-means algorithms and can be easily adapted for K-medians. One commonly used method for determining the optimal number of clusters is the elbow method. Other methods often used are the Silhouette (Kaufman and Rousseeuw,, 2009) and the Gap Statistic (Tibshirani et al.,, 2001). The silhouette coefficient of a sample is defined as the difference between the within-cluster distance between the sample and other data points in the same cluster and the inter-cluster distance between the sample and the nearest cluster. The Silhouette method suggests selecting the value of k that maximizes the average silhouette coefficient of all data points. The silhouette score is typically calculated using Euclidean or Manhattan distance. Regarding the Gap Statistic, the idea is to compare the within-cluster dispersion to its expected value under an appropriate null reference distribution. The reference data set is generated via Monte Carlo simulations of the sampling process.

In Fischer, (2011), the objective is to minimize the empirical distortion, which is defined in (1), as a function of k in order to determine the optimal number of clusters. However, if all data points are placed in a single cluster, the empirical distortion will be minimized. To prevent choosing too large a value for k, a penalty function is introduced. It was shown that the penalty shape is kn\sqrt{\frac{k}{n}} in the case of K-means clustering and by finding the constant of the penalty with the data-based calibration method, one can obtain better results than by using usual other methods. The data-driven calibration algorithm is a method proposed by Birgé and Massart, (2007) and developed by Arlot and Massart, (2009) , to find the constant of penalty function. Theoretical properties on this data-based penalization procedures have been studied by Birgé and Massart, (2007); Arlot and Massart, (2009); Baudry et al., (2012). The aim of this paper is to adapt these methods for K-medians algorithms. We first provide the shape of the penalty function and then use the slope heuristic method to calibrate the constant and construct a penalized criterion for selecting the number of clusters for K-medians algorithms.

The paper is organized as follows. We provide a recap of two different methods for estimating the geometric median, followed by the introduction of three K-median algorithms (“Online”, “Semi-Online”, and “Offline”). In section 3, we propose a penalty shape for the proposed penalized criterion and give an upper bound for the expectation of the distortion at empirically optimal codebook with size of optimal number of clusters which ensure our penalty function. We illustrate the proposed approach with some simulations and compare it with several methods in section 4. Finally, the proofs are gathered in section 5. All the proposed algorithms are available in the R package Kmedians on CRAN https://cran.r-project.org/package=Kmedians.

2 Framework

2.1 Geometric Median

In what follows, we consider a random variable X that takes values in d\mathbb{R}^{d} for some d1d\geq 1. It is well-known that the standard mean of X is not robust to corruptions. Hence, the median is preferred to the mean in robust statistics. The geometric median mm, also called L1L^{1}-median or spatial median, of a random variable XdX\in\mathbb{R}^{d} is defined by Haldane, (1948) as follows:

m=argminud𝔼[Xu].m=\arg\min_{u\in\mathbb{R}^{d}}\mathbb{E}\left[\left\|X-u\right\|\right].

For the 1-dimensional case, the geometric median coincides with the usual median in \mathbb{R}. As Euclidean space d\mathbb{R}^{d} is strictly convex, the geometric median mm exists and is unique if the points are not concentrated around a straight line (Kemperman,, 1987). The geometric median is known to be robust and has a breakdown point of 0.50.5.

Let us now consider a sequence of i.i.d copies X1,,XnX_{1},...,X_{n} of XX. In this paper, we focus on two methods to determine the geometric median. The first one is iterative and consists in considering the fix point estimates (Weiszfeld,, 1937; Vardi and Zhang,, 2000)

m^t+1=i𝒳tXiXim^ti𝒳t1Xim^t\hat{m}_{t+1}=\frac{\sum_{i\in\mathcal{X}_{t}}\frac{X_{i}}{\left\|X_{i}-\hat{m}_{t}\right\|}}{\sum_{i\in\mathcal{X}_{t}}\frac{1}{\left\|X_{i}-\hat{m}_{t}\right\|}}

with a initial point m^0d\hat{m}_{0}\in\mathbb{R}^{d} chosen arbitrarily such that it does not coincide with any of the XiX_{i} and 𝒳t={i,Xim^t}\mathcal{X}_{t}=\{i,X_{i}\neq\hat{m}_{t}\}. This Weiszfeld algorithm can be a flexible technique, but there are many implementation difficulties for massive data in high-dimensional spaces.

An alternative and simple estimation algorithm which can be seen as a stochastic gradient algorithm (Robbins and Monro,, 1951; Ruppert,, 1985; Duflo,, 1997; Cardot et al.,, 2013) and is defined as follows

mj+1=mj+γjXj+1mjXj+1mjm_{j+1}=m_{j}+\gamma_{j}\frac{X_{j+1}-m_{j}}{\left\|X_{j+1}-m_{j}\right\|}

where m0m_{0} is an arbitrarily chosen starting point and γj\gamma_{j} is a step size such that j1,γj>0,j1γj=\forall j\geq 1,\gamma_{j}>0,\sum_{j\geq 1}\gamma_{j}=\infty and j1γj2<\sum_{j\geq 1}\gamma_{j}^{2}<\infty. Its averaged version (ASG), which is effective for large samples of high dimension data, introduced by Polyak and Juditsky, (1992) and adapted by Cardot et al., (2013), is defined by

m¯j+1=m¯j+1j+1(mj+1m¯j).\overline{m}_{j+1}=\overline{m}_{j}+\frac{1}{j+1}(m_{j+1}-\overline{m}_{j}).

One can speak about averaging since m¯j=1ji=1jmi\overline{m}_{j}=\frac{1}{j}\sum_{i=1}^{j}m_{i}. We note that, under suitable assumptions, both m^t\hat{m}_{t} and m¯t\overline{m}_{t} are asymptotically efficient (Vardi and Zhang,, 2000; Cardot et al.,, 2013).

2.2 K-medians

For a positive integer k, a vector quantizer Q of dimension d and codebook size k is a (measurable) mapping of the d-dimensional Euclidean d\mathbb{R}^{d} into a finite set of points {c1,,ck}\left\{c_{1},...,c_{k}\right\} (Linder,, 2000). More precisely, the points cid,i=1,,kc_{i}\in\mathbb{R}^{d},i=1,...,k are called the codepoints and the vector composed of the code points {c1,,ck}\left\{c_{1},...,c_{k}\right\} is called codebook, denoted by c. Given a d-dimensional random vector X admitting a finite first order moment, the L1L^{1}-distortion of a vector quantizer Q with codebook c={c1,,ck}c=\left\{c_{1},...,c_{k}\right\} is defined by

W(c):=𝔼[minj=1,..,kXcj].W(c):=\mathbb{E}\left[\min_{j=1,..,k}\left\|X-c_{j}\right\|\right]. (2)

Let us now consider X1,,XnX_{1},...,X_{n} random vectors d\in\mathbb{R}^{d} i.i.d with the same law as X. Then, one can define the empirical L1L^{1}-distortion as :

Wn(c):=1ni=1nminj=1,..,kXicj.W_{n}(c):=\frac{1}{n}\sum_{i=1}^{n}\min_{j=1,..,k}\left\|X_{i}-c_{j}\right\|. (3)

In this paper, we consider two types of K-medians algorithms : sequential and non sequential algorithm. The non sequential algorithm uses Lloyd-style iteration which alternates between an expectation (E) and maximization (M) step and is precisely described in Algorithm 1:

Inputs : D={x1,,xn}D=\left\{x_{1},...,x_{n}\right\} datapoints, k number of clusters
Output : A set of k clusters : C1,,CkC_{1},...,C_{k}
Randomly choose k centroids : m1,,mkm_{1},...,m_{k}.
while the clusters change do
       for 1in1\leq i\leq n do
             r=argmin1jkximjr=\arg\min_{1\leq j\leq k}\left\|x_{i}-m_{j}\right\|
             CrxiC_{r}\leftarrow x_{i}
            
       end for
      for 1jk1\leq j\leq k do
             mj=argminmi,xiCjximm_{j}=\arg\min_{m}\sum_{i,x_{i}\in C_{j}}\left\|x_{i}-m\right\|
            
       end for
      
end while
Algorithm 1 Non Sequential K-medians Algorithm .

For 1jk,mj1\leq j\leq k,m_{j} is nothing but the geometric median of the points in the cluster CjC_{j}. As mjm_{j} is not explicit, we will use Weiszfeld (indicated by “Offline”) or ASG (indicated by “Semi-Online”) to estimate it. The Online K-median algorithm proposed by Cardot et al., (2012) based on an averaged Robbins-Monro procedure (Robbins and Monro,, 1951; Polyak and Juditsky,, 1992) is described in Algorithm 2:

Inputs : D={x1,,xn}D=\left\{x_{1},...,x_{n}\right\} datapoints, k number of clusters, cγ>0c_{\gamma}>0 and α(1/2,1)\alpha\in(1/2,1)
Output : A set of k clusters : C1,,CkC_{1},...,C_{k}
Randomly choose k centroids : m1,,mkm_{1},...,m_{k}.
m¯j=mj1jk\overline{m}_{j}=m_{j}\hskip 2.84526pt\forall\hskip 2.84526pt1\leq j\leq k
nj=11jkn_{j}=1\hskip 2.84526pt\forall\hskip 2.84526pt1\leq j\leq k
for 1in1\leq i\leq n do
       r=argmin1jkxim¯jr=\arg\min_{1\leq j\leq k}\left\|x_{i}-\overline{m}_{j}\right\|
       CrxiC_{r}\leftarrow x_{i}
       mrmr+cγ(nr+1)αximrximrm_{r}\leftarrow m_{r}+\frac{c_{\gamma}}{\left(n_{r}+1\right)^{\alpha}}\frac{x_{i}-m_{r}}{\left\|x_{i}-m_{r}\right\|}
       m¯rnrm¯r+mrnr+1\overline{m}_{r}\leftarrow\frac{n_{r}\overline{m}_{r}+m_{r}}{n_{r}+1}
       nrnr+1n_{r}\leftarrow n_{r}+1
      
end for
Algorithm 2 Online K-medians Algorithm .

The non-sequential algorithms are effective but the computational time is huge compared to the sequential (“Online”) algorithm, which is very fast and only requires O(knd)O(knd) operations, where n is the sample size, k is the number of clusters and d is dimension. Furthermore, in case of large samples, Online algorithm is expected to estimate the centers of the clusters as well as the non-sequential algorithm Cardot et al., (2012). Then, in case of large sample size, Online algorithm should be preferred and vice versa.

3 The choice of k

In this section, we adapt the results that have been shown for K-means in Fischer, (2011) to K-medians clustering. In this aim, let X1,,XnX_{1},...,X_{n} random vectors with the same law as XX, and we assume that XR\|X\|\leq R almost surely for some R>0R>0. Let SkS_{k} denote the countable set of all {c1,,ck}k\left\{c_{1},...,c_{k}\right\}\in\mathbb{Q}^{k}, where \mathbb{Q} is some grid over d\mathbb{R}^{d}. It is important to note that \mathbb{Q} represents the search space for the centers. Since X\|X\| is assumed to be bounded by RR, we consider a grid ¯(0,R)\mathbb{Q}\subset\overline{\mathcal{B}}(0,R) (where ¯(0,R)\overline{\mathcal{B}}(0,R) denotes the closed ball centered at 0 with radius RR. A codebook c^k\hat{c}_{k} is said empirically optimal codebook if we have Wn(c^k)=mincSkWn(c)W_{n}(\hat{c}_{k})=\min_{c\in S_{k}}W_{n}(c). Let c^k\hat{c}_{k} be a minimizer of the criterion Wn(c)W_{n}(c) over SkS_{k}. Our aim is to determine k^\hat{k} minimizing a criterion of the type

crit(k)=Wn(c^k)+pen(k)\text{crit}(k)=W_{n}(\hat{c}_{k})+\text{pen}(k)

where pen : {1,,n}+\left\{1,...,n\right\}\rightarrow\mathbb{R}_{+} is a penalty function described later. The purpose of this penalty method is to prevent choosing too large a value for k by introducing a penalty into the objective function.

In this section, we will give an upper bound for the expectation of the distortion at empirically optimal codebook with size of optimal number of clusters which is based on a general non asymptotic upper bound for

𝔼[supcSk{W(c)Wn(c)}].\mathbb{E}\left[\sup_{c\in S_{k}}\left\{W(c)-W_{n}(c)\right\}\right].
Theorem 3.1.

Let X1,,XnX_{1},\ldots,X_{n} be random vectors taking values in d\mathbb{R}^{d} with the same law as XX, and we assume that XR\|X\|\leq R almost surely for some R>0R>0. Define WW and WnW_{n} as in (2) and (3), respectively. Then for all 1kn1\leq k\leq n,

𝔼[supcSk{W(c)Wn(c)}]48Rkdn.\mathbb{E}\left[\sup_{c\in S_{k}}\left\{W(c)-W_{n}(c)\right\}\right]\leq 48R\sqrt{\frac{kd}{n}}.

This theorem shows that the maximum difference of the distortion and the expected empirical distortion of any vector quantizer is of order n1/2n^{-1/2}. Selecting the search space for the centers is crucial because a larger search space results in a higher upper bound.

Theorem 3.2.

Let XX be a random vector taking values in d\mathbb{R}^{d} and we assume that XR\|X\|\leq R almost surely for some R>0R>0. Consider nonnegative weights {xk}1kn\left\{x_{k}\right\}_{1\leq k\leq n} such that k=1nexk=Σ\sum_{k=1}^{n}e^{-x_{k}}=\Sigma. Define WW as in (2) and suppose that for all 1kn1\leq k\leq n

pen(k)R(48kdn+2xk2n).\text{pen}(k)\geq R\left(48\sqrt{\frac{kd}{n}}+2\sqrt{\frac{x_{k}}{2n}}\right).

Then:

𝔼[W(c~)]inf1kn{infcSkW(c)+pen(k)}+ΣRπ2n,\mathbb{E}\left[W(\tilde{c})\right]\leq\inf_{1\leq k\leq n}\left\{\inf_{c\in S_{k}}W(c)+\text{pen}(k)\right\}+\Sigma R\sqrt{\frac{\pi}{2n}},

where c~=c^k^\tilde{c}=\hat{c}_{\hat{k}} minimizer of the penalized criterion.

We remark the presence of the weights {xk}1kn\left\{x_{k}\right\}_{1\leq k\leq n} in penalty function and Σ\Sigma which depends on the weights in upper bound for the expectation of the distortion at c~\tilde{c}. The larger the weights {xk}1kn\left\{x_{k}\right\}_{1\leq k\leq n}, the smaller the value of Σ\Sigma. So, we have to make a compromise between these two terms. Let us indeed consider the simple situation where one can take {xk}1kn\left\{x_{k}\right\}_{1\leq k\leq n} such that xkx_{k} = Lk for some positive constant L and Σ=k=1nexk1\Sigma=\sum_{k=1}^{n}e^{-x_{k}}\leq 1. If we take

pen(k)=R(48kdn+2Lk2n)=Rkn(48d+2L2),\text{pen}(k)=R\left(48\sqrt{\frac{kd}{n}}+2\sqrt{\frac{Lk}{2n}}\right)=R\sqrt{\frac{k}{n}}\left(48\sqrt{d}+2\sqrt{\frac{L}{2}}\right),

we deduce that the penalty shape is akna\sqrt{\frac{k}{n}} where aa is a constant.

Proposition 3.1.

Let XX be a d-dimensional random vector such that XR\|X\|\leq R almost surely. Then for all 1kn1\leq k\leq n,

infcSkW(c)4Rk1/d,\inf_{c\in S_{k}}W(c)\leq 4Rk^{-1/d},

where WW is defined in (2).

Assume that for every 1kn1\leq k\leq n

pen(k)=aRkn,\text{pen}(k)=aR\sqrt{\frac{k}{n}},

where aa is a positive constant that satisfies a(48d+2L2)a\geq\left(48\sqrt{d}+2\sqrt{\frac{L}{2}}\right) to verify the hypothesis of Theorem 3.2. Using Theorem 3.2 and Proposition 3.1, we obtain:

𝔼[W(c~)]R(inf1kn{4k1/d+akn}+Σπ2n).\mathbb{E}\left[W(\tilde{c})\right]\leq R\left(\inf_{1\leq k\leq n}\left\{4k^{-1/d}+a\sqrt{\frac{k}{n}}\right\}+\Sigma\sqrt{\frac{\pi}{2n}}\right).

Minimizing the term on the right hand side of previous inequality leads to k of the order ndd+2n^{\frac{d}{d+2}} and

𝔼[W(c~)]=𝒪(n1d+2).\mathbb{E}\left[W(\tilde{c})\right]=\mathcal{O}(n^{-\frac{1}{d+2}}).

We conclude that our penalty shape is akna\sqrt{\frac{k}{n}} where aa is a constant. In Birgé and Massart, (2007), a data-driven method has been introduced to calibrate such criteria whose penalties are known up to a multiplicative factor: the “slope heuristics”. This method consists of estimating the constant of penalty function by the slope of the expected linear relation of Wn(c^k)-W_{n}(\hat{c}_{k}) with respect to the penalty shape values penshape(k)=kn\text{pen}_{\text{shape}}(k)=\sqrt{\frac{k}{n}}.

Estimation of constant aa: Let denote c=argmincSW(c)c^{*}=\arg\min_{c\in S}W(c) and ck=argmincSkW(c)c_{k}=\arg\min_{c\in S_{k}}W(c), where S any linear subspace of d\mathbb{R}^{d} and SkS_{k} set of predictors (called a model). It was shown in Birgé and Massart, (2007); Arlot and Massart, (2009); Baudry et al., (2012) that under conditions, the optimal penalty verifies for large nn:

penopt(k):=aoptpenshape(k)2(Wn(c)Wn(c^k)).\text{pen}_{\text{opt}}(k):=a_{\text{opt}}\text{pen}_{\text{shape}}(k)\approx 2(W_{n}(c^{*})-W_{n}(\hat{c}_{k})).

This gives

aopt2penshape(k)Wn(c)Wn(c^k).\frac{a_{opt}}{2}\text{pen}_{\text{shape}}(k)-W_{n}(c^{*})\approx-W_{n}(\hat{c}_{k}).

The term Wn(c^k)-W_{n}(\hat{c}_{k}) with respect to the penalty shape behaves like a linear function for a large kk. The slope S^\hat{S} of the linear regression of Wn(c^k)-W_{n}(\hat{c}_{k}) with respect to penshape(k)pen_{shape}(k) is computed to estimate aopt2\frac{a_{opt}}{2}. Finally, we obtain

pen(k):=aoptpenshape(k)=2S^penshape(k).\text{pen}(k):=a_{\text{opt}}\text{pen}_{\text{shape}}(k)=2\hat{S}\text{pen}_{\text{shape}}(k).

Of course, since this method is based on asymptotic results, it can encounter some practical problems when the dimension dd is larger than the sample size nn.

4 Simulations

This whole method is implemented in R and all these studied algorithms are available in the R package Kmedians https://cran.r-project.org/package=Kmedians. In what follows, the centers initialization are generated from robust hierarchical clustering algorithm with genieclust package (Gagolewski et al.,, 2016).

4.1 Visualization of results with the package Kmedians

In Section 3, we proved that the penalty shape is akna\sqrt{\frac{k}{n}} where aa is a constant to calibrate. To find the constant aa, we will use the data-based calibration algorithm for penalization procedures that is explained at the end of section 3. This data-driven slope estimation method is implemented in CAPUSHE (CAlibrating Penalty Using Slope HEuristics) (Brault et al.,, 2011) which is available in the R package capushe https://cran.r-project.org/package=capushe. This proposed slope estimation method is made to be robust in order to preserve the eventual undesirable variations of criteria. More precisely, for a certain number of clusters kk, the algorithm may be trapped by a local minima, which could create a “bad point” for the slope heuristic. The slope heuristic has therefore been designed to be robust to the presence of such points.

In what follows, we consider a random variable XX following a Gaussian Mixture Model with k=6k=6 classes where the mixture density function is defined as

p(x)=j=1kπj𝒩(x|μj,Id)p(x)=\sum_{j=1}^{k}\pi_{j}\mathcal{N}(x|\mu_{j},\operatorname{I}_{d})

with, πj=1k1jk,μj𝒰10\pi_{j}=\frac{1}{k}\hskip 8.53581pt\forall 1\leq j\leq k,\hskip 8.53581pt\mu_{j}\sim\mathcal{U}_{10} where 𝒰10\mathcal{U}_{10} is the uniform law on the sphere of radius 10 and,

𝒩(x|μ,Id)=1(2π)dexp(12xμ2).\mathcal{N}(x|\mu,\operatorname{I}_{d})=\frac{1}{\sqrt{(2\pi)^{d}}}\exp\left(-\frac{1}{2}\|x-\mu\|^{2}\right).

In what follows, we consider n=3000n=3000 i.i.d realizations of XX and d=5d=5. We first focus on some visualization of our slope method.

Refer to caption
Figure 1: Evolution of Wn(c^k)-W_{n}(\hat{c}_{k}) with respect to penalty shape: k/n\sqrt{k/n}.
Refer to caption
Refer to caption
Figure 2: Evolution of Wn(c^k)W_{n}(\hat{c}_{k}) (on the left) and crit(k)\text{crit}(k) (on the right) with respect to kk.

To estimate a2S^a\approx 2\hat{S} in the penalty function, it is sufficient to estimate S^\hat{S}, which is the slope of the red curve in Figure 1. As shown in Figure 1, the regression slope is estimated using the last 21 points, as it behaves like an affine function when kk is large. In Figure 2 (left), two possible elbows are observed in the curve. Consequently, the elbow method suggests considering either 5 or 6 as the number of clusters. We would prefer to choose 5 since the elbow point at 5 is more pronounced compared to the one at 6. Therefore, this method is not ideal in this case.

Figures 3 to 5 represent the data as curves, which we call “profiles” (the x-label corresponds to the coordinates, and the y-label to the values of the coordinates), gathered by cluster and with the centers of the groups represented in red. We also show the first two principal components of the data using robust principal component analysis components (RPCA) (Cardot and Godichon-Baggioni,, 2017). In Figure 3, we focus on the clustering obtained with the K-medians algorithm (“Offline” version) for non contaminated data. In each cluster, the curves are close to each other and also close to the median, and the profiles differ from one cluster to another, meaning that our method separated well the 6 groups. In order to visualize the robustness of the proposed method, we considered contaminated data with the law Z=(Z1,,Z5)Z=(Z_{1},...,Z_{5}) where ZiZ_{i} are i.i.d, with Zi𝒯1Z_{i}\sim\mathcal{T}_{1} where 𝒯1\mathcal{T}_{1} is a Student law with one degree of freedom. Applying our method for selecting the number of clusters for K-medians algorithms, we selected the corrected number of clusters. Furthermore, the obtained groups, despite the presence of some outliers in each cluster, are coherent. Nevertheless, in the case of K-means clustering, the method found non homogeneous clusters, i.e. the method assimilates some far outliers as single clusters (see Figure 5). It’s important to note that, in the case of contaminated data (Figures 4 and 5), we only represented 95%95\% of the data to better visualize them. Then, in Figure, 5, Clusters 5, 7, 8, 11 and 12 are not visible since they are “far” outliers.

Refer to caption
Refer to caption
Figure 3: Profiles (on the left) and clustering via K-medians represented on the first two principal components (on the right) without contaminated data.
Refer to caption
Refer to caption
Figure 4: Profiles (on the left) and clustering via K-medians algorithm represented on the first two principal components (on the right) with 5%5\% of contaminated data.
Refer to caption
Refer to caption
Figure 5: Profiles (on the left) and clustering via K-means algorithm represented on the first two principal components (on the right) with 5%5\% of contaminated data.

4.2 Comparison with Gap Statistic and Silhouette

In what follows, we focus on the choice of the number of clusters and compare our results with different methods. For this, we generated some basic data sets in three different scenarios (see Fischer, (2011)) :
(S1) 44 clusters in dimension 33 : The data are generated by Gaussian mixture centered at (0,0,0)(0,0,0), (0,2,3)(0,2,3), (3,0,1)(3,0,-1), and (3,1,0)(-3,-1,0) with variance equal to the identity matrix. Each cluster contains 500 data points.
(S2) 55 clusters in dimension 44 : The data are generated by Gaussian mixture centered at (0,0,0,0)(0,0,0,0), (3,5,1,0)(3,5,-1,0), (5,0,0,0)(-5,0,0,0), (1,1,6,2)(1,1,6,-2) and (1,3,2,5)(1,-3,-2,5) with variance equal to the identity matrix. Each cluster contains 500 data points.
(S3) 33 clusters in dimension 22 : The data are generated by a Student mixture centered at (0,0)(0,0), (0,6)(0,6) and (5,3)(5,3) with 2 degree of freedom. Each cluster contains 500 data points.

We applied three different methods for determining the number of clusters : the proposed slope method, Gap Statistic and Silhouette method. For each method, we use four clustering algorithms : K-medians (“Online”, “Semi-Online”, “Offline”) and K-means. For each scenario, we contaminated our data with the law Z=(Z1,,Zd)Z=(Z_{1},...,Z_{d}) where ZiZ_{i} are i.i.d, with Zi𝒯1Z_{i}\sim\mathcal{T}_{1} where 𝒯1\mathcal{T}_{1} is a Student law with 1 degree of freedom. Then, we evaluate our method for the different methods and scenarios by considering:

  • N : The number of times we get the correct value of cluster in 50 repeated trials without contaminated data.

  • k¯\bar{k} : The average of number of clusters obtained over 50 trials without contaminated data.

  • N0.1N_{0.1} : The number of times we get the correct value of cluster in 50 repeated trials with 10%10\% of contaminated data.

  • k¯0.1\bar{k}_{0.1} : The average of number of clusters obtained over 50 trials with 10%10\% of contaminated data.

Simulations S1 S2 S3
Algorithms NN k¯\bar{k} N0.1N_{0.1} k¯0.1\bar{k}_{0.1} NN k¯\bar{k} N0.1N_{0.1} k¯0.1\bar{k}_{0.1} NN k¯\bar{k} N0.1N_{0.1} k¯0.1\bar{k}_{0.1}
Slope Offline 50 4 50 4 50 5 50 5 50 3 49 3.04
Semi-Online 50 4 49 4.02 50 5 46 5.1 50 3 49 3.04
Online 48 4 42 4.1 50 5 40 5.2 50 3 49 3.04
K-means 50 4 1 7.9 50 5 2 6.7 3 5.3 0 7.2
Gap Offline 6 1.7 0 1 47 4.8 2 1.2 0 1 0 1
Semi-Online 7 1.7 0 1 47 4.8 2 1.2 0 1 0 1
Online 8 2.4 0 1 47 4.8 2 1.2 0 1 0 1
K-means 0 1.2 0 1.2 12 2 0 1.3 0 1 0 1
Silhouette Offline 0 3 0 2.9 27 4.4 1 3.5 50 3 44 3.1
Semi-Online 0 3 0 2.9 24 4.4 1 3.5 50 3 43 3.1
Online 0 3 2 3.2 22 4.5 2 4.5 49 3.02 29 3.5
K-means 0 3 7 3.2 20 4.5 0 6.7 3 5.9 2 7.2
Table 1: Comparison of the number of times we get the right value of clusters and the averaged selected number of clusters obtained with the different methods without contaminated data and with 10%10\% of contaminated data.

In case of well separated clusters as in the scenario (S2), the gap statistics method and silhouette method give competitive results. Nevertheless, for closer clusters, the slope method works much better than gap statistics and silhouette method as in the scenario (S1). The gap statistics method only works in scenario 2 and is ineffective in the presence of contamination. In closer cluster scenarios, it often predicts 11 as the number of clusters. The silhouette method performs moderately well in scenario 2 and very well in scenario 3, but it is globally not as competitive as the slope method, especially in cases of contaminated data. In scenarios 1 and 2 with slope method, Offline, Semi-Online, Online and K-means give better results but in cases of contamination, K-means crashes completely while the other three methods seem to be not too much sensitive. Furthermore, on non-Gaussian data (scenario 3), the K-means method does not work at all. In such cases, K-median clustering is often preferred over K-means clustering.

Overall, in every scenario, Offline, Semi-Online, Online K-medians with the slope method give very competitive results and in the case where the data are contaminated, they clearly outperform other methods, especially the Offline method.

4.3 Contaminated Data in Higher Dimensions

We now focus on the impact of contaminated data on the selection of the number of clusters in K-medians clustering, particularly in higher dimensions. We compare our method with Gap Statistic and SigClust (Liu et al.,, 2008; Huang et al.,, 2015; Bello et al.,, 2023) in the Offline setting, as it yields competitive results, as noted in the previous section. Concerning SigClust, it is a method which enables to test whether a sample comes from a single Gaussian or several in high dimension. Then, starting from k=k0k=k_{0}, we test for all possible pairs of clusters whether the fusion of the two clusters comes from a single Gaussian or not. If the test rejected the hypothesis that the combined cluster is a single Gaussian for all fittings, the same procedure is repeated for k+1k+1. If there is a fitting for which the test is not rejected, it is considered that these two clusters should be merged, and the procedure is stopped. The optimal number of clusters is then determined as kopt=k1k_{\text{opt}}=k-1. It is important to note that we did not compare with Gap Statistics, as it is computationally expensive, especially in high dimensions.

In this aim, we generate data using a Gaussian mixture model with 10 classes in 100 and 200 dimensions, where the centers of the classes are randomly generated on a sphere with radius 1010, and each class contains 100 data points. The data is contaminated with the law Z=(Z1,,Zd)Z=(Z_{1},...,Z_{d}), where dd is the dimension (100 or 200 for each scenario), ZiZ_{i} are i.i.d, with two possible scenarios:

  1. 1.

    Zi𝒯1Z_{i}\sim\mathcal{T}_{1},

  2. 2.

    Zi𝒯2Z_{i}\sim\mathcal{T}_{2}.

Here, 𝒯m\mathcal{T}_{m} is the Student law with m degrees of freedom. In what follows, let us denote by ρ\rho the proportion of contaminated data. In order to compare the different clustering results, we focus on the Adjusted Rand Index (ARI) (Rand,, 1971; Hubert and Arabie,, 1985) which is a measure of similarity between two clusterings and which relies on taking into account the right number of correctly classified pairs. We evaluate, for each scenario, the average number of clusters obtained over 50 trials and the average ARI evaluated only on uncontaminated data.

ρ\rho 0 0.01 0.02 0.03 0.05 0.1
d=100d=100 Zi𝒯1Z_{i}\sim\mathcal{T}_{1} Our Method k¯\bar{k} 10 10 10 10 10.2 7.6
Silhouette 10 8.7 5.9 4.2 3.6 2.6
SigClust 10 2.7 2.9 3.1 4.3 5.2
Our Method ARI 1 1 1 0.94 0.91 0.53
Silhouette 1 0.75 0.51 0.29 0.18 0.15
SigClust 1 0.22 0.28 0.29 0.31 0.46
Zi𝒯2Z_{i}\sim\mathcal{T}_{2} Our Method k¯\bar{k} 10 10 10 10 10 10.9
Silhouette 10 10 10 10.1 9.8 10.4
SigClust 10 5.3 4.5 4.2 4.4 4.2
Our Method ARI 1 1 1 1 0.99 0.99
Silhouette 1 1 0.99 0.99 0.99 0.98
SigClust 1 0.52 0.36 0.35 0.34 0.32
d=200d=200 Zi𝒯1Z_{i}\sim\mathcal{T}_{1} Our Method k¯\bar{k} 10 10 10 9.6 9.6 7.3
Silhouette 10 6 2.9 2.9 2.5 3.1
SigClust 9.2 2.7 3.4 4.3 5 6.5
Our Method ARI 1 1 1 0.89 0.76 0.53
Silhouette 1 0.39 0.18 0.17 0.16 0.21
SigClust 0.97 0.22 0.28 0.34 0.42 0.48
Zi𝒯2Z_{i}\sim\mathcal{T}_{2} Our Method k¯\bar{k} 10 10 10 10 10 10
Silhouette 10 10 10 10 10.1 10.2
SigClust 9.2 4.6 4.3 3.9 3.8 2.7
Our Method ARI 1 1 1 1 1 1
Silhouette 1 1 1 1 0.99 0.99
SigClust 0.97 0.37 0.35 0.32 0.32 0.22
Table 2: Comparison of the selected number of clusters and the averaged ARI obtained obtained using different methods with respect to the proportion of contaminated data for Zi𝒯1Z_{i}\sim\mathcal{T}_{1} and Zi𝒯2Z_{i}\sim\mathcal{T}_{2}.

We observe that in the case of non-contamination, we obtain similar results across all methods. However, in the presence of contamination, our method consistently performs well, while others struggle to identify an appropriate number of clusters. With a Student distribution contamination of 1 degree of freedom, our method excels in terms of both the number of clusters and the ARI. The results with a Student distribution contamination of 2 degrees of freedom are comparable to those obtained using the Silhouette method.

In summary, our method demonstrates remarkable robustness in the face of contaminated data, making it a strong choice for clustering in higher dimensions. The comparison with the Silhouette, Gap Statistic, and SigClust in the offline setting reaffirms the effectiveness of our approach, especially when computational efficiency is a critical factor in high-dimensional data.

4.4 An illustration on real data

We will first briefly discuss the data we used for clustering, which was provided by Califrais, a company specializing in developing environmentally responsible technology to optimize logistics flows on a large scale. Our goal is to build a Recommender System that is designed to suggest items individually for each user based on their historical data or preferences. In this scenario, the clustering algorithms can be employed to identify groups of similar customers where each group consists of customers share similar features or properties. It is crucial to perform robust clustering in order to develop an effective Recommender System.

The dataset includes information on 508 customers, including nine features that represent the total number of products purchased in each of the following categories: Fruits, Vegetables, Dairy products, Seafood, Butcher, Deli, Catering, Grocery, and Accessories and equipment. Therefore, we have a sample size of n=508n=508 and a dimensionality of d=9d=9. To apply clustering, we will determine the appropriate number of clusters using the proposed method. Before applying our method, we normalize our data using RobustScaler. This removes the median and scales the data according to the Interquartile Range, which is the range between the 1st quartile and the 3rd quartile.

Refer to captionRefer to caption
Figure 6: Califrais data: Profiles (on the left) and clustering with Slope method represented on the first two principal components (on the right).
Refer to captionRefer to caption
Figure 7: Califrais data: Profiles (on the left) and clustering with Silhouette method represented on the first two principal components (on the right).
Refer to captionRefer to caption
Figure 8: Califrais data: Profiles (on the left) and clustering with Gap Statistics method represented on the first two principal components (on the right).

We plotted the profiles of the clusters obtained using our Slope method, Silhouette and Gap Statistic in Figures 6, 7, and 8. We observe that our method indicates 5 clusters, while the Gap Statistic suggests 3 clusters, and Silhouette suggests 2 clusters. Regarding the Silhouette method, the second cluster obtained is not homogeneous, as seen in Figure 7. We obtain 3 clusters with the Gap Statistic method. The important thing to note is that the Gap Statistic method separates the second cluster obtained by Silhouette into two clusters (Cluster 2 and Cluster 3). However, in the third cluster of Gap Statistic (Figure 8), homogeneity is still not achieved. In Figure 6, it can be seen that the clusters generated by our slope method are more or less homogeneous. To establish a connection with the simulations conducted in Section 4.2, for example, in scenario (S1), we observed that Silhouette and Gap Statistics failed to find the correct number of clusters when the clusters are closer. This is reflected here, as the behavior of clients does not change significantly, resulting in close clusters. To provide an overview of our clusters, the first cluster represents customers who regularly consume products from all categories. The third cluster consists of customers who frequently engage with catering products. Clusters 2, 4, and 5 correspond to customers who consume significant amounts of Butcher, Deli, and Catering products at different levels, as depicted in the figure 6.

4.5 Conclusion

The proposed penalized criterion, calibrated with the help of the slope heuristic method, consistently gives competitive results for selecting the number of clusters in K-medians, even in the presence of outliers, outperforming other methods such as Gap Statistics, Silhouette, and SigClust. Notably, our method demonstrates excellent performance even in high dimensions. Among the three K-medians algorithms, Offline, Semi-Online, and Online, their performances are generally analogous, with Offline being slightly better. However, for large sample sizes, one may prefer the Online K-medians algorithm in terms of computation time. As discussed in Section 2, it is recommended to use the Offline algorithm for moderate sample sizes, the Semi-Online algorithm for medium sample sizes, and the Online algorithm for large sample sizes. In our real-life data illustration, our proposed method consistently produces more robust clusters and a more suitable number of clusters compared to other methods.

In conclusion, our paper presents a robust and efficient approach for selecting clusters in K-medians, demonstrating superior performance even in challenging scenarios. The findings provide practical recommendations for algorithm selection based on sample size, reinforcing the applicability of our proposed method in real-world clustering scenarios.

Acknowledgement

The authors wish to thank Califrais for providing the real-life data and Raphaël Carpintero Perez for the data preprocessing work.

5 Proofs

5.1 Some definitions and lemma

First, we provide some definitions and lemmas that are useful to prove Theorems 3.1 and 3.2.

Definitions :

  • Let (S,p)(S,p) be a totally bounded metric space. For any FSF\subset S and ϵ>0\epsilon>0 the ϵ\epsilon-covering number Np(F,ϵ)N_{p}(F,\epsilon) of F is defined as the minimum number of closed balls with radius ϵ\epsilon whose union covers FF.

  • Let (S,p)(S,p) be a totally bounded metric space. For any FSF\subset S,
    diam(F)=sup{p(x,y):x,yF}\text{diam}(F)=\sup\left\{p(x,y):x,y\in F\right\}.

  • A Family {Ts:sS}\left\{T_{s}:s\in S\right\} of zero-mean random variables indexed by the metric space (S, p) is called subgaussian in the metric p if for any λ>0\lambda>0 and s,sSs,s^{\prime}\in S we have

    𝔼[eλ(TsTs)]eλ2p(s,s)22.\mathbb{E}\left[e^{\lambda(T_{s}-T_{s^{\prime}})}\right]\leq e^{\frac{\lambda^{2}p(s,s^{\prime})^{2}}{2}}.
  • The Family {Ts:sS}\left\{T_{s}:s\in S\right\} is called sample continuous if for any sequence s1,s2..,Ss_{1},s_{2}..,\in S such that sjsSs_{j}\rightarrow s\in S we have TsjTsT_{s_{j}}\rightarrow T_{s} with probability one.

Lemma 5.1 (Hoeffding, (1994)).

Let Y1,..,YnY_{1},..,Y_{n} are independent zero-mean random variables such that aYib,i=1,,na\leq Y_{i}\leq b,i=1,...,n, then for all λ>0\lambda>0,

𝔼[eλ(i=1nYi)]eλ2n(ba)28.\mathbb{E}\left[e^{\lambda(\sum_{i=1}^{n}Y_{i})}\right]\leq e^{\frac{\lambda^{2}n(b-a)^{2}}{8}}.
Lemma 5.2 (Cesa-Bianchi and Lugosi, (1999), Proposition 3).

If {Ts:sS}\left\{T_{s}:s\in S\right\} is subgaussian and sample continuous in the metric p, then

𝔼[supsSTs]120diam(S)/2lnNp(S,ϵ)𝑑ϵ.\mathbb{E}\left[\sup_{s\in S}T_{s}\right]\leq 12\int_{0}^{\text{diam}(S)/2}\sqrt{\ln{N_{p}(S,\epsilon)}}d\epsilon.
Lemma 5.3 (Bartlett et al., (1998), Lemma 1).

Let S(0,R)S(0,R) denote the closed d-dimensional sphere of radius RR centered at 0. Let ϵ>0\epsilon>0 and N(ϵ)N(\epsilon) denote the cardinality of the minimum ϵ\epsilon covering of S(0,R)S(0,R), that is, N(ϵ)N(\epsilon) is the smallest integer NN such that there exist points {y1,,yN}S(0,R)\left\{y_{1},...,y_{N}\right\}\subset S(0,R) with the property

supxS(0,R)min1iNxyiϵ.\sup_{x\in S(0,R)}\min_{1\leq i\leq N}\left\|x-y_{i}\right\|\leq\epsilon.

Then, for all ϵ2R\epsilon\leq 2R we have

N(ϵ)(4Rϵ)d.N(\epsilon)\leq\left(\frac{4R}{\epsilon}\right)^{d}.
Lemma 5.4.

For any 0<ϵ2R0<\epsilon\leq 2R and k1k\geq 1, the covering number of SkS_{k} in the metric

p(c,c)=supxR{|minj=1,..,kxcjminj=1,..,kxcj|}p(c,c^{\prime})=\sup_{\|x\|\leq R}\left\{\big{|}\min_{j=1,..,k}\left\|x-c_{j}\right\|-\min_{j=1,..,k}\|x-c^{\prime}_{j}\|\hskip 2.84526pt\big{|}\right\}

is bounded as

Np(Sk,ϵ)(4Rϵ)kd.N_{p}(S_{k},\epsilon)\leq\left(\frac{4R}{\epsilon}\right)^{kd}.
Proof of the Lemma 5.4 : .

Let 0<ϵ2R0<\epsilon\leq 2R, by Lemma 5.3 there exists a ϵ\epsilon-covering set of points {y1,,yN}S(0,R)\left\{y_{1},...,y_{N}\right\}\subset S(0,R) with N(4Rϵ)dN\leq\left(\frac{4R}{\epsilon}\right)^{d}.
Since, we have NkN^{k} ways to choose k codepoints from a set of N points {y1,,yN}\left\{y_{1},...,y_{N}\right\}, that implies

Np(Sk,ϵ)(4Rϵ)kd.N_{p}(S_{k},\epsilon)\leq\left(\frac{4R}{\epsilon}\right)^{kd}.

For any codepoints {c1,,ck}\left\{c_{1},...,c_{k}\right\} which are contained in S(0,R)S(0,R), there exists a set of codepoints such that cjcjϵ\left\|c_{j}-c^{\prime}_{j}\right\|\leq\epsilon for all j.
Let us first show

minj=1,..,kxcjminj=1,..,kxcjϵ.\min_{j=1,..,k}\left\|x-c_{j}\right\|-\min_{j=1,..,k}\left\|x-c^{\prime}_{j}\right\|\leq\epsilon.

In this aim, let us consider qargminj=1,..,kxcjq\in\arg\min_{j=1,..,k}\left\|x-c_{j}\right\|, then

minj=1,..,kxcjminj=1,..,kxcjxcqxcqcqcqϵ.\min_{j=1,..,k}\left\|x-c^{\prime}_{j}\right\|-\min_{j=1,..,k}\left\|x-c_{j}\right\|\leq\left\|x-c^{\prime}_{q}\right\|-\left\|x-c_{q}\right\|\leq\left\|c_{q}-c^{\prime}_{q}\right\|\leq\epsilon.

In the same way, considering qargminj=1,..,kxcjq^{\prime}\in\arg\min_{j=1,..,k}\left\|x-c^{\prime}_{j}\right\| , we show

minj=1,..,kxcjminj=1,..,kxcjxcqxcqcqcqϵ.\min_{j=1,..,k}\left\|x-c_{j}\right\|-\min_{j=1,..,k}\left\|x-c^{\prime}_{j}\right\|\leq\left\|x-c_{q^{\prime}}\right\|-\left\|x-c^{\prime}_{q^{\prime}}\right\|\leq\left\|c_{q^{\prime}}-c^{\prime}_{q^{\prime}}\right\|\leq\epsilon.

So,

|minj=1,..,kxcjminj=1,..,kxcj|ϵ\big{|}\min_{j=1,..,k}\left\|x-c_{j}\right\|-\min_{j=1,..,k}\left\|x-c^{\prime}_{j}\right\|\hskip 2.84526pt\big{|}\leq\epsilon

for any codepoints {c1,,ck}\left\{c_{1},...,c_{k}\right\} which are contained in S(0,R)S(0,R), there exists a set of codepoints {c1,,ck}\left\{c^{\prime}_{1},...,c^{\prime}_{k}\right\} such that

|minj=1,..,kxcjminj=1,..,kxcj|ϵ.\big{|}\min_{j=1,..,k}\left\|x-c_{j}\right\|-\min_{j=1,..,k}\left\|x-c^{\prime}_{j}\right\|\hskip 2.84526pt\big{|}\leq\epsilon.

Lemma 5.5 (McDiarmid et al., (1989), Massart, (2007) : Theorem 5.3).

If X1,XnX_{1},...X_{n} are independent random variables and \mathcal{F} is a finite or countable class of real-valued functions such that afba\leq f\leq b for all ff\in\mathcal{F}, the if Z=supfi=1n(f(Xi)𝔼[f(Xi)])Z=\sup_{f\in\mathcal{F}}\sum_{i=1}^{n}(f(X_{i})-\mathbb{E}\left[f(X_{i})\right]), we have, for every ϵ>0\epsilon>0,

[Z𝔼[Z]ϵ]exp(2ϵ2n(ba)2).\mathds{P}\left[Z-\mathbb{E}\left[Z\right]\geq\epsilon\right]\leq\exp\left(-\frac{2\epsilon^{2}}{n(b-a)^{2}}\right).

5.2 Proof of Theorem 3.1

The proof of the Theorem 3.1 is inspired by the proof of Theorem 3 in Linder, (2000).

Proof.

For any cSkc\in S_{k}, let Tn(c)=n2(W(c)Wn(c))T_{n}^{(c)}=\frac{n}{2}(W(c)-W_{n}(c)) =12i=1n(𝔼[minj=1,..,kXicj]minj=1,..,kXicj).=\frac{1}{2}\sum_{i=1}^{n}(\mathbb{E}\left[\min_{j=1,..,k}\left\|X_{i}-c_{j}\right\|\right]-\min_{j=1,..,k}\left\|X_{i}-c_{j}\right\|).
So

𝔼[supcSk(W(c)Wn(c))]=2n𝔼[supcSkTn(c)].\mathbb{E}\left[\sup_{c\in S_{k}}(W(c)-W_{n}(c))\right]=\frac{2}{n}\mathbb{E}\left[\sup_{c\in S_{k}}T_{n}^{(c)}\right].

Let us first demonstrate that the family of random variables {Tn(c):cSk}\left\{T_{n}^{(c)}:c\in S_{k}\right\} is subgaussian and sample continuous in a suitable metric. For any c,cSkc,c^{\prime}\in S_{k} define

p(c,c)=supxR{|minj=1,..,kxcjminj=1,..,kxcj|},p(c,c^{\prime})=\sup_{\|x\|\leq R}\left\{\big{|}\min_{j=1,..,k}\left\|x-c_{j}\right\|-\min_{j=1,..,k}\left\|x-c^{\prime}_{j}\right\|\hskip 2.84526pt\big{|}\right\},

and pn(c,c)=np(c,c)p_{n}(c,c^{\prime})=\sqrt{n}p(c,c^{\prime}), pnp_{n} is a metric on SkS_{k}. Since we have:

|Tn(c)Tn(c)|\displaystyle\big{|}T_{n}^{(c)}-T_{n}^{(c^{\prime})}\big{|} =n2|W(c)W(c)+Wn(c)Wn(c)|\displaystyle=\frac{n}{2}\big{|}W(c)-W(c^{\prime})+W_{n}(c^{\prime})-W_{n}(c)\big{|}
n2(|W(c)W(c)|+|Wn(c)Wn(c)|)\displaystyle\leq\frac{n}{2}\left(\big{|}W(c)-W(c^{\prime})\big{|}+\big{|}W_{n}(c^{\prime})-W_{n}(c)\big{|}\right)
np(c,c)=npn(c,c),\displaystyle\leq np(c,c^{\prime})=\sqrt{n}p_{n}(c,c^{\prime}),

the family {Tn(c):cSk}\left\{T_{n}^{(c)}:c\in S_{k}\right\} is then sample continuous in the metric pnp_{n}. To show that {Tn(c):cSk}\left\{T_{n}^{(c)}:c\in S_{k}\right\} is subgaussian in pnp_{n}, let

Yi=12(W(c)minj=1,..,kxcj)12(W(c)minj=1,..,kxcj).Y_{i}=\frac{1}{2}\left(W(c)-\min_{j=1,..,k}\left\|x-c_{j}\right\|\right)-\frac{1}{2}\left(W(c^{\prime})-\min_{j=1,..,k}\left\|x-c^{\prime}_{j}\right\|\right).

Then

Tn(c)Tn(c)=i=1nYi,T_{n}^{(c)}-T_{n}^{(c^{\prime})}=\sum_{i=1}^{n}Y_{i},

where YiY_{i} are independent, have zero mean, and

|Yi|1npn(c,c).\big{|}Y_{i}\big{|}\leq\frac{1}{\sqrt{n}}p_{n}(c,c^{\prime}).

By Lemma 5.1, we obtain

𝔼[eλ(Tn(c)Tn(c))]eλ2pn(c,c)22.\mathbb{E}\left[e^{\lambda(T_{n}^{(c)}-T_{n}^{(c^{\prime})})}\right]\leq e^{\frac{\lambda^{2}p_{n}(c,c^{\prime})^{2}}{2}}.

So, {Tn(c):cSk}\left\{T_{n}^{(c)}:c\in S_{k}\right\} is subgaussian in pnp_{n}. As the family {Tn(c):cSk}\left\{T_{n}^{(c)}:c\in S_{k}\right\} is subgaussian and sample continuous in pnp_{n}, Lemma 5.2 gives

𝔼[supcSkTn(c)]120diam(Sk)/2lnNpn(Sk,ϵ)𝑑ϵ.\mathbb{E}\left[\sup_{c\in S_{k}}T_{n}^{(c)}\right]\leq 12\int_{0}^{\text{diam}(S_{k})/2}\sqrt{\ln{N_{p_{n}}(S_{k},\epsilon)}}d\epsilon.

Since pn(c,c)=np(c,c)p_{n}(c,c^{\prime})=\sqrt{n}p(c,c^{\prime}), by Lemma 5.4 with the metric pnp_{n}, for all ϵ2Rn\epsilon\leq 2R\sqrt{n} we obtain

Npn(Sk,ϵ)(4Rnϵ)kd,N_{p_{n}}(S_{k},\epsilon)\leq\left(\frac{4R\sqrt{n}}{\epsilon}\right)^{kd},

and as diam(Sk):=sup{pn(c,c):c,cSk}=nsup{p(c,c):c,cSk}n2R\text{diam}(S_{k}):=\sup\left\{p_{n}(c,c^{\prime}):c,c^{\prime}\in S_{k}\right\}=\sqrt{n}\sup\left\{p(c,c^{\prime}):c,c^{\prime}\in S_{k}\right\}\leq\sqrt{n}2R,

𝔼[supcSkTn(c)]\displaystyle\mathbb{E}\left[\sup_{c\in S_{k}}T_{n}^{(c)}\right] 24n0nRln((4Rnϵ)kd)𝑑ϵ\displaystyle\leq\frac{24}{n}\int_{0}^{\sqrt{n}R}\sqrt{\ln\left(\left({\frac{4R\sqrt{n}}{\epsilon}}\right)^{kd}\right)}d\epsilon
=24kdn0nRln(4Rnϵ)𝑑ϵ.\displaystyle=\frac{24\sqrt{kd}}{n}\int_{0}^{\sqrt{n}R}\sqrt{\ln\left({\frac{4R\sqrt{n}}{\epsilon}}\right)}d\epsilon.

Considering x=ϵ4Rnx=\frac{\epsilon}{4R\sqrt{n}}, we obtain,

𝔼[supcSkTn(c)]24kdn0144Rnln(1x)𝑑x.\mathbb{E}\left[\sup_{c\in S_{k}}T_{n}^{(c)}\right]\leq\frac{24\sqrt{kd}}{n}\int_{0}^{\frac{1}{4}}4R\sqrt{n}\sqrt{\ln\left({\frac{1}{x}}\right)}dx.

Applying Jensen’s inequality to the concave function f(x)=xf(x)=\sqrt{x} :

𝔼[supcSkTn(c)]\displaystyle\mathbb{E}\left[\sup_{c\in S_{k}}T_{n}^{(c)}\right] 24Rkdn0144ln(1x)𝑑x\displaystyle\leq 24R\sqrt{\frac{kd}{n}}\sqrt{\int_{0}^{\frac{1}{4}}4\ln\left({\frac{1}{x}}\right)dx}
=24Rkdn1+ln4\displaystyle=24R\sqrt{\frac{kd}{n}}\sqrt{1+\ln{4}}
48Rkdn,\displaystyle\leq 48R\sqrt{\frac{kd}{n}},

where we used that lnx=xlnxx\int\ln{x}=x\ln{x}-x and ln43\ln{4}\leq 3.
Thus,

𝔼[supcSk{W(c)Wn(c)}]48Rkdn.\mathbb{E}\left[\sup_{c\in S_{k}}\left\{W(c)-W_{n}(c)\right\}\right]\leq 48R\sqrt{\frac{kd}{n}}.


5.3 Proof of Theorem 3.2

Theorem 3.2 is an adaptation of Theorem 8.1 in Massart, (2007) and Theorem 2.1 in Fischer, (2011).

Proof.

By definition of c~,\tilde{c}, for all k,1knk,1\leq k\leq n and ckSkc_{k}\in S_{k}, we have:

Wn(c~)+pen(k^)Wn(ck)+pen(k)W_{n}(\tilde{c})+\text{pen}(\hat{k})\leq W_{n}(c_{k})+\text{pen}(k)
W(c~)Wn(ck)+W(c~)Wn(c~)+pen(k)pen(k^).W(\tilde{c})\leq W_{n}(c_{k})+W(\tilde{c})-W_{n}(\tilde{c})+\text{pen}(k)-\text{pen}(\hat{k}). (4)

Consider nonnegative weights {xl}1ln\left\{x_{l}\right\}_{1\leq l\leq n} such that l=1nexl=Σ\sum_{l=1}^{n}e^{-x_{l}}=\Sigma and let z >> 0.
Applying Lemma 5.5 with f(x)=1nminj=1,..,lxcjf(x)=\frac{1}{n}\min_{j=1,..,l}\left\|x-c_{j}\right\|, a=0a=0 and b=2Rnb=\frac{2R}{n} for all l,1lnl,1\leq l\leq n and all ϵl>0\epsilon_{l}>0

[supcSl(W(c)Wn(c))𝔼[supcSl(W(c)Wn(c))]ϵl]exp(nϵl22R2).\mathds{P}\left[\sup_{c\in S_{l}}(W(c)-W_{n}(c))-\mathbb{E}\left[\sup_{c\in S_{l}}(W(c)-W_{n}(c))\right]\geq\epsilon_{l}\right]\leq\exp\left(-\frac{n\epsilon_{l}^{2}}{2R^{2}}\right).

It follows that for all l, taking ϵl=2Rxl+z2n\epsilon_{l}=2R\sqrt{\frac{x_{l}+z}{2n}}

[supcSl(W(c)Wn(c))𝔼[supcSl(W(c)Wn(c))]+2Rxl+z2n]exlz.\mathds{P}\left[\sup_{c\in S_{l}}(W(c)-W_{n}(c))\geq\mathbb{E}\left[\sup_{c\in S_{l}}(W(c)-W_{n}(c))\right]+2R\sqrt{\frac{x_{l}+z}{2n}}\right]\leq e^{-x_{l}-z}.

Thus, we have

[l=1nsupcSl(W(c)Wn(c))𝔼[supcSl(W(c)Wn(c))]+2Rxl+z2n]\mathds{P}\left[\bigcap_{l=1}^{n}\sup_{c\in S_{l}}(W(c)-W_{n}(c))\leq\mathbb{E}\left[\sup_{c\in S_{l}}(W(c)-W_{n}(c))\right]+2R\sqrt{\frac{x_{l}+z}{2n}}\right]
=1[l=1nsupcSl(W(c)Wn(c))𝔼[supcSl(W(c)Wn(c))]+2Rxl+z2n]1Σez.=1-\mathds{P}\left[\bigcup_{l=1}^{n}\sup_{c\in S_{l}}(W(c)-W_{n}(c))\geq\mathbb{E}\left[\sup_{c\in S_{l}}(W(c)-W_{n}(c))\right]+2R\sqrt{\frac{x_{l}+z}{2n}}\right]\geq 1-\Sigma e^{-z}.

Considering Zl=𝔼[supcSl(W(c)Wn(c))]Z_{l}=\mathbb{E}\left[\sup_{c\in S_{l}}(W(c)-W_{n}(c))\right], let us show if we have for all 1ln1\leq l\leq n,

supcSl(W(c)Wn(c))Zl+2Rxl+z2n\sup_{c\in S_{l}}(W(c)-W_{n}(c))\leq Z_{l}+2R\sqrt{\frac{x_{l}+z}{2n}}

then,

W(c~)Wn(ck)+2Rz2n+pen(k).W(\tilde{c})\leq W_{n}(c_{k})+2R\sqrt{\frac{z}{2n}}+\text{pen}(k).

We suppose that we have

supcSl(W(c)Wn(c))Zl+2Rxl+z2n1ln.\sup_{c\in S_{l}}(W(c)-W_{n}(c))\leq Z_{l}+2R\sqrt{\frac{x_{l}+z}{2n}}\hskip 28.45274pt\forall 1\leq l\leq n. (5)

Particularly it’s true for l=k^l=\hat{k}, we have also W(c~)Wn(c~)supcSk^(W(c)Wn(c))W(\tilde{c})-W_{n}(\tilde{c})\leq\sup_{c\in S_{\hat{k}}}(W(c)-W_{n}(c)) and a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} a,b0\forall a,b\geq 0. By combining this result with (2) and (3), we get

W(c~)\displaystyle W(\tilde{c}) Wn(ck)+supcSk^(W(c)Wn(c))+pen(k)pen(k^)\displaystyle\leq W_{n}(c_{k})+\sup_{c\in S_{\hat{k}}}(W(c)-W_{n}(c))+\text{pen}(k)-\text{pen}(\hat{k})
Wn(ck)+Zk^+2Rxk^2n+2Rz2n+pen(k)pen(k^).\displaystyle\leq W_{n}(c_{k})+Z_{\hat{k}}+2R\sqrt{\frac{x_{\hat{k}}}{2n}}+2R\sqrt{\frac{z}{2n}}+\text{pen}(k)-\text{pen}(\hat{k}).

With the help of Theorem 3.2, we have Zk48RkdnZ_{k}\leq 48R\sqrt{\frac{kd}{n}} for all k,1knk,1\leq k\leq n and if we have pen(k)R(48kdn+2xk2n)\text{pen}(k)\geq R\left(48\sqrt{\frac{kd}{n}}+2\sqrt{\frac{x_{k}}{2n}}\right)

W(c~)\displaystyle W(\tilde{c}) Wn(ck)+48Rk^dn+2Rxk^2n+2Rz2n+pen(k)R(48k^dn+2xk^2n)\displaystyle\leq W_{n}(c_{k})+48R\sqrt{\frac{\hat{k}d}{n}}+2R\sqrt{\frac{x_{\hat{k}}}{2n}}+2R\sqrt{\frac{z}{2n}}+\text{pen}(k)-R\left(48\sqrt{\frac{\hat{k}d}{n}}+2\sqrt{\frac{x_{\hat{k}}}{2n}}\right)
=Wn(ck)+2Rz2n+pen(k),\displaystyle=W_{n}(c_{k})+2R\sqrt{\frac{z}{2n}}+\text{pen}(k),

which shows that

W(c~)Wn(ck)+2Rz2n+pen(k).W(\tilde{c})\leq W_{n}(c_{k})+2R\sqrt{\frac{z}{2n}}+\text{pen}(k).

Thus

[W(c~)Wn(ck)+2Rz2n+pen(k)]\mathds{P}\left[W(\tilde{c})\leq W_{n}(c_{k})+2R\sqrt{\frac{z}{2n}}+\text{pen}(k)\right]
[l=1nsupcSl(W(μ,c)W(μn,c))𝔼[supcSl(W(c)Wn(c))]+2Rxl+z2n]1Σez.\geq\mathds{P}\left[\bigcap_{l=1}^{n}\sup_{c\in S_{l}}(W(\mu,c)-W(\mu_{n},c))\leq\mathbb{E}\left[\sup_{c\in S_{l}}(W(c)-W_{n}(c))\right]+2R\sqrt{\frac{x_{l}+z}{2n}}\right]\geq 1-\Sigma e^{-z}.

We get

[W(c~)Wn(ck)pen(k)2Rz2n]Σez\mathds{P}\left[W(\tilde{c})-W_{n}(c_{k})-\text{pen}(k)\geq 2R\sqrt{\frac{z}{2n}}\right]\leq\Sigma e^{-z}
[2n2R(W(c~)Wn(ck)pen(k))z]Σez\mathds{P}\left[\frac{\sqrt{2n}}{2R}(W(\tilde{c})-W_{n}(c_{k})-\text{pen}(k))\geq\sqrt{z}\right]\leq\Sigma e^{-z}

or, setting z=u2z=u^{2},

[2n2R(W(c~)Wn(ck)pen(k))u]Σeu2\mathds{P}\left[\frac{\sqrt{2n}}{2R}(W(\tilde{c})-W_{n}(c_{k})-\text{pen}(k))\geq u\right]\leq\Sigma e^{-u^{2}}
𝔼[2n2R(W(c~)Wn(ck)pen(k))+]\displaystyle\mathbb{E}\left[\frac{\sqrt{2n}}{2R}(W(\tilde{c})-W_{n}(c_{k})-\text{pen}(k))_{+}\right] =0[2n2R(W(c~)Wn(ck)pen(k))+u]𝑑u\displaystyle=\int_{0}^{\infty}\mathds{P}\left[\frac{\sqrt{2n}}{2R}(W(\tilde{c})-W_{n}(c_{k})-\text{pen}(k))_{+}\geq u\right]du
0[2n2R(W(c~)Wn(ck)pen(k))u]𝑑u\displaystyle\leq\int_{0}^{\infty}\mathds{P}\left[\frac{\sqrt{2n}}{2R}(W(\tilde{c})-W_{n}(c_{k})-\text{pen}(k))\geq u\right]du
Σ0eu2𝑑u=Σπ2.\displaystyle\leq\Sigma\int_{0}^{\infty}e^{-u^{2}}du=\Sigma\frac{\sqrt{\pi}}{2}.

We get

𝔼[(W(c~)Wn(ck)pen(k))+]ΣRπ2n.\mathbb{E}\left[(W(\tilde{c})-W_{n}(c_{k})-\text{pen}(k))_{+}\right]\leq\Sigma R\sqrt{\frac{\pi}{2n}}.

Since 𝔼[Wn(ck)]=W(ck)\mathbb{E}\left[W_{n}(c_{k})\right]=W(c_{k}), we have :

𝔼[W(c~)]W(ck)+pen(k)+ΣRπ2n.\mathbb{E}\left[W(\tilde{c})\right]\leq W(c_{k})+\text{pen}(k)+\Sigma R\sqrt{\frac{\pi}{2n}}.
𝔼[W(c~)]inf1kn,ckSk{W(ck)+pen(k)}+ΣRπ2n.\mathbb{E}\left[W(\tilde{c})\right]\leq\inf_{1\leq k\leq n,c_{k}\in S_{k}}\left\{W(c_{k})+\text{pen}(k)\right\}+\Sigma R\sqrt{\frac{\pi}{2n}}.

5.4 Proof of Proposition 3.1

Proof.

If k2dk\leq 2^{d}, we have 4Rk1/d4R21=2R4Rk^{-1/d}\geq 4R2^{-1}=2R. Thus, W(c)2d4dk1/dW(c)\leq 2\sqrt{d}\leq 4\sqrt{d}k^{-1/d} for any vector quantizer with codebook c.
Otherwise, let ϵ=4Rk1/d\epsilon=4Rk^{-1/d}. Then ϵ2R\epsilon\leq 2R and by Lemma 5.3 there exists a set of points {y1,,yk}S(0,R)\left\{y_{1},...,y_{k}\right\}\subset S(0,R) that ϵ\epsilon-covers S(0,R)S(0,R). A quantizer with the codebook c={y1,,yk}c=\left\{y_{1},...,y_{k}\right\} verifies :

W(c)ϵ4Rk1/d.W(c)\leq\epsilon\leq 4Rk^{-1/d}.

That concludes

infcSkW(c)4Rk1/d.\inf_{c\in S_{k}}W(c)\leq 4Rk^{-1/d}.

References

  • Arlot and Massart, (2009) Arlot, S. and Massart, P. (2009). Data-driven calibration of penalties for least-squares regression. Journal of Machine learning research, 10(2).
  • Bartlett et al., (1998) Bartlett, P. L., Linder, T., and Lugosi, G. (1998). The minimax distortion redundancy in empirical quantizer design. IEEE Transactions on Information theory, 44(5):1802–1813.
  • Baudry et al., (2012) Baudry, J.-P., Maugis, C., and Michel, B. (2012). Slope heuristics: overview and implementation. Statistics and Computing, 22(2):455–470.
  • Bello et al., (2023) Bello, D. Z., Valk, M., and Cybis, G. B. (2023). Towards u-statistics clustering inference for multiple groups. Journal of Statistical Computation and Simulation, pages 1–19.
  • Berkhin, (2006) Berkhin, P. (2006). A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71. Springer.
  • Bezdek, (2013) Bezdek, J. C. (2013). Pattern recognition with fuzzy objective function algorithms. Springer Science & Business Media.
  • Birgé and Massart, (2007) Birgé, L. and Massart, P. (2007). Minimal penalties for gaussian model selection. Probability theory and related fields, 138(1):33–73.
  • Brault et al., (2011) Brault, V., Baudry, J.-P., Maugis, C., Michel, B., and Brault, M. V. (2011). Package ‘capushe’.
  • Cardot et al., (2012) Cardot, H., Cénac, P., and Monnez, J.-M. (2012). A fast and recursive algorithm for clustering large datasets with k-medians. Computational Statistics & Data Analysis, 56(6):1434–1449.
  • Cardot et al., (2013) Cardot, H., Cénac, P., and Zitt, P.-A. (2013). Efficient and fast estimation of the geometric median in hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli, 19(1):18–43.
  • Cardot and Godichon-Baggioni, (2017) Cardot, H. and Godichon-Baggioni, A. (2017). Fast estimation of the median covariation matrix with application to online robust principal components analysis. Test, 26(3):461–480.
  • Cesa-Bianchi and Lugosi, (1999) Cesa-Bianchi, N. and Lugosi, G. (1999). Minimax regret under log loss for general classes of experts. In Proceedings of the Twelfth annual conference on computational learning theory, pages 12–18.
  • Duflo, (1997) Duflo, M. (1997). Random iterative models, stochastic modelling and applied probability, vol. 34.
  • Dunn, (1973) Dunn, J. C. (1973). A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters.
  • Fischer, (2011) Fischer, A. (2011). On the number of groups in clustering. Statistics & Probability Letters, 81(12):1771–1781.
  • Forgy, (1965) Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications. biometrics, 21:768–769.
  • Gagolewski et al., (2016) Gagolewski, M., Bartoszuk, M., and Cena, A. (2016). Genie: A new, fast, and outlier-resistant hierarchical clustering algorithm. Information Sciences, 363:8–23.
  • Gersho and Gray, (2012) Gersho, A. and Gray, R. M. (2012). Vector quantization and signal compression, volume 159. Springer Science & Business Media.
  • Haldane, (1948) Haldane, J. (1948). Note on the median of a multivariate distribution. Biometrika, 35(3-4):414–417.
  • Hoeffding, (1994) Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, pages 409–426. Springer.
  • Huang et al., (2015) Huang, H., Liu, Y., Yuan, M., and Marron, J. (2015). Statistical significance of clustering using soft thresholding. Journal of Computational and Graphical Statistics, 24(4):975–993.
  • Hubert and Arabie, (1985) Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of classification, 2(1):193–218.
  • Jain and Dubes, (1988) Jain, A. K. and Dubes, R. C. (1988). Algorithms for clustering data. Prentice-Hall, Inc.
  • Jain et al., (1999) Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323.
  • Kaufman and Rousseeuw, (2009) Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis. John Wiley & Sons.
  • Kemperman, (1987) Kemperman, J. (1987). The median of a finite measure on a banach space. Statistical data analysis based on the L1-norm and related methods (Neuchâtel, 1987), pages 217–230.
  • Linder, (2000) Linder, T. (2000). On the training distortion of vector quantizers. IEEE Transactions on Information Theory, 46(4):1617–1623.
  • Liu et al., (2008) Liu, Y., Hayes, D. N., Nobel, A., and Marron, J. S. (2008). Statistical significance of clustering for high-dimension, low–sample size data. Journal of the American Statistical Association, 103(483):1281–1293.
  • MacQueen, (1967) MacQueen, J. (1967). Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability, pages 281–297.
  • Massart, (2007) Massart, P. (2007). Concentration inequalities and model selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003. Springer.
  • McDiarmid et al., (1989) McDiarmid, C. et al. (1989). On the method of bounded differences. Surveys in combinatorics, 141(1):148–188.
  • Mirkin, (1996) Mirkin, B. (1996). Mathematical classification and clustering, volume 11. Springer Science & Business Media.
  • Polyak and Juditsky, (1992) Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855.
  • Rand, (1971) Rand, W. M. (1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association, 66(336):846–850.
  • Robbins and Monro, (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400–407.
  • Ruppert, (1985) Ruppert, D. (1985). A newton-raphson version of the multivariate robbins-monro procedure. The Annals of Statistics, 13(1):236–245.
  • Spath, (1980) Spath, H. (1980). Cluster analysis algorithms for data reduction and classification of objects. Ellis Horwood Chichester.
  • Tibshirani et al., (2001) Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):411–423.
  • Vardi and Zhang, (2000) Vardi, Y. and Zhang, C.-H. (2000). The multivariate l 1-median and associated data depth. Proceedings of the National Academy of Sciences, 97(4):1423–1426.
  • Weiszfeld, (1937) Weiszfeld, E. (1937). Sur le point pour lequel la somme des distances de n points donnés est minimum. Tohoku Mathematical Journal, First Series, 43:355–386.