Uniform Concentration Bounds toward a Unified Framework for Robust Clustering
Abstract
Recent advances in center-based clustering continue to improve upon the drawbacks of Lloyd’s celebrated -means algorithm over years after its introduction. Various methods seek to address poor local minima, sensitivity to outliers, and data that are not well-suited to Euclidean measures of fit, but many are supported largely empirically. Moreover, combining such approaches in a piecemeal manner can result in ad hoc methods, and the limited theoretical results supporting each individual contribution may no longer hold. Toward addressing these issues in a principled way, this paper proposes a cohesive robust framework for center-based clustering under a general class of dissimilarity measures. In particular, we present a rigorous theoretical treatment within a Median-of-Means (MoM) estimation framework, showing that it subsumes several popular -means variants. In addition to unifying existing methods, we derive uniform concentration bounds that complete their analyses, and bridge these results to the MoM framework via Dudley’s chaining arguments. Importantly, we neither require any assumptions on the distribution of the outlying observations nor on the relative number of observations to features . We establish strong consistency and an error rate of under mild conditions, surpassing the best-known results in the literature. The methods are empirically validated thoroughly on real and synthetic datasets.
1 Introduction
Clustering is a fundamental task in unsupervised learning, which seeks to discover naturally occurring groups within a dataset. Among a plethora of algorithms in the literature, center-based methods remain widely popular, with -means (MacQueen,, 1967; Lloyd,, 1982) as the most prominent example. Given data points , -means seeks to partition the data into mutually exclusive and exhaustive groups by minimizing the within cluster variance. Representing the cluster centroids , the -means problem is formulated as the minimization of the objective function
where is a dissimilarity measure. Taking the squared Euclidean distance yields the classical -means formulation.
Clustering via -means is well-documented to be sensitive to initialization (Vassilvitskii and Arthur,, 2006), prone to getting stuck in poor local optima (Zhang et al.,, 1999; Xu and Lange,, 2019; Chakraborty et al.,, 2020), and fragile against linearly non-separable clusters (Ng et al.,, 2001) or in the presence of even a single outlying observation (Klochkov et al.,, 2020). Researchers continue to tackle these drawbacks of -means while preserving its simplicity and interpretability (Banerjee et al.,, 2005; Aloise et al.,, 2009; Teboulle,, 2007; Ostrovsky et al.,, 2013; Zhang et al.,, 2020; Telgarsky and Dasgupta,, 2012). Although often successful in practice, few of these approaches are grounded in rigorous theory or provide finite-sample statistical guarantees. The available consistency results are mostly asymptotic in nature (Chakraborty and Das,, 2020; Chakraborty et al.,, 2020; Terada,, 2014, 2015), lacking convergence rates or operating under restrictive assumptions on the relation between and (Paul et al., 2021a, ). For example, the recent large sample analysis (Paul et al., 2021a, ) of the hard Bregman -means algorithm (Banerjee et al.,, 2005) obtains an asymptotic error rate of under restrictive assumptions on the relation between and . The approach by Teboulle, (2007) focuses on a unified framework from an optimization perspective, while Balcan et al., (2008); Telgarsky and Dasgupta, (2012) focus on different divergence-based methods to better understand the underlying feature-space.
The presence of outliers only further complicates matters. Outlying data are common in real applications, but many of the aforementioned approaches are fragile to deviations from the assumed data generating mechanism. On the other hand, recent work on robust clustering methods (Deshpande et al.,, 2020; Fischer et al.,, 2020; Klochkov et al.,, 2020) do not integrate the practical advances surveyed above, and similarly lack finite-sample guarantees. To bridge this gap, the Median of Means (MoM) literature provides a promising and attractive framework to robustify center-based clustering methods against outliers. MoM estimators are not only insensitive to outliers, but are also equipped with exponential concentration results under the mild condition of finite variance (Lugosi and Mendelson,, 2019; Lecué and Lerasle,, 2020; Bartlett et al.,, 2002; Lerasle,, 2019; Laforgue et al.,, 2019). Recently, near-optimal results for mean estimation (Minsker,, 2018), classification (Lecué et al.,, 2020), regression (Mathieu and Minsker,, 2021; Lugosi and Mendelson,, 2019), clustering (Klochkov et al.,, 2020; Brunet-Saumard et al.,, 2022), bandits (Bubeck et al.,, 2013) and optimal transport (Staerman et al.,, 2021) have been established from this perspective.
Under the MoM lens, we propose a unified framework for robust center-based clustering. Our treatment considers a family of Bregman loss functions not restricted to the classical squared Euclidean loss. We explore the exact sample error bounds for a general class of algorithms under mild regularity assumptions that apply to popular existing approaches, which we show to be special cases. The proposed framework allows for the data to be divided into two categories: the set of inliers () and the set of outliers (). The inliers are assumed to be independent and identically distributed (i.i.d.), while absolutely no assumption is required on , allowing outliers to be unboundedly large, dependent on each other, sampled from a heavy-tailed distribution and so on. Our contributions within the MoM framework make use of Rademacher complexities (Bartlett and Mendelson,, 2002; Bartlett et al.,, 2005) and symmetrization arguments (Vapnik,, 2013), powerful tools that often find use in the empirical process literature but, in our view, are underexplored in the context of robust clustering.
The paper is organized as follows. In Section 2, we identify a general centroid-based clustering framework which encompasses -means (MacQueen,, 1967), -harmonic means (Zhang et al.,, 1999), and power -means (Xu and Lange,, 2019) as special cases, to name a few. We show how this framework is made robust via Median of Means estimation, yielding an array of center-based robust clustering methods. Within this framework, we derive uniform deviation bounds and concentration inequalities under standard regularity conditions through bounding Rademacher complexity by metric entropy via Dudley’s chaining argument in Section 3. The analysis newly reveals the convergence rate for popularly used clustering methods such as -harmonic means and power -means, matching the known rate results for -means, and elegantly carries over to their MoM counterparts. We then implement and empirically assess the resulting algorithms through simulated and real data experiments. In particular, we find that power -means (Xu and Lange,, 2019) under the MoM paradigm outperforms the state-of-the-art in the presence of outliers.
1.1 Related theoretical analyses of clustering
In seminal work, Pollard, (1981), proved the strong consistency of -means under a finite second moment assumption, spurring the large sample analysis of unsupervised clustering methods. This result has been extended to separable Hilbert spaces (Biau et al.,, 2008; Levrard,, 2015) and for related algorithms (Chakraborty and Das,, 2020; Terada,, 2014, 2015), but these do not provide guarantees on the number of samples required so that the excess risk falls below a given threshold. Towards finding probabilistic error bounds, following research derived uniform concentration results for -means and its variants (Telgarsky and Dasgupta,, 2012), sub-Gaussian distortion bounds for the -medians problem (Brownlees et al.,, 2015), and a bound on -means with Bregman divergences (Paul et al., 2021b, ). More recently, concentration inequalities for -means under the MoM paradigm have been established (Klochkov et al.,, 2020; Brunet-Saumard et al.,, 2022) under the restriction that sample cluster centroids ( in Section 3) are assumed to be bounded. This paper shows how a number of center-based clustering methods can be brought under the same umbrella and can be robustified using a general-purpose scheme. The theoretical analyses of this broad spectrum of methods is conducted via Dudley’s chaining arguments and through the aid of Rademacher complexity-based uniform concentration bounds. This approach enables us to replace assumptions on the sample cluster centroids by a bounded support assumption of the (inlying) data points, which yields an intuitive way of ensuring the boundedness of the cluster centroids through the obtuse angle property of Bregman divergences (Lemma 3.1 below). In contrast to the prior results, our analyses are not asymptotic in nature, so that the derived bounds hold for all values of the model parameters.
2 Problem Setting and Proposed Method
We consider the problem of partitioning a set of data points into mutually exclusive clusters. In a center-based clustering framework, we represent the cluster by its centroid for each . To quantify the notion of “closeness", we allow the dissimilarity measure to be any Bregman divergence. Recall any differentiable, convex function generates the Bregman divergence ( denoting the set of non-negative reals) defined as
For instance, generates the Euclidean distance. Without loss of generality, one may assume In this paradigm, clustering is achieved by minimizing the objective
(1) |
Here is a component-wise non-decreasing function (such as a generalized mean) of the dissimilarities which satisfies . The hyperparameter is specified by the user, and we will additionally assume that is Lipschitz. For intuition, we begin by showing how this setup includes several popular clustering methods.
Examples:
Suppose and . Then the objective (1) becomes , which is the objective function of -harmonic means clustering (Zhang et al.,, 1999). Now consider other generalized means: take where we denote the power mean for . Then objective (1) coincides with the recent power -means method (Xu and Lange,, 2019), . When , (1) recovers Bregman hard clustering proposed in (Banerjee et al.,, 2005) for any valid , while the special case of yields the familiar Euclidean -means problem (MacQueen,, 1967).
In what follows, we derive concentration bounds that establish new theoretical guarantees such as consistency and convergence rates for clustering algorithms in this framework. These analyses lead us to a unified, robust framework by embedding this class of methods within the Median of Means (MoM) estimation paradigm. Via elegant connections between the properties of MoM estimators and Vapnik-Chervonenkis (VC) theory, our MoM estimators too will inherit uniform concentration inequalities from the preceding analysis, extending convergence guarantees to the robust setting.
Median of Means
Instead of directly minimizing the empirical risk (1), MoM begins by partitioning the data into sets which each contain exactly many elements (discarding a few observations when is not divisible by ). The partitions can be assigned uniformly at random, or can be shuffled throughout the algorithm (Lecué et al.,, 2020). MoM then entails a robust version of the estimator defined under (1) by instead minimizing the following objective with respect to :
(2) |
Intuitively, MoM estimators are more robust than their ERM counterparts since under mild conditions, only a subset of the partitions is contaminated by outliers while others are outlier-free. Taking the median over partitions negates the influence of these spurious partitions, thus reducing the effect of outliers. Formal analysis of robustness via breakdown points is also available for MoM estimators (Lecué and Lerasle,, 2019; Rodriguez and Valdora,, 2019); we will make use of the nice concentration properties of MoM estimators in Section 3.4.
Furthermore, optimizing (2) is made tractable via gradient-based methods. We advocate the Adagrad algorithm (Duchi et al.,, 2011; Goodfellow et al.,, 2016), whose updates are given by
for hyperparameter , learning rate , and denoting a subgradient of at . That is, if denotes the median partition at step , then
For any differentiable,
by the chain rule. As a concrete illustration of the general formula, consider the power -means objective : upon partial differentiation, we obtain
As another example, the classic -means objective requires the sub-gradient
where and denotes the indicator function.
A pseudocode summary appears in Algorithm 1. This method tends to find clusters efficiently: for instance, we incur per-iteration complexity applied to both the -means and power -means instances, which matches that of their original algorithms without robustifying via MoM (MacQueen,, 1967; Lloyd,, 1982; Xu and Lange,, 2019). Of course, here we update by an adaptive gradient step rather than by a closed form expression.
As emphasized by an anonymous reviewer, one should note that the algorithms under the proposed framework may converge to sub-optimal local solutions under the proposed optimization scheme. As is the case for -means and its variants, this possibility arises due to the non-convexity of the objective functions (1) and (2). A complete theoretical understanding from an optimization perspective for such methods is not yet fully developed and global results are in general notoriously difficult to obtain. Having said that, it is worth noting that recent empirical analysis show that techniques such as annealing (Xu and Lange,, 2019; Chakraborty et al.,, 2020) may be effective to circumvent this difficulty. Indeed, our experimental analysis (see Section 4) suggests that a robust version of power -means using this annealing technique best overcomes this difficulty even in the presence of outliers.
3 Theoretical Analysis
Here we analyze properties of the proposed framework (1), with complete details of all proofs in the Appendix. Denoting the set of all probability measures with support on , i.e. , we first make standard assumptions that the data are i.i.d. with bounded components (Ben-David,, 2007; Chakraborty et al.,, 2020; Paul et al., 2021a, ).
A 1.
such that .
Let be the empirical distribution based of the data . That is, for any Borel set , . For notational simplicity, we write . Appealing to the Strong Law of Large Numbers (SLLN) (Athreya and Lahiri,, 2006), . Suppose be a (global) minimizer of (1) and be the global minimizer of . Since the functions and are close to each other as become large, we can expect that their respective minimizers, and are also close to one another. To show that converges to as , we consider bounding the uniform deviation . Towards establishing such bounds, we will posit two regularity assumptions on and , beginning with a -Lipschitz condition on .
A 2.
For all and any , we have
We also assume a weaker form of a standard condition that the gradient is -Lipschitz (Telgarsky and Dasgupta,, 2013); unlike their work, note we do not additionally require strong convexity of .
A 3.
There exists such that for all .
These conditions are mild, and can be seen to hold for all of the aforementioned popular special cases. For instance, taking yields the squared Euclidean distances, and we see that so that A3 is satisfied with constant . Now, let as in classical -means, and denote . For any vectors ,
Thus, is clearly non-negative and componentwise non-decreasing, and satisfies A2. Similarly, if we take , the conditions are met for power -means: it is again non-negative and non-decreasing in its components, and satisfies A2 with due to Beliakov et al., (2010).
3.1 Bounds on and
Toward proving that converges to , we will need to show that both and lie in . To this end, we first establish the obtuse angle property for Bregman divergences.
Lemma 3.1.
Let be a convex set and suppose be the projection of onto , with respect to the Bregman divergence , i.e. (assuming it exists). Then,
We next show that to minimize or , it is enough to restrict the search to .
Lemma 3.2.
Let A2 hold, and . Let be a Bregman divergence. Then for any , there exists , such that .
Since we can restrict our attention to to minimize , we have the following:
Corollary 3.1.
Let and be a Bregman divergence. If , then .
Now note under A1 both and have support contained in . The following corollary is thus implied by replacing by and .
Now that we have bounded and in a compact set, the following section supplies probabilistic bounds on the uniform deviation via metric entropy arguments.
3.2 Concentration Inequality and Metric Entropy Bounds via Rademacher Complexity
We have proven that . To bound the difference , we observe
(3) |
Thus, to bound , it is enough to prove bounds on . The main idea here is to bound this uniform deviation via Rademacher complexity, and in turn bound the Rademacher complexity itself (Dudley,, 1967; Mohri et al.,, 2018). Let , and denote the -norm (Athreya and Lahiri,, 2006) between two probability measures and as . We recall the definition of Rademacher complexity and covering number as follows.
Definition 1.
Let ’s be i.i.d. Rademacher random variables independent of , i.e. , The population Rademacher complexity of is defined as , where the expectation is over both and .
Definition 2.
(-cover and Covering Number) Let be a metric space. The set is said to be a -cover of if for all , , such that . The -covering number of w.r.t. , denoted by , is the size of the smallest -cover of with respect to .
The following Lemma gives a bound for the -covering number of under the supremum norm. The main idea here is to use the Lipschitz property of and then to find a cover of the search space for , i.e. . This then automatically transcends to a cover of under the sup-norm.
To bound the Rademacher complexity, we will also need to bound the diameter of under :
We are now ready to state the bound on the Rademacher complexity in Theorem 3.1. We provide a sketch of the argument below, with the complete proof details available in the Appendix.
Proof Sketch
The main idea of the proof is to use Dudley’s chaining argument (Dudley,, 1967; Wainwright,, 2019) to bound the Rademacher complexity in terms of an integral involving the metric entropy . Let . Using the chaining approach, one can show that
(4) | ||||
Before we proceed, we state the following Lemma, which puts an uniform bound on , .
Lemma 3.5.
For all and ,
Now that we have established a bound on the Rademacher complexity , we are now ready to establish a uniform concentration inequality on in the following Theorem, proven in the Appendix.
The result in Theorem 3.2, reveals a non-asymptotic bound on :
3.3 Inference for Fixed : Strong and Consistency
We now discuss the classical domain where is kept fixed (Pollard,, 1981; Chakraborty et al.,, 2020; Terada,, 2014) and show that the results above imply strong and -consistency, mirroring some known results for existing methods such as -means. We first solidify the notion of convergence of the centroids to , following Pollard, (1981). Since centroids are unique only up to label permutations, our notion of dissimilarity
is considered over the set of all real permutation matrices, where denotes the Frobenius norm. Now, we say that the sequence converges to if . We begin by imposing the following standard identifiablity condition (Pollard,, 1981; Terada,, 2014; Chakraborty et al.,, 2020) on for our analysis.
A 4.
For all , there exists , such that whenever .
We now investigate the strong consistency properties of , and also investigate the rate at which converges to . Theorem 3.3 states that indeed strong consistency holds, with convergence rate . Note that this rate is faster than that found previously by Paul et al., 2021a . Before we proceed, recall we say that if the sequence is tight (Athreya and Lahiri,, 2006).
Theorem 3.3.
Note: Here is means that is stochastically bounded or tight.
3.4 Inference Under the MoM Framework
Theorem 3.3 and the bounds we present above already establish novel statistical results that pertain to methods under (1) such as power -means in the “uncontaminated" setting. In this section, we now extend these findings to the MoM setup under (2) in order to understand their behavior in the presence of outliers. Recall that in this setting, the data are partitioned into equally sized blocks; without loss of generality, we take . We denote the set of all inliers by and the outliers by . Now let denote the minimizer of (2): towards establishing the error rate at which goes to , we assume the following:
A 5.
with .
A 6.
There exists such that .
We remark A5 is identical to A1, but imposed only on the inlying observations. A6 ensures at least half of the partitions are free of outliers; note this is much weaker than requiring as is done in recent work (Lecué et al.,, 2020). Importantly, we emphasize that no distributional assumptions regarding the outlying observations are made, allowing them to be unbounded, generated from heavy-tailed distributions, or dependent among each other. Proofs of the following results appear in the Appendix.
As a point of departure, we first establish that .
Again, we may restrict the search space for finding in due to Lemma 3.6.
Similarly to Section 3, we derive a uniform bound on and then bound in turn. For brevity we define , and use the notation “” to suppress the absolute constants. The uniform deviation bound is as follows, with complete proof appearing in the Appendix.
We give a brief outline of our proof for Theorem 3.4 below, with details appearing in the Appendix.
Proof sketch
We begin by noting that if
then
where denotes the empirical distribution of . This implies it is suffices to bound the quantity
Introducing the function , we begin by bounding outlier-free partitions, making use of the inequalities We then proceed by bounding
by the sum , where
We bound by appealing to Hoeffding’s inequality, while we show that can be bounded by applying the bounded difference inequality together with the result of Theorem 3.1 to bound the resulting Rademacher complexity.
The corollary below follows from this uniform bound, giving a non-asymptotic control over the difference in terms of the model parameters.
3.5 Inference for Fixed and Under the MoM Framework
We now focus our attention back to the classical setting where the numbers of clusters and features remain fixed. To show is consistent for , we need to impose conditions such that the RHS of the bounds presented in Corollary 3.5 decrease to as . We state the required conditions as follows.
A 7.
The number of partitions , and as .
These conditions are natural: as grows, so too must in order to maintain a proportion of outlier-free partitions. On the other hand, must grow slowly relative to to ensure each partition can be assigned sufficient numbers of datapoints. We note that A7 implies , an intuitive and standard condition (Lecué et al.,, 2020; Staerman et al.,, 2021; Paul et al., 2021b, ) as outliers should be few by definition.
In the following corollary, we focus on the squared Euclidean distance for center-based clustering under the MoM framework. We show that the (global) estimates obtained from MoM -means, MoM power -means etc. are consistent. We stress that the obtained convergence rate, as a function of and , does not depend on the choice of , as long as it satisfies A2, i.e. is Lipschitz. In particular, replacing with and , the rate in Corollary 3.6 applies to robustified MoM versions of -means, power -means and -harmonic means alike.
Corollary 3.6.
Remark
These results imply that any MoM center-based algorithm under our paradigm admits a convergence rate of , when equipped with squared Euclidean distance. Note that as , . Thus, the convergence rates for MoM variants in our framework are generally slower than their ERM counterparts, for which the rate is . This is unsurprising as MoM operates on outlier-contaminated data; there is “no free lunch" in trading off robustness for rate of convergence. However, if the number of partitions grows slowly relative to (say, so that ), then the convergence rates for MoM estimation become comparable to the ERM counterparts at .
4 Empirical Studies and Performance
To validate and assess our proposed framework, we now turn to an empirical comparison of the proposed and peer clustering methods. We evaluate clustering quality under the Adjusted Rand Index (ARI) (Hubert and Arabie,, 1985), with values ranging between and and denoting a perfect match with the ground truth. Though it is not feasible to exhaustively survey center-based clustering methods, we consider a broad range of competitors, comparing to -means (MacQueen,, 1967), Partition Around Medoids (PAM) (Kaufman and Rousseeuw,, 2009), -medians (Jain,, 2010), Robust kmeans++ (RKMpp) (Deshpande et al.,, 2020), Robust Continuous Clustering (RCC) (Shah and Koltun,, 2017), Bregman Trimmed -means (BTKM) (Fischer et al.,, 2020), MoM -means (MOMKM) (Klochkov et al.,, 2020) and a novel MoM variant of power -means (MOMPKM) implied under the proposed framework. An open-source implementation of the proposed method and code for reproducing the simulation experiments are available at https://github.com/SaptarshiC98/MOMPKM.
We consider two thorough simulated experiments below, with additional simulations and large-scale real data results in the Appendix. While the extended comparisons are omitted for space considerations, they convey the same trends as the studies below. In all settings, we generate data in dimensions, varying the number of clusters and the outlier percentages. True centers are spaced uniformly on a grid with and observations are drawn from Gaussians around their ground truth centers with variance . Because we generate Gaussian data, we focus here on the Euclidean case, and do not consider other Bregman divergences in the present empirical study. In all experiments, we take the number of partitions to be roughly double the number of outliers, and set the hyperparameters and to be and respectively by default. Results are averaged over 20 random restarts, and all competitors are initialized and tuned according to the standard implementations described in their original papers.
Experiment 1: increasing the number of clusters
The first experiment assesses performance as the number of true clusters grows, while keeping the proportion of outliers fixed at . Datasets are generated with the number of clusters varying between and . For each setting, we create balanced clusters drawing points from each true center. The outliers are generated from a uniform distribution with support on the range of the inlying observations. We repeat this data generation process times under each parameter setting.
The average ARI values at convergence, plotted against the number of clusters along with error bars ( standard deviations), are shown in the left panel of Figure 1. We see that the robustified version of power -means implied by our framework (labeled MOMPKM) achieves the best performance here. This may be unsurprising as the recent power -means method was shown to significantly reduce sensitivity to local minima (which tend to increase with ), while MoM further protects the algorithm from outlier influence.

Experiment 2: increasing outlier percentage
Following the same data generation process in Experiment 1, we now fix while varying the outlier percentage from to . For each parameter setting, we again replicate the experiment times. Average ARI values comparing the inlying observations to their ground truths are plotted in the right panel of Figure 1. Not only does this study convey similar trends as Experiment 1, but we see that competing methods continue to deteriorate with increasing outliers as one might expect, while MOMPKM remarkably remains relatively stable.
We see that the ERM-based methods such as BTKM, MOMKM, and PAM struggle when there is large number of clusters. Similarly, methods such as -medians and RKMpp often stop short at poor local optima, which is quickly exacerbated by outliers despite initialization via clever seeding techniques. These phenomena are consistent with what has been reported in the literature for their non-robust counterparts (Xu and Lange,, 2019; Chakraborty et al.,, 2020). Overall, the empirical study suggests that a robust version of power -means clustering under the MoM framework shows promise to handle several data challenges at once, in line with our theoretical analysis.
5 Discussion
In this paper, we proposed a paradigm for center-based clustering that unifies a suite of center-based clustering algorithms. Under this view, developed a simple yet efficient framework for robust versions of such algorithms by appealing to the Median of Means (MoM) philosophy. Using gradient-based methods, the MoM objectives can be solved with the same per-iteration complexity of Lloyd’s -means algorithm, largely retaining its simplicity. Importantly, we derive a thorough analysis of the statistical properties and convergence rates by establishing uniform concentration bounds under i.i.d. sampling of the data. These novel theoretical contributions demonstrate how arguments utilizing Rademacher complexities and Dudley’s chaining arguments can be leveraged in the robust clustering context. As a result, we are able to obtain error rates that do not require asymptotic assumptions, nor restrictions on the relation between and . These findings recover asymptotic results such as strong consistency and -consistency under classical assumptions.
As shown in the paper, the robustness of MoM estimators comes at the cost of slower convergence rates compared to their ERM counterparts. We emphasize that there is no “median-of-means magic", and that the efficacy of MoM depends on the interplay between the partitions and the outliers. If the number of partitions circumvents the impact of the outliers, the performance of MoM clustering estimates under our framework scales with the block size as . Since can be chosen to be approximately , the obtained error rate is roughly . If scales proportionately with , however, the error bound of becomes meaningless. For our consistency results to hold, it is crucial that , which in turn allows us to choose that satisfies condition A7. If , for some , the error rate is .
This suggests possible future research directions in improving these ERM rates via finding so called “fast rates" under additional assumptions (Boucheron et al.,, 2005; Wainwright,, 2019). Moreover, it will be fruitful to extend the results for noise distributions that satisfy only moment conditions. Recent works in convex clustering (Tan and Witten,, 2015; Chakraborty and Xu,, 2020) have considered sub-Gaussian models to obtain error rates. The work by (Biau et al.,, 2008) and recent work by (Klochkov et al.,, 2020) obtain similar error rates under finite second-moment conditions using assumptions that the cluster centroids are bounded, and it may be possible to extend our approach using local Rademacher complexities (Bartlett et al.,, 2005) to relax the bounded support assumption. One can also seek lower bounds on the approximation error or can explore high-dimensional robust center-based clustering under the proposed paradigm. Finally, we have not explored the implementation and empirical performance of Bregman versions of our MoM estimator, for instance with application to data arising from mixtures of exponential families other than the Gaussian case. These interesting directions remain open for future work.
References
- Alcalá et al., (2010) Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., and Herrera, F. (2010). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3):255–287.
- Aloise et al., (2009) Aloise, D., Deshpande, A., Hansen, P., and Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248.
- Athreya and Lahiri, (2006) Athreya, K. B. and Lahiri, S. N. (2006). Measure theory and probability theory. Springer Science & Business Media.
- Balcan et al., (2008) Balcan, M.-F., Blum, A., and Vempala, S. (2008). A discriminative framework for clustering via similarity functions. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 671–680.
- Banerjee et al., (2005) Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., and Lafferty, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6(10).
- Bartlett et al., (2002) Bartlett, P. L., Boucheron, S., and Lugosi, G. (2002). Model selection and error estimation. Machine Learning, 48(1):85–113.
- Bartlett et al., (2005) Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local Rademacher complexities. The Annals of Statistics, 33(4):1497–1537.
- Bartlett and Mendelson, (2002) Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482.
- Beliakov et al., (2010) Beliakov, G., Calvo, T., and James, S. (2010). On Lipschitz properties of generated aggregation functions. Fuzzy Sets and Systems, 161(10):1437–1447.
- Ben-David, (2007) Ben-David, S. (2007). A framework for statistical clustering with constant time approximation algorithms for k-median and k-means clustering. Machine Learning, 66(2):243–257.
- Biau et al., (2008) Biau, G., Devroye, L., and Lugosi, G. (2008). On the performance of clustering in Hilbert spaces. IEEE Transactions on Information Theory, 54(2):781–790.
- Boucheron et al., (2005) Boucheron, S., Bousquet, O., and Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375.
- Brownlees et al., (2015) Brownlees, C., Joly, E., and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507–2536.
- Brunet-Saumard et al., (2022) Brunet-Saumard, C., Genetay, E., and Saumard, A. (2022). K-bMOM: A robust lloyd-type clustering algorithm based on bootstrap median-of-means. Computational Statistics & Data Analysis, 167:107370.
- Bubeck et al., (2013) Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717.
- Chakraborty and Das, (2020) Chakraborty, S. and Das, S. (2020). Detecting meaningful clusters from high-dimensional data: A strongly consistent sparse center-based clustering approach. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Chakraborty et al., (2020) Chakraborty, S., Paul, D., Das, S., and Xu, J. (2020). Entropy weighted power k-means clustering. In International Conference on Artificial Intelligence and Statistics, pages 691–701. PMLR.
- Chakraborty and Xu, (2020) Chakraborty, S. and Xu, J. (2020). Biconvex clustering. arXiv preprint arXiv:2008.01760.
- Deshpande et al., (2020) Deshpande, A., Kacham, P., and Pratap, R. (2020). Robust -means++. In Conference on Uncertainty in Artificial Intelligence, pages 799–808. PMLR.
- Duchi et al., (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159.
- Dudley, (1967) Dudley, R. M. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290–330.
- Fischer et al., (2020) Fischer, A., Levrard, C., and Brécheteau, C. (2020). Robust Bregman clustering. Annals of Statistics.
- Goodfellow et al., (2016) Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, volume 1. MIT press Cambridge.
- Hastie et al., (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.
- Hubert and Arabie, (1985) Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193–218.
- Jain, (2010) Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666.
- Kaufman and Rousseeuw, (2009) Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons.
- Klochkov et al., (2020) Klochkov, Y., Kroshnin, A., and Zhivotovskiy, N. (2020). Robust -means clustering for distributions with two moments. arXiv preprint arXiv:2002.02339.
- Laforgue et al., (2019) Laforgue, P., Clémençon, S., and Bertail, P. (2019). On medians of (randomized) pairwise means. In International Conference on Machine Learning, pages 1272–1281. PMLR.
- Lecué and Lerasle, (2019) Lecué, G. and Lerasle, M. (2019). Learning from MOM’s principles: Le Cam’s approach. Stochastic Processes and Their Applications, 129(11):4385–4410.
- Lecué and Lerasle, (2020) Lecué, G. and Lerasle, M. (2020). Robust machine learning by median-of-means: theory and practice. Annals of Statistics, 48(2):906–931.
- Lecué et al., (2020) Lecué, G., Lerasle, M., and Mathieu, T. (2020). Robust classification via mom minimization. Machine Learning, 109(8):1635–1665.
- Lerasle, (2019) Lerasle, M. (2019). Lecture notes: Selected topics on robust statistical learning theory. arXiv preprint arXiv:1908.10761.
- Levrard, (2015) Levrard, C. (2015). Nonasymptotic bounds for vector quantization in Hilbert spaces. The Annals of Statistics, 43(2):592–619.
- Lloyd, (1982) Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137.
- Lugosi and Mendelson, (2019) Lugosi, G. and Mendelson, S. (2019). Regularization, sparse recovery, and median-of-means tournaments. Bernoulli, 25(3):2075–2106.
- MacQueen, (1967) MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA.
- Mathieu and Minsker, (2021) Mathieu, T. and Minsker, S. (2021). Excess risk bounds in robust empirical risk minimization. Information and Inference: A Journal of the IMA.
- Minsker, (2018) Minsker, S. (2018). Uniform bounds for robust mean estimators. arXiv preprint arXiv:1812.03523.
- Mohri et al., (2018) Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
- Ng et al., (2001) Ng, A., Jordan, M., and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 14:849–856.
- Ostrovsky et al., (2013) Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. (2013). The effectiveness of Lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):1–22.
- (43) Paul, D., Chakraborty, S., and Das, S. (2021a). On the uniform concentration bounds and large sample properties of clustering with Bregman divergences. Stat, page e360.
- (44) Paul, D., Chakraborty, S., and Das, S. (2021b). Robust principal component analysis: A median of means approach. arXiv preprint arXiv:2102.03403.
- Pollard, (1981) Pollard, D. (1981). Strong consistency of -means clustering. The Annals of Statistics, 9(1):135–140.
- Rodriguez and Valdora, (2019) Rodriguez, D. and Valdora, M. (2019). The breakdown point of the median of means tournament. Statistics & Probability Letters, 153:108–112.
- Shah and Koltun, (2017) Shah, S. A. and Koltun, V. (2017). Robust continuous clustering. Proceedings of the National Academy of Sciences, 114(37):9814–9819.
- Shalev-Shwartz and Ben-David, (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
- Staerman et al., (2021) Staerman, G., Laforgue, P., Mozharovskyi, P., and d’Alché Buc, F. (2021). When OT meets MoM: Robust estimation of Wasserstein distance. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130, pages 136–144. PMLR.
- Tan and Witten, (2015) Tan, K. M. and Witten, D. (2015). Statistical properties of convex clustering. Electronic journal of statistics, 9(2):2324.
- Teboulle, (2007) Teboulle, M. (2007). A unified continuous optimization framework for center-based clustering methods. Journal of Machine Learning Research, 8(1).
- Telgarsky and Dasgupta, (2012) Telgarsky, M. J. and Dasgupta, S. (2012). Agglomerative Bregman clustering. In 29th International Conference on Machine Learning, ICML 2012, pages 1527–1534.
- Telgarsky and Dasgupta, (2013) Telgarsky, M. J. and Dasgupta, S. (2013). Moment-based uniform deviation bounds for -means and friends. Advances in Neural Information Processing Systems.
- Terada, (2014) Terada, Y. (2014). Strong consistency of reduced k-means clustering. Scandinavian Journal of Statistics, 41(4):913–931.
- Terada, (2015) Terada, Y. (2015). Strong consistency of factorial -means clustering. Annals of the Institute of Statistical Mathematics, 67(2):335–357.
- Vapnik, (2013) Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media.
- Vassilvitskii and Arthur, (2006) Vassilvitskii, S. and Arthur, D. (2006). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035.
- Wainwright, (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press.
- Xu and Lange, (2019) Xu, J. and Lange, K. (2019). Power k-means clustering. In International Conference on Machine Learning, pages 6921–6931. PMLR.
- Zhang et al., (1999) Zhang, B., Hsu, M., and Dayal, U. (1999). K-harmonic means—a data clustering algorithm. Hewlett-Packard Labs Technical Report HPL-1999-124, 55.
- Zhang et al., (2020) Zhang, Z., Lange, K., and Xu, J. (2020). Simple and scalable sparse k-means clustering via feature ranking. Advances in Neural Information Processing Systems, 33.
Appendix A Proofs of Lemmas
For the theoretical exposition, we first establish the following Lemmas. Lemma A.1 proves that the derivative of the function is bounded in the -norm when the domain is restricted to the support of .
Lemma A.1.
Under A3, , for all .
Proof.
Lemma A.2 essentially proves that the function is Lipschitz with Lipschitz constant on .
Lemma A.2.
Under A3, for all , is -Lipschitz, i.e.
Proof.
From the mean value theorem,
for some in the convex combinations of and . Clearly, , due to the convexity of . Now by the Cauchy-Schwartz inequality and Lemma A.1,
∎
Lemma A.3 proves that the function , as a function of , is Lipschitz with respect to the norm.
Proof.
∎
Appendix B Proofs from Section 3
B.1 Proof of Lemma 3.1
Proof.
Let . Since minimizes over , there exists a subgradient such that
(5) |
B.2 Proof of Lemma 3.2
Proof.
B.3 Proof of Lemma 3.3
Proof.
We first divide the set into a small bins, each with size . Denote , for , and let . If , we take . Clearly, . From the construction of , for all , there exists , such that, . We take . We define
Then immediately we see
For any , we can construct , such that, From Lemma A.3, we observe that,
Thus, constitutes a -cover of under the norm. Hence,
∎
B.4 Proof of Lemma 3.4
Proof.
B.5 Proof of Lemma 3.5
B.6 Proof of Theorem 3.1
Proof.
Let . We construct a decreasing sequence as follows. Take (the last equality follows from Lemma 3.4) and . Let be a minimal cover of , i.e. . Now denote to be the closest element of in (with ties broken arbitrarily). We can thus write,
where
(9) | |||
(10) | |||
(11) |
Since we can pick arbitrarily from (as ), . To bound , we observe that,
To bound , we observe that,
Now appealing to Massart’s lemma (Mohri et al.,, 2018), we get,
Thus,
Combining the bounds on , and , we get,
(12) |
From the construction of , we know, . Hence from equation (12), we get,
Taking limits as in the above equation, we get,
From Lemma 3.3, plugging in the value of , we get,
∎
B.7 Proof of Theorem 3.2
B.8 Proof of Theorem 3.3
Proof.
(Proof of Strong consistency) We will first show . To show this let . Then from Theorem 3.2, we observe that with probability at least ,
(14) |
Fix . If and , the RHS of (14) becomes no bigger than . Thus,
Since the series is convergent from the above equation, so is . Hence, . Thus, for any , almost surely w.r.t. for sufficiently large. From assumption A4, , almost surely w.r.t. , for any prefixed , and large. Thus, , which proves the result.
(Proof of -consistency) Fix . From Theorem 3.2, with probability at least ,
Hence, with probability at least . Thus, , such that
for all large enough. Hence, . ∎
Appendix C Proofs from Section 3.4
C.1 Proof of Lemma 3.6
C.2 Proof of Theorem 3.4
Proof.
For notational simplicity let denote the empirical distribution of . Suppose . We will first bound the probability of . To do so, we will individually bound the probabilities of the events
and
We note that if
then
Here again denote the indicator function. Now let . Clearly,
(15) |
We observe that,
(16) |
To bound , we will first bound the quantity . We observe that,
(17) |
We now turn to bounding the term
Appealing to Theorem 26.5 of (Shalev-Shwartz and Ben-David,, 2014) we observe that, with probability at least , for all ,
(18) |
Here are i.i.d. Rademacher random variables. Let be i.i.d. Rademacher random variables, independent form . From equation (18), we get,
(19) |
Equation (19) follows from the fact that is 1-Lipschitz and appealing to Lemma 26.9 of Shalev-Shwartz and Ben-David, (2014). We now consider a “ghost" sample , which are i.i.d. and follow the probability law . Thus, equation (19) can be further shown to give
(20) | ||||
(21) | ||||
(22) | ||||
(23) |
Equation (20) follows from observing that . In equation (21), are independent Rademacher random variables due to their construction. Equation (22) follows from appealing to Theorem 3.1. Thus, combining equations (18), (19), and (23), we conclude that, with probability of at least ,
(24) |
We choose and
This makes the right hand side of (24) strictly smaller than . Thus, we have shown that
Similarly, we can show that,
Combining the above two inequalities, we get,
In other words, with at least probability ,
∎
C.3 Proof of Corollary 3.5
Proof.
C.4 Proof of Corollary 3.6
Appendix D Additional Experiments
D.1 Additional Simulations
Experiment 3
We use the same simulation setting as Experiment 1. However, the outliers are now generated from a Gaussian as well with mean coordinate , and covariance matrix , where is the dimensional vector of all ’s and is the identity matrix.
Experiment 4
We use the same simulation setting as Experiment 2. However, the outliers are now generated from the same scheme as in Experiment 3.


D.2 Case Study on Real Data: KDDCUP
In this section, we assess the performance of real data through the analysis of KDDCUP dataset (Alcalá et al.,, 2010), and consists of approximately M observations depicting connections of sequences of TCP packets. The features are normalized to have zero mean and unit standard deviation. The data contains classes, out of which, the three largest contain of the observations. Following the footsteps of Deshpande et al., (2020), the remaining classes consisting of points are considered as outliers. We run all the algorithms as described in the beginning of Section 4. The parameters considered for our algorithm are , and . We measure the performance of this algorithm in terms of the ARI, as well as average precision and recall (Hastie et al.,, 2009). The last two indices are added following (Deshpande et al.,, 2020). We report the average of these indices out of replications in Table 1. For all these indices, a higher value implies superior performance. Table 1 shows similar trends as discussed in Section 4 of the main text. In particular, MOMPKM resembles the ground truth compared to the state-of-the-art. Surprisingly, RKMpp performs better than other competitors (except for MOMPKM), which was not the case for simulated data under ideal model assumptions. This is possibly because of the fact that the data contains only 47 features, compared to almost 5M samples, so that good seedings capitalize on the higher signal-to-noise-ratio, compared to that of the data used in the simulation studies.
Index | k.means | BTKM | RCC | PAM | RKMpp | PKM | MOMKM | MOMPKM |
---|---|---|---|---|---|---|---|---|
ARI | ||||||||
Precision | ||||||||
Recall | 0.00 | 0.14 | 0.07 | 0.11 | 0.63 | 0.49 | 0.59 | 0.76 |
Appendix E Machine Specifications
The simulation studies were undertaken on an HP laptop with Intel(R) Core(TM)i3-5010U 2.10 GHz processor, 4GB RAM, 64-bit Windows 8.1 operating system in R and python 3.7 programming languages. The real data experiments were undertaken on a computing cluster with 656 cores (essentially CPUs) spread across a number of nodes of varying hardware specifications. 256 of the cores are in the ‘low’ partition. There are 32 cores and 256 GB RAM per node.