This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Uniform Concentration Bounds toward a Unified Framework for Robust Clustering

Debolina Paul Joint first authors contributed equally to this work. Department of Statistics, Stanford University Saptarshi Chakraborty Department of Statistics, University of California, Berkeley Swagatam Das Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, India Jason Xu Correspondence to: [email protected].
To appear in the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021. Department of Statistical Science, Duke University
Abstract

Recent advances in center-based clustering continue to improve upon the drawbacks of Lloyd’s celebrated kk-means algorithm over 6060 years after its introduction. Various methods seek to address poor local minima, sensitivity to outliers, and data that are not well-suited to Euclidean measures of fit, but many are supported largely empirically. Moreover, combining such approaches in a piecemeal manner can result in ad hoc methods, and the limited theoretical results supporting each individual contribution may no longer hold. Toward addressing these issues in a principled way, this paper proposes a cohesive robust framework for center-based clustering under a general class of dissimilarity measures. In particular, we present a rigorous theoretical treatment within a Median-of-Means (MoM) estimation framework, showing that it subsumes several popular kk-means variants. In addition to unifying existing methods, we derive uniform concentration bounds that complete their analyses, and bridge these results to the MoM framework via Dudley’s chaining arguments. Importantly, we neither require any assumptions on the distribution of the outlying observations nor on the relative number of observations nn to features pp. We establish strong consistency and an error rate of O(n1/2)O(n^{-1/2}) under mild conditions, surpassing the best-known results in the literature. The methods are empirically validated thoroughly on real and synthetic datasets.

1 Introduction

Clustering is a fundamental task in unsupervised learning, which seeks to discover naturally occurring groups within a dataset. Among a plethora of algorithms in the literature, center-based methods remain widely popular, with kk-means (MacQueen,, 1967; Lloyd,, 1982) as the most prominent example. Given nn data points 𝒳={𝑿i:i=1,,n}p\mathcal{X}=\{\boldsymbol{X}_{i}:i=1,\dots,n\}\subset\mathop{\mathbb{R}}\nolimits^{p}, kk-means seeks to partition the data into kk mutually exclusive and exhaustive groups by minimizing the within cluster variance. Representing the cluster centroids 𝚯={𝜽1,,𝜽k}p\boldsymbol{\Theta}=\{\boldsymbol{\theta}_{1},\dots,\boldsymbol{\theta}_{k}\}\subset\mathop{\mathbb{R}}\nolimits^{p}, the kk-means problem is formulated as the minimization of the objective function

fk-means(𝚯)=i=1nmin1jkd(𝑿i,𝜽j),f_{k\text{-means}}(\boldsymbol{\Theta})=\sum_{i=1}^{n}\min_{1\leq j\leq k}d(\boldsymbol{X}_{i},\boldsymbol{\theta}_{j}),

where d(,)d(\cdot,\cdot) is a dissimilarity measure. Taking the squared Euclidean distance d(𝒙,𝒚)=𝒙𝒚22d(\boldsymbol{x},\boldsymbol{y})=\|\boldsymbol{x}-\boldsymbol{y}\|_{2}^{2} yields the classical kk-means formulation.

Clustering via kk-means is well-documented to be sensitive to initialization (Vassilvitskii and Arthur,, 2006), prone to getting stuck in poor local optima (Zhang et al.,, 1999; Xu and Lange,, 2019; Chakraborty et al.,, 2020), and fragile against linearly non-separable clusters (Ng et al.,, 2001) or in the presence of even a single outlying observation (Klochkov et al.,, 2020). Researchers continue to tackle these drawbacks of kk-means while preserving its simplicity and interpretability (Banerjee et al.,, 2005; Aloise et al.,, 2009; Teboulle,, 2007; Ostrovsky et al.,, 2013; Zhang et al.,, 2020; Telgarsky and Dasgupta,, 2012). Although often successful in practice, few of these approaches are grounded in rigorous theory or provide finite-sample statistical guarantees. The available consistency results are mostly asymptotic in nature (Chakraborty and Das,, 2020; Chakraborty et al.,, 2020; Terada,, 2014, 2015), lacking convergence rates or operating under restrictive assumptions on the relation between pp and nn (Paul et al., 2021a, ). For example, the recent large sample analysis (Paul et al., 2021a, ) of the hard Bregman kk-means algorithm (Banerjee et al.,, 2005) obtains an asymptotic error rate of O(logn/n)O(\sqrt{\log n/n}) under restrictive assumptions on the relation between nn and pp. The approach by Teboulle, (2007) focuses on a unified framework from an optimization perspective, while Balcan et al., (2008); Telgarsky and Dasgupta, (2012) focus on different divergence-based methods to better understand the underlying feature-space.

The presence of outliers only further complicates matters. Outlying data are common in real applications, but many of the aforementioned approaches are fragile to deviations from the assumed data generating mechanism. On the other hand, recent work on robust clustering methods (Deshpande et al.,, 2020; Fischer et al.,, 2020; Klochkov et al.,, 2020) do not integrate the practical advances surveyed above, and similarly lack finite-sample guarantees. To bridge this gap, the Median of Means (MoM) literature provides a promising and attractive framework to robustify center-based clustering methods against outliers. MoM estimators are not only insensitive to outliers, but are also equipped with exponential concentration results under the mild condition of finite variance (Lugosi and Mendelson,, 2019; Lecué and Lerasle,, 2020; Bartlett et al.,, 2002; Lerasle,, 2019; Laforgue et al.,, 2019). Recently, near-optimal results for mean estimation (Minsker,, 2018), classification (Lecué et al.,, 2020), regression (Mathieu and Minsker,, 2021; Lugosi and Mendelson,, 2019), clustering (Klochkov et al.,, 2020; Brunet-Saumard et al.,, 2022), bandits (Bubeck et al.,, 2013) and optimal transport (Staerman et al.,, 2021) have been established from this perspective.

Under the MoM lens, we propose a unified framework for robust center-based clustering. Our treatment considers a family of Bregman loss functions not restricted to the classical squared Euclidean loss. We explore the exact sample error bounds for a general class of algorithms under mild regularity assumptions that apply to popular existing approaches, which we show to be special cases. The proposed framework allows for the data to be divided into two categories: the set of inliers (\mathcal{I}) and the set of outliers (𝒪\mathcal{O}). The inliers are assumed to be independent and identically distributed (i.i.d.), while absolutely no assumption is required on 𝒪\mathcal{O}, allowing outliers to be unboundedly large, dependent on each other, sampled from a heavy-tailed distribution and so on. Our contributions within the MoM framework make use of Rademacher complexities (Bartlett and Mendelson,, 2002; Bartlett et al.,, 2005) and symmetrization arguments (Vapnik,, 2013), powerful tools that often find use in the empirical process literature but, in our view, are underexplored in the context of robust clustering.

The paper is organized as follows. In Section 2, we identify a general centroid-based clustering framework which encompasses kk-means (MacQueen,, 1967), kk-harmonic means (Zhang et al.,, 1999), and power kk-means (Xu and Lange,, 2019) as special cases, to name a few. We show how this framework is made robust via Median of Means estimation, yielding an array of center-based robust clustering methods. Within this framework, we derive uniform deviation bounds and concentration inequalities under standard regularity conditions through bounding Rademacher complexity by metric entropy via Dudley’s chaining argument in Section 3. The analysis newly reveals the convergence rate for popularly used clustering methods such as kk-harmonic means and power kk-means, matching the known rate results for kk-means, and elegantly carries over to their MoM counterparts. We then implement and empirically assess the resulting algorithms through simulated and real data experiments. In particular, we find that power kk-means (Xu and Lange,, 2019) under the MoM paradigm outperforms the state-of-the-art in the presence of outliers.

1.1 Related theoretical analyses of clustering

In seminal work, Pollard, (1981), proved the strong consistency of kk-means under a finite second moment assumption, spurring the large sample analysis of unsupervised clustering methods. This result has been extended to separable Hilbert spaces (Biau et al.,, 2008; Levrard,, 2015) and for related algorithms (Chakraborty and Das,, 2020; Terada,, 2014, 2015), but these do not provide guarantees on the number of samples required so that the excess risk falls below a given threshold. Towards finding probabilistic error bounds, following research derived uniform concentration results for kk-means and its variants (Telgarsky and Dasgupta,, 2012), sub-Gaussian distortion bounds for the kk-medians problem (Brownlees et al.,, 2015), and a O(logn/n)O(\sqrt{\log n/n}) bound on kk-means with Bregman divergences (Paul et al., 2021b, ). More recently, concentration inequalities for kk-means under the MoM paradigm have been established (Klochkov et al.,, 2020; Brunet-Saumard et al.,, 2022) under the restriction that sample cluster centroids (𝚯^n\widehat{\boldsymbol{\Theta}}_{n} in Section 3) are assumed to be bounded. This paper shows how a number of center-based clustering methods can be brought under the same umbrella and can be robustified using a general-purpose scheme. The theoretical analyses of this broad spectrum of methods is conducted via Dudley’s chaining arguments and through the aid of Rademacher complexity-based uniform concentration bounds. This approach enables us to replace assumptions on the sample cluster centroids by a bounded support assumption of the (inlying) data points, which yields an intuitive way of ensuring the boundedness of the cluster centroids through the obtuse angle property of Bregman divergences (Lemma 3.1 below). In contrast to the prior results, our analyses are not asymptotic in nature, so that the derived bounds hold for all values of the model parameters.

2 Problem Setting and Proposed Method

We consider the problem of partitioning a set of nn data points 𝒳={𝑿i:i=1,,n}p\mathcal{X}=\{\boldsymbol{X}_{i}:i=1,\dots,n\}\subset\mathop{\mathbb{R}}\nolimits^{p} into kk mutually exclusive clusters. In a center-based clustering framework, we represent the jthj^{\text{th}} cluster by its centroid 𝜽jp\boldsymbol{\theta}_{j}\in\mathop{\mathbb{R}}\nolimits^{p} for each j{1,,k}j\in\{1,\dots,k\}. To quantify the notion of “closeness", we allow the dissimilarity measure to be any Bregman divergence. Recall any differentiable, convex function ϕ:p\phi:\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits generates the Bregman divergence dϕ:p×p0d_{\phi}:\mathop{\mathbb{R}}\nolimits^{p}\times\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits_{\geq 0} (0\mathop{\mathbb{R}}\nolimits_{\geq 0} denoting the set of non-negative reals) defined as

dϕ(𝒙,𝒚)=ϕ(𝒙)ϕ(𝒚)ϕ(𝒚),𝒙𝒚.d_{\phi}(\boldsymbol{x},\boldsymbol{y})=\phi(\boldsymbol{x})-\phi(\boldsymbol{y})-\langle\nabla\phi(\boldsymbol{y}),\boldsymbol{x}-\boldsymbol{y}\rangle.

For instance, ϕ(𝒖)=𝒖22\phi(\boldsymbol{u})=\|\boldsymbol{u}\|_{2}^{2} generates the Euclidean distance. Without loss of generality, one may assume ϕ(𝟎)=ϕ(𝟎)=0.\phi(\mathbf{0})=\nabla\phi(\mathbf{0})=0. In this paradigm, clustering is achieved by minimizing the objective

1ni=1nΨ𝜶(dϕ(𝑿i,𝜽1),,dϕ(𝑿i,𝜽k)):=f𝚯(𝑿).\frac{1}{n}\sum_{i=1}^{n}\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{k})\right):=f_{\boldsymbol{\Theta}}(\boldsymbol{X}). (1)

Here Ψ𝜶:0k0\Psi_{\boldsymbol{\alpha}}:\mathop{\mathbb{R}}\nolimits^{k}_{\geq 0}\to\mathop{\mathbb{R}}\nolimits_{\geq 0} is a component-wise non-decreasing function (such as a generalized mean) of the dissimilarities {dϕ(𝑿,𝚯j)}j=1k\{d_{\phi}(\boldsymbol{X},\boldsymbol{\Theta}_{j})\}_{j=1}^{k} which satisfies Ψ(𝟎)=0\Psi(\mathbf{0})=0. The hyperparameter 𝜶𝒜q\boldsymbol{\alpha}\in\mathcal{A}\subseteq\mathop{\mathbb{R}}\nolimits^{q} is specified by the user, and we will additionally assume that Ψ\Psi is Lipschitz. For intuition, we begin by showing how this setup includes several popular clustering methods.

Examples:

Suppose ϕ(𝒖)=𝒖22\phi(\boldsymbol{u})=\|\boldsymbol{u}\|_{2}^{2} and Ψ(𝒙)=(j=1kxj1)1\Psi(\boldsymbol{x})=(\sum_{j=1}^{k}x_{j}^{-1})^{-1}. Then the objective (1) becomes 1ni=1n(j=1k𝑿i𝜽j22)1\frac{1}{n}\sum_{i=1}^{n}(\sum_{j=1}^{k}\|\boldsymbol{X}_{i}-\boldsymbol{\theta}_{j}\|_{2}^{-2})^{-1}, which is the objective function of kk-harmonic means clustering (Zhang et al.,, 1999). Now consider other generalized means: take Ψs(𝒙)=Ms(𝒙)\Psi_{s}(\boldsymbol{x})=M_{s}(\boldsymbol{x}) where we denote the power mean Ms(𝒙)=(k1i=1kxis)1/sM_{s}(\boldsymbol{x})=(k^{-1}\sum_{i=1}^{k}x_{i}^{s})^{1/s} for s(,1]s\in(-\infty,-1]. Then objective (1) coincides with the recent power kk-means method (Xu and Lange,, 2019), 1ni=1nMs(𝑿i𝜽122,,𝑿i𝜽k22)\frac{1}{n}\sum_{i=1}^{n}M_{s}\left(\|\boldsymbol{X}_{i}-\boldsymbol{\theta}_{1}\|_{2}^{2},\dots,\|\boldsymbol{X}_{i}-\boldsymbol{\theta}_{k}\|_{2}^{2}\right). When Ψ(𝒙)=min1jkxj\Psi(\boldsymbol{x})=\min_{1\leq j\leq k}x_{j}, (1) recovers Bregman hard clustering 1ni=1nmin1jkdϕ(𝑿i,𝜽j)\frac{1}{n}\sum_{i=1}^{n}\min_{1\leq j\leq k}d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{j}) proposed in (Banerjee et al.,, 2005) for any valid ϕ\phi, while the special case of ϕ(𝒖)=𝒖22\phi(\boldsymbol{u})=\|\boldsymbol{u}\|_{2}^{2} yields the familiar Euclidean kk-means problem (MacQueen,, 1967).

In what follows, we derive concentration bounds that establish new theoretical guarantees such as consistency and convergence rates for clustering algorithms in this framework. These analyses lead us to a unified, robust framework by embedding this class of methods within the Median of Means (MoM) estimation paradigm. Via elegant connections between the properties of MoM estimators and Vapnik-Chervonenkis (VC) theory, our MoM estimators too will inherit uniform concentration inequalities from the preceding analysis, extending convergence guarantees to the robust setting.

Median of Means

Instead of directly minimizing the empirical risk (1), MoM begins by partitioning the data into LL sets B1,,BLB_{1},\dots,B_{L} which each contain exactly bb many elements (discarding a few observations when nn is not divisible by LL). The partitions can be assigned uniformly at random, or can be shuffled throughout the algorithm (Lecué et al.,, 2020). MoM then entails a robust version of the estimator defined under (1) by instead minimizing the following objective with respect to 𝚯\boldsymbol{\Theta}:

MoMLn(𝚯)=Median(1biB1f𝚯(𝑿i),,1biBLf𝚯(𝑿i)).\text{MoM}_{L}^{n}(\boldsymbol{\Theta})=\text{Median}\left(\frac{1}{b}\sum_{i\in B_{1}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}),\dots,\frac{1}{b}\sum_{i\in B_{L}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})\right). (2)

Intuitively, MoM estimators are more robust than their ERM counterparts since under mild conditions, only a subset of the partitions is contaminated by outliers while others are outlier-free. Taking the median over partitions negates the influence of these spurious partitions, thus reducing the effect of outliers. Formal analysis of robustness via breakdown points is also available for MoM estimators (Lecué and Lerasle,, 2019; Rodriguez and Valdora,, 2019); we will make use of the nice concentration properties of MoM estimators in Section 3.4.

Algorithm 1 MoM Clustering via Adagrad
Input: 𝑿n×p\boldsymbol{X}\in\mathop{\mathbb{R}}\nolimits^{n\times p}, kk, LL, f𝚯()f_{\boldsymbol{\Theta}}(\cdot), η\eta, ϵ\epsilon.
Output: The cluster centroids 𝚯\boldsymbol{\Theta}.
Initialization: Randomly partition {1,,n}\{1,\dots,n\} into LL many partitions of equal length. Randomly choose kk points without replacement from 𝒳\mathcal{X} to initialize 𝚯0\boldsymbol{\Theta}_{0}.
repeat
     Step 1: Find t{1,,L}\ell_{t}\in\{1,\dots,L\}, such that MoMLn(𝚯t)=1biBtf𝚯t(𝑿i)\text{MoM}_{L}^{n}(\boldsymbol{\Theta}_{t})=\frac{1}{b}\sum_{i\in B_{\ell_{t}}}f_{\boldsymbol{\Theta}_{t}}(\boldsymbol{X}_{i}).
     Step 2: 𝒈j(t)1biBt𝜽jf𝚯(𝑿i)\boldsymbol{g}^{(t)}_{j}\leftarrow\frac{1}{b}\sum_{i\in B_{\ell_{t}}}\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})
     Step 3: Update 𝚯\boldsymbol{\Theta} by 𝜽j(t+1)𝜽j(t)ηϵ+t=1(t)𝒈j(t)22𝒈j(t)\boldsymbol{\theta}_{j}^{(t+1)}\leftarrow\boldsymbol{\theta}_{j}^{(t)}-\frac{\eta}{\sqrt{\epsilon+\sum_{t^{\prime}=1}^{(t)}\|\boldsymbol{g}_{j}^{(t^{\prime})}\|_{2}^{2}}}\boldsymbol{g}_{j}^{(t)}.
until objective (2) converges

Furthermore, optimizing (2) is made tractable via gradient-based methods. We advocate the Adagrad algorithm (Duchi et al.,, 2011; Goodfellow et al.,, 2016), whose updates are given by

𝜽j(t+1)𝜽j(t)ηϵ+t=1(t)𝒈j(t)22𝒈j(t)\boldsymbol{\theta}_{j}^{(t+1)}\leftarrow\boldsymbol{\theta}_{j}^{(t)}-\frac{\eta}{\sqrt{\epsilon+\sum_{t^{\prime}=1}^{(t)}\|\boldsymbol{g}_{j}^{(t^{\prime})}\|_{2}^{2}}}\boldsymbol{g}_{j}^{(t)}

for hyperparameter ϵ>0\epsilon>0, learning rate η>0\eta>0, and 𝒈j(t)\boldsymbol{g}_{j}^{(t)} denoting a subgradient of MoMLn(𝚯)\text{MoM}_{L}^{n}(\boldsymbol{\Theta}) at 𝚯t\boldsymbol{\Theta}_{t}. That is, if t\ell_{t} denotes the median partition at step tt, then

𝚯jMoMLn(𝚯t)=1biBt𝜽jf𝚯(𝑿i)|𝚯=𝚯t.\nabla_{\boldsymbol{\Theta}_{j}}\text{MoM}_{L}^{n}(\boldsymbol{\Theta}_{t})=\frac{1}{b}\sum_{i\in B_{\ell_{t}}}\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})|_{\boldsymbol{\Theta}=\boldsymbol{\Theta}_{t}}.

For any Ψ𝜶\Psi_{\boldsymbol{\alpha}} differentiable,

𝜽jf𝚯(𝒙)=jΨ𝜶(dϕ(𝒙,𝜽1(t)),,dϕ(𝒙,𝜽k(t)))𝜽dϕ(𝒙,𝜽)\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{x})=\partial_{j}\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}^{(t)}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}^{(t)})\right)\nabla_{\boldsymbol{\theta}}d_{\phi}(\boldsymbol{x},\boldsymbol{\theta})

by the chain rule. As a concrete illustration of the general formula, consider the power kk-means objective f𝚯(𝒙)=Ms(𝒙𝜽122,,𝒙𝜽k22)f_{\boldsymbol{\Theta}}(\boldsymbol{x})=M_{s}(\|\boldsymbol{x}-\boldsymbol{\theta}_{1}\|_{2}^{2},\dots,\|\boldsymbol{x}-\boldsymbol{\theta}_{k}\|_{2}^{2}): upon partial differentiation, we obtain

𝜽jf𝚯(𝒙)=2k(1kj=1k𝒙𝜽j22s)1/s1𝒙𝜽j22(s1)(𝜽j𝒙).\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{x})=\frac{2}{k}\left(\frac{1}{k}\sum_{j^{\prime}=1}^{k}\|\boldsymbol{x}-\boldsymbol{\theta}_{j^{\prime}}\|_{2}^{2s}\right)^{1/s-1}\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\|_{2}^{2(s-1)}(\boldsymbol{\theta}_{j}-\boldsymbol{x}).

As another example, the classic kk-means objective f𝚯(𝒙)=min𝜽𝚯𝒙𝜽22f_{\boldsymbol{\Theta}}(\boldsymbol{x})=\min_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}\|\boldsymbol{x}-\boldsymbol{\theta}\|^{2}_{2} requires the sub-gradient

𝜽jf𝚯(𝒙)=2(𝜽j𝒙)𝟙{j𝒥(𝚯,𝒙)},\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{x})=2(\boldsymbol{\theta}_{j}-\boldsymbol{x})\mathbbm{1}\{j\in\mathcal{J}(\boldsymbol{\Theta},\boldsymbol{x})\},

where 𝒥(𝚯,𝒙)=argmin1jk𝒙𝜽j22\mathcal{J}(\boldsymbol{\Theta},\boldsymbol{x})=\mathop{\rm argmin}\nolimits_{1\leq j\leq k}\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\|^{2}_{2} and 𝟙{}\mathbbm{1}\{\cdot\} denotes the indicator function.

A pseudocode summary appears in Algorithm 1. This method tends to find clusters efficiently: for instance, we incur O(npk)O(npk) per-iteration complexity applied to both the kk-means and power kk-means instances, which matches that of their original algorithms without robustifying via MoM (MacQueen,, 1967; Lloyd,, 1982; Xu and Lange,, 2019). Of course, here we update 𝜽j\boldsymbol{\theta}_{j} by an adaptive gradient step rather than by a closed form expression.

As emphasized by an anonymous reviewer, one should note that the algorithms under the proposed framework may converge to sub-optimal local solutions under the proposed optimization scheme. As is the case for kk-means and its variants, this possibility arises due to the non-convexity of the objective functions (1) and (2). A complete theoretical understanding from an optimization perspective for such methods is not yet fully developed and global results are in general notoriously difficult to obtain. Having said that, it is worth noting that recent empirical analysis show that techniques such as annealing (Xu and Lange,, 2019; Chakraborty et al.,, 2020) may be effective to circumvent this difficulty. Indeed, our experimental analysis (see Section 4) suggests that a robust version of power kk-means using this annealing technique best overcomes this difficulty even in the presence of outliers.

3 Theoretical Analysis

Here we analyze properties of the proposed framework (1), with complete details of all proofs in the Appendix. Denoting \mathcal{M} the set of all probability measures PP with support on [M,M]p[-M,M]^{p}, i.e. P([M,M]p)=1,PP([-M,M]^{p})=1,\,\forall P\in\mathcal{M}, we first make standard assumptions that the data are i.i.d. with bounded components (Ben-David,, 2007; Chakraborty et al.,, 2020; Paul et al., 2021a, ).

A 1.

𝑿1,𝑿ni.i.d.P\boldsymbol{X}_{1}\dots,\boldsymbol{X}_{n}\overset{i.i.d.}{\sim}P such that PP\in\mathcal{M}.

Let PnP_{n} be the empirical distribution based of the data 𝑿1,𝑿n\boldsymbol{X}_{1}\dots,\boldsymbol{X}_{n}. That is, for any Borel set AA, Pn(A)=1ni=1n𝟙{𝑿iA}P_{n}(A)=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}\{\boldsymbol{X}_{i}\in A\}. For notational simplicity, we write μf:=f𝑑μ\mu f:=\int fd\mu. Appealing to the Strong Law of Large Numbers (SLLN) (Athreya and Lahiri,, 2006), Pnf𝚯a.s.Pf𝚯P_{n}f_{\boldsymbol{\Theta}}\xrightarrow{a.s.}Pf_{\boldsymbol{\Theta}}. Suppose 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} be a (global) minimizer of (1) and 𝚯\boldsymbol{\Theta}^{\ast} be the global minimizer of Pf𝚯Pf_{\boldsymbol{\Theta}}. Since the functions Pnf𝚯P_{n}f_{\boldsymbol{\Theta}} and Pf𝚯Pf_{\boldsymbol{\Theta}} are close to each other as nn become large, we can expect that their respective minimizers, 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} and 𝚯\boldsymbol{\Theta}^{\ast} are also close to one another. To show that 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} converges to 𝚯\boldsymbol{\Theta}^{\ast} as nn\to\infty, we consider bounding the uniform deviation sup𝚯|Pnf𝚯Pf𝚯|\sup_{\boldsymbol{\Theta}}|P_{n}f_{\boldsymbol{\Theta}}-Pf_{\boldsymbol{\Theta}}|. Towards establishing such bounds, we will posit two regularity assumptions on Ψ𝜶()\Psi_{\boldsymbol{\alpha}}(\cdot) and ϕ()\phi(\cdot), beginning with a τ𝜶,k\tau_{\boldsymbol{\alpha},k}-Lipschitz condition on Ψ𝜶\Psi_{\boldsymbol{\alpha}}.

A 2.

For all 𝛂𝒜\boldsymbol{\alpha}\in\mathcal{A} and any 𝐱,𝐲0k\boldsymbol{x},\boldsymbol{y}\in\mathop{\mathbb{R}}\nolimits^{k}_{\geq 0}, we have |Ψ𝛂(𝐱)Ψ𝛂(𝐲)|τ𝛂,k𝐱𝐲1.|\Psi_{\boldsymbol{\alpha}}(\boldsymbol{x})-\Psi_{\boldsymbol{\alpha}}(\boldsymbol{y})|\leq\tau_{\boldsymbol{\alpha},k}\|\boldsymbol{x}-\boldsymbol{y}\|_{1}.

We also assume a weaker form of a standard condition that the gradient ϕ()\nabla\phi(\cdot) is HpH_{p}-Lipschitz (Telgarsky and Dasgupta,, 2013); unlike their work, note we do not additionally require strong convexity of ϕ\phi.

A 3.

There exists Hp0H_{p}\geq 0 such that ϕ(𝐱)ϕ(𝐲)2Hp𝐱𝐲2\,\|\nabla\phi(\boldsymbol{x})-\nabla\phi(\boldsymbol{y})\|_{2}\leq H_{p}\|\boldsymbol{x}-\boldsymbol{y}\|_{2}    for all 𝐱,𝐲[M,M]p\boldsymbol{x},\boldsymbol{y}\in[-M,M]^{p}.

These conditions are mild, and can be seen to hold for all of the aforementioned popular special cases. For instance, taking ϕ(𝒖)=𝒖2\phi(\boldsymbol{u})=\|\boldsymbol{u}\|^{2} yields the squared Euclidean distances, and we see that ϕ(𝒖)=2𝒖\nabla\phi(\boldsymbol{u})=2\boldsymbol{u} so that A3 is satisfied with constant Hp=2H_{p}=2. Now, let Ψ(𝒙)=min1jkxj\Psi(\boldsymbol{x})=\min_{1\leq j\leq k}x_{j} as in classical kk-means, and denote jargmin1jkyjj^{\ast}\in\mathop{\rm argmin}\nolimits_{1\leq j\leq k}y_{j}. For any vectors 𝒙,𝒚0p\boldsymbol{x},\boldsymbol{y}\in\mathop{\mathbb{R}}\nolimits_{\geq 0}^{p},

Ψ(𝒙)Ψ(𝒚)=min1jkxjmin1jkyj=min1jkxjyjxjyj𝒙𝒚1.\displaystyle\Psi(\boldsymbol{x})-\Psi(\boldsymbol{y})=\min_{1\leq j\leq k}x_{j}-\min_{1\leq j\leq k}y_{j}=\min_{1\leq j\leq k}x_{j}-y_{j^{\ast}}\leq x_{j^{\ast}}-y_{j^{\ast}}\leq\|\boldsymbol{x}-\boldsymbol{y}\|_{1}.

Thus, Ψ\Psi is clearly non-negative and componentwise non-decreasing, and satisfies A2. Similarly, if we take Ψs(𝒙)=Ms(𝒙)\Psi_{s}(\boldsymbol{x})=M_{s}(\boldsymbol{x}), the conditions are met for power kk-means: it is again non-negative and non-decreasing in its components, and satisfies A2 with τ𝜶,k=k1/s\tau_{\boldsymbol{\alpha},k}=k^{-1/s} due to Beliakov et al., (2010).

3.1 Bounds on 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} and 𝚯\boldsymbol{\Theta}^{\ast}

Toward proving that 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} converges to 𝚯\boldsymbol{\Theta}^{\ast}, we will need to show that both 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} and 𝚯\boldsymbol{\Theta}^{\ast} lie in [M,M]k×p[-M,M]^{k\times p}. To this end, we first establish the obtuse angle property for Bregman divergences.

Lemma 3.1.

Let 𝒞\mathcal{C} be a convex set and suppose P𝒞(𝛉)P_{\mathcal{C}}(\boldsymbol{\theta}) be the projection of 𝛉\boldsymbol{\theta} onto 𝒞\mathcal{C}, with respect to the Bregman divergence dϕ(,)d_{\phi}(\cdot,\cdot), i.e. P𝒞(𝛉)=argmin𝐱𝒞dϕ(𝐱,𝛉)P_{\mathcal{C}}(\boldsymbol{\theta})=\arg\min_{\boldsymbol{x}\in\mathcal{C}}d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}) (assuming it exists). Then,

dϕ(𝒙,𝜽)dϕ(𝒙,P𝒞(𝜽))+dϕ(P𝒞(𝜽),𝜽),for all 𝒙𝒞.d_{\phi}(\boldsymbol{x},\boldsymbol{\theta})\geq d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}))+d_{\phi}(P_{\mathcal{C}}(\boldsymbol{\theta}),\boldsymbol{\theta}),\quad\text{for all }\boldsymbol{x}\in\mathcal{C}.

We next show that to minimize Pnf𝚯P_{n}f_{\boldsymbol{\Theta}} or Pf𝚯Pf_{\boldsymbol{\Theta}}, it is enough to restrict the search to [M,M]k×p[-M,M]^{k\times p}.

Lemma 3.2.

Let A2 hold, and QQ\in\mathcal{M}. Let dϕ:p×p0d_{\phi}:\mathop{\mathbb{R}}\nolimits^{p}\times\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits_{\geq 0} be a Bregman divergence. Then for any 𝚯k×p\boldsymbol{\Theta}\in\mathop{\mathbb{R}}\nolimits^{k\times p}, there exists 𝚯[M,M]k×p\boldsymbol{\Theta}^{\prime}\in[-M,M]^{k\times p}, such that Qf𝚯Qf𝚯Qf_{\boldsymbol{\Theta}^{\prime}}\leq Qf_{\boldsymbol{\Theta}}.

Since we can restrict our attention to [M,M]k×p[-M,M]^{k\times p} to minimize Qf𝚯Qf_{\boldsymbol{\Theta}}, we have the following:

Corollary 3.1.

Let QQ\in\mathcal{M} and dϕ:p×p0d_{\phi}:\mathop{\mathbb{R}}\nolimits^{p}\times\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits_{\geq 0} be a Bregman divergence. If 𝚯0=argmin𝚯k×pf𝚯𝑑Q\boldsymbol{\Theta}_{0}=\arg\min_{\boldsymbol{\Theta}\in\mathop{\mathbb{R}}\nolimits^{k\times p}}\int f_{\boldsymbol{\Theta}}dQ, then 𝚯0[M,M]k×p\boldsymbol{\Theta}_{0}\in[-M,M]^{k\times p}.

Now note under A1 both PP and PnP_{n} have support contained in [M,M]k×p[-M,M]^{k\times p}. The following corollary is thus implied by replacing QQ by PP and PnP_{n}.

Corollary 3.2.

Under A13, both 𝚯^n,𝚯[M,M]k×p\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast}\in[-M,M]^{k\times p}.

Now that we have bounded 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} and 𝚯\boldsymbol{\Theta}^{\ast} in a compact set, the following section supplies probabilistic bounds on the uniform deviation 2sup𝚯[M,M]k×p|Pnf𝚯Pf𝚯|2\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|P_{n}f_{\boldsymbol{\Theta}}-Pf_{\boldsymbol{\Theta}}| via metric entropy arguments.

3.2 Concentration Inequality and Metric Entropy Bounds via Rademacher Complexity

We have proven that 𝚯^n,𝚯[M,M]k×p\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast}\in[-M,M]^{k\times p}. To bound the difference |Pf𝚯^nPf𝚯||Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|, we observe

|Pf𝚯^nPf𝚯|\displaystyle|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}| =Pf𝚯^nPf𝚯=Pf𝚯^nPnf𝚯^n+Pnf𝚯^nPnf𝚯+Pnf𝚯Pf𝚯\displaystyle\,\,=\,\,\,Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}\,\,=\,\,Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-P_{n}f_{\widehat{\boldsymbol{\Theta}}_{n}}+P_{n}f_{\widehat{\boldsymbol{\Theta}}_{n}}-P_{n}f_{\boldsymbol{\Theta}^{\ast}}+P_{n}f_{\boldsymbol{\Theta}^{\ast}}-Pf_{\boldsymbol{\Theta}^{\ast}}
Pf𝚯^nPnf𝚯^n+Pnf𝚯Pf𝚯  2sup𝚯[M,M]k×p|Pnf𝚯Pf𝚯|.\displaystyle\,\,\leq Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-P_{n}f_{\widehat{\boldsymbol{\Theta}}_{n}}+P_{n}f_{\boldsymbol{\Theta}^{\ast}}-Pf_{\boldsymbol{\Theta}^{\ast}}\,\,\leq\,\,2\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|P_{n}f_{\boldsymbol{\Theta}}-Pf_{\boldsymbol{\Theta}}|. (3)

Thus, to bound |Pf𝚯^nPf𝚯||Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|, it is enough to prove bounds on sup𝚯[M,M]k×p|Pnf𝚯Pf𝚯|\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|P_{n}f_{\boldsymbol{\Theta}}-Pf_{\boldsymbol{\Theta}}|. The main idea here is to bound this uniform deviation via Rademacher complexity, and in turn bound the Rademacher complexity itself (Dudley,, 1967; Mohri et al.,, 2018). Let ={f𝚯:𝚯[M,M]k×p}\mathcal{F}=\{f_{\boldsymbol{\Theta}}:\boldsymbol{\Theta}\in[-M,M]^{k\times p}\}, and denote the \mathcal{F}-norm (Athreya and Lahiri,, 2006) between two probability measures μ\mu and ν\nu as μν=supf|f𝑑μf𝑑ν|\|\mu-\nu\|_{\mathcal{F}}=\sup_{f\in\mathcal{F}}|\int fd\mu-\int fd\nu|. We recall the definition of Rademacher complexity and covering number as follows.

Definition 1.

Let ϵi\epsilon_{i}’s be i.i.d. Rademacher random variables independent of 𝒳={𝐗1,,𝐗n}\mathcal{X}=\{\boldsymbol{X}_{1},\dots,\boldsymbol{X}_{n}\}, i.e. (ϵi=1)=(ϵi=1)=0.5\mathbbm{P}(\epsilon_{i}=1)=\mathbbm{P}(\epsilon_{i}=-1)=0.5, The population Rademacher complexity of \mathcal{F} is defined as n()=𝔼supf1ni=1nϵif(𝐗i)\mathcal{R}_{n}(\mathcal{F})=\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i}), where the expectation is over both ϵ\boldsymbol{\epsilon} and 𝒳\mathcal{X}.

Definition 2.

(δ\delta-cover and Covering Number) Let (X,d)(X,d) be a metric space. The set XδXX_{\delta}\subseteq X is said to be a δ\delta-cover of XX if for all xXx\in X, \exists xXδx^{\prime}\in X_{\delta}, such that d(x,x)δd(x,x^{\prime})\leq\delta. The δ\delta-covering number of XX w.r.t. dd, denoted by N(δ;X,d)N(\delta;X,d), is the size of the smallest δ\delta-cover of XX with respect to dd.

The following Lemma gives a bound for the δ\delta-covering number of \mathcal{F} under the supremum norm. The main idea here is to use the Lipschitz property of f𝚯f_{\boldsymbol{\Theta}} and then to find a cover of the search space for 𝚯\boldsymbol{\Theta}, i.e. [M,M]k×p[-M,M]^{k\times p}. This then automatically transcends to a cover of \mathcal{F} under the sup-norm.

Lemma 3.3.

Let N(δ;,)N(\delta;\mathcal{F},\|\cdot\|_{\infty}) be the δ\delta-covering number of \mathcal{F} under \|\cdot\|_{\infty}. Then, under A13,

N(δ;,)(max{8M2τ𝜶,kHpkpδ,1})kp.N(\delta;\mathcal{F},\|\cdot\|_{\infty})\leq\left(\max\left\{\left\lfloor\frac{8M^{2}\tau_{\boldsymbol{\alpha},k}H_{p}kp}{\delta}\right\rfloor,1\right\}\right)^{kp}.

To bound the Rademacher complexity, we will also need to bound the diameter of \mathcal{F} under \|\cdot\|_{\infty}:

Lemma 3.4.

Let diam()=supf,gfg\text{diam}(\mathcal{F})=\sup_{f,g\in\mathcal{F}}\|f-g\|_{\infty}. Then, under A13, diam()8τ𝛂,kHpM2kp.\text{diam}(\mathcal{F})\leq 8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}kp.

We are now ready to state the bound on the Rademacher complexity n()\mathcal{R}_{n}(\mathcal{F}) in Theorem 3.1. We provide a sketch of the argument below, with the complete proof details available in the Appendix.

Theorem 3.1.

Under A13, n()48πτ𝛂,kHpM2(kp)3/2n1/2.\mathcal{R}_{n}(\mathcal{F})\leq 48\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}.

Proof Sketch

The main idea of the proof is to use Dudley’s chaining argument (Dudley,, 1967; Wainwright,, 2019) to bound the Rademacher complexity in terms of an integral involving the metric entropy logN(δ;,)\log N(\delta;\mathcal{F},\|\cdot\|_{\infty}). Let Δ=8HpM2k11/sp\Delta=8H_{p}M^{2}k^{1-1/s}p. Using the chaining approach, one can show that

n()12n0ΔlogN(ϵ;,)𝑑ϵ\displaystyle\mathcal{R}_{n}(\mathcal{F})\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{\log N(\epsilon;\mathcal{F},\|\cdot\|_{\infty})}d\epsilon 12n0Δkplog(max{Δϵ,1})𝑑ϵ\displaystyle\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{kp\log\left(\max\left\{\frac{\Delta}{\epsilon},1\right\}\right)}d\epsilon (4)
=12n0Δkplog(Δϵ)𝑑ϵ\displaystyle=\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{kp\log\left(\frac{\Delta}{\epsilon}\right)}d\epsilon
=48πτ𝜶,kHpM2(kp)3/2n1/2.\displaystyle=48\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}.

Here the second inequality (4) follows from Lemma 3.3.

Before we proceed, we state the following Lemma, which puts an uniform bound on f\|f\|_{\infty}, ff\in\mathcal{F}.

Lemma 3.5.

For all 𝐱[M,M]p\boldsymbol{x}\in[-M,M]^{p} and 𝚯[M,M]k×p\boldsymbol{\Theta}\in[-M,M]^{k\times p},

0Ψ𝜶(dϕ(𝒙,𝜽1),,dϕ(𝒙,𝜽k))4τ𝜶,kHpM2pk.0\leq\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk.

Now that we have established a bound on the Rademacher complexity n()\mathcal{R}_{n}(\mathcal{F}), we are now ready to establish a uniform concentration inequality on PnPsupf|PnfPf|\|P_{n}-P\|_{\mathcal{F}}\triangleq\sup_{f\in\mathcal{F}}|P_{n}f-Pf| in the following Theorem, proven in the Appendix.

Theorem 3.2.

Under A13, with probability at least 1δ1-\delta, the following holds.

PnP96πτ𝜶,kHpM2(kp)3/2n1/2+4τ𝜶,kHpM2pklog(2/δ)2n.\|P_{n}-P\|_{\mathcal{F}}\leq 96\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}+4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\sqrt{\frac{\log(2/\delta)}{2n}}.

The result in Theorem 3.2, reveals a non-asymptotic bound on |Pf𝚯^nPf𝚯||Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|:

Corollary 3.3.

Under A13, with probability at least 1δ1-\delta, the following holds.

|Pf𝚯^nPf𝚯|192πτ𝜶,kHpM2(kp)3/2n1/2+8τ𝜶,kHpM2pklog(2/δ)2n.|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq 192\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}+8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\sqrt{\frac{\log(2/\delta)}{2n}}.
Proof.

From equation (3), we know that |Pf𝚯^nPf𝚯|2PnP|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq 2\|P_{n}-P\|_{\mathcal{F}}. The corollary follows immediately by application of Theorem 3.2. ∎

3.3 Inference for Fixed pp: Strong and n\sqrt{n} Consistency

We now discuss the classical domain where pp is kept fixed (Pollard,, 1981; Chakraborty et al.,, 2020; Terada,, 2014) and show that the results above imply strong and n\sqrt{n}-consistency, mirroring some known results for existing methods such as kk-means. We first solidify the notion of convergence of the centroids 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} to 𝚯\boldsymbol{\Theta}^{\ast}, following Pollard, (1981). Since centroids are unique only up to label permutations, our notion of dissimilarity

diss(𝚯1,𝚯2)=minM𝒫k𝚯1M𝚯2F\text{diss}(\boldsymbol{\Theta}_{1},\boldsymbol{\Theta}_{2})=\min_{M\in\mathcal{P}_{k}}\|\boldsymbol{\Theta}_{1}-M\boldsymbol{\Theta}_{2}\|_{F}

is considered over 𝒫k\mathcal{P}_{k} the set of all k×kk\times k real permutation matrices, where F\|\cdot\|_{F} denotes the Frobenius norm. Now, we say that the sequence 𝚯n\boldsymbol{\Theta}_{n} converges to 𝚯\boldsymbol{\Theta} if limndiss(𝚯n,𝚯)=0\lim_{n\to\infty}\text{diss}(\boldsymbol{\Theta}_{n},\boldsymbol{\Theta})=0. We begin by imposing the following standard identifiablity condition (Pollard,, 1981; Terada,, 2014; Chakraborty et al.,, 2020) on PP for our analysis.

A 4.

For all η>0\eta>0, there exists ϵ>0\epsilon>0, such that Pf𝚯>Pf𝚯+ϵPf_{\boldsymbol{\Theta}}>Pf_{\boldsymbol{\Theta}^{\ast}}+\epsilon\, whenever diss(𝚯,𝚯)>η\,\text{diss}(\boldsymbol{\Theta},\boldsymbol{\Theta}^{\ast})>\eta.

We now investigate the strong consistency properties of 𝚯^n\widehat{\boldsymbol{\Theta}}_{n}, and also investigate the rate at which |Pf𝚯^nPf𝚯||Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}| converges to 0. Theorem 3.3 states that indeed strong consistency holds, with convergence rate O(n1/2)O(n^{-1/2}). Note that this rate is faster than that found previously by Paul et al., 2021a . Before we proceed, recall we say that Xn=OP(an)X_{n}=O_{P}(a_{n}) if the sequence Xn/anX_{n}/a_{n} is tight (Athreya and Lahiri,, 2006).

Theorem 3.3.

(Strong consistency and n\sqrt{n}-consistency) If pp is kept fixed then under A14, 𝚯^na.s.𝚯\widehat{\boldsymbol{\Theta}}_{n}\xrightarrow{a.s.}\boldsymbol{\Theta}^{\ast} under PP. Moreover, |Pf𝚯^nPf𝚯|=OP(n1/2)|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O_{P}(n^{-1/2}).

Note: Here AnA_{n} is OP(an)O_{P}(a_{n}) means that {An/an}n\{A_{n}/a_{n}\}_{n\in\mathbb{N}} is stochastically bounded or tight.

3.4 Inference Under the MoM Framework

Theorem 3.3 and the bounds we present above already establish novel statistical results that pertain to methods under (1) such as power kk-means in the “uncontaminated" setting. In this section, we now extend these findings to the MoM setup under (2) in order to understand their behavior in the presence of outliers. Recall that in this setting, the data are partitioned into LL equally sized blocks; without loss of generality, we take n=Lbn=L\cdot b. We denote the set of all inliers by {𝑿i}i\{\boldsymbol{X}_{i}\}_{i\in\mathcal{I}} and the outliers by {𝑿i}i𝒪\{\boldsymbol{X}_{i}\}_{i\in\mathcal{O}}. Now let 𝚯^n(MoM)\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})} denote the minimizer of (2): towards establishing the error rate at which |Pf𝚯^n(MoM)Pf𝚯||Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}| goes to 0, we assume the following:

A 5.

{𝑿i}ii.i.d.P\{\boldsymbol{X}_{i}\}_{i\in\mathcal{I}}\overset{\text{i.i.d.}}{\sim}P with PP\in\mathcal{M}.

A 6.

There exists η>0\,\eta>0 such that L>(2+η)|𝒪|L>(2+\eta)|\mathcal{O}|.

We remark A5 is identical to A1, but imposed only on the inlying observations. A6 ensures at least half of the LL partitions are free of outliers; note this is much weaker than requiring L>4|𝒪|L>4|\mathcal{O}| as is done in recent work (Lecué et al.,, 2020). Importantly, we emphasize that no distributional assumptions regarding the outlying observations are made, allowing them to be unbounded, generated from heavy-tailed distributions, or dependent among each other. Proofs of the following results appear in the Appendix.

As a point of departure, we first establish that 𝚯^n(MoM)[M,M]k×p\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}\in[-M,M]^{k\times p}.

Lemma 3.6.

Let A23 and A56 hold. Then for any 𝚯k×p\boldsymbol{\Theta}\in\mathop{\mathbb{R}}\nolimits^{k\times p}, there exists 𝚯[M,M]k×p\boldsymbol{\Theta}^{\prime}\in[-M,M]^{k\times p}, such that MoMLn(𝚯)MoMLn(𝚯)\text{MoM}_{L}^{n}(\boldsymbol{\Theta}^{\prime})\leq\text{MoM}_{L}^{n}(\boldsymbol{\Theta}).

Again, we may restrict the search space for finding 𝚯^n(MoM)\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})} in [M,M]k×p[-M,M]^{k\times p} due to Lemma 3.6.

Corollary 3.4.

Under A23 and A56, 𝚯^n(MoM)[M,M]k×p\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}\in[-M,M]^{k\times p}.

Similarly to Section 3, we derive a uniform bound on sup𝚯[M,M]k×p|MoMLn(f𝚯)Pf𝚯|\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}| and then bound |Pf𝚯^n(MoM)Pf𝚯||Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}| in turn. For brevity we define δ:=2/(4+η)|𝒪|/L\delta:=2/(4+\eta)-|\mathcal{O}|/L, and use the notation “\lesssim” to suppress the absolute constants. The uniform deviation bound is as follows, with complete proof appearing in the Appendix.

Theorem 3.4.

Under A23 and A56, with probability at least 12e2Lδ21-2e^{-2L\delta^{2}}, the following holds.

sup𝚯[M,M]k×p|MoMLn(f𝚯)Pf𝚯|τ𝜶,kHpmax{kpLn,(kp)3/2||n}.\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}|\lesssim\tau_{\boldsymbol{\alpha},k}H_{p}\max\left\{kp\sqrt{\frac{L}{n}},(kp)^{3/2}\frac{\sqrt{|\mathcal{I}|}}{n}\right\}.

We give a brief outline of our proof for Theorem 3.4 below, with details appearing in the Appendix.

Proof sketch

We begin by noting that if

sup𝚯[M,M]k×p=1L𝟙{(PPB)f𝚯>ϵ}>L2,\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}~{}>~{}\frac{L}{2},

then

sup𝚯[M,M]k×p(Pf𝚯MoMLn(f𝚯))>ϵ,\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(Pf_{\boldsymbol{\Theta}}-\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}}))>\epsilon\,,

where PBP_{B_{\ell}} denotes the empirical distribution of {𝑿i}iB\{\boldsymbol{X}_{i}\}_{i\in B_{\ell}}. This implies it is suffices to bound the quantity

(sup𝚯[M,M]k×p=1L𝟙{(PPB)f𝚯>ϵ}>L2).\mathbbm{P}\left(\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}>\frac{L}{2}\right).

Introducing the function φ(t)=(t1)𝟙{1t2}+𝟙{t>2}\varphi(t)=(t-1)\mathbbm{1}\{1\leq t\leq 2\}+\mathbbm{1}\{t>2\}, we begin by bounding outlier-free partitions, making use of the inequalities 𝟙{t2}φ(t)𝟙{t1}.\mathbbm{1}\{t\geq 2\}\leq\varphi(t)~{}\leq\mathbbm{1}\{t\geq 1\}. We then proceed by bounding

sup𝚯[M,M]k×p=1L𝟙{(PPB)f𝚯>ϵ}\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}

by the sum ξ1+ξ2+|𝒪|\xi_{1}+\xi_{2}+|\mathcal{O}|, where

ξ1=sup𝚯[M,M]k×p𝔼φ(2(PPB)f𝚯ϵ),and\xi_{1}=\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\,,\quad\text{and}
ξ2=sup𝚯[M,M]k×p[φ(2(PPB)f𝚯ϵ)𝔼φ(2(PPB)f𝚯ϵ)].\xi_{2}=\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\bigg{[}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)-\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\bigg{]}.

We bound ξ1\xi_{1} by appealing to Hoeffding’s inequality, while we show that ξ2\xi_{2} can be bounded by applying the bounded difference inequality together with the result of Theorem 3.1 to bound the resulting Rademacher complexity.

The corollary below follows from this uniform bound, giving a non-asymptotic control over the difference |Pf𝚯^n(MoM)Pf𝚯||Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}| in terms of the model parameters.

Corollary 3.5.

Under A23 and A56, with probability at least 12e2Lδ21-2e^{-2L\delta^{2}}, the following holds.

|Pf𝚯^n(MoM)Pf𝚯|τ𝜶,kHpmax{kpLn,(kp)3/2||n}.|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\lesssim\tau_{\boldsymbol{\alpha},k}H_{p}\max\left\{kp\sqrt{\frac{L}{n}},(kp)^{3/2}\frac{\sqrt{|\mathcal{I}|}}{n}\right\}.

3.5 Inference for Fixed kk and pp Under the MoM Framework

We now focus our attention back to the classical setting where the numbers of clusters and features remain fixed. To show 𝚯^n(MoM)\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})} is consistent for 𝚯\boldsymbol{\Theta}^{\ast}, we need to impose conditions such that the RHS of the bounds presented in Corollary 3.5 decrease to 0 as nn\to\infty. We state the required conditions as follows.

A 7.

The number of partitions L=o(n)L=o(n), and LL\to\infty as nn\to\infty.

These conditions are natural: as nn grows, so too must LL in order to maintain a proportion of outlier-free partitions. On the other hand, LL must grow slowly relative to nn to ensure each partition can be assigned sufficient numbers of datapoints. We note that A7 implies |𝒪|=o(n)|\mathcal{O}|=o(n), an intuitive and standard condition (Lecué et al.,, 2020; Staerman et al.,, 2021; Paul et al., 2021b, ) as outliers should be few by definition.

In the following corollary, we focus on the squared Euclidean distance for center-based clustering under the MoM framework. We show that the (global) estimates obtained from MoM kk-means, MoM power kk-means etc. are consistent. We stress that the obtained convergence rate, as a function of n,Ln,\,L and |||\mathcal{I}|, does not depend on the choice of Ψ𝜶()\Psi_{\boldsymbol{\alpha}}(\cdot), as long as it satisfies A2, i.e. is Lipschitz. In particular, replacing Ψ𝜶()\Psi_{\boldsymbol{\alpha}}(\cdot) with min1jkxj,Ms(𝒙)\min_{1\leq j\leq k}x_{j},\,M_{s}(\boldsymbol{x}) and (j=1kxj1)1(\sum_{j=1}^{k}x_{j}^{-1})^{-1}, the rate in Corollary 3.6 applies to robustified MoM versions of kk-means, power kk-means and kk-harmonic means alike.

Corollary 3.6.

Suppose ϕ(𝐮)=𝐮22\phi(\boldsymbol{u})=\|\boldsymbol{u}\|_{2}^{2} and Ψ𝛂()\Psi_{\boldsymbol{\alpha}}(\cdot) satisfy A2. Then under A23 and A57,

|Pf𝚯^n(MoM)Pf𝚯|=OP(max{L1/2n1/2,n1||})|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O_{P}\left(\max\left\{L^{1/2}n^{-1/2},n^{-1}\sqrt{|\mathcal{I}|}\right\}\right)

and Pf𝚯^n(MoM)𝑃Pf𝚯Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}\xrightarrow{P}Pf_{\boldsymbol{\Theta}^{\ast}}. Moreover, whenever A4 additionally holds, we have 𝚯^n(MoM)𝑃𝚯\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}\xrightarrow{P}\boldsymbol{\Theta}^{\ast}.

Remark

These results imply that any MoM center-based algorithm under our paradigm admits a convergence rate of O(max{L1/2n1/2,n1||})O\left(\max\left\{L^{1/2}n^{-1/2},n^{-1}\sqrt{|\mathcal{I}|}\right\}\right), when equipped with squared Euclidean distance. Note that as L1L\geq 1, max{L1/2n1/2,n1||}=Ω(n1/2)\max\left\{L^{1/2}n^{-1/2},n^{-1}\sqrt{|\mathcal{I}|}\right\}=\Omega\left(n^{-1/2}\right). Thus, the convergence rates for MoM variants in our framework are generally slower than their ERM counterparts, for which the rate is O(n1/2)O(n^{-1/2}). This is unsurprising as MoM operates on outlier-contaminated data; there is “no free lunch" in trading off robustness for rate of convergence. However, if the number of partitions LL grows slowly relative to nn (say, L=O(logn)L=O(\log n) so that |𝒪|=O(logn)|\mathcal{O}|=O(\log n)), then the convergence rates for MoM estimation become comparable to the ERM counterparts at O~(n1/2)\widetilde{O}(n^{-1/2}).

4 Empirical Studies and Performance

To validate and assess our proposed framework, we now turn to an empirical comparison of the proposed and peer clustering methods. We evaluate clustering quality under the Adjusted Rand Index (ARI) (Hubert and Arabie,, 1985), with values ranging between 0 and 11 and 11 denoting a perfect match with the ground truth. Though it is not feasible to exhaustively survey center-based clustering methods, we consider a broad range of competitors, comparing to kk-means (MacQueen,, 1967), Partition Around Medoids (PAM) (Kaufman and Rousseeuw,, 2009), kk-medians (Jain,, 2010), Robust kmeans++ (RKMpp) (Deshpande et al.,, 2020), Robust Continuous Clustering (RCC) (Shah and Koltun,, 2017), Bregman Trimmed kk-means (BTKM) (Fischer et al.,, 2020), MoM kk-means (MOMKM) (Klochkov et al.,, 2020) and a novel MoM variant of power kk-means (MOMPKM) implied under the proposed framework. An open-source implementation of the proposed method and code for reproducing the simulation experiments are available at https://github.com/SaptarshiC98/MOMPKM.

We consider two thorough simulated experiments below, with additional simulations and large-scale real data results in the Appendix. While the extended comparisons are omitted for space considerations, they convey the same trends as the studies below. In all settings, we generate data in p=5p=5 dimensions, varying the number of clusters and the outlier percentages. True centers are spaced uniformly on a grid with θk=k110\theta_{k}=\frac{k-1}{10} and observations are drawn from Gaussians around their ground truth centers with variance 0.10.1. Because we generate Gaussian data, we focus here on the Euclidean case, and do not consider other Bregman divergences in the present empirical study. In all experiments, we take the number of partitions LL to be roughly double the number of outliers, and set the hyperparameters η\eta and α\alpha to be 1.021.02 and 11 respectively by default. Results are averaged over 20 random restarts, and all competitors are initialized and tuned according to the standard implementations described in their original papers.

Experiment 1: increasing the number of clusters

The first experiment assesses performance as the number of true clusters grows, while keeping the proportion of outliers fixed at 25%25\%. Datasets are generated with the number of clusters kk varying between 33 and 100100. For each setting, we create balanced clusters drawing 3030 points from each true center. The 25%25\% outliers are generated from a uniform distribution with support on the range of the inlying observations. We repeat this data generation process 3030 times under each parameter setting.

The average ARI values at convergence, plotted against the number of clusters along with error bars (±\pm standard deviations), are shown in the left panel of Figure 1. We see that the robustified version of power kk-means implied by our framework (labeled MOMPKM) achieves the best performance here. This may be unsurprising as the recent power kk-means method was shown to significantly reduce sensitivity to local minima (which tend to increase with kk), while MoM further protects the algorithm from outlier influence.

Refer to caption
Figure 1: Average ARI values, along with error bars (±\pmsd), comparing the peer algorithms in Experiments 1 (left) and 2 (right). MOMPKM remains relatively stable even when kk or outlier percentages increase, maintaining the best performance among peer methods.

Experiment 2: increasing outlier percentage

Following the same data generation process in Experiment 1, we now fix k=20k=20 while varying the outlier percentage from 0%0\% to 50%50\%. For each parameter setting, we again replicate the experiment 3030 times. Average ARI values comparing the inlying observations to their ground truths are plotted in the right panel of Figure 1. Not only does this study convey similar trends as Experiment 1, but we see that competing methods continue to deteriorate with increasing outliers as one might expect, while MOMPKM remarkably remains relatively stable.

We see that the ERM-based methods such as BTKM, MOMKM, and PAM struggle when there is large number of clusters. Similarly, methods such as kk-medians and RKMpp often stop short at poor local optima, which is quickly exacerbated by outliers despite initialization via clever seeding techniques. These phenomena are consistent with what has been reported in the literature for their non-robust counterparts (Xu and Lange,, 2019; Chakraborty et al.,, 2020). Overall, the empirical study suggests that a robust version of power kk-means clustering under the MoM framework shows promise to handle several data challenges at once, in line with our theoretical analysis.

5 Discussion

In this paper, we proposed a paradigm for center-based clustering that unifies a suite of center-based clustering algorithms. Under this view, developed a simple yet efficient framework for robust versions of such algorithms by appealing to the Median of Means (MoM) philosophy. Using gradient-based methods, the MoM objectives can be solved with the same per-iteration complexity of Lloyd’s kk-means algorithm, largely retaining its simplicity. Importantly, we derive a thorough analysis of the statistical properties and convergence rates by establishing uniform concentration bounds under i.i.d. sampling of the data. These novel theoretical contributions demonstrate how arguments utilizing Rademacher complexities and Dudley’s chaining arguments can be leveraged in the robust clustering context. As a result, we are able to obtain error rates that do not require asymptotic assumptions, nor restrictions on the relation between nn and pp. These findings recover asymptotic results such as strong consistency and n\sqrt{n}-consistency under classical assumptions.

As shown in the paper, the robustness of MoM estimators comes at the cost of slower convergence rates compared to their ERM counterparts. We emphasize that there is no “median-of-means magic", and that the efficacy of MoM depends on the interplay between the partitions and the outliers. If the number of partitions circumvents the impact of the outliers, the performance of MoM clustering estimates under our framework scales with the block size bb as 1/b=L/n1/\sqrt{b}=\sqrt{L/n}. Since LL can be chosen to be approximately 2|𝒪|2|\mathcal{O}|, the obtained error rate is roughly O(|𝒪|/n)O(\sqrt{|\mathcal{O}|/n}). If |𝒪||\mathcal{O}| scales proportionately with nn, however, the error bound of O(|𝒪|/n)O(\sqrt{|\mathcal{O}|/n}) becomes meaningless. For our consistency results to hold, it is crucial that |𝒪|=o(n)|\mathcal{O}|=o(n), which in turn allows us to choose LL that satisfies condition A7. If |𝒪|=O(nβ)|\mathcal{O}|=O(n^{\beta}), for some 0<β<10<\beta<1, the error rate is O(n(β1)/2)O(n^{(\beta-1)/2}).

This suggests possible future research directions in improving these ERM rates via finding so called “fast rates" under additional assumptions (Boucheron et al.,, 2005; Wainwright,, 2019). Moreover, it will be fruitful to extend the results for noise distributions that satisfy only moment conditions. Recent works in convex clustering (Tan and Witten,, 2015; Chakraborty and Xu,, 2020) have considered sub-Gaussian models to obtain error rates. The work by (Biau et al.,, 2008) and recent work by (Klochkov et al.,, 2020) obtain similar error rates under finite second-moment conditions using assumptions that the cluster centroids 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} are bounded, and it may be possible to extend our approach using local Rademacher complexities (Bartlett et al.,, 2005) to relax the bounded support assumption. One can also seek lower bounds on the approximation error or can explore high-dimensional robust center-based clustering under the proposed paradigm. Finally, we have not explored the implementation and empirical performance of Bregman versions of our MoM estimator, for instance with application to data arising from mixtures of exponential families other than the Gaussian case. These interesting directions remain open for future work.

References

  • Alcalá et al., (2010) Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., and Herrera, F. (2010). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3):255–287.
  • Aloise et al., (2009) Aloise, D., Deshpande, A., Hansen, P., and Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248.
  • Athreya and Lahiri, (2006) Athreya, K. B. and Lahiri, S. N. (2006). Measure theory and probability theory. Springer Science & Business Media.
  • Balcan et al., (2008) Balcan, M.-F., Blum, A., and Vempala, S. (2008). A discriminative framework for clustering via similarity functions. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 671–680.
  • Banerjee et al., (2005) Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., and Lafferty, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6(10).
  • Bartlett et al., (2002) Bartlett, P. L., Boucheron, S., and Lugosi, G. (2002). Model selection and error estimation. Machine Learning, 48(1):85–113.
  • Bartlett et al., (2005) Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local Rademacher complexities. The Annals of Statistics, 33(4):1497–1537.
  • Bartlett and Mendelson, (2002) Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482.
  • Beliakov et al., (2010) Beliakov, G., Calvo, T., and James, S. (2010). On Lipschitz properties of generated aggregation functions. Fuzzy Sets and Systems, 161(10):1437–1447.
  • Ben-David, (2007) Ben-David, S. (2007). A framework for statistical clustering with constant time approximation algorithms for k-median and k-means clustering. Machine Learning, 66(2):243–257.
  • Biau et al., (2008) Biau, G., Devroye, L., and Lugosi, G. (2008). On the performance of clustering in Hilbert spaces. IEEE Transactions on Information Theory, 54(2):781–790.
  • Boucheron et al., (2005) Boucheron, S., Bousquet, O., and Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375.
  • Brownlees et al., (2015) Brownlees, C., Joly, E., and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507–2536.
  • Brunet-Saumard et al., (2022) Brunet-Saumard, C., Genetay, E., and Saumard, A. (2022). K-bMOM: A robust lloyd-type clustering algorithm based on bootstrap median-of-means. Computational Statistics & Data Analysis, 167:107370.
  • Bubeck et al., (2013) Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717.
  • Chakraborty and Das, (2020) Chakraborty, S. and Das, S. (2020). Detecting meaningful clusters from high-dimensional data: A strongly consistent sparse center-based clustering approach. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Chakraborty et al., (2020) Chakraborty, S., Paul, D., Das, S., and Xu, J. (2020). Entropy weighted power k-means clustering. In International Conference on Artificial Intelligence and Statistics, pages 691–701. PMLR.
  • Chakraborty and Xu, (2020) Chakraborty, S. and Xu, J. (2020). Biconvex clustering. arXiv preprint arXiv:2008.01760.
  • Deshpande et al., (2020) Deshpande, A., Kacham, P., and Pratap, R. (2020). Robust kk-means++. In Conference on Uncertainty in Artificial Intelligence, pages 799–808. PMLR.
  • Duchi et al., (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159.
  • Dudley, (1967) Dudley, R. M. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290–330.
  • Fischer et al., (2020) Fischer, A., Levrard, C., and Brécheteau, C. (2020). Robust Bregman clustering. Annals of Statistics.
  • Goodfellow et al., (2016) Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, volume 1. MIT press Cambridge.
  • Hastie et al., (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.
  • Hubert and Arabie, (1985) Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193–218.
  • Jain, (2010) Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666.
  • Kaufman and Rousseeuw, (2009) Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons.
  • Klochkov et al., (2020) Klochkov, Y., Kroshnin, A., and Zhivotovskiy, N. (2020). Robust kk-means clustering for distributions with two moments. arXiv preprint arXiv:2002.02339.
  • Laforgue et al., (2019) Laforgue, P., Clémençon, S., and Bertail, P. (2019). On medians of (randomized) pairwise means. In International Conference on Machine Learning, pages 1272–1281. PMLR.
  • Lecué and Lerasle, (2019) Lecué, G. and Lerasle, M. (2019). Learning from MOM’s principles: Le Cam’s approach. Stochastic Processes and Their Applications, 129(11):4385–4410.
  • Lecué and Lerasle, (2020) Lecué, G. and Lerasle, M. (2020). Robust machine learning by median-of-means: theory and practice. Annals of Statistics, 48(2):906–931.
  • Lecué et al., (2020) Lecué, G., Lerasle, M., and Mathieu, T. (2020). Robust classification via mom minimization. Machine Learning, 109(8):1635–1665.
  • Lerasle, (2019) Lerasle, M. (2019). Lecture notes: Selected topics on robust statistical learning theory. arXiv preprint arXiv:1908.10761.
  • Levrard, (2015) Levrard, C. (2015). Nonasymptotic bounds for vector quantization in Hilbert spaces. The Annals of Statistics, 43(2):592–619.
  • Lloyd, (1982) Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137.
  • Lugosi and Mendelson, (2019) Lugosi, G. and Mendelson, S. (2019). Regularization, sparse recovery, and median-of-means tournaments. Bernoulli, 25(3):2075–2106.
  • MacQueen, (1967) MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA.
  • Mathieu and Minsker, (2021) Mathieu, T. and Minsker, S. (2021). Excess risk bounds in robust empirical risk minimization. Information and Inference: A Journal of the IMA.
  • Minsker, (2018) Minsker, S. (2018). Uniform bounds for robust mean estimators. arXiv preprint arXiv:1812.03523.
  • Mohri et al., (2018) Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
  • Ng et al., (2001) Ng, A., Jordan, M., and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 14:849–856.
  • Ostrovsky et al., (2013) Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. (2013). The effectiveness of Lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):1–22.
  • (43) Paul, D., Chakraborty, S., and Das, S. (2021a). On the uniform concentration bounds and large sample properties of clustering with Bregman divergences. Stat, page e360.
  • (44) Paul, D., Chakraborty, S., and Das, S. (2021b). Robust principal component analysis: A median of means approach. arXiv preprint arXiv:2102.03403.
  • Pollard, (1981) Pollard, D. (1981). Strong consistency of kk-means clustering. The Annals of Statistics, 9(1):135–140.
  • Rodriguez and Valdora, (2019) Rodriguez, D. and Valdora, M. (2019). The breakdown point of the median of means tournament. Statistics & Probability Letters, 153:108–112.
  • Shah and Koltun, (2017) Shah, S. A. and Koltun, V. (2017). Robust continuous clustering. Proceedings of the National Academy of Sciences, 114(37):9814–9819.
  • Shalev-Shwartz and Ben-David, (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
  • Staerman et al., (2021) Staerman, G., Laforgue, P., Mozharovskyi, P., and d’Alché Buc, F. (2021). When OT meets MoM: Robust estimation of Wasserstein distance. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130, pages 136–144. PMLR.
  • Tan and Witten, (2015) Tan, K. M. and Witten, D. (2015). Statistical properties of convex clustering. Electronic journal of statistics, 9(2):2324.
  • Teboulle, (2007) Teboulle, M. (2007). A unified continuous optimization framework for center-based clustering methods. Journal of Machine Learning Research, 8(1).
  • Telgarsky and Dasgupta, (2012) Telgarsky, M. J. and Dasgupta, S. (2012). Agglomerative Bregman clustering. In 29th International Conference on Machine Learning, ICML 2012, pages 1527–1534.
  • Telgarsky and Dasgupta, (2013) Telgarsky, M. J. and Dasgupta, S. (2013). Moment-based uniform deviation bounds for kk-means and friends. Advances in Neural Information Processing Systems.
  • Terada, (2014) Terada, Y. (2014). Strong consistency of reduced k-means clustering. Scandinavian Journal of Statistics, 41(4):913–931.
  • Terada, (2015) Terada, Y. (2015). Strong consistency of factorial kk-means clustering. Annals of the Institute of Statistical Mathematics, 67(2):335–357.
  • Vapnik, (2013) Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media.
  • Vassilvitskii and Arthur, (2006) Vassilvitskii, S. and Arthur, D. (2006). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035.
  • Wainwright, (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press.
  • Xu and Lange, (2019) Xu, J. and Lange, K. (2019). Power k-means clustering. In International Conference on Machine Learning, pages 6921–6931. PMLR.
  • Zhang et al., (1999) Zhang, B., Hsu, M., and Dayal, U. (1999). K-harmonic means—a data clustering algorithm. Hewlett-Packard Labs Technical Report HPL-1999-124, 55.
  • Zhang et al., (2020) Zhang, Z., Lange, K., and Xu, J. (2020). Simple and scalable sparse k-means clustering via feature ranking. Advances in Neural Information Processing Systems, 33.

Appendix A Proofs of Lemmas

For the theoretical exposition, we first establish the following Lemmas. Lemma A.1 proves that the derivative of the function ϕ\phi is bounded in the 2\ell_{2}-norm when the domain is restricted to the support of PP.

Lemma A.1.

Under A3, ϕ(𝐱)2HpMp\|\nabla\phi(\boldsymbol{x})\|_{2}\leq H_{p}M\sqrt{p}, for all 𝐱[M,M]p\boldsymbol{x}\in[-M,M]^{p}.

Proof.

From A3, we observe that

ϕ(𝒙)ϕ(𝟎)2Hp𝒙2\displaystyle\|\nabla\phi(\boldsymbol{x})-\nabla\phi(\boldsymbol{0})\|_{2}\leq H_{p}\|\boldsymbol{x}\|_{2}\implies ϕ(𝒙)2Hp𝒙2HpMp.\displaystyle\|\nabla\phi(\boldsymbol{x})\|_{2}\leq H_{p}\|\boldsymbol{x}\|_{2}\leq H_{p}M\sqrt{p}.

Lemma A.2 essentially proves that the function ϕ\phi is Lipschitz with Lipschitz constant HpMpH_{p}M\sqrt{p} on [M,M]p[-M,M]^{p}.

Lemma A.2.

Under A3, for all 𝐱,𝐲[M,M]p\boldsymbol{x},\boldsymbol{y}\in[-M,M]^{p}, ϕ()\phi(\cdot) is 2HpMp2H_{p}M\sqrt{p}-Lipschitz, i.e.

|ϕ(𝒙)ϕ(𝒚)|HpMp𝒙𝒚2.|\phi(\boldsymbol{x})-\phi(\boldsymbol{y})|\leq H_{p}M\sqrt{p}\|\boldsymbol{x}-\boldsymbol{y}\|_{2}.
Proof.

From the mean value theorem,

ϕ(𝒙)ϕ(𝒚)=ϕ(𝝃),𝒙𝒚,\phi(\boldsymbol{x})-\phi(\boldsymbol{y})=\langle\nabla\phi(\boldsymbol{\xi}),\boldsymbol{x}-\boldsymbol{y}\rangle,

for some 𝝃\boldsymbol{\xi} in the convex combinations of 𝒙\boldsymbol{x} and 𝒚\boldsymbol{y}. Clearly, 𝝃[M,M]p\boldsymbol{\xi}\in[-M,M]^{p}, due to the convexity of [M,M]p[-M,M]^{p}. Now by the Cauchy-Schwartz inequality and Lemma A.1,

|ϕ(𝒙)ϕ(𝒚)|ϕ(𝝃)2𝒙𝒚2HpMp𝒙𝒚2.|\phi(\boldsymbol{x})-\phi(\boldsymbol{y})|\leq\|\nabla\phi(\boldsymbol{\xi})\|_{2}\|\boldsymbol{x}-\boldsymbol{y}\|_{2}\leq H_{p}M\sqrt{p}\|\boldsymbol{x}-\boldsymbol{y}\|_{2}.

Lemma A.3 proves that the function f𝚯f_{\boldsymbol{\Theta}}, as a function of 𝚯\boldsymbol{\Theta}, is Lipschitz with respect to the \|\cdot\|_{\infty} norm.

Lemma A.3.

Under assumptions A13, for any 𝚯,𝚯[M,M]p\boldsymbol{\Theta},\boldsymbol{\Theta}^{\prime}\in[-M,M]^{p},

f𝚯f𝚯4τ𝜶,kHpMpj=1k𝜽j𝜽j2.\|f_{\boldsymbol{\Theta}}-f_{\boldsymbol{\Theta}^{\prime}}\|_{\infty}\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}\sum_{j=1}^{k}\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\|_{2}.

Here, 𝚯=[𝛉1,,𝛉k]\boldsymbol{\Theta}=[\boldsymbol{\theta}_{1}^{\top},\dots,\boldsymbol{\theta}_{k}^{\top}]^{\top} and 𝚯=[𝛉1,,𝛉k]\boldsymbol{\Theta}=[\boldsymbol{\theta}_{1}^{\prime\top},\dots,\boldsymbol{\theta}_{k}^{\prime\top}]^{\top}.

Proof.
f𝚯f𝚯\displaystyle\|f_{\boldsymbol{\Theta}}-f_{\boldsymbol{\Theta}^{\prime}}\|_{\infty}
=sup𝒙[M,M]p|Ψ𝜶(dϕ(𝒙,𝜽1),,dϕ(𝒙,𝜽k))Ψ𝜶(dϕ(𝒙,𝜽1),,dϕ(𝒙,𝜽k))|\displaystyle=\sup_{\boldsymbol{x}\in[-M,M]^{p}}\left|\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))-\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}^{\prime}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}^{\prime}))\right|
τ𝜶,kj=1k|dϕ(𝒙,𝜽j)dϕ(𝒙,𝜽j)|\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}|d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j})-d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j}^{\prime})|
=τ𝜶,kj=1k|ϕ(𝜽j)ϕ(𝜽j)+ϕ(𝜽j),𝒙𝜽jϕ(𝜽j),𝒙𝜽j|\displaystyle=\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}|\phi(\boldsymbol{\theta}_{j}^{\prime})-\phi(\boldsymbol{\theta}_{j})+\langle\nabla\phi(\boldsymbol{\theta}_{j}^{\prime}),\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\rangle-\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}\rangle|
=τ𝜶,kj=1k|ϕ(𝜽j)ϕ(𝜽j)+ϕ(𝜽j)ϕ(𝜽j),𝒙𝜽j+ϕ(𝜽j),𝜽j𝜽j|\displaystyle=\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}|\phi(\boldsymbol{\theta}_{j}^{\prime})-\phi(\boldsymbol{\theta}_{j})+\langle\nabla\phi(\boldsymbol{\theta}_{j}^{\prime})-\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\rangle+\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\rangle|
τ𝜶,kj=1k(|ϕ(𝜽j)ϕ(𝜽j)|+|ϕ(𝜽j)ϕ(𝜽j),𝒙𝜽j|+|ϕ(𝜽j),𝜽j𝜽j|)\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(|\phi(\boldsymbol{\theta}_{j}^{\prime})-\phi(\boldsymbol{\theta}_{j})|+|\langle\nabla\phi(\boldsymbol{\theta}_{j}^{\prime})-\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\rangle|+|\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\rangle|\right)
τ𝜶,kj=1k(HpMp𝜽j𝜽j2+ϕ(𝜽j)ϕ(𝜽j)2𝒙𝜽j2+ϕ(𝜽j)2𝜽j𝜽j2)\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(H_{p}M\sqrt{p}\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\|_{2}+\|\nabla\phi(\boldsymbol{\theta}_{j}^{\prime})-\nabla\phi(\boldsymbol{\theta}_{j})\|_{2}\|\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\|_{2}+\|\nabla\phi(\boldsymbol{\theta}_{j})\|_{2}\|\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\|_{2}\right)
τ𝜶,kj=1k(HpMp𝜽j𝜽j2+Hp𝜽j𝜽j2×2pM+HpMp𝜽j𝜽j2)\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(H_{p}M\sqrt{p}\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\|_{2}+H_{p}\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\|_{2}\times 2\sqrt{p}M+H_{p}M\sqrt{p}\|\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\|_{2}\right)
4τ𝜶,kHpMpj=1k𝜽j𝜽j2\displaystyle\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}\sum_{j=1}^{k}\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\|_{2}

Appendix B Proofs from Section 3

B.1 Proof of Lemma 3.1

Proof.

Let J(𝒙)=dϕ(𝒙,𝜽)J(\boldsymbol{x})=d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}). Since P𝒞(𝜽)P_{\mathcal{C}}(\boldsymbol{\theta}) minimizes J()J(\cdot) over 𝒞\mathcal{C}, there exists a subgradient 𝒅J(P𝒞(𝜽))\boldsymbol{d}\in\partial J(P_{\mathcal{C}}(\boldsymbol{\theta})) such that

𝒅,𝒙P𝒞(𝜽)0.\langle\boldsymbol{d},\boldsymbol{x}-P_{\mathcal{C}}(\boldsymbol{\theta})\rangle\geq 0. (5)

We note that J(P𝒞(𝜽))={ϕ(P𝒞(𝜽))ϕ(𝜽)}J(P_{\mathcal{C}}(\boldsymbol{\theta}))=\{\nabla\phi(P_{\mathcal{C}}(\boldsymbol{\theta}))-\nabla\phi(\boldsymbol{\theta})\}. Thus, from equation (5),

ϕ(P𝒞(𝜽))ϕ(𝜽),𝒙P𝒞(𝜽)0.\langle\nabla\phi(P_{\mathcal{C}}(\boldsymbol{\theta}))-\nabla\phi(\boldsymbol{\theta}),\boldsymbol{x}-P_{\mathcal{C}}(\boldsymbol{\theta})\rangle\geq 0. (6)

We now observe that,

dϕ(𝒙,𝜽)dϕ(𝒙,P𝒞(𝜽))dϕ(P𝒞(𝜽),𝜽)=ϕ(P𝒞(𝜽))ϕ(𝜽),𝒙P𝒞(𝜽)0.d_{\phi}(\boldsymbol{x},\boldsymbol{\theta})-d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}))-d_{\phi}(P_{\mathcal{C}}(\boldsymbol{\theta}),\boldsymbol{\theta})=\langle\nabla\phi(P_{\mathcal{C}}(\boldsymbol{\theta}))-\nabla\phi(\boldsymbol{\theta}),\boldsymbol{x}-P_{\mathcal{C}}(\boldsymbol{\theta})\rangle\geq 0.

Hence the result. ∎

B.2 Proof of Lemma 3.2

Proof.

Suppose 𝚯={𝜽1,,𝜽k}\boldsymbol{\Theta}=\{\boldsymbol{\theta}_{1},\dots,\boldsymbol{\theta}_{k}\}. We take 𝒞=[M,M]k×p\mathcal{C}=[-M,M]^{k\times p} and 𝚯={P𝒞(𝜽1),,P𝒞(𝜽k)}\boldsymbol{\Theta}^{\prime}=\{P_{\mathcal{C}}(\boldsymbol{\theta}_{1}),\dots,P_{\mathcal{C}}(\boldsymbol{\theta}_{k})\}. Clearly 𝒞\mathcal{C} is a convex set. Thus, from Lemma 3.1, we observe that

dϕ(𝒙,𝜽j)dϕ(𝒙,P𝒞(𝜽j))+dϕ(P𝒞(𝜽j),𝜽j)dϕ(𝒙,P𝒞(𝜽j))j=1,,k.\displaystyle d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j})\geq d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{j}))+d_{\phi}(P_{\mathcal{C}}(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j})\geq d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{j}))\,\quad\forall\,j=1,\dots,k.
\displaystyle\implies Ψ𝜶(dϕ(𝒙,P𝒞(𝜽1)),,dϕ(𝒙,P𝒞(𝜽k)))Ψ𝜶(dϕ(𝒙,𝜽1),,dϕ(𝒙,𝜽k))\displaystyle\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{1})),\dots,d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{k}))\right)\leq\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k})\right)
\displaystyle\implies Ψ𝜶(dϕ(𝒙,P𝒞(𝜽1)),,dϕ(𝒙,P𝒞(𝜽k)))𝑑QΨ𝜶(dϕ(𝒙,𝜽1),,dϕ(𝒙,𝜽k))𝑑Q\displaystyle\int\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{1})),\dots,d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{k}))\right)dQ\leq\int\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k})\right)dQ
\displaystyle\implies Qf𝚯Qf𝚯\displaystyle Qf_{\boldsymbol{\Theta}^{\prime}}\,\,\leq\,\,Qf_{\boldsymbol{\Theta}}

B.3 Proof of Lemma 3.3

Proof.

We first divide the set [M,M][-M,M] into a small bins, each with size ϵ\epsilon. Denote γi=M+iϵ\gamma_{i}=-M+i\epsilon, for i=1,,2Mϵi=1,\dots,\lfloor\frac{2M}{\epsilon}\rfloor, and let Γϵ={γii{1,,2Mϵ}}\Gamma_{\epsilon}=\left\{\gamma_{i}\,\|\,i\in\{1,\dots,\lfloor\frac{2M}{\epsilon}\rfloor\}\right\}. If ϵ>2M\epsilon>2M, we take Γϵ={0}\Gamma_{\epsilon}=\{0\}. Clearly, |Γϵ|=max{2Mϵ,1}|\Gamma_{\epsilon}|=\max\{\lfloor\frac{2M}{\epsilon}\rfloor,1\}. From the construction of Γϵ\Gamma_{\epsilon}, for all x[M,M]x\in[-M,M], there exists i[|Γϵ|]i\in\left[|\Gamma_{\epsilon}|\right], such that, |xγi|ϵ|x-\gamma_{i}|\leq\epsilon. We take ϵ=(4τ𝜶,kHpMkp)1δ\epsilon=\left(4\tau_{\boldsymbol{\alpha},k}H_{p}Mkp\right)^{-1}\delta. We define

𝚯δ={𝚯=((θi)):θiΓϵ}.\boldsymbol{\Theta}_{\delta}=\left\{\boldsymbol{\Theta}=\left((\theta_{i\ell})\right):\theta_{i\ell}\in\Gamma_{\epsilon}\right\}.

Then immediately we see

|𝚯δ|=(max{2Mϵ,1})kp.|\boldsymbol{\Theta}_{\delta}|=\left(\max\left\{\left\lfloor\frac{2M}{\epsilon}\right\rfloor,1\right\}\right)^{kp}.

For any 𝚯[M,M]p\boldsymbol{\Theta}\in[-M,M]^{p}, we can construct 𝚯𝚯δ\boldsymbol{\Theta}^{\prime}\in\boldsymbol{\Theta}_{\delta}, such that, |θiθi|ϵ.|\theta_{i\ell}-\theta^{\prime}_{i\ell}|\leq\epsilon. From Lemma A.3, we observe that,

f𝚯f𝚯\displaystyle\|f_{\boldsymbol{\Theta}}-f_{\boldsymbol{\Theta}^{\prime}}\|_{\infty} 4τ𝜶,kHpMpj=1k𝜽j𝜽j2.\displaystyle\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}\sum_{j=1}^{k}\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\|_{2}.
4τ𝜶,kHpMpkpϵ\displaystyle\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}k\sqrt{p}\epsilon
=4τ𝜶,kHpMkpϵ\displaystyle=4\tau_{\boldsymbol{\alpha},k}H_{p}Mkp\epsilon
=δ.\displaystyle=\delta.

Thus, δ={f𝚯:𝚯𝚯δ}\mathcal{F}_{\delta}=\{f_{\boldsymbol{\Theta}}:\boldsymbol{\Theta}\in\boldsymbol{\Theta}_{\delta}\} constitutes a δ\delta-cover of \mathcal{F} under the \|\cdot\|_{\infty} norm. Hence,

N(δ;,)|δ||𝚯δ|\displaystyle N(\delta;\mathcal{F},\|\cdot\|_{\infty})\leq|\mathcal{F}_{\delta}|\leq|\boldsymbol{\Theta}_{\delta}| =(max{2Mϵ,1})kp=(max{8M2τ𝜶,kHpkpδ,1})kp.\displaystyle=\left(\max\left\{\left\lfloor\frac{2M}{\epsilon}\right\rfloor,1\right\}\right)^{kp}=\left(\max\left\{\left\lfloor\frac{8M^{2}\tau_{\boldsymbol{\alpha},k}H_{p}kp}{\delta}\right\rfloor,1\right\}\right)^{kp}.

B.4 Proof of Lemma 3.4

Proof.

From Lemma A.3, we observe that,

diam()\displaystyle\text{diam}(\mathcal{F}) =sup𝚯,𝚯[M,M]k×pf𝚯f𝚯\displaystyle=\sup_{\boldsymbol{\Theta},\boldsymbol{\Theta}^{\prime}\in[-M,M]^{k\times p}}\|f_{\boldsymbol{\Theta}}-f_{\boldsymbol{\Theta}^{\prime}}\|_{\infty}
4HpMpτ𝜶,ksup𝚯,𝚯[M,M]k×pj=1k𝜽j𝜽j2\displaystyle\leq 4H_{p}M\sqrt{p}\tau_{\boldsymbol{\alpha},k}\sup_{\boldsymbol{\Theta},\boldsymbol{\Theta}^{\prime}\in[-M,M]^{k\times p}}\sum_{j=1}^{k}\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\|_{2}
4HpMpτ𝜶,k×2kMp\displaystyle\leq 4H_{p}M\sqrt{p}\tau_{\boldsymbol{\alpha},k}\times 2kM\sqrt{p}
=8τ𝜶,kHpM2kp.\displaystyle=8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}kp.

B.5 Proof of Lemma 3.5

Proof.

From the non-negativity of Ψ𝜶()\Psi_{\boldsymbol{\alpha}}(\cdot), we get, Ψ𝜶(dϕ(𝒙,𝜽1),,dϕ(𝒙,𝜽k))0\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))\geq 0, for any 𝒙[M,M]p\boldsymbol{x}\in[-M,M]^{p} and 𝚯[M,M]k×p\boldsymbol{\Theta}\in[-M,M]^{k\times p}. For any 𝜷0k\boldsymbol{\beta}\in\mathop{\mathbb{R}}\nolimits^{k}_{\geq 0}, from A2, we get,

Ψ𝜶(𝜷)=|Ψ𝜶(𝜷)Ψ𝜶(𝟎)|τ𝜶,k𝜷𝟎1=𝜷1.\Psi_{\boldsymbol{\alpha}}(\boldsymbol{\beta})=|\Psi_{\boldsymbol{\alpha}}(\boldsymbol{\beta})-\Psi_{\boldsymbol{\alpha}}(\mathbf{0})|\leq\tau_{\boldsymbol{\alpha},k}\|\boldsymbol{\beta}-\mathbf{0}\|_{1}=\|\boldsymbol{\beta}\|_{1}.

Taking 𝜷=(dϕ(𝒙,𝜽1),,dϕ(𝒙,𝜽k))\boldsymbol{\beta}=(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))^{\top}, we get,

Ψ𝜶(dϕ(𝒙,𝜽1),,dϕ(𝒙,𝜽k))\displaystyle\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))
τ𝜶,kj=1kdϕ(𝒙,𝜽j)\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j})
=τ𝜶,kj=1k(ϕ(𝒙)ϕ(𝜽j)ϕ(𝜽j),𝒙𝜽j)\displaystyle=\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(\phi(\boldsymbol{x})-\phi(\boldsymbol{\theta}_{j})-\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}\rangle\right)
τ𝜶,kj=1k(|ϕ(𝒙)ϕ(𝜽j)|+|ϕ(𝜽j),𝒙𝜽j|)\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(|\phi(\boldsymbol{x})-\phi(\boldsymbol{\theta}_{j})|+|\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}\rangle|\right)
τ𝜶,kj=1k(HpMp𝒙𝜽j2+ϕ(𝜽j)2𝒙𝜽j2)\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(H_{p}M\sqrt{p}\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\|_{2}+\|\nabla\phi(\boldsymbol{\theta}_{j})\|_{2}\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\|_{2}\right) (7)
2τ𝜶,kHpMpj=1k𝒙𝜽j2\displaystyle\leq 2\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}\sum_{j=1}^{k}\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\|_{2} (8)
4τ𝜶,kHpM2pk.\displaystyle\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk.

Here inequality (7) follows from Cauchy-Schwartz inequality and Lemma A.2. Inequality (8) follows from Lemma A.1. ∎

B.6 Proof of Theorem 3.1

Proof.

Let Δ=8HpM2k11/sp\Delta=8H_{p}M^{2}k^{1-1/s}p. We construct a decreasing sequence {δi}i\{\delta_{i}\}_{i\in\mathbb{N}} as follows. Take δ1:=diam()=Δ\delta_{1}:=\text{diam}(\mathcal{F})=\Delta (the last equality follows from Lemma 3.4) and δi+1=12δi\delta_{i+1}=\frac{1}{2}\delta_{i}. Let i\mathcal{F}_{i} be a minimal δi\delta_{i} cover of \mathcal{F}, i.e. |i|=N(δi;,)|\mathcal{F}_{i}|=N(\delta_{i};\mathcal{F},\|\cdot\|_{\infty}). Now denote fif_{i} to be the closest element of ff in i\mathcal{F}_{i} (with ties broken arbitrarily). We can thus write,

𝔼supf1ni=1nϵif(𝑿i)ξ1+ξ2+ξ3,\displaystyle\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})\leq\xi_{1}+\xi_{2}+\xi_{3},

where

ξ1=𝔼supf1ni=1nϵi(f(𝑿i)fm(𝑿i)),\displaystyle\xi_{1}=\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-f_{m}(\boldsymbol{X}_{i})), (9)
ξ2=j=1m1𝔼supf1ni=1nϵi(fj+1(𝑿i)fj(𝑿i)),\displaystyle\xi_{2}=\sum_{j=1}^{m-1}\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f_{j+1}(\boldsymbol{X}_{i})-f_{j}(\boldsymbol{X}_{i})), (10)
ξ3=𝔼supf1ni=1nϵif1(𝑿i).\displaystyle\xi_{3}=\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f_{1}(\boldsymbol{X}_{i}). (11)

Since we can pick f1f_{1} arbitrarily from \mathcal{F} (as δ1=diam()\delta_{1}=\text{diam}(\mathcal{F})), ξ3=0\xi_{3}=0. To bound ξ1\xi_{1}, we observe that,

ξ1=𝔼supf1ni=1nϵi(f(𝑿i)fm(𝑿i))𝔼supf1n(i=1nϵi2)(i=1n(f(𝑿i)fm(𝑿i))2)δm\xi_{1}=\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-f_{m}(\boldsymbol{X}_{i}))\leq\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sqrt{\left(\sum_{i=1}^{n}\epsilon_{i}^{2}\right)\left(\sum_{i=1}^{n}(f(\boldsymbol{X}_{i})-f_{m}(\boldsymbol{X}_{i}))^{2}\right)}\leq\delta_{m}

To bound ξ2\xi_{2}, we observe that,

fj+1fjfj+1f+ffjδj+1+δj.\|f_{j+1}-f_{j}\|_{\infty}\leq\|f_{j+1}-f\|_{\infty}+\|f-f_{j}\|_{\infty}\leq\delta_{j+1}+\delta_{j}.

Now appealing to Massart’s lemma (Mohri et al.,, 2018), we get,

𝔼supf1ni=1nϵi(fj+1(𝑿i)fj(𝑿i))\displaystyle\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f_{j+1}(\boldsymbol{X}_{i})-f_{j}(\boldsymbol{X}_{i})) (δj+1+δj)2log(N(δj;,)N(δj+1;,))n\displaystyle\leq\frac{(\delta_{j+1}+\delta_{j})\sqrt{2\log\left(N(\delta_{j};\mathcal{F},\|\cdot\|_{\infty})N(\delta_{j+1};\mathcal{F},\|\cdot\|_{\infty})\right)}}{\sqrt{n}}
2(δj+1+δj)logN(δj+1;,)n\displaystyle\leq\frac{2(\delta_{j+1}+\delta_{j})\sqrt{\log N(\delta_{j+1};\mathcal{F},\|\cdot\|_{\infty})}}{\sqrt{n}}

Thus,

ξ2=j=1m1𝔼supf1ni=1nϵi(fj+1(𝑿i)fj(𝑿i))j=1m12(δj+1+δj)logN(δj+1;,)n\xi_{2}=\sum_{j=1}^{m-1}\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f_{j+1}(\boldsymbol{X}_{i})-f_{j}(\boldsymbol{X}_{i}))\leq\sum_{j=1}^{m-1}\frac{2(\delta_{j+1}+\delta_{j})\sqrt{\log N(\delta_{j+1};\mathcal{F},\|\cdot\|_{\infty})}}{\sqrt{n}}

Combining the bounds on ξ1\xi_{1}, ξ2\xi_{2} and ξ3\xi_{3}, we get,

𝔼supf1ni=1nϵif(𝑿i)δm+2nj=1m1(δj+1+δj)logN(δj+1;,).\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})\leq\delta_{m}+\frac{2}{\sqrt{n}}\sum_{j=1}^{m-1}(\delta_{j+1}+\delta_{j})\sqrt{\log N(\delta_{j+1};\mathcal{F},\|\cdot\|_{\infty})}. (12)

From the construction of {δi}i1\{\delta_{i}\}_{i\geq 1}, we know, δj+1+δj=6(δj+1δj+2)\delta_{j+1}+\delta_{j}=6(\delta_{j+1}-\delta_{j+2}). Hence from equation (12), we get,

𝔼supf1ni=1nϵif(𝑿i)\displaystyle\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i}) δm+2nj=1m1(δj+1+δj)logN(δj+1;,)\displaystyle\leq\delta_{m}+\frac{2}{\sqrt{n}}\sum_{j=1}^{m-1}(\delta_{j+1}+\delta_{j})\sqrt{\log N(\delta_{j+1};\mathcal{F},\|\cdot\|_{\infty})}
=δm+12nj=1m1(δj+1δj+2)logN(δj+1;,)\displaystyle=\delta_{m}+\frac{12}{\sqrt{n}}\sum_{j=1}^{m-1}(\delta_{j+1}-\delta_{j+2})\sqrt{\log N(\delta_{j+1};\mathcal{F},\|\cdot\|_{\infty})}
δm+12nδm+1δ2logN(ϵ;,)𝑑ϵ\displaystyle\leq\delta_{m}+\frac{12}{\sqrt{n}}\int_{\delta_{m+1}}^{\delta_{2}}\sqrt{\log N(\epsilon;\mathcal{F},\|\cdot\|_{\infty})}d\epsilon

Taking limits as mm\to\infty in the above equation, we get,

𝔼supf1ni=1nϵif(𝑿i)12n0ΔlogN(ϵ;,)𝑑ϵ.\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{\log N(\epsilon;\mathcal{F},\|\cdot\|_{\infty})}d\epsilon.

From Lemma 3.3, plugging in the value of N(ϵ;,)N(\epsilon;\mathcal{F},\|\cdot\|_{\infty}), we get,

n()\displaystyle\mathcal{R}_{n}(\mathcal{F}) 12n0ΔlogN(ϵ;,)𝑑ϵ\displaystyle\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{\log N(\epsilon;\mathcal{F},\|\cdot\|_{\infty})}d\epsilon
12n0Δkplog(max{Δϵ,1})𝑑ϵ\displaystyle\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{kp\log\left(\max\left\{\frac{\Delta}{\epsilon},1\right\}\right)}d\epsilon
=12n0Δkplog(Δϵ)𝑑ϵ\displaystyle=\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{kp\log\left(\frac{\Delta}{\epsilon}\right)}d\epsilon
=12kpnΔ02t2et2𝑑t\displaystyle=12\sqrt{\frac{kp}{n}}\Delta\int_{0}^{\infty}2t^{2}e^{-t^{2}}dt
=12kpnΔ0u321eu𝑑u\displaystyle=12\sqrt{\frac{kp}{n}}\Delta\int_{0}^{\infty}u^{\frac{3}{2}-1}e^{-u}du
=12kpnΔΓ(3/2)\displaystyle=12\sqrt{\frac{kp}{n}}\Delta\Gamma(3/2)
=6kpπn×8τ𝜶,kHpM2kp\displaystyle=6\sqrt{\frac{kp\pi}{n}}\times 8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}kp
=48πτ𝜶,kHpM2(kp)3/2n1/2.\displaystyle=48\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}.

B.7 Proof of Theorem 3.2

Proof.

From Lemma, 3.5, we observe that supff4τ𝜶,kHpM2pk\sup_{f\in\mathcal{F}}\|f\|_{\infty}\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk. Under assumption A1, we observe that, with probability at least 1δ1-\delta,

supf|PnfPf|\displaystyle\sup_{f\in\mathcal{F}}|P_{n}f-Pf| 2n()+supfflog(2/δ)2n\displaystyle\leq 2\mathcal{R}_{n}(\mathcal{F})+\sup_{f\in\mathcal{F}}\|f\|_{\infty}\sqrt{\frac{\log(2/\delta)}{2n}}
96πτ𝜶,kHpM2(kp)3/2n1/2+4τ𝜶,kHpM2pklog(2/δ)2n.\displaystyle\leq 96\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}+4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\sqrt{\frac{\log(2/\delta)}{2n}}. (13)

Inequality (13) follows from appealing to Theorem 3.1 and observing that supff4τ𝜶,kHpM2pk\sup_{f\in\mathcal{F}}\|f\|_{\infty}\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk. ∎

B.8 Proof of Theorem 3.3

Proof.

(Proof of Strong consistency) We will first show |Pf𝚯^nPf𝚯|a.s.0|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\xrightarrow{a.s.}0. To show this let C=max{192πτ𝜶,kHpM2(kp)3/2,8τ𝜶,kHpM2pk}C=\max\{192\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2},8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\}. Then from Theorem 3.2, we observe that with probability at least 1δ1-\delta,

|Pf𝚯^nPf𝚯|Cn+Clog(2/δ)2n.|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq\frac{C}{\sqrt{n}}+C\sqrt{\frac{\log(2/\delta)}{2n}}. (14)

Fix ϵ>0\epsilon>0. If n4C2/ϵ2n\geq 4C^{2}/\epsilon^{2} and δ=2exp(nϵ22C2)\delta=2\exp\left(-\frac{n\epsilon^{2}}{2C^{2}}\right), the RHS of (14) becomes no bigger than ϵ\epsilon. Thus,

(|Pf𝚯^nPf𝚯|>ϵ)2exp(nϵ22C2),n4C2/ϵ2.\mathbbm{P}\left(|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|>\epsilon\right)\leq 2\exp\left(-\frac{n\epsilon^{2}}{2C^{2}}\right),\quad\forall\,n\geq 4C^{2}/\epsilon^{2}.

Since the series n=1exp(nϵ22C2)\sum_{n=1}^{\infty}\exp\left(-\frac{n\epsilon^{2}}{2C^{2}}\right) is convergent from the above equation, so is (|Pf𝚯^nPf𝚯|>ϵ)\mathbbm{P}\left(|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|>\epsilon\right). Hence, Pf𝚯^na.s.Pf𝚯Pf_{\widehat{\boldsymbol{\Theta}}_{n}}\xrightarrow{a.s.}Pf_{\boldsymbol{\Theta}^{\ast}}. Thus, for any ϵ>0\epsilon>0, Pf𝚯^npf𝚯+ϵPf_{\widehat{\boldsymbol{\Theta}}_{n}}\leq pf_{\boldsymbol{\Theta}^{\ast}}+\epsilon almost surely w.r.t. [P][P] for nn sufficiently large. From assumption A4, diss(𝚯^n,𝚯)η\text{diss}(\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast})\leq\eta, almost surely w.r.t. [P][P], for any prefixed η>0\eta>0, and nn large. Thus, diss(𝚯^n,𝚯)a.s.0\text{diss}(\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast})\xrightarrow{a.s.}0, which proves the result.

(Proof of n\sqrt{n}-consistency) Fix δ(0,1]\delta\in(0,1]. From Theorem 3.2, with probability at least 1δ1-\delta,

|Pf𝚯^nPf𝚯|192πτ𝜶,kHpM2(kp)3/2n1/2+8τ𝜶,kHpM2pklog(2/δ)2n=O(n1/2).|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq 192\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}+8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\sqrt{\frac{\log(2/\delta)}{2n}}=O(n^{-1/2}).

Hence, n|Pf𝚯^nPf𝚯|=O(1)\sqrt{n}|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O(1) with probability at least 1δ1-\delta. Thus, Cδ\exists\,C_{\delta}, such that

(n|Pf𝚯^nPf𝚯|Cδ)1δ,\mathbbm{P}\left(\sqrt{n}|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq C_{\delta}\right)\geq 1-\delta,

for all nn large enough. Hence, |Pf𝚯^nPf𝚯|=OP(n1/2)|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O_{P}(n^{-1/2}). ∎

Appendix C Proofs from Section 3.4

C.1 Proof of Lemma 3.6

Proof.

Suppose 𝚯={𝜽1,,𝜽k}\boldsymbol{\Theta}=\{\boldsymbol{\theta}_{1},\dots,\boldsymbol{\theta}_{k}\}. We take 𝒞=[M,M]k×p\mathcal{C}=[-M,M]^{k\times p} and 𝚯={P𝒞(𝜽1),,P𝒞(𝜽k)}\boldsymbol{\Theta}^{\prime}=\{P_{\mathcal{C}}(\boldsymbol{\theta}_{1}),\dots,P_{\mathcal{C}}(\boldsymbol{\theta}_{k})\}. Clearly 𝒞\mathcal{C} is convex. Let {1,,L}\mathcal{L}\subset\{1,\dots,L\} be the set of all partitions which do not contain an outlier. Thus, from Lemma 3.1, we observe that

dϕ(𝑿i,𝜽j)dϕ(𝑿i,P𝒞(𝜽j))+dϕ(P𝒞(𝜽j),𝜽j)dϕ(𝑿i,P𝒞(𝜽j))j=1,,k and i\displaystyle d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{j})\geq d_{\phi}(\boldsymbol{X}_{i},P_{\mathcal{C}}(\boldsymbol{\theta}_{j}))+d_{\phi}(P_{\mathcal{C}}(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j})\geq d_{\phi}(\boldsymbol{X}_{i},P_{\mathcal{C}}(\boldsymbol{\theta}_{j}))\,\forall\,j=1,\dots,k\text{ and }i\in\mathcal{I}
\displaystyle\implies Ψ𝜶(dϕ(𝑿i,P𝒞(𝜽1)),,dϕ(𝑿,P𝒞(𝜽k)))Ψ𝜶(dϕ(𝑿i,𝜽1),,dϕ(𝑿,𝜽k))i\displaystyle\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},P_{\mathcal{C}}(\boldsymbol{\theta}_{1})),\dots,d_{\phi}(\boldsymbol{X},P_{\mathcal{C}}(\boldsymbol{\theta}_{k}))\right)\leq\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{X},\boldsymbol{\theta}_{k})\right)\,\forall\,i\in\mathcal{I}
\displaystyle\implies iBΨ𝜶(dϕ(𝑿i,P𝒞(𝜽1)),,dϕ(𝑿,P𝒞(𝜽k)))iBΨ𝜶(dϕ(𝑿i,𝜽1),,dϕ(𝑿,𝜽k))\displaystyle\sum_{i\in B_{\ell}}\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},P_{\mathcal{C}}(\boldsymbol{\theta}_{1})),\dots,d_{\phi}(\boldsymbol{X},P_{\mathcal{C}}(\boldsymbol{\theta}_{k}))\right)\leq\sum_{i\in B_{\ell}}\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{X},\boldsymbol{\theta}_{k})\right)\,\forall\,\ell\in\mathcal{L}
\displaystyle\implies 1biBf𝚯(𝑿i)1biBf𝚯(𝑿i)\displaystyle\frac{1}{b}\sum_{i\in B_{\ell}}f_{\boldsymbol{\Theta}^{\prime}}(\boldsymbol{X}_{i})\leq\frac{1}{b}\sum_{i\in B_{\ell}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})\,\forall\,\ell\in\mathcal{L}

Now since ||>|C||\mathcal{L}|>|\mathcal{L}^{C}| (from assumption 6),

Median(1biB1f𝚯(𝑿i),,1biBLf𝚯(𝑿i))Median(1biB1f𝚯(𝑿i),,1biBLf𝚯(𝑿i))\displaystyle\text{Median}\left(\frac{1}{b}\sum_{i\in B_{1}}f_{\boldsymbol{\Theta}^{\prime}}(\boldsymbol{X}_{i}),\dots,\frac{1}{b}\sum_{i\in B_{L}}f_{\boldsymbol{\Theta}^{\prime}}(\boldsymbol{X}_{i})\right)\leq\text{Median}\left(\frac{1}{b}\sum_{i\in B_{1}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}),\dots,\frac{1}{b}\sum_{i\in B_{L}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})\right)
MoMLn(𝚯)MoMLn(𝚯)\displaystyle\implies\text{MoM}_{L}^{n}(\boldsymbol{\Theta}^{\prime})\leq\text{MoM}_{L}^{n}(\boldsymbol{\Theta})

C.2 Proof of Theorem 3.4

Proof.

For notational simplicity let PBP_{B_{\ell}} denote the empirical distribution of {𝑿i}iB\{\boldsymbol{X}_{i}\}_{i\in B_{\ell}}. Suppose ϵ>0\epsilon>0. We will first bound the probability of sup𝚯[M,M]k×p|MoMLn(f𝚯)Pf𝚯|>ϵ\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}|>\epsilon. To do so, we will individually bound the probabilities of the events

sup𝚯[M,M]k×p(MoMLn(f𝚯)Pf𝚯)>ϵ\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}})>\epsilon

and

sup𝚯[M,M]k×p(Pf𝚯MoMLn(f𝚯))>ϵ.\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(Pf_{\boldsymbol{\Theta}}-\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}}))>\epsilon.

We note that if

sup𝚯[M,M]k×p=1L𝟙{(PPB)f𝚯>ϵ}>L2,\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}>\frac{L}{2},

then

sup𝚯[M,M]k×p(Pf𝚯MoMLn(f𝚯))>ϵ.\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(Pf_{\boldsymbol{\Theta}}-\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}}))>\epsilon.

Here again 𝟙{}\mathbbm{1}\{\cdot\} denote the indicator function. Now let φ(t)=(t1)𝟙{1t2}+𝟙{t>2}\varphi(t)=(t-1)\mathbbm{1}\{1\leq t\leq 2\}+\mathbbm{1}\{t>2\}. Clearly,

𝟙{t2}φ(t)𝟙{t1}.\mathbbm{1}\{t\geq 2\}\leq\varphi(t)\leq\mathbbm{1}\{t\geq 1\}. (15)

We observe that,

sup𝚯[M,M]k×p=1L𝟙{(PPB)f𝚯>ϵ}\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}
\displaystyle\leq sup𝚯[M,M]k×p𝟙{(PPB)f𝚯>ϵ}+|𝒪|\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}+|\mathcal{O}|
\displaystyle\leq sup𝚯[M,M]k×pφ(2(PPB)f𝚯ϵ)+|𝒪|\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)+|\mathcal{O}|
\displaystyle\leq sup𝚯[M,M]k×p𝔼φ(2(PPB)f𝚯ϵ)+|𝒪|\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)+|\mathcal{O}|
+sup𝚯[M,M]k×p[φ(2(PPB)f𝚯ϵ)𝔼φ(2(PPB)f𝚯ϵ)].\displaystyle+\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\bigg{[}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)-\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\bigg{]}. (16)

To bound sup𝚯[M,M]k×p=1L𝟙{(PPB)f𝚯>ϵ}\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}, we will first bound the quantity 𝔼φ(2(PPB)f𝚯ϵ)\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right). We observe that,

𝔼φ(2(PPB)f𝚯ϵ)𝔼[𝟙{2(PPB)f𝚯ϵ>1}]=\displaystyle\small\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\leq\mathbb{E}\left[\mathbbm{1}\left\{\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}>1\right\}\right]= [(PPB)f𝚯>ϵ2]\displaystyle\mathbbm{P}\left[(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\frac{\epsilon}{2}\right]
\displaystyle\leq exp{bϵ232τ𝜶,k2Hp2M4k2p2}\displaystyle\exp\left\{-\frac{b\epsilon^{2}}{32\tau_{\boldsymbol{\alpha},k}^{2}H_{p}^{2}M^{4}k^{2}p^{2}}\right\} (17)

We now turn to bounding the term

sup𝚯[M,M]k×p[φ(2(PPB)f𝚯ϵ)𝔼φ(2(PPB)f𝚯ϵ)].\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\bigg{[}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)-\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\bigg{]}.

Appealing to Theorem 26.5 of (Shalev-Shwartz and Ben-David,, 2014) we observe that, with probability at least 1e2Lδ21-e^{-2L\delta^{2}}, for all 𝚯[M,M]k×p\boldsymbol{\Theta}\in[-M,M]^{k\times p},

1Lφ(2(PPB)f𝚯ϵ)\displaystyle\frac{1}{L}\sum_{\ell\in\mathcal{L}}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)
\displaystyle\leq 𝔼[1Lφ(2(PPB)f𝚯ϵ)]+2𝔼[sup𝚯[M,M]k×p1Lσφ(2(PPB)f𝚯ϵ)]+δ.\displaystyle\mathbb{E}\left[\frac{1}{L}\sum_{\ell\in\mathcal{L}}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\right]+2\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\frac{1}{L}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\right]+\delta. (18)

Here {σ}\{\sigma_{\ell}\}_{\ell\in\mathcal{L}} are i.i.d. Rademacher random variables. Let {ξi}i=1n\{\xi_{i}\}_{i=1}^{n} be i.i.d. Rademacher random variables, independent form {σ}\{\sigma_{\ell}\}_{\ell\in\mathcal{L}}. From equation (18), we get,

1Lsup𝚯[M,M]k×p[φ(2(PPB)f𝚯ϵ)𝔼φ(2(PPB)f𝚯ϵ)]\displaystyle\frac{1}{L}\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\bigg{[}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)-\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\bigg{]}
\displaystyle\leq 2𝔼[sup𝚯[M,M]k×p1Lσφ(2(PPB)f𝚯ϵ)]+δ\displaystyle 2\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\frac{1}{L}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\right]+\delta
\displaystyle\leq 4Lϵ𝔼[sup𝚯[M,M]k×pσ(PPB)f𝚯]+δ.\displaystyle\frac{4}{L\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}\right]+\delta. (19)

Equation (19) follows from the fact that φ()\varphi(\cdot) is 1-Lipschitz and appealing to Lemma 26.9 of Shalev-Shwartz and Ben-David, (2014). We now consider a “ghost" sample 𝒳={𝑿1,,𝑿n}\mathcal{X}^{\prime}=\{\boldsymbol{X}_{1}^{\prime},\dots,\boldsymbol{X}_{n}^{\prime}\}, which are i.i.d. and follow the probability law PP. Thus, equation (19) can be further shown to give

=\displaystyle= 4Lϵ𝔼[sup𝚯[M,M]k×pσ𝔼𝒳((PBPB)f𝚯)]+δ\displaystyle\frac{4}{L\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\mathbb{E}_{\mathcal{X}^{\prime}}\left((P^{\prime}_{B_{\ell}}-P_{B_{\ell}})f_{\boldsymbol{\Theta}}\right)\right]+\delta
\displaystyle\leq 4Lϵ𝔼[sup𝚯[M,M]k×pσ(PBPB)f𝚯]+δ\displaystyle\frac{4}{L\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}(P^{\prime}_{B_{\ell}}-P_{B_{\ell}})f_{\boldsymbol{\Theta}}\right]+\delta
=\displaystyle= 4Lϵ𝔼[sup𝚯[M,M]k×pσ1biB(f𝚯(𝑿i)f𝚯(𝑿i))]+δ\displaystyle\frac{4}{L\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\frac{1}{b}\sum_{i\in B_{\ell}}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right]+\delta
=\displaystyle= 4bLϵ𝔼[sup𝚯[M,M]k×pσiBξi(f𝚯(𝑿i)f𝚯(𝑿i))]+δ\displaystyle\frac{4}{bL\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\sum_{i\in B_{\ell}}\xi_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right]+\delta (20)
=\displaystyle= 4nϵ𝔼[sup𝚯[M,M]k×piBσξi(f𝚯(𝑿i)f𝚯(𝑿i))]+δ\displaystyle\frac{4}{n\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sum_{i\in B_{\ell}}\sigma_{\ell}\xi_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right]+\delta
\displaystyle\leq 4nϵ𝔼[sup𝚯[M,M]k×piBσξi(f𝚯(𝑿i)+f𝚯(𝑿i))]+δ\displaystyle\frac{4}{n\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sum_{i\in B_{\ell}}\sigma_{\ell}\xi_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})+f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right]+\delta
=\displaystyle= 4nϵ𝔼[sup𝚯[M,M]k×pi𝒥γi(f𝚯(𝑿i)+f𝚯(𝑿i))]\displaystyle\frac{4}{n\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{i\in\mathcal{J}}\gamma_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})+f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right] (21)
=\displaystyle= 8nϵ𝔼[sup𝚯[M,M]k×pi𝒥γif𝚯(𝑿i)]+δ\displaystyle\frac{8}{n\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{i\in\mathcal{J}}\gamma_{i}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})\right]+\delta
\displaystyle\leq 8nϵ48πτ𝜶,kHpM2(kp)3/2|𝒥|+δ\displaystyle\frac{8}{n\epsilon}48\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}\sqrt{|\mathcal{J}|}+\delta (22)
\displaystyle\leq 384nϵπτ𝜶,kHpM2(kp)3/2||+δ.\displaystyle\frac{384}{n\epsilon}\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}\sqrt{|\mathcal{I}|}+\delta. (23)

Equation (20) follows from observing that (f𝚯(𝑿i)f𝚯(𝑿i))=𝑑ξi(f𝚯(𝑿i)f𝚯(𝑿i))(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\overset{d}{=}\xi_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})). In equation (21), {γi}i𝒥\{\gamma_{i}\}_{i\in\mathcal{J}} are independent Rademacher random variables due to their construction. Equation (22) follows from appealing to Theorem 3.1. Thus, combining equations (18), (19), and (23), we conclude that, with probability of at least 1e2Lδ21-e^{-2L\delta^{2}},

sup𝚯[M,M]k×p=1L𝟙{(PPB)f𝚯>ϵ}\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}
L(exp{bϵ232τ𝜶,k2Hp2M4k2p2}+|𝒪|L+384nϵπτ𝜶,kHpM2(kp)3/2||+δ).\displaystyle\leq L\left(\exp\left\{-\frac{b\epsilon^{2}}{32\tau_{\boldsymbol{\alpha},k}^{2}H_{p}^{2}M^{4}k^{2}p^{2}}\right\}+\frac{|\mathcal{O}|}{L}+\frac{384}{n\epsilon}\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}\sqrt{|\mathcal{I}|}+\delta\right). (24)

We choose δ=24+η|𝒪|L\delta=\frac{2}{4+\eta}-\frac{|\mathcal{O}|}{L} and

ϵ=2max{32τ𝜶,k2Hp2M4log(4(η+4)η)kpLn,1536(η+4)τ𝜶,kHpM2πη(kp)3/2||n}.\epsilon=2\max\left\{\sqrt{32\tau_{\boldsymbol{\alpha},k}^{2}H_{p}^{2}M^{4}\log\left(\frac{4(\eta+4)}{\eta}\right)}kp\sqrt{\frac{L}{n}},\frac{1536(\eta+4)\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}\sqrt{\pi}}{\eta}(kp)^{3/2}\frac{\sqrt{|\mathcal{I}|}}{n}\right\}.

This makes the right hand side of (24) strictly smaller than L2\frac{L}{2}. Thus, we have shown that

(sup𝚯[M,M]k×p(Pf𝚯MoMLn(f𝚯))>ϵ)e2Lδ2.\displaystyle\mathbbm{P}\left(\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(Pf_{\boldsymbol{\Theta}}-\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}}))>\epsilon\right)\leq e^{-2L\delta^{2}}.

Similarly, we can show that,

(sup𝚯[M,M]k×p(MoMLn(f𝚯)Pf𝚯)>ϵ)e2Lδ2.\displaystyle\mathbbm{P}\left(\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}})>\epsilon\right)\leq e^{-2L\delta^{2}}.

Combining the above two inequalities, we get,

(sup𝚯[M,M]k×p|MoMLn(f𝚯)Pf𝚯|>ϵ)2e2Lδ2.\mathbbm{P}\left(\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}|>\epsilon\right)\leq 2e^{-2L\delta^{2}}.

In other words, with at least probability 12e2Lδ21-2e^{-2L\delta^{2}},

sup𝚯[M,M]k×p|MoMLn(f𝚯)Pf𝚯|\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}|
\displaystyle\leq 2max{32τ𝜶,k2Hp2M4log(4(η+4)η)kpLn,1536(η+4)τ𝜶,kHpM2πη(kp)3/2||n}\displaystyle 2\max\left\{\sqrt{32\tau_{\boldsymbol{\alpha},k}^{2}H_{p}^{2}M^{4}\log\left(\frac{4(\eta+4)}{\eta}\right)}kp\sqrt{\frac{L}{n}},\frac{1536(\eta+4)\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}\sqrt{\pi}}{\eta}(kp)^{3/2}\frac{\sqrt{|\mathcal{I}|}}{n}\right\}
\displaystyle\lesssim τ𝜶,kHpmax{kpLn,(kp)3/2||n}.\displaystyle\tau_{\boldsymbol{\alpha},k}H_{p}\max\left\{kp\sqrt{\frac{L}{n}},(kp)^{3/2}\frac{\sqrt{|\mathcal{I}|}}{n}\right\}.

C.3 Proof of Corollary 3.5

Proof.

We observe the follwoing.

|Pf𝚯^n(MoM)Pf𝚯|\displaystyle|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|
=Pf𝚯^n(MoM)Pf𝚯\displaystyle=Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}
=Pf𝚯^n(MoM)MoMLn(f𝚯^n(MoM))+MoMLn(f𝚯^n(MoM))MoMLn(f𝚯)+MoMLn(f𝚯)Pf𝚯\displaystyle=Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-\text{MoM}^{n}_{L}(f_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}})+\text{MoM}^{n}_{L}(f_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}})-\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}^{\ast}})+\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}^{\ast}})-Pf_{\boldsymbol{\Theta}^{\ast}}
Pf𝚯^n(MoM)MoMLn(f𝚯^n(MoM))+MoMLn(f𝚯)Pf𝚯\displaystyle\leq Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-\text{MoM}^{n}_{L}(f_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}})+\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}^{\ast}})-Pf_{\boldsymbol{\Theta}^{\ast}} (25)
2sup𝚯[M,M]k×p|MoMLn(f𝚯)Pf𝚯|\displaystyle\leq 2\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}|
τ𝜶,kHpmax{kpLn,(kp)3/2||n}.\displaystyle\lesssim\tau_{\boldsymbol{\alpha},k}H_{p}\max\left\{kp\sqrt{\frac{L}{n}},(kp)^{3/2}\frac{\sqrt{|\mathcal{I}|}}{n}\right\}.

Inequality (25) follows from the fact that MoMLn(f𝚯^n(MoM))MoMLn(f𝚯)\text{MoM}^{n}_{L}(f_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}})\leq\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}^{\ast}}), by definition of 𝚯^n(MoM)\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}. ∎

C.4 Proof of Corollary 3.6

Proof.

In this case, Hp=2H_{p}=2. Thus, the bound in Corollary 3.5 becomes |Pf𝚯^n(MoM)Pf𝚯|max{Ln,||n}|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\lesssim\max\left\{\sqrt{\frac{L}{n}},\frac{\sqrt{|\mathcal{I}|}}{n}\right\}. By A7, 2e2Lδ2=o(1)2e^{-2L\delta^{2}}=o(1). Thus,

(|Pf𝚯^n(MoM)Pf𝚯|=O(max{Ln,||n}))1o(1).\mathbbm{P}\left(|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O\left(\max\left\{\sqrt{\frac{L}{n}},\frac{\sqrt{|\mathcal{I}|}}{n}\right\}\right)\right)\geq 1-o(1).

Hence, |Pf𝚯^n(MoM)Pf𝚯|=OP(max{Ln,1n})|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O_{P}\left(\max\left\{\sqrt{\frac{L}{n}},\frac{1}{\sqrt{n}}\right\}\right).

Under A7, max{Ln,||n}max{Ln,1n}=o(1)|Pf𝚯^n(MoM)Pf𝚯|=oP(1)\max\left\{\sqrt{\frac{L}{n}},\frac{\sqrt{|\mathcal{I}|}}{n}\right\}\leq\max\left\{\sqrt{\frac{L}{n}},\frac{1}{\sqrt{n}}\right\}=o(1)\,\implies\,|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=o_{P}(1)111Xn=oP(an)X_{n}=o_{P}(a_{n}) if Xn/an𝑃0X_{n}/a_{n}\xrightarrow{P}0.. Thus, Pf𝚯^n(MoM)𝑃Pf𝚯Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}\xrightarrow{P}Pf_{\boldsymbol{\Theta}^{\ast}}. Now, for any ϵ,δ>0\epsilon,\delta>0 , (Pf𝚯^nPf𝚯+ϵ)1δ\mathbbm{P}(Pf_{\widehat{\boldsymbol{\Theta}}_{n}}\leq Pf_{\boldsymbol{\Theta}^{\ast}}+\epsilon)\geq 1-\delta, if nn is large. From assumption A4, (diss(𝚯^n,𝚯)η)1δ\mathbbm{P}(\text{diss}(\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast})\leq\eta)\geq 1-\delta for any prefixed η>0\eta>0, and nn large. Thus, diss(𝚯^n,𝚯)𝑃0\text{diss}(\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast})\xrightarrow{P}0, which proves the result. ∎

Appendix D Additional Experiments

D.1 Additional Simulations

Experiment 3

We use the same simulation setting as Experiment 1. However, the outliers are now generated from a Gaussian as well with mean coordinate 20×𝟏520\times\mathbf{1}_{5}, and covariance matrix 0.1I50.1I_{5}, where 𝟏5\mathbf{1}_{5} is the 55 dimensional vector of all 11’s and I5I_{5} is the 5×55\times 5 identity matrix.

Experiment 4

We use the same simulation setting as Experiment 2. However, the outliers are now generated from the same scheme as in Experiment 3.

Refer to caption
(a) Simulation 3
Refer to caption
(b) Simulation 4
Figure 2: Results on Simulation Studies based on Average ARI Values

D.2 Case Study on Real Data: KDDCUP

In this section, we assess the performance of real data through the analysis of KDDCUP dataset (Alcalá et al.,, 2010), and consists of approximately 4.94.9M observations depicting connections of sequences of TCP packets. The features are normalized to have zero mean and unit standard deviation. The data contains 2323 classes, out of which, the three largest contain 98%98\% of the observations. Following the footsteps of Deshpande et al., (2020), the remaining 2020 classes consisting of 87528752 points are considered as outliers. We run all the algorithms as described in the beginning of Section 4. The parameters considered for our algorithm are L=10000L=10000, η=1.02\eta=1.02 and α=1\alpha=1. We measure the performance of this algorithm in terms of the ARI, as well as average precision and recall (Hastie et al.,, 2009). The last two indices are added following (Deshpande et al.,, 2020). We report the average of these indices out of 2020 replications in Table 1. For all these indices, a higher value implies superior performance. Table 1 shows similar trends as discussed in Section 4 of the main text. In particular, MOMPKM resembles the ground truth compared to the state-of-the-art. Surprisingly, RKMpp performs better than other competitors (except for MOMPKM), which was not the case for simulated data under ideal model assumptions. This is possibly because of the fact that the data contains only 47 features, compared to almost 5M samples, so that good seedings capitalize on the higher signal-to-noise-ratio, compared to that of the data used in the simulation studies.

Table 1: Results on KDDCUP Dataset
Index k.means BTKM RCC PAM RKMpp PKM MOMKM MOMPKM
ARI 10510^{-5} 0.010.01 10510^{-5} 101610^{-16} 0.810.81 0.240.24 0.760.76 0.87\mathbf{0.87}
Precision 0.250.25 0.230.23 0.190.19 0.230.23 0.640.64 0.430.43 0.560.56 0.71\mathbf{0.71}
Recall 0.00 0.14 0.07 0.11 0.63 0.49 0.59 0.76

Appendix E Machine Specifications

The simulation studies were undertaken on an HP laptop with Intel(R) Core(TM)i3-5010U 2.10 GHz processor, 4GB RAM, 64-bit Windows 8.1 operating system in R and python 3.7 programming languages. The real data experiments were undertaken on a computing cluster with 656 cores (essentially CPUs) spread across a number of nodes of varying hardware specifications. 256 of the cores are in the ‘low’ partition. There are 32 cores and 256 GB RAM per node.