Uniform Concentration Bounds toward a Unified Framework for Robust Clustering

Debolina Paul Joint first authors contributed equally to this work. Department of Statistics, Stanford University Saptarshi Chakraborty^† Department of Statistics, University of California, Berkeley Swagatam Das Electronics and Communication Sciences Unit, Indian Statistical Institute, Kolkata, India Jason Xu Correspondence to: [email protected].
To appear in the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021. Department of Statistical Science, Duke University

Abstract

Recent advances in center-based clustering continue to improve upon the drawbacks of Lloyd’s celebrated $k$ -means algorithm over $60$ years after its introduction. Various methods seek to address poor local minima, sensitivity to outliers, and data that are not well-suited to Euclidean measures of fit, but many are supported largely empirically. Moreover, combining such approaches in a piecemeal manner can result in ad hoc methods, and the limited theoretical results supporting each individual contribution may no longer hold. Toward addressing these issues in a principled way, this paper proposes a cohesive robust framework for center-based clustering under a general class of dissimilarity measures. In particular, we present a rigorous theoretical treatment within a Median-of-Means (MoM) estimation framework, showing that it subsumes several popular $k$ -means variants. In addition to unifying existing methods, we derive uniform concentration bounds that complete their analyses, and bridge these results to the MoM framework via Dudley’s chaining arguments. Importantly, we neither require any assumptions on the distribution of the outlying observations nor on the relative number of observations $n$ to features $p$ . We establish strong consistency and an error rate of $O(n^{-1/2})$ under mild conditions, surpassing the best-known results in the literature. The methods are empirically validated thoroughly on real and synthetic datasets.

1 Introduction

Clustering is a fundamental task in unsupervised learning, which seeks to discover naturally occurring groups within a dataset. Among a plethora of algorithms in the literature, center-based methods remain widely popular, with $k$ -means (MacQueen,, 1967; Lloyd,, 1982) as the most prominent example. Given $n$ data points $\mathcal{X}=\{\boldsymbol{X}_{i}:i=1,\dots,n\}\subset\mathop{\mathbb{R}}\nolimits^{p}$ , $k$ -means seeks to partition the data into $k$ mutually exclusive and exhaustive groups by minimizing the within cluster variance. Representing the cluster centroids $\boldsymbol{\Theta}=\{\boldsymbol{\theta}_{1},\dots,\boldsymbol{\theta}_{k}\}\subset\mathop{\mathbb{R}}\nolimits^{p}$ , the $k$ -means problem is formulated as the minimization of the objective function

f_{k\text{-means}}(\boldsymbol{\Theta})=\sum_{i=1}^{n}\min_{1\leq j\leq k}d(\boldsymbol{X}_{i},\boldsymbol{\theta}_{j}),

where $d(\cdot,\cdot)$ is a dissimilarity measure. Taking the squared Euclidean distance $d(\boldsymbol{x},\boldsymbol{y})=\|\boldsymbol{x}-\boldsymbol{y}\|_{2}^{2}$ yields the classical $k$ -means formulation.

Clustering via $k$ -means is well-documented to be sensitive to initialization (Vassilvitskii and Arthur,, 2006), prone to getting stuck in poor local optima (Zhang et al.,, 1999; Xu and Lange,, 2019; Chakraborty et al.,, 2020), and fragile against linearly non-separable clusters (Ng et al.,, 2001) or in the presence of even a single outlying observation (Klochkov et al.,, 2020). Researchers continue to tackle these drawbacks of $k$ -means while preserving its simplicity and interpretability (Banerjee et al.,, 2005; Aloise et al.,, 2009; Teboulle,, 2007; Ostrovsky et al.,, 2013; Zhang et al.,, 2020; Telgarsky and Dasgupta,, 2012). Although often successful in practice, few of these approaches are grounded in rigorous theory or provide finite-sample statistical guarantees. The available consistency results are mostly asymptotic in nature (Chakraborty and Das,, 2020; Chakraborty et al.,, 2020; Terada,, 2014, 2015), lacking convergence rates or operating under restrictive assumptions on the relation between $p$ and $n$ (Paul et al., 2021a, ). For example, the recent large sample analysis (Paul et al., 2021a, ) of the hard Bregman $k$ -means algorithm (Banerjee et al.,, 2005) obtains an asymptotic error rate of $O(\sqrt{\log n/n})$ under restrictive assumptions on the relation between $n$ and $p$ . The approach by Teboulle, (2007) focuses on a unified framework from an optimization perspective, while Balcan et al., (2008); Telgarsky and Dasgupta, (2012) focus on different divergence-based methods to better understand the underlying feature-space.

The presence of outliers only further complicates matters. Outlying data are common in real applications, but many of the aforementioned approaches are fragile to deviations from the assumed data generating mechanism. On the other hand, recent work on robust clustering methods (Deshpande et al.,, 2020; Fischer et al.,, 2020; Klochkov et al.,, 2020) do not integrate the practical advances surveyed above, and similarly lack finite-sample guarantees. To bridge this gap, the Median of Means (MoM) literature provides a promising and attractive framework to robustify center-based clustering methods against outliers. MoM estimators are not only insensitive to outliers, but are also equipped with exponential concentration results under the mild condition of finite variance (Lugosi and Mendelson,, 2019; Lecué and Lerasle,, 2020; Bartlett et al.,, 2002; Lerasle,, 2019; Laforgue et al.,, 2019). Recently, near-optimal results for mean estimation (Minsker,, 2018), classification (Lecué et al.,, 2020), regression (Mathieu and Minsker,, 2021; Lugosi and Mendelson,, 2019), clustering (Klochkov et al.,, 2020; Brunet-Saumard et al.,, 2022), bandits (Bubeck et al.,, 2013) and optimal transport (Staerman et al.,, 2021) have been established from this perspective.

Under the MoM lens, we propose a unified framework for robust center-based clustering. Our treatment considers a family of Bregman loss functions not restricted to the classical squared Euclidean loss. We explore the exact sample error bounds for a general class of algorithms under mild regularity assumptions that apply to popular existing approaches, which we show to be special cases. The proposed framework allows for the data to be divided into two categories: the set of inliers ( $\mathcal{I}$ ) and the set of outliers ( $\mathcal{O}$ ). The inliers are assumed to be independent and identically distributed (i.i.d.), while absolutely no assumption is required on $\mathcal{O}$ , allowing outliers to be unboundedly large, dependent on each other, sampled from a heavy-tailed distribution and so on. Our contributions within the MoM framework make use of Rademacher complexities (Bartlett and Mendelson,, 2002; Bartlett et al.,, 2005) and symmetrization arguments (Vapnik,, 2013), powerful tools that often find use in the empirical process literature but, in our view, are underexplored in the context of robust clustering.

The paper is organized as follows. In Section 2, we identify a general centroid-based clustering framework which encompasses $k$ -means (MacQueen,, 1967), $k$ -harmonic means (Zhang et al.,, 1999), and power $k$ -means (Xu and Lange,, 2019) as special cases, to name a few. We show how this framework is made robust via Median of Means estimation, yielding an array of center-based robust clustering methods. Within this framework, we derive uniform deviation bounds and concentration inequalities under standard regularity conditions through bounding Rademacher complexity by metric entropy via Dudley’s chaining argument in Section 3. The analysis newly reveals the convergence rate for popularly used clustering methods such as $k$ -harmonic means and power $k$ -means, matching the known rate results for $k$ -means, and elegantly carries over to their MoM counterparts. We then implement and empirically assess the resulting algorithms through simulated and real data experiments. In particular, we find that power $k$ -means (Xu and Lange,, 2019) under the MoM paradigm outperforms the state-of-the-art in the presence of outliers.

1.1 Related theoretical analyses of clustering

In seminal work, Pollard, (1981), proved the strong consistency of $k$ -means under a finite second moment assumption, spurring the large sample analysis of unsupervised clustering methods. This result has been extended to separable Hilbert spaces (Biau et al.,, 2008; Levrard,, 2015) and for related algorithms (Chakraborty and Das,, 2020; Terada,, 2014, 2015), but these do not provide guarantees on the number of samples required so that the excess risk falls below a given threshold. Towards finding probabilistic error bounds, following research derived uniform concentration results for $k$ -means and its variants (Telgarsky and Dasgupta,, 2012), sub-Gaussian distortion bounds for the $k$ -medians problem (Brownlees et al.,, 2015), and a $O(\sqrt{\log n/n})$ bound on $k$ -means with Bregman divergences (Paul et al., 2021b, ). More recently, concentration inequalities for $k$ -means under the MoM paradigm have been established (Klochkov et al.,, 2020; Brunet-Saumard et al.,, 2022) under the restriction that sample cluster centroids ( $\widehat{\boldsymbol{\Theta}}_{n}$ in Section 3) are assumed to be bounded. This paper shows how a number of center-based clustering methods can be brought under the same umbrella and can be robustified using a general-purpose scheme. The theoretical analyses of this broad spectrum of methods is conducted via Dudley’s chaining arguments and through the aid of Rademacher complexity-based uniform concentration bounds. This approach enables us to replace assumptions on the sample cluster centroids by a bounded support assumption of the (inlying) data points, which yields an intuitive way of ensuring the boundedness of the cluster centroids through the obtuse angle property of Bregman divergences (Lemma 3.1 below). In contrast to the prior results, our analyses are not asymptotic in nature, so that the derived bounds hold for all values of the model parameters.

2 Problem Setting and Proposed Method

We consider the problem of partitioning a set of $n$ data points $\mathcal{X}=\{\boldsymbol{X}_{i}:i=1,\dots,n\}\subset\mathop{\mathbb{R}}\nolimits^{p}$ into $k$ mutually exclusive clusters. In a center-based clustering framework, we represent the $j^{\text{th}}$ cluster by its centroid $\boldsymbol{\theta}_{j}\in\mathop{\mathbb{R}}\nolimits^{p}$ for each $j\in\{1,\dots,k\}$ . To quantify the notion of “closeness", we allow the dissimilarity measure to be any Bregman divergence. Recall any differentiable, convex function $\phi:\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits$ generates the Bregman divergence $d_{\phi}:\mathop{\mathbb{R}}\nolimits^{p}\times\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits_{\geq 0}$ ( $\mathop{\mathbb{R}}\nolimits_{\geq 0}$ denoting the set of non-negative reals) defined as

d_{\phi}(\boldsymbol{x},\boldsymbol{y})=\phi(\boldsymbol{x})-\phi(\boldsymbol{y})-\langle\nabla\phi(\boldsymbol{y}),\boldsymbol{x}-\boldsymbol{y}\rangle.

For instance, $\phi(\boldsymbol{u})=\|\boldsymbol{u}\|_{2}^{2}$ generates the Euclidean distance. Without loss of generality, one may assume $\phi(\mathbf{0})=\nabla\phi(\mathbf{0})=0.$ In this paradigm, clustering is achieved by minimizing the objective

\frac{1}{n}\sum_{i=1}^{n}\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{k})\right):=f_{\boldsymbol{\Theta}}(\boldsymbol{X}).

(1)

Here $\Psi_{\boldsymbol{\alpha}}:\mathop{\mathbb{R}}\nolimits^{k}_{\geq 0}\to\mathop{\mathbb{R}}\nolimits_{\geq 0}$ is a component-wise non-decreasing function (such as a generalized mean) of the dissimilarities $\{d_{\phi}(\boldsymbol{X},\boldsymbol{\Theta}_{j})\}_{j=1}^{k}$ which satisfies $\Psi(\mathbf{0})=0$ . The hyperparameter $\boldsymbol{\alpha}\in\mathcal{A}\subseteq\mathop{\mathbb{R}}\nolimits^{q}$ is specified by the user, and we will additionally assume that $\Psi$ is Lipschitz. For intuition, we begin by showing how this setup includes several popular clustering methods.

Examples:

Suppose $\phi(\boldsymbol{u})=\|\boldsymbol{u}\|_{2}^{2}$ and $\Psi(\boldsymbol{x})=(\sum_{j=1}^{k}x_{j}^{-1})^{-1}$ . Then the objective (1) becomes $\frac{1}{n}\sum_{i=1}^{n}(\sum_{j=1}^{k}\|\boldsymbol{X}_{i}-\boldsymbol{\theta}_{j}\|_{2}^{-2})^{-1}$ , which is the objective function of $k$ -harmonic means clustering (Zhang et al.,, 1999). Now consider other generalized means: take $\Psi_{s}(\boldsymbol{x})=M_{s}(\boldsymbol{x})$ where we denote the power mean $M_{s}(\boldsymbol{x})=(k^{-1}\sum_{i=1}^{k}x_{i}^{s})^{1/s}$ for $s\in(-\infty,-1]$ . Then objective (1) coincides with the recent power $k$ -means method (Xu and Lange,, 2019), $\frac{1}{n}\sum_{i=1}^{n}M_{s}\left(\|\boldsymbol{X}_{i}-\boldsymbol{\theta}_{1}\|_{2}^{2},\dots,\|\boldsymbol{X}_{i}-\boldsymbol{\theta}_{k}\|_{2}^{2}\right)$ . When $\Psi(\boldsymbol{x})=\min_{1\leq j\leq k}x_{j}$ , (1) recovers Bregman hard clustering $\frac{1}{n}\sum_{i=1}^{n}\min_{1\leq j\leq k}d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{j})$ proposed in (Banerjee et al.,, 2005) for any valid $\phi$ , while the special case of $\phi(\boldsymbol{u})=\|\boldsymbol{u}\|_{2}^{2}$ yields the familiar Euclidean $k$ -means problem (MacQueen,, 1967).

In what follows, we derive concentration bounds that establish new theoretical guarantees such as consistency and convergence rates for clustering algorithms in this framework. These analyses lead us to a unified, robust framework by embedding this class of methods within the Median of Means (MoM) estimation paradigm. Via elegant connections between the properties of MoM estimators and Vapnik-Chervonenkis (VC) theory, our MoM estimators too will inherit uniform concentration inequalities from the preceding analysis, extending convergence guarantees to the robust setting.

Median of Means

Instead of directly minimizing the empirical risk (1), MoM begins by partitioning the data into $L$ sets $B_{1},\dots,B_{L}$ which each contain exactly $b$ many elements (discarding a few observations when $n$ is not divisible by $L$ ). The partitions can be assigned uniformly at random, or can be shuffled throughout the algorithm (Lecué et al.,, 2020). MoM then entails a robust version of the estimator defined under (1) by instead minimizing the following objective with respect to $\boldsymbol{\Theta}$ :

\text{MoM}_{L}^{n}(\boldsymbol{\Theta})=\text{Median}\left(\frac{1}{b}\sum_{i\in B_{1}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}),\dots,\frac{1}{b}\sum_{i\in B_{L}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})\right).

(2)

Intuitively, MoM estimators are more robust than their ERM counterparts since under mild conditions, only a subset of the partitions is contaminated by outliers while others are outlier-free. Taking the median over partitions negates the influence of these spurious partitions, thus reducing the effect of outliers. Formal analysis of robustness via breakdown points is also available for MoM estimators (Lecué and Lerasle,, 2019; Rodriguez and Valdora,, 2019); we will make use of the nice concentration properties of MoM estimators in Section 3.4.

Algorithm 1 MoM Clustering via Adagrad

Input:

\boldsymbol{X}\in\mathop{\mathbb{R}}\nolimits^{n\times p}

k

L

f_{\boldsymbol{\Theta}}(\cdot)

\eta

\epsilon

Output: The cluster centroids

\boldsymbol{\Theta}

Initialization: Randomly partition

\{1,\dots,n\}

into

L

many partitions of equal length. Randomly choose

k

points without replacement from

\mathcal{X}

to initialize

\boldsymbol{\Theta}_{0}

repeat

Step 1: Find

\ell_{t}\in\{1,\dots,L\}

, such that

\text{MoM}_{L}^{n}(\boldsymbol{\Theta}_{t})=\frac{1}{b}\sum_{i\in B_{\ell_{t}}}f_{\boldsymbol{\Theta}_{t}}(\boldsymbol{X}_{i})

Step 2:

\boldsymbol{g}^{(t)}_{j}\leftarrow\frac{1}{b}\sum_{i\in B_{\ell_{t}}}\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})

Step 3: Update

\boldsymbol{\Theta}

\boldsymbol{\theta}_{j}^{(t+1)}\leftarrow\boldsymbol{\theta}_{j}^{(t)}-\frac{\eta}{\sqrt{\epsilon+\sum_{t^{\prime}=1}^{(t)}\|\boldsymbol{g}_{j}^{(t^{\prime})}\|_{2}^{2}}}\boldsymbol{g}_{j}^{(t)}

until objective (2) converges

Furthermore, optimizing (2) is made tractable via gradient-based methods. We advocate the Adagrad algorithm (Duchi et al.,, 2011; Goodfellow et al.,, 2016), whose updates are given by

\boldsymbol{\theta}_{j}^{(t+1)}\leftarrow\boldsymbol{\theta}_{j}^{(t)}-\frac{\eta}{\sqrt{\epsilon+\sum_{t^{\prime}=1}^{(t)}\|\boldsymbol{g}_{j}^{(t^{\prime})}\|_{2}^{2}}}\boldsymbol{g}_{j}^{(t)}

for hyperparameter $\epsilon>0$ , learning rate $\eta>0$ , and $\boldsymbol{g}_{j}^{(t)}$ denoting a subgradient of $\text{MoM}_{L}^{n}(\boldsymbol{\Theta})$ at $\boldsymbol{\Theta}_{t}$ . That is, if $\ell_{t}$ denotes the median partition at step $t$ , then

\nabla_{\boldsymbol{\Theta}_{j}}\text{MoM}_{L}^{n}(\boldsymbol{\Theta}_{t})=\frac{1}{b}\sum_{i\in B_{\ell_{t}}}\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})|_{\boldsymbol{\Theta}=\boldsymbol{\Theta}_{t}}.

For any $\Psi_{\boldsymbol{\alpha}}$ differentiable,

\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{x})=\partial_{j}\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}^{(t)}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}^{(t)})\right)\nabla_{\boldsymbol{\theta}}d_{\phi}(\boldsymbol{x},\boldsymbol{\theta})

by the chain rule. As a concrete illustration of the general formula, consider the power $k$ -means objective $f_{\boldsymbol{\Theta}}(\boldsymbol{x})=M_{s}(\|\boldsymbol{x}-\boldsymbol{\theta}_{1}\|_{2}^{2},\dots,\|\boldsymbol{x}-\boldsymbol{\theta}_{k}\|_{2}^{2})$ : upon partial differentiation, we obtain

\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{x})=\frac{2}{k}\left(\frac{1}{k}\sum_{j^{\prime}=1}^{k}\|\boldsymbol{x}-\boldsymbol{\theta}_{j^{\prime}}\|_{2}^{2s}\right)^{1/s-1}\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\|_{2}^{2(s-1)}(\boldsymbol{\theta}_{j}-\boldsymbol{x}).

As another example, the classic $k$ -means objective $f_{\boldsymbol{\Theta}}(\boldsymbol{x})=\min_{\boldsymbol{\theta}\in\boldsymbol{\Theta}}\|\boldsymbol{x}-\boldsymbol{\theta}\|^{2}_{2}$ requires the sub-gradient

\nabla_{\boldsymbol{\theta}_{j}}f_{\boldsymbol{\Theta}}(\boldsymbol{x})=2(\boldsymbol{\theta}_{j}-\boldsymbol{x})\mathbbm{1}\{j\in\mathcal{J}(\boldsymbol{\Theta},\boldsymbol{x})\},

where $\mathcal{J}(\boldsymbol{\Theta},\boldsymbol{x})=\mathop{\rm argmin}\nolimits_{1\leq j\leq k}\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\|^{2}_{2}$ and $\mathbbm{1}\{\cdot\}$ denotes the indicator function.

A pseudocode summary appears in Algorithm 1. This method tends to find clusters efficiently: for instance, we incur $O(npk)$ per-iteration complexity applied to both the $k$ -means and power $k$ -means instances, which matches that of their original algorithms without robustifying via MoM (MacQueen,, 1967; Lloyd,, 1982; Xu and Lange,, 2019). Of course, here we update $\boldsymbol{\theta}_{j}$ by an adaptive gradient step rather than by a closed form expression.

As emphasized by an anonymous reviewer, one should note that the algorithms under the proposed framework may converge to sub-optimal local solutions under the proposed optimization scheme. As is the case for $k$ -means and its variants, this possibility arises due to the non-convexity of the objective functions (1) and (2). A complete theoretical understanding from an optimization perspective for such methods is not yet fully developed and global results are in general notoriously difficult to obtain. Having said that, it is worth noting that recent empirical analysis show that techniques such as annealing (Xu and Lange,, 2019; Chakraborty et al.,, 2020) may be effective to circumvent this difficulty. Indeed, our experimental analysis (see Section 4) suggests that a robust version of power $k$ -means using this annealing technique best overcomes this difficulty even in the presence of outliers.

3 Theoretical Analysis

Here we analyze properties of the proposed framework (1), with complete details of all proofs in the Appendix. Denoting $\mathcal{M}$ the set of all probability measures $P$ with support on $[-M,M]^{p}$ , i.e. $P([-M,M]^{p})=1,\,\forall P\in\mathcal{M}$ , we first make standard assumptions that the data are i.i.d. with bounded components (Ben-David,, 2007; Chakraborty et al.,, 2020; Paul et al., 2021a, ).

A 1.

$\boldsymbol{X}_{1}\dots,\boldsymbol{X}_{n}\overset{i.i.d.}{\sim}P$ such that $P\in\mathcal{M}$ .

Let $P_{n}$ be the empirical distribution based of the data $\boldsymbol{X}_{1}\dots,\boldsymbol{X}_{n}$ . That is, for any Borel set $A$ , $P_{n}(A)=\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}\{\boldsymbol{X}_{i}\in A\}$ . For notational simplicity, we write $\mu f:=\int fd\mu$ . Appealing to the Strong Law of Large Numbers (SLLN) (Athreya and Lahiri,, 2006), $P_{n}f_{\boldsymbol{\Theta}}\xrightarrow{a.s.}Pf_{\boldsymbol{\Theta}}$ . Suppose $\widehat{\boldsymbol{\Theta}}_{n}$ be a (global) minimizer of (1) and $\boldsymbol{\Theta}^{\ast}$ be the global minimizer of $Pf_{\boldsymbol{\Theta}}$ . Since the functions $P_{n}f_{\boldsymbol{\Theta}}$ and $Pf_{\boldsymbol{\Theta}}$ are close to each other as $n$ become large, we can expect that their respective minimizers, $\widehat{\boldsymbol{\Theta}}_{n}$ and $\boldsymbol{\Theta}^{\ast}$ are also close to one another. To show that $\widehat{\boldsymbol{\Theta}}_{n}$ converges to $\boldsymbol{\Theta}^{\ast}$ as $n\to\infty$ , we consider bounding the uniform deviation $\sup_{\boldsymbol{\Theta}}|P_{n}f_{\boldsymbol{\Theta}}-Pf_{\boldsymbol{\Theta}}|$ . Towards establishing such bounds, we will posit two regularity assumptions on $\Psi_{\boldsymbol{\alpha}}(\cdot)$ and $\phi(\cdot)$ , beginning with a $\tau_{\boldsymbol{\alpha},k}$ -Lipschitz condition on $\Psi_{\boldsymbol{\alpha}}$ .

A 2.

For all $\boldsymbol{\alpha}\in\mathcal{A}$ and any $\boldsymbol{x},\boldsymbol{y}\in\mathop{\mathbb{R}}\nolimits^{k}_{\geq 0}$ , we have $|\Psi_{\boldsymbol{\alpha}}(\boldsymbol{x})-\Psi_{\boldsymbol{\alpha}}(\boldsymbol{y})|\leq\tau_{\boldsymbol{\alpha},k}\|\boldsymbol{x}-\boldsymbol{y}\|_{1}.$

We also assume a weaker form of a standard condition that the gradient $\nabla\phi(\cdot)$ is $H_{p}$ -Lipschitz (Telgarsky and Dasgupta,, 2013); unlike their work, note we do not additionally require strong convexity of $\phi$ .

A 3.

There exists $H_{p}\geq 0$ such that $\,\|\nabla\phi(\boldsymbol{x})-\nabla\phi(\boldsymbol{y})\|_{2}\leq H_{p}\|\boldsymbol{x}-\boldsymbol{y}\|_{2}$ for all $\boldsymbol{x},\boldsymbol{y}\in[-M,M]^{p}$ .

These conditions are mild, and can be seen to hold for all of the aforementioned popular special cases. For instance, taking $\phi(\boldsymbol{u})=\|\boldsymbol{u}\|^{2}$ yields the squared Euclidean distances, and we see that $\nabla\phi(\boldsymbol{u})=2\boldsymbol{u}$ so that A3 is satisfied with constant $H_{p}=2$ . Now, let $\Psi(\boldsymbol{x})=\min_{1\leq j\leq k}x_{j}$ as in classical $k$ -means, and denote $j^{\ast}\in\mathop{\rm argmin}\nolimits_{1\leq j\leq k}y_{j}$ . For any vectors $\boldsymbol{x},\boldsymbol{y}\in\mathop{\mathbb{R}}\nolimits_{\geq 0}^{p}$ ,

\displaystyle\Psi(\boldsymbol{x})-\Psi(\boldsymbol{y})=\min_{1\leq j\leq k}x_{j}-\min_{1\leq j\leq k}y_{j}=\min_{1\leq j\leq k}x_{j}-y_{j^{\ast}}\leq x_{j^{\ast}}-y_{j^{\ast}}\leq\|\boldsymbol{x}-\boldsymbol{y}\|_{1}.

Thus, $\Psi$ is clearly non-negative and componentwise non-decreasing, and satisfies A2. Similarly, if we take $\Psi_{s}(\boldsymbol{x})=M_{s}(\boldsymbol{x})$ , the conditions are met for power $k$ -means: it is again non-negative and non-decreasing in its components, and satisfies A2 with $\tau_{\boldsymbol{\alpha},k}=k^{-1/s}$ due to Beliakov et al., (2010).

3.1 Bounds on $\widehat{\boldsymbol{\Theta}}_{n}$ and $\boldsymbol{\Theta}^{\ast}$

Toward proving that $\widehat{\boldsymbol{\Theta}}_{n}$ converges to $\boldsymbol{\Theta}^{\ast}$ , we will need to show that both $\widehat{\boldsymbol{\Theta}}_{n}$ and $\boldsymbol{\Theta}^{\ast}$ lie in $[-M,M]^{k\times p}$ . To this end, we first establish the obtuse angle property for Bregman divergences.

Lemma 3.1.

Let $\mathcal{C}$ be a convex set and suppose $P_{\mathcal{C}}(\boldsymbol{\theta})$ be the projection of $\boldsymbol{\theta}$ onto $\mathcal{C}$ , with respect to the Bregman divergence $d_{\phi}(\cdot,\cdot)$ , i.e. $P_{\mathcal{C}}(\boldsymbol{\theta})=\arg\min_{\boldsymbol{x}\in\mathcal{C}}d_{\phi}(\boldsymbol{x},\boldsymbol{\theta})$ (assuming it exists). Then,

d_{\phi}(\boldsymbol{x},\boldsymbol{\theta})\geq d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}))+d_{\phi}(P_{\mathcal{C}}(\boldsymbol{\theta}),\boldsymbol{\theta}),\quad\text{for all }\boldsymbol{x}\in\mathcal{C}.

We next show that to minimize $P_{n}f_{\boldsymbol{\Theta}}$ or $Pf_{\boldsymbol{\Theta}}$ , it is enough to restrict the search to $[-M,M]^{k\times p}$ .

Lemma 3.2.

Let A2 hold, and $Q\in\mathcal{M}$ . Let $d_{\phi}:\mathop{\mathbb{R}}\nolimits^{p}\times\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits_{\geq 0}$ be a Bregman divergence. Then for any $\boldsymbol{\Theta}\in\mathop{\mathbb{R}}\nolimits^{k\times p}$ , there exists $\boldsymbol{\Theta}^{\prime}\in[-M,M]^{k\times p}$ , such that $Qf_{\boldsymbol{\Theta}^{\prime}}\leq Qf_{\boldsymbol{\Theta}}$ .

Since we can restrict our attention to $[-M,M]^{k\times p}$ to minimize $Qf_{\boldsymbol{\Theta}}$ , we have the following:

Corollary 3.1.

Let $Q\in\mathcal{M}$ and $d_{\phi}:\mathop{\mathbb{R}}\nolimits^{p}\times\mathop{\mathbb{R}}\nolimits^{p}\to\mathop{\mathbb{R}}\nolimits_{\geq 0}$ be a Bregman divergence. If $\boldsymbol{\Theta}_{0}=\arg\min_{\boldsymbol{\Theta}\in\mathop{\mathbb{R}}\nolimits^{k\times p}}\int f_{\boldsymbol{\Theta}}dQ$ , then $\boldsymbol{\Theta}_{0}\in[-M,M]^{k\times p}$ .

Now note under A1 both $P$ and $P_{n}$ have support contained in $[-M,M]^{k\times p}$ . The following corollary is thus implied by replacing $Q$ by $P$ and $P_{n}$ .

Corollary 3.2.

Under A1—3, both $\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast}\in[-M,M]^{k\times p}$ .

Now that we have bounded $\widehat{\boldsymbol{\Theta}}_{n}$ and $\boldsymbol{\Theta}^{\ast}$ in a compact set, the following section supplies probabilistic bounds on the uniform deviation $2\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|P_{n}f_{\boldsymbol{\Theta}}-Pf_{\boldsymbol{\Theta}}|$ via metric entropy arguments.

3.2 Concentration Inequality and Metric Entropy Bounds via Rademacher Complexity

We have proven that $\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast}\in[-M,M]^{k\times p}$ . To bound the difference $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|$ , we observe

	$\displaystyle\|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}\|$	$\displaystyle\,\,=\,\,\,Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}\,\,=\,\,Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-P_{n}f_{\widehat{\boldsymbol{\Theta}}_{n}}+P_{n}f_{\widehat{\boldsymbol{\Theta}}_{n}}-P_{n}f_{\boldsymbol{\Theta}^{\ast}}+P_{n}f_{\boldsymbol{\Theta}^{\ast}}-Pf_{\boldsymbol{\Theta}^{\ast}}$
		$\displaystyle\,\,\leq Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-P_{n}f_{\widehat{\boldsymbol{\Theta}}_{n}}+P_{n}f_{\boldsymbol{\Theta}^{\ast}}-Pf_{\boldsymbol{\Theta}^{\ast}}\,\,\leq\,\,2\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\|P_{n}f_{\boldsymbol{\Theta}}-Pf_{\boldsymbol{\Theta}}\|.$		(3)

Thus, to bound $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|$ , it is enough to prove bounds on $\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|P_{n}f_{\boldsymbol{\Theta}}-Pf_{\boldsymbol{\Theta}}|$ . The main idea here is to bound this uniform deviation via Rademacher complexity, and in turn bound the Rademacher complexity itself (Dudley,, 1967; Mohri et al.,, 2018). Let $\mathcal{F}=\{f_{\boldsymbol{\Theta}}:\boldsymbol{\Theta}\in[-M,M]^{k\times p}\}$ , and denote the $\mathcal{F}$ -norm (Athreya and Lahiri,, 2006) between two probability measures $\mu$ and $\nu$ as $\|\mu-\nu\|_{\mathcal{F}}=\sup_{f\in\mathcal{F}}|\int fd\mu-\int fd\nu|$ . We recall the definition of Rademacher complexity and covering number as follows.

Definition 1.

Let $\epsilon_{i}$ ’s be i.i.d. Rademacher random variables independent of $\mathcal{X}=\{\boldsymbol{X}_{1},\dots,\boldsymbol{X}_{n}\}$ , i.e. $\mathbbm{P}(\epsilon_{i}=1)=\mathbbm{P}(\epsilon_{i}=-1)=0.5$ , The population Rademacher complexity of $\mathcal{F}$ is defined as $\mathcal{R}_{n}(\mathcal{F})=\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})$ , where the expectation is over both $\boldsymbol{\epsilon}$ and $\mathcal{X}$ .

Definition 2.

( $\delta$ -cover and Covering Number) Let $(X,d)$ be a metric space. The set $X_{\delta}\subseteq X$ is said to be a $\delta$ -cover of $X$ if for all $x\in X$ , $\exists$ $x^{\prime}\in X_{\delta}$ , such that $d(x,x^{\prime})\leq\delta$ . The $\delta$ -covering number of $X$ w.r.t. $d$ , denoted by $N(\delta;X,d)$ , is the size of the smallest $\delta$ -cover of $X$ with respect to $d$ .

The following Lemma gives a bound for the $\delta$ -covering number of $\mathcal{F}$ under the supremum norm. The main idea here is to use the Lipschitz property of $f_{\boldsymbol{\Theta}}$ and then to find a cover of the search space for $\boldsymbol{\Theta}$ , i.e. $[-M,M]^{k\times p}$ . This then automatically transcends to a cover of $\mathcal{F}$ under the sup-norm.

Lemma 3.3.

Let $N(\delta;\mathcal{F},\|\cdot\|_{\infty})$ be the $\delta$ -covering number of $\mathcal{F}$ under $\|\cdot\|_{\infty}$ . Then, under A1—3,

N(\delta;\mathcal{F},\|\cdot\|_{\infty})\leq\left(\max\left\{\left\lfloor\frac{8M^{2}\tau_{\boldsymbol{\alpha},k}H_{p}kp}{\delta}\right\rfloor,1\right\}\right)^{kp}.

To bound the Rademacher complexity, we will also need to bound the diameter of $\mathcal{F}$ under $\|\cdot\|_{\infty}$ :

Lemma 3.4.

Let $\text{diam}(\mathcal{F})=\sup_{f,g\in\mathcal{F}}\|f-g\|_{\infty}$ . Then, under A1—3, $\text{diam}(\mathcal{F})\leq 8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}kp.$

We are now ready to state the bound on the Rademacher complexity $\mathcal{R}_{n}(\mathcal{F})$ in Theorem 3.1. We provide a sketch of the argument below, with the complete proof details available in the Appendix.

Theorem 3.1.

Under A1—3, $\mathcal{R}_{n}(\mathcal{F})\leq 48\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}.$

Proof Sketch

The main idea of the proof is to use Dudley’s chaining argument (Dudley,, 1967; Wainwright,, 2019) to bound the Rademacher complexity in terms of an integral involving the metric entropy $\log N(\delta;\mathcal{F},\|\cdot\|_{\infty})$ . Let $\Delta=8H_{p}M^{2}k^{1-1/s}p$ . Using the chaining approach, one can show that

$\displaystyle\mathcal{R}_{n}(\mathcal{F})\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{\log N(\epsilon;\mathcal{F},\\|\cdot\\|_{\infty})}d\epsilon$	$\displaystyle\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{kp\log\left(\max\left\{\frac{\Delta}{\epsilon},1\right\}\right)}d\epsilon$	(4)
	$\displaystyle=\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{kp\log\left(\frac{\Delta}{\epsilon}\right)}d\epsilon$
	$\displaystyle=48\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}.$

Here the second inequality (4) follows from Lemma 3.3.

Before we proceed, we state the following Lemma, which puts an uniform bound on $\|f\|_{\infty}$ , $f\in\mathcal{F}$ .

Lemma 3.5.

For all $\boldsymbol{x}\in[-M,M]^{p}$ and $\boldsymbol{\Theta}\in[-M,M]^{k\times p}$ ,

0\leq\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk.

Now that we have established a bound on the Rademacher complexity $\mathcal{R}_{n}(\mathcal{F})$ , we are now ready to establish a uniform concentration inequality on $\|P_{n}-P\|_{\mathcal{F}}\triangleq\sup_{f\in\mathcal{F}}|P_{n}f-Pf|$ in the following Theorem, proven in the Appendix.

Theorem 3.2.

Under A1—3, with probability at least $1-\delta$ , the following holds.

\|P_{n}-P\|_{\mathcal{F}}\leq 96\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}+4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\sqrt{\frac{\log(2/\delta)}{2n}}.

The result in Theorem 3.2, reveals a non-asymptotic bound on $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|$ :

Corollary 3.3.

Under A1—3, with probability at least $1-\delta$ , the following holds.

|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq 192\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}+8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\sqrt{\frac{\log(2/\delta)}{2n}}.

Proof.

From equation (3), we know that $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq 2\|P_{n}-P\|_{\mathcal{F}}$ . The corollary follows immediately by application of Theorem 3.2. ∎

3.3 Inference for Fixed $p$ : Strong and $\sqrt{n}$ Consistency

We now discuss the classical domain where $p$ is kept fixed (Pollard,, 1981; Chakraborty et al.,, 2020; Terada,, 2014) and show that the results above imply strong and $\sqrt{n}$ -consistency, mirroring some known results for existing methods such as $k$ -means. We first solidify the notion of convergence of the centroids $\widehat{\boldsymbol{\Theta}}_{n}$ to $\boldsymbol{\Theta}^{\ast}$ , following Pollard, (1981). Since centroids are unique only up to label permutations, our notion of dissimilarity

\text{diss}(\boldsymbol{\Theta}_{1},\boldsymbol{\Theta}_{2})=\min_{M\in\mathcal{P}_{k}}\|\boldsymbol{\Theta}_{1}-M\boldsymbol{\Theta}_{2}\|_{F}

is considered over $\mathcal{P}_{k}$ the set of all $k\times k$ real permutation matrices, where $\|\cdot\|_{F}$ denotes the Frobenius norm. Now, we say that the sequence $\boldsymbol{\Theta}_{n}$ converges to $\boldsymbol{\Theta}$ if $\lim_{n\to\infty}\text{diss}(\boldsymbol{\Theta}_{n},\boldsymbol{\Theta})=0$ . We begin by imposing the following standard identifiablity condition (Pollard,, 1981; Terada,, 2014; Chakraborty et al.,, 2020) on $P$ for our analysis.

A 4.

For all $\eta>0$ , there exists $\epsilon>0$ , such that $Pf_{\boldsymbol{\Theta}}>Pf_{\boldsymbol{\Theta}^{\ast}}+\epsilon\,$ whenever $\,\text{diss}(\boldsymbol{\Theta},\boldsymbol{\Theta}^{\ast})>\eta$ .

We now investigate the strong consistency properties of $\widehat{\boldsymbol{\Theta}}_{n}$ , and also investigate the rate at which $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|$ converges to $0$ . Theorem 3.3 states that indeed strong consistency holds, with convergence rate $O(n^{-1/2})$ . Note that this rate is faster than that found previously by Paul et al., 2021a . Before we proceed, recall we say that $X_{n}=O_{P}(a_{n})$ if the sequence $X_{n}/a_{n}$ is tight (Athreya and Lahiri,, 2006).

Theorem 3.3.

(Strong consistency and $\sqrt{n}$ -consistency) If $p$ is kept fixed then under A1—4, $\widehat{\boldsymbol{\Theta}}_{n}\xrightarrow{a.s.}\boldsymbol{\Theta}^{\ast}$ under $P$ . Moreover, $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O_{P}(n^{-1/2})$ .

Note: Here $A_{n}$ is $O_{P}(a_{n})$ means that $\{A_{n}/a_{n}\}_{n\in\mathbb{N}}$ is stochastically bounded or tight.

3.4 Inference Under the MoM Framework

Theorem 3.3 and the bounds we present above already establish novel statistical results that pertain to methods under (1) such as power $k$ -means in the “uncontaminated" setting. In this section, we now extend these findings to the MoM setup under (2) in order to understand their behavior in the presence of outliers. Recall that in this setting, the data are partitioned into $L$ equally sized blocks; without loss of generality, we take $n=L\cdot b$ . We denote the set of all inliers by $\{\boldsymbol{X}_{i}\}_{i\in\mathcal{I}}$ and the outliers by $\{\boldsymbol{X}_{i}\}_{i\in\mathcal{O}}$ . Now let $\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}$ denote the minimizer of (2): towards establishing the error rate at which $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|$ goes to $0$ , we assume the following:

A 5.

$\{\boldsymbol{X}_{i}\}_{i\in\mathcal{I}}\overset{\text{i.i.d.}}{\sim}P$ with $P\in\mathcal{M}$ .

A 6.

There exists $\,\eta>0$ such that $L>(2+\eta)|\mathcal{O}|$ .

We remark A5 is identical to A1, but imposed only on the inlying observations. A6 ensures at least half of the $L$ partitions are free of outliers; note this is much weaker than requiring $L>4|\mathcal{O}|$ as is done in recent work (Lecué et al.,, 2020). Importantly, we emphasize that no distributional assumptions regarding the outlying observations are made, allowing them to be unbounded, generated from heavy-tailed distributions, or dependent among each other. Proofs of the following results appear in the Appendix.

As a point of departure, we first establish that $\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}\in[-M,M]^{k\times p}$ .

Lemma 3.6.

Let A2—3 and A5—6 hold. Then for any $\boldsymbol{\Theta}\in\mathop{\mathbb{R}}\nolimits^{k\times p}$ , there exists $\boldsymbol{\Theta}^{\prime}\in[-M,M]^{k\times p}$ , such that $\text{MoM}_{L}^{n}(\boldsymbol{\Theta}^{\prime})\leq\text{MoM}_{L}^{n}(\boldsymbol{\Theta})$ .

Again, we may restrict the search space for finding $\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}$ in $[-M,M]^{k\times p}$ due to Lemma 3.6.

Corollary 3.4.

Under A2—3 and A5—6, $\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}\in[-M,M]^{k\times p}$ .

Similarly to Section 3, we derive a uniform bound on $\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}|$ and then bound $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|$ in turn. For brevity we define $\delta:=2/(4+\eta)-|\mathcal{O}|/L$ , and use the notation “ $\lesssim$ ” to suppress the absolute constants. The uniform deviation bound is as follows, with complete proof appearing in the Appendix.

Theorem 3.4.

Under A2—3 and A5—6, with probability at least $1-2e^{-2L\delta^{2}}$ , the following holds.

\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}|\lesssim\tau_{\boldsymbol{\alpha},k}H_{p}\max\left\{kp\sqrt{\frac{L}{n}},(kp)^{3/2}\frac{\sqrt{|\mathcal{I}|}}{n}\right\}.

We give a brief outline of our proof for Theorem 3.4 below, with details appearing in the Appendix.

Proof sketch

We begin by noting that if

\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}~{}>~{}\frac{L}{2},

then

\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(Pf_{\boldsymbol{\Theta}}-\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}}))>\epsilon\,,

where $P_{B_{\ell}}$ denotes the empirical distribution of $\{\boldsymbol{X}_{i}\}_{i\in B_{\ell}}$ . This implies it is suffices to bound the quantity

\mathbbm{P}\left(\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}>\frac{L}{2}\right).

Introducing the function $\varphi(t)=(t-1)\mathbbm{1}\{1\leq t\leq 2\}+\mathbbm{1}\{t>2\}$ , we begin by bounding outlier-free partitions, making use of the inequalities $\mathbbm{1}\{t\geq 2\}\leq\varphi(t)~{}\leq\mathbbm{1}\{t\geq 1\}.$ We then proceed by bounding

\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}

by the sum $\xi_{1}+\xi_{2}+|\mathcal{O}|$ , where

\xi_{1}=\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\,,\quad\text{and}

\xi_{2}=\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\bigg{[}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)-\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\bigg{]}.

We bound $\xi_{1}$ by appealing to Hoeffding’s inequality, while we show that $\xi_{2}$ can be bounded by applying the bounded difference inequality together with the result of Theorem 3.1 to bound the resulting Rademacher complexity.

The corollary below follows from this uniform bound, giving a non-asymptotic control over the difference $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|$ in terms of the model parameters.

Corollary 3.5.

Under A2—3 and A5—6, with probability at least $1-2e^{-2L\delta^{2}}$ , the following holds.

|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\lesssim\tau_{\boldsymbol{\alpha},k}H_{p}\max\left\{kp\sqrt{\frac{L}{n}},(kp)^{3/2}\frac{\sqrt{|\mathcal{I}|}}{n}\right\}.

3.5 Inference for Fixed $k$ and $p$ Under the MoM Framework

We now focus our attention back to the classical setting where the numbers of clusters and features remain fixed. To show $\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}$ is consistent for $\boldsymbol{\Theta}^{\ast}$ , we need to impose conditions such that the RHS of the bounds presented in Corollary 3.5 decrease to $0$ as $n\to\infty$ . We state the required conditions as follows.

A 7.

The number of partitions $L=o(n)$ , and $L\to\infty$ as $n\to\infty$ .

These conditions are natural: as $n$ grows, so too must $L$ in order to maintain a proportion of outlier-free partitions. On the other hand, $L$ must grow slowly relative to $n$ to ensure each partition can be assigned sufficient numbers of datapoints. We note that A7 implies $|\mathcal{O}|=o(n)$ , an intuitive and standard condition (Lecué et al.,, 2020; Staerman et al.,, 2021; Paul et al., 2021b, ) as outliers should be few by definition.

In the following corollary, we focus on the squared Euclidean distance for center-based clustering under the MoM framework. We show that the (global) estimates obtained from MoM $k$ -means, MoM power $k$ -means etc. are consistent. We stress that the obtained convergence rate, as a function of $n,\,L$ and $|\mathcal{I}|$ , does not depend on the choice of $\Psi_{\boldsymbol{\alpha}}(\cdot)$ , as long as it satisfies A2, i.e. is Lipschitz. In particular, replacing $\Psi_{\boldsymbol{\alpha}}(\cdot)$ with $\min_{1\leq j\leq k}x_{j},\,M_{s}(\boldsymbol{x})$ and $(\sum_{j=1}^{k}x_{j}^{-1})^{-1}$ , the rate in Corollary 3.6 applies to robustified MoM versions of $k$ -means, power $k$ -means and $k$ -harmonic means alike.

Corollary 3.6.

Suppose $\phi(\boldsymbol{u})=\|\boldsymbol{u}\|_{2}^{2}$ and $\Psi_{\boldsymbol{\alpha}}(\cdot)$ satisfy A2. Then under A2—3 and A5—7,

|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O_{P}\left(\max\left\{L^{1/2}n^{-1/2},n^{-1}\sqrt{|\mathcal{I}|}\right\}\right)

and $Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}\xrightarrow{P}Pf_{\boldsymbol{\Theta}^{\ast}}$ . Moreover, whenever A4 additionally holds, we have $\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}\xrightarrow{P}\boldsymbol{\Theta}^{\ast}$ .

Remark

These results imply that any MoM center-based algorithm under our paradigm admits a convergence rate of $O\left(\max\left\{L^{1/2}n^{-1/2},n^{-1}\sqrt{|\mathcal{I}|}\right\}\right)$ , when equipped with squared Euclidean distance. Note that as $L\geq 1$ , $\max\left\{L^{1/2}n^{-1/2},n^{-1}\sqrt{|\mathcal{I}|}\right\}=\Omega\left(n^{-1/2}\right)$ . Thus, the convergence rates for MoM variants in our framework are generally slower than their ERM counterparts, for which the rate is $O(n^{-1/2})$ . This is unsurprising as MoM operates on outlier-contaminated data; there is “no free lunch" in trading off robustness for rate of convergence. However, if the number of partitions $L$ grows slowly relative to $n$ (say, $L=O(\log n)$ so that $|\mathcal{O}|=O(\log n)$ ), then the convergence rates for MoM estimation become comparable to the ERM counterparts at $\widetilde{O}(n^{-1/2})$ .

4 Empirical Studies and Performance

To validate and assess our proposed framework, we now turn to an empirical comparison of the proposed and peer clustering methods. We evaluate clustering quality under the Adjusted Rand Index (ARI) (Hubert and Arabie,, 1985), with values ranging between $0$ and $1$ and $1$ denoting a perfect match with the ground truth. Though it is not feasible to exhaustively survey center-based clustering methods, we consider a broad range of competitors, comparing to $k$ -means (MacQueen,, 1967), Partition Around Medoids (PAM) (Kaufman and Rousseeuw,, 2009), $k$ -medians (Jain,, 2010), Robust kmeans++ (RKMpp) (Deshpande et al.,, 2020), Robust Continuous Clustering (RCC) (Shah and Koltun,, 2017), Bregman Trimmed $k$ -means (BTKM) (Fischer et al.,, 2020), MoM $k$ -means (MOMKM) (Klochkov et al.,, 2020) and a novel MoM variant of power $k$ -means (MOMPKM) implied under the proposed framework. An open-source implementation of the proposed method and code for reproducing the simulation experiments are available at https://github.com/SaptarshiC98/MOMPKM.

We consider two thorough simulated experiments below, with additional simulations and large-scale real data results in the Appendix. While the extended comparisons are omitted for space considerations, they convey the same trends as the studies below. In all settings, we generate data in $p=5$ dimensions, varying the number of clusters and the outlier percentages. True centers are spaced uniformly on a grid with $\theta_{k}=\frac{k-1}{10}$ and observations are drawn from Gaussians around their ground truth centers with variance $0.1$ . Because we generate Gaussian data, we focus here on the Euclidean case, and do not consider other Bregman divergences in the present empirical study. In all experiments, we take the number of partitions $L$ to be roughly double the number of outliers, and set the hyperparameters $\eta$ and $\alpha$ to be $1.02$ and $1$ respectively by default. Results are averaged over 20 random restarts, and all competitors are initialized and tuned according to the standard implementations described in their original papers.

Experiment 1: increasing the number of clusters

The first experiment assesses performance as the number of true clusters grows, while keeping the proportion of outliers fixed at $25\%$ . Datasets are generated with the number of clusters $k$ varying between $3$ and $100$ . For each setting, we create balanced clusters drawing $30$ points from each true center. The $25\%$ outliers are generated from a uniform distribution with support on the range of the inlying observations. We repeat this data generation process $30$ times under each parameter setting.

The average ARI values at convergence, plotted against the number of clusters along with error bars ( $\pm$ standard deviations), are shown in the left panel of Figure 1. We see that the robustified version of power $k$ -means implied by our framework (labeled MOMPKM) achieves the best performance here. This may be unsurprising as the recent power $k$ -means method was shown to significantly reduce sensitivity to local minima (which tend to increase with $k$ ), while MoM further protects the algorithm from outlier influence.

Refer to caption — Figure 1: Average ARI values, along with error bars ( $\pm$ sd), comparing the peer algorithms in Experiments 1 (left) and 2 (right). MOMPKM remains relatively stable even when $k$ or outlier percentages increase, maintaining the best performance among peer methods.

Experiment 2: increasing outlier percentage

Following the same data generation process in Experiment 1, we now fix $k=20$ while varying the outlier percentage from $0\%$ to $50\%$ . For each parameter setting, we again replicate the experiment $30$ times. Average ARI values comparing the inlying observations to their ground truths are plotted in the right panel of Figure 1. Not only does this study convey similar trends as Experiment 1, but we see that competing methods continue to deteriorate with increasing outliers as one might expect, while MOMPKM remarkably remains relatively stable.

We see that the ERM-based methods such as BTKM, MOMKM, and PAM struggle when there is large number of clusters. Similarly, methods such as $k$ -medians and RKMpp often stop short at poor local optima, which is quickly exacerbated by outliers despite initialization via clever seeding techniques. These phenomena are consistent with what has been reported in the literature for their non-robust counterparts (Xu and Lange,, 2019; Chakraborty et al.,, 2020). Overall, the empirical study suggests that a robust version of power $k$ -means clustering under the MoM framework shows promise to handle several data challenges at once, in line with our theoretical analysis.

5 Discussion

In this paper, we proposed a paradigm for center-based clustering that unifies a suite of center-based clustering algorithms. Under this view, developed a simple yet efficient framework for robust versions of such algorithms by appealing to the Median of Means (MoM) philosophy. Using gradient-based methods, the MoM objectives can be solved with the same per-iteration complexity of Lloyd’s $k$ -means algorithm, largely retaining its simplicity. Importantly, we derive a thorough analysis of the statistical properties and convergence rates by establishing uniform concentration bounds under i.i.d. sampling of the data. These novel theoretical contributions demonstrate how arguments utilizing Rademacher complexities and Dudley’s chaining arguments can be leveraged in the robust clustering context. As a result, we are able to obtain error rates that do not require asymptotic assumptions, nor restrictions on the relation between $n$ and $p$ . These findings recover asymptotic results such as strong consistency and $\sqrt{n}$ -consistency under classical assumptions.

As shown in the paper, the robustness of MoM estimators comes at the cost of slower convergence rates compared to their ERM counterparts. We emphasize that there is no “median-of-means magic", and that the efficacy of MoM depends on the interplay between the partitions and the outliers. If the number of partitions circumvents the impact of the outliers, the performance of MoM clustering estimates under our framework scales with the block size $b$ as $1/\sqrt{b}=\sqrt{L/n}$ . Since $L$ can be chosen to be approximately $2|\mathcal{O}|$ , the obtained error rate is roughly $O(\sqrt{|\mathcal{O}|/n})$ . If $|\mathcal{O}|$ scales proportionately with $n$ , however, the error bound of $O(\sqrt{|\mathcal{O}|/n})$ becomes meaningless. For our consistency results to hold, it is crucial that $|\mathcal{O}|=o(n)$ , which in turn allows us to choose $L$ that satisfies condition A7. If $|\mathcal{O}|=O(n^{\beta})$ , for some $0<\beta<1$ , the error rate is $O(n^{(\beta-1)/2})$ .

This suggests possible future research directions in improving these ERM rates via finding so called “fast rates" under additional assumptions (Boucheron et al.,, 2005; Wainwright,, 2019). Moreover, it will be fruitful to extend the results for noise distributions that satisfy only moment conditions. Recent works in convex clustering (Tan and Witten,, 2015; Chakraborty and Xu,, 2020) have considered sub-Gaussian models to obtain error rates. The work by (Biau et al.,, 2008) and recent work by (Klochkov et al.,, 2020) obtain similar error rates under finite second-moment conditions using assumptions that the cluster centroids $\widehat{\boldsymbol{\Theta}}_{n}$ are bounded, and it may be possible to extend our approach using local Rademacher complexities (Bartlett et al.,, 2005) to relax the bounded support assumption. One can also seek lower bounds on the approximation error or can explore high-dimensional robust center-based clustering under the proposed paradigm. Finally, we have not explored the implementation and empirical performance of Bregman versions of our MoM estimator, for instance with application to data arising from mixtures of exponential families other than the Gaussian case. These interesting directions remain open for future work.

References

Alcalá et al., (2010) Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., and Herrera, F. (2010). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3):255–287.
Aloise et al., (2009) Aloise, D., Deshpande, A., Hansen, P., and Popat, P. (2009). NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248.
Athreya and Lahiri, (2006) Athreya, K. B. and Lahiri, S. N. (2006). Measure theory and probability theory. Springer Science & Business Media.
Balcan et al., (2008) Balcan, M.-F., Blum, A., and Vempala, S. (2008). A discriminative framework for clustering via similarity functions. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 671–680.
Banerjee et al., (2005) Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., and Lafferty, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6(10).
Bartlett et al., (2002) Bartlett, P. L., Boucheron, S., and Lugosi, G. (2002). Model selection and error estimation. Machine Learning, 48(1):85–113.
Bartlett et al., (2005) Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local Rademacher complexities. The Annals of Statistics, 33(4):1497–1537.
Bartlett and Mendelson, (2002) Bartlett, P. L. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482.
Beliakov et al., (2010) Beliakov, G., Calvo, T., and James, S. (2010). On Lipschitz properties of generated aggregation functions. Fuzzy Sets and Systems, 161(10):1437–1447.
Ben-David, (2007) Ben-David, S. (2007). A framework for statistical clustering with constant time approximation algorithms for k-median and k-means clustering. Machine Learning, 66(2):243–257.
Biau et al., (2008) Biau, G., Devroye, L., and Lugosi, G. (2008). On the performance of clustering in Hilbert spaces. IEEE Transactions on Information Theory, 54(2):781–790.
Boucheron et al., (2005) Boucheron, S., Bousquet, O., and Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375.
Brownlees et al., (2015) Brownlees, C., Joly, E., and Lugosi, G. (2015). Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507–2536.
Brunet-Saumard et al., (2022) Brunet-Saumard, C., Genetay, E., and Saumard, A. (2022). K-bMOM: A robust lloyd-type clustering algorithm based on bootstrap median-of-means. Computational Statistics & Data Analysis, 167:107370.
Bubeck et al., (2013) Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717.
Chakraborty and Das, (2020) Chakraborty, S. and Das, S. (2020). Detecting meaningful clusters from high-dimensional data: A strongly consistent sparse center-based clustering approach. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Chakraborty et al., (2020) Chakraborty, S., Paul, D., Das, S., and Xu, J. (2020). Entropy weighted power k-means clustering. In International Conference on Artificial Intelligence and Statistics, pages 691–701. PMLR.
Chakraborty and Xu, (2020) Chakraborty, S. and Xu, J. (2020). Biconvex clustering. arXiv preprint arXiv:2008.01760.
Deshpande et al., (2020) Deshpande, A., Kacham, P., and Pratap, R. (2020). Robust $k$ -means++. In Conference on Uncertainty in Artificial Intelligence, pages 799–808. PMLR.
Duchi et al., (2011) Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159.
Dudley, (1967) Dudley, R. M. (1967). The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290–330.
Fischer et al., (2020) Fischer, A., Levrard, C., and Brécheteau, C. (2020). Robust Bregman clustering. Annals of Statistics.
Goodfellow et al., (2016) Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. (2016). Deep learning, volume 1. MIT press Cambridge.
Hastie et al., (2009) Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media.
Hubert and Arabie, (1985) Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1):193–218.
Jain, (2010) Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666.
Kaufman and Rousseeuw, (2009) Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons.
Klochkov et al., (2020) Klochkov, Y., Kroshnin, A., and Zhivotovskiy, N. (2020). Robust $k$ -means clustering for distributions with two moments. arXiv preprint arXiv:2002.02339.
Laforgue et al., (2019) Laforgue, P., Clémençon, S., and Bertail, P. (2019). On medians of (randomized) pairwise means. In International Conference on Machine Learning, pages 1272–1281. PMLR.
Lecué and Lerasle, (2019) Lecué, G. and Lerasle, M. (2019). Learning from MOM’s principles: Le Cam’s approach. Stochastic Processes and Their Applications, 129(11):4385–4410.
Lecué and Lerasle, (2020) Lecué, G. and Lerasle, M. (2020). Robust machine learning by median-of-means: theory and practice. Annals of Statistics, 48(2):906–931.
Lecué et al., (2020) Lecué, G., Lerasle, M., and Mathieu, T. (2020). Robust classification via mom minimization. Machine Learning, 109(8):1635–1665.
Lerasle, (2019) Lerasle, M. (2019). Lecture notes: Selected topics on robust statistical learning theory. arXiv preprint arXiv:1908.10761.
Levrard, (2015) Levrard, C. (2015). Nonasymptotic bounds for vector quantization in Hilbert spaces. The Annals of Statistics, 43(2):592–619.
Lloyd, (1982) Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2):129–137.
Lugosi and Mendelson, (2019) Lugosi, G. and Mendelson, S. (2019). Regularization, sparse recovery, and median-of-means tournaments. Bernoulli, 25(3):2075–2106.
MacQueen, (1967) MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA.
Mathieu and Minsker, (2021) Mathieu, T. and Minsker, S. (2021). Excess risk bounds in robust empirical risk minimization. Information and Inference: A Journal of the IMA.
Minsker, (2018) Minsker, S. (2018). Uniform bounds for robust mean estimators. arXiv preprint arXiv:1812.03523.
Mohri et al., (2018) Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press.
Ng et al., (2001) Ng, A., Jordan, M., and Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems, 14:849–856.
Ostrovsky et al., (2013) Ostrovsky, R., Rabani, Y., Schulman, L. J., and Swamy, C. (2013). The effectiveness of Lloyd-type methods for the k-means problem. Journal of the ACM (JACM), 59(6):1–22.
(43) Paul, D., Chakraborty, S., and Das, S. (2021a). On the uniform concentration bounds and large sample properties of clustering with Bregman divergences. Stat, page e360.
(44) Paul, D., Chakraborty, S., and Das, S. (2021b). Robust principal component analysis: A median of means approach. arXiv preprint arXiv:2102.03403.
Pollard, (1981) Pollard, D. (1981). Strong consistency of $k$ -means clustering. The Annals of Statistics, 9(1):135–140.
Rodriguez and Valdora, (2019) Rodriguez, D. and Valdora, M. (2019). The breakdown point of the median of means tournament. Statistics & Probability Letters, 153:108–112.
Shah and Koltun, (2017) Shah, S. A. and Koltun, V. (2017). Robust continuous clustering. Proceedings of the National Academy of Sciences, 114(37):9814–9819.
Shalev-Shwartz and Ben-David, (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.
Staerman et al., (2021) Staerman, G., Laforgue, P., Mozharovskyi, P., and d’Alché Buc, F. (2021). When OT meets MoM: Robust estimation of Wasserstein distance. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130, pages 136–144. PMLR.
Tan and Witten, (2015) Tan, K. M. and Witten, D. (2015). Statistical properties of convex clustering. Electronic journal of statistics, 9(2):2324.
Teboulle, (2007) Teboulle, M. (2007). A unified continuous optimization framework for center-based clustering methods. Journal of Machine Learning Research, 8(1).
Telgarsky and Dasgupta, (2012) Telgarsky, M. J. and Dasgupta, S. (2012). Agglomerative Bregman clustering. In 29th International Conference on Machine Learning, ICML 2012, pages 1527–1534.
Telgarsky and Dasgupta, (2013) Telgarsky, M. J. and Dasgupta, S. (2013). Moment-based uniform deviation bounds for $k$ -means and friends. Advances in Neural Information Processing Systems.
Terada, (2014) Terada, Y. (2014). Strong consistency of reduced k-means clustering. Scandinavian Journal of Statistics, 41(4):913–931.
Terada, (2015) Terada, Y. (2015). Strong consistency of factorial $k$ -means clustering. Annals of the Institute of Statistical Mathematics, 67(2):335–357.
Vapnik, (2013) Vapnik, V. (2013). The nature of statistical learning theory. Springer science & business media.
Vassilvitskii and Arthur, (2006) Vassilvitskii, S. and Arthur, D. (2006). k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1027–1035.
Wainwright, (2019) Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press.
Xu and Lange, (2019) Xu, J. and Lange, K. (2019). Power k-means clustering. In International Conference on Machine Learning, pages 6921–6931. PMLR.
Zhang et al., (1999) Zhang, B., Hsu, M., and Dayal, U. (1999). K-harmonic means—a data clustering algorithm. Hewlett-Packard Labs Technical Report HPL-1999-124, 55.
Zhang et al., (2020) Zhang, Z., Lange, K., and Xu, J. (2020). Simple and scalable sparse k-means clustering via feature ranking. Advances in Neural Information Processing Systems, 33.

Appendix A Proofs of Lemmas

For the theoretical exposition, we first establish the following Lemmas. Lemma A.1 proves that the derivative of the function $\phi$ is bounded in the $\ell_{2}$ -norm when the domain is restricted to the support of $P$ .

Lemma A.1.

Under A3, $\|\nabla\phi(\boldsymbol{x})\|_{2}\leq H_{p}M\sqrt{p}$ , for all $\boldsymbol{x}\in[-M,M]^{p}$ .

Proof.

From A3, we observe that

\displaystyle\|\nabla\phi(\boldsymbol{x})-\nabla\phi(\boldsymbol{0})\|_{2}\leq H_{p}\|\boldsymbol{x}\|_{2}\implies

\displaystyle\|\nabla\phi(\boldsymbol{x})\|_{2}\leq H_{p}\|\boldsymbol{x}\|_{2}\leq H_{p}M\sqrt{p}.

∎

Lemma A.2 essentially proves that the function $\phi$ is Lipschitz with Lipschitz constant $H_{p}M\sqrt{p}$ on $[-M,M]^{p}$ .

Lemma A.2.

Under A3, for all $\boldsymbol{x},\boldsymbol{y}\in[-M,M]^{p}$ , $\phi(\cdot)$ is $2H_{p}M\sqrt{p}$ -Lipschitz, i.e.

|\phi(\boldsymbol{x})-\phi(\boldsymbol{y})|\leq H_{p}M\sqrt{p}\|\boldsymbol{x}-\boldsymbol{y}\|_{2}.

Proof.

From the mean value theorem,

\phi(\boldsymbol{x})-\phi(\boldsymbol{y})=\langle\nabla\phi(\boldsymbol{\xi}),\boldsymbol{x}-\boldsymbol{y}\rangle,

for some $\boldsymbol{\xi}$ in the convex combinations of $\boldsymbol{x}$ and $\boldsymbol{y}$ . Clearly, $\boldsymbol{\xi}\in[-M,M]^{p}$ , due to the convexity of $[-M,M]^{p}$ . Now by the Cauchy-Schwartz inequality and Lemma A.1,

|\phi(\boldsymbol{x})-\phi(\boldsymbol{y})|\leq\|\nabla\phi(\boldsymbol{\xi})\|_{2}\|\boldsymbol{x}-\boldsymbol{y}\|_{2}\leq H_{p}M\sqrt{p}\|\boldsymbol{x}-\boldsymbol{y}\|_{2}.

∎

Lemma A.3 proves that the function $f_{\boldsymbol{\Theta}}$ , as a function of $\boldsymbol{\Theta}$ , is Lipschitz with respect to the $\|\cdot\|_{\infty}$ norm.

Lemma A.3.

Under assumptions A1–3, for any $\boldsymbol{\Theta},\boldsymbol{\Theta}^{\prime}\in[-M,M]^{p}$ ,

\|f_{\boldsymbol{\Theta}}-f_{\boldsymbol{\Theta}^{\prime}}\|_{\infty}\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}\sum_{j=1}^{k}\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\|_{2}.

Here, $\boldsymbol{\Theta}=[\boldsymbol{\theta}_{1}^{\top},\dots,\boldsymbol{\theta}_{k}^{\top}]^{\top}$ and $\boldsymbol{\Theta}=[\boldsymbol{\theta}_{1}^{\prime\top},\dots,\boldsymbol{\theta}_{k}^{\prime\top}]^{\top}$ .

Proof.

	$\displaystyle\\|f_{\boldsymbol{\Theta}}-f_{\boldsymbol{\Theta}^{\prime}}\\|_{\infty}$
	$\displaystyle=\sup_{\boldsymbol{x}\in[-M,M]^{p}}\left\|\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))-\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}^{\prime}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}^{\prime}))\right\|$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\|d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j})-d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j}^{\prime})\|$
	$\displaystyle=\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\|\phi(\boldsymbol{\theta}_{j}^{\prime})-\phi(\boldsymbol{\theta}_{j})+\langle\nabla\phi(\boldsymbol{\theta}_{j}^{\prime}),\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\rangle-\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}\rangle\|$
	$\displaystyle=\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\|\phi(\boldsymbol{\theta}_{j}^{\prime})-\phi(\boldsymbol{\theta}_{j})+\langle\nabla\phi(\boldsymbol{\theta}_{j}^{\prime})-\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\rangle+\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\rangle\|$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(\|\phi(\boldsymbol{\theta}_{j}^{\prime})-\phi(\boldsymbol{\theta}_{j})\|+\|\langle\nabla\phi(\boldsymbol{\theta}_{j}^{\prime})-\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\rangle\|+\|\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\rangle\|\right)$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(H_{p}M\sqrt{p}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}+\\|\nabla\phi(\boldsymbol{\theta}_{j}^{\prime})-\nabla\phi(\boldsymbol{\theta}_{j})\\|_{2}\\|\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\\|_{2}+\\|\nabla\phi(\boldsymbol{\theta}_{j})\\|_{2}\\|\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\\|_{2}\right)$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(H_{p}M\sqrt{p}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}+H_{p}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}\times 2\sqrt{p}M+H_{p}M\sqrt{p}\\|\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\\|_{2}\right)$
	$\displaystyle\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}\sum_{j=1}^{k}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}$

∎

Appendix B Proofs from Section 3

B.1 Proof of Lemma 3.1

Proof.

Let $J(\boldsymbol{x})=d_{\phi}(\boldsymbol{x},\boldsymbol{\theta})$ . Since $P_{\mathcal{C}}(\boldsymbol{\theta})$ minimizes $J(\cdot)$ over $\mathcal{C}$ , there exists a subgradient $\boldsymbol{d}\in\partial J(P_{\mathcal{C}}(\boldsymbol{\theta}))$ such that

\langle\boldsymbol{d},\boldsymbol{x}-P_{\mathcal{C}}(\boldsymbol{\theta})\rangle\geq 0.

(5)

We note that $J(P_{\mathcal{C}}(\boldsymbol{\theta}))=\{\nabla\phi(P_{\mathcal{C}}(\boldsymbol{\theta}))-\nabla\phi(\boldsymbol{\theta})\}$ . Thus, from equation (5),

\langle\nabla\phi(P_{\mathcal{C}}(\boldsymbol{\theta}))-\nabla\phi(\boldsymbol{\theta}),\boldsymbol{x}-P_{\mathcal{C}}(\boldsymbol{\theta})\rangle\geq 0.

(6)

We now observe that,

d_{\phi}(\boldsymbol{x},\boldsymbol{\theta})-d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}))-d_{\phi}(P_{\mathcal{C}}(\boldsymbol{\theta}),\boldsymbol{\theta})=\langle\nabla\phi(P_{\mathcal{C}}(\boldsymbol{\theta}))-\nabla\phi(\boldsymbol{\theta}),\boldsymbol{x}-P_{\mathcal{C}}(\boldsymbol{\theta})\rangle\geq 0.

Hence the result. ∎

B.2 Proof of Lemma 3.2

Proof.

Suppose $\boldsymbol{\Theta}=\{\boldsymbol{\theta}_{1},\dots,\boldsymbol{\theta}_{k}\}$ . We take $\mathcal{C}=[-M,M]^{k\times p}$ and $\boldsymbol{\Theta}^{\prime}=\{P_{\mathcal{C}}(\boldsymbol{\theta}_{1}),\dots,P_{\mathcal{C}}(\boldsymbol{\theta}_{k})\}$ . Clearly $\mathcal{C}$ is a convex set. Thus, from Lemma 3.1, we observe that

		$\displaystyle d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j})\geq d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{j}))+d_{\phi}(P_{\mathcal{C}}(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j})\geq d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{j}))\,\quad\forall\,j=1,\dots,k.$
	$\displaystyle\implies$	$\displaystyle\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{1})),\dots,d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{k}))\right)\leq\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k})\right)$
	$\displaystyle\implies$	$\displaystyle\int\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{1})),\dots,d_{\phi}(\boldsymbol{x},P_{\mathcal{C}}(\boldsymbol{\theta}_{k}))\right)dQ\leq\int\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k})\right)dQ$
	$\displaystyle\implies$	$\displaystyle Qf_{\boldsymbol{\Theta}^{\prime}}\,\,\leq\,\,Qf_{\boldsymbol{\Theta}}$

∎

B.3 Proof of Lemma 3.3

Proof.

We first divide the set $[-M,M]$ into a small bins, each with size $\epsilon$ . Denote $\gamma_{i}=-M+i\epsilon$ , for $i=1,\dots,\lfloor\frac{2M}{\epsilon}\rfloor$ , and let $\Gamma_{\epsilon}=\left\{\gamma_{i}\,\|\,i\in\{1,\dots,\lfloor\frac{2M}{\epsilon}\rfloor\}\right\}$ . If $\epsilon>2M$ , we take $\Gamma_{\epsilon}=\{0\}$ . Clearly, $|\Gamma_{\epsilon}|=\max\{\lfloor\frac{2M}{\epsilon}\rfloor,1\}$ . From the construction of $\Gamma_{\epsilon}$ , for all $x\in[-M,M]$ , there exists $i\in\left[|\Gamma_{\epsilon}|\right]$ , such that, $|x-\gamma_{i}|\leq\epsilon$ . We take $\epsilon=\left(4\tau_{\boldsymbol{\alpha},k}H_{p}Mkp\right)^{-1}\delta$ . We define

\boldsymbol{\Theta}_{\delta}=\left\{\boldsymbol{\Theta}=\left((\theta_{i\ell})\right):\theta_{i\ell}\in\Gamma_{\epsilon}\right\}.

Then immediately we see

|\boldsymbol{\Theta}_{\delta}|=\left(\max\left\{\left\lfloor\frac{2M}{\epsilon}\right\rfloor,1\right\}\right)^{kp}.

For any $\boldsymbol{\Theta}\in[-M,M]^{p}$ , we can construct $\boldsymbol{\Theta}^{\prime}\in\boldsymbol{\Theta}_{\delta}$ , such that, $|\theta_{i\ell}-\theta^{\prime}_{i\ell}|\leq\epsilon.$ From Lemma A.3, we observe that,

	$\displaystyle\\|f_{\boldsymbol{\Theta}}-f_{\boldsymbol{\Theta}^{\prime}}\\|_{\infty}$	$\displaystyle\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}\sum_{j=1}^{k}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}.$
		$\displaystyle\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}k\sqrt{p}\epsilon$
		$\displaystyle=4\tau_{\boldsymbol{\alpha},k}H_{p}Mkp\epsilon$
		$\displaystyle=\delta.$

Thus, $\mathcal{F}_{\delta}=\{f_{\boldsymbol{\Theta}}:\boldsymbol{\Theta}\in\boldsymbol{\Theta}_{\delta}\}$ constitutes a $\delta$ -cover of $\mathcal{F}$ under the $\|\cdot\|_{\infty}$ norm. Hence,

\displaystyle N(\delta;\mathcal{F},\|\cdot\|_{\infty})\leq|\mathcal{F}_{\delta}|\leq|\boldsymbol{\Theta}_{\delta}|

\displaystyle=\left(\max\left\{\left\lfloor\frac{2M}{\epsilon}\right\rfloor,1\right\}\right)^{kp}=\left(\max\left\{\left\lfloor\frac{8M^{2}\tau_{\boldsymbol{\alpha},k}H_{p}kp}{\delta}\right\rfloor,1\right\}\right)^{kp}.

∎

B.4 Proof of Lemma 3.4

Proof.

From Lemma A.3, we observe that,

	$\displaystyle\text{diam}(\mathcal{F})$	$\displaystyle=\sup_{\boldsymbol{\Theta},\boldsymbol{\Theta}^{\prime}\in[-M,M]^{k\times p}}\\|f_{\boldsymbol{\Theta}}-f_{\boldsymbol{\Theta}^{\prime}}\\|_{\infty}$
		$\displaystyle\leq 4H_{p}M\sqrt{p}\tau_{\boldsymbol{\alpha},k}\sup_{\boldsymbol{\Theta},\boldsymbol{\Theta}^{\prime}\in[-M,M]^{k\times p}}\sum_{j=1}^{k}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}$
		$\displaystyle\leq 4H_{p}M\sqrt{p}\tau_{\boldsymbol{\alpha},k}\times 2kM\sqrt{p}$
		$\displaystyle=8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}kp.$

∎

B.5 Proof of Lemma 3.5

Proof.

From the non-negativity of $\Psi_{\boldsymbol{\alpha}}(\cdot)$ , we get, $\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))\geq 0$ , for any $\boldsymbol{x}\in[-M,M]^{p}$ and $\boldsymbol{\Theta}\in[-M,M]^{k\times p}$ . For any $\boldsymbol{\beta}\in\mathop{\mathbb{R}}\nolimits^{k}_{\geq 0}$ , from A2, we get,

\Psi_{\boldsymbol{\alpha}}(\boldsymbol{\beta})=|\Psi_{\boldsymbol{\alpha}}(\boldsymbol{\beta})-\Psi_{\boldsymbol{\alpha}}(\mathbf{0})|\leq\tau_{\boldsymbol{\alpha},k}\|\boldsymbol{\beta}-\mathbf{0}\|_{1}=\|\boldsymbol{\beta}\|_{1}.

Taking $\boldsymbol{\beta}=(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))^{\top}$ , we get,

	$\displaystyle\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j})$
	$\displaystyle=\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(\phi(\boldsymbol{x})-\phi(\boldsymbol{\theta}_{j})-\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}\rangle\right)$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(\|\phi(\boldsymbol{x})-\phi(\boldsymbol{\theta}_{j})\|+\|\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}\rangle\|\right)$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(H_{p}M\sqrt{p}\\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\\|_{2}+\\|\nabla\phi(\boldsymbol{\theta}_{j})\\|_{2}\\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\\|_{2}\right)$		(7)
	$\displaystyle\leq 2\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}\sum_{j=1}^{k}\\|\boldsymbol{x}-\boldsymbol{\theta}_{j}\\|_{2}$		(8)
	$\displaystyle\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk.$

Here inequality (7) follows from Cauchy-Schwartz inequality and Lemma A.2. Inequality (8) follows from Lemma A.1. ∎

B.6 Proof of Theorem 3.1

Proof.

Let $\Delta=8H_{p}M^{2}k^{1-1/s}p$ . We construct a decreasing sequence $\{\delta_{i}\}_{i\in\mathbb{N}}$ as follows. Take $\delta_{1}:=\text{diam}(\mathcal{F})=\Delta$ (the last equality follows from Lemma 3.4) and $\delta_{i+1}=\frac{1}{2}\delta_{i}$ . Let $\mathcal{F}_{i}$ be a minimal $\delta_{i}$ cover of $\mathcal{F}$ , i.e. $|\mathcal{F}_{i}|=N(\delta_{i};\mathcal{F},\|\cdot\|_{\infty})$ . Now denote $f_{i}$ to be the closest element of $f$ in $\mathcal{F}_{i}$ (with ties broken arbitrarily). We can thus write,

\displaystyle\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})\leq\xi_{1}+\xi_{2}+\xi_{3},

where

	$\displaystyle\xi_{1}=\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-f_{m}(\boldsymbol{X}_{i})),$		(9)
	$\displaystyle\xi_{2}=\sum_{j=1}^{m-1}\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f_{j+1}(\boldsymbol{X}_{i})-f_{j}(\boldsymbol{X}_{i})),$		(10)
	$\displaystyle\xi_{3}=\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f_{1}(\boldsymbol{X}_{i}).$		(11)

Since we can pick $f_{1}$ arbitrarily from $\mathcal{F}$ (as $\delta_{1}=\text{diam}(\mathcal{F})$ ), $\xi_{3}=0$ . To bound $\xi_{1}$ , we observe that,

\xi_{1}=\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f(\boldsymbol{X}_{i})-f_{m}(\boldsymbol{X}_{i}))\leq\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sqrt{\left(\sum_{i=1}^{n}\epsilon_{i}^{2}\right)\left(\sum_{i=1}^{n}(f(\boldsymbol{X}_{i})-f_{m}(\boldsymbol{X}_{i}))^{2}\right)}\leq\delta_{m}

To bound $\xi_{2}$ , we observe that,

\|f_{j+1}-f_{j}\|_{\infty}\leq\|f_{j+1}-f\|_{\infty}+\|f-f_{j}\|_{\infty}\leq\delta_{j+1}+\delta_{j}.

Now appealing to Massart’s lemma (Mohri et al.,, 2018), we get,

	$\displaystyle\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f_{j+1}(\boldsymbol{X}_{i})-f_{j}(\boldsymbol{X}_{i}))$	$\displaystyle\leq\frac{(\delta_{j+1}+\delta_{j})\sqrt{2\log\left(N(\delta_{j};\mathcal{F},\\|\cdot\\|_{\infty})N(\delta_{j+1};\mathcal{F},\\|\cdot\\|_{\infty})\right)}}{\sqrt{n}}$
		$\displaystyle\leq\frac{2(\delta_{j+1}+\delta_{j})\sqrt{\log N(\delta_{j+1};\mathcal{F},\\|\cdot\\|_{\infty})}}{\sqrt{n}}$

Thus,

\xi_{2}=\sum_{j=1}^{m-1}\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}(f_{j+1}(\boldsymbol{X}_{i})-f_{j}(\boldsymbol{X}_{i}))\leq\sum_{j=1}^{m-1}\frac{2(\delta_{j+1}+\delta_{j})\sqrt{\log N(\delta_{j+1};\mathcal{F},\|\cdot\|_{\infty})}}{\sqrt{n}}

Combining the bounds on $\xi_{1}$ , $\xi_{2}$ and $\xi_{3}$ , we get,

\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})\leq\delta_{m}+\frac{2}{\sqrt{n}}\sum_{j=1}^{m-1}(\delta_{j+1}+\delta_{j})\sqrt{\log N(\delta_{j+1};\mathcal{F},\|\cdot\|_{\infty})}.

(12)

From the construction of $\{\delta_{i}\}_{i\geq 1}$ , we know, $\delta_{j+1}+\delta_{j}=6(\delta_{j+1}-\delta_{j+2})$ . Hence from equation (12), we get,

	$\displaystyle\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})$	$\displaystyle\leq\delta_{m}+\frac{2}{\sqrt{n}}\sum_{j=1}^{m-1}(\delta_{j+1}+\delta_{j})\sqrt{\log N(\delta_{j+1};\mathcal{F},\\|\cdot\\|_{\infty})}$
		$\displaystyle=\delta_{m}+\frac{12}{\sqrt{n}}\sum_{j=1}^{m-1}(\delta_{j+1}-\delta_{j+2})\sqrt{\log N(\delta_{j+1};\mathcal{F},\\|\cdot\\|_{\infty})}$
		$\displaystyle\leq\delta_{m}+\frac{12}{\sqrt{n}}\int_{\delta_{m+1}}^{\delta_{2}}\sqrt{\log N(\epsilon;\mathcal{F},\\|\cdot\\|_{\infty})}d\epsilon$

Taking limits as $m\to\infty$ in the above equation, we get,

\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{\log N(\epsilon;\mathcal{F},\|\cdot\|_{\infty})}d\epsilon.

From Lemma 3.3, plugging in the value of $N(\epsilon;\mathcal{F},\|\cdot\|_{\infty})$ , we get,

	$\displaystyle\mathcal{R}_{n}(\mathcal{F})$	$\displaystyle\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{\log N(\epsilon;\mathcal{F},\\|\cdot\\|_{\infty})}d\epsilon$
		$\displaystyle\leq\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{kp\log\left(\max\left\{\frac{\Delta}{\epsilon},1\right\}\right)}d\epsilon$
		$\displaystyle=\frac{12}{\sqrt{n}}\int_{0}^{\Delta}\sqrt{kp\log\left(\frac{\Delta}{\epsilon}\right)}d\epsilon$
		$\displaystyle=12\sqrt{\frac{kp}{n}}\Delta\int_{0}^{\infty}2t^{2}e^{-t^{2}}dt$
		$\displaystyle=12\sqrt{\frac{kp}{n}}\Delta\int_{0}^{\infty}u^{\frac{3}{2}-1}e^{-u}du$
		$\displaystyle=12\sqrt{\frac{kp}{n}}\Delta\Gamma(3/2)$
		$\displaystyle=6\sqrt{\frac{kp\pi}{n}}\times 8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}kp$
		$\displaystyle=48\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}.$

∎

B.7 Proof of Theorem 3.2

Proof.

From Lemma, 3.5, we observe that $\sup_{f\in\mathcal{F}}\|f\|_{\infty}\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk$ . Under assumption A1, we observe that, with probability at least $1-\delta$ ,

	$\displaystyle\sup_{f\in\mathcal{F}}\|P_{n}f-Pf\|$	$\displaystyle\leq 2\mathcal{R}_{n}(\mathcal{F})+\sup_{f\in\mathcal{F}}\\|f\\|_{\infty}\sqrt{\frac{\log(2/\delta)}{2n}}$
		$\displaystyle\leq 96\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}+4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\sqrt{\frac{\log(2/\delta)}{2n}}.$		(13)

Inequality (13) follows from appealing to Theorem 3.1 and observing that $\sup_{f\in\mathcal{F}}\|f\|_{\infty}\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk$ . ∎

B.8 Proof of Theorem 3.3

Proof.

(Proof of Strong consistency) We will first show $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\xrightarrow{a.s.}0$ . To show this let $C=\max\{192\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2},8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\}$ . Then from Theorem 3.2, we observe that with probability at least $1-\delta$ ,

|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq\frac{C}{\sqrt{n}}+C\sqrt{\frac{\log(2/\delta)}{2n}}.

(14)

Fix $\epsilon>0$ . If $n\geq 4C^{2}/\epsilon^{2}$ and $\delta=2\exp\left(-\frac{n\epsilon^{2}}{2C^{2}}\right)$ , the RHS of (14) becomes no bigger than $\epsilon$ . Thus,

\mathbbm{P}\left(|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|>\epsilon\right)\leq 2\exp\left(-\frac{n\epsilon^{2}}{2C^{2}}\right),\quad\forall\,n\geq 4C^{2}/\epsilon^{2}.

Since the series $\sum_{n=1}^{\infty}\exp\left(-\frac{n\epsilon^{2}}{2C^{2}}\right)$ is convergent from the above equation, so is $\mathbbm{P}\left(|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|>\epsilon\right)$ . Hence, $Pf_{\widehat{\boldsymbol{\Theta}}_{n}}\xrightarrow{a.s.}Pf_{\boldsymbol{\Theta}^{\ast}}$ . Thus, for any $\epsilon>0$ , $Pf_{\widehat{\boldsymbol{\Theta}}_{n}}\leq pf_{\boldsymbol{\Theta}^{\ast}}+\epsilon$ almost surely w.r.t. $[P]$ for $n$ sufficiently large. From assumption A4, $\text{diss}(\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast})\leq\eta$ , almost surely w.r.t. $[P]$ , for any prefixed $\eta>0$ , and $n$ large. Thus, $\text{diss}(\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast})\xrightarrow{a.s.}0$ , which proves the result.

(Proof of $\sqrt{n}$ -consistency) Fix $\delta\in(0,1]$ . From Theorem 3.2, with probability at least $1-\delta$ ,

|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq 192\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}n^{-1/2}+8\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}pk\sqrt{\frac{\log(2/\delta)}{2n}}=O(n^{-1/2}).

Hence, $\sqrt{n}|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O(1)$ with probability at least $1-\delta$ . Thus, $\exists\,C_{\delta}$ , such that

\mathbbm{P}\left(\sqrt{n}|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\leq C_{\delta}\right)\geq 1-\delta,

for all $n$ large enough. Hence, $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O_{P}(n^{-1/2})$ . ∎

Appendix C Proofs from Section 3.4

C.1 Proof of Lemma 3.6

Proof.

Suppose $\boldsymbol{\Theta}=\{\boldsymbol{\theta}_{1},\dots,\boldsymbol{\theta}_{k}\}$ . We take $\mathcal{C}=[-M,M]^{k\times p}$ and $\boldsymbol{\Theta}^{\prime}=\{P_{\mathcal{C}}(\boldsymbol{\theta}_{1}),\dots,P_{\mathcal{C}}(\boldsymbol{\theta}_{k})\}$ . Clearly $\mathcal{C}$ is convex. Let $\mathcal{L}\subset\{1,\dots,L\}$ be the set of all partitions which do not contain an outlier. Thus, from Lemma 3.1, we observe that

		$\displaystyle d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{j})\geq d_{\phi}(\boldsymbol{X}_{i},P_{\mathcal{C}}(\boldsymbol{\theta}_{j}))+d_{\phi}(P_{\mathcal{C}}(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j})\geq d_{\phi}(\boldsymbol{X}_{i},P_{\mathcal{C}}(\boldsymbol{\theta}_{j}))\,\forall\,j=1,\dots,k\text{ and }i\in\mathcal{I}$
	$\displaystyle\implies$	$\displaystyle\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},P_{\mathcal{C}}(\boldsymbol{\theta}_{1})),\dots,d_{\phi}(\boldsymbol{X},P_{\mathcal{C}}(\boldsymbol{\theta}_{k}))\right)\leq\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{X},\boldsymbol{\theta}_{k})\right)\,\forall\,i\in\mathcal{I}$
	$\displaystyle\implies$	$\displaystyle\sum_{i\in B_{\ell}}\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},P_{\mathcal{C}}(\boldsymbol{\theta}_{1})),\dots,d_{\phi}(\boldsymbol{X},P_{\mathcal{C}}(\boldsymbol{\theta}_{k}))\right)\leq\sum_{i\in B_{\ell}}\Psi_{\boldsymbol{\alpha}}\left(d_{\phi}(\boldsymbol{X}_{i},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{X},\boldsymbol{\theta}_{k})\right)\,\forall\,\ell\in\mathcal{L}$
	$\displaystyle\implies$	$\displaystyle\frac{1}{b}\sum_{i\in B_{\ell}}f_{\boldsymbol{\Theta}^{\prime}}(\boldsymbol{X}_{i})\leq\frac{1}{b}\sum_{i\in B_{\ell}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})\,\forall\,\ell\in\mathcal{L}$

Now since $|\mathcal{L}|>|\mathcal{L}^{C}|$ (from assumption 6),

	$\displaystyle\text{Median}\left(\frac{1}{b}\sum_{i\in B_{1}}f_{\boldsymbol{\Theta}^{\prime}}(\boldsymbol{X}_{i}),\dots,\frac{1}{b}\sum_{i\in B_{L}}f_{\boldsymbol{\Theta}^{\prime}}(\boldsymbol{X}_{i})\right)\leq\text{Median}\left(\frac{1}{b}\sum_{i\in B_{1}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}),\dots,\frac{1}{b}\sum_{i\in B_{L}}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})\right)$
	$\displaystyle\implies\text{MoM}_{L}^{n}(\boldsymbol{\Theta}^{\prime})\leq\text{MoM}_{L}^{n}(\boldsymbol{\Theta})$

∎

C.2 Proof of Theorem 3.4

Proof.

For notational simplicity let $P_{B_{\ell}}$ denote the empirical distribution of $\{\boldsymbol{X}_{i}\}_{i\in B_{\ell}}$ . Suppose $\epsilon>0$ . We will first bound the probability of $\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}|>\epsilon$ . To do so, we will individually bound the probabilities of the events

\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}})>\epsilon

and

\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(Pf_{\boldsymbol{\Theta}}-\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}}))>\epsilon.

We note that if

\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}>\frac{L}{2},

then

\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(Pf_{\boldsymbol{\Theta}}-\text{MoM}_{L}^{n}(f_{\boldsymbol{\Theta}}))>\epsilon.

Here again $\mathbbm{1}\{\cdot\}$ denote the indicator function. Now let $\varphi(t)=(t-1)\mathbbm{1}\{1\leq t\leq 2\}+\mathbbm{1}\{t>2\}$ . Clearly,

\mathbbm{1}\{t\geq 2\}\leq\varphi(t)\leq\mathbbm{1}\{t\geq 1\}.

(15)

We observe that,

	$\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}$
$\displaystyle\leq$	$\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}+\|\mathcal{O}\|$
$\displaystyle\leq$	$\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)+\|\mathcal{O}\|$
$\displaystyle\leq$	$\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)+\|\mathcal{O}\|$
	$\displaystyle+\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\bigg{[}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)-\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\bigg{]}.$	(16)

To bound $\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}$ , we will first bound the quantity $\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)$ . We observe that,

	$\displaystyle\small\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\leq\mathbb{E}\left[\mathbbm{1}\left\{\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}>1\right\}\right]=$	$\displaystyle\mathbbm{P}\left[(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\frac{\epsilon}{2}\right]$
	$\displaystyle\leq$	$\displaystyle\exp\left\{-\frac{b\epsilon^{2}}{32\tau_{\boldsymbol{\alpha},k}^{2}H_{p}^{2}M^{4}k^{2}p^{2}}\right\}$		(17)

We now turn to bounding the term

\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\bigg{[}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)-\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\bigg{]}.

Appealing to Theorem 26.5 of (Shalev-Shwartz and Ben-David,, 2014) we observe that, with probability at least $1-e^{-2L\delta^{2}}$ , for all $\boldsymbol{\Theta}\in[-M,M]^{k\times p}$ ,

		$\displaystyle\frac{1}{L}\sum_{\ell\in\mathcal{L}}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\frac{1}{L}\sum_{\ell\in\mathcal{L}}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\right]+2\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\frac{1}{L}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\right]+\delta.$		(18)

Here $\{\sigma_{\ell}\}_{\ell\in\mathcal{L}}$ are i.i.d. Rademacher random variables. Let $\{\xi_{i}\}_{i=1}^{n}$ be i.i.d. Rademacher random variables, independent form $\{\sigma_{\ell}\}_{\ell\in\mathcal{L}}$ . From equation (18), we get,

	$\displaystyle\frac{1}{L}\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\bigg{[}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)-\mathbb{E}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\bigg{]}$
$\displaystyle\leq$	$\displaystyle 2\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\frac{1}{L}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\varphi\left(\frac{2(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}}{\epsilon}\right)\right]+\delta$
$\displaystyle\leq$	$\displaystyle\frac{4}{L\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}\right]+\delta.$	(19)

Equation (19) follows from the fact that $\varphi(\cdot)$ is 1-Lipschitz and appealing to Lemma 26.9 of Shalev-Shwartz and Ben-David, (2014). We now consider a “ghost" sample $\mathcal{X}^{\prime}=\{\boldsymbol{X}_{1}^{\prime},\dots,\boldsymbol{X}_{n}^{\prime}\}$ , which are i.i.d. and follow the probability law $P$ . Thus, equation (19) can be further shown to give

$\displaystyle=$	$\displaystyle\frac{4}{L\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\mathbb{E}_{\mathcal{X}^{\prime}}\left((P^{\prime}_{B_{\ell}}-P_{B_{\ell}})f_{\boldsymbol{\Theta}}\right)\right]+\delta$
$\displaystyle\leq$	$\displaystyle\frac{4}{L\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}(P^{\prime}_{B_{\ell}}-P_{B_{\ell}})f_{\boldsymbol{\Theta}}\right]+\delta$
$\displaystyle=$	$\displaystyle\frac{4}{L\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\frac{1}{b}\sum_{i\in B_{\ell}}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right]+\delta$
$\displaystyle=$	$\displaystyle\frac{4}{bL\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sigma_{\ell}\sum_{i\in B_{\ell}}\xi_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right]+\delta$	(20)
$\displaystyle=$	$\displaystyle\frac{4}{n\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sum_{i\in B_{\ell}}\sigma_{\ell}\xi_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right]+\delta$
$\displaystyle\leq$	$\displaystyle\frac{4}{n\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell\in\mathcal{L}}\sum_{i\in B_{\ell}}\sigma_{\ell}\xi_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})+f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right]+\delta$
$\displaystyle=$	$\displaystyle\frac{4}{n\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{i\in\mathcal{J}}\gamma_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})+f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\right]$	(21)
$\displaystyle=$	$\displaystyle\frac{8}{n\epsilon}\mathbb{E}\left[\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{i\in\mathcal{J}}\gamma_{i}f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i})\right]+\delta$
$\displaystyle\leq$	$\displaystyle\frac{8}{n\epsilon}48\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}\sqrt{\|\mathcal{J}\|}+\delta$	(22)
$\displaystyle\leq$	$\displaystyle\frac{384}{n\epsilon}\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}\sqrt{\|\mathcal{I}\|}+\delta.$	(23)

Equation (20) follows from observing that $(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))\overset{d}{=}\xi_{i}(f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}^{\prime})-f_{\boldsymbol{\Theta}}(\boldsymbol{X}_{i}))$ . In equation (21), $\{\gamma_{i}\}_{i\in\mathcal{J}}$ are independent Rademacher random variables due to their construction. Equation (22) follows from appealing to Theorem 3.1. Thus, combining equations (18), (19), and (23), we conclude that, with probability of at least $1-e^{-2L\delta^{2}}$ ,

	$\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\sum_{\ell=1}^{L}\mathbbm{1}\left\{(P-P_{B_{\ell}})f_{\boldsymbol{\Theta}}>\epsilon\right\}$
	$\displaystyle\leq L\left(\exp\left\{-\frac{b\epsilon^{2}}{32\tau_{\boldsymbol{\alpha},k}^{2}H_{p}^{2}M^{4}k^{2}p^{2}}\right\}+\frac{\|\mathcal{O}\|}{L}+\frac{384}{n\epsilon}\sqrt{\pi}\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}(kp)^{3/2}\sqrt{\|\mathcal{I}\|}+\delta\right).$		(24)

We choose $\delta=\frac{2}{4+\eta}-\frac{|\mathcal{O}|}{L}$ and

\epsilon=2\max\left\{\sqrt{32\tau_{\boldsymbol{\alpha},k}^{2}H_{p}^{2}M^{4}\log\left(\frac{4(\eta+4)}{\eta}\right)}kp\sqrt{\frac{L}{n}},\frac{1536(\eta+4)\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}\sqrt{\pi}}{\eta}(kp)^{3/2}\frac{\sqrt{|\mathcal{I}|}}{n}\right\}.

This makes the right hand side of (24) strictly smaller than $\frac{L}{2}$ . Thus, we have shown that

\displaystyle\mathbbm{P}\left(\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(Pf_{\boldsymbol{\Theta}}-\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}}))>\epsilon\right)\leq e^{-2L\delta^{2}}.

Similarly, we can show that,

\displaystyle\mathbbm{P}\left(\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}(\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}})>\epsilon\right)\leq e^{-2L\delta^{2}}.

Combining the above two inequalities, we get,

\mathbbm{P}\left(\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}|>\epsilon\right)\leq 2e^{-2L\delta^{2}}.

In other words, with at least probability $1-2e^{-2L\delta^{2}}$ ,

		$\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}\|$
	$\displaystyle\leq$	$\displaystyle 2\max\left\{\sqrt{32\tau_{\boldsymbol{\alpha},k}^{2}H_{p}^{2}M^{4}\log\left(\frac{4(\eta+4)}{\eta}\right)}kp\sqrt{\frac{L}{n}},\frac{1536(\eta+4)\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}\sqrt{\pi}}{\eta}(kp)^{3/2}\frac{\sqrt{\|\mathcal{I}\|}}{n}\right\}$
	$\displaystyle\lesssim$	$\displaystyle\tau_{\boldsymbol{\alpha},k}H_{p}\max\left\{kp\sqrt{\frac{L}{n}},(kp)^{3/2}\frac{\sqrt{\|\mathcal{I}\|}}{n}\right\}.$

∎

C.3 Proof of Corollary 3.5

Proof.

We observe the follwoing.

	$\displaystyle\|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}\|$
	$\displaystyle=Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}$
	$\displaystyle=Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-\text{MoM}^{n}_{L}(f_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}})+\text{MoM}^{n}_{L}(f_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}})-\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}^{\ast}})+\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}^{\ast}})-Pf_{\boldsymbol{\Theta}^{\ast}}$
	$\displaystyle\leq Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-\text{MoM}^{n}_{L}(f_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}})+\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}^{\ast}})-Pf_{\boldsymbol{\Theta}^{\ast}}$		(25)
	$\displaystyle\leq 2\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}\|$
	$\displaystyle\lesssim\tau_{\boldsymbol{\alpha},k}H_{p}\max\left\{kp\sqrt{\frac{L}{n}},(kp)^{3/2}\frac{\sqrt{\|\mathcal{I}\|}}{n}\right\}.$

Inequality (25) follows from the fact that $\text{MoM}^{n}_{L}(f_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}})\leq\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}^{\ast}})$ , by definition of $\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}$ . ∎

C.4 Proof of Corollary 3.6

Proof.

In this case, $H_{p}=2$ . Thus, the bound in Corollary 3.5 becomes $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|\lesssim\max\left\{\sqrt{\frac{L}{n}},\frac{\sqrt{|\mathcal{I}|}}{n}\right\}$ . By A7, $2e^{-2L\delta^{2}}=o(1)$ . Thus,

\mathbbm{P}\left(|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O\left(\max\left\{\sqrt{\frac{L}{n}},\frac{\sqrt{|\mathcal{I}|}}{n}\right\}\right)\right)\geq 1-o(1).

Hence, $|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=O_{P}\left(\max\left\{\sqrt{\frac{L}{n}},\frac{1}{\sqrt{n}}\right\}\right)$ .

Under A7, $\max\left\{\sqrt{\frac{L}{n}},\frac{\sqrt{|\mathcal{I}|}}{n}\right\}\leq\max\left\{\sqrt{\frac{L}{n}},\frac{1}{\sqrt{n}}\right\}=o(1)\,\implies\,|Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}-Pf_{\boldsymbol{\Theta}^{\ast}}|=o_{P}(1)$ ¹¹1 $X_{n}=o_{P}(a_{n})$ if $X_{n}/a_{n}\xrightarrow{P}0$ .. Thus, $Pf_{\widehat{\boldsymbol{\Theta}}_{n}^{(\text{MoM})}}\xrightarrow{P}Pf_{\boldsymbol{\Theta}^{\ast}}$ . Now, for any $\epsilon,\delta>0$ , $\mathbbm{P}(Pf_{\widehat{\boldsymbol{\Theta}}_{n}}\leq Pf_{\boldsymbol{\Theta}^{\ast}}+\epsilon)\geq 1-\delta$ , if $n$ is large. From assumption A4, $\mathbbm{P}(\text{diss}(\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast})\leq\eta)\geq 1-\delta$ for any prefixed $\eta>0$ , and $n$ large. Thus, $\text{diss}(\widehat{\boldsymbol{\Theta}}_{n},\boldsymbol{\Theta}^{\ast})\xrightarrow{P}0$ , which proves the result. ∎

Appendix D Additional Experiments

D.1 Additional Simulations

Experiment 3

We use the same simulation setting as Experiment 1. However, the outliers are now generated from a Gaussian as well with mean coordinate $20\times\mathbf{1}_{5}$ , and covariance matrix $0.1I_{5}$ , where $\mathbf{1}_{5}$ is the $5$ dimensional vector of all $1$ ’s and $I_{5}$ is the $5\times 5$ identity matrix.

Experiment 4

We use the same simulation setting as Experiment 2. However, the outliers are now generated from the same scheme as in Experiment 3.

D.2 Case Study on Real Data: KDDCUP

In this section, we assess the performance of real data through the analysis of KDDCUP dataset (Alcalá et al.,, 2010), and consists of approximately $4.9$ M observations depicting connections of sequences of TCP packets. The features are normalized to have zero mean and unit standard deviation. The data contains $23$ classes, out of which, the three largest contain $98\%$ of the observations. Following the footsteps of Deshpande et al., (2020), the remaining $20$ classes consisting of $8752$ points are considered as outliers. We run all the algorithms as described in the beginning of Section 4. The parameters considered for our algorithm are $L=10000$ , $\eta=1.02$ and $\alpha=1$ . We measure the performance of this algorithm in terms of the ARI, as well as average precision and recall (Hastie et al.,, 2009). The last two indices are added following (Deshpande et al.,, 2020). We report the average of these indices out of $20$ replications in Table 1. For all these indices, a higher value implies superior performance. Table 1 shows similar trends as discussed in Section 4 of the main text. In particular, MOMPKM resembles the ground truth compared to the state-of-the-art. Surprisingly, RKMpp performs better than other competitors (except for MOMPKM), which was not the case for simulated data under ideal model assumptions. This is possibly because of the fact that the data contains only 47 features, compared to almost 5M samples, so that good seedings capitalize on the higher signal-to-noise-ratio, compared to that of the data used in the simulation studies.

Table 1: Results on KDDCUP Dataset

Index	k.means	BTKM	RCC	PAM	RKMpp	PKM	MOMKM	MOMPKM
ARI	$10^{-5}$	$0.01$	$10^{-5}$	$10^{-16}$	$0.81$	$0.24$	$0.76$	$\mathbf{0.87}$
Precision	$0.25$	$0.23$	$0.19$	$0.23$	$0.64$	$0.43$	$0.56$	$\mathbf{0.71}$
Recall	0.00	0.14	0.07	0.11	0.63	0.49	0.59	0.76

Appendix E Machine Specifications

The simulation studies were undertaken on an HP laptop with Intel(R) Core(TM)i3-5010U 2.10 GHz processor, 4GB RAM, 64-bit Windows 8.1 operating system in R and python 3.7 programming languages. The real data experiments were undertaken on a computing cluster with 656 cores (essentially CPUs) spread across a number of nodes of varying hardware specifications. 256 of the cores are in the ‘low’ partition. There are 32 cores and 256 GB RAM per node.

	$\displaystyle\\|f_{\boldsymbol{\Theta}}-f_{\boldsymbol{\Theta}^{\prime}}\\|_{\infty}$
	$\displaystyle=\sup_{\boldsymbol{x}\in[-M,M]^{p}}\left\|\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}))-\Psi_{\boldsymbol{\alpha}}(d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{1}^{\prime}),\dots,d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{k}^{\prime}))\right\|$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\|d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j})-d_{\phi}(\boldsymbol{x},\boldsymbol{\theta}_{j}^{\prime})\|$
	$\displaystyle=\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\|\phi(\boldsymbol{\theta}_{j}^{\prime})-\phi(\boldsymbol{\theta}_{j})+\langle\nabla\phi(\boldsymbol{\theta}_{j}^{\prime}),\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\rangle-\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}\rangle\|$
	$\displaystyle=\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\|\phi(\boldsymbol{\theta}_{j}^{\prime})-\phi(\boldsymbol{\theta}_{j})+\langle\nabla\phi(\boldsymbol{\theta}_{j}^{\prime})-\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\rangle+\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\rangle\|$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(\|\phi(\boldsymbol{\theta}_{j}^{\prime})-\phi(\boldsymbol{\theta}_{j})\|+\|\langle\nabla\phi(\boldsymbol{\theta}_{j}^{\prime})-\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\rangle\|+\|\langle\nabla\phi(\boldsymbol{\theta}_{j}),\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\rangle\|\right)$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(H_{p}M\sqrt{p}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}+\\|\nabla\phi(\boldsymbol{\theta}_{j}^{\prime})-\nabla\phi(\boldsymbol{\theta}_{j})\\|_{2}\\|\boldsymbol{x}-\boldsymbol{\theta}_{j}^{\prime}\\|_{2}+\\|\nabla\phi(\boldsymbol{\theta}_{j})\\|_{2}\\|\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\\|_{2}\right)$
	$\displaystyle\leq\tau_{\boldsymbol{\alpha},k}\sum_{j=1}^{k}\left(H_{p}M\sqrt{p}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}+H_{p}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}\times 2\sqrt{p}M+H_{p}M\sqrt{p}\\|\boldsymbol{\theta}_{j}-\boldsymbol{\theta}_{j}^{\prime}\\|_{2}\right)$
	$\displaystyle\leq 4\tau_{\boldsymbol{\alpha},k}H_{p}M\sqrt{p}\sum_{j=1}^{k}\\|\boldsymbol{\theta}_{j}^{\prime}-\boldsymbol{\theta}_{j}\\|_{2}$

	$\displaystyle\mathbb{E}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\epsilon_{i}f(\boldsymbol{X}_{i})$	$\displaystyle\leq\delta_{m}+\frac{2}{\sqrt{n}}\sum_{j=1}^{m-1}(\delta_{j+1}+\delta_{j})\sqrt{\log N(\delta_{j+1};\mathcal{F},\\|\cdot\\|_{\infty})}$
		$\displaystyle=\delta_{m}+\frac{12}{\sqrt{n}}\sum_{j=1}^{m-1}(\delta_{j+1}-\delta_{j+2})\sqrt{\log N(\delta_{j+1};\mathcal{F},\\|\cdot\\|_{\infty})}$
		$\displaystyle\leq\delta_{m}+\frac{12}{\sqrt{n}}\int_{\delta_{m+1}}^{\delta_{2}}\sqrt{\log N(\epsilon;\mathcal{F},\\|\cdot\\|_{\infty})}d\epsilon$

		$\displaystyle\sup_{\boldsymbol{\Theta}\in[-M,M]^{k\times p}}\|\text{MoM}^{n}_{L}(f_{\boldsymbol{\Theta}})-Pf_{\boldsymbol{\Theta}}\|$
	$\displaystyle\leq$	$\displaystyle 2\max\left\{\sqrt{32\tau_{\boldsymbol{\alpha},k}^{2}H_{p}^{2}M^{4}\log\left(\frac{4(\eta+4)}{\eta}\right)}kp\sqrt{\frac{L}{n}},\frac{1536(\eta+4)\tau_{\boldsymbol{\alpha},k}H_{p}M^{2}\sqrt{\pi}}{\eta}(kp)^{3/2}\frac{\sqrt{\|\mathcal{I}\|}}{n}\right\}$
	$\displaystyle\lesssim$	$\displaystyle\tau_{\boldsymbol{\alpha},k}H_{p}\max\left\{kp\sqrt{\frac{L}{n}},(kp)^{3/2}\frac{\sqrt{\|\mathcal{I}\|}}{n}\right\}.$

Uniform Concentration Bounds toward a Unified Framework for Robust Clustering

Abstract

1 Introduction

1.1 Related theoretical analyses of clustering

2 Problem Setting and Proposed Method

Examples:

Median of Means

3 Theoretical Analysis

A 1.

A 2.

A 3.

3.1 Bounds on 𝚯^n\widehat{\boldsymbol{\Theta}}_{n} and 𝚯∗\boldsymbol{\Theta}^{\ast}

Lemma 3.1.

Lemma 3.2.

Corollary 3.1.

Corollary 3.2.

3.2 Concentration Inequality and Metric Entropy Bounds via Rademacher Complexity

Definition 1.

Definition 2.

Lemma 3.3.

Lemma 3.4.

Theorem 3.1.

Proof Sketch

Lemma 3.5.

Theorem 3.2.

Corollary 3.3.

Proof.

3.3 Inference for Fixed pp: Strong and n\sqrt{n} Consistency

A 4.

Theorem 3.3.

3.4 Inference Under the MoM Framework

A 5.

A 6.

Lemma 3.6.

Corollary 3.4.

Theorem 3.4.

Proof sketch

Corollary 3.5.

3.5 Inference for Fixed kk and pp Under the MoM Framework

A 7.

Corollary 3.6.

Remark

4 Empirical Studies and Performance

Experiment 1: increasing the number of clusters

Experiment 2: increasing outlier percentage

5 Discussion

References

Appendix A Proofs of Lemmas

Lemma A.1.

Proof.

Lemma A.2.

Proof.

Lemma A.3.

Proof.

Appendix B Proofs from Section 3

B.1 Proof of Lemma 3.1

Proof.

B.2 Proof of Lemma 3.2

Proof.

B.3 Proof of Lemma 3.3

Proof.

B.4 Proof of Lemma 3.4

Proof.

B.5 Proof of Lemma 3.5

Proof.

B.6 Proof of Theorem 3.1

Proof.

B.7 Proof of Theorem 3.2

Proof.

B.8 Proof of Theorem 3.3

Proof.

Appendix C Proofs from Section 3.4

C.1 Proof of Lemma 3.6

Proof.

C.2 Proof of Theorem 3.4

Proof.

C.3 Proof of Corollary 3.5

Proof.

C.4 Proof of Corollary 3.6

Proof.

3.1 Bounds on $\widehat{\boldsymbol{\Theta}}_{n}$ and $\boldsymbol{\Theta}^{\ast}$

3.3 Inference for Fixed $p$ : Strong and $\sqrt{n}$ Consistency

3.5 Inference for Fixed $k$ and $p$ Under the MoM Framework