This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Algorithms for Sum-of-Minimum Optimization

Lisang Ding (LD) Department of Mathematics, University of California, Los Angeles (UCLA), Los Angeles, CA 90095. [email protected] Ziang Chen (ZC) Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139. [email protected] Xinshang Wang (XW) Decision Intelligence Lab, Damo Academy, Alibaba US, Bellevue, WA 98004. [email protected]  and  Wotao Yin (WY) Decision Intelligence Lab, Damo Academy, Alibaba US, Bellevue, WA 98004. [email protected]
Abstract.

In this work, we propose a novel optimization model termed “sum-of-minimum” optimization. This model seeks to minimize the sum or average of NN objective functions over kk parameters, where each objective takes the minimum value of a predefined sub-function with respect to the kk parameters. This universal framework encompasses numerous clustering applications in machine learning and related fields. We develop efficient algorithms for solving sum-of-minimum optimization problems, inspired by a randomized initialization algorithm for the classic kk-means [arthur2007k] and Lloyd’s algorithm [lloyd1982least]. We establish a new tight bound for the generalized initialization algorithm and prove a gradient-descent-like convergence rate for generalized Lloyd’s algorithm. The efficiency of our algorithms is numerically examined on multiple tasks, including generalized principal component analysis, mixed linear regression, and small-scale neural network training. Our approach compares favorably to previous ones based on simpler-but-less-precise optimization reformulations.

A major part of the work of ZC was completed during his internship at Alibaba US DAMO Academy. Corresponding author: Ziang Chen, [email protected]

1. Introduction

In this paper, we propose the following “sum-of-minimum” optimization model:

(1.1) minimize𝐱1,𝐱2,,𝐱kF(𝐱1,𝐱2,,𝐱k):=1Ni=1Nmin{fi(𝐱1),fi(𝐱2),,fi(𝐱k)},\operatorname*{minimize}_{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}}~{}F(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}):=\frac{1}{N}\sum_{i=1}^{N}\min\{f_{i}(\mathbf{x}_{1}),f_{i}(\mathbf{x}_{2}),\ldots,f_{i}(\mathbf{x}_{k})\},

where 𝐱1,𝐱2,,𝐱k\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k} are unknown parameters to determine. The cost function FF is the average of NN objectives where the ii-th objective is fif_{i} evaluated at its “optimal” out of the kk parameter choices. This paper aims to develop efficient algorithms for solving (1.1) and analyze their performance.

Write [k]={1,2,,k}[k]=\{1,2,\dots,k\} and [N]={1,2,,N}[N]=\{1,2,\dots,N\}. Let (C1,C2,,Ck)(C_{1},C_{2},\ldots,C_{k}) be a partition of [N][N], i.e., CiC_{i}’s are disjoint subsets of [N][N] and their union equals [N][N]. Let 𝒫Nk\mathcal{P}^{k}_{N} denote the set of all such partitions. Then, (1.1) is equivalent to

(1.2) minimize(C1,C2,,Ck)𝒫Nkmin𝐱1,𝐱2,,𝐱k1Nj=1kiCjfi(𝐱j).\operatorname*{minimize}_{(C_{1},C_{2},\ldots,C_{k})\in\mathcal{P}^{k}_{N}}\min_{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}}\frac{1}{N}\sum_{j=1}^{k}\sum_{i\in C_{j}}f_{i}(\mathbf{x}_{j}).

It is easy to see (C1,C2,,Ck)(C_{1}^{*},C_{2}^{*},\ldots,C_{k}^{*}) and (𝐱1,𝐱2,,𝐱k)(\mathbf{x}_{1}^{*},\mathbf{x}_{2}^{*},\ldots,\mathbf{x}_{k}^{*}) are optimal to (1.2) if and only if (𝐱1,𝐱2,,𝐱k)(\mathbf{x}_{1}^{*},\mathbf{x}_{2}^{*},\ldots,\mathbf{x}_{k}^{*}) is optimal to (1.1) and

iCjfi(𝐱j)=min{fi(𝐱1),fi(𝐱2),,fi(𝐱k)}.i\in C_{j}^{*}\implies f_{i}(\mathbf{x}_{j}^{*})=\min\{f_{i}(\mathbf{x}_{1}^{*}),f_{i}(\mathbf{x}_{2}^{*}),\ldots,f_{i}(\mathbf{x}_{k}^{*})\}.

Reformulation (1.2) reveals its clustering purpose. It finds the optimal partition (C1,C2,,Ck)(C_{1}^{*},C_{2}^{*},\ldots,C_{k}^{*}) such that using the parameter 𝐱j\mathbf{x}_{j}^{*} to minimize the average of fif_{i}’s in the cluster CjC_{j} leads to the minimal total cost.

Problem (1.1) generalizes kk-means clustering. Consider NN data points 𝐲1,𝐲2,,𝐲N\mathbf{y}_{1},\mathbf{y}_{2},\ldots,\mathbf{y}_{N} and a distance function d(,)d(\cdot,\cdot). The goal of kk-means clustering is to find clustering centroids 𝐱1,𝐱2,,𝐱k\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k} that minimize

F(𝐱1,𝐱2,,𝐱k)=1Ni=1Nminj[k]{d(𝐱j,𝐲i)},F(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k})=\frac{1}{N}\sum_{i=1}^{N}\min_{j\in[k]}\{d(\mathbf{x}_{j},\mathbf{y}_{i})\},

which is the average distance from each data point to its nearest cluster center. The literature presents various choices for the distance function d(,)d(\cdot,\cdot). When d(𝐱,𝐲)=12𝐱𝐲2d(\mathbf{x},\mathbf{y})=\frac{1}{2}\|\mathbf{x}-\mathbf{y}\|^{2}, this optimization problem reduces to the classic kk-means clustering problem, for which numerous algorithms have been proposed [krishna1999genetic, arthur2007k, na2010research, sinaga2020unsupervised, ahmed2020k]. Bregman divergence is also widely adopted as a distance measure[banerjee2005clustering, manthey2013worst, liu2016clustering], defined as

d(𝐱,𝐲)=h(𝐱)h(𝐲)h(𝐲),𝐱𝐲,d(\mathbf{x},\mathbf{y})=h(\mathbf{x})-h(\mathbf{y})-\langle\nabla h(\mathbf{y}),\mathbf{x}-\mathbf{y}\rangle,

with hh being a differentiable convex function.

A special case of (1.1) is mixed linear regression, which generalizes linear regression and models the dataset {(𝐚i,bi)}i=1N\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N} by multiple linear models. A linear model is a function g(𝐚;𝐱)=𝐚𝐱g(\mathbf{a};\mathbf{x})=\mathbf{a}^{\top}\mathbf{x}, which utilizes 𝐱\mathbf{x} as the coefficient vector for each model. Make kk copies of the linear model and set the jj-th linear coefficient as 𝐱j\mathbf{x}_{j}. The loss for each data pair (𝐚i,bi)(\mathbf{a}_{i},b_{i}) is computed as the squared error from the best-fitting linear model, specifically minj[k]{12(g(𝐚i;𝐱j)bi)2}\min_{j\in[k]}\{\frac{1}{2}(g(\mathbf{a}_{i};\mathbf{x}_{j})-b_{i})^{2}\}. We aim to search for optimal parameters {𝐱j}j=1k\{\mathbf{x}_{j}\}_{j=1}^{k} that minimizes the average loss

(1.3) 1Ni=1Nminj[k]{12(g(𝐚i;𝐱j)bi)2}.\frac{1}{N}\sum_{i=1}^{N}\min_{j\in[k]}\left\{\frac{1}{2}\left(g(\mathbf{a}_{i};\mathbf{x}_{j})-b_{i}\right)^{2}\right\}.

Paper [zhong2016mixed] simplifies this non-smooth problem to the sum-of-product problem:

(1.4) minimize𝐱1,𝐱2,,𝐱k1Ni=1Nj[k](g(𝐚i;𝐱j)bi)2,\operatorname*{minimize}_{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{k}}~{}\frac{1}{N}\sum_{i=1}^{N}\prod_{j\in[k]}\left(g(\mathbf{a}_{i};\mathbf{x}_{j})-b_{i}\right)^{2},

which is smooth. Although (1.4) is easier to approach due to its smooth objective function, problem (1.3) is more accurate. Various algorithms are proposed to recover kk linear models from mixed-class data [yi2014alternating, shen2019iterative, kong2020meta, zilber2023imbalanced].

In (1.3), the function g(;𝐱)g(\cdot;\mathbf{x}) parameterized by 𝐱\mathbf{x} can be any nonlinear function such as neural networks, and we call this extension mixed nonlinear regression.

An application of (1.1) is generalized principal component analysis (GPCA) [vidal2005generalized, tsakiris2017filtrated], which aims to recover kk low-dimensional subspaces, V1,V2,,VkV_{1},V_{2},\dots,V_{k}, from the given data points 𝐲1,𝐲2,,𝐲N\mathbf{y}_{1},\mathbf{y}_{2},\dots,\mathbf{y}_{N}, which are assumed to be located on or close to the collective union of these subspaces V1V2VkV_{1}\cup V_{2}\cup\cdots\cup V_{k}. This process, also referred to as subspace clustering, seeks to accurately segment data points into their respective subspaces [ma2008estimation, vidal2011subspace, elhamifar2013sparse]. Each subspace VjV_{j} is represented as Vj={𝐲d:𝐲𝐀j=0}V_{j}=\{\mathbf{y}\in\mathbb{R}^{d}:\mathbf{y}^{\top}\mathbf{A}_{j}=0\} where 𝐀jd×r\mathbf{A}_{j}\in\mathbb{R}^{d\times r} and 𝐀j𝐀j=Ir\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}, with rr being the co-dimension of VjV_{j}. From an optimization perspective, the GPCA task can be formulated as

(1.5) minimize𝐀j𝐀j=Ir1Ni=1Nminj[k]{12𝐲i𝐀j2}.\operatorname*{minimize}_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}~{}\frac{1}{N}\sum_{i=1}^{N}\min_{j\in[k]}\left\{\frac{1}{2}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2}\right\}.

Similar to (1.4), [peng2023block] works with the less precise reformulation using the product of 𝐲i𝐀j2\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2} for smoothness and introduces block coordinate descent algorithm.

When k=1k=1, problem (1.1) reduces to the finite-sum optimization problem

(1.6) min𝐱F(𝐱)=1Ni=1Nfi(𝐱),\min_{\mathbf{x}}~{}F(\mathbf{x})=\frac{1}{N}\sum_{i=1}^{N}f_{i}(\mathbf{x}),

widely used to train machine learning models, where fi(𝐱)f_{i}(\mathbf{x}) depicts the loss of the model at parameter 𝐱\mathbf{x} on the ii-th data point. When the underlying model lacks sufficient expressiveness, problem  (1.6) alone may not yield satisfactory results.To enhance a model’s performance, one can train the model with multiple parameters, 𝐱1,𝐱2,,𝐱k,k2\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{k},k\geq 2, and utilize only the most effective parameter for every data point. This strategy has been successfully applied in various classic tasks, including the aforementioned kk-means clustering, mixed linear regression, and the generalized principal component analysis. These applications share a common objective: to segment the dataset into kk groups and identify the best parameter for each group. Although no single parameter might perform well across the entire dataset, every data point is adequately served by at least one of the kk parameters. By aggregating the strengths of multiple smaller models, this approach not only enhances model expressiveness but also offers a cost-efficient alternative to deploying a singular larger model.

Although one might expect that algorithms and analyses for the sum-of-minimum problem (1.1) to be weaker as (1.1) subsumes the discussed previous models, we find our algorithms and analyses for (1.1) to enhance those known for the existing models. Our algorithms extend the kk-means++ algorithm [arthur2007k] and Lloyd’s algorithm [lloyd1982least], which are proposed for classic kk-means problems. We obtain new bounds of these algorithms for (1.1). Our contributions are summarized as follows:

  • We propose the sum-of-minimum optimization problem, adapt kk-means++ to the problem for initialization, and generalize Lloyd’s algorithm to approximately solve the problem.

  • We establish theoretical guarantees for the proposed algorithms. Specifically, under the assumption that each fif_{i} is LL-smooth and μ\mu-strongly convex, we prove the output of the initialization is 𝒪(L2μ2lnk)\mathcal{O}(\frac{L^{2}}{\mu^{2}}\ln k)-optimal and that this bound is tight with respect to both kk and the condition number Lμ\frac{L}{\mu}. When reducing to kk-means optimization, our result recovers that of [arthur2007k]. Furthermore, we prove an 𝒪(1T)\mathcal{O}(\frac{1}{T}) convergence rate for generalized Lloyd’s algorithms.

  • We numerically verify the efficiency of the proposed framework and algorithms on several tasks, including generalized principal component analysis, 2\ell_{2}-regularized mixed linear regression, and small-scale neural network training. The results reveal that our optimization model and algorithm lead to a higher successful rate in finding the ground-truth clustering, compared to existing approaches that resort to less accurate reformulations for the sake of smoother optimization landscapes. Moreover, our initialization shows significant improvements in both convergence speed and chance of obtaining better minima.

Our work significantly generalizes classic kk-means to handles more complex nonlinear models and provides new perspectives for improving the model performance. The rest of this paper is organized as follows. We introduce the preliminaries and related works in Section 2. We present the algorithms in Section 3. The algorithms are analyzed theoretically in Section 4 and numerically in Section 5. The paper is concluded in Section 6.

Throughout this paper, the 2\ell_{2}-norm and 2\ell_{2}-inner product are denoted by \|\cdot\| and ,\langle\cdot,\cdot\rangle, respectively. We employ |||\cdot| as the cardinal number of a set.

2. Related Work and Preliminary

2.1. Related work

Lloyd’s algorithm [lloyd1982least], a well-established iterative method for the classic kk-means problem, alternates between two key steps [mackay2003example]: 1) assigning 𝐲i\mathbf{y}_{i} to 𝐱j(t)\mathbf{x}_{j}^{(t)} if 𝐱j(t)\mathbf{x}_{j}^{(t)} is the closest to 𝐲i\mathbf{y}_{i} among {𝐱1(t),𝐱2(t),,𝐱k(t)}\{\mathbf{x}_{1}^{(t)},\mathbf{x}_{2}^{(t)},\dots,\mathbf{x}_{k}^{(t)}\}; 2) updating 𝐱j(t+1)\mathbf{x}_{j}^{(t+1)} as the centroid of all 𝐲i\mathbf{y}_{i}’s assigned to 𝐱j(t)\mathbf{x}_{j}^{(t)}. Although Lloyd’s algorithm can be proved to converge to stationary points, the results can be highly suboptimal due to the inherent non-convex nature of the problem. Therefore, the performance of Lloyd’s algorithm highly depends on the initialization. To address this, a randomized initialization algorithm, kk-means++ [arthur2007k], generates an initial solution in a sequential fashion. Each centroid 𝐱j(0)\mathbf{x}_{j}^{(0)} is sampled recurrently according to the distribution

(2.1) (𝐱j(0)=𝐲i)min1jj1𝐱j𝐲i2,i[N].\mathbb{P}(\mathbf{x}_{j}^{(0)}=\mathbf{y}_{i})\propto\min_{1\leq j^{\prime}\leq j-1}\|\mathbf{x}_{j^{\prime}}-\mathbf{y}_{i}\|^{2},\quad i\in[N].

The idea is to sample a data point farther from the current centroids with higher probability, ensuring the samples to be more evenly distributed across the dataset. It is proved in [arthur2007k] that

(2.2) 𝔼F(𝐱1(0),𝐱2(0),,𝐱k(0))8(lnk+2)F,\mathbb{E}F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})\leq 8(\ln k+2)F^{*},

where FF^{*} is the optimal objective value of FF. This seminal work has inspired numerous enhancements to the kk-means++ algorithm, as evidenced by contributions from [bahmani2012scalable, zimichev2014spectral, bachem2016fast, bachem2016approximate, wu2021user, ren2022novel]. Our result generalizes the bound in (2.2), broadening its applicability in sum-of-minimum optimization.

2.2. Definitions and assumptions

In this subsection, we outline the foundational settings for our algorithm and theory. For each sub-function fif_{i}, we establish the following assumptions.

Assumption 2.1.

Each fif_{i} is LL-smooth, satisfying

fi(x)fi(y)Lxy,x,yd,i[N].\|\nabla f_{i}(x)-\nabla f_{i}(y)\|\leq L\|x-y\|,\quad\forall~{}x,y\in\mathbb{R}^{d},\ i\in[N].
Assumption 2.2.

Each fif_{i} is μ\mu-strongly convex, for all x,ydx,y\in\mathbb{R}^{d} and i[N]i\in[N],

fi(y)fi(x)+fi(x)(yx)+μ2xy2.f_{i}(y)\geq f_{i}(x)+\nabla f_{i}(x)^{\top}(y-x)+\frac{\mu}{2}\|x-y\|^{2}.

Let 𝐱i\mathbf{x}^{*}_{i} denote the optimizer of fi(𝐱)f_{i}(\mathbf{x}) such that fi=fi(𝐱i)f_{i}^{*}=f_{i}(\mathbf{x}^{*}_{i}), and let

S={𝐱i:1iN}S^{*}=\{\mathbf{x}_{i}^{*}:1\leq i\leq N\}

represent the solution set. If SS^{*} comprises l<kl<k different elements, the problem (1.1) possesses infinitely many global minima. Specifically, we can set the variables 𝐱1,𝐱2,,𝐱l\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{l} to be the ll distinct elements in SS^{*}, while leaving 𝐱l+1,𝐱l+2,,𝐱k\mathbf{x}_{l+1},\mathbf{x}_{l+2},\ldots,\mathbf{x}_{k} as free variables. Given these kk variables, F(𝐱1,𝐱2,,𝐱k)=1Ni=1NfiF(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k})=\frac{1}{N}\sum_{i=1}^{N}f_{i}^{*}. If SS^{*} contains more than kk distinct components, we have the following proposition.

Proposition 2.3.

Under Assumption 2.2, if |S|k|S^{*}|\geq k, the optimization problem (1.1) admits finitely many minimizers.

Expanding on the correlation between the number of global minimizers and the size of SS^{*}, we introduce well-posedness conditions for SS^{*}.

Definition 2.4 (kk-separate and (k,r)(k,r)-separate).

We call SS^{*} kk-separate if it contains at least kk different elements, i.e., |S|k|S^{*}|\geq k. Furthermore, we call SS^{*} (k,r)(k,r)-separate if there exists 1i1<i2<<ikN1\leq i_{1}<i_{2}<\cdots<i_{k}\leq N such that 𝐱ij𝐱ij>2r\|\mathbf{x}_{i_{j}}^{*}-\mathbf{x}_{i_{j^{\prime}}}^{*}\|>2r for all jjj\not=j^{\prime}.

Finally, we address the optimality measurement in (1.1). The norm of the (sub)-gradient is an inappropriate measure for global optimality due to the problem’s non-convex nature. Instead, we utilize the following optimality gap.

Definition 2.5 (Optimality gap).

Given a point 𝐱\mathbf{x}, the optimality gap of fif_{i} at 𝐱\mathbf{x} is fi(𝐱)fif_{i}(\mathbf{x})-f^{*}_{i}. Given a finite point set \mathcal{M}, the optimality gap of fif_{i} at \mathcal{M} is min𝐱fi(𝐱)fi\min_{\mathbf{x}\in\mathcal{M}}f_{i}(\mathbf{x})-f^{*}_{i}. When ={𝐱1,𝐱2,,𝐱k}\mathcal{M}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{k}\}, the averaged optimality gap of f1,f2,,fNf_{1},f_{2},\dots,f_{N} at \mathcal{M} is the shifted objective function

(2.3) F(𝐱1,𝐱2,,𝐱k)1Ni=1Nfi.F(\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{k})-\frac{1}{N}\sum_{i=1}^{N}f_{i}^{*}.

The averaged optimality gap in (2.3) will be used as the optimality measurement throughout this paper. Specifically, in the classic kk-means problem, one has fi=0f_{i}^{*}=0, so the function F(𝐱1,𝐱2,,𝐱k)F(\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{k}) directly indicates global optimality.

3. Algorithms

In this section, we introduce the algorithm for solving the sum-of-minimum optimization problem (1.1). Our approach is twofold, comprising an initialization phase based on kk-means++ and a generalized version of Lloyd’s algorithm.

3.1. Initialization

As the sum-of-minimum optimization (1.1) can be considered a generalization of the classic kk-means clustering, we adopt kk-means++. In kk-means++, clustering centers are selected sequentially from the dataset, with each data point chosen based on a probability proportional to its squared distance from the nearest existing clustering centers, as detailed in (2.1). We generalize this idea and propose the following initialization algorithm that outputs initial parameters 𝐱1(0),𝐱2(0),,𝐱k(0)\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)} for the problem (1.1).

First, we select an index i1i_{1} at random from [N][N], following a uniform distribution, and then utilize a specific method to determine the minimizer 𝐱i1\mathbf{x}^{*}_{i_{1}}, setting

(3.1) 𝐱1(0)=𝐱i1=argmin𝐱fi1(𝐱).\mathbf{x}_{1}^{(0)}=\mathbf{x}^{*}_{i_{1}}=\mathop{\rm argmin}_{\mathbf{x}}f_{i_{1}}(\mathbf{x}).

For j=2,3,,kj=2,3,\dots,k, we sample iji_{j} based on the existing variables j={𝐱1(0),𝐱2(0),,𝐱j1(0)}\mathcal{M}_{j}=\{\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{j-1}^{(0)}\}, with each index ii sampled based on a probability proportional to the optimality gap of fif_{i} at j\mathcal{M}_{j}. Specifically, we compute the minimal optimality gaps

(3.2) vi(j)=min1jj1(fi(𝐱j(0))fi),i[N],v_{i}^{(j)}=\min_{1\leq j^{\prime}\leq j-1}\left(f_{i}(\mathbf{x}^{(0)}_{j^{\prime}})-f^{*}_{i}\right),\quad i\in[N],

as probability scores. Each score vi(j)v_{i}^{(j)} can be regarded as an indicator of how unresolved an instance fif_{i} is with the current variables {𝐱j(0)}j=1j1\{\mathbf{x}_{j^{\prime}}^{(0)}\}_{j^{\prime}=1}^{j-1}. We then normalize these scores

(3.3) wi(j)=vi(j)i=1Nvi(j),i[N],w_{i}^{(j)}=\frac{v_{i}^{(j)}}{\sum_{i^{\prime}=1}^{N}v_{i^{\prime}}^{(j)}},\quad i\in[N],

and sample ij[N]i_{j}\in[N] following the probability distribution 𝐰(j)=(w1(j),,wN(j))\mathbf{w}^{(j)}=\left(w^{(j)}_{1},\dots,w^{(j)}_{N}\right). The jj-th initialization is determined by optimizing fijf_{i_{j}},

(3.4) 𝐱j(0)=𝐱ij=argmin𝐱fij(𝐱).\mathbf{x}_{j}^{(0)}=\mathbf{x}_{i_{j}}^{*}=\mathop{\rm argmin}_{\mathbf{x}}f_{i_{j}}(\mathbf{x}).

We terminate the selection process once kk variables 𝐱1(0),𝐱2(0),,𝐱k(0)\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)} are determined. The pseudo-code of this algorithm is shown in Algorithm 1.

Algorithm 1 Initialization
1:Sample i1i_{1} uniformly at random from [N][N] and compute 𝐱1(0)\mathbf{x}_{1}^{(0)} via (3.1).
2:for j=2,3,,kj=2,3,\dots,k do
3:     Compute 𝐯(j)=(v1(j),v2(j),,vN(j))\mathbf{v}^{(j)}=\left(v_{1}^{(j)},v_{2}^{(j)},\ldots,v_{N}^{(j)}\right) via (3.2).
4:     Compute 𝐰(j)=(w1(j),,wN(j))\mathbf{w}^{(j)}=\left(w^{(j)}_{1},\dots,w^{(j)}_{N}\right) via (3.3).
5:     Sample ij[N]i_{j}\in[N] according to the weights 𝐰(j)\mathbf{w}^{(j)} and compute 𝐱j(0)\mathbf{x}_{j}^{(0)} via (3.4).
6:end for

We note that the scores vi(j)v_{i}^{(j)} defined in (3.2) rely on the optimal objectives fif_{i}^{*}, which may be computationally intensive to calculate in certain scenarios. Therefore, we propose a variant of Algorithm 1 by adjusting the scores vi(j)v_{i}^{(j)}. Specifically, when j1j-1 parameters 𝐱1(0),𝐱2(0),,𝐱j1(0)\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{j-1}^{(0)} are selected, the score is set as the minimum squared norm of the gradient:

(3.5) vi(j)=min1jj1fi(𝐱j(0))2.v_{i}^{(j)}=\min_{1\leq j^{\prime}\leq j-1}\left\|\nabla f_{i}(\mathbf{x}_{j^{\prime}}^{(0)})\right\|^{2}.

This variant involves replacing the scores in Step 3 of Algorithm 1 with (3.5), which is further elaborated in Appendix B.

In the context of classic kk-means clustering where fi(𝐱)=12𝐱𝐲i2f_{i}(\mathbf{x})=\frac{1}{2}\|\mathbf{x}-\mathbf{y}_{i}\|^{2} for the ii-th data point 𝐲i\mathbf{y}_{i}, the score vi(j)v_{i}^{(j)} in both (3.2) and (3.5) reduces to

min1jj1𝐱j(0)𝐲i2,\min_{1\leq j^{\prime}\leq j-1}\|\mathbf{x}_{j^{\prime}}^{(0)}-\mathbf{y}_{i}\|^{2},

up to a constant scalar. This initialization algorithm, whether utilizing scores from (3.2) or (3.5), aligns with the approach of the classic kk-means++ algorithm.

3.2. Generalized Lloyd’s algorithm

Lloyd’s algorithm is employed to minimize the loss in kk-means clustering by alternately updating the clusters and their centroids [lloyd1982least, mackay2003example]. This centroid update process can be regarded as a form of gradient descent applied to group functions, defined by the average distance between data points within a cluster and its centroid [bottou1994convergence]. For our problem (1.1), we introduce a novel gradient descent algorithm that utilizes dynamic group functions. Our algorithm is structured into two main phases: reclassification and group gradient descent.

Reclassification.

The goal is for Cj(t)C_{j}^{(t)} to encompass all i[N]i\in[N] where fif_{i} is active at 𝐱j(t)\mathbf{x}_{j}^{(t)}, allowing us to use the sub-functions fif_{i} within Cj(t)C_{j}^{(t)} to update 𝐱j(t)\mathbf{x}_{j}^{(t)}. This process leads to the reclassification step as follows:

(3.6) Cj(t)={i[N]:fi(𝐱j(t))fi(𝐱j(t)),j[k]}\(l<jCl(t)),j=1,2,,k.C_{j}^{(t)}=\Big{\{}i\in[N]:f_{i}(\mathbf{x}_{j}^{(t)})\leq f_{i}(\mathbf{x}_{j^{\prime}}^{(t)}),\forall~{}j^{\prime}\in[k]\Big{\}}\Big{\backslash}\Big{(}\bigcup_{l<j}C_{l}^{(t)}\Big{)},\quad j=1,2,\dots,k.

Given that reclassification may incur non-negligible costs in practice, a reclassification frequency rr can be established, performing the update in (3.6) every rr iterations while keeping Cj(t)=Cj(t1)C_{j}^{(t)}=C_{j}^{(t-1)} constant during other iterations.

Group gradient descent.

With Cj(t)C_{j}^{(t)} indicating the active fif_{i} at 𝐱j(t)\mathbf{x}_{j}^{(t)}, we can define the group objective function:

(3.7) Fj(t)(𝐳)={1|Cj(t)|iCj(t)fi(𝐳),Cj(t),0,Cj(t)=,F_{j}^{(t)}(\mathbf{z})=\begin{cases}\frac{1}{|C_{j}^{(t)}|}\sum_{i\in C_{j}^{(t)}}f_{i}(\mathbf{z}),&C_{j}^{(t)}\not=\emptyset,\\ 0,&C_{j}^{(t)}=\emptyset,\end{cases}

In each iteration, gradient descent is performed on 𝐱j(t)\mathbf{x}_{j}^{(t)} individually as:

(3.8) 𝐱j(t+1)=𝐱j(t)γFj(t)(𝐱j(t)).\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}-\gamma\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)}).

Here, γ>0\gamma>0 is the chosen step size. Alternatively, one might opt for different iterative updates or directly compute:

𝐱j(t+1)=argmin𝐱iCj(t)fi(𝐱),\mathbf{x}_{j}^{(t+1)}=\mathop{\rm argmin}_{\mathbf{x}}\sum_{i\in C_{j}^{(t)}}f_{i}(\mathbf{x}),

especially if the minimizer of iCj(t)fi(𝐱)\sum_{i\in C_{j}^{(t)}}f_{i}(\mathbf{x}) admits a closed form or can be computed efficiently. The pseudo-code consisting of the above two steps is presented in Algorithm 2.

Algorithm 2 Generalized Lloyd’s Algorithm
1:Generate the initialization 𝐱1(0),𝐱2(0),,𝐱k(0)\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)} and set r,γr,\gamma.
2:for t=0,1,2,,Tt=0,1,2,\ldots,T do
3:     if t0(mod r)t\equiv 0\,(\textup{mod }r) then
4:         Compute the partition {Cj(t)}j=1k\{C_{j}^{(t)}\}_{j=1}^{k} via (3.6).
5:     else
6:         Cj(t)=Cj(t1),1jk.C_{j}^{(t)}=C_{j}^{(t-1)},\quad 1\leq j\leq k.
7:     end if
8:     Compute 𝐱j(t+1)\mathbf{x}_{j}^{(t+1)} via (3.8).
9:end for

Momentum Lloyd’s Algorithm. We enhance Algorithm 1 by incorporating a momentum term. The momentum for 𝐱j(t)\mathbf{x}_{j}^{(t)} is represented as 𝐦j(t)\mathbf{m}_{j}^{(t)}, with 0<β<10<\beta<1 and γ>0\gamma>0 serving as the step sizes for the momentum-based updates. We use the gradient of the group function Fj(t)F_{j}^{(t)} to update the momentum 𝐦j(t)\mathbf{m}_{j}^{(t)}. The momentum algorithm admits the following form:

(3.9) 𝐱j(t+1)=𝐱j(t)γ𝐦j(t),\displaystyle\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}-\gamma\mathbf{m}_{j}^{(t)},
(3.10) 𝐦j(t+1)=β𝐦j(t)+Fj(t+1)(𝐱j(t+1)).\displaystyle\mathbf{m}_{j}^{(t+1)}=\beta\mathbf{m}_{j}^{(t)}+\nabla F_{j}^{(t+1)}(\mathbf{x}_{j}^{(t+1)}).

A critical aspect of the momentum algorithm involves updating the classes Cj(t)C_{j}^{(t)} between (3.9) and (3.10). Rather than reclassifying based on fif_{i} evaluated at 𝐱j(t+1)\mathbf{x}_{j}^{(t+1)}, reclassification leverages an acceleration variable:

(3.11) 𝐮j(t+1)=11β(𝐱j(t+1)β𝐱j(t)).\mathbf{u}_{j}^{(t+1)}=\frac{1}{1-\beta}(\mathbf{x}_{j}^{(t+1)}-\beta\mathbf{x}_{j}^{(t)}).

The index ii will be classified to Cj(t+1)C_{j}^{(t+1)} where fi(𝐮j(t+1))f_{i}(\mathbf{u}_{j}^{(t+1)}) attains the minimal value. Furthermore, to mitigate abrupt shifts in each class CjC_{j}, we implement a controlled reclassification scheme that limits the extent of change in each class:

(3.12) 1α|Cj(t)||Cj(t+1)|α|Cj(t)|,\frac{1}{\alpha}|C_{j}^{(t)}|\leq|C_{j}^{(t+1)}|\leq\alpha|C_{j}^{(t)}|,

where α>1\alpha>1 serves as a constraint factor. Details of the momentum algorithm are provided in Appendix B. We display the pseudo-code in Algorithm 3.

Algorithm 3 Momentum Lloyd’s Algorithm
1:Generate the initialization 𝐱1(0),𝐱2(0),,𝐱k(0)\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}. Set 𝐦1(0),𝐦2(0),,𝐦k(0)\mathbf{m}_{1}^{(0)},\mathbf{m}_{2}^{(0)},\ldots,\mathbf{m}_{k}^{(0)} to be 𝟎\mathbf{0}. Set r,α,β,γr,\alpha,\beta,\gamma.
2:for t=0,1,2,,Tt=0,1,2,\ldots,T do
3:     Update 𝐱j(t)\mathbf{x}_{j}^{(t)} using (3.9).
4:     if t0(mod r)t\equiv 0\,(\textup{mod }r) then
5:         Compute 𝐮j(t+1)\mathbf{u}_{j}^{(t+1)} via (3.11).
6:         Update Cj(t+1)C_{j}^{(t+1)} with 𝐮j(t+1)\mathbf{u}_{j}^{(t+1)} in control, such that (3.12) holds.
7:     else
8:         Cj(t+1)=Cj(t),1jk.C_{j}^{(t+1)}=C_{j}^{(t)},\quad 1\leq j\leq k.
9:     end if
10:     Update the momentum 𝐦j(t)\mathbf{m}_{j}^{(t)} via (3.10).
11:end for

4. Theoretical Analysis

In this section, we prove the efficiency of the initialization algorithm and establish the convergence rate of Lloyd’s algorithm. For the initialization Algorithm 1, we show that the ratio between the optimality gap of {𝐱1(0),𝐱2(0),,𝐱k(0)}\{\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}\} and the smallest possible optimality gap is 𝒪(L2μ2lnk)\mathcal{O}(\frac{L^{2}}{\mu^{2}}\ln k). Additionally, by presenting an example where this ratio is Ω(L2μ2lnk)\Omega(\frac{L^{2}}{\mu^{2}}\ln k), we illustrate the bound’s tightness. For Lloyd’s Algorithms 2 and 3, we establish a gradient decay rate of 𝒪(1T)\mathcal{O}(\frac{1}{T}), underscoring the efficiency and convergence properties of these algorithms.

4.1. Error bound of the initialization algorithm

We define the set of initial points selected by the randomized initialization Algorithm 1,

init={𝐱1(0),𝐱2(0),,𝐱k(0)}={𝐱i1,𝐱i2,,𝐱ik},\mathcal{M}_{\textup{init}}=\{\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}\}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{k}}^{*}\},

as the starting configuration for our optimization process. For simplicity, we use F(init)=F(𝐱i1,𝐱i2,,𝐱ik)F(\mathcal{M}_{\textup{init}})=F(\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{k}}^{*}) to represent the function value at these initial points. Let FF^{*} be the global minimal value of FF, and let f=1Ni=1Nfif^{*}=\frac{1}{N}\sum_{i=1}^{N}f_{i}^{*} denote the average of the optimal values of sub-functions. The effectiveness of Algorithm 1 is evaluated by the ratio between 𝔼F(init)f\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*} and FfF^{*}-f^{*}, which is the expected ratio between the averaged optimality gap at init\mathcal{M}_{\textup{init}} and the minimal possible averaged optimality gap. The following theorem provides a specific bound.

Theorem 4.1.

Suppose that Assumptions 2.1 and 2.2 hold. Assume that the solution set SS^{*} is kk-separate. Let init\mathcal{M}_{\textup{init}} be a random initialization set generated by Algorithm 1. We have

𝔼F(init)f4(2+lnk)(L2μ2+Lμ)(Ff).\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\leq 4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\left(F^{*}-f^{*}\right).

Theorem 4.1 indicates that the relative optimality gap at the initialization set is constrained by a factor of 𝒪(L2μ2lnk)\mathcal{O}(\frac{L^{2}}{\mu^{2}}\ln k) times the minimal optimality gap. The proof of Theorem 4.1 is detailed in Appendix C. In the classic kk-means problem, where L=μL=\mu, this result reduces to Theorem 1.1 in [arthur2007k]. Moreover, the upper bound 𝒪(L2μ2lnk)\mathcal{O}(\frac{L^{2}}{\mu^{2}}\ln k) is proven to be tight via a lower bound established in the following theorem.

Theorem 4.2.

Given a fixed cluster number k>0k>0, there exists an integer N>0N>0. We can construct NN sub-functions {fi}i=1N\{f_{i}\}_{i=1}^{N} satisfying Assumptions 2.12.2 and guaranteeing the solution set SS^{*} to be kk-separate. When applying Algorithm 1 over the instances {fi}i=1N\{f_{i}\}_{i=1}^{N}, we have

(4.1) 𝔼F(init)f12L2μ2lnk(Ff).\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\geq\frac{1}{2}\frac{L^{2}}{\mu^{2}}\ln k\left(F^{*}-f^{*}\right).

The proof of Theorem 4.2 is presented in detail in Appendix C. In both Theorem 4.1 and Theorem 4.2, the performance of Algorithm 1 is analyzed with the assumption that 𝐯(j)\mathbf{v}^{(j)} and fif_{i}^{*} in (3.2) can be computed exactly. However, the accurate computation of fif_{i}^{*} may be impractical due to computational costs. Therefore, we explore the error bounds when the score 𝐯(j)\mathbf{v}^{(j)} approximates (3.2) with some degree of error. We investigate two types of scoring errors.

  • Additive error. There exists ϵ>0\epsilon>0, we have access to an estimated f~i\tilde{f}_{i}^{*} satisfying

    (4.2) fiϵf~ifi+ϵ.f^{*}_{i}-\epsilon\leq\tilde{f}^{*}_{i}\leq f^{*}_{i}+\epsilon.

    Accordingly, we define:

    (4.3) v~i(j)=min1jj1(max(fi(𝐱j(0))f~i,0))=max(min1jj1(fi(𝐱j(0))f~i),0).\tilde{v}_{i}^{(j)}=\min_{1\leq j^{\prime}\leq j-1}\left(\max\left(f_{i}(\mathbf{x}^{(0)}_{j^{\prime}})-\tilde{f}^{*}_{i},0\right)\right)=\max\left(\min_{1\leq j^{\prime}\leq j-1}\left(f_{i}(\mathbf{x}^{(0)}_{j^{\prime}})-\tilde{f}^{*}_{i}\right),0\right).
  • Scaling error. There exists a deterministic oracle Ov:[N]×dO_{v}:[N]\times\mathbb{R}^{d}\rightarrow\mathbb{R}, such that for any 𝐱d\mathbf{x}\in\mathbb{R}^{d} and i[N]i\in[N],

    (4.4) c1(fi(𝐱)fi)Ov(i,𝐱)c2(fi(𝐱)fi).c_{1}(f_{i}(\mathbf{x})-f_{i}^{*})\leq O_{v}(i,\mathbf{x})\leq c_{2}(f_{i}(\mathbf{x})-f_{i}^{*}).

    Set

    (4.5) v~i(j)=min1jj1Ov(i,𝐱j(0)).\tilde{v}_{i}^{(j)}=\min_{1\leq j^{\prime}\leq j-1}O_{v}(i,\mathbf{x}_{j^{\prime}}^{(0)}).

We first analyze the performance of Algorithm 1 using the score v~i(j)\tilde{v}_{i}^{(j)} with additive error as in (4.3). We typically require the assumption that the solution set SS^{*} is (k,2ϵμ)(k,\sqrt{\frac{2\epsilon}{\mu}})-separate, which guarantees that

i=1Nminj[l]max((fi(𝐳j)f~i),0)>0,\sum_{i=1}^{N}\min_{j\in[l]}\max\left((f_{i}(\mathbf{z}_{j})-\tilde{f}_{i}^{*}),0\right)>0,

for any l<kl<k and 𝐳1,𝐳2,,𝐳ld\mathbf{z}_{1},\mathbf{z}_{2},\ldots,\mathbf{z}_{l}\in\mathbb{R}^{d}. Hence in the initialization Algorithm 1 with score (4.3), there is at least one v~i(j)>0\tilde{v}_{i}^{(j)}>0 in each round. We have the following generalized version of Theorem 4.1 with additive error.

Theorem 4.3.

Under Assumptions 2.1 and 2.2, suppose that we have {f~i}i=1N\{\tilde{f}_{i}^{*}\}_{i=1}^{N} satisfying (4.2) for some noise factor ϵ>0\epsilon>0, and that the solution set SS^{*} is (k,2ϵμ)\big{(}k,\sqrt{\frac{2\epsilon}{\mu}}\big{)}-separate. Then for the initialization Algorithm 1 with the scores in (3.2) replaced by the noisy scores in (4.3), we have

(4.6) 𝔼F(init)f4(2+lnk)(L2μ2+Lμ)(Ff)+ϵ(1+(2+lnk)(1+4Lμ)).\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\leq 4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)(F^{*}-f^{*})+\epsilon\cdot\left(1+(2+\ln k)\left(1+\frac{4L}{\mu}\right)\right).

The proof of Theorem 4.3 is deferred to Appendix C. Next, we state a similar result for the scaling-error oracle as in (4.5), whose proof is deferred to Appendix C.

Theorem 4.4.

Suppose that Assumptions 2.12.2 hold and that the solution set SS^{*} is kk-separate. Then for the initialization Algorithm 1 with the scores in (3.2) replaced by the scores in (4.5), we have the following bound:

𝔼F(init)f4(c2c1Lμ+c22c12L2μ2)(2+lnk)(Ff).\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\leq 4\left(\frac{c_{2}}{c_{1}}\frac{L}{\mu}+\frac{c_{2}^{2}}{c_{1}^{2}}\frac{L^{2}}{\mu^{2}}\right)(2+\ln k)(F^{*}-f^{*}).

Recall that we introduce an alternative score in (3.5). This score can actually be viewed as a noisy version of (3.2) with a scaling error. Under Assumptions 2.1 and 2.2, it holds that

2μ(fi(𝐱)fi)fi(𝐱)22L(fi(𝐱)fi),2\mu(f_{i}(\mathbf{x})-f_{i}^{*})\leq\|\nabla f_{i}(\mathbf{x})\|^{2}\leq 2L(f_{i}(\mathbf{x})-f_{i}^{*}),

for any i[N]i\in[N] and 𝐱d\mathbf{x}\in\mathbb{R}^{d}, which satisfies (4.4) with c1=2μc_{1}=2\mu and c2=2Lc_{2}=2L. Therefore, we have a direct corollary of Theorem 4.4.

Corollary 4.5.

Suppose that Assumptions 2.1 and 2.2 hold and that the solution set SS^{*} is kk-separate. For the initialization Algorithm 1 with the scores in (3.2) replaced by the scores in (3.5), we have

𝔼F(init)f4(L2μ2+L4μ4)(2+lnk)(Ff).\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\leq 4\left(\frac{L^{2}}{\mu^{2}}+\frac{L^{4}}{\mu^{4}}\right)(2+\ln k)(F^{*}-f^{*}).

4.2. Convergence rate of Lloyd’s algorithm

In this subsection, we state convergence results of Lloyd’s Algorithm 2 and momentum Lloyd’s Algorithm 3, with all proofs being deferred to Appendix D. For Algorithm 2, the optimization process of 𝐱j(t)\mathbf{x}_{j}^{(t)} follows a gradient descent scheme on a varying objective function Fj(t)F_{j}^{(t)}, which is the average of all active fif_{i}’s determined by Cj(t)C_{j}^{(t)} in (3.6). We have the following gradient-descent-like convergence rate on the gradient norm Fj(t)(𝐱j(t))\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|.

Theorem 4.6.

Suppose that Assumption 2.1 is satisfied and we take the step size γ=1L\gamma=\frac{1}{L} in Algorithm 2. Then

1T+1t=0Tj=1k|Cj(t)|NFj(t)(𝐱j(t))22LT+1(F(𝐱1(0),𝐱2(0),,𝐱k(0))F).\frac{1}{T+1}\sum_{t=0}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\left\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\right\|^{2}\leq\frac{2L}{T+1}\left(F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{\star}\right).

For momentum Lloyd’s Algorithm 3, we have a similar convergence rate stated as follows.

Theorem 4.7.

Suppose that Assumption 2.1 holds and that α>1\alpha>1. For Algorithm 3, there exists a constant γ¯(α,β,L)\bar{\gamma}(\alpha,\beta,L), such that

1Tt=1Tj=1k|Cj(t)|NFj(t)(𝐱j(t))22(1β)γF(𝐱1(0),𝐱2(0),,𝐱k(0))FT,\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2(1-\beta)}{\gamma}\cdot\frac{F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{*}}{T},

as long as γγ¯(α,β,L)\gamma\leq\bar{\gamma}(\alpha,\beta,L).

5. Numerical Experiments

In this section, we conduct numerical experiments to demonstrate the efficiency of the proposed model and algorithms. Our code with documentation can be found at https://github.com/LisangDing/Sum-of-Minimum_Optimization.

5.1. Comparison between the sum-of-minimum model and the product formulation

We consider two optimization models for generalized principal component analysis: the sum-of-minimum formulation (1.5) and another widely acknowledged formulation given by [peng2023block, vidal2005generalized]:

(5.1) minimize𝐀j𝐀j=Ir1Ni=1Nj=1k𝐲i𝐀j2.\operatorname*{minimize}_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}\frac{1}{N}\sum_{i=1}^{N}\prod_{j=1}^{k}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2}.

The initialization for both formulations is generated by Algorithm 1. We use a slightly modified version of Algorithm 2 to minimize (1.5) since the minimization of the group functions for GPCA admits closed-form solutions. In particular, we alternatively compute the minimizer of each group objective function as the update of 𝐀j\mathbf{A}_{j} and then reclassify the sub-functions. We use the block coordinate descent (BCD) method [peng2023block] to minimize (5.1). The BCD algorithm alternatively minimizes 𝐀j\mathbf{A}_{j} with all other 𝐀l(lj)\mathbf{A}_{l}\ (l\neq j) being fixed. The pseudo-codes of both algorithms are included in Appendix E.1.

We set the cluster number k{2,3,4}k\in\{2,3,4\}, dimension d{4,5,6}d\in\{4,5,6\}, subspace co-dimension r=d2r=d-2, and the number of data points N=1000N=1000. The generalization of the dataset {𝒚i}i=1N\{\boldsymbol{y}_{i}\}_{i=1}^{N} is described in Appendix E.1. We set the maximum iteration number as 50 for Algorithm 2 with (1.5) and terminate the algorithm once the objective function stops decreasing, i.e., the partition/clustering remains unchanged. Meanwhile, we set the iteration number to 50 for the BCD algorithm [peng2023block] with (5.1). The sythetic data generation is elaborated in Appendix E.1. The classification accuracy of both methods is reported in Table 1, where the classification accuracy is defined as the maximal matching accuracy with respect to the ground truth over all permutations. We observe that our model and algorithm lead to significantly higher accuracy. This is because, compared to (5.1), the formulation in (1.5) models the requirements more precisely, though it is more difficult to optimize due to the non-smoothness.

Table 1. Cluster accuracy percentages of the sum-of-minimum (SoM) vs. the sum-of-product (SoP) GPCA models after 50 iterations.
d=4d=4 d=5d=5 d=6d=6
k=2k=2 SoM 98.24 98.07 98.19
SoP 81.88 75.90 73.33
k=3k=3 SoM 95.04 94.98 95.94
SoP 67.69 62.89 60.85
k=4k=4 SoM 91.30 92.92 93.73
SoP 62.36 59.65 57.89

Next, we compare the computational cost for our model and algorithms with that of the product model and the BCD algorithm. We observe that the BCD algorithm exhibited limited improvements in accuracy after the initial 10 iterations. Thus, for a fare comparison, we set both the maximum iterations for our model and algorithms and the iteration number for the BCD algorithm to 10. The accuracy rate and the CPU time are shown in Table 2, from which one can see that the computational costs of our algorithm and the BCD algorithm are competitive, while our algorithm achieves much better classification accuracy.

Table 2. Averaged cluster accuracy percentage / CPU time in seconds for GPCA after 10 iterations on sum-of-minimum (SoM) and the sum-of-product (SoP) models.
d=4d=4 d=5d=5 d=6d=6
k=2k=2 SoM 97.84 / 0.08 97.93 / 0.08 98.01 / 0.08
SoP 81.78 / 0.14 75.76 / 0.14 73.24 / 0.15
k=3k=3 SoM 93.34 / 0.19 94.14 / 0.19 95.25 / 0.16
SoP 67.18 / 0.20 62.76 / 0.22 60.80 / 0.20
k=4k=4 SoM 88.62 / 0.32 91.78 / 0.29 92.62 / 0.27
SoP 61.52 / 0.26 59.37 / 0.27 57.82 / 0.27

5.2. Comparison between different initializations

We present the performance of Lloyd’s Algorithm 2 combined with different initialization methods. The initialization methods adopted in this subsection are:

  • Random initialization. We initialize variables 𝐱1(0),𝐱2(0),,𝐱k(0)\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)} with i.i.d. samples from the dd-dimensional standard Gaussian distribution.

  • Uniform seeding initialization. We uniformly sample kk different indices i1,i2,,iki_{1},i_{2},\ldots,i_{k} from [N][N], then we set 𝐱ij\mathbf{x}_{i_{j}}^{*} as the initial value of 𝐱j(0)\mathbf{x}_{j}^{(0)}.

  • Proposed initialization. We sample the kk indices using Algorithm 1 and initialize 𝐱j(0)\mathbf{x}_{j}^{(0)} with the minimizer of the corresponding sub-function.

Mixed linear regression. Our first example is the 2\ell_{2}-regularized mixed linear regression. We add an 2\ell_{2} regularization on each sub-function fif_{i} in (1.3) to guarantee strong convexity, and the sum-of-minimum optimization objective function can be written as

1Ni=1Nminj[k]{12(g(𝐚i;𝐱j)bi)2+λ2𝐱j2},\frac{1}{N}\sum_{i=1}^{N}\min_{j\in[k]}\left\{\frac{1}{2}\left(g(\mathbf{a}_{i};\mathbf{x}_{j})-b_{i}\right)^{2}+\frac{\lambda}{2}\|\mathbf{x}_{j}\|^{2}\right\},

where {(𝐚i,bi)}i=1N\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N} collects all data points and λ>0\lambda>0 is a fixed parameter. The dataset {(𝐚i,bi)}i=1N\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N} is generated as described in Appendix E.2.

Similar to the GPCA problem, we slightly modify Lloyd’s algorithm since the 2\ell_{2}-regularized least-square problem can be solved analytically. Specifically, we use the minimizer of the group objective function as the update of 𝐱j\mathbf{x}_{j} instead of performing the gradient descent as in (3.8) or Algorithm 2. We perform the algorithm until a maximum iteration number is met or the objective function value stops decreasing. The detailed algorithm is given in Appendix E.2.

In the experiment, the number of samples is set to N=1000N=1000 and we vary kk from 4 to 6 and dd (the dimension of 𝐚i\mathbf{a}_{i} and 𝐱j\mathbf{x}_{j}) from 4 to 8. For each problem with fixed cluster number and dimension, we repeat the experiment for 1000 times with different random seeds. In each repeated experiment, we record two metrics. If the output objective value at the last iteration is less than or equal to F(𝐱1+,𝐱2+,,𝐱k+)F(\mathbf{x}_{1}^{+},\mathbf{x}_{2}^{+},\ldots,\mathbf{x}_{k}^{+}), where (𝐱1+,𝐱2+,,𝐱k+)(\mathbf{x}_{1}^{+},\mathbf{x}_{2}^{+},\ldots,\mathbf{x}_{k}^{+}) is the ground truth that generates the dataset {(𝐚i,bi)}i=1N\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N}, we consider the objective function to be nearly optimized and label the algorithm as successful on the task; otherwise, we label the algorithm as failed on the task. Additionally, we record the number of iterations the algorithm takes to output a result. The result is displayed in Table 3.

Table 3. The failing rate vs. the average iteration number of three initialization methods when solving mixed linear regression problems with different cluster numbers and dimensionality. A smaller failure rate and a lower average iteration number indicate better performance. The least failure rate among the three methods is bolded, and the least average iteration number under the same cluster number and dimension settings is underlined.
Init. Method d=4d=4 d=5d=5 d=6d=6 d=7d=7 d=8d=8
k=4k=4 random 0.056 / 17.577 0.031 / 18.378 0.038 / 19.923 0.058 / 21.631 0.071 / 22.344
unif. seeding 0.057 / 16.139 0.034 / 16.885 0.050 / 18.022 0.055 / 18.708 0.075 / 19.959
proposed 0.050 / 14.551 0.036 / 15.276 0.034 / 16.020 0.044 / 16.936 0.051 / 17.409
k=5k=5 random 0.161 / 26.355 0.156 / 28.844 0.172 / 32.247 0.238 / 35.042 0.321 / 38.324
unif. seeding 0.145 / 23.728 0.136 / 25.914 0.143 / 27.671 0.198 / 29.935 0.256 / 32.662
proposed 0.162 / 21.552 0.130 / 23.476 0.143 / 25.933 0.161 / 27.268 0.217 / 29.086
k=6k=6 random 0.363 / 35.831 0.382 / 41.043 0.504 / 43.999 0.594 / 47.918 0.739 / 48.730
unif. seeding 0.347 / 31.536 0.350 / 35.230 0.408 / 39.688 0.524 / 42.453 0.596 / 43.117
proposed 0.339 / 29.610 0.312 / 33.460 0.389 / 36.068 0.463 / 39.010 0.563 / 40.320

Mixed nonlinear regression. Our second experiment is on mixed nonlinear regression using 2-layer neural networks. We construct kk neural networks with the same structure and let the jj-th neural network be:

ψ(𝐚;𝐖j,𝐩j,𝐪j,oj)=𝐩jReLU(𝐖j𝐚+𝐪j)+oj.\psi(\mathbf{a};\mathbf{W}_{j},\mathbf{p}_{j},\mathbf{q}_{j},o_{j})=\mathbf{p}_{j}^{\top}\textup{ReLU}(\mathbf{W}_{j}\mathbf{a}+\mathbf{q}_{j})+o_{j}.

Here, 𝐚\mathbf{a} is the input data. We let dId_{I} be the input dimension and dHd_{H} be the hidden dimension. The dimensions of the variables are 𝐚dI,𝐖jdH×dI,𝐩j,𝐪jdH,oj.\mathbf{a}\in\mathbb{R}^{d_{I}},\mathbf{W}_{j}\in\mathbb{R}^{d_{H}\times d_{I}},\mathbf{p}_{j},\mathbf{q}_{j}\in\mathbb{R}^{d_{H}},o_{j}\in\mathbb{R}. We denote θj=(𝐖j,𝐩j,𝐪j,oj)\theta_{j}=(\mathbf{W}_{j},\mathbf{p}_{j},\mathbf{q}_{j},o_{j}) as the trainable parameters in the neural network. For each trial, we prepare the ground truth θj+\theta_{j}^{+} and the dataset {(𝐚i,bi)}i=1N\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N} as described in Appendix E.2. We use the squared 2\ell_{2} loss for each neural network and construct the ii-th sub-function as:

fi(θ)=12(ψ(𝐚i;θ)bi)2+λ2θ2,f_{i}(\theta)=\frac{1}{2}(\psi(\mathbf{a}_{i};\theta)-b_{i})^{2}+\frac{\lambda}{2}\|\theta\|^{2},

where we still use λ2θ2,λ>0\frac{\lambda}{2}\|\theta\|^{2},\lambda>0 as a regularization term. We perform parallel experiments on training the neural networks via Algorithm 2 using three different initialization methods. During the training process of neural networks, stochastic gradient descent is commonly used to manage limited memory, reduce training loss, and improve generalization. Moreover, the ADAM algorithm proposed in [kingma2014adam] is widely applied. This optimizer is empirically observed to be less sensitive to hyperparameters, more robust, and to converge faster. To align with this practice, we replace the group gradient descent in Algorithm 2 and the group momentum method in Algorithm 3 with ADAM optimizer-based backward propagation for the corresponding group objective function.

We use two metrics to measure the performance of the algorithms. In one set of experiments, we train kk neural networks until the value of the loss function FF under parameters θ1,θ2,,θk\theta_{1},\theta_{2},\ldots,\theta_{k} is less than that under θ1+,θ2+,,θk+\theta_{1}^{+},\theta_{2}^{+},\ldots,\theta_{k}^{+}. We record the average iterations required to achieve the optimization loss. In the other set of experiments, we train kk neural networks for a fixed number of iterations. Then, we compute the training and testing loss of the trained neural network, where the training loss on the dataset {(𝐚i,bi)}i=1N\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N} is defined as 1Ni=1Nminj(12(ψ(𝐚i;θj)bi)2)\frac{1}{N}\sum_{i=1}^{N}\min_{j}\left(\frac{1}{2}(\psi(\mathbf{a}_{i};\theta_{j})-b_{i})^{2}\right) and the testing loss is defined in a similar way.

In our experiments, the training dataset size is N=1000N=1000 and the testing dataset size is 200200. The testing dataset is generated from the same distribution as the training data. Benefiting from ADAM’s robust nature regarding hyperparameters, we use the default ADAM learning rate γ=1e3\gamma=1e-3. We set r=10r=10 in Lloyd’s Algorithm 2 and fix the cluster number k=5k=5. We test on three different (dI,dH)(d_{I},d_{H}) tuples: (5,3)(5,3), (7,5)(7,5), and (10,5)(10,5). The results can be found in Table 4 and 5.

Table 4. Average epochs for different seeding methods to achieve the ground truth model training loss.
(dI,dH)(d_{I},d_{H}) (5,3) (7,5) (10,5)
random 329.4 132.1 130.8
unif. Seeding 233.1 71.2 67.6
proposed 181.4 49.3 47.2
Table 5. The training errors vs. the testing errors (unit: 10e-3) of Lloyd’s algorithm with fixed training epoch numbers.
(dI,dH)(d_{I},d_{H}) / Iter. (5,3) / 300 (7,5) / 150 (10,5) / 150
random 4.26 / 4.63 4.57 / 5.54 4.62 / 5.82
unif. Seeding 3.86 / 4.25 3.96 / 4.77 3.56 / 4.52
proposed 3.44 / 3.93 3.51 / 4.37 3.39 / 4.34

We can conclude from Table 3, 4, and 5 that the careful seeding Algorithm 1 generates the best initialization in most cases. This initialization algorithm results in the fewest iterations required by Lloyd’s algorithm to converge, the smallest final loss, and the highest probability of finding the ground-truth clustering.

6. Conclusion

This paper proposes a general framework for sum-of-minimum optimization, as well as efficient initialization and optimization algorithms. Theoretically, tight bounds are established for smooth and strongly convex sub-functions fif_{i}. Though this work is motivated by classic algorithms for the kk-means problem, we extend the ideas and theory significantly for a broad family of tasks. Furthermore, the numerical efficiency is validated for generalized principal component analysis and mixed linear and nonlinear regression problems. Future directions include developing algorithms with provable guarantees for non-convex fif_{i} and exploring empirical potentials on large-scale tasks.

Acknowledgements

Lisang Ding receives support from Air Force Office of Scientific Research Grants MURI-FA9550-18-1-0502. We thank Liangzu Peng for fruitful discussions on GPCA.

References

Appendix A Proof of Proposition 2.3

In this section, we provide a proof of the proposition in Section 2.

Proposition A.1 (Restatement of Proposition 2.3).

Under Assumption 2.2, if |S|k|S^{*}|\geq k, the optimization problem (1.1) admits finitely many minimizers.

Proof.

If |S|=k|S^{*}|=k, then the only minimizer up to a permutation of indices is 𝐱1,𝐱2,,𝐱k\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}, such that

{𝐱1,𝐱2,,𝐱k}=S.\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}\}=S^{*}.

Next we consider the case where |S|>k|S^{*}|>k. Let \mathcal{R} be the set of all minimizers of (1.1). Due to the μ\mu-strong convexity of fif_{i}, the set \mathcal{R} is nonempty. Let 𝒯\mathcal{T} be the set of all partitions C1,C2,,CkC_{1},C_{2},\ldots,C_{k} of [N][N], such that CjC_{j}\not=\emptyset for all j[k]j\in[k]. The set 𝒯\mathcal{T} is finite. Next, we show there is an injection from \mathcal{R} to 𝒯\mathcal{T}. For 𝐗=(𝐱1,𝐱2,,𝐱k)\mathbf{X}=(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k})\in\mathcal{R}, we recurrently define

Cj𝐗={i[N]|fi(𝐱j)=minl(fi(𝐱l))}\(1jj1Cj𝐗).C_{j}^{\mathbf{X}}=\{i\in[N]\,|\,f_{i}(\mathbf{x}_{j})=\min_{l}(f_{i}(\mathbf{x}_{l}))\}\backslash\left(\cup_{1\leq j^{\prime}\leq j-1}C_{j^{\prime}}^{\mathbf{X}}\right).

We claim that all Cj𝐗C_{j}^{\mathbf{X}}’s are nonempty. Otherwise, if there is an index jj such that Cj𝐗=C_{j}^{\mathbf{X}}=\emptyset, we have a 𝐳S\{𝐱1,𝐱2,,𝐱k}\mathbf{z}\in S^{*}\backslash\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}\}. Replacing the jj-th parameter 𝐱j\mathbf{x}_{j} with 𝐳\mathbf{z}, we have

F(𝐱1,𝐱2,,𝐱k)>F(𝐱1,𝐱2,,𝐱j1,𝐳,𝐱j+1,,𝐱k).F(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k})>F(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{j-1},\mathbf{z},\mathbf{x}_{j+1},\ldots,\mathbf{x}_{k}).

This contradicts the assumption that 𝐗\mathbf{X} is a minimizer of (1.1). Thus, 𝐗(C1𝐗,C2𝐗,,Ck𝐗)\mathbf{X}\mapsto(C_{1}^{\mathbf{X}},C_{2}^{\mathbf{X}},\ldots,C_{k}^{\mathbf{X}}) is a well-defined map from \mathcal{R} to 𝒯\mathcal{T}. Consider another 𝐘=(𝐲1,𝐲2,,𝐲k)\mathbf{Y}=(\mathbf{y}_{1},\mathbf{y}_{2},\ldots,\mathbf{y}_{k})\in\mathcal{R}. If Cj𝐗=Cj𝐘C_{j}^{\mathbf{X}}=C_{j}^{\mathbf{Y}} for all j[k]j\in[k], due to the μ\mu-strong convexity of fif_{i}’s, we have

𝐲j=argmin𝐳iCj𝐘fi(𝐳)=argmin𝐳iCj𝐗fi(𝐳)=𝐱j,j[k].\mathbf{y}_{j}=\mathop{\rm argmin}_{\mathbf{z}}\sum_{i\in C_{j}^{\mathbf{Y}}}f_{i}(\mathbf{z})=\mathop{\rm argmin}_{\mathbf{z}}\sum_{i\in C_{j}^{\mathbf{X}}}f_{i}(\mathbf{z})=\mathbf{x}_{j},\quad\forall j\in[k].

Thus, the map defined above is injective. Overall, \mathcal{R} is a finite set. ∎

Appendix B Algorithm details

In this section, we provide the details of the algorithms presented in Section 3.

B.1. Initialization with alternative scores

When the score function vi(j)v_{i}^{(j)} is taken as the squared gradient norm as in (3.5), the pseudo-code of the initialization can be found in Algorithm 4.

Algorithm 4 Initialization
1:Sample i1i_{1} uniformly at random from [N][N] and compute
𝐱1(0)=𝐱i1=argmin𝐱fi1(𝐱).\mathbf{x}_{1}^{(0)}=\mathbf{x}_{i_{1}}^{*}=\mathop{\rm argmin}_{\mathbf{x}}f_{i_{1}}(\mathbf{x}).
2:for j=2,3,,kj=2,3,\dots,k do
3:     Compute scores 𝐯(j)=(v1(j),v2(j),,vN(j))\mathbf{v}^{(j)}=\left(v_{1}^{(j)},v_{2}^{(j)},\ldots,v_{N}^{(j)}\right) via
vi(j)=min1jj1fi(𝐱j(0))2.v^{(j)}_{i}=\min_{1\leq j^{\prime}\leq j-1}\left\|\nabla f_{i}(\mathbf{x}_{j^{\prime}}^{(0)})\right\|^{2}.
4:     Compute the sampling weights 𝐰(j)=(w1(j),,wN(j))\mathbf{w}^{(j)}=\left(w^{(j)}_{1},\dots,w^{(j)}_{N}\right) by normalizing {vi(j)}i=1N\{v_{i}^{(j)}\}_{i=1}^{N},
wi(j)=vi(j)i=1Nvi(j).w^{(j)}_{i}=\frac{v_{i}^{(j)}}{\sum_{i^{\prime}=1}^{N}v_{i^{\prime}}^{(j)}}.
5:     Sample ij[N]i_{j}\in[N] according to the weights 𝐰(j)\mathbf{w}^{(j)} and compute
𝐱j(0)=𝐱ij=argmin𝐱fij(𝐱).\mathbf{x}_{j}^{(0)}=\mathbf{x}_{i_{j}}^{*}=\mathop{\rm argmin}_{\mathbf{x}}f_{i_{j}}(\mathbf{x}).
6:end for

B.2. Details on momentum Lloyd’s Algorithm

In this section, we elaborate on the details of momentum Lloyd’s Algorithm 3. We use 𝐱1(t),𝐱2(t),,𝐱k(t)\mathbf{x}_{1}^{(t)},\mathbf{x}_{2}^{(t)},\ldots,\mathbf{x}_{k}^{(t)} as the kk variables to be optimized. Correspondingly, we introduce 𝐦1(t),𝐦2(t),,𝐦k(t)\mathbf{m}_{1}^{(t)},\mathbf{m}_{2}^{(t)},\ldots,\mathbf{m}_{k}^{(t)} as their momentum. We use the same notation Fj(t)F_{j}^{(t)} in (3.7) as the group objective function. In each iteration, we update 𝐱\mathbf{x} using momentum gradient descent and update 𝐦\mathbf{m} using the gradient of the group function.

𝐱j(t+1)=𝐱j(t)γ𝐦j(t),\displaystyle\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}-\gamma\mathbf{m}_{j}^{(t)},
𝐦j(t+1)=β𝐦j(t)+Fj(t+1)(𝐱j(t+1)).\displaystyle\mathbf{m}_{j}^{(t+1)}=\beta\mathbf{m}_{j}^{(t)}+\nabla F_{j}^{(t+1)}(\mathbf{x}_{j}^{(t+1)}).

The update of Cj(t)C_{j}^{(t)} in the momentum algorithm is different from the Lloyd’s Algorithm 2. We introduce an acceleration quantity

𝐮j(t+1)=11β(𝐱j(t+1)β𝐱j(t)).\mathbf{u}_{j}^{(t+1)}=\frac{1}{1-\beta}(\mathbf{x}_{j}^{(t+1)}-\beta\mathbf{x}_{j}^{(t)}).

Each class is then renewed around the center 𝐮j(t+1)\mathbf{u}_{j}^{(t+1)}. We update index i[N]i\in[N] to the class Cj(t+1)C_{j}^{(t+1)} where fi(𝐮j(t+1))f_{i}(\mathbf{u}_{j}^{(t+1)}) attains the minimum value among all j[k]j\in[k]. To ensure the stability of the momentum accumulation, we further introduce a controlled reclassification method. We set a reclassification factor α>1\alpha>1. We update Cj(t)C_{j}^{(t)} to Cj(t+1)C_{j}^{(t+1)} in the following way to ensure

1α|Cj(t)||Cj(t+1)|α|Cj(t)|.\frac{1}{\alpha}|C_{j}^{(t)}|\leq|C_{j}^{(t+1)}|\leq\alpha|C_{j}^{(t)}|.

The key idea is to carefully reclassify each index one by one until the size of one class breaks the above restriction. We construct Cj,0=Cj(t),j[k]C_{j,0}=C_{j}^{(t)},j\in[k] as the initialization of the reclassification. We randomly, non-repeatedly pick indices ii from [N][N] one by one. For ll looping from 1 to NN, we let Cj,l1,j[k]C_{j,l-1},j\in[k] be the classification before the ll-th random index is picked. Let ili_{l} be the ll-th index sampled. We reassign ili_{l} to the jj-th class, such that

fil(𝐮j(t+1))=minj[k]fil(𝐮j(t+1)).f_{i_{l}}(\mathbf{u}_{j}^{(t+1)})=\min_{j^{\prime}\in[k]}f_{i_{l}}(\mathbf{u}_{j^{\prime}}^{(t+1)}).

There will be at most two classes changed due to the one-index reassignment. We update the class notations from Cj,l1C_{j^{\prime},l-1} to Cj,lC_{j^{\prime},l} for all j[N]j^{\prime}\in[N]. If there is any change between Cj,l1C_{j^{\prime},l-1} and Cj,lC_{j^{\prime},l}, we check whether

1α|Cj(t)||Cj|α|Cj(t)|\frac{1}{\alpha}|C_{j^{\prime}}^{(t)}|\leq|C_{j^{\prime}}|\leq\alpha|C_{j^{\prime}}^{(t)}|

holds. If the above restriction holds for all j[N]j^{\prime}\in[N], we accept the reclassification and move on to the next index sample. Otherwise, we stop the process and return Cj(t+1)=Cj,l1,j[k]C_{j}^{(t+1)}=C_{j,l-1},j\in[k]. If the reclassification trial successfully loops to the last index. We assign Cj(t+1)=Cj,N,j[k]C_{j}^{(t+1)}=C_{j,N},j\in[k].

Appendix C Initialization error bounds

In this section, we prove the error bounds of the initialization Algorithms 1 and 4. Before our proof, we prepare the following concepts and definitions.

Definition C.1.

For any nonempty C[N]C\subset[N], we define

ΔC:=1|C|iCiC𝐱i𝐱i2.\Delta_{C}:=\frac{1}{|C|}\sum_{i\in C}\sum_{i^{\prime}\in C}\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}.
Definition C.2.

Let [N]\mathcal{I}\subset[N] be an index set, d\mathcal{M}\subset\mathbb{R}^{d} be a finite set, we define

𝒜(,)=imin𝐳(fi(𝐳)fi(𝐱i)),\displaystyle\mathcal{A}(\mathcal{I},\mathcal{M})=\sum_{i\in\mathcal{I}}\min_{\mathbf{z}\in\mathcal{M}}(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})),
𝒟(,)=imin𝐳fi(𝐳)2.\displaystyle\mathcal{D}(\mathcal{I},\mathcal{M})=\sum_{i\in\mathcal{I}}\min_{\mathbf{z}\in\mathcal{M}}\|\nabla f_{i}(\mathbf{z})\|^{2}.

Under the μ\mu-strong convexity and LL-smooth Assumptions 2.1 and 2.2, we immediately have

12L𝒟(,)𝒜(,)12μ𝒟(,).\frac{1}{2L}\mathcal{D}(\mathcal{I},\mathcal{M})\leq\mathcal{A}(\mathcal{I},\mathcal{M})\leq\frac{1}{2\mu}\mathcal{D}(\mathcal{I},\mathcal{M}).

Besides, for disjoint index sets 1,2\mathcal{I}_{1},\mathcal{I}_{2}, we have

𝒜(12,)=𝒜(1,)+𝒜(2,),\displaystyle\mathcal{A}(\mathcal{I}_{1}\cup\mathcal{I}_{2},\mathcal{M})=\mathcal{A}(\mathcal{I}_{1},\mathcal{M})+\mathcal{A}(\mathcal{I}_{2},\mathcal{M}),
𝒟(12,)=𝒟(1,)+𝒟(2,).\displaystyle\mathcal{D}(\mathcal{I}_{1}\cup\mathcal{I}_{2},\mathcal{M})=\mathcal{D}(\mathcal{I}_{1},\mathcal{M})+\mathcal{D}(\mathcal{I}_{2},\mathcal{M}).

For the problem (1.1), the optimal solution exists due to the strong convexity assumption on fif_{i}’s. We pick one set of optimal solutions 𝐳1,𝐳2,,𝐳k\mathbf{z}_{1}^{*},\mathbf{z}_{2}^{*},\ldots,\mathbf{z}_{k}^{*}. We let

OPT={𝐳1,𝐳2,,𝐳k}.\mathcal{M}_{\textup{OPT}}=\{\mathbf{z}_{1}^{*},\mathbf{z}_{2}^{*},\ldots,\mathbf{z}_{k}^{*}\}.

Based on this optimal solutions, we introduce (A1,A2,,Ak)(A_{1},A_{2},\ldots,A_{k}) as a partition of [N][N]. AjA_{j}’s are disjoint with each other and

j[k]Aj=[N].\bigcup_{j\in[k]}A_{j}=[N].

Besides, for all iAji\in A_{j}, fi(𝐱)f_{i}(\mathbf{x}) attains minimum at 𝐳j\mathbf{z}_{j}^{*} over OPT\mathcal{M}_{\textup{OPT}},

fi(𝐳j)fi(𝐱i)=minj[k](fi(𝐳j)fi(𝐱i)).f_{i}(\mathbf{z}_{j}^{*})-f_{i}(\mathbf{x}_{i}^{*})=\min_{j^{\prime}\in[k]}\left(f_{i}(\mathbf{z}_{j^{\prime}}^{*})-f_{i}(\mathbf{x}_{i}^{*})\right).

The choice of OPT\mathcal{M}_{\textup{OPT}} and (A1,A2,,Ak)(A_{1},A_{2},\ldots,A_{k}) is not unique. We carefully choose them so that AjA_{j} are non-empty for each j[k]j\in[k].

Lemma C.3.

Suppose that Assumption 2.1 holds. Let \mathcal{I} be a nonempty index subset of [N][N] and let ii be sampled uniformly at random from \mathcal{I}. We have

𝔼i𝒜(,{𝐱i})L2Δ.\mathbb{E}_{i}\,\mathcal{A}(\mathcal{I},\{\mathbf{x}_{i}^{*}\})\leq\frac{L}{2}\Delta_{\mathcal{I}}.
Proof.

We have the following direct inequality.

𝔼i𝒜(,{𝐱i})\displaystyle\mathbb{E}_{i}\,\mathcal{A}(\mathcal{I},\{\mathbf{x}_{i}^{*}\}) =1||i𝒜(,{𝐱i})\displaystyle=\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\mathcal{A}(\mathcal{I},\{\mathbf{x}_{i}^{*}\})
=1||ii(fi(𝐱i)fi(𝐱i))\displaystyle=\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\left(f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*})\right)
1||iiL2𝐱i𝐱i2\displaystyle\leq\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\frac{L}{2}\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}
=L2Δ.\displaystyle=\frac{L}{2}\Delta_{\mathcal{I}}.

Lemma C.4.

Let \mathcal{M} be a fixed finite set in d\mathbb{R}^{d}. For two indices iii\not=i^{\prime}, we have

𝒜({i},)2Lμ𝒜({i},)+L𝐱i𝐱i2\mathcal{A}(\{i\},\mathcal{M})\leq\frac{2L}{\mu}\mathcal{A}(\{i^{\prime}\},\mathcal{M})+L\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}
Proof.

We have the following inequality.

𝒜({i},)\displaystyle\mathcal{A}(\{i\},\mathcal{M}) =min𝐳(fi(𝐳)fi(𝐱i))\displaystyle=\min_{\mathbf{z}\in\mathcal{M}}\left(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})\right)
min𝐳L2𝐳𝐱i2\displaystyle\leq\min_{\mathbf{z}\in\mathcal{M}}\frac{L}{2}\|\mathbf{z}-\mathbf{x}_{i}^{*}\|^{2}
min𝐳L(𝐳𝐱i2+𝐱i𝐱i2)\displaystyle\leq\min_{\mathbf{z}\in\mathcal{M}}L(\|\mathbf{z}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}+\|\mathbf{x}_{i^{\prime}}^{*}-\mathbf{x}_{i}^{*}\|^{2})
2Lμminz(fi(𝐳)fi(𝐱i))+L𝐱i𝐱i2\displaystyle\leq\frac{2L}{\mu}\min_{z\in\mathcal{M}}\left(f_{i^{\prime}}(\mathbf{z})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*})\right)+L\|\mathbf{x}_{i^{\prime}}^{*}-\mathbf{x}_{i}^{*}\|^{2}
=2Lμ𝒜({i},)+L𝐱i𝐱i2.\displaystyle=\frac{2L}{\mu}\mathcal{A}(\{i^{\prime}\},\mathcal{M})+L\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}.

Lemma C.5.

Given an index set \mathcal{I} and a finite point set \mathcal{M}, suppose that 𝒜(,)>0.\mathcal{A}(\mathcal{I},\mathcal{M})>0. If we randomly sample an index ii\in\mathcal{I} with probability 𝒜({i},)𝒜(,)\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}, then we have the following inequality,

𝔼𝒜(,{𝐱i})(L2μ+L)Δ.\mathbb{E}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})\leq\left(\frac{L^{2}}{\mu}+L\right)\Delta_{\mathcal{I}}.
Proof.

We consider the expectation of 𝒜(,{𝐱i})\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\}) over ii\in\mathcal{I}. We have the following inequality bound.

𝔼𝒜(,{𝐱i})\displaystyle\mathbb{E}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})
=i𝒜({i},)𝒜(,)𝒜(,{𝐱i})\displaystyle\quad=\sum_{i\in\mathcal{I}}\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})
=i𝒜({i},)𝒜(,)imin(𝒜({i},),fi(𝐱i)fi(𝐱i))\displaystyle\quad=\sum_{i\in\mathcal{I}}\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))
(a)i1||i′′(2Lμ𝒜({i′′},)+L𝐱i′′𝐱i2)𝒜(,)imin(𝒜({i},),fi(𝐱i)fi(𝐱i))\displaystyle\quad\stackrel{{\scriptstyle(a)}}{{\leq}}\sum_{i\in\mathcal{I}}\frac{\frac{1}{|\mathcal{I}|}\sum_{i^{\prime\prime}\in\mathcal{I}}\left(\frac{2L}{\mu}\mathcal{A}(\{i^{\prime\prime}\},\mathcal{M})+L\|\mathbf{x}_{i^{\prime\prime}}^{*}-\mathbf{x}_{i}^{*}\|^{2}\right)}{\mathcal{A}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))
=2Lμ1||iimin(𝒜({i},),fi(𝐱i)fi(𝐱i))\displaystyle\quad=\frac{2L}{\mu}\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))
+L𝒜(,)||ii′′𝐱i′′𝐱i2imin(𝒜({i},),fi(𝐱i)fi(𝐱i))\displaystyle\quad\quad\quad+\frac{L}{\mathcal{A}(\mathcal{I},\mathcal{M})|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime\prime}\in\mathcal{I}}\|\mathbf{x}_{i^{\prime\prime}}^{*}-\mathbf{x}_{i}^{*}\|^{2}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))
2Lμ1||iiL2𝐱i𝐱i2+L𝒜(,)||ii′′𝐱i′′𝐱i2i𝒜({i},)\displaystyle\quad\leq\frac{2L}{\mu}\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\frac{L}{2}\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}+\frac{L}{\mathcal{A}(\mathcal{I},\mathcal{M})|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime\prime}\in\mathcal{I}}\|\mathbf{x}_{i^{\prime\prime}}^{*}-\mathbf{x}_{i}^{*}\|^{2}\sum_{i^{\prime}\in\mathcal{I}}\mathcal{A}(\{i^{\prime}\},\mathcal{M})
=(L2μ+L)1||ii𝐱i𝐱i2.\displaystyle\quad=\left(\frac{L^{2}}{\mu}+L\right)\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\|\mathbf{x}_{i^{\prime}}^{*}-\mathbf{x}_{i}^{*}\|^{2}.

Here, (a) holds when applying Lemma C.4. ∎

Lemma C.6.

For any AlA_{l} in the optimal partition (A1,A2,,Ak)(A_{1},A_{2},\ldots,A_{k}), we have

ΔAl4μ𝒜(Al,OPT).\Delta_{A_{l}}\leq\frac{4}{\mu}\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}}).
Proof.

We let 𝐲¯l=1|Al|iAl𝐱i\bar{\mathbf{y}}_{l}=\frac{1}{|A_{l}|}\sum_{i\in A_{l}}\mathbf{x}_{i}^{*} be the geometric center of optimal fif_{i} solutions of index set AlA_{l}.

ΔAl\displaystyle\Delta_{A_{l}} =1|Al|iAliAl𝐱i𝐱i2\displaystyle=\frac{1}{|A_{l}|}\sum_{i\in A_{l}}\sum_{i^{\prime}\in A_{l}}\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}
=1|Al|iAliAl𝐱i𝐲¯l+𝐲¯l𝐱i2\displaystyle=\frac{1}{|A_{l}|}\sum_{i\in A_{l}}\sum_{i^{\prime}\in A_{l}}\|\mathbf{x}_{i}^{*}-\bar{\mathbf{y}}_{l}+\bar{\mathbf{y}}_{l}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}
=1|Al|iAliAl(𝐱i𝐲¯l2+𝐲¯l𝐱i2)\displaystyle=\frac{1}{|A_{l}|}\sum_{i\in A_{l}}\sum_{i^{\prime}\in A_{l}}\left(\|\mathbf{x}_{i}^{*}-\bar{\mathbf{y}}_{l}\|^{2}+\|\bar{\mathbf{y}}_{l}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}\right)
=2iAl𝐱i𝐲¯l2\displaystyle=2\sum_{i\in A_{l}}\|\mathbf{x}_{i}^{*}-\bar{\mathbf{y}}_{l}\|^{2}
=2min𝐳iAl𝐱i𝐳2\displaystyle=2\min_{\mathbf{z}}\sum_{i\in A_{l}}\|\mathbf{x}_{i}^{*}-\mathbf{z}\|^{2}
4μmin𝐳iAl(fi(𝐳)fi(𝐱i))\displaystyle\leq\frac{4}{\mu}\min_{\mathbf{z}}\sum_{i\in A_{l}}\left(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})\right)
=4μmin𝐳𝒜(Al,{𝐳})\displaystyle=\frac{4}{\mu}\min_{\mathbf{z}}\mathcal{A}(A_{l},\{\mathbf{z}\})
=4μ𝒜(Al.OPT).\displaystyle=\frac{4}{\mu}\mathcal{A}(A_{l}.\mathcal{M}_{\textup{OPT}}).

Proposition C.7.

Let \mathcal{I} be an index set, and \mathcal{M} be a finite point set. Let 𝐳\mathbf{z}^{*} be a minimizer of the objective function i(fi(𝐳)fi(𝐱i))\sum_{i\in\mathcal{I}}\left(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})\right). Suppose that 𝒜(,)>0.\mathcal{A}(\mathcal{I},\mathcal{M})>0. If we sample an index ii\in\mathcal{I} with probability 𝒜({i},)𝒜(,)\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}, then we have the following inequality:

(C.1) 𝔼𝒜(,{𝐱i})4(L2μ2+Lμ)𝒜(Al.OPT).\mathbb{E}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})\leq 4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(A_{l}.\mathcal{M}_{\textup{OPT}}).
Proof.

The deduction of (C.1) is a direct combination of Lemma C.5 and Lemma C.6. ∎

Next we prove that the L2μ2\frac{L^{2}}{\mu^{2}} bound in (C.1) is tight.

Proposition C.8.

Fix the dimension d1d\geq 1, there exists an integer NN. We can construct NN μ\mu-strongly convex and LL-smooth sub-functions f1,f2,,fNf_{1},f_{2},\ldots,f_{N}, and a finite set d\mathcal{M}\subseteq\mathbb{R}^{d}. We let {fi}i=1N\{f_{i}\}_{i=1}^{N} be the NN sub-functions of the sum-of-minimum optimization problem (1.1). When we sample an index i[N]i\in[N] with probability 𝒜({i},)𝒜([N],)\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}([N],\mathcal{M})}, we have

𝔼𝒜([N],{𝐱i})L2μ2𝒜(Al,OPT).\mathbb{E}\mathcal{A}([N],\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})\geq\frac{L^{2}}{\mu^{2}}\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}}).
Proof.

For the cases where the dimension d2d\geq 2, we construct the instance in a more concise way. We consider the following n+1n+1 points, 𝐱i=(1,0,0,,0)d\mathbf{x}_{i}^{*}=(1,0,0,\ldots,0)\in\mathbb{R}^{d}, i=1,2,,ni=1,2,\ldots,n, 𝐱n+1=(1,0,0,,0)d\mathbf{x}_{n+1}^{*}=(-1,0,0,\ldots,0)\in\mathbb{R}^{d}. All the elements except the first one of 𝐱i\mathbf{x}_{i}^{*} are zero. We construct the following functions fif_{i} with minimizers 𝐱i\mathbf{x}_{i}^{*}.

(C.2) fi(y1,y2,,yd)=L2(y11)2+μ2j=2dyj2,i=1,2,n,fi(y1,y2,,yd)=μ2(y1+1)2+L2j=2dyj2,i=n+1.\begin{split}f_{i}(y_{1},y_{2},\ldots,y_{d})=\frac{L}{2}\left(y_{1}-1\right)^{2}+\frac{\mu}{2}\sum_{j=2}^{d}y_{j}^{2},\quad i=1,2\ldots,n,\\ f_{i}(y_{1},y_{2},\ldots,y_{d})=\frac{\mu}{2}\left(y_{1}+1\right)^{2}+\frac{L}{2}\sum_{j=2}^{d}y_{j}^{2},\quad i=n+1.\end{split}

We have fi:=fi(𝐱i)=0f_{i}^{*}:=f_{i}(\mathbf{x}_{i}^{*})=0 for all i[n+1]i\in[n+1]. We construct the finite set \mathcal{M} in an orthogonal manner. We let ={(0,ξ)}\mathcal{M}=\{(0,\xi)\}, ξd1\xi\in\mathbb{R}^{d-1} be a single point set. Besides, ξ=m1\|\xi\|=m\gg 1. The point 𝝃=(0,ξ)\boldsymbol{\xi}=(0,\xi) in \mathcal{M} is orthogonal to all 𝐱i\mathbf{x}_{i}^{*}’s. Consider the expectation over the newly sampled index ii, we have

𝔼i=1n+1min(fi(𝝃)fi(𝐱i),fi(𝐱i)fi(𝐱i))=n(L+μm2)n(L+μm2)+(μ+Lm2)2μ+μ+Lm2n(L+μm2)+(μ+Lm2)2nL.\mathbb{E}\sum_{i^{\prime}=1}^{n+1}\min(f_{i^{\prime}}(\boldsymbol{\xi})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))\\ =\frac{n(L+\mu m^{2})}{n(L+\mu m^{2})+(\mu+Lm^{2})}2\mu+\frac{\mu+Lm^{2}}{n(L+\mu m^{2})+(\mu+Lm^{2})}2nL.

We set m=exp(n)m=\exp(n). As nn\rightarrow\infty, we have

limn𝔼i=1n+1min(fi(𝝃)fi(𝐱i),fi(𝐱i)fi(𝐱i))=2μ+2L2μ.\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}\sum_{i^{\prime}=1}^{n+1}\min(f_{i^{\prime}}(\boldsymbol{\xi})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))=2\mu+2\frac{L^{2}}{\mu}.

In the meanwhile, we have

𝐳:=argmin𝐳i=1n+1(fi(𝐳)fi(𝐱i))=nLμnL+μ,\displaystyle\mathbf{z}^{*}:=\mathop{\rm argmin}_{\mathbf{z}}\sum_{i^{\prime}=1}^{n+1}(f_{i^{\prime}}(\mathbf{z})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))=\frac{nL-\mu}{nL+\mu},
i=1n+1(fi(𝐳)fi(𝐱i))=2μnLμ+nLn2μ.\displaystyle\sum_{i^{\prime}=1}^{n+1}(f_{i^{\prime}}(\mathbf{z}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))=\frac{2\mu nL}{\mu+nL}\stackrel{{\scriptstyle n\rightarrow\infty}}{{\rightarrow}}2\mu.

We have the following error rate:

limn𝔼i=1n+1min(fi(𝝃)fi(𝐱i),fi(𝐱i)fi(𝐱i))i=1n+1(fi(𝐳)fi(𝐱i))=1+L2μ2.\lim_{n\rightarrow\infty}\frac{\mathbb{E}\sum_{i^{\prime}=1}^{n+1}\min(f_{i^{\prime}}(\boldsymbol{\xi})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))}{\sum_{i^{\prime}=1}^{n+1}(f_{i^{\prime}}(\mathbf{z}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))}=1+\frac{L^{2}}{\mu^{2}}.

As for the 1D case, we consider the following n+1n+1 points. We let xi=1,i=1,2,,nx_{i}^{*}=1,i=1,2,\ldots,n, and xn+1=0x_{n+1}^{*}=0. We construct:

fi(x)={L2(x1)2,x1,μ2(x1)2,x1,i=1,2,,n.\displaystyle f_{i}(x)=\left\{\begin{split}\frac{L}{2}(x-1)^{2},\quad x\leq 1,\\ \frac{\mu}{2}(x-1)^{2},\quad x\geq 1,\end{split}\right.\quad i=1,2,\ldots,n.
fi(x)={μ2x2,x1,L2(x1)2+μ(x12),x1,i=n+1.\displaystyle f_{i}(x)=\left\{\begin{split}&\frac{\mu}{2}x^{2},\quad x\leq 1,\\ &\frac{L}{2}(x-1)^{2}+\mu\left(x-\frac{1}{2}\right),\quad x\geq 1,\end{split}\quad i=n+1.\right.

Each fif_{i}^{*} has the minimizer xix_{i}^{*}. Besides, fi:=fi(xi)=0f_{i}^{*}:=f_{i}(x_{i}^{*})=0. We let ={1+Lμ}\mathcal{M}=\{1+\frac{L}{\mu}\} be a single point set. Let ξ=1+Lμ\xi=1+\frac{L}{\mu}. We have

fi(xn+1)fi=L2,i=1,2,,n,\displaystyle f_{i}(x_{n+1}^{*})-f_{i}^{*}=\frac{L}{2},\quad i=1,2,\ldots,n,
fn+1(xi)fn+1=μ2,i=1,2,,n,\displaystyle f_{n+1}(x_{i}^{*})-f_{n+1}^{*}=\frac{\mu}{2},\quad i=1,2,\ldots,n,
fi(1+Lμ)fi=L22μ,i=1,2,,n,\displaystyle f_{i}\left(1+\frac{L}{\mu}\right)-f_{i}^{*}=\frac{L^{2}}{2\mu},\quad i=1,2,\ldots,n,
fn+1(1+Lμ)fn+1=L3+2μ2L+μ32μ2.\displaystyle f_{n+1}\left(1+\frac{L}{\mu}\right)-f_{n+1}^{*}=\frac{L^{3}+2\mu^{2}L+\mu^{3}}{2\mu^{2}}.

We have the following expectation:

𝔼i=1n+1min(fi(ξ)fi(xi),fi(xi)fi(xi))\displaystyle\mathbb{E}\sum_{i^{\prime}=1}^{n+1}\min(f_{i^{\prime}}(\xi)-f_{i^{\prime}}(x_{i^{\prime}}^{*}),f_{i^{\prime}}(x_{i}^{*})-f_{i^{\prime}}(x_{i^{\prime}}^{*})) =nL22μμ2+L3+2μ2L+μ32μ2nL2nL22μ+L3+2μ2L+μ32μ2\displaystyle=\frac{n\frac{L^{2}}{2\mu}\cdot\frac{\mu}{2}+\frac{L^{3}+2\mu^{2}L+\mu^{3}}{2\mu^{2}}\cdot n\frac{L}{2}}{n\frac{L^{2}}{2\mu}+\frac{L^{3}+2\mu^{2}L+\mu^{3}}{2\mu^{2}}}
n32μ+L22μ+μ22L.\displaystyle\stackrel{{\scriptstyle n\rightarrow\infty}}{{\rightarrow}}\frac{3}{2}\mu+\frac{L^{2}}{2\mu}+\frac{\mu^{2}}{2L}.

Besides, we have the minimizer z=nLnL+μz^{*}=\frac{nL}{nL+\mu} of the objective function i=1n+1(fi(z)fi(xi))\sum_{i=1}^{n+1}(f_{i}(z)-f_{i}(x_{i}^{*})). We have

i=1n+1(fi(z)fi(xi))=nLμ2(nL+μ)nμ2.\sum_{i=1}^{n+1}(f_{i}(z^{*})-f_{i}(x_{i}^{*}))=\frac{nL\mu}{2(nL+\mu)}\stackrel{{\scriptstyle n\rightarrow\infty}}{{\rightarrow}}\frac{\mu}{2}.

We have the following asymptotic error bound:

limn𝔼i=1n+1min(fi(ξ)fi(xi),fi(xi)fi(xi))i=1n+1(fi(z)fi(xi))=3+L2μ2+μL.\lim_{n\rightarrow\infty}\frac{\mathbb{E}\sum_{i=1}^{n+1}\min(f_{i}(\xi)-f_{i}(x_{i}^{*}),f_{i}(x_{i}^{*})-f_{i}(x_{i}^{*}))}{\sum_{i=1}^{n+1}(f_{i}(z^{*})-f_{i}(x_{i}^{*}))}=3+\frac{L^{2}}{\mu^{2}}+\frac{\mu}{L}.

We remark that the orthogonal technique used in the construction of (C.2) can be applied in other lower bound constructions in the proofs of the initialization Algorithms 1 and 4 as well.

Lemma C.9.

We consider the sum-of-minimum optimization (1.1). Suppose that SS^{*} is kk-separate. Suppose that we have fixed indices i1,i2,,iji_{1},i_{2},\ldots,i_{j}. We define the finite set j={𝐱i1,𝐱i2,,𝐱ij}\mathcal{M}_{j}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{j}}^{*}\}. We define the index sets Lj={l:Al{i1,i2,,ij}},Ljc={l:Al{i1,i2,,ij}=},j=lLjAl,jc=lLjcAlL_{j}=\{l:A_{l}\cap\{i_{1},i_{2},\ldots,i_{j}\}\not=\emptyset\},L_{j}^{c}=\{l:A_{l}\cap\{i_{1},i_{2},\ldots,i_{j}\}=\emptyset\},\mathcal{I}_{j}=\cup_{l\in L_{j}}A_{l},\mathcal{I}_{j}^{c}=\cup_{l\in L_{j}^{c}}A_{l}. Let u=|Ljc|u=|L_{j}^{c}|. We sample tut\leq u new indices. We let j,s+={𝐱i1,𝐱i2,,𝐱ij,𝐱ij+1,,𝐱ij+s}\mathcal{M}_{j,s}^{+}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{j}}^{*},\mathbf{x}_{i_{j+1}}^{*},\ldots,\mathbf{x}_{i_{j+s}}^{*}\} for 0st0\leq s\leq t. In each round of sampling, the probability of ij+s,s>0i_{j+s},s>0, being sampled as ii is 𝒜({i},j,s1+)𝒜([N],j,s1+)\frac{\mathcal{A}(\{i\},\mathcal{M}_{j,s-1}^{+})}{\mathcal{A}([N],\mathcal{M}_{j,s-1}^{+})}. Then we have the following bound,

(C.3) 𝔼𝒜([N],j,t+)(𝒜(j,j)+4(L2μ2+Lμ)𝒜(jc,OPT))(1+Ht)+utu𝒜(jc,j).\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\leq\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t})+\frac{u-t}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

Here, Ht=1+12++1tH_{t}=1+\frac{1}{2}+\cdots+\frac{1}{t} is the harmonic sum.

Proof.

We prove by induction on u=|Ljc|u=|L_{j}^{c}| and tt. We introduce the notation

Φj(i)=𝒜({i},j)=min𝐳j(fi(𝐳)fi(𝐱i)).\Phi_{j}(i)=\mathcal{A}(\{i\},\mathcal{M}_{j})=\min_{\mathbf{z}\in\mathcal{M}_{j}}(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})).

We show that if (C.3) holds for the case (u1,t1)(u-1,t-1) and (u,t1)(u,t-1), then it also holds for the case (u,t)(u,t). We first prove two base cases.

Case 1: t=0,u>0t=0,u>0.

𝔼𝒜([N],j,t+)=𝒜([N],j)=𝒜(j,j)+𝒜(jc,j).\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{j,t}^{+})=\mathcal{A}([N],\mathcal{M}_{j})=\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

Case 2: t=1,u=1t=1,u=1. With probability 𝒜(j,j)𝒜([N],j)\frac{\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}, the newly sampled index ij+1i_{j+1} will lie in j\mathcal{I}_{j}, and with probability 𝒜(jc,j)𝒜([N],j)\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}, it will lie in jc\mathcal{I}_{j}^{c}. We have bounds on the conditional expectation

𝔼(𝒜([N],j,t+)|ij+1j)\displaystyle\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{|}i_{j+1}\in\mathcal{I}_{j}\right) 𝒜([N],j),\displaystyle\leq\mathcal{A}([N],\mathcal{M}_{j}),
𝔼(𝒜([N],j,t+)|ij+1jc)\displaystyle\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{|}i_{j+1}\in\mathcal{I}_{j}^{c}\right) =𝔼(𝒜(j,j,t+)|ij+1jc)+𝔼(𝒜(jc,j,t+)|ij+1jc)\displaystyle=\mathbb{E}\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j,t}^{+})\big{|}i_{j+1}\in\mathcal{I}_{j}^{c}\right)+\mathbb{E}\left(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j,t}^{+})\big{|}i_{j+1}\in\mathcal{I}_{j}^{c}\right)
𝒜(j,j)+ijcΦj(i)ijcΦj(i)𝒜(jc,j{𝐱i})\displaystyle\leq\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\sum_{i^{\prime}\in\mathcal{I}_{j}^{c}}\frac{\Phi_{j}(i^{\prime})}{\sum_{i\in\mathcal{I}_{j}^{c}}\Phi_{j}(i)}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}\cup\{\mathbf{x}_{i^{\prime}}^{*}\})
(a)𝒜(j,j)+(L2μ+L)Δjc\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\left(\frac{L^{2}}{\mu}+L\right)\Delta_{\mathcal{I}_{j}^{c}}
(b)𝒜(j,j)+4(L2μ2+Lμ)𝒜(jc,OPT)\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})

Here, (a) holds when applying Lemma C.5. (b) holds since jc\mathcal{I}_{j}^{c} is identical to a certain AlA_{l} as u=1u=1 and we apply Lemma C.6. Overall, we have the bound:

𝔼𝒜([N],j,t+)\displaystyle\mathbb{E}\mathcal{A}([N],\mathcal{M}_{j,t}^{+}) =𝒜(j,j)𝒜([N],j)𝔼(𝒜([N],j,t+)|ij+1j)\displaystyle=\frac{\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{|}i_{j+1}\in\mathcal{I}_{j}\right)
+𝒜(jc,j)𝒜([N],j)𝔼(𝒜([N],j,t+)|ij+1jc)\displaystyle\qquad+\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{|}i_{j+1}\in\mathcal{I}_{j}^{c}\right)
𝒜(j,j)+4(L2μ2+Lμ)𝒜(jc,OPT)+𝒜(j,j)\displaystyle\leq\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})+\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})
=2𝒜(j,j)+4(L2μ2+Lμ)𝒜(jc,OPT)\displaystyle=2\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})

Next, we prove that the case (u,t)(u,t) holds when the inequality holds for cases (u1,t)(u-1,t) and (u1,t1)(u-1,t-1). With probability 𝒜(j,j)𝒜([N],j)\frac{\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}, the first sampled index ij+1i_{j+1} will lie in j\mathcal{I}_{j}, and with probability 𝒜(jc,j)𝒜([N],j)\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}, it will lie in jc\mathcal{I}_{j}^{c}. Let

α=4(L2μ2+Lμ).\alpha=4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right).

We divide into two cases and compute the corresponding conditional expectations. For the case where ij+1i_{j+1} lies in j\mathcal{I}_{j}, we have the following bound on the conditional expectation.

𝔼(𝒜([N],j,t+)|ij+1j)\displaystyle\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\,\big{|}\,i_{j+1}\in\mathcal{I}_{j}\right)
𝔼((𝒜(j,j{𝐱ij+1})+α𝒜(jc,OPT))(1+Ht1)\displaystyle\quad\leq\mathbb{E}\bigg{(}\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j}\cup\{\mathbf{x}_{i_{j+1}}^{*}\})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})
+ut+1u𝒜(jc,j{𝐱ij+1})|ij+1j)\displaystyle\qquad\quad+\frac{u-t+1}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}\cup\{\mathbf{x}_{i_{j+1}}^{*}\})\big{|}i_{j+1}\in\mathcal{I}_{j}\bigg{)}
(𝒜(j,j)+α𝒜(jc,OPT))(1+Ht1)+ut+1u𝒜(jc,j).\displaystyle\quad\leq\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t+1}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

For the case where ij+1i_{j+1} lies in jc\mathcal{I}_{j}^{c}, we have the following inequality:

𝔼(𝒜([N],j,t+)|ij+1jc)\displaystyle\mathbb{E}\,\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\,\big{|}\,i_{j+1}\in\mathcal{I}_{j}^{c}\right)
lLjciAlΦj(i)ijcΦj(i)[(𝒜(jAl,j{𝐱i})+α𝒜(jc\Al,OPT))(1+Ht1)\displaystyle\quad\leq\sum_{l\in L_{j}^{c}}\frac{\sum_{i\in A_{l}}\Phi_{j}(i)}{\sum_{i^{\prime}\in\mathcal{I}_{j}^{c}}\Phi_{j}(i^{\prime})}\bigg{[}\left(\mathcal{A}(\mathcal{I}_{j}\cup A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c}\backslash A_{l},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})
+utu1𝒜(jc\Al,j{𝐱i})]\displaystyle\quad\quad\quad+\frac{u-t}{u-1}\mathcal{A}(\mathcal{I}_{j}^{c}\backslash A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})\bigg{]}
lLjciAlΦj(i)ijcΦj(i)[(𝒜(j,j)+𝒜(Al,j{𝐱i})\displaystyle\quad\leq\sum_{l\in L_{j}^{c}}\frac{\sum_{i\in A_{l}}\Phi_{j}(i)}{\sum_{i^{\prime}\in\mathcal{I}_{j}^{c}}\Phi_{j}(i^{\prime})}\bigg{[}\big{(}\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\mathcal{A}(A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})
+α(𝒜(jc,OPT)𝒜(Al,OPT)))(1+Ht1)+utu1(𝒜(jc,j)𝒜(Al,j))]\displaystyle\quad\quad\quad+\alpha(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})-\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}}))\big{)}(1+H_{t-1})+\frac{u-t}{u-1}\left(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})-\mathcal{A}(A_{l},\mathcal{M}_{j})\right)\bigg{]}
(a)(𝒜(j,j)+α𝒜(jc,OPT))(1+Ht1)+utu1(𝒜(jc,j)lLjc𝒜(Al,j)2𝒜(jc,j))\displaystyle\quad\stackrel{{\scriptstyle(a)}}{{\leq}}\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t}{u-1}\left(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})-\sum_{l\in L_{j}^{c}}\frac{\mathcal{A}(A_{l},\mathcal{M}_{j})^{2}}{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}\right)
(b)(𝒜(j,j)+α𝒜(jc,OPT))(1+Ht1)+utu1(𝒜(jc,j)1u𝒜(jc,j))\displaystyle\quad\stackrel{{\scriptstyle(b)}}{{\leq}}\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t}{u-1}\left(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})-\frac{1}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\right)
=(𝒜(j,j)+α𝒜(jc,OPT))(1+Ht1)+utu𝒜(jc,j).\displaystyle\quad=\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

Here, (a) holds when applying Lemma C.5 and Lemma C.6. (b) holds as

lLjc𝒜(Al,j)21u(lLjc𝒜(Al,j))2=1u𝒜(jc,j)2.\displaystyle\sum_{l\in L_{j}^{c}}\mathcal{A}(A_{l},\mathcal{M}_{j})^{2}\geq\frac{1}{u}\left(\sum_{l\in L_{j}^{c}}\mathcal{A}(A_{l},\mathcal{M}_{j})\right)^{2}=\frac{1}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})^{2}.

Overall, we have the bound:

𝔼𝒜([N],j,t+)\displaystyle\mathbb{E}\mathcal{A}([N],\mathcal{M}_{j,t}^{+})
=\displaystyle= 𝒜(j,j)𝒜([N],j)𝔼(𝒜([N],j,t+)|ij+1j)+𝒜(jc,j)𝒜([N],j)𝔼(𝒜([N],j,t+)|ij+1jc)\displaystyle\frac{\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{|}i_{j+1}\in\mathcal{I}_{j}\right)+\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{|}i_{j+1}\in\mathcal{I}_{j}^{c}\right)
\displaystyle\leq (𝒜(j,j)+α𝒜(jc,OPT))(1+Ht1)+utu𝒜(jc,j)+1u𝒜(jc,j)𝒜(j,j)𝒜([N],j)\displaystyle\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})+\frac{1}{u}\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}
(a)\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}} (𝒜(j,j)+α𝒜(jc,OPT))(1+Ht)+utu𝒜(jc,j).\displaystyle\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t})+\frac{u-t}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

Here, (a) holds since utu\geq t and

𝒜(jc,j)𝒜(j,j)𝒜([N],j)𝒜(j,j).\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\leq\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j}).

The proof concludes. ∎

Theorem C.10 (Restatement of Theorem 4.1).

Suppose that the solution set SS^{*} is kk-separate. Let

init={𝐱i1,𝐱i2,,𝐱ik}\mathcal{M}_{\textup{init}}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{k}}^{*}\}

be the initial points sampled by the random initialization Algorithm 1. We have the following bound:

(C.4) 𝔼𝒜([N],init)4(2+lnk)(L2μ2+Lμ)𝒜([N],OPT).\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\leq 4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).
Proof.

We start with a fixed index i1i_{1}, let 1={𝐱i1}\mathcal{M}_{1}=\{\mathbf{x}_{i_{1}}^{*}\}. Suppose 𝐱i1Al\mathbf{x}_{i_{1}}\in A_{l}. Then we use Lemma C.9 with u=k1,t=k1u=k-1,t=k-1. Let

α=4(L2μ2+Lμ).\alpha=4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right).

We have

𝔼𝒜([N],1,k1+)(𝒜(Al,1)+α𝒜([N]\Al,OPT))(1+Hk1)\displaystyle\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{1,k-1}^{+})\leq\left(\mathcal{A}(A_{l},\mathcal{M}_{1})+\alpha\mathcal{A}([N]\backslash A_{l},\mathcal{M}_{\textup{OPT}})\right)(1+H_{k-1})

The term 𝔼𝒜([N],1,k1+)\mathbb{E}\mathcal{A}([N],\mathcal{M}_{1,k-1}^{+}) can be regarded as the conditional expectation of 𝒜([N],init)\mathcal{A}([N],\mathcal{M}_{\textup{init}}) given i1i_{1}

𝔼𝒜([N],1,k1+)=𝔼(𝒜([N],init)|i1).\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{1,k-1}^{+})=\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{\textup{init}})~{}|~{}i_{1}\right).

According to Algorithm 1, the first index i1i_{1} is uniformly random in [N][N]. We take the expectation over i1i_{1} and get

𝔼𝒜([N],init)\displaystyle\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{\textup{init}})
\displaystyle\leq 1Nl[k]iAl(𝒜(Al,{𝐱i})+α𝒜([N]\Al,OPT))(1+Hk1)\displaystyle\frac{1}{N}\sum_{l\in[k]}\sum_{i\in A_{l}}\left(\mathcal{A}(A_{l},\{\mathbf{x}_{i}^{*}\})+\alpha\mathcal{A}([N]\backslash A_{l},\mathcal{M}_{\textup{OPT}})\right)(1+H_{k-1})
=\displaystyle= (1Nl[k]iAl𝒜(Al,{𝐱i})+α(𝒜([N],OPT)1Nl[k]|Al|𝒜(Al,OPT)))(1+Hk1)\displaystyle\left(\frac{1}{N}\sum_{l\in[k]}\sum_{i\in A_{l}}\mathcal{A}(A_{l},\{\mathbf{x}_{i}^{*}\})+\alpha\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}|A_{l}|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})
(a)\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}} (1Nl[k]|Al|L2ΔAl+α(𝒜([N],OPT)1Nl[k]|Al|𝒜(Al,OPT)))(1+Hk1)\displaystyle\left(\frac{1}{N}\sum_{l\in[k]}|A_{l}|\frac{L}{2}\Delta_{A_{l}}+\alpha\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}|A_{l}|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})
(b)\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}} (1Nl[k]|Al|2Lμ𝒜(Al,OPT)+α(𝒜([N],OPT)1Nl[k]|Al|𝒜(Al,OPT)))\displaystyle\left(\frac{1}{N}\sum_{l\in[k]}|A_{l}|\frac{2L}{\mu}\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})+\alpha\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}|A_{l}|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)
(1+Hk1)\displaystyle\qquad\cdot(1+H_{k-1})
\displaystyle\leq α𝒜([N],OPT)(1+Hk1)\displaystyle\alpha\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})(1+H_{k-1})
\displaystyle\leq 4(2+lnk)(L2μ2+Lμ)𝒜([N],OPT).\displaystyle 4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).

Here, (a) holds when applying Lemma C.3. (b) holds as a result of Lemma C.6.

When we take

fi(𝐱)=12𝐱𝐱i2,f_{i}(\mathbf{x})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}_{i}^{*}\|^{2},

the optimization problem (1.1) reduces to the k-means problem, and Algorithm 1 reduces to the k-means++ algorithm. Therefore, according to [arthur2007k], the bound given in Theorem C.10 is tight in lnk\ln k up to a constant. Next, we give a more detailed lower bound considering the conditioning number Lμ\frac{L}{\mu}.

Theorem C.11 (Restatement of Theorem 4.2).

Given a fixed cluster number k>0k>0, there exists N>0N>0. We can construct NN μ\mu-strongly convex and LL-smooth sub-functions {fi}i=1N\{f_{i}\}_{i=1}^{N}, whose minimizer set SS^{*} is kk-separate. Besides, the sum-of-min objective function FF satisfies that F>fF^{*}>f^{*}, so that 𝒜([N],OPT)>0\mathcal{A}([N],\mathcal{M}_{OPT})>0. When we apply Algorithm 1 to sample the initial centers init\mathcal{M}_{\textup{init}}, we have the following error bound:

(C.5) 𝔼𝒜([N],init)12L2μ2lnk𝒜([N],OPT).\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\geq\frac{1}{2}\frac{L^{2}}{\mu^{2}}\ln k\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).
Proof.

We construct the following problem. We fix the cluster number to be kk. We let the dimension to be 2k2k. We pick the vertices of a kk-simplex as the “centers” of kk clusters. The kk-simplex is embedded in a k1k-1 dimensional subspace. We let the first kk elements of the vertices’ coordinates to be non-zero, while the other elements are zero. We denote the first kk elements of the ll-th vertex by ξ(l)k\xi^{(l)}\in\mathbb{R}^{k}. We let the kk-simplex be centered at the origin, so that the magnitudes ξ(l)\|\xi^{(l)}\|’s are the same. We let mm be the edge length of the simplex. The functions in each cluster follows the orthogonal construction technique in (C.2). Specifically, in cluster ll, we construct n+1n+1 functions mapping from 2k\mathbb{R}^{2k} to \mathbb{R} as

(C.6) fi,l(y)=μ2y1:dξ(l)2+μ2jk+1,jk+lyj2+L2(yk+l+1)2,i=1,2,,n,fi,l(y)=L2y1:dξ(l)2+L2jk+1,jk+lyj2+μ2(yk+l1)2,i=n+1.\begin{split}f_{i,l}(y)&=\frac{\mu}{2}\|y_{1:d}-\xi^{(l)}\|^{2}+\frac{\mu}{2}\sum_{j\geq k+1,j\not=k+l}y_{j}^{2}+\frac{L}{2}(y_{k+l}+1)^{2},\quad i=1,2,\ldots,n,\\ f_{i,l}(y)&=\frac{L}{2}\|y_{1:d}-\xi^{(l)}\|^{2}+\frac{L}{2}\sum_{j\geq k+1,j\not=k+l}y_{j}^{2}+\frac{\mu}{2}(y_{k+l}-1)^{2},\quad i=n+1.\end{split}

We have a total of N=k(n+1)N=k(n+1) sub-functions. We let m=exp(n)m=\exp(n), n1n\gg 1, so that {fi,l}i=1n+1\{f_{i,l}\}_{i=1}^{n+1} will be assigned in the same cluster when computing the minimizer of the objective function FF. We let 𝐞lk\mathbf{e}_{l}\in\mathbb{R}^{k} be the ll-th unit vector, then the minimizers of the above sub-functions are 𝐱i,l=[ξ(l);𝐞l](i=1,2,,n)\mathbf{x}_{i,l}^{*}=[\xi^{(l)};-\mathbf{e}_{l}](i=1,2,\ldots,n) and 𝐱n+1,l=[ξ(l);𝐞l]\mathbf{x}_{n+1,l}^{*}=[\xi^{(l)};\mathbf{e}_{l}]. We let SS^{*} be the set of all the minimizers {𝐱1,l}l=1k{𝐱n+1,l}l=1k\{\mathbf{x}_{1,l}\}_{l=1}^{k}\cup\{\mathbf{x}_{n+1,l}\}_{l=1}^{k}. For each cluster ll, we can compute

minyi=1n+1(fi,l(y)fi,l)=2nLμnL+μ.\min_{y}\sum_{i=1}^{n+1}(f_{i,l}(y)-f_{i,l}^{*})=\frac{2nL\mu}{nL+\mu}.

Thus, we have

𝒜([N],OPT)=k2nLμnL+μ.\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})=k\frac{2nL\mu}{nL+\mu}.

Let \mathcal{M} be a nonempty subset of SS^{*}. We study the optimality gap of FF when sampling the new centers based on \mathcal{M}. We divide the kk clusters into 4 classes as follows:

Ca={l|𝐱1,l,𝐱n+1,l},\displaystyle C_{a}=\{l\,|\,\mathbf{x}_{1,l}^{*}\in\mathcal{M},\mathbf{x}_{n+1,l}^{*}\not\in\mathcal{M}\},
Cb={l|𝐱n+1,l,𝐱1,l},\displaystyle C_{b}=\{l\,|\,\mathbf{x}_{n+1,l}^{*}\in\mathcal{M},\mathbf{x}_{1,l}^{*}\not\in\mathcal{M}\},
Cf={l|𝐱1,l,𝐱n+1,l},\displaystyle C_{f}=\{l\,|\,\mathbf{x}_{1,l}^{*}\in\mathcal{M},\mathbf{x}_{n+1,l}^{*}\in\mathcal{M}\},
Cu={l|𝐱1,l,𝐱n+1,l}.\displaystyle C_{u}=\{l\,|\,\mathbf{x}_{1,l}^{*}\not\in\mathcal{M},\mathbf{x}_{n+1,l}^{*}\not\in\mathcal{M}\}.

We define a=|Ca|,b=|Cb|,u=|Cu|a=|C_{a}|,b=|C_{b}|,u=|C_{u}|. Consider \mathcal{M} as the existing centers, we continue sampling tut\leq u new centers using Algorithm 1. Let 𝐰1,𝐰2,,𝐰t\mathbf{w}_{1}^{*},\mathbf{w}_{2}^{*},\ldots,\mathbf{w}_{t}^{*} be the newly sampled centers. We define the quantity

ϕa,b,u,t=𝔼𝒜([N],{𝐰1,𝐰2,,𝐰t}),\phi_{a,b,u,t}=\mathbb{E}\mathcal{A}([N],\mathcal{M}\cup\{\mathbf{w}_{1}^{*},\mathbf{w}_{2}^{*},\ldots,\mathbf{w}_{t}^{*}\}),

which is the expected optimality gap after sampling. We will prove by induction that

(C.7) ϕa,b,u,tαt+1[12(n(μm2+μ+L)+(Lm2+L+μ))(ut)+(2nLb+2μa)(1+Hu)+(2L2μ+2μ)Gu].\begin{split}\phi_{a,b,u,t}\geq\alpha^{t+1}\bigg{[}\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)(u-t)\\ +(2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}\bigg{]}.\end{split}

Here HuH_{u} is the harmonic series. GuG_{u} is recursively defined as:

G0=0,GuGu1=β(1+Hu1).G_{0}=0,\quad G_{u}-G_{u-1}=\beta(1+H_{u-1}).

The parameter 0<α,β<10<\alpha,\beta<1 are chosen as

α=11m,β=11n.\alpha=1-\frac{1}{m},\quad\beta=1-\frac{1}{\sqrt{n}}.

We denote the right0hand side of (C.7) as αt+1φa,b,u,t\alpha^{t+1}\varphi_{a,b,u,t}.

We consider the case where t=0t=0, we have

ϕa,b,u,0=12(n(μm2+μ+L)+(Lm2+L+μ))u+2nLb+2μa.\displaystyle\phi_{a,b,u,0}=\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u+2nLb+2\mu a.

In the mean while,

φa,b,u,0=12(n(μm2+μ+L)+(Lm2+L+μ))u+(2nLb+2μa)(1+Hu)+(2L2μ+2μ)Gu.\varphi_{a,b,u,0}=\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u+(2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}.

If u=0u=0, we have

ϕa,b,0,0=φa,b,0,0αφa,b,0,0.\phi_{a,b,0,0}=\varphi_{a,b,0,0}\geq\alpha\varphi_{a,b,0,0}.

If u1u\geq 1, then 12(n(μm2+μ+L)+(Lm2+L+μ))u\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u becomes the leading term,

(1α)12(n(μm2+μ+L)+(Lm2+L+μ))u\displaystyle(1-\alpha)\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u
12nmμu\displaystyle\quad\geq\frac{1}{2}nm\mu u
(2nLb+2μa)(1+Hu)+(2L2μ+2μ)Gu\displaystyle\quad\geq(2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}
α((2nLb+2μa)(1+Hu)+(2L2μ+2μ)Gu).\displaystyle\quad\geq\alpha\left((2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}\right).

Rearrange the left-hand side and the right-hand side of the inequality, we have:

12(n(μm2+μ+L)+(Lm2+L+μ))u\displaystyle\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u
\displaystyle\geq α(12(n(μm2+μ+L)+(Lm2+L+μ))u+(2nLb+2μa)(1+Hu)+(2L2μ+2μ)Gu)\displaystyle\alpha\left(\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u+(2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}\right)
=\displaystyle= αφa,b,u,0.\displaystyle\alpha\varphi_{a,b,u,0}.

Therefore, we have

ϕa,b,u,0αφa,b,u,0.\phi_{a,b,u,0}\geq\alpha\varphi_{a,b,u,0}.

Next, we induct on tt. When t1t\geq 1, we have u1u\geq 1. We use the one-step transfer technique. We let

K=12(n(μm2+μ+L)+(Lm2+L+μ))u+2μa+2nbL,\displaystyle K=\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u+2\mu a+2nbL,
A=12(n(μm2+μ+L)+(Lm2+L+μ)),\displaystyle A=\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right),
B=2L2μ+2μ.\displaystyle B=\frac{2L^{2}}{\mu}+2\mu.

We have

ϕa,b,u,t\displaystyle\phi_{a,b,u,t}
=\displaystyle= n(μm2+μ+L)u2Kϕa+1,b,u1,t1+(Lm2+L+μ)u2Kϕa,b+1,u1,t1\displaystyle\frac{n(\mu m^{2}+\mu+L)u}{2K}\phi_{a+1,b,u-1,t-1}+\frac{(Lm^{2}+L+\mu)u}{2K}\phi_{a,b+1,u-1,t-1}
+2nLbKϕa,b1,u,t1+2μaKϕa1,b,u,t1\displaystyle+\frac{2nLb}{K}\phi_{a,b-1,u,t-1}+\frac{2\mu a}{K}\phi_{a-1,b,u,t-1}
\displaystyle\geq n(μm2+μ+L)u2Kαt[A(ut)+(2nLb+2μa+2μ)(1+Hu1)+BGu1]\displaystyle\frac{n(\mu m^{2}+\mu+L)u}{2K}\alpha^{t}\left[A(u-t)+(2nLb+2\mu a+2\mu)(1+H_{u-1})+BG_{u-1}\right]
+(Lm2+L+μ)u2Kαt[A(ut)+(2nLb+2μa+2nL)(1+Hu1)+BGu1]\displaystyle+\frac{(Lm^{2}+L+\mu)u}{2K}\alpha^{t}\left[A(u-t)+(2nLb+2\mu a+2nL)(1+H_{u-1})+BG_{u-1}\right]
+2nLbKαt[A(ut+1)+(2nLb+2μa2nL)(1+Hu)+BGu]\displaystyle+\frac{2nLb}{K}\alpha^{t}\left[A(u-t+1)+(2nLb+2\mu a-2nL)(1+H_{u})+BG_{u}\right]
+2μaKαt[A(ut+1)+(2nLb+2μa2μ)(1+Hu)+BGu]\displaystyle+\frac{2\mu a}{K}\alpha^{t}\left[A(u-t+1)+(2nLb+2\mu a-2\mu)(1+H_{u})+BG_{u}\right]
=\displaystyle= n(μm2+μ+L)u2Kαtφa,b,u,t\displaystyle\frac{n(\mu m^{2}+\mu+L)u}{2K}\alpha^{t}\varphi_{a,b,u,t}
+n(μm2+μ+L)u2Kαt[2μ(1+Hu1)+(2nLb+2μa)(Hu1Hu)+B(Gu1Gu)]\displaystyle+\frac{n(\mu m^{2}+\mu+L)u}{2K}\alpha^{t}\left[2\mu(1+H_{u-1})+(2nLb+2\mu a)(H_{u-1}-H_{u})+B(G_{u-1}-G_{u})\right]
+(Lm2+L+μ)u2Kαtφa,b,u,t\displaystyle+\frac{(Lm^{2}+L+\mu)u}{2K}\alpha^{t}\varphi_{a,b,u,t}
+(Lm2+L+μ)u2Kαt[2nL(1+Hu1)+(2nLb+2μa)(Hu1Hu)+B(Gu1Gu)]\displaystyle+\frac{(Lm^{2}+L+\mu)u}{2K}\alpha^{t}\left[2nL(1+H_{u-1})+(2nLb+2\mu a)(H_{u-1}-H_{u})+B(G_{u-1}-G_{u})\right]
+2nLbKαtφa,b,u,t+2nLbKαt(A2nL(1+Hu))+2μaKαtφa,b,u,t+2μaKαt(A2μ(1+Hu))\displaystyle+\frac{2nLb}{K}\alpha^{t}\varphi_{a,b,u,t}+\frac{2nLb}{K}\alpha^{t}(A-2nL(1+H_{u}))+\frac{2\mu a}{K}\alpha^{t}\varphi_{a,b,u,t}+\frac{2\mu a}{K}\alpha^{t}(A-2\mu(1+H_{u}))
=\displaystyle= αtφa,b,u,t+1Kαt[12n(μm2+μ+L)u(2μβ(2μ+2L2μ))(1+Hu1)]\displaystyle\alpha^{t}\varphi_{a,b,u,t}+\frac{1}{K}\alpha^{t}\left[\frac{1}{2}n(\mu m^{2}+\mu+L)u\left(2\mu-\beta\left(2\mu+\frac{2L^{2}}{\mu}\right)\right)(1+H_{u-1})\right]
+1Kαt[(Lm2+L+μ)unL(1+Hu1)12(Lm2+L+μ)uβB(1+Hu1)\displaystyle+\frac{1}{K}\alpha^{t}\bigg{[}(Lm^{2}+L+\mu)unL(1+H_{u-1})-\frac{1}{2}(Lm^{2}+L+\mu)u\beta B(1+H_{u-1})
4μ2a(1+Hu)4n2L2b(1+Hu)]\displaystyle\quad\quad\quad\quad-4\mu^{2}a(1+H_{u})-4n^{2}L^{2}b(1+H_{u})\bigg{]}
=\displaystyle= αtφa,b,u,t+1Kαt[12(n(μm2+μ+L)+(Lm2+L+μ))(2nLb+2μa)+A(2nLb+2μa)]\displaystyle\alpha^{t}\varphi_{a,b,u,t}+\frac{1}{K}\alpha^{t}\left[-\frac{1}{2}\left(n(\mu m^{2}+\mu+L)+(Lm^{2}+L+\mu)\right)(2nLb+2\mu a)+A(2nLb+2\mu a)\right]
+1Kαt[12n(μm2+μ+L)u(2L2μ+1n(2μ+2L2μ))(1+Hu1)\displaystyle+\frac{1}{K}\alpha^{t}\bigg{[}\frac{1}{2}n(\mu m^{2}+\mu+L)u\left(-\frac{2L^{2}}{\mu}+\frac{1}{\sqrt{n}}(2\mu+\frac{2L^{2}}{\mu})\right)(1+H_{u-1})
+(Lm2+L+μ)unL(1+Hu1)]\displaystyle\quad\quad\quad\quad+(Lm^{2}+L+\mu)unL(1+H_{u-1})\bigg{]}
+1Kαt[12(Lm2+L+μ)uβB(1+Hu1)4μ2a(1+Hu)4n2L2b(1+Hu)]\displaystyle+\frac{1}{K}\alpha^{t}\left[-\frac{1}{2}(Lm^{2}+L+\mu)u\beta B(1+H_{u-1})-4\mu^{2}a(1+H_{u})-4n^{2}L^{2}b(1+H_{u})\right]
=\displaystyle= αtφa,b,u,t+1Kαtnm2u(1+Hu1)(μ2+L2)\displaystyle\alpha^{t}\varphi_{a,b,u,t}+\frac{1}{K}\alpha^{t}\sqrt{n}m^{2}u(1+H_{u-1})(\mu^{2}+L^{2})
+1Kαt[n(μ+L)u(LL2μ+1n(μ+L2μ))(1+Hu1)\displaystyle+\frac{1}{K}\alpha^{t}\bigg{[}n(\mu+L)u\left(L-\frac{L^{2}}{\mu}+\frac{1}{\sqrt{n}}\left(\mu+\frac{L^{2}}{\mu}\right)\right)(1+H_{u-1})
12(Lm2+L+μ)uβB(1+Hu1)4μ2a(1+Hu)4n2L2b(1+Hu)]\displaystyle\quad\quad\quad\quad-\frac{1}{2}(Lm^{2}+L+\mu)u\beta B(1+H_{u-1})-4\mu^{2}a(1+H_{u})-4n^{2}L^{2}b(1+H_{u})\bigg{]}
(a)\displaystyle\stackrel{{\scriptstyle(a)}}{{\geq}} αtφa,b,u,t\displaystyle\alpha^{t}\varphi_{a,b,u,t}
\displaystyle\geq αt+1φa,b,u,t.\displaystyle\alpha^{t+1}\varphi_{a,b,u,t}.

For (a), we have

nm2u(1+Hu1)(μ2+L2)n(μ+L)u(LL2μ+1n(μ+L2μ))(1+Hu1)12(Lm2+L+μ)uβB(1+Hu1)4μ2a(1+Hu)4n2L2b(1+Hu).\sqrt{n}m^{2}u(1+H_{u-1})(\mu^{2}+L^{2})\geq n(\mu+L)u\left(L-\frac{L^{2}}{\mu}+\frac{1}{\sqrt{n}}\left(\mu+\frac{L^{2}}{\mu}\right)\right)(1+H_{u-1})\\ -\frac{1}{2}(Lm^{2}+L+\mu)u\beta B(1+H_{u-1})-4\mu^{2}a(1+H_{u})-4n^{2}L^{2}b(1+H_{u}).

when m=exp(n)m=\exp(n) and n1n\gg 1.

Thus the inequality (C.7) holds. Let u=t=k1u=t=k-1. We have

ϕa,b,k1,k1αk[(2nLb+2μa)(1+Hk1)+(2L2μ+2μ)Gk1].\phi_{a,b,k-1,k-1}\geq\alpha^{k}\left[(2nLb+2\mu a)(1+H_{k-1})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{k-1}\right].

Let n100k2n\geq 100k^{2}. Since m=exp(n)100k2m=\exp(n)\geq 100k^{2}, then

αk34,β=1110k910.\alpha^{k}\geq\frac{3}{4},\quad\beta=1-\frac{1}{10k}\geq\frac{9}{10}.
ϕa,b,t1,t134(2L2μ+2μ)Gk1.\phi_{a,b,t-1,t-1}\geq\frac{3}{4}\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{k-1}.

We have the following inequalities:

Hk1\displaystyle H_{k-1} =1+12++1k11k1t𝑑t=lnk,k1,\displaystyle=1+\frac{1}{2}+\cdots+\frac{1}{k-1}\geq\int_{1}^{k}\frac{1}{t}\,dt=\ln k,\quad k\geq 1,
Gk\displaystyle G_{k} =βj=0k1(1+Hj)β(k+j=1klnj)β(k+t=1klntdt)=β(klnk+1).\displaystyle=\beta\sum_{j=0}^{k-1}(1+H_{j})\geq\beta\left(k+\sum_{j=1}^{k}\ln j\right)\geq\beta\left(k+\int_{t=1}^{k}\ln t\,dt\right)=\beta(k\ln k+1).

Therefore, we have

𝔼𝒜([N],init)12klnk(2L2μ+2μ)=klnk(L2μ+μ).\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\geq\frac{1}{2}k\ln k\left(\frac{2L^{2}}{\mu}+2\mu\right)=k\ln k\left(\frac{L^{2}}{\mu}+\mu\right).

In the meanwhile, we have an upper bound estimate for 𝒜([N],OPT)\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}). We pick ξ={[ξ(l);𝐞l]}l=1k\mathcal{M}_{\xi}=\{[\xi^{(l)};-\mathbf{e}_{l}]\}_{l=1}^{k} as the centers. We have

𝒜([N],OPT)𝒜([N],ξ)=2kμ.\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})\leq\mathcal{A}([N],\mathcal{M}_{\xi})=2k\mu.

Thus,

𝔼𝒜([N],init)12lnkL2μ2𝒜([N],OPT).\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\geq\frac{1}{2}\ln k\frac{L^{2}}{\mu^{2}}\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).

We prove two different error bounds when the estimate of fi(𝐳)fi(𝐱i)f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*}) is not accurate. We consider the additive and multiplicative errors on the oracle fi(𝐳)fi(𝐱i)f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*}).

In Algorithm 1, when computing the score vi(j)v_{i}^{(j)}, we suppose we do not have the exact fif_{i}^{*}, instead, we have an estimate f~i\tilde{f}_{i}^{*}, such that

|f~ifi|ϵ|\tilde{f}_{i}^{*}-f_{i}^{*}|\leq\epsilon

for a certain error factor ϵ>0\epsilon>0. We define

𝒜~(,)=imax(min𝐳(fi(𝐳)f~i),0)=imin𝐳(max(fi(𝐳)f~i,0)).\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})=\sum_{i\in\mathcal{I}}\max\left(\min_{\mathbf{z}\in\mathcal{M}}\left(f_{i}(\mathbf{z})-\tilde{f}_{i}^{*}\right),0\right)=\sum_{i\in\mathcal{I}}\min_{\mathbf{z}\in\mathcal{M}}\left(\max\left(f_{i}(\mathbf{z})-\tilde{f}_{i}^{*},0\right)\right).
Lemma C.12.

Let \mathcal{I} be an index set, and \mathcal{M} be a finite point set. Suppose that 𝒜~(,)>0\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})>0. We sample an index ii\in\mathcal{I} with probability 𝒜~({i},)𝒜~(,)\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M})}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}, then we have the following inequality:

(C.8) 𝔼𝒜~(,{𝐱i})||(1+4Lμ)ϵ+4(L2μ2+Lμ)min𝐳i(fi(𝐳)fi(𝐱i)).\mathbb{E}\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})\leq|\mathcal{I}|\left(1+\frac{4L}{\mu}\right)\epsilon+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\min_{\mathbf{z}}\sum_{i\in\mathcal{I}}(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})).
Proof.

We have

𝒜~({i},)\displaystyle\tilde{\mathcal{A}}(\{i\},\mathcal{M}) =max(min𝐳(fi(𝐳)f~i),0)\displaystyle=\max\left(\min_{\mathbf{z}\in\mathcal{M}}(f_{i}(\mathbf{z})-\tilde{f}_{i}^{*}),0\right)
ϵ+min𝐳(fi(𝐳)fi)\displaystyle\leq\epsilon+\min_{\mathbf{z}\in\mathcal{M}}(f_{i}(\mathbf{z})-f_{i}^{*})
ϵ+L2min𝐳𝐳𝐱i2\displaystyle\leq\epsilon+\frac{L}{2}\min_{\mathbf{z}\in\mathcal{M}}\|\mathbf{z}-\mathbf{x}_{i}^{*}\|^{2}
ϵ+L𝐱i𝐱i2+Lmin𝐳𝐳𝐱i2\displaystyle\leq\epsilon+L\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}+L\min_{\mathbf{z}\in\mathcal{M}}\|\mathbf{z}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}
ϵ+L𝐱i𝐱i2+2Lμmin𝐳(fi(𝐳)fi(𝐱i))\displaystyle\leq\epsilon+L\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}+\frac{2L}{\mu}\min_{\mathbf{z}\in\mathcal{M}}(f_{i^{\prime}}(\mathbf{z})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))
(1+2Lμ)ϵ+L𝐱i𝐱i2+2Lμmin𝐳(fi(𝐳)f~i)\displaystyle\leq\left(1+\frac{2L}{\mu}\right)\epsilon+L\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}+\frac{2L}{\mu}\min_{\mathbf{z}\in\mathcal{M}}(f_{i^{\prime}}(\mathbf{z})-\tilde{f}_{i^{\prime}}^{*})
(1+2Lμ)ϵ+L𝐱i𝐱i2+2Lμmax(min𝐳(fi(𝐳)f~i),0)\displaystyle\leq\left(1+\frac{2L}{\mu}\right)\epsilon+L\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}+\frac{2L}{\mu}\max\left(\min_{\mathbf{z}\in\mathcal{M}}(f_{i^{\prime}}(\mathbf{z})-\tilde{f}_{i^{\prime}}^{*}),0\right)
=(1+2Lμ)ϵ+L𝐱i𝐱i2+2Lμ𝒜~({i},).\displaystyle=\left(1+\frac{2L}{\mu}\right)\epsilon+L\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}+\frac{2L}{\mu}\tilde{\mathcal{A}}(\{i^{\prime}\},\mathcal{M}).

We have

𝔼𝒜~(,{𝐱i})\displaystyle\mathbb{E}\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})
=\displaystyle= i𝒜~({i},)𝒜~(,)𝒜~(,{𝐱i})\displaystyle\sum_{i\in\mathcal{I}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M})}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})
=\displaystyle= i𝒜~({i},)𝒜~(,)i′′𝒜~({i′′},{𝐱i})\displaystyle\sum_{i\in\mathcal{I}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M})}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime\prime}\in\mathcal{I}}\tilde{\mathcal{A}}(\{i^{\prime\prime}\},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})
=\displaystyle= i𝒜~({i},)𝒜~(,)i′′min(𝒜~({i′′},),max(fi′′(𝐱i)f~i′′,0))\displaystyle\sum_{i\in\mathcal{I}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M})}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime\prime}\in\mathcal{I}}\min(\tilde{\mathcal{A}}(\{i^{\prime\prime}\},\mathcal{M}),\max(f_{i^{\prime\prime}}(\mathbf{x}_{i}^{*})-\tilde{f}_{i^{\prime\prime}}^{*},0))
\displaystyle\leq i1||i{(1+2Lμ)ϵ+L𝐱i𝐱i2+2Lμ𝒜~({i},)}𝒜~(,)\displaystyle\sum_{i\in\mathcal{I}}\frac{\frac{1}{|\mathcal{I}|}\sum_{i^{\prime}\in\mathcal{I}}\left\{\left(1+\frac{2L}{\mu}\right)\epsilon+L\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}+\frac{2L}{\mu}\tilde{\mathcal{A}}(\{i^{\prime}\},\mathcal{M})\right\}}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}
i′′min(𝒜~({i′′},),max(fi′′(𝐱i)f~i′′,0))\displaystyle\qquad\cdot\sum_{i^{\prime\prime}\in\mathcal{I}}\min(\tilde{\mathcal{A}}(\{i^{\prime\prime}\},\mathcal{M}),\max(f_{i^{\prime\prime}}(\mathbf{x}_{i}^{*})-\tilde{f}_{i^{\prime\prime}}^{*},0))
\displaystyle\leq ||(1+2Lμ)ϵ+L1||ii𝐱i𝐱i2+2Lμ1||ii′′max(fi′′(𝐱i)f~i′′,0)\displaystyle|\mathcal{I}|\left(1+\frac{2L}{\mu}\right)\epsilon+L\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}+\frac{2L}{\mu}\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime\prime}\in\mathcal{I}}\max(f_{i^{\prime\prime}}(\mathbf{x}_{i}^{*})-\tilde{f}_{i^{\prime\prime}}^{*},0)
\displaystyle\leq ||(1+4Lμ)ϵ+(L+L2μ)1||ii𝐱i𝐱i2\displaystyle|\mathcal{I}|\left(1+\frac{4L}{\mu}\right)\epsilon+\left(L+\frac{L^{2}}{\mu}\right)\frac{1}{|\mathcal{I}|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}
=\displaystyle= ||(1+4Lμ)ϵ+2(L+L2μ)min𝐳i𝐱i𝐳2\displaystyle|\mathcal{I}|\left(1+\frac{4L}{\mu}\right)\epsilon+2\left(L+\frac{L^{2}}{\mu}\right)\min_{\mathbf{z}}\sum_{i\in\mathcal{I}}\|\mathbf{x}_{i}^{*}-\mathbf{z}\|^{2}
\displaystyle\leq ||(1+4Lμ)ϵ+4(L2μ2+Lμ)min𝐳i(fi(𝐳)fi(𝐱i)).\displaystyle|\mathcal{I}|\left(1+\frac{4L}{\mu}\right)\epsilon+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\min_{\mathbf{z}}\sum_{i\in\mathcal{I}}(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})).

Lemma C.13.

Suppose that we have fixed indices i1,i2,,iji_{1},i_{2},\ldots,i_{j}. We define the finite set j={xi1,xi2,,xij}\mathcal{M}_{j}=\{x_{i_{1}}^{*},x_{i_{2}}^{*},\ldots,x_{i_{j}}^{*}\}. We define the index sets Lj={l:Al{i1,i2,,ij}},Ljc={l:Al{i1,i2,,ij}=},j=lLjAl,jc=lLjcAlL_{j}=\{l:A_{l}\cap\{i_{1},i_{2},\ldots,i_{j}\}\not=\emptyset\},L_{j}^{c}=\{l:A_{l}\cap\{i_{1},i_{2},\ldots,i_{j}\}=\emptyset\},\mathcal{I}_{j}=\cup_{l\in L_{j}}A_{l},\mathcal{I}_{j}^{c}=\cup_{l\in L_{j}^{c}}A_{l}. Let u=|Ljc|u=|L_{j}^{c}|. Suppose that u>0u>0. We sample tut\leq u new indices. We let j,s+={xi1,xi2,,xij,xij+1,,xij+s}\mathcal{M}_{j,s}^{+}=\{x_{i_{1}}^{*},x_{i_{2}}^{*},\ldots,x_{i_{j}}^{*},x_{i_{j+1}}^{*},\ldots,x_{i_{j+s}}^{*}\} for 0st0\leq s\leq t. In each round of sampling, the probability of ij+s,s>0i_{j+s},s>0, being sampled as ii is 𝒜~({i},j,s1+)𝒜~([N],j,s1+)\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M}_{j,s-1}^{+})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j,s-1}^{+})}. Then we have the following bound:

(C.9) 𝔼𝒜~([N],j,t+)(1+Ht)[𝒜~(j,j)+|jc|(1+4Lμ)ϵ+4(L2μ2+Lμ)𝒜(jc,OPT)]+utu𝒜~(jc,j).\begin{split}\mathbb{E}\tilde{\mathcal{A}}([N],\mathcal{M}_{j,t}^{+})&\leq(1+H_{t})\left[\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\left(1+\frac{4L}{\mu}\right)\epsilon+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right]\\ &+\frac{u-t}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).\end{split}
Proof.

The key idea of the proof is similar to Lemma C.9. We let

α=1+4Lμ,β=4(L2μ2+Lμ).\alpha=1+\frac{4L}{\mu},\quad\beta=4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right).

We prove by induction. When t=0t=0, the inequality obviously holds. When t>0,u=1t>0,u=1, we have the inequality:

𝔼𝒜~([N],j,t+)\displaystyle\mathbb{E}\tilde{\mathcal{A}}([N],\mathcal{M}_{j,t}^{+})
\displaystyle\leq 𝒜~(j,j)𝒜~([N],j)𝒜~([N],j)+𝒜~(jc,j)𝒜~([N],j)(𝒜~(j,j)+|jc|αϵ+β𝒜(jc,OPT))\displaystyle\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}\tilde{\mathcal{A}}([N],\mathcal{M}_{j})+\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}\left(\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)
\displaystyle\leq 𝒜~(j,j)+|jc|αϵ+β𝒜(jc,OPT)+𝒜~(jc,j).\displaystyle\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})+\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

For the general (t,u)(t,u) case, 𝔼𝒜~([N],j,t+)\mathbb{E}\tilde{\mathcal{A}}([N],\mathcal{M}_{j,t}^{+}) can be bounded by two parts. With probability 𝒜~(j,j)𝒜~([N],j)\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}, the first sampled index lies in j\mathcal{I}_{j}, and the conditional expectation is bounded by:

(1+Ht1)[𝒜~(j,j)+|jc|αϵ+β𝒜(jc,OPT)]+ut+1u𝒜~(jc,j).(1+H_{t-1})\left[\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right]+\frac{u-t+1}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

With probability 𝒜~(jc,j)𝒜~([N],j)\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}, the first sampled index lies in jc\mathcal{I}_{j}^{c}. The conditional expectation is bounded by:

lLc𝒜~(Al,j)𝒜~(jc,j)iAl𝒜~({i},j)𝒜~(Al,j){(1+Ht1)(𝒜~(jAl,j{𝐱i})+|jc\Al|αϵ\displaystyle\sum_{l\in L^{c}}\frac{\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j})}{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}\sum_{i\in A_{l}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M}_{j})}{\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j})}\bigg{\{}(1+H_{t-1})\Big{(}\tilde{\mathcal{A}}(\mathcal{I}_{j}\cup A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})+|\mathcal{I}_{j}^{c}\backslash A_{l}|\alpha\epsilon
+β𝒜(jc\Al,OPT))+utu1𝒜~(jc\Al,j{𝐱i})}\displaystyle\quad\quad+\beta\mathcal{A}(\mathcal{I}_{j}^{c}\backslash A_{l},\mathcal{M}_{\textup{OPT}})\Big{)}+\frac{u-t}{u-1}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c}\backslash A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})\bigg{\}}
lLc𝒜~(Al,j)𝒜~(jc,j)iAl𝒜~({i},j)𝒜~(Al,j){(1+Ht1)(𝒜~(j,j)+𝒜~(Al,j{𝐱i})+|jc\Al|αϵ\displaystyle\leq\sum_{l\in L^{c}}\frac{\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j})}{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}\sum_{i\in A_{l}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M}_{j})}{\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j})}\bigg{\{}(1+H_{t-1})\Big{(}\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})+|\mathcal{I}_{j}^{c}\backslash A_{l}|\alpha\epsilon
+β𝒜(jc,OPT)β𝒜(Al,OPT))+utu1(𝒜~(jc,j)𝒜~(Al,j))}\displaystyle\quad\quad+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})-\beta\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\Big{)}+\frac{u-t}{u-1}(\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})-\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j}))\bigg{\}}
(1+Ht1)(𝒜~(j,j)+|jc|αϵ+β𝒜(jc,OPT))+utu𝒜~(jc,j).\displaystyle\leq(1+H_{t-1})\left(\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)+\frac{u-t}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

Overall, we have the following inequality:

𝔼𝒜~([N],j,t+)\displaystyle\mathbb{E}\tilde{\mathcal{A}}([N],\mathcal{M}_{j,t}^{+})
\displaystyle\leq 𝒜~(j,j)𝒜~([N],j){(1+Ht1)[𝒜~(j,j)+|jc|αϵ+β𝒜(jc,OPT)]+ut+1u𝒜~(jc,j)}\displaystyle\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}\left\{(1+H_{t-1})\left[\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right]+\frac{u-t+1}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\right\}
+𝒜~(jc,j)𝒜~([N],j){(1+Ht1)(𝒜~(j,j)+|jc|αϵ+β𝒜(jc,OPT))+utu𝒜~(jc,j)}\displaystyle\quad+\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}\left\{(1+H_{t-1})\left(\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)+\frac{u-t}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\right\}
\displaystyle\leq (1+Ht)(𝒜~(j,j)+|jc|αϵ+β𝒜(jc,OPT))+utu𝒜~(jc,j).\displaystyle(1+H_{t})\left(\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)+\frac{u-t}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

Theorem C.14 (Restatement of Theorem 4.3).

Suppose that the solution set SS^{*} is (k,2ϵμ)(k,\sqrt{\frac{2\epsilon}{\mu}})-separate. Let

init={𝐱i1,𝐱i2,,𝐱ik}\mathcal{M}_{\textup{init}}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{k}}^{*}\}

be the initial points sampled by the random initialization Algorithm 1 with noisy oracles f~i\tilde{f}_{i}^{*}. We have the following bound:

1N𝔼𝒜([N],init)ϵ+(2+lnk)(1+4Lμ)ϵ+4(2+lnk)(L2μ2+Lμ)1N𝒜([N],OPT).\frac{1}{N}\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\leq\epsilon+(2+\ln k)\left(1+\frac{4L}{\mu}\right)\epsilon+4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\frac{1}{N}\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).
Proof.

The proof is similar to that of Theorem C.10. We let

α=1+4Lμ,β=4(L2μ2+Lμ).\alpha=1+\frac{4L}{\mu},\quad\beta=4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right).

We fix the first index i1i_{1}. Suppose that i1i_{1} lies in AlA_{l}, we have

𝔼𝒜~([N],1,k1+)(𝒜~(Al,{𝐱i1})+|[N]\Al|αϵ+β𝒜([N]\Al,OPT))(1+Hk1).\displaystyle\mathbb{E}~{}\tilde{\mathcal{A}}([N],\mathcal{M}_{1,k-1}^{+})\leq\left(\tilde{\mathcal{A}}(A_{l},\{\mathbf{x}_{i_{1}}^{*}\})+|[N]\backslash A_{l}|\alpha\epsilon+\beta\mathcal{A}([N]\backslash A_{l},\mathcal{M}_{\textup{OPT}})\right)(1+H_{k-1}).

We have

𝔼𝒜~([N],init)\displaystyle\mathbb{E}~{}\tilde{\mathcal{A}}([N],\mathcal{M}_{\textup{init}}) (1Nl[k]iAl𝒜~(Al,{𝐱i})+Nαϵ1Nl[k]|Al|2αϵ\displaystyle\leq\left(\frac{1}{N}\sum_{l\in[k]}\sum_{i\in A_{l}}\tilde{\mathcal{A}}(A_{l},\{\mathbf{x}_{i}^{*}\})+N\alpha\epsilon-\frac{1}{N}\sum_{l\in[k]}|A_{l}|^{2}\alpha\epsilon\right.
+β(𝒜([N],OPT)1Nl[k]|Al|𝒜(Al,OPT)))(1+Hk1)\displaystyle\quad\quad\left.+\beta\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}|A_{l}|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})
(a)(1Nl[k](|Al|2ϵ+L2|Al|ΔAl)+Nαϵ1Nl[k]|Al|2αϵ\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\left(\frac{1}{N}\sum_{l\in[k]}\left(|A_{l}|^{2}\epsilon+\frac{L}{2}|A_{l}|\Delta_{A_{l}}\right)+N\alpha\epsilon-\frac{1}{N}\sum_{l\in[k]}|A_{l}|^{2}\alpha\epsilon\right.
+β(𝒜([N],OPT)1Nl[k]|Al|𝒜(Al,OPT)))(1+Hk1)\displaystyle\quad\quad\left.+\beta\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}|A_{l}|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})
(b)(1Nl[k](|Al|2ϵ+2Lμ|Al|𝒜(Al,OPT))+Nαϵ1Nl[k]|Al|2αϵ\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\left(\frac{1}{N}\sum_{l\in[k]}\left(|A_{l}|^{2}\epsilon+\frac{2L}{\mu}|A_{l}|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)+N\alpha\epsilon-\frac{1}{N}\sum_{l\in[k]}|A_{l}|^{2}\alpha\epsilon\right.
+β(𝒜([N],OPT)1Nl[k]|Al|𝒜(Al,OPT)))(1+Hk1)\displaystyle\quad\quad\left.+\beta\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}|A_{l}|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})
(Nαϵ+β𝒜([N],OPT))(1+Hk1)\displaystyle\leq\left(N\alpha\epsilon+\beta\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})\right)(1+H_{k-1})
(2+lnk)(1+4Lμ)Nϵ+4(2+lnk)(L2μ2+Lμ)𝒜([N],OPT).\displaystyle\leq(2+\ln k)\left(1+\frac{4L}{\mu}\right)N\epsilon+4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).

Here, (a) holds when applying Lemma C.3. (b) holds when applying

L2ΔAl=Lmin𝐳iAl𝐱i𝐳22Lμmin𝐳iAl(fi(𝐳)fi)=2Lμ𝒜(Al,OPT(D)).\frac{L}{2}\Delta_{A_{l}}=L\min_{\mathbf{z}}\sum_{i\in A_{l}}\|\mathbf{x}_{i}^{*}-\mathbf{z}\|^{2}\leq\frac{2L}{\mu}\min_{\mathbf{z}}\sum_{i\in A_{l}}(f_{i}(\mathbf{z})-f_{i}^{*})=\frac{2L}{\mu}\mathcal{A}(A_{l},\mathcal{M}^{(D)}_{\textup{OPT}}).

Therefore, we have

𝔼𝒜([N],init)Nϵ+(2+lnk)(1+4Lμ)Nϵ+4(2+lnk)(L2μ2+Lμ)𝒜([N],OPT),\displaystyle\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\leq N\epsilon+(2+\ln k)\left(1+\frac{4L}{\mu}\right)N\epsilon+4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}),
1N𝔼𝒜([N],init)ϵ+(2+lnk)(1+4Lμ)ϵ+4(2+lnk)(L2μ2+Lμ)1N𝒜([N],OPT).\displaystyle\frac{1}{N}\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\leq\epsilon+(2+\ln k)\left(1+\frac{4L}{\mu}\right)\epsilon+4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\frac{1}{N}\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).

The proof of Theorem 4.4 is similar to the proof of Theorem 4.3, we skip the details here.

Appendix D Convergence of Lloyd’s algorithm

In this section, we provide a convergence analysis for Algorithms 2 and 3.

Theorem D.1 (Restatement of Theorem 4.6).

In Algorithm 2, we take the step size γ=1L\gamma=\frac{1}{L}. If fif_{i} are LL-smooth, we have the following convergence result:

1T+1t=0Tj=1k|Cj(t)|NFj(t)(𝐱j(t))22LT+1(F(𝐱1(0),𝐱2(0),,𝐱k(0))F).\frac{1}{T+1}\sum_{t=0}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2L}{T+1}\left(F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{\star}\right).

Here, FF^{\star} is the minimum of FF.

Proof.

According to the LL-smoothness assumption on fif_{i}, Fj(t)F_{j}^{(t)} is also LL-smooth, which implies that

Fj(t)(𝐱j(t+1))\displaystyle F_{j}^{(t)}(\mathbf{x}_{j}^{(t+1)}) Fj(t)(𝐱j(t))+Fj(t)(𝐱j(t)),𝐱j(t+1)𝐱j(t)+L2𝐱j(t+1)𝐱j(t)2\displaystyle\leq F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})+\langle\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)}),\mathbf{x}_{j}^{(t+1)}-\mathbf{x}_{j}^{(t)}\rangle+\frac{L}{2}\|\mathbf{x}_{j}^{(t+1)}-\mathbf{x}_{j}^{(t)}\|^{2}
=Fj(t)(𝐱j(t))12LFj(t)(𝐱j(t))2,\displaystyle=F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})-\frac{1}{2L}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2},
12LFj(t)(𝐱j(t))2\displaystyle\frac{1}{2L}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2} Fj(t)(𝐱j(t))Fj(t)(𝐱j(t+1)),\displaystyle\leq F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})-F_{j}^{(t)}(\mathbf{x}_{j}^{(t+1)}),
Fj(t)(𝐱j(t))2\displaystyle\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2} 2L(Fj(t)(𝐱j(t))Fj(t)(𝐱j(t+1))).\displaystyle\leq 2L\left(F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})-F_{j}^{(t)}(\mathbf{x}_{j}^{(t+1)})\right).

Averaging over Fj(t)(𝐱j(t))2\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2} with weights |Cj(t)|/N|C_{j}^{(t)}|/N, we have

j=1k|Cj(t)|NFj(t)(𝐱j(t))2\displaystyle\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2} 2L(F(𝐱1(t),𝐱2(t),,𝐱k(t))F(𝐱1(t+1),𝐱2(t+1),,𝐱k(t+1))).\displaystyle\leq 2L\left(F(\mathbf{x}_{1}^{(t)},\mathbf{x}_{2}^{(t)},\ldots,\mathbf{x}_{k}^{(t)})-F(\mathbf{x}_{1}^{(t+1)},\mathbf{x}_{2}^{(t+1)},\ldots,\mathbf{x}_{k}^{(t+1)})\right).

Averaging over tt from 0 to TT, we have

1T+1t=0Tj=1k|Cj(t)|NFj(t)(𝐱j(t))22LT+1(F(𝐱1(0),𝐱2(0),,𝐱k(0))F).\frac{1}{T+1}\sum_{t=0}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2L}{T+1}\left(F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{\star}\right).

Next, we present a convergence theorem for the momentum algorithm. For simplification, we use the notation

𝐔(t)=(𝐮1(t),𝐮2(t),,𝐮k(t)).\mathbf{U}^{(t)}=(\mathbf{u}_{1}^{(t)},\mathbf{u}_{2}^{(t)},\ldots,\mathbf{u}_{k}^{(t)}).

We have the following convergence theorem:

Theorem D.2 (Restatement of Theorem 4.7).

Consider Algorithm 3. Suppose that Assumption 2.1 holds, α>1\alpha>1, and

γmin(1β2L,(1β)32(1αβ)122α12Lβ).\gamma\leq\min\left(\frac{1-\beta}{2L},\frac{(1-\beta)^{\frac{3}{2}}(1-\alpha\beta)^{\frac{1}{2}}}{2\alpha^{\frac{1}{2}}L\beta}\right).

Then it holds that

1Tt=1Tj=1k|Cj(t)|NFj(t)(𝐱j(t))22(1β)γF(𝐱1(0),𝐱2(0),,𝐱k(0))FT.\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2(1-\beta)}{\gamma}\cdot\frac{F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{*}}{T}.
Proof.

The variable 𝐮j(t)\mathbf{u}_{j}^{(t)} satisfies the following property,

𝐮j(t+1)𝐮j(t)\displaystyle\mathbf{u}_{j}^{(t+1)}-\mathbf{u}_{j}^{(t)} =11β((𝐱j(t+1)𝐱j(t))β(𝐱j(t)𝐱j(t1)))\displaystyle=\frac{1}{1-\beta}\left((\mathbf{x}_{j}^{(t+1)}-\mathbf{x}_{j}^{(t)})-\beta(\mathbf{x}_{j}^{(t)}-\mathbf{x}_{j}^{(t-1)})\right)
=γ1β(𝐦j(t)β𝐦j(t1))\displaystyle=\frac{-\gamma}{1-\beta}\left(\mathbf{m}_{j}^{(t)}-\beta\mathbf{m}_{j}^{(t-1)}\right)
=γ1βFj(t)(𝐱j(t)).\displaystyle=\frac{-\gamma}{1-\beta}\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)}).

We have the following inequality:

Fj(t)(𝐮j(t+1))\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t+1)})
\displaystyle\leq Fj(t)(𝐮j(t))+Fj(t)(𝐮j(t)),𝐮j(t+1)𝐮j(t)+L2𝐮j(t+1)𝐮j(t)2\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})+\langle\nabla F_{j}^{(t)}(\mathbf{u}_{j}^{(t)}),\mathbf{u}_{j}^{(t+1)}-\mathbf{u}_{j}^{(t)}\rangle+\frac{L}{2}\|\mathbf{u}_{j}^{(t+1)}-\mathbf{u}_{j}^{(t)}\|^{2}
=\displaystyle= Fj(t)(𝐮j(t))γ1βFj(t)(𝐮j(t)),Fj(t)(𝐱j(t))+L2γ2(1β)2Fj(t)(𝐱j(t))2\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-\frac{\gamma}{1-\beta}\langle\nabla F_{j}^{(t)}(\mathbf{u}_{j}^{(t)}),\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\rangle+\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}
=\displaystyle= Fj(t)(𝐮j(t))γ1βFj(t)(𝐮j(t))Fj(t)(𝐱j(t)),Fj(t)(𝐱j(t))\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-\frac{\gamma}{1-\beta}\langle\nabla F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)}),\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\rangle
+(L2γ2(1β)2γ1β)Fj(t)(𝐱j(t))2\displaystyle\quad\quad+\left(\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\gamma}{1-\beta}\right)\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}
\displaystyle\leq Fj(t)(𝐮j(t))+(L2γ2(1β)2γ1β)Fj(t)(𝐱j(t))2\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})+\left(\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\gamma}{1-\beta}\right)\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}
+γ1βϵ2Fj(t)(𝐮j(t))Fj(t)(𝐱j(t))2+γ1β12ϵFj(t)(𝐱j(t))2\displaystyle\quad\quad+\frac{\gamma}{1-\beta}\frac{\epsilon}{2}\|\nabla F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}+\frac{\gamma}{1-\beta}\frac{1}{2\epsilon}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}
\displaystyle\leq Fj(t)(𝐮j(t))+(L2γ2(1β)2γ1β+12ϵγ1β)Fj(t)(𝐱j(t))2+ϵ2L2β2γ3(1β)3𝐦j(t1)2.\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})+\left(\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\gamma}{1-\beta}+\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}+\frac{\epsilon}{2}\frac{L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\|\mathbf{m}_{j}^{(t-1)}\|^{2}.

Rearranging the inequality, we have

(γ1βL2γ2(1β)212ϵγ1β)Fj(t)(𝐱j(t))2Fj(t)(𝐮j(t))Fj(t)(𝐮j(t+1))+ϵ2L2β2γ3(1β)3𝐦j(t1)2.\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-F_{j}^{(t)}(\mathbf{u}_{j}^{(t+1)})+\frac{\epsilon}{2}\frac{L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\|\mathbf{m}_{j}^{(t-1)}\|^{2}.

We sum over j=1,2,,kj=1,2,\ldots,k with weights |Cj|N\frac{|C_{j}|}{N} and get

(γ1βL2γ2(1β)212ϵγ1β)j=1k|Cj(t)|NFj(t)(𝐱j(t))2F(𝐔(t))j=1k|Cj(t)|NFj(t)(𝐮j(t+1))+ϵ2L2β2γ3(1β)3j=1k|Cj(t)|N𝐦j(t1)2.\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq F(\mathbf{U}^{(t)})-\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}F_{j}^{(t)}(\mathbf{u}_{j}^{(t+1)})+\frac{\epsilon}{2}\frac{L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\mathbf{m}_{j}^{(t-1)}\|^{2}.

Since

j=1k|Cj(t)|NFj(t)(𝐮j(t+1))=1Nj=1kiCj(t)fi(𝐮j(t+1))F(𝐔(t+1)),\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}F_{j}^{(t)}(\mathbf{u}_{j}^{(t+1)})=\frac{1}{N}\sum_{j=1}^{k}\sum_{i\in C_{j}^{(t)}}f_{i}(\mathbf{u}_{j}^{(t+1)})\geq F(\mathbf{U}^{(t+1)}),

we have

(γ1βL2γ2(1β)212ϵγ1β)j=1k|Cj(t)|NFj(t)(𝐱j(t))2F(𝐔(t))F(𝐔(t+1))+ϵ2αL2β2γ3(1β)3j=1k|Cj(t1)|N𝐦j(t1)2.\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq F(\mathbf{U}^{(t)})-F(\mathbf{U}^{(t+1)})+\frac{\epsilon}{2}\frac{\alpha L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\sum_{j=1}^{k}\frac{|C_{j}^{(t-1)}|}{N}\|\mathbf{m}_{j}^{(t-1)}\|^{2}.

Summing both sides from t=1t=1 to TT, then dividing both sides by TT, we have

(D.1) (γ1βL2γ2(1β)212ϵγ1β)1Tt=1Tj=1k|Cj(t)|NFj(t)(𝐱j(t))2F(𝐔(1))F(𝐔(T+1))T+ϵ2αL2β2γ3(1β)31Tt=1Tj=1k|Cj(t1)|N𝐦j(t1)2.\begin{split}\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq\frac{F(\mathbf{U}^{(1)})-F(\mathbf{U}^{(T+1)})}{T}+\frac{\epsilon}{2}\frac{\alpha L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t-1)}|}{N}\|\mathbf{m}_{j}^{(t-1)}\|^{2}.\end{split}

Now, we consider the average term 1Tt=1T|Cj(t)|N𝐦j(t)2\frac{1}{T}\sum_{t=1}^{T}\frac{|C_{j}^{(t)}|}{N}\|\mathbf{m}_{j}^{(t)}\|^{2}. For 𝐦j(t)\mathbf{m}_{j}^{(t)}, we have

𝐦j(t)\displaystyle\mathbf{m}_{j}^{(t)} =β𝐦j(t1)+Fj(t)(𝐱j(t))\displaystyle=\beta\mathbf{m}_{j}^{(t-1)}+\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})
=βt𝐦j(0)+l=0t1βlFj(tl)(𝐱j(tl))\displaystyle=\beta^{t}\mathbf{m}_{j}^{(0)}+\sum_{l=0}^{t-1}\beta^{l}\nabla F_{j}^{(t-l)}(\mathbf{x}_{j}^{(t-l)})
=l=1tβtlFj(l)(𝐱j(l)).\displaystyle=\sum_{l=1}^{t}\beta^{t-l}\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)}).

We have the following bound on the squared norm of 𝐦j(t)\mathbf{m}_{j}^{(t)}:

𝐦j(t)2\displaystyle\|\mathbf{m}_{j}^{(t)}\|^{2} =l=1tβtlFj(l)(𝐱j(l))2\displaystyle=\left\|\sum_{l=1}^{t}\beta^{t-l}\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\|^{2}
=(s=1tβts)2l=1tβtls=1tβtsFj(l)(𝐱j(l))2\displaystyle=\left(\sum_{s=1}^{t}\beta^{t-s}\right)^{2}\left\|\sum_{l=1}^{t}\frac{\beta^{t-l}}{\sum_{s=1}^{t}\beta^{t-s}}\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\|^{2}
(a)(s=1tβts)2l=1tβtls=1tβtsFj(l)(𝐱j(l))2\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\left(\sum_{s=1}^{t}\beta^{t-s}\right)^{2}\sum_{l=1}^{t}\frac{\beta^{t-l}}{\sum_{s=1}^{t}\beta^{t-s}}\left\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\|^{2}
11βl=1tβtlFj(l)(𝐱j(l))2.\displaystyle\leq\frac{1}{1-\beta}\sum_{l=1}^{t}\beta^{t-l}\left\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\|^{2}.

Here, (a) applies as an instance of Jensen’s inequality. Averaging the above inequality over t=1,2,,Tt=1,2,\ldots,T, we obtain

1Tt=1T|Cj(t)|N𝐦j(t)2\displaystyle\frac{1}{T}\sum_{t=1}^{T}\frac{|C_{j}^{(t)}|}{N}\|\mathbf{m}_{j}^{(t)}\|^{2} 11β1N1Tt=1Tl=1t|Cj(t)|βtlFj(l)(𝐱j(l))2\displaystyle\leq\frac{1}{1-\beta}\frac{1}{N}\frac{1}{T}\sum_{t=1}^{T}\sum_{l=1}^{t}|C_{j}^{(t)}|\beta^{t-l}\left\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\|^{2}
1T1N11βl=1Tt=lT|Cj(t)|βtlFj(l)(𝐱j(l))2\displaystyle\leq\frac{1}{T}\frac{1}{N}\frac{1}{1-\beta}\sum_{l=1}^{T}\sum_{t=l}^{T}|C_{j}^{(t)}|\beta^{t-l}\left\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\|^{2}
1T1N11βl=1Tt=lT|Cj(l)|αtlβtlFj(l)(𝐱j(l))2\displaystyle\leq\frac{1}{T}\frac{1}{N}\frac{1}{1-\beta}\sum_{l=1}^{T}\sum_{t=l}^{T}|C_{j}^{(l)}|\alpha^{t-l}\beta^{t-l}\left\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\|^{2}
1T11β11αβl=1T|Cj(l)|NFj(l)(𝐱j(l))2\displaystyle\leq\frac{1}{T}\frac{1}{1-\beta}\frac{1}{1-\alpha\beta}\sum_{l=1}^{T}\frac{|C_{j}^{(l)}|}{N}\left\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\|^{2}

Substituting the above inequality back into (D.1), we obtain

(γ1βL2γ2(1β)212ϵγ1βϵ2αL2β2γ3(1β)4(1αβ))1Tt=1Tj=1k|Cj(t)|NFj(t)(𝐱j(t))2F(𝐔(1))F(𝐔(T+1))T.\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}-\frac{\epsilon}{2}\frac{\alpha L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{4}(1-\alpha\beta)}\right)\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq\frac{F(\mathbf{U}^{(1)})-F(\mathbf{U}^{(T+1)})}{T}.

We choose

ϵ=(1β)32(1αβ)12γβLα12\epsilon=\frac{(1-\beta)^{\frac{3}{2}}(1-\alpha\beta)^{\frac{1}{2}}}{\gamma\beta L\alpha^{\frac{1}{2}}}

and rearrange the above inequality. Thus, we have

(γ1βL2γ2(1β)2α12Lβγ22(1β)52(1αβ)12)1Tt=1Tj=1k|Cj(t)|NFj(t)(𝐱j(t))2F(𝐔(1))F(𝐔(T+1))T.\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\alpha^{\frac{1}{2}}L\beta\gamma^{2}}{2(1-\beta)^{\frac{5}{2}}(1-\alpha\beta)^{\frac{1}{2}}}\right)\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq\frac{F(\mathbf{U}^{(1)})-F(\mathbf{U}^{(T+1)})}{T}.

Since we initialize 𝐦j(0)=𝟎\mathbf{m}_{j}^{(0)}=\mathbf{0}, we have

𝐱j(1)\displaystyle\mathbf{x}_{j}^{(1)} =𝐱j(0)γ𝐦j(0)=𝐱j(0),\displaystyle=\mathbf{x}_{j}^{(0)}-\gamma\mathbf{m}_{j}^{(0)}=\mathbf{x}_{j}^{(0)},
𝐮j(1)\displaystyle\mathbf{u}_{j}^{(1)} =𝐱j(0).\displaystyle=\mathbf{x}_{j}^{(0)}.

Besides, since F(𝐔(T+1))FF(\mathbf{U}^{(T+1)})\geq F^{*}, we have

(γ1βL2γ2(1β)2α12Lβγ22(1β)52(1αβ)12)1Tt=1Tj=1k|Cj(t)|NFj(t)(𝐱j(t))2F(𝐱1(0),𝐱2(0),,𝐱k(0))FT.\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\alpha^{\frac{1}{2}}L\beta\gamma^{2}}{2(1-\beta)^{\frac{5}{2}}(1-\alpha\beta)^{\frac{1}{2}}}\right)\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq\frac{F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{*}}{T}.

When

γmin(1β2L,(1β)32(1αβ)122α12Lβ),\gamma\leq\min\left(\frac{1-\beta}{2L},\frac{(1-\beta)^{\frac{3}{2}}(1-\alpha\beta)^{\frac{1}{2}}}{2\alpha^{\frac{1}{2}}L\beta}\right),

we have

1Tt=1Tj=1k|Cj(t)|NFj(t)(𝐱j(t))22(1β)γF(𝐱1(0),𝐱2(0),,𝐱k(0))FT.\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2(1-\beta)}{\gamma}\cdot\frac{F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{*}}{T}.

Appendix E Supplementary experiment details

In this section, we provide details on the experiments described in Section 5.

E.1. Supplementary details for Section 5.1

We elaborate on the generation of the synthetic data for the GPCA experiment in Section 5.1.

  • First, we uniformly generate kk pairs of orthonormal vectors {ϵ1,j,ϵ2,j}\{\boldsymbol{\epsilon}_{1,j},\boldsymbol{\epsilon}_{2,j}\} for j=1,2,,kj=1,2,\ldots,k. Each pair is generated uniformly at random, with ϵ1,j\boldsymbol{\epsilon}_{1,j} and ϵ2,j\boldsymbol{\epsilon}_{2,j} forming the basis of the jj-th subspace.

  • For each data point i[N]i\in[N], we independently generate two Gaussian samples ξ1,i,ξ2,i\xi_{1,i},\xi_{2,i}. Next, we sample an index ji[k]j_{i}\in[k] uniformly at random. We then let 𝐱i=ξ1,iϵ1,ji+ξ2,iϵ2,ji\mathbf{x}_{i}=\xi_{1,i}\boldsymbol{\epsilon}_{1,j_{i}}+\xi_{2,i}\boldsymbol{\epsilon}_{2,j_{i}}.

We provide in Algorithm 5 a detailed pseudo-code of Lloyd’s algorithm for solving the GPCA problem in the sum-of-minimum formulation (1.5), which consists of two steps in each iteration, say updating the clusters via (3.6) and precisely compute the minimizer of each group objective function

min𝐀j𝐀j=Ir1|Cj(t)|iCj(t)𝐲i𝐀j2\displaystyle\min_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}\frac{1}{|C_{j}^{(t)}|}\sum_{i\in C_{j}^{(t)}}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2} =1|Cj(t)|iCj(t)tr(𝐀j𝒚i𝒚𝐀j)\displaystyle=\frac{1}{|C_{j}^{(t)}|}\sum_{i\in C_{j}^{(t)}}\textup{tr}\left(\mathbf{A}_{j}^{\top}\boldsymbol{y}_{i}\boldsymbol{y}^{\top}\mathbf{A}_{j}\right)
=tr(𝐀j(1|Cj(t)|iCj(t)𝒚i𝒚)𝐀j).\displaystyle=\textup{tr}\left(\mathbf{A}_{j}^{\top}\left(\frac{1}{|C_{j}^{(t)}|}\sum_{i\in C_{j}^{(t)}}\boldsymbol{y}_{i}\boldsymbol{y}^{\top}\right)\mathbf{A}_{j}\right).
Algorithm 5 Lloyd’s Algorithm for generalized principal component analysis
1:Initialize 𝐀1(0),𝐀2(0),,𝐀k(0)\mathbf{A}_{1}^{(0)},\mathbf{A}_{2}^{(0)},\dots,\mathbf{A}_{k}^{(0)}. Set F(1)=+F^{(-1)}=+\infty.
2:for t=0,1,2,,t=0,1,2,\ldots, max iterations do
3:     Compute F(t)=F(𝐀1(t),𝐀2(t),,𝐀k(t))F^{(t)}=F(\mathbf{A}_{1}^{(t)},\mathbf{A}_{2}^{(t)},\ldots,\mathbf{A}_{k}^{(t)}).
4:     if F(t)=F(t1)F^{(t)}=F^{(t-1)} then
5:         Break.
6:     end if
7:     Compute the partition {Cj(t)}j=1k\{C_{j}^{(t)}\}_{j=1}^{k} via (3.6).
8:     for j=1,2,,kj=1,2,\ldots,k do
9:         if Cj(t)C_{j}^{(t)}\not=\emptyset then
10:              Compute the matrix
1|Cj(t)|iCj(t)𝒚i𝒚i\frac{1}{|C_{j}^{(t)}|}\sum_{i\in C_{j}^{(t)}}\boldsymbol{y}_{i}\boldsymbol{y}_{i}^{\top}
and its rr orthonormal eigenvectors 𝐯1,𝐯2,,𝐯r\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{r} corresponding to the smallest rr eigenvalues.
11:              Set
𝐀j(t+1)=[𝐯1𝐯2𝐯r].\mathbf{A}_{j}^{(t+1)}=\begin{bmatrix}\mathbf{v}_{1}&\mathbf{v}_{2}&\ldots&\mathbf{v}_{r}\end{bmatrix}.
12:         else
13:              𝐱j(t+1)=𝐱j(t)\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}.
14:         end if
15:     end for
16:end for

We implement the BCD algorithm [peng2023block] for the following optimization problem:

(E.1) min𝐀j𝐀j=Ir1Ni=1Nj[k]𝐲i𝐀j2.\min_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}\frac{1}{N}\sum_{i=1}^{N}\prod_{j\in[k]}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2}.

For any j[k]j\in[k], when 𝐀l\mathbf{A}_{l} is fixed for all l[k]\{j}l\in[k]\backslash\{j\}, the problem in (E.1) is equivalent to:

min𝐀j𝐀j=Ir1Ni=1Nwij𝐲i𝐀j2=1Ni=1Nwijtr(𝐀j𝐲i𝐲i𝐀j)=tr(𝐀j(1Ni=1Nwij𝐲i𝐲i)𝐀j),\min_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}\frac{1}{N}\sum_{i=1}^{N}w_{ij}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2}=\frac{1}{N}\sum_{i=1}^{N}w_{ij}\textup{tr}\left(\mathbf{A}_{j}^{\top}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\right)=\textup{tr}\left(\mathbf{A}_{j}^{\top}\left(\frac{1}{N}\sum_{i=1}^{N}w_{ij}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}\right)\mathbf{A}_{j}\right),

where the weights wijw_{ij} are given by:

wij=lj𝐲i𝐀l2.w_{ij}=\prod_{l\neq j}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{l}\|^{2}.

The detailed pseudo-code can be found in Algorithm 6.

Algorithm 6 Block coordinate descent for generalized principal component analysis [peng2023block]
1:Initialize 𝐀1(0),𝐀2(0),,𝐀k(0)\mathbf{A}_{1}^{(0)},\mathbf{A}_{2}^{(0)},\dots,\mathbf{A}_{k}^{(0)}.
2:for t=0,1,2,,t=0,1,2,\ldots, max iterations do
3:     for j=1,2,,kj=1,2,\ldots,k do
4:         Compute the weights:
wij(t)=l<j𝐲i𝐀l(t+1)2l>j𝐲i𝐀l(t)2.w_{ij}^{(t)}=\prod_{l<j}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{l}^{(t+1)}\|^{2}\cdot\prod_{l>j}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{l}^{(t)}\|^{2}.
5:         Compute the matrix:
1Ni=1Nwij(t)𝐲i𝐲i\frac{1}{N}\sum_{i=1}^{N}w_{ij}^{(t)}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}
and its rr orthonormal eigenvectors 𝐯1,𝐯2,,𝐯r\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{r} corresponding to the smallest rr eigenvalues.
6:         Set
𝐀j(t+1)=[𝐯1𝐯2𝐯r].\mathbf{A}_{j}^{(t+1)}=\begin{bmatrix}\mathbf{v}_{1}&\mathbf{v}_{2}&\ldots&\mathbf{v}_{r}\end{bmatrix}.
7:     end for
8:end for

E.2. Supplementary details for Section 5.2

Mixed linear regression

Here, we provide the detailed pseudo-code for Lloyd’s algorithm used to solve 2\ell_{2}-regularized mixed linear regression problem in Section 5. Each iteration of the algorithm consists of two steps: reclassification and cluster parameter update. We alternatively reclassify indices ii to Cj(t)C_{j}^{(t)} using (3.6) and update the cluster parameter 𝐱j(t)\mathbf{x}_{j}^{(t)} for nonempty clusters Cj(t)C_{j}^{(t)} using:

(E.2) 𝐱j(t+1)=(iCj(t)𝐚i𝐚i+λ|Cj(t)|𝐈)1iCj(t)bi𝐚i,\mathbf{x}_{j}^{(t+1)}=\left(\sum_{i\in C_{j}^{(t)}}\mathbf{a}_{i}\mathbf{a}_{i}^{\top}+\lambda|C_{j}^{(t)}|\mathbf{I}\right)^{-1}\sum_{i\in C_{j}^{(t)}}b_{i}\mathbf{a}_{i},

so that 𝐱j(t+1)\mathbf{x}_{j}^{(t+1)} is exactly the minimizer of the group objective function. The algorithm continues until F(t)F^{(t)} stops decreasing after 𝐱(t)\mathbf{x}^{(t)}’s update or a max iteration number is reached. The pseudo-code is shown in Algorithm 7.

Algorithm 7 Lloyd’s Algorithm for mixed linear regression
1:Initialize 𝐱1(0),𝐱2(0),,𝐱k(0)\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}. Set F(1)=+F^{(-1)}=+\infty.
2:for t=0,1,2,,t=0,1,2,\ldots, max iterations do
3:     Compute F(t)=F(𝐱1(t),𝐱2(t),,𝐱k(t))F^{(t)}=F(\mathbf{x}_{1}^{(t)},\mathbf{x}_{2}^{(t)},\ldots,\mathbf{x}_{k}^{(t)}).
4:     if F(t)=F(t1)F^{(t)}=F^{(t-1)} then
5:         Break.
6:     end if
7:     Compute the partition {Cj(t)}j=1k\{C_{j}^{(t)}\}_{j=1}^{k} via (3.6).
8:     for j=1,2,,kj=1,2,\ldots,k do
9:         if Cj(t)C_{j}^{(t)}\not=\emptyset then
10:              Compute 𝐱j(t+1)\mathbf{x}_{j}^{(t+1)} using (E.2).
11:         else
12:              𝐱j(t+1)=𝐱j(t)\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}.
13:         end if
14:     end for
15:end for

The dataset {(𝐚i,bi)}i=1N\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N} for the 2\ell_{2}-regularized mixed linear regression is synthetically generated in the following way:

  • Fix the dimension dd and the number of function clusters kk, and sample 𝐱1+,𝐱2+,,𝐱k+i.i.d.𝒩(0,𝐈d)\mathbf{x}_{1}^{+},\mathbf{x}_{2}^{+},\dots,\mathbf{x}_{k}^{+}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\mathcal{N}(0,\mathbf{I}_{d}) as the linear coefficients of kk ground truth regression models.

  • For i=1,2,,Ni=1,2,\dots,N, we independently generate data 𝐚i𝒩(0,𝐈d)\mathbf{a}_{i}\sim\mathcal{N}(0,\mathbf{I}_{d}), class index ciUniform([k])c_{i}\sim\textup{Uniform}([k]), noise ϵi𝒩(0,σ2)\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2}), and compute bi=𝐚i𝐱ci++ϵib_{i}=\mathbf{a}_{i}^{\top}\mathbf{x}_{c_{i}}^{+}+\epsilon_{i}.

In the experiment, the noise level is set to σ=0.01\sigma=0.01 and the regularization factor is set to λ=0.01\lambda=0.01.

Mixed non-linear regression

The ground truth θj+\theta_{j}^{+}’s are sampled from a standard Gaussian. The dataset {(𝐚i,bi)}i=1N\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N} is generated in the same way as in the mixed linear regression experiment. We set the variance of the Gaussian noise on the dataset to σ2=0.012\sigma^{2}=0.01^{2} and use a regularization factor λ=0.01\lambda=0.01.