Efficient Algorithms for Sum-of-Minimum Optimization

Lisang Ding (LD) Department of Mathematics, University of California, Los Angeles (UCLA), Los Angeles, CA 90095. [email protected] , Ziang Chen (ZC) Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, MA 02139. [email protected] , Xinshang Wang (XW) Decision Intelligence Lab, Damo Academy, Alibaba US, Bellevue, WA 98004. [email protected] and Wotao Yin (WY) Decision Intelligence Lab, Damo Academy, Alibaba US, Bellevue, WA 98004. [email protected]

Abstract.

In this work, we propose a novel optimization model termed “sum-of-minimum” optimization. This model seeks to minimize the sum or average of $N$ objective functions over $k$ parameters, where each objective takes the minimum value of a predefined sub-function with respect to the $k$ parameters. This universal framework encompasses numerous clustering applications in machine learning and related fields. We develop efficient algorithms for solving sum-of-minimum optimization problems, inspired by a randomized initialization algorithm for the classic $k$ -means [arthur2007k] and Lloyd’s algorithm [lloyd1982least]. We establish a new tight bound for the generalized initialization algorithm and prove a gradient-descent-like convergence rate for generalized Lloyd’s algorithm. The efficiency of our algorithms is numerically examined on multiple tasks, including generalized principal component analysis, mixed linear regression, and small-scale neural network training. Our approach compares favorably to previous ones based on simpler-but-less-precise optimization reformulations.

A major part of the work of ZC was completed during his internship at Alibaba US DAMO Academy. Corresponding author: Ziang Chen, [email protected]

1. Introduction

In this paper, we propose the following “sum-of-minimum” optimization model:

(1.1)

\operatorname*{minimize}_{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}}~{}F(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}):=\frac{1}{N}\sum_{i=1}^{N}\min\{f_{i}(\mathbf{x}_{1}),f_{i}(\mathbf{x}_{2}),\ldots,f_{i}(\mathbf{x}_{k})\},

where $\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}$ are unknown parameters to determine. The cost function $F$ is the average of $N$ objectives where the $i$ -th objective is $f_{i}$ evaluated at its “optimal” out of the $k$ parameter choices. This paper aims to develop efficient algorithms for solving (1.1) and analyze their performance.

Write $[k]=\{1,2,\dots,k\}$ and $[N]=\{1,2,\dots,N\}$ . Let $(C_{1},C_{2},\ldots,C_{k})$ be a partition of $[N]$ , i.e., $C_{i}$ ’s are disjoint subsets of $[N]$ and their union equals $[N]$ . Let $\mathcal{P}^{k}_{N}$ denote the set of all such partitions. Then, (1.1) is equivalent to

(1.2)

\operatorname*{minimize}_{(C_{1},C_{2},\ldots,C_{k})\in\mathcal{P}^{k}_{N}}\min_{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}}\frac{1}{N}\sum_{j=1}^{k}\sum_{i\in C_{j}}f_{i}(\mathbf{x}_{j}).

It is easy to see $(C_{1}^{*},C_{2}^{*},\ldots,C_{k}^{*})$ and $(\mathbf{x}_{1}^{*},\mathbf{x}_{2}^{*},\ldots,\mathbf{x}_{k}^{*})$ are optimal to (1.2) if and only if $(\mathbf{x}_{1}^{*},\mathbf{x}_{2}^{*},\ldots,\mathbf{x}_{k}^{*})$ is optimal to (1.1) and

i\in C_{j}^{*}\implies f_{i}(\mathbf{x}_{j}^{*})=\min\{f_{i}(\mathbf{x}_{1}^{*}),f_{i}(\mathbf{x}_{2}^{*}),\ldots,f_{i}(\mathbf{x}_{k}^{*})\}.

Reformulation (1.2) reveals its clustering purpose. It finds the optimal partition $(C_{1}^{*},C_{2}^{*},\ldots,C_{k}^{*})$ such that using the parameter $\mathbf{x}_{j}^{*}$ to minimize the average of $f_{i}$ ’s in the cluster $C_{j}$ leads to the minimal total cost.

Problem (1.1) generalizes $k$ -means clustering. Consider $N$ data points $\mathbf{y}_{1},\mathbf{y}_{2},\ldots,\mathbf{y}_{N}$ and a distance function $d(\cdot,\cdot)$ . The goal of $k$ -means clustering is to find clustering centroids $\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}$ that minimize

F(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k})=\frac{1}{N}\sum_{i=1}^{N}\min_{j\in[k]}\{d(\mathbf{x}_{j},\mathbf{y}_{i})\},

which is the average distance from each data point to its nearest cluster center. The literature presents various choices for the distance function $d(\cdot,\cdot)$ . When $d(\mathbf{x},\mathbf{y})=\frac{1}{2}\|\mathbf{x}-\mathbf{y}\|^{2}$ , this optimization problem reduces to the classic $k$ -means clustering problem, for which numerous algorithms have been proposed [krishna1999genetic, arthur2007k, na2010research, sinaga2020unsupervised, ahmed2020k]. Bregman divergence is also widely adopted as a distance measure[banerjee2005clustering, manthey2013worst, liu2016clustering], defined as

d(\mathbf{x},\mathbf{y})=h(\mathbf{x})-h(\mathbf{y})-\langle\nabla h(\mathbf{y}),\mathbf{x}-\mathbf{y}\rangle,

with $h$ being a differentiable convex function.

A special case of (1.1) is mixed linear regression, which generalizes linear regression and models the dataset $\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N}$ by multiple linear models. A linear model is a function $g(\mathbf{a};\mathbf{x})=\mathbf{a}^{\top}\mathbf{x}$ , which utilizes $\mathbf{x}$ as the coefficient vector for each model. Make $k$ copies of the linear model and set the $j$ -th linear coefficient as $\mathbf{x}_{j}$ . The loss for each data pair $(\mathbf{a}_{i},b_{i})$ is computed as the squared error from the best-fitting linear model, specifically $\min_{j\in[k]}\{\frac{1}{2}(g(\mathbf{a}_{i};\mathbf{x}_{j})-b_{i})^{2}\}$ . We aim to search for optimal parameters $\{\mathbf{x}_{j}\}_{j=1}^{k}$ that minimizes the average loss

(1.3)

\frac{1}{N}\sum_{i=1}^{N}\min_{j\in[k]}\left\{\frac{1}{2}\left(g(\mathbf{a}_{i};\mathbf{x}_{j})-b_{i}\right)^{2}\right\}.

Paper [zhong2016mixed] simplifies this non-smooth problem to the sum-of-product problem:

(1.4)

\operatorname*{minimize}_{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{k}}~{}\frac{1}{N}\sum_{i=1}^{N}\prod_{j\in[k]}\left(g(\mathbf{a}_{i};\mathbf{x}_{j})-b_{i}\right)^{2},

which is smooth. Although (1.4) is easier to approach due to its smooth objective function, problem (1.3) is more accurate. Various algorithms are proposed to recover $k$ linear models from mixed-class data [yi2014alternating, shen2019iterative, kong2020meta, zilber2023imbalanced].

In (1.3), the function $g(\cdot;\mathbf{x})$ parameterized by $\mathbf{x}$ can be any nonlinear function such as neural networks, and we call this extension mixed nonlinear regression.

An application of (1.1) is generalized principal component analysis (GPCA) [vidal2005generalized, tsakiris2017filtrated], which aims to recover $k$ low-dimensional subspaces, $V_{1},V_{2},\dots,V_{k}$ , from the given data points $\mathbf{y}_{1},\mathbf{y}_{2},\dots,\mathbf{y}_{N}$ , which are assumed to be located on or close to the collective union of these subspaces $V_{1}\cup V_{2}\cup\cdots\cup V_{k}$ . This process, also referred to as subspace clustering, seeks to accurately segment data points into their respective subspaces [ma2008estimation, vidal2011subspace, elhamifar2013sparse]. Each subspace $V_{j}$ is represented as $V_{j}=\{\mathbf{y}\in\mathbb{R}^{d}:\mathbf{y}^{\top}\mathbf{A}_{j}=0\}$ where $\mathbf{A}_{j}\in\mathbb{R}^{d\times r}$ and $\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}$ , with $r$ being the co-dimension of $V_{j}$ . From an optimization perspective, the GPCA task can be formulated as

(1.5)

\operatorname*{minimize}_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}~{}\frac{1}{N}\sum_{i=1}^{N}\min_{j\in[k]}\left\{\frac{1}{2}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2}\right\}.

Similar to (1.4), [peng2023block] works with the less precise reformulation using the product of $\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2}$ for smoothness and introduces block coordinate descent algorithm.

When $k=1$ , problem (1.1) reduces to the finite-sum optimization problem

(1.6)

\min_{\mathbf{x}}~{}F(\mathbf{x})=\frac{1}{N}\sum_{i=1}^{N}f_{i}(\mathbf{x}),

widely used to train machine learning models, where $f_{i}(\mathbf{x})$ depicts the loss of the model at parameter $\mathbf{x}$ on the $i$ -th data point. When the underlying model lacks sufficient expressiveness, problem (1.6) alone may not yield satisfactory results.To enhance a model’s performance, one can train the model with multiple parameters, $\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{k},k\geq 2$ , and utilize only the most effective parameter for every data point. This strategy has been successfully applied in various classic tasks, including the aforementioned $k$ -means clustering, mixed linear regression, and the generalized principal component analysis. These applications share a common objective: to segment the dataset into $k$ groups and identify the best parameter for each group. Although no single parameter might perform well across the entire dataset, every data point is adequately served by at least one of the $k$ parameters. By aggregating the strengths of multiple smaller models, this approach not only enhances model expressiveness but also offers a cost-efficient alternative to deploying a singular larger model.

Although one might expect that algorithms and analyses for the sum-of-minimum problem (1.1) to be weaker as (1.1) subsumes the discussed previous models, we find our algorithms and analyses for (1.1) to enhance those known for the existing models. Our algorithms extend the $k$ -means++ algorithm [arthur2007k] and Lloyd’s algorithm [lloyd1982least], which are proposed for classic $k$ -means problems. We obtain new bounds of these algorithms for (1.1). Our contributions are summarized as follows:

•

We propose the sum-of-minimum optimization problem, adapt $k$ -means++ to the problem for initialization, and generalize Lloyd’s algorithm to approximately solve the problem.
•

We establish theoretical guarantees for the proposed algorithms. Specifically, under the assumption that each $f_{i}$ is $L$ -smooth and $\mu$ -strongly convex, we prove the output of the initialization is $\mathcal{O}(\frac{L^{2}}{\mu^{2}}\ln k)$ -optimal and that this bound is tight with respect to both $k$ and the condition number $\frac{L}{\mu}$ . When reducing to $k$ -means optimization, our result recovers that of [arthur2007k]. Furthermore, we prove an $\mathcal{O}(\frac{1}{T})$ convergence rate for generalized Lloyd’s algorithms.
•

We numerically verify the efficiency of the proposed framework and algorithms on several tasks, including generalized principal component analysis, $\ell_{2}$ -regularized mixed linear regression, and small-scale neural network training. The results reveal that our optimization model and algorithm lead to a higher successful rate in finding the ground-truth clustering, compared to existing approaches that resort to less accurate reformulations for the sake of smoother optimization landscapes. Moreover, our initialization shows significant improvements in both convergence speed and chance of obtaining better minima.

Our work significantly generalizes classic $k$ -means to handles more complex nonlinear models and provides new perspectives for improving the model performance. The rest of this paper is organized as follows. We introduce the preliminaries and related works in Section 2. We present the algorithms in Section 3. The algorithms are analyzed theoretically in Section 4 and numerically in Section 5. The paper is concluded in Section 6.

Throughout this paper, the $\ell_{2}$ -norm and $\ell_{2}$ -inner product are denoted by $\|\cdot\|$ and $\langle\cdot,\cdot\rangle$ , respectively. We employ $|\cdot|$ as the cardinal number of a set.

2. Related Work and Preliminary

2.1. Related work

Lloyd’s algorithm [lloyd1982least], a well-established iterative method for the classic $k$ -means problem, alternates between two key steps [mackay2003example]: 1) assigning $\mathbf{y}_{i}$ to $\mathbf{x}_{j}^{(t)}$ if $\mathbf{x}_{j}^{(t)}$ is the closest to $\mathbf{y}_{i}$ among $\{\mathbf{x}_{1}^{(t)},\mathbf{x}_{2}^{(t)},\dots,\mathbf{x}_{k}^{(t)}\}$ ; 2) updating $\mathbf{x}_{j}^{(t+1)}$ as the centroid of all $\mathbf{y}_{i}$ ’s assigned to $\mathbf{x}_{j}^{(t)}$ . Although Lloyd’s algorithm can be proved to converge to stationary points, the results can be highly suboptimal due to the inherent non-convex nature of the problem. Therefore, the performance of Lloyd’s algorithm highly depends on the initialization. To address this, a randomized initialization algorithm, $k$ -means++ [arthur2007k], generates an initial solution in a sequential fashion. Each centroid $\mathbf{x}_{j}^{(0)}$ is sampled recurrently according to the distribution

(2.1)

\mathbb{P}(\mathbf{x}_{j}^{(0)}=\mathbf{y}_{i})\propto\min_{1\leq j^{\prime}\leq j-1}\|\mathbf{x}_{j^{\prime}}-\mathbf{y}_{i}\|^{2},\quad i\in[N].

The idea is to sample a data point farther from the current centroids with higher probability, ensuring the samples to be more evenly distributed across the dataset. It is proved in [arthur2007k] that

(2.2)

\mathbb{E}F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})\leq 8(\ln k+2)F^{*},

where $F^{*}$ is the optimal objective value of $F$ . This seminal work has inspired numerous enhancements to the $k$ -means++ algorithm, as evidenced by contributions from [bahmani2012scalable, zimichev2014spectral, bachem2016fast, bachem2016approximate, wu2021user, ren2022novel]. Our result generalizes the bound in (2.2), broadening its applicability in sum-of-minimum optimization.

2.2. Definitions and assumptions

In this subsection, we outline the foundational settings for our algorithm and theory. For each sub-function $f_{i}$ , we establish the following assumptions.

Assumption 2.1.

Each $f_{i}$ is $L$ -smooth, satisfying

\|\nabla f_{i}(x)-\nabla f_{i}(y)\|\leq L\|x-y\|,\quad\forall~{}x,y\in\mathbb{R}^{d},\ i\in[N].

Assumption 2.2.

Each $f_{i}$ is $\mu$ -strongly convex, for all $x,y\in\mathbb{R}^{d}$ and $i\in[N]$ ,

f_{i}(y)\geq f_{i}(x)+\nabla f_{i}(x)^{\top}(y-x)+\frac{\mu}{2}\|x-y\|^{2}.

Let $\mathbf{x}^{*}_{i}$ denote the optimizer of $f_{i}(\mathbf{x})$ such that $f_{i}^{*}=f_{i}(\mathbf{x}^{*}_{i})$ , and let

S^{*}=\{\mathbf{x}_{i}^{*}:1\leq i\leq N\}

represent the solution set. If $S^{*}$ comprises $l<k$ different elements, the problem (1.1) possesses infinitely many global minima. Specifically, we can set the variables $\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{l}$ to be the $l$ distinct elements in $S^{*}$ , while leaving $\mathbf{x}_{l+1},\mathbf{x}_{l+2},\ldots,\mathbf{x}_{k}$ as free variables. Given these $k$ variables, $F(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k})=\frac{1}{N}\sum_{i=1}^{N}f_{i}^{*}$ . If $S^{*}$ contains more than $k$ distinct components, we have the following proposition.

Proposition 2.3.

Under Assumption 2.2, if $|S^{*}|\geq k$ , the optimization problem (1.1) admits finitely many minimizers.

Expanding on the correlation between the number of global minimizers and the size of $S^{*}$ , we introduce well-posedness conditions for $S^{*}$ .

Definition 2.4 ( $k$ -separate and $(k,r)$ -separate).

We call $S^{*}$ $k$ -separate if it contains at least $k$ different elements, i.e., $|S^{*}|\geq k$ . Furthermore, we call $S^{*}$ $(k,r)$ -separate if there exists $1\leq i_{1}<i_{2}<\cdots<i_{k}\leq N$ such that $\|\mathbf{x}_{i_{j}}^{*}-\mathbf{x}_{i_{j^{\prime}}}^{*}\|>2r$ for all $j\not=j^{\prime}$ .

Finally, we address the optimality measurement in (1.1). The norm of the (sub)-gradient is an inappropriate measure for global optimality due to the problem’s non-convex nature. Instead, we utilize the following optimality gap.

Definition 2.5 (Optimality gap).

Given a point $\mathbf{x}$ , the optimality gap of $f_{i}$ at $\mathbf{x}$ is $f_{i}(\mathbf{x})-f^{*}_{i}$ . Given a finite point set $\mathcal{M}$ , the optimality gap of $f_{i}$ at $\mathcal{M}$ is $\min_{\mathbf{x}\in\mathcal{M}}f_{i}(\mathbf{x})-f^{*}_{i}$ . When $\mathcal{M}=\{\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{k}\}$ , the averaged optimality gap of $f_{1},f_{2},\dots,f_{N}$ at $\mathcal{M}$ is the shifted objective function

(2.3)

F(\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{k})-\frac{1}{N}\sum_{i=1}^{N}f_{i}^{*}.

The averaged optimality gap in (2.3) will be used as the optimality measurement throughout this paper. Specifically, in the classic $k$ -means problem, one has $f_{i}^{*}=0$ , so the function $F(\mathbf{x}_{1},\mathbf{x}_{2},\dots,\mathbf{x}_{k})$ directly indicates global optimality.

3. Algorithms

In this section, we introduce the algorithm for solving the sum-of-minimum optimization problem (1.1). Our approach is twofold, comprising an initialization phase based on $k$ -means++ and a generalized version of Lloyd’s algorithm.

3.1. Initialization

As the sum-of-minimum optimization (1.1) can be considered a generalization of the classic $k$ -means clustering, we adopt $k$ -means++. In $k$ -means++, clustering centers are selected sequentially from the dataset, with each data point chosen based on a probability proportional to its squared distance from the nearest existing clustering centers, as detailed in (2.1). We generalize this idea and propose the following initialization algorithm that outputs initial parameters $\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}$ for the problem (1.1).

First, we select an index $i_{1}$ at random from $[N]$ , following a uniform distribution, and then utilize a specific method to determine the minimizer $\mathbf{x}^{*}_{i_{1}}$ , setting

(3.1)

\mathbf{x}_{1}^{(0)}=\mathbf{x}^{*}_{i_{1}}=\mathop{\rm argmin}_{\mathbf{x}}f_{i_{1}}(\mathbf{x}).

For $j=2,3,\dots,k$ , we sample $i_{j}$ based on the existing variables $\mathcal{M}_{j}=\{\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{j-1}^{(0)}\}$ , with each index $i$ sampled based on a probability proportional to the optimality gap of $f_{i}$ at $\mathcal{M}_{j}$ . Specifically, we compute the minimal optimality gaps

(3.2)

v_{i}^{(j)}=\min_{1\leq j^{\prime}\leq j-1}\left(f_{i}(\mathbf{x}^{(0)}_{j^{\prime}})-f^{*}_{i}\right),\quad i\in[N],

as probability scores. Each score $v_{i}^{(j)}$ can be regarded as an indicator of how unresolved an instance $f_{i}$ is with the current variables $\{\mathbf{x}_{j^{\prime}}^{(0)}\}_{j^{\prime}=1}^{j-1}$ . We then normalize these scores

(3.3)

w_{i}^{(j)}=\frac{v_{i}^{(j)}}{\sum_{i^{\prime}=1}^{N}v_{i^{\prime}}^{(j)}},\quad i\in[N],

and sample $i_{j}\in[N]$ following the probability distribution $\mathbf{w}^{(j)}=\left(w^{(j)}_{1},\dots,w^{(j)}_{N}\right)$ . The $j$ -th initialization is determined by optimizing $f_{i_{j}}$ ,

(3.4)

\mathbf{x}_{j}^{(0)}=\mathbf{x}_{i_{j}}^{*}=\mathop{\rm argmin}_{\mathbf{x}}f_{i_{j}}(\mathbf{x}).

We terminate the selection process once $k$ variables $\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)}$ are determined. The pseudo-code of this algorithm is shown in Algorithm 1.

Algorithm 1 Initialization

1:Sample

i_{1}

uniformly at random from

[N]

and compute

\mathbf{x}_{1}^{(0)}

via (3.1).

2:for

j=2,3,\dots,k

3: Compute

\mathbf{v}^{(j)}=\left(v_{1}^{(j)},v_{2}^{(j)},\ldots,v_{N}^{(j)}\right)

via (3.2).

4: Compute

\mathbf{w}^{(j)}=\left(w^{(j)}_{1},\dots,w^{(j)}_{N}\right)

via (3.3).

5: Sample

i_{j}\in[N]

according to the weights

\mathbf{w}^{(j)}

and compute

\mathbf{x}_{j}^{(0)}

via (3.4).

6:end for

We note that the scores $v_{i}^{(j)}$ defined in (3.2) rely on the optimal objectives $f_{i}^{*}$ , which may be computationally intensive to calculate in certain scenarios. Therefore, we propose a variant of Algorithm 1 by adjusting the scores $v_{i}^{(j)}$ . Specifically, when $j-1$ parameters $\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{j-1}^{(0)}$ are selected, the score is set as the minimum squared norm of the gradient:

(3.5)

v_{i}^{(j)}=\min_{1\leq j^{\prime}\leq j-1}\left\|\nabla f_{i}(\mathbf{x}_{j^{\prime}}^{(0)})\right\|^{2}.

This variant involves replacing the scores in Step 3 of Algorithm 1 with (3.5), which is further elaborated in Appendix B.

In the context of classic $k$ -means clustering where $f_{i}(\mathbf{x})=\frac{1}{2}\|\mathbf{x}-\mathbf{y}_{i}\|^{2}$ for the $i$ -th data point $\mathbf{y}_{i}$ , the score $v_{i}^{(j)}$ in both (3.2) and (3.5) reduces to

\min_{1\leq j^{\prime}\leq j-1}\|\mathbf{x}_{j^{\prime}}^{(0)}-\mathbf{y}_{i}\|^{2},

up to a constant scalar. This initialization algorithm, whether utilizing scores from (3.2) or (3.5), aligns with the approach of the classic $k$ -means++ algorithm.

3.2. Generalized Lloyd’s algorithm

Lloyd’s algorithm is employed to minimize the loss in $k$ -means clustering by alternately updating the clusters and their centroids [lloyd1982least, mackay2003example]. This centroid update process can be regarded as a form of gradient descent applied to group functions, defined by the average distance between data points within a cluster and its centroid [bottou1994convergence]. For our problem (1.1), we introduce a novel gradient descent algorithm that utilizes dynamic group functions. Our algorithm is structured into two main phases: reclassification and group gradient descent.

Reclassification.

The goal is for $C_{j}^{(t)}$ to encompass all $i\in[N]$ where $f_{i}$ is active at $\mathbf{x}_{j}^{(t)}$ , allowing us to use the sub-functions $f_{i}$ within $C_{j}^{(t)}$ to update $\mathbf{x}_{j}^{(t)}$ . This process leads to the reclassification step as follows:

(3.6)

C_{j}^{(t)}=\Big{\{}i\in[N]:f_{i}(\mathbf{x}_{j}^{(t)})\leq f_{i}(\mathbf{x}_{j^{\prime}}^{(t)}),\forall~{}j^{\prime}\in[k]\Big{\}}\Big{\backslash}\Big{(}\bigcup_{l<j}C_{l}^{(t)}\Big{)},\quad j=1,2,\dots,k.

Given that reclassification may incur non-negligible costs in practice, a reclassification frequency $r$ can be established, performing the update in (3.6) every $r$ iterations while keeping $C_{j}^{(t)}=C_{j}^{(t-1)}$ constant during other iterations.

Group gradient descent.

With $C_{j}^{(t)}$ indicating the active $f_{i}$ at $\mathbf{x}_{j}^{(t)}$ , we can define the group objective function:

(3.7)

F_{j}^{(t)}(\mathbf{z})=\begin{cases}\frac{1}{|C_{j}^{(t)}|}\sum_{i\in C_{j}^{(t)}}f_{i}(\mathbf{z}),&C_{j}^{(t)}\not=\emptyset,\\ 0,&C_{j}^{(t)}=\emptyset,\end{cases}

In each iteration, gradient descent is performed on $\mathbf{x}_{j}^{(t)}$ individually as:

(3.8)

\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}-\gamma\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)}).

Here, $\gamma>0$ is the chosen step size. Alternatively, one might opt for different iterative updates or directly compute:

\mathbf{x}_{j}^{(t+1)}=\mathop{\rm argmin}_{\mathbf{x}}\sum_{i\in C_{j}^{(t)}}f_{i}(\mathbf{x}),

especially if the minimizer of $\sum_{i\in C_{j}^{(t)}}f_{i}(\mathbf{x})$ admits a closed form or can be computed efficiently. The pseudo-code consisting of the above two steps is presented in Algorithm 2.

Algorithm 2 Generalized Lloyd’s Algorithm

1:Generate the initialization

\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}

and set

r,\gamma

2:for

t=0,1,2,\ldots,T

3: if

t\equiv 0\,(\textup{mod }r)

then

4: Compute the partition

\{C_{j}^{(t)}\}_{j=1}^{k}

via (3.6).

5: else

C_{j}^{(t)}=C_{j}^{(t-1)},\quad 1\leq j\leq k.

7: end if

8: Compute

\mathbf{x}_{j}^{(t+1)}

via (3.8).

9:end for

Momentum Lloyd’s Algorithm. We enhance Algorithm 1 by incorporating a momentum term. The momentum for $\mathbf{x}_{j}^{(t)}$ is represented as $\mathbf{m}_{j}^{(t)}$ , with $0<\beta<1$ and $\gamma>0$ serving as the step sizes for the momentum-based updates. We use the gradient of the group function $F_{j}^{(t)}$ to update the momentum $\mathbf{m}_{j}^{(t)}$ . The momentum algorithm admits the following form:

(3.9)			$\displaystyle\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}-\gamma\mathbf{m}_{j}^{(t)},$
(3.10)			$\displaystyle\mathbf{m}_{j}^{(t+1)}=\beta\mathbf{m}_{j}^{(t)}+\nabla F_{j}^{(t+1)}(\mathbf{x}_{j}^{(t+1)}).$

A critical aspect of the momentum algorithm involves updating the classes $C_{j}^{(t)}$ between (3.9) and (3.10). Rather than reclassifying based on $f_{i}$ evaluated at $\mathbf{x}_{j}^{(t+1)}$ , reclassification leverages an acceleration variable:

(3.11)

\mathbf{u}_{j}^{(t+1)}=\frac{1}{1-\beta}(\mathbf{x}_{j}^{(t+1)}-\beta\mathbf{x}_{j}^{(t)}).

The index $i$ will be classified to $C_{j}^{(t+1)}$ where $f_{i}(\mathbf{u}_{j}^{(t+1)})$ attains the minimal value. Furthermore, to mitigate abrupt shifts in each class $C_{j}$ , we implement a controlled reclassification scheme that limits the extent of change in each class:

(3.12)

\frac{1}{\alpha}|C_{j}^{(t)}|\leq|C_{j}^{(t+1)}|\leq\alpha|C_{j}^{(t)}|,

where $\alpha>1$ serves as a constraint factor. Details of the momentum algorithm are provided in Appendix B. We display the pseudo-code in Algorithm 3.

Algorithm 3 Momentum Lloyd’s Algorithm

1:Generate the initialization

\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}

. Set

\mathbf{m}_{1}^{(0)},\mathbf{m}_{2}^{(0)},\ldots,\mathbf{m}_{k}^{(0)}

to be

\mathbf{0}

. Set

r,\alpha,\beta,\gamma

2:for

t=0,1,2,\ldots,T

3: Update

\mathbf{x}_{j}^{(t)}

using (3.9).

4: if

t\equiv 0\,(\textup{mod }r)

then

5: Compute

\mathbf{u}_{j}^{(t+1)}

via (3.11).

6: Update

C_{j}^{(t+1)}

with

\mathbf{u}_{j}^{(t+1)}

in control, such that (3.12) holds.

7: else

C_{j}^{(t+1)}=C_{j}^{(t)},\quad 1\leq j\leq k.

9: end if

10: Update the momentum

\mathbf{m}_{j}^{(t)}

via (3.10).

11:end for

4. Theoretical Analysis

In this section, we prove the efficiency of the initialization algorithm and establish the convergence rate of Lloyd’s algorithm. For the initialization Algorithm 1, we show that the ratio between the optimality gap of $\{\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}\}$ and the smallest possible optimality gap is $\mathcal{O}(\frac{L^{2}}{\mu^{2}}\ln k)$ . Additionally, by presenting an example where this ratio is $\Omega(\frac{L^{2}}{\mu^{2}}\ln k)$ , we illustrate the bound’s tightness. For Lloyd’s Algorithms 2 and 3, we establish a gradient decay rate of $\mathcal{O}(\frac{1}{T})$ , underscoring the efficiency and convergence properties of these algorithms.

4.1. Error bound of the initialization algorithm

We define the set of initial points selected by the randomized initialization Algorithm 1,

\mathcal{M}_{\textup{init}}=\{\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}\}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{k}}^{*}\},

as the starting configuration for our optimization process. For simplicity, we use $F(\mathcal{M}_{\textup{init}})=F(\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{k}}^{*})$ to represent the function value at these initial points. Let $F^{*}$ be the global minimal value of $F$ , and let $f^{*}=\frac{1}{N}\sum_{i=1}^{N}f_{i}^{*}$ denote the average of the optimal values of sub-functions. The effectiveness of Algorithm 1 is evaluated by the ratio between $\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}$ and $F^{*}-f^{*}$ , which is the expected ratio between the averaged optimality gap at $\mathcal{M}_{\textup{init}}$ and the minimal possible averaged optimality gap. The following theorem provides a specific bound.

Theorem 4.1.

Suppose that Assumptions 2.1 and 2.2 hold. Assume that the solution set $S^{*}$ is $k$ -separate. Let $\mathcal{M}_{\textup{init}}$ be a random initialization set generated by Algorithm 1. We have

\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\leq 4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\left(F^{*}-f^{*}\right).

Theorem 4.1 indicates that the relative optimality gap at the initialization set is constrained by a factor of $\mathcal{O}(\frac{L^{2}}{\mu^{2}}\ln k)$ times the minimal optimality gap. The proof of Theorem 4.1 is detailed in Appendix C. In the classic $k$ -means problem, where $L=\mu$ , this result reduces to Theorem 1.1 in [arthur2007k]. Moreover, the upper bound $\mathcal{O}(\frac{L^{2}}{\mu^{2}}\ln k)$ is proven to be tight via a lower bound established in the following theorem.

Theorem 4.2.

Given a fixed cluster number $k>0$ , there exists an integer $N>0$ . We can construct $N$ sub-functions $\{f_{i}\}_{i=1}^{N}$ satisfying Assumptions 2.1–2.2 and guaranteeing the solution set $S^{*}$ to be $k$ -separate. When applying Algorithm 1 over the instances $\{f_{i}\}_{i=1}^{N}$ , we have

(4.1)

\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\geq\frac{1}{2}\frac{L^{2}}{\mu^{2}}\ln k\left(F^{*}-f^{*}\right).

The proof of Theorem 4.2 is presented in detail in Appendix C. In both Theorem 4.1 and Theorem 4.2, the performance of Algorithm 1 is analyzed with the assumption that $\mathbf{v}^{(j)}$ and $f_{i}^{*}$ in (3.2) can be computed exactly. However, the accurate computation of $f_{i}^{*}$ may be impractical due to computational costs. Therefore, we explore the error bounds when the score $\mathbf{v}^{(j)}$ approximates (3.2) with some degree of error. We investigate two types of scoring errors.

•

Additive error. There exists $\epsilon>0$ , we have access to an estimated $\tilde{f}_{i}^{*}$ satisfying

(4.2)

f^{*}_{i}-\epsilon\leq\tilde{f}^{*}_{i}\leq f^{*}_{i}+\epsilon.

Accordingly, we define:

(4.3)

\tilde{v}_{i}^{(j)}=\min_{1\leq j^{\prime}\leq j-1}\left(\max\left(f_{i}(\mathbf{x}^{(0)}_{j^{\prime}})-\tilde{f}^{*}_{i},0\right)\right)=\max\left(\min_{1\leq j^{\prime}\leq j-1}\left(f_{i}(\mathbf{x}^{(0)}_{j^{\prime}})-\tilde{f}^{*}_{i}\right),0\right).

•

Scaling error. There exists a deterministic oracle $O_{v}:[N]\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ , such that for any $\mathbf{x}\in\mathbb{R}^{d}$ and $i\in[N]$ ,

(4.4) $c_{1}(f_{i}(\mathbf{x})-f_{i}^{*})\leq O_{v}(i,\mathbf{x})\leq c_{2}(f_{i}(\mathbf{x})-f_{i}^{*}).$

Set

(4.5) $\tilde{v}_{i}^{(j)}=\min_{1\leq j^{\prime}\leq j-1}O_{v}(i,\mathbf{x}_{j^{\prime}}^{(0)}).$

We first analyze the performance of Algorithm 1 using the score $\tilde{v}_{i}^{(j)}$ with additive error as in (4.3). We typically require the assumption that the solution set $S^{*}$ is $(k,\sqrt{\frac{2\epsilon}{\mu}})$ -separate, which guarantees that

\sum_{i=1}^{N}\min_{j\in[l]}\max\left((f_{i}(\mathbf{z}_{j})-\tilde{f}_{i}^{*}),0\right)>0,

for any $l<k$ and $\mathbf{z}_{1},\mathbf{z}_{2},\ldots,\mathbf{z}_{l}\in\mathbb{R}^{d}$ . Hence in the initialization Algorithm 1 with score (4.3), there is at least one $\tilde{v}_{i}^{(j)}>0$ in each round. We have the following generalized version of Theorem 4.1 with additive error.

Theorem 4.3.

Under Assumptions 2.1 and 2.2, suppose that we have $\{\tilde{f}_{i}^{*}\}_{i=1}^{N}$ satisfying (4.2) for some noise factor $\epsilon>0$ , and that the solution set $S^{*}$ is $\big{(}k,\sqrt{\frac{2\epsilon}{\mu}}\big{)}$ -separate. Then for the initialization Algorithm 1 with the scores in (3.2) replaced by the noisy scores in (4.3), we have

(4.6)

\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\leq 4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)(F^{*}-f^{*})+\epsilon\cdot\left(1+(2+\ln k)\left(1+\frac{4L}{\mu}\right)\right).

The proof of Theorem 4.3 is deferred to Appendix C. Next, we state a similar result for the scaling-error oracle as in (4.5), whose proof is deferred to Appendix C.

Theorem 4.4.

Suppose that Assumptions 2.1–2.2 hold and that the solution set $S^{*}$ is $k$ -separate. Then for the initialization Algorithm 1 with the scores in (3.2) replaced by the scores in (4.5), we have the following bound:

\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\leq 4\left(\frac{c_{2}}{c_{1}}\frac{L}{\mu}+\frac{c_{2}^{2}}{c_{1}^{2}}\frac{L^{2}}{\mu^{2}}\right)(2+\ln k)(F^{*}-f^{*}).

Recall that we introduce an alternative score in (3.5). This score can actually be viewed as a noisy version of (3.2) with a scaling error. Under Assumptions 2.1 and 2.2, it holds that

2\mu(f_{i}(\mathbf{x})-f_{i}^{*})\leq\|\nabla f_{i}(\mathbf{x})\|^{2}\leq 2L(f_{i}(\mathbf{x})-f_{i}^{*}),

for any $i\in[N]$ and $\mathbf{x}\in\mathbb{R}^{d}$ , which satisfies (4.4) with $c_{1}=2\mu$ and $c_{2}=2L$ . Therefore, we have a direct corollary of Theorem 4.4.

Corollary 4.5.

Suppose that Assumptions 2.1 and 2.2 hold and that the solution set $S^{*}$ is $k$ -separate. For the initialization Algorithm 1 with the scores in (3.2) replaced by the scores in (3.5), we have

\mathbb{E}F(\mathcal{M}_{\textup{init}})-f^{*}\leq 4\left(\frac{L^{2}}{\mu^{2}}+\frac{L^{4}}{\mu^{4}}\right)(2+\ln k)(F^{*}-f^{*}).

4.2. Convergence rate of Lloyd’s algorithm

In this subsection, we state convergence results of Lloyd’s Algorithm 2 and momentum Lloyd’s Algorithm 3, with all proofs being deferred to Appendix D. For Algorithm 2, the optimization process of $\mathbf{x}_{j}^{(t)}$ follows a gradient descent scheme on a varying objective function $F_{j}^{(t)}$ , which is the average of all active $f_{i}$ ’s determined by $C_{j}^{(t)}$ in (3.6). We have the following gradient-descent-like convergence rate on the gradient norm $\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|$ .

Theorem 4.6.

Suppose that Assumption 2.1 is satisfied and we take the step size $\gamma=\frac{1}{L}$ in Algorithm 2. Then

\frac{1}{T+1}\sum_{t=0}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\left\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\right\|^{2}\leq\frac{2L}{T+1}\left(F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{\star}\right).

For momentum Lloyd’s Algorithm 3, we have a similar convergence rate stated as follows.

Theorem 4.7.

Suppose that Assumption 2.1 holds and that $\alpha>1$ . For Algorithm 3, there exists a constant $\bar{\gamma}(\alpha,\beta,L)$ , such that

\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2(1-\beta)}{\gamma}\cdot\frac{F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{*}}{T},

as long as $\gamma\leq\bar{\gamma}(\alpha,\beta,L)$ .

5. Numerical Experiments

In this section, we conduct numerical experiments to demonstrate the efficiency of the proposed model and algorithms. Our code with documentation can be found at https://github.com/LisangDing/Sum-of-Minimum_Optimization.

5.1. Comparison between the sum-of-minimum model and the product formulation

We consider two optimization models for generalized principal component analysis: the sum-of-minimum formulation (1.5) and another widely acknowledged formulation given by [peng2023block, vidal2005generalized]:

(5.1)

\operatorname*{minimize}_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}\frac{1}{N}\sum_{i=1}^{N}\prod_{j=1}^{k}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2}.

The initialization for both formulations is generated by Algorithm 1. We use a slightly modified version of Algorithm 2 to minimize (1.5) since the minimization of the group functions for GPCA admits closed-form solutions. In particular, we alternatively compute the minimizer of each group objective function as the update of $\mathbf{A}_{j}$ and then reclassify the sub-functions. We use the block coordinate descent (BCD) method [peng2023block] to minimize (5.1). The BCD algorithm alternatively minimizes $\mathbf{A}_{j}$ with all other $\mathbf{A}_{l}\ (l\neq j)$ being fixed. The pseudo-codes of both algorithms are included in Appendix E.1.

We set the cluster number $k\in\{2,3,4\}$ , dimension $d\in\{4,5,6\}$ , subspace co-dimension $r=d-2$ , and the number of data points $N=1000$ . The generalization of the dataset $\{\boldsymbol{y}_{i}\}_{i=1}^{N}$ is described in Appendix E.1. We set the maximum iteration number as 50 for Algorithm 2 with (1.5) and terminate the algorithm once the objective function stops decreasing, i.e., the partition/clustering remains unchanged. Meanwhile, we set the iteration number to 50 for the BCD algorithm [peng2023block] with (5.1). The sythetic data generation is elaborated in Appendix E.1. The classification accuracy of both methods is reported in Table 1, where the classification accuracy is defined as the maximal matching accuracy with respect to the ground truth over all permutations. We observe that our model and algorithm lead to significantly higher accuracy. This is because, compared to (5.1), the formulation in (1.5) models the requirements more precisely, though it is more difficult to optimize due to the non-smoothness.

Table 1. Cluster accuracy percentages of the sum-of-minimum (SoM) vs. the sum-of-product (SoP) GPCA models after 50 iterations.

		$d=4$	$d=5$	$d=6$
$k=2$	SoM	98.24	98.07	98.19
$k=2$	SoP	81.88	75.90	73.33
$k=3$	SoM	95.04	94.98	95.94
$k=3$	SoP	67.69	62.89	60.85
$k=4$	SoM	91.30	92.92	93.73
$k=4$	SoP	62.36	59.65	57.89

Next, we compare the computational cost for our model and algorithms with that of the product model and the BCD algorithm. We observe that the BCD algorithm exhibited limited improvements in accuracy after the initial 10 iterations. Thus, for a fare comparison, we set both the maximum iterations for our model and algorithms and the iteration number for the BCD algorithm to 10. The accuracy rate and the CPU time are shown in Table 2, from which one can see that the computational costs of our algorithm and the BCD algorithm are competitive, while our algorithm achieves much better classification accuracy.

Table 2. Averaged cluster accuracy percentage / CPU time in seconds for GPCA after 10 iterations on sum-of-minimum (SoM) and the sum-of-product (SoP) models.

		$d=4$	$d=5$	$d=6$
$k=2$	SoM	97.84 / 0.08	97.93 / 0.08	98.01 / 0.08
$k=2$	SoP	81.78 / 0.14	75.76 / 0.14	73.24 / 0.15
$k=3$	SoM	93.34 / 0.19	94.14 / 0.19	95.25 / 0.16
$k=3$	SoP	67.18 / 0.20	62.76 / 0.22	60.80 / 0.20
$k=4$	SoM	88.62 / 0.32	91.78 / 0.29	92.62 / 0.27
$k=4$	SoP	61.52 / 0.26	59.37 / 0.27	57.82 / 0.27

5.2. Comparison between different initializations

We present the performance of Lloyd’s Algorithm 2 combined with different initialization methods. The initialization methods adopted in this subsection are:

•

Random initialization. We initialize variables $\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)}$ with i.i.d. samples from the $d$ -dimensional standard Gaussian distribution.
•

Uniform seeding initialization. We uniformly sample $k$ different indices $i_{1},i_{2},\ldots,i_{k}$ from $[N]$ , then we set $\mathbf{x}_{i_{j}}^{*}$ as the initial value of $\mathbf{x}_{j}^{(0)}$ .
•

Proposed initialization. We sample the $k$ indices using Algorithm 1 and initialize $\mathbf{x}_{j}^{(0)}$ with the minimizer of the corresponding sub-function.

Mixed linear regression. Our first example is the $\ell_{2}$ -regularized mixed linear regression. We add an $\ell_{2}$ regularization on each sub-function $f_{i}$ in (1.3) to guarantee strong convexity, and the sum-of-minimum optimization objective function can be written as

\frac{1}{N}\sum_{i=1}^{N}\min_{j\in[k]}\left\{\frac{1}{2}\left(g(\mathbf{a}_{i};\mathbf{x}_{j})-b_{i}\right)^{2}+\frac{\lambda}{2}\|\mathbf{x}_{j}\|^{2}\right\},

where $\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N}$ collects all data points and $\lambda>0$ is a fixed parameter. The dataset $\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N}$ is generated as described in Appendix E.2.

Similar to the GPCA problem, we slightly modify Lloyd’s algorithm since the $\ell_{2}$ -regularized least-square problem can be solved analytically. Specifically, we use the minimizer of the group objective function as the update of $\mathbf{x}_{j}$ instead of performing the gradient descent as in (3.8) or Algorithm 2. We perform the algorithm until a maximum iteration number is met or the objective function value stops decreasing. The detailed algorithm is given in Appendix E.2.

In the experiment, the number of samples is set to $N=1000$ and we vary $k$ from 4 to 6 and $d$ (the dimension of $\mathbf{a}_{i}$ and $\mathbf{x}_{j}$ ) from 4 to 8. For each problem with fixed cluster number and dimension, we repeat the experiment for 1000 times with different random seeds. In each repeated experiment, we record two metrics. If the output objective value at the last iteration is less than or equal to $F(\mathbf{x}_{1}^{+},\mathbf{x}_{2}^{+},\ldots,\mathbf{x}_{k}^{+})$ , where $(\mathbf{x}_{1}^{+},\mathbf{x}_{2}^{+},\ldots,\mathbf{x}_{k}^{+})$ is the ground truth that generates the dataset $\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N}$ , we consider the objective function to be nearly optimized and label the algorithm as successful on the task; otherwise, we label the algorithm as failed on the task. Additionally, we record the number of iterations the algorithm takes to output a result. The result is displayed in Table 3.

Table 3. The failing rate vs. the average iteration number of three initialization methods when solving mixed linear regression problems with different cluster numbers and dimensionality. A smaller failure rate and a lower average iteration number indicate better performance. The least failure rate among the three methods is bolded, and the least average iteration number under the same cluster number and dimension settings is underlined.

	Init. Method	$d=4$	$d=5$	$d=6$	$d=7$	$d=8$
$k=4$	random	0.056 / 17.577	0.031 / 18.378	0.038 / 19.923	0.058 / 21.631	0.071 / 22.344
	unif. seeding	0.057 / 16.139	0.034 / 16.885	0.050 / 18.022	0.055 / 18.708	0.075 / 19.959
	proposed	0.050 / 14.551	0.036 / 15.276	0.034 / 16.020	0.044 / 16.936	0.051 / 17.409
$k=5$	random	0.161 / 26.355	0.156 / 28.844	0.172 / 32.247	0.238 / 35.042	0.321 / 38.324
	unif. seeding	0.145 / 23.728	0.136 / 25.914	0.143 / 27.671	0.198 / 29.935	0.256 / 32.662
	proposed	0.162 / 21.552	0.130 / 23.476	0.143 / 25.933	0.161 / 27.268	0.217 / 29.086
$k=6$	random	0.363 / 35.831	0.382 / 41.043	0.504 / 43.999	0.594 / 47.918	0.739 / 48.730
	unif. seeding	0.347 / 31.536	0.350 / 35.230	0.408 / 39.688	0.524 / 42.453	0.596 / 43.117
	proposed	0.339 / 29.610	0.312 / 33.460	0.389 / 36.068	0.463 / 39.010	0.563 / 40.320

Mixed nonlinear regression. Our second experiment is on mixed nonlinear regression using 2-layer neural networks. We construct $k$ neural networks with the same structure and let the $j$ -th neural network be:

\psi(\mathbf{a};\mathbf{W}_{j},\mathbf{p}_{j},\mathbf{q}_{j},o_{j})=\mathbf{p}_{j}^{\top}\textup{ReLU}(\mathbf{W}_{j}\mathbf{a}+\mathbf{q}_{j})+o_{j}.

Here, $\mathbf{a}$ is the input data. We let $d_{I}$ be the input dimension and $d_{H}$ be the hidden dimension. The dimensions of the variables are $\mathbf{a}\in\mathbb{R}^{d_{I}},\mathbf{W}_{j}\in\mathbb{R}^{d_{H}\times d_{I}},\mathbf{p}_{j},\mathbf{q}_{j}\in\mathbb{R}^{d_{H}},o_{j}\in\mathbb{R}.$ We denote $\theta_{j}=(\mathbf{W}_{j},\mathbf{p}_{j},\mathbf{q}_{j},o_{j})$ as the trainable parameters in the neural network. For each trial, we prepare the ground truth $\theta_{j}^{+}$ and the dataset $\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N}$ as described in Appendix E.2. We use the squared $\ell_{2}$ loss for each neural network and construct the $i$ -th sub-function as:

f_{i}(\theta)=\frac{1}{2}(\psi(\mathbf{a}_{i};\theta)-b_{i})^{2}+\frac{\lambda}{2}\|\theta\|^{2},

where we still use $\frac{\lambda}{2}\|\theta\|^{2},\lambda>0$ as a regularization term. We perform parallel experiments on training the neural networks via Algorithm 2 using three different initialization methods. During the training process of neural networks, stochastic gradient descent is commonly used to manage limited memory, reduce training loss, and improve generalization. Moreover, the ADAM algorithm proposed in [kingma2014adam] is widely applied. This optimizer is empirically observed to be less sensitive to hyperparameters, more robust, and to converge faster. To align with this practice, we replace the group gradient descent in Algorithm 2 and the group momentum method in Algorithm 3 with ADAM optimizer-based backward propagation for the corresponding group objective function.

We use two metrics to measure the performance of the algorithms. In one set of experiments, we train $k$ neural networks until the value of the loss function $F$ under parameters $\theta_{1},\theta_{2},\ldots,\theta_{k}$ is less than that under $\theta_{1}^{+},\theta_{2}^{+},\ldots,\theta_{k}^{+}$ . We record the average iterations required to achieve the optimization loss. In the other set of experiments, we train $k$ neural networks for a fixed number of iterations. Then, we compute the training and testing loss of the trained neural network, where the training loss on the dataset $\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N}$ is defined as $\frac{1}{N}\sum_{i=1}^{N}\min_{j}\left(\frac{1}{2}(\psi(\mathbf{a}_{i};\theta_{j})-b_{i})^{2}\right)$ and the testing loss is defined in a similar way.

In our experiments, the training dataset size is $N=1000$ and the testing dataset size is $200$ . The testing dataset is generated from the same distribution as the training data. Benefiting from ADAM’s robust nature regarding hyperparameters, we use the default ADAM learning rate $\gamma=1e-3$ . We set $r=10$ in Lloyd’s Algorithm 2 and fix the cluster number $k=5$ . We test on three different $(d_{I},d_{H})$ tuples: $(5,3)$ , $(7,5)$ , and $(10,5)$ . The results can be found in Table 4 and 5.

Table 4. Average epochs for different seeding methods to achieve the ground truth model training loss.

$(d_{I},d_{H})$	(5,3)	(7,5)	(10,5)
random	329.4	132.1	130.8
unif. Seeding	233.1	71.2	67.6
proposed	181.4	49.3	47.2

Table 5. The training errors vs. the testing errors (unit: 10e-3) of Lloyd’s algorithm with fixed training epoch numbers.

$(d_{I},d_{H})$ / Iter.	(5,3) / 300	(7,5) / 150	(10,5) / 150
random	4.26 / 4.63	4.57 / 5.54	4.62 / 5.82
unif. Seeding	3.86 / 4.25	3.96 / 4.77	3.56 / 4.52
proposed	3.44 / 3.93	3.51 / 4.37	3.39 / 4.34

We can conclude from Table 3, 4, and 5 that the careful seeding Algorithm 1 generates the best initialization in most cases. This initialization algorithm results in the fewest iterations required by Lloyd’s algorithm to converge, the smallest final loss, and the highest probability of finding the ground-truth clustering.

6. Conclusion

This paper proposes a general framework for sum-of-minimum optimization, as well as efficient initialization and optimization algorithms. Theoretically, tight bounds are established for smooth and strongly convex sub-functions $f_{i}$ . Though this work is motivated by classic algorithms for the $k$ -means problem, we extend the ideas and theory significantly for a broad family of tasks. Furthermore, the numerical efficiency is validated for generalized principal component analysis and mixed linear and nonlinear regression problems. Future directions include developing algorithms with provable guarantees for non-convex $f_{i}$ and exploring empirical potentials on large-scale tasks.

Acknowledgements

Lisang Ding receives support from Air Force Office of Scientific Research Grants MURI-FA9550-18-1-0502. We thank Liangzu Peng for fruitful discussions on GPCA.

References

Appendix A Proof of Proposition 2.3

In this section, we provide a proof of the proposition in Section 2.

Proposition A.1 (Restatement of Proposition 2.3).

Under Assumption 2.2, if $|S^{*}|\geq k$ , the optimization problem (1.1) admits finitely many minimizers.

Proof.

If $|S^{*}|=k$ , then the only minimizer up to a permutation of indices is $\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}$ , such that

\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}\}=S^{*}.

Next we consider the case where $|S^{*}|>k$ . Let $\mathcal{R}$ be the set of all minimizers of (1.1). Due to the $\mu$ -strong convexity of $f_{i}$ , the set $\mathcal{R}$ is nonempty. Let $\mathcal{T}$ be the set of all partitions $C_{1},C_{2},\ldots,C_{k}$ of $[N]$ , such that $C_{j}\not=\emptyset$ for all $j\in[k]$ . The set $\mathcal{T}$ is finite. Next, we show there is an injection from $\mathcal{R}$ to $\mathcal{T}$ . For $\mathbf{X}=(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k})\in\mathcal{R}$ , we recurrently define

C_{j}^{\mathbf{X}}=\{i\in[N]\,|\,f_{i}(\mathbf{x}_{j})=\min_{l}(f_{i}(\mathbf{x}_{l}))\}\backslash\left(\cup_{1\leq j^{\prime}\leq j-1}C_{j^{\prime}}^{\mathbf{X}}\right).

We claim that all $C_{j}^{\mathbf{X}}$ ’s are nonempty. Otherwise, if there is an index $j$ such that $C_{j}^{\mathbf{X}}=\emptyset$ , we have a $\mathbf{z}\in S^{*}\backslash\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k}\}$ . Replacing the $j$ -th parameter $\mathbf{x}_{j}$ with $\mathbf{z}$ , we have

F(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{k})>F(\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{j-1},\mathbf{z},\mathbf{x}_{j+1},\ldots,\mathbf{x}_{k}).

This contradicts the assumption that $\mathbf{X}$ is a minimizer of (1.1). Thus, $\mathbf{X}\mapsto(C_{1}^{\mathbf{X}},C_{2}^{\mathbf{X}},\ldots,C_{k}^{\mathbf{X}})$ is a well-defined map from $\mathcal{R}$ to $\mathcal{T}$ . Consider another $\mathbf{Y}=(\mathbf{y}_{1},\mathbf{y}_{2},\ldots,\mathbf{y}_{k})\in\mathcal{R}$ . If $C_{j}^{\mathbf{X}}=C_{j}^{\mathbf{Y}}$ for all $j\in[k]$ , due to the $\mu$ -strong convexity of $f_{i}$ ’s, we have

\mathbf{y}_{j}=\mathop{\rm argmin}_{\mathbf{z}}\sum_{i\in C_{j}^{\mathbf{Y}}}f_{i}(\mathbf{z})=\mathop{\rm argmin}_{\mathbf{z}}\sum_{i\in C_{j}^{\mathbf{X}}}f_{i}(\mathbf{z})=\mathbf{x}_{j},\quad\forall j\in[k].

Thus, the map defined above is injective. Overall, $\mathcal{R}$ is a finite set. ∎

Appendix B Algorithm details

In this section, we provide the details of the algorithms presented in Section 3.

B.1. Initialization with alternative scores

When the score function $v_{i}^{(j)}$ is taken as the squared gradient norm as in (3.5), the pseudo-code of the initialization can be found in Algorithm 4.

Algorithm 4 Initialization

1:Sample

i_{1}

uniformly at random from

[N]

and compute

\mathbf{x}_{1}^{(0)}=\mathbf{x}_{i_{1}}^{*}=\mathop{\rm argmin}_{\mathbf{x}}f_{i_{1}}(\mathbf{x}).

2:for

j=2,3,\dots,k

3: Compute scores

\mathbf{v}^{(j)}=\left(v_{1}^{(j)},v_{2}^{(j)},\ldots,v_{N}^{(j)}\right)

via

v^{(j)}_{i}=\min_{1\leq j^{\prime}\leq j-1}\left\|\nabla f_{i}(\mathbf{x}_{j^{\prime}}^{(0)})\right\|^{2}.

4: Compute the sampling weights

\mathbf{w}^{(j)}=\left(w^{(j)}_{1},\dots,w^{(j)}_{N}\right)

by normalizing

\{v_{i}^{(j)}\}_{i=1}^{N}

w^{(j)}_{i}=\frac{v_{i}^{(j)}}{\sum_{i^{\prime}=1}^{N}v_{i^{\prime}}^{(j)}}.

5: Sample

i_{j}\in[N]

according to the weights

\mathbf{w}^{(j)}

and compute

\mathbf{x}_{j}^{(0)}=\mathbf{x}_{i_{j}}^{*}=\mathop{\rm argmin}_{\mathbf{x}}f_{i_{j}}(\mathbf{x}).

6:end for

B.2. Details on momentum Lloyd’s Algorithm

In this section, we elaborate on the details of momentum Lloyd’s Algorithm 3. We use $\mathbf{x}_{1}^{(t)},\mathbf{x}_{2}^{(t)},\ldots,\mathbf{x}_{k}^{(t)}$ as the $k$ variables to be optimized. Correspondingly, we introduce $\mathbf{m}_{1}^{(t)},\mathbf{m}_{2}^{(t)},\ldots,\mathbf{m}_{k}^{(t)}$ as their momentum. We use the same notation $F_{j}^{(t)}$ in (3.7) as the group objective function. In each iteration, we update $\mathbf{x}$ using momentum gradient descent and update $\mathbf{m}$ using the gradient of the group function.

	$\displaystyle\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}-\gamma\mathbf{m}_{j}^{(t)},$
	$\displaystyle\mathbf{m}_{j}^{(t+1)}=\beta\mathbf{m}_{j}^{(t)}+\nabla F_{j}^{(t+1)}(\mathbf{x}_{j}^{(t+1)}).$

The update of $C_{j}^{(t)}$ in the momentum algorithm is different from the Lloyd’s Algorithm 2. We introduce an acceleration quantity

\mathbf{u}_{j}^{(t+1)}=\frac{1}{1-\beta}(\mathbf{x}_{j}^{(t+1)}-\beta\mathbf{x}_{j}^{(t)}).

Each class is then renewed around the center $\mathbf{u}_{j}^{(t+1)}$ . We update index $i\in[N]$ to the class $C_{j}^{(t+1)}$ where $f_{i}(\mathbf{u}_{j}^{(t+1)})$ attains the minimum value among all $j\in[k]$ . To ensure the stability of the momentum accumulation, we further introduce a controlled reclassification method. We set a reclassification factor $\alpha>1$ . We update $C_{j}^{(t)}$ to $C_{j}^{(t+1)}$ in the following way to ensure

\frac{1}{\alpha}|C_{j}^{(t)}|\leq|C_{j}^{(t+1)}|\leq\alpha|C_{j}^{(t)}|.

The key idea is to carefully reclassify each index one by one until the size of one class breaks the above restriction. We construct $C_{j,0}=C_{j}^{(t)},j\in[k]$ as the initialization of the reclassification. We randomly, non-repeatedly pick indices $i$ from $[N]$ one by one. For $l$ looping from 1 to $N$ , we let $C_{j,l-1},j\in[k]$ be the classification before the $l$ -th random index is picked. Let $i_{l}$ be the $l$ -th index sampled. We reassign $i_{l}$ to the $j$ -th class, such that

f_{i_{l}}(\mathbf{u}_{j}^{(t+1)})=\min_{j^{\prime}\in[k]}f_{i_{l}}(\mathbf{u}_{j^{\prime}}^{(t+1)}).

There will be at most two classes changed due to the one-index reassignment. We update the class notations from $C_{j^{\prime},l-1}$ to $C_{j^{\prime},l}$ for all $j^{\prime}\in[N]$ . If there is any change between $C_{j^{\prime},l-1}$ and $C_{j^{\prime},l}$ , we check whether

\frac{1}{\alpha}|C_{j^{\prime}}^{(t)}|\leq|C_{j^{\prime}}|\leq\alpha|C_{j^{\prime}}^{(t)}|

holds. If the above restriction holds for all $j^{\prime}\in[N]$ , we accept the reclassification and move on to the next index sample. Otherwise, we stop the process and return $C_{j}^{(t+1)}=C_{j,l-1},j\in[k]$ . If the reclassification trial successfully loops to the last index. We assign $C_{j}^{(t+1)}=C_{j,N},j\in[k]$ .

Appendix C Initialization error bounds

In this section, we prove the error bounds of the initialization Algorithms 1 and 4. Before our proof, we prepare the following concepts and definitions.

Definition C.1.

For any nonempty $C\subset[N]$ , we define

\Delta_{C}:=\frac{1}{|C|}\sum_{i\in C}\sum_{i^{\prime}\in C}\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}.

Definition C.2.

Let $\mathcal{I}\subset[N]$ be an index set, $\mathcal{M}\subset\mathbb{R}^{d}$ be a finite set, we define

	$\displaystyle\mathcal{A}(\mathcal{I},\mathcal{M})=\sum_{i\in\mathcal{I}}\min_{\mathbf{z}\in\mathcal{M}}(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})),$
	$\displaystyle\mathcal{D}(\mathcal{I},\mathcal{M})=\sum_{i\in\mathcal{I}}\min_{\mathbf{z}\in\mathcal{M}}\\|\nabla f_{i}(\mathbf{z})\\|^{2}.$

Under the $\mu$ -strong convexity and $L$ -smooth Assumptions 2.1 and 2.2, we immediately have

\frac{1}{2L}\mathcal{D}(\mathcal{I},\mathcal{M})\leq\mathcal{A}(\mathcal{I},\mathcal{M})\leq\frac{1}{2\mu}\mathcal{D}(\mathcal{I},\mathcal{M}).

Besides, for disjoint index sets $\mathcal{I}_{1},\mathcal{I}_{2}$ , we have

	$\displaystyle\mathcal{A}(\mathcal{I}_{1}\cup\mathcal{I}_{2},\mathcal{M})=\mathcal{A}(\mathcal{I}_{1},\mathcal{M})+\mathcal{A}(\mathcal{I}_{2},\mathcal{M}),$
	$\displaystyle\mathcal{D}(\mathcal{I}_{1}\cup\mathcal{I}_{2},\mathcal{M})=\mathcal{D}(\mathcal{I}_{1},\mathcal{M})+\mathcal{D}(\mathcal{I}_{2},\mathcal{M}).$

For the problem (1.1), the optimal solution exists due to the strong convexity assumption on $f_{i}$ ’s. We pick one set of optimal solutions $\mathbf{z}_{1}^{*},\mathbf{z}_{2}^{*},\ldots,\mathbf{z}_{k}^{*}$ . We let

\mathcal{M}_{\textup{OPT}}=\{\mathbf{z}_{1}^{*},\mathbf{z}_{2}^{*},\ldots,\mathbf{z}_{k}^{*}\}.

Based on this optimal solutions, we introduce $(A_{1},A_{2},\ldots,A_{k})$ as a partition of $[N]$ . $A_{j}$ ’s are disjoint with each other and

\bigcup_{j\in[k]}A_{j}=[N].

Besides, for all $i\in A_{j}$ , $f_{i}(\mathbf{x})$ attains minimum at $\mathbf{z}_{j}^{*}$ over $\mathcal{M}_{\textup{OPT}}$ ,

f_{i}(\mathbf{z}_{j}^{*})-f_{i}(\mathbf{x}_{i}^{*})=\min_{j^{\prime}\in[k]}\left(f_{i}(\mathbf{z}_{j^{\prime}}^{*})-f_{i}(\mathbf{x}_{i}^{*})\right).

The choice of $\mathcal{M}_{\textup{OPT}}$ and $(A_{1},A_{2},\ldots,A_{k})$ is not unique. We carefully choose them so that $A_{j}$ are non-empty for each $j\in[k]$ .

Lemma C.3.

Suppose that Assumption 2.1 holds. Let $\mathcal{I}$ be a nonempty index subset of $[N]$ and let $i$ be sampled uniformly at random from $\mathcal{I}$ . We have

\mathbb{E}_{i}\,\mathcal{A}(\mathcal{I},\{\mathbf{x}_{i}^{*}\})\leq\frac{L}{2}\Delta_{\mathcal{I}}.

Proof.

We have the following direct inequality.

	$\displaystyle\mathbb{E}_{i}\,\mathcal{A}(\mathcal{I},\{\mathbf{x}_{i}^{*}\})$	$\displaystyle=\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\mathcal{A}(\mathcal{I},\{\mathbf{x}_{i}^{*}\})$
		$\displaystyle=\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\left(f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{})\right)$
		$\displaystyle\leq\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\frac{L}{2}\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}$
		$\displaystyle=\frac{L}{2}\Delta_{\mathcal{I}}.$

∎

Lemma C.4.

Let $\mathcal{M}$ be a fixed finite set in $\mathbb{R}^{d}$ . For two indices $i\not=i^{\prime}$ , we have

\mathcal{A}(\{i\},\mathcal{M})\leq\frac{2L}{\mu}\mathcal{A}(\{i^{\prime}\},\mathcal{M})+L\|\mathbf{x}_{i}^{*}-\mathbf{x}_{i^{\prime}}^{*}\|^{2}

Proof.

We have the following inequality.

	$\displaystyle\mathcal{A}(\{i\},\mathcal{M})$	$\displaystyle=\min_{\mathbf{z}\in\mathcal{M}}\left(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})\right)$
		$\displaystyle\leq\min_{\mathbf{z}\in\mathcal{M}}\frac{L}{2}\\|\mathbf{z}-\mathbf{x}_{i}^{*}\\|^{2}$
		$\displaystyle\leq\min_{\mathbf{z}\in\mathcal{M}}L(\\|\mathbf{z}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\\|\mathbf{x}_{i^{\prime}}^{}-\mathbf{x}_{i}^{*}\\|^{2})$
		$\displaystyle\leq\frac{2L}{\mu}\min_{z\in\mathcal{M}}\left(f_{i^{\prime}}(\mathbf{z})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{})\right)+L\\|\mathbf{x}_{i^{\prime}}^{}-\mathbf{x}_{i}^{*}\\|^{2}$
		$\displaystyle=\frac{2L}{\mu}\mathcal{A}(\{i^{\prime}\},\mathcal{M})+L\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}.$

∎

Lemma C.5.

Given an index set $\mathcal{I}$ and a finite point set $\mathcal{M}$ , suppose that $\mathcal{A}(\mathcal{I},\mathcal{M})>0.$ If we randomly sample an index $i\in\mathcal{I}$ with probability $\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}$ , then we have the following inequality,

\mathbb{E}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})\leq\left(\frac{L^{2}}{\mu}+L\right)\Delta_{\mathcal{I}}.

Proof.

We consider the expectation of $\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})$ over $i\in\mathcal{I}$ . We have the following inequality bound.

	$\displaystyle\mathbb{E}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})$
	$\displaystyle\quad=\sum_{i\in\mathcal{I}}\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})$
	$\displaystyle\quad=\sum_{i\in\mathcal{I}}\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))$
	$\displaystyle\quad\stackrel{{\scriptstyle(a)}}{{\leq}}\sum_{i\in\mathcal{I}}\frac{\frac{1}{\|\mathcal{I}\|}\sum_{i^{\prime\prime}\in\mathcal{I}}\left(\frac{2L}{\mu}\mathcal{A}(\{i^{\prime\prime}\},\mathcal{M})+L\\|\mathbf{x}_{i^{\prime\prime}}^{}-\mathbf{x}_{i}^{}\\|^{2}\right)}{\mathcal{A}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))$
	$\displaystyle\quad=\frac{2L}{\mu}\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))$
	$\displaystyle\quad\quad\quad+\frac{L}{\mathcal{A}(\mathcal{I},\mathcal{M})\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime\prime}\in\mathcal{I}}\\|\mathbf{x}_{i^{\prime\prime}}^{}-\mathbf{x}_{i}^{}\\|^{2}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))$
	$\displaystyle\quad\leq\frac{2L}{\mu}\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\frac{L}{2}\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\frac{L}{\mathcal{A}(\mathcal{I},\mathcal{M})\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime\prime}\in\mathcal{I}}\\|\mathbf{x}_{i^{\prime\prime}}^{}-\mathbf{x}_{i}^{}\\|^{2}\sum_{i^{\prime}\in\mathcal{I}}\mathcal{A}(\{i^{\prime}\},\mathcal{M})$
	$\displaystyle\quad=\left(\frac{L^{2}}{\mu}+L\right)\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\\|\mathbf{x}_{i^{\prime}}^{}-\mathbf{x}_{i}^{}\\|^{2}.$

Here, (a) holds when applying Lemma C.4. ∎

Lemma C.6.

For any $A_{l}$ in the optimal partition $(A_{1},A_{2},\ldots,A_{k})$ , we have

\Delta_{A_{l}}\leq\frac{4}{\mu}\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}}).

Proof.

We let $\bar{\mathbf{y}}_{l}=\frac{1}{|A_{l}|}\sum_{i\in A_{l}}\mathbf{x}_{i}^{*}$ be the geometric center of optimal $f_{i}$ solutions of index set $A_{l}$ .

	$\displaystyle\Delta_{A_{l}}$	$\displaystyle=\frac{1}{\|A_{l}\|}\sum_{i\in A_{l}}\sum_{i^{\prime}\in A_{l}}\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}$
		$\displaystyle=\frac{1}{\|A_{l}\|}\sum_{i\in A_{l}}\sum_{i^{\prime}\in A_{l}}\\|\mathbf{x}_{i}^{}-\bar{\mathbf{y}}_{l}+\bar{\mathbf{y}}_{l}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}$
		$\displaystyle=\frac{1}{\|A_{l}\|}\sum_{i\in A_{l}}\sum_{i^{\prime}\in A_{l}}\left(\\|\mathbf{x}_{i}^{}-\bar{\mathbf{y}}_{l}\\|^{2}+\\|\bar{\mathbf{y}}_{l}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}\right)$
		$\displaystyle=2\sum_{i\in A_{l}}\\|\mathbf{x}_{i}^{*}-\bar{\mathbf{y}}_{l}\\|^{2}$
		$\displaystyle=2\min_{\mathbf{z}}\sum_{i\in A_{l}}\\|\mathbf{x}_{i}^{*}-\mathbf{z}\\|^{2}$
		$\displaystyle\leq\frac{4}{\mu}\min_{\mathbf{z}}\sum_{i\in A_{l}}\left(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})\right)$
		$\displaystyle=\frac{4}{\mu}\min_{\mathbf{z}}\mathcal{A}(A_{l},\{\mathbf{z}\})$
		$\displaystyle=\frac{4}{\mu}\mathcal{A}(A_{l}.\mathcal{M}_{\textup{OPT}}).$

∎

Proposition C.7.

Let $\mathcal{I}$ be an index set, and $\mathcal{M}$ be a finite point set. Let $\mathbf{z}^{*}$ be a minimizer of the objective function $\sum_{i\in\mathcal{I}}\left(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})\right)$ . Suppose that $\mathcal{A}(\mathcal{I},\mathcal{M})>0.$ If we sample an index $i\in\mathcal{I}$ with probability $\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}$ , then we have the following inequality:

(C.1)

\mathbb{E}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})\leq 4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(A_{l}.\mathcal{M}_{\textup{OPT}}).

Proof.

The deduction of (C.1) is a direct combination of Lemma C.5 and Lemma C.6. ∎

Next we prove that the $\frac{L^{2}}{\mu^{2}}$ bound in (C.1) is tight.

Proposition C.8.

Fix the dimension $d\geq 1$ , there exists an integer $N$ . We can construct $N$ $\mu$ -strongly convex and $L$ -smooth sub-functions $f_{1},f_{2},\ldots,f_{N}$ , and a finite set $\mathcal{M}\subseteq\mathbb{R}^{d}$ . We let $\{f_{i}\}_{i=1}^{N}$ be the $N$ sub-functions of the sum-of-minimum optimization problem (1.1). When we sample an index $i\in[N]$ with probability $\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}([N],\mathcal{M})}$ , we have

\mathbb{E}\mathcal{A}([N],\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})\geq\frac{L^{2}}{\mu^{2}}\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}}).

Proof.

For the cases where the dimension $d\geq 2$ , we construct the instance in a more concise way. We consider the following $n+1$ points, $\mathbf{x}_{i}^{*}=(1,0,0,\ldots,0)\in\mathbb{R}^{d}$ , $i=1,2,\ldots,n$ , $\mathbf{x}_{n+1}^{*}=(-1,0,0,\ldots,0)\in\mathbb{R}^{d}$ . All the elements except the first one of $\mathbf{x}_{i}^{*}$ are zero. We construct the following functions $f_{i}$ with minimizers $\mathbf{x}_{i}^{*}$ .

(C.2)

\begin{split}f_{i}(y_{1},y_{2},\ldots,y_{d})=\frac{L}{2}\left(y_{1}-1\right)^{2}+\frac{\mu}{2}\sum_{j=2}^{d}y_{j}^{2},\quad i=1,2\ldots,n,\\ f_{i}(y_{1},y_{2},\ldots,y_{d})=\frac{\mu}{2}\left(y_{1}+1\right)^{2}+\frac{L}{2}\sum_{j=2}^{d}y_{j}^{2},\quad i=n+1.\end{split}

We have $f_{i}^{*}:=f_{i}(\mathbf{x}_{i}^{*})=0$ for all $i\in[n+1]$ . We construct the finite set $\mathcal{M}$ in an orthogonal manner. We let $\mathcal{M}=\{(0,\xi)\}$ , $\xi\in\mathbb{R}^{d-1}$ be a single point set. Besides, $\|\xi\|=m\gg 1$ . The point $\boldsymbol{\xi}=(0,\xi)$ in $\mathcal{M}$ is orthogonal to all $\mathbf{x}_{i}^{*}$ ’s. Consider the expectation over the newly sampled index $i$ , we have

\mathbb{E}\sum_{i^{\prime}=1}^{n+1}\min(f_{i^{\prime}}(\boldsymbol{\xi})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))\\ =\frac{n(L+\mu m^{2})}{n(L+\mu m^{2})+(\mu+Lm^{2})}2\mu+\frac{\mu+Lm^{2}}{n(L+\mu m^{2})+(\mu+Lm^{2})}2nL.

We set $m=\exp(n)$ . As $n\rightarrow\infty$ , we have

\displaystyle\lim_{n\rightarrow\infty}\mathbb{E}\sum_{i^{\prime}=1}^{n+1}\min(f_{i^{\prime}}(\boldsymbol{\xi})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))=2\mu+2\frac{L^{2}}{\mu}.

In the meanwhile, we have

	$\displaystyle\mathbf{z}^{}:=\mathop{\rm argmin}_{\mathbf{z}}\sum_{i^{\prime}=1}^{n+1}(f_{i^{\prime}}(\mathbf{z})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))=\frac{nL-\mu}{nL+\mu},$
	$\displaystyle\sum_{i^{\prime}=1}^{n+1}(f_{i^{\prime}}(\mathbf{z}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))=\frac{2\mu nL}{\mu+nL}\stackrel{{\scriptstyle n\rightarrow\infty}}{{\rightarrow}}2\mu.$

We have the following error rate:

\lim_{n\rightarrow\infty}\frac{\mathbb{E}\sum_{i^{\prime}=1}^{n+1}\min(f_{i^{\prime}}(\boldsymbol{\xi})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}),f_{i^{\prime}}(\mathbf{x}_{i}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))}{\sum_{i^{\prime}=1}^{n+1}(f_{i^{\prime}}(\mathbf{z}^{*})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))}=1+\frac{L^{2}}{\mu^{2}}.

As for the 1D case, we consider the following $n+1$ points. We let $x_{i}^{*}=1,i=1,2,\ldots,n$ , and $x_{n+1}^{*}=0$ . We construct:

	$\displaystyle f_{i}(x)=\left\{\begin{split}\frac{L}{2}(x-1)^{2},\quad x\leq 1,\\ \frac{\mu}{2}(x-1)^{2},\quad x\geq 1,\end{split}\right.\quad i=1,2,\ldots,n.$
	$\displaystyle f_{i}(x)=\left\{\begin{split}&\frac{\mu}{2}x^{2},\quad x\leq 1,\\ &\frac{L}{2}(x-1)^{2}+\mu\left(x-\frac{1}{2}\right),\quad x\geq 1,\end{split}\quad i=n+1.\right.$

Each $f_{i}^{*}$ has the minimizer $x_{i}^{*}$ . Besides, $f_{i}^{*}:=f_{i}(x_{i}^{*})=0$ . We let $\mathcal{M}=\{1+\frac{L}{\mu}\}$ be a single point set. Let $\xi=1+\frac{L}{\mu}$ . We have

	$\displaystyle f_{i}(x_{n+1}^{})-f_{i}^{}=\frac{L}{2},\quad i=1,2,\ldots,n,$
	$\displaystyle f_{n+1}(x_{i}^{})-f_{n+1}^{}=\frac{\mu}{2},\quad i=1,2,\ldots,n,$
	$\displaystyle f_{i}\left(1+\frac{L}{\mu}\right)-f_{i}^{*}=\frac{L^{2}}{2\mu},\quad i=1,2,\ldots,n,$
	$\displaystyle f_{n+1}\left(1+\frac{L}{\mu}\right)-f_{n+1}^{*}=\frac{L^{3}+2\mu^{2}L+\mu^{3}}{2\mu^{2}}.$

We have the following expectation:

	$\displaystyle\mathbb{E}\sum_{i^{\prime}=1}^{n+1}\min(f_{i^{\prime}}(\xi)-f_{i^{\prime}}(x_{i^{\prime}}^{}),f_{i^{\prime}}(x_{i}^{})-f_{i^{\prime}}(x_{i^{\prime}}^{*}))$	$\displaystyle=\frac{n\frac{L^{2}}{2\mu}\cdot\frac{\mu}{2}+\frac{L^{3}+2\mu^{2}L+\mu^{3}}{2\mu^{2}}\cdot n\frac{L}{2}}{n\frac{L^{2}}{2\mu}+\frac{L^{3}+2\mu^{2}L+\mu^{3}}{2\mu^{2}}}$
		$\displaystyle\stackrel{{\scriptstyle n\rightarrow\infty}}{{\rightarrow}}\frac{3}{2}\mu+\frac{L^{2}}{2\mu}+\frac{\mu^{2}}{2L}.$

Besides, we have the minimizer $z^{*}=\frac{nL}{nL+\mu}$ of the objective function $\sum_{i=1}^{n+1}(f_{i}(z)-f_{i}(x_{i}^{*}))$ . We have

\sum_{i=1}^{n+1}(f_{i}(z^{*})-f_{i}(x_{i}^{*}))=\frac{nL\mu}{2(nL+\mu)}\stackrel{{\scriptstyle n\rightarrow\infty}}{{\rightarrow}}\frac{\mu}{2}.

We have the following asymptotic error bound:

\lim_{n\rightarrow\infty}\frac{\mathbb{E}\sum_{i=1}^{n+1}\min(f_{i}(\xi)-f_{i}(x_{i}^{*}),f_{i}(x_{i}^{*})-f_{i}(x_{i}^{*}))}{\sum_{i=1}^{n+1}(f_{i}(z^{*})-f_{i}(x_{i}^{*}))}=3+\frac{L^{2}}{\mu^{2}}+\frac{\mu}{L}.

∎

We remark that the orthogonal technique used in the construction of (C.2) can be applied in other lower bound constructions in the proofs of the initialization Algorithms 1 and 4 as well.

Lemma C.9.

We consider the sum-of-minimum optimization (1.1). Suppose that $S^{*}$ is $k$ -separate. Suppose that we have fixed indices $i_{1},i_{2},\ldots,i_{j}$ . We define the finite set $\mathcal{M}_{j}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{j}}^{*}\}$ . We define the index sets $L_{j}=\{l:A_{l}\cap\{i_{1},i_{2},\ldots,i_{j}\}\not=\emptyset\},L_{j}^{c}=\{l:A_{l}\cap\{i_{1},i_{2},\ldots,i_{j}\}=\emptyset\},\mathcal{I}_{j}=\cup_{l\in L_{j}}A_{l},\mathcal{I}_{j}^{c}=\cup_{l\in L_{j}^{c}}A_{l}$ . Let $u=|L_{j}^{c}|$ . We sample $t\leq u$ new indices. We let $\mathcal{M}_{j,s}^{+}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{j}}^{*},\mathbf{x}_{i_{j+1}}^{*},\ldots,\mathbf{x}_{i_{j+s}}^{*}\}$ for $0\leq s\leq t$ . In each round of sampling, the probability of $i_{j+s},s>0$ , being sampled as $i$ is $\frac{\mathcal{A}(\{i\},\mathcal{M}_{j,s-1}^{+})}{\mathcal{A}([N],\mathcal{M}_{j,s-1}^{+})}$ . Then we have the following bound,

(C.3)

\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\leq\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t})+\frac{u-t}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

Here, $H_{t}=1+\frac{1}{2}+\cdots+\frac{1}{t}$ is the harmonic sum.

Proof.

We prove by induction on $u=|L_{j}^{c}|$ and $t$ . We introduce the notation

\Phi_{j}(i)=\mathcal{A}(\{i\},\mathcal{M}_{j})=\min_{\mathbf{z}\in\mathcal{M}_{j}}(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})).

We show that if (C.3) holds for the case $(u-1,t-1)$ and $(u,t-1)$ , then it also holds for the case $(u,t)$ . We first prove two base cases.

Case 1: $t=0,u>0$ .

\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{j,t}^{+})=\mathcal{A}([N],\mathcal{M}_{j})=\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

Case 2: $t=1,u=1$ . With probability $\frac{\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}$ , the newly sampled index $i_{j+1}$ will lie in $\mathcal{I}_{j}$ , and with probability $\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}$ , it will lie in $\mathcal{I}_{j}^{c}$ . We have bounds on the conditional expectation

	$\displaystyle\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{\|}i_{j+1}\in\mathcal{I}_{j}\right)$	$\displaystyle\leq\mathcal{A}([N],\mathcal{M}_{j}),$
	$\displaystyle\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{\|}i_{j+1}\in\mathcal{I}_{j}^{c}\right)$	$\displaystyle=\mathbb{E}\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j,t}^{+})\big{\|}i_{j+1}\in\mathcal{I}_{j}^{c}\right)+\mathbb{E}\left(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j,t}^{+})\big{\|}i_{j+1}\in\mathcal{I}_{j}^{c}\right)$
		$\displaystyle\leq\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\sum_{i^{\prime}\in\mathcal{I}_{j}^{c}}\frac{\Phi_{j}(i^{\prime})}{\sum_{i\in\mathcal{I}_{j}^{c}}\Phi_{j}(i)}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}\cup\{\mathbf{x}_{i^{\prime}}^{*}\})$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\left(\frac{L^{2}}{\mu}+L\right)\Delta_{\mathcal{I}_{j}^{c}}$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})$

Here, (a) holds when applying Lemma C.5. (b) holds since $\mathcal{I}_{j}^{c}$ is identical to a certain $A_{l}$ as $u=1$ and we apply Lemma C.6. Overall, we have the bound:

	$\displaystyle\mathbb{E}\mathcal{A}([N],\mathcal{M}_{j,t}^{+})$	$\displaystyle=\frac{\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{\|}i_{j+1}\in\mathcal{I}_{j}\right)$
		$\displaystyle\qquad+\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{\|}i_{j+1}\in\mathcal{I}_{j}^{c}\right)$
		$\displaystyle\leq\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})+\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})$
		$\displaystyle=2\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})$

Next, we prove that the case $(u,t)$ holds when the inequality holds for cases $(u-1,t)$ and $(u-1,t-1)$ . With probability $\frac{\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}$ , the first sampled index $i_{j+1}$ will lie in $\mathcal{I}_{j}$ , and with probability $\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}$ , it will lie in $\mathcal{I}_{j}^{c}$ . Let

\alpha=4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right).

We divide into two cases and compute the corresponding conditional expectations. For the case where $i_{j+1}$ lies in $\mathcal{I}_{j}$ , we have the following bound on the conditional expectation.

	$\displaystyle\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\,\big{\|}\,i_{j+1}\in\mathcal{I}_{j}\right)$
	$\displaystyle\quad\leq\mathbb{E}\bigg{(}\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j}\cup\{\mathbf{x}_{i_{j+1}}^{*}\})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})$
	$\displaystyle\qquad\quad+\frac{u-t+1}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}\cup\{\mathbf{x}_{i_{j+1}}^{*}\})\big{\|}i_{j+1}\in\mathcal{I}_{j}\bigg{)}$
	$\displaystyle\quad\leq\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t+1}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).$

For the case where $i_{j+1}$ lies in $\mathcal{I}_{j}^{c}$ , we have the following inequality:

	$\displaystyle\mathbb{E}\,\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\,\big{\|}\,i_{j+1}\in\mathcal{I}_{j}^{c}\right)$
	$\displaystyle\quad\leq\sum_{l\in L_{j}^{c}}\frac{\sum_{i\in A_{l}}\Phi_{j}(i)}{\sum_{i^{\prime}\in\mathcal{I}_{j}^{c}}\Phi_{j}(i^{\prime})}\bigg{[}\left(\mathcal{A}(\mathcal{I}_{j}\cup A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c}\backslash A_{l},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})$
	$\displaystyle\quad\quad\quad+\frac{u-t}{u-1}\mathcal{A}(\mathcal{I}_{j}^{c}\backslash A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})\bigg{]}$
	$\displaystyle\quad\leq\sum_{l\in L_{j}^{c}}\frac{\sum_{i\in A_{l}}\Phi_{j}(i)}{\sum_{i^{\prime}\in\mathcal{I}_{j}^{c}}\Phi_{j}(i^{\prime})}\bigg{[}\big{(}\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\mathcal{A}(A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})$
	$\displaystyle\quad\quad\quad+\alpha(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})-\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}}))\big{)}(1+H_{t-1})+\frac{u-t}{u-1}\left(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})-\mathcal{A}(A_{l},\mathcal{M}_{j})\right)\bigg{]}$
	$\displaystyle\quad\stackrel{{\scriptstyle(a)}}{{\leq}}\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t}{u-1}\left(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})-\sum_{l\in L_{j}^{c}}\frac{\mathcal{A}(A_{l},\mathcal{M}_{j})^{2}}{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}\right)$
	$\displaystyle\quad\stackrel{{\scriptstyle(b)}}{{\leq}}\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t}{u-1}\left(\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})-\frac{1}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\right)$
	$\displaystyle\quad=\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).$

Here, (a) holds when applying Lemma C.5 and Lemma C.6. (b) holds as

\displaystyle\sum_{l\in L_{j}^{c}}\mathcal{A}(A_{l},\mathcal{M}_{j})^{2}\geq\frac{1}{u}\left(\sum_{l\in L_{j}^{c}}\mathcal{A}(A_{l},\mathcal{M}_{j})\right)^{2}=\frac{1}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})^{2}.

Overall, we have the bound:

		$\displaystyle\mathbb{E}\mathcal{A}([N],\mathcal{M}_{j,t}^{+})$
	$\displaystyle=$	$\displaystyle\frac{\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{\|}i_{j+1}\in\mathcal{I}_{j}\right)+\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{j,t}^{+})\big{\|}i_{j+1}\in\mathcal{I}_{j}^{c}\right)$
	$\displaystyle\leq$	$\displaystyle\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t-1})+\frac{u-t}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})+\frac{1}{u}\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}$	$\displaystyle\left(\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})+\alpha\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)(1+H_{t})+\frac{u-t}{u}\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).$

Here, (a) holds since $u\geq t$ and

\frac{\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j})}{\mathcal{A}([N],\mathcal{M}_{j})}\leq\mathcal{A}(\mathcal{I}_{j},\mathcal{M}_{j}).

The proof concludes. ∎

Theorem C.10 (Restatement of Theorem 4.1).

Suppose that the solution set $S^{*}$ is $k$ -separate. Let

\mathcal{M}_{\textup{init}}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{k}}^{*}\}

be the initial points sampled by the random initialization Algorithm 1. We have the following bound:

(C.4)

\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\leq 4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).

Proof.

We start with a fixed index $i_{1}$ , let $\mathcal{M}_{1}=\{\mathbf{x}_{i_{1}}^{*}\}$ . Suppose $\mathbf{x}_{i_{1}}\in A_{l}$ . Then we use Lemma C.9 with $u=k-1,t=k-1$ . Let

\alpha=4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right).

We have

\displaystyle\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{1,k-1}^{+})\leq\left(\mathcal{A}(A_{l},\mathcal{M}_{1})+\alpha\mathcal{A}([N]\backslash A_{l},\mathcal{M}_{\textup{OPT}})\right)(1+H_{k-1})

The term $\mathbb{E}\mathcal{A}([N],\mathcal{M}_{1,k-1}^{+})$ can be regarded as the conditional expectation of $\mathcal{A}([N],\mathcal{M}_{\textup{init}})$ given $i_{1}$

\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{1,k-1}^{+})=\mathbb{E}\left(\mathcal{A}([N],\mathcal{M}_{\textup{init}})~{}|~{}i_{1}\right).

According to Algorithm 1, the first index $i_{1}$ is uniformly random in $[N]$ . We take the expectation over $i_{1}$ and get

		$\displaystyle\mathbb{E}~{}\mathcal{A}([N],\mathcal{M}_{\textup{init}})$
	$\displaystyle\leq$	$\displaystyle\frac{1}{N}\sum_{l\in[k]}\sum_{i\in A_{l}}\left(\mathcal{A}(A_{l},\{\mathbf{x}_{i}^{*}\})+\alpha\mathcal{A}([N]\backslash A_{l},\mathcal{M}_{\textup{OPT}})\right)(1+H_{k-1})$
	$\displaystyle=$	$\displaystyle\left(\frac{1}{N}\sum_{l\in[k]}\sum_{i\in A_{l}}\mathcal{A}(A_{l},\{\mathbf{x}_{i}^{*}\})+\alpha\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}$	$\displaystyle\left(\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|\frac{L}{2}\Delta_{A_{l}}+\alpha\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})$
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}$	$\displaystyle\left(\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|\frac{2L}{\mu}\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})+\alpha\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)$
		$\displaystyle\qquad\cdot(1+H_{k-1})$
	$\displaystyle\leq$	$\displaystyle\alpha\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})(1+H_{k-1})$
	$\displaystyle\leq$	$\displaystyle 4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).$

Here, (a) holds when applying Lemma C.3. (b) holds as a result of Lemma C.6.

∎

When we take

f_{i}(\mathbf{x})=\frac{1}{2}\|\mathbf{x}-\mathbf{x}_{i}^{*}\|^{2},

the optimization problem (1.1) reduces to the k-means problem, and Algorithm 1 reduces to the k-means++ algorithm. Therefore, according to [arthur2007k], the bound given in Theorem C.10 is tight in $\ln k$ up to a constant. Next, we give a more detailed lower bound considering the conditioning number $\frac{L}{\mu}$ .

Theorem C.11 (Restatement of Theorem 4.2).

Given a fixed cluster number $k>0$ , there exists $N>0$ . We can construct $N$ $\mu$ -strongly convex and $L$ -smooth sub-functions $\{f_{i}\}_{i=1}^{N}$ , whose minimizer set $S^{*}$ is $k$ -separate. Besides, the sum-of-min objective function $F$ satisfies that $F^{*}>f^{*}$ , so that $\mathcal{A}([N],\mathcal{M}_{OPT})>0$ . When we apply Algorithm 1 to sample the initial centers $\mathcal{M}_{\textup{init}}$ , we have the following error bound:

(C.5)

\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\geq\frac{1}{2}\frac{L^{2}}{\mu^{2}}\ln k\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).

Proof.

We construct the following problem. We fix the cluster number to be $k$ . We let the dimension to be $2k$ . We pick the vertices of a $k$ -simplex as the “centers” of $k$ clusters. The $k$ -simplex is embedded in a $k-1$ dimensional subspace. We let the first $k$ elements of the vertices’ coordinates to be non-zero, while the other elements are zero. We denote the first $k$ elements of the $l$ -th vertex by $\xi^{(l)}\in\mathbb{R}^{k}$ . We let the $k$ -simplex be centered at the origin, so that the magnitudes $\|\xi^{(l)}\|$ ’s are the same. We let $m$ be the edge length of the simplex. The functions in each cluster follows the orthogonal construction technique in (C.2). Specifically, in cluster $l$ , we construct $n+1$ functions mapping from $\mathbb{R}^{2k}$ to $\mathbb{R}$ as

(C.6)

\begin{split}f_{i,l}(y)&=\frac{\mu}{2}\|y_{1:d}-\xi^{(l)}\|^{2}+\frac{\mu}{2}\sum_{j\geq k+1,j\not=k+l}y_{j}^{2}+\frac{L}{2}(y_{k+l}+1)^{2},\quad i=1,2,\ldots,n,\\ f_{i,l}(y)&=\frac{L}{2}\|y_{1:d}-\xi^{(l)}\|^{2}+\frac{L}{2}\sum_{j\geq k+1,j\not=k+l}y_{j}^{2}+\frac{\mu}{2}(y_{k+l}-1)^{2},\quad i=n+1.\end{split}

We have a total of $N=k(n+1)$ sub-functions. We let $m=\exp(n)$ , $n\gg 1$ , so that $\{f_{i,l}\}_{i=1}^{n+1}$ will be assigned in the same cluster when computing the minimizer of the objective function $F$ . We let $\mathbf{e}_{l}\in\mathbb{R}^{k}$ be the $l$ -th unit vector, then the minimizers of the above sub-functions are $\mathbf{x}_{i,l}^{*}=[\xi^{(l)};-\mathbf{e}_{l}](i=1,2,\ldots,n)$ and $\mathbf{x}_{n+1,l}^{*}=[\xi^{(l)};\mathbf{e}_{l}]$ . We let $S^{*}$ be the set of all the minimizers $\{\mathbf{x}_{1,l}\}_{l=1}^{k}\cup\{\mathbf{x}_{n+1,l}\}_{l=1}^{k}$ . For each cluster $l$ , we can compute

\min_{y}\sum_{i=1}^{n+1}(f_{i,l}(y)-f_{i,l}^{*})=\frac{2nL\mu}{nL+\mu}.

Thus, we have

\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})=k\frac{2nL\mu}{nL+\mu}.

Let $\mathcal{M}$ be a nonempty subset of $S^{*}$ . We study the optimality gap of $F$ when sampling the new centers based on $\mathcal{M}$ . We divide the $k$ clusters into 4 classes as follows:

	$\displaystyle C_{a}=\{l\,\|\,\mathbf{x}_{1,l}^{}\in\mathcal{M},\mathbf{x}_{n+1,l}^{}\not\in\mathcal{M}\},$
	$\displaystyle C_{b}=\{l\,\|\,\mathbf{x}_{n+1,l}^{}\in\mathcal{M},\mathbf{x}_{1,l}^{}\not\in\mathcal{M}\},$
	$\displaystyle C_{f}=\{l\,\|\,\mathbf{x}_{1,l}^{}\in\mathcal{M},\mathbf{x}_{n+1,l}^{}\in\mathcal{M}\},$
	$\displaystyle C_{u}=\{l\,\|\,\mathbf{x}_{1,l}^{}\not\in\mathcal{M},\mathbf{x}_{n+1,l}^{}\not\in\mathcal{M}\}.$

We define $a=|C_{a}|,b=|C_{b}|,u=|C_{u}|$ . Consider $\mathcal{M}$ as the existing centers, we continue sampling $t\leq u$ new centers using Algorithm 1. Let $\mathbf{w}_{1}^{*},\mathbf{w}_{2}^{*},\ldots,\mathbf{w}_{t}^{*}$ be the newly sampled centers. We define the quantity

\phi_{a,b,u,t}=\mathbb{E}\mathcal{A}([N],\mathcal{M}\cup\{\mathbf{w}_{1}^{*},\mathbf{w}_{2}^{*},\ldots,\mathbf{w}_{t}^{*}\}),

which is the expected optimality gap after sampling. We will prove by induction that

(C.7)

\begin{split}\phi_{a,b,u,t}\geq\alpha^{t+1}\bigg{[}\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)(u-t)\\ +(2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}\bigg{]}.\end{split}

Here $H_{u}$ is the harmonic series. $G_{u}$ is recursively defined as:

G_{0}=0,\quad G_{u}-G_{u-1}=\beta(1+H_{u-1}).

The parameter $0<\alpha,\beta<1$ are chosen as

\alpha=1-\frac{1}{m},\quad\beta=1-\frac{1}{\sqrt{n}}.

We denote the right0hand side of (C.7) as $\alpha^{t+1}\varphi_{a,b,u,t}$ .

We consider the case where $t=0$ , we have

\displaystyle\phi_{a,b,u,0}=\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u+2nLb+2\mu a.

In the mean while,

\varphi_{a,b,u,0}=\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u+(2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}.

If $u=0$ , we have

\phi_{a,b,0,0}=\varphi_{a,b,0,0}\geq\alpha\varphi_{a,b,0,0}.

If $u\geq 1$ , then $\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u$ becomes the leading term,

	$\displaystyle(1-\alpha)\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u$
	$\displaystyle\quad\geq\frac{1}{2}nm\mu u$
	$\displaystyle\quad\geq(2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}$
	$\displaystyle\quad\geq\alpha\left((2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}\right).$

Rearrange the left-hand side and the right-hand side of the inequality, we have:

		$\displaystyle\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u$
	$\displaystyle\geq$	$\displaystyle\alpha\left(\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u+(2nLb+2\mu a)(1+H_{u})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{u}\right)$
	$\displaystyle=$	$\displaystyle\alpha\varphi_{a,b,u,0}.$

Therefore, we have

\phi_{a,b,u,0}\geq\alpha\varphi_{a,b,u,0}.

Next, we induct on $t$ . When $t\geq 1$ , we have $u\geq 1$ . We use the one-step transfer technique. We let

	$\displaystyle K=\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right)u+2\mu a+2nbL,$
	$\displaystyle A=\frac{1}{2}\left(n\left(\mu m^{2}+\mu+L\right)+\left(Lm^{2}+L+\mu\right)\right),$
	$\displaystyle B=\frac{2L^{2}}{\mu}+2\mu.$

We have

		$\displaystyle\phi_{a,b,u,t}$
	$\displaystyle=$	$\displaystyle\frac{n(\mu m^{2}+\mu+L)u}{2K}\phi_{a+1,b,u-1,t-1}+\frac{(Lm^{2}+L+\mu)u}{2K}\phi_{a,b+1,u-1,t-1}$
		$\displaystyle+\frac{2nLb}{K}\phi_{a,b-1,u,t-1}+\frac{2\mu a}{K}\phi_{a-1,b,u,t-1}$
	$\displaystyle\geq$	$\displaystyle\frac{n(\mu m^{2}+\mu+L)u}{2K}\alpha^{t}\left[A(u-t)+(2nLb+2\mu a+2\mu)(1+H_{u-1})+BG_{u-1}\right]$
		$\displaystyle+\frac{(Lm^{2}+L+\mu)u}{2K}\alpha^{t}\left[A(u-t)+(2nLb+2\mu a+2nL)(1+H_{u-1})+BG_{u-1}\right]$
		$\displaystyle+\frac{2nLb}{K}\alpha^{t}\left[A(u-t+1)+(2nLb+2\mu a-2nL)(1+H_{u})+BG_{u}\right]$
		$\displaystyle+\frac{2\mu a}{K}\alpha^{t}\left[A(u-t+1)+(2nLb+2\mu a-2\mu)(1+H_{u})+BG_{u}\right]$
	$\displaystyle=$	$\displaystyle\frac{n(\mu m^{2}+\mu+L)u}{2K}\alpha^{t}\varphi_{a,b,u,t}$
		$\displaystyle+\frac{n(\mu m^{2}+\mu+L)u}{2K}\alpha^{t}\left[2\mu(1+H_{u-1})+(2nLb+2\mu a)(H_{u-1}-H_{u})+B(G_{u-1}-G_{u})\right]$
		$\displaystyle+\frac{(Lm^{2}+L+\mu)u}{2K}\alpha^{t}\varphi_{a,b,u,t}$
		$\displaystyle+\frac{(Lm^{2}+L+\mu)u}{2K}\alpha^{t}\left[2nL(1+H_{u-1})+(2nLb+2\mu a)(H_{u-1}-H_{u})+B(G_{u-1}-G_{u})\right]$
		$\displaystyle+\frac{2nLb}{K}\alpha^{t}\varphi_{a,b,u,t}+\frac{2nLb}{K}\alpha^{t}(A-2nL(1+H_{u}))+\frac{2\mu a}{K}\alpha^{t}\varphi_{a,b,u,t}+\frac{2\mu a}{K}\alpha^{t}(A-2\mu(1+H_{u}))$
	$\displaystyle=$	$\displaystyle\alpha^{t}\varphi_{a,b,u,t}+\frac{1}{K}\alpha^{t}\left[\frac{1}{2}n(\mu m^{2}+\mu+L)u\left(2\mu-\beta\left(2\mu+\frac{2L^{2}}{\mu}\right)\right)(1+H_{u-1})\right]$
		$\displaystyle+\frac{1}{K}\alpha^{t}\bigg{[}(Lm^{2}+L+\mu)unL(1+H_{u-1})-\frac{1}{2}(Lm^{2}+L+\mu)u\beta B(1+H_{u-1})$
		$\displaystyle\quad\quad\quad\quad-4\mu^{2}a(1+H_{u})-4n^{2}L^{2}b(1+H_{u})\bigg{]}$
	$\displaystyle=$	$\displaystyle\alpha^{t}\varphi_{a,b,u,t}+\frac{1}{K}\alpha^{t}\left[-\frac{1}{2}\left(n(\mu m^{2}+\mu+L)+(Lm^{2}+L+\mu)\right)(2nLb+2\mu a)+A(2nLb+2\mu a)\right]$
		$\displaystyle+\frac{1}{K}\alpha^{t}\bigg{[}\frac{1}{2}n(\mu m^{2}+\mu+L)u\left(-\frac{2L^{2}}{\mu}+\frac{1}{\sqrt{n}}(2\mu+\frac{2L^{2}}{\mu})\right)(1+H_{u-1})$
		$\displaystyle\quad\quad\quad\quad+(Lm^{2}+L+\mu)unL(1+H_{u-1})\bigg{]}$
		$\displaystyle+\frac{1}{K}\alpha^{t}\left[-\frac{1}{2}(Lm^{2}+L+\mu)u\beta B(1+H_{u-1})-4\mu^{2}a(1+H_{u})-4n^{2}L^{2}b(1+H_{u})\right]$
	$\displaystyle=$	$\displaystyle\alpha^{t}\varphi_{a,b,u,t}+\frac{1}{K}\alpha^{t}\sqrt{n}m^{2}u(1+H_{u-1})(\mu^{2}+L^{2})$
		$\displaystyle+\frac{1}{K}\alpha^{t}\bigg{[}n(\mu+L)u\left(L-\frac{L^{2}}{\mu}+\frac{1}{\sqrt{n}}\left(\mu+\frac{L^{2}}{\mu}\right)\right)(1+H_{u-1})$
		$\displaystyle\quad\quad\quad\quad-\frac{1}{2}(Lm^{2}+L+\mu)u\beta B(1+H_{u-1})-4\mu^{2}a(1+H_{u})-4n^{2}L^{2}b(1+H_{u})\bigg{]}$
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\geq}}$	$\displaystyle\alpha^{t}\varphi_{a,b,u,t}$
	$\displaystyle\geq$	$\displaystyle\alpha^{t+1}\varphi_{a,b,u,t}.$

For (a), we have

\sqrt{n}m^{2}u(1+H_{u-1})(\mu^{2}+L^{2})\geq n(\mu+L)u\left(L-\frac{L^{2}}{\mu}+\frac{1}{\sqrt{n}}\left(\mu+\frac{L^{2}}{\mu}\right)\right)(1+H_{u-1})\\ -\frac{1}{2}(Lm^{2}+L+\mu)u\beta B(1+H_{u-1})-4\mu^{2}a(1+H_{u})-4n^{2}L^{2}b(1+H_{u}).

when $m=\exp(n)$ and $n\gg 1$ .

Thus the inequality (C.7) holds. Let $u=t=k-1$ . We have

\phi_{a,b,k-1,k-1}\geq\alpha^{k}\left[(2nLb+2\mu a)(1+H_{k-1})+\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{k-1}\right].

Let $n\geq 100k^{2}$ . Since $m=\exp(n)\geq 100k^{2}$ , then

\alpha^{k}\geq\frac{3}{4},\quad\beta=1-\frac{1}{10k}\geq\frac{9}{10}.

\phi_{a,b,t-1,t-1}\geq\frac{3}{4}\left(\frac{2L^{2}}{\mu}+2\mu\right)G_{k-1}.

We have the following inequalities:

	$\displaystyle H_{k-1}$	$\displaystyle=1+\frac{1}{2}+\cdots+\frac{1}{k-1}\geq\int_{1}^{k}\frac{1}{t}\,dt=\ln k,\quad k\geq 1,$
	$\displaystyle G_{k}$	$\displaystyle=\beta\sum_{j=0}^{k-1}(1+H_{j})\geq\beta\left(k+\sum_{j=1}^{k}\ln j\right)\geq\beta\left(k+\int_{t=1}^{k}\ln t\,dt\right)=\beta(k\ln k+1).$

Therefore, we have

\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\geq\frac{1}{2}k\ln k\left(\frac{2L^{2}}{\mu}+2\mu\right)=k\ln k\left(\frac{L^{2}}{\mu}+\mu\right).

In the meanwhile, we have an upper bound estimate for $\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})$ . We pick $\mathcal{M}_{\xi}=\{[\xi^{(l)};-\mathbf{e}_{l}]\}_{l=1}^{k}$ as the centers. We have

\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})\leq\mathcal{A}([N],\mathcal{M}_{\xi})=2k\mu.

Thus,

\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\geq\frac{1}{2}\ln k\frac{L^{2}}{\mu^{2}}\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).

∎

We prove two different error bounds when the estimate of $f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})$ is not accurate. We consider the additive and multiplicative errors on the oracle $f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})$ .

In Algorithm 1, when computing the score $v_{i}^{(j)}$ , we suppose we do not have the exact $f_{i}^{*}$ , instead, we have an estimate $\tilde{f}_{i}^{*}$ , such that

|\tilde{f}_{i}^{*}-f_{i}^{*}|\leq\epsilon

for a certain error factor $\epsilon>0$ . We define

\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})=\sum_{i\in\mathcal{I}}\max\left(\min_{\mathbf{z}\in\mathcal{M}}\left(f_{i}(\mathbf{z})-\tilde{f}_{i}^{*}\right),0\right)=\sum_{i\in\mathcal{I}}\min_{\mathbf{z}\in\mathcal{M}}\left(\max\left(f_{i}(\mathbf{z})-\tilde{f}_{i}^{*},0\right)\right).

Lemma C.12.

Let $\mathcal{I}$ be an index set, and $\mathcal{M}$ be a finite point set. Suppose that $\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})>0$ . We sample an index $i\in\mathcal{I}$ with probability $\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M})}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}$ , then we have the following inequality:

(C.8)

\mathbb{E}\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})\leq|\mathcal{I}|\left(1+\frac{4L}{\mu}\right)\epsilon+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\min_{\mathbf{z}}\sum_{i\in\mathcal{I}}(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})).

Proof.

We have

	$\displaystyle\tilde{\mathcal{A}}(\{i\},\mathcal{M})$	$\displaystyle=\max\left(\min_{\mathbf{z}\in\mathcal{M}}(f_{i}(\mathbf{z})-\tilde{f}_{i}^{*}),0\right)$
		$\displaystyle\leq\epsilon+\min_{\mathbf{z}\in\mathcal{M}}(f_{i}(\mathbf{z})-f_{i}^{*})$
		$\displaystyle\leq\epsilon+\frac{L}{2}\min_{\mathbf{z}\in\mathcal{M}}\\|\mathbf{z}-\mathbf{x}_{i}^{*}\\|^{2}$
		$\displaystyle\leq\epsilon+L\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+L\min_{\mathbf{z}\in\mathcal{M}}\\|\mathbf{z}-\mathbf{x}_{i^{\prime}}^{*}\\|^{2}$
		$\displaystyle\leq\epsilon+L\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\frac{2L}{\mu}\min_{\mathbf{z}\in\mathcal{M}}(f_{i^{\prime}}(\mathbf{z})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{*}))$
		$\displaystyle\leq\left(1+\frac{2L}{\mu}\right)\epsilon+L\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\frac{2L}{\mu}\min_{\mathbf{z}\in\mathcal{M}}(f_{i^{\prime}}(\mathbf{z})-\tilde{f}_{i^{\prime}}^{*})$
		$\displaystyle\leq\left(1+\frac{2L}{\mu}\right)\epsilon+L\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\frac{2L}{\mu}\max\left(\min_{\mathbf{z}\in\mathcal{M}}(f_{i^{\prime}}(\mathbf{z})-\tilde{f}_{i^{\prime}}^{*}),0\right)$
		$\displaystyle=\left(1+\frac{2L}{\mu}\right)\epsilon+L\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\frac{2L}{\mu}\tilde{\mathcal{A}}(\{i^{\prime}\},\mathcal{M}).$

We have

		$\displaystyle\mathbb{E}\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})$
	$\displaystyle=$	$\displaystyle\sum_{i\in\mathcal{I}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M})}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})$
	$\displaystyle=$	$\displaystyle\sum_{i\in\mathcal{I}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M})}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime\prime}\in\mathcal{I}}\tilde{\mathcal{A}}(\{i^{\prime\prime}\},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})$
	$\displaystyle=$	$\displaystyle\sum_{i\in\mathcal{I}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M})}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime\prime}\in\mathcal{I}}\min(\tilde{\mathcal{A}}(\{i^{\prime\prime}\},\mathcal{M}),\max(f_{i^{\prime\prime}}(\mathbf{x}_{i}^{})-\tilde{f}_{i^{\prime\prime}}^{},0))$
	$\displaystyle\leq$	$\displaystyle\sum_{i\in\mathcal{I}}\frac{\frac{1}{\|\mathcal{I}\|}\sum_{i^{\prime}\in\mathcal{I}}\left\{\left(1+\frac{2L}{\mu}\right)\epsilon+L\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\frac{2L}{\mu}\tilde{\mathcal{A}}(\{i^{\prime}\},\mathcal{M})\right\}}{\tilde{\mathcal{A}}(\mathcal{I},\mathcal{M})}$
		$\displaystyle\qquad\cdot\sum_{i^{\prime\prime}\in\mathcal{I}}\min(\tilde{\mathcal{A}}(\{i^{\prime\prime}\},\mathcal{M}),\max(f_{i^{\prime\prime}}(\mathbf{x}_{i}^{})-\tilde{f}_{i^{\prime\prime}}^{},0))$
	$\displaystyle\leq$	$\displaystyle\|\mathcal{I}\|\left(1+\frac{2L}{\mu}\right)\epsilon+L\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\frac{2L}{\mu}\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime\prime}\in\mathcal{I}}\max(f_{i^{\prime\prime}}(\mathbf{x}_{i}^{})-\tilde{f}_{i^{\prime\prime}}^{},0)$
	$\displaystyle\leq$	$\displaystyle\|\mathcal{I}\|\left(1+\frac{4L}{\mu}\right)\epsilon+\left(L+\frac{L^{2}}{\mu}\right)\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}$
	$\displaystyle=$	$\displaystyle\|\mathcal{I}\|\left(1+\frac{4L}{\mu}\right)\epsilon+2\left(L+\frac{L^{2}}{\mu}\right)\min_{\mathbf{z}}\sum_{i\in\mathcal{I}}\\|\mathbf{x}_{i}^{*}-\mathbf{z}\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\|\mathcal{I}\|\left(1+\frac{4L}{\mu}\right)\epsilon+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\min_{\mathbf{z}}\sum_{i\in\mathcal{I}}(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})).$

∎

Lemma C.13.

Suppose that we have fixed indices $i_{1},i_{2},\ldots,i_{j}$ . We define the finite set $\mathcal{M}_{j}=\{x_{i_{1}}^{*},x_{i_{2}}^{*},\ldots,x_{i_{j}}^{*}\}$ . We define the index sets $L_{j}=\{l:A_{l}\cap\{i_{1},i_{2},\ldots,i_{j}\}\not=\emptyset\},L_{j}^{c}=\{l:A_{l}\cap\{i_{1},i_{2},\ldots,i_{j}\}=\emptyset\},\mathcal{I}_{j}=\cup_{l\in L_{j}}A_{l},\mathcal{I}_{j}^{c}=\cup_{l\in L_{j}^{c}}A_{l}$ . Let $u=|L_{j}^{c}|$ . Suppose that $u>0$ . We sample $t\leq u$ new indices. We let $\mathcal{M}_{j,s}^{+}=\{x_{i_{1}}^{*},x_{i_{2}}^{*},\ldots,x_{i_{j}}^{*},x_{i_{j+1}}^{*},\ldots,x_{i_{j+s}}^{*}\}$ for $0\leq s\leq t$ . In each round of sampling, the probability of $i_{j+s},s>0$ , being sampled as $i$ is $\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M}_{j,s-1}^{+})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j,s-1}^{+})}$ . Then we have the following bound:

(C.9)

\begin{split}\mathbb{E}\tilde{\mathcal{A}}([N],\mathcal{M}_{j,t}^{+})&\leq(1+H_{t})\left[\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\left(1+\frac{4L}{\mu}\right)\epsilon+4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right]\\ &+\frac{u-t}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).\end{split}

Proof.

The key idea of the proof is similar to Lemma C.9. We let

\alpha=1+\frac{4L}{\mu},\quad\beta=4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right).

We prove by induction. When $t=0$ , the inequality obviously holds. When $t>0,u=1$ , we have the inequality:

		$\displaystyle\mathbb{E}\tilde{\mathcal{A}}([N],\mathcal{M}_{j,t}^{+})$
	$\displaystyle\leq$	$\displaystyle\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}\tilde{\mathcal{A}}([N],\mathcal{M}_{j})+\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}\left(\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+\|\mathcal{I}_{j}^{c}\|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)$
	$\displaystyle\leq$	$\displaystyle\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+\|\mathcal{I}_{j}^{c}\|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})+\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).$

For the general $(t,u)$ case, $\mathbb{E}\tilde{\mathcal{A}}([N],\mathcal{M}_{j,t}^{+})$ can be bounded by two parts. With probability $\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}$ , the first sampled index lies in $\mathcal{I}_{j}$ , and the conditional expectation is bounded by:

(1+H_{t-1})\left[\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+|\mathcal{I}_{j}^{c}|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right]+\frac{u-t+1}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).

With probability $\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}$ , the first sampled index lies in $\mathcal{I}_{j}^{c}$ . The conditional expectation is bounded by:

	$\displaystyle\sum_{l\in L^{c}}\frac{\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j})}{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}\sum_{i\in A_{l}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M}_{j})}{\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j})}\bigg{\{}(1+H_{t-1})\Big{(}\tilde{\mathcal{A}}(\mathcal{I}_{j}\cup A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})+\|\mathcal{I}_{j}^{c}\backslash A_{l}\|\alpha\epsilon$
	$\displaystyle\quad\quad+\beta\mathcal{A}(\mathcal{I}_{j}^{c}\backslash A_{l},\mathcal{M}_{\textup{OPT}})\Big{)}+\frac{u-t}{u-1}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c}\backslash A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})\bigg{\}}$
	$\displaystyle\leq\sum_{l\in L^{c}}\frac{\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j})}{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}\sum_{i\in A_{l}}\frac{\tilde{\mathcal{A}}(\{i\},\mathcal{M}_{j})}{\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j})}\bigg{\{}(1+H_{t-1})\Big{(}\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j}\cup\{\mathbf{x}_{i}^{*}\})+\|\mathcal{I}_{j}^{c}\backslash A_{l}\|\alpha\epsilon$
	$\displaystyle\quad\quad+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})-\beta\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\Big{)}+\frac{u-t}{u-1}(\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})-\tilde{\mathcal{A}}(A_{l},\mathcal{M}_{j}))\bigg{\}}$
	$\displaystyle\leq(1+H_{t-1})\left(\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+\|\mathcal{I}_{j}^{c}\|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)+\frac{u-t}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).$

Overall, we have the following inequality:

		$\displaystyle\mathbb{E}\tilde{\mathcal{A}}([N],\mathcal{M}_{j,t}^{+})$
	$\displaystyle\leq$	$\displaystyle\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}\left\{(1+H_{t-1})\left[\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+\|\mathcal{I}_{j}^{c}\|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right]+\frac{u-t+1}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\right\}$
		$\displaystyle\quad+\frac{\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})}{\tilde{\mathcal{A}}([N],\mathcal{M}_{j})}\left\{(1+H_{t-1})\left(\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+\|\mathcal{I}_{j}^{c}\|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)+\frac{u-t}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j})\right\}$
	$\displaystyle\leq$	$\displaystyle(1+H_{t})\left(\tilde{\mathcal{A}}(\mathcal{I}_{j},\mathcal{M}_{j})+\|\mathcal{I}_{j}^{c}\|\alpha\epsilon+\beta\mathcal{A}(\mathcal{I}_{j}^{c},\mathcal{M}_{\textup{OPT}})\right)+\frac{u-t}{u}\tilde{\mathcal{A}}(\mathcal{I}_{j}^{c},\mathcal{M}_{j}).$

∎

Theorem C.14 (Restatement of Theorem 4.3).

Suppose that the solution set $S^{*}$ is $(k,\sqrt{\frac{2\epsilon}{\mu}})$ -separate. Let

\mathcal{M}_{\textup{init}}=\{\mathbf{x}_{i_{1}}^{*},\mathbf{x}_{i_{2}}^{*},\ldots,\mathbf{x}_{i_{k}}^{*}\}

be the initial points sampled by the random initialization Algorithm 1 with noisy oracles $\tilde{f}_{i}^{*}$ . We have the following bound:

\frac{1}{N}\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\leq\epsilon+(2+\ln k)\left(1+\frac{4L}{\mu}\right)\epsilon+4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\frac{1}{N}\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).

Proof.

The proof is similar to that of Theorem C.10. We let

\alpha=1+\frac{4L}{\mu},\quad\beta=4\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right).

We fix the first index $i_{1}$ . Suppose that $i_{1}$ lies in $A_{l}$ , we have

\displaystyle\mathbb{E}~{}\tilde{\mathcal{A}}([N],\mathcal{M}_{1,k-1}^{+})\leq\left(\tilde{\mathcal{A}}(A_{l},\{\mathbf{x}_{i_{1}}^{*}\})+|[N]\backslash A_{l}|\alpha\epsilon+\beta\mathcal{A}([N]\backslash A_{l},\mathcal{M}_{\textup{OPT}})\right)(1+H_{k-1}).

We have

	$\displaystyle\mathbb{E}~{}\tilde{\mathcal{A}}([N],\mathcal{M}_{\textup{init}})$	$\displaystyle\leq\left(\frac{1}{N}\sum_{l\in[k]}\sum_{i\in A_{l}}\tilde{\mathcal{A}}(A_{l},\{\mathbf{x}_{i}^{*}\})+N\alpha\epsilon-\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|^{2}\alpha\epsilon\right.$
		$\displaystyle\quad\quad\left.+\beta\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\left(\frac{1}{N}\sum_{l\in[k]}\left(\|A_{l}\|^{2}\epsilon+\frac{L}{2}\|A_{l}\|\Delta_{A_{l}}\right)+N\alpha\epsilon-\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|^{2}\alpha\epsilon\right.$
		$\displaystyle\quad\quad\left.+\beta\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})$
		$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\left(\frac{1}{N}\sum_{l\in[k]}\left(\|A_{l}\|^{2}\epsilon+\frac{2L}{\mu}\|A_{l}\|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)+N\alpha\epsilon-\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|^{2}\alpha\epsilon\right.$
		$\displaystyle\quad\quad\left.+\beta\left(\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})-\frac{1}{N}\sum_{l\in[k]}\|A_{l}\|\mathcal{A}(A_{l},\mathcal{M}_{\textup{OPT}})\right)\right)(1+H_{k-1})$
		$\displaystyle\leq\left(N\alpha\epsilon+\beta\mathcal{A}([N],\mathcal{M}_{\textup{OPT}})\right)(1+H_{k-1})$
		$\displaystyle\leq(2+\ln k)\left(1+\frac{4L}{\mu}\right)N\epsilon+4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).$

Here, (a) holds when applying Lemma C.3. (b) holds when applying

\frac{L}{2}\Delta_{A_{l}}=L\min_{\mathbf{z}}\sum_{i\in A_{l}}\|\mathbf{x}_{i}^{*}-\mathbf{z}\|^{2}\leq\frac{2L}{\mu}\min_{\mathbf{z}}\sum_{i\in A_{l}}(f_{i}(\mathbf{z})-f_{i}^{*})=\frac{2L}{\mu}\mathcal{A}(A_{l},\mathcal{M}^{(D)}_{\textup{OPT}}).

Therefore, we have

	$\displaystyle\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\leq N\epsilon+(2+\ln k)\left(1+\frac{4L}{\mu}\right)N\epsilon+4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}),$
	$\displaystyle\frac{1}{N}\mathbb{E}\mathcal{A}([N],\mathcal{M}_{\textup{init}})\leq\epsilon+(2+\ln k)\left(1+\frac{4L}{\mu}\right)\epsilon+4(2+\ln k)\left(\frac{L^{2}}{\mu^{2}}+\frac{L}{\mu}\right)\frac{1}{N}\mathcal{A}([N],\mathcal{M}_{\textup{OPT}}).$

∎

The proof of Theorem 4.4 is similar to the proof of Theorem 4.3, we skip the details here.

Appendix D Convergence of Lloyd’s algorithm

In this section, we provide a convergence analysis for Algorithms 2 and 3.

Theorem D.1 (Restatement of Theorem 4.6).

In Algorithm 2, we take the step size $\gamma=\frac{1}{L}$ . If $f_{i}$ are $L$ -smooth, we have the following convergence result:

\frac{1}{T+1}\sum_{t=0}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2L}{T+1}\left(F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{\star}\right).

Here, $F^{\star}$ is the minimum of $F$ .

Proof.

According to the $L$ -smoothness assumption on $f_{i}$ , $F_{j}^{(t)}$ is also $L$ -smooth, which implies that

	$\displaystyle F_{j}^{(t)}(\mathbf{x}_{j}^{(t+1)})$	$\displaystyle\leq F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})+\langle\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)}),\mathbf{x}_{j}^{(t+1)}-\mathbf{x}_{j}^{(t)}\rangle+\frac{L}{2}\\|\mathbf{x}_{j}^{(t+1)}-\mathbf{x}_{j}^{(t)}\\|^{2}$
		$\displaystyle=F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})-\frac{1}{2L}\\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\\|^{2},$
	$\displaystyle\frac{1}{2L}\\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\\|^{2}$	$\displaystyle\leq F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})-F_{j}^{(t)}(\mathbf{x}_{j}^{(t+1)}),$
	$\displaystyle\\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\\|^{2}$	$\displaystyle\leq 2L\left(F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})-F_{j}^{(t)}(\mathbf{x}_{j}^{(t+1)})\right).$

Averaging over $\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}$ with weights $|C_{j}^{(t)}|/N$ , we have

\displaystyle\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}

\displaystyle\leq 2L\left(F(\mathbf{x}_{1}^{(t)},\mathbf{x}_{2}^{(t)},\ldots,\mathbf{x}_{k}^{(t)})-F(\mathbf{x}_{1}^{(t+1)},\mathbf{x}_{2}^{(t+1)},\ldots,\mathbf{x}_{k}^{(t+1)})\right).

Averaging over $t$ from $0$ to $T$ , we have

\frac{1}{T+1}\sum_{t=0}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2L}{T+1}\left(F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{\star}\right).

∎

Next, we present a convergence theorem for the momentum algorithm. For simplification, we use the notation

\mathbf{U}^{(t)}=(\mathbf{u}_{1}^{(t)},\mathbf{u}_{2}^{(t)},\ldots,\mathbf{u}_{k}^{(t)}).

We have the following convergence theorem:

Theorem D.2 (Restatement of Theorem 4.7).

Consider Algorithm 3. Suppose that Assumption 2.1 holds, $\alpha>1$ , and

\gamma\leq\min\left(\frac{1-\beta}{2L},\frac{(1-\beta)^{\frac{3}{2}}(1-\alpha\beta)^{\frac{1}{2}}}{2\alpha^{\frac{1}{2}}L\beta}\right).

Then it holds that

\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2(1-\beta)}{\gamma}\cdot\frac{F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{*}}{T}.

Proof.

The variable $\mathbf{u}_{j}^{(t)}$ satisfies the following property,

	$\displaystyle\mathbf{u}_{j}^{(t+1)}-\mathbf{u}_{j}^{(t)}$	$\displaystyle=\frac{1}{1-\beta}\left((\mathbf{x}_{j}^{(t+1)}-\mathbf{x}_{j}^{(t)})-\beta(\mathbf{x}_{j}^{(t)}-\mathbf{x}_{j}^{(t-1)})\right)$
		$\displaystyle=\frac{-\gamma}{1-\beta}\left(\mathbf{m}_{j}^{(t)}-\beta\mathbf{m}_{j}^{(t-1)}\right)$
		$\displaystyle=\frac{-\gamma}{1-\beta}\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)}).$

We have the following inequality:

		$\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t+1)})$
	$\displaystyle\leq$	$\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})+\langle\nabla F_{j}^{(t)}(\mathbf{u}_{j}^{(t)}),\mathbf{u}_{j}^{(t+1)}-\mathbf{u}_{j}^{(t)}\rangle+\frac{L}{2}\\|\mathbf{u}_{j}^{(t+1)}-\mathbf{u}_{j}^{(t)}\\|^{2}$
	$\displaystyle=$	$\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-\frac{\gamma}{1-\beta}\langle\nabla F_{j}^{(t)}(\mathbf{u}_{j}^{(t)}),\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\rangle+\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}\\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\\|^{2}$
	$\displaystyle=$	$\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-\frac{\gamma}{1-\beta}\langle\nabla F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)}),\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\rangle$
		$\displaystyle\quad\quad+\left(\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\gamma}{1-\beta}\right)\\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})+\left(\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\gamma}{1-\beta}\right)\\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\\|^{2}$
		$\displaystyle\quad\quad+\frac{\gamma}{1-\beta}\frac{\epsilon}{2}\\|\nabla F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\\|^{2}+\frac{\gamma}{1-\beta}\frac{1}{2\epsilon}\\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\\|^{2}$
	$\displaystyle\leq$	$\displaystyle F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})+\left(\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\gamma}{1-\beta}+\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\\|^{2}+\frac{\epsilon}{2}\frac{L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\\|\mathbf{m}_{j}^{(t-1)}\\|^{2}.$

Rearranging the inequality, we have

\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq F_{j}^{(t)}(\mathbf{u}_{j}^{(t)})-F_{j}^{(t)}(\mathbf{u}_{j}^{(t+1)})+\frac{\epsilon}{2}\frac{L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\|\mathbf{m}_{j}^{(t-1)}\|^{2}.

We sum over $j=1,2,\ldots,k$ with weights $\frac{|C_{j}|}{N}$ and get

\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq F(\mathbf{U}^{(t)})-\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}F_{j}^{(t)}(\mathbf{u}_{j}^{(t+1)})+\frac{\epsilon}{2}\frac{L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\mathbf{m}_{j}^{(t-1)}\|^{2}.

Since

\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}F_{j}^{(t)}(\mathbf{u}_{j}^{(t+1)})=\frac{1}{N}\sum_{j=1}^{k}\sum_{i\in C_{j}^{(t)}}f_{i}(\mathbf{u}_{j}^{(t+1)})\geq F(\mathbf{U}^{(t+1)}),

we have

\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq F(\mathbf{U}^{(t)})-F(\mathbf{U}^{(t+1)})+\frac{\epsilon}{2}\frac{\alpha L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\sum_{j=1}^{k}\frac{|C_{j}^{(t-1)}|}{N}\|\mathbf{m}_{j}^{(t-1)}\|^{2}.

Summing both sides from $t=1$ to $T$ , then dividing both sides by $T$ , we have

(D.1)

\begin{split}\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}\right)\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq\frac{F(\mathbf{U}^{(1)})-F(\mathbf{U}^{(T+1)})}{T}+\frac{\epsilon}{2}\frac{\alpha L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{3}}\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t-1)}|}{N}\|\mathbf{m}_{j}^{(t-1)}\|^{2}.\end{split}

Now, we consider the average term $\frac{1}{T}\sum_{t=1}^{T}\frac{|C_{j}^{(t)}|}{N}\|\mathbf{m}_{j}^{(t)}\|^{2}$ . For $\mathbf{m}_{j}^{(t)}$ , we have

	$\displaystyle\mathbf{m}_{j}^{(t)}$	$\displaystyle=\beta\mathbf{m}_{j}^{(t-1)}+\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})$
		$\displaystyle=\beta^{t}\mathbf{m}_{j}^{(0)}+\sum_{l=0}^{t-1}\beta^{l}\nabla F_{j}^{(t-l)}(\mathbf{x}_{j}^{(t-l)})$
		$\displaystyle=\sum_{l=1}^{t}\beta^{t-l}\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)}).$

We have the following bound on the squared norm of $\mathbf{m}_{j}^{(t)}$ :

	$\displaystyle\\|\mathbf{m}_{j}^{(t)}\\|^{2}$	$\displaystyle=\left\\|\sum_{l=1}^{t}\beta^{t-l}\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\\|^{2}$
		$\displaystyle=\left(\sum_{s=1}^{t}\beta^{t-s}\right)^{2}\left\\|\sum_{l=1}^{t}\frac{\beta^{t-l}}{\sum_{s=1}^{t}\beta^{t-s}}\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\\|^{2}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\left(\sum_{s=1}^{t}\beta^{t-s}\right)^{2}\sum_{l=1}^{t}\frac{\beta^{t-l}}{\sum_{s=1}^{t}\beta^{t-s}}\left\\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\\|^{2}$
		$\displaystyle\leq\frac{1}{1-\beta}\sum_{l=1}^{t}\beta^{t-l}\left\\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\\|^{2}.$

Here, (a) applies as an instance of Jensen’s inequality. Averaging the above inequality over $t=1,2,\ldots,T$ , we obtain

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\frac{\|C_{j}^{(t)}\|}{N}\\|\mathbf{m}_{j}^{(t)}\\|^{2}$	$\displaystyle\leq\frac{1}{1-\beta}\frac{1}{N}\frac{1}{T}\sum_{t=1}^{T}\sum_{l=1}^{t}\|C_{j}^{(t)}\|\beta^{t-l}\left\\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\\|^{2}$
		$\displaystyle\leq\frac{1}{T}\frac{1}{N}\frac{1}{1-\beta}\sum_{l=1}^{T}\sum_{t=l}^{T}\|C_{j}^{(t)}\|\beta^{t-l}\left\\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\\|^{2}$
		$\displaystyle\leq\frac{1}{T}\frac{1}{N}\frac{1}{1-\beta}\sum_{l=1}^{T}\sum_{t=l}^{T}\|C_{j}^{(l)}\|\alpha^{t-l}\beta^{t-l}\left\\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\\|^{2}$
		$\displaystyle\leq\frac{1}{T}\frac{1}{1-\beta}\frac{1}{1-\alpha\beta}\sum_{l=1}^{T}\frac{\|C_{j}^{(l)}\|}{N}\left\\|\nabla F_{j}^{(l)}(\mathbf{x}_{j}^{(l)})\right\\|^{2}$

Substituting the above inequality back into (D.1), we obtain

\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{1}{2\epsilon}\frac{\gamma}{1-\beta}-\frac{\epsilon}{2}\frac{\alpha L^{2}\beta^{2}\gamma^{3}}{(1-\beta)^{4}(1-\alpha\beta)}\right)\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq\frac{F(\mathbf{U}^{(1)})-F(\mathbf{U}^{(T+1)})}{T}.

We choose

\epsilon=\frac{(1-\beta)^{\frac{3}{2}}(1-\alpha\beta)^{\frac{1}{2}}}{\gamma\beta L\alpha^{\frac{1}{2}}}

and rearrange the above inequality. Thus, we have

\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\alpha^{\frac{1}{2}}L\beta\gamma^{2}}{2(1-\beta)^{\frac{5}{2}}(1-\alpha\beta)^{\frac{1}{2}}}\right)\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq\frac{F(\mathbf{U}^{(1)})-F(\mathbf{U}^{(T+1)})}{T}.

Since we initialize $\mathbf{m}_{j}^{(0)}=\mathbf{0}$ , we have

	$\displaystyle\mathbf{x}_{j}^{(1)}$	$\displaystyle=\mathbf{x}_{j}^{(0)}-\gamma\mathbf{m}_{j}^{(0)}=\mathbf{x}_{j}^{(0)},$
	$\displaystyle\mathbf{u}_{j}^{(1)}$	$\displaystyle=\mathbf{x}_{j}^{(0)}.$

Besides, since $F(\mathbf{U}^{(T+1)})\geq F^{*}$ , we have

\left(\frac{\gamma}{1-\beta}-\frac{L}{2}\frac{\gamma^{2}}{(1-\beta)^{2}}-\frac{\alpha^{\frac{1}{2}}L\beta\gamma^{2}}{2(1-\beta)^{\frac{5}{2}}(1-\alpha\beta)^{\frac{1}{2}}}\right)\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\\ \leq\frac{F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{*}}{T}.

When

\gamma\leq\min\left(\frac{1-\beta}{2L},\frac{(1-\beta)^{\frac{3}{2}}(1-\alpha\beta)^{\frac{1}{2}}}{2\alpha^{\frac{1}{2}}L\beta}\right),

we have

\frac{1}{T}\sum_{t=1}^{T}\sum_{j=1}^{k}\frac{|C_{j}^{(t)}|}{N}\|\nabla F_{j}^{(t)}(\mathbf{x}_{j}^{(t)})\|^{2}\leq\frac{2(1-\beta)}{\gamma}\cdot\frac{F(\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\ldots,\mathbf{x}_{k}^{(0)})-F^{*}}{T}.

∎

Appendix E Supplementary experiment details

In this section, we provide details on the experiments described in Section 5.

E.1. Supplementary details for Section 5.1

We elaborate on the generation of the synthetic data for the GPCA experiment in Section 5.1.

•

First, we uniformly generate $k$ pairs of orthonormal vectors $\{\boldsymbol{\epsilon}_{1,j},\boldsymbol{\epsilon}_{2,j}\}$ for $j=1,2,\ldots,k$ . Each pair is generated uniformly at random, with $\boldsymbol{\epsilon}_{1,j}$ and $\boldsymbol{\epsilon}_{2,j}$ forming the basis of the $j$ -th subspace.
•

For each data point $i\in[N]$ , we independently generate two Gaussian samples $\xi_{1,i},\xi_{2,i}$ . Next, we sample an index $j_{i}\in[k]$ uniformly at random. We then let $\mathbf{x}_{i}=\xi_{1,i}\boldsymbol{\epsilon}_{1,j_{i}}+\xi_{2,i}\boldsymbol{\epsilon}_{2,j_{i}}$ .

We provide in Algorithm 5 a detailed pseudo-code of Lloyd’s algorithm for solving the GPCA problem in the sum-of-minimum formulation (1.5), which consists of two steps in each iteration, say updating the clusters via (3.6) and precisely compute the minimizer of each group objective function

	$\displaystyle\min_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}\frac{1}{\|C_{j}^{(t)}\|}\sum_{i\in C_{j}^{(t)}}\\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\\|^{2}$	$\displaystyle=\frac{1}{\|C_{j}^{(t)}\|}\sum_{i\in C_{j}^{(t)}}\textup{tr}\left(\mathbf{A}_{j}^{\top}\boldsymbol{y}_{i}\boldsymbol{y}^{\top}\mathbf{A}_{j}\right)$
		$\displaystyle=\textup{tr}\left(\mathbf{A}_{j}^{\top}\left(\frac{1}{\|C_{j}^{(t)}\|}\sum_{i\in C_{j}^{(t)}}\boldsymbol{y}_{i}\boldsymbol{y}^{\top}\right)\mathbf{A}_{j}\right).$

Algorithm 5 Lloyd’s Algorithm for generalized principal component analysis

1:Initialize

\mathbf{A}_{1}^{(0)},\mathbf{A}_{2}^{(0)},\dots,\mathbf{A}_{k}^{(0)}

. Set

F^{(-1)}=+\infty

2:for

t=0,1,2,\ldots,

max iterations do

3: Compute

F^{(t)}=F(\mathbf{A}_{1}^{(t)},\mathbf{A}_{2}^{(t)},\ldots,\mathbf{A}_{k}^{(t)})

4: if

F^{(t)}=F^{(t-1)}

then

5: Break.

6: end if

7: Compute the partition

\{C_{j}^{(t)}\}_{j=1}^{k}

via (3.6).

8: for

j=1,2,\ldots,k

9: if

C_{j}^{(t)}\not=\emptyset

then

10: Compute the matrix

\frac{1}{|C_{j}^{(t)}|}\sum_{i\in C_{j}^{(t)}}\boldsymbol{y}_{i}\boldsymbol{y}_{i}^{\top}

and its

r

orthonormal eigenvectors

\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{r}

corresponding to the smallest

r

eigenvalues.

11: Set

\mathbf{A}_{j}^{(t+1)}=\begin{bmatrix}\mathbf{v}_{1}&\mathbf{v}_{2}&\ldots&\mathbf{v}_{r}\end{bmatrix}.

12: else

13:

\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}

14: end if

15: end for

16:end for

We implement the BCD algorithm [peng2023block] for the following optimization problem:

(E.1)

\min_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}\frac{1}{N}\sum_{i=1}^{N}\prod_{j\in[k]}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2}.

For any $j\in[k]$ , when $\mathbf{A}_{l}$ is fixed for all $l\in[k]\backslash\{j\}$ , the problem in (E.1) is equivalent to:

\min_{\mathbf{A}_{j}^{\top}\mathbf{A}_{j}=I_{r}}\frac{1}{N}\sum_{i=1}^{N}w_{ij}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\|^{2}=\frac{1}{N}\sum_{i=1}^{N}w_{ij}\textup{tr}\left(\mathbf{A}_{j}^{\top}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}\mathbf{A}_{j}\right)=\textup{tr}\left(\mathbf{A}_{j}^{\top}\left(\frac{1}{N}\sum_{i=1}^{N}w_{ij}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}\right)\mathbf{A}_{j}\right),

where the weights $w_{ij}$ are given by:

w_{ij}=\prod_{l\neq j}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{l}\|^{2}.

The detailed pseudo-code can be found in Algorithm 6.

Algorithm 6 Block coordinate descent for generalized principal component analysis [peng2023block]

1:Initialize

\mathbf{A}_{1}^{(0)},\mathbf{A}_{2}^{(0)},\dots,\mathbf{A}_{k}^{(0)}

2:for

t=0,1,2,\ldots,

max iterations do

3: for

j=1,2,\ldots,k

4: Compute the weights:

w_{ij}^{(t)}=\prod_{l<j}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{l}^{(t+1)}\|^{2}\cdot\prod_{l>j}\|\mathbf{y}_{i}^{\top}\mathbf{A}_{l}^{(t)}\|^{2}.

5: Compute the matrix:

\frac{1}{N}\sum_{i=1}^{N}w_{ij}^{(t)}\mathbf{y}_{i}\mathbf{y}_{i}^{\top}

and its

r

orthonormal eigenvectors

\mathbf{v}_{1},\mathbf{v}_{2},\ldots,\mathbf{v}_{r}

corresponding to the smallest

r

eigenvalues.

6: Set

\mathbf{A}_{j}^{(t+1)}=\begin{bmatrix}\mathbf{v}_{1}&\mathbf{v}_{2}&\ldots&\mathbf{v}_{r}\end{bmatrix}.

7: end for

8:end for

E.2. Supplementary details for Section 5.2

Mixed linear regression

Here, we provide the detailed pseudo-code for Lloyd’s algorithm used to solve $\ell_{2}$ -regularized mixed linear regression problem in Section 5. Each iteration of the algorithm consists of two steps: reclassification and cluster parameter update. We alternatively reclassify indices $i$ to $C_{j}^{(t)}$ using (3.6) and update the cluster parameter $\mathbf{x}_{j}^{(t)}$ for nonempty clusters $C_{j}^{(t)}$ using:

(E.2)

\mathbf{x}_{j}^{(t+1)}=\left(\sum_{i\in C_{j}^{(t)}}\mathbf{a}_{i}\mathbf{a}_{i}^{\top}+\lambda|C_{j}^{(t)}|\mathbf{I}\right)^{-1}\sum_{i\in C_{j}^{(t)}}b_{i}\mathbf{a}_{i},

so that $\mathbf{x}_{j}^{(t+1)}$ is exactly the minimizer of the group objective function. The algorithm continues until $F^{(t)}$ stops decreasing after $\mathbf{x}^{(t)}$ ’s update or a max iteration number is reached. The pseudo-code is shown in Algorithm 7.

Algorithm 7 Lloyd’s Algorithm for mixed linear regression

1:Initialize

\mathbf{x}_{1}^{(0)},\mathbf{x}_{2}^{(0)},\dots,\mathbf{x}_{k}^{(0)}

. Set

F^{(-1)}=+\infty

2:for

t=0,1,2,\ldots,

max iterations do

3: Compute

F^{(t)}=F(\mathbf{x}_{1}^{(t)},\mathbf{x}_{2}^{(t)},\ldots,\mathbf{x}_{k}^{(t)})

4: if

F^{(t)}=F^{(t-1)}

then

5: Break.

6: end if

7: Compute the partition

\{C_{j}^{(t)}\}_{j=1}^{k}

via (3.6).

8: for

j=1,2,\ldots,k

9: if

C_{j}^{(t)}\not=\emptyset

then

10: Compute

\mathbf{x}_{j}^{(t+1)}

using (E.2).

11: else

12:

\mathbf{x}_{j}^{(t+1)}=\mathbf{x}_{j}^{(t)}

13: end if

14: end for

15:end for

The dataset $\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N}$ for the $\ell_{2}$ -regularized mixed linear regression is synthetically generated in the following way:

•

Fix the dimension $d$ and the number of function clusters $k$ , and sample $\mathbf{x}_{1}^{+},\mathbf{x}_{2}^{+},\dots,\mathbf{x}_{k}^{+}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\mathcal{N}(0,\mathbf{I}_{d})$ as the linear coefficients of $k$ ground truth regression models.
•

For $i=1,2,\dots,N$ , we independently generate data $\mathbf{a}_{i}\sim\mathcal{N}(0,\mathbf{I}_{d})$ , class index $c_{i}\sim\textup{Uniform}([k])$ , noise $\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2})$ , and compute $b_{i}=\mathbf{a}_{i}^{\top}\mathbf{x}_{c_{i}}^{+}+\epsilon_{i}$ .

In the experiment, the noise level is set to $\sigma=0.01$ and the regularization factor is set to $\lambda=0.01$ .

Mixed non-linear regression

The ground truth $\theta_{j}^{+}$ ’s are sampled from a standard Gaussian. The dataset $\{(\mathbf{a}_{i},b_{i})\}_{i=1}^{N}$ is generated in the same way as in the mixed linear regression experiment. We set the variance of the Gaussian noise on the dataset to $\sigma^{2}=0.01^{2}$ and use a regularization factor $\lambda=0.01$ .

	$\displaystyle\mathbb{E}_{i}\,\mathcal{A}(\mathcal{I},\{\mathbf{x}_{i}^{*}\})$	$\displaystyle=\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\mathcal{A}(\mathcal{I},\{\mathbf{x}_{i}^{*}\})$
		$\displaystyle=\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\left(f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{})\right)$
		$\displaystyle\leq\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\frac{L}{2}\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}$
		$\displaystyle=\frac{L}{2}\Delta_{\mathcal{I}}.$

	$\displaystyle\mathcal{A}(\{i\},\mathcal{M})$	$\displaystyle=\min_{\mathbf{z}\in\mathcal{M}}\left(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})\right)$
		$\displaystyle\leq\min_{\mathbf{z}\in\mathcal{M}}\frac{L}{2}\\|\mathbf{z}-\mathbf{x}_{i}^{*}\\|^{2}$
		$\displaystyle\leq\min_{\mathbf{z}\in\mathcal{M}}L(\\|\mathbf{z}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\\|\mathbf{x}_{i^{\prime}}^{}-\mathbf{x}_{i}^{*}\\|^{2})$
		$\displaystyle\leq\frac{2L}{\mu}\min_{z\in\mathcal{M}}\left(f_{i^{\prime}}(\mathbf{z})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{})\right)+L\\|\mathbf{x}_{i^{\prime}}^{}-\mathbf{x}_{i}^{*}\\|^{2}$
		$\displaystyle=\frac{2L}{\mu}\mathcal{A}(\{i^{\prime}\},\mathcal{M})+L\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}.$

	$\displaystyle\mathbb{E}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})$
	$\displaystyle\quad=\sum_{i\in\mathcal{I}}\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}\mathcal{A}(\mathcal{I},\mathcal{M}\cup\{\mathbf{x}_{i}^{*}\})$
	$\displaystyle\quad=\sum_{i\in\mathcal{I}}\frac{\mathcal{A}(\{i\},\mathcal{M})}{\mathcal{A}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))$
	$\displaystyle\quad\stackrel{{\scriptstyle(a)}}{{\leq}}\sum_{i\in\mathcal{I}}\frac{\frac{1}{\|\mathcal{I}\|}\sum_{i^{\prime\prime}\in\mathcal{I}}\left(\frac{2L}{\mu}\mathcal{A}(\{i^{\prime\prime}\},\mathcal{M})+L\\|\mathbf{x}_{i^{\prime\prime}}^{}-\mathbf{x}_{i}^{}\\|^{2}\right)}{\mathcal{A}(\mathcal{I},\mathcal{M})}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))$
	$\displaystyle\quad=\frac{2L}{\mu}\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))$
	$\displaystyle\quad\quad\quad+\frac{L}{\mathcal{A}(\mathcal{I},\mathcal{M})\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime\prime}\in\mathcal{I}}\\|\mathbf{x}_{i^{\prime\prime}}^{}-\mathbf{x}_{i}^{}\\|^{2}\sum_{i^{\prime}\in\mathcal{I}}\min(\mathcal{A}(\{i^{\prime}\},\mathcal{M}),f_{i^{\prime}}(\mathbf{x}_{i}^{})-f_{i^{\prime}}(\mathbf{x}_{i^{\prime}}^{}))$
	$\displaystyle\quad\leq\frac{2L}{\mu}\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\frac{L}{2}\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}+\frac{L}{\mathcal{A}(\mathcal{I},\mathcal{M})\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime\prime}\in\mathcal{I}}\\|\mathbf{x}_{i^{\prime\prime}}^{}-\mathbf{x}_{i}^{}\\|^{2}\sum_{i^{\prime}\in\mathcal{I}}\mathcal{A}(\{i^{\prime}\},\mathcal{M})$
	$\displaystyle\quad=\left(\frac{L^{2}}{\mu}+L\right)\frac{1}{\|\mathcal{I}\|}\sum_{i\in\mathcal{I}}\sum_{i^{\prime}\in\mathcal{I}}\\|\mathbf{x}_{i^{\prime}}^{}-\mathbf{x}_{i}^{}\\|^{2}.$

	$\displaystyle\Delta_{A_{l}}$	$\displaystyle=\frac{1}{\|A_{l}\|}\sum_{i\in A_{l}}\sum_{i^{\prime}\in A_{l}}\\|\mathbf{x}_{i}^{}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}$
		$\displaystyle=\frac{1}{\|A_{l}\|}\sum_{i\in A_{l}}\sum_{i^{\prime}\in A_{l}}\\|\mathbf{x}_{i}^{}-\bar{\mathbf{y}}_{l}+\bar{\mathbf{y}}_{l}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}$
		$\displaystyle=\frac{1}{\|A_{l}\|}\sum_{i\in A_{l}}\sum_{i^{\prime}\in A_{l}}\left(\\|\mathbf{x}_{i}^{}-\bar{\mathbf{y}}_{l}\\|^{2}+\\|\bar{\mathbf{y}}_{l}-\mathbf{x}_{i^{\prime}}^{}\\|^{2}\right)$
		$\displaystyle=2\sum_{i\in A_{l}}\\|\mathbf{x}_{i}^{*}-\bar{\mathbf{y}}_{l}\\|^{2}$
		$\displaystyle=2\min_{\mathbf{z}}\sum_{i\in A_{l}}\\|\mathbf{x}_{i}^{*}-\mathbf{z}\\|^{2}$
		$\displaystyle\leq\frac{4}{\mu}\min_{\mathbf{z}}\sum_{i\in A_{l}}\left(f_{i}(\mathbf{z})-f_{i}(\mathbf{x}_{i}^{*})\right)$
		$\displaystyle=\frac{4}{\mu}\min_{\mathbf{z}}\mathcal{A}(A_{l},\{\mathbf{z}\})$
		$\displaystyle=\frac{4}{\mu}\mathcal{A}(A_{l}.\mathcal{M}_{\textup{OPT}}).$

	$\displaystyle C_{a}=\{l\,\|\,\mathbf{x}_{1,l}^{}\in\mathcal{M},\mathbf{x}_{n+1,l}^{}\not\in\mathcal{M}\},$
	$\displaystyle C_{b}=\{l\,\|\,\mathbf{x}_{n+1,l}^{}\in\mathcal{M},\mathbf{x}_{1,l}^{}\not\in\mathcal{M}\},$
	$\displaystyle C_{f}=\{l\,\|\,\mathbf{x}_{1,l}^{}\in\mathcal{M},\mathbf{x}_{n+1,l}^{}\in\mathcal{M}\},$
	$\displaystyle C_{u}=\{l\,\|\,\mathbf{x}_{1,l}^{}\not\in\mathcal{M},\mathbf{x}_{n+1,l}^{}\not\in\mathcal{M}\}.$

Efficient Algorithms for Sum-of-Minimum Optimization

Abstract.

1. Introduction

2. Related Work and Preliminary

2.1. Related work

2.2. Definitions and assumptions

Assumption 2.1.

Assumption 2.2.

Proposition 2.3.

Definition 2.4 (kk-separate and (k,r)(k,r)-separate).

Definition 2.5 (Optimality gap).

3. Algorithms

3.1. Initialization

3.2. Generalized Lloyd’s algorithm

Reclassification.

Group gradient descent.

4. Theoretical Analysis

4.1. Error bound of the initialization algorithm

Theorem 4.1.

Theorem 4.2.

Theorem 4.3.

Theorem 4.4.

Corollary 4.5.

4.2. Convergence rate of Lloyd’s algorithm

Theorem 4.6.

Theorem 4.7.

5. Numerical Experiments

5.1. Comparison between the sum-of-minimum model and the product formulation

5.2. Comparison between different initializations

6. Conclusion

Acknowledgements

References

Appendix A Proof of Proposition 2.3

Proposition A.1 (Restatement of Proposition 2.3).

Proof.

Appendix B Algorithm details

B.1. Initialization with alternative scores

B.2. Details on momentum Lloyd’s Algorithm

Appendix C Initialization error bounds

Definition C.1.

Definition C.2.

Lemma C.3.

Proof.

Lemma C.4.

Proof.

Lemma C.5.

Proof.

Lemma C.6.

Proof.

Proposition C.7.

Proof.

Proposition C.8.

Proof.

Lemma C.9.

Proof.

Theorem C.10 (Restatement of Theorem 4.1).

Proof.

Theorem C.11 (Restatement of Theorem 4.2).

Proof.

Lemma C.12.

Proof.

Lemma C.13.

Proof.

Theorem C.14 (Restatement of Theorem 4.3).

Proof.

Appendix D Convergence of Lloyd’s algorithm

Theorem D.1 (Restatement of Theorem 4.6).

Proof.

Theorem D.2 (Restatement of Theorem 4.7).

Proof.

Appendix E Supplementary experiment details

E.1. Supplementary details for Section 5.1

E.2. Supplementary details for Section 5.2

Mixed linear regression

Mixed non-linear regression

Definition 2.4 ( $k$ -separate and $(k,r)$ -separate).