Efficient Algorithms for Sum-of-Minimum Optimization
Abstract.
In this work, we propose a novel optimization model termed “sum-of-minimum” optimization. This model seeks to minimize the sum or average of objective functions over parameters, where each objective takes the minimum value of a predefined sub-function with respect to the parameters. This universal framework encompasses numerous clustering applications in machine learning and related fields. We develop efficient algorithms for solving sum-of-minimum optimization problems, inspired by a randomized initialization algorithm for the classic -means [arthur2007k] and Lloyd’s algorithm [lloyd1982least]. We establish a new tight bound for the generalized initialization algorithm and prove a gradient-descent-like convergence rate for generalized Lloyd’s algorithm. The efficiency of our algorithms is numerically examined on multiple tasks, including generalized principal component analysis, mixed linear regression, and small-scale neural network training. Our approach compares favorably to previous ones based on simpler-but-less-precise optimization reformulations.
1. Introduction
In this paper, we propose the following “sum-of-minimum” optimization model:
(1.1) |
where are unknown parameters to determine. The cost function is the average of objectives where the -th objective is evaluated at its “optimal” out of the parameter choices. This paper aims to develop efficient algorithms for solving (1.1) and analyze their performance.
Write and . Let be a partition of , i.e., ’s are disjoint subsets of and their union equals . Let denote the set of all such partitions. Then, (1.1) is equivalent to
(1.2) |
It is easy to see and are optimal to (1.2) if and only if is optimal to (1.1) and
Reformulation (1.2) reveals its clustering purpose. It finds the optimal partition such that using the parameter to minimize the average of ’s in the cluster leads to the minimal total cost.
Problem (1.1) generalizes -means clustering. Consider data points and a distance function . The goal of -means clustering is to find clustering centroids that minimize
which is the average distance from each data point to its nearest cluster center. The literature presents various choices for the distance function . When , this optimization problem reduces to the classic -means clustering problem, for which numerous algorithms have been proposed [krishna1999genetic, arthur2007k, na2010research, sinaga2020unsupervised, ahmed2020k]. Bregman divergence is also widely adopted as a distance measure[banerjee2005clustering, manthey2013worst, liu2016clustering], defined as
with being a differentiable convex function.
A special case of (1.1) is mixed linear regression, which generalizes linear regression and models the dataset by multiple linear models. A linear model is a function , which utilizes as the coefficient vector for each model. Make copies of the linear model and set the -th linear coefficient as . The loss for each data pair is computed as the squared error from the best-fitting linear model, specifically . We aim to search for optimal parameters that minimizes the average loss
(1.3) |
Paper [zhong2016mixed] simplifies this non-smooth problem to the sum-of-product problem:
(1.4) |
which is smooth. Although (1.4) is easier to approach due to its smooth objective function, problem (1.3) is more accurate. Various algorithms are proposed to recover linear models from mixed-class data [yi2014alternating, shen2019iterative, kong2020meta, zilber2023imbalanced].
In (1.3), the function parameterized by can be any nonlinear function such as neural networks, and we call this extension mixed nonlinear regression.
An application of (1.1) is generalized principal component analysis (GPCA) [vidal2005generalized, tsakiris2017filtrated], which aims to recover low-dimensional subspaces, , from the given data points , which are assumed to be located on or close to the collective union of these subspaces . This process, also referred to as subspace clustering, seeks to accurately segment data points into their respective subspaces [ma2008estimation, vidal2011subspace, elhamifar2013sparse]. Each subspace is represented as where and , with being the co-dimension of . From an optimization perspective, the GPCA task can be formulated as
(1.5) |
Similar to (1.4), [peng2023block] works with the less precise reformulation using the product of for smoothness and introduces block coordinate descent algorithm.
When , problem (1.1) reduces to the finite-sum optimization problem
(1.6) |
widely used to train machine learning models, where depicts the loss of the model at parameter on the -th data point. When the underlying model lacks sufficient expressiveness, problem (1.6) alone may not yield satisfactory results.To enhance a model’s performance, one can train the model with multiple parameters, , and utilize only the most effective parameter for every data point. This strategy has been successfully applied in various classic tasks, including the aforementioned -means clustering, mixed linear regression, and the generalized principal component analysis. These applications share a common objective: to segment the dataset into groups and identify the best parameter for each group. Although no single parameter might perform well across the entire dataset, every data point is adequately served by at least one of the parameters. By aggregating the strengths of multiple smaller models, this approach not only enhances model expressiveness but also offers a cost-efficient alternative to deploying a singular larger model.
Although one might expect that algorithms and analyses for the sum-of-minimum problem (1.1) to be weaker as (1.1) subsumes the discussed previous models, we find our algorithms and analyses for (1.1) to enhance those known for the existing models. Our algorithms extend the -means++ algorithm [arthur2007k] and Lloyd’s algorithm [lloyd1982least], which are proposed for classic -means problems. We obtain new bounds of these algorithms for (1.1). Our contributions are summarized as follows:
-
•
We propose the sum-of-minimum optimization problem, adapt -means++ to the problem for initialization, and generalize Lloyd’s algorithm to approximately solve the problem.
-
•
We establish theoretical guarantees for the proposed algorithms. Specifically, under the assumption that each is -smooth and -strongly convex, we prove the output of the initialization is -optimal and that this bound is tight with respect to both and the condition number . When reducing to -means optimization, our result recovers that of [arthur2007k]. Furthermore, we prove an convergence rate for generalized Lloyd’s algorithms.
-
•
We numerically verify the efficiency of the proposed framework and algorithms on several tasks, including generalized principal component analysis, -regularized mixed linear regression, and small-scale neural network training. The results reveal that our optimization model and algorithm lead to a higher successful rate in finding the ground-truth clustering, compared to existing approaches that resort to less accurate reformulations for the sake of smoother optimization landscapes. Moreover, our initialization shows significant improvements in both convergence speed and chance of obtaining better minima.
Our work significantly generalizes classic -means to handles more complex nonlinear models and provides new perspectives for improving the model performance. The rest of this paper is organized as follows. We introduce the preliminaries and related works in Section 2. We present the algorithms in Section 3. The algorithms are analyzed theoretically in Section 4 and numerically in Section 5. The paper is concluded in Section 6.
Throughout this paper, the -norm and -inner product are denoted by and , respectively. We employ as the cardinal number of a set.
2. Related Work and Preliminary
2.1. Related work
Lloyd’s algorithm [lloyd1982least], a well-established iterative method for the classic -means problem, alternates between two key steps [mackay2003example]: 1) assigning to if is the closest to among ; 2) updating as the centroid of all ’s assigned to . Although Lloyd’s algorithm can be proved to converge to stationary points, the results can be highly suboptimal due to the inherent non-convex nature of the problem. Therefore, the performance of Lloyd’s algorithm highly depends on the initialization. To address this, a randomized initialization algorithm, -means++ [arthur2007k], generates an initial solution in a sequential fashion. Each centroid is sampled recurrently according to the distribution
(2.1) |
The idea is to sample a data point farther from the current centroids with higher probability, ensuring the samples to be more evenly distributed across the dataset. It is proved in [arthur2007k] that
(2.2) |
where is the optimal objective value of . This seminal work has inspired numerous enhancements to the -means++ algorithm, as evidenced by contributions from [bahmani2012scalable, zimichev2014spectral, bachem2016fast, bachem2016approximate, wu2021user, ren2022novel]. Our result generalizes the bound in (2.2), broadening its applicability in sum-of-minimum optimization.
2.2. Definitions and assumptions
In this subsection, we outline the foundational settings for our algorithm and theory. For each sub-function , we establish the following assumptions.
Assumption 2.1.
Each is -smooth, satisfying
Assumption 2.2.
Each is -strongly convex, for all and ,
Let denote the optimizer of such that , and let
represent the solution set. If comprises different elements, the problem (1.1) possesses infinitely many global minima. Specifically, we can set the variables to be the distinct elements in , while leaving as free variables. Given these variables, . If contains more than distinct components, we have the following proposition.
Proposition 2.3.
Expanding on the correlation between the number of global minimizers and the size of , we introduce well-posedness conditions for .
Definition 2.4 (-separate and -separate).
We call -separate if it contains at least different elements, i.e., . Furthermore, we call -separate if there exists such that for all .
Finally, we address the optimality measurement in (1.1). The norm of the (sub)-gradient is an inappropriate measure for global optimality due to the problem’s non-convex nature. Instead, we utilize the following optimality gap.
Definition 2.5 (Optimality gap).
Given a point , the optimality gap of at is . Given a finite point set , the optimality gap of at is . When , the averaged optimality gap of at is the shifted objective function
(2.3) |
The averaged optimality gap in (2.3) will be used as the optimality measurement throughout this paper. Specifically, in the classic -means problem, one has , so the function directly indicates global optimality.
3. Algorithms
In this section, we introduce the algorithm for solving the sum-of-minimum optimization problem (1.1). Our approach is twofold, comprising an initialization phase based on -means++ and a generalized version of Lloyd’s algorithm.
3.1. Initialization
As the sum-of-minimum optimization (1.1) can be considered a generalization of the classic -means clustering, we adopt -means++. In -means++, clustering centers are selected sequentially from the dataset, with each data point chosen based on a probability proportional to its squared distance from the nearest existing clustering centers, as detailed in (2.1). We generalize this idea and propose the following initialization algorithm that outputs initial parameters for the problem (1.1).
First, we select an index at random from , following a uniform distribution, and then utilize a specific method to determine the minimizer , setting
(3.1) |
For , we sample based on the existing variables , with each index sampled based on a probability proportional to the optimality gap of at . Specifically, we compute the minimal optimality gaps
(3.2) |
as probability scores. Each score can be regarded as an indicator of how unresolved an instance is with the current variables . We then normalize these scores
(3.3) |
and sample following the probability distribution . The -th initialization is determined by optimizing ,
(3.4) |
We terminate the selection process once variables are determined. The pseudo-code of this algorithm is shown in Algorithm 1.
We note that the scores defined in (3.2) rely on the optimal objectives , which may be computationally intensive to calculate in certain scenarios. Therefore, we propose a variant of Algorithm 1 by adjusting the scores . Specifically, when parameters are selected, the score is set as the minimum squared norm of the gradient:
(3.5) |
This variant involves replacing the scores in Step 3 of Algorithm 1 with (3.5), which is further elaborated in Appendix B.
3.2. Generalized Lloyd’s algorithm
Lloyd’s algorithm is employed to minimize the loss in -means clustering by alternately updating the clusters and their centroids [lloyd1982least, mackay2003example]. This centroid update process can be regarded as a form of gradient descent applied to group functions, defined by the average distance between data points within a cluster and its centroid [bottou1994convergence]. For our problem (1.1), we introduce a novel gradient descent algorithm that utilizes dynamic group functions. Our algorithm is structured into two main phases: reclassification and group gradient descent.
Reclassification.
The goal is for to encompass all where is active at , allowing us to use the sub-functions within to update . This process leads to the reclassification step as follows:
(3.6) |
Given that reclassification may incur non-negligible costs in practice, a reclassification frequency can be established, performing the update in (3.6) every iterations while keeping constant during other iterations.
Group gradient descent.
With indicating the active at , we can define the group objective function:
(3.7) |
In each iteration, gradient descent is performed on individually as:
(3.8) |
Here, is the chosen step size. Alternatively, one might opt for different iterative updates or directly compute:
especially if the minimizer of admits a closed form or can be computed efficiently. The pseudo-code consisting of the above two steps is presented in Algorithm 2.
Momentum Lloyd’s Algorithm. We enhance Algorithm 1 by incorporating a momentum term. The momentum for is represented as , with and serving as the step sizes for the momentum-based updates. We use the gradient of the group function to update the momentum . The momentum algorithm admits the following form:
(3.9) | ||||
(3.10) |
A critical aspect of the momentum algorithm involves updating the classes between (3.9) and (3.10). Rather than reclassifying based on evaluated at , reclassification leverages an acceleration variable:
(3.11) |
The index will be classified to where attains the minimal value. Furthermore, to mitigate abrupt shifts in each class , we implement a controlled reclassification scheme that limits the extent of change in each class:
(3.12) |
where serves as a constraint factor. Details of the momentum algorithm are provided in Appendix B. We display the pseudo-code in Algorithm 3.
4. Theoretical Analysis
In this section, we prove the efficiency of the initialization algorithm and establish the convergence rate of Lloyd’s algorithm. For the initialization Algorithm 1, we show that the ratio between the optimality gap of and the smallest possible optimality gap is . Additionally, by presenting an example where this ratio is , we illustrate the bound’s tightness. For Lloyd’s Algorithms 2 and 3, we establish a gradient decay rate of , underscoring the efficiency and convergence properties of these algorithms.
4.1. Error bound of the initialization algorithm
We define the set of initial points selected by the randomized initialization Algorithm 1,
as the starting configuration for our optimization process. For simplicity, we use to represent the function value at these initial points. Let be the global minimal value of , and let denote the average of the optimal values of sub-functions. The effectiveness of Algorithm 1 is evaluated by the ratio between and , which is the expected ratio between the averaged optimality gap at and the minimal possible averaged optimality gap. The following theorem provides a specific bound.
Theorem 4.1.
Theorem 4.1 indicates that the relative optimality gap at the initialization set is constrained by a factor of times the minimal optimality gap. The proof of Theorem 4.1 is detailed in Appendix C. In the classic -means problem, where , this result reduces to Theorem 1.1 in [arthur2007k]. Moreover, the upper bound is proven to be tight via a lower bound established in the following theorem.
Theorem 4.2.
The proof of Theorem 4.2 is presented in detail in Appendix C. In both Theorem 4.1 and Theorem 4.2, the performance of Algorithm 1 is analyzed with the assumption that and in (3.2) can be computed exactly. However, the accurate computation of may be impractical due to computational costs. Therefore, we explore the error bounds when the score approximates (3.2) with some degree of error. We investigate two types of scoring errors.
-
•
Additive error. There exists , we have access to an estimated satisfying
(4.2) Accordingly, we define:
(4.3) -
•
Scaling error. There exists a deterministic oracle , such that for any and ,
(4.4) Set
(4.5)
We first analyze the performance of Algorithm 1 using the score with additive error as in (4.3). We typically require the assumption that the solution set is -separate, which guarantees that
for any and . Hence in the initialization Algorithm 1 with score (4.3), there is at least one in each round. We have the following generalized version of Theorem 4.1 with additive error.
Theorem 4.3.
The proof of Theorem 4.3 is deferred to Appendix C. Next, we state a similar result for the scaling-error oracle as in (4.5), whose proof is deferred to Appendix C.
Theorem 4.4.
4.2. Convergence rate of Lloyd’s algorithm
In this subsection, we state convergence results of Lloyd’s Algorithm 2 and momentum Lloyd’s Algorithm 3, with all proofs being deferred to Appendix D. For Algorithm 2, the optimization process of follows a gradient descent scheme on a varying objective function , which is the average of all active ’s determined by in (3.6). We have the following gradient-descent-like convergence rate on the gradient norm .
Theorem 4.6.
For momentum Lloyd’s Algorithm 3, we have a similar convergence rate stated as follows.
5. Numerical Experiments
In this section, we conduct numerical experiments to demonstrate the efficiency of the proposed model and algorithms. Our code with documentation can be found at https://github.com/LisangDing/Sum-of-Minimum_Optimization.
5.1. Comparison between the sum-of-minimum model and the product formulation
We consider two optimization models for generalized principal component analysis: the sum-of-minimum formulation (1.5) and another widely acknowledged formulation given by [peng2023block, vidal2005generalized]:
(5.1) |
The initialization for both formulations is generated by Algorithm 1. We use a slightly modified version of Algorithm 2 to minimize (1.5) since the minimization of the group functions for GPCA admits closed-form solutions. In particular, we alternatively compute the minimizer of each group objective function as the update of and then reclassify the sub-functions. We use the block coordinate descent (BCD) method [peng2023block] to minimize (5.1). The BCD algorithm alternatively minimizes with all other being fixed. The pseudo-codes of both algorithms are included in Appendix E.1.
We set the cluster number , dimension , subspace co-dimension , and the number of data points . The generalization of the dataset is described in Appendix E.1. We set the maximum iteration number as 50 for Algorithm 2 with (1.5) and terminate the algorithm once the objective function stops decreasing, i.e., the partition/clustering remains unchanged. Meanwhile, we set the iteration number to 50 for the BCD algorithm [peng2023block] with (5.1). The sythetic data generation is elaborated in Appendix E.1. The classification accuracy of both methods is reported in Table 1, where the classification accuracy is defined as the maximal matching accuracy with respect to the ground truth over all permutations. We observe that our model and algorithm lead to significantly higher accuracy. This is because, compared to (5.1), the formulation in (1.5) models the requirements more precisely, though it is more difficult to optimize due to the non-smoothness.
SoM | 98.24 | 98.07 | 98.19 | |
---|---|---|---|---|
SoP | 81.88 | 75.90 | 73.33 | |
SoM | 95.04 | 94.98 | 95.94 | |
SoP | 67.69 | 62.89 | 60.85 | |
SoM | 91.30 | 92.92 | 93.73 | |
SoP | 62.36 | 59.65 | 57.89 |
Next, we compare the computational cost for our model and algorithms with that of the product model and the BCD algorithm. We observe that the BCD algorithm exhibited limited improvements in accuracy after the initial 10 iterations. Thus, for a fare comparison, we set both the maximum iterations for our model and algorithms and the iteration number for the BCD algorithm to 10. The accuracy rate and the CPU time are shown in Table 2, from which one can see that the computational costs of our algorithm and the BCD algorithm are competitive, while our algorithm achieves much better classification accuracy.
SoM | 97.84 / 0.08 | 97.93 / 0.08 | 98.01 / 0.08 | |
---|---|---|---|---|
SoP | 81.78 / 0.14 | 75.76 / 0.14 | 73.24 / 0.15 | |
SoM | 93.34 / 0.19 | 94.14 / 0.19 | 95.25 / 0.16 | |
SoP | 67.18 / 0.20 | 62.76 / 0.22 | 60.80 / 0.20 | |
SoM | 88.62 / 0.32 | 91.78 / 0.29 | 92.62 / 0.27 | |
SoP | 61.52 / 0.26 | 59.37 / 0.27 | 57.82 / 0.27 |
5.2. Comparison between different initializations
We present the performance of Lloyd’s Algorithm 2 combined with different initialization methods. The initialization methods adopted in this subsection are:
-
•
Random initialization. We initialize variables with i.i.d. samples from the -dimensional standard Gaussian distribution.
-
•
Uniform seeding initialization. We uniformly sample different indices from , then we set as the initial value of .
-
•
Proposed initialization. We sample the indices using Algorithm 1 and initialize with the minimizer of the corresponding sub-function.
Mixed linear regression. Our first example is the -regularized mixed linear regression. We add an regularization on each sub-function in (1.3) to guarantee strong convexity, and the sum-of-minimum optimization objective function can be written as
where collects all data points and is a fixed parameter. The dataset is generated as described in Appendix E.2.
Similar to the GPCA problem, we slightly modify Lloyd’s algorithm since the -regularized least-square problem can be solved analytically. Specifically, we use the minimizer of the group objective function as the update of instead of performing the gradient descent as in (3.8) or Algorithm 2. We perform the algorithm until a maximum iteration number is met or the objective function value stops decreasing. The detailed algorithm is given in Appendix E.2.
In the experiment, the number of samples is set to and we vary from 4 to 6 and (the dimension of and ) from 4 to 8. For each problem with fixed cluster number and dimension, we repeat the experiment for 1000 times with different random seeds. In each repeated experiment, we record two metrics. If the output objective value at the last iteration is less than or equal to , where is the ground truth that generates the dataset , we consider the objective function to be nearly optimized and label the algorithm as successful on the task; otherwise, we label the algorithm as failed on the task. Additionally, we record the number of iterations the algorithm takes to output a result. The result is displayed in Table 3.
Init. Method | ||||||||
---|---|---|---|---|---|---|---|---|
random | 0.056 / 17.577 | 0.031 / 18.378 | 0.038 / 19.923 | 0.058 / 21.631 | 0.071 / 22.344 | |||
unif. seeding | 0.057 / 16.139 | 0.034 / 16.885 | 0.050 / 18.022 | 0.055 / 18.708 | 0.075 / 19.959 | |||
proposed | 0.050 / 14.551 | 0.036 / 15.276 | 0.034 / 16.020 | 0.044 / 16.936 | 0.051 / 17.409 | |||
random | 0.161 / 26.355 | 0.156 / 28.844 | 0.172 / 32.247 | 0.238 / 35.042 | 0.321 / 38.324 | |||
unif. seeding | 0.145 / 23.728 | 0.136 / 25.914 | 0.143 / 27.671 | 0.198 / 29.935 | 0.256 / 32.662 | |||
proposed | 0.162 / 21.552 | 0.130 / 23.476 | 0.143 / 25.933 | 0.161 / 27.268 | 0.217 / 29.086 | |||
random | 0.363 / 35.831 | 0.382 / 41.043 | 0.504 / 43.999 | 0.594 / 47.918 | 0.739 / 48.730 | |||
unif. seeding | 0.347 / 31.536 | 0.350 / 35.230 | 0.408 / 39.688 | 0.524 / 42.453 | 0.596 / 43.117 | |||
proposed | 0.339 / 29.610 | 0.312 / 33.460 | 0.389 / 36.068 | 0.463 / 39.010 | 0.563 / 40.320 |
Mixed nonlinear regression. Our second experiment is on mixed nonlinear regression using 2-layer neural networks. We construct neural networks with the same structure and let the -th neural network be:
Here, is the input data. We let be the input dimension and be the hidden dimension. The dimensions of the variables are We denote as the trainable parameters in the neural network. For each trial, we prepare the ground truth and the dataset as described in Appendix E.2. We use the squared loss for each neural network and construct the -th sub-function as:
where we still use as a regularization term. We perform parallel experiments on training the neural networks via Algorithm 2 using three different initialization methods. During the training process of neural networks, stochastic gradient descent is commonly used to manage limited memory, reduce training loss, and improve generalization. Moreover, the ADAM algorithm proposed in [kingma2014adam] is widely applied. This optimizer is empirically observed to be less sensitive to hyperparameters, more robust, and to converge faster. To align with this practice, we replace the group gradient descent in Algorithm 2 and the group momentum method in Algorithm 3 with ADAM optimizer-based backward propagation for the corresponding group objective function.
We use two metrics to measure the performance of the algorithms. In one set of experiments, we train neural networks until the value of the loss function under parameters is less than that under . We record the average iterations required to achieve the optimization loss. In the other set of experiments, we train neural networks for a fixed number of iterations. Then, we compute the training and testing loss of the trained neural network, where the training loss on the dataset is defined as and the testing loss is defined in a similar way.
In our experiments, the training dataset size is and the testing dataset size is . The testing dataset is generated from the same distribution as the training data. Benefiting from ADAM’s robust nature regarding hyperparameters, we use the default ADAM learning rate . We set in Lloyd’s Algorithm 2 and fix the cluster number . We test on three different tuples: , , and . The results can be found in Table 4 and 5.
(5,3) | (7,5) | (10,5) | |
---|---|---|---|
random | 329.4 | 132.1 | 130.8 |
unif. Seeding | 233.1 | 71.2 | 67.6 |
proposed | 181.4 | 49.3 | 47.2 |
/ Iter. | (5,3) / 300 | (7,5) / 150 | (10,5) / 150 |
---|---|---|---|
random | 4.26 / 4.63 | 4.57 / 5.54 | 4.62 / 5.82 |
unif. Seeding | 3.86 / 4.25 | 3.96 / 4.77 | 3.56 / 4.52 |
proposed | 3.44 / 3.93 | 3.51 / 4.37 | 3.39 / 4.34 |
We can conclude from Table 3, 4, and 5 that the careful seeding Algorithm 1 generates the best initialization in most cases. This initialization algorithm results in the fewest iterations required by Lloyd’s algorithm to converge, the smallest final loss, and the highest probability of finding the ground-truth clustering.
6. Conclusion
This paper proposes a general framework for sum-of-minimum optimization, as well as efficient initialization and optimization algorithms. Theoretically, tight bounds are established for smooth and strongly convex sub-functions . Though this work is motivated by classic algorithms for the -means problem, we extend the ideas and theory significantly for a broad family of tasks. Furthermore, the numerical efficiency is validated for generalized principal component analysis and mixed linear and nonlinear regression problems. Future directions include developing algorithms with provable guarantees for non-convex and exploring empirical potentials on large-scale tasks.
Acknowledgements
Lisang Ding receives support from Air Force Office of Scientific Research Grants MURI-FA9550-18-1-0502. We thank Liangzu Peng for fruitful discussions on GPCA.
References
Appendix A Proof of Proposition 2.3
In this section, we provide a proof of the proposition in Section 2.
Proposition A.1 (Restatement of Proposition 2.3).
Proof.
If , then the only minimizer up to a permutation of indices is , such that
Next we consider the case where . Let be the set of all minimizers of (1.1). Due to the -strong convexity of , the set is nonempty. Let be the set of all partitions of , such that for all . The set is finite. Next, we show there is an injection from to . For , we recurrently define
We claim that all ’s are nonempty. Otherwise, if there is an index such that , we have a . Replacing the -th parameter with , we have
This contradicts the assumption that is a minimizer of (1.1). Thus, is a well-defined map from to . Consider another . If for all , due to the -strong convexity of ’s, we have
Thus, the map defined above is injective. Overall, is a finite set. ∎
Appendix B Algorithm details
In this section, we provide the details of the algorithms presented in Section 3.
B.1. Initialization with alternative scores
When the score function is taken as the squared gradient norm as in (3.5), the pseudo-code of the initialization can be found in Algorithm 4.
B.2. Details on momentum Lloyd’s Algorithm
In this section, we elaborate on the details of momentum Lloyd’s Algorithm 3. We use as the variables to be optimized. Correspondingly, we introduce as their momentum. We use the same notation in (3.7) as the group objective function. In each iteration, we update using momentum gradient descent and update using the gradient of the group function.
The update of in the momentum algorithm is different from the Lloyd’s Algorithm 2. We introduce an acceleration quantity
Each class is then renewed around the center . We update index to the class where attains the minimum value among all . To ensure the stability of the momentum accumulation, we further introduce a controlled reclassification method. We set a reclassification factor . We update to in the following way to ensure
The key idea is to carefully reclassify each index one by one until the size of one class breaks the above restriction. We construct as the initialization of the reclassification. We randomly, non-repeatedly pick indices from one by one. For looping from 1 to , we let be the classification before the -th random index is picked. Let be the -th index sampled. We reassign to the -th class, such that
There will be at most two classes changed due to the one-index reassignment. We update the class notations from to for all . If there is any change between and , we check whether
holds. If the above restriction holds for all , we accept the reclassification and move on to the next index sample. Otherwise, we stop the process and return . If the reclassification trial successfully loops to the last index. We assign .
Appendix C Initialization error bounds
In this section, we prove the error bounds of the initialization Algorithms 1 and 4. Before our proof, we prepare the following concepts and definitions.
Definition C.1.
For any nonempty , we define
Definition C.2.
Let be an index set, be a finite set, we define
Under the -strong convexity and -smooth Assumptions 2.1 and 2.2, we immediately have
Besides, for disjoint index sets , we have
For the problem (1.1), the optimal solution exists due to the strong convexity assumption on ’s. We pick one set of optimal solutions . We let
Based on this optimal solutions, we introduce as a partition of . ’s are disjoint with each other and
Besides, for all , attains minimum at over ,
The choice of and is not unique. We carefully choose them so that are non-empty for each .
Lemma C.3.
Suppose that Assumption 2.1 holds. Let be a nonempty index subset of and let be sampled uniformly at random from . We have
Proof.
We have the following direct inequality.
∎
Lemma C.4.
Let be a fixed finite set in . For two indices , we have
Proof.
We have the following inequality.
∎
Lemma C.5.
Given an index set and a finite point set , suppose that If we randomly sample an index with probability , then we have the following inequality,
Proof.
We consider the expectation of over . We have the following inequality bound.
Here, (a) holds when applying Lemma C.4. ∎
Lemma C.6.
For any in the optimal partition , we have
Proof.
We let be the geometric center of optimal solutions of index set .
∎
Proposition C.7.
Let be an index set, and be a finite point set. Let be a minimizer of the objective function . Suppose that If we sample an index with probability , then we have the following inequality:
(C.1) |
Next we prove that the bound in (C.1) is tight.
Proposition C.8.
Fix the dimension , there exists an integer . We can construct -strongly convex and -smooth sub-functions , and a finite set . We let be the sub-functions of the sum-of-minimum optimization problem (1.1). When we sample an index with probability , we have
Proof.
For the cases where the dimension , we construct the instance in a more concise way. We consider the following points, , , . All the elements except the first one of are zero. We construct the following functions with minimizers .
(C.2) |
We have for all . We construct the finite set in an orthogonal manner. We let , be a single point set. Besides, . The point in is orthogonal to all ’s. Consider the expectation over the newly sampled index , we have
We set . As , we have
In the meanwhile, we have
We have the following error rate:
As for the 1D case, we consider the following points. We let , and . We construct:
Each has the minimizer . Besides, . We let be a single point set. Let . We have
We have the following expectation:
Besides, we have the minimizer of the objective function . We have
We have the following asymptotic error bound:
∎
We remark that the orthogonal technique used in the construction of (C.2) can be applied in other lower bound constructions in the proofs of the initialization Algorithms 1 and 4 as well.
Lemma C.9.
We consider the sum-of-minimum optimization (1.1). Suppose that is -separate. Suppose that we have fixed indices . We define the finite set . We define the index sets . Let . We sample new indices. We let for . In each round of sampling, the probability of , being sampled as is . Then we have the following bound,
(C.3) |
Here, is the harmonic sum.
Proof.
We prove by induction on and . We introduce the notation
We show that if (C.3) holds for the case and , then it also holds for the case . We first prove two base cases.
Case 1: .
Case 2: . With probability , the newly sampled index will lie in , and with probability , it will lie in . We have bounds on the conditional expectation
Here, (a) holds when applying Lemma C.5. (b) holds since is identical to a certain as and we apply Lemma C.6. Overall, we have the bound:
Next, we prove that the case holds when the inequality holds for cases and . With probability , the first sampled index will lie in , and with probability , it will lie in . Let
We divide into two cases and compute the corresponding conditional expectations. For the case where lies in , we have the following bound on the conditional expectation.
For the case where lies in , we have the following inequality:
Here, (a) holds when applying Lemma C.5 and Lemma C.6. (b) holds as
Overall, we have the bound:
Here, (a) holds since and
The proof concludes. ∎
Theorem C.10 (Restatement of Theorem 4.1).
Suppose that the solution set is -separate. Let
be the initial points sampled by the random initialization Algorithm 1. We have the following bound:
(C.4) |
Proof.
We start with a fixed index , let . Suppose . Then we use Lemma C.9 with . Let
We have
The term can be regarded as the conditional expectation of given
According to Algorithm 1, the first index is uniformly random in . We take the expectation over and get
Here, (a) holds when applying Lemma C.3. (b) holds as a result of Lemma C.6.
∎
When we take
the optimization problem (1.1) reduces to the k-means problem, and Algorithm 1 reduces to the k-means++ algorithm. Therefore, according to [arthur2007k], the bound given in Theorem C.10 is tight in up to a constant. Next, we give a more detailed lower bound considering the conditioning number .
Theorem C.11 (Restatement of Theorem 4.2).
Given a fixed cluster number , there exists . We can construct -strongly convex and -smooth sub-functions , whose minimizer set is -separate. Besides, the sum-of-min objective function satisfies that , so that . When we apply Algorithm 1 to sample the initial centers , we have the following error bound:
(C.5) |
Proof.
We construct the following problem. We fix the cluster number to be . We let the dimension to be . We pick the vertices of a -simplex as the “centers” of clusters. The -simplex is embedded in a dimensional subspace. We let the first elements of the vertices’ coordinates to be non-zero, while the other elements are zero. We denote the first elements of the -th vertex by . We let the -simplex be centered at the origin, so that the magnitudes ’s are the same. We let be the edge length of the simplex. The functions in each cluster follows the orthogonal construction technique in (C.2). Specifically, in cluster , we construct functions mapping from to as
(C.6) |
We have a total of sub-functions. We let , , so that will be assigned in the same cluster when computing the minimizer of the objective function . We let be the -th unit vector, then the minimizers of the above sub-functions are and . We let be the set of all the minimizers . For each cluster , we can compute
Thus, we have
Let be a nonempty subset of . We study the optimality gap of when sampling the new centers based on . We divide the clusters into 4 classes as follows:
We define . Consider as the existing centers, we continue sampling new centers using Algorithm 1. Let be the newly sampled centers. We define the quantity
which is the expected optimality gap after sampling. We will prove by induction that
(C.7) |
Here is the harmonic series. is recursively defined as:
The parameter are chosen as
We denote the right0hand side of (C.7) as .
We consider the case where , we have
In the mean while,
If , we have
If , then becomes the leading term,
Rearrange the left-hand side and the right-hand side of the inequality, we have:
Therefore, we have
Next, we induct on . When , we have . We use the one-step transfer technique. We let
We have
For (a), we have
when and .
Thus the inequality (C.7) holds. Let . We have
Let . Since , then
We have the following inequalities:
Therefore, we have
In the meanwhile, we have an upper bound estimate for . We pick as the centers. We have
Thus,
∎
We prove two different error bounds when the estimate of is not accurate. We consider the additive and multiplicative errors on the oracle .
In Algorithm 1, when computing the score , we suppose we do not have the exact , instead, we have an estimate , such that
for a certain error factor . We define
Lemma C.12.
Let be an index set, and be a finite point set. Suppose that . We sample an index with probability , then we have the following inequality:
(C.8) |
Proof.
We have
We have
∎
Lemma C.13.
Suppose that we have fixed indices . We define the finite set . We define the index sets . Let . Suppose that . We sample new indices. We let for . In each round of sampling, the probability of , being sampled as is . Then we have the following bound:
(C.9) |
Proof.
The key idea of the proof is similar to Lemma C.9. We let
We prove by induction. When , the inequality obviously holds. When , we have the inequality:
For the general case, can be bounded by two parts. With probability , the first sampled index lies in , and the conditional expectation is bounded by:
With probability , the first sampled index lies in . The conditional expectation is bounded by:
Overall, we have the following inequality:
∎
Theorem C.14 (Restatement of Theorem 4.3).
Suppose that the solution set is -separate. Let
be the initial points sampled by the random initialization Algorithm 1 with noisy oracles . We have the following bound:
Appendix D Convergence of Lloyd’s algorithm
Theorem D.1 (Restatement of Theorem 4.6).
In Algorithm 2, we take the step size . If are -smooth, we have the following convergence result:
Here, is the minimum of .
Proof.
According to the -smoothness assumption on , is also -smooth, which implies that
Averaging over with weights , we have
Averaging over from to , we have
∎
Next, we present a convergence theorem for the momentum algorithm. For simplification, we use the notation
We have the following convergence theorem:
Theorem D.2 (Restatement of Theorem 4.7).
Proof.
The variable satisfies the following property,
We have the following inequality:
Rearranging the inequality, we have
We sum over with weights and get
Since
we have
Summing both sides from to , then dividing both sides by , we have
(D.1) |
Now, we consider the average term . For , we have
We have the following bound on the squared norm of :
Here, (a) applies as an instance of Jensen’s inequality. Averaging the above inequality over , we obtain
Substituting the above inequality back into (D.1), we obtain
We choose
and rearrange the above inequality. Thus, we have
Since we initialize , we have
Besides, since , we have
When
we have
∎
Appendix E Supplementary experiment details
In this section, we provide details on the experiments described in Section 5.
E.1. Supplementary details for Section 5.1
We elaborate on the generation of the synthetic data for the GPCA experiment in Section 5.1.
-
•
First, we uniformly generate pairs of orthonormal vectors for . Each pair is generated uniformly at random, with and forming the basis of the -th subspace.
-
•
For each data point , we independently generate two Gaussian samples . Next, we sample an index uniformly at random. We then let .
We provide in Algorithm 5 a detailed pseudo-code of Lloyd’s algorithm for solving the GPCA problem in the sum-of-minimum formulation (1.5), which consists of two steps in each iteration, say updating the clusters via (3.6) and precisely compute the minimizer of each group objective function
We implement the BCD algorithm [peng2023block] for the following optimization problem:
(E.1) |
For any , when is fixed for all , the problem in (E.1) is equivalent to:
where the weights are given by:
The detailed pseudo-code can be found in Algorithm 6.
E.2. Supplementary details for Section 5.2
Mixed linear regression
Here, we provide the detailed pseudo-code for Lloyd’s algorithm used to solve -regularized mixed linear regression problem in Section 5. Each iteration of the algorithm consists of two steps: reclassification and cluster parameter update. We alternatively reclassify indices to using (3.6) and update the cluster parameter for nonempty clusters using:
(E.2) |
so that is exactly the minimizer of the group objective function. The algorithm continues until stops decreasing after ’s update or a max iteration number is reached. The pseudo-code is shown in Algorithm 7.
The dataset for the -regularized mixed linear regression is synthetically generated in the following way:
-
•
Fix the dimension and the number of function clusters , and sample as the linear coefficients of ground truth regression models.
-
•
For , we independently generate data , class index , noise , and compute .
In the experiment, the noise level is set to and the regularization factor is set to .
Mixed non-linear regression
The ground truth ’s are sampled from a standard Gaussian. The dataset is generated in the same way as in the mixed linear regression experiment. We set the variance of the Gaussian noise on the dataset to and use a regularization factor .