Improved Approximations for Euclidean -means and -median, via Nested Quasi-Independent Sets
Abstract
Motivated by data analysis and machine learning applications, we consider the popular high-dimensional Euclidean -median and -means problems. We propose a new primal-dual algorithm, inspired by the classic algorithm of Jain and Vazirani [30] and the recent algorithm of Ahmadian, Norouzi-Fard, Svensson, and Ward [1]. Our algorithm achieves an approximation ratio of and for Euclidean -median and -means, respectively, improving upon the 2.633 approximation ratio of Ahmadian et al. [1] and the 6.1291 approximation ratio of Grandoni, Ostrovsky, Rabani, Schulman, and Venkat [25].
Our techniques involve a much stronger exploitation of the Euclidean metric than previous work on Euclidean clustering. In addition, we introduce a new method of removing excess centers using a variant of independent sets over graphs that we dub a “nested quasi-independent set”. In turn, this technique may be of interest for other optimization problems in Euclidean and metric spaces.
1 Introduction
The -means and -median problems are among the oldest and most fundamental clustering problems. Originally motivated by operations research and statistics problems when they first appeared in the late 50s [43, 37, 28] they are now at the heart of several unsupervised and semi-supervised machine learning models and data mining techniques and are thus a main part of the toolbox of modern data analysis techniques in a variety of fields. In addition to their practical relevance, these two problems exhibit strong ties with some classic optimization problems, such as set cover, and understanding their complexity has thus been a long standing problem which has inspired several breakthrough techniques.
Given two sets , of points in a metric space and an integer , the goal of the -median problem is to find a set of points in , called centers, minimizing the sum of distances from each point in to the closest point in . The goal of the -means problem is to minimize the sum of distances squared. The complexity of the -median or -means problems heavily depends on the underlying metric space. In general metric spaces, namely when the distances only need to obey the triangle inequality, the -median and -means problem are known to admit a 2.675-approximation [9] and a 9-approximation [1], respectively, and cannot be approximated better than for -median and for -means assuming P NP. We know that the upper and lower bounds are tight when we allow a running time of (i.e.: in the fixed-parameter tractability setting) for arbitrary computable functions [17], suggesting that the lower bounds cannot be improved for the general case either. When we turn to slightly more structured metric spaces such as Euclidean metrics the picture changes drastically. While the problem remains NP-hard when the dimension (and is large) [41] or (and is large) [23], both problems admit -approximation algorithms with running times [33], with an exponential dependency in , and [15, 14], with a doubly exponential dependency in (the latter extends to doubling metrics), a prohibitive running time in practice.
Arguably, the most practically important setting is when the input points lie in Euclidean space of large dimension and the number of clusters is non-constant, namely when both and are part of the input 111Note here that can always be assumed to be of using dimensionality reduction techniques [38] (see also [8] for a slightly worse bound). Unfortunately, the complexity of the problem in this regime is far from being understood. Recent results have proven new inapproximability results: respectively 1.17 and 1.07 for the Euclidean k-means and k-median problems assuming P NP and 1.73 and 1.27 assuming the Johnson Coverage hypothesis of [18, 19]. For the continuous case, the same series of papers show a hardness of 1.06 and 1.015 for Euclidean -means and -median respectively assuming P NP and 1.36 and 1.08 assuming the Johnson-Coverage hypothesis (see also [21] for further related work on continuous -median and -means in other metric spaces). Interestingly, the above hardness results implies that there is no algorithmic benefit that could be gained from the -metric: Assuming the Johnson Coverage hypothesis the hardness bounds for -median and -means are the same in the -metric that in general metrics [19]. However, it seems plausible to leverage the structure of the -metric to obtain approximation algorithms bypassing the lower bounds for the general metric or case (e.g.: obtaining an approximation ratio better than for -median).
In a breakthrough result, Ahmadian et al. [1] were the first to exploit the structure of high-dimensional Euclidean metrics to obtain better bounds than the current best-known bounds for the general metric case. Concretely, they showed how to obtain a 6.3574-approximation for -means (improving upon the 9-approximation of Kanungo et al. [31]) and a -approximation for -median (improving upon the 2.675-approximation for general metrics [9]). The bound for -means was recently improved to a 6.1291-approximation (or more precisely, the unique real root to ) by [25], by tweaking the analysis of Ahmadian et al. [1], and no progress has been made for Euclidean -median after [1].
This very active line of research mixes both new hardness of approximation results and approximation algorithms and aims at answering a very fundamental question: How much can we leverage the Euclidean geometry to obtain better approximation algorithms? And conversely, what do we learn about Euclidean geometry when studying basic computational problems? Our results aim at making progress toward answering the above questions.
1.1 Our Results
Our main result consists of better approximation algorithms for both -median and -means, with ratio for -median and for -means, improving the 2.633-approximation of Ahmadian et al. [1] for -median and -approximation of [25] for -means.
Theorem 1.1.
For any , there exists a polynomial-time algorithm that returns a solution to the Euclidean -median problem whose cost is at most times the optimum.
For any , there exists a polynomial-time algorithm that returns a solution to the Euclidean -means problem whose cost is at most times the optimum.
Our approximation ratio for -median breaks the natural barrier of and our approximation ratio for -means is the first below 6. The approximation bound of for Euclidean -median is indeed a natural barrier for the state-of-the-art approach of Ahmadian et al. [1] that relies on using the primal-dual approach of Jain and Vazirani [30]. At a high level, the approximation bound of 3 for general metrics for the algorithm of Jain and Vazirani can be interpreted as an approximation bound of 1+2, where 1 is the optimum service cost and an additional cost of 2 times the optimum is added for input clients poorly served by the solution. Since general metrics are only required to satisfy the triangle inequality, the 2 naturally arises from bounding the distance from a client to its center in the solution output and an application of the triangle inequality to bound this distance. Therefore, one can hope to then obtain a substantial gain when working in Euclidean spaces: The triangle inequality is rarely tight (unless points are aligned) and this leads to the hope of replacing the 2 by in the above the bound, making the approximation ratio of a natural target for -median. In fact, this high-level discussion can be made slightly more formal: we show that the analysis of the result of Ahmadian et al. [1] cannot be improved below for -median, exhibiting a limit of the state-of-the-art approaches.
In this paper, we take one step further, similar to the result of Li and Svensson [36] for the general metric case who improved for the first below the approximation ratio of 3, we show how to bypass for Euclidean -median (and the bound of 6.129 for -means).
Furthermore, one of our main contributions is to obtain better Lagrangian Multiplier Preserving (LMP) approximations for the Lagrangian relaxations of both problems. To understand this, we need to give a little more background on previous work and how previous approximation algorithms were derived. A natural approach to the -median and -means problem is to (1) relax the constraint on the number of centers in the solution, (2) find an approximate solution for the relaxed problem, and (3) derive an approximate solution that satisfies the constraint on the number of centers. Roughly, an LMP approximate solution is a solution where we bound the ratio of the cost of to the optimum cost, but pay a penalty proportional to some for each center in . Importantly, if , an LMP -approximation that outputs is a -approximation for -means (or -median). We formally define an LMP approximation in Section 2. LMP solutions have played a central role in obtaining better approximation algorithms for -median in general metrics and more recently in high-dimensional Euclidean metrics. Thus obtaining better LMP solutions for the Lagrangian relaxation of -median and -means has been an important problem. A byproduct of our approach is a new 2.395-LMP for Euclidean -median and -LMP for Euclidean -means.
Our techniques may be of use in other clustering and combinatorial optimization problems over Euclidean space as well, such as Facility Location. In addition, by exploiting the geometric structure similarly, these techniques likely extend to -metric spaces (for ).
1.2 Related Work
The first -approximation for the -median problem in general metrics is due to Charikar et al. [12]. The -median problem has then been a testbed for a variety of powerful approaches such as the primal-dual schema [30, 10], greedy algorithms (and dual fitting) [29], improved LP rounding [13], local-search [4, 16], and LMP-based-approximation [36]. The current best approximation guarantee is 2.675 [9] and the best hardness result is [26]. For -means in general metrics, the current best approximation guarantee is 9 [1] and the current best hardness result is (which implicitly follows from [26], as noted in [1]).
We have already covered in the introduction the history of algorithms for high-dimensional Euclidean -median and -means with running time polynomial in both and the dimension and that leverage the properties of the Euclidean metrics ([31, 1, 25]). In terms of lower bounds, the first to show that the high-dimensional -median and -means problems were APX-hard were Guruswami and Indyk [27], and later Awasthi et al. [6] showed that the APX-hardness holds even if the centers can be placed arbitrarily in . The inapproximability bound was later slightly improved by Lee et al. [34] until the recent best known bounds of [18, 19]. From a more practical point of view, Arthur and Vassilvitskii showed that the widely-used popular heuristic of Lloyd [37] can lead to solutions with arbitrarily bad approximation guarantees [3], but can be improved by a simple seeding strategy, called -means++, so as to guarantee that the output is within an factor of the optimum [2].
For fixed , there are several known approximation schemes, typically using small coresets [8, 24, 33] There also exists a large body of bicriteria approximations (namely outputting a solution with centers for some constant ): see, e.g., [7, 11, 20, 32, 39]. There has also been a long line or work on the metric facility location problem, culminating with the result of Li [35] who gave a 1.488-approximation algorithm, almost matching the lower bound of 1.463 of Guha and Khuller [26]. Note that no better bound is known for high-dimensional Euclidean facility location.
1.3 Roadmap
In Section 2, we describe some preliminary definitions. We also formally define the LMP approximation and introduce the LMP framework of Jain and Vazirani [30] and Ahmadian et al. [1]. In Section 3, we provide an overview of the new technical results we developed to obtain the improved bounds. In Section 4, we obtain a LMP approximation for the Euclidean -means problem. In Section 5, extend our LMP approximation to a -approximation for standard Euclidean -means. Finally, in Section 6, we obtain a LMP approximation for Euclidean -median that can be extended to a -approximation for standard Euclidean -median.
2 Preliminaries
Our goal is to provide approximation algorithms for either the -means or -median problem in Euclidean space on a set of clients of size . For the entirety of this paper, we consider the discrete -means and -median problems, where rather than having the centers allowed to be anywhere, we are given a fixed set of facilities of size which is polynomial in , from which the centers must be chosen from. It is well-known (e.g., [40]) that a polynomial-time algorithm providing a -approximation for discrete -means (resp., median) implies a polynomial-time -approximation for standard -means (resp., median) for an arbitrarily small constant .
For two points in Euclidean space, we define as the Euclidean distance (a.k.a. -distance) between and . In addition, to avoid redefining everything or restating identical results for both -means and -median, we define in the context of -means and in the context of -means. For a subset of Euclidean space, we define and .
For the -means (or -median) problem, for a subset , we define . In addition, we define to be the optimum -means (or -median) cost for a set and a set of facilities , i.e., . Recall that a -approximation algorithm is an algorithm that produces a subset of facilities with in the worst-case.
2.1 The Lagrangian LP Relaxation and LMP Solutions
We first look at the standard LP formulation for -means/medians. The variables of the LP include a variable for each facility and a variable for each pair for and . The standard LP relaxation is the following:
minimize | (1) | ||||||
such that | (2) | ||||||
(3) | |||||||
(4) |
The intuition behind this linear program is that we can think of as the indicator variable of client being assigned to facility , and as the indicator variable of facility being opened. We need every facility to be assigned to at least one client, that at most facilities are opened, and that is only if (since clients can only be assigned to open facilities). We also ensure a nonnegativity constraint on and by ensuring that . Finally, our goal is to minimize the sum of distances (for -median) or the sum of squared distances (for -means) from each client to its closest facility – or simply the facility it is assigned to, and if exactly one of the values is for a fixed client and the rest are , then is precisely the distance (or squared distance) from to its corresponding facility. By relaxing the linear program to have real variables, we can only decrease the optimum, so if we let be the optimum value of the LP relaxation, then .
Jain and Vazirani [30] considered the Lagrangian relaxation of this linear program, by relaxing the constraint (3) and adding a dependence on a Lagrangian parameter . By doing this, the number of facilities no longer has to be at most in the relaxed linear program but the objective function penalizes for opening more than centers. Namely, the goal becomes to minimize
(5) |
subject to Constraints (2) and (4). Indeed, for , the objective only decreases from (1) to (5) for any feasible solution to the original LP. Therefore, this new linear program, which we will call , has optimum . Now, it is known that the Dual linear program to this Lagrangian relaxation of the original linear program can be written as the following, which has variables :
maximize | (6) | ||||||
such that | (7) | ||||||
(8) |
We call this linear program . Because the optimum to equals the optimum to the primal by strong duality, this means that for any satisfying Conditions (7) and (8), we have that .
For a fixed , we say that is feasible if it satisfies both (7) and (8). Thus, to provide a -approximation to -means (or -median), it suffices to provide both a feasible and a subset of size such that is at most
In both the work of Jain and Vazirani [30] and the work of Ahmadian et al. [1], they start with a weaker type of algorithm, called a Lagrangian Multiplier Preserving (LMP) approximation algorithm. To explain this notion, let represent the optimum (minimum) for the modified linear program , which is the same as except without the subtraction of in the objective function (5). (So, has no dependence on ). Note that this is also the optimum (maximum) for , which is the same as except without the subtraction of in the objective function (6). Then, for some fixed , we say that a -approximation algorithm is LMP if it returns a solution satisfying
Indeed, if we could find a choice of and an LMP -approximate solution with size , we would have found a -approximation for -means (or -median) clustering.
2.2 Witnesses and Conflict Graphs
Jain and Vazirani [30] proposed a simple primal-dual approach to create a feasible solution of with certain additional properties that are useful for providing an efficient solution to the original -median (or -means) problem efficiently. We describe it briefly as follows, based on the exposition of Ahmadian et al. [1, Subsection 3.1].
Start with , i.e., for all . We increase all ’s continuously at a uniform rate, but stop growing each once one of the following two events occurs:
-
1.
For some , a dual constraint becomes tight (i.e., reaches equality). Once this happens, we stop growing , and declare that facility is tight for all such that the constraint became equality. In addition, we will say that is the witness of .
-
2.
For some already tight facility , we grow until . In this case, we also say that is the witness of .
We note that this process must eventually terminate for all (e.g., once reaches ). This completes our creation of the dual solution (it is simple to see that is feasible).
For any client , we define , and likewise, for any client , we define . For any tight facility , we will define , where by default if . For each client , its witness will have the useful properties that and .
We have already created our dual solution : to create our set of facilities, we will choose a subset of the tight facilities. First, we define the conflict graph on the set of tight facilities. Indeed, Jain and Vazirani [30], Ahmadian et al. [1], and we all have slightly different definitions: so we contrast the three.
-
•
[30] Here, we say that forms an edge in the conflict graph if there exists a client such that (or equivalently, and ).
-
•
[1] Here, we say that forms an edge in the conflict graph (where is some parameter) if and there exists a client such that .
-
•
In our definition, we completely drop the condition from [30], and just say that forms an edge in the conflict graph if .
It turns out that in the algorithm of Ahmadian et al. [1], the approximation factor is not affected by whether they use their definition or our definition. But in our case, it turns out that dropping the condition from [30] in fact allows us to obtain a better approximation.
To provide an LMP approximation, both Jain and Vazirani [30] and Ahmadian et al. [1] constructed a maximal independent set of the conflict graph (or for an appropriate choice of ) and used as the set of centers. For Jain and Vazirani’s definition, the independent set obtains an LMP -approximation for metric -means. For Ahmadian et al.’s definition, the independent set obtains an LMP -approximation for Euclidean -means if is chosen properly. (We note that Ahmadian et al. only proved a factor of , though their argument can be improved to show a -approximation factor as proven by Grandoni et al. [25]). While we will not explicitly prove it, our definition of also obtains the same bound with the same choice of . For the Euclidean -median problem, using either Ahmadian et al.’s or our definition, one can obtain an LMP -approximation: Ahmadian et al. [1] only proved it for the weaker approximation factor, but we prove the improved approximation in Subsection 6.1. We then show how to obtain a better LMP solution and then a better approximation bound.
3 Technical Overview
The algorithms of both Jain and Vazirani [30] and Ahmadian et al [1] begin by constructing an LMP approximation. Their approximation follows two phases: a growing phase and a pruning phase. In the growing phase, as described in Subsection 2.2, they grow the solution starting from , until they obtain a suitable dual solution for . In addition, they have a list of tight facilities , which we think of as our candidate centers. The pruning phase removes unnecessary facilities: as described in Subsection 2.2, we create a conflict graph over the tight facilities, and only choose a maximal independent set . Hence, we are pruning out tight facilities to make sure we do not have too many nearby facilities. This way we ensure that the total number of centers is not unnecessarily large. Our main contributions are to improve the pruning phase with a new algorithm and several new geometric insights, and to show how our LMP approximation can be extended to improved algorithms for standard Euclidean -means (and -median) clustering.
Improved LMP Approximation:
To simplify the exposition, we focus on Euclidean -means. To analyze the approximation, Ahmadian et al. [1] compare the cost of each client in the final solution to its contribution to the dual objective function. The cost of a client is simply where is our set of centers, and ’s contribution to the dual is , where we recall the definition of from Subsection 2.2. One can show that the sum of the individual dual contributions equals the dual objective (6). By casework on the size , [25] (by modifying the work of [1]) shows that where if is chosen appropriately (in general, we think of as slightly greater than ). The only bottlenecks (i.e., where the cost-dual ratio could equal ) are when . Our LMP approximation improves the pruning phase by reducing the cost-dual ratio in these cases.
Due to the disjointed nature of the bottleneck cases, our first observation was that averaging between two or more independent sets may be beneficial. This way, if for some client , the first independent set had and the second set had , perhaps by taking a weighted average of and we can obtain a set where with reasonable probability. Hence, the expected cost-dual ratio of will be below . The second useful observation comes from the fact that and , if is the witness of . In the case, [1] applies this to show that and , which follows by the definition of conflict graph. Hence, the bottleneck occurs when has distance exactly from , and the nearest point has distance exactly from , in the direction opposite . This causes to be in the worst case, whereas ’s contribution to the dual is merely . To reduce this ratio, we either need to make sure that the distance from to either or is smaller, or that do not lie in a line in that order.
A reasonable first attempt is to select two choices , and consider the nested conflict graphs . We can then create nested independent sets by first creating as a maximal independent set of , then extending it to for . Our final set will be an average of and : we include all of and each point in with some probability. The motivation behind this attempt is that if both and are empty, the witness of should be adjacent in to some point , and adjacent in to some point . Hence, either , in which case the distance from to is now only instead of (see Figure 1), or , in which case must be far apart because are both in the independent set (see Figure 1). In the latter case, we cannot have the bottleneck case for both and , as that would imply are collinear in that order with and , so are too close. Hence, it appears that we have improved one of the main bottlenecks.
Unfortunately, we run into a new issue, if and no other points in were in . While this case may look good because which is not a bottleneck, it is only not a bottleneck because the contribution of to the clustering cost and the dual both equal in this case. So, if we condition on , the cost and dual are not affected by , but if we condition on , the cost-dual ratio of could be – hence, we have again made no improvement. While one could attempt to fix this by creating a series of nested independent sets, this approach also fails to work for the same reasons. Hence, we have a new main bottleneck case, where , and are collinear in that order with and (see Figure 1).
We now explain the intuition for fixing this. In the main bottleneck case, if we could add the witness of to , this would reduce significantly if , yet does not affect ’s contribution to the dual. Unfortunately, adding to reduces the dual nonetheless, due to other clients. Instead, we will consider a subset of tight facilities that are close, but not too close, to exactly one tight facility in but not close to any other facilities in . In the main bottleneck cases, the witnesses precisely satisfy the condition, as may some additional points. We again prune these points by creating a conflict graph just on these vertices, and pick another maximal independent set . Finally, with some probability we will replace each point with the points in that are close to . In our main bottleneck case, we will either pick , or pick another point that is within distance of , but now must be far from all points in and therefore will not have distance from (see Figure 1). In addition, might be close, but is not allowed to be too close to , so if we replace with , we do not run into the issue of ’s contribution being for both the clustering cost and the dual solution.
Given , our overall procedure for generating is to include all of , and include points in and with some probability. In addition, each point is close to a unique point , and we anti-correlate them being in . We call the triple a nested quasi-independent set, since and are independent sets and has similar properties, and these sets are nested.
We remark that the anti-correlation between selecting and points that are close to is reminiscent of a step of the rounding of bipoint solution of Li and Svensson [36] (there the authors combine two solutions by building a collection of stars where the center of the star belongs to a solution, while the leaves belong to another, and choose to open either the center, or the leaves at random). In this context, we will also allow our algorithm to open slightly centers (instead of ) for some absolute constant . Similar to Li and Svensson [36] and Byrka et al. [9], we show (Lemma 5.2) that this is without loss of generality (such an algorithm can be used to obtain an approximation algorithm opening at most centers).
The analysis of the approximation bound is very casework heavy, depending on the values of , and for each , and requiring many geometric insights that heavily exploit the structure of the Euclidean metric.
Finally, we remark that this method of generating a set of centers also can be used to provide an LMP approximation below for Euclidean -median. However, due to the differing nature of what the bottleneck cases are, the choices of , and the construction of the final set will differ. Both to obtain and break the -approximation for Euclidean -median, one must understand how the sum of distances from a facility to its close facilities relates to the pairwise distances between . In the -means case, the squared distances make this straightfoward, but the -median case requires more geometric insights (see Lemma 6.1) to improve the -approximation of Ahmadian et al. [1] to a -approximation, even with the same algorithm.
Polynomial-Time Algorithm:
While we can successfully obtain an LMP -approximation for Euclidean -means, this does not imply a -approximation for the overall Euclidean -means problem, since our LMP approximation may not produce the right number of cluster centers. To address this issue, Ahmadian et al. [1] developed a method of slowly raising the parameter and generating a series of dual solutions , where each pair of consecutive solutions and are sufficiently similar. The procedure of generating dual solutions in [1] is very complicated, but will not change significantly from their result to ours. Given these dual solutions, [1] shows how to interpolate between consecutive dual solutions and , creating a series of conflict graphs where the subsequent graph removes at most one facility at a time (but occasionally adds many facilities). This property of removing at most one facility ensures that the maximal independent set decreases by at most 1 at each step (but could increase by a large number if the removed vertex allows for more points to be added). Using an intermediate-value theorem, they could ensure that at some point, there is an independent set of the conflict graph with size exactly . Hence, they apply the LMP technique to obtain the same approximation ratio.
In our setting, we are not so lucky, because we are dealing with nested independent sets (and even more complicated versions of them). Even if the size of the first part never decreases by more than at a time, could increase which could potentially make the sizes of or decrease rapidly, even if we only remove one facility from the conflict graph. To deal with this, we instead consider the first time that the expected size of drops below (where we recall that all points in are in , but points in and are inserted into our final set with some probability). Let represent the set of chosen facilities right before the expected size of drops below for the first time, and represent the set of chosen facilities right after.
To obtain size exactly , we show that one can always do one of the following: either (i) modify the probabilities of selecting points in and so that the size is exactly , or (ii) use submodularity properties of the -means objective function to interpolate between the set generated by and the set generated by . While and , we do not necessarily have that , and in fact we will interpolate between and instead by adding random points from to . These procedures are relatively computational and require a good understanding of how the dual objective behaves as we modify the probabilities of selecting elements. In addition, because we modify the probabilities to make the size exactly and because we interpolate between and instead of and to obtain our final set of centers, we lose a slight approximation factor from the LMP approximation, but we still significantly improve over the old bound. In addition, this procedure works well for the Euclidean -means and -median problems.
One minor problem that we will run into is that a point may have many points in that are all “close” to it. As a result, even if the expected size of is exactly , the variance may be large because we are either selecting or all of the points that are close to . In this case, we can use the submodularity properties of the -means objective and coupling methods from probability theory to show that if has too many close points in , we could instead include and include an average number of these close points from , without increasing the expected cost.
4 LMP Approximation for Euclidean -means
In this section, we provide a LMP approximation algorithm (in expectation) for the Euclidean -means problem. Our algorithm will be parameterized by four parameters, , and . We fix and allow to vary, and obtain an approximation constant that is a function of : for appropriate choices of , will equal .
In Subsection 4.1, we describe the LMP algorithm, which is based on the LMP approximation algorithms by Jain and Vazirani [30] and Ahmadian et al. [1], but using our technique of generating what we call a nested quasi-independent set. In Subsection 4.2, we analyze the approximation ratio, which spans a large amount of casework.
4.1 The algorithm and setup
Recall the conflict graph , where we define two tight facilities to be connected if We set parameters and , and define to be the set of all tight facilities. Given the set of tight facilities and conflict graphs over for all , our algorithm works by applying the procedure described in Algorithm 1 (on the next page) to , with parameters , and .
LMP():
-
•
Include every point .
-
•
For each point , flip a fair coin. If the coin lands heads, include with probability . Otherwise, include each point in independently with probability .
We will call the triple generated by Algorithm 1 a nested quasi-independent set. Although are disjoint, we call it a nested quasi-independent set since are nested, and is a maximal independent set for and is a maximal independent set for . While is not an independent set, it shares similar properties. As described in the technical overview (and in Algorithm 1), the LMP approximation algorithm uses to create our output set of centers . contains all of and each point in with probability , but the choices of which points in are in are not fully independent.
For the dual solution and values for tight as generated as in Subsection 2.2, we note the following simple yet crucial facts.
Proposition 4.1.
[1] The following hold.
-
1.
For any client and its witness , is tight and .
-
2.
For any client and its witness , .
-
3.
For any tight facility and any client , .
These will essentially be the only facts relating witnesses and clients that we will need to use.
4.2 Main lemma
We consider a more general setup, as it will be required when converting the LMP approximation to a full polynomial-time algorithm. Let be a subset of facilities (for instance, this may represent the set of tight facilities) and let be the full set of clients. For each let be some real number, and for each , let be some real number. For each client , we associate with it a set . (For instance, this could be the set ). In addition, suppose that for each client there exists a “witness” facility . Finally, suppose that we have the following assumptions. (These assumptions will hold by Proposition 4.1 when is generated by the procedure in Subsection 2.2, is the set of tight facilities, and )
-
1.
For any client , the witness satisfies and .
-
2.
For any client and any facility , .
For the Euclidean k-means problem with the above assumptions, we will show the following:
Lemma 4.2.
Consider the set of conflict graphs created on the vertices , where is an edge in if . Fix and let be variable. Now, let be the randomized set created by applying Algorithm 1 on . Then, for any ,
(9) |
where is some constant that only depends on (since are fixed).
Remark.
While we have not defined , we will implicitly define it through our cases. We provide detailed visualizations of the bounds we obtain for each of our cases in Desmos (see Appendix A for the links). Importantly, we show that we can set such that .
To see why Lemma 4.2 implies an LMP approximation (in expectation), fix some . Then, we perform the procedure in Subsection 2.2 and let be the set of tight facilities, , and . Then, by adding (9) over all , we have that
Above, the final line follows because , since is assumed to be tight. Thus, we obtain an LMP approximation with approximation factor for any choice of . Given this, we now prove Lemma 4.2.
Proof of Lemma 4.2.
We fix and do casework based on the sizes , and . We show that for each case of We will call the numerator and the denominator, and attempt to show this fraction is at most By scaling all distances (and all values accordingly), we will assume WLOG that , so .
First, we have the following basic proposition.
Proposition 4.3.
.
Proof.
Note that . But and by properties of the independent set , there exists such that , since . So, , as desired. ∎
In addition, we have the following simple proposition about Euclidean space, the proof of which is deferred to Appendix B.
Proposition 4.4.
Suppose that we have points in Euclidean space and parameters such that , , , and . Moreover, assume that and that . Then,
Finally, we will also make frequent use of the well-known fact that for any set of points and any in Euclidean space, As a direct result of this, if for all but for all , then .
We are now ready to investigate the several cases needed to prove Lemma 4.2.
Case 1: .
Let be the unique point in and let be the witness of . Recall that is tight, so . Note that and .
There are numerous sub-cases to consider, which we enumerate.
-
a)
. In this case, either so , or there exists such that . So, . In addition, we have that with probability . So, if we let , we can bound the fraction by
Note that since and the above fraction is maximized for , in which case we get that the fraction is at most
(1.a) -
b)
In this case, there exists (possibly ) such that In addition, there exists such that . Finally, since , we must have that . If we condition on , then the numerator and denominator both equal , so the fraction is (or ). Else, if we condition on , then the denominator is , and the numerator is at most , since always, and either , in which case , or , in which case .
Note that , that , and that . So, we may apply Proposition 4.4 with and to bound numerator (and thus the overall fraction since the denominator equals ) by
(1.b)
In the remaining cases, we may assume that . Then, one of the following must occur:
-
c)
. In this case, define , and note that . So, with probability , we have that , and otherwise, we have that . So, we can bound the ratio by
(1.c) We prove this final inequality in Appendix B.
-
d)
but . First, we recall that . Now, let . In this case, with probability , (if we select to be in ), with probability , (if we select but not to be in ), and in the remaining event of probability, we still have that by Proposition 4.3. So, we can bound the ratio by
Note that this is maximized when (since the numerator and denominator increase at the same rate when increases), so we can bound the ratio by
(1.d) -
e)
There is more than one neighbor of in that is in . In this case, there is some other point not in such that So, we have four points such that and
-
f)
There are no neighbors of in that are in . In this case, we would actually have that because we defined to be a maximal independent set in the induced subgraph So, if there were no such neighbors and , then we could add to , contradicting the maximality of . Having was already covered by subcases c) and d).
-
g)
There is a neighbor of in that is also in , which means that either so , or there is some other point not in such that If , then define and . In this case, and Since and , we can bound the overall fraction as at most
(1.g.i) We derive the final inequality in Appendix B.
Alternatively, if then if we condition on the fraction is (or ), and if we condition on , the denominator is and the numerator is at most (Note that and are independent.) Therefore, we can also bound the overall fraction by
(1.g.ii) -
h)
There is a neighbor of in that is also in . In this case, would not be in , so we are back to sub-case 1.a.
Case 2: .
Let be the unique point in and let represent the points in . Let be the number of points in that are in , and let be the number of points in not in We will have four subcases. For simplicity, in this case we keep
Before delving into the subcases, we first prove the following propositions regarding the probability of some point in being selected.
Proposition 4.5.
Let . Then, the probability that no point in is in is at most .
Proof.
First, note that . Therefore, if , then In general, through induction we have that for any that .
Now, group the points into groups of sizes , where each group is points that map to the same point in under . Then, for each group , the probability that no point in the group is in is precisely , because with probability we will only consider picking the point (for in group ), and otherwise each point in the group will still not be in with probability . So, the overall probability is
We also note the following related proposition.
Proposition 4.6.
Let , and for some arbitrary . Then, the probability that no point in nor is in is at most .
Proof.
Similar to the previous proposition, we group the points into groups of sizes . Assume WLOG that is in the first group. Then, the probability that no point in the first group nor is in is . So, the overall probability is
-
a)
. In this case, we have that no pair of points in are connected in , which means that they have pairwise distances at least from each other (since if ). So, since and . Consequently, . Therefore, the denominator is at least . To bound the numerator, we note that the probability of none of the points in being in is at most . This is because with probability , no point in is in with probability at most by Proposition 4.5, and these two events are independent since . If some point in is in , then , and otherwise, . Therefore, we can bound the numerator as at most Overall, we have that the fraction is at most
(2.a) -
b)
and . In this case, we have that for all (note that ). Since the points for all have pairwise distance at least from each other, we have that . Letting be such that we have that . In addition, let In this case, the denominator of our fraction is To bound the numerator, with probability we have that , in which case . In addition, there is a disjoint -probability event where some , and conditioned on this event, . Otherwise, we still have that . So, overall we have that the fraction is at most
This function clearly decreases as increases (since the numerator and denominator increase at the same rate). In addition, since (whenever ), we have that the numerator increases at a slower rate than the denominator when increases, so this function also decreases as increases. So, we may assume that to get that the fraction is at most
(2.b) -
c)
. In this case, we have that there exists a point not in , so it has distance at least from . Therefore, by the triangle inequality, Let . Next, since all of the points in have pairwise distance at least from each other, . Therefore, the denominator of the fraction is at most .
We now bound the numerator. First, by Proposition 4.6 and since , the probability that no point in is in is some , where . In this event, we have that . In addition, there is a probability that , in which case . Finally, with probability, we have that but there is some , so . Overall, the fraction is at most
This function clearly decreases as increases (since the numerator and denominator increase at the same rate), so we may assume that as this is our lower bound on . So, the fraction is at most
(2.c) -
d)
and . In this case, , so we simply write as the unique point in . Let be the witness of . Since , this means that are neighbors in and . Finally, we have that are not connected in . So, Now, note that since are not in , we either have that the witness is in , in which case , or all of are adjacent to in since was a maximal independent set in . Therefore, if we define and , this means that and by triangle inequality.
Note that the denominator equals . To bound the numerator, note that with probability , in which case and with probability , in which case . Also, these two events are disjoint since . Finally, in the remaining probability event, .
Letting , we have that . We also know that . Therefore, we can bound the ratio by
(2.d) We prove the final equality in Appendix B.
Case 3: .
We split this case into three cases. First, we recall that each point corresponds to some point . Let represent the number of points such such that , and let . Note that if , then .
-
a)
. In this case, all of the points in must not be connected in . Therefore, and since , this means that the denominator is at least in expectation. In addition, since the probability of no point in being in is at most , in which case . If there is some point in in , then Therefore, the numerator is at most So, we can bound the fraction by
(3.a) -
b)
. In this case, the probability of there being a point in that is part of is . Conditioned on this event, , and otherwise, . So, the numerator is at most Finally, we have that all of the points in are separated by at least , except for the unique point and , which are separated by at least . So, which means that Therefore, the denominator is at least Overall, the fraction is at most
(3.b) -
c)
or . In this case, we first note that since all of the points in have distance at least from each other and all of the points in have distance at least from each other, both and are at most . Let be such that Then, the denominator is at least . In addition, with probability , at least one of the points in will be in , conditioned on which the expected value of is at most Next, note that the probability of no point in being in is maximized when all points with map to a single point , and all other points in map to a single point . In this case, the probability that no point in is in is at most
Overall, we have that with probability , some point in is in , conditioned on which , with probability at most no point in is in , conditioned on which and otherwise, some point in is in , which means So, we can bound this fraction overall by
Noting that , we have that the numerator increases at a slower rate than the denominator as increases. Therefore, this fraction is maximized when . So, we can bound the fraction by
(3.c)
Case 4: .
We split this case into three subcases.
-
a)
In this case, is always empty, so the denominator is . To bound the numerator, we consider the witness of . If , the numerator is at most so we can bound the fraction by . Else, if then there exists such that , and since we have that . Thus, the fraction in the case where or is at most
(4.a.i) Otherwise, there is some of distance at most away from . Next, if the numerator is at most . Otherwise, there is some of distance at most away from . Finally, . Therefore, as in 1.b, we can apply Proposition 4.4 to obtain that the numerator, and therefore, the fraction is at most
(4.a.ii) since the above (4.a.ii) is greater than both and for any
-
b)
In this case, let be the unique element in . Conditioned in , the denominator equals and the numerator is at most . Otherwise, the denominator is , and the numerator can again be bounded in an identical way to the previous case, since the probability of is either (if ) or (if ). Therefore, the fraction is again at most
(4.b.i) if the witness of satisfies or , and is at most
(4.b.ii) otherwise.
-
c)
. In this case, note that all points in are separated by at least , which means that . Letting be such that we have that , so there exists such that . In addition, we know that . Finally, we also note the denominator equals .
Next, note that
Since , this means there exists such that and since for any , this means that
Let , and let . Let be the event that no point in nor is in , let be the event that no point in is in , and let be the event that is not in . Note that implies , which implies . Now, by Proposition 4.6, equals some . Likewise, By Proposition 4.5, equals some . In addition, . Under the event , we have that . Next, under the event we have that some point in is in , so . Under the event , we know that , so . Finally, we always have that .
Therefore, we can bound the overall fraction by
Since the above fraction is an increasing function in the variables . So, we can upper bound this fraction by replacing with their respective upper bounds , and , as well as replacing with simply . Next, since and , the derivative of the numerator with respect to is , and the derivative of the denominator with respect to is . Hence, this fraction decreases as increases, unless the fraction is less than . Therefore, we can bound the fraction by
(4.c) where we used the facts that and .
Case 5: .
In this case, note that deterministically, since . However, we can improve upon this, since we may have some points in much closer to , or we may have some points in which are closer and appear with some probability.
Recall that in our algorithm, we flip a coin for each to decide whether we include with probability or include each in independently with probability . Let us condition on all of these fair coin flips, and say that a point survives the coin flips if they could be in the set with probability afterwards. For simplicity, we replace with . We also let represent the points in that survive the fair coin flips.
Let the squared distances from to each of the points in be , and the squared distances from to each of the points in be , where . It is trivial to see that and conditioned on at least one of the points in being selected, we have that in expectation is at most The probability of at least one of the points in being selected in is
since conditioned on the initial coin flips, each surviving point in is included in independently with probability . Therefore, we can say that
Next, we have that
Now, we provide a lower bound for To do so, we use the fact that all the points in are separated by at least in squared distance, and all the surviving points in are separated by at least in squared distance, to get
So, if and , then if we let and , then the ratio is at most
(5.a) |
In the case where this fraction is undefined. However, we note that in this case, deterministically contains a unique center and nothing else, so and Therefore, the fraction is . ∎
Therefore, we have the LMP approximation is at most , where is determined via the numerous cases in the proof of Lemma 4.2. The final step is to actually bound based on the cases. Indeed, by casework one can show the following proposition.
Proposition 4.7.
For and , and , we have that
As a consequence, for , .
We defer the casework to Lemma 5.19, and the above proposition follows immediately from it. Therefore, we get a LMP approximation for the Euclidean -means problem.
5 Polynomial-time Approximation Algorithm for Euclidean -means
In this section, we describe how we improve the LMP approximation for Euclidean -means to a polynomial-time approximation algorithm. Unfortunately, we will lose a slight factor in our approximation, but we still obtain a significant improvement over the previous state-of-the-art approximation factor. While we focus on the -means problem, we note that this improvement can also be applied to the -median problem as well, with some small modifications that we will make note of. In Section 6, we will provide an LMP approximation for -median, and explain how we can use the same techniques in this section to also obtain an improved polynomial-time algorithm for -median as well.
In Subsection 5.1, we describe the polynomial-time algorithm to generate two nested quasi-independent sets and , which will be crucial in developing our final set of centers of size with low clustering cost. This procedure is based on a similar algorithm of Ahmadian et al. [1], but with some important changes in how we update our graphs and independent sets. In Subsection 5.2, we describe and state a few additional preliminary results. In Subsection 5.3, we analyze the algorithm and show how we can use and to generate our final set of centers, to obtain a -approximation algorithm. Finally, in Subsection 5.4, we show that our analysis in Subsection 5.3 can be further improved, to obtain a -approximation guarantee.
We remark that our approximation guarantee of is only in expectation. However, this can be made to be with exponential failure probability, as with probability the approximation ratio will be . So, by running the algorithm times in parallel, and outputting the best solution of these, we obtain a -approximation ratio with probability at least .
5.1 The algorithm and setup
First, we make some assumptions on the clients and facilities. We first assume that the number of facilities, , is at most polynomial in the number of clients . In addition, we also assume that the distances between clients and facilities are all in the range . Indeed, both of these assumptions can be made via standard discretization techniques, and we only lose an -approximation factor by removing these assumptions [1]. Note that this means the optimal clustering cost, which we call , is at least for both -means and -medians. Finally, we assume that (else this problem is trivial in polynomial time).
Next, we describe the setup relating to dual solutions. Consider the tuple , where , , , and . Here, represents the set which will be a solution to the dual linear program, represents , where each will be a modified value representing the threshold for tightness of facility , represents a subset of facilities that we deem “special”, and is a function that maps each special facility to a subset of the clients that we deem special clients for that facility.
When talking about a single solution , we define for any , and define . We say that a facility is tight if . Now, we define for each that is either tight or special (i.e., in ). For each tight facility, we define , and for each special facility, we define . We default the maximum of an empty set to be . We also consider a modified conflict graph on the set of tight or special facilities, with an edge between and if .
We can now define the notion of roundable solutions: our definition is slightly modified from [1, Definition 5.1].
Definition 5.1.
Let be the set of special facilities, and be the function assigning each special facility to a subset of special clients . Then, the tuple is -roundable if
-
1.
is a feasible solution of and for all .
-
2.
For all
-
3.
There exists a subset of “bad” clients so that for all , there is a facility that is either tight or in , such that:
-
(a)
For all ,
-
(b)
For all ,
-
(c)
-
(a)
-
4.
, and
Here, are arbitrarily small constants, which are implicit parameters in the definition. Finally, we say that is -roundable if it is -roundable for some choice of .
Our main algorithm is described in Figure 2. This algorithm outputs two nested quasi-independent sets and . The final set will be obtained either from one of these two sets, or from some hybridization of them. We will defer the actual construction of to Theorem 5.17.
The method of RaisePrice comes from [1], and will not be of importance, apart from the results that they give us for the overall algorithm (see Theorem 5.7). We will make some important definitions and set up the overall algorithm, and then describe the method of GraphUpdate, which is slightly modified from [1].
First, we describe the initialization phase of Algorithm 2 to generate , which is almost identical to that of Ahmadian et al. [1, P. 22-23 of journal version]. The main difference is that we parameterized the procedure by (instead of 2 and 6 in Ahmadian et al.) Start by setting , for all , and (so has empty domain). We then set for all .
Now, we increase all of the values simultaneously at a uniform rate, and for each , we stop increasing once one of the following events occur:
-
1.
for some .
-
2.
for some (or in the -median case). Here, will be a fixed small constant (see Appendix D for details on how to set ).
In the initial solution, for all , which means that is empty for all , so . In addition, since every , every facility is tight. This means that the conflict graph on the set of tight facilities for the initial solution we construct is just an empty graph on the full set of facilities, since So, this means that if we apply Algorithm 1 to , we will obtain that and .
We now set up some definitions that will be important for the remainder of the algorithm and analysis. Define two dual solutions and to be close if . Consider two solutions and that are each -roundable for some choice of , such that and are close. This means that for all and that for all , even if is not tight. Let represent the set of tight or special facilities in and define likewise.
Let denote the disjoint union, i.e., is a set consisting of a copy of each element in and a distinct copy of each element in . For each point , we define (if were a special facility), , and based on whether came from or from . This means that for , , , and if is tight and if is special (and likewise for ). In addition, for each client , we define ; for each and , we define ; and for each , we let . Since this means if and if
We create a hybrid conflict graph on a subset of the disjoint union . First, we let represent the conflict graph on . The conflict graph on a set of facilities means there is an edge between two vertices if . Next, we choose some ordering of the vertices in , and for each , we let represent the vertices of the so-called merged vertex set defined as after we removed the first vertices in , and let represent the conflict graph on , where again the conflict graph means that share an edge if . Note that . For simplicity of notation, we may abbreviate as and as if the context is clear.
In the context of a hybrid conflict graph , for any client , let be the subset of consisting of all tight facilities such that and all special facilities such that . We also define as the witness for in the solution , and as the set of bad clients from the solution .
Finally, to describe the actual GraphUpdate procedure, it works as follows. First, we note that which has already been decided, either by the first initialized solution or from the previous solution before we reset . Otherwise, since the set of vertices . Finally, we note that since for and otherwise, is created by simply removing one element from . So, the maximal independent set of can easily be extended to a maximal independent set of by deleting at most element and then possibly extending the independent set. We then extend arbitrarily based on Steps 2 to 5 to create and , where and may have no relation to and . So, we inductively create from , where importantly .
5.2 Additional preliminaries
For our approximation guarantees, we require two additional preliminaries. The first is to show a rough equivalence between solving -means (resp., -median) clustering and solving -means (resp., -median) clustering if allowed additional clusters. The second is to define the notion of negative-submodularity and its application for -means and -median.
First, we show that for any constant and parameter , if there exists a polynomial-time -approximation algorithm for -means or -median in any metric space that uses centers, then for any constant , there exists a polynomial-time -approximation algorithm that opens exactly centers.
More formally, the statement we prove is the following. Note that a similar statement was proven in [36].
Lemma 5.2.
Let be some absolute constants. Let be an -approximation algorithm with running time for -median (resp. -means) that open centers. Then, for any , there exists an -approximation for -median (resp. -means) with running time .
Proof.
We give the proof of the -median problem, the proof for the -means problem is identical, up to adjustment of the constants.
To proceed, we need the following notion (due to [42]): A -median instance is said to be -ORSS-separable if the ratio of the cost of the optimum solution with centers to the cost of the optimum solution with centers is at least .
We can now present our algorithm; For any , the algorithm is as follows. Compute an -approximate solution (with centers) to the -median instance using . Then, for any , compute a solution with centers using the algorithms for -ORSS-separable instances of [5, 22] to obtain a -approximate solution in time . Output the solution of with minimum -median cost.
We now turn to the analysis of the above algorithm. The running time follows immediately from its definition and the results of [5, 22]. Let’s then consider the approximation guarantee of the solution produced. For , let be the solution to -median, i.e.: the -median problem with centers. Our goal is thus to show that the solution output by the above algorithm is an -approximate solution to the .
If the cost of is within a -factor of the cost of , then the cost of the solution output by the algorithm is no larger than the cost of solution whose total cost is thus at most , as desired.
Otherwise, since we have that there exists a such that . Let be the smallest such that the above holds. In which case, we have both that and that the -median instance is -ORSS-separable. Therefore, by the results of [5, 22], the cost of the solution output by our algorithm is no larger than and so at most by our choice of , hence the lemma. ∎
Next, we describe the definition of submodular and negative-submodular set functions.
Definition 5.3.
Let be a finite set, and let be a function from the set of subsets of , , to the real numbers . Then, is submodular over if for any and any , we have that . Likewise, is negative-submodular over if is a submodular function: equivalently, if for any and any , we have that .
The following claim, proven in [17], proves that the -means and -median objective functions are both negative-submodular functions.
Proposition 5.4.
[17, Claim 10] Fix and , and let be the function sending each subset to . Then, is a negative-submodular function over , either if the cost is the -median cost or if the cost is the -means cost.
Given Proposition 5.4, we can use standard properties of submodular functions to infer the following claims.
Proposition 5.5.
Let be sets of facilities, where has size and has size . Then, for any , if is a set created including all of and then independently including each element in with probability , then .
Proposition 5.6.
Let be sets of facilities, where has size and has size . Then, if is a set created by randomly adding exactly items from to , for some fixed , then we have .
5.3 Analysis
The following theorem relating to the above algorithm was (essentially) proven by Ahmadian et al. [1], and will be very important in our analysis.
Theorem 5.7.
Algorithm 2 runs in time (where the may depend on and ), and the following conditions hold.
-
1.
Let be the minimum of and the sizes of all sets that become (i.e., the first part of each nested quasi-independent set that becomes , as done in line 13 of the pseudocode). Then, every solution that is generated when is a certain value is -roundable.
-
2.
For any solution that becomes , (and so is an empty function). In addition, has no corresponding bad clients, i.e., .
-
3.
For any two consecutive solutions and , we have that and are close.
-
4.
Every is a nested quasi-independent set for the set of facilities . In addition, for every ,
This theorem is technically stronger than what was proven in [1], but follows from a nearly identical analysis to their paper, with a very minor tweak to the algorithm. We explain why Theorem 5.7 follows from their analysis in Appendix D.
For the following lemmas (Lemmas 5.8 until Proposition 5.15), we consider a fixed family of conflict graphs on a hybrid set , for some , where both and are -roundable. For some fixed we let be a nested quasi-independent set of , i.e., the output of running all but the final step of Algorithm 1 with , and treat it as fixed. However, we will let (required in the final step 6) be variable, though we may consider as initialized to some fixed .
Many of these results apply for both -means and -median. While we focus on -means, we later explain how to make simple modifications to apply our results to the -median problem. In addition, we will treat as fixed but as potentially variable. We let represent the approximation constant from the LMP algorithm (i.e., in Lemma 4.2) with probability (either for -means or -median, depending on constant).
We first show some crucial preliminary claims relating to the hybrid graph , where
Lemma 5.8.
For any client and any facility , .
Proof.
Note that if , then is the maximum over such that and if is special. Since , this is at least the maximum over such that and if is special. But if , then indeed and if is special (recall that was defined based on whether is in or ). So, . By an identical argument, the same holds if .
Finally, note that we defined to precisely be the set of tight facilities in with , or special facilities in with and . So, we always have . ∎
Lemma 5.9.
Suppose that contains every point in , and each point in and each point in with probability (not necessarily independently). Then, for any point ,
Remark.
We note that this lemma holds even for bad clients . In addition, we remark that we will be applying this lemma on as a nested quasi-independent set or something similar.
Finally, we note that this lemma (and the following lemma, Lemma 5.10) are the only results where we directly use the fact that we are studying the -means as opposed to the -median problem.
Proof of Lemma 5.9.
Note that every point satisfies and , by Lemma 5.8. So, by linearity of expectation, it suffices to show that
(10) |
We next have the following lemma, which bounds the cost for clients that are not bad.
Lemma 5.10.
Proof.
We note that the expression may be somewhat unwieldy. Therefore, we provide an upper bound on its sum over
Lemma 5.11.
Let be any subset of . Then,
Remark.
We note that this lemma holds even when , i.e., .
Proof.
First, by splitting the sum based on tight and special facilities, we have that
(13) |
Now, we note that for any tight facility , either or . Since and are close, this means , and since there are clients in , this means that
(14) |
for some choice of in .
In addition, we know that both and are feasible solutions of , and that . Therefore, for any special facility , . But, we have that
for both and (the last inequality follows by Condition 4 of Definition 5.1). So, if we let represent for each special facility , we have that but , since and . Similarly, So, this means that
which means that
(15) |
Next, we show that the bad clients do not contribute much to the total cost.
Lemma 5.12.
Let be a subset of containing . Then, we have that
Proof.
Note that for every point , there exists a facility such that
because is -roundable. Now, note that , so . But since , , and therefore,
where the final inequality follows by Condition 3c) of Definition 5.1. ∎
We now combine our previous lemmas to obtain the following bound on the expected cost of , giving a result that bounds the overall expected cost in terms of the dual solution.
Lemma 5.13.
Proof.
We will abbreviate . We can split up the cost based on good (i.e., not in ) points and bad points . Indeed, doing this, we get
In the above equation, the second line follows from Lemmas 5.10 and 5.12. The third line is true since for all , even in , by Lemma 5.9. Finally, the fourth line is true because of Lemma 5.11. ∎
Next, we show that under certain circumstances, we can find a solution of size at most satisfying a similar condition to Lemma 5.13, with high probability.
Lemma 5.14.
Suppose that for some integer (which may be larger than ) and some . Then, for any sufficiently large integer , if , then there exists a polynomial-time randomized algorithm that outputs a set such that , and with probability at least , and
Remark.
By repeating this randomized algorithm polynomially many times and outputting the lowest-cost solution with , we can make the failure probability exponentially small.
Proof.
Let , and partition into , where each consists of a point and , i.e., ’s preimage under the map . We assume WLOG that the ’s are sorted in non-increasing order of size, and write as the unique point in . Define such that (note that may be or ). Note that so . Now, set , and consider creating the following set :
-
•
For each include .
-
•
For each include , and for each in , include independently with probability .
-
•
For each flip a fair coin. If it lands heads, include with probability , and if it lands tails, include each point in independently with probability .
The expected size of is Therefore, since the expected size of as at most To bound the variance of , we note that each point in and each for is deterministically in , each point in for is independently selected with probability , and the number of points from each for is some independent random variable bounded by . So, the variance can be bounded by . So, by Chebyshev’s inequality, with probability at least ,
where the final inequality is true since .
Next, we bound the expected cost of . First, consider running the final step 6 of the LMP algorithm on using probability . This would produce a set such that
(16) |
Above, the first line follows by Lemma 5.13, the second line follows by definition of and , and the third line follows from the fact that .
Now, note that if we had performed the final step of the LMP algorithm on using probability instead of , the set (call it ) would have satisfied
by Lemmas 5.11 and 5.9. This means that Therefore, again using the fact that , we have that
We can rewrite this to obtain
(17) |
since .
In this and the next paragraph, we prove that . To see why, we consider a coupling of the randomness to generate a sequence of sets . To do so, for each point , we automatically include and for all . Now, for each , we first create a temporary set by including each point to be in independently with probability . Then, we create two sets and as follows. For , we include each point in independently, with probability , and always include . For , we flip a fair coin: if the coin lands heads, we only include , but if the coin lands tails, we do not include but include all of . We remark that overall, includes each point in independently with probability .
Now, for each , we define . One can verify that and have the desired distribution, since is precisely the distribution obtained after applying step 6 of the LMP algorithm on , but takes instead of for each , which is precisely the desired distribution for (as we defined at the beginning of this lemma’s proof). To show that , it suffices to show that for all . However, note that because of our coupling, the only difference between and relates to points in . If we let be the set that always includes but includes the entirety of with probability , then clearly . In addition, if we condition on the set , the only difference between and is that includes each point in with probability, whereas either includes the entirety of with probability or includes none of . Therefore, by Proposition 5.5, using the negative-submodularity of -means [17], we have that . So, we have that , which means that
(18) |
In summary,
Above, the first line follows from Equation (18), the second line follows from Equation (16), the third line follows from Equation (17), and the fourth line follows since So, with probability at least , , and by Markov’s inequality,
with probability at least . So, both of these hold simultaneously with probability at least and by repeating the procedure times, we will find our desired set with probability . ∎
Our upper bound on the cost has so far been based on terms of the form We note that this value is at most roughly . Specifically, we note the following:
Proposition 5.15.
If is a feasible solution to then
Proof.
Recall that and that is a feasible solution to . Therefore, by duality, we have that
One potential issue is that if our goal is to obtain a good approximation to optimal -means, the error, which should be negligible, may appear too large if is smaller than . To fix this, we show that in certain cases, which we later show we will satisfy. For the following lemma, we return to considering a single roundable solution , and let represent the corresponding set of tight or special facilities corresponding to
Lemma 5.16.
Let be -roundable for some , where and the set of corresponding bad clients is . Define as the corresponding family of conflict graphs, with some fixed nested quasi-independent set . Suppose that for some , and that . Then, .
Proof.
If , the result is trivial. So, we assume that .
Since is empty, we have that every client has a tight witness (since there are no special facilities) such that and . In addition, we have that for any , is tight which means . Therefore,
Then, we can use the LMP approximation to get that
where , i.e., assuming that no point in or is included as part of the set. In addition, we know that if is created by including all of and each point in with probability , then
Above, the first inequality follows since for any tight , and the final inequality follows because of Lemma 5.9.
To summarize, we have that there exists a constant such that
(19) | ||||
and | ||||
(20) |
Therefore, by taking a weighted average of Equations (19) and (20), we get
where the first inequality is true since and the last inequality is true since Thus, since and since , we have that .
Finally, since is a feasible solution to , this means that However, if , then . So, . ∎
Recall that represents the conflict graph . We will also let represent the conflict graph . In that case, is the same as except with one vertex removed. Recall was a nested quasi-independent set of , and let be a nested quasi-independent set of , such that and .
Theorem 5.17.
Let be an arbitrarily large constant, and be an arbitrarily small constant. Given the sets obtained in Algorithm 2, in polynomial time we can obtain a solution for Euclidean -means with approximation factor at most
(21) |
Proof.
First, we remark that it suffices to obtain a set of facilities of size at most with cost at most for any fixed constant (for all ), where is the value in Equation (21). Indeed, we can apply Lemma 5.2 to obtain a solution of size and cost in polynomial time, hence obtaining a -approximate solution for means clustering for all . Thus, we get the desired approximation ratio for all , but for , we can enumerate all the different clusterings of the input that have at most non-singleton parts and solve -clustering exactly in time.
Algorithm 2 stops once we have found the first hybrid conflict graph for some where the corresponding nested quasi-independent set satisfies . Let and let . In addition, let and . If then , which may be problematic since our previous lemmas can only be used for . However, we note that if , and that was previously labeled as The only exception to this is the case when , and is the initialized solution created in the first line of the algorithm. In this case, however, recall from our initialization that is the full set of facilities, and will just be an extension of this set, so . Therefore, and are both expressible as nested quasi-independent sets of merged conflict graphs. However, if and , then we may need to express based on a previous labeling, so it is possible that comes from a -roundable solution and comes from a -roundable solution, rather than both nested quasi-independent sets coming from -roundable solutions.
First, we show that for the value of at the end of the algorithm (which means all solutions found are -roundable for some ), that . To see why this is the case, note that either , so the claim is trivial, or for some that corresponds to a solution that was at some point labeled as . Note that the corresponding nested quasi-independent set is not the final set , because if so we would have stopped the algorithm before we decided to label as such. Therefore, and we are assuming that Finally, since arises from a -roundable solution with no special facilities or bad clients (by Condition 2), we may apply Lemma 5.16 to obtain that .
Note that , and that , which means that so . First, suppose that , where we recall that is an arbitrarily large but fixed constant. In this case, this means that and so In this case, we can apply Lemma 5.13 to find a randomized set such that
where corresponds to the merged solution that produces . Since , we can try every possible to get a deterministic set of size at most and size at least such that
The final line follows by Proposition 5.15, since , and since so . Therefore, there exists an absolute constant such that we have a set of size at most with cost at most . As argued in the first paragraph of this proof, this is sufficient since we can apply Lemma 5.2.
Otherwise, namely when , we have since . Then, recall that so let . Set and such that and Then, for some , so . In this case, , so if we set , then . Also, since , we have that , so for some . In this case, assuming that we can use Lemma 5.14 to obtain a solution of size at most with cost at most
Now, since , it is straightforward to verify that has bounded derivative. (Indeed, each case produces a function that is continuously differentiable on , so has bounded derivative on .) Therefore, since , the solution in fact has cost at most
But since , and since
we obtain a solution of cost at most
(22) |
In addition, note that we can use Lemma 5.14 to obtain a solution for -means and for -means. Also, So, if we define , then
(23) |
and
(24) |
Therefore, by Proposition 5.6, if we randomly add of the items in , we will get a set of size with expected cost
(25) |
Note that if , then this means that and Proposition 5.15 tells us that , so the expected cost is at most Alternatively, we may assume that .
In this case, note that by Lemmas 5.9 and 5.11, we have that
We can rewrite as , where the last inequality is true since . Thus, we have that
Therefore, combining the above equation with (25), we have that
(26) |
If we set such that , then and
So, by combining Equations (22) and (26), we can always guarantee an approximation factor of at most
(27) |
This approximation factor also holds in the case when , by setting . So, by letting be an arbitrarily large constant and be arbitrarily small constants, the result follows. ∎
Since we have set and , by Proposition 4.7 we have . If one can verify that . Alternatively, if then and it is straightfoward to verify that for all using Proposition 4.7.222We remark that while Proposition 4.7 follows from Lemma 5.19, Lemma 5.19 only depends on Lemma 4.2, so there is no circular reasoning. Hence, we have a -approximation to -means.
5.4 Improving the approximation further
First, we define some important quantities. For any client , we define , and we define . Note that always.
We split the clients into groups. Let be the set of all clients corresponding to subcases 1.a, 1.c, 1.d, 1.g.ii, 1.h, 2.a, 3.a, 4.a.i, 4.b.i, and 4.c, as well as the clients in 5.a where there do not exist and such that . (In the casework, our choice of is rather than , similar for and .) Let be the sum of for these clients, and be the sum of for these clients. Next, let be the set of all clients corresponding to subcases 1.b, 1.e, 4.a.ii, and 4.b.ii. Let be the set of all clients corresponding to subcase 1.g.i. Let be the set of all clients corresponding to subcase 2.d, further restricted to (or equivalently in the language of Case 2.d in Lemma 4.2, ). Finally, let be the set of all bad clients , as well as all remaining subcases (2.b, 2.c, 2.d when , 3.b, 3.c, and the clients in 5.a not covered by ). Note these cover all cases (recall that 1.f is a non-existent case). Finally, we define similarly to how we defined and
Now, we have the following result, which improves over Lemma 5.9.
Lemma 5.18.
For any client , . In addition, if the client corresponds to any of the subcases in case or case , or to subcases or , then . Also, if the client corresponds to subcase where , then .
Remark.
As in Lemma 5.9, this lemma holds even for bad clients .
Proof.
The proof that for any client is identical to that of Lemma 5.9. So, we focus on the next two claims. For the subcases in Case 1, note that , so we just need to show that , which is implied by Equation (11). Likewise, for the subcases in Case 4, note that , so we just need to show that , which is implied by Equation (12).
We recall that in subcases 2.a and 3.a, we noted in both cases that the points in and were all separated in , and that since . So, we have that in both cases.
Finally, we consider subcase 2.d when In this case, we have (when ) that , and , which is at most if So, for general , we have that ∎
Therefore, we have that
(28) |
Next, we define to be the maximum fraction obtained from the casework corresponding to the (not bad) clients in . (Note that .) We have the following result:
Lemma 5.19.
Let , and . Then, for all we have that , , and .
Proof.
We start by considering , covered by subcases 1.a, 1.c, 1.d, 1.g.ii, 1.h, 2.a, 3.a, 4.a.i, 4.b.i, and 4.c, and certain subcases of 5.a. All subcases except 2.a, 4.c, and 5.a can easily be verified (see our Desmos file for K-means Case 1, the link is in Appendix A). For subcase 2.a, we have to verify it for all choices of . However, it is simple to see that the numerator of the fraction decreases as increases whenever , so in fact we just have to verify it for , which is straightforward. For subcase 4.c, we have to verify it for all choices of . For it is straightforward to verify. For since it suffices to show
where we have taken the fraction from 4.c and added back a term to the numerator. Now, this fraction is decreasing as increases, so it suffices to verify it for , which is straightforward.
The last case for is Case 5.a. We show that in all cases the fraction is bounded by for , and if then the fraction can further be bounded by . This is clearly sufficient for bounding . It will also be important in bounding - indeed, if there exist and such that then regardless of the outcomes of the initial fair coins, since exactly one of or will contribute to the value of .
First, we note that can be rewritten as
In the case where and , we can therefore simplify the fraction in (5.a) to This is at most for any . When , we can write the fraction as
(29) |
When and , (29) can be simplified as
When and , we can rewrite (29) as
which is easily verifiable to be at most for . When and , (29) is at most
which is easily verifiable to be at most for . The final case is when , but here we saw in our analysis of 5.a that the fraction is at most , or that the numerator and denominator are both .
Next, consider , which is covered by subcases 1.b, 1.e, 4.a.ii, and 4.b.ii. Indeed, since , these all have the exact same bound of .
Finally, we deal with , which deals with subcases 2.b, 2.c, 3.b, 3.c, and 5.a, along with 2.d when .
Subcase 2.b can be easily verified to be at most in the range when and is between and . Beyond this, we assume that , so we can apply the crude bound that the fraction is at most
which is at most for . It is easy to verify that the fraction in Subcase 2.c is at most for .
Subcase 3.b is easy to verify for For , we can apply the crude bound that the fraction is at most
which trivially satisfies the desired bounds. Finally, we note that in Subcase 3.c, the fraction decreases as and increase, so we may assume that either or and . These are easy to verify for , and for , we may apply a crude bound to say the fraction is at most
as long as and This is at most in the range .
Subcase 5.a was dealt with previously (as we only have to consider when ), so the final case is 2.d when . In this case, we recall the fraction is
where , , and also . By the symmetry of and , we may replace with . So, by defining we can upperbound the above expression by
since and since . By Cauchy-Schwarz, . So, we can bound the above expression by
For this fraction clearly increases with , so we maximize this when . When setting , this can easily be verified to be at most for all .
This concludes all cases, thus proving the proposition. ∎
Next, we recall Lemma 5.11. First, by setting to be in Lemma 5.11, we obtain that
(30) |
Next, by setting to be in Lemma 5.11, we obtain that
(31) |
Next, we recall Lemma 5.13. By splitting based on whether is in , , , , , or , we obtain that
Therefore, the argument of Lemma 5.14 implies that if , if , and if , then we can choose a set such that and
(32) |
To explain the second line, note that has bounded derivative on and that . Therefore, since , , and which means . In addition, we still have that , as in our proof of Theorem 5.17.
We now return to the setup of Theorem 5.17, where we have and . Suppose that , , , and . In addition, suppose that which means that . In this case, we may follow the same approach as in our Theorem 5.17 to obtain a -approximation to -means.
Alternatively, we may suppose that , which implies that . Then, defining such that , we can use Equation (32) to find a solution of size at most with cost at most
(33) |
in the same manner as (32), by setting . Alternatively, we can obtain two separate solutions of size , and a solution of size , such that . We have that
Finally, using the bound (23) for the cost of , we have
Note that we are not able to use a more sophisticated bound for , because our values of and only apply to and not to . By combining the solutions and , by adding random points from to , and using Proposition 5.6, we obtain a solution with expected cost
(34) |
This is because we combine the solution , which has size , with the solution , which has size so we assign the first solution relative weight and the second solution relative weight .
Now, let equal . Then, since , we can combine Equations (30) and (31) to get that
(35) |
Next, recall (by Equation (33)) that we have a solution of size at most with cost at most
(36) |
Finally, we note that since we have that
Therefore, we can bound the expected cost of by
(37) |
Now, we have that , and if we let , we have that . Hence, to show that we obtain an approximation , it suffices to show that for all choices of and that if we let , one cannot simultaneously satisfy
(38) | ||||
(39) | ||||
(40) |
and
(41) |
Indeed, we already know that (38) is true (same as (35)) and that (41) is true (same as (28)). So if we can show we can’t simultaneously satisfy all of (38), (39), (40), and (41), then either (39) or (40) is false. But we have a clustering with at most centers and cost at most the right hand sides of each of (39) and (40) up to a multiplicative factor, due to (37) and (36), respectively. Therefore, we successfully obtain a solution of cost at most . Moreover, , since by Proposition 5.15 as both are solutions to , and since . Therefore, , which means that we have found a approximation to -means clustering.
Indeed, by numerical analysis of these linear constraints and based on the functions , we obtain a -approximation algorithm for Euclidean -means clustering. We defer the details to Appendix C.
6 Improved Approximation Algorithm for -median
6.1 Improvement to -approximation
In this subsection, we show that a -approximation can be obtained by a simple modification of the Ahmadian et al. [1] analysis. Because we use the same algorithm as [1], the reduction from an LMP algorithm to a full polynomial-time algorithm is identical, so it suffices to improve the analysis of their LMP algorithm to a -approximation. The main difficulty in this subsection will be obtaining a tight bound on the norms (as opposed to squared norms) of points that are pairwise separated, which we prove in Lemma 6.1. In the next subsection, we show how to break the barrier that this algorithm runs into, which will follow a similar approach to our improved -means algorithm.
We first recall the setup of the LMP approximation of [1]. Let be the distance between a client and a facility . Suppose we have a solution to , such that every client has a tight witness with and Recall that , where , and likewise, . Now, we let the conflict graph on tight facilities (i.e., facilities with ) have an edge if .
We let and return a maximal independent set of as our LMP-approximation. It suffices to show that for each client that To see why, by adding over all clients , we obtain that
Finally, since is a feasible solution to this implies that
Before we verify the LMP approximation, we need the following lemma about points in Euclidean space.
Lemma 6.1.
Let and suppose that are points in Euclidean space (for some ) such that for all . Then, .
Proof.
Note that for any positive real numbers that add to , we have that
Then, by setting for each and scaling by accordingly to remove the assumption that , we have that
for all . Now, if some , then , which means that for all . Alternatively, for any , so we can set , to obtain that
(42) |
From now on, for any polynomial , we denote to be the sum of all distinct terms of the form over all permutations of . For instance, and .
To verify the LMP approximation, it suffices to show that for every , We split this up into cases.
Case 1: .
In this case, by the Triangle Inequality. But we know that , and that , using the fact that is a maximal independent set so has some neighbor of in the conflict graph. Thus, . However, since , this means that . So, the desired inequality holds.
Case 2: .
In this case, let be the unique point in . Then, . In addition, . Since , the desired inequality holds (even with a ratio of ).
Case 3: .
In this case, let be the set of points in . Then, we know that for any . But by definition of (since ), so this means that for every .
Now, by applying Lemma 6.1, we have that . Now, let , so . Then, . Ina ddition, we have that . So, the ratio is
Above, the first inequality follows because as increases, the numerator increases at a slower rate than the denominator, so assuming that the fraction is at least , we wish for to be as small as possible to maximize the fraction. The final inequality holds because for all . Therefore, the desired inequality holds (even with a ratio of ).
So in fact, there is a simple improvement from the approximation algorithm to a algorithm. A natural question is whether this can be improved further without any significant changes to the algorithm or analysis. Indeed, there only seems to be one bottleneck, when , so naturally one may assume that by slightly reducing , the approximation from Case 1 should improve below and the approximation from Case 3 should become worse than , but can still be below .
Unfortunately, such a hope cannot be realized. Indeed, if we replace with some , we may have that and the pairwise distances are all exactly between each . However, in this case, which for is in fact negative for sufficiently large . Hence, even for for a very small choice of , we cannot even guarantee a constant factor approximation with this analysis approach. So, this approach gets stuck at a approximation.
In the following subsection, we show how an improved LMP approximation algorithm for Euclidean -median, breaking the approximation barrier. We will then show that we can also break this barrier for a polynomial-time -median algorithm as well.
6.2 An improved LMP algorithm for Euclidean -median
Recall the conflict graph , where we define two tight facilities to be connected if We set parameters and , and define to be the set of all tight facilities. Given the set of tight facilities and conflict graphs for all , our algorithm works by applying the procedure described in Algorithm 3 to .
LMPMedian():
-
•
Include every point .
-
•
For each point , flip a fair coin. If the coin lands heads, include with probability . Otherwise, include each point in independently with probability .
As in the -means case, we consider a more general setup, so that we can convert the LMP approximation to a full polynomial-time algorithm. Instead of let be a subset of facilities and let be the full set of clients. For each let be some real number, and for each , let be some real number. In addition, for each client , we associate with it a set and a “witness” facility . Finally, suppose that we have the following assumptions:
-
1.
For any client , the witness satisfies and .
-
2.
For any client and any facility , .
Then, for the graph on where are connected if and only if (recall that now, instead of ), we have the following main lemma.
Lemma 6.2.
Fix , , and and let be variable. Now, let be the randomized set created by applying Algorithm 3 on . Then, for any ,
where is some constant that only depends on (since are fixed).
Proof.
As in the -means case, we fix , and we do casework based on the sizes of , , and .
Case 1: .
Let be the unique point in and let be the witness of . We have the following subcases:
-
a)
. In this case, either so , or there exists such that . So, . In addition, we have that with probability . So, if we let , we can bound the ratio by
(1.a’) since .
-
b)
In this case, there exists (possibly ) such that In addition, there exists such that . In addition, we have that . Finally, since , we must have that . If we condition on , then the numerator and denominator both equal , so the fraction is (or ). Else, if we condition on , then the denominator is , and with probability either or . Therefore, . We can bound this (we defer the details to Appendix B) by
(1.b’) where and
In the remaining cases, we may assume that . Then, one of the following must occur:
-
c)
. In this case, define , and note that . So, with probability , we have that , and otherwise, we have that . So, we can bound the ratio by
For such that it is clear that this function increases as increases, so it is maximized when , which means we can bound the ratio by
(1.c’) -
d)
but . First, we recall that . Now, let . In this case, with probability , (if we select to be in ), with probability , (if we select but not to be in ), and in the remaining event of probability, we still have that So, we can bound the ratio by
Note that this is maximized when (since the numerator and denominator increase at the same rate when increases), so we can bound the ratio by
(1.d’) -
e)
There is more than one neighbor of in that is in . In this case, there is some other point not in such that So, we have four points such that and
If we condition on , then the denominator equals and the numerator is at most , so the fraction is (or ). Else, if we condition on , then the denominator is , and the numerator is at most . Note that , that , and that . So, as in case b), the overall fraction is at most
(1.e’) where and
-
f)
There are no neighbors of in that are in . In this case, Define . Since by the triangle inequality we have that . In addition, we still have that and , so together we have that . Since with probability , the ratio is at most
It is clear that this function is decreasing as is increasing (and nonnegative). So, we may assume WLOG that to bound this ratio by
If , then so we can bound the above equation by . Otherwise, the above fraction can be rewritten as . For this is maximized when over the range , so we can bound the ratio by
(1.f’) -
g)
There is a neighbor of in that is also in . In this case, either so , or there is some other point not in such that If , then define and . In this case, and So, the fraction is at most
Since and , we can bound the overall fraction as at most
(1.g.i’) We derive the final equality in Appendix B.
Alternatively, if then if we condition on the fraction is (or ), and if we condition on , the denominator is and the numerator is at most (Note that and are independent.) Therefore, we can also bound the overall fraction by
(1.g.ii’) -
h)
There is a neighbor of in that is also in . In this case, would not be in , so we are back to sub-case 1.a’.
Case 2: .
We again let be the witness of . In this case, if , then there exists such that , in which case Otherwise, there exists such that , and there exists such that . Finally, in this case we also have that . Now, we consider two subcases, either or .
-
a)
In this case, we have that the denominator is , and the numerator is either at most , or is at most , where , , , and . Hence, we can bound the overall fraction, by the same computation as in the -median subcase 1.b), as
(2.a’) where and
-
b)
In this case, let be the unique point in Then, conditioned on being in , the numerator and denominator both equal . Otherwise, the denominator is and we can bound the numerator the same way as in subcase 2a), since the probability of is either (if ) or (if ). So, we can bound the overall fraction again as
(2.b’) where and
Case 3: , all other cases.
Note that in this case, we may assume , since we already took care of all cases when and . We split into two main subcases.
-
a)
Every point in satisfies In this case, let represent the set of points selected to be in . Note that is a random set.
Note that with probability at least , . (Since has size at least , the probability of being nonempty is minimized when , and the two points in map to the same point under .) In this event, let , and let represent the distances from to each of the points in . Then, by Lemma 6.1, So, if we set then for any and which means that .
In addition, if , then , and otherwise, because every point in is separated by at least distance, by Lemma 6.1. Overall, this means that whenever and .
In addition, if , then the denominator is and the numerator is at most . Therefore, if we let be the probability that and be the expectation of conditioned on , the overall fraction is at most
(3.a’) -
b)
There exists a point such that In this case, note that for all points . Assuming , this is only possible if either:
-
i)
and the unique points and satisfy , or
-
ii)
, the unique point is the only point with , and every point in maps to under .
First, assume Case b)i. Let and . Then, , and the expected distance is at most . Since , this means that so the overall fraction is at most
(3.b.i’) Next, assume Case b)ii. Let , and let be the distances from to each of the points in . Let Then, In addition, is at most . Since the numerator and denominator grow at the same rate with respect to , and the numerator grows slower with respect to than the denominator, we wish to minimize and to maximize the fraction. So, we set , and by Lemma 6.1. Therefore, the fraction is at most
(3.b.ii’) -
i)
Case 4: .
First, we will condition on the fair coin flips, and let be the set of “surviving” points, i.e., the points that will be included in with probability . Note all points in have pairwise distance at least from each other, and all points in have pairwise distance at least from each other also. However, the points in and are only guaranteed to have pairwise distance at least from each other. Let represent the size .
We consider several subcases.
-
a)
In this case, we can use the same bounds as Cases 2 and 3 of the simpler -approximation, since we only have to worry about points in Indeed, the same bounds on the numerator and denominator still hold, so the ratio is at most
(4.a’) -
b)
In this case, let be the unique point in , and let be the unique point in . Then, so if and , then the denominator in expectation is , since . But, the numerator is at most , so the overall fraction is at most
(4.b’) -
c)
Let be the unique point in Then, we must have that Letting , we have that , but However, we know that by Lemma 6.1, so the denominator is at least . So, the ratio is at most which is maximized when is as small as possible, namely . So, the ratio is at most
(4.c’) -
d)
In this case, let be the unique point in , and let . Note that , so . In addition, if the distances from to the points in are , then If we let , then So, the overall fraction is at most It is clear that this function decreases as increases, so we want to set as small as possible. However, we know that by Lemma 6.1, so the overall fraction is at most
The denominator clearly decreases as , so the overall fraction is at most the limit of the above as , which is
(4.d’) -
e)
In this case, let the distances from to the points in be , and let the distances from the points from to the points in be . Also, let and let Then, we have that the numerator is at most , and the denominator is at least . Next, note that by Lemma 6.1, , so . So, the fraction is at most . This is exactly the same as in subcase 4d), except there the denominator was i.e., we just replaced with . So, the same calculations give us that we can bound the overall fraction by at most
(4.e’)
∎
Finally, we bound the actual LMP approximation constant, similar to Proposition 4.7 for the -means case. We have the following proposition, which will immediately follow from analyzing all subases carefully (see Lemma 6.5).
Proposition 6.3.
For . Hence, we can obtain a -LMP approximation.
6.3 Improved -median approximation
In this section, we explain how our LMP approximation for -median implies an improved polynomial time -median approximation for any fixed . We set and , and . In this case, we have that by Proposition 6.3.
Next, we have that all of the results in Subsections 5.1 and 5.3 hold in the -median context, with two changes. The first, more obvious, change is that Lemma 5.10 (and all subsequent results in Section 5.3) needs to use the function associated with -median as opposed to the function associated with -means.
The second change is that Lemma 5.9 no longer holds for , but still holds for for some fixed choice . Indeed, for Cases 1, 2, and 3 (i.e., when ), we have that for any , since , and if , and likewise if . So, for . For case of -median, we verify that after conditioning on . Indeed, if (i.e., subcase 4.a), then this just equals . In the remaining subcases, the value is always nonnegative as long as the denominator of our final fractions are also nonnegative. So, we just need that , , , and . These are all true as long as . Thus, we replace with .
Overall, the rest of Subsection 5.3 goes through, except that our final bound will be
(44) |
where and . The main replacement here is that we replaced with We can use this to obtain a -approximation, improving over . We will not elaborate on this, however, as we will see that using the method in Subsection 5.4, we can further improve this to .
We split the clients this time into groups. We let be the set of clients corresponding to all subcases in Cases 1 and 2, be the set of clients corresponding to all subcases in Case 3, and be the set of clients corresponding to all subcases in Case 4, and all bad clients For any client , as in Subsection 5.4, we define and . We also define similar to how we did for the -means case.
Similar to Lemma 5.18 in the -means case, we have the following result for the -median case.
Lemma 6.4.
Let , , and . For any client , . For any client , . Finally, for any client
Proof.
Recall that , , and . In case or , we have that and , so , and the sum of is merely over a single point, so is at most . Thus, if . (Note that this even holds for bad clients .)
In case , we have that , so . In addition, all points in are separated by at least . Hence, by Lemma 6.1, . Likewise, . So, Thus, if .
In case , we have that . We claim that in this case, . This will follow from the fact that all points in are separated by at least , all points in are also separated by at least , and all points in are separated by at least from all points in . In fact, this immediately follows from the bounding of the denominators in subcases 4.a’, 4.b’, 4.c’, 4.d’, and 4.e’, where we replace with . Likewise, . Overall, we have that for all clients in case , . Since in all cases, it also holds for the bad clients as well. ∎
As a direct corollary, we have that
(45) |
Next, similar to Lemma 5.19 in the -means case, we have the following result.
Lemma 6.5.
Let , and . Then, for all we have that where and In addition, for all , we have that and that .
Proof.
To bound , we simply analyze all subcases in Case 1 and Case 2 and set . This is straightfoward to verify (see, for instance, our Desmos files on -median in Appendix A).
To bound , we analyze all subcases in Case 3. Subcase 3.a’ and 3.b.i’ are straightfoward to verify. For subcase 3.b.ii’, we have to verify for all choices of . For , we can verify manually. For , it is easy to see that the numerator of 3.b.ii’ is at most
and the denominator is at least . So, the fraction is at most
which is at most for all .
Finally, it is straightfoward to check that all subcases in Case are at most for all . ∎
One can modify the remainder of the proof analogously to as in the -means case in Section 5.4. Hence, to show that we obtain an approximation , it suffices to show that for all choices of and that if we let , one cannot simultaneously satisfy
(46) | ||||
(47) | ||||
(48) |
and
(49) |
By numerical analysis of these linear constraints and based on the functions , we obtain a -approximation algorithm for Euclidean -median clustering. We defer the details to Appendix C.
Acknowledgments
The authors thank Ashkan Norouzi-Fard for helpful discussions relating to modifying the previous results on roundable solutions. The authors also thank Fabrizio Grandoni, Piotr Indyk, Euiwoong Lee, and Chris Schwiegelshohn for helpful conversations. Finally, we would like to thank an anonymous reviewer for providing a useful suggestion in removing one of the cases for -means.
References
- [1] Sara Ahmadian, Ashkan Norouzi-Fard, Ola Svensson, and Justin Ward. Better guarantees for k-means and Euclidean k-median by primal-dual algorithms. SIAM Journal on Computing, 49(4):FOCS17–97–FOCS17–156, 2019.
- [2] David Arthur and Sergei Vassilvitskii. k-means++: the advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, pages 1027–1035, 2007.
- [3] David Arthur and Sergei Vassilvitskii. Worst-case and smoothed analysis of the ICP algorithm, with an application to the k-means method. SIAM J. Comput., 39(2):766–782, 2009.
- [4] Vijay Arya, Naveen Garg, Rohit Khandekar, Adam Meyerson, Kamesh Munagala, and Vinayaka Pandit. Local search heuristics for k-median and facility location problems. SIAM J. Comput., 33(3):544–562, 2004.
- [5] Pranjal Awasthi, Avrim Blum, and Or Sheffet. Stability yields a PTAS for k-median and k-means clustering. In 51th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2010, October 23-26, 2010, Las Vegas, Nevada, USA, pages 309–318, 2010.
- [6] Pranjal Awasthi, Moses Charikar, Ravishankar Krishnaswamy, and Ali Kemal Sinop. The hardness of approximation of euclidean k-means. In Lars Arge and János Pach, editors, 31st International Symposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Netherlands, volume 34 of LIPIcs, pages 754–767. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015.
- [7] Sayan Bandyapadhyay and Kasturi Varadarajan. On variants of k-means clustering. In 32nd International Symposium on Computational Geometry (SoCG 2016). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2016.
- [8] Luca Becchetti, Marc Bury, Vincent Cohen-Addad, Fabrizio Grandoni, and Chris Schwiegelshohn. Oblivious dimension reduction for k-means: beyond subspaces and the johnson-lindenstrauss lemma. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 1039–1050, 2019.
- [9] Jaroslaw Byrka, Thomas W. Pensyl, Bartosz Rybicki, Aravind Srinivasan, and Khoa Trinh. An improved approximation for k-median and positive correlation in budgeted optimization. ACM Trans. Algorithms, 13(2):23:1–23:31, 2017.
- [10] Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for the facility location and k-median problems. In 40th Annual Symposium on Foundations of Computer Science, FOCS ’99, 17-18 October, 1999, New York, NY, USA, pages 378–388, 1999.
- [11] Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for facility location problems. SIAM J. Comput., 34(4):803–824, 2005.
- [12] Moses Charikar, Sudipto Guha, Éva Tardos, and David B. Shmoys. A constant-factor approximation algorithm for the -median problem. J. Comput. Syst. Sci., 65(1):129–149, 2002.
- [13] Moses Charikar and Shi Li. A dependent LP-rounding approach for the k-median problem. In Automata, Languages, and Programming - 39th International Colloquium, ICALP 2012, Warwick, UK, July 9-13, 2012, Proceedings, Part I, pages 194–205, 2012.
- [14] Vincent Cohen-Addad. A fast approximation scheme for low-dimensional k-means. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 430–440. SIAM, 2018.
- [15] Vincent Cohen-Addad, Andreas Emil Feldmann, and David Saulpic. Near-linear time approximations schemes for clustering in doubling metrics. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 540–559. IEEE, 2019.
- [16] Vincent Cohen-Addad, Anupam Gupta, Lunjia Hu, Hoon Oh, and David Saulpic. An improved local search algorithm for -median. In Proceedings of the Thirty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2022, pages 1556–1612. SIAM, 2022.
- [17] Vincent Cohen-Addad, Anupam Gupta, Amit Kumar, Euiwoong Lee, and Jason Li. Tight fpt approximations for k-median and k-means. In International Colloquium on Automata, Languages, and Programming (ICALP), pages 42:1–42:14. Ieee, 2019.
- [18] Vincent Cohen-Addad and Karthik C. S. Inapproximability of clustering in lp metrics. In David Zuckerman, editor, 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2019, Baltimore, Maryland, USA, November 9-12, 2019, pages 519–539. IEEE Computer Society, 2019.
- [19] Vincent Cohen-Addad, Euiwoong Lee, and Karthik C. S. Johnson coverage hypothesis: Inapproximability of -means and -median in metrics. In Proceedings of the 2022 ACM-SIAM Symposium on Discrete Algorithms, SODA 2022. SIAM, 2022.
- [20] Vincent Cohen-Addad and Claire Mathieu. Effectiveness of local search for geometric optimization. In 31st International Symposium on Computational Geometry, SoCG 2015, June 22-25, 2015, Eindhoven, The Netherlands, pages 329–343, 2015.
- [21] Vincent Cohen-Addad, Karthik C. S., and Euiwoong Lee. On approximability of clustering problems without candidate centers. In Dániel Marx, editor, Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, pages 2635–2648. SIAM, 2021.
- [22] Vincent Cohen-Addad and Chris Schwiegelshohn. On the local structure of stable clustering instances. In Chris Umans, editor, 58th IEEE Annual Symposium on Foundations of Computer Science, FOCS 2017, Berkeley, CA, USA, October 15-17, 2017, pages 49–60. IEEE Computer Society, 2017.
- [23] Sanjoy Dasgupta. The hardness of k-means clustering. Department of Computer Science and Engineering, University of California …, 2008.
- [24] D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In STOC, pages 569–578, 2011.
- [25] Fabrizio Grandoni, Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Rakesh Venkat. A refined approximation for euclidean k-means. Inf. Process. Lett., 176:106251, 2022.
- [26] Sudipto Guha and Samir Khuller. Greedy strikes back: Improved facility location algorithms. J. Algorithms, 31(1):228–248, 1999.
- [27] Venkatesan Guruswami and Piotr Indyk. Embeddings and non-approximability of geometric problems. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, January 12-14, 2003, Baltimore, Maryland, USA., pages 537–538, 2003.
- [28] S Louis Hakimi. Optimum locations of switching centers and the absolute centers and medians of a graph. Operations research, 12(3):450–459, 1964.
- [29] Kamal Jain, Mohammad Mahdian, Evangelos Markakis, Amin Saberi, and Vijay V. Vazirani. Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP. J. ACM, 50(6):795–824, 2003.
- [30] Kamal Jain and Vijay V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. Journal of the ACM, 48(2):274–296, 2001.
- [31] Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, and Angela Y. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 28(2-3):89–112, 2004.
- [32] Madhukar R. Korupolu, C. Greg Plaxton, and Rajmohan Rajaraman. Analysis of a local search heuristic for facility location problems. J. Algorithms, 37(1):146–188, 2000.
- [33] Amit Kumar, Yogish Sabharwal, and Sandeep Sen. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM, 57(2):5:1–5:32, 2010.
- [34] Euiwoong Lee, Melanie Schmidt, and John Wright. Improved and simplified inapproximability for k-means. Inf. Process. Lett., 120:40–43, 2017.
- [35] Shi Li. A 1.488 approximation algorithm for the uncapacitated facility location problem. Inf. Comput., 222:45–58, 2013.
- [36] Shi Li and Ola Svensson. Approximating k-median via pseudo-approximation. SIAM J. Comput., 45(2):530–547, 2016.
- [37] SP Lloyd. Least square quantization in pcm. bell telephone laboratories paper. published in journal much later: Lloyd, sp: Least squares quantization in pcm. IEEE Trans. Inform. Theor.(1957/1982), 18, 1957.
- [38] Konstantin Makarychev, Yury Makarychev, and Ilya P. Razenshteyn. Performance of johnson-lindenstrauss transform for k-means and k-medians clustering. In Moses Charikar and Edith Cohen, editors, Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA, June 23-26, 2019, pages 1027–1038. ACM, 2019.
- [39] Konstantin Makarychev, Yury Makarychev, Maxim Sviridenko, and Justin Ward. A bi-criteria approximation algorithm for k-means. In Klaus Jansen, Claire Mathieu, José D. P. Rolim, and Chris Umans, editors, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, APPROX/RANDOM 2016, September 7-9, 2016, Paris, France, volume 60 of LIPIcs, pages 14:1–14:20. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2016.
- [40] Jirí Matousek. On approximate geometric k-clustering. Discrete & Computational Geometry, 24(1):61–84, 2000.
- [41] Nimrod Megiddo and Kenneth J Supowit. On the complexity of some common geometric location problems. SIAM journal on computing, 13(1):182–196, 1984.
- [42] Rafail Ostrovsky, Yuval Rabani, Leonard J. Schulman, and Chaitanya Swamy. The effectiveness of Lloyd-type methods for the k-means problem. J. ACM, 59(6):28, 2012.
- [43] Hugo Steinhaus. Sur la division des corps matériels en parties. Bull. Acad. Pol. Sci., Cl. III, 4:801–804, 1957.
Appendix A Desmos Graphs and Code
Here, we provide links for the Desmos files used to visualize the LMP approximations for both -means and -median, and the Python code used to improve the approximation factor for -means.
We provide graphs on Desmos for the LMP Approximation bounds for -means and -medians, as functions of the probability . We remark that in some of these cases, there may be parameters (such as ) that need to be set properly (which can be done via toggles on the respective Desmos link) to see the actual approximation ratio as a function of .
Case of -means is available here: https://www.desmos.com/calculator/jd8ud6h2e9
Case of -means is available here: https://www.desmos.com/calculator/pgtylk9eui
Case of -means is available here: https://www.desmos.com/calculator/zjshynypsh
Case of -means is available here: https://www.desmos.com/calculator/ibwult8qzs
Case of -means is available here: https://www.desmos.com/calculator/pgtylk9eui
Case of -median is available here: https://www.desmos.com/calculator/9qmscsfvrr
Case of -median is available here: https://www.desmos.com/calculator/rdidyxhs2o
Case of -median is available here: https://www.desmos.com/calculator/zoeswetvyz
Case of -median is available here: https://www.desmos.com/calculator/mpwrmz7mhe
The Python source code for -means is available at
https://drive.google.com/file/d/1mzKPr4ZbXe7FPDtx8JBkz2ReCK4OriQz/view?usp=sharing, and the Python source code for -median is available at
https://drive.google.com/file/d/1SEfUvHaOd78QgxCFaFA8OjuDsKb3Qg0I/view?usp=sharing. One can also view the code in a more readable PDF format for -means at
https://drive.google.com/file/d/1Ujcd6znbwxOkG-72zBZGX3otXPPAScud/view?usp=sharing,
and for -median at
https://drive.google.com/file/d/15HP3wBN20tCanwAc1dAA3rvjI23drdcO/view?usp=sharing.
Appendix B Omitted Details for the LMP Approximations
First, we prove Proposition 4.4.
Proof of Proposition 4.4.
Let , , and Then,
since . Now, we can write
So, we have that is at most
(50) |
It is simple to see that (50) is nondecreasing in for a fixed , so (50) is maximized when . Next, when , it is clear that (50) is non-increasing in if and likewise for , so (50) is maximized for some . In this case, (50) simplifies to
Now, using the fact that and that , we have that this expression is nondecreasing in both as long as . So, we may upper bound (50), and thus , by
by setting ∎
Next, we complete the details in Lemmas 4.2 and 6.2 that we did not complete in the main body of the paper.
K-means: Case 1.c:
We wish to maximize
over First, note that if , then we can bound this fraction by . Alternatively, if , then we can bound this fraction by at most
where the left-hand side in the above equation has the numerator and denominator increasing at the same rate in terms of , so it is maximized when is minimized, i.e., . Thus, we can bound the overall fraction as at most
K-Means: Case 1.g.i:
We wish to maximize
over and . First, note that if we treat as a variable, the numerator and denominator increase at the same rate as increases, so this fraction is maximized when . If , then this fraction equals , but since and this means that . Alternatively, we are maximizing
over . Next, note that if and then increasing will decrease and increase . So, the denominator decreases and the numerator either increases or decreases at a slower rate. Thus, we may assume that either or that . In the case where , we wish to maximize
Writing , and we can verify that and are both increasing functions in over , which means so is Therefore, the overall maximum is at most
K-Means: Case 2.d:
Our goal is to maximize
over and (and where ). By symmetry, we may assume WLOG that , and replace with . Next, note that increasing only increases the overall fraction, so we may increase until we have that . So, we now wish to maximize
over subject to (since and ). But, note that if , then any decrease in either or until we have that will decrease both the numerator and the denominator by the same amount, and so will increase the fraction. Thus, we may assume that .
In this case, we may rewrite our goal as maximizing
(51) |
over subject to and . Now, for any fixed , we note that for any multiplies the numerator of the fraction in (51) by a factor, but multiplies the denominator of the fraction by less than a factor, since . Therefore, the fraction increases overall, which means that to maximize , we may always assume that . It is easy to see that this automatically implies that when .
Thus, our goal is to maximize
subject to and . Maximizing this, however, just entails to minimizing , which is easy to solve as and which means that
K-median: Case 1.b’:
It suffices to prove the following proposition.
Proposition B.1.
Let , and suppose that we have points in Euclidean space such that and . Then, for any ,
where and
K-median: Case 1.g.i’:
Our goal is to maximize
over First, note that if then since this means that . In this case, the fraction equals , since and .
Alternatively, we have that Let , so . In this case, we wish to maximize
over and . Since , we have that increasing increases the fraction overall. So, we may assume that In this case, we are trying to maximize the fraction
Since , the numerator increases and the denominator decreases as increases, so the fraction increases overall. Thus, this fraction is maximized when , and equals
Appendix C Numerical Analysis for Euclidean -means and -median
C.1 The -means case
We recall that our goal is to show, for an appropriate choice of , that for any and any , we cannot simultaneously satisfy
(52) | ||||
(53) | ||||
(54) |
and
(55) |
where we will let and be arbitrary nonnegative reals. For , we recall that . Now, note that if we increase , Equations (53) and (54) become harder to satisfy, since in both equations, the left hand side has a greater slope as a function of than the right hand side. As a result, we may assume that , which we know is nonnegative since and , so for all .
Now, we note that we may assume . This is because if then and it is easy to verify that for any (for instance, by using Lemma 5.19 to bound , and , and using Cases 1.g.i and 2.d for and ). Therefore, for any if then Equations (52) and (54) cannot hold simultaneously. In addition, we may also assume that since if , then we can use the simpler bound of .
We recall that for , , , , and .
Now, let , and suppose there exist and such that Equations (52), (53), (54), and (55) can be simultaneously satisfies for nonnegative . Then, in fact we must be able to satisfy the weaker conditions
and (55), while having all be nonnegative. Indeed, the conditions are weaker since we have decreased the value of and increased all terms on the right-hand side (noting that each is a non-increasing function in the range ).
For every and such that are integral multiples of , we look at the intervals and . If this region has a nonnegative solution to these inequalities, then we further partition the region into a grid of dimensions . If one of these regions has a nonnegative solution to these inequalities, we partition one step further into a grid of dimensions (we will not need to partition beyond this). Using this procecdure, we are able to obtain that there is no solution in and when , which allows us to establish that our algorithm provides a polynomial-time -approximation for Euclidean -means clustering.
See Appendix A for links to the Python code.
C.2 The -median case
The -median case is almost identical, except for the modified equations and modified choices of . This time, we wish to show that for any and , there exists and such that one cannot satisfy
where are nonnegative. In the above equations, we set , , and . Also, recall that for we have that where and and that .
We remark that in the final equation, we use instead of - this is because is a decreasing function on the region , as opposed to and which are both decreasing functions.
First, we may assume that Indeed, if , one can use the more naive bound of , which is less than . If , one can instead use the bound of
To finish, we apply a similar method as in the -means case. We split the region into grid blocks of size with being the endpoints in each direction. We verify that the linear program has no solution when for each grid block: if it does, we further refine the grid block into smaller -sized pieces and verify each of the smaller pieces.
See Appendix A for links to the Python code.
Appendix D Changes to Construction of Roundable Solutions
In this section, we explain how Ahmadian et al. [1] implicitly prove Theorem 5.7, up to some minor modifications of their algorithm and analysis. Because the algorithm and analysis is almost entirely the same, we only describe the differences between the algorithm and analysis in [1] and what we need for our Theorem 5.7.
The only changes in the overall algorithm will be as follows. We will set some small constant such that . We will then set as opposed to in [1, Algorithm 2, Line 4], and set the definition of stopped [1, Section 7] to be that is stopped if such that , as opposed to such that . Here, we are letting for .
We now describe how this changes the claims throughout [1, Sections 7-8]. We only describe the changes to the statements, because the proofs do not change at all. For nearly the remainder of this appendix, we will consider a new definition of roundable, neither the one in [1] nor our Definition 5.1. Our modified definition will instead have that: for all and all , , which replaces Conditions 3a and 3b in Definition 5.1 (or Condition 2a in [1, Definition 5.1]). In addition, our modified definition will have that for all clients , for all , which replaces our Condition 3c in Definition 5.1 (or condition 2b in [1, Definition 5.1])
We are now ready to describe how each claim in [1] changes (or stays the same).
[1, Lemma 7.1] still holds with our new definition of stopped: the same proof still works.
The (unnumbered) claim in the first paragraph of [1, Section 8] still goes through, with our modified definition of roundable (i.e., the one presented in this section).
In [1, Section 8.1], both [1, Lemma 8.1] and [1, Lemma 8.2] still hold with our new definition of stopped, with essentially no changes to the proof. Likewise, in [1, Section 8.2], Lemmas 8.3, 8.4, 8.5 and Corollary 8.6 in [1] still hold with our new definition of stopped.
In [1, Section 8.3], we change the definition of , the “potentially bad” clients (see [1, Equation (8.1)]), to be
where refers to the value of at the start of a RaisePrice(s)olution (i.e., when the solution is labeled as in our Algorithm 2). This contrasts to the original definition of , which was the set of undecided clients with , in a similar way to how our definition of stopped contrasts with the original definition.
[1, Lemma 8.7] is now as follows. For any produced during RaisePrice, for every client the following holds:
-
•
If then there exists a tight facility such that for all .
-
•
There exists a tight facility such that for all .
Again, the same proof holds.
We now move to [1, Section 8.4]. We update [1, Lemma 8.8] to be that if has a tight edge to some facility , then for any with a tight edge to . In the proof, we would replace the stronger statement [1, Equation (8.2)] with: . In addition, we would update the Claim inside the proof of [1, Lemma 8.8] to be that: there is some tight facility in and also:
Up to these changes, the rest of the proof of Lemma 8.8 in [1] is essentially unchanged.
Next, we update [1, Lemma 8.9] to replace “ or ” with “ or ” - again the same proof still holds.
[1, Proposition 8.10] and its proof still hold, except that we have to replace with . So, if we set for our new choice of Proposition 8.10 in [1] holds.
We update [1, Proposition 8.11] to say that if for some produced by RaisePrice, then . We replace the last equation in the Claim in the Proposition’s proof with: . The same proof still holds. We also update the definition of to be the set where RaisePrice defines the parameters based on the shift parameter . With these definitions, and our modified choice of , we will have that [1, Corollary 8.12] still holds.
We now move to [1, Section 8.5]. We keep their definitions of -close neighborhoods and of dense facilities and clients. We also let the sets , and be defined the same way (modulo our change in definition of ). We also define the same way, and we will let and simply represent the conflict graphs and as generated by our Algorithm 2, respectively, at each iteration corresponding to making a new solution . Note that we are choosing , so .
With these, it is quite simple to see that [1, Lemma 8.13] still holds, where the choice in the proof is the approximation constant of the LMP algorithm in [1] with only a single parameter based on In addition, [1, Lemma 8.14] still holds, except that we replace with , since the final inequality in the proof relates to now since is the minimum of and all sizes of the sets that become at some point. [1, Corollary 8.15] (and the following Remark 8.16) also hold, due to our updated definition of and .
Now, we update [1, Lemma 8.17] to say: for any , either:
-
•
There exists a tight facility such that for all ,
-
•
There exists a special facility such that for all ,
Again, the proof holds with minimal change.
We now move to [1, Section 8.6], the final section of Ahmadian et al.’s analysis. We first look at how [1, Proposition 8.18] changes. The fact that is feasible for , that for all , and that are all still true. We now have that for all clients not in by using our modified versions of Lemma 8.7 and Lemma 8.17. In addition, our modified Lemma 8.7 tells us that even for bad clients , there exists a tight facility such that for all . Hence, we precisely have that for all , by adding over all clients and using Corollary 8.15, which still holds unchanged, apart from replacing with . The final part of proving [1, Proposition 8.18], i.e., verifying Condition 4 in our definition 5.1, holds where the only change is replacing with . Thus, we have that each solution that is generated is -roundable.
Finally, we have that [1, Theorem 8.19] still holds with essentially no change, meaning that each call to RaisePrice takes polynomial time and generates a polynomial number of -roundable solutions for our modified definition of roundable. This also implies that Algorithm 2 runs in polynomial time, since GraphUpdate clearly takes polynomial time, and since the total number of times we call RaisePrice is at most which is polynomial since and are both polynomial in .
Now, we note that each time we update our quasi-independent set in GraphUpdate, the new set only depends on and has no dependence on our choice of or . Therefore, if we ignore the sets and only focus on , the procedure of generating the sequence is in fact identical to the procedure in Ahmadian et al.[1]. The only difference is that we choose our stopping point based on the first time that as opposed to the first time that as done in [1]. Because of this, our Algorithm 2 in fact works exactly as the main algorithm in [1] if we only focus on and set The only differences are the way we choose when to stop the procedure and the way we update and the definition of stopped clients and definition of the set .
As a result, we have that each solution generated is is -roundable, up to our modified definition of roundable (where is chosen accordingly). By this, we mean that we replace Condition 3 in 5.1 with:
-
a)
For all and all .
-
b)
For all and all , .
Recall that . Now, by setting , we have that
and by setting , we have
Finally, by setting , we have
Therefore, by setting , we have that and . In addition, if we set , we have that Therefore, since , and if we assume that , then we have that and . So, we can replace with and with , we have an algorithm that still runs in polynomial time (since the old values of are still polynomial factors in the new values of which are all constants, even if they are arbitrary small). But now, we have that each solution satisfies the actual Condition 3 in Definition 5.1, for our new values of and .
Overall, we have that the algorithm runs in polynomial time, and each solution is -roundable, where is the minimum of and over the course of the algorithm. Each pair of consecutive solutions is close as in Theorem 8.19 in [1] (which follows from their Proposition 8.10). Next, we have that each time we create a solution , Lemma 8.1 in [1], which holds in our setting, tells us that every client is decided. Since is a subset of undecided facilities, for a solution , which means that based on the definition of . In addition, our modified version of [1, Lemma 8.7] holds for all since which means that we can set the bad clients to be . So, for each , the special facilities and bad clients are both empty. Next, we have that is a nested quasi-independent set because of how we defined and how we defined in our GraphUpdate procedure. Finally, we had that as described at the end of Subsection 5.1, and that we created from by removing a single point from if (which may or may not be in ), and then extending to a maximal independent set of . So, we have that This means that all of the statements of Theorem 5.7 hold.
Appendix E Limit of [1] in Obtaining Improved Approximations
In this section, we show that the algorithm of Ahmadian et al. [1] cannot guarantee an LMP approximation better than in the case of -median. In more detail, we show that there exists a set of clients , facilities , and parameter such that for any choice in the pruning phase, the LMP algorithm described in the preliminary subsection 5.1 does not obtain better than a -approximation for -median. As a result, their technique cannot guarantee an LMP approximation for all choices , which means any improvement to their analysis would have to move significantly outside the LMP framework.
We start with the -median case. First, consider the points , , and such that are collinear in that order, , and for some choice of . Consider applying the LMP algorithm described in Section 2.2 on just these points and , where we set and will include a large number of copies of . In this case, the growing phase will set , where and both become tight. Also, (with each copy of barely not being in it) and . One also obtains that . Then, if , then are connected in the conflict graph which means that the pruning phase will only allow either or to be in our set . The algorithm is arbitrary, and may set . In this case, the total clustering cost is whereas the dual is . If then both are included, so the primal is and the dual is .
Next, we consider a point as well as points , such that form a regular simplex with centroid and pairwise distances between each and , for some and arbitrarily small . Consider applying the LMP algorithm described in Section 2.2 on just these points and , where we set . In this case, we will have that since for all , all facilities will become tight with for all . If , since the pairwise distances are more than , the conflict graph will be empty so all facilities will be in the independent set. Therefore, the clustering cost will be , and the dual will be
Else, if , the conflict graph is complete on , so only one facility will be in the independent set. The clustering cost is still , and the dual will be
Now, we fix as a very small constant, and . Finally, we set and . Finally, we set , and consider the concatenation of each of the two cases described above, where the corresponding cases are sufficiently far apart in Euclidean space that there is no interaction.
If , then the overall clustering cost is
whereas the total dual is
So, we do not obtain better than a approximation in this case. If , then the total dual is in fact negative, as it is at most
Overall, there is no choice of that we can set to improve over a approximation.