MdB is supported by the Dutch Research Council (NWO) through Gravitation-grant NETWORKS-024.002.003. Department of Mathematics and Computer Science, TU Eindhoven, the [email protected] Department of Mathematics and Computer Science, TU Eindhoven, the [email protected] Department of Mathematics and Computer Science, TU Eindhoven, the [email protected] \CopyrightMark de Berg and Morteza Monemizadeh and Yu Zhong\ccsdesc[100]Theory of computation Design and analysis of algorithms \EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23
-Center Clustering with Outliers in the Sliding-Window Model
Abstract
The -center problem for a point set asks for a collection of congruent balls (that is, balls of equal radius) that together cover all the points in and whose radius is minimized. The -center problem with outliers is defined similarly, except that of the points in need not be covered, for a fixed parameter . We study the -center problem with outliers in data streams in the sliding-window model. In this model we are given a possibly infinite stream of points and a time window of length , and we want to maintain a small sketch of the set of points currently in the window such that using the sketch we can approximately solve the problem on .
We present the first sketch for the -center problem with outliers in this setting. Our sketch works for the case where the points come from a finite space of bounded doubling dimension. It provides a -approximation using storage, where is the doubling dimension of the underlying space and is its spread.
keywords:
Streaming algorithms, -center problem, sliding window, bounded doubling dimension1 Introduction
Clustering is one of the most important tools to analyze large data sets. A well-known class of clustering algorithms is formed by centroid-based algorithms, which include -means clustering, -median clustering and -center clustering. The latter type of clustering is the topic of our paper. In the -center problem one is given a set of points from a metric space and a parameter , and the goal is to find congruent balls balls (that is, balls of equal radius) that together cover the points from and whose radius is minimized. Note that the special case corresponds to the minimum-enclosing ball problem. Data sets in practice often contain outliers, leading to the -center problem with outliers. Here we are given, besides and , a parameter that indicates the allowed number of outliers. Thus the radius of the balls in an optimal solution is given by
:= the smallest radius such that we can cover all points from , except for at most outliers, by balls of radius .
In this paper we study the -center problem with outliers in data streams, where the input is a possibly infinite stream of points. The goal is to maintain a solution to the -center problem as the points arrive over time, without any knowledge of future arrivals and using limited (sub-linear) storage. Since we cannot store all the points in the stream, we cannot expect to maintain an optimal solution. Hence, the two main quality criteria of a streaming algorithm are its approximation ratio and the amount of storage it uses. We will study this problem in the sliding-window model. In this model we are given a window length and we are, at any time , only interested in the points that arrived in the time window . Working in the sliding-window model is often significantly more difficult than working in the standard (insertion-only) streaming model.
Previous work. Charikar et al.[DBLP:journals/siamcomp/CharikarCFM04] were the first to study the metric -center problem in data streams. They developed a streaming algorithm that computes an -approximation for the -center problem using space. Later McCutchen and Khuller [DBLP:conf/approx/McCutchenK08] improved the approximation ratio to at the cost of increasing the storage to . McCutchen and Khuller also studied the -center problem with outliers, for which they gave a -approximation algorithm that requires space.
The above results are for general metric spaces. In spaces of bounded doubling dimension111The doubling dimension of a space is the smallest number such that any ball in the space can be covered by balls of radius . better bounds are possible. Indeed, Ceccarello, Pietracaprina and Pucci [DBLP:journals/pvldb/CeccarelloPP19] gave a -approximation algorithm for the -center problem with outliers, thus improving the approximation ratio for general metrics. Their algorithm requires storage, where is the doubling dimension of the underlying space (which is assumed to be a fixed constant).
The algorithms mentioned so far are deterministic. Charikar, O’Callaghan, and Panigrahy [DBLP:conf/stoc/CharikarOP03] and Ding, Yu and Wang [DBLP:conf/esa/DingYW19] studied sampling-based streaming algorithms for the Euclidean -center problem with outliers, showing that if one allows slightly more than outliers then randomization can reduce the storage requirements. Our focus, however, is on deterministic algorithms.
For the -center problem in the sliding-window model, the only result we are aware of is due to Cohen-Addad, Schwiegelshohn and Sohler [DBLP:conf/icalp/Cohen-AddadSS16]. They deal with the -center problem in general metric spaces, but without outliers, and they propose a -approximation algorithm using storage, and a -approximation for the special case . Here denotes the spread of the underling space , and it is assumed that the values and are known to the algorithm. They also prove that any algorithm for the -center problem with outliers in general metric spaces that achieves an approximation ratio of less than requires space, where is the size222Here the window size is defined in terms of the number of points in the window, that is, the window consists of the most recent points. We define the window in a slightly more general manner, by defining to be the length (that is, duration) of the window. Note that if we assume that the -th point arrives at time , then the two models are the same. of the window. Table 1 gives an overview of the known results on the -center problem in the insertion-only and the sliding-window model.
model | metric space | approx. | storage | outliers | ref. |
---|---|---|---|---|---|
insertion-only | general | no | [DBLP:journals/siamcomp/CharikarCFM04] | ||
general | no | [DBLP:conf/approx/McCutchenK08] | |||
general | yes | [DBLP:conf/approx/McCutchenK08] | |||
bounded doubling | yes | [DBLP:journals/pvldb/CeccarelloPP19] | |||
sliding window | general | no | [DBLP:conf/icalp/Cohen-AddadSS16] | ||
bounded doubling | yes | here |
While our main interest is in the -center problem for , we will automatically obtain a result for the 1-center problem. Hence, we also briefly discuss previous results for the 1-center problem.
For the 1-center problem in -dimensional Euclidean space, streaming algorithms that maintain an -kernel give a -approximation. An example is the algorithm of Zarabi-Zadeh [z-cpa-08] which maintains an -kernel of size . Moreover, using only storage one can obtain a 1.22-approximation for the 1-center problem without outliers [DBLP:journals/algorithmica/AgarwalS15, cp-sdameb-14]. For the 1-center problem with outliers, one can obtain a -approximation algorithm that uses storage using the technique of Agarwal, Har-Peled and Yu [DBLP:journals/dcg/AgarwalHY08]. Zarrabi-Zadeh and Mukhopadhyay [DBLP:conf/cccg/Zarrabi-ZadehM09] studied the -center problem with outliers in high-dimensional Euclidean spaces, where is not considered constant, giving a -approximation algorithm that requires space. Recently, Hatami anad Zarrabi-Zadeh [DBLP:journals/comgeo/HatamiZ17] extended this result to -center problem with outliers, obtaining a -approximation using storage. None of the 1-center algorithms discussed above works in the sliding-window model.
As problem that is closely related to the 1-center problem is the diameter problem, where the goal is to maintain an approximation of the diameter of the points in the stream. This problem has been studied in the sliding-window model by Feigenbaum, Kannan, Zhang [DBLP:journals/algorithmica/FeigenbaumKZ04] and later by Chan and Sadjad [cs-gosw-06], who gave a -approximation for the diameter problem (without outliers) in the sliding window model, using storage.
Our contribution. We present the first algorithm for the -center problem with outliers in the sliding-window model. It works in spaces of bounded doubling dimension and yields a -approximation. So far a -approximation was not even known for the -center problem without outliers in the insertion-only model. Our algorithm uses storage, where is the doubling dimension and is the spread of the underlying space, as defined above. Thus for the 1-center problem we obtain a solution that uses storage. This solution also works for the diameter problem. Note that also for the 1-center problem with outliers (and the diameter problem with outliers) an algorithm for the sliding-window model was not yet known. A useful property of the sketch333We use the word sketch even though we do not study how to compose the sketches for two separate streams into a sketch for the concatenation of the stream, as the term sketch seems more appropriate than data structure, for example. maintained by our algorithm for the -center problem, is that is can also be used for the -center problem for any , as well as for the diameter problem.
As in the previous papers on the -center problem (or the diameter problem) in the sliding-window model [cs-gosw-06, DBLP:conf/icalp/Cohen-AddadSS16, DBLP:journals/algorithmica/FeigenbaumKZ04], we assume that the values and are known to the algorithm. A typical example is when the input consists of points in -dimensional Euclidean space with integer coordinates in a given range . In this example, the spread is .
2 The algorithm
Let be a possibly infinite stream of points from a metric space of doubling dimension and spread , where is considered to be a fixed constant. We denote the arrival time of a point by . We say that expires at time , where is the given length of the time window. To simplify the exposition, we assume that all arrival times and departure times (that is, times at which a point expires) are distinct. For a time we define to be the set444We allow the same point from to occur multiple times in the stream, so is actually a multi-set. Whenever we refer to “sets” in the remainder of the paper we mean “multi-sets”. of points currently in the window. In other words, . For a point and a parameter , use to denote the ball with center and radius .
In the following we show how to maintain a sketch of for the -center problem with outliers. We first present a sketch for a decision version of the problem. (Actually our sketch is a bit more powerful than a decision algorithm, as explained below.) Given this sketch, it will be easy to develop a sketch for the optimization version of the problem.
2.1 A sketch for the decision problem
For a set of points, a number of outliers and a number of centers , we define
:= the smallest radius such that we can cover all points from , except for at most outliers, by balls of radius .
Let be a given parameter with . The goal of this section is design a sketch and a corresponding decision algorithm TryToCover with the following properties:
-
•
The sketch uses storage.
-
•
TryToCover either reports a collection of balls of radius that together cover all points in except for at most outliers, or it report no. In latter case we have .
Our sketch is a tuple ,
where is a timestamp, is a collection of balls, and
is a collection of so-called representative sets.
The sketch has the following properties:
- (Prop-1)
-
For any time with we are guaranteed that . The idea is that when we find points with pairwise distances greater , then we can set to the expiration time of the oldest of these points, and we can delete this point (as well as any older points) from the sketch.
- (Prop-2)
-
The number of balls in is . Each ball has radius and the center of , denoted by , is a point from the stream. Note that the center does not need to be a point from . The balls in are well spread in the sense that no ball contains the center of any other ball. In other words, we have for any two balls .
- (Prop-3)
-
For each ball the set contains a representative set , and these are the only sets in . The representative sets are pairwise disjoint, and each set contains at most points.
Define a ball to be full when , and let . Then for any point we have (i) , and/or (ii) for a ball that is full and such that all points in arrived after .
At time , before the arrival of the first point, we have and . Since this trivially satisfies the various properties our sketch should have. Before we prove that we can maintain the sketch upon the arrival of new points and upon departure of old points, we present our decision algorithm TryToCover and prove its correctness. The algorithm is quite simple, and given in Algorithm 1.
Remark 2.1.
In line 2 we compute an optimal solution on the point set . How this is done, and how much time this takes, depends on the specific space that we consider. In -dimensional Euclidean space, for instance, we can solve the -center problem with outliers in time polynomial in (for constant and ), as follows: first generate all potential centers, then generate all possible collections of such centers, and then for each of the such collections find the minimum radius such that we can cover all except for points. For other spaces computing an optimal solution may not be easy, though it is always possible of course since our space is finite.
In fact, we do not need an exact solution to the problem on . It is sufficient if we have an algorithm that computes, for any given , a -approximation of the optimal solution. By tuning the parameter in the sketch and the parameter in the approximation algorithm appropriately, we can then still obtain the desired accuracy in our final answer. Since this is rather straightforward (but somewhat tedious) we omit the details, and we assume in the remainder that we compute an exact solution for in line 2.
The following lemma establishes the correctness of the algorithm.
Lemma 2.2.
Algorithm TryToCover either reports a collection of balls of radius that together cover all points in except for at most outliers, or it reports no. In latter case we have .
Proof 2.3.
First suppose the algorithm reports no. If this happens because then by (Prop-1). Otherwise, this happens because . But then , because (Prop-3) implies that .
Now suppose the algorithm reports a collection of balls. Let be the corresponding set of balls before they were expanded. Since is an optimal solution for , the balls have radius and together they cover all points in except for at most outliers. Now consider a point . To finish the proof, we must show that is covered by one of the balls in . To this end, first observe that because . Since TryToCover did not report no, this implies that . Hence, we can conclude from (Prop-3) that for a ball that is full. Thus contains points, and since we allow only outliers this implies that at least one point from is covered by a ball . Because , this implies that must be covered by , thus finishing the proof.
Next we show how to update the sketch .
Handling departures
Handling departures is easy. When a point in one of our representative sets expires, we simply delete it from , and if then becomes empty we remove from and from .
It is trivial to verify that (Prop-1) and (Prop-2) still hold for the updated sketch. To see that (Prop-3) holds as well, consider a point . The only reason for (Prop-3) to be violated, would be when for a ball that was full before the deletion of but is no longer full after the deletion. However, (Prop-3) states that all points in arrived after . Since did not yet expire, this means that the point that currently expires cannot be a point from .
Handling arrivals
Algorithm 2 shows how to handle the arrival of a new point at time . We denote the sketch just before the arrival by , and the updated sketch by .
Lemma 2.4.
The sketch computed by HandleArrival has properties (Prop-1)–(Prop-3).
Proof 2.5.
First consider (Prop-1). If then obviously (Prop-1) still holds, so assume that is updated by the algorithm. Then in the loop of lines 8–11 we added anchor points to the set , and after doing so is still non-empty. Note that whenever we add a point as anchor point to , then all balls with are added to . Moreover, the points in the corresponding representative sets are removed from . Hence, all points that remain in must be in balls whose center lies at distance more than from . Since the balls have radius , this means that all remaining points in are at distance more than from .
This implies two things: the distance between any two anchor points in is at least , and the distance from the point that defines in line 13 to any anchor point is at least . Hence, the set consists of points whose pairwise distances are at least . Thus any ball of radius covers at most one point from , and since we allow at most outliers this implies that . It remains to observe that the point is the oldest point in , since in line 9 we pick the youngest remaining point to be the next anchor point. Hence, until expires at time we have , and so for all . This proves that (Prop-1) still holds.
Now consider (Prop-2). The set must be well spread, because was well spread and we only add in line 5 when its center is outside the existing balls. To prove that , first observe that by construction. In line 11 we add for each anchor point one or more balls to . The centers of these balls all lie at distance at most from , so they are contained in . Because the underlying space has doubling dimension , we can cover by a set consisting of balls of radius . Because is well spread, any ball in contains at most one center of a ball from . Hence we add balls for each anchor point, thus finishing the proof of (Prop-2).
It remains to argue that (Prop-3) is maintained. The only property that is non-trivial to check is the property the points in need to satisfy. To this end consider a point , and assume that . Note that cannot be the just inserted point , because is the first point added as an anchor point to . Hence, . There are two cases.
The first case is that . Then we have , which implies that for a ball that was full and such that all points in arrived later than . We may have replaced a point from by in line 3, but this does not violate the property that all points in arrived after . Thus the only problem that may arise is that was not added to in the loop of lines 8–11. But then when line 13 is reached. Since all points in are younger than , this contradicts that .
The second case is that . Since this means that , and because we know that is one of the points considered in line 13. But then , again contradicting our assumptions.
2.2 A sketch for the optimization problem
Above we presented a sketch for a decision version of the problem. For given parameters and , the sketch uses storage. We also gave an algorithmTryToCover that either reports a collection of balls of radius that together cover all points in except for at most outliers, or that reports no. In latter case we know that . To make the parameter and explicit, we will from now on denote the sketch by .
In the optimization version of the problem we wish to find congruent balls of minimum radius that together cover all points in except for at most outliers. Our sketch for this problem, for a given , is defined as follows.
-
•
We maintain a sketch for every , where .
Our algorithm to approximate the optimal solution to the -center problem with outliers on is given in Algorithm 3.
We can now prove our main theorem.
Theorem 2.6.
Let be a finite space of doubling dimension and spread . Let be a given parameter. There is a sketch for the -center problem with outliers on streams from in the sliding-window model. The sketch uses storage, and it allows us to report at any time a collection of balls of radius at most that together cover all points from , except at most outliers.
Proof 2.7.
First consider the case . When we run with , then by Lemma 2.2 we will report a collection of balls of radius
Since , the radii are strictly smaller than . Since the balls are centered at points from , we may as well reduce the radii to zero, thus achieving an optimal solution.
Next, consider the case . Then . Consider the loop in lines 2–4 and let be the value of the counter when we obtain no. Note that this must happen at some point, because for we have
and so Lemma 2.2 guarantees that no. If then the balls in the reported solution have radius
Otherwise we know that the answer for is no, which implies that . Hence, the balls in the solution that is reported for have radius
Our sketch for the -center problem can also be used for the -center problem for . Moreover, a sketch for the -center problem (or a sketch for the -center problem for ) can also be used for the diameter problem. Recall that the diameter problem with outliers for the set asks for the value
that is, is the smallest diameter one can obtain by deleting outliers from . We say that an algorithm reports a -approximation to if it reports a value with .
Theorem 2.8.
The sketch for the -center problem with outliers as presented above can also be used to provide a -approximation for the -center problem with outliers, for any . Moreover, it can be used to provide a -approximation for the diameter problem with outliers.
Proof 2.9.
The only place where the value plays a role in properties (Prop-1)–(Prop-3), except for in the bound on the size of , is in (Prop-1). Since for any , this means that a sketch for the -center problem will have the properties required of a sketch for the -center problem. Thus running TryToCover on a sketch for the -center, where in line 2 we compute , will give a correct result for .
Now consider the diameter problem. Suppose we run TryToCover on a sketch for the -center problem, where in line 2 we compute , and instead of lines 6–7 we report . We claim that if the algorithm reports no then we have , and otherwise .
Note that for any . Hence, when then we have . This implies that the claim holds when TryToCover reports no.
If TryToCover does not report no, it reports . Clearly we then have . Now suppose for a contradiction that . Let be such that . We will argue that there are points such that and . But then we would have , which contradicts the assumption. We will argue the existence of ; the argument for is similar. If then we can take and we are done. Otherwise, as argued in the proof of Lemma 2.2, we know that for a ball that is full. In particular contains at least one point , and this point must be at distance at most from , as claimed.
We have proved the claim that if the algorithm reports no then we have , and otherwise we have . Plugging this claim for the decision problem into the mechanism to obtain a sketch for the optimization problem—recall that there we used as parameter—now gives the desired result.
Theorem 2.8 implies that there exists a sketch for the diameter problem with outliers in the sliding-window model that gives a -approximation to using storage, namely the sketch for the 1-center problem.
3 Concluding remarks
We presented the first sketch for the -center problem with outliers in the sliding-window model. We assumed that the points in the stream come from a finite space of bounded doubling dimension and spread , such as points in with integer coordinates from the set with . Alternatively, we can assume that we are given a range of possible values for . Assuming we also have a subroutine available to compute (a -approximation of) for any given static set , we can then decide if , and, if so, give a -approximation of . The storage will be as stated in Theorem 2.6, with .
Our sketch has low storage when , the number of outliers, is not too large. It would be interesting to see if the dependency on in the storage requirements reduced. A first step would be to see if it is possible to refine our approach to obtain a sketch with storage, saving a factor . Reducing the dependency on to sublinear may be possible using a random-sampling approach, if one is willing to allow slightly more than outliers. Another challenging open problem is to develop a sketch whose storage is only polynomially dependent on the doubling dimension .