This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\funding

MdB is supported by the Dutch Research Council (NWO) through Gravitation-grant NETWORKS-024.002.003. Department of Mathematics and Computer Science, TU Eindhoven, the [email protected] Department of Mathematics and Computer Science, TU Eindhoven, the [email protected] Department of Mathematics and Computer Science, TU Eindhoven, the [email protected] \CopyrightMark de Berg and Morteza Monemizadeh and Yu Zhong\ccsdesc[100]Theory of computation Design and analysis of algorithms \EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

kk-Center Clustering with Outliers in the Sliding-Window Model

Mark de Berg    Morteza Monemizadeh    Yu Zhong
Abstract

The kk-center problem for a point set PP asks for a collection of kk congruent balls (that is, balls of equal radius) that together cover all the points in PP and whose radius is minimized. The kk-center problem with outliers is defined similarly, except that zz of the points in PP need not be covered, for a fixed parameter zz. We study the kk-center problem with outliers in data streams in the sliding-window model. In this model we are given a possibly infinite stream P=p1,p2,p3,P=\langle p_{1},p_{2},p_{3},\ldots\rangle of points and a time window of length WW, and we want to maintain a small sketch of the set P(t)P(t) of points currently in the window such that using the sketch we can approximately solve the problem on P(t)P(t).

We present the first sketch for the kk-center problem with outliers in this setting. Our sketch works for the case where the points come from a finite space of bounded doubling dimension. It provides a (1+ε)(1+\varepsilon)-approximation using O((k+z)(z+1)(1/εd)logσ)O((k+z)(z+1)(1/\varepsilon^{d})\log\sigma) storage, where dd is the doubling dimension of the underlying space and σ:=maxq,qXdist(q,q)/minq,qXdist(q,q)\sigma:=\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})/\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}) is its spread.

keywords:
Streaming algorithms, kk-center problem, sliding window, bounded doubling dimension

1 Introduction

Clustering is one of the most important tools to analyze large data sets. A well-known class of clustering algorithms is formed by centroid-based algorithms, which include kk-means clustering, kk-median clustering and kk-center clustering. The latter type of clustering is the topic of our paper. In the kk-center problem one is given a set PP of points from a metric space and a parameter kk, and the goal is to find kk congruent balls balls (that is, balls of equal radius) that together cover the points from PP and whose radius is minimized. Note that the special case k=1k=1 corresponds to the minimum-enclosing ball problem. Data sets in practice often contain outliers, leading to the kk-center problem with outliers. Here we are given, besides PP and kk, a parameter zz that indicates the allowed number of outliers. Thus the radius of the balls in an optimal solution is given by

optk,z(P)\mbox{{\sc opt}}_{k,z}(P) := the smallest radius ρ\rho such that we can cover all points from PP, except for at most zz outliers, by kk balls of radius ρ\rho.

In this paper we study the kk-center problem with outliers in data streams, where the input is a possibly infinite stream P=p1,p2,P=\langle p_{1},p_{2},\ldots\rangle of points. The goal is to maintain a solution to the kk-center problem as the points arrive over time, without any knowledge of future arrivals and using limited (sub-linear) storage. Since we cannot store all the points in the stream, we cannot expect to maintain an optimal solution. Hence, the two main quality criteria of a streaming algorithm are its approximation ratio and the amount of storage it uses. We will study this problem in the sliding-window model. In this model we are given a window length WW and we are, at any time tt, only interested in the points that arrived in the time window (tW,t](t-W,t]. Working in the sliding-window model is often significantly more difficult than working in the standard (insertion-only) streaming model.

Previous work. Charikar et al.[DBLP:journals/siamcomp/CharikarCFM04] were the first to study the metric kk-center problem in data streams. They developed a streaming algorithm that computes an 88-approximation for the kk-center problem using Θ(k)\Theta(k) space. Later McCutchen and Khuller [DBLP:conf/approx/McCutchenK08] improved the approximation ratio to 2+ε2+\varepsilon at the cost of increasing the storage to O((k/ε)log(1/ε))O((k/\varepsilon)\log(1/\varepsilon)). McCutchen and Khuller also studied the kk-center problem with z1z\geqslant 1 outliers, for which they gave a (4+ε)(4+\varepsilon)-approximation algorithm that requires O(kz/ε)O(kz/\varepsilon) space.

The above results are for general metric spaces. In spaces of bounded doubling dimension111The doubling dimension of a space XX is the smallest number dd such that any ball BB in the space can be covered by 2d2^{d} balls of radius radius(B)/2\operatorname{radius}(B)/2. better bounds are possible. Indeed, Ceccarello, Pietracaprina and Pucci [DBLP:journals/pvldb/CeccarelloPP19] gave a (3+ε)(3+\varepsilon)-approximation algorithm for the kk-center problem with zz outliers, thus improving the approximation ratio (4+ε)(4+\varepsilon) for general metrics. Their algorithm requires O((k+z)(1/ε)d)O((k+z)(1/\varepsilon)^{d}) storage, where dd is the doubling dimension of the underlying space (which is assumed to be a fixed constant).

The algorithms mentioned so far are deterministic. Charikar, O’Callaghan, and Panigrahy [DBLP:conf/stoc/CharikarOP03] and Ding, Yu and Wang [DBLP:conf/esa/DingYW19] studied sampling-based streaming algorithms for the Euclidean kk-center problem with outliers, showing that if one allows slightly more than zz outliers then randomization can reduce the storage requirements. Our focus, however, is on deterministic algorithms.

For the kk-center problem in the sliding-window model, the only result we are aware of is due to Cohen-Addad, Schwiegelshohn and Sohler [DBLP:conf/icalp/Cohen-AddadSS16]. They deal with the kk-center problem in general metric spaces, but without outliers, and they propose a (6+ε)(6+\varepsilon)-approximation algorithm using O((k/ε)logσ)O((k/\varepsilon)\log\sigma) storage, and a (4+ε)(4+\varepsilon)-approximation for the special case k=2k=2. Here σ:=maxq,qXdist(q,q)/minq,qXdist(q,q)\sigma:=\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})/\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}) denotes the spread of the underling space XX, and it is assumed that the values maxq,qXdist(q,q)\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}) and minq,qXdist(q,q)\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}) are known to the algorithm. They also prove that any algorithm for the 22-center problem with outliers in general metric spaces that achieves an approximation ratio of less than 44 requires Ω(W1/3)\Omega(W^{1/3}) space, where WW is the size222Here the window size WW is defined in terms of the number of points in the window, that is, the window consists of the WW most recent points. We define the window in a slightly more general manner, by defining WW to be the length (that is, duration) of the window. Note that if we assume that the ii-th point arrives at time t=it=i, then the two models are the same. of the window. Table 1 gives an overview of the known results on the kk-center problem in the insertion-only and the sliding-window model.

model metric space approx. storage outliers ref.
insertion-only general 88 kk no [DBLP:journals/siamcomp/CharikarCFM04]
general 2+ε2+\varepsilon (k/ε)log(1/ε)(k/\varepsilon)\log(1/\varepsilon) no [DBLP:conf/approx/McCutchenK08]
general 4+ε4+\varepsilon kz/εkz/\varepsilon yes [DBLP:conf/approx/McCutchenK08]
bounded doubling 3+ε3+\varepsilon (k+z)/εd(k+z)/\varepsilon^{d} yes [DBLP:journals/pvldb/CeccarelloPP19]
sliding window general 6+ε6+\varepsilon (k/ε)logσ(k/\varepsilon)\log\sigma no [DBLP:conf/icalp/Cohen-AddadSS16]
bounded doubling 1+ε1+\varepsilon ((k+z)z/εd)logσ((k+z)z/\varepsilon^{d})\log\sigma yes here
Table 1: Results for the kk-center problem with and without outliers in the insertion-only and the sliding-window model. Bounds on the storage are asymptotic. In the papers where the metric space has bounded doubling dimension or is Euclidean, the dimension dd is consider a constant.

While our main interest is in the kk-center problem for k>1k>1, we will automatically obtain a result for the 1-center problem. Hence, we also briefly discuss previous results for the 1-center problem.

For the 1-center problem in dd-dimensional Euclidean space, streaming algorithms that maintain an ε\varepsilon-kernel give a (1+ε)(1+\varepsilon)-approximation. An example is the algorithm of Zarabi-Zadeh [z-cpa-08] which maintains an ε\varepsilon-kernel of size O(1/ε(d1)/2log(1/ε))O(1/\varepsilon^{(d-1)/2}\log(1/\varepsilon)). Moreover, using only O(d)O(d) storage one can obtain a 1.22-approximation for the 1-center problem without outliers [DBLP:journals/algorithmica/AgarwalS15, cp-sdameb-14]. For the 1-center problem with outliers, one can obtain a (1+ε)(1+\varepsilon)-approximation algorithm that uses z/εO(d)z/\varepsilon^{O(d)} storage using the technique of Agarwal, Har-Peled and Yu [DBLP:journals/dcg/AgarwalHY08]. Zarrabi-Zadeh and Mukhopadhyay  [DBLP:conf/cccg/Zarrabi-ZadehM09] studied the 11-center problem with zz outliers in high-dimensional Euclidean spaces, where dd is not considered constant, giving a 1.731.73-approximation algorithm that requires O(d3z)O(d^{3}z) space. Recently, Hatami anad Zarrabi-Zadeh [DBLP:journals/comgeo/HatamiZ17] extended this result to 22-center problem with zz outliers, obtaining a (1.8+ε)(1.8+\varepsilon)-approximation using O(d3z2+dz4/ε)O(d^{3}z^{2}+dz^{4}/\varepsilon) storage. None of the 1-center algorithms discussed above works in the sliding-window model.

As problem that is closely related to the 1-center problem is the diameter problem, where the goal is to maintain an approximation of the diameter of the points in the stream. This problem has been studied in the sliding-window model by Feigenbaum, Kannan, Zhang [DBLP:journals/algorithmica/FeigenbaumKZ04] and later by Chan and Sadjad [cs-gosw-06], who gave a (1+ε)(1+\varepsilon)-approximation for the diameter problem (without outliers) in the sliding window model, using O((1/ε)(d+1)/2log(σ/ε))O((1/\varepsilon)^{(d+1)/2}\log(\sigma/\varepsilon)) storage.

Our contribution. We present the first algorithm for the kk-center problem with zz outliers in the sliding-window model. It works in spaces of bounded doubling dimension and yields a (1+ε)(1+\varepsilon)-approximation. So far a (1+ε)(1+\varepsilon)-approximation was not even known for the kk-center problem without outliers in the insertion-only model. Our algorithm uses O(((k+z)(z+1)/εd)logσ)O(((k+z)(z+1)/\varepsilon^{d})\log\sigma) storage, where dd is the doubling dimension and σ\sigma is the spread of the underlying space, as defined above. Thus for the 1-center problem we obtain a solution that uses O((k/εd)logσ)O((k/\varepsilon^{d})\log\sigma) storage. This solution also works for the diameter problem. Note that also for the 1-center problem with outliers (and the diameter problem with outliers) an algorithm for the sliding-window model was not yet known. A useful property of the sketch333We use the word sketch even though we do not study how to compose the sketches for two separate streams into a sketch for the concatenation of the stream, as the term sketch seems more appropriate than data structure, for example. maintained by our algorithm for the kk-center problem, is that is can also be used for the kk^{\prime}-center problem for any k<kk^{\prime}<k, as well as for the diameter problem.

As in the previous papers on the kk-center problem (or the diameter problem) in the sliding-window model [cs-gosw-06, DBLP:conf/icalp/Cohen-AddadSS16, DBLP:journals/algorithmica/FeigenbaumKZ04], we assume that the values maxq,qXdist(q,q)\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}) and minq,qXdist(q,q)\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}) are known to the algorithm. A typical example is when the input consists of points in dd-dimensional Euclidean space with integer coordinates in a given range {0,1,,U}\{0,1,\ldots,U\}. In this example, the spread is Θ(U)\Theta(U).

2 The algorithm

Let P:=p1,p2,P:=\langle p_{1},p_{2},\ldots\rangle be a possibly infinite stream of points from a metric space XX of doubling dimension dd and spread σ\sigma, where dd is considered to be a fixed constant. We denote the arrival time of a point pip_{i} by and (pi)\and(p_{i}). We say that pip_{i} expires at time texp(pi):= and (pi)+Wt_{\mathrm{exp}}(p_{i}):=\and(p_{i})+W, where WW is the given length of the time window. To simplify the exposition, we assume that all arrival times and departure times (that is, times at which a point expires) are distinct. For a time tt we define P(t)P(t) to be the set444We allow the same point from XX to occur multiple times in the stream, so P(t)P(t) is actually a multi-set. Whenever we refer to “sets” in the remainder of the paper we mean “multi-sets”. of points currently in the window. In other words, P(t):={pi: and (pi)t<texp(pi)}P(t):=\{p_{i}:\and(p_{i})\leqslant t<t_{\mathrm{exp}}(p_{i})\}. For a point qXq\in X and a parameter r0r\geqslant 0, use ball(q,r)\operatorname{ball}(q,r) to denote the ball with center qq and radius rr.

In the following we show how to maintain a sketch Γ(t)\Gamma(t) of P(t)P(t) for the kk-center problem with outliers. We first present a sketch for a decision version of the problem. (Actually our sketch is a bit more powerful than a decision algorithm, as explained below.) Given this sketch, it will be easy to develop a sketch for the optimization version of the problem.

2.1 A sketch for the decision problem

For a set QQ of points, a number of outliers zz and a number of centers kk, we define

optk,z(Q)\mbox{{\sc opt}}_{k,z}(Q) := the smallest radius ρ\rho such that we can cover all points from QQ, except for at most zz outliers, by kk balls of radius ρ\rho.

Let ρ\rho be a given parameter with minq,qXdist(q,q)ρmaxq,qXdist(q,q)\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})\leqslant\rho\leqslant\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}). The goal of this section is design a sketch Γ(t)\Gamma(t) and a corresponding decision algorithm TryToCover with the following properties:

  • The sketch Γ(t)\Gamma(t) uses O((k+z)(z+1)(1/ε)d)O((k+z)(z+1)(1/\varepsilon)^{d}) storage.

  • TryToCover either reports a collection 𝒞\mathcal{C}^{*} of kk balls of radius optk,z(P(t))+2ερ\mbox{{\sc opt}}_{k,z}(P(t))+2\varepsilon\rho that together cover all points in P(t)P(t) except for at most zz outliers, or it report no. In latter case we have optk,z(P(t))>2ρ\mbox{{\sc opt}}_{k,z}(P(t))>2\rho.

Our sketch Γ(t)\Gamma(t) is a tuple (τ(t),(t),(t))(\tau(t),\mathcal{B}(t),\mathcal{R}(t)), where τ\tau is a timestamp, (t)\mathcal{B}(t) is a collection of balls, and (t)\mathcal{R}(t) is a collection of so-called representative sets. The sketch has the following properties:

(Prop-1)

For any time tt^{\prime} with tt<τ(t)t\leqslant t^{\prime}<\tau(t) we are guaranteed that optk,z(P(t))>2ρ\mbox{{\sc opt}}_{k,z}(P(t^{\prime}))>2\rho. The idea is that when we find k+z+1k+z+1 points with pairwise distances greater 2ρ2\rho, then we can set τ\tau to the expiration time of the oldest of these points, and we can delete this point (as well as any older points) from the sketch.

(Prop-2)

The number of balls in (t)\mathcal{B}(t) is O((k+z)/εd)O((k+z)/\varepsilon^{d}). Each ball B(t)B\in\mathcal{B}(t) has radius ερ\varepsilon\rho and the center of BB, denoted by center(B)\operatorname{center}(B), is a point from the stream. Note that the center does not need to be a point from P(t)P(t). The balls in (t)\mathcal{B}(t) are well spread in the sense that no ball contains the center of any other ball. In other words, we have dist(center(B),center(B))>ερ\operatorname{dist}(\operatorname{center}(B),\operatorname{center}(B^{\prime}))>\varepsilon\rho for any two balls B,B(t)B,B^{\prime}\in\mathcal{B}(t).

(Prop-3)

For each ball B(t)B\in\mathcal{B}(t) the set (t)\mathcal{R}(t) contains a representative set R(B)BP(t)R(B)\subseteq B\cap P(t), and these are the only sets in (t)\mathcal{R}(t). The representative sets R(B)R(B) are pairwise disjoint, and each set R(B)R(B) contains at most z+1z+1 points.

Define a ball B(t)B\in\mathcal{B}(t) to be full when |R(B)|=z+1|R(B)|=z+1, and let S(t):=B(t)R(B)S(t):=\bigcup_{B\in\mathcal{B}(t)}R(B). Then for any point piP(t)S(t)p_{i}\in P(t)\setminus S(t) we have (i) texp(pi)τ(t)t_{\mathrm{exp}}(p_{i})\leqslant\tau(t), and/or (ii) piBp_{i}\in B for a ball B(t)B\in\mathcal{B}(t) that is full and such that all points in R(B)R(B) arrived after pip_{i}.

At time t=0t=0, before the arrival of the first point, we have 𝒜(t)=(t)=0\mathcal{A}(t)=\mathcal{B}(t)=0 and τ(t)=0\tau(t)=0. Since P(0)=P(0)=\emptyset this trivially satisfies the various properties our sketch should have. Before we prove that we can maintain the sketch upon the arrival of new points and upon departure of old points, we present our decision algorithm TryToCover and prove its correctness. The algorithm is quite simple, and given in Algorithm 1.

Algorithm 1 TryToCover(Γ(t))(\Gamma(t))
1:S(t)B(t)R(B)S(t)\leftarrow\bigcup_{B\in\mathcal{B}(t)}R(B)
2:Compute optk,z(S(t))\mbox{{\sc opt}}_{k,z}(S(t)) and the corresponding collection 𝒞:={C1,,Ck}\mathcal{C}:=\{C_{1},\ldots,C_{k}\} of balls.
3:if t<τ(t)t<\tau(t) or optk,z(S(t))>2ρ\mbox{{\sc opt}}_{k,z}(S(t))>2\rho then
4:     Report no
5:else
6:     Increase the radius of each ball Ci𝒞C_{i}\in\mathcal{C} by 2ερ2\varepsilon\rho.
7:     Report the collection 𝒞:={C1,,Ck}\mathcal{C}^{*}:=\{C^{*}_{1},\ldots,C^{*}_{k}\} of expanded balls.
Remark 2.1.

In line 2 we compute an optimal solution on the point set S(t)S(t). How this is done, and how much time this takes, depends on the specific space XX that we consider. In dd-dimensional Euclidean space, for instance, we can solve the kk-center problem with outliers in time polynomial in n:=|S(t)|n:=|S(t)| (for constant kk and dd), as follows: first generate all O(nd+1)O(n^{d+1}) potential centers, then generate all possible collections of kk such centers, and then for each of the O(n(d+1)k)O(n^{(d+1)k}) such collections find the minimum radius ρ\rho such that we can cover all except for zz points. For other spaces computing an optimal solution may not be easy, though it is always possible of course since our space XX is finite.

In fact, we do not need an exact solution to the problem on S(t)S(t). It is sufficient if we have an algorithm that computes, for any given δ>0\delta>0, a (1+δ)(1+\delta)-approximation of the optimal solution. By tuning the parameter ε\varepsilon in the sketch and the parameter δ\delta in the approximation algorithm appropriately, we can then still obtain the desired accuracy in our final answer. Since this is rather straightforward (but somewhat tedious) we omit the details, and we assume in the remainder that we compute an exact solution for S(t)S(t) in line 2.

The following lemma establishes the correctness of the algorithm.

Lemma 2.2.

Algorithm TryToCover either reports a collection 𝒞\mathcal{C}^{*} of kk balls of radius optk,z(P(t))+2ερ\mbox{{\sc opt}}_{k,z}(P(t))+2\varepsilon\rho that together cover all points in P(t)P(t) except for at most zz outliers, or it reports no. In latter case we have optk,z(P(t))>2ρ\mbox{{\sc opt}}_{k,z}(P(t))>2\rho.

Proof 2.3.

First suppose the algorithm reports no. If this happens because t<τ(t)t<\tau(t) then optk,z(P(t))>2ρ\mbox{{\sc opt}}_{k,z}(P(t))>2\rho by (Prop-1). Otherwise, this happens because optk,z(S(t))>2ρ\mbox{{\sc opt}}_{k,z}(S(t))>2\rho. But then optk,z(P(t))>2ρ\mbox{{\sc opt}}_{k,z}(P(t))>2\rho, because (Prop-3) implies that S(t)P(t)S(t)\subseteq P(t).

Now suppose the algorithm reports a collection 𝒞:={C1,,Ck}\mathcal{C}^{*}:=\{C^{*}_{1},\ldots,C^{*}_{k}\} of balls. Let 𝒞\mathcal{C} be the corresponding set of balls before they were expanded. Since 𝒞\mathcal{C} is an optimal solution for S(t)S(t), the balls CiC_{i} have radius optk,z(S(t))optk,z(P(t))\mbox{{\sc opt}}_{k,z}(S(t))\leqslant\mbox{{\sc opt}}_{k,z}(P(t)) and together they cover all points in S(t)S(t) except for at most zz outliers. Now consider a point piP(t)S(t)p_{i}\in P(t)\setminus S(t). To finish the proof, we must show that pip_{i} is covered by one of the balls in 𝒞\mathcal{C}^{*}. To this end, first observe that texp(pi)>tt_{\mathrm{exp}}(p_{i})>t because piP(t)p_{i}\in P(t). Since TryToCover did not report no, this implies that texp(pi)>τ(t)t_{\mathrm{exp}}(p_{i})>\tau(t). Hence, we can conclude from (Prop-3) that piBp_{i}\in B for a ball B(t)B\in\mathcal{B}(t) that is full. Thus R(B)R(B) contains z+1z+1 points, and since we allow only zz outliers this implies that at least one point from R(B)R(B) is covered by a ball Ci𝒞C_{i}\in\mathcal{C}. Because diam(B)=2ερ\operatorname{diam}(B)=2\varepsilon\rho, this implies that pip_{i} must be covered by CiC^{*}_{i}, thus finishing the proof.

Next we show how to update the sketch Γ(t)\Gamma(t).

Handling departures

Handling departures is easy. When a point pjp_{j} in one of our representative sets R(B)R(B) expires, we simply delete it from R(B)R(B), and if R(B)R(B) then becomes empty we remove R(B)R(B) from (t)\mathcal{R}(t) and BB from (t)\mathcal{B}(t).

It is trivial to verify that (Prop-1) and (Prop-2) still hold for the updated sketch. To see that (Prop-3) holds as well, consider a point piP(t)S(t)p_{i}\in P(t)\setminus S(t). The only reason for (Prop-3) to be violated, would be when piBp_{i}\in B for a ball BB that was full before the deletion of pjp_{j} but is no longer full after the deletion. However, (Prop-3) states that all points in R(B)R(B) arrived after pip_{i}. Since pip_{i} did not yet expire, this means that the point pjp_{j} that currently expires cannot be a point from R(B)R(B).

Handling arrivals

Algorithm 2 shows how to handle the arrival of a new point pjp_{j} at time t:= and (pj)t:=\and(p_{j}). We denote the sketch just before the arrival by Γ(t)\Gamma(t^{-}), and the updated sketch by Γ(t+)\Gamma(t^{+}).

Algorithm 2 HandleArrival(Γ(t),pj)(\Gamma(t^{-}),p_{j})
1:(t)\mathcal{B}\leftarrow\mathcal{B}(t^{-}) \triangleright \mathcal{B} will be the set of balls from which we will pick the balls in (t+)\mathcal{B}(t^{+})
2:if pjBp_{j}\in B for some BB\in\mathcal{B} then
3:      Add pjp_{j} to R(B)R(B) for an arbitrary such ball BB. If we now have |R(B)|=z+2|R(B)|=z+2 because R(B)R(B) was full before the addition of pjp_{j}, then remove the oldest point from R(B)R(B).
4:else
5:     Add the ball B:=ball(pj,ερ)B:=\operatorname{ball}(p_{j},\varepsilon\rho) to \mathcal{B} and set R(B){pj}R(B)\leftarrow\{p_{j}\}.
6:SBR(B)S\leftarrow\bigcup_{B\in\mathcal{B}}R(B); 𝒜\mathcal{A}\leftarrow\emptyset \triangleright 𝒜\mathcal{A} is a set of so-called anchor points
7:(t+)\mathcal{B}(t^{+})\leftarrow\emptyset; (t+)\mathcal{R}(t^{+})\leftarrow\emptyset
8:while SS\neq\emptyset and |𝒜|<k+z|\mathcal{A}|<k+z do
9:     Let pip_{i} be the youngest point in SS. Add pip_{i} to 𝒜\mathcal{A}.
10:     for all balls BB\in\mathcal{B} such that dist(pi,center(B))(2+ε)ρ\operatorname{dist}(p_{i},\operatorname{center}(B))\leqslant(2+\varepsilon)\rho do
11:         Add BB to (t+)\mathcal{B}(t^{+}), add R(B)R(B) to (t+)\mathcal{R}(t^{+}), and set SSR(B)S\leftarrow S\setminus R(B).      
12:if SS\neq\emptyset then
13:     Let pip_{i^{*}} be the youngest point in SS and set τ(t+)max(τ(t),texp(pi))\tau(t^{+})\leftarrow\max\left(\tau(t),t_{\mathrm{exp}}(p_{i^{*}})\right)
Lemma 2.4.

The sketch computed by HandleArrival has properties (Prop-1)–(Prop-3).

Proof 2.5.

First consider (Prop-1). If τ(t+)=τ(t)\tau(t^{+})=\tau(t^{-}) then obviously (Prop-1) still holds, so assume that τ\tau is updated by the algorithm. Then in the loop of lines 811 we added k+zk+z anchor points to the set 𝒜\mathcal{A}, and after doing so SS is still non-empty. Note that whenever we add a point pip_{i} as anchor point to 𝒜\mathcal{A}, then all balls BB with dist(pi,center(B))(2+ε)ρ\operatorname{dist}(p_{i},\operatorname{center}(B))\leqslant(2+\varepsilon)\rho are added to (t+)\mathcal{B}(t^{+}). Moreover, the points in the corresponding representative sets R(B)R(B) are removed from SS. Hence, all points that remain in SS must be in balls whose center lies at distance more than (2+ε)ρ(2+\varepsilon)\rho from pip_{i}. Since the balls have radius ερ\varepsilon\rho, this means that all remaining points in SS are at distance more than 2ρ2\rho from pip_{i}.

This implies two things: the distance between any two anchor points in 𝒜\mathcal{A} is at least 2ρ2\rho, and the distance from the point pip_{i^{*}} that defines τ(t+)\tau(t^{+}) in line 13 to any anchor point is at least 2ρ2\rho. Hence, the set 𝒜:=𝒜{pi}\mathcal{A}^{*}:=\mathcal{A}\cup\{p_{i^{*}}\} consists of k+z+1k+z+1 points whose pairwise distances are at least 2ρ2\rho. Thus any ball of radius ρ\rho covers at most one point from 𝒜\mathcal{A}^{*}, and since we allow at most zz outliers this implies that optk,z(𝒜)>ρ\mbox{{\sc opt}}_{k,z}(\mathcal{A}^{*})>\rho. It remains to observe that the point pip_{i^{*}} is the oldest point in 𝒜\mathcal{A}^{*}, since in line 9 we pick the youngest remaining point to be the next anchor point. Hence, until pip_{i^{*}} expires at time τ(t+)=texp(pi)\tau(t^{+})=t_{\mathrm{exp}}(p_{i^{*}}) we have 𝒜P(t)\mathcal{A}^{*}\subseteq P(t), and so optk,z(P(t))>ρ\mbox{{\sc opt}}_{k,z}(P(t^{\prime}))>\rho for all tt<τ(t+)t\leqslant t^{\prime}<\tau(t^{+}). This proves that (Prop-1) still holds.

Now consider (Prop-2). The set (t+)\mathcal{B}(t^{+}) must be well spread, because Γ(t)\Gamma(t^{-}) was well spread and we only add ball(pi,ερ)\operatorname{ball}(p_{i},\varepsilon\rho) in line 5 when its center is outside the existing balls. To prove that |(t+)|=O((k+z)/εd)|\mathcal{B}(t^{+})|=O((k+z)/\varepsilon^{d}), first observe that |𝒜|k+z|\mathcal{A}|\leqslant k+z by construction. In line 11 we add for each anchor point pi𝒜p_{i}\in\mathcal{A} one or more balls to (t+)\mathcal{B}(t^{+}). The centers of these balls all lie at distance at most (2+ε)ρ(2+\varepsilon)\rho from pip_{i}, so they are contained in ball(pi,(2+ε)ρ)\operatorname{ball}(p_{i},(2+\varepsilon)\rho). Because the underlying space XX has doubling dimension dd, we can cover ball(pi,(2+ε)ρ)\operatorname{ball}(p_{i},(2+\varepsilon)\rho) by a set 𝒟\mathcal{D} consisting of O(1/εd)O(1/\varepsilon^{d}) balls of radius ερ/2\varepsilon\rho/2. Because (t+)\mathcal{B}(t^{+}) is well spread, any ball in 𝒟\mathcal{D} contains at most one center of a ball from (t+)\mathcal{B}(t^{+}). Hence we add O(1/εd)O(1/\varepsilon^{d}) balls for each anchor point, thus finishing the proof of (Prop-2).

It remains to argue that (Prop-3) is maintained. The only property that is non-trivial to check is the property the points in P(t)S(t)P(t)\setminus S(t) need to satisfy. To this end consider a point piP(t+)S(t+)p_{i}\in P(t^{+})\setminus S(t^{+}), and assume that texp(pi)>τ(t+)t_{\mathrm{exp}}(p_{i})>\tau(t^{+}). Note that pip_{i} cannot be the just inserted point pjp_{j}, because pjp_{j} is the first point added as an anchor point to 𝒜\mathcal{A}. Hence, piP(t)p_{i}\in P(t^{-}). There are two cases.

The first case is that piP(t)S(t)p_{i}\in P(t^{-})\setminus S(t^{-}). Then we have texp(pi)>τ(t)t_{\mathrm{exp}}(p_{i})>\tau(t^{-}), which implies that piBp_{i}\in B for a ball B(t)B\in\mathcal{B}(t^{-}) that was full and such that all points in R(B)R(B) arrived later than pip_{i}. We may have replaced a point from R(B)R(B) by pjp_{j} in line 3, but this does not violate the property that all points in R(B)R(B) arrived after pip_{i}. Thus the only problem that may arise is that BB was not added to (t+)\mathcal{B}(t^{+}) in the loop of lines 811. But then R(B)SR(B)\subseteq S when line 13 is reached. Since all points in R(B)R(B) are younger than pip_{i}, this contradicts that texp(pi)>τ(t+)t_{\mathrm{exp}}(p_{i})>\tau(t^{+}).

The second case is that piP(t)S(t)p_{i}\not\in P(t^{-})\setminus S(t^{-}). Since piP(t)p_{i}\in P(t^{-}) this means that piS(t)p_{i}\in S(t^{-}), and because piS(t+)p_{i}\not\in S(t^{+}) we know that pip_{i} is one of the points considered in line 13. But then texp(pi)τ(t+)t_{\mathrm{exp}}(p_{i})\leqslant\tau(t^{+}), again contradicting our assumptions.

2.2 A sketch for the optimization problem

Above we presented a sketch for a decision version of the problem. For given parameters ρ\rho and ε\varepsilon, the sketch uses O((k+z)(z+1)(1/ε)d)O((k+z)(z+1)(1/\varepsilon)^{d}) storage. We also gave an algorithmTryToCover that either reports a collection 𝒞\mathcal{C}^{*} of kk balls of radius optk,z(P(t))+2ερ\mbox{{\sc opt}}_{k,z}(P(t))+2\varepsilon\rho that together cover all points in P(t)P(t) except for at most zz outliers, or that reports no. In latter case we know that optk,z(P(t))>2ρ\mbox{{\sc opt}}_{k,z}(P(t))>2\rho. To make the parameter ρ\rho and ε\varepsilon explicit, we will from now on denote the sketch by Γρ,ε\Gamma_{\rho,\varepsilon}.

In the optimization version of the problem we wish to find kk congruent balls of minimum radius that together cover all points in P(t)P(t) except for at most zz outliers. Our sketch for this problem, for a given ε>0\varepsilon>0, is defined as follows.

  • We maintain a sketch Γρi,ε/2\Gamma_{\rho_{i},\varepsilon/2} for every 0ilogσ0\leqslant i\leqslant\lfloor\log\sigma\rfloor, where ρi:=2iminq,qXdist(q,q)\rho_{i}:=2^{i}\cdot\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}).

Our algorithm to approximate the optimal solution to the kk-center problem with outliers on P(t)P(t) is given in Algorithm 3.

Algorithm 3 FindApproximateCenters(Γ(t))(\Gamma(t))
1:i0i\leftarrow 0
2:repeat
3:     𝑎𝑛𝑠𝑤𝑒𝑟TryToCover(Γρi,ε/2)\mathit{answer}\leftarrow\textsc{TryToCover}(\Gamma_{\rho_{i},\varepsilon/2})
4:until 𝑎𝑛𝑠𝑤𝑒𝑟\mathit{answer}\neq no
5:if the radii of the balls in 𝑎𝑛𝑠𝑤𝑒𝑟\mathit{answer} is less than minq,qXdist(q,q)\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}) then
6:     Reduce the radii of the balls to zero
7:Report 𝑎𝑛𝑠𝑤𝑒𝑟\mathit{answer}

We can now prove our main theorem.

Theorem 2.6.

Let XX be a finite space of doubling dimension dd and spread σ\sigma. Let 0<ε<10<\varepsilon<1 be a given parameter. There is a sketch for the kk-center problem with zz outliers on streams from XX in the sliding-window model. The sketch uses O((k+z)(z+1)(1/ε)dlogσ)O((k+z)(z+1)(1/\varepsilon)^{d}\log\sigma) storage, and it allows us to report at any time tt a collection 𝒞\mathcal{C}^{*} of balls of radius at most (1+ε)optk,z(P(t))(1+\varepsilon)\cdot\mbox{{\sc opt}}_{k,z}(P(t)) that together cover all points from P(t)P(t), except at most zz outliers.

Proof 2.7.

First consider the case optk,z(P(t))=0\mbox{{\sc opt}}_{k,z}(P(t))=0. When we run TryToCover(Γρi,ε/2)\textsc{TryToCover}(\Gamma_{\rho_{i},\varepsilon/2}) with i=0i=0, then by Lemma 2.2 we will report a collection of balls of radius

optk,z(P(t))+2(ε/2)ρ0=εminq,qXdist(q,q).\mbox{{\sc opt}}_{k,z}(P(t))+2(\varepsilon/2)\rho_{0}=\varepsilon\cdot\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}).

Since ε<1\varepsilon<1, the radii are strictly smaller than minq,qXdist(q,q)\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}). Since the balls are centered at points from XX, we may as well reduce the radii to zero, thus achieving an optimal solution.

Next, consider the case optk,z(P(t))>0\mbox{{\sc opt}}_{k,z}(P(t))>0. Then optk,z(P(t))minq,qXdist(q,q)=ρ0\mbox{{\sc opt}}_{k,z}(P(t))\geqslant\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})=\rho_{0}. Consider the loop in lines 24 and let ii^{*} be the value of the counter ii when we obtain TryToCover(Γρi,ε/2)\textsc{TryToCover}(\Gamma_{\rho_{i},\varepsilon/2})\neq no. Note that this must happen at some point, because for i=logσi=\lfloor\log\sigma\rfloor we have

ρi=2logσminq,qXdist(q,q)(σ/2)minq,qXdist(q,q)=(1/2)maxq,qXdist(q,q)\rho_{i}=2^{\lfloor\log\sigma\rfloor}\cdot\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})\geqslant(\sigma/2)\cdot\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})=(1/2)\cdot\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})

and so Lemma 2.2 guarantees that 𝑎𝑛𝑠𝑤𝑒𝑟\mathit{answer}\neq no. If i=0i^{*}=0 then the balls in the reported solution have radius

optk,z(P(t))+2(ε/2)ρ0(1+ε)optk,z(P(t)).\mbox{{\sc opt}}_{k,z}(P(t))+2(\varepsilon/2)\rho_{0}\leqslant(1+\varepsilon)\cdot\mbox{{\sc opt}}_{k,z}(P(t)).

Otherwise we know that the answer for i=i1i=i^{*}-1 is no, which implies that optk,z(P(t))>2ρi1=ρi\mbox{{\sc opt}}_{k,z}(P(t))>2\rho_{i^{*}-1}=\rho_{i^{*}}. Hence, the balls in the solution that is reported for ii^{*} have radius

optk,z(P(t))+2(ε/2)ρi(1+ε)optk,z(P(t)).\mbox{{\sc opt}}_{k,z}(P(t))+2(\varepsilon/2)\rho_{i^{*}}\leqslant(1+\varepsilon)\cdot\mbox{{\sc opt}}_{k,z}(P(t)).

Our sketch for the kk-center problem can also be used for the kk^{\prime}-center problem for k<kk^{\prime}<k. Moreover, a sketch for the 11-center problem (or a sketch for the kk-center problem for k>1k>1) can also be used for the diameter problem. Recall that the diameter problem with outliers for the set P(t)P(t) asks for the value

diamz(P(t)):=min{diam(P(t)Q):|Q|=z},\operatorname{diam}_{z}(P(t)):=\min\{\operatorname{diam}(P(t)\setminus Q):|Q|=z\},

that is, diamz(P(t))\operatorname{diam}_{z}(P(t)) is the smallest diameter one can obtain by deleting zz outliers from P(t)P(t). We say that an algorithm reports a (1ε)(1-\varepsilon)-approximation to diamz(P(t))\operatorname{diam}_{z}(P(t)) if it reports a value DD with (1ε)diamz(P(t))Ddiamz(P(t))(1-\varepsilon)\cdot\operatorname{diam}_{z}(P(t))\leqslant D\leqslant\operatorname{diam}_{z}(P(t)).

Theorem 2.8.

The sketch for the kk-center problem with outliers as presented above can also be used to provide a (1+ε)(1+\varepsilon)-approximation for the kk^{\prime}-center problem with outliers, for any 1kk1\leqslant k^{\prime}\leqslant k. Moreover, it can be used to provide a (12ε)(1-2\varepsilon)-approximation for the diameter problem with outliers.

Proof 2.9.

The only place where the value kk plays a role in properties (Prop-1)–(Prop-3), except for in the bound on the size of (t)\mathcal{B}(t), is in (Prop-1). Since optk,z(P(t))optk,z(P(t))\mbox{{\sc opt}}_{k^{\prime},z}(P(t))\geqslant\mbox{{\sc opt}}_{k,z}(P(t)) for any kkk^{\prime}\leqslant k, this means that a sketch for the kk-center problem will have the properties required of a sketch for the kk^{\prime}-center problem. Thus running TryToCover on a sketch for the kk-center, where in line 2 we compute optk,z(P(t))\mbox{{\sc opt}}_{k^{\prime},z}(P(t)), will give a correct result for kk^{\prime}.

Now consider the diameter problem. Suppose we run TryToCover on a sketch for the kk-center problem, where in line 2 we compute diamz(P(t))\operatorname{diam}_{z}(P(t)), and instead of lines 67 we report D:=diamz(S(t))D:=\operatorname{diam}_{z}(S(t)). We claim that if the algorithm reports no then we have diamz(P(t))>2ρ\operatorname{diam}_{z}(P(t))>2\rho, and otherwise diamz(P(t))4ερDdiamz(P(t))\operatorname{diam}_{z}(P(t))-4\varepsilon\rho\leqslant D\leqslant\operatorname{diam}_{z}(P(t)).

Note that diamz(P(t))optk,z(P(t))\operatorname{diam}_{z}(P(t))\geqslant\mbox{{\sc opt}}_{k,z}(P(t)) for any k1k\geqslant 1. Hence, when t<τ(t)t<\tau(t) then we have diamz(P(t))optk,z(P(t))>2ρ\operatorname{diam}_{z}(P(t))\geqslant\mbox{{\sc opt}}_{k,z}(P(t))>2\rho. This implies that the claim holds when TryToCover reports no.

If TryToCover does not report no, it reports D:=diamz(S(t))D:=\operatorname{diam}_{z}(S(t)). Clearly we then have Ddiamz(P(t))D\leqslant\operatorname{diam}_{z}(P(t)). Now suppose for a contradiction that D<diamz(P(t))4ερD<\operatorname{diam}_{z}(P(t))-4\varepsilon\rho. Let pi,pjP(t)p_{i},p_{j}\in P(t) be such that dist(pi,pj)=diamz(P(t))\operatorname{dist}(p_{i},p_{j})=\operatorname{diam}_{z}(P(t)). We will argue that there are points pi,pjS(t)p_{i^{\prime}},p_{j^{\prime}}\in S(t) such that dist(pi,pi)2ερ\operatorname{dist}(p_{i},p_{i^{\prime}})\leqslant 2\varepsilon\rho and dist(pj,pj)2ερ\operatorname{dist}(p_{j},p_{j^{\prime}})\leqslant 2\varepsilon\rho. But then we would have D=diamz(S(t))diamz(P(t))4ερD=\operatorname{diam}_{z}(S(t))\geqslant\operatorname{diam}_{z}(P(t))-4\varepsilon\rho, which contradicts the assumption. We will argue the existence of pip_{i^{\prime}}; the argument for pjp_{j^{\prime}} is similar. If piS(t)p_{i}\in S(t) then we can take pi:=pip_{i^{\prime}}:=p_{i} and we are done. Otherwise, as argued in the proof of Lemma 2.2, we know that piBp_{i}\in B for a ball BB(t)B\in B(t) that is full. In particular R(B)R(B) contains at least one point pip_{i^{\prime}}, and this point must be at distance at most 2ερ2\varepsilon\rho from pip_{i}, as claimed.

We have proved the claim that if the algorithm reports no then we have diamz(P(t))>2ρ\operatorname{diam}_{z}(P(t))>2\rho, and otherwise we have diamz(P(t))4ερDdiamz(P(t))\operatorname{diam}_{z}(P(t))-4\varepsilon\rho\leqslant D\leqslant\operatorname{diam}_{z}(P(t)). Plugging this claim for the decision problem into the mechanism to obtain a sketch for the optimization problem—recall that there we used ε/2\varepsilon/2 as parameter—now gives the desired result.

Theorem 2.8 implies that there exists a sketch for the diameter problem with outliers in the sliding-window model that gives a (1+ε)(1+\varepsilon)-approximation to diamz(P(t))\operatorname{diam}_{z}(P(t)) using O((z2/εd)logσ)O((z^{2}/\varepsilon^{d})\log\sigma) storage, namely the sketch for the 1-center problem.

3 Concluding remarks

We presented the first sketch for the kk-center problem with outliers in the sliding-window model. We assumed that the points in the stream come from a finite space XX of bounded doubling dimension and spread σ\sigma, such as points in d\mathbb{R}^{d} with integer coordinates from the set {0,,U}\{0,\ldots,U\} with U:=σ/dU:=\lfloor\sigma/\sqrt{d}\rfloor. Alternatively, we can assume that we are given a range [optmin,optmax][\mbox{{\sc opt}}_{\min},\mbox{{\sc opt}}_{\max}] of possible values for optk,z(P(t))\mbox{{\sc opt}}_{k,z}(P(t)). Assuming we also have a subroutine available to compute (a (1+ε)(1+\varepsilon)-approximation of) optk,z(S)\mbox{{\sc opt}}_{k,z}(S) for any given static set SS, we can then decide if optk,z(P(t))[optmin,optmax]\mbox{{\sc opt}}_{k,z}(P(t))\in[\mbox{{\sc opt}}_{\min},\mbox{{\sc opt}}_{\max}], and, if so, give a (1+ε)(1+\varepsilon)-approximation of optk,z(P(t))\mbox{{\sc opt}}_{k,z}(P(t)). The storage will be as stated in Theorem 2.6, with σ:=optmax/optmin\sigma:=\mbox{{\sc opt}}_{\max}/\mbox{{\sc opt}}_{\min}.

Our sketch has low storage when zz, the number of outliers, is not too large. It would be interesting to see if the dependency on zz in the storage requirements reduced. A first step would be to see if it is possible to refine our approach to obtain a sketch with O((k+z)/εd)O((k+z)/\varepsilon^{d}) storage, saving a factor zz. Reducing the dependency on zz to sublinear may be possible using a random-sampling approach, if one is willing to allow slightly more than zz outliers. Another challenging open problem is to develop a sketch whose storage is only polynomially dependent on the doubling dimension dd.