\funding

MdB is supported by the Dutch Research Council (NWO) through Gravitation-grant NETWORKS-024.002.003. Department of Mathematics and Computer Science, TU Eindhoven, the [email protected] Department of Mathematics and Computer Science, TU Eindhoven, the [email protected] Department of Mathematics and Computer Science, TU Eindhoven, the [email protected] \CopyrightMark de Berg and Morteza Monemizadeh and Yu Zhong\ccsdesc[100]Theory of computation Design and analysis of algorithms \EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

$k$ -Center Clustering with Outliers in the Sliding-Window Model

Mark de Berg Morteza Monemizadeh Yu Zhong

Abstract

The $k$ -center problem for a point set $P$ asks for a collection of $k$ congruent balls (that is, balls of equal radius) that together cover all the points in $P$ and whose radius is minimized. The $k$ -center problem with outliers is defined similarly, except that $z$ of the points in $P$ need not be covered, for a fixed parameter $z$ . We study the $k$ -center problem with outliers in data streams in the sliding-window model. In this model we are given a possibly infinite stream $P=\langle p_{1},p_{2},p_{3},\ldots\rangle$ of points and a time window of length $W$ , and we want to maintain a small sketch of the set $P(t)$ of points currently in the window such that using the sketch we can approximately solve the problem on $P(t)$ .

We present the first sketch for the $k$ -center problem with outliers in this setting. Our sketch works for the case where the points come from a finite space of bounded doubling dimension. It provides a $(1+\varepsilon)$ -approximation using $O((k+z)(z+1)(1/\varepsilon^{d})\log\sigma)$ storage, where $d$ is the doubling dimension of the underlying space and $\sigma:=\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})/\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})$ is its spread.

keywords:

Streaming algorithms,

k

-center problem, sliding window, bounded doubling dimension

1 Introduction

Clustering is one of the most important tools to analyze large data sets. A well-known class of clustering algorithms is formed by centroid-based algorithms, which include $k$ -means clustering, $k$ -median clustering and $k$ -center clustering. The latter type of clustering is the topic of our paper. In the $k$ -center problem one is given a set $P$ of points from a metric space and a parameter $k$ , and the goal is to find $k$ congruent balls balls (that is, balls of equal radius) that together cover the points from $P$ and whose radius is minimized. Note that the special case $k=1$ corresponds to the minimum-enclosing ball problem. Data sets in practice often contain outliers, leading to the $k$ -center problem with outliers. Here we are given, besides $P$ and $k$ , a parameter $z$ that indicates the allowed number of outliers. Thus the radius of the balls in an optimal solution is given by

$\mbox{{\sc opt}}_{k,z}(P)$ := the smallest radius $\rho$ such that we can cover all points from $P$ , except for at most $z$ outliers, by $k$ balls of radius $\rho$ .

In this paper we study the $k$ -center problem with outliers in data streams, where the input is a possibly infinite stream $P=\langle p_{1},p_{2},\ldots\rangle$ of points. The goal is to maintain a solution to the $k$ -center problem as the points arrive over time, without any knowledge of future arrivals and using limited (sub-linear) storage. Since we cannot store all the points in the stream, we cannot expect to maintain an optimal solution. Hence, the two main quality criteria of a streaming algorithm are its approximation ratio and the amount of storage it uses. We will study this problem in the sliding-window model. In this model we are given a window length $W$ and we are, at any time $t$ , only interested in the points that arrived in the time window $(t-W,t]$ . Working in the sliding-window model is often significantly more difficult than working in the standard (insertion-only) streaming model.

Previous work. Charikar et al.[DBLP:journals/siamcomp/CharikarCFM04] were the first to study the metric $k$ -center problem in data streams. They developed a streaming algorithm that computes an $8$ -approximation for the $k$ -center problem using $\Theta(k)$ space. Later McCutchen and Khuller [DBLP:conf/approx/McCutchenK08] improved the approximation ratio to $2+\varepsilon$ at the cost of increasing the storage to $O((k/\varepsilon)\log(1/\varepsilon))$ . McCutchen and Khuller also studied the $k$ -center problem with $z\geqslant 1$ outliers, for which they gave a $(4+\varepsilon)$ -approximation algorithm that requires $O(kz/\varepsilon)$ space.

The above results are for general metric spaces. In spaces of bounded doubling dimension¹¹1The doubling dimension of a space $X$ is the smallest number $d$ such that any ball $B$ in the space can be covered by $2^{d}$ balls of radius $\operatorname{radius}(B)/2$ . better bounds are possible. Indeed, Ceccarello, Pietracaprina and Pucci [DBLP:journals/pvldb/CeccarelloPP19] gave a $(3+\varepsilon)$ -approximation algorithm for the $k$ -center problem with $z$ outliers, thus improving the approximation ratio $(4+\varepsilon)$ for general metrics. Their algorithm requires $O((k+z)(1/\varepsilon)^{d})$ storage, where $d$ is the doubling dimension of the underlying space (which is assumed to be a fixed constant).

The algorithms mentioned so far are deterministic. Charikar, O’Callaghan, and Panigrahy [DBLP:conf/stoc/CharikarOP03] and Ding, Yu and Wang [DBLP:conf/esa/DingYW19] studied sampling-based streaming algorithms for the Euclidean $k$ -center problem with outliers, showing that if one allows slightly more than $z$ outliers then randomization can reduce the storage requirements. Our focus, however, is on deterministic algorithms.

For the $k$ -center problem in the sliding-window model, the only result we are aware of is due to Cohen-Addad, Schwiegelshohn and Sohler [DBLP:conf/icalp/Cohen-AddadSS16]. They deal with the $k$ -center problem in general metric spaces, but without outliers, and they propose a $(6+\varepsilon)$ -approximation algorithm using $O((k/\varepsilon)\log\sigma)$ storage, and a $(4+\varepsilon)$ -approximation for the special case $k=2$ . Here $\sigma:=\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})/\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})$ denotes the spread of the underling space $X$ , and it is assumed that the values $\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})$ and $\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})$ are known to the algorithm. They also prove that any algorithm for the $2$ -center problem with outliers in general metric spaces that achieves an approximation ratio of less than $4$ requires $\Omega(W^{1/3})$ space, where $W$ is the size²²2Here the window size $W$ is defined in terms of the number of points in the window, that is, the window consists of the $W$ most recent points. We define the window in a slightly more general manner, by defining $W$ to be the length (that is, duration) of the window. Note that if we assume that the $i$ -th point arrives at time $t=i$ , then the two models are the same. of the window. Table 1 gives an overview of the known results on the $k$ -center problem in the insertion-only and the sliding-window model.

model	metric space	approx.	storage	outliers	ref.
insertion-only	general	$8$	$k$	no	[DBLP:journals/siamcomp/CharikarCFM04]
	general	$2+\varepsilon$	$(k/\varepsilon)\log(1/\varepsilon)$	no	[DBLP:conf/approx/McCutchenK08]
	general	$4+\varepsilon$	$kz/\varepsilon$	yes	[DBLP:conf/approx/McCutchenK08]
	bounded doubling	$3+\varepsilon$	$(k+z)/\varepsilon^{d}$	yes	[DBLP:journals/pvldb/CeccarelloPP19]
sliding window	general	$6+\varepsilon$	$(k/\varepsilon)\log\sigma$	no	[DBLP:conf/icalp/Cohen-AddadSS16]
	bounded doubling	$1+\varepsilon$	$((k+z)z/\varepsilon^{d})\log\sigma$	yes	here

Table 1: Results for the

k

-center problem with and without outliers in the insertion-only and the sliding-window model. Bounds on the storage are asymptotic. In the papers where the metric space has bounded doubling dimension or is Euclidean, the dimension

d

is consider a constant.

While our main interest is in the $k$ -center problem for $k>1$ , we will automatically obtain a result for the 1-center problem. Hence, we also briefly discuss previous results for the 1-center problem.

For the 1-center problem in $d$ -dimensional Euclidean space, streaming algorithms that maintain an $\varepsilon$ -kernel give a $(1+\varepsilon)$ -approximation. An example is the algorithm of Zarabi-Zadeh [z-cpa-08] which maintains an $\varepsilon$ -kernel of size $O(1/\varepsilon^{(d-1)/2}\log(1/\varepsilon))$ . Moreover, using only $O(d)$ storage one can obtain a 1.22-approximation for the 1-center problem without outliers [DBLP:journals/algorithmica/AgarwalS15, cp-sdameb-14]. For the 1-center problem with outliers, one can obtain a $(1+\varepsilon)$ -approximation algorithm that uses $z/\varepsilon^{O(d)}$ storage using the technique of Agarwal, Har-Peled and Yu [DBLP:journals/dcg/AgarwalHY08]. Zarrabi-Zadeh and Mukhopadhyay [DBLP:conf/cccg/Zarrabi-ZadehM09] studied the $1$ -center problem with $z$ outliers in high-dimensional Euclidean spaces, where $d$ is not considered constant, giving a $1.73$ -approximation algorithm that requires $O(d^{3}z)$ space. Recently, Hatami anad Zarrabi-Zadeh [DBLP:journals/comgeo/HatamiZ17] extended this result to $2$ -center problem with $z$ outliers, obtaining a $(1.8+\varepsilon)$ -approximation using $O(d^{3}z^{2}+dz^{4}/\varepsilon)$ storage. None of the 1-center algorithms discussed above works in the sliding-window model.

As problem that is closely related to the 1-center problem is the diameter problem, where the goal is to maintain an approximation of the diameter of the points in the stream. This problem has been studied in the sliding-window model by Feigenbaum, Kannan, Zhang [DBLP:journals/algorithmica/FeigenbaumKZ04] and later by Chan and Sadjad [cs-gosw-06], who gave a $(1+\varepsilon)$ -approximation for the diameter problem (without outliers) in the sliding window model, using $O((1/\varepsilon)^{(d+1)/2}\log(\sigma/\varepsilon))$ storage.

Our contribution. We present the first algorithm for the $k$ -center problem with $z$ outliers in the sliding-window model. It works in spaces of bounded doubling dimension and yields a $(1+\varepsilon)$ -approximation. So far a $(1+\varepsilon)$ -approximation was not even known for the $k$ -center problem without outliers in the insertion-only model. Our algorithm uses $O(((k+z)(z+1)/\varepsilon^{d})\log\sigma)$ storage, where $d$ is the doubling dimension and $\sigma$ is the spread of the underlying space, as defined above. Thus for the 1-center problem we obtain a solution that uses $O((k/\varepsilon^{d})\log\sigma)$ storage. This solution also works for the diameter problem. Note that also for the 1-center problem with outliers (and the diameter problem with outliers) an algorithm for the sliding-window model was not yet known. A useful property of the sketch³³3We use the word sketch even though we do not study how to compose the sketches for two separate streams into a sketch for the concatenation of the stream, as the term sketch seems more appropriate than data structure, for example. maintained by our algorithm for the $k$ -center problem, is that is can also be used for the $k^{\prime}$ -center problem for any $k^{\prime}<k$ , as well as for the diameter problem.

As in the previous papers on the $k$ -center problem (or the diameter problem) in the sliding-window model [cs-gosw-06, DBLP:conf/icalp/Cohen-AddadSS16, DBLP:journals/algorithmica/FeigenbaumKZ04], we assume that the values $\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})$ and $\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})$ are known to the algorithm. A typical example is when the input consists of points in $d$ -dimensional Euclidean space with integer coordinates in a given range $\{0,1,\ldots,U\}$ . In this example, the spread is $\Theta(U)$ .

2 The algorithm

Let $P:=\langle p_{1},p_{2},\ldots\rangle$ be a possibly infinite stream of points from a metric space $X$ of doubling dimension $d$ and spread $\sigma$ , where $d$ is considered to be a fixed constant. We denote the arrival time of a point $p_{i}$ by $\and(p_{i})$ . We say that $p_{i}$ expires at time $t_{\mathrm{exp}}(p_{i}):=\and(p_{i})+W$ , where $W$ is the given length of the time window. To simplify the exposition, we assume that all arrival times and departure times (that is, times at which a point expires) are distinct. For a time $t$ we define $P(t)$ to be the set⁴⁴4We allow the same point from $X$ to occur multiple times in the stream, so $P(t)$ is actually a multi-set. Whenever we refer to “sets” in the remainder of the paper we mean “multi-sets”. of points currently in the window. In other words, $P(t):=\{p_{i}:\and(p_{i})\leqslant t<t_{\mathrm{exp}}(p_{i})\}$ . For a point $q\in X$ and a parameter $r\geqslant 0$ , use $\operatorname{ball}(q,r)$ to denote the ball with center $q$ and radius $r$ .

In the following we show how to maintain a sketch $\Gamma(t)$ of $P(t)$ for the $k$ -center problem with outliers. We first present a sketch for a decision version of the problem. (Actually our sketch is a bit more powerful than a decision algorithm, as explained below.) Given this sketch, it will be easy to develop a sketch for the optimization version of the problem.

2.1 A sketch for the decision problem

For a set $Q$ of points, a number of outliers $z$ and a number of centers $k$ , we define

$\mbox{{\sc opt}}_{k,z}(Q)$ := the smallest radius $\rho$ such that we can cover all points from $Q$ , except for at most $z$ outliers, by $k$ balls of radius $\rho$ .

Let $\rho$ be a given parameter with $\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})\leqslant\rho\leqslant\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})$ . The goal of this section is design a sketch $\Gamma(t)$ and a corresponding decision algorithm TryToCover with the following properties:

•

The sketch $\Gamma(t)$ uses $O((k+z)(z+1)(1/\varepsilon)^{d})$ storage.
•

TryToCover either reports a collection $\mathcal{C}^{*}$ of $k$ balls of radius $\mbox{{\sc opt}}_{k,z}(P(t))+2\varepsilon\rho$ that together cover all points in $P(t)$ except for at most $z$ outliers, or it report no. In latter case we have $\mbox{{\sc opt}}_{k,z}(P(t))>2\rho$ .

Our sketch $\Gamma(t)$ is a tuple $(\tau(t),\mathcal{B}(t),\mathcal{R}(t))$ , where $\tau$ is a timestamp, $\mathcal{B}(t)$ is a collection of balls, and $\mathcal{R}(t)$ is a collection of so-called representative sets. The sketch has the following properties:

(Prop-1): For any time $t^{\prime}$ with $t\leqslant t^{\prime}<\tau(t)$ we are guaranteed that $\mbox{{\sc opt}}_{k,z}(P(t^{\prime}))>2\rho$ . The idea is that when we find $k+z+1$ points with pairwise distances greater $2\rho$ , then we can set $\tau$ to the expiration time of the oldest of these points, and we can delete this point (as well as any older points) from the sketch.
(Prop-2): The number of balls in $\mathcal{B}(t)$ is $O((k+z)/\varepsilon^{d})$ . Each ball $B\in\mathcal{B}(t)$ has radius $\varepsilon\rho$ and the center of $B$ , denoted by $\operatorname{center}(B)$ , is a point from the stream. Note that the center does not need to be a point from $P(t)$ . The balls in $\mathcal{B}(t)$ are well spread in the sense that no ball contains the center of any other ball. In other words, we have $\operatorname{dist}(\operatorname{center}(B),\operatorname{center}(B^{\prime}))>\varepsilon\rho$ for any two balls $B,B^{\prime}\in\mathcal{B}(t)$ .
(Prop-3): For each ball $B\in\mathcal{B}(t)$ the set $\mathcal{R}(t)$ contains a representative set $R(B)\subseteq B\cap P(t)$ , and these are the only sets in $\mathcal{R}(t)$ . The representative sets $R(B)$ are pairwise disjoint, and each set $R(B)$ contains at most $z+1$ points.

Define a ball $B\in\mathcal{B}(t)$ to be full when $|R(B)|=z+1$ , and let $S(t):=\bigcup_{B\in\mathcal{B}(t)}R(B)$ . Then for any point $p_{i}\in P(t)\setminus S(t)$ we have (i) $t_{\mathrm{exp}}(p_{i})\leqslant\tau(t)$ , and/or (ii) $p_{i}\in B$ for a ball $B\in\mathcal{B}(t)$ that is full and such that all points in $R(B)$ arrived after $p_{i}$ .

At time $t=0$ , before the arrival of the first point, we have $\mathcal{A}(t)=\mathcal{B}(t)=0$ and $\tau(t)=0$ . Since $P(0)=\emptyset$ this trivially satisfies the various properties our sketch should have. Before we prove that we can maintain the sketch upon the arrival of new points and upon departure of old points, we present our decision algorithm TryToCover and prove its correctness. The algorithm is quite simple, and given in Algorithm 1.

Algorithm 1 TryToCover

(\Gamma(t))

S(t)\leftarrow\bigcup_{B\in\mathcal{B}(t)}R(B)

2:Compute

\mbox{{\sc opt}}_{k,z}(S(t))

and the corresponding collection

\mathcal{C}:=\{C_{1},\ldots,C_{k}\}

of balls.

3:if

t<\tau(t)

\mbox{{\sc opt}}_{k,z}(S(t))>2\rho

then

4: Report no

5:else

6: Increase the radius of each ball

C_{i}\in\mathcal{C}

2\varepsilon\rho

7: Report the collection

\mathcal{C}^{*}:=\{C^{*}_{1},\ldots,C^{*}_{k}\}

of expanded balls.

Remark 2.1.

In line 2 we compute an optimal solution on the point set $S(t)$ . How this is done, and how much time this takes, depends on the specific space $X$ that we consider. In $d$ -dimensional Euclidean space, for instance, we can solve the $k$ -center problem with outliers in time polynomial in $n:=|S(t)|$ (for constant $k$ and $d$ ), as follows: first generate all $O(n^{d+1})$ potential centers, then generate all possible collections of $k$ such centers, and then for each of the $O(n^{(d+1)k})$ such collections find the minimum radius $\rho$ such that we can cover all except for $z$ points. For other spaces computing an optimal solution may not be easy, though it is always possible of course since our space $X$ is finite.

In fact, we do not need an exact solution to the problem on $S(t)$ . It is sufficient if we have an algorithm that computes, for any given $\delta>0$ , a $(1+\delta)$ -approximation of the optimal solution. By tuning the parameter $\varepsilon$ in the sketch and the parameter $\delta$ in the approximation algorithm appropriately, we can then still obtain the desired accuracy in our final answer. Since this is rather straightforward (but somewhat tedious) we omit the details, and we assume in the remainder that we compute an exact solution for $S(t)$ in line 2.

The following lemma establishes the correctness of the algorithm.

Lemma 2.2.

Algorithm TryToCover either reports a collection $\mathcal{C}^{*}$ of $k$ balls of radius $\mbox{{\sc opt}}_{k,z}(P(t))+2\varepsilon\rho$ that together cover all points in $P(t)$ except for at most $z$ outliers, or it reports no. In latter case we have $\mbox{{\sc opt}}_{k,z}(P(t))>2\rho$ .

Proof 2.3.

First suppose the algorithm reports no. If this happens because $t<\tau(t)$ then $\mbox{{\sc opt}}_{k,z}(P(t))>2\rho$ by (Prop-1). Otherwise, this happens because $\mbox{{\sc opt}}_{k,z}(S(t))>2\rho$ . But then $\mbox{{\sc opt}}_{k,z}(P(t))>2\rho$ , because (Prop-3) implies that $S(t)\subseteq P(t)$ .

Now suppose the algorithm reports a collection $\mathcal{C}^{*}:=\{C^{*}_{1},\ldots,C^{*}_{k}\}$ of balls. Let $\mathcal{C}$ be the corresponding set of balls before they were expanded. Since $\mathcal{C}$ is an optimal solution for $S(t)$ , the balls $C_{i}$ have radius $\mbox{{\sc opt}}_{k,z}(S(t))\leqslant\mbox{{\sc opt}}_{k,z}(P(t))$ and together they cover all points in $S(t)$ except for at most $z$ outliers. Now consider a point $p_{i}\in P(t)\setminus S(t)$ . To finish the proof, we must show that $p_{i}$ is covered by one of the balls in $\mathcal{C}^{*}$ . To this end, first observe that $t_{\mathrm{exp}}(p_{i})>t$ because $p_{i}\in P(t)$ . Since TryToCover did not report no, this implies that $t_{\mathrm{exp}}(p_{i})>\tau(t)$ . Hence, we can conclude from (Prop-3) that $p_{i}\in B$ for a ball $B\in\mathcal{B}(t)$ that is full. Thus $R(B)$ contains $z+1$ points, and since we allow only $z$ outliers this implies that at least one point from $R(B)$ is covered by a ball $C_{i}\in\mathcal{C}$ . Because $\operatorname{diam}(B)=2\varepsilon\rho$ , this implies that $p_{i}$ must be covered by $C^{*}_{i}$ , thus finishing the proof.

Next we show how to update the sketch $\Gamma(t)$ .

Handling departures

Handling departures is easy. When a point $p_{j}$ in one of our representative sets $R(B)$ expires, we simply delete it from $R(B)$ , and if $R(B)$ then becomes empty we remove $R(B)$ from $\mathcal{R}(t)$ and $B$ from $\mathcal{B}(t)$ .

It is trivial to verify that (Prop-1) and (Prop-2) still hold for the updated sketch. To see that (Prop-3) holds as well, consider a point $p_{i}\in P(t)\setminus S(t)$ . The only reason for (Prop-3) to be violated, would be when $p_{i}\in B$ for a ball $B$ that was full before the deletion of $p_{j}$ but is no longer full after the deletion. However, (Prop-3) states that all points in $R(B)$ arrived after $p_{i}$ . Since $p_{i}$ did not yet expire, this means that the point $p_{j}$ that currently expires cannot be a point from $R(B)$ .

Handling arrivals

Algorithm 2 shows how to handle the arrival of a new point $p_{j}$ at time $t:=\and(p_{j})$ . We denote the sketch just before the arrival by $\Gamma(t^{-})$ , and the updated sketch by $\Gamma(t^{+})$ .

Algorithm 2 HandleArrival

(\Gamma(t^{-}),p_{j})

\mathcal{B}\leftarrow\mathcal{B}(t^{-})

\triangleright

\mathcal{B}

will be the set of balls from which we will pick the balls in

\mathcal{B}(t^{+})

2:if

p_{j}\in B

for some

B\in\mathcal{B}

then

3: Add

p_{j}

R(B)

for an arbitrary such ball

B

. If we now have

|R(B)|=z+2

because

R(B)

was full before the addition of

p_{j}

, then remove the oldest point from

R(B)

4:else

5: Add the ball

B:=\operatorname{ball}(p_{j},\varepsilon\rho)

\mathcal{B}

and set

R(B)\leftarrow\{p_{j}\}

S\leftarrow\bigcup_{B\in\mathcal{B}}R(B)

;

\mathcal{A}\leftarrow\emptyset

\triangleright

\mathcal{A}

is a set of so-called anchor points

\mathcal{B}(t^{+})\leftarrow\emptyset

;

\mathcal{R}(t^{+})\leftarrow\emptyset

8:while

S\neq\emptyset

and

|\mathcal{A}|<k+z

9: Let

p_{i}

be the youngest point in

S

. Add

p_{i}

\mathcal{A}

10: for all balls

B\in\mathcal{B}

such that

\operatorname{dist}(p_{i},\operatorname{center}(B))\leqslant(2+\varepsilon)\rho

11: Add

B

\mathcal{B}(t^{+})

, add

R(B)

\mathcal{R}(t^{+})

, and set

S\leftarrow S\setminus R(B)

12:if

S\neq\emptyset

then

13: Let

p_{i^{*}}

be the youngest point in

S

and set

\tau(t^{+})\leftarrow\max\left(\tau(t),t_{\mathrm{exp}}(p_{i^{*}})\right)

Lemma 2.4.

The sketch computed by HandleArrival has properties (Prop-1)–(Prop-3).

Proof 2.5.

First consider (Prop-1). If $\tau(t^{+})=\tau(t^{-})$ then obviously (Prop-1) still holds, so assume that $\tau$ is updated by the algorithm. Then in the loop of lines 8–11 we added $k+z$ anchor points to the set $\mathcal{A}$ , and after doing so $S$ is still non-empty. Note that whenever we add a point $p_{i}$ as anchor point to $\mathcal{A}$ , then all balls $B$ with $\operatorname{dist}(p_{i},\operatorname{center}(B))\leqslant(2+\varepsilon)\rho$ are added to $\mathcal{B}(t^{+})$ . Moreover, the points in the corresponding representative sets $R(B)$ are removed from $S$ . Hence, all points that remain in $S$ must be in balls whose center lies at distance more than $(2+\varepsilon)\rho$ from $p_{i}$ . Since the balls have radius $\varepsilon\rho$ , this means that all remaining points in $S$ are at distance more than $2\rho$ from $p_{i}$ .

This implies two things: the distance between any two anchor points in $\mathcal{A}$ is at least $2\rho$ , and the distance from the point $p_{i^{*}}$ that defines $\tau(t^{+})$ in line 13 to any anchor point is at least $2\rho$ . Hence, the set $\mathcal{A}^{*}:=\mathcal{A}\cup\{p_{i^{*}}\}$ consists of $k+z+1$ points whose pairwise distances are at least $2\rho$ . Thus any ball of radius $\rho$ covers at most one point from $\mathcal{A}^{*}$ , and since we allow at most $z$ outliers this implies that $\mbox{{\sc opt}}_{k,z}(\mathcal{A}^{*})>\rho$ . It remains to observe that the point $p_{i^{*}}$ is the oldest point in $\mathcal{A}^{*}$ , since in line 9 we pick the youngest remaining point to be the next anchor point. Hence, until $p_{i^{*}}$ expires at time $\tau(t^{+})=t_{\mathrm{exp}}(p_{i^{*}})$ we have $\mathcal{A}^{*}\subseteq P(t)$ , and so $\mbox{{\sc opt}}_{k,z}(P(t^{\prime}))>\rho$ for all $t\leqslant t^{\prime}<\tau(t^{+})$ . This proves that (Prop-1) still holds.

Now consider (Prop-2). The set $\mathcal{B}(t^{+})$ must be well spread, because $\Gamma(t^{-})$ was well spread and we only add $\operatorname{ball}(p_{i},\varepsilon\rho)$ in line 5 when its center is outside the existing balls. To prove that $|\mathcal{B}(t^{+})|=O((k+z)/\varepsilon^{d})$ , first observe that $|\mathcal{A}|\leqslant k+z$ by construction. In line 11 we add for each anchor point $p_{i}\in\mathcal{A}$ one or more balls to $\mathcal{B}(t^{+})$ . The centers of these balls all lie at distance at most $(2+\varepsilon)\rho$ from $p_{i}$ , so they are contained in $\operatorname{ball}(p_{i},(2+\varepsilon)\rho)$ . Because the underlying space $X$ has doubling dimension $d$ , we can cover $\operatorname{ball}(p_{i},(2+\varepsilon)\rho)$ by a set $\mathcal{D}$ consisting of $O(1/\varepsilon^{d})$ balls of radius $\varepsilon\rho/2$ . Because $\mathcal{B}(t^{+})$ is well spread, any ball in $\mathcal{D}$ contains at most one center of a ball from $\mathcal{B}(t^{+})$ . Hence we add $O(1/\varepsilon^{d})$ balls for each anchor point, thus finishing the proof of (Prop-2).

It remains to argue that (Prop-3) is maintained. The only property that is non-trivial to check is the property the points in $P(t)\setminus S(t)$ need to satisfy. To this end consider a point $p_{i}\in P(t^{+})\setminus S(t^{+})$ , and assume that $t_{\mathrm{exp}}(p_{i})>\tau(t^{+})$ . Note that $p_{i}$ cannot be the just inserted point $p_{j}$ , because $p_{j}$ is the first point added as an anchor point to $\mathcal{A}$ . Hence, $p_{i}\in P(t^{-})$ . There are two cases.

The first case is that $p_{i}\in P(t^{-})\setminus S(t^{-})$ . Then we have $t_{\mathrm{exp}}(p_{i})>\tau(t^{-})$ , which implies that $p_{i}\in B$ for a ball $B\in\mathcal{B}(t^{-})$ that was full and such that all points in $R(B)$ arrived later than $p_{i}$ . We may have replaced a point from $R(B)$ by $p_{j}$ in line 3, but this does not violate the property that all points in $R(B)$ arrived after $p_{i}$ . Thus the only problem that may arise is that $B$ was not added to $\mathcal{B}(t^{+})$ in the loop of lines 8–11. But then $R(B)\subseteq S$ when line 13 is reached. Since all points in $R(B)$ are younger than $p_{i}$ , this contradicts that $t_{\mathrm{exp}}(p_{i})>\tau(t^{+})$ .

The second case is that $p_{i}\not\in P(t^{-})\setminus S(t^{-})$ . Since $p_{i}\in P(t^{-})$ this means that $p_{i}\in S(t^{-})$ , and because $p_{i}\not\in S(t^{+})$ we know that $p_{i}$ is one of the points considered in line 13. But then $t_{\mathrm{exp}}(p_{i})\leqslant\tau(t^{+})$ , again contradicting our assumptions.

2.2 A sketch for the optimization problem

Above we presented a sketch for a decision version of the problem. For given parameters $\rho$ and $\varepsilon$ , the sketch uses $O((k+z)(z+1)(1/\varepsilon)^{d})$ storage. We also gave an algorithmTryToCover that either reports a collection $\mathcal{C}^{*}$ of $k$ balls of radius $\mbox{{\sc opt}}_{k,z}(P(t))+2\varepsilon\rho$ that together cover all points in $P(t)$ except for at most $z$ outliers, or that reports no. In latter case we know that $\mbox{{\sc opt}}_{k,z}(P(t))>2\rho$ . To make the parameter $\rho$ and $\varepsilon$ explicit, we will from now on denote the sketch by $\Gamma_{\rho,\varepsilon}$ .

In the optimization version of the problem we wish to find $k$ congruent balls of minimum radius that together cover all points in $P(t)$ except for at most $z$ outliers. Our sketch for this problem, for a given $\varepsilon>0$ , is defined as follows.

•

We maintain a sketch $\Gamma_{\rho_{i},\varepsilon/2}$ for every $0\leqslant i\leqslant\lfloor\log\sigma\rfloor$ , where $\rho_{i}:=2^{i}\cdot\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})$ .

Our algorithm to approximate the optimal solution to the $k$ -center problem with outliers on $P(t)$ is given in Algorithm 3.

Algorithm 3 FindApproximateCenters

(\Gamma(t))

i\leftarrow 0

2:repeat

\mathit{answer}\leftarrow\textsc{TryToCover}(\Gamma_{\rho_{i},\varepsilon/2})

4:until

\mathit{answer}\neq

5:if the radii of the balls in

\mathit{answer}

is less than

\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})

then

6: Reduce the radii of the balls to zero

7:Report

\mathit{answer}

We can now prove our main theorem.

Theorem 2.6.

Let $X$ be a finite space of doubling dimension $d$ and spread $\sigma$ . Let $0<\varepsilon<1$ be a given parameter. There is a sketch for the $k$ -center problem with $z$ outliers on streams from $X$ in the sliding-window model. The sketch uses $O((k+z)(z+1)(1/\varepsilon)^{d}\log\sigma)$ storage, and it allows us to report at any time $t$ a collection $\mathcal{C}^{*}$ of balls of radius at most $(1+\varepsilon)\cdot\mbox{{\sc opt}}_{k,z}(P(t))$ that together cover all points from $P(t)$ , except at most $z$ outliers.

Proof 2.7.

First consider the case $\mbox{{\sc opt}}_{k,z}(P(t))=0$ . When we run $\textsc{TryToCover}(\Gamma_{\rho_{i},\varepsilon/2})$ with $i=0$ , then by Lemma 2.2 we will report a collection of balls of radius

\mbox{{\sc opt}}_{k,z}(P(t))+2(\varepsilon/2)\rho_{0}=\varepsilon\cdot\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime}).

Since $\varepsilon<1$ , the radii are strictly smaller than $\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})$ . Since the balls are centered at points from $X$ , we may as well reduce the radii to zero, thus achieving an optimal solution.

Next, consider the case $\mbox{{\sc opt}}_{k,z}(P(t))>0$ . Then $\mbox{{\sc opt}}_{k,z}(P(t))\geqslant\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})=\rho_{0}$ . Consider the loop in lines 2–4 and let $i^{*}$ be the value of the counter $i$ when we obtain $\textsc{TryToCover}(\Gamma_{\rho_{i},\varepsilon/2})\neq$ no. Note that this must happen at some point, because for $i=\lfloor\log\sigma\rfloor$ we have

\rho_{i}=2^{\lfloor\log\sigma\rfloor}\cdot\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})\geqslant(\sigma/2)\cdot\min_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})=(1/2)\cdot\max_{q,q^{\prime}\in X}\operatorname{dist}(q,q^{\prime})

and so Lemma 2.2 guarantees that $\mathit{answer}\neq$ no. If $i^{*}=0$ then the balls in the reported solution have radius

\mbox{{\sc opt}}_{k,z}(P(t))+2(\varepsilon/2)\rho_{0}\leqslant(1+\varepsilon)\cdot\mbox{{\sc opt}}_{k,z}(P(t)).

Otherwise we know that the answer for $i=i^{*}-1$ is no, which implies that $\mbox{{\sc opt}}_{k,z}(P(t))>2\rho_{i^{*}-1}=\rho_{i^{*}}$ . Hence, the balls in the solution that is reported for $i^{*}$ have radius

\mbox{{\sc opt}}_{k,z}(P(t))+2(\varepsilon/2)\rho_{i^{*}}\leqslant(1+\varepsilon)\cdot\mbox{{\sc opt}}_{k,z}(P(t)).

Our sketch for the $k$ -center problem can also be used for the $k^{\prime}$ -center problem for $k^{\prime}<k$ . Moreover, a sketch for the $1$ -center problem (or a sketch for the $k$ -center problem for $k>1$ ) can also be used for the diameter problem. Recall that the diameter problem with outliers for the set $P(t)$ asks for the value

\operatorname{diam}_{z}(P(t)):=\min\{\operatorname{diam}(P(t)\setminus Q):|Q|=z\},

that is, $\operatorname{diam}_{z}(P(t))$ is the smallest diameter one can obtain by deleting $z$ outliers from $P(t)$ . We say that an algorithm reports a $(1-\varepsilon)$ -approximation to $\operatorname{diam}_{z}(P(t))$ if it reports a value $D$ with $(1-\varepsilon)\cdot\operatorname{diam}_{z}(P(t))\leqslant D\leqslant\operatorname{diam}_{z}(P(t))$ .

Theorem 2.8.

The sketch for the $k$ -center problem with outliers as presented above can also be used to provide a $(1+\varepsilon)$ -approximation for the $k^{\prime}$ -center problem with outliers, for any $1\leqslant k^{\prime}\leqslant k$ . Moreover, it can be used to provide a $(1-2\varepsilon)$ -approximation for the diameter problem with outliers.

Proof 2.9.

The only place where the value $k$ plays a role in properties (Prop-1)–(Prop-3), except for in the bound on the size of $\mathcal{B}(t)$ , is in (Prop-1). Since $\mbox{{\sc opt}}_{k^{\prime},z}(P(t))\geqslant\mbox{{\sc opt}}_{k,z}(P(t))$ for any $k^{\prime}\leqslant k$ , this means that a sketch for the $k$ -center problem will have the properties required of a sketch for the $k^{\prime}$ -center problem. Thus running TryToCover on a sketch for the $k$ -center, where in line 2 we compute $\mbox{{\sc opt}}_{k^{\prime},z}(P(t))$ , will give a correct result for $k^{\prime}$ .

Now consider the diameter problem. Suppose we run TryToCover on a sketch for the $k$ -center problem, where in line 2 we compute $\operatorname{diam}_{z}(P(t))$ , and instead of lines 6–7 we report $D:=\operatorname{diam}_{z}(S(t))$ . We claim that if the algorithm reports no then we have $\operatorname{diam}_{z}(P(t))>2\rho$ , and otherwise $\operatorname{diam}_{z}(P(t))-4\varepsilon\rho\leqslant D\leqslant\operatorname{diam}_{z}(P(t))$ .

Note that $\operatorname{diam}_{z}(P(t))\geqslant\mbox{{\sc opt}}_{k,z}(P(t))$ for any $k\geqslant 1$ . Hence, when $t<\tau(t)$ then we have $\operatorname{diam}_{z}(P(t))\geqslant\mbox{{\sc opt}}_{k,z}(P(t))>2\rho$ . This implies that the claim holds when TryToCover reports no.

If TryToCover does not report no, it reports $D:=\operatorname{diam}_{z}(S(t))$ . Clearly we then have $D\leqslant\operatorname{diam}_{z}(P(t))$ . Now suppose for a contradiction that $D<\operatorname{diam}_{z}(P(t))-4\varepsilon\rho$ . Let $p_{i},p_{j}\in P(t)$ be such that $\operatorname{dist}(p_{i},p_{j})=\operatorname{diam}_{z}(P(t))$ . We will argue that there are points $p_{i^{\prime}},p_{j^{\prime}}\in S(t)$ such that $\operatorname{dist}(p_{i},p_{i^{\prime}})\leqslant 2\varepsilon\rho$ and $\operatorname{dist}(p_{j},p_{j^{\prime}})\leqslant 2\varepsilon\rho$ . But then we would have $D=\operatorname{diam}_{z}(S(t))\geqslant\operatorname{diam}_{z}(P(t))-4\varepsilon\rho$ , which contradicts the assumption. We will argue the existence of $p_{i^{\prime}}$ ; the argument for $p_{j^{\prime}}$ is similar. If $p_{i}\in S(t)$ then we can take $p_{i^{\prime}}:=p_{i}$ and we are done. Otherwise, as argued in the proof of Lemma 2.2, we know that $p_{i}\in B$ for a ball $B\in B(t)$ that is full. In particular $R(B)$ contains at least one point $p_{i^{\prime}}$ , and this point must be at distance at most $2\varepsilon\rho$ from $p_{i}$ , as claimed.

We have proved the claim that if the algorithm reports no then we have $\operatorname{diam}_{z}(P(t))>2\rho$ , and otherwise we have $\operatorname{diam}_{z}(P(t))-4\varepsilon\rho\leqslant D\leqslant\operatorname{diam}_{z}(P(t))$ . Plugging this claim for the decision problem into the mechanism to obtain a sketch for the optimization problem—recall that there we used $\varepsilon/2$ as parameter—now gives the desired result.

Theorem 2.8 implies that there exists a sketch for the diameter problem with outliers in the sliding-window model that gives a $(1+\varepsilon)$ -approximation to $\operatorname{diam}_{z}(P(t))$ using $O((z^{2}/\varepsilon^{d})\log\sigma)$ storage, namely the sketch for the 1-center problem.

3 Concluding remarks

We presented the first sketch for the $k$ -center problem with outliers in the sliding-window model. We assumed that the points in the stream come from a finite space $X$ of bounded doubling dimension and spread $\sigma$ , such as points in $\mathbb{R}^{d}$ with integer coordinates from the set $\{0,\ldots,U\}$ with $U:=\lfloor\sigma/\sqrt{d}\rfloor$ . Alternatively, we can assume that we are given a range $[\mbox{{\sc opt}}_{\min},\mbox{{\sc opt}}_{\max}]$ of possible values for $\mbox{{\sc opt}}_{k,z}(P(t))$ . Assuming we also have a subroutine available to compute (a $(1+\varepsilon)$ -approximation of) $\mbox{{\sc opt}}_{k,z}(S)$ for any given static set $S$ , we can then decide if $\mbox{{\sc opt}}_{k,z}(P(t))\in[\mbox{{\sc opt}}_{\min},\mbox{{\sc opt}}_{\max}]$ , and, if so, give a $(1+\varepsilon)$ -approximation of $\mbox{{\sc opt}}_{k,z}(P(t))$ . The storage will be as stated in Theorem 2.6, with $\sigma:=\mbox{{\sc opt}}_{\max}/\mbox{{\sc opt}}_{\min}$ .

Our sketch has low storage when $z$ , the number of outliers, is not too large. It would be interesting to see if the dependency on $z$ in the storage requirements reduced. A first step would be to see if it is possible to refine our approach to obtain a sketch with $O((k+z)/\varepsilon^{d})$ storage, saving a factor $z$ . Reducing the dependency on $z$ to sublinear may be possible using a random-sampling approach, if one is willing to allow slightly more than $z$ outliers. Another challenging open problem is to develop a sketch whose storage is only polynomially dependent on the doubling dimension $d$ .

kk-Center Clustering with Outliers in the Sliding-Window Model

Abstract

keywords:

1 Introduction

2 The algorithm

2.1 A sketch for the decision problem

Remark 2.1.

Lemma 2.2.

Proof 2.3.

Handling departures

Handling arrivals

Lemma 2.4.

Proof 2.5.

2.2 A sketch for the optimization problem

Theorem 2.6.

Proof 2.7.

Theorem 2.8.

Proof 2.9.

3 Concluding remarks

$k$ -Center Clustering with Outliers in the Sliding-Window Model