This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning to Cluster via Same-Cluster Queries

Yi Li Nanyang Technological University21 Nanyang LinkSingapore637371 [email protected] Yan Song Indiana University Bloomington700 North Woodlawn AvenueBloomingtonINUSA47408 [email protected]  and  Qin Zhang Indiana University Bloomington700 North Woodlawn AvenueBloomingtonINUSA47408 [email protected]
Abstract.

We study the problem of learning to cluster data points using an oracle which can answer same-cluster queries. Different from previous approaches, we do not assume that the total number of clusters is known at the beginning and do not require that the true clusters are consistent with a predefined objective function such as the KK-means. These relaxations are critical from the practical perspective and, meanwhile, make the problem more challenging. We propose two algorithms with provable theoretical guarantees and verify their effectiveness via an extensive set of experiments on both synthetic and real-world data.

clustering; weak supervision; same-cluster oracle
ccs: Theory of computation Semi-supervised learningccs: Theory of computation Facility location and clustering

1. Introduction

Clustering is a fundamental problem in data analytics and sees numerous applications across many areas in computer science. However, many clustering problems are computationally hard, even for approximate solutions. To alleviate the computational burden, Ashtiani et al. (Ashtiani et al., 2016) introduced weak supervision into the clustering process. More precisely, they allow the algorithm to query an oracle which can answer same-cluster queries, that is, whether two elements (points in the Euclidean space) belong to the same cluster. The oracle can, for example, be a domain expert in the setting of crowdsourcing. It was shown in (Ashtiani et al., 2016) that the KK-means clustering can be solved computationally efficiently with the help of a small number of same-cluster queries using the oracle.

The initial work by Ashtiani et al. (Ashtiani et al., 2016) relies on a strong “data niceness property” called the γ\gamma-margin, which requires that for each cluster CC with center μ\mu, the distance between any point pCp\not\in C and μ\mu needs to be larger than that between any point pCp\in C and μ\mu by at least a constant multiplicative factor γ>1\gamma>1 sufficiently large. This assumption was later removed by Ailon et al. (Ailon et al., 2018b), who gave an algorithm which, given a parameter KK, outputs a set CC of KK centers attaining a (1+ϵ)(1+\epsilon)-approximation of the KK-means objective function. There has been follow-up work by Chien et al. (Chien et al., 2018) in the same setting, replacing the γ\gamma-margin assumption with a new assumption on the size of the clusters. However, there are still two critical issues in this line of approach:

  1. (1)

    In (Ailon et al., 2018b; Chien et al., 2018), it is assumed that the total number of clusters KK in the true clustering is known to the algorithm, which is unrealistic in many cases.

  2. (2)

    In (Ailon et al., 2018b; Chien et al., 2018), it is assumed that the optimal solution to the KK-means clustering is consistent with the true clustering (i.e., when we have ground truth labels for all points). This assumption is highly problematic, since the ground-truth clustering can be arbitrary and very different from an optimal solution with respect to a fixed objective function such as the KK-means.

In this paper, we aim to remove both assumptions. We shall design algorithms that find the approximate centers of the true clusters using same-cluster queries, in the setting that we do not know in advance the number of clusters in the true clustering and the true clustering has no relationship with the optimal solution of a certain objective function. We obtain our result at a small cost: our sample-based algorithm may not be able to find the centers of the small clusters which are very close to some big identified clusters; we shall elaborate on this shortly.

Problem Setup.   We denote the clusters by X1,X2,X_{1},X_{2},\dots, which are point sets in the Euclidean space. For each XiX_{i}, let μi=1|Xi|xXix\mu_{i}=\frac{1}{|X_{i}|}\sum_{x\in X_{i}}x be its centroid. For simplicity we refer to XiX_{i} as “cluster ii” or “ii-th cluster”. The number of clusters is unknown at the beginning. Our goal is to find all “big” clusters and their approximate centroids μ~i\tilde{\mu}_{i}.

To specify what we mean by finding all big clusters, we need to introduce a concept called reducibility. Several previous studies (Kumar et al., 2010; Jaiswal et al., 2014; Ailon et al., 2018b) on the KK-means/median clustering problem assumed that after finding KK main clusters, the residual ones satisfy some reducibility condition, which states that using the KK discovered cluster centers to cover the residual ones would only increase the total cost by a small (1+ϵ)(1+\epsilon) factor. Similarly, we consider the following reducibility condition, based on the usual cost function.

Definition 0 (cost function).

Suppose that XX and CC are the set of data points and centers, respectively. The cost of covering XX using CC is Φ(X,C)=xXmincCxc2\Phi(X,C)=\sum_{x\in X}\min_{c\in C}\|x-c\|^{2}, where \|\cdot\| denotes the Euclidean norm. When C={c}C=\{c\} is a singleton, we also write Φ(X,{c})\Phi(X,\{c\}) as Φ(X,c)\Phi(X,c). When C=C=\emptyset, we define Φ(X,C)=|X|supx,yXxy2\Phi(X,C)=|X|\cdot\sup_{x,y\in X}\|x-y\|^{2}.

Definition 0 (ϵ\epsilon-reducibility).

Let X1,X2,X_{1},X_{2},\ldots be true clusters in XX. Let II be a subset of indices of clusters in XX. We say the clusters in XX are ϵ\epsilon-reducible w.r.t. II if it holds for each I\ell\not\in I that

Φ(X,iI{μi})ϵiIΦ(Xi,μi).\Phi\left(X_{\ell},\bigcup_{i\in I}\{\mu_{i}\}\right)\leq\epsilon\sum_{i\in I}\Phi(X_{i},\mu_{i}).

Intuitively, this reducibility condition states that the clusters outside II will be covered by the centers of the clusters in II with only a small increase in the total cost.

Our goal is to find a subset of indices II such that the set of true clusters in XX is ϵ\epsilon-reducible w.r.t. II.

Note that in this formulation, we do not attempt to optimize any objective function; instead, we just try to recover the approximate centroids of a set of clusters to which all other clusters are reducible. Inevitably we have to adopt a distance function in our definition of irreducibility, and we choose the widely used D2D^{2} (Euclidean-squared) distance function so that we can still use the D2D^{2}-sampling method developed and used in the earlier works (Arthur and Vassilvitskii, 2007; Jaiswal et al., 2014; Ailon et al., 2018b). The D2D^{2}-sampling will be introduced in Definition 2.

Table 1.1. Notations
XX set of input points
rr index of the round
II set of indices of recovered clusters so far
kk number of recovered clusters so far
QQ set of indices of newly discovered clusters
in the current round
qq size of QQ; number of newly discovered clusters
in the current round
SS multiset of samples. Each sample has the form of
(x,i)(x,i), where xx is the point and ii the cluster index.
xjx_{j}^{\ast} reference point in cluster jj for rejection sampling
SjS_{j} multiset of uniform samples in cluster jj returned
by rejection sampling. Note that SjSjS\neq\bigcup_{j}S_{j}.
WW set of indices of clusters to recover
KK total number of recovered clusters at the end
LL total number of discovered clusters at the end

To facilitate discussion, we list in Table 1.1 the commonly used notations in our algorithms and analyses. A cluster ii is said to be “discovered” if any point in XiX_{i} has been sampled and “recovered” when the approximate centroid μ~i\tilde{\mu}_{i} is computed. As discussed in the preceding paragraph, it is possible that the total number of recovered clusters, KK, is less than the total number of discovered clusters, LL, and that LL is less than the true number of clusters.

Our Contributions.  We provide two clustering algorithms with theoretically proven bounds on the total number of oracle queries in the circumstance of no prior knowledge on the number of clusters and no predefined objective function. To the best of our knowledge, these are the first algorithms for the clustering problem of its kind. Both our algorithms output (1+ϵ)(1+\epsilon)-approximate centers for all recovered clusters. Our first algorithm makes O~(ϵ4K2L2)\tilde{O}(\epsilon^{-4}K^{2}L^{2}) queries (Section 3) and the second algorithm makes O~(ϵ4KL2)\tilde{O}(\epsilon^{-4}KL^{2}) queries (Section 4).111In O~(),Ω~(),Θ~()\tilde{O}(\cdot),\tilde{\Omega}(\cdot),\tilde{\Theta}(\cdot) we use ‘~\tilde{~{}}’ to hide logarithmic factors. The exact query complexities can be found in the Theorems 6 and 5.

We also conduct an extensive set of experiments demonstrating the effectiveness of our algorithms on both synthetic and real-world data; see Section 6.

We further extend our algorithms to the case of a noisy same-cluster oracle which errs with a constant probability p<1/2p<1/2. This extension has been deferred to Appendix D of the full version of this paper.

We remark that since our algorithms target sublinear (i.e., o(|X|)o(\left|X\right|)) number of oracle queries, in the general case where the shape of clusters can be arbitrary, it is impossible to classify correctly all points in the datasets. But the approximate centers outputted by the algorithms can be used to efficiently and correctly classify any newly inserted points (i.e., database queries) as well as existing database points (when needed), using a natural heuristic (Heuristic 1) that we shall introduce in Section 6. Our experiments show that most points can be classified using only one additional oracle query.

Related Work.  As mentioned, the semi-supervised active clustering framework was first introduced in (Ashtiani et al., 2016), where the authors considered the KK-means clustering under the γ\gamma-margin assumption. Ailon et al. (Ailon et al., 2018a, b) proposed approximation algorithms for the KK-means and correlation clustering that compute a (1+ϵ)(1+\epsilon)-approximation of the optimal solution with the help of same-cluster queries. Chien et al. (Chien et al., 2018) studied the KK-means clustering under the same setting as that in (Ailon et al., 2018b), but used uniform sampling instead of D2D^{2}-sampling and worked under the assumption that no cluster (in the true clustering) has a small size. Saha and Subramanian (Saha and Subramanian, 2019) gave algorithms for correlation clustering with the same-cluster query complexity bounded by the optimal cost of the clustering. Gamlath et al. (Gamlath et al., 2018) extended the KK-means problem from Euclidean spaces to general finite metric spaces with a strengthened guarantee that recovered clusters largely overlap with respect to the true clusters. All these works, however, assume that the ground truth clustering (known by the oracle) is consistent with the target objective function, which is unrealistic in most real world applications.

Bressan et al. (Bressan et al., 2020) considered the case where the clusters are separated by ellipsoids, in contrast to the balls as suggested by the usual KK-means clustering objective. However, their algorithm still requires the knowledge of KK and has a query complexity that depends on nn (although logarithmically), which we avoid in this work.

Mazumdar and Saha studied clustering nn points into KK clusters with a noisy oracle (Mazumdar and Saha, 2017a) or side information (Mazumdar and Saha, 2017b). The noisy oracle gives incorrect answers to same-cluster queries with probability p<1/2p<1/2 (so that majority voting still works). The side information is the similarity score between each pair of data points, generated from a distribution f+f_{+} if the pair belongs to the same cluster and from another distribution ff_{-} otherwise. Algorithms proposed in these papers guarantee to recover the true clusters of size at least Ω(logn)\Omega(\log n), however, with query complexities at least Ω(Kn)\Omega(Kn), much larger than what we are interested to achieve in this paper.

Huleihel et al. (Huleihel et al., 2019) studied the overlapping clustering with the aid of a same-cluster oracle. Suppose that AA is an n×Kn\times K clustering matrix whose ii-th row is the indicator vector of the cluster membership of the ii-th element. The task is to recover AA from the similarity matrix AATAA^{T} using a small number of oracle queries.

Finally, we note that same-cluster queries have been used extensively for entity resolution (or, de-duplication(Wang et al., 2012, 2013; Vesdapunt et al., 2014; Verroios and Garcia-Molina, 2015; Firmani et al., 2016).

2. Preliminaries

In this paper we consider point sets in the canonical Euclidean space (d,)(\operatorname{\mathbb{R}}^{d},\|\cdot\|). The geometric centroid, or simply centroid, of a finite point set XX is defined as μ(X)=1|X|xXx\mu(X)=\frac{1}{|X|}\sum_{x\in X}x. It is known that μ(X)\mu(X) is the minimizer of the 11-center problem mincxXxc2\min_{c}\sum_{x\in X}\|x-c\|^{2}. The next lemma provides a guarantee on approximating the centroid of a cluster using uniform samples.

Lemma 1 ((Inaba et al., 1994)).

Let SS be a set of points obtained by independently sampling M points uniformly at random with replacement from a point set XdX\in\operatorname{\mathbb{R}}^{d}. Then for any δ>0\delta>0, it holds that

Pr{Φ(S,μ(S))(1+1/(δM))Φ(X,μ(X))}1δ.\Pr\left\{\Phi(S,\mu(S))\leq\left(1+1/(\delta M)\right)\Phi(X,\mu(X))\right\}\geq 1-\delta.

We define the D2D^{2}-sampling of a point set XX with respect to a point set CC as follows.

Definition 0 (D2D^{2}-sampling).

The D2D^{2}-sampling of a point set XX with respect to a point set CC returns a random point pXp\in X subject to the distribution defined by Pr{p=x}=Φ({x},C)/Φ(X,C)\Pr\{p=x\}=\Phi(\{x\},C)/\Phi(X,C) for all xXx\in X.

In this paper we use a same-cluster oracle, that is, given two data points x,yXx,y\in X, Oracle(x,y)\textsc{Oracle}(x,y) returns true if xx and yy belong to the same cluster and false otherwise. For simplicity of the algorithm description, we shall instead invoke the function Classify(x)\textsc{Classify}(x) to obtain the cluster index of xx, which can be easily implemented using the same-cluster oracle, as shown in Algorithm 2.1. This implies that the number of oracle queries is at most LL times the number of samples.

Algorithm 2.1 Classify(x)\textsc{Classify}(x). The overall algorithm which maintains the number LL of discovered clusters and a representative point ziz_{i} for each i=1,,Li=1,\dots,L.
for i=1i=1 to LL do
     if 
Oracle(x,zi)\textsc{Oracle}(x,z_{i}) then
         return ii
LL+1L\leftarrow L+1
zLxz_{L}\leftarrow x
return LL

3. Basic Algorithm

Algorithm.

Despite the fact that the algorithm in (Ailon et al., 2018b) cannot be used directly because KK is unknown to us, we briefly review the algorithm below. The algorithm gradually “discovers” and then “recovers” more and more clusters by taking D2D^{2}-samples. It recovers KK clusters in KK rounds, one in each round. To recover one cluster, it takes sufficient D2D^{2}-samples w.r.t. the approximate centers {μ~i}iI\{\tilde{\mu}_{i}\}_{i\in I}, where II is the set of the recovered clusters, and finds the largest unrecovered cluster jj. For this cluster jj, it obtains sufficient uniform samples via rejection sampling and invokes Lemma 1 to compute an approximate centroid μ~j\tilde{\mu}_{j}. Then it includes jj in the set II of recovered clusters and proceeds to the next round. It is shown that poly(K/ϵ)\operatorname{poly}(K/\epsilon) samples each round can achieve the failure probability of O(1/K)O(1/K) and taking a union bound over all KK rounds yields a constant overall failure probability.

There are two major difficulties of adapting this algorithm to our setting where KK is unknown.

  1. (1)

    The number of rounds is unknown. It is not clear when our algorithm should terminate, i.e., when we are confident that there are no more irreducible clusters. The failure probability of each round also needs to be redesigned and cannot be O(1/K)O(1/K).

  2. (2)

    Since KK is unknown, we cannot predetermine the number of samples to use and have to maintain dynamically various stopping criteria. For instance, it is subtle to determine which one the largest cluster is. If we stop too early, we may identify a cluster that is actually small and will need a large number of samples in order to obtain enough uniform samples from this cluster when doing rejection sampling; if we stop too late, we can make a more accurate decision but may have already taken too many samples.

To address the first issue, we observe that Ω(ϵ1logK)\Omega(\epsilon^{-1}\log K) samples is sufficient to test whether there are ϵ\epsilon-irreducible clusters left with constant probability. We also assign the failure probability of the rr-th round to be ara_{r} such that r=1ar\sum_{r=1}^{\infty}a_{r} is a small constant.

To address the second issue, observe that the ϵ\epsilon-irreducibility condition implies that all unrecovered clusters together have a Φ\Phi-value of Ω(ϵ)\Omega(\epsilon) and thus the largest unrecovered cluster has a Φ\Phi-value of Ω(ϵ/q)\Omega(\epsilon/q), where qq is the number of newly discovered clusters. We can show that O~(q/ϵ)\tilde{O}(q/\epsilon) samples suffices to ensure the identification of a cluster with Φ\Phi-value at least Ω(ϵ/q)\Omega(\epsilon/q) and thus we can, since the value of qq increases as the number of samples grows, keep sampling until O~(q/ϵ)\tilde{O}(q/\epsilon) samples are obtained. Note that this is a dynamic criterion in contrast to the predetermined one in (Ailon et al., 2018b). We then choose the largest cluster jj and carry out the rejection sampling and estimate the approximate centroid μ~j\tilde{\mu}_{j} with a careful control over the number of samples.

Algorithm 3.1 The Basic Algorithm.
1:II\leftarrow\emptyset
2:k0k\leftarrow 0, r0r\leftarrow 0
3:repeat
4:     rr+1r\leftarrow r+1
5:     QQ\leftarrow\emptyset, SS\leftarrow\emptyset
6:     T18ϵ1ln(10(k+1))T_{1}\leftarrow 8\epsilon^{-1}\ln(10(k+1))
7:     while Q=Q=\emptyset and |S|T1|S|\leq T_{1} do
8:         (x,Q,S)Sample(I,{μ~i}iI,Q,S)(x,Q,S)\leftarrow\textsc{Sample}(I,\{\tilde{\mu}_{i}\}_{i\in I},Q,S)
9:     end while
10:     if 
QQ\neq\emptyset then
11:         while |S|96|Q|ln(10(k+|Q|))/ϵ|S|\leq 96|Q|\ln(10(k+|Q|))/\epsilon do
12:              (x,Q,S)Sample(I,{μ~i}iI,Q,S)(x,Q,S)\leftarrow\textsc{Sample}(I,\{\tilde{\mu}_{i}\}_{i\in I},Q,S)
13:         end while
14:         jargmaxi|{uS:u=(x,i)}|j\leftarrow\operatorname*{arg\,max}_{i}|\{u\!\in\!S:u\!=\!(x,i)\}|
15:         q|Q|q\leftarrow|Q|
16:         T2212qln(10(k+q))/ϵ2T_{2}\leftarrow 2^{12}q\ln(10(k+q))/\epsilon^{2}
17:         while |S|T2|S|\leq T_{2} do
18:              (x,Q,S)Sample(I,{μ~i}iI,Q,S)(x,Q,S)\leftarrow\textsc{Sample}(I,\{\tilde{\mu}_{i}\}_{i\in I},Q,S)
19:         end while
20:         xjargminx:(x,j)SΦ({x},{μ~i}iI)x_{j}^{\ast}\leftarrow\operatorname*{arg\,min}_{x:(x,j)\in S}\Phi(\{x\},\{\tilde{\mu}_{i}\}_{i\in I})
21:         SjS_{j}\leftarrow\emptyset
22:         T320ϵ1rln2(10r)T_{3}\leftarrow 20\epsilon^{-1}r\ln^{2}(10r)
23:         SjRejSamp(I,{μ~i}iI,T3,{j},{xj})S_{j}\leftarrow\textsc{RejSamp}(I,\{\tilde{\mu}_{i}\}_{i\in I},T_{3},\{j\},\{x_{j}^{\ast}\})
24:         μ~j(1/|Sj|)xSjx\tilde{\mu}_{j}\leftarrow(1/|S_{j}|)\sum_{x\in S_{j}}x
25:         kk+1k\leftarrow k+1
26:         II{j}I\leftarrow I\cup\{j\}
27:     end if
28:until Q=Q=\emptyset \triangleright no new cluster is discovered Phase 1 Phase 2 Phase 3 Phase 4
Algorithm 3.2 Sample(I,{μ~i}iI,Q,S)\textsc{Sample}(I,\{\tilde{\mu}_{i}\}_{i\in I},Q,S): Returning a single D2D^{2}-sample w.r.t. recovered clusters II
1:xx\leftarrow a point returned by D2D^{2}-sampling (w.r.t. {μ~i}iI\{\tilde{\mu}_{i}\}_{i\in I})
2:jClassify(x)j\leftarrow\textsc{Classify}(x)
3:if 
jIj\not\in I then
4:     QQ{j}Q\leftarrow Q\cup\{j\}
5:SS{(x,j)}S\leftarrow S\cup\{(x,j)\}
6:return (x,Q,S)(x,Q,S)
Algorithm 3.3 RejSamp(I,{μ~i}iI,T,W,{xj}jW)\textsc{RejSamp}(I,\{\tilde{\mu}_{i}\}_{i\in I},T,W,\{x_{j}^{\ast}\}_{j\in W}): Rejection sampling on heavy clusters
1:repeat
2:     xx\leftarrow a point returned by D2D^{2}-sampling w.r.t. {μ~i}iI\{\tilde{\mu}_{i}\}_{i\in I}
3:     jClassify(x)j\leftarrow\textsc{Classify}(x)
4:     if 
jWj\in W then
5:         Add xx to SjS_{j} with probability ϵ128Φ({xj},{μ~i}iI)Φ({x},{μ~i}iI)\frac{\epsilon}{128}\cdot\frac{\Phi(\{x_{j}^{\ast}\},\{\tilde{\mu}_{i}\}_{i\in I})}{\Phi(\{x\},\{\tilde{\mu}_{i}\}_{i\in I})}
6:until |Sj|T|S_{j}|\geq T for all jWj\in W
7:return {Sj}jW\{S_{j}\}_{j\in W}

We present our algorithm in Algorithm 3.1 without attempts to optimize the constants. In our algorithm, each round is divided into four phases. Phase 1 tests whether there are new clusters to recover, using O(ϵ1lnk)O(\epsilon^{-1}\ln k) samples, where kk is the number of the recovered clusters. If no new clusters are found, the algorithm would terminate. Phase 2 samples more points and finds the largest cluster discovered so far, which we shall aim to recover in the two remaining phases. Phase 3 samples more points and finds a reference point xjx_{j}^{\ast} to be used in the rejection sampling in the next phase. Phase 4 executes rejection sampling using the reference point xjx_{j}^{\ast} such that the returned samples are uniform from the cluster jj. We need to collect O~(k/ϵ)\tilde{O}(k/\epsilon) uniform samples before calculating the approximate centroid μ~j\tilde{\mu}_{j} for cluster jj. We then declare that cluster jj is recovered and start the next round.

We would like to remark that the D2D^{2}-sampling which our algorithm uses has a great advantage over uniform sampling (Chien et al., 2018) when there are small and faraway clusters. Uniform sampling would require a large number of samples to find small clusters, regardless of their locations. On the other hand, faraway clusters are clearly irreducible and are rightfully expected to be recognized. The D2D^{2}-sampling captures the distance information such that those small faraway clusters may have a large Φ\Phi-value and can be found with much fewer samples.

Analysis.

Throughout the execution of the algorithm, we keep the loop invariant that

(3.1) Φ(Xi,μ~i)(1+ϵ)Φ(Xi,μi)\Phi(X_{i},\tilde{\mu}_{i})\leq(1+\epsilon)\Phi(X_{i},\mu_{i})

for all iIi\in I, i.e., the approximate centers are good for all recovered clusters.

It is clear that the main algorithm will terminate. We prove that the algorithm is correct, i.e., the algorithm will find all clusters with μ~i\tilde{\mu}_{i}’s being all good approximate centroids. We first need a lemma showing that the undiscovered clusters are heavy under the assumption of ϵ\epsilon-reducibility. All proofs are postponed to Appendix B.

Lemma 1.

Suppose the ϵ\epsilon-reducibility constraint w.r.t. II does not hold. It holds for ϵ(0,1]\epsilon\in(0,1] that iIΦ(Xi,C~)/iΦ(Xi,C~)ϵ/4\sum_{i\not\in I}\Phi(X_{i},\tilde{C})/\sum_{i}\Phi(X_{i},\tilde{C})\geq\epsilon/4.

Using the preceding lemma, we are ready to show that Phase 1 of Algorithm 3.1 discovers all heavy clusters with high probability.

Lemma 2.

Upon the termination of Algorithm 3.1, with probability at least 0.950.95, the ϵ\epsilon-reducibility condition w.r.t. II is satisfied and an approximate centroid μ~i\tilde{\mu}_{i} that satisfies (3.1) is obtained for every iIi\in I.

Next we upper bound the number of samples. For each cluster jIj\not\in I define its conditional sample probability

(3.2) pj=Φ(Xj,C)/iIΦ(Xi,C),\textstyle p_{j}=\Phi(X_{j},C)/\sum_{i\notin I}\Phi(X_{i},C),

where C={μ~i}iIC=\{\tilde{\mu}_{i}\}_{i\in I} is the set of approximate centers so far. For each pjp_{j}, we also define an empirical approximation p^j=sj/iIsi\hat{p}_{j}=s_{j}/\sum_{i\notin I}s_{i}, where sis_{i} denotes the number of samples seen so far which belong to cluster ii.

The next lemma shows that the new clusters we find in Phase 2 are heavy.

Lemma 3.

With probability at least 12/(100(k+q)2)1-2/(100(k+q)^{2}), it holds that pj1/(3q)p_{j}\geq 1/(3q) for all jWj\in W.

To analyse Phases 3 and 4, we define an auxiliary point set

(3.3) Yj={yXj:Φ({y},C)/Φ(Xj,C)2/|Xj|}.Y_{j}=\left\{y\in X_{j}:\Phi(\{y\},C)/\Phi(X_{j},C)\leq 2/|X_{j}|\right\}.

Points in YjY_{j} are good pivot points for the rejection sampling procedure in Phase 4. We first show that we can find a good pivot point xjx_{j}^{\ast} in Phase 3.

Lemma 4.

Assume that the event in Lemma 3 occurs. Then xjYjx_{j}^{\ast}\in Y_{j} with probability at least 11/(100(k+q)2)1-1/(100(k+q)^{2}).

We then show that, with a good pivot point xjx_{j}^{\ast}, the rejection sampling procedure in Phase 4 obtains sufficiently many uniform samples from cluster jj so that we can calculate a good approximate center later.

Lemma 5.

Assume that the event in Lemma 4 occurs. By sampling s=223ϵ4qrln2(10r)s=2^{23}\epsilon^{-4}qr\ln^{2}(10r) points in total, with probability at least 1exp(8r)1-\exp(-8r), every cluster jWj\in W has T3=20ϵ1rln2(10r)T_{3}=20\epsilon^{-1}r\ln^{2}(10r) samples returned by the rejection sampling procedure.

Combining Lemmata 2, 3, 4 and 5, we arrive at the main conclusion of the basic algorithm.

Theorem 6.

With probability at least 0.90.9, Algorithm 3.1 finds a set II of clusters that satisfies Definition 2, and obtains for every iIi\in I an approximate centroid μ~i\tilde{\mu}_{i} that satisfies (3.1), using O(ϵ4K2L2log2L)O(\epsilon^{-4}K^{2}L^{2}\log^{2}L) same-cluster queries in total, where KK is the total number of recovered clusters and LL is the total number of discovered clusters.

Remark 1.

When the clusters are ϵ\epsilon-irreducible with respect to any subcollection of the clusters, we shall recover all LL clusters using O(ϵ4L4log2L)O(\epsilon^{-4}L^{4}\log^{2}L) same-cluster queries.

4. Improved Algorithm

In this section, we improve the basic algorithm by allowing the recovery of multiple clusters in each round. A particular case in which the basic algorithm would suffer from a prodigal waste of samples is that there are in total KK clusters of approximately the same size and they are ϵ\epsilon-irreducible with respect to any subset I[K]I\subset[K]. In such case, our basic algorithm requires a sample of size O~(K/ϵ)\tilde{O}(K/\epsilon) from that cluster, which in turn requires a global sample of size O~(K2/ϵ)\tilde{O}(K^{2}/\epsilon). Summing over KK rounds leads to a total number O(K3/ϵ)O(K^{3}/\epsilon) of samples. However, with O~(K2/ϵ)\tilde{O}(K^{2}/\epsilon) global samples, we can obtain O~(K/ϵ)\tilde{O}(K/\epsilon) samples from every cluster and recover all clusters in the same round. This reduces a factor of KK in the total number of samples. The goal of this section is to improve the sample/query complexity of the basic algorithm by a factor of KK even in the worst case.

To recover an indefinite number of clusters in each round, it is a natural idea to generalize the basic algorithm to identifying a set WW of clusters, instead of the biggest one only, from which we shall obtain uniform samples via rejection sampling and calculate the approximate centroids {μ~j}jW\{\tilde{\mu}_{j}\}_{j\in W}. Here we face an ever greater challenge as to how to determine WW. The naïve idea of including all clusters of Φ\Phi-value at least Ω(ϵ/r)\Omega(\epsilon/r) runs into difficulty: as we sample more points, the value of rr may increase and we may need to include more and more points, so when do we stop? We may expect that the value rr will stabilize after sufficiently many samples, but it is difficult to quantify “stable” and control the number of samples needed.

Our solution to overcome the above-mentioned difficulty is to split the clusters into bands. To explain this we introduce a few more notations. For each unrecovered cluster jIj\notin I, recall that we defined in (3.2) its conditional sample probability pj=Φ(Xj,C)/iIΦ(Xi,C)p_{j}=\Phi(X_{j},C)/\sum_{i\notin I}\Phi(X_{i},C); we also define its empirical approximation to be p^j=sj/iIsi\hat{p}_{j}=s_{j}/\sum_{i\notin I}s_{i}, where sis_{i} denotes the number of samples seen so far which belong to the cluster ii. We split {p^i}iI\{\hat{p}_{i}\}_{i\notin I} into L+1L+1 bands for L=3logqL=3\log q, where the \ell-th band is defined to be

B={iI:2<p^i2+1}B_{\ell}=\{i\not\in I:2^{-\ell}<\hat{p}_{i}\leq 2^{-\ell+1}\}

for L\ell\leq L and the last band BL+1B_{L+1} consists of all the remaining clusters ii (i.e., p^i1/q3\hat{p}_{i}\leq 1/q^{3}). We say a band BB_{\ell} is heavy if iBp^i1/(3L)\sum_{i\in B_{\ell}}\hat{p}_{i}\geq 1/(3L). The values p^i\hat{p}_{i} and the bands are dynamically adjusted as the number of samples grows.

Instead of focusing on individual clusters and identifying the heavy ones, we shall identify the heavy bands at an appropriate moment and recover all the clusters in those heavy bands. Intuitively, it takes much fewer samples for bands to stabilize than for individual clusters. For instance, consider a heavy band which consists of many small clusters. In this case, we will have seen most of the clusters in the band without too many samples, although many of the p^i\hat{p}_{i}’s may be far off from pip_{i}’s. The important observation here is that we have seen those clusters; in contrast, if we consider individual clusters, we would have missed many of those clusters and will focus on recovering a few of them, which would cause a colossal waste of samples in the rejection sampling stage, for a similar reason explained at the beginning of this section. Therefore such banding approach enables us to control the number of samples needed.

Algorithm 4.1 The Improved Algorithm
1:II\leftarrow\emptyset
2:k0k\leftarrow 0, r0r\leftarrow 0, K1/2K\leftarrow 1/2
3:repeat
4:     K2KK\leftarrow 2K
5:     repeat
6:         rr+1r\leftarrow r+1
7:         QQ\leftarrow\emptyset, SS\leftarrow\emptyset
8:         T18ϵ1ln(10(k+1))T_{1}\leftarrow 8\epsilon^{-1}\ln(10(k+1))
9:         while Q=Q=\emptyset and |S|T1|S|\leq T_{1} do
10:              (x,Q,S)Sample(I,{μ~i}iI,Q,S)(x,Q,S)\leftarrow\textsc{Sample}(I,\{\tilde{\mu}_{i}\}_{i\in I},Q,S)
11:         end while
12:         if 
QQ\neq\emptyset and k<Kk<K then
13:              repeat
14:                  (x,Q,S)Sample(I,{μ~i}iI,Q,S)(x,Q,S)\leftarrow\textsc{Sample}(I,\{\tilde{\mu}_{i}\}_{i\in I},Q,S)\leavevmode\hbox to6.67pt{\vbox to6.67pt{\pgfpicture\makeatletter\hbox{\hskip 3.33301pt\lower-3.33301pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{0.0pt}{0.0pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hbox to0.0pt{}{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}
15:                  for each iQi\in Q do
16:                       p^i|{uS:u=(x,i)}|/|S|\hat{p}_{i}\leftarrow|\{u\in S:u=(x,i)\}|/|S|
17:                  end for
18:                  Split the clusters in to bands {B}\{B_{\ell}\}
19:                  WW\leftarrow all clusters in heavy bands
20:              until |S|1600|W|log|Q|ln(10(k+|Q|))/ϵ|S|\!\geq\!1600|W|\log|Q|\ln(10(k\!+\!|Q|))/\epsilon
21:              q|Q|q\leftarrow|Q|
22:              T2217|W|logqln(10(k+q))/ϵ2T_{2}\leftarrow 2^{17}|W|\log q\ln(10(k+q))/\epsilon^{2}
23:              while |S|T2|S|\leq T_{2} do
24:                  (x,Q,S)Sample(I,{μ~i}iI,Q,S)(x,Q,S)\leftarrow\textsc{Sample}(I,\{\tilde{\mu}_{i}\}_{i\in I},Q,S)
25:              end while
26:              for each jWj\in W do
27:                  xjargminx:(x,j)SΦ({x},{μ~i}iI)x_{j}^{\ast}\leftarrow\operatorname*{arg\,min}_{x:(x,j)\in S}\Phi(\{x\},\{\tilde{\mu}_{i}\}_{i\in I})
28:                  SjS_{j}\leftarrow\emptyset
29:              end for
30:              T330K/ϵT_{3}\leftarrow 30K/\epsilon
31:              {Sj}jWRejSamp(I,{μ~i}iI,T3,W,{xj}jW)\{S_{j}\}_{j\in W}\leftarrow\textsc{RejSamp}(I,\{\tilde{\mu}_{i}\}_{i\in I},\quad\leavevmode\hbox to6.67pt{\vbox to6.67pt{\pgfpicture\makeatletter\hbox{\hskip 3.33301pt\lower-3.33301pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{0.0pt}{0.0pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hbox to0.0pt{}{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\newline \hskip 100.00015ptT_{3},W,\{x_{j}^{\ast}\}_{j\in W})
32:              for each jWj\in W do
33:                  μ~j(1/|Sj|)xSjx\tilde{\mu}_{j}\leftarrow(1/|S_{j}|)\sum_{x\in S_{j}}x
34:              end for
35:              kk+|W|k\leftarrow k+|W|
36:              IIWI\leftarrow I\cup W
37:         end if
38:     until Q=Q=\emptyset or kKk\geq K
39:until Q=Q=\emptyset and kKk\leq K Phase 1 Phase 2 Phase 3 Phase 4

Our improved algorithm is presented in Algorithm 4.1. Each round remains divided into four phases. Phase 1 remains testing new clusters. Starting from Phase 2 comes the change that instead of recovering only the largest sampled cluster jj, we identify a subset WW of indices of the newly discovered clusters and recover all clusters in WW. Phase 2 samples more points and identifies this subset WW by splitting the discovered clusters into bands and choosing all clusters in the heavy bands. Phase 3 samples more points and finds a reference point xjx_{j}^{\ast} for each jWj\in W. Phase 4 keeps sample points until each cluster jWj\in W sees Θ(ϵ1|W|r)\Theta(\epsilon^{-1}|W|r) points and then calculates the approximate centers μ~j\tilde{\mu}_{j} for each jWj\in W. The clusters in WW will then be added to II as recovered clusters before the algorithm starts a new round.

A technical subtlety lies in controlling the failure probability in our new algorithm. In the basic algorithm, the failure probability in the rr-th round is ar=1/(rpoly(logr))a_{r}=1/(r\operatorname{poly}(\log r)). If we recover wrw_{r} clusters in rr-th round, this failure probability for each of the wrw_{r} clusters needs to be 1/(wrar)1/(w_{r}a_{r}), and as a consequence, O~(wr2ar)\tilde{O}(w_{r}^{2}a_{r}) samples are needed in the rr-th round. In the worst case this becomes O~(K3)\tilde{O}(K^{3}) samples, the same as in the basic algorithm, when both wr=Θ(K)w_{r}=\Theta(K) and ar=Θ~(1/K)a_{r}=\tilde{\Theta}(1/K). To resolve this, we guess K=1,2,K=1,2,\dots in powers of 22 and assign ara_{r} to be wr/Kw_{r}/K. For each fixed guess KK, the number of samples is bounded by O~(K2)\tilde{O}(K^{2}); iterating over guesses incurs only an additional logK\log K factor.

The analysis of our improved algorithm follows the same sketch of the basic algorithm and we only present the changes below. All proofs are postponed to Appendix C. The next lemma offers a similar guarantee as Lemma 2.

Lemma 1.

Upon the termination of Algorithm 4.1, with probability at least 0.950.95, the ϵ\epsilon-reducibility condition w.r.t. II is satisfied and an approximate centroid μ~i\tilde{\mu}_{i} that satisfies (3.1) is obtained for every iIi\in I.

Next we upper bound the number of samples with a lemma analogous to Lemma 3.

Lemma 2.

With probability at least 11/(50(k+q)2)1-1/(50(k+q)^{2}), it holds that pj1/(70|B(j)|logq)p_{j}\geq 1/(70|B_{\ell(j)}|\log q) for all jWj\in W.

Recall that YjY_{j} was defined in (3.3) for all jWj\in W. The following two lemmata are the analogue of Lemmata 4 and 5, respectively.

Lemma 3.

Assume that the event in Lemma 2 holds. With probability at least 11/(25(k+q)2)1-1/(25(k+q)^{2}), it holds that xjYjx_{j}^{\ast}\in Y_{j} for all jWj\in W.

Lemma 4.

Assume that the event in Lemma 3 holds. By sampling s=228ϵ4|W|Klogqs=2^{28}\epsilon^{-4}|W|K\log q points in total, with probability at least 1|W|exp(8K)1-|W|\exp(-8K), every cluster jWj\in W has T3=30K/ϵT_{3}=30K/\epsilon samples returned by the rejection sampling procedure.

Now we are ready to prove the main theorem of the improved algorithm.

Theorem 5.

With probability at least 0.90.9, Algorithm 4.1 (the improved algorithm) finds all the clusters and obtain for every ii an approximate centroid μ~i\tilde{\mu}_{i} that satisfies (3.1), using O(ϵ4KL2log3KlogL)O(\epsilon^{-4}KL^{2}\log^{3}K\log L) same-cluster queries in total, where KK is the number of recovered clusters and LL is the total number of discovered clusters.

Remark 2.

The preceding theorem improves the query complexity of the basic algorithm by about a factor of KK, as desired. In particular, when the clusters are ϵ\epsilon-irreducible with respect to any subcollection of the clusters, we shall recover all LL clusters using O(ϵ4L3log3L)O(\epsilon^{-4}L^{3}\log^{3}L) same-cluster queries, better than basic algorithm by about a factor of LL.

5. Noisy Oracles

Our algorithm can be extended to the case where the same-cluster oracle errs with a constant probability p<1/2p<1/2. The full details are postponed to Appendix D in the full version.

Theorem 1.

Suppose that the same-cluster oracle errs with a constant probability p<1/2p<1/2. With probability at least 0.60.6, there is an algorithm that finds all the clusters and obtains for every ii an approximate centroid μ~i\tilde{\mu}_{i} that satisfies (3.1), using O~(ϵ6L6K)\tilde{O}(\epsilon^{-6}L^{6}K) same-cluster queries in total.

6. Experiments

In this section, we conduct experimental studies of our algorithms. As we mentioned before, all previous algorithms need to know KK and it is impossible to convert them to the case of unknown KK (with the only exception of (Chien et al., 2018)). In particular, we note the inapplicability of the kk-means algorithm and the algorithm in (Ailon et al., 2018b).

  • The kk-means algorithm is for an unsupervised learning task and returns clusters which are always spherical, so it is undesirable for arbitrary shapes of clusters and it makes sense to examine the accuracy in terms of misclassified points. Our problem is semi-supervised with a same-cluster oracle, which can be used to recover clusters of arbitrary shapes and guarantees no misclassification. The assessment of the algorithm is therefore the number of discovered clusters and the number of samples. Owing to the very different nature of the problems, the kk-means algorithm should not be used in our problem or compared with algorithms designed specifically for our problem.

  • The algorithm in (Ailon et al., 2018b) runs in kk rounds and take 212k3/ϵ22^{12}k^{3}/\epsilon^{2} samples in the first step of each round. With k=10k=10 and ϵ=0.1\epsilon=0.1, it needs to sample in each round at least 4×1084\times 10^{8} points, usually larger than the size of a dataset. (Our algorithm can be easily simplified to handle such cases, see the subsection titled “Algorithms” below.)

The only exception, the algorithm in (Chien et al., 2018), is just simple uniform sampling and whether or not KK is known is not critical. This uniform sampling algorithm will be referred to as Uniform below.

Datasets.   We test our algorithms using both synthetic and real world datasets.

For synthetic datasets, we generate nn points that belong to KK clusters in the dd-dimensional Euclidean space. Since in the real world the cluster sizes typically follow a power-law distribution, we assign to each cluster a random size drawn from the widely used Zipf distribution, which has the density function f(x)xαf(x)\propto x^{-\alpha}, where xx is the size of the cluster and α\alpha a parameter controlling the skewness of the dataset. We then choose a center μi\mu_{i} for each cluster i[K]i\in[K] in a manner to be specified later and generate the points in the cluster from the multivariate Gaussian distribution N(μi,σ2Id)N(\mu_{i},\sigma^{2}I_{d}), where IdI_{d} denotes the d×dd\times d identity matrix.

Now we specify how to choose the centers. In practice, clusters in the dataset are not always well separated; there could be clusters whose centers are close to each other. We thus use an additional parameter pp to characterize this phenomenon. In the default setting of p=0p=0, all centers of the KK clusters are drawn uniformly at random from [0,b]d[0,b]^{d}. When p>0p>0, we first partition the clusters into groups as follows: Think of each cluster as a node. For each pair of clusters, with probability pp we add an edge between the two nodes. Each connected component of the resulting graph forms a group. Next, for each group of clusters, we pick a random cluster and choose its center μ\mu uniformly at random in [0,b]d[0,b]^{d}. For each of the remaining clusters in the group, we choose its center uniformly at random in the neighborhood of radius ρ\rho centered at μ\mu.

We use the following set of parameters as the default setting in our synthetic datasets: n=106n=10^{6}, K=100K=100, α=2.5\alpha=2.5, σ=0.3\sigma=0.3, b=5b=5, d=10d=10 and ρ=0.1\rho=0.1.

We also use the following two real-world datasets.

Each feature in the real world datasets is normalized to have zero mean and unit standard deviation.

Algorithms.   We compare three algorithms listed below. A cluster is said to be heavy when the number of (uniform) samples obtained from this cluster is more than a predetermined threshold, which is set to be 1010 in our experiments.

  • Uniform: This is uniform sampling. That is, we keep getting random samples from the dataset one by one and identify their label by same-cluster queries. We recover a cluster when it becomes heavy.

  • Basic: This is based on our basic algorithm (Algorithm 3.1). We recover clusters one at a time when it becomes heavy.

  • Batched: This is a simplified version of the improved algorithm (Algorithm 4.1). Instead of partitioning clusters to bands and then recovering all clusters in the heavy bands, in each round we just keep sampling points until the fraction of points in the heavy clusters is more than a half, at which moment we try to recover all heavy clusters in a batch.

Recall that in our basic and improved algorithms, we always “ignore” the samples belonging to the unrecovered clusters in the previous rounds, which will not affect our theoretical analysis. But in practice it is reasonable to reuse these “old” samples in the succeeding rounds so that we are able to recover some clusters earlier. Because of such a sample-reuse procedure, the practical performance difference between the Basic and Batched algorithms becomes less significant. Recall that in theory, the main improvement of Batched over Basic is that we try to avoid wasting too many samples in each round by recovering possibly more than one cluster at a time. But still, as we shall see shortly, Batched outperforms Basic in all metrics, though sometimes the gaps may not seem significant.

Measurements.  To measure the same-cluster query complexity, we introduce two assessments. The first is fixed-budget, where given a fixed number of same-cluster queries, we compare the number of clusters recovered by different algorithms. The second is fixed-recovery, where each algorithm needs to recover a predetermined number of clusters, and we compare the numbers of same-cluster queries the algorithms use.

We also report the error of the approximate centroid of each recovered cluster. For a recovered cluster XiX_{i}, let μi\mu_{i} be its centroid and μ^i\hat{\mu}_{i} be the approximate center. We define the centroid approximation error to be (Φ(Xi,μ^i)Φ(Xi,μi))/Φ(Xi,μi)\left(\Phi(X_{i},\hat{\mu}_{i})-\Phi(X_{i},\mu_{i})\right)/\Phi(X_{i},\mu_{i}).

For Basic and Batched, we further compare their running time and round usage. We note that a small round usage is very useful if we want to efficiently parallelize the learning process.

Finally, we would like to mention one subtlety when measuring the number of same-cluster queries. Since all algorithms that we are going to test are sampling based, and for each sampled point we need to identify its label by same-cluster queries, which, in the worst case, takes kk queries, where kk is the number of the discovered clusters so far. In practice, however, a good query order/strategy can save us a lot of same-cluster queries. We apply the following query strategy for all tested algorithms.

Heuristic 1.

For each discovered cluster, we maintain its approximate center (using the samples we have obtained). When a new point is sampled, we query the centers of discovered clusters based on their distances to the new point in the increasing order as follows. Suppose that the new point is xx. We calculate the distances between xx and all approximate centers μ^i\hat{\mu}_{i}, and sort them in increasing order, say, d(x,μ^1)d(x,μ^2)d(x,\hat{\mu}_{1})\leq d(x,\hat{\mu}_{2})\leq\cdots. Then for i=1,2,i=1,2,\dots sequentially, we check whether xx belongs to cluster ii by querying whether xx and some sample point from cluster ii belong to the same cluster. If xx does not belong to any discovered cluster, a new cluster will be created.

We will also use this heuristic to test the effectiveness of approximate centers for classifying new points. That is, for a newly inserted point qq, we count the number of same-cluster queries needed to correctly classify qq using the approximate centers and Heuristic 1.

Computation Environments.  All algorithms are implemented in C++, and all experiments were conducted on Dell PowerEdge T630 server with two Intel Xeon E5-2667 v4 3.2GHz CPU with eight cores each and 256GB memory.

6.1. Results

The three algorithms, Uniform, Basic and Batched, are compared on the same-cluster query complexity, the quality of returned approximate centers, the running time and the number of rounds. All results are the average values of 100100 runs of experiments.

\begin{overpic}[width=433.62pt,percent]{./figs/a25} \put(13.0,65.0){\scriptsize$p=0$} \end{overpic}
\begin{overpic}[width=433.62pt,percent]{./figs/c_a25} \put(13.0,65.0){\scriptsize$p=0$} \end{overpic}
\begin{overpic}[width=433.62pt,percent]{./figs/p1} \put(13.0,63.0){\scriptsize$p=0.1$} \end{overpic}
\begin{overpic}[width=433.62pt,percent]{./figs/c_p1} \put(13.0,63.0){\scriptsize$p=0.1$} \end{overpic}
\begin{overpic}[width=433.62pt,percent]{./figs/p3} \put(13.0,63.0){\scriptsize$p=0.3$} \end{overpic}

Fixed-budget

\begin{overpic}[width=433.62pt,percent]{./figs/c_p3} \put(13.0,63.0){\scriptsize$p=0.3$} \end{overpic}

Fixed-recovery

Figure 1. Performance comparison on synthetic datasets.
Refer to caption

Fixed-budget

Refer to caption

Fixed-recovery

Figure 2. Performance comparison on Shuttle dataset. The curves for Basic and Batched almost overlap for fixed-recovery.
Refer to caption

Fixed-budget

Refer to caption

Fixed-recovery

Figure 3. Performance comparison on Kdd dataset.

Query Complexity. For each of the synthetic and real-world datasets, we measure the query complexities under both fixed-budget and fixed-recovery settings.

The results for synthetic dataset, the Shuttle dataset and the Kdd dataset are plotted in Figure 1, Figure 2 and Figure 3, respectively. In all figures, the left column corresponds to the fixed-budget case and the right column the fixed-recovery case. We note that some points for Uniform are missing since they are out of the boundary. The exact values of all experiments can be found in Appendix E of the full version.

For all datasets, Basic and Batched significantly outperform Uniform. Take Kdd dataset for example, we observe that with 10510^{5} query budget Uniform recovers 1010 clusters while Basic and Batched recover around 1818 clusters. In order to recover 1616 clusters, Uniform requires 200,000200,000 queries per cluster, while Basic and Batched only requires no more than 2,5002,500 queries per cluster. Even on small datasets such as Shuttle, Uniform needs about twice amount of queries to recover all clusters compared with Basic and Batched. We also observe that Batched performs slightly better than Basic.

On synthetic datasets, one can see that higher collision probability pp makes clustering more difficult for all algorithms. Given a query budget of 2×1052\times 10^{5}, Basic and Batched can recover about 7070 clusters when p=0p=0 (no collision), but only about 6060 clusters when p=0.3p=0.3.

We also observe that Batched performs better than Basic in the Kdd dataset in all settings, except the fixed-recovery setting for the Shuttle dataset, in which they have almost the same performance.

Center Approximation.  We compare the approximation error of each algorithm to show the quality of returned approximate centers. We run all three algorithms in the fixed-recovery setting because we can only measure the error of the recovered clusters. We examine the median error among all recovered clusters and observe that the centroid approximation errors are indeed similar on all datasets when the same number of clusters is recovered. Detailed results are listed in Tables 6.4 to 6.4. We observe that all centroid approximation error rates are below 10%: the error rates are about 9% on the synthetic datasets and are about 5% on the two real-world datasets. There is no significant difference between the error rates of different algorithms; the maximum difference is no more than 2%. These consistent results indicate that the approximate centers computed by all algorithms are of good quality.

#Clusters Algorithms 20 30 40 50 60 70
p=0p=0 Basic 8.70% 8.70% 8.50% 8.41% 8.43% 8.37%
Batched 8.74% 8.60% 8.56% 8.64% 8.43% 8.51%
Uniform 9.90% 9.43% 9.58% 9.22% 9.24% 9.18%
p=0.1p=0.1 Basic 8.86% 8.68% 8.72% 8.39% 8.51% 8.45%
Batched 8.76% 8.64% 8.45% 8.49% 8.50% 8.42%
Uniform 9.59% 9.64% 9.39% 9.42% 9.30% 9.18%
p=0.3p=0.3 Basic 8.87% 8.68% 8.56% 8.40% 8.34% 8.38%
Batched 8.95% 8.64% 8.57% 8.37% 8.35% 8.39%
Uniform 9.75% 9.44% 9.45% 9.35% 9.25% 9.28%
Table 6.2. Error rates on synthetic datasets
#Clusters 2 3 4 5 6 7
Basic 9.72% 4.99% 6.28% 5.40% 5.85% 5.91%
Batched 8.94% 4.78% 6.33% 5.44% 5.89% 5.66%
Uniform 9.32% 4.92% 5.79% 5.67% 5.65% 5.50%
Table 6.3. Error rates on Shuttle dataset
#Clusters 6 8 10 12 14 16
Basic 3.70% 3.83% 4.87% 5.75% 5.95% 5.68%
Batched 3.57% 3.68% 4.49% 4.39% 5.04% 5.53%
Uniform 3.96% 5.21% 4.92% 4.83% 4.57% 5.43%
Table 6.4. Error rates on Kdd dataset

Time and Rounds.  We compare the running time and the number of rounds between Basic and Batched, with results presented in Figure 4 to Figure 6. Here we present the running time in the sampling stage, showing that the reduced round usage in Batched also leads to a reduced usage of sampling time in Batched. In terms of the time saving in querying stage, we have already discussed it in the query complexity part. The only exception is the Shuttle dataset, where the performance of Basic and Batched are almost the same; this may be because the number of clusters in Shuttle is too small (77 in total) for the algorithms to make any difference. We remark that fewer rounds implies a shorter waiting time for synchronization in the parallel learning setting and a smaller communication cost in the distributed learning setting.

\begin{overpic}[width=433.62pt,percent]{./figs/t_a25} \put(13.0,65.0){\scriptsize$p=0$} \end{overpic}
\begin{overpic}[width=433.62pt,percent]{./figs/r_a25} \put(13.0,65.0){\scriptsize$p=0$} \end{overpic}
\begin{overpic}[width=433.62pt,percent]{./figs/t_p1} \put(13.0,65.0){\scriptsize$p=0.1$} \end{overpic}
\begin{overpic}[width=433.62pt,percent]{./figs/r_p1} \put(13.0,65.0){\scriptsize$p=0.1$} \end{overpic}
\begin{overpic}[width=433.62pt,percent]{./figs/t_p1} \put(13.0,65.0){\scriptsize$p=0.3$} \end{overpic}
\begin{overpic}[width=433.62pt,percent]{./figs/r_p3} \put(13.0,65.0){\scriptsize$p=0.3$} \end{overpic}
Figure 4. Running time and round usage on synthetic datasets
Refer to caption
Refer to caption
Figure 5. Running time and round usage on Shuttle dataset
Refer to caption
Refer to caption
Figure 6. Running time and round usage on Kdd dataset

Classification Using Approximate Centers.  We now measure the quality of the approximate centers outputted by our algorithms for point classification using Heuristic 1. That is, after the algorithm terminates and outputs the set of approximate centers, we try to cluster new (unclassified) points using Heuristic 1 and count the number of same-cluster queries used for each new point. Since there is only a very small number of points in the dataset that have been sampled/classified during the run of Basic and Batched, we simply use the whole dataset as test points. This should not introduce much bias to our measurement.

Our results for the synthetic and the real-world datasets are plotted in Figures 7 and 8, respectively. We observe that for all three tested algorithms, for the majority of the points in the dataset, only one same-cluster query is needed for classification; in other words, we only need to find their nearest approximate centers for determining their cluster labels. Except for the Shuttle dataset for which about 10-15% points need at least three same-cluster queries, in all other datasets the fraction of points which need at least three queries is negligible. Overall, the performance of the three tested algorithms is comparable.

Refer to caption

p=0p=0

Refer to caption

p=0.1p=0.1

Refer to caption

p=0.3p=0.3

Figure 7. Number of same-cluster queries on synthetic datasets.
Refer to caption

Shuttle

Refer to caption

Kdd

Figure 8. Number of same-cluster queries on real datasets.

Summary.  Briefly summarizing our experimental results, we observe that both Basic and Batched significantly outperform Uniform in terms of query complexities. All three algorithms have similar center approximation quality. Batched outperforms Basic by a large margin in both running time and round usage on the synthetic and Kdd datasets.

Acknowledgments

The authors would also like to thank the anonymous referees for their valuable comments and helpful suggestions. Yan Song and Qin Zhang are supported in part by NSF IIS-1633215 and CCF-1844234.

References

  • (1)
  • Ailon et al. (2018a) Nir Ailon, Anup Bhattacharya, and Ragesh Jaiswal. 2018a. Approximate Correlation Clustering Using Same-Cluster Queries. In LATIN. 14–27.
  • Ailon et al. (2018b) Nir Ailon, Anup Bhattacharya, Ragesh Jaiswal, and Amit Kumar. 2018b. Approximate Clustering with Same-Cluster Queries. In ITCS. 40:1–40:21.
  • Arthur and Vassilvitskii (2007) David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. In SODA. 1027–1035.
  • Ashtiani et al. (2016) Hassan Ashtiani, Shrinu Kushagra, and Shai Ben-David. 2016. Clustering with Same-Cluster Queries. In NIPS. 3216–3224.
  • Bressan et al. (2020) Marco Bressan, Nicolò Cesa-Bianchi, Silvio Lattanzi, and Andrea Paudice. 2020. Exact Recovery of Mangled Clusters with Same-Cluster Queries. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.).
  • Chien et al. (2018) I (Eli) Chien, Chao Pan, and Olgica Milenkovic. 2018. Query K-means Clustering and the Double Dixie Cup Problem. In NeurIPS. 6650–6659.
  • Firmani et al. (2016) Donatella Firmani, Barna Saha, and Divesh Srivastava. 2016. Online Entity Resolution Using an Oracle. PVLDB 9, 5 (2016), 384–395.
  • Gamlath et al. (2018) Buddhima Gamlath, Sangxia Huang, and Ola Svensson. 2018. Semi-Supervised Algorithms for Approximately Optimal and Accurate Clustering. In ICALP. 57:1–57:14.
  • Huleihel et al. (2019) Wasim Huleihel, Arya Mazumdar, Muriel Médard, and Soumyabrata Pal. 2019. Same-Cluster Querying for Overlapping Clusters. In NeurIPS. 10485–10495.
  • Inaba et al. (1994) Mary Inaba, Naoki Katoh, and Hiroshi Imai. 1994. Applications of Weighted Voronoi Diagrams and Randomization to Variance-Based k-Clustering (Extended Abstract). In SOCG. 332–339.
  • Jaiswal et al. (2014) Ragesh Jaiswal, Amit Kumar, and Sandeep Sen. 2014. A Simple D2{D}^{2}-Sampling Based PTAS for kk-Means and Other Clustering Problems. Algorithmica 70, 1 (2014), 22–46.
  • Kumar et al. (2010) Amit Kumar, Yogish Sabharwal, and Sandeep Sen. 2010. Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 57, 2 (2010), 5:1–5:32.
  • Mazumdar and Saha (2017a) Arya Mazumdar and Barna Saha. 2017a. Clustering with Noisy Queries. In NIPS. 5788–5799.
  • Mazumdar and Saha (2017b) Arya Mazumdar and Barna Saha. 2017b. Query Complexity of Clustering with Side Information. In NIPS. 4682–4693.
  • Saha and Subramanian (2019) Barna Saha and Sanjay Subramanian. 2019. Correlation Clustering with Same-Cluster Queries Bounded by Optimal Cost. In ESA. 81:1–81:17.
  • Verroios and Garcia-Molina (2015) Vasilis Verroios and Hector Garcia-Molina. 2015. Entity Resolution with crowd errors. In ICDE. 219–230.
  • Vesdapunt et al. (2014) Norases Vesdapunt, Kedar Bellare, and Nilesh N. Dalvi. 2014. Crowdsourcing Algorithms for Entity Resolution. PVLDB 7, 12 (2014), 1071–1082.
  • Wang et al. (2012) Jiannan Wang, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. PVLDB 5, 11 (2012), 1483–1494.
  • Wang et al. (2013) Jiannan Wang, Guoliang Li, Tim Kraska, Michael J. Franklin, and Jianhua Feng. 2013. Leveraging transitive relations for crowdsourced joins. In SIGMOD. 229–240.

Appendix A Supplementary Preliminaries

The following forms of Chernoff bounds will be used repeatedly in the analysis of our algorithms.

Lemma 1 (Additive Chernoff).

Let X1,,XNX_{1},\dots,X_{N} be i.i.d. Bernoulli random variables such that 𝔼Xi=p\operatorname{\mathbb{E}}X_{i}=p. Let p^=(i=1NXi)/N\hat{p}=(\sum_{i=1}^{N}X_{i})/N. Then Pr{p^>p+t}eNp(ep/t)Nt\Pr\left\{\hat{p}>p+t\right\}\leq e^{-Np}(ep/t)^{Nt}.

Lemma 2 (Multiplicative Chernoff).

Let X1,,XNX_{1},\dots,X_{N} be i.i.d. Bernoulli random variables such that 𝔼Xi=p\operatorname{\mathbb{E}}X_{i}=p. Let p^=(i=1NXi)/N\hat{p}=(\sum_{i=1}^{N}X_{i})/N. Then for δ(0,1)\delta\in(0,1), it holds that Pr{p^(1δ)p}exp(12δ2Np)\Pr\left\{\hat{p}\leq(1-\delta)p\right\}\leq\exp(-\frac{1}{2}\delta^{2}Np).

Appendix B Omitted Proofs in Section 3

We need two lemmata from (Ailon et al., 2018b).

Lemma 1 ((Ailon et al., 2018b)).

It holds that

Φ({x},C)Φ(Xj,C)ϵ641|Xj|.\frac{\Phi(\{x\},C)}{\Phi(X_{j},C)}\geq\frac{\epsilon}{64}\cdot\frac{1}{\left|X_{j}\right|}.
Lemma 2 ((Ailon et al., 2018b)).

It holds that

Φ(Yj,C)Φ(Xj,C)ϵ128,\frac{\Phi(Y_{j},C)}{\Phi(X_{j},C)}\geq\frac{\epsilon}{128},

and

Φ({y},C)Φ({x},C)ϵ128.\frac{\Phi(\{y\},C)}{\Phi(\{x\},C)}\leq\frac{\epsilon}{128}.

B.1. Proof of Lemma 1

Proof.

We adapt the proof of Lemma 15 in (Ailon et al., 2018b) to our irreducibility assumption. Observe that

iΦ(Xi,C~)\displaystyle\sum_{i}\Phi(X_{i},\tilde{C}) =iIΦ(Xi,C~)+iIΦ(Xi,C~)\displaystyle=\sum_{i\in I}\Phi(X_{i},\tilde{C})+\sum_{i\not\in I}\Phi(X_{i},\tilde{C})
iIΦ(Xi,C~)+ϵ/41ϵ/4iIΦ(Xi,C~)\displaystyle\leq\sum_{i\in I}\Phi(X_{i},\tilde{C})+\frac{\epsilon/4}{1-\epsilon/4}\sum_{i\in I}\Phi(X_{i},\tilde{C})
(1+ϵ2)iIΦ(Xi,C~)\displaystyle\leq\left(1+\frac{\epsilon}{2}\right)\sum_{i\in I}\Phi(X_{i},\tilde{C})

Splitting the sum on the left-hand side into iIi\in I and iIi\not\in I,

iIΦ(Xi,C~)+iIΦ(Xi,C~)(1+ϵ2)iIΦ(Xi,C~),\sum_{i\in I}\Phi(X_{i},\tilde{C})+\sum_{i\not\in I}\Phi(X_{i},\tilde{C})\leq\left(1+\frac{\epsilon}{2}\right)\sum_{i\in I}\Phi(X_{i},\tilde{C}),

and so

iIΦ(Xi,C~)ϵ2iIΦ(Xi,C~)\sum_{i\not\in I}\Phi(X_{i},\tilde{C})\leq\frac{\epsilon}{2}\sum_{i\in I}\Phi(X_{i},\tilde{C})

Comparing with the true centers, we have

iIΦ(Xi,C)ϵ2(1+ϵ)2iIΦ(Xi,C)ϵiIΦ(Xi,C).\sum_{i\not\in I}\Phi(X_{i},C)\leq\frac{\epsilon}{2}(1+\epsilon)^{2}\sum_{i\in I}\Phi(X_{i},C)\leq\epsilon\sum_{i\in I}\Phi(X_{i},C).

By the ϵ\epsilon-irreducibility assumption, there exists an I\ell\not\in I such that

iIΦ(Xi,C)>Φ(X,C)>ϵiIΦ(Xi,μi),\sum_{i\not\in I}\Phi(X_{i},C)>\Phi(X_{\ell},C)>\epsilon\sum_{i\in I}\Phi(X_{i},\mu_{i}),

and so we have a contradiction. The claimed result in the lemma therefore holds. ∎

B.2. Proof of Lemma 2

Proof.

By Lemma 1, the probability that no undiscovered clusters is seen among T1T_{1} samples (Line 6) with probability

(1ϵ4)T1e2ln(10(k+1))=1100(k+1)2.\left(1-\frac{\epsilon}{4}\right)^{T_{1}}\leq e^{-2\ln(10(k+1))}=\frac{1}{100(k+1)^{2}}.

The maintenance of the loop invariant (3.1) in each round is a corollary of Lemma 1. With T3=20ϵ1rln2(10r)T_{3}=20\epsilon^{-1}r\ln^{2}(10r) (see Line 22), the centroid estimate μ~j\tilde{\mu}_{j} satisfies (3.1) with probability at least 11/(20rln2(10r))1-1/(20r\ln^{2}(10r)).

Taking a union bound over all rounds, the total failure probability is at most

i=11100i2+r=1120rln2(10r)0.05.\sum_{i=1}^{\infty}\frac{1}{100i^{2}}+\sum_{r=1}^{\infty}\frac{1}{20r\ln^{2}(10r)}\leq 0.05.\qed

B.3. Proof of Lemma 3

Proof.

Upon the termination of the while loop in Phase 2, let ss^{\prime} denote the number of samples belonging to the clusters in QQ. We claim that s6qln(10(k+q))s^{\prime}\geq 6q\ln(10(k+q)) with probability at least 11/(100(k+q)2)1-1/(100(k+q)^{2}). Observe that s96qln(10(k+q))/ϵs\geq 96q\ln(10(k+q))/\epsilon at this point, then it follows from a Chernoff bound (Lemma 2) that

Pr{s<6ln(10(k+q))}exp(1214sϵ4)1100(k+q)2.\Pr\{s^{\prime}<6\ln(10(k+q))\}\leq\exp\left(-\frac{1}{2}\cdot\frac{1}{4}\cdot\frac{s\epsilon}{4}\right)\leq\frac{1}{100(k+q)^{2}}.

We choose jj to be the newly discovered cluster of most samples and so p^j1/q\hat{p}_{j}\geq 1/q. If pj<1/(3q)p_{j}<1/(3q), then by a Chernoff bound (Lemma 1),

Pr{p^j1q}(13e113)s23q1100(k+q)2.\Pr\left\{\hat{p}_{j}\geq\frac{1}{q}\right\}\leq\left(\frac{\frac{1}{3}e}{1-\frac{1}{3}}\right)^{s^{\prime}\frac{2}{3q}}\leq\frac{1}{100(k+q)^{2}}.\qed

B.4. Proof of Lemma 4

Proof.

It suffices to show that each cluster jWj\in W sees a sample in YjY_{j}. The probability that a sample lies in YjY_{j} is

Φ(Yj,C)Φ(Xj,C)Φ(Xj,C)Φ(X,C)ϵ128(ϵ413q)=ϵ2293q,\frac{\Phi(Y_{j},C)}{\Phi(X_{j},C)}\cdot\frac{\Phi(X_{j},C)}{\Phi(X,C)}\geq\frac{\epsilon}{128}\cdot\left(\frac{\epsilon}{4}\cdot\frac{1}{3q}\right)=\frac{\epsilon^{2}}{2^{9}\cdot 3q},

where we used Lemmata 1 and 1.

Therefore if we sample T2=2103qln(10(k+q))/ϵ2T_{2}=2^{10}\cdot 3q\ln(10(k+q))/\epsilon^{2} points in total (Line 16), cluster jj does not see a point in YjY_{j} with probability at most

(1ϵ2293q)T2e2ln(10(k+q))1100(k+q)2.\left(1-\frac{\epsilon^{2}}{2^{9}\cdot 3q}\right)^{T_{2}}\leq e^{-2\ln(10(k+q))}\leq\frac{1}{100(k+q)^{2}}.\qed

B.5. Proof of Lemma 5

Proof.

Let jWj\in W. The rejection sampling has a valid probability threshold (at most 11) by the second part of Lemma 2. The rejection sampling includes a sampled point in XjX_{j} with probability

xXjΦ({x},C)Φ(X,C)(ϵ128Φ({xj},C)Φ({x},C))\displaystyle\quad\ \sum_{x\in X_{j}}\frac{\Phi(\{x\},C)}{\Phi(X,C)}\cdot\left(\frac{\epsilon}{128}\cdot\frac{\Phi(\{x_{j}^{\ast}\},C)}{\Phi(\{x\},C)}\right)
=|Xj|ϵ128Φ({xj},C)Φ(X,C)\displaystyle=\left|X_{j}\right|\cdot\frac{\epsilon}{128}\frac{\Phi(\{x_{j}^{\ast}\},C)}{\Phi(X,C)}
=|Xj|ϵ128Φ(Xj,C)Φ(X,C)Φ({xj},C)Φ(Xj,C)\displaystyle=\left|X_{j}\right|\cdot\frac{\epsilon}{128}\cdot\frac{\Phi(X_{j},C)}{\Phi(X,C)}\cdot\frac{\Phi(\{x_{j}^{\ast}\},C)}{\Phi(X_{j},C)}\cdot
|Xj|ϵ128ϵ413qϵ64|Xj|ϵ32153q,\displaystyle\geq\left|X_{j}\right|\cdot\frac{\epsilon}{128}\cdot\frac{\epsilon}{4}\cdot\frac{1}{3q}\cdot\frac{\epsilon}{64|X_{j}|}\geq\frac{\epsilon^{3}}{2^{15}\cdot 3q},

where the first inequality uses Lemma 1 to lower bound the last factor.

Hence, by sampling s=223ϵ4qrln2(10r)s=2^{23}\epsilon^{-4}qr\ln^{2}(10r) points in total, we expect to see at least sϵ3/(2153q)>2T3s\epsilon^{3}/(2^{15}\cdot 3q)>2T_{3} points returned by the rejection sampling procedure. Using the multiplicative Chernoff bound (Lemma 2), the rejection sampling returns at most T3=20ϵ1rln2(10r)T_{3}=20\epsilon^{-1}r\ln^{2}(10r) points with probability at most

exp(1214sϵ32153q)exp(8r).\exp\left(-\frac{1}{2}\cdot\frac{1}{4}\cdot\frac{s\epsilon^{3}}{2^{15}\cdot 3q}\right)\leq\exp\left(-8r\right).\qed

B.6. Proof of Theorem 6

Proof.

In each round, by Lemmata 3, 4 and 5, Phases 2–4 fails with probability at most 3/(100(k+q)2)+exp(8r)3/(100(k+q)^{2})+\exp(-8r). Combining with Lemma 2 and summing over all rounds, we see that the overall failure probability is at most

0.05+i=1(3100i2+exp(8i))0.1.0.05+\sum_{i=1}^{\infty}\left(\frac{3}{100i^{2}}+\exp(-8i)\right)\leq 0.1.

Let qrq_{r} denote the number of newly discovered clusters in the rr-th round. There are KK rounds in total and the total number of samples is therefore

O(r=1Kqrrlog2Lϵ4)=O(K2Llog2Lϵ4),O\left(\sum_{r=1}^{K}\frac{q_{r}r\log^{2}L}{\epsilon^{4}}\right)=O\left(\frac{K^{2}L\log^{2}L}{\epsilon^{4}}\right),

where we used the fact that qrLq_{r}\leq L. For each sample we need to call the same-cluster query O(L)O(L) times to obtain the its cluster index, and thus the query complexity is O(ϵ4K2L2log2L)O(\epsilon^{-4}K^{2}L^{2}\log^{2}L). ∎

Appendix C Omitted Proofs in Section 4

C.1. Proof of Lemma 1

Proof.

As argued in the proof of Lemma 2, the probability that no undiscovered clusters is seen among T1T_{1} samples (see Line 8 in Algorithm 4.1) with probability 1/(100(k+1)2)1/(100(k+1)^{2}).

Regarding the maintenance of the loop invariant (3.1), with T3=30K/ϵT_{3}=30K/\epsilon (see Line 22 in Algorithm 4.1), each centroid estimate μ~i\tilde{\mu}_{i} satisfies (3.1) with probability at least 11/(30K)1-1/(30K). Taking a union bound, we know that (3.1) holds for all iWi\in W in each single round with probability at least 1|W|/(30K)1-|W|/(30K).

Let wrw_{r} denote the number of newly recovered clusters in the rr-th round, then rwrK\sum_{r}w_{r}\leq K. Taking a union bound over all rounds as in Lemma 2, we see that the failure probability is at most

i=11100i2+rwr30K=π2600+1300.05.\sum_{i=1}^{\infty}\frac{1}{100i^{2}}+\sum_{r}\frac{w_{r}}{30K}=\frac{\pi^{2}}{600}+\frac{1}{30}\leq 0.05.\qed

C.2. Proof of Lemma 2

Proof.

Let ss^{\prime} denote the number of samples belonging to the clusters in QQ. We claim that upon the termination of the while loop in Phase 2, it holds that s100|W|logqln(10(k+q))s^{\prime}\geq 100|W|\log q\ln(10(k+q)) with probability at least 11/(100(k+q)2)1-1/(100(k+q)^{2}). Observe that s1600|W|logqln(10(k+q))/ϵs\geq 1600|W|\log q\ln(10(k+q))/\epsilon at this point, then it follows from a Chernoff bound (Lemma 2) that

Pr{s<100ln(10(k+q))}\displaystyle\Pr\{s^{\prime}<100\ln(10(k+q))\} exp(1214sϵ4)1100(k+q)2.\displaystyle\leq\exp\left(-\frac{1}{2}\cdot\frac{1}{4}\cdot\frac{s\epsilon}{4}\right)\leq\frac{1}{100(k+q)^{2}}.

Observe that every cluster ii in a heavy band BB_{\ell} satisfies that

p^ijBpj32|B|16|B|L118|B|logq.\hat{p}_{i}\geq\frac{\sum_{j\in B_{\ell}}p_{j}}{\frac{3}{2}|B_{\ell}|}\geq\frac{1}{6|B_{\ell}|L}\geq\frac{1}{18|B_{\ell}|\log q}.

If pi<1/(70|B|logq)p_{i}<1/(70|B_{\ell}|\log q), then by a Chernoff bound (Lemma 1),

Pr{p^i118|B|logq}(170e118170)s(118170)1|W|logq1100(k+q)3.\displaystyle\Pr\left\{\hat{p}_{i}\geq\frac{1}{18|B_{\ell}|\log q}\right\}\leq\left(\frac{\frac{1}{70}e}{\frac{1}{18}-\frac{1}{70}}\right)^{s^{\prime}(\frac{1}{18}-\frac{1}{70})\frac{1}{|W|\log q}}\leq\frac{1}{100(k+q)^{3}}.

Taking a union bound over all |W|q|W|\leq q clusters yields the claimed result. ∎

C.3. Proof of Lemma 3

Proof.

It suffices to show that each cluster jWj\in W sees a sample in YjY_{j}. The probability that a sample lies in YjY_{j} is

Φ(Yj,C)Φ(Xj,C)Φ(Xj,C)Φ(X,C)ϵ128(ϵ4170|B(j)|logq)=ϵ221035|B(j)|logq.\frac{\Phi(Y_{j},C)}{\Phi(X_{j},C)}\cdot\frac{\Phi(X_{j},C)}{\Phi(X,C)}\geq\frac{\epsilon}{128}\left(\frac{\epsilon}{4}\!\cdot\!\frac{1}{70|B_{\ell(j)}|\log q}\right)=\frac{\epsilon^{2}}{2^{10}\!\cdot\!35\!\cdot\!|B_{\ell(j)}|\log q}.

Therefore if we sample T2210105|W|logqln(5(k+q))/ϵ2T_{2}\geq 2^{10}\cdot 105\cdot|W|\log q\ln(5(k+q))/\epsilon^{2} points in total (see Line 22 in Algorithm 4.1), each cluster jWj\in W does not see a point in YjY_{j} with probability at most

(1ϵ221035|B(j)|logq)T2e3ln(5(k+q))125(k+q)3\displaystyle\left(1-\frac{\epsilon^{2}}{2^{10}\cdot 35\cdot|B_{\ell(j)}|\log q}\right)^{T_{2}}\leq e^{-3\ln(5(k+q))}\leq\frac{1}{25(k+q)^{3}}

and taking a union bound over all jWj\in W yields the claim. ∎

C.4. Proof of Lemma 4

Proof.

Let jWj\in W. The rejection sampling is a valid probability threshold (at most 11) by the second part of Lemma 2. The rejection sampling includes a sampled point in XjX_{j} with probability

xXjΦ({x},C)Φ(X,C)(ϵ128Φ({xj},C)Φ({x},C))\displaystyle\sum_{x\in X_{j}}\frac{\Phi(\{x\},C)}{\Phi(X,C)}\cdot\left(\frac{\epsilon}{128}\cdot\frac{\Phi(\{x_{j}^{\ast}\},C)}{\Phi(\{x\},C)}\right)
=|Xj|ϵ128Φ({xj},C)Φ(X,C)\displaystyle=|X_{j}|\cdot\frac{\epsilon}{128}\frac{\Phi(\{x_{j}^{\ast}\},C)}{\Phi(X,C)}
=|Xj|ϵ128Φ(Xj,C)Φ(X,C)Φ({xj},C)Φ(Xj,C)\displaystyle=|X_{j}|\cdot\frac{\epsilon}{128}\cdot\frac{\Phi(X_{j},C)}{\Phi(X,C)}\cdot\frac{\Phi(\{x_{j}^{\ast}\},C)}{\Phi(X_{j},C)}\cdot
|Xj|ϵ128ϵ4170|B(j)|logqϵ64|Xj|\displaystyle\geq|X_{j}|\cdot\frac{\epsilon}{128}\cdot\frac{\epsilon}{4}\cdot\frac{1}{70|B_{\ell(j)}|\log q}\cdot\frac{\epsilon}{64|X_{j}|}
ϵ3222|B(j)|logq\displaystyle\geq\frac{\epsilon^{3}}{2^{22}|B_{\ell(j)}|\log q}
ϵ3222|W|logq.\displaystyle\geq\frac{\epsilon^{3}}{2^{22}|W|\log q}.

Let s=228ϵ4|W|Klogqs=2^{28}\epsilon^{-4}|W|K\log q. By sampling ss points, for each jWj\in W, it follows from the multiplicative Chernoff bound (Lemma 2) that the rejection sampling returns at least T3=30K/ϵT_{3}=30K/\epsilon points (see Line 30 in Algorithm 4.1) with probability at most

exp(1214sϵ3222|W|logq)exp(8K).\exp\left(-\frac{1}{2}\cdot\frac{1}{4}\cdot\frac{s\epsilon^{3}}{2^{22}|W|\log q}\right)\leq\exp\left(-8K\right).

Taking a union bound over all jWj\in W yields the claimed result. ∎

C.5. Proof of Theorem 5

Proof.

Consider the inner repeat-until loop for a fixed value of KK. Let wrw_{r} denote the number of newly recovered clusters in the rr-th round, then wrK\sum w_{r}\leq K.

The overall failure probability is similar to that in the proof of Theorem 6, and can be upper bounded by

0.05+i=13100i2+rwrexp(8K)\displaystyle 0.05+\sum_{i=1}^{\infty}\frac{3}{100i^{2}}+\sum_{r}w_{r}\exp(-8K) =0.05+π2200+Kexp(8K)\displaystyle=0.05+\frac{\pi^{2}}{200}+K\exp(-8K)
0.1\displaystyle\leq 0.1

and the total number of samples is

O(r=1RwrKlog2Lϵ4)=O(K2log2Lϵ4).O\left(\frac{\sum_{r=1}^{R}w_{r}K\log^{2}L}{\epsilon^{4}}\right)=O\left(\frac{K^{2}\log^{2}L}{\epsilon^{4}}\right).

Since there are logK\log K repetitions and the number of samples is dominated by that in the last repetition, the total number of samples is O(ϵ4K2logKlog2L)O(\epsilon^{-4}K^{2}\log K\log^{2}L). For each sample we need to call the same-cluster query O(L)O(L) times to obtain the its cluster index, and thus the query complexity is O(ϵ4K2LlogKlog2L)O(\epsilon^{-4}K^{2}L\log K\log^{2}L). ∎

Appendix D Noisy Oracles

Algorithm D.1 Algorithm in the case of noisy oracle. The oracle errs with a constant probability p<1/2p<1/2. The constants C1,C2,C_{1},C_{2},\dots in the algorithm below all depend on pp.
1:II\leftarrow\emptyset
2:k0k\leftarrow 0
3:r0r\leftarrow 0
4:K1/2K\leftarrow 1/2
5:repeat
6:     K2KK\leftarrow 2K
7:     repeat
8:         rr+1r\leftarrow r+1
9:         QQ\leftarrow\emptyset
10:         SS\leftarrow\emptyset
11:         T18ϵ1ln(10(k+1))T_{1}\leftarrow 8\epsilon^{-1}\ln(10(k+1))
12:         SD2-sample T1 points w.r.t. {μ~i}iIS\leftarrow\text{$D^{2}$-sample $T_{1}$ points w.r.t.\ $\{\tilde{\mu}_{i}\}_{i\in I}$}
13:         newfalsenew\leftarrow false
14:         for each xSx\in S do
15:              jCheckCluster(x,I)j\leftarrow\textsc{CheckCluster}(x,I)
16:              if 
j=nullj=null then
17:                  newtruenew\leftarrow true
18:                  break loop
19:              end if
20:         end for
21:         if 
new=truenew=true then
22:              q1/2q\leftarrow 1/2
23:              repeat
24:                  q2qq\leftarrow 2q
25:                  TC2ϵ2q2poly(log((k+q)/ϵ)T\leftarrow C_{2}\epsilon^{-2}q^{2}\operatorname{poly}(\log((k+q)/\epsilon)
26:                  SD2-sample T points w.r.t. {μ~i}iIS\leftarrow\text{$D^{2}$-sample $T$ points w.r.t.\ $\{\tilde{\mu}_{i}\}_{i\in I}$}
27:                  (Q,(Zi)iQ)FindClusters(S)(Q,(Z_{i})_{i\in Q})\leftarrow\textsc{FindClusters}(S)
28:                  Q{iQ:|Zi|=Ω(TlogT)}Q\leftarrow\{i\in Q:|Z_{i}|=\Omega(\sqrt{T}\log T)\}
29:              until |Q|q/2|Q|\geq q/2
30:              for each iQi\in Q do
31:                  p^i|Zi|/T\hat{p}_{i}\leftarrow|Z_{i}|/T
32:              end for
33:              Split the clusters in to bands {B}\{B_{\ell}\}
34:              WW\leftarrow all clusters in heavy bands
35:              q|Q|q\leftarrow|Q|
36:              for each jWj\in W do
37:                  xjargminxYjΦ({x},{μ~i}iI)x_{j}^{\ast}\leftarrow\operatorname*{arg\,min}_{x\in Y_{j}}\Phi(\{x\},\{\tilde{\mu}_{i}\}_{i\in I})
38:                  SjS_{j}\leftarrow\emptyset
39:              end for
40:              wr|W|w_{r}\leftarrow|W|
41:              T330K/ϵT_{3}\leftarrow 30K/\epsilon
42:              {Sj}jWRejSampNoisy(I,{μ~i}iI,T3,W,{xj}jW)\{S_{j}\}_{j\in W}\leftarrow\textsc{RejSampNoisy}(I,\{\tilde{\mu}_{i}\}_{i\in I},\ \leavevmode\hbox to6.67pt{\vbox to6.67pt{\pgfpicture\makeatletter\hbox{\hskip 3.33301pt\lower-3.33301pt\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\pgfsys@setlinewidth{0.4pt}\pgfsys@invoke{ }\nullfont\hbox to0.0pt{\pgfsys@beginscope\pgfsys@invoke{ }{{}}\hbox{\hbox{{\pgfsys@beginscope\pgfsys@invoke{ }{{}{}{{ {}{}}}{ {}{}} {{}{{}}}{{}{}}{}{{}{}} { }{{{{}}\pgfsys@beginscope\pgfsys@invoke{ }\pgfsys@transformcm{1.0}{0.0}{0.0}{1.0}{0.0pt}{0.0pt}\pgfsys@invoke{ }\hbox{{\definecolor{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@invoke{ }\pgfsys@color@rgb@fill{0}{0}{0}\pgfsys@invoke{ }\hbox{} }}\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope}}} \pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hbox to0.0pt{}{}{}{}\hss}\pgfsys@discardpath\pgfsys@invoke{\lxSVG@closescope }\pgfsys@endscope\hss}}\lxSVG@closescope\endpgfpicture}}\newline \hskip 100.00015ptT_{3},W,\{x_{j}^{\ast}\}_{j\in W})
43:              for each jWj\in W do
44:                  μ~j(1/|Sj|)xSjx\tilde{\mu}_{j}\leftarrow(1/|S_{j}|)\sum_{x\in S_{j}}x
45:              end for
46:              kk+|W|k\leftarrow k+|W|
47:              IIWI\leftarrow I\cup W
48:              Retain C4K/ϵC_{4}K/\epsilon points in each ZjZ_{j} for all jWj\in W
49:         end if
50:     until Q=Q=\emptyset or kKk\geq K
51:until Q=Q=\emptyset and kKk\leq K Phase 1 Phase 2 Phase 3 Phase 4
Algorithm D.2 CheckCluster(x,I)\textsc{CheckCluster}(x,I): Check if a point xx belongs to some cluster iIi\in I using the set of representative points {Zi}iI\{Z_{i}\}_{i\in I}.
1:for each iIi\in I do
2:     if 
xx belongs to the same cluster
as the majority points in ZiZ_{i} then
3:         return ii
4:return nullnull

In this section we consider the noisy same-cluster oracle which errs with a constant probability p<1/2p<1/2. We assume that the same-cluster oracle returns the same answer on repetition for the same pair of points. The immediate consequence is that the Classify function needs to be amended as we cannot keep one representative point ziz_{i} from each cluster ii but have to keep a set of points ZiZ_{i} from cluster ii to tolerate oracle error. We present the substitute for Classify, called CheckCluster, in Algorithm D.2. The following is the guarantee of the CheckCluster algorithm.

Lemma 1.

Suppose that |I|K|I|\leq K and MC1(K/ϵ)cM\leq C_{1}(K/\epsilon)^{c} for some absolute constants C1C_{1} and cc. There exists a constant C=C(p,c)>0C=C(p,c)>0 such that when |Zi|CK/ϵ|Z_{i}|\geq CK/\epsilon for all iIi\in I, all MM calls to the CheckCluster function is correct with probability at least 0.990.99.

Proof.

By a Chernoff bound (2), the algorithm yields the correct behaviour w.r.t. each cluster iIi\in I with probability at least 1exp((2p1)2pCK/(8ϵ))=1exp(CK/ϵ)1-\exp(-(2p-1)^{2}pCK/(8\epsilon))=1-\exp(-C^{\prime}K/\epsilon). Taking a union bound over all iIi\in I and NN calls yields the overall failure probability at most M|I|exp(CpK)C1(K/ϵ)c+1exp(CK/ϵ)0.01M|I|\exp(-C_{p}^{\prime}K)\leq C_{1}(K/\epsilon)^{c+1}\exp(-C^{\prime}K/\epsilon)\leq 0.01, provided CC^{\prime} is large enough. Then we let C=8C/(p(2p1)2)C=8C^{\prime}/(p(2p-1)^{2}). ∎

In order to obtain the representative samples from each set, we deploy a reduction to the stochastic block model. This model considers a graph G=(V,E)G=(V,E) of nn nodes, which is partitioned as V=V1V2VkV=V_{1}\cup V_{2}\cup\cdots\cup V_{k}. An edge is added between two nodes belonging to the same partition with probability at least 1p1-p and between two nodes in different partitions with probability at most qq. It is known how to find the (large) clusters in this model.

Lemma 2 ((Mazumdar and Saha, 2017a)).

There exists a polynomial time algorithm that, given an instance of a stochastic block model on nn nodes, retrieves all clusters of size Ω(nlogn)\Omega(\sqrt{n}\log n) using O(n3/(2p1)4)O(n^{3}/(2p-1)^{4}) queries with probability 11/n1-1/n.

Our main theorem about the noisy same-cluster oracle is the following. We do not attempt to optimize the number of same-cluster queries.

Theorem 3.

Suppose that the same-cluster oracle errs with a constant probability p<1/2p<1/2. With probability at least 0.60.6, Algorithm D.1 finds all the clusters and obtains for every ii an approximate centroid μ~i\tilde{\mu}_{i} that satisfies (3.1), using O~p(ϵ6L6K)\tilde{O}_{p}(\epsilon^{-6}L^{6}K) same-cluster queries in total.

Proof.

The analysis is similar to that of the improved algorithm (Algorithm 4.1) and we shall only highlight the changes. For now let us assume that all CheckCluster calls return correct results (we shall verify this at the end of the proof).

The analysis of Phases 1 and 4 remains the same as the analysis for the improved algorithm (Algorithm 4.1).

In Phase 2, all clusters jWj\in W satisfy pi=Ω(1/(|W|logq))p_{i}=\Omega(1/(|W|\log q)) and hence each such cluster receives in expectation at least s′′=Ω(|S|/(|W|logq)(1/ϵ))s^{\prime\prime}=\Omega(|S|/(|W|\log q)\cdot(1/\epsilon)) sampled points. We require s′′|S|log|S|s^{\prime\prime}\geq\sqrt{|S|}\log|S| (by Lemma 2), which leads to |S|=Ω(ϵ2q2log2qlog2(q/ϵ))|S|=\Omega(\epsilon^{-2}q^{2}\log^{2}q\log^{2}(q/\epsilon)). By a Chernoff bound, we see that O(ϵ2q2poly(log((k+q)/ϵ)))O(\epsilon^{-2}q^{2}\operatorname{poly}(\log((k+q)/\epsilon))) samples suffices to guarantee the correctness. However, we are unable to know the number of newly discovered clusters qq when we are taking samples. Hence we guess qq over the powers of 22 sequentially and stop when we obtain more than q/2q/2 new clusters of size at least Ω(|S|log|S|)\Omega(\sqrt{|S|}\log|S|). For each guess of qq, the number of same-cluster queries is Op(|S|3)=O~p(ϵ6q6)O_{p}(|S|^{3})=\tilde{O}_{p}(\epsilon^{-6}q^{6}). There can be at most logK\log K guesses and thus the number of queries in this phase is O~p(ϵ6q6)\tilde{O}_{p}(\epsilon^{-6}q^{6}).

In Phase 3, we would need each cluster jWj\in W to see in expectation Θ(1/ϵ)\Theta(1/\epsilon) samples. This is automatically achieved by the number of samples in the analysis above for Phase 2, with an appropriate adjustment of constants.

For each fixed value of KK, the number of calls to CheckCluster is O(K2/ϵ4)O(K^{2}/\epsilon^{4}) and Lemma 1 applies, verifying our earlier claim that all calls to CheckCluster are correct.

For each fixed value of KK, in each round, Phase 2 uses O~(ϵ6L6)\tilde{O}(\epsilon^{-6}L^{6}) oracle calls (Lemma 2) and there are KK rounds; each call to Algorithm D.2 CheckCluster uses O(K2/ϵ)O(K^{2}/\epsilon) oracle calls and there are O(K2/ϵ4)O(K^{2}/\epsilon^{4}) calls to CheckCluster. Hence the total number of oracle calls for each fixed value of KK is O~(L6K/ϵ6)+O(K2/ϵK2/ϵ4)=O~(L6K/ϵ6)\tilde{O}(L^{6}K/\epsilon^{6})+O(K^{2}/\epsilon\cdot K^{2}/\epsilon^{4})=\tilde{O}(L^{6}K/\epsilon^{6}).

Overall there are logK\log K guesses for the value of KK, the overall number of oracle calls is O~(L6K/ϵ6)\tilde{O}(L^{6}K/\epsilon^{6}). ∎

Algorithm D.3 RejSampNoisy(I,{μ~i}iI,T,W,{xj}jW)\textsc{RejSampNoisy}(I,\{\tilde{\mu}_{i}\}_{i\in I},T,W,\{x_{j}^{\ast}\}_{j\in W}): Rejection sampling on heavy clusters
1:repeat
2:     xx\leftarrow a point returned by D2D^{2}-sampling (w.r.t. {μ~i}iI\{\tilde{\mu}_{i}\}_{i\in I})
3:     jCheckCluster(x,W)j\leftarrow\textsc{CheckCluster}(x,W)
4:     if 
jnullj\neq null and jWj\in W then
5:         Add xx to SjS_{j} with probability ϵ128Φ({xj},{μ~i}iI)Φ({x},{μ~i}iI)\frac{\epsilon}{128}\cdot\frac{\Phi(\{x_{j}^{\ast}\},\{\tilde{\mu}_{i}\}_{i\in I})}{\Phi(\{x\},\{\tilde{\mu}_{i}\}_{i\in I})}
6:until |Sj|T|S_{j}|\geq T for all jWj\in W
7:return {Sj}jW\{S_{j}\}_{j\in W}

Appendix E Omitted Tables of Experiment Results

#Queries Algo. 1.0E4 2.5E4 5.0E4 7.5E4 1.0E5 1.2E5 1.5E5 1.75E5 2.0E5
p=0p=0 Basic 16.96 24.88 36.95 43.51 49.34 54.53 60.47 64.62 68.25
Batched 17.45 26.00 37.98 45.56 51.92 56.56 62.85 66.59 70.68
Uniform 12.97 19.30 25.75 30.73 34.77 37.59 41.62 44.48 48.01
p=0.1p=0.1 Basic 17.78 25.10 33.90 40.52 46.93 51.14 56.66 60.84 65.05
Batched 18.31 25.85 35.27 43.28 49.60 53.78 59.14 63.50 68.55
Uniform 13.11 19.40 25.88 30.63 34.73 37.68 41.73 44.41 47.54
p=0.3p=0.3 Basic 17.26 26.38 33.73 40.22 45.62 49.03 54.09 57.94 60.65
Batched 17.91 27.59 35.44 42.56 47.74 51.53 56.56 60.04 62.99
Uniform 13.17 19.26 25.95 31.17 34.77 38.11 41.72 45.04 47.87
Table E.5. Performance comparison on synthetic datasets, fixed-budget
#Clusters Algo. 20 30 40 50 60 70
p=0p=0 Basic 789.57 1128.73 1459.13 1978.24 2416.48 2997.13
Batched 722.21 1050.19 1363.88 1846.77 2316.00 2845.01
Uniform 1275.11 2260.90 3359.38 4338.60 5237.93 6261.74
p=0.1p=0.1 Basic 618.83 1233.11 1787.45 2262.58 2778.48 3235.86
Batched 615.86 1160.61 1625.75 2053.64 2596.92 3059.33
Uniform 1284.39 2297.05 3336.56 4366.22 5265.16 6187.66
p=0.3p=0.3 Basic 678.37 1109.64 1803.80 2438.27 3198.09 4447.11
Batched 630.94 1056.14 1664.32 2269.10 2954.42 4089.53
Uniform 1264.46 2201.10 3317.04 4307.54 5239.21 6074.59
Table E.6. Performance comparison on synthetic datasets, fixed-recovery
#Queries Basic Batched Uniform
5.00E+03 8.33 8.89 5.03
2.00E+04 13.08 13.65 6.54
4.00E+04 16.54 17.05 8.21
6.00E+04 17.20 17.64 9.23
8.00E+04 17.50 18.07 9.79
1.00E+05 17.80 18.25 9.95

Fixed-budget

#Clusters Basic Batched Uniform
6 210.68 192.10 986.56
8 420.47 351.37 3624.32
10 861.65 747.29 6694.91
12 1252.59 996.43 79672.90
14 1838.29 1475.32 142161.65
16 2484.36 1901.43 201012.82

Fixed-recovery

Table E.7. Performance comparison on Kdd dataset
#Queries Basic Batched Uniform
1.00E+03 3.22 3.22 3.00
5.00E+03 4.89 4.92 3.51
1.00E+04 5.48 5.55 4.06
2.00E+04 6.14 6.31 4.83
3.00E+04 6.55 6.61 5.13

Fixed-budget

#Clusters Basic Batched Uniform
3 86.03 86.65 79.10
4 443.60 435.29 1233.67
5 725.46 711.47 3154.66
6 2053.65 2036.89 6988.62
7 4050.22 4050.03 7799.98

Fixed-recovery

Table E.8. Performance comparison on Shuttle dataset