This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Correlation Clustering Algorithm for Dynamic Complete Signed Graphs: An Index-based Approach

Ali Shakiba
(Department of Computer Science, Vali-e-Asr University of Rafsanjan, Rafsanjan, Iran.
[email protected];[email protected]
)
Abstract

In this paper, we reduce the complexity of approximating the correlation clustering problem from π’ͺ​(mΓ—(2+α​(G))+n)\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right) to π’ͺ​(m+n)\mathcal{O}\left(m+n\right) for any given value of Ξ΅\varepsilon for a complete signed graph with nn vertices and mm positive edges where α​(G)\alpha(G) is the arboricity of the graph. Our approach gives the same output as the original algorithm and makes it possible to implement the algorithm in a full dynamic setting where edge sign flipping and vertex addition/removal are allowed. Constructing this index costs π’ͺ​(m)\mathcal{O}\left(m\right) memory and π’ͺ​(m×α​(G))\mathcal{O}\left(m\times\alpha(G)\right) time. We also studied the structural properties of the non-agreement measure used in the approximation algorithm. The theoretical results are accompanied by a full set of experiments concerning seven real-world graphs. These results shows superiority of our index-based algorithm to the non-index one by a decrease of %34 in time on average.

Keywords: Correlation clustering β‹…\cdot Dynamic graphs β‹…\cdot Online Algorithms

1 Introduction

Clustering is one of the most studied problems in machine learning with various applications in analyzing and visualizing large datasets. There are various models and technique to obtain a partition of elements, such that elements belonging to different partitions are dissimilar to each other and the elements in the same partition are very similar to each other.

The problem of correlation clustering, introduced in [1], is known to be an NP-hard problem for the disagree minimization. Therefore, several different approximation solutions based on its IP formulation exist in the literature. Recently, the idea of a 2-approximation algorithm in [1] is extended in [4] for constructing a π’ͺ​(1)\mathcal{O}\left(1\right)-approximation algorithm. The experiments in [4] show acceptable performance for this algorithm in practice, although its theoretical guarantee can be too high, e.g. 1 4421\,442 for Ξ²=Ξ»=136\beta=\lambda=\frac{1}{36}. In [3], this algorithm is extended to an online setting where just vertex additions are allowed, and whenever a new vertex is added, it reveals all its positively signed edges. Shakiba in [12] studied the effect of vertex addition/removal and edge sign flipping in the underlying graph to the final clustering result, in order to make the algorithm suitable for dynamic graphs. However, one bottleneck in this way is computing the values of NonAgreement among the edges and identifying the Ξ΅\varepsilon-lightness of vertices. The current paper proposes a novel indexing scheme to remedy this and make the algorithm efficient, not just in terms of dynamic graphs, but for even dynamic hyper-parameter Ξ΅\varepsilon. Our proposed method, in comparison with the online method of [3] is that we allow a full dynamic setting, i.e. vertex addition/removal and edge’s sign flipping. It is known that any online algorithm for the correlation clustering problem has at least Ω​(n)\Omega(n)-approximation ratio [10]. Note that the underlying algorithm used in the current paper is consistent, as is shown via experimental results [3].

The rest of the paper is organized as follows: In Section 1.1, we highlight our contributions. This is followed by a reminding some basic algorithms and results in Section 2. Then, we introduce the novel indexing structure in Section 3.1 and show how it can be employed to enhance the running-time of the approximate correlation clustering algorithm. Then, we show how to maintain the proposed indices in a full dynamic settings in Section 3.2. In Section 4, we give an extensive experiments which accompanies the theoretical results and show the effectiveness of the proposed indexing structure. Finally, a conclusion is drawn.

1.1 Our Contribution

In this paper, we simply ask

β€œHow can one reduce the time to approximate a correlation clustering of the input graph [4] for varying values of Ξ΅\varepsilon?”

We also ask

β€œHow can we make the solution to the first question an online solution for dynamic graphs?”

Our answer to the first question is devising a novel indexing-structure which is constructed based on the structural properties of the approximation algorithm and its NonAgreement measure. As our experiments in Section 4 show, the proposed method enhanced the total running-time of querying the clustering for about %34\%34 on average for seven real-world datasets. Then, we make this structure online to work with dynamic graphs based on theoretical results in [12]. The construction of the index itself is highly parallelizable, up to the number of the vertices in the input graph. The idea for parallelization is simple: construct each ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) in the ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right) with a separate parallel thread. We also study the intrinsic structures in the NonAgreement measure, to bake more efficient algorithms for index-maintenance due to updates to the underlying graph. More precisely, we show that using the proposed index structure, we can find a correlation clustering for a graph for any given value of Ξ΅\varepsilon in time π’ͺ​(m+n)\mathcal{O}\left(m+n\right), compared to the π’ͺ​(mΓ—(2+α​(G))+n)\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right) time for the CC. The pre-processing time of the ICC would be π’ͺ​(m×α​(G))\mathcal{O}\left(m\times\alpha(G)\right) with π’ͺ​(m)\mathcal{O}\left(m\right) space complexity.

2 Preliminaries

Let G=(V,E)G=(V,E) be a complete undirected signed graph with |V|=n\left|V\right|=n vertices. The set of edges EE is naturally partitioned into positive and negative signed edges, E+E^{+} and Eβˆ’E^{-}, respectively. Then, we use mm to denote |E+|\left|E^{+}\right|. The correlation clustering problem is defined as

cost⁑(π’ž)=βˆ‘{u,v}∈E+u∈Ci,v∈Cj,iβ‰ j1+βˆ‘{u,v}∈Eβˆ’u,v∈Ci1,\operatorname{cost}(\mathcal{C})=\sum_{\begin{subarray}{c}\left\{u,v\right\}\in E^{+}\\ u\in C_{i},v\in C_{j},i\neq j\end{subarray}}1+\sum_{\begin{subarray}{c}\left\{u,v\right\}\in E^{-}\\ u,v\in C_{i}\end{subarray}}1, (1)

where π’ž={C1,…,Cβ„“}\mathcal{C}=\left\{C_{1},\ldots,C_{\ell}\right\} is a clustering. Note that this is the min-disagree variant of the problem.

The constant factor approximation algorithm of [4] is based on two main quantities: (1) Ξ΅\varepsilon-agreement of a positively signed edge {u,v}\left\{u,v\right\}, i.e. uu and vv are in Ξ΅\varepsilon-agreement if and only if NonAgreementG​(u,v)=|NG​(u)​Δ​NG​(v)|max⁑{|NG​(u)|,|NG​(v)|}<Ξ΅\textsc{NonAgreement}_{G}\left(u,v\right)=\frac{\left|N_{G}(u)\Delta N_{G}(v)\right|}{\max\left\{\left|N_{G}(u)\right|,\left|N_{G}(v)\right|\right\}}<\varepsilon, and (2) Ξ΅\varepsilon-lightness, where a vertex uu is said to be Ξ΅\varepsilon-light if AgreeCntG+​(u)|NG+​(u)|<Ξ΅\frac{\textsc{AgreeCnt}_{G^{+}}(u)}{\left|N_{G^{+}}(u)\right|}<\varepsilon where AgreeCntG+​(u)=|{w∈V|u​ and ​v​ are in ​Ρ​-agreement}|\textsc{AgreeCnt}_{G^{+}}(u)=\left|\left\{w\in V|u\text{ and }v\text{ are in }\varepsilon\text{-agreement}\right\}\right|. Note that a vertex which is not Ξ΅\varepsilon-light is called Ξ΅\varepsilon-heavy. This is a 2+4Ξ΅+1Ξ΅22+\frac{4}{\varepsilon}+\frac{1}{\varepsilon^{2}}-approximation algorithm, as is shown in [3]. This algorithm is described in Algorithm 1, which we will refer to the CC algorithm, for short.

Algorithm 1 CorrelationClustering(GG) [4]
1:procedureΒ CorrelationClustering(G,Ξ΅G,\varepsilon)
2:Β Β Β Β Β Let G+=G​[E+]G^{+}=G[E^{+}] where E+E^{+} is the set of edges whose sign is ++
3:Β Β Β Β Β Discard all edges whose endpoints are not in Ξ΅\varepsilon-agreement
4:Β Β Β Β Β Discard all edges between two Ξ΅\varepsilon-light vertices
5:Β Β Β Β Β Let G+~\widetilde{G^{+}} be the sparsified graph G+G^{+} after performing previous two operations
6:Β Β Β Β Β Let π’ž\mathcal{C} be the collection of connected components in G+~\widetilde{G^{+}}
7:Β Β Β Β Β return π’ž\mathcal{C} as the output clustering
8:endΒ procedure

Shakiba in [12] studied theoretical foundation of the CC algorithm in a full dynamic setting. The following result is a summary of Table 1, Corollary 1, and Theorem 4 in [12].

Theorem 1.

Suppose the sign of an edge u={u,v}u=\left\{u,v\right\} is flipped. Then, the non-agreement and Ξ΅\varepsilon-lightness of vertices other than the ones whose distance to either uu and vv is more than two would not change.

The arboricity of the graph GG is the minimum number of edge-disjoint spanning forests into which GG can be decomposed. The following lemma for arboricity is useful in bounding the number of operations.

Lemma 1.

Lemma 2 in [2] Suppose the graph G=(V,E)G=(V,E) has nn vertices with mm edges. Then,

βˆ‘{u,v}∈Emin⁑{degG⁑(u),degG⁑(v)}≀2​a​(G)Γ—m.\sum_{\left\{u,v\right\}\in E}\min\left\{\deg_{G}(u),\deg_{G}(v)\right\}\leq 2a(G)\times m. (2)

3 Proposed Method

In this section, we describe our novel indexing structure. This structure allows dynamic queries of the correlation clustering with varying values of Ξ΅\varepsilon for dynamic graphs. The proposed algorithm which uses the indexing structure would be called ICC, or indexed-based correlation clustering.

3.1 Indexing structure

For an edge e={u,v}e=\left\{u,v\right\} with positive sign, we define its Ξ΅\varepsilon-agreement distance as NonAgreementG+​(u,v)\textsc{NonAgreement}_{G^{+}}\left(u,v\right). Intuitively, this is the infimum of the values Ξ΅\varepsilon which the nodes uu and vv are not in Ξ΅\varepsilon-agreement. Let define the set β„°={NonAgreementG+​(u,v)|e={u,v}∈E+}\mathcal{E}=\left\{\textsc{NonAgreement}_{G^{+}}\left(u,v\right)|e=\left\{u,v\right\}\in E^{+}\right\}. Without loss of generality, let β„°={Ξ΅0,…,Ξ΅β„“βˆ’1}\mathcal{E}=\left\{\varepsilon_{0},\ldots,\varepsilon_{\ell-1}\right\} with the ordering min⁑ℰ=Ξ΅0<Ξ΅1<β‹―<Ξ΅β„“βˆ’1=max⁑ℰ\min\mathcal{E}=\varepsilon_{0}<\varepsilon_{1}<\cdots<\varepsilon_{\ell-1}=\max\mathcal{E}. For a fixed value of Ξ΅\varepsilon, let GΞ΅+=(V,EΞ΅+)G^{+}_{\varepsilon}=(V,E^{+}_{\varepsilon}) where EΞ΅+={e={u,v}∈E+|NonAgreementG+​(u,v)<Ξ΅}E^{+}_{\varepsilon}=\left\{e=\left\{u,v\right\}\in E^{+}|\textsc{NonAgreement}_{G^{+}}\left(u,v\right)<\varepsilon\right\}.

Observation 1.

For all Ρ≀Ρ0\varepsilon\leq\varepsilon_{0}, GΞ΅+G^{+}_{\varepsilon} is the null graph, i.e. a graph on all nodes without any edges. Moreover, for all Ξ΅>Ξ΅β„“\varepsilon>\varepsilon_{\ell}, GΞ΅+=G+G^{+}_{\varepsilon}=G^{+}.

Next, we introduce the key ingredient to our indexing structure, called ℕ​𝔸​𝕆\mathbb{NAO}.

Definition 1 (NonAgreement Node Ordering).

The Ξ΅\varepsilon-agreement ordering for each node v∈Vv\in V, denoted by ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right), is defined as an ordered subset of vertices in GG where:

  1. 1.

    node u∈Vu\in V appears in the ordering ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) if and only if e={u,v}e=\left\{u,v\right\} is a positive edge in GG.

  2. 2.

    for each two distinct vertices u,w∈Vu,w\in V which appear in ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right),

    NonAgreementG+​(v,u)<NonAgreementG+​(v,w),\textsc{NonAgreement}_{G^{+}}\left(v,u\right)<\textsc{NonAgreement}_{G^{+}}\left(v,w\right), (3)

    implies uu appears before ww.

  3. 3.

    for each node uβˆˆβ„•β€‹π”Έβ€‹π•†β€‹(v)u\in\mathbb{NAO}\left(v\right), its Ξ΅\varepsilon-agreement distance is also stored with that node.

The NonAgreement node ordering of the graph GG is defined as ℕ​𝔸​𝕆​(G)={(v,ℕ​𝔸​𝕆​(v))|v∈V}\mathbb{NAO}\left(G\right)=\left\{(v,\mathbb{NAO}\left(v\right))|v\in V\right\}.

In other words, the ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) is a sorted array of neighboring nodes of vv in G+G^{+} in their Ξ΅\varepsilon-agreement distance value. An example NonAgreement node ordering for all vertices in a sample graph is illustrated in Figure 1.

Refer to caption
Figure 1: An illustrative example of ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) for an example graph.

The space and construction time complexities of the ℕ​𝔸​𝕆​(G)\mathbb{NAO}(G) are investigated in the next two lemmas.

Lemma 2.

The NonAgreement node ordering for a graph GG, ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right), can be represented in π’ͺ​(m)\mathcal{O}\left(m\right) memory.

Proof.

The number of nodes inside ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) equals to the degG+⁑(v)+1\deg_{G^{+}}(v)+1, for all vertices in NG+​(v)N_{G^{+}}(v) as well as the vertex vv itself. Note that it is not required to explicitly store the vertex vv itself in the ordering. Cumulatively, the total size required for representing ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) is 2Γ—m2\times m entries. ∎

Lemma 3.

The time complexity to construct the ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) for all vertices v∈Vv\in V is π’ͺ​(mΓ—(α​(G)+lg⁑m))\mathcal{O}\left(m\times\left(\alpha(G)+\lg m\right)\right) where α​(G)\alpha(G) is the arboricity of the graph GG.

Proof.

To compute the value of NonAgreementG+​(u,v)\textsc{NonAgreement}_{G^{+}}\left(u,v\right) for all edges e={u,v}∈E+e=\left\{u,v\right\}\in E^{+}, one needs to compute |NG+​(u)​Δ​NG+​(v)|\left|N_{G^{+}}(u)\Delta N_{G^{+}}(v)\right|. This requires π’ͺ​(degG+⁑(u)+degG+⁑(v))\mathcal{O}\left(\deg_{G^{+}}(u)+\deg_{G^{+}}(v)\right) operations to compute the symmetric difference, given the adjacency lists of uu and vv are sorted. However, we can compute their intersection and use it to compute the NonAgreement as follows

NonAgreementG+​(u,v)=degG+⁑(u)+degG+⁑(v)βˆ’2Γ—|NG+​(u)∩NG+​(v)|max⁑{NG+​(u),NG+​(v)}.\textsc{NonAgreement}_{G^{+}}\left(u,v\right)=\frac{\deg_{G^{+}}(u)+\deg_{G^{+}}(v)-2\times\left|N_{G^{+}}(u)\cap N_{G^{+}}(v)\right|}{\max\left\{N_{G^{+}}(u),N_{G^{+}}(v)\right\}}. (4)

Hence, the total time required to compute the values of NonAgreement is equal to

βˆ‘{u,v}∈E+min⁑{degG+⁑(u),degG+⁑(v)},\sum_{\left\{u,v\right\}\in E^{+}}\min\left\{\deg_{G^{+}}(u),\deg_{G^{+}}(v)\right\},

which is known to be bounded by 2​α​(G)Γ—m2\alpha(G)\times m (Lemma 1). Moreover, each ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) is of length 1+degG+⁑(v)1+\deg_{G^{+}}(v) which requires sorting by the value of their NonAgreement in π’ͺ​(degG+⁑(v)​lg⁑(degG+⁑(v)))\mathcal{O}\left(\deg_{G^{+}}(v)\lg\left(\deg_{G^{+}}(v)\right)\right), which accumulates to π’ͺ​(m​lg⁑m)\mathcal{O}\left(m\lg m\right). ∎

The correlation clustering corresponding to a given value of Ξ΅\varepsilon and a graph GG is the set of connected components of the graph G+~\widetilde{G^{+}}. Given access to ℕ​𝔸​𝕆​(G)={ℕ​𝔸​𝕆​(v)|v∈V​(G)}\mathbb{NAO}\left(G\right)=\left\{\mathbb{NAO}\left(v\right)|v\in V(G)\right\}, one can respond to the following queries: (1) Is the vertex vv an Ξ΅\varepsilon-heavy vertex or an Ξ΅\varepsilon-light? (2) Are the endpoints of an edge {u,v}\left\{u,v\right\} in Ξ΅\varepsilon-agreement? As we will show in this section, both of these questions can be answered in π’ͺ​(1)\mathcal{O}\left(1\right) time using the ℕ​𝔸​𝕆\mathbb{NAO} structure. For a vertex v∈Vv\in V, its Ξ΅\varepsilon-light threshold is defined as 𝕃​𝕋​ℍΡ​(v)=βŒˆΞ΅β€‹degG+⁑(v)βŒ‰+1\mathbb{LTH}_{\varepsilon}\left(v\right)=\left\lceil\varepsilon\deg_{G^{+}}(v)\right\rceil+1.

Lemma 4.

For an Ξ΅\varepsilon and a vertex v∈Vv\in V, vv is Ξ΅\varepsilon-heavy if and only if the Ξ΅\varepsilon-agreement distance of the 𝕃​𝕋​ℍΡ​(v)\mathbb{LTH}_{\varepsilon}\left(v\right)-th smallest vertex in ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) is less than Ξ΅\varepsilon. Otherwise, it is Ξ΅\varepsilon-light.

Proof.

The vertices uu and vv would be in Ξ΅\varepsilon-agreement if and only if NonAgreementG+​(u,v)<Ξ΅\textsc{NonAgreement}_{G^{+}}\left(u,v\right)<\varepsilon. Whenever the Ξ΅\varepsilon-agreement distance of the 𝕃​𝕋​ℍΡ​(v)\mathbb{LTH}_{\varepsilon}\left(v\right)-th smallest vertex in ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) is less than Ξ΅\varepsilon, it means that there are at least βŒˆΞ΅β€‹degG+⁑(v)βŒ‰+1\left\lceil\varepsilon\deg_{G^{+}}(v)\right\rceil+1 vertices, including the vv itself, in Ξ΅\varepsilon-agreement with vv. This is equivalent the Ξ΅\varepsilon-heavyness of the vertex vv. On the other hand, if the Ξ΅\varepsilon-agreement distance of the 𝕃​𝕋​ℍΡ​(v)\mathbb{LTH}_{\varepsilon}\left(v\right)-th smallest vertex in ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) is greater than or equal to Ξ΅\varepsilon, this is equivalent to that the number of vertices in Ξ΅\varepsilon-agreement with vv is less than Ρ×degG+⁑(v)\varepsilon\times\deg_{G^{+}}(v), i.e. vv is Ξ΅\varepsilon-light. ∎

Lemma 5.

Given access to ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right), for all values of Ξ΅\varepsilon and any vertex v∈Vv\in V, identifying the Ξ΅\varepsilon-lightness of vv can be accomplished in π’ͺ​(1)\mathcal{O}\left(1\right).

Proof.

This is implied by Lemma 4, assuming that the ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) is implemented as an array. ∎

Given access to the index ℕ​𝔸​𝕆​(G)={(v,ℕ​𝔸​𝕆​(v))|v∈V}\mathbb{NAO}\left(G\right)=\left\{(v,\mathbb{NAO}\left(v\right))|v\in V\right\} for a graph GG, a query simply asks for a clustering of the graph for a given value of Ξ΅\varepsilon.

Theorem 2.

Given access to the index ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right), computing a clustering for a given parameter Ξ΅\varepsilon can be accomplished in π’ͺ​(m​(a​(n)+1))\mathcal{O}\left(m\left(a(n)+1\right)\right) amortized time, where a​(n)a(n) is the slowly growing inverse of the single-valued Ackermann’s function.

Proof.

Intuitively, we are going to use ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right) to construct the graph G+~\widetilde{G^{+}} and compute its connected components incrementally. More formally, we start by putting each vertex to its isolated set in a disjoin-union structure. Then, we identify Ξ΅\varepsilon-heavy vertices β„‹βŠ†V\mathcal{H}\subseteq V, which takes π’ͺ​(m)\mathcal{O}\left(m\right). Next, consider ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) for vβˆˆβ„‹v\in\mathcal{H}. For each vertex uβˆˆβ„•β€‹π”Έβ€‹π•†β€‹(v)u\in\mathbb{NAO}\left(v\right) whose key is smaller than Ξ΅\varepsilon and uβˆˆβ„‹u\in\mathcal{H}, we call Union​(u,v)\textsc{Union}(u,v), which merges the cluster which uu and vv belong. These operations takes π’ͺ​(mΓ—a​(n))\mathcal{O}\left(m\times a(n)\right) amortized time [5] using a Disjoint-Set data structure. The correctness of the query algorithm is implied by the Definition 1 and the Lemmas 4 and 5. ∎

Before going any further, we need to state the following results, which intuitively investigate the behavior of the Ξ΅\varepsilon-agreement and Ξ΅\varepsilon-light of edges and vertices through the β„°\mathcal{E}.Let

β„°={NonAgreementG+​(u,v)|{u,v}∈E+}={Ξ΅0,Ξ΅1,…,Ξ΅β„“βˆ’1},\mathcal{E}=\left\{\textsc{NonAgreement}_{G^{+}}\left(u,v\right)|\left\{u,v\right\}\in E^{+}\right\}=\left\{\varepsilon_{0},\varepsilon_{1},\ldots,\varepsilon_{\ell-1}\right\}, (5)

where Ξ΅iβˆ’1<Ξ΅i\varepsilon_{i-1}<\varepsilon_{i} for i=1,…,β„“βˆ’1i=1,\ldots,\ell-1. Moreover, assume that Ξ΅β„“>max⁑ℰ\varepsilon_{\ell}>\max\mathcal{E}.

Observation 2.

For any Ρ≀Ρ0\varepsilon\leq\varepsilon_{0}, the endpoints of all positive signed edges {u,v}\left\{u,v\right\} are not in Ξ΅\varepsilon-agreement.

Observation 3.

For any Ξ΅>Ξ΅β„“βˆ’1\varepsilon>\varepsilon_{\ell-1}, the endpoints of all positive signed edges {u,v}\left\{u,v\right\} are in Ξ΅\varepsilon-agreement.

By the Observation 2, the correlation clustering output by the Algorithm 1 for all values of Ρ≀Ρ0\varepsilon\leq\varepsilon_{0} would be the collection of singleton vertices, i.e. {{v}|v∈V}\left\{\left\{v\right\}|v\in V\right\}. However, we cannot say that for Ξ΅>Ξ΅e​l​lβˆ’1\varepsilon>\varepsilon_{ell-1}, the output is a single cluster, i.e. the set VV. Why is that? As Ξ΅\varepsilon increases, the number of edges in Ξ΅\varepsilon-agreement would increase, however, the number of Ξ΅\varepsilon-light vertices would increase, too (By Corollary 1, which follows soon). Hence, there is a trade-off for choosing a suitable value of Ξ΅\varepsilon to get the minimum number of clusters. We would discuss this issue further in Section 4 (Experiments).

Theorem 3.

Let Ξ΅<Ξ΅β€²\varepsilon<\varepsilon^{\prime} and u,v∈Vu,v\in V be two distinct vertices. If uu and vv are in Ξ΅\varepsilon-agreement, then they would be in Ξ΅β€²\varepsilon^{\prime}-agreement, too. Also, if uu and vv are not in Ξ΅β€²\varepsilon^{\prime}-agreement, then they would not be in Ξ΅\varepsilon-agreement.

Proof.

The proof is a direct implication of the NonAgreement definition. Given that uu and vv are in Ξ΅\varepsilon-agreement, we have NonAgreementG+​(u,v)<Ξ΅<Ξ΅β€²\textsc{NonAgreement}_{G^{+}}\left(u,v\right)<\varepsilon<\varepsilon^{\prime}, i.e. uu and vv are in Ξ΅β€²\varepsilon^{\prime}-agreement, too. Similarly, given that uu and vv are not in Ξ΅β€²\varepsilon^{\prime}-agreement, we have NonAgreementG+​(u,v)β‰₯Ξ΅β€²>Ξ΅\textsc{NonAgreement}_{G^{+}}\left(u,v\right)\geq\varepsilon^{\prime}>\varepsilon, i.e. uu and vv are not in Ξ΅\varepsilon-agreement, too. ∎

Theorem 4.

Let Ξ΅<Ξ΅β€²\varepsilon<\varepsilon^{\prime} and u∈Vu\in V. If uu is Ξ΅\varepsilon-light, then uu would be Ξ΅β€²\varepsilon^{\prime}-light, too. Also, if uu is Ξ΅β€²\varepsilon^{\prime}-heavy, then it would be Ξ΅\varepsilon-heavy.

Proof.

This is implied by the definition of Ρ\varepsilon-lightness and Theorem 3. ∎

Two other important cases are not considered in Theorem 4, i.e. (1) Is it possible that a vertex uu be Ξ΅\varepsilon-heavy, but becomes Ξ΅β€²\varepsilon^{\prime}-light for some values Ξ΅<Ξ΅β€²\varepsilon<\varepsilon^{\prime}?, and (2) Is it possible that a Ξ΅\varepsilon-light vertex becomes Ξ΅β€²\varepsilon^{\prime}-heavy for some values of Ξ΅<Ξ΅β€²\varepsilon<\varepsilon^{\prime}? The answer to both of these questions is affirmative. By Theorems 3 and 4 and the previous discussion, we can state the following corollary.

Corollary 1.

Let Ξ΅<Ξ΅β€²\varepsilon<\varepsilon^{\prime}.

  1. 1.

    The number of positively signed edges whose endpoints are in Ξ΅β€²\varepsilon^{\prime}-agreement is greater than or equal to the number of positive edges whose endpoints are in Ξ΅\varepsilon-agreement. In other words, the agreement relation is monotone.

  2. 2.

    The number of vertices which are Ξ΅β€²\varepsilon^{\prime}-light can be either greater or less than or even equal to the number of Ξ΅\varepsilon-light vertices. In other words, the lightness relation is not necessarily monotone.

The Baseline idea is to recompute the NonAgreement for each new value of Ξ΅\varepsilon, which takes π’ͺ​(m×α​(G))\mathcal{O}\left(m\times\alpha(G)\right), as is described in the proof of Lemma 2, deciding on the Ξ΅\varepsilon-heaviness of the vertices in GG in time π’ͺ​(n)\mathcal{O}\left(n\right) and computing the connected components of the graph G+~\widetilde{G^{+}} as the output in time π’ͺ​(m+n)\mathcal{O}\left(m+n\right). Totally, the time complexity of the Baseline is π’ͺ​(mΓ—(2+α​(G))+n)\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right).

To conclude, using the index structure we invest π’ͺ​(m)\mathcal{O}\left(m\right) space (Lemma 2) and π’ͺ​(mΓ—(α​(G)+lg⁑m))\mathcal{O}\left(m\times\left(\alpha(G)+\lg m\right)\right) (Lemma 3) time to construct the ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right) which make it possible to answer each query with varying Ξ΅\varepsilon values in time π’ͺ​(m​(a​(n)+1))\mathcal{O}\left(m(a(n)+1)\right) (Theorem 2). Comparing this with π’ͺ​(mΓ—(2+α​(G))+n)\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right) time for the Baseline reveals that our index-based structure makes query times faster for variable values of Ξ΅\varepsilon.

Given access to NAO, the algorithm 2 constructs the graph G+~\widetilde{G^{+}}. Note that the output of the algorithms 1 and 2 are the same. As answering whether a vertex vv is Ξ΅\varepsilon-heavy or the endpoints of an edge {u,v}\left\{u,v\right\} are in Ξ΅\varepsilon-agreement can be answered in π’ͺ​(1)\mathcal{O}\left(1\right) time, the running-time of the algorithm 2 would be π’ͺ​(max⁑{|V|,|E+|})\mathcal{O}\left(\max\left\{|V|,|E^{+}|\right\}\right) for the for loop and π’ͺ​(|VG+~|+|EG+~+|)\mathcal{O}\left(|V_{\widetilde{G^{+}}}|+|E^{+}_{\widetilde{G^{+}}}|\right) to compute the connected components of G+~\widetilde{G^{+}}. Totally, the running-time of the query algorithm in the ICC would be π’ͺ​(m+n)\mathcal{O}\left(m+n\right), compared to the π’ͺ​(mΓ—(2+α​(G))+n)\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right) time for the CC.

To model queries over a graph for various values of Ξ΅\varepsilon, we use a Ξ΅\varepsilon-Schedule defined as

Ρ​-Schedule={Ξ΅0<Ξ΅1<β‹―<Ξ΅β„“}βŠ†[0,2].\varepsilon\textsc{-Schedule}=\left\{\varepsilon_{0}<\varepsilon_{1}<\cdots<\varepsilon_{\ell}\right\}\subseteq[0,2]. (6)

Note that we consider Ρ​-Schedule\varepsilon\textsc{-Schedule} as a set of strictly increasing real numbers, just for the sake of simplicity in notation. In a real life scenario, any time one can ask for a clustering with any value of Ξ΅\varepsilon.

Algorithm 2 The index-based correlation clustering algorithm with ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right) structure.
1:procedureΒ IndexCorrelationClustering(G,ℕ​𝔸​𝕆​(G),Ξ΅G,\mathbb{NAO}\left(G\right),\varepsilon)
2:     for all v∈Vv\in V do
3:Β Β Β Β Β Β Β Β Β Use the ℕ​𝔸​𝕆​(()​v)\mathbb{NAO}\left((\right)v) to identify and remove all the edges {v,u}\left\{v,u\right\} which are in Ξ΅\varepsilon-NonAgreement
4:Β Β Β Β Β Β Β Β Β Use the ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) to identify and remove edges {u,v}\left\{u,v\right\} where uu and vv are both Ξ΅\varepsilon-light
5:Β Β Β Β Β endΒ for
6:Β Β Β Β Β return The connected components of the remaining graph as the clustering output.
7:endΒ procedure

3.2 Maintaining the index

The index structure introduced in Section 3.1 is used to compute a clustering of a static graph for user-defined dynamic values of Ξ΅\varepsilon. In this section, we revise this structure to make it suitable for computing a clustering of a dynamic graph for dynamic values of Ξ΅\varepsilon.

There are three different operations applicable to the underlying graph of the CorrelationClustering: (1) flipping the sign of an edge e={u,v}e=\left\{u,v\right\}, (2) adding a new vertex vv, and (3) removing an existing vertex vv. These operations are considered in Lemmas 6, 7, and 8, respectively.

Shakiba in [12] has shown that flipping the sign of an edge {u,v}\left\{u,v\right\}, the Ξ΅\varepsilon-agreement of edges whose both endpoints are not in the union of the positive neighborhood of its endpoints does not change (Propositions 2 and 4 in [12]). Therefore, we just need to compute the Ξ΅\varepsilon-agreement for positive edges {x,w}\left\{x,w\right\} where x,w∈(NG+​(u)βˆͺNG+​(v))x,w\in\left(N_{G^{+}}(u)\cup N_{G^{+}}(v)\right).

Lemma 6.

Assuming the sign of an edge e={u,v}e=\left\{u,v\right\} is flipped.

  1. 1.

    Assuming the sign of edge was ++ prior to the flipping, then uu and vv would be no more in Ξ΅\varepsilon-agreement since their edge is now negatively signed. The vertices vv and uu are removed from the arrays ℕ​𝔸​𝕆​(u)\mathbb{NAO}\left(u\right) and ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right), respectively. Moreover, we need to update the values of the non-agreement for vertices uu and vv in ℕ​𝔸​𝕆​(w)\mathbb{NAO}\left(w\right) for all vertices wβˆˆβ„•β€‹π”Έβ€‹π•†β€‹(u)βˆͺℕ​𝔸​𝕆​(v)w\in\mathbb{NAO}\left(u\right)\cup\mathbb{NAO}\left(v\right).

  2. 2.

    Assuming the sign of edge was βˆ’- prior to the flipping, then their NonAgreementG+​(u,v)\textsc{NonAgreement}_{G^{+}}\left(u,v\right) is computed and the vertices vv and uu are added to proper sorted place in ℕ​𝔸​𝕆​(u)\mathbb{NAO}\left(u\right) and ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right), respectively. Moreover, we need to recompute the NonAgreementG+​(x,w)\textsc{NonAgreement}_{G^{+}}\left(x,w\right) for all x,wβˆˆβ„•β€‹π”Έβ€‹π•†β€‹(u)βˆͺℕ​𝔸​𝕆​(v)x,w\in\mathbb{NAO}\left(u\right)\cup\mathbb{NAO}\left(v\right) and update their Ξ΅\varepsilon-agreement values.

Applying these changes is applicable in π’ͺ​(βˆ‘{x,w}∈E+∩XΓ—X(log⁑degG+⁑(x)+log⁑degG+⁑(w)))\mathcal{O}\left(\sum_{\left\{x,w\right\}\in E^{+}\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right) where X=ℕ​𝔸​𝕆​(u)βˆͺℕ​𝔸​𝕆​(v)X=\mathbb{NAO}\left(u\right)\cup\mathbb{NAO}\left(v\right).

Proof.

The correctness of this lemma follows from the discussion just before it. Therefore, we would just give the time analysis. In the first case, finding the vertices in each ℕ​𝔸​𝕆\mathbb{NAO} is of π’ͺ​(log⁑degG+⁑(u)+log⁑degG+⁑(v))\mathcal{O}\left(\log\deg_{G^{+}}(u)+\log\deg_{G^{+}}(v)\right). The removal would be π’ͺ​(1)\mathcal{O}\left(1\right) amortized time.

The second case, i.e. flipping the sign from βˆ’- to ++, would cost more, as it may require changing all the positive edges with both endpoints the in the set XX. In the worst case, one may assume that all the Ξ΅\varepsilon-agreement between the edges with both endpoints inside XX is changed. Therefore, we need to update each of them on their corresponding ℕ​𝔸​𝕆\mathbb{NAO} indices. We need to add vertices vv and uu to the ℕ​𝔸​𝕆​(u)\mathbb{NAO}(u) and ℕ​𝔸​𝕆​(v)\mathbb{NAO}(v), respectively, based on NonAgreementG+​(u,v)\textsc{NonAgreement}_{G^{+}}\left(u,v\right). Finding their correct position inside sorted ℕ​𝔸​𝕆\mathbb{NAO}s is possible with π’ͺ​(log⁑degG+⁑(u)+log⁑degG+⁑(v))\mathcal{O}\left(\log\deg_{G^{+}}(u)+\log\deg_{G^{+}}(v)\right) comparisons. Again, adding them at their corresponding position would take π’ͺ​(1)\mathcal{O}\left(1\right) amortized time. For the other positive edges with both endpoints in the set XX, such as {x,w}\left\{x,w\right\}, we might need to update their Ξ΅\varepsilon-agreement. Without loss of generality, assume that degG+⁑(x)≀degG+⁑(w)\deg_{G^{+}}(x)\leq\deg_{G^{+}}(w). Doing so requires re-computation of NonAgreementG+​(x,w)\textsc{NonAgreement}_{G^{+}}\left(x,w\right), in π’ͺ​(min⁑{degG+⁑(x),degG+⁑(w)})\mathcal{O}\left(\min\left\{\deg_{G^{+}}(x),\deg_{G^{+}}(w)\right\}\right), querying the NonAgreement inside ℕ​𝔸​𝕆​(x)\mathbb{NAO}\left(x\right), in time π’ͺ​(log⁑degG+⁑(x))\mathcal{O}\left(\log\deg_{G^{+}}(x)\right), and then possible updating its position, in both ℕ​𝔸​𝕆​(x)\mathbb{NAO}\left(x\right) and ℕ​𝔸​𝕆​(w)\mathbb{NAO}\left(w\right) requires π’ͺ​(log⁑degG+⁑(x)+log⁑degG+⁑(w))\mathcal{O}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right). In the worst case, all the elements in XX may need update. Therefore, this case can be accomplished in

π’ͺ​(βˆ‘{x,w}∈E+∩XΓ—X(log⁑degG+⁑(x)+log⁑degG+⁑(w))),\mathcal{O}\left(\sum_{\left\{x,w\right\}\in E^{+}\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right),

in the worst-case. ∎

Using Lemma 6, we can state the NAO-FlipEdge​(u,v)\textsc{NAO-FlipEdge}(u,v) algorithm (Algorithm 3) which flips the sign of the edge {u,v}\left\{u,v\right\}. The correctness and the running-time of this algorithm follows directly from Lemma 6.

Algorithm 3 The NAO-FlipEdge​(u,v)\textsc{NAO-FlipEdge}(u,v) algorithm to apply the effect of flipping the sign of the edge (u,v)(u,v) in ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right).
1:procedureΒ NAO-FlipEdge(u,vu,v)
2:Β Β Β Β Β Let π’œβ†β„•β€‹π”Έβ€‹π•†β€‹(u)βˆͺℕ​𝔸​𝕆​(v)\mathcal{A}\leftarrow\mathbb{NAO}\left(u\right)\cup\mathbb{NAO}\left(v\right).
3:Β Β Β Β Β ifΒ Sign of edge {u,v}\left\{u,v\right\} is positiveΒ then
4:Β Β Β Β Β Β Β Β Β Update the graph G+G^{+} by removing edge {u,v}\left\{u,v\right\}.
5:Β Β Β Β Β Β Β Β Β Remove the vertices uu and vv from ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) and ℕ​𝔸​𝕆​(u)\mathbb{NAO}\left(u\right), respectively.
6:Β Β Β Β Β elseβ–·\triangleright Sign of edge {u,v}\left\{u,v\right\} is negative
7:Β Β Β Β Β Β Β Β Β Update the graph G+G^{+} by adding edge {u,v}\left\{u,v\right\}.
8:Β Β Β Β Β Β Β Β Β Compute NonAgreementG+​(u,v)\textsc{NonAgreement}_{G^{+}}\left(u,v\right).
9:Β Β Β Β Β Β Β Β Β Add the vertices uu and vv to ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) and ℕ​𝔸​𝕆​(u)\mathbb{NAO}\left(u\right), respectively.
10:Β Β Β Β Β endΒ if
11:Β Β Β Β Β for allΒ wβˆˆπ’œw\in\mathcal{A}Β do
12:Β Β Β Β Β Β Β Β Β Recompute NonAgreementG+​(u,w)\textsc{NonAgreement}_{G^{+}}\left(u,w\right) and update ℕ​𝔸​𝕆​(u)\mathbb{NAO}\left(u\right) and ℕ​𝔸​𝕆​(w)\mathbb{NAO}\left(w\right).
13:Β Β Β Β Β Β Β Β Β Recompute NonAgreementG+​(v,w)\textsc{NonAgreement}_{G^{+}}\left(v,w\right) and update ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) and ℕ​𝔸​𝕆​(w)\mathbb{NAO}\left(w\right).
14:Β Β Β Β Β endΒ for
15:endΒ procedure

There are cases where a set of positive edges Ev+E^{+}_{v} are also given for a newly added vertex vv. One can first add the vertex with all negative edges, and afterwards, flip all the edges in Ev+E^{+}_{v}, so they would become positively signed. However, a batch operation would give us higher performance in practice.

Lemma 7.

Assume that a vertex vv is added to graph GG with new positive signed edges Ev+E^{+}_{v}. Then,

  1. 1.

    If Ev+=βˆ…E^{+}_{v}=\emptyset, then ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right) does not need any updates and we just add a new ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) to ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right) with vv as its only element in constant-time.

  2. 2.

    Otherwise, let X=NG+​(v)βˆͺ(βˆͺx∈NG+​(v)NG+​(x))X=N_{G^{+}}(v)\cup\left(\cup_{x\in N_{G^{+}}(v)}N_{G^{+}}(x)\right). We need to compute the NonAgreementG+​(x,w)\textsc{NonAgreement}_{G^{+}}\left(x,w\right) for each positive edges {x,w}\left\{x,w\right\} with x,w∈Xx,w\in X after adding all the new positive edges in Ev+E^{+}_{v}, and possibly update ℕ​𝔸​𝕆​(x)\mathbb{NAO}\left(x\right) and ℕ​𝔸​𝕆​(w)\mathbb{NAO}\left(w\right). This would require

    π’ͺ​(βˆ‘{x,w}∈(E+βˆͺEv+)∩XΓ—X(log⁑degG+⁑(x)+log⁑degG+⁑(w))),\mathcal{O}\left(\sum_{\left\{x,w\right\}\in(E^{+}\cup E^{+}_{v})\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right), (7)

    operations in the worst case.

Proof.

The correctness would follow immediately from the Lemma 6. Again, the first case can be accomplished in π’ͺ​(1)\mathcal{O}\left(1\right) time. For the second case, we need to calculate |X|\left|X\right| values of NonAgreement, and add them or update them in their corresponding ℕ​𝔸​𝕆\mathbb{NAO}s. This would require

π’ͺ​(βˆ‘{x,w}∈(E+βˆͺEv+)∩XΓ—X(log⁑degG+⁑(x)+log⁑degG+⁑(w))),\mathcal{O}\left(\sum_{\left\{x,w\right\}\in(E^{+}\cup E^{+}_{v})\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right), (8)

by exactly the same discussion as in Lemma 6. ∎

Remark 1.

In comparing the time required for Lemmas 6 and 7, please note that the size of the set XX can be much larger in Lemma 7 than the one in Lemma 6, depending on the size of Ev+E^{+}_{v} for the new vertex vv.

The NAO-AddVertex​(x,Nx)\textsc{NAO-AddVertex}(x,N_{x}) algorithm (Algorithm 4) adds a new vertex xx to G+G^{+} and all its neighboring positively signed edges NxN_{x}, and updates the ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right). The correctness and the running-time of this algorithm follows directly from Lemma 7.

Algorithm 4 The NAO-AddVertex​(x,Nx)\textsc{NAO-AddVertex}(x,N_{x}) algorithm to add a new vertex xx and all its positive neighbors NxN_{x} to ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right).
1:procedureΒ NAO-AddVertex(x,Nxx,N_{x})
2:Β Β Β Β Β Update graph G+G^{+} by adding the new vertex xx.
3:Β Β Β Β Β Construct a new ℕ​𝔸​𝕆​(x)\mathbb{NAO}\left(x\right) and add it to ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right).
4:Β Β Β Β Β Let X←Nxβˆͺ(βˆͺz∈NxNG+​(z))X\leftarrow N_{x}\cup\left(\cup_{z\in N_{x}}N_{G^{+}}(z)\right).
5:     for all y∈Xy\in X do
6:Β Β Β Β Β Β Β Β Β for allΒ z∈Xβˆ–yz\in X\setminus yΒ do
7:Β Β Β Β Β Β Β Β Β Β Β Β Β Β Update ℕ​𝔸​𝕆​(y)\mathbb{NAO}\left(y\right) and ℕ​𝔸​𝕆​(z)\mathbb{NAO}\left(z\right) if the NonAgreementG+​(y,z)\textsc{NonAgreement}_{G^{+}}\left(y,z\right) changes.
8:Β Β Β Β Β Β Β Β Β endΒ for
9:Β Β Β Β Β endΒ for
10:endΒ procedure

For the last operation, removing an existing vertex vv with a set of positive adjacent edges Ev+E^{+}_{v}, can be accomplished by first flipping the sign of all of its adjacent edges in Ev+E^{+}_{v} from ++ to βˆ’-, and afterwards, removing its ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) from ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right). Similar to vertex addition, we can do it also in batch-mode, hoping for better performance in practice. The algorithm for the NAO-RemoveVertex​(x)\textsc{NAO-RemoveVertex}(x) is similar to the NAO-AddVertex​(x,Nx)\textsc{NAO-AddVertex}(x,N_{x}) algorithm (Algorithm 4).

Lemma 8.

Assume that an existing vertex vv is removed from the graph GG with the set of adjacent positive signed edges Ev+E^{+}_{v}. Then,

  1. 1.

    If Ev+=βˆ…E^{+}_{v}=\emptyset, then ℕ​𝔸​𝕆​(v)\mathbb{NAO}\left(v\right) has a single element, itself, and can be easily removed from ℕ​𝔸​𝕆​(G)\mathbb{NAO}\left(G\right). Nothing else would require a change, so it is of constant-time complexity.

  2. 2.

    Otherwise, let X=NG+​(v)βˆͺ(βˆͺx∈NG+​(v)NG+​(x))X=N_{G^{+}}(v)\cup\left(\cup_{x\in N_{G^{+}}(v)}N_{G^{+}}(x)\right). We need to compute the NonAgreementG+​(x,w)\textsc{NonAgreement}_{G^{+}}\left(x,w\right) for each positive edges {x,w}\left\{x,w\right\} with x,w∈Xx,w\in X after removing all the edges in Ev+E^{+}_{v}, and possibly update ℕ​𝔸​𝕆​(x)\mathbb{NAO}\left(x\right) and ℕ​𝔸​𝕆​(w)\mathbb{NAO}\left(w\right). In the worst-case, this would require

    π’ͺ​(βˆ‘{x,w}∈(E+βˆͺEv+)∩XΓ—X(log⁑degG+⁑(x)+log⁑degG+⁑(w))),\mathcal{O}\left(\sum_{\left\{x,w\right\}\in(E^{+}\cup E^{+}_{v})\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right), (9)

    operations.

Proof.

The correctness of this lemma is implied by the Lemma 6. The discussion of the running-time is exactly the same as in the proof of Lemma 7. ∎

4 Experiments

To evaluate the proposed method, we used 7 graphs which are formed by user-user interactions. These datasets are described in Table 1 and are accessible through the SNAP [8]. The smallest dataset consists of 22 470 nodes with 171 002 edges and the largest one consists of 317 080 nodes with 1 049 866 edges. In all these graphs, we consider the existing edges as positively signed and non-existing edges as negatively signed.

The distribution of the NonAgreements for each dataset is illustrated in Figures 2(a) to 8(a). The distribution of NonAgreements of the edges in almost all the datasets obeys the normal distribution, except small imperfections in Arxiv ASTRO-PH (Figure 2(a)), Arxiv COND-MAT (Figure 4(a)), and DBLP (Figure 8(a)). Moreover, more detailed statistics on these distributions are given in Tables 2. One single observation is that the most frequent value of the NonAgreement in all the sample datasets is 11. Why is that? Without loss of generality, assume that degG⁑(u)β‰₯degG⁑(v)\deg_{G}(u)\geq\deg_{G}(v). Then, |NG​(v)|=2​|NG​(u)∩NG​(v)|\left|N_{G}(v)\right|=2\left|N_{G}(u)\cap N_{G}(v)\right|. By the assumption |NG​(u)|β‰₯2​|NG​(u)∩NG​(v)|\left|N_{G}(u)\right|\geq 2\left|N_{G}(u)\cap N_{G}(v)\right|, i.e. the intersection of the neighborhood of vertices uu and vv consists of at most half of the neighborhood of uu. Also, exactly half of the neighborhood of vv falls at the intersection of the neighborhood of uu and vv. Intuitively, the vertex uu has clustering preference with extra vertices at least the number of vertices which both uu and vv have clustering preference. Similarly, the vertex vv has clustering preference with exactly half extra vertices which both uu and vv have clustering preference.

For each dataset, we have used a different set of Ξ΅\varepsilon-Schedules, depending on the distribution of their NonAgreements. More precisely: (1) we have sorted the values of non-agreements in each dataset in a non-decreasing order, with repetitions. (2) Then, we have selected 2121 distinct values equally spaces of these values. (3) The Ξ΅\varepsilon-Schedule was set to the selected values in the second step, which appended the value of 0 at the beginning and the value of 1.991.99 to the end if either does not exists. Totally, the number of Ξ΅\varepsilon-Schedule for each dataset is either 2121, 2222 or 2323.

An interesting observation is the number of clusters for Ξ΅β‰ˆ1βˆ’\varepsilon\approx 1^{-} in Figures 2(b) to 8(b). Note that we use 1βˆ’1^{-} to denote the interval [1βˆ’Ο΅,1][1-\epsilon,1] for some non-zero constant Ο΅>0\epsilon>0. When Ξ΅β‰ˆ1\varepsilon\approx 1 but less than 11, the number of clusters is minimum. As it gets closer to 11, the number of clusters increases with a much greater descent than the decrease in the number of clusters as it gets close to 11 from 0.

As in Corollary 1, by increasing the Ξ΅\varepsilon, the number of vertices in Ξ΅\varepsilon-agreement is a non-decreasing function, which is confirmed by the plots in Figures 2(a) to 8(a) as the number of vertices in Ξ΅\varepsilon-non-agreement is given by a non-increasing function. By closer visual inspection of these figures, we can see that the shape of the plot for the number of Ξ΅\varepsilon-non-agreement vertices in all these graphs is almost the same, with inflection point around the value of Ξ΅β‰ˆ1\varepsilon\approx 1. This is due to the intrinsic nature of the NonAgreementG​(u,v)\textsc{NonAgreement}_{G}\left(u,v\right).

Similarly, the non-monotonicity result stated in Corollary 1 is observed in the same figures for the number of Ξ΅\varepsilon-light vertices. By a visual inspection, the trend of the number of Ξ΅\varepsilon-light vertices for almost all datasets, except for the Arxiv HEP-TH (Figure 5(a)) and the EU-Email (Figure 7(a)), is the same: the number of Ξ΅\varepsilon-light vertices increases as Ξ΅\varepsilon increases up to some point (first interval), then decreases slightly (second interval), and finally increases and would be asymptotically equal to the number of vertices in the graphs (last interval). For Arxiv HEP-TH and EU-Email, we have the same trend, however, the second interval is very small.

Table 1: Description of the datasets.
Dataset Nodes Edges
Arxiv ASTRO-PH [7] 18 772 198 110
MUSAE-Facebook [11] 22 470 171 002
Arxiv COND-MAT [7] 23 133 93 497
Arxiv HEP-TH [6] 27 770 352 807
Enron-Email [9] 36 692 183 831
EU-Email [7] 265 214 420 045
DBLP [13] 317 080 1 049 866
Table 2: Statistics of NonAgreement in each dataset.
Dataset Distinct Minimum Maximum Top 2 frequent values
Arxiv ASTRO-PH 16 436 0.015 873 0 1.967 21 1 β€” 9 664 0.5 β€” 2 992
MUSAE-Facebook 12 988 0.031 746 1.978 02 1 β€” 18 534 1.25 β€” 2 346
Arxiv COND-MAT 3 893 0.044 444 4 1.964 91 1 β€” 12 018 0.5 β€” 4 954
Arxiv HEP-TH 23 285 0.086 956 5 1.976 19 1 β€” 24 852 1.333 33 β€” 4 910
Enron-Email 20 273 0.090 909 1 1.954 55 1 β€” 31 704 0.5 β€” 6 796
EU-Email 28 612 0.117 647 1.990 52 1 β€” 441 520 0.997 658 β€” 682
DBLP 11 611 0.013 793 1 1.981 13 1 β€” 167 590 0.5 β€” 67 238

All the algorithms for the naive and the proposed index-based correlation clustering algorithms are implemented in C++111https://github.com/alishakiba/Correlation-Clustering-Algorithm-for-Dynamic-Complete-Signed-Graphs-An-Index-based-Approach without any parallelization and the experiments are done using an Ubuntu 22.04.1 TLS with an Intel Core i7-10510U CPU @ 1.80GHz with 12 GB of RAM. The time for running the naive correlation clustering algorithm (Algorithm 1), denoted here as CC, as well as the time for the index-based correlation clustering algorithm denoted as ICC, is given in Figures 2(d) to 8(d)). Note that the time reported for the ICC in these figures does not include the required time for constructing the ℕ​𝔸​𝕆\mathbb{NAO}s, as they are constructed once and used throughout the Ξ΅\varepsilon-Schedule. The running-time to read the graph as well as constructing the ℕ​𝔸​𝕆\mathbb{NAO}s is reported in Table 3 in milliseconds. The CC and ICC algorithms are the same, except that in CC, the non-agreement values of the edges and the Ξ΅\varepsilon-lightness of the vertices are computed for each given value of Ξ΅\varepsilon, however, in ICC these are computed and stored in the proposed ℕ​𝔸​𝕆\mathbb{NAO}s structure once and used for clustering with respect to different values of Ξ΅\varepsilon. As it can be observed in Figures 2(d) to 8(d), the running time for the ICC, excluding the time to construct the ℕ​𝔸​𝕆\mathbb{NAO}s for once, is largely smaller than the one for CC. On average, our approach for the described Ξ΅\varepsilon-Schedule lead to %25\%25 decrease in clustering time. This enhancement comes at the cost of pre-computing the ℕ​𝔸​𝕆\mathbb{NAO}s, which costs on average %34\%34 of the time for a single run of CC, which is quite small and makes the ICC efficient in cases where one requires to have multiple clustering for various values of Ξ΅\varepsilon.

Table 3: Time to construct the ℕ​𝔸​𝕆\mathbb{NAO} for each dataset in milliseconds.
Dataset Time to construct ℕ​𝔸​𝕆\mathbb{NAO} (ms) Time to read graph (ms)
Arxiv ASTRO-PH 1 677 543
MUSAE-Facebook 1 391 427
Arxiv COND-MAT 521 346
Arxiv HEP-TH 12 530 725
Enron-Email 2 498 525
EU-Email 4 479 958
DBLP 4 620 2 167
Refer to caption
(a) NonAgreements
Refer to caption
(b) Number of clusters
Refer to caption
(c) The number of NonAgreement and Ξ΅\varepsilon-light edges
Refer to caption
(d) Clustering time in milliseconds
Figure 2: The graph Arxiv ASTRO-PH.
Refer to caption
(a) NonAgreements
Refer to caption
(b) Number of clusters
Refer to caption
(c) The number of NonAgreement and Ξ΅\varepsilon-light edges
Refer to caption
(d) Clustering time in milliseconds
Figure 3: The graph MUSAE-Facebook.
Refer to caption
(a) NonAgreements
Refer to caption
(b) Number of clusters
Refer to caption
(c) The number of NonAgreement and Ξ΅\varepsilon-light edges
Refer to caption
(d) Clustering time in milliseconds
Figure 4: The graph Arxiv COND-MAT.
Refer to caption
(a) NonAgreements
Refer to caption
(b) Number of clusters
Refer to caption
(c) The number of NonAgreement and Ξ΅\varepsilon-light edges
Refer to caption
(d) Clustering time in milliseconds
Figure 5: The graph Arxiv HEP-TH.
Refer to caption
(a) NonAgreements
Refer to caption
(b) Number of clusters
Refer to caption
(c) The number of NonAgreement and Ξ΅\varepsilon-light edges
Refer to caption
(d) Clustering time in milliseconds
Figure 6: The graph Enron-Email.
Refer to caption
(a) NonAgreements
Refer to caption
(b) Number of clusters
Refer to caption
(c) The number of NonAgreement and Ξ΅\varepsilon-light edges
Refer to caption
(d) Clustering time in milliseconds
Figure 7: The graph EU-Email.
Refer to caption
(a) NonAgreements
Refer to caption
(b) Number of clusters
Refer to caption
(c) The number of NonAgreement and Ξ΅\varepsilon-light edges
Refer to caption
(d) Clustering time in milliseconds
Figure 8: The graph DBLP.

5 Conclusion

In this paper, we proposed a novel indexing structure to decrease the overall running-time of an approximation algorithm for the correlation clustering problem. This structure can be constructed in π’ͺ​(m×α​(G))\mathcal{O}\left(m\times\alpha(G)\right) time with π’ͺ​(m)\mathcal{O}\left(m\right) memory. Then, we can output a correlation clustering for any value of Ξ΅\varepsilon in π’ͺ​(m+n)\mathcal{O}\left(m+n\right), compared with π’ͺ​(mΓ—(2+α​(G))+n)\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right) time complexity of the ordinary correlation clustering algorithm. Moreover, the proposed index can be efficiently maintained during updates to the underlying graph, including edge sign flip, vertex addition and vertex deletion. The theoretical results are accompanied with practical results in the experiments using seven real world graphs. The experimental results show about %34 decrease in the running-time of queries.

A future research direction would be studying this algorithm in parallel frameworks such as Map-Reduce and make it scalable to very Big graphs. Another research direction would be enhancing the approximation guarantee of the algorithm, or devising more efficient algorithms in terms of approximation ratio.

References

  • [1] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine learning, 56(1):89–113, 2004.
  • [2] Norishige Chiba and Takao Nishizeki. Arboricity and subgraph listing algorithms. SIAM Journal on computing, 14(1):210–223, 1985.
  • [3] Vincent Cohen-Addad, Silvio Lattanzi, Andreas Maggiori, and Nikos Parotsidis. Online and consistent correlation clustering. In Kamalika Chaudhuri, Stefanie Jegelka, LeΒ Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 4157–4179. PMLR, 17–23 Jul 2022.
  • [4] Vincent Cohen-Addad, Silvio Lattanzi, Slobodan MitroviΔ‡, Ashkan Norouzi-Fard, Nikos Parotsidis, and Jakub Tarnawski. Correlation clustering in constant many parallel rounds. In International Conference on Machine Learning (ICML), pages 2069–2078. PMLR, 2021.
  • [5] ThomasΒ H Cormen, CharlesΒ E Leiserson, RonaldΒ L Rivest, and Clifford Stein. Introduction to algorithms. MIT Press, 2022.
  • [6] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187, 2005.
  • [7] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD), 1(1):2–es, 2007.
  • [8] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  • [9] Jure Leskovec, KevinΒ J Lang, Anirban Dasgupta, and MichaelΒ W Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29–123, 2009.
  • [10] Claire Mathieu, Ocan Sankur, and Warren Schudy. Online correlation clustering. In 27th International Symposium on Theoretical Aspects of Computer Science-STACS 2010, pages 573–584, 2010.
  • [11] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding, 2019.
  • [12] Ali Shakiba. Online correlation clustering for dynamic complete signed graphs. arXiv preprint arXiv:2211.07000, 2022.
  • [13] Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, pages 1–8, 2012.