Correlation Clustering Algorithm for Dynamic Complete Signed Graphs: An Index-based Approach

Ali Shakiba

(Department of Computer Science, Vali-e-Asr University of Rafsanjan, Rafsanjan, Iran.
[email protected];[email protected])

Abstract

In this paper, we reduce the complexity of approximating the correlation clustering problem from $\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right)$ to $\mathcal{O}\left(m+n\right)$ for any given value of $\varepsilon$ for a complete signed graph with $n$ vertices and $m$ positive edges where $\alpha(G)$ is the arboricity of the graph. Our approach gives the same output as the original algorithm and makes it possible to implement the algorithm in a full dynamic setting where edge sign flipping and vertex addition/removal are allowed. Constructing this index costs $\mathcal{O}\left(m\right)$ memory and $\mathcal{O}\left(m\times\alpha(G)\right)$ time. We also studied the structural properties of the non-agreement measure used in the approximation algorithm. The theoretical results are accompanied by a full set of experiments concerning seven real-world graphs. These results shows superiority of our index-based algorithm to the non-index one by a decrease of %34 in time on average.

Keywords: Correlation clustering $\cdot$ Dynamic graphs $\cdot$ Online Algorithms

1 Introduction

Clustering is one of the most studied problems in machine learning with various applications in analyzing and visualizing large datasets. There are various models and technique to obtain a partition of elements, such that elements belonging to different partitions are dissimilar to each other and the elements in the same partition are very similar to each other.

The problem of correlation clustering, introduced in [1], is known to be an NP-hard problem for the disagree minimization. Therefore, several different approximation solutions based on its IP formulation exist in the literature. Recently, the idea of a 2-approximation algorithm in [1] is extended in [4] for constructing a $\mathcal{O}\left(1\right)$ -approximation algorithm. The experiments in [4] show acceptable performance for this algorithm in practice, although its theoretical guarantee can be too high, e.g. $1\,442$ for $\beta=\lambda=\frac{1}{36}$ . In [3], this algorithm is extended to an online setting where just vertex additions are allowed, and whenever a new vertex is added, it reveals all its positively signed edges. Shakiba in [12] studied the effect of vertex addition/removal and edge sign flipping in the underlying graph to the final clustering result, in order to make the algorithm suitable for dynamic graphs. However, one bottleneck in this way is computing the values of NonAgreement among the edges and identifying the $\varepsilon$ -lightness of vertices. The current paper proposes a novel indexing scheme to remedy this and make the algorithm efficient, not just in terms of dynamic graphs, but for even dynamic hyper-parameter $\varepsilon$ . Our proposed method, in comparison with the online method of [3] is that we allow a full dynamic setting, i.e. vertex addition/removal and edge’s sign flipping. It is known that any online algorithm for the correlation clustering problem has at least $\Omega(n)$ -approximation ratio [10]. Note that the underlying algorithm used in the current paper is consistent, as is shown via experimental results [3].

The rest of the paper is organized as follows: In Section 1.1, we highlight our contributions. This is followed by a reminding some basic algorithms and results in Section 2. Then, we introduce the novel indexing structure in Section 3.1 and show how it can be employed to enhance the running-time of the approximate correlation clustering algorithm. Then, we show how to maintain the proposed indices in a full dynamic settings in Section 3.2. In Section 4, we give an extensive experiments which accompanies the theoretical results and show the effectiveness of the proposed indexing structure. Finally, a conclusion is drawn.

1.1 Our Contribution

In this paper, we simply ask

“How can one reduce the time to approximate a correlation clustering of the input graph [4] for varying values of $\varepsilon$ ?”

We also ask

“How can we make the solution to the first question an online solution for dynamic graphs?”

Our answer to the first question is devising a novel indexing-structure which is constructed based on the structural properties of the approximation algorithm and its NonAgreement measure. As our experiments in Section 4 show, the proposed method enhanced the total running-time of querying the clustering for about $\%34$ on average for seven real-world datasets. Then, we make this structure online to work with dynamic graphs based on theoretical results in [12]. The construction of the index itself is highly parallelizable, up to the number of the vertices in the input graph. The idea for parallelization is simple: construct each $\mathbb{NAO}\left(v\right)$ in the $\mathbb{NAO}\left(G\right)$ with a separate parallel thread. We also study the intrinsic structures in the NonAgreement measure, to bake more efficient algorithms for index-maintenance due to updates to the underlying graph. More precisely, we show that using the proposed index structure, we can find a correlation clustering for a graph for any given value of $\varepsilon$ in time $\mathcal{O}\left(m+n\right)$ , compared to the $\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right)$ time for the CC. The pre-processing time of the ICC would be $\mathcal{O}\left(m\times\alpha(G)\right)$ with $\mathcal{O}\left(m\right)$ space complexity.

2 Preliminaries

Let $G=(V,E)$ be a complete undirected signed graph with $\left|V\right|=n$ vertices. The set of edges $E$ is naturally partitioned into positive and negative signed edges, $E^{+}$ and $E^{-}$ , respectively. Then, we use $m$ to denote $\left|E^{+}\right|$ . The correlation clustering problem is defined as

\operatorname{cost}(\mathcal{C})=\sum_{\begin{subarray}{c}\left\{u,v\right\}\in E^{+}\\ u\in C_{i},v\in C_{j},i\neq j\end{subarray}}1+\sum_{\begin{subarray}{c}\left\{u,v\right\}\in E^{-}\\ u,v\in C_{i}\end{subarray}}1,

(1)

where $\mathcal{C}=\left\{C_{1},\ldots,C_{\ell}\right\}$ is a clustering. Note that this is the min-disagree variant of the problem.

The constant factor approximation algorithm of [4] is based on two main quantities: (1) $\varepsilon$ -agreement of a positively signed edge $\left\{u,v\right\}$ , i.e. $u$ and $v$ are in $\varepsilon$ -agreement if and only if $\textsc{NonAgreement}_{G}\left(u,v\right)=\frac{\left|N_{G}(u)\Delta N_{G}(v)\right|}{\max\left\{\left|N_{G}(u)\right|,\left|N_{G}(v)\right|\right\}}<\varepsilon$ , and (2) $\varepsilon$ -lightness, where a vertex $u$ is said to be $\varepsilon$ -light if $\frac{\textsc{AgreeCnt}_{G^{+}}(u)}{\left|N_{G^{+}}(u)\right|}<\varepsilon$ where $\textsc{AgreeCnt}_{G^{+}}(u)=\left|\left\{w\in V|u\text{ and }v\text{ are in }\varepsilon\text{-agreement}\right\}\right|$ . Note that a vertex which is not $\varepsilon$ -light is called $\varepsilon$ -heavy. This is a $2+\frac{4}{\varepsilon}+\frac{1}{\varepsilon^{2}}$ -approximation algorithm, as is shown in [3]. This algorithm is described in Algorithm 1, which we will refer to the CC algorithm, for short.

Algorithm 1 CorrelationClustering(

G

) [4]

1:procedure CorrelationClustering(

G,\varepsilon

)

2: Let

G^{+}=G[E^{+}]

where

E^{+}

is the set of edges whose sign is

+

3: Discard all edges whose endpoints are not in

\varepsilon

-agreement

4: Discard all edges between two

\varepsilon

-light vertices

5: Let

\widetilde{G^{+}}

be the sparsified graph

G^{+}

after performing previous two operations

6: Let

\mathcal{C}

be the collection of connected components in

\widetilde{G^{+}}

7: return

\mathcal{C}

as the output clustering

8:end procedure

Shakiba in [12] studied theoretical foundation of the CC algorithm in a full dynamic setting. The following result is a summary of Table 1, Corollary 1, and Theorem 4 in [12].

Theorem 1.

Suppose the sign of an edge $u=\left\{u,v\right\}$ is flipped. Then, the non-agreement and $\varepsilon$ -lightness of vertices other than the ones whose distance to either $u$ and $v$ is more than two would not change.

The arboricity of the graph $G$ is the minimum number of edge-disjoint spanning forests into which $G$ can be decomposed. The following lemma for arboricity is useful in bounding the number of operations.

Lemma 1.

Lemma 2 in [2] Suppose the graph $G=(V,E)$ has $n$ vertices with $m$ edges. Then,

\sum_{\left\{u,v\right\}\in E}\min\left\{\deg_{G}(u),\deg_{G}(v)\right\}\leq 2a(G)\times m.

(2)

3 Proposed Method

In this section, we describe our novel indexing structure. This structure allows dynamic queries of the correlation clustering with varying values of $\varepsilon$ for dynamic graphs. The proposed algorithm which uses the indexing structure would be called ICC, or indexed-based correlation clustering.

3.1 Indexing structure

For an edge $e=\left\{u,v\right\}$ with positive sign, we define its $\varepsilon$ -agreement distance as $\textsc{NonAgreement}_{G^{+}}\left(u,v\right)$ . Intuitively, this is the infimum of the values $\varepsilon$ which the nodes $u$ and $v$ are not in $\varepsilon$ -agreement. Let define the set $\mathcal{E}=\left\{\textsc{NonAgreement}_{G^{+}}\left(u,v\right)|e=\left\{u,v\right\}\in E^{+}\right\}$ . Without loss of generality, let $\mathcal{E}=\left\{\varepsilon_{0},\ldots,\varepsilon_{\ell-1}\right\}$ with the ordering $\min\mathcal{E}=\varepsilon_{0}<\varepsilon_{1}<\cdots<\varepsilon_{\ell-1}=\max\mathcal{E}$ . For a fixed value of $\varepsilon$ , let $G^{+}_{\varepsilon}=(V,E^{+}_{\varepsilon})$ where $E^{+}_{\varepsilon}=\left\{e=\left\{u,v\right\}\in E^{+}|\textsc{NonAgreement}_{G^{+}}\left(u,v\right)<\varepsilon\right\}$ .

Observation 1.

For all $\varepsilon\leq\varepsilon_{0}$ , $G^{+}_{\varepsilon}$ is the null graph, i.e. a graph on all nodes without any edges. Moreover, for all $\varepsilon>\varepsilon_{\ell}$ , $G^{+}_{\varepsilon}=G^{+}$ .

Next, we introduce the key ingredient to our indexing structure, called $\mathbb{NAO}$ .

Definition 1 (NonAgreement Node Ordering).

The $\varepsilon$ -agreement ordering for each node $v\in V$ , denoted by $\mathbb{NAO}\left(v\right)$ , is defined as an ordered subset of vertices in $G$ where:

1.

node $u\in V$ appears in the ordering $\mathbb{NAO}\left(v\right)$ if and only if $e=\left\{u,v\right\}$ is a positive edge in $G$ .
2.

for each two distinct vertices $u,w\in V$ which appear in $\mathbb{NAO}\left(v\right)$ ,

$\textsc{NonAgreement}_{G^{+}}\left(v,u\right)<\textsc{NonAgreement}_{G^{+}}\left(v,w\right),$ (3)

implies $u$ appears before $w$ .
3.

for each node $u\in\mathbb{NAO}\left(v\right)$ , its $\varepsilon$ -agreement distance is also stored with that node.

The NonAgreement node ordering of the graph $G$ is defined as $\mathbb{NAO}\left(G\right)=\left\{(v,\mathbb{NAO}\left(v\right))|v\in V\right\}$ .

In other words, the $\mathbb{NAO}\left(v\right)$ is a sorted array of neighboring nodes of $v$ in $G^{+}$ in their $\varepsilon$ -agreement distance value. An example NonAgreement node ordering for all vertices in a sample graph is illustrated in Figure 1.

Refer to caption — Figure 1: An illustrative example of $\mathbb{NAO}\left(v\right)$ for an example graph.

The space and construction time complexities of the $\mathbb{NAO}(G)$ are investigated in the next two lemmas.

Lemma 2.

The NonAgreement node ordering for a graph $G$ , $\mathbb{NAO}\left(G\right)$ , can be represented in $\mathcal{O}\left(m\right)$ memory.

Proof.

The number of nodes inside $\mathbb{NAO}\left(v\right)$ equals to the $\deg_{G^{+}}(v)+1$ , for all vertices in $N_{G^{+}}(v)$ as well as the vertex $v$ itself. Note that it is not required to explicitly store the vertex $v$ itself in the ordering. Cumulatively, the total size required for representing $\mathbb{NAO}\left(v\right)$ is $2\times m$ entries. ∎

Lemma 3.

The time complexity to construct the $\mathbb{NAO}\left(v\right)$ for all vertices $v\in V$ is $\mathcal{O}\left(m\times\left(\alpha(G)+\lg m\right)\right)$ where $\alpha(G)$ is the arboricity of the graph $G$ .

Proof.

To compute the value of $\textsc{NonAgreement}_{G^{+}}\left(u,v\right)$ for all edges $e=\left\{u,v\right\}\in E^{+}$ , one needs to compute $\left|N_{G^{+}}(u)\Delta N_{G^{+}}(v)\right|$ . This requires $\mathcal{O}\left(\deg_{G^{+}}(u)+\deg_{G^{+}}(v)\right)$ operations to compute the symmetric difference, given the adjacency lists of $u$ and $v$ are sorted. However, we can compute their intersection and use it to compute the NonAgreement as follows

\textsc{NonAgreement}_{G^{+}}\left(u,v\right)=\frac{\deg_{G^{+}}(u)+\deg_{G^{+}}(v)-2\times\left|N_{G^{+}}(u)\cap N_{G^{+}}(v)\right|}{\max\left\{N_{G^{+}}(u),N_{G^{+}}(v)\right\}}.

(4)

Hence, the total time required to compute the values of NonAgreement is equal to

\sum_{\left\{u,v\right\}\in E^{+}}\min\left\{\deg_{G^{+}}(u),\deg_{G^{+}}(v)\right\},

which is known to be bounded by $2\alpha(G)\times m$ (Lemma 1). Moreover, each $\mathbb{NAO}\left(v\right)$ is of length $1+\deg_{G^{+}}(v)$ which requires sorting by the value of their NonAgreement in $\mathcal{O}\left(\deg_{G^{+}}(v)\lg\left(\deg_{G^{+}}(v)\right)\right)$ , which accumulates to $\mathcal{O}\left(m\lg m\right)$ . ∎

The correlation clustering corresponding to a given value of $\varepsilon$ and a graph $G$ is the set of connected components of the graph $\widetilde{G^{+}}$ . Given access to $\mathbb{NAO}\left(G\right)=\left\{\mathbb{NAO}\left(v\right)|v\in V(G)\right\}$ , one can respond to the following queries: (1) Is the vertex $v$ an $\varepsilon$ -heavy vertex or an $\varepsilon$ -light? (2) Are the endpoints of an edge $\left\{u,v\right\}$ in $\varepsilon$ -agreement? As we will show in this section, both of these questions can be answered in $\mathcal{O}\left(1\right)$ time using the $\mathbb{NAO}$ structure. For a vertex $v\in V$ , its $\varepsilon$ -light threshold is defined as $\mathbb{LTH}_{\varepsilon}\left(v\right)=\left\lceil\varepsilon\deg_{G^{+}}(v)\right\rceil+1$ .

Lemma 4.

For an $\varepsilon$ and a vertex $v\in V$ , $v$ is $\varepsilon$ -heavy if and only if the $\varepsilon$ -agreement distance of the $\mathbb{LTH}_{\varepsilon}\left(v\right)$ -th smallest vertex in $\mathbb{NAO}\left(v\right)$ is less than $\varepsilon$ . Otherwise, it is $\varepsilon$ -light.

Proof.

The vertices $u$ and $v$ would be in $\varepsilon$ -agreement if and only if $\textsc{NonAgreement}_{G^{+}}\left(u,v\right)<\varepsilon$ . Whenever the $\varepsilon$ -agreement distance of the $\mathbb{LTH}_{\varepsilon}\left(v\right)$ -th smallest vertex in $\mathbb{NAO}\left(v\right)$ is less than $\varepsilon$ , it means that there are at least $\left\lceil\varepsilon\deg_{G^{+}}(v)\right\rceil+1$ vertices, including the $v$ itself, in $\varepsilon$ -agreement with $v$ . This is equivalent the $\varepsilon$ -heavyness of the vertex $v$ . On the other hand, if the $\varepsilon$ -agreement distance of the $\mathbb{LTH}_{\varepsilon}\left(v\right)$ -th smallest vertex in $\mathbb{NAO}\left(v\right)$ is greater than or equal to $\varepsilon$ , this is equivalent to that the number of vertices in $\varepsilon$ -agreement with $v$ is less than $\varepsilon\times\deg_{G^{+}}(v)$ , i.e. $v$ is $\varepsilon$ -light. ∎

Lemma 5.

Given access to $\mathbb{NAO}\left(v\right)$ , for all values of $\varepsilon$ and any vertex $v\in V$ , identifying the $\varepsilon$ -lightness of $v$ can be accomplished in $\mathcal{O}\left(1\right)$ .

Proof.

This is implied by Lemma 4, assuming that the $\mathbb{NAO}\left(v\right)$ is implemented as an array. ∎

Given access to the index $\mathbb{NAO}\left(G\right)=\left\{(v,\mathbb{NAO}\left(v\right))|v\in V\right\}$ for a graph $G$ , a query simply asks for a clustering of the graph for a given value of $\varepsilon$ .

Theorem 2.

Given access to the index $\mathbb{NAO}\left(G\right)$ , computing a clustering for a given parameter $\varepsilon$ can be accomplished in $\mathcal{O}\left(m\left(a(n)+1\right)\right)$ amortized time, where $a(n)$ is the slowly growing inverse of the single-valued Ackermann’s function.

Proof.

Intuitively, we are going to use $\mathbb{NAO}\left(G\right)$ to construct the graph $\widetilde{G^{+}}$ and compute its connected components incrementally. More formally, we start by putting each vertex to its isolated set in a disjoin-union structure. Then, we identify $\varepsilon$ -heavy vertices $\mathcal{H}\subseteq V$ , which takes $\mathcal{O}\left(m\right)$ . Next, consider $\mathbb{NAO}\left(v\right)$ for $v\in\mathcal{H}$ . For each vertex $u\in\mathbb{NAO}\left(v\right)$ whose key is smaller than $\varepsilon$ and $u\in\mathcal{H}$ , we call $\textsc{Union}(u,v)$ , which merges the cluster which $u$ and $v$ belong. These operations takes $\mathcal{O}\left(m\times a(n)\right)$ amortized time [5] using a Disjoint-Set data structure. The correctness of the query algorithm is implied by the Definition 1 and the Lemmas 4 and 5. ∎

Before going any further, we need to state the following results, which intuitively investigate the behavior of the $\varepsilon$ -agreement and $\varepsilon$ -light of edges and vertices through the $\mathcal{E}$ .Let

\mathcal{E}=\left\{\textsc{NonAgreement}_{G^{+}}\left(u,v\right)|\left\{u,v\right\}\in E^{+}\right\}=\left\{\varepsilon_{0},\varepsilon_{1},\ldots,\varepsilon_{\ell-1}\right\},

(5)

where $\varepsilon_{i-1}<\varepsilon_{i}$ for $i=1,\ldots,\ell-1$ . Moreover, assume that $\varepsilon_{\ell}>\max\mathcal{E}$ .

Observation 2.

For any $\varepsilon\leq\varepsilon_{0}$ , the endpoints of all positive signed edges $\left\{u,v\right\}$ are not in $\varepsilon$ -agreement.

Observation 3.

For any $\varepsilon>\varepsilon_{\ell-1}$ , the endpoints of all positive signed edges $\left\{u,v\right\}$ are in $\varepsilon$ -agreement.

By the Observation 2, the correlation clustering output by the Algorithm 1 for all values of $\varepsilon\leq\varepsilon_{0}$ would be the collection of singleton vertices, i.e. $\left\{\left\{v\right\}|v\in V\right\}$ . However, we cannot say that for $\varepsilon>\varepsilon_{ell-1}$ , the output is a single cluster, i.e. the set $V$ . Why is that? As $\varepsilon$ increases, the number of edges in $\varepsilon$ -agreement would increase, however, the number of $\varepsilon$ -light vertices would increase, too (By Corollary 1, which follows soon). Hence, there is a trade-off for choosing a suitable value of $\varepsilon$ to get the minimum number of clusters. We would discuss this issue further in Section 4 (Experiments).

Theorem 3.

Let $\varepsilon<\varepsilon^{\prime}$ and $u,v\in V$ be two distinct vertices. If $u$ and $v$ are in $\varepsilon$ -agreement, then they would be in $\varepsilon^{\prime}$ -agreement, too. Also, if $u$ and $v$ are not in $\varepsilon^{\prime}$ -agreement, then they would not be in $\varepsilon$ -agreement.

Proof.

The proof is a direct implication of the NonAgreement definition. Given that $u$ and $v$ are in $\varepsilon$ -agreement, we have $\textsc{NonAgreement}_{G^{+}}\left(u,v\right)<\varepsilon<\varepsilon^{\prime}$ , i.e. $u$ and $v$ are in $\varepsilon^{\prime}$ -agreement, too. Similarly, given that $u$ and $v$ are not in $\varepsilon^{\prime}$ -agreement, we have $\textsc{NonAgreement}_{G^{+}}\left(u,v\right)\geq\varepsilon^{\prime}>\varepsilon$ , i.e. $u$ and $v$ are not in $\varepsilon$ -agreement, too. ∎

Theorem 4.

Let $\varepsilon<\varepsilon^{\prime}$ and $u\in V$ . If $u$ is $\varepsilon$ -light, then $u$ would be $\varepsilon^{\prime}$ -light, too. Also, if $u$ is $\varepsilon^{\prime}$ -heavy, then it would be $\varepsilon$ -heavy.

Proof.

This is implied by the definition of $\varepsilon$ -lightness and Theorem 3. ∎

Two other important cases are not considered in Theorem 4, i.e. (1) Is it possible that a vertex $u$ be $\varepsilon$ -heavy, but becomes $\varepsilon^{\prime}$ -light for some values $\varepsilon<\varepsilon^{\prime}$ ?, and (2) Is it possible that a $\varepsilon$ -light vertex becomes $\varepsilon^{\prime}$ -heavy for some values of $\varepsilon<\varepsilon^{\prime}$ ? The answer to both of these questions is affirmative. By Theorems 3 and 4 and the previous discussion, we can state the following corollary.

Corollary 1.

Let $\varepsilon<\varepsilon^{\prime}$ .

1.

The number of positively signed edges whose endpoints are in $\varepsilon^{\prime}$ -agreement is greater than or equal to the number of positive edges whose endpoints are in $\varepsilon$ -agreement. In other words, the agreement relation is monotone.
2.

The number of vertices which are $\varepsilon^{\prime}$ -light can be either greater or less than or even equal to the number of $\varepsilon$ -light vertices. In other words, the lightness relation is not necessarily monotone.

The Baseline idea is to recompute the NonAgreement for each new value of $\varepsilon$ , which takes $\mathcal{O}\left(m\times\alpha(G)\right)$ , as is described in the proof of Lemma 2, deciding on the $\varepsilon$ -heaviness of the vertices in $G$ in time $\mathcal{O}\left(n\right)$ and computing the connected components of the graph $\widetilde{G^{+}}$ as the output in time $\mathcal{O}\left(m+n\right)$ . Totally, the time complexity of the Baseline is $\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right)$ .

To conclude, using the index structure we invest $\mathcal{O}\left(m\right)$ space (Lemma 2) and $\mathcal{O}\left(m\times\left(\alpha(G)+\lg m\right)\right)$ (Lemma 3) time to construct the $\mathbb{NAO}\left(G\right)$ which make it possible to answer each query with varying $\varepsilon$ values in time $\mathcal{O}\left(m(a(n)+1)\right)$ (Theorem 2). Comparing this with $\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right)$ time for the Baseline reveals that our index-based structure makes query times faster for variable values of $\varepsilon$ .

Given access to NAO, the algorithm 2 constructs the graph $\widetilde{G^{+}}$ . Note that the output of the algorithms 1 and 2 are the same. As answering whether a vertex $v$ is $\varepsilon$ -heavy or the endpoints of an edge $\left\{u,v\right\}$ are in $\varepsilon$ -agreement can be answered in $\mathcal{O}\left(1\right)$ time, the running-time of the algorithm 2 would be $\mathcal{O}\left(\max\left\{|V|,|E^{+}|\right\}\right)$ for the for loop and $\mathcal{O}\left(|V_{\widetilde{G^{+}}}|+|E^{+}_{\widetilde{G^{+}}}|\right)$ to compute the connected components of $\widetilde{G^{+}}$ . Totally, the running-time of the query algorithm in the ICC would be $\mathcal{O}\left(m+n\right)$ , compared to the $\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right)$ time for the CC.

To model queries over a graph for various values of $\varepsilon$ , we use a $\varepsilon$ -Schedule defined as

\varepsilon\textsc{-Schedule}=\left\{\varepsilon_{0}<\varepsilon_{1}<\cdots<\varepsilon_{\ell}\right\}\subseteq[0,2].

(6)

Note that we consider $\varepsilon\textsc{-Schedule}$ as a set of strictly increasing real numbers, just for the sake of simplicity in notation. In a real life scenario, any time one can ask for a clustering with any value of $\varepsilon$ .

Algorithm 2 The index-based correlation clustering algorithm with

\mathbb{NAO}\left(G\right)

structure.

1:procedure IndexCorrelationClustering(

G,\mathbb{NAO}\left(G\right),\varepsilon

)

2: for all

v\in V

3: Use the

\mathbb{NAO}\left((\right)v)

to identify and remove all the edges

\left\{v,u\right\}

which are in

\varepsilon

-NonAgreement

4: Use the

\mathbb{NAO}\left(v\right)

to identify and remove edges

\left\{u,v\right\}

where

u

and

v

are both

\varepsilon

-light

5: end for

6: return The connected components of the remaining graph as the clustering output.

7:end procedure

3.2 Maintaining the index

The index structure introduced in Section 3.1 is used to compute a clustering of a static graph for user-defined dynamic values of $\varepsilon$ . In this section, we revise this structure to make it suitable for computing a clustering of a dynamic graph for dynamic values of $\varepsilon$ .

There are three different operations applicable to the underlying graph of the CorrelationClustering: (1) flipping the sign of an edge $e=\left\{u,v\right\}$ , (2) adding a new vertex $v$ , and (3) removing an existing vertex $v$ . These operations are considered in Lemmas 6, 7, and 8, respectively.

Shakiba in [12] has shown that flipping the sign of an edge $\left\{u,v\right\}$ , the $\varepsilon$ -agreement of edges whose both endpoints are not in the union of the positive neighborhood of its endpoints does not change (Propositions 2 and 4 in [12]). Therefore, we just need to compute the $\varepsilon$ -agreement for positive edges $\left\{x,w\right\}$ where $x,w\in\left(N_{G^{+}}(u)\cup N_{G^{+}}(v)\right)$ .

Lemma 6.

Assuming the sign of an edge $e=\left\{u,v\right\}$ is flipped.

1.

Assuming the sign of edge was $+$ prior to the flipping, then $u$ and $v$ would be no more in $\varepsilon$ -agreement since their edge is now negatively signed. The vertices $v$ and $u$ are removed from the arrays $\mathbb{NAO}\left(u\right)$ and $\mathbb{NAO}\left(v\right)$ , respectively. Moreover, we need to update the values of the non-agreement for vertices $u$ and $v$ in $\mathbb{NAO}\left(w\right)$ for all vertices $w\in\mathbb{NAO}\left(u\right)\cup\mathbb{NAO}\left(v\right)$ .
2.

Assuming the sign of edge was $-$ prior to the flipping, then their $\textsc{NonAgreement}_{G^{+}}\left(u,v\right)$ is computed and the vertices $v$ and $u$ are added to proper sorted place in $\mathbb{NAO}\left(u\right)$ and $\mathbb{NAO}\left(v\right)$ , respectively. Moreover, we need to recompute the $\textsc{NonAgreement}_{G^{+}}\left(x,w\right)$ for all $x,w\in\mathbb{NAO}\left(u\right)\cup\mathbb{NAO}\left(v\right)$ and update their $\varepsilon$ -agreement values.

Applying these changes is applicable in $\mathcal{O}\left(\sum_{\left\{x,w\right\}\in E^{+}\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right)$ where $X=\mathbb{NAO}\left(u\right)\cup\mathbb{NAO}\left(v\right)$ .

Proof.

The correctness of this lemma follows from the discussion just before it. Therefore, we would just give the time analysis. In the first case, finding the vertices in each $\mathbb{NAO}$ is of $\mathcal{O}\left(\log\deg_{G^{+}}(u)+\log\deg_{G^{+}}(v)\right)$ . The removal would be $\mathcal{O}\left(1\right)$ amortized time.

The second case, i.e. flipping the sign from $-$ to $+$ , would cost more, as it may require changing all the positive edges with both endpoints the in the set $X$ . In the worst case, one may assume that all the $\varepsilon$ -agreement between the edges with both endpoints inside $X$ is changed. Therefore, we need to update each of them on their corresponding $\mathbb{NAO}$ indices. We need to add vertices $v$ and $u$ to the $\mathbb{NAO}(u)$ and $\mathbb{NAO}(v)$ , respectively, based on $\textsc{NonAgreement}_{G^{+}}\left(u,v\right)$ . Finding their correct position inside sorted $\mathbb{NAO}$ s is possible with $\mathcal{O}\left(\log\deg_{G^{+}}(u)+\log\deg_{G^{+}}(v)\right)$ comparisons. Again, adding them at their corresponding position would take $\mathcal{O}\left(1\right)$ amortized time. For the other positive edges with both endpoints in the set $X$ , such as $\left\{x,w\right\}$ , we might need to update their $\varepsilon$ -agreement. Without loss of generality, assume that $\deg_{G^{+}}(x)\leq\deg_{G^{+}}(w)$ . Doing so requires re-computation of $\textsc{NonAgreement}_{G^{+}}\left(x,w\right)$ , in $\mathcal{O}\left(\min\left\{\deg_{G^{+}}(x),\deg_{G^{+}}(w)\right\}\right)$ , querying the NonAgreement inside $\mathbb{NAO}\left(x\right)$ , in time $\mathcal{O}\left(\log\deg_{G^{+}}(x)\right)$ , and then possible updating its position, in both $\mathbb{NAO}\left(x\right)$ and $\mathbb{NAO}\left(w\right)$ requires $\mathcal{O}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)$ . In the worst case, all the elements in $X$ may need update. Therefore, this case can be accomplished in

\mathcal{O}\left(\sum_{\left\{x,w\right\}\in E^{+}\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right),

in the worst-case. ∎

Using Lemma 6, we can state the $\textsc{NAO-FlipEdge}(u,v)$ algorithm (Algorithm 3) which flips the sign of the edge $\left\{u,v\right\}$ . The correctness and the running-time of this algorithm follows directly from Lemma 6.

Algorithm 3 The

\textsc{NAO-FlipEdge}(u,v)

algorithm to apply the effect of flipping the sign of the edge

(u,v)

\mathbb{NAO}\left(G\right)

1:procedure NAO-FlipEdge(

u,v

)

2: Let

\mathcal{A}\leftarrow\mathbb{NAO}\left(u\right)\cup\mathbb{NAO}\left(v\right)

3: if Sign of edge

\left\{u,v\right\}

is positive then

4: Update the graph

G^{+}

by removing edge

\left\{u,v\right\}

5: Remove the vertices

u

and

v

from

\mathbb{NAO}\left(v\right)

and

\mathbb{NAO}\left(u\right)

, respectively.

6: else

\triangleright

Sign of edge

\left\{u,v\right\}

is negative

7: Update the graph

G^{+}

by adding edge

\left\{u,v\right\}

8: Compute

\textsc{NonAgreement}_{G^{+}}\left(u,v\right)

9: Add the vertices

u

and

v

\mathbb{NAO}\left(v\right)

and

\mathbb{NAO}\left(u\right)

, respectively.

10: end if

11: for all

w\in\mathcal{A}

12: Recompute

\textsc{NonAgreement}_{G^{+}}\left(u,w\right)

and update

\mathbb{NAO}\left(u\right)

and

\mathbb{NAO}\left(w\right)

13: Recompute

\textsc{NonAgreement}_{G^{+}}\left(v,w\right)

and update

\mathbb{NAO}\left(v\right)

and

\mathbb{NAO}\left(w\right)

14: end for

15:end procedure

There are cases where a set of positive edges $E^{+}_{v}$ are also given for a newly added vertex $v$ . One can first add the vertex with all negative edges, and afterwards, flip all the edges in $E^{+}_{v}$ , so they would become positively signed. However, a batch operation would give us higher performance in practice.

Lemma 7.

Assume that a vertex $v$ is added to graph $G$ with new positive signed edges $E^{+}_{v}$ . Then,

1.

If $E^{+}_{v}=\emptyset$ , then $\mathbb{NAO}\left(G\right)$ does not need any updates and we just add a new $\mathbb{NAO}\left(v\right)$ to $\mathbb{NAO}\left(G\right)$ with $v$ as its only element in constant-time.

Otherwise, let $X=N_{G^{+}}(v)\cup\left(\cup_{x\in N_{G^{+}}(v)}N_{G^{+}}(x)\right)$ . We need to compute the $\textsc{NonAgreement}_{G^{+}}\left(x,w\right)$ for each positive edges $\left\{x,w\right\}$ with $x,w\in X$ after adding all the new positive edges in $E^{+}_{v}$ , and possibly update $\mathbb{NAO}\left(x\right)$ and $\mathbb{NAO}\left(w\right)$ . This would require

\mathcal{O}\left(\sum_{\left\{x,w\right\}\in(E^{+}\cup E^{+}_{v})\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right),

(7)

operations in the worst case.

Proof.

The correctness would follow immediately from the Lemma 6. Again, the first case can be accomplished in $\mathcal{O}\left(1\right)$ time. For the second case, we need to calculate $\left|X\right|$ values of NonAgreement, and add them or update them in their corresponding $\mathbb{NAO}$ s. This would require

\mathcal{O}\left(\sum_{\left\{x,w\right\}\in(E^{+}\cup E^{+}_{v})\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right),

(8)

by exactly the same discussion as in Lemma 6. ∎

Remark 1.

In comparing the time required for Lemmas 6 and 7, please note that the size of the set $X$ can be much larger in Lemma 7 than the one in Lemma 6, depending on the size of $E^{+}_{v}$ for the new vertex $v$ .

The $\textsc{NAO-AddVertex}(x,N_{x})$ algorithm (Algorithm 4) adds a new vertex $x$ to $G^{+}$ and all its neighboring positively signed edges $N_{x}$ , and updates the $\mathbb{NAO}\left(G\right)$ . The correctness and the running-time of this algorithm follows directly from Lemma 7.

Algorithm 4 The

\textsc{NAO-AddVertex}(x,N_{x})

algorithm to add a new vertex

x

and all its positive neighbors

N_{x}

\mathbb{NAO}\left(G\right)

1:procedure NAO-AddVertex(

x,N_{x}

)

2: Update graph

G^{+}

by adding the new vertex

x

3: Construct a new

\mathbb{NAO}\left(x\right)

and add it to

\mathbb{NAO}\left(G\right)

4: Let

X\leftarrow N_{x}\cup\left(\cup_{z\in N_{x}}N_{G^{+}}(z)\right)

5: for all

y\in X

6: for all

z\in X\setminus y

7: Update

\mathbb{NAO}\left(y\right)

and

\mathbb{NAO}\left(z\right)

if the

\textsc{NonAgreement}_{G^{+}}\left(y,z\right)

changes.

8: end for

9: end for

10:end procedure

For the last operation, removing an existing vertex $v$ with a set of positive adjacent edges $E^{+}_{v}$ , can be accomplished by first flipping the sign of all of its adjacent edges in $E^{+}_{v}$ from $+$ to $-$ , and afterwards, removing its $\mathbb{NAO}\left(v\right)$ from $\mathbb{NAO}\left(G\right)$ . Similar to vertex addition, we can do it also in batch-mode, hoping for better performance in practice. The algorithm for the $\textsc{NAO-RemoveVertex}(x)$ is similar to the $\textsc{NAO-AddVertex}(x,N_{x})$ algorithm (Algorithm 4).

Lemma 8.

Assume that an existing vertex $v$ is removed from the graph $G$ with the set of adjacent positive signed edges $E^{+}_{v}$ . Then,

1.

If $E^{+}_{v}=\emptyset$ , then $\mathbb{NAO}\left(v\right)$ has a single element, itself, and can be easily removed from $\mathbb{NAO}\left(G\right)$ . Nothing else would require a change, so it is of constant-time complexity.

Otherwise, let $X=N_{G^{+}}(v)\cup\left(\cup_{x\in N_{G^{+}}(v)}N_{G^{+}}(x)\right)$ . We need to compute the $\textsc{NonAgreement}_{G^{+}}\left(x,w\right)$ for each positive edges $\left\{x,w\right\}$ with $x,w\in X$ after removing all the edges in $E^{+}_{v}$ , and possibly update $\mathbb{NAO}\left(x\right)$ and $\mathbb{NAO}\left(w\right)$ . In the worst-case, this would require

\mathcal{O}\left(\sum_{\left\{x,w\right\}\in(E^{+}\cup E^{+}_{v})\cap X\times X}\left(\log\deg_{G^{+}}(x)+\log\deg_{G^{+}}(w)\right)\right),

(9)

operations.

Proof.

The correctness of this lemma is implied by the Lemma 6. The discussion of the running-time is exactly the same as in the proof of Lemma 7. ∎

4 Experiments

To evaluate the proposed method, we used 7 graphs which are formed by user-user interactions. These datasets are described in Table 1 and are accessible through the SNAP [8]. The smallest dataset consists of 22 470 nodes with 171 002 edges and the largest one consists of 317 080 nodes with 1 049 866 edges. In all these graphs, we consider the existing edges as positively signed and non-existing edges as negatively signed.

The distribution of the NonAgreements for each dataset is illustrated in Figures 2(a) to 8(a). The distribution of NonAgreements of the edges in almost all the datasets obeys the normal distribution, except small imperfections in Arxiv ASTRO-PH (Figure 2(a)), Arxiv COND-MAT (Figure 4(a)), and DBLP (Figure 8(a)). Moreover, more detailed statistics on these distributions are given in Tables 2. One single observation is that the most frequent value of the NonAgreement in all the sample datasets is $1$ . Why is that? Without loss of generality, assume that $\deg_{G}(u)\geq\deg_{G}(v)$ . Then, $\left|N_{G}(v)\right|=2\left|N_{G}(u)\cap N_{G}(v)\right|$ . By the assumption $\left|N_{G}(u)\right|\geq 2\left|N_{G}(u)\cap N_{G}(v)\right|$ , i.e. the intersection of the neighborhood of vertices $u$ and $v$ consists of at most half of the neighborhood of $u$ . Also, exactly half of the neighborhood of $v$ falls at the intersection of the neighborhood of $u$ and $v$ . Intuitively, the vertex $u$ has clustering preference with extra vertices at least the number of vertices which both $u$ and $v$ have clustering preference. Similarly, the vertex $v$ has clustering preference with exactly half extra vertices which both $u$ and $v$ have clustering preference.

For each dataset, we have used a different set of $\varepsilon$ -Schedules, depending on the distribution of their NonAgreements. More precisely: (1) we have sorted the values of non-agreements in each dataset in a non-decreasing order, with repetitions. (2) Then, we have selected $21$ distinct values equally spaces of these values. (3) The $\varepsilon$ -Schedule was set to the selected values in the second step, which appended the value of $0$ at the beginning and the value of $1.99$ to the end if either does not exists. Totally, the number of $\varepsilon$ -Schedule for each dataset is either $21$ , $22$ or $23$ .

An interesting observation is the number of clusters for $\varepsilon\approx 1^{-}$ in Figures 2(b) to 8(b). Note that we use $1^{-}$ to denote the interval $[1-\epsilon,1]$ for some non-zero constant $\epsilon>0$ . When $\varepsilon\approx 1$ but less than $1$ , the number of clusters is minimum. As it gets closer to $1$ , the number of clusters increases with a much greater descent than the decrease in the number of clusters as it gets close to $1$ from $0$ .

As in Corollary 1, by increasing the $\varepsilon$ , the number of vertices in $\varepsilon$ -agreement is a non-decreasing function, which is confirmed by the plots in Figures 2(a) to 8(a) as the number of vertices in $\varepsilon$ -non-agreement is given by a non-increasing function. By closer visual inspection of these figures, we can see that the shape of the plot for the number of $\varepsilon$ -non-agreement vertices in all these graphs is almost the same, with inflection point around the value of $\varepsilon\approx 1$ . This is due to the intrinsic nature of the $\textsc{NonAgreement}_{G}\left(u,v\right)$ .

Similarly, the non-monotonicity result stated in Corollary 1 is observed in the same figures for the number of $\varepsilon$ -light vertices. By a visual inspection, the trend of the number of $\varepsilon$ -light vertices for almost all datasets, except for the Arxiv HEP-TH (Figure 5(a)) and the EU-Email (Figure 7(a)), is the same: the number of $\varepsilon$ -light vertices increases as $\varepsilon$ increases up to some point (first interval), then decreases slightly (second interval), and finally increases and would be asymptotically equal to the number of vertices in the graphs (last interval). For Arxiv HEP-TH and EU-Email, we have the same trend, however, the second interval is very small.

Table 1: Description of the datasets.

Dataset	Nodes	Edges
Arxiv ASTRO-PH [7]	18 772	198 110
MUSAE-Facebook [11]	22 470	171 002
Arxiv COND-MAT [7]	23 133	93 497
Arxiv HEP-TH [6]	27 770	352 807
Enron-Email [9]	36 692	183 831
EU-Email [7]	265 214	420 045
DBLP [13]	317 080	1 049 866

Table 2: Statistics of NonAgreement in each dataset.

Dataset	Distinct	Minimum	Maximum	Top 2 frequent values
Arxiv ASTRO-PH	16 436	0.015 873 0	1.967 21	1 — 9 664	0.5 — 2 992
MUSAE-Facebook	12 988	0.031 746	1.978 02	1 — 18 534	1.25 — 2 346
Arxiv COND-MAT	3 893	0.044 444 4	1.964 91	1 — 12 018	0.5 — 4 954
Arxiv HEP-TH	23 285	0.086 956 5	1.976 19	1 — 24 852	1.333 33 — 4 910
Enron-Email	20 273	0.090 909 1	1.954 55	1 — 31 704	0.5 — 6 796
EU-Email	28 612	0.117 647	1.990 52	1 — 441 520	0.997 658 — 682
DBLP	11 611	0.013 793 1	1.981 13	1 — 167 590	0.5 — 67 238

All the algorithms for the naive and the proposed index-based correlation clustering algorithms are implemented in C++¹¹1https://github.com/alishakiba/Correlation-Clustering-Algorithm-for-Dynamic-Complete-Signed-Graphs-An-Index-based-Approach without any parallelization and the experiments are done using an Ubuntu 22.04.1 TLS with an Intel Core i7-10510U CPU @ 1.80GHz with 12 GB of RAM. The time for running the naive correlation clustering algorithm (Algorithm 1), denoted here as CC, as well as the time for the index-based correlation clustering algorithm denoted as ICC, is given in Figures 2(d) to 8(d)). Note that the time reported for the ICC in these figures does not include the required time for constructing the $\mathbb{NAO}$ s, as they are constructed once and used throughout the $\varepsilon$ -Schedule. The running-time to read the graph as well as constructing the $\mathbb{NAO}$ s is reported in Table 3 in milliseconds. The CC and ICC algorithms are the same, except that in CC, the non-agreement values of the edges and the $\varepsilon$ -lightness of the vertices are computed for each given value of $\varepsilon$ , however, in ICC these are computed and stored in the proposed $\mathbb{NAO}$ s structure once and used for clustering with respect to different values of $\varepsilon$ . As it can be observed in Figures 2(d) to 8(d), the running time for the ICC, excluding the time to construct the $\mathbb{NAO}$ s for once, is largely smaller than the one for CC. On average, our approach for the described $\varepsilon$ -Schedule lead to $\%25$ decrease in clustering time. This enhancement comes at the cost of pre-computing the $\mathbb{NAO}$ s, which costs on average $\%34$ of the time for a single run of CC, which is quite small and makes the ICC efficient in cases where one requires to have multiple clustering for various values of $\varepsilon$ .

Table 3: Time to construct the

\mathbb{NAO}

for each dataset in milliseconds.

Dataset	Time to construct $\mathbb{NAO}$ (ms)	Time to read graph (ms)
Arxiv ASTRO-PH	1 677	543
MUSAE-Facebook	1 391	427
Arxiv COND-MAT	521	346
Arxiv HEP-TH	12 530	725
Enron-Email	2 498	525
EU-Email	4 479	958
DBLP	4 620	2 167

5 Conclusion

In this paper, we proposed a novel indexing structure to decrease the overall running-time of an approximation algorithm for the correlation clustering problem. This structure can be constructed in $\mathcal{O}\left(m\times\alpha(G)\right)$ time with $\mathcal{O}\left(m\right)$ memory. Then, we can output a correlation clustering for any value of $\varepsilon$ in $\mathcal{O}\left(m+n\right)$ , compared with $\mathcal{O}\left(m\times\left(2+\alpha(G)\right)+n\right)$ time complexity of the ordinary correlation clustering algorithm. Moreover, the proposed index can be efficiently maintained during updates to the underlying graph, including edge sign flip, vertex addition and vertex deletion. The theoretical results are accompanied with practical results in the experiments using seven real world graphs. The experimental results show about %34 decrease in the running-time of queries.

A future research direction would be studying this algorithm in parallel frameworks such as Map-Reduce and make it scalable to very Big graphs. Another research direction would be enhancing the approximation guarantee of the algorithm, or devising more efficient algorithms in terms of approximation ratio.

References

[1] Nikhil Bansal, Avrim Blum, and Shuchi Chawla. Correlation clustering. Machine learning, 56(1):89–113, 2004.
[2] Norishige Chiba and Takao Nishizeki. Arboricity and subgraph listing algorithms. SIAM Journal on computing, 14(1):210–223, 1985.
[3] Vincent Cohen-Addad, Silvio Lattanzi, Andreas Maggiori, and Nikos Parotsidis. Online and consistent correlation clustering. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 4157–4179. PMLR, 17–23 Jul 2022.
[4] Vincent Cohen-Addad, Silvio Lattanzi, Slobodan Mitrović, Ashkan Norouzi-Fard, Nikos Parotsidis, and Jakub Tarnawski. Correlation clustering in constant many parallel rounds. In International Conference on Machine Learning (ICML), pages 2069–2078. PMLR, 2021.
[5] Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. Introduction to algorithms. MIT Press, 2022.
[6] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187, 2005.
[7] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graph evolution: Densification and shrinking diameters. ACM transactions on Knowledge Discovery from Data (TKDD), 1(1):2–es, 2007.
[8] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
[9] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29–123, 2009.
[10] Claire Mathieu, Ocan Sankur, and Warren Schudy. Online correlation clustering. In 27th International Symposium on Theoretical Aspects of Computer Science-STACS 2010, pages 573–584, 2010.
[11] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding, 2019.
[12] Ali Shakiba. Online correlation clustering for dynamic complete signed graphs. arXiv preprint arXiv:2211.07000, 2022.
[13] Jaewon Yang and Jure Leskovec. Defining and evaluating network communities based on ground-truth. In Proceedings of the ACM SIGKDD Workshop on Mining Data Semantics, pages 1–8, 2012.