This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Finding All Bounded-Length Simple Cycles in a Directed Graph

Anshul Gupta and Toyotaro Suzumura
[email protected], [email protected]
IBM T.J. Watson Research Center
Yorktown Heights, NY 10598, USA
Abstract

A new efficient algorithm is presented for finding all simple cycles that satisfy a length constraint in a directed graph. When the number of vertices is non-trivial, most cycle-finding problems are of practical interest for sparse graphs only. We show that for a class of sparse graphs in which the vertex degrees are almost uniform, our algorithm can find all cycles of length less than or equal to kk in O((c+n)(k1)dk)O((c+n)(k-1)d^{k}) steps, where nn is the number of vertices, cc is the total number of cycles discovered, dd is the average degree of the graph’s vertices, and k>1k>1. While our analysis for the running time addresses only a class of sparse graphs, we provide empirical and experimental evidence of the efficiency of the algorithm for general sparse graphs. This algorithm is a significant improvement over the only other deterministic algorithm for this problem known to us; it also lends itself to massive parallelism. Experimental results of a serial implementation on some large real-world graphs are presented.

Key Words: Graph algorithms, Cycles, Circuits, Directed graphs.

1 Introduction

Finding all simple cycles in a directed graph is a classical computer science problem, for which an elegant algorithm was published by Johnson [10]. An optimal and asymptotically faster algorithm for undirected graphs was proposed by Birmelé et al. [6].

The number of cycles in a complete graph grows exponentially with the number of vertices, nn. Even sparse graphs, such as planar graphs, can have an exponential number of cycles [8]. A more tractable problem that arises in real applications is that of finding all simple cycles of a given length kk or of length bounded by kk, where kk is typically a small integer. This has applications in the analysis of social networks, communications graphs, and financial transaction graphs, etc. Despite its several applications, to the best of our knowledge, no deterministic algorithm for specifically this problem has been proposed to date. The closest problem that has been studied is that of finding cc shortest simple cycles by Agarwal and Ramachandran [2, 3]. Their O(cne)O(cne) algorithm for enumerating the first cc cycles in nondecreasing order in a graph with nn vertrices and ee edges can be easily adapted to find all cycle with at most kk edges by stopping as soon as the first cycle of length greater than kk is found. Monien [13] gave an O(ne)O(ne) algorithm for finding and reporting a cycle of length kk for any k3k\geq 3, where nn is the number of vertices and ee is the number of edges in the graph. Alon et al. [4] presented an algorithm that improved this bound to O(e22/k)O(e^{2-2/k}), reducing the running time for small values of kk.

In this paper, we present a relatively simple, but fast and powerful new algorithm for finding all simple cycles of length less than or equal to kk in sparse directed graphs. We show that for graphs with a nearly uniform vertex degree dd, our algorithm takes O((c+n)(k1)dk)O((c+n)(k-1)d^{k}) time, where cc is the total number of cycles discovered, d=e/nd=e/n, and k>1k>1. While our analysis for the running time addresses only a class of sparse graphs, we provide empirical and experimental evidence of the efficiency of the algorithm for general sparse graphs as well. For the class of graphs analyzed, the running time increases linearly with the size of the graph or the number of cycles discovered because for a given kk and a given graph, kdkkd^{k} is a constant, albeit relatively large. The algorithm is not only practical to implement, but also lends itself to easy and scalable parallelization, thus extending its suitability for real applications.

The remainder of the paper is organized as follows. We start with an overview of Johnson’s algorithm [10] for finding all simple cycles in Section 2, since our algorithm uses a similar overall strategy to avoid futile searches. We describe our algorithm in detail in Section 3. Sections 4 and 5 contain correctness proofs and complexity analysis, respectively. Experimental results on large graphs related to real applications are presented in Section 6. Finally, Section 7 contains concluding remarks and directions for future work.

2 Johnson’s algorithm for finding all simple cycles

Johnson’s algorithm [10] combines depth-first search (DFS) and backtracking with an astute strategy of blocking and unblocking vertices to prevent fruitless searches and can find all simple cycles in a directed graph in O((c+1)(n+e))O((c+1)(n+e)) time. It remains the best-known algorithm for this problem to date. At the time of its publication, it was a significant improvement over the previous fastest algorithm [15], whose worst-case complexity is O(e.n(c+1))O(e.n(c+1)).

In each of its steps, Johnson’s algorithm searches for cycles that start and end at a vertex ss, in the strongly connected component [14] of the subgraph induced by vertices in {s,s+1,,n}\{s,s+1,\ldots,n\}. Thus, step ss outputs all cycles in which ss is the vertex with the smallest index (henceforth, referred to as the ”smallest vertex”). Such cycle searches are performed for all values of ss from 11 to n1n-1 for which the strongly connected component has more than 11 vertex.

As the algorithm proceeds with DFS starting at vertex ss, it stacks and blocks the vertices encountered along the way. The search would either lead to ss, in which case a cycle has been discovered, or lead to a blocked vertex. Backtracking from an unsuccessful search unstacks the vertices, but leaves them blocked. It also adds the current vertex to the Blist of the most recently unstacked vertex, thus leaving a backward trail along an unsuccessful path. Blists are data structures used to remember unsuccessful portions of a path that has been explored. The blocked vertices prevent the search from following the same unsuccessful path, or a portion of it, more than once while the vertices responsible for the failure of the path are still on the stack. When backtracking from a successful path, the algorithm unblocks each unstacked vertex, and recursively unblocks the vertices in the Blist of each vertex being unblocked. The rationale is that a vertex vv being unstacked was the reason for blocking the vertices in the backward trails emanating from this vertex. As soon as vv leaves the stack after a successful search, all the vertices that it was responsible for blocking can be considered for inclusion in a cycle in a subsequent visit via a DFS path that reaches vv with a different stack. For a more detailed description and analysis of the algorithm, the reader is referred to the original paper [10].

3 Algorithm for length-constrained simple cycles

We now present an efficient algorithm for finding all elementary cycles of length less than or equal to kk in a directed graph G=(V,E)G=(V,E) where V{1,2,,n}V\in\{1,2,\ldots,n\} and (u,v)E(u,v)\in E iff there is an edge with source uVu\in V and destination vVv\in V.

1. function LC_CYCLES (GG, kk)
// G=(V,E)G=(V,E), V={1,2,,n}V=\{1,2,\ldots,n\}
2. for ss = 1, nn
3. Hs=(Ws,Fs)H^{s}=(W^{s},F^{s}), where Ws={s,s+1,,n}W^{s}=\{s,s+1,\ldots,n\} and
(u,v)Fs(u,v)\in F^{s} if u,vsu,v\geq s and (u,v)E(u,v)\in E;
4. Gs=(Vs,Es)G^{s}=(V^{s},E^{s}), where VsWsV^{s}\in W^{s} is the set of all vertices reachable from ss
via BFS with k1k-1 levels and (u,v)Es(u,v)\in E^{s} if uu, vVsv\in V^{s}, (u,v)Fs(u,v)\in F^{s};
5. for vVsv\in V^{s}
6. Lock(vv) = \infty;
7. Blist(vv) = \emptyset;
8. end for
9. blen = CYCLE_SEARCH (GsG^{s}, ss, kk, 0);
10. end for
11. end LC_CYCLES
Figure 1: Outline of the LC_CYCLES procedure for finding all simple cycles of length less than or equal to kk in a directed graph G=(V,E)G=(V,E).

3.1 The outer loop

Figure 1 shows the outer loop of our algorithm. Similar to Johnson’s algorithm, our algorithm also searches for cycles whose smallest vertex is ss in successive subgraphs GsG^{s} for s{1,2,,n}s\in\{1,2,\ldots,n\}. However, the subgraph GsG^{s} is constructed quite differently. Let HsH^{s} be a subgraph of GG induced by {s,s+1,,n}\{s,s+1,\ldots,n\}. Then GsG^{s} is the subgraph of HsH^{s} induced by all the vertices in HsH^{s} that are reachable by breath-first search (BFS) on HsH^{s} up to k1k-1 levels starting with vertex ss. Clearly, any vertex in HsH^{s} that is not in GsG^{s} cannot belong to a cycle that includes ss and has at most kk edges.

1. integer function CYCLE_SEARCH (GG, vv, kk, flen)
// G=(V,E)G=(V,E), V{s,s+1,,n}V\in\{s,s+1,\ldots,n\}
2. blen = \infty;
3. Lock(vv) = flen;
4. PUSH_STACK (vv);
5. for ww\in Adjacency(vv)
6. if (w==s)w==s) then
7. Output (Stack {s}\cup\{s\}) as a new cycle of length flen+1+1;
8. blen = 1;
9. else if ((flen+1)<+1)< Lock(ww) and (flen+1)<k+1)<k) then
10. blen = MIN (blen, 1 + CYCLE_SEARCH (GG, ww, kk, flen+1+1));
11. end if
12. end for
13. if (blen <<\infty) then                    // True only if at least one cycle found
14. RELAX_LOCKS (vv, kk, blen);
15. else
16. for ww\in Adjacency(vv) if (vv\notin Blist(ww)) then Blist(ww) = Blist(ww) {v}\cup\ \{v\};
17. end if
18. POP_STACK (vv);
19. return (blen);
20. end CYCLE_SEARCH
Figure 2: The CYCLE_SEARCH algorithm for finding all simple cycles of length at most kk that include vertex ss in a directed graph G=(V,E)G=(V,E) whose smallest vertex is ss. Note that the forward distance from ss, flen, is incremented at each recursive call.

3.2 Cycle search with a given origin

At the heart of our algorithm is the function CYCLE_SEARCH shown in Figure 2 for finding all simple cycles of length at most kk that originate and terminate at vertex ss in subgraph GsG^{s}. When called to explore vertex vv, the input parameter flen is the length of the current forward path from ss to vv. The function returns the length of the shortest path from vv to ss when the current visit to vv results in the detection of a cycle; otherwise, it returns \infty.

In addition to a stack, this function uses two other key data structures.

The first one is the array Lock, which stores an integer value for each vertex and helps avoid futile searches. For any vertex vv that has been reached at least once and is not currently on the stack, Lock(vv) contains one plus the maximum possible length of any path from ss to vv that has the possibility of reaching back to ss with at most kk total edges in it. This value is used to permit only those paths that reach vv with flen << Lock(vv) to visit it and to block all other paths (Line 9). Note that Line 9 also blocks any path that already has k1k-1 edges and the next edge does not lead to ss. As a path is being explored, Lock(vv) is set to flen when the path reaches vv. If this path is eventually unsuccessful, then this value of Lock indicates that a future path reaching vv has the potential to succeed with its current stack only if it reaches vv with a smaller flen value. A cycles is detected when an edge (v,s)(v,s) is encountered (Line 6). At this point, blen or the backward distance from ss is set to 1 (Line 8). Upon return from every successful call to CYCLE_SEARCH, blen at the calling vertex is updated (Line 10) so that the shortest backward distance from ss for the current vertex vv can be returned.

The second key data structure of function CYCLE_SEARCH is the Blist associated with each vertex ww, which stores the indices of the source vertices vv of all incoming edges (v,w)(v,w) for which no cycle containing (v,w)(v,w) is found with the current state of the search stack (Line 16). It can be implemented using a data structure such as B-trees [9], which permit fast insertion and look-up. Recall that the role of Lock(vv) is to allow vertex vv to be included only in those paths from ss that have the potential to terminate at ss within kk or fewer total hops (edges). When the most recent visit to vv results in an unsuccessful search, this is achieved by setting and leaving the value of Lock(vv) equal to flen. When vv is part of a successful search, then, with the help of the Blist data structures, the function RELAX_LOCKS, called at Line 14 in CYCLE_SEARCH, appropriately sets the Lock values of vv and the vertices that can reach it based on the length of the current shortest path from vv to ss.

1. function RELAX_LOCKS (uu, kk, blen)
2. if (Lock(uu) <k<k-blen+1) then
3. Lock(uu) = kk-blen+1;
4. for (ww\in Blist(uu))
5. if (ww\notin Stack) RELAX_LOCKS (ww, kk, blen+1);
6. end for
7. end if
8. end RELAX_LOCKS
Figure 3: The recursive procedure for relaxing the Lock values when backtracking along a path that is part of at least one cycle. Note that the backward distance from ss, blen, is incremented at each recursive call.

3.3 Backtracking and relaxing Lock

Figure 3 shows the recursive procedure for reassigning the Lock values when a cycle is detected. This function is called for a vertex uu with a corresponding integer blen, which is the length of the currently shortest known path from uu to ss. The value of Lock(uu) is increased to kk-blen+1+1, if possible (Line 3). An increase in Lock(uu) is equivalent to relaxing the conditions for uu to be included in a future search because this is permitted for only those paths from ss that reach uu with a length strictly less than Lock(uu). If blen is the length of the shortest path from uu to ss, then a new path from ss to uu with length kk-blen or less can succeed. Relaxing Lock(uu) can also potentially relax Lock(ww) if edge (w,u)(w,u) was part of a previous unsuccessful search that led to ww being added to the Blist of uu. This process of going back to relax Lock values proceeds recursively, adding 1 to blen for each backward step (Line 5). This backward traversal is gated by the condition on Line 2 and follows only those backward tracks along which the Lock values can be increased.

3.4 Performance gains through sorting and parallelism

Note that the number of vertices and edges in subgraph HsH^{s} in Line 3 of Function LC_CYLCES in Figure 1 monotonically decreases with ss. Preprocessing graph GG to sort and renumber its vertices in the decreasing order of the sum of their incoming and outgoing edges would ensure that the density or the average degree (ratio of number of edges to the number of vertices) of HsH^{s} also decreases monotonically with ss. This can potentially improve performance in two ways. First, on an average, this is likely to reduce the size of the search space for constructing GsG^{s}, as well as the size of GsG^{s}. Second, this has the potential to reduce the overall amount of search in CYCLE_SEARCH. Recall that for each vertex ss, CYCLE_SEARCH looks for cycles starting and ending in ss only. All other cycles that are traversed in the process are ignored. Processing denser subgraphs (presumably with more cycles) with a root vertex ss of greater connectivity earlier is likely to reduce the number of valid cycles that are traversed but whose reporting is delayed until these are traversed again in the right GsG^{s}. In Section 6, we show the improvement in running time achieved by this preprocessing step.

Function LC_CYLCES can be easily and scalably parallelized, which significantly increases the algorithm’s application potential for solving real-world problem involving very large graphs. All calls to CYCLE_SEARCH on Line 9 are independent and can be performed in parallel. The amount of work in CYCLE_SEARCH generally decreases as ss increases, especially if the vertices of GG are numbered in decreasing order of their degrees. In addition to the size of GsG^{s}, the amount of computation in CYCLE_SEARCH also depends on the number of cycles that it detects; i.e., the number of cycles in which ss is the smallest vertex. Therefore, an efficient parallel implementation would need to carefully batch groups of starting vertices ss in order to avoid load imbalance. However, other that this relatively minor consideration, parallelization of LC_CYLCES is quite straightforward.

4 Correctness proof

The following three lemmas show that function CYCLE_SEARCH in Figure 2 finds every cycle of length less than or equal to kk that includes vertex ss in GsG^{s} exactly once.

Lemma 1.

Function CYCLE_SEARCH does not output any cycle of length greater than kk.

Proof.

By construction, in CYCLE_SEARCH(GG,vv,kk,flen), flen is the length of the path from ss to vv. Line 9 ensures that flen <k<k. Therefore, if an edge (v,s)(v,s) is encountered on Line 6, the length of the cycle is flen+1+1, which is k\leq k. ∎

Lemma 2.

Function CYCLE_SEARCH outputs every cycle of length kk or less in subgraph GsG^{s}.

Proof.

Assume that Lemma 2 is false, so CYCLE_SEARCH never reaches Line 7 with v=vlv=v_{l} and stack (s,v1,v2,,vl)(s,v_{1},v_{2},\ldots,v_{l}) for some valid cycle (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s) in GsG^{s}. Without loss of generality, also assume that among the valid cycles for which this lemma fails, (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s) is the first one.

Since (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s) is a valid cycle of length l+1l+1,

l+1k.l+1\leq k. (1)

Since (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s) is a cycle in GsG^{s}, ss must be in the adjacency list of vlv_{l} and if CYCLE_SEARCH(GsG^{s},vlv_{l},kk,ll) reaches Line 6 with stack (s,v1,v2,,vl)(s,v_{1},v_{2},\ldots,v_{l}), the cycle is guaranteed to be detected. Therefore, for Lemma 2 to fail, the algorithm must be blocked at some edge other than (vl,s)(v_{l},s) in this cycle. Without loss of generality, assume that (vi,vi+1)(v_{i},v_{i+1}) is the first blocked edge of this cycle, where v0=sv_{0}=s and 0i<l0\leq i<l.

Since i<l<ki<l<k, edge (vi,vi+1)(v_{i},v_{i+1}) can be blocked only if

i+1Lock(vi+1)i+1\geq\mbox{Lock}(v_{i+1}) (2)

on Line 9 because flen for viv_{i} is ii. This cannot happen if this is the first visit to vi+1v_{i+1} because all Lock values were initialized to \infty in LC_CYCLES before the call to CYCLE_SEARCH. There are only two other possibilities for Inequality 2 to be true.

The first possibility is that the last visit to vi+1v_{i+1}, which would have happened with a stack different from the current one, did not yield a valid cycle. In this case, Lock(vi+1v_{i+1}) would have been set equal to flen for that visit. Let’s denote that flen with ff. By Inequality 2, it follows that

fi+1.f\leq i+1. (3)

Since (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s) is a cycle, we know that a path from vi+1v_{i+1} to ss of length lil-i exists. Aslo, since vi+1v_{i+1} was visited earlier with a different stack, a cycle other than (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s) containing vi+1v_{i+1} exists. The length qq of this cycle is f+lif+l-i, or f=ql+if=q-l+i. By Inequality 3, ql+ii+1q-l+i\leq i+1 or ql+1q\leq l+1. Therefore, by Inequality 1, qkq\leq k. This means that this other cycle containing vi+1v_{i+1} meets the length constraint and is a valid cycle. Since (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s) is the first cycle to be missed by the algorithm, the previous visit to vi+1v_{i+1}, which set Lock(vi+1v_{i+1}) to a value less than or equal to i+1i+1, must have produced a cycle of length qq. This leads to a contradiction because we are considering the scenario in which the last visit to vi+1v_{i+1} did not yield a valid cycle. This rules out the first possibility.

The second possibility is that the last visit to vi+1v_{i+1} did yield a valid cycle, and therefore, function RELAX_LOCKS (Figure 3) was called with vi+1v_{i+1} as the first argument. Let bb be the value of blen in that call to RELAX_LOCKS. By design of RELAX_LOCKS, the following must hold upon return from that call:

Lock(vi+1)kb+1.\mbox{Lock}(v_{i+1})\geq k-b+1. (4)

By virtue of (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s) being a cycle, there is path of length lil-i from vi+1v_{i+1} to ss. This path must have been successfully traversed during the last visit to vi+1v_{i+1} because (vi,vi+1)(v_{i},v_{i+1}) is the first edge of any valid cycle to be blocked. Therefore, Line 10 in the last call to CYCLE_SEARCH(GsG^{s},vi+1v_{i+1},kk,flen) will ensure that

bli.b\leq l-i. (5)

From Inequalities 2 and 4, it follows that i+1kb+1i+1\geq k-b+1 or bkib\geq k-i. Combining this with Inequality 5 yields klk\leq l, which contradicts Inequality 1. Therefore, the second possibility is also ruled out.

In conclusion, no edge of the cycle (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s) can be blocked in CYCLE_SEARCH if l<kl<k, and this cycle is guaranteed to be reported by the algorithm. This completes the proof.

Lemma 3.

Function CYCLE_SEARCH does not output any cycle (s,v1,v2,,vl,s)(s,v_{1},v_{2},\ldots,v_{l},s), where l<kl<k, more than once.

Proof.

Line 5 of CYCLE_SEARCH visits each element in the adjacency list of vv exactly once. After edge (vl,s)(v_{l},s) is detected and vlv_{l} is unstacked at the end of CYCLE_SEARCH(GsG^{s},vlv_{l},kk,ll), it can never get back to vlv_{l} with the same stack again and the path (s,v1,v2,,vl)(s,v_{1},v_{2},\ldots,v_{l}) is never explored further again. ∎

Lemmas 13 prove that CYCLE_SEARCH yields all valid cycles starting and ending at ss in GsG^{s} exactly once. Since CYCLE_SEARCH is called once for all s{1,2,,n}s\in\{1,2,\ldots,n\} in LC_CYCLES, it can be concluded that LC_CYCLES yields exactly one instance of every valid cycle in the original graph GG.

5 Time complexity

We first analyze the running time of CYCLE_SEARCH for finding all cycles starting and ending at vertex ss in graph GsG^{s}. In CYCLE_SEARCH, the Lock values monotonically decrease until a valid cycle is found. Additionally, a vertex cannot be revisited without decreasing its Lock value. This ensures that no vertex is visited more than k1k-1 times before a valid cycle is detected. After a cycle is found, the Lock values of some vertices are relaxed or increased. During the process of relaxing the Lock values, function RELAX_LOCKS does not traverse an edge unless a Lock value can be increased. In the worst case scenario, this cannot happen to any vertex at most k1k-1 times. Therefore, the time associated with finding a valid cycle cannot exceed O((k1)(|Vs|+|Es|)O((k-1)(|V^{s}|+|E^{s}|), and the time complexity of CYCLE_SEARCH for finding all valid cycles in a given subgraph GsG^{s} is O((cs+1)(k1)(|Vs|+|Es|)O((c_{s}+1)(k-1)(|V^{s}|+|E^{s}|), where csc_{s} is the number of valid cycles in which the smallest vertex is ss.

Recall that GsG^{s} consists of ss and all vertices greater than ss that are reachable from ss within k1k-1 steps of BFS. If GG is a sparse graph with a uniform degree dd, then |Vs|+|Es||V^{s}|+|E^{s}| is O(dk)O(d^{k}) for all ss. Therefore, the complexity of the overall algorithm for finding all cycles of length kk or less in a graph with a uniform degree dd is O(s=1n(cs+1)(k1)dk)O(\sum_{s=1}^{n}(c_{s}+1)(k-1)d^{k}) = O((c+n)(k1)dk)O((c+n)(k-1)d^{k}), where k>1k>1 and c=s=1ncsc=\sum_{s=1}^{n}c_{s} is the total number of valid cycles in GG.

In our analysis, we use dkd^{k} as the upper bound on the size of any GsG^{s} in the call to CYCLE_SEARCH. For sufficiently large graphs, the size of GsG^{s} would not increase with nn as long as d=e/nd=e/n stays constant. GsG^{s} consists of all vertices with a maximum distance of k1k-1 from ss, and the size of this feasible neighborhood of ss depends only the structure of GG (unless GG is a scale-free network, which is rare [7]). The simplifying assumption of uniform vertex degrees was made to arrive at a closed-form expression for an upper bound on the size of GsG^{s}. For sufficiently large nn, the size of GsG^{s} is likely to be independent of nn even without this assumption. As a result, for a given kk and a given dd, the overall time of our algorithm would increase only linearly with the number of valid cycles and the number of vertices in the graph. In Section 6, we experimentally examine the growth of the running time with respect to kk, and therefore with cc, which itself is likely to grow with kk.

It should be noted that explicit construction of GsG^{s} is neither necessary for the correctness of the algorithm, nor does it have a significant effect on its time complexity. By construction, function CYCLE_SEARCH does not visit any vertex whose minimum distance from ss is kk or greater. By using GsG^{s} instead of HsH^{s}, CYCLE_SEARCH avoids testing unnecessary edges at the periphery of the feasible neighborhood of ss on Line 9. Additionally, explicitly constructing GsG^{s} with its own local indices improves the memory access locality of a computer implementation of the algorithm and yields better performance.

6 Experimental results

Graphs nn ee dd σd\sigma_{d}
amazon0505 4.10e5 3.36e6 8.18 3.07
as-caida 2.65e4 1.07e5 4.03 33.4
email-EuAll 2.65e5 4.20e5 1.58 9.97
soc-pokec 1.63e6 3.06e7 18.8 32.0
web-BerkStan 6.85e5 7.60e6 11.1 16.3
web-Google 8.76e5 5.11e6 5.83 6.59
wiki-topcats 1.79e6 2.85e7 15.9 30.4
Table 1: Description of the directed graphs used in experiments in this paper. As elsewhere in the paper, nn refers to the number of vertices, ee refers to the number of directed edges, dd is the average out-degree or the ratio en\frac{e}{n} of the graph, and σd\sigma_{d} is the standard deviation of the out-degrees of the vertices.

We present experimental results on a set of large directed graphs obtained from the SNAP data set [12]. Table 1 contains detailed information about each graph. These graphs include social and communication networks, web graphs, product relationships, and professional networks, etc. They range in average density from roughly 1.6 to 18.8. The standard deviation of the out-degrees shows that with the exception of amazon0505, the graphs are fairly nonuniform. This is not surprising, since it has been observed that many social networks are highly skewed graphs and some even exhibit power-law characteristics [5]. For the results reported in this section, a serial implementation of the algorithm in Figures 13 was run on a single 3.3 GHz Intel Xeon E5-2667 core. This section contains results from our implementation only because we were unable to find any other practical algorithms or software to solve this problem for comparison on large graphs.

Graphs k=4k=4 k=5k=5 k=6k=6
cc TT cc TT TT (uns) cc TT
amazon0505 1.35e7 1.50 6.45e07 8 11 3.04e08 41
as-caida 4.65e6 3.67 1.47e08 252 401 7.64e09 11168
email-EuAll 6.68e6 5.0 2.95e08 407 472 1.50e10 27130
soc-pokec 5.21e8 1820 1.34e10 45606 69187 6.11e11 1.24e6
web-BerkStan 8.56e8 329 7.55e10 58702 62043 - -
web-Google 3.39e7 3.35 3.11e08 44 97 3.35e09 638
wiki-topcats 1.93e8 521 6.36e09 17165 1.03e6 3.28e11 1.96e6
Table 2: Total number of all cycles of length between 3 and kk, for k=4,5,6k=4,5,6 and the time to find them in some directed graphs obtained from the SNAP repository [12]. For k=5k=5, running time without sorting the vertices by out-degree is also included to show the benefit of sorting. The number of cycles is denoted by cc, and the time TT is in seconds.

Table 2 shows the number of cycles detected and the running time of our algorithm for kk = 4, 5, and 6. While our algorithm detects all simple cycles of length less than or equal to kk, we do not count and report cycles of length one (self-edges) and two (bidirectional edges) in this table. The table shows that the algorithm successfully detected up to hundreds of billions of cycles on large graphs using a single CPU, with only one run (web-BerkStan graph for k=6k=6) that did not complete in several days and had to be aborted. By default, we sort and renumber the vertices of the graph in increasing order of total degree, as discussed in Section 3.4. For kk = 5, we also show the running time with this sorting turned off to show the benefit of sorting. The time is worse without sorting for every graph in our test set. In most cases, sorting results in an improvement by a relatively small factor; however, for one graph (wiki-topcats), the difference is so significant that the algorithm may not be practical without sorting.

Graphs k=4k=4 k=5k=5 k=6k=6
amazon0505 8.02e-12 8.40e-13 8.98e-14
as-caida 9.90e-10 4.03e-10 6.80e-11
email-EuAll 3.71e-08 3.45e-08 2.29e-08
soc-pokec 8.32e-12 3.67e-13 1.33e-14
web-BerkStan 8.46e-12 1.16e-12 -
web-Google 2.78e-11 5.24e-12 9.70e-13
wiki-topcats 1.39e-11 6.60e-13 7.36e-14
Table 3: The ratio T(c+n)(k1)min(e,dk)\frac{T}{(c+n)(k-1)\mbox{min}(e,d^{k})} for k=4,5,6k=4,5,6 for the graphs from the SNAP data set.

In Section 5, we derived O((c+n)(k1)dk)O((c+n)(k-1)d^{k}) as the expression for the time complexity of our algorithm for graphs with a uniform out-degree dd. We tested the graphs in our test suite from the SNAP data set against this expression for growth of running time with kk and cc. Table 3 shows the ratio T(c+n)(k1)min(e,dk)\frac{T}{(c+n)(k-1)\mbox{min}(e,d^{k})} for k=4,5,6k=4,5,6 for these graphs. We used min(e,dk)(e,d^{k}) in the denominator because dkd^{k} is meant to represent the maximum number of edges in each GsG^{s} in the CYCLE_SEARCH algorithm (Figure 2) for a given kk and in practice, it cannot exceed the total number of edges ee in the original graph. Table 3 shows that this ratio decreases as kk (and therefore, cc) increase for all the graphs in our test set. Even though our graphs are quite diverse in origin and density, and the distribution of edges over the vertices is highly nonuniform, they all share this trend. The analytically derived running time bound used a simplifying assumption that the vertices had a uniform degree dd. Experimentally observing that despite non-uniform degrees, the growth in the running time with kk and cc for every test case is slower than that suggested by the analytical bound is therefore an important finding and reinforces the practicality of our algorithm.

7 Concluding remarks and future work

While enumerating simple cycles in directed graphs is a classical computer science problem, in most real applications on very large graphs, only a small fraction of the potentially enormous number of cycles are of interest. There has been relatively little research on discovering interesting cycles and algorithms for these are typically expensive [1], unless the search space can be substantially narrowed based on the definition of interesting cycles, for example, cycles with obligatory vertices or edges [11]. One class of cycle enumeration and detection problems of practical interest involves finding cycles of a limited length. We devised an efficient and practical algorithm for finding all length-constrained simple cycles in a directed graph.

We analytically and experimentally demonstrated the efficiency of the algorithm and showed that a serial implementation could detect billions of cycles on large graphs with millions on edges in a reasonable amount of time. In the future, it would be interesting to study the results of a parallel implementation on even larger graphs. On the analysis side, it would be interesting to explore the derivation of tighter time bounds without simplifying assumptions about the structure of the graphs.

References

  • [1] Florian Adriaens, Cigdem Aslay, Tijl De Bie, Aristides Gionis, and Jefrey Lijffijt. Discovering interesting cycles in directed graphs. In CIKM ’19: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 1191–1200, November 2019.
  • [2] Udit Agarwal and Vijaya Ramachandran. Finding kk simple paths and cycles. In Proceedings of the International Symposium on Algorithms and Computation (ISAAC), December 2016.
  • [3] Udit Agarwal and Vijaya Ramachandran. Fine-grained complexity for sparse graphs. In Proceedings of the 50th ACM SIGACT Symposium on Theory of Computing (STOC), 2018.
  • [4] N. Alon, R. Yuster, and U. Zwick. Finding and counting given lenth cycles. Algorithmica, 17:209–223, 1997.
  • [5] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, October 1999.
  • [6] Etienne Birmelé et al. Optimal listing of cycles and st-paths in undirected graphs. In Proceedings of the 2013 Annual ACM-SIAM Symposium on Discrete Algorithms, 2013.
  • [7] Anna D. Broido and Aaron Clauset. Scale-free networks are rare. Nature Communications, 10:1017, March 2019.
  • [8] Kevin Buchin, Christian Knauer, Klaus Kriegel, Andre Schulz, and Raimund Seidel. On the number of cycles in planar graphs. In International Computing and Combinatorics Conference, pages 97–107, 07 2007.
  • [9] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford S. Stein. Introduction to Algorithms (Second Ed.). MIT Press, McGraw-Hill, New York, NY, 2001.
  • [10] Donald B. Johnson. Finding all the elementary circuits of a directed graph. SIAM Journal on Computing, 4(1):77–84, 1975.
  • [11] Steffen Klamt and Axel von Kamp. Computing paths and cycles in biological interaction graphs. BMC Bioinformatics, 10:181, 2009.
  • [12] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection.
  • [13] B. Monien. How to find long paths efficiently. Annals of Discrete Mathematics, 25:239–254, 1985.
  • [14] Robert E. Tarjan. Depth-first search and linear graph algorithms. SIAM Journal on Computing, 1(2):146–160, 1972.
  • [15] Robert E. Tarjan. Enumeration of the elementary circuits of a directed graph. SIAM Journal on Computing, 2(3):211–216, 1973.