Distributed D-core Decomposition over Large Directed Graphs

Xuankun Liao Hong Kong Baptist University [email protected] , Qing Liu Hong Kong Baptist University [email protected] , Jiaxin Jiang Hong Kong Baptist University [email protected] , Xin Huang Hong Kong Baptist University [email protected] , Jianliang Xu Hong Kong Baptist University [email protected] and Byron Choi Hong Kong Baptist University [email protected]

Abstract.

Given a directed graph $G$ and integers $k$ and $l$ , a D-core is the maximal subgraph $H\subseteq G$ such that for every vertex of $H$ , its in-degree and out-degree are no smaller than $k$ and $l$ , respectively. For a directed graph $G$ , the problem of D-core decomposition aims to compute the non-empty D-cores for all possible values of $k$ and $l$ . In the literature, several peeling-based algorithms have been proposed to handle D-core decomposition. However, the peeling-based algorithms that work in a sequential fashion and require global graph information during processing are mainly designed for centralized settings, which cannot handle large-scale graphs efficiently in distributed settings. Motivated by this, we study the distributed D-core decomposition problem in this paper. We start by defining a concept called anchored coreness, based on which we propose a new H-index-based algorithm for distributed D-core decomposition. Furthermore, we devise a novel concept, namely skyline coreness, and show that the D-core decomposition problem is equivalent to the computation of skyline corenesses for all vertices. We design an efficient D-index to compute the skyline corenesses distributedly. We implement the proposed algorithms under both vertex-centric and block-centric distributed graph processing frameworks. Moreover, we theoretically analyze the algorithm and message complexities. Extensive experiments on large real-world graphs with billions of edges demonstrate the efficiency of the proposed algorithms in terms of both the running time and communication overhead.

PVLDB Reference Format:
PVLDB, 14(1): XXX-XXX, 2020.
doi:XX.XX/XXX.XX ^†^†This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097.
doi:XX.XX/XXX.XX

1. Introduction

Graph is a widely used data structure to depict entities and their relationships. In a directed graph, edges have directions to represent links from one vertex to another, which has many real-life applications, e.g., the following relationship in online social networks such as Twitter, the money flow in financial networks, the traffic route in road networks, and the message forwarding path in communication networks. Among many graph analysis algorithms, cohesive subgraph analysis is to discover densely connected subgraphs under a cohesive subgraph model. A well-known model used for undirected graphs is $k$ -core, which requires every vertex in the subgraph to have at least $k$ neighbors (Seidman, 1983). As a directed version of $k$ -core, $D$ -core, a.k.a. $(k,l)$ -core, is the maximal directed subgraph such that every vertex has at least $k$ in-neighbors and $l$ out-neighbors within this subgraph (Giatsidis et al., 2013). For example, in Figure 1(a), the whole directed graph $G$ is a (2, 2)-core since every vertex has an in-degree of at least 2 and an out-degree of at least 2.

Refer to caption — (a) A directed graph $G$

As a foundation of D-core discovery, the problem of D-core decomposition aims to compute the non-empty D-cores of a directed graph for all possible values of $k$ and $l$ . D-core decomposition has a number of applications. It has been used to build coreness-based indexes for speeding up community search (Fang et al., 2018; Chen et al., 2020), to measure influence in social networks (Garcia et al., 2017), to evaluate graph collaboration features of communities (Giatsidis et al., 2013), to visualize and characterize complex networks (Montresor et al., 2012), and to discover hubs and authorities of directed networks (Soldano et al., 2017). For example, based on the D-core decomposition results, we can index a graph’s vertices by their corenesses using a table (Fang et al., 2018) or D-Forest (Chen et al., 2020); then, D-core-based community search can be accelerated by looking up the table or D-Forest directly, instead of performing the search from scratch. In the literature, peeling-based algorithms have been proposed for D-core decomposition in centralized settings (Giatsidis et al., 2013; Fang et al., 2018). They work in a sequential fashion to remove disqualified vertices one by one from a graph. That is, they first determine all possible values of $k$ (i.e., from 0 to the maximum in-degree of the graph). Next, for each value $k$ , they compute $(k,l)$ -cores for all possible values of $l$ by iteratively deleting the vertices with the smallest out-degree. Figure 1(b) shows the D-core decomposition results of $G$ for different $k$ and $l$ values.

In this paper, we study the problem of D-core decomposition in distributed settings, where the input graph $G$ is stored on a collection of machines and each machine holds only a partial subgraph of $G$ . The motivation is two-folded. First, due to the large size of graph data, D-core decomposition necessitates huge memory space, which may exceed the capacity of a single machine. For example, the existing algorithms could not work for billion-scale graphs due to excessive memory space costs (Fang et al., 2018). Second, in practical applications, many large graphs are inherently distributed over a collection of machines, making distributed processing a natural solution (Linial, 1992; Aridhi et al., 2016; Montresor et al., 2012; Mandal and Hasan, 2017).

However, the existing peeling-based algorithms are not efficient when extended to distributed settings. In particular, when computing the $(k,l)$ -cores for a given $k$ , the algorithms need to iteratively find the vertices with the smallest out-degree to delete and then update the out-degrees for the remaining vertices, until the graph becomes empty. This process (i) is not parallelizable since the update of out-degrees in each iteration depends on the vertices deletion in the previous iteration and (ii) entails expensive network communications since it needs global graph information.

To address these issues, we design new distributed D-core decomposition algorithms by exploiting the relationships between a vertex and its neighbors. First, inspired by the notion of k-list (Fang et al., 2018), we propose an anchored coreness-based algorithm. Specifically, for a vertex $v$ , if we fix the value of $k_{v}$ , we can compute the maximum value of $l_{v}$ such that $v$ is contained in the $(k_{v},l_{v})$ -core. We call this pair $(k_{v},l_{v})$ an anchored coreness of $v$ . For example, for vertex $v_{2}$ in Figure 1, when $k_{v_{2}}=0$ , the maximum value of $l_{v_{2}}$ is 2 since $v_{2}\in(0,2)$ -core but $v_{2}\notin(0,3)$ -core. Hence, $(0,2)$ is an anchored coreness of $v_{2}$ . The other anchored corenesses of $v_{2}$ are $(1,2)$ , $(2,2)$ , and $(3,1)$ . Once we have computed the anchored corenesses for every vertex, we can easily derive the D-cores from these anchored corenesses. Specifically, given integers $k$ and $l$ , the $(k,l)$ -core consists of the vertices whose anchored coreness $(k_{v},l_{v})$ satisfies $k_{v}=k$ and $l_{v}\geq l$ . Thus, the problem of distributed D-core decomposition is equivalent to computing the anchored corenesses in a distributed way. To do so, we first exploit the in-degree relationship between a vertex and its in-neighbors and define an in-H-index, based on which we compute the maximum value of $k$ for each vertex. Then, we study the property of $(k,0)$ -core and define an out-H-index. On the basis of that, for each possible value of $k$ with respect to a vertex, we iteratively compute the corresponding upper bound of $l$ simultaneously. Finally, we utilize the definition of D-core to iteratively refine all the upper bounds to obtain the anchored corenesses of all vertices.

Note that the anchored coreness-based algorithm first fixes one dimension and then computes the anchored corenesses for the other dimension, which may lead to suboptimal performance. To improve performance, we further propose a novel concept, called skyline coreness, and develop a skyline coreness-based algorithm. Specifically, we say the pair $(k_{v},l_{v})$ is a skyline coreness of a vertex $v$ , if there is no other pair $(k_{v}^{\prime},l_{v}^{\prime})$ such that $k_{v}^{\prime}\geq k_{v}$ , $l_{v}^{\prime}\geq l_{v}$ , and $v\in(k_{v}^{\prime},l_{v}^{\prime})$ -core. For example, in Figure 1, the skyline corenesses of $v_{2}$ are $\{(2,2),(3,1)\}$ . Compared with anchored corenesses, a vertex’s skyline corenesses contain fewer pairs of $(k_{v},l_{v})$ . Nevertheless, based on the skyline corenesses, we can still easily find all the D-cores containing the corresponding vertex. To be specific, if $(k_{v},l_{v})$ is a skyline coreness of $v$ , then $v$ is also in the $(k,l)$ -cores with $k\leq k_{v}$ and $l\leq l_{v}$ . The basic idea of the skyline coreness-based algorithm is to use neighbors’ skyline corenesses to iteratively estimate the skyline corenesses of each vertex. To this end, we define a new index, called D-index, for each vertex based on the following unique property of skyline corenesses. If $(k_{v},l_{v})$ is one of the skyline corenesses of a vertex $v$ , we have (i) $v$ has at least $k_{v}$ in-neighbors such that each of these in-neighbors, $v_{i}$ , has a skyline coreness $(k_{v_{i}}^{\prime},l_{v_{i}}^{\prime})$ satisfying $k_{v_{i}}^{\prime}\geq k_{v}$ and $l_{v_{i}}^{\prime}\geq l_{v}$ ; and (ii) $v$ has at least $l_{v}$ out-neighbors such that each of these out-neighbors, $v_{j}$ , has a skyline coreness $(k_{v_{j}}^{\prime\prime},l_{v_{j}}^{\prime\prime})$ satisfying $k_{v_{j}}^{\prime\prime}\geq k_{v}$ and $l_{v_{j}}^{\prime\prime}\geq l_{v}$ . With this property, we design a distributed algorithm to iteratively compute the D-index for each vertex with its neighbors’ D-indexes. To deal with the combinatorial blow-ups in the computation of D-indexes, we further develop three optimization strategies to improve efficiency.

We implement our algorithms under two well-known distributed graph processing frameworks, i.e., vertex-centric (Malewicz et al., 2010; gir, 2012; Salihoglu and Widom, 2013; Low et al., 2012) and block-centric (Tian et al., 2013; Yan et al., 2014; Fan et al., 2017). Empirical results on small graphs demonstrate that our algorithms run faster than the peeling-based algorithm by up to 3 orders of magnitude. For larger graphs with more than 50 million edges, the peeling-based algorithm cannot finish within 5 days, while our algorithms can finish within 1 hour for most datasets. Moreover, our proposed algorithms require less than 100 rounds to converge for most datasets, and more than $90\%$ vertices can converge within 10 rounds.

This paper’s main contributions are summarized as follows:

•

For the first time in the literature, we study the problem of distributed D-core decomposition over large directed graphs.
•

We develop an anchored coreness-based distributed algorithm using well-defined in-H-index and out-H-index. To efficiently compute the anchored corenesses, we propose tight upper bounds that can be iteratively refined to exact anchored corenesses with reduced network communications.
•

We further propose a novel concept of skyline coreness and show that the problem is equivalent to the computation of skyline corenesses for all vertices. A new two-dimensional D-index that unifies the in- and out-neighbor relationships, together with three optimization strategies, is designed to compute the skyline corenesses distributedly.
•

Both theoretical analysis and empirical evaluation validate the efficiency of our algorithms for distributed D-core decomposition.

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 formally defines the problem. Sections 4 and 5 propose two distributed algorithms for computing anchored coreness and skyline coreness, respectively. Experimental results are reported in Section 6. Finally, Section 7 concludes the paper.

2. Related Work

In this section, we review the related work in two aspects, i.e., core decomposition and distributed graph computation.

Core Decomposition. As a well-known dense subgraph model, a $k$ -core is the maximal subgraph of an undirected graph such that every vertex has at least $k$ neighbors within this subgraph (Seidman, 1983). The core decomposition task aims at finding the $k$ -cores for all possible values of $k$ in a graph. Many efficient algorithms have been proposed to handle core decomposition over an undirected graph, such as peeling-based algorithms (Batagelj and Zaversnik, 2003; Cheng et al., 2011; Khaouid et al., 2015), disk-based algorithm (Cheng et al., 2011; Khaouid et al., 2015), semi-external algorithm (Wen et al., 2016), streaming algorithms (Saríyüce et al., 2013; Sarıyüce et al., 2016), parallel algorithms (Dasari et al., 2014; Esfandiari et al., 2018), and distributed algorithms (Aridhi et al., 2016; Montresor et al., 2012; Mandal and Hasan, 2017). It is worth mentioning that the distributed algorithms for $k$ -core decomposition (Aridhi et al., 2016; Montresor et al., 2012; Mandal and Hasan, 2017) cannot be used for distributed D-core decomposition. Specifically, the distributed $k$ -core decomposition algorithms use the neighbors’ corenesses to estimate a vertex’s coreness, where all neighbors are of the same type. For D-core, a vertex’s neighbors include in-neighbors and out-neighbors, which affect each other and should be considered simultaneously. If we consider only one type of neighbors, we cannot get the correct answer. Inspired by the H-index-based computation for core decomposition (Lü et al., 2016) and nucleus decomposition (Sariyüce et al., 2018), we apply a similar idea in the design of distributed algorithms. Nevertheless, our technical novelty lies in the non-trivial extension of H-index from one-dimensional undirected coreness to two-dimensional anchored/skyline coreness, which needs to consider the computations of in-degrees and out-degrees simultaneously in a unified way.

In addition, core decomposition has been studied for different types of networks, such as weighted graphs (Eidsaa and Almaas, 2013; Zhou et al., 2021), uncertain graphs (Bonchi et al., 2014; Peng et al., 2018), bipartite graphs (Liu et al., 2019), temporal graphs (Galimberti et al., 2021; Wu et al., 2015), and heterogeneous information networks (Fang et al., 2020). Recently, a new problem of distance-generalized core decomposition has been studied by considering vertices’ $h$ -hop connectivity (Bonchi et al., 2019; Liu et al., 2021). Note that a directed graph can be viewed as a bipartite graph. After transforming a directed graph to a bipartite graph, the $(k,l)$ -core in the directed graph has a corresponding $(\alpha,\beta)$ -core in the bipartite graph (Liu et al., 2019), but not vice versa. Therefore, the problems of $(k,l)$ -core decomposition and $(\alpha,\beta)$ -core decomposition are not equivalent, and $(\alpha,\beta)$ -core decomposition algorithms cannot be used in our work.

Distributed Graph Computation. In the literature, there exist various distributed graph computing models and systems to support big graph analytics. Among them, the vertex-centric framework (Malewicz et al., 2010; Low et al., 2012; McCune et al., 2015) and the block-centric framework (Tian et al., 2013; Yan et al., 2014) are two most popular frameworks.

The vertex-centric framework assumes that each vertex is associated with one computing node and communication occurs through edges. The workflow of the vertex-centric framework consists of a set of synchronous supersteps. Within each superstep, the vertices execute a user-defined function asynchronously after receiving messages from their neighbors. If a vertex does not receive any message, it will be marked as inactive. The framework stops once all vertices become inactive. Typical vertex-centric systems include Pregel (Malewicz et al., 2010), Giraph (gir, 2012), GPS (Salihoglu and Widom, 2013), and GraphLab (Low et al., 2012). For the block-centric framework, one computing node stores the vertices within a block together and communication occurs between blocks after the computation within a block reaches convergence. Compared with the vertex-centric framework, the block-centric framework can reduce the network traffic and better balance the workload among nodes. Distributed graph processing systems such as Giraph++ (Tian et al., 2013), Blogel (Yan et al., 2014), and GRAPE (Fan et al., 2017) belong to the block-centric framework.

Note that in this paper we mainly focus on algorithm designs for distributed D-core decomposition. To demonstrate the flexibility of our proposed algorithms, we implement them for performance evaluation in both vertex-centric and block-centric frameworks.

3. Problem Formulation

In this paper, we consider a directed, unweighted simple graph $G=(V_{G},E_{G})$ , where $V_{G}$ and $E_{G}$ are the set of vertices and edges, respectively. Each edge $e\in E_{G}$ has a direction. For an edge $e=\langle u,v\rangle\in E_{G}$ , we say $u$ is an in-neighbor of $v$ and $v$ is an out-neighbor of $u$ . Correspondingly, $N^{in}_{G}(v)$ and $N^{out}_{G}(v)$ are respectively denoted as the in-neighbor set and out-neighbor set of a vertex $v$ in $G$ . We define three kinds of degrees for a vertex $v$ as follows: (1) the in-degree is the number of $v$ ’s in-neighbors in $G$ , i.e., $deg^{in}_{G}(v)=|N^{in}_{G}(v)|$ ; (2) the out-degree is the number of $v$ ’s out-neighbors in $G$ , i.e., $deg^{out}_{G}(v)=|N^{out}_{G}(v)|$ ; (3) the degree is the sum of its in-degree and out-degree, i.e., $deg_{G}(v)=deg^{in}_{G}(v)+deg^{out}_{G}(v)$ . Based on the in-degree and out-degree, we give a definition of D-core as follows.

Definition 3.0.

D-core (Giatsidis et al., 2013). Given a directed graph $G=(V_{G},E_{G})$ and two integers $k$ and $l$ , a D-core of $G$ , also denoted as $(k,l)$ -core, is the maximal subgraph $H=(V_{H},E_{H})\subseteq G$ such that $\forall v\in V_{H}$ , $deg_{H}^{in}(v)\geq k$ and $deg_{H}^{out}(v)\geq l$ .

According to Definition 3.1, a D-core should satisfy both the degree constraints and the size constraint. The degree constraints ensure the cohesiveness of D-core in terms of in-degree and out-degree. The size constraint guarantees the uniqueness of the D-core, i.e., for a specific pair of $(k,l)$ , there exists at most one D-core in $G$ . Moreover, D-core has a partial nesting property as follows.

Property 3.1.

Partial Nesting. Given two D-cores, $(k_{1},l_{1})$ -core $D_{1}$ and $(k_{2},l_{2})$ -core $D_{2}$ , $D_{1}$ is nested in $D_{2}$ (i.e., $D_{1}\subseteq D_{2}$ ) if $k_{1}\geq k_{2}$ and $l_{1}\geq l_{2}$ . Note that if $k_{1}\geq k_{2}$ and $l_{1}<l_{2}$ , or $k_{1}<k_{2}$ and $l_{1}\geq l_{2}$ , $D_{1}$ and $D_{2}$ may be not nested in each other.

Example 3.0.

In Figure 2, the directed subgraph $H_{1}$ induced by the vertices $v_{1}$ , $v_{4}$ , $v_{5}$ , and $v_{6}$ is a $(2,2)$ -core since $\forall v\in V_{H_{1}}$ , $deg^{in}_{H_{1}}(v)=deg^{out}_{H_{1}}(v)=2$ . Moreover, $H_{1}$ $\subseteq$ $H_{2}=(2,0)$ -core, $H_{1}$ $\subseteq$ $H_{3}=(1,1)$ -core. On the other hand, $H_{2}$ $\nsubseteq$ $H_{3}$ and $H_{3}$ $\nsubseteq$ $H_{2}$ , due to the non-overlapping vertices $v_{2}$ , $v_{3}$ , and $v_{7}$ .

In this paper, we study the problem of D-core decomposition to find all D-cores of a directed graph $G$ in distributed settings. In-memory algorithms of D-core decomposition have been studied in (Giatsidis et al., 2013; Fang et al., 2018), assuming that the entire graph and associated structures can fit into the memory of a single machine. To our best knowledge, the problem of distributed D-core decomposition, considering a large graph distributed over a collection of machines, has not been investigated in the literature. We formulate a new problem of distributed D-core decomposition as follows.

Problem 1.

(Distributed D-core Decomposition). Given a directed graph $G=(V_{G},E_{G})$ that is distributed in a collection of machines $\{M_{i}:$ a machine $M_{i}$ holds a partial subgraph $G_{i}\subseteq G$ , $1\leq i\leq n\}$ where $n\geq 2$ and $\cup_{i=1}^{n}G_{i}=G$ , the problem of distributed D-core decomposition is to find all D-cores of $G$ using $n$ machines, i.e., identifying the $(k,l)$ -cores with all possible $(k,l)$ pairs.

Consider applying D-core decomposition on $G$ in Figure 2. We can obtain a total of 9 different D-cores: $(0,0)$ -core $=$ $(1,0)$ -core $=$ $G$ ; $(0,1)$ -core $=$ $(1,1)$ -core $=$ $H_{3}$ ; $(0,2)$ -core $=$ the subgraph of $G$ induced by the vertices in $V_{H_{1}}\cup\{v_{7}\}$ ; $(1,2)$ -core $=$ $(2,1)$ -core $=$ $(2,2)$ -core $=$ $H_{1}$ ; $(2,0)$ -core $=$ $H_{2}$ .

In the following two sections, we propose two new distributed algorithms for D-core decomposition. Without loss of generality, we mainly present the algorithms under the vertex-centric framework. At the end of Sections 4 and 5, we discuss how to extend our proposed algorithms to the block-centric framework.

4. Distributed Anchored Coreness-Based Algorithm

In this section, we first give a definition of anchored coreness, which is useful for D-core decomposition. Then, we present a vertex-centric distributed algorithm for anchored coreness computation. Finally, we analyze the correctness and complexity of our proposed algorithm, and discuss its block-centric extension.

4.1. Anchored Coreness

Recall that, in the undirected $k$ -core model (Seidman, 1983), every vertex $v$ has a unique value called coreness, i.e., the maximum value $k\in\mathbb{N}_{0}$ such that $v$ is contained in a non-empty $k$ -core. Similarly, we give a definition of anchored coreness for directed graphs as follows.

Definition 4.0.

(Anchored Coreness). Given a directed graph $G$ and an integer $k$ , the anchored coreness of a vertex $v\in V_{G}$ is a pair $(k,l_{max}(v,k))$ , where $l_{max}(v,k)=\max\limits_{l\in\mathbb{N}_{0}}\{l\ |\ \exists(k,l)\text{-core }H\subseteq G\wedge v\in V_{H}\}$ . The entire anchored corenesses of the vertex $v$ are defined as $\Phi(v)=\{(k^{\prime},l_{max}(v,k^{\prime}))\ |\ 0\leq k^{\prime}\leq k_{max}(v)\}$ , where $k_{max}(v)=\max\limits_{k^{\prime\prime}\in\mathbb{N}_{0}}\{k^{\prime\prime}\ |\ \exists(k^{\prime\prime},0)\text{-core }H\wedge v\in V_{H}\}$ .

Different from the undirected coreness, the anchored coreness is a two-dimensional feature of in-degree and out-degree in directed graphs. For example, consider the graph $G$ in Figure 1 and $k=3$ , the anchored coreness of vertex $v_{2}$ is $(3,1)$ , as $l_{max}(v_{2},3)=1$ . Correspondingly, $\Phi(v_{2})=\{(0,2),(1,2),(2,2),(3,1)\}$ . The anchored corenesses can facilitate the distributed D-core decomposition as follows. According to Property 3.1, for a vertex $v$ with the anchored coreness of $(k,l)$ , $v$ belongs to any $(k,l^{\prime})$ -core with $l^{\prime}\leq l$ . Hence, as long as we compute the anchored corenesses of $v$ for each possible $k$ , we can get all D-cores containing $v$ . As a result, for a given directed graph $G$ , the problem of D-core decomposition is equivalent to computing the entire anchored corenesses for every vertex $v\in V_{G}$ , i.e., $\{\Phi(v)|v\in V_{G}\}$ .

4.2. Distributed Anchored Coreness Computing

In this section, we present a distributed algorithm for computing the entire anchored corenesses for every vertex in $G$ .

Overview. To handle the anchored coreness computation simultaneously in a distributed setting, we propose a distributed vertex-centric algorithm to compute all feasible anchored corenesses $(k,l)$ ’s for a vertex $v$ . The general idea is to first identify the feasible range of $k\in[0,k_{max}(v)]$ by exploring $(k,0)$ -cores and then refine an estimated upper bound of $l_{max}(v,k)$ to be exact for all possible values of $k$ . The framework is outlined in Algorithm 1, which gives an overview of the anchored coreness updating procedure in three phases: 1) deriving $k_{max}(v)$ ; 2) computing the upper bound of $l_{max}(v,k)$ for each $k$ ; and 3) refining the upper bound to the exact anchored coreness $l_{max}(v,k)$ . Note that in the second and third phases, the upper bound of $l_{max}(v,k)$ can be computed and refined in batch, instead of one by one sequentially, for different values of $k\in[0,k_{max}(v)]$ .

Table 1. An illustration of distributed D-core decomposition using Algorithm 1 on graph

G

in Figure 2.

		Vertices
		$v_{1}$	$v_{2}$	$v_{3}$	$v_{4}$	$v_{5}$	$v_{6}$	$v_{7}$	$v_{8}$
Phase I	$\mathrm{iH}^{(0)}(v)$	3	2	2	2	2	3	1	2
	$\mathrm{iH}^{(1)}(v)$	2	2	2	2	2	2	1	2
	$\mathrm{iH}^{(2)}(v)$ $=$ $k_{max}(v)$	2	2	2	2	2	2	1	2
Phase II	$\forall k\in[0,k_{max}(v)]$ , $\mathrm{oH}_{G[k]}^{(0)}(v)$	3; 3; 3	0; 0; 0	0; 0; 0	5; 5; 5	3; 3; 3	2; 2; 2	2; 2	2; 2; 2
	$\forall k\in[0,k_{max}(v)]$ , $\mathrm{oH}_{G[k]}^{(1)}(v)$	2; 2; 2	0; 0; 0	0; 0; 0	2; 2; 2	2; 2; 2	2; 2; 2	2; 2	1; 1; 0
	$\forall k\in[0,k_{max}(v)]$ , $\mathrm{oH}_{G[k]}^{(2)}(v)$	2; 2; 2	0; 0; 0	0; 0; 0	2; 2; 2	2; 2; 2	2; 2; 2	2; 2	1; 1; 0
Phase III	$\forall k\in[0,k_{max}(v)]$ , $l_{upp}(k,v)$	2; 2; 2	0; 0; 0	0; 0; 0	2; 2; 2	2; 2; 2	2; 2; 2	2; 2	1; 1; 0
	$\forall k\in[0,k_{max}(v)]$ , $l^{\prime}_{upp}(k,v)$	2; 2; 2	0; 0; 0	0; 0; 0	2; 2; 2	2; 2; 2	2; 2; 2	2; 1	1; 1; 0
	$\forall k\in[0,k_{max}(v)]$ , $l_{max}(k,v)$	2; 2; 2	0; 0; 0	0; 0; 0	2; 2; 2	2; 2; 2	2; 2; 2	2; 1	1; 1; 0

Input: directed graph

G

, vertex

v

Output: anchored corenesses of vertex

v

1 Compute

k_{max}(v)

for vertex

v

using Algorithm 2;

2 Compute the upper bounds

l_{upp}(k,v)

where

k\in[0,k_{max}]

, by invoking Algorithm 3;

3 Refine the upper bounds

l_{upp}(k,v)

to anchored corenesses

l_{max}(k,v)

using Algorithm 4;

return the entire anchored corenesses of

v

\Phi(v)

;

Algorithm 1 Distributed Anchored Coreness Computation: routine executed by a vertex

v

Phase I: Computing the in-degree limit $k_{max}(v)$ . To compute $k_{max}(v)$ , first, we introduce a concept of H-index (Hirsch, 2005). Specifically, given a collection of integers $S$ , the H-index of $S$ is a maximum integer $h$ such that $S$ has at least $h$ integer elements whose values are no less than $h$ , denoted as $\mathcal{H}(S)$ . For example, given $S=\{1,2,3,3,4,6\}$ , H-index $\mathcal{H}(S)=3$ , as $S$ has at least 3 elements whose values are no less than 3. Based on H-index, we give a new definition of $n$ -order in-H-index for directed graph.

Definition 4.0.

( $n$ -order in-H-index). Given a vertex $v$ in $G$ , the $n$ -order in-H-index of $v$ , denoted by $\mathrm{iH}_{G}^{(n)}(v)$ , is defined as

(1)

\mathrm{iH}_{G}^{(n)}(v)=\begin{cases}deg^{in}_{G}(v),&n=0\\ \mathcal{H}(I),&n>0\end{cases}

where the integer set $I=\{\mathrm{iH}_{G}^{(n-1)}(u)|u\in N^{in}_{G}(v)\}$ .

Theorem 4.3 (Convergence).

(2)

k_{max}(v)=\lim_{n\rightarrow\infty}\mathrm{iH}_{G}^{(n)}(v)

Proof.

Due to space limitation, we give a proof sketch here. The detailed proof can be found in (Liao et al., 2022). First, we prove that $\mathrm{iH}_{G}^{(n)}(v)$ is non-increasing with the increase of order $n$ . Thus, $\mathrm{iH}_{G}^{(n)}(v)$ finally converges to an integer when $n$ is big enough. Then, we prove $k_{max}(v)\leq\mathrm{iH}_{G^{\prime}}^{(\infty)}(v)\leq\mathrm{iH}_{G}^{(\infty)}(v)$ , where $G^{\prime}\subseteq G$ is a subgraph induced by the vertices $v^{\prime}$ with $k_{max}(v^{\prime})\geq k_{max}(v)$ . Also, we know $k_{max}(v)\geq\mathrm{iH}_{G}^{(\infty)}(v)$ by definition. Hence, $k_{max}(v)=\mathrm{iH}_{G}^{(\infty)}(v)$ . ∎

According to Theorem 2, $\mathrm{iH}_{G}^{(n)}(v)$ finally converges to $k_{max}(v)$ , based on which we present a distributed algorithm as shown in Algorithm 2 to compute $k_{max}(v)$ . Algorithm 2 has an initialization step (lines 1-4), and two update procedures after receiving one message (lines 5-7) and all messages (lines 8-11). It first uses 0 to initialize the set $I$ , which keeps the latest $n$ -order in-H-indexes of $v$ ’s in-neighbors (lines 1-2). Then, the algorithm sets the $n$ -order in-H-index of $v$ to its in-degree (line 3) and sends the message ¡ $v$ , $\mathrm{iH}(v)$ ¿ to all its out-neighbors (line 4). When $v$ receives a message ¡ $v^{\prime}$ , $\mathrm{iH}(v^{\prime})$ ¿ from its in-neighbor $v^{\prime}$ , the algorithm updates the $n$ -order in-H-index of $v^{\prime}$ (line 5). If $\mathrm{iH}(v^{\prime})<\mathrm{iH}(v)$ , it means the $n$ -order in-H-index of $v$ may decrease. Thus, flag is set to True to indicate the re-computation of $v$ ’s $n$ -order in-H-index (line 7). After receiving all massages, if flag is True, Algorithm 2 updates $v$ ’s $n$ -order in-H-index $\mathrm{iH}(v)$ and inform all its out-neighbors if $\mathrm{iH}(v)$ decreases (lines 9-11). Algorithm 2 completes and returns $\mathrm{iH}(v)$ as $k_{max}(v)$ when there is no vertex broadcasting messages (line 12).

Example 4.0.

We use the directed graph $G$ in Figure 2 to illustrate Algorithm 2, whose calculation process is shown in Table 1. We take vertex $v_{1}$ as an example. First, $v_{1}$ ’s 0-order in-H-index is initialized with its in-degree, i.e., $\mathrm{iH}_{G}^{(0)}(v_{1})=3$ . Then, Algorithm 2 iteratively computes $\mathrm{iH}_{G}^{(n)}(v_{1})$ . After one iteration, the 1-order in-H-index of $v_{1}$ has converged to $\mathcal{S}(\mathrm{iH}_{G}^{(0)}(v_{4}),\mathrm{iH}_{G}^{(0)}(v_{6}),\mathrm{iH}_{G}^{(0)}(v_{7}))$ = $\mathcal{S}(2,3,1)$ = 2. Thus, $k_{max}(v_{1})=\mathrm{iH}_{G}^{(2)}(v_{1})=\mathrm{iH}_{G}^{(1)}(v_{1})=2$ .

Input: directed graph

G

, vertex

v

Output:

k_{max}(v)

Initializations

1 for each $v^{\prime}\in N^{in}_{G}(v)$ do

I[v^{\prime}]\leftarrow 0

;

\mathrm{iH}(v)\leftarrow deg^{in}_{G}(v)

; Send message

\langle v

\mathrm{iH}(v)\rangle

to all out-neighbors of

v

;

On receiving message $\langle v^{\prime}$ , $\mathrm{iH}(v^{\prime})\rangle$ from $v$ ’s in-neighbor $v^{\prime}$

I[v^{\prime}]\leftarrow\mathrm{iH}(v^{\prime})

; if $\mathrm{iH}(v^{\prime})$ ¡ $\mathrm{iH}(v)$ then

flag\leftarrow

True

After receiving all messages

8 if $flag=True$ then

9 if $\mathcal{H}(I)$ ¡ $\mathrm{iH}(v)$ then

\mathrm{iH}(v)\leftarrow\mathcal{H}(I)

;

flag\leftarrow

False; Send message

\langle v

\mathrm{iH}(v)\rangle

to all out-neighbors of

v

When no vertex broadcasts messages

return $k_{max}(v)\leftarrow\mathrm{iH}(v)$ ;

Algorithm 2 Computing

k_{max}(v)

Phase II: Computing the upper bounds of $l_{max}(k,v)$ . In a distributed setting, the computation of $l_{max}(k,v)$ faces technical challenges. It is difficult to compute $l_{max}(k,v)$ by making use of only the “intermediate” neighborhood information. Because some vertices $u\in N_{G}(v)$ may become disqualified and thus be removed from the candidate set of $(k,l_{max}(k,v))$ -core during the iteration process. Even worse, verifying the candidacy of $u$ requires a large number of message exchanges between vertices. To address these issues, we design a novel upper bound for $l_{max}(k,v)$ , denoted by $l_{upp}(k,v)$ , which can be iteratively computed with “intermediate” corenesses to reduce communication costs. To start with, we give a new definition of $n$ -order out-H-index, similar to Definition 4.2.

Definition 4.0.

( $n$ -order out-H-index). Given a vertex $v$ in $G$ , the $n$ -order out-H-index of $v$ , denoted as $\mathrm{oH}_{G}^{(n)}(v)$ , is defined as

(3)

\mathrm{oH}_{G}^{(n)}(v)=\begin{cases}deg^{out}_{G}(v),&n=0\\ \mathcal{H}(O),&n>0\end{cases}

where $O=\{\mathrm{oH}_{G}^{(n-1)}(u)|u\in N^{out}_{G}(v)\}$ .

Based on $\mathrm{oH}_{G}^{(n)}(v)$ , we have the following theorem.

Theorem 4.6.

Given a vertex $v$ in $G$ and an integer $k\in[0,k_{max}(v)]$ , let $G[k]$ be the subgraph of $G$ induced by the vertices in $V_{k}=\{u\ |\ u\in V_{G}\wedge k_{max}(u)\geq k\}$ . Then, it holds that

(4)

l_{max}(k,v)\leq\lim_{n\rightarrow\infty}\mathrm{oH}_{G[k]}^{(n)}(v).

Proof.

Similar to Theorem 2, we can prove $\lim\limits_{n\rightarrow\infty}\mathrm{oH}_{G[k]}^{(n)}(v)$ $=l^{\prime}$ such that $v\in(0,l^{\prime})$ -core of $G[k]$ but $v\notin(0,l^{\prime}+1)$ -core of $G[k]$ . Then, we have the following relationship for the D-cores of $G[k]$ : $(k,l_{max}(k,v))$ -core $\subseteq$ $(0,l_{max}(k,v))$ -core $\subseteq$ $(0,l^{\prime})$ -core. According to the partial nesting property of D-core, $l^{\prime}\geq l_{max}(k,v)$ holds. ∎

Theorem 4 indicates that $\lim\limits_{n\rightarrow\infty}\mathrm{oH}_{G[k]}^{(n)}(v)$ can be served as an upper bound of $l_{max}(k,v)$ , i.e., $l_{upp}(k,v)=\lim\limits_{n\rightarrow\infty}\mathrm{oH}_{G[k]}^{(n)}(v)$ . Thus, we can compute $l_{upp}(k,v)$ by iteratively calculating the $n$ -order out-H-index of $v$ in the directed subgraph $G[k]$ . Moreover, to efficiently compute $l_{upp}(k,v)$ for all values $k\in[0,k_{max}(v)]$ in parallel, our distributed algorithm should send updating messages in batch and compute $l_{upp}(k,v)$ simultaneously.

Input: directed graph

G

, vertex

v

k_{max}(v)

Output: the upper bounds

l_{upp}(k,v)

for

k\in[0,k_{max}(v)]

Initializations

1 for each $k\in[0,k_{max}(v)]$ do

2 for each $v^{\prime}\in N^{out}_{G[k]}(v)$ do

I[k][v^{\prime}]\leftarrow 0

;

\mathrm{oH}_{v}[k]\leftarrow deg^{out}_{G[k]}(v)

;

change[k]\leftarrow

True;

6Send message

\langle v

\mathrm{oH}_{v}[\cdot],change[\cdot]\rangle

to all in-neighbors of

v

;

On receiving message $\langle v^{\prime}$ , $\mathrm{oH}_{v^{\prime}}[\cdot],change[\cdot]\rangle$ from $v$ ’s out-neighbor $v^{\prime}$

8 for each $k\in[0,k_{max}(v)]$ do

9 if $change[k]=True$ then

I[k][v^{\prime}]\leftarrow\mathrm{oH}[k]

;

11 if $\mathrm{oH}_{v^{\prime}}[k]$ ¡ $\mathrm{oH}_{v}[k]$ then

flag[k]\leftarrow

True

After receiving all messages

16 for each $k\in[0,k_{max}(v)]$ do

17 if $flag[k]=True$ then

18 if $\mathcal{H}(I[k])$ ¡ $\mathrm{oH}_{v}[k]$ then

\mathrm{oH}_{v}[k]\leftarrow\mathcal{H}(I[k])

;

change[k]\leftarrow

False;

22if $\exists k\in[0,k_{max}(v)]$ , $change[k]=True$ then

23 Send message

\langle v

\mathrm{oH}_{v}[\cdot],change[\cdot]\rangle

to all in-neighbors of

v

;

When no vertex broadcasts messages

l_{upp}[\cdot]\leftarrow\mathrm{oH}_{v}[\cdot]

;

Algorithm 3 Computing Upper Bounds

l_{upp}(k,v)

Based on the above discussion, we propose a distributed algorithm for computing the upper bounds $l_{upp}(k,v)$ . Algorithm 3 presents the detailed procedure. First, it initializes the $n$ -order out-H-index of $v$ for each possible value of $k$ and sends them to $v$ ’s in-neighbors (lines 1-5). When $v$ receives a message from its out-neighbor $v^{\prime}$ , $v$ updates the $n$ -order out-H-index of $v^{\prime}$ for subsequent calculation (lines 6-10). After receiving all messages, $v$ updates its own $n$ -order out-H-index for each possible value of $k$ (lines 11-14). If any $n$ -order out-H-indexes of $v$ decreases, $v$ informs all its in-neighbors (lines 15-16). Finally, when there is no vertex broadcasting messages, we get the upper bound $l_{upp}(k,v)$ for each $k\in[0,k_{max}(v)]$ (line 17).

Example 4.0.

We illustrate Algorithm 3 by continuing Example 4.4. As shown in Table 1, since $k_{max}(v_{1})=2$ , we first initialize the $0$ -order out-H-indexes of $v_{1}$ as $\mathrm{oH}_{G[k]}^{(0)}(v_{1})=3$ for each $k\in\{0,1,2\}$ . After one iteration of computing the $n$ -order out-H-indexes, all $1$ -order out-H-indexes of $v_{1}$ have converged to 2. Thus, we have $\mathrm{oH}_{G[0]}^{(1)}(v_{1})=2$ , $\mathrm{oH}_{G[1]}^{(1)}(v_{1})=2$ , $\mathrm{oH}_{G[2]}^{(1)}(v_{1})=2$ .

Phase III: Refining $l_{upp}(k,v)$ to $l_{max}(k,v)$ . Finally, we present the third phase of refining the upper bound $l_{upp}(k,v)$ to get the exact anchored coreness $l_{max}(k,v)$ . To this end, we first present the following theorem.

Theorem 4.8.

Given a vertex $v$ in $G$ and an integer $k$ , if $(k,l_{upp}(k,v))$ is an anchored coreness of $v$ , it should satisfy two constraints on in-neighbors and out-neighbors: (i) $v$ has at least $k$ in-neighbors $v^{\prime}$ such that $l_{upp}(k,v^{\prime})\geq l_{upp}(k,v)$ ; and (ii) $v$ has at least $l_{upp}(k,v)$ out-neighbors $v^{\prime\prime}$ such that $l_{upp}(k,v^{\prime\prime})\geq l_{upp}(k,v)$ .

Theorem 4.8 obviously holds, according to Def. 3.1 of D-core and the upper bound $l_{upp}(k,v)\geq l_{max}(k,v)$ . Based on Theorem 4.8, we can refine $l_{upp}(k,v)$ decrementally by checking the upper bounds $l_{upp}(k,v^{\prime})$ ’s of $v$ ’s in- and out-neighbors. If $v$ satisfies the above two constraints in Theorem 4.8, $l_{upp}(k,v)$ keeps unchanged; otherwise, $l_{upp}(k,v)$ decreases by 1 as the current $(k,l_{upp}(k,v))$ is not an anchored coreness of $v$ . The above process needs to repeat for all vertices and all possible values of $k$ , until none of $(k,l_{upp}(k,v))$ changes. Finally, we obtain all anchored corenesses $\{\Phi(v)|v\in V_{G}\}$ .

Algorithm 4 outlines the procedure of the distributed refinement phase. First, the algorithm initializes some auxiliary structures and broadcast $v$ ’s upper bound $l_{upp}(k,v)$ for each possible $k\in[0,k_{max}(v)]$ (lines 1-3). When it receives a message from $v$ ’s neighbor $v^{\prime}$ , the algorithm updates the upper bound set for $v^{\prime}$ (lines 4-7). After receiving all messages, the algorithm refines $l_{upp}(k,v)$ for each $k\in[0,k_{max}(v)]$ based on Theorem 4.8 (lines 8-13). If there exists such a $(k,l_{upp}(k,v))$ whose $l_{upp}(k,v)$ is decreased, the algorithm broadcasts the new upper bound set to $v$ ’s neighbors (lines 14-15). As soon as there are no vertex broadcasting messages, Algorithm 4 terminates and we get all anchored corenesses of $v$ (lines 16-17).

Example 4.0.

Continue Example 4.7 to illustrate Algorithm 4 in Phase III, which refines the upper bound $l_{upp}(k,v_{1})$ to the exact $l_{max}(k,v_{1})$ . For $k_{max}(v_{1})=3$ and each $k\in[0,k_{max}(v_{1})]$ , Table 1 reports the final results $l_{upp}(k,v_{1})$ $=$ $l_{max}(k,v_{1})=2$ . Therefore, the entire anchored corenesses of $v_{1}$ are $\Phi(v_{1})=\{(0,2),(1,2),(2,2)\}$ .

Input: graph

G

, vertex

v

k_{max}(v)

, upper bounds

l_{upp}[\cdot]

Output: the entire anchored corenesses of

v

\Phi(v)

Initializations

1 for each $k\in[0,k_{max}(v)]$ do

change[k]\leftarrow

True;

l[k][v]\leftarrow 0

;

4Send message

\langle v

l_{upp}[\cdot],change[\cdot]\rangle

to all neighbors of

v

;

On receiving message $\langle v^{\prime}$ , $l_{upp}[\cdot],change[\cdot]\rangle$ from $v$ ’s neighbor $v^{\prime}$

6 for each $k\in[0,k_{max}(v)]$ do

7 if $change[k]=True$ then

l[k][v^{\prime}]\leftarrow l_{upp}[k]

;

flag[k]\leftarrow

True

After receiving all messages

12 for each $k\in[0,k_{max}(v)]$ do

13 if $flag[k]=True$ then

V^{\prime}\leftarrow\{v^{\prime}|v^{\prime}\in N^{in}_{G}(v)\wedge l[k][v^{\prime}]\geq l_{upp}[k]\}

;

V^{\prime\prime}\leftarrow\{v^{\prime\prime}|v^{\prime\prime}\in N^{out}_{G}(v)\wedge l[k][v^{\prime\prime}]\geq l_{upp}[k]\}

; if $|V^{\prime}|<k$ or $|V^{\prime\prime}|<l_{upp}[k]$ then

l_{upp}[k]\leftarrow l_{upp}[k]-1

;

change[k]\leftarrow

True;

19if $\exists$ $k\in[0,k_{max}(v)]$ such that $change[k]=True$ then

20 Send message

\langle v

l_{upp}[\cdot],change[\cdot]\rangle

to all neighbors of

v

;

When no vertex broadcasts messages

23 for each $k\in[0,k_{max}(v)]$ do

24 Add

(k,l_{upp}[k])

to the anchored corenesses

\Phi(v)

;

Algorithm 4 Anchored Coreness Refinement

4.3. Algorithm Analysis and Extension

Complexity analysis. We first analyze the time, space, message complexities of Algorithm 1. Let the edge size $|E_{G}|=m$ , the maximum in-degree $\Delta_{in}=\max_{v\in V_{G}}deg^{in}_{G}(v)$ , the maximum out-degree $\Delta_{out}=\max_{v\in V_{G}}deg^{out}_{G}(v)$ , and the maximum degree $\Delta=\max_{v\in V_{G}}deg_{G}(v)$ . In addition, let $R_{AC-I}$ , $R_{AC-II}$ , and $R_{AC-III}$ be the number of convergence rounds required by the three phases in Algorithm 1, respectively. Let be the total number of converge rounds in Algorithm 1 as $R_{AC}=R_{AC-I}+R_{AC-II}+R_{AC-III}$ and $R_{AC}\in O(\Delta)$ . We have the following theorems (their detailed proofs can be found in (Liao et al., 2022)):

Theorem 4.10.

(Time and Space Complexities) Algorithm 1 takes $O(R_{AC}\cdot\Delta_{in}\cdot\Delta)$ time and $O(\Delta_{in}\cdot\Delta)$ space. The total time and space complexities for computing all vertices’ corenesses are $O(R_{AC}\cdot\Delta_{in}\cdot m)$ and $O(\Delta_{in}\cdot m)$ , respectively.

Theorem 4.11.

(Message Complexity) The message complexity (i.e., the total number of times that a vertex send messages) of Algorithm 1 is $O(\Delta_{in}\cdot\Delta_{out}\cdot\Delta)$ . The total message complexity for computing all vertices’ corenesses is $O(\Delta_{in}\cdot\Delta_{out}\cdot m)$ .

Block-centric extension of Algorithm 1. We further discuss to extend the vertex-centric D-core decomposition in Algorithm 1 to the block-centric framework. The extension can be easily achieved by changing the update operation after receiving all messages. That is, instead of having Algorithms 2, 3 and 4 perform the update operation only once after receiving all messages, in the block-centric framework, the algorithms should update the H-indexes multiple times until the local block converges. For example, Algorithm 2 computes the $n$ -order in-H-index of $v$ only once in each round (lines 10-13). In contrast, the block-centric version should compute $v$ ’s $n$ -order in-H-index iteratively with $v$ ’s in-neighbors, that are located in the same block as $v$ , before broadcasting messages to other blocks to enter the next round. Note that in the worst case, for block-centric algorithms, every vertex converges within the block after computing the in-H-index/out-H-index only once, which is the same as vertex-centric algorithms. Therefore, the worse-case cost of block-centric algorithms is the same as that of vertex-centric algorithms.

5. Distributed Skyline Coreness-Based Algorithm

In this section, we propose a novel concept of skyline coreness, which is more elegant than the anchored coreness. Then, we give a new definition of $n$ -order D-index for computing skyline corenesses. Based on the D-index, we propose a distributed algorithm for skyline coreness computation to accomplish D-core decomposition.

5.1. Skyline Coreness

Motivation. The motivation for proposing another skyline coreness lies in an important observation that the anchored corenesses $(k,l)$ ’s may have redundancy. For example, in Figure 1, the vertex $v_{2}$ has four anchored corenesses, i.e., $\Phi(v_{2})=\{$ (0, 2), (1, 2), (2, 2), (3, 1) $\}$ . According to D-core’s partial nesting property, if $v_{2}\in(2,2)$ -core, $v_{2}$ must also belong to $(0,2)$ -core and $(1,2)$ -core. Thus, it is sufficient and more efficient to keep the coreness of $v_{2}$ as $\{$ (2, 2), (3, 1) $\}$ , which uses (2, 2)-core to represent other two D-cores (0, 2)-core and (1, 2)-core. This elegant representation is termed as skyline coreness, which can facilitate space saving and fast computation of D-core decomposition. Based on the above observation, we formally define the dominance operation and skyline coreness as follows.

Definition 5.0.

(Dominance Operations). Given two coreness pairs $(k,l)$ and $(k^{\prime},l^{\prime})$ , we define two operations ‘ $\prec$ ’ and ‘ $\preceq$ ’ to compare them: (i) $(k^{\prime},l^{\prime})\prec(k,l)$ indicates that $(k,l)$ dominates $(k^{\prime},l^{\prime})$ , i.e., either $k^{\prime}<k$ , $l^{\prime}\leq l$ hold or $k^{\prime}\leq k$ , $l^{\prime}<l$ hold; and (ii) $(k^{\prime},l^{\prime})\preceq(k,l)$ represents that $k^{\prime}\leq k$ , $l^{\prime}\leq l$ hold.

Definition 5.0.

(Skyline Coreness). Given a vertex $v$ in a directed graph $G$ and a coreness pair $(k,l)$ , we say that $(k,l)$ is a skyline coreness of $v$ iff it satisfies that (i) $v\in(k,l)$ -core; and (ii) there exist no other pair $(k^{\prime},l^{\prime})$ such that $(k,l)\prec(k^{\prime},l^{\prime})$ and $v\in(k^{\prime},l^{\prime})$ -core. We use ${\mathsf{SC}}(v)$ to denote the entire skyline corenesses of the vertex $v$ , i.e., ${\mathsf{SC}}(v)=\{(k,l)\ |\ (k,l)$ is a skyline coreness of $v\}$ .

In other words, the skyline coreness of a vertex $v$ is a non-dominated pair $(k,l)$ whose corresponding $(k,l)$ -core contains $v$ . For instance, vertex $v_{2}$ has the skyline corenesses ${\mathsf{SC}}(v_{2})=\{(2,2),(3,1)\}$ in Figure 1, reflecting that no other coreness $(k,l)$ can dominate any skyline coreness in ${\mathsf{SC}}(v_{2})$ . According to D-core’s partial nesting property, for a skyline coreness $(k,l)$ of $v$ , $v$ is contained in the $(k^{\prime},l^{\prime})$ -core with $(k^{\prime},l^{\prime})\prec(k,l)$ . Therefore, if we compute all skyline corenesses ${\mathsf{SC}}(v)$ for a vertex $v$ , we can find all D-cores the vertex $v$ belonging to. As a result, the problem of D-core decomposition is equivalent to computing the entire skyline corenesses for every vertex in $G$ , i.e., $\{{\mathsf{SC}}(v)|v\in V_{G}\}$ .

Structural properties of skyline coreness. We analyze the structural properties of skyline coreness.

Property 5.1.

Let $(k_{v},l_{v})$ be a skyline coreness of $v$ , the following properties hold:

(I)

There exist $k_{v}$ in-neighbors $v^{\prime}\in N_{G}^{in}(v)$ such that $(k_{v},l_{v})$ $\preceq$ $(k_{v^{\prime}},l_{v^{\prime}})$ , and also $l_{v}$ out-neighbors $v^{\prime\prime}\in N_{G}^{out}(v)$ such that $(k_{v},l_{v})$ $\preceq$ $(k_{v^{\prime\prime}},l_{v^{\prime\prime}})$ .
(II)

Two cases cannot hold in either way: there exist $k_{v}+1$ in-neighbors $v^{\prime}\in N_{G}^{in}(v)$ such that $(k_{v}+1,l_{v})$ $\preceq$ $(k_{v^{\prime}},l_{v^{\prime}})$ , or $l_{v}$ out-neighbors $v^{\prime\prime}\in N_{G}^{out}(v)$ such that $(k_{v}+1,l_{v})$ $\preceq$ $(k_{v^{\prime\prime}},l_{v^{\prime\prime}})$ .
(III)

Two cases cannot hold in either way: there exist $k_{v}$ in-neighbors $v^{\prime}\in N_{G}^{in}(v)$ such that $(k_{v},l_{v}+1)$ $\preceq$ $(k_{v^{\prime}},l_{v^{\prime}})$ , or $l_{v}+1$ out-neighbors $v^{\prime\prime}\in N_{G}^{out}(v)$ such that $(k_{v},l_{v}+1)$ $\preceq$ $(k_{v^{\prime\prime}},l_{v^{\prime\prime}})$ .

Proof.

First, we prove Property 5.1(I). Let $D_{1}$ be the $(k_{v},l_{v})$ -core of $G$ , we have $deg_{D_{1}}^{in}(v)\geq k_{v}$ and $deg_{D_{1}}^{out}(v)\geq l_{v}$ . For $\forall v^{\prime}\in(N_{D_{1}}^{in}(v)\cup N_{D_{1}}^{out}(v))$ , $v^{\prime}$ may be in the $(k^{\prime},l^{\prime})$ -core with $k_{v}\leq k^{\prime}\leq k_{v^{\prime}}$ and $l_{v}\leq l^{\prime}\leq l_{v^{\prime}}$ . Therefore, (I) of Property 5.1 holds.

Next, we prove Property 5.1(II). Assume that $v$ has $k_{v}+1$ in-neighbors $V^{\prime}$ and $l_{v}$ out-neighbors $V^{\prime\prime}$ satisfying the constraints of (II). Then, $V^{\prime}$ and $V^{\prime\prime}$ must be in the $(k_{v}+1,l_{v})$ -core. Moreover, $v$ $\cup$ $(k_{v}+1,l_{v})$ -core is also a $(k_{v}+1,l_{v})$ -core. Hence, $(k_{v}+1,l_{v})$ rather than $(k_{v},l_{v})$ is a skyline coreness of $v$ , which contradicts to the condition of Property 5.1. Therefore, the assumption does not hold.

Finally, Property 5.1(III) can be proved in the same way of Property 5.1(II). It is omitted due to space limitation. ∎

For example, $(2,2)$ is a skyline coreness of $v_{2}$ in Figure 1. The in-neighbors of $v_{2}$ are $v_{3}$ , $v_{4}$ , $v_{5}$ , and $v_{7}$ , whose skyline corenesses are $\{(3,3)\}$ , $\{(2,2)\}$ , $\{(3,3)\}$ , and $\{(2,2),(3,1)\}$ , respectively. These four vertices all have skyline corenesses that dominate or are identical to $v_{2}$ ’s skyline coreness $(2,2)$ . But only two vertices $v_{3}$ and $v_{5}$ have skyline corenesses that dominate $(k_{v_{2}}+1,l_{v_{2}})=(3,2)$ . Hence, $(3,2)$ is not a skyline coreness of $v_{2}$ . Property 5.1 reveals the relationships among vertices’ skyline corenesses, based on which we propose an algorithm for skyline coreness computation in the next subsection.

5.2. Distributed Skyline Corenesses Computing

We begin with a novel concept of D-index.

Definition 5.0.

(D-index). Given two sets of pairs of integers $R_{in}$ , $R_{out}$ $\subseteq\mathbb{N}_{0}\times\mathbb{N}_{0}$ , the D-index of $R_{in}$ and $R_{out}$ is denoted by $\mathcal{D}(R_{in},R_{out})$ $\subseteq\mathbb{N}_{0}\times\mathbb{N}_{0}$ , where each element $(k,l)\in\mathcal{D}(R_{in},R_{out})$ satisfies: (i) there exist at least $k$ pairs $(k_{i},l_{i})\in R_{in}$ such that $(k,l)\preceq(k_{i},l_{i})$ for $1\leq i\leq k$ ; (ii) there exist at least $l$ pairs $(k_{j},l_{j})\in R_{out}$ such that $(k,l)\preceq(k_{j},l_{j})$ for $1\leq j\leq l$ ; (iii) there does not exist another $(k^{\prime},l^{\prime})\in\mathbb{N}_{0}\times\mathbb{N}_{0}$ satisfying the above conditions (1) and (2), and $(k,l)\prec(k^{\prime},l^{\prime})$ .

The idea of D-index is very similar to H-index. Actually, the D-index is an extension of H-index to handle two-dimensional integer pairs. For $\mathcal{D}(R_{in},R_{out})$ , it finds a series of $(k,l)$ skyline pairs such that each has at least $k$ dominated pairs in $R_{in}$ and at least $l$ dominated pairs in $R_{out}$ , using a joint indexing way. For example, let $R_{in}=\{(1,1),(2,2)\}$ and $R_{out}=\{(3,3),(4,4)\}$ , then $\mathcal{D}(R_{in},R_{out})=\{(1,2)\}$ . Note that $\mathcal{D}(R_{in},R_{out})\neq\mathcal{D}(R_{out},R_{in})$ may hold for the D-index, as $\mathcal{D}(R_{out},R_{in})=\{(2,1)\}\neq\{(1,2)\}=\mathcal{D}(R_{in},R_{out})$ in this example. Next, we introduce another concept of $n$ -order D-index for distributed D-core decomposition.

Definition 5.0.

( $n$ -order D-index). Given a vertex $v$ in $G$ , the $n$ -order D-index of $v$ , denoted by $D^{(n)}(v)$ $\subseteq\mathbb{N}_{0}\times\mathbb{N}_{0}$ , is defined as

(5)

D^{(n)}(v)=\begin{cases}\{(deg_{G}^{in}(v),deg_{G}^{out}(v))\},&n=0\\ \mathcal{D}(R_{in}^{(n-1)}(v),R_{out}^{(n-1)}(v)),&n>0\end{cases}

Here, $R_{in}^{(n-1)}(v)$ $=$ $\{(k_{u},l_{u})\in D^{(n-1)}(u)\ |\ u\in N_{G}^{in}(v)\}$ and $R_{out}^{(n-1)}(v)$ $=$ $\{(k_{u},l_{u})\in D^{(n-1)}(u)\ |\ u\in N_{G}^{out}(v)\}$ . Note that $D^{(n)}(v)$ is the largest non-dominated D-index such that it dominates or at least is identical to $\mathcal{D}(R_{in}^{(n-1)}(v),R_{out}^{(n-1)}(v))$ , for each $R_{in}^{(n-1)}(v)$ $\in D^{(n-1)}(u_{1})\times\ldots\times D^{(n-1)}(u_{i})$ when $N_{G}^{in}(v)=\{u_{1},\ldots,u_{i}\}$ and each $R_{out}^{(n-1)}(v)$ $\in D^{(n-1)}(u_{1})\times\ldots\times D^{(n-1)}(u_{j})$ when $N_{G}^{out}(v)=\{u_{1},\ldots,u_{j}\}$ .

The $n$ -order D-index $D^{(n)}(v)$ may contain more than one pair $(k,l)$ , i.e., $|D^{(n)}(v)|\geq 1$ . Note that $R_{in}^{(n-1)}(v)$ and $R_{out}^{(n-1)}(v)$ consist of one pair $(k_{u},l_{u})$ for each in-neighbor $u\in N^{in}_{G}(v)$ and each out-neighbor $u\in N^{out}_{G}(v)$ , respectively. Therefore, there exist multiple combinations of $R_{out}^{(n-1)}(v)$ and $R_{out}^{(n-1)}(v)$ . Moreover, $D^{(n)}(v)$ should consider all combinations of $R_{out}^{(n-1)}(v)$ and $R_{out}^{(n-1)}(v)$ , and finally select the “best” choice as the largest non-dominated set of D-index $\mathcal{D}(R_{in}^{(n-1)}(v),R_{out}^{(n-1)}(v))$ .

For two pair sets $R_{1}$ , $R_{2}$ $\subseteq\mathbb{N}_{0}\times\mathbb{N}_{0}$ , we say $R_{2}\preceq R_{1}$ if and only if $\forall(k,l)\in R_{2}$ , $\exists(k^{\prime},l^{\prime})\in R_{1}$ such that $(k,l)$ $\preceq$ $(k^{\prime},l^{\prime})$ . Then, we have the following theorem of $n$ -order D-index convergence.

Theorem 5.5 ( $n$ -order D-index Convergence).

For a vertex $v$ in $G$ , it holds that

(6)

{\mathsf{SC}}(v)=\lim_{n\rightarrow\infty}D^{(n)}(v)

Proof.

The proof can be similarly done as Theorem 2. ∎

By Theorem 6, we can compute vertices’ skyline corenesses via iteratively computing their $n$ -order D-indexes until convergence.

5.3. Algorithms and Optimizations

A naive implementation of the distributed algorithm to compute $D^{(n)}(v)$ may suffer from serious performance problems, due to the combinatorial blow-ups in a large number of choices of $R_{in}^{(n-1)}(v)$ and $R_{out}^{(n-1)}(v)$ . Thus, we first tackle three critical issues for fast distributed computation of $n$ -order D-index.

Optimization-1: Fast computation of D-index $\mathcal{D}(R_{in},R_{out})$ . The first issue is, given $R_{in}$ and $R_{out}$ , how to compute D-index $\mathcal{D}(R_{in},R_{out})$ . A straightforward way is to list all candidate pairs and return the pairs satisfying Def. 5.3. According to conditions (1)&(2) in Def. 5.3, if $(k,l)$ belongs to D-index, there exists at least $k$ pairs of $R_{in}$ satisfying the dominance relationship. Therefore, $0\leq k\leq|R_{in}|$ . Similarly, $0\leq l\leq|R_{out}|$ . Thus, there are a total of $(|R_{1}|+1)\cdot(|R_{2}|+1)$ candidate pairs to be checked, which is costly for large $|R_{1}|$ and $|R_{2}|$ . In addition, the basic operation of D-index computation is frequently invoked in the process of computing $D^{(n)}(v)$ . Hence, it is necessary to develop faster algorithms. To this end, we try to reduce the pairs for examination as many as possible through the following two optimizations.

•

Reducing the ranges of $k$ and $l$ . For conditions (1)&(2) in Def. 5.3, if $(k,l)$ belongs to $\mathcal{D}(R_{in},R_{out})$ , there exist at least $k$ pairs $(k_{i},l_{i})$ in $R_{in}$ such that $(k,l)\preceq(k_{i},l_{i})$ . In other words, at least $k$ pairs $(k_{i},l_{i})$ in $R_{in}$ have $k_{i}\geq k$ . Thus, the maximum $k$ is denoted by $k_{max}$ $=$ $\mathcal{H}(I_{k})$ , where $I_{k}=\{k_{i}\ |\ (k_{i},l_{i})\in R_{in}\}$ . Similarly, we can also obtain the maximum $l$ , denoted by $l_{max}$ , as $l_{max}$ $=$ $\mathcal{H}(O_{l})$ , where $O_{l}=\{l_{j}\ |\ (k_{j},l_{j})\in R_{out}\}$ . Since $\mathcal{H}(I_{k})\leq|R_{in}|$ and $\mathcal{H}(O_{l})\leq|R_{out}|$ , the total number of candidate pairs decreases.
•

Pruning disqualified candidate pairs. Let $(k,l)$ $\in$ $\mathcal{D}(R_{in},R_{out})$ . According to condition (3) in Def. 5.3, if $(k^{\prime},l^{\prime})$ $\in$ $\mathcal{D}(R_{in},R_{out})$ with $k^{\prime}<k$ , $l^{\prime}$ must satisfy $l^{\prime}>l$ . Otherwise, $(k^{\prime},l^{\prime})\prec(k,l)$ and $(k^{\prime},l^{\prime})$ $\notin$ $\mathcal{D}(R_{in},R_{out})$ . This rule can be used to prune disqualified pairs based on the found skyline corenesses.

Optimization-2: Fast computation of n-order D-index $D^{(n)}(v)$ . The second issue is the computation of $D^{(n)}(v)$ . By Def. 5.4, both $R_{in}^{(n-1)}(v)$ and $R_{out}^{(n-1)}(v)$ may have multiple instances. Hence, a straightforward way is to compute the D-index for every instance and finally integrate them together. In total, we need to compute the D-index $O(\prod_{v^{\prime}\in N_{G}^{in}(v)}|D^{(n-1)}(v^{\prime})|$ $\cdot$ $\prod_{v^{\prime\prime}\in N_{G}^{out}(v)}|D^{(n-1)}(v^{\prime\prime})|)$ times, which is very inefficient. Actually, several redundant computations occur due to many independent instances in the D-index computation. For example, in one instance, we have verified that $(k,l)$ belongs to the $n$ -order D-index. Then, there is no need to verify $(k,l)$ in other instances. This motivates us to devise a more efficient method to compute $D^{(n)}(v)$ , which requires D-index computation only once. Specifically, we first compute $k_{max}$ and $l_{max}$ . Then, we enumerate candidate pairs for dominance checking. Here, we highlight two differences from the original D-index computation method.

•

The difference of $k_{max}$ and $l_{max}$ computations. For $k_{max}$ and $l_{max}$ in D-index computation, $I_{k}$ (resp. $O_{l}$ ) is formed by just adding $k_{i}$ (resp. $l_{i}$ ) from each pair in $R_{in}$ . For $n$ -order D-index computation, the vertex’s $(n-1)$ -order D-index may have more than one pairs. We should select the maximum $k_{i}$ and $l_{i}$ among these pairs. Specifically, for $v$ ’s $n$ -order D-index computation, to compute $k_{max}$ , $I_{k}(v)=\{k_{i}\ |\ v^{\prime}\in N_{G}^{in}(v),k_{i}=\max_{(k^{\prime}_{i},l^{\prime}_{i})\in D^{(n-1)}(v^{\prime})}(k^{\prime}_{i})\}$ . In the same way, $I_{l}(v)=\{l_{j}\ |\ v^{\prime}\in N_{G}^{out}(v),l_{j}=\max_{(k^{\prime}_{j},l^{\prime}_{j})\in D^{(n-1)}(v^{\prime})}(l^{\prime}_{j})\}$ .
•

The difference of dominance checking. For a candidate pair $(k,l)$ , the D-index computation should find the pairs in $R_{in}$ and $R_{out}$ that dominate or are identical to $(k,l)$ . To compute $D^{(n)}(v)$ , we should find all $v$ ’s neighbors $v^{\prime}$ whose $(n-1)$ -order D-index has a pair dominating or identical to $(k,l)$ . If $D^{(n-1)}(v^{\prime})$ has multiple pairs, we need to examine the dominance relationship for each of these pairs with $(k,l)$ . Once one pair dominates or is identical to $(k,l)$ , such $v^{\prime}$ is identified.

Input: directed graph

G

, vertex

v

Output: the skyline corenesses

{\mathsf{SC}}(v)

Initializations

1 Compute

\mathrm{iH}^{(\infty)}_{G}(v)

and

\mathrm{oH}^{(\infty)}_{G}(v)

using Algorithm 2;

D_{v}=\{(\mathrm{iH}^{(\infty)}_{G}(v),\mathrm{oH}^{(\infty)}_{G}(v))\}

;

3 Send message

\langle v

D_{v}\rangle

to all neighbors of

v

;

On receiving message $\langle v^{\prime}$ , $D_{v^{\prime}}\rangle$ from $v$ ’s neighbor $v^{\prime}$

D_{k}[v^{\prime}]\leftarrow 0

;

D_{l}[v^{\prime}]\leftarrow 0

;

D[v^{\prime}]\leftarrow D_{v^{\prime}}

;

7 for each $(k,l)\in D_{v^{\prime}}$ do

D_{k}[v^{\prime}]\leftarrow\max(D_{k}[v^{\prime}],k)

;

D_{l}[v^{\prime}]\leftarrow\max(D_{l}[v^{\prime}],l)

;

flag\leftarrow

True;

After receiving all messages

11 if $flag=True$ then

12 Apply Algorithm 6 on

n

-order D-index computation;

13 if $D[v]\neq D$ then

D[v]\leftarrow D

;

D_{v}\leftarrow D

;

15 Send message

\langle v

D_{v}\rangle

to all neighbors of

v

;

When no vertex broadcasts messages

return

{\mathsf{SC}}(v)\leftarrow D[v]

;

Algorithm 5 Distributed Skyline Corenesses Computation Algorithm: routine executed by vertex

v

Output:

v

’s

n

-order D-index

D\leftarrow\varnothing

;

l_{min}\leftarrow 0

;

I_{k}=\{D_{k}[v^{\prime}]|v^{\prime}\in N_{G}^{in}(v)\}

;

k_{max}\leftarrow\mathcal{H}(I_{k})

;

O_{l}=\{D_{l}[v^{\prime}]|v^{\prime}\in N_{G}^{out}(v)\}

;

l_{max}\leftarrow\mathcal{H}(O_{l})

;

5for $k\leftarrow k_{max}$ to $0$ do

l\leftarrow l_{max}

;

7 while $l>l_{min}$ do

V_{1}=\{v^{\prime}|v^{\prime}\in N_{G}^{in}(v),

and

\exists(k^{\prime},l^{\prime})\in D[v^{\prime}],

(k,l)\preceq(k^{\prime},l^{\prime})\}

;

V_{2}=\{v^{\prime}|v^{\prime}\in N_{G}^{out}(v),

and

\exists(k^{\prime},l^{\prime})\in D[v^{\prime}],

(k,l)\preceq(k^{\prime},l^{\prime})\}

;

10 if $|V_{1}|\geq k$ $\wedge$ $|V_{2}|\geq l$ then

l_{min}\leftarrow l

;

D\leftarrow D\cup(k,l)

;

l\leftarrow l-1

;

Algorithm 6

n

-order D-index Computation

Table 2. An illustration of distributed skyline coreness computation using Algorithm 5 on graph

G

in Figure 2.

	Vertices
	$v_{1}$	$v_{2}$	$v_{3}$	$v_{4}$	$v_{5}$	$v_{6}$	$v_{7}$	$v_{8}$
$D^{(0)}(v)$	$\{(2,2)\}$	$\{(2,0)\}$	$\{(2,0)\}$	$\{(2,2)\}$	$\{(2,2)\}$	$\{(2,2)\}$	$\{(1,2)\}$	$\{(2,1)\}$
$D^{(1)}(v)$	$\{(2,2)\}$	$\{(2,0)\}$	$\{(2,0)\}$	$\{(2,2)\}$	$\{(2,2)\}$	$\{(2,2)\}$	$\{(0,2),(1,1)\}$	$\{(1,1),(2,0)\}$
$D^{(2)}(v)$	$\{(2,2)\}$	$\{(2,0)\}$	$\{(2,0)\}$	$\{(2,2)\}$	$\{(2,2)\}$	$\{(2,2)\}$	$\{(0,2),(1,1)\}$	$\{(1,1),(2,0)\}$

Optimization-3: Tight initialization. Finally, we present an optimization for $D^{(n)}(v)$ computation using a tight initialization. In Def. 5.4, the $0$ -order D-index is initialized with the vertex’s in-degree and out-degree. The optimization idea is that if we tightly initialize the vertex’s $0$ -order D-index with smaller values (denoted by $D^{(0)}(v)=(k_{0}(v),l_{0}(v))$ ), the $n$ -order D-index can converge faster to the exact skyline coreness. Here, we highlight two principles to find such $(k_{0}(v),l_{0}(v))$ : (i) $k_{0}(v)\leq\max_{(k_{i},l_{i})\in{\mathsf{SC}}(v)}k_{i}$ and $l_{0}(v)\leq\max_{(k_{i},l_{i})\in{\mathsf{SC}}(v)}l_{i}$ , otherwise the $D^{(n)}(v)$ cannot converge to ${\mathsf{SC}}(v)$ ; (ii) $(k_{0}(v),l_{0}(v))$ should be easy to compute in distributed settings. As a result, we present the following theorem.

Theorem 5.6.

For any vertex $v$ in $G$ , it holds that $k_{max}(v)\geq\max\{k_{i}\ |\ (k_{i},l_{i})\in{\mathsf{SC}}(v)\}$ and $l_{max}(v)\geq\max\{l_{i}\ |\ (k_{i},l_{i})\in{\mathsf{SC}}(v)\}$ , where $l_{max}(v)$ $=$ $\max\{l\ |\ v\in(0,l)\text{-core }\wedge v\notin(0,l+1)\text{-core}\}$ .

Theorem 5.6 offers two tight upper bounds for $k_{0}$ and $l_{0}$ , i.e., $k_{max}(v)$ and $l_{max}(v)$ , respectively. In addition, according to Theorems 2 and 4, $k_{max}(v)$ and $l_{max}(v)$ can be computed by iteratively computing $v$ ’s $n$ -order in-H-index and out-H-index, respectively. Therefore, we initialize $D^{(0)}(v)=(k_{max}(v),l_{max}(v))$ .

Algorithms. Based on the above theoretical analytics and optimizations, we present the distributed skyline corenesses computation algorithm in Algorithm 5. At the initialization phase, the algorithm computes $\mathrm{iH}^{(\infty)}_{G}(v)$ and $\mathrm{oH}^{(\infty)}_{G}(v)$ using Algorithm 2 and uses them to initialize the 0-order D-index of $v$ , which is broadcast to all neighbors of $v$ (lines 1-3). When $v$ receives a message from its neighbor $v^{\prime}$ , Algorithm 5 updates the $n$ -order D-index of $v^{\prime}$ that is stored in $v$ ’s node, and finds the maximum values in each pair of $k$ and $l$ (lines 4-8). After $v$ receives all messages, Algorithm 5 computes the $n$ -order D-index for $v$ , which is described in Algorithm 6. Then, it broadcasts to all neighbors of $v$ if the $n$ -order D-index changes (lines 9-13). When there is no vertex broadcasting messages, Algorithm 5 returns the latest $n$ -order D-index as skyline corenesses (line 14).

Next, we present the procedure of Algorithm 6 for $n$ -order D-index computation. It first computes $k_{max}$ and $l_{max}$ as shown in Optimization-1 and Optimization-2 (lines 2-3), which help to determine the range of candidate pairs. Then, the algorithm enumerates all candidate pairs $(k,l)$ and examines whether $(k,l)$ belongs to the $n$ -order D-index of $v$ (lines 6-11). Note that $l_{min}$ keeps the minimal value of $l$ for the remaining candidate pairs, which is used to prune disqualified pairs.

Example 5.0.

We use the graph $G$ in Figure 2 to illustrate Algorithm 5. Table 2 reports the process of computing skyline corenesses. Take vertex $v_{7}$ as an example. First, the 0-order D-index of $v_{7}$ is initialized with $\{(1,2)\}$ , i.e., $D^{(0)}(v_{7})=\{(1,2)\}$ . Then, we iteratively compute the $n$ -order D-index for $v_{7}$ . We can observe that after one iteration only, the $1$ -order D-index of $v_{7}$ has converged as $D^{(2)}(v_{7})=D^{(1)}(v_{7})=$ $\{(0,2),(1,1)\}$ . Thus, the entire skyline corenesses of $v_{7}$ are ${\mathsf{SC}}(v_{7})=$ $\{(0,2),(1,1)\}$ .

5.4. Algorithm Analysis and Extension

Complexity analysis. Let $R_{SC}$ be the number of convergence rounds taken by Algorithm 5. In practice, our algorithms achieve $R_{SC}\leq R_{AC}\ll\Delta$ on real datasets. We show the time, space, and message complexities of Algorithm 5 below.

Theorem 5.8.

(Time and Space Complexities) Algorithm 5 takes $O(R_{SC}\cdot\Delta_{in}\cdot\Delta_{out})$ time and $O(\Delta\cdot\min\{\Delta_{in},\Delta_{out}\})$ space. The total time and space complexities for computing all vertices’ corenesses are $O(R_{SC}\cdot\Delta_{in}\cdot m)$ and $O(\min\{\Delta_{in},\Delta_{out}\}\cdot m)$ , respectively.

Theorem 5.9.

(Message Complexity) The message complexity of Algorithm 5 is $O(\Delta^{2})$ . The total message complexity for computing all vertices’ corenesses is $O(\Delta\cdot m)$ .

Through the above analysis, we can see that the skyline coreness-based approach in Algorithm 5 takes less space and runs much faster than the anchored coreness approach in Algorithm 1.

Block-centric extension. Algorithm 5 can be easily extended to the block-centric framework. The only difference is that each machine iteratively computes the $n$ -order D-index locally until the algorithm converges within the local block, before broadcasting to other blocks (lines 9-13 of Algorithm 5).

6. Performance Evaluation

In this section, we empirically evaluate our proposed algorithms. We conduct our experiments on a collection of Amazon EC2 r5.2x large instances, each powered by 8 vCPUs and 64GB memory. The network bandwidth is up to 10G Gb/s. All experiments are implemented in C++ on the Ubuntu 18.04 operating system.

Datasets. We use 11 real-world graphs in our experiments. Table 3 shows the statistics of these graphs. Specifically, Wiki-vote¹¹1http://snap.stanford.edu/data/index.html is a voting graph; Email-EuAll¹ is a communication graph; Amazon¹ is a product co-purchasing graph; Hollywood² is an actors collaboration graph; Pokec¹, Live Journal¹, and Slashdot¹ are social graphs; Citation¹ is a citation graph; UK-2002²²2http://law.di.unimi.it/datasets.php, IT-2004², and UK-2005² are web graphs.

Table 3. Statistics of the datasets (

\bm{deg_{avg}}

represents the average degree; K =

10^{3}

, M =

10^{6}

, and B =

10^{9}

)

Dataset	Abbr.	$\bm{\|V_{G}\|}$	$\bm{\|E_{G}\|}$	$\bm{deg_{avg}}$	$\bm{k_{max}}$	$\bm{l_{max}}$
Wiki-vote	WV	7.1K	103.6K	14.57	19	15
Email-EuAll	EE	265.2K	420K	1.58	28	28
Slashdot	SL	82.1K	948.4K	11.54	54	9
Amazon	AM	400.7K	3.2M	7.99	10	10
Citation	CT	3.7M	16.5M	4.37	1	1
Pokec	PO	1.6M	30.6M	18.75	32	31
Live Journal	LJ	4.8M	69.0M	14.23	253	254
Hollywood	HW	2.1M	228.9M	105.00	1,297	99
UK-2002	UK2	18.5M	298.1M	16.09	942	99
UK-2005	UK5	39.4M	936.3M	23.73	584	99
IT-2004	IT	41.2M	1.1B	27.87	3,198	990

Algorithms. We compare five algorithms in our experiments.

•

AC-V and AC-B: The distributed anchored coreness-based D-core decomposition algorithms implemented in the vertex-centric and block-centric frameworks, respectively.
•

SC-V and SC-B: The distributed skyline coreness-based D-core decomposition algorithms implemented in the vertex-centric and block-centric frameworks, respectively.
•

Peeling: The distributed version of the peeling algorithm for D-core decomposition (Fang et al., 2018), in which one machine is assigned as the coordinator to collect global graph information and dispatch decomposition tasks.

We employ GRAPE (Fan et al., 2017) as the block-centric framework and use the hash partitioner for graph partitioning by default. For the sake of fairness, we also employ GRAPE to simulate the vertex-centric framework. In specific, at each round, all vertices within a block execute computations only once and when all vertices complete the computation, the messages will be broadcast to their neighbors.

Parameters and Metrics. The parameters tested in experiments include $\#$ machines and graph size, whose default settings are 8 and $100\%\cdot|V_{G}|$ , respectively. The performance metrics evaluated include $\#$ iterations required for convergence, convergence rate (i.e., the percentage of vertices who have computed the coreness), running time (in seconds), and communication overhead (i.e., the total messages sent by all vertices).

6.1. Convergence Evaluation

The first set of experiments evaluates the convergence of our proposed algorithms.

Table 4.

\#

Iterations required for the algorithms

Algorithms		Datasets
Algorithms		WV	EE	SL	AM	CT
Upper Bound		1,167	7,636	5,064	2,757	793
AC-V	Phase I	19	17	40	16	32
	Phase II	32	19	53	64	32
	Phase III	33	22	61	61	2
	Total	84	58	154	141	66
AC-B	Phase I	14	14	35	13	28
	Phase II	15	7	43	30	28
	Phase III	16	21	45	25	2
	Total	45	42	123	68	58
SC-V		33	19	61	65	2
SC-B		17	6	46	25	2

Exp-1: Evaluation on the number of iterations. We start by evaluating $\#$ iterations required for our algorithms to converge. Note that an iteration here refers to a cycle of the algorithm receiving messages, performing computations, and broadcasting messages. Table 4 reports the results on datasets WV, EE, SL, AM, and CT. We make several observations. First, for every graph, all of our proposed algorithms have much less iterations than the upper bound (i.e., the maximum degree of the graph), which demonstrates the efficiency of our algorithms. Second, the iterations of SC-V and SC-B are less than those of AC-V and AC-B. This is because the computation of anchored corenesses is more cumbersome than that of skyline corenesses. Hence, both AC-V and AC-B take more iterations. Third, for the same type of algorithms, i.e., AC or SC, the algorithm implemented in the block-centric framework takes less iterations than that in the vertex-centric framework. The reason is that the block-centric framework allows algorithms to use vertices located in the same block to converge locally within a single round, which leads to faster convergence.

Exp-2: Evaluation on the convergence rate. Since different vertices require different numbers of iterations to converge, in this experiment, we evaluate the algorithms’ convergence rates. Figure 3 shows the results on Amazon. As expected, the algorithms implemented in the block-centric framework converge faster. For example, in Figure 3(d), after 8 iterations, the convergence rates of SC-V and SC-B reach $89.9\%$ and $98.6\%$ , respectively. Moreover, most vertices can converge within just a few iterations. Specifically, for SC-B, more than $95\%$ vertices converge within 5 iterations. In addition, SC algorithms have faster convergence rates than AC algorithms. For example, AC-B takes 68 iterations to reach convergence while SC-B takes 25 iterations.

6.2. Efficiency Evaluation

Next, we evaluate the efficiency of our proposed algorithms against the state-of-the-art peeling algorithm, denoted as Peeling. Note that if an algorithm cannot finish within 5 days, we denote it by ’INF’.

Exp-3: Our algorithms vs. Peeling. We first compare the performance of our proposed algorithms with Peeling under the default experiment settings. Figures 4(a) and 4(b) report the results in terms of the running time and communication overhead, respectively. First, we can see that Peeling cannot finish within 5 days on the large-scale graphs with more than 50 million edges, including LJ, HW, UK2 UK5, and IT, while our algorithms can finish within 1 hour for most of these datasets and no more than 10 hours on the largest billion-scale graph for our fastest algorithm. Moreover, for the datasets where Peeling can finish, our algorithms outperform Peeling by up to 3 orders of magnitude. This well demonstrates the efficiency of our proposed algorithms. Second, SC-V and SC-B perform better than AC-V and AC-B in terms of both the running time and communication overhead. For example, on the biggest graph IT with over a billion edges, the improvement is nearly 1 order of magnitude. This is because SC-V and SC-B compute less corenesses than AC-V and AC-B. Third, AC-V (resp. SC-V) is better than AV-B (resp. SC-B) in terms of the running time while it is opposite in terms of the communication overhead. This is due to the effect of straggler (Ananthanarayanan et al., 2010). Specifically, for the block-centric framework, in each iteration, the algorithms use block information to converge locally (i.e., within each block); they cannot start the next iteration until all blocks have converged. There are machines where some blocks may converge very slowly, which deteriorates the overall performance of the block-centric algorithms.

Exp-4: Effect of the number of machines. Next, we vary the number of machines from 2 to 16 and test its effect on performance. Figure 5 reports the results for datasets UK2 and HW. As shown in Figures 5(a) and 5(b), all of our algorithms take less running time when the number of machines increases. This is because the more the machines, the stronger the computing power our algorithms can take advantage of, thanks to their distributed designs. In addition, Figures 5(c) and 5(d) show that the communication overheads of all algorithms do not change with the number of machines. This is because the communication overhead is determined by the convergence rate of the algorithms, which is not influenced by the number of machines.

Exp-5: Effect of dataset cardinality. We evaluate the effect of cardinality for our proposed algorithms on datasets PO and UK5. For this purpose, we extract a set of subgraphs from the original graphs by randomly selecting different fractions of vertices, which varies from $20\%$ to $100\%$ . As shown in Figure 6, both the running time and communication overhead increase with the dataset cardinality. This is expected because the larger the dataset, the more the corenesses of the vertices to compute, resulting in to poorer performance.

Exp-6: Effect of partition strategies. We evaluate the effect of different partition strategies in block-centric algorithms, i.e., AC-B and SC-B. Specifically, we compare four partition strategies, including SEG (Fan et al., 2017), HASH (Fan et al., 2017), FENNEL (Tsourakakis et al., 2014), and METIS (Karypis and Kumar, 1998).

•

SEG is a built-in partitioner of GRAPE. Let $C$ be the maximum cardinality of partitioned subgraphs. For a vertex $v$ with its ID $v_{id}\in[0,n-1]$ , $v$ is allocated to the $i$ -th subgraph, where $i=v_{id}/C$ .
•

HASH is also a built-in partitioner of GRAPE. Let $N$ be the number of partitioned subgraphs. For a vertex $v$ with its ID $v_{id}\in[0,n-1]$ , $v$ is allocated to the $i$ -th subgraph, where $i=v_{id}\%N$ .
•

FENNEL subsumes two popular heuristics to partition the graph: the folklore heuristic that places a vertex to the subgraph with the fewest non-neighbors, and the degree-based heuristic that uses different heuristics to place a vertex based on its degree.
•

METIS is a popular edge-cut partitioner that partitions the graph into subgraphs with minimum crossing edges.

Figure 7 shows the results. We can observe that HASH has the best performance in terms of running time on most datasets, but it has the worse performance in terms of communication cost. This is because HASH has more balanced partitions (i.e., each partition has almost an equal number of vertices) while METIS and FENNEL have higher locality, which leads to more running time, due to the effect of straggler, but less communication overhead. Considering the importance of efficiency in practice, we employ HASH as the default partition strategy in our experiments, as mentioned earlier.

7. Conclusion

In this paper, we study the problem of D-core decomposition in distributed settings. To handle distributed D-core decomposition, we propose two efficient algorithms, i.e., the anchored-coreness-based algorithm and skyline-coreness-based algorithm. Specifically, the anchored-coreness-based algorithm employs in-H-index and out-H-index to compute the anchored corenesses in a distributed way; the skyline-coreness-based algorithm uses a newly designed index, called D-index, for D-core decomposition through skyline coreness computing. Both theoretical analysis and empirical evaluation show the efficiency of our proposed algorithms with fast convergence rates.

As for future work, we are interested to study how to further improve the performance of the skyline-coreness-based algorithm, in particular how to accelerate the computation of D-index on each single machine. We are also interested to investigate efficient algorithms of distributed D-core maintenance for dynamic graphs.

References

(1)
gir (2012) 2012. Giraph. https://giraph.apache.org/.
Ananthanarayanan et al. (2010) Ganesh Ananthanarayanan, Srikanth Kandula, Albert G. Greenberg, Ion Stoica, Yi Lu, Bikas Saha, and Edward Harris. 2010. Reining in the Outliers in Map-Reduce Clusters using Mantri. In OSDI. 265–278.
Aridhi et al. (2016) Sabeur Aridhi, Martin Brugnara, Alberto Montresor, and Yannis Velegrakis. 2016. Distributed k-core decomposition and maintenance in large dynamic graphs. In Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems. 161–168.
Batagelj and Zaversnik (2003) Vladimir Batagelj and Matjaz Zaversnik. 2003. An O (m) algorithm for cores decomposition of networks. arXiv preprint cs/0310049 (2003).
Bonchi et al. (2014) Francesco Bonchi, Francesco Gullo, Andreas Kaltenbrunner, and Yana Volkovich. 2014. Core decomposition of uncertain graphs. In Proceedings of the 20th International Conference on Knowledge Discovery and Data Mining. 1316–1325.
Bonchi et al. (2019) Francesco Bonchi, Arijit Khan, and Lorenzo Severini. 2019. Distance-generalized core decomposition. In Proceedings of the 2019 International Conference on Management of Data. 1006–1023.
Chen et al. (2020) Yankai Chen, Jie Zhang, Yixiang Fang, Xin Cao, and Irwin King. 2020. Efficient community search over large directed graphs: An augmented index-based approach. In International Joint Conference on Artificial Intelligence. 3544–3550.
Cheng et al. (2011) James Cheng, Yiping Ke, Shumo Chu, and M Tamer Özsu. 2011. Efficient core decomposition in massive networks. In 2011 IEEE 27th International Conference on Data Engineering. 51–62.
Dasari et al. (2014) Naga Shailaja Dasari, Ranjan Desh, and Mohammad Zubair. 2014. ParK: An efficient algorithm for k-core decomposition on multicore processors. In 2014 IEEE International Conference on Big Data. 9–16.
Eidsaa and Almaas (2013) Marius Eidsaa and Eivind Almaas. 2013. S-core network decomposition: A generalization of k-core analysis to weighted networks. Physical Review E 88, 6 (2013), 062819.
Esfandiari et al. (2018) Hossein Esfandiari, Silvio Lattanzi, and Vahab Mirrokni. 2018. Parallel and streaming algorithms for k-core decomposition. In International Conference on Machine Learning. 1397–1406.
Fan et al. (2017) Wenfei Fan, Jingbo Xu, Yinghui Wu, Wenyuan Yu, and Jiaxin Jiang. 2017. GRAPE: Parallelizing sequential graph computations. Proceedings of the VLDB Endowment 10, 12 (2017), 1889–1892.
Fang et al. (2018) Yixiang Fang, Zhongran Wang, Reynold Cheng, Hongzhi Wang, and Jiafeng Hu. 2018. Effective and efficient community search over large directed graphs. IEEE Transactions on Knowledge and Data Engineering 31, 11 (2018), 2093–2107.
Fang et al. (2020) Yixiang Fang, Yixing Yang, Wenjie Zhang, Xuemin Lin, and Xin Cao. 2020. Effective and Efficient Community Search over Large Heterogeneous Information Networks. Proceedings of the VLDB Endowment 13, 6 (2020), 854–867.
Galimberti et al. (2021) Edoardo Galimberti, Martino Ciaperoni, Alain Barrat, Francesco Bonchi, Ciro Cattuto, and Francesco Gullo. 2021. Span-core Decomposition for Temporal Networks: Algorithms and Applications. ACM Transactions on Knowledge Discovery from Data 15, 1 (2021), 2:1–2:44.
Garcia et al. (2017) David Garcia, Pavlin Mavrodiev, Daniele Casati, and Frank Schweitzer. 2017. Understanding popularity, reputation, and social influence in the twitter society. Policy & Internet 9, 3 (2017), 343–364.
Giatsidis et al. (2013) Christos Giatsidis, Dimitrios M Thilikos, and Michalis Vazirgiannis. 2013. D-cores: measuring collaboration of directed graphs based on degeneracy. Knowledge and Information Systems 35, 2 (2013), 311–343.
Hirsch (2005) Jorge E. Hirsch. 2005. H-index. https://en.wikipedia.org/wiki/H-index/.
Karypis and Kumar (1998) George Karypis and Vipin Kumar. 1998. Metis: A software package for partitioning unstructured graphs. Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices Version 4 (1998).
Khaouid et al. (2015) Wissam Khaouid, Marina Barsky, Venkatesh Srinivasan, and Alex Thomo. 2015. K-core decomposition of large networks on a single pc. Proceedings of the VLDB Endowment 9, 1 (2015), 13–23.
Liao et al. (2022) Xunkun Liao, Qing Liu, Jiaxin Jiang, Xin Huang, Jianliang Xu, and Byron Choi. 2022. Distributed D-core Decomposition over Large Directed Graphs. In arXivpreprint arXiv:XXXXXXX.
Linial (1992) Nathan Linial. 1992. Locality in Distributed Graph Algorithms. SIAM J. Comput. 21, 1 (1992), 193–201.
Liu et al. (2019) Boge Liu, Long Yuan, Xuemin Lin, Lu Qin, Wenjie Zhang, and Jingren Zhou. 2019. Efficient (a, $\beta$ )-core Computation: an Index-based Approach. In International World Wide Web Conference. 1130–1141.
Liu et al. (2021) Qing Liu, Xuliang Zhu, Xin Huang, and Jianliang Xu. 2021. Local Algorithms for Distance-generalized Core Decomposition over Large Dynamic Graphs. The Proceedings of the VLDB Endowment (2021).
Low et al. (2012) Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M Hellerstein. 2012. Distributed graphlab: A framework for machine learning in the cloud. arXiv preprint arXiv:1204.6078 (2012).
Lü et al. (2016) Linyuan Lü, Tao Zhou, Qian-Ming Zhang, and H Eugene Stanley. 2016. The H-index of a network node and its relation to degree and coreness. Nature Communications 7, 1 (2016), 1–7.
Malewicz et al. (2010) Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 International Conference on Management of Data. 135–146.
Mandal and Hasan (2017) Aritra Mandal and Mohammad Al Hasan. 2017. A distributed k-core decomposition algorithm on spark. In IEEE International Conference on Big Data. 976–981.
McCune et al. (2015) Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking Like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing. Comput. Surveys 48, 2 (2015), 25:1–25:39.
Montresor et al. (2012) Alberto Montresor, Francesco De Pellegrini, and Daniele Miorandi. 2012. Distributed k-core decomposition. IEEE Transactions on Parallel and Distributed Systems 24, 2 (2012), 288–300.
Peng et al. (2018) You Peng, Ying Zhang, Wenjie Zhang, Xuemin Lin, and Lu Qin. 2018. Efficient Probabilistic K-Core Computation on Uncertain Graphs. In IEEE International Conference on Data Engineering. 1192–1203.
Salihoglu and Widom (2013) Semih Salihoglu and Jennifer Widom. 2013. Gps: A graph processing system. In Proceedings of the 25th International Conference on Scientific and Statistical Database Management. 1–12.
Saríyüce et al. (2013) Ahmet Erdem Saríyüce, Buğra Gedik, Gabriela Jacques-Silva, Kun-Lung Wu, and Ümit V Çatalyürek. 2013. Streaming algorithms for k-core decomposition. Proceedings of the VLDB Endowment 6, 6 (2013), 433–444.
Sarıyüce et al. (2016) Ahmet Erdem Sarıyüce, Buğra Gedik, Gabriela Jacques-Silva, Kun-Lung Wu, and Ümit V Çatalyürek. 2016. Incremental k-core decomposition: algorithms and evaluation. The VLDB Journal 25, 3 (2016), 425–447.
Sariyüce et al. (2018) Ahmet Erdem Sariyüce, C. Seshadhri, and Ali Pinar. 2018. Local Algorithms for Hierarchical Dense Subgraph Discovery. Proceedings of the VLDB Endowment 12, 1 (2018), 43–56.
Seidman (1983) Stephen B Seidman. 1983. Network structure and minimum degree. Social Networks 5, 3 (1983), 269–287.
Soldano et al. (2017) Henry Soldano, Guillaume Santini, Dominique Bouthinon, and Emmanuel Lazega. 2017. Hub-authority cores and attributed directed network mining. In International Conference on Tools with Artificial Intelligence. 1120–1127.
Tian et al. (2013) Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, and John McPherson. 2013. From” think like a vertex” to” think like a graph”. Proceedings of the VLDB Endowment 7, 3 (2013), 193–204.
Tsourakakis et al. (2014) Charalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. 2014. FENNEL: Streaming Graph Partitioning for Massive Scale Graphs. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining. 333–342.
Wen et al. (2016) Dong Wen, Lu Qin, Ying Zhang, Xuemin Lin, and Jeffrey Xu Yu. 2016. I/O efficient Core Graph Decomposition at web scale. In International Conference on Data Engineering. 133–144.
Wu et al. (2015) Huanhuan Wu, James Cheng, Yi Lu, Yiping Ke, Yuzhen Huang, Da Yan, and Hejun Wu. 2015. Core decomposition in large temporal graphs. In IEEE International Conference on Big Data. 649–658.
Yan et al. (2014) Da Yan, James Cheng, Yi Lu, and Wilfred Ng. 2014. Blogel: A block-centric framework for distributed computation on real-world graphs. Proceedings of the VLDB Endowment 7, 14 (2014), 1981–1992.
Zhou et al. (2021) Wei Zhou, Hong Huang, Qiang-Sheng Hua, Dongxiao Yu, Hai Jin, and Xiaoming Fu. 2021. Core decomposition and maintenance in weighted graph. World Wide Web 24, 2 (2021), 541–561.

Appendix A Proof of Theorem 2

Theorem 4.1. (Convergence).

k_{max}(v)=\lim_{n\rightarrow\infty}\mathrm{iH}_{G}^{(n)}(v)

Proof.

First, we prove that $\mathrm{iH}_{G}^{(n)}(v)$ is non-increasing with the increase of order $n$ . Then, we prove that $k_{max}(v)\leq\mathrm{iH}_{G^{\prime}}^{(\infty)}(v)\leq\mathrm{iH}_{G}^{(\infty)}(v)$ , where $G^{\prime}\subseteq G$ is a subgraph induced by the vertices $v^{\prime}$ with $k_{max}(v^{\prime})\geq k_{max}(v)$ . Also, we know $k_{max}(v)\geq\mathrm{iH}_{G}^{(\infty)}(v)$ by definition. Hence, $k_{max}(v)=\mathrm{iH}_{G}^{(\infty)}(v)$ . Complete proof is given here:

First, we prove that $\mathrm{iH}_{G}^{(n)}(v)$ is non-increasing with the increase of $n$ through mathematical induction. (1) It is straightforward that $\mathrm{iH}_{G}^{(0)}(v)$ $\geq$ $\mathrm{iH}_{G}^{(1)}(v)$ holds according to the definition of $\mathrm{iH}_{G}^{(n)}(v)$ . (2) Assume that $\mathrm{iH}_{G}^{(m)}(v)\geq\mathrm{iH}_{G}^{(m+1)}(v)$ holds. We have $\mathrm{iH}_{G}^{(m+2)}(v)=\mathcal{H}(\mathrm{iH}_{G}^{(m+1)}(u_{1}),$ $...,\mathrm{iH}_{G}^{(m+1)}(u_{k_{i}}))\leq$ $\mathcal{H}(\mathrm{iH}_{G}^{(m)}(u_{1}),...,\mathrm{iH}_{G}^{(m)}(u_{k_{i}}))=\mathrm{iH}_{G}^{(m+1)}(v)$ , i.e., $\mathrm{iH}_{G}^{(m+1)}(v)\geq\mathrm{iH}_{G}^{(m+2)}(v)$ . Thus, if $\mathrm{iH}_{G}^{(n)}(v)\geq\mathrm{iH}_{G}^{(n+1)}(v)$ holds for $n=m$ , it also holds for $n=m+1$ . Combining (1) and (2), $\mathrm{iH}_{G}^{(n)}(v)$ is non-increasing with the increase of $n$ . Moreover, since $\mathrm{iH}_{G}^{(n)}(v)$ is a positive integer, $\mathrm{iH}_{G}^{(n)}(v)$ can converge to a certain value.

Next, we prove that $\mathrm{iH}_{G}^{(\infty)}(v)\geq k_{max}(v)$ . To this end, we introduce two properties of $\mathrm{iH}_{G}^{(n)}(v)$ : (i) if $G^{\prime}\subseteq G$ , for $\forall v\in V_{G^{\prime}}$ and $\forall n\in\mathbb{N}_{0}$ , we have $\mathrm{iH}_{G^{\prime}}^{(n)}(v)\leq\mathrm{iH}_{G}^{(n)}(v)$ ; and (ii) for $\forall v\in V_{G}$ and $\forall n\in\mathbb{N}_{0}$ , we have $\mathrm{iH}_{G}^{(n)}(v)\geq indeg_{min}(G)$ , where $indeg_{min}(G)$ is the minimum in-degree of $G$ . Let $G^{\prime}\subseteq G$ be the subgraph of $G$ induced by vertex $v$ and the vertices $v^{\prime}$ with $k_{max}(v^{\prime})\geq k_{max}(v)$ . We have $\mathrm{iH}_{G}^{(\infty)}(v)\geq\mathrm{iH}_{G^{\prime}}^{(\infty)}(v)\geq indeg_{min}(G^{\prime})\geq k_{max}(v)$ .

Then, we prove that $k_{max}(v)\geq\mathrm{iH}_{G}^{(\infty)}(v)$ . To this end, we construct a subgraph $G^{\prime\prime}\subseteq G$ induced by vertex $v$ and the vertices $v^{\prime\prime}$ satisfying $\mathrm{iH}_{G}^{(\infty)}(v^{\prime\prime})\geq\mathrm{iH}_{G}^{(\infty)}(v)$ . Obviously, $G^{\prime\prime}$ is a $(\mathrm{iH}_{G}^{(\infty)}(v),0)$ -core. Hence, $k_{max}(v)\geq\mathrm{iH}_{G}^{(\infty)}(v)$ .

Finally, combining the above three proofs, we obtain Theorem 4.1. ∎

Appendix B Proof of Theorem 4.10

Theorem 4.4. (Time and Space Complexities). Algorithm 1 takes $O(R_{AC}\cdot\Delta_{in}\cdot\Delta)$ time and $O(\Delta_{in}\cdot\Delta)$ space. The total time and space complexities for computing all vertices’ corenesses are $O(R_{AC}\cdot\Delta_{in}\cdot m)$ and $O(\Delta_{in}\cdot m)$ , respectively..

Proof.

Algorithm 1 consists of three phases in Algorithms 2, 3, and 4. Algorithm 2 iteratively computes the $n$ -order in-H-index for all vertices. The time complexity of Algorithm 2 is mainly determined by the number of rounds that need to compute the $n$ -order in-H-index and also the processing time of each round, which are $R_{AC-I}$ and $O(deg^{in}_{G}(v))$ , respectively. Hence, the time complexity of Algorithm 2 is $O(R_{AC-I}\cdot deg^{in}_{G}(v))$ . Next, Algorithm 3 iteratively computes $l_{upp}(k,v)$ . Each round of Algorithm 3 takes $O(deg^{in}_{G}(v)\cdot deg^{out}_{G}(v))$ time. Algorithm 4 iteratively refines $l_{upp}(k,v)$ to $l_{max}(k,v)$ . Each round of Algorithm 4 takes $O(deg^{in}_{G}(v)\cdot deg_{G}(v))$ time. Hence, the time complexities of Algorithms 3 and 4 are $O(R_{AC-II}\cdot deg^{in}_{G}(v)\cdot deg^{out}_{G}(v))$ and $O(R_{AC-III}\cdot deg^{in}_{G}(v)\cdot deg_{G}(v))$ , respectively. Therefore, the time complexity of Algorithm 1 is $O(R_{AC-I}\cdot deg^{in}_{G}(v)+R_{AC-II}\cdot deg^{in}_{G}(v)\cdot deg^{out}_{G}(v)+R_{AC-III}\cdot deg^{in}_{G}(v)\cdot deg_{G}(v))$ $=$ $O(R_{AC}\cdot deg^{in}_{G}(v)\cdot deg_{G}(v))$ $=$ $O(R_{AC}\cdot\Delta_{in}\cdot\Delta)$ . The total time complexity for all vertices is $O(\sum\limits_{v\in V_{G}}R_{AC}\cdot deg^{in}_{G}(v)\cdot deg_{G}(v))$ $=$ $O(R_{AC}\cdot\Delta_{in}\cdot\sum\limits_{v\in V_{G}}deg_{G}(v))$ $=$ $O(R_{AC}\cdot\Delta_{in}\cdot m)$ .

Next, we analyze the space complexity. Algorithm 2 computes $\mathrm{iH}_{G}^{(n)}(v)$ by storing $\mathrm{iH}_{G}^{(n-1)}(u)$ for $u\in N^{in}(v)$ , which costs $O(deg^{in}_{G}(v))$ space. Next, Algorithm 3 computes $l_{upp}(k,v)$ . For each $k\in[0,k_{max}]$ , it stores the out-H-indexes of the out-neighbors in $O(deg^{out}_{G}(v))$ space. Hence, Phase II requires $O(deg^{in}_{G}(v)\cdot deg^{out}_{G}(v))$ space. Phase III refines $l_{upp}(k,v)$ to $l_{max}(k,v)$ in Algorithm 4. For each $k\in[0,k_{max}]$ , it stores $l_{upp}(k,v^{\prime})$ for all $v$ ’s neighbors in $O(deg^{in}_{G}(v)\cdot deg_{G}(v))$ space. Overall, the space complexity of Algorithm 1 is $O(deg^{in}_{G}(v)+deg^{in}_{G}(v)\cdot deg^{out}_{G}(v)+deg^{in}_{G}(v)\cdot deg_{G}(v))$ $=$ $O(deg^{in}_{G}(v)\cdot deg_{G}(v))$ $=$ $O(\Delta_{in}\cdot\Delta)$ . The total space complexity for all vertices is $O(\sum\limits_{v\in V_{G}}deg^{in}_{G}(v)\cdot deg_{G}(v))$ $=$ $O(\Delta_{in}\cdot m)$ . ∎

Appendix C Proof of Theorem 4.11

Theorem 4.5. (Message Complexity). The message complexity (i.e., the total number of times that a vertex send messages) of Algorithm 1 is $O(\Delta_{in}\cdot\Delta_{out}\cdot\Delta)$ . The total message complexity for computing all vertices’ corenesses is $O(\Delta_{in}\cdot\Delta_{out}\cdot m)$ .

Proof.

We analyze the message complexity in terms of two parts: Algorithm 2 and Algorithms 3 and 4. First, Algorithm 2 calculate $k_{max}(v)$ by iteratively computing $n$ -order in-H-index of $v$ . When $n$ -order in-H-index is not equal to $(n-1)$ -order in-H-index, Algorithm 2 also sends messages to all $v$ ’s out-neighbors. $v$ sends message at most $deg^{in}_{G}(v)$ times. Hence, the message complexity of Algorithm 1 is $O(deg^{in}_{G}(v)\cdot deg^{out}_{G}(v))$ .

Next, for each value of $k$ , Algorithm 3 first computes the upper bound $l_{upp}(k,v)$ and then Algorithm 4 refine $l_{upp}(k,v)$ to $l_{max}(k,v)$ . Algorithms 3 obtains $l_{upp}(k,v)$ by iteratively computing $n$ -order out-H-index of $v$ . When $n$ -order out-H-index of $v$ decreases, Algorithm 3 sends messages to $v$ ’s in-neighbors. Algorithm 3 continues until the $n$ -order out-H-index of $v$ equals $l_{upp}(k,v)$ . Then, Algorithms 4 refines $l_{upp}(k,v)$ by gradually decreasing it to $l_{max}(k,v)$ . Each time $l_{upp}(k,v)$ decreases, Algorithm 4 sends messages to $v$ ’s neighbors. Since $l_{max}(k,v)$ is at least 1 and there are at most $deg^{in}_{G}(v)$ values of $k$ , the total messages sent by $v$ is $deg^{out}_{G}(v)\cdot deg^{in}_{G}(v)\cdot deg_{G}(v)$ in worst. Overall, the message complexity of Algorithms 3 and 4 is $O(deg^{out}_{G}(v)\cdot deg^{in}_{G}(v)\cdot deg_{G}(v))$ .

As a result, the message complexity of Algorithm 1 is $O(deg^{in}_{G}(v)\cdot deg^{out}_{G}(v)+deg^{out}_{G}(v)\cdot deg^{in}_{G}(v)\cdot deg_{G}(v))$ $=$ $O(deg^{out}_{G}(v)\cdot deg^{in}_{G}(v)\cdot deg_{G}(v))$ $=$ $O(\Delta_{in}\cdot\Delta_{out}\cdot\Delta)$ . The total message complexity for all vertices is $O(\sum\limits_{v\in V_{G}}deg^{out}_{G}(v)\cdot deg^{in}_{G}(v)\cdot deg_{G}(v))$ $=$ $O(\Delta_{in}\cdot\Delta_{out}\cdot\sum\limits_{v\in V_{G}}deg_{G}(v))$ $=$ $O(\Delta_{in}\cdot\Delta_{out}\cdot m)$ . ∎

Appendix D Proof of Theorem 5.8

Theorem 5.3. (Time and Space Complexities). Algorithm 5 takes $O(R_{SC}\cdot\Delta_{in}\cdot\Delta_{out})$ time in $O(\Delta\cdot\min\{\Delta_{in},\Delta_{out}\})$ space. The total time and space complexities for computing all vertices’ corenesses are $O(R_{SC}\cdot\Delta_{in}\cdot m)$ and $O(\min\{\Delta_{in},\Delta_{out}\}\cdot m)$ , respectively.

Proof.

The time complexity of Algorithm 5 is dominated by the iterative computation of $n$ -order D-index. In each round, Algorithm 6 examines at most $deg^{in}_{G}(v)\cdot deg^{out}_{G}(v)$ pairs. Therefore, the time complexity of Algorithm 5 is $O(R_{SC}\cdot deg^{in}_{G}(v)\cdot deg^{out}_{G}(v))$ $=$ $O(R_{SC}\cdot\Delta_{in}\cdot\Delta_{out})$ . The total time complexity for all vertices is $O(\sum\limits_{v\in V_{G}}R_{SC}\cdot deg^{in}_{G}(v)\cdot deg^{out}_{G}(v))$ $=$ $O(R_{SC}\cdot\Delta_{in}\cdot\sum\limits_{v\in V_{G}}deg_{G}(v))$ $=$ $O(R_{SC}\cdot\Delta_{in}\cdot m)$ .

Next, we analyze the space complexity. To compute $v$ ’s $n$ -order D-index, Algorithm 5 stores the $(n-1)$ -order D-indexes of all $v$ ’s neighbors. The size of vertex $v$ ’s $n$ -order D-index is bounded by $\min\{\Delta_{in},\Delta_{out}\}$ . Thus, the space complexity of Algorithm 5 is $O(deg_{G}(v)\cdot\min\{\Delta_{in},\Delta_{out}\})$ $=$ $O(\Delta\cdot\min\{\Delta_{in},\Delta_{out}\})$ . The total time complexity for all vertices is $O(\sum\limits_{v\in V_{G}}\min\{deg^{in}_{G}(v),deg^{out}_{G}(v)\}\cdot deg^{out}_{G}(v))$ $=$ $O(\min\{\Delta_{in},\Delta_{out}\}\cdot\sum\limits_{v\in V_{G}}deg_{G}(v))$ $=$ $O(\min\{\Delta_{in},\Delta_{out}\}\cdot m)$ . ∎

Appendix E Proof of Theorem 5.9

Theorem 5.4. (Message Complexity). The message complexity of Algorithm 5 is $O(\Delta^{2})$ . The total message complexity for computing all vertices’ corenesses is $O(\Delta\cdot m)$ .

Proof.

The message complexity of Algorithm 5 is dominated by the computation of $n$ -order D-index for vertex $v$ . When the $n$ -order D-index changes from the $(n-1)$ -order D-index, $v$ broadcasts messages to its neighbors. The $n$ -order D-index contains skyline pairs. Assume that in each round the skyline pair decreases one dimension by one at most. There are a total of $deg_{G}(v)$ rounds for $v$ . Hence, the message complexity of Algorithm 5 is $O(deg_{G}(v)\cdot deg_{G}(v))$ $=$ $O(\Delta^{2})$ . The total message complexity for all vertices is $O(\sum\limits_{v\in V_{G}}deg_{G}(v)\cdot deg_{G}(v))$ $=$ $O(\Delta\cdot\sum\limits_{v\in V_{G}}deg_{G}(v))$ $=$ $O(\Delta\cdot m)$ . ∎

Appendix F Performance comparisons on a single machine

This set of experiments evaluates the performance of our proposed algorithms and the state-of-the-art centralized method (Fang et al., 2018) over a single machine. Figure 8 shows the running time results. We can observe that for most of the small graphs, the peeling algorithm is more efficient than our algorithms. The reason is two-folded: (1) our proposed algorithms require iterative computations of in-H-index/out-H-index/D-index for every vertex before convergence, which is time consuming in centralized settings; (2) on a single machine, the vertex state update is transparent to other vertices in the peeling algorithm while it needs message communication exchanges in our algorithms, which incur a higher overhead. Nevertheless, our proposed algorithms run much faster than the peeling method in distributed settings as reported in Section 6.2. On the other hand, for billion-scale graphs UK5 and IT, all algorithms could not finish, due to memory overflows. This indeed motivates the need of distributed D-core decomposition over large directed graphs using a collection of machines.

Distributed D-core Decomposition over Large Directed Graphs

Abstract.

1. Introduction

2. Related Work

3. Problem Formulation

Definition 3.0.

Property 3.1.

Example 3.0.

Problem 1.

4. Distributed Anchored Coreness-Based Algorithm

4.1. Anchored Coreness

Definition 4.0.

4.2. Distributed Anchored Coreness Computing

Definition 4.0.

Theorem 4.3 (Convergence).

Proof.

Example 4.0.

Definition 4.0.

Theorem 4.6.

Proof.

Example 4.0.

Theorem 4.8.

Example 4.0.

4.3. Algorithm Analysis and Extension

Theorem 4.10.

Theorem 4.11.

5. Distributed Skyline Coreness-Based Algorithm

5.1. Skyline Coreness

Definition 5.0.

Definition 5.0.

Property 5.1.

Proof.

5.2. Distributed Skyline Corenesses Computing

Definition 5.0.

Definition 5.0.

Theorem 5.5 (nn-order D-index Convergence).

Proof.

5.3. Algorithms and Optimizations

Theorem 5.6.

Example 5.0.

5.4. Algorithm Analysis and Extension

Theorem 5.8.

Theorem 5.9.

6. Performance Evaluation

6.1. Convergence Evaluation

6.2. Efficiency Evaluation

7. Conclusion

References

Appendix A Proof of Theorem 2

Proof.

Appendix B Proof of Theorem 4.10

Proof.

Appendix C Proof of Theorem 4.11

Proof.

Appendix D Proof of Theorem 5.8

Proof.

Appendix E Proof of Theorem 5.9

Proof.

Appendix F Performance comparisons on a single machine

Theorem 5.5 ( $n$ -order D-index Convergence).