This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Distributed Solution for Efficient K Shortest Paths Computation over Dynamic Road Networks

Ziqiang Yu1, Xiaohui Yu2∗, Nick Koudas3, Yueting Chen2, Yang Liu4
*Corresponding author 1Yantai University  2York University  3University of Toronto  4Wilfrid Laurier University
Abstract

The problem of identifying the k-shortest paths (KSPs for short) in a dynamic road network is essential to many location-based services. Road networks are dynamic in the sense that the weights of the edges in the corresponding graph constantly change over time, representing evolving traffic conditions. Very often such services have to process numerous KSP queries over large road networks at the same time, thus there is a pressing need to identify distributed solutions for this problem. However, most existing approaches are designed to identify KSPs on a static graph in a sequential manner (i.e., the (i+1)th(i+1)^{th} shortest path is generated based on the ithi^{th} shortest path), restricting their scalability and applicability in a distributed setting. We therefore propose KSP-DG, a distributed algorithm for identifying k-shortest paths in a dynamic graph. It is based on partitioning the entire graph into smaller subgraphs, and reduces the problem of determining KSPs into the computation of partial KSPs in relevant subgraphs, which can execute in parallel on a cluster of servers. A distributed two-level index called DTLP is developed to facilitate the efficient identification of relevant subgraphs. A salient feature of DTLP is that it indexes a set of virtual paths that are insensitive to varying traffic conditions in an efficient and compact fashion, leading to very low maintenance cost in dynamic road networks. This is the first treatment of the problem of processing KSP queries over dynamic road networks. Extensive experiments conducted on real road networks confirm the superiority of our proposal over baseline methods.

Index Terms:
kk shortest paths, road networks, dynamic graph, distributed index.

1 Introduction

In this work, we are concerned with identifying k-shortest paths (KSPs for short) over dynamic road networks. That is, for a given dynamic road network and a pair of origin and destination, identify the k shortest loop-less paths according to a predefined measure (e.g., travel time). The road network can be considered a graph where the intersections/endpoints are represented as vertices and roads as edges. It is dynamic in the sense that the travel time (or other similar measures such as congestion level) changes over time, corresponding to evolving weights of edges in the graph.

Identifying KSPs in a dynamic road network is an essential building block in many location-based services such as route navigation and ride-sharing [1, 4, 2, 3]. Most of the existing work to address the KSP problem in a road network (or more generally, in a graph) assumes a centralized approach [5, 6, 7, 8, 9, 10, 11, 12, 13, 14], and is incapable of handling large volumes of concurrent queries over a dynamic graph for two main reasons. First, the query processing strategies employed are sequential, namely they generate the (i+1)th(i+1)^{th} shortest path based on the ithi^{th} shortest path, which limits their scalability with respect to the number of concurrent queries. Second, some previous approaches [8, 10, 11] require building a heavy-weight path index, which is costly and easily becomes invalid once the edge weights in the graph change.

Distributed algorithms have recently been developed to compute shortest paths in static graphs [15, 16, 17, 18, 19, 20], but also cannot be directly adopted to process KSP queries in dynamic road networks because they also need to require building indexes beforehand to enhance search efficiency, which is impractical for dynamic graphs.

In order to tackle the aforementioned problems, we have proposed a distributed solution that can be deployed on a cluster of servers to handle KSP queries in a distributed fashion over large dynamic graphs. At a very high level, this solution adopts a divide-and-conquer strategy: the large graph is partitioned into multiple subgraphs maintained on different servers; given a query, we first compute in parallel partial shortest paths in the subgraphs on different servers, which are subsequently merged to construct the k shortest paths. Although similar strategies have been adopted for distributed query evaluation on graphs [21, 22, 23, 24, 25], the processing of KSP queries presents some unique challenges: (1) It is non-trivial to identify the partitioned subgraphs containing the required partial shortest paths. (2) It is vital but difficult to build an effective index for identifying KSPs that is easy to maintain in the presence of constantly changing weights.

With these challenges in mind, we first propose a Distributed Two-Level Path (DTLP) index. This index is based on partitioning any graph GG into multiple subgraphs, where two subgraphs may overlap at a small number of vertices called boundary vertices but have no common edges. For any two boundary vertices in the same subgraph, we compute specific paths between them called bounding paths. The boundary vertices and the bounding paths in all subgraphs form the first level of DTLP, which provides a lower bound of the shortest distance between a pair of boundary vertices. These bounding paths do not change due to varying weights, making the index easily maintainable. However, the volume of bounding paths in a subgraph can be prohibitively large. We thus propose a tree-based index structure to compactly store the bounding paths without sacrificing maintenance efficiency. The second level, is a skeleton graph GλG_{\lambda}, in which the vertices correspond to the boundary vertices of all subgraphs, and there exists an edge between a pair of vertices in GλG_{\lambda} if and only if there is a set of bounding paths between the corresponding vertices in the original graph, with the weight of the edge computed based on that set of bounding paths. Graph GλG_{\lambda} serves the purpose of supplying an approximate search direction for identifying KSPs.

We propose an iterative algorithm, KSP-DG, to identify KSPs in Dynamic Graphs based on DTLP, which follows a ”filter-and-refine” strategy. At each iteration of the algorithm, we first use the skeleton graph GλG_{\lambda} in the filter step to compute a shortest path in GλG_{\lambda} that has not been examined before. Then the subgraphs covering the vertices on the path (i.e., boundary vertices in subgraphs) are selected for further examination. In the refine step, kk shortest paths between the boundary vertices are generated from each subgraph, which are combined to form complete paths in GG. These paths are then used to update the list of shortest paths in GG that have been obtained so far. This process terminates when there is no more change in the list, and the final result is guaranteed to be the KSPs. The algorithm is distributed by design, in that the partitioning of the graph allows the refine step to run in parallel over the cluster, and the small footprint of the skeleton graph GλG_{\lambda} lends itself well to be replicated to any node in the cluster where it is needed for generating the shortest paths in GλG_{\lambda}.

In the refine step of KSP-DG, a frequently invoked but expensive operation is to compute the partial KSPs between boundary vertices in given subgraphs. In order to speed up this computation, We propose a progressive version of the widely adopted Yen’s algorithm [6], which was originally designed for computing pair-wise shortest distances in a graph, to improve the search efficiency through the following the enhancements. We call this new algorithm PYen (for Progressive Yen’s algorithm). (1) PYen parallelizes the generation of each potential shortest path and can prune unpromising potential shortest paths early before the complete generation; and (2) it fully reuses identified shortest paths from previous results and only compute shortest paths for unknown vertex pairs incrementally to avoid redundant computation.

In summary, we make the following contributions.

  • We introduce a comprehensive distributed solution for processing KSP queries, utilizing a divide-and-conquer strategy. This independent exploration across different divisions allows our solution to be easily scaled across multiple servers and executed in parallel, thereby accelerating the search process.

  • We devise DTLP, a dual-level path index suitable for deployment in a distributed setting to facilitate the processing of KSP queries over dynamic graphs.

  • We condense the extensive volume of bounding paths of DTLP into a space-efficient modified prefix tree using locality sensitive hashing. This enhances the space efficiency without compromising the indexing power.

  • We propose KSP-DG that decomposes the problem of identifying KSPs in the entire graph into parallel search for partial KSPs in different subgraphs, suitable for distributed processing on a cluster. Furthermore, we present PYen, a Progressive Yen’s algorithm, to expedite the computation of partial KSPs within subgraphs, a crucial operation frequently used by KSP-DG.

  • We conduct extensive experiments on real road networks to evaluate the performance of the proposed approach, confirming its effectiveness and superiority over other approaches across a variety of settings.

The rest of the paper is organized as follows. Section 2 defines the problem of finding KSPs. Section 3 presents the DTLP index, and its maintenance strategy is described in Section 4. Section 5 discusses KSP-DG as well as further optimizations. Section 6 presents the distributed implementation of DTLP and KSP-DG, and experimentally evaluates the performance of our proposal. Section 7 discusses related work, and Section 8 concludes this paper.

2 Problem Definition

Definition 1 ( Dynamic undirected graph).

A dynamic graph GG = (𝒱\mathcal{V}, \mathcal{E}, 𝒲\mathcal{W}) consists of (1) a finite set of vertices 𝒱\mathcal{V}, (2) a set of edges 𝒱×𝒱\mathcal{E}\subseteq\mathcal{V}\times\mathcal{V}, where ei,je_{i,j} (ei,je_{i,j}\in\mathcal{E}) denotes an edge between vertices viv_{i} and vjv_{j}, and (3) a set of non-negative weights 𝒲\mathcal{W}, where wi,j𝒲w_{i,j}\in\mathcal{W} is the weight of edge ei,je_{i,j}, which may change by a negative or non-negative value Δw\Delta w at any time point.

In what follows, we present the terminologies and solutions in the context of graphs instead of road networks, in view of their potential applicability in other scenarios involving dynamic undirected graphs.

Definition 2 (Subgraph).

A graph SG{SG}=(𝒱\mathcal{V}^{\prime}, \mathcal{E}^{\prime}, 𝒲\mathcal{W}^{\prime}) is a subgraph of the graph GG= (𝒱\mathcal{V}, \mathcal{E}, 𝒲\mathcal{W}) iff

  • 𝒱𝒱\mathcal{V}^{\prime}\subseteq\mathcal{V},

  • \mathcal{E}^{\prime}\subseteq\mathcal{E} (em,nvm,vn𝒱)\land(e_{m,n}\in\mathcal{E}^{\prime}\to v_{m},v_{n}\in\mathcal{V}^{\prime}),

  • 𝒲𝒲\mathcal{W}^{\prime}\subseteq\mathcal{W} (wm,n𝒲em,n)\land(w_{m,n}\in\mathcal{W}^{\prime}\to e_{m,n}\in\mathcal{E}^{\prime}).

Definition 3 (Path, Distance of Path).

Path P(s,t)P(s,t) from one vertex vsv_{s} (the source vertex) to another vertex vtv_{t} (the destination vertex) in graph GG is a sequence of vertices v0\langle\texttt{v}_{0}= vsv_{s},\cdots vl\texttt{v}_{l},\cdots,vn\texttt{v}_{n} =vtv_{t}\rangle such that l[1,n]\forall l\in[1,n], ei,je_{i,j}\in\mathcal{E} if vl1=vi\texttt{v}_{l-1}=v_{i} and vl=vj\texttt{v}_{l}=v_{j}. In this work, we only consider simple paths, i.e., paths with no repeat vertices. The distance of P(s,t)P(s,t) is defined as D(P(s,t))D(P(s,t))=i=1nwi1,i\sum\limits_{i=1}^{n}w_{i-1,i}.

Definition 4 (k-shortest path query (KSP query)).

For a given pair of source and destination vertices, vsv_{s} and vtv_{t}, the kk-shortest path query, q(vs,vt)q(v_{s},v_{t}), identifies the set of kk paths from vsv_{s} to vtv_{t} in graph GG, 𝒫s,t={P1(s,t),,Pk(s,t)}\mathcal{P}_{s,t}=\{{P_{1}}(s,t),\ldots,P_{k}(s,t)\}, such that D(Pi(s,t))D(Pi+1(s,t))(i[1,k1]D(P_{i}(s,t))\leq D(P_{i+1}(s,t))(i\in[1,k-1]), and P(s,t)𝒫s,t\forall P(s,t)\notin\mathcal{P}_{s,t}D(P(s,t))D(Pk(s,t))D(P(s,t))\geq D(P_{k}(s,t)).

To ensure timely processing of the queries, we assume that query processing takes place in main memory. Moreover, as graph GG constantly evolves as queries arrive, we use a buffer GcurrG_{curr} to model the current version of GG to ensure unambiguous semantics of the query answers. This buffer is updated continuously and asynchronously as the graph changes. At fixed time intervals, a snapshot GcurrG_{curr} is taken, and the answer to an incoming KSP query is processed against the most recent snapshot. Each query answer has a timestamp indicating the moment at which the answer is exact.

3 Distributed Two-Level Path Index

A naive approach to process KSP queries is to compute from scratch the shortest paths on graph GG directly each time a query arrives. In practice, however, both the size and the dynamic nature of graphs, as well the vast volume of concurrent queries, make it infeasible to process queries this way. For this reason we consider an index structure that could scale and speed up the processing.

3.1 Desiderata

Our problem setting necessitates the following desirable properties in an index to support processing KSP queries.

(1) Low maintenance cost for dynamic graphs. Some existing index structures for processing shortest path queries maintain the shortest path between selected “landmark” vertices. However, when the edge weights of the graph constantly change, this becomes extremely expensive to maintain. As such, one of our requirements is that the index must be easy to maintain for dynamic graphs and imposes as little overhead as possible when the graph evolves.

(2) Suitable for deployment in a distributed setting. As the size of the graph and the number of concurrent queries keep increasing, a distributed index becomes a more attractive solution than a centralized one due to its ability to scale out. Since the graph GG itself may have to be partitioned and stored across the cluster, we should be able to maintain parts of the index on different nodes in a cluster corresponding to subgraphs of GG.

(3) Supporting distributed algorithms. With the graph partitioned and subgraphs stored across the cluster, query processing takes place on different nodes in parallel. It is thus necessary for the index to assist the identification of relevant subgraphs that may contain pieces of the shortest paths to avoid examination of all subgraphs. Such an index should provide sufficient pruning power, and at the same time guarantee the correctness of the query result.

3.2 Overview of DTLP

In view of the above requirements, we devise an index called DTLP that facilitates distributed KSP query processing over dynamic graphs. Based on a partitioning of GG, DTLP has a two-level structure: the first level indexes each subgraph by maintaining a list of bounding paths (Section 3.4) between any pair of boundary vertices (Section 3.3). This provides the basis for computing the lower bounds of the shortest distances between the boundary vertices. The second level keeps a skeleton graph (Section 3.6) with all boundary vertices in all subgraphs, and it is computed based on the bounding paths identified in the first level. Fig. 1 illustrates the structure of the DTLP index.

Refer to caption
Figure 1: Schematic Diagram of DTLP

3.3 Graph Partitioning and Boundary Vertices

The construction of DTLP starts with partitioning graph G into multiple subgraphs such that each subgraph has at most zz vertices. Different subgraphs may share vertices but not edges. For graph partitioning, we start from any vertex and traverse the graph GG using a breadth-first strategy to generate the subgraphs. The set of subgraphs is denoted as 𝒮\mathcal{S}={SG1\{{SG}_{1}, \cdots, SGn}SG_{n}\} such that (1) 𝒱1𝒱n=𝒱\mathcal{V}_{1}\cup\cdots\cup\mathcal{V}_{n}=\mathcal{V}; (2) 1n={\mathcal{E}}_{1}\cup\cdots\cup{\mathcal{E}}_{n}=\mathcal{E}; (3) 𝒲1𝒲n=𝒲{\mathcal{W}}_{1}\cup\cdots\cup{\mathcal{W}}_{n}=\mathcal{W}, where nn is the number of subgraphs and SGi={𝒱i,i,𝒲i}SG_{i}=\{\mathcal{V}_{i},\mathcal{E}_{i},\mathcal{W}_{i}\} (i[1,n])(i\in[1,n]).

The vertices shared by two or more subgraphs are called boundary vertices. Evidently, any path from a non-boundary vertex in SGiSG_{i} to a non-boundary vertex in SGjSG_{j} must pass through one or more boundary vertices, as those boundary vertices are the only ”contact vertices” between subgraphs.

Example 1.

Figure 2 gives an example of graph GG that is partitioned into four subgraphs in Figure 3, where the threshold zz (maximum number of vertices in a subgraph) is set to 6. The boundary vertices in each subgraph are shaded.

Refer to caption
Figure 2: Graph GG
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3: Subgraphs of GG

3.4 Bounding Paths

Once graph partitions are in place, we move on to identify for each pair of boundary vertices in a subgraph a set of bounding paths. Bounding paths are specific paths in the subgraph serving as a reference to establish the bound distance, a stable lower bound of the shortest distance between the respective boundary vertices. The bounding paths do not change when the edge weights in the graph change, but the bound distances have to be updated to reflect the new information on edge weights. Those bounding paths will in turn be used to construct the skeleton graph.

As a first attempt, we consider utilizing the path(s) with the fewest edges between two boundary vertices as the bounding path(s). For any bounding path P(i,j)P(i,j) with mm edges between two boundary vertices viv_{i} and vjv_{j} in subgraph SGSG, the corresponding bound distance, denoted by BD(P(i,j))BD(P(i,j)), is computed as the sum of the mm smallest edge weights in SGSG. It is easy to show that BD(P(i,j))BD(P(i,j)) is not greater than the shortest distance between viv_{i} and vjv_{j} in SGSG, but it very likely has a large discrepancy with the actual shortest distance. Next, we seek to further improve this bound in two ways. First, we identify more than one bounding path between a pair of boundary vertices, which provides more bound distances and we can choose the maximum one to narrow down the discrepancy with the actual shortest distance.

Second, we decompose the edge weight to finer granularity to provide better resolution when computing the bound distance. In particular, we assign virtual fragments (vfrags for short) for each edge in the graph GG and the number of vfrags of any edge ei,je_{i,j} equals to wi,j0w^{0}_{i,j}, where wi,j0w^{0}_{i,j} is the initial weight of ei,je_{i,j} at the beginning of the DTLP construction. Such vfrags are additive, i.e., the number of vfrags on a path, ϕ(P(i,j))\phi(P(i,j)), is the sum of the number of vfrags on each edge in the path. We call the weight of each vfrag in ei,je_{i,j} the unit weight, defined as wi,j/wi,j0{w_{i,j}}/{w^{0}_{i,j}}, where wi,jw_{i,j} is the current weight of edge ei,je_{i,j}, which varies over time. For any edge, the number of vfrags always remains the same, but the unit weight will change with varying weight.

Refer to caption
(a) SG4SG_{4}
Refer to caption
(b) SG4SG_{4}
Figure 4: Subgraphs SG4SG_{4} and SG4SG^{\prime}_{4}

With the two improvements above, we compute the bounding path and the bound distance as follows. For two boundary vertices viv_{i} and vjv_{j}, given a configurable parameter ξ\xi, we compute a set i,j\mathcal{B}_{i,j} consisting of at most ξ\xi bounding paths that contain the least number of vfrags (where bounding paths containing the same number of vfrags are counted as only one path). Here, Pl(i,j){P^{\prime}_{l}}(i,j) (l[1,|i,j|]l\in[1,|\mathcal{B}_{i,j}|]) is called a bounding path, and P1(i,j){P^{\prime}_{1}}(i,j) represents the bounding path with the fewest vfrags.

For a bounding path Pl(i,j){{P_{l}}^{\prime}}(i,j) from viv_{i} to vjv_{j} in subgraph SGSG, we can compute a corresponding bound distance, which is the sum of the ϕ(Pl(i,j))\phi({{P_{l}}^{\prime}}(i,j)) smallest unit weights in SGSG, denoted by BD(Pl(i,j))BD({{P_{l}}^{\prime}}(i,j)).

Example 2.

For boundary vertices v13v_{13} and v14v_{14} in SG4SG_{4} in Fig. 4(a), P1(13,14){P^{\prime}_{1}}(13,14)=v13,v16,v14\langle v_{13},v_{16},v_{14}\rangle and P2(13,14){P^{\prime}_{2}}(13,14)=v13,v18,v17,v16,v14\langle v_{13},v_{18},v_{17},v_{16},v_{14}\rangle are the bounding paths if ξ=2\xi=2. As to P1(13,14){P^{\prime}_{1}}(13,14), ϕ(P1(13,14))=8\phi({P^{\prime}_{1}}(13,14))=8 and all unit weights of vfrags in SG4SG_{4} are 1 initially, so BD(P1(13,14))=8BD({P^{\prime}_{1}}(13,14))=8. When SG4SG_{4} change to SG4SG^{\prime}_{4}, the unit weights in SG4SG^{\prime}_{4} are updated to (1/3,3)(1/3,3), (1/2,4)(1/2,4), (1,8)(1,8), and (2,3)(2,3), meaning there are 3 vfrags of unit weight 1/31/3, 4 vfrags of unit weight 1/21/2, and so on. Thus, BD(P1(13,14))BD({P^{\prime}_{1}}(13,14)) can be computed using the 8 smallest unit weights, consisting of 3 unit weights of 1/31/3, 4 unit weights of 1/21/2, and 1 unit weight of 11, with the result being 4.

Bounding paths are be computed in the initial graph offline and always remain the same regardless of the changing weights.

3.5 Lower Bound Distances

After identifying the set of bounding paths between viv_{i} and vjv_{j} in subgraph SGSG, we aim to determine the bounding path that would impose as tight a lower bound on the shortest distance as possible. This translates to locating the maximal bound distance less than or equal to the shortest distance corresponding to this set of bounding paths. It will be referred to as the lower bound distance between viv_{i} and vjv_{j}, and the corresponding bounding path is called the lower bounding path.

To reach a tighter lower bound, recall that for each bounding path pl(i,j){p_{l}}^{\prime}(i,j) between viv_{i} and vjv_{j} in SGSG, its bound distance (BD(pl(i,j))BD({p_{l}}^{\prime}(i,j))) computed using vfrags is never greater than its actual distance (D(pl(i,j))D({p_{l}}^{\prime}(i,j))). Consequently, if D(pl(i,j))BD(pg(i,j))D({p_{l}}^{\prime}(i,j))\leq BD({p_{g}}^{\prime}(i,j)), we can conclude D(pl(i,j))D(pg(i,j))D({p_{l}}^{\prime}(i,j))\leq D({p_{g}}^{\prime}(i,j)). Exploiting this relationship allows us to identify the shortest path between viv_{i} and vjv_{j} among the set of bounding paths in many cases. If this fails, however, we can still utilize the bounding path with the maximal bound distance as the lower bounding path. We detail this approach below.

Definition 5 (Lower Bounding Path).

Let i,j\mathcal{B}_{i,j} be a set of bounding paths between viv_{i} and vjv_{j} in SGSG and P1(i,j){P_{1}}(i,j) be the shortest path connecting these two vertices in the original graph GG. A bounding path Pb(i,j){P^{\prime}_{b}}(i,j) is the lower bounding path between viv_{i} and vjv_{j} if it satisfies one of the following conditions:

(1) D(Pb(i,j))=D(P1(i,j))D({P^{\prime}_{b}}(i,j))=D({P_{1}}(i,j)); or

(2) the bound distance of Pb(i,j){P^{\prime}_{b}}(i,j) is maximal among all bounding paths in i,j\mathcal{B}_{i,j}.

Definition 6 (Lower Bound Distance).

Following Definition 5, the lower bound distance between viv_{i} and vjv_{j}, denoted by LBD(i,j)LBD(i,j), equals to D(P1(i,j))D({P_{1}}(i,j)) if condition (1) is established, or BD(Pb(i,j))BD({P^{\prime}_{b}}(i,j)) if condition (2) is met.

Next, we introduce Theorem1111We have excluded the proofs for the theorems and lemmas in consideration of space limitations; however, they can be found in the appendix. to help identify the lower bound distance for a pair of boundary vertices.

Theorem 1.

Let i,j\mathcal{B}_{i,j}={Pl(i,j)}\{{P^{\prime}_{l}(i,j)}\} (l[1,r]l\in[1,r], r=|i,j|r=|\mathcal{B}_{i,j}|) be the set of bounding paths connecting viv_{i} and vjv_{j} in subgraph SGSG, and let Pu(i,j)P^{\prime}_{u}(i,j) (u[1,r]u\in[1,r]) be the path whose actual distance in SGSG is the shortest within i,j\mathcal{B}_{i,j}. Assuming that the paths in i,j\mathcal{B}_{i,j} are sorted in ascending order based on their bound distances, we can make the following claims:

  1. 1.

    If BD(Pl(i,j))D(Pu(i,j))BD(P^{\prime}_{l}(i,j))\leq D(P^{\prime}_{u}(i,j)) and BD(Pl+1BD(P^{\prime}_{l+1} (i,j))(i,j)) \geq D(Pu(i,j))D(P^{\prime}_{u}(i,j)) (l+1[1,r]l+1\in[1,r]), then Pu(i,j)P^{\prime}_{u}(i,j) is the shortest path between viv_{i} and vjv_{j} in SGSG. In this case, Pu(i,j)P^{\prime}_{u}(i,j) is the lower bounding path.

  2. 2.

    If BD(Pr(i,j))<D(Pu(i,j))BD(P^{\prime}_{r}(i,j))<D(P^{\prime}_{u}(i,j)), then BD(Pr(i,j))BD(P^{\prime}_{r}(i,j)) must be less than the actual shortest distance between viv_{i} and vjv_{j} in SGSG and Pr(i,j)P^{\prime}_{r}(i,j) is the lower bounding path.

Proof.

To show the first claim holds, suppose the shortest path from viv_{i} to vjv_{j} is not Pu(i,j)P^{\prime}_{u}(i,j) but another path denoted by Pf(i,j)P^{\prime}_{f}(i,j) that is not covered by i,j\mathcal{B}_{i,j}. Based on this assumption, it can be inferred that ϕ(Pf(i,j))>ϕ(Pr(i,j))\phi(P^{\prime}_{f}(i,j))>\phi(P^{\prime}_{r}(i,j)), so BD(Pf(i,j))>BD(Pr(i,j))BD(P^{\prime}_{f}(i,j))>BD(P^{\prime}_{r}(i,j)). As the actual distance of a path is no less than its bound distance, we have D(Pf(i,j))>BD(Pr(i,j))BD(Pl+1(i,j))D(P^{\prime}_{f}(i,j))>BD(P^{\prime}_{r}(i,j))\geq BD(P^{\prime}_{l+1}(i,j)). Also, because BD(Pl+1(i,j))>D(Pu(i,j))BD(P^{\prime}_{l+1}(i,j))>D(P^{\prime}_{u}(i,j)), we have D(Pf(i,j))>D(Pu(i,j))D(P^{\prime}_{f}(i,j))>D(P^{\prime}_{u}(i,j)) which is a contradiction.

The second claim is true if Pu(i,j)P^{\prime}_{u}(i,j) is the shortest path from viv_{i} to vjv_{j} in SGSG; otherwise, we infer that the shortest path from viv_{i} to vjv_{j} is not in i,j\mathcal{B}_{i,j}. In this case, the bound distance of this shortest path must be greater than BD(Pr(i,j))BD(P^{\prime}_{r}(i,j)), so its actual distance is also greater than BDBD (Pr(i,j))(P^{\prime}_{r}(i,j)). ∎

Refer to caption
(a)
Refer to caption
(b) SG4SG_{4}
Refer to caption
(c) SG4SG_{4}
Refer to caption
(d) SG4SG_{4}
Figure 5: Example for Theorem 1
Example 3.

This example illustrates the two cases in Theorem 1. For the graph in Fig. 5(a), P1(s,t)P^{\prime}_{1}(s,t)=vs,v1,vt\langle v_{s},v_{1},v_{t}\rangle, P2(s,t)P^{\prime}_{2}(s,t) = vs,v2,v3,\langle v_{s},v_{2},v_{3}, vtv_{t}\rangle, and P3(s,t)P^{\prime}_{3}(s,t)=vs,v4,v5,v6,vt\langle v_{s},v_{4},v_{5},v_{6},v_{t}\rangle are bounding paths between vsv_{s} and vtv_{t} when ξ=3\xi=3. If the weights change as shown in Fig. 5(b), P3(s,t)P^{\prime}_{3}(s,t) becomes the bounding path with the shortest distance and the unit weights in this graph are updated to (2, 4), (4, 3), and (8, 2). As BD(P1(s,t))=4BD(P^{\prime}_{1}(s,t))=4, BD(P2(s,t))=6BD(P^{\prime}_{2}(s,t))=6, and BD(P3(s,t))=8BD(P^{\prime}_{3}(s,t))=8, the bound distance of BDBD(P3(s,t))(P^{\prime}_{3}(s,t)) equals to its actual distance. Hence, P3(s,t)P^{\prime}_{3}(s,t) is the shortest path from vsv_{s} to vtv_{t}, which follows the first claim in Theorem 1.

For the scenario shown as Fig. 5(c), P1(s,t)P^{\prime}_{1}(s,t), P2(s,t)P^{\prime}_{2}(s,t), and P3(s,t)P^{\prime}_{3}(s,t) are still the bounding paths connecting vsv_{s} and vtv_{t}. When the graph in Figure 5(c) changes to the one in Fig. 5(d), there are five vfrags with unit weight 1. In this case, BD(P1(s,t))=2BD(P^{\prime}_{1}(s,t))=2, BD(P2(s,t))=3BD(P^{\prime}_{2}(s,t))=3, and BD(P3(s,t))=4BD(P^{\prime}_{3}(s,t))=4. Because D(P3(s,t))>BD(P3(s,t))D(P^{\prime}_{3}(s,t))>BD(P^{\prime}_{3}(s,t)), it is uncertain whether P3(s,t)P^{\prime}_{3}(s,t) is the shortest path from vsv_{s} to vtv_{t}, but one can guarantee that the shortest distance between vsv_{s} and vtv_{t} is greater than BD(P3(s,t))BD(P^{\prime}_{3}(s,t)), conforming to the second claim in Theorem 1.

Based on Theorem 1, we can identify the lower bounding paths and the lower bound distances between any pair of boundary vertices in each subgraph. Since any pair of boundary vertices vv can co-occur in more than one subgraph, there may be multiple lower bound distances associated with them, each for one subgraph. We call the least of these lower bound distances the minimum lower bound distance, denoted by MBD(i,j)MBD(i,j), that provides the basis for constructing the skeleton graph in the Section 3.6.

Refer to caption
Figure 6: Skeleton Graph GλG_{\lambda}

3.6 Skeleton Graph

The skeleton graph GλG_{\lambda} contains all boundary vertices of all subgraphs. Any pair of boundary vertices viv_{i} and vjv_{j} within the same subgraph is connected by edge ei,je^{\prime}_{i,j} with its weight being the minimum lower bound distance between viv_{i} and vjv_{j}. The rationale behind introducing the skeleton graph is that KSPs between two vertices in the original graph GG possibly pass through the same sequence of boundary vertices as their shortest paths in GλG_{\lambda}. Thus, GλG_{\lambda} can provide an approximate search guideline to identify the different subgraphs in finding KSPs between a pair of vertices in GG. Fig. 6 shows a skeleton graph GλG_{\lambda} corresponding to the graph GG in Fig. 2.

4 Maintenance of DTLP

As the edge weights of the graph evolve over time, the main task in maintaining DTLP is to update the lower bound distances between boundary vertices in each subgraph. After that, the lower bound distances between boundary vertices can be easily deduced, so do the edge weights in the skeleton graph GλG_{\lambda}. In this section, we first introduce an Edges and Bounding Paths Inverted Index EBP-II in Section 4.1 to ease the maintenance of lower bound distances, and then transform EBP-II into a prefix tree-based index structure with higher storage efficiency in Section 4.2.

4.1 EBP-II

EBP-II is an inverted index built for each subgraph. Each entity in EBP-II is a key-value pair, where the edges appearing in the bounding paths are viewed as keys and the value of each key (i.e., an edge) is the set of bounding paths containing the corresponding edge. The design of EBP-II is such that when the weight of an edge changes by Δw\Delta w, we can immediately identify the bounding paths containing that edge, and update their actual distances by Δw\Delta w. Subsequently, we compute the unit weights of the edge based on its new weight, and then update the bound distances of the bounding paths in the subgraph. Finally, for any two boundary vertices, if their bounding paths are impacted by the update on the actual or bound distances, we can easily update their lower bound distance as per Theorem 1. Fig. 7 shows a subgraph instance as well as the bounding paths between the boundary vertices v1v_{1} and v10v_{10}, and its corresponding EBP-II is illustrated in Fig. 8.

Refer to caption
Figure 7: Bounding paths (P1,,P6P_{1},\cdots,P_{6}) between v1v_{1} and v10v_{10} (ξ=2\xi=2).
Refer to caption
Figure 8: EBP-II instance. Corresponding to Fig.7, P1P_{1} appears in the values of keys (i.e., edges) e1,4e_{1,4}, e4,7e_{4,7}, e7,10e_{7,10}.

4.2 Compaction of EBP-II

EBP-II supports efficient maintenance of lower bound distances. However, it requires large storage overhead due to two major reasons: 1) a subgraph could contain a large number of edges, and therefore indexing them using edges as keys and set of bounding paths as values could lead to large storage overhead; and 2) there could be many duplicate bounding paths associated with different keys, which unnecessarily consumes extra storage. In view of these issues, we introduce an algorithm to compact EBP-II without sacrificing the maintenance efficiency.

The compaction strategy first partitions the key-value pairs in EBP-II into different sets with the purpose of assigning the values having as many common paths as possible into the same set. In this way, the shared paths corresponding to different values within the same set have a higher likelihood of being compacted. Following this, a Modified Prefix Tree (MPTree for short) is introduced to condense the repeated bounding paths in the values within each group. We describe the procedure in more detail below.

4.2.1 Partitioning EBP-II.

Suppose 𝒫\mathcal{P} and 𝒫\mathcal{P}^{\prime} are two bounding path sets associated with two distinct edges. The compaction ratio for storing 𝒫\mathcal{P} and 𝒫\mathcal{P}^{\prime} refers to the proportion of common paths between 𝒫\mathcal{P} and 𝒫\mathcal{P}^{\prime} (which are stored only once) among all paths in these two sets. Ideally, their compaction ratio can reach |𝒫𝒫||𝒫𝒫|\tfrac{|\mathcal{P}\cap\mathcal{P}^{\prime}|}{|\mathcal{P}\cup\mathcal{P}^{\prime}|}, i.e., their Jaccard similarity. It is more likely to achieve a higher compaction ratio if values with a higher Jaccard similarity are grouped into the same set. As such, we employ Locality Sensitive Hashing (LSH) [26] to hash the edges whose values have high Jaccard similarity into the same set with the following procedure.

Step 1. Transforming EBP-II into a matrix called PE-Matrix, where each bounding path becomes a row and edges correspond to columns. If a bounding path includes an edge, the corresponding position of the matrix is set to 1 and 0 otherwise. The PE-Matrix helps depict the relationship between bounding paths and edges using 0s and 1s, simplifying the subsequent computation of a signature matrix with Minhash [27]. Fig. 9 gives the PE-Matrix of EP-Index in Fig. 8.

Step 2. Generating a signature matrix of PE-Matrix with MinHash [27], denoted by Sig-Matrix. The columns of Sig-Matrix still correspond to all edges, and rows correspond to hh hash functions. We employ each hash function to hash the row numbers of PE-Matrix and select a hashed row number as the signature for each column of Sig-Matrix. The signatures produced by hh hash functions constitute hh rows of Sig-Matrix. Similar sequences of signatures in Sig-Matrix indicate edges likely to share common paths. Example 4 illustrates the generation of the Sig-Matrix.

Step 3. Employing LSH to divide columns (edges) in a Sig-Matrix into different sets. More specifically, the rows of Sig-Matrix are first partitioned into bb bands. In every band, the sequence of hb\frac{h}{b} signatures in every column is used to hash the columns into different sets such that any two columns hashed into the same set are identical in at least one band. As such, the columns within the same set are likely to have more common bounding paths than those in different sets.

Refer to caption
Figure 9: PE-Matrix

For the edges within the same partitioned set, we propose the MPTree to further compact their bounding path sets, which will be discussed in Section 4.2.2.

Example 4.

For the given PE-Matrix in Fig.9, we introduce two hash functions h1(x)=(x+1){h_{1}}(x)={(x+1)} mod 4 and h2(x)=(5x+1){h_{2}}(x)={(5x+1)} mod 4 to compute the corresponding Sig-Matrix that is initialized as depicted by Fig.10(a). Next, we process each row rr of PE-Matrix as follows. First, we compute h1(r){h_{1}(r)} and h2(r){h_{2}(r)}, shown in the right part of Fig.9, where rr is the row number. Second, for each column cc in rr do the following: 1) If cc has 0 in row rr, do nothing; 2) otherwise, all rows in column cc of Sig-Matrix are set to the smaller values between the current ones and {h1(r),h2(r)}\{{h_{1}}(r),{h_{2}}(r)\}. Fig. 10(b) shows the procedure of computing the Sig-Matrix, where the edges with the same color are partitioned into the same group.

Refer to caption
(a) Initialized Sig-Matrix
Refer to caption
(b) SG4SG_{4}
Figure 10: Sig-Matrix

4.2.2 MPTree

MPTree is a modified prefix tree used to compact the sets of bounding paths corresponding to the edges within the same partitioned set by LSH. Prior to constructing MPTree, we first sort the bounding paths in each set in descending order based on their frequency. Next, we initialize MPTree as a root node, and then insert each edge and its bounding paths (as nodes) into the MPTree with the following procedure.

(1) For an edge ei,je_{i,j} and the set of its bounding paths 𝒫i,j\mathcal{P}_{i,j}={p0,,pl}\{p_{0},\cdots,p_{l}\} to be inserted into MPTree, we first combine them as a sequence L=p0,,pl,ei,j{L}={\langle p_{0},\cdots,p_{l},e_{i,j}\rangle}, and then identify the longest matching prefix of L{L} in the MPTree, denoted by L~\tilde{L}. Please note that L~\tilde{L} does not need to start from the root but possibly from any node in the MPTree.

(2) If L~\tilde{L} exists, the remaining part of L{L} will be directly appended to L~\tilde{L}. Otherwise, L{L} will be inserted at the root. Here, pjp_{j} (j[0,l]j\in[0,l]) and ei,je_{i,j} are called normal and tail nodes respectively.

(3) When L~\tilde{L} is inserted, the root node will record ei,je_{i,j} that has a reference to the tail node ei,je_{i,j}; meanwhile, the tail node ei,je_{i,j} keeps |𝒫i||\mathcal{P}_{i}| (i.e., the size of 𝒫i\mathcal{P}_{i}) that is used to efficiently identify the bounding paths set 𝒫i,j\mathcal{P}_{i,j} for ei,je_{i,j} in the MPTree.

After all edges and their bounding path sets are inserted, the construction of MPTree is complete. Since each partitioned set of edges corresponds to a MPTree and a subgraph usually has multiple partitioned sets, we thus merge these MPTrees into a Global MPTree (G-MPTree for short) for the subgraph. In the merging process, we add a common parent node for the root of each MPTree as the root of G-MPTree. The root of G-MPTree records the edges kept by each of its child nodes. Fig. 11 shows a G-MPTree that consists of two MPTrees corresponding to the green and purple groups of edges shown in Fig. 10(b).

Refer to caption
Figure 11: G-MPTree

4.3 Maintenance of lower bound distances

When the weight of an edge ei,je_{i,j} changes by Δw\Delta w, we update the lower bound distances as follows. First, we identify the bounding paths covering ei,je_{i,j} based on G-MPTree. In particular, we first visit the root of G-MPTree to identify the subtree containing ei,je_{i,j}, and then locate the tail node ei,je_{i,j} in the subtree based on the reference to ei,je_{i,j} kept by the root. Subsequently, we traverse upward |𝒫i||\mathcal{P}_{i}| steps from ei,je_{i,j}, where the visited normal nodes form 𝒫i\mathcal{P}_{i}, i.e., the set of bounding paths containing ei,je_{i,j}. Finally, as discussed in Section 4.1, we update the actual and bound distances of the bounding paths in 𝒫i\mathcal{P}_{i}. After that, the lower bound distances between the corresponding boundary vertices as specified in Section 4.1 can be efficiently updated.

5 KSP-DG algorithm

In this section, we first discuss the theoretical basis of KSP-DG (Section 5.1), and then present the detailed algorithm (Section 5.2). Subsequently, we propose a progressive Yen’s algorithm to optimize KSP-DG ( Section 5.3) and prove its correctness (Section 5.4). For clarity, we use Pi(s,t){P_{i}}(s,t) and Pjλ(s,t){P^{\lambda}_{j}}(s,t) to denote the ithi^{th} and the jthj^{th} shortest path in the original graph GG and the skeleton graph GλG_{\lambda} respectively. For a given query q(vs,vt)q(v_{s},v_{t}), we assume vsv_{s} and vtv_{t} are both boundary vertices in GG for ease of presentation, which means both of them are in GλG_{\lambda}. Cases where vsv_{s} and vtv_{t} are not boundary vertices in GG are discussed in Section 5.

5.1 Underpinnings of KSP-DG

For a given query q(vs,vt)q(v_{s},v_{t}), the basic idea of KSP-DG is to use the shortest paths between vsv_{s} and vtv_{t} in GλG_{\lambda} one by one (in increasing order of distance) as a reference path to identify the corresponding shortest paths in GG that have the same sequence of boundary vertices. This iterative process continues until the KSPs for the query are found. For this to succeed, one key observation is that the path between vsv_{s} and vtv_{t} in GλG_{\lambda} is not longer than the path linking vsv_{s} and vtv_{t} in GG with the same sequence of boundary vertices. which is formally presented as follows.

Lemma 1.

Given two boundary vertices viv_{i} and vjv_{j} in GλG_{\lambda}, if the shortest path between viv_{i} and vjv_{j} in GλG_{\lambda}, P1λ(i,j){P^{\lambda}_{1}}(i,j), contains only viv_{i} and vjv_{j}, then D(P1λ(i,j))D(P1(i,j))D({P^{\lambda}_{1}}(i,j))\leq D({P_{1}}(i,j)), where D(P1λ(i,j))D({P^{\lambda}_{1}}(i,j)) and D(P1(i,j))D({P_{1}}(i,j)) are the shortest distances between viv_{i} to vjv_{j} in GλG_{\lambda} and GG respectively.

Proof.

Assume for the sake of contradiction that D(P1λ(i,j))D({P^{\lambda}_{1}}(i,j)) >> D(P1(i,j))D({P_{1}}(i,j)). By definition, the weight of edge (viv_{i}, vjv_{j}) in GλG_{\lambda} is the minimum lower bound distance between viv_{i} and vjv_{j}, which is not greater than D(P1(i,j))D({P_{1}}(i,j)). Therefore, if D(P1λ(i,j))D({P^{\lambda}_{1}}(i,j)) >> D(P1(i,j))D({P_{1}}(i,j)), then there must exist one or more vertices between viv_{i} and vjv_{j} in P1λ(i,j){P^{\lambda}_{1}}(i,j), which contradicts the initial assumption. ∎

Theorem 2.

vs,vtGλ\forall v_{s},v_{t}\in G_{\lambda}, D(P1λ(s,t))D(P1(s,t))D({P^{\lambda}_{1}}(s,t))\leq D(P_{1}(s,t)).

Proof.

For the shortest path P1(s,t)P_{1}(s,t) between vsv_{s} and vtv_{t} in GG, there must be a corresponding path in graph GλG_{\lambda} with the same sequence of boundary vertices as those present in P1(s,t)P_{1}(s,t); let us denote this path by Pfλ(s,t){P^{\lambda}_{f}}(s,t). For any pair of adjacent vertices viv_{i} and vjv_{j} in this sequence of boundary vertices (i.e., there exists an edge (vi,vj)(v_{i},v_{j}) in Pfλ(s,t){P^{\lambda}_{f}}(s,t)), it follows from Lemma 1 that the distance between viv_{i} and vjv_{j} in Pfλ(i,j){P^{\lambda}_{f}}(i,j) cannot be greater than the distance connecting viv_{i} and vjv_{j} in P1(i,j)P_{1}(i,j). Hence, D(Pfλ(s,t))D(P1(s,t))D({P^{\lambda}_{f}}(s,t))\leq D(P_{1}(s,t)). If Pfλ(s,t){P^{\lambda}_{f}}(s,t) is the shortest path between vsv_{s} and vtv_{t} in GλG^{\lambda}, then we immediately have D(P1λ(s,t))D(P1(s,t))D({P^{\lambda}_{1}}(s,t))\leq D(P_{1}(s,t)); otherwise, there must exist a shortest path P1λ(s,t){P^{\lambda}_{1}}(s,t) such that D(P1λ(s,t))D(Pfλ(s,t))D({P^{\lambda}_{1}}(s,t))\leq D({P^{\lambda}_{f}}(s,t)). As D(Pfλ(s,t))D(P1(s,t))D({P^{\lambda}_{f}}(s,t))\leq D(P_{1}(s,t)), we have D(P1λ(s,t))D({P^{\lambda}_{1}}(s,t)) \leq D(P1(s,t))D(P_{1}(s,t)). ∎

5.2 KSP-DG Algorithm in Detail

KSP-DG is designed to run in a distributed cluster with the master-worker model. The master node maintains the original graph GG, receives the weight updates for edges in GG, and serves as the entry point of KSP queries. DTLP, as the fundamental index structure, is distributed across workers. In particular, the subgraphs resulting from partitioning GG are allocated to different workers on a many-to-one basis based on their load. Each worker maintains the assigned subgraphs as well as the bounding paths between boundary vertices (i.e., the first level of DTLP) within each subgraph. Moreover, a copy of the skeleton graph GλG_{\lambda} (the second level of the DTLP index) is also kept on each worker.

Using the DLTP index, KSP-DG adopts a filter-and-refine strategy to iteratively find KSPs for the query q(vs,vt)q(v_{s},v_{t}). Each iteration of KSP-DG consists of two steps: filter and refine. Without loss of generality, we describe the two steps for the ithi^{th} iteration.

In the filter step, we compute the ithi^{th} shortest path between vsv_{s} and vtv_{t} in the skeleton graph GλG_{\lambda}, and use it as the reference path. This path corresponds to a sequence of boundary vertices, denoted by \mathcal{R}, between vsv_{s} and vtv_{t} in GG. This step can be executed on one of the workers responsible for processing this query.

In the refine step, we aim to compute the corresponding k shortest paths connecting vsv_{s} and vtv_{t} in GG that traverse the same sequence of boundary vertices as those present in the reference path. Since any two adjacent boundary vertices in \mathcal{R} must belong to the same subgraph, we can identify partial k shortest paths for each pair of adjacent boundary vertices in \mathcal{R} from the corresponding subgraphs individually. This operation can be carried out by the workers maintaining those respective subgraphs in parallel. Next, the generated partial k shortest paths are reported back by these workers to the worker responsible for this query, which will then merge the partial k shortest paths received to form kk complete shortest paths, which all share the same sequence of boundary vertices as the reference path.

In each iteration, the generated k shortest paths corresponding to the reference path, called the candidate KSPs, are used to update a list \mathcal{L} of the shortest paths that have been obtained so far. \mathcal{L} keeps only k shortest paths found so far in ascending order of distance. After using the generated candidate KSPs to update \mathcal{L} in the ithi^{th} iteration, if the distance of the kthk^{th} path in \mathcal{L} is not greater than that of the reference path generated in the (i+1)th(i+1)^{th} iteration, the algorithm terminates, and the paths in \mathcal{L} are the final answer, i.e., the KSPs between vsv_{s} and vtv_{t} in GG. Otherwise, the iteration continues.

Algorithm 1 shows the procedure of KSP-DG. It first initializes the parameters in Line 1, and then executes the filter and refine steps in each iteration in Lines 2-12, where Piλ(s,t)P^{\lambda}_{i}(s,t) denotes the ithi^{th} reference path. The function candidateKSP is to identify the candidate KSPs for a given reference path, and its pseudo-code is shown in Algorithm 2. Line 6 in Algorithm 2 uses a Progressive Yen’s (PYen’s for short) Algorithm that will be discussed in Section 5.3.2 to compute the kk-shortest paths in subgraph SG between the jthj^{th} and (j+1)th(j+1)^{th} vertices of the reference path.

0:  GλG_{\lambda}, q(vs,vt)q(v_{s},v_{t}), 𝒮\mathcal{S}={SG1\{{SG}_{1}, \cdots, SGn}SG_{n}\};
0:  KSPs from vsv_{s} to vtv_{t} in GG;
1:  =ϕ\mathcal{L}=\phi; Dist=Dist=\infty; i=1i=1;
2:  while =ϕ\mathcal{L}=\phi |||| DistD(Pi+1λ(s,t))Dist\leq D({P^{\lambda}_{i+1}}(s,t)) do
3:     𝒞={\mathcal{C}}=candidateKSP (𝒮\mathcal{S}, Piλ(s,t){P^{\lambda}_{i}}(s,t));
4:     Add 𝒞{\mathcal{C}} into {\mathcal{L}};
5:     if ||>k|{\mathcal{L}}|>k then
6:        Keep the k shortest paths in \mathcal{L} and remove others;
7:     end if
8:     Dist = the distance of the kthk^{th} path in \mathcal{L};
9:     i++i++;
10:  end while
11:  return \mathcal{L};
Algorithm 1 KSP-DG
0:  𝒮\mathcal{S}={SG1\{{SG}_{1}, \cdots, SGn}SG_{n}\}, Piλ(s,t)P^{\lambda}_{i}(s,t);
0:  Candidate KSPs from vsv_{s} to vtv_{t} in GG;
1:  Set 𝒞=ϕ\mathcal{C}=\phi; Set 𝒴=ϕ\mathcal{Y}=\phi; jj=1;
2:  while j<|Piλ(s,t)|j<|P^{\lambda}_{i}(s,t)| do
3:     Identify 𝒰\mathcal{U}, the set of subgraphs containing the jthj^{th} and (j+1)th(j+1)^{th} vertices in Piλ(s,t)P^{\lambda}_{i}(s,t), vj\texttt{v}_{j} and vj+1\texttt{v}_{j+1};
4:     𝒴=ϕ\mathcal{Y}=\phi;
5:     for all subgraph SGSG \in 𝒰\mathcal{U} do
6:        𝒴\mathcal{Y}=𝒴\mathcal{Y} \cup PYen(vj\texttt{v}_{j},vj+1\texttt{v}_{j+1},SGSG);
7:     end for
8:     Keep only kk shortest paths in 𝒴\mathcal{Y};
9:     𝒞\mathcal{C}=𝒞\mathcal{C} \Join 𝒴\mathcal{Y};
10:     Keep only kk shortest paths in 𝒞\mathcal{C};
11:     j++j++;
12:  end while
13:  Return 𝒞\mathcal{C};
Algorithm 2 candidateKSP
Example 5.

Suppose v4v_{4} and v13v_{13} in Fig. 2 are the source and destination vertices respectively, and k=2k=2. In the first iteration, KSP-DG identifies the first reference path from v4v_{4} to v13v_{13} in GλG_{\lambda} (shown as Fig. 6) as P1λ(4,13)=v4,v6,v9,v13{P^{\lambda}_{1}(4,13)}=\langle v_{4},v_{6},v_{9},v_{13}\rangle with distance 19. Next, KSP-DG computes k=2k=2 shortest paths between any two adjacent boundary vertices as shown in the following table, where the third column shows the subgraphs involved. Then, KSP-DG joins the partial shortest paths to generate k=2k=2 candidate shortest paths from v4v_{4} to v13v_{13}, denoted by P1=v4,v6,v9,v11,v12,v13{P_{1}}=\langle v_{4},v_{6},v_{9},v_{11},v_{12},v_{13}\rangle with distance 19 and P2=v4,v5,v6,v9,v11,v12,v13{P_{2}}=\langle v_{4},v_{5},v_{6},v_{9},v_{11},v_{12},v_{13}\rangle with distance 25. So, ={P1,P2}\mathcal{L}=\{{P_{1}},{P_{2}}\}.

adjacent boundary vertices partial shortest paths involved subgraphs
(v4(v_{4}, v6)v_{6}) v4,v5,v6\langle v_{4},v_{5},v_{6}\rangle, v4,v6\langle v_{4},v_{6}\rangle SG1SG_{1}, SG2SG_{2}
(v6(v_{6}, v9)v_{9}) v6,v9\langle v_{6},v_{9}\rangle SG2SG_{2}
(v9(v_{9}, v13)v_{13}) v9,v11,v12,v13\langle v_{9},v_{11},v_{12},v_{13}\rangle v9,v11,v10,v14,v13\langle v_{9},v_{11},v_{10},v_{14},v_{13}\rangle SG3SG_{3}

Since the second reference path is P2λ(4,13){P^{\lambda}_{2}}(4,13)=v4,v9,v13\langle v_{4},v_{9},v_{13}\rangle with distance 22, and D(P2)>D(P2λ(4,13))D(P_{2})>D({P^{\lambda}_{2}}(4,13)), KSP-DG continues onto the second iteration, where it calculates candidate KSPs w.r.t. P2λ(4,13){P^{\lambda}_{2}}(4,13), denoted by P3P_{3}=v4,v7,v8,v9,v11,v12,v13\langle v_{4},v_{7},v_{8},v_{9},v_{11},v_{12},v_{13}\rangle with distance 22, and P4P_{4}=v4,v7,v8,v9,v11,\langle v_{4},v_{7},v_{8},v_{9},v_{11}, v10,v14,v13v_{10},v_{14},v_{13}\rangle with distance 34. Then \mathcal{L} is updated to {P1,P3}\{P_{1},P_{3}\}. After identifying the third reference path P3λ(4,13)P^{\lambda}_{3}(4,13)=v4,v6,v10,v13\langle v_{4},v_{6},v_{10},v_{13}\rangle with distance 25, it is safe for KSP-DG to conclude that P1P_{1} and P3P_{3} are the two shortest paths from v4v_{4} to v13v_{13}, as D(P3)<Dist(P3λ(4,13))D(P_{3})<Dist(P^{\lambda}_{3}(4,13)).

Finding KSPs in directed graphs. KSP-DG can be adapted for KSP queries in directed graphs with minor DTLP index modifications. Instead of computing a single lower bound distance for a pair of boundary vertices viv_{i} and vjv_{j}, we maintain two sets of bounding paths and their corresponding lower bound distances, one for each direction (from viv_{i} to vjv_{j} and vice versa). This modification transforms the skeleton graph into a directed one with two edges connecting each adjacent vertex pair, enabling KSP-DG to operate effectively for KSP search.

Non-boundary Vertices as Source or Destination. In scenarios where vsv_{s} and/or vtv_{t} are not boundary vertices, we address this by initially treating them as boundary vertices within subgraph SGxSG_{x} and SGySG_{y} respectively. This tactical maneuver involves connecting vsv_{s} to every boundary vertex viv_{i} in SGxSG_{x} and vtv_{t} to every boundary vertex vjv_{j} in SGySG_{y} within the skeleton graph GλG_{\lambda}. These connections are made with edge weights set to the minimum lower bound distance between vsv_{s} and viv_{i}, and similarly for vtv_{t}. Afterward, both vsv_{s} and vtv_{t} are incorporated into GλG_{\lambda}, and the KSP-DG process, as outlined in Algorithm 1, is executed as if vsv_{s} and vtv_{t} were boundary vertices.

5.3 Optimization on Computing Partial KSPs

In the refinement step of KSP-DG, a computationally intensive yet frequently used operation involves calculating partial KSPs for specified pairs of boundary vertices within subgraphs. Typically, this task is performed using Yen’s algorithm, which can be inefficient. To address this inefficiency, we introduce PYen, a Progressive Yen’s algorithm designed to expedite this computation. In this section, we first provide an overview of Yen’s algorithm and then delve into the optimizations introduced in PYen.

5.3.1 Yen’s algorithm

In Yen’s algorithm, given source and destination vertices vsv_{s} and vtv_{t} in a subgraph SGSG, it iteratively identifies the (i+1)th(i+1)^{th} shortest path Pi+1(s,t)P_{i+1}(s,t) based on the previously determined ithi^{th} shortest path Pi(s,t)P_{i}(s,t) (1i<k1\leq i<k). The algorithm computes a deviation path Pli(s,t)P^{l}i(s,t) for each vertex vlv_{l} (excluding vtv_{t}) in Pi(s,t)P{i}(s,t), referred to as a deviation vertex. A deviation path Pli(s,t)P^{l}i(s,t) comprises two partial paths: the base partial path from vsv_{s} to vlv_{l} following Pi(s,t)P{i}(s,t) and the spur partial path, which represents the shortest path from vlv_{l} to vtv_{t} in the subgraph with the weight of edge el,l+1e_{l,l+1} set to infinity. Here, vlv_{l} and vl+1v_{l+1} are adjacent vertices in Pi(s,t){P_{i}}(s,t). Since the base partial path is already known, the focus is on computing the spur partial path, typically done using existing shortest path algorithms like Dijkstra. Once all deviation paths of Pi(s,t){P_{i}}(s,t) are determined, the shortest one among them becomes the next shortest path Pi+1(s,t)P_{i+1}(s,t). This process continues until the KSPs between vsv_{s} and vtv_{t} are determined.

5.3.2 PYen algorithm

PYen has the following optimizations over Yen’s algorithm.

Parallel deviation path identification. Following insights from Para-Yen[28] that parallelized deviation path computation, PYen improves Yen’s efficiency by initiating separate search instances to compute spur partial paths from each vertex in Pi(s,t)P_{i}(s,t) to the destination concurrently. This fully parallelized approach yields significant speedup over Yen’s algorithm, particularly when ample computing resources, such as CPU cores, enable greater parallelism. In our distributed solution, where computing resources are already stretched with three parallel levels for processing KSP queries, computing partial KSPs, and identifying deviation paths, PYen shifts focus from single shortest path parallelization, as in Para-Yen[28], to optimizing search efficiency by reducing redundant computations and early termination of unpromising deviation path exploration.

Avoiding repetitive computation. Yen’s algorithm redundantly computes the same shortest paths for spur partial paths, whereas PYen mitigates this by reusing previously computed shortest paths. Unlike prior work, such as Feng’s algorithm [29], which precomputes and indexes the shortest paths to the destination vertex, PYen addresses dynamic graphs where precomputed paths often become invalid due to changing edge weights. Furthermore, PYen tackles the unique challenge of reusing previously identified shortest paths in parallel deviation path computation, a departure from the sequential processing in Feng’s algorithm [29].

Challenge: In Yen’s algorithm, each deviation vertex must find a loop-free spur path by avoiding specific impassable edges. To achieve this, each deviation vertex corresponds to a residual subgraph, created by excluding impassable edges from the original subgraph, facilitating the computation of the corresponding spur path. However, disparities in the shortest paths between the same vertices in different residual subgraphs hinder straightforward reuse. Specifically, for any two deviation vertices, the one with more impassable edges involves a residual subgraph nested within the other’s residual subgraph. Due to parallel deviation path computation, the spur path of the vertex with more impassable edges is often computed before the other. Consequently, the shortest paths computed in its subgraph cannot be directly applied to the other, as they tend to be longer in the former’s subgraph.

Solution: To tackle this challenge, we index only those shortest paths that are entirely consistent in both the residual subgraph they are computed and the original subgraph. Such paths inherently cannot be shorter within any other subgraph version and can thus be used for computing the spur path from any deviation vertex, provided it lacks impassable edges related to the spur path. To facilitate this approach, PYen maintains a lightweight index consisting of a distance array 𝒜D\mathcal{A}_{D} and a path array 𝒜P\mathcal{A}_{P}. These arrays store identified shortest distances and paths from deviation vertices to vtv_{t}, with sizes matching the vertex count of the subgraph. We sort the vertices in the subgraph based on their identifiers in ascending order and arrange the items in 𝒜D\mathcal{A}_{D} and 𝒜P\mathcal{A}_{P} so that each item corresponds to an vertex in the same order. For a vertex viv_{i} (vivtv_{i}\neq v_{t}) in the subgraph, we let 𝒜D[vi]\mathcal{A}_{D}[v_{i}] and 𝒜P[vi]\mathcal{A}_{P}[v_{i}] represent the values of viv_{i} in 𝒜D\mathcal{A}_{D} and 𝒜P\mathcal{A}_{P} respectively, where 𝒜D[vi]\mathcal{A}_{D}[v_{i}] is the shortest distance from viv_{i} to vtv_{t}, while 𝒜P[vi]\mathcal{A}_{P}[v_{i}] keeps the identifier of the vertex next to viv_{i} in the shortest path from viv_{i} to vtv_{t}.

All values in 𝒜D\mathcal{A}_{D} and 𝒜P\mathcal{A}_{P} are initialized to \infty and null respectively. Once the shortest path P(i,t)P(i,t) from viv_{i} to vtv_{t} is identified and identical with the shortest path from viv_{i} to vtv_{t} in SGSG, the shortest distances from all vertices in P(i,t)P(i,t) to vtv_{t} are determined and stored in 𝒜D\mathcal{A}_{D}. Meanwhile, P(i,t)P(i,t) is inserted into 𝒜P\mathcal{A}_{P}. With 𝒜D\mathcal{A}_{D} and 𝒜P\mathcal{A}_{P}, an search instance first accesses 𝒜D[vh]\mathcal{A}_{D}[v_{h}] to detect whether the shortest distance SD(vh,vt)SD(v_{h},v_{t}) from vhv_{h} to vtv_{t} is known when visiting a vertex vhv_{h} for computing a spur partial path. If SD(vh,vt)SD(v_{h},v_{t}) has been previously determined (i.e., not \infty), we can easily obtain the shortest path from vhv_{h} to vtv_{t}, P(h,t)P(h,t), from 𝒜P\mathcal{A}_{P} by sequentially obtaining the identifier of the next vertex in P(h,t)P(h,t) from vhv_{h}. If P(h,t)P(h,t) contains no impassable edges, it can be readily used to expedite the spur path computation; otherwise, the shortest path from vhv_{h} to vtv_{t} needs to be recomputed.

Early termination of unpromising deviation paths. Deviation paths that are guaranteed not to be the partial KSPs are pruned early in PYen. Once the ithi^{th} shortest path Pi(s,t)P_{i}(s,t) is found and (ki)(k-i) shortest paths remain to be computed, we only keep track of the already identified (ki)(k-i) shortest deviation paths. We use the (ki)th(k-i)^{th} deviation path (PkiP^{k-i}) as a reference to prune deviation paths that show no promise. Specifically, if the current distance of a deviation path being explored for Pi(s,t)P_{i}(s,t) exceeds that of PkiP^{k-i} before it is fully generated, we safely discard it. In contrast, deviation paths with a smaller distance than PkiP^{k-i} are utilized to update the maintained (ki)(k-i) shortest deviation paths. After all deviation paths of Pi(s,t)P_{i}(s,t) have been processed in this way, the shortest one among the maintained (ki)(k-i) deviation paths is the next shortest path Pi+1(s,t)P_{i+1}(s,t).

5.4 Analysis of KSP-DG

Correctness. The KSP-DG algorithm is provably correct, as shown below.

Lemma 2.

Let Piλ(s,t){P^{\lambda}_{i}}(s,t) be the ithi^{th} reference path from vsv_{s} to vtv_{t} and 𝒞i\mathcal{C}_{i} be the set of candidate KSPs w.r.t. Piλ(s,t){P^{\lambda}_{i}}(s,t). We have Pi(s,t)𝒞i\forall P_{i}(s,t)\in\mathcal{C}_{i}, D(Piλ(s,t))D(P^{\lambda}_{i}(s,t)) \leq D(Pi(s,t))D(P_{i}(s,t)).

Proof.

Let 𝒮i\mathcal{S}_{i} denote the sequence of boundary vertices on Piλ(s,t){P^{\lambda}_{i}}(s,t) in graph GG, where vsv_{s} and vtv_{t} are viewed as boundary vertices. Since Pi(s,t)𝒞iP_{i}(s,t)\in\mathcal{C}_{i}, the sequence of boundary vertices on Pi(s,t)P_{i}(s,t) is the same as 𝒮i\mathcal{S}_{i} in graph GG. For any two adjacent boundary vertices vjv_{j} and vj+1v_{j+1} in 𝒮i\mathcal{S}_{i}, based on Lemma 1, we infer that the shortest distance between vjv_{j} and vj+1v_{j+1} in skeleton graph GλG_{\lambda} is not greater than their shortest distance in graph GG. Accumulating the shortest distances of all pairs of adjacent vertices in 𝒮i\mathcal{S}_{i} on GλG_{\lambda} and GG, we have D(Piλ(s,t))D(P^{\lambda}_{i}(s,t)) \leq D(Pi(s,t))D(P_{i}(s,t)). ∎

Theorem 3.

In the ithi^{th} (i1i\geq 1) iteration of KSP-DG, if D(Pk)D(Pi+1λ(s,t))D(P_{k})\leq D({P^{\lambda}_{i+1}}(s,t)), where PkP_{k} is the kthk^{th} path in \mathcal{L}, then the paths in \mathcal{L} are KSPs from vsv_{s} to vtv_{t} in GG, .

Proof.

Suppose for the sake of contradiction that there exists a path Pf(s,t){P_{f}}(s,t) with a smaller distance than PkP_{k} and not in \mathcal{L}. Pf(s,t){P_{f}}(s,t) is thus not a candidate shortest path generated based on any reference path Pjλ(s,t){P^{\lambda}_{j}}(s,t) (j[1,i]j\in[1,i]). Therefore, there must be another reference path Pfλ(s,t){P^{\lambda}_{f}}(s,t) that matches Pf(s,t){P_{f}}(s,t) and D(Pfλ(s,t))D(Pi+1λ(s,t))D({P^{\lambda}_{f}}(s,t))\geq D({P^{\lambda}_{i+1}}(s,t)). Based on the given condition D(Pk)D(Pi+1λ(s,t))D(P_{k})\leq D({P^{\lambda}_{i+1}}(s,t)), we can infer that D(Pk)D(Pfλ(s,t))D(P_{k})\leq D({P^{\lambda}_{f}}(s,t)). Because D(Pfλ(s,t))D(Pf(s,t))D({P^{\lambda}_{f}}(s,t))\leq D({P_{f}}(s,t)) follows from Lemma 2, we have D(Pk)D(Pf(s,t))D(P_{k})\leq D({P_{f}}(s,t)), which contradicts the initial assumption. ∎

Computation Cost. KSP-DG comprises two primary operations: (1) identifying reference paths and (2) computing candidate KSPs for each reference path. In Operation (1), the worker computes a total of rr reference paths with a time complexity of O(r(|λ|+|𝒱λ|)log|𝒱λ|)O(r(|\mathcal{E}_{\lambda}|+|\mathcal{V}_{\lambda}|)\log{|\mathcal{V}_{\lambda}|}). Here, |𝒱λ||\mathcal{V}_{\lambda}| and |λ||\mathcal{E}_{\lambda}| refer to the vertex and edge counts in the skeleton graph GλG_{\lambda}. In Operation (2), during each KSP-DG iteration, partial KSPs are computed between adjacent boundary vertices in PλP_{\lambda}. As such, there are |Pλ|1{|P_{\lambda}|}-1 pairs of adjacent boundary vertices to process in each iteration, resulting in a computation complexity of O(k|Pλ|(|𝒱A|+|A|)log|𝒱A|)O(k\cdot{|P_{\lambda}|}\cdot(|\mathcal{V}_{A}|+|\mathcal{E}_{A}|)\log{|\mathcal{V}_{A}|}) per iteration, where |𝒱A||\mathcal{V}_{A}| and |A||\mathcal{E}_{A}| represent the maximum number of vertices and edges of each relevant subgraph. Given that this operation is distributed across ww workers and executed rr times, the overall computation cost for KSP-DG is expressed as: O(r(|λ|+|𝒱λ|)log|𝒱λ|)O(r(|\mathcal{E}_{\lambda}|+|\mathcal{V}_{\lambda}|)\log{|\mathcal{V}_{\lambda}|}) + O(kr(|𝒱A|+|A|)log|𝒱A||Pλ|wO(k\cdot r\cdot(|\mathcal{V}_{A}|+|\mathcal{E}_{A}|)\log{|\mathcal{V}_{A}|}\cdot\left\lceil\frac{|P_{\lambda}|}{w}\right\rceil).

6 Experiments

6.1 Implementation of KSP-DG on Storm

We implement KSP-DG on Apache Storm[30], a popular distributed stream processing framework, to evaluate its performance. Following the Storm paradigm, KSP-DG is designed as a “topology” of a directed acyclic graph with Spouts and Bolts as nodes of this graph. In Storm, spouts and bolts, as logical processors, are connected with streams, where each stream is an unbounded sequence of tuples containing the data to be processed. A spout acts as a source of streams in a topology, and every bolt processes the tuples according to a user-specified logic.

In the KSP-DG topology, only one Spout is set to receive edge weight updates and new KSP queries. This Spout maintains the partitioning of subgraphs to assist itself in sending edge weight update directly to corresponding subgraphs. SubgraphBolts are responsible for maintaining a set of subgraphs, while QueryBolts are in charge of executing KSP queries, where each query is processed by one unique QueryBolt.

Refer to caption
Figure 12: Framework for Deploying KSP-DG on Storm

Figure 12 presents the procedure of processing q(vs,vt)q(v_{s},v_{t}) by KSP-DG on Storm. When query qq arrives at the Spout, we first detect whether vsv_{s} and vtv_{t} are boundary vertices. If not, they will be processed by Step 1 first; otherwise, qq will be directly processed by Step 2.

Step 1 : The Spout sends vsv_{s} and vtv_{t} to the SubgraphBolt(s) that maintains the subgraph(s) covering vsv_{s} and vtv_{t} respectively. Then each of the relevant SubgraphBolt(s) computes the lower bound distances from vsv_{s} and vtv_{t} to each boundary vertex in the subgraph(s) covering vsv_{s} and vtv_{t} respectively. Next, those identified lower bound distances as well as vsv_{s} and vtv_{t} are delivered to every QueryBolt, which inserts vsv_{s} and vtv_{t} into the skeleton graph GλG_{\lambda} using the principle discussed in Section  5.

Step 2 : The Spout assigns qq to a QueryBolt QBiQB_{i}, which computes a reference path PλP^{\lambda} for this query, and then broadcasts PλP^{\lambda} and qq to all SubgraphBolts. Next, every SubgraphBolt identifies the subgraphs within those it maintains that cover a pair of two adjacent vertices in PλP^{\lambda}, and then generates partial KSPs between each pair of vertices from each such subgraph. These partial KSPs and qq are then returned to QBi{QB}_{i}. After receiving all partial KSPs related to PλP^{\lambda}, QBi{QB}_{i} joins the received partial paths to generate candidate KSPs corresponding to PλP^{\lambda}. These candidate KSPs are used to update the list of paths obtained so far. When the identified KSPs satisfy the terminating condition, QBi{QB}_{i} outputs the final result; otherwise, it generates the next reference path and continues onto the next iteration.

6.2 Experiment Setup and Datasets

The system is deployed on a cluster of 10 servers from a public cloud service provider, and each server has a quadcore CPU of 2.5GHz and 32GB memory. These servers are connected via Ethernet with a bandwidth of 1Gbps. We use as the datasets four real road networks with travel times from New York, Colorado, Florida, and Central USA [31], which are directed graphs, denoted by NY, COL, FLA, and CUSA respectively. The number of vertices and edges in these graphs are given in Table I, along with the number of subgraphs and skeleton graphs that can be obtained when zz (the maximum subgraph size) takes on their typical values. Moreover, we employ 20 hash functions with in the form of hi(r)=(air+1)modch_{i}(r)=(a_{i}\cdot r+1)\bmod c (i[1,20]i\in[1,20]) in LSH to partition EBP-II and the number of bands bb is set to 2 as discussed in Section 4.2.1. Notably, cc is the maximum prime number based on the the number of rows in PE-Matrix, and aia_{i} sequentially takes prime numbers ranging from 2 to 71, totaling 20 prime numbers.

TABLE I: Statistics on the Road Network Datasets
road network #vertices #edges z #subgraphs (nb>n_{b}>5) #vertices of GλG_{\lambda}
NY 264,346 733,846 200 4,173 (1,586) 24461
COL 435,666 1,057,066 200 8,001 (2,004) 27,665
FLA 1,070,376 2,712,798 500 13,701 (3,682) 52,640
CUSA 14,081,816 34,292,496 1000 121,725 (18,251) 514,618

We use the (normalized) travel time on each road in the road networks as the edge weights in the graphs. Since only one snapshot of the travel times in each road network is given in the dataset, we adopt a well-established model [32] to dynamically vary the travel time in each road to simulate real-world traffic conditions. We use α\alpha to represent the percentage of edges that change weights at each snapshot, and [τ,τ][-\tau,\tau] to denote the range of weight variation. We apply identical changes to the weights of the two edges in the opposite direction between a pair of vertices to simulate varying undirected graphs; for CUSA, we also experiment with the case where the weights of the opposite edges change independently to simulate varying directed graphs. All results shown are the average of 20 runs on the cluster of 10 servers unless otherwise specified.

6.3 Evaluation of DTLP

We evaluate the influence of parameters zz, ξ\xi, α\alpha, and τ\tau on the performance of DTLP, which are summarized in Table II. The subgraph size zz significantly impacts the scale of the skeleton graph |Gλ|{|G^{\lambda}|}, the number of relevant subgraphs for each query (rr), and the computation cost of partial KSPs within a relevant subgraph (trt_{r}). Ideally, zz should be relatively small to ensure that rr for each query exceeds the number of distributed workers (ww), facilitating effective parallel computation of partial KSPs and reducing trt_{r}. However, setting zz excessively small is unwise. An overly small zz increases |Gλ||G^{\lambda}|, elevates the cost of computing the reference path based on GλG^{\lambda}, and results in an excess of relevant subgraphs for each query, exceeding ww and inflating scheduling costs. In light of these considerations, it is advisable to set zz to be three or four orders of magnitude smaller than the graph size, particularly for graphs containing hundreds of thousands of vertices.

TABLE II: Summary of Parameters Used in Evaluation
Parameters Meaning Default value
zz size of subgraph 100 (NY), 200 (COL) 400 (FLA), 1000 (CUSA)
ξ\xi
number of bounding paths between
a pair of boundary vertices
10
α\alpha percentage of edges changing weights at each snapshot 50%
τ\tau range of edge weight variation 50%
Refer to caption
(a) On NY
Refer to caption
(b) On COL
Refer to caption
(c) On FLA
Refer to caption
(d) on CUSA
Refer to caption
(e) MPTree v.s. EBP-II on Memory
Figure 13: Construction Cost (Building Time and Memory Consumption)
Refer to caption
(a) Varying Graph Size
Refer to caption
(b) Varying ξ\xi
Refer to caption
(c) Varying α\alpha
Refer to caption
(d) Varying zz
Refer to caption
(e) Maintenance Comparison
Figure 14: Maintenance Cost

6.3.1 Construction Cost

Figures 15(a)-15(d) depict the time and memory usage to build DTLP for NY, COL, FLA, and CUSA varying the values of zz. Building time first decreases and then increases as zz grows. This is because the number of subgraphs is reduced when the road network is partitioned into larger subgraphs, resulting in less subgraphs being assigned to each server, so the total building time declines. When zz grows beyond a certain threshold (e.g., zz=100 for NY), however, this decrease is outweighed by the increase in the number, as well as, the average length of bounding paths in each subgraph. For the same reason, memory consumption by MPTree and the skeleton graph shows a trend similar to that of building time. Moreover, we compare the time for building DTLP in the directed and undirected graphs using CUSA. Fig. 15(d) shows that the building time for the directed graph is double that of the undirected graph, as two sets of bounding paths (in opposite directions) for each pair of boundary vertices have to be computed in directed graphs. For the same reason, the maintenance cost of DTLP for directed graphs is also almost doubled, as shown in Fig. 14(d). Fig. 15(e) depicts the memory consumption comparison of MPTree and EBP-II on NY. MPTree consumes significantly less memory than EBP-II due to the compaction of duplicate bounding paths, while its maintenance cost is merely a little greater than that of EBP-II, as shown in Fig. 14(e).

We further evaluate the influence of the size of a graph on the building time of DTLP. For this purpose, we choose five subgraphs from COL with 50K, 100K, 150K, 200K, and 250K vertices respectively, and use NgN_{g} to denote their sizes. We measure the time for building DTLP for these selected graphs, and the results are shown in Fig. 14(a) (left vertical axis). Apparently, the construction cost of DTLP increases almost linearly with the size of the graph, as the cost of computing bounding paths is roughly proportional to NgN_{g}.

6.3.2 Maintenance Cost

We study the trend of the maintenance cost of DTLP on graphs of varying sizes. We change the weights of half of the edges in each graph. Therefore, the number of varying weights is directly proportional to NgN_{g}, the size of the graph examined. As shown in Fig. 14(a) (right vertical axis), there is an approximately linear ascending trend in the maintenance time with the size of a graph, as the maintenance time is heavily affected by the number of the varying weights which is proportional to NgN_{g}.

We further evaluate the time required to update DTLP with different values of α\alpha and ξ\xi, and the results are shown in Fig. 14(b) and Fig. 14(c) respectively. In these two groups of experiments, all weight updates are fed into the system as a batch, and the maintenance time of DTLP is measured as the time between receiving the new weights and finishing updating the DTLP index. Fig. 14(b) shows an ascending trend of the maintenance cost w.r.t. ξ\xi, as larger values of ξ\xi cause more bounding paths with bound distances to be updated, resulting in higher maintenance costs. However, the rate of growth slows down when ξ\xi exceeds a certain value (e.g., ξ\xi=15 in FLA) because the number of bounding paths in some subgraphs stops increasing when ξ\xi is large enough. Fig. 14(c) shows an ascending trend with increasing α\alpha, as more weights need to be processed.

6.4 Evaluation of KSP-DG

Our next task is to study the impact of different parameters (zz, α\alpha, ξ\xi, and kk) on the performance of KSP-DG. The default value of kk is 2, but we vary it as needed for our evaluation.

Refer to caption
(a) On NY
Refer to caption
(b) On COL
Refer to caption
(c) On FLA
Refer to caption
(d) on CUSA
Refer to caption
(e) MPTree v.s. EBP-II on Memory
Figure 15: Construction Cost (Building Time and Memory Consumption)
Refer to caption
(a) Processing Time w.r.t. zz (on NY)
Refer to caption
(b) Processing Time w.r.t. zz (on CUSA)
Refer to caption
(c) Processing Time w.r.t. NqN_{q}
Refer to caption
(d) Processing Time w.r.t. ξ\xi in NY
Refer to caption
(e) Processing Time w.r.t. τ\tau in NY
Figure 16: Influence of parameters on Processing Time
Refer to caption
(a) Comparison on NY
Refer to caption
(b) Comparison on COL
Refer to caption
(c) Comparison on FLA
Refer to caption
(d) Comparison on CUSA
Refer to caption
(e) Comparison w.r.t. k on FLA
Figure 17: Processing Time Comparison with Baselines

6.4.1 Number of Iterations

The number of iterations required by KSP-DG with varying values of ξ\xi, α\alpha, kk, and τ\tau are shown in Figures LABEL:iterations-xi-LABEL:iterations-varying-rate. kk is set to 50 for effective measuring the the number of iterations as the variation is more apparent when kk takes larger values. Fig. LABEL:iterations-xi shows that the number of iterations significantly decreases with increasing ξ\xi, which is as expected, because when more bounding paths are indexed, it narrows the gap between the lower bound distance and the actual shortest distance for the given pair of vertices and thus reduces the number of iterations needed. However, as a higher construction and maintenance cost of DTLP is associated with a larger ξ\xi, the value of ξ\xi has to be chosen in a way that balances the processing time of KSP-DG and the construction/maintenance time of DTLP.

As shown in Fig. LABEL:iterations-varying-range, the number of iterations increases with growing τ\tau (the varying range of weights). The reason is that, greater variation in the weights would loosen the lower bound distance and thus weaken the pruning power of the skeleton graph GλG_{\lambda}. Moreover, greater values of k would incur more iterations in KSP-DG, as shown in Fig. LABEL:iterations-k. But the good news is that the rate of increase is very small when k is not large (i.e., k<30k<30), which should be sufficient for most applications. The influence of α\alpha, as shown in Fig. LABEL:iterations-varying-rate, differs from one dataset to another, implying that its effect may depend on the particular distribution of edges with varying weights. However, it is apparent that the numbers of iterations for all the datasets are small when the weights of the graph are not changing dramatically (i.e., α<30%\alpha<30\%).

TABLE III: Number of Vertices in Skeleton Graph GλG_{\lambda} with Varying zz
NY,COL FLA z=100 z=350 z=150 z=400 z=200 z=450 z=250 z=500 z=300 z=550
GλG_{\lambda} (NY) 32,534 27,668 24,461 22,604 20,775
GλG_{\lambda} (COL) 36,831 30,886 27,655 25,329 23,271
GλG_{\lambda} (FLA) 60,125 57,085 54,695 52,640 50,411

6.4.2 Query Processing Time

Figures 16(a)-16(b) depict the influence of parameters zz and kk on the query processing time using KSP-DG. In this group of experiments, we randomly generate 1,000 queries (Nq=1,000N_{q}=1,000), feed them into the system simultaneously, and measure the total processing time of all the queries. As can be observed from the plots, the processing time first decreases and then increases as zz grows. This is because as zz increases, the number of subgraphs decreases, and so does the number of boundary vertices, which in turn leads to a smaller skeleton graph GλG_{\lambda} (shown in Table III). Roughly speaking, a smaller GλG_{\lambda} means fewer vertices in a reference path and fewer subgraphs to be explored in each iteration. Moreover, the cost of generating partial k shortest paths within each subgraph grows very slowly when the size of the subgraph is small. Therefore, the processing time decreases as zz increases but is still small. However, when zz grows beyond a certain threshold (e.g.,  200 for k=2k=2 on NY), the cost of computing partial k shortest paths in subgraphs increases significantly and dominates the overall cost. Consequently, the processing time starts to grow as zz becomes sufficiently large. In the following discussion, we set the value of zz to 200, 200, 500, and 1000 in NY, COL, FLA, and CUSA respectively, unless otherwise specified.

From each figure we also observe that the processing time of KSP-DG increases linearly with kk, which is because larger values of kk lead to more candidate k shortest paths being generated in each iteration. Moreover, the number of iterations also increases with k, as shown in Figure LABEL:iterations-k.

The scalability of KSP-DG w.r.t. the number of concurrent queries is evaluated, and the results are shown in Fig. 16(c). We generate multiple batches of queries with different batch sizes (number of queries), and feed each batch into the system to measure the total processing time. From the curves in Fig. 16(c), we observe that running time of KSP-DG increases approximately linearly w.r.t. the number of queries with a low rate of growth, benefiting from its distributed paradigm.

Figures 16(d)-16(e) present the impact of ξ\xi and τ\tau on the running time of KSP-DG. We only display the performance on NY; similar trends are observed on the other datasets. Fig. 16(d) shows that the running time decreases with increasing ξ\xi. This is because a larger ξ\xi leads to a smaller number of iterations, as demonstrated in Fig. LABEL:iterations-xi. This trend is more apparent when k takes larger values as more iterations are required by KSP-DG when kk is large (as shown in Fig. LABEL:iterations-k). The relationship of τ\tau and the processing time is evaluated in Fig. 16(e), where the processing time slowly increases when τ\tau grows, as a larger τ\tau leads to a greater number of iterations (Fig. LABEL:iterations-varying-range).

6.5 Comparison with Baseline Algorithms

We compare KSP-DG to FindKSP [5], Para-Yen[29], and Yen’s algorithm [6] on scalability with respect to the number of queries and kk. Since FindKSP, Para-Yen and Yen’s algorithm are centralized, we run these three algorithms on every server individually and then distribute all queries to the adopted servers randomly for fair comparison. To evaluate PYen’s effectiveness, we included two KSP-DG versions, namely KSP-DG-Yen and Para-KSP-DG, which employ Yen [6] and Para-Yen [29] for computing partial KSPs in place of PYen respectively, serving as baseline comparisons. We refer to KSP-DG, KSP-DG-Yen, and Para-KSP-DG collectively as “KSP-DG-based algorithms”.

Figures 17(a)-17(d) show the scalability comparison of the all algorithms when processing an increasing number of queries in various graphs. KSP-DG-based algorithms outperform the three centralized algorithms by exhibiting a lower rate of processing time growth. Their advantage stems from the ability to decompose the KSP search problem into smaller parallel procedures, which is challenging for the sequential processing required by FindKSP, Para-Yen, and Yen’s algorithm. The performance advantages of KSP-DG-based algorithms are particularly evident in large graphs (e.g., CUSA), making them highly suitable for large graph scenarios with substantial concurrent queries.

Moreover, KSP-DG outperforms KSP-DG-Yen and Para-KSP-Yen on each dataset due to the effectiveness of PYen in computing partial KSPs. This advantage becomes more remarkable as kk increases, as depicted by Fig.17(e). This is because PYen’s optimizations, especially the reuse of previous computations, become more effective with higher kk values. In contrast, Para-Yen’s approach of parallelizing single shortest path computations for each deviation path within KSP-DG’s resource-intensive processes fails to expedite partial KSP computation and instead adds scheduling overhead, especially noticeable as kk grows.

Refer to caption
(a) Building Time w.r.t. NsN_{s}
Refer to caption
(b) Processing Time w.r.t. NsN_{s}
Refer to caption
(c) Processing Time w.r.t. k
Refer to caption
(d) Scalability Comparison
Refer to caption
(e) Relative Speedups
Figure 18: Scalability

6.6 Scaling-out

To further evaluate the horizontal scalability of DTLP and the search algorithms, we extend the cluster to 20 servers and the results are shown in Figures 18(a)-18(e). As is clear from Fig. 18(a), the building time of DTLP decreases when more servers are introduced, as DTLP can distribute the load to all servers, and thus is able to utilize the power of a cluster.

Fig. 18(b) demonstrates the performance of KSP-DG with a varying number of servers. We feed a batch of 1,000 queries into KSP-DG with a different number of servers employed. In each figure, we observe a notable reduction of processing cost when more servers are used, which testifies to the horizontal scalability of KSP-DG. We also feed KSP-DG with 1,000 queries with different values of kk to further validate its scalability by varying the number of servers, and the results are illustrated in Fig. 18(c). It is observed that the running time of KSP-DG significantly decreases with more servers being used, regardless of the value of kk.

Fig. 18(d) depicts the scalability of the six algorithms when processing the same group of queries on a different number of servers. The results show that the three KSP-DG based algorithms always outperform FindKSP, Para-Yen and Yen’s algorithm. The relative speedups of the three algorithms with a varying number of servers are shown in Fig. 18(e). It can be observed that the relative speedup of each algorithm grows linearly with the number of servers.

7 Related work

Centralized KSP Algorithms. Yen’s algorithm [6] identifies KSPs based on a deviation paradigm, which first computes the shortest path, and then generates all candidate paths that deviate from this path by applying Dijkstra’s algorithm repeatedly, from which the shortest is selected as the next shortest path. It repeats the above steps until KSPs have been determined. Some methods [34, 9, 35, 8, 10, 11, 5, 28] are proposed to further optimize the generation of candidate paths. [9, 34, 35] partition the candidate shortest paths into equivalence classes and safely prune specific classes that are impossible to contain k shortest paths. Others adopt the Shortest Path Tree (SPT for short) [8, 10, 11, 29, 35, 5, 36] to help identify candidate shortest paths, where SPT maintains the shortest paths from all vertices to the terminal vertex.

All of the above proposals suffer from the following drawbacks if directly applied to our problem. First, most of them require access to the entire graph during their operation; as a result the only option is to replicate the entire dynamic graph on each server, allowing them to operate in a distributed fashion. This however has significant cost and scalability implications. Second, the majority of these algorithms adopt a sequential strategy that necessitates a search for the shortest paths one after the other, which limits their ability to handle many concurrent queries in a distributed setting. Finally, some of them require building a path index such as SPT [8, 10, 11] for every query, which is too heavy-weight for dynamic graphs as the index often become invalid due to varying weights.

Distributed SSP Algorithms. Past work focuses on identifying a Single Shortest Path (SSP) [15, 16, 17, 18, 37, 19, 20, 38] over a static graph in a distributed fashion. [15, 16, 17, 18] aim to determine the SSP in a communication network, which is distributed if each vertex represents a processor. Others investigate the distributed computation of the SSP on a cluster of servers [19, 20, 38]. Li et al. investigate the SSP in multimodal road network that includes buses, trains, cars, bikes, and so on. Qiu et al. [20] identify the SSP based on the principle of pruned landmark labeling [39], while Aridhi et al. [19] propose a distributed algorithm to identify an approximate shortest path in a large-scale network with MapReduce.

These distributed SSP algorithms suffer from the following problems when applied to our setting. First, they are designed for static graphs, and thus most index structures employed in those approaches such as k-shortcut hopset [17, 18, 37] and node labels [20] quickly become obsolete in a dynamic graph rendering them unusable. Second, all of them focus on the search for a single shortest path and its nontrivial to extend, to the case of finding KSPs.

SSP in Dynamic Graphs. Previous research has addressed SSP computation in dynamic graphs[22, 40]. Shang et al. investigated monitoring the shortest path in a dynamic graph, yet their centralized search algorithm, reliant on a road network expansion tree, proves challenging to deploy in a distributed setting. Yang et al. [22] propose a distributed algorithm CANDS to identify the SSP in a dynamic graph. It partitions the entire graph into subgraphs residing on different servers, and indexes the shortest path between any pair of boundary vertices within each subgraph. Although seemingly similar to our solution, this approach would not work well when directly applied to our problem. First, its search procedure is essentially sequential. For a given query, it starts from the subgraph covering the source vertex and iteratively expands to other subgraphs via the indexed shortest paths using Dijkstra’s algorithm until reaching the subgraph containing the destination vertex. The examined subgraphs are not known initially and they have to be explored in order. Second, the indexed shortest paths in CANDS require frequent re-computation in a dynamic graph, which can be quite expensive. Finally, CANDS is designed for identifying single shortest path, and it is non-trivial to adapt this paradigm to identify KSPs.

8 Conclusions and Future Work

We focus on the problem of identifying KSPs over road networks and propose a suite of distributed solutions. Consisting of the bounding paths and the skeleton graph, DTLP provides reference paths assisting the identification of relevant subgraphs that need to be explored to identify KSPs. Since the bounding paths compacted by G-MPTrees in DTLP do not change with varying weights, DTLP is light-weight for dynamic graphs with low maintenance cost. Based on DTLP, the KSP-DG algorithm is designed to run in a distributed setting. We have demonstrated through extensive experiments that our proposal significantly outperforms baseline methods across a variety of settings.

As future work, one may consider addressing variants of the problem studied in this paper. For example, a constrained version of the KSP query may require all shortest paths to pass through some designated vertices; another version may involve limiting the diversity of the shortest paths to below a certain threshold. These variants have important practical applications, and it is worthwhile to investigate their solutions in a distributed environment.

References

  • [1] S. Shang, L. Chen, Z. Wei, C. S. Jensen, J.-R. Wen, and P. Kalnis, “Collective travel planning in spatial networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 5, pp. 1132–1146, 2016.
  • [2] L. Chen, Y. Gao, Z. Fang, X. Miao, C. S. Jensen, and C. Guo, “Real-time distributed co-movement pattern detection on streaming trajectories,” Proceedings of the VLDB Endowment, vol. 12, no. 10, pp. 1208–1220, 2019.
  • [3] L. Chen, Q. Zhong, X. Xiao, Y. Gao, P. Jin, and C. S. Jensen, “Price-and-time-aware dynamic ridesharing,” in 2018 IEEE 34th international conference on data engineering (ICDE).   IEEE, 2018, pp. 1061–1072.
  • [4] Z. Yu, X. Yu, N. Koudas, Y. Liu, Y. Li, Y. Chen, and D. Yang, “Distributed processing of k shortest path queries over dynamic road networks,” in SIGMOD, 2020, pp. 665–679.
  • [5] H. Liu, C. Jin, B. Yang, and A. Zhou, “Finding top-k shortest paths with diversity,” IEEE Transactions on Knowledge and Data Engineering, vol. 30, no. 3, pp. 488–502, 2018.
  • [6] J. Y. Yen, “Finding the k shortest loopless paths in a network,” Management Science, vol. 17, no. 11, pp. 712–716, 1971.
  • [7] N. Katoh, T. Ibaraki, and H. Mine, “An efficient algorithm for k shortest simple paths,” Networks, vol. 12, no. 4, pp. 411–427, 1982.
  • [8] D. Eppstein, “Finding the k shortest paths,” SIAM Journal on computing, vol. 28, no. 2, pp. 652–673, 1998.
  • [9] J. Hershberger, M. Maxel, and S. Suri, “Finding the k shortest simple paths: A new algorithm and its implementation,” ACM Transactions on Algorithms, vol. 3, no. 4, pp. 45–es, 2007.
  • [10] J. Gao, H. Qiu, X. Jiang, T. Wang, and D. Yang, “Fast top-k simple shortest paths discovery in graphs,” in CIKM, 2010, pp. 509–518.
  • [11] J. Gao, J. Yu, H. Qiu, X. Jiang, T. Wang, and D. Yang, “Holistic top-k simple shortest path join in graphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 4, pp. 665–677, 2012.
  • [12] B. Zheng, C. Huang, C. S. Jensen, L. Chen, N. Q. V. Hung, G. Liu, G. Li, and K. Zheng, “Online trichromatic pickup and delivery scheduling in spatial crowdsourcing,” in ICDE.   IEEE, 2020, pp. 973–984.
  • [13] B. Zheng, K. Zheng, C. S. Jensen, N. Q. V. Hung, H. Su, G. Li, and X. Zhou, “Answering why-not group spatial keyword queries,” IEEE Trans. Knowl. Data Eng., vol. 32, no. 1, pp. 26–39, 2020.
  • [14] B. Zheng, L. Bi, J. Cao, H. Chai, J. Fang, L. Chen, Y. Gao, X. Zhou, and C. S. Jensen, “Speaknav: Voice-based route description language understanding for template driven path search,” Proc. VLDB Endow., vol. 14, no. 12, pp. 3056–3068, 2021.
  • [15] K. M. Chandy and J. Misra, “Distributed computation on graphs: Shortest path algorithms,” Programming Techniques and Data Structures, pp. 833–837, 1982.
  • [16] B. Awerbuch, “Randomized distributed shortest paths algorithms,” in STOC, 1989, pp. 490–500.
  • [17] M. Elkin, “Distributed exact shortest paths in sublinear time,” in STOC, 2017, pp. 757–770.
  • [18] M. Ghaffari and J. Li, “Improved distributed algorithms for exact shortest paths,” in STOC, 2018, p. 431–444.
  • [19] S. Aridhi, P. Lacomme, L. Ren, and B. Vincent, “A mapreduce-based approach for shortest path problem in large-scale networks,” Engineering Applications of Artificial Intelligence, vol. 41, pp. 151–165, 2015.
  • [20] K. Qiu, Y. Zhu, J. Yuan, J. Zhao, X. Wang, and T. Wolf, “Parapll: Fast parallel shortest-path distance query on large-scale weighted graphs,” in ICPP, 2018, pp. 1–10.
  • [21] W. Fan, X. Wang, and Y. Wu, “Performance guarantees for distributed reachability queries,” in VLDB, 2012, pp. 1304–1315.
  • [22] D. Yang, D. Zhang, K.-L. Tan, J. Cao, and F. Le Mouël, “Cands: continuous optimal navigation via distributed stream processing,” in VLDB, 2014, pp. 137–148.
  • [23] S. Shang, L. Chen, Z. Wei, C. S. Jensen, K. Zheng, and P. Kalnis, “Trajectory similarity join in spatial networks,” in VLDB, 2017, pp. 1178–1189.
  • [24] B. Yao, W. Zhang, Z.-J. Wang, Z. Chen, S. Shang, K. Zheng, and M. Guo, “Distributed in-memory analytics for big temporal data,” in DASFAA, 2018.
  • [25] L. Chen, S. Shang, C. S. Jensen, B. Yao, and P. Kalnis, “Parallel semantic trajectory similarity join,” in ICDE, 2020, pp. 997–1008.
  • [26] R. Anand and U. Jeffrey David, Mining of massive datasets.   Cambridge university press, 2011.
  • [27] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets.   USA: Cambridge University Press, 2014.
  • [28] D. Ajwani, E. Duriakova, N. Hurley, U. Meyer, and A. Schickedanz, “An empirical comparison of k-shortest simple path algorithms on multicores,” in ICPP, 2018.
  • [29] G. Feng, “Finding k shortest simple paths in directed graphs: A node classification algorithm,” Networks, pp. 6–17, 2014.
  • [30] “Apache storm,” http://storm.apache.org/, 2019.
  • [31] DIMACS, “http://users.diag.uniroma1.it/challenge9,” 2005.
  • [32] F. Bernhard, G. Martin, and G. Stefan, “Time-varying travel times in vehicle routing,” Transportation Science, vol. 38, no. 2, pp. 121–255, 2004.
  • [33] B. Hannah, D. Daniel, G. Andrew, M.-H. Matthias, P. Thomas, S. Peter, W. Dorothea, and R. F. Werneck, “Route planning in transportation networks,” Algorithm Engineering, pp. 19–80, 2016.
  • [34] J. Hershberger and S. Suri, “Vickrey prices and shortest paths: What is an edge worth?” in IEEE International Conference on Cluster Computing, 2001, pp. 252–259.
  • [35] L. Chang, X. Lin, L. Qin, J. X. Yu, and J. Pei, “Efficiently computing top-k shortest path join,” in EDBT, 2015, pp. 133–144.
  • [36] Z. Luo, L. Li, M. Zhang, W. Hua, Y. Xu, and X. Zhou, “Diversified top-k route planning in road network,” in VLDB, 2022, pp. 3199–3212.
  • [37] S. Forster and D. Nanongkai, “A faster distributed single-source shortest paths algorithm,” in IEEE 59th Annual Symposium on Foundations of Computer Science, 2018, pp. 686–697.
  • [38] Y. Li, Y. Yuan, Y. Wang, X. Lian, Y. Ma, and G. Wang, “Distributed multimodal path queries,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 7, pp. 3196–3210, 2022.
  • [39] T. Akiba, Y. Iwata, and Y. Yoshida, “Fast exact shortest-path distance queries on large networks by pruned landmark labeling,” in SIGMOD, 2013, pp. 349–360.
  • [40] S. Shang, L. Chen, Z.-W. Wei, D.-H. Guo, and J.-R. Wen, “Dynamic shortest path monitoring in spatial networks,” Journal of Computer Science and Technology, vol. 31, pp. 637–648, 2016.

Acknowledgments

This work was supported by NSFC Grants [No.62172351] and NSERC [RGPIN-2022-04623].

[Uncaptioned image] Ziqiang Yu received his Ph.D. degree in computer science in 2015 from Shandong University, China. He is currently an associate professor in the School of Computer and Control Engineering, Yantai University, China. His research interests mainly focus on spatial-temporal data processing and graph computing.
[Uncaptioned image] Xiaohui Yu received his Ph.D. degree in computer science in 2006 from the University of Toronto, Canada. He is currently an associate professor in the School of Information Technology, York University, Toronto. His research interests are in the areas of database systems and data mining.
[Uncaptioned image] Nick Koudas received his PhD degree from the University of Toronto. He is currently a professor at the department of computer science at the university of Toronto. His research interests are in the areas of data systems, analysis of big data, data science and applied machine learning.
[Uncaptioned image] Yueting Chen received his Master degree in computer science in 2017 from Shandong University, China. He is now a Ph.D. candidate in Computer Science at York university starting in 2017, supervised by Xiaohui Yu. His research interests mainly focus on databases and data management systems.
[Uncaptioned image] Yang Liu received her Ph.D. degree in computer science and engineering in 2008 from York University, Canada. She is currently an associate professor in the Department of Physics and Computer Science, Wilfrid Laurier University, Canada. Her main areas of research are data mining and information retrieval.