DAWN: Matrix Operation-Optimized Algorithm for Shortest Paths Problem on Unweighted Graphs

Yelai Feng^{1, 2}, Huaixi Wang^{1, *}, Yining Zhu³, Xiandong Liu⁴, Hongyi Lu², Qing Liu⁵ fengyelai,[email protected] ¹ College of Electronic Engineering, National University of Defense Technology, Hefei, China
²College of Computer Science and Technology, National University of Defense Technology, Changsha, China
³Ningbo Institute of Technolog, Zhejiang University, Ningbo, China
⁴PerfXLab Technologies Co.,Ltd, Beijing, China
⁵College of Electrical and Computer Engineering, Technical University of Munich, Munich, Germany

(2024)

Abstract.

The shortest paths problem is a fundamental challenge in graph theory, with a broad range of potential applications. The algorithms based on matrix multiplication exhibits excellent parallelism and scalability, but is constrained by high memory consumption and algorithmic complexity. Traditional shortest paths algorithms are limited by priority queues, such as BFS and Dijkstra algorithm, making the improvement of their parallelism a focal issue. We propose a matrix operation-optimized algorithm, which offers improved parallelism, reduced time complexity, and lower memory consumption. The novel algorithm requires $O(E_{wcc}(i))$ and $O(S_{wcc}\cdot E_{wcc})$ times for single-source and all-pairs shortest paths problems, respectively, where $S_{wcc}$ and $E_{wcc}$ denote the number of nodes and edges included in the largest weakly connected component in graph. To evaluate the effectiveness of the novel algorithm, we tested it using graphs from SuiteSparse Matrix Collection and Gunrock benchmark dataset. Our algorithm outperformed the BFS implementations from Gunrock and GAP (the previous state-of-the-art solution), achieving an average speedup of 3.769 $\times$ and 9.410 $\times$ , respectively.

Graph Theory, Shortest Paths, Parallel Computing, Matrix Operation

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the 38th ACM International Conference on Supercomputing; June 4–7, 2024; Kyoto, Japan^†^†booktitle: Proceedings of the 38th ACM International Conference on Supercomputing (ICS ’24), June 4–7, 2024, Kyoto, Japan^†^†doi: 10.1145/3650200.3656600^†^†isbn: 979-8-4007-0610-3/24/06^†^†ccs: Computing methodologies Massively parallel algorithms^†^†ccs: Theory of computation Parallel algorithms

1. Introduction

The shortest paths problem, a fundamental problem in graph theory and network science, has garnered interest from researchers across various disciplines such as transportation planning, computer science, network science, and applied mathematics (Dreyfus, 1969; Bertsekas, 1998; Tarjan, 1983; Waissi, 1994; Bertsekas and Tsitsiklis, 1991). As the scale of the graph increases, serial algorithms struggle to adapt to changes, and prompting researchers to explore parallel computing as a solution to the shortest paths problem.

The state-of-the-art solution for SSSP (Single-Source Shortest Paths) problem is the BFS algorithm on unweighted graph and $\Delta$ -stepping Dijkstra’s algorithm on weighted graph (Meyer and Sanders, 2003), which have garnered significant attention (Chan, 2012; Davidson et al., 2014; Busato and Bombieri, 2015; Surve and Shah, 2017; Wang et al., 2019). Currently, there are several solutions available in the industry for rapidly computing the APSP (All-Pairs Shortest Paths) problem on large-scale clusters (sao et al., 2021; Sao et al., 2020).

Timothy M et al. proposed two novel APSP algorithms based on the BFS algorithm, which require $O(\frac{mn}{\log n})$ ( $m\gg n\log^{2}n$ ) and $O(\frac{mn\log\log n}{\log n}+\frac{n^{2}\log^{2}\log n}{\log n})$ times for all graphs (Chan, 2012), where $m$ and $n$ respectively represents the number of nodes and edges in graph. The APSP algorithms based on matrix multiplication, such as Seidel’s algorithm (Seidel, 1995), Galil and Margalit’s algorithm (Galil and Margalit, 1997), reduces the time complexity by divide-and-conquer strategy, these approaches require significant memory resources for maintaining intermediate matrices.

We introduce a novel algorithm named DAWN (Distance Assessment algorithm With matrix operations on Networks), which is based on matrix operation-optimizing. DAWN requires $O(m)$ space and $O(E_{wcc}(i))$ time on unweighted graphs for SSSP tasks. It is also capable of processing APSP tasks and requires $O(S_{wcc}\cdot E_{wcc})$ time. Here, $S_{wcc}$ and $E_{wcc}$ denote the number of nodes and edges included in the largest WCC (Weakly Connected Component) in the graphs, and $i$ is the source node of the SSSP task.

The main contributions of this work are as follows:

(1)

We propose a matrix operation-optimized algorithm, which requires $O(m)$ space and $O(E_{wcc}(i))$ times on the unweighted graphs for SSSP problem, respectively. In contrast to the prevalent optimization of state-of-the-art BFS implementation, which commonly rely on priority queues, our approach leverages matrix operations to endow DAWN with enhanced parallelism.
(2)

We propose BOVM (Boolean Vector-Matrix Operation) method, which make DAWN to require $O(\frac{\epsilon(i)}{2}m)$ time for SSSP tasks on unweighted graphs, where $\epsilon(i)$ is the eccentricity of node $i$ . Further, we propose an SOVM (Sparse Optimized Boolean Vector-Matrix Operation) method to significantly improve the performance of DAWN on sparse graphs, reducing the time requirements to $O(E_{wcc}(i))$ for SSSP tasks and $O(S_{wcc}\cdot E_{wcc})$ for APSP tasks.
(3)

DAWN achieves superior performance compared to Gunrock while utilizing fewer GPU memory resources. It successfully completes the SSSP task on graphs with 214 million nodes and 936 million edges using an RTX 3080TI, a task unattainable by Gunrock. Prior to DAWN, algorithms based on matrix multiplication used a divide-and-conquer strategy, such as Seidel’s algorithm (Seidel, 1995), which generated numerous intermediate matrices and required excessive memory.

In Section 2, we present an overview of the typical shortest paths algorithms. In Section 3, we describe the design of the DAWN and propose optimization methods to make it more widely applicable to various graphs. In Section 4, we conducted comparative experiments of several implementations across various platforms, which demonstrates the high efficiency of DAWN. In Section 5, we summarize the work in this paper and outline future research directions. Table 1 lists the notations used throughout the paper.

Table 1. Definition of notations

Notation	Definition
$A$	Adjacency matrix of unweighted graphs
$E_{wcc}$	Maximum edge count of the largest WCC
$S_{wcc}$	Maximum node count of the largest WCC
$E_{wcc}(i)$	Edge count of the largest WCC of node $i$
$S_{wcc}(i)$	Node count of the largest WCC of node $i$
$\epsilon(i)$	Eccentricity of node $i$
$\epsilon_{max}$	Maximal eccentricity of the graph
$n,m$	Number of nodes and edges in graph
$p$	Average connected probability
$CSR$	Compressed Sparse Row format
$CSC$	Compressed Sparse Column format

2. Related Works

The shortest paths problem is a classic problem in graph theory and network science. In this section, we introduce the typical algorithms for solving the SSSP and APSP problems. In addition, we will introduce several APSP algorithms based on matrix multiplication.

2.1. SSSP algorithm

Dijkstra’s algorithm is a common SSSP algorithm (Dijkstra et al., 1959), and the main optimization methods are priority queue of binary heap and Fibonacci heap (Johnson, 1977; Raman, 1997; Thorup, 2000; Fredman and Tarjan, 1987; Moret and Shapiro, 1992; Chen et al., 2007).

Meyer et al. proposed an optimized Dijkstra’s algorithm, which is a parallel version for a large class of graphs (Meyer and Sanders, 2003). The best parallel version of the $\Delta$ -stepping Dijkstra’s algorithm takes $O(D\cdot L\cdot\log n+\log^{2}n)$ time and $O(n+m+D\cdot L\cdot\log n)$ work on average, where $L$ denotes the maximum shortest paths weight from the source node ( $s$ ) to any node reachable from $s$ , and $D$ represents the maximum node degree (Meyer and Sanders, 2003).

2.2. BFS Algorithm

Scott Beamer et al. proposed a hybrid BFS algorithm that combines a conventional top-down approach with a novel bottom-up approach (Beamer et al., 2012). In the top-down approach, nodes within the active frontier seek unvisited child nodes, whereas in the bottom-up approach, unvisited nodes seek parents within the active frontier. Scott Beamer et al. further optimized the performance of direction-optimized BFS in various application scenarios (Beamer et al., 2013; Buluç et al., 2017), ultimately integrating it into GAP (Graph Algorithm Platform benchmark Suite). GAP is a portable high-performance baseline which includes representative implementations of state-of-the-art performance, and is intended to help graph processing research by standardizing evaluations(Beamer et al., 2015).

Julian Shun and Laxman Dhulipala et al. inspired by the direction-optimized BFS algorithm, achieve close to the same efficiency (time and space) as the optimized BFS of Beamer et. al. and using a much simpler implementation code (Shun and Blelloch, 2013; Shun et al., 2015). Further, they promoted widespread applications of this optimization approach and constructed a high-performance computing framework called GBBS (Graph Based Benchmark Suite) which based on the Ligra/Ligra+/Julienne graph processing frameworks (Shun et al., 2016; Dhulipala et al., 2017; Shun, 2020; Dhulipala et al., 2018), including betweenness centrality, graph radius estimation, graph connectivity, PageRank and single-source shortest paths etc..

2.3. Matrix algorithm

The shortest paths algorithm based on matrix squaring multiplication has received extensive attention (Farbey et al., 1967; Takaoka, 1998; Mulmuley and Shah, 2000). The matrix squaring multiplication algorithm reduces the number of matrix multiplications from $n$ to $\log n$ , but it requires storing many intermediate matrices. When the graph is large, the algorithm needs to consume a significant amount of space.

Seidel’s algorithm is the APSP algorithm based on matrix multiplication (Seidel, 1995), which is suitable for unweighted undirected graphs and time complexity is $O(\log n\cdot n^{\omega})$ . The Seidel’s algorithm requires numerous memory to maintain intermediate matrices, due to reduce the time complexity of GEMM (General Matrix-Matrix multiplication) via divide-and-conquer strategy (Strassen et al., 1969; Coppersmith and Winograd, 1987; Williams, 2012).

Arlazarov et al. proposed the APSP algorithm based on the boolean matrix multiplication on the unweighted graphs (Arlazarov et al., 1970), which has alleviated the issue of excessive memory requirements associated with matrix multiplication-based APSP algorithms to a limited extent.

3. Methods

In this section, we will introduce the DAWN and the technical details of BOVM and SOVM. In Section 3.1, we will illustrate the principle of DAWN. We will then discuss the optimization of boolean vector-matrix operation (BOVM) that enables DAWN’s efficiency in Section 3.2. Furthermore, we will expand on BOVM and introduce its extension to sparse matrices, known as sparse optimized boolean vector-matrix operation (SOVM), in Section 3.3. In Section 3.5, we use an example to demonstrate difference between BFS and DAWN.

3.1. Principle

DAWN relies on the result of matrix multiplication to assist in determining which edge visits can be skipped. However, matrix multiplication is a costly operation, requiring $O(n^{3})$ time and $O(n^{2})$ space.

Our main contribution is the simplification of matrix multiplication, which does not mean that we can compute matrix multiplication faster, but rather that we focus on only a portion of the results of matrix multiplication for the shortest path problem. Specifically, our new approach is able to determine which rows and columns of the matrix multiplication result have an impact on the shortest path problem.

Figure 1 illustrates the correspondence between Boolean matrix operations and shortest path discovery in a graph. The blue markers indicate the result vector, while the red markers indicate a particular row of the adjacency matrix.

Refer to caption — Figure 1. Example for the Shortest Paths Discovery in the Boolean Matrix

Lemma 3.1.

(Harary et al., 1965) In the matrix $A^{k}=\big{(}a_{i,j}^{(k)}\big{)}_{n\times n}$ , the element $a_{i,j}^{(k)}$ represents the number of paths with length $k$ from $v_{i}$ to $v_{j}$ .

Theorem 3.2.

In unweighted graphs, the length of the shortest path from $v_{i}$ to $v_{j}$ is $k_{min}$ , if and only if $a_{i,j}^{(k_{min})}\not=0\land i\not=j\land\sum_{k=1}^{k_{min}-1}a_{i,j}^{(k)}=0$ .

Fact 1.

In unweighted graphs, any shortest paths of length $k$ can be expressed as the connection of two shortest paths with lengths $k-1$ and $1$ , where $k\geq 2$ .

Obviously, we can obtain an expression for all paths,

(1)

\begin{split}P(i,j)_{min}&=\sum_{k=1}^{\epsilon(i)}a_{i,j}^{(k)}=\sum_{k=1}^{k_{min}-1}a_{i,j}^{(k)}+a_{i,j}^{(k_{min})}+\sum_{k=k_{min}+1}^{\epsilon(i)}a_{i,j}^{(k)},\end{split}

where $1\leq k_{min}\leq\epsilon(i)$ . It is evident that the sum of array is minimized when the first non-zero value of $a_{i,j}^{(k)}$ is encountered.

Theorem 3.2 and Fact 1 state the sufficient condition for breaking the loop and ending it:

(1)

DAWN already found the shortest paths between all pairs of nodes in the graph;
(2)

The distance vector does not change when a loop ends, which means no new paths were found in the loop.

3.2. BOVM

We describe the vector-vector multiplication as follows,

(2)

a_{i,j}^{(2)}=a_{i,1}a_{1,j}+a_{i,2}a_{2,j}\cdots+a_{i,n}a_{n,j},

which denotes the collection of path combinations from $i$ to $j$ through any node. Thus, it is unnecessary to consider all possible combinations when determining the presence of a path from i to j; only cases where $a_{i,l}>0\wedge a_{l,j}>0$ can affect the value of $a_{i,j}^{(2)}$ . If $a_{i,j}^{(2)}$ represents any value greater than 0, it signifies the existence of a shortest path from node $i$ to $j$ . Consequently, we can simplify this formula by utilizing a Boolean data type.

We converted the Formula 2 to a Boolean-type as follows,

(3)

a_{i,j}^{(2)}=\sum_{l=0}^{n-1}\alpha[l]\wedge\beta[l],

which requires $O(n)$ time. Since only the non-zero elements in $\alpha$ and $\beta$ affect the multiplication result, we can compress the vectors by retaining only their non-zero elements, and get Formula 4 as follows,

(4)

a_{i,j}^{(2)}=\sum_{l=0}^{len_{\gamma}-1}\alpha[\gamma[l]],

where $len_{\gamma}$ represents the length of $\gamma$ and $\gamma$ is a compressed version of $\beta$ , containing only the indices of non-zero elements.

Input: CSC,

\alpha

\beta

, distance, step, is_converged

Output: distance, is_converged

1 while step ¡ n do

2 step

\Leftarrow

step + 1 ;

3 for i $\in$ [0,n-1] & $\alpha$ [i] = false do

4 start

\Leftarrow

CSC.columns_ptr[i] ;

5 end

\Leftarrow

CSC.columns_ptr[i + 1] ;

6 while (j $\in$ [start, end-1]) & ( $\alpha$ [CSC.row[j]] = true) & (CSC.row[k] != i) do

\beta

[i]

\Leftarrow

true ;

8 distance[i]

\Leftarrow

step ;

9 if is_converged = true then

10 is_converged

\Leftarrow

false;

12 end if

13 break;

15 end while

// The vector

\alpha

and

\beta

are mandatory and cannot be replaced by search vector

distance

\alpha\Leftarrow merge(\alpha,\beta)

;

\beta.swap(NULL)

;

19 end for

20 if is_converged = true then

21 break ;

23 end if

25 end while

26return distance;

Algorithm 1 Boolean Vector-Matrix Operation

In Formula 4, if the value of $\alpha[\gamma[l]]$ is 1, the result $a_{i,j}^{(2)}$ will be 1. Once the first element of $true$ is obtained, the sum will always yield a value of $true$ and indicates that the path exists from $i$ to $j$ . Hence, we let the loop end at the time. We can ensure that the path we first discover is the shortest path by Theorem 3.2, and skip the computation of any paths ended with $j$ in next operations.

We can extend this vector operation to the CSC format matrix to assist DAWN in reducing neighbor node access and discovering the shortest path. We get the BOVM as the Algorithm 1, where $\alpha$ and $\beta$ are the dense vector, $step$ represents the steps of the shortest paths in the iteration. Line 4 of Algorithm 1 implements the Formula 4, while lines 7-8 implement the stopping criterion provided by Fact 1.

Next, we discuss the time complexity of DAWN based on the BOVM, and the formula as follows,

(5)		$\displaystyle T(n)$	$\displaystyle=\sum_{x=1}^{\epsilon(i)-1}\big{[}\sum_{i=0}^{n-len[\alpha_{x}]}D_{exit}(i,x)\big{]},$
(6)		$\displaystyle T(n)$	$\displaystyle<\sum_{x=1}^{\epsilon(i)-1}\big{[}1-\frac{1-p}{\epsilon(i)-1}x\big{]}\cdot m<\frac{1+p}{2}\epsilon(i)m.$

where $D_{exit}(i,x)$ represents the index value of element in $A[i]$ when DAWN exits the loop $x$ , $len[\alpha_{x}]$ is the length of $\alpha$ in the loop $x$ (Refer to Algorithm 1 Line 1 to 5), $p$ represents the average connection probability of graph.

DAWN requires $O(\frac{1+p}{2}\epsilon_{max}\cdot nm)$ and $O(\frac{1+p}{2}\epsilon(i)m)$ time for APSP and SSSP problem, respectively.

3.3. Sparse Optimized Operation

The performance of BOVM on sparse graphs, especially those with large diameters, is often limited by the expensive cost of vector-matrix multiplication, making it difficult to outperform BFS. Reducing the number of vector multiplications has become a critical issue in enabling DAWN to be widely used.

We propose the method of SOVM to optimized DAWN on the sparse graphs, which combines graph traversal algorithms with vector-matrix multiplication, limiting the operation to nodes and their neighboring nodes. Specifically, we first obtain the set of neighboring nodes, exclude nodes that have already appeared in the result vector, then calculate the vector multiplication values of these nodes, obtaining paths of length step with target nodes in the neighboring nodes set, and finally update the shortest paths in the result vector.

Although the process is complex, we can simplify it by utilizing the properties of Boolean matrix operations. It is important to note that SOVM operates on CSR matrices, while BOVM operates on CSC matrices. The boolean vector-matrix multiplication is as follows,

(7)

\displaystyle\gamma[i]=\sum_{l=0}^{n}\alpha[l]\wedge A[i][l].

If we use matrix $A=\{\beta_{0},\beta_{1},\cdots\beta_{n-2},\beta_{n-1}\}$ , we can simplify the boolean vector-matrix multiplication as follows,

(8)

\displaystyle\gamma=\prod_{k=0}^{len_{\beta^{\prime}}-1}\beta_{\beta^{\prime}[k]},

where $\beta^{\prime}$ is the compress version of $\beta$ . Formula 8 indicates that the BOVM can be achieved by computing multiple inner products of vectors in succession.

If we transpose the matrix A to a CSR matrix $A\_CSR=\{\alpha_{0},\alpha_{1},\cdots\alpha_{n-2},\alpha_{n-1}\}$ , Formula 8 can be simplified as follows,

(9)

\displaystyle\beta=\bigcup_{k=0}^{len_{\beta^{\prime}}-1}\alpha_{\beta^{\prime}[k]},

and it means that we can use $len_{\beta^{\prime}}$ times of array merges to replace boolean vector-matrix multiplication in the SSSP tasks. We get the optimized method as the Algorithm 2.

Input: CSR,

\alpha

\beta

, step, distance

Output: distance

1 while step ¡ n do

2 step

\Leftarrow

step + 1 ;

3 while i $\in$ [0, n-1] & ( $\alpha$ [i] = true) do

4 start

\Leftarrow

CSR.row_ptr[i];

5 end

\Leftarrow

CSR.row_ptr[i + 1];

6 while (j $\in$ [start, end-1]) & (distance[CSR.col[j]] = 0) do

\beta

[CSR.col[j]]

\Leftarrow

true;

8 distance[CSR.col[j]]

\Leftarrow

step;

9 if is_converged = true then

10 is_converged

\Leftarrow

false;

12 end if

14 end while

16 end while

\alpha

.swap(NULL);

// Reset the vector by swap

18 if is_converged = true then

19 break ;

21 end if

23 end while

24return distance;

Algorithm 2 Sparse Optimized Boolean Vector-Matrix Operation

Algorithm 2 utilizes a simpler method to merge vectors, and is particularly interested in the newly added elements of $\beta$ after merging these arrays. We aim to skip any duplicate elements since these shortest paths have already been discovered. We only visit the edges and update the shortest path when the element is missing in the $\beta$ array.

Specifically, SOVM starts from the set of neighbor nodes, skips all nodes that have already appeared in the result vector (line 1), finds the target nodes in neighboring nodes set that have not yet appeared in the result vector (line 4), and then updates their shortest paths. Formula 9 provides theoretical support for such operations, and SOVM can automatically exclude the cycles without additional judgment.

Hence, we get the time complexity of DAWN based on SOVM to solve SSSP task of node $i$ is as follows,

(10)

\displaystyle T(n)=\sum_{j=0}^{S_{wcc}(i)}d^{+}(j)=E_{wcc}(i),

where $d^{+}(j)$ is the out-degree of node $j$ . $S_{wcc}(i)$ and $E_{wcc}(i)$ denotes the number of nodes and edges included in the largest WCC (Weakly Connected Component) to which node $i$ belongs. The time complexity of DAWN for APSP is determined by the largest WCC in the graph,

(11)		$\displaystyle T(n)$	$\displaystyle=\sum_{i=0}^{n-1}E_{wcc}(i)=S_{wcc}\cdot E_{wcc}+\sum_{i=0}^{n-1-S_{wcc}}E_{wcc}(i),$
(12)		$\displaystyle T(n)$	$\displaystyle<2S_{wcc}\cdot E_{wcc},$

where $S_{wcc}$ and $E_{wcc}$ denote the number of nodes and edges included in the largest WCC in graph. The time complexity of DAWN based on the SOVM is $O(S_{wcc}\cdot E_{wcc})$ for APSP tasks.

In summary, DAWN based on SOVM achieves better time complexity, requiring $O(S_{wcc}\cdot E_{wcc})$ and $O(E_{wcc}(i))$ time for APSP and SSSP tasks on the unweighted graphs, compared to BFS which requires $O(nm)$ and $O(m)$ , respectively. It is important to note that this complexity improvement only occurs in non-connected graphs, whereas in connected graphs, DAWN and BSF both require $O(nm)$ and $O(m)$ time for APSP and SSSP tasks.

3.4. Memory

In this section, we elaborate on how DAWN achieves reduced memory usage compared to BFS. Typically, the memory requirements of the BFS algorithm can be divided into three components: CSR matrix, the distance vector, and the priority queue. This implies that BFS cannot operate with less than $4m+8n$ bytes of memory.

DAWN’s memory requirements also consist of three components: the CSR matrix, the distance vector, and two boolean arrays. The two boolean arrays are utilized to store the paths updated in the previous and current iterations (details in Algorithm 2). As DAWN is a backward BFS algorithm, we can maintain a boolean array on the GPU instead of distance vector, with path length updates occurring in memory. The GPU memory is byte-addressable, and even boolean variables are allocated a byte of space.

Therefore, DAWN necessitates a minimum of $4m+3n$ bytes of memory. We can get the formula as follows,

(13)

\eta=\frac{4m+3n}{4m+8n}=\frac{4D+3}{4D+8},

where $D$ represents the average degree of the graph.

For instance, when considering the theoretical minimum memory usage, DAWN requires only 91.58% of the memory used by Gunrock on the graph uk-2005. While the theoretical difference is approximately 8.4%, in experiments conducted under constrained GPU memory conditions, DAWN can solve the BFS task on uk-2005, whereas Gunrock fails to allocate sufficient GPU memory. It is noteworthy that as the sparsity of the graph increases, the memory advantage of DAWN becomes more pronounced.

3.5. Difference

To further examine the differences between DAWN and BFS, we present the technical details used in these two algorithms. Algorithm 3 describes the general BFS algorithm, and the benchmark implementations from GAP and Gunrock both employed more sophisticated optimization techniques in the experiment, where $pq$ represents the priority queue and $source$ is the source node of the SSSP task.

Input: CSR, pq, source, distance

Output: distance

1 pq.push({source,distance}) ;

2 while !pq.empty() do

3 i

\Leftarrow

pq.top; pq.pop();

4 start

\Leftarrow

CSR.row_ptr[i];

5 end

\Leftarrow

CSR.row_ptr[i + 1];

6 while j $\in$ [start, end-1] do

7 index

\Leftarrow

CSR.col[j];

8 if distance[index] = 0 then

9 new_dist

\Leftarrow

distance + 1;

10 distance[index]

\Leftarrow

new_dist;

11 pq.push({index,new_dist}) ;

13 end if

15 end while

17 end while

18return distance;

Algorithm 3 General BFS

For Line 15 of Algorithm 2, if no new shortest paths are found in this loop, then exit, according to Fact 1. We utilize step to mark the current node being visited as a neighbor node of the source node at the layer $step$ in DAWN. Lines 4-6 of Algorithm 2 indicate that some edge visitations can be skipped. This means that the nodes that have already been visited in the previous layer do not need to be visited again, as the shortest path has already been determined. The theorem supporting this decision is referred to as Theorem 3.2, which states that the first discovered path from the source node to a reachable node is the shortest path. The processing steps of DAWN is as follows,

(1)

Firstly, DAWN reads the input vector and identifies the single-step reachable nodes from the source node $s$ ,
(2)

Then, searching for the single-step reachable nodes from the updated nodes which updating in the previous step, while skipping nodes that have already been discovered to have a path from the source node $s$ and reachable nodes with an out-degree of 0,
(3)

Next, DAWN repeats the second step until output vector stabilizes.
(4)

Finally, exit loop and output the result vector.

On the other hand, in BFS, the operations of accessing nodes and edges, and checking whether the path needs to be updated are necessary for every node and edge, refer to Line 6-10 in Algorithm 3.

We illustrate the difference between the BFS algorithm and DAWN through a example in Figure 2. The red color indicates the nodes and edges that are visited in the current step, the blue nodes and edges represent that have already been visited, and the green edges represent that have not yet been visited. The $input$ represents the paths updated in the previous iteration, while the $output$ indicates the paths updated in the current iteration. The $result$ signifies the outcome of the algorithm, specifically denoting the shortest path lengths from the source node to other nodes in the graph.

In the third step of Figure 2, node $5$ has four outbound edges to node $1,4,6,10$ . BFS must traverse these edges, and then BFS would note that the destination vertex of these edges had already been visited and the destination vertex would not be in the output frontier. However, DAWN does not traverse these edges. The compressed vector for node 5 is $\{1,2,3,6,10\}$ , and $result$ is $\{1,1,1,0,0,0,1,0,0,0\}$ . The values of $result[1]$ , $result[2]$ , $result[3]$ , and $result[6]$ are all 1, indicating that DAWN will skip these edges, because the shortest paths of them were already found.

In the SSSP task for node $s$ , the BFS algorithm visited a total of 10 nodes and 17 edges, while DAWN visited only 8 nodes and 12 edges, resulting in 2 fewer nodes and 5 fewer edges being visited by DAWN.

Overall, the fundamental difference between DAWN and BFS lies in whether the algorithm relies on a priority queue to prevent revisiting nodes and edges.

4. Results

In this section, we will outline the experimental setup and present initial experimental data. Following this, we proceed to show the performance of DAWN with regard to scalability and its effect in accelerating the SSSP task.

4.1. Experiment Introduction

In our experimental trials, we leverage a set of 66 general graphs sourced from the SuiteSparse Matrix Collection and the Gunrock benchmark datasets (Davis and Hu, 2011; Wang et al., 2016).

In scenarios where a node is not part of the largest WCC of the graph, but instead resides in a smaller connected component, DAWN has the potential to accomplish the task in constant time, while BFS requires the construction of a priority queue.

Therefore, we have established a randomly generated set comprising 500 nodes, where each node executes the SSSP task 64 times. All computations are conducted within this node-set. It is noteworthy that these nodes are not exclusively part of the largest connected component, and our dataset includes non-connected graphs. We underscore our focus on evaluating the performance of the BFS algorithm across diverse graph types, including connected and non-connected graphs, as well as both generated and real-world graphs.

Performing the task consecutively serves to minimize the influence of external factors on experimental results, such as interference from background processes. We adopted the arithmetic mean as the anticipated value and subjected the sample distribution to a t-distribution test. After eliminating samples that deviated from the assumptions of the t-distribution, and computing the mean of the remaining samples, we get the final result.

Table 2. Parameters of the Test Machine

Hardware	Machine1	Machine2
CPU	Intel Core i5-13600KF	AMD EPYC Milan 7T83
RAM	32GB	128GB
GPU	NVIDIA RTX 3080TI	–
Compiler	NVCC and GCC 9.4.0	GCC 9.4.0
OS	Ubuntu 20.04	Ubuntu 20.04
Toolkit	CUDA 12.1	–

Table 3. Parameters of the Algorithms

Abbreviation	Solution
Gunrock	The BFS algorithm running on RXT3080TI,
	from the Gunrock (Wang et al., 2017; Osama et al., 2022)
GAP	Direction-optimizing BFS algorithm, from
	the GAP (Beamer et al., 2012), running on I5-13600KF
DAWN	DAWN with RTX 3080TI
DAWN(20)	DAWN with I5-13600KF

Table 4. Parameters of the Experimental Graphs (Nodes and Edges)

Nodes
¡ 100K	100K $\thicksim$ 500K	500K $\thicksim$ 5M	5M $\thicksim$ 100M	¿ 100M
14	24	14	13	1

Edges
¡ 1M	1M $\thicksim$ 5M	5M $\thicksim$ 20M	20M $\thicksim$ 500M	¿ 500M
16	16	23	8	3

The parameters of the test machine are detailed in Table 2. Our comparison includes various versions of the BFS algorithm, and the results are presented in Table 3. We provide accessible links to graphs: [Dataset]. The number of nodes in these graphs ranges up to $139M$ , with edges extending up to $921M$ . The parameters of the experimental graphs are detailed in Table 4.

Specifically, the results for DAWN running on GPUs were obtained using a thread block size of 1024, a configuration viable on GPUs since the Pascal architecture introduced in 2016. Although the optimal block partitioning scheme depends on several factors (e.g., matrix density, shared memory size, bandwidth, etc.), we adopt a fixed block size to enhance result reproducibility.

Gunrock is a CUDA library for graph-processing designed specifically for the GPU, which achieves a better balance between performance and expressiveness via coupling high-performance GPU computing primitives and optimization strategies, particularly in the area of fine-grained load balancing. (Wang et al., 2017; Osama et al., 2022, 2023; Wang et al., 2016).

We strongly encourage readers to delve into the provided codebase and verify the reported results. The code and more information for our algorithm are available on [GitHub]. In the repository, we offer additional insights into the actual running times and graph details for each proposed solution, accompanied by a description of artifacts and evaluation methodologies. These details are provided to enhance the reproducibility of any results presented in this paper.

4.2. Scalability

It is important to validate DAWN’s feature of high parallelism and scalability. We measure the scalability of DAWN using multi-threading efficiency as follows, simplified from Gustafson-Barsis’s law(Gustafson, 1988),

(14)

\eta_{t}=\frac{T_{B}}{T_{N}\times N},

where $T_{B}$ represents the baseline execution time, $T_{N}$ represents the execution time of the program with $N$ threads, and $N$ is the number of threads. In Table 5 and 6, the multi-threading efficiency for DAWN based on SOVM and BFS API from GAP on the I5-13600KF and EPYC Milan 7T83 are depicted, respectively.

Table 5. The multi-threading efficiency of DAWN and GAP on I5-13600KF

Thread	1	3	6	12	20
DAWN(20)	100%	99.72%	98.35%	77.96%	37.54%
GAP	100%	96.84%	89.51%	66.28%	23.43%

Table 6. The multi-threading efficiency of DAWN and GAP on EPYC Milan 7T83

Thread	4	8	16	32	64
DAWN(20)	100%	99.45%	97.73%	84.10%	58.97%
GAP	100%	95.69%	88.12%	71.49%	40.16%

When utilizing up to 32 threads on the EPYC processor, the core frequency remains constant at 3.5 GHz. However, when scaling up to 64 threads, the frequency of all cores diminishes to 2.54 GHz. In contrast, the I5 processor does not experience any reduction in core frequency. It is important to note that the exact performance gains are contingent upon the particular hardware configuration utilized, and considerations such as power and thermal constraints impose limitations on the maximum achievable performance.

The I5 processor integrates a combination of performance and efficiency cores, where the former delivers higher clock speeds, and the latter excels in power efficiency. Hence, DAWN achieves a linear speedup when scaling from 1 thread to 6 threads, but the performance improvement slows down when scaling from 6 to 12 threads due to the performance gap between the two types of cores. Additionally, the I5 processor does not have enough physical cores to achieve significant performance gains beyond 14 threads.

Across diverse hardware configurations, DAWN demonstrates better multi-threading efficiency compared to GAP. Futhurmore, the scalability of the algorithm is influenced by a considerable number of factors, among which the characteristic of the graph is a significant factor.

Figure 3 and 4 illustrate DAWN’s capability for speedup across different thread counts on some of the graphs. Specifically, on the mycielskian16, which is a dense graph with a low-diameter of 2, DAWN exhibits lower thread efficiency compared to other sparse graphs. This phenomenon underscores the impact of multiple factors on the increase or decrease in algorithm performance. For instance, Graph mouse_gene is denser than mycielskian16, and with a diameter of 12, yet DAWN exhibits superior thread efficiency on mouse_gene. Therefore, we emphasize the comprehensive performance of the algorithm across a wider variety of graphs.

4.3. Performance Comparison with GAP

In the experiment, DAWN based on SOVM demonstrated remarkable performance and also exhibits high scalability and parallelism. We have included a figure illustrating the distribution of the speedup of DAWN over GAP across all graphs in the test dataset.

Table 7. The speedup of DAWN over GAP

Speedup	¡1 $\times$	1 $\times$ $\thicksim$ 2 $\times$	2 $\times$ $\thicksim$ 4 $\times$	4 $\times$ $\thicksim$ 16 $\times$	¿16 $\times$
DAWN(20)	4	15	24	17	6
DAWN	12	20	11	9	14

In Table 7, the speedup for DAWN based on SOVM over BFS API from GAP on an I5-13600KF is depicted, with the values derived from the mean of repeated experiments, following the methodology outlined in Section 4.1. The first row shows the speedup of DAWN(20) over BFS API from GAP, both on I5-13600KF, and the next row shows the speedup of DAWN on RTX3080TI over BFS API from GAP on I5-13600KF. Due to the significant increase in scalability and parallelism, DAWN based on the SOVM outperformed GAP in most graphs (62 out of 66), achieving an impressive average speedup of 3.769 $\times$ .

However, the DAWN algorithm demonstrates comparatively lower performance in four specific graphs (coPapersDBLP, com-DBLP, coAuthorsDBLP, coPapersCiteseer), all representing citation and collaboration networks. These graph types are characterized by high clustering coefficients and relatively short average shortest paths. Despite the deployment of a more potent processor, BFS API from Gunrock falls short of surpassing the performance of GAP on these graphs. Nevertheless, in other scale-free graphs such as social networks and the internet, the DAWN algorithm exhibits superior performance.

Numerous well-established studies have presented evidence that the eccentricity of the real graphs is $\log n$ (Bollobás, 1981; Albert et al., 1999; Cohen and Havlin, 2003; Riordan et al., 2004; Ganesh and Xue, 2007; Takes and Kosters, 2011). Therefore, we get the small-world graphs (23 out of 66) which the average shortest path in the graph is less than $\log n$ , includes the citation and collaboration networks mentioned before.

The Direction-Optimizing BFS algorithm will achieve the speedups when the active frontier is a substantial fraction of the total graph, which commonly occurs in small-world graphs (Beamer et al., 2012). However, DAWN outperforms GAP on the most small-world graphs (19 out of 23) and achieves an average speedup of 2.332 $\times$ . Furthermore, in other real graph with a high-diameter such as road networks, DAWN achieves an average speedup of 4.483 $\times$ over GAP.

Figure 5 shows the running time for DAWN(20) and GAP. The y-axis represents the average running time, with each marker representing the running time on a graph. GAP instances are indicated by blue markers, while DAWN instances are represented by black markers.

4.4. Performance Comparison with Gunrock

In Table 8, the first row shows the speedup of DAWN(20) on an I5-13600KF over BFS API from Gunrock on RTX3080TI. The next rows shows the speedup of DAWN over BFS API from Gunrock, both on RTX3080TI. Figure 6 illustrates the running time for DAWN and BFS API from Gunrock. Red markers correspond to DAWN, while green markers represent Gunrock. Impressively, DAWN outperformed Gunrock in the majority of graphs (63 out of 66), achieving an average speedup of 9.410 $\times$ . On the Graphs uk-2005 and arabic-2005, Gunrock encountered an out-of-memory error, thus preventing the acquisition of runtime data for these two graphs. The testing machine equipped with 12GB of physical GPU memory, with 9.7GB available. The available GPU memory for both DAWN and Gunrock is identical, indicating that when executing similar tasks, DAWN requires less GPU memory compared to Gunrock.

Table 8. The speedup of DAWN over Gunrock

Speedup	¡1 $\times$	1 $\times$ $\thicksim$ 2 $\times$	2 $\times$ $\thicksim$ 4 $\times$	4 $\times$ $\thicksim$ 16 $\times$	¿16 $\times$
DAWN(20)	5	5	10	23	21
DAWN	3	16	22	6	19

Apart from the aforementioned two graphs, DAWN demonstrates performance inferior to that of Gunrock on the Graph web-BerkStan, and also falls short compared to DAWN(20) and GAP running on CPU. This phenomenon may be attributed to the scale-free nature of web-BerkStan, leading to load imbalance during computation and significantly impacting algorithm performance. Gunrock, possessing robust load balancing capabilities, holds an advantage in such scenarios. It is crucial to note that DAWN does not prioritize load balancing as the primary focus of our investigation. Nonetheless, despite these challenges, DAWN outperforms Gunrock on the majority of graphs due to algorithmic optimizations.

4.5. Performance on Different Platforms

The differences in the speedup distribution of DAWN compared to different algorithms are attributed to the nature of the graphs, such as the number of nodes and the average shortest path length. Therefore, we will proceed to compare the performance of DAWN on the different platforms. Figure 7 illustrates the performance gap of DAWN between CPU and GPU. Light purple bars indicate cases where DAWN’s performance on CPU is inferior to that on GPU, while dark purple bars represent the opposite scenario.

In more than half of the graphs (37 out of 66), DAWN(20) exhibits superior performance compared to DAWN. For instance, on web graphs, DAWN(20) and the GAP algorithm achieved enhanced performance, which demonstrates that the powerful single-core performance of CPUs provide better acceleration for algorithms.

However, this single-core performance acceleration has its limitations. Once the graph size exceeds one million nodes, the advantage of single-core performance can no longer compensate for the performance disparity induced by a greater number of cores. Furthermore, on graphs with a smaller number of nodes, the communication overhead between the CPU and GPU appears more costly than computational expenses, leading to inferior performance compared to algorithms running on CPU.

DAWN(20) achieved performance superiority on graphs with an average of 0.209 million nodes and 5.854 million edges (considering undirected edges as two directed edges). On the other hand, DAWN demonstrated performance superiority on graphs with an average of 13.820 million nodes and 146.592 million edges.

In summary, DAWN is more efficient and yielding a higher speedup when compared to Gunrock and GAP.

5. Conclusion

In this paper, we propose an enhanced BFS algorithm, which requires $O(E_{wcc}(i))$ and $O(S_{wcc}\cdot E_{wcc})$ time for solving SSSP and APSP problems on the unweighted graph, respectively.

Our research involved a performance evaluation of DAWN, GAP, and Gunrock across various platforms, using graphs from SuiteSparse Matrix Collection and Gunrock benchmark dataset. DAWN achieves average speedup of 3.769 $\times$ and 9.410 $\times$ , over GAP and Gunrock respectively.

The experiment underscores that DAWN based on the SOVM, exhibits remarkable scalability and efficiency in addressing the shortest paths problem on modern processors. The efficient utilization of computational resources is a significant factor contributing to its exceptional performance. These results highlight the potential of DAWN as a powerful tool for graph analytics, particularly in applications that require high processing speed and efficiency.

Our future research will focus on addressing the balance between optimizing matrix operations and managing the consumption of (min,+) operations. This focus is aimed to expand the applicability of DAWN on weighted graphs, transforming it from a promising proof-of-concept to a practical tool that can be used in various real-world graph analysis applications in the future.

References

(1)
Albert et al. (1999) Réka Albert, Hawoong Jeong, and Albert-László Barabási. 1999. Diameter of the world-wide web. nature 401, 6749 (1999), 130–131.
Arlazarov et al. (1970) Vladimir L’vovich Arlazarov, Yefim A Dinitz, MA Kronrod, and IgorAleksandrovich Faradzhev. 1970. On economical construction of the transitive closure of an oriented graph. In Doklady Akademii Nauk, Vol. 194. Russian Academy of Sciences, 487–488.
Beamer et al. (2012) Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction-optimizing breadth-first search. In SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–10.
Beamer et al. (2015) Scott Beamer, Krste Asanović, and David Patterson. 2015. The GAP benchmark suite. arXiv preprint arXiv:1508.03619 (2015).
Beamer et al. (2013) Scott Beamer, Aydin Buluc, Krste Asanovic, and David Patterson. 2013. Distributed memory breadth-first search revisited: Enabling bottom-up search. In 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum. IEEE, 1618–1627.
Bertsekas (1998) Dimitri Bertsekas. 1998. Network optimization: continuous and discrete models. 8, 2 (1998), 91–112.
Bertsekas and Tsitsiklis (1991) Dimitri P Bertsekas and John N Tsitsiklis. 1991. An analysis of stochastic shortest path problems. Mathematics of Operations Research 16, 3 (1991), 580–595.
Bollobás (1981) Béla Bollobás. 1981. The diameter of random graphs. Trans. Amer. Math. Soc. 267, 1 (1981), 41–52.
Buluç et al. (2017) Aydin Buluç, Scott Beamer, Kamesh Madduri, Krste Asanovic, and David Patterson. 2017. Distributed-memory breadth-first search on massive graphs. arXiv preprint arXiv:1705.04590 (2017).
Busato and Bombieri (2015) Federico Busato and Nicola Bombieri. 2015. An efficient implementation of the Bellman-Ford algorithm for Kepler GPU architectures. IEEE Transactions on Parallel and Distributed Systems 27, 8 (2015), 2222–2233.
Chan (2012) Timothy M Chan. 2012. All-pairs shortest paths for unweighted undirected graphs in o (mn) time. ACM Transactions on Algorithms (TALG) 8, 4 (2012), 1–17.
Chen et al. (2007) Mo Chen, Rezaul Alam Chowdhury, Vijaya Ramachandran, David Lan Roche, and Lingling Tong. 2007. Priority queues and dijkstra’s algorithm. (2007).
Cohen and Havlin (2003) Reuven Cohen and Shlomo Havlin. 2003. Scale-Free Networks Are Ultrasmall. Phys. Rev. Lett. 90 (Feb 2003), 058701. Issue 5. https://doi.org/10.1103/PhysRevLett.90.058701
Coppersmith and Winograd (1987) D. Coppersmith and S. Winograd. 1987. Matrix Multiplication via Arithmetic Progressions. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing (New York, New York, USA) (STOC ’87). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/28395.28396
Davidson et al. (2014) Andrew Davidson, Sean Baxter, Michael Garland, and John D Owens. 2014. Work-efficient parallel GPU methods for single-source shortest paths. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 349–359.
Davis and Hu (2011) Timothy A. Davis and Yifan Hu. 2011. The University of Florida Sparse Matrix Collection. ACM Trans. Math. Softw. 38, 1 (dec 2011).
Dhulipala et al. (2017) Laxman Dhulipala, Guy Blelloch, and Julian Shun. 2017. Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures. 293–304.
Dhulipala et al. (2018) Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2018. Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA).
Dijkstra et al. (1959) Edsger W Dijkstra et al. 1959. A note on two problems in connexion with graphs. Numerische mathematik 1, 1 (1959), 269–271.
Dreyfus (1969) Stuart E Dreyfus. 1969. An appraisal of some shortest-path algorithms. Operations research 17, 3 (1969), 395–412.
Farbey et al. (1967) BA Farbey, AH Land, and JD Murchland. 1967. The cascade algorithm for finding all shortest distances in a directed graph. Management Science 14, 1 (1967), 19–28.
Fredman and Tarjan (1987) Michael L Fredman and Robert Endre Tarjan. 1987. Fibonacci heaps and their uses in improved network optimization algorithms. Journal of the ACM (JACM) 34, 3 (1987), 596–615.
Galil and Margalit (1997) Zvi Galil and Oded Margalit. 1997. All Pairs Shortest Distances for Graphs with Small Integer Length Edges. Information and Computation 134, 2 (1997), 103–139.
Ganesh and Xue (2007) Ayalvadi Ganesh and Feng Xue. 2007. On the connectivity and diameter of small-world networks. Advances in Applied Probability 39, 4 (2007), 853–863. https://doi.org/10.1239/aap/1198177228
Gustafson (1988) John L Gustafson. 1988. Reevaluating Amdahl’s law. Commun. ACM 31, 5 (1988), 532–533.
Harary et al. (1965) Frank Harary, Robert Zane Norman, and Dorwin Cartwright. 1965. Structural models: An introduction to the theory of directed graphs. 5, 1 (1965), 111–115.
Johnson (1977) Donald B Johnson. 1977. Efficient algorithms for shortest paths in sparse networks. Journal of the ACM (JACM) 24, 1 (1977), 1–13.
Meyer and Sanders (2003) Ulrich Meyer and Peter Sanders. 2003. $\Delta$ -stepping: a parallelizable shortest path algorithm. Journal of Algorithms 49, 1 (2003), 114–152.
Moret and Shapiro (1992) Bernard ME Moret and Henry D Shapiro. 1992. An Empirical Assessment of Algorithms for Constructing a Minimum Spanning Tree. Computational Support for Discrete Mathematics 15 (1992), 99–117.
Mulmuley and Shah (2000) Ketan Mulmuley and Pradyut Shah. 2000. A lower bound for the shortest path problem. In Proceedings 15th Annual IEEE Conference on Computational Complexity. IEEE, 14–21.
Osama et al. (2022) Muhammad Osama, Serban D. Porumbescu, and John D. Owens. 2022. Essentials of Parallel Graph Analytics. In Proceedings of the Workshop on Graphs, Architectures, Programming, and Learning (GrAPL 2022). 314–317. https://doi.org/10.1109/IPDPSW55747.2022.00061
Osama et al. (2023) Muhammad Osama, Serban D. Porumbescu, and John D. Owens. 2023. A Programming Model for GPU Load Balancing. In Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (Montreal, QC, Canada) (PPoPP ’23). Association for Computing Machinery, New York, NY, USA, 79–91.
Raman (1997) Rajeev Raman. 1997. Recent results on the single-source shortest paths problem. ACM SIGACT News 28, 2 (1997), 81–87.
Riordan et al. (2004) Oliver Riordan et al. 2004. The diameter of a scale-free random graph. Combinatorica 24, 1 (2004), 5–34.
Sao et al. (2020) Piyush Sao, Ramakrishnan Kannan, Prasun Gera, and Richard Vuduc. 2020. A supernodal all-pairs shortest path algorithm. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 250–261.
sao et al. (2021) piyush sao, Hao lu, Ramakrishnan Kannan, Vijay Thakkar, Richard Vuduc, and Thomas Potok. 2021. Scalable All-Pairs Shortest Paths for Huge Graphs on Multi-GPU Clusters. In Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing (Virtual Event, Sweden) (HPDC ’21). Association for Computing Machinery, New York, NY, USA, 121–131.
Seidel (1995) Raimund Seidel. 1995. On the all-pairs-shortest-path problem in unweighted undirected graphs. Journal of computer and system sciences 51, 3 (1995), 400–403.
Shun (2020) Julian Shun. 2020. Practical parallel hypergraph algorithms. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 232–249.
Shun and Blelloch (2013) Julian Shun and Guy E Blelloch. 2013. Ligra: a lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming. 135–146.
Shun et al. (2015) Julian Shun, Laxman Dhulipala, and Guy E Blelloch. 2015. Smaller and faster: Parallel processing of compressed graphs with Ligra+. In 2015 Data Compression Conference. IEEE, 403–412.
Shun et al. (2016) Julian Shun, Farbod Roosta-Khorasani, Kimon Fountoulakis, and Michael W Mahoney. 2016. Parallel Local Graph Clustering. Proceedings of the VLDB Endowment 9, 12 (2016).
Strassen et al. (1969) Volker Strassen et al. 1969. Gaussian elimination is not optimal. Numerische mathematik 13, 4 (1969), 354–356.
Surve and Shah (2017) Ganesh G Surve and Medha A Shah. 2017. Parallel implementation of bellman-ford algorithm using CUDA architecture. In 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), Vol. 2. IEEE, 16–22.
Takaoka (1998) Tadao Takaoka. 1998. Subcubic cost algorithms for the all pairs shortest path problem. Algorithmica 20, 3 (1998), 309–318.
Takes and Kosters (2011) Frank W. Takes and Walter A. Kosters. 2011. Determining the Diameter of Small World Networks (CIKM ’11). Association for Computing Machinery, New York, NY, USA, 1191–1196. https://doi.org/10.1145/2063576.2063748
Tarjan (1983) Robert Endre Tarjan. 1983. Data structures and network algorithms. 1, 3 (1983), 39–45.
Thorup (2000) Mikkel Thorup. 2000. On RAM priority queues. SIAM J. Comput. 30, 1 (2000), 86–109.
Waissi (1994) Gary R Waissi. 1994. Network Flows: Theory, Algorithms, and Applications.
Wang et al. (2019) Hao Wang, Liang Geng, Rubao Lee, Kaixi Hou, Yanfeng Zhang, and Xiaodong Zhang. 2019. SEP-Graph: finding shortest execution paths for graph processing under a hybrid framework on GPU. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 38–52.
Wang et al. (2016) Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. 2016. Gunrock: A High-Performance Graph Processing Library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Barcelona, Spain) (PPoPP ’16). Association for Computing Machinery, New York, NY, USA, Article 11, 12 pages.
Wang et al. (2017) Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T. Riffel, and John D. Owens. 2017. Gunrock: GPU Graph Analytics. ACM Transactions on Parallel Computing 4, 1 (Aug. 2017), 3:1–3:49.
Williams (2012) Virginia Vassilevska Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing. 887–898.