MV4PG: Materialized Views for Property Graphs

Chaijun Xu¹, Xingdi Wei¹, Yu Zhang¹, Kaiwei Li², Xiaowei Zhu², Ke Huang², Tao Wang², Shipeng Qi³ ¹ University of Science and Technology of China, Hefei, China ² Tsinghua University, Beijing, China ³ Huazhong University of Science and Technology, Wuhan, China {xuchaijun,wxd352541141}@mail.ustc.edu.cn, [email protected],
[email protected], {coolerzxw,wangtao.waves,qishipengqsp}@gmail.com, [email protected]

Abstract

Graph databases are getting more and more attention in the highly interconnected data domain, and the demand for efficient querying of big data is increasing. We noticed that there are duplicate patterns in graph database queries, and the results of these patterns can be stored as materialized views first, which can speed up the query rate. So we propose materialized views on property graphs, including three parts: view creation, view maintenance, and query optimization using views, and we propose for the first time an efficient templated view maintenance method for containing variable-length edges, which can be applied to multiple graph databases.

In order to verify the effect of materialized views, we prototype on TuGraph and experiment on both TuGraph and Neo4j. The experiment results show that our query optimization on read statements is much higher than the additional view maintenance cost brought by write statements. The speedup ratio of the whole workload reaches up to 28.71x, and the speedup ratio of a single query reaches up to nearly 100x.

Index Terms:

materialized views, property graph, incremental maintenance, optimization, Cypher

I Introduction

With the booming development of big data, graph databases are increasingly crucial in applications such as social network analysis, recommendation systems, financial fraud detection, and transaction analysis [1]. Research into further accelerating the query rate of graph databases has also attracted increasing amounts of people.

In the workload of graph databases, there are often some patterns that will be queried repeatedly. If the results of these patterns are stored in advance, the query rate can be greatly accelerated by directly querying the stored results when needed, such an idea is very similar to that of materialized views defined in relational databases [2, 3, 4, 5, 6, 7, 8, 9, 10]. However, certain types of queries in graph databases, such as reachability queries containing variable-length edges, receive little attention in the context of relational databases. As a result, the implementation of materialized views in graph databases is still a fresh area waiting to be developed.

Some research has been conducted on materialized views in graph databases such as [11, 12, 13]. These studies leverage materialized views to optimize query performance. However, when the graph database is updated, the consistency between the view and the original database might be broken, requiring maintenance of the view to restore consistency. Notably, these works do not address the maintenance of variable-length edges, which, as previously mentioned, play a crucial role in specific types of queries within graph databases and warrant more focused attention.

Our proposed approach. To address this issue, we propose an approach that consists of three key components: view creation, view maintenance, and query optimization using views. In the view creation phase, we define the creation statements based on the syntax of GQL [14], the standard graph database query language. This ensures compatibility and standardization across graph database systems. In the view maintenance phase, the maintenance template is predefined at the time of view creation. When the graph data is updated, the template parameters are substituted to generate specific maintenance statements, ensuring that the view remains consistent with the underlying graph data. In the query optimization phase using views, materialized views are ranked by their optimization potential, matched to the query graph via subgraph matching, and integrated into the query to improve execution efficiency.

Let’s assume a graph database is used to efficiently manage complex relationships and connections. The approach can be illustrated through the following three examples, each demonstrating one of the three key components:

Example 1 (View Creation). As shown in Figure 1, to determine which Post each Comment belongs to, we query the graph database using the pattern graph depicted in Figure 1(a). This pattern graph includes variable-length edges (denoted by *..), where each edge represents a path with at least one hop and potentially infinite hops. However, querying uncapped variable-length edges can be time-consuming.

To optimize this, we can store the query result as a new edge labeled ROOT_POST, as shown in Figure 1(b). The subsequent sections on view maintenance and query optimization will build upon this materialized views.

Refer to caption — (a) Pattern Graph of Example Query

Example 2 (View Maintenance). When the graph data is updated, such as deleting a Comment or linking a new Comment to a different Post, the consistency of the materialized view and original graph data is disrupted, requiring the view to be maintained.

For efficient maintenance, only the affected views are incrementally updated, rather than being fully regenerated. For example, with the view in Figure 1(b), if a node is deleted in the graph database, all impacted ROOT_POST edges must be removed. As shown in Figure 1(a), the deleted node could be at n or m or inside the variable-length edge r. The three pattern graphs in Figure 2 are then used to identify the corresponding n, m node pairs. Here $L, $K, and $V represent the label, primary key name, and primary key value, respectively. Based on these patterns, the corresponding n, m pairs are identified, and the affected ROOT_POST edges are deleted accordingly.

To reduce the maintenance time, we can automatically generate a view maintenance template during view creation. When a node is deleted, we simply replace $L, $K, and $V with the node’s actual information, eliminating the need to regenerate the maintenance statement each time.

Example 3 (Query Optimization Using Views). Once the materialized view is in place, queries involving Comment-Post relationships can be optimized. Instead of traversing variable-length edges in real-time, the query leverages the precomputed ROOT_POST edges.

Given the query pattern graph in Figure 3(a), we observe that the edge between n and m matches the view in Figure 1(b). As a result, this edge can be replaced with the view edge ROOT_POST, optimizing the query.

Contributions. In this paper, we propose an innovative solution for creating and maintaining materialized views on property graphs, with a focus on optimizing queries involving variable-length edges. A central feature of our approach is the Mv4pg system, which seamlessly integrates view creation, maintenance, and query optimization into a unified framework. At its core is a templated view maintenance strategy that automatically generates maintenance templates during view creation. This method significantly reduces the overhead of regenerating maintenance statements after graph updates. By replacing only the relevant parts of the template with actual deletion information, we eliminate the need for redundant computations, ensuring both efficiency and scalability. Our solution is highly applicable to a wide range of property graph use cases, offering substantial performance improvements in query optimization.

The contributions of this paper are threefold:

•

We propose a view creation language that conforms to the GQL standard syntax, facilitating easier adoption of our solution into existing graph query environments.
•

We introduce an efficient templated view maintenance approach for property graphs with variable-length edges.
•

We demonstrate the practical effectiveness of the Mv4pg through a prototype implementation on TuGraph [15] (with code available on github [16]), and comprehensive testing on both TuGraph and Neo4j [17]. The results show that views significantly reduce the execution time of the overall workload, achieving a maximum speedup of 28.71×, with the highest speedup for a single statement approaching 100×. Furthermore, the optimization brought by views far exceeds the additional maintenance cost associated with write statements.

TABLE I: Comparison of View Implementation Features in Graph Databases

	IVM4PGQ [18]	KASKADE [11]	MVS&QP [19]	IS4VOPG [12]	Mv4pg
Inclusion of variable-length edges	Y	Y	Y	N	Y
View selection	N	A	A	M	M
View maintenance	Y	N	N	Y	Y
Query optimization	N	Y	Y	Y	Y
Multi-DBMS implementations	N	N	N	Y	Y
Experimental evaluation	N	Y	Y	Y	Y

Note: ‘Y’ indicates that the feature is available, while ‘N’ means the feature is not available or not mentioned. In the view selection row, ‘A’ and ‘M’ represent automatic and manual, respectively.

II Preliminaries and Related Work

II-A Preliminaries

Property Graph Model. The property graph model, proposed by Rodriguez and Neubauer in 2010 [20], utilizes nodes to represent entities, edges to denote relationships, and both nodes and edges can have their associated properties.

We adopt the definition of the property graph database model given by Angles [21], and the schema is supported by Neo4j and TuGraph. Assume that L, P, V represent the sets of labels, property names, and property values, respectively.

Definition 1 (Property Graph ( $\mathcal{PG}$ )).

A property graph is defined as a tuple G = (N, E, $\rho$ , $\lambda$ , $\pi$ ) where:

•

$N$ is a finite set of nodes;
•

$E$ is a finite set of edges such that $N\cap E=\emptyset$ ;
•

$\rho:E\rightarrow(N\times N)$ is a total function mapping edges to ordered pairs of nodes;
•

$\lambda:(N\cup E)\rightarrow$ $\mathcal{P}(\textbf{L})$ is a total function mapping nodes and edges to sets of labels (including the empty set), and $\mathcal{P}(\textbf{L})$ represents the power set of L;
•

$\pi:(N\cup E)\times$ P $\rightarrow$ $\mathcal{P}(\textbf{V})$ is a partial function mapping properties of nodes/edges to sets of values.

For $\lambda$ , the set of labels associated with a single node or a single edge in this paper always contains exactly one label.

Query Language. The most popular query language for property graphs is Cypher [22]. Originating from Neo4j, it has been adopted by other vendors, including Amazon Neptune, Agens Graph, AnzoGraph, Katana Graph, Memgraph, RedisGraph, and SAP HANA [23]. To find all Comments posted under each Post, the Cypher query can be written as follows:

⬇

MATCH (n:Comment)-[r:replyOf*..]->(m:Post)

RETURN n,m

Here the MATCH clause corresponds to the pattern graph of Figure 1(a).

II-B Related Work

Materialized Views in RDBMS. Materialized views have been studied in relational databases for several decades, and there is a great deal of work on them. [2, 3, 4, 5, 6] discusses the efficient maintenance of materialized views, including different kinds of views such as distributed and non-distributed aggregation functions, views containing join operators, etc. [7, 8, 9, 10] uses materialized views to reduce hits to the original database and thus optimize query efficiency, [7] discusses the timing of view optimization, suggesting that blindly using views may instead be more ineffective, [10] proposes a very efficient view-matching algorithm that can quickly find the right view to optimize query.

Materialized Views in Property Graph DBMS. Table I shows related work on the materialized view of graph databases in recent years. [18] first proposed incremental view maintenance in property graphs, converting graph query statements into the form of a relational algebra, using incremental maintenance algorithms in relational databases, but is more limited to theory and does not give evaluation results. [11] automatically selects the required views for the workload and optimizes the query for rewriting using the views, focusing on the more specific queries in graph databases that contain variable-length edges, but not discussing the view maintenance part. [19], like [11], focuses only on view selection and query optimization on static graphs, and does not focus on the incremental maintenance component. [12] discusses materialized views in property graphs more systematically, including incremental maintenance and query optimization using views, and also uses a series of transformation rules to implement materialized views in both RDBMS and GDBMS, but does not take into account the more specific queries in graph databases that contain variable-length edges. In contrast, Mv4pg discusses both materialized view maintenance and query optimization using views, and gives a corresponding incremental maintenance approach for queries containing variable-length edges in graph databases, and the final experimental results in both Neo4j and TuGraph prove that Mv4pg plays a great role in query rate improvement.

III Overview of Mv4pg

Figure 4 provides an overview of our Mv4pg system, designed to address the challenges of efficient view creation, maintenance, and query optimization in graph databases. For a given workload, we first manually extract common parts as views and materialize them in the graph database (Neo4j or TuGraph). Meanwhile, the View Maintenance Statement Generator creates View Maintenance Statement Templates based on the View Creation Statements. When modifications occur in the graph data as part of the workload, the modified information is injected into the template parameters, generating the View Maintenance Statements, which are then executed for database maintenance. For the queries in the workload, the Query Optimization Using Views module (see Section V for details) is applied to optimize them before execution in the graph database.

Table II provides a summary of the notations and their definitions used throughout this paper.

TABLE II: Notations

Notations	Definitions
$\mathcal{Q}$ , $\mathcal{Q}_{o}$	(optimized) pattern graph of a cypher query
$\mathcal{V}$	pattern graph of a view
$\mathbb{V}$	the set of $\mathcal{V}$
$\mathbb{V^{\prime}}$	sorted $\mathbb{V}$
$s_{vc}$	view creation statement
$\mathbb{S}_{vc}$	all view creation statements in the graph
$\mathcal{N}$	node information including labels and properties
$\mathcal{E}$	edge information including $\mathcal{N}$ of source node,
$\mathcal{E}$	$\mathcal{N}$ of end node, labels and properties of the edge itself
$\mathcal{MP}$	path in $\mathcal{Q}$ that matches the view

IV View Creation And Maintenance

IV-A View Creation

As shown in Figure 4, the public patterns of the query statements are manually extracted from the workload, and the corresponding view creation statements are given, which are directly inputted into the graph database system. At the same time, for each view creation statement, we will generate the corresponding view maintenance statement template, which is put into Section IV-B to talk about it in detail.

Since existing graph database systems do not support the materialized view, we use part of the GQL grammar to extend the Cypher grammar, the relevant core grammar is shown in Figure 5.

(View)	$v$	::=	CREATE VIEW VName AS
			‘(’ CONSTRUCT P MATCH P ‘)’;
(PatElem)	P	::=	( P ${}_{\textsf{N}}$ ( P ${}_{\textsf{S}}$ )* ) ;
(PatSeg)	P ${}_{\textsf{S}}$	::=	P ${}_{\textsf{R}}$ P ${}_{\textsf{N}}$ ;
(NodePat)	P ${}_{\textsf{N}}$	::=	‘(’ ( Var )? ( NodeLabels )? ( Prop )? ‘)’ ;
(RelPat)	P ${}_{\textsf{R}}$	::=	( ‘<-’ R ? ‘->’ )
			$\mid$ ( ‘<-’ R ? ‘-’ )
			$\mid$ ( ‘-’ R ? ‘->’ )
			$\mid$ ( ‘-’ R ? ‘-’ )
			;
(RelInfo)	R	::=	‘[’ ( Var )? ( RelTypes )? Range?
			( Prop )? ‘]’ ;

Figure 5: Syntax of View Creation

The main part of our view is the pattern element (PatElem) of Match clause that represents a path from the start to the end. The PatElem of the Construct clause has only two nodes and one view edge, the two nodes are the source and the end of the PatElem of the Match clause. The role of the view is to compress a path that may contain variable-length edges into a single view edge, speeding up the query rate.

Example. A concrete example of the view definition is as follows:

⬇

CREATE VIEW ROOT_POST AS (

CONSTRUCT (c)-[r:ROOT_POST]->(p)

MATCH (c:Comment)-[:replyOf*..]->(p:Post)

)

This statement creates a view called ROOT_POST, which identifies all (c,p) pairs where a node c of type Comment connects via a variable-length edge to a node p of type Post. For each such pair, an edge of type ROOT_POST is created with c pointing to p. The replyOf*.. notation, specifically replyOf*1.. $\infty$ , indicates that the edge has a variable length, with a minimum length of 1 and no upper limit ( $\infty$ ).

IV-B View Maintenance

After the view is created, any write operations such as creating, deleting, or property updating in a graph database require updating the view to maintain consistency. As shown in Figure 6, we predefine maintenance statement templates for views. When the graph database is modified, these templates are iterated over, with the relevant parameters replaced, and the resulting statements are executed in the graph database system.

Since the views in this paper do not involve properties, we focus solely on creation and deletion operations. Specifically, we handle four cases: creating a node, deleting a node, creating an edge, and deleting an edge. Next, we will explain how to generate maintenance statement templates for these scenarios.

Create a node. Since simply creating a node, which is not connected to any other node, does not affect the path-based view described in this paper, it does not need to be processed.

Delete a node. When deleting a node in the graph database, we need to delete all the view edges associated with that node, as shown in the algorithm 1.

0: Deleted Node Information

\mathcal{N}

, View Creation Statements

\mathbb{S}_{vc}

0: View Maintenance Template map

\mathcal{M}_{\textsf{VMT}}

1: Initialize

\mathcal{M}_{\textsf{VMT}}

as an empty map

2: for

s_{vc}

\mathbb{S}_{vc}

3: Initialize

\mathcal{S}_{\textsf{VMT}}

as an empty set

4: for

node

s_{vc}

.Match.nodes do

\mathcal{S}_{\textsf{VMT}}

.insert(ReplNode(

node

\mathcal{N}

))

6: end for

7: for

e

s_{vc}

.Match.VLEdges do

n\leftarrow e

.minLength

m\leftarrow e

.maxLength

10: if

m==\infty

then

11: for

i=1

to max

(n-1,1)

12: if

i<

max

(n-1,1)

then

13:

s_{vm}

\leftarrow

ReplNodeInVlen(

(i,i),(n-i,\infty),e

\mathcal{N}

)

14:

\mathcal{S}_{\textsf{VMT}}

.insert(

s_{vm}

)

15: else

16:

s_{vm}

\leftarrow

ReplNodeInVlen((

i

\infty

), (1,

\infty

), e,

\mathcal{N}

)

17:

\mathcal{S}_{\textsf{VMT}}

.insert(

s_{vm}

)

18: end if

19: end for

20: else

21: for

i=1

m-1

122:

s_{vm}

\leftarrow

ReplNodeInVlen(

(

i,i

), (max(

n-i

,1),

m-i

e

\mathcal{N}

)

23:

\mathcal{S}_{\textsf{VMT}}

.insert(

s_{vm}

)

24: end for

25: end if

26: end for

27:

\mathcal{M}_{\textsf{VMT}}

[

s_{vc}

.ViewName]

\leftarrow

\mathcal{S}_{\textsf{VMT}}

28: end for

ALGORITHM 1 View Maintenance Template for Deleting a Node

In the path of a view, nodes may appear in two places, one is explicit nodes and the other is implicitly in variable-length edges. Lines 4-6 of the algorithm 1 iterates through all the explicit nodes and calls ReplNode to generate a maintenance statement, which instantiates the node to the deleted node $\mathcal{N}$ and leaves the rest of the view unchanged, which finds all view edges when $\mathcal{N}$ is at the certain position when all view edges are found and deleted; Lines 7-26 deal with the case in variable-length edges, setting n,m as the minimum and maximum hops of variable-length edges, respectively. We need to traverse all possible locations of the $\mathcal{N}$ , so we traverse from the smallest to the largest distance of $\mathcal{N}$ from the starting node of the variable-length edge, and we don’t need to consider the case where the distance from the start node or the end node is 0 here, because a distance of 0 is an overlap with the start node or the end node, which is the case in lines 4-6 of the algorithm 1.

Because the maximum hop may equal $\infty$ , we discuss in two cases, if the maximum hop is infinite, the only limit is the number of hops is greater than or equal to n. When the distance between $\mathcal{N}$ and the start node of e is $i<n-1$ , the distance between $\mathcal{N}$ and the end node of e must be $>=n-i$ , such as lines 13-14; when the distance between $\mathcal{N}$ and the start node of e is $i>=n-1$ , then the distance between $\mathcal{N}$ and the end node of e can be $>=1$ , so this case can be merged into one statement, such as lines 16-17, there will be a total of $n-2+1=n-1$ statements. Of course, if $n<2$ , there will also be 1 statement.

If the maximum hop is not infinite, then traverse the distance from $\mathcal{N}$ to the start node from $1$ to $m-1$ , generating a total of m-1 statements, as shown in lines 21-23. When the distance from $\mathcal{N}$ to the start node of e is i, the distance from $\mathcal{N}$ to the end node of e will be $max(n-i,1)$ at minimum, and the maximum will be $m-i$ .

Here the ReplNodeInVlen function expands the variable-length edge e into three parts: edge-node-edge. The first parameter $(i,i)$ of ReplNodeInVlen, i.e., the minimum and maximum hops of the first edge, denotes the distance between the $\mathcal{N}$ and the start node of e, the second parameter is the minimum and maximum hops of the second edge, which denotes the distance between the $\mathcal{N}$ and the end node of e, and the intermediate node is instantiated with the $\mathcal{N}$ .

Create an edge or Delete an edge. When creating or deleting an edge, the overall idea is similar to deleting a node, creating or deleting all view edges associated with that edge, as shown in algorithm 2.

0: Modified Edge Information

\mathcal{E}

, View Creation Statements

\mathbb{S}_{vc}

, isCreate

0: View Maintenance Template map

\mathcal{M}_{\textsf{VMT}}

1: Initialize

\mathcal{M}_{\textsf{VMT}}

as an empty map

2: for

s_{vc}

\mathbb{S}_{vc}

3: Initialize

\mathcal{S}_{\textsf{VMT}}

as an empty set

4: for e in

s_{vc}

.Match.FLEdges do

\mathcal{S}_{\textsf{VMT}}

.insert(ReplEdge(e,

\mathcal{E}

))

6: end for

7: for e in

s_{vc}

.Match.VLEdges do

n\leftarrow e

.minLength

m\leftarrow e

.maxLength

10: if

m==\infty

then

11: for

i=0

to max

(n-1,0)

12: if

i<

max

(n-1,0)

then

113:

s_{vm}

\leftarrow

ReplEdgeInVlen(

(i,i),(n-1-i,\infty),e

\mathcal{E}

, isCreate)

14:

\mathcal{S}_{\textsf{VMT}}

.insert(

s_{vm}

)

15: else

216:

s_{vm}

\leftarrow

ReplEdgeInVlen(

(i,

\infty

), (0,

\infty

e

\mathcal{E}

, isCreate)

17:

\mathcal{S}_{\textsf{VMT}}

.insert(

s_{vm}

)

18: end if

19: end for

20: else

21: for

i=0

m-1

322:

s_{vm}

\leftarrow

ReplEdgeInVlen(

4 (

i,i

), (max

(n-1-i,0),m-1-i

e

\mathcal{E}

, isCreate)

23:

\mathcal{S}_{\textsf{VMT}}

.insert(

s_{vm}

)

24: end for

25: end if

26: end for

27:

\mathcal{M}_{\textsf{VMT}}

[

s_{vc}

.ViewName]

\leftarrow

\mathcal{S}_{\textsf{VMT}}

28: end for

ALGORITHM 2 View Maintenance Template for Creating or Deleting an Edge

The maintenance statement for creating or deleting an edge is similar to that for deleting a node; first, all explicit edges are found and instantiated, and the replacement will instantiate the start node, the end node, and the edge itself all in $\mathcal{E}$ . Then deal with the case in the variable-length edge e, also divided into two cases according to whether the maximum hop is infinite or not like algorithm 1, but changing $i$ to the distance from the starting node of e to the starting node in $\mathcal{E}$ , the specific process will not be repeated. The first parameter of ReplEdgeInVlen is the minimum and maximum number of hops for variable-length edges between the start node in $\mathcal{E}$ and the start node of e. The second parameter is the minimum and maximum number of hops for variable-length edges between the end node in $\mathcal{E}$ and the end node of e. The final addition of an extra isCreate compared to algorithm 1 to indicate whether to create an edge or to delete an edge does not affect the core of the maintenance statement, which will be explained in the subsequent examples.

Example for View Maintenance.

Listing 1: View Example for Maintenance

⬇

CREATE VIEW INDIRECT_KNOW AS (

CONSTRUCT (s)-[r:INDIRECT_KNOW]->(d)

MATCH (s:Person)-[:knows*3..]->(d:Person)

)

For the view created by LABEL:lst:view_maintenance_example, we generate templates for three view maintenance sets based on algorithm 1 and algorithm 2, representing deleting a node, creating an edge, and deleting an edge, respectively.

Listing 2: Maintenance template for Deleting a node

⬇

MATCH (s:Person:$L{$K:$V})-[:knows*3..]->(d:Person)

WITH s,d MATCH (s)-[r:INDIRECT_KNOW NoDupEdge]->(d) DELETE r

MATCH (s:Person)-[:knows*3..]->(d:Person:$L{$K:$V})

WITH s,d MATCH (s)-[r:INDIRECT_KNOW NoDupEdge]->(d) DELETE r

MATCH (s:Person)-[:knows*1]->(:$L{$K:$V})-[:knows*2..]->(d:Person)

WITH s,d MATCH (s)-[r:INDIRECT_KNOW NoDupEdge]->(d) DELETE r

MATCH (s:Person)-[:knows*2..]->(:$L{$K:$V})-[:knows*1..]->(d:Person)

WITH s,d MATCH (s)-[r:INDIRECT_KNOW NoDupEdge ]->(d) DELETE r

LABEL:lst:view_maintenance_node demonstrates the maintenance template for the view shown in LABEL:lst:view_maintenance_example when deleting a node with a total of four statements, i.e., $|$ $\mathcal{S}_{\textsf{VMT}}$ $|=4$ . In algorithm 1, the first two statements are generated by lines 4-6. The last two statements are for the case where the deleted node is in the variable-length edge, and since the maximum number of hops is infinite, they correspond to lines 11-19 in algorithm 1. In the example, $n=3,m=\infty$ , $i=1\ to\ 2$ . When $i$ is $1$ , corresponding to lines 12-14 of LABEL:lst:view_maintenance_node, the third statement of LABEL:lst:view_maintenance_node is generated, i.e., the deleted node is greater than or equal to $2$ from the end node when the distance from the start node is $1$ ; When $i$ is $2$ , corresponding to lines 15-18 of algorithm 1, when the deleted node is at a distance greater than or equal to $2$ from the start node, it is all at a distance greater than or equal to $1$ from the end node, so they can be combined as a single statement, i.e., the last statement of LABEL:lst:view_maintenance_node.

It can be noted that each statement has an identical clause headed by the WITH keyword, which is the part that removes the view edges from each pair of views found for their start and end nodes. One thing to note here is that for each pair of (s,d) found, if there are multiple view edges in s and d, not all of them are deleted, but only one of them is deleted, because each view edge actually corresponds to one of the cases in which all the nodes and edges are instantiated in the $\mathcal{V}$ , and also corresponds to each pair of s,d that we found here.

This is correct for a single statement, since a Match statement such as

MATCH (s:Person:$L{$K:$V})-[:knows*3..]->(d:Person),
the result of each graph instance matched by the Match statement corresponds to a different graph instance matched in the view creation statement, so for each pair of s,d matched, deleting or creating one of their edges is consistent with the database.

But if there are multiple statements, different maintenance statements may match to the same view, meaning that the deleted node may appear in two locations at the same time making the deleted view duplicated. For example, if Person{id:1} is deleted, the first statement of LABEL:lst:view_maintenance_node may match this case:

MATCH (s:Person{id:1})-[:knows]->(:Person{id:2})-[:knows]->(d:Person{id:1}).
However, the second statement will match the same case, which could result in the removal of two view edges (if any) of s->d when deleting, but only one view edge should be removed for the same instance. So we need to save the matched graph instances, i.e., the actual graph data corresponding to each node and each edge in the pattern graph of the view $\mathcal{V}$ , to make sure that only one view edge is deleted or created for the same graph instance.

Listing 3: Maintenance template for Creating an edge

⬇

MATCH (s:Person)-[:knows*0]->(:$SL{$SK:$SV})-[@R:knows]->(:$DL{$DK:$DV})-[*2..]->(d:Person) WHERE id(@R)=$RID

WITH s,d CREATE (s)-[r:INDIRECT_KNOW]->(d)

MATCH (s:Person)-[:knows*1]->(:$SL{$SK:$SV})-[@R:knows]->(:$DL{$DK:$DV})-[*1..]->(d:Person) WHERE id(@R)=$RID

WITH s,d CREATE (s)-[r:INDIRECT_KNOW]->(d)

MATCH (s:Person)-[:knows*2..]->(:$SL{$SK:$SV})-[@R:knows]->(:$DL{$DK:$DV})-[*0..]->(d:Person) WHERE id(@R)=$RID

WITH s,d CREATE (s)-[r:INDIRECT_KNOW]->(d)

LABEL:lst:view_maintenance_edge demonstrates the maintenance template for the view shown in LABEL:lst:view_maintenance_example when creating an edge, by Union combining the three statements into one, i.e., $|$ $\mathcal{S}_{\textsf{VMT}}$ $|=3$ . All statements are for the case where the created edge is in the variable-length edge, generated by lines 11-19 of algorithm 2. Deleting an edge is largely similar to creating an edge, except that the statement following the WITH keyword is different; when deleting an edge, the portion following the WITH keyword is the same as the maintenance template for deleting a node. Another point of detail is that edges don’t necessarily have a primary key, so we use the internal id of the graph database system to uniquely lock the edge.

Correctness. When creating or deleting data, the purpose of the maintenance statement is to ensure the consistency of the view and the graph database, i.e., the view after maintenance and the view after deleting the view to re-execute the create view statement need to be identical. The Match part of our generated maintenance statement has matched all possible graph instances affected by the changed data, as long as we ensure that for each graph instance to delete/create and only delete/create a view can ensure the consistency of the view. As previously discussed, we need to save each graph instance and just skip it the next time we match to the same graph instance, which will serve as a future work.

Complexity Discussion. Assuming the number of edges that the view maintenance statement will delete or create is $N$ , the time complexity of maintenance is often $O(N)$ . Although it is possible that the maintenance statement may traverse illegal nodes or edges when matching feasible paths, in fact, the more illegal nodes or edges traversed in the maintenance statement, the more traversal can be reduced by the query optimization in Section V, and the better the optimization effect of the view.

In practice, in most cases, the $N$ involved in a single node or a single edge is not large, so the maintenance time required for small-scale updates under incremental maintenance will not increase significantly, which can be verified in Section VI.

V Query Optimization Using Views

The ultimate goal of building materialized views is to optimize query efficiency. Figure 7 shows the basic flow of Query Optimization Using Views, divided into two output routes, one route is executed in TuGraph for the main part in the middle, and the other route first relies on TuGraph to directly output the optimized query statement, which is then executed in Neo4j.

0: pattern graph

\mathcal{Q}

, the set of all view pattern graphs

\mathbb{V}

0: optimized pattern graph

\mathcal{Q}_{o}

\mathbb{V^{\prime}}

\leftarrow

SortByOptEff(

\mathbb{V}

)

2: for

\mathcal{V}

\mathbb{V^{\prime}}

3: Init()

4: while MatchView(

\mathcal{Q}

\mathcal{V}

, matchResult) do

5: ChangePG(

\mathcal{Q}

\mathcal{V}

, matchResult)

6: Reset()

7: end while

8: end for

\mathcal{Q}_{o}

\leftarrow

\mathcal{Q}

ALGORITHM 3 View-Based Optimizer for Pattern Graph

At the heart of Figure 7 is the View-Based Optimizer, we use algorithm 3 to describe its exact process. The algorithm first sorts all the views through the SortByOptEff function, then iterates through each view according to the order, continuously matches the view with the query graph and changes the query graph according to the matchResult(The Init and Reset functions empty the matchResult and other global variables), and finally outputs the optimized query graph $\mathcal{Q}_{o}$ .

V-A Sorting by Optimization Effect

Different view traversal orders lead to different optimization results. For example, assuming that $\mathbb{V}$ have two views in Figure 9 and Figure 9, for the query pattern graph in Figure 12, if View1 is used first to optimize, then the optimized result is shown in Figure 12, which only applies View1; if View2 is used first to optimize, then the optimized result is shown in Figure 12, which only applies View2.

So the SortByOptEff function sorts the collection of views from largest to smallest according to the optimization effect of the views. Assume $\textsf{DBHit}_{\textsf{}}$ represents the total number of accesses made by all operators to the storage engine to retrieve necessary data during the execution plan, e.g., accessing a node or an edge in the database is counted as one time. The subscript noV and V stand for “no Views” and “with Views”, respectively. The currently adopted practice is to consider only the optimization effect of the view itself, i.e., $\textsf{DBHit}_{\textsf{noV}}$ - $\textsf{DBHit}_{\textsf{V}}$ . And $\textsf{DBHit}_{\textsf{V}}$ = $|N_{\$SL}|+2*|E_{\$VL}|$ ( $\$SL$ and $\$VL$ are the labels for the start node of view and view edge, respectively, $|N_{\$SL}|$ is the number of nodes in the graph database labeled $\$SL$ , $|E_{\$VL}|$ is the number of edges in the graph database labeled $\$VL$ ), i.e., the cost is to traverse the starting node, then traversing all view edges adjacent to the starting node and obtaining the corresponding end node. So the final optimization effect formula is shown in Equation 1.

	ViewOptEff	$\displaystyle=\textsf{DBHit}_{\textsf{noV}}-\textsf{DBHit}_{\textsf{V}}$		(1)
		$\displaystyle=\textsf{DBHit}-(\|N_{\$SL}\|+2*\|E_{\$VL}\|)$		(1)

In this example, the optimization effect of View1 is the difference between Figure 12 and Figure 12, where Figure 12 goes from a node of label Comment through two hops to a node of label Tag, which is the original query cost of View1, while Figure 12 goes from a node of label Comment through View1 directly to a node of label Tag, where the query cost becomes the number of View1, and the two differences are the effect of optimization. The $|N_{\$SL}|$ , $|E_{\$VL}|$ and DBHit will have an initial value when the view is created. Both $|N_{\$SL}|$ and $|E_{\$VL}|$ can be easily maintained by changing them when the corresponding nodes or edges are created or deleted. But the DBHit is more difficult to maintain, We use an estimation approach for maintenance, specifically we will save the value of initial $\textsf{DBHit}/(|initialN_{\$SL}|+2*|initialE_{\$VL}|)$ as the $optRate$ , assuming that the ratio $optRate$ constant, then the value of DBHit is shown in Equation 2.

\textsf{DBHit}=(|N_{\$SL}|+2*|E_{\$VL}|)*optRate

(2)

\mathcal{Q}

\mathcal{V}

, matchResult =

\emptyset

0: matchResult

\mathcal{Q}_{o}

1: if matchResult.size ==

\mathcal{V}

.nodes.size then

2: return true

3: end if

4: if nowVN == null then

5: tempVN

\leftarrow

nowVN

6: nextVN

\leftarrow

\mathcal{V}

.startNode

7: for n in

\mathcal{Q}

.nodes do

8: if NodeCanMatch(nextVN, n) then

9: matchResult.add(nextVN, n)

10: nowVN

\leftarrow

nextVN

11: if MatchView(

\mathcal{Q}

\mathcal{V}

, matchResult) then

12: return true

13: end if

14: nowVN

\leftarrow

tempVN

15: matchResult.delete(nextVN)

16: end if

17: end for

18: else

19: tempVN

\leftarrow

nowVN

20: nextVN, nextViewRelp

\leftarrow

GetNext(nowVN)[0]

21: for n, r in GetNext(matchResult[nowVN]) do

22: if NodeCanMatch(nextVN, n) and RelpCanMatch(nextViewRelp, r) then

23: matchResult.add(nextVN, n)

24: nowVN

\leftarrow

nextVN

25: if MatchView(

\mathcal{Q}

\mathcal{V}

, matchResult) then

26: return true

27: end if

28: nowVN

\leftarrow

tempVN

29: matchResult.delete(nextVN)

30: end if

31: end for

32: end if

33:

34: return false

ALGORITHM 4 Matching View Algorithm

V-B Match View

The algorithm of MatchView is shown in algorithm 4, the general idea is similar to the VF2 algorithm [24], which is to do matching between $\mathcal{V}$ and $\mathcal{Q}$ from the start node of $\mathcal{V}$ . The matchResult stores the matching result, nowVN is a global variable, which indicates the latest view node that has been matched so far. When matchResult.size and $\mathcal{V}$ .nodes.size are equal, then the match is complete and returns true; conversely, if there is no feasible match, false is returned in the last line. Specifically, lines 4-17 iterate over all the nodes in $\mathcal{Q}$ to match the $\mathcal{V}$ .startNode. If the NodeCanMatch function is true, the match result is saved and the MatchView function is called recursively, if the recursively called MatchView function finds a feasible match, then it returns true, otherwise, it backs up nowVN to the state before the match and deletes the match result in matchResult; Lines 18-32 are executed after matching the $\mathcal{V}$ .startNode with $\mathcal{Q}$ , and firstly call GetNext(nowVN) function to get the unmatched neighbors of nowVN. Since $\mathcal{V}$ is a path, the set size of GetNext(nowVN) is 1. Then get the set of unmatched neighbor nodes and neighbor edges of matchResult[nowVN](matchResult[nowVN] is the node in $\mathcal{Q}$ that matches with nowVN), iterate through the set, and call NodeCanMatch and RelpCanMatch functions to determine whether it can match, if it can be matched, it is handled the same as the previous lines 9-15.

When the MatchView function finds a feasible match, the ChangePG function changes $\mathcal{Q}$ based on the match. For path $\mathcal{MP}$ in $\mathcal{Q}$ that matches the view, ChangePG replaces the path between the start and end nodes in $\mathcal{MP}$ with a view edge.

Because ChangePG replaces all nodes and edges in $\mathcal{MP}$ other than the start and end nodes with the view edge, if there are other references to these replaced nodes or edges in $\mathcal{Q}$ (e.g., in other clauses such as WHERE, WITH, RETURN, etc.), it will affect the execution of the query statement, so an attribute isReferenced of whether or not the nodes and edges are referenced is added to each node and edge at the parser stage. And in the NodeCanMatch(nextVN, n) function, we need to first determine whether the label is the same, if nextVN is not the start or end node of $\mathcal{V}$ , then the isRenferenced attribute of n needs to be false, and the node of the degree of outgoing and incoming degrees add up to 2, that is, n will have no neighbors outside $\mathcal{MP}$ ; In RelpCanMatch, we also need to make sure that the labels are the same, that there are no other references, and that the direction, min-hop and max-hop are the same.

V-C Analysis

Termination. Let the number of node in the graph $G$ be $G_{N}$ and the number of edges be $G_{E}$ . Every time MatchView matches and calls ChangePG to change $\mathcal{Q}$ , at least one edge becomes a view edge, so after at most $|\mathcal{Q}_{E}|$ matches, all the edges will become view edges, i.e., no more matches can be made, and algorithm 3 must terminate.

Complexity Analysis. SortByOptEff costs $|\mathbb{VQ}|*log(|\mathbb{VQ}|)$ .

The worst case of MatchView iterates through all possible matches, matching node in $\mathcal{V}$ with every node in $\mathcal{Q}$ , i.e., $|\mathcal{Q}_{N}|^{|\mathcal{VQ}_{N}|}$ , as mentioned before for each view MatchView can match successfully up to $|\mathcal{Q}_{E}|$ times, so time complexity is $O(\displaystyle\sum_{\mathcal{VQ}\ in\ \mathbb{VQ}}|\mathcal{Q}_{E}|*|\mathcal{Q}_{N}|^{|\mathcal{VQ}_{N}|})$ .

ChangePG replaces the matched paths in the query graph with view edges, i.e., remove the paths first and then insert the view edges, the complexity is $O(|\mathcal{VQ}_{N}|)$ for each call. The number of calls is the same as the number of successful MatchView matches, the total complexity is $\displaystyle O(\sum_{\mathcal{VQ}\ in\ \mathbb{VQ}}|\mathcal{Q}_{E}|*|\mathcal{VQ}_{N}|)$ , compared to the MatchView can be ignored.

In summary, the total time complexity is $\displaystyle O(|\mathbb{VQ}|*log(|\mathbb{VQ}|)+\sum_{\mathcal{VQ}\ in\ \mathbb{VQ}}|\mathcal{Q}_{E}|*|\mathcal{Q}_{N}|^{|\mathcal{VQ}_{N}|})$ . The graph schema under the property graph model is relatively simple, the number of views will not be very large, and the number of nodes and edges in $\mathcal{Q}$ and $\mathcal{V}$ is also very small, and the estimation of MatchView is in the worst-case scenario, in fact, due to the presence of filtering operations such as labeling for nodes, edges, etc., the real traversal number is much less than the worst-case scenario, so the time taken to optimize the query graph using the view is completely negligible compared to the actual query time, which can be verified in the Section VI.

VI Evaluation

Since view creation and view maintenance are costly, we need to evaluate the overall optimization effect after using views to prove the positive effect of MV4PG constructed in this paper. Meanwhile, to demonstrate the scalability of the approach in this paper, we have experimented on both TuGraph and Neo4j.

VI-A Experimental Setup

Environment. We conducted experiments on a server with Ubuntu 22.04.2, two 24-core Intel Xeon Gold 6248R Processors, and 256 GB RAM. The graph databases we evaluated are Tugraph v4.1.0 and Neo4j v4.4.2.

Dataset. We use the following two data sets for evaluation:

•

SNB: the social network benchmark provided by LDBC [25], which simulates real online social networking scenarios, is a better way to evaluate the performance of a graph database in real-world scenarios. The SNB dataset used in this paper contains 3,181,724 nodes and 17,256,038 edges.
•

Finbench: LDBC FinBench(Financial Benchmark) [26] intends to define a benchmark characterized by special data and query patterns in financial industry to test graph database systems to make the evaluation of graph databases representative, reliable and comparable, especially in financial scenarios. The Finbench dataset used in this paper contains 5,376,981 nodes and 26,633,151 edges.

Workload. We designed a small workload for each dataset to test the effect of views, each workload has 10 test statements, including 7 read statements and 3 write statements, the write statements include the creation of edges, the deletion of edges, and the deletion of nodes in three cases. For the specific content of the workload, see [16]. To ensure accuracy and consistency in our experiments, each query was run five times consecutively, and the average value was taken as the execution time. For write statements, we execute a recover statement after each execution to restore the database to its initial state. For example, if Q8 creates an edge, then the recover statement will delete the edge.

VI-B Evaluation Results

View Creation. We created three views for each dataset based on the statements in the workload, and the time spent creating the views is shown in Table III.

TABLE III: View Creation Time

View Creation Time(s)	TuGraph	Neo4j
ROOT_POST(SNB)	20.79	35.68
COMMENT_PLACE(SNB)	58.61	19.93
PERSON_PLACE(SNB)	154.48	12.00
ACCOUNT_LOAN(Finbench)	406.15	185.71
PERSON_COMPANY(Finbench)	46.85	16.45
COMPANY_LOAN(Finbench)	25.51	25.70

Workload Results.

•

SNB: Figure 14 shows the results of SNB’s workload tested in TuGraph and Figure 14 shows the results of SNB’s workload tested in Neo4j where ORI_QUERY and OPT_QUERY represent the execution time without view optimization and with view optimization, respectively, and OPT_VPG represents the time required by the algorithm 3 to change the pattern graph. And it can be seen that the OPT_VPG is very short and negligible compared with OPT_QUERY. The OPT_VPG in Neo4j is longer than in TuGraph because, when executing in Neo4j, the TuGraph interface is called to get the optimized statement (as shown in Figure 7), which introduces additional transmission time.

Table IV summarizes the speedup ratios of SNB’s workload in TuGraph and Neo4j, where CE, DE, and DV denote creating edge, deleting edge, and deleting node, respectively. Table V counts the speedup ratios of the whole Workload, where $W_{ori}$ denotes the execution time of the unoptimized workload, $W_{opt}$ denotes the execution time of the optimized workload, and $MV$ denotes the total time for the creation of all materialized views.

Figure 13: SNB’s workload results in TuGraph

Figure 14: SNB’s workload results in Neo4j

TABLE IV: Speed-up ratio of SNB

TuGraph Neo4j

Q1 2.27 1.87

Q2 2.20 3.80

Q3 45.00 4.41

Q4 2.36 2.97

Q5 2.81 1.06

Q6 60.99 1.35

Q7 96.91 1.27

Q8(CE) 0.91 0.99

Q9(DE) 0.87 0.40

Q10(DV) 0.68 0.30

TABLE V: workload speed-up ratio of SNB

TuGraph Neo4j

$W_{ori}/W_{opt}$ 28.71 1.64

$W_{ori}/(MV+W_{opt}$ ) 16.19 1.32

Figure 15: Finbench’s workload results in TuGraph

Figure 16: Finbench’s workload results in Neo4j

TABLE VI: View Opt of Finbench

TuGraph Neo4j

Q1 1.94 1.71

Q2 2.15 3.30

Q3 4.32 4.42

Q4 5.18 1.62

Q5 3.54 4.49

Q6 3.49 8.94

Q7 12.68 4.34

Q8(CE) 0.91 0.91

Q9(DE) 0.60 0.34

Q10(DV) 0.68 0.29

TABLE VII: workload speed-up ratio of Finbench

TuGraph Neo4j

$W_{ori}/W_{opt}$ 9.49 2.47

$W_{ori}/(MV+W_{opt})$ 4.43 1.19
•

Finbench: Figure 16 shows the results of Finbench’s workload tested in TuGraph and Figure 16 shows the results of Finbench’s workload tested in Neo4j. Table VI summarizes the speedup ratios of Finbench’s workload in TuGraph and Neo4j, and Table VII counts the speedup ratios of the whole Workload.

VI-C Analysis

Correctness Verification. Comparing the results returned by each read statement, we find that they are the same. Because the view creation statement in the evaluation guarantees that there will not be a single identical node appearing multiple times in the same graph instance as previously mentioned, after executing each write statement, we find that the number of views and the number of results found in the Match clause of the view creation statement are the same, i.e., they satisfy the data consistency, and after executing the corresponding recover operation, the number of views is also the same as the number before executing the write statement.

General Analysis of Performance. From Table IV and Table VI, we can see that each read statement has a relatively good acceleration ratio, and the best acceleration ratio is even close to 100x in TuGraph. For write statements, the acceleration ratios are all close to or over 30%, which is acceptable given the very short absolute time for small updates, especially when compared to the optimization benefits achieved for read queries.

From Table V and Table VII, we can see that the speedup ratio of the whole workload is also good, and the best speedup ratio in TuGraph is as high as 28.71x. There is also a good optimization effect after adding the time of view creation, that is to say the optimization effect of adding views is greater than the additional overhead that the views bring.

Performance Analysis for Reading Statements. It can be seen that the optimization of certain statements in TuGraph is particularly good, and the optimization in Neo4j is worse than TuGraph in general, which is because the Cost-Based Optimizer used in Neo4j chooses a better execution plan, resulting in a decrease in the view optimization effect(We filed an issue on the TuGraph repository [27] and got a positive response!). But in general, the optimization effect of views is obvious in both graph databases, from which we select a few statements to analyze the effect of view optimization. We define the following two metrics for in-depth analysis.

Definition 2.

Rows denotes the number of data rows passed between operators in the execution plan.

Metric Rows indicates the size of the intermediate results when the query is executed. A decrease in the number of intermediate results will result in less data being passed between multiple operators, thereby reducing processing time and improving query performance.

Definition 3.

DBHit denotes the number of times an operator accesses the storage engine while retrieving necessary data in the execution plan.

Metric DBHit as mentioned in indicates the number of times an operator needs to access the storage engine. The decrease of DBHit will lead to improved query performance.

Figure 17 shows the analysis results of the execution plan of Q3 in SNB on TuGraph. Q3 optimizes the two-hop path as n1->n2->n3 into a single edge from n1 to n3 using a view. It can be seen that the value of DBHit is significantly reduced, especially for the Expand node from n2 to n3, which originally had nearly 300 million DBHit. However, the total DBHit after view optimization is only over 4 million, which is only one-sixtieth of the original. The final speedup ratio also reached 45.00.

Figure 18 shows the analysis results of the execution plan of Q3 in SNB on Neo4j. Without view optimization, Neo4j chooses to traverse in two paths and finally merges the two paths at n2. The optimized execution plan is similar to TuGraph. Although this results in a small difference in the number of DBHit, both in the millions, the extra time consumed by NodeHashJoin still results in a speedup ratio of 4.41.

Performance Analysis for Writing Statements.

The efficiency of the write statements in Table IV and Table VI is relatively high, but they only create or delete a very small number of nodes or edges. To further illustrate the speed of view maintenance, we increased the number of edges deleted to $10^{0},10^{1},...,10^{4}$ , because most updates in dynamic graphs do not involve a very large amount of data; a range of $1$ to $10^{4}$ deleted edges can cover the vast majority of cases where the number of the total edges is on the order of tens of millions. We conducted tests in both TuGraph and Neo4j, and the speedup ratios are shown in Figure 19.

As we can see, the speedup ratio in TuGraph drops significantly when the number of edges reaches $10^{4}$ . Our analysis indicates that this is not an issue with our incremental maintenance algorithm, which is also validated by the results from Neo4j. Instead, it’s because TuGraph cannot locate a specific edge based on its edge ID. During maintenance, it must first obtain the starting node through $SK:$SV in LABEL:lst:view_maintenance_edge, then traverse all adjacent edges of the starting node, and use an Edge Filter to exclude edges that do not match the specified edge ID. When the starting node has a large number of adjacent edges, the extra traversal becomes substantial, leading to decreased efficiency. We also measured the time required by the Edge Filter during deletion in TuGraph and found that when deleting $10^{4}$ edges, the Edge Filter accounts for 93% of the total time. After removing the time taken by the Edge Filter, the speedup ratios all exceed 20%.

In Neo4j, the speedup ratio is relatively stable, remaining at 20% to 30% overall, and the DBHit and the number of deletions show a perfect linear relationship. This indicates that the overhead introduced by our maintenance algorithm is linear. In absolute time, the additional cost of view maintenance is much smaller than the efficiency gains of query optimization. However, for graph data that is updated extremely frequently, real-time incremental maintenance may have a negative impact, which will be addressed in future work.

VII Conclusion and Future Work

This paper implements materialized views on property graphs and proposes for the first time a method to maintain materialized views containing variable-length edges using maintenance statement templates. Then, experiments were conducted on both TuGraph and Neo4j, and both showed excellent optimization results, demonstrating the effectiveness of the proposed views.

However, there are still some areas for improvement in current Mv4pg. In the view creation part, we will try to automatically select and materialize the most optimized views based on the workload and the graph schema, rather than manually selecting them as we do now. In the view maintenance part, as mentioned in Section IV-B, it is necessary to handle the special case of having the same graph instance in different statements during maintenance. This involves saving the matched graph instances and not matching them if they are the same. Additionally, we will consider seeking a balance between performance and consistency to find a better maintenance strategy, such as delayed updates. In the part of query optimization using views, the current method for selecting the view optimization order in Section V may not accurately represent the optimization effect of the views. In the future, we will consider using a Cost-Based Optimizer and a greedy algorithm to provide the query cost after applying each remaining view at each step, selecting the execution plan with the lowest cost. All in all, graph database materialized views are still a very new field and the issues we mentioned above will be left for future work.

References

[1] W. Fan, “Big graphs: challenges and opportunities,” VLDB, vol. 15, no. 12, pp. 3782–3797, 2022.
[2] T. Palpanas, R. Sidle, R. Cochrane, and H. Pirahesh, “Incremental maintenance for non-distributive aggregate functions,” in VLDB’02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 2002, pp. 802–813.
[3] H. Gupta and I. S. Mumick, “Incremental maintenance of aggregate and outerjoin expressions,” Information Systems, vol. 31, no. 6, pp. 435–464, 2006.
[4] K. Yi, H. Yu, J. Yang, G. Xia, and Y. Chen, “Efficient maintenance of materialized top-k views,” in Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405), 2003, pp. 189–200.
[5] C. Koch, Y. Ahmad, O. Kennedy, M. Nikolic, A. Nötzli, D. Lupei, and A. Shaikhha, “Dbtoaster: higher-order delta processing for dynamic, frequently fresh views,” The VLDB Journal, vol. 23, pp. 253–278, 2014.
[6] P.-A. Larson and J. Zhou, “Efficient maintenance of materialized outer-join views,” in 2007 IEEE 23rd International Conference on Data Engineering. IEEE, 2006, pp. 56–65.
[7] S. Chaudhuri, R. Krishnamurthy, S. Potamianos, and K. Shim, “Optimizing queries with materialized views,” in Proceedings of the Eleventh International Conference on Data Engineering. IEEE, 1995, pp. 190–200.
[8] A. Y. Halevy, “Theory of answering queries using views,” ACM SIGMOD Record, vol. 29, no. 4, pp. 40–47, 2000.
[9] S. Flesca and S. Greco, “Rewriting queries using views,” IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 6, pp. 980–995, 2001.
[10] J. Goldstein and P.-Å. Larson, “Optimizing queries using materialized views: a practical, scalable solution,” ACM SIGMOD Record, vol. 30, no. 2, pp. 331–342, 2001.
[11] J. M. da Trindade, K. Karanasos, C. Curino, S. Madden, and J. Shun, “Kaskade: Graph views for efficient graph analytics,” in 2020 IEEE 36th International Conference on Data Engineering (ICDE). IEEE, 2020, pp. 193–204.
[12] S. Han and Z. G. Ives, “Implementation strategies for views over property graphs,” Proceedings of the ACM on Management of Data, vol. 2, no. 3, pp. 1–26, 2024.
[13] W. Fan, X. Wang, and Y. Wu, “Answering pattern queries using views,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 2, pp. 326–341, 2016.
[14] “Information technology – Database languages – GQL,” International Organization for Standardization and International Electrotechnical Commission, Standard, 2024.
[15] TuGraph-family, “tugraph-db v4.1.0,” https://github.com/TuGraph-family/tugraph-db/releases/tag/v4.1.0, 2023.
[16] s4plus GraphDB, “Materialized Views for Property Graphs,” https://github.com/S4Plus/MV4PG, 2024.
[17] J. Guia, V. G. Soares, and J. Bernardino, “Graph databases: Neo4j analysis,” in ICEIS (1), 2017, pp. 351–356.
[18] G. Szárnyas, “Incremental view maintenance for property graph queries,” in Proceedings of the 2018 International Conference on Management of Data, 2018, pp. 1843–1845.
[19] Y. Pang, L. Zou, J. X. Yu, and L. Yang, “Materialized view selection & view-based query planning for regular path queries,” Proc. ACM Manag. Data, vol. 2, no. 3, May 2024. [Online]. Available: https://doi.org/10.1145/3654955
[20] M. A. Rodriguez and P. Neubauer, “Constructions from dots and lines,” arXiv preprint arXiv:1006.2361, 2010.
[21] R. Angles, “The property graph database model,” in 12th AMW, 2018.
[22] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V. Marsault, S. Plantikow, M. Rydberg, P. Selmer, and A. Taylor, “Cypher: An evolving query language for property graphs,” in SIGMOD, 2018, pp. 1433–1445.
[23] openCypher, “Usage of cypher,” https://opencypher.org/projects/, 2023.
[24] L. P. Cordella, P. Foggia, C. Sansone, and M. Vento, “A (sub) graph isomorphism algorithm for matching large graphs,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 10, pp. 1367–1372, 2004.
[25] O. Erling, A. Averbuch, J. Larriba-Pey, H. Chafi, A. Gubichev, A. Prat, M.-D. Pham, and P. Boncz, “The ldbc social network benchmark: Interactive workload,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015, pp. 619–630.
[26] S. Qi, H. Lin, Z. Guo, G. Szárnyas, B. Tong, Y. Zhou, B. Yang, J. Zhang, Z. Wang, Y. Shen et al., “The ldbc financial benchmark,” arXiv preprint arXiv:2306.15975, 2023.
[27] s4plus GraphDB, “Generating Better Execution Plans,” https://github.com/TuGraph-family/tugraph-db/issues/733, 2024.

	TuGraph	Neo4j
Q1	2.27	1.87
Q2	2.20	3.80
Q3	45.00	4.41
Q4	2.36	2.97
Q5	2.81	1.06
Q6	60.99	1.35
Q7	96.91	1.27
Q8(CE)	0.91	0.99
Q9(DE)	0.87	0.40
Q10(DV)	0.68	0.30

	TuGraph	Neo4j
$W_{ori}/W_{opt}$	28.71	1.64
$W_{ori}/(MV+W_{opt}$ )	16.19	1.32

	TuGraph	Neo4j
Q1	1.94	1.71
Q2	2.15	3.30
Q3	4.32	4.42
Q4	5.18	1.62
Q5	3.54	4.49
Q6	3.49	8.94
Q7	12.68	4.34
Q8(CE)	0.91	0.91
Q9(DE)	0.60	0.34
Q10(DV)	0.68	0.29

	TuGraph	Neo4j
$W_{ori}/W_{opt}$	9.49	2.47
$W_{ori}/(MV+W_{opt})$	4.43	1.19

MV4PG: Materialized Views for Property Graphs

Abstract

Index Terms:

I Introduction

II Preliminaries and Related Work

II-A Preliminaries

Definition 1 (Property Graph (𝒫​𝒢\mathcal{PG})).

II-B Related Work

III Overview of Mv4pg

IV View Creation And Maintenance

IV-A View Creation

IV-B View Maintenance

V Query Optimization Using Views

V-A Sorting by Optimization Effect

V-B Match View

V-C Analysis

VI Evaluation

VI-A Experimental Setup

VI-B Evaluation Results

VI-C Analysis

Definition 2.

Definition 3.

VII Conclusion and Future Work

References

Definition 1 (Property Graph ( $\mathcal{PG}$ )).