Encoder-Decoder Architecture for Supervised Dynamic Graph Learning: A Survey

Yue Cai Zhu1, Fuyuan Lyu2, Chengming Hu3, Xi Chen4, Xue Liu5
School of Computer Science
McGill University
Montreal, Canada
Email: [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract

In recent years, the prevalent online services generate a sheer volume of user activity data. Service providers collect these data in order to perform client behavior analysis, and offer better and more customized services. Majority of these data can be modeled and stored as graph, such as the social graph in Facebook, user-video interaction graph in Youtube. These graphs need to evolve over time to capture the dynamics in the real world, leading to the invention of dynamic graphs. However, the temporal information embedded in the dynamic graphs brings new challenges in analyzing and deploying them. Events staleness, temporal information learning and explicit time dimension usage are some example challenges in dynamic graph learning.

In order to offer a convenient reference to both the industry and academia, this survey presents the Three Stages Recurrent Temporal Learning Framework based on dynamic graph evolution theories, so as to interpret the learning of temporal information with a generalized framework. Under this framework, this survey categories and reviews different intelligent encoder-decoder architectures for supervised dynamic graph learning. We believe that this survey could supply useful guidelines to researchers and engineers in finding suitable graph structures for their dynamic learning tasks.

Index Terms:

Dynamic Graph, Supervised Learning. Temporal Learning, Encoder-Decoder Architecture, Continuous Time Dynamic Graph, Discrete Time Dynamic Graph, Streaming Graph

1 Introduction

In the data explosion era, the amount of data increases exponentially. Most of data can be viewed as a graph. Graph is a data structure that consists of nodes and edges. It is designed to model and store data that contains not only features for different entities, but also relations between them. Graph analysis has long been an important research topic. Previous works [1, 2, 3] assume that the underling graph is a static graph which does not change over time. But in real world, the entities modeled as graph present different temporal dynamic in node features and relations. Dynamic graph is developed to model and store such an evolving graph. The extra time dimension brings temporal information to the graph’s representation and reveals the causality embedded in its network dynamic[4]. However, such temporal information also increases the difficulty of analyzing graphs.

In recent year, utilizing machine learning techniques to analyze dynamic graph has become an emerging research topic[5, 6, 7]. Moreover, the prevalent online services generate a sheer volume of relational data, transaction data and interaction data. They are modeled and stored as attributed dynamic graphs, such as social graph in Facebook [8] and user video interaction graph in Youtube[9]. Those dynamic graph databases with rich attributes make supervised dynamic graph learning feasible and urge the industry to look for effective supervised dynamic graph learning methods[10, 7, 11] . Therefore, we believe a survey of such methods is extremely helpful for the industry and the research community to exploit the potential of those databases.

There are multiple well-written surveys to summarize dynamic graph representation learning algorithms. Kazemi et al.[12] focus on a broad topics of dynamic graph representation learning. Skarding et al.[13] specialize in Graph Neural Network models for dynamic graph. They all follow the encoder-decoder learning framework proposed by Hamilton et al.[14]. With this framework, encoder generates graph embedding at node level, and decoder use the embedding to perform prediction/classification. Practitioners could assemble different encoder and decoder combinations to best fit their machine learning task. Moreover, the decoder can be modified to perform all dynamic graph learning tasks introduced in later sections. However, the aforementioned surveys do not focus on supervised learning methods, and they do not discuss how temporal information is learnt.

Our work is different from the previous ones mainly in that we develop the Three Stages Recurrent Temporal Learning Framework based on dynamic graph evolution theories. We use this framework to explain how dynamic graphs evolve over time and how different algorithms can learn the temporal information. It also gives a general form of these algorithms.

In the development of the mentioned framework, we found that using time as an input feature enables learning algorithms to recognize temporal periodicity and vector clock[15]. Vector clock is recently introduced to describe the phenomena that message sent from neighboring nodes to the target node requires different traversing time depending on their connection pattern. This motivates us to categorize different algorithms by whether time is learnt implicitly or explicitly, namely Implicit Time and Explicit Time Learning Algorithm. Only Explicit Time learning Algorithms are capable to perform time prediction tasks which predict when a given graph updating event would happen. Time prediction task is recently recognized as one of the goals of dynamic graph learning.[7, 16]

As a summary, we make the following contributions in this survey paper:

•

The Three Stages Recurrent Temporal Learning Framework for dynamic graph learning. Under this framework, we discuss how temporal information is learnt by different dynamic graph learning algorithms.
•

A list of different goals of dynamic graph learning which includes time prediction
•

A review of the recent development of supervised dynamic graph learning for Discrete Time Dynamic Graph and Continuous Time Dynamic Graph
•

Some interesting future research directions in dynamic graph learning according to the topics discussed

The organization of this survey is as follows: Sec. 2 introduces some background knowledge of machine learning in dynamic graph; Sec. 3 describes the taxonomy in this survey, mainly on the categories of dynamic graph, the encoder-decoder learning framework and the motivation of implicit/explicit time learning models categorization. Fig. 3 illustrates the taxonomy in this survey; in Sec. 4, we introduce the Three Stages Recurrent Temporal Learning Framework. With this framework we discuss how temporal information is learnt; We will then start the review from algorithms designed for Discrete Time Dynamic Graph (DTDG) at Sec. 5, and then the algorithms designed for Continuous Time Dynamic Graph (CTDG) at Sec. 6. Potential future directions are discussed in Sec. 7. In Appendix, table summarizes the notations and abbreviations used in this work.

2 Machine Learning In Dynamic Graph

2.1 Supervised Learning in Graph

Supervised machine learning typically trains a machine learning model with historical data and conducts prediction or classification with the trained model during inference time. In order to perform supervised training, the ground truth must be available in the historical data which is referred as label by convention. Supervised learning in graphs differs from traditional machine learning in that it could be further categorized as node focus task, edge focus task and graph focus task[17]. In the following section, we use classification task as examples. Such paradigm can be easily extended to regression tasks.

We can denote a graph as $G=(V,E,\mathbf{X})$ with $V=\{v_{1},v_{2},\dots,v_{i},\dots,v_{m}\}$ as the set of nodes in $G$ where $v_{i}\in G\wedge i\in[1,|V|]$ , and $E=\{e_{i,j}\}$ as the set of edges in $G$ where $e_{i,j}=(v_{i},v_{j},f_{i,j})$ , $v_{i},v_{j}\in V$ and $f_{i,j}$ represents the feature of edge $e_{i,j}$ . $X$ is the node feature matrix with each row vector $x_{v_{i}}\in X$ stores the features for node $v_{i}\in V$ . Namely, there exists a one to one mapping between $V$ and the row vectors in $X$ . $\mathbf{X}\in\mathcal{R}^{|V|\times d}$ and $d$ is the number of features for a given node.

Then these three tasks can be defined as:

Definition 1.

Node Focus Task: given a one to one mapping between $V_{\text{known}}=\{v_{1},v_{2},\cdots,v_{i}\}$ with $V_{\text{known}}\subset V$ and the label set $Y_{\text{known}}=\{y_{1},y_{2},\cdots,y_{i}\}$ . The learning purpose is to predict $y_{k}$ for $v_{k}\in V_{\text{unknown}}$ where $V_{\text{unknown}}\cap V_{\text{known}}=\phi$ , $V_{\text{unknown}}\subseteq V-V_{\text{known}}$ .

Fig. 2(a) illustrates the idea of node focus task. The color represents the node attribute. And each node has its label $y$ . Node focus task is to predict the unknown node labels. As an example of node focus task, in community detection, the label assigned to each node would be the name of its associated community. If two nodes have the same label, they belong to the same community. It is also straightforward to see that node classification is node focus task.

Definition 2.

Edge Focus Task: given a one to one mapping between $E_{\text{known}}\subset E$ and their corresponding labels $Y_{\text{known}}$ . The learning purpose is to predict $y_{i,j}$ for $e_{i,j}\in E_{\text{unknown}}$ where $E_{\text{unknown}}=E-E_{\text{known}}$

As shown in Fig. 2(b), given a graph with node attribute and observed edges, edge focus task is to predict whether an edge exists between two given nodes.

Definition 3.

Graph Focus Task: given a collection of graphs or sub-graphs $\{G_{i}\}$ and a one to one mapping between $\{G_{i}\}$ and the label set $\{y_{i}\}$ , the learning purpose is to predict $y_{j}$ for $G_{j}\notin\{G_{i}\}$ .

Given multiple graphs as in Fig. 2(c), some of them has known labels but some not. Graph focus task is to predict the unknown graph label.

To sum up, the label $y$ to be predicted could be a cell in the adjacency matrix $A$ or an edge feature in edge focus tasks, one particular attribute in node attributes matrix $X$ for node focus tasks, or an numerical interpretation of the state tuple for graph focus tasks.

2.2 Extrapolation and Interpolation Learning in Dynamic Graph

In real world applications, we are facing the challenge of learning the network dynamic, namely the repetitive pattern in a dynamic graph’s evolution over time. Dynamic graphs have one more dimension than static graphs have, which is time. Time dimension is usually stored as the timestamp when the observation of the graph or its components happens.

Let’s denote a dynamic graph as $G_{T}=O_{T}$ , where $T=[t_{1}:t_{n}]$ is the time span from $t_{1}$ to $t_{n}$ and is referred as the observation period. $O_{T}=\{o_{t_{1}},o_{t_{2}},\cdots,o_{t_{n}}\}$ is the set of observations which are performed within the observation period $T$ . The observation could be a snapshot of graph $G_{t}=(V_{t},E_{t},\mathbf{X}_{t})$ where $V_{t}$ , $E_{t}$ and $\mathbf{X}_{t}$ are the snapshot of the nodes, edges and node features at time $t$ , a single node updating event $o_{t}=v_{i,t}$ or edge updating event $o_{t}=e_{\{i,j\},t}$ at time $t$ . Supervised learning in dynamic graphs could be extrapolation learning, interpolation learning or time prediction.

In Extrapolation Learning, the purpose is to predict the label $Y_{t_{n+1}}$ at a future time based on the previous observations of a particular dynamic graph $G_{T}$ and ground truth labels $Y_{T}$ with the observation period $T=[t_{1}:t_{n}]$ . While in interpolation Learning, the objective is to estimate the missing labels $Y_{t_{i}}$ such that $t_{i}\in T$ and $Y_{t_{i}}\notin Y_{T}$ . Extrapolation task predicts the future based on historical data while interpolation task gives estimation for the past. Therefore, interpolation task is mainly used in data imputation. As an example, in stock market analysis, if the purpose is to predict the future trend, then it is extrapolation learning; if the purpose is to fill some missing data, then it is interpolation learning.

In Time Prediction, the goal is to predict the time for a given incoming event. For example, predicting when the next crisis event would happen in Integrated Crisis Early Warning System (ICEWS) [7, 16]. The different learning tasks for supervised dynamic graph learning are summarized in Table I

TABLE I: Supervised Learning task in Dynamic Graph

	Extrapolation	Interpolation	Time Prediction
Node Focus	Predict the target node’s attribute	Estimate the target node’s missing attribute	Predict when a given node updating
	in the future	in the past	event will happen
Edge Focus	Predict the target edge’s status	Estimate the target edge’s status	Predict when a given link will be
	in the future	in the past	added or deleted
Graph Focus	Predict the given dynamic graph’s attribute	Estimate the given graph’s missing attribute	Predict when the targeted graph will reach
	in the future	in the past	to a given state

3 Taxonomy

3.1 Dynamic Graph Storage Model

A dynamic graph represents a graph evolving over time. Different kinds of graph evolve differently. Some change very fast but some others not. For example, a telephone SS7 voice network generates server logs in each node every second, while an social interactive network modeled from customer reviews has no updates for days. Zaki et el.[18] summarized that dynamic graph can be modeled as either Discrete Time Dynamic Graph (DTDG) or Continuous Time Dynamic Graph (CTDG) based on how the temporal information is expressed regarding to the evolution of dynamic graph. DTDG is a list of snapshots and each of them keeps the graph status at a certain moment. Meanwhile, CTDG can be viewed as a stream of graph updating event. The definitions of these two graph representation models will be given out in Sec. 5 and 6 respectively. Such modeling paradigms are well adopted in the community [10, 19, 6, 12].

For a graph whose nodes or edges are frequently updated, it is more memory efficient to store it with DTDG modeling due to its snapshot-based representation[4]. However, it may induce some important temporal information loss if the observation frequency is not appropriately set. For example, if strong periodicity presents in nodes’ updating events, each node’s activity reach the peak at noon everyday. But the observation frequency is set to be every 24 hours, then such periodicity pattern is never captured in the resulting snapshots. In comparison, CTDG can capture all temporal information as it is an event-based representation[4, 20].

Due to their pros and cons, the most important decision to make is which storage model to use. Such decision must be made at the early stage of a dynamic graph learning use case. The selected storage model also limits the options of machine learning algorithms to only those are compatible. Algorithms that are designed specifically for DTDG, such as DySat[5] and STGCN[21], cannot be applied directly with CTDG storage model. Similarly, algorithms such as TGAT[6] that are designed for CTDG cannot be applied with DTDG either. In order to offer a fast reference for practitioners the first hierarchy in our taxonomy to categorize different algorithms is the compatible graph storage model.

3.2 Encoder And Decoder Learning Framework

Graph learning algorithms are better considered to be under the encoder-decoder framework[14]. We follow the same framework when reviewing supervised dynamic graph learning algorithms. As shown in Fig. 3, the encoder turns the observations of a dynamic graph into its latent representation that is node-based. This latent representation is called graph embedding. The decoder decodes the generated graph embedding and give the prediction or classification result. Under this framework, researchers can experiment on different encoder-decoder combinations to find the best one fitting the particular learning task [12].

The modification of decoder to perform the three graph learning tasks is simple. Node focus task is straightforward, since the embedding generated by encoder is node-based. We can modify the decoder to take the concatenation of two nodes’ embedding as input to perform edge focus task, which is referred as the pair-wise decoder[14]. In the graph focus task, practitioners need to find an aggregation method to aggregate all nodes’ embedding into the graph embedding. By modifying the decoder as described above, one can use the same encoder and decoder combination to perform all three dynamic graph learning problems. Therefore, algorithms fitting with such encoder-decoder framework are considered as general purposes learning methods.

3.3 Implicit And Explicit Learning Model

The last hierarchy of the taxonomy in this survey paper is whether a given algorithm uses time as an input feature explicitly or implicitly. If time is used implicitly, it is not fed into the model as an independent feature. For example, the algorithm with a static graph encoder and LSTM decoder, does not use time explicitly. The success of temporal learning relies on the time based ordering of the input snapshots and the regular observation frequency. We refer this kind of learning model as Implicit Time Learning Model.

Meanwhile, if a given algorithm takes time as an independent input feature, it is referred as Explicit Time Learning Model. Examples include TGAT[6] and TGNN[22] which use the time encoding Time2Vec[23] as a node feature. Explicit Time Learning Model is capable of periodicity recognition and vector clock recognition, for which Implicit Time Learning Model is incapable.

3.3.1 Periodicity Recognition

In temporal data, periodicity is a frequent seen phenomena. For example, $50\%$ to $70\%$ of human movements can be explained by periodic behavior pattern[24]. Periodic pattern can be learnt by sequential model in DTDG setting. The learning model infers the underlying timestamps by the ordering and position of the input snapshots. However, in CTDG, this is unfeasible without the explicit use of time. Kazemi et al.[23] proposed Time2Vec to encode time and help the downstream learning model to recognize periodic pattern as well as linear time pattern.

TGAT[6] applies time2vec in message passing to assign more weight to neighbors with similar periodic patterns to target nodes. Explicit use of time with Time2Vec helps CTDG learning better learn periodical patterns.

3.3.2 Vector Clock Recognition

Information flowing between two given nodes needs to traverse their shortest path. The difference in path length and edge property results in the difference in the flowing information’s arrival time at the target node. Temporal Distance between starting nodes and destination nodes is one metric to evaluate how much time is needed for the information to flow from the starting one to the destination[25]. Because the temporal distance from the target node to its neighbors has different value, most up-to-date information with respect to its neighbors is sent at different timestamp. This phenomena is described by the concept Vector Clock[15]. Since different nodes have different vector clocks,they evolve differently from each other. Temporal Point Process(TPP)[26] is one of the temporal learning methods that is capable to learn the impact from vector clock. With TPP and the explicit use of time, a learning algorithm could recognize the impacts from a given node’s vector clock to its evolution.

Moreover, the explicit use of time makes time prediction tasks feasible in dynamic graph learning. For the best of our knowledge, time prediction tasks could only be done with explicit time learning methods. Figure 1 shows the overall architecture of how this survey organizes different learning methods.

4 Temporal Pattern Learning

In this section, we present the Three Stages Recurrent Temporal Learning Framework. This framework describes a general form of dynamic graph learning algorithms, how they learn and apply the temporal pattern for different graph learning tasks. In some circumstances, node attributes are not available. Such dynamic graphs are called non-attributed dynamic graph. Conversely, dynamic graphs with attributes are called attributed dynamic graph. Previous works are limited to either attributed or non-attributed dynamic graphs[27, 28, 29]. For the best of our knowledge, Three Stages Recurrent Temporal Learning Framework is the only framework which could be generalized to both attributed and non-attributed dynamic graph learning. Subsection 4.1 presents the idea of the proposed framework and briefly introduces its three stages. Subsection 4.2, 4.3 and 4.4 explain the three stages in detail. Subsection 4.5 explains how to apply Three Stages Recurrent Temporal Learning Framework in attributed and non-attributed dynamic graphs. The equations presented in this section assumes the given learning task is an extrapolation task. With some modification, they can be applied to interpolation tasks as well.

4.1 Three Stages Recurrent Temporal Learning Framework

Three Stages Recurrent Temporal Learning Framework describes how a learning algorithms learns the temporal pattern. A temporal pattern is a repetitive pattern in the given dynamic graph’s evolution. A particular learning algorithm could learn and use it to perform those mentioned graph learning tasks. Given that the state of a dynamic graph $G_{t}$ at time $t$ is described by a timestamped state tuple $(E_{t},\mathbf{X_{t}},t)$ , in which $E_{t}$ and $\mathbf{X_{t}}$ is the edge connections and node attributes matrix observed at time $t$ . The temporal pattern could be expressed the following function:

(\hat{E}_{t_{n+1}},\mathbf{\hat{X}}_{t_{n+1}},t_{n+1})=\text{tp}(\{E_{T}\},\{\mathbf{X_{T}}\},T)

(1)

where $\{E_{T}\}$ and $\{\mathbf{X_{T}}\}$ are the set of observations of $E$ and $\mathbf{X}$ in time period $T=[t_{0},t_{1},\cdots,t_{n}]$ . $\hat{E}_{t_{n+1}}$ and $\mathbf{\hat{X}_{t_{n+1}}}$ are the prediction of $E$ and $\mathbf{X}$ for timestamp $t_{n+1}$ based on the observed history.

To predict any missing entry in the state tuple, a learning algorithm needs an output function $\text{out}(\cdot)$ which takes the predicted state $(\hat{E}_{t_{n+1}},\mathbf{\hat{X}_{t_{n+1}}},t_{n+1})$ and the known state $(E_{t_{n+1}}-e_{\{i,j\},t_{n+1}}),\mathbf{X_{t_{n+1}}},t_{n+1})$ as input. For example, in edge focus task, to predict the status of a given edge $e_{\{i,j\},t_{n+1}}\in E_{t_{n+1}}$ between $v_{i}$ and $v_{j}$ , the output function $\text{out}(\cdot)$ could be written as:

\begin{split}e_{\{i,j\},t_{n+1}}&=\text{out}((\hat{E}_{t_{n+1}},\mathbf{\hat{X}}_{t_{n+1}},t_{n+1}),\\ &(E_{t_{n+1}}-e_{\{i,j\},t_{n+1}}),\mathbf{X_{t_{n+1}}},t_{n+1}))\end{split}

(2)

A supervised dynamic graph learning algorithm needs to learn the temporal pattern $\text{tp}(\cdot)$ from the ground true history.

Three Stages Recurrent Temporal Learning Framework assumes $\text{tp}(\cdot)$ to be a three stages process such that it is a composite of the three functions:


$\displaystyle\mathbf{X^{\prime}}_{t_{n+1}}$	$\displaystyle=\text{asu}(\{\mathbf{X}_{T}\},T,t_{n+1})$	(3a)
$\displaystyle\mathbf{\hat{E}}_{t_{n+1}}$	$\displaystyle=\text{ap}(\{E_{T}\},\{\mathbf{X}_{T}\},\mathbf{X^{\prime}}_{t_{n+1}},T,t_{n+1})$	(3b)
$\displaystyle\mathbf{\hat{X}}_{t_{n+1}}$	$\displaystyle=\text{mp}(\hat{E}_{t_{n+1}},\mathbf{X^{\prime}}_{t_{n+1}})$	(3c)
tp	$\displaystyle=\text{mp}\circ\text{ap}\circ\text{asu}$	(3d)

Functions $\text{asu}(\cdot)$ , $\text{ap}(\cdot)$ and $\text{mp}(\cdot)$ each represents an intermediate stage in a dynamic graph’s evolution which will be detailed in following subsections. The operation $\circ$ is function composition.

As shown in Fig. 4, the first stage is the Attribute Self-Updating defined as in Eqn. (3a). This stage captures the impact from the external factors to the graph evolution and gives out $\mathbf{X^{\prime}_{t_{n+1}}}$ as an estimation of $X$ at $t_{n+1}$ . Here we present a model with only node attributes self-updating. Readers can extend it to edge attributes and graph attributes self-updating with similar schema. The change of attributes triggers the second stage named Association Process. As in Eqn. (3b), the association process describes the evolution of connection pattern. It generates new connection pattern based on the current connection pattern, timestamp and the self-updated attributes. The new connection pattern triggers the last stage which is the Message Passing. As defined as Eqn. (3c), this stage integrates the impact of attributes self-updating and association process to generate the attributes for the next state. Consequently, the result of message passing would be the starting point of attributes self-updating in the next round of the evolution. A dynamic graph’s evolution can be described as a recurrent process of these three stages.

4.2 Attributes Self-Updating

The first stage in Three Stages Recurrent Temporal Learning Frameworkis the attributes self-updating. This stage captures the impact of external factors to the evolution of the given dynamic graph’s attribute. Depending on the context, the attributes being impacted could be node attributes, edge attributes or graph attributes. The change of attributes alternatively drives the evolution of the given dynamic graph’s connection pattern[27]. As an example, in a client-item knowledge graph, the unobserved change of the client’s status would triggers the change of his interested items.

Definition 4.

Attributes Self Updating: the change of node, edge or graph’s attributes resulted from external factors. Its equation $\text{asu}(\cdot)$ is defined in Eqn. (3a)

As in our example, this stage only takes input the history of the node attributes and the timestamps. Its output $\mathbf{X^{\prime}}_{t_{n+1}}$ is the estimated node attributes for the next timestamp $t_{n+1}$ .

4.3 Association Process

The change of attributes triggers the development of a new connection pattern. The process that describes the evolution of a dynamic graph’s connection is called Association Process.

Definition 5.

Association Process: The process that a particular dynamic graph develops, abandons or modifies the edges between its nodes. Its equation $\text{ap}(\cdot)$ is defined in Eqn. (3b).

As shown in Eqn. (3b), the association process outputs the future connection pattern based on the given dynamic graph’s evolution history, the estimated future attributes from the first stage, and the timestamps.

4.4 Message Passing

In graph analysis, we believe nodes are impacted by their neighbours. To learn how nodes are impacted by their neighbours, Message Passing is developed for Graph Neural Network(GNN)s[30, 31, 2].

Definition 6.

Message Passing[30], also known as Affinity Propagation[32, 33] and Communication[34], is a local neighborhood information aggregation method, which updates node attributes by aggregating messages received from neighboring nodes and the connected edges. Its equation $\text{mp}(\cdot)$ is defined in Eqn. (3c)

Recent advancements in DTDG learning apply message passing to generate node-based graph embedding. The resulted graph embedding is considered to be a latent representation of the underlying graph’s network structure and nodes/edges attributes[14, 3]. Therefore, it contains important information for graph related machine learning tasks.

4.5 Generalization To Attributed And Non-attributed Dynamic Graphs

Three Stages Recurrent Temporal Learning Framework describes how attributed dynamic graph evolves by Eqn. (3). Supervised dynamic graph learning algorithms learns the temporal pattern $F$ by optimizing the trainable weights in Eqn. (3) and (2) to best fit the input history.

When generalized to non-attributed dynamic graph, the impact from node attributes are usually ignored since they are not available. The only driver considered in the graph’s evolution is the association process.[4, 27]. In this case we can drop Eqn. (3a) from Eqn. (3d) and setting Eqn. (3b) to take input only $\{E_{T}\}$ and the timestamps as follows:

\hat{E}_{t_{n+1}}=\text{ap}(\{E_{T}\},T,t_{n+1}).

(4)

TGN[10] and APAN[35] adopts Three Stages Recurrent Temporal Learning Framework by applying the node memory units to capture the attribute self-updating.

Know-Evolve[7], DyRep[34] and TGAT[6] adopt Three Stages Recurrent Temporal Learning Framework by setting Eqn. (3a) to always return the last observation in the input attributes history $\{\mathbf{X}_{T}\}$ .

The learning of temporal pattern $F$ is achieved by applying temporal learning algorithms in the dynamic graph encoder or directly in the decoder. Commonly-used temporal learning algorithms include RNN family neural networks, 1-D convolutional networks, time series analysis methods and attention networks. Recent trend is to explore the use of Temporal Point Process in temporal learning[7, 34].

5 Discrete Time Dynamic Graph Learning

Dynamic network represented by DTDG has a discretized time dimension. Each observation in the given dynamic graph $G_{T}$ is expressed as a snapshot of the given graph attached with the observation timestamp. DTDG is defined as follow:

Definition 7.

Discrete Time Dynamic Graph (DTDG): a dynamic graph $G_{T}=O_{T}$ for a time span $T=[t_{1}:t_{n}]$ is stored as DTDG, if each stored observation $o_{t_{i}}$ in $O_{T}$ is a snapshot of the given graph $o_{t_{i}}=(V_{t_{i}},E_{t_{i}},\mathbf{X}_{t_{i}})$ where $V_{t_{i}}$ , $E_{t_{i}}$ and $\mathbf{X}_{t_{i}}$ are nodes, edges and node features matrix observed at $t_{i}$ .

The data pipeline that makes the observation and stores the snapshot in a database is usually running in a regular frequency depending on the requirement, such as once per hour or once per day. The timestamps are usually as simple as ordered integers instead of actual date and time (e.g., $T=[1,2,3,\dots,n]$ ).

The most seen form of supervised DTDG learning is the static graph encoder - sequential decoder framework. As shown in Fig. 5(a), Algorithms following this framework use a static graph encoder to generate embedding for each snapshot and pass those embedding to a supervised sequential decoder for inference. Because the dynamic network has already been sliced into snapshots, there is no explicit usage of temporal information in the encoder. The encoder only captures the graph structure, property and attributes for each snapshot. Temporal information is learnt through the sequential decoder.

There are some emerging attempts to learn temporal pattern as well as graph topology and attributes in the encoder. These algorithms follow the dynamic graph encoder - simple decoder framework as shown in Fig. 5(b). Instead of generating node-wise graph embedding for each snapshot in each inference, the dynamic graph encoder only generates one embedding recursively based on inputs in the past and the current input snapshot at each inference. Table II lists all encoders for supervised DTDG learning in this survey. So far as we summarize, all supervised DTDG learning methods do not learn the graph attribute self updating process.

TABLE II: DTDG Graph Encoders

Time Model	Graph Type	Encoder
Implicit	non-attributed static graph	Shallow Embedding[36, 37, 38, 39, 40, 41, 42, 43, 44, 45]
		autoencoder Based Embedding[46, 47]
	attributed static graph	Message Passing GNN[17, 1, 48, 31]
		Graph Convolution Network[49, 50, 51, 31, 52, 53, 54]
		Graph Attention Network[2, 55, 5]
	dynamic graph	Kalman Filter Based Encoder[56]
		EvolveGCN[57]
		Spatial-Temporal Graph Convolution Network[21]
Explicit	N.A.	N.A.

5.1 Static Graph Encoder

5.1.1 Non-Attributed Static Graph Encoder

Static graph encoders could be categorized as non-attributed and attributed graph encoder. Non-attributed static graph encoders are not widely discussed in academia these days, due to their incapability to leverage graph attributes. This motivates the recent exploration in attributed static graph embedding approaches. We will first review briefly the non-attributed static graph embedding methods, and then provide an in-depth review of attributed static graph embedding methods. For those interested in a more detailed non-attributed static graph representation learning algorithms, we refer them to read the recent work of Hamilton et al. [14], Cui et al.[58] and kazemi et al.[12].

A basic form of non-attributed static graph encoder is the shallow embedding approach, which aims to transform graph structure and property to node-level graph embedding [58]. However, graph attributes are not considered in shallow embedding. Shallow embedding includes

•

Matrix Factorization based approaches such as Laplacian Eigenmaps[36], Graph Factorization[37], GraRep[38], HOPE[39] and M-NMF[40];
•

Random walk based approaches such as DeepWalk[41], TADW[42], HARP[43] and node2vec[44].
•

Other approaches, such as LINE[45] .

Shallow embedding encoder is simply an embedding look up based on node ids, there is no parameter sharing between nodes. Hence, the computation is inefficient and the trained encoder cannot be used for new graphs with unseen nodes [14].

To overcome these challenges and leverage graph attributes in the embedding generating process, multiple approaches are proposed to parameterized graph embedding (i.e, parameter sharing between nodes).

Deep Neural Graph Representation (DNGR)[46] and Structural Deep Network Embeddings (SDNE)[47] apply autoencoder[59] to map a high-dimension node similarity matrix to a low-dimension node embedding. These two approaches enable sharing of parameters, which can provide an efficient computation and be applied with unseen nodes.

5.1.2 Attributed Static Graph Encoder

To leverage graph attributes in static graph embedding generation, attributed static graph encoders are proposed. Attributed static graph encoders are based on Graph Neural Network (GNN)[17]. Such algorithms follow a message passing schema to aggregate neighborhood information and generate embedding for the target node [1]. In this survey, we focus on some widely used GNN based attributed graph encoders which are the foundation of dynamic graph encoders to review in later section.

GNN assigns state $\mathbf{h}_{v_{i}}$ to node $v_{i}$ of the input graph. Given mappings $N(v_{i})$ to be all neighbors of node $v_{i}$ , mapping $E(v_{i})$ to be all edges connecting node $v_{i}$ and its neighbors $N(v_{i})$ , $f_{w}(\cdot)$ to be the neighborhood information aggregating function, state $\mathbf{h}_{v_{i}}$ at the $k$ layer is defined as in Eqn. (5).

\mathbf{h}_{v_{i}}^{k}=f_{w}(\mathbf{h}^{k-1}_{v_{i}},E(v_{i}),\mathbf{h}^{k-1}_{N(v_{i})}),

(5)

which can be written as Eqn. (6) when there is no positional information for the neighbours.

\mathbf{h}_{v_{i}}^{k}=\sum_{u\in N(v_{i})}f_{w}(\mathbf{h}^{k-1}_{v_{i}},e(v_{i},u),\mathbf{h}^{k-1}_{u}).

(6)

The output state from the last layer $n$ is the graph embedding $\mathbf{z}_{v_{i}}$ :

\mathbf{z}_{v_{i}}=\mathbf{h}_{v_{i}}^{n}.

(7)

This state updating process is described as a message passing process in [1, 48]. In each iteration, message from each node is passed through the edges to their neighbours. To learn local neighborhood structure, $k$ is required to be a small value, such as $k=2$ in the learning of $2$ -hop neighborhood information. With Banach’s fixed point theorem, the state values will converge with the update of state in iterations [60, 1]. Therefore, to learn the global graph structure, we can just continue the iterative updating until the change in the state value between two consecutive iterations is close to 0 as shown in Eqn. (8). When the state values converge, the global graph structure and property is embedded in the resulting embedding.

\mathbf{h}_{v_{i}}^{k}-\mathbf{h}_{v_{i}}^{k-1}\approx 0.

(8)

Stacking the converged state value $z$ of each node together to produce $Z$ , we obtain the node-wise embedding of the input graph. GraphSAGE[31] and column network[61] are following the same schema with different neighborhood information aggregation functions.

Inspired by GNN, Graph Convolution Network (GCN)[49] is a generalization of Convolution Neural Network (CNN)[62] to static graph data structure with spectral method. GCN applies graph Laplacian to generate the feature maps which are shared within the whole graph as in CNN. Given the adjacency matrix $\mathbf{A}$ , $\mathbf{\hat{A}}=\mathbf{A}+\mathbf{I}_{N}$ , the diagonal degree matrix $\mathbf{\hat{D}}:\mathbf{\hat{D}}_{ii}=\sum_{j}\mathbf{\hat{A}}_{ij}$ , the graph Laplacian is calculated as follows:

\mathbf{L}=\mathbf{\hat{D}}^{\frac{1}{2}}\mathbf{\hat{A}}\mathbf{\hat{D}}^{\frac{1}{2}}.

(9)

Moreover, a convolution layer with the output of $d$ feature maps is calculated as follows:

\mathbf{Z}=\mathbf{L}\mathbf{X}\mathbf{W},

(10)

where $\mathbf{X}\in\mathcal{R}^{|V|\times d}$ is the input node feature matrix with $d$ features and $\mathbf{W}\in\mathcal{R}^{d\times d^{\prime}}$ is the layer weight matrix to generate $d^{\prime}$ feature maps. The output $\mathbf{Z}$ of a hidden convolution layer is usually passed to an ReLU activation function, and the output of activation function will then be passed to the next layer as the input. As an example, a classic two layers GCN with softmax activation in the output layer has the form in Eqn. (11).

\mathbf{Z}=\text{softmax}(\mathbf{L}\cdot\text{ReLU}(\mathbf{L}\mathbf{X}\mathbf{W}^{\text{hidden}})\cdot\mathbf{W}^{\text{out}}).

(11)

One example of GCN as static graph encoder to learn the neighborhood structure for each snapshot and a sequential decoder to learn the temporal pattern between snapshots is AddGraph[51] .

Similar to GCN, Graph Attention Network (GAT) [2] applies the attention mechanism [63, 64] to determine the importance of neighboring nodes in the neighborhood aggregation for the target node:

\text{attn}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{softmax}(\text{score}(\mathbf{Q},\mathbf{K}))\mathbf{V},

(12)

where $\mathbf{Q}$ is the linear projection of the target node’s input state $\mathbf{h}_{v_{i}}$ from the previous layer. $\mathbf{K}$ and $\mathbf{V}$ are the linear projections of the input state of the $v_{i}$ ’s neighboring nodes:

\begin{split}\mathbf{Q}&=\mathbf{h}_{v_{i}}\mathbf{W}_{q}\\ \mathbf{K}&=[\mathbf{h}_{v_{1}},\mathbf{h}_{v_{2}},\cdots,\mathbf{h}_{v_{j}},\cdots,\mathbf{h}_{v_{n}}]\mathbf{W}_{k},\forall v_{j}\in N(v_{i})\\ \mathbf{V}&=[\mathbf{h}_{v_{1}},\mathbf{h}_{v_{2}},\cdots,\mathbf{h}_{v_{j}},\cdots,\mathbf{h}_{v_{n}}]\mathbf{W}_{v},\forall v_{j}\in N(v_{i})\end{split}

(13)

The function $\text{score}(\cdot)$ calculates a score representing how well $\mathbf{Q}$ align to $\mathbf{K}$ . The alignment scores for $v_{i}$ ’s neighborhood $N(v_{i})$ are then normalized by a softmax layer. The output of the whole attention layer is the product of the normalized alignment score and the linear projection $\mathbf{V}$ .

In Transformer [64], the score function is the dot product of $Q$ and $K$ , while the score function is a leaky ReLU layer in Graph Attention Network (GAT)[2, 55]. GAT is used as static graph encoder to learn DTDG in Dysat[5]. All attributed static graph encoders can be used to generate embedding in a supervised or unsupervised manner[31, 50, 49, 2, 61].

5.2 Sequential Decoder

In supervised DTDG learning, the temporal pattern is revealed by the change between snapshots along the time dimension. By sorting the static graph embedding generated from the series of snapshots according to their timestamps and treating them as sequential data, temporal pattern can be learnt via a sequential decoder. However, sequential decoder does not use time explicitly and is not capable to perform time predicting tasks. Table III lists different kind of decoders summarized in this work.

TABLE III: Decoders

Storage Model	Time Model	Decoder
DTDG/CTDG	Implicit	Time Series Analysis Methods [65, 66, 67]
		RNN family[59, 68, 69, 52, 53, 54, 70, 71, 72]
		1-D Convolution Network [73]
		Positional Temporal Self Attention[5]
CTDG	Explicit	Temporal Point Process Decoder[7, 74]

5.2.1 Time Series Analysis Methods

DTDG can be considered as a time series of snapshots. Hence, traditional time series methods can be used naturally as decoders to learn temporal patterns. Exponential Moving Average (EMA) and Auto-Regressive Integrated Moving Average (ARIMA) are two widely-used methods, which are used as decoders in previous works [65, 66, 67]. As an example, Eqn. (14) shows the equation of EMA with a predefined smoothing factor $\alpha\in(0,1)$ .

\centering\mathbf{Z}_{t_{n+1}}=\sum_{i=0}^{n}\alpha(1-\alpha)^{i}\mathbf{Z}_{t_{n-i}}.\@add@centering

(14)

5.2.2 Recurrent Neural Network (RNN) family

RNN based methods are frequently-used supervised sequential decoder and are capable to learn temporal correlation. A basic form of RNN decoders is defined by a recurrent state function which takes the previous state and the current embedding as input[59]:

\mathbf{h}_{v_{i},t_{n}}=\text{RNN}(\mathbf{h}_{v_{i},t_{n-1}},\mathbf{z}_{v_{i},t_{n}}).

(15)

The generated state is used as the input in the output unit which is usually a Feed forward Neural Network (FNN):

o_{v_{i},t_{n}}=\text{FNN}(\mathbf{h}_{v_{i},t_{n}}).

(16)

One frequently used RNN model is the Long Short Term Memory (LSTM)[68] which better learns long-term temporal pattern. For example, Seo et al.[69] applied the spectral GCN[75] as static graph encoder and LSTM as decoder; Narayan et al.[76] used a different GCN[52] as static graph encoder and LSTM as decoder; Manessi et al.[53] and Mohanty et al.[54] also used different version of GCN as static graph encoder and LSTM as decoder to perform dynamic graph learning. Yuan et al.[70] applies message passing GNN as static graph encoder and a four gates LSTM as sequential decoder to learn the dynamic graph constructed from video frames and performs object detection. DyGGNN[71] uses a gated graph neural network[77] as encoder and LSTM as decoder. Bogaets et al.[72] applies CNN as static graph encoder and LSTM as sequential decoder in traffic forecasting for road network in the city of Xi’an.

5.2.3 1-D Convolution Neural Network (Conv1d)

Conv1d is used in time series analysis[78]. Therefore, it is naturally to use as the sequential decoder to learn the temporal pattern. Unlike RNN families, Conv1d only learns short-term temporal patterns in the given time frame. Periodicity of those short term temporal patterns could be learnt by stacking multiple Conv1d layers. By carefully setting the size of each feature map, Conv1d inputs the embedding of a given node for $n$ most recent snapshots $[\mathbf{z}_{v_{i},t_{1}},\mathbf{z}_{v_{i},t_{2}},\cdots,\mathbf{z}_{v_{i},t_{n}}]$ , and generates the decoded state $\mathbf{h}_{v_{i},t_{n}}$ for node $v_{i}$ at time $t_{n}$ :

\mathbf{h}_{v_{i},t_{n}}=\text{Conv1d}([\mathbf{z}_{v_{i},t_{1}},\mathbf{z}_{v_{i},t_{2}},\cdots,\mathbf{z}_{v_{i},t_{n}}]).

(17)

The decoded state is then pass to the output unit as in Eqn. (16).

GraphTCN[73] applies a GAT based static graph encoder and Conv1d as the sequential decoder to learn the spatial and temporal information in Human Trajectory Prediction.

5.2.4 Temporal Self-Attention (TSA)

Attention mechanism is proved to perform very well in sequential data learning [63, 79]. As in Eqn. (18), given the node embedding $\mathbf{Z}_{v_{i}}$ of the target node $v_{i}$ , each element encodes a snapshot from the observation period $[t_{1},t_{2},\cdots,t_{n-1},t_{n}]$ , TSA uses each element in $\mathbf{Z}_{v_{i}}$ as a query to attend over the whole input history $\mathbf{Z}_{v_{i},[t_{1}:t_{n}]}$ to generate the temporal graph embedding for the corresponding snapshot. The output of a TSA layer $\mathbf{H}_{v_{i}}$ has the same time dimension as the input, such that it is feasible to stack multiple TSA layers to learn the evolution of its local neighborhood structure and attributes over time.

\begin{split}\mathbf{Z}_{v_{i}}&=[\mathbf{z}_{v_{i},t_{1}},\mathbf{z}_{v_{i},t_{2}},\cdots,\mathbf{z}_{v_{i},t_{n}}]\\ \mathbf{Q}_{t_{i}}&=(\mathbf{z}_{v_{i},t_{i}}+\mathbf{P}_{t_{i}})\mathbf{W}_{q},\\ \mathbf{K}&=(\mathbf{Z}_{v_{i}}+\mathbf{P})\mathbf{W}_{k},\\ \mathbf{V}&=(\mathbf{Z}_{v_{i}}+\mathbf{P})\mathbf{W}_{v},\\ \mathbf{h}_{v_{i},t_{i}}&=\text{attn}(\mathbf{Q}_{t_{i}},\mathbf{K},\mathbf{V}),\\ \mathbf{H}_{v_{i}}&=[\mathbf{h}_{v_{i},t_{1}},\mathbf{h}_{v_{i},t_{2}},\cdots,\mathbf{h}_{v_{i},t_{n}}].\end{split}

(18)

As in Eqn. (12), Dysat[5] applied TSA as the sequential decoder in DTDG learning. Their score function is shown in Eqn. (19).

\text{score}(\mathbf{Q}_{t_{i}},\mathbf{K})=\frac{\mathbf{Q}_{t_{i}}\mathbf{K}^{\textbf{T}}}{\sqrt{d^{\prime}}}.

(19)

To the best of our knowledge, there is no explicit time sequential decoder proposed for supervised DTDG learning. Because the snapshots are taken in a regular frequency, temporal information is revealed in the ordering and the position of the snapshots. However, without using time as a learning feature, implicit time sequential decoder is not capable to perform time prediction tasks as discussed in Sec. 2.

5.3 Dynamic DTDG Encoder

5.3.1 Implicit Time DTDG Encoder

Recent developments commonly use attributed dynamic graph embedding as an encoder to learn both graph structures and temporal correlations. The generated dynamic graph embedding is then fed to a simple supervised predictive model as an decoder to conduct prediction.

Kalman Filter Based Encoder

Kalman Filter[80], also called Linear Quadratic Estimation, is widely used in sensor signal refinement. It is efficient in handling uncertainty caused by random external factors. When considering node properties as sensor measures, Kalman Filter can be used to generate dynamic node embedding. A hidden state matrix $\mathbf{H}_{t-1}$ at time $t-1$ is formed by stacking the hidden state of each node in the graph or the local neighborhoods. Kalman Filter based encoder is a two-step recurrent process that includes the prediction step and the hidden state updating step. Given the hidden state $\mathbf{H}_{t-1}$ at snapshot $t-1$ and its estimated covariance matrix $\mathbf{\hat{P}}_{t-1}$ , the embedding matrix $\mathbf{Z}_{t}$ at time $t$ and its predicted covariance matrix $\mathbf{P}_{t}$ is calculated as in Eqn. (20), where $\mathbf{W}$ and $\mathbf{B}$ are trainable parameters, and $\mathbf{Q}_{t}$ is a random noise drawn from a zero-mean Gaussian distribution, and $N_{t-1}$ is a control factor that could be simply 0 as in Sarkar et al.[56] or neighborhood embedding aggregation.

	$\displaystyle\mathbf{Z}_{t}$	$\displaystyle=\mathbf{WH}_{t-1}+\mathbf{BN}_{t-1},$		(20)
	$\displaystyle\mathbf{P}_{t}$	$\displaystyle=\mathbf{W\hat{P}}_{t-1}\mathbf{W}^{\textbf{T}}+\mathbf{Q}_{t}.$		(20)

Once a new observation of node attributes $\mathbf{X}_{t}$ at time $t$ is obtained, the Kalman gain $\mathbf{K}_{t}$ is defined as in Eqn. (21), where $\mathbf{J}$ is a trainable parameter.

\mathbf{K}_{t}=\mathbf{P}_{t}\mathbf{J}^{\textbf{T}}(\mathbf{JP}_{t}\mathbf{J}^{\textbf{T}}+\text{Cov}(\mathbf{X}_{t}))^{-1}.

(21)

The hidden state $\mathbf{H}_{t}$ and the estimated covariance $\mathbf{\hat{P}}_{t}$ are updated as follows:

	$\displaystyle\mathbf{H}_{t}$	$\displaystyle=\mathbf{Z}_{t}+\mathbf{K}_{t}(\mathbf{X}_{t}-\mathbf{JZ}_{t}),$		(22)
	$\displaystyle\mathbf{\hat{P}}_{t}$	$\displaystyle=\mathbf{P}_{t}-\mathbf{K}_{t}\mathbf{JP}_{t}.$		(22)

EvolveGCN

The idea of EvolveGCN[57] is simple and interesting. In order to to make the model adaptable to new added nodes, EvolveGCN focuses on training an RNN model to learn the temporal dynamic presented in the underling GCN. Namely, the parameters in the underling GCN are not learned. They are predicted by an RNN model. There are two versions of the EvolveGCN unit, which are the hidden unit EGCU-H and the output unit EGCU-O. EGCU-H takes the input of last layers output states $\mathbf{H}^{l-1}_{t}$ and the parameters from the last time step $\mathbf{W}^{l}_{t-1}$ , and then outputs the parameter for the current time step $\mathbf{W}^{l}_{t}$ as shown in Eqn. (23).

\begin{split}\mathbf{W}^{l}_{t}&=\text{GRU}(\mathbf{H}^{l-1}_{t},\mathbf{W}^{l}_{t-1}),\\ \mathbf{H}^{l+1}_{t}&=\text{GNN}(\mathbf{A}_{t},\mathbf{H}^{l}_{t},\mathbf{W}^{l}_{t}).\end{split}

(23)

Similarly, EGCU-O calculates the parameter for the underling GCN at the current time step by LSTM, but takes only the parameter from the last time step as input as follows:

\begin{split}\mathbf{W}^{l}_{t}&=\text{LSTM}(\mathbf{W}^{l}_{t-1}),\\ \mathbf{H}^{l+1}_{t}&=\text{GNN}(\mathbf{A}_{t},\mathbf{H}^{l}_{t},\mathbf{W}^{l}_{t}).\end{split}

(24)

Spatial Temporal Graph Convolutional Network (ST-GCN)

ST-GCN is developed to learn both the spatial and temporal patterns for human action recognition[21]. An ST-GCN layer is composed by two operations: the spatial convolution and temporal convolution. Spatial convolution learns the graph structure pattern. It adds a partitioning strategy to spectral GCN as described in Eqn. 10 to assign different weights to the nodes in different partitions, so as to learn the feature importance for different partitions based on their spatial information. There are three partitioning strategies proposed in [21]:

•

uni-labeling: all nodes are assigned in the same partition.
•

distance partitioning: the target node $v_{i}$ is assigned to partition $0$ , and the partitioning of its neighbors is determined by the length of their shortest paths to $v_{i}$ .
•

spatial configuration: the partition is determined by the given node’s distance to the graph’s centroid, as shown in Eqn. (25), where $r_{i}$ is the distance between $v_{i}$ and the graph centroid, and $l_{ti}(\cdot)$ is a function whose output is the partition label of the input node $v_{j}$ at time $t$ in the state calculation for node $v_{i}$ .

l_{ti}(v_{tj})=\begin{cases}0&\text{if $r_{j}=r_{i}$}\\ 1&\text{if $r_{j}<r_{i}$}\\ 2&\text{if $r_{j}>r_{i}$}\end{cases}

(25)

After the partitions are generated, the graph Laplacian will be broken down accordingly. With uni-labeling partitioning, the resulted graph Laplacian is $I_{N}+A$ which is exactly same as the GCN proposed by Kipf et al.[50]. With distance partitioning and spatial partitioning, the graph Laplacian is dismantled into multiple tensors $A_{j}$ according to the partition label such that the sum of those tensors equals to $I_{N}+A$ as follows:

\mathbf{I}_{N}+\mathbf{A}=\sum_{j}\mathbf{A}_{j}.

(26)

Each $A_{j}$ has its own learnable weight $M$ , and the GCN encoder in ST-GCN is calculated as shown in Eqn. (27) where $\otimes$ denotes the element-wise multiplication.

\mathbf{Z}=\sum_{j}\mathbf{D}_{j}^{\frac{1}{2}}(\mathbf{A}_{j}\otimes\mathbf{M}_{j})\mathbf{D}_{j}^{\frac{1}{2}}\mathbf{X}\mathbf{W}.

(27)

The resulting embedding for the input snapshots are ordered based on their timestamps as the output embedding.

As shown in Fig. 6, The temporal convolution performs 2-D convolution for each node along its time dimension $T$ and feature dimension $D$ to learn the temporal pattern.

An ST-GCN layers can have multiple temporal and spatial convolutions. Multiple ST-GCN layers can be stacked together to construct a deep dynamic graph encoder for better expressive power. The implementation in Yan et al.[21] ensembles an ST-GCN layer with one spatial convolution in between two temporal convolutions.

Explicit use of time in DTDG has not been explored in the community to the best of our knowledge. Neither does time prediction task for DTDG. It is an interesting direction of future work to develop an explicit time learning model for DTDG and perform time prediction tasks.

6 Continuous Time Dynamic Graph Model

DTDG is a well-explored dynamic graph model, which offers plenty of learning algorithms for downstream applications. However, as discussed above, it is possible to lose important temporal information when DTDG storage models are applied with inappropriate observation frequency. To preserve all temporal information, we can use Continuous Time Dynamic Graph(CTDG) storage model. CTDG is also called streaming graph. Under this storage model, the dynamic graph is modeled and stored as the graph updating event stream. Because all the changes and their timestamps are kept in the database, there is no loss of temporal information.

Definition 8.

Continuous Time Dynamic Graph (CTDG): a dynamic graph $G_{T}$ with $T=[t_{0}:t_{n}]$ is regarded as a CTDG model if it is stored as $G_{T}=(G_{t_{0}},O_{[t_{1}:t_{n}]})$ with $O_{[t_{1}:t_{n}]}$ to be a collection of timestamped graph updating events observed during the time span $[t_{1}:t_{n}]$ , $G_{t_{0}}$ to be its initial state at $t_{0}$ . Each event $o_{t_{i}}\in O_{[t_{1}:t_{n}]}$ could be either a node updating event $x_{v_{i},t_{i}}$ or edge updating event $e_{\{i,j\},t_{j}}$ .

CTDG learning algorithm aims to learn the network evolution embedded in the events stream. As shown in Fig. 5(c), the framework for CTDG learning algorithms is very similar to that of the dynamic DTDG encoder (the simple decoder framework in Fig. 5(b)). In this framework, the input is the observed updating event stream, and then the dynamic CTDG encoder transforms the input event stream to a node-wise graph embedding that learns the graph’s temporal pattern in its evolution. The encoder learns the regular temporal pattern as described by Eqn. (1). Hence it can not only be used directly with common decoders such as MLP[81] and SVM[82], but also be used with sequential decoders introduced in Sec. 5. Moreover, this framework is commonly applied to analyze interaction networks[7, 74], transaction networks[35], and knowledge graphs in recommendation systems[35].

There are three major challenges in the supervised learning of CTDG: Event Expiration, Computational Exhaustive in Adjacency Matrix Retrieval, and Temporal Information Learning.

6.0.1 Event Expiration

This is the staleness problem proposed by Kazemi et al.[12] for CTDG learning. How can we determine if long-ago updating events have large impacts on the current nodes? For example, the relations between users in a social network are defined by their phone call activities, and a phone number is previously abandoned and is recently assigned to a new user. In this case, the previous events for the node represented by this number should be expired and no longer reflect its current owner’s social relationship, and this node should be counted as a new user without history.

6.0.2 Computational Exhaustive in Adjacency Matrix Retrieval

CTDG is stored as a collection of updating events. To obtain its adjacency matrix, we need to scan the whole history and construct the relation between nodes to fill in the cells in the adjacency matrix for each source and destination node pair, according to the observed events and their timestamp. However, the computing resources are costly, so that we try to avoid the adjacency matrix construction in CTDG learning.

6.0.3 Temporal information Learning

Each observation in CTDG has its own timestamp and hence rich temporal information can be learnt by analyzing their timestamps. The challenge of learning the temporal information is how we can properly use the timestamps as features to learn from.

In the following sections, we will review different dynamic graph encoders for CTDG and discuss how these graph encoders tackle the aforementioned challenges. All summarized CTDG encoders are listed in Table IV:

TABLE IV: CTDG Graph Encoders

Time Model	Graph type	Encoder
Implicit	Non-attributed Dynamic Graph	Temporal Random Walk-Based Encoder[83, 84, 85, 34]
Explicit	Attributed Dynamic Graph	Temporal Attention Based Encoder [6, 19]
		RNN based Encoder[10, 86, 87, 7, 34, 35]

6.1 Implicit Time CTDG Encoder

The temporal process regarding to the graph topology can be learnt through the Temporal Random Walk-Based Encoder. Temporal Random Walk is defined as a random walk performed respecting to the timestamp of the edge updating events[4, 88, 20, 89]. Once the sets of temporal walks are sampled, random walk-based static graph encoders can be used to generate the embedding. Nguyen et al.[83] propose a temporal random walk-based encoder for non-attributed dynamic graphs with three strategies to select the next hop in a walk. De Winter et al.[84] and Bastas et al.[85] convert the CTDG to DTDG and perform temporal random walks. Since random walk-based approaches are generally applied in non-attributed dynamic graphs, practitioners have to combine them with a decoder that utilizes the graph attributes for attributed dynamic network. One potential direction of future works can extend temporal random walk-based encoders for attributed dynamic graph by biasing the hop selection based on TPP. The temporal attentive aggregation for neighborhood message passing in DyREP[34] is an example, in which the maximum walk length is 1 and the probability of next hop is calculated based on TPP.

6.2 Explicit Time CTDG Encoder

CTDG is stored as a stream of observed updating events $O(T)$ which could be viewed as sequential data. Therefore sequential learning models are naturally used in dynmaic CTDG encoder to transform a given node’s updating events to its embedding. One of such kind of encoders is the Temporal Attention based encoder.

6.2.1 Temporal Attention Based CTDG Encoder

Inspired by the success of network attention mechanism in learning sequence data[64], Xu et al. proposed the Temporal Attention mechanism as the dynamic graph encoder in Temporal Graph Attention Network (TGAT). It assumes that more critical temporal information are revealed in the relative time span, compared to the absolute time value. However attributes self-updating is not considered in its temporal learning framework.

With this assumption, Temporal Attention applies Time2Vec[23] to capture critical temporal information from the dynamic graph. Time2Vec aims to generate a simple vector representation of time so as to enable different learning algorithms to learn the temporal correlation as well as the periodicity with the explicit use of time. Given a scalar notion of time $\Delta t$ , which could be time difference in CTDG setting, its naive Time2Vec encoding $\text{t2v}(\Delta t)$ , is a vector of predefined size $d$ and calculated as:

\text{t2v}(\Delta t)[i]=\begin{cases}w_{i}\Delta t+\varphi_{i}&\text{if $i=1$},\\ f(w_{i}\Delta t+\varphi_{i})&\text{if $1<i\leq d$},\end{cases}

(28)

where $w_{i}$ and $\varphi_{i}$ are trainable weight or predefined weight, and $f$ is a periodic function, such as $\text{sin}(\cdot)$ and $\text{cos}(\cdot)$ . Time2Vec helps the learning model to learn temporal correlation by simple linear time projection when $i=1$ ; and it helps to learn the periodicity of time by kernel learning when $1<i\leq d$ [6].

TGAT applied a modified version of Time2Vec to calculate the functional encoding of a given node $v_{i}$ at time $t$ :


$\displaystyle\mathbf{h}_{v_{i},v_{j}}$	$\displaystyle=\mathbf{x}_{v_{j},t_{j}}\|\|\mathbf{e}_{ij,t_{j}}\|\|\text{t2v}(t_{i}-t_{j})$	(29a)
$\displaystyle\begin{split}\mathbf{H}_{v_{i}}&=[\mathbf{h}_{v_{i},v_{i}},\mathbf{h}_{v_{i},v_{1}},\cdots,\mathbf{h}_{v_{i},v_{j}},\cdots,\mathbf{h}_{v_{i},v_{n}}]^{\textbf{T}}\\ &\forall v_{j}\in N(v_{i})\end{split}$		(29b)
$\displaystyle\begin{split}\text{t2v}(\Delta t)&=\sqrt{\frac{1}{d}}[\text{cos}(w_{1}\Delta t),\text{sin}(w_{1}\Delta t),\cdots,\\ &\text{cos}(w_{d}\Delta t),\text{sin}(w_{d}\Delta t)]\end{split}$		(29c)

where $\mathbf{x}_{v_{i},t_{i}}$ is the node feature vector for node $v_{i}$ at time $t_{i}$ ; $\mathbf{e}_{ij,t_{j}}$ is the edge feature vector for the edge between node $v_{i}$ and $v_{j}$ at time $t_{j}$ . In Eqn. (29a), if $v_{j}=v_{i}$ , $\mathbf{e}_{ij,t_{j}}$ is vector of zeros.

With the functional encoding, temporal attention calculates the query, key and value for $v_{i}$ at time $t_{i}$ as:

\begin{split}\mathbf{Q}_{v_{i},t_{i}}&=[\mathbf{H}_{v_{i}}]_{0}\mathbf{W}_{Q},\\ \mathbf{K}_{v_{i},t_{i}}&=[\mathbf{H}_{v_{i}}]_{1:n}\mathbf{W}_{K},\\ \mathbf{V}_{v_{i},t_{i}}&=[\mathbf{H}_{v_{i}}]_{1:n}\mathbf{W}_{V},\end{split}

(30)

and the output state of the target node $v_{i}$ is given by Eqn. (12):

\mathbf{z}_{v_{i},t_{i}}=\text{attn}(\mathbf{Q}_{v_{i},t_{i}},\mathbf{K}_{v_{i},t_{i}},\mathbf{V}_{v_{i},t_{i}}).

(31)

6.2.2 RNN-based CTDG Encoder

As discussed in Sec. 4, external factors have major contribution to a dynamic graph’s evolution. RNN-based CTDG encoder follows the Three Stages Recurrent Temporal Learning Frameworkto capture the external factors contributing to the node attribute self-update. RNN-based encoder first generates impression from the observed events related to the target node by a memory function, and then maintains its memory related to the target node by feeding the impression to a sequential model. Node embedding is generated from the maintained memory optionally together with other factors, such as node embedding from other kinds of encoder, embedding from the observed events, and embedding from the timestamps.

Temporal Graph Neural Network (TGN)[10] provides a general framework for RNN based encoder. TGN consists of two components which are the memory component and embedding component. The memory component represents the model’s memory for a given node’s history. we can denote the memory of node $v_{i}$ at time $t$ by vector $\mathbf{s}_{v_{i},t}$ . $\mathbf{s}_{v_{i},t}$ is updated when an updating event involving node $v_{i}$ is encountered. If the event is a node-wise event with new node attribute vector $\mathbf{x}_{v_{i},t}$ , the message to $\mathbf{s}_{v_{i},t}$ will be calculated as shown in Eqn. (32), where $t^{-}$ is the timestamp for the last observation for node $v_{i}$ before $t$ and $\Delta t$ is the time span between $t^{-}$ and $t$ :

\mathbf{m}_{v_{i},t}=\text{msg}_{n}(\mathbf{s}_{v_{i},t^{-}},\Delta t,\mathbf{x}_{v_{i},t})

(32)

For an interaction event with new edge feature vector $\mathbf{e}_{ij,t}$ indicating node $v_{i}$ as source and $v_{j}$ as destination, the message is calculated as shown in Eqn. (33a) and (33b), respectively.


$\displaystyle\mathbf{m}_{v_{i},t}$	$\displaystyle=\text{msg}_{s}(\mathbf{s}_{v_{i},t^{-}},\mathbf{s}_{v_{j},t^{-}},\Delta t,\mathbf{e}_{ij,t},$	(33a)
$\displaystyle\mathbf{m}_{v_{j},t}$	$\displaystyle=\text{msg}_{d}(\mathbf{s}_{v_{j},t^{-}},\mathbf{s}_{v_{i},t^{-}},\Delta t,\mathbf{e}_{ij,t}).$	(33b)

For an undirected network, $\text{msg}_{s}$ and $\text{msg}_{d}$ share the same parameters.

Once the messages for all input events with the target node $v_{i}$ involved are generated, they are aggregated as described in Eqn. (34), where $t^{-}<t_{1},\dots,t_{b}<=t$ :

\mathbf{s}_{v_{i},t}=\text{mem}(\mathbf{s}_{v_{i},t^{-}},\text{agg}(\mathbf{m}_{v_{i},t_{1}},\dots,\mathbf{m}_{v_{i},t_{b}}))

(34)

The $\text{agg}(\cdot)$ refers to a aggregation function, which can be a trainable deep learning layer (e.g., RNN, LSTM), or a simple aggregation without trainable weight (e.g., the mean of those messages, the most recent message). The $\text{mem}(\cdot)$ function is a deep learning layer representing the memory of the model, and should be selected from the RNN family. The output of the memory component $s_{i,t}$ can be used directly as the resulted node embedding as in Jodie[86, 87], Know-Evolve[7] and DyREP[34].

However, as proposed by Kazemi et al. [12], there is a so-called memory staleness problem for nodes that are not active for a relatively long time depending on the context. The memory $\mathbf{s}_{v_{i},t^{-}}$ of these kind of node is not updated for a long time. Once a new event arrives, the outdated memory has the same impact as a recent memory. When calculating the new memory $\mathbf{s}_{v_{i},t}$ , this is not desirable when the events happened long time ago has not much impact to the graph’s evolution. For example, two users with past frequent connections may not be currently connected in a call social network, since one of the users changes the phone number. A new event for an abandoned number represents that this number is recycled and used by a different user, and the history for this number is no longer meaningful. To overcome this challenge, TGN proposed the embedding component which uses a time embedding [23, 6] as one of the input features in inference and could be trained to recognize the impact of staled memory.

The embedding component $\mathbf{z}_{v_{i},t}$ has a general form as the following message passing schema:

\mathbf{z}_{v_{i},t}=\sum_{v_{j}\in N_{t}(v_{i})}f(\mathbf{s}_{v_{i},t},\mathbf{s}_{v_{j},t_{j}},e_{ij,t_{j}},\mathbf{x}_{v_{i},t},\mathbf{x}_{v_{j},t}),

(35)

where $f(\cdot)$ is a neighbour node aggregation function and $t_{j}$ is the timestamp that the last edge updating event observed between $v_{i}$ and $v_{j}$ .

The embedding component $\mathbf{z}_{v_{i},t}$ has different implementations and can generate embedding without neighborhood information aggregation. For example, it could be as simple as just the identify function of the memory $\mathbf{s}_{v_{i},t}$ ; the time projection used in Jodie [87]:

\mathbf{z}_{v_{i},t}=(1+\Delta tw)\cdot\mathbf{s}_{v_{i},t}.

(36)

A MLP-TGAT network structure takes the states of the target node $v_{i}$ and its neighbors from the last layer and the Time2Vec embedding of their last updating events’ time stamps $\text{t2v}(t-\mathbf{T}_{N(v_{i})})$ as input, the input state $\mathbf{h}_{v_{i},t}^{(1)}$ of the first layer is the memory $\mathbf{s}_{v_{i},t}$ and the output state $\mathbf{h}_{v_{i},t}^{(l)}$ of the last layer would be the resulting embedding:

\begin{split}\mathbf{h}_{v_{i},t}^{(1)}&=\mathbf{s}_{v_{i},t}\\ \mathbf{\hat{h}}_{v_{i},t}^{(l)}&=\text{TGAT}(\mathbf{h}_{v_{i},t}^{(l-1)},\text{t2v}(0),\mathbf{H}_{N(v_{i})}^{(l-1)},\text{t2v}(t-\mathbf{T}_{N(v_{i})}))\\ \mathbf{h}_{v_{i},t}^{(l)}&=\text{MLP}(\mathbf{h}_{v_{i},t}^{(l-1)},\mathbf{\hat{h}}_{v_{i},t}^{(l)})\end{split}

(37)

Similarly, Temporal Graph Sum is proposed in TGN [10]:

\begin{split}\mathbf{\hat{h}}_{v_{i},t}^{(l)}&=\text{ReLU}(\sum_{v_{j}\in N(v_{i})}\mathbf{W}_{1}^{(l)}(\mathbf{h}_{v_{j},t_{j}}^{(l-1)},\text{t2v}(t-t_{j}))),\\ \mathbf{h}_{v_{i}}^{(l)}&=\mathbf{W}_{2}^{(l)}(\mathbf{h}_{v_{i},t}^{(l-1)}||\mathbf{\hat{h}}_{v_{i},t}^{(l)}).\end{split}

(38)

APAN[35] applies an asynchronous mail propagator to over come the out-of-order event arrivial issue in TGN’s node memory units design. In on-line training, graph updating events are not guaranteed to arrive in timestamp order. This will bring instability to RNN based CTDG encoder such as TGN. The asynchronous mail propagator fixes this issue by storing the incoming events in their timestamp order.

6.3 Explicit Time CTDG Decoder

As discussed, the decoder in CTDG learning can be a simple supervised classifier, such as MLP[81] and SVM[82]. The application of sequential decoders is introduced in Sec. 5. Temporal Point Process Based Decoder in CTDG learning is introduced in this section.

Temporal Point Process Based Decoder

Know-Evolve[7] and TDIG-MPNN[74] applies TPP as decoder in edge focus tasks and time predicting tasks with CTDG setting.

Instead of modeling the set of observations $O_{[0:t]}$ as sequential data, temporal point process models $O_{[0:t]}$ as a random process with parameters the input time $t$ and the observations $O_{[0:t^{-}]}$ before $t$ [26]:

f(t,O_{[0:t^{-}]})=\lambda(t,O_{[0:t^{-}]})\text{exp}(-\int_{t^{-}}^{t}\lambda(t,O_{[0:t^{-}]})dt),

(39)

where $f(t,O_{[0:t^{-}]})$ is the probability density function representing an event occurs at time $t$ given the previous observations. $t^{-}$ is the time for the last observation right before $t$ . $\lambda(\cdot)$ is the conditional intensity function. Its form depends on the temporal pattern to capture, such as Poisson process, Hawkes process[90], Self-correcting process[91], Power Law and Rayleigh process[92]. In Know-Evolve, the conditional intensity function $\lambda(\cdot)$ for the target edge $e_{i,j}$ at time $t$ is described as follows:

\lambda_{i,j}(t|t^{-})=\text{exp}(\mathbf{z}_{v_{i},t^{-}}^{\textbf{T}}\cdot\mathbf{W}_{i,j}\cdot\mathbf{z}_{v_{j},t^{-}})\dot{(}t-t^{-}),

(40)

where $\mathbf{W}_{i,j}\in\mathcal{R}^{d\times d}$ is the unique trainable parameter regarding the edge $e_{i,j}$ . For a Rayleigh process, the survival term in Eqn. (39) is calculated as:

\text{exp}(-\int_{t^{-}}^{t}\lambda_{i,j}(t|t^{-})dt)=\lambda_{i,j}(t|t^{-})\cdot(t^{2}-(t^{-})^{2}).

(41)

In the inference, practitioners can estimate the probability that an event happens for the given edge in the next moment, by the product of the resulted probability density and the time duration in interest to perform edge focus task and time predicting task. Graph Hawkes Neural Network applies Hawkes process in its decoder[93] in a similar manner. It would be interesting for a future work to extend TPP based decoder for node focus task as well as graph focus task. Also, applying TPP based decoder in DTDG learning for time predicting task is an interesting future work as well.

7 Challenges And Future Works

In this section, we will highlight some challenges and interesting future works in the supervised learning for both DTDG and CTDG.

7.0.1 Explicit Time DTDG Learning

DTDG is a very well-explored dynamic graph model. Because its snapshot based representation fits naturally with static graph encoder, most of the works focus on performing node focusing, edge focusing and graph focusing tasks with the framework of static graph encoder and sequential decoder. However, since time is implicitly learnt in the decoder with this framework, it is not capable to perform time predicting tasks. To the best of our knowledge, time predicting task is never considered with DTDG. Explicit time prediction in DTDG learning will be an important research direction in our opinion. For example, in financial stock price prediction, we can model each company in the investment portfolio as a node based on the background of the companies, and then the relation between nodes can be formulated as a graph. In addition to predicting the future price of a particular company’s stock, investors are also interested in knowing the time required to hold the stock until the targeted share price is reached. To facilitate explicit time prediction in both supervised and unsupervised DTDG learning, an explicit time DTDG encoder or decoder is required. Future works could be conducted on developing and applying explicit time supervised DTDG learning algorithms to predict the number of snapshots required for a particular change to happen.

7.0.2 Large Scale Dynamic Graph Learning

GNN based learning models suffer from high memory and CPU power requirement, which becomes more critical in large scale dynamic graph learning. We observe a recent trend in solving this challenge in static graph learning[94, 95, 96], while there is still large improvement room to address this issue in dynamic graph learning. The optimization for existing dynamic graph learning methods to accommodate large scale dynamic graph is a meaningful and interesting future work.

7.0.3 The Significance of Temporal Pattern

As discussed in Sec. 4, the evolution of dynamic network can be described by Eqn (1), where the future state of a dynamic graph is correlated to its past. The ultimate goal of dynamic graph learning is to estimate this function and apply it in different graph learning tasks. What if the evolution is just a random event that is not correlated with a graph’s history? Or more possibly, there are some randomness in the evolution that cannot be explained or learnt. In this case, we believe the performance gain by applying dynamic graph learning methods over static graph learning methods depends on how much randomness presented in the temporal process. Considering the additional complexity of dynamic graph modeling and learning over static ones, the performance gain via evaluating the significance of temporal pattern in a dynamic graph is very useful for industry practitioners to choose an appropriate model with low cost.

7.0.4 Implementation Tools for Dynamic Graph Learning

There are Tensorflow[97], Pytorch[98] and Scikit-Learn[99] to help write data pipelines in machine learning algorithm implementations. Some necessary and useful tools, such as DGL[100], Spektral[101], can help implement static graph learning algorithms. However, there are less library supports for implementing dynamic graph learning. A future work of creating such convenient tools will be beneficial for both the industry as well as the research community.

8 Conclusion

As a data structure, graph data modeling attracts the data science community by its extraordinary expressive power. The world is a dynamic system evolving along time, and so does the data generated from real world activity. Dynamic graph is the extension of naive graph modeling to capture this temporal dynamic. Moreover, with the recent development of big data technology and online services, recording and storing graph attributes become feasible. This urges the need for effective supervised dynamic graph algorithms to perform different machine learning tasks with better accuracy. To provide a comprehensive reference for academia researchers and industry practitioners, this survey is conducted on the following scopes:

•

Background of dynamic graph learning with a full-scale systematic summary from storage model, learning purpose, algorithm architecture framework.
•

We propose the Three Stages Recurrent Temporal Learning Framework. Based on the proposed temporal learning framework, we discuss how temporal information is learnt by different dynamic graph learning algorithms. Three Stages Recurrent Temporal Learning Framework also provides a general mathematical form of dynamic graph learning algorithms As far as we know, this is the first and only temporal learning framework which could be applied to both attributed and non-attributed dynamic graph learning.
•

Supervised dynamic graph learning algorithms with detailed introduction from DTDG compatible algorithms to CTDG compatible algorithms, from encoders to decoders and from implicit time algorithms to explicit time algorithms.
•

Future research directions according to the topics discussed

We hope that with this survey paper, we can offer a convenient reference to industry practitioners and facilitate future research for the academic community.

References

[1] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International conference on machine learning. PMLR, 2017, pp. 1263–1272.
[2] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[3] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE transactions on neural networks and learning systems, 2020.
[4] N. Masuda and R. Lambiotte, Guide To Temporal Networks, A. World Scientific, 2020, vol. 6.
[5] A. Sankar, Y. Wu, L. Gou, W. Zhang, and H. Yang, “Dysat: Deep neural representation learning on dynamic graphs via self-attention networks,” in Proceedings of the 13th International Conference on Web Search and Data Mining, 2020, pp. 519–527.
[6] D. Xu, C. Ruan, E. Korpeoglu, S. Kumar, and K. Achan, “Inductive representation learning on temporal graphs,” arXiv preprint arXiv:2002.07962, 2020.
[7] R. Trivedi, H. Dai, Y. Wang, and L. Song, “Know-evolve: Deep temporal reasoning for dynamic knowledge graphs,” in international conference on machine learning. PMLR, 2017, pp. 3462–3471.
[8] J. Weaver and P. Tarjan, “Facebook linked data via the graph api,” Semantic Web, vol. 4, no. 3, pp. 245–250, 2013.
[9] F. Benevenuto, F. Duarte, T. Rodrigues, V. A. Almeida, J. M. Almeida, and K. W. Ross, “Understanding video interactions in youtube,” in Proceedings of the 16th ACM international conference on Multimedia, 2008, pp. 761–764.
[10] E. Rossi, B. Chamberlain, F. Frasca, D. Eynard, F. Monti, and M. Bronstein, “Temporal graph networks for deep learning on dynamic graphs,” arXiv preprint arXiv:2006.10637, 2020.
[11] J. Li, H. Dani, X. Hu, J. Tang, Y. Chang, and H. Liu, “Attributed network embedding for learning in a dynamic environment,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 387–396.
[12] S. M. Kazemi, R. Goel, K. Jain, I. Kobyzev, A. Sethi, P. Forsyth, and P. Poupart, “Representation learning for dynamic graphs: A survey.” Journal of Machine Learning Research, vol. 21, no. 70, pp. 1–73, 2020.
[13] J. Skardinga, B. Gabrys, and K. Musial, “Foundations and modelling of dynamic networks using dynamic graph neural networks: A survey,” IEEE Access, 2021.
[14] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation learning on graphs: Methods and applications,” arXiv preprint arXiv:1709.05584, 2017.
[15] L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” in Concurrency: the Works of Leslie Lamport, 2019, pp. 179–196.
[16] E. Boschee, J. Lautenschlager, S. O’Brien, S. Shellman, J. Starz, and M. Ward, “Integrated crisis early warning system (icews) coded event data,” URL: https://dataverse. harvard. edu/dataverse/icews, 2015.
[17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE transactions on neural networks, vol. 20, no. 1, pp. 61–80, 2008.
[18] A. Zaki, M. Attia, D. Hegazy, and S. Amin, “Comprehensive Survey on Dynamic Graph Models,” International Journal of Advanced Computer Science and Applications, vol. 7, no. 2, 2016.
[19] D. Xu, C. Ruan, S. Kumar, E. Korpeoglu, and K. Achan, “Self-attention with functional time representation learning,” 2019.
[20] D. Kempe, J. Kleinberg, and A. Kumar, “Connectivity and inference problems for temporal networks,” Journal of Computer and System Sciences, vol. 64, no. 4, pp. 820–842, 2002.
[21] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
[22] Y. Ma, Z. Guo, Z. Ren, J. Tang, and D. Yin, “Streaming graph neural networks,” in Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, 2020, pp. 719–728.
[23] S. M. Kazemi, R. Goel, S. Eghbali, J. Ramanan, J. Sahota, S. Thakur, S. Wu, C. Smyth, P. Poupart, and M. Brubaker, “Time2vec: Learning a vector representation of time,” arXiv preprint arXiv:1907.05321, 2019.
[24] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: user movement in location-based social networks,” in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011, pp. 1082–1090.
[25] B. B. Xuan, A. Ferreira, and A. Jarry, “Computing shortest, fastest, and foremost journeys in dynamic networks,” International Journal of Foundations of Computer Science, vol. 14, no. 02, pp. 267–285, 2003.
[26] J. G. Rasmussen, “Temporal point processes: the conditional intensity function,” Lecture Notes, Jan, 2011.
[27] R. Toivonen, L. Kovanen, M. Kivelä, J.-P. Onnela, J. Saramäki, and K. Kaski, “A comparative study of social network models: Network evolution models and nodal attribute models,” Social networks, vol. 31, no. 4, pp. 240–254, 2009.
[28] D. Farine, “The dynamics of transmission and the dynamics of networks,” Journal of Animal Ecology, vol. 86, no. 3, pp. 415–418, 2017.
[29] O. Artime, J. J. Ramasco, and M. San Miguel, “Dynamics on networks: competition of temporal and topological correlations,” Scientific reports, vol. 7, no. 1, pp. 1–10, 2017.
[30] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl, “Neural message passing for quantum chemistry,” in International Conference on Machine Learning. PMLR, 2017, pp. 1263–1272.
[31] W. L. Hamilton, R. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” arXiv preprint arXiv:1706.02216, 2017.
[32] K. Wang, J. Zhang, D. Li, X. Zhang, and T. Guo, “Adaptive affinity propagation clustering,” arXiv preprint arXiv:0805.1096, 2008.
[33] D. Dueck, Affinity propagation: clustering data by passing messages. Citeseer, 2009.
[34] R. Trivedi, M. Farajtabar, P. Biswal, and H. Zha, “Dyrep: Learning representations over dynamic graphs,” in International conference on learning representations, 2019.
[35] X. Wang, D. Lyu, M. Li, Y. Xia, Q. Yang, X. Wang, X. Wang, P. Cui, Y. Yang, B. Sun et al., “Apan: Asynchronous propagation attention network for real-time temporal graph embedding,” in Proceedings of the 2021 International Conference on Management of Data, 2021, pp. 2628–2638.
[36] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
[37] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A. J. Smola, “Distributed large-scale natural graph factorization,” in Proceedings of the 22nd international conference on World Wide Web, 2013, pp. 37–48.
[38] S. Cao, W. Lu, and Q. Xu, “Grarep: Learning graph representations with global structural information,” in Proceedings of the 24th ACM international on conference on information and knowledge management, 2015, pp. 891–900.
[39] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 1105–1114.
[40] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang, “Community preserving network embedding,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
[41] B. Perozzi, R. Al-Rfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, 2014, pp. 701–710.
[42] C. Yang, Z. Liu, D. Zhao, M. Sun, and E. Chang, “Network representation learning with rich text information,” in Twenty-fourth international joint conference on artificial intelligence, 2015.
[43] H. Chen, B. Perozzi, Y. Hu, and S. Skiena, “Harp: Hierarchical representation learning for networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[44] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 855–864.
[45] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei, “Line: Large-scale information network embedding,” in Proceedings of the 24th international conference on world wide web, 2015, pp. 1067–1077.
[46] S. Cao, W. Lu, and Q. Xu, “Deep neural networks for learning graph representations,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
[47] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, 2016, pp. 1225–1234.
[48] J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, “Graph neural networks: A review of methods and applications,” AI Open, vol. 1, pp. 57–81, 2020.
[49] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
[50] ——, “Variational graph auto-encoders,” arXiv preprint arXiv:1611.07308, 2016.
[51] L. Zheng, Z. Li, J. Li, Z. Li, and J. Gao, “Addgraph: Anomaly detection in dynamic graph using attention-based temporal gcn.” in IJCAI, 2019, pp. 4419–4425.
[52] M. Niepert, M. Ahmed, and K. Kutzkov, “Learning convolutional neural networks for graphs,” in International conference on machine learning. PMLR, 2016, pp. 2014–2023.
[53] F. Manessi, A. Rozza, and M. Manzo, “Dynamic graph convolutional networks,” Pattern Recognition, vol. 97, p. 107000, 2020.
[54] S. Mohanty and A. Pozdnukhov, “Graph cnn+ lstm framework for dynamic macroscopic traffic congestion prediction,” in International Workshop on Mining and Learning with Graphs, 2018.
[55] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of rectified activations in convolutional network,” arXiv preprint arXiv:1505.00853, 2015.
[56] P. Sarkar, S. M. Siddiqi, and G. J. Gordon, “A latent space approach to dynamic embedding of co-occurrence data,” in Artificial Intelligence and Statistics. PMLR, 2007, pp. 420–427.
[57] A. Pareja, G. Domeniconi, J. Chen, T. Ma, T. Suzumura, H. Kanezashi, T. Kaler, T. Schardl, and C. Leiserson, “Evolvegcn: Evolving graph convolutional networks for dynamic graphs,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 5363–5370.
[58] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on network embedding,” IEEE Transactions on Knowledge and Data Engineering, vol. 31, no. 5, pp. 833–852, 2018.
[59] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” nature, vol. 323, no. 6088, pp. 533–536, 1986.
[60] S. Banach, “Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales,” Fund. math, vol. 3, no. 1, pp. 133–181, 1922.
[61] T. Pham, T. Tran, D. Phung, and S. Venkatesh, “Column networks for collective classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
[62] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
[63] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
[65] P. R. da Silva Soares and R. B. C. Prudêncio, “Time series based link prediction,” in The 2012 international joint conference on neural networks (IJCNN). IEEE, 2012, pp. 1–7.
[66] Z. Huang and D. K. Lin, “The time-series link prediction problem with applications in communication surveillance,” INFORMS Journal on Computing, vol. 21, no. 2, pp. 286–303, 2009.
[67] İ. Güneş, Ş. Gündüz-Öğüdücü, and Z. Çataltepe, “Link prediction using time series of neighborhood-based node similarity scores,” Data Mining and Knowledge Discovery, vol. 30, no. 1, pp. 147–180, 2016.
[68] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[69] Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson, “Structured sequence modeling with graph convolutional recurrent networks,” in International Conference on Neural Information Processing. Springer, 2018, pp. 362–373.
[70] Y. Yuan, X. Liang, X. Wang, D.-Y. Yeung, and A. Gupta, “Temporal dynamic graph lstm for action-driven video object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1801–1810.
[71] A. Taheri, K. Gimpel, and T. Berger-Wolf, “Learning to represent the evolution of dynamic graphs with recurrent models,” in Companion Proceedings of The 2019 World Wide Web Conference, 2019, pp. 301–307.
[72] T. Bogaerts, A. D. Masegosa, J. S. Angarita-Zapata, E. Onieva, and P. Hellinckx, “A graph cnn-lstm neural network for short and long-term traffic forecasting based on trajectory data,” Transportation Research Part C: Emerging Technologies, vol. 112, pp. 62–77, 2020.
[73] C. Wang, S. Cai, and G. Tan, “Graphtcn: Spatio-temporal interaction modeling for human trajectory prediction,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 3450–3459.
[74] X. Chang, X. Liu, J. Wen, S. Li, Y. Fang, L. Song, and Y. Qi, “Continuous-time dynamic graph learning via neural interaction processes,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 145–154.
[75] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” Advances in neural information processing systems, vol. 29, pp. 3844–3852, 2016.
[76] A. Narayan and P. H. Roe, “Learning graph dynamics using deep neural networks,” IFAC-PapersOnLine, vol. 51, no. 2, pp. 433–438, 2018.
[77] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
[78] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995.
[79] Y.-H. Chen and J.-T. Chien, “Continuous-time attention for sequential learning,” in Proc. of AAAI Conference on Aritificial Intelligence, 2021.
[80] R. E. Kalman, “A new approach to linear filtering and prediction problems,” 1960.
[81] M. W. Gardner and S. Dorling, “Artificial neural networks (the multilayer perceptron)—a review of applications in the atmospheric sciences,” Atmospheric environment, vol. 32, no. 14-15, pp. 2627–2636, 1998.
[82] J. A. Suykens and J. Vandewalle, “Least squares support vector machine classifiers,” Neural processing letters, vol. 9, no. 3, pp. 293–300, 1999.
[83] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim, “Continuous-time dynamic network embeddings,” in Companion Proceedings of the The Web Conference 2018, 2018, pp. 969–976.
[84] S. De Winter, T. Decuypere, S. Mitrović, B. Baesens, and J. De Weerdt, “Combining temporal aspects of dynamic networks with node2vec for a more efficient dynamic link prediction,” in 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2018, pp. 1234–1241.
[85] N. Bastas, T. Semertzidis, A. Axenopoulos, and P. Daras, “evolve2vec: Learning network representations using temporal unfolding,” in International Conference on Multimedia Modeling. Springer, 2019, pp. 447–458.
[86] S. Kumar, X. Zhang, and J. Leskovec, “Learning dynamic embeddings from temporal interactions,” arXiv preprint arXiv:1812.02289, 2018.
[87] ——, “Predicting dynamic embedding trajectory in temporal interaction networks,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 1269–1278.
[88] P. Holme, “Network reachability of real-world contact sequences,” Physical Review E, vol. 71, no. 4, p. 046119, 2005.
[89] K. A. Berman, “Vulnerability of scheduled networks and a generalization of menger’s theorem,” Networks: An International Journal, vol. 28, no. 3, pp. 125–134, 1996.
[90] A. G. Hawkes, “Spectra of some self-exciting and mutually exciting point processes,” Biometrika, vol. 58, no. 1, pp. 83–90, 1971.
[91] V. Isham and M. Westcott, “A self-correcting point process,” Stochastic processes and their applications, vol. 8, no. 3, pp. 335–347, 1979.
[92] K. Miller, R. Bernstein, and L. Blumenson, “Generalized rayleigh processes,” Quarterly of Applied Mathematics, vol. 16, no. 2, pp. 137–145, 1958.
[93] Z. Han, Y. Ma, Y. Wang, S. Günnemann, and V. Tresp, “Graph hawkes neural network for forecasting on temporal knowledge graphs,” arXiv preprint arXiv:2003.13432, 2020.
[94] Z. Jia, S. Lin, M. Gao, M. Zaharia, and A. Aiken, “Improving the accuracy, scalability, and performance of graph neural networks with roc,” Proceedings of Machine Learning and Systems, vol. 2, pp. 187–198, 2020.
[95] Y. Bai, C. Li, Z. Lin, Y. Wu, Y. Miao, Y. Liu, and Y. Xu, “Efficient data loader for fast sampling-based gnn training on large graphs,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 10, pp. 2541–2556, 2021.
[96] C. Gallicchio and A. Micheli, “Fast and deep graph neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, 2020, pp. 3898–3905.
[97] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning,” in 12th $\{$ USENIX $\}$ symposium on operating systems design and implementation ( $\{$ OSDI $\}$ 16), 2016, pp. 265–283.
[98] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, pp. 8026–8037, 2019.
[99] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,” the Journal of machine Learning research, vol. 12, pp. 2825–2830, 2011.
[100] M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma et al., “Deep graph library: Towards efficient and scalable deep learning on graphs.” 2019.
[101] D. Grattarola and C. Alippi, “Graph neural networks in tensorflow and keras with spektral,” arXiv preprint arXiv:2006.12138, 2020.

TABLE V: List Of Notations

Notation	Description
$\alpha$	The smoothing factor of EMA
$G$	A graph
$V$	Set of nodes $V={v_{1},v_{2},\cdots v_{n}}$ ; the input value $\mathbf{V}$ of the attention layer
$V(T)$	Node-wise events observed at period $T$
$E$	Set of edges $E={e_{1},e_{2},\cdots e_{n}}$
$E(T)$	Interaction events observed at period $T$
$v$	A node $v$ in $V$
$e$	An edge $e$ in $E$
$e_{i,j}$	the edge start from node $i$ to $j$
$\mathbf{e}_{ij,t}$	the edge feature vector for edge $e_{i,j}$ at time $t$
$t$	Time step / event time
$t^{-}$	Time step just before time $t$
$T$	Time duration
$G_{T}$	Dynamic graph in time duration $T$
$O_{T}$	Observations of a dynamic graph in time duration $T$
$G_{t}$	Dynamic graph at time $t$
$o_{t}$	Observation of a dynamic graph at time $t$ , $o_{t}\in O_{T}$
$o_{t}(v_{i})$	A node updating event for node $v_{i}$ observed at time $t$
$e_{\{i,j\},t})$	An edge updating event between node $v_{i}$ and $v_{j}$ observed at time $t$
$Y$	Set of labels in a data set
$y_{i}$	A particular label $y_{i}$ in $Y$
$\mathbf{X}$	Nodes feature matrix for $V$
$\mathbf{x}$	Node feature vector in $X$
$w,b,j,\varphi$	Learnable or preset model parameters in scala form
$\mathbf{w,b,j,\varphi}$	Learnable or preset model parameters in vector form
$\mathbf{W},\mathbf{B},\mathbf{J}$	Learnable or preset model parameters in matrix form
$f,\lambda$	Model functions
$h,c$	Hidden states in scala form
$\mathbf{h,c}$	Hidden states in vector form
$\mathbf{H},\mathbf{C}$	Hidden states in matrix form
$\mathbf{h}_{v_{i}}^{k}$	Hidden state of node $v_{i}$ at layer $k$ in a learning model
$N(v_{i})$	Neighborhood function which returns the neighboring nodes for the input node $v_{i}$
$E(v_{i})$	A function returns all edges connecting $v_{i}$ to its neighbors
$\mathbf{N_{t}}$	Neighborhood aggregation matrix, formed by stacking each node’s neighboring aggregation
$\mathbf{Q}$	The input query of the attention layer
$\mathbf{K}$	The input key of the attention layer; Kalman gain in Kalman filter
$\mathbf{P}$	Positional encoding in TSA; Covariance matrix in Kalman filter
$\mathbf{K^{\textbf{T}}}$	The superscript $T$ in bold text represents the transpose operation in matrix
$\mathbf{z}_{v_{i}}$	Embedding vector for node $v_{i}$
$\mathbf{Z}$	Graph embedding matrix
$\mathbf{A}$	Adjacency matrix
$\mathbf{\hat{D}}$	Diagonal degree matrix
$\mathbf{L}$	Graph labracian
$\mathbf{m}_{v_{i},t}$	Message passed to node $v_{i}$ at time $t$
$\text{out}(\cdot)$	output function of a learning model
$\text{tp}(\cdot)$	Temporal Pattern to learn in Three Stages Recurrent Temporal Learning Framework
$\text{asu}(\cdot)$	Attributes Self-Updating function in Three Stages Recurrent Temporal Learning Framework
$\text{ap}(\cdot)$	Association Process function Three Stages Recurrent Temporal Learning Framework
$\text{mp}(\cdot)$	Message Passing function Three Stages Recurrent Temporal Learning Framework
$\text{attn}(\cdot)$	Attention layer
$\text{msg}(\cdot)$	message generating function in TGN
$\text{mem}(\cdot)$	The memory of a TGN model
$\text{agg}(\cdot)$	message aggregating function in TGN
$\text{t2v}(\cdot)$	Time2Vec embedding
$\otimes$	Element-wise multiplication
$\|\|$	Concatenation

TABLE VI: List Of Abbreviations

Abbreviation	Description
APAN	Asynchronous Propagation Attention Network
CNN	Convolutional Neural Network
Conv1d	1-d Convolution Neural Network
CTDG	Continuous Time Dynamic Graph
DeepWalk	A graph embedding method based on deep learning and random walk
DNGR	Deep Neural Graph Representation
DTDG	Discrete Time Dynamic Graph
DyGGNN	Dynamic Gated Graph Neural Network
DySAT	Dynamic Self-Attention Network
EvolveGCN	Evolving GCN
EGCU	Evolving GCN unit
EMA	Exponential Moving Average
ARIMA	Auto-Regressive Integrated Moving Average
FNN	Feedforward Neural Network
GCN	Graph Convolution Neural Network
GNN	Graph Neural Network
GraRep	Graph representation with global structural information
GRU	Gated Recurrent Unit
HARP	Hierarchical Representation Learning
HOPE	High Order Proximity Preserved Embedding
Jodie	a coupled recurrent neural network model that learns the embedding trajectories of users and items
Know-Evolve	Deep evolutionary knowledge network for dynamic knowledge graph learning
LINE	Large-scale Information Network Embedding
LSTM	Long Short Term Memory
M-NMF	Modularized Non-negative Matrix Factorization
RNN	Recurrent Neural Network
SDNE	Structural Deep Network Embedding
softmax	softmax layer
STGCN	Spatial Temporal Graph Convolution Network
TADW	Text Associated DeepWalk
TGN	Temporal Graph Network
TGAT	Temporal Graph Attention Network
TGNN	Temporal Graph Neural Network
TPP	Temporal Point Process
TSA	Temporal Self-Attention