Automated Dilated Spatio-Temporal Synchronous Graph Modeling for Traffic Prediction

Guangyin Jin*, Fuxian Li*, Jinlei Zhang, Mudan Wang and Jincai Huang F. Li, M. Wang are with Beijing National Research Center for Information Science and Technology (BNRist), Department of Electronic Engineering, Tsinghua University, Beijing 100084, China.
E-mail: [email protected], [email protected]
G. Jin and J. Huang are with College of Systems Engineering, National University of Defense Technology, Changsha, China.
E-mail: [email protected], [email protected]
J. Zhang is with State Key Laboratory of Rail Traffic Control and Safety, Beijing Jiaotong University, Beijing 100044, China.
E-mail: [email protected]
^∗Both authors contributed equally to this research.

Abstract

Accurate traffic prediction is a challenging task in intelligent transportation systems because of the complex spatio-temporal dependencies in transportation networks. Many existing works utilize sophisticated temporal modeling approaches to incorporate with graph convolution networks (GCNs) for capturing short-term and long-term spatio-temporal dependencies. However, these separated modules with complicated designs could restrict effectiveness and efficiency of spatio-temporal representation learning. Furthermore, most previous works adopt the fixed graph construction methods to characterize the global spatio-temporal relations, which limits the learning capability of the model for different time periods and even different data scenarios. To overcome these limitations, we propose an automated dilated spatio-temporal synchronous graph network, named Auto-DSTSGN for traffic prediction. Specifically, we design an automated dilated spatio-temporal synchronous graph (Auto-DSTSG) module to capture the short-term and long-term spatio-temporal correlations by stacking deeper layers with dilation factors in an increasing order. Further, we propose a graph structure search approach to automatically construct the spatio-temporal synchronous graph that can adapt to different data scenarios. Extensive experiments on four real-world datasets demonstrate that our model can achieve about $10\%$ improvements compared with the state-of-art methods. Source codes are available at https://github.com/jinguangyin/Auto-DSTSGN.

Index Terms:

Traffic prediction, spatio-temporal modeling, graph neural networks, automated machine learning

I Introduction

Traffic prediction plays a basic but crucial role in intelligent transportation systems, which has been widely deployed in some online services such as navigation and ride-hailing. Effective spatio-temporal modeling is key to obtain more precise predictions. In recent years, most works utilize spatial modules including convolutional neural networks (CNNs) and graph neural networks (GNNs), and temporal modules such as recurrent neural networks (RNNs), temporal convolution networks (TCNs) for spatio-temporal modeling [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19], but the complex correlations in spatio-temporal scale are still difficult to learn.

First, both geo-spatial relations and pattern similarities exist in traffic network at the same time. As shown in Figure 1(a), We can respectively construct the spatial graph and temporal graph to characterize the proximity (e.g., tourist district and business district) and similar patterns (e.g., two different tourist districts) between different nodes. Second, the two different dependencies not only exist in the same time interval but also have influences across different time steps. The correlations across different time steps can be short-term (e.g., time step $t_{1}$ to $t_{2}$ ) or long-term (e.g., time step $t_{1}$ to $t_{12}$ ), as shown in Figure 1(b). Although many graph-based deep learning models have been demonstrated effective in traffic prediction, there are still at least two limitations.

Refer to caption — Figure 1: Example of spatial and temporal graph construction and spatio-temporal dependencies in a network. The blue dotted arrows, green dotted arrows and red dotted arrows respectively represents the cross-time self-connection, cross-time geo-spatial relations and cross-time pattern similarities.

(a) Capturing complex long-short term spatio-temporal dependencies. Most works employ GNNs and some temporal learning components to respectively capture the spatial and temporal correlations [20]. For temporal modeling, RNN is a classical method but suffers from gradient vanishing or explosion for learning long-range sequences [21]. Some superior variants such as LSTM [22] and GRU [23] alleviate the gradient problem to some extent, but still suffer from low computational efficiency due to the recurrent structures. Self-attention mechanism such as Transformer [24, 3] is designed to capture long-term dependencies but it is still time-consuming during training and inference phase. TCN considers both performance and efficiency but it is hard to capture long-term dependencies by fixed kernel sizes [2]. To address this problem, Graph WaveNet [6] integrates GCN and dilated TCN to capture long-short term dynamics by stacking layers with the dilation factors in an increasing order. However, the framework with separate modules is still hard to capture more complex dependencies from spatio-temporal scale, as shown in Figure 1(b). To enhance the learning capability for spatio-temporal dependencies, STSGCN [25] and STFGNN [26] construct the spatio-temporal synchronous graphs (STSGs). However, the fixed receptive field of them limits their capability for learning both short-term and long-term spatio-temporal dependencies.

(b) Flexible spatio-temporal graph construction. In most previous works, spatio-temporal graph is artificially designed and fixed for different time periods [2, 25, 5, 26, 27, 28, 29]. Meanwhile, the method of graph construction cannot be adjusted even on different datasets. Hence, this manner can hardly characterize diverse spatio-temporal relations for different time periods and different datasets. The learnable mask [25, 26] and adaptive graph [6, 30, 31, 14, 32] are introduced to overcome the limitation of fixed spatio-temporal graph but they still have many disadvantages. On one hand, the modeling of them is global, thus they fail to characterize spatio-temporal relations for different time periods. On the other hand, they cannot be associated with the characteristics of the data itself, hence they have weak interpretability.

To address the above problems, we propose a novel framework for traffic prediction, called Automated Dilated Spatio-Temporal Synchronous Graph Network (Auto-DSTSGN). To be specific, we design the dilated spatio-temporal synchronous graph framework to flexibly capture the short-term and long-term spatio-temporal complex dependencies. Further, a graph structure search operation is proposed to construct the flexible and diverse STSGs automatically according to different data input in different time periods. Our main contributions in this paper are summarized as follows:

•

We design a dilated spatio-temporal synchronous graph framework to capture spatio-temporal correlations efficiently, whose receptive field can become larger by stacking deeper layers with the dilation factors in an increasing order. This framework can capture both short-term and long-term dependencies with relatively low computation burden and GPU occupancy.
•

We propose the graph structure search operation based on DARTS framework. As far as we know, it is the first attempt to search adjacency matrices rather than neural architectures by auto machine learning methods.
•

We conduct extensive experiments on four public traffic datasets. The experimental results demonstrates that our model can obtain at least 4.9% $\sim$ 10.3% improvements compared with the state-of-art baselines.

II Related Work

II-A Traffic Prediction

In recent years, spatio-temporal graph modeling has become a mainstream method for traffic prediction. Most of these works combine spatial graph convolution networks (GCNs) with some temporal encoder to capture the complex spatio-temporal dependencies. STGCN [2] first integrate TCNs and GCNs for spatio-temporal modeling. Based on this, ASTGCN [8] adopts attention mechanism to enhance the representation learning capability of GCNs and TCNs, Graph WaveNet [6] involves the adaptive graph to incorporate with dilated temporal convolution networks. RNN is a widely used model for sequence leaning, whose variant gated recurrent unit (GRU) is utilized in DCRNN [5] and T-GCN [33] to capture temporal correlations of hidden representations from GCNs. In addition, self-attention mechanism is powerful tool not only for spatial dynamic learning, but also for capturing temporal dependencies [24]. Both STGNN [3] and GMAN [34] adopt self-attention mechanism in spatial GCNs and temporal dependencies learning. In some most recent works, some novel frameworks are introduced in this field. STSGCN [25] first proposes the framework of spatio-temporal synchronous modeling. Based on this, STFGNN [26] presents an informative fusion graph and parallel TCNs for further improving spatio-temporal dependencies learning. AutoSTG [35] first combines neural architecture search approach with spatio-temporal graph framework to improve the adaptability for different data. However, these existing works are not only difficult to flexibly capture the long-short term complex spatio-temporal dependencies, but also hard to characterize the spatio-temporal relations in different periods. Different from them, our model can balance short-term and long-term spatio-temporal correlations by dilation mechanism, and the graph search operation in our model can construct diverse spatio-temporal relations automatically for different datasets.

II-B Automated Machine Learning

Automated Machine Learning (AutoML) aims to obtain appropriate features or models for various downstream tasks [36]. This field can be roughly divided into two main categories: automated feature engineering and automated model design, while automated feature engineering aims to synthesize or select informative features for model training [37, 38, 39]. Automated model design aims to select appropriate models or construct reasonable model architecture. Neural architecture search (NAS) is one of the most important direction in automated model design, which is widely used in deep learning. There are three mainstream types of methods in NAS, reinforcement learning-based [40, 41, 42], evolutionary learning-based [43, 44, 45] and gradient-based [46, 47, 48] respectively. Among these three types, gradient-based methods are relatively more efficient. Since efficiency is important in traffic prediction, we adopt gradient-based framework DARTS [46] in this paper. Different from these previous works, we employ DARTS to search graph structure rather than neural architecture in this paper.

III Preliminary

III-A Problem Definition

Given the traffic sensor set with $N$ nodes $V(|V|=N)$ , the sensor network can be defined as a graph $\mathcal{G}=(V,E,A)$ . $E$ denotes the set of edges, whose relations between different nodes is characterize by the adjacency matrix $A$ . The graph signal at time step $t$ contains $d$ -dimensional original traffic features (e.g., the speed, volume), which is defined as ${X}_{\mathcal{G}}^{(t)}\in\mathbb{R}^{N\times d}$ . The aim of traffic prediction task in sensor network is to learn a non-linear function $f(\cdot)$ from historical $T$ -step graph signals for forecasting next $T^{{}^{\prime}}$ -step graph signals. The mathematical form is defined as follows:

[\mathbf{X}_{\mathcal{G}}^{(t-T+1)},\cdots,\mathbf{X}_{\mathcal{G}}^{t}]\xrightarrow[]{f(\cdot)}[\mathbf{X}^{t+1}_{\mathcal{G}},\cdots,\mathbf{X}^{t+T^{{}^{\prime}}}_{\mathcal{G}}].

(1)

III-B Meta Graph Construction

In this paper, the graph that characterizes the spatio-temporal relations in one time step is called as the meta graph, which is the basic unit of STSGs. To take both geographical proximity and pattern similarity into account, we introduce two types of meta graph: spatial graph $A_{SG}\in\mathbb{R}^{N\times N}$ and temporal graph $A_{TG}\in\mathbb{R}^{N\times N}$ . The adjacency matrix of spatial graph can be formulated as:

A^{ij}_{SG}=\left\{\begin{aligned} 1,&\ if\ v_{i}\ connects\ to\ v_{j}\\ 0,&\ \text{otherwise}\end{aligned},\right.

(2)

Dynamic Time Warping (DTW) algorithm is adopted to calculate the similarity of two time series [49], which can characterize the pattern similarity between different nodes. For example, given two time series $X=(x_{1},x_{2},\cdots,x_{m})$ and $Y=(y_{1},y_{2},\cdots,y_{n})$ , DTW is a dynamic programming algorithm defined as:

D(i,j)=|x_{i}-y_{j}|+\min\left(D(i-1,j),D(i,j-1),D(i-1,j-1)\right),

(3)

where $D(i,j)$ denotes the shortest distance between sub-sequence $X_{s}=(x_{1},x_{2},\cdots,x_{i})$ and $Y_{s}=(y_{1},y_{2},\cdots,y_{j})$ . As a result, we can obtain $DTW(X,Y)=D(m,n)$ as the final distance between $X$ and $Y$ . This method not only does not need to limit the length of the input sequences, but also better reveals the similarity of two time series compared with the Euclidean distance. Thus, we can define the adjacency matrix of temporal graph through the DTW distance as following:

A^{ij}_{TG}=\left\{\begin{aligned} 1,&\ DTW(X^{i},X^{j})<\epsilon\\ 0,&\ \text{otherwise}\end{aligned},\right.

(4)

where $X^{i}$ and $X^{j}$ are speed data series attached to node $i$ and node $j$ respectively. $\epsilon$ is a threshold to control the sparsity of $A_{TG}$ , the setting of which is the same as that in [26].

III-C Spatio-Temporal Synchronous Graph Modeling

Spatio-temporal synchronous graph (STSG) is a special architecture to establish the unified spatio-temporal correlations by graph structures, which is first proposed in [25]. Given a spatial graph with $N$ nodes, the adjacency matrix of STSG in [25] is designed as a $3N\times 3N$ expanded matrix, as shown in Fig. 2(a). In [26], the STSG is extended to $4N\times 4N$ and improved by involving more informative graphs such as temporal graph, which is shown in Fig. 2(b). Although this modeling approach can characterize more complex spatio-temporal relations, the larger expanded STSG could bring the problem of high computational overhead.

IV Methodology

The overview of our model is illustrated in Figure 3. To enhance the capability of representation learning, we employ a fully connected layer at the top of our model to transform the input graph signal into high-dimensional space. Then we design multiple stacked layers for extracting spatio-temporal structural information. At each layer, we design Automated Dilated Spatio-Temporal Synchronous Graph Convolution (Auto-DSTSG) Module with multiple parallel blocks for modeling complex spatio-temporal dependencies. To be specific, Auto-DSTSG Module in each layer can expand receptive field to capture both short-term and long-term spatio-temporal complex dependencies by the increased dilation factors, which addresses the first limitation in Sec. I. In each parallel Auto-DSTSG block, we also propose the Graph Structure Search (GSS) Operation to automatically construct adjacency matrix of spatio-temporal synchronous graph (STSG), which can characterize the flexible and diverse spatio-temporal relations in different time periods. This deals with the second limitation in Sec. I. Then the graph convolution layers are adopted to capture the spatio-temporal correlations based on the constructed STSGs. There are also many supporting operators in each block: cropping, gated linear units (GLUs) and max pooling. The cropping operation is to ensure dimensional consistency with the output. The gated linear units are used to increase the nonlinearity of graph convolutional features. And max pooling is to preserve the most distinctive features. Their details are shown in the subsection IV-A. In addition, we also design Dilated Temporal Convolution Module to enhance the global temporal correlations, whose output is aggregated with the output from Auto-DSTSG Module in each layer. We further adopt the residual connection and skip connection in each layer of Auto-DSTSGN. The hidden information of each layer flows to the next layer after passing through residual connections. On the other hand, the hidden information of each layer is output through skip connections. At the top of the framework, the outputs from the skip connections of each layer are summed up as the input to the MLP to obtain the predictions.

In the following subsections, we introduce Automated Dilated Spatio-Temporal Synchronous Graph Convolution Module, Graph Structure Search Operation, Dilated Temporal Convolution Module and some other components in details. In addition, we also show a brief overview of the optimization algorithm of graph structure search. To facilitate understanding of the following expressions, we list the definition of some important notations and operators throughout the overall methodology in Table I.

TABLE I: The definition of some important notations and operators in our methodology.

Symbols	Definition
$\mathcal{F}_{G}(\cdot)$	Graph structure search operation
$A_{ST}$	Final adjacency matrices of STSGs
$\mathcal{M}_{A}$	Mixed adjacency matrices of STSGs
$\mathcal{M}_{1}$	Mixed main-diagonal matrices of STSGs
$\mathcal{M}_{2}$	Mixed sub-diagonal matrices of STSGs
$\mathcal{C}(\cdot)$	Cropping operation for STSGs

IV-A Automated Dilated Spatio-Temporal Synchronous Graph Convolution Module

Recall that a normal spatial GCN layer [50] has the form:

\mathbf{Z}=\sigma(\hat{\mathbf{A}}\cdot\mathbf{X}\cdot\mathbf{\Theta}),

(5)

where $\mathbf{X}\in R^{N\times D}$ and $\mathbf{Z}\in R^{N\times D^{\prime}}$ respectively denote the input and output node embedding of the GCN layer. $\mathbf{\Theta}\in R^{D\times D^{\prime}}$ denotes the shared weight for nodes’ feature mapping and $\sigma(\cdot)$ denote the activation function. $\hat{\mathbf{A}}\in R^{N\times N}$ denotes the normalized adjacency matrix which represents the message passing between one-hop neighbors.

To capture the complex spatio-temporal correlations, the normal spatial GCN can be extended to the spatio-temporal scale. A special GCN-based framework called spatio-temporal synchronous graph convolution network is proposed [25], but there are two main limitations of this modeling approach. The first one is that the fixed spatio-temporal receptive field limits the learning capability for long-term dependencies. To capture the longer-term spatio-temporal dependencies by this framework, the receptive field should become larger. Whenever the receptive field is expanded by one unit, the calculation amount of matrix multiplication will be expanded exponentially. The second limitation is that the adjacency matrix of STSG is shared for each spatio-temporal synchronous graph convolution module, which can not reflect the diverse spatio-temporal correlations in different periods. In addition, the deterministic adjacency matrix designed manually is hard to characterize the complex relations between different spatio-temporal nodes for different datasets.

To overcome these problems, we design the Auto-DSTSG module with multiple parallel blocks. The form of the Auto-DSTSG block is defined as:

		$\displaystyle\mathbf{Z}=\sigma(\mathcal{C}(\mathcal{F}_{G}(\mathbf{X}(t,k,d))\cdot\mathbf{\Theta})),$		(6)
		$\displaystyle\mathbf{X}(t,k,d)=[\mathbf{x}(t-d\times(k-1)),\dots,\mathbf{x}(t-d),\mathbf{x}(t)],$		(7)

where $\mathbf{X}(t,k,d)\in R^{kN\times D}$ and $\mathbf{Z}\in R^{N\times D^{\prime}}$ respectively denote the input and output representation. $k$ denotes the kernel size and $d$ denotes the dilation factor to control the skipping distance in the sequence input. $[\cdot]$ denotes the concentration operation. Compared with the normal GCN, the receptive field has been expended to $d\times k$ times when the range of spatio-temporal synchronous graph covers $k$ time steps. Here $\mathcal{F}_{G}(\cdot)$ represents the graph structure search operation, whose output is a mixed adjacency matrix $\mathcal{M}_{A}\in R^{kN\times kN}$ during searching phase. This operation can adjust the structure of STSGs according to different data, which can characterize complex spatio-temporal relations in different periods and even different datasets. $\mathcal{C}(\cdot)$ denotes cropping operation to convert the range of data from $kN$ to $N$ . The common method is to select the graph data of the middle or last time step as the final output, which is described in [3]. When $\mathcal{F}_{G}(\cdot)$ is a fixed adjacency matrix with $k=1,d=1$ and $\mathcal{C}(\cdot)$ is an identity mapping, eq (6) is equivalent to eq (5).

IV-A1 Dilated Spatio-temporal Synchronous Graph Convolution Framework

The framework of stacked dilated temporal convolution was first proposed in [51]. To be specific, the long-term correlations can be captured effectively by stacking deeper layers with dilation factors in an increasing order. Inspired by this, we expand the framework to the case of spatio-temporal synchronous graph convolution. In this framework, since even small kernel size can be competent for long-term dependencies learning, we do not need large size STSGs to cover long-range spatio-temporal correlations. The computational burden and memory occupancy can also be reduced greatly by small kernel sizes. Thus, the kernel size $k$ is fixed as 2 in our model. Suppose the input length of our model is 12 time steps, we can design a four-layer framework with a sequence of dilation factors $[1,2,4,4]$ to cover the whole receptive field of the 12 time steps, as shown in Figure 3(b). In this way, the short-term spatio-temporal correlations can be captured in shallower layers while the long-term dynamics can be extracted in deeper layers. Both short-term and long-term spatio-temporal dependencies can be taken into account in this framework.

IV-A2 Graph Structure Search

For each Auto-DSTSG block, the most important part is graph structure search operation $\mathcal{F}_{G}(\cdot)$ , which can flexibly construct the adjacency matrix of spatio-temporal synchronous graph (STSG). According to [25, 26], there are three constraints to construct STSG: a) the meta graph on the main diagonal in STSG must be spatial graph (SG) or temporal graph (TG). b) the matrix on the sub diagonal in STSG must not be the zero matrix. c) the complete adjacency matrix of STSG is assumed to be symmetric.

The meta graph on the main diagonal in STSG can determine the different dependencies in different time steps while the meta graph on the sub diagonal control the correlations between different time steps. When the size of STSG is fixed as $2N\times 2N$ , there are four possible options on the main diagonal, they are respectively $[TG,TG]$ , $[TG,SG]$ , $[SG,TG]$ , $[SG,SG]$ . There are also three possible options on the sub diagonal, they are respectively $TG$ , $SG$ , $TC$ , where $TC$ denotes the identity matrix to describe self-connectivity. We divide the complete adjacency matrix into two groups by the candidate sub-matrix on the main diagonal and on the sub diagonal. It is worth mentioning that this grouping method can also be extended to scenes where $k$ is larger. The search space of these two groups can also be determined by the possible options we discuss above.

In Figure 4, we take an example to illustrate the selection process of graph structure search operation. Suppose in a traffic scenario during rush hours as shown in Figure 4(a), the pattern similarities between the districts with the same function could be more significant (e.g., district $B$ and $C$ ). Thus, the temporal graph can better characterize the dependencies in each time step, which can be selected as the meta graph on the main diagonal in STSG. Assume that a congestion event occurs in $C$ , the congestion will propagate from $C$ to the spatial adjacent node $A$ over time, as the red arrow shown in Figure 4(a). Thus, the spatial graph can better characterize the dependencies across time steps, which can be selected as the meta graph on the sub diagonal. As shown in Figure 4(b), the nodes with similar patterns are marked by the same color, the significant dependencies are depicted by solid lines and other insignificant dependencies are depicted by dotted lines. Finally, we sum the selected options on main diagonal and sub diagonal to obtain the complete STSG, as shown in Figure 4(c).

In this case, we design the two-group graph structure search method inspired by DARTS framework [46]. Although adjacency matrices of STSGs are not neural architectures, they can determine the mode of message passing in graph convolution networks. Thus, the structure of the spatio-temporal synchronous adjacency matrices can be seen as a special architecture in our model. Similar to the mixed operation in classical DARTS framework, the mixed adjacency matrix generated in searching phase is formulated as follow:

	$\displaystyle\mathcal{M}_{1}=\sum_{m_{1}\in\mathbf{M_{1}}}\frac{\exp(\alpha_{m_{1}})}{\sum_{m_{1}^{\prime}\in\mathbf{M_{1}}}\exp(\alpha_{m_{1}^{\prime}})}m_{1}$		(8)
	$\displaystyle\mathcal{M}_{2}=\sum_{m_{2}\in\mathbf{M_{2}}}\frac{\exp(\alpha_{m_{2}})}{\sum_{m_{2}^{\prime}\in\mathbf{M_{2}}}\exp(\alpha_{m_{2}^{\prime}})}m_{2},$		(9)
	$\displaystyle\mathcal{M}_{A}=\mathcal{M}_{1}+\mathcal{M}_{2},$		(10)

where $m_{1}\in\mathbf{R}^{2N\times 2N}$ and $m_{2}\in\mathbf{R}^{2N\times 2N}$ are respectively the candidate matrix of the case of main diagonal and the sub-diagonal, $\mathbf{M_{1}}=\{M_{1}^{(1)},M_{1}^{(2)},\cdots\}$ and $\mathbf{M_{2}}=\{M_{2}^{(1)},M_{2}^{(2)},\cdots\}$ are the set of pre-defined candidate matrices of the two groups. $\mathcal{M}_{A}$ is the complete mixed adjacency matrix as the final output from GSS operation in searching phase. $\alpha_{m_{1}}$ and $\alpha_{m_{2}}$ are respectively the learnable parameters to weight the candidate matrix $m_{1}$ and $m_{2}$ .

For each group, the best sub-graph structure is determined by the lowest validation loss in candidate set. And during the training phase, we replace each mixed operation $\mathcal{M}_{1}$ and $\mathcal{M}_{2}$ with the highest confidence to get the final graph structure $A_{ST}$ , which can be formulated as follow:

\displaystyle A_{ST}=(\mathrm{Argmax}_{m_{1}\in\mathbf{M_{1}}}\enskip\alpha_{m_{1}})+(\mathrm{Argmax}_{m_{2}\in\mathbf{M_{2}}}\enskip\alpha_{m_{2}}).

(11)

IV-A3 Mixed-Hop Graph Convolution

With the mixed adjacency matrix $\mathcal{M}_{A}$ during searching phase or the final adjacency matrix $A_{ST}$ during training phase, we also adopt the mixed-hop mechanism to capture complex spatio-temporal information from different hops. In addition, gated linear unit (GLU) is used as the mapping function for each hop. The graph convolution operation for each hop can be formulated as:

\mathcal{H}_{i}=GLU_{i}((\mathcal{C}(\mathbb{A}^{i}\cdot X))),

(12)

where $\mathbb{A}^{i}\in R^{2N\times 2N}$ denotes the $i_{th}$ hop adjacency matrix during searching phase or training phase, $X\in R^{2N\times D}$ denotes the input data and $\mathcal{H}_{i}\in R^{N\times D}$ is the output of the $i_{th}$ hop graph. $GLU_{i}(\cdot)$ is the individual operation for the $i_{th}$ hop graph convolution, which is formulated as:

GLU_{i}(X)=(X_{c}\cdot W_{1}+b_{1})\odot\sigma(X_{c}\cdot W_{2}+b_{2}),

(13)

where $W_{1},W_{2}\in\mathbb{R}^{D\times D^{\prime}}$ , $b_{1},b_{2}\in\mathbb{R}^{D^{\prime}}$ are weights and bias of GLU, $\odot$ represents element-wise product, $\sigma$ denotes the Sigmoid function and $X_{c}\in R^{N\times D}$ is the output from the cropping operation. Finally, we adopt the max-pooling approach to aggregate the graph information from different hops as the output of the Auto-DSTSG block, which is defined as:

\displaystyle\mathcal{H}_{\mathbf{g}}=MaxPooling([\mathcal{H}_{1},\cdots,\mathcal{H}_{i}])\in R^{N\times D}.

(14)

The spatio-temporal input data is treated by multiple Auto-DSTS blocks independently in parallel in each layer. The output from the Auto-DSTSGC module in each layer is formulated as:

\displaystyle X_{\mathbf{g}}=[\mathcal{H}_{\mathbf{g}}^{0},\dots,\mathcal{H}_{\mathbf{g}}^{T-d-1}]\in R^{N\times(T-d)\times D},

(15)

where $T$ denotes the time steps of input data and $\mathcal{H}_{\mathbf{g}}^{i}$ denotes the output from $i_{th}$ Auto-DSTSG block.

IV-B Dilated Temporal Convolution Module

The weight sharing mechanism in temporal convolution is conducive to learning global temporal dependencies for individual nodes [26, 52]. Although Auto-DSTSG module can flexibly capture the complex spatio-temporal correlations in a unified framework, its weight non-sharing mechanism for different blocks is more conducive to capturing diverse local spatio-temporal dependencies for different time periods rather than global dependencies for each node.

The dilated temporal convolution operation is defined as:

\mathbf{x}\star\mathbf{f}(t)=\sum_{s=0}^{k-1}\mathbf{f}(\Theta)\mathbf{x}(t-d\times s),

(16)

where $\mathbf{x}\in\mathbf{R}^{T}$ denotes the given 1D sequence input, $\mathbf{f}\in\mathbf{R}^{K}$ denotes a temporal convolutional filter at step $t$ , $\Theta$ denotes the learnable weights of the filter and $d$ denotes the dilation factor. Similar to [51], we obtain the larger receptive field by expanding the dilation factor in temporal convolution when the layer goes deeper. To keep the consistency of the receptive field in each layer, the kernel size $k$ of temporal convolution is fixed as 2 with a sequence of dilation factors $[1,2,4,4]$ .

To better control the information flow and reserve the useful information, we adopt the gating mechanism in this case. Similar to [53], a simple gated temporal convolution network only contains an output gate, which is expressed as:

X_{\mathbf{f}}=tanh(\mathbf{\Theta_{1}}\star X+\mathbf{b_{1}})\odot\sigma(\mathbf{\Theta_{2}}\star X+\mathbf{b_{2}}),

(17)

where $X\in R^{N\times T\times D}$ is the given input, $X_{\mathbf{f}}\in R^{N\times(T-k-d+2)\times D^{\prime}}$ is the output, $\mathbf{\Theta_{1}}$ , $\mathbf{\Theta_{2}}$ , $\mathbf{b_{1}}$ and $\mathbf{b_{2}}$ are model parameters, $\odot$ is the element-wise product, $tanh(\cdot)$ is the tanh activation function of the outputs, and $\sigma(\cdot)$ is the sigmoid function which controls the ratio of information flow put forward to the next layer.

To enhance the global temporal correlations for individual nodes, we aggregate the output of the dilated temporal convolution module and the output of Auto-DSTSG module in each layer. The output of each layer in Auto-DSTSGN can be defined as:

X_{\mathbf{out}}=\mathbf{Agg}(X_{\mathbf{g}},\ X_{\mathbf{f}}),

(18)

where $X_{\mathbf{out}}$ denotes the output from each layer, $X_{\mathbf{g}}$ and $X_{\mathbf{f}}$ are respectively the output of Auto-DSTSG module and dilated temporal convolution module. $\mathbf{Agg}(\cdot)$ denotes the aggregation function, which is set as sum function in our model.

IV-C Other Components

IV-C1 Residual connection

Residual connection is an effective approach to overcome the problem of vanishing gradient in deep neural networks. As shown in Figure 3(a), we employ it in each layer of our model, which is defined as:

X_{l}=X+FC(\mathcal{F}(X)),

(19)

where $X$ denotes the original input of each layer, $\mathcal{F}(\cdot)$ denotes the neural network mapping in each layer, $FC(\cdot)$ denotes the linear mapping and $X_{l}$ is the output of each layer.

IV-C2 Skip connection

Skip connection is adopted to fully exploit different level information and aggregate them to obtain powerful representation. As shown in Figure 3(a), each layer has a skip connection, which is defined as:

X_{s}=\sum_{i=0}^{l}FC_{i}(\mathcal{F}(X)),

(20)

where $X_{s}$ denotes the final representation from aggregation of skip connection, $FC_{i}(\cdot)$ denotes the linear mapping.

IV-C3 Input layer and output layer

At the top of our proposed framework, we use a fully connected layer to map the raw input into high-dimensional tensor, which can enhance the capability of representation learning in deep neural networks.

In output layer, we adopt a series of two-layer fully connected layers that do not share weights to deal with predictions at different time steps, which is expressed as follows:

\hat{y}_{i}=ReLU(X_{s}\cdot W_{i}^{1}+b_{i}^{1})\cdot W_{i}^{2}+b_{i}^{2},

(21)

where $X_{s}$ denotes the aggregated representation from skip connection, $\hat{y}_{i}$ denotes the predicted result at $i_{th}$ step. In this manner, we can obtain the next $T$ step predictions by concentrating the predictions at each step together, which is expressed as:

\hat{Y}=[\hat{y}_{1},\hat{y}_{2},\dots.\hat{y}_{T}].

(22)

Finally, we select L1 loss as the loss function in our model:

L(Y,\hat{Y})=|Y-\hat{Y}|.

(23)

IV-D Searching Algorithm

In graph structure search operation, all computations are differentiable. Similar to DARTS framework, a bi-level gradient-based optimization algorithm can be employed to update the weight parameters $\theta$ of the network (including the parameters in TCNs, GLUs, FCs and MLP) and the architecture parameters $\omega$ (including the scores of candidate matrices) alternately. As shown in Algorithm 1, the weight parameters (Line 4-5) and architecture parameters (Line 6-7) are alternately updated based on the training and validation sets, until the stopping criteria is met. Then, the structures of STSGs can be obtained by selecting the candidate operations with the highest operation scores.

Algorithm 1 Optimization algorithm of Auto-DSTSGN.

1:Traffic data from road networks:

[V_{1},\dots,V_{N}]

, adjacency matrix of spatial graph

A_{s}

and temporal graph

A_{t}

2:The learned STSGs in each Auto-DSTSG module and the predictions in next 12 steps;

3:Build

\mathbb{D}_{train}

\mathbb{D}_{valid}

from

[V_{1},\dots,V_{N_{e}}]

A_{s}

and

A_{t}

;

4:Initialize the graph structure parameters

\omega

and the weight parameters

\theta

;

5:do:

6: Sample

\mathbb{D}_{batch}

from

\mathbb{D}_{train}

;

\theta\leftarrow\theta-\mu_{\theta}\nabla_{\theta}L(\theta,\omega,D_{batch})

\mu_{\theta}

is learning rate

8: Sample

\mathbb{D}_{batch}

from

\mathbb{D}_{valid}

;

\omega\leftarrow\omega-\mu_{\theta}\nabla_{\omega}L(\theta,\omega,D_{batch})

\mu_{\theta}

is learning rate;

10:until stopping criteria is met;

11:Get the graph structures, and further train the model on

\mathbb{D}_{train}

V Experiments

In this section, we conduct extensive experiments on four public datasets to answer the following research questions:

•

RQ1: How does our proposed Auto-DSTSGN perform compared with the state-of-the-art baselines in traffic prediction?
•

RQ2: How does our model perform compared with different variants in the ablation study?
•

RQ3: What kind of graph structures can be searched by GSS mechanism in our model? For different datasets, what are the differences in searching results?
•

RQ4: How does the efficiency and GPU occupancy of our model compare with other models?
•

RQ5: How do the model parameters (e.g., the max hop of GCN layers) affect the performance of our model?

V-A Datasets and Settings

We evaluate our model on PEMS03, PEMS04, PEMS07 and PEMS08 which are collected from Caltrans Performance Measurement System (PeMS). The time granularity of all datasets is set to 5 minutes. The spatial graph for each dataset are constructed based on road network topology. Z-score normalization is applied to the traffic flow data. Detailed statistics of datasets are shown in Table II.

TABLE II: Dataset description and statistics.

Datasets	#Nodes	#Edges	#TimeSteps	#TimeRange
PEMS03	358	547	26208	9/1/2018 - 11/30/2018
PEMS04	307	340	16992	1/1/2018 - 2/28/2018
PEMS07	883	866	28224	5/1/2017 - 8/31/2017
PEMS08	170	295	17856	7/1/2016 - 8/31/2016

TABLE III: Performance comparison of baseline models and Auto-DSTSGN. The sub- optimal results are marked by the asterisk

Datasets	Metric	FC-LSTM	DCRNN	STGCN	ASTGCN(r)	GWN	STSGCN	STFGNN	STGODE	AutoSTG	Auto-DSTSGN
PEMS03	MAE	21.33 $\pm$ 0.24	18.18 $\pm$ 0.15	17.49 $\pm$ 0.46	17.69 $\pm$ 1.43	19.85 $\pm$ 0.03	17.48 $\pm$ 0.15	16.77 $\pm$ 0.09	16.53 $\pm$ 0.10	16.27* $\pm$ 0.27	14.59 $\pm$ 0.05
	MAPE(%)	23.33 $\pm$ 1.23	18.91 $\pm$ 0.82	17.15 $\pm$ 0.45	19.40 $\pm$ 2.24	19.31 $\pm$ 0.49	16.78 $\pm$ 0.20	16.30 $\pm$ 0.09	16.68 $\pm$ 0.05	16.10* $\pm$ 0.03	14.22 $\pm$ 0.16
	RMSE	35.11 $\pm$ 0.50	30.31 $\pm$ 0.25	30.12 $\pm$ 0.70	29.66 $\pm$ 1.68	32.94 $\pm$ 0.18	29.21 $\pm$ 0.56	28.34 $\pm$ 0.46	27.79 $\pm$ 0.32	27.63* $\pm$ 0.78	25.17 $\pm$ 0.24
PEMS04	MAE	27.14 $\pm$ 0.20	24.70 $\pm$ 0.22	22.70 $\pm$ 0.64	22.93 $\pm$ 1.29	25.45 $\pm$ 0.03	21.19 $\pm$ 0.10	19.83* $\pm$ 0.06	20.84 $\pm$ 0.07	20.38 $\pm$ 0.09	18.85 $\pm$ 0.08
	MAPE(%)	18.20 $\pm$ 0.40	17.12 $\pm$ 0.37	14.59 $\pm$ 0.21	16.56 $\pm$ 1.36	17.29 $\pm$ 0.24	13.90 $\pm$ 0.05	13.02 $\pm$ 0.05	13.76 $\pm$ 0.04	14.12 $\pm$ 0.02	13.21* $\pm$ 0.02
	RMSE	41.59 $\pm$ 0.21	38.12 $\pm$ 0.26	35.55 $\pm$ 0.75	35.22 $\pm$ 1.90	39.70 $\pm$ 0.04	33.65 $\pm$ 0.20	31.88* $\pm$ 0.14	32.84 $\pm$ 0.19	32.51 $\pm$ 0.12	30.48 $\pm$ 0.17
PEMS07	MAE	29.98 $\pm$ 0.42	25.30 $\pm$ 0.52	25.38 $\pm$ 0.49	28.05 $\pm$ 2.34	26.85 $\pm$ 0.05	24.26 $\pm$ 0.14	22.07* $\pm$ 0.11	23.02 $\pm$ 0.15	23.22 $\pm$ 0.33	20.08 $\pm$ 0.08
	MAPE(%)	13.20 $\pm$ 0.53	11.66 $\pm$ 0.33	11.08 $\pm$ 0.18	13.92 $\pm$ 1.65	12.12 $\pm$ 0.41	10.21 $\pm$ 1.65	9.21* $\pm$ 0.07	10.09 $\pm$ 0.09	9.95 $\pm$ 0.01	8.57 $\pm$ 0.05
	RMSE	45.94 $\pm$ 0.57	38.58 $\pm$ 0.70	38.78 $\pm$ 0.58	42.57 $\pm$ 3.31	42.78 $\pm$ 0.07	39.03 $\pm$ 0.27	35.80* $\pm$ 0.18	37.48 $\pm$ 0.39	36.47 $\pm$ 0.47	33.02 $\pm$ 0.12
PEMS08	MAE	22.20 $\pm$ 0.18	17.86 $\pm$ 0.03	18.02 $\pm$ 0.14	18.61 $\pm$ 0.40	19.13 $\pm$ 0.08	17.13 $\pm$ 0.09	16.64 $\pm$ 0.09	16.79 $\pm$ 0.08	16.37* $\pm$ 0.12	14.74 $\pm$ 0.04
	MAPE(%)	14.20 $\pm$ 0.59	11.45 $\pm$ 0.03	11.40 $\pm$ 0.10	13.08 $\pm$ 1.00	12.68 $\pm$ 0.57	10.96 $\pm$ 0.07	10.60 $\pm$ 0.06	10.58 $\pm$ 0.04	10.36* $\pm$ 0.03	9.45 $\pm$ 0.03
	RMSE	34.06 $\pm$ 0.32	27.83 $\pm$ 0.05	27.76 $\pm$ 0.20	28.16 $\pm$ 0.48	31.05 $\pm$ 0.07	26.80 $\pm$ 0.18	26.22 $\pm$ 0.15	26.01 $\pm$ 0.14	25.46* $\pm$ 0.18	23.76 $\pm$ 0.05

TABLE IV: Ablation experiments.

Dataset	Model&Variants	MAE	MAPE(%)	RMSE
PEMS04	Auto-DSTSGN	18.85	13.21	30.48
	w/o GCN	23.45	16.96	36.53
	w/o TCN	19.33	14.12	30.79
	w/o Dilation	19.18	13.76	30.71
	w/o GSS	19.48	14.06	30.72
	GGS Random	19.68	14.31	31.26
	w/o DTW	19.27	13.84	31.05
PEMS08	Auto-DSTSGN	14.74	9.45	23.76
	w/o GCN	18.59	12.73	28.91
	w/o TCN	15.22	10.12	24.02
	w/o Dilation	15.10	10.16	23.95
	w/o GSS	15.31	10.15	24.21
	GGS Random	15.31	10.15	24.21
	w/o DTW	15.07	9.97	24.10

Each dataset is chronologically split with 60% for training, 20% for validation and 20% for testing. We use the historical traffic flow in the last one hour to forecast the future traffic flow in the next one hour. The spatial graph and temporal graph are designed inspired by [26]. Our model is implemented by Pytorch 1.5 with NVIDIA TESLA V100 GPU. The max hop of graph convolution in each Auto-DSTSG block is set as 2 by default. The dimension of hidden representations is set as 40. The optimizer of our model is set as Adam. The batch size is 64 and the learning rate is 0.001. Our model is evaluated five times on each dataset. During search process, we utilize the early stopping strategy for graph structure search with tolerance 15 for 60 epochs. During training process, we reinitialize the optimizer and employ early stopping with tolerance 30 for 200 epochs.

V-B Overall Performance (RQ1)

We compare our model with the following nine state-of-art baselines in recent years:

•

FC-LSTM: This is a variant of Long Short-Term Memory Network, which adopts fully connected hidden units [22]. We set the number of hidden layer as 1 and the hidden units as 64.
•

DCRNN: Diffusion Convolution Recurrent Neural Network, which integrates GCNs into encoder-decoder RNNs [5]. The hop of diffusion graph convolution is set as 2 and the hidden dimension is set as 64.
•

STGCN: Spatio-Temporal Graph Convolution Network, which employs GCNs and TCNs for spatio-temporal learning [2]. Each spatio-temporal cell in this model contains two TCNs and one GCN. The number of spatio-temporal cell is set as 2 and the hidden dimension is set as 64.
•

ASTGCN: Attention based Spatial Temporal Graph Convolution Network, which utilizes spatial and temporal attention mechanisms [8]. Similar to STGCN, there are two spatio-temporal cells in this model and the hidden dimension is set as 64.
•

Graph WaveNet (GWN): Graph WaveNet adopts the GCNs with adaptive adjacency matrix and 1D dilated TCNs for spatio-temporal modeling [6]. Each layer in this model contains a gated TCN and a spatial GCN. The number of stacked layers in this model is set as 8 with the dilation rate [1, 2, 1, 2, 1, 2, 1, 2, 1, 2] and the hidden dimension is set as 64.
•

STSGCN: Spatial-Temporal Synchronous Graph Convolution Network, which utilizes multiple STSG modules for localized spatio-temporal joint dependencies modeling [25]. The size of spatial-temporal synchronous graph is set as $3N\times 3N$ , the number of STSG layers is set as 3, and the hidden dimension is set as 64 in this model.
•

STFGNN: Spatial-Temporal Fusion Graph Convolution Network, which utilizes spatio-temporal fusion graph convolution and parallel TCNs for learning localized and global spatio-temporal dependencies respecively [26]. The size of spatial-temporal fusion graph is set as $4N\times 4N$ , the number of layers is set as 3, and the hidden dimension is set as 64 in this model.
•

STGODE: Spatial-Temporal Graph Ordinary Differential Equation Network, which integrates the tensor-based ordinary differential equation into GCN modules [54]. The number of graph ODE layers is set as 6 and the hidden dimension is set as 64 in this model.
•

AutoSTG: Automated Spatio-Temporal Graph Network, which integrates NAS with spatio-temporal graph learning modules [35]. The number of spatio-temporal graph learning cells is set as 5, the number of mixed operations is set as 6 in each cells, and the the hidden dimension is set as 64 in this model.

The evaluation metrics are mean absolute errors (MAE), root mean squared errors (RMSE) and mean absolute percentage errors (MAPE) averaged over five times for one hour ahead prediction. For these three metrics, smaller values means better performance and the formula of these three metrics are defined as follows:

{\rm RMSE}(\widehat{y}_{i},y_{i})=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{(y_{i}-\widehat{y}_{i})^{2}}},

(24)

{\rm MAE}(\widehat{y}_{i},y_{i})=\frac{1}{n}\sum_{i=1}^{n}{|y_{i}-\widehat{y}_{i}|},\quad\quad

(25)

{\rm MAPE}(\widehat{y}_{i},y_{i})=\frac{1}{n}\sum_{i=1}^{n}{\frac{|y_{i}-\widehat{y}_{i}|}{y_{i}}}.\quad

(26)

where $\widehat{y}_{i}$ denotes the prediction results and ${y}_{i}$ denotes the ground-truths.

From the results in Table III, we can observe that our model Auto-DSTSGN consistently outperforms the sub-optimal baselines with 4.9% $\sim$ 10.3% improvements in terms of MAE on the four datasets, which demonstrates the superiority of our proposed method. In order to make the comparison results more intuitive visually, we also provide the box plot of Table III, as shown in Figure 5. Next, we analyze and compare the strengths of our proposed model with some well-performing baselines.

AutoSTG is the only baseline that integrates the neural architecture search to adjust its architecture corresponding to the data. From the comparison results, AutoSTG significantly outperforms most non-auto state-of-art models such as DCRNN and Graph WaveNet, since it can search the optimal neural architectures for different data scenarios. However, AutoSTG is still weaker than our model. There are two main reasons. The one is that the spatio-temporal synchronous graphs of our model can characterize more complex dynamics than the normal graphs. Another one is that AutoSTG focuses on the neural architecture search but ignores the importance of informative spatio-temporal graph construction. Both STFGNN and STSGCN adopt the spatio-temporal synchronous graphs but they can hardly capture the short-term and long-term spatio-temporal correlations flexibly. Our model presents the dilated spatio-temporal synchronous graph convolution framework to balance the short-term and long-term dependencies learning. Both STFGNN and STGODE are the baselines that adopt informative adjacency matrices for spatio-temporal dependencies learning. However, these two models design the spatio-temporal synchronous graphs or neural architectures manually and empirically, thus they are difficult to adapt to different data scenarios. In contrast, our model can construct different spatio-temporal synchronous graphs for different data scenarios based on auto machine learning, which enhances the adaptability and generalizability of our model.

In summary, the dilated spatio-temporal synchronous graph structures in our model can flexibly characterize the short and long-term spatio-temporal dependencies, the auto machine learning mechanism for graph structure search can help our model adapt to different data inputs and achieve the optimal dilated spatio-temporal synchronous graph modeling. These are why our model can outperform other baselines significantly.

V-C Ablation Study (RQ2)

We conduct ablation study on PEMS04 and PEMS08 to evaluate the effectiveness of key components in our model. As shown in Table IV, we compared Auto-DSTSGN with following variants: 1) w/o GCN, which removes all the Auto-DSTSG modules from our models 2) w/o TCN, which removes all the dilated temporal convolution modules from our model. 3) w/o Dilation, which removes the dilation mechanism in Auto-DSTSG module from our model. 4) w/o GSS, which replaces the Auto-DSTSG module with the fixed STFGNN module [26]. 5) GSS Random, which samples an complete adjacency matrix from search space in graph structure search module during the search process. 6) w/o DTW, which replaces the DTW with Pearson coefficient to construct the temporal graphs.

From the experimental results, we can find that Auto-DSTSGN outperforms all the ablation variants. Compared with the results of w/o Dilation, Auto-DSTSGN improves 1.7%, 4.0% in terms of MAE and MAP on PEMS04. Meanwhile, it also improves 2.4%, 7.0% in terms of MAE and MAPE on PEMS08, which illustrates the effectiveness of dilation mechanism on learning long-term spatio-temporal dependencies. Compared with the results of w/o GSS, Auto-DSTSGN improves 3.2%, 6.1% in terms of MAE and MAPE on PEMS04. In the meanwhile, it also improves 3.7%, 6.9% in terms of MAE and MAPE on PEMS08. Compared with the results of GSS Random, Auto-DSTSGN improves 4.2%, 7.7% in terms of MAE and MAPE on PEMS04. Meanwhile, it also improves 5.5%, 12.4% in terms of MAE and MAPE on PEMS08, which illustrates the effectiveness of Graph structure search module in learning diverse spatio-temporal correlations and adapting to different data. There is a significant fall of the performance without graph convolutions and temporal convolutions (w/o GCN and w/o TCN), demonstrating the effectiveness of these two parts for spatio-temporal representation learning. Moreover, when we replace DTW with Pearson coefficient to construct the temporal graphs (w/o DTW), the performance drops on both PEMS04 and PEMS08 datasets. The reason is that Pearson coefficient can only characterize the linear correlations between different time series while DTW can better characterize the non-linear time series similarity.

V-D Case Study (RQ3)

We select PEMS04 and PEMS08 to further investigate the relations between the attributes of meta graphs and the optimal structure of STSGs on them. For the two different meta graphs SG and TG, we choose mean degree of them as the most important attribute. The higher mean degree of a graph means stronger correlations among the nodes. Specifically, the higher mean degree of SG represents the stronger spatial correlations while the higher mean degree of TG represents the stronger temporal correlations. Thus, empirically we need more graph convolution operations on the related graphs to achieve a larger receptive field for capturing long-range spatio-temporal correlations. We count the average number of SG and TG in the learned structure of STSGs on two datasets. All the results are shown in Table V. We observe that the STSGs learned on PEMS04 contains more TGs while the STSGs learned on PEMS08 contains more SGs. This can be explained by the mean degree of TG and SG on two datasets. Since the higher mean degree characterizes stronger correlations between different nodes, our model can obtain stronger capability for message passing in spatio-temporal scale by adjusting the structure of STSGs automatically. In addition, we also visualize the learned structures of STSGs on the PEMS04 and PEMS08 in Figure 6.

TABLE V: The attributes of two datasets and the corresponding structures of STSGs.

Objects	Attributes	PEMS04	PEMS08
Datasets	Mean degree of SG	2.21	3.22
Datasets	Mean degree of TG	4.51	1.27
Structures of STSGs	Average # of SGs	35	68
Structures of STSGs	Average # of TGs	47	30

V-E Efficiency and Occupancy Analysis (RQ4)

Time consumption and memory occupancy on GPU are two intuitive metrics to reflect the time and space complexity of different models. They are also two important metrics that measure the efficiency and scalability of the model in industrial scenarios. We select three best baselines STFGNN, STGODE and AutoSTG to compare with our model on the two metrics. The results are shown in Figure 7. From the absolute perspective, the efficiency of our model during training phase is slightly improved compared with STFGNN and STGODE in general and the GPU occupancy of our model is significantly lower than them. We involve the dilation mechanism into the spatio-temporal synchronous graph modeling, so our model significantly reduces model complexity compared with STFGNN. STGODE employs the multiple neural ODE architectures, which greatly increase the time and space complexity of the model. From the relative perspective, the GPU occupancy of our model is almost the same during search phase and training phase, but AutoSTG significantly costs more GPU occupancy in the searching phase. This indicates that graph structure search is more lightweight and efficient than neural architecture search. This is because graph structure search only involves simple matrix calculations whereas neural architecture search involves complex computations of neural network structures with learnable parameters.

V-F Parameters Study (RQ5)

To further investigate the effectiveness of our model, we conduct parameter study on PEMS04 and PEMS08, including the dimension of hidden representations $D$ and the max hop of graph convolution $H$ in each block. The experimental results are shown in Figure 8. We can find that MAE, MAPE and RMSE on two datasets are the optimal when $D$ is equal to 48. When $D$ is too small, the learning capability of our model become worse, resulting in poor prediction performance. When $D$ is too large, the three metrics on both PEMS04 and PEMS08 become worse. This is because too large hidden dimension cause the over-fitting. For parameter $H$ , we can observe that MAE, MAPE and RMSE on PEMS04 achieve the optimal results when $H$ is equal to 2. On PEMS08, MAE, MAPE and RMSE obtain the best values when $H$ is equal to 3. This implies that aggregating the neighbor information of the appropriate order in the traffic network can better learn the spatial dependencies, and an excessively large number of hops will lead to the over-smoothing phenomenon of the GCNs.

VI conclusion

We propose a novel automated dilated spatio-temporal synchronous graph convolution framework to capture complex spatio-temporal dependencies for traffic prediction. Our model can not only capture complex long-term and short-term spatio-temporal dependencies, but also more flexibly characterize spatio-temporal relations of different time steps and even different scenarios through graph structure search. Extensive experiments on four real-world datasets demonstrate the superiority of our model in prediction accuracy compared with other state-of-art baselines. In addition, under the premise of ensuring accuracy, our model also takes into account both efficiency and GPU occupancy, which provides a solid foundation for the deployment of the model in industrial scenarios. In this paper, we give a first attempt to adopt automatic machine learning approach in graph structure search for diverse spatio-temporal relations. In future work, we will extend this method to a more generalized spatio-temporal prediction scenario.

References

[1] X. Kong, W. Xing, X. Wei, P. Bao, J. Zhang, and W. Lu, “Stgat: Spatial-temporal graph attention networks for traffic flow forecasting,” IEEE Access, vol. PP, pp. 1–1, 07 2020.
[2] B. Yu, H. Yin, and Z. Zhu, “Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting,” 07 2018, pp. 3634–3640.
[3] X. Wang, Y. Ma, Y. Wang, W. Jin, X. Wang, J. Tang, C. Jia, and J. Yu, “Traffic flow prediction via spatial temporal graph neural network,” in Proceedings of The Web Conference 2020, ser. WWW ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 1082–1092. [Online]. Available: https://doi.org/10.1145/3366423.3380186
[4] Z. Pan, Y. Liang, W. Wang, Y. Yu, Y. Zheng, and J. Zhang, “Urban traffic prediction from spatio-temporal data using deep meta learning,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’19, 2019, p. 1720–1730.
[5] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion convolutional recurrent neural network: Data-driven traffic forecasting,” in Proc. of ICLR, 2018.
[6] Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang, “Graph wavenet for deep spatial-temporal graph modeling,” in Proc. of IJCAI, 2019.
[7] G. Jin, H. Yan, F. Li, J. Huang, and Y. Li, “Spatio-temporal dual graph neural networks for travel time estimation,” arXiv preprint arXiv:2105.13591, 2021.
[8] S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan, “Attention based spatial-temporal graph convolutional networks for traffic flow forecasting,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 922–929, 07 2019.
[9] L. Liu, J. Zhen, G. Li, G. Zhan, Z. He, B. Du, and L. Lin, “Dynamic spatial-temporal representation learning for traffic flow prediction,” IEEE Transactions on Intelligent Transportation Systems, 2020.
[10] J. Liu, N. Wu, Y. Qiao, and Z. Li, “Short-term traffic flow forecasting using ensemble approach based on deep belief networks,” IEEE Transactions on Intelligent Transportation Systems, 2020.
[11] W. Liao, B. Zeng, J. Liu, P. Wei, and X. Cheng, “Taxi demand forecasting based on the temporal multimodal information fusion graph neural network,” Applied Intelligence, pp. 1–14, 2022.
[12] S. T. Ikram, V. Priya, B. Anbarasu, X. Cheng, M. R. Ghalib, and A. Shankar, “Prediction of iiot traffic using a modified whale optimization approach integrated with random forest classifier,” The Journal of Supercomputing, pp. 1–32, 2022.
[13] G. Jin, M. Wang, J. Zhang, H. Sha, and J. Huang, “Stgnn-tte: Travel time estimation via spatial–temporal graph neural network,” Future Generation Computer Systems, vol. 126, pp. 70–81, 2022.
[14] G. Jin, C. Liu, Z. Xi, H. Sha, Y. Liu, and J. Huang, “Adaptive dual-view wavenet for urban spatial–temporal event prediction,” Information Sciences, vol. 588, pp. 315–330, 2022.
[15] J. James, “Graph construction for traffic prediction: A data-driven approach,” IEEE Transactions on Intelligent Transportation Systems, 2022.
[16] G. Jin, H. Sha, Y. Feng, Q. Cheng, and J. Huang, “Gsen: An ensemble deep learning benchmark model for urban hotspots spatiotemporal prediction,” Neurocomputing, vol. 455, pp. 353–367, 2021.
[17] G. Jin, C. Zhu, X. Chen, H. Sha, X. Hu, and J. Huang, “Ufsp-net: a neural network with spatio-temporal information fusion for urban fire situation prediction,” in IOP Conference Series: Materials Science and Engineering, vol. 853, no. 1. IOP Publishing, 2020, p. 012050.
[18] L. Liu, J. Chen, H. Wu, J. Zhen, G. Li, and L. Lin, “Physical-virtual collaboration modeling for intra-and inter-station metro ridership prediction,” IEEE Transactions on Intelligent Transportation Systems, 2020.
[19] G. Jin, Q. Wang, C. Zhu, Y. Feng, J. Huang, and X. Hu, “Urban fire situation forecasting: Deep sequence learning with spatio-temporal dynamics,” Applied Soft Computing, vol. 97, p. 106730, 2020.
[20] J. Ye, J. Zhao, K. Ye, and C. Xu, “How to build a graph-based deep learning architecture in traffic domain: A survey,” IEEE Transactions on Intelligent Transportation Systems, 2020.
[21] D. Servan-Schreiber, A. Cleeremans, and J. McClelland, “Learning sequential structure in simple recurrent networks,” Advances in neural information processing systems, vol. 1, 1988.
[22] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735–80, 12 1997.
[23] K. Cho, B. van Merrienboer, Çaglar Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” in EMNLP, 2014.
[24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017. [Online]. Available: http://arxiv.org/abs/1706.03762
[25] C. Song, Y. Lin, S. Guo, and H. Wan, “Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 914–921.
[26] M. Li and Z. Zhu, “Spatial-temporal fusion graph neural networks for traffic flow forecasting,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, pp. 4189–4196, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16542
[27] G. Jin, Y. Cui, L. Zeng, H. Tang, Y. Feng, and J. Huang, “Urban ride-hailing demand prediction with multiple spatio-temporal information fusion network,” Transportation Research Part C: Emerging Technologies, vol. 117, p. 102665, 2020.
[28] G. Jin, H. Yan, F. Li, Y. Li, and J. Huang, “Hierarchical neural architecture search for travel time estimation,” in Proceedings of the 29th International Conference on Advances in Geographic Information Systems, 2021, pp. 91–94.
[29] G. Jin, Z. Xi, H. Sha, Y. Feng, and J. Huang, “Deep multi-view spatiotemporal virtual graph neural network for significant citywide ride-hailing demand prediction,” arXiv preprint arXiv:2007.15189, 2020.
[30] Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang, Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks. New York, NY, USA: Association for Computing Machinery, 2020, p. 753–763. [Online]. Available: https://doi.org/10.1145/3394486.3403118
[31] B. Lu, X. Gan, H. Jin, L. Fu, and H. Zhang, “Spatiotemporal adaptive gated graph convolution network for urban traffic flow forecasting,” in Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2020, pp. 1025–1034.
[32] F. Li, J. Feng, H. Yan, G. Jin, D. Jin, and Y. Li, “Dynamic graph convolutional recurrent network for traffic prediction: Benchmark and solution,” arXiv preprint arXiv:2104.14917, 2021.
[33] L. Zhao, Y. Song, C. Zhang, Y. Liu, P. Wang, T. Lin, M. Deng, and H. Li, “T-gcn: A temporal graph convolutional network for traffic prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 9, pp. 3848–3858, 2019.
[34] C. Zheng, X. Fan, C. Wang, and J. Qi, “Gman: A graph multi-attention network for traffic prediction,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 1234–1241, 04 2020.
[35] Z. Pan, S. Ke, X. Yang, Y. Liang, Y. Yu, J. Zhang, and Y. Zheng, “Autostg: Neural architecture search for predictions of spatio-temporal graph,” in Proceedings of the Web Conference 2021, 2021, pp. 1846–1855.
[36] Q. Yao, M. Wang, Y. Chen, W. Dai, Y.-F. Li, W.-W. Tu, Q. Yang, and Y. Yu, “Taking human out of learning applications: A survey on automated machine learning,” arXiv preprint arXiv:1810.13306, 2018.
[37] A. Kaul, S. Maheshwary, and V. Pudi, “Autolearn—automated feature generation and selection,” in 2017 IEEE International Conference on data mining (ICDM). IEEE, 2017, pp. 217–226.
[38] J. M. Kanter and K. Veeramachaneni, “Deep feature synthesis: Towards automating data science endeavors,” in 2015 IEEE international conference on data science and advanced analytics (DSAA). IEEE, 2015, pp. 1–10.
[39] M. Christ, N. Braun, J. Neuffer, and A. W. Kempa-Liehr, “Time series feature extraction on basis of scalable hypothesis tests (tsfresh–a python package),” Neurocomputing, vol. 307, pp. 72–77, 2018.
[40] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural architecture search,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 19–34.
[41] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean, “Efficient neural architecture search via parameters sharing,” in International Conference on Machine Learning. PMLR, 2018, pp. 4095–4104.
[42] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
[43] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized evolution for image classifier architecture search,” in Proceedings of the aaai conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 4780–4789.
[44] Y. Sun, B. Xue, M. Zhang, G. G. Yen, and J. Lv, “Automatically designing cnn architectures using the genetic algorithm for image classification,” IEEE transactions on cybernetics, vol. 50, no. 9, pp. 3840–3854, 2020.
[45] L. Xie and A. Yuille, “Genetic cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1379–1388.
[46] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” in International Conference on Learning Representations, 2018.
[47] X. Chen, L. Xie, J. Wu, and Q. Tian, “Progressive differentiable architecture search: Bridging the depth gap between search and evaluation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1294–1303.
[48] T. Li, J. Zhang, K. Bao, Y. Liang, Y. Li, and Y. Zheng, “Autost: Efficient neural architecture search for spatio-temporal prediction,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 794–802.
[49] P. Tormene, T. Giorgino, S. Quaglini, and M. Stefanelli, “Matching incomplete time series with dynamic time warping: An algorithm and an application to post-stroke rehabilitation,” vol. 45, 2009. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/19111449
[50] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in Proc. of ICLR, 2017.
[51] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in ICLR, 2016.
[52] C.-W. Huang and S. S. Narayanan, “Deep convolutional recurrent neural network with attention mechanism for robust speech emotion recognition,” in 2017 IEEE international conference on multimedia and expo (ICME). IEEE, 2017, pp. 583–588.
[53] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling with gated convolutional networks,” in International conference on machine learning. PMLR, 2017, pp. 933–941.
[54] Z. Fang, Q. Long, G. Song, and K. Xie, “Spatial-temporal graph ode networks for traffic flow forecasting,” in Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, 2021, pp. 364–373.