FMA-ETA: Estimating Travel Time Entirely
Based on FFN With Attention

Yiwen Sun¹ Yulu Wang¹ Kun Fu² Zheng Wang² Ziang Yan¹
Changshui Zhang¹ Jieping Ye²
¹Department of Automation, Institute for Artificial Intelligence
Tsinghua University (THUAI), Beijing, China
²DiDi AI Labs, Beijing, China
syw17@mails.tsinghua.edu.cn, wangyulu18@mails.tsinghua.edu.cn
fukunkunfu@didiglobal.com, wangzhengzwang@didiglobal.com
yza18@mails.tsinghua.edu.cn, zcs@mail.tsinghua.edu.cn, yejieping@didiglobal.com

Abstract

Estimated time of arrival (ETA) is one of the most important services in intelligent transportation systems and becomes a challenging spatial-temporal (ST) data mining task in recent years. Nowadays, deep learning based methods, specifically recurrent neural networks (RNN) based ones are adapted to model the ST patterns from massive data for ETA and become the state-of-the-art. However, RNN is suffering from slow training and inference speed, as its structure is unfriendly to parallel computing. To solve this problem, we propose a novel, brief and effective framework mainly based on feed-forward network (FFN) for ETA, FFN with Multi-factor self-Attention (FMA-ETA). The novel Multi-factor self-attention mechanism is proposed to deal with different category features and aggregate the information purposefully. Extensive experimental results on the real-world vehicle travel dataset show FMA-ETA is competitive with state-of-the-art methods in terms of the prediction accuracy with significantly better inference speed.

1 Introduction

Estimated time of arrival (ETA) or travel time prediction is universally considered as the travel time estimation given a pair of origin and destination locations along the route [22]. As an essential component of artificial intelligence for transportation, ETA influences route planning, navigation and vehicle dispatching which are fundamental for ride-hailing platforms, such as DiDi and Uber [22, 19]. ETA is a representative and challenging sequence learning and data mining task attracting lots of attention [21, 7, 3, 27, 22, 19, 14].

Since 2018, deep learning [13] based methods [19, 14, 22] which significantly overperform non-deep learning-based methods [21, 7, 3, 27] mines the spatial-temporal correlations concurrently and effectively from large-scale data and become state-of-the-art. The general sequential semantic information extractor of these state-of-the-art methods, such as WDR [22], DeepTTE [19], DeepTravel [28] are mainly one Recurrent Neural Network (RNN) [8, 9, 4] variant, Long Short-Term Memory Network (LSTM) [6]. RNN adopts the recurrent structure to model sequence and extract semantic information, which also determines its restricted inference speed due to non-parallelization.

In this paper, we discuss the possibility of mainly adopting FFN to mine spatial-temporal information from sequential massive data for ETA, as illustrated in Fig. 1. FFN is parallelizable and naturally beneficial for fast ETA inference considering accuracy simultaneously which is a industry pain point for ride-hailing platforms. However, completely depending on FFN, the model can hardly capture the dependency between links.

Refer to caption — Figure 1: The conceptual demonstration of ETA and two kinds of candidate sequence feature extractors based on deep learning. ETA refers to estimating the travel time along the given route between the origin and destination. In real application scenarios, the RNN, such as current state-of-the-art LSTM is copmlex and slow. FFN with our proposed Multi-factor Attention (FMA-ETA) is promising for the future of ETA due to simplicity, high speed and effectiveness.

Will there be a novel structure helping FFN analyze sequence semantic information effectively and play its obvious advantages in inference speed for ETA? Follow this line, we present a novel Multi-factor Attention which is specially designed for ETA, a sequential learning task affected by various factors. FMA-ETA which is mainly based on FFN with Multi-factor Attention is proposed for a better sequence feature extractor than RNN which is state-of-the-art since 2018.

The main contributions in this work are as follows:

•

We propose a novel ETA deep learning based framework, FMA-ETA that is the first deep learing framework entirely based on FFN with attention, to our best knowledge.
•

We propose a novel Multi-factor Attention mechanism for effectively learning the time dependency and semantic information between time steps of the sequence. Through sufficient experiments, we find that for ETA, Multi-factor Attention is better than Multi-head attention [18] which is famous in natural language processing. Besides, Multi-factor attention can be adpoted and may also be promising for other sequence learning tasks affected by various factors.
•

We evaluate FMA-ETA on the massive real-world dataset containing over 500 million trajectories from one famous ride-sharing platform. The abundant experimental results demonstrate that FMA-ETA’s estimation precision is comparable with the state-of-the-art RNN based method, WDR. Not only that, FMA-ETA improves the ETA model inference speed than WDR significantly.

We organize the paper as follows. Section 2 briefly summarizes the backgrounds of ETA, sequence learning and attention mechanism. Section 3 introduces the overall framework of FMA-ETA, followed by the description of the general Multi-factor attention in detail. In Section 4, we elaborate the reason why we propose Multi-factor attention. In Section 5, experimental result comparisons on the large-scale real-world dataset are presented to show the excellent accuracy and inference speed of FMA-ETA. Finally, this paper is concluded and the possibility of further work is analyzed in Section 6.

2 Background

In this section, we briefly overview the background of our work, inculding estimated time of arrival and attention mechanism.

2.1 Estimated time of arrival

Estimated time of arrival (ETA) is a challenging problem in the field of intelligent transportation system. There are two representative methods for solving ETA, route-based method and data-driven method. The route-based method focus on formulate the travel time of a given route as the summation of time on each road segment and each intersection. Traditional machine learning methods such as dynamic Bayesian network [7], least-square minimization [26] and pattern matching [3] are typical approaches to capture the spatial-temporal features in the route-based method. However, the idea of dividing the original trajectory results in the accumulation of local errors. The data-driven method has shifted from traditional methods such as TEMP [20] and time-dependent landmark graph [25] to deep-learning based methods [22, 5]. MURAT [14] uses multi-task learning and graph convolutional networks to assist a residual block to predict the travel time from the departure to the destination without a given trajectory. In recent years, researchers have conducted more explorations on applying deep learning methods to solve ETA problems, such as Deeptravel [28], DeepTTE [19], Deepi2t [12] and WDR [22]. These methods apply different approaches on modeling spatial information, but they all use LSTM [6] to extract features from time series. However, the inference speed of the model with LSTM is too slow to be applied in actual scenarios. In this work, we proposed FMA-ETA which can sufficiently handle the above problem.

2.2 Attention mechanism

Attention is a very effective mechanism in natural language processing [1], image caption [23] and other research areas [16]. Attention mechanism has outstanding ability in capturing semantic dependencies. Common attention mechanisms are local attention, global attention [15], self-attention [18], etc. Transformer [18] is a novel sequence to sequence network entirely based on FFN with Multi-head self-attention. It achieves promising results in translation with a faster speed than RNN-based models. Then self-attention becomes a hot topic in neural network attention research. Self-attention is calculated by:

\text{ Self-attention }=\operatorname{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d}}\right)\mathbf{V}

(1)

where $\mathbf{Q},\mathbf{K},\mathbf{V}$ is query, key and value matrix, $d$ is the dimension of key and query matrix, and key and query matrix are usually the same. Self-attention is proved useful in a wide variety of tasks including sequential recommendation [10], reading comprehension [24], speech recognition [17] and traffic flow predicting [29].

Deep learning-based ETA models are mostly RNN-based models. RNN has problems when deals with long-range dependencies. LSTM are able to deal the problem to some extent, but in practice it still have problems in long-range dependencies. The inference time of LSTM-based model is too long for practical application, so it is vital to introduce attention mechanism to the ETA problem.

3 Model Architecture

We first give the accurate mathematical definition of estimated time of arrival with reference to [22].

Definition 3.1 (Estimated time of arrival).

For a collection of historical trips $D=\left\{s_{i},e_{i},d_{i},\boldsymbol{p}_{i}\right\}_{i=1}^{N}$ , where $s_{i}$ is the departure time for the i-th trajectory, $e_{i}$ is the arrival time for i-th trajectory, $d_{i}$ is the driver ID, $\boldsymbol{p}_{i}$ is the link sequence set of the trajectory, and $N$ is the total number of samples. The ground truth travel time is computed by $y_{i}=e_{i}-s_{i}$ . Here the link sequence set $\boldsymbol{p}_{i}$ can be represented as $\boldsymbol{p}_{i}=\left\{l_{i1},l_{i2},\cdots,l_{iT_{i}}\right\}$ , where $l_{ij}$ represents the j-th link in the i-th trajectory, and $T_{i}$ is the length of the link sequence set.

Sicne 2018, most state-of-the-art methods’ main force for capturing spatial-temporal patterns to complete ETA has change into RNN (specifically, LSTM). RNN is a famous general sequence feature extractor for various sequence learning subfields, such as speech signal processing and natural language processing. In this paper, we break the stereotype and present a ETA framework entirely based on FFN and novel Multi-factor Attention, FMA-ETA. We introduces the overall structure of FMA-ETA as well as proposed Multi-factor attention in next two subsections.

3.1 Overall framework

The first main step is the sophisticated feature engineering where we follow [22]. Rich features from massive raw data is the key input for deep learning model. Features could be divided into the following two categories.

(1) Global features are sparse and one trajectory corresponds to one global representation, such as driverID, day of week, departure time slice. The method, Embedding [2] is adopted for the dimensionality reduction of sparse features.

(2) Sequential features are related to each link of the trajectory, for instance, length of the link (road segment), speed (road contidion), link time (related to road contidion) and embedding of linkID. These four factors influence ETA from different perspectives.

We then describe the overall framework of FMA-ETA, as shown in Fig. 2.

Two main components of FMA-ETA are Multi-factor Attention and FFN.

(1) Sequential features adopt our Multi-factor Attention which will be discussed in detail in next subsection. This component fully explores the relationship between different links in each track.

(2) Parallelizable FFN is the main reason for simplicity and fast inference that are our greatest advantages compared with RNN. The front FFN is utilized for each sequential factor to mining the spatial-temporal patterns in their single aspect as well as for concatenated factors. The last FFN is for the information aggregation from sequential separate and combined representations as well as embeddings of global features.

The regressor is one linear layer with ReLU [11] as the activation function. The Objective function of the overall deep learning model is the mean absolute percentage error (MAPE) which is an common and relative loss function for ETA. The FMA-ETA’s parameters are trained through:

\min_{\theta}\sum_{i}\left|\frac{y_{i}-y_{i}^{\prime}}{y_{i}}\right|,

(2)

where $y_{i}^{\prime}$ is $i_{th}$ query’s ETA, $y_{i}$ is the ground truth time and $\theta$ is all the parameters of FMA-ETA.

3.2 Multi-factor Attention

ETA is a challenging and complex problem due to the fact that various factors affect the accuracy of prediction, such as the link length as well as its road condition. Therefore, unlike for natural language processing where one word can be represented by a single embedding vector, different sequential features ought to be treated and dealt with more specifically for ETA. Our Multi-factor Attention mechanism is proposed to let different sequential factors mine its patterns and the impact on ETA in different subspaces, as shown in the upper left corner of Fig. 2. Self attention is with reference to [18] and we add position encoding, residual connections, layer normalization, and dropout after self attention following [18]. Combined sequential features also capture the spatial-temporal patterns as a whole by FFN with self-attention.

In the Fig. 2, we show the Multi-factor Attention mechanism with three factors. When the number of the factors is arbitrary $n$ , the general Multi-factor Attention could be expressed:

	$\displaystyle\text{Multi-factor}(\text{factor}_{1},...,\text{factor}_{n})\left.=\text{Concat}(\text{self}_{1},...,\text{self}_{n},\text{self}_{\text{all}}\right)\mathbf{W}^{O}$		(3)
	$\displaystyle\text{where}\ \text{self}_{i}=\text{Self-attention}\left(\text{factor}_{i}\mathbf{W}_{i}^{Q},\text{factor}_{i}\mathbf{W}_{i}^{K},\text{factor}_{i}\mathbf{W}_{i}^{V}\right),$
	$\displaystyle i=1,...,n$
	$\displaystyle\text{self}_{\text{all}}=\text{Self-attention}\left(\text{factor}_{C}\mathbf{W}_{all}^{Q},\text{factor}_{C}\mathbf{W}_{all}^{K},\text{factor}_{C}\mathbf{W}_{all}^{V}\right),$
	$\displaystyle\text{factor}_{C}=\text{Concat}(\text{factor}_{1},...,\text{factor}_{n}).$

Where $\text{factor}_{i}$ is the i-th factor of ETA problem, ${\mathbf{W}_{i}}^{*}$ is the learned parameters in the FFN layers of the i-th factor, ${\mathbf{W}_{all}}^{*}$ is the learned parameters in the FFN layers of the combined features. For our FMA-ETA, sequential factors contains (1) length of the link, (2) road contidion speed, (3) corresponding link time and (4) embedding of linkID. Hence, we adopt the Multi-factor Attention version of four factors, i.e., $n=4$ . Through concatenating the separate and combined sequence representations, our Multi-factor Attention complete the multi level and detailed extraction of the spatial-temporal dependencies of sequence data.

4 Why Multi-factor Attention

In this section, we will discuss the motivation and reason for the proposal of multi-factor attention. The RNN-based model has a good performance on the ETA problem of which the evaluation metric is good. However, the RNN-based model has a slow training/inference speed, making it difficult to be applied in practical problems. FFN is a promising method to accelerate the speed of the model. But FFN has a poor performance in sequence learning and have problems on long-range dependencies. Our proposed multi-factor attention can solve the above problem. We will analyze and compare the total computational complexity and sequential operations of RNN and FFN with Multi-factor Attention.

As shown in Table 1, the multi-factor attention only need $O(1)$ sequetial operations while RNN requires $O(n)$ . As for computational complexity, when the length of sequence $n$ is smaller than the dimension of features $d$ , our multi-factor attention is faster than RNN.

Self attention especially multi-head attention has achieved good results on sequence learning. Why not multi-head attention? In terms of ETA problem, there are many different factors affect it and the traffic state is complex and dynamic. Experiments in Section 5 show that multi-head attention does not perform well on complex problems in the transportation system like ETA. Our multi-factor attention focus on both separate features and combined features. In this way can we promote different subspaces to analyze the effect of a certain factor pattern on ETA. The evaluation metrics shows that model with multi-factor attention preforms better than model with multi-head attention on ETA problems.

Hence, multi-factor attention is more effective for extracting systematic and comprehensive spatial-temporal patterns comparing with multi-head attention. Considering the speed promotion of FFN, FFN with multi-factor attention has a great advantage in tasks in intelligent transportation system (ITS). Multi-factor attention is a general method and may be also promising for other time series forecasting tasks.

Table 1: The per layer complexity and sequential operations of different methods

	Complexity per Layer	Sequential Operations
RNN	$\mathcal{O}(n^{2}d)$	$\mathcal{O}(n)$
Multi-factor Attention	$\mathcal{O}(nd^{2})$	$\mathcal{O}(1)$

*

$n$ is the length of the sequence, $d$ is the dimension of features.

5 Results

5.1 Dataset

We evaluate our model on a large-scale real-world floating-car trajectory dataset Beijing 2018 collected by a ride hailing platform. It contains the trajectory data of hundreds of millions of Beijing taxi drivers after desensitization for more than 4 months in 2018. This dataset covers different types of roads in Beijing urban areas, including local streets and freeways. We filter out the abnormal data where the driving time is less than 60s or the speed exceeds 120 km/h in Beijing 2018. We divided this data set into atraining set (the first 16 weeks of data), a validation set (the middle2 weeks of data) Test set (data for the last 2 weeks).

5.2 Compared methods

We compared the proposed FMA-ETA with the following competitors:

(1) Route-ETA: a representative method for traditional non-deep learning methods. It divides the trajectory into several links and intersections. The travel time $t_{i}$ in the i-th link of this trajectory is calculated by dividing the link’s length by the estimated speed in the i-th link. The delay time $c_{j}$ in the j-th intersection is provided by a real-time traffic monitoring system.The final arrival time is the sum of the estimated time spent in each subsection.

(2) WDR(RNN): a deep learning method achieving the state-of-the-art performance in ETA problem. WDR is a joint model including width module, depth module, and recurrent module. It can effectively use the dense features, high-dimensional sparse features and local features of road sequence in traffic information. Here we use RNN in the recurrent module.

(3) WDR(LSTM): a variants of WDR(RNN). Here we use LSTM in the recurrent module of WDR and we make no changes to other part of WDR.

(4) WD-FFN: a variants of WDR. It uses deep module to replace the recurrent module. Here we use a Multi-Layer Perceptron network for comparision.

(5) WD-Resnet: a variants of WDR. It uses deep module to replace the recurrent module. Here we use a residual structure to extract features.

(6) Multi-head attention: a variants of WDR. We use FFN with multi-head attention mechanism instead of RNN to extract features from the sequential data.

If this work is accepted, we will open source the codes of proposed deep learning-based model, FMA-ETA.

5.3 Experimental Settings

In our experiment, all models are written in PyTorch. They are trained and evaluated on a single NVIDIA Tesla P40 GPU. The number of iterations of the deep learning-based method is 3.5 million. We use the method of Back Propagation (BP) to train the deep learning-based methods, and the loss function is the MAPE loss. We choose Adam as the optimizer due to its good performance. The batch size is 256 and the initial learning rate is 0.0002.

5.4 Evaluation Metrics

To evaluate and compare the performance of FMA-ETA and other baselines, we use evaluation metrics, Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE) and Rooted Mean Square Error (RMSE):

\mathrm{MAE}(y,y^{\prime})=\frac{1}{N}\sum_{i=1}^{N}\left|y_{i}-y_{i}^{\prime}\right|

(4)

\mathrm{RMSE}(y,y^{\prime})=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(y_{i}-y_{i}^{\prime}\right)^{2}}

(5)

where $y_{i}^{\prime}$ is the predicted travel time, and $y_{i}$ is the ground truth travel time. The calculation process of MAPE is shown in Section 3.

5.5 Experimental Results and Analysis

Table 2: The results of different methods

	MAE(sec)	RMSE(sec)	MAPE(%)	Latency(ms) ^*
Route-ETA	69.008	106.966	25.010	0.179
WD-FFN	57.797	93.588	21.106	0.344
WD-Resnet	57.064	92.241	21.015	0.454
WDR(RNN)	55.284	90.836	19.677	1.107
WDR(LSTM)	55.227	90.480	19.598	1.109
Multi-head attention	55.145	90.101	19.678	0.635
FMA-ETA (ours)	54.642	88.794	19.618	0.866

*

Latency is the average inference time of the models.

Table 2 shows the general three evaluation metrics for ETA problems. Our FMA-ETA outperforms all competitors in terms of MAE and RMSE metrics. FMA-ETA achieves similar results with the start-of-the-art method WDR(LSTM) in terms of MAPE metric. The detailed analysis of the experimental results are as follows.

(1) The representative non-deep learning method, route-ETA performs worse than other deep learning based methods. It shows that the data-driven method is more effective than route-based method. The deep method is suitable for modeling complex transportation system given massive spatio-temporal data.

(2) Models with recurrent module performs better than models that only use deep modules without attention mechanism. WDR(RNN) and WDR(LSTM) achieves better results than WD-FFN and WD-Resnet. WDR(LSTM) performs best on MAPE metric, because the use of gated units can solve the problem of long-term dependencies to a certain extent. The deep modules with attention achieve better results than WDR on MAE and RMSE metrics, which means attention mechanism can help to extract features and sole the long-range dependencies in long sequence.

(3) Our FMA-ETA performs best on MAE and RMSE metrics, which means our method is very applicable to ETA problems. Our FMA-ETA outperforms LSTM by 1.05% in terms of MAE loss and 1.86% in terms of RMSE loss. Our FMA-ETA perform similar results to WDR(LSTM), and our FMA-ETA only 0.1% worse than WDR(LSTM) in terms of MAPE metric. Considering the three evaluation metrics, our FMA-ETA performs best on ETA problems.

(4) As can be seen in the "Latency" column of Table 2, our FMA-ETA speed up the inference process by $21.8\%$ compared with WDR(LSTM). Route-ETA has the shortest time of 0.179s, but its performance on evaluation metrics is poor. FFN-based methods without attention mechanism is fast, but it brings a great loss on the evaluation metrics. Model with multi-head attention is faster than FMA-ETA. Its performance is worse than FMA-ETA. If the performance of models for ETA problem is not good enough, many tasks of ETA’s downstream in intelligent transportation systems such as route planning, navigation and vehicle dispatching will be affected greatly. Therefore, we should increase the inference speed of the model while ensuring the accuracy. Currently only our FMA-ETA can reach the goal.

Our FMA-ETA has a good performance on the ETA problem, and it greatly prompts the inference speed compared with the state-of-the-art method WDR(LSTM). FMA-ETA achieves clear improvements over WDR(LSTM) regarding to MAE and MAPE metrics. Taking into account both evaluation metrics and speed, our method is the most suitable method for ETA problems.

5.6 Speed comparison of different methods

As we analyzed above, the state-of-the-art method WDR(LSTM) for ETA problem in previous literature takes a long time for inference. This makes WDR(LSTM) hard to be applied in the real-time traffic system. FFN can greatly accelerate the inference speed of the model for ETA problem, such as WD-FFN and WD-Resnet, but it causes a large decrease in accuracy which can be seen in Table 2. The attention mechanism can help FFN to effectively extract sequence features. The existing multi-head attention improves the inference speed, but it still brings a great loss of accuracy. Our goal is to increase the inference speed of the ETA model while ensure that the evaluation metrics do not decrease. We can see from the average inference speed in Tabel 2, only FMA-ETA can achieve the goal. We further explore the inference speed of WDR(LSTM) and FMA-ETA with different sequence length. We randomly sample 50 samples at each sequence length for WDR(LSTM) and FMA-ETA, then plot the scatters in Figure 3. The curve in Figure 3 is obtained by fitting the sampling points through logarithmic fitting.

As illustrated by the figure, our FMA-ETA is obviously faster than WDR(LSTM) when the sequence length is large than 180. LSTM-based model is fast in short sequences, and its consuming time increases rapidly as the sequence becomes longer. In actual car rides data, long-range sequences are common, so FMA-ETA is more applicable for practical problems.

6 Conclusion and Future Work

In this paper, to our best knowledge, we are the first to estimate travel time entirely based on FFN with attention by presenting FMA-ETA. This idea is novel and quite different from RNN based methods which have been state-of-the-art since 2018. Furthermore, we propose a novel Multi-factor self-attention mechanism for FFN to better mine sequence semantic information for ETA which is affected by various factors. Through sufficient experiments on the massive real-world dataset from a famous intelligent travel platform, we conclude that FMA-ETA achieves slight improvements over other state-of-the-art methods regarding to the prediction precision. Most importantly, our method significantly speeds up the inference process compared with RNN based methods. Multi-factor self-attention mechanism is also verified by experiments to be superior to the popular Multi-head self-attention that is proposed for natural language processing. Future efforts will be made to adopt our Multi-head self-attention for other sequence learning tasks which are also affected by many complex factors. Besides, we plan to conduct a series of online tests for FMA-ETA and decide if we could adopt this promising deep learning framework for large scale practical application.

Broader Impact

We present the statement of the broader impact of our paper as followed:

a) This research is benefitial for many other tasks in ITS, such as route planning, navigation and vehicle dispatching. Our FMA-ETA which is an significantly faster and more accurate framework for ETA do good to the ride-hailing platforms, such as DiDi and Uber, for providing better user experiencs. Furthermore, our method promotes the long-term development of ITS and the spatial-temporal sequential prediction;

b) We are convinced that nobody will be put at disadvantage from our work. On the contrary, our research indirectly makes it more convenient for many people to travel and helps environmental protection;

c) Our framework is a potential framework for online application, which reflects the practical application value of our model. If our model is lucky enough to be selected as the practical application method, the ride-hailing platform will also go through multi-directional tests in order to avoid the only economic loss once our method fails;

d) We ensure that the method does not leverage any biases in the data. The experiments are carried out on the large-scale real-world vehicle travel dataset. The data includes more than 500 million trajectories and covers almost all road types. Therefore, the distribution tends to be that of real world.

References

[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[2] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
[3] Hao Chen, Hesham A Rakha, and Catherine C McGhee. Dynamic travel time prediction using pattern recognition. In 20th World Congress on Intelligent Transportation Systems. TU Delft, 2013.
[4] Elman and Jeffrey L. Finding structure in time. Cognitive science, 14(2):179–211, 1990.
[5] Kun Fu, Fanlin Meng, Jieping Ye, and Zheng Wang. Compacteta: A fast inference system for travel time prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020.
[6] Sepp Hochreiter and Jürgen Schmidhuber. Lstm can solve hard long time lag problems. In Advances in neural information processing systems, pages 473–479, 1997.
[7] Aude Hofleitner, Ryan Herring, Pieter Abbeel, and Alexandre Bayen. Learning the dynamics of arterial traffic from probe data using a dynamic bayesian network. IEEE Transactions on Intelligent Transportation Systems, 13(4):1679–1693, 2012.
[8] John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
[9] Jordan and Michael I. Serial order: A parallel distributed processing approach. In Advances in psychology, volume 121, pages 471–495. Elsevier, 1997.
[10] Wang-Cheng Kang and Julian McAuley. Self-attentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), pages 197–206. IEEE, 2018.
[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[12] Wuwei Lan, Yanyan Xu, and Bin Zhao. Travel time estimation without road networks: an urban morphological layout representation approach. arXiv preprint arXiv:1907.03381, 2019.
[13] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
[14] Yaguang Li, Kun Fu, Zheng Wang, Cyrus Shahabi, Jieping Ye, and Yan Liu. Multi-task representation learning for travel time estimation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1695–1704, 2018.
[15] Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
[16] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. In Advances in Neural Information Processing Systems, pages 3165–3174, 2019.
[17] Julian Salazar, Katrin Kirchhoff, and Zhiheng Huang. Self-attention networks for connectionist temporal classification in speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7115–7119. IEEE, 2019.
[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[19] Dong Wang, Junbo Zhang, Wei Cao, Jian Li, and Yu Zheng. When will you arrive? estimating travel time based on deep neural networks. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[20] Hongjian Wang, Xianfeng Tang, Yu-Hsuan Kuo, Daniel Kifer, and Zhenhui Li. A simple baseline for travel time estimation using large-scale trip data. ACM Transactions on Intelligent Systems and Technology (TIST), 10(2):1–22, 2019.
[21] Yilun Wang, Yu Zheng, and Yexiang Xue. Travel time estimation of a path using sparse trajectories. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 25–34. ACM, 2014.
[22] Zheng Wang, Kun Fu, and Jieping Ye. Learning to estimate the travel time. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 858–866, 2018.
[23] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
[24] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541, 2018.
[25] Jing Yuan, Yu Zheng, Xing Xie, and Guangzhong Sun. T-drive: Enhancing driving directions with taxi drivers’ intelligence. IEEE Transactions on Knowledge and Data Engineering, 25(1):220–232, 2011.
[26] Xianyuan Zhan, Samiul Hasan, Satish V Ukkusuri, and Camille Kamga. Urban link travel time estimation using large-scale taxi data with partial information. Transportation Research Part C: Emerging Technologies, 33:37–49, 2013.
[27] Faming Zhang, Xinyan Zhu, Tao Hu, Wei Guo, Chen Chen, and Lingjia Liu. Urban link travel time prediction based on a gradient boosting method considering spatiotemporal correlations. ISPRS International Journal of Geo-Information, 5(11):201, 2016.
[28] Hanyuan Zhang, Hao Wu, Weiwei Sun, and Baihua Zheng. Deeptravel: a neural network based travel time estimation model with auxiliary supervision. arXiv preprint arXiv:1802.02147, 2018.
[29] Zheng Zhu, Wei Wu, Wei Zou, and Junjie Yan. End-to-end flow correlation tracking with spatial-temporal attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 548–557, 2018.

FMA-ETA: Estimating Travel Time Entirely Based on FFN With Attention