This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MSTFormer: Motion Inspired Spatial-temporal Transformer with Dynamic-aware Attention for long-term Vessel Trajectory Prediction

Huimin Qiang, Zhiyuan Guo, Shiyuan Xie, Xiaodong Peng
Abstract

Incorporating the dynamics knowledge into the model is critical for achieving accurate trajectory prediction while considering the spatial and temporal characteristics of the vessel. However, existing methods rarely consider the underlying dynamics knowledge and directly use machine learning algorithms to predict the trajectories. Intuitively, the vessel’s motions are following the laws of dynamics, e.g., the speed of a vessel decreases when turning a corner. Yet, it is challenging to combine dynamic knowledge and neural networks due to their inherent heterogeneity. Against this background, we propose MSTFormer, a motion inspired vessel trajectory prediction method based on Transformer. The contribution of this work is threefold. First, we design a data augmentation method to describe the spatial features and motion features of the trajectory. Second, we propose a Multi-headed Dynamic-aware Self-attention mechanism to focus on trajectory points with frequent motion transformations. Finally, we construct a knowledge-inspired loss function to further boost the performance of the model. Experimental results on real-world datasets show that our strategy not only effectively improves long-term predictive capability but also outperforms backbones on cornering data. The ablation analysis further confirms the efficacy of the proposed method. To the best of our knowledge, MSTFormer is the first neural network model for trajectory prediction fused with vessel motion dynamics, providing a worthwhile direction for future research. The source code is available at https://github.com/simple316/MSTFormer.

Index Terms:
Long-term prediction, dynamics knowledge representation, spatio-temporal information, vessel trajectory.

I Introduction

Maritime transportation is the foundation of international trade and the global economy—more than 80%\% of the world’s products are transported by sea. Despite the disruption of COVID-19, the decline in maritime traffic in 2020 is still higher than expected 111https://unctad.org/webflyer/review-maritime-transport-2021. With the gradual liberalization of the epidemic and the increasing demand for maritime transportation, the issue of vessel safety and security has received increasing attention from industry and academia [1]. Among them, accurate vessel trajectory prediction plays a critical role in collision avoidance [2], abnormal trajectory detection [3], navigation safety guarantee [4], port management [5], etc.

With the improvement of sensor accuracy, multi-source data is collected for maritime Situation Awareness (SA) and vessel trajectory prediction assistance, including Automatic Identification System (AIS) data[6], SAR satellite data[7], Vessel Monitoring Systems (VMS) data[8], long-Range Identification and Tracking (LRIT) data[9], etc. According to this review[10], 51 of the 57 studies are based on AIS data for vessel trajectory prediction; this indicates a growing academic interest in exploring new approaches to maritime traffic analysis based on AIS data. Besides, AIS transponders are widely equipped on vessels, and a large amount of static and dynamic information is acquired to provide data support for the study. However, AIS data usually include significant data flaws and poor data consistency, which pose many problems for data processing. At the same time, existing methods use cleaned time-series data directly and do not explore the deep dynamic information contained in the data. Therefore, how to deeply mine the spatial-temporal information of the trajectory implied in the AIS data is still an open problem.

Many researchers have devoted their efforts to predicting vessels’ trajectory from AIS data [6, 11, 12]. The existing vessel trajectory prediction methods can be classified into four categories: simulation-based, statistics-based, deep learning-based, and hybrid-based. Generally, simulation-based methods simulate vessel behavior by constructing a motion model[13], but their predictions are unreliable when the trajectory is sparse[10]. Statistics-based methods usually search for matches or build statistical models such as nearest neighbor search [14, 15], Markov chains [8], filters [16], etc. This method is beneficial for long-term trajectory prediction but is computationally expensive and sensitive to parameters. Deep learning-based methods abstract the data into high-level features by overlaying multiple non-linear layers to learn the complex dependencies in the trajectory samples [10]. While deep learning-based methods have shown fast prediction and strong generalization abilities, they are not empirically suitable for predicting medium- and long-term trajectories and struggle to effectively capture long-term data dependencies [12]. In addition, pure deep learning-based methods are less accurate in predicting the trajectory of corners [17].

To solve these problems of predicting longer trajectories, it is natural to consider hybrid methods that combine deep learning with other methods. There are two mainstream ways of hybrid methods. The first is to cluster all historical trajectories, design cluster classifiers, and train their own local behavior network for each cluster [18, 2]. The second is to extract the typical waterway and use it to correct the prediction results of the neural network [12, 19, 20]. From a higher dimensional perspective, these methods use pre-existing or extracted knowledge to assist neural networks in predicting trajectories. However, the separation of knowledge from the network’s learning process restricts the potential of deep learning to extract the underlying dynamics knowledge. In summary, to overcome the above issue and achieve accurate vessel trajectory prediction performance for long-term trajectory, it is necessary to handle three important problems as follows:

  1. 1.

    How to deeply extract effective spatial-temporal information from massive historical AIS data?

  2. 2.

    How to keep a focus on navigation state changes while ensuring the long-range dependence of trajectories?

  3. 3.

    How to reliably integrate dynamic knowledge into the learning process of neural networks and improve its comprehension?

To achieve this goal, we propose MSTFormer, a motion inspired spatial-temporal transformer for long-term vessel trajectory prediction. MSTFormer extracts the spatial Characteristics and simple dynamic knowledge from the Augmented Trajectory Matrix (ATM), guarantees long-term dependence through the Dynamic-aware Self-attention mechanism and is coupled with a knowledge inspired loss function. The main contributions of this paper are summarized as follows:

  1. 1.

    We propose a novel data augmentation approach based on the AIS trajectory point information. A series of matrices are constructed based on the adjacent trajectory information, incorporating basic dynamics knowledge.

  2. 2.

    We develop a Multi-head Dynamic-aware Self-attention mechanism that focuses on capturing changes in the vessel’s motion and building long-term dependence on the trajectory.

  3. 3.

    We design a knowledge inspired loss function based on prediction correction.

  4. 4.

    The proposed MSTFormer is assessed on realistic vessel trajectories in the Gulf of Mexico. Our method outperforms the previous state-of-the-art approaches in long-term and cornering trajectories.

The remainder of this paper is organized as follows. Section II provides a summary of research on state-of-the-art trajectory prediction methods. The MSTFormer for vessel trajectory prediction is presented in Section III. The superior performance of our method is demonstrated by comparative experiments and effect analysis in Section IV. Section V concludes this paper by discussing the research findings and potential directions for future investigation.

II Related Works

In this section, we briefly review deep learning-based and hybrid-based trajectory prediction methods, as they are more relevant to our work. In addition, transformer-based methods for time series data prediction have attracted substantial attention due to their effectiveness. We review this line of research in the third subsection.

II-A Vessel Trajectory Prediction Using Deep Learning-based Methods

Researchers have used various deep learning models for trajectory prediction, including Recurrent Neural Network (RNN)[21], Gated Recurrent Unit (GRU)[22], long Short-Term Memory (LSTM)[4], and Transformer Neural Network[23]. Among them, LSTM and its variants have become the mainstream approach for trajectory prediction due to its effectiveness in modeling time series dependency. To model the bidirectional dependency, the bidirectional LSTM (Bi-LSTM)[24] has been introduced, maintaining the relevance between historical and future time-series data. Wang and Fu [25] proposed an attention-based Bi-LSTM to model multiple dependencies adaptively. Moreover, several methods combine bidirectional structure, attention mechanism, and encoder-decoder in various ways[26, 27, 28], achieving superior performance. Besides, other modern network architectures, such as graph convolutional network[3] and transformer[23] have also been employed to predict vessel trajectory.

II-B Vessel Trajectory Prediction Using Hybrid-base Methods

Deep learning-based methods generally exhibit poor performance in long-term trajectory prediction [11]. Therefore, some studies have attempted to use knowledge-aided prediction by combining neural networks with other methods. Some methods [29, 30] separate the regions using clustering and then train the regional neural network for prediction. Instead of clustering region, a hybrid architecture[18] with three modules—trajectory clustering, trajectory classification, and trajectory prediction was developed. Another study[31] used hierarchical DBSCAN (HDBSCAN) for the trajectory clustering and bidirectional GRU for classification and prediction. In addition, there are methods that use extracted waterways or similar historical trajectories to assist neural network predictions. A novel hybrid framework[12] consists of three parts: a neural network predicts sog and cog, corrects cog with the waterway and Particle Filter (PF), and finally calculates position using a dynamical model. A similarity search prediction model comprising DTW for search and LSTM for spatial distance prediction was explored[32]. However, these methods use knowledge to assist the network prediction separately. As a result, the prediction accuracy decreases when the vessel motion changes drastically because these methods do not learn the dynamics themselves.

II-C Recent Advances in Transformer Network

Transformers’ outstanding results[33] in natural language processing(NLP) has attracted a lot of attention in the time series fields[34]. For time series modeling, the ability of transformers to capture long-range dependencies and interactions is particularly inspiring, and several Transformer variations have been effectively used for prediction tasks[35, 36]. For example, AST trains a sparse Transformer model using a generative adversarial encoder-decoder architecture[37]. And to capture temporal correlations of various ranges, Pyraformer adopts a hierarchical pyramidal attention module with a binary tree following route[38]. However, most of these methods are designed for short-term prediction, and the computational efficiency will decrease in the long sequence time-series prediction. LogTrans integrates sparse bias, a Logsparse mask for long sequences, into the self-attention model to reduce computational complexity[39]. Similarly, Informer chooses O(logL)O(logL) dominating queries based on the Query and Key similarities to improve computational efficiency. Additionally, it develops a generative form decoder to directly provide long-term prediction and prevent accumulated error[40]. There are also variants of transformers that simultaneously extract spatial and temporal features. Besides a temporal Transformer block to capture temporal dependencies, Traffic Transformer creates an extra Graph neural network unit to capture spatial dependencies[41]. Likewise, Spatial-temporal Transformer designs a spatial Transformer block to predict traffic flow[42]. Spatial-temporal graph Transformer uses an attention-based graph convolution mechanism[43] for pedestrian trajectory prediction. However, there is still no variant of Transformer fully exploits the vessel trajectory characteristics.

III MSTFormer: Konwledge Inspired Spatial-temporal Transformer

In this paper, our goal is to maintain the long-term trajectory dependency and model the spatial-temporal motion features using dynamic knowledge to predict moving vessels’ longer positions. To achieve superior performance, we propose to develop the MSTFormer-based trajectory prediction method, illustrated in Fig. 1. The proposed MSTFormer mainly consists of three main components, i.e., Augmented Trajectory Matrix(ATM), network structure with Multi-head Dynamic-aware Self-attention, and knowledge inspired loss function. Firstly, The Augmented Trajectory Matrix utilizes dynamics knowledge to efficiently represent spatial features efficiently. Secondly, the network structure with Multi-head Dynamic-aware Self-attention is designed to model long-term trajectory dependency better. To extract spatial features from ATMs, a CNN is employed. Lastly, the knowledge-inspired loss function calculates the loss using the haversine formula after the motion model utilizes the features extracted by MSTFormer to obtain the predicted position.

Refer to caption
Figure 1: Framework of the MSTFormer.

III-A Data Augmentation and Problem Formulation

It is critical to use historical vessel trajectory information to predict future trajectories. We define the trajectory time series and the Augmented Trajectory Matrix to describe the vessel trajectory prediction problem.

III-A1 TTS: Trajectory Time Series


The original data is divided into trajectory segments TrajiTraj_{i} based on time and distance differences, and then uniformly sampled to one point per minute using Cubic spline Interpolation. The vessel trajectory TrajiTraj_{i} is represented by a series of timestamped trajectory points pn=[tn,lonn,latn,sogn,cogn,headingn]p_{n}=[t_{n},lon_{n},lat_{n},sog_{n},cog_{n},heading_{n}] with n{1,2,,N}n\in\{1,2,...,N\} obtained by AIS devices, i.e., Traji={p1,p2,,pN}Traj_{i}=\{p_{1},p_{2},...,p_{N}\} with NN is the number of timestamped trajectory points in TT and i{1,2,,I}i\in\{1,2,...,I\} with II is the number of trajectory segments. Here, tnt_{n}, lonnlon_{n}, latnlat_{n}, sognsog_{n}, cogncog_{n}, headingnheading_{n} represent the time stamp, longitude, and latitude, speed over ground, course over ground, heading of the vessel, respectively.

We use the difference of the data as input because the differenced series is smoother, which helps to improve the prediction accuracy. The sequence data used is represented as ΔTraji={Δp1,Δp2,,ΔpN1}\Delta Traj_{i}=\{\Delta p_{1},\Delta p_{2},...,\Delta p_{N-1}\}, i.e., Δpm=pnpn1\Delta p_{m}=p_{n}-p_{n-1} with m{1,2,,N1}m\in\{1,2,...,N-1\}. In this work, the input time series of MSTFormer is 𝐗={𝐱1t,,𝐱Lxt𝐱itdx}\mathbf{X}=\left\{\mathbf{x}_{1}^{t},\ldots,\mathbf{x}_{L_{x}}^{t}\mid\mathbf{x}_{i}^{t}\in\mathbb{R}^{d_{x}}\right\} and 𝐱it=[Δsogi,Δcogi]\mathbf{x}_{i}^{t}=[\Delta sog_{i},\Delta cog_{i}]. And the input data is split into encoded data and decoded data, which are represented as 𝐗enc\mathbf{X}_{enc} and 𝐗dec\mathbf{X}_{dec}. The true trajectory is 𝐘={𝐲1t,,𝐲Lxt𝐲itdy}\mathbf{Y}=\left\{\mathbf{y}_{1}^{t},\ldots,\mathbf{y}_{L_{x}}^{t}\mid\mathbf{y}_{i}^{t}\in\mathbb{R}^{d_{y}}\right\} and 𝐲it=[loni,lati]\mathbf{y}_{i}^{t}=[lon_{i},lat_{i}]. The historical trajectory point where the prediction begins is called the prediction point and is denoted as 𝐏=[lon,lat,sog,cog]\mathbf{P}=[lon,lat,sog,cog].

III-A2 ATM: Augmented Trajectory Matrix


The ATM is constructed from two adjacent trajectory points pnp_{n} and pn+1p_{n+1}. In Fig. 1, the position in the center of the matrix is taken as the current position [lonn,latn][lon_{n},lat_{n}], and the weight of the center position ATM[lc]ATM[l_{c}] is set to LL. According to cogncog_{n} and headingnheading_{n} obtain the representation location lcl_{c}, lhl_{h} in ATM by the direction ,and assigned as ATM[lh]=HATM[l_{h}]=H and ATM[ls]=SsognATM[l_{s}]=S\cdot sog_{n} in Algorithm 1, . The representation location lnl_{n} of pn+1p_{n+1} in ATM is then calculated based on the location relationship between pnp_{n} and pn+1p_{n+1}. And ATM[ln]ATM[l_{n}] is Gaussianised in the matrix, where the Gaussian kernel is GG. Among them, the constants LL, HH, SS, sogLinesogLine, and GridGrid are set to 0.8, 0.2, 0.02, 3, and 0.25 respectively. The GridGrid indicates that a grid represents 0.25 KM. The ATM proposed in this paper is deemed more appropriate for the analysis of trajectory time series data as it is capable of efficiently extracting spatial features and basic motion states.

Algorithm 1 Augmented Trajectory Matrix(ATM)
1:function getATM(pn,pn+1,L,H,S,G,sogLine,Gridp_{n},p_{n+1},L,H,S,G,sogLine,Grid)
2:     ATM[lc]LATM[l_{c}]\leftarrow L
3:     lh[cxcos(cogn),cy+sin(cogn)]l_{h}\leftarrow[c_{x}-cos(cog_{n}),c_{y}+sin(cog_{n})]
4:     ATM[lh]HATM[l_{h}]\leftarrow H
5:     for a=1a=1 to sogLinesogLine do
6:         ls[cx+acos(cogn),cyasin(cog)n]l_{s}\leftarrow[c_{x}+acos(cog_{n}),c_{y}-asin(cog)_{n}]
7:         ATM[ls]SsognATM[l_{s}]\leftarrow S\cdot sog_{n}
8:     end for
9:     disxHaversine(lonn,latn+1,lonn,latn)dis_{x}\leftarrow Haversine(lon_{n},lat_{n+1},lon_{n},lat_{n})
10:     disyHaversine(lonn+1,latn,lonn,latn)dis_{y}\leftarrow Haversine(lon_{n+1},lat_{n},lon_{n},lat_{n})
11:     Δlatlatn+1latn\Delta lat\leftarrow lat_{n+1}-lat_{n}
12:     Δlonlonn+1lonn\Delta lon\leftarrow lon_{n+1}-lon_{n}
13:     ln[signΔlatdisx/Grid,signΔlondisy/Grid]l_{n}\leftarrow[\operatorname{sign}\Delta lat\cdot dis_{x}/Grid,\operatorname{sign}\Delta lon\cdot dis_{y}/Grid]
14:     ATM[ln]GATM[l_{n}]\leftarrow G
15:     return ATMATM
16:end function

III-A3 Problem Formulation


The vessel trajectory prediction problem is a classic time-series prediction task that can be regarded as learning the function \mathcal{F} to estimate the most likely traffic characteristics of a future time period provided historical vessel trajectories. In this paper, we define the vessel trajectory information on the maritime as the attribute features of the TTS and ATM in the network. Thus, the prediction framework in this paper can be given by

𝐘^=(𝐗,𝐌,𝐏;),\hat{\mathbf{Y}}=\mathcal{F}\left(\mathbf{X},\mathbf{M},\mathbf{P};\mathbb{C}\right), (1)

where \mathcal{F} is the MSTFormer network learned from historical data. 𝐗\mathbf{X}, 𝐘^\hat{\mathbf{Y}} defines the historical trajectory and prediction trajectory, i.e.,𝐗={𝐱1t,,𝐱Lxt𝐱itdx}\mathbf{X}=\left\{\mathbf{x}_{1}^{t},\ldots,\mathbf{x}_{L_{x}}^{t}\mid\mathbf{x}_{i}^{t}\in\mathbb{R}^{d_{x}}\right\}, 𝐘^={𝐲^1t,,𝐲^Lyt𝐲^itdy}\hat{\mathbf{Y}}=\left\{\hat{\mathbf{y}}_{1}^{t},\ldots,\hat{\mathbf{y}}_{L_{y}}^{t}\mid\hat{\mathbf{y}}_{i}^{t}\in\mathbb{R}^{d_{y}}\right\}. 𝐌\mathbf{M} denotes the Augmented Trajectory Matrixes by Enhancing raw timing data. 𝐏\mathbf{P} is the prediction point. \mathbb{C} represents a group of network parameters in Eq. 1.

III-B MSTFormer structure with Dynamic-aware Attention

This subsection mainly focuses on how to build the structure of MSTFormer to incorporate the knowledge of dynamics to simulate longer spatial-temporal correlations and how to predict vessel trajectory based on MSTFormer. We first introduce the data embedding of MSTFormer, which include position embedding and time embedding. Then, we introduce the Multi-head Dynamic-aware Self-attention proposed to focus on long-term trajectory dependence and motion state change more effectively. Finally, The complete network structure of MSTFormer, including encoders with the distilling operation and generative style decoders is performed on TTS and ATM for feature learning.

III-B1 Data Embedding


Transformer has neither recurrence nor convolution, unlike LSTM or RNN. Instead, it models the sequence information using position embedding. For each position pospos in the vanilla Transformer, there is the fixed position embedding

PE(pos,n)={sin(pos/100002n/dmodel),n%2=1cos(pos/100002n/dmodel),n%2=0PE_{(pos,n)}=\begin{cases}\sin\left(pos/10000^{2n/d_{model}}\right),&n\%2=1\\ \cos\left(pos/10000^{2n/d_{model}}\right),&n\%2=0\end{cases} (2)

where nn is the dimension, dmodeld_{model} is the feature dimension after token embedding. This function is possible to encode both absolute and relative positions.

In order to capture long-range dependence and manipulate global context, the hierarchical time embedding[40, 36] is utilized. This data embedding method has become popular to lessen the impact of the encoder and decoder Query-Key mismatches. A trainable stamp embeddings TE(pos,i)TE_{(pos,i)} employs all global time stamp layers and follows the same embedding strategy as PE(pos,i)PE_{(pos,i)} in Eq. 2. Based on the temporal characteristics of the vessel trajectory data, this paper utilizes five layers (minute, hour, day, week, month) of time embedding in Fig. 2. This approach can better capture the periodicity of trajectories at different time lengths.

In Fig. 2, we first adjust the dimensionality using a 1-D convolutional filter (kernel width = 3, stride = 1) from the input data 𝐱it\mathbf{x}_{i}^{t} to dmodeld_{model}-dim vector 𝐮it\mathbf{u}_{i}^{t}. Thus, the feeding vector is provided by

𝐗feed[i]t=α𝐮it+PE(ind,)+l[TE(ind,)]l,\mathbf{X}_{\text{feed}[i]}^{t}=\alpha\mathbf{u}_{i}^{t}+\operatorname{PE}_{\left(ind,\right)}+\sum_{l}\left[\operatorname{TE}_{\left(ind,\right)}\right]_{l}, (3)

where ind=Lx×(t1)+iind=L_{x}\times(t-1)+i, ii is the index of input data and ll is the layer number of time embedding, i.e., i{1,,Lx}i\in\left\{1,\ldots,L_{x}\right\} and l{1,,Ll}l\in\left\{1,\ldots,L_{l}\right\}. α\alpha is a factor to balance the token projection and the embedding.

Refer to caption
Figure 2: data embedding of the MSTFormer.

III-B2 Multi-head Dynamic-aware Attention


Transformer adopts the formulation in the study [33], replacing the single attention function Eq. 4 with multi-head attention with HH discrete learned projections as Eq. 5

Attention(Q,K,V)=softmax(QKTdk)V.\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V. (4)
MultiHead(Q,K,V)=Concat(,headi,)WO,\displaystyle\operatorname{MultiHead}(Q,K,V)=\operatorname{Concat}\left(\ldots,\operatorname{head}_{i},\ldots\right)W^{O}, (5)
where headi=Attention(QWiQ,KWiK,VWiV).\displaystyle\text{ where }head_{i}=\operatorname{Attention}\left(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}\right).

The multiple independent attention allows for parallel computation, which improves computational efficiency. However, due to the quadratic complexity of sequence length, the long-term prediction problem still requires a large amount of computing. Our method chooses representative Query in Fig.3.a and Value in Fig.3.b to take part in the computation, ensuring accuracy and reducing computational complexity.

In Fig.3, the input 𝐗feedLseq×dmodel\mathbf{X}_{\text{feed}}\in\mathbb{R}^{L_{seq}\times d_{model}} of the attention block is first linearly projected using 𝒘Q,𝒘K,𝒘Vdmodel×dmodel\boldsymbol{w}_{Q},\boldsymbol{w}_{K},\boldsymbol{w}_{V}\in\mathbb{R}^{d_{model}\times d_{model}}, so Q=𝐗feed𝒘QQ=\mathbf{X}_{\text{feed}}\cdot\boldsymbol{w}_{Q} Similarly, We can obtain Key and Value. Then we split QQ into multiple heads and transpose it, make QNhead×Lseq×LhSeqQ\in\mathbb{R}^{N_{head}\times L_{seq}\times L_{hSeq}} and dmodel=Nhead×LhSeqd_{model}=N_{head}\times L_{hSeq}. In each head ii, we select MM representative 𝑸~hi\tilde{\boldsymbol{Q}}_{h}^{i} from the 𝑸hi\boldsymbol{Q}_{h}^{i} by

𝑸~hi=Select(𝑸hi)=Select(𝑸hi,Sort(hi(𝐗)),M),\tilde{\boldsymbol{Q}}_{h}^{i}=\operatorname{Select}(\boldsymbol{Q}_{h}^{i})=\operatorname{Select}(\boldsymbol{Q}_{h}^{i},\operatorname{Sort}(\mathcal{I}_{h}^{i}(\mathbf{X})),M), (6)

where𝑸={𝑸h1,,𝑸hNhead𝑸hiLseq×LhSeq}\boldsymbol{Q}=\{\boldsymbol{Q}_{h}^{1},\ldots,\boldsymbol{Q}_{h}^{N_{head}}\mid\boldsymbol{Q}_{h}^{i}\in\mathbb{R}^{L_{seq}\times L_{hSeq}}\}, ={h1,,hNheadhiLseq}\mathcal{I}=\{\mathcal{I}_{h}^{1},\ldots,\mathcal{I}_{h}^{N_{head}}\mid\mathcal{I}_{h}^{i}\in\mathbb{R}^{L_{seq}}\} and M=clnLQM=c\cdot\ln L_{Q}. hi\mathcal{I}_{h}^{i} represents the importance list of 𝑸hi\boldsymbol{Q}_{h}^{i} at different positions in the ii-th head. It can be obtained by

hi(𝐗)=pos(𝐗ind)+wsva(𝐗Δsog)+wcva(𝐗Δcog),\mathcal{I}_{h}^{i}(\mathbf{X})=\mathcal{I}_{pos}(\mathbf{X}_{ind})+w_{s}\mathcal{I}_{va}(\mathbf{X}_{\Delta sog})+w_{c}\mathcal{I}_{va}(\mathbf{X}_{\Delta cog}), (7)
pos(𝐗ind)=1ln(𝐗ind+D),\mathcal{I}_{pos}(\mathbf{X}_{ind})=\frac{1}{\ln(\mathbf{X}_{ind}+D)}, (8)
va(𝐗val)=ln(min(max(|𝐗val||𝐗val|𝐗len,0),1)),\mathcal{I}_{va}(\mathbf{X}_{val})=\ln(\operatorname{min}(\operatorname{max}(\lvert\mathbf{X}_{val}\rvert-\frac{\sum\lvert\mathbf{X}_{val}\rvert}{\mathbf{X}_{len}},0),1)), (9)

where 𝐗index\mathbf{X}_{index} is the index of the historical trajectory time series data, 𝐗Δsog\mathbf{X}_{\Delta sog} and 𝐗Δsog\mathbf{X}_{\Delta sog} is the speed difference and course difference between two trajectory points. wsw_{s} and wcw_{c} are used to regulate the proportion of the three influences in the different heads. DD is a constant to Guarantee Pos[0,1]\mathcal{I}_{Pos}\in[0,1]. We can obtain the most important QQ by sorting the (𝐗)\mathcal{I}(\mathbf{X}). Then the Q_KQ\_K is given by

Q~_K=softmax(Q~KTdk).\tilde{Q}\_K=\operatorname{softmax}\left(\frac{\tilde{Q}K^{T}}{\sqrt{d_{k}}}\right). (10)

Other than that, we fill the unselected positions with V¯\overline{V}, and the attention formulation is

Attention(Q,K,V)=Concat(Vnoselect¯,Q~_KV).\operatorname{Attention}(Q,K,V)=\operatorname{Concat}(\overline{V_{noselect}},\tilde{Q}\_K\cdot V). (11)
Refer to caption
Figure 3: Multi-head Dynamic-aware Attention.

When the data enters the second layer of the MSTFormer, it is impossible to determine its importance based on dynamic knowledge. In this paper, we use the ProbSparse Self-attention method in Informer [40] above the first layer of MSTFormer. First, randomly select K~\tilde{K} from KK, and the number of K~\tilde{K} is clnLKc\cdot\ln L_{K}. Then calculate the sampled Q_K~=QK~TQ\_\tilde{K}=Q\tilde{K}^{T} and compute the importance by

(𝐗)=max(Q_K~)mean(Q_K~).\displaystyle\mathcal{I}(\mathbf{X})=\operatorname{max}(Q\_\tilde{K})-\operatorname{mean}(Q\_\tilde{K}). (12)

III-B3 the Entire Structure of MSTFormer


The entire structure of the MSTFormer mainly consists of a CNN and a variant of the transformer in Fig. 1. The CNN focuses on extracting spatial features from trajectory data and understanding basic dynamics in ATMs. In contrast, the Transformer primarily captures the temporal features of the time series data, paying attention to changes in speed and course.

We evenly select a portion of data for the ATMs, after dividing the encoder data and decoder data. The same three-layer convolutional network is used for the ATMs of the encoder and decoder

Conv=MaxPool3d(ReLU(Conv3d(ATM))),Conv=\operatorname{MaxPool3d}(\operatorname{ReLU}(\operatorname{Conv3d}(ATM))), (13)
CNN=Linear(Conv1,Conv2,Conv3),\operatorname{CNN}=\operatorname{Linear}(Conv1,Conv2,Conv3), (14)

where kernelConv3d=(1,3,3)kernel_{Conv3d}=(1,3,3) ,kernelMaxPool3d=(1,2,2)kernel_{MaxPool3d}=(1,2,2) and the channels remains unchanged in Conv1Conv1. kernelConv3d=(3,3,3)kernel_{Conv3d}=(3,3,3) ,kernelMaxPool3d=(3,2,2)kernel_{MaxPool3d}=(3,2,2) and the channels changed from 11 to 1616 in Conv2Conv2. kernelConv3d=(3,3,3)kernel_{Conv3d}=(3,3,3) ,kernelMaxPool3d=(1,2,2)kernel_{MaxPool3d}=(1,2,2) and the channels changed from 1616 to the length of time series data in Conv3Conv3. In the first layer, the network mainly extracts the shape features in the ATM and learns features about the motion knowledge between vessel speed, course, and next-moment trajectory point position. Next, in the subsequent network, we apply deep convolution to extract the connections between the ship motion behaviors at different moments.

The MSTFormer mainly consists of encoders and decoders. The prediction results can be obtained by

𝒯^=Dec(Enc(𝐗enc,𝐌enc;),𝐗dec,𝐌dec;),\hat{\mathcal{T}}=\operatorname{Dec}(\operatorname{Enc}(\mathbf{X}_{enc},\mathbf{M}_{enc};\mathbb{C}),\mathbf{X}_{dec},\mathbf{M}_{dec};\mathbb{C}), (15)

where 𝐗enc\mathbf{X}_{enc}, 𝐗dec\mathbf{X}_{dec}, 𝐌enc\mathbf{M}_{enc} and 𝐌dec\mathbf{M}_{dec} are the encoded data and decoded data split from set 𝐗\mathbf{X} and set 𝐌\mathbf{M}. In the encoder, the input time series data is first processed by embedding and attention, and the resulting tensor is combined with the tensor obtained by CNN.

𝒜enc=Norm(Datten(Embedding(𝐗enc))+𝐗enc),\displaystyle\mathcal{A}_{enc}=\operatorname{Norm}(\operatorname{D-atten}(\operatorname{Embedding}(\mathbf{X}_{enc}))+\mathbf{X}_{enc}), (16)
𝒞enc=Norm(𝒜enc+CNN(𝐌enc)),\displaystyle\mathcal{C}_{enc}=\operatorname{Norm}(\mathcal{A}_{enc}+\operatorname{CNN}(\mathbf{M}_{enc})), (17)
Enc(𝐗enc,𝐌enc;)=conv1d(relu(conv1d(𝒞enc))).\displaystyle\operatorname{Enc}(\mathbf{X}_{enc},\mathbf{M}_{enc};\mathbb{C})=\operatorname{conv1d}(\operatorname{relu}(\operatorname{conv1d}(\mathcal{C}_{enc}))). (18)

In the first layer of the encoder, Datten\operatorname{D-atten} and Norm\operatorname{Norm} represent the Multi-head Dynamic-aware Self-attention and layer normalization. In the other layers, Datten\operatorname{D-atten} represent Multi-head ProbSparse self-attention, and 𝒞enc=Norm(𝒜enc)\mathcal{C}_{enc}=\operatorname{Norm}(\mathcal{A}_{enc}) in Eq. 18.

In MSTFormer, the decoder is implemented by improving the generative structure of Informer[40].

𝒜dec=Norm(Datten(Embedding(𝐗dec))+𝐗dec),\displaystyle\mathcal{A}_{dec}=\operatorname{Norm}(\operatorname{D-atten}(\operatorname{Embedding}(\mathbf{X}_{dec}))+\mathbf{X}_{dec}), (19)
𝒞dec=Norm(𝒜dec+CNN(𝐌dec)),\displaystyle\mathcal{C}_{dec}=\operatorname{Norm}(\mathcal{A}_{dec}+\operatorname{CNN}(\mathbf{M}_{dec})), (20)
𝒟dec=Norm(atten(𝒞dec,𝐗enc,𝐗enc)+𝐂dec),\displaystyle\mathcal{D}_{dec}=\operatorname{Norm}(\operatorname{atten}(\mathcal{C}_{dec},\mathbf{X}_{enc},\mathbf{X}_{enc})+\mathbf{C}_{dec}), (21)
Dec(𝐗dec,𝐌dec;)=conv1d(relu(conv1d(𝒟dec))),\displaystyle\operatorname{Dec}(\mathbf{X}_{dec},\mathbf{M}_{dec};\mathbb{C})=\operatorname{conv1d}(\operatorname{relu}(\operatorname{conv1d}(\mathcal{D}_{dec}))), (22)

where atten\operatorname{atten} is the Vanilla attention mechanism. This generative style decoder of the MSTFormer has only one layer.

III-C Knowledge Inspired Loss Function

There are methods that use neural networks to predict the sog and cog of a vessel, and then obtain its latitude and longitude by building a motion physics model[19, 12]. The MSTFomer uses Δsog,Δcog\Delta sog,\Delta cog for prediction, and the data series after performing differencing is more stationary, which is beneficial to improve the prediction accuracy. Additionally, by directly adding the motion model into the loss function to determine the vessel’s position, the model can better understand the assessment mechanism.

First, the knowledge inspired loss function performs data recovery on the predicted difference results of the MSTFormer model by

(sogi^,cog~i)={(sog,cog)+𝒯^i,i=1(sog^i1,cog^i1)+𝒯^i,i>1\displaystyle(\hat{sog_{i}},\tilde{cog}_{i})={\begin{cases}(sog,cog)+\hat{\mathcal{T}}_{i},&i=1\\ (\hat{sog}_{i-1},\hat{cog}_{i-1})+\hat{\mathcal{T}}_{i},&i>1\end{cases}} (23)
cog^={cog~+360,cog~<0cog~360,cog~>360\displaystyle\hat{cog}={\begin{cases}\tilde{cog}+360,&\tilde{cog}<0\\ \tilde{cog}-360,&\tilde{cog}>360\end{cases}} (24)

where (lon,lat,sog,cog)(lon,lat,sog,cog) is the prediction point and the 𝒯^i\hat{\mathcal{T}}_{i} is network output sequence.

Here, we can use the prediction points (sog,cog)(sog,cog) directly to infer the position of the first predicted result, but since the velocity and direction are instantaneous values, there is a bias in the direct calculation. Therefore, in this paper, the output of the network is considered as a correction of the instantaneous values, and the average speed and average steering between two points are calculated using Eq. 23. Meanwhile, for each recovered data cog~i\tilde{cog}_{i}, it was corrected using Eq. 24. Then, we use the sog^\hat{sog} and cog^\hat{cog} sequences corrected above and the latlat and lonlon of the prediction point to calculate the predicted location

(lon^i,lat^i)={(lon,lat,sog^i,cog^i),i=1(lon^i1,lat^i1,sog^i,cog^i),i>1\displaystyle(\hat{lon}_{i},\hat{lat}_{i})={\begin{cases}\mathcal{M}(lon,lat,\hat{sog}_{i},\hat{cog}_{i}),&i=1\\ \mathcal{M}(\hat{lon}_{i-1},\hat{lat}_{i-1},\hat{sog}_{i},\hat{cog}_{i}),&i>1\end{cases}} (25)

where \mathcal{M} is the motion model to obtain the predicted lon^\hat{lon} and lat^\hat{lat}. \mathcal{M} is calculated as follows

distance=sog^×Tinter×N,\displaystyle distance=\hat{sog}\times T_{inter}\times N,
δ=distance/R(lat),\displaystyle\delta=distance/R(lat),
lon^=lon+arctan(sin(cog^)sin(δ)cos(lat)cos(δ)sin(lat)sin(lat^)),\displaystyle\hat{lon}=lon+\arctan(\frac{\sin(\hat{cog})\sin(\delta)\cos(lat)}{\cos(\delta)-\sin(lat)\sin(\hat{lat})}), (26)
lat^=arcsin(sin(lat)cos(δ)+cos(lat)sin(δ)cos(cog^)),\displaystyle\hat{lat}=\arcsin(\sin(lat)\cos(\delta)+\cos(lat)\sin(\delta)\cos(\hat{cog})),

where TinterT_{inter} is time interval between trajectory points. Since the Earth is an irregular sphere, use R(lat)R(lat) to calculate the radius of the Earth at different locations in Eq. 27.

R(lat)=(A2coslat)2+(B2sinlat)2(Acoslat)2+(Bsinlat)2,\displaystyle R(lat)=\sqrt{\frac{\left(A^{2}\cos lat\right)^{2}+\left(B^{2}\sin lat\right)^{2}}{(A\cos lat)^{2}+(B\sin lat)^{2}}}, (27)

where AA is the Equatorial radius and BB is the Polar radius in WGS-84. And A=6378137.0A=6378137.0, B=6356752.3142B=6356752.3142.

Finally, we can use the ‘haversine’ formula to calculate the great-circle distance between the true trajectory and predicted points.

Δlat=\displaystyle\Delta lat= latlat^,\displaystyle lat-\hat{lat}, (28)
Δlon=\displaystyle\Delta lon= lonlon^,\displaystyle lon-\hat{lon},
a=sin2(Δlat/2)+cos(\displaystyle a=\sin^{2}(\Delta lat/2)+\cos( lat^)cos(lat)sin2(Δlon/2),\displaystyle\hat{lat})\cos(lat)\sin^{2}(\Delta lon/2),
dis=2R(\displaystyle dis=2R( lat^)arcsin(a),\displaystyle\hat{lat})\arcsin(\sqrt{a}),

where (lon,lat)(lon,lat) is the true trajectory point, (lon^,lat^)(\hat{lon},\hat{lat}) is the predicted point. The loss can obtain by

loss=n=1NdisN.loss=\frac{\sum\nolimits_{n=1}^{N}dis}{N}. (29)

IV Experimental Results and Analysis

IV-A Data Processing

We downloaded the AIS data from the NOAA (National Oceanic and Atmospheric Administration) 222https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2021/. The NOAA AIS data consists of 82 different types of vessels, including 11,401 vessels from January 1 to December 31, 2021. We used data in the longitude range of -98.5 to -80 and the latitude range of 17 to 31, which is the area of the Gulf of Mexico. The raw data have various problems, including non-uniformity of the measurement scale, missing data, and imbalanced classes. Therefore, we performed necessary data preprocessing procedures, including cleaning abnormal data, data interpolation, data sampling, and data standardization.

We first purged the abnormal, such as abnormal speed and heading lost data, and data that haven’t left the port for a year. For data with time intervals of more than one hour between consecutive track points, we performed segmentation. Then, we adjusted the time interval of the trajectory sequence to one minute by independently interpolating the longitude and latitude using cubic spline interpolation. Next, we used the interpolated latitude and longitude to calculate the sog and cog of various trajectory points and filled the vessel’s heading with the first value. After the preprocessing, there remain 2378 vessels with 35 types. We randomly selected 68 vessels from each type, or all of them if there are not enough. Finally, we computed the z-score of the Δsog\Delta sog and Δcog\Delta cog.

Four datasets with different prediction durations were generated in Table I. Each dataset consists of three parts of data, including encoded data, decoded data, and prediction data, denoted by 𝐗enc\mathbf{X}_{enc}, 𝐗dec\mathbf{X}_{dec}, 𝐘\mathbf{Y} in Fig. 1. In addition, We subsetted dataset2 by selecting data points with above-average rates of change of speed and course to create cornerset. This dataset was specifically used to evaluate the predictive ability of different models for cornering trajectory. We split each dataset, using 70% of the data for training, 10% for validation, and 20% for testing.

TABLE I: Datasets
Dataset
Encoded
Length
Decoded
Length
Prediction
Horizon
Time
Interval
Size
1 72(3.6h) 48 24(1.2h) 3 minutes 43000
2 72(1.2h) 48 24(0.4h) 1 minute 172100
3 48(2.4h) 24 48(2.4h) 3 minutes 43000
4 48(0.8h) 24 48(0.8h) 1 minute 172100
cornerset 72(1.2h) 48 24(0.4h) 1 minute 32600

IV-B Experimental Settings

Refer to caption
Figure 4: Performance of MSTFormer under the different proportions of ATMs
Refer to caption
Figure 5: Performance of MSTFormer under different number of selected Query

Our method aims to improve the long-term vessel trajectory prediction performance of backbone models utilizing motion-inspired model structure and loss function. So, we compared the proposed MSTFormer with some traditional methods and several state-of-the-art machine learning approaches to illustrate the model structure’s superior prediction ability. Including SVR[44], RFR[45], LSTM, Seq2Seq-LSTM[4], Seq2Seq-LSTM with attention[26] and Transformer[33].

To ensure the reliability of the results, we used the same training parameters for all deep-learning models. We set the training epoch to be the ratio of the training dataset length to batch size, meanwhile, the batch size and dropout were set to 50 and 0.05, respectively. Also, to improve convergence efficiency, the Adam optimizer with 5e-6 learning rate was utilized, which was updated by multiplying 0.5epoch10.5^{epoch-1}. In addition, we set up a supervisory mechanism where the model stopped training when the error was larger than the optimal error plus α\alpha on the evaluation set, and in this paper α\alpha equals 0.5. All models have three layers, two encoder layers, and one decoder layer.

Two crucial hyperparameters for our model are the quantity of Augmented Trajectory Matrix and the number of selected Query in the dynamic-aware attention since they have the most effects on the prediction results. In order to select the proper values, we performed experiments: (1)Proportion of Augmented Trajectory Matrix: We chose the proportion of ATMs in the series data from [1,1/2,1/3,1/4,1/5][1,1/2,1/3,1/4,1/5] to assess the model’s performance. Fig.4 shows the DIS, MSE, MAE, and RMSE of the MSTFormer under the different settings. We can observe that when more data are augmented, it is easy to overfit, leading to worse results. And when the proportion of ATMs is set to 1/41/4, MSTFormer performs at its best. (2)Number of Selected Query:We fixed the proportion of ATMs to 1/4 and selected the hyperparameter cc of selected Query from [1,2,3,5,8,11,15][1,2,3,5,8,11,15], where cc was used to control the number of queries that are selected to participate in the computation in Eq. 6. When the hyperparameter is set to 1515, all 72 variables are involved in the attention calculation. Fig.5 shows the DIS, MSE, MAE, and RMSE of the MSTFormer under the different number of selected Query settings. Our model has the best performance when the hyperparameters are set to 3.

IV-C Results and Analysis

IV-C1 Prediction Accuracy

Refer to caption
Figure 6: Performance of different model
TABLE II: experimental comparison with baselines and backbones
Evaluation matrix SVR RFR LSTM LSTMED LSTMED-attention Transformer MSTFormer
MSE(10310^{-3}) 1.8779 1.6682 1.9280 1.5328 1.2401 1.4138 0.8946
MAE(10210^{-2}) 1.9213 1.9034 2.3574 2.1339 1.6878 1.7374 1.3384
DIS(KM) 3.1642 3.1323 3.8756 3.4873 2.7721 2.8634 2.2057
RMSE(10210^{-2}) 4.3334 4.0843 4.3909 3.9151 3.5216 3.7600 2.9911
MAPE(10410^{-4}) 4.2332 4.2131 5.3702 4.6916 3.7678 3.8813 2.9342
MSPE(10610^{-6}) 1.0996 0.9846 1.2079 0.8296 0.7385 0.8489 0.4793

We compare the performance of the MSTFormer to that of the backbones and baseline approaches using dataset 1. Table II shows that the MSTFormer performs better than the traditional methods like SVR and RFR and deep learning methods like LSTM and LSTMED, the former only considers temporal dependencies alone, while the latter is ineffective for long-term cases. Meanwhile, experiments show that the addition of the self-attention mechanism significantly increases the network’s capacity to predict long-term trajectory sequences. The MSTFormer outperforms its backbones, reducing the DIS by 22.9% over the Transformer in Table II, verifying that dynamic knowledge boosts the prediction.

The visual comparisons of various models are shown in Fig.6 in order to further assess the prediction capability. Fig.6(a) demonstrate that the prediction accuracy of different models decreases as the curvature of the historical trajectory increases. However, the MSTFormer has less influence and performs better. Meanwhile, MSTFormer has better identification and prediction of vessel’s deceleration and stay compared to other models in Fig.6(b). Fig.6(c) indicate the correction capability of MSTFormer. It is found that it may be far from the true trajectory in the middle of the prediction horizon but gradually approaches the true value over time. In addition, MSTFormer also achieves excellent performance on mid-corner and 90-degree turn prediction trajectories In Fig.6(d). In the last image, although the model accurately predicts the turnaround of the vessel, the subsequent turn prediction is biased. However, the final position is very close to the true position, which demonstrates the correction ability of the model. In summary, when the predicted horizon is more than one hour, the vessels have more freedom and are more difficult to predict. And only accurate prediction of speed and course can achieve better performance.

IV-C2 long-term Prediction

TABLE III: experimental comparison with different prediction horizons
Dataset history(h) Horizon(h) Methods MSE(10410^{-4}) MAE(10210^{-2}) DIS(KM) RMSE(10210^{-2}) MAPE(10410^{-4}) MSPE(10710^{-7})
1 3.6 1.2 LSTMED-attention 12.4018 1.6878 2.7721 3.5216 3.7678 7.3857
Transformer 14.1382 1.7374 2.8634 3.7600 3.8813 8.4892
MSTFormer 8.9467 1.3384 2.2057 2.9911 2.9342 4.7934
2 1.2 0.4 LSTMED-attention 1.3745 0.6382 1.0787 1.1724 1.3983 0.7602
Transformer 0.9741 0.3609 0.6453 0.9869 0.7959 0.5356
MSTFormer 0.7411 0.3212 0.5778 0.8608 0.6956 0.3764
3 2.4 2.4 LSTMED-attention 115.9209 6.0370 9.9655 10.7666 13.7565 74.9649
Transformer 247.3624 9.4731 15.5236 15.7277 21.9762 168.4004
MSTFormer 43.4369 3.2935 5.3971 6.5906 7.2012 24.380
4 0.8 0.8 LSTMED-attention 5.5873 1.0424 1.7310 2.3637 2.3175 3.2291
Transformer 5.8840 1.1356 1.8817 2.4257 2.5235 3.1283
MSTFormer 4.1699 0.8508 1.4189 2.0420 1.8413 2.1340

We furthermore compare MSTFormer’s performance across different prediction horizons; the results are summarized in Table III. The findings indicate that as the prediction horizon lengthens, the predictive accuracy of MSTFormer declines. It is worth noting that the accuracy of the model improves with the increasing number of observed trajectories. In a related study by Xiao et al.[11]that utilized mined waterway patterns to assist network prediction, the prediction performance was reported to be excellent. Specifically, the probability errors for over 50% of vessels in the 30-minute and 60-minute traffic predictions ranged from 0.6-2 km and 2-4.5 km, respectively. However, our study’s average error rates for 24-minute and 72-minute traffic predictions were 0.778 km and 2.2057 km for various vessels, respectively.

To explain the results in detail, We have chosen some representative predictions and visualized them in Fig.7. Our methods demonstrate a high degree of overlap between the true and predicted trajectories when the horizon is 0.4 hours. However, as the number of historical trajectories decreases and the prediction horizon increases, the accuracy of our predictions decreases accordingly. Despite this, our method still accurately recognizes the vessel’s Cornering characteristics in Fig.7. As the predicted time continues to increase, the trajectories exhibit increasingly complex navigational behavior, and prediction uncertainty increase. Nevertheless, MSTFormer is still able to achieve excellent results in the sixth figure. Notably, our model is able to accurately predict the vessels’ turns with a final location difference of less than 0.1 KM, as demonstrated in the seventh figure. As a result, we conclude that MSTFormer outperforms other models in short and long-term trajectory prediction but struggles in the ultra-long-term trajectory prediction domain (horizons exceeding 1.5h).

Refer to caption
Figure 7: Result for different prediction horizon

IV-C3 Cornering Trajectory Prediction


To test the superiority of the model in predicting the cornering data, we generated cornerset. First, we calculated the average speed rate of change and the average course rate of change for dataset 2 as 0.1912 and 4.2746, respectively. Then we selected the data in dataset 2 that were larger than the average. The final cornerset’s average speed change rate and average angle change rate are 0.5539 and 10.9726.

The results show the predicted results of MSTFormer in Table IV where DIS is increased from 0.5778 to 1.3781 and MSE is even increased from 0.7411 to 2.5058 compared to Table III. This indicates that compared to the original dataset2, the data in cornerset contains more information and is harder to predict, which is a great challenge for all models. To observe the predictive ability of other models for cornerset, we selected the backbone for comparison, which performed better in the prediction accuracy experiments in Table II, namely LSTMED-attention and Transformer. Surprisingly, the DIS increased from 1.5829 of MSTFormer to 4.9627, a full 213.5% increase, according to the LSTMED-attention prediction. Even Transformer increased by 261.5% compared to MSTFormer.

TABLE IV: Experimental comparison with different models on cornerset
Metric LSTMED-attention Transformer MSTFormer
MSE(10410^{-4}) 24.6290 32.7519 2.5058
MAE(10210^{-2}) 2.8034 3.2227 0.8274
DIS(KM) 4.5522 5.2144 1.3781
RMSE(10210^{-2}) 4.9627 5.7229 1.5829
MAPE(10410^{-4}) 5.9454 6.8489 1.7493
MSPE(10710^{-7}) 12.7457 17.2154 1.2272

IV-C4 Ablation Study


In this subsection, we attempt to evaluate the effectiveness of the individual structures we designed in MSTFormer on the trajectory prediction task. The prediction results of adding different motion inspired modules we designed to the backbone Transformer are shown in Table V. When we added the information extracted from Augmented Trajectory Matrix using CNN to the network, DIS was reduced from 2.8634 to 2.6331, a reduction of 8.04%, and MSE was reduced by 12.7%. When we change the original attention mechanism to Multi-head Dynamic-aware Self-attention, the prediction error does not decrease but increases a lot. However, adding the knowledge inspired loss function to the above structure, the prediction error drops sharply, and DIS decreased by 16.2% from 2.6331 to 2.2057. To demonstrate the effectiveness of the dynamic-aware attention mechanism, we tried to remove the CNN, and the model still achieved good results. This shows that the latter two structures need to be used together to obtain superior performance. In addition, the results of using the non-corrective loss function improve DIS from 2.2057 to 2.2650. Overall, the model achieves better prediction accuracy when all structures are used together.

TABLE V: ablation study of MSTFormer
Evaluation matrix Transformer +CNN
+CNN
+Dynamic-aware Atten
MSTFormer
+Dynamic-aware Atten
+Knowledge Loss
+CNN
+Dynamic-aware Atten
+Knowledge Inspird Loss
(No correction)
MSE(10(3)10^{(}-3)) 1.4138 1.2336 2.4223 0.8946 0.9287 0.9348
MAE(10(2)10^{(}-2)) 1.7374 1.6015 2.6690 1.3384 1.3839 1.3765
DIS(KM) 2.8634 2.6331 4.3762 2.2057 2.2741 2.2650
RMSE(10(2)10^{(}-2)) 3.7600 3.5122 4.9217 2.9911 3.0475 3.0575
MAPE(10(4)10^{(}-4)) 3.8813 3.5384 5.9532 2.9342 3.0212 3.0061
MSPE(10(6)10^{(}-6)) 0.8489 0.72032 1.3812 0.4793 0.4918 0.5037

V Conclusion

In this paper, we propose MSTFormer, the motion inspired transformer, which is incorporated with vessel dynamic knowledge. MSTFormer first uses Augmented Trajectory Matrix (ATM) to express the vessel motion state and spatial features. Then dynamic-aware self-attention is proposed to enable the backbone Transformer to sense motion changes. Finally, knowledge inspired loss function based on prediction correction is used to train the network to learn Geodesy while understanding the dynamics of the previous construction. The experimental outcomes prove the superiority of MSTFormer compared with competing methods on normal and corner datasets. In addition, the ablation analysis validates the effectiveness of each design module.

There are several aspects that are worthy of investigation in future work. First, ATMs only fuse the motion of vessels; how to incorporate external information, such as coastlines and currents, to achieve better prediction performance is a promising direction. Second, the ablation analysis shows that only our proposed attention mechanism and the loss function work together to improve the performance of the model, and the reason behind this remains unclear. Finally, although the proposed model clearly outperforms competing methods on corner data, it remains a challenge to accurately predict trajectories on such data.

We also anticipate that the idea of incorporating dynamic knowledge with modern neural network architectures sheds light on future research on accurate vessel trajectory prediction.

References

  • [1] I. Ashraf, Y. Park, S. Hur, S. W. Kim, R. Alroobaea, Y. B. Zikria, and S. Nosheen, “A survey on cyber security threats in iot-enabled maritime industry,” IEEE Transactions on Intelligent Transportation Systems, 2022.
  • [2] B. Murray and L. P. Perera, “An ais-based multiple trajectory prediction approach for collision avoidance in future vessels,” in International Conference on Offshore Mechanics and Arctic Engineering, vol. 58851, p. V07BT06A031, American Society of Mechanical Engineers, 2019.
  • [3] R. W. Liu, M. Liang, J. Nie, Y. Yuan, Z. Xiong, H. Yu, and N. Guizani, “Stmgcn: Mobile edge computing-empowered vessel trajectory prediction using spatio-temporal multi-graph convolutional network,” IEEE Transactions on Industrial Informatics, 2022.
  • [4] R. W. Liu, M. Liang, J. Nie, W. Y. B. Lim, Y. Zhang, and M. Guizani, “Deep learning-powered vessel trajectory prediction for improving smart traffic services in maritime internet of things,” IEEE Transactions on Network Science and Engineering, vol. 9, no. 5, pp. 3080–3094, 2022.
  • [5] M. Liang, Y. Zhan, and R. W. Liu, “Mvffnet: Multi-view feature fusion network for imbalanced ship classification,” Pattern Recognition Letters, vol. 151, pp. 26–32, 2021.
  • [6] Y. Su, J. Du, Y. Li, X. Li, R. Liang, Z. Hua, and J. Zhou, “Trajectory forecasting based on prior-aware directed graph convolutional neural network,” IEEE Transactions on Intelligent Transportation Systems, 2022.
  • [7] S. Brusch, S. Lehner, T. Fritz, M. Soccorsi, A. Soloviev, and B. van Schie, “Ship surveillance with terrasar-x,” IEEE transactions on geoscience and remote sensing, vol. 49, no. 3, pp. 1092–1103, 2010.
  • [8] S. Guo, C. Liu, Z. Guo, Y. Feng, F. Hong, and H. Huang, “Trajectory prediction for ocean vessels base on k-order multivariate markov chain,” in International Conference on Wireless Algorithms, Systems, and Applications, pp. 140–150, Springer, 2018.
  • [9] A. Alessandrini, F. Mazzarella, and M. Vespe, “Estimated time of arrival using historical vessel tracking data,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 1, pp. 7–15, 2018.
  • [10] X. Zhang, X. Fu, Z. Xiao, H. Xu, and Z. Qin, “Vessel trajectory prediction in maritime transportation: Current approaches and beyond,” IEEE Transactions on Intelligent Transportation Systems, 2022.
  • [11] Z. Xiao, L. Ponnambalam, X. Fu, and W. Zhang, “Maritime traffic probabilistic forecasting based on vessels’ waterway patterns and motion behaviors,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 11, pp. 3122–3134, 2017.
  • [12] Z. Xiao, X. Fu, L. Zhang, W. Zhang, R. W. Liu, Z. Liu, and R. S. M. Goh, “Big data driven vessel trajectory and navigating state prediction with adaptive learning, motion modeling and particle filtering techniques,” IEEE Transactions on Intelligent Transportation Systems, 2020.
  • [13] P. Last, C. Bahlke, M. Hering-Bertram, and L. Linsen, “Comprehensive analysis of automatic identification system (ais) data in regard to vessel movement prediction,” The Journal of Navigation, vol. 67, no. 5, pp. 791–809, 2014.
  • [14] S. Hexeberg, A. L. Flåten, E. F. Brekke, et al., “Ais-based vessel trajectory prediction,” in 2017 20th International Conference on Information Fusion (Fusion), pp. 1–8, IEEE, 2017.
  • [15] B. R. Dalsnes, S. Hexeberg, A. L. Flåten, B.-O. H. Eriksen, and E. F. Brekke, “The neighbor course distribution method with gaussian mixture models for ais-based vessel trajectory prediction,” in 2018 21st International Conference on Information Fusion (FUSION), pp. 580–587, IEEE, 2018.
  • [16] Y. Lian, L. Yang, L. Lu, J. Sun, and Y. Lu, “Research on ship ais trajectory estimation based on particle filter algorithm,” in 2019 11th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC), vol. 1, pp. 305–308, IEEE, 2019.
  • [17] D.-w. Gao, Y.-s. Zhu, J.-f. Zhang, Y.-k. He, K. Yan, and B.-r. Yan, “A novel mp-lstm method for ship trajectory prediction based on ais data,” Ocean Engineering, vol. 228, p. 108956, 2021.
  • [18] B. Murray and L. P. Perera, “A dual linear autoencoder approach for vessel trajectory prediction using historical ais data,” Ocean Engineering, vol. 209, p. 107478, 2020.
  • [19] L.-z. Sang, X.-p. Yan, A. Wall, J. Wang, and Z. Mao, “Cpa calculation method based on ais position prediction,” The journal of navigation, vol. 69, no. 6, pp. 1409–1426, 2016.
  • [20] P. Last, M. Hering-Bertram, and L. Linsen, “Interactive history-based vessel movement prediction,” IEEE Intelligent Systems, vol. 34, no. 6, pp. 3–13, 2019.
  • [21] S. Capobianco, L. M. Millefiori, N. Forti, P. Braca, and P. Willett, “Deep learning methods for vessel trajectory prediction based on recurrent neural networks,” IEEE Transactions on Aerospace and Electronic Systems, vol. 57, no. 6, pp. 4329–4346, 2021.
  • [22] K. Bao, J. Bi, M. Gao, Y. Sun, X. Zhang, and W. Zhang, “An improved ship trajectory prediction based on ais data using mha-bigru,” Journal of Marine Science and Engineering, vol. 10, no. 6, p. 804, 2022.
  • [23] D. Nguyen and R. Fablet, “Traisformer-a generative transformer for ais trajectory prediction,” arXiv preprint arXiv:2109.0395, 2021.
  • [24] M. Gao, G. Shi, and S. Li, “Online prediction of ship behavior with automatic identification system sensor data using bidirectional long short-term memory recurrent neural network,” Sensors, vol. 18, no. 12, p. 4211, 2018.
  • [25] C. Wang and Y. Fu, “Ship trajectory prediction based on attention in bidirectional recurrent neural networks,” in 2020 5th International Conference on Information Science, Computer Technology and Transportation (ISCTT), pp. 529–533, IEEE, 2020.
  • [26] Z. Wang, X. Su, and Z. Ding, “Long-term traffic prediction based on lstm encoder-decoder architecture,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 10, pp. 6561–6571, 2020.
  • [27] S. Zhang, L. Wang, M. Zhu, S. Chen, H. Zhang, and Z. Zeng, “A bi-directional lstm ship trajectory prediction method based on attention mechanism,” in 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), vol. 5, pp. 1987–1993, IEEE, 2021.
  • [28] J. Sekhon and C. Fleming, “A spatially and temporally attentive joint trajectory prediction framework for modeling vessel intent,” in Learning for Dynamics and Control, pp. 318–327, PMLR, 2020.
  • [29] W. Li, C. Zhang, J. Ma, and C. Jia, “Long-term vessel motion predication by modeling trajectory patterns with ais data,” in 2019 5th International Conference on Transportation Information and Safety (ICTIS), pp. 1389–1394, IEEE, 2019.
  • [30] Y. Suo, W. Chen, C. Claramunt, and S. Yang, “A ship trajectory prediction framework based on a recurrent neural network,” Sensors, vol. 20, no. 18, p. 5133, 2020.
  • [31] B. Murray and L. P. Perera, “An ais-based deep learning framework for regional ship behavior prediction,” Reliability Engineering & System Safety, vol. 215, p. 107819, 2021.
  • [32] D. Alizadeh, A. A. Alesheikh, and M. Sharif, “Vessel trajectory prediction using historical automatic identification system data,” The Journal of Navigation, vol. 74, no. 1, pp. 156–174, 2021.
  • [33] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [34] Q. Wen, T. Zhou, C. Zhang, W. Chen, Z. Ma, J. Yan, and L. Sun, “Transformers in time series: A survey,” arXiv preprint arXiv:2202.07125, 2022.
  • [35] B. Lim, S. Ö. Arık, N. Loeff, and T. Pfister, “Temporal fusion transformers for interpretable multi-horizon time series forecasting,” International Journal of Forecasting, vol. 37, no. 4, pp. 1748–1764, 2021.
  • [36] T. Zhou, Z. Ma, Q. Wen, X. Wang, L. Sun, and R. Jin, “Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting,” arXiv preprint arXiv:2201.12740, 2022.
  • [37] S. Wu, X. Xiao, Q. Ding, P. Zhao, Y. Wei, and J. Huang, “Adversarial sparse transformer for time series forecasting,” Advances in neural information processing systems, vol. 33, pp. 17105–17115, 2020.
  • [38] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, and S. Dustdar, “Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting,” in International Conference on Learning Representations, 2021.
  • [39] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan, “Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting,” Advances in neural information processing systems, vol. 32, 2019.
  • [40] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, and W. Zhang, “Informer: Beyond efficient transformer for long sequence time-series forecasting,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 11106–11115, 2021.
  • [41] L. Cai, K. Janowicz, G. Mai, B. Yan, and R. Zhu, “Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting,” Transactions in GIS, vol. 24, no. 3, pp. 736–755, 2020.
  • [42] M. Xu, W. Dai, C. Liu, X. Gao, W. Lin, G.-J. Qi, and H. Xiong, “Spatial-temporal transformer networks for traffic flow forecasting,” arXiv preprint arXiv:2001.02908, 2020.
  • [43] C. Yu, X. Ma, J. Ren, H. Zhao, and S. Yi, “Spatio-temporal graph transformer networks for pedestrian trajectory prediction,” in European Conference on Computer Vision, pp. 507–523, Springer, 2020.
  • [44] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, and V. Vapnik, “Support vector regression machines,” Advances in neural information processing systems, vol. 9, 1996.
  • [45] L. Breiman, “Random forests,” Machine learning, vol. 45, pp. 5–32, 2001.