Spatio-Temporal Graph Dual-Attention Network for Multi-Agent Prediction and Tracking

Jiachen Li, Hengbo Ma, Zhihao Zhang, Jinning Li, and Masayoshi Tomizuka, J. Li, H. Ma, Z. Zhang, J. Li and M. Tomizuka are with the Department of Mechanical Engineering, University of California, Berkeley, CA 94720, USA (Email: {jiachen_li, hengbo_ma, zhihaozhang, jinning_li, tomizuka}@berkeley.edu)

Abstract

An effective understanding of the environment and accurate trajectory prediction of surrounding dynamic obstacles are indispensable for intelligent mobile systems (e.g. autonomous vehicles and social robots) to achieve safe and high-quality planning when they navigate in highly interactive and crowded scenarios. Due to the existence of frequent interactions and uncertainty in the scene evolution, it is desired for the prediction system to enable relational reasoning on different entities and provide a distribution of future trajectories for each agent. In this paper, we propose a generic generative neural system (called STG-DAT) for multi-agent trajectory prediction involving heterogeneous agents. The system takes a step forward to explicit interaction modeling by incorporating relational inductive biases with a dynamic graph representation and leverages both trajectory and scene context information. We also employ an efficient kinematic constraint layer applied to vehicle trajectory prediction. The constraint not only ensures physical feasibility but also enhances model performance. Moreover, the proposed prediction model can be easily adopted by multi-target tracking frameworks. The tracking accuracy proves to be improved by empirical results. The proposed system is evaluated on three public benchmark datasets for trajectory prediction, where the agents cover pedestrians, cyclists and on-road vehicles. The experimental results demonstrate that our model achieves better performance than various baseline approaches in terms of prediction and tracking accuracy.

Index Terms:

Trajectory prediction, multi-target tracking, spatio-temporal graph, graph neural network, interaction modeling, relational reasoning

I Introduction

In order to navigate safely in dense traffic scenarios or crowded areas full of vehicles and pedestrians, it is crucial for autonomous vehicles or mobile robots to forecast and track future behaviors of surrounding interactive agents accurately and efficiently [1]. For short-term prediction, it may be acceptable to use pure physics based methods. However, due to the uncertain nature of future situations, the system for long-term prediction is desired to not only allow for interaction modeling between different agents, but also to figure out traversable regions delimited by road layouts as well as right of way compliant to traffic rules. Fig. 1 illustrates several traffic scenarios where interaction happens frequently and the drivable areas are heavily defined by road geometries. For instance, at the entrance of roundabouts or unsignalized intersections, the future behavior of an entering vehicle highly depends on whether the conflicting vehicles would yield and leave enough space for it to merge. In addition, for vehicle trajectory prediction, kinematic constraints should be satisfied to make the trajectories feasible and smooth.

Refer to caption — Figure 1: Typical traffic scenarios with large uncertainty and interactions among multiple entities. The left column is adopted from [2]. The upper figure in the first column was captured in a highway ramp merging scenario, where lane change behavior with negotiation happens frequently. The lower figure was captured in a roundabout and an unsignalized intersection scenario, where yielding and stopping behaviors happen frequently. The other two columns shows the occupancy density maps and the velocity fields of the scenarios, which are generated based on the training data to provide statistical context information.

There have been extensive studies on the prediction of a single target entity, which consider the influences of its surrounding entities [3, 4, 5, 6]. However, such approaches only care about one-way interactions, ignoring the potential interactions in the opposite way. Recent works have tried to address this issue by simultaneously forecasting for multiple agents [7, 8, 9]. However, most of these methods employ concatenation or social pooling operations to blend the features of different agents without explicit relational reasoning. Moreover, they are not able to model higher-order interactions (indirect influences) beyond adjacent entities. In this work, we take a step forward to model the interactions explicitly with a spatio-temporal graph representation and attention-based message passing rules. Our model enables permutation invariance and mutual effects between pairs of entities.

In the literature of deep learning methods, the soft attention mechanism is widely adopted in modeling spatial relations due to its flexibility and interpretability. The temporal relations are usually modeled by recurrent neural networks, which naturally attenuate the effects of distant history information. One of the most related work is Social Attention [10]. However, in some cases the earlier information may be also important. Therefore, in order to figure out which of the other agents have the most significant influence on a certain agent, as well as the relative importance of different time steps, we propose a spatio-temporal graph attention mechanism which is applied to both topological and temporal features.

Besides the trajectory prediction task, the proposed approach can also be incorporated into multi-target tracking frameworks, due to its property of permutation invariance and flexibility on agent numbers. More specifically, it can serve as the process (prediction) model in the prior update of recursive Bayesian state estimation. The model has advantages in handling occlusion issues and missing observations, due to its capability of long-term prediction. In this paper, we adopt the multi-target tracking framework proposed in [11] and compare the performance of tracking with our model and other widely used models.

This paper is a significant extension of our prior work [12] where we presented a modified Wasserstein generative modeling method. This is adopted as the basis of training the proposed model in this paper. The method in [12] was only able to predict interactive behaviors of two vehicles in a single scenario, while the proposed approach is able to handle multiple, heterogeneous agents in different scenarios simultaneously. The main contributions of this paper are summarized as follows:

•

We propose a multi-agent, generative trajectory forecasting system with relational reasoning on heterogeneous, interactive agents. The system is applied to predict pedestrian and vehicle trajectories across different scenarios.
•

We propose a spatio-temporal dual-attention mechanism for representation learning on spatio-temporal dynamic graphs, which can figure out relative significance of the information about different surrounding agents at each time step.
•

We incorporate an efficient kinematic constraint module similar to [12] to ensure physical feasibility for vehicle trajectory prediction. This constraint layer can not only smooth the trajectories and reduce prediction error, but also enhance the model robustness to noisy data.
•

We validate the proposed system on multiple trajectory forecasting benchmark datasets. The approach achieves state-of-the-art prediction accuracy. The model also proves to enhance the multi-target tracking performance.

The remainder of the paper is organized as follows. Section II provides a brief overview on state-of-the-art related research. Section III introduces preliminary background of the proposed approach. Section IV presents a generic problem formulation for the trajectory prediction task. Section V illustrates the proposed forecasting system. In Section VI, the proposed system is applied to interactive trajectory prediction of pedestrians and vehicles using real-world benchmark datasets. The performance is compared with various baseline methods in terms of widely-used evaluation metrics. A comprehensive ablative analysis is also provided to illustrate the necessity of each component. The prediction model is also applied to a vehicle tracking framework. Finally, Section VII concludes the paper. The details of data preprocessing are introduced in the appendix.

II Related Work

In this section, we provide a concise literature review on related research and illustrate the distinctions and advantages of the proposed generative trajectory prediction approach, which can also be leveraged by multi-object tracking frameworks to enhance tracking performance.

II-A Interaction-Aware Trajectory Prediction

Extensive research has been conducted on trajectory prediction of humans and autonomous agents (e.g. on-road vehicles, mobile robots, etc). Early literature mainly introduced physics-based or rule-based approaches, such as state estimation techniques based on kinematic models. These methods do not consider mutual influence between intelligent agents and they are only able to perform well in short-term prediction tasks with limited model flexibility [13, 14]. As machine learning techniques are studied more extensively, people began to employ learning-based models for prediction purpose, such as hidden Markov models [15], Gaussian mixture models [16], dynamic Bayesian network [17], and inverse reinforcement learning [18]. In recent years, researchers have proposed various learning-based prediction models, which enables more flexibility and capacity to capture underlying interactive behavior patterns [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 4, 30, 31, 32, 33, 34, 35, 36, 37]. The prediction hypotheses can be directly obtained from the model outputs. However, with an end-to-end fashion, physical feasibility constraints are usually ignored in these methods, which may result in implausible forecast. In this paper, we address multi-agent interaction modeling and introduce a probabilistic prediction system based on a deep generative framework. Our approach also explicitly considers the feasibility constraints of vehicles.

II-B Multi-Target Tracking and State Estimation

Although many research have been conducted on end-to-end tracking through real-time detection in the computer vision community, here we only focus on the approaches based on recursive Bayesian state estimation, which consists of prior (prediction) update and posterior (measurement) update. One of the challenges in the tracking problem is noisy or missing observations (occlusions). Kinematics based models tend to be sensitive to measurement noise and lose tracking when the targets are occluded for a period. Moreover, they are not able to consider interactive behaviors between tracking targets since the target states are propagated independently. Some approaches to handle these issues were proposed in [11]. In their work, a learning-based behavior model is employed to improve the tracking accuracy. However, the behavior model requires specific assumptions on agent numbers, agent roles and scenarios, which significantly restricts the applicability. The prediction model proposed in this paper can serve as a strong alternative to the original behavior model, which brings more versatility and larger model capacity.

II-C Relational Reasoning and Graph Networks

In general, the objective of relational reasoning is to reason about different entities and their relations from observed structured or unstructured data, such as image pixels [38], words or sentences [39], human skeletons [40] and interactive navigating agents [41, 42, 43]. Popular techniques for relational reasoning and interaction modeling in earlier literature include, but are not limited to, social pooling mechanism [5], convolutional pooling mechanism [7], soft attention mechanism [10], etc. Recently, graph networks have proved to be effective for relational reasoning on graph-structured data, where there is no restriction on the message passing rules. In traffic scenarios, a typical representation of the whole scene is to formulate a graph, where nodes are agents and edges are their relationships. Most existing works focused on the approximation function parameterized by deep neural network due to its high flexibility, which leads to graph neural networks (GNN). In this paper, we present a graph neural network with both topological and temporal attention mechanisms to capture underlying interaction patterns and jointly predict future behaviors.

II-D Deep Generative Models

Our approach is also related to deep generative models, which have been widely applied to representation learning and distribution approximation tasks [44, 45]. One of the advantages of generative modeling lies in the data distribution learning without supervision. Coupled with highly flexible deep networks, deep generative models have achieved satisfying performance in image generation, style transfer, sequence synthesis tasks, etc. Despite the variational auto-encoder, a highly flexible latent variable model with encoder-decoder architecture, tries its best to make the posterior of the latent variable and its prior (usually a normal distribution) as similar as possible, the two distributions do not match well in many tasks. Hence, it breaks the consistency of the model. Also, although generative adversarial networks have achieved satisfying performance on image generation tasks, it usually suffers from mode collapse problems, especially when applied to sequential data under the conditional setting. In order to mitigate these drawbacks, The Wasserstein auto-encoder (WAE) [46], proposed from the optimal transport point of view and combined with information theory, encourages the consistency between the encoded latent distribution and the prior distribution. A variant of variational auto-encoder was proposed in [47] and a modified approach was proposed in our previous work [12]. In this paper, we adopt the similar Wasserstein generative method in [12] as the basis of model training, and significantly extends pair-wise prediction to multi-agent prediction.

III Preliminaries

In this section, we first provide a high-level summary of basics of graph neural networks. Then, we concisely introduce the Wasserstein generative modeling proposed in [12].

III-A Graph Neural Network

The graph neural network is a type of deep learning models which are directly applied to graph structures. It naturally incorporates relational inductive bias into the model design. In the context of graph neural network, most graphs are attributed (with node attributes and/or edge attributes and/or global attributes). Generally, there are three basic operations in graph representation learning with GNN: edge update, node update and global update [48]. Note that the global update is optional, which is applied only if the graphs have global attributes. More formally, denote the graph with $n$ nodes as $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ , where $\mathcal{V}=\{v_{i},i\in\{1,...,n\}\}$ is a set of node attributes and $\mathcal{E}=\{e_{ij},i,j\in\{1,...,n\}\}$ is a set of edge attributes. Denote $u$ as the global attribute. Then, the three update operation can be written as

$\displaystyle e^{\prime}_{ij}$	$\displaystyle=\phi^{e}(e_{ij},v_{i},v_{j},u),\ \qquad\bar{e}^{\prime}_{i}=\rho^{e\rightarrow v}(E^{\prime}_{i}),$	(1)
$\displaystyle v^{\prime}_{i}$	$\displaystyle=\phi^{v}(\bar{e}^{\prime}_{i},v_{i},u),\qquad\qquad\bar{e}^{\prime}=\rho^{e\rightarrow u}(E^{\prime}),$
$\displaystyle u^{\prime}$	$\displaystyle=\phi^{u}(\bar{e}^{\prime},\bar{v}^{\prime},u),\qquad\qquad\bar{v}^{\prime}=\rho^{v\rightarrow u}(V^{\prime}),$

where $E^{\prime}_{i}=\{e^{\prime}_{ij},j\in N(i)\}$ , $E^{\prime}=\bigcup E^{\prime}_{i}$ , $V^{\prime}=\{v^{\prime}_{i},i=1,...,n\}$ , and $N(i)$ is the neighbors of node $i$ . $\phi^{e}(\cdot)$ , $\phi^{v}(\cdot)$ and $\phi^{u}(\cdot)$ are neural networks. $\rho^{e\rightarrow v}(\cdot)$ , $\rho^{e\rightarrow u}(\cdot)$ and $\rho^{v\rightarrow u}(\cdot)$ are aggregation functions with the property of permutation invariance.

In this work, it is natural to represent intelligent agents as nodes, and their relation as edges. We only apply edge update and node update, since there is no global attribute in our setup.

III-B Wasserstein Generative Modeling

The Wasserstein distance is defined in a metric space $(\chi,\rho)$ :

\displaystyle W_{\rho}(Q,P)=\sup_{||f||_{\text{Lip}}\leq 1}\int{f(dQ-dP)},

(2)

where $||f||_{\text{Lip}}$ is the Lipschitz constant of the function $f$ . By the Kantorovich-Rubinstein duality, we can formulate Eq. (2) as an optimal transport problem, which is given by

\displaystyle W_{\rho}(Q,P)=\inf_{\mathbf{M}}\int\rho(x,x^{\prime})d\mathbf{M}=\inf_{\mathbf{M}}\mathbf{E}_{M}[\rho],

(3)

where $\mathbf{M}(x,x^{\prime})$ is the coupling distribution of $x\sim P$ , $x^{\prime}\sim Q$ , which is a probability measure for $\chi\times\chi$ .

Define

p_{G}(x):=\int_{z}p_{G}(x|z)p_{z}(z)\text{d}z,\forall x\in\chi.

(4)

Following the WAE [46] with the assumption that $p_{G}(x|z)$ is deterministic, we have

\displaystyle\inf_{M\in\mathcal{P}(P_{X},P_{G})}\mathbf{E}_{M}[\rho(X,Y)]=\inf_{Q_{z}=P_{z}}\mathbf{E}_{p}\mathbf{E}_{q(z|x)}[\rho(X,G(Z))],

(5)

where $X\sim P_{X}$ , $Z\sim Q(Z|X)$ and $Q_{z}$ is the marginal distribution of $Z$ . We relax the optimization problem as:

		$\displaystyle\min_{Q(Z\|X)\in\mathcal{Q}}\mathbf{E}_{P_{X}}\mathbf{E}_{Q(Z\|X)}[\rho(X,G(Z))]$		(6)
	$\displaystyle s.t.$	$\displaystyle\left\{\begin{aligned} &\mathbf{E}_{p(x)}[\mathcal{D}_{\text{KL}}(Q(Z\|X)\|\|P(Z))]\leq\epsilon_{1},\\ &\mathcal{D}(Q_{Z},P_{Z})\leq\epsilon_{2},\end{aligned}\right.$		(6)

where $\epsilon_{1}$ and $\epsilon_{2}$ are pre-defined constants. Then, the dual optimal problem is formulated as

		$\displaystyle\max_{\alpha,\beta}\min_{Q(Z\|X)\in\mathcal{Q}}\mathbf{E}_{P_{X}}\mathbf{E}_{Q(Z\|X)}[\rho(X,G(Z))]$		(7)
	$\displaystyle+$	$\displaystyle\alpha\mathbf{E}_{p(x)}[\mathcal{D}_{\text{KL}}(Q(Z\|X)\|\|P(Z))]+\beta\mathcal{D}(Q_{Z},P_{Z}),$		(7)

where $\alpha$ and $\beta$ are chosen from a proper range.

After some algebra derivations, we obtain the following equivalent optimization problem

		$\displaystyle\left\{\begin{aligned} &\max I_{\phi}(x,z)\\ s.t.&\left\{\begin{aligned} &\mathcal{D}_{\text{KL}}[p_{\phi}(z)\|\|p(z)]\leq\epsilon_{1},\\ &\mathcal{D}_{\text{KL}}[p_{\phi}(x,z)\|\|p_{\theta}(x,z)]\leq\epsilon_{2}.\\ \end{aligned}\right.\end{aligned}\right.$		(8)
	$\displaystyle\Leftrightarrow$	$\displaystyle\left\{\begin{aligned} \min_{0<1-\alpha<\beta}&-(1-\alpha)I_{\phi}(x,z)\\ +&(\beta+\alpha-1)\mathcal{D}_{\text{KL}}[p_{\phi}(z)\|\|p(z)]\\ +&\mathcal{D}_{\text{KL}}[p_{\phi}(x,z)\|\|p_{\theta}(x,z)],\\ \end{aligned}\right.$		(8)

where $I_{\phi}(x,z)$ is the mutual information between $x$ and $z$ .

Please refer to our prior work [12] for more details on the derivation. The goal of both WAE and vanilla variational auto-encoder (VAE) is to learn the data distribution. According to the numerical experiment results in [12], the vanilla VAE tends to learn a distribution with smaller variance and have mode collapse issue; while WAE is better at capturing the true data distribution. Also, WAE is able to learn a better latent representation than VAE due to the regularization terms.

IV Problem Formulation

The objective is to predict future trajectories for multiple interactive agents, based on their historical states and context information. The prediction system can be also incorporated into any multi-target tracking frameworks. Without loss of generality, we assume $N$ agents are navigating in the observation area, which are divided into $M$ categories. In this work, the involved agents include vehicles, pedestrians and cyclists. We denote a set of agent trajectories as

	$\displaystyle\mathbf{T}^{1:T}=\{\bm{\tau}_{i}^{1:T}\|\bm{\tau}_{i}^{k}=(x_{i}^{k},y_{i}^{k},v_{i}^{k},\psi_{i}^{k}),$		(9)
	$\displaystyle T=T_{h}+T_{f},i=1,...,N\},$		(9)

where $T_{h}$ is the history horizon and $T_{f}$ is the forecasting horizon. $(x_{i}^{k},y_{i}^{k})$ is the position, $v_{i}^{k}$ is the velocity, and $\psi_{i}^{k}$ is the heading angle of agent $i$ at time $k$ . The coordinates can be either in the world space or image pixel space. We also denote a sequence of context information (raw images, semantic maps or the tensors which includes other relevant information) as

	$\displaystyle\mathbf{C}_{g}^{1:T}$	$\displaystyle=\{\mathbf{c}^{1:T},T=T_{h}+T_{f}\}\ \text{(global)},$		(10)
	$\displaystyle\mathbf{C}_{l}^{1:T}$	$\displaystyle=\{\mathbf{c}_{i}^{1:T},T=T_{h}+T_{f},i=1,...,N\}\ \text{(local)},$		(10)

which indicates components in the high-definition maps (e.g. road geometries, road lanes, drivable areas, traffic signs, etc). The future information is accessible during training. We aim to approximate the conditional distribution $p(\mathbf{T}^{T_{h}+1:T_{h}+T_{f}}|\mathbf{T}^{1:T_{h}},\mathbf{C}^{1:T_{h}})$ . The number of involved agents can be flexible in different cases. In the multi-target tracking tasks, the prediction model is iteratively applied.

V Method: STG-DAT

In this section, we first provide an overview of the key modules and the architecture of the proposed generative trajectory prediction system. The detailed model design of each module will then be further illustrated.

V-A System Overview

The detailed architecture of STG-DAT is shown in Fig. 2, where a standard encoder-decoder architecture is employed. There are three key components: a deep feature extractor, an encoder with spatio-temporal graph generation and dual-attention network, and a decoder with a kinematic constraint layer. First, the feature extractor takes in both history and future information and outputs state, relation, and context feature embeddings. The information contains the trajectories of the involved interactive agents, and a sequence of context density maps and mean velocity fields. The scene images or semantic maps can also be included, if they are available in the dataset. Since the neural networks have a the capability of extracting highly flexible features, we choose multi-layer perceptron (MLP) to generate state and relation embeddings and convolutional neural network (CNN) to generate context embedding. The extracted features are utilized to generate a spatio-temporal graph for both the history and the future, respectively. The node attributes are updated by a spatio-temporal graph attention mechanism. Then, the updated node attributes are transformed from the feature space into a latent space by an encoding function according to the derivation of the conditional generative modeling. Finally, the decoder based on the recurrent neural network generates feasible and human-like future trajectories for all the involved agents. The number of agents can be flexible in different cases due to the weight sharing and permutation invariance of the graph representation. All the components are implemented with deep neural networks, thus they can be trained end-to-end efficiently and consistently.

V-B Feature Extraction

The feature extractor consists of three parts: State MLP, Relation MLP, and Context CNN. The operations below are applied at each time step, and a sequence of state, relation, and context feature embeddings can be obtained.

•

State MLP: It embeds the position, velocity, and heading information into a state feature vector for each agent. Different types of agents use distinct state embedding functions and the same type of agents share the same one. In this paper, we consider three types: vehicles, cyclists and pedestrians. The state embedding (SE) of agent $i$ at time $k$ is obtained by

$SE_{i}^{k}=\text{MLP}_{s}(\bm{\tau}_{i}^{k}).$ (11)
•

Relation MLP: It embeds the relative information between each pair of agents into a relation feature vector. We differentiate the edges with opposite directions between the same pair of nodes. The relative information can be either the distance and relative angle (in a 2D polar coordinate), or the differences between the positions of the two agents along two perpendicular axes (in a 2D Cartesian coordinate). We use the latter in this work, since it is simpler to compute and the performance is comparable to the former one. More specifically, consider a pair of agents $i$ and $j$ . When calculating the relation embedding associated with edge $e_{ij}$ in a Cartesian coordinate, we set agent $i$ as the origin and its heading as the positive direction. The relative position, velocity and heading angle of agent $j$ with respect to agent $i$ can be calculated and denoted as $\bm{\phi}_{ij}^{k}$ . The relation embedding (RE) is obtained by

$RE_{ij}^{k}=\text{MLP}_{r}(\bm{\phi}_{ij}^{k}).$ (12)
•

Context CNN: It extracts spatial features for each agent from a local occupancy density map ( $H\times W\times 1$ ) as well as heuristic features from a local velocity field ( $H\times W\times 2$ ) centered on the corresponding agent. The reason of using occupancy density maps instead of real scene images is to remove redundant information and efficiently represent data-driven drivable regions. This information provides a prior knowledge of common driving behaviors at specific areas of the scene. The context embedding (CE) of agent $i$ at time $k$ is obtained by

$CE_{i}^{k}=\text{CNN}(\bm{c}_{i}^{k}).$ (13)

V-C Encoder with Graph Dual-Attention Network

After obtaining the extracted features, a history spatio-temporal graph (HG) and a future spatio-temporal graph (FG) are generated to represent the information related to the involved agents. Here, the state features and context features are concatenated to serve as the (agent) node attributes, whereas the relation features serve as edge attributes. The HG and FG contain different time steps, and they are processed in a similar fashion with the graph dual-attention network. In a specific case, the number of nodes (agents) in both HG and FG is assumed to be fixed, which implies the same agents appear in the whole horizon. The edges are eliminated at a certain time step if the Euclidean distance between two agents is larger than a threshold $d$ . Therefore, the graph connectivity and topology at different time steps may vary.

The proposed graph dual-attention network consists of two consecutive layers: a topological attention layer which updates node attributes from the spatial or topological perspective, and a temporal attention layer which outputs a high-level feature embedding for each node. The temporal attention layer summaries both the topological and temporal information and figure out relative significance of the information at each time step. Following the notation in Section III, assume there are totally $n$ nodes (agents) in a graph, we denote a graph as $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ , where $\mathcal{V}=\{v_{i}\in\mathbb{R}^{D_{n}},i\in\{1,...,n\}\}$ and $\mathcal{E}=\{e_{ij}\in\mathbb{R}^{D_{e}},i,j\in\{1,...,n\}\}$ . $D_{n}$ and $D_{e}$ are the dimensions of node attributes and edge attributes.

V-C1 Topological Attention Layer

The inputs of this layer are the original spatio-temporal graphs. The output is a new set of node attributes $\bar{\mathcal{V}}=\{\bar{v}^{k}_{i}\in\mathbb{R}^{\bar{D}_{n}},i\in\{1,...,n\},k\in\{1,...,T\},T=T_{h}+T_{f}\}$ , which can capture local structural properties. The topological attention coefficients $\alpha^{k}_{ij}$ (showing the significance of node $j$ w.r.t. node $i$ ) are calculated by

\alpha^{k}_{ij}=\frac{\exp{(-A_{ij}(\lambda\left\lVert v^{k}_{i}-v^{k}_{j}\right\rVert^{2}+\mu\left\lVert e^{k}_{ij}\right\rVert^{2}))}}{\sum_{p\in N(i)}\exp{(-A_{ip}(\lambda\left\lVert v^{k}_{i}-v^{k}_{p}\right\rVert^{2}+\mu\left\lVert e^{k}_{ip}\right\rVert^{2}))}},

(14)

where $N(i)$ is the first-order neighbor nodes (including $i$ ). $A_{ij}$ is a prior attention coefficient which provides inductive bias from prior knowledge, $\lambda$ and $\mu$ are weight parameters to adjust the relative importance of node attributes and edge attributes for computing attention coefficients. The underlying intuition is that the agents, with similar node attributes to the objective agent or with small spatial distance, tend to have more correlation thus should be paid more attention to. In this work, we set $A_{ij}=1$ implying no prior attention bias, while more exploration on incorporating prior knowledge is left for future work. Then the node attributes are updated by

\bar{v}^{k}_{i}=\sum_{j\in N(i)}f_{\text{act}}(\alpha^{k}_{ij}W_{n}v^{k}_{j}),

(15)

where $f_{\text{act}}(\cdot)$ is an activation function, and $W_{n}$ is learnable parameters. The above procedures are applied to each time step, and the weight matrices are shared across different time steps. We also employ the multi-head attention mechanism [49] to boost model performance by adjusting $\lambda$ and $\mu$ , where the node attributes obtained by using different attention coefficients are concatenated into a whole vector. The above message passing procedures can be applied multiple times to capture higher-order interactions with an additional edge update procedure following the form of Eq. (1). More specifically, the updated edge attributes can be computed by

\bar{e}^{k}_{ij}=\text{MLP}\left(\left[\bar{v}^{k}_{i},\bar{v}^{k}_{j},e^{k}_{ij}\right]\right).

(16)

V-C2 Temporal Attention Layer

The input of this layer is the output of the topological attention layer, which is a set of node attributes $\bar{\mathcal{V}}=\{\bar{v}^{k}_{i}\in\mathbb{R}^{\bar{D}_{n}},i\in\{1,...,n\},k\in\{1,...,T\}\}$ . The output is a set of highly abstract node attributes $\widetilde{\mathcal{V}}=\{\widetilde{v}_{i}\in\mathbb{R}^{\widetilde{D}_{n}},i\in\{1,...,n\}\}$ . These attributes will be further processed by the downstream modules. The temporal attention coefficients $\beta^{k}_{i}$ is

	$\displaystyle\beta^{hk}_{i}=$	$\displaystyle\frac{\exp{(f_{\text{act}}(\bar{v}^{k\top}_{i}w))}}{\sum^{T_{h}}_{k^{\prime}=1}\exp{(f_{\text{act}}(\bar{v}^{k^{\prime}\top}_{i}w))}},\ (1\leq k\leq T_{h})$		(17)
	$\displaystyle\beta^{fk}_{i}=$	$\displaystyle\frac{\exp{(f_{\text{act}}(\bar{v}^{k\top}_{i}w))}}{\sum^{T_{h}+T_{f}}_{k^{\prime}=T_{h}+1}\exp{(f_{\text{act}}(\bar{v}^{k^{\prime}\top}_{i}w))}},\ (T_{h}+1\leq k\leq T_{h}+T_{f})$		(17)

where $w\in\mathbb{R}^{\bar{D}_{n}}$ is a weight vector parameterizing the attention function. Then the node attributes are updated by

\displaystyle\widetilde{v}^{h}_{i}=\sum^{T_{h}}_{k=1}f_{\text{act}}(\beta^{hk}_{i}w^{\top}\bar{v}^{k}_{i}),\quad\widetilde{v}^{f}_{i}=\sum^{T_{h}+T_{f}}_{k=T_{h}+1}f_{\text{act}}(\beta^{fk}_{i}w^{\top}\bar{v}^{k}_{i}).

(18)

The multi-head attention mechanism can also be employed by learning different $w$ and fusing the information by averaging or concatenation operations.

V-C3 Feature Encoding

For each agent, the history and future node attributes are concatenated and mapped by an encoding function $f_{\text{enc}}$ to obtain a latent variable $z_{i}$ , which is given by

z_{i}=f_{\text{enc}}([\widetilde{v}^{h}_{i}||\widetilde{v}^{f}_{i}]).

(19)

The underlying intuition is that, during the training phase, the latent variable is able to encode the future information conditioned on the given history information, and it is trained to be consistent with the prior distribution by a regularization term in the loss function. During the testing phase, although the future information is not available, it can be implicitly obtained by sampling the latent variable from the prior distribution.

V-D Decoder with Kinematic Constraint

We impose a kinematic constraint layer after the decoder gated recurrent unit (GRU) to enforce feasible trajectory prediction. The same types of agents share the same GRU unit, while different types use distinct ones. This is reasonable since the behavior patterns, speed ranges and traversable areas may be diverse among different types.

The bicycle model is a widely used nonlinear model to approximate the kinematics of vehicles, which is shown in Fig. 3. Here we adopt the discretized form, which is given by

\left\{\begin{aligned} &x(k+1)=x(k)+v(k)\cos(\psi(k)+\beta(k))\Delta T,\\ &y(k+1)=y(k)+v(k)\sin(\psi(k)+\beta(k))\Delta T,\\ &\psi(k+1)=\psi(k)+\dfrac{v(k)}{l_{r}}\sin\beta(k)\Delta T,\\ &v(k+1)=v(k)+a(k)\Delta T,\\ &\beta(k+1)=\beta(k)+\dot{\beta}(k)\Delta T,\\ \end{aligned}\right.

(20)

where $(x,y)$ are the coordinates of the center of mass, $\psi$ is the inertial heading and $v$ is the speed of the vehicle. $\beta$ is the angle of the current velocity of the center of mass with respect to the longitudinal axis of the car. $l_{f}$ and $l_{r}$ denote the distance from the center of mass of the vehicle to the front and rear axles, respectively. The state of vehicles at time step $k$ is denoted as $\mathbf{s}(k)=[x(k),y(k),\phi(k),v(k),\beta(k)]^{\top}$ , and the control input at time step $k$ as $\mathbf{u}(k)=[a(k),\dot{\beta}(k)]^{\top}$ . Eq. (20) can be expressed as $\mathbf{s}(k+1)=\mathbf{f}(\mathbf{s}(k),\mathbf{u}(k))$ .

The prediction system is expected to provide the position distribution of each agent at each time step. The distribution of the control input $\mathbf{u}(k)$ is assumed to be a multi-variate Gaussian distribution at each time step, which is parameterized by the output of the GRU cell. We provide an illustrative example for the agent $i$ . The inputs of the gated recurrent unit (GRU) are the node attribute $\widetilde{v}_{i}$ at the first step and zero paddings for the following steps. The outputs are the raw $\mathbf{u}(k)$ at each step which are truncated by a saturation function in order to restrict the control actions in the feasible range. Then the kinematic cell takes in the current state and action, and outputs the state at the next step. This procedure is iterated until the prediction horizon is reached. If $l_{r}$ is not a prior knowledge or cannot be observed during testing, then we can approximate it with a constant based on the statistics of training data.

In order to propagate the uncertainty to future time steps, two options are available with tradeoffs. The details are introduced in the following.

$\bullet$ Nonlinear system with Monte Carlo sampling:

We can approximate the distribution of position by a set of Monte Carlo samples, which is highly flexible without any restrictions on the nonlinear dynamic system. The particle samples are propagated by the nonlinear bicycle model. The prediction accuracy tends to be improved as the number of particles increases. However, the number of samples may need to be adjusted according to real-time requirements. If the explicit probability density function of the state variable is required, the kernel density estimation technique can be employed in a non-parametric way.

$\bullet$ Linearized system with Gaussian assumption:

Since $\mathbf{u}(k)$ follows Gaussian distribution, we can obtain an analytic distribution of position by linearizing the bicycle model at the current state. It is easy to show that the position distribution is also a Gaussian distribution. This is simple to implement and computationally efficient, while the flexibility is very limited. This restriction leads to lower prediction accuracy in general, especially when the real distribution is multi-modal.

For a nonlinear system with Gaussian assumption, the expectation and covariance of the state variable can be propagated in a fashion similar to the prior (prediction) update of extended Kalman filter [51]. We can linearize the system around the current state $\mathbf{s}(k)$ ,

\mathbf{s}(k+1)=\mathbf{Df}_{s}(k)\mathbf{s}(k)+\mathbf{Df}_{u}(k)\mathbf{u}(k)

(21)

\mathbf{u}(k)\sim\mathcal{N}(\bm{\mu}_{u}(k),\bm{\Sigma}_{uu}(k))

(22)

where $\mathbf{Df}_{s}(k)$ and $\mathbf{Df}_{u}(k)$ are the Jacobian matrices defined as below,

		$\displaystyle\mathbf{Df}_{s}(k)=\begin{bmatrix}1&0&-\Delta Tv_{y}(k)&\Delta Tv_{x}(k)&-\Delta Tv_{y}(k)\\ 0&1&\Delta Tv_{x}(k)&\Delta Tv_{y}(k)&\Delta Tv_{x}(k)\\ 0&0&1&\dfrac{\Delta T}{l_{r}}\sin\beta(k)&\Delta T\dfrac{v(k)}{l_{r}}\cos\beta(k)\\ 0&0&0&1&0\\ 0&0&0&0&1\end{bmatrix},$		(23)
		$\displaystyle\mathbf{Df}_{u}(k)=\begin{bmatrix}0&0\\ 0&0\\ 0&0\\ \Delta T&0\\ 0&\Delta T\end{bmatrix},$		(23)

where $\gamma(k)=\phi(k)+\beta(k)$ , $v_{x}(k)=v(k)\cos\gamma(k)$ , and $v_{y}(k)=v(k)\sin\gamma(k)$ . Then we have the distribution of $\mathbf{s}(k+1)$ , which is expressed as

\mathbf{s}(k+1)\sim\mathcal{N}(\bm{\mu}_{s}(k+1),\bm{\Sigma}_{ss}(k+1)),

(24)

where

		$\displaystyle\bm{\mu}_{s}(k+1)=\mathbf{Df}_{s}(k)\bm{\mu}_{s}(k)+\mathbf{Df}_{u}(k)\bm{\mu}_{u}(k),$		(25)
		$\displaystyle\bm{\Sigma}_{ss}(k+1)=\mathbf{Df}_{s}(k)\bm{\Sigma}_{ss}(k)\mathbf{Df}_{s}(k)^{\top}+\mathbf{Df}_{u}(k)\bm{\Sigma}_{uu}(k)\mathbf{Df}_{u}(k)^{\top},$		(25)

with the initial condition $\mathbf{s}(0)\sim\mathcal{N}(\bm{\mu}_{s}(0),\bm{\Sigma}_{ss}(0))$ .

V-E Loss Function and Training

In this part, we demonstrate the loss function of our model, which is based on the optimization formulation of Wasserstein generative modeling aforementioned in Section III. In order to keep consistent with Section III, we use the same notations: $x$ denotes the predicted trajectories, $z$ denotes the latent variable, and $y$ denotes the condition variable.

The optimization problem can be formulated in the same way as in [12], which is written as

$\displaystyle\min_{\theta,\phi,s.t.0<1-\alpha<\beta}$	$\displaystyle-\mathbf{E}_{p_{\phi}(z\|x,y)}[\log p_{\theta}(x\|z,y)]]$	(26)
	$\displaystyle+\alpha\mathbf{E}_{p(x\|y)}[D_{\text{KL}}[p_{\phi}(z\|x,y)\|\|p(z\|y)]]$
	$\displaystyle+\beta D(p_{\phi}(z\|y),p(z\|y)).$

The detailed derivation can be found in [12]. Since the whole system is fully differentiable, we can train the network in an end-to-end fashion by the Adam optimizer [52]. The loss function is given by

$\displaystyle\mathcal{L}=$	$\displaystyle\gamma\mathbf{E}_{j\in\{1,...,N_{b}\}}\left\lVert\mathbf{\tau}_{j}^{T_{h}+1:T_{h}+T_{f}}-\hat{\mathbf{\tau}}_{j}^{T_{h}+1:T_{h}+T_{f}}\right\rVert^{2}$	(27)
$\displaystyle+$	$\displaystyle\alpha\mathbf{E}_{p(x\|y)}[D_{\text{KL}}[p_{\phi}(z\|x,y)\|\|p(z\|y)]$
$\displaystyle+$	$\displaystyle\beta\text{MMD}(p_{\phi}(z\|y),p(z\|y)),$

where

$\displaystyle p_{\phi}$	$\displaystyle(z\|x,y)=\mathcal{N}(\text{MLP}(\tilde{v}_{h},\tilde{v}_{f}),\mathbf{I}),$	(28)
$\displaystyle\tilde{v}_{h}$	$\displaystyle=\text{GDAT}(\text{FE}(y)),\ \tilde{v}_{f}=\text{GDAT}(\text{FE}(x)),$
$\displaystyle y$	$\displaystyle=\{\mathbf{T}_{1:T_{h}},\mathbf{C}_{1:T_{h}}\},$
$\displaystyle x$	$\displaystyle=\{\mathbf{T}_{T_{h}+1:T_{h}+T_{f}},\mathbf{C}_{T_{h}+1:T_{h}+T_{f}}\}.$

FE is the deep feature extractor and GDAT is the proposed graph dual-attention network. $\gamma$ is a weight parameter to adjust the relative importance of the reconstruction loss, $N_{b}$ is the total number of training agents, $D_{\text{KL}}$ is Kullback-Leibler divergence, and MMD is maximum mean discrepancy. If $\gamma\gg\alpha,\beta$ , then the loss function degenerates to the mean squared error loss. The whole model is trained in an end-to-end fashion.

TABLE I: ADE / FDE (meters) Comparisons (ETH & UCY datasets).

Scenes	S-LSTM	S-GAN	CGNS	Social-BiGAT	Trajectron	STG-DAT
ETH	1.09 / 2.35	0.81 / 1.52	0.62 / 1.40	0.69 / 1.29	0.48 / 0.93	0.38 / 0.77
HOTEL	0.79 / 1.76	0.72 / 0.61	0.70 / 0.93	0.49 / 1.01	0.29 / 0.54	0.25 / 0.39
UNIV	0.67 / 1.40	0.60 / 1.26	0.48 / 1.22	0.55 / 1.32	0.44 / 0.93	0.41 / 0.82
ZARA1	0.47 / 1.00	0.34 / 0.69	0.32 / 0.59	0.30 / 0.63	0.35 / 0.68	0.23 / 0.50
ZARA2	0.56 / 1.17	0.42 / 0.84	0.35 / 0.71	0.36 / 0.75	0.36 / 0.70	0.21 / 0.46
AVG	0.72 / 1.54	0.58 / 1.18	0.49 / 0.97	0.48 / 1.00	0.46 / 0.94	0.30 / 0.59

TABLE II: ADE / FDE (pixels) Comparisons (SDD dataset).

S-LSTM	S-GAN	CGNS	DESIRE	Trajectron	STG-DAT (same node)	STG-DAT
33.19 / 56.38	24.81 / 38.62	15.60 / 28.20	19.30 / 34.12	17.38 / 31.46	14.55 / 23.54	13.25 / 21.94

VI Experiments

In this section, we validate the proposed method on three publicly available benchmark datasets for trajectory prediction of traffic participants. The results are analyzed and compared with state-of-the-art baselines.

VI-A Datasets

Here we briefly introduce the datasets below. Please refer to the appendix for data processing details.

$\bullet$ ETH [53] and UCY [54]: These two datasets are usually used together in literature, which include bird-eye-view videos and annotations of pedestrians in both indoor and outdoor scenarios. The trajectories were extracted in the world space (unit: meters).

$\bullet$ Stanford Drone Dataset (SDD)[55]: The dataset also contains a set of bird-eye-view images and the corresponding trajectories of involved entities. It was collected in multiple scenarios in a university campus full of interactive pedestrians, cyclists and vehicles. The trajectories were extracted in the image pixel space.

$\bullet$ INTERACTION Dataset (ID)[2]: The dataset contains naturalistic motions of various traffic participants in a variety of highly interactive driving scenarios. Trajectory data was collected using drones and traffic cameras. The high-definition maps of scenarios and agents’ trajectories are provided. We consider three types of scenarios: roundabout (RA), unsignalized intersection (UI) and highway ramp (HR). The trajectories were extracted in the world space (unit: meters).

VI-B Evaluation Metrics

We evaluate the model performance in terms of average displacement error (ADE) and final displacement error (FDE), which are exactly the same as [5, 6, 56]. ADE is defined as the average distance between the predicted trajectories and the groundtruth over all the involved entities within the prediction horizon. FDE is defined as the deviated distance at the last predicted time step. For the ETH, UCY, and SDD dataset, we predicted the future 12 time steps (4.8s) based on the historical 8 time steps (3.2s). For the ID dataset, we predicted the future 10 time steps (5.0s) based on the historical 4 time steps (2.0s).

VI-C Baseline Methods

•

Probabilistic LSTM (P-LSTM) [57]: The backbone of the model is the same as an encoder-decoder architecture with vanilla LSTM. In order to incorporate uncertainty in the model, a noise term sampled from the normal distribution is added in the input, which results in a probabilistic model.
•

Social LSTM (S-LSTM) [5]: The trajectories are encoded with an LSTM layer, whose hidden states serve as the input of the proposed social pooling layer, which handles interaction modeling implicitly.
•

Social GAN (S-GAN) [6]: The model introduces a generative adversarial learning scheme into S-LSTM to further improve performance.
•

Social Attention (S-ATT) [10]: The model is based on the architecture of Structural-RNN [58], which deals with spatio-temporal graphs with recurrent neural networks.
•

DESIRE [20]: The model is a deep stochastic inverse optimal control framework based on conditional variational auto-encoder with RNN encoders and decoders. A ranking module for sampled trajectories was introduced to indicate their likelihood.
•

Social-BiGAT [56]: The model is a graph-based generative adversarial network, which is based on a graph attention network. A recurrent encoder-decoder architecture is trained via an adversarial scheme.
•

Trajectron [29]: The model combines tools from recurrent sequence modeling and variational deep generative modeling to produce a distribution of future trajectories.

TABLE III: ADE / FDE (meters) Comparisons (ID dataset).

		Baseline Methods

Scenes	Time	P-LSTM	S-LSTM	S-GAN	S-ATT	CGNS	Trajectron

RA	1.0s	0.13 / 0.15	0.16 / 0.20	0.13 / 0.16	0.14 / 0.17	0.11 / 0.16	0.09 / 0.13
	2.0s	0.27 / 0.35	0.25 / 0.39	0.22 / 0.35	0.23 / 0.39	0.21 / 0.32	0.18 / 0.30
	3.0s	0.58 / 0.83	0.52 / 0.80	0.45 / 0.72	0.53 / 0.77	0.44 / 0.68	0.40 / 0.61
	4.0s	0.87 / 1.35	0.84 / 1.33	0.70 / 1.19	0.78 / 1.25	0.69 / 1.03	0.64 / 0.98
	5.0s	1.36 / 1.88	1.29 / 1.80	1.13 / 1.51	1.22 / 1.69	1.01 / 1.45	0.95 / 1.32
UI	1.0s	0.11 / 0.20	0.12 / 0.17	0.11 / 0.16	0.12 / 0.18	0.11 / 0.14	0.10 / 0.14
	2.0s	0.28 / 0.49	0.25 / 0.48	0.24 / 0.40	0.26 / 0.43	0.24 / 0.38	0.23 / 0.35
	3.0s	0.45 / 0.88	0.41 / 0.83	0.39 / 0.77	0.41 / 0.80	0.38 / 0.76	0.36 / 0.72
	4.0s	0.83 / 1.64	0.77 / 1.47	0.74 / 1.31	0.75 / 1.39	0.70 / 1.18	0.69 / 1.10
	5.0s	1.34 / 2.36	1.31 / 2.20	1.21 / 1.98	1.29 / 2.06	1.06 / 1.90	1.01 / 1.84
HR	1.0s	0.07 / 0.11	0.06 / 0.10	0.06 / 0.08	0.06 / 0.09	0.05 / 0.08	0.05 / 0.07
	2.0s	0.21 / 0.39	0.19 / 0.35	0.17 / 0.31	0.17 / 0.32	0.16 / 0.29	0.15 / 0.27
	3.0s	0.55 / 1.02	0.49 / 0.92	0.43 / 0.81	0.45 / 0.84	0.40 / 0.75	0.38 / 0.70
	4.0s	0.90 / 1.58	0.81 / 1.43	0.71 / 1.26	0.74 / 1.31	0.66 / 1.17	0.62 / 1.09
	5.0s	1.51 / 2.57	1.36 / 2.31	1.20 / 2.04	1.25 / 2.12	1.12 / 1.90	1.04 / 1.77

		STG-DAT

Scenes	Time	$\mathbf{T}$	$\mathbf{T}+\mathbf{C}-\mathbf{ATT}$	$\mathbf{T}+\mathbf{C}$	$\mathbf{T}+\mathbf{C}+\textbf{\text{K}}$

RA	1.0s	0.07 / 0.11	0.09 / 0.13	0.08 / 0.13	0.06 / 0.10
	2.0s	0.19 / 0.31	0.22 / 0.34	0.16 / 0.29	0.14 / 0.25
	3.0s	0.35 / 0.57	0.43 / 0.66	0.31 / 0.54	0.26 / 0.47
	4.0s	0.58 / 0.92	0.70 / 1.05	0.51 / 0.83	0.42 / 0.71
	5.0s	0.90 / 1.25	1.08 / 1.56	0.85 / 1.17	0.68 / 1.01
UI	1.0s	0.06 / 0.10	0.09 / 0.16	0.08 / 0.14	0.07 / 0.11
	2.0s	0.20 / 0.32	0.23 / 0.36	0.19 / 0.30	0.16 / 0.28
	3.0s	0.37 / 0.65	0.41 / 0.71	0.34 / 0.59	0.30 / 0.55
	4.0s	0.60 / 0.98	0.72 / 1.19	0.55 / 0.91	0.49 / 0.83
	5.0s	0.97 / 1.69	1.22 / 1.91	0.88 / 1.50	0.77 / 1.26
HR	1.0s	0.04 / 0.07	0.05 / 0.07	0.04 / 0.07	0.04 / 0.06
	2.0s	0.14 / 0.25	0.15 / 0.27	0.13 / 0.24	0.12 / 0.22
	3.0s	0.35 / 0.66	0.37 / 0.70	0.34 / 0.64	0.31 / 0.58
	4.0s	0.58 / 1.02	0.61 / 1.08	0.56 / 0.99	0.51 / 0.90
	5.0s	0.97 / 1.65	1.03 / 1.75	0.95 / 1.61	0.86 / 1.46

VI-D Implementation Details

A batch size of 64 was used and the models were trained for 100 epochs with early stopping using the Adam optimizer with an initial learning rate of 0.001. The models were trained on a single NVIDIA TITAN X GPU. We used a split of 70%, 10%, 20% as training, validation and testing data, respectively. During the testing phase, the data processing time for a single time step is around 8ms in average. According to our setting, we predict the future 10 time steps which needs around 80ms (equivalent to 12.5Hz).

The details of our model architecture are introduced in the following.

•

Deep Feature Extractor (FE): The State MLP and Relation MLP both have three hidden layers with 128 hidden units. The Context CNN adopts the same backbone structure of ResNet18 [59], which is trained from scratch.
•

Graph Dual-Attention Network (GDAT): The dimensions of node attributes and edge attributes are 64 and 16, respectively. These dimensions are fixed in different rounds of message passing. The activation functions in the attention mechanism are LeakyReLU.
•

Encoding Function: The encoding function is a three-layer MLP with 128 hidden units. The dimension of latent variable is 32.
•

Decoding Function: The decoding function is a GRU recurrent layer with 128 hidden units.

VI-E Quantitative Analysis

$\bullet$ ETH and UCY Datasets: The comparison of the proposed STG-DAT and baseline methods in terms of ADE and FDE is shown in Table I. Some of the reported statistics are adopted from the original papers. All the baseline methods deal with interaction modeling in a specific way. The S-LSTM employs a social pooling mechanism to model the interactions between entities. The S-GAN improve the performance by introducing deep generative modeling. The CGNS combines conditional latent space learning and variational divergence minimization to further enhance the generation capability. The Trajectron adopts a graph-structured model with sequence modeling. Both Social-BiGAT and our method leverage the trajectory and context information, but in different ways. Our model can achieve better performance owing to the explicit interaction modeling with graph neural networks and more compact distribution learning with conditional Wasserstein generative modeling. In general, our approach achieves the smallest ADE and FDE across different scenes. The average ADE / FDE are reduced by 34.8% / 37.2% compared to the best baseline (Trajectron).

$\bullet$ Stanford Drone Dataset: The comparison of results is provided in Table II, where the ADE and FDE are reported in the pixel distance. Note that we also included cyclists and vehicles in the dataset besides pedestrians. In general, the relative performance of baseline methods is consistent with the observation on the ETH / UCY datasets. Our approach achieves the best performance in terms of prediction error, which implies the superiority of explicit interaction modeling and necessity of leveraging both trajectory and context information. In order to show the effectiveness of distinct node embedding functions for different types, we provide an ablative result by treating all the agents as the same type, which leads to an increase on the prediction error. For the full model, the ADE / FDE are reduced by 15.1% / 22.4% with respect to the best baseline (CGNS).

$\bullet$ INTERACTION Dataset: We finally compare the model performance on the real-world driving dataset in Table III. For fair comparison, we only involved the baseline approaches whose codes are publicly available and can be adapted to the same setup as our approach. Although we trained a unified prediction model on different scenarios simultaneously, we analyzed the results for each type of scenario separately. In the HR scenarios, the results of baseline methods are comparable while our model achieves the best performance. The behavior patterns of vehicles in HR scenarios are relatively easy to forecast, since most vehicles are doing car following without highly interactive behaviors. In the RA and UI scenarios, however, the superiority of the proposed system is more distinguishable due to frequent interactions. The P-LSTM performs the worst since it predicts future trajectories for each agent individually without considering their relations. Although all the other baseline approaches incorporate interaction modeling by different strategies which further reduce the prediction error, our model still performs the best. This implies the advantages of graph dual-attention network for interaction modeling, as well as kinematic layer for feasibility constraints. By using our full model $\mathbf{T}+\mathbf{C}+\mathbf{K}$ , the 5.0s ADE / FDE are reduced by 28.4% / 23.5%, 33.7% / 31.5% and 17.3% / 17.5% in RA, UI and HR with respect to the best baseline (Trajectron), respectively.

We also tested the tracking performance using our approach and baseline methods, which is shown in Table IV. The constant velocity model (CVM) and constant acceleration model (CAM) are widely used linear vehicle kinematics models with a constant velocity / acceleration assumption. These models are employed frequently in multi-target tracking literature. A Gaussian noise term is injected at each time step to model uncertainty. It shows that tracking with learning-based models performs consistently better than CVM and CAM due to the ability of interaction-aware prediction. Our approach STG-DAT achieves a significantly higher accuracy.

VI-F Qualitative and Ablative Analysis

TABLE IV: Tracking Performance of Vehicle Positions and Velocities

Method	Position (m)	Velocity (m/s)
CVM	0.025	0.231
CAM	0.021	0.186
P-LSTM	0.014	0.108
S-LSTM	0.013	0.101
S-GAN	0.011	0.087
S-ATT	0.012	0.096
CGNS	0.010	0.077
Trajectron	0.009	0.061
STG-DAT (Linearization)	0.007	0.035
STG-DAT (Monte Carlo)	0.005	0.030

We qualitatively evaluated on prediction hypotheses of typical testing cases on the SDD dataset and ID dataset in Fig. 5 and Fig. 6, respectively. Although we jointly predict all the agents in a scene, we show predictions for a subset for clearness. It shows that our approach can handle different challenging scenarios (e.g. intersection, roundabout) and diverse behaviors (e.g. going straight, turning, waiting, stopping) of vehicles and pedestrians. Generally, the groundtruth trajectories are close to the mean of predicted distribution and the model also allows for uncertainty.

We also conducted comprehensive ablative analysis on the ID dataset to demonstrate relative significance of context information, dual-attention mechanism and the kinematic constraint layer for vehicle trajectory prediction. The descriptions of compared model settings are provided below:

•

$\mathbf{T}$ : This is the model without the kinematic layer, which only uses trajectory information.
•

$\mathbf{T}+\mathbf{C}-\mathbf{ATT}$ : This is the model without the dual-attention mechanism or the kinematic constraint layer. We used equal attention in this model setting instead.
•

$\mathbf{T}+\mathbf{C}$ : This is the model without kinematic constraint layer.
•

$\mathbf{T}+\mathbf{C}+\mathbf{K}$ : This is the whole proposed model including all the components.

The ADE / FDE of each model setting are shown in the lower part of Table III.

$\bullet$ $\mathbf{T}$ versus $\mathbf{T}+\mathbf{C}$ : We show the effectiveness of employing scene context information. $\mathbf{T}$ is the model without the kinematic layer, which only uses trajectory information, while $\mathbf{T}+\mathbf{C}$ further employs context information. The models directly output the position displacements $(\Delta x^{k},\Delta y^{k})$ at each step, which are aggregated to get complete trajectories. We can see little difference on prediction errors over short horizons, while the gap becomes larger as the horizon extends. The reason is that the vehicle trajectories within a short period can usually be well approximated by a constant velocity model, which are not heavily restricted or affected by the static environmental context. However, as the forecasting horizon increases, the effects of context constraints cannot be ignored anymore, which leads to larger performance gain of leveraging context information. Compared with $\mathbf{T}$ , the 5.0s ADE / FDE of $\mathbf{T}+\mathbf{C}$ are reduced by 5.6% / 6.4%, 9.3% / 7.7% and 2.1% / 2.4% in RA, UI and HR scenarios, respectively. This implies that the context information has larger effects on the prediction in RA and UI scenarios, where the influence of road geometries cannot be ignored. The context information does little help to HR scenarios, since most vehicles go straight on highways. In Fig. 6, the predicted distribution of $\mathbf{T}+\mathbf{C}$ is more compliant to roadways to avoid collisions and the vehicles near the “yield” or “stop” signs tend to yield or stop. However, $\mathbf{T}$ generates samples which are outside of feasible areas or violating traffic rules.

$\bullet$ $\mathbf{T}+\mathbf{C}-\mathbf{ATT}$ versus $\mathbf{T}+\mathbf{C}$ : We show the effectiveness of the proposed dual-attention mechanism. $\mathbf{T}+\mathbf{C}-\mathbf{ATT}$ uses equal attention coefficients in both topological and temporal layers. According to the statistics reported in Table III, compared with equal attention, employing the dual-attention mechanism to figure out relative importance within the topological structure and along different time steps can reduce the 5.0s ADE / FDE by 21.3% / 25.0%, 27.9% / 21.5% and 7.8% / 8.0% in RA, UI and HR scenarios, respectively. The improvement in RA and UI are more significant due to frequent interactions.

$\bullet$ $\mathbf{T}+\mathbf{C}$ versus $\mathbf{T}+\mathbf{C}+\mathbf{K}$ : We show the effectiveness of the kinematic constraint layer. Different from $\mathbf{T}$ which directly output position displacement, the outputs of the GRU unit in $\mathbf{T}+\mathbf{C}+\mathbf{K}$ are control actions, which are aggregated by the bicycle model to obtain complete trajectories. According to Table III, employing the kinematic constraint layer to regularize the learning-based prediction hypotheses can further reduce the 5.0s ADE / FDE by 20.0% / 13.7%, 12.5% / 16.0% and 9.5% / 9.3% in RA, UI and HR scenarios, respectively. Due to the restriction from the kinematic model, unfeasible movements can be filtered out and the model is unlikely to overfit noisy data or outliers. Moreover, the improvement in RA and UI is more significant than in HR. The reason is that most vehicles go straight along the road in HR, whose behaviors can be well approximated by linear models. However, there are frequent turning behaviors in RA and UI which need constraints by more sophisticated models. We also visualize the predicted trajectories in Fig. 6, where the ones in $\mathbf{T}+\mathbf{C}+\mathbf{K}$ are smoother and more plausible.

VII Conclusion

In this paper, we propose a generic system for multi-agent trajectory prediction named STG-DAT, which considers context information, trajectories of heterogeneous, interactive agents and physical feasibility constraints. In order to effectively model the interactions between different entities, we design a graph dual-attention network to extract features from spatio-temporal dynamic graphs. The Wasserstein generative modeling is employed as the basis of training the whole framework. The STG-DAT is validated by both pedestrian and vehicle trajectory prediction tasks on multiple benchmark datasets. The experimental results show that our approach achieves the state-of-the-art prediction performance compared with multiple baseline methods. Moreover, the proposed prediction model can be easily adopted by multi-target tracking frameworks, which empirically proves to enhance tracking accuracy.

Appendix A Data Preprocessing

In this section, we introduce the supplementary details of data preprocessing. The pipeline is illustrated in Fig. 7.

A-A Global Context Information

In order to provide better global context information, we designed two different representations, namely occupancy density map and mean velocity field. After constructing such global context information offline, we did decentralized online localization for the corresponding target agent and obtained their local context information, which was used in both training and testing phases.

Occupancy Density Map The density map describes the normalized frequency distribution of all the agents’ locations. For a specific scene, we first split our map into a number of bin areas, which are 1m $\times$ 1m squares. Without loss of generality, we denote this histogram as $B$ , and all the agents in different frames as a set $\{o_{k,p}\}$ , where $p$ is the agent index and $k$ is the frame index. We obtained the global representation of density by calculating $B_{i,j}=\sum_{t,k}\phi(o_{k,p},i,j)$ , where $i,j$ are the indices of the histogram and $\phi(o_{k,p},i,j)$ is an indicator function which equals 1 if $o_{k,p}$ is located in the bin area indicated by index $i,j$ and 0 otherwise. Then we normalized this density map by dividing all bin values by the sum of the values in this histogram and used this normalized histogram as our occupancy density map.

Mean Velocity Field Similarly, we also created a map of velocity field which contains 1m $\times$ 1m square areas. We denote the whole map as $VF$ and the bin item indexed by $i,j$ as $VF(i,j)$ . The $VF(i,j)$ is a two-dimensional vector representing the average speed along vertical and horizontal axes of all the agents in this area. More formally,

	$\displaystyle VF(i,j)_{x}=\frac{1}{N}\sum_{k,p}\phi(v_{k,p},i,j)v_{k,p}^{x},$		(29)
	$\displaystyle VF(i,j)_{y}=\frac{1}{N}\sum_{k,p}\phi(v_{k,p},i,j)v_{k,p}^{y},$		(29)

where $N$ is the number of points located at the bin area $(i,j)$ .

A-B Localization for Local Context Information

After obtaining the global context offline, our model utilized a decentralized method to do localization for each agent during training and testing. Given the location and the moving direction of the current agent at the current time step, we obtained a local context centered on this agent along its moving direction from the global context. All the agents share the same size of the local context map.

References

[1] S. Lefèvre, D. Vasquez, and C. Laugier, “A survey on motion prediction and risk assessment for intelligent vehicles,” ROBOMECH journal, vol. 1, no. 1, p. 1, 2014.
[2] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Kümmerle, H. Königshof, C. Stiller, A. de La Fortelle, and M. Tomizuka, “INTERACTION Dataset: An INTERnational, Adversarial and Cooperative moTION Dataset in Interactive Driving Scenarios with Semantic Maps,” arXiv:1910.03088 [cs, eess], 2019.
[3] J. Hong, B. Sapp, and J. Philbin, “Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8454–8462.
[4] Y. Chai, B. Sapp, M. Bansal, and D. Anguelov, “Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction,” arXiv preprint arXiv:1910.05449, 2019.
[5] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 961–971.
[6] A. Gupta, J. Johnson, L. Fei-Fei, S. Savarese, and A. Alahi, “Social gan: Socially acceptable trajectories with generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2255–2264.
[7] N. Deo and M. M. Trivedi, “Convolutional social pooling for vehicle trajectory prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 1468–1476.
[8] N. Deo, A. Rangesh, and M. M. Trivedi, “How would surround vehicles move? a unified framework for maneuver classification and motion prediction,” IEEE Transactions on Intelligent Vehicles, vol. 3, no. 2, pp. 129–140, 2018.
[9] T. Zhao, Y. Xu, M. Monfort, W. Choi, C. Baker, Y. Zhao, Y. Wang, and Y. N. Wu, “Multi-agent tensor fusion for contextual trajectory prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 126–12 134.
[10] A. Vemula, K. Muelling, and J. Oh, “Social attention: Modeling attention in human crowds,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 1–7.
[11] J. Li, W. Zhan, Y. Hu, and M. Tomizuka, “Generic tracking and probabilistic prediction framework and its application in autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 9, pp. 3634–3649, 2019.
[12] H. Ma, J. Li, W. Zhan, and M. Tomizuka, “Wasserstein generative learning with kinematic constraints for probabilistic interactive driving behavior prediction,” in 2019 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2019, pp. 2477–2483.
[13] W. Liu, H. He, and F. Sun, “Vehicle state estimation based on minimum model error criterion combining with extended kalman filter,” Journal of the Franklin Institute, vol. 353, no. 4, pp. 834–856, 2016.
[14] J. Scharcanski, A. B. de Oliveira, P. G. Cavalcanti, and Y. Yari, “A particle-filtering approach for vehicular tracking adaptive to occlusions,” IEEE Transactions on Vehicular Technology, vol. 60, no. 2, pp. 381–389, 2010.
[15] W. Wang, J. Xi, and D. Zhao, “Learning and inferring a driver’s braking action in car-following scenarios,” IEEE Transactions on Vehicular Technology, vol. 67, no. 5, pp. 3887–3899, 2018.
[16] W. Zhan, L. Sun, Y. Hu, J. Li, and M. Tomizuka, “Towards a fatality-aware benchmark of probabilistic reaction prediction in highly interactive driving scenarios,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 3274–3280.
[17] D. Kasper, G. Weidl, T. Dang, G. Breuel, A. Tamke, A. Wedel, and W. Rosenstiel, “Object-oriented bayesian networks for detection of lane change maneuvers,” IEEE Intelligent Transportation Systems Magazine, vol. 4, no. 3, pp. 19–31, 2012.
[18] L. Sun, W. Zhan, and M. Tomizuka, “Probabilistic prediction of interactive driving behavior via hierarchical inverse reinforcement learning,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. 2111–2117.
[19] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Soft+ hardwired attention: An lstm framework for human trajectory prediction and abnormal event detection,” Neural networks, vol. 108, pp. 466–478, 2018.
[20] N. Lee, W. Choi, P. Vernaza, C. B. Choy, P. H. Torr, and M. Chandraker, “Desire: Distant future prediction in dynamic scenes with interacting agents,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 336–345.
[21] A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras, “Human motion trajectory prediction: A survey,” arXiv preprint arXiv:1905.06113, 2019.
[22] Y. Xu, Z. Piao, and S. Gao, “Encoding crowd interaction with deep neural network for pedestrian trajectory prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5275–5284.
[23] J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and L. Fei-Fei, “Peeking into the future: Predicting future person activities and locations in videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5725–5734.
[24] Y. Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha, “Trafficpredict: Trajectory prediction for heterogeneous traffic-agents,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 6120–6127.
[25] J. Li, H. Ma, and M. Tomizuka, “Conditional generative neural system for probabilistic trajectory prediction,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019, pp. 6150–6156.
[26] Z. Li, B. Wang, J. Gong, T. Gao, C. Lu, and G. Wang, “Development and evaluation of two learning-based personalized driver models for pure pursuit path-tracking behaviors,” in 2018 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2018, pp. 79–84.
[27] X. Huang, S. G. McGill, B. C. Williams, L. Fletcher, and G. Rosman, “Uncertainty-aware driver trajectory prediction at urban intersections,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 9718–9724.
[28] S. Su, C. Peng, J. Shi, and C. Choi, “Potential field: Interpretable and unified representation for trajectory prediction,” arXiv preprint arXiv:1911.07414, 2019.
[29] B. Ivanovic and M. Pavone, “The trajectron: Probabilistic multi-agent trajectory modeling with dynamic spatiotemporal graphs,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2375–2384.
[30] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “Precog: Prediction conditioned on goals in visual multi-agent settings,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2821–2830.
[31] J. Li, F. Yang, M. Tomizuka, and C. Choi, “Evolvegraph: Multi-agent trajectory prediction with dynamic relational reasoning,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[32] P. Xue, J. Liu, S. Chen, Z. Zhuoli, Y. Huo, and N. Zheng, “Crossing-road pedestrian trajectory prediction via encoder-decoder lstm,” 10 2019, pp. 2027–2033.
[33] A. Sadeghian, V. Kosaraju, A. Sadeghian, N. Hirose, H. Rezatofighi, and S. Savarese, “Sophie: An attentive gan for predicting paths compliant to social and physical constraints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1349–1358.
[34] P. Zhang, W. Ouyang, P. Zhang, J. Xue, and N. Zheng, “Sr-lstm: State refinement for lstm towards pedestrian trajectory prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 085–12 094.
[35] C. Choi, J. H. Choi, J. Li, and S. Malla, “Shared cross-modal trajectory prediction for autonomous driving,” arXiv preprint arXiv:2011.08436, 2020.
[36] S. H. Park, G. Lee, J. Seo, M. Bhat, M. Kang, J. Francis, A. Jadhav, P. P. Liang, and L.-P. Morency, “Diverse and admissible trajectory forecasting through multimodal context understanding,” in European Conference on Computer Vision. Springer, 2020, pp. 282–298.
[37] N. Rhinehart, K. M. Kitani, and P. Vernaza, “R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 772–788.
[38] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.
[39] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” arXiv preprint arXiv:1703.03130, 2017.
[40] T. Kipf, E. Fetaya, K.-C. Wang, M. Welling, and R. Zemel, “Neural relational inference for interacting systems,” in International Conference on Machine Learning, 2018, pp. 2688–2697.
[41] V. Zambaldi, D. Raposo, A. Santoro, V. Bapst, Y. Li, I. Babuschkin, K. Tuyls, D. Reichert, T. Lillicrap, E. Lockhart et al., “Relational deep reinforcement learning,” arXiv preprint arXiv:1806.01830, 2018.
[42] C. Choi and B. Dariush, “Looking to relations for future trajectory forecast,” arXiv preprint arXiv:1905.08855, 2019.
[43] X. Ma, J. Li, M. J. Kochenderfer, D. Isele, and K. Fujimura, “Reinforcement learning for autonomous driving with latent state inference and spatial-temporal relationships,” arXiv preprint arXiv:2011.04251, 2020.
[44] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[45] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[46] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasserstein auto-encoders,” arXiv preprint arXiv:1711.01558, 2017.
[47] S. Zhao, J. Song, and S. Ermon, “Infovae: Information maximizing variational autoencoders,” arXiv preprint arXiv:1706.02262, 2017.
[48] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
[49] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
[50] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and dynamic vehicle models for autonomous driving control design,” in 2015 IEEE Intelligent Vehicles Symposium (IV). IEEE, 2015, pp. 1094–1099.
[51] D. Simon, “Kalman filtering with state constraints: a survey of linear and nonlinear algorithms,” IET Control Theory & Applications, vol. 4, no. 8, pp. 1303–1318, 2010.
[52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[53] S. Pellegrini, A. Ess, and L. Van Gool, “Improving data association by joint modeling of pedestrian trajectories and groupings,” in European conference on computer vision. Springer, 2010, pp. 452–465.
[54] L. Leal-Taixé, M. Fenzi, A. Kuznetsova, B. Rosenhahn, and S. Savarese, “Learning an image-based motion context for multiple people tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3542–3549.
[55] A. Robicquet, A. Sadeghian, A. Alahi, and S. Savarese, “Learning social etiquette: Human trajectory understanding in crowded scenes,” in ECCV. Springer, 2016, pp. 549–565.
[56] V. Kosaraju, A. Sadeghian, R. Martín-Martín, I. Reid, H. Rezatofighi, and S. Savarese, “Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks,” in Advances in Neural Information Processing Systems, 2019, pp. 137–146.
[57] J. Li, H. Ma, and M. Tomizuka, “Interaction-aware multi-agent tracking and probabilistic behavior prediction via adversarial learning,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 6658–6664.
[58] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: Deep learning on spatio-temporal graphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5308–5317.
[59] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

$\displaystyle\min_{\theta,\phi,s.t.0<1-\alpha<\beta}$	$\displaystyle-\mathbf{E}_{p_{\phi}(z\|x,y)}[\log p_{\theta}(x\|z,y)]]$	(26)
	$\displaystyle+\alpha\mathbf{E}_{p(x\|y)}[D_{\text{KL}}[p_{\phi}(z\|x,y)\|\|p(z\|y)]]$
	$\displaystyle+\beta D(p_{\phi}(z\|y),p(z\|y)).$