¹¹institutetext: Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA ¹¹email: {yiqizhon,uneumann}@usc.edu ²²institutetext: Cooperative Medianet Innovation Center (CMIC), Shanghai Jiao Tong University, Shanghai, China
²²email: {0107nzy,sihengc}@sjtu.edu.cn

Aware of the History: Trajectory Forecasting with the Local Behavior Data

Yiqi Zhong 1University of Southern California, Los Angeles, CA 90089, USA 1{yiqizhon,uneumann}@usc.edu 0000-0002-0928-8018 Zhenyang Ni 2Shanghai Jiao Tong University, Shanghai, China
2{0107nzy,sihengc}@sjtu.edu.cn 0000-0001-7134-620X Siheng Chen 2Shanghai Jiao Tong University, Shanghai, China
2{0107nzy,sihengc}@sjtu.edu.cn ¹¹footnotemark: 1¹¹footnotemark: 1 0000-0001-6199-529X Ulrich Neumann 1University of Southern California, Los Angeles, CA 90089, USA 1{yiqizhon,uneumann}@usc.edu Corresponding authoursCorresponding authours 0000-0001-8977-7112 1University of Southern California, Los Angeles, CA 90089, USA 1{yiqizhon,uneumann}@usc.edu 0000-0002-0928-8018 2Shanghai Jiao Tong University, Shanghai, China
2{0107nzy,sihengc}@sjtu.edu.cn 0000-0001-7134-620X 2Shanghai Jiao Tong University, Shanghai, China
2{0107nzy,sihengc}@sjtu.edu.cn 0000-0001-6199-529X 1University of Southern California, Los Angeles, CA 90089, USA 1{yiqizhon,uneumann}@usc.edu 0000-0001-8977-7112

Abstract

The historical trajectories previously passing through a location may help infer the future trajectory of an agent currently at this location. Despite great improvements in trajectory forecasting with the guidance of high-definition maps, only a few works have explored such local historical information. In this work, we re-introduce this information as a new type of input data for trajectory forecasting systems: the local behavior data, which we conceptualize as a collection of location-specific historical trajectories. Local behavior data helps the systems emphasize the prediction locality and better understand the impact of static map objects on moving agents. We propose a novel local-behavior-aware (LBA) prediction framework that improves forecasting accuracy by fusing information from observed trajectories, HD maps, and local behavior data. Also, where such historical data is insufficient or unavailable, we employ a local-behavior-free (LBF) prediction framework, which adopts a knowledge-distillation-based architecture to infer the impact of missing data. Extensive experiments demonstrate that upgrading existing methods with these two frameworks significantly improves their performances. Especially, the LBA framework boosts the SOTA methods’ performance on the nuScenes dataset by at least 14% for the $K=1$ metrics. ¹¹1Code is at https://github.com/Kay1794/LocalBehavior-based-trajectory-prediction

Keywords:

Trajectory forecasting, Historical data, Knowledge-distillation

1 Introduction

Refer to caption — Figure 1: A) Compared to previous works that mostly rely on HD maps and agents’ past trajectories, we additionally input the local behavior data to the prediction framework. B) For agents to be predicted in the scene, we follow these steps to retrieve their local behavior data for the framework input composition.

Trajectory forecasting aims to predict an agent’s future trajectory based on its past trajectory and surrounding scene information. This task is essential in a variety of applications, including self-driving cars [27], surveillance systems [8], robotics [2], and human behavior analysis [36]. Prior prediction systems primarily use deep-learning-based models (e.g., LSTM, temporal convolutions) to exploit limited information such as past trajectories [1, 21, 30]. Recent efforts also reveal that forecasting ability will improve as more scene information is introduced into the input. One type of scene information, for example, is the past trajectories of a target agent’s neighboring agents. To date, many graph-neural-network-based methods have explored the potential of agents’ motion features and interactive motion relations to improve predictions [20, 35, 42]. Recently, high-definition (HD) maps are incorporated as an additional type of scene information [9, 11, 13, 23, 41] to provide geometric priors.

Besides the widely used HD maps and agents’ past trajectories, we propose a novel Local-Behavior-Aware (LBA) prediction framework that takes a new type of scene information as the input, which we term as local behavior data. The local behavior data is defined as a collection of historical trajectories at an agent’s current location. Fig. 1 (A) shows the three components of the LBA framework input. Taking local behavior data as the input brings two benefits to the task. First, the data provides location-specific motion patterns, which helps the model effectively narrow down the search space for future trajectories. Most of the existing prediction models solely rely on the features learned from the static HD map to infer such information [15, 10, 13]. In comparison, taking local behavior data as the input immediately equips the model with this information, making the model more tractable and robust. Second, local behavior data provides complementary information to augment static maps into dynamic behavioral maps. The map prior in the current literature is limited to static geometric information. The rich dynamic information brought by this new input would help the model better understand the impact of map objects on moving agents.

Many car companies and navigation apps are collecting such local behavior data. Yet, sometimes this data is insufficient or is yet to be gathered (e.g. when a self-driving car explores new areas). Therefore, we further propose a Local-Behavior-Free (LBF) prediction framework that only takes the current agents’ observed trajectories and HD maps as the input, when the local behavior data is unavailable. Inspired by recent development in knowledge distillation, we use our pre-trained LBA prediction framework as the teacher network during the training phase. This teacher network guides the LBF student network in inferring the features of the absent local behavior data. The intuition behind this design is that a traffic agent’s movement at a particular location is confined to a limited number of possibilities. The teacher network essentially provides the ground truth of the movement pattern, making it plausible for the student network to learn the inference of the pattern given the current scene information.

LBA and LBF frameworks both have strong generalizability to be adopted by a wide range of existing trajectory forecasting methods. In Sec. 4.2 and Sec. 5.2, we showcase the implementation methodology on how to upgrade existing systems to the LBA/LBF framework respectively. We then implement and validate the two frameworks based on three state-of-the-art trajectory forecasting methods, P2T [9], LaneGCN [23], and DenseTNT [15].

In summary, this work has three major contributions to the literature:

•

We propose a Local-Behavior-Aware prediction framework. It enables most of the existing methods to incorporate the local behavior data, a new type of system input which contains the local historical information.
•

We further introduce a Local-Behavior-Free prediction framework, which adopts a knowledge-distillation-based architecture to infer local behavioral information when it is temporarily unavailable.
•

We conduct extensive experiments on published benchmarks (Argoverse [6], nuScenes [3]), and validate that upgrading the SOTA methods to LBA/LBF frameworks consistently improves their performances by significant margins on various metrics. Especially, the LBA framework improves the SOTA methods on the nuScenes dataset by at least 14% on the K=1 metrics.

2 Related Work

2.1 Historical Behaviors in Trajectory Forecasting

Historical behaviors are very helpful to trajectory forecasting since they reveal the motion pattern of agents. Several previous works have made significant progress in this direction by adopting memory-based designs. MANTRA [28] uses the observed past trajectories and the map as the key to query the possible hidden features of future trajectories. Similarly, Zhao et al. [45] build an expert repository to retrieve the possible destinations by matching the current trajectory with the ones in the repository. MemoNet [40] also considers memorizing the destinations of each agent and first applies the memory mechanism to multi-agent trajectory forecasting. Compared to those memory-based methods, our work: i) regards the historical behaviors as system inputs to benefit the task from the perspective of enriching scene information; ii) directly uses geometric coordinates to query related historical information which emphasizes the data locality and is more interpretable and robust.

2.2 Scene Representation in Trajectory Forecasting

To use a new type of scene information in systems requires fusing it with existing scene information sources. After reviewing how scene information is encoded in previous methods, we see two main types of scene representations: 1) rasterized-image-based representations [5, 20], 2) graph-based representations [11, 12, 23, 39]. Rasterized-image-based representations render static HD maps and motion data into a birds’ eye view (BEV) image, using various colors to represent different objects. Systems with this scene representation tend to use standard convolutional neural networks (CNNs) backbones to extract scene features [5, 9, 13, 20, 31]. These methods transform scenes into image coordinates systems.

Graph-based scene representations become popular with the recent development of graph learning [34] and attention mechanism [37]. These methods build a graph that can be either directional or non-directional, and use techniques such as graph neural network (GNN) [34] or attention-based operations [37] to enable the interaction among map objects and agents. The nodes are the semantic map objects, e.g., lane segments and agents. The edges are defined by heuristics, which can be the spatial distances between the two nodes or the semantic labels of the two nodes. Systems using graph-based scene representations [11, 12, 15, 23, 44] independently encode each map object to make the graph more informative. Graph-based scene representations have an outstanding information fusion capability and are substantially explored recently.

Other methods that do not strictly fall into the above two categories can share some properties with one or both of them. For example, TPCN [41] uses point cloud representations for the scene and does not manually specify the interaction structures. Yet, by using PointNet++ [32] to aggregate each point’s neighborhood information, TPCN still technically defines local complete subgraphs for the scene where each point is a node connected with its neighbors.

To show the generalizability of the proposed frameworks, in this work we introduce an implementation methodology for upgrading forecasting systems that use either rasterized-image-based or graph-based scene representations to the LBA and LBF frameworks in Sec 4.2 and 5.2 respectively.

2.3 Knowledge Distillation

Our LBF framework is inspired by the recent study of knowledge distillation (KD). Knowledge distillation is a technique that compresses a larger teacher network to a smaller student network by urging the student to mimic the teacher at the intermediate feature or output level [19]. KD is widely used for various tasks, including object detection [7, 16], semantic segmentation [18, 25] and tracking [26]. Its usage is still being explored. For example, researchers start to use KD for collaborative perception tasks [22]. In our work, the proposed LBF framework uses KD in the trajectory forecasting task. Compared to previous works that pay more attention to model compression, we seek to compress the volume of input data. We use the offline framework as the teacher network, which takes each agent’s local behavior data as the input along with HD maps and the agents’ observed trajectories, while the student network (i.e., the online framework) only uses the later two data modalities as the input without requiring local behavior data. Our experiments show that KD-based information compression significantly boosts the performance of trajectory forecasting.

3 Formulation of Local Behavior

In trajectory forecasting, the observed trajectories of the agents previously passing through a location may help infer the future trajectory of an agent currently at the location. In this work, we collect such historical information and reformulate it to make it become one of the inputs for the trajectory forecasting system. We name this new type of scene information as the local behavior data. In this section, we introduce its formulation and the methodology about how to retrieve such data from existing datasets.

Consider a trajectory forecasting dataset with $S$ scenes and the $p$ th scene has $K_{p}$ agents. The observed trajectory and ground truth future trajectory of agent $i$ in scene $p$ are denoted respectively as $\mathbf{X}^{(i,p)}$ and $\mathbf{Y}^{(i,p)}$ , where $\mathbf{X}^{(i,p)}\!\in\!\mathbb{R}^{\mathbf{T}^{-}\times 2}$ and $\mathbf{Y}^{(i,p)}\!\in\!\mathbb{R}^{\mathbf{T}^{+}\times 2}$ . Each $\mathbf{X}^{(i,p)}$ or $\mathbf{Y}^{(i,p)}$ consists of two-dimensional coordinates at $\mathbf{T}^{\!-}$ or $\mathbf{T}^{\!+}$ timestamps. Note that the coordinate system used for the trajectories is the global coordinate system aligned with the global geometric map. In this work, we specifically denote two special items in $\mathbf{X}^{(i,p)}$ . We use $\mathbf{X}_{1}^{(i,p)}\!\in\!\mathbb{R}^{2}$ to represent the location of agent $i$ in scene $s$ at the first timestamp, i.e., the first observed location of the agent. Accordingly, we use $\mathbf{X}_{\mathbf{T}^{-}}^{(i,p)}\!\in\!\mathbb{R}^{2}$ to denote the agent location at timestamp $\mathbf{T}^{-}$ , i.e., its current location. By gathering all the observed trajectories in this dataset, we build a behavior database $\mathcal{D}_{\mathcal{B}}=\{\mathbf{X}^{(i,p)},p\in\{1,2,\cdots,S\},i\in\{1,2,\cdots,K_{p}\}\}_{i,p}$ . We can query the local behavior from $\mathcal{D}_{\mathcal{B}}$ ; namely, the local behavior data of agent $i$ in scene $p$ is

\mathcal{B}^{(i,p)}_{\epsilon}=\{\mathbf{X}^{(j,q)}|\left\|\mathbf{X}_{\mathbf{T}^{-}}^{(i,p)}-\mathbf{X}_{1}^{(j,q)}\right\|_{2}<\epsilon,\mathbf{X}^{(j,q)}\in\mathcal{D}_{\mathcal{B}}\},

(1)

where $\epsilon$ is an adjustable hyper-parameter defining the radius of the neighboring area of a location. The size of $\mathcal{B}^{(i,p)}_{\epsilon}$ refers to the number of observed trajectories in $\mathcal{B}^{(i,p)}_{\epsilon}$ . Fig. 1 (B) shows the steps to query local behavior data from $\mathcal{D}_{\mathcal{B}}$ .

4 Local-Behavior-Aware Prediction

In real life, local behavior data has been widely collected by navigation apps and car companies. To use such data to benefit the trajectory forecasting task, we propose a Local-Behavior-Aware (LBA) prediction framework. In this section, we first demonstrate the generic pipeline of the LBA prediction framework and its major components. Then, we introduce the implementation strategy of the framework which especially emphasizes the representation of local behavior data and its corresponding scene encoder design.

4.1 Framework Pipeline

Like a typical prediction framework pipeline adopted by previous works [11, 20, 31] (see Figure 2 (a)), our LBA framework follows an encoder-decoder structure (see Figure 2 (b)). The encoder $\Theta$ extracts features from multiple scene information sources. Then, the subsequent decoder $\Phi$ generates the predicted future trajectories $\widehat{\mathcal{Y}}$ based on the scene features.

The main difference between LBA and previous frameworks is the design of the encoder $\Theta$ . Our scene encoder involves two steps: 1) scene modeling, which assembles data from all three sources to generate a comprehensive representation of the current scene, and 2) scene feature extraction, which extracts the high-dimensional features from the generated scene representation. The way to represent a scene would largely determine the inner structure of a scene encoder.

In the current literature, the two most frequently used scene representations are graph-based and rasterized-image-based representations. To show that most of the existing forecasting systems can be upgraded to fit the proposed LBA prediction framework, we will describe the implementation strategies for the graph-based and rasterized-image-based systems.

4.2 Implementation

4.2.1 Graph-based Systems.

The existing graph-based systems represent the whole scene information into a scene graph $G(V,E)$ , where $V$ is the node set that contains both the map object node set $V_{\mathcal{M}}$ and the agent node sets $V_{\mathcal{X}}$ , and $E$ is the edge set that reflects the internal interactions between nodes. For each map object node, its associated node attributes include the geometric coordinates, reflecting the static physical location of the map object. For each agent node, its associated node attributes include the two-dimensional coordinates across various time-stamps, reflecting the movement information of the agent. The characteristics of the graph-based representation are that i) it is compact and effective, leading to an efficient system; ii) it enables effective modeling of the interactions among objects (both map objects and agents) in the scene, which is crucial for understanding complicated dynamic relations in traffic scenarios.

To implement the graph-based LBA prediction system, we emphasize the strategy of incorporating local behavior data into the scene graph and enabling the feature interaction between local behavior data and other nodes.

Representation of Local Behavior Data. Given the $i$ th agent in the $p$ th scene, we can query its specific local behavior data $\mathcal{B}^{(i,p)}_{\epsilon}$ from the behavior database $\mathcal{D}_{\mathcal{B}}$ . For each individual observed trajectory in $\mathcal{B}^{(i,p)}_{\epsilon}$ , we create a local behavior node. This step results in a local behavior node set $V_{\mathcal{B}}$ , which has the same size as the local behavior data.

Scene Encoder. As demonstrated in Fig. 4, the scene encoder of the LBA graph-based systems includes the scene graph initialization, individual node feature extraction, and interactive node feature extraction. To initialize the local-behavior-aware scene graph $G^{\prime}(V^{\prime},E)$ , we add the local behavior node set to the original graph $G(V,E)$ , where the updated node set is $V^{\prime}=V\cup V_{\mathcal{B}}$ and the updated edge set is $E^{\prime}=E\cup\{(v_{m},v_{n})|v_{m}\in V^{\prime},v_{n}\in V_{\mathcal{B}}\}_{m,n}$ . With this graph, the local behavior data will participate in the feature interaction procedure (some methods may further update the edge set based on the node distance [22, 43, 15]). The output of the scene encoder $f_{\mathcal{S}}$ will also be local-behavior-aware. We use three feature encoders ( $\Theta_{\mathcal{M}}$ , $\Theta_{\mathcal{X}}$ , $\Theta_{\mathcal{B}}$ ) to extract the features of map objects $f_{\mathcal{M}}$ , agents’ observed trajectories $f_{\mathcal{X}}$ , and the local behavior data $f_{\mathcal{B}}$ , respectively. Here each scene node will obtain its corresponding node features individually. To capture internal interactions, we use an interaction module $\mathcal{I}$ , which is either GNN-based or attention-based, depending on the design of the original system. The interaction module aggregates information from all three scene components.

The architecture of $\Theta_{\mathcal{M}}$ , $\Theta_{\mathcal{X}}$ and $\mathcal{I}$ can remain unchanged from the original system structure for our implementation. As for $\Theta_{\mathcal{B}}$ , since each independent behavior data is essentially an observed trajectory, we can directly adopt the structure of the trajectory encoder $\Theta_{\mathcal{X}}$ in the system as our behavior encoder structure. We can also use simple encoder structures, such as multi-layer perceptrons (MLPs), to keep the system light-weight.

4.2.2 Rasterized-image-based Systems.

Rasterized-image-based systems represent the whole scene information as a rasterized scene image (see an example in Fig. 4). The rasterized scene image reflects the HD map objects (junctions, lanes) and the agent trajectory information as BEV images with various colors according to their semantic labels. This representation method essentially transforms a global coordinate system into an image coordinate system, where each location on the map can be represented by a pixel coordinate. The main characteristic of rasterization-based representations is that they can leverage established CNN backbones, such as ResNet [17], to extract image features.

To implement the rasterized-image-based LBA prediction system, we need to render local behavior data to an image that has the same coordinate system with the original scene image, which ensures consistency and compatibility.

Representation of Local Behavior Data. We seek to render local behavior data into a behavior probability map $\mathcal{P}_{\mathcal{B}}$ . It is an image whose pixel value reflects an agent’s moving probability from the current pixel to another pixel in the image. For the $i$ th agent at the $p$ th scene, $\mathcal{P}^{(i,p)}_{\mathcal{B}}$ is formulated as a single channel image of size $(H,W,1)$ that shares the same coordinate system with the rasterized scene image, where $W,H$ are the width and the height of $\mathcal{P}_{\mathcal{B}}$ . To generate such a behavior probability map, we initialize each pixel value in the image $\mathcal{P}^{(i,p)}_{\mathcal{B}}(x,y)$ to be 0. We then enumerate each trajectory $f_{\mathcal{X}}\in\mathcal{B}_{\epsilon}^{(i,p)}$ and add the corresponding information into $\mathcal{P}^{(i,p)}_{\mathcal{B}}(x,y)$ by adding 1 to the pixel value once each trajectory covers that pixel. In the end, we normalize $\mathcal{P}^{(i,p)}_{\mathcal{B}}$ by dividing every pixel by the maximum value of the pixels, specifically

\mathcal{P}^{(i,p)}_{\mathcal{B}}(x,y)=\frac{\mathcal{P}^{(i,p)}_{\mathcal{B}}(x,y)}{\max(\mathcal{P}^{(i,p)}_{\mathcal{B}}(x,y))},

(2)

where $0\leq x<W,0\leq y<H$ . Fig. 4 (b) illustrates the local behavior data of the agent represented as a red rectangle in Fig. 4 (a). Since the configuration of the rasterized scene image and the behavior probability map are the same, this probability map indicates how likely, according to local behavior data, the agent at the current pixel will pass a certain pixel on the scene image.

Scene Encoder. As shown in Fig. 4, the scene encoder of the rasterized image-based LBA systems includes image rendering, image feature extraction, and concatenation-based aggregation. To achieve scene modeling, we render both the rasterized scene image and the behavior probability map. To comprehensively extract scene features, we use two separate encoders, $\Theta_{\mathcal{S}}$ , which extracts features from the rasterized scene image $\mathcal{S}$ , and $\Theta_{\mathcal{B}}$ , which extracts features from the behavior probability map $\mathcal{P}_{\mathcal{B}}$ . The architecture of $\Theta_{\mathcal{S}}$ can remain identical with the one in the original systems. To implement $\Theta_{\mathcal{B}}$ , we can adopt any established images’ feature extraction networks, such as ResNet [17]. Afterwards, the output of the two encoders will be concatenated channel-wise to build the output of the scene encoder, which is now local-behavior-aware.

5 Local-Behavior-Free Prediction

As mentioned in Sec 1, there will be scenarios where the local behavior data is yet to be gathered or insufficient. To handle such situations, we propose a Local-Behavior-Free (LBF) prediction framework based on knowledge distillation.

The training of the LBF prediction framework follows a teacher-student structure. The teacher network is implemented using the LBA framework introduced in Sec 4, which uses local behavior data as the third input; while the student network only takes static HD map information and the agents’ observed trajectories as the input. The knowledge-distillation-based training strategy enhances the training of the LBF prediction framework by urging the student network to imitate the teacher network in the intermediate feature levels when processing the same data samples.

The intuition behind this design is that given a specific location, the number of possible movement patterns of an traffic agent is limited. With the guidance from the teacher network that is trained on local behavior data, it is feasible for the student network to learn the reasoning of the movement pattern based on the static map objects and agents’ observed trajectories.

5.1 Framework Pipeline

The LBF framework includes a teacher network and a student network. See Fig. 2 (b), (c). The teacher network follows the LBA framework. For the student network, we remove the input stream of local behavior data in the LBA framework and add a behavior estimator $\Omega_{\mathcal{B}}$ to the pipeline. $\Omega_{\mathcal{B}}$ takes the intermediate scene features $f^{\prime}_{\mathcal{S}}$ from the scene encoder $\Theta$ as the input, and outputs the estimated local behavior features $\widehat{f}_{\mathcal{B}}$ . Note that $f^{\prime}_{\mathcal{S}}$ is not involved with local behavioral information. Next, we let $\widehat{f}_{\mathcal{B}}$ join the scene encoder along with $f^{\prime}_{\mathcal{S}}$ for the final feature generation. The scene encoder outputs the updated $f_{\mathcal{S}}$ , which contains the estimated local behavioral information, to the decoder $\Phi$ . The decoder then processes $f_{\mathcal{S}}$ and generates the predicted future trajectories $\widehat{\mathcal{Y}}$ . The core step of the LBF student network is to link the behavior estimator and the scene encoder.

5.2 Implementation

Like Sec 4.2, this section introduces the implementation of the student networks in the LBF prediction framework based on two scene representations.

5.2.1 Graph-based Systems.

In the teacher network (the LBA prediction system), the local behavior data goes through an encoder to obtain behavior features $f_{\mathcal{B}}$ . In the student network (the LBF version), we use a behavior estimator $\Omega_{\mathcal{B}}$ to estimate the behavior features even when the original local behavior data is not available. We implement the behavior estimator $\Omega_{\mathcal{B}}$ by a graph neural network. Its input includes the features of the map objects $f_{\mathcal{M}}$ and the features of the agents’ observed trajectories $f_{\mathcal{X}}$ , which include all scene information in hand. It outputs the estimated behavior features $\widehat{f}_{\mathcal{B}}$ . After $f_{\mathcal{M}}$ and $f_{\mathcal{X}}$ are interacted in the interaction module, we aggregate its output $\widehat{f}^{\prime}_{\mathcal{S}}$ and the estimated behavior features $\widehat{f}_{\mathcal{B}}$ in a fusion module $\Psi$ to form the final scene feature $f_{\mathcal{S}}$ . We implement $\Psi$ by an attention-based network. The pipeline of this procedure is shown in Fig. 6. During training, the estimated behavior features $\widehat{f}_{\mathcal{B}}$ in the student network can be supervised by the behavior features ${f_{\mathcal{B}}}$ in the teacher network through a knowledge-distillation loss.

5.2.2 Rasterized-image-based Systems.

In the teacher network (the LBA prediction system), the behavior probability map leads to behavior features $f_{\mathcal{B}}$ . In the student network (the LBF version), we use a behavior estimator $\Omega_{\mathcal{B}}$ to estimate such behavior features even the behavior probability map is not available. We implement the behavior estimator $\Omega_{\mathcal{B}}$ by a CNN-based network. Its input is the output of the scene encoder $\Theta_{\mathcal{S}}$ , which is the intermediate scene feature $f^{\prime}_{\mathcal{S}}$ . Its output is the estimated behavior features $\widehat{f}_{\mathcal{B}}$ . We then concatenate $\widehat{f}_{\mathcal{B}}$ and $f^{\prime}_{\mathcal{S}}$ to form the final scene feature $f_{\mathcal{S}}$ . The pipeline of this procedure is shown in Fig. 6. During training, the estimated behavior features $\widehat{f}_{\mathcal{B}}$ in the student network can be supervised by the behavior features ${f_{\mathcal{B}}}$ in the teacher network through a knowledge-distillation loss.

Note that both graph-based and rasterized-image-based LBF systems follow the same design rationale: estimating local behavior data based on the other known scene information. The difference between the two types of systems is in the fusion step. In the graph-based system, we use a trainable fusion module; and in the rasterized-image-based system, since the scene image and the behavior probability map share the same coordinate system, we simply concatenate them.

5.3 Training with Knowledge Distillation

To train such a teacher-student framework, we pre-train an LBA prediction network as the teacher network, and then use the intermediate features from the teacher network as the supervision signals to guide the training of the student network. We consider features from the teacher network that are leveraged to guide the student network training as $\mathbf{F}_{t}$ and the corresponding features from the student network as $\mathbf{F}_{s}$ . Note that $\mathbf{F}_{s}$ may include but are not limited to the reconstructed local behavior features and the the scene encoder’s outputs .

The training loss thereafter contains the trajectory forecasting loss $\mathcal{L}_{pred}$ , which is defined by the original system and identical with the training loss of the teacher network as well as a KD loss. In this work, we use $\ell_{2}$ loss as the KD loss and set $\lambda_{kd}$ as the adjustable weight of the KD loss. The overall loss function aggregates the loss over all the samples; that is,

\mathcal{L}=\Sigma_{i}{\left(\lambda_{kd}\left\|\mathbf{F}^{i}_{s}-\widehat{\mathbf{F}}^{i}_{t}\right\|_{2}+\mathcal{L}^{i}_{pred}\right)}.

(3)

6 Experiments

6.1 Experimental Setup

Datasets. We consider two widely used trajectory forecasting benchmarks, nuScenes [3] and Argoverse [6]. nuScenes collected 1000 scenes. Each scene is 20s long with a sampling rate of 2 Hz. The instances for trajectory forecasting tasks were split into train, validation and test splits, which respectively entail 32186, 8560 and 9041 instances. Each agent in the instances had 2s’ observed trajectories and the ground truth of 6s’ future trajectories. Argoverse collected over 30K scenarios. Each scenario is sampled at 10 Hz. The train/ val/ test splits had 205942/ 39472/ 78143 sequences respectively. Each agent in the scene had 2s’ observed trajectories and the ground-truth of 3s’ future trajectories.

Database Construction. To evaluate local-behavior-aware (LBA) framework without ground-truth data leakage, we only use the 2s’ observed trajectories of all agents in each data split to build the behavior database $\mathcal{D_{B}}$ for each corresponding phase (e.g. testing phase only uses the test split). For the local-behavior-free (LBF) framework, we only use the observed trajectories from the training split during the training phase. Please see appendix for detailed information of behavior database construction.

Metrics. We adopt three widely used metrics for quantitative evaluation: minimum average displacement error among $K$ predictions (minADE_K), minimum final displacement error among $K$ predictions (minFDE_K) and the missing rate (MR). minADE evaluates the minimum average prediction error of all timestamps among the predicted $K$ trajectories; minFDE is the minimum error of the final position among the $K$ predictions; MR_K is the ratio of agents whose minFDE_K is larger than 2.0 meters.

6.2 Implementation Details

We pick three SOTA trajectory forecasting methods: LaneGCN [23], DenseTNT [15] and P2T [9], and adapt them to our behavior-aware framework. We use their official code packages as the implementation start code. The baseline performances in Table 1 and 2 are reproduced by ourselves for variables controlling.

LaneGCN and DenseTNT are graph-based methods. We use stacked linear residual blocks mentioned in [23] as our behavior encoder $E_{\mathcal{B}}$ for both methods. In the LBF prediction, we use attention based architecture to implement the behavior estimator $\Omega_{\mathcal{B}}$ as well as the auxiliary fusion module $\Psi$ .

P2T is a rasterization-based method. We use behavioral probability map to represent the local behavior data, and ResNet [17] as the encoder backbone to extract features from the behavioral probability map. For the LBF prediction, we use three 1D Convolutional Layers to implement the behavior estimator $\Omega_{\mathcal{B}}$ .

To train the network, we adopt the hyper-parameter configuration from each method’s official instructions. More implementation details are in the appendix.

6.3 Evaluation

We evaluate LaneGCN and P2T on nuScenes (see the quantitative results in Table 1). For Argoverse, we evaluate LaneGCN and DenseTNT (see Table 2). We also show the qualitative evaluation in Fig. 7.

In all experiments, when upgraded to either the LBA or LBF framework, the baseline methods significantly improve in performance. Unconventionally, our proposed frameworks bring consistent improvements to various models (P2T, LaneGCN, DenseTNT) across all metrics. This is a substantive progress compared to previous methods, because SOTA methods [12, 14, 41, 43] usually only show improvements in one or some metrics but not all metrics. Furthermore, on both datasets, the gains brought by local behavior data are consistently larger on $K=1$ metrics than on other metrics. This may result from the raise of average performance of the worst prediction among K predictions, as local behavior data efficiently narrows down the solution search space (explained in Sec 1).

Table 1: Evaluation results on nuScenes [3] dataset test split.

Method	Framework	minADE₁	minFDE₁	minADE₁₀	minFDE₁₀
P2T [9]	Baseline	4.60	10.80	1.17	2.15
	LBA	3.78 $\downarrow 18$ %	9.25 $\downarrow 14$ %	1.08 $\downarrow 8$ %	2.04 $\downarrow 5$ %
	LBF	4.04 $\downarrow 12\%$	9.54 $\downarrow 12\%$	1.15 $\downarrow 2\%$	2.11 $\downarrow 2\%$
LaneGCN [23]	Baseline	6.17	12.34	1.82	2.98
	LBA	2.72 $\downarrow 56$ %	6.78 $\downarrow 45$ %	0.95 $\downarrow 48$ %	1.85 $\downarrow 38$ %
	LBF	5.58 $\downarrow 10$ %	11.47 $\downarrow 7$ %	1.67 $\downarrow 8$ %	2.66 $\downarrow 11$ %

Interestingly, the prediction performance of the LBF framework occasionally surpasses that of the LBA framework, even though the LBF framework lacks local behavior data input. Table 2 shows an example of this case, with the LaneGCN results on Argoverse. Our educated conjecture is that sometimes the LBF framework enjoys more data representativeness thanks to the reconstructed behavioral features; whereas, in the meantime, the LBA framework may be using pre-gathered local behavior data of a small size in its testing phase.

Besides the comparison with the baselines, we also show the comparison with the other published works on the benchmarks; see Tables 4 and 4.

Table 2: Evaluation results on Argoverse [6] dataset test split.

Method	Framework	minADE₁	minFDE₁	minADE₆	minFDE₆
LaneGCN [23]	Baseline	1.74	3.89	0.87	1.37
	LBA	1.64 $\downarrow 6\%$	3.61 $\downarrow 7\%$	0.84 $\downarrow 3\%$	1.30 $\downarrow 5\%$
	LBF	1.61 $\downarrow 7\%$	3.54 $\downarrow 9\%$	0.85 $\downarrow 2\%$	1.31 $\downarrow 4\%$
DenseTNT [15]	Baseline	1.70	3.72	0.90	1.33
	LBA	1.65 $\downarrow 3\%$	3.57 $\downarrow 4\%$	0.88 $\downarrow 2\%$	1.26 $\downarrow 5\%$
	LBF	1.67 $\downarrow 2$ %	3.63 $\downarrow 2$ %	0.89 $\downarrow 2$ %	1.29 $\downarrow 3$ %

Table 3: nuScenes benchmark comparison.

Method	minADE₁₀	MR₅	MR₁₀	minFDE₁
P2T[9]	1.16	64%	46%	10.50
MHA-JAM[29]	1.24	59%	46%	8.57
SGNet[38]	1.40	67%	52%	9.25
Trajectron++[33]	1.51	70%	57%	9.52
M-SCOUT[4]	1.92	78%	78%	9.29
GOHOME[12]	1.15	57%	47%	6.99
P2T-LBA	1.08	57%	41%	9.25
P2T-LBF	1.15	61%	46%	9.37
LaneGCN-LBA	0.95	49%	36%	6.78
LaneGCN-LBF	1.67	75%	68%	11.47

Table 4: Argoverse benchmark comparison.

Method	minADE₁	minADE₆	Brier-FDE₆
LaneRCNN[43]	1.69	0.90	2.15
DenseTNT[15]	1.68	0.88	1.98
GOHOME[12]	1.69	0.94	1.98
MMTransformer[24]	1.77	0.84	2.03
LaneGCN[23]	1.71	0.87	2.06
LaneGCN-LBA	1.64	0.84	2.00
LaneGCN-LBF	1.61	0.85	2.00
DenseTNT-LBA	1.65	0.88	1.93
DenseTNT-LBF	1.67	0.89	1.96

Figure 7: LaneGCN qualitative results on
Argoverse val split.

Table 5: minFDE₁ of LaneGCN on Argoverse val split among agents having different sizes of local behavior data.

Size	% of agents	Baseline	LBA
$\left[0,4\right)$	24%	3.33	3.36 $\uparrow 1\%$
$\left[4,8\right)$	14%	3.25	3.19 $\downarrow 2\%$
$\left[8,12\right)$	10%	3.22	3.13 $\downarrow 3\%$
$\left[12,16\right)$	9%	3.00	2.87 $\downarrow 4\%$
$\left[16,\infty\right)$	43%	2.66	2.60 $\downarrow 2\%$

6.4 Ablation Study

We conduct the ablation study on the hyper-parameters regarding local behavior data generation and our prediction frameworks.

Local Behavior Data Size. Regarding the relationship between prediction performance and the local behavior data size of each agent, Table 6.3 shows: when there is few local behavior data, prediction is not as accurate as the baseline; but, even a small size of local behavior data can improve the prediction.

Local Range $\epsilon$ . The adjustable parameter $\epsilon$ defines the local radius (see Sec 3). See Table 7 for the impact of $\epsilon$ on LaneGCN performance with Argoverse test split. When $\epsilon$ increases, the performance of the LBA framework first go up but later go down. It matches our intuition that a large $\epsilon$ will introduce distractions. The LBF framework, however, shows less confusion brought by the pre-trained LBA network, demonstrating stronger robustness to $\epsilon$ .

Knowledge Distillation Parameters. We study the impact of knowledge-distillation-related parameters, i.e., the loss weight $\lambda_{kd}$ and the number of times that intermediate features are involved in the KD loss (denoted as KD times). See the results in Table 7. Detailed information about the choices of intermediate features is in the appendix. We see that firstly, the knowledge distillation structure does help the framework infer the impact of local behavior data. The framework, across all parameter settings in Table 7, outperforms the baseline method (in Table 2, row 1) and the model without KD loss (in Table 7, row 1). Secondly, the comparatively similar results across all the settings show that the LBF framework is relatively robust to the KD hyper-parameters.

Table 6: Impact of the behavioral data’s local range. A larger

\epsilon

brings a larger size of local behavior data, but when

\epsilon

is too large, it can also introduce confusion due to the loss of the data locality.

Framework	$\epsilon$	minADE₁	minFDE₁
LBA	0.5m	1.64	3.61
	1.0m	1.62	3.58
	1.5m	1.71	3.86
LBF	0.5m	1.61	3.54
	1.0m	1.62	3.57
	1.5m	1.63	3.59

Table 7: Impact of the number of times that KD supervision gets applied in the training phase and the weight of the KD loss.

KD times	$\lambda_{kd}$	minADE₁	minFDE₁
N/A	0	1.68	3.72
2	1	1.62	3.57
2	1.5	1.61	3.54
2	2	1.62	3.56
1	1.5	1.64	3.61
3	1.5	1.61	3.56

7 Conclusion

In this work, we re-introduce the local historical trajectories as a new type of data input to the trajectory forecasting task, referred as local behavior data. To adapt to this new data input and fully exploit its value, we propose a behavior-aware framework and a behavior-free framework for trajectory forecasting. The behavior-free framework, especially, adopts a knowledge-distillation architecture to estimate the impact of local behavior data. Extensive experiments on published benchmarks validate that the proposed frameworks significantly improve the performances of SOTA methods on prevalent metrics.

Limitations. Local historical information reveals local motion patterns with high fidelity, but there are always outliers. For use cases of great safety concerns (e.g. autonomous driving), historical data may provide good reference but should not be the only reference. Also, the motion patterns of a certain location can vary over time. To optimize the benefits of the LBA and LBF framework, future research should explore the historical data gathering strategies.

Acknowledgement:

National Natural Science Foundation of China under Grant 62171276, the Science and Technology Commission of Shanghai Municipal under Grant 21511100900, CCF-DiDi GAIA Research Collaboration Plan 202112 and CALT 2021-01.

References

[1] Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-Fei, L., Savarese, S.: Social lstm: Human trajectory prediction in crowded spaces. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 961–971 (2016)
[2] Bennewitz, M., Burgard, W., Thrun, S.: Learning motion patterns of persons for mobile service robots. In: Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No. 02CH37292). vol. 4, pp. 3601–3606. IEEE (2002)
[3] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11621–11631 (2020)
[4] Carrasco, S., Llorca, D.F., Sotelo, M.Á.: Scout: Socially-consistent and understandable graph attention network for trajectory prediction of vehicles and vrus. arXiv preprint arXiv:2102.06361 (2021)
[5] Casas, S., Luo, W., Urtasun, R.: Intentnet: Learning to predict intention from raw sensor data. In: Conference on Robot Learning. pp. 947–956. PMLR (2018)
[6] Chang, M.F., Lambert, J., Sangkloy, P., Singh, J., Bak, S., Hartnett, A., Wang, D., Carr, P., Lucey, S., Ramanan, D., et al.: Argoverse: 3d tracking and forecasting with rich maps. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8748–8757 (2019)
[7] Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. Advances in neural information processing systems 30 (2017)
[8] De Leege, A., van Paassen, M., Mulder, M.: A machine learning approach to trajectory prediction. In: AIAA Guidance, Navigation, and Control (GNC) Conference. p. 4782 (2013)
[9] Deo, N., Trivedi, M.M.: Trajectory forecasts in unknown environments conditioned on grid-based plans. arXiv preprint arXiv:2001.00735 (2020)
[10] Deo, N., Wolff, E., Beijbom, O.: Multimodal trajectory prediction conditioned on lane-graph traversals. In: 5th Annual Conference on Robot Learning (2021)
[11] Gao, J., Sun, C., Zhao, H., Shen, Y., Anguelov, D., Li, C., Schmid, C.: Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11525–11533 (2020)
[12] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., Moutarde, F.: Gohome: Graph-oriented heatmap output forfuture motion estimation. arXiv preprint arXiv:2109.01827 (2021)
[13] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., Moutarde, F.: Home: Heatmap output for future motion estimation. arXiv preprint arXiv:2105.10968 (2021)
[14] Gilles, T., Sabatini, S., Tsishkou, D., Stanciulescu, B., Moutarde, F.: Thomas: Trajectory heatmap output with learned multi-agent sampling. arXiv:2110.06607 (2021)
[15] Gu, J., Sun, C., Zhao, H.: Densetnt: End-to-end trajectory prediction from dense goal sets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15303–15312 (2021)
[16] Hao, Y., Fu, Y., Jiang, Y.G., Tian, Q.: An end-to-end architecture for class-incremental object detection with knowledge distillation. In: 2019 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2019)
[17] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[18] He, T., Shen, C., Tian, Z., Gong, D., Sun, C., Yan, Y.: Knowledge adaptation for efficient semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 578–587 (2019)
[19] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
[20] Hu, Y., Chen, S., Zhang, Y., Gu, X.: Collaborative motion prediction via neural motion message passing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6319–6328 (2020)
[21] Kim, B., Kang, C.M., Kim, J., Lee, S.H., Chung, C.C., Choi, J.W.: Probabilistic vehicle trajectory prediction over occupancy grid map via recurrent neural network. In: 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC). pp. 399–404. IEEE (2017)
[22] Li, Y., Ren, S., Wu, P., Chen, S., Feng, C., Zhang, W.: Learning distilled collaboration graph for multi-agent perception. arXiv preprint arXiv:2111.00643 (2021)
[23] Liang, M., Yang, B., Hu, R., Chen, Y., Liao, R., Feng, S., Urtasun, R.: Learning lane graph representations for motion forecasting. In: European Conference on Computer Vision. pp. 541–556. Springer (2020)
[24] Liu, Y., Zhang, J., Fang, L., Jiang, Q., Zhou, B.: Multimodal motion prediction with stacked transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7577–7586 (2021)
[25] Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z., Wang, J.: Structured knowledge distillation for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2604–2613 (2019)
[26] Liu, Y., Dong, X., Lu, X., Khan, F.S., Shen, J., Hoi, S.: Teacher-students knowledge distillation for siamese trackers. arXiv preprint arXiv:1907.10586 (2019)
[27] Luo, Y., Cai, P., Bera, A., Hsu, D., Lee, W.S., Manocha, D.: Porca: Modeling and planning for autonomous driving among many pedestrians. IEEE Robotics and Automation Letters 3(4), 3418–3425 (2018)
[28] Marchetti, F., Becattini, F., Seidenari, L., Bimbo, A.D.: Mantra: Memory augmented networks for multiple trajectory prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7143–7152 (2020)
[29] Messaoud, K., Deo, N., Trivedi, M.M., Nashashibi, F.: Trajectory prediction for autonomous driving based on multi-head attention with joint agent-map representation. arXiv preprint arXiv:2005.02545 (2020)
[30] Nikhil, N., Tran Morris, B.: Convolutional neural network for trajectory prediction. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops. pp. 0–0 (2018)
[31] Phan-Minh, T., Grigore, E.C., Boulton, F.A., Beijbom, O., Wolff, E.M.: Covernet: Multimodal behavior prediction using trajectory sets. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14074–14083 (2020)
[32] Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv preprint arXiv:1706.02413 (2017)
[33] Salzmann, T., Ivanovic, B., Chakravarty, P., Pavone, M.: Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16. pp. 683–700. Springer (2020)
[34] Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE transactions on neural networks 20(1), 61–80 (2008)
[35] Sun, C., Karlsson, P., Wu, J., Tenenbaum, J.B., Murphy, K.: Stochastic prediction of multi-agent interactions from partial observations. arXiv preprint arXiv:1902.09641 (2019)
[36] Sun, C., Shrivastava, A., Vondrick, C., Sukthankar, R., Murphy, K., Schmid, C.: Relational action forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 273–283 (2019)
[37] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. pp. 5998–6008 (2017)
[38] Wang, C., Wang, Y., Xu, M., Crandall, D.J.: Stepwise goal-driven networks for trajectory prediction. arXiv preprint arXiv:2103.14107 (2021)
[39] Xu, C., Li, M., Ni, Z., Zhang, Y., Chen, S.: Groupnet: Multiscale hypergraph neural networks for trajectory prediction with relational reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6498–6507 (2022)
[40] Xu, C., Mao, W., Zhang, W., Chen, S.: Remember intentions: Retrospective-memory-based trajectory prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6488–6497 (2022)
[41] Ye, M., Cao, T., Chen, Q.: Tpcn: Temporal point cloud networks for motion forecasting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11318–11327 (2021)
[42] Yu, C., Ma, X., Ren, J., Zhao, H., Yi, S.: Spatio-temporal graph transformer networks for pedestrian trajectory prediction. In: European Conference on Computer Vision. pp. 507–523. Springer (2020)
[43] Zeng, W., Liang, M., Liao, R., Urtasun, R.: Lanercnn: Distributed representations for graph-centric motion forecasting. arXiv preprint arXiv:2101.06653 (2021)
[44] Zhao, H., Gao, J., Lan, T., Sun, C., Sapp, B., Varadarajan, B., Shen, Y., Shen, Y., Chai, Y., Schmid, C., et al.: Tnt: Target-driven trajectory prediction. arXiv preprint arXiv:2008.08294 (2020)
[45] Zhao, H., Wildes, R.P.: Where are you heading? dynamic trajectory prediction with expert goal examples. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7629–7638 (2021)

Appendix — Aware of the History: Trajectory Forecasting with the Local Behavior Data Yiqi Zhong Zhenyang Ni Siheng ChenUlrich Neumann

Appendix 0.A Dataset Construction

In this work, we use two widely used autonomous driving benchmark: nuScenes[3] and Argoverse[6]. Here we describe how we derive local behavior data from each dataset for the experiments use.

0.A.1 nuScenes

nuScenes collects scene from four different places from Singapore and Boston, USA. The four places are labeled as singapore-onenorh, boston-seaport, singapore-queenstown and singapore-hollandvillage by the dataset. For each data split, we collect all the 2-second observed trajectories in the four places separately to build the behavior database $\mathcal{D}_{\mathcal{B}}$ . Since the sample rate of nuScenes is 2Hz, each 2-second observed trajectory suppose to have two-dimensional coordinates at 5 timestamps. However, in the dataset, there are missing timestamps for some observed trajectories. To make the learning procedure more stable, we only pick the observed trajectories those have no missing data. Meanwhile, we also filter out the static trajectories (whose speed is lower than 2m/s). Table 8 shows the size of the behavior database for each data split in nuScenes dataset.

Table 8: The number of historical observed trajectories contained in the behavior database for each location in nuScenes dataset. We build the database for each split separately to simulate the real-world scenario.

Location Label	train	val	test
singapore-onenorth	92893	54878	50570
boston-seaport	708527	226813	163812
singapore-queenstown	59702	3696	26568
singapore-hollandvillage	92924	N/A	7340

0.A.2 Argoverse

Argoverse collects data from Pittsburgh and Miami. We build the behavior $\mathcal{D}_{\mathcal{B}}$ for each city of each data split using the all 2-second observed trajectories correspondingly. In argoverse, since the sample rate is 10Hz, for each 2-second observed trajectory, there are 20 timestamps geometric position data. Similar to what we do for nuScenes, we only pick the observed trajectory that has no unavailable timestamp and whose average speed is larger than 2m/s. The statistic of the behavior database size is shown in Table 9

Table 9: The number of historical observed trajectories contained in the behavior database for each location in Argoverse dataset.

Location	train	val	test
Miami	1532574	281809	538004
Pittsburg	980827	162470	583311

Appendix 0.B Extensive Discussion

0.B.1 Practicability.

When introducing a new data source to trajectory forecasting, researchers should evaluate its accessibility in real-world applications. Our experiments use a collection of available agents’ observed trajectories in each dataset to build the behavior database $\mathcal{D}_{\mathcal{B}}$ . The datasets used in this work both contain the global coordinates for each recorded trajectory. This property of the datasets enables us to mount the trajectories to the physical location on the global map, which is fundamental to the concept of local behavior data. Judging our framework in real-world applications, the global coordinates of the map and the agents are relatively easy to retrieve through techniques such as the Global Positioning System (GPS). Once the global coordinates are retrieved, gathering local behavior data in the real world becomes feasible.

Furthermore, we argue that in the real-world application, the benefit brought by the local behavior data will be even more significant, since the behavior data gathering procedure in can last longer time and collect more data. Despite the improvement brought by the local behavior data in our experiments on the published datasets, in the Argoverse training split, there are still 5% of agents have no available local behavior data.

0.B.2 Local Behavior Information Visualization

In this paper, we exploit the local behavior data from the perspective of agent current locations. To more clearly visualize the information hidden in the local behavior data, here we show visualizations of the local behavior data for every lane segment on the map, see Figure 8.

In the figure, we focus on the visualization of two types of the information, one is the average speed of the trajectories collected from each lane segment and the other is the ratio of the trajectories that show turning actions (including turning and the lane changing). From the visualization we can see besides the location-specific information that is hidden in the local behavior data, such as the speed limit, there are more interesting information that worth further exploit. Those information includes some implicit rules of human driving behaviors that are hard to be learnt by limited number of observed trajectories, such as the turning intentions are much higher for the agents on the lanes that have intersections ahead even if the lane does not directly lead to a right-turn lane or left-turn lane.

Appendix 0.C Implementation

We pick three SOTA methods for implementation. The implemented works in our paper are all based on existing official code packages. We only modify the scene encoder related modules while remain the other parts exactly the same as the baseline methods, including the decoders and the encoders for HD maps and motion data. For the LBF training, the network parameters are randomly initialized by pytorch, i.e. we did not use LBA teacher network as the initialize ion.

0.C.1 LaneGCN[23]

Implementation details: We upgrade LaneGCN to Local-Behavior-Aware and Local-Behavior-Free framework based on their official code (https://github.com/uber-research/LaneGCN). During the training, on both nuScenes and Argoverse dataset, we use 2 NVIDIA GeForce RTX 3090 GPUs with the batch size of 32. The initial learning rate is 1e-3 and will decay to 1e-4 at epoch 32. The total training epoch number is 36. We replicate the exact training scheduler of the LaneGCN listed in the readme file of their official code package. We use $\epsilon=0.5$ and $\lambda_{kd}=1.5$ as the default setting.

Dataset Preprocessing: For the argoverse dataset, we use their official code to generate the preprocessed data, including the lane graph and the motion data. For the nuScenes dataset, we follow the instruction in their paper[23], generating the lane graph using the lane information from nuScenes dataset. In the nuScenes, besides the lane information, there are also other map objects like sidewalks. Because in the LaneGCN paper, there is no indication for how to process those map objects, we decide to ignore them during our data preprocessing. It may explain that why the baseline performance of the LaneGCN on nuScenes (See Table 1 in the paper) is not as good as it shows on Argoverse. It also explains by the boost brought by the local behavior data is significant even compared to other methods: local behavior data provide extra complementary information to the system.

Network Architecture: We show the architecture of the implementation on LaneGCN based framework in Figure 9.

In the network architecture drawn in the figure, B2A is the auxiliary fusion module for LBF framework. When applied knowledge distillation loss, we use the output of B2A, output of A2A II from the LBF as $\mathbf{F}_{s}$ and the output of B2A and the output of A2A as $\mathbf{F}_{t}$ . In the Table 4 of the paper, to do the abalation study, we pick an intermediate layer inside the PredNet as the third supervised feature for comparison.

Detailed inner structure of behavior encoder, behavior estimator and auxiliary fusion module B2A are in Figure 10

0.C.2 DenseTNT[15]

Implementation Details. In the experiment of DenseTNT, we adopt their official code package from https://github.com/Tsinghua-MARS-Lab/DenseTNT. We use $\epsilon=0.5$ and $\lambda_{kd}=1.5$ as the default setting. We use 8 NVIDIA GeForce RTX 3090 GPUs with a batch size of 64 for training. Different from their paper which claims a two-stage training strategy, in their official code, they train the network in a end-to-end style. We adopt the training strategy in the official code and set the learning rate with an initial value of 0.001 decays to 30% every 5 epochs. The total training epoch number is 30. The hidden size of the feature vectors is set to 128. The head number of our goal set predictor is 12. No data augmentation is used.

Dataset Preprocessing: We directly use their preprocess code to process the Argoverse dataset.

Network Architecture: We show the implemented architecture of DenseTNT in Figure 11. The inner architecture of behavior estimator and the behavior encoder are identical to the ones in the LaneGCN, which is shown in Figure 10. In the training of LBF framework , $\mathbf{F}_{s}$ includes the output of Behavior estimator and the output of Dense goal encoder while $\mathbf{F}_{t}$ contains the target goal features of the Sparse context encoder and the output of the Dense goal encoder correspondingly.

0.C.3 P2T[9]

Implementation Details: The implementation of P2T is based on their published code package https://github.com/nachiket92/P2T. For the baseline result, we directly use their released pre-trained model which is included in their code package. We use one NVIDIA GeForce 1080 Ti GPU for training and the batch size is 32. P2T trains the network in a three-stage style. It first trains a reward network for 25 epochs with the learning rate as 1e-4 and then trains a coarse trajectory predictor for 100 epochs whose learning rate is 1e-3. Afterwards, it will train a finetuned trajectory predictor for 400 epochs with the learning rate of 1e-4. We directly follow the default training strategy stated in the official code package for the network training.

Dataset Preprocssing: We use the provided preprocess code to generate rasterization of map images for each data sample. For the behavior probability maps, we use the equation:

\mathcal{P}^{(i,p)}_{\mathcal{B}}(x,y)=\frac{\mathcal{P}^{(i,p)}_{\mathcal{B}}(x,y)}{\max(\mathcal{P}^{(i,p)}_{\mathcal{B}}(x,y))},

(4)

which is stated in the paper to generate each probability map.

Network Architecture: P2T uses reinforcement learning to solve the prediction problem. Since we only modify the scene encoder part, in Figure 12, we only shows the detailed modification in the encoder part and skip the description of the reward model and the decoder.

The behavior encoder is one linear layer with activation and the output channel is 16. The behavior estimator is implemented as stacked 3-layer-conv2D structure, the kernel size is 1 and the output feature channel is 16. During the training of LBF framework, we apply KD loss on the output of the behavior estimator as well as the input of the Decoder.

Appendix 0.D Additional Experiments

0.D.1 Ablation study on LBF system design.

The LBF system has three factors: if using i) knowledge distillation (KD), ii) LBA as the teacher net in KD (to provide local behavior features guidance), iii) behavior estimator. Tab 10 studies each factor for LaneGCN on Argoverse, and it shows: i) A surpassed baseline (self-KD is effective); ii) B surpassed A (local behavior features from LBA network are effective), iii) C surpassed B (the behavior estimator is effective).

Method	KD	Behavior Estimator	minADE₁	minFDE₁
baseline	no teacher		1.71 -	3.78 -
A	self-KD		1.66 $\downarrow 3\%$	3.67 $\downarrow 3\%$
B	LBA as teacher		1.65 $\downarrow 3\%$	3.63 $\downarrow 4\%$
C (LBF)	LBA as teacher	✓	1.60 $\downarrow 6\%$	3.54 $\downarrow 6\%$

Table 10: Each of three factors counts validated on Argo test split

0.D.2 Comparison with memory-based method.

Table 11 shows that the proposed LBA/ LBF significantly outperform SOTA memory-based method, MANTRA[28], on Argoverse test split (numbers from its original paper). As mentioned in Sec 2, previous memory-based methods also leverage historical info, but three major differences mark the novelty of our work: i) we directly use historical trajectories as system input to avoid information loss, ii) we explicitly emphasize the spatial locality, iii) based on knowledge distillation, LBF does not need extra input while a memory-based method needs to store a huge memory bank.

Method	minADE ${}_{1}\downarrow$	minFDE ${}_{1}\downarrow$	minADE ${}_{6}\downarrow$	minFDE ${}_{6}\downarrow$
MANTRA	2.36 -	5.31 -	1.22 -	2.30 -
LaneGCN-LBA (ours)	1.62 $\downarrow 30\%$	3.58 $\downarrow 33\%$	0.84 $\downarrow 31\%$	1.30 $\downarrow 43\%$
LaneGCN-LBF (ours)	1.60 $\downarrow 32\%$	3.54 $\downarrow 33\%$	0.85 $\downarrow 30\%$	1.31 $\downarrow 43\%$

Table 11: Proposed LBA and LBF systems outperform MANTRA.

0.D.3 Performance gains along the time

In our paper, we only use the a few second local behavior data to avoid data snooping and for fair comparison. But in practice, by “local,” local behavior data just means the data’s starting position is at an agent’s current location; such data can be a long trajectory reflecting long-term behavior. We want to demonstrate the potential of the local behavior data for long-term prediction in real world application by using fig 8 to show that more gain from local behavior data as prediction time goes on. One reason is that as real human behavior recordings, local behavior data can suppress error accumulation in a prediction model.