EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction

Longzhong Lin^1,2, Xuewu Lin², Tianwei Lin², Lichao Huang², Rong Xiong¹, Yue Wang¹ Corresponding author.

Abstract

Motion prediction is a crucial task in autonomous driving, and one of its major challenges lands in the multimodality of future behaviors. Many successful works have utilized mixture models which require identification of positive mixture components, and correspondingly fall into two main lines: prediction-based and anchor-based matching. The prediction clustering phenomenon in prediction-based matching makes it difficult to pick representative trajectories for downstream tasks, while the anchor-based matching suffers from a limited regression capability. In this paper, we introduce a novel paradigm, named Evolving and Distinct Anchors (EDA), to define the positive and negative components for multimodal motion prediction based on mixture models. We enable anchors to evolve and redistribute themselves under specific scenes for an enlarged regression capacity. Furthermore, we select distinct anchors before matching them with the ground truth, which results in impressive scoring performance. Our approach enhances all metrics compared to the baseline MTR, particularly with a notable relative reduction of 13.5% in Miss Rate, resulting in state-of-the-art performance on the Waymo Open Motion Dataset. Code is available at https://github.com/Longzhong-Lin/EDA.

Introduction

Refer to caption — Figure 1: The outcomes from different matching paradigms. All of the strategies share the same network structure with 64 learnable queries. The top 6 predictions are selected from the original ones by non-maximum suppression (NMS).

In the field of autonomous driving, motion prediction is an important task which contributes to scene understanding and safe planning. Motion prediction utilizes historical agent states and road maps to predict the future trajectories of traffic participants. In recent years, an increasing amount of research works (2023; 2023; 2022a; 2022; 2021; 2021; 2020; 2019; 2018; 2017) have focused on motion prediction. A major challenge of motion forecasting is the multimodality of future behaviors, which means an agent could carry out one of many underlying possibilities.

A bunch of works (Ngiam et al. 2021; Varadarajan et al. 2022; Shi et al. 2022a; Chai et al. 2019) have adopted mixture models, like Gaussian Mixture Model (GMM), to represent multimodal future behaviors and have gained great success, where potential trajectories are modeled as scored components. These approaches typically employ a winner-takes-all regression loss in conjunction with a classification term, which necessitates identifying the positive and negative mixture components. For selecting positive components, there are two main categories of existing methods: prediction-based and anchor-based matching.

The prediction-based matching methods (Ngiam et al. 2021; Varadarajan et al. 2022) choose the predicted trajectory that is closest to the ground truth as the positive component, which is demonstrated in Fig. 2(a). Predictions generated by these methods honestly reflect the high degree of uncertainty in future behaviors, which results in an originally lower minimum error and miss rate (Fig. 1(a)). However, as illustrated in Fig. 1(c), the output trajectories from prediction-based matching tend to cluster around the most probable regions and similar scores are made upon such predictions, making it difficult to pick representative trajectories for downstream tasks (Fig. 1(b)).

As demonstrated in Fig. 2(b), the anchor-based matching methods (Shi et al. 2022a; Chai et al. 2019) associate each component with an anchor endpoint or trajectory, and select the positive one corresponding to the closest predefined anchor to ground truth. The introduction of spatial priors considerably alleviates the burden of optimization in classification, and the methods would prefer to generate trajectories around the predefined anchors. Nevertheless, to reduce computational costs and prevent compromising the scoring performance (Shi et al. 2022a), the anchors are usually distributed in a sparser manner compared to the outputs from prediction-based matching. Hence the regression capability of model is limited, which is shown in Fig. 1(a).

In this paper, we introduce a novel paradigm, named Evolving and Distinct Anchors (EDA), to define the positive and negative components for multi-modal motion prediction based on mixture models. As illustrated in Fig. 2(c), we first pre-define anchors and then update them by the intermediate outputs, hence the name Evolving Anchors. On the one hand, we utilize spatial priors in the form of predefined anchors to alleviate the difficulties in trajectory scoring posed by prediction-based matching approaches. On the other hand, we allow anchors to redistribute themselves based on predictions under specific scenes for a promoted regression capability compared to the vanilla anchor-based matching. As the anchors evolve multiple times, we observe that the prediction clustering issue previously presented in prediction-based matching arises and becomes pronounced, which continues to bother the optimization in scoring trajectories. In order to mitigate the ambiguity in classification caused by the gathering problem, inspired by Dense Distinct Query (Zhang et al. 2023) for object detection, we select Distinct Anchors through non-maximum suppression (NMS) before matching them with the ground truth, as demonstrated in Fig. 2(c). The adoption of distinct anchors also encourages the model to prioritize the most probable component among similar ones, facilitating the selection of representative predictions for downstream jobs. It turns out that our method leverages the benefits of both anchor-based and prediction-based matching (as shown in Fig. 1), and achieves state-of-the-art performance on the Waymo Open Motion Dataset (Ettinger et al. 2021).

Our contributions can be summarized as follows:

1.

We propose the Evolving Anchors for multimodal motion prediction based on mixture models, where we pre-define spatial anchors and then update them by the intermediate outputs. This novel strategy strikes a balance between the existing anchor-based and prediction-based matching approaches.
2.

We adopt Distinct Anchors to address the ambiguity in classification induced by prediction clustering phenomena. Employing NMS on anchors before matching them with the ground truth, we reduce the optimization difficulty in trajectory scoring and enhance the selection of representative predictions for subsequent tasks.
3.

We have performed experiments on the Waymo Open Motion Dataset (2021). With the assistance of Evolving and Distinct Anchors, our single model has surpassed the performance of previous ensemble-free approaches, exhibiting improvements on all metrics compared to the baseline MTR (Shi et al. 2022a), particularly with a significant relative reduction of 13.5% in Miss Rate.

Related Work

Architectures for Motion Prediction

In recent times, there has been a significant increase in the study of motion prediction owing to the rising interest in autonomous driving. Motion prediction involves using the past agent states and road maps to forecast the future paths of traffic participants. Early studies (Chai et al. 2019; Casas et al. 2020; Park et al. 2020; Gilles et al. 2021; Casas, Sadat, and Urtasun 2021) commonly rasterize the inputs into images and capture the contextual information through CNNs. LaneGCN (Liang et al. 2020) and LaneRCNN (Zeng et al. 2021) construct lane graphs to efficiently represent the topology of road maps. Recent works (Gu, Sun, and Zhao 2021; Varadarajan et al. 2022; Shi et al. 2022a) have widely adopted the VectorNet (Gao et al. 2020) representation scheme, which regards the road maps as polylines. As Transformers (Vaswani et al. 2017) have gained popularity, an increasing number of studies (Liu et al. 2021; Ngiam et al. 2021; Jia et al. 2023) have utilized the attention mechanism to encode scene context. Encouraged by the successful application of DETR (Carion et al. 2020), many Transformer-based models (Girgis et al. 2021; Varadarajan et al. 2022; Nayakanti et al. 2023) have adopted learnable queries in decoder to generate multiple potential future trajectories. In our study, we utilize the architecture presented in MTR (Shi et al. 2022a), which is an advanced transformer framework incorporating a local attention based encoder and a decoder with intention queries.

Modeling for Multimodal Future Motion

Previous studies have investigated different approaches for modeling multimodal future behaviors. Earlier generative models (Lee et al. 2017; Gupta et al. 2018; Rhinehart, Kitani, and Vernaza 2018; Rhinehart et al. 2019) generate a collection of samples to represent the distribution of future. Many other works (Chai et al. 2019; Mercat et al. 2020; Ngiam et al. 2021) have utilized mixture models to parameterize multi-modal predictions, which mainly fall into two lines: prediction-based and anchor-based matching, as elaborated in introduction. In prediction-based matching methods (Ngiam et al. 2021; Varadarajan et al. 2022; Nayakanti et al. 2023), the positive mixture component is chosen by directly comparing predicted trajectories to the ground truth. Some models (Tang and Salakhutdinov 2019; Girgis et al. 2021) using the loss based on EM algorithm can also be viewed as prediction-based matching when its KL term converges. Due to the challenge of selecting representative future trajectories, these methods have opted to use well-designed aggregation techniques (Varadarajan et al. 2022; Nayakanti et al. 2023), or to directly utilize an end-to-end version (Ngiam et al. 2021; Girgis et al. 2021). However, their scoring performance still lags behind that of anchor-based matching methods. The anchor-based matching (Chai et al. 2019; Zhao et al. 2021) regards as positive the component matching the closest predefined anchor to ground truth. The HOME series (Gilles et al. 2021, 2022) and DenseTNT (Gu, Sun, and Zhao 2021) can be considered as variations of anchor-based matching, where the anchors are the grids in heatmaps or target candidates placed on roads, but they require an additional sampling process to obtain the final predictions. The MTR (Shi et al. 2022a) achieves remarkable scoring performance using predefined anchors, while its end-to-end prediction-based matching version demonstrates significantly better performance in terms of minimum error and miss rate. Motivated by the findings, we propose a novel matching paradigm to exploit the regression potential hidden by the state-of-the-art anchor-based matching strategy.

Dense Distinct Query for Label Assignment

According to Zhang et al., considering one-to-one label assignment in object detection, sparse queries cannot ensure a high recall, while dense queries inevitably bring more similar queries and face optimization challenges in classification. Therefore, they propose Dense Distinct Queries (DDQ), in which dense queries are first laid and then distinct queries are selected for one-to-one assignments. Inspired by DDQ (Zhang et al. 2023), we adopt distinct anchors to mitigate the ambiguity in trajectory scoring induced by prediction clustering phenomena.

Evolving and Distinct Anchors

For identifying positive components, there are two primary strategies within the existing mixture-model based methods. The prediction-based matching directly compares the predicted trajectories $\{P_{i}\}_{i=1}^{N_{C}}$ with the ground truth $G$ :

Distance(P_{i},G),\ i=1,\cdots,N_{C},

(1)

where $N_{C}$ denotes the number of components. In anchor-based matching, the spatial anchors $\{A_{i}\}_{i=1}^{N_{C}}$ are linked to each component and matched with the ground truth $G$ :

Distance(A_{i},G),\ i=1,\cdots,N_{C}.

(2)

In this study, we present Evolving and Distinct Anchors (EDA), a novel paradigm to define the positive and negative mixture components by:

Distance(A_{E_{j}},G),\ j\in\mathcal{I}_{D},

(3)

where $A_{E}$ denotes the evolving anchors, and $\mathcal{I}_{D}$ is the index set of distinct anchors. The main idea is illustrated in Fig. 3. In the following we first introduce the encoder-decoder structure upon which our method is built. Subsequently, we provide detailed descriptions of the proposed Evolving Anchors and Distinct Anchors respectively.

Network Architecture

We have implemented our ideas on a cutting-edge encoder-decoder structure, as the one presented in MTR (Shi et al. 2022a). This transformer framework employs an encoder with local self-attention for scene context modeling, in addition to a multi-layer decoder that incorporates learnable intention queries to predict multimodal trajectories.

It is important to note that our approach presented in this paper is centered on the design of loss. Consequently, the proposed Evolving and Distinct Anchors (EDA) can be readily applied to any network structure that includes a multi-layer decoder.

Evolving Anchors

Although the spatial priors significantly alleviate the challenge in classification optimization, the vanilla anchor-based matching encounters a limitation in its regression capability, which will be demonstrated later. Regarding the above issue and encouraged by the successful adoption of multi-layer decoders in motion prediction (2021; 2022a), we naturally consider enabling anchors to evolve through multiple decoder layers for an enlarged regression capacity.

Take a 6-layer decoder for instance, as illustrated in Fig. 3(a), we can implement twice-evolving anchors by updating the anchors with outputs from the \nth2 and \nth4 layers, in which the evolving anchors for the $n$ -th layer are:

A_{E}^{(n)}=\left\{\begin{aligned} &A,&n=1,2\\ &P^{(2)},&n=3,4\\ &P^{(4)},&n=5,6\\ \end{aligned}\right.,

(4)

where we have omitted the index subscripts for simplicity.

In a word, the evolving anchors are initially predefined and later adjusted by the intermediate outputs from decoder layers, which means the anchors are allowed to redistribute themselves under specific scenes.

Effects of Evolving Anchors.

The vanilla anchor-based matching, as presented in Fig. 4, tends to make relative small adjustments to the predefined anchors in each layer. This is because, making significant changes to the anchor that hits the ground truth would result in a considerable regression loss, while the refinements to unlikely ones are not encouraged. Besides, the anchors are usually distributed in a sparser manner to reduce computational costs and avoid compromising the scoring performance (Shi et al. 2022a). Therefore, the regression capability of model is limited by the anchor-based matching with static anchors.

Correspondingly, making anchors adjustable motivates the model to modify unreasonable components in a larger degree, as illustrated in Fig. 4. Nevertheless, substantial refinements are made only when the potential benefits of achieving successful regression outweigh the expected cost of mistakenly making substantial adjustments. Hence the modifications to anchors are restrained and progressive in evolving anchors. In contrast, without the constraints from predefined anchors, the prediction-based matching would generate trajectories gathering around the most possible regions, even in the earlier layers, as shown in Fig. 4.

Therefore, the proposed Evolving Anchors achieves a balance between the anchor-based and prediction-based matching, where one can adjust the extent of modifications to predefined anchors through the frequency of anchor updates.

Distinct Anchors

Although predicting trajectories that cluster around the most probable regions contributes to better coverage of future behaviors with high uncertainty in prediction-based matching, this preference also introduces a serious issue of ambiguity in the scoring task. With multiple gathering outcomes, it becomes difficult for the model to distinguish the actual one closest to the ground truth. Hence the model tends to output similar scores for such predictions, making it hard to pick representative trajectories for downstream tasks.

In our proposed evolving anchors, as stated in the above analysis on effects of evolving anchors, the more frequently we update anchors, the greater the opportunity for substantial adjustments to unreal components. However, this also increases the potential for the phenomenon of prediction clustering. Such patterns can be observed intuitively in Fig. 5. As a result, this issue continues to pose a challenge for optimization in classification, particularly when updating the anchors multiple times.

Taking inspiration from DDQ (Zhang et al. 2023) in the object detection domain, we attempt to adopt distinct anchors to improve scoring performance. Specifically, we apply non-maximum suppression (NMS) to the anchors for each decoder layer prior to matching them with the ground truth during training, as illustrated in Fig. 3(b). Mixture components that correspond to the excluded anchors will neither serve as positive nor negative samples. Through the aforementioned operations:

•

We prevent the labeling of similar anchors as opposite, which significantly reduces the optimization difficulty for the classification task.
•

Moreover, the model is encouraged to prioritize the most probable trajectory among the similar ones, making it easier to select the representative predictions using simple post-processing techniques such as NMS.

Training Losses

We train the model with a combination of winner-takes-all regression loss and classification term, which is commonly used in mixture-model based methods (Chai et al. 2019; Nayakanti et al. 2023). Same as MTR (Shi et al. 2022a), we employ a Gaussian regression loss. Instead of Cross Entropy (CE) in MTR, we use Binary Cross Entropy (BCE) for classification loss, which is suitable for arbitrary numbers of mixture components filtered by distinct anchors. Please refer to the Appendix for more implementation details.

Table 1: Top 6 metrics on the validation set of Waymo Open Motion Dataset (Ettinger et al. 2021). The terms “original”, “scaled” and “rank” under the “mAP” heading respectively represent the results upon the original, scaled and ranking-oriented top 6 scores, as elaborated in implementation details.

Anchor Evolving Times	Classification Loss	Distinct Anchors	mAP $\uparrow$			minADE $\downarrow$	minFDE $\downarrow$	Miss Rate $\downarrow$
Anchor Evolving Times	Classification Loss	Distinct Anchors	original	scaled	rank	minADE $\downarrow$	minFDE $\downarrow$	Miss Rate $\downarrow$
0	CE		0.4059	0.4167	0.4121	0.6012	1.2277	0.1348
0	BCE		0.4053	0.4171	0.4126	0.6050	1.2376	0.1357
1	CE		0.4013	0.4211	0.4183	0.5867	1.2109	0.1240
1	BCE		0.4060	0.4255	0.4228	0.5838	1.2012	0.1221
1	BCE	✓	0.4173	0.4221	0.4278	0.5776	1.1895	0.1203
2	CE		0.3868	0.4107	0.4101	0.5881	1.2145	0.1227
2	BCE		0.3957	0.4236	0.4207	0.5888	1.2144	0.1229
2	BCE	✓	0.4235	0.4251	0.4353	0.5708	1.1730	0.1178
5	CE		0.3647	0.4051	0.4002	0.5996	1.2444	0.1264
5	BCE		0.3675	0.4063	0.4037	0.5998	1.2412	0.1272
5	BCE	✓	0.4186	0.4185	0.4322	0.5817	1.2056	0.1245

Experiments

Experimental Setup

Dataset and metrics.

We assess our method on the large-scale Waymo Open Motion Dataset (WOMD) proposed by Ettinger et al., which extracts interesting behaviors from actual traffic scenes. The WOMD (Ettinger et al. 2021) includes 487k training scenes, 44k validation and 44k testing scenes, where each scene contains up to 8 target agents. Each agent is comprised of 1 second of historical states and 8 seconds of future information. The long time horizon challenges the model’s capacity to capture a broad field of view and adapt to a vast output space for trajectories.

Due to the complexity of reasoning about numerous potential future behaviors, benchmark metrics limit the number of trajectories under consideration. The official website offers an evaluation on submissions with up to 6 motion predictions for each target agent, returning metrics including minADE (Minimum Average Displacement Error), minFDE (Minimum Final Displacement Error), Miss Rate, Overlap Rate, mAP and Soft mAP. Hence the top 6 metrics we provide are obtained from the official evaluation server, whereas we utilize a local evaluation tool based on the official API to compute metrics on a greater number of mixture components.

Implementation details.

Our design is built upon the state-of-the-art MTR framework (Shi et al. 2022a), where we adopt the default setting of the network structure and training configuration. We train the model for 30 epochs on 16 GPUs (NVDIA RTX 3090) with the batch size of 80 scenes. The predefined anchors we use are the 64 intention points generated by a k-means clustering algorithm on the training set, as used in MTR. To achieve a more stable matching, except for predefined anchors we assign labels based on the full trajectories of intermediate outputs that act as evolving anchors.

For evaluation, we pick top 6 predictions by employing NMS on the endpoints of 64 predicted trajectories. Following Shi et al., the distance threshold $\sigma$ is scaled proportionally to the length $L$ of trajectory with the highest confidence: $\sigma=\min[3.5,\max[2.5,2.5+1.5\times(L-10)/(50-10)]]$ . The same NMS distance threshold is also applied to the selection of distinct anchors. To improve the mAP metrics, the MTR (2022a) scales the original top 6 scores for each sample through dividing them by their sum, making the scores comparable across different agents. As far as we are concerned, it also makes sense to consider the rank of trajectories in a sample when comparing predictions across different agents. Therefore, We add a rank-related integer to the original scores ranging between 0 and 1, to ensure that when computing the mAP metrics, the top-ranked trajectories of all samples will be sorted at the top, followed by the \nth2-ranked, \nth3-ranked, and so on. For instance, we add 5 for the top-ranked trajectory, 4 for the \nth2-ranked, 3 for the \nth3-ranked, and so forth. In order to align with previous works, we still present the mAP metrics upon the original and scaled scores in the following ablation study.

Table 2: Performance comparison on the validation and test sets of Waymo Open Motion Dataset (Ettinger et al. 2021).

Set	Method	Soft mAP $\uparrow$	mAP $\uparrow$	minADE $\downarrow$	minFDE $\downarrow$	Miss Rate $\downarrow$	Overlap Rate $\downarrow$
Test	MotionCNN (2022)	-	0.2136	0.7400	1.4936	0.2091	0.1560
	ReCoAt (2022)	-	0.2711	0.7703	1.6668	0.2437	0.1642
	DenseTNT (2021)	-	0.3281	1.0387	1.5514	0.1573	0.1779
	SceneTransformer (2021)	-	0.2788	0.6117	1.2116	0.1564	0.1473
	HDGT (2023)	-	0.2854	0.5933	1.2055	0.1511	-
	MTR (2022a)	0.4216	0.4129	0.6050	1.2207	0.1351	0.1277
	MTR++ (2023)	0.4414	0.4329	0.5906	1.1939	0.1298	0.1281
	EDA (Ours)	0.4510	0.4401	0.5718	1.1702	0.1169	0.1266
Val	MTR (2022a)	-	0.4164	0.6046	1.2251	0.1366	-
	MTR++ (2023)	-	0.4351	0.5912	1.1986	0.1296	-
	EDA (Ours)	0.4462	0.4353	0.5708	1.1730	0.1178	0.1273

Ablation Study

We first investigate the impacts of Evolving Anchors, and then assess the effectiveness of Distinct Anchors. All models are evaluated on the validation set of WOMD (Ettinger et al. 2021). In terms of mAP metrics, the results based on the original, scaled, and ranking-oriented top 6 scores are all presented, as referred in implementation details.

Evolving Anchors.

Starting from the baseline with 0 time of anchor updating, which is actually the MTR (Shi et al. 2022a) that uses the anchor-based matching with static anchors, we apply various anchor evolving times to explore the effects of evolving anchors. Upon the adopted 6-layer decoder, we update anchors at the \nth3 layer for once-updating anchors, at the \nth2 and \nth4 layers for twice-evolving anchors, and at every but the final layer for 5 times of anchor evolving. The corresponding top 6 metrics are displayed in the rows highlighted in gray of Table 1, while the results on original 64 components are included in Fig. 6.

Fig. 6 shows that the regression capacity of model improves as the number of anchor updates increases, with a significant enhancement each time the anchors evolve. This finding supports the idea that evolving anchors present opportunities to unlock the potential in regression hidden by the vanilla anchor-based matching. And the more frequently we update the anchors, the greater the potential for adjustments to enhance the regression.

However, as illustrated in Fig. 5, the phenomenon of prediction clustering also becomes severe when the anchors are updated more times, since the increased freedom in modifying the predefined anchors results in outputs more resembling those from the prediction-based matching. This issue adversely affects the performance of trajectory scoring, leading to a decline in top 6 metrics when two or more anchor updates are employed, as presented in Table 1.

Distinct Anchors.

We utilize the BCE loss to accommodate varying numbers of the mixture components selected for distinct anchors, which is different from the MTR (Shi et al. 2022a) using the CE loss. Hence we begin by assessing the influence of various options for the classification loss. From both Fig. 6 and Table 1, it can be observed that, overall, the BCE loss leads to only marginal differences in the results, along with a slightly better mAP. This suggests that the BCE loss can be considered a reasonable substitute for the CE classification loss.

After validating the impact of BCE loss, we now evaluate the efficacy of Distinct Anchors. As seen in Table 1, the use of distinct anchors brings a considerable enhancement in the top 6 metrics for models with evolving anchors. What’s more, the progress, particularly in mAP (e.g., +0.5%, +1.46%, +2.85% for 1, 2, 5 anchor updates respectively upon ranking-oriented scores), becomes notable with a higher frequency of anchor evolving. Nevertheless, the regression metrics on original 64 mixture components, as shown in Fig. 6, do not exhibit a significant improvement. Such evidences indicate that the adoption of distinct anchors does facilitate the selection for the representative behaviors as well as the scoring performance, which is hindered by the prediction clustering phenomenon.

But the benefits of distinct anchors are not limitless. As depicted in Table 1, both with the help of distinct anchors, the performance of 5 anchor updates cannot surpass that of twice-evolving anchors at all. And the unusual deterioration in Miss Rate when using distinct anchors for 5 times of anchor evolving (Fig. 6) implies that the model may be still plagued by too many anchor updates.

Benchmark Results

We evaluate the model that performs the best in our ablation study, namely twice-evolving and distinct anchors with ranking-oriented top 6 scores, on the test set of WOMD (Ettinger et al. 2021). We need to point out that the model for testing is trained solely on the WOMD training set without any ensemble techniques applied, consistent with our baseline MTR (Shi et al. 2022a).

As shown in Table 2, our single model outperforms previous ensemble-free approaches on the WOMD. The proposed EDA has demonstrated significant improvements in all performance metrics compared to the baseline MTR on both the validation and test sets. Specifically, there is a relative improvement of 13.5% on Miss Rate, 5.5% on minADE, and 4.1% on minFDE, as well as a +2.94% absolute growth in SoftmAP on the test set. Furthermroe, the performance of our EDA surpasses that of MTR++ (Shi et al. 2023), the latest improved version of MTR, on both the validation and test sets of WOMD. It is worth noting that MTR++ primarily enhances the network structure of MTR, while our approach centers on the design of loss, which means that combining the two complementary refinements has the potential to yield even more remarkable performance. Please refer to the Appendix for more experimental results.

Conclusions

In this paper, we present Evolving and Distinct Anchors (EDA), a novel paradigm to define the positive and negative components for multi-modal motion prediction based on mixture models. We pre-define anchors and update them with intermediate outputs and pick distinct anchors before matching them with the ground truth. Allowing the anchors to evolve and redistribute themselves under specific scenes promotes the regression capability of model. The adoption of distinct anchors addresses the ambiguity in classification induced by the prediction clustering issue, and facilitates the selection of representative predictions for downstream tasks. It turns out that our approach exhibits a significant improvement compared to the baseline MTR, achieving state-of-the-art performance on the Waymo Open Motion Dataset.

References

Carion et al. (2020) Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; and Zagoruyko, S. 2020. End-to-end object detection with transformers. In European conference on computer vision, 213–229. Springer.
Casas et al. (2020) Casas, S.; Gulino, C.; Liao, R.; and Urtasun, R. 2020. Spagnn: Spatially-aware graph neural networks for relational behavior forecasting from sensor data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 9491–9497. IEEE.
Casas, Sadat, and Urtasun (2021) Casas, S.; Sadat, A.; and Urtasun, R. 2021. Mp3: A unified model to map, perceive, predict and plan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14403–14412.
Chai et al. (2019) Chai, Y.; Sapp, B.; Bansal, M.; and Anguelov, D. 2019. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. arXiv preprint arXiv:1910.05449.
Ettinger et al. (2021) Ettinger, S.; Cheng, S.; Caine, B.; Liu, C.; Zhao, H.; Pradhan, S.; Chai, Y.; Sapp, B.; Qi, C. R.; Zhou, Y.; et al. 2021. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9710–9719.
Gao et al. (2020) Gao, J.; Sun, C.; Zhao, H.; Shen, Y.; Anguelov, D.; Li, C.; and Schmid, C. 2020. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11525–11533.
Gilles et al. (2021) Gilles, T.; Sabatini, S.; Tsishkou, D.; Stanciulescu, B.; and Moutarde, F. 2021. Home: Heatmap output for future motion estimation. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), 500–507. IEEE.
Gilles et al. (2022) Gilles, T.; Sabatini, S.; Tsishkou, D.; Stanciulescu, B.; and Moutarde, F. 2022. Gohome: Graph-oriented heatmap output for future motion estimation. In 2022 international conference on robotics and automation (ICRA), 9107–9114. IEEE.
Girgis et al. (2021) Girgis, R.; Golemo, F.; Codevilla, F.; Weiss, M.; D’Souza, J. A.; Kahou, S. E.; Heide, F.; and Pal, C. 2021. Latent variable sequential set transformers for joint multi-agent motion prediction. arXiv preprint arXiv:2104.00563.
Gu, Sun, and Zhao (2021) Gu, J.; Sun, C.; and Zhao, H. 2021. Densetnt: End-to-end trajectory prediction from dense goal sets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 15303–15312.
Gupta et al. (2018) Gupta, A.; Johnson, J.; Fei-Fei, L.; Savarese, S.; and Alahi, A. 2018. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2255–2264.
Huang, Mo, and Lv (2022) Huang, Z.; Mo, X.; and Lv, C. 2022. ReCoAt: A deep learning-based framework for multi-modal motion prediction in autonomous driving application. In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), 988–993. IEEE.
Jia et al. (2023) Jia, X.; Wu, P.; Chen, L.; Liu, Y.; Li, H.; and Yan, J. 2023. Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding. IEEE Transactions on Pattern Analysis and Machine Intelligence.
Konev, Brodt, and Sanakoyeu (2022) Konev, S.; Brodt, K.; and Sanakoyeu, A. 2022. MotionCNN: a strong baseline for motion prediction in autonomous driving. arXiv preprint arXiv:2206.02163.
Lee et al. (2017) Lee, N.; Choi, W.; Vernaza, P.; Choy, C. B.; Torr, P. H.; and Chandraker, M. 2017. Desire: Distant future prediction in dynamic scenes with interacting agents. In Proceedings of the IEEE conference on computer vision and pattern recognition, 336–345.
Liang et al. (2020) Liang, M.; Yang, B.; Hu, R.; Chen, Y.; Liao, R.; Feng, S.; and Urtasun, R. 2020. Learning lane graph representations for motion forecasting. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, 541–556. Springer.
Liu et al. (2021) Liu, Y.; Zhang, J.; Fang, L.; Jiang, Q.; and Zhou, B. 2021. Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7577–7586.
Mercat et al. (2020) Mercat, J.; Gilles, T.; El Zoghby, N.; Sandou, G.; Beauvois, D.; and Gil, G. P. 2020. Multi-head attention for multi-modal joint vehicle motion forecasting. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 9638–9644. IEEE.
Nayakanti et al. (2023) Nayakanti, N.; Al-Rfou, R.; Zhou, A.; Goel, K.; Refaat, K. S.; and Sapp, B. 2023. Wayformer: Motion forecasting via simple & efficient attention networks. In 2023 IEEE International Conference on Robotics and Automation (ICRA), 2980–2987. IEEE.
Ngiam et al. (2021) Ngiam, J.; Vasudevan, V.; Caine, B.; Zhang, Z.; Chiang, H.-T. L.; Ling, J.; Roelofs, R.; Bewley, A.; Liu, C.; Venugopal, A.; et al. 2021. Scene transformer: A unified architecture for predicting future trajectories of multiple agents. In International Conference on Learning Representations.
Park et al. (2020) Park, S. H.; Lee, G.; Seo, J.; Bhat, M.; Kang, M.; Francis, J.; Jadhav, A.; Liang, P. P.; and Morency, L.-P. 2020. Diverse and admissible trajectory forecasting through multimodal context understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, 282–298. Springer.
Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
Rhinehart, Kitani, and Vernaza (2018) Rhinehart, N.; Kitani, K. M.; and Vernaza, P. 2018. R2p2: A reparameterized pushforward policy for diverse, precise generative path forecasting. In Proceedings of the European Conference on Computer Vision (ECCV), 772–788.
Rhinehart et al. (2019) Rhinehart, N.; McAllister, R.; Kitani, K.; and Levine, S. 2019. Precog: Prediction conditioned on goals in visual multi-agent settings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2821–2830.
Shi et al. (2022a) Shi, S.; Jiang, L.; Dai, D.; and Schiele, B. 2022a. Motion transformer with global intention localization and local movement refinement. Advances in Neural Information Processing Systems, 35: 6531–6543.
Shi et al. (2022b) Shi, S.; Jiang, L.; Dai, D.; and Schiele, B. 2022b. MTR-A: 1st Place Solution for 2022 Waymo Open Dataset Challenge – Motion Prediction. arXiv:2209.10033.
Shi et al. (2023) Shi, S.; Jiang, L.; Dai, D.; and Schiele, B. 2023. MTR++: Multi-Agent Motion Prediction with Symmetric Scene Modeling and Guided Intention Querying. arXiv preprint arXiv:2306.17770.
Tang and Salakhutdinov (2019) Tang, C.; and Salakhutdinov, R. R. 2019. Multiple futures prediction. Advances in neural information processing systems, 32.
Varadarajan et al. (2022) Varadarajan, B.; Hefny, A.; Srivastava, A.; Refaat, K. S.; Nayakanti, N.; Cornman, A.; Chen, K.; Douillard, B.; Lam, C. P.; Anguelov, D.; et al. 2022. Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction. In 2022 International Conference on Robotics and Automation (ICRA), 7814–7821. IEEE.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Wilson et al. (2021) Wilson, B.; Qi, W.; Agarwal, T.; Lambert, J.; Singh, J.; Khandelwal, S.; Pan, B.; Kumar, R.; Hartnett, A.; Pontes, J. K.; Ramanan, D.; Carr, P.; and Hays, J. 2021. Argoverse 2: Next Generation Datasets for Self-driving Perception and Forecasting. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021).
Ye, Cao, and Chen (2021) Ye, M.; Cao, T.; and Chen, Q. 2021. Tpcn: Temporal point cloud networks for motion forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11318–11327.
Zeng et al. (2021) Zeng, W.; Liang, M.; Liao, R.; and Urtasun, R. 2021. Lanercnn: Distributed representations for graph-centric motion forecasting. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 532–539. IEEE.
Zhang et al. (2023) Zhang, S.; Wang, X.; Wang, J.; Pang, J.; Lyu, C.; Zhang, W.; Luo, P.; and Chen, K. 2023. Dense Distinct Query for End-to-End Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7329–7338.
Zhao et al. (2021) Zhao, H.; Gao, J.; Lan, T.; Sun, C.; Sapp, B.; Varadarajan, B.; Shen, Y.; Shen, Y.; Chai, Y.; Schmid, C.; et al. 2021. Tnt: Target-driven trajectory prediction. In Conference on Robot Learning, 895–904. PMLR.
Zhou et al. (2023) Zhou, Z.; Wang, J.; Li, Y.-H.; and Huang, Y.-K. 2023. Query-Centric Trajectory Prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17863–17873.
Zhou et al. (2022) Zhou, Z.; Ye, L.; Wang, J.; Wu, K.; and Lu, K. 2022. Hivt: Hierarchical vector transformer for multi-agent motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8823–8833.

Appendix

Implementation Details

Architecture details.

We have implemented our methods based on the Motion TRansformer (Shi et al. 2022a). To predict the motion of a target agent, we utilize the agent-centric strategy in which all inputs are normalized to the coordinate system centered around the agent in question. The MTR adopts the vectorized representation (Gao et al. 2020) to arrange the input agent states and road maps as polylines. The encoder extracts the scene context information including agent and map features with local self-attention on the input polylines. A dense future prediction generates the future states for all surrounding agents to boost their features. Taking the context features as keys and values, a multi-layer decoder incorporates learnable intention queries to produce multimodal trajectories. The learnable queries are initialized with intention endpoints generated by a k-means clustering algorithm on the training set. The endpoints also serve as the predefined anchors for the vanilla anchor-based matching and our EDA, as shown in Fig. 7. In cross attention for each decoder layer, a dynamic map collection is applied to collect the closest map features to the latest predicted trajectories for querying, and the query position embedding is continuously updated using the layer outputs. We adopt the default parameters of MTR in our experiments, which are shown in Table 3.

Table 3: The architecture parameters.

Enc.	number of layers	6
	number of map polylines	768
	number of points in each polyline	20
	number of attention neighbors	16
	hidden feature dimension	256
Dec.	number of layers	6
	number of components	64
	number of nearest map polylines	128

The mixture model loss.

Given the context $X$ and the ground truth $Y$ , the aim of optimization is to maximum the log-likelihood:

		$\displaystyle log\ p_{\theta}(Y\|X)=E_{Z\sim q(Z)}[log\frac{p_{\theta}(Y,Z\|X)}{p_{\theta}(Z\|Y,X)}]$
	$\displaystyle=$	$\displaystyle E_{Z\sim q(Z)}[log\frac{p_{\theta}(Y,Z\|X)}{q(Z)}]+E_{Z\sim q(Z)}[log\frac{q(Z)}{p_{\theta}(Z\|Y,X)}]$
	$\displaystyle=$	$\displaystyle E_{Z\sim q(Z)}[log\ p_{\theta}(Y,Z\|X)]-H[q]+KL[q(Z)\\|p_{\theta}(Z\|Y,X)]$
	$\displaystyle\geq$	$\displaystyle E_{Z\sim q(Z)}[log\ p_{\theta}(Y,Z\|X)]-H[q]$
	$\displaystyle=$	$\displaystyle E_{Z\sim q(Z)}[log\ p_{\theta}(Y\|Z,X)]+E_{Z\sim q(Z)}[log\ p_{\theta}(Z\|X)]-H[q]$
	$\displaystyle=$	$\displaystyle E_{Z\sim q(Z)}[log\ p_{\theta}(Y\|Z,X)]-KL[q(Z)\\|p_{\theta}(Z\|X)],$

where $\theta$ denotes the learnable weights of neural network, and $Z$ is the discrete random variable representing the probability of each component in mixture models. If we take:

q(Z)=\mathds{1}(Z=z^{*}),

where $z^{*}$ corresponds to the positive component whose anchor or prediction is closest to the ground truth, namely we utilize a winner-takes-all strategy.

Correspondingly, the term $E_{Z\sim q(Z)}[log\ p_{\theta}(Y|Z,X)]$ is the regression loss between the prediction of positive component $\hat{Y}^{*}$ and ground truth $Y$ , where we model the probability using a Gaussian distribution:

L_{reg}=log\ p_{\theta}(Y|Z=z^{*},X)=log\ \mathcal{N}(Y|\hat{Y}^{*}).

Since $H[q]$ is irrelevant to the learnable weights $\theta$ , the other term $-KL[q(Z)\|p_{\theta}(Z|X)]$ is equivalent to the Cross Entropy loss on predicted scores $\hat{Z}$ :

	$\displaystyle-KL[q(Z)\\|p_{\theta}(Z\|X)]$	$\displaystyle\Leftrightarrow E_{Z\sim q(Z)}[log\ p_{\theta}(Z\|X)]$
		$\displaystyle=-CE[\mathds{1}(Z=z^{*}),\hat{Z}].$

To accommodate varying numbers of the mixture components selected for distinct anchors, we employ Binary Cross Entropy for classification, which our experiments have shown to be a viable alternative to the CE loss:

L_{cls}=-BCE[\mathds{1}(Z_{d}=z_{d}^{*}),\hat{Z}_{d}],

where the subscript $d$ indicates the components corresponding to the distinct anchors, and the positive prediction $\hat{Y}^{*}$ in regression should also be replaced by $\hat{Y}_{d}^{*}$ , which means that only the selected anchors have the chance to be positive.

In conclusion, the mixture model loss can be expressed as a combination of regression loss and a classification term:

\mathcal{L}_{\mathrm{mixture\ model}}=\lambda_{reg}L_{reg}+\lambda_{cls}L_{cls}.

The details in EDA.

The proposed Evolving and Distinct Anchors (EDA) is centered on the design of loss, where the anchors for identification of positive components are updated by the layer outputs and selected to be distinct before matching. Therefore, the paradigm can be readily applied to any network structure that includes a multi-layer decoder. When selecting distinct anchors through NMS, we utilize the scores from each layer to sort the corresponding anchors, since these scores actually represent the probability of the relevant anchors. Below is the pseudo-code in PyTorch (Paszke et al. 2019) demonstrating the EDA paradigm:

⬇

1for layer_idx in range(num_decoder_layers):

2 # predictions of the current layer

3 pred_trajs, pred_scores = pred_list[layer_idx]

4 # evolving anchors for the current layer

5 anchor_trajs = evolving_anchors[layer_idx]

6 # selection of distinct anchors

7 distinct_mask = nms(anchor_trajs, pred_scores)

8 # compare anchors with ground truth

9 distance = compute_distance(anchor_trajs, gt)

10 # compute loss

11 loss = mixture_model_loss(

12 pred_trajs, pred_scores, gt

13 distinct_mask, distance

14 )

Training details.

To align with the baseline, in addition to the mixture model loss, we also incorporate a loss for dense future prediction and an L1 loss on the agent velocity. As shown in Table 4, we adopt the same training configuration settings as in MTR (Shi et al. 2022a), where we train a single model for all three categories without any data augmentation.

Table 4: The training configuration.

number of epochs	30
batch size	80
optimizer	AdamW
initial learning rate	0.0001
learning rate schedule	epoch 22,24,26,28
learning rate decay factor	0.5
weight decay	0.01
attention dropout	0.1
regression weight	1.0
classification weight	1.0
dense regression weight	1.0
velocity regression weight	0.5

Per-class Results of EDA

We report the per-category performance of our EDA, which is the same model used in the benchmark results, on the Waymo Open Motion Dataset (Ettinger et al. 2021) for reference, as shown in Table 5.

Table 5: Per-class Performance on the validation and test sets of Waymo Open Motion Dataset (Ettinger et al. 2021).

Set	Category	mAP $\uparrow$	minADE $\downarrow$	minFDE $\downarrow$	Miss Rate $\downarrow$
Test	Vehicle	0.4807	0.6808	1.3921	0.1164
	Pedestrian	0.4390	0.3426	0.7080	0.0670
	Cyclist	0.4008	0.6920	1.4106	0.1673
	Avg	0.4401	0.5718	1.1702	0.1169
Val	Vehicle	0.4810	0.6820	1.3990	0.1176
	Pedestrian	0.4279	0.3431	0.7124	0.0658
	Cyclist	0.3970	0.6873	1.4077	0.1700
	Avg	0.4353	0.5708	1.1730	0.1178

Inference Consumption

As a paradigm to define positive components for mixture-model-based methods, EDA theoretically should not add any burden to the model during inference. To empirically validate the viewpoint, we test our baseline MTR and the EDA with varying evolving frequencies on average consumption for single-scene inference on Waymo dataset on a 3090 GPU. As shown in Table 6, all models exhibit identical inference latency, GPU memory usage and computational costs.

Table 6: The inference consumption of EDA.

Method	Evolving Times	Latency	Memory Cost	MACs
MTR	-	62ms	1.83G	30.2G
EDA	1	62ms	1.83G	30.2G
	2	62ms	1.83G	30.2G
	5	62ms	1.83G	30.2G

Combined Effects of EDA

Our outstanding performance benefits from the combined effect of evolving anchors and distinct anchors. The former enhances the regression capacity but also leads to prediction clustering issues, which are then addressed by the latter. To further confirm the above conclusions, we conduct an experiment on the effect of solely using distinct anchors. As shown in Table 7, merely distinct anchors bring little to no improvement, since the predefined anchors are already uniformly distributed.

Table 7: The performance of only applying distinct anchors.

Distinct Anchors	mAP $\uparrow$	minADE $\downarrow$	minFDE $\downarrow$	MR $\downarrow$
	0.4171	0.6050	1.2376	0.1357
✓	0.4208	0.6037	1.2342	0.1359

Ablation Study on Predefined Anchors

Our ablation experiments on different numbers of anchors further demonstrate the generalizability of our method. As shown in Table 8, the EDA presents excellent performance improvement across different sets of predefined anchors. We also observe that when there are fewer anchors, more times of evolving would be helpful to enhance the regression capability, which is consistent with our analysis of effects of evolving anchors.

Table 8: The additional ablation on anchors.

Number of Anchors	Evolving Times	Distinct Anchors	mAP $\uparrow$	minADE $\downarrow$	minFDE $\downarrow$	MR $\downarrow$
16	0		0.4009	0.6553	1.4137	0.1769
	1	✓	0.4138	0.6038	1.2750	0.1483
	2	✓	0.4207	0.5886	1.2299	0.1378
	5	✓	0.4201	0.5833	1.2041	0.1365
100	0		0.4158	0.6023	1.2263	0.1333
	1	✓	0.4302	0.5789	1.1917	0.1216
	2	✓	0.4231	0.5838	1.2083	0.1236

Additional Results on Argoverse 2

We conduct experiments on another large-scale dataset Argoverse 2 (Wilson et al. 2021), with the SOTA method QCNet (Zhou et al. 2023) as the baseline. As shown in Table 9, our method achieves consistent improvements.

Decoder of the original QCNet consists of two stages: prediction-based proposal generation and anchor-based refinement. To better validate the improvements brought by our proposed matching paradigm, we first construct a pure anchor-based variant of QCNet with minimal modifications, and then apply EDA to this modified version. To align with the consumption (comparable MACs in Table 9), we use only 16 predefined anchor trajectories and employ NMS to select the top 6 predictions for evaluation. For evaluation, apart from the metrics commonly used in Argoverse 2, we further calculate the mAP with a combined endpoint threshold of 2, 4, and 6m.

Table 9 presents the performance of pure anchor-based variant of QCNet (Anchor), ours (Anchor + EDA), and original QCNet (Original) as reference. Due to its limited refinement capacity and the sparsity of predefined anchors, the pure anchor-based variant of QCNet struggles to generate accurate predictions. In contrast, EDA significantly improves the prediction quality and achieves the best mAP among all models. This indicates that our method not only enhances the regression capability but also exhibits remarkable scoring performance.

Table 9: Top 6 metrics on the validation set of Argoverse 2.

QCNet	mAP $\uparrow$	b-minFDE $\downarrow$	minADE $\downarrow$	MR $\downarrow$	MACs
Original	0.50	1.86	0.72	0.15	30G
Anchor	0.48	3.04	1.20	0.51	31G
Anchor + EDA	0.58	2.10	0.81	0.23	31G

Qualitative Results

We provide some qualitative results of our EDA on the Waymo Open Motion Dataset (Ettinger et al. 2021) in Fig. 8.

		$\displaystyle log\ p_{\theta}(Y\|X)=E_{Z\sim q(Z)}[log\frac{p_{\theta}(Y,Z\|X)}{p_{\theta}(Z\|Y,X)}]$
	$\displaystyle=$	$\displaystyle E_{Z\sim q(Z)}[log\frac{p_{\theta}(Y,Z\|X)}{q(Z)}]+E_{Z\sim q(Z)}[log\frac{q(Z)}{p_{\theta}(Z\|Y,X)}]$
	$\displaystyle=$	$\displaystyle E_{Z\sim q(Z)}[log\ p_{\theta}(Y,Z\|X)]-H[q]+KL[q(Z)\\|p_{\theta}(Z\|Y,X)]$
	$\displaystyle\geq$	$\displaystyle E_{Z\sim q(Z)}[log\ p_{\theta}(Y,Z\|X)]-H[q]$
	$\displaystyle=$	$\displaystyle E_{Z\sim q(Z)}[log\ p_{\theta}(Y\|Z,X)]+E_{Z\sim q(Z)}[log\ p_{\theta}(Z\|X)]-H[q]$
	$\displaystyle=$	$\displaystyle E_{Z\sim q(Z)}[log\ p_{\theta}(Y\|Z,X)]-KL[q(Z)\\|p_{\theta}(Z\|X)],$