font=mysize,justification=raggedright,singlelinecheck=false

Equivariant Map and Agent Geometry for Autonomous Driving Motion Prediction

Yuping Wang University of Michigan
Ann Arbor, United States
[email protected] Jier Chen Shanghai Jiao Tong University
Shanghai, China
[email protected]

Abstract

In autonomous driving, deep learning enabled motion prediction is a popular topic. A critical gap in traditional motion prediction methodologies lies in ensuring equivariance under Euclidean geometric transformations and maintaining invariant interaction relationships. This research introduces a groundbreaking solution by employing EqMotion, a theoretically geometric equivariant and interaction invariant motion prediction model for particles and humans, plus integrating agent-equivariant high-definition (HD) map features for context aware motion prediction in autonomous driving. The use of EqMotion as backbone marks a significant departure from existing methods by rigorously ensuring motion equivariance and interaction invariance. Equivariance here implies that an output motion must be equally transformed under the same Euclidean transformation as an input motion, while interaction invariance preserves the manner in which agents interact despite transformations. These properties make the network robust to arbitrary Euclidean transformations and contribute to more accurate prediction. In addition, we introduce an equivariant method to process the HD map to enrich the spatial understanding of the network while preserving the overall network equivariance property. By applying these technologies, our model is able to achieve high prediction accuracy while maintain a lightweight design and efficient data utilization.

Index Terms:

Autonomous Driving, Motion Prediction, Graph Convolution Networks, Equivariant Networks, HD Map

I Introduction

Autonomous driving hinges on the ability to accurately predict vehicle trajectories, a task that ensures safety and efficiency. Recent advancements in machine learning, particularly deep learning, offer promising avenues to address these challenges. However, existing approaches often falter, ignoring the critical need for equivariant motion prediction, where predictions respect the underlying geometric transformations of the environment. Concurrently, invariant interaction reasoning - preserving the interaction dynamics despite transformations - is paramount in multi-agent scenarios. This research employs EqMotion[1], designed with these principles in mind, as its backbone, and innovatively incorporates agent-equivariant maps. These maps, capturing high-definition features without being tied to a specific agent motion, provide a richer spatial understanding. Together, this combination paves the way for a novel approach in autonomous vehicle motion prediction, bridging theory and practical applicability to craft a more robust and insightful predictive model.

EqMotion[1], a groundbreaking motion prediction model, has made strides in various domains but has not explicitly adapted itself to the specific challenges of autonomous vehicle tasks. Its theoretical foundations in equivariance and invariant interaction reasoning are robust, yet the model lacks tailored features to address the complex dynamics and real-world transformations inherent in autonomous driving. A significant limitation is EqMotion’s omission of map integration, leaving a gap in contextual understanding of the environment. Without maps, the model’s ability to accurately predict intricate vehicular motions may be constrained. These limitations underscore the necessity for a specialized approach, inspiring this research. By building upon EqMotion’s backbone and incorporating HD map features and motion equivariant maps, this work aims to bridge these gaps, forging a novel path in autonomous vehicle motion prediction.

Map information, particularly lane centerlines, plays a crucial role in motion prediction for autonomous vehicles, offering vital context on road geometry and traffic constraints. When comparing representations, vectorized representations, as implemented in models like VectorNet[2], provide significant advantages over rasterized image representations typically used in Convolutional Neural Networks (CNNs)[3]. Vectorized representations retain the geometric structure and can be manipulated through mathematical transformations, facilitating a more compact and expressive encoding of spatial relationships. This approach enhances generalizability and reduces computational complexity. On the other hand, rasterized representations, while popular in traditional CNN-based models, may lead to a loss of geometric fidelity and incur higher computational costs. The principle of equivariance in vectorized representations further bolsters these advantages by ensuring consistent behavior under geometric transformations, enhancing the model’s robustness, and maintaining the essential spatial relationships between objects. By preserving this geometric integrity, equivariant vectorized representations offer a more nuanced and efficient approach for motion prediction in complex driving scenarios.

In this research, we have made significant strides in autonomous vehicle motion prediction by introducing several key contributions. First, we propose a novel framework that effectively leverages the equivariance property of map information, particularly in the context of lane centerlines. By embracing this geometric principle, our framework ensures consistent behavior under spatial transformations, offering a more robust and context-aware model. Second, we employ a transformer model on vectorized map representation to extract the vital lane context. This approach has allowed us to capture intricate spatial relationships within the road environment, leading to more nuanced trajectory predictions. Finally, we employ the EqMotion[1] to learn the equivalent agent geometric features and reason the invariant interaction between agents. We conduct our experiments on a autonomous driving dataset and the results not only affirm the efficacy of our approach but also represent a significant advancement in the field, offering a more robust and effective model for motion prediction in autonomous driving.

II Related Work

II-A Vehicle Trajectory Prediction

Autonomous vehicle motion prediction has been a focal area of research, with diverse methodologies and frameworks being proposed with a number of them utilizing deep neural networks[4, 5, 6]. LSTM-based models like Social-LSTM[7] have been utilized for capturing temporal dependencies. Convolutional Neural Networks (CNN) have been employed to process rasterized map data[3]. More recently, attention mechanisms like Transformer models have been adapted for motion prediction[8]. Graph-based approaches like Graph Neural Networks have been explored for multi-agent interaction[9, 10, 11]. VectorNet[2] has introduced the use of vectorized representations, highlighting the importance of preserving geometric relationships. These diverse approaches lay a rich foundation for ongoing innovations in the field.

II-B Map Information Encoding

Map information encoding is a critical aspect of motion prediction, with two main approaches: rasterized and vectorized representations. Rasterized encoding, which converts map information into pixel grids, has been employed in many CNN-based model[3, 12]. It offers simplicity and compatibility with standard image processing techniques but suffers from loss of geometric precision and can be computationally expensive. In contrast, vectorized representations, like those used in VectorNet[2], preserve the geometric structure of the map, enabling accurate encoding of spatial relationships. Vectorized encoding ensures scalability and fidelity but may require more complex processing to fully exploit its potential. Recent advancements in transformer-based models have demonstrated the efficacy of vectorized encoding in autonomous driving applications[13]. The choice between these approaches hinges on specific requirements, balancing simplicity with geometric accuracy.

II-C Equivariant Feature Extraction

The concept of equivariance has become particularly prominent in the realm of 2D image analysis. Owing to the susceptibility of CNN structures to rotational alterations, there has been a pursuit of designs that embrace rotation-equivariance, such as the implementation of directional convolutional filters[14]. In Graph Neural Networks, [15] achieves both rotation and translation equivariance by using tensor field neural networks. In these works, equivariant models have shown their advantage of an efficient training process by avoiding data augmentation. A myriad of research initiatives have proposed equivariant layers tailor-made for specific tasks, including protein structure decoding[16]. However, many remain limited to state prediction and overlook sequence data. Few attempts, like one that focuses on coordinate processing, strive for motion equivariance. Our model innovates an equivariant map processor on top of the equivariant sequence predictor, EqMotion[1], to predict the trajectories for autonomous vehicles.

III Problem Formulation

In this section, we define the formulation of the motion prediction task, which is to predict the ego agent motion given the past motion of itself and neighbor agents, plus the lane centerlines near the ego agent.

Our first input consists of the historical trajectories of $N$ agents, including both the ego agent and neighbors. For each agent $a$ at each timestep $t$ in the input sequence, we have its location vector $x_{a}^{t}$ being its $x$ , $y$ coordinates. Thus, we have $N\times T_{\text{in}}$ coordinate pairs, which we denote as $X\in\mathbb{R}^{N\times T_{in}\times 2}$ . Our second input is a $Q$ lane center lines each represented as $L$ coordinate points, each also represented by its coordinates $x$ , $y$ . Thus, we have $Q\times L$ coordinate pairs, which we denote as the M (map) matrix and $M\in\mathbb{R}^{Q\times L\times 2}$ . Our output $\hat{y}$ is the coordinates for the ego agent over the next $T_{\text{out}}$ timesteps. Thus, it has $T_{\text{out}}$ coordinate pairs, $\hat{y}\in\mathbb{R}^{T_{out}\times 2}$ .

During training and validation, we have access to the ground truth future trajectory for the ego agent as $y\in\mathbb{R}^{T_{out}\times 2}$ . Our learning goal is to have $\hat{y}$ minic $y$ as much as possible. Note that $N$ , $T_{\text{in}}$ , $T_{\text{out}}$ , $Q$ , $L$ are fixed configuration parameters. When in the scene we have less than $N$ total agents or less than $Q$ centerlines, we’ll pad the invalid values in the matrices with 0.

IV Methodology

Refer to caption — Figure 1: Model Architecture

IV-A Equivariant Map Feature

Map features provide the context for our ego motion prediction. To account for equivariant nature between the map and ego agent, we first translate the map coordinate frame to align with the current location at time $t$ of the ego agent $a$ , which is denoted as $x_{a}^{t}$ . For all the center lines and center line points in the map $M_{l}^{q}$ , $l\in[0,L]$ and $q\in[0,Q]$ , we have:

M_{\text{centered},l}^{q}=M_{l}^{q}-x_{a}^{t}.

(1)

In addition to translating the map, we also rotate the map to align with the ego agent heading which is computed from the current velocity:

v_{a}^{t}=x_{a}^{t}-x_{a}^{t-1},

(2)

\theta_{a}^{t}=\operatorname{arctan2}(v_{a}^{t}[1],(v_{a}^{t}[0]),

(3)

R_{\theta}=\begin{bmatrix}\cos(\theta_{a}^{t})&-\sin(\theta_{a}^{t})\\ \sin(\theta_{a}^{t})&\cos(\theta_{a}^{t})\end{bmatrix},

(4)

M_{\text{rotated},l}^{q}=R_{\theta}\cdot M_{\text{centered},l}^{q}.

(5)

IV-B Explanation on Map Equivariance

We want to show our map transformation is equivariant to agent motion. This means for arbitrary agent movement, the map’s coordinates in the agent frame will shift the same amount. We denote each coordinate pair in the map $M$ as $m_{i}$ and $i\in[0,L\times Q]$ . The agent’s current position is represented by $x_{a}^{t}$ and its heading represented by $\theta$ .

IV-B1 Translation Equivariance

If the agent moves by a translation vector $b$ , its new position is:

\tilde{x}_{a}^{t}=x_{a}^{t}+b

(6)

We then have the transformed coordinates of $m_{i}$ after this translation, relative to the new agent’s position, are:

\tilde{m_{i}}=m_{i}-\tilde{x}_{a}^{t}=m_{i}-(x_{a}^{t}+b)=m_{i}-x_{a}^{t}-b.

(7)

This is equivalent to translating the original transformed coordinates $m_{i}-x_{a}^{t}$ by $-b$ .

IV-B2 Rotation Equivariance

If the agent rotates by an angle $\phi$ in the counter-clockwise direction, the agent’s new heading is $-\phi+\theta$ . We then have the transformed coordinates of $m_{i}^{\prime}$ after this rotation, relative to the agent, are:

\check{m_{i}}=R_{(-\phi+\theta)}*\tilde{m_{i}}=R_{-\phi}R_{\theta}*\tilde{m_{i}}.

(8)

This is equivalent to rotating the original rotated map in 5 in the clockwise direction by $\phi$ .

Thus, the operation of transforming map coordinates to the frame of a moving agent is equivariant with respect to the agent’s movement.

IV-C Map Feature Encoding

Given the map feature is now equivariant to the agent motion, we now further expand its dimension space for downstream backbone. Specifically, we reshape the rotated map $M_{\text{rotated}}$ so that the lanes and waypoints per lane are vectorized into a vector $M_{\text{vectorized}}$ where $M_{\text{vectorized},n}$ is a single coordinate pair and $n\in[0,L\times Q]$ , which we then encode $M_{\text{vectored}}$ with a Transformer [17] layer that is composed of the following:

Multi-Head Self-Attention

Attention(Q,K,V)=softmax\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V,

(9)

where

$\displaystyle Q$	$\displaystyle=M_{\text{vectorized}}\cdot W_{Q}$	(10)
$\displaystyle K$	$\displaystyle=M_{\text{vectorized}}\cdot W_{K}$	(11)
$\displaystyle V$	$\displaystyle=M_{\text{vectorized}}\cdot W_{V},$	(12)

and $W_{Q}$ , $W_{K}$ , $W_{V}$ are the matrices for linear transformation. $d_{k}$ is the dimensionality of the keys in the multi-head attention mechanism.

2.

Add and Norm

$Z=LayerNorm(X+Attention(Q,K,V))$ (13)
3.

Feed-Forward Neural Network (FFN)

$FFN(Z)=ReLU(ZW_{1}+b_{1})W_{2}+b_{2}$ (14)

$Y=LayerNorm(Z+FFN(Z))$ (15)

The resulting vector $Y$ is then concatenated with the agent features as fed to the downstream.

M_{\text{features}}=[X;Y]

(16)

IV-D Geometric Feature and Pattern Feature Learning

We here employ the Eqmotion Network [1]. Specifically, given our $AgentMapFeatures$ from the above, we obtain the initial geometric $G^{0}\in\mathbb{R}^{N\times T_{out}}$ and pattern $H^{0}\in\mathbb{R}^{N\times hidden\_dim}$ features:

G^{0},H^{0}=\mathfrak{F}_{\text{FeatureInitLayer}}(M_{\text{features}}).

(17)

We then apply a graph convolution network to infer the relationships between the agents, and the map context:

{e_{ij}}=\mathfrak{F}_{\text{GraphConvolutionLayer}}(G^{0},H^{0}).

(18)

With the above, we will sequentially apply two networks to learn the geometric and pattern features in an interleaving fashion. This step will repeat $P$ times and $P$ is a hyperparameter:

G^{p}=\mathfrak{F}_{\text{GeometricLayer}}(G^{p-1},H^{p-1},{e_{ij}}),

(19)

H^{p}=\mathfrak{F}_{\text{PatternLayer}}(G^{p-1},H^{p-1}).

(20)

For the detailed implementation of the above layers please refer to EqMotion[1].

IV-E Output Decoder

We take the final geometric feature $G^{P}\in\mathbb{R}^{N\times hidden\_dim}$ and decode it with a 4-layer MLP which produces the output trajectory $\hat{y}\in\mathbb{R}^{T_{out}}$ . We will also add the current position of the agent to it:

\hat{y}=MLP(G^{P}-\bar{G^{P}})+\bar{G^{P}}+\bar{X_{0}}.

(21)

IV-F Loss Function

We use the Average Displacement Error as the loss function:

\text{ADE}=\frac{1}{T}\sum_{t=1}^{T}\|\hat{y}_{t}-y_{t}\|_{2}.

(22)

V Experiments

V-A Dataset and Evaluation Metrics

In our work, we used the Argoverse motion forecasting dataset[18], a comprehensive and well-regarded collection specifically designed for autonomous vehicle applications. Argoverse 1 provides a training set of around 200k scenes, each includes high-definition maps and both ego and neighbor agent trajectories. By leveraging this dataset, we were able to validate and benchmark our model’s performance in realistic scenarios, ensuring that the findings are representative and applicable to real-world autonomous driving contexts. In this dataset, agent history is composed of coordinates from the past 2 seconds, and agent futures are the next 3 seconds, both sampled at 10Hz.

In our comprehensive evaluation, we utilized both Average Displacement Error (ADE) and Final Displacement Error (FDE) metrics at multiple time intervals (1s, 2s, and 3s) to thoroughly assess the predictive capabilities of our model. ADE measures the average difference between predicted and actual trajectories over specified time horizons, providing insights into prediction accuracy. On the other hand, FDE quantifies the endpoint difference between predicted and actual trajectories, offering a distinct perspective on model performance. By employing both ADE and FDE at different time intervals, we gained a comprehensive understanding of our model’s accuracy in both intermediate and long-term predictions. This multi-faceted evaluation approach enabled us to identify not only immediate accuracy but also the ability of our model to sustain accurate predictions over longer horizons. These metrics reinforce the rigor of our evaluation process and highlight our commitment to providing reliable and insightful autonomous vehicle motion predictions.

V-B Implementation Details

During training, we use the Adam optimizer with a learning rate of $1e^{-5}$ . We train our network for 20 epochs and used a batch size of 512. Our training hardware is the NVidia RTX 3060Ti with 8GB of graphic memory. The training loss decay can be seen in Figure 2.

In our model, we used the configuration parameters as in Table I.

Table I: Configuration Parameters

Description	Name	Value
Length of input sequence	$T_{in}$	20
Length of output sequence	$T_{out}$	30
Number of total agents in the scene	$N$	4
Number of lane centerlines in the map	$Q$	10
Number of points to represent each centerline	$L$	100
Number of repeats on the feature learning layers	$P$	20
Size of all hidden dimension	$hidden\_dim$	64
Number of layers in MLP		4
Number of heads in transformer		12

V-C Baseline Model

To measure the performance of our model, we developed a Long Short Term Memory (LSTM) based network to serve as the baseline. LSTM is a popular network that is able to learn a temporal sequence such as in [7] and [19], in which LSTM is used to predict vehicle behavior at intersections.

In our LSTM network, we rasterize the map into a $200\times 200$ boolean image with the lane centerlines represented as having a value of $1$ in the image while the others have $0$ . The image then goes through two convolution neural networks (CNN), each has $3\times 3$ kernels followed by a MaxPool layer with stride $2$ . The convoluted feature then goes through a 4-layer MLP to produce a feature vector, which we then concatenate to the feature vector produced by the LSTM. Lastly, we use a 4-layer MLP to produce the output trajectory.

V-D Comparative Studies

To provide a better insight into the effectiveness of using the map feature processing and encoding logic via Transformer blocks, we performed experiments with these modifications.

1.

Skipping the map rotation logic in eq. 5 and only perform translation.
2.

Replacing the transformer block with a self attention layer[20] with a graph size of 64.
3.

Replacing the transformer block with the same CNN model as in the baseline.
4.

Ablating the map processing logic to measure the impact of the map information.

Table II: Comparison of Different Models

Method	ADE at 1s	ADE at 2s	ADE at 3s	FDE at 1s	FDE at 2s	FDE at 3s	Parameters	Training
LSTM Baseline	0.922	1.827	2.957	1.806	3.994	6.503	10.1M	56min
Our model, no map	0.602	1.037	1.667	0.982	2.182	3.715	5.8M	213min
Our model, translated map only, attention	0.606	1.040	1.663	0.997	2.181	3.682	8.9M	227min
Our model, rotated map, attention	0.588	1.036	1.661	0.987	2.180	3.662	8.9M	224min
Our model, translated map only, transformer	0.556	0.989	1.619	0.894	2.023	3.490	10.1M	195min
Our model, rotated map, transformer	0.549	0.967	1.591	0.895	1.996	3.437	10.1M	211min
Our model, rotated map, CNN	0.812	1.320	2.051	1.252	2.64	4.41	47M	18hrs

V-E Quantitative Results

The quantitative results are shown in Table II. Rows 1 - 2 are the results for not using a map. Rows 3 - 7 are the results for using a map.

Our LSTM baseline has higher errors in ADE and FDE, signifying less accurate predictions. It has a moderate model size and reasonable training time. By replacing the LSTM backbone with EqMotion[1], we see an immediate improvement in both ADE and FDE compared to the baseline and reduced complexity in terms of the number of parameters. Lengthier training time may indicate a more intricate learning process. We observe that regardless of the map feature encoder used, we can see a performance improvement if we align the map with agent heading. Note that rotating the map does not incur an increase in the number of parameters. Thus it does not affect model size or training time. Using a transformer model than a simpler self attention one costs 1.2M more parameters, but also better performance. Lastly, applying a rasterized view of the map and supply it to the model performs worse than the our equivariant map processor, even with many more parameters and longer training time.

In conclusion, our models with map information generally outperform the non map models in prediction accuracy. Different architectural choices like map rotation and attention/transformer showed a noticeable improvement. The model with rotated map and transformer architecture appears to be a overall better choice given models at row 3-6 have similar sizes. Careful consideration is needed for the CNN variant, as it has higher complexity and significantly longer training time. Further experiments with hyperparameter tuning, different feature engineering, or additional data might lead to more robust insights.

V-F Qualitative Results

Some visualizations of our model predictions are shown in Figure 4. We observe the predicted trajectories appear to stay close to the ground truth, suggesting our model is able to predict reasonably well. Additionally, the ground truth data suffers from sampling noise, represented as the spikes in the plot. Our prediction, however, produces a much smoother and realistic trajectory.

VI Conclusion

In conclusion, this research presents an innovative integration of EqMotion and HD map features, a first in autonomous vehicle motion prediction. The theoretical rigor and empirical success of this approach promise a transformative impact on the field, leading to more robust, accurate, and generalized motion prediction. The synergy between geometric understanding, interaction reasoning, and environmental mapping sets a new benchmark for autonomous driving technology, with far-reaching implications for the future of transportation.

References

[1] C. Xu, R. T. Tan, Y. Tan, S. Chen, Y. G. Wang, X. Wang, and Y. Wang, “Eqmotion: Equivariant multi-agent motion prediction with invariant interaction reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1410–1420.
[2] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid, “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 525–11 533.
[3] C. Choi, J. H. Choi, J. Li, and S. Malla, “Shared cross-modal trajectory prediction for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021.
[4] J. Li, H. Ma, Z. Zhang, J. Li, and M. Tomizuka, “Spatio-temporal graph dual-attention network for multi-agent prediction and tracking,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 10 556–10 569, 2021.
[5] J. Li, F. Yang, M. Tomizuka, and C. Choi, “Evolvegraph: Multi-agent trajectory prediction with dynamic relational reasoning,” NeurIPS, vol. 33, pp. 19 783–19 794, 2020.
[6] J. Li, H. Ma, and M. Tomizuka, “Conditional generative neural system for probabilistic trajectory prediction,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2019.
[7] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 961–971.
[8] J. Ngiam, V. Vasudevan, B. Caine, Z. Zhang, H. L. Chiang, J. Ling, R. Roelofs, A. Bewley, C. Liu, A. Venugopal, D. J. Weiss, B. Sapp, Z. Chen, and J. Shlens, “Scene transformer: A unified architecture for predicting future trajectories of multiple agents,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, 2022.
[9] D. Cao, J. Li, H. Ma, and M. Tomizuka, “Spectral temporal graph neural network for trajectory prediction,” in 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.
[10] H. Girase, H. Gang, S. Malla, J. Li, A. Kanehara, K. Mangalam, and C. Choi, “Loki: Long term and key intentions for trajectory prediction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9803–9812.
[11] F.-Y. Sun, I. Kauvar, R. Zhang, J. Li, M. J. Kochenderfer, J. Wu, and N. Haber, “Interaction modeling with multiplex attention,” Advances in Neural Information Processing Systems, vol. 35, 2022.
[12] H. Ma, Y. Sun, J. Li, and M. Tomizuka, “Multi-agent driving behavior prediction across different scenarios with self-supervised domain knowledge,” in 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). IEEE, 2021, pp. 3122–3129.
[13] B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “VAD: vectorized scene representation for efficient autonomous driving,” CoRR, vol. abs/2303.12077, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.12077
[14] D. Marcos, M. Volpi, N. Komodakis, and D. Tuia, “Rotation equivariant vector field networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5048–5057.
[15] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley, “Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds,” arXiv preprint arXiv:1802.08219, 2018.
[16] C. Shi, C. Wang, J. Lu, B. Zhong, and J. Tang, “Protein sequence and structure co-design with equivariant translation,” in The Eleventh International Conference on Learning Representations, 2023.
[17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
[18] M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan et al., “Argoverse: 3d tracking and forecasting with rich maps,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019.
[19] Y. Jeong, S. Kim, and K. Yi, “Surround vehicle motion prediction using lstm-rnn for motion planning of autonomous vehicles at multi-lane turn intersections,” IEEE Open Journal of Intelligent Transportation Systems, vol. 1, pp. 2–14, 2020.
[20] J. Li, F. Yang, H. Ma, S. Malla, M. Tomizuka, and C. Choi, “Rain: Reinforced hybrid attention inference network for motion forecasting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.