Social Occlusion Inference with Vectorized Representation for Autonomous Driving
Abstract
Autonomous vehicles must be capable of handling the occlusion of the environment to ensure safe and efficient driving. The social occlusion inference task focuses on inferring occupancy from agent behaviors as a remedy for perceptual deficiencies. We identify visible trajectories, road context, and occlusion information as the three key environmental elements represented by vectors in our method. Therefore, this paper introduces a novel social occlusion inference approach that learns a mapping from three types of vectors to an occupancy grid map (OGM) representing the view of the ego vehicle. Specifically, vectorized inputs are encoded through the polyline encoder to aggregate vector-level features into polyline-level features. Since vehicles are constrained by the road and affected by other agents and occlusion areas, we exploit a transformer module that models the high-order social interactions of three types of polylines. Importantly, to address the inconsistency between input and output modalities and introduce prior knowledge of occlusion, occlusion queries are proposed to fuse polyline features and generate the OGM without the input of visual modality. We evaluate our approach on the INTERACTION dataset, which achieves on-par or better performance than the baselines. The ablation study demonstrates that three key elements of input can enhance the performance of our network.
Index Terms:
Occlusion inference, Vectorized representation, Social interaction, TransformerI Introduction
Perception of the environment is an integral part of ensuring the safety and efficiency of autonomous driving. However, the driving environment of autonomous vehicles can be full of occlusion due to static or dynamic obstacles, especially in urban areas [1]. Humans also suffer from visual limitations when driving. Still, they can usually scene abnormal behaviors of other vehicles around them based on intuition and experience and infer that there may be moving pedestrians or vehicles in occluded areas [2]. In order to be safe and efficient, autonomous vehicles should also be capable of making inferences about occluded regions to make up for the lack of perception [3].
Occlusion inference in autonomous driving assumes occluded regions to be either free or occupied space. Afolabi et al. [4] exploited that the actions of other intelligent agents also present useful information. In addition to the sensors equipped by the ego vehicle, they modeled other vehicles as additional sensors to infer the regions of occlusion in a map, coining the term People as Sensor(PaS). They used occupancy grid map (OGM) [5] representations to present the regions of occlusion. Itkina et al. [6] extended PaS to incorporate distributional multimodality and to a multi-agent framework. They learned a driver sensor model that maps observed driver trajectories to different occupancy patterns ahead of the driver and fused all the OGMs inferred from driver sensors into the environment map. Even though they took a fusion-based approach in considering multimodality, they still lack global interaction information due to the division of the environment map. Besides, neither of the above methods takes road context and occlusion information into consideration, which influences the vehicles’ trajectory.

We use vectors to present the dynamic trajectory, road context, and occlusion information inspired by VectorNet [7]. All the static and dynamic information have a structured physical meaning. As illustrated in Fig. 1, the structured features can be presented as points, polygons, or curves in geographic coordinates. Especially the occlusion in the environment is caused by other vehicles blocking the view of the ego vehicle, which leads to the ego vehicle failing to detect vehicles in the invisible areas. We use vectors to present the boundary of occlusion areas.
Our approach aims to compensate for perceived deficits by inferring about unseen regions, thereby reducing the uncertainties. OGMs are helpful in representing occlusion [8, 9]. The inference task then turns to imply each grid in occlusion is occupied or accessible in the regions of occlusion, as illustrated in Fig. 1. To solve the proposed social occlusion inference task, we use a learning-based method to make inferences about occluded spaces. With the vectorized vehicle trajectory, road context, and occlusion information as input, an attention-based polyline encoder is utilized to combine the features of vector sets into the features of polylines. We also employ an interaction-aware transformer block to model the social interactions on the polyline features. Inspired by DETR [10], occlusion queries are proposed to tackle the modality mismatch between input and output and introduce prior knowledge of occlusion. We use stacked cross-attention and self-attention layers for occlusion queries to fuse the encoded polyline features. These outputs can be concatenated together as the predicted result. In summary, our contributions are:
-
•
We present a novel hierarchical transformer framework that enables end-to-end training for the social occlusion inference task. It can learn a mapping from three types of polylines to an OGM.
-
•
We propose occlusion queries to tackle the modality mismatch between input and output. It can introduce prior knowledge of occlusion and fuse the polyline features through the cross-attention mechanism.
-
•
We validate our framework on the INTERACTION dataset [11], and the proposed vector-based approach achieves state-of-the-art performance.
II Related Work
Uncertainty Awareness. Autonomous vehicles are necessary to have the ability to ensure safety and be aware of uncertainty. Some prior work focuses on motion planning with the presence of uncertainty [12, 13, 14, 15]. Stiller et al. [12] define criteria that measure the available margins to a collision to remain collision-free for the worst-case evolution. Sun et al. [13] propose a social perception scheme that learns a cost function from observed vehicles to avoid collision. Rezaee et al. [14] present a reinforcement learning-based solution to manage uncertainty by optimizing for the worst-case outcome. Nager et al. [15] describe a motion planning pipeline with respect to hypothetical hidden agents. Ren et al. [16] study a motion prediction task considering unseen vehicles. All these approaches consider occlusion in prediction or planning tasks, but we focus on inferring occlusion as a remedy for perceptual deficiencies.
Social Interaction Modeling. Attention mechanisms are the preferred method to model the interaction between vehicles. Graph attention networks [17, 18, 19, 20, 7] apply a self-attention mechanism to model the interaction between edges in a graph. Yuan et al. [21] propose a transformer-based approach to model time and social dimensions for trajectory forecasting tasks. Our method uses a transformer encoder to jointly model the interactions between agents, static road context, and occlusions.
Social Occlusion Inference. At present, we find that there are few approaches to directly model the pedestrians’ behaviors or vehicles’ trajectories to estimate occlusion. The approaches using Pas proposed by Afolabi et al. [4] and Itkina et al. [6] are most closely to our work. Afolabi et al. [4] introduce and formalize the concept of PaS for imputing maps. To model the driver as a sensor, they define five categories of vehicle actions: moving fast, moving slow, accelerating, decelerating, and stopped. Then, they cluster vehicles’ trajectories into different actions and finally learn occupancy probabilities for the OGM for each action. This approach is evaluated on a simulation dataset that has a crosswalk scenario with a single driver sensor and a single occluded pedestrian. This work does not account for the multimodality of the occlusion inference task. Itkina et al. [6] extend PaS to incorporate distributional multimodality into a multi-agent framework. They present a two-stage occlusion inference algorithm. In the first stage, a driver sensor model is learned to map a vehicle’s trajectory to a discrete set of possibilities for the local OGM ahead of the driver. They use a conditional variational autoencoder (CVAE) to account for multimodality. In the second stage, all the local OGMs are fused into the environment map to access the global OGM, using a multi-agent sensor fusion mechanism based on evidential theory. However, these approaches often neglect the social interactions between different agents and the interactions between agents and the environment. We propose to use a transformer encoder to model the high-order interactions between different elements.

III Method
III-A Problem Formulation
We assume the receptive field of the ego vehicle to be a neighborhood bounding box. Under this circumstance, we present the bounding box as an OGM , where and are the length and width, respectively. The occupied grid cells have an occupancy probability of 1, while the free grid cells have a probability of 0. We consider grids to be in occlusion if their centers are in the invisible areas. To record the occlusion grids, another OGM is proposed that 1 indicates the occlusion grids while 0 indicates the visible grids. The inference task aims to infer the occupancy probability of each grid in occlusion (where ). The social occlusion inference task can be formulated as predicting an output at a current time conditioned on the past and present states of visible traffic agents and road context within the neighborhood.
For vectorized inputs, we define three types of polylines , and that represent trajectory, road context and occlusion respectively. Let , where , each in denotes a set of vectors . We define the feature of vectors that belong to different types of polylines as:
(1) | ||||
where , , and are vectors of trajectory, road context, and occlusion, respectively; (, ) and (, ) are coordinates of the start and end points of the vector; in is the end timestamp of trajectory clips; in determines different types of road context.
Suppose a mapping model with parameters , the occlusion inference task is formulated as:
(2) | ||||

III-B Model Framework
Fig. 2 shows the overall structure of our occlusion inference approach. Vectorized inputs are encoded with polyline encoder to generate polyline features, which then are concatenated with their embedded types through a MLP layer. We adopt the interaction-aware transformer with residual connections to model the social interactions between three types of polylines. The stacked cross-attention and self-attention layers fuse the embedded polyline features and the proposed occlusion queries, with the latter ultimately being transformed and concatenated to obtain the final result .
III-C Vector Encoder
The process of encoding vectorized input consists of two primary stages. In the first stage, vector features are aggregated into polyline features. In the second stage, polyline features are encoded considering interaction awareness.
1) Polyline Encoder: All three types of polylines are encoded by a shared polyline encoder based on attention mechanism. As illustrated in Fig. 3, given a polyline , vector features are firstly embedded through a MLP layer to expand and unify the dimension and then added with position embeddings to get the vector embeddings . Specially, we introduce a learnable token and concatenate it with other vector embeddings to obtain the multi-head self-attention (MSA) input . After a MSA layer, the information of all other tokens is aggregated on the learnable token, which is treated as a polyline feature .
2) Interaction-Aware Transformer: Social interaction modeling is a standard method to capture the interactions between agents [21, 22]. Besides, agent behaviors are constrained by the road context and influenced by the occlusion areas. Therefore, we go one step further by jointly modeling the interactions between agents, road context, and occlusion for the inference task. We use a transformer [23] encoder composed of 6 layers to model the high-order interactions on these features. Each transformer layer consists of a MSA block followed by a position-wise fully connected feed-forward network (FFN). A residual connection is added after each block, followed by a LN.
III-D Decoder

The occlusion inference task aims to predict an OGM while the model takes vectorized input. To address the modality mismatch between input and output, we propose occlusion queries as illustrated in Fig. 4. Considering the OGM , blue grids represent invisible areas while white grids represent visible areas. We reshape it into a sequence of flattened patches , where is the resolution of each map patch, is the number of patches. After position embedding, we get the occlusion queries . Meanwhile, occlusion queries introduce the prior knowledge of occlusion information, which allows the model to focus on the region of interest.
The occlusion queries pass through a self-attention module firstly to learn the position information of each other and to increase their dimension to match that of polyline features. Then, there are stacked cross-attention and self-attention layers in the decoder. Occlusion queries extract polyline features in the cross-attention modules while interacting with each other in the self-attention modules. Finally, we access the output by reshaping and concatenating the occlusion queries.
III-E Loss function
The occlusion inference task is equivalent to making a binary classification for each grid in occlusion. We train our model to optimize for the objective using:
(3) |
where and are two scalars to balance the three loss terms, is a standard grid-wise binary cross-entropy loss between and to learn the global representation, is another cross-entropy loss between and to make the model focus on the grids in occlusion, where denotes element-wise product. Due to occupancy class imbalance, is proposed to constrain the values of not to be all zeros:
(4) |
where .
IV Experiments
This section describes the experimental settings, including dataset, data processing, baseline, and metrics.
IV-A Dataset and data processing
INTERACTION dataset [11] contains naturalistic motions of various traffic participants in a variety of highly interactive driving scenarios from different countries. To enrich the interaction between vehicles, we pick a location in an unsignalized interaction scenario, which has a total of vehicles. The vehicles continuously present for more than serve as ego vehicles. We sample the presence time of ego vehicles as the current time (time ) at and then build ego OGMs around it to present the receptive fields. Each ego OGM has grids of in size. For road context, we consider the elements within the receptive field to form the vectors. For trajectory, we form the trajectory vectors over of past data sampled at of all visible vehicles.
IV-B Baselines
We compare the results against the approaches using PaS [4, 6]. K-means PaS [4], and GMM PaS [6] use k-means and GMM to cluster the trajectories separately. Then, they map the clusters to different occupancy patterns. Multi-Agent PaS [6] utilizes a CVAE to train the driver sensor model and fuse all the local OGMs to form the global OGM using Dempster-Shafer’s rule. We use their average results and the Top results they proposed.
To compare with the vector-based approach, we also design a visual-based approach that maps the semantic map and OGM with historical trajectory information to the ground truth OGM. For the input OGM , to represent the occupancy: indicates the grid is free while indicates the grid is occupied; to indicate the grid is occupied in the past second to embed the trajectories (the larger the number is, the closer to the current time). We use swin transformer [24] to encode the input maps and keep the same decoder with the vector-based approach.
IV-C Metrics
To evaluate the performance of our model, we consider using three metrics: accuracy, mean squared error (MSE), and image similarity (IS) [25]. For the classification task ( or ), we threshold the occupancy probability above 0.5 to be occupied and below 0.5 to be free to compute the accuracy. We also use MSE to measure the predicted probability of occupancy. Due to the stochastic existence of vehicles, we cannot precisely infer the occupancy of each grid solely from the observed agent behaviors. Therefore, we use the IS metric to evaluate the relative similarity of the predicted OGM and the ground truth OGM. The IS metric is computed as:
(5) | ||||
where and are two OGMs, represents two different occupancy classes, and are 2D spatial coordinates of each cell in and , gives the Manhattan distance between coordinates, and is the number of cells in with class .
Due to the quantitative imbalance between free grids and occupied grids (in fact, the quantities of free grids are far more than those of occupied grids), we use the metrics to evaluate each class separately besides overall.
Method | Acc(%). | MSE(%) | IS | ||||||
---|---|---|---|---|---|---|---|---|---|
Occ. | Free | Overall | Occ. | Free | Overall | Occ. | Free | Overall | |
K-means PaS [4] | 0.834 | 0.680 | 0.682 | 0.146 | 0.198 | 0.197 | 1.373 | 0.027 | 1.400 |
GMM PaS [6] | 0.838 | 0.660 | 0.663 | 0.144 | 0.205 | 0.204 | 1.393 | 0.029 | 1.423 |
Multi-Agent PaS Avg. [6] | 0.660 | 0.722 | 0.722 | 0.303 | 0.171 | 0.173 | 1.336 | 0.017 | 1.353 |
Multi-Agent PaS Top 3 [6] | 0.746 | 0.778 | 0.774 | 0.233 | 0.136 | 0.140 | 1.220 | 0.011 | 1.232 |
Visual-based model | 0.658 | 0.732 | 0.731 | 0.314 | 0.182 | 0.179 | 1.082 | 0.017 | 1.103 |
Vector-based model | 0.763 | 0.827 | 0.826 | 0.216 | 0.099 | 0.101 | 0.147 | 0.006 | 0.153 |
Context | Acc(%). | MSE(%) | IS | ||||||
---|---|---|---|---|---|---|---|---|---|
Occ. | Free | Overall | Occ. | Free | Overall | Occ. | Free | Overall | |
Traj. Only | 0.742 | 0.813 | 0.812 | 0.250 | 0.112 | 0.114 | 0.186 | 0.008 | 0.196 |
Traj. + Occ. | 0.749 | 0.817 | 0.815 | 0.241 | 0.109 | 0.111 | 0.177 | 0.008 | 0.186 |
Traj. + Road. | 0.758 | 0.824 | 0.822 | 0.224 | 0.101 | 0.103 | 0.155 | 0.006 | 0.162 |
Traj. + Road. + Occ. | 0.763 | 0.827 | 0.826 | 0.216 | 0.099 | 0.101 | 0.147 | 0.006 | 0.153 |




V Results
V-A Baseline comparisons
We demonstrate our results compared with baselines in Table I. Bold denotes the best result across a metric. To ensure a fair comparison, we use precisely the same train dataset and test dataset for the visual representation as well as the vectorized representation. By comparing our visual-based model, we can tell that the vectors can represent the structured elements more efficiently with less information loss than an OGM in our framework. Multi-agent PaS [6] has two sets of metrics: the average and the Top 3. They take the best metric across the three most likely modes of the CVAE to generate the Top 3. In almost all metrics, our approach outperforms the results presented in [4] and [6], even their Top 3 results. We notice that our results make significant progress in IS metrics (note that IS values are divided by 100). It proves that by modeling the high-order social interactions between polyline-level features, our predicted OGM has a more similar structure to the ground truth OGM.
Accuracy and MSE for the occupied class are the only two metrics on which our method has worse performance than the baselines. However, the improvement in the occlusion metrics of K-means PaS [4] and GMM PaS [6] comes at the cost of a reduction in the free metrics, which have a significant contribution to the overall metrics. While these two baselines tend more towards predicting the grids as occupied, our approach can better balance the inference performance of the two different classes.
V-B Ablation Studies
There are three types of vectors for the input: trajectories, road context, and occlusion. We study whether they are helpful for the occlusion inference task in Table II. "Traj." refers to the vectors of trajectory, "Road." refers to the vectors of road context, and "Occ." refers to the vectors of occlusion. Three additional experiments are conducted to verify the impact of road context and occlusion on the model performance. From all four rows, we can see the improvements in our model with two additional kinds of vectors, lacking any one of them hurts the performance. Moreover, the second and third rows indicate that the vectors of road context have more contributions than the vectors of occlusion.
V-C Occupancy visualization
We demonstrate the visualization of the outputs of our occlusion inference approach in Fig. 5. We can see that our method successfully infers the existence of vehicles in occlusion and their approximate location through the behaviors of observed vehicles.
VI Conclusion
This paper presents our approach to the social occlusion inference task. We propose to use vectors to represent the agent trajectories, road context, and occlusion. With this representation, we design a hierarchical transformer network to infer the OGM, where the polyline encoder aggregates vector-level features to generate polyline-level features and an interaction-aware transformer encoder models the higher-order interactions between polylines. Occlusion queries are proposed to fuse the polyline features through the decoder and generate the inference result. Our model is trained with a loss function containing three parts to focus on the occlusion. Experiments show that our vector-based approach achieves on-par or better performance than the visual-based approach and state-of-the-art results. In the future, OGMs with a smaller grid cell resolution may be used to represent the occupancy. We also consider introducing social occlusion inference into trajectory prediction and motion planning tasks to reduce uncertainty.
References
- [1] M. Bouton, A. Nakhaei, K. Fujimura, and M. J. Kochenderfer, “Scalable decision making with sensor occlusions for autonomous driving,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 2076–2081.
- [2] R. Senanayake, S. O’Callaghan, and F. Ramos, “Learning highly dynamic environments with stochastic variational inference,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 2532–2539.
- [3] V. Guizilini, R. Senanayake, and F. Ramos, “Dynamic hilbert maps: Real-time occupancy predictions in changing environments,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 4091–4097.
- [4] O. Afolabi, K. Driggs–Campbell, R. Dong, M. J. Kochenderfer, and S. S. Sastry, “People as sensors: Imputing maps from human actions,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 2342–2348.
- [5] A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer, vol. 22, no. 6, pp. 46–57, 1989.
- [6] M. Itkina, Y.-J. Mun, K. Driggs-Campbell, and M. J. Kochenderfer, “Multi-agent variational occlusion inference using people as sensors,” in 2022 International Conference on Robotics and Automation (ICRA), 2022, pp. 4585–4591.
- [7] J. Gao, C. Sun, H. Zhao, Y. Shen, D. Anguelov, C. Li, and C. Schmid, “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 522–11 530.
- [8] K. Doherty, J. Wang, and B. Englot, “Bayesian generalized kernel inference for occupancy map prediction,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017, pp. 3118–3124.
- [9] J. Wang and B. Englot, “Fast, accurate gaussian process occupancy maps via test-data octrees and nested bayesian fusion,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 1003–1010.
- [10] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer, 2020, pp. 213–229.
- [11] W. Zhan, L. Sun, D. Wang, H. Shi, A. Clausse, M. Naumann, J. Kummerle, H. Konigshof, C. Stiller, A. de La Fortelle, et al., “Interaction dataset: An international, adversarial and cooperative motion dataset in interactive driving scenarios with semantic maps,” arXiv preprint arXiv:1910.03088, 2019.
- [12] O. S. Tas and C. Stiller, “Limited visibility and uncertainty aware motion planning for automated driving,” in 2018 IEEE Intelligent Vehicles Symposium (IV), 2018, pp. 1171–1178.
- [13] L. Sun, W. Zhan, C.-Y. Chan, and M. Tomizuka, “Behavior planning of autonomous cars with social perception,” in 2019 IEEE Intelligent Vehicles Symposium (IV), 2019, pp. 207–213.
- [14] K. Rezaee, P. Yadmellat, and S. Chamorro, “Motion planning for autonomous vehicles in the presence of uncertainty using reinforcement learning,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021, pp. 3506–3511.
- [15] Y. Nager, A. Censi, and E. Frazzoli, “What lies in the shadows? safe and computation-aware motion planning for autonomous vehicles using intent-aware dynamic shadow regions,” in 2019 International Conference on Robotics and Automation (ICRA), 2019, pp. 5800–5806.
- [16] X. Ren, T. Yang, L. E. Li, A. Alahi, and Q. Chen, “Safety-aware motion prediction with unseen vehicles for autonomous driving,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 15 711–15 720.
- [17] Y. Hoshen, “Vain: Attentional multi-agent predictive modeling,” Advances in Neural Information Processing Systems, vol. 30, 2017.
- [18] C. Yu, X. Ma, J. Ren, H. Zhao, and S. Yi, “Spatio-temporal graph transformer networks for pedestrian trajectory prediction,” in European Conference on Computer Vision. Springer, 2020, pp. 507–523.
- [19] V. Kosaraju, A. Sadeghian, R. Martín-Martín, I. Reid, H. Rezatofighi, and S. Savarese, “Social-bigat: Multimodal trajectory forecasting using bicycle-gan and graph attention networks,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- [20] Y. Huang, H. Bi, Z. Li, T. Mao, and Z. Wang, “Stgat: Modeling spatial-temporal interactions for human trajectory prediction,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 6271–6280.
- [21] Y. Yuan, X. Weng, Y. Ou, and K. Kitani, “Agentformer: Agent-aware transformers for socio-temporal multi-agent forecasting,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9793–9803.
- [22] J. Li, F. Yang, M. Tomizuka, and C. Choi, “Evolvegraph: Multi-agent trajectory prediction with dynamic relational reasoning,” Advances in neural information processing systems, vol. 33, pp. 19 783–19 794, 2020.
- [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [24] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
- [25] A. Birk and S. Carpin, “Merging occupancy grid maps from multiple robots,” Proceedings of the IEEE, vol. 94, no. 7, pp. 1384–1397, 2006.