This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Spatial-Temporal Perception with Causal Inference for Naturalistic Driving Action Recognition

Qing Chang1*, Wei Dai2*, Zhihao Shuai3, Limin Yu2, Yutao Yue3†
1School of Mechanical Engineering, Nanjing University of Science and Technology
2School of Advanced Technology, Xi’an Jiaotong-Liverpool University
3The Hong Kong University of Science and Technology (Guangzhou)
*These authors contributed equally.†Corresponding author: [email protected]
Abstract

Naturalistic driving action recognition is essential for vehicle cabin monitoring systems. However, the complexity of real-world backgrounds presents significant challenges for this task, and previous approaches have struggled with practical implementation due to their limited ability to observe subtle behavioral differences and effectively learn inter-frame features from video. In this paper, we propose a novel Spatial-Temporal Perception (STP) architecture that emphasizes both temporal information and spatial relationships between key objects, incorporating a causal decoder to perform behavior recognition and temporal action localization. Without requiring multimodal input, STP directly extracts temporal and spatial distance features from RGB video clips. Subsequently, these dual features are jointly encoded by maximizing the expected likelihood across all possible permutations of the factorization order. By integrating temporal and spatial features at different scales, STP can perceive subtle behavioral changes in challenging scenarios. Additionally, we introduce a causal-aware module to explore relationships between video frame features, significantly enhancing detection efficiency and performance. We validate the effectiveness of our approach using two publicly available driver distraction detection benchmarks. The results demonstrate that our framework achieves state-of-the-art performance.

Index Terms:
driver action recognition, causal inference, car cabin monitoring

I Introduction

Naturalistic driving action recognition (DAR) is a critical component of vehicle cabin monitoring systems. Its primary objectives are distracted behavior detection and temporal action localization (TAL) within untrimmed video sequences, both of which are vital for improving driving safety and fostering effective driver-vehicle interaction.

Refer to caption
Figure 1: Illustrations of challenging cases in driving action recognition. (a) Calling and (b) drinking scenes, where the objects are partially visible and the lighting conditions are unstable. (c) Variations in the distance dd between key points assist in identifying behavior categories and temporal localization.

Recent advancements in DAR have been propelled by the powerful representation capabilities of deep learning [1]. Several approaches build on general human action recognition backbones, leveraging 3D convolutional neural networks (CNNs) [5] and vision transformers [28]. Among these, the temporal shift module (TSM) [17] has demonstrated effectiveness in learning features from adjacent frames.

Despite this progress, DAR remains a challenging task due to the complex nature of the vehicle cabin environment, which provides limited distinguishing features. As shown in Fig. 1, drivers often exhibit highly similar movements of body parts (e.g., eating and drinking), which can easily confuse detectors. Additionally, variable lighting conditions in the cabin and the duration of input video clips pose further challenges in modeling long-sequence feature relationships.

Refer to caption
Figure 2: The overall architecture of STP.

Several studies have sought to address these challenges. Khan et al. [10] utilized depth and infrared inputs using a late fusion method to improve the robustness of driving behavior detection. Ma et al. [20] proposed a multi-scale attention module for multi-view image fusion, while Jiang et al. [12] developed a multi-camera DAR model that trains a single-camera feature extractor to boost performance. However, these methods are limited by the requirement for additional input types and depend on single-mode classification pipelines, which reduce their practicality in real-world hardware environments and compromise efficiency. Additionally, they often neglect the temporal correlations between frames.

After thorough analysis, we argue that two critical features from video merit attention: temporal information and the spatial distance between interest objects. Temporal information is directly derived from video clips, providing fine-grained visual features essential for action recognition [8, 7]. Furthermore, as illustrated in Fig. 1(c), changes in the distance between a driver’s mouth and a cup offer cues for identifying the start and end of actions. By aligning and fusing these two feature types, the model can better focus on key regions to perceive the action, thereby mitigating irrelevant interference.

To this end, we present a novel Spatial-Temporal Perception (STP) network that emphasizes both temporal information and spatial relationships between interest objects, incorporating a causal decoder to perform behavior recognition and temporal action localization. Without relying on multi-view and multimodal input, STP directly extracts temporal and spatial distance features from RGB video clips. These dual features are jointly encoded by maximizing the expected likelihood across all possible permutations of the factorization order. By integrating temporal and spatial features, STP is capable of detecting subtle behavioral changes in challenging scenarios. Furthermore, we introduce a causal-aware module to analyze relationships between video frame features, significantly improving detection efficiency and performance. We validated the effectiveness of our approach using two publicly available driver distraction detection benchmarks: Drive&Act and SynDD2. The results demonstrate that our framework achieves state-of-the-art performance in both driver action recognition and temporal action localization tasks.

II Methodology

II-A Overview

The overall architecture of STP is illustrated in Fig. 2. Given a video with TT frames, denoted as XT×3×H×WX\in\mathbb{R}^{T\times 3\times H\times W}, the STP aims to integrate the temporal features of video clips with the spatial relationships between key points to enhance driver action recognition and temporal localization. The video clip is first processed in parallel by two heads to extract the clip context and key point positions. These two outputs are then combined, where the clip context is aggregated with position embeddings to generate temporal features, and key points are refined into distance features for each frame using graph convolutional networks (GCNs) [11]. These two types of features are interactively fused to align in space and passed to a stacked multi-scale channel transformer for context encoding. The proposed causal-aware module further explores the relationships between feature sequences. Finally, these hybrid features are decoded by the classification and localization heads to accurately identify the behavior pattern and the start and end frames of action.

II-B Context Joint Encoding

In the feature extraction stage, two lightweight extraction networks are employed to obtain dual features. However, the dense temporal features are not aligned with the sparse distance features between nodes extracted by GCNs. To address this misalignment, we first introduce an interaction mechanism to spatially align the dual feature sets. These aligned features are then fed into stacked multi-scale channel transformer layers for fusion and calibration, enabling the integration of both global and local information [4, 13].

Dual Feature Interaction. Feature interaction transfers and aligns features between different modalities. Let the TT-frame dual feature clips be represented as Xp={xtp}t=1TX^{p}=\left\{x_{t}^{p}\right\}_{t=1}^{T} and Xd={xtd}t=1TX^{d}=\left\{x_{t}^{d}\right\}_{t=1}^{T}, where xtpx_{t}^{p}, xtdC×Lx_{t}^{d}\in\mathbb{R}^{C\times L} denote the temporal and distance features at timestamp tt, respectively. The dual features are updated by shifting the last kk feature channels of each modality as follows:

x^tp\displaystyle\hat{x}_{t}^{p} =MLP(xtp[:k],xtd[k:]),\displaystyle=\operatorname{MLP}\left(x_{t}^{p}[:-k],x_{t}^{d}[-k:]\right), (1)
x^td\displaystyle\hat{x}_{t}^{d} =MLP(xtd[:k],xtp[k:]),\displaystyle=\operatorname{MLP}\left(x_{t}^{d}[:-k],x_{t}^{p}[-k:]\right), (2)

where [,][\cdot,\cdot] denotes channel-wise concatenation, and MLP\operatorname{MLP} refers to a fully connected layer. This process incurs minimal computational cost. As a result, dual-feature interaction efficiently aligns and integrates information across modalities.

Multi-Scale Encoder. We adopt a stacked multi-scale channel transformer layer within our recurrence mechanism to facilitate the reuse of hidden states from preceding segments. For a longer sequence FF obtained via dual-feature interaction, consider extracting two segments S~=F1:t\tilde{S}=F_{1:t} and S=Ft+1:2tS=F_{t+1:2t} for illustration. We process the initial segment S~\tilde{S} and retain the resultant content representations H~(m)\tilde{H}(m) for each layer mm. Let Q=H(m1)Q=H{(m-1)} and K,V=[H~(m1),H(m1)]K,V=\left[\tilde{H}{(m-1)},H{(m-1)}\right]. When processing the subsequent segment, the attention update, integrating memory, can be formulated as follows:

H(m)=Softmax(QKTD)V,H^{(m)}=\operatorname{Softmax}\left(\frac{QK^{T}}{\sqrt{D}}\right)V, (3)

where DD denotes the embedding dimension. Consequently, once the representations H~(m1)\tilde{H}{(m-1)} have been obtained, the attention update operates independently of the variable S~\tilde{S}.

II-C Causal Query Decoding

Causal-aware Module. To ensure the query attends equally to the embeddings of each time frame and fully explores the causal relationships between video frames, we propose a causal-aware module based on cross-attention masks. Specifically, the output of the module for video embeddings is computed as follows:

yi=jMijexp(QiKT(xj))V(xj)jMijexp(QiKT(xj))𝟏L,y_{i}=\frac{\sum_{j}M_{ij}\exp\left(Q_{i}K^{T}\left(x_{j}\right)\right)V\left(x_{j}\right)}{\sum_{j}M_{ij}\exp\left(Q_{i}K^{T}\left(x_{j}\right)\right)\mathbf{1}_{L}}, (4)

where MijM_{ij} denotes the mask for the ii-th query QiQ_{i} of the jj-th frame xjx_{j}. As current time step tt is typically smaller than the number of tokens nn, Mij=1M_{ij}=1 if ijnti\geq j\left\lfloor\frac{n}{t}\rfloor\right., otherwise Mij=0M_{ij}=0. The 𝟏L=[1,,1]TL×1\mathbf{1}_{L}=[1,\cdots,1]^{T}\in\mathbb{R}^{L\times 1} is a vector of ones. We illustrate the masking process of our module in Fig. 3. This approach ensures that initial queries focus on early visual embeddings, while final queries can access embeddings from various time frames to capture causal relationships across time.

Prediction Head. The prediction head consists of a classification head and a localization head. A feed-forward network (FFN) [27] is used to predict parameters, and the prediction head is formulated as:

[𝑻𝒔,𝑻𝒆]=FFNreg(𝐐),\displaystyle[\boldsymbol{T_{s}},\boldsymbol{T_{e}}]=\operatorname{FFN}_{reg}(\mathbf{Q}), (5)
𝒞=FFNcls(𝐐),\displaystyle\mathcal{C}=\operatorname{FFN}_{cls}(\mathbf{Q}), (6)

where 𝑻𝒔,𝑻𝒆\boldsymbol{T_{s}},\boldsymbol{T_{e}} represent the start and end frames of the action, and 𝒞\mathcal{C} denotes the predicted action category.

Refer to caption
Figure 3: Causal-aware module incrementally exposes video frames to learnable queries to decouple spatial and temporal features.

Total Loss. Given matched ground truth labels for the prediction queries, we calculate the corresponding loss for each matched pair. The overall loss of our model includes both classification loss and regression loss:

total=λclscls+λregreg,\mathcal{L}_{total}=\lambda_{cls}\mathcal{L}_{cls}+\lambda_{reg}\mathcal{L}_{reg}, (7)

where cls\mathcal{L}_{cls} is the focal loss with γ=2.0\gamma=2.0 and α=0.25\alpha=0.25. reg\mathcal{L}_{reg} is the smooth-l1l_{1} loss for TAL.

III Experiments and Results

III-A Datasets and Evaluation Metrics

Drive&Act [22] is widely used for driver activity recognition tasks. It contains 9.6 million frames across three modalities (RGB, IR, and depth) and five different camera views. The dataset provides three levels of activity labels: action units, fine-grained activities, and coarse tasks. In this paper, we focus on the fine-grained RGB modality from the top-right view.

SynDD2 [26] includes IR and RGB videos, along with annotation files, collected from three in-vehicle cameras located at the dashboard, rearview mirror, and top-right corner of the window. The dataset covers two types of activities: distracted activities and gaze zones, each with and without appearance obstructions such as hats or sunglasses.

Evaluation Metrics. We follow the official evaluation metrics for Drive&Act, using Mean-1 Accuracy (average per-class accuracy) as the primary metric, and Top-1 Accuracy for implementation assessment. For SynDD2, we evaluate temporal action localization and recognition performance using the average overlap score (AO-Score), which is defined as follows:

os(p,g)=max(min(ge,pe)max(gs,ps),0)max(ge,pe)min(gs,ps),os(p,g)=\frac{\max(\min(ge,pe)-\max(gs,ps),0)}{\max(ge,pe)-\min(gs,ps)}, (8)

where gsgs and gege represent the start and end times of the ground-truth activity gg, respectively. The variable pp denotes the best predicted activity of the same category as gg, while osos refers to the highest overlap. The overlap between gg and pp is defined as the ratio of the intersection time to the union time of the two activities. After matching each ground truth activity in order of their start times, any unmatched ground truth activities or unmatched predicted activities will be assigned an overlap score of 0.

TABLE I: Comparison with popular methods on Drive&Act.
Method Modality Mean-1(%)\uparrow Top-1(%)\uparrow
ResNet [9] IR, Depth 51.08 56.43
UniFormerV2 [14] RGB, IR, Depth 61.58 78.63
MDBU (I3D) [25] IR, NIR 62.02 76.91
DFS [16] IR, Depth 63.12 77.61
TSM [17] IR 59.81 67.75
TransDARC [24] RGB 60.10 76.17
UniFormerV2 [14] RGB 61.79 76.71
STP (Ours) RGB 63.82 78.32
TABLE II: Comparison with popular methods on SynDD2.
Method Multi-View Setting AO-Score\uparrow
M2DAR [21] Right, Dashboard 0.5921
MCPRL [30] Right, Rear, Dashboard 0.6080
SKKU  [23] Right, Rear, Dashboard 0.7798
APC [15] Dashboard 0.7046
AMA [29] Right 0.7459
STP (Ours) Right 0.7823

III-B Implementation Details

We utilize the pre-trained VideoMAEv2 [28] and OpenPose [3] models as the backbones for video feature extraction and spatial pose estimation, respectively. Following [6], the input video is sampled with a temporal stride of 88, each frame is resized to 224×224224\times 224, and only 13 key points are used per frame. The Multi-Scale Encoder consists of 66 layers, with 44 heads and 256256-dimensional embeddings. In the training stage, we use AdamW [19] optimizer with an initial learning rate of 1e-3 and cosine decay learning rate strategy [18] with power set to 0.90.9. During inference, the initial predictions are screened by SoftNMS [2] with a threshold of 0.2. The weights λcls\lambda_{cls} and λreg\lambda_{reg} are set as 1 and 1.5, respectively.

TABLE III: Effects of each component in our method.
Spatial Temporal Spatial-Temporal Causal-aware AO-Score\uparrow
Feature Feature Feature Moudel
0.7223
0.7298
0.7443
0.7823
TABLE IV: The comparison of the model efficiency results.
Modality Methods Latency(ms)\downarrow #Param\downarrow
Dual UniFormerV2 [14] 33.0 47.2M
Dual DFS [16] 28.0 38.8M
Single TSM [17] 15.0 25.3M
Single I3D [25] 18.3 28.0M
Single STP (Ours) 14.2 23.7M

III-C Quantitative Results

Table I shows the results of our method on the Drive&Act test dataset. We compare popular single-modal and multi-modal driving action recognition methods. STP significantly outperforms all methods, achieving a Mean-1 accuracy of 63.82% and a Top-1 accuracy of 78.32%. This shows that our method can effectively achieve high-precision driver action recognition even without additional information input.

Table II shows the results on the SynDD2 dataset. In this table, we categorize and compare the methods using different camera angles. Without the need for complex multi-view fusion, our approach achieved a state-of-the-art performance with 0.7823 AO-Score, demonstrating the effectiveness of our proposed Spatial-Temporal Perception. It is worth noting that our method only relies on the RGB input of the camera on the right side of the driver, which greatly reduces the hardware cost of the actual scene.

III-D Ablation Study and Visualization

To validate the effectiveness of our STP model, we conducted several ablation experiments on the SynDD2 dataset.

Ablation Study. We present the results of the ablation studies in Table III. We progressively add the spatial-temporal perception structure and the causal-aware module, and report the corresponding AO-Score. The results indicate that using spatial or temporal features alone provides only limited improvements. In contrast, the proposed spatial-temporal perception significantly enhances the AO-Score. Furthermore, the causal-aware module effectively captures relationships between video frames, leading to additional performance gains.

Efficiency comparison. Model efficiency is critical for real-time driver monitoring systems. We further evaluate the efficiency of the model in terms of latency and parameter size, as shown in Table IV. For a fair comparison, we categorize the current popular methods into dual and single modality inputs, ensuring consistent input cropping. The results demonstrate that our method not only achieves superior performance but also retains the efficiency benefits of lower latency and a reduced parameter count typical of single-modality inputs.

Results Visualization. As shown in Table V,we further visualize several challenging cases and their corresponding results during the keypoint detection stage. These examples clearly demonstrate that changes in the distance between key points (such as between the fingers and mouth) provide valuable prior knowledge, enabling the model to make accurate inferences. Even when actions appear similar, our method can accurately distinguish between the driver’s Calling and Eating actions and predict the precise start and end times of these actions.

TABLE V: Sample visualizations of the process of keypoint detection.
Input Results
[Uncaptioned image] Calling
[Uncaptioned image] Eating
[Uncaptioned image] Drinking

IV Conclusion

In this paper, we introduce a novel Spatial-Temporal Perception (STP) architecture designed to enhance action recognition and temporal action localization by capturing both the temporal dynamics and spatial relationships between key objects. Unlike multimodal approaches, STP directly extracts temporal and spatial distance features from RGB video clips, encoding these dual features by optimizing the likelihood across various factorization orders. This integration allows STP to detect subtle behavioral changes, even in complex scenarios. Furthermore, the inclusion of a causal-aware module improves detection efficiency by exploring the relationships between video frame features. Validated on two publicly available driver distraction detection benchmarks, our approach achieves state-of-the-art performance, highlighting its effectiveness and potential for broader applications.

Acknowledgment

This work was supported by Guangzhou-HKUST(GZ) Joint Funding Program(Grant No.2023A03J0008), Education Bureau of Guangzhou Municipality.

References

  • [1] V. A. Adewopo, N. Elsayed, Z. ElSayed, M. Ozer, A. Abdelgawad, and M. Bayoumi. A review on action recognition for accident detection in smart city transportation systems. Journal of Electrical Systems and Information Technology, 10(1):57, 2023.
  • [2] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pages 5561–5569, 2017.
  • [3] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7291–7299, 2017.
  • [4] Q. Chang and Y. Tong. A hybrid global-local perception network for lane detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 981–989, 2024.
  • [5] H. Chen. Skateboardai: The coolest video action recognition for skateboarding (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 16184–16185, 2023.
  • [6] X. Dong, R. Zhao, H. Sun, D. Wu, J. Wang, X. Zhou, J. Liu, S. Cui, and Z. He. Multi-attention transformer for naturalistic driving action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5435–5441, 2023.
  • [7] R. Guan, S. Yao, K. L. Man, X. Zhu, Y. Yue, J. Smith, E. G. Lim, and Y. Yue. Asy-vrnet: Waterway panoptic driving perception model based on asymmetric fair fusion of vision and 4d mmwave radar. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12479–12486. IEEE, 2024.
  • [8] C. He, C. Fang, Y. Zhang, T. Ye, K. Li, L. Tang, Z. Guo, X. Li, and S. Farsiu. Reti-diff: Illumination degradation image restoration with retinex-based latent diffusion model. arXiv preprint arXiv:2311.11638, 2023.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [10] S. S. Khan, Z. Shen, H. Sun, A. Patel, and A. Abedi. Supervised contrastive learning for detecting anomalous driving behaviours from multimodal videos. In 2022 19th Conference on Robots and Vision (CRV), pages 16–23. IEEE, 2022.
  • [11] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [12] J. Kuang, W. Li, F. Li, J. Zhang, and Z. Wu. Mifi: Multi-camera feature integration for robust 3d distracted driver activity recognition. IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [13] S. Lai, T. Xue, H. Xiao, L. Hu, J. Wu, N. Feng, R. Guan, H. Liao, Z. Li, and Y. Yue. Drive: Dependable robust interpretable visionary ensemble framework in autonomous driving. arXiv preprint arXiv:2409.10330, 2024.
  • [14] K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, L. Wang, and Y. Qiao. Uniformerv2: Unlocking the potential of image vits for video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1632–1643, 2023.
  • [15] R. Li, C. Wu, L. Li, Z. Shen, T. Xu, X.-j. Wu, X. Li, J. Lu, and J. Kittler. Action probability calibration for efficient naturalistic driving action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5270–5277, 2023.
  • [16] D. Lin, P. H. Y. Lee, Y. Li, R. Wang, K.-H. Yap, B. Li, and Y. S. Ngim. Multi-modality action recognition based on dual feature shift in vehicle cabin monitoring. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6480–6484. IEEE, 2024.
  • [17] J. Lin, C. Gan, and S. Han. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093, 2019.
  • [18] I. Loshchilov and F. Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • [19] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • [20] Y. Ma, V. Sanchez, S. Nikan, D. Upadhyay, B. Atote, and T. Guha. Real-time driver monitoring systems through modality and view analysis. arXiv preprint arXiv:2210.09441, 2022.
  • [21] Y. Ma, L. Yuan, A. Abdelraouf, K. Han, R. Gupta, Z. Li, and Z. Wang. M2dar: Multi-view multi-scale driver action recognition with vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5287–5294, 2023.
  • [22] M. Martin, A. Roitberg, M. Haurilet, M. Horne, S. Reiß, M. Voit, and R. Stiefelhagen. Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2801–2810, 2019.
  • [23] H.-H. Nguyen, C. D. Tran, L. H. Pham, D. N.-N. Tran, T. H.-P. Tran, D. K. Vu, Q. P.-N. Ho, N. D.-M. Huynh, H.-M. Jeon, H.-J. Jeon, et al. Multi-view spatial-temporal learning for understanding unusual behaviors in untrimmed naturalistic driving videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7144–7152, 2024.
  • [24] K. Peng, A. Roitberg, K. Yang, J. Zhang, and R. Stiefelhagen. Transdarc: Transformer-based driver activity recognition with latent space feature calibration. in 2022 ieee. In RSJ International Conference on Intelligent Robots and Systems (IROS), pages 278–285.
  • [25] A. Roitberg, K. Peng, Z. Marinov, C. Seibold, D. Schneider, and R. Stiefelhagen. A comparative analysis of decision-level fusion for multimodal driver behaviour understanding. In 2022 IEEE Intelligent Vehicles Symposium (IV), pages 1438–1444. IEEE, 2022.
  • [26] M. Shaiqur Rahman, J. Wang, S. Velipasalar Gursoy, D. Anastasiu, S. Wang, and A. Sharma. Synthetic distracted driving (syndd2) dataset for analyzing distracted behaviors and various gaze zones of a driver. arXiv e-prints, pages arXiv–2204, 2022.
  • [27] A. Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  • [28] L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao. Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14549–14560, 2023.
  • [29] T. Zhang, Q. Wang, X. Dong, W. Yu, H. Sun, X. Zhou, A. Zhen, S. Cui, D. Wu, and Z. He. Augmented self-mask attention transformer for naturalistic driving action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7108–7114, 2024.
  • [30] W. Zhou, Y. Qian, Z. Jie, and L. Ma. Multi view action recognition for distracted driver behavior localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5375–5380, 2023.