Uncertainty-aware Bridge based Mobile-Former Network for Event-based Pattern Recognition

Haoxiang Yang†¹, Chengguo Yuan†¹, Yabin Zhu^1,2, Lan Chen², Xiao Wang¹, Futian Wang^1,3 ¹ School of Computer Science and Technology, Anhui University, Hefei 230601, China ² School of Electronic and Information Engineering, Anhui University, Hefei 230601, China ³ Suzhou Glink IoT Technology Co., Ltd.

{\dagger}

The first two authors contribute equally to this work.
✉ Corresponding author: Lan Chen ([email protected]), Xiao Wang ([email protected])

Abstract

The mainstream human activity recognition (HAR) algorithms are developed based on RGB cameras, which are easily influenced by low-quality images (e.g., low illumination, motion blur). Meanwhile, the privacy protection issue caused by ultra-high definition (HD) RGB cameras aroused more and more people’s attention. Inspired by the success of event cameras which perform better on high dynamic range, no motion blur, and low energy consumption, we propose to recognize human actions based on the event stream. We propose a lightweight uncertainty-aware information propagation based Mobile-Former network for efficient pattern recognition, which aggregates the MobileNet and Transformer network effectively. Specifically, we first embed the event images using a stem network into feature representations, then, feed them into uncertainty-aware Mobile-Former blocks for local and global feature learning and fusion. Finally, the features from MobileNet and Transformer branches are concatenated for pattern recognition. Extensive experiments on multiple event-based recognition datasets fully validated the effectiveness of our model. The source code of this work will be released at https://github.com/Event-AHU/Uncertainty˙aware˙MobileFormer.

I INTRODUCTION

Human Activity Recognition (HAR) is one of computer vision’s most critical tasks and developed significantly in recent years [1, 23] with the help of deep learning. Usually, these models are designed for video frames captured using RGB cameras and are widely used in practical applications. For example, we can achieve pre-prevention, in-process monitoring, and post-inspection in security monitoring field, intelligent referee in sports through the analysis of human behavior. Although the RGB cameras based HAR works well in simple scenarios, however, the issues caused by its imaging quality may limit the applications of HAR severely, such as low illumination and fast motion. On the other hand, the privacy protection is also widely discussed in the human-centered research. Awkwardly, the ethical problems caused by high-quality data and the data quality problems caused by low-quality video both require new behavior recognition paradigms.

Recently, the event camera (also termed Dynamic Vision Sensors, DVS) which is a bio-inspired sensor draws more and more attention from researchers [41] [42] [40]. Different from the RGB camera which records the scene into video frames in a synchronous way, each pixel in the event camera is triggered asynchronously by saving an event point if and only if the variation of intensity exceeds the given threshold. Due to the aforementioned unique imaging principle, the event camera shows the following advantages or features: high dynamic range, low energy-consumption, dense temporal resolution but sparse spatial resolution [17]. Therefore, it performs well even in low-illumination, overexposure, and fast-motion scenarios. Also, the spatial resolution is getting higher, for example, $1280\times 800$ and $1280\times 720$ can be achieved by the CeleX-V [6] and PROPHESEE, respectively. These features all inspired us to address the pain points of HAR using an event camera.

In this work, we propose a new lightweight event-based recognition model by aggregating the MobileNet [19] and Transformer [21] networks based on an uncertainty-aware bridge module. The key insight of this work is that the CNN models the local features well and the Transformer performs better on the long-range relations mining. Unlike the typical dual-stream fusion network architecture, we feed the event data into the MobileNet branch and input random initialized tokens into the Transformer branch, as shown in Fig. 1. Inspired by Mobile-Former [7], we propose an enhanced bridge module to connect dual parallel branches by considering the information propagation with uncertainty. The key insight of this idea is that the two branches focus on different types of feature learning, therefore, the information from different samples or the same sample at different time steps may be asymmetrical. The decision of which branch should transmit richer information to the other one carries a certain level of uncertainty. Specifically, we model the uncertainty using the Gaussian distribution and adopt two MLP layers to predict its mean and variations. Then, the re-parameterization trick [12] is adopted to sample a new feature from the Gaussian distribution and fuse with another kind of feature. Extensive experiments validated the effectiveness of such uncertainty-aware information propagation modules between dual branches.

To sum up, the contributions of this paper can be summarized as the following three aspects:

$\bullet$ We propose a novel lightweight uncertainty-aware Mobile-Former framework for event-based pattern recognition. It is a parallel dual-branch framework that simultaneously models the local and global features and effectively regulates the information flow between the dual branches.

$\bullet$ We propose a new uncertainty-aware bridge block which effectively boosts the feature interaction and fusion between the local CNN features and global Transformer features.

$\bullet$ Extensive experiments conducted on multiple widely used event-based recognition benchmark datasets fully demonstrate the effectiveness of the proposed model.

II RELATED WORK

In this section, we will introduce the related works about Event-based Recognition and Uncertainty-aware Learning.

II-A Event-based Recognition

Current works can be divided into three streams for the event-based recognition, including the CNN based [43], SNN based [15, 13], GNN based models [4, 5], due to the flexible representation of event streams. For the CNN based models, Wang et al. [43] propose to identify human gaits using event camera and design a CNN model for recognition. As the third generation of neural networks, the SNN is also adopted to encode the event stream for energy-efficient recognition. To be specific, Peter et al. [11] propose the weight and threshold balancing method to achieve efficient ANN-to-SNN conversion. Nicolas et al. [32] propose a sparse backpropagation method for SNNs and achieve faster and more memory efficient.

For the point cloud based representation, Wang et al. [39] treat the event stream as space-time event clouds and adopt PointNet [33] as their backbone for gesture recognition. Sai et al. propose the event variational auto-encoder (eVAE) [37] to achieve compact representation learning from the asynchronous event points directly. Fang et al. [14] propose SEW (spike-element-wise) residual learning for deep SNNs which addresses the vanishing/exploding gradient problems effectively. Meng et al. [29] propose an accurate and low latency SNN based on the Differentiation on Spike Representation (DSR) method. TORE [2] is short for Time-Ordered Recent Event (TORE) volumes, which compactly stores raw spike timing information. VMV-GCN [48] is proposed by Xie et al. which is a voxel-wise graph learning model to fuse multi-view volumetric. Li et al. [26] introduce the Transformer network to learn event-based representation in a native vectorized tensor way. Different from these works, in this paper, we design a novel uncertainty-aware MobileFormer network that effectively aggregates the CNN and Transformer.

II-B Uncertainty-aware Learning

Uncertainty-aware learning is widely exploited in machine learning and computer vision tasks. Specifically, [25] propose a dual uncertainty-aware pseudo-labeling method for self-training to achieve knowledge transfer. [38] propose an uncertainty-aware clustering framework for unsupervised domain adaptive task. Qin et al. [34] adopts an uncertainty-aware method for federated Open set domain adaptation algorithm to generate a global model from all client models. Fang et al. [16] propose a novel uncertainty-aware salient object detection model, which use multiple supervision signals to teach the networks not only to focus on saliency regions but also pixels surrounding the contour of salient objects. Zhang et al. [50] propose a unified blind image quality assessment (BIQA) model and also propose a hinge constraint to regularize uncertainty estimation when optimizing their model. Le et al. [24] propose an uncertainty-aware label distribution learning approach for the improvement of the robustness of deep models against uncertainty and ambiguity for facial expression recognition. Different from existing works, we model the message propagation between CNN and Transformer networks using an uncertainty-aware learning approach which further improves the final recognition performance.

III METHODOLOGY

III-A Overview

As shown in Fig. 1, given the event streams, we first stack them into event images and extract their features using StemNet. The uncertainty-aware MobileFormer block is proposed and stacked as the backbone network. Specifically, we adopt the MobileNet as the CNN branch to extract the local features and utilize the Transformer to model the long-range relations. Note that, the Transformer takes the randomly initialized tokens as the input. To boost the interaction between the dual branches, we propose a novel uncertainty-aware bridge module to control the message passing. For the feature flows from CNN to the Transformer branch, we propose two MLPs to predict the Gaussian distribution using the output mean and variance values. Then, we sample a feature vector from the Gaussian distribution via re-parameterization tricks and aggregate it with the input of Transformer branch using cross-attention mechanism. Similar operations are conducted for the controllable information flow from the Transformer to CNN branch. The output of CNN and Transformer branches of the last uncertainty-aware Mobile-Former block are concatenated as the final feature representation. A classification head consisting of two dense layers is used for classification.

III-B Network Architecture

Input Representation and Embedding. Each point in the event stream $\mathcal{E}$ is usually represented as a tuple $e=\{x,y,t,p\}$ , where $x,y$ are spatial coordinates, $t$ is the timestamp, and $p$ denotes the polarity (e.g., +1 and -1 denotes positive and negative event point). In this work, we stack the event streams into multiple event images due to their simplicity and effectiveness. Specifically, we first split the event streams into a fixed number of tensor tubes $T_{i},i\in\{1,2,...,M\}$ , then, each split is transformed into one event frame $F_{i},i\in\{1,2,...,M\}$ . The obtained event images are visualized in Fig. 3.

After we get the event images, we resize them into a fixed resolution $224\times 224$ and design a StemNet to project them into feature embeddings. Specifically, a 3D convolutional layer (kernel size: $3\times 3\times 3$ ) is proposed to achieve this embedding and get feature maps $F_{emb}\in\mathbb{R}^{4\times 112\times 112\times 24}$ .

Uncertainty-aware Bridge based Mobile-Former. Given the feature embeddings of event streams, we design a novel uncertainty-aware Mobile-Former network that consists of lightweight mobile layers, Transformer layers, and uncertainty-aware bridge (UA-Bridge) modules. As shown in Fig. 1, the event feature embeddings are fed into a $1\times 1\times 1$ 3D convolutional layer and dynamic ReLU (DY-ReLU) layer. Then, two depth-wise 3D convolutional layers (kernel size: $3\times 3\times 3$ ) are utilized for local feature mining. The output will pass through a new DY-ReLU layer and two $1\times 1\times 1$ 3D convolutional layers. Note that, the local feature learning in the MobileNet branch also considers the information from the Transformer branch by dynamically updating the ReLU layer.

Refer to caption — Figure 1: An overview of our proposed uncertainty-aware bridge based Mobile-Former framework for event-based pattern recognition.

The input feature embeddings are also passed into the UA-Bridge module which contains two MLPs for Gaussian distribution estimation. This module will adaptively control the message propagation between the dual branches. Mathematically, the mean and the variance of the predicted message passing can be written as:

\mu=MLP_{1}(F_{emb}),~{}~{}~{}~{}\sigma=MLP_{2}(F_{emb})

(1)

Then, a multi-variate Gaussian distribution can be built using the predicted mean and variance. The filtered features $F_{mf}$ from MobileNet to Transformer branch can be sampled from the Gaussian distribution using the reparameterization trick [22]:

F_{mf}=\mu+\varepsilon*\sigma,~{}~{}~{}~{}\varepsilon\sim\mathcal{N}(0,I)

(2)

After that, we adopt a cross-attention layer to aggregate the $F_{mf}$ with Transformer tokens $Z_{i},i=\{1,2,...,N\}$ , which can be formulated as:

CrossAtten(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d}})V.

(3)

where $d$ is the dimension of input feature vectors. For the cross-attention from MobileNet to Transformer, in our case, the $F_{mf}$ is reshaped into a feature vector and treated as Query ( $Q$ ) and Key ( $K$ ), and $Z_{i}$ is the Value ( $V$ ). For the opposite direction, the global tokens of Transformer are used as the Query and Key and the local CNN features are the Value.

For the Transformer branch, we take the randomly initialized tokens as the input and fuse them with CNN features using a cross-attention layer as mentioned above. Then, the standard multi-head self-attention (MHSA) and feed-forward networks are adopted for long-range global feature learning. The output tokens will be used for parameter updating in the ReLU layer and also the UA-Bridge module for aggregation with CNN features. Similar operations are also conducted in subsequent Mobile-Former blocks.

Classification Head and Loss Function. The local features and global features from CNN and Transformer blocks are concatenated and fed into two fully connected layers for pattern recognition. The cross entropy-loss function is adopted for the training of our whole framework in an end-to-end manner which can be formulated as:

\displaystyle Loss=-\frac{1}{B}\sum_{b=1}^{B}\sum_{n=1}^{N}Y_{bn}logP_{bn}

(4)

where $B$ denotes the batch size, $N$ denotes the number of event classes. $Y$ and $P$ represent the ground truth and predicted class labels of the event sample, respectively.

IV EXPERIMENTS

IV-A Dataset and Evaluation Metric

In this paper, our experiments are conducted on the ASL-DVS [5], N-Caltech101 [30], DVS128-Gait-Day [46] dataset. The top-1 accuracy is adopted as the evaluation metric for the evaluation of our proposed model and other SOTA pattern recognition algorithms.

IV-B Implementation Details

Our proposed lightweight uncertainty-aware bridge based Mobile-Former framework can be optimized in an end-to-end manner. The learning rate and weight decay are set as 0.0001 and 0.1, respectively. The AdamW [27] is selected as the optimizer and trained for a total of 60 epochs. In our implementations, 12 blocks are stacked as our backbone network. For the input of the Transformer branch, we randomly initialized 6 tokens and also tested other settings which will be discussed in experiments. We select 8 event frames as the input of MobileNet branch. Our code is implemented based on PyTorch [31] framework and the experiments are conducted on a server with GPU V100.

IV-C Comparison with Other SOTA Algorithms

Results on ASL-DVS [5]. As shown in Table I, our proposed method achieves $99.9\%$ on this benchmark dataset which is a new SOTA performance. The compared method M-LSTM which adopts learnable event representation is still inferior to our method. Some graph-based event recognition models are also worse than ours, including EV-VGCNN, VMV-GCN, and Ev-Gait-3DGraph. Therefore, we can draw the conclusion that our proposed lightweight model is effective for event-based pattern recognition.

TABLE I: Results on the ASL-DVS dataset.

EST [18]	AMAE [10]	M-LSTM [3]	MVF-Net [9]	EventNet [35]
0.979	0.984	0.980	0.971	0.833
RG-CNNs [5]	EV-VGCNN [8]	VMV-GCN [49]	EV-Gait-3DGraph [44]	Ours
0.901	0.983	0.989	0.738	0.999

Results on N-Caltech101 [30]. As shown in Table II, our model achieves 0.798 on the top-1 accuracy on this benchmark dataset which is significantly better than the compared methods. To be specific, our model outperforms the ResNet50 by $+16.1\%$ on the top-1 metric and beats the VMV-GCN by $+2\%$ which ranks the second place. Thanks to the uncertainty-aware local and global feature learning, our model achieves superior performance which fully validated its effectiveness.

TABLE II: Results on N-Caltech101 dataset.

EventNet [35]	Gabor-SNN [36]	RG-CNNs [5]	VMV-GCN [49]	EV-VGCNN [8]	EST [18]
0.425	0.196	0.657	0.778	0.748	0.753
ResNet-50 [28]	MVF-Net [9]	M-LSTM [3]	AMAE [10]	HATS [36]	Ours
0.637	0.687	0.738	0.694	0.642	0.798

Results on DVS128-Gait-Day [46]. This dataset is specifically proposed for human gait recognition by Wang et al., as shown in Table III, we can find that the EVGait-3DGraph already achieves $94.9\%$ on this dataset. In contrast, our proposed method obtains $95.9\%$ which is better than this 3D graph based recognition model. The superior results on this dataset fully demonstrate that our model works well on event-based recognition.

TABLE III: Results on the DVS128-Gait-Day dataset.

EVGait-3DGraph [44]	2DGraph-3DCNN [5]	EV-Gait-IMG [45]
94.9	92.2	87.3
LSTM-CNN [47]	SVM-PCA [20]	Ours
86.5	78.05	95.9

IV-D Ablation Study

To help the readers better understand the effectiveness of aggregating MobileNet and Transformer branches, as shown in Table IV, we isolate these components separately to validate their effectiveness. To be specific, when only MobileNet or Transformer is used for recognition, the results are $76.53\%$ and $58.01\%$ , respectively. When both branches are aggregated using the cross-attention mechanism, the results can be improved to $76.83\%$ , which demonstrates that it is effective when fusing the local and global features for event-based recognition. For the ReLU activation layer, it is possible to predict its parameters using MLPs in an online manner to attain greater flexibility and robustness. We can find that the results can be improved to $77.94\%$ when the dynamic ReLU (DY-ReLU) is adopted.

In this work, we exploit the uncertainty-aware message propagation between MobileNet and Transformer branch to achieve effective feature aggregation. As shown in Table IV, we can find that the results can be improved to $79.80\%$ when this module is adopted. This experiment fully validated the effectiveness of controllable information propagation for event-based recognition.

TABLE IV: Component analysis on N-Caltech101 dataset. UAB and CA denote the Uncertainty-aware Bridge and Cross Attention module, respectively.

No.	Mobile	Former	UAB	CA	DY-ReLU	Results
1	✓					76.53
2		✓				58.01
3	✓	✓		✓		76.83
4	✓	✓		✓	✓	77.94
5	✓	✓	✓	✓	✓	79.80

IV-E Parameter Analysis

When building our network, there are multiple flexible design choices that can enhance the final recognition results. This section will explore the following aspects and report the corresponding results on N-Caltech101 dataset in Fig. 2.

1). Number and Dimension of input tokens for Transformer: We changed the dimension of input tokens into multiple versions, including 64, 128, 192, and 256. From Fig. 2 (a), it is easy to find that the best performance can be achieved when the 192 is used. For the number of input tokens, we set it as 1, 3, 6, 9, and the corresponding results are 74.56, 76.47, 79.85, 73.72 on the top-1 accuracy, as shown in Fig. 2 (b).

2). Number of input frames for MobileNet: We try to input 4, 8, and 12 event frames into the MobileNet branch, and the corresponding results are 78.16, 79.85, and 71.01, as shown in Fig. 2 (c). One can find that a better result can be obtained when eight frames are used for MobileNet. Introducing more video frames can bring in a substantial amount of redundant information, making it challenging for the model to extract useful features. An excessive number of frames might lead to model complexity, making training more difficult and potentially increasing the risk of overfitting. This phenomenon is also referred to as information overload.

3). Number of uncertainty-aware bridge based Mobile-Former blocks: When building our backbone network, different uncertainty-aware Mobile-Former layers can be used. In this part, we exploit 9, 12, and 14 layers and get the top-1 results 74.22, 79.85, 72.87, as shown in Fig. 2 (d).

IV-F Visualization

In addition to the aforementioned quantitative experimental analysis, we also provide visualizations to better assist readers in comprehending the effectiveness of our model. As shown in Fig. 3, we first give a visualization of MobileNet feature maps. One can find that our model performs well in capturing the active event regions. As shown in Fig. 4, we also present the top-5 recognition results and their corresponding confidence scores for model predictions. It is evident that our approach can accurately identify the patterns captured by the event camera. As shown in Fig. 5, we randomly select 10 classes of samples from N-Caltech101 dataset and project the features into 2D spaces. It is easy to find that our model performs well in separating these categories.

V Conclusion

In this work, we propose to recognize objects based on the event streams. we propose a lightweight uncertainty-aware information propagation based Mobile-Former network for efficient pattern recognition, which aggregates the MobileNet and Transformer network effectively. Extensive experiments on multiple event-based recognition datasets fully validated the effectiveness of our model. In our future works, we will consider a knowledge distillation strategy to further enhance the final recognition performance.

References

[1] T. Ahmad, L. Jin, X. Zhang, L. Lin, and G. Tang. Graph convolutional neural network for action recognition: A comprehensive survey. IEEE Transactions on Artificial Intelligence, 2021.
[2] R. Baldwin, R. Liu, M. M. Almatrafi, V. K. Asari, and K. Hirakawa. Time-ordered recent event (tore) volumes for event cameras. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[3] A. Behera, A. Keidel, and B. Debnath. Context-driven multi-stream lstm (m-lstm) for recognizing fine-grained activity of drivers. In German Conference on Pattern Recognition, pages 298–314. Springer, 2018.
[4] Y. Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y. Andreopoulos. Graph-based object classification for neuromorphic vision sensing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 491–501, 2019.
[5] Y. Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y. Andreopoulos. Graph-based spatio-temporal feature learning for neuromorphic vision sensing. IEEE Transactions on Image Processing, 29:9084–9098, 2020.
[6] S. Chen and M. Guo. Live demonstration: Celex-v: a 1m pixel multi-mode event-based sensor. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1682–1683. IEEE, 2019.
[7] Y. Chen, X. Dai, D. Chen, M. Liu, X. Dong, L. Yuan, and Z. Liu. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5270–5279, 2022.
[8] Y. Deng, H. Chen, H. Chen, and Y. Li. Evvgcnn: A voxel graph cnn for event-based object classification. arXiv preprint arXiv:2106.00216, 1(2):6, 2021.
[9] Y. Deng, H. Chen, and Y. Li. Mvf-net: A multi-view fusion network for event-based object classification. IEEE Transactions on Circuits and Systems for Video Technology, 32(12):8275–8284, 2021.
[10] Y. Deng, Y. Li, and H. Chen. Amae: Adaptive motion-agnostic encoder for event-based object classification. IEEE Robotics and Automation Letters, 5(3):4596–4603, 2020.
[11] P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer. Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In 2015 International joint conference on neural networks (IJCNN), pages 1–8. ieee, 2015.
[12] C. Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
[13] H. Fang, A. Shrestha, Z. Zhao, and Q. Qiu. Exploiting neuron and synapse filter dynamics in spatial temporal learning of deep spiking neural network. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pages 2799–2806, 2021.
[14] W. Fang, Z. Yu, Y. Chen, T. Huang, T. Masquelier, and Y. Tian. Deep residual learning in spiking neural networks. NeurIPS, 2021.
[15] W. Fang, Z. Yu, Y. Chen, T. Masquelier, T. Huang, and Y. Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[16] Y. Fang, H. Zhang, J. Yan, W. Jiang, and Y. Liu. Udnet: Uncertainty-aware deep network for salient object detection. Pattern Recognition, 134:109099, 2023.
[17] G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, et al. Event-based vision: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020.
[18] D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza. End-to-end learning of representations for asynchronous event-based data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5633–5643, 2019.
[19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[20] A. Khamparia, S. K. Singh, and A. K. Luhach. Svm-pca based handwritten devanagari digit character recognition. Recent Advances in Computer Science and Communications (Formerly: Recent Patents on Computer Science), 14(1):48–53, 2021.
[21] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah. Transformers in vision: A survey. arXiv preprint arXiv:2101.01169, 2021.
[22] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015.
[23] Y. Kong and Y. Fu. Human action recognition and prediction: A survey. arXiv preprint arXiv:1806.11230, 2018.
[24] N. Le, K. Nguyen, Q. Tran, E. Tjiputra, B. Le, and A. Nguyen. Uncertainty-aware label distribution learning for facial expression recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6088–6097, 2023.
[25] D. Li, Z. Zhang, C. Shan, and L. Wang. Incremental pedestrian attribute recognition via dual uncertainty-aware pseudo-labeling. IEEE Transactions on Information Forensics and Security, 2023.
[26] Z. Li, M. S. Asif, and Z. Ma. Event transformer. arXiv preprint arXiv:2204.05172, 2022.
[27] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. 2017. doi: 10.48550. arXiv preprint ARXIV.1711.05101, 2023.
[28] S. Mascarenhas and M. Agarwal. A comparison between vgg16, vgg19 and resnet50 architecture frameworks for image classification. In 2021 International conference on disruptive technologies for multi-disciplinary research and applications (CENTCON), volume 1, pages 96–99. IEEE, 2021.
[29] Q. Meng, M. Xiao, S. Yan, Y. Wang, Z. Lin, and Z.-Q. Luo. Training high-performance low-latency spiking neural networks by differentiation on spike representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12444–12453, 2022.
[30] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience, 9:437, 2015.
[31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[32] N. Perez-Nieves and D. Goodman. Sparse spiking gradient descent. Advances in Neural Information Processing Systems, 34:11795–11808, 2021.
[33] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
[34] Z. Qin, L. Yang, F. Gao, Q. Hu, and C. Shen. Uncertainty-aware aggregation for federated open set domain adaptation. IEEE Transactions on Neural Networks and Learning Systems, 2022.
[35] Y. Sekikawa, K. Hara, and H. Saito. Eventnet: Asynchronous recursive event processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3887–3896, 2019.
[36] A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce, and R. Benosman. Hats: Histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1731–1740, 2018.
[37] S. Vemprala, S. Mian, and A. Kapoor. Representation learning for event-based visuomotor policies. Advances in Neural Information Processing Systems, 34:4712–4724, 2021.
[38] P. Wang, C. Ding, W. Tan, M. Gong, K. Jia, and D. Tao. Uncertainty-aware clustering for unsupervised domain adaptive object re-identification. IEEE Transactions on Multimedia, 2022.
[39] Q. Wang, Y. Zhang, J. Yuan, and Y. Lu. Space-time event clouds for gesture recognition: From rgb cameras to event cameras. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1826–1835. IEEE, 2019.
[40] X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, and F. Wu. Visevent: Reliable object tracking via collaboration of frame and event flows. IEEE Transactions on Cybernetics, 2023.
[41] X. Wang, Y. Rong, S. Wang, Y. Chen, Z. Wu, B. Jiang, Y. Tian, and J. Tang. Unleashing the power of cnn and transformer for balanced rgb-event video recognition. arXiv preprint arXiv:2312.11128, 2023.
[42] X. Wang, Z. Wu, Y. Rong, L. Zhu, B. Jiang, J. Tang, and Y. Tian. Sstformer: bridging spiking neural network and memory support transformer for frame-event based recognition. arXiv preprint arXiv:2308.04369, 2023.
[43] Y. Wang, B. Du, Y. Shen, K. Wu, G. Zhao, J. Sun, and H. Wen. Ev-gait: Event-based robust gait recognition using dynamic vision sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6358–6367, 2019.
[44] Y. Wang, B. Du, Y. Shen, K. Wu, G. Zhao, J. Sun, and H. Wen. Ev-gait: Event-based robust gait recognition using dynamic vision sensors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6358–6367, 2019.
[45] Y. Wang, X. Zhang, Y. Shen, B. Du, G. Zhao, L. Cui, and H. Wen. Event-stream representation for human gaits identification using deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):3436–3449, 2021.
[46] Y. Wang, X. Zhang, Y. Shen, B. Du, G. Zhao, L. C. C. Lizhen, and H. Wen. Event-stream representation for human gaits identification using deep neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[47] K. Xia, J. Huang, and H. Wang. Lstm-cnn architecture for human activity recognition. IEEE Access, 8:56855–56866, 2020.
[48] B. Xie, Y. Deng, Z. Shao, H. Liu, and Y. Li. Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification. IEEE Robotics and Automation Letters, 7(2):1976–1983, 2022.
[49] B. Xie, Y. Deng, Z. Shao, H. Liu, and Y. Li. Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification. IEEE Robotics and Automation Letters, 7(2):1976–1983, 2022.
[50] W. Zhang, K. Ma, G. Zhai, and X. Yang. Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Transactions on Image Processing, 30:3474–3486, 2021.