¹¹institutetext: 1. Anhui University, Hefei City, 230601, Anhui Province, China
¹¹email: {jy_0x4f, 17398389386}@163.com, ¹¹email: {e21301283, E02114335, E02114336}@stu.ahu.edu.cn, ¹¹email: {chenlan, xiaowang}@ahu.edu.cn

Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion based Classification ^†^†thanks: Corresponding author: Lan Chen, Email: ()

Chengguo Yuan 11 Yu Jin 11 Zongzhen Wu 11 Fanting Wei 11 Yangzirui Wang 11
Lan Chen (✉) 11 Xiao Wang 11

Abstract

Recognizing target objects using an event-based camera draws more and more attention in recent years. Existing works usually represent the event streams into point-cloud, voxel, image, etc, and learn the feature representations using various deep neural networks. Their final results may be limited by the following factors: monotonous modal expressions and the design of the network structure. To address the aforementioned challenges, this paper proposes a novel dual-stream framework for event representation, extraction, and fusion. This framework simultaneously models two common representations: event images and event voxels. By utilizing Transformer and Structured Graph Neural Network (GNN) architectures, spatial information and three-dimensional stereo information can be learned separately. Additionally, a bottleneck Transformer is introduced to facilitate the fusion of the dual-stream information. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on two widely used event-based classification datasets. The source code of this work is available at: https://github.com/Event-AHU/EFV_event_classification

Keywords:

Event Camera Graph Neural Networks Transformer Network Bottleneck Fusion.

1 Introduction

Recognizing the category of a given object is a fundamental problem in computer vision. Most of the previous classification models are developed for frame-based cameras, in other words, these recognition models focus on encoding and learning the representation of RGB frames. With the rapid development of deep learning, frame-based classification achieves significant improvement in recent years. Representative deep models (e.g., the AlexNet [1], ResNet [2], and Transformer [3]) and datasets (e.g., ImageNet [4]) are proposed one after another. However, the recognition performance in challenging scenarios is still far from unsatisfactory, including heavy occlusion, fast motion, and low illumination.

Refer to caption — Figure 1: Comparison of the frame- and event-based cameras https://youtu.be/6xOmo7Ikwzk. (a, b) shows representative samples in regular scenarios, low-illumination, and fast motion. (c, d) illustrates the different types of raw data representation of frame- and event-based cameras.

To improve object recognition in challenging scenarios, some researchers have started leveraging other sensors to obtain more effective signal inputs, thus enhancing recognition performance [5]. Among them, one of the most representative sensors is the event camera, also known as DVS (Dynamic Vision Sensor), which has been widely exploited in computer vision [6, 7, 8]. This paper focuses on using event cameras for object recognition. As shown in Fig. 1, different from the frame-based camera which records the light intensity for each pixel simultaneously, the event camera captures pulse signals asynchronously based on changes in light intensity, recording binary digital values of either zero or one. Typically, an increase in brightness is denoted as an ON event, while a decrease corresponds to an OFF event. An event pulse signal can be represented as a quadruple ( $x,y,t,p$ ), where $x,y$ represents the spatial position information, $t$ represents the timestamp, and $p$ represents the polarity, i.e., ON/OFF event. Many works demonstrate that the event camera performs better in High Dynamic Range (HDR), high temporal resolution, low latency response, and strong robustness. Therefore, utilizing event cameras for object recognition is a research direction that holds great research value and practical potential.

Recently, researchers have already conducted studies on object recognition using event cameras and have proposed various approaches to address this task, including CNN (Convolutional Neural Network) [9], GNN (Graph Neural Network) [10], Transformer [3], etc. Although these methods have achieved good accuracy by representing and learning events from different perspectives, they are still limited by the following aspects: Firstly, they rely on a single event representation form, such as images, point clouds, or voxels, which may limit the expressiveness and versatility of the learned features. Different event representation forms may capture different aspects of the data, and using only one representation may lead to the loss of valuable information. Secondly, the current methods are constrained to using only one of the deep learning architectures, such as CNNs, GNNs, or Transformers, for feature learning. Each architecture has its strengths and limitations in capturing different types of patterns and dependencies in data. By restricting the choice to a single architecture, the methods may not fully exploit the potential benefits and complementary strengths of different architectures. To address these limitations, future research should explore approaches that can integrate multiple event representation forms and leverage the combined power of different deep learning architectures. This could involve developing novel fusion techniques or hybrid architectures that can effectively capture and leverage diverse features and dependencies present in event data. By doing so, we can potentially enhance the performance and flexibility of event-based object recognition methods.

To address the aforementioned issues, in this work, we propose an effective dual-stream event information processing framework, referred to as EFV, as shown in Fig. 2. Specifically, we first transform the dense event point cloud signals into event images and event voxel representations. For the input of image frames, we utilize advanced spatiotemporal Transformer networks to learn spatiotemporal features. For voxel input, considering the sparsity of events, we employ a top-k selection method to sample meaningful signals for constructing a structured graph, and then use GNN (Graph Neural Network) to learn these volumetric structured features. Importantly, we introduce the Bottleneck Transformer to integrate these two types of feature representations, which are ultimately input to the dense layer for classification. It is easy to find that our proposed EFV possesses the characteristics of efficient event information processing, integration of multiple feature representations, spatiotemporal modeling capability, consideration of event sparsity, and accurate classification capability.

To sum up, the main contributions of this work can be concluded as the following two aspects:

$\bullet$ We propose an effective framework for recognition in event-based cameras, utilizing Event Image-Voxel feature representation and fusion.

$\bullet$ The introduction of the Bottleneck Transformer enables the interaction and fusion of dual-stream information, leading to improved recognition results.

2 Related Work

In this section, we give an introduction to Event-based Recognition¹¹1https://github.com/Event-AHU/Event_Camera_in_Top_Conference, Graph Neural Networks, and Bottleneck Transformer.

Event-based Recognition. Current research on event-based recognition can be divided into three distinct streams: CNN-based [11], SNN (Spiking Neural Networks)-based [12, 13], and GNN-based models [14, 15, 16]. For the CNN-based models, Wang et al. [11] proposed an event-based gait recognition (EV-gait) method, which effectively removes noise via motion consistency. SNN is also utilized for encoding the event stream in order to achieve energy-efficient recognition. A kind of highly efficient conversion of ANN to SNN method is put forward by Peter and others [17], the method involves the balance of the weights and thresholds, while achieving lower latency and requiring fewer operations. In [18], a sparse backpropagation method for SNN was introduced by redefining the surrogate gradient function form. Fang et al.[19] propose spike element-wise (SEW) ResNet to implement residual learning for deep SNNS, while proving that SEW ResNet can easily implement identity mapping and overcome the vanishing/exploding gradient problem of Spiking ResNet. Wang et al. propose a hybrid SNN-ANN framework for RGB-Event based recognition by fusing the memory support Transformer and spiking neural networks, termed SSTFormer [20]. Jiang et al. propose to aggregate the event point and voxel using absorbing graph neural networks for event-based recognition [21].

For point cloud based representation, Wang et al. [22] treat the event stream as a set of 3D points in space-time, i.e., space-time event clouds, and adopt the PointNet [23] architecture, which directly takes the point cloud as input and outputs the class label for the entire input or each point segment/part label for each input point. Xie et al. [24] propose VMV-GCN, a voxel-wise graph learning model designed to integrate multi-view volumetric data. Li et al. [25] introduce a Transformer network to directly process event sequences in its native vector tensor format to effectively represent the temporal and spatial correlations of input raw events, thereby generating effective spatio-temporal features for the task. Different from previous works, this paper designs an event recognition method based on Transformer and graph convolutional neural network, which transmits bimodal information through a specific method and learns a unified feature representation, so as to represent event data more effectively.

Graph Neural Networks. One notable application of GNNs in event data recognition is gait recognition. Wang et al. [16] propose a 3D graph neural network specifically designed for gait recognition. The model leverages the graph structure to capture the spatial and temporal dependencies in gait patterns. Bi et al. introduce the concepts of residual Graph Convolutional Neural Networks (RG-CNN) and Graph2Grid blocks [15], [14], which exploit graph structure to extract and exploit spatial and temporal information from event data. The Asynchronous, Event-based Graph Neural Networks (AEGNN) proposed by [26] addresses the processing of events as “evolving” spatio-temporal graphs. In the field of object recognition, Li et al. [27] introduce SlideGCN, a GNN-based model that focuses on fast graph construction using a radius search algorithm. Different from previous works, we adopt a graph Convolutional Neural Network (GCN) to process the graph data and connect the outputs of GCN and ST-Transformer module for accurate event-based pattern recognition.

Bottleneck Transformer. The traditional Transformer model has the problem of large computation and memory overhead when processing large-size images. Srinivas et al. [28] propose a novel network architecture called Bottleneck Transformer, which achieves dimensionality reduction of spatial attention by introducing a ”bottleneck layer” between high-resolution and low-resolution representations of input features, thereby reducing computational consumption and increasing model scalability. Li et al. [29] introduce a local multi-head self-attention mechanism and a novel position encoding method to solve the scalability bottleneck of Transformer under GPU memory constraints. Nagrani et al. [30] propose a multimodal bottleneck converter (MBT) and guided the bottlenecks in it to connect across modes. Song et al.[31] propose a new model BS2T that captures long-range dependencies between pixels in HS images by leveraging the self-attention mechanism in Transformers. In addition, we innovatively introduce bottleneck Transformer to promote the fusion of dual-stream information and improve the performance of module fusion.

3 Our Proposed Approach

3.1 Overview

Given an input event stream consisting of hundreds of thousands of events, our approach involves several steps to enhance the representation. Initially, we employ event frame stacking and voxel construction techniques to generate event frame and voxel representations, respectively. Subsequently, we utilize two intermediate representations, namely event frame and voxel graph, to capture the spatio-temporal relationships within the event stream. To further improve the feature descriptors for event frame and graph-based event representation, we propose a novel dual branch learning network. Finally, we combine these representations to create a comprehensive representation for event data, enabling effective recognition. The overall framework is depicted in Fig. 2. In the following sections, we provide a detailed explanation of each module.

3.2 Network Architecture

Input Representation. Considering the large amount of data and computational complexity, it is necessary to employ some down-sampling techniques to reduce the number of events. In this paper, we adopt two kinds of sampling techniques to obtain the compressed event representations. We first transform the asynchronous event flow into the synchronous event images by stacking the events in a time interval based on the exposure time. We also employ voxelization to obtain voxel representations. Specifically, given the original event stream $\mathcal{E}$ with range $H,W,T$ , we divide the spatio-temporal 3D space into voxels with the size of each voxel being $h^{\prime},w^{\prime},t^{\prime}$ . Hence, each voxel generally contains several events and the resulting event voxels in spatio-temporal space are of size $H/h^{\prime},W/w^{\prime},T/t^{\prime}$ . In practice, the above voxelization usually still produces tens of thousands of voxels. In order to further reduce the number of voxels and alleviate the effect of noisy voxels, we also adopt a voxel selection process to select top $K$ voxels based on the number of events contained in each voxel. Let $\mathcal{O}=\{o_{1},o_{2}\cdots o_{K}\}$ denote the collection of the final selected voxels. Each event voxel $o_{i}$ is associated with a feature descriptor $\textbf{a}_{i}\in\mathbb{R}^{C}$ which integrates the attributes (polarity) of its involved events. Hence, each $o_{i}\in\mathcal{O}$ is represented as: $o_{i}=(x_{i},y_{i},t_{i},\textbf{a}_{i})$ , where $x_{i},y_{i},t_{i}$ denotes the 3D coordinate of each voxel.

Graph Neural Networks for Event Voxel Encoding. We similarly construct a geometric neighboring graph $G^{o}(V^{o},E^{o})$ for voxel event data $\mathcal{O}$ . To be specific, each node $v_{i}\in V^{o}$ represents a voxel $o_{i}=(x_{i},y_{i},t_{i},\textbf{a}_{i})\in\mathcal{O}$ which is described as a feature vector $\textbf{a}_{i}\in\mathbb{R}^{C}$ . The edge $e_{ij}\in E^{o}$ exists between node $v_{i}$ and $v_{j}$ , if the Euclidean distance between their 3D coordinates is less than a threshold R. We adapt Gaussian Mixture Model(GMM), convolution to learn the effective representations for voxel graph. To be specific, in each GCN layer, each event node $v_{i}$ aggregates the features from its adjacency nodes as

\displaystyle f^{\prime}_{d}(v_{i})\leftarrow\sigma\Big{(}\sum_{v\in V}\omega_{d}(v_{i},v)f(v)\Big{)},d=1,2\cdots D

(1)

where $\sigma(\cdot)$ denotes the activation function, such as ReLU. $V$ denotes adjacency nodes of $v_{i}$ . $\omega_{d}(v_{i},v)$ denote the learnable convolution kernel weights. Finally, we adapt average graph pooling to get the global representation of voxel graph.

Spatial-temporal Transformer for Event Frame Encoding. After a series of data augmentation, each video sample obtained $T$ event frames with a size of ${H\times W}$ . We extract initial CNN features and embed event frames through StemNet (ResNet18 [2] is used in our experiments). After obtaining the initial features, we designed an ST-Transformer module to further achieve a better representation of spatio-temporal information. The proposed module consists of multi-head self-attention (MSA), MLP, and Layernorm (LN). As shown in Fig. 2, $T$ event frames are divided into $N$ patches in spatial dimension, therefore, the ${T\times N}$ tokens can be obtained. We add learnable location encoding to these tokens and feed them into the ST-Transformer module to fully extract the enhanced spatio-temporal features, as shown in Eq. 2 and Eq. 3:

		$\displaystyle Y={X}^{in}+MSA(LN({X}^{in}))$		(2)
		$\displaystyle X^{out}=Y+MLP(LN(Y))$		(3)

Bottleneck Transformer. In order to achieve the interaction between Event Images and Event Voxels information representations and learn a unified spatio-temporal context data representation. We also designed the Fusion Transformer module and introduced the Bottleneck mechanism. Specifically, let ${X}^{image}\in\mathbb{R}^{T\times N\times d}$ and ${X}^{voxel}\in\mathbb{R}^{1\times d}$ represent the outputs of the previous ST Transformer and GNN modules, respectively. We first collect the ${T\times N}$ image and ${T\times N}$ randomly initialized Bottleneck tokens together and feed them to Fusion Transformer which includes multi-head self-attention (MSA) and MLP submodule, i.e.,

	$\displaystyle F^{1}=[{X}^{image},{X}^{bottleneck}]\in\mathbb{R}^{2\times T\times N\times d}$		(4)
	$\displaystyle\widetilde{F}^{1}=FusionTransformer(F^{1})$		(5)

We then split $\widetilde{F}^{1}$ into two parts, i.e., the images feature representation $\widetilde{F}^{image}$ and the bottleneck feature representation $\widetilde{F}^{bottleneck}$ . The latter one will be concatenated with ${X}^{voxel}$ and fed into the Fusion-Transformer module for interactive learning of the two representations. Similarly,

	$\displaystyle F^{2}=[{X}^{bottleneck},{X}^{voxel}]\in\mathbb{R}^{(T\times N+1)\times d}$		(6)
	$\displaystyle\widetilde{F}^{2}=FusionTransformer(F^{2})$		(7)

Finally, we concatenate both $\widetilde{F}^{2}$ and $\widetilde{F}^{image}$ together and flattened them into a feature representation. After that, we utilize a two-layer MLP to output the final class label prediction, as shown in Fig. 2. We adopt the Negative Log Likelihood Loss function [32] to train the whole network.

4 Experiment

4.1 Dataset and Evaluation Metric

In this work, we utilized two datasets, namely DVS128-Gait-Day [33], N-MNIST [34], and ASL-DVS [15], to evaluate our proposed model. Here is a brief introduction to these datasets:

$\bullet$ ASL-DVS [15]: This dataset consists of 100,800 samples, with 4,200 samples available for each letter. The focus was on the 24 letters representing the handshapes of American Sign Language. Each video in this dataset has a duration of approximately 100 milliseconds. The author captured these samples using an iniLabs DAVIS240c camera under realistic conditions.

$\bullet$ DVS128-Gait-Day [33] dataset is proposed for event-based gait recognition. It contains 4,000 videos corresponding to 20 classes. 20 volunteers are recruited for data collection using a DVS128 Dynamic Vision Sensor (the pixel resolution is $128\times 128$ ).

$\bullet$ N-MNIST [34] dataset is obtained by recording the display equipment when visualizing the original MNIST (28 × 28 pixels). The ATIS event camera is used for the data collection and each event sample lasts about 10ms. There are 70,000 event files for this dataset, the training and testing subset contains 60,000 and 10,000 videos, respectively. The resolution of this dataset is $28\times$ 28.

Note that the top-1 and top-5 accuracy are employed as the evaluation metrics throughout our study.

4.2 Implementation Details

Our proposed dual-stream event-based recognition framework can be trained in an end-to-end manner. The initial learning rate is set as 0.001 and multiplied by 0.1 for every 60 epochs. We select eight frames for each video sample and divide each frame into eight tokens. For the constructed voxel graph, the threshold R is set to 2. The scale of the voxel grid is (4. 4. 4) for the ASL-DVS dataset. We select 512 voxels as the graph node for the structured graph representation learning. Our code is implemented using Python 3.8 and trained on a server with RTX3090 GPUs.

Table 1: Results on the ASL-DVS [15] dataset.

EST[35]	AMAE[36]	M-LSTM[37]	MVF-Net[38]	EventNet[39]
0.979	0.984	0.980	0.971	0.833
RG-CNNs[15]	EV-VGCNN [40]	VMV-GCN [41]	EV-Gait-3DGraph[9]	Ours
0.901	0.983	0.989	0.738	0.996

4.3 Comparison with Other SOTA Algorithms

As shown in Table 1, previous works already achieve high performance on the ASL-DVS [15] dataset. For example, the EST [35] (0.979), AMAE [36] (0.984), M-LSTM [37] (0.980), and MVF-Net [38] (0.971). Note that the GCN-based model, VMV-GCN [41], achieves better results, i.e., 0.989 on the top-1 accuracy. Thanks to the spatial-temporal feature learning and fusion network proposed in this work, we set new state-of-the-art performance on this dataset, i.e., 0.996. On the N-MNIST [34] dataset, as shown in Table 2, we also achieve SOTA performance compared with recent strong models. These comparisons fully validated the effectiveness of our proposed framework for event-based recognition. We provide two figures to better illustrate our results, as shown in the left subfigure of Fig. 3.

Table 2: Results on the N-MNIST [34] dataset.

EST [35]	M-LSTM [37]	MVF-Net [38]	Gabor-SNN [42]	EvS-S[27]
99.0	98.6	98.1	83.7	99.1
HATS[42]	EventNet[39]	RG-CNNs[15]	EV-VGCNN[40]	Ours
99.1	75.2	99.0	99.4	98.9

4.4 Ablation Study

To help researchers better understand the method we proposed, in this section, we conduct comprehensive experiments of component analysis on the DVS128-Gait-Day dataset and ASL-DVS dataset to check their influence on the overall model.

Component Analysis. Table 3 shows the effect of using different components on experimental results. In this part, we didn’t use the Bottleneck, and the dataset we use is DVS128-Gait-Day. Event image only indicates that we only transform the event flow into the synchronous event images, which gets the result of 95.2. Event voxel only indicates that we only employ voxelization to obtain the compressed event representation and it achieves 98.0. We also use Event image and Event voxel together when obtaining event representation, denoted by Event Image+Voxel. It gets the result of 98.7. We can easily draw the conclusion by comparing the above three cases that using Event Image and Event voxel together can achieve higher performance, which reflects the effectiveness of our method.

Effect of Bottleneck. In this paper, we use Bottleneck Transformer to enhance the performance when fusing the modules. As shown in Table 4, w/o Bottleneck Feature means we do not feed the learning token into Bottleneck. It gets the result of 98.5. w/o FusionFormer means we do not use Fusion Transformer before the linear layer, in other words, we use the all components proposed in this paper except for the Bottleneck, and the result is 98.3. We found that compared with these two experiments, the result has increased after introducing the Bottleneck which indicates that the Bottleneck is a better choice for our framework. At the same time, a comparison with the experimental results in Table 1 shows that the experimental result drops when there is no Bottleneck, which also demonstrates Bottleneck has a positive effect on our proposed model.

Table 3: Ablation study on DVS128-Gait-Day dataset [33].

Index	Component	Results
1	Event image only	95.2
2	Event voxel only	98.0
3	Event Image + Voxel	98.7

Table 4: Ablation study on ASL-DVS [15].

Index	Component	Results
1	w/o Bottleneck Feature	98.5
2	w/o FusionFormer	98.3

4.5 Parameter Analysis

The storage space of our proposed method is 220.3 MB. Our model spends 16.7 ms for each video in ASL-DVS dataset.

5 Conclusion

Previous event-based recognition approaches typically represented event streams as point clouds, voxels, or images, and employed various deep neural networks to learn feature representations. However, these approaches are usually challenged by monotonous modal expressions and the design of the network structure. To overcome these challenges, this paper introduces a novel dual-stream framework for event representation, extraction, and fusion. The proposed framework simultaneously models two common representations: event images and event voxels. By leveraging Transformer and Structured Graph Neural Network (GNN) architectures, spatial information and three dimensional stereo information can be learned separately. Moreover, the introduction of a bottleneck Transformer facilitates the fusion of the dual-stream information. Extensive experiments were conducted to evaluate the performance of our framework, using two widely used event-based classification datasets. The results demonstrate that our proposed framework achieves state-of-the-art performance. These findings highlight the effectiveness of the dual-stream framework in addressing the limitations of existing approaches and improving the recognition accuracy in event-based object recognition tasks.

Acknowledgement: This work is supported by the National Natural Science Foundation of China (No. 62102205).

References

[1] M. ul Hassan, “Alexnet imagenet classification with deep convolutional neural networks,” 2018.
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
[5] Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, “Human action recognition from various data modalities: A review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3200–3225, 2023.
[6] X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,” arXiv preprint arXiv:2108.05015, 2021.
[7] C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, J. Zhang, Y. Wang, and Y. Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,” arXiv preprint arXiv:2211.11010, 2022.
[8] L. Zhu, X. Wang, Y. Chang, J. Li, T. Huang, and Y. Tian, “Event-based video reconstruction via potential-assisted spiking neural network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3594–3604.
[9] Y. Wang, B. Du, Y. Shen, K. Wu, G. Zhao, J. Sun, and H. Wen, “Ev-gait: Event-based robust gait recognition using dynamic vision sensors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6358–6367.
[10] Y. Wang, X. Zhang, Y. Shen, B. Du, G. Zhao, L. C. C. Lizhen, and H. Wen, “Event-stream representation for human gaits identification using deep neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[11] Y. Wang, B. Du, Y. Shen, K. Wu, G. Zhao, J. Sun, and H. Wen, “Ev-gait: Event-based robust gait recognition using dynamic vision sensors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6358–6367.
[12] H. Fang, A. Shrestha, Z. Zhao, and Q. Qiu, “Exploiting neuron and synapse filter dynamics in spatial temporal learning of deep spiking neural network,” arXiv preprint arXiv:2003.02944, 2020.
[13] W. Fang, Z. Yu, Y. Chen, T. Masquelier, T. Huang, and Y. Tian, “Incorporating learnable membrane time constant to enhance learning of spiking neural networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2661–2671.
[14] Y. Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y. Andreopoulos, “Graph-based object classification for neuromorphic vision sensing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 491–501.
[15] ——, “Graph-based spatio-temporal feature learning for neuromorphic vision sensing,” IEEE Transactions on Image Processing, vol. 29, pp. 9084–9098, 2020.
[16] Y. Wang, X. Zhang, Y. Shen, B. Du, G. Zhao, L. Cui, and H. Wen, “Event-stream representation for human gaits identification using deep neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 7, pp. 3436–3449, 2021.
[17] P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer, “Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing,” in 2015 International joint conference on neural networks (IJCNN). ieee, 2015, pp. 1–8.
[18] N. Perez-Nieves and D. Goodman, “Sparse spiking gradient descent,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 795–11 808, 2021.
[19] W. Fang, Z. Yu, Y. Chen, T. Huang, T. Masquelier, and Y. Tian, “Deep residual learning in spiking neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 21 056–21 069, 2021.
[20] X. Wang, Z. Wu, Y. Rong, L. Zhu, B. Jiang, J. Tang, and Y. Tian, “Sstformer: Bridging spiking neural network and memory support transformer for frame-event based recognition,” arXiv preprint arXiv:2308.04369, 2023.
[21] B. Jiang, C. Yuan, X. Wang, Z. Bao, L. Zhu, and B. Luo, “Point-voxel absorbing graph representation learning for event stream based recognition,” arXiv preprint arXiv:2306.05239, 2023.
[22] Q. Wang, Y. Zhang, J. Yuan, and Y. Lu, “Space-time event clouds for gesture recognition: From rgb cameras to event cameras,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 1826–1835.
[23] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
[24] B. Xie, Y. Deng, Z. Shao, H. Liu, and Y. Li, “Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1976–1983, 2022.
[25] Z. Li, M. S. Asif, and Z. Ma, “Event transformer,” arXiv preprint arXiv:2204.05172, 2022.
[26] S. Schaefer, D. Gehrig, and D. Scaramuzza, “Aegnn: Asynchronous event-based graph neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 371–12 381.
[27] Y. Li, H. Zhou, B. Yang, Y. Zhang, Z. Cui, H. Bao, and G. Zhang, “Graph-based asynchronous event processing for rapid object recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 934–943.
[28] A. Srinivas, T.-Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani, “Bottleneck transformers for visual recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 519–16 529.
[29] S. Li, X. Jin, Y. Xuan, X. Zhou, W. Chen, Y.-X. Wang, and X. Yan, “Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting,” Advances in neural information processing systems, vol. 32, 2019.
[30] A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, “Attention bottlenecks for multimodal fusion,” Advances in Neural Information Processing Systems, vol. 34, pp. 14 200–14 213, 2021.
[31] R. Song, Y. Feng, W. Cheng, Z. Mu, and X. Wang, “Bs2t: Bottleneck spatial–spectral transformer for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–17, 2022.
[32] L. J. Miranda, “Understanding softmax and the negative log-likelihood,” ljvmiranda921. github. io, 2017.
[33] Y. Wang, X. Zhang, Y. Shen, B. Du, G. Zhao, L. C. C. Lizhen, and H. Wen, “Event-stream representation for human gaits identification using deep neural networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[34] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor, “Converting static image datasets to spiking neuromorphic datasets using saccades,” Frontiers in neuroscience, vol. 9, p. 437, 2015.
[35] D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza, “End-to-end learning of representations for asynchronous event-based data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5633–5643.
[36] Y. Deng, Y. Li, and H. Chen, “Amae: Adaptive motion-agnostic encoder for event-based object classification,” IEEE Robotics and Automation Letters, vol. 5, no. 3, pp. 4596–4603, 2020.
[37] M. Cannici, M. Ciccone, A. Romanoni, and M. Matteucci, “A differentiable recurrent surface for asynchronous event-based data,” in European Conference on Computer Vision. Springer, 2020, pp. 136–152.
[38] Y. Deng, H. Chen, and Y. Li, “Mvf-net: A multi-view fusion network for event-based object classification,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
[39] Y. Sekikawa, K. Hara, and H. Saito, “Eventnet: Asynchronous recursive event processing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3887–3896.
[40] Y. Deng, H. Chen, H. Chen, and Y. Li, “Evvgcnn: A voxel graph cnn for event-based object classification,” arXiv preprint arXiv:2106.00216, vol. 1, no. 2, p. 6, 2021.
[41] B. Xie, Y. Deng, Z. Shao, H. Liu, and Y. Li, “Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1976–1983, 2022.
[42] A. Sironi, M. Brambilla, N. Bourdis, X. Lagorce, and R. Benosman, “Hats: Histograms of averaged time surfaces for robust event-based object classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1731–1740.

Learning Bottleneck Transformer for Event Image-Voxel Feature Fusion based Classification ††thanks: Corresponding author: Lan Chen, Email: ()