Spiking Tucker Fusion Transformer for Audio-Visual Zero-Shot Learning

Wenrui Li, Penghong Wang, Ruiqin Xiong, Xiaopeng Fan This work was supported in part by the National Key R&D Program of China (2021YFF0900500), and the National Natural Science Foundation of China (NSFC) under grants 62441202, U22B2035. (Corresponding author: Xiaopeng Fan.)Wenrui Li, Penghong Wang and Xiaopeng Fan are with the Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China. (e-mail: [email protected]; [email protected]; [email protected]).Ruiqin Xiong is with the School of Electronic Engineering and Computer Science, Institute of Digital Media, Peking University, Beijing 100871, China (e-mail: [email protected])

Abstract

The spiking neural networks (SNNs) that efficiently encode temporal sequences have shown great potential in extracting audio-visual joint feature representations. However, coupling SNNs (binary spike sequences) with transformers (float-point sequences) to jointly explore the temporal-semantic information still facing challenges. In this paper, we introduce a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning (ZSL). The STFT leverage the temporal and semantic information from different time steps to generate robust representations. The time-step factor (TSF) is introduced to dynamically synthesis the subsequent inference information. To guide the formation of input membrane potentials and reduce the spike noise, we propose a global-local pooling (GLP) which combines the max and average pooling operations. Furthermore, the thresholds of the spiking neurons are dynamically adjusted based on semantic and temporal cues. Integrating the temporal and semantic information extracted by SNNs and Transformers are difficult due to the increased number of parameters in a straightforward bilinear model. To address this, we introduce a temporal-semantic Tucker fusion module, which achieves multi-scale fusion of SNN and Transformer outputs while maintaining full second-order interactions. Our experimental results demonstrate the effectiveness of the proposed approach in achieving state-of-the-art performance in three benchmark datasets. The harmonic mean (HM) improvement of VGGSound, UCF101 and ActivityNet are around 15.4%, 3.9%, and 14.9%, respectively.

Index Terms:

Audio-visual zero-shot learning, spiking neural network, low-rank approximation.

I Introduction

The task of audio-visual zero-shot learning (ZSL) [1, 2, 3] aims to classify objects or scenes by utilizing both audio and visual modalities, even when labeled data is not available. Conventional supervised audio-visual approaches are training with lots of labeled training instances for each class. In order to address the constraints of traditional supervised audio-visual methods, the generalized zero-shot learning (GZSL) setting has been proposed [4, 5]. GZSL methods permit models to identify and classify instances from both seen and unseen classes, thereby facilitating more practical and scalable solutions for audio-visual classification and recognition tasks.

Refer to caption — Figure 1: The illustration of our proposed STFT for audio-visual GZSL. The SNN utilize the time-step factor to dynamic synthesis the output of the temporal information. The audio and visual encoder utilize the latent knowledge combiner to explore the semantic information with latent cues. After temporal-semantic tucker fusion, the fused features are further reasoned through the cross-modal transformer. The information from seen training classes could transfer to unseen test classes by textual embeddings.

To obtain more robust audio-visual feature representations, most existing methods model and align the temporal and semantic features of the input separately. CJME [6] projects the audio-visual and textual modalities into a shared space and calculates their similarity, clustering features of the same category in the shared space using triplet loss. Mercea et al. [2] introduce a lightweight processing framework that achieves excellent results by utilizing cross-attention to interact with audio-visual modality information. TCaF [3] preprocesses temporal information and verifies its importance in the interaction of audio-visual modalities. Spike Neural Networks (SNNs) provide significant advantages for audio-visual representation. Firstly, they efficiently encode temporal information by mimicking the spike-timing of biological neurons, allowing precise modeling of dynamic events over time. This spike-timing-dependent plasticity (STDP) enables SNNs to capture fine-grained temporal patterns crucial for understanding complex audio-visual data. Secondly, SNNs offer high stability and robustness to noise, making them resilient to variations and disturbances in real-world data. This robustness is particularly beneficial in scenarios where audio and visual inputs are degraded or incomplete. Thirdly, integrating SNNs with transformers enhances the extraction of both temporal and semantic features. SNNs manage precise timing aspects, while transformers excel at capturing contextual relationships, resulting in a comprehensive multimodal feature representation. This combination has demonstrated state-of-the-art performance in tasks such as audio-visual zero-shot classification, as shown by Li et al. [5]. The aforementioned studies have demonstrated the powerful potential of SNNs in audio-visual joint learning. However, efficiently coupling SNNs with Transformers still faces following challenges:

1) Time Steps: Currently, most SNNs obtain the final output by averaging the output of each neuron with fixed time steps. These approaches not only overlook the importance of various layers in encoding temporal sequences but also cause significant fluctuations in SNN performance.

2) Spiking Redundancy: SNNs outputs exhibit redundancy with noise spikes present in both the temporal and spatial dimensions, which are highly correlated with spike firing frequency and neuron position. Finding a balance between spike neuron firing frequency and accuracy is crucial for reducing the redundancy of SNNs.

3) Output Heterogeneity: There is a significant difference in the output data distribution between SNNs and Transformers which are binary spike sequences and floating-point features, respectively. Efficiently integrating features from different data distributions is important to release the potential of SNNs.

To address the aforementioned challenges, we propose a new Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning in Fig. 1. Firstly, we introduce the time-step factors (TSF) which dynamically measure the significance of each time step in influencing the SNN’s output. By efficiently utilizing the outputs from different time steps, these importance factors guide the synthesis of subsequent inference information. Additionally, we propose a global-local pooling (GLP) to combine the max and average pooling operation to guide the formation of the input membrane potential. The thresholds of the spiking neuron are adjust dynamically based on the semantic and temporal information cues. This helps reduce the generation of spike noise and improves the model’s robustness. In terms of integrating the temporal information extracted by SNNs and the semantic information extracted by Transformers, a straightforward approach is to use a bilinear model for complete second-order interaction. However, this can lead to a significant increase in the number of parameters. We introduce a temporal-semantic Tucker fusion module to deal with this challenge. This module achieves multi-scale fusion of SNN and Transformer outputs at a very low cost while maintaining full second-order interactions. We also demonstrate the qualitative comparison results with recent SOTA method MDFT in the bottom of Fig. 1. In sports classes with frequent changes in motion information, STFT demonstrate superiorities compared with MDFT due to the less spiking redundancy.

To sum up, our proposed SFTF aims to address the challenges of time steps, spiking redundancy and output heterogeneity in coupling SNNs and Transformer, enabling efficient fusion and interaction between temporal and semantic information. The main contributions of this paper are as follows:

•

We propose a novel Spiking Tucker Fusion Transformer (STFT) for audio-visual zero-shot learning. STFT efficiently couples SNNs with Transformers, and combines the temporal and semantic information in different time steps to format the robust representations.
•

The temporal-semantic tucker fusion is proposed to achieve multi-scale fusion of SNN and Transformer outputs while maintaining full second-order interactions. This module effectively integrates the temporal and semantic information, providing a comprehensive representation for audio-visual data.
•

To reduce the spike noise, we adjust the thresholds of spiking neurons based on semantic and temporal information dynamically. The GLP is proposed to guide the formation of input membrane potentials based on their global and local characteristics.

The extensive experimental results prove that SFTF shows superiorities among state-of-the-art methods. The ablation study also demonstrates the effectiveness of each key component of our proposed model.

II Related Work

II-A Audio-Visual Zero-Shot Learning

With the development of the deep learning [7, 8, 9, 10, 11] in recent years, audio-visual zero-shot learning has gained significant attention due to its potential applications in various domains such as violence detection [12], aerial scene recognition [13], speech recognition [14, 15] and video classification [16, 17, 18, 19, 20, 21, 22, 23, 24, 25]. MARBLE [26] provides a comprehensive benchmark for evaluating AI in music understanding, addressing the need for deep music representations, large-scale datasets, and a universal community-driven standard. IcoCap [27] improves video captioning by using easily-learned image semantics to diversify video content, helping captioners focus on relevant information. Finsta [28] enhances video-language models by using fine-grained spatio-temporal alignment with scene graph structures, improving performance on various tasks without needing to retrain from scratch. Chen et al. [29] propose Co-Meta Learning, which improves self-supervised speaker verification by leveraging complementary audio-visual information and updating network parameters using a disagreement strategy and meta learning. MAViL [30] uses three forms of self-supervision to learn audio-visual representations, achieving state-of-the-art performance in audio-video classification and enhancing both multimodal and unimodal tasks. SEEG [31] generates semantic-aware gestures by decoupling semantic-irrelevant information and leveraging semantic learning, outperforming other methods in semantic expressiveness and evaluations on various benchmarks. Hong et al. [4] incorporate a novel loss function that aligns video and audio features in the hyperbolic space, along with exploring the use of multiple adaptive curvatures for hyperbolic projections. Gowda et al. [32] discuss the issue of invalidation of the zero-shot setting in action recognition due to the overlap between classes in the pre-training and the evaluation datasets. They also highlight similar issues in few-shot action recognition and provide their splits for future evaluation in the field. Narayan et al. [33] incorporate a feedback loop from a semantic embedding decoder to refine the generated features iteratively. These synthesized features, along with their corresponding latent embeddings, are then transformed into discriminative features and utilized for classification, reducing ambiguities between categories. Wu et al. [34] propose the Dual Attention Matching (DAM) module, which spans longer video durations to better model high-level event information and captures local temporal details using a global cross-check mechanism. MA [35] focus on improving weakly-supervised audio-visual video parsing by using cross-modal correspondence and contrastive learning to generate reliable event labels and address audio-visual asynchrony. Wu et al. [36] propose the switchable LSTM framework to manage the generation and retrieval of nouns from external knowledge. Our proposed knowledge slots are primarily used for cross-modal fusion and semantic reasoning across different types of data (audio and visual). In contrast, the external knowledge in [36] is specifically tailored for enhancing language models by incorporating external visual knowledge for novel object captioning. Yang et al. [37] provides a comprehensive framework for multiple knowledge representations, which is crucial for understanding the integration of different modalities in ZSL. Yan et al. [38] introduce a semantics-guided approach for zero-shot learning, which aligns closely with the temporal-semantic integration. Li et al. [5, 39] first demonstrate the potential of SNN in audio-visual zero-shot learning. By efficiently extracting temporal information from different modalities using SNN, they achieved significant improvements in ZSL. However, due to the output heterogeneity between SNN and Transformer, their model’s performance on seen classes tends to decline. Therefore, how to relieve this challenge is the key to release the potential of audio-visual ZSL.

II-B Spiking Neural Network

Spiking Neural Networks (SNNs) are a biologically-inspired models that have time-evolving states [40, 41]. Unlike traditional neural networks that use continuous-valued activations, SNNs communicate through discrete spikes, which are analogous to action potentials in biological neurons. Each neuron in SNN receives input from neighboring neurons and generates a spike when the combined signals surpass a certain threshold. The precise timing of these spikes is crucial as it corresponds to the timing of action potentials in biological neurons. Spikes are transmitted between neurons through synapses, which have weights that determine the strength of the connections. The correlation between pre- and post-synaptic spikes is commonly used to train an SNN by adjusting the synaptic weights. In SNNs, information is encoded in the precise timing of spikes, allowing them to capture the temporal dynamics of data. The timing of spikes carries important information about the input, and the interactions between neurons are determined by the arrival times of these spikes. Currently, a significant number of researchers have been studying the intrinsic nature of SNNs, including attention mechanisms [42, 43, 44], deep SNNs [45, 46, 47, 48], and simulations of biological visual pathways [49]. In addition to these investigations, SNNs have found wide-ranging applications in various fields, such as image classification [50, 51], speech recognition [52, 53], object detection [54, 55], and multimedia learning [56, 5].

TABLE I: Key Notations and Descriptions

Notation	Description
(G)ZSL	(Generalized) Zero-Shot Learning
STFT	Spiking Tucker Fusion Transformer
TSF	Time-Step Factor
GLP	Global-Local Pooling
HM	Harmonic Mean
SNN	Spiking Neural Network
$\boldsymbol{a}_{i}\in\mathbb{R}^{a_{\text{in}}\times h_{\text{emb}}}$	Audio feature vector for sample $i$
$\boldsymbol{v}_{i}\in\mathbb{R}^{v_{\text{in}}\times h_{\text{emb}}}$	Visual feature vector for sample $i$
$\boldsymbol{t}_{i}\in\mathbb{R}^{h_{\text{emb}}}$	Textual embedding for sample $i$
$E_{a},E_{v}$	Audio encoder, Visual encoder
$\boldsymbol{A}_{t}\in\mathbb{R}^{a_{\text{in}}\times h_{\text{emb}}}$	Output of the audio encoder
$\boldsymbol{V}_{t}\in\mathbb{R}^{v_{\text{in}}\times h_{\text{emb}}}$	Output of the visual encoder
$\boldsymbol{R}_{a},\boldsymbol{R}_{v}\in\mathbb{R}^{h_{\text{emb}}\times h_{\text{emb}}}$	Audio and visual latent semantic representations
$\boldsymbol{S}_{c},\boldsymbol{\mathcal{G}}$	Combined spiking output and Tucker core tensor
$\boldsymbol{K}_{t}\in\mathbb{R}^{h_{\text{emb}}\times h_{\text{emb}}}$	Latent knowledge slots
$\boldsymbol{P}_{a},\boldsymbol{P}_{v}$	Projections of audio and visual features
$\boldsymbol{\mathcal{T}}_{a}\in\mathbb{R}^{d_{as}\times d_{at}\times K_{a}}$	Audio tensors in Tucker decomposition
$\boldsymbol{\mathcal{T}}_{v}\in\mathbb{R}^{d_{vs}\times d_{vt}\times K_{v}}$	Visual tensors in Tucker decomposition
$\boldsymbol{U}^{(s)}\in\mathbb{R}^{d_{s}\times n_{a}}$	Factor matrices for spatial dimensions
$\boldsymbol{U}^{(t)}\in\mathbb{R}^{d_{t}\times n_{v}}$	Factor matrices for temporal dimensions
$\boldsymbol{U}^{(k)}\in\mathbb{R}^{K\times n_{k}}$	Factor matrices for latent dimensions

III Our Method

The architecture of the STFT is illustrated in Fig. 2 which consists of four primary components: spatial-temporal SNN, latent semantic reasoning module, temporal-semantic tucker fusion and joint reasoning module.

In the training phase, the seen classes training set, denoted as $\boldsymbol{\mathcal{X}}=(\boldsymbol{a}_{i}^{x},\boldsymbol{v}_{i}^{x},\boldsymbol{t}_{i}^{x})$ , consists of $N$ samples. Here, $\boldsymbol{a}_{i}^{x}$ represents the audio feature, $\boldsymbol{v}_{i}^{x}$ represents the visual feature, and $\boldsymbol{t}_{i}^{x}$ represents the textual labeled embedding of the corresponding ground-truth class. The goal of STFT is to learn a projection function $f(\boldsymbol{a}_{i}^{x},\boldsymbol{v}_{i}^{x})\mapsto\boldsymbol{g}_{j}^{y}$ , where $\boldsymbol{g}_{j}^{x}$ represents the class-level textual embedding for class $j$ . This projection function is learned using the seen classes training set $\boldsymbol{\mathcal{X}}$ . In the testing phase, the unseen testing set $\boldsymbol{\mathcal{Y}}=(\boldsymbol{a}_{i}^{y},\boldsymbol{v}_{i}^{y},\boldsymbol{t}_{i}^{y})$ is also projected using the function $f(\boldsymbol{a}_{i}^{y},\boldsymbol{v}_{i}^{y})\mapsto\boldsymbol{g}_{j}^{y}$ . Overall, the STFT aims to learn a projection function that maps audio and visual features to class-level textual embeddings, allowing for the projection of unseen testing samples into the same embedding space. Table I demonstrated the notations and descriptions in detail.

III-A Latent Semantic Information Modeling

III-A1 Audio and visual encoder

We employ the pre-trained SeLaVi model [57] to accurately and effectively extract audio and visual features, as described in [2]. In order to further investigate the connections between contextual semantic information, we introduce audio and visual encoders, denoted as $E_{a}$ and $E_{v}$ . The outputs of the audio and visual encoder can be written as: $\boldsymbol{\mathcal{A}}^{t}=E_{a}(\boldsymbol{\mathcal{X}}_{a})$ and $\boldsymbol{\mathcal{V}}^{t}=E_{v}(\boldsymbol{\mathcal{X}}_{v})$ , where $\boldsymbol{\mathcal{A}}^{t}\in\mathbb{R}^{a_{in}\times h_{emb}}$ and $\boldsymbol{\mathcal{V}}^{t}\in\mathbb{R}^{v_{in}\times h_{emb}}$ . Each encoder for different modalities consists of two linear layers, namely $f_{1}^{s}$ and $f_{2}^{s}$ for $s\in(\boldsymbol{a}_{t},\boldsymbol{v}_{t})$ . $f_{1}^{s}$ : $\mathbb{R}^{s_{in}\times h_{in}}\rightarrow\mathbb{R}^{s_{in}\times h_{hid}}$ and $f_{2}^{m}$ : $\mathbb{R}^{s_{in}\times h_{hid}}\rightarrow\mathbb{R}^{s_{in}\times h_{emb}}$ . Each linear layer is followed by batch normalization, ReLU activation function, and dropout with a dropout rate of $d_{enc}$ .

III-A2 Latent semantic reasoning module

To better explore the potential relationships between semantic features within different modalities, we introduce the latent semantic reasoning module. We have observed that semantic features in audio and visual have correlations across different temporal dimensions. Therefore, we propose the Latent Knowledge Combiner (LKC) to dynamically update the latent semantic features of tow modalities, optimizing the feature representations of each modalities. The LKC assists in exploring and aligning latent cross-modal latent relationships, enabling to extract more robust multimodal feature representations.

The LKC captures a set of latent knowledge slots denoted as $\boldsymbol{K}=\{\boldsymbol{K}_{1},\boldsymbol{K}_{2},\ldots,\boldsymbol{K}_{n}\}$ . These knowledge slots represent the latent semantic features that exist between two modalities. The illustration of LKC is shown in purple area of Fig. 2. The LKC could compute the importance of each latent knowledge slot based on the input vectors and effectively combines them together. Mathematically, this process can be expressed as follows:

\boldsymbol{K}_{oa}=\sum_{i=1}^{k}\phi(\boldsymbol{K}_{i}\boldsymbol{\mathcal{A}}_{t})\boldsymbol{\mathcal{A}}_{t},\boldsymbol{K}_{ov}=\sum_{i=1}^{k}\phi(\boldsymbol{K}_{i}\boldsymbol{\mathcal{V}}_{t})\boldsymbol{\mathcal{V}}_{t},

(1)

where $\phi(x)=1/(1+e^{-x})$ . We proposed gate function to selectively remain the fusion features, which is defined as:

	$\displaystyle\boldsymbol{P}_{a}$	$\displaystyle=\mathrm{ReLU}(\boldsymbol{W}_{oa}\boldsymbol{K}_{oa}+b_{oa}),$		(2)
	$\displaystyle\boldsymbol{P}_{v}$	$\displaystyle=\mathrm{ReLU}(\boldsymbol{W}_{ov}\boldsymbol{K}_{ov}+b_{ov}),$		(2)

where $\boldsymbol{W}_{oa}\in\mathbb{R}^{h_{emb}\times h_{emb}}$ and $\boldsymbol{W}_{ov}\in\mathbb{R}^{h_{emb}\times h_{emb}}$ are learnable weight metrics and $b_{oa}$ and $b_{ov}$ are the bias items. The latent knowledge is update sequentially to connect the features of different modalities with previous latent knowledge slots $\boldsymbol{K}_{t-1}\in\mathbb{R}^{h_{emb}\times h_{emb}}$ . The latent knowledge slots are updated as follows:

\boldsymbol{K}_{t}=\alpha(\boldsymbol{P}_{a}\boldsymbol{K}_{oa}+\boldsymbol{P}_{v}\boldsymbol{K}_{ov})+(1-\alpha)\boldsymbol{K}_{t-1}.

(3)

where $\alpha$ is the learnable item to adjust the formation of the latent knowledge dynamically. The self-attention function is employed to further infer the inherent relationship between the audio and visual features using the latent knowledge. Formally, the outputs of the latent semantic reasoning module $\boldsymbol{R}_{a}^{t}\in\mathbb{R}^{h_{emb}\times h_{emb}}$ and $\boldsymbol{R}_{v}^{t}\in\mathbb{R}^{h_{emb}\times h_{emb}}$ are defined as:

	$\displaystyle\boldsymbol{R}_{a}^{t}$	$\displaystyle=\mathrm{MLP}(\mathrm{LN}(\mathrm{SA}(\boldsymbol{K}_{oa}^{t})))+\mathrm{SA}(\boldsymbol{K}_{oa}^{t}),$		(4)
	$\displaystyle\boldsymbol{R}_{v}^{t}$	$\displaystyle=\mathrm{MLP}(\mathrm{LN}(\mathrm{SA}(\boldsymbol{K}_{ov}^{t})))+\mathrm{SA}(\boldsymbol{K}_{ov}^{t}),$		(4)

where $\mathrm{SA}(\cdot)$ represents the self attention function, $\mathrm{LN}(\cdot)$ represents the layer normalization and $\mathrm{MLP}(\cdot)$ represents the multi-layer perceptron.

III-B Spatial-Temporal SNN

Unlike existing SNNs in the field of multimodal learning, we have specifically designed our SNN for the audio-visual domain. Firstly, we recognize the importance of temporal encoding by leveraging the information from different time steps in the SNN. We propose a time step factor (TSF) to dynamically fuse the outputs from different time steps. Additionally, to reduce the spiking noise in the SNN output and enhance the model’s robustness, we introduce a global-local pooling (GLP) to improve the overall performance and stability of the SNN by combining the max and average pooling operations.

Our SNN network consists of three convolution SNN blocks, each comprising a convolution operation layer followed by a LIF-based layer [58]. The LIF model consists of an integration phase, where the neuron accumulates input currents, and a firing phase, where the neuron generates a spike and resets its membrane potential. Specifically, the dynamics of a LIF neuron can be described by the following equation:

\tau_{m}\frac{d\boldsymbol{V}(t)}{dt}=-\boldsymbol{V}(t)+R\boldsymbol{I}(t),

(5)

where $\tau_{m}$ is the membrane time constant, $\boldsymbol{V}(t)$ is the membrane potential at time $t$ , $R$ is the membrane resistance, and $\boldsymbol{I}(t)$ is the current input at time $t$ . To compute the input of the $i$ -th LIF neuron $\boldsymbol{I}_{i}(t)$ , we calculate the convolution operation and batch normalization with the output of the previous layer $\boldsymbol{P}(t)$ as:

\boldsymbol{I}_{i}(t)=\mathrm{BN}(\mathrm{CONV}(\boldsymbol{W}_{p},\boldsymbol{P}(t))),

(6)

where $\boldsymbol{W}_{p}$ represents the weight matrix, $\mathrm{BN}(\cdot)$ represents the batch normalization and $\mathrm{CONV}(\cdot)$ represents the convolution operation. When the membrane potential reaches a threshold value $v_{th}$ would generate a spike, and the membrane potential is reset to a reset potential $V_{rest}$ .

To optimize the distribution of input features before processing by the LIF neurons, we propose the GLP as shown in Fig. 3. The max pooling operation captures the global maximum value of the input features, which represents the overall variation in the input distribution. The average pooling operation calculates the average value of the input features, which highlights the importance of local salient regions. By combining the outputs of max and average pooling, the GLP module provides guidance for generating the input features based on the global and local characteristics. The overall process can be written as follows:

	$\displaystyle\boldsymbol{P}_{all}$	$\displaystyle=\frac{1}{2}(\boldsymbol{P}_{max}+\boldsymbol{P}_{avg})+\beta\boldsymbol{P}_{max}+(1-\beta)\boldsymbol{P}_{avg},$		(7)
	$\displaystyle\hat{\boldsymbol{I}_{i}(t)}$	$\displaystyle=\phi(\boldsymbol{P}_{all}\boldsymbol{I}_{i}(t)+\boldsymbol{I}_{i}(t))$		(7)

where $\boldsymbol{P}_{max}$ and $\boldsymbol{P}_{avg}$ are corresponding to the max and average pooling features, and $\beta$ is the learnable items. Indeed, the relationship between the output of the SNN and the corresponding time steps is crucial. Effectively utilizing the outputs from different time steps can significantly influence the final performance. A common method is to assign equal weights to each time step and compute the average output across all time steps to obtain the final result. However, this method overlooks the diversity among different time steps. To deal with this, we propose the TSF which adjust the weights of SNN outputs in different time steps dynamically. By considering the importance of each time step, the SNN can effectively capture the temporal dynamics and encode relevant information at different time scales. The final output of the spatial-temporal SNN can be summarized by the following equations:

\boldsymbol{S}_{c}=\sum_{T}^{i=1}max(\frac{e^{\boldsymbol{I}_{i}(t)}}{{\textstyle\sum_{j=1}^{T}e^{\boldsymbol{I}_{i}(j)}}})\boldsymbol{I}_{i}(t),

(8)

where $max(\cdot)$ returns the largest item of the input and $c\in(a_{t},v_{t})$ . This dynamic adjustment of weights allows the SNN to adaptively emphasize the contributions of different time steps based on their significance, leading to obtain more fine-grained representation of temporal sequences.

To reduce the spiking noise, we dynamically adjusting the threshold of the LIF neurons based on the current output of the SNN and the pooling matrix. Specifically, we use the entropy of the current SNN output to represent the amount of information contained in the features. If the information content is rich, it indicates a more informative scene, and we need to increase the threshold to suppress spike noise. The threshold adjustment for the audio and visual modalities of the SNN can be expressed as follows:

	$\displaystyle V_{th/aud}^{t}$	$\displaystyle=(\phi(\boldsymbol{P}_{all})+\mathcal{N}(\boldsymbol{S}_{a}^{t})log(\mathcal{N}(\boldsymbol{S}_{a})))V_{th/aud}^{t-1},$		(9)
	$\displaystyle V_{th/vis}^{t}$	$\displaystyle=(\phi(\boldsymbol{P}_{all})+\mathcal{N}(\boldsymbol{S}_{v}^{t})log(\mathcal{N}(\boldsymbol{S}_{v})))V_{th/vis}^{t-1},$		(9)

where $\mathcal{N}(\cdot)$ represents the normalization operation.

III-C Temporal-Semantic Tucker Fusion

In this paper, we utilize both the spatial-temporal SNN and LSR module to extract temporal and semantic information, respectively. However, these two types of network outputs have significantly different data distributions: one is a binary sequence, while the other is a floating-point feature. It poses a challenge to effectively fuse these outputs while preserving the complex and high-level interactions. A powerful solution for feature fusion that has been recently proposed is bilinear modeling, which can encode fully parameterized bilinear interactions. First, the semantic and temporal features in each modality need to be projected into embedding vectors as $\boldsymbol{R}_{a}\in\mathbb{R}^{d_{as}}$ , $\boldsymbol{R}_{v}\in\mathbb{R}^{d_{vs}}$ , $\boldsymbol{S}_{a}\in\mathbb{R}^{d_{at}}$ and $\boldsymbol{S}_{v}\in\mathbb{R}^{d_{vt}}$ , respectively. The bilinear model in visual pipeline can be written as:

\boldsymbol{Y}_{a}=\boldsymbol{\mathcal{T}_{a}}\times_{1}\boldsymbol{R}_{a}\times_{2}\boldsymbol{S}_{a},\boldsymbol{Y}_{v}=\boldsymbol{\mathcal{T}_{v}}\times_{1}\boldsymbol{R}_{v}\times_{2}\boldsymbol{S}_{v},

(10)

where $\boldsymbol{\mathcal{T}_{a}}\in\mathbb{R}^{d_{as}\times d_{at}\times K_{a}}$ and $\boldsymbol{\mathcal{T}_{v}}\in\mathbb{R}^{d_{vs}\times d_{vt}\times K_{v}}$ represent the full tensors and $\times_{i}$ represents the $i$ -mode product. However, the parameters in full tensor $\boldsymbol{\mathcal{T}_{c}}$ could be very large. Here, we propose the temporal-semantic tucker fusion to factorize the full tensor $\boldsymbol{\mathcal{T}_{c}}$ following [59]. The decomposition of full tensor $\boldsymbol{\mathcal{T}}$ could be defined as :

\boldsymbol{\mathcal{T}}:=\boldsymbol{\mathcal{G}}\times_{1}\boldsymbol{U}^{(s)}\times_{2}\boldsymbol{U}^{(t)}\times_{3}\boldsymbol{U}^{(k)},

(11)

where $\boldsymbol{\mathcal{G}}$ is the core tensor, $\boldsymbol{U}^{(s)}\in\mathbb{R}^{d_{s}\times n_{a}}$ , $\boldsymbol{U}^{(t)}\in\mathbb{R}^{d_{t}\times n_{v}}$ and $\boldsymbol{U}^{(k)}\in\mathbb{R}^{K\times n_{k}}$ are the factor matrices. We could utilize the tensor decomposition to factorize the full tensor $\boldsymbol{\mathcal{T}_{a}}$ and $\boldsymbol{\mathcal{T}_{v}}$ , and rewrite the Eq. (10) as follows:

	$\displaystyle\boldsymbol{Y}_{a}:$	$\displaystyle=\boldsymbol{\mathcal{G}_{a}}\times_{1}(\boldsymbol{R}_{a}^{\top}\boldsymbol{U}^{t}_{a})\times_{2}(\boldsymbol{S}_{a}^{\top}\boldsymbol{U}^{s}_{a}))\times_{3}\boldsymbol{U}^{k}_{a},$		(12)
	$\displaystyle\boldsymbol{Y}_{v}:$	$\displaystyle=\boldsymbol{\mathcal{G}_{v}}\times_{1}(\boldsymbol{R}_{v}^{\top}\boldsymbol{U}^{t}_{v})\times_{2}(\boldsymbol{S}_{v}^{\top}\boldsymbol{U}^{s}_{v}))\times_{3}\boldsymbol{U}^{k}_{v}.$		(12)

We can perform bilinear interaction to capture the complex relationships between the temporal and semantic information, and then project them into a lower-dimensional representation. This process is defined as as:

	$\displaystyle\boldsymbol{Y}_{a}$	$\displaystyle=\boldsymbol{\mathcal{G}_{a}}\times_{1}\widetilde{\boldsymbol{R}_{a}}\times_{2}\widetilde{\boldsymbol{S}_{a}},$		(13)
	$\displaystyle\boldsymbol{Y}_{v}$	$\displaystyle=\boldsymbol{\mathcal{G}_{v}}\times_{1}\widetilde{\boldsymbol{R}_{v}}\times_{2}\widetilde{\boldsymbol{S}_{v}},$		(13)

where $\widetilde{\boldsymbol{R}_{a}}=\boldsymbol{R}_{a}^{\top}\boldsymbol{U}^{s}_{a}\in\mathbb{R}^{n_{a}\times s_{a}}$ , $\widetilde{\boldsymbol{R}_{v}}=\boldsymbol{R}_{v}^{\top}\boldsymbol{U}^{s}_{v}\in\mathbb{R}^{n_{v}\times s_{v}}$ , $\widetilde{\boldsymbol{S}_{a}}=\boldsymbol{S}_{a}^{\top}\boldsymbol{U}^{t}_{a}\in\mathbb{R}^{n_{a}\times t_{a}}$ and $\widetilde{\boldsymbol{S}_{v}}=\boldsymbol{S}_{v}^{\top}\boldsymbol{U}^{t}_{v}\in\mathbb{R}^{n_{v}\times t_{v}}$ .

III-D Joint Reasoning Module

After integrating the temporal and semantic features from different modalities, we propose a cross-modal transformer to further reason about the implicit feature correspondences within each modality. We establish residual connections between the two modalities, to capture the complementary information between them. Layer normalization is applied to mitigate the impact of feature variations. The cross-modal transformer contains a stack of standard transformer layers to obtains a joint temporal-semantic representation. The cross-modal transformer block in two modalities are shared weight which can be summarized as follow:

	$\displaystyle\boldsymbol{Q}_{av}$	$\displaystyle=\mathrm{MHCA}(\boldsymbol{Y}_{a},\boldsymbol{Y}_{v}),$		(14)
	$\displaystyle\boldsymbol{Z}_{av}$	$\displaystyle=\mathrm{MLP}(\mathrm{LN}(\boldsymbol{Q}_{i}))+\boldsymbol{Q}_{av},$		(14)

where $\mathrm{MHCA}(\cdot)$ represents the multi-head cross attention. The ultimate goal of our model is to predict the text category based on the audio-visual inputs. To project the joint audio-visual features into the same space as the text features, we construct the projection and reconstruction layers. The projection layer maps the audio-visual features to align with the text feature. the reconstruction layer helps to preserve the relevant information while discarding any noise or irrelevant details that may have been introduced during the projection. Both the projection and reconstruction layers have a similar structure, consisting of two linear layers $f_{3}^{m}:\mathbb{R}^{s_{in}*h_{emb}}\rightarrow\mathbb{R}^{s_{in}*h_{hid}}$ and $f_{4}^{m}:\mathbb{R}^{s_{in}*h_{hid}}\rightarrow\mathbb{R}^{s_{in}*h_{out}}$ , followed by dropout regularization with rate $d_{proj}$ . The final audio-visual joint feature embeddings can be obtained as:

\displaystyle\boldsymbol{\mathcal{F}}_{av}

\displaystyle=\boldsymbol{Pro}_{av}(\boldsymbol{Z}_{av}),

(15)

where $\boldsymbol{Pro}_{av}(\cdot)$ is the projection function. The final textual labeled embedding $\boldsymbol{\mathcal{F}}_{tex}$ is obtained through the word projection layer $\boldsymbol{Pro}_{tex}$ . The architecture of the $\boldsymbol{W}_{tex}$ is similar with $\boldsymbol{Pro}_{av}$ with dropout rate $d_{text}$ .

III-E Training Strategy

The STFT is tranined using a Nvidia V100S GPU. The audio and visual embeddings are extracted using pretrained SeLaVi [57]. In STFT, we set $a_{in}=512$ , $h_{emb}=512$ , $h_{hid}=512$ , $h_{out}=300$ and $h_{proj}=64$ . In VGGSound, UCF, ActivitiNet datasets, the dropout rates are corresponding to $d_{enc}=(0.20,0.25,0.10)$ , $d_{dec}=(0.25,0.20,0.15)$ , and $d_{text}=(0.1,0.1,0.1)$ , respectively. The cross-modal transformer is constructed with 8 heads, the dimension of each head is 64. We select Adam as training optimizer. STFT is trained 60 epochs with 0.0001 learning rate. To update parameters more effectively, STFT using the combination of triplet loss $\mathcal{L}_{t}$ , projection loss $\mathcal{L}_{p}$ and reconstruction loss $\mathcal{L}_{r}$ .

III-E1 Triplet loss.

The triplet loss compares the distances between anchor samples, positive samples, and negative samples in the joint audio-visual embedding space. The triplet loss $\mathcal{L}_{t}$ can be written as:

\begin{split}\mathcal{L}_{t}=[\gamma+\boldsymbol{\mathcal{F}}_{av}^{+}-\boldsymbol{\mathcal{F}}_{tex}^{+}]_{+}+[\gamma+\boldsymbol{\mathcal{F}}_{av}^{-}-\boldsymbol{\mathcal{F}}_{tex}^{+}]_{+},\end{split}

(16)

where $\gamma$ represents the crucial margin parameter that defines the minimum separation between negative pairs of different modalities and truly matching audio-visual embeddings, $\boldsymbol{\mathcal{F}}_{tex}$ represents the textual embeddings, $[x]_{+}\equiv\max(x,0)$ , and $\boldsymbol{\mathcal{F}}_{av}^{+}$ and $\boldsymbol{\mathcal{F}}_{av}^{-}$ correspond to positive and negative examples respectively.

III-E2 Projection loss.

The projection loss reduce the distance between the output joint embeddings from the projection layer and the corresponding textual labeled embeddings, which can be written as:

\mathcal{L}_{p}=\frac{1}{n}\sum_{i=1}^{n}(\boldsymbol{\mathcal{F}}_{av}-\boldsymbol{\mathcal{F}}_{tex}),

(17)

where $n$ is the number of samples.

III-E3 Reconstruction loss.

The reconstruction loss is proposed to ensure the original data distribution is maintained when projecting audio-visual features to the shared embedding space. The reconstruction loss $\mathcal{L}_{r}$ can be written as:

\mathcal{L}_{r}=\frac{1}{n}\sum_{i=1}^{n}(\boldsymbol{\mathcal{F}}_{av}^{rec}-\boldsymbol{\mathcal{F}}_{tex}),

(18)

where $\boldsymbol{\mathcal{O}}_{av}^{rec}$ is the output of the reconstruction layer. The total loss is formulated as $\mathcal{L}_{all}=0.5*\mathcal{L}_{t}+0.5*(\mathcal{L}_{p}+\mathcal{L}_{r})$ .

IV Experiment

In this paper, we evaluate our proposed model in both ZSL and GZSL settings. Following [2], we utilize the mean class accuracy to measure the effectiveness of the models in classification tasks. For the ZSL evaluation, we specifically focus on analyzing the performance of the models on test samples from the subset of unseen test classes. In the GZSL evaluation, we evaluate the models on the entire test set, which includes both seen (S) and unseen (U) classes. This comprehensive evaluation enables us to calculate the harmonic mean, which is given by the formula: $\mathrm{HM}=\frac{2\mathrm{US}}{\mathrm{U}+\mathrm{S}}$ . The harmonic mean provides a balanced measure of the model’s overall performance in the GZSL scenario.

TABLE II: The performance of our STFT and state-of-the-art baselines for audio-visual (G)ZSL on three benchmark datasets.

Type	Model	VGGSound-GZSL				UCF-GZSL				ActivityNet-GZSL
Type	Model	S	U	HM $\uparrow$	ZSL $\uparrow$	S	U	HM $\uparrow$	ZSL $\uparrow$	S	U	HM $\uparrow$	ZSL $\uparrow$
ZSL	SJE [60]	48.33	1.10	2.15	4.06	63.10	16.77	26.50	18.93	4.61	7.04	5.57	7.08
ZSL	DEVISE [61]	36.22	1.07	2.08	5.59	55.59	14.94	23.56	16.09	3.45	8.53	4.91	8.53
	APN [62]	7.48	3.88	5.11	4.49	28.46	16.16	20.61	16.44	9.84	5.76	7.27	6.34
	VAEGAN [63]	12.77	0.95	1.77	1.91	17.29	8.47	11.37	11.11	4.36	2.14	2.87	2.40
Audio-visual ZSL	CJME [6]	8.69	4.78	6.17	5.16	26.04	8.21	12.48	8.29	5.55	4.75	5.12	5.84
	AVGZSLNet [1]	18.05	3.48	5.83	5.28	52.52	10.90	18.05	13.65	8.93	5.04	6.44	5.40
	AVCA [2]	14.90	4.00	6.31	6.00	51.53	18.43	27.15	20.01	24.86	8.02	12.13	9.13
	TCaF [3]	9.64	5.91	7.33	6.06	58.60	21.74	31.72	24.81	18.70	7.50	10.71	7.91
	AVMST [5]	14.14	5.28	7.68	6.61	44.08	22.63	29.91	28.19	17.75	9.90	12.71	10.37
	Hyper^alignment [4]	13.22	5.01	7.27	6.14	57.28	17.83	27.19	19.02	23.50	8.47	12.46	9.83
	Hyper^single [4]	9.79	6.23	7.62	6.46	52.67	19.04	27.97	22.09	23.60	10.13	14.18	10.80
	Hyper^multiple [4]	15.02	6.75	9.32	7.97	63.08	19.10	29.32	22.24	23.38	8.67	12.65	9.50
	MDFT [39]	16.14	5.97	8.72	7.13	48.79	23.11	31.36	31.53	18.32	10.55	13.39	12.55
	STFT (ours)	19.22	6.81	10.06	8.24	56.47	22.89	32.58	29.72	22.34	11.73	15.38	12.91

IV-A Dataset Statistics

In this study, we conducted experiments and evaluated the proposed models using three benchmark datasets: ActivityNet, VGGSound, and UCF101. These datasets were chosen to provide a diverse range of audio-visual data and cover various domains, enabling a comprehensive evaluation of the proposed models’ performance. The statistics of these datasets are as follows: 1). ActivityNet contains a wide variety of human activities along with the corresponding videos. The dataset consists of approximately 200 different activity classes and more than 20,000 videos with an average duration of about 2 minutes per video. 2). UCF101 dataset consists of more than 13,000 videos collected from YouTube, with an average duration of around 7 seconds per video. The videos cover a wide range of human actions in various contexts and provide a challenging dataset for action recognition algorithms. 3). VGGSound dataset consists of more than 200 different classes and includes thousands of audio clips obtained from online sources.

IV-B Results Comparison

In Table II, we demonstrate the superiority of proposed STFT compared to state-of-the-art (SOTA) methods. On the VGGSound dataset, STFT achieves significant improvements over TCaF with a 37.2% increase in HM and a 35.9% increase in ZSL scores. On the UCF101 dataset, STFT achieves an HM of 32.58 and a ZSL score of 29.72. Compared to the best current method MDFT [39], STFT achieves a 3.9% improvement in HM but experiences a slight decrease in ZSL. It’s worth noting that both our STFT and MDFT models employ SNN as temporal encoders. However, due to the different output data distributions between SNN and the Transformer used in MDFT, MDFT’s performance on Seen Classes is not satisfactory. To address this issue, we propose the temporal-semantic tucker fusion, which achieves a 15.7% improvement on Seen Classes. On the ActivitiNet dataset, STFT obtains an HM score of 15.38 and a ZSL score of 12.91, surpassing AVCA’s HM score of 12.13 and ZSL score of 9.13. Compared to AVMST, STFT achieves a 21% improvement in HM and a 24.5% improvement in ZSL. Overall, our STFT model demonstrates superior performance compared to existing methods on various evaluation metrics across the three datasets.

While the proposed method shows significant improvements on the VGGSound-GZSL dataset compared to others, it is important to note that the MDFT method requires data enhancement to convert RGB images to events, inherently introducing additional computational complexity. MDFT utilizes an Event Generative Model (EGM) to convert RGB images into event streams, eliminating background scene bias and capturing motion information. However, this conversion adds computational overhead, requiring the processing of high-resolution image data to generate events and the use of Spiking Neural Networks (SNNs) to handle the sparse event data efficiently. In contrast, our proposed method avoids this complexity by directly modeling the audio-visual data without converting RGB images to events. This design choice allows our method to maintain competitive performance across different datasets while being more computationally efficient.

Moreover, we observed a slight decline in ZSL performance on the UCF101 dataset. This decline can be attributed to the fixed rank constraint used in the semantic-temporal Tucker fusion module, which may not fully capture the complex temporal dynamics of the UCF101 dataset. To address this issue, we suggest dynamically adjusting the rank constraint based on the singular values of the input data in the future. Additionally, significant variations in activity patterns within the UCF101 dataset may introduce redundancy at higher time steps, negatively impacting ZSL performance. We suggest reducing redundancy through an optimized temporal encoding process and exploring different configurations of the time-step factor to enhance temporal feature integration.

TABLE III: Ablation study of different loss items.

Loss	UCF-GZSL
Loss	S	U	HM $\uparrow$	ZSL $\uparrow$
W/o $\mathcal{L}_{p}$ + $\mathcal{L}_{r}$	48.76	17.21	25.44	19.46
W/o $\mathcal{L}_{p}$	53.14	18.21	27.13	23.14
W/o $\mathcal{L}_{r}$	51.47	19.33	28.10	23.81
STFT	56.47	22.89	32.58	29.72

TABLE IV: Ablation study of different model components.

Components	UCF-GZSL
Components	S	U	HM $\uparrow$	ZSL $\uparrow$
W/o GLP	52.88	18.72	27.65	24.79
W/o TSF	52.14	19.44	28.32	25.52
W/o DTH	53.79	21.72	30.94	27.41
W/o LKC	49.13	22.67	31.02	28.96
STFT	56.47	22.89	32.58	29.72

IV-C Ablation Study

IV-C1 The effectiveness of different model components

To demonstrate the effectiveness of each component in our model, we conducted extensive experiments on the UCF dataset, as shown in Table III. The models without the Latent Knowledge Combiner, Global-Local Pooling module, Time-Step Factor, and Dynamic Threshold Adjustment module are denoted as “W/o LKC,” “W/o GLP,” “W/o TSF,” and “W/o DTH,” respectively. Among these components, the GLP module in the SNN has the most significant impact on the overall performance of the model. GLP guides the generation of SNN outputs by incorporating both global and local characteristics, enhancing the fusion of spatial and temporal features. The TSF is the next influential component. When TSF is removed, our model experiences a performance decrease of 15.1% in HM and 20.3% in ZSL. TSF enables our model to dynamically adjust the weights of different time steps based on the output of SNN, improving the efficiency of temporal information extraction. The DTH adjusts the threshold of SNN based on the amount of input information and GLP, which alleviate the spiking noise and improve the robustness of the model. Lastly, the LKC computes the importance of each latent knowledge slot and effectively combines them together.

Furthermore, in Table III, we also demonstrate the impact of different loss items on the model performance. We observe that utilizing the complete loss function yields the best HM and ZSL performance across the UCF-GZSL, VGGSound-GZSL, and ActivityNet-GZSL datasets. When both the $\mathcal{L}_{p}$ and $\mathcal{L}_{r}$ losses are removed, our model experiences a decrease of 28.1% in HM and 57.9% in ZSL. This experiment verifies the indispensable role of each loss function in the model training process, highlighting the importance of incorporating all loss items to ensure enhanced GZSL and ZSL performance.

TABLE V: The comparison of different combinations of loss weights.

$\alpha$	$\beta$	S	U	HM $\uparrow$	ZSL $\uparrow$
0.2	0.8	55.12	20.11	29.47	25.52
0.8	0.2	58.13	20.89	30.74	27.16
0.7	0.3	56.94	21.12	30.81	26.93
0.3	0.7	54.69	21.78	31.15	28.47
0.5	0.5	56.47	22.89	32.58	29.72

TABLE VI: The effectiveness of combining SNN and Transformer.

Model	S	U	HM $\uparrow$	ZSL $\uparrow$
Transformer+MLP	62.43	14.31	23.28	21.03
Spikformer+SNN	28.12	23.69	25.72	28.54
Transformer+SNN (STFT)	56.47	22.89	32.58	29.72

TABLE VII: The ablation study of the impact of different latent knowledge slots on three benchmark datasets.

Number of slots	VGGSound-GZSL		UCF-GZSL		ActivityNet-GZSL
Number of slots	HM	ZSL	HM	ZSL	HM	ZSL
1	9.14	6.99	28.46	29.33	13.96	11.63
2	9.31	7.56	30.71	29.14	13.87	11.96
3	9.48	8.03	32.58	30.72	14.35	12.14
4	10.06	8.24	31.47	30.11	15.12	12.03
5	9.36	8.13	31.47	30.11	15.38	12.91
6	9.88	7.93	31.75	30.54	15.21	12.64

IV-C2 The impact of TSF in different time step

The performance variation of the model with and without TSF at different time steps is shown in Fig. 3(a) and 3(d). It is evident that when TSF is introduced, the performance of the model becomes more stable and improved across different time steps. However, there is a slight decrease in both HM and ZSL metrics when the number of time steps increases from 8 to 16. This is because as the number of time steps increases, it leads to increased redundancy in the SNN outputs and significantly higher computational costs. Overall, without TSF has a more significant impact on the ZSL performance, especially in cases of low and high time steps. The TSF enhances the stability and performance of the model across different time steps, providing a more efficient way to combine the outputs of SNN at various time steps and obtain a more comprehensive feature representation.

IV-C3 The impact of rank constraint in $\boldsymbol{\mathcal{T}}_{c}$

In Fig. 3(b) and 3(e), we show the impact of different rank constraints in $\boldsymbol{\mathcal{T}}_{c}$ on the model performance at different time steps. A lower rank constraint represents faster inference speed and fewer model parameters, while a higher rank constraint indicates more information and more preserved features. Generally, higher ranks achieve higher accuracy, particularly at lower time steps such as 2 and 4. When the rank is set to 80, the performance of STFT continues to improve as the number of time steps increases, while the other curves show a slight decrease. This improvement may be attributed to the richer information fusion during feature combination. However, when the rank is set to 60, the performance in ZSL is higher than that of rank 80 at time step 8. This is because the output of the SNN is sparse, and a lower rank constraint can filter out redundant features. Considering the overall practical considerations, we believe that selecting a rank of 60 can ensure performance while improving model efficiency.

IV-C4 The importance of different spiking thresholds

In Fig. 3(c) and 3(f), we demonstrate the impact of dynamic threshold adjustment on the model performance using multimodal and unimodal inputs with different fixed thresholds. Our model is highly sensitive to the spike thresholds of neurons, and as a result, the model performance experiences significant variations with different fixed training thresholds. When using dynamic thresholds, the STFT method outperforms all the fixed threshold methods. Obviously, multimodal inputs yield a great improvement compared to unimodal inputs, highlighting the importance of joint learning. The dynamic threshold adjustment module dynamically adjusts the model threshold by measuring the amount of information in the current input, which leads to enhanced feature representations and highlights the significance of incorporating multimodal inputs for better performance.

IV-C5 The effectiveness of each loss items

The total loss is formulated as $\mathcal{L}_{all}=\alpha*\mathcal{L}_{t}+\beta*(\mathcal{L}_{p}+\mathcal{L}_{r})$ , where $\alpha$ and $\beta$ are the hyperparameters. The additional experiments of the hyperparameters of the loss weights are illustrated in Table V. In Table V, the equal weights of loss items demonstrated the superiorities compared with other weights combinations.

IV-C6 The effectiveness of combining SNN and Transformer

In Table VI, we replace the Transformer to Spikformer (full-spike) and SNN to MLP to certify the effectiveness of the combination of SNN and Transformer. “Transformer+MLP” (full-float) performs best on the seen classes but has a significant gap with our model on unseen classes and ZSL. The “Spikformer+SNN” (full-spike) performs struggle in seen classes. The experiment demonstrates the strong domain adaptation abilities of SNN in zero-shot learning. Therefore, combining SNN with Transformer can leverage the characteristics of both types of models and performs the best on the $\mathrm{HM}$ metric.

TABLE VIII: The statistics for our VGGSound, UCF, and ActivityNet (G)ZSL^cls datasets.

Dataset	# Classes				#Videos
Dataset	all	train	val(U)	test(U)	test(U)
VGGSound-GZSL^cls	271	138	69	64	3200
UCF-GZSL^cls	48	30	12	6	845
ActivityNet-GZSL^cls	198	99	51	48	4052

TABLE IX: The ablation study on only using Transformer or SNN.

Model	S	U	HM $\uparrow$	ZSL $\uparrow$
Only Transformer	58.41	15.23	24.16	22.31
Only SNN	26.14	22.95	24.44	27.72
STFT	56.47	22.89	32.58	29.72

IV-C7 The impact of different latent knowledge slots

We show the impact of various latent knowledge slots on model performance in Table VII. These knowledge slots symbolize the latent semantic features present between two modalities. Effectively integrating knowledge slots from diverse modalities helps in discovering and aligning latent cross-modal relationships, facilitating the extraction of stronger multimodal feature representations. We increment the number of latent knowledge slots from 1 to 6 and perform an ablation study on the UCF101, VGGSound, and ActivityNet datasets.

Overall, the performance changes on the VGGSound and ActivityNet datasets are relatively smooth. However, noticeable fluctuations occur on the UCF101 dataset. Table VII shows the performance changes of different latent knowledge slots on VGGSound, where the optimal performance is achieved with four slots. Table VII displays the performance changes on the UCF101 dataset, indicating a general upward trend in the model’s performance as the number of slots increases. With three slots, there is an improvement of 14.48% and 4.7% on the HM and ZSL datasets, respectively, compared to having just one slot. Table VII also reveals the performance on the ActivityNet dataset, achieving the best performance with five slots.

Having more slots means more latent semantic features, but it can also introduce unnecessary redundancy. Thus, choosing an optimal number of slots is essential for improving model performance.

IV-C8 The effectiveness of only using Transformer or SNN

Table IX evaluates the effectiveness of combining SNN and Transformer models. Comparing “Only Transformer” and “Only SNN” models, the Transformer-only model excels in performance on seen classes but notably underperforms on unseen classes. Conversely, the SNN-only model struggles with seen data but performs better on unseen classes. This indicates SNN’s strong domain adaptation capabilities in zero-shot learning scenarios. The STFT model, which integrates both SNN and Transformer, successfully combines the strengths of both. It achieves the highest score on the HM metric, demonstrating superior overall performance and balance between seen and unseen data.

TABLE X: The parameter comparison on VGGSound dataset.

Model	S	U	HM $\uparrow$	ZSL $\uparrow$	#params	GFLOPS
AVCA [2]	14.90	4.00	6.31	6.00	1.69M	2.36
AVMST [5]	14.14	5.28	7.68	6.61	6.32M	5.12
MDFT [39]	16.14	5.97	8.72	7.13	5.51M	5.62
STFT	19.22	6.81	10.06	8.24	4.16M	4.27

IV-D Parameter analysis

As shown in Table X. “#params” represents the number of parameters, “GFLOPs ” represents the computational cost during training. Overall, our model exhibits strong performance in both the number of parameters and computational costs, while ensuring high classification effectiveness. Compared to MDFT, our model reduces the parameters by approximately 32%, and reduces the GFLOPS from 5.62 to 4.27.

IV-E Different audio/video extracted networks.

TABLE XI: We conduct evaluations of STFT along with state-of-the-art (G)ZSL methods on the VGGSound^cls, UCF^cls, and ActivityNet^cls datasets using features extracted from audio/video classification networks.

Type	Model	VGGSound-GZSL^cls				UCF-GZSL^cls				ActivityNet-GZSL^cls
Type	Model	S	U	HM $\uparrow$	ZSL $\uparrow$	S	U	HM $\uparrow$	ZSL $\uparrow$	S	U	HM $\uparrow$	ZSL $\uparrow$
ZSL	ALE [64]	26.13	1.72	3.23	4.97	45.42	29.09	35.47	32.30	0.89	6.16	1.55	6.16
ZSL	SJE [60]	16.94	2.72	4.69	3.22	19.39	32.47	24.28	32.47	37.92	1.22	2.35	4.35
	DEVISE [61]	29.96	1.94	3.64	4.72	29.58	34.80	31.98	35.48	0.17	5.84	0.33	5.84
	APN [62]	6.46	6.13	6.29	6.50	13.54	28.44	18.35	29.69	3.79	3.39	3.58	3.97
Audio-visual ZSL	CJME [6]	10.86	2.22	3.68	3.72	33.89	24.82	28.65	29.01	10.75	5.55	7.32	6.29
	AVGZSLNet [1]	15.02	3.19	5.26	4.81	74.79	24.15	36.51	31.51	13.70	5.96	8.30	6.39
	AVCA [2]	12.63	6.19	8.31	6.91	63.15	30.72	41.34	37.72	16.77	7.04	9.92	7.58
	TCaF [3]	12.63	6.72	8.77	7.41	67.14	40.83	50.78	44.64	30.12	7.65	12.20	7.96
	Hyper^alignment [4]	12.50	6.44	8.50	7.25	57.13	33.86	42.52	39.80	29.77	8.77	13.55	9.13
	Hyper^single [4]	12.56	5.03	7.18	5.47	63.47	34.85	44.99	39.86	24.61	10.10	14.32	10.37
	Hyper^multiple [4]	15.62	6.00	8.67	7.31	74.26	35.79	48.30	52.11	36.98	9.60	15.25	10.39
	STFT (ours)	11.74	8.83	10.08	8.79	61.42	43.81	51.14	49.74	25.12	9.83	14.13	9.46

Figure 4: Visualization examples on UCF101. We give t-SNE visualization results for five categories which can be categorized into two parent classes: “Sports” and “Instrument”.

We expand our methodology by incorporating features from audio and video classification networks into our model training and evaluation process. Specifically, we use C3D [65], a network pre-trained on the Sports1M [66] dataset for video classification, to extract visual features. For audio feature extraction, we use VGGish [67], pre-trained on the Youtube-8M [68] dataset. To create a unified feature representation for each video, we average these extracted features over time, resulting in a 4096-dimensional visual feature vector and a 128-dimensional audio feature vector.

To adjust the audio features derived from the Youtube-8M pre-trained network, we make changes to the dataset splits for VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL. We remove the test unseen classes that overlap with the Youtube-8M dataset, leading to modified dataset splits named VGGSound-GZSL^cls, UCF-GZSL^cls, and ActivityNet-GZSL^cls. More details on these adjustments are available in Table VIII.

Table XI presents the comparative results of our STFT model against the baseline models, using the audio and video classification features mentioned above. The STFT model shows exceptional performance across all datasets compared to the baselines. For example, in the VGGSound-GZSL^cls dataset, STFT achieves a Harmonic Mean (HM) of 10.08% and a Zero-Shot Learning (ZSL) accuracy of 8.79%, outperforming the TCaF model, which records an HM of 8.77% and a ZSL accuracy of 7.41%. In the UCF-GZSL^cls scenario, STFT reaches an HM of 51.14% and a ZSL accuracy of 49.74%, surpassing both AVCA and AVGZSLNet^cls, which show lower HMs and ZSL accuracies. Likewise, on ActivityNet-GZSL^cls, AVCA outperforms AVGZSLNet^cls in terms of HM and ZSL accuracy. These results highlight STFT’s consistent superiority over competing models, attributing this success to the innovative integration of our temporal-semantic tucker fusion module, which effectively combines SNN and Transformer outputs for improved multi-scale fusion and performance.

IV-F Visualization results

We use t-SNE visualization to demonstrate the advantages of the proposed STFT in exploring intrinsic correlations within multimodal data, as shown in the Fig. 4. In the UCF101 dataset, STFT actively clusters features from the same parent category and separates features from different parent categories. For example, features such as “basketball” and “basketballdunk,” which belong to the same parent category “sport,” are brought closer, while features like “PlayingCello” and “HandstandWalking” from the “instrument” category are separated. These visualizations illustrate how our method explores correlations between different types of data.

IV-G Qualitative Comparison

We demonstrate the qualitative comparison results with recent SOTA method MDFT in Fig. 5. MDFT focus on decoupling motion and background information, while our model coupling the outputs of SNN and Transformer effectively. In this paper, we address the challenges of time steps and spiking redundancy in SNN, and the output heterogeneity between SNN and Transformer. In sports classes with frequent changes in motion information, STFT demonstrate superiorities compared with MDFT due to the less spiking redundancy.

IV-H Limitations

Although our model has demonstrated SOTA performance in HM on three benchmark datasets, we observed a slight decrease on ZSL in UCF101. This may caused by the fixed rank constraint. A potential solution could be dynamically setting the rank constraint of the current temporal-semantic tucker fusion based on the singular values of the input information. This adaptive adjustment strategy of rank constraint could potentially improve the ZSL performance.

IV-I Scalability Discussion

The proposed Spiking Tucker Fusion Transformer (STFT) is designed for scalability, effectively handling larger datasets and complex audio-visual sequences. The STFT uses a temporal-semantic fusion module based on Tucker decomposition, enabling multi-scale fusion of SNN and Transformer outputs. This design ensures the number of parameters remains manageable, maintaining computational efficiency even with larger datasets. Efficient data loading and batching strategies are used to handle larger datasets, ensuring memory constraints are not exceeded and performance is maintained. Additionally, the STFT adapts to different audio-visual sequence complexities through dynamic adjustments, such as the TSF for synthesizing temporal information and GLP for reducing spike noise and enhancing robustness. These components enable the model to effectively manage sequence complexity, ensuring robust performance across diverse scenarios. Overall, the STFT’s design and components enhance its scalability and applicability in real-world scenarios involving large datasets and complex audio-visual sequences.

V Conclusion

In conclusion, this paper introduces the Spiking Tucker Fusion Transformer (STFT) model for audio-visual zero-shot learning. The STFT model effectively combines Spiking Neural Networks (SNNs) and Transformers, integrating both temporal and semantic information to generate robust representations. By introducing time-step factors (TSF), the significance of each time step in influencing the SNN’s output is dynamically measured, leading to improved performance. To guide the formation of input membrane potentials and reduce spike noise, a global-local pooling (GLP) method is proposed. Additionally, the thresholds of the spiking neurons are adjusted dynamically based on semantic and temporal cues, enhancing the model’s robustness. We propose a temporal-semantic tucker fusion module to achieves multi-scale fusion of SNN and Transformer outputs while maintaining full second-order interactions. The experimental results demonstrate that the proposed STFT model outperforms existing methods in audio-visual zero-shot learning tasks.

References

[1] Pratik Mazumder, Pravendra Singh, Kranti Kumar Parida, and Vinay P Namboodiri. Avgzslnet: Audio-visual generalized zero-shot learning by reconstructing label features from multi-modal embeddings. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3090–3099, 2021.
[2] Otniel-Bogdan Mercea, Lukas Riesch, A Koepke, and Zeynep Akata. Audio-visual generalised zero-shot learning with cross-modal attention and language. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10553–10563, 2022.
[3] Otniel-Bogdan Mercea, Thomas Hummel, A Sophia Koepke, and Zeynep Akata. Temporal and cross-modal attention for audio-visual zero-shot learning. In European Conference on Computer Vision, pages 488–505. Springer, 2022.
[4] Jie Hong, Zeeshan Hayder, Junlin Han, Pengfei Fang, Mehrtash Harandi, and Lars Petersson. Hyperbolic audio-visual zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7873–7883, 2023.
[5] Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Hengyu Man, and Xiaopeng Fan. Modality-fusion spiking transformer network for audio-visual zero-shot learning. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 426–431. IEEE, 2023.
[6] Kranti Parida, Neeraj Matiyali, Tanaya Guha, and Gaurav Sharma. Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 3251–3260, 2020.
[7] Zhuangzhuang Chen, Jin Zhang, Zhuonan Lai, Jie Chen, Zun Liu, and Jianqiang Li. Geometry-aware guided loss for deep crack recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4703–4712, June 2022.
[8] Zhuangzhuang Chen, Jin Zhang, Zhuonan Lai, Guanming Zhu, Zun Liu, Jie Chen, and Jianqiang Li. The devil is in the crack orientation: A new perspective for crack detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6653–6663, October 2023.
[9] Yuqin Cao, Xiongkuo Min, Wei Sun, and Guangtao Zhai. Attention-guided neural networks for full-reference and no-reference audio-visual quality assessment. IEEE Transactions on Image Processing, 32:1882–1896, 2023.
[10] Xiongkuo Min, Guangtao Zhai, Jiantao Zhou, Xiao-Ping Zhang, Xiaokang Yang, and Xinping Guan. A multimodal saliency model for videos with high audio-visual correspondence. IEEE Transactions on Image Processing, 29:3805–3819, 2020.
[11] Valentina Sanguineti, Pietro Morerio, Alessio Del Bue, and Vittorio Murino. Unsupervised synthetic acoustic image generation for audio-visual scene understanding. IEEE Transactions on Image Processing, 31:7102–7115, 2022.
[12] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer, 2020.
[13] Di Hu, Xuhong Li, Lichao Mou, Pu Jin, Dong Chen, Liping Jing, Xiaoxiang Zhu, and Dejing Dou. Cross-task transfer for geotagged audiovisual aerial scene recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIV 16, pages 68–84. Springer, 2020.
[14] Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018.
[15] Bo Xu, Cheng Lu, Yandong Guo, and Jacob Wang. Discriminative multi-modality speech recognition. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 14433–14442, 2020.
[16] Edgar Schonfeld, Sayna Ebrahimi, Samarth Sinha, Trevor Darrell, and Zeynep Akata. Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8247–8255, 2019.
[17] Vinay Kumar Verma, Gundeep Arora, Ashish Mishra, and Piyush Rai. Generalized zero-shot learning via synthesized examples. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4281–4289, 2018.
[18] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep Akata. Feature generating networks for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5542–5551, 2018.
[19] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generative adversarial approach for zero-shot learning from noisy texts. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1004–1013, 2018.
[20] Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgammal. Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9844–9854, 2019.
[21] Elyor Kodirov, Tao Xiang, and Shaogang Gong. Semantic autoencoder for zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3174–3183, 2017.
[22] Alina Roitberg, Manuel Martinez, Monica Haurilet, and Rainer Stiefelhagen. Towards a fair evaluation of zero-shot action recognition using external data. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
[23] Biagio Brattoli, Joseph Tighe, Fedor Zhdanov, Pietro Perona, and Krzysztof Chalupka. Rethinking zero-shot video classification: End-to-end training for realistic applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4613–4623, 2020.
[24] Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In International conference on machine learning, pages 2152–2161. PMLR, 2015.
[25] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
[26] Ruibin Yuan, Yinghao Ma, Yizhi Li, Ge Zhang, Xingran Chen, Hanzhi Yin, zhuo le, Yiqi Liu, Jiawen Huang, Zeyue Tian, Binyue Deng, Ningzhi Wang, Chenghua Lin, Emmanouil Benetos, Anton Ragni, Norbert Gyenge, Roger Dannenberg, Wenhu Chen, Gus Xia, Wei Xue, Si Liu, Shi Wang, Ruibo Liu, Yike Guo, and Jie Fu. Marble: Music audio representation benchmark for universal evaluation. In Advances in Neural Information Processing Systems, volume 36, pages 39626–39647, 2023.
[27] Yuanzhi Liang, Linchao Zhu, Xiaohan Wang, and Yi Yang. Icocap: Improving video captioning by compounding images. IEEE Transactions on Multimedia, 26:4389–4400, 2024.
[28] Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-Seng Chua, and Shuicheng Yan. Enhancing video-language representations with structural spatio-temporal alignment. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–18, 2024.
[29] Hui Chen, Hanyi Zhang, Longbiao Wang, Kong Aik Lee, Meng Liu, and Jianwu Dang. Self-supervised audio-visual speaker representation with co-meta learning. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023.
[30] Po-Yao Huang, Vasu Sharma, Hu Xu, Chaitanya Ryali, haoqi fan, Yanghao Li, Shang-Wen Li, Gargi Ghosh, Jitendra Malik, and Christoph Feichtenhofer. Mavil: Masked audio-video learners. In Advances in Neural Information Processing Systems, volume 36, pages 20371–20393, 2023.
[31] Yuanzhi Liang, Qianyu Feng, Linchao Zhu, Li Hu, Pan Pan, and Yi Yang. Seeg: Semantic energized co-speech gesture generation. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10463–10472, 2022.
[32] Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, and Marcus Rohrbach. A new split for evaluating true zero-shot action recognition. In DAGM German Conference on Pattern Recognition, pages 191–205. Springer, 2021.
[33] Sanath Narayan, Akshita Gupta, Fahad Shahbaz Khan, Cees GM Snoek, and Ling Shao. Latent embedding feedback and discriminative features for zero-shot classification. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 479–495. Springer, 2020.
[34] Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6292–6300, 2019.
[35] Yu Wu and Yi Yang. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1326–1335, 2021.
[36] Yu Wu, Lu Jiang, and Yi Yang. Switchable novel object captioner. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1162–1173, 2022.
[37] Yi Yang, Yueting Zhuang, and Yunhe Pan. Multiple knowledge representation for big data artificial intelligence: framework, applications, and case studies. Frontiers of Information Technology & Electronic Engineering, 22(12):1551–1558, 2021.
[38] Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng. Semantics-guided contrastive network for zero-shot object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(3):1530–1544, 2024.
[39] Wenrui Li, Xi-Le Zhao, Zhengyu Ma, Xingtao Wang, Xiaopeng Fan, and Yonghong Tian. Motion-decoupled spiking transformer for audio-visual zero-shot learning. In Proceedings of the 31st ACM International Conference on Multimedia, MM ’23, page 3994–4002, 2023.
[40] Jianhao Ding, Tong Bu, Zhaofei Yu, Tiejun Huang, and Jian Liu. Snn-rat: Robustness-enhanced spiking neural network through regularized adversarial training. Advances in Neural Information Processing Systems, 35:24780–24793, 2022.
[41] Amirhossein Tavanaei, Masoud Ghodrati, Saeed Reza Kheradpisheh, Timothée Masquelier, and Anthony Maida. Deep learning in spiking neural networks. Neural networks, 111:47–63, 2019.
[42] Man Yao, Huanhuan Gao, Guangshe Zhao, Dingheng Wang, Yihan Lin, Zhaoxu Yang, and Guoqi Li. Temporal-wise attention spiking neural networks for event streams classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10221–10230, 2021.
[43] Man Yao, Guangshe Zhao, Hengyu Zhang, Yifan Hu, Lei Deng, Yonghong Tian, Bo Xu, and Guoqi Li. Attention spiking neural networks. IEEE transactions on pattern analysis and machine intelligence, 2023.
[44] Man Yao, Jiakui Hu, Guangshe Zhao, Yaoyuan Wang, Ziyang Zhang, Bo Xu, and Guoqi Li. Inherent redundancy in spiking neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16924–16934, October 2023.
[45] Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks. Advances in Neural Information Processing Systems, 34:21056–21069, 2021.
[46] Yanqi Chen, Zhaofei Yu, Wei Fang, Zhengyu Ma, Tiejun Huang, and Yonghong Tian. State transition of dendritic spines improves learning of sparse spiking neural networks. In International Conference on Machine Learning, pages 3701–3715. PMLR, 2022.
[47] Yanqi Chen, Zhaofei Yu, Wei Fang, Tiejun Huang, and Yonghong Tian. Pruning of deep spiking neural networks through gradient rewiring. arXiv preprint arXiv:2105.04916, 2021.
[48] Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2661–2671, 2021.
[49] Liwei Huang, Zhengyu Ma, Liutao Yu, Huihui Zhou, and Yonghong Tian. Deep spiking neural networks with high representation similarity model visual pathways of macaque and mouse. arXiv preprint arXiv:2303.06060, 2023.
[50] Qi Xu, Yaxin Li, Jiangrong Shen, Pingping Zhang, Jian K Liu, Huajin Tang, and Gang Pan. Hierarchical spiking-based model for efficient image classification with enhanced feature extraction and encoding. IEEE Transactions on Neural Networks and Learning Systems, 2022.
[51] Xinjian Gao, Tingting Mu, John Yannis Goulermas, Jeyarajan Thiyagalingam, and Meng Wang. An interpretable deep architecture for similarity learning built upon hierarchical concepts. IEEE Transactions on Image Processing, 29:3911–3926, 2020.
[52] Qingyu Wang, Tielin Zhang, Minglun Han, Yi Wang, Duzhen Zhang, and Bo Xu. Complex dynamic neurons improved spiking transformer network for efficient automatic speech recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 102–109, 2023.
[53] Jibin Wu, Emre Yılmaz, Malu Zhang, Haizhou Li, and Kay Chen Tan. Deep spiking neural networks for large vocabulary automatic speech recognition. Frontiers in neuroscience, 14:199, 2020.
[54] Biswadeep Chakraborty, Xueyuan She, and Saibal Mukhopadhyay. A fully spiking hybrid neural network for energy-efficient object detection. IEEE Transactions on Image Processing, 30:9014–9029, 2021.
[55] Yajing Zheng, Zhaofei Yu, Song Wang, and Tiejun Huang. Spike-based motion estimation for object tracking through bio-inspired unsupervised learning. IEEE Transactions on Image Processing, 32:335–349, 2022.
[56] Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Xiaopeng Fan, and Yonghong Tian. Neuron-based spiking transmission and reasoning network for robust image-text retrieval. IEEE Transactions on Circuits and Systems for Video Technology, 33(7):3516–3528, 2023.
[57] Yuki Asano, Mandela Patrick, Christian Rupprecht, and Andrea Vedaldi. Labelling unlabelled videos from scratch with multi-modal self-supervision. Advances in Neural Information Processing Systems, 33:4660–4671, 2020.
[58] Corinne Teeter, Ramakrishnan Iyer, Vilas Menon, Nathan Gouwens, David Feng, Jim Berg, Aaron Szafer, Nicholas Cain, Hongkui Zeng, Michael Hawrylycz, et al. Generalized leaky integrate-and-fire models classify multiple neuron types. Nature communications, 9(1):709, 2018.
[59] Hedi Ben-Younes, Rémi Cadene, Matthieu Cord, and Nicolas Thome. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2612–2620, 2017.
[60] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2927–2936, 2015.
[61] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems, 26, 2013.
[62] Wenjia Xu, Yongqin Xian, Jiuniu Wang, Bernt Schiele, and Zeynep Akata. Attribute prototype network for zero-shot learning. Advances in Neural Information Processing Systems, 33:21969–21980, 2020.
[63] Yongqin Xian, Saurabh Sharma, Bernt Schiele, and Zeynep Akata. f-vaegan-d2: A feature generating framework for any-shot learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10275–10284, 2019.
[64] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence, 38(7):1425–1438, 2015.
[65] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497, 2015.
[66] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
[67] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin Wilson. Cnn architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017.
[68] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark, 2016.

Penghong Wang received the M.S. degree in computer science and technology from Taiyuan University of Science and Technology, Taiyuan, China, in 2020. He is currently pursuing the Ph.D. degree with the School of Computer Science, Harbin Institute of Technology, Harbin, China. His main research interests include wireless sensor networks, joint source-channel coding, and computer vision. He has authored or co-authored more than 20 technical articles in referred international journals and conferences. He also serves as a reviewer for IEEE TVT, IEEE TAES, IEEE CE, IEEE IOTJ, NeurIPS, ECCV, AAAI, and ACM MM.

Ruiqin Xiong (Senior Member, IEEE) received the B.S. degree in computer science from the University of Science and Technology of China, Hefei, China, in 2001, and the Ph.D. degree in computer science from the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, in 2007. From 2002 to 2007, he was a Research Intern with Microsoft Research Asia. From 2007 to 2009, he was a Senior Research Associate with the University of New South Wales, Sydney, NSW, Australia. In 2010, he joined the School of Electronic Engineering and Computer Science, Peking University, Beijing, where he is currently a Professor. He has authored or coauthored more than 140 technical articles in referred international journals and conferences. His research interests include image and video processing, statistical image modeling, deep learning, neuromorphic camera, and computational imaging. He was a recipient of the Best Student Paper Award from the SPIE Conference on Visual Communications and Image Processing in 2005 and the Best Paper Award from the IEEE Visual Communications and Image Processing in 2011. He was a co-recipient of the Best Student Paper Award from the IEEE Visual Communications and Image Processing in 2017.