Training High-Performance Low-Latency Spiking Neural Networks by Differentiation on Spike Representation

Qingyan Meng^1,2, Mingqing Xiao³, Shen Yan⁴, Yisen Wang^3,5, Zhouchen Lin^3,5,6, , Zhi-Quan Luo^1,2
¹The Chinese University of Hong Kong, Shenzhen ²Shenzhen Research Institute of Big Data
³Key Lab. of Machine Perception (MoE), School of Artificial Intelligence, Peking University
⁴Center for Data Science, Peking University ⁵Institute for Artificial Intelligence, Peking University
⁶Peng Cheng Laboratory
[email protected], {mingqing_xiao, yanshen, yisen.wang, zlin}@pku.edu.cn,
[email protected] Corresponding author.

Abstract

Spiking Neural Network (SNN) is a promising energy-efficient AI model when implemented on neuromorphic hardware. However, it is a challenge to efficiently train SNNs due to their non-differentiability. Most existing methods either suffer from high latency (i.e. missing, long simulation time steps), or cannot achieve as high performance as Artificial Neural Networks (ANNs). In this paper, we propose the Differentiation on Spike Representation (DSR) method, which could achieve high performance that is competitive to ANNs yet with low latency. First, we encode the spike trains into spike representation using (weighted) firing rate coding. Based on the spike representation, we systematically derive that the spiking dynamics with common neural models can be represented as some sub-differentiable mapping. With this viewpoint, our proposed DSR method trains SNNs through gradients of the mapping and avoids the common non-differentiability problem in SNN training. Then we analyze the error when representing the specific mapping with the forward computation of the SNN. To reduce such error, we propose to train the spike threshold in each layer, and to introduce a new hyperparameter for the neural models. With these components, the DSR method can achieve state-of-the-art SNN performance with low latency on both static and neuromorphic datasets, including CIFAR-10, CIFAR-100, ImageNet, and DVS-CIFAR10.

1 Introduction

Inspired by biological neurons that communicate using spikes, Spiking Neural Networks (SNNs) have recently received surging attention. This promise depends on their energy efficiency on neuromorphic hardware [35, 7, 39], while deep Artificial Neural Networks (ANNs) require substantial power consumption.

However, the training of SNNs is a major challenge [49] since information in SNNs is transmitted through non-differentiable spike trains. Specifically, the non-differentiability in SNN computation hampers the effective usage of gradient-based backpropagation methods. To tackle this problem, the surrogate gradient (SG) method [37, 52, 45, 16, 59] and the ANN-to-SNN conversion method [4, 10, 44, 42, 56] have been proposed and yielded the best performance. In the SG method, an SNN is regarded as a recurrent neural network (RNN) and trained by the backpropagation through time (BPTT) framework. And during backpropagation, gradients of non-differentiable spike functions are approximated by some surrogate gradients. Although the SG method can train SNNs with low latency (i.e., short simulation time steps), it cannot achieve high performance comparable to leading ANNs. Besides, the adopted BPTT framework needs to backpropagate gradients through both the layer-by-layer spatial domain and the temporal domain, leading to a long training time and high memory cost of the SG method. The high training costs further limit the usage of large-scale network architectures. On the other hand, the ANN-to-SNN conversion method directly determines the network weights of an SNN from a corresponding ANN, relying on the connection between firing rates of the SNN and activations of the ANN. The conversion method enables the obtained SNN to perform as competent as its ANN counterpart. However, intolerably high latency is typically required, since only a large number of time steps can make the firing rates closely approach the high-precision activation values of ANNs [42, 22]. Overall, SNNs obtained by the two widely-used methods either cannot compete their ANN counterparts, or suffer from high latency.

Table 1: Comparison of the ANN-to-SNN conversion, surrogate gradient (SG), and DSR method with respect to latency, performance with low latency, and applicability on neuromorphic data.

	Conversion	SG	DSR
Latency	High	Low	Low
Performace w/	Low	Medium	High
Low Latency	Low	Medium	High
Neuromorphic	Non-appli-	Appli-	Appli-
Data	cable	cable	cable

In this paper, we overcome both the low performance and high latency issues by introducing the Differentiation on Spike Representation (DSR) method to train SNNs. First, we treat the (weighted) firing rate of the spiking neurons as spike representation. Based on the representation, we show that the forward computation of an SNN with common spiking neurons can be represented as some sub-differentiable mapping. We then derive the backpropagation algorithm while treating the spike representation as the information carrier. In this way, our method encodes the temporal information into spike representation and backpropagates through sub-differentiable mappings of it, avoiding calculating gradients at each time step. To effectively train SNNs with low latency, we further study the representation error due to the SNN-to-mapping approximation, and propose to train the spike thresholds and introduce a new hyperparameter for the spiking neural models to reduce the error. With these methods, we can train high-performance low-latency SNNs. And the comparison of the properties between the DSR method and other methods is illustrated in Tab. 1. Formally, our main contributions are summarized as follows:

1.

We systematically study the spike representation for common spiking neural models, and propose the DSR method that uses the representation to train SNNs by backpropagation. The proposed method avoids the non-differentiability problem in SNN training and does not require the costly error backpropagation through the temporal domain.
2.

We propose to train the spike thresholds and introduce a new hyperparameter for the spiking neural models to reduce the representation error. The two techniques greatly help the DSR method to train SNNs with high performance and low latency.
3.

Our model achieves competitive or state-of-the-art (SOTA) SNN performance with low latency on the CIFAR-10[29], CIFAR-100[29], ImageNet[8], and DVS-CIFAR10[32] datasets. Furthermore, the experiments also prove the effectiveness of the DSR method under ultra-low latency or deep network structures.

2 Related Work

Many works seek biological plausibility in training SNNs [5, 26, 31] using derivations of the Hebbian learning rule [21]. However, this method cannot achieve competitive performance and cannot be applicable on complicated datasets. Besides the brain-inspired method, SNN learning methods can be mainly categorized into two classes: ANN-to-SNN conversion [10, 56, 44, 42, 24, 11, 27, 19, 18, 12] and direct training[2, 23, 58, 59, 36, 53, 52, 1, 37, 57, 55, 45, 16, 13, 14]. We discuss both the conversion and direct training method, then analyze the information representation used in them.

ANN-to-SNN Conversion

The feasibility of this conversion method relies on the fact that the firing rates of an SNN can be estimated by activations of an ANN with corresponding architecture and weights[42]. With this method, the parameters of a target SNN are directly determined from a source ANN. And the performance of the target SNN is supposed to be not much worse than the source ANN. Many effective techniques have been proposed to reduce the performance gap, such as weight normalization[44], temporal switch coding[18], rate norm layer[12], and bias shift[10]. Recently, the conversion method has achieved high-performance ANN-to-SNN conversion [33, 56, 10], even on ImageNet. However, the good performance is at the expense of high latency, since only high latency can make the firing rates closely approach the high-precision activation. This fact hurts the energy efficiency of SNNs when using the conversion method. Furthermore, the conversion method is not suitable for neuromorphic data. In this paper, we borrow the idea of ANN-SNN mapping to design the backpropagation algorithm for training SNNs. However, unlike usual ANN-to-SNN conversion methods, the proposed DSR method can obtain high performance with low latency on both static and neuromorphic data.

Direct Training

Inspired by the immense success of gradient descent-based algorithms for training ANN, some works regard an SNN as an RNN and directly train it with the BPTT method. This scheme typically leverages surrogate gradient to deal with the discontinuous spike functions [2, 52, 59, 37], or calculate the gradients of loss with respect to spike times[36, 54, 60]. Between them, the surrogate gradient method achieves better performance with lower latency[16, 59]. However, those approaches need to backpropagate error signals through time steps and thus suffer from high computational costs during training [9]. Furthermore, the inaccurate approximations for computing the gradients or the “dead neuron” problem[45] limit the training effect and the use of large-scale network architectures. The proposed method uses spike representation to calculate the gradients of loss and need not backpropagate error through time steps. Therefore, the proposed method avoids the common problems for direct training. A few works [50, 51] also use the similar idea of decoupling the forward and backward passes to train feedforward SNNs; however, they neither systematically analyze the representation schemes nor the representation error, and they cannot achieve comparable accuracy as ours, even with high latency.

Information Representation in SNNs

In SNNs, information is carried by some representation of spike trains [38]. There are mainly two representation schemes: temporal coding and rate coding. These two schemes treat exact firing times and firing rates, respectively, as the information carrier. Temporal coding is adopted by some direct training methods that calculate gradients with respect to spike times [36, 54, 60], or few ANN-to-SNN methods [18, 48]. With temporal coding, those methods typically enjoy low energy consumption on neuromorphic chips due to sparse spikes. However, those methods either require chip-unfriendly neuron settings [18, 48, 60], or only perform well on simple datasets. Rate coding is adopted by most ANN-to-SNN methods [10, 56, 44, 42, 11, 27, 19, 12] and many direct training methods [55, 51]. The rate coding-based methods typically achieve better performances than those with temporal coding. Furthermore, recent progress shows the potential of rate-coding based methods on low latency or sparse firing [55], making it possible to reach the same or even better level of energy efficiency as temporal coding scheme. In this paper, we adopt the rate coding scheme to train SNNs.

3 Proposed Differentiation on Spike Representation (DSR) Method

3.1 Spiking Neural Models

Spiking neurons imitate the biological neurons that communicate with each other by spike trains. In this paper, we adopt the widely used integrate-and-fire (IF) model and leaky-integrate-and-fire (LIF) model [3], both of which are simplified models to characterize the process of spike generation. Each IF neuron or LIF neuron ‘integrates’ the received spike as its membrane potential $V(t)$ , and the dynamics of membrane potential can be formally depicted as

	$\displaystyle\text{IF:}\ \ \ \ \frac{\mathrm{d}V(t)}{\mathrm{d}t}=I(t),\quad\quad\quad\quad\quad\quad\quad\quad\quad\ V<V_{th},$		(1)
	$\displaystyle\text{LIF:}\ \ \tau\frac{\mathrm{d}V(t)}{\mathrm{d}t}=-(V(t)-V_{rest})+I(t),\ \ V<V_{th},$		(2)

where $V_{rest}$ is the resting potential, $\tau$ is the time constant, $V_{th}$ is the spike threshold, and $I$ is the input current which is related to received spikes. Once the membrane potential $V$ exceeds the predefined threshold $V_{th}$ at time $t_{f}$ , the neuron will fire a spike and reset its membrane potential to the resting potential $V_{rest}$ . The output spike train can be expressed using the Dirac delta function $s(t)=\sum_{t_{f}}\delta(t-t_{f})$ .

In practice, discretization for the dynamics is required. The discretized model is described as:

		$\displaystyle U[n]=f(V[n-1],I[n]),$			(3a)
		$\displaystyle{s}[n]=H(U[n]-V_{th}),$			(3b)
		$\displaystyle V[n]=U[n]-V_{th}{s}[n],$			(3c)

where $U[n]$ is the membrane potential before resetting, ${s}[n]\in\{0,1\}$ is the output spike, $n=1,2,\cdots,N$ is the time step index and $N$ is the latency, $H(x)$ is the Heaviside step function, and $f$ is the membrane potential update function. In the discretization, both $V[0]$ and $V_{rest}$ are set to be $0$ for simplicity, and therefore $V_{th}>0$ . The function $f(\cdot,\cdot)$ for IF and LIF models can be described as:

	$\displaystyle\text{IF:}\ f(V,I)=V+I,$		(4)
	$\displaystyle\text{LIF:}\ f(V,I)=e^{-\frac{\Delta t}{\tau}}V+\left(1-e^{-\frac{\Delta t}{\tau}}\right)I,$		(5)

where $\Delta t<\tau$ is the discrete step for LIF model. In practice, we set $\Delta t$ to be much less that $\tau$ . Different from other literature [52, 17, 10], we explicitly introduce the hyperparameter $\Delta t$ to ensure a large feasible region for $\tau$ , since the discretization for LIF model is only valid when the discrete step $\tau>\Delta t$ [17]. For example, $\tau=1$ is allowed in our setting, while some other works prohibit it since they set $\Delta t=1$ . We use the “reduce by subtraction” method [55, 51] for resetting the membrane potential in Eq. (3c). Combing Eqs. (3a) and (3c), we get a more concise update rule for the membrane potential:

V[n]=f(V[n-1],I[n])-V_{th}{s}[n].

(6)

Eq. 6 is used to define the forward pass of SNNs.

3.2 Forward Pass

In this paper, we consider $L$ -layer feedforward SNNs with the IF or LIF models. According to Eqs. 6 and 4, the spiking dynamics for an SNN with the IF model can be described as:

\mathbf{V}^{i}[n]=\mathbf{V}^{i}[n-1]+V_{th}^{i-1}\mathbf{W}^{i}\mathbf{s}^{i-1}[n]-V_{th}^{i}\mathbf{s}^{i}[n],

(7)

where $i=1,2,\cdots,L$ is the layer index, $\mathbf{s}^{0}$ are the input data to the network, $\mathbf{s}^{i}$ are the output spike trains of the $i^{\text{th}}$ layer for $i=1,2,\cdots,L$ , and $\mathbf{W}^{i}$ are trainable synaptic weights from the $(i-1)^{\text{th}}$ layer to the $i^{\text{th}}$ layer. Spikes are generated according to Eq. (3b), and $V_{th}^{i-1}\mathbf{W}^{i}\mathbf{s}^{i-1}[n]$ are treated as input currents to the $i^{\text{th}}$ layer. Furthermore, the spike thresholds are the same for all IF neurons in one particular layer. Similarly, according to Eqs. 6 and 5, the spiking dynamics for an SNN with the LIF model can be shown as:

	$\displaystyle\mathbf{V}^{i}$	$\displaystyle[n]=\exp({-\frac{\Delta t}{\tau^{i}}})\mathbf{V}^{i}[n-1]$		(8)
		$\displaystyle+(1-\exp({-\frac{\Delta t}{\tau^{i}}}))\frac{V_{th}^{i-1}}{\Delta t}\mathbf{W}^{i}\mathbf{s}^{i-1}[n]-V_{th}^{i}\mathbf{s}^{i}[n],$		(8)

where $\Delta t$ is set to be a positive number much less than $\tau_{i}$ , and it appears to simplify the analysis on spike representation schemes in Sec. 3.3. In Eqs. 7 and 8, we only consider fully connected layers. However, other neural network components like convolution, skip connection, and average pooling can also be adopted.

Refer to caption — Figure 1: The pipeline of proposed DSR method. The left part shows the forward computation of an SNN. The right shows the error backpropagation through the sub-differentiable mapping $g_{\mathbf{W}^{i}}$ .

The input $\mathbf{s}^{0}$ to the SNN can be both neuromorphic data or static data (e.g., images). While neuromorphic data are naturally adapted to SNNs, for static data, we repeatedly apply them to the first layer at each time step [55, 59, 42]. With this method, the first layer can be treated as the spike-train data generator. We use the spike trains $\mathbf{s}^{L}$ as the output data of the SNN, whose setting is more biologically plausible than prohibiting firing for the last layer and using the membrane potentials as the network output [50, 30].

3.3 Spike Representation

In this subsection, we show that the forward computation for each layer of an SNN with the IF or LIF neurons can be represented as a sub-differentiable mapping using spike representation as the information carrier. And the spike representation is obtained by (weighted) firing rate coding. Specifically, denoting by $\mathbf{s}^{i}$ the output spike trains of the $i^{\text{th}}$ layer, the relationship between the SNN and the mapping can be expressed as

{}\operatorname{Rep}(\mathbf{s}^{i})\approx g_{\mathbf{W}^{i}}(\operatorname{Rep}(\mathbf{s}^{i-1})),\ i=1,2,\cdots,L,

(9)

where $\operatorname{Rep}(\mathbf{s}^{i})$ is the spike representation of $\mathbf{s}^{i}$ , $\mathbf{W}^{i}$ are the SNN parameters for the $i^{\text{th}}$ layer, and $g_{\mathbf{W}^{i}}$ is the sub-differentiable mapping also parameterized by $\mathbf{W}^{i}$ . Then the SNN parameters $\mathbf{W}^{i}$ can be learned through gradients of $g_{\mathbf{W}^{i}}$ . The illustration of the SNN-to-mapping representation is shown in Fig. 1.

We first use weighted firing rate coding to derive the formulae of spike representation $\operatorname{Rep}(\mathbf{s}^{i})$ and the sub-differentiable mapping $g_{\mathbf{W}^{i}}$ for the LIF model. Then we briefly introduce the formulae for the IF model, which are simple extensions of those for the LIF model. Training SNNs based on the spike representation schemes is described in Sec. 3.4.

3.3.1 Spike Representation for the LIF Model

We first consider the LIF model defined by Eqs. (3a) to (3c) and (5). To simplify the notation, define $\lambda=\exp(-\frac{\Delta t}{\tau})$ . We further define $\hat{I}[N]=\frac{\sum_{n=1}^{N}\lambda^{N-n}I[n]}{\sum_{n=1}^{N}\lambda^{N-n}}$ as the weighted average input current until the time step $N$ , and define $\hat{a}[N]=\frac{V_{th}\sum_{n=1}^{N}\lambda^{N-n}s[n]}{\sum_{n=1}^{N}\lambda^{N-n}\Delta t}$ as the scaled weighted firing rate. Here we treat $\hat{a}[N]$ as the spike representation of the spike train $\{s[n]\}_{n=1}^{N}$ for the LIF model. The key idea is to directly determine the relationship between $\hat{I}[N]$ and $\hat{a}[N]$ using a (sub-)differentiable mapping.

In detail, combining Eqs. 6 and 5, and multiplying the combined equation by $\lambda^{N-n}$ , we have

	$\displaystyle\lambda^{N-n}V[n]$	$\displaystyle=\lambda^{N-n+1}V[n-1]$		(10)
		$\displaystyle+(1-\lambda)\lambda^{N-n}I[n]-\lambda^{N-n}V_{th}s[n].$		(10)

Summing Eq. 10 over $n=1$ to $N$ , we can get

\displaystyle V[N]=(1-\lambda)\sum_{n=1}^{N}\lambda^{N-n}I[n]-\sum_{n=1}^{N}\lambda^{N-n}V_{th}s[n].

(11)

Dividing Eq. 11 by $\Delta t\sum_{n=1}^{N}\lambda^{N-n}$ and then rearrange the terms, we have

\hat{a}[N]=\frac{(1-\lambda)\hat{I}[N]}{\Delta t}-\frac{V[N]}{\Delta t\sum_{n=1}^{N}\lambda^{N-n}}.

(12)

Note that we can further approximate $\frac{1-\lambda}{\Delta t}$ in Eq. 12 by $\frac{1}{\tau}$ , since $\lim\limits_{\Delta t\rightarrow 0}\frac{1-\lambda}{\Delta t}=\frac{1}{\tau}$ and we set $\Delta t\ll\tau$ . Then we have

\hat{a}[N]\approx\frac{\hat{I}[N]}{\tau}-\frac{V[N]}{\Delta t\sum_{n=1}^{N}\lambda^{N-n}}.

(13)

Eq. 13 is the basic formula to determine the mapping from $\hat{I}[N]$ to $\hat{a}[N]$ . Note that in Eq. 13, the term $\frac{V[N]}{\Delta t\sum_{n=1}^{N}\lambda^{N-n}}$ cannot be directly determined only given $\hat{I}[N]$ . However, taking $\hat{a}[N]\in[0,\frac{V_{th}}{\Delta t}]$ into consideration and assuming $V_{th}$ is small, we can ignore the term $\frac{V[N]}{\Delta t\sum_{n=1}^{N}\lambda^{N-n}}$ in Eq. 13, and further approximate $\hat{a}[N]$ as

\displaystyle\lim\limits_{N\rightarrow\infty}\hat{a}[N]\approx\operatorname{clamp}\left(\lim\limits_{N\rightarrow\infty}\frac{\hat{I}[N]}{\tau},0,\frac{V_{th}}{\Delta t}\right),

(14)

where $\operatorname{clamp}(x,a,b)\triangleq\max(a,\min(x,b))$ . Detailed derivation and mild assumptions for Eq. 14 are shown in the Supplementary Materials. Applying Eq. 14 to feedforward SNNs with multiple LIF neurons, we have Proposition 1, which is used to train SNNs.

Proposition 1.

Consider an SNN with LIF neurons defined by Eq. 8. Define $\hat{\mathbf{a}}^{0}[N]=\frac{\sum_{n=1}^{N}\lambda_{i}^{N-n}\mathbf{s}^{0}[n]}{\sum_{n=1}^{N}\lambda_{i}^{N-n}\Delta t}$ and $\hat{\mathbf{a}}^{i}[N]=\frac{V_{th}^{i}\sum_{n=1}^{N}\lambda_{i}^{N-n}\mathbf{s}^{i}[n]}{\sum_{n=1}^{N}\lambda_{i}^{N-n}\Delta t},\forall i=1,2,\cdots,L$ , where $\lambda_{i}=\exp(-\frac{\Delta t}{\tau^{i}})$ . Further define sub-differentiable mappings

\mathbf{z}^{i}=\operatorname{clamp}\left(\frac{1}{\tau^{i}}\mathbf{W}^{i}\mathbf{z}^{i-1},0,\frac{V_{th}^{i}}{\Delta t}\right),i=1,2,\cdots,L.

If $\lim\limits_{N\rightarrow\infty}\hat{\mathbf{a}}^{i}[N]=\mathbf{z}^{i}$ for $i=0,1,\cdots,L-1$ , then $\hat{\mathbf{a}}^{i+1}[N]$ approximates $\mathbf{z}^{i+1}$ when $N\rightarrow\infty$ .

3.3.2 Spike Representation for the IF Model

We then consider the IF model defined by Eqs. (3a) to (3c) and (4). Define $\bar{I}[N]=\frac{1}{N}\sum_{n=1}^{N}I[n]$ as the average input current until the time step $N$ , and define $a[N]=\frac{1}{N}\sum_{n=1}^{N}V_{th}s[n]$ as the scaled firing rate. We treat $a[N]$ as the spike representation of the spike train $\{s[n]\}_{n=1}^{N}$ for the IF model. We can use similar arguments shown in Sec. 3.3.1 to determine the relationship between $\hat{I}[N]$ and $\hat{a}[N]$ as

\displaystyle\lim\limits_{N\rightarrow\infty}a[N]=\operatorname{clamp}\left(\lim\limits_{N\rightarrow\infty}\bar{I}[N],0,V_{th}\right).

(15)

Detailed assumptions and derivation for Eq. 15 are shown in the Supplementary Materials. With Eq. 15, we can have Proposition 2 to train feedforward SNNs with the IF model.

Proposition 2.

Consider an SNN with IF neurons defined by Eq. 7. Define $\mathbf{a}^{0}[N]=\frac{1}{N}\sum_{n=1}^{N}\mathbf{s}^{0}[n]$ and $\mathbf{a}^{i}[N]=\frac{1}{N}\sum_{n=1}^{N}V_{th}^{L}\mathbf{s}^{i}[n],\forall i=1,2,\cdots,L$ . Further define sub-differentiable mappings:

\mathbf{z}^{i}=\operatorname{clamp}\left(\mathbf{W}^{i}\mathbf{z}^{i-1},0,V_{th}^{i}\right),i=1,2,\cdots,L.

If $\lim\limits_{N\rightarrow\infty}\mathbf{a}^{0}[N]=\mathbf{z}^{0}$ , then $\lim\limits_{N\rightarrow\infty}\mathbf{a}^{i}[N]=\mathbf{z}^{i}$ .

3.4 Differentiation on Spike Representation

In this subsection, we use the spike representation for the IF and the LIF models to drive the backpropagation training algorithm for SNNs, based on Propositions 2 and 1. And the illustration can be found in Fig. 1.

Define the spike representation operator $\operatorname{r}(\cdot)$ with spike train $s=(s[1],\cdots,s[N])$ as input, such that $\operatorname{r}(s)=\frac{1}{N}\sum_{n=1}^{N}V_{th}s[n]$ for the IF model, and $\operatorname{r}(s)=\frac{V_{th}\sum_{n=1}^{N}\lambda^{N-n}s[n]}{\sum_{n=1}^{N}\lambda^{N-n}\Delta t}$ for the LIF model, where $\lambda=\exp(-\frac{\Delta t}{\tau})$ . With the spike representation, we define the final output of the SNN as $\mathbf{o}^{L}=\operatorname{r}(\mathbf{s}^{L})$ , where $\mathbf{s}^{L}$ is the output spike trains from the last layer and $\operatorname{r}(\cdot)$ is defined element-wise. We use cross-entropy as the loss function $\ell$ .

The proposed DSR method backpropagates the gradient of error signals based on the representation of spike trains in each layer, $\mathbf{o}^{i}=\operatorname{r}(\mathbf{s}^{i})$ , where $i=1,2,\cdots,L$ is the layer index. By applying chain rule, the required gradient $\frac{\partial\ell}{\partial\mathbf{W}^{i}}$ can be computed as

\frac{\partial\ell}{\partial\mathbf{W}^{i}}=\frac{\partial\ell}{\partial\mathbf{o}^{i}}\frac{\partial\mathbf{o}^{i}}{\partial\mathbf{W}^{i}},\quad\frac{\partial\ell}{\partial\mathbf{o}^{i}}=\frac{\partial\ell}{\partial\mathbf{o}^{i+1}}\frac{\partial\mathbf{o}^{i+1}}{\partial\mathbf{o}^{i}},

(16)

where $\frac{\partial\mathbf{o}^{i+1}}{\partial\mathbf{o}^{i}}$ and $\frac{\partial\mathbf{o}^{i}}{\partial\mathbf{W}^{i}}$ can be computed with Propositions 2 and 1. Specifically, from Sec. 3.3, we have

\small\mathbf{o}^{i}=\operatorname{r}(\mathbf{s}^{i})\approx\operatorname{clamp}(\mathbf{W}^{i}\operatorname{r}(\mathbf{s}^{i-1}),0,b_{i}),\ i=1,2,\cdots,L,

(17)

where $b_{i}=V_{th}^{i}$ for the IF model, and $b_{i}=\frac{V_{th}^{i}}{\Delta t}$ for the LIF model. Therefore, we can calculate $\frac{\partial\mathbf{o}^{i+1}}{\partial\mathbf{o}^{i}}$ and $\frac{\partial\mathbf{o}^{i}}{\partial\mathbf{W}^{i}}$ based on Eq. 17. The pseudocode of the proposed DSR method can be found in the Supplementary Materials.

With the proposed DSR method, we avoid two common problems in SNN training. First, this method does not require backpropagation through the temporal domain, improving the training efficiency when compared with the BPTT type methods, especially when the number of time steps is not ultra-small. Second, this method does not need to handle the non-differentiability of spike functions, since the signals are backpropagated through sub-differentiable mapping. Although there exists representation error due to finite time steps, we can reduce it, as described in Sec. 4.

4 Reducing Representation Error

Propositions 2 and 1 show that the (weighted) firing rate can gradually estimate or converge to the output of a sub-differentiable mapping. And Sec. 3.4 shows that we can train SNNs by backpropagation using spike representation. However, in practice we want to simulate SNNs with only a small number of time steps, for the sake of low energy consumption. The low latency will further introduce representation error that hinders effective training. In this subsection, we study the representation error and propose to train the spike threshold and introduce a new hyperparameter for the neural models to reduce the error.

The representation error $e_{r}$ can be decomposed as $e_{r}=e_{q}+e_{d}$ , where $e_{q}$ is the “quantization error” and $e_{d}$ is the “deviation error”. The quantization error $e_{q}$ exists due to the imperfect precision of the firing rate, when assuming the same input currents at all time steps. For example, it can only take value in the form $\frac{n}{N}$ for the IF neuron, where $n\in\mathbb{N}$ and $N$ is the number of time steps. And the deviation error $e_{q}$ exists due to the inconsistency of input currents at different time steps. For example, when the average input current is $0$ , the output firing rate is supposed to be 0; however, it can be significantly larger than $0$ if the input currents are positive during the first few time steps.

From the statistical perspective, the expectation for $e_{d}$ is 0, assuming i.i.d. input currents at different time steps. Therefore, the “deviation error” $e_{d}$ will not affect training too much when using stochastic optimization algorithms. Next, we dig into $e_{q}$ with the IF model and then propose methods to reduce it. Similar arguments can be derived for the LIF model. When given unchanged input currents $I^{*}$ at all time steps to the IF neuron, the scaled average firing rate $a[N]=\frac{1}{N}\sum_{n=1}^{N}V_{th}s[n]$ can be determined as

a[N]=\frac{V_{th}}{N}\cdot\operatorname{clamp}\left(\left\lfloor\frac{NI^{*}}{\theta}\right\rfloor,0,N\right),

(18)

shown as the red curve in Fig. 2, where $\lfloor\cdot\rfloor$ is the floor rounding operator. Inspired by Eq. 18, we propose two methods to reduce the quantization error.

Training the Spike Threshold

From Fig. 2 and Eq. 18, we observe that using small spike thresholds can reduce the quantization error. However, it also weakens the approximation capacity of the SNNs, since the scaled (weighted) firing rate will be in a small range then. Inspired by activation clipping methods for training quantized neural networks [6], in this paper, we treat the spike threshold of each layer as parameters to be trained, and include an L2-regularizer for the thresholds in the loss function to balance the tradeoff between quantization error and approximation capacity. To train the spike thresholds using backpropagation, we calculate the gradients with respect to them based on the spike representation introduced in Sec. 3.3. For example, using Eq. 15, for one IF neuron with average input current $I^{*}$ and steady scaled firing rate $a^{*}$ , we have

\frac{\partial a^{*}}{\partial V_{th}}=\left\{\begin{array}[]{l}1,\quad\text{if}\ I^{*}>V_{th},\\ 0,\quad\text{otherwise}.\end{array}\right.

(19)

Then we can calculate the gradient of the loss function with respect to the threshold by the chain rule. A similar calculation applies to LIF neurons. In practice, since we use mini-batch optimization methods to train SNNs, the gradient for each threshold is proportional to the batch size by the chain rule. Thus, we scale the gradient regarding different batch sizes and spiking neural models.

Introducing a new hyperparameter for the neural models

We can introduce a new hyperparameter for spiking neurons to control the neuron firing to reduce the quantization error. Formally, we change Eq. (3b) to

s[n]=H(U[n]-\alpha V_{th})

(20)

to get a new firing mechanism, where $\alpha\in[0,1]$ is a hyperparameter. For the IF model with the new firing mechanism and $\alpha=0.5$ , using the same notation as in Eq. 18, the scaled firing rate becomes

a[N]=\frac{V_{th}}{N}\cdot\operatorname{clamp}\left(\left[\frac{NI^{*}}{\theta}\right],0,N\right),

(21)

shown as the green curve in Fig. 2, where $[\cdot]$ is the rounding operator. From Fig. 2, we can see that the maximum absolute quantization error is halved when using this mechanism. Furthermore, since $\alpha=0.5$ makes the average absolute quantization error minimized, $\alpha=0.5$ is the best choice for the IF model. On the other hand, for the LIF model, the best choice for $\alpha$ changes when setting different latency $N$ , so we choose different $\alpha$ in our experiments to minimize the average absolute quantization error.

Table 2: Performance on CIFAR-10, CIFAR-100, ImageNet, and DVS-CIFAR10. For the first three datasets, we categorize the methods into 4 classes: ANN, the ANN-to-SNN method, the direct training method, and our proposed method. Different types of methods are separated by horizontal lines. We bold the best result for the LIF model, and underline the best result for the IF model.

	Method	Network	Neural Model	Time Steps	Accuracy
CIFAR-10	ANN ¹	PreAct-ResNet-18	/	/	$95.41\%$
	ANN-to-SNN[10]	ResNet-20	IF	128	$93.56\%$
	ANN-to-SNN[19]	VGG-16	IF	2048	$93.63\%$
	ANN-to-SNN[56]	VGG-like	IF	600	$94.20\%$
	Tandem Learning[51]	CIFARNet	IF	8	$90.98\%$
	ASF-BP[50]	VGG-7	IF	400	$91.35\%$
	STBP[52]	CIFARNet	LIF	12	$90.53\%$
	IDE[55]	CIFARNet-F	LIF	100	$92.52\%\pm 0.17\%$
	STBP-tdBN[59]	ResNet-19	LIF	6	$93.16\%$
	TSSL-BP[58]	CIFARNet	LIF w/ synaptic model	5	$91.41\%$
	DSR (ours)	PreAct-ResNet-18	IF	20	$\underline{95.24\%\pm 0.17\%}$
	DSR (ours)	PreAct-ResNet-18	LIF	20	$\mathbf{95.40\%\pm 0.15\%}$
CIFAR-100	ANN ¹	PreAct-ResNet-18	/	/	$78.12\%$
	ANN-to-SNN[10]	ResNet-20	IF	400-600	$69.82\%$
	ANN-to-SNN[19]	VGG-16	IF	768	$70.09\%$
	ANN-to-SNN[56]	VGG-like	IF	300	$71.84\%$
	Hybrid Training[41]	VGG-11	LIF	125	$67.84\%$
	DIET-SNN[40]	VGG-16	LIF	5	$69.67\%$
	IDE[55]	CIFARNet-F	LIF	100	$73.07\%\pm 0.21\%$
	DSR (ours)	PreAct-ResNet-18	IF	20	$\underline{78.20\%\pm 0.13\%}$
	DSR (ours)	PreAct-ResNet-18	LIF	20	$\mathbf{78.50\%\pm 0.12\%}$
ImageNet	ANN ¹	PreAct-ResNet-18	/	/	$70.79\%$
	ANN-to-SNN[44]	ResNet-34	IF	2000	$65.47\%$
	ANN-to-SNN[19]	ResNet-34	IF	4096	$\underline{69.89\%}$
	Hybrid training[41]	ResNet-34	LIF	250	$61.48\%$
	STBP-tdBN[59]	ResNet-34	LIF	6	$63.72\%$
	SEW ResNet[16]	SEW ResNet-34	LIF	4	$67.04\%$
	SEW ResNet[16]	SEW ResNet-18	LIF	4	$63.18\%$
	DSR (ours)	PreAct-ResNet-18	IF	50	$\underline{67.74\%}$
DVS-CIFAR10	ASF-BP[50]	VGG-7	IF	50	$62.50\%$
	Tandem Learning[51]	7-layer CNN	IF	20	$65.59\%$
	STBP[53]	7-layer CNN	LIF	40	$60.50\%$
	STBP-tdBN[59]	ResNet-19	LIF	10	$67.80\%$
	Fang et al. [17]	7-layer CNN	LIF	20	$74.80\%$
	DSR (ours)	VGG-11	IF	20	$\underline{75.03\%\pm 0.39\%}$
	DSR (ours)	VGG-11	LIF	20	$\mathbf{77.27\%\pm 0.24\%}$

1

Self-implemented results for ANN.

5 Experiments

We first evaluate the proposed DSR method and compare it with other works on visual object recognition benchmarks, including CIFAR-10, CIFAR-100, ImageNet, and DVS-CIFAR10. We then demonstrate the effectiveness of our method when the number of time steps becomes smaller and smaller, or the network becomes deeper and deeper. We also test the effectiveness of the methods for reducing representation error. Please refer to the Supplementary Materials for experiment details. Our code is available at https://github.com/qymeng94/DSR.

5.1 Comparison to the State-of-the-Art

The comparison on CIFAR-10, CIFAR-100, ImageNet, and DVS-CIFAR10 is shown in Tab. 2.

For the CIFAR-10 and the CIFAR-100 datasets, we use pre-activation ResNet-18 [20] as the network architecture. Tab. 2 shows that the proposed DSR method outperforms all other methods on CIFAR-10 and the CIFAR-100 with 20 time steps for both the IF and the LIF models, based on 3 runs of experiments. Especially, our method achieves accuracies that are 5%-10% higher on CIFAR-100 when compared to others. Furthermore, the obtained SNNs have similar or even better performance compared to ANNs with the same network architectures. Although some direct training methods use smaller time steps than ours, our method can also achieve better performances than others when the number of time steps $N=15,10,$ and $5$ , as shown in Fig. 3(a).

For the ImageNet dataset, we also use the pre-activation ResNet-18 network architecture. To accelerate training, we adopt the hybrid training technique [41, 40]. And considering the data complexity and the 1000 classes, we use a moderate number of time steps to achieve satisfactory results. Our proposed method can outperform the direct training methods even if they use larger network architectures. Although some ANN-to-SNN methods have better accuracy, they use much more time steps than ours.

For the neuromorphic DVS-CIFAR10, we adopt the VGG-11 architecture [46] and conduct 3 runs of experiments for each neural model. It can be found in Tab. 2 that the proposed method outperforms other SOTA methods with low latency using both the IF and the LIF models.

5.2 Model Validation and Ablation Study

Effectiveness of the Proposed Method with Low Latency

We validate that the proposed method can achieve competitive performances even with ultra-low latency, as shown in Fig. 3(a). Each model is trained from scratch. From 20 to 5 time steps, our models only suffer from less than $1\%$ accuracy drop. The results for 5 time steps also outperform other SOTA shown in Tab. 2. More training details can be found in the Supplementary Materials.

Effectiveness of the Proposed Method with Deep Network Structure

Many SNN learning methods cannot adapt to deep network architectures, limiting the potential of SNNs. The reason is that the error for gradient approximation or ANN-to-SNN conversion will accumulate through layers, or the methods are computationally expensive for large-scale network structures. In this part, we test the proposed method on CIFAR-10 using pre-activation ResNet with different depths, namely 20, 32, 44, 56, 110 layers. Note that the channel size is smaller than the PreAct-ResNet-18, since the deep networks with large channel size as in PreAct-ResNet-18 perform not much better and are harder to train even for ANNs. More details about network architectures can be found in the Supplementary Materials. Results are shown in Fig. 3(b). The figure shows that our method is effective on deep networks (>100 layers), and performs better with deeper network structures. This indicates the great potential of our method to achieve more advanced performance when using very deep networks.

Ablation Study on Methods to Reduce Representation Error

We conduct the ablation study on the representation error reduction methods, namely training the threshold and introducing a new hyperparameter for the neural models. The models are trained on CIFAR-10 with PreAct-ResNet-18 structure and 20 time steps, and the results are shown in Tab. 3. The experiments imply that the representation error significantly hinders training and also demonstrate the superiority of the two methods to reduce the representation error. Note that the threshold training method also helps stabilize training, since the results become unstable for large thresholds without this method (e.g., the standard deviation is $1.84\%$ when $V_{th}=6$ ). Furthermore, the average accuracy of not using both methods is better than the one of only using the firing mechanism modification, maybe due to the instability of the results when $V_{th}$ is large.

Table 3: Ablation study on the representation error reduction methods on CIFAR-10. The PreAct-ResNet-18 architecture with the IF model is used, and results are based on 3 runs of experiments. ‘F’ means firing mechanism modification, and ‘T’ means threshold training.

Setting	Accuracy
DSR, init. $V_{th}=6$	$95.24\%\pm 0.17\%$
DSR w/o F, init. $V_{th}=6$	$92.88\%\pm 0.25\%$
DSR w/o T, $V_{th}=6$	$90.45\%\pm 1.84\%$
DSR w/o T, $V_{th}=2$	$90.47\%\pm 0.12\%$
DSR w/o F&T, $V_{th}=6$	$92.59\%\pm 0.81\%$

6 Conclusion and Discussions

In this work, we show that the forward computation of SNNs can be represented as some sub-differentiable mapping. Based on the SNN-to-mapping representation, we propose the DSR method to train SNNs that avoids the non-differentiability problem in SNN training and does not require backpropagation through the temporal domain. We also analyze the representation error due to the small number of time steps, and propose to train the thresholds and introduce a new hyperparameter for the IF and LIF models to reduce the representation error. With the error reduction methods, we can train SNNs with low latency by the DSR method. Experiments show that the proposed method could achieve SOTA performance on mainstream vision tasks, and show the effectiveness of the method when dealing with ultra-low latency or very deep network structures.

Societal impact and limitations. As for societal impact, there is no direct negative societal impact since this work only focuses on training SNNs. In fact, the development of high-performance low-latency SNNs allows SNNs to replace ANNs in some real-world tasks. This replacement will alleviate the huge energy consumption by ANNs and reduce carbon dioxide emissions. As for limitations, the DSR method may suffer from a certain degree of performance drop when the latency is extremely low (e.g., with only 2 or 3 time steps), since the method requires relatively accurate spike representation to conduct backpropagation.

Acknowledgment

We thank Jiancong Xiao and Zeyu Qin for useful discussion. We thank Jin Wang for pointing out typos. The work of Z.-Q. Luo was supported by the National Natural Science Foundation of China under Grant 61731018, and the Guangdong Provincial Key Laboratory of Big Data Computation Theories and Methods. Z. Lin was supported by the NSF China (No. 61731018), NSFC Tianyuan Fund for Mathematics (No. 12026606), Project 2020BD006 supported by PKU-Baidu Fund, and Qualcomm. The work of Yisen Wang was partially supported by the National Natural Science Foundation of China under Grant 62006153, and Project 2020BD006 supported by PKU-Baidu Fund.

References

[1] Guillaume Bellec, Darjan Salaj, Anand Subramoney, Robert Legenstein, and Wolfgang Maass. Long short-term memory and learning-to-learn in networks of spiking neurons. In NIPS, 2018.
[2] Sander M Bohte, Joost N Kok, and Johannes A La Poutré. Spikeprop: backpropagation for networks of spiking neurons. In ESANN, 2000.
[3] Anthony N Burkitt. A review of the integrate-and-fire neuron model: I. homogeneous synaptic input. Biological cybernetics, 95(1):1–19, 2006.
[4] Yongqiang Cao, Yang Chen, and Deepak Khosla. Spiking deep convolutional neural networks for energy-efficient object recognition. IJCV, 113(1):54–66, 2015.
[5] Natalia Caporale and Yang Dan. Spike timing–dependent plasticity: a hebbian learning rule. Annu. Rev. Neurosci., 31:25–46, 2008.
[6] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
[7] Mike Davies, Narayan Srinivasa, Tsung-Han Lin, Gautham Chinya, Yongqiang Cao, Sri Harsha Choday, Georgios Dimou, Prasad Joshi, Nabil Imam, Shweta Jain, et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro, 38(1):82–99, 2018.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009.
[9] Lei Deng, Yujie Wu, Xing Hu, Ling Liang, Yufei Ding, Guoqi Li, Guangshe Zhao, Peng Li, and Yuan Xie. Rethinking the performance comparison between snns and anns. Neural Networks, 121:294–307, 2020.
[10] Shikuang Deng and Shi Gu. Optimal conversion of conventional artificial neural networks to spiking neural networks. In ICLR, 2021.
[11] Peter U Diehl, Daniel Neil, Jonathan Binas, Matthew Cook, Shih-Chii Liu, and Michael Pfeiffer. Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing. In IJCNN, 2015.
[12] Jianhao Ding, Zhaofei Yu, Yonghong Tian, and Tiejun Huang. Optimal ann-snn conversion for fast and accurate inference in deep spiking neural networks. In IJCAI, 2021.
[13] Steve K Esser, Rathinakumar Appuswamy, Paul Merolla, John V Arthur, and Dharmendra S Modha. Backpropagation for energy-efficient neuromorphic computing. In NeurIPS, 2015.
[14] Steven K. Esser, Paul A. Merolla, John V. Arthur, Andrew S. Cassidy, Rathinakumar Appuswamy, Alexander Andreopoulos, David J. Berg, Jeffrey L. McKinstry, Timothy Melano, Davis R. Barch, Carmelo di Nolfo, Pallab Datta, Arnon Amir, Brian Taba, Myron D. Flickner, and Dharmendra S. Modha. Convolutional networks for fast, energy-efficient neuromorphic computing. PNAS, 113(41):11441–11446, 2016.
[15] Wei Fang, Yanqi Chen, Jianhao Ding, Ding Chen, Zhaofei Yu, Huihui Zhou, Yonghong Tian, and other contributors. Spikingjelly. https://github.com/fangwei123456/spikingjelly, 2020.
[16] Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian. Deep residual learning in spiking neural networks. In NIPS, 2021.
[17] Wei Fang, Zhaofei Yu, Yanqi Chen, Timothée Masquelier, Tiejun Huang, and Yonghong Tian. Incorporating learnable membrane time constant to enhance learning of spiking neural networks. In ICCV, 2021.
[18] Bing Han and Kaushik Roy. Deep spiking neural network: Energy efficiency through time based coding. In ECCV, 2020.
[19] Bing Han, Gopalakrishnan Srinivasan, and Kaushik Roy. RMP-SNN: residual membrane potential neuron for enabling deeper high-accuracy and low-latency spiking neural network. In CVPR, 2020.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
[21] Donald Olding Hebb. The organisation of behaviour: a neuropsychological theory. Science Editions New York, 1949.
[22] Yangfan Hu, Huajin Tang, Yueming Wang, and Gang Pan. Spiking deep residual network. arXiv preprint arXiv:1805.01352, 2018.
[23] Dongsung Huh and Terrence J. Sejnowski. Gradient descent for spiking neural networks. In NIPS, 2018.
[24] Eric Hunsberger and Chris Eliasmith. Spiking deep networks with lif neurons. arXiv preprint arXiv:1510.08829, 2015.
[25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
[26] Saeed Reza Kheradpisheh, Mohammad Ganjtabesh, Simon J Thorpe, and Timothée Masquelier. Stdp-based spiking deep convolutional neural networks for object recognition. Neural Networks, 99:56–67, 2018.
[27] Sei Joon Kim, Seongsik Park, Byunggook Na, and Sungroh Yoon. Spiking-yolo: Spiking neural network for energy-efficient object detection. In AAAI, 2020.
[28] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[29] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[30] Chankyu Lee, Syed Shakib Sarwar, Priyadarshini Panda, Gopalakrishnan Srinivasan, and Kaushik Roy. Enabling spike-based backpropagation for training deep neural network architectures. Frontiers in Neuroscience, 14:119, 2020.
[31] Robert Legenstein, Dejan Pecevski, and Wolfgang Maass. A learning theory for reward-modulated spike-timing-dependent plasticity with application to biofeedback. PLoS computational biology, 4(10):e1000180, 2008.
[32] Hongmin Li, Hanchao Liu, Xiangyang Ji, Guoqi Li, and Luping Shi. Cifar10-dvs: an event-stream dataset for object classification. Frontiers in neuroscience, 11:309, 2017.
[33] Yuhang Li, Shikuang Deng, Xin Dong, Ruihao Gong, and Shi Gu. A free lunch from ann: Towards efficient, accurate spiking neural networks calibration. In ICML, 2021.
[34] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In ICLR, 2017.
[35] Paul A Merolla, John V Arthur, Rodrigo Alvarez-Icaza, Andrew S Cassidy, Jun Sawada, Filipp Akopyan, Bryan L Jackson, Nabil Imam, Chen Guo, Yutaka Nakamura, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
[36] Hesham Mostafa. Supervised learning based on temporal coding in spiking neural networks. TNNLS, 29(7):3227–3235, 2017.
[37] Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6):51–63, 2019.
[38] Stefano Panzeri and Simon R Schultz. A unified approach to the study of temporal, correlational, and rate coding. Neural Computation, 13(6):1311–1349, 2001.
[39] Jing Pei, Lei Deng, Sen Song, Mingguo Zhao, Youhui Zhang, Shuang Wu, Guanrui Wang, Zhe Zou, Zhenzhi Wu, Wei He, et al. Towards artificial general intelligence with hybrid tianjic chip architecture. Nature, 572(7767):106–111, 2019.
[40] Nitin Rathi and Kaushik Roy. Diet-snn: Direct input encoding with leakage and threshold optimization in deep spiking neural networks. arXiv preprint arXiv:2008.03658, 2020.
[41] Nitin Rathi, Gopalakrishnan Srinivasan, Priyadarshini Panda, and Kaushik Roy. Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation. In ICLR, 2019.
[42] Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, Michael Pfeiffer, and Shih-Chii Liu. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Frontiers in neuroscience, 11:682, 2017.
[43] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
[44] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy. Going deeper in spiking neural networks: Vgg and residual architectures. Frontiers in neuroscience, 13:95, 2019.
[45] Sumit Bam Shrestha and Garrick Orchard. SLAYER: spike layer error reassignment in time. In NIPS, 2018.
[46] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2014.
[47] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
[48] Christoph Stöckl and Wolfgang Maass. Optimized spiking neurons can classify images with high accuracy through temporal coding with two spikes. Nature Machine Intelligence, pages 1–9, 2021.
[49] Amirhossein Tavanaei, Masoud Ghodrati, Saeed Reza Kheradpisheh, Timothée Masquelier, and Anthony Maida. Deep learning in spiking neural networks. Neural Networks, 111:47–63, 2019.
[50] Hao Wu, Yueyi Zhang, Wenming Weng, Yongting Zhang, Zhiwei Xiong, Zheng-Jun Zha, Xiaoyan Sun, and Feng Wu. Training spiking neural networks with accumulated spiking flow. In AAAI, 2021.
[51] Jibin Wu, Yansong Chua, Malu Zhang, Guoqi Li, Haizhou Li, and Kay Chen Tan. A tandem learning rule for effective training and rapid inference of deep spiking neural networks. TNNLS, 2021.
[52] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12:331, 2018.
[53] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, Yuan Xie, and Luping Shi. Direct training for spiking neural networks: Faster, larger, better. In AAAI, 2019.
[54] Timo C Wunderlich and Christian Pehle. Event-based backpropagation can compute exact gradients for spiking neural networks. Scientific Reports, 11(1):1–17, 2021.
[55] Mingqing Xiao, Qingyan Meng, Zongpeng Zhang, Yisen Wang, and Zhouchen Lin. Training feedback spiking neural networks by implicit differentiation on the equilibrium state. In NIPS, 2021.
[56] Zhanglu Yan, Jun Zhou, and Weng-Fai Wong. Near lossless transfer learning for spiking neural networks. In AAAI, 2021.
[57] Yukun Yang, Wenrui Zhang, and Peng Li. Backpropagated neighborhood aggregation for accurate training of spiking neural networks. In ICML, 2021.
[58] Wenrui Zhang and Peng Li. Temporal spike sequence learning via backpropagation for deep spiking neural networks. In NIPS, 2020.
[59] Hanle Zheng, Yujie Wu, Lei Deng, Yifan Hu, and Guoqi Li. Going deeper with directly-trained larger spiking neural networks. In AAAI, 2021.
[60] Shibo Zhou, Xiaohua Li, Ying Chen, Sanjeev T Chandrasekaran, and Arindam Sanyal. Temporal-coded deep spiking neural network with easy training and robust performance. In AAAI, 2021.

Appendix A Details about Spike Representation

A.1 Derivation for Eq. 14

In this subsection, we consider the LIF model defined by Eqs. (3a) to (3c) and (5), and derive Eq. 14 from Eq. 13 in the main content under mild assumptions.

From the main content, we have derived that

\hat{a}[N]\approx\frac{\hat{I}[N]}{\tau}-\frac{V[N]}{\Delta t\sum_{n=1}^{N}\lambda^{N-n}},

(22)

as shown in Eq. 13. Since the LIF neuron is supposed to fire no or very few spikes when $\frac{\hat{I}[N]}{\tau}<0$ and fire almost always when $\frac{\hat{I}[N]}{\tau}>\frac{V_{th}}{\Delta t}$ , we can separate the accumulated membrane potential $V^{[}N]$ into two parts: one part $V^{-}[N]$ represents the “exceeded” membrane potential that does not contribute to spike firing, and the other part $V^{+}[N]$ represents the “remaining” membrane potential. In detail, the “exceeded” membrane potential $V^{-}[N]$ can be calculated as

\small V^{-}[N]=\left\{\begin{array}[]{l}(\Delta t\sum_{n=1}^{N}\lambda^{N-n})\frac{\hat{I}[N]}{\tau},\quad\quad\quad\quad\quad\ \ \ \frac{\hat{I}[N]}{\tau}<0,\\ (\Delta t\sum_{n=1}^{N}\lambda^{N-n})(\frac{\hat{I}[N]}{\tau}-\frac{V_{th}}{\Delta t}),\quad\frac{\hat{I}[N]}{\tau}>\frac{V_{th}}{\Delta t},\\ 0,\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\ \ \text{otherwise}.\end{array}\right.

(23)

And the “remaining” membrane potential can be calculated as $V^{+}[N]=V[N]-V^{-}[N]$ . With the decomposition of membrane potential $V[N]$ and the fact that $\Delta t\sum_{n=1}^{N}\lambda^{N-n}=\tau$ when $N\rightarrow\infty$ and $\Delta t\rightarrow 0$ , we can further approximate $\hat{a}[N]$ from Eq. 22 as

\displaystyle\lim\limits_{\begin{subarray}{c}N\rightarrow\infty\end{subarray}}\hat{a}[N]\approx\lim\limits_{\begin{subarray}{c}N\rightarrow\infty\end{subarray}}\operatorname{clamp}\left(\frac{\hat{I}[N]}{\tau}-\frac{V^{+}[N]}{\tau},0,\frac{V_{th}}{\Delta t}\right),

(24)

if the limit of right hand side exists.

Now we want to find the condition to ignore the term $\frac{V^{+}[N]}{\tau}$ in Eq. 24. In the case $V^{-}[N]\neq 0$ , the magnitude of membrane potential $|V(n)|$ would gradually increase with time. After introducing $V^{-}$ , the “remaining” membrane potential $V^{+}[N]$ typically does not diverge over time. In fact, $V^{+}[N]$ is typically bounded in $[0,V_{th}]$ when $N\rightarrow\infty$ , except in the extreme case when the input current at different time steps distributes extremely unevenly. So we can just assume that $V^{+}[N]\in[0,V_{th}]$ . Furthermore, if we set a significantly smaller threshold $V_{th}$ compared to the magnitude of $\hat{I}[N]$ , the term $\frac{V^{+}[N]}{\tau}$ can be ignored. Then from Eq. 24, we have

\displaystyle\lim\limits_{N\rightarrow\infty}\hat{a}[N]\approx\operatorname{clamp}\left(\lim\limits_{N\rightarrow\infty}\frac{\hat{I}[N]}{\tau},0,\frac{V_{th}}{\Delta t}\right).

(25)

That is ,we can approximate $\hat{a}[N]$ by $\operatorname{clamp}\left(\frac{\hat{I}[N]}{\tau},0,\frac{V_{th}}{\Delta t}\right)$ , with an approximation error bounded by $\frac{V_{th}}{\tau}$ when $N\rightarrow\infty$ .

In summary, we derive Eq. 14 from Eq. 13 in the main content under following mild conditions:

1.

The LIF neuron fires no or finite spikes as $N\rightarrow\infty$ when $\frac{\hat{I}[N]}{\tau}<0$ . And the LIF neuron does not fire only at a finite number of time steps as $N\rightarrow\infty$ when $\frac{\hat{I}[N]}{\tau}>\frac{V_{th}}{\Delta t}$ .
2.

$V^{+}[N]\in[0,V_{th}]$ .

A.2 Derivation for Eq. 15

In this subsection, we consider the IF model defined by Eqs. (3a) to (3c) and (4) and derive Eq. 15 in the main content under mild assumptions.

Combining Eqs. (4) and (3c), and taking the summation over $n=1$ to $N$ , we can get

{}V[N]-V[0]=\sum_{n=1}^{N}I[n]-V_{th}\sum_{n=1}^{N}s[n].

(26)

Define the scaled firing rate until the time step $N$ as $a[N]=\frac{1}{N}\sum_{n=1}^{N}V_{th}s[n]$ , and the average input current as $\bar{I}[N]=\frac{1}{N}\sum_{n=1}^{N}I[n]$ . Dividing Eq. 26 by $N$ , we have

{}a[N]=\bar{I}[N]-\frac{V[N]}{N}.

(27)

Using similar arguments appeared in Section A.1, we can get

\displaystyle\lim\limits_{N\rightarrow\infty}a[N]=\lim\limits_{N\rightarrow\infty}\operatorname{clamp}\left(\bar{I}[N]-\frac{V^{+}[N]}{N},0,V_{th}\right),

(28)

if the limit of right hand side exists. Here $V^{+}[N]=V[N]-V^{-}[N]$ and

{}\small V^{-}[N]=\min\left(\max\left(\sum_{n=1}^{N}I[n]-NV_{th},0\right),\sum_{n=1}^{N}I[n]\right).

(29)

Similar to the LIF model, the “remaining” membrane potential $V^{+}[N]$ for the IF model is typically bounded in $[0,V_{th}]$ when $N\rightarrow\infty$ , except in the extreme case. For example, consider $\bar{I}[N]>0$ and the input current is non-zero only at the last $\frac{N}{2}$ time steps, then $V^{+}[N]$ will be inconsistently large and will be unbounded when $N\rightarrow\infty$ . However, this extreme case will not happen in SNN computation for normal input data. Therefore, we assume $V^{+}[N]\in[0,V_{th}]$ , and can get

\displaystyle\lim\limits_{N\rightarrow\infty}a[N]=\operatorname{clamp}\left(\lim\limits_{N\rightarrow\infty}\bar{I}[N],0,V_{th}\right).

(30)

if the limit of $\bar{I}[N]$ exists.

In summary, we derive Eq. 15 in the main content under following mild conditions:

1.

The IF neuron fires no or finite spikes as $N\rightarrow\infty$ when ${\bar{I}[N]}<0$ . And the IF neuron does not fire only at a finite number of time steps as $N\rightarrow\infty$ when ${\bar{I}[N]}>{V_{th}}$ .
2.

$V^{+}[N]\in[0,V_{th}]$ .

Appendix B Pseudocode of the Proposed DSR Method

We present the pseudocode of one iteration of SNN training with the DSR method in Algorithm 1 for better illustration.

Algorithm 1 One iteration of SNN training with the proposed DSR method.

1:Time steps

N

; Network depth

L

; Network parameters

\mathbf{W}^{L},\cdots,\mathbf{W}^{L},V_{th}^{1},\cdots,V_{th}^{L}

; Input data

\mathbf{x}

; Label

\mathbf{y}

; Other hyperparameters.

2:Trained network parameters

\mathbf{W}^{L},\cdots,\mathbf{W}^{L}

V_{th}^{1},\cdots,V_{th}^{L}

4:for

n=1,2,\cdots,N

5: for

i=1,2,\cdots,L

6: if the IF model is used then

7: Calculate

\mathbf{s}^{i}[n]

by Eq. 7;

8: else if the LIF model is used then

9: Calculate

\mathbf{s}^{i}[n]

by Eq. 8;

10: end if

11: if

n=N

then

12: if the IF model is used then

13:

\mathbf{o}^{i}=\frac{1}{N}\sum_{n=1}^{N}V_{th}\mathbf{s}[n]

;

14: else if the LIF model is used then

15:

\mathbf{o}^{i}=\frac{V_{th}\sum_{n=1}^{N}\lambda^{N-n}s[n]}{\sum_{n=1}^{N}\lambda^{N-n}\Delta t}

;

16: end if

17: end if

18: end for

19:end for

20:Calculate the loss

\ell

based on

\mathbf{o}^{L}

and

\mathbf{y}

21:

22:Calculate

\frac{\partial\ell}{\partial\mathbf{o}^{L}}

;

23:for

i=L,L-1,\cdots,1

24: Calculates

\frac{\partial\mathbf{o}^{i}}{\partial\mathbf{o}^{i-1}}

\frac{\partial\mathbf{o}^{i}}{\partial\mathbf{W}^{i}}

, and

\frac{\partial\mathbf{o}^{i}}{\partial V_{th}^{i}}

by Eq. 17;

25:

\frac{\partial\ell}{\partial\mathbf{W}^{i}}=\frac{\partial\ell}{\partial\mathbf{o}^{i}}\frac{\partial\mathbf{o}^{i}}{\partial\mathbf{W}^{i}}

;

26:

\frac{\partial\ell}{\partial V_{th}^{i}}=\frac{\partial\ell}{\partial\mathbf{o}^{i}}\frac{\partial\mathbf{o}^{i}}{\partial V_{th}^{i}}

;

27: if

i\neq 1

then

28:

\frac{\partial\ell}{\partial\mathbf{o}^{i-1}}=\frac{\partial\ell}{\partial\mathbf{o}^{i}}\frac{\partial\mathbf{o}^{i}}{\partial\mathbf{o}^{i-1}}

;

29: end if

30: Update

\mathbf{W}^{i},V_{th}^{i}

based on

\frac{\partial\ell}{\partial\mathbf{W}^{i}}

\frac{\partial\ell}{\partial V_{th}^{i}}

31:end for

Appendix C Implementation Details

C.1 Dataset Description and Preprocessing

CIFAR-10 and CIFAR-100

The CIFAR-10 dataset [29] contains 60,000 32 $\times$ 32 color images in 10 different classes, which can be separated into 50,000 training samples and 10,000 testing samples. We apply data normalization to ensure that input images have zero mean and unit variance. We apply random cropping and horizontal flipping for data augmentation. The CIFAR-100 dataset [29] is similar to CIFAR-10 except that there are 100 classes of objects. We use the same data preprocessing as CIFAR-10. These two datasets are licensed under MIT.

ImageNet

The ImageNet-1K dataset [8] spans 1000 object classes and contains 1,281,167 training images, 50,000 validation images and 100,000 test images. This dataset is licensed under Custom (non-commercial). We apply data normalization to ensure zero mean and unit variance for input images. Moreover, we apply random resized cropping and horizontal flipping for data augmentation.

Table 4: Network architectures for PreAct-ResNet-20, PreAct-ResNet-32, PreAct-ResNet-44, PreAct-ResNet-56, PreAct-ResNet-110.

20-layers	32-layers	44-layers	56-layers	110-layers
conv (3 $\times$ 3,16)
$\left(\begin{array}[]{c}3\times 3,16\\ 3\times 3,16\end{array}\right)\times 3$	$\left(\begin{array}[]{c}3\times 3,16\\ 3\times 3,16\end{array}\right)\times 5$	$\left(\begin{array}[]{c}3\times 3,16\\ 3\times 3,16\end{array}\right)\times 7$	$\left(\begin{array}[]{c}3\times 3,16\\ 3\times 3,16\end{array}\right)\times 9$	$\left(\begin{array}[]{c}3\times 3,16\\ 3\times 3,16\end{array}\right)\times 18$
$\left(\begin{array}[]{c}3\times 3,32\\ 3\times 3,32\end{array}\right)\times 3$	$\left(\begin{array}[]{c}3\times 3,32\\ 3\times 3,32\end{array}\right)\times 5$	$\left(\begin{array}[]{c}3\times 3,32\\ 3\times 3,32\end{array}\right)\times 7$	$\left(\begin{array}[]{c}3\times 3,32\\ 3\times 3,32\end{array}\right)\times 9$	$\left(\begin{array}[]{c}3\times 3,32\\ 3\times 3,32\end{array}\right)\times 18$
$\left(\begin{array}[]{c}3\times 3,64\\ 3\times 3,64\end{array}\right)\times 3$	$\left(\begin{array}[]{c}3\times 3,64\\ 3\times 3,64\end{array}\right)\times 5$	$\left(\begin{array}[]{c}3\times 3,64\\ 3\times 3,64\end{array}\right)\times 7$	$\left(\begin{array}[]{c}3\times 3,64\\ 3\times 3,64\end{array}\right)\times 9$	$\left(\begin{array}[]{c}3\times 3,64\\ 3\times 3,64\end{array}\right)\times 18$
average pool, 10-d fc

DVS-CIFAR10

The DVS-CIFAR10 dataset [32] is a neuromophic dataset converted from CIFAR-10 using an event-based sensor. It contains 10,000 event-based images with resolution 128 $\times$ 128 pixels. The images are in 10 classes, with 1000 examples in each class. The dataset is licensed under CC BY 4.0. Since each spike train contains more than one million events, we split the events into 20 slices and integrate the events in each slice into one frame. More details about the transformation could be found in [17]. To conduct training and testing, we separate the whole data into 9000 training images and 1000 test images. Both the event-to-frame integrating and data separation are handled with the SpikingJelly [15] framework. We also reduce the spatial resolution from 128 $\times$ 128 to 48 $\times$ 48 and apply random cropping for data augmentation.

C.2 Batch Normalization

Batch Normalization (BN) [25] is a widely used technique in the deep learning community to stabilize signal propagation and accelerate training. In this paper, BN is adopted in the network architectures. However, since the input data for SNNs have an additional time dimension when compared to input image data for ANNs, we need to make the BN components suitable for SNNs.

In this paper, we combine the time dimension and the batch dimension into one and then conduct BN. In detail, consider a batch of temporal data $\mathbf{x}\in\mathbb{R}^{B\times N}$ with batch size $B$ and temporal dimension $N$ such that $\mathbf{x}=(\mathbf{x}^{(1)},\cdots,\mathbf{x}^{(B)})$ , and $\mathbf{x}^{(i)}\in\mathbb{R}^{N}$ for $i=1,2,\cdots,N$ . Then define ${\mu}$ and ${\sigma}^{2}$ to be the mean and variance of the reshaped data $({x}^{(1)}[1],\cdots,{x}^{(1)}[N],\cdots,{x}^{(B)}[1],\cdots,{x}^{(B)}[N])$ . With the defined ${\mu}$ and ${\sigma}^{2}$ , BN transforms the original data $\mathbf{x}^{(i)}$ to $\hat{\mathbf{x}}^{(i)}$ as

\hat{\mathbf{x}}^{(i)}=\gamma\frac{\mathbf{x}^{(i)}-\mu}{\sqrt{\sigma^{2}+\epsilon}}+\beta,

(31)

where $\gamma$ and $\beta$ are learnable parameters, and $\epsilon$ is a small positive number to guarantee valid division.

C.3 Network Architectures

We use the pre-activation ResNet-18 [20] network architecture to conduct experiments on CIFAR-10, CIFAR-100, and ImageNet. To make the network architecture implementable on neuromorphic chips, we add spiking neurons after pooling operations and the last fully connected classifier. Furthermore, we replace all max pooling with average pooling. To stabilize the (weighted) firing rates of the output layer, we also introduce an additional BN operation between the last fully connected classifier and the last spiking neuron layer. The network contains four groups of basic block [20] structures, with channel sizes 64, 128, 256, and 512, respectively.

To test the effectiveness of the proposed DSR method with deeper networks structures, we conduct experiments with the pre-activation ResNet with 20, 32, 44, 56, 110 layers, whose architectures are shown in Tab. 4. We also add additional spiking neuron layers and BN layers like what we do for the PreAct-ResNet-18 structure.

We use the VGG-11 [46] network architecture to conduct experiments on DVS-CIFAR10. To enhance generalization capacity of the network, we further add dropout [47] layers after the spiking neurons, and we set the probability of zeroing elements to be $0.1$ . We only keep one fully connected layer to reduce the number of neurons.

C.4 Training Hyperparameters

First, we consider hyperparameters about the IF model. We set the initial threshold for each layer $V_{th}^{i}=6$ and restrict it to be no less than $0.01$ during training. We set $\alpha$ in Eq. 20 to be $0.5$ .

Then we consider hyperparameters about the LIF model. We fix the time constant $\tau^{i}$ for the $i^{\text{th}}$ layer to be $1$ for each $i$ . The setting for initial $V_{th}^{i}$ , lower bound for $V_{th}^{i}$ , $\Delta t$ , and $\alpha$ change with the number of time steps, as shown in Tab. 5.

Table 5: Hyperparameters for the LIF model given different number of time steps. “LB for

V_{th}^{i}

” means lower bound for

V_{th}^{i}

Time Steps	initial $V_{th}^{i}$	LB for $V_{th}^{i}$	$\Delta t$	$\alpha$
20	0.3	0.0005	0.05	0.3
15	0.3	0.0005	0.05	0.4
10	0.3	0.0005	0.05	0.4
5	0.6	0.001	0.1	0.5

Next, we consider hyperparameters about the optimization. We use cosine annealing [34] as the learning rate schedule for all datasets. Other hyperparameters can be found in Tab. 6. Also note that we change the initial learning rate from $0.1$ to $0.05$ when using 5 time steps.

Table 6: Hyperparameters about Optimization for training CIFAR-10, CIFAR-100, ImageNet, and DVS-CIFAR10.

Dataset	Optimizer	Epoch	$lr$	Batchsize
CIFAR-10	SGD [43]	200	$0.1$	128
CIFAR-100	SGD [43]	200	$0.1$	128
ImageNet	Adam [28]	90	$0.001$	144
DVS-CIFAR10	SGD [43]	300	$0.05$	128

Appendix D Firing Sparsity

To achieve low energy consumption on neuromorphic hardware, the number of spikes generated by an SNN should be small. Then the firing rate is an important quantity to measure the energy efficiency of SNNs. We calculate the average firing rates of the trained SNNs on CIFAR-10, as shown in Fig. 4. The results show that the firing rates of all layers are below 20%, and many layers have firing rates of no more than 5%. Regardless of the layer of neurons, the total firing rates are between 7.5% and 9.5% for different number of time steps and both neuron models. Since the firing rate does not increase as the number of time steps decreases, the proposed method can achieve satisfactory performance with both low latency and high firing sparsity.

Appendix E Weight Quantization

In our experiments, the network weights are 32-bit. However, we can also adopt low-bit weights when implementing our method on neuromorphic hardware, by combining existing quantization algorithms. The weights in neuromorphic hardware are generally 8-bit. So we simply quantize the weights of our trained SNNs to 8 bits and even 4 bits using the straight-through estimation method, and the results on CIFAR-10 are shown in Tab. 7.

Table 7: Performances on CIFAR-10 with network weights of different precisions. The PreAct-ResNet-18 architecture with 20 time steps is used.

Neural Model	32 bits	8 bits	4 bits
IF	95.38%	95.45%	95.31%
LIF	95.63%	95.65%	95.39%