Ultra-low Latency Adaptive Local Binary Spiking Neural Network with Accuracy Loss Estimator

Changqing Xu¹\equalcontribYijian Pei¹\equalcontribZili Wu³, Yi Liu², Yintang Yang² [email protected]@xidian.edu.cn.

Abstract

Spiking neural network (SNN) is a brain-inspired model which has more spatio-temporal information processing capacity and computational energy efficiency. However, with the increasing depth of SNNs, the memory problem caused by the weights of SNNs has gradually attracted attention. Inspired by Artificial Neural Networks (ANNs) quantization technology, binarized SNN (BSNN) is introduced to solve the memory problem. Due to the lack of suitable learning algorithms, BSNN is usually obtained by ANN-to-SNN conversion, whose accuracy will be limited by the trained ANNs. In this paper, we propose an ultra-low latency adaptive local binary spiking neural network (ALBSNN) with accuracy loss estimators, which dynamically selects the network layers to be binarized to ensure the accuracy of the network by evaluating the error caused by the binarized weights during the network learning process. Experimental results show that this method can reduce storage space by more than 20 $\%$ without losing network accuracy. At the same time, in order to accelerate the training speed of the network, the global average pooling(GAP) layer is introduced to replace the fully connected layers by the combination of convolution and pooling, so that SNNs can use a small number of time steps to obtain better recognition accuracy. In the extreme case of using only one time step, we still can achieve 92.92 $\%$ , 91.63 $\%$ ,and 63.54 $\%$ testing accuracy on three different datasets, Fashion-MNIST, CIFAR-10, and CIFAR-100, respectively.

Introduction

Courbariaux et al. proposed BinaryConnect(Courbariaux, Bengio, and David 2015), which pioneered the study of binary neural networks. Binarization can not only minimize the storage usage and computational complexity of the model but also reduce the storage resource consumption of model deployment and greatly accelerate the inference process of the neural network. In the field of CNNs, many algorithms have been proposed and satisfactory progress has been made. Spiking neural networks, as the third generation of neural networks, is a computational paradigm that simulates the biological brain based on the dynamic activation of binary neurons and event-driven(Tavanaei et al. 2019; Illing, Gerstner, and Brea 2019). By making use of the time sparsity of binary time series signals, it can improve the computational energy efficiency on special hardware(Mead 1990). The combination of SNNs and the binary networks has gradually attracted more and more attention(Lu and Sengupta 2020; Kheradpisheh, Mirsadeghi, and Masquelier 2022; Srinivasan and Roy 2019). However, it is still a great challenge to train SNNs due to their non-differentiable activation function. In order to maintain good accuracy, some researchers choose to use pre-training to obtain parameters from ANNs(Lu and Sengupta 2020; Wang et al. 2020; Cao, Chen, and Khosla 2015). But the pre-training of ANNs gives up the advantage of SNNs in temporal and spatial information processing. In recent years, some studies have successfully trained binarized SNNs, directly. For example, Jang et al.(Jang, Skatchkovsky, and Simeone 2021) used the Bayesian rule to directly train binarized SNNs(BSNNs), and Kheradpisheh et al.(Kheradpisheh, Mirsadeghi, and Masquelier 2022) used time-to-first-spike coding in the direct training of the network.

We find that it is more reasonable to train BSNNs directly, but we need to build a reasonable SNNs structure and improve the learning algorithm appropriately. Therefore, we propose Accuracy Loss Estimator (ALE) and Global Average Pooling (GAP) Layer and use them to construct an ultra-low latency adaptive local binary spiking neural network. We directly train the network using the iterative neuron model in (Wu et al. 2018), then the binarized layer is automatically selected by ALE to solve the problem of large precision loss in direct training. Secondly, we use the GAP layer instead of the fully connected layer to reduce the amount of calculation and change the output layer of SNNs to alleviate the phenomenon that it takes a long time to train BSNNs directly. Object recognition experiments are conducted on three different datasets: Fashion-MNIST, CIFAR-10, and CIFAR-100, and a comprehensive comparison is made with other BSNNs. Through experiments, we verify the effectiveness of ALE and demonstrate the advantages of our method in terms of accuracy and training time.

Related Works

Binary Spiking Neural Networks

Generally, when choosing the quantization of the network, we can consider the following two aspects: weight and input(Qin et al. 2020). However, due to the characteristics of SNNs, there is no need to apply extra additional quantization of the network input. Recently, the idea of combining the SNN and the binarization has been proposed. Lu et al.(Lu and Sengupta 2020) proposed B-SNN, which is transformed into BSNNs by pre-training BCNNs. Roy et al.(Roy, Chakraborty, and Roy 2019) analyzed the results of combining different binary neurons with various binarized weight methods. Kheradpisheh et al.(Kheradpisheh, Mirsadeghi, and Masquelier 2022) proposed BS4NN and explored the adaptation of simple non-leaky integrate-and-fire neurons, time-to-first-spike coding, and binarized weight in backpropagation. Jang et al.(Jang, Skatchkovsky, and Simeone 2021) proposed BISNN, which combined Bayesian learning to train SNNs with binarized weights. However, a lot of work has focused on approximating full precision weights or reducing gradient errors to learn discrete parameters. For BSNN, it is usually to keep the first layer and the last layer binarized to reduce the accuracy drop based on the experimental experience(Deng et al. 2021). This method usually works, but there is still room for improvement.

Training of Binary Spiking Neural Networks

The training methods of BSNNs are also getting more and more attention. Recently, Mirsadeghi et al.(Mirsadeghi et al. 2021) proposed the STiDi-BP algorithm to avoid reverse recursive gradient computation while using binarized weights to obtain good performance. Wang et al.(Wang et al. 2020) proposed the weights-thresholds balance conversion method to scale the full precision weights into binarized weights through changing the corresponding thresholds of spiking neurons, then effectively obtain BSNNs.Roy et al.(Roy, Chakraborty, and Roy 2019)trained ANNs with constrained weights and activations and deployed them into SNNs with binarized weights. The BS4NN proposed by Kheradpisheh et al.(Kheradpisheh, Mirsadeghi, and Masquelier 2022) takes advantage of the temporal dimension and performs better than a simple BNNs with the same architecture. The current BSNNs training method mainly uses all binarized weights, which fails to achieve a balance between accuracy and spatial quantization. Furthermore, SNNs usually require sufficient time steps to simulate neural dynamics and encode information, and also take a long time to converge, which brings huge computational costs(Sengupta et al. 2019).

Approach

In this section, we will introduce the neuron model, binary spiking neural network learning method, and binarization method first. Then we will also introduce our proposed accuracy loss estimator and GAP Layer.

Iterative Leaky Integrate-and-Fire Neural Model

In this paper, we use iterative Leaky Integrate-and-Fire(LIF) neuron model to construct networks. First, we will introduce the classic Leaky Integrate-and-Fire (LIF) model, which is defined as:

\tau\frac{du(t)}{dt}=-u(t)+I(t),u<V_{th}

(1)

where $u(t)$ is the membrane voltage of the neuron at time $t$ , $\tau$ is the decay constant of the membrane potential, and $I(t)$ is the input from the presynaptic neuron. The membrane potential $u$ exceeds the threshold $V_{th}$ and then returns to the resting potential after firing a spike.

Then, the LIF neuron model is converted into an iterative version that is easy to program. Specifically, an iterative version can be obtained by the last spiking moment and the pre-synaptic input:

u(t_{i})=u(t_{i-1})e^{\frac{t_{i-1}-t}{\tau}}+I^{{}^{\prime}}(t_{i})

(2)

where $u(t_{i-1})$ is the membrane voltage at time step $t_{i-1}$ and the $I^{{}^{\prime}}(t)$ is the input from the presynaptic neuron at time step $t_{i}$ .

When the neuron output is zero before the last moment, the membrane voltage begins to leak. This process can be expressed mathematically simply,

u^{l+1}_{p}(t_{i+1})=\tau u^{l+1}_{p}(t_{i})(1-o^{l+1}_{p}(t_{i}))+\sum^{l_{max}}_{q=1}w_{pq}o^{l}_{q}(t_{i+1})

(3)

where $u^{l+1}_{p}(t_{i+1})$ is the membrane voltage of $p$ th neuron of $(l+1)$ th layer at time step $t_{i+1}$ , and $o^{l+1}_{p}(t_{i})$ is the output of $p$ th neuron of $(l+1)$ th layer at time step $t_{i}$ , $\tau$ is the decay factor, $w_{pq}$ represents the weight of the $q$ th synapse to the $p$ th neuron and $l_{max}$ is the total number of neurons at the $l$ th layer.

Finally, a step function $f(x)$ is used to represent whether the neuron’s membrane voltage reaches a threshold voltage $V_{th}$ and fires a spike:

o^{l+1}_{p}(t_{i+1})=f(u^{l+1}_{p}(t_{i+1}))

(4)

where the step function is $f(x)=\begin{cases}1&x\geq\ V_{th}\\ 0&x<V_{th}\end{cases}$

Accuracy loss estimator for weight binarization

In order to reduce the accuracy drop of BSNNs, it is usually to keep the first and last layers non-binarized based on engineering experience, which means that the weight precision of the first and last layers plays a important role in the inference of the neural network(Deng et al. 2021). However, according to our study, which layer should be binarized is dependent on the structure of the neural networks and characteristics of datasets, and it is not always the best solution to keep the first and last layers with full precision. Therefore, we propose ALE, which automatically selects binarized and non-binarized network layers during network training by estimating the effect of different network layers on network accuracy.

First of all, we use the Manhattan Distance between approximate binarized weights and full precision weights as the error estimation of binarized weight $w_{loss}^{l}$ , its calculation formula is shown below:

w_{loss}^{l}=\sum^{n}_{i=1}|w_{i}^{l}-bw_{i}^{l}|,l=1,2,3...L

(5)

where $w^{l}_{i}$ is the $i$ th full precision weight of the $l$ th layer, $bw^{l}_{i}$ is the $i$ th approximate weight of the $l$ th layer

For a BSNN, each output channel of the spiking convolution layer corresponds to one feature extraction. So we use the average error of feature extraction $A^{l}$ to estimate the error caused by the binarized weights. The formula is shown below.

A^{l}=\frac{w^{l}_{loss}}{c_{out}^{l}}

(6)

where $c_{out}^{l}$ is the number of output channels of the $l$ th layer.

Besides the error caused by binarization, we also consider the size of weight storage space as the criteria for selecting binarized layers. We hope that the layer with a large number of weights will have a greater probability of being selected for binarization. Because error estimation caused by binarization $A$ is calculated based on $w_{loss}$ , we try to use $w_{loss}$ to estimate the difference in the weight storage space of different layers $M$ , the formula is as follows:

M^{l}=\frac{\theta_{max}^{l}-\theta_{1}^{l}}{2}

(7)

$\theta_{max}^{l}$ is the $A^{l}$ obtained when the number of output channels of the $l$ th layer is equal to 1, and $\theta_{1}^{l}$ is the obtained $A^{l}$ when the number of output channels of the $l$ th layer equal to the total number of weights. For example, for a weight in the shape of $[output\ channel,input\ channel,kernel\ size,kernel\ size]\\ =[10,10,3,3]$ , its $\theta_{max}$ is equal to $A^{l}$ in the shape of $[1,100,3,3]$ , and $\theta_{1}$ is equal to $A^{l}$ in the shape of $[100,1,3,3]$ . These $A^{l}$ can be obtained quickly by using the equality (5) and (7).

To simply the calculation of $M$ , we use the $A^{l}$ to estimate the $\theta_{1}^{l}$ and $\theta_{max}^{l}$ based on the relationship between the error estimation of binarization weights with different shapes, which is obtained by experiments. The relationship is shown below.

\frac{w^{l}_{loss}}{w^{l^{\prime}}_{loss}}\approx\sqrt[2]{(\frac{c_{out}^{l}}{c_{out}^{l^{\prime}}})^{2}*\frac{c_{in}^{l}}{c_{in}^{l^{\prime}}}}

(8)

where $w_{loss}^{l}$ , $c_{out}^{l}$ , $c_{in}^{l}$ are the weight error of $l$ th layer’, the number of output channels, and the number of input channels, respectively. $w_{loss}^{l^{\prime}}$ , $c_{out}^{l^{\prime}}$ , $c_{in}^{l^{\prime}}$ are the weight error of $l$ th layers reshaped weights, the corresponding number of output channels, and the corresponding number of input channels, respectively.

Furthermore, we consider the influence of binarized weights at different layers in the forward pass and backpropagation After experiments on the hierarchical structure, we found that the binarization operation on the first layer will have a greater impact on inference accuracy. In addition, for backpropagation, the last layer has a greater impact on training. So, we directly set the first and last layers to have the same degree of influence on the result, and use a parabola $F(x)$ to describe this phenomenon:

F(x)=\epsilon(x-\frac{sumL+1}{2})^{2}

(9)

where $x$ is the index of layer, $\epsilon$ is a facter which is equal to $\frac{4*\eta}{sumL^{2}}$ , $sumL$ represents the total number of layers, $\eta$ is a variable, and we set it to 1 by default.

We combine $A^{l}$ , $M^{l}$ , and $F(x)$ together to get the criteria $R(x)$ for selecting binarized layers, which is shown below.

R(x)=\begin{cases}(\frac{1}{A^{l}+M^{l}})F(x)&,x\leq\frac{sumL+1}{2}\\ (\frac{1}{A^{l}+M^{l}})log_{10}(K)F(x)&,x>\frac{sumL+1}{2}\\ \end{cases}

(10)

where $K$ to represent the number of classes in the dataset. We can make different selection strategies according to the value of $R(x)$ to satisfy different applications. We will discuss the strategies in detail in the experiment section.

Backpropagation with adaptive local binarization

For the binarization of the weights, we use 3 binarized weight blocks for the binarization approximation of the full precision weights. That is, a linear combination of 3 binary filters $\alpha$ is used to represent the full precision weight $W$

W\approx\alpha_{1}B_{1}+\alpha_{2}B_{2}+\alpha_{3}B_{3}

(11)

Then we calculate the value of each binarized weight $B$ referring to (Lin, Zhao, and Pan 2017). The equations are given as follows:

B_{i}=sign(W-mean(W)+(i-2)std(W)),i=1,2,3

(12)

where $mean(W)$ and $std(W)$ are the mean and standard deviation of $W$ , respectively.

Once $B$ is obtained, we can get $\alpha$ easily according to:

\mathop{min}\limits_{\alpha}J(\alpha)=||w-B\alpha||^{2}

(13)

For the forward pass, the network selects whether to binarize the weights of each layer according to the ALE and uses the weights to calculate the output of each layer $O$ .

O=\begin{cases}\sum_{m=1}^{3}\alpha_{m}Conv(B_{m},A)&Binarization\\ Conv(W,A)&else\end{cases}

(14)

where $Conv()$ represents convolution function.

BSNNs are affected by binarized weight and binary input, so the backpropagation process must be reconsidered. We use the Dirac function to generate the spikes of SNNs. Due to the non-differentiability of the Dirac function, the approximate gradient function is used instead of the derivative function in backpropagation(Neftci, Mostafa, and Zenke 2019; Wu et al. 2018), the approximate gradient function is defined as follows:

h(u)=\frac{1}{a}sign(|u-V_{th}<\frac{a}{2}|)

(15)

where $u$ represents the membrane voltage, $V_{th}$ represents the threshold, and $a$ is the parameter that determines the sharpness of the curve.

Using the chain rule, the error gradient with the respect to the presynaptic weight W is

\frac{\partial{L}}{\partial{W}}=\frac{\partial{L}}{\partial{O}}\frac{\partial{O}}{\partial{W}}=\frac{\partial{L}}{\partial{O}}(\frac{1}{a}sign(|u-V_{th}<\frac{a}{2}|))

(16)

where $L$ is the loss function, and $sign$ is signum function.

Moreover, the binarization function of weight is also a typical step function, straight-through estimator (STE)(Bengio, Léonard, and Courville 2013) is usually used to solve this problem.

\frac{\partial{L}}{\partial{W}}\mathop{=}\limits^{STE}\frac{\partial{L}}{\partial{O}}\frac{\partial{O}}{\partial{B}}\frac{\partial{Htanh}}{\partial{W}}=\frac{\partial{L}}{\partial{O}}\frac{\partial{O}}{\partial{B}}=\frac{\partial{L}}{\partial{B}}

(17)

where $O$ and $Htanh$ as the output tensor of a convolution and hard-tanh function respectively.

In Fig. 1, we show the network layer with ALE and its workflow. Firstly, the network can use the Flag obtained from Box to determine whether this layer uses binarized weights. Then convolve the selected weight with the input. For the current time step, Box stores the selection result of the last time step, and these results will be used to select whether the binarized weight will be used. ALE will recalculate the value of $R$ and update the selection results in the Box at the same time. Next, the process for ALE to recalculate the value of $R$ is as follows. It calculates the binarized weight $BW$ according to the original weight $W1$ , and then they work together to get $R$ . Finally, the selection result is dependent on the value of $R$ and the selection criteria, and the results are updated to the Box.

Refer to caption — Figure 1: Network layers with ALE. The box is used to record the index of layers that need to be binarized. The flag is used to determine whether the binarized weights are used for convolution calculation. $W1$ , $BW$ and $W2$ represent the original weight, the binarized weight, and the weights selected for convolution calculation, respectively. And $Conv$ is the convolution function.

GAP Layer

Because of the binary output of spiking neurons, it is extremely sensitive to noise when the results of a few time steps are directly used for classification. Therefore, it is usually to use the spiking trains for a long period of time to indicate the degree of response to the category, which causes extra computational consumption. To address this problem, we learn from CNN’s global average pooling(Lin, Chen, and Yan 2013) and apply it in SNNs to reduce the time steps.

GAP Layer consists of a convolutional layer and a global average pooling layer(Lin, Chen, and Yan 2013). The convolution layer adjusts output channels to the number of classifications of the dataset. The global average pooling layer is used to convert the feature map into a classification vector, which is directly related to the final classification result. The overall structure of the GAP Layer is shown in Fig. 2. The number of output channels is adjusted to the number of dataset classes by convolution calculation, first. Then a global average pooling is used to transform the spatial average of the feature maps from the last layer to the confidence of categories. The obtained confidence is used as the probability of recognition. Just as global average pooling plays a role in CNNs: it enforces correspondence between feature maps and categories and integrates global spatial information(Lin, Chen, and Yan 2013). We can naturally introduce these advantages into SNNs to alleviate the excessive cost of the time steps. Therefore, SNNs can achieve competitive classification accuracy in few time steps or even one time step compared to existing state-of-the-art SNNs

Adaptive Local Binary Spiking Neural Network

The overall structure of the proposed Adaptive Local Binary Spiking Neural Network(ALBSNN) structure is illustrated in Fig. 3. The network consists of $N$ end-to-end spiking convolution blocks, and a GAP layer block. The spiking convolution block consists of an ALE, a spiking convolution layer, a batch normalization layer, and an average pooling layer. ALE decides whether the weight is binarized or not, and the spiking convolution layer extracts the features of the image. The GAP layer is used to alleviate the excessive cost of the time steps.

Experiments

In this section, we evaluate our proposed Adaptive Local Binary Spiking Neural Network(ALBSNN) on three datasets Fashion-MNIST(Xiao, Rasul, and Vollgraf 2017), CIFAR-10, and CIFAR-100(Krizhevsky, Hinton et al. 2009) datasets, which are the popular benchmarks for SNNs. Fashion-MNIST is a fashion product image dataset with 10 classes, 70,000 grayscale images in the size of 28 × 28 are divided into 60,000 training data and 10000 test data. CIFAR-10 and CIFAR-100 are composed of 3 channel RGB images of size 32×32 with 50,000 training images and 10,000 test images. CIFAR-10 has 10 classes, and CIFAR-100 has 100 classes, all images are divided equally by class. We verify the effectiveness of ALE and study the factors affecting ALE, first. Then we try four different selection criteria for binarization and compare them. Finally, we compared ALBSNN with several previously reported state-of-the-art results with the same or similar network.

Experimental Setup

Dataset

Structure

Fashion-MNIST

16C3(Encoding)-16C3-AP2-64C3

-64C3-AP2-256C3-1024C3-GAP

CIFAR-10, CIFAR-100

128C3(Encoding)-256C3-AP2

-512C3-AP2-1024C3-512C3-GAP

Table 1: Network structures.

Parameter	Fashion-MNIST	CIFAR-10	CIFAR-100
$V_{th}$	0.5	0.5	0.5
$\tau$	0.25	0.25	0.25
a	1	1	1
learning rate	0.001	0.001	0.001
batch size	16	16	64
time step	1	1	1
optimizer	Adam	Adam	Adam
criterion	MSE	MSE	Cross-Entropy

Table 2: Parameters setting.

All reported experiments below are conducted on an NVIDIA Tesla V100 GPU. The implementation of our proposed ALBSNN is on the Pytorch framework(Paszke et al. 2019). Only one time step is used to demonstrate the advantage of our proposed ALBSNN on ultra low-latency. Adam is applied as the optimizer(Kingma and Ba 2014). The results shown in this paper refer to the average results obtained by repeating five times.

In this paper, we apply several data augmentation during training processing as follows: (1) padding the original figure, and the padding size is 4, (2) crop pictures with a size of 32 pixels randomly, (3) flip the image horizontally with half probability, (4) normalized image,the standard deviation is 0.5. For the testing process, only normalization is applied(Shorten and Khoshgoftaar 2019).

We use iterative LIF model and approximate gradient for network training. The first convolutional layer acts as an encoding layer and network structures for Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets are shown in Table 1. Between the convolution calculation and the activation function, batch-normalization(BN)(Ioffe and Szegedy 2015) is applied. All convolution operations used in the experiment are based on the operations provided by Pytorch. The hyperparameters of netowkrs we used in our experiments are shown in Table 2. The learning rate uses the cosineanealing strategy(Loshchilov and Hutter 2016). Unless otherwise specified, the testing accuracy of Fashion-MNIST and CIFAR-10 are reported after training 20 epochs in our experiments. And for CIFAR-100, 200 epochs are applied for training.

Effectiveness of ALE

To validate the effectiveness of ALE, We compared ALBSNN with SNN with full precision weights (FPSNN), Binarized SNN (BSNN), and BSNN whose first layer and last layer are non-binarized(FLNBSNN) on Fashion-MNIST, CIFAR-10, and CIFAR-100 datasets, respectively. For the fairness of comparison, ALBSNN is designed to select two layers to maintain full precision. Table 3 shows the accuracy of four different methods. We obtain FPSNN and BSNN results by STBP(Wu et al. 2018) and ABC-NET(Lin, Zhao, and Pan 2017). Compared with FPSNN, BSNN, FLNBSNN, and ALBSNN will drop a some accuracy due to binarization. Among three binarized SNNs, ALBSNN achieves better results in accuracy because the ALE block can help ALBSNN select more suitable layers based on the network structure and dataset. For Fashion-MNIST and CIFAR-10 dataset, compared with FPSNN, ALBSNN only drops 0.20 $\%$ , and 0.52 $\%$ accuracy, respectively. The reason that ALBSNN drops 3.49 $\%$ accuracy for the CIFAR-100 dataset is that ALBSNN is limited to select two layer to maintain full precision. In the section about selection criteria below, we will show the improvement of ALBSNN by different selection criteria.

There is an interesting phenomenon that FLNBSNN and ALBSNN both select the first and sixth layer as non-binarized layers for the CIFAR-10 and CIFAR-100 datasets, but ALBSNN obtains a better accuracy. After further study, we find ALBSNN does not always select the first and sixth layers as non-binarized layers. For example, ALBSNN only selects the sixth layer as non-binarized layers at the first iteration of training for CIFAR-10. In a further experiment, we force the first layer of the first forward in FLBSNN to be binarized so that it has the same effect as ALBSNN, other conditions remain unchanged. We can observe that the recognition accuracy of CIFAR-10 is improved from 89.17 $\%$ to 90.02 $\%$ . Therefore, we believe that the binarized weight used in the initial input can be regarded as beneficial noise, it can improve the robustness of the network.

Dataset	Method	Full precision layer	Acc( $\%$ )
Fashion-MNIST	BSNN	-	92.38
	FLNBSNN	1,6	92.92
	ALBSNN	1,2	93.10
	FPSNN	all	93.30
CIFAR-10	BSNN	-	88.53
	FLNBSNN	1,6	89.17
	ALBSNN	1,6	90.12
	FPSNN	all	90.64
CIFAR-100	BSNN	-	57.68
	FLNBSNN	1,6	63.31
	ALBSNN	1,6	63.54
	FPSNN	all	67.03

Table 3: Accuracy of different methods.

Rethink about local binarization

Compared the selection results binarized layer for Fashion-MNIST, CIIFAR-10, and CIFAR-100, we find these selection result is related to the complexity of dataset and the network structure. As shown in Table 3, ALBSNN chooses the same layers as FLNSNN to keep full precision when the dataset is CIFAR-10 or CIFAR-100. Compared with the network structure we used, We find that if final output channel is relatively small and the size of weights between adjacent network layers is relatively large, ALBSNN may obtain a better binarization scheme by ALE. However, if the size of weights in the network increases or decreases gradually, FLNBSNN is a good solution. As the weights of common networks generally conform to the rule of flat change layer by layer, therefore, the selection of ALE tends to be similar to FLNB. Of course, if the non-binarized layers are not limited to two, ALE still can obtain a better binarization scheme by evaluating the error caused by the binarized weights. To sum up, the selection result of ALE is mainly related to the complexity of the dataset and structure of the neural network

Impact of selection criteria

Dataset	Selection criteria	Full precision layer	Acc( $\%$ )
Fashion-MNIST	SC1	1	92.81
	SC2	1,6	93.10
	SC3	1,2,7	93.26
	SC4	1,3,7	93.21
CIFAR-10	SC1	1	89.49
	SC2	1,6	90.12
	SC3	1,2,6	90.12
	SC4	1,5,6	90.15
CIFAR-100	SC1	1	60.86
	SC2	1,6	63.54
	SC3	1,2,6	64.23
	SC4	1,5,6	64.54

Table 4: Accuracy of different selection criteria.

In the previous section, in order to make a fair comparison with FLNBSNN, we select the two layers with the largest value $R$ as full precision layers. In this section, we choose four different selection criteria SC1, SC2, SC3, and SC4, to show the impact of the selection criteria on the accuracy of ALBSNN. SC1 applies the mean value $R$ of all layers as the baseline. When the value $R$ of a layer is greater than the mean value, this layer is selected as the full precision layer. SC2 uses the $R$ of the last layer as the baseline. If the $R$ of a layer is greater than the baseline, and the layer is non-binarized. For SC3, the first and last layers are selected as full precision layers, and the mean of $R$ of the other layers is set as the baseline, if the $R$ of other layers exceeds the baseline, the layer is selected as the full precision layer. For SC4, the first and last layers are selected as full precision layers, and the layer closest to the average value of $R$ excluding these two layers is also regarded as the full precision layer.

As the Table 4 is shown, different binarization scheme is obtained based on the network structure and dataset by ALE with the different selection criteria. For the CIFAR-100 dataset, the accuracy can be improved from 63.54 $\%$ to 64.54 $\%$ by only adding one non-binarized layer. In practice, we can choose the appropriate selection criteria according to the requirements of accuracy and weight storage space.

Compared with other methods

Dataset	Method	Structure
Fashion -MNIST	BS4NN	600FC-600FC-10
	SSTiDi-BP	20C5-MP2-40C5-MP2-1000FC-10
	ALBSNN	20C3-MP2-40C3-MP2-1000C3-10
CIFAR-10	Roy-SVGG10	128C3 $\times$ 2-MP-256C3 $\times$ 2-MP2-512C3 $\times$ 2-MP2-1024FC-1024FC-10
	Wang-SVGG10	128C3 $\times$ 2-MP2-256C3 $\times$ 2-MP2-512C3 $\times$ 2-MP2-1024FC-1024FC-10
	ALBSNN	128C3-256C3-AP2-512C3-AP2-1024C3-512C3-10
CIFAR-100	Roy-SVGG100	64C3 $\times$ 2-MP2-128C3 $\times$ 2-MP2-256C3 $\times$ 3-MP2-(512C3 $\times$ 3-MP2) $\times$ 2-4096FC-4096FC-100
	Wang-SVGG100	128C3 $\times$ 2-MP2-256C3 $\times$ 2-MP2-512C3-512C3-MP2-1024FC-1024FC-512FC-100
	ALBSNN	128C3-256C3-AP2-512C3-AP2-1024C3-512C3-100

Table 5: Structure of the network with different methods.

Dataset	Method	Learning	Epoch	Timestep	Weight storage space (Normalized)	Acc( $\%$ )
Fashion-MNIST	BS4NN	Temporal backpropagation	500	100	1.85	87.50
	SSTiDi-BP	Temporal backpropagation	-	100	3.09	92.00
	ALBSNN	STBP	20	1	1	91.83
CIFAR-10	Roy-SVGG10	ANN2SNN	150	-	1.26	88.27
	Wang-SVGG10	ANN2SNN	500	100	1.26	90.19
	ALBSNN	STBP	50	1	1	91.63
CIFAR-100	Roy-SVGG100	ANN2SNN	400	-	2.76	54.44
	Wang-SVGG100	ANN2SNN	500	300	1.18	62.02
	ALBSNN	STBP	200	1	1	63.54

Table 6: Compare with different methods.

In this section, we compare our proposed ALBSNN with several previously reported state-of-the-art method with the same or similar network. For the Fashion-MNIST, BS4NN(Kheradpisheh, Mirsadeghi, and Masquelier 2022) is trained with a simple fully connected network, and (Mirsadeghi et al. 2021) uses a higher-performance convolutional network for recognition(We denote this network by SSTiDi-BP). Both networks use temporal backpropagation for learning. For a fair comparison, we replace the fully connected layer with the GAP Layer and build an ALBSNN based on a similar network structure for discussion. For CIFAR-10 and CIFAR-100 datasets, the network structures used by (Roy, Chakraborty, and Roy 2019) and (Wang et al. 2020) are both modified VGG network(Simonyan and Zisserman 2014), we use Roy-SVGG10 and Wang-SVGG10 to denote these two networks, respectively. They do not train directly the SNN but rather use the method of ANN-to-SNN conversion. For CIFAR-10 and CIFAR-100 datasets, we also build an ALBSNN based on a similar network structure to make the comparison.

Table. 5 list the network structure of different methods, in which 128C3 $\times$ 2 represents 2 convolution block, each convolution block with 128 3 $\times$ 3 filters, AP2 represents average pooling layer with 2 $\times$ 2 filters, MP2 represents max pooling layer with 2 $\times$ 2 filters, and 600FC means a fully connected layer that consists of 600 neurons. Table. 6 shows the results of different methods. The weight storage space is normalized with respect to the baseline(ALBSNN). For Fashion-MNIST datasets, our recognition accuracy is on the same level as state-of-the-art networks, but we use less training time and save more than 45 $\%$ storage resources. For CIFAR-10 and CIFAR-100 datasets, our proposed ALBSNN obtained 91.63 $\%$ and 63.54 $\%$ accuracy, respectively. Compared with Wang-SVGG10, our proposed ALBSNN achieves 1.44 $\%$ and 1.52 $\%$ average testing accuracy improvement with only one time steps and fewer epochs. For the weight storage space, our proposed ALBSNN can obtain more than 20 $\%$ and 15 $\%$ reduction on the CIFAR-10 and CIFAR-100 datasets, respectively.

Conclusion

This paper proposes a construction method of Ultra-low Latency Adaptive Local Binary Spiking Neural Network with Accuracy Loss Estimator, which balances the pros and cons between full precision weights and binarized weights by choosing binarized or non-binarized weights adaptively. Our proposed network satisfies the requirement of network quantization while keeping high recognition accuracy. At the same time, we find the problem of long training time for BSNNs. Therefore, we propose the GAP Layer, in which a convolution layer is used to replace the fully connected layer, and a global average pooling layer is used to solve the binary output problem of SNN. Because of the binary output, SNN usually needs to run multiple time steps to get reasonable results. Experiments on Fashion-MNIST, CIFAR-10, and CIFAR-100 show that our method not only saves more storage resources and training time, but also achieves competitive classification accuracy compared with existing state-of-the-art BSNNs

References

Bengio, Léonard, and Courville (2013) Bengio, Y.; Léonard, N.; and Courville, A. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432.
Cao, Chen, and Khosla (2015) Cao, Y.; Chen, Y.; and Khosla, D. 2015. Spiking deep convolutional neural networks for energy-efficient object recognition. International Journal of Computer Vision, 113(1): 54–66.
Courbariaux, Bengio, and David (2015) Courbariaux, M.; Bengio, Y.; and David, J.-P. 2015. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28.
Deng et al. (2021) Deng, L.; Wu, Y.; Hu, Y.; Liang, L.; Li, G.; Hu, X.; Ding, Y.; Li, P.; and Xie, Y. 2021. Comprehensive snn compression using admm optimization and activity regularization. IEEE transactions on neural networks and learning systems.
Illing, Gerstner, and Brea (2019) Illing, B.; Gerstner, W.; and Brea, J. 2019. Biologically plausible deep learning—but how far can we go with shallow networks? Neural Networks, 118: 90–101.
Ioffe and Szegedy (2015) Ioffe, S.; and Szegedy, C. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, 448–456. PMLR.
Jang, Skatchkovsky, and Simeone (2021) Jang, H.; Skatchkovsky, N.; and Simeone, O. 2021. BiSNN: Training spiking neural networks with binary weights via Bayesian learning. In 2021 IEEE Data Science and Learning Workshop (DSLW), 1–6. IEEE.
Kheradpisheh, Mirsadeghi, and Masquelier (2022) Kheradpisheh, S. R.; Mirsadeghi, M.; and Masquelier, T. 2022. Bs4nn: Binarized spiking neural networks with temporal coding and learning. Neural Processing Letters, 54(2): 1255–1273.
Kingma and Ba (2014) Kingma, D.; and Ba, J. 2014. Adam: A Method for Stochastic Optimization. Computer Science.
Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
Lin, Chen, and Yan (2013) Lin, M.; Chen, Q.; and Yan, S. 2013. Network in network. arXiv preprint arXiv:1312.4400.
Lin, Zhao, and Pan (2017) Lin, X.; Zhao, C.; and Pan, W. 2017. Towards accurate binary convolutional neural network. Advances in neural information processing systems, 30.
Loshchilov and Hutter (2016) Loshchilov, I.; and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
Lu and Sengupta (2020) Lu, S.; and Sengupta, A. 2020. Exploring the connection between binary and spiking neural networks. Frontiers in Neuroscience, 14: 535.
Mead (1990) Mead, C. 1990. Neuromorphic electronic systems. Proceedings of the IEEE, 78(10): 1629–1636.
Mirsadeghi et al. (2021) Mirsadeghi, M.; Shalchian, M.; Kheradpisheh, S. R.; and Masquelier, T. 2021. STiDi-BP: Spike time displacement based error backpropagation in multilayer spiking neural networks. Neurocomputing, 427: 131–140.
Neftci, Mostafa, and Zenke (2019) Neftci, E. O.; Mostafa, H.; and Zenke, F. 2019. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Processing Magazine, 36(6): 51–63.
Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library.
Qin et al. (2020) Qin, H.; Gong, R.; Liu, X.; Bai, X.; Song, J.; and Sebe, N. 2020. Binary neural networks: A survey. Pattern Recognition, 105: 107281.
Roy, Chakraborty, and Roy (2019) Roy, D.; Chakraborty, I.; and Roy, K. 2019. Scaling deep spiking neural networks with binary stochastic activations. In 2019 IEEE International Conference on Cognitive Computing (ICCC), 50–58. IEEE.
Sengupta et al. (2019) Sengupta, A.; Ye, Y.; Wang, R.; Liu, C.; and Roy, K. 2019. Going deeper in spiking neural networks: VGG and residual architectures. Frontiers in neuroscience, 13: 95.
Shorten and Khoshgoftaar (2019) Shorten, C.; and Khoshgoftaar, T. M. 2019. A survey on image data augmentation for deep learning. Journal of big data, 6(1): 1–48.
Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Srinivasan and Roy (2019) Srinivasan, G.; and Roy, K. 2019. Restocnet: Residual stochastic binary convolutional spiking neural network for memory-efficient neuromorphic computing. Frontiers in neuroscience, 13: 189.
Tavanaei et al. (2019) Tavanaei, A.; Ghodrati, M.; Kheradpisheh, S. R.; Masquelier, T.; and Maida, A. 2019. Deep learning in spiking neural networks. Neural networks, 111: 47–63.
Wang et al. (2020) Wang, Y.; Xu, Y.; Yan, R.; and Tang, H. 2020. Deep spiking neural networks with binary weights for object recognition. IEEE Transactions on Cognitive and Developmental Systems, 13(3): 514–523.
Wu et al. (2018) Wu, Y.; Deng, L.; Li, G.; Zhu, J.; and Shi, L. 2018. Spatio-temporal backpropagation for training high-performance spiking neural networks. Frontiers in neuroscience, 12: 331.
Xiao, Rasul, and Vollgraf (2017) Xiao, H.; Rasul, K.; and Vollgraf, R. 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.