Q-SpiNN: A Framework for Quantizing
Spiking Neural Networks

Rachmad Vidya Wicaksana Putra¹, Muhammad Shafique² ¹Technische Universität Wien (TU Wien), Vienna, Austria
²New York University Abu Dhabi (NYUAD), Abu Dhabi, United Arab Emirates
Email: [email protected], [email protected]

Abstract

A prominent technique for reducing the memory footprint of Spiking Neural Networks (SNNs) without decreasing the accuracy significantly is quantization. However, the state-of- the-art only focus on employing the weight quantization directly from a specific quantization scheme, i.e., either the post-training quantization (PTQ) or the in-training quantization (ITQ), and do not consider (1) quantizing other SNN parameters (e.g., neurons’ membrane potential), (2) exploring different combinations of quantization approaches (i.e., quantization schemes, precision levels, and rounding schemes), and (3) selecting the SNN model with a good memory-accuracy trade-off at the end. Therefore, the memory saving offered by these state-of-the-art to meet the targeted accuracy is limited, thereby hindering processing SNNs on the resource-constrained systems (e.g., the IoT-Edge devices). Towards this, we propose Q-SpiNN, a novel quantization framework for memory-efficient SNNs. The key mechanisms of the Q-SpiNN are: (1) employing quantization for different SNN parameters based on their significance to the accuracy, (2) exploring different combinations of quantization schemes, precision levels, and rounding schemes to find efficient SNN model candidates, and (3) developing an algorithm that quantifies the benefit of the memory-accuracy trade-off obtained by the candidates, and selects the Pareto-optimal one. The experimental results show that, for the unsupervised network, the Q-SpiNN reduces the memory footprint by ca. 4x, while maintaining the accuracy within 1% from the baseline on the MNIST dataset. For the supervised network, the Q-SpiNN reduces the memory by ca. 2x, while keeping the accuracy within 2% from the baseline on the DVS-Gesture dataset.

I Introduction

SNN models have been proposed to solve various data analytic tasks, such as digit classification, object detection, and hand gesture recognition [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. To achieve high accuracy, many large-sized SNN models have been developed, as they have shown better capability for learning the input features than the small ones. For instance, a large SNN model that occupies $>$ 100MB of memory with 32-bit floating-point format (FP32), achieves ca. 92% accuracy on MNIST dataset [13]. Meanwhile, a small model that occupies ca. 0.3MB of memory with FP32, achieves ca. 75% accuracy, as shown in Figs. 1(a)-(b). Consequently, the state-of-the-art SNN models typically have large number of parameters that need to be considered in both the training and the inference phases. Therefore, they incur large memory footprint, which hinder their applicability for the resource-constrained systems, such as the IoT-Edge devices.

To address these issues, prior works have proposed different methodologies, such as (1) reduction of SNN operations via stochastic neuron operations [14], neuron elimination [15], and weight pruning [16]; and (2) quantization [16, 17, 18]. Among these techniques, quantization is a prominent one that incurs relatively low overhead, since it only needs to reduce the data precision. Besides memory saving, the reduced precision also leads to other advantages, e.g., faster computation and lower power/energy consumption. However, reducing the precision of SNN parameters leads to accuracy degradation if it is not performed carefully due to the information loss, as shown in Fig. 1(c). The results show that a network with W(Q1.4) or 6-bit of fixed-point weights, suffers from accuracy drop, compared to the 32-bit floating-point (FP32). Here, the Q1.4 format denotes 1 sign bit, 1 integer bit, and 4 fractional bits¹¹1Note: In this paper, the fixed-point format is represented as Qi.f, with 1 sign bit, i integer bits, and f fractional bits [19]. The value of i for each parameter depends on the range of its integer values. The detailed discussion of the fixed-point format is presented in Section II-B..

Targeted Research Problem: If and how can we employ the quantization on SNNs to maximize the memory saving, while maintaining the accuracy. An efficient solution to this problem will improve the applicability of the SNN systems on resource-constrained devices.

Refer to caption — Figure 1: (a) A fully-connected SNN architecture with the spike-timing- dependent plasticity based unsupervised learning (U-SNN), and its detailed information will be discussed further in Section II-A. (b) Accuracy and memory footprint of the U-SNN with different number of neurons. (c) Accuracy of the 400-neurons U-SNN with different precision levels.

I-A State-of-the-art and Their Limitations

The state-of-the-art have employed quantization to reduce the precision of the weights by directly using a specific quantization scheme, i.e., either the post-training quantization (PTQ) or the in-training quantization (ITQ) [16, 17, 18]. However, they have several drawbacks as they do not consider:

•

quantization for other SNN parameters (e.g., the neurons’ membrane potential) that occupy a considerable amount of memory during the SNN processing [14][20],
•

exploring different combinations of quantization approaches (i.e., quantization schemes, precision levels, and rounding schemes) to find the SNN model that fulfills the targeted accuracy and achieves high memory saving.

Therefore, the memory saving offered by these state-of-the-art to meet the targeted accuracy are limited, thereby hindering the deployment of SNNs on the resource-constrained devices. To highlight the targeted problem and the limitations of the state-of-the-art, we perform an experimental case study, as discussed below.

I-B Motivational Case Study and Key Challenges

We observe that, apart from the weights, there are other SNN parameters that can be quantized to further reduce the memory footprint, e.g., neurons’ membrane and threshold potentials (discussed further in Section II-A). To see the potential of such an idea, we study the impact of different precision levels (bitwidth) for different SNN parameters on the accuracy through experiments using PyTorch-based simulation on GPGPU, i.e., Nvidia RTX 2080 Ti (the detailed experimental setup is explained in Section IV). Fig. 2 shows the experimental results, from which we make the following key observations.

•

Different parameters may have different integer bitwidth, as they have different range of values.
•

Different combinations of precision levels (bitwidth) may achieve comparable accuracy, but occupy different memory footprint. For instance, W(Q1.16)-N(FP32), W(FP32)-N(Q11.16), and W(Q1.16)-N(Q11.16) obtain about 84% accuracy, while consuming ca. 1.2MB, 0.68MB, 1.19MB, and 0.67MB respectively; see label- in Fig. 2.
•

Less memory footprint requires less number of memory accesses, and thereby less access energy. This potentially improves the energy-efficiency of SNN processing, as the memory accesses dominate the energy of SNN processing (i.e., 50%-75% of total system energy) [21].

Although the quantization effectively reduces the memory footprint, it leads to accuracy degradation if the quantization process is not performed carefully. Furthermore, finding the appropriate quantization levels for different SNN parameters is challenging, as the number of potential combinations of precision levels is large. Therefore, the key challenge is how to effectively perform quantization and exploit the trade-off between memory and accuracy, so that the memory footprint is reduced and the targeted accuracy are met.

I-C Our Novel Contributions

To address the above challenges, we propose Q-SpiNN, a novel Quantization framework for Spiking Neural Networks, through the following mechanisms (the overview is in Fig. 3).

•

Employ quantization for different SNN parameters based on their significance to the accuracy, that are analyzed by observing the accuracy obtained under different precision.
•

Explore different combinations of quantization schemes, precision levels, and rounding schemes to find the SNN models that meet the user-targeted accuracy, and refer them to as the solution candidates.
•

Develop and employ an algorithm to select the SNN model from the given candidates. It quantifies the benefit of the memory-accuracy trade-off obtained by the candidates using the proposed reward function, and then selects the one with the highest benefit.

Key Results: We evaluated the Q-SpiNN using PyTorch- based simulation on the GPGPU and the Embedded GPU. The experimental results show that, for the unsupervised SNN, Q-SpiNN achieves 4x memory saving, while maintaining the accuracy within 1% from the baseline on the MNIST. For the supervised one, it achieves 2x memory saving, with the accuracy within 2% from the baseline on the DVS-Gesture.

II Background and Related Work

II-A Spiking Neural Networks (SNNs)

An SNN model is composed of the network architecture, the neuron and synapse model, the spike coding, and the learning rule [22]. There are two major learning approaches that determine how the SNN models are designed and trained, i.e., the unsupervised learning and the supervised learning. In this paper, we evaluate our Q-SpiNN framework for both learning approaches to show its generality for different SNN designs. For the unsupervised SNN, we consider the single-layer network that employs the spike-timing-dependent plasticity (STDP) learning [15]. For the supervised one, we consider the multi-layer network that employs the deep continuous local learning (DECOLLE) [8]. We select them since they show the state-of-the-art accuracy with relatively low memory and compute costs, compared to other designs with same approach.

A Single-Layer SNN with Unsupervised Local Learning (U-SNN): This network consists of a single fully-connected (FC) layer. Each input pixel is converted into the rate-coded spikes which are transferred to all neurons. Each neuron generates spikes that inhibit other neurons, thereby enabling competition among neurons, as shown in Fig. 1(a). Here, the pair-wise weight-dependent STDP learning rule is used, as it defines the maximum allowed weights, which is suitable for fixed-point format (see Eq. 1).

\vspace{-0.3cm}\small\begin{split}\Delta w=\begin{cases}-\eta_{pre}x_{post}w^{\mu}&\text{on}\;\text{presynaptic spike}\\ \eta_{post}x_{pre}(w_{m}-w)^{\mu}&\text{on}\;\text{postsynaptic spike}\end{cases}\end{split}

(1)

$\Delta w$ denotes the weight update, $\eta_{pre}$ and $\eta_{post}$ denote the learning rate for pre- and post-synaptic spike, while $x_{pre}$ and $x_{post}$ denote the pre- and post-synaptic traces, respectively. $w_{m}$ denotes the maximum allowed weight, $w$ denotes the current weight, and $\mu$ denotes the weight dependence factor.

Here, the conductance-based Leaky Integrate-and-Fire (LIF) neuron model is used, since it has low complexity [23]. Its membrane potential ( $V$ ) increases each time a presynaptic spike comes, otherwise it decreases. If the $V$ reaches the threshold potential ( $V_{th}$ ), a spike is emitted, then it goes to the reset potential ( $V_{reset}$ ). To prevent a neuron from dominating the spike firing, the $V_{th}$ is defined as $V_{th}+\theta$ with $\theta$ refers to as the adaptation potential, which adds to $V_{th}$ each time the neuron fires a spike. The synapse is modeled by a conductance, which increases by weight ( $w$ ) when a presynaptic spike comes, and otherwise it decreases. Note, we quantize these weights and neuron parameters to get memory saving.

A Multi-Layer SNN with Supervised Local Learning (S-SNN): This network consists of three convolutional (CONV) layers and one FC layer, as shown in Fig. 4. Each layer is trained using the supervised deep continuous local learning (DECOLLE) [8], whose idea is to use a surrogate gradient for minimizing the local (layer-wise) loss function, so that the readout unit can produce the targeted output ( $\hat{y}$ ). The difference between the readout output ( $y$ ) and the target ( $\hat{y}$ ) denotes the error that is used to train the weights (red dashed-line). In this manner, the loss function minimization can be performed directly in the spiking environment.

The dynamics of each layer are based on the current-based LIF neuron model, and expressed as Eq. LABEL:Eq_NeuronSynapse. $V_{i}^{l}[n]$ denotes the membrane potential of neuron- $i$ in layer- $l$ at timestep- $n$ , while $w_{ij}$ denotes the weight between the pre-synaptic neuron- $j$ and the post-synaptic neuron- $i$ . A spike $S_{i}^{l}[n]$ is emitted at timestep- $n$ if $V_{i}^{l}[n]$ reaches the threshold ( $V_{th}$ ) through the $\Theta$ function, where $\Theta(x)=1$ if $x\geq 0$ , and otherwise 0. $P$ and $Q$ denote the traces of the membrane and the current-based synapse respectively, while $R$ denotes the refractory state and $\rho$ is the inhibition weight. $\alpha=exp(-\frac{\Delta t}{\tau_{mem}})$ , $\beta=exp(-\frac{\Delta t}{\tau_{syn}})$ , and $\gamma=exp(-\frac{\Delta t}{\tau_{ref}})$ denote the decay of the $V$ , $Q$ , and $R$ , respectively. For the detailed discussion on the DECOLLE, we refer to the original paper [8]. We quantize these weights and neuron parameters to get memory saving.

\vspace{-0.4cm}\small\begin{split}V_{i}^{l}[n]&=\sum\limits_{j}w_{ij}^{l}P_{j}^{l}[n]-\rho R_{i}^{l}[n]\\ S_{i}^{l}[n]&=\Theta(V_{i}^{l}[n]-V_{th})\\ P_{j}^{l}[n+1]&=\alpha P_{j}^{l}[n]+Q_{j}^{l}[n]\\ Q_{j}^{l}[n+1]&=\beta Q_{j}^{l}[n]+S_{j}^{l-1}[n]\\ R_{i}^{l}[n+1]&=\gamma R_{i}^{l}[n]+S_{i}^{l}[n]\\ \end{split}\vspace{-0.6cm}

(2)

II-B Fixed-Point Representation and Rounding Schemes

The fixed-point format is represented as Qi.f, that consists of 1 sign bit, i integer bits, and f fractional bits, and follows the 2’s complement format. Given the fixed-point Qi.f, the range of representable values is $[-2^{\texttt{i}},2^{\texttt{i}}-2^{\texttt{-f}}]$ and the precision is $\epsilon=2^{\texttt{-f}}$ . In the quantization process, a rounding scheme is required, and we consider the widely used ones, i.e., truncation, rounding-to-the-nearest, and stochastic [24] [25].

Truncation (TR) keeps the f bits and discards the other bits from the fractional part. Hence, the output fixed-point for the given real number $x$ and configuration Qi.f, is defined as $TR(x,\texttt{Qi.f})=\lfloor x\rfloor$ .

Rounding-to-the-Nearest (RN) rounds the value, that is half-way between two representable values ( $\lfloor x\rfloor+\frac{\epsilon}{2}$ ), by rounding it up. Hence, the output fixed-point for the given real number $x$ and configuration Qi.f, is defined as

\small\begin{split}RN(x,\texttt{Qi.f})=\begin{cases}\lfloor x\rfloor&\text{if}\;\lfloor x\rfloor\leq x<\lfloor x\rfloor+\displaystyle\frac{\epsilon}{2}\\ \lfloor x\rfloor+\epsilon&\text{if}\;\lfloor x\rfloor+\displaystyle\frac{\epsilon}{2}\leq x<\lfloor x\rfloor+\epsilon\end{cases}\end{split}\vspace{-0.2cm}

(3)

Stochastic Rounding (SR) rounds the value using a non- deterministic approach. Given a random value $P\in[0,1)$ that is drawn from a uniform random number generator, the output fixed-point for the real number $x$ and configuration Qi.f is

\small\begin{split}SR(x,\texttt{Qi.f})=\begin{cases}\lfloor x\rfloor&\text{if}\;P\geq\displaystyle\frac{x-\lfloor x\rfloor}{\epsilon}\\ \lfloor x\rfloor+\epsilon&\text{if}\;P<\displaystyle\frac{x-\lfloor x\rfloor}{\epsilon}\end{cases}\end{split}\vspace{-0.1cm}

(4)

II-C Quantization Schemes

There are two widely used quantization schemes in the neural network models, i.e., the Post-Training Quantization, and the In-Training Quantization (or the Quantization-aware Training) [26], whose key mechanisms are shown in Fig. 5.

Post-Training Quantization (PTQ) trains an SNN model with a floating-point precision (e.g., FP32) and results in a trained model. Afterwards, the quantization is performed on the trained model with the given Qi.f precision, resulting in a quantized model for the inference phase.

In-Training Quantization (ITQ) quantizes an SNN model with the given Qi.f precision during the training phase. Therefore, the trained model is already in a quantized form and can be used for the inference phase. The quantization is typically performed using the simulated quantization [26].

III Our Q-SpiNN Framework

III-A Overview

The Q-SpiNN framework employs the following key mechanisms for obtaining memory-efficient SNNs, while maintaining the accuracy (the overview is shown in Fig. 6).

•
Quantization of different parameters (Section III-B): It is performed through the following means.
- –
  
  Maximizing the quantization for each SNN parameter.
- –
  
  Defining the precision level (bitwidth) for each parameter based on its significance, that is obtained by analyzing the accuracy under different precision levels.
•
Design exploration of different quantization approaches (Section III-C): It is done through the following means.
- –
  
  Observing the accuracy obtained by different quantization schemes (i.e., PTQ and ITQ), different precision levels, and different rounding schemes (i.e., TR, RN, and SR),
- –
  
  Selecting the SNN models that meet the targeted accuracy as the solution candidates.
•
An SNN model selection (Section III-D): It searches for an appropriate SNN model from the given candidates through the following means.
- –
  
  Quantifying the benefit of the memory-accuracy trade-off obtained by the SNN model candidates using our proposed multi-objective reward function,
- –
  
  Selecting the SNN model with the highest benefit.

III-B Quantization of Different SNN Parameters

Different SNN designs may have different SNN parameters that can be quantized, as discussed in Section II-A. Therefore, to provide a generic solution for any SNN designs, we propose a significance-aware quantization steps (overview is in Fig. 7). The idea is to maximize the quantization for each SNN parameter, and define the precision level for each parameter based on its significance to the accuracy. For the given SNN model (in FP32), we first determine the parameters to be quantized by manually selecting them. Afterward, we analyze the significance of each parameter to determine the integer and fractional bitwidth. For the integer part, the bitwidth requirement is analyzed by observing the range of parameter values when running the given workload. For the fractional part, there are two cases. If the parameter is a constant, then the bitwidth depends on the parameter value; and otherwise (if parameter is a variable), the bitwidth requirement is analyzed by gradually reducing its precision and observing the output accuracy. In this manner, the impact of the parameters’ bitwidth on accuracy is systematically explored.

Case Study: We provide a case study to show how the proposed quantization steps are done for the unsupervised SNN (U-SNN) with MNIST. First, we select the $w$ , $V$ , $V_{th}$ , $V_{reset}$ , and $\theta$ as the parameters to be quantized (see SNN fundamentals in Section II-A). $V_{reset}$ and $\theta$ are constants, while others (i.e., $w$ , $V$ , and $V_{th}$ ) are variables in the training.

•

For constant parameters: We quantize the constants based on their values (see Table I). For instance, $V_{reset}$ = -65mV and it is represented using 8 bits with Q7.0 format, while $\theta$ = 0.05mV and it is represented using 8 bits with Q1.6 format. In this manner, 24 bits are saved for $V_{reset}$ and $\theta$ , compared to the FP32.
•
For variable parameters: We perform experiments to obtain the ranges of parameter values (see the values in Table I).
- –
  
  For the integer part, we define the integer bitwidth based on the observed ranges, i.e., $V_{th}$ , $V$ , and $w$ need 11 bits, 11 bits, and 1 bit of integer, respectively.
- –
  
  For the fractional part, we gradually reduce the precision and observing the output accuracy to study the impact of different precision levels. Therefore, we perform a design exploration, which is discussed further in Section III-C.

TABLE I: Observed U-SNNs’ Parameter Values for MNIST Workload.

Parameters	Value	Description
$V_{reset}$	-65mV	shown by label- in Fig. 8(a)
$\theta$	0.05mV	shown by label- in Fig. 8(a)
$V_{th}$	-52mV – 1271.88mV	shown by label- in Fig. 8(a)
$V$	-887.29mV – 1250.18mV	shown by label- in Fig. 8(a)
$w$	0 – 0.7	shown by label- in Fig. 8(b)

III-C Exploration of Different Quantization Approaches

To find the effective configuration of quantization for the given SNN, a design exploration of different quantization approaches is required. Therefore, we comprehensively study the impact of different quantization schemes (i.e., PTQ and ITQ), different precision levels, and different rounding schemes (i.e., TR, RN, and SR), and select the ones that meet the user- targeted accuracy. Towards this, we devise a search algorithm across a range of selected parameters to systematically perform the exploration, whose steps are the following (pseudo-code is in Alg. 1, considering an example for the U-SNN case).

•

We train the given model with a floating-point precision (Alg. 1: line 3), and the test accuracy of the trained model is considered as the baseline accuracy (Alg. 1: line 4). Then, we perform the PTQ and the ITQ schemes, subsequently.
•

For the PTQ, the quantization is performed on the trained model, then the accuracy is evaluated (Alg. 1: lines 12-16). Meanwhile for the ITQ, we quantize the given SNN model during the training phase. Therefore, the trained model is already in a quantized form, and can be used for the accuracy evaluation (Alg. 1: lines 18-23).
•

For both schemes, we reduce the precision of each parameter using a nested for-loop (Alg. 1: lines 7-9), and in each step, we explore the use of different rounding schemes (Alg. 1: line 10). The depth of the loop depends on the parameters (e.g., we consider $w$ , $V$ , and $V_{th}$ for U-SNN case). If the accuracy is within the target, then the model is selected as a solution candidate. Otherwise, the currently investigated precision and the lower precision (if any) for the corresponding parameter, are not considered in next exploration steps (Alg. 1: lines 24-33). Therefore, the design space is reduced and the exploration is performed efficiently.

This exploration populates the SNN models that meet the target accuracy (i.e., the solution candidates). To select the appropriate model out of them, we propose a model selection algorithm, which is discussed in the Section III-D.

Algorithm 1 Pseudo-code of the proposed exploration

0: (1) SNN: floating-point model (

m_{fp}

), accuracy (

acc_{fp}

), parameters (

m_{fp}.w,m_{fp}.V

, and

m_{fp}.V_{th}

); // for the U-SNN case study; (2) Maximum allowed accuracy degradation (

acc_{target}

); (3) Quantization schemes (

QS=[PTQ,ITQ]

); (4) Precision levels (

QL

); //

QL

is the user-defined fractional bitwidth sorted in a descending order, e.g.,

QL=[16,14,...,0]

; (5) Rounding schemes (

RS=[TR,RN,SR]

);

0: SNN model candidates (

C

);

0: Initialization:

C=[]

;

c=1

; Process:

\overline{m}_{fp}\leftarrow train(m_{fp})

;

acc_{fp}=test(\overline{m}_{fp})

;

5: for (

qs=1

len(QS)

) do

Nw=len(QL)

;

Nv=len(QL)

;

Nt=len(QL)

;

7: for (

iw=1

Nw

) do

8: for (

iv=1

Nv

) do

9: for (

it=1

Nt

) do

10: for (

rs=1

len(RS)

) do

11: if (

QS[qs]==PTQ

) then

12:

w_{q}=quantize(\overline{m}_{fp}.w,QL[iw],RS[rs])

;

13:

V_{q}=quantize(\overline{m}_{fp}.V,QL[iv],RS[rs])

;

14:

V_{thq}=quantize(\overline{m}_{fp}.V_{th},QL[it],RS[rs])

;

15:

\overline{m}_{q}\leftarrow substitute(\overline{m}_{fp},(w_{q},V_{q},V_{thq}))

;

16:

acc_{q}=test(\overline{m}_{q})

;

17: else

18:

w_{q}=quantize(m_{fp}.w,QL[iw],RS[rs])

;

19:

V_{q}=quantize(m_{fp}.V,QL[iv],RS[rs])

;

20:

V_{thq}=quantize(m_{fp}.V_{th},QL[it],RS[rs])

;

21:

m_{q}\leftarrow substitute(m_{fp},(w_{q},V_{q},V_{thq}))

;

22:

\overline{m}_{q}\leftarrow train(m_{q})

;

23:

acc_{q}=test(\overline{m}_{q})

;

24: if

(acc_{q}\geq(acc_{fp}-acc_{target}))

then

25:

C[c]=\overline{m}_{q}

;

26:

c\mathrel{+}=1

;

27: else

28: if

(iw\geq 1)\&(iv==1)\&(it==1)

then

29:

Nw=iw-1

;

30: else if

(iw==1)\&(iv\geq 1)\&(it==1)

then

31:

Nv=iv-1

;

32: else if

(iw==1)\&(iv==1)\&(it\geq 1)

then

33:

Nt=it-1

;

34: return

C

;

34:

III-D SNN Model Selection Algorithm

We obtain a set of model candidates from the exploration in Section III-C. Afterwards, we need to select the Pareto-optimal model out of the candidates, while considering the accuracy and the memory footprint. Towards this, we propose an SNN model selection algorithm that quantifies the benefit of the memory-accuracy trade-off obtained by the candidates using the proposed multi-objective reward function. The idea of our reward function is to prioritize the model with higher accuracy and smaller memory footprint, which is expressed as Eq. 5.

\small R(acc_{q},m_{norm})=acc_{q}-\mu\;m_{norm}

(5)

\small m_{norm}=\frac{mem_{q}}{mem_{fp}}

(6)

\small\begin{split}mem&=mem\_w+mem\_n=Nw\;Bw+\sum_{k}Nn^{k}\;Bn^{k}\end{split}\vspace{-0.1cm}

(7)

$acc_{q}$ denotes the test accuracy of the quantized SNN model, $m_{norm}$ denotes the normalized memory footprint, and the coefficient $\mu$ is the weight to trade-off between memory and accuracy. Note $acc$ , $m_{norm}$ , and $\mu$ have a value range of [0,1]. $m_{norm}$ is obtained from the ratio between the memory of quantized model ( $mem_{q}$ ) and floating-point model ( $mem_{fp}$ ), as stated in Eq. 6. The memory footprint ( $mem$ ) is estimated by the total memory required by the weights ( $mem\_w$ ) and neuron parameters ( $mem\_n$ ), as shown in Eq. 7. $mem\_w$ is obtained by multiplying the number of weights ( $Nw$ ) and the respective bitwidth ( $Bw$ ). Similar approach is used for neuron parameters, i.e., multiplying the number of parameter ( $Nn$ ) and the bitwidth ( $Bn$ ). Since the neuron has several parameters ( $k$ ) which may have different precision, $mem\_n$ is defined as the total bits from all neuron parameters.

IV Evaluation Methodology

Fig. 9 shows the experimental setup for evaluating the Q-SpiNN framework. We use the PyTorch-based simulation to evaluate the accuracy of the unsupervised SNN [27] and the supervised SNN [8], estimate the memory, and select the SNN model. We run the simulations on GPGPU (i.e., Nvidia GeForce RTX 2080 Ti) and Embedded GPU (i.e., Nvidia Jetson Nano) to show the applicability of the Q-SpiNN framework on different hardwares with different compute and memory capabilities.

Networks: We use networks with different architectures, number of layers, and learning rules to show the generality of our Q-SpiNN. For the unsupervised SNN, we consider a single-layer FC network with the STDP, as shown in Fig. 1(a), while for the supervised SNN, we consider a multi-layer CONV network with the DECOLLE, as shown in Fig. 4.

Datasets: We use the MNIST dataset [13] for the U-SNN, and the DVS-Gesture dataset [28] for the S-SNN. In the MNIST, there are 60,000 images for the training and 10,000 images for the test, each having a dimension of 28×28 pixels. Meanwhile, the DVS-Gesture, which is obtained using a Dynamic Vision Sensor (DVS), consists of 1,342 instances of a set of 11 hand and arm gestures. They are collected from 29 subjects under 3 lighting conditions. Gestures from 23 subjects are used as the training set, and the remaining 6 subjects are used as test set. Each gesture consists of the stream of events and lasts for 6s. The event streams were downsized from 128 × 128 to 32 × 32 and binned in frames of 1ms [8].

Comparisons: We use networks with different precision levels as the comparison partners, for both the unsupervised (i.e., U-SNN) and supervised (i.e., S-SNN) cases. For the U-SNN, we consider a network with 400 excitatory neurons with 1 training epoch (i.e., using the STDP during forward propagation). For the S-SNN, we train the network using the DECOLLE with 200 epochs. For both cases, the baseline refers to the network with FP32 precision for all parameters.

Quantization Format: We use the W(X)-N(Y) format to represent a model with X precision for the weights and Y precision for the neuron parameters (see Section III-B). For conciseness, we simply use W(Qi.f) to represent a model with W(Qi.f)-N(FP32) precision, and N(Qi.f) to represent a model with W(FP32)-N(Qi.f) precision. Furthermore, since there are several neuron parameters involved in the quantization process, their integer part is simply written as i, e.g., N(Qi.8) means that each neuron parameter employs integer bitwidth based on its value range and 8-bit fraction.

V Results and Discussions

V-A Impact of Different Quantization Approaches on Accuracy

Accuracy of the Unsupervised SNN: In the U-SNN case, we quantize the weights ( $w$ ) and the neuron parameters (i.e., $V_{reset}$ , $\theta$ , $V$ , and $V_{th}$ ), and the experimental results are shown in Fig. 10. Here, N(Qi.f) represents the precision of variables $V$ and $V_{th}$ . Notable accuracy degradation from the baseline accuracy is observed when the weights’ bitwidth is reduced to the 4-bit fraction, as pointed by label- for the PTQ and label- for the ITQ. The reason is that the 4-bit fraction (or fewer) for weights does not have sufficient levels of value to modulate the input spikes, thereby making the learning process ineffective. Meanwhile, quantizing $V$ and $V_{th}$ with the same fractional bits (i.e., 4 bits) still maintains the accuracy compared to the baseline, as shown by label- and label- for the PTQ and the ITQ, respectively. The reason is that the values for updating the $V$ and $V_{th}$ can be represented using fewer fractional bits than the ones for updating the weights $w$ . These also indicate that the weights are more significant than the neuron parameters, as their small update can change the accuracy significantly. Hence, quantizing all parameters of the U-SNN also leads to a notable accuracy degradation when the fractional bitwidth is reduced to 4 bits (or fewer), as pointed by label- and label- for the PTQ and the ITQ, respectively

Accuracy of the Supervised SNN: In the S-SNN case, we quantize the weights ( $w$ ), and the neuron parameters (i.e., $\alpha$ , $\beta$ , $\gamma$ , $P$ , $Q$ , $R$ , and $V$ ), and the experimental results are presented in Fig. 10. Here, N(Qi.f) represents the precision of variables $P$ , $Q$ , $R$ , and $V$ . Notable accuracy degradation from the baseline accuracy is observed when reducing the fractional bits of either the weights and the neuron parameters to 10 bits (and fewer), indicating that the weights and the neuron parameters have comparable significance to the accuracy. These also indicate that the S-SNN requires considerable bitwidth to maintain the high accuracy for the DVS-Gesture dataset. The reason is that, the DVS-Gesture is a relatively complex dataset because, besides considering the stream of events in each frame, the network has to draw a correct conclusion of a gesture for the complete stream of events from multiple frames. Therefore, it requires considerable bitwidth to distinguish a gesture from other gestures in each frame and in a complete stream of events.

Additional Discussion: We also make the following observations across different network types (i.e., U-SNN and S-SNN) and different quantization approaches.

•

The SR scheme generally achieves slightly better accuracy than other rounding schemes, because this scheme is not biased towards a specific rounding direction, thereby having a higher probability of values that obtain higher accuracy. However, it consumes the highest hardware resource as it needs a random number generator.
•

Different combinations of the quantization and rounding schemes achieve various accuracy, but their values are not significantly different. Users can decide the quantization and rounding schemes, as well as the parameters to be quantized, that are suitable for the target applications, considering the accuracy and memory constraints, and the exploration cost. Therefore, the overhead depends on the selected scheme.

V-B SNN Model Selection with the Memory-Accuracy Trade-Off

To find the SNN model that offers a good memory-accuracy trade-off, we employ the proposed reward function in Eq. 5 that quantifies the trade-off benefit of the given model. To do this, we need to define the coefficient $\mu$ in the reward function. Small $\mu$ means that the function prioritizes the weight of accuracy more than the memory. On the other hand, large $\mu$ means that the function prioritizes the weight of memory more than the accuracy. The users can define the value of $\mu$ based on their preferences to meet the design specifications. In this work, for the exploration purpose, we define the value of coefficient $\mu\in\{0.01,0.1,0.2,0.3,0.4,0.5,1\}$ .

Model Selection for the Unsupervised SNN: We apply the proposed reward function to the explored U-SNN models and the results are provided in Fig. 12(a) and Table II, from which we make the following observations.

•

Label- : The model that has the highest reward for $\mu=0.01$ is the one that employs W(Q1.8)-N(Qi.8) precision, and achieves 86.56% accuracy and 3.194x memory saving.
•

Label- : The model that has the highest reward for $\mu=0.3$ is the one that employs W(Q1.6)-N(FP32) precision, and achieves 86.27% accuracy and 3.94x memory saving.
•

Label- : The model that has the highest reward for $\mu\in\{0.1,0.2,0.4,0.5,1\}$ is the one that employs W(Q1.6)- N(Qi.6) precision, and achieves 86.24% accuracy and 3.987x memory saving.

These results show that larger $\mu$ shifts the preferred model towards the one with smaller memory, which typically has lower accuracy. Meanwhile, smaller $\mu$ shifts the preferred model towards the one with higher accuracy, which typically has larger memory footprint. If the maximum tolerance of accuracy degradation is 1% from the baseline, then the model with W(Q1.6)-N(Qi.6) precision is the Pareto-optimal one, with 86.24% accuracy and 3.987x memory saving.

Model Selection for the Supervised SNN: We also apply the proposed reward function to the explored S-SNNs models and the results are provided in Fig. 12(b) and Table II, from which we make the following observations.

•

Label- : The model that has the highest reward for $\mu=0.01$ is the one that employs W(FP32)-N(Qi.16) precision, and achieves 96.14% accuracy and 1.165x memory saving.
•

Label- : The model that has the highest reward for $\mu=0.1$ is the one that employs W(Q1.14)-N(Qi.14) precision, and achieves 95.14% accuracy and 1.926x memory saving.
•

Label- : The model that has the highest reward for $\mu\in\{0.2,0.3,0.4,0.5,1\}$ is the one that employs W(Q1.12)- N(Qi.12) precision, and achieves 94.14% accuracy and 2.132x memory saving.

Here, similar trend regarding the impact of $\mu$ is also observed, e.g., larger $\mu$ shifts the preferred model towards the one with smaller memory and lower accuracy. If the maximum tolerance of accuracy degradation is only 1% from the baseline, then the model with W(FP32)-N(Qi.16) precision is selected. If we relax the tolerance to 2%, it suggests different Pareto-optimal SNN model, i.e., the model with W(Q1.14)-N(Qi.14) precision, 95.14% accuracy, and 1.926x memory saving.

The above results and discussion show that our Q-SpiNN framework provides (1) a comprehensive information about the accuracy and the memory of the given SNN models under different quantization approaches, and (2) an effective model selection to find the efficient SNN model. Moreover, the users can set $\mu$ with their preferred value in the reward function to select an SNN model that meets their design requirements.

VI Conclusion

We propose the Q-SpiNN framework for quantizing SNNs through (1) quantization of different parameters, (2) exploration that considers different quantization schemes, precision levels, and rounding schemes, and (3) employment of a reward function for model selection. For the unsupervised SNN, the Q-SpiNN obtains ca. 4x memory saving and keeps the accuracy within 1% from the baseline on the MNIST. For the supervised one, it obtains ca. 2x memory saving, and keeps the accuracy within 2% from the baseline on the DVS-Gesture. Therefore, our framework would enable the SNN systems to be deployable on the resource-constrained devices.

TABLE II: The best accuracy and the memory savings, across different quantization schemes, precision levels, and rounding schemes

Precision	U-SNN		S-SNN
	Accuracy	Memory	Accuracy	Memory
	Accuracy	Saving	Accuracy	Saving
Baseline	87.20%	1.000x	96.18%	1.000x
W(Q1.16)	86.54%	1.771x	95.79%	1.395x
W(Q1.14)	86.67%	1.990x	95.14%	1.479x
W(Q1.12)	86.66%	2.271x	95.14%	1.573x
W(Q1.10)	86.52%	2.644x	91.18%	1.680x
W(Q1.8)	86.56%	3.165x	87.14%	1.803x
W(Q1.6)	86.27%	3.940x	82.75%	1.934x
W(Q1.4)	76.66%	5.218x	78.14%	2.110x
N(Qi.16)	86.36%	1.002x	96.14%	1.165x
N(Qi.14)	86.63%	1.002x	95.14%	1.182x
N(Qi.12)	86.53%	1.002x	94.53%	1.200x
N(Qi.10)	86.53%	1.003x	90.44%	1.218x
N(Qi.8)	86.67%	1.003x	86.44%	1.238x
N(Qi.6)	86.57%	1.003x	82.10%	1.257x
N(Qi.4)	86.58%	1.003x	78.10%	1.277x
W(Q1.16)-N(Qi.16)	86.53%	1.778x	95.49%	1.739x
W(Q1.14)-N(Qi.14)	86.67%	1.999x	95.14%	1.926x
W(Q1.12)-N(Qi.12)	86.68%	2.284x	94.14%	2.132x
W(Q1.10)-N(Qi.10)	86.52%	2.663x	90.10%	2.404x
W(Q1.8)-N(Qi.8)	86.56%	3.194x	86.79%	2.756x
W(Q1.6)-N(Qi.6)	86.24%	3.987x	81.79%	3.228x
W(Q1.4)-N(Qi.4)	76.71%	5.306x	77.40%	3.895x

Acknowledgment

This work was partly supported by Indonesia Endowment Fund for Education (LPDP) Scholarship Program, from the Ministry of Finance, Indonesia.

References

[1] M. Pfeiffer and T. Pfeil, “Deep learning with spiking neurons: Opportunities and challenges,” Frontiers in Neuroscience, vol. 12, 2018.
[2] A. Tavanaei et al., “Deep learning in spiking neural networks,” Neural Networks, vol. 111, pp. 47–63, 2019.
[3] P. Diehl and M. Cook, “Unsupervised learning of digit recognition using spike-timing-dependent plasticity,” Frontiers in Computational Neuroscience, vol. 9, p. 99, 2015.
[4] H. Hazan et al., “Unsupervised learning with self-organizing spiking neural networks,” in Proc. of IJCNN, July 2018, pp. 1–6.
[5] D. J. Saunders et al., “Stdp learning of image patches with convolutional spiking neural networks,” in Proc. of IJCNN, July 2018, pp. 1–7.
[6] D. J. Saunders et al., “Locally connected spiking neural networks for unsupervised feature learning,” Neural Networks, vol. 119, 2019.
[7] H. Hazan et al., “Lattice map spiking neural networks (lm-snns) for clustering and classifying image data,” Annals of Mathematics and Artificial Intelligence, Sep. 2019.
[8] J. Kaiser et al., “Synaptic plasticity dynamics for deep continuous local learning (decolle),” Frontiers in Neuroscience, vol. 14, p. 424, 2020.
[9] R. Massa et al., “An efficient spiking neural network for recognizing gestures with a dvs camera on the loihi neuromorphic processor,” in Proc. of IJCNN, 2020, pp. 1–9.
[10] R. V. W. Putra et al., “Sparkxd: A framework for resilient and energy- efficient spiking neural network inference using approximate dram,” arXiv, vol. 2103.00421, 2021.
[11] V. Venceslai et al., “Neuroattack: Undermining spiking neural networks security through externally triggered bit-flips,” in Proc. of IJCNN, 2020.
[12] R. V. W. Putra et al., “Spikedyn: A framework for energy-efficient spiking neural networks with continual and unsupervised learning capabilities in dynamic environments,” arXiv, vol. 2103.00424, 2021.
[13] Y. Lecun et al., “Gradient-based learning applied to document recognition,” Proc. of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
[14] S. Sen et al., “Approximate computing for spiking neural networks,” in Proc. of DATE, March 2017, pp. 193–198.
[15] R. V. W. Putra and M. Shafique, “Fspinn: An optimization framework for memory-and energy-efficient spiking neural networks,” IEEE TCAD, vol. 39, no. 11, pp. 3601–3613, 2020.
[16] N. Rathi et al., “Stdp-based pruning of connections and weight quantization in spiking neural networks for energy-efficient recognition,” IEEE TCAD, vol. 38, no. 4, pp. 668–677, April 2019.
[17] M. Sorbaro et al., “Optimizing the energy consumption of spiking neural networks for neuromorphic applications,” Frontiers in Neuroscience, vol. 14, p. 662, 2020.
[18] C. Zou et al., “A novel conversion method for spiking neural network using median quantization,” in ISCAS, 2020, pp. 1–5.
[19] A. Granas and J. Dugundji, Fixed Point Theory. Springer, 2003.
[20] A. Roy et al., “A programmable event-driven architecture for evaluating spiking neural networks,” in Proc. of ISLPED, July 2017, pp. 1–6.
[21] S. Krithivasan et al., “Dynamic spike bundling for energy-efficient spiking neural networks,” in Proc. of ISLPED, July 2019, pp. 1–6.
[22] M. Mozafari et al., “Spyketorch: Efficient simulation of convolutional spiking neural networks with at most one spike per neuron,” Frontiers in Neuroscience, vol. 13, p. 625, 2019.
[23] E. M. Izhikevich, “Which model to use for cortical spiking neurons?” IEEE TNN, vol. 15, no. 5, pp. 1063–1070, Sep. 2004.
[24] M. Hopkins et al., “Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations,” Philosophical Transactions of the Royal Society A, vol. 378, 2020.
[25] S. Gupta et al., “Deep learning with limited numerical precision,” in Proc. of the ICML, 2015, p. 1737–1746.
[26] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv, vol. 1806.08342, 2018.
[27] H. Hazan et al., “Bindsnet: A machine learning-oriented spiking neural networks library in python,” Frontiers in Neuroinformatics, 2018.
[28] A. Amir et al., “A low power, fully event-based gesture recognition system,” in Proc. of CVPR, 2017, pp. 7388–7397.

Q-SpiNN: A Framework for Quantizing Spiking Neural Networks

Abstract

I Introduction

I-A State-of-the-art and Their Limitations

I-B Motivational Case Study and Key Challenges

I-C Our Novel Contributions

II Background and Related Work

II-A Spiking Neural Networks (SNNs)

II-B Fixed-Point Representation and Rounding Schemes

II-C Quantization Schemes

III Our Q-SpiNN Framework

III-A Overview

III-B Quantization of Different SNN Parameters

III-C Exploration of Different Quantization Approaches

III-D SNN Model Selection Algorithm

IV Evaluation Methodology

V Results and Discussions

V-A Impact of Different Quantization Approaches on Accuracy

V-B SNN Model Selection with the Memory-Accuracy Trade-Off

VI Conclusion

Acknowledgment

References

Q-SpiNN: A Framework for Quantizing
Spiking Neural Networks