Direct Implementation of Neuromorphic Temporal Neural Networks Using Off-the-Shelf CMOS Technology

Abstract

Temporal Neural Networks (TNNs) use time as a resource to represent and process information, mimicking the behavior of the human neocortex. This work presents an architecture and design approach targeting the implementation of TNNs using off-the-shelf CMOS design and synthesis tools. Gate-level designs are given for a neuron and a column of neurons as building blocks for TNNs. A column consists of multiple excitatory neurons based on the SRM0 neuron model, followed by winner-take-all (WTA) inhibition and is capable of online, unsupervised, and localized learning using spike timing dependent plasticity (STDP).

This effort targets the direct implementation of TNN neurons and columns in standard CMOS. In a direct implementation, the hardware clock cycle also serves as the basic time unit for TNN temporal processing. We explore the scalable design space in three dimensions: 1) number of synapses per neuron p (from 64 to 1K), 2) number of neurons per column q (from 8 to 16), and 3) silicon process nodes (from 45nm to 7nm). Our results show that a single 10x784 column, with 10 neurons and 784 synapses/neuron, implemented in 7nm CMOS, is capable of processing each MNIST image in 8.85 ns while consuming only 179 $\mu$ W and requiring only 0.037 mm² die area. We believe TNNs can facilitate extremely energy-efficient edge-native online sensory processing.

1 Introduction

1.1 Temporal Neural Networks

Temporal Neural Networks (TNNs) [smith2017space] are a class of spiking neural networks (SNNs) that encode and process information in temporal form using precise spike timings to represent information [vanrullen2005spike], as contrasted with SNNs that use spike rates for information encoding and processing. Furthermore, unlike artificial neural networks (e.g. CNNs, RNNs, DNNs), TNNs generally strive for biological plausibility with the goal of achieving brain-like capabilities and efficiencies.

The taxonomy in Figure 1 further highlights the distinctive features of TNNs [smith2017space]. TNNs are composed of neurons that perform temporal functions on values encoded as relative spike timings. Furthermore, TNNs are trained via Spike Timing Dependent Plasticity (STDP) [guyonneau2005neurons]. With STDP, a synapse’s weight is updated based only on the relative timings of its incoming spike (from a pre-synaptic neuron) and the outgoing spike, if any, (from its post-synaptic neuron). Hence, the STDP training paradigm has the potential for enabling unsupervised, localized, on-line, and continuous learning, without requiring compute-intensive back propagation that is employed in all deep neural networks (DNNs).

Refer to caption — Figure 1: Neural Network Taxonomy

1.2 Motivation and Contributions

This paper explores the feasibility and potentials of implementing TNNs using standard off-the-shelf digital CMOS technology and design tools. Due to the neuromorphic attributes of TNNs, there is the potential for a new computing fabric for performing human-like sensory processing with extreme energy efficiency. This effort is a first step towards realizing a silicon neocortex composed of a complex of TNNs.

Previous research has shown that TNNs can be built from functional building blocks that model a neuron and a column of neurons, analogous to building blocks in biological neural networks. We define a “column” as a collection of parallel excitatory neurons, followed by winner-take-all (WTA) inhibition across the neurons. In this paper we present the gate-level designs of neurons and columns. We evaluate our designs based on the metrics of: Area (mm²) for complexity, Delay (ns) for performance, and Power (mW) for efficiency. The key contributions of this work include:

•

An architecture and a design approach for implementing practical TNNs using off-the-shelf CMOS technology and design tools.
•

Gate-level designs of: 1) a scalable neuron, with a ramp-no-leak (RNL) response function and STDP learning rules, and 2) a scalable column, with winner-take-all (WTA) lateral inhibition.
•

Parameterized equations for gate count to explore the design space by scaling the synapse count p for a neuron, and the neuron count q for a column of neurons.
•

Post-synthesis evaluation of the Area, Delay, and Power metrics of the scalable neuron and the scalable column designs, based on a standard cell library for the 45nm process node.
•

Results on Area, Delay, and Power metrics for the scalable neuron and the scalable column designs, as process node is scaled from 45nm down to 7nm.
•

A prototype TNN column (10x784), with 10 neurons and 784 synapses/neuron, which is capable of processing MNIST type images in an online fashion.

2 Related Work

In this section, we summarize previous works in several related domains and highlight the differences of our work relative to these previous works.

2.1 Deep Neural Networks

Deep Neural Networks (DNNs), including convolutional neural networks (CNNs) and recurrent neural networks (RNNs), are currently dominant paradigms for performing human-like sensory processing using machine learning (ML) techniques. DNNs can effectively support ML applications such as image classification and speech recognition, providing on-par or even superior performance to humans [pouyanfar2018survey].

DNNs typically employ: 1) tensor processing with high-dimensional matrix multiplications, and 2) supervised global learning with back propagation and stochastic gradient decent. Even though highly effective, both of these techniques are not biologically plausible and require an incredible amount of linear algebraic computation. Since 2012, the computation required for training larger and deeper DNNs has been doubling every 3.4 months, or increasing at the rate of 10x/year [openai].

To address this exponential computation demand, especially for DNN training, many specialized accelerators have been developed. Examples include: Minerva [reagen2016minerva], Eyeriss [chen2016eyeriss, chen2019eyeriss], DianNao family [chen2016diannao], Google TPU [jouppi2017datacenter], mostly to improve inferencing efficiency in hardware [chen2020survey]. There have also been digital [eldredge1994rrann, esser2015backpropagation, girau1996line, cloutier1994hardware] and analog [withagen1994implementing, dolenko1995tolerance] approaches toward hardware implementations of backpropagation. These accelerators are being deployed in both data centers as well as edge devices. High-end mobile SoCs are now incorporating specialized processing units for accelerating AI/ML workloads (e.g. Apple NPU, Qualcomm HTA, Huawei NPU, Mediatek APU, etc.). Most of these accelerators are mainly providing more efficient tensor processing hardware by: 1) increasing the number of smaller MAC (multiply-accumulate) units; 2) lowering the precision of weights with quantization, and 3) employing various forms of network pruning to achieve sparser computation with lower memory requirements.

TNNs are fundamentally different and are based on a computing paradigm that does not involve tensor processing. TNNs perform computation based on spike timing relationships and do not use linear algebraic operations. The learning paradigm for TNNs is localized and unsupervised, and is capable of on-line and continuous learning. Hence, inferencing and training can occur simultaneously.

2.2 Spiking Neural Networks

Spiking Neural Networks (SNNs) use spikes to communicate between neurons and are a broader class of neural networks that includes TNNs as shown in Figure 1. However, most SNNs in the literature employ rate encoding as opposed to TNNs which use temporal encoding. In rate encoding, spike rates are used to represent values whereas in temporal encoding, precise spike times are used.

Many SNNs also use backpropagation for learning, similar to DNNs. TNNs use spike times for temporal encoding and STDP for local learning. Multilayer SNNs proposed in [o2016deep, lee2016training, neftci2017stochastic] employ both rate encoding and backpropagation, whereas SNNs in [diehl2015unsupervised, tavanaei2017multi, tavanaei2017spiking, querlioz2013immunity, brader2007learning] use rate encoding coupled with STDP. The works in [liu2017mt, mostafa2017supervised] implement feedforward SNNs with temporal encoding but use backpropagation for learning.

The work that comes the closest to ours is from [kheradpisheh2018stdp, mozafari2019bio]. That work uses both temporal encoding and STDP learning, very similar to our approach. However, their TNN approach also incorporates a conventional convolution structure with weight sharing. In our TNN organization, we assume unique weights for synapses, and there is no weight sharing. The column design we use can be viewed as a structural transpose of their conventional feature map organization (as illustrated in Figure 2). The number of feature maps per layer in their TNN is equal to the number of neurons per column in our TNN, and the number of columns in our layer is equal to the size of their 2D feature map (height x width).

2.3 Neuromorphic Hardware

In recent years several experimental neuromorphic chips have been introduced. Of them, we highlight three digital implementations here.

TrueNorth is a digital CMOS neuromorphic chip introduced by IBM in 2014 [merolla2014million] fabricated using 28 nm CMOS technology. It consists of one million neurons distributed across 4096 cores, with each core containing 256 leaky-integrate-and-fire (LIF) neurons, each with 256 synapses. Synaptic weights can be set to either 0 or 1 (1 bit weights). The chip runs at 1 KHz with 4-bit resolution for time steps (0 to 15). Time delays are stored and processed as 4-bit binary values, hence, making this an indirect implementation. Spikes are communicated as packets using address event representation. The authors use offline training for their performance demonstration. TrueNorth achieves a commendable 20 mW/cm² power density, which is about three orders of magnitude more efficient than typical CPUs.

Intel unveiled Loihi in 2018 as their flagship neuromorphic chip built in 14 nm FinFET process [davies2018loihi]. The main difference between Loihi and TrueNorth is that Loihi supports STDP-based learning. Each chip consists of 130,000 LIF neurons distributed across 128 cores, and each core consists of 1024 neurons, each with 1024 synapses. It can support 1-9 bit synaptic weights. Spikes are represented as packetized messages. Intel has provided Loihi boards in the cloud for access and experimentation by a community of researchers.

ODIN [frenkel20180] is a digital neuromorphic ASIC designed in 28 nm CMOS, consisting of 256 neurons with 256 synapses each. It implements a version of STDP, namely, Spike Dependent Synaptic Plasticity (SDSP) in order to support localized learning. It implements a 3-bit synaptic weight. Spikes are represented as AER packets.

Our goal is not to create another hardware platform for experimentation, as is the case with the prior neuromorphic chip implementations, but a new computation model and processor architecture for designing and implementing application-specific neuromorphic processors a la TNNs. In the long run, we are also interested in developing a suite of software tools for supporting this new design framework.

2.4 Temporal Encoding and Processing

A distinctive attribute of TNNs is the use of temporal encoding instead of rate encoding used in most ANNs and SNNs. With temporal encoding information is represented by precise relative timings of spikes. For TNN, computation occurs in waves or volleys of spikes. In each wave, a volley of spikes arrives at a neuron’s inputs with at most one spike at every synaptic input. The value represented by each input spike is based on its spike time relative to the first spike in a volley. The first spike in the volley represents 0 and subsequent spikes are assigned increasing values based on increasing delays relative to the leading spike. Temporal encoding is illustrated in Figure 3. If no spike occurs on an input, it is assigned a spike time of “ $\infty$ ”. This is in contrast with rate encoding where spike rates are used to communicate information.

For rate encoding, a much larger time window is needed for each computation wave, in order to sample enough spikes on an input to determine the spiking rate on that input. Temporal encoding is supported by experiments in neuroscience whereas rate encoding has been shown to be implausibly slow compared to biology [thorpe1989biological, thorpe1990spike]. It has been proposed that computational waves are synchronized by inhibitory gamma oscillations that provide temporal coding window of 5-10 msec [gray1989oscillatory] (allowing for an inhibitory phase of the cycle). Neuron spiking behavior has been shown to be repeatable about every 1 msec [butts2007temporal, mainen1995reliability]. A 5-10 msec window along with the 1 msec encoding precision implies only 3-4 bits are needed for encoding possible spike times for spikes within a volley.

In this work, we employ temporal encoding and processing. In our direct implementation of TNNs using standard CMOS, we use the actual hardware clock cycle (alpha cycle) as the basic unit of time for temporal encoding and processing. We adopt 3-bit temporal encoding for values ranging from 0 to 7. When using pulse-based, unary encoding for spike times, it requires at least 8 alpha cycles to traverse each parallel column of neurons (beta cycle). For a TNN with multiple layers of columns, it will require even more time to traverse the entire network, but it will be typically pipelined. Pipelining depth will determine the frequency of the computation waves (gamma cycle). For example, when pipelined between each TNN layer, the gamma cycle will extend for 7+8=15 alpha cycles as shown in Figure 3. It is to be emphasized that the proposed design uses only two clocks, one to represent the basic time unit (alpha clock or aclk) and the other to represent the gamma computational wave (gamma clock or gclk). Beta cycle in Figure 3 is just used to illustrate the coding window.

3 Excitatory Neuron Model

This work focuses on the SRM0 excitatory neuron model based on the popular biologically inspired Spike Response Model [kistler1997reduction] as shown in Figure 4. This section presents the components of this model, specifically for inference and their detailed gate level designs. STDP is discussed later in Section 5.

3.1 Synaptic Response Functions

A synapse connects the axon (output) of a pre-synaptic neuron and a dendrite (input) of the post-synaptic neuron. An SRM0 neuron takes multiple input spikes and generates a response function for each based on its corresponding synaptic weight. All the individual response functions are then integrated to form the neuron’s membrane potential. When (and if) the membrane potential crosses a threshold, the neuron fires an output spike on its axon. Four examples of discretized response functions, namely, biexponential, piecewise linear, step-no-leak (SNL) and ramp-no-leak (RNL) are shown in Figure 5. SNL increases the membrane potential by the weight value instantaneously when an input spike arrives. The RNL function, increases by a unit step at every time unit until it reaches its peak (equal to the weight) and then remains constant until it is reset prior to the next computation cycle. As opposed to the SNL, the RNL response function introduces a temporal aspect into the computation due to its gradually increasing slope. We focus on RNL response function in this paper due to their temporal computational benefits.

3.2 SRM0 Neuron Model

Figure 6 shows the block diagram for the proposed SRM0 neuron with p synapses and ramp-no-leak response function (as shown in Figure 5). Note that this model applies to inference; learning will be discussed later. Both the input spikes (x₁,…,x_p) and output spike (z) are represented using 8-unit time signal pulses (spanning a beta cycle). In general, the pulse width is equal to the maximum weight. Synaptic weights range between 0 and 7 and are encoded as 3-bit binary values which are read by using a readout logic. A 3-bit FSM is used as the readout logic to generate the appropriate up-step (1 or 0) at every time instant in the form of 1-cycle wide pulses. These 1-bit up-step pulses are then counted and accumulated over time to calculate the aggregate potential of the neuron with the help of a simple accumulator and comparator which comprise the neuron body. An interesting aspect of this design is that the 3-bit FSM in the readout logic itself stores the synaptic weights, so additional hardware is not required for storing synaptic weights. Although illustrated as two separate units in Figure 6 for simplicity, the synaptic weight and readout logic basically share a single 3-bit FSM. The two main neuronal components as shown in Figure 6, namely, the FSM and neuron body, are explained below.

3.3 FSM: Synaptic Weights and Readout Logic

Each synapse has its own FSM as shown in Figure 7 to store and read out its weights. The goal of this FSM is to read out the 3-bit binary weight into a serial thermometer-encoded fashion without changing the weight itself. For example, if the synaptic weight is 5 or “101”, the output should be a 5-unit time signal pulse starting from the time of arrival of input spike. As discussed before, our design has two clocks, one to represent the basic time unit (aclk) and the other to represent the gamma computational wave (gclk). Each of the three flipflops in the FSM is an asyncronous set-reset flipflop with input spike-gated aclk (tclk as shown in Figure 7) as the asynchronous input and gclk as the synchronous clock input. The 3-bit FSM starts counting down when an input spike arrives and wraps around for 8 cycles (the width of the input pulse). This ensures that the synaptic weight is unchanged when readout is completed. The FSM is programmed to output a 1 for any state until a transition occurs from S0 to S7, at which point the output gets latched to 0. The latch is released when gclk arrives. This mechanism ensures that for a synaptic weight of ‘n’, an ‘n’-unit time pulse is generated at the output. The synaptic weight can only be modified when a synchronizing clock (gclk) arrives. At that point, the count is increased by 1 if inc is set, decreased by 1 if dec is set and unchanged otherwise, with saturation at 0 and 7. The inc and dec signals never arrive together and they are only enabled when there is no input spike, so as not to overlap with the duration of the asynchronous tclk input.

\mathbf{Gates_{FSM}=61p}

3.4 Neuron Body

Using ripple carry adders as fundamental units, an efficient N-input accumulator is designed by integrating an (p-1)-input parallel combinational counter and a (log₂p+1)-bit adder into one design [parhami1995accumulative]. Figure 8 shows the logic diagram for a 16-input accumulator, with integrated output spike generation. For a p-input accumulator, p-1 inputs are accumulated into a (log₂p)-bit output, which is then zero-extended and added to the previous stored (log₂p+1)-bit value from the register. The one remaining input bit is directly provided as a carry-in to the last stage (highlighted by pink box in Figure 8). Zero-extension is implemented virtually by using a half adder (HA) for the MSB in the last stage. Comparison with threshold is simply implemented by initializing the (log₂p+1)-bit accumulating register with a signed 2’s complement representation of (- $\theta$ ). Thus, comparison with threshold is, in fact, checking the (log₂p+1)^th bit of the output. If it is ‘0’, the output is non-negative, i.e., the accumulated value has crossed the threshold and complement of (log₂p+1)^th bit denotes the occurrence of an output spike. An 8-unit time pulse signal is enforced at the output spike by using a 3-bit counter triggered by this (log₂p+1)^th bit. This bit is also used as a control signal to a multiplexer which resets the register to (- $\theta$ ) at the next cycle.

Gates_FA+HA = 5*(p-1) + 2 = 5p-3

Gates_register = 5*(log₂p+1) = 5log₂p+5

Gates_{(log₂p+1)-bit 2-to-1 mux} = 3log₂p+4

Gates_Counter = 23

There are 2 extra gates (1 ‘INV’ and 1 ‘OR’).

\mathbf{Gates_{Neuron\_Body}=5p+8log_{2}p+31}

\mathbf{Gates_{Neuron\_w/o\_STDP}=66p+8log\textsubscript{2}p+31}

(1)

4 Column Architecture

In this section, we define our column architecture and present a scalable implementation for the same. As illustrated in Figure 9, a column is a collection of multiple excitatory neurons operating in parallel, followed by winner-take-all (WTA) inhibition across the neurons. The column inhibition feature is necessary to prevent run away spiking behavior. Columns can be used as functional building blocks for TNNs.

4.1 Column Operation

In this subsection, we briefly explain the functionality of a typical column before moving on to its implementation in the following subsections. Figure 10 illustrates the operation of an example column with 8 RNL neurons and 64 synapses (each neuron has 8 input synapses and a threshold value of 8). Maximum synaptic weight value is also 8 in this example. Every intersection point in the synaptic crossbar is a synapse with its weight value shown inside the circle. Absence of values indicates a weight of 0. Neuron 1 (driving output $z_{1}$ ), has only one of its inputs ( $x_{5}$ ) driving a synapse with weight = $w_{max}$ . Hence, its body potential will cross the threshold at t = 7. This value (7) is shown as the input to 1-WTA inhibition in Figure 10. Neuron 4 ( $z_{4}$ ) receives input spikes on three inputs with weights of $w_{max}$ ; hence its output spike time is t = 2 as shown in the figure. The output $z_{4}$ is the first output spike, so it is selected as the “winner” by WTA inhibition. Hence, $z_{4}$ = 2; all the other $z_{i}$ = $\infty$ .

4.2 Winner-Take-All (WTA) Inhibition

Winner-take-all (WTA) lateral inhibition selects the first (or first k as in k-WTA) spiking neuron and allows its output spike to pass through intact, while inhibiting or nullifying the outputs of the other neighboring neurons. STDP is typically performed after lateral inhibition. Figure 11 shows the logic diagram for lateral inhibition across the q neurons in a column.

(z₁,…,z_q) denote the output spikes for the q neurons, in pulse format. The blue box illustrates the latch-based logic implementation for a less-than-or-equal temporal operator, which operates on edge signals. It consists of a ‘data’ input and an ‘inhibit’ input. If inhibit arrives prior to the data input, the latter is inhibited; else it is allowed to pass. The first spike is found through a big ‘OR’ gate and is fed back through a pulse-to-edge conversion logic which acts as the inhibit input for all the later arriving pulses. Note that this implementation works as expected even with the input being a pulse, because the inhibit signal is in edge form and any input pulse coming after the inhibit is blocked. It only outputs a single pulse, because the later coming pulses are blocked by the less-than-or-equal logic before reaching the ‘OR’ gate. Tie breaking is implemented by selecting the first spiking neuron with the lowest index.

\mathbf{Gates_{WTA}=7q+4}

(2)

4.3 Column Implementation

Given the neuron and WTA designs from the previous section, we can now build a parallel column of q neurons. As shown in Figure 9 each column also requires a synaptic crossbar connecting the p inputs to the q excitatory neurons, thus generating pq synapses. For a column to be able to learn, STDP capability must also be incorporated into the column. Due to the important role STDP plays in a column, we cover STDP separately in the next section.

A single (pxq) column with pq synaptic inputs and q excitatory neurons, supported by WTA lateral inhibition and STDP update rules, becomes a fully operational TNN, capable of performing inferencing and online continuous learning. Columns can also be used as building blocks for creating larger TNNs by stacking multiple small columns to form a stack/layer of multiple columns, as well as by cascading multiple stacks/layers into a large multi-layer TNNs.

\mathbf{Gates_{pxq\_column\_w/o\_STDP}=66pq+8qlog_{2}p+38q+4}

(3)

5 STDP learning

STDP is the distinctive ingredient for TNNs. The STDP concept has been around for quite a while and many specific ideas for STDP have been proposed and simulated, as we briefly discuss below. In this work we propose a novel and rigorous STDP algorithm that is both effective in learning and implementable using off-the-shelf digital CMOS technology.

5.1 Prior Work

The basic idea behind STDP learning mechanism is based on the temporal correlation between input spike at a synapse and its neuron’s output spike; stronger correlation should lead to increasing the synaptic weight. Although this was first observed by Hebb [hebb1949organization], Levy and Steward [levy1983temporal] formalized the classic STDP rule: if the input spike precedes the output spike of a neuron, the corresponding synaptic weight is increased; else if the output spike precedes the input spike, the weight is decreased. Several follow on works have demonstrated experimental support for STDP [markram1997regulation, bi1998synaptic]. These research efforts have led to more functionally rich STDP update rules.

Although relatively limited, attempts have been made in the past towards hardware implementations of STDP. Seo et al. [seo201145nm] implemented a fully digital silicon synapse programmable via a probabilistic variant of STDP. However, that implementation requires a custom SRAM design. Other approaches for hardware STDP implementation have been predominantly in the analog or mixed signal domain [schemmel2006implementing, arthur2006learning, tanaka2009cmos, cruz2012energy]. To our knowledge, the work by Cassidy et al. [cassidy2011combinational] is the closest to ours in terms of a fully synthesizable digital implementation of STDP.

5.2 Proposed STDP Update Rules

The learning method presented here is a customized version of the classic Spike Timing Dependent Plasticity (STDP), and is implemented locally at each synapse as shown in Figure 12. The proposed STDP learning rules are summarized in Table 1.

Input Conditions		Weight Update
$x(t)\neq\infty$ ;	$x(t)\leq y(t)$	$\Delta w=+B(\mu_{capture})*B(max(F(w),\mu_{min}))$
$y(t)\neq\infty$	$x(t)>y(t)$	$\Delta w=-B(\mu_{backoff})*B(max(F(w),\mu_{min}))$
$x(t)\neq\infty$ ; $y(t)=\infty$		$\Delta w=+B(\mu_{search})$
$x(t)=\infty$ ; $y(t)\neq\infty$		$\Delta w=-B(\mu_{backoff})*B(max(F(w),\mu_{min}))$
$x(t)=\infty$ ; $y(t)=\infty$		$\Delta w=0$

Table 1: Proposed STDP Update Rules

Here, x(t) and y(t) represent input and output spikes respectively. $\Delta$ w denotes change in weight and B( $\mu$ ) represents a Bernoulli random variable with probability $\mu$ .

STDP update rules are divided into four major cases, corresponding to the four combinations of input and output spikes (represented by x(t) and y(t) respectively) being present ( $\neq\infty$ ) or absent ( $=\infty$ ) . When both are present, two sub-cases are formed based on the relative timing of the input and output spikes in the classical STDP manner [bi1998synaptic].

The STDP update function either increments the weight by $\Delta$ w (up to a maximum of w_max = 7), decrements the weight by $\Delta$ w (down to a minimum of 0), or leaves the weight un-changed. The $\Delta$ values are defined using Bernoulli random variables (BRVs) with parameterized learning probabilities denoted as B( $\mu$ ) with a descriptive subscript. Using Bernoulli random variables (BRVs), we perform a small number of relatively larger (unit) updates in the weight, as opposed to a large number of small floating point updates, as is done commonly. BRVs are also well-suited to hardware implementations because a linear feedback shift register (LFSR) network can provide pseudo-random binary values. $F(w)$ is a stabilizing function (= $(w/w_{max})(1-w/w_{max})$ ) which makes the weights sticky at both ends (0 and 7) [kheradpisheh2018stdp].

5.3 Proposed STDP Implementation

Our proposed STDP logic implementation is shown in Figure 13. Here, we assume that all the required BRVs have been generated from an LFSR network shared by many neurons or columns. Hence, LFSRs are not being considered in our hardware implementation below. LFSRs can be shared across all or multiple neurons in a column, or across multiple columns in a layer, or even across multiple layers in a TNN. Introducing LFSRs to our implementation is straightforward, once their distribution in the multi-layer TNN is determined. The proposed STDP logic has the following three components:

5.3.1 Case Generation Logic

The case generation logic per synapse compares the input spiketime (x_i) of its corresponding synapse with the neuronal output spiketime (z) and generates 4 control signals corresponding to the 4 cases in Figure 1. Case 5 is eliminated since inc and dec will automatically be 0 as required, if none of the 4 cases are 1. Note that STDP is performed at the end of a computational cycle, i.e., after all non-null inputs have arrived and produced responses. Hence, information from any pulse signal needs to be maintained or latched until the end of the cycle for computing STDP decisions. This is equivalent to converting the pulse signals to edge signals and the logic implementation for the same is essentially a 1-bit accumulator (SR flipflop with XOR gate). There is a pulse-to-edge conversion logic for every input synapse (x_i), whereas the conversion logic for the output (z) is shared across all synapses. The following latch with 3 gates implements the temporal operation of less-than-or-equal ( $\leq$ ) based on edge signals. To illustrate an example x_i $\leq$ z, the input (x_i) is passed to the output only if it arrives before or at the same time as the inhibit input (z); otherwise the output stays null. The logic equations implemented for the four STDP cases are as follows:

•

Case 1: $(x_{i}\leq z).x_{i}.z$
•

Case 2: $(\overline{x_{i}\leq z}).x_{i}.z$
•

Case 3: $(x_{i}\leq z).(x_{i}\oplus z)$
•

Case 4: $(\overline{x_{i}\leq z}).(x_{i}\oplus z)$

Note that only one of the four outputs can have a value of 1 for a particular computational cycle and all the outputs are edge signals.

\mathbf{Gates_{case\_generation}=17p+5}

5.3.2 Stabilization Function Logic

Stabilization function generates a BRV whose probability depends on the synaptic weight $F=B((w/w_{max})(1-w/w_{max}))$ . The stabilization LFSR generates 6 outputs corresponding to the 6 different values of ’F’ function that are not ‘0’ or ‘1’ for 3-bit weights. The 6 outputs of stabilization LFSR are passed through the stabilization function logic to output a BRV F, where F is the stabilizing factor used in Table 1. F is generated by 8-to-1 multiplexers controlled by 3-bit weights. Then, the max operation in the weight stabilization terms in Table 1 is essentially implemented by ‘OR’ing F with the output of min LFSR.

\mathbf{Gates_{stabilization}=12p}

5.3.3 Inc/Dec Logic

The outputs from the stabilization logic are appropriately used along with the cases and outputs from capture, minus, search and backoff LFSRs to generate inc and dec outputs. Outputs corresponding to capture and search are ‘OR’ed together for inc and outputs of minus and backoff are ‘OR’ed to generate dec.

\mathbf{Gates_{incdec}=7p}

\mathbf{Gates_{STDP\_per\_neuron}=36p+5}

(4)

\mathbf{Gates_{neuron\_w\_STDP}=102p+8log_{2}p+36}

(5)

\mathbf{Gates_{pxq\_column\_w\_STDP}=102pq+8qlog_{2}p+43q+4}

(6)

6 Neuron and Column Designs

We have implemented functional simulation models of the scalable neuron and column designs in Python. We have also implemented these designs in Verilog and generated synthesis results based on a 45nm standard cell library. This section presents our evaluation of these designs.

6.1 Methodology

We perform two types of evaluation to determine area (A), critical path delay (D), computation time (T) and power (P) for the neuron and column designs, namely gate-level and circuit-level evaluations.

In gate level evaluation, we derive equations for A, D, T and P based on the gate count and number of signal transitions, parameterized in terms of number of neurons (q) and number of synapses per neuron (p). The procedure is as follows:

•

Gate count can be used as a surrogate for area as die area is proportional to the number of gates.
•

Time for a single computation cycle can be calculated based on the critical path delay in terms of total number of gate delays. Assuming 3-bit weights, the computational cycle requires at least 8 clock cycles; the first and last spikes can be separated by up to 7 clock cycles. After the last input spike arrives, it can take up to 6 more cycles for the RNL response function to reach its peak, 1 cycle for restoring the weights and a final cycle for STDP update, hence 8 cycles are added. This sets the gamma cycle.

T = (7+8) * (critical path delay) = 15*(critical path delay)
•

Gate count can also be used for estimating static power consumption, and the number of gate transitions can be used as an estimate for dynamic power consumption. These estimates are overly pessimistic since spikes are going to be sparse in general (typically only 10% of the lines have spikes), which will bring down the dynamic power consumption significantly. Sparsity can be easily introduced in our equations by using a factor of 0.1 for dynamic power.

Circuit level results are obtained from post-synthesis output generated by Synopsys Design Compiler using open source 45 nm Nangate standard cell library [knudsen2008nangate]. We use the low power process corner for our synthesis. Area, power and critical path delay are obtained directly from the Synopsys tool, and computation time is derived by multiplying the critical path delay by 15. The post-synthesis designs have been verified functionally both in Python and Verilog using Synopsys VCS.

6.2 Neuron Gate Level Evaluation

The equations derived in this section can be used as a formula to scale A, D, T and P metrics for arbitrary number of synapses for a neuron.

6.2.1 Area (A)

As per the methodology mentioned above, gate count can be used to estimate area by multiplying it with area per gate (Area_gate). Hence,

\begin{split}\mathbf{A}&=(66p+8log\textsubscript{2}p+31)+(36p+5)*Area\textsubscript{gate}\\ &=\mathbf{(102p+8log_{2}p+36)*Area_{gate}}\end{split}

(7)

6.2.2 Critical Path Delay (D)

Critical path in a neuron is the path from the FSM to the output of the accumulator (shown by red arrow in Figure 8). Its delay is estimated by adding all the gate delays in the path and multiplying that by delay per gate (Delay_gate). Hence,

\mathbf{D=(6log_{2}p+4)*Delay_{gate}}

(8)

6.2.3 Power (P)

All gates in the FSM undergo at most 2 signal transitions during the 8 states. This is due to the nature of RNL response function. The pulses have to be consecutive in time, thereby resulting in only 2 transitions, one at the beginning and one at the end. Every gate except the gates in the last stage in the accumulator (marked by purple box in Figure 8) undergoes at most 2 transitions from incoming pulses from FSMs. For an estimate of worst-case power consumption, the last stage gates are assumed to undergo 15 transitions in 15 clock cycles. The last accumulator stage consists of log₂p FAs and 1 HA, followed by (log₂p+1)-bit mux, (log₂p+1)-bit register and 2 extra gates. Using the constants for leakage power per gate (Leak_gate) and unit switching power per gate (Switch_gate), we can derive the following equations for static and dynamic power:

\mathbf{P_{static}=(102p+8log_{2}p+36)*Leak_{gate}}

(9)

P_{dynamic}=2*(102p-5log_{2}p+23)+15*(13log_{2}p+13)

\mathbf{P_{dynamic}=(204p+185log_{2}p+241)*Switch_{gate}}

(10)

Gate level equations derived above indicate near-linear scaling of area and power, and logarithmic scaling of delay with respect to the number of synapses. We verify this with the post-synthesis results for 45nm CMOS generated by Synopsys Design Compiler in the next subsection.

6.3 Neuron Circuit Level Evaluation

6.3.1 Gate Complexity Breakdown

Figure 14 shows the breakdown for various components of the neuron in terms of number of gates, scaled across synaptic count ranging from 64 to 1024. We clearly observe a near-linear scaling of total gate counts relative to the number of synapses. Interestingly, the gate count for the neuron body is quite small whereas storage for synaptic weights incur the most gate counts. Weights constitute almost 50% of the entire neuron complexity and STDP logic 40% while the neuron body and readout logic account for the remaining 10%. The total gate count for a neuron with 1024 synapses is 102,432 gates.

6.3.2 ATP Complexity (45 nm CMOS)

Synapses	Gate	Area	Critical Path	Power
	Count	[ $\mu$ m²]	Delay [ns]	[ $\mu$ W]
64	6471	6504.23	1.93	31.39
128	12859	12931.86	2.16	62.41
256	25673	25818.49	2.41	124.69
512	51258	51558.51	2.64	249.02
1024	102432	103042.81	2.82	497.70

Table 2: Area, delay and power consumption values (45nm) for a neuron

Table 2 above shows the A, D and P values for a neuron with STDP-programmable synapses, for 45 nm digital CMOS technology. A temporal neuron with 1024 synapses consumes 0.1 $\mu$ m² area and 0.5 mW power. Its critical path delay of 2.82 ns maps to a maximum clock frequency of 354.6 MHz. This is the maximum rate at which the actual hardware clock (alpha cycle) aclk can be clocked. The near-linear scaling of A and P, and logarithmic scaling of D, relative to synaptic counts, can be verified based on the values in the table.

6.4 Column Gate Level Evaluation

Gate level equations are derived below for a column having q neurons, each with p synapses.

6.4.1 Area (A)

Since gate count can be directly used to estimate area,

\split\mathbf{A=(102pq+8qlog_{2}p+43q+4)*Area_{gate}}

(11)

6.4.2 Computation Time (T)

Time for a single computational cycle (T) or the gamma cycle is obtained by multiplying the critical path delay by 15, as described above. Since critical path for a column is the same as that for a neuron,

T=15*(6log_{2}p+4)

\mathbf{T=90log_{2}p+60}

(12)

6.4.3 Power (P)

Static power consumption is proportional to the gate count. For an estimate of the dynamic power consumption, each gate in WTA lateral inhibition logic is assumed to have 2 transitions (including 1 after reset). As before, we derive:

\mathbf{P_{static}=(102pq+8qlog_{2}p+43q+4)*Leak\textsubscript{gate}}

(13)

P_{dynamic}=(204p+185log_{2}p+241)*q+2*(7q+4)

\mathbf{P_{dynamic}=(204pq+185qlog_{2}p+255q+8)*Switch_{gate}}

(14)

6.5 Column Circuit Level Evaluation

6.5.1 Gate Complexity Breakdown

Figure 15 shows the breakdown for various components of a column with 16 neurons in terms of number of gates, scaled across synaptic counts ranging from 64 to 1024. We again observe a near-linear scaling of total column gate count relative to the synaptic count. Interestingly, WTA lateral inhibition incurs negligible gate count. Excitatory column constitutes almost 50% of the entire column complexity while STDP logic incurs about half of that of the excitatory column. The total number of gates in a column having 16 neurons with 1024 synapses each is 1,769,619 gates.

6.5.2 ATP Complexity (45 nm CMOS)

Neurons /	Gate	Area	Comp.	Power
Synapses	Count	[mm²]	Time [ns]	[mW]
8 / 64	55369	0.056	28.95	0.25
10 / 128	138332	0.14	32.4	0.62
10 / 784	784435	0.786	40.8	3.8
16 / 1024	1769619	1.787	42.3	7.9

Table 3: Area, computation time and power consumption values (45 nm) for several columns with neuron counts of 8, 10, and 16, and synapse counts of 64, 128, 784, and 1024.

In Table 3, we present 45 nm CMOS results for four column configurations: 1) a small 8x64 column, 2) a medium 10x128 column, 3) a prototype 10x784 column, and 4) a large 16x1024 column. The gamma cycles for the smallest column and the largest column are 28.95 ns (34.54 MHz) and 42.3 ns (23.64 MHz) respectively. Our prototype 10x784 column consumes 3.8 mW power and 0.786 mm² die area, roughly half that of the largest column and 12 times that of the smallest column. The prototype column can process a single 28x28 MNIST image in almost 60 ns (??).

7 Technology Scaling Evaluation

7.1 Methodology

In this section, we scale our results for the neuron and column designs from 45 nm process node to 7 nm using simple scaling rules as explained later. This will help us gauge the potential for implementing TNN cores as a new type of core for inclusion in mobile SoCs with heterogeneous cores. We use the family of Apple mobile SoCs from A7 to A13 as the reference for deriving technology scaling ratios for 28 nm, 16nm, 10nm and 7 nm process nodes [wiki:tc]. Based on Apple A7, A9, A11 and A13 SoCs, the above technology nodes translate to transistor densities of 10 MT/mm², 22 MT/mm², 46 MT/mm² and 85 MT/mm² respectively. Here, MT/mm² stands for million transistors per mm². In order to get transistor densities for neuron and column for 45 nm node, we first derive an approximate transistor count for both designs by multiplying gate count by 4 (since each NAND2 gate has 4 transistors) and then dividing it by its corresponding area.

We perform technology scaling for area and power by multiplying them by the ratio of transistor densities between the source and target nodes. For critical path (CP) delay, we use the square root of the above ratio. In the following two subsections, we provide technology scaling results for 1) a neuron with 1024 synapses, and 2) a prototype (10x784) column with p = 784 and q = 10.

7.2 Neuron Scaling

In this subsection, we present technology scaling results for a neuron with 1024 synapses and capable of online STDP learning. Note that neurons in the neocortex typically exhibit the ratio of 1,000-10,000 synapses/neuron. As per the methodology mentioned above, we calculate the transistor count for such a neuron to be 409,728. Given its area of 0.1 mm², this translates to 4 MT/mm² for 45 nm process node. Table 4 shows the technology scaling results for the five process nodes down to 7 nm. From 45 nm to 7 nm, area and power decrease by more than 20x, and critical path delay by almost 5x. The values indicate that a state-of-the-art 100 mm² chip fabricated using 7 nm CMOS can accommodate 20,622 such neurons (206 neurons/mm²), each of which only consumes a minimal 23.4 $\mu$ W. Its critical path delay implies a maximum clock rate of 1.639 GHz. We would like to emphasize that this is a computationally powerful neuron capable of local, online, continuous STDP learning at each of its 1024 synapses.

Tech.	Transistor	Area	CP Delay	Power
Node	Density	[ $\mu$ m²]	[ns]	[ $\mu$ W]
45 nm	4 MT/mm²	103042.81	2.82	497.70
28nm	10 MT/mm²	41217.12	1.78	199.08
16 nm	22 MT/mm²	18735.06	1.20	90.49
10 nm	46 MT/mm²	8960.24	0.83	43.28
7 nm	85 MT/mm²	4849.07	0.61	23.42

Table 4: Technology node scaling for a neuron with 1024 synapses (complexity: 102,432 gates or 409,728 transistors).

7.3 Column Scaling

Here, we perform a similar analysis as above for our prototype column with 10 neurons and 784 synapses/neuron. The reason for choosing these parameters is that a 10x784 column is able to directly process MNIST images (28x28 pixels) and perform unsupervised clustering into 10 clusters. Table 5 shows the corresponding results. We use the same transistor densities derived above. Interestingly, 2,702 such columns be accommodated in a die area of 100 mm² in 7 nm CMOS. A single 10x784 column in 7nm CMOS can process one MNIST image in 8.85 ns and consumes only 0.179mW. Our experiments outside the scope of this paper indicate that each such column is capable of performing unsupervised MNIST clustering on par with classic k-means. This gives us confidence on the sheer compute power that is feasible while consuming only hundreds of microwatts.

Tech.	Transistor	Area	Comp.	Power
Node	Density	[ $\mu$ m²]	Time [ns]	[mW]
45 nm	4 MT/mm²	786354.21	40.8	3.802
28nm	10 MT/mm²	314541.68	25.8	1.520
16 nm	22 MT/mm²	142973.49	17.4	0.691
10 nm	46 MT/mm²	68378.63	12.03	0.330
7 nm	85 MT/mm²	37004.9	8.85	0.179

Table 5: Technology node scaling for a prototype (10x784) column having 10 neurons with 784 synapses each (complexity: 784,435 gates or 3,137,740 transistors).

8 Discussion and Analysis

8.1 Scaling Trends

Our proposed neuron design consists of three main functional components, synaptic weights (including readout logic), neuron body and STDP. Among these, synaptic weights and STDP are localized with respect to each synapse, which implies they scale linearly with the number of synapses (p). However, neuron body is shared by all its synapses, and is primarily a p-input accumulator with O(p) complexity. Hence, the neuron body scales near-linearly with p. This is well supported by our post-synthesis results in the previous section.

A TNN column consists of an excitatory column with q neurons and pxq synaptic weights (p synapses per neuron) followed by WTA inhibition. WTA inhibition is simply a priority-based selection logic and has negligible complexity as compared to the excitatory column. This implies that a column’s complexity scales near-linearly with respect to pxq. For a given q, a column’s complexity is linear with respect to p, whereas for a given p, it is linear with respect to q. Again, this is corroborated by our post-synthesis results in the previous section.

Across the five technology nodes, ranging from 45nm to 7nm, area and power for each subsequent generation scale by a factor of 0.4, 0.45, 0.48 and 0.54, respectively. The critical path delay scales with the square root of these scaling factors, namely, 0.63, 0.67, 0.69 and 0.73, respectively.

The three scaling trends above can be used to generate area, power and delay projections for any pxq column, for any of the above process nodes. This helps us to perform comparisons with other related works across different values of p, q, and silicon process nodes.

8.2 Choices in Response Function

We decided to use the ramp-no-leak (RNL) response function in all of our designs in this paper in order to take advantage of the training benefit from temporal processing of the ramp. However, a very popular and widely used response function is the step-no-leak (SNL). This is the simplest, although biologically less plausible, response function model that has been shown to be computationally efficient in the literature. In order to adapt our neuron design to SNL response function, the main changes required are: 1) simplify the FSM for weights by eliminating the tclk logic, 2) add a simple input pulse-based selection logic to pass on the weights to the accumulator, 3) each input to the accumulator is now 3-bit, instead of 1-bit. 1) and 2) reduce the synaptic weights + readout logic complexity for every synapse, while 3) increases the neuron body complexity.

Furthermore, if we use a 0 $\rightarrow$ 1 transition edge instead of pulse to represent spike, the accumulator can be replaced with a traditional p-input (each 3 bits wide) adder and the counter at the output can be replaced by a simple pulse-to-edge conversion logic. Our gate-level analysis shows an overall 20% reduction in gate-count complexity for SNL as compared to RNL. It is interesting to note that RNL adds temporal functionality to the model without incurring significant overhead. This efficiency for RNL mainly arises from two design optimizations: 1) sharing the synaptic weight FSM for generating RNL up-steps, and 2) time-multiplexing all up-steps for every synapse into a 1 bit signal that is fed to the accumulator.

Any arbitrary response function can be discretized and represented by a set of up-steps and down-steps distributed over time. The main difference occurs in the readout logic, where now, for every synaptic weight, the output is an n-bit 1-cycle wide pulse with 2’s complement representation that gets accumulated. This increases the complexity of the readout logic and the accumulator width. However, since the biggest contributors to neuron complexity, namely, the synaptic weights and STDP, remain unchanged, any response function can be implemented with reasonable complexity within close ballpark of the RNL model.

8.3 Comparison with TrueNorth and Loihi

IBM’s TrueNorth has a neuron density of about 2700 neurons/ mm², which is about 28x more dense as compared to our density of 97 neurons/mm² at the same 28nm process node. Our neuron design supports a 3-bit synaptic weight with STDP implemented locally for each synapse, in contrast to TrueNorth which uses 1-bit weights with no STDP. At the operating frequency of 100 KHz, our neuron consumes about 9.47pJ/spike at 28nm, which is about 3x more energy efficient than TrueNorth’s 26pJ/spike.

Intel’s Loihi has a neuron density of 2184 neurons/mm² at 14nm, which is about 40x denser than our density of 53 neurons/mm² at 16nm. Loihi consumes 15pJ/spike which is about 3.3x higher than our 4.5pJ/spike at 100 KHz at 16nm. Again, our neuron design includes synaptic weights, readout logic, and STDP hardware. Loihi uses a separate SRAM memory for storing synaptic weights.

The above comparisons are qualitative and mainly for providing a context for this work. While not as dense, our neurons can be quite competitive in energy efficiency. Our goal is to create an architecture and framework for designing and implementing application-specific neuromorphic processors that are extremely energy efficient and can be deployed in edge devices for always-on online sensory processing.

9 Conclusions

Some of the key results from this work include: 1) Gate-level design of a neuron with 1024 synapses that requires 4849 $\mu$ m² of die area, consumes 23.42 $\mu$ W of power, and incurs a critical path delay of 0.61ns, in 7nm CMOS; 2) Gate-level design of a prototype column with 10 neurons (each with 784 synapses) which is capable of processing a 28x28 MNIST image in 8.85ns, while consuming only 179 $\mu$ W power and 37,004 $\mu$ m² die area.

Some of our key observations include: 1) About 50% of a neuron’s complexity comes from the synaptic weights, 40% from STDP and the remaining 10% the neuron body; 2) A column’s complexity is mainly determined by its excitatory column’s size and configuration; and 3) Different response functions have affinity toward different representations of spikes. For example, while an n-cycle wide pulse (n = $w_{max}$ + 1) is the most suitable for RNL, SNL is more efficient when transition edges (1 $\rightarrow$ 0 or 0 $\rightarrow$ 1) are used to represent spikes.

Current computing demand for supporting machine learning workloads is increasing at the rate of 10x per year or doubling every 3.4 months [openai]. At best, Moore’s law is only doubling every 2 years. The gap is widening at the rate of 8x per year or 500x every 3 years. Can current computing systems that have their roots dating back to Alan Turing and John von Neumann effectively address this widening gap?

Is there a better computing paradigm that can more efficiently support human-like sensory processing and human-like continuous unsupervised learning? We need computing systems that are multiple orders of magnitude more energy efficient. As computer architects, we need to go back to biology and reverse architect the human neocortex and attempt to replicate its design using modern silicon technology. We believe TNN-based neuromorphic processors have the potential of being one small step in this direction, and this paper is our first attempt at laying out a microarchitecture for such TNNs.