Towards Efficient In-memory Computing Hardware for Quantized Neural Networks: State-of-the-art, Open Challenges and Perspectives

Olga Krestinskaya, Li Zhang, Khaled Nabil Salama Olga Krestinskaya, Li Zhang, and Khaled Nabil Salama are with the Electrical and Computer Engineering Program, the Division of Computer, Electrical and Mathematical Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal 23955, Saudi Arabia (e-mail: [email protected]; [email protected]; [email protected]).

Abstract

The amount of data processed in the cloud, the development of Internet-of-Things (IoT) applications, and growing data privacy concerns force the transition from cloud-based to edge-based processing. Limited energy and computational resources on edge push the transition from traditional von Neumann architectures to In-memory Computing (IMC), especially for machine learning and neural network applications. Network compression techniques are applied to implement a neural network on limited hardware resources. Quantization is one of the most efficient network compression techniques allowing to reduce the memory footprint, latency, and energy consumption. This paper provides a comprehensive review of IMC-based Quantized Neural Networks (QNN) and links software-based quantization approaches to IMC hardware implementation. Moreover, open challenges, QNN design requirements, recommendations, and perspectives along with an IMC-based QNN hardware roadmap are provided.

Index Terms:

In-memory Computing, Quantized Neural Network, Hardware, Quantization

I Introduction

The recently growing data privacy concerns led to the demand to reduce cloud-based processing and to move to local on-edge processing without sharing the data with the server. Moreover, the development of edge devices and IoT applications created the need to deploy machine learning algorithms and neural networks to low-power devices. Also, the neural network models trained on the cloud become larger and more complex. The energy consumption of the data centers to support such AI-related tasks on the cloud is expected to grow exponentially in the next few years [1]. Therefore, the development and advancement of on-edge processing for both algorithms and hardware are critical.

To move neural network computations to the edge, neural network compression techniques and efficient hardware designs are essential. The memory consumed by a state-of-the-art neural network can reach hundreds of megabytes, especially when 32-bit floating-point data representation is used. Quantization in the neural networks is one of the compression techniques allowing to move from floating point to low-precision fixed-point computations and aiming to reduce the memory footprint, latency, energy consumption, hardware complexity, and computational complexity of a network [2].

This paper provides a comprehensive review of Quantized Neural Network (QNN) implementations on In-memory Computing (IMC) platforms, related challenges, and open problems. Compared to the previous surveys on QNNs [2, 3], this work links software-based QNN designs to IMC hardware implementations, and identifies the related issues. We analyze state-of-the-art IMC-based QNN implementations, identify open challenges and requirements, and provide recommendations and perspectives.

The manuscript is organized as follows. Section II focuses on the IMC background, corresponding IMC devices, and state-of-the-art IMC architectures. Section III provides an overview of quantization methods, and QNN training, links quantization schemes with IMC architecture designs, and discusses QNN mapping to IMC hardware. Section IV reviews state-of-the-art IMC implementations of QNN, and compares different quantized IMC designs. Section V discusses the open challenges and requirements for efficient IMC-based QNN hardware along with recommendations and perspectives. Section VI summarises the paper.

II In-memory Computing Background

The issues of a memory wall and power efficiency ceiling of traditional von Neumann architectures pushed the development of new hardware designs for specific applications, like neural networks. Moreover, a large amount of data moved between the memory and processor in von Neumann architectures leads to the demand to search for more efficient alternatives for these applications. The bottleneck of von Neumann architectures is the bus bandwidth between memory and processor. According to [4], the bandwidth of this bus reaches 167 GB/s, while the reading operation bandwidth in traditional SRAM memories is 328 TB/s. Also, the data transmission energy between the memory and processor (42 pJ) is 26 times higher than the energy required for the read-out operation (1.6 pJ) [4]. In-memory Computing (IMC) architecture is an efficient solution to implement Matrix-Vector Multiplication (MVM) operations for machine learning algorithms and neural networks.

Refer to caption — Figure 1: (a) IMC devices [5, 6]. (b) Typical IMC architecture hierarchy [7].

The devices used for an IMC architecture shown in Fig. 1 can be divided into volatile, e.g. static random access memory (SRAM) and dynamic random access memory (DRAM), and non-volatile, e.g. resistive random access memories (RRAM), phase change memories (PCM), spin-transfer torque magnetoresistive random access memories (STT-MRAM), ferroelectric RAM (FeRAM), and ferroelectric field-effect transistors (FeFETs) [5]. SRAM and DRAM memories are more mature, compared to non-volatile memories. Therefore, non-volatile memories, especially RRAMs and PCMs, suffer from various device non-idealities. These non-idealities include endurance issues, conductance drift, problems of stacking at fault values, and non-linearity of the switching curve. However, non-volatile memories allow multi-level storage, scalability, and high computational density [8]. For example, if the cell size of SRAM is approximately $124F^{2}$ , where $F$ is a technology feature size, RRAM cell with the selector device (1T1R cell) has a size of $12F^{2}$ [9].

The IMC devices represent neural network weights and are connected to the crossbar structure to implement IMC architecture, as shown in Fig. 1 (b). The computation in a crossbar can be performed in the either digital or analog domain. In digital domain computation, the crossbar with IMC devices is used similarly to memory, and the multiplication of each crossbar input with each weight is performed separately. In analog domain computation, the MVM operation is performed in a single cycle, and the voltage inputs to the crossbars are multiplied by the conductance of IMC devices, and the output current through a crossbar column is equivalent to a single entry of MVM. To perform MVM in the analog domain, the input to the crossbars is applied either through digital-to-analog converters (DACs) for analog inputs, or simple 1-bit comparators when the high-bit inputs are represented by time-encoded binary signals followed by partial sums calculation. The main problems of a crossbar architecture include sneak path currents [10], IR drop [11], device imperfections [12, 13], and non-linearity of selector devices usually connected in series with an IMC device [14]. The typical size of an IMC crossbar is 128x128 or 256x256 [9]. There have been attempts to fabricate larger arrays, however, it increases the effects of sneak path currents and IR drop, especially for non-volatile devices.

The typical IMC architecture is hierarchical (Fig. 1 (b)). Several crossbar arrays with peripheral circuits, including multiplexers, analog-to-digital converters (ADCs), Shift-and-Add operators, local registers, and control circuits, are connected to form a compute element [7]. Several compute elements with the other peripheral circuits, e.g. activation units and buffers, form a tile. In turn, several tiles are connected through network-on-chip (NoC) including routers to direct the signals.

III Quantization in IMC-based Neural Network Architectures

III-A Quantization Methods

Quantizing the neural network implies the quantization of neural network weights, activations, or both. Quantization methods can be divided into uniform and non-uniform, e.g. logarithmic and codebook quantization. Uniform quantization is a simple widely-used method, which divides the quantized interval into equally distributed sub-intervals. Fig. 2 (a) illustrates corresponding quantization equations for both methods, where $w_{q}$ and $w_{c}$ represent quantized and full-precision weights respectively, $\lfloor x\rceil$ rounds $x$ to the nearest integer and $clamp(x,v_{min},v_{max})$ restricts $x$ within the range of $[v_{min},v_{max}]$ . In uniform quantization, the dynamic range of the quantized values is smaller than for non-uniform quantization. To improve this, a layer-wise or channel-wise scaling factor is used at a cost of computational complexity. Non-uniform quantization methods, e.g. logarithmic quantization, have a wider dynamic range allowing better representation of the weights. In Fig. 2 (a), 4-bit radix-4 logarithmic quantization is taken as an example of non-uniform quantization, which approximates the absolute value of the weights using $4^{n}$ , $n=-3,-2,-1,...3$ with boundaries being the mid-value of the adjacent values $((4^{n}+4^{(n-1)})/2)$ .

The other possible quantization method is a codebook quantization, which is useful when the data or weights distribution does not follow linear or logarithmic distributions. One such method uses k-mean clustering to find the quantized values for a codebook [15]. The codebook can also be formed by selecting the most frequent values among the weights in a neural network and be updated during training [16]. Reading the codebook can bring additional computational and hardware overhead to neural network implementation.

An important part of quantization is to determine the clipping range. For neural network weights, the clipping range is static and determined during the training. While, for the activations, it is different, as the inputs change during the inference. Quantization of the activations can be divided into dynamic and static quantization depending on how the clipping of the function is performed. In dynamic quantization, the clipping range is computed dynamically for each activation leading to higher accuracy but more complex computations. In static quantization, the clipping range is pre-calculated [2].

Different parts of the network exhibit different levels of abstraction and can be affected by the quantization differently [17]. Therefore, it is useful to set different quantization parameters, scale factors, and bit-width for different parts of the network. This is defined as mixed-precision quantization (MPQ). MPQ methods are divided into layer-level and fine-grained MPQs (channel/weight-level) [18, 19]. MPQ parameters can be determined by certain rules, e.g. input layers are more sensitive to the quantization than the other layers in a neural network. Also, MPQ policies can be optimized using differentiable optimization [20] or reinforcement learning algorithms [21]. MPQ reduces the model size and computation energy demands while keeping high-performance accuracy. However, hardware support and additional hardware overhead are required, especially when fine-grained MPQs are adopted.

III-B Quantizing and Training a Neural Network

There are two main methods to quantize a neural network: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) (Fig. 2 (b)). In PTQ, the well-trained full-precision model is quantized after the training. In QAT, the neural network models are quantized during the training. PTQ is faster but leads to lower accuracy than QAT. The main problem of QAT is a zero-gradient issue due to the stair-like nature of the quantization functions, therefore, the traditional stochastic gradient descent (SGD) algorithm with backpropagation cannot be directly applied for QNN training.

The issue of zero gradients in QNN training can be solved either by backpropagating approximated gradients or gradually quantizing the network while training (Fig. 2 (c)). The gradient approximation method is called straight-through estimator (STE), where a Jacobian matrix is set to a diagonal matrix with all 1-s in diagonal entries. The models trained using STE can achieve the accuracy compatible with full-precision models [22]. In general, STE-based methods use full-precision gradients to update the neural network.

The other method to solve the zero-gradient problems in QNN training is a gradual quantization and application of ”soft” quantization methods that converge to ”hard” quantization over time. One such method is additive noise annealing (ANA) [23], which injects noise into the quantized variables for the first few iterations of the training. The other method is adding a regularizer to the SGD update [24]. In addition, to quantize the network gradually, the network can be trained with weights and activations followed by the training with quantized weights and finishing with activation quantization [25]. However, all these training methods require the hardware support of full-precision computation, which makes it challenging to implement on low-power edge devices, especially for IMC architectures.

III-C Quantization in In-memory Computing Architectures

The QNN weights quantization schemes with high bit-precision in a crossbar-based architecture can be either implemented using high-precision devices allowing multi-level storage or combining several fixed-precision n-bit IMC crossbar cells to represent a single weight. IMC devices are already quantized by default and can represent several states. For example, in SRAM, the state of the cell is already quantized to two possible levels. In RRAMs, as the number of possible conductance states is also limited, the devices are quantized to a certain precision. The quantization of the crossbar outputs and activations is defined by ADC and DAC precision [9] and multilevel reading/writing drivers [26, 27]. Moreover, ADC and DAC precision also affects the resolution of neural network weights, e.g. low-precision ADCs cannot capture all the variations of high-precision weights [9].

In the case of combining several n-bit crossbar cells to form a single weight, the different memory cells represent all bits of the weights from the least significant bit (LSB) to the most significant bit (MSB). The calculation of partial sums is performed to compute the final result (the sum of all the outputs from the different cells), which cause the hardware overhead due to the required ADC resolution. ADC necessary resolution is calculated as $ceil(log_{2}((2^{b_{DAC}}-1)\times(2^{b_{w}}-1)\times i))$ , where $b_{DAC}$ is a DAC resolution, $i$ is a number of crossbar rows, and $b_{w}$ is the number of bits in weights [9]. As ADCs are responsible for up to 90% of the area and power consumption in the compute element [28], the quantization methods, number of bits in weights and activations, and circuit and architecture design of an IMC core are critical for the hardware efficiency of QNN implementation. Moreover, the partial sums overhead also depends on input slicing and input resolution (activation precision in QNN).

III-D Mapping of QNN to IMC architecture

Mapping of QNN architecture to IMC hardware has several specifications depending on the IMC devices and corresponding hardware designs. There are several mapping challenges corresponding to different parts of the design: non-linear distribution of the quantized levels in IMC devices, quantized weights mapping to several crossbar cells, unrolled convolution kernels mapping to IMC crossbars and mapping of large weight matrices to smaller crossbar arrays.

In high-precision devices, e.g. RRAMs, the quantization levels represented by RRAM conductance are often distributed non-uniformly. Therefore, mapping the uniformly distributed quantized weights to such devices can be challenging, which can also cause a degradation in the performance accuracy. In quantized high-precision weights mapped to low-precision crossbar cells, the weights can be mapped to several columns in a single crossbar or to several crossbars. This depends on particular designs and should be optimized according to the implemented hardware and availability of peripheral circuits, e.g. ADCs. In convolution kernels mapping to IMC crossbars to implement CNN, unrolling 3D kernels to vertical columns can cause the circuits overhead, including interconnects and buffers, and dataflow overhead [29]. In the mapping of large weight matrices to smaller crossbar arrays, the hardware overhead of ADCs should be considered. The search for the optimum crossbar size can be formed as an optimization problem to ensure the efficient mapping of neural network architecture to IMC hardware [30].

TABLE I: Summary of QNN hardware for in-memory computing.

Work

Device

CMOS

Bits (W, A, I)^∗

Implemented

network

Details

Binary designs

[31]

SRAM

65nm

1, 1.5^∗1, 1

CNN

Fabricated SRAM-based IMC macro

[32]

SRAM

65nm

1, 1-5, 1

3-layer ANN^∗2

Fabricated SRAM bitcell array for BNN

[33]

SRAM/

RRAM^∗3

65nm

1, 1, 1

VGG

Simulated XNOR-based BNN

[34]

RRAM

65nm

1, 1, 1

5-layer CNN

Simulated fully-parallel BNN

[35]

RRAM

90nm

1, 1, 1

9-layer CNN,

5-layer ANN

Fabricated XNOR-RRAM prototype chip

[36]

MRAM

22nm

1, 4, 1

6-layer CNN

Fabrucated macro with high parallelism

Ternary designs

[37]

SRAM

65nm

1.5, 1.5, -

13-layer DNN^∗2

Reconfigurable accelerator with ternarized mask

[38]

RRAM

130nm

1.5, 1.5, 1.5

VGG

Gated XNOR with near-throushold sense amplifier

[39]

MRAM

45nm

1.5, 8, -

CNN

Sparse and fast additions for IMC

[40]

FeFET

22nm

1.5, 4, -

ResNet20

Analog crossbar with FeFET and tunnel junction resistor

Higher precision designs

[41]

SRAM

28nm

8,8,8

AlexNet

CNN with sandwich-shaped SRAM

[4]

SRAM

28nm

1-8^∗4

CNN

Hybrid in-/near-memory compute SRAM for IMC

[42]

SRAM

7nm

4, 4, 4

Fabricated reconfigurable FinFET SRAM IMC macro

[43]

SRAM

28nm

4-8, 2-8, 10-20

ResNet-20

SRAM-based inference and training accelerator

[44]

SRAM

55nm

1-8

Configurable hybrid SRAM macro

[45]

SRAM

65nm

1-8, 2-8, -

VGG, ResNet

IMC macro with zero-activation and zero-weight skipping

[46]

SRAM

28nm

2-4, 4-8, -

RebNet-20, LSTM

IMC macro with variable precision quantization

[47]

SRAM

5nm

4, 4, 14

Digital IMC macro

[48]

RRAM

8, - , 16

CNN, ConvLSTM

Fabricated network with on-chip learning

[49]

RRAM

130nm

3, 8 , 8^∗5

5-layer CNN

Fabricated fully implemented network with on-chip fine-tuning

[50]

RRAM

32nm

16, 16, 16^∗5

VGG

Full pipelined accelerator architecture ISAAC

[51]

RRAM

16, 16, 16^∗5

AlexNet, VGG

Full accelerator architecture Pipelayer for training and inference

[28]

RRAM

32nm

16, 16, 16^∗5

VGG, LSTM

Full programmable accelerator architecture PUMA with compiler

[52]

RRAM

40nm

1-8, 3, -

VGG-8

Reconfigurable IMC macro with sparsity control

[53]

RRAM

45nm

-, 6-10, -

Lenet, VGG-16,

ResNet-18

IMC macro with configurable precision and layer-wise quantization

[54]

PCM

14nm

8, -, 8

ANN, ResNet-9

IMC macro with local digital processing

[55]

PCM

40nm

2-8, 5-19, 1-8

ResNet-20

IMC with PCM macro

[56]

FeFET

3,-,-

FeFET-based IMC with charge sharing

[57]

DRAM

65nm

1-8, 2-16, 2-8

ResNet-20

Gain-cell eDRAM for IMC

[58]

DRAM

65nm

8,8,8

6-layer CNN

Charge-based eDRAM for IMC

[59]

Flash

65nm

8,-,8

LeNet-5

3-D NAND Flash IMC array

^∗: W,A,I - weights, activations, inputs, ^∗1: ternary, ^∗2: fully-connected, ^∗3: different architectures, ^∗4: several cases, ^∗5: serial 1-bit input.

IV In-memory Computing Hardware for QNN

IV-A Binarized Neural Networks and QNNs with Ternary Weights

Binarized neural network (BNN) is a type of QNN relying on 1-bit weights (+1 and -1 in software) and activations [8]. The fabricated binary SRAM-based IMC designs are shown in [31, 32]. The efficient implementation of BNN in IMC hardware is based on XNOR operation, which reduces ADC overhead using 1-bit sense amplifiers [34] for binarized activations. XNOR-based implementation can be 30 times more efficient than sequential row-by-row read-out. Comparing the implementations of binarized SRAM- and RRAM- based networks, RRAM-based implementation is 5.8 times more energy-efficient than 8T SRAM-based design [33]. The other RRAM-based fabricated prototype of BNN with flash ADCs is shown in [35]. The application of flash ADCs is acceptable for BNNs, however for higher-bit designs, flash ADCs can bring significant area and power overhead [8]. A fabricated MRAM-based IMC macro for 1-bit operations is shown in [36].

The QNN designs with ternary weights imply the quantized values represented by -1, 0, and +1. Ternary weights networks demonstrate higher accuracy than BNNs and better sparsity, which allows skipping the operations related to zero weights [39]. Ternary networks can be implemented using RRAM-based crossbar arrays, where positive and negative weights are represented by two crossbar cells, and a subtractor is used to calculate the final value of the weight, which allows getting zero-weights [60]. The implementation of a ternarized accelerator using SRAMs is also possible [37]. In [37], the network is ternarized introducing a mask. This design aims for reconfigurability for different types of neural network architectures. [39] demonstrates STT-MRAM based ternary neural network implementation with sparse and fast addition. The ternary weights in FeFET-based architecture are illustrated in [40]. [38] illustrates the ternary XNOR network with RRAM devices.

IV-B Higher-Bit Fixed-point IMC Computations for QNNs

The most common IMC architectures for fixed-point computations are based on SRAM and RRAM memories. As SRAM is a mature memory technology, there are a lot of fabricated SRAM-based IMC architectures shown in [41, 4, 42]. The SRAM implementation of both inference and training accelerator is illustrated in [43]. The fabricated RRAM-based IMC macros designs are shown in [48, 49]. In [48], on-chip learning is considered. In [49], on-chip fine-tuning is demonstrated. The RRAM-based IMC accelerator level designs with several levels of architecture hierarchy are shown in [50, 51, 28, 61]. These QNN designs include the datapath, routing, and arrangement of the network nodes are considered in addition to controlling and computation peripherals [8].

The general trend to move towards general-purpose implementations in QNN accelerators rather than architecture-specific designs leads to the development of configurable bit-precision in IMC architectures. Several works are implementing configurable macros for IMC accelerators to be able to vary the bit precision of the weights [44, 45, 46, 47]. Some recent RRAM-based IMC architectures can also support configurable bit-operations [52, 53]. Apart from SRAM and RRAM, the other IMC devices are less popular for IMC hardware. However, there are several architectures based on PCMs [54, 55], FeFETs [56], DRAMs [57, 58], and flash memories [59].

Table I shows the summary QNN hardware designs for IMC. Fig. 3 illustrates the comparison of power and area efficiency of different IMC hardware designs depending on the weights precision for the architectures in Table I. It is important to note that ADC precision affects power efficiency significantly. In [32], the design with 1-bit ADC is 30 times more efficient than for 5-bit ADC. The precision of weights also affects power efficiency. In [44], the power efficiency drops from 40.2 TOPS/W to 0.6 TOPS/W for 1-bit and 8-bit weights accordingly. The power efficiency of the IMC macro and the power efficiency of the whole system including several macros connected with peripheral circuits has a significant difference, e.g. in [45] IMC macro is 5 times more efficient than the overall system. Power efficiency is also affected by the technology node. For example, for a 4-bit neural network, 7nm SRAM-based IMC architecture [42] is more than 100 times more efficient than 55nm SRAM-based architecture [44].

V Open Challenges, Requirements, Recommendations, and Perspectives

The roadmap for IMC-based QNN architectures with recent achievements and open challenges is illustrated in Fig. 4. The recommendations for future development of IMC-based QNN hardware include general improvement in QNN hardware designs, on-chip training, mixed-precision quantization and design reconfigurability support, automation of the quantization policies search, development of software-hardware co-design frameworks, and integration of IMC architectures to the traditional designs.

V-A Efficient QNN Inference Architecture

Despite the recent development of IMC inference hardware designs, there are still challenges that should be addressed to move from IMC accelerators developed in research laboratories to efficient commercial solutions for low-power IMC-based QNN hardware. They include the hardware overhead in crossbar macros, consideration of architecture hierarchy and related challenges, and hardware non-idealities affecting performance accuracy.

Even though IMC crossbar architecture can be highly efficient for MVM operations, the hardware overhead caused by peripheral and control circuits can minimize the benefits of even small low-power non-volatile memories in the design. In crossbar macros, the hardware overhead comes from ADC and partial sums, when high-precision weights split into low-precision crossbar cells [9]. ADC design improvements, implementation of low-power converters, reducing the number of ADCs and ADC resolution, and relying on approximate computing can benefit IMC hardware designs [8].

An efficient IMC hierarchy design is also important. The interconnection of processing elements, computation blocks, and tiles also affects design efficiency. The architecture hierarchy in SRAM-based IMC designs is explored more than in non-volatile memory-based designs. Data movement between the layers, requirements for additional storage, and circuits interconnection in each level of the IMC architecture hierarchy should still be improved further. Moreover, several solutions have been explored to move from a traditional monolithic chip design to 2.5D integration or chiplet-based designs [7].

IMC designs based on non-volatile devices are prone to non-idealities. Even though quantized and binarized architectures are affected less by noise and device variations [63], device-to-device and cycle-to-cycle variability cause errors propagating through the network and affecting the performance accuracy. In addition, the immaturity of non-volatile memories leads to device fault issues and conductance drift with time also reducing performance accuracy [8]. These issues along with the device endurance should still be addressed at the device and material level. Moreover, the 3D stacking capability of IMC devices and crossbar cells should also be developed, as 3D integration is useful to decrease the length of interconnect wires to increase the chip density and reduce IR drop.

V-B QNN Training on Chip

One of the main problems of QNN training is the requirement of full-precision gradients for weight updates. This leads to the complexity of neural network training on low-power hardware and IMC architectures that do not support full-precision operations. Several techniques to reduce the number of full-precision weights, e.g. mixing the full-precision weights with quantized weights for training [64]. However, the hardware support for full-precision operation is still necessary in this case. Therefore, the algorithms for QNN training based only on quantized values still need to be developed. Moreover, it is important to adapt these algorithms for IMC hardware relying on hardware-friendly operations, and include IMC hardware considerations, e.g. ADC precision considerations, partial sums, and IMC hardware non-idealities.

Neural network training on a chip can also be limited by the device non-idealities. If IMC hardware non-idealities are not the main concern in SRAM-based architectures, RRAM-based IMC architectures suffer from endurance issues and variabilities. The endurance of $10^{8}-10^{9}$ cycles is required to implement on-chip training, while most of the current RRAM devices are limited to $10^{6}$ cycles [9]. The other issue related to on-chip training implementation of non-volatile IMC devices, e.g. RRAMs and PCMs, is a high write latency. The write latency of RRAMs and PCMs usually reaches 100-150ns, and the write latency of SRAMs can be less than 1ns [65]. Non-linearity in the switching behavior of some non-volatile IMC devices, such as RRAMs, also requires additional hardware overhead to verify and correct the programmed conductance values for on-chip training [8].

On-chip learning and backpropagation circuits create an overhead for additional memory and registers to store intermediate outputs for gradient calculation, control circuits and flexible peripheral circuits for a crossbar array. Moreover, neural network training still requires higher bit precision than inference, which also leads to higher resolution ADCs. The development of efficient and flexible hardware support for on-chip training is an open challenge.

In addition, the QNN IMC hardware should support state-of-the-art software-based training techniques. For example, normalization fusion of neural network layers [66], where batch normalization layers are merged into the previous fully-connected or convolutional layers. Implementation of such a method on IMC hardware might bring accuracy degradation if the trained neural network is directly quantized and deployed on IMC accelerators. Overall, it is important to transfer the software-based methods for QNN to IMC hardware.

V-C IMC Hardware Support for Mixed-precision Designs and Reconfigurable Precision

Recently, mixed-precision quantization has become an essential part of IMC hardware design, as it is more effective than fixed-precision quantization [67]. The IMC hardware support for mixed-precision designs is an open challenge. This implies that IMC macros should support a different number of bits for weights and activations, and have the ability to be reconfigured for different precision schemes, which also leads to hardware overhead related to computational blocks and control circuits. Moreover, the hardware under-utilization problem should be considered in such architectures, and reconfiguration schemes for effective hardware resource utilization in IMC macros are required for mixed-signal designs.

V-D Moving Towards General-Purpose Architectures by Improving IMC Design Reconfigurability

Most of the hardware for IMC architectures is designed for specific applications and specific QNN types. However, the overall trend in IMC designs is to move towards general-purpose chips [62]. To make a step towards general-purpose functionality, apart from the support of reconfigurable precision, IMC designs should support different workloads, different directions of data movement, quantization schemes, data types, and both spatial and temporal domain computations. The distribution of the workloads in different types of neural networks varies, e.g. ResNet workloads and data distribution for pattern recognition are different from LSTM workloads for language processing. Moreover, there is also a variation of the workloads within a neural network for different layers, e.g. convolution layers and fully-connected layers process the data differently and IMC hardware should support reconfigurability to be programmed accordingly. The issue of different workloads leads to unbalanced execution time and resource utilization in different layers [62].

Reconfigurability requires the hardware support of flexible interconnects. The support of different types of quantization schemes is important, as different applications rely on different data distributions. Moreover, the support of corresponding arithmetic operations and related IMC hardware macros is essential. At the same time, it is important to identify the trade-off between the support of different operations and quantization schemes and related hardware overhead. Overall, to satisfy different tasks, network structures, quantization methods, precision, and latency requirements, IMC QNN architectures should be reconfigurable.

Reconfigurability is also useful for balancing energy consumption and performance accuracy. There is a trade-off between the resources required for the task execution and the performance accuracy in low-power on-edge designs. In general, low-precision architectures usually lead to lower energy consumption but lower accuracy. Flexible precision and reconfigurability of IMC architectures can be used for scenarios when the power source is not available for on-edge devices, so the application can run under constrained power resources at the cost of reduced performance accuracy.

V-E Automating the Quantization Policies Search in IMC Architectures

Quantization policies have several parameters to optimize to deploy to IMC hardware, including bit widths, scaling factors, quantization thresholds, and clipping ranges. This problem becomes even more complex for mixed-precision quantization, where different layers can have different quantization policies. Moreover, when deploying on IMC hardware, the performance accuracy can also be affected by IMC hardware non-idealities [68]. Therefore, it is important to include hardware-related evaluation when optimizing quantization policies, as it can affect hardware performance accuracy. It is difficult to find the optimum policies manually, especially when the size of the network increase. To automate the search for optimum quantization policies, it can be formulated as Hardware-Aware Neural Architecture Search (HW-NAS) problem. Such an approach is presented in [67], where optimum bit widths for weights and precision for ADC for IMC architectures are searched using a reinforcement learning approach. CMQ framework [69] also searches for optimum quantization threshold and bit width using a differentiable search approach. Gibbon framework [70] searches for optimum bit width and ADC precision along with neural network architecture and crossbar-related hardware parameters using an evolutionary algorithm-based approach. The further development of such frameworks is useful, as there are no unified frameworks considering the joint optimization of neural network architecture parameters, hardware parameters, and quantization policies. Moreover, the existing frameworks are designed for specific architectures and specific problems and require modifications for general use. Therefore, the development of a unified joint optimization framework for HW-NAS including quantization policies optimization is an open challenge.

V-F Software-Hardware Co-design

Software-Hardware Co-design connects a software-based design of a neural network to hardware its implementation and implies the optimization of all levels of design from the devices and circuit macros to architectures and algorithms [62]. Currently available frameworks for IMC architectures, e.g. PUMAsim [28] or NeuroSim [71], focus on a specific IMC design, and do not support different quantization techniques, especially mixed-precision quantization, and have limited support of various IMC devices. There is still a lack of a universal EDA toolchain for large-scale implementations supporting various IMC hardware designs [62], including different types of neural networks, IMC architectures with various IMC devices and related non-idealities, various macros, and different designs of interconnects for the architecture blocks. Software-Hardware Co-design frameworks should include more QNN features, support different quantization methods, and include more mapping techniques. Overall, the efficient automated compiler for mapping QNN to an IMC-based hardware implementation supporting a wide range of IMC designs is still an open challenge.

V-G Integration of IMC Architectures to Traditional Hardware Designs and Combining Different Types of Hardware

While IMC architectures are efficient for certain operations, e.g. MVMs, they cannot fully replace the other types of hardware. Therefore, it is important to integrate IMC architectures with traditional hardware designs and different types of hardware. In particular, this approach will also benefit QNN training on a chip, where floating point operations are still necessary and require different hardware or additional DSP blocks. The efficient integration of various types of IMC hardware together is also necessary. There has been a successful attempt to combine an RRAM-based IMC accelerator with conventional SRAM-based memories and embedded processor [72]. The efficient integration of non-volatile memories with traditional memories, e.g. SRAM, is important to overcome external memory requirements and latency issues. Large-scale integration of relatively novel IMC architectures with traditional hardware designs can bring a lot of benefits in terms of architecture efficiency and is the next step in IMC hardware development.

VI Conclusion

This paper reviews state-of-the-art designs of IMC-based QNN hardware implementations and compares the QNN designs with different IMC devices. To improve the efficiency of IMC-based QNNs, different levels of the design from IMC devices and architectures to QNN algorithms should be improved simultaneously. The main challenges and future directions in IMC-based QNN hardware research include further improvement of the IMC-based inference engines for QNNs, efficient on-chip training with quantized gradients and weight updates, hardware support of mixed-precision quantization, reconfigurability of the designs, automation of the optimum quantization policies search along with optimized hardware parameters, software-hardware co-design, and integration of IMC architectures to traditional hardware designs.

References

[1] A. S. Andrae, “New perspectives on internet electricity use in 2030,” Engineering and Applied Science Letters, vol. 3, no. 2, pp. 19–31, 2020.
[2] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” arXiv preprint arXiv:2103.13630, 2021.
[3] Y. Guo, “A survey on methods and theories of quantized neural networks,” arXiv preprint arXiv:1808.04752, 2018.
[4] J. Wang, X. Wang, C. Eckert, A. Subramaniyan, R. Das, D. Blaauw, and D. Sylvester, “A 28-nm compute sram with bit-serial logic/arithmetic operations for programmable in-memory vector computing,” IEEE Journal of Solid-State Circuits, vol. 55, no. 1, pp. 76–86, 2019.
[5] A. Sebastian, M. Le Gallo, R. Khaddam-Aljameh, and E. Eleftheriou, “Memory devices and applications for in-memory computing,” Nature nanotechnology, vol. 15, no. 7, pp. 529–544, 2020.
[6] D. Ielmini and G. Pedretti, “Device and circuit architectures for in-memory computing,” Advanced Intelligent Systems, vol. 2, no. 7, p. 2000040, 2020.
[7] G. Krishnan, S. K. Mandal, M. Pannala, C. Chakrabarti, J.-S. Seo, U. Y. Ogras, and Y. Cao, “Siam: Chiplet-based scalable in-memory acceleration with mesh for deep neural networks,” ACM Transactions on Embedded Computing Systems (TECS), vol. 20, no. 5s, pp. 1–24, 2021.
[8] O. Krestinskaya, L. Zhang, and K. N. Salama, “Towards efficient rram-based quantized neural networks hardware: State-of-the-art and open issues,” in 2022 IEEE 22nd International Conference on Nanotechnology (NANO). IEEE, 2022, pp. 465–468.
[9] I. Chakraborty, M. Ali, A. Ankit, S. Jain, S. Roy, S. Sridharan, A. Agrawal, A. Raghunathan, and K. Roy, “Resistive crossbars as approximate hardware building blocks for machine learning: Opportunities and challenges,” Proceedings of the IEEE, vol. 108, no. 12, pp. 2276–2310, 2020.
[10] M. A. Zidan, H. A. H. Fahmy, M. M. Hussain, and K. N. Salama, “Memristor-based memory: The sneak paths problem and solutions,” Microelectronics journal, vol. 44, no. 2, pp. 176–183, 2013.
[11] M. E. Fouda, S. Lee, J. Lee, G. H. Kim, F. Kurdahi, and A. M. Eltawi, “Ir-qnn framework: An ir drop-aware offline training of quantized crossbar arrays,” IEEE Access, vol. 8, pp. 228 392–228 408, 2020.
[12] X. Sun, W. Khwa, Y. Chen, C. Lee, H. Lee, S. Yu, R. Naous, J. Wu, T. Chen, X. Bao et al., “Pcm-based analog compute-in-memory: impact of device non-idealities on inference accuracy,” IEEE Transactions on Electron Devices, vol. 68, no. 11, pp. 5585–5591, 2021.
[13] O. Krestinskaya, A. Irmanova, and A. P. James, “Memristive non-idealities: Is there any practical implications for designing neural network chips?” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.
[14] L. Zhang, S. Cosemans, D. J. Wouters, G. Groeseneken, M. Jurczak, and B. Govoreanu, “Selector design considerations and requirements for 1 sir rram crossbar array,” in 2014 IEEE 6th International Memory Workshop (IMW). IEEE, 2014, pp. 1–4.
[15] Y. Gong, L. Liu, M. Yang, and L. D. Bourdev, “Compressing deep convolutional networks using vector quantization,” CoRR, vol. abs/1412.6115, 2014. [Online]. Available: http://arxiv.org/abs/1412.6115
[16] C.-F. Teng, C.-H. D. Wu, A. K.-S. Ho, and A.-Y. A. Wu, “Low-complexity recurrent neural network-based polar decoder with weight quantization mechanism,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 1413–1417.
[17] T. Chu, Q. Luo, J. Yang, and X. Huang, “Mixed-precision quantized neural networks with progressively decreasing bitwidth,” Pattern Recognition, vol. 111, p. 107647, 2021.
[18] N. Kim, D. Shin, W. Choi, G. Kim, and J. Park, “Exploiting retraining-based mixed-precision quantization for low-cost dnn accelerator design,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 7, pp. 2925–2938, 2020.
[19] D. T. Nguyen, H. Kim, and H.-J. Lee, “Layer-specific optimization for mixed data flow with mixed precision in fpga design for cnn-based object detectors,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 6, pp. 2450–2464, 2020.
[20] H. V. Habi, R. H. Jennings, and A. Netzer, “Hmq: Hardware friendly mixed precision quantization block for cnns,” in European Conference on Computer Vision. Springer, 2020, pp. 448–463.
[21] A. Elthakeb, P. Pilligundla, F. Mireshghallah, A. Yazdanbakhsh, S. Gao, and H. Esmaeilzadeh, “Releq: an automatic reinforcement learning approach for deep quantization of neural networks,” in NeurIPS ML for Systems workshop, 2018, 2019.
[22] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” Advances in neural information processing systems, vol. 28, 2015.
[23] M. Spallanzani, L. Cavigelli, G. P. Leonardi, M. Bertogna, and L. Benini, “Additive noise annealing and approximation properties of quantized neural networks,” arXiv preprint arXiv:1905.10452, 2019.
[24] Y. Bai, Y.-X. Wang, and E. Liberty, “Proxquant: Quantized neural networks via proximal operators,” arXiv preprint arXiv:1810.00861, 2018.
[25] J. Yang, X. Shen, J. Xing, X. Tian, H. Li, B. Deng, J. Huang, and X.-s. Hua, “Quantization networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7308–7316.
[26] Y. Yilmaz and P. Mazumder, “A drift-tolerant read/write scheme for multilevel memristor memory,” IEEE Transactions on Nanotechnology, vol. 16, no. 6, pp. 1016–1027, 2017.
[27] A. Ciprut and E. G. Friedman, “Modeling size limitations of resistive crossbar array with cell selectors,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 1, pp. 286–293, 2016.
[28] A. Ankit, I. E. Hajj, S. R. Chalamalasetti, G. Ndu, M. Foltin, R. S. Williams, P. Faraboschi, W.-m. W. Hwu, J. P. Strachan, K. Roy et al., “Puma: A programmable ultra-efficient memristor-based accelerator for machine learning inference,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 715–731.
[29] X. Peng, R. Liu, and S. Yu, “Optimizing weight mapping and data flow for convolutional neural networks on rram based processing-in-memory architecture,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.
[30] S. Negi, I. Chakraborty, A. Ankit, and K. Roy, “Nax: Neural architecture and memristive xbar based accelerator co-design,” in Proceedings of the 59th ACM/IEEE Design Automation Conference, ser. DAC ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 451–456. [Online]. Available: https://doi.org/10.1145/3489517.3530476
[31] W.-S. Khwa, J.-J. Chen, J.-F. Li, X. Si, E.-Y. Yang, X. Sun, R. Liu, P.-Y. Chen, Q. Li, S. Yu et al., “A 65nm 4kb algorithm-dependent computing-in-memory sram unit-macro with 2.3 ns and 55.8 tops/w fully parallel product-sum operation for binary dnn edge processors,” in 2018 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2018, pp. 496–498.
[32] C. Yu, T. Yoo, T. T.-H. Kim, K. C. T. Chuan, and B. Kim, “A 16k current-based 8t sram compute-in-memory macro with decoupled read/write and 1-5bit column adc,” in 2020 IEEE Custom Integrated Circuits Conference (CICC). IEEE, 2020, pp. 1–4.
[33] X. Sun, R. Liu, X. Peng, and S. Yu, “Computing-in-memory with sram and rram for binary neural networks,” in 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT). IEEE, 2018, pp. 1–4.
[34] X. Sun, X. Peng, P.-Y. Chen, R. Liu, J.-s. Seo, and S. Yu, “Fully parallel rram synaptic array for implementing binary neural network with (+ 1,- 1) weights and (+ 1, 0) neurons,” in 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2018, pp. 574–579.
[35] S. Yin, X. Sun, S. Yu, and J.-s. Seo, “High-throughput in-memory computing for binary deep neural networks with monolithically integrated rram and 90-nm cmos,” IEEE Transactions on Electron Devices, vol. 67, no. 10, pp. 4185–4192, 2020.
[36] P. Deaville, B. Zhang, L.-Y. Chen, and N. Verma, “A maximally row-parallel mram in-memory-computing macro addressing readout circuit sensitivity and area,” in ESSCIRC 2021-IEEE 47th European Solid State Circuits Conference (ESSCIRC). IEEE, 2021, pp. 75–78.
[37] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S. Takamaeda-Yamazaki, M. Ikebe, T. Asai, T. Kuroda et al., “Brein memory: A single-chip binary/ternary reconfigurable in-memory deep neural network accelerator achieving 1.4 tops at 0.6 w,” IEEE Journal of Solid-State Circuits, vol. 53, no. 4, pp. 983–994, 2017.
[38] A. Laborieux, M. Bocquet, T. Hirtzlin, J.-O. Klein, L. H. Diez, E. Nowak, E. Vianello, J.-M. Portal, and D. Querlioz, “Low power in-memory implementation of ternary neural networks with resistive ram-based synapse,” in 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS). IEEE, 2020, pp. 136–140.
[39] S. Zhu, L. H. Duong, H. Chen, D. Liu, and W. Liu, “Fat: An in-memory accelerator with fast addition for ternary weight neural networks,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022.
[40] D. Saito, T. Kobayashi, H. Koga, N. Ronchi, K. Banerjee, Y. Shuto, J. Okuno, K. Konishi, L. Di Piazza, A. Mallik et al., “Analog in-memory computing in fefet-based 1t1r array for edge ai applications,” in 2021 Symposium on VLSI Technology. IEEE, 2021, pp. 1–2.
[41] J. Yang, Y. Kong, Z. Wang, Y. Liu, B. Wang, S. Yin, and L. Shi, “24.4 sandwich-ram: An energy-efficient in-memory bwn architecture with pulse-width modulation,” in 2019 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2019, pp. 394–396.
[42] Q. Dong, M. E. Sinangil, B. Erbagci, D. Sun, W.-S. Khwa, H.-J. Liao, Y. Wang, and J. Chang, “15.3 a 351tops/w and 372.4 gops compute-in-memory sram macro in 7nm finfet cmos for machine-learning applications,” in 2020 IEEE International Solid-State Circuits Conference-(ISSCC). IEEE, 2020, pp. 242–244.
[43] J.-W. Su, X. Si, Y.-C. Chou, T.-W. Chang, W.-H. Huang, Y.-N. Tu, R. Liu, P.-J. Lu, T.-W. Liu, J.-H. Wang et al., “Two-way transpose multibit 6t sram computing-in-memory macro for inference-training ai edge chips,” IEEE Journal of Solid-State Circuits, vol. 57, no. 2, pp. 609–624, 2021.
[44] Z. Zhang, J.-J. Chen, X. Si, Y.-N. Tu, J.-W. Su, W.-H. Huang, J.-H. Wang, W.-C. Wei, Y.-C. Chiu, J.-M. Hong et al., “A 55nm 1-to-8 bit configurable 6t sram based computing-in-memory unit-macro for cnn-based ai edge processors,” in 2019 IEEE Asian Solid-State Circuits Conference (A-SSCC). IEEE, 2019, pp. 217–218.
[45] J. Yue, X. Feng, Y. He, Y. Huang, Y. Wang, Z. Yuan, M. Zhan, J. Liu, J.-W. Su, Y.-L. Chung et al., “15.2 a 2.75-to-75.9 tops/w computing-in-memory nn processor supporting set-associate block-wise zero skipping and ping-pong cim with simultaneous computation and weight updating,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64. IEEE, 2021, pp. 238–240.
[46] R. Guo, Z. Yue, X. Si, T. Hu, H. Li, L. Tang, Y. Wang, L. Liu, M.-F. Chang, Q. Li et al., “15.4 a 5.99-to-691.1 tops/w tensor-train in-memory-computing processor using bit-level-sparsity-based optimization and variable-precision quantization,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64. IEEE, 2021, pp. 242–244.
[47] H. Fujiwara, H. Mori, W.-C. Zhao, M.-C. Chuang, R. Naous, C.-K. Chuang, T. Hashizume, D. Sun, C.-F. Lee, K. Akarvardar et al., “A 5-nm 254-tops/w 221-tops/mm 2 fully-digital computing-in-memory macro supporting wide-range dynamic-voltage-frequency scaling and simultaneous mac and write operations,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 1–3.
[48] Z. Wang, C. Li, P. Lin, M. Rao, Y. Nie, W. Song, Q. Qiu, Y. Li, P. Yan, J. P. Strachan et al., “In situ training of feed-forward and recurrent convolutional memristor networks,” Nature Machine Intelligence, vol. 1, no. 9, pp. 434–442, 2019.
[49] P. Yao, H. Wu, B. Gao, J. Tang, Q. Zhang, W. Zhang, J. J. Yang, and H. Qian, “Fully hardware-implemented memristor convolutional neural network,” Nature, vol. 577, no. 7792, pp. 641–646, 2020.
[50] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” ACM SIGARCH Computer Architecture News, vol. 44, no. 3, pp. 14–26, 2016.
[51] L. Song, X. Qian, H. Li, and Y. Chen, “Pipelayer: A pipelined reram-based accelerator for deep learning,” in 2017 IEEE international symposium on high performance computer architecture (HPCA). IEEE, 2017, pp. 541–552.
[52] W. Li, S. Huang, X. Sun, H. Jiang, and S. Yu, “Secure-rram: A 40nm 16kb compute-in-memory macro with reconfigurability, sparsity control, and embedded security,” in 2021 IEEE Custom Integrated Circuits Conference (CICC). IEEE, 2021, pp. 1–2.
[53] Z. Zhu, H. Sun, Y. Lin, G. Dai, L. Xia, S. Han, Y. Wang, and H. Yang, “A configurable multi-precision cnn computing framework based on single bit rram,” in 2019 56th ACM/IEEE Design Automation Conference (DAC). IEEE, 2019, pp. 1–6.
[54] R. Khaddam-Aljameh, M. Stanisavljevic, J. F. Mas, G. Karunaratne, M. Braendli, F. Liu, A. Singh, S. M. Müller, U. Egger, A. Petropoulos et al., “Hermes core–a 14nm cmos and pcm-based in-memory compute core using an array of 300ps/lsb linearized cco-based adcs and local digital processing,” in 2021 Symposium on VLSI Circuits. IEEE, 2021, pp. 1–2.
[55] W.-S. Khwa, Y.-C. Chiu, C.-J. Jhang, S.-P. Huang, C.-Y. Lee, T.-H. Wen, F.-C. Chang, S.-M. Yu, T.-Y. Lee, and M.-F. Chang, “A 40-nm, 2m-cell, 8b-precision, hybrid slc-mlc pcm computing-in-memory macro with 20.5-65.0 tops/w for tiny-al edge devices,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 1–3.
[56] C. Matsui, K. Toprasertpong, S. Takagi, and K. Takeuchi, “Energy-efficient reliable hzo fefet computation-in-memory with local multiply & global accumulate array for source-follower & charge-sharing voltage sensing,” in 2021 Symposium on VLSI Technology. IEEE, 2021, pp. 1–2.
[57] S. Xie, C. Ni, P. Jain, F. Hamzaoglu, and J. P. Kulkarni, “Gain-cell cim: Leakage and bitline swing aware 2t1c gain-cell edram compute in memory design with bitline precharge dacs and compact schmitt trigger adcs,” in 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits). IEEE, 2022, pp. 112–113.
[58] S. Xie, C. Ni, A. Sayal, P. Jain, F. Hamzaoglu, and J. P. Kulkarni, “16.2 edram-cim: Compute-in-memory design with reconfigurable embedded-dynamic-memory array realizing adaptive data converters and charge-domain computing,” in 2021 IEEE International Solid-State Circuits Conference (ISSCC), vol. 64. IEEE, 2021, pp. 248–250.
[59] M. Kim, M. Liu, L. R. Everson, and C. H. Kim, “An embedded nand flash-based compute-in-memory array demonstrated in a standard logic process,” IEEE Journal of Solid-State Circuits, vol. 57, no. 2, pp. 625–638, 2021.
[60] Q. Liu, B. Gao, P. Yao, D. Wu, J. Chen, Y. Pang, W. Zhang, Y. Liao, C.-X. Xue, W.-H. Chen et al., “33.2 a fully integrated analog reram based 78.4 tops/w compute-in-memory chip with fully parallel mac computing,” in 2020 ieee international solid-state circuits conference-(isscc). IEEE, 2020, pp. 500–502.
[61] A. Ankit, I. El Hajj, S. R. Chalamalasetti, S. Agarwal, M. Marinella, M. Foltin, J. P. Strachan, D. Milojicic, W.-M. Hwu, and K. Roy, “Panther: A programmable architecture for neural network training harnessing energy-efficient reram,” IEEE Transactions on Computers, vol. 69, no. 8, pp. 1128–1142, 2020.
[62] W. Zhang, B. Gao, J. Tang, P. Yao, S. Yu, M.-F. Chang, H.-J. Yoo, H. Qian, and H. Wu, “Neuro-inspired computing chips,” Nature electronics, vol. 3, no. 7, pp. 371–382, 2020.
[63] A. James, Y. Toleubay, O. Krestinskaya, and C. Reghuvaran, “Inference dropouts in binary weighted analog memristive crossbar,” IEEE Transactions on Nanotechnology, vol. 21, pp. 271–277, 2022.
[64] J. Chen, L. Liu, Y. Liu, and X. Zeng, “A learning framework for n-bit quantized neural networks toward fpgas,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 3, pp. 1067–1081, 2020.
[65] W. Haensch, A. Raghunathan, K. Roy, B. Chakrabart, C. M. Phatak, C. Wang, and S. Guha, “A co-design view of compute in-memory with non-volatile elements for neural networks,” arXiv preprint arXiv:2206.08735, 2022.
[66] W. Jung, D. Jung, B. Kim, S. Lee, W. Rhee, and J. H. Ahn, “Restructuring batch normalization to accelerate cnn training,” Proceedings of Machine Learning and Systems, vol. 1, pp. 14–26, 2019.
[67] S. Huang, A. Ankit, P. Silveira, R. Antunes, S. R. Chalamalasetti, I. El Hajj, D. E. Kim, G. Aguiar, P. Bruel, S. Serebryakov et al., “Mixed precision quantization for reram-based dnn inference accelerators,” in 2021 26th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2021, pp. 372–377.
[68] O. Krestinskaya, K. N. Salama, and A. P. James, “Automating analogue ai chip design with genetic search,” Advanced Intelligent Systems, vol. 2, no. 8, p. 2000075, 2020.
[69] J. Peng, H. Liu, Z. Zhao, Z. Li, S. Liu, and Q. Li, “Cmq: Crossbar-aware neural network mixed-precision quantization via differentiable architecture search,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2022.
[70] H. Sun, C. Wang, Z. Zhu, X. Ning, G. Dai, H. Yang, and Y. Wang, “Gibbon: efficient co-exploration of nn model and processing-in-memory architecture,” in 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2022, pp. 867–872.
[71] X. Peng, S. Huang, Y. Luo, X. Sun, and S. Yu, “Dnn+ neurosim: An end-to-end benchmarking framework for compute-in-memory accelerators with versatile device technologies,” in 2019 IEEE international electron devices meeting (IEDM). IEEE, 2019, pp. 32–5.
[72] M. Chang, S. D. Spetalnick, B. Crafton, W.-S. Khwa, Y.-D. Chih, M.-F. Chang, and A. Raychowdhury, “A 40nm 60.64 tops/w ecc-capable compute-in-memory/digital 2.25 mb/768kb rram/sram system with embedded cortex m3 microprocessor for edge recommendation systems,” in 2022 IEEE International Solid-State Circuits Conference (ISSCC), vol. 65. IEEE, 2022, pp. 1–3.