This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PartitionPIM: Practical Memristive Partitions for Fast Processing-in-Memory

Orian Leitersdorf, Ronny Ronen, and Shahar Kvatinsky
(2021)
Abstract.

Digital memristive processing-in-memory overcomes the memory wall through a fundamental storage device capable of stateful logic within crossbar arrays. Dynamically dividing the crossbar arrays by adding memristive partitions further increases parallelism, thereby overcoming an inherent trade-off in memristive processing-in-memory. The algorithmic topology of partitions is highly unique, and was recently exploited to accelerate multiplication (11×11\times with 32 partitions) and sorting (14×14\times with 16 partitions). Yet, the physical implementation of memristive partitions, such as the peripheral decoders and the control message, has never been considered and may lead to vast impracticality. This paper overcomes that challenge with several novel techniques, presenting efficient practical designs of memristive partitions. We begin by formalizing the algorithmic properties of memristive partitions into serial, parallel, and semi-parallel operations. Peripheral overhead is addressed via a novel technique of half-gates that enables efficient decoding with negligible overhead. Control overhead is addressed by carefully reducing the operation set of memristive partitions, while resulting in negligible performance impact, by utilizing techniques such as shared indices and pattern generators. Ultimately, these efficient practical solutions, combined with the vast algorithmic potential, may revolutionize digital memristive processing-in-memory.

copyright: nonejournalyear: 2022doi: xx.xxxx/xxxxxxx.xxxxxxx

1. Introduction

Logic and storage are united in processing-in-memory (PIM) (Balasubramonian and Grot, 2016) solutions to overcome the data-transfer bottleneck (Pedram et al., 2017). Memristive processing-in-memory (Talati et al., 2020) is rapidly emerging as an implementation of real processing-in-memory that utilizes the memristor (Chua, 1971), a device inherently capable of both storage and logic. Yet, the logic supported by memristive processing-in-memory is basic (a logic gate such as NOR (Kvatinsky et al., 2014)), requiring many cycles for arithmetic operations (e.g., 320 cycles for 32-bit addition (Ronen et al., 2021), 11264 cycles for 32-bit multiplication (Haj-Ali et al., 2018a)). This leads to a trade-off between latency and throughput, diminishing the benefit of memristive processing-in-memory. Recent works have overcome this trade-off by proposing the concept of memristive partitions (Gupta et al., 2018) that increases parallelism to achieve arithmetic that is both fast and high-throughput (Leitersdorf et al., 2021; Lu et al., 2021; Alam et al., 2022). Nevertheless, the implementation of partitions was never discussed, and naive solutions lead to vast impracticality. This paper utilizes several novel techniques to design practical memristive partitions that maintain the immense algorithmic potential, thereby advancing memristive processing-in-memory towards a practical solution that may vastly accelerate large-scale applications.

The emerging stateful-logic technique enables parallel row or column gates within a memristive crossbar array with constant time. Binary information is represented by the resistance of a memristor, and stateful logic enables logic amongst memristors (Reuben et al., 2017; Borghetti et al., 2010; Kvatinsky et al., 2014; Gupta et al., 2018). The crossbar array stores a single bit per memristor, and applying voltages across bitlines (wordlines) of the crossbar induces parallel logic within rows (columns). For example, the bit-wise NOR of two columns can be computed and stored in a third column, within a single cycle, by applying only three voltages on the bitlines, as illustrated in Figure 1 (Kvatinsky et al., 2014). Examples of stateful-logic techniques include MAGIC (Kvatinsky et al., 2014) and FELIX (Gupta et al., 2018), enabling various logic gates, such as NOT, NOR, NAND, OR, and Minority3, in a single cycle.

In single-row arithmetic, each row performs computation independently on different input. As such, column operations enable parallel vectored execution across all rows (Ben-Hur et al., 2020), independent of the vector dimension (row count). Due to its potential high throughput, such arithmetic gained recent attention, including single-row addition (Leitersdorf et al., 2021; Gupta et al., 2018; Ronen et al., 2021; Ben-Hur et al., 2020; Eliahu et al., 2020) and multiplication (Haj-Ali et al., 2018a; Imani et al., 2019; Haj-Ali et al., 2018b). While logic is parallelized across rows, each gate in a single-row algorithm occurs serially (Reuben et al., 2017). For example, element-wise multiplication of two nn-dimensional vectors of NN-bit numbers requires serial execution of O(N2)O(N^{2}) gates, hence O(N2)O(N^{2}) latency (independent of nn(Haj-Ali et al., 2018a).

Refer to caption
Figure 1. Stateful logic within a crossbar with parallelism across all rows by applying voltages across the bitlines.
Refer to caption
Figure 2. Memristive crossbar with row partitions, and the different types of partition-based computation: (a) serial, (b) parallel, and (c,d) semi-parallel. The dynamic section division is shown in dashed orange, inputs are blue, and outputs are green.

Memristive partitions (Gupta et al., 2018) accelerate stateful logic algorithms by enabling concurrent row or column operations using transistors that divide crossbar wordlines (bitlines) into independent partitions. Voltages are applied on the bitlines (wordlines), yet the transistors ensure isolation between the wordlines (bitlines) of distinct stateful-logic gates in the same row (column) to enable concurrent operation. For single-row arithmetic, partitions enable execution of multiple parallel gates per row, alleviating the single-gate per cycle constraint, while still executing in parallel across all rows (Gupta et al., 2018; Leitersdorf et al., 2021; Lu et al., 2021; Alam et al., 2022).

Since the proposition of partitions (Gupta et al., 2018), algorithmic works (Leitersdorf et al., 2021; Lu et al., 2021; Alam et al., 2022; Gupta et al., 2018; Leitersdorf et al., 2021) exploited partitions to accelerate single-row multiplication by 11×11\times (using 32 partitions) (Leitersdorf et al., 2021)111To isolate the effect of partitions, this result compares MultPIM (Leitersdorf et al., 2021) to its optimized serial implementation. Note that algorithmic area (memristor footprint) and energy (total gate count) are increased by 1.4×1.4\times and 2.1×2.1\times, respectively. See Section 5. and sorting by 14×14\times (using 16 partitions) (Alam et al., 2022). MultPIM (Leitersdorf et al., 2021) exploited interesting properties that arise when dynamically dividing partitions to develop fast shifting (in constant time) and broadcasting (in logarithmic time) techniques. Preliminary analysis from previous work estimates low area-overhead for the transistors, such as 3%3\% for 32 partitions (Gupta et al., 2018). This suggests memristive partitions can, e.g., accelerate multiplication latency by 11×11\times while only increasing crossbar area by 1.03×1.03\times. Yet, the peripheral circuits and control to support partitions were never previously discussed. The peripheral circuits relate to decoders that apply voltages across bitlines and wordlines, while control refers to controller messages sent to crossbars to convey operations. Without efficient designs of these, realizing the potential of partitions may be over-optimistic, leading to impractical designs.

This paper proposes efficient designs of partitions. Section 2 formalizes the algorithmic potential implied by previous works into serial, parallel and semi-parallel operations, and proposes efficient periphery fully enabling these operations through a novel technique of half-gates. Section 2 also identifies control overhead as an inherent challenge in this unlimited model, which is addressed in Section 3 (standard model) via intra-partition restrictions and in Section 4 (minimal model) via inter-partition patterns. Section 5 presents the trade-off between control overhead and performance, analyzing multiplication (Leitersdorf et al., 2021) as a case study. This paper contributes:

  1. (1)

    Partition Models: Presents well-defined algorithmic models detailing the capabilities of memristive partitions, categorizing operations as serial, parallel and semi-parallel.

  2. (2)

    Periphery: Proposes efficient crossbar periphery for each model, based on a technique of half-gates, incurring even slightly less overhead than a crossbar without partitions.

  3. (3)

    Control: Drastically reduces control-message length by restricting operation sets, in the standard and minimal models, with shared indices and pattern generators, while causing only minimal impact on partition performance.

  4. (4)

    Results: Case study of parallel multiplication, showing a trade-off between control overhead and performance. Control overhead is reduced by 17×17\times while latency is only increased by 1.3×1.3\times (that is, 9×9\times latency over a baseline with no partitions rather than the theoretical 11×11\times).

2. Unlimited Design

Refer to caption
Figure 3. (a) The periphery for a baseline crossbar with no partitions, with each decoder unit consisting of analog multiplexers (per bitline, not shown in the figure) and CMOS decoders (Talati, 2018). (b) A naive impractical approach at utilizing the column decoder for the unlimited model, and (c) the proposed approach based on the half-gates technique.

.

We formalize the operations enabled by partitions into serial, parallel, or semi-parallel; the unlimited model supports all of the possibilities for each. Semi-parallel operations recently emerged (Leitersdorf et al., 2021), enabling efficient communication between partitions that significantly improves results over only serial and parallel partition parallelism, such as a 4×4\times latency improvement in multiplication (Leitersdorf et al., 2021; Lu et al., 2021). We then detail a naive approach to crossbar periphery for the unlimited model, which we replace with an efficient solution based on the novel half-gates technique. Finally, we identify control overhead as a critical inherent challenge in the unlimited model, which is resolved in Sections 3 and 4 by carefully reducing the operation set.

2.1. Model

Memristive partitions enable a unique parallelism that may be exploited for efficient techniques. Consider inserting k1k-1 transistors at fixed locations into each row of the n×nn\times n crossbar, as illustrated in Figure 2. The transistors dynamically isolate different parts of each row to enable concurrent execution, essentially dynamically dividing the crossbar partitions into sections (dashed orange) such that each section may perform a column operation. Initial works (Gupta et al., 2018; Lu et al., 2021; Alam et al., 2022) utilized partitions in a binary fashion: either each partition is a section (parallel), or the entire crossbar is one section (serial). A recent work (Leitersdorf et al., 2021) demonstrated the potential of semi-parallelism, significantly improving solutions that utilize only serial and parallel, e.g., 4×4\times improvement in latency for multiplication (Leitersdorf et al., 2021; Lu et al., 2021). We define these various parallelism forms:

  • Serial (Figure 2(a)): When the transistors are all conducting, then the crossbar is equivalent to one without partitions. Therefore, only a single gate is operated per cycle.

  • Parallel (Figure 2(b)): When the transistors are all not conducting, then kk gates may operate concurrently as part of an operation, one gate within each partition.

  • Semi-Parallel (Figures 2(c,d)): When only some transistors are conducting, then multiple gates may operate concurrently, between partitions. Essentially, an operation is a set of disjoint intervals that represent the underlying partitions of sections.

2.2. Periphery

We propose a low-overhead peripheral design that supports the unlimited model. We begin with a short background on the peripheral circuits in a crossbar without partitions, and then discuss a naive approach for partition periphery that is not practical. Conversely, we introduce a unique technique of half-gates, which enables practical partition periphery with very low overhead.

Traditional crossbar periphery applies voltages across bitlines or wordlines to induce vectored logic. Without loss of generality, we discuss only bitlines and assume a single two-input gate type (e.g., NOR)222The proposed designs can be generalized to support additional types of gates (e.g., NAND, OR), including gates with more than two inputs (Gupta et al., 2018).. The periphery for such a crossbar consists of a column decoder that receives indices InAInA, InBInB, and OutOut, and applies VINV_{IN} on bitlines InAInA/InBInB and VOUTV_{OUT} on bitline OutOut, as illustrated in Figure 3(a). The column decoder is composed of three decoder units, each of which receives a single index and outputs a fixed voltage at that index. Each decoder unit consists of a CMOS decoder and an analog multiplexer for each bitline: the CMOS decoder provides the selection to the analog multiplexers of each bitline (Talati, 2018; Reuben et al., 2017; Ben Hur and Kvatinsky, 2016).

The naive approach to supporting the unlimited model utilizes the overall column decoder structure for enabling serial, parallel, and semi-parallel operations. For supporting serial operations, a single column decoder across all bitlines is necessary. For supporting parallel operations, a column decoder is needed for every partition individually. For semi-parallel operations, decoders are needed for any possible configuration of the sections. Overall, the naive approach requires a stack of Ω(k2)\Omega(k^{2}) column decoders (for the unlimited model), which is clearly not practical, see Figure 3(b). Note that this is in addition to signals that control the transistors. Other naive solutions will likely also suffer from this overhead.

Table 1. Opcode for an Individual Partition
Index Opcode Index Opcode
000 - 100 Gate(InA,?)?Gate(InA,?)\rightarrow\;?
001 ?Out?\rightarrow Out 101 Gate(InA,?)OutGate(InA,?)\rightarrow Out
010 Gate(?,InB)?Gate(?,InB)\rightarrow\;? 110 Gate(InA,InB)?Gate(InA,InB)\rightarrow\;?
011 Gate(?,InB)OutGate(?,InB)\rightarrow Out 111 Gate(InA,InB)OutGate(InA,InB)\rightarrow\;Out

Our proposed approach is based on half-gates: we eliminate the stack of column decoders, utilize only a single column decoder per partition, yet introduce an opcode for each partition, as shown in Figure 3(c). We describe the basic idea through the following example: to support a gate where inputs are in partition p1p_{1} and outputs are in partition p2p_{2} (both in the same section), (1) the column decoder of p1p_{1} applies only the input voltages without applying the output voltages, and (2) the column decoder of p2p_{2} applies only the output voltages without applying the input voltages. Essentially, p1p_{1} trusts that a different partition will apply output voltages, and p2p_{2} trusts that a different partition will apply input voltages. While each gate on its own is not valid (half-gate), their combination is valid. Table 1 details the various possible opcode states of each column decoder, where \say? represents not applying voltages for that part of the gate and \say- represents not applying voltages at all (for intermediate partitions). An example opcode setting is shown in Figure 4 for the operation from Figure 2(d). Note that the opcode decoding is simple: two bits are the enables for the input decoder units, and the other bit is the enable for the output decoder unit.

In terms of the decoder structure, the proposed design is nearly identical to the baseline crossbar. In essence, the proposed decoder is identical to concatenating multiple baseline n/kn/k-column decoders horizontally (thus width remains at O(k(n/k))=O(n))O(k\cdot(n/k))=O(n))). The analog multiplexers within the decoders remain completely identical, and the only difference is the CMOS decoders that generate the selects for the analog multiplexers. Rather than a CMOS nn-multiplexer, the proposed solution requires kk CMOS n/kn/k-multiplexers. In terms of CMOS gates, the proposed solution requires less gates than the baseline crossbar as log2(n/k)<log2(n)\log_{2}(n/k)<\log_{2}(n). This saving in gates comes at the cost of increased control complexity.

2.3. Control

The proposed periphery decodes a relatively-long message sent from the controller that details the operation. We demonstrate that this length is nearly optimal for the unlimited model. The proposed peripheral decoding requires 3klog2(n/k)+3k+(k1)3k\cdot\log_{2}(n/k)+3k+(k-1) bits to encode an operation that may occur in a single cycle (indices, opcodes, and transistor selects, respectively), for kk evenly-spaced partitions amongst nn bitlines. For k=32k=32 and n=1024n=1024, this requires 607607 bits, compared to the 3030 bits required in a crossbar without partitions. This 20×20\times difference is concerning as the communication architecture between the controller and the crossbars must support 20×20\times larger messages, incurring massive area and energy overhead. We prove that this message length is nearly optimal (and not as a result of poor decoding) through a combinatorial analysis:

  • Serial: There are (n2)(n2)\binom{n}{2}\cdot(n-2) different possible serial operations (choices for InA,InBInA,InB and OutOut).

  • Parallel: There are [(n/k2)(n/k2)]k\big{[}\binom{n/k}{2}\cdot(n/k-2)\big{]}^{k} possible parallel operations (each partition performs a different gate).

  • Semi-Parallel: We do not count semi-parallel operations for unlimited. This is valid as we seek a lower-bound.

Refer to caption
Figure 4. Demonstration of the opcodes for each partition in the half-gate technique for the operation from Figure 2(d).

We find over 24432^{443} different operations, thus log2(2443)\log_{2}\left(2^{443}\right) is a lower-bound on the message length. Therefore, any implementation of the unlimited model will require a length of at least 443443 bits (compared to 3030 without partitions). Note that 443443 is not very far from 607607, and that attaining this lower-bound likely requires complex decoding. This challenge was not considered by the previous works (Gupta et al., 2018), and is addressed in the next sections.

3. Standard Design

The standard model addresses the control-overhead challenge with the unlimited model. We seek to significantly reduce the control-message length, while having low impact on performance. We still support serial, parallel, and semi-parallel operations, yet we eliminate the less-common operations primarily by restricting allowed intra-partition patterns.

3.1. Model

We set the following criteria in addition to Section 2.1:

  • Identical Indices: The input and output intra-partition indices of the gates must be identical. Figure 2(d) is supported as the indices within each partition are identical (in the example, inputs are always first/second columns, output is always the fourth column). That is, for evenly distributed partitions, the indices modulo n/kn/k must be identical.

  • No Split-Input: Gate inputs InAInA and InBInB must belong to the same partition, for each gate333This affects serial operations as well. Serial algorithms may overcome this limitation by copying one of the inputs, or by adjusting the mapping algorithms (Ben-Hur et al., 2020)..

  • Uniform Direction: The direction (\sayinputs left of outputs, or \sayoutputs left of inputs) is identical in all concurrent gates for semi-parallel operations.

Note that all of the examples in Figure 2 are supported, as well as the general-purpose partition techniques from MultPIM (Leitersdorf et al., 2021)444While the techniques are supported, the original MultPIM algorithm slightly violates the Identical Indices criteria with the first/last partitions. Section 5 replaces those illegal operations with supported alternatives as part of the evaluation..

3.2. Periphery

The peripheral modifications from the unlimited model to the standard model are two-fold: the intra-partition indices are shared (following the Identical Indices criteria), and the opcodes are automatically generated from the section division (following the No Split-Input and Uniform Direction criteria). The periphery is slightly reduced compared to that of a baseline crossbar without partitions, suggesting low overhead.

3.2.1. Intra-Partition Indices

The modification from unlimited is rather simple: the indices provided to each decoder are identical (see Figure 5). Note that this enables an additional optimization: shared CMOS decoders. Recall that each of the column decoders is composed of input/output decoder units, and that each such unit is composed of a CMOS decoder and analog multiplexers (for each bitline) – the CMOS decoder provides the select lines for the analog multiplexers (Talati, 2018). Given that the indices are now shared across partitions, the CMOS decoders may be shared while the analog multiplexers remain unchanged (compared to a baseline crossbar). Therefore, this has the potential to further reduce the proposed peripheral overhead compared to a baseline crossbar.

3.2.2. Opcode Generation

The No Split-Input and Uniform Direction criteria, combined with a unique observation, enable opcode derivation from the section division and a single enable bit per partition. First, we note that there may exist multiple valid section divisions for a single semi-parallel operation; for example, any semi-parallel operation that does not utilize the first partition may set the first transistor to either conducting or non-conducting. We restrict this degree of freedom by defining a tight section division as one satisfying that no section can be split; e.g., the first transistor would be non-conducting in the previous example. In a tight section division, the first and last partitions of each section always contain either an input or output, and the middle partitions (if exist) are always unused. The only exception is sections that do not contain any gate. Therefore, given only the section division, the direction of the operation, and an indication of whether each section contains a gate, the opcodes of all partitions may be generated.

Opcode generation is achieved following this observation, as illustrated in Figure 5. Given the transistor selects (which are chosen to define a tight section division), an enable bit for each partition, and a general direction bit (\sayinputs left of outputs, or \sayoutputs left of inputs), the opcode generator computes the opcodes of all partitions (which are inputted to the decoders). For the direction of \sayinputs left of outputs, the input bits of an opcode are logical one if the transistor to the left of that partition is selected, and the output bits are logical one if the transistor to the right is selected (vice-versa for \sayoutputs left of inputs) – with the opcode ANDed with the enable of that partition. Therefore, the opcode for a partition may be derived from the select of the transistors to the left and right, the enable of that partition, and the general direction. Such decoding may be achieved by two 2:1 multiplexers per partition (only O(k)O(k) gates total, negligible compared to O(nlogk)O(n\log k) gates elsewhere).

Refer to caption
Figure 5. The periphery of the standard model, sharing the indices across the decoders and generating opcodes (each orange box consists of two 2:1 multiplexers, see Section 3.2).

3.3. Control

The message-length of the unlimited model is far reduced in the standard model by index sharing and opcode generation. Standard decoding uses 3log2(n/k)+(2k1)+13\cdot\log_{2}(n/k)+(2k-1)+1 bits (indices, enables and transistor selects, and direction, respectively), nearly eliminating the bottleneck of unlimited (3klog2(n/k)3k\cdot\log_{2}(n/k)), and mildly reducing the rest (4k14k-1). Message length is reduced from 607607 to 7979 bits – a 7.7×7.7\times improvement (for k=32k=32 and n=1024n=1024). From a combinatorial analysis, we find at least 2m=1k(k1m1)(n/k2)(n/k2)2\cdot\sum_{m=1}^{k}\binom{k-1}{m-1}\binom{n/k}{2}\cdot(n/k-2) supported operations, thus a 4646 bit lower-bound – not very far from 7979 bits.

4. Minimal Design

The control overhead in Section 3 remains relatively high due to the transistor selects and enables: any section division is supported. The minimal design addresses that concern by requiring that the section division follows several carefully-chosen inter-partition patterns.

4.1. Model

The standard model supports any division of partitions into sections, but in practice, many divisions are typically not used. For example, Figure 2(d) is rarely used – e.g., not at all in MultPIM. We identify two criteria typically followed:

  • Uniform Partition-Distance: The partition distance of a gate is the distance between its input and output partitions (e.g., (1,1) for Figure 2(c); (0,1,0) for Figure 2(d)). Gates performed concurrently should all have identical partition distance.

  • Periodic: The gates must repeat periodically, e.g., every TT partitions (for TT greater than the partition distance).

Examples (a), (b), and (c) from Figure 2 are supported, as well as the partition techniques from MultPIM (Leitersdorf et al., 2021)555While the techniques are supported, the MultPIM algorithm slightly violates the Periodic criteria, see Section 5 for the alternative.. Note that typical usage of partitions already follows the above restrictions, suggesting the minimal model is general-purpose.

4.2. Periphery

The decoder for the minimal design replaces the opcode generator from standard, following these key observations:

  • Input opcodes can be derived from a Range Generator, outputting logical one every period TT, from pstartp_{start} to pendp_{end}. This may be accomplished with two shifters (for pstartp_{start} and pendp_{end}) and a decoder (for TT).

  • Output opcodes can be derived by shifting the input opcodes by the partition distance according to the global direction (up to kk shift in either direction).

  • Transistor selects can be derived from input and output opcodes. For example, if the global direction is \sayinput left of output, then a separation transistor is non-conducting if there is output to its left or input to its right.

The overall periphery is similar to that of standard, while replacing the opcode generator with the above shifters and decoder. Note the periphery overhead here is relatively low as shifters and decoder operate on width kk (rather than nn).

4.3. Control

The moderate-length control message in the standard model is drastically reduced in the minimal model, attaining 3log2(n/k)+3log2(k)+log2(k)+13\cdot\log_{2}(n/k)+3\cdot\log_{2}(k)+\log_{2}(k)+1 bits (intra-partition indices, range indices, partition-distance, and global direction, respectively). For n=1024n=1024 and k=32k=32, this improves from 607 bits (unlimited) and 79 bits (standard) to only 36 bits. Interestingly, this small message is still capable of supporting most of the operations used in algorithms, see Section 5. We find a lower bound of at least 25 bits from all non-input-split serial operations being supported.

5. Evaluation

We analyze the unlimited, standard, and minimal models, presenting a trade-off between overhead and performance. We evaluate the effect of standard and minimal on performance by examining their effect on MultPIM (Leitersdorf et al., 2021), as a case study. While the proposed mechanisms can be generalized to three-input gates (e.g., Minority3), we assume up to two inputs for simplicity and thus consider the NOT/NOR implementation of MultPIM. While the partition techniques proposed in MultPIM are supported by standard and minimal models, MultPIM includes various operations that are not supported. Those operations are replaced with alternatives that are compatible, yet require additional latency – the details are provided as part of the modified cycle-accurate simulations666Available at https://github.com/oleitersdorf/PartitionPIM..

We demonstrate 9×9\times latency improvement for multiplication with the proposed minimal model compared to an optimized serial algorithm, requiring only approximately 1.4×1.4\times area and 1.2×1.2\times control overhead. Figure 6 shows the results for 32-bit multiplication, comparing latency, control-overhead, and area.

unlimitedstandardminimalserial3,0003{,}0006,0006{,}0009,0009{,}000(a)  Latency (Cycles)     unlimitedstandardminimalserial0200200400400600600(b)  Control (Bits)     unlimitedstandardminimalserial0200200400400(c) Area (Memristors)     
Figure 6. Comparison of (a) latency, (b) control overhead, and (c) algorithmic area for 32-bit multiplication under the proposed partition models and a serial baseline.

5.1. Latency

The standard and minimal implementations of MultPIM incur, relative to unlimited, a slight latency increase of 1.23×1.23\times and 1.32×1.32\times, respectively, as seen in Figure 6(a). Yet, this latency is still 9.2×9.2\times and 8.6×8.6\times faster compared to an optimized serial multiplier, respectively.

5.2. Control

The control overhead results follow from the expressions derived in the previous sections, see Figure 6(b). Assuming k=32k=32 and n=1024n=1024 with gate-types of NOT/NOR, the control-message length for unlimited is 607 bits, 79 bits for standard, and 36 bits for minimal – compared to 30 bits for a baseline crossbar without partitions. That is, through the novel techniques (shared indices and pattern generators) and restrictions, the minimal model achieves a low control overhead of only 1.2×1.2\times, compared to 20×20\times for unlimited.

5.3. Area

Area overhead is composed of physical overhead (mentioned in the previous sections), and algorithmic overhead that arises from requiring additional intermediate memristors within the crossbar.

5.3.1. Physical Overhead

The half-gates technique, utilized in all three models, achieves low peripheral overhead compared to a baseline crossbar (without partitions). In fact, the peripheral complexity of the proposed solutions is slightly lower than that of a baseline crossbar as the decoder widths are smaller, see Section 2.2. Furthermore, the crossbar and the analog multiplexers remain completely unchanged. While additional logic is required in the standard and minimal solutions for decoders and shifters, that overhead is relatively low considering that they operate on width kk and not nn. Exact results depend highly on the exact crossbar structure (e.g., 1R, 1S1R, 1T1R), requiring a full physical design of a crossbar array and periphery. Regardless, considering the fact that the peripheral complexity is slightly decreased compared to the baseline, we conclude that peripheral area overhead is negligible compared to algorithmic area overhead (and the potential 20×20\times control overhead).

5.3.2. Algorithmic Overhead

Algorithmic area overhead, shown in Figure 6(c), is based on the number of memristors required. All parallel approaches have higher area overhead than the serial approach as utilizing parallel operations requires intermediates per partition rather than per crossbar. The minor differences in the area overhead between the unlimited, standard and minimal approaches originate from the alternatives to the unsupported operations.

5.4. Energy

Energy consumption for stateful-logic is dominated by the memristor switching energy (Talati, 2018). Therefore, energy is approximated by the total gate count (Ronen et al., 2021). For MultPIM, the energy overhead is 2.1×2.1\times from serial to parallel: while the latency is improved, more gates occur due to the partition parallelism.

6. Conclusion

The algorithmic potential of emerging memristive partitions is highly unique, and may advance digital memristive processing-in-memory by overcoming an inherent trade-off that leads to slow arithmetic operations. Nevertheless, the physical design of partitions has never been discussed, and naive designs for periphery and control may incur massive overhead that leads to vast impracticality. This paper proposes efficient periphery and control through three potential designs with varying flexibility: unlimited, standard, and minimal. We demonstrate efficient periphery for all three models by utilizing a novel technique of half-gates, yet identify control overhead as an inherent concern in the unlimited model. We drastically reduce this control overhead by carefully minimizing the operation set while resulting in negligible performance impact, utilizing techniques such as shared indices and pattern generators. Through a case study of multiplication, we conclude that the proposed practical designs of partitions, coupled with the previous algorithmic works, suggest that partitions will be a crucial element in the integration of memristive processing-in-memory in computing devices.

Acknowledgment

This work was supported in part by the European Research Council through the European Union’s Horizon 2020 Research and Innovation Programe under Grant 757259, and in part by the Israel Science Foundation under Grant 1514/17.

References

  • (1)
  • Alam et al. (2022) M. Alam et al. 2022. Sorting in Memristive Memory. JETC (2022).
  • Balasubramonian and Grot (2016) R. Balasubramonian and B. Grot. 2016. Near-Data Processing. IEEE Micro (2016).
  • Ben-Hur et al. (2020) R. Ben-Hur et al. 2020. SIMPLER MAGIC: Synthesis and Mapping of In-Memory Logic Executed in a Single Row to Improve Throughput. IEEE TCAD (2020).
  • Ben Hur and Kvatinsky (2016) R. Ben Hur and S. Kvatinsky. 2016. Memristive memory processing unit (MPU) controller for in-memory processing. In IEEE ICSEE.
  • Borghetti et al. (2010) J. Borghetti et al. 2010. ‘Memristive’ switches enable ‘stateful’ logic operations via material implication. Nature (2010).
  • Chua (1971) L. Chua. 1971. Memristor-The missing circuit element. IEEE TCT (1971).
  • Eliahu et al. (2020) A. Eliahu et al. 2020. abstractPIM: Bridging the Gap Between Processing-In-Memory Technology and Instruction Set Architecture. In IFIP/IEEE VLSI-SOC.
  • Gupta et al. (2018) S. Gupta et al. 2018. FELIX: Fast and Energy-Efficient Logic in Memory. In ICCAD.
  • Haj-Ali et al. (2018a) A. Haj-Ali et al. 2018a. Efficient Algorithms for In-Memory Fixed Point Multiplication Using MAGIC. In IEEE ISCAS.
  • Haj-Ali et al. (2018b) A. Haj-Ali et al. 2018b. IMAGING: In-Memory AlGorithms for Image processiNG. IEEE TCAS-I (2018).
  • Imani et al. (2019) M. Imani et al. 2019. FloatPIM: In-Memory Acceleration of Deep Neural Network Training with High Precision. In ISCA.
  • Kvatinsky et al. (2014) S. Kvatinsky et al. 2014. MAGIC—Memristor-Aided Logic. IEEE TCAS-II (2014).
  • Leitersdorf et al. (2021) O. Leitersdorf et al. 2021. Making Memristive Processing-in-Memory Reliable. In IEEE ICECS.
  • Leitersdorf et al. (2021) O. Leitersdorf et al. 2021. MultPIM: Fast Stateful Multiplication for Processing-in-Memory. IEEE TCAS-II (2021).
  • Lu et al. (2021) Z. Lu et al. 2021. RIME: A Scalable and Energy-Efficient Processing-In-Memory Architecture for Floating-Point Operations. In ASP-DAC.
  • Pedram et al. (2017) A. Pedram et al. 2017. Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era. IEEE Design & Test (2017).
  • Reuben et al. (2017) J. Reuben et al. 2017. Memristive logic: A framework for evaluation and comparison. In PATMOS.
  • Ronen et al. (2021) R. Ronen et al. 2021. The Bitlet Model: A Parameterized Analytical Model to Compare PIM and CPU Systems. JETC (2021).
  • Talati (2018) N. Talati. 2018. Design of a Memory Controller to Support PIM Operations Using RRAM. Master’s thesis. Viterbi Faculty of Electrical Engineering, Technion – Israel Institute of Technology, Haifa, Israel.
  • Talati et al. (2020) N. Talati et al. 2020. mMPU—A Real Processing-in-Memory Architecture to Combat the von Neumann Bottleneck. Springer Singapore, Singapore, 191–213.