VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs
Abstract
Deep Learning (DL) acceleration support in CPUs has recently gained a lot of traction, with several companies (Arm, Intel, IBM) announcing products with specialized matrix engines accessible via GEMM instructions. CPUs are pervasive and need to handle diverse requirements across DL workloads running in edge/HPC/cloud platforms. Therefore, as DL workloads embrace sparsity to reduce the computations and memory size of models, it is also imperative for CPUs to add support for sparsity to avoid under-utilization of the dense matrix engine and inefficient usage of the caches and registers. This work presents VEGETA, a set of ISA and microarchitecture extensions over dense matrix engines to support flexible structured sparsity for CPUs, enabling programmable support for diverse DL models with varying degrees of sparsity. Compared to the state-of-the-art (SOTA) dense matrix engine in CPUs, a VEGETA engine provides 1.09, 2.20, 3.74, and 3.28 speed-ups when running 4:4 (dense), 2:4, 1:4, and unstructured (95%) sparse DNN layers.
I Introduction
Deep learning (DL) is used in various domains including computer vision, recommendation systems, and natural language processing [53, 40, 30]. However, the training and inference for those DL models require a large number of computations and huge memory. To accelerate those, high-end GPUs (such as ones with NVIDIA Tensor cores [2]), domain-specific accelerators (such as Google’s TPUs [30]), and high-performance CPUs [45, 26] along with optimized libraries and frameworks [5, 11, 14, 7] have been introduced.
Although CPUs have received relatively less attention in the DL era compared with GPUs and domain-specific accelerators, CPUs are widely used at datacenters thanks to their flexibility [53]. In fact, there are cases where CPUs are more suitable than GPUs/accelerators as the primary processor for processing deep neural networks (DNNs). For example, edge devices with tight area/power budget [46] prefer CPUs since GPUs/accelerators cannot be easily deployed in addition to CPUs. Moreover, a large number of server-class CPUs are (and will be) used in datacenters to process High Performance Computing (HPC) workloads, and adding/maintaining additional accelerators just for DL workloads would increase complexity and cost [18]. Furthermore, offloading a modest-size task to an accelerator or a GPU might not be the best option if the offloading overhead is relatively substantial [48]. Finally, large DNNs are hard to be used with GPUs or accelerators due to their smaller size of memory compared with that of CPUs [32]. Considering these reasons, it is not surprising that many CPU vendors including Arm [9], Intel [26], and IBM [51] have started deploying dense matrix engines along with conventional scalar and vector engines to accelerate GEMM (i.e., general matrix multiplication) which is at the core of DL models [1]. Since they only focus on dense matrix computations, there is a challenge as sparsity becomes pervasive in DL. Thus, it is time to ponder how to utilize these engines efficiently for sparse DNNs on CPUs, similar to how we have evolved scalar/vector engines over many years.
Recent research has focused on adding hardware support in DNN accelerators to handle sparsity as it can be used to improve power and performance by skipping or gating ineffectual computations (since anything multiplied by zero is zero) and to reduce memory capacity and bandwidth requirements by saving only non-zero values and their indices in a compressed manner [23, 43, 42, 13]. CPUs, however, bring in the following additional challenges. (i) Being general-purpose, CPUs are particularly sensitive to the amount of hardware devoted to improving a subset of workloads. (ii) At the same time, new functional units in CPUs need to be able to support a wide variety of use cases. (iii) CPUs need programmability support to handle sparsity unlike offload accelerators with custom DMA engines for sparse accesses.
In this paper, we introduce VEGETA, Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs. To expose VEGETA to programmers, we extend the ISA with new instructions and registers, which enables accelerating sparse DNNs by removing redundant computations and reducing memory traffic. We also introduce a new light-weight matrix engine that enhances a systolic array and register file to support flexible : structured sparsity. Further, we demonstrate how VEGETA can accelerate both structured and unstructured sparse-dense GEMMs (SPMMs) via software transformations of the sparse matrix. This is the first work, to the best of our knowledge, to demonstrate sparsity support in a CPU matrix engine. We believe this work is extremely timely given growing interest in industry products for sparsity, such as Sparse Tensor Core (STC) in NVIDIA’s Ampere GPU [39] and Sparsity-aware NPU in a Samsung Mobile AP [28].
Summary of Contributions:
-
•
We introduce VEGETA ISA extensions that include new instructions and registers for supporting sparse tile operations along with software optimizations.
-
•
We propose new architectural extensions to a systolic-array-based matrix engine to support flexible : structured sparsity in terms of sparsity pattern and granularity.
-
•
We explore different VEGETA engine design choices to understand the performance and area trade-offs.
-
•
Across a suite of GEMM/SPMM kernels, we observe 1.09, 2.20, 3.74, and 3.28 speed-ups compared to a SOTA dense array when running 4:4 (dense), 2:4, 1:4, and unstructured (95%) sparse DNN layers, respectively.
II Background
II-A GEMM and DNN Layers
A DNN comprises multiple layers of potentially different types. Two of the most time-consuming layer types are fully connected layers, which are widely used in Multi-Layer Perceptrons/Transformers, and convolutional layers [12]. The underlying building block of these popular and compute-intensive DNN layers is GEMM. Fully connected layers are composed of GEMMs naturally, while modern high-performance convolutional layers are implemented with GEMMs as their innermost operation [17].
II-B Systolic Array (SA)
In the last few years, systolic arrays [33] have become a prime design choice for accelerating dense GEMM, especially in the context of DNNs, due to its simple and efficient architecture. The most well-known commercial systolic array architecture is in Google’s TPU [30]. A systolic array is a two-dimensional array composed of highly scalable, pipelined, and homogeneous processing elements (PEs). Each PE is responsible for a Multiply-and-Accumulate (MAC) operation and forwarding operands to the neighboring PEs. An element of one matrix is pre-loaded into the PE (often called the stationary element [12]). The elements of the remaining matrices are streamed to the left and top edge of the systolic array. Then, they are propagated to the PEs using the horizontal and vertical links between neighboring PEs. This maximizes data reuse between PEs, thus minimizing the number of reads and writes between the array and external buffers.
II-C Sparsity in DNNs


Even though the number of computations required for executing a DNN is huge, there is an opportunity to reduce that significantly by leveraging sparsity in DNNs. We describe different dimensions of sparsity in DNN inference.
Sparsity source. Weights and input activations are the main sources of sparsity in DNNs. Weight (static) sparsity is derived by pruning some edges in a DNN with a small sacrifice in accuracy [20, 19]. This leads to zeros in the weight matrix. Input (dynamic) sparsity is usually induced by a popular activation function, Rectified Linear Unit (ReLU), which clips negative inputs to zero, leading to a significant fraction of zero-valued input activations for the next layer.
Sparsity degree. Sparsity degree is the fraction of zeros in a given matrix. The degree of weight sparsity can often be tuned during the pruning process. However, the degree of input sparsity is usually non-deterministic and unpredictable since the actual values of input activations are determined at runtime.
Sparsity pattern. The non-zeros for a DNN may have a pattern, such as certain channels of weights being all zero (through channel pruning) or each block with M elements having at most N non-zeros (often called : sparsity [52]). “Unstructured sparsity” indicates the lack of a pattern.
Sparsity granularity. Sparsity patterns may exist at different granularities. For example, “network-wise : sparsity” indicates all layers in a network have the same : ratio, while “layer-wise : sparsity” means different layers may have different : ratios. During execution, usually a layer is decomposed into 2D tiles; each tile is composed of rows of vectors. Similar to the network granularity as explained before, sparsity patterns could exist at the tile or row granularity. In Figure 1, we compare matrices with different sparsity patterns/granularities, with a tile size of 88. Previous HW support focused on either network-wise [39, 36, 56] or layer-wise : sparsity [37], while we target to support row-wise : sparsity to cover broader flexibility as well as some unstructured sparsity as shown in Table I.
III Motivation and VEGETA Overview
III-A Vector vs. Matrix Engine
Single Instruction Multiple Data (SIMD) and/or short vector execution have been supported in mainstream CPUs for several generations. Due to the smaller granularity of vectors compared to matrices, the industry has started integrating matrix engines inside CPUs [26, 9, 24]. A matrix engine can provide power-efficient high compute throughput via simple control logic and significant internal data reuse. For example, in Intel’s upcoming Sapphire Rapids processors, the peak matrix compute capability per cycle of each core is 8 the vector capability [27]. In Figure 3, we show effective compute throughputs of sparse/dense matrix/vector engines on a convolutional layer with different densities derived from a roofline model. We assume 64 and 512 GFLOPS for the vector and matrix engines, respectively, with a memory bandwidth of 94 GB/s [25]. For the 100% dense case, the dense matrix (vector) and sparse matrix (vector) engines achieve the same compute throughput since no computation can be skipped. We observe that sparse engines outperform dense engines significantly by skipping non-effectual computations, especially when density is low. Also, there is a significant gap in compute throughput between vector and matrix engines. Moreover, due to the smaller granularity of vector instructions, the same GEMM kernel requires many more instructions to be executed when using vector engines, contributing to the runtime gap as shown in Figure 4 (we estimated them using a cycle accurate simulator, MacSim [31]). When memory bound, i.e., at extremely low density and thus arithmetic intensity, vector compute throughput is sufficient, so a sparse vector engine performs similar to a sparse matrix engine.


III-B Structured vs. Unstructured Sparsity
CPUs are general-purpose processors. Ideally, a CPU should be able to support any sparsity pattern, including unstructured sparsity. However, this introduces two practical challenges.
Challenge 1: Programmability. GEMM implementations typically partition the matrices into tiles and iterate over the tiles to optimize data reuse in caches, facilitate parallelization, etc. Innermost GEMM kernels are often optimized for a predetermined tile size, using instructions operating on fixed size tiles [17]. The irregular nature of unstructured sparsity means we cannot know tile sizes a priori, and further, they may all be different. In the context of how software is written, this makes it tricky to define an easy-to-use ISA.

Challenge 2: Implementation overheads. There are trade-offs between different options for the register file (RF) and systolic array (SA) when supporting sparsity. The RF feeding the SA could hold a dense (i.e., conventional, with zeros) or sparse/compressed representation of each tile; if sparse, we need indexing logic and metadata to match each non-zero to the appropriate values from the other matrix. Also, the SA could be comprised of conventional PEs (i.e., dense) or be enhanced to be sparsity-aware and skip zeros, at the cost of additional interconnects inside each PE [43]. Naturally, all these structures and interconnects add overhead.
To address these challenges, we make the following design decisions: (i) We limit our scope to sparsity support for DL workloads, where the typical sparsity degree is up to 95%. Further, we add HW support for flexible : structured sparsity, leveraging insights from a recent work [52] which has shown that adopting layer-wise : sparsity shows better accuracy compared to network-wise. We also show how the HW can support unstructured sparsity by transforming the target matrix using row-wise : sparsity. (ii) We add sparsity support in both the RF (Section IV) and SA (Section V) to achieve efficient utilization of the register storage and MACs.
III-C HW Support for Flexible : Structured Sparsity
Figure 5 shows under-utilization challenges for a conventional WS systolic array when used with sparse weights. A WS systolic array keeps the weight values stationary over time while inputs and outputs are streamed in a skewed manner. If sparse weights are used with 2:4/1:4 structured sparsity, 50%/25% of PEs are mapped with zero weights, which causes useless computation. This work proposes an enhanced systolic array-based matrix engine which maps only non-zero weight values and distributes/reduces the correct input/output elements at the right time, leveraging the strengths of a systolic array, in the presence of some irregularity in the inputs. This also requires logic in a PE to pick the right input elements for MACs. The indexing logic and input distribution and output reduction logic to support flexible : structured sparsity (i.e., the purple boxes in Figure 5) are presented in Section V. In terms of sparsity granularity, we not only support network/layer/tile-wise, but also row-wise : sparsity. In Table I, we compare the supported sparsity granularity of our design and previous works that support : sparsity.
III-D Transforming Unstructured to Row-Wise : Sparsity
While native support for unstructured sparsity can provide higher accuracy with extreme sparsity, the area overhead to support the sparsity in the form of a highly reconfigurable networks on chips (NoC) [43] and a sizable sparse controller [23, 16] is only justifiable on standalone accelerators [6, 44], not within CPUs. However, given an unstructured sparse tile, one can derive a row-wise : sparse tile that covers all non-zeros in the given sparse tile by selecting appropriate : per each row. For example, assuming 1:4, 2:4, and 4:4 are available sparsity patterns, one can analyze each row of the target unstructured tile to find the most sparse : sparsity that covers all non-zeros in the row. Then, each row can be compressed using the corresponding : sparsity. For example, the first and the second rows of Figure 1 (a) would be compressed with 2:4 while the third and the fourth would be compressed with 1:4 to guarantee that none of non-zero values get lost. This transformation does not cause any accuracy drop since it is lossless, meaning that all non-zeros in the original unstructured sparse matrix will still exist in the corresponding structured sparse matrix. Figure 1 (c) is derived using this transformation from Figure 1 (a), covering all non-zeros (similarly, tile-wise 2:4 is used to derive Figure 1 (b) from Figure 1 (a)). We use this transformation to leverage unstructured sparsity using VEGETA and show the estimated performance gain in Section VI-E.
III-E VEGETA Design Overview
This work presents VEGETA, which includes ISA extensions and microarchitecture support for structured/unstructured sparsity using flexible : fine-grained structured sparsity HW [52, 55] in CPUs. We present a detailed design for some specific and important points (1:4, 2:4, 4:4) to explain detailed extensions for both ISA and the microarchitecture, but both can naturally be extended for different block sizes (M).
We use a 3216 conventional WS systolic array as the baseline, inspired by RASA [29] and Intel’s TMUL [26]. Comparing against a dense matrix engine rather than a vector engine provides a strong baseline due to the huge gap in compute throughput between typical matrix engines and vector engines (Section III-A). Our proposed VEGETA engine maintains the same number of MAC units as the baseline, adding the ability to skip zero-valued weight values via new control logic, multiplexers, and adders for reductions, along with some wider data paths. We target mixed-precision with BF16/FP32 which is widely used for both inference and training on commercial devices [26, 47, 3].
IV VEGETA Instruction Set Architecture
IV-A Register File Support

Inspired by Intel AMX [26], we assume there are eight 1 KB tile registers (treg0-7), each comprising 16 rows of 64 Bytes. We define a tile as a 2D block comprising rows of data separated by a fixed stride. We define an effective tile as the larger sparse tile captured by the non-zeros present in the compressed tile in the case of a sparse matrix (along with metadata). A tile register can hold 1632 BF16 elements or 1616 FP32 elements. To support 2:4 and 1:4 sparsity, we introduce aliased tile registers of size 2 KB (utile register or ureg) and 4 KB (vtile register or vreg), respectively. One ureg is composed of two consecutive tregs, while one vreg is composed of two consecutive uregs as shown in Figure 6.
Next, we introduce metadata registers (mreg0-7) to store metadata information for sparse tiles. As shown in Figure 2, a pair of bits in the metadata represents the position of one non-zero element in a block of the compressed sparse matrix. Since a single row of the tile register holds 32 non-zero BF16 elements, the corresponding metadata register row holds bits of metadata. Hence, an mreg has 16 rows, each with 64 bits of metadata for a total of 128 Bytes. Note that while a treg can hold 1632 BF16 elements, a treg and mreg for 2:4 sparsity can be used to store data for the effective tile whose dimension is 1664. Similarly, when mreg is used for 1:4 sparsity, a treg and mreg can represent an effective tile whose dimension is 16128.
Instruction | Operands | Explanation | |||||
---|---|---|---|---|---|---|---|
tile_load_t |
|
|
|||||
tile_load_u |
|
|
|||||
tile_load_v |
|
|
|||||
tile_load_m |
|
|
|||||
tile_store_t | dst: ptr[TILE], src: treg |
|
|||||
tile_gemm |
|
|
|||||
tile_spmm_u |
|
|
|||||
tile_spmm_v |
|
|
|||||
tile_spmm_r |
|
|
IV-B VEGETA Instructions
Table II summarizes the VEGETA instructions using the aforementioned registers. tile_load_t, tile_load_u, and tile_load_v load a 1 KB, 2 KB, and 4 KB tile from the specified address to a treg, ureg, and vreg, respectively, while tile_store stores a 1 KB tile from a treg to memory. tile_load_t can be used to load either a dense tile or the non-zero values of a compressed sparse tile. In the latter case, the 1 KB tile has an effective tile size of 2 KB or 4 KB for 1:4 and 2:4 sparsity ratios, respectively, as mentioned above. Furthermore, the load of a sparse tile must be accompanied by a corresponding tile_load_m instruction, which loads 128 B of metadata from memory to an mreg.
The tile_gemm, tile_spmm_u, and tile_spmm_v instructions perform a tile matrix multiply and accumulate, , where and are BF16 tiles and is the FP32 output. holds a tile of a sparse matrix. With data in compressed format, holds a fixed number of non-zeros, but with higher sparsity (smaller :), its effective size is larger. In contrast, the actual size of the dense tile must grow as gets sparser. For example, assuming a dense tile is , the effective size of for a 2:4 sparsity ratio is (2 KB), while the non-zero values could still fit into a 1 KB treg. Thus, the corresponding tile should be , which will fit into a 2 KB ureg. Similarly, for a 1:4 ratio, the effective size of is 4 KB and the corresponding tile will fit into a 4 KB vreg. Note that the corresponding output tile is a constant size () and fits in a 1 KB treg. We call tile_gemm a VEGETA tile GEMM instruction and tile_spmm_u and tile_spmm_v VEGETA tile SPMM instructions.
To summarize, tile_gemm performs a dense (4:4) GEMM operation on 1 KB treg inputs, while tile_spmm_u performs an SPMM operation where is a 2:4 compressed sparse 1 KB tile, is a dense 2 KB tile, and the output is a dense 1 KB tile. Thus, tile_spmm_u calculates . Similarly, tile_spmm_v calculates . Note that is used as both input and output.
The number of useful MAC operations required to calculate is the same for tile_gemm, tile_spmm_u, and tile_load_v (8192). For each output element, the number of effectual MAC computations is 32. Finally, to support SPMM with a row-wise : sparse matrix , we introduce tile_spmm_r. tile_spmm_r calculates where can vary from 8 to 32, depending on :4 sparsity for each row (which will be stored as extra metadata, 322 bits, or 8B, at most).
We show an example SPMM kernel assuming with 2:4 sparsity using VEGETA instructions in 1. , , and indicate the size of each dimension while , , and show the corresponding tile sizes. In this case, , , and are 16, 16, and 64, respectively. We store the values of matrix in a transposed manner in the tile registers. We implemented an optimized version of this kernel to evaluate our architecture that we introduce in Section V.
IV-C Flexibility in the Block Size, M
In this work, we assume to explain our extension in detail, which can handle different fine-grained structured sparsity patterns including 1:4/2:4/4:4, but our extension is not limited to the specific size of . While we use in our implementation, our approach can be extended to by modifying the tile/metadata registers and ISA to support larger input registers. A larger provides greater flexibility to the sparse model design and may result in improved accuracy [52], but would cost more HW.
V VEGETA Engine Architecture
V-A Processing Units and Processing Elements

VEGETA Engine |
|
|
|
|
|
Example from Prior Work | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VEGETA-D-1-1 | 32 | 16 | 1 | 1 | 1 | 16 | Dense |
|
||||||||||
VEGETA-D-1-2 | 16 | 16 | 2 | 2 | 1 | 16 | Dense |
|
||||||||||
VEGETA-D-16-1 | 32 | 1 | 16 | 1 | 16 | 1 | Dense |
|
||||||||||
VEGETA-S-1-2 | 16 | 16 | 2 | 8 | 1 | 16 | 1:4, 2:4, 4:4 | New design | ||||||||||
VEGETA-S-2-2 | 16 | 8 | 4 | 8 | 2 | 8 | 1:4, 2:4, 4:4 | New design | ||||||||||
VEGETA-S-4-2 | 16 | 4 | 8 | 8 | 4 | 4 | 1:4, 2:4, 4:4 | New design | ||||||||||
VEGETA-S-8-2 | 16 | 2 | 16 | 8 | 8 | 2 | 1:4, 2:4, 4:4 | New design | ||||||||||
VEGETA-S-16-2 | 16 | 1 | 32 | 8 | 16 | 2 | 1:4, 2:4, 4:4 | New design |
Processing Unit (PU). A PU is composed of a number of MAC units that contribute to the same output element. In a WS systolic array, the dot product to produce each output element is mapped to a column of PUs. Partially accumulated results trickle down a column, and the final result of the dot product exits the bottom of the array. In a conventional design, a PU is composed of one MAC unit, the height of the array is the length of the dot product, and each PU produces a single partial sum each cycle. We can, however, break the set of operations for each dot product into pieces, or “lanes.” If we pack multiple MAC units into a PU, each PU can work on all lanes in parallel, and we can scale down the height of the systolic array. In this case, we must combine the partial sum for each lane at the bottom of the systolic array. We call the number of lanes or the number of MAC units in a PU, or the reduction factor, . also indicates how many partial sums need to be reduced at the bottom of the systolic array to generate a single output element.
Processing Element (PE). We group PUs that share the same eastbound inputs and output buffers into a PE. That is, peer-to-peer forwarding happens between PEs, and the input fed from the west is broadcasted to all PUs in a PE. The more PUs in a PE, the narrower the systolic array, and the more we amortize the overhead of the horizontal pipeline buffer. This improvement in area and power comes at the cost of lower achievable frequency since the broadcast must reach more PUs. We call the number of PUs in a PE, or the broadcast factor, . We label PE designs as PE--. For example, PE-1-1 indicates each PE has one single-MAC PU, while PE-4-2 indicates each PE has four two-MAC PUs (PU-2s).
SPU and SPE. We enhance PUs and PEs to support tile SPMM with flexible : structured sparsity. To distinguish them, we call a PU and PE without sparsity support Dense Processing Unit (DPU) and Dense Processing Element (DPE), respectively, while we call a PU and PE with sparsity support Sparse Processing Unit (SPU) and Sparse Processing Element (SPE), respectively. DPEs and SPEs are the building blocks for VEGETA-D (for dense) and VEGETA-S (for dense and sparse) engines which we explain in the following sub-sections. The main differences between a DPE and SPE are to 1 MUXes and a metadata buffer added to each weight buffer. Each cycle, an SPE receives multiple input elements, and uses this extra hardware to select, for each weight, the one corresponding to its index. To support flexible : structured sparsity, we choose as ; this ensures that input elements need only be fed into a single row. Since we use , we use for our SPEs. In summary, we use DPE-- and SPE-- to indicate the broadcast factor () and the reduction factor () of each PE. For example, the broadcast factor () of SPE-2-2 (shown in Figure 7 (b)) is 2, which indicates that a single SPE-2-2 is composed of two () SPU-2s.

Execution flow of DPE/SPE: DPE-1-1, a conventional single-MAC PE, operates as follows. First, a weight element is loaded into its weight buffer. Then, an input element and partial sum element are streamed from the left and top ports, respectively. The input element is multiplied with the element in the weight buffer and accumulated with the partial sum element. The generated partial sum flows to the PE below the current PE, and the input element flows to the PE on the right.
An SPE is composed of SPUs, each of which comprises MAC units. Different SPUs calculate partial sums in parallel for the different output elements, while each MAC unit in an SPU calculates partial sums in parallel for the same output element. An SPE-2-2 receives two () input blocks (instead of one input element) from the west and broadcasts them to SPUs. An input block is fed into each to 1 MUX in an SPU, and using the metadata for the corresponding weight value in the weight buffer, the corresponding input element from the block is selected. Then, each selected input element is multiplied by the corresponding weight element and accumulated. Since each SPU generates partial sums, and there are SPUs, it forwards a total of partial sums (for example, SPE-2-2 will generate partial sums) to the south port.
V-B VEGETA Matrix Engine for Tile-Wise : Sparsity

Using DPEs and SPEs, we show how to build various VEGETA engines. We use -{S (for sparse) D (for dense)}-- to indicate the type of the base PE of a VEGETA engine. For example, VEGETA-D-1-1 has DPE-1-1s as its PEs while VEGETA-S-2-2 is composed of SPE-2-2s. In Figure 7, we show VEGETA-D-1-1 and VEGETA-S-2-2 as examples. Note that adders (or adder trees) are needed at the bottom of the VEGETA engine if the reduction factor to generate the final output element by reducing partial sums. A VEGETA engine is a 2D array of PEs, where neighboring PEs are connected. Since a column of SPUs cooperates to calculate a single output element, can be derived as . The number of effectual computations per output element is 32 for the VEGETA Tile GEMM/SPMM instructions. With , the number of PUs in a PE (), the number of MACs in a PU (), and the total number of MAC units in a VEGETA engine, can be derived . In Table III, we summarize different VEGETA engine design choices.
Each row of a VEGETA-S engine has a special unit called an Input Selector to select the right input blocks to support flexible structured sparsity. In Figure 8, we show how different VEGETA tile GEMM/SPMM instructions are executed on VEGETA-S-2-2 focusing on one SPE and the corresponding input selector. For all three cases, non-zero values in each row of a weight tile are mapped onto a column of SPU-2s (i.e. 2 columns of MAC units). For example, weight elements 1-32 are mapped onto the first column of SPU-2 while weight elements 33-64 are mapped onto the second column of SPU-2.
4:4 structured sparsity: For tile_gemm (4:4 sparsity), the weight tile is dense; thus, there are four effectual computations for an output element per input block (size of 4). Thus, a half input block (two input elements) is fed into a row of PEs from the west port, and they are broadcast to each SPU in an SPE. Since a weight tile is dense for tile_gemm, the first element of an input block is multiplied with the first weight element in an SPU while the second element of the input block is calculated with the second weight element in the SPU. Similar to a classic systolic array, the input is fed in a skewed manner so that the reduction for partial sums happens in a spatio-temporal manner along a column of MAC units. Once it reaches the bottom of the array, the partial sums from each MAC column in an SPU are accumulated in a reduction unit (an adder in this case) to get the final output element.
2:4 structured sparsity: For tile_spmm_u (2:4 sparsity), an input block (instead of a half block) will be fed into a row of PEs from the left, and they are broadcast to each SPU in a SPE. 4 to 1 MUXes in an SPU choose the corresponding input element from an input block to be used in a MAC unit. For this case, the same block is used between MAC units in an SPU since there are two effectual computations for an output element per input block due to the 2:4 structured sparsity.
1:4 structured sparsity: Finally, for tile_spmm_v (1:4 sparsity), there is only one effectual computation for an output element per input block. Thus, two input blocks will be fed into a row of PEs from the left, and they are broadcast to each SPU in an SPE. Each MAC unit in an SPU gets an input block and chooses and computes with the corresponding input element in the block. Unlike the tile_spmm_u, there is only one non-zero element in a weight block due to the 1:4 structured sparsity. Thus, two weight elements in an SPU are belonging to two different weight blocks, which implies that they need two different input blocks for computation. In Figure 9, we show a detailed cycle-level visualization about how a VEGETA-S-2-2 executes a tile_spmm_v, which computes tile SPMM with 1:4 sparse (weight), dense (input), and dense (output).
V-C Optimizations
Pipelining. So far, we have shown how one VEGETA tile GEMM/SPMM instruction can be executed on a VEGETA engine. For modest-sized tiles, filling and draining the systolic array for a tile GEMM/SPMM instruction can significantly lower PE utilization. A recent work RASA [29] introduced pipelining the execution of tile GEMM instructions on a dense systolic array-based matrix engine; this overlaps different stages of execution for different instructions. We extend the pipelining concept and show how different tile SPMM instructions can be executed concurrently.
Similar to the stages defined in the previous work, Weight Load (WL), Feed First (FF), Feed Second (FS), and Drain (DR) stages are used for pipelining both VEGETA-D designs and VEGETA-S designs. WL is the first stage of the execution on the systolic array where weight elements (which will be stationary) are loaded to the corresponding PEs from north ports, which takes cycles. Next, during the FF stage, the corresponding inputs and output elements are getting fed from the west and north ports, which takes cycles, where is the number of columns in an input tile. This stage ends when the top-left PE stops receiving input/output elements. Since the inputs are streamed in a skewed manner, the remaining rows of the systolic array are still receiving the new input elements from the west ports. This stage is called as the FS stage and ends when there is no more input element coming in from the west ports, which takes cycles. Finally, to finish the remaining calculations in the systolic array, it needs cycles during the DR stage followed by cycles required for reduction at the bottom. Note that we need an additional stage for reduction after DR stage to wait for the remaining values in reduction units. Also, unlike the conventional systolic array where each PE is receiving one input, weight, and output element at a time, input blocks, weight elements, output elements should be fed to each SPE. Figure 9 shows the steaming pattern of input element blocks.
and can be reduced by increasing and , larger and might reduce latency of a single instruction. Larger causes lower frequency while larger causes longer reduction latency and larger reduction unit, so we cannot arbitrarily increase those. Furthermore, to properly overlap different stages, it is crucial to balance the latencies of different stages. Different VEGETA-S designs can use same pipelining stages since they follow the same streaming pattern.
Figure 10 (a) and (b) shows the pipelining examples for VEGETA-D-1-2 and VEGETA-S-16-2, assuming no dependency between instructions. With pipelining, multiple instructions may be executed in the systolic array concurrently, but no two of them may be in the same stage of execution, since that would oversubscribe at least one hardware resource. Since the latency of each stage is known, the scheduler can easily enforce this. We observe that the next instruction can be executed after 16 cycles for VEGETA-S-16-2, which is same as VEGETA-D-1-2 since it is limited by the total number of MAC units (maximum compute throughput). Due to the smaller and , the latency of each instruction for VEGETA-S-16-2 is shorter than that of VEGETA-D-1-2.

Output forwarding. Pipelining allows concurrent execution of independent instructions. However, it is often not allowed if the instructions have dependence between them. Since VEGETA tile GEMM/SPMM instructions perform accumulation, the destination register is a source as well; thus, for two back-to-back tile multiplication instructions with the same destination register, the second cannot begin execution until the first finishes. A traditional approach to resolve this kind of pipeline stall is data forwarding. We extend this “output forwarding (OF)” concept to matrix engines. In Figure 10 (c) and (d), we show the pipelining examples of dependent instructions for VEGETA-S-16-2 with and without output forwarding. Since we feed a tile into the systolic array over multiple cycles, we need to only be sure that particular elements will be ready before we need to feed them. A key insight for this is that the register reads/writes of tile follow the exactly same order in a systolic array. This is because every output element will be calculated cycles after it gets fed into the systolic array. For example, when executing a VEGETA-S-16-2, a tile_spmm_u starts reading the C tile when FF stage begins (Cycle ). In the same order, the C tile will be written back from Cycle so that the dependent instruction can start reading the same tile. This can significantly reduce the stalls which might have occurred between instructions with a dependency without OF, but a bypass buffer would be required to keep and forward the data before writing it back to the tile register.
V-D Flexibility in the Block Size, M
Similar to the case for VEGETA ISA to support a broader range of sparsity, VEGETA engines can be easily extended for larger . For example, to support the case (for 1:16, 2:16, 4:16, 8:16, and 16:16), a 16-to-1 MUX (or five 4-to-1 MUXes) is needed per MAC unit, which would take an input block composed of 16 elements and select the corresponding input element. and should be configured considering the balance of pipeline stages, data reuse, critical path, etc.
V-E Row-Wise : Sparsity Support
VEGETA engines support row-wise : sparsity, which can be used for leveraging unstructured sparsity using the method described in Section III-D. In Figure 11, we show a row-wise sparse matrix using 4:4, 2:4, 1:4 at Row 1, Row 2/3, and Row -3 to , respectively. A row with 4:4 maps to an SPE-1-4 column, and outputs of the four SPU columns reduce to generate one partial sum. A row with 2:4 maps to an SPE-2-2 column, and outputs of each pair of SPU columns reduce, resulting in two partial sums per SPE column. A row with 1:4 maps to an SPE-4-1 column, generating 4 partial sums per SPE column. Using this mapping, we can ensure all columns are fully utilized while allowing different : sparsity across different rows. The number of columns of a weight tile, , is equal to to keep all columns fully utilized.

Similar to Figure 8 (b), an input block (4 elements) is fed into a row of SPEs from the left and broadcast to each MAC in an SPE. In Figure 11, the number of rows of a weight tile can vary based on the : sparsity combinations in a tile. When a row-wise : sparse tile has rows with 4:4 sparsity, rows with 2:4 sparsity and rows with 1:4 sparsity, the number of columns in VEGETA-S and the number of rows of the weight tile, can be derived as and . can vary from 8 to 32 depending on the sparsity degree of the tile. Our approach requires having consecutive groups of rows of the weight tile with the same sparsity degree (e.g., 2 for 2:4 and 4 for 1:4); we call this as pseudo row-wise : sparsity. We can employ a simple reordering in input/output DMA engines to group input rows with the same sparsity, and reorder the output elements back to their original order. Since this only needs an extra row of adders, the hardware overhead would be negligible compared to VEGETA-S which supports tile-wise flexible :.
V-F Integration of VEGETA Engines with CPU

In Figure 12, we show how we integrate VEGETA in a CPU. We also marked the components that we modify or introduce to integrate VEGETA in red. Our baseline core design maintains separate physical and architectural register files. Also, we extend this to tile registers and metadata registers for VEGETA; this includes enhancements to the register file itself, allocator, renamer, scheduler, and ROB to add the ability to manage the new registers, track dependencies on them, and support precise exceptions. We also enhance the scheduler to track static and dynamic information of instructions being executed in VEGETA engines to support pipelining and output forwarding at the right timing without interrupting scalar/vector instructions. A tile_load_t (or tile_store_t) will be converted into 16 memory requests and each will be loading (or storing) 64 Bytes (cache line size) through load/store queue, not imposing any extra implication on cache/memory coherence/consistency.
VI Evaluation
Workloads | Dimensions | # of MACs |
---|---|---|
ResNet50-L1 | K=64, C=256, Y=56, X=56, R=1, S=1 | 51,380,224 |
ResNet50-L2 | K=64, C=64, Y=56, X=56, R=3, S=3 | 115,605,504 |
ResNet50-L3 | K=256, C=64, Y=56, X=56, R=1, S=1 | 51,380,224 |
ResNet50-L4 | K=128, C=128, Y=28, X=28, R=3, S=3 | 115,605,504 |
ResNet50-L5 | K=512, C=128, Y=28, X=28, R=1, S=1 | 51,380,224 |
ResNet50-L6 | K=256, C=256, Y=14, X=14, R=3, S=3 | 115,605,504 |
BERT-L1 | M=512, N=768, K=768 | 301,989,888 |
BERT-L2 | M=512, N=512, K=768 | 201,326,592 |
BERT-L3 | M=512, N=768, K=512 | 201,326,592 |
GPT-L1 | M=256, N=256, K=2,048 | 134,217,728 |
GPT-L2 | M=512, N=512, K=2,048 | 536,870,912 |
GPT-L3 | M=256, N=256, K=12,288 | 805,306,368 |


VI-A VEGETA Implementation
We modified LLVM [35] for the new VEGETA ISA and implemented VEGETA C++ intrinsics. Next, we wrote GEMM/SPMM kernels that exploit layer-wise : sparsity using VEGETA intrinsics. Since there is no commercial CPU that can execute VEGETA instructions, we developed a Pintool, an instrumentation tool using Pin APIs [38], which registers instrumentation callback routines that emulate each of the VEGETA instructions described in Table II. Then, we generated the traces of the kernels which have every executed instruction with dynamic information using the Pintool, and extended MacSim [31] to handle VEGETA instructions and registers along with the different executions of matrix engines. Finally, we simulated the GEMM/SPMM kernels using the generated traces on MacSim.
We also developed RTL designs to explore different VEGETA engines with different and . We model baseline dense matrix engines with RASA-SM, RASA-DM, and the Intel TMUL-like config through VEGETA-D-1-1, VEGETA-D-1-2, and VEGETA-D-16-1, respectively. We estimate the performance of an engine with the NVIDIA STC-like config using VEGETA-S-1-2 forcing only 2:4 support. We synthesized each RTL design using Synopsis DC compiler with Nangate 15nm Technology Library. We used the post-layout design for area/power/timing numbers for designs shown in Table III.
VI-B Experimental Setup
Although VEGETA is not limited to a single use case, thanks to its generic SPMM instructions that can be used in various kernels, we use DL inferences as a use case to show the performance of VEGETA. As DNN compression is done offline (usually once before deployment) for inference [55, 52, 56, 4], the cost of DNN compression is amortized to multiple inferences, thus the inference performance does not include this. For the workload, we choose representative DNN layers from ResNet50 [21], BERT [15], and GPT-3 [10]. The parameters for the layers are summarized in Table IV. To convert convolutional layers for ResNet50 to GEMM kernels, we use the dimensions derived by applying image to column (im2col) algorithm. We run the DNN layers with 1:4/2:4/4:4 structured sparsity on different VEGETA engines listed in Table III using MacSim. We set the frequency of the core as 2 GHz and fetch/issue/retire width as four with 16 pipeline stages, 97 ROB entries, and 96 load buffer entries. To focus on the performance trade-off of different VEGETA designs, we assume that the data is prefetched to the L2 cache.
VI-C Performance Analysis
Figure 13 shows the runtime with various DNN layers. For this experiment, we assume that all the matrix engines are running with 0.5 GHz. We chose 0.5 GHz since it met the timing for all matrix designs that we used in the evaluation, which are derived based on the corresponding RTL implementations. We normalized the runtime using the longest runtime (runtime on GPT-L3 with RASA-SM). We first observe that the RASA-SM suffers from the under-utilization of processing elements due to the mismatch of matrix engine pipeline stages (WL/FF/FS/DR), resulting in the highest runtime. RASA-DM is a state-of-the-art matrix engine for CPUs and achieves good throughput by matching the latencies of its matrix engine pipeline stages. Compared to RASA-DM, our sparse engine designs performs comparably for the dense workload showing a performance gain of up to 7%. The performance gain is mainly coming from the reduced latency of the drain stage by reducing the width of the SA and output forwarding.
Since VEGETA-D engines are not able to leverage sparsity, they cannot skip ineffectual computations and show the same performance with 2:4 and 1:4 structured sparsity, unlike VEGETA-S engines. Also, the design with the STC-like config can only accelerate 2:4 sparsity while our VEGETA-S designs can accelerate various fine-grained structured sparsities. For layers with 2:4 structured sparsity, the matrix engine using the STC-like config shows 16% runtime reduction on average compared with the RASA-DM. Using VEGETA-S-16-2, additional 18% runtime reduction was achieved compared to the design with the STC-like config. Finally, with output forwarding, another 32% runtime was reduced. For layers with 1:4 structured sparsity, the design with the STC-like config does not show better performance compared with 2:4 structured sparsity since it cannot exploit the extra zeros to skip extra ineffectual computations, unlike our VEGETA-S designs. We observe that VEGETA-S-1-2 shows 51% runtime reduction on average compared with the RASA-DM since it can skip all ineffectual computations. Using our VEGETA-S-16-2 engine, additional 8% runtime reduction was achieved. Finally, with output forwarding, another 37% runtime was reduced by resolving output dependencies early.
VI-D Area and Power Analysis

In Figure 14, we show the normalized area and frequency for different VEGETA engines. First, we observe that when we increase the number of PUs in a PE (), the area of the VEGETA engines decreases due to the lower number of horizontal pipeline buffers as described in Section V-A. Since we add small modules to support sparsity, the VEGETA-S design with the largest area overhead compared with RASA-SM only causes 6% area overhead. Moreover, by increasing , VEGETA-S-8-2 and VEGETA-S-16-2 show lower area compared to RASA-SM or state-of-the-art dense matrix engine for CPU (RASA-DM). This is because the overhead gets amortized and compensated as more MACs share the data, reducing the total pipeline registers. Power shows a similar trend. When we vary for VEGETA-S--2 as 1, 2, 4, 8, 16, the power overhead (both static and dynamic) is 17%, 8%, 4%, 3%, 1% compared with the baseline. In the meantime, higher limits maximum frequency due to the increased wire length for broadcasting across PUs.
VI-E Analysis for Unstructured Sparsity Support Using VEGETA
We convert unstructured sparse matrices into row-wise :4 sparse matrices to accelerate SPMM with VEGETA, as discussed in Section V-E. Since there is no work on CPU sparse matrix engines, we use a few SOTA sparse accelerators [43, 37] as baseline matrix engines for comparison. As shown in Table I, S2TA [37] naturally supports layer-wise : and potentially be enhanced to support tile-wise : while VEGETA can support pseudo row-wise and row-wise :. SIGMA [43] can leverage unstructured sparsity, but it comes with area overhead (we normalize performance with the area). For the conservative evaluation, we assume that they are also enhanced to fully hide fill and drain overhead through perfect pipelining. Unlike a layer-wise :4 evaluation, it is not straightforward to implement optimized kernels using VEGETA instructions, so we use an analytical roofline model, leaving the development of optimized SW kernels using VEGETA instructions as future work. We use the same workloads, but induce random and unstructured sparsity of varying degrees and report average speed-up normalized to a dense engine in Figure 15. A smaller sparsity granularity increases the possibility of finding : sparsity that covers non-zeros. For example, it is unlikely that an entire unstructured sparse layer exhibits a certain : sparsity; thus, layer-wise does not show much performance improvement over dense. In contrast, row-wise achieves 2.36 and 3.28 at 90% and 95% sparsity degree. SIGMA performs better than others with extremely high sparsity degrees (95%), but it is inefficient for the modest sparsity degree (the target of our work) indicating that extra area overhead is not useful.
VII Related Work
CPU support to run GEMMs efficiently. SAVE [18] is a sparsity-aware CPU vector engine that skips redundant computations in sparse GEMM operations to accelerate sparse DNN and high-performance computing workloads. Similarly, SparCE [46] also increases the utilization of vector engines by tracking the general-purpose registers with zeros. With extremely high sparsity, a program gets memory bounded due to the low arithmetic intensity, making vector compute throughput enough, and vector engines like SAVE/SparCE can be equal performance to a matrix engine. Otherwise, we expect them to be significantly slower than VEGETA due to the lower compute throughput. ZCOMP [8] introduces a vector ISA extension to reduce cross-layer communication for DNNs. Our work is orthogonal and complementary to those works since we are targeting a matrix engine that operates on tiles of the matrix instead of individual vectors. RASA [29] proposes control and data optimizations for CPU matrix engines to improve utilization via efficient pipelining and overlap. They divide a matrix multiplication with different sub-stages on the systolic array and introduce optimizations with pipelining and overlapping different stages. It inspired a lot on our work regarding pipelining with sub-stages. However, it does not consider SPMM and their design cannot be used directly as a sparse matrix engine.
Handling dynamic sparsity. Since dynamic (input) sparsity is hard to predict, it has to be handled at runtime, unlike static sparsity which is usually pre-determined. We could use compaction of tile registers to build non-zero tiles, similar to the approach that SAVE [18] used for merging vector registers to remove zeros. However, this is not practical for a matrix engine due to the high probability of conflicts across different tiles since the number of operands in a vector register is 32 while that of a tile register is 512 (1632). There could be an efficient way to exploit dynamic sparsity on matrix engines in CPUs without much overhead, but we leave it as future work.
Sparse DNN accelerators. There have been several papers focusing on SPMM/SPGEMM acceleration with standalone accelerators [42, 43, 23, 49, 50, 41, 54]. Sparse-TPU [22] and the work from Kung et al. [34] introduce packing algorithms targeting unstructured sparsity for systolic array-like architectures. Being standalone accelerators with large area-power budgets, these works have explored co-design of compression formats, dataflow, indexing logic, and control structures. CPU matrix engines, on the other hand, have a tight coupling with the RF and operate in strict area-power budgets, motivating our solution of enhancing a dense systolic array with a pipeline-friendly dataflow. Zhu et al. [56] introduce a sparse tensor core by extending NVIDIA Tensor Core using vector-wise pruning and encoding to enforce structured sparsity. NVIDIA has also recently introduced Sparse Tensor Core as a sparse matrix engine in their Ampere GPUs [39] to accelerate sparse DNNs [39] with 2:4 structured sparsity. STA [36] proposes a systolic tensor array for inference on mobile devices with density-bound block sparsity, which is similar to the : structured sparsity, and S2TA [37] extended STA for flexible : sparsity. None of the previous works supports row-wise : sparsity, which this work shows as a promising technique to cover unstructured sparsity. Also, prior works do not consider dividing the execution of tile computations into fine-grained stages for pipelining without resource conflicts, which is critical when integrated into CPUs.
VIII Conclusion
This work adds flexible : structured sparsity support in CPU matrix engines via VEGETA ISA and engines supporting various sparsity granularity. We propose several design choices for the VEGETA engine, studying kernel performance, clock frequency, and area trade-offs. Also, we show how efficiently the row-wise : sparsity feature of VEGETA can accelerate unstructured sparsity by transforming unstructured sparsity to row-wise : sparsity. We believe this work opens up opportunities for future work in enhancing the design further for dynamic sparsity and studying the interaction of the sparse matrix engine with the rest of the CPU memory hierarchy.
IX Acknowledgement
This work was funded by an award from Intel’s Corporate Research Council. We thank Raveesh Garg, Po-An Tsai, and Vivek Sarkar for profound discussions on this work.
References
- [1] “https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/,” 2015.
- [2] “Nvidia tesla v100 gpu architecture.” 2017, https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
- [3] “Tensorfloat-32 in the a100 gpu accelerates ai training, hpc up to 20x,” 2019, https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/.
- [4] “Nvidia ampere ga102 gpu architecture.” 2021, https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf.
- [5] “oneapi deep neural network library (onednn),” 2021, https://github.com/oneapi-src/oneDNN.
- [6] “Sambanova whitepaper,” 2021. [Online]. Available: https://sambanova.ai/wp-content/uploads/2021/06/SambaNova_RDA_Whitepaper_English.pdf
- [7] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
- [8] B. Akin, Z. A. Chishti, and A. R. Alameldeen, “Zcomp: Reducing dnn cross-layer memory footprint using vector extensions,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY, USA: Association for Computing Machinery, 2019, p. 126–138. [Online]. Available: https://doi.org/10.1145/3352460.3358305
- [9] ARM, “Powering the edge: Driving optimal performance with the ethos-n77 npu,” Whitepaper, 2019.
- [10] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” 2020. [Online]. Available: https://arxiv.org/abs/2005.14165
- [11] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated end-to-end optimizing compiler for deep learning,” in Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’18. USA: USENIX Association, 2018, p. 579–594.
- [12] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
- [13] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
- [14] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint arXiv:1410.0759, 2014.
- [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
- [16] T. Geng, A. Li, T. Wang, C. Wu, Y. Li, R. Shi, A. Tumeo, S. Che, S. Reinhardt, and M. Herbordt, “Awb-gcn: A graph convolutional network accelerator with runtime workload rebalancing,” MICRO, 2020.
- [17] E. Georganas, K. Banerjee, D. Kalamkar, S. Avancha, A. Venkat, M. Anderson, G. Henry, H. Pabst, and A. Heinecke, “Harnessing deep learning via a single building block,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 222–233.
- [18] Z. Gong, H. Ji, C. W. Fletcher, C. J. Hughes, S. Baghsorkhi, and J. Torrellas, “Save: Sparsity-aware vector engine for accelerating dnn training and inference on cpus,” in MICRO, 2020.
- [19] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016.
- [20] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 1135–1143.
- [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- [22] X. He, S. Pal, A. Amarnath, S. Feng, D.-H. Park, A. Rovinski, H. Ye, Y. Chen, R. Dreslinski, and T. Mudge, “Sparse-tpu: Adapting systolic arrays for sparse matrices,” in Proceedings of the 34th ACM International Conference on Supercomputing, ser. ICS ’20. New York, NY, USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3392717.3392751
- [23] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “Extensor: An accelerator for sparse tensor algebra,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY, USA: Association for Computing Machinery, 2019, p. 319–333. [Online]. Available: https://doi.org/10.1145/3352460.3358275
- [24] IBM, “Power isa version 3.1,” Whitepaper, 2020.
- [25] Intel, “Theoretical maximum memory bandwidth for intel® core™ x-series processors,” https://www.intel.com/content/www/us/en/support/articles/000056722/processors/intel-core-processors.html.
- [26] Intel, “Intel 64 and ia-32 architectures software developer’s manual,” 2020.
- [27] Intel, “Presentation deck: Intel architecture day 2021,” 2021, https://download.intel.com/newsroom/2021/client-computing/intel-architecture-day-2021-presentation.pdf.
- [28] J.-W. Jang, S. Lee, D. Kim, H. Park, A. S. Ardestani, Y. Choi, C. Kim, Y. Kim, H. Yu, H. Abdel-Aziz, J.-S. Park, H. Lee, D. Lee, M. W. Kim, H. Jung, H. Nam, D. Lim, S. Lee, J.-H. Song, S. Kwon, J. Hassoun, S. Lim, and C. Choi, “Sparsity-aware and re-configurable npu architecture for samsung flagship mobile soc,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 15–28.
- [29] G. Jeong, E. Qin, A. Samajdar, C. J. Hughes, S. Subramoney, H. Kim, and T. Krishna, “Rasa: Efficient register-aware systolic array matrix engine for cpu,” in 2021 58th ACM/IEEE Design Automation Conference (DAC), 2021, pp. 253–258.
- [30] N. P. Jouppi, , C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017.
- [31] H. Kim et al., “Macsim: A cpu-gpu heterogeneous simulation framework user guide,” 2012.
- [32] P. D. Koichi Yamada, Wei Li, “Intel’s mlperf results show robust cpu-based training performance for a range of workloads,” 2020, https://www.intel.com/content/www/us/en/artificial-intelligence/posts/intels-mlper-results.html.
- [33] H.-T. Kung, “Why systolic architectures?” Computer, no. 1, pp. 37–46, 1982.
- [34] H. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 821–834. [Online]. Available: https://doi.org/10.1145/3297858.3304028
- [35] C. Lattner and V. Adve, “Llvm: a compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization, 2004. CGO 2004., 2004, pp. 75–86.
- [36] Z.-G. Liu, P. N. Whatmough, and M. Mattina, “Systolic tensor array: An efficient structured-sparse gemm accelerator for mobile cnn inference,” IEEE Computer Architecture Letters, vol. 19, no. 1, pp. 34–37, 2020.
- [37] Z.-G. Liu, P. N. Whatmough, Y. Zhu, and M. Mattina, “S2ta: Exploiting structured sparsity for energy-efficient mobile cnn acceleration,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022, pp. 573–586.
- [38] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building customized program analysis tools with dynamic instrumentation,” SIGPLAN Not., vol. 40, no. 6, p. 190–200, jun 2005. [Online]. Available: https://doi.org/10.1145/1064978.1065034
- [39] A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,” arXiv preprint arXiv:2104.08378, 2021.
- [40] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” CoRR, vol. abs/1906.00091, 2019. [Online]. Available: https://arxiv.org/abs/1906.00091
- [41] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng, C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, “Outerspace: An outer product based sparse matrix multiplication accelerator,” in ISCA, 2018.
- [42] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 27–40.
- [43] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 58–70.
- [44] K. Rocki, D. Van Essendelft, I. Sharapov, R. Schreiber, M. Morrison, V. Kibardin, A. Portnoy, J. F. Dietiker, M. Syamlal, and M. James, “Fast stencil-code computation on a wafer-scale processor,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14.
- [45] A. Rodrigues et al., “Lower numerical precision deep learning inference and training,” Whitepaper, 2018.
- [46] S. Sen, S. Jain, S. Venkataramani, and A. Raghunathan, “Sparce: Sparsity aware general-purpose core extensions to accelerate deep neural networks,” IEEE Trans. Comput., vol. 68, no. 6, p. 912–925, Jun. 2019. [Online]. Available: https://doi.org/10.1109/TC.2018.2879434
- [47] P. K. Shibo Wang, “Bfloat16: The secret to high performance on cloud tpus,” 2019, https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.
- [48] A. Sriraman and A. Dhanotia, “Accelerometer: Understanding acceleration opportunities for data center overheads at hyperscale,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 733–750. [Online]. Available: https://doi.org/10.1145/3373376.3378450
- [49] N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, “Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 766–780.
- [50] N. Srivastava, H. Jin, S. Smith, H. Rong, D. Albonesi, and Z. Zhang, “Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 689–702.
- [51] W. J. Starke, B. W. Thompto, J. A. Stuecheli, and J. E. Moreira, “Ibm’s power10 processor,” IEEE Micro, vol. 41, no. 2, pp. 7–14, 2021.
- [52] W. Sun, A. Zhou, S. Stuijk, R. Wijnhoven, A. O. Nelson, hongsheng Li, and H. Corporaal, “Dominosearch: Find layer-wise fine-grained n:m sparse schemes from dense neural networks,” in Advances in Neural Information Processing Systems, vol. 34, 2021.
- [53] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang, “Machine learning at facebook: Understanding inference at the edge,” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344.
- [54] Z. Zhang, H. Wang, S. Han, and W. J. Dally, “Sparch: Efficient architecture for sparse matrix multiplication,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 261–274.
- [55] A. Zhou, Y. Ma, J. Zhu, J. Liu, Z. Zhang, K. Yuan, W. Sun, and H. Li, “Learning n:m fine-grained structured sparse neural networks from scratch,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=K9bw7vqp_s
- [56] M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus,” in MICRO, 2019.