VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

Geonhwa Jeong1, Sana Damani1^§, Abhimanyu Rajeshkumar Bambhaniya1, Eric Qin1^§,
Christopher J. Hughes2, Sreenivas Subramoney2, Hyesoon Kim1, and Tushar Krishna1
1{geonhwa.jeong, sdamani, abambhaniya3, ecqin}@gatech.edu, [email protected], [email protected] 1Georgia Institute of Technology, 2Intel Labs 2{christopher.j.hughes, sreenivas.subramoney}@intel.com

Abstract

Deep Learning (DL) acceleration support in CPUs has recently gained a lot of traction, with several companies (Arm, Intel, IBM) announcing products with specialized matrix engines accessible via GEMM instructions. CPUs are pervasive and need to handle diverse requirements across DL workloads running in edge/HPC/cloud platforms. Therefore, as DL workloads embrace sparsity to reduce the computations and memory size of models, it is also imperative for CPUs to add support for sparsity to avoid under-utilization of the dense matrix engine and inefficient usage of the caches and registers. This work presents VEGETA, a set of ISA and microarchitecture extensions over dense matrix engines to support flexible structured sparsity for CPUs, enabling programmable support for diverse DL models with varying degrees of sparsity. Compared to the state-of-the-art (SOTA) dense matrix engine in CPUs, a VEGETA engine provides 1.09 $\times$ , 2.20 $\times$ , 3.74 $\times$ , and 3.28 $\times$ speed-ups when running 4:4 (dense), 2:4, 1:4, and unstructured (95%) sparse DNN layers.

^§^§footnotetext: Sana Damani and Eric Qin are now at NVIDIA and Meta, respectively.

I Introduction

Deep learning (DL) is used in various domains including computer vision, recommendation systems, and natural language processing [53, 40, 30]. However, the training and inference for those DL models require a large number of computations and huge memory. To accelerate those, high-end GPUs (such as ones with NVIDIA Tensor cores [2]), domain-specific accelerators (such as Google’s TPUs [30]), and high-performance CPUs [45, 26] along with optimized libraries and frameworks [5, 11, 14, 7] have been introduced.

Although CPUs have received relatively less attention in the DL era compared with GPUs and domain-specific accelerators, CPUs are widely used at datacenters thanks to their flexibility [53]. In fact, there are cases where CPUs are more suitable than GPUs/accelerators as the primary processor for processing deep neural networks (DNNs). For example, edge devices with tight area/power budget [46] prefer CPUs since GPUs/accelerators cannot be easily deployed in addition to CPUs. Moreover, a large number of server-class CPUs are (and will be) used in datacenters to process High Performance Computing (HPC) workloads, and adding/maintaining additional accelerators just for DL workloads would increase complexity and cost [18]. Furthermore, offloading a modest-size task to an accelerator or a GPU might not be the best option if the offloading overhead is relatively substantial [48]. Finally, large DNNs are hard to be used with GPUs or accelerators due to their smaller size of memory compared with that of CPUs [32]. Considering these reasons, it is not surprising that many CPU vendors including Arm [9], Intel [26], and IBM [51] have started deploying dense matrix engines along with conventional scalar and vector engines to accelerate GEMM (i.e., general matrix multiplication) which is at the core of DL models [1]. Since they only focus on dense matrix computations, there is a challenge as sparsity becomes pervasive in DL. Thus, it is time to ponder how to utilize these engines efficiently for sparse DNNs on CPUs, similar to how we have evolved scalar/vector engines over many years.

Recent research has focused on adding hardware support in DNN accelerators to handle sparsity as it can be used to improve power and performance by skipping or gating ineffectual computations (since anything multiplied by zero is zero) and to reduce memory capacity and bandwidth requirements by saving only non-zero values and their indices in a compressed manner [23, 43, 42, 13]. CPUs, however, bring in the following additional challenges. (i) Being general-purpose, CPUs are particularly sensitive to the amount of hardware devoted to improving a subset of workloads. (ii) At the same time, new functional units in CPUs need to be able to support a wide variety of use cases. (iii) CPUs need programmability support to handle sparsity unlike offload accelerators with custom DMA engines for sparse accesses.

In this paper, we introduce VEGETA, Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs. To expose VEGETA to programmers, we extend the ISA with new instructions and registers, which enables accelerating sparse DNNs by removing redundant computations and reducing memory traffic. We also introduce a new light-weight matrix engine that enhances a systolic array and register file to support flexible $N$ : $M$ structured sparsity. Further, we demonstrate how VEGETA can accelerate both structured and unstructured sparse-dense GEMMs (SPMMs) via software transformations of the sparse matrix. This is the first work, to the best of our knowledge, to demonstrate sparsity support in a CPU matrix engine. We believe this work is extremely timely given growing interest in industry products for sparsity, such as Sparse Tensor Core (STC) in NVIDIA’s Ampere GPU [39] and Sparsity-aware NPU in a Samsung Mobile AP [28].

Summary of Contributions:

•

We introduce VEGETA ISA extensions that include new instructions and registers for supporting sparse tile operations along with software optimizations.
•

We propose new architectural extensions to a systolic-array-based matrix engine to support flexible $N$ : $M$ structured sparsity in terms of sparsity pattern and granularity.
•

We explore different VEGETA engine design choices to understand the performance and area trade-offs.
•

Across a suite of GEMM/SPMM kernels, we observe 1.09 $\times$ , 2.20 $\times$ , 3.74 $\times$ , and 3.28 $\times$ speed-ups compared to a SOTA dense array when running 4:4 (dense), 2:4, 1:4, and unstructured (95%) sparse DNN layers, respectively.

II Background

II-A GEMM and DNN Layers

A DNN comprises multiple layers of potentially different types. Two of the most time-consuming layer types are fully connected layers, which are widely used in Multi-Layer Perceptrons/Transformers, and convolutional layers [12]. The underlying building block of these popular and compute-intensive DNN layers is GEMM. Fully connected layers are composed of GEMMs naturally, while modern high-performance convolutional layers are implemented with GEMMs as their innermost operation [17].

II-B Systolic Array (SA)

In the last few years, systolic arrays [33] have become a prime design choice for accelerating dense GEMM, especially in the context of DNNs, due to its simple and efficient architecture. The most well-known commercial systolic array architecture is in Google’s TPU [30]. A systolic array is a two-dimensional array composed of highly scalable, pipelined, and homogeneous processing elements (PEs). Each PE is responsible for a Multiply-and-Accumulate (MAC) operation and forwarding operands to the neighboring PEs. An element of one matrix is pre-loaded into the PE (often called the stationary element [12]). The elements of the remaining matrices are streamed to the left and top edge of the systolic array. Then, they are propagated to the PEs using the horizontal and vertical links between neighboring PEs. This maximizes data reuse between PEs, thus minimizing the number of reads and writes between the array and external buffers.

II-C Sparsity in DNNs

Refer to caption — Figure 1: Comparison with unstructured sparsity, tile-wise 2:4 sparsity and row-wise $N$ :4 sparsity (different rows could have different $N$ values between 0 and 4). The size of tile is 8 $\times$ 8 in this example and the size of a block is 4.

Even though the number of computations required for executing a DNN is huge, there is an opportunity to reduce that significantly by leveraging sparsity in DNNs. We describe different dimensions of sparsity in DNN inference.

Sparsity source. Weights and input activations are the main sources of sparsity in DNNs. Weight (static) sparsity is derived by pruning some edges in a DNN with a small sacrifice in accuracy [20, 19]. This leads to zeros in the weight matrix. Input (dynamic) sparsity is usually induced by a popular activation function, Rectified Linear Unit (ReLU), which clips negative inputs to zero, leading to a significant fraction of zero-valued input activations for the next layer.

Sparsity degree. Sparsity degree is the fraction of zeros in a given matrix. The degree of weight sparsity can often be tuned during the pruning process. However, the degree of input sparsity is usually non-deterministic and unpredictable since the actual values of input activations are determined at runtime.

Sparsity pattern. The non-zeros for a DNN may have a pattern, such as certain channels of weights being all zero (through channel pruning) or each block with M elements having at most N non-zeros (often called $N$ : $M$ sparsity [52]). “Unstructured sparsity” indicates the lack of a pattern.

Sparsity granularity. Sparsity patterns may exist at different granularities. For example, “network-wise $N$ : $M$ sparsity” indicates all layers in a network have the same $N$ : $M$ ratio, while “layer-wise $N$ : $M$ sparsity” means different layers may have different $N$ : $M$ ratios. During execution, usually a layer is decomposed into 2D tiles; each tile is composed of rows of vectors. Similar to the network granularity as explained before, sparsity patterns could exist at the tile or row granularity. In Figure 1, we compare matrices with different sparsity patterns/granularities, with a tile size of 8 $\times$ 8. Previous HW support focused on either network-wise [39, 36, 56] or layer-wise $N$ : $M$ sparsity [37], while we target to support row-wise $N$ : $M$ sparsity to cover broader flexibility as well as some unstructured sparsity as shown in Table I.

III Motivation and VEGETA Overview

III-A Vector vs. Matrix Engine

Single Instruction Multiple Data (SIMD) and/or short vector execution have been supported in mainstream CPUs for several generations. Due to the smaller granularity of vectors compared to matrices, the industry has started integrating matrix engines inside CPUs [26, 9, 24]. A matrix engine can provide power-efficient high compute throughput via simple control logic and significant internal data reuse. For example, in Intel’s upcoming Sapphire Rapids processors, the peak matrix compute capability per cycle of each core is 8 $\times$ the vector capability [27]. In Figure 3, we show effective compute throughputs of sparse/dense matrix/vector engines on a convolutional layer with different densities derived from a roofline model. We assume 64 and 512 GFLOPS for the vector and matrix engines, respectively, with a memory bandwidth of 94 GB/s [25]. For the 100% dense case, the dense matrix (vector) and sparse matrix (vector) engines achieve the same compute throughput since no computation can be skipped. We observe that sparse engines outperform dense engines significantly by skipping non-effectual computations, especially when density is low. Also, there is a significant gap in compute throughput between vector and matrix engines. Moreover, due to the smaller granularity of vector instructions, the same GEMM kernel requires many more instructions to be executed when using vector engines, contributing to the runtime gap as shown in Figure 4 (we estimated them using a cycle accurate simulator, MacSim [31]). When memory bound, i.e., at extremely low density and thus arithmetic intensity, vector compute throughput is sufficient, so a sparse vector engine performs similar to a sparse matrix engine.

III-B Structured vs. Unstructured Sparsity

CPUs are general-purpose processors. Ideally, a CPU should be able to support any sparsity pattern, including unstructured sparsity. However, this introduces two practical challenges.

Challenge 1: Programmability. GEMM implementations typically partition the matrices into tiles and iterate over the tiles to optimize data reuse in caches, facilitate parallelization, etc. Innermost GEMM kernels are often optimized for a predetermined tile size, using instructions operating on fixed size tiles [17]. The irregular nature of unstructured sparsity means we cannot know tile sizes a priori, and further, they may all be different. In the context of how software is written, this makes it tricky to define an easy-to-use ISA.

Challenge 2: Implementation overheads. There are trade-offs between different options for the register file (RF) and systolic array (SA) when supporting sparsity. The RF feeding the SA could hold a dense (i.e., conventional, with zeros) or sparse/compressed representation of each tile; if sparse, we need indexing logic and metadata to match each non-zero to the appropriate values from the other matrix. Also, the SA could be comprised of conventional PEs (i.e., dense) or be enhanced to be sparsity-aware and skip zeros, at the cost of additional interconnects inside each PE [43]. Naturally, all these structures and interconnects add overhead.

To address these challenges, we make the following design decisions: (i) We limit our scope to sparsity support for DL workloads, where the typical sparsity degree is up to 95%. Further, we add HW support for flexible $N$ : $M$ structured sparsity, leveraging insights from a recent work [52] which has shown that adopting layer-wise $N$ : $M$ sparsity shows better accuracy compared to network-wise. We also show how the HW can support unstructured sparsity by transforming the target matrix using row-wise $N$ : $M$ sparsity. (ii) We add sparsity support in both the RF (Section IV) and SA (Section V) to achieve efficient utilization of the register storage and MACs.

TABLE I: Comparison of Sparsity Granularity of Previous works.

Network-wise

Layer-wise

Tile-wise

Row-wise

NVIDIA STC [39]

✓

✗

STA [36]

✓

✗

S2TA [37]

✓

✓¹¹1They do not claim they support tile-wise, but it can be extended.

✗

VEGETA

✓

III-C HW Support for Flexible $N$ : $M$ Structured Sparsity

Figure 5 shows under-utilization challenges for a conventional WS systolic array when used with sparse weights. A WS systolic array keeps the weight values stationary over time while inputs and outputs are streamed in a skewed manner. If sparse weights are used with 2:4/1:4 structured sparsity, 50%/25% of PEs are mapped with zero weights, which causes useless computation. This work proposes an enhanced systolic array-based matrix engine which maps only non-zero weight values and distributes/reduces the correct input/output elements at the right time, leveraging the strengths of a systolic array, in the presence of some irregularity in the inputs. This also requires logic in a PE to pick the right input elements for MACs. The indexing logic and input distribution and output reduction logic to support flexible $N$ : $M$ structured sparsity (i.e., the purple boxes in Figure 5) are presented in Section V. In terms of sparsity granularity, we not only support network/layer/tile-wise, but also row-wise $N$ : $M$ sparsity. In Table I, we compare the supported sparsity granularity of our design and previous works that support $N$ : $M$ sparsity.

III-D Transforming Unstructured to Row-Wise $N$ : $M$ Sparsity

While native support for unstructured sparsity can provide higher accuracy with extreme sparsity, the area overhead to support the sparsity in the form of a highly reconfigurable networks on chips (NoC) [43] and a sizable sparse controller [23, 16] is only justifiable on standalone accelerators [6, 44], not within CPUs. However, given an unstructured sparse tile, one can derive a row-wise $N$ : $M$ sparse tile that covers all non-zeros in the given sparse tile by selecting appropriate $N$ : $M$ per each row. For example, assuming 1:4, 2:4, and 4:4 are available sparsity patterns, one can analyze each row of the target unstructured tile to find the most sparse $N$ : $M$ sparsity that covers all non-zeros in the row. Then, each row can be compressed using the corresponding $N$ : $M$ sparsity. For example, the first and the second rows of Figure 1 (a) would be compressed with 2:4 while the third and the fourth would be compressed with 1:4 to guarantee that none of non-zero values get lost. This transformation does not cause any accuracy drop since it is lossless, meaning that all non-zeros in the original unstructured sparse matrix will still exist in the corresponding structured sparse matrix. Figure 1 (c) is derived using this transformation from Figure 1 (a), covering all non-zeros (similarly, tile-wise 2:4 is used to derive Figure 1 (b) from Figure 1 (a)). We use this transformation to leverage unstructured sparsity using VEGETA and show the estimated performance gain in Section VI-E.

III-E VEGETA Design Overview

This work presents VEGETA, which includes ISA extensions and microarchitecture support for structured/unstructured sparsity using flexible $N$ : $M$ fine-grained structured sparsity HW [52, 55] in CPUs. We present a detailed design for some specific and important points (1:4, 2:4, 4:4) to explain detailed extensions for both ISA and the microarchitecture, but both can naturally be extended for different block sizes (M).

We use a 32 $\times$ 16 conventional WS systolic array as the baseline, inspired by RASA [29] and Intel’s TMUL [26]. Comparing against a dense matrix engine rather than a vector engine provides a strong baseline due to the huge gap in compute throughput between typical matrix engines and vector engines (Section III-A). Our proposed VEGETA engine maintains the same number of MAC units as the baseline, adding the ability to skip zero-valued weight values via new control logic, multiplexers, and adders for reductions, along with some wider data paths. We target mixed-precision with BF16/FP32 which is widely used for both inference and training on commercial devices [26, 47, 3].

IV VEGETA Instruction Set Architecture

IV-A Register File Support

Inspired by Intel AMX [26], we assume there are eight 1 KB tile registers (treg0-7), each comprising 16 rows of 64 Bytes. We define a tile as a 2D block comprising rows of data separated by a fixed stride. We define an effective tile as the larger sparse tile captured by the non-zeros present in the compressed tile in the case of a sparse matrix (along with metadata). A tile register can hold 16 $\times$ 32 BF16 elements or 16 $\times$ 16 FP32 elements. To support 2:4 and 1:4 sparsity, we introduce aliased tile registers of size 2 KB (utile register or ureg) and 4 KB (vtile register or vreg), respectively. One ureg is composed of two consecutive tregs, while one vreg is composed of two consecutive uregs as shown in Figure 6.

Next, we introduce metadata registers (mreg0-7) to store metadata information for sparse tiles. As shown in Figure 2, a pair of bits in the metadata represents the position of one non-zero element in a block of the compressed sparse matrix. Since a single row of the tile register holds 32 non-zero BF16 elements, the corresponding metadata register row holds $32\times 2$ bits of metadata. Hence, an mreg has 16 rows, each with 64 bits of metadata for a total of 128 Bytes. Note that while a treg can hold 16 $\times$ 32 BF16 elements, a treg and mreg for 2:4 sparsity can be used to store data for the effective tile whose dimension is 16 $\times$ 64. Similarly, when mreg is used for 1:4 sparsity, a treg and mreg can represent an effective tile whose dimension is 16 $\times$ 128.

TABLE II: VEGETA Instructions (MD refers to Metadata of a tile).

Instruction

Operands

Explanation

tile_load_t

dst: treg, src: ptr[TILE]

Load 1 KB from ptr[TILE] to treg.

tile_load_u

dst: ureg, src: ptr[TILE]

Load 2 KB from ptr[TILE] to ureg.

tile_load_v

dst: vreg, src: ptr[TILE]

Load 4 KB from ptr[TILE] to vreg.

tile_load_m

dst: mreg, src: ptr[MD]

Load 128 B from ptr[MD] to mreg.

tile_store_t

dst: ptr[TILE], src: treg

Store 1 KB from ptr[TILE] to treg.

tile_gemm

dst, src0: treg,

src1: treg, src2: treg

Multiply dense tile src1

with dense tile src2, and add

the result back to tile dst.

tile_spmm_u

dst, src0: treg,

src1: treg, src2: ureg

Multiply sparse tile src1 (2:4)

with dense tile src2, and add

the result back to tile dst.

tile_spmm_v

dst, src0: treg,

src1: treg, src2: vreg

Multiply sparse tile src1 (1:4)

with dense tile src2, and add

the result back to tile dst.

tile_spmm_r

dst, src0: ureg,

src1: treg, src2: ureg

Multiply sparse tile src1 (row-wise

N

:4) with dense tile src2, and

add the result back to tile dst.

IV-B VEGETA Instructions

Table II summarizes the VEGETA instructions using the aforementioned registers. tile_load_t, tile_load_u, and tile_load_v load a 1 KB, 2 KB, and 4 KB tile from the specified address to a treg, ureg, and vreg, respectively, while tile_store stores a 1 KB tile from a treg to memory. tile_load_t can be used to load either a dense tile or the non-zero values of a compressed sparse tile. In the latter case, the 1 KB tile has an effective tile size of 2 KB or 4 KB for 1:4 and 2:4 sparsity ratios, respectively, as mentioned above. Furthermore, the load of a sparse tile must be accompanied by a corresponding tile_load_m instruction, which loads 128 B of metadata from memory to an mreg.

The tile_gemm, tile_spmm_u, and tile_spmm_v instructions perform a tile matrix multiply and accumulate, $\boldsymbol{C}\mathrel{{+}{=}}\boldsymbol{A}\times\boldsymbol{B}$ , where $\boldsymbol{A}$ and $\boldsymbol{B}$ are BF16 tiles and $\boldsymbol{C}$ is the FP32 output. $\boldsymbol{A}$ holds a tile of a sparse matrix. With data in compressed format, $\boldsymbol{A}$ holds a fixed number of non-zeros, but with higher sparsity (smaller $N$ : $M$ ), its effective size is larger. In contrast, the actual size of the dense $\boldsymbol{B}$ tile must grow as $\boldsymbol{A}$ gets sparser. For example, assuming a dense $\boldsymbol{A}$ tile is $16\times 32$ , the effective size of $\boldsymbol{A}$ for a 2:4 sparsity ratio is $16\times 64$ (2 KB), while the non-zero values could still fit into a 1 KB treg. Thus, the corresponding $\boldsymbol{B}$ tile should be $64\times 16$ , which will fit into a 2 KB ureg. Similarly, for a 1:4 ratio, the effective size of $\boldsymbol{A}$ is 4 KB and the corresponding $\boldsymbol{B}$ tile will fit into a 4 KB vreg. Note that the corresponding output tile $\boldsymbol{C}$ is a constant size ( $16\times 16,FP32$ ) and fits in a 1 KB treg. We call tile_gemm a VEGETA tile GEMM instruction and tile_spmm_u and tile_spmm_v VEGETA tile SPMM instructions.

To summarize, tile_gemm performs a dense (4:4) GEMM operation on 1 KB treg inputs, while tile_spmm_u performs an SPMM operation where $\boldsymbol{A}$ is a 2:4 compressed sparse 1 KB tile, $\boldsymbol{B}$ is a dense 2 KB tile, and the output $\boldsymbol{C}$ is a dense 1 KB tile. Thus, tile_spmm_u calculates $\boldsymbol{C}$ $(16\times 16,FP32)\mathrel{{+}{=}}\boldsymbol{A}$ $(16\times 64,BF16)\times\boldsymbol{B}$ $(64\times 16,BF16)$ . Similarly, tile_spmm_v calculates $\boldsymbol{C}$ $(16\times 16,FP32)\mathrel{{+}{=}}\boldsymbol{A}$ $(16\times 128,BF16)\times\boldsymbol{B}$ $(128\times 16,BF16)$ . Note that $\boldsymbol{C}$ is used as both input and output.

Listing 1: An implementation of SPMM.

⬇

1 for {int i = 0; i <

\text{D}_{\text{m}}

\text{T}_{\text{m}}

; i++} {

2 for {int j = 0; j <

\text{D}_{\text{n}}

\text{T}_{\text{n}}

; j++} {

3 for {int k = 0;k <

\text{D}_{\text{k}}

\text{T}_{\text{k}}

; k++} {

4 // C[i][j] += A[i][k] + B[k][j]

5 tile_load_u ureg0, tileB[j][k];

6 tile_load_t treg2, tileC[i][j];

7 tile_load_t treg3, tileA[i][k];

8 tile_load_m mreg3, metadataA[i][k];

9 tile_spmm_u treg2, treg3, ureg0;

10 tile_store_t tileC[i][j], treg2;

11 }

12 }

13 }

The number of useful MAC operations required to calculate $\boldsymbol{C}$ is the same for tile_gemm, tile_spmm_u, and tile_load_v (8192). For each output element, the number of effectual MAC computations is 32. Finally, to support SPMM with a row-wise $N$ : $M$ sparse matrix $A$ , we introduce tile_spmm_r. tile_spmm_r calculates $\boldsymbol{C}$ $(R\times 16,FP32)\mathrel{{+}{=}}\boldsymbol{A}$ $(R\times 64,BF16)\times\boldsymbol{B}$ $(64\times 16,BF16)$ where $R$ can vary from 8 to 32, depending on $N$ :4 sparsity for each row (which will be stored as extra metadata, 32 $\times$ 2 bits, or 8B, at most).

We show an example SPMM kernel assuming $\boldsymbol{A}$ with 2:4 sparsity using VEGETA instructions in 1. $D_{m}$ , $D_{n}$ , and $D_{k}$ indicate the size of each dimension while $T_{m}$ , $T_{n}$ , and $T_{k}$ show the corresponding tile sizes. In this case, $T_{m}$ , $T_{n}$ , and $T_{k}$ are 16, 16, and 64, respectively. We store the values of matrix $B$ in a transposed manner in the tile registers. We implemented an optimized version of this kernel to evaluate our architecture that we introduce in Section V.

IV-C Flexibility in the Block Size, M

In this work, we assume $M=4$ to explain our extension in detail, which can handle different fine-grained structured sparsity patterns including 1:4/2:4/4:4, but our extension is not limited to the specific size of $M$ . While we use $M=4$ in our implementation, our approach can be extended to $M=2^{m}$ by modifying the tile/metadata registers and ISA to support larger input registers. A larger $M$ provides greater flexibility to the sparse model design and may result in improved accuracy [52], but would cost more HW.

V VEGETA Engine Architecture

V-A Processing Units and Processing Elements

TABLE III: Different VEGETA-D and VEGETA-S Designs.

VEGETA Engine

N_{rows}

N_{cols}

# of MACs per PE

(

\alpha\times\beta

)

# of inputs

per PE

Broadcast

Factor (

\alpha

)

Drain

Latency

Supported

Sparsity

Example from Prior Work

VEGETA-D-1-1

Dense

Conventional SA [33], RASA-SM [29]

VEGETA-D-1-2

Dense

RASA-DM [29]

VEGETA-D-16-1

Dense

Intel TMUL [26]-Inspired Unit

VEGETA-S-1-2

1:4, 2:4, 4:4

New design

VEGETA-S-2-2

1:4, 2:4, 4:4

New design

VEGETA-S-4-2

1:4, 2:4, 4:4

New design

VEGETA-S-8-2

1:4, 2:4, 4:4

New design

VEGETA-S-16-2

1:4, 2:4, 4:4

New design

Processing Unit (PU). A PU is composed of a number of MAC units that contribute to the same output element. In a WS systolic array, the dot product to produce each output element is mapped to a column of PUs. Partially accumulated results trickle down a column, and the final result of the dot product exits the bottom of the array. In a conventional design, a PU is composed of one MAC unit, the height of the array is the length of the dot product, and each PU produces a single partial sum each cycle. We can, however, break the set of operations for each dot product into pieces, or “lanes.” If we pack multiple MAC units into a PU, each PU can work on all lanes in parallel, and we can scale down the height of the systolic array. In this case, we must combine the partial sum for each lane at the bottom of the systolic array. We call the number of lanes or the number of MAC units in a PU, or the reduction factor, $\beta$ . $\beta$ also indicates how many partial sums need to be reduced at the bottom of the systolic array to generate a single output element.

Processing Element (PE). We group PUs that share the same eastbound inputs and output buffers into a PE. That is, peer-to-peer forwarding happens between PEs, and the input fed from the west is broadcasted to all PUs in a PE. The more PUs in a PE, the narrower the systolic array, and the more we amortize the overhead of the horizontal pipeline buffer. This improvement in area and power comes at the cost of lower achievable frequency since the broadcast must reach more PUs. We call the number of PUs in a PE, or the broadcast factor, $\alpha$ . We label PE designs as PE- $\alpha$ - $\beta$ . For example, PE-1-1 indicates each PE has one single-MAC PU, while PE-4-2 indicates each PE has four two-MAC PUs (PU-2s).

SPU and SPE. We enhance PUs and PEs to support tile SPMM with flexible $N$ : $M$ structured sparsity. To distinguish them, we call a PU and PE without sparsity support Dense Processing Unit (DPU) and Dense Processing Element (DPE), respectively, while we call a PU and PE with sparsity support Sparse Processing Unit (SPU) and Sparse Processing Element (SPE), respectively. DPEs and SPEs are the building blocks for VEGETA-D (for dense) and VEGETA-S (for dense and sparse) engines which we explain in the following sub-sections. The main differences between a DPE and SPE are $M$ to 1 MUXes and a metadata buffer added to each weight buffer. Each cycle, an SPE receives multiple input elements, and uses this extra hardware to select, for each weight, the one corresponding to its index. To support flexible $N$ : $M$ structured sparsity, we choose $\beta$ as $\frac{M}{2}$ ; this ensures that input elements need only be fed into a single row. Since we use $M=4$ , we use $\beta=2$ for our SPEs. In summary, we use DPE- $\alpha$ - $\beta$ and SPE- $\alpha$ - $\beta$ to indicate the broadcast factor ( $\alpha$ ) and the reduction factor ( $\beta$ ) of each PE. For example, the broadcast factor ( $\alpha$ ) of SPE-2-2 (shown in Figure 7 (b)) is 2, which indicates that a single SPE-2-2 is composed of two ( $=\alpha$ ) SPU-2s.

Execution flow of DPE/SPE: DPE-1-1, a conventional single-MAC PE, operates as follows. First, a weight element is loaded into its weight buffer. Then, an input element and partial sum element are streamed from the left and top ports, respectively. The input element is multiplied with the element in the weight buffer and accumulated with the partial sum element. The generated partial sum flows to the PE below the current PE, and the input element flows to the PE on the right.

An SPE is composed of $\alpha$ SPUs, each of which comprises $\beta$ MAC units. Different SPUs calculate partial sums in parallel for the different output elements, while each MAC unit in an SPU calculates partial sums in parallel for the same output element. An SPE-2-2 receives two ( $=\beta$ ) input blocks (instead of one input element) from the west and broadcasts them to $\alpha$ SPUs. An input block is fed into each $M$ to 1 MUX in an SPU, and using the metadata for the corresponding weight value in the weight buffer, the corresponding input element from the block is selected. Then, each selected input element is multiplied by the corresponding weight element and accumulated. Since each SPU generates $\beta$ partial sums, and there are $\alpha$ SPUs, it forwards a total of $\alpha\times\beta$ partial sums (for example, SPE-2-2 will generate $2\times 2=4$ partial sums) to the south port.

V-B VEGETA Matrix Engine for Tile-Wise $N$ : $M$ Sparsity

Using DPEs and SPEs, we show how to build various VEGETA engines. We use -{S (for sparse) $|$ D (for dense)}- $\alpha$ - $\beta$ to indicate the type of the base PE of a VEGETA engine. For example, VEGETA-D-1-1 has DPE-1-1s as its PEs while VEGETA-S-2-2 is composed of SPE-2-2s. In Figure 7, we show VEGETA-D-1-1 and VEGETA-S-2-2 as examples. Note that adders (or adder trees) are needed at the bottom of the VEGETA engine if the reduction factor $\beta>1$ to generate the final output element by reducing partial sums. A VEGETA engine is a 2D array of $N_{rows}\times N_{cols}$ PEs, where neighboring PEs are connected. Since a column of SPUs cooperates to calculate a single output element, $N_{rows}$ can be derived as $N_{rows}=\frac{\text{\# of effectual computations per output element}}{\beta}$ . The number of effectual computations per output element is 32 for the VEGETA Tile GEMM/SPMM instructions. With $N_{rows}$ , the number of PUs in a PE ( $\alpha$ ), the number of MACs in a PU ( $\beta$ ), and the total number of MAC units in a VEGETA engine, $N_{cols}$ can be derived $N_{cols}=\frac{\text{\# of total MAC units}}{N_{rows}\times\alpha\times\beta}$ . In Table III, we summarize different VEGETA engine design choices.

Each row of a VEGETA-S engine has a special unit called an Input Selector to select the right input blocks to support flexible structured sparsity. In Figure 8, we show how different VEGETA tile GEMM/SPMM instructions are executed on VEGETA-S-2-2 focusing on one SPE and the corresponding input selector. For all three cases, non-zero values in each row of a weight tile are mapped onto a column of SPU-2s (i.e. 2 columns of MAC units). For example, weight elements 1-32 are mapped onto the first column of SPU-2 while weight elements 33-64 are mapped onto the second column of SPU-2.

4:4 structured sparsity: For tile_gemm (4:4 sparsity), the weight tile is dense; thus, there are four effectual computations for an output element per input block (size of 4). Thus, a half input block (two input elements) is fed into a row of PEs from the west port, and they are broadcast to each SPU in an SPE. Since a weight tile is dense for tile_gemm, the first element of an input block is multiplied with the first weight element in an SPU while the second element of the input block is calculated with the second weight element in the SPU. Similar to a classic systolic array, the input is fed in a skewed manner so that the reduction for partial sums happens in a spatio-temporal manner along a column of MAC units. Once it reaches the bottom of the array, the partial sums from each MAC column in an SPU are accumulated in a reduction unit (an adder in this case) to get the final output element.

2:4 structured sparsity: For tile_spmm_u (2:4 sparsity), an input block (instead of a half block) will be fed into a row of PEs from the left, and they are broadcast to each SPU in a SPE. 4 to 1 MUXes in an SPU choose the corresponding input element from an input block to be used in a MAC unit. For this case, the same block is used between MAC units in an SPU since there are two effectual computations for an output element per input block due to the 2:4 structured sparsity.

1:4 structured sparsity: Finally, for tile_spmm_v (1:4 sparsity), there is only one effectual computation for an output element per input block. Thus, two input blocks will be fed into a row of PEs from the left, and they are broadcast to each SPU in an SPE. Each MAC unit in an SPU gets an input block and chooses and computes with the corresponding input element in the block. Unlike the tile_spmm_u, there is only one non-zero element in a weight block due to the 1:4 structured sparsity. Thus, two weight elements in an SPU are belonging to two different weight blocks, which implies that they need two different input blocks for computation. In Figure 9, we show a detailed cycle-level visualization about how a VEGETA-S-2-2 executes a tile_spmm_v, which computes tile SPMM with 1:4 sparse $\boldsymbol{A}$ (weight), dense $\boldsymbol{B}$ (input), and dense $\boldsymbol{C}$ (output).

V-C Optimizations

Pipelining. So far, we have shown how one VEGETA tile GEMM/SPMM instruction can be executed on a VEGETA engine. For modest-sized tiles, filling and draining the systolic array for a tile GEMM/SPMM instruction can significantly lower PE utilization. A recent work RASA [29] introduced pipelining the execution of tile GEMM instructions on a dense systolic array-based matrix engine; this overlaps different stages of execution for different instructions. We extend the pipelining concept and show how different tile SPMM instructions can be executed concurrently.

Similar to the stages defined in the previous work, Weight Load (WL), Feed First (FF), Feed Second (FS), and Drain (DR) stages are used for pipelining both VEGETA-D designs and VEGETA-S designs. WL is the first stage of the execution on the systolic array where weight elements (which will be stationary) are loaded to the corresponding PEs from north ports, which takes $N_{rows}$ cycles. Next, during the FF stage, the corresponding inputs and output elements are getting fed from the west and north ports, which takes $T_{n}$ cycles, where $T_{n}$ is the number of columns in an input tile. This stage ends when the top-left PE stops receiving input/output elements. Since the inputs are streamed in a skewed manner, the remaining rows of the systolic array are still receiving the new input elements from the west ports. This stage is called as the FS stage and ends when there is no more input element coming in from the west ports, which takes $N_{rows}-1$ cycles. Finally, to finish the remaining calculations in the systolic array, it needs $N_{cols}$ cycles during the DR stage followed by cycles required for reduction at the bottom. Note that we need an additional stage for reduction after DR stage to wait for the remaining values in reduction units. Also, unlike the conventional systolic array where each PE is receiving one input, weight, and output element at a time, $\beta$ input blocks, $\alpha$ weight elements, $\alpha$ output elements should be fed to each SPE. Figure 9 shows the steaming pattern of input element blocks.

$N_{rows}$ and $N_{cols}$ can be reduced by increasing $\alpha$ and $\beta$ , larger $\alpha$ and $\beta$ might reduce latency of a single instruction. Larger $\alpha$ causes lower frequency while larger $\beta$ causes longer reduction latency and larger reduction unit, so we cannot arbitrarily increase those. Furthermore, to properly overlap different stages, it is crucial to balance the latencies of different stages. Different VEGETA-S designs can use same pipelining stages since they follow the same streaming pattern.

Figure 10 (a) and (b) shows the pipelining examples for VEGETA-D-1-2 and VEGETA-S-16-2, assuming no dependency between instructions. With pipelining, multiple instructions may be executed in the systolic array concurrently, but no two of them may be in the same stage of execution, since that would oversubscribe at least one hardware resource. Since the latency of each stage is known, the scheduler can easily enforce this. We observe that the next instruction can be executed after 16 cycles for VEGETA-S-16-2, which is same as VEGETA-D-1-2 since it is limited by the total number of MAC units (maximum compute throughput). Due to the smaller $N_{rows}$ and $N_{cols}$ , the latency of each instruction for VEGETA-S-16-2 is shorter than that of VEGETA-D-1-2.

Output forwarding. Pipelining allows concurrent execution of independent instructions. However, it is often not allowed if the instructions have dependence between them. Since VEGETA tile GEMM/SPMM instructions perform accumulation, the destination register is a source as well; thus, for two back-to-back tile multiplication instructions with the same destination register, the second cannot begin execution until the first finishes. A traditional approach to resolve this kind of pipeline stall is data forwarding. We extend this “output forwarding (OF)” concept to matrix engines. In Figure 10 (c) and (d), we show the pipelining examples of dependent instructions for VEGETA-S-16-2 with and without output forwarding. Since we feed a $\boldsymbol{C}$ tile into the systolic array over multiple cycles, we need to only be sure that particular $\boldsymbol{C}$ elements will be ready before we need to feed them. A key insight for this is that the register reads/writes of $\boldsymbol{C}$ tile follow the exactly same order in a systolic array. This is because every output element will be calculated $N_{rows}+log_{2}{\beta}$ cycles after it gets fed into the systolic array. For example, when executing a VEGETA-S-16-2, a tile_spmm_u starts reading the C tile when FF stage begins (Cycle $N_{rows}+1$ ). In the same order, the C tile will be written back from Cycle $2N_{rows}+log_{2}{\beta}$ so that the dependent instruction can start reading the same $\boldsymbol{C}$ tile. This can significantly reduce the stalls which might have occurred between instructions with a dependency without OF, but a bypass buffer would be required to keep and forward the data before writing it back to the tile register.

V-D Flexibility in the Block Size, M

Similar to the case for VEGETA ISA to support a broader range of sparsity, VEGETA engines can be easily extended for larger $M$ . For example, to support the $M=16$ case (for 1:16, 2:16, 4:16, 8:16, and 16:16), a 16-to-1 MUX (or five 4-to-1 MUXes) is needed per MAC unit, which would take an input block composed of 16 elements and select the corresponding input element. $\alpha$ and $\beta$ should be configured considering the balance of pipeline stages, data reuse, critical path, etc.

V-E Row-Wise $N$ : $M$ Sparsity Support

VEGETA engines support row-wise $N$ : $M$ sparsity, which can be used for leveraging unstructured sparsity using the method described in Section III-D. In Figure 11, we show a row-wise sparse matrix using 4:4, 2:4, 1:4 at Row 1, Row 2/3, and Row $H_{A}$ -3 to $H_{A}$ , respectively. A row with 4:4 maps to an SPE-1-4 column, and outputs of the four SPU columns reduce to generate one partial sum. A row with 2:4 maps to an SPE-2-2 column, and outputs of each pair of SPU columns reduce, resulting in two partial sums per SPE column. A row with 1:4 maps to an SPE-4-1 column, generating 4 partial sums per SPE column. Using this mapping, we can ensure all columns are fully utilized while allowing different $N$ : $M$ sparsity across different rows. The number of columns of a weight tile, $W_{A}$ , is equal to $M\times N_{rows}$ to keep all columns fully utilized.

Similar to Figure 8 (b), an input block (4 elements) is fed into a row of SPEs from the left and broadcast to each MAC in an SPE. In Figure 11, the number of rows of a weight tile can vary based on the $N$ : $M$ sparsity combinations in a tile. When a row-wise $N$ : $M$ sparse tile has $N_{4:4}$ rows with 4:4 sparsity, $N_{2:4}$ rows with 2:4 sparsity and $N_{1:4}$ rows with 1:4 sparsity, the number of columns in VEGETA-S and the number of rows of the weight tile, $H_{A}$ can be derived as $N_{cols}=N_{4:4}+\frac{N_{2:4}}{2}+\frac{N_{1:4}}{4}$ and $H_{A}=N_{4:4}+N_{2:4}+N_{1:4}$ . $H_{A}$ can vary from 8 to 32 depending on the sparsity degree of the tile. Our approach requires having consecutive groups of rows of the weight tile with the same sparsity degree (e.g., 2 for 2:4 and 4 for 1:4); we call this as pseudo row-wise $N$ : $M$ sparsity. We can employ a simple reordering in input/output DMA engines to group input rows with the same sparsity, and reorder the output elements back to their original order. Since this only needs an extra row of adders, the hardware overhead would be negligible compared to VEGETA-S which supports tile-wise flexible $N$ : $M$ .

V-F Integration of VEGETA Engines with CPU

In Figure 12, we show how we integrate VEGETA in a CPU. We also marked the components that we modify or introduce to integrate VEGETA in red. Our baseline core design maintains separate physical and architectural register files. Also, we extend this to tile registers and metadata registers for VEGETA; this includes enhancements to the register file itself, allocator, renamer, scheduler, and ROB to add the ability to manage the new registers, track dependencies on them, and support precise exceptions. We also enhance the scheduler to track static and dynamic information of instructions being executed in VEGETA engines to support pipelining and output forwarding at the right timing without interrupting scalar/vector instructions. A tile_load_t (or tile_store_t) will be converted into 16 memory requests and each will be loading (or storing) 64 Bytes (cache line size) through load/store queue, not imposing any extra implication on cache/memory coherence/consistency.

VI Evaluation

TABLE IV: Dimensions of DNN layers used in this evaluation.

Workloads	Dimensions	# of MACs
ResNet50-L1	K=64, C=256, Y=56, X=56, R=1, S=1	51,380,224
ResNet50-L2	K=64, C=64, Y=56, X=56, R=3, S=3	115,605,504
ResNet50-L3	K=256, C=64, Y=56, X=56, R=1, S=1	51,380,224
ResNet50-L4	K=128, C=128, Y=28, X=28, R=3, S=3	115,605,504
ResNet50-L5	K=512, C=128, Y=28, X=28, R=1, S=1	51,380,224
ResNet50-L6	K=256, C=256, Y=14, X=14, R=3, S=3	115,605,504
BERT-L1	M=512, N=768, K=768	301,989,888
BERT-L2	M=512, N=512, K=768	201,326,592
BERT-L3	M=512, N=768, K=512	201,326,592
GPT-L1	M=256, N=256, K=2,048	134,217,728
GPT-L2	M=512, N=512, K=2,048	536,870,912
GPT-L3	M=256, N=256, K=12,288	805,306,368

VI-A VEGETA Implementation

We modified LLVM [35] for the new VEGETA ISA and implemented VEGETA C++ intrinsics. Next, we wrote GEMM/SPMM kernels that exploit layer-wise $N$ : $M$ sparsity using VEGETA intrinsics. Since there is no commercial CPU that can execute VEGETA instructions, we developed a Pintool, an instrumentation tool using Pin APIs [38], which registers instrumentation callback routines that emulate each of the VEGETA instructions described in Table II. Then, we generated the traces of the kernels which have every executed instruction with dynamic information using the Pintool, and extended MacSim [31] to handle VEGETA instructions and registers along with the different executions of matrix engines. Finally, we simulated the GEMM/SPMM kernels using the generated traces on MacSim.

We also developed RTL designs to explore different VEGETA engines with different $\alpha$ and $\beta$ . We model baseline dense matrix engines with RASA-SM, RASA-DM, and the Intel TMUL-like config through VEGETA-D-1-1, VEGETA-D-1-2, and VEGETA-D-16-1, respectively. We estimate the performance of an engine with the NVIDIA STC-like config using VEGETA-S-1-2 forcing only 2:4 support. We synthesized each RTL design using Synopsis DC compiler with Nangate 15nm Technology Library. We used the post-layout design for area/power/timing numbers for designs shown in Table III.

VI-B Experimental Setup

Although VEGETA is not limited to a single use case, thanks to its generic SPMM instructions that can be used in various kernels, we use DL inferences as a use case to show the performance of VEGETA. As DNN compression is done offline (usually once before deployment) for inference [55, 52, 56, 4], the cost of DNN compression is amortized to multiple inferences, thus the inference performance does not include this. For the workload, we choose representative DNN layers from ResNet50 [21], BERT [15], and GPT-3 [10]. The parameters for the layers are summarized in Table IV. To convert convolutional layers for ResNet50 to GEMM kernels, we use the dimensions derived by applying image to column (im2col) algorithm. We run the DNN layers with 1:4/2:4/4:4 structured sparsity on different VEGETA engines listed in Table III using MacSim. We set the frequency of the core as 2 GHz and fetch/issue/retire width as four with 16 pipeline stages, 97 ROB entries, and 96 load buffer entries. To focus on the performance trade-off of different VEGETA designs, we assume that the data is prefetched to the L2 cache.

VI-C Performance Analysis

Figure 13 shows the runtime with various DNN layers. For this experiment, we assume that all the matrix engines are running with 0.5 GHz. We chose 0.5 GHz since it met the timing for all matrix designs that we used in the evaluation, which are derived based on the corresponding RTL implementations. We normalized the runtime using the longest runtime (runtime on GPT-L3 with RASA-SM). We first observe that the RASA-SM suffers from the under-utilization of processing elements due to the mismatch of matrix engine pipeline stages (WL/FF/FS/DR), resulting in the highest runtime. RASA-DM is a state-of-the-art matrix engine for CPUs and achieves good throughput by matching the latencies of its matrix engine pipeline stages. Compared to RASA-DM, our sparse engine designs performs comparably for the dense workload showing a performance gain of up to 7%. The performance gain is mainly coming from the reduced latency of the drain stage by reducing the width of the SA and output forwarding.

Since VEGETA-D engines are not able to leverage sparsity, they cannot skip ineffectual computations and show the same performance with 2:4 and 1:4 structured sparsity, unlike VEGETA-S engines. Also, the design with the STC-like config can only accelerate 2:4 sparsity while our VEGETA-S designs can accelerate various fine-grained structured sparsities. For layers with 2:4 structured sparsity, the matrix engine using the STC-like config shows 16% runtime reduction on average compared with the RASA-DM. Using VEGETA-S-16-2, additional 18% runtime reduction was achieved compared to the design with the STC-like config. Finally, with output forwarding, another 32% runtime was reduced. For layers with 1:4 structured sparsity, the design with the STC-like config does not show better performance compared with 2:4 structured sparsity since it cannot exploit the extra zeros to skip extra ineffectual computations, unlike our VEGETA-S designs. We observe that VEGETA-S-1-2 shows 51% runtime reduction on average compared with the RASA-DM since it can skip all ineffectual computations. Using our VEGETA-S-16-2 engine, additional 8% runtime reduction was achieved. Finally, with output forwarding, another 37% runtime was reduced by resolving output dependencies early.

VI-D Area and Power Analysis

In Figure 14, we show the normalized area and frequency for different VEGETA engines. First, we observe that when we increase the number of PUs in a PE ( $\alpha$ ), the area of the VEGETA engines decreases due to the lower number of horizontal pipeline buffers as described in Section V-A. Since we add small modules to support sparsity, the VEGETA-S design with the largest area overhead compared with RASA-SM only causes 6% area overhead. Moreover, by increasing $\alpha$ , VEGETA-S-8-2 and VEGETA-S-16-2 show lower area compared to RASA-SM or state-of-the-art dense matrix engine for CPU (RASA-DM). This is because the overhead gets amortized and compensated as more MACs share the data, reducing the total pipeline registers. Power shows a similar trend. When we vary $\alpha$ for VEGETA-S- $\alpha$ -2 as 1, 2, 4, 8, 16, the power overhead (both static and dynamic) is 17%, 8%, 4%, 3%, 1% compared with the baseline. In the meantime, higher $\alpha$ limits maximum frequency due to the increased wire length for broadcasting across PUs.

VI-E Analysis for Unstructured Sparsity Support Using VEGETA

We convert unstructured sparse matrices into row-wise $N$ :4 sparse matrices to accelerate SPMM with VEGETA, as discussed in Section V-E. Since there is no work on CPU sparse matrix engines, we use a few SOTA sparse accelerators [43, 37] as baseline matrix engines for comparison. As shown in Table I, S2TA [37] naturally supports layer-wise $N$ : $M$ and potentially be enhanced to support tile-wise $N$ : $M$ while VEGETA can support pseudo row-wise and row-wise $N$ : $M$ . SIGMA [43] can leverage unstructured sparsity, but it comes with area overhead (we normalize performance with the area). For the conservative evaluation, we assume that they are also enhanced to fully hide fill and drain overhead through perfect pipelining. Unlike a layer-wise $N$ :4 evaluation, it is not straightforward to implement optimized kernels using VEGETA instructions, so we use an analytical roofline model, leaving the development of optimized SW kernels using VEGETA instructions as future work. We use the same workloads, but induce random and unstructured sparsity of varying degrees and report average speed-up normalized to a dense engine in Figure 15. A smaller sparsity granularity increases the possibility of finding $N$ : $M$ sparsity that covers non-zeros. For example, it is unlikely that an entire unstructured sparse layer exhibits a certain $N$ : $M$ sparsity; thus, layer-wise does not show much performance improvement over dense. In contrast, row-wise achieves 2.36 $\times$ and 3.28 $\times$ at 90% and 95% sparsity degree. SIGMA performs better than others with extremely high sparsity degrees ( $>$ 95%), but it is inefficient for the modest sparsity degree (the target of our work) indicating that extra area overhead is not useful.

VII Related Work

CPU support to run GEMMs efficiently. SAVE [18] is a sparsity-aware CPU vector engine that skips redundant computations in sparse GEMM operations to accelerate sparse DNN and high-performance computing workloads. Similarly, SparCE [46] also increases the utilization of vector engines by tracking the general-purpose registers with zeros. With extremely high sparsity, a program gets memory bounded due to the low arithmetic intensity, making vector compute throughput enough, and vector engines like SAVE/SparCE can be equal performance to a matrix engine. Otherwise, we expect them to be significantly slower than VEGETA due to the lower compute throughput. ZCOMP [8] introduces a vector ISA extension to reduce cross-layer communication for DNNs. Our work is orthogonal and complementary to those works since we are targeting a matrix engine that operates on tiles of the matrix instead of individual vectors. RASA [29] proposes control and data optimizations for CPU matrix engines to improve utilization via efficient pipelining and overlap. They divide a matrix multiplication with different sub-stages on the systolic array and introduce optimizations with pipelining and overlapping different stages. It inspired a lot on our work regarding pipelining with sub-stages. However, it does not consider SPMM and their design cannot be used directly as a sparse matrix engine.

Handling dynamic sparsity. Since dynamic (input) sparsity is hard to predict, it has to be handled at runtime, unlike static sparsity which is usually pre-determined. We could use compaction of tile registers to build non-zero tiles, similar to the approach that SAVE [18] used for merging vector registers to remove zeros. However, this is not practical for a matrix engine due to the high probability of conflicts across different tiles since the number of operands in a vector register is 32 while that of a tile register is 512 (16 $\times$ 32). There could be an efficient way to exploit dynamic sparsity on matrix engines in CPUs without much overhead, but we leave it as future work.

Sparse DNN accelerators. There have been several papers focusing on SPMM/SPGEMM acceleration with standalone accelerators [42, 43, 23, 49, 50, 41, 54]. Sparse-TPU [22] and the work from Kung et al. [34] introduce packing algorithms targeting unstructured sparsity for systolic array-like architectures. Being standalone accelerators with large area-power budgets, these works have explored co-design of compression formats, dataflow, indexing logic, and control structures. CPU matrix engines, on the other hand, have a tight coupling with the RF and operate in strict area-power budgets, motivating our solution of enhancing a dense systolic array with a pipeline-friendly dataflow. Zhu et al. [56] introduce a sparse tensor core by extending NVIDIA Tensor Core using vector-wise pruning and encoding to enforce structured sparsity. NVIDIA has also recently introduced Sparse Tensor Core as a sparse matrix engine in their Ampere GPUs [39] to accelerate sparse DNNs [39] with 2:4 structured sparsity. STA [36] proposes a systolic tensor array for inference on mobile devices with density-bound block sparsity, which is similar to the $N$ : $M$ structured sparsity, and S2TA [37] extended STA for flexible $N$ : $M$ sparsity. None of the previous works supports row-wise $N$ : $M$ sparsity, which this work shows as a promising technique to cover unstructured sparsity. Also, prior works do not consider dividing the execution of tile computations into fine-grained stages for pipelining without resource conflicts, which is critical when integrated into CPUs.

VIII Conclusion

This work adds flexible $N$ : $M$ structured sparsity support in CPU matrix engines via VEGETA ISA and engines supporting various sparsity granularity. We propose several design choices for the VEGETA engine, studying kernel performance, clock frequency, and area trade-offs. Also, we show how efficiently the row-wise $N$ : $M$ sparsity feature of VEGETA can accelerate unstructured sparsity by transforming unstructured sparsity to row-wise $N$ : $M$ sparsity. We believe this work opens up opportunities for future work in enhancing the design further for dynamic sparsity and studying the interaction of the sparse matrix engine with the rest of the CPU memory hierarchy.

IX Acknowledgement

This work was funded by an award from Intel’s Corporate Research Council. We thank Raveesh Garg, Po-An Tsai, and Vivek Sarkar for profound discussions on this work.

References

[1] “https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/,” 2015.
[2] “Nvidia tesla v100 gpu architecture.” 2017, https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[3] “Tensorfloat-32 in the a100 gpu accelerates ai training, hpc up to 20x,” 2019, https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/.
[4] “Nvidia ampere ga102 gpu architecture.” 2021, https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf.
[5] “oneapi deep neural network library (onednn),” 2021, https://github.com/oneapi-src/oneDNN.
[6] “Sambanova whitepaper,” 2021. [Online]. Available: https://sambanova.ai/wp-content/uploads/2021/06/SambaNova_RDA_Whitepaper_English.pdf
[7] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available: https://www.tensorflow.org/
[8] B. Akin, Z. A. Chishti, and A. R. Alameldeen, “Zcomp: Reducing dnn cross-layer memory footprint using vector extensions,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY, USA: Association for Computing Machinery, 2019, p. 126–138. [Online]. Available: https://doi.org/10.1145/3352460.3358305
[9] ARM, “Powering the edge: Driving optimal performance with the ethos-n77 npu,” Whitepaper, 2019.
[10] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language models are few-shot learners,” 2020. [Online]. Available: https://arxiv.org/abs/2005.14165
[11] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, M. Cowan, H. Shen, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnamurthy, “Tvm: An automated end-to-end optimizing compiler for deep learning,” in Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’18. USA: USENIX Association, 2018, p. 579–594.
[12] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
[13] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible accelerator for emerging deep neural networks on mobile devices,” IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 2, pp. 292–308, 2019.
[14] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv preprint arXiv:1410.0759, 2014.
[15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
[16] T. Geng, A. Li, T. Wang, C. Wu, Y. Li, R. Shi, A. Tumeo, S. Che, S. Reinhardt, and M. Herbordt, “Awb-gcn: A graph convolutional network accelerator with runtime workload rebalancing,” MICRO, 2020.
[17] E. Georganas, K. Banerjee, D. Kalamkar, S. Avancha, A. Venkat, M. Anderson, G. Henry, H. Pabst, and A. Heinecke, “Harnessing deep learning via a single building block,” in 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020, pp. 222–233.
[18] Z. Gong, H. Ji, C. W. Fletcher, C. J. Hughes, S. Baghsorkhi, and J. Torrellas, “Save: Sparsity-aware vector engine for accelerating dnn training and inference on cpus,” in MICRO, 2020.
[19] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016.
[20] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 1135–1143.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[22] X. He, S. Pal, A. Amarnath, S. Feng, D.-H. Park, A. Rovinski, H. Ye, Y. Chen, R. Dreslinski, and T. Mudge, “Sparse-tpu: Adapting systolic arrays for sparse matrices,” in Proceedings of the 34th ACM International Conference on Supercomputing, ser. ICS ’20. New York, NY, USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3392717.3392751
[23] K. Hegde, H. Asghari-Moghaddam, M. Pellauer, N. Crago, A. Jaleel, E. Solomonik, J. Emer, and C. W. Fletcher, “Extensor: An accelerator for sparse tensor algebra,” in Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’52. New York, NY, USA: Association for Computing Machinery, 2019, p. 319–333. [Online]. Available: https://doi.org/10.1145/3352460.3358275
[24] IBM, “Power isa version 3.1,” Whitepaper, 2020.
[25] Intel, “Theoretical maximum memory bandwidth for intel® core™ x-series processors,” https://www.intel.com/content/www/us/en/support/articles/000056722/processors/intel-core-processors.html.
[26] Intel, “Intel 64 and ia-32 architectures software developer’s manual,” 2020.
[27] Intel, “Presentation deck: Intel architecture day 2021,” 2021, https://download.intel.com/newsroom/2021/client-computing/intel-architecture-day-2021-presentation.pdf.
[28] J.-W. Jang, S. Lee, D. Kim, H. Park, A. S. Ardestani, Y. Choi, C. Kim, Y. Kim, H. Yu, H. Abdel-Aziz, J.-S. Park, H. Lee, D. Lee, M. W. Kim, H. Jung, H. Nam, D. Lim, S. Lee, J.-H. Song, S. Kwon, J. Hassoun, S. Lim, and C. Choi, “Sparsity-aware and re-configurable npu architecture for samsung flagship mobile soc,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), 2021, pp. 15–28.
[29] G. Jeong, E. Qin, A. Samajdar, C. J. Hughes, S. Subramoney, H. Kim, and T. Krishna, “Rasa: Efficient register-aware systolic array matrix engine for cpu,” in 2021 58th ACM/IEEE Design Automation Conference (DAC), 2021, pp. 253–258.
[30] N. P. Jouppi, , C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017.
[31] H. Kim et al., “Macsim: A cpu-gpu heterogeneous simulation framework user guide,” 2012.
[32] P. D. Koichi Yamada, Wei Li, “Intel’s mlperf results show robust cpu-based training performance for a range of workloads,” 2020, https://www.intel.com/content/www/us/en/artificial-intelligence/posts/intels-mlper-results.html.
[33] H.-T. Kung, “Why systolic architectures?” Computer, no. 1, pp. 37–46, 1982.
[34] H. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization,” in Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 821–834. [Online]. Available: https://doi.org/10.1145/3297858.3304028
[35] C. Lattner and V. Adve, “Llvm: a compilation framework for lifelong program analysis & transformation,” in International Symposium on Code Generation and Optimization, 2004. CGO 2004., 2004, pp. 75–86.
[36] Z.-G. Liu, P. N. Whatmough, and M. Mattina, “Systolic tensor array: An efficient structured-sparse gemm accelerator for mobile cnn inference,” IEEE Computer Architecture Letters, vol. 19, no. 1, pp. 34–37, 2020.
[37] Z.-G. Liu, P. N. Whatmough, Y. Zhu, and M. Mattina, “S2ta: Exploiting structured sparsity for energy-efficient mobile cnn acceleration,” in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022, pp. 573–586.
[38] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, “Pin: Building customized program analysis tools with dynamic instrumentation,” SIGPLAN Not., vol. 40, no. 6, p. 190–200, jun 2005. [Online]. Available: https://doi.org/10.1145/1064978.1065034
[39] A. Mishra, J. A. Latorre, J. Pool, D. Stosic, D. Stosic, G. Venkatesh, C. Yu, and P. Micikevicius, “Accelerating sparse deep neural networks,” arXiv preprint arXiv:2104.08378, 2021.
[40] M. Naumov, D. Mudigere, H. M. Shi, J. Huang, N. Sundaraman, J. Park, X. Wang, U. Gupta, C. Wu, A. G. Azzolini, D. Dzhulgakov, A. Mallevich, I. Cherniavskii, Y. Lu, R. Krishnamoorthi, A. Yu, V. Kondratenko, S. Pereira, X. Chen, W. Chen, V. Rao, B. Jia, L. Xiong, and M. Smelyanskiy, “Deep learning recommendation model for personalization and recommendation systems,” CoRR, vol. abs/1906.00091, 2019. [Online]. Available: https://arxiv.org/abs/1906.00091
[41] S. Pal, J. Beaumont, D.-H. Park, A. Amarnath, S. Feng, C. Chakrabarti, H.-S. Kim, D. Blaauw, T. Mudge, and R. Dreslinski, “Outerspace: An outer product based sparse matrix multiplication accelerator,” in ISCA, 2018.
[42] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally, “Scnn: An accelerator for compressed-sparse convolutional neural networks,” in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), 2017, pp. 27–40.
[43] E. Qin, A. Samajdar, H. Kwon, V. Nadella, S. Srinivasan, D. Das, B. Kaul, and T. Krishna, “Sigma: A sparse and irregular gemm accelerator with flexible interconnects for dnn training,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020, pp. 58–70.
[44] K. Rocki, D. Van Essendelft, I. Sharapov, R. Schreiber, M. Morrison, V. Kibardin, A. Portnoy, J. F. Dietiker, M. Syamlal, and M. James, “Fast stencil-code computation on a wafer-scale processor,” in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020, pp. 1–14.
[45] A. Rodrigues et al., “Lower numerical precision deep learning inference and training,” Whitepaper, 2018.
[46] S. Sen, S. Jain, S. Venkataramani, and A. Raghunathan, “Sparce: Sparsity aware general-purpose core extensions to accelerate deep neural networks,” IEEE Trans. Comput., vol. 68, no. 6, p. 912–925, Jun. 2019. [Online]. Available: https://doi.org/10.1109/TC.2018.2879434
[47] P. K. Shibo Wang, “Bfloat16: The secret to high performance on cloud tpus,” 2019, https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.
[48] A. Sriraman and A. Dhanotia, “Accelerometer: Understanding acceleration opportunities for data center overheads at hyperscale,” in Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’20. New York, NY, USA: Association for Computing Machinery, 2020, p. 733–750. [Online]. Available: https://doi.org/10.1145/3373376.3378450
[49] N. Srivastava, H. Jin, J. Liu, D. Albonesi, and Z. Zhang, “Matraptor: A sparse-sparse matrix multiplication accelerator based on row-wise product,” in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020, pp. 766–780.
[50] N. Srivastava, H. Jin, S. Smith, H. Rong, D. Albonesi, and Z. Zhang, “Tensaurus: A versatile accelerator for mixed sparse-dense tensor computations,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 689–702.
[51] W. J. Starke, B. W. Thompto, J. A. Stuecheli, and J. E. Moreira, “Ibm’s power10 processor,” IEEE Micro, vol. 41, no. 2, pp. 7–14, 2021.
[52] W. Sun, A. Zhou, S. Stuijk, R. Wijnhoven, A. O. Nelson, hongsheng Li, and H. Corporaal, “Dominosearch: Find layer-wise fine-grained n:m sparse schemes from dense neural networks,” in Advances in Neural Information Processing Systems, vol. 34, 2021.
[53] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang, “Machine learning at facebook: Understanding inference at the edge,” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2019, pp. 331–344.
[54] Z. Zhang, H. Wang, S. Han, and W. J. Dally, “Sparch: Efficient architecture for sparse matrix multiplication,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 261–274.
[55] A. Zhou, Y. Ma, J. Zhu, J. Liu, Z. Zhang, K. Yuan, W. Sun, and H. Li, “Learning n:m fine-grained structured sparse neural networks from scratch,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=K9bw7vqp_s
[56] M. Zhu, T. Zhang, Z. Gu, and Y. Xie, “Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern gpus,” in MICRO, 2019.

VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

Abstract

I Introduction

II Background

II-A GEMM and DNN Layers

II-B Systolic Array (SA)

II-C Sparsity in DNNs

III Motivation and VEGETA Overview

III-A Vector vs. Matrix Engine

III-B Structured vs. Unstructured Sparsity

III-C HW Support for Flexible NN:MM Structured Sparsity

III-D Transforming Unstructured to Row-Wise NN:MM Sparsity

III-E VEGETA Design Overview

IV VEGETA Instruction Set Architecture

IV-A Register File Support

IV-B VEGETA Instructions

IV-C Flexibility in the Block Size, M

V VEGETA Engine Architecture

V-A Processing Units and Processing Elements

V-B VEGETA Matrix Engine for Tile-Wise NN:MM Sparsity

V-C Optimizations

V-D Flexibility in the Block Size, M

V-E Row-Wise NN:MM Sparsity Support

V-F Integration of VEGETA Engines with CPU

VI Evaluation

VI-A VEGETA Implementation

VI-B Experimental Setup

VI-C Performance Analysis

VI-D Area and Power Analysis

VI-E Analysis for Unstructured Sparsity Support Using VEGETA

VII Related Work

VIII Conclusion

IX Acknowledgement

References

III-C HW Support for Flexible $N$ : $M$ Structured Sparsity

III-D Transforming Unstructured to Row-Wise $N$ : $M$ Sparsity

V-B VEGETA Matrix Engine for Tile-Wise $N$ : $M$ Sparsity

V-E Row-Wise $N$ : $M$ Sparsity Support