© 2024 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
This work has been accepted as a poster at the FPT 2024 (International Conference on Field Programmable Technology) Proceedings. It will appear in the proceedings as a two-page manuscript and on the IEEE website soon.
FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs
Abstract
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV). Their popularity is largely attributed to the exceptional performance of their multi-head self-attention blocks when analyzing sequential data and extracting features. To date, there are limited hardware accelerators tailored for this mechanism, which is the first step before designing an accelerator for a complete model. This paper proposes FAMOUS, a flexible hardware accelerator for dense multi-head attention (MHA) computation of TNNs on field-programmable gate arrays (FPGAs). It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency. An efficient tiling of large matrices has been employed to distribute memory and computing resources across different modules on various FPGA platforms. The design is evaluated on Xilinx Alveo U55C and U200 data center cards containing Ultrascale+ FPGAs. Experimental results are presented that show that it can attain a maximum throughput, number of parallel attention heads, embedding dimension and tile size of 328 (giga operations/second (GOPS)), 8, 768 and 64 respectively on the U55C. Furthermore, it is 3.28 and 2.6 faster than the Intel Xeon Gold 5220R CPU and NVIDIA V100 GPU respectively. It is also 1.3 faster than the fastest state-of-the-art FPGA-based accelerator.
Index Terms:
FPGA, Transformer, Attention-based Neural Networks, High-Level Synthesis, Natural Language Processing, Hardware Accelerators.I Introduction
Transformer neural networks have demonstrated significant advancements in natural language processing (NLP) [1, 2], machine translation [3], computer vision [4], and other domains in recent years. Numerous transformer-based models have surfaced, including full transformers containing an encoder and decoder [2], BERT [5, 6], Transformer-XL [7], ALBERT [8], T5 [9], Routing Transformers [10], structBERT [11], and more. These models contain a remarkable feature named multi-headed attention (MHA) mechanism which is different from the traditional convolutional neural network (CNN), recurrent neural network (RNN), and long short term memory (LSTM) model. It is even replacing RNNs and LSTMs for NLP tasks, as well as convolutional layers in CV tasks, because it enables a high level of computational parallelism for both training and inference phase. This makes it well-suited for acceleration on hardware such as GPUs and FPGAs.
Nevertheless, the attention mechanism incurs high computational costs due to intensive matrix computations and intricate data flows [12]. It consumes a significant amount of runtime in many existing TNNs, ranging from about 38% to 64% when the sequence length (number of tokens in the input sequence) varies from 64 to 256[13]. Unfortunately, general-purpose platforms such as GPUs and CPUs are inefficient for processing TNNs due to low computational efficiency, underutilized memory bandwidth, and significant compilation overheads[14]. In contrast, FPGAs have gained widespread use for accelerating DNNs due to their high level of parallelism and low latency [15, 16]. Many works focus on parallelizing computations to accelerate CNN, LSTM, Graph Convolutional Network (GCN) [17, 18, 19, 20] on FPGAs. Recently, some works have successfully built FPGA or application-specific integrated circuit (ASIC) hardware accelerators for transformers [21, 22, 23]. Most of these works compress the model by using different weight pruning strategies to accelerate attention, and they reduce latency by incorporating sparse matrices. Thus, they have specialized sparse architecture for a specific application. However, different applications require different sparsity patterns, necessitating the redesign of hardware architectures for optimal results, which is a time-consuming and challenging task. ASICs are designed for a specific model and configuration, so, they perform poorly on different models or even the same model with varying configurations[24]. Custom FPGA accelerators also lack the flexibility to be reconfigured for a different model during runtime.
Thus, a versatile accelerator is needed that can efficiently handle dense matrix computations across various TNN applications. The study in [21] utilizes logic resources to implement a systolic array (SA) for parallelism, leading to a waste of digital signal processing (DSP) resources that are capable of high-speed computation at higher frequencies. DSP consumption also depends on the implementation method. For example, most accelerators [23, 25, 26, 27] used high-level synthesis (HLS) tools, while some used hardware description language (HDL) [28, 29, 30] for design. While HLS takes less implementation time compared to HDL, it is challenging to write efficient HLS code that can effectively manage certain FPGA resources like DSPs on an FPGA for optimal performance[18]. Analysis done in [31, 22, 32, 33, 34] showed that MHA occupies most of the storage and has the highest number of operations. Since the block RAM (BRAM) size of FPGAs typically falls below 36MB, input matrices must be partitioned into tiles. However, formulating an ideal partition scheme that aligns well with the architecture poses a considerable challenge.
In this paper, we present FAMOUS, a flexible accelerator designed to adapt to a wide range of TNN applications. Our HLS-based code is optimized to utilize more DSPs and BRAMs in parallel. FAMOUS integrates efficient tiling along with enhanced parallel computation and communication to accelerate the attention mechanism as much as possible.
The contributions of this paper are:
-
A novel architecture that ensures high utilization of BRAM and DSP, enhancing parallel processing of the attention mechanism of the transformer and achieving low latency.
-
An efficient tiling of weight matrices to accommodate large models in on-chip memory.
-
A parameterized HLS code that enables users to modify some parameters at design time from HLS tool.
-
A runtime programmable feature that enables users to modify some parameters at runtime from software.
-
A theoretical model to validate both predicted and experimental latency.
II Background
There are several building blocks in transformers as shown in Fig. 1. An input sequence of tokens is converted into embeddings. The positional encoder enables the model to consider the order of tokens in a sequence by adding positional information to the embeddings. It generates vectors that give context according to the word’s position in a sentence. Then the vectors are linearly transformed into three tensors: Q (queries), K (keys), and V (values) by multiplying the embedding matrix with three weight matrices. The encoder block handles these tensors, transforming them into a higher-level representation that encapsulates crucial information. This process ensures the proper capture of features and contextual relationships within the input sequence. The encoder architecture comprises two main sub-layers: (1) the self-attention mechanism, and (2) the position-wise feed-forward network. The self-attention mechanism enables the model to assess different segments of an input sequence simultaneously. It captures long-range relationships by measuring attention scores and utilizing multi-head projections for various input representations. Thus, it can learn complex patterns, dependencies, and relationships effectively. The position-wise feed-forward network (FFN), which is equivalent to a multilayer perceptron (MLP), applies linear transformations to every position independently in the input sequence. In this network, two linear transformations are executed. They mainly contain matrix-vector multiplication. The first linear transformation has activation functions such as the Rectified Linear Unit (ReLU) or Gaussian Error Linear Unit (GeLU) but the second one does not have these. Furthermore, each sub-layer includes a residual connection combined with layer normalization (LN). This solves the vanishing gradient problem during training. Residual addition and LN layers are inserted after each MHA and FFN. It mainly includes the addition of matrix elements and nonlinear functions.

The decoder block illustrated in Fig. 1 is responsible for generating the output sequence based on the encoded representations supplied by the encoder. Like the encoder, the decoder also consists of a stack of N identical layers. Each layer within the decoder contains three sub-layers. They are: (1) the Masked Attention Mechanism, resembling the encoder’s self-attention, and it includes a masking feature that restricts the output’s dependency on known preceding outputs; and (2) an attention layer that directs its focus to the encoder’s output, enabling the decoder to emphasize relevant sections of the input sequence for each output element. and (3) a position-wise feed-forward network.

As illustrated in Fig 2, the scaled dot product attention in each head is a crucial part of the multihead attention layer. The attention weights are computed by performing the dot product of the Q and K matrices and subsequently scaling them down by the square root of the 2nd dimension of the K matrix. This scaling is essential to prevent the dot products from becoming excessively large, contributing to the stabilization of gradients during the training process. Subsequently, the scaled dot products undergo the softmax function, resulting in the computation of attention weights. These weights are then utilized to perform a weighted sum of the value vectors. The ultimate output is the projection of the concatenated sequences from all heads.
The output of MHA can be represented as Equation 1 & 2. The input sequence X is linearly mapped into matrices using weights and biases. The parameter is the 2nd dimension of and . is a hyperparameter called embedding dimension and h is number of heads. ‘i’ is the index for attention heads.
(1) | |||
(2) |
III Related work
Several FPGA and ASIC accelerators exist for accelerating attention mechanisms. The ASIC design in [22] exploited parallelism and datapath specialization to significantly improve performance and energy efficiency. Another ASIC called ELSA [13] used specialized approximation algorithms to reduce computation. SpAtten [33] ASIC utilized sparsity and quantization to reduce computations and memory access. A hardware-software co-design framework called Sanger [12] enabled dynamic sparsity by a reconfigurable architecture on ASIC. The FPGA accelerator proposed by Lu et al. [21] is the first hardware architecture to accelerate both the multi-head attention (MHA) layer and the feedforward network (FFN) of the transformer. However, their implementation was done using HDL for a single attention head. Ye et al.[35] proposed an FPGA accelerator for MHA with reconfigurable architecture, efficient systolic arrays, and hardware-friendly radix-2 softmax, but they did not consider maximization of BRAM and DSP usage to maximize parallelism. Fujimaki et al. [36] also proposed a systolic array-based accelerator for attention mechanism within CNN where MHA consumed 63% of the computation time, but they used MHA within CNN, not TNN. A shared computing architecture is implemented in [37], where a parallel computing array is shared between MHA and FFNs for a CNN application. A novel structural pruning method was proposed by [38] and the associated accelerator on FPGA was designed to reduce memory footprint. All of the existing hardware architectures are designed for a specific TNN and a specific sparsity pattern. They lack the flexibility to reconfigure the computing structure for different applications during runtime. Furthermore, they have not explored which tile size and what utilization of BRAMs and DSPs could achieve optimum parallelism.
IV Accelerator Architecture
The core of the accelerator was designed in C language on Vitis high level synthesis (HLS) 2022.2.1 tool. Functional verification was performed through its C simulation and C/RTL co-simulation features. This section describes the HLS design technique that generates an optimized architecture utilizing most of the BRAMs and DSPs in the processing elements, ensuring high parallelism.
IV-A Overall Structure
The overall structure of the accelerator is shown in Fig. 3. There are three main processing modules in it. They are denoted as , and according to the output they produce. The number of instances for these modules depends on the number of attention heads (h). Each module contains an array of processing elements (PE). A PE is comprised of a DSP48 performing multiplication and accumulation (MAC) operations. The number of PEs (t) depends on the unrolling factor of the inner loop and the initiation interval of the pipelined outer loop. The PE array’s data access pattern and computational requirements differ across modules. Therefore, they are defined separately with distinct sets of PE arrays. This approach enables optimization of each module separately. Input data and weights are stored in multiple BRAMs to enable parallel access.
In our architecture, each PE is independent, with its own local memory, control and computing unit. The weights (, , ) for generating query (Q), key (K) and value (V) matrix are declared as separate two-dimensional arrays of size () in HLS. TS is tilesize which represents the dimension of the sub-matrices into which the larger weight matrices are divided. The number of heads and tiles, and the array partition directive on HLS determine how the arrays will be partitioned to generate multiple two-port BRAMs. Due to the limited ports of BRAMs, array partitioning and data loading are efficiently managed to ensure that data required simultaneously by a DSP are stored in separate BRAMs. The Q, K, and V matrices of size () are stored in intermediate buffers, which are also implemented as BRAMs. SL is sequencelength.

IV-A1 QKVPM module
module generates the query, key, and value matrices. This module contains the , , BRAMs, and input () BRAMs from which data is accessed in parallel by parallel DSP units. The arrays used in this module are divided into subarrays using our tiling technique to fit into the BRAMs.
Experimental findings indicated that a tile size of 64 is optimal for HLS to partition arrays within a reasonable compilation time ( 36 hours) for a state-of-the-art (SOTA) transformer encoder. The impact of tile size on performance is discussed in Section VI. The number of loop iterations of the module depends on the tile size. There are a total of () tiles or iterations. At each iteration, the , , , and BRAMs are loaded with distinct data. Then the computations start in the PEs. Simultaneously, the biases for the Q, K and V matrices are loaded to registers from off-chip memory while the module performs computations. Then they are added with Q, K, V matrices. Algorithm 1 outlines the computations of this module where the 2nd loop (line 6) is pipelined causing the innermost loop (line 8) to be fully unrolled. This generates () PEs.
IV-A2 QKPM module
module performs the matrix-matrix multiplication operations between the Q and K matrices. As these matrices are relatively small, they are not tiled. Algorithm 2 describes these operations. The innermost loop (line 6) is fully unrolled, generating () PEs for this module. This module contains the Q and K BRAMs from which data is retrieved by the DSP units. As the division operation outlined in Equation 1 is executed within this module using lookup tables (LUTs), the number of parallel operations is constrained to prevent excessive utilization of LUTs. A matrix (S) of attention weights is generated within this module, which is stored either in BRAM or registers. Subsequently, these values are forwarded to the non-linear activation function known as softmax. The softmax function, as described in HLS, generates the function using LUTs and FFs.
IV-A3 SVPM module
The output matrix (S) derived from the softmax operation is transmitted to the module (Algorithm 3), where it undergoes matrix-matrix multiplication operations with the value (V) matrix. Algorithm 3 fully unrolls the innermost loop (line 6), resulting in (SL) PEs. The output from this module is referred to as the attention score.
IV-B Tiling Technique
As transformer models tend to be large, tiling helps prevent excessive utilization of on-chip memory and computing units. It also ensures that HLS tool can effectively partition arrays, and pipeline or unroll the loops to minimize latency within a short compilation time. Fig. 4 describes our unique tiling strategy. The weight matrices are partitioned into tiles, allowing BRAMs to be loaded with partial data retrieved from off-chip memory. They are tiled along the second dimension (column of the matrix) only because the first dimension (row of the matrix) is already reduced by the number of heads. Thus, they are loaded () times. Input buffers of each attention head are declared as a two-dimensional array of size (SL TS). Therefore, tiling is applied along the column of the matrix, and they are also loaded () times. At each iteration, data for only one tile is loaded first. The PEs then perform computations on this data, storing the results in intermediate buffers. These results are also accumulated with those from previous iterations in the next cycle. Consequently, the final output is the cumulative sum of the outputs computed for all the tiles.

IV-C Runtime Programmable Feature
The parameters such as attention heads, embedding dimension, and sequence length were runtime programmable in our design. These parameters can be sent to FAMOUS from the software using the steps shown in Fig. 6. TNN models were trained using the PyTorch framework, and the suitable models should be saved as ’.pth’ files. These files were then sent to a Python interpreter to extract the data about attention heads, embedding dimension, and sequence length. These data will differ across applications, but FAMOUS will not need resynthesis for each one. Only the software code will need some modifications. The Xilinx SDK tool was used to write the software in C++, which runs on the processor. The extracted data such as the number of attention heads, embedding dimension etc. from the interpreter was used in this software. Based on this data, the processor generated instructions and control signals for the accelerator, allowing it to activate different parts of the hardware.
V Overall System
Fig. 5 shows the complete system design for running the multi-head attention layer on different FPGA platforms such as U200 (UltraScale+XCU200-FSGD2104-2-E) and U55C (UltraScale+XCU55C-FSVH2892-2L-E) for our experiments. Each design parameter can be programmed during runtime up to a maximum value by MicroBlaze (B) softcore processor.


The overall system was designed on Vivado 2022.1.2 design suite. It contains a custom IP block for the MHA accelerator, which is exported from HLS. The inputs and weights are fetched from off-chip high-bandwidth memory (HBM) using AXI4 master interfaces [39] when the load instruction from the accelerator controller is received according to demand. The accelerator receives control signals from the processor through an AXI-lite slave interface. B can access the HBMs which is connected to the MHA accelerator. It is used to transfer data to the BRAMs from HBMs. It also sends control signals to the accelerator. The boards are connected to a host PC with USB-JTAG interface and PCIe 3.04 interface. This host can communicate with other IPs except B using the DMA/Bridge Subsystem for PCI Express IP [40] in the system, but PCI communication was not needed in this work. B uses AXI-TIMER[41] to measure the latency, which includes the time between the start and stop signal from the custom IP module. The host connected to JTAG cable[42] displays the results on the terminal using the UARTlite interface[43].
VI Evaluation and Results
Table I illustrates the runtime programmable capability, resource utilization, and performance of FAMOUS. Synthesis was performed once for a constant tile size. The design parameters such as embedding dimension (), number of heads (h), and sequence length (SL) of the accelerator were configured before synthesis with the fixed values of 768, 8, and 64, respectively, according to a variant of BERT[6] and the available FPGA resources. Then they were dynamically adjusted during runtime using B. Hence, FAMOUS can be synthesized for a fixed number of resources, but it will remain flexible enough to accommodate smaller architectures as needed. The tile size can be adjusted only before synthesis. The data was quantized into 8-bit fixed-point numbers. Quantization for various applications may lead to accuracy loss, although it wasn’t our primary focus. If a larger bit width is necessary, the design can be easily adjusted by modifying certain parameters in HLS code during design time, which will affect resource utilization and latency. Tests no. 1, 2 & 3 show how the number of heads can be varied within the same accelerator dynamically affecting the latency and throughput where throughput is defined as number of giga operations per second (GOPS). On Alveo U55C, the lowest latency of 0.94 ms and the highest GOPS of 328 were achieved for 8 parallel heads. Tests no. 4 & 5 show the effect of varying embedding dimensions on performance where latency increased and GOPS decreased for a larger dimension. Sequence length was dynamically varied for tests no. 6, 7 & 8 and performance deteriorated as the length increased. It can be observed that resource utilization remained unchanged from tests 1 to 8 because the accelerator was synthesized only once when tile size was constant, while other parameters could be reconfigured at runtime from the software.
Tests no. 9 & 10 had different tile sizes, necessitating resynthesis of the accelerator, which resulted in different resource utilization. Resource utilization decreased with a reduction in tile size, leading to increased latency and decreased GOPS. This is because a smaller tile size requires more frequent loading of each tile from external memory to on-chip memory. Tests no. 11 and 12 demonstrated the performance and resource utilization of FAMOUS on Alveo U200, highlighting its portability. We ensured high resource utilization levels, with 46% DSPs, 78% BRAMs, and 98% LUTs. Further DSP utilization was not feasible, as it would have exceeded the capacity of LUTs. The optimal number of attention heads operating in parallel was determined to be 8 and 6 when the tile size is 64 on Alveo U55C and U200, respectively. Otherwise, the LUT becomes overutilized by the module. Reducing the tile size helped to decrease resource consumption, although at the expense of speed. Six parallel attention heads were feasible on U200, and this decrease in parallelism led to an increase in latency.
Test no. | Sequence | Embedding | Number | Tile | FPGA | Data | DSPs | BRAMs | LUTs | FFs | Latency | GOPS \bigstrut[t] |
Length | Dimension | of Heads | Size | Format | 18k | (ms) | \bigstrut[b] | |||||
#1 | 64 | 768 | 8 | 64 | Alveo | 8bit fixed | 4157 (46%) | 3148 (78%) | 1284782 (98%) | 661996 (25%) | 0.94 | 328 \bigstrut |
#2 | 4 | U55C | 1.401 | 220 \bigstrut | ||||||||
#3 | 2 | 2.281 | 135 \bigstrut | |||||||||
\bigstrut | ||||||||||||
#4 | 64 | 512 | 8 | 64 | Alveo | 8bit fixed | 4157 (46%) | 3148 (78%) | 1284782 (98%) | 661996 (25%) | 0.597 | 184 \bigstrut |
#5 | 256 | U55C | 0.352 | 312 \bigstrut | ||||||||
\bigstrut | ||||||||||||
#6 | 128 | 768 | 8 | 64 | Alveo | 8bit fixed | 4157 (46%) | 3148 (78%) | 1284782 (98%) | 661996 (25%) | 2 | 314 \bigstrut |
#7 | 32 | U55C | 0.534 | 285 \bigstrut | ||||||||
#8 | 16 | 13 | 16 \bigstrut | |||||||||
\bigstrut | ||||||||||||
#9 | 64 | 768 | 8 | 32 | Alveo | 8bit fixed | 3636 (40%) | 2636 (65%) | 746769 (57%) | 587337 (22%) | 1.155 | 267 \bigstrut |
#10 | 16 | U55C | 2996 (33%) | 2380 (59%) | 607554 (46%) | 529543 (20%) | 1.563 | 197 \bigstrut | ||||
\bigstrut | ||||||||||||
#11 | 64 | 768 | 6 | 64 | Alveo | 8bit fixed | 3306 (48%) | 2740 (63%) | 1048022 (88%) | 625983 (26%) | 0.977 | 315 \bigstrut |
#12 | 512 | U200 | 0.604 | 182 \bigstrut |
Platform | Intel E5 | NVIDIA V100 | Intel Xeon | NVIDIA P100 | FAMOUS \bigstrut[t] | |
CPU[34] | GPU[44] | CPU[35] | GPU[35] | (Alveo U55C FPGA) \bigstrut[b] | ||
Topologies | 64, 768, 12 | 64, 512, 4 | 64, 512, 8 | 64, 512, 4 | 64, 768, 8 | 64, 512, 8 \bigstrut |
GOP | 0.308 | 0.11 | 0.11 | 0.11 | 0.308 | 0.11 \bigstrut |
Latency | 1.1 | 1.5578 | 1.96 | 0.496 | 0.94 | 0.597 \bigstrut |
(ms) | \bigstrut[b] | |||||
GOPS | 280 | 71 | 56 | 221 | 328 | 184 \bigstrut |
Table II shows a comparison of FAMOUS with some GPUs and CPUs running approximately at 1.5GHz frequency. Topologies include sequence length, embedding dimension, and number of heads. Assuming that attention heads operate in parallel, their number should not impact the latency. Therefore, we did not alter the number of attention heads, even though other works used different numbers. However, we varied the embedding dimensions in line with other studies to ensure a fair comparison. We achieved 3.28, 2.6, 1.17 speed up and increase in throughput compared to Intel Xeon Gold 5220R CPU, NVIDIA V100 GPU, and Intel E5 2698 v4 CPU respectively because of higher parallelism.
Table III compares our FPGA-based accelerator and several ASIC-based accelerators tailored for attention mechanisms. The ASICs utilize sparsity to mitigate resource consumption and computation time for specific applications, operating at a frequency of 1 GHz. Consequently, some ASICs in the table exhibit higher throughput compared to ours. Conversely, our accelerator employs dense matrices, ensuring no loss of accuracy for any transformer-based application. Furthermore, it provides flexibility to run many models without requiring redesign or resynthesis. Table IV provides a comparison of the latency of our accelerator with other FPGA-based accelerators. The works in [21], [35], and [44] reported latency for single-head attention calculation. Their embedding dimensions are also smaller in this table. The number of operations (GOP) reported for the transformer base model in [44] is equivalent to the operations of a single self-attention head. It compared latency with [21], which reported latency for single-head computations based on the given resource usage and algorithm. Ye et al. [35] compared its results with [21], so it was assumed that it also measured latency and resources for single-head computation. Since our results were for 8 self-attention heads, we multiplied the results of other works by 8 for a fair comparison. It also compared with Peng et al. [25] though Peng et al. implemented a complete transformer model. However, it provided latency details for each layer, allowing us to extract the latency specifically for the attention layer.
[b] Works Calabash Lu et al. Ye et al. Li et al. Peng et al. FAMOUS \bigstrut[t] [34] [21] [35] [44] [25] \bigstrut[b] Topologies 64, 768, 12 64, 512, 8 64, 512, 4 64, 512, 4 32, 800, 4 64, 768, 8 \bigstrut FPGAs Xilinx Xilinx Alveo Xilinx Alveo Alveo \bigstrut[t] VU9P VU13P U250 VU37P U200 U55C \bigstrut[b] Data format 16 bit fix 8 bit fix 16 bit fix 8 bit fix – 8 bit fix \bigstrut Method HDL HDL HDL HLS HLS HLS \bigstrut DSPs 4227 129 4189 1260 623 4157 \bigstrut BRAMs 640 498 1781 448 – 3148 \bigstrut GOPS 1288 128 171 72 97 623 \bigstrut Latency 0.239a 0.8536b 0.642 1.5264 1.706c 0.494 \bigstrut[t] (ms) \bigstrut[b]
-
a
Q, K, V matrix computation time ignored.
-
b
Time adjusted for 8 attention heads.
-
c
Time extracted for attention mechanism from a full transformer.
To ensure a fair comparison, we presented the latency specifically for the computation of the attention mechanism, excluding the latency associated with load and store operations for the accelerator, in this table. Our latency is lower and GOPS is higher than all other works except for Calabash[34] because it excluded computation time for Q, K, and V calculations. Calabash claims to outperform the Intel E5 CPU with a speedup. Therefore, its latency and GOPS could be calculated using the Intel E5 CPU’s data from Table II. In summary, we were able to utilize more DSPs and BRAMs in parallel, even with HLS implementation, achieving the lowest latency for a larger embedding dimension.
VII Analytical Model for Latency
The parameters that affect the resource utilization and performance of FAMOUS are the tile size or number of tiles in the attention module, the number of attention heads, the sequence length, and the embedding dimension when the bit width is fixed. An analytical model was developed to establish the relationship between these parameters and latency. This model aids in estimating both latency and the value of the parameters before synthesis.
The design is modular, and each module is implemented as a function with loops. Thus, the latency of the modules depends on the loop iteration latency, which in turn depends on the loop pipeline and unrolling pragmas. For the nested loops in the modules, the second loop from the last is pipelined resulting in a complete unroll of the innermost loop. The outermost loop had no pragmas to avoid complicated pipeline depth and resource requirements. Pipelined loop latency (PLL) can be calculated by equation 3. If it is enclosed by another loop, then the total latency (TL) will be given by equation 4. Here, Trip_count (TC) is the number of iterations of a loop, and the initiation interval (II) is the latency between the initiation of two consecutive iterations. Pipeline_Depth is the latency to finish one iteration. It depends on the sequential and parallel operations executed in one iteration. Different modules of FAMOUS can have different Pipeline_Depth (PD). The latency is measured in clock cycles (cc).
(3) | |||
(4) |
Equation 3 & 4 are generalized equations, the variables of which differ for different modules of FAMOUS as shown in the following equations.
(5) | |||
(6) | |||
(7) | |||
(8) | |||
(9) | |||
(10) | |||
(11) | |||
(12) |
where, PD_L includes the time required to establish communication with HBM using AXI master interface (7 cc), read address location (1 cc), load (1 cc), and store (1 cc) data from and to that address, and convert floating point data to fixed point (3 cc) for tasks such as loading all inputs (LI) and all biases (LB), as well as loading inputs (LIA) and weights (LWA) for each attention head. PD_MHA equals () plus the time required to load (1 cc), multiply (2 cc), add (1 cc), and store (1 cc) for computing self-attention (SA) in module. PD_BA includes latency associated with loading, adding, and storing operations in bias addition (BA) tasks. PD_S equals (), the time required to compute the score (S) in module. PD_SV equals SL in the computation of SV within the module. Experimental results showed that
Equation 13 represents the total latency () in clock cycles (cc), which was calculated by summing equations 5 through 12. Equation 14 converts clock cycles into milliseconds (ms).
(13) | |||
(14) |
For instance, the analytical model predicts a latency of 0.98 ms at 400 MHz for the configuration of test 1 in Table I, closely matching the experimental result of 0.94 ms. Likewise, for test 6, the analytical model estimates a latency of 1.9 ms, which is very close to the 2 ms observed experimentally. Other data from the same table will also comply with the analytical model.
VIII Conclusion & Future Works
In this research, a flexible FPGA-based accelerator was designed for the multi-head attention (MHA) layer of a transformer neural network (TNN) using a high-level synthesis (HLS) tool. The accelerator architecture leverages the parallelism of the FPGA and the inherent parallelism of the MHA layer. Resources including BRAMs, DSPs, and LUTs were utilized in an optimized way on Alveo U55C and U200 FPGA platforms to maximize parallelism and minimize the latency. The accelerator is runtime programmable to support different topologies without going through synthesis steps. An efficient tiling and loading of weight arrays were implemented to accommodate large models in on-chip memory without exceeding the capacity of computation units. Experimental results demonstrate that our design achieves a maximum throughput of 328 GOPS, surpassing some CPUs and GPUs. It achieves comparable throughput with state-of-the-art ASIC accelerators, which operate at higher frequencies and leverage sparsity to reduce computation and resources. Moreover, it achieved a latency that is lower than the fastest state-of-the-art FPGA-based accelerator. In this paper, the architecture supports only the attention module, but it will be expanded to support the full encoder and eventually both the encoder and decoder of the transformer in future work using the same architectural concept.
References
- [1] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [3] K. Song, K. Wang, H. Yu, Y. Zhang, Z. Huang, W. Luo, X. Duan, and M. Zhang, “Alignment-enhanced transformer for constraining nmt with pre-specified translations,” in AAAI Conference on Artificial Intelligence, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:213842037
- [4] T. Wang, L. Gong, C. Wang, Y. Yang, Y. Gao, X. Zhou, and H. Chen, “ViA: A Novel Vision-Transformer Accelerator Based on FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 41, no. 11, pp. 4088–4099, Nov. 2022. [Online]. Available: https://ieeexplore.ieee.org/document/9925700/
- [5] J. J. Lin, R. Nogueira, and A. Yates, “Pretrained transformers for text ranking: Bert and beyond,” Proceedings of the 14th ACM International Conference on Web Search and Data Mining, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:222310837
- [6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [7] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-xl: Attentive language models beyond a fixed-length context,” ArXiv, vol. abs/1901.02860, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:57759363
- [8] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
- [9] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- [10] A. Roy, M. Saffar, A. Vaswani, and D. Grangier, “Efficient content-based sparse attention with routing transformers,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 53–68, 2021. [Online]. Available: https://aclanthology.org/2021.tacl-1.4
- [11] W. Wang, B. Bi, M. Yan, C. Wu, Z. Bao, J. Xia, L. Peng, and L. Si, “Structbert: Incorporating language structures into pre-training for deep language understanding,” arXiv preprint arXiv:1908.04577, 2019.
- [12] L. Lu, Y. Jin, H. Bi, Z. Luo, P. Li, T. Wang, and Y. Liang, “Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture,” in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO ’21. New York, NY, USA: Association for Computing Machinery, 2021, p. 977–991. [Online]. Available: https://doi.org/10.1145/3466752.3480125
- [13] T. J. Ham, Y. Lee, S. H. Seo, S. Kim, H. Choi, S. J. Jung, and J. W. Lee, “ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks,” in 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Jun. 2021, pp. 692–705, iSSN: 2575-713X.
- [14] S. Zeng, J. Liu, G. Dai, X. Yang, T. Fu, H. Wang, W. Ma, H. Sun, S. Li, Z. Huang, Y. Dai, J. Li, Z. Wang, R. Zhang, K. Wen, X. Ning, and Y. Wang, “Flightllm: Efficient large language model inference with a complete mapping flow on fpgas,” in Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, New York, USA, 2024. [Online]. Available: https://doi.org/10.1145/3626202.3637562
- [15] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “[dl] a survey of fpga-based neural network inference accelerators,” ACM Trans. Reconfigurable Technol. Syst., vol. 12, no. 1, mar 2019. [Online]. Available: https://doi.org/10.1145/3289185
- [16] M. Rognlien, Z. Que, J. G. F. Coutinho, and W. Luk, “Hardware-Aware Optimizations for Deep Learning Inference on Edge Devices,” in Applied Reconfigurable Computing. Architectures, Tools, and Applications, L. Gan, Y. Wang, W. Xue, and T. Chau, Eds. Cham: Springer Nature Switzerland, 2022, vol. 13569, pp. 118–133, series Title: Lecture Notes in Computer Science. [Online]. Available: https://link.springer.com/10.1007/978-3-031-19983-7_9
- [17] S. K. Venkataramanaiah, H.-S. Suh, S. Yin, E. Nurvitadhi, A. Dasu, Y. Cao, and J.-S. Seo, “Fpga-based low-batch training accelerator for modern cnns featuring high bandwidth memory,” in 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2020, pp. 1–8.
- [18] E. Kabir, D. Coble, J. N. Satme, A. R. Downey, J. D. Bakos, D. Andrews, and M. Huang, “Accelerating lstm-based high-rate dynamic system models,” in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL), 2023, pp. 327–332.
- [19] B. Zhang, H. Zeng, and V. Prasanna, “Accelerating large scale gcn inference on fpga,” in 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020, pp. 241–241.
- [20] Z. Que, M. Loo, H. Fan, M. Pierini, A. Tapper, and W. Luk, “Optimizing Graph Neural Networks for Jet Tagging in Particle Physics on FPGAs,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL). Belfast, United Kingdom: IEEE, Aug. 2022, pp. 327–333. [Online]. Available: https://ieeexplore.ieee.org/document/10035216/
- [21] S. Lu, M. Wang, S. Liang, J. Lin, and Z. Wang, “Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer,” in 2020 IEEE 33rd International System-on-Chip Conference (SOCC). Las Vegas, NV, USA: IEEE, Sep. 2020, pp. 84–89. [Online]. Available: https://ieeexplore.ieee.org/document/9524802/
- [22] T. J. Ham, S. Jung, S. Kim, Y. H. Oh, Y. Park, Y. Song, J.-H. Park, S. Lee, K. Park, J. W. Lee, and D.-K. Jeong, “A3: Accelerating attention mechanisms in neural networks with approximation,” 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 328–341, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:211296403
- [23] H. Peng, S. Huang, S. Chen, B. Li, T. Geng, A. Li, W. Jiang, W. Wen, J. Bi, H. Liu, and C. Ding, “A length adaptive algorithm-hardware co-design of transformer on FPGA through sparse attention and dynamic pipelining,” in Proceedings of the 59th ACM/IEEE Design Automation Conference. San Francisco California: ACM, Jul. 2022, pp. 1135–1140. [Online]. Available: https://dl.acm.org/doi/10.1145/3489517.3530585
- [24] S. Hur, S. Na, D. Kwon, J. Kim, A. Boutros, E. Nurvitadhi, and J. Kim, “A fast and flexible fpga-based accelerator for natural language processing neural networks,” ACM Trans. Archit. Code Optim., vol. 20, no. 1, feb 2023. [Online]. Available: https://doi.org/10.1145/3564606
- [25] H. Peng, S. Huang, T. Geng, A. Li, W. Jiang, H. Liu, S. Wang, and C. Ding, “Accelerating Transformer-based Deep Learning Models on FPGAs using Column Balanced Block Pruning,” in 2021 22nd International Symposium on Quality Electronic Design (ISQED). Santa Clara, CA, USA: IEEE, Apr. 2021, pp. 142–148. [Online]. Available: https://ieeexplore.ieee.org/document/9424344/
- [26] Z. Jiang, D. Yin, E. E. Khoda, V. Loncar, E. Govorkova, E. Moreno, P. Harris, S. Hauck, and S.-C. Hsu, “Ultra Fast Transformers on FPGAs for Particle Physics Experiments.”
- [27] F. Wojcicki, Z. Que, A. D. Tapper, and W. Luk, “Accelerating Transformer Neural Networks on FPGAs for High Energy Physics Experiments,” in 2022 International Conference on Field-Programmable Technology (ICFPT). Hong Kong: IEEE, Dec. 2022, pp. 1–8. [Online]. Available: https://ieeexplore.ieee.org/document/9974463/
- [28] Y. Chen, T. Li, X. Chen, Z. Cai, and T. Su, “High-Frequency Systolic Array-Based Transformer Accelerator on Field Programmable Gate Arrays,” Electronics, vol. 12, no. 4, p. 822, Jan. 2023, number: 4 Publisher: Multidisciplinary Digital Publishing Institute. [Online]. Available: https://www.mdpi.com/2079-9292/12/4/822
- [29] X. Yang and T. Su, “EFA-Trans: An Efficient and Flexible Acceleration Architecture for Transformers,” Electronics, vol. 11, no. 21, p. 3550, Oct. 2022. [Online]. Available: https://www.mdpi.com/2079-9292/11/21/3550
- [30] Y. Bai and F. University, “LTrans-OPU: A Low-Latency FPGA-Based Overlay Processor for Transformer Networks.”
- [31] P. Ganesh, Y. Chen, X. Lou, M. A. Khan, Y. Yang, H. Sajjad, P. Nakov, D. Chen, and M. Winslett, “Compressing large-scale transformer-based models: A case study on bert,” Transactions of the Association for Computational Linguistics, vol. 9, pp. 1061–1080, 2021.
- [32] B. Li, S. Pandey, H. Fang, Y. Lyv, J. Li, J. Chen, M. Xie, L. Wan, H. Liu, and C. Ding, “FTRANS: energy-efficient acceleration of transformers using FPGA,” in Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. Boston Massachusetts: ACM, Aug. 2020, pp. 175–180. [Online]. Available: https://dl.acm.org/doi/10.1145/3370748.3406567
- [33] H. Wang, Z. Zhang, and S. Han, “Spatten: Efficient sparse attention architecture with cascade token and head pruning,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 97–110.
- [34] Z. Luo, L. Lu, Y. Jin, L. Jia, and Y. Liang, “Calabash: Accelerating Attention Using a Systolic Array Chain on FPGAs,” in 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). Gothenburg, Sweden: IEEE, Sep. 2023, pp. 242–247. [Online]. Available: https://ieeexplore.ieee.org/document/10296242/
- [35] W. Ye, X. Zhou, J. Zhou, C. Chen, and K. Li, “Accelerating Attention Mechanism on FPGAs based on Efficient Reconfigurable Systolic Array,” ACM Transactions on Embedded Computing Systems, vol. 22, no. 6, pp. 1–22, Nov. 2023. [Online]. Available: https://dl.acm.org/doi/10.1145/3549937
- [36] S. Fujimaki, Y. Inoue, D. Hisano, K. Maruta, Y. Nakayama, and Y. Hara-Azumi, “A Self-Attention Network for Deep JSCCM: The Design and FPGA Implementation,” in GLOBECOM 2022 - 2022 IEEE Global Communications Conference. Rio de Janeiro, Brazil: IEEE, Dec. 2022, pp. 6390–6395. [Online]. Available: https://ieeexplore.ieee.org/document/10001518/
- [37] R. Qiao, X. Guo, W. Mao, J. Li, and H. Lu, “FPGA-based design and implementation of the location attention mechanism in neural networks,” Journal of Intelligent & Fuzzy Systems, vol. 43, no. 4, pp. 5309–5323, Aug. 2022. [Online]. Available: https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/JIFS-212273
- [38] X. Zhang, Y. Wu, P. Zhou, X. Tang, and J. Hu, “Algorithm-hardware Co-design of Attention Mechanism on FPGA Devices,” ACM Transactions on Embedded Computing Systems, vol. 20, no. 5s, pp. 1–24, Oct. 2021. [Online]. Available: https://dl.acm.org/doi/10.1145/3477002
- [39] “AMD Technical Information Portal — docs.amd.com,” https://docs.amd.com/r/en-US/ug1399-vitis-hls/AXI4-Master-Interface, [Accessed 21-07-2024].
- [40] “Introduction • DMA/Bridge Subsystem for PCI Express Product Guide (PG195) • Reader • Documentation Portal.” [Online]. Available: https://docs.xilinx.com/r/en-US/pg195-pcie-dma
- [41] “AMD Technical Information Portal — docs.amd.com,” https://docs.amd.com/v/u/en-US/axi_timer_ds764, [Accessed 15-07-2024].
- [42] “Programmers — digilent.com,” https://digilent.com/shop/fpga-boards/programmers/, [Accessed 15-07-2024].
- [43] “AMD Technical Information Portal — docs.amd.com,” https://docs.amd.com/v/u/en-US/axi_uartlite_ds741, [Accessed 15-07-2024].
- [44] T. Li, F. Zhang, X. Fan, J. Shen, W. Guo, and W. Cao, “Unified Accelerator for Attention and Convolution in Inference Based on FPGA,” in 2023 IEEE International Symposium on Circuits and Systems (ISCAS). Monterey, CA, USA: IEEE, May 2023, pp. 1–5. [Online]. Available: https://ieeexplore.ieee.org/document/10182145/
- [45] G. Shen, J. Zhao, Q. Chen, J. Leng, C. Li, and M. Guo, “SALO: an efficient spatial accelerator enabling hybrid sparse attention mechanisms for long sequences,” in Proceedings of the 59th ACM/IEEE Design Automation Conference. San Francisco California: ACM, Jul. 2022, pp. 571–576. [Online]. Available: https://dl.acm.org/doi/10.1145/3489517.3530504
- [46] J. Zhao, L. Feng, S. Sinha, W. Zhang, Y. Liang, and B. He, “Performance Modeling and Directives Optimization for High-Level Synthesis on FPGA,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 39, no. 7, pp. 1428–1441, Jul. 2020. [Online]. Available: https://ieeexplore.ieee.org/document/8695879/