ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models
Abstract
Large Language Models (LLMs) have revolutionized natural language processing tasks. However, their practical application is constrained by substantial memory and computational demands. Post-training quantization (PTQ) is considered an effective method to accelerate LLM inference. Despite its growing popularity in LLM model compression, PTQ deployment faces two major challenges. First, low-bit quantization leads to performance degradation. Second, restricted by the limited integer computing unit type on GPUs, quantized matrix operations with different precisions cannot be effectively accelerated. To address these issues, we introduce a novel arbitrary-bit quantization algorithm and inference framework, ABQ-LLM. It achieves superior performance across various quantization settings and enables efficient arbitrary-precision quantized inference on the GPU. ABQ-LLM introduces several key innovations: (1) a distribution correction method for transformer blocks to mitigate distribution differences caused by full quantization of weights and activations, improving performance at low bit-widths. (2) the bit balance strategy to counteract performance degradation from asymmetric distribution issues at very low bit-widths (e.g., 2-bit). (3) an innovative quantization acceleration framework that reconstructs the quantization matrix multiplication of arbitrary precision combinations based on BTC (Binary TensorCore) equivalents, gets rid of the limitations of INT4/INT8 computing units. ABQ-LLM can convert each component bit width gain into actual acceleration gain, maximizing performance under mixed precision(e.g., W6A6, W2A8). Based on W2*A8 quantization configuration on LLaMA-7B model, it achieved a WikiText2 perplexity of 7.59 (2.17 vs 9.76 in AffineQuant). Compared to SmoothQuant, we realized 1.6 acceleration improvement and 2.7 memory compression gain.
1 Introduction
Recent advancements in large language models (LLMs) (Bubeck et al. 2023; Touvron et al. 2023a, b) have demonstrated impressive capabilities across various natural language benchmark, including reasoning (Clark et al. 2019, 2018), cognitive processing (Xu et al. 2023a; Hardy et al. 2023), and dialogue generation(Hu et al. 2023). However, these models are characterized by a substantial number of parameters, posing significant challenges in terms of memory consumption and bandwidth (Zheng et al. 2024; Kim et al. 2023).
Post-training quantization (PTQ) effectively reduces both computational and storage requirements. This technique significantly accelerates model inference by converting the weights and activation values of large language models (LLMs) from high-precision floating-point numbers to low-precision integer values for storage, and using efficient integer matrix multiplication operators to handle the bulk of matrix multiplication computations during inference. Currently, 80% of the computation and parameter access in LLMs is concentrated on general matrix multiplication (GEMM) and vector multiplication (GEMV) operations, especially during autoregressive decoding, where all GEMM operations degrade into GEMV operations due to single-token generation. Consequently, the efficiency of GEMV computation and memory access directly determines the efficiency and power consumption of LLM inference.
To improve GEMM/GEMV memory access efficiency, LLM inference typically employs a quantized inference strategy. The current mainstream approach is weight-only quantization, where the kernel performs actual computation based on dequantized FP16 values. However, this approach offers limited performance improvement in highly parallel scenarios. To further enhance quantized inference performance, the industry is pursuing full quantization of both weights and activation values to reduce activation memory access and leverage higher computational power using quantized kernels, such as those from NVIDIA. However, current industry practices in weight and activation full quantization (WA full quantization) face several limitations. NVIDIA provides only a limited set of hardware-accelerated instructions(Lin et al. 2024a; Ashkboos et al. 2024; Zhao et al. 2024), which constrains the design space for quantization algorithms. Other quantization combinations (e.g., W4A8 or W2A4) require type conversion to W8A8 or W4A4 during computation, leading to inefficiency(Lin et al. 2024b). Furthermore, due to GEMV, additional padding calculations are required in scenarios with a batch size less than 8, resulting in inefficient matrix multiplication for W4A4 and W8A8. Finally, WA fully quantized models encounter significant challenges in low-bit quantization (e.g., W2A8, W2A6).

In this paper, we introduce a novel quantization framework for PTQ, called ABQ-LLM. By examining the quantization sensitivity of components within the transformer block (Figure 1) and the attention map before and after quantization (Figure 2), we find the down_proj linear layer and the attention map particularly sensitive to quantization. To address this, we propose a double cosine similarity distribution correction and an attention map distribution bootstrap for the output of down_proj. This method calibrates the quantization constants and restores the model’s performance at low bit-widths such as W6A6, W4A4 and W2A8. Additionally, we analyze performance degradation in low-bit quantization and address the asymmetric loss issue in low-bit representations like INT2 using the bit balance strategy. Finally, we implement customized software engine to support fully quantized inference of various precision combinations based on BTC equivalents, fully exploiting the advantages of quantized models under mixed precision. Our contributions are summarized as follows:
-
•
We propose a novel block-wise distribution correction and compensation scheme in the PTQ domain to mitigate the distribution discrepancy caused by full quantization of weights and activations, thereby improving model performance at low bit-widths.
-
•
We address the problem of asymmetric loss at low bit-widths, such as INT2, and significantly improve INT2 quantization performance using the bit balance strategy, enhancing model performance under the INT2 quantization configuration.
-
•
We propose a software engine which achieves quantization freedom for the first time in the LLM field. It eliminates the limitations of INT4/INT8 computational units, and effectively avoids the GEMV problem. Under the LLaMA-7B W2A8 configuration, it has 1.6 ultimate acceleration compared to SmoothQuant, achieving SOTA performance.

2 Related Work
LLM quantization can be broadly divided into weight-only quantization and weight-activation quantization.
Weight-only quantization. To alleviate computational burdens, some studies focus on weight-only quantization. LLM.int8() (Dettmers et al. 2022) achieves accurate INT8 quantization by retaining significant channels. GPTQ (Frantar et al. 2022) uses Hessian-based error compensation to reduce quantization errors in LLMs, enabling 3-bit quantization. AWQ (Lin et al. 2024a) and OWQ (Lee et al. 2024) significantly enhance quantized model performance by considering the impact of activation outliers on weight quantization. Methods like QuIP(Chee et al. 2024), QuIP# (Tseng et al. 2024), and AQLM(Egiazarian et al. 2024) facilitate 2-bit quantization through learnable codebooks or additional fine-tuning. Approaches such as (Dettmers et al. 2023; Shang et al. 2023; Huang et al. 2024) improve PTQ performance through unstructured mixed-precision fine-grained weight grouping. Additionally, research such as (Dettmers et al. 2024; Xu et al. 2023b; Arshia et al. 2022; Bondarenko, Del Chiaro, and Nagel 2024) employs efficient parameter fine-tuning (PEFT) techniques to compress weights through fine-tuning.
Weight-activation quantization. Weight-activation quantization differs from weight-only quantization by quantizing both weights and activation (including KV caches) to accelerate LLM inference. The main challenge in quantizing activation is handling outliers, which can cause significant quantization errors. To address this issue, ZeroQuant(Yao et al. 2022) proposes a fine-grained, hardware-friendly quantization scheme for weights and activation. SmoothQuant(Xiao et al. 2023) shifts the quantization difficulty from activation to weights through mathematically equivalent transformations, achieving W8A8 quantization. (Shao et al. 2023; Ma et al. 2024b; Hu et al. 2024a) enhances performance by training quantization parameters. Limited by GPU platform instruction limitations, these jobs can only use W8A8 to perform actual inference, even if they achieve lower quantization bit-widths (e.g., W6A6).
3 Method
In this section, we provide a detailed introduction to our ABQ-LLM. We first describe the distribution correction and bit balance strategy and then introduce our arbitrary-bit inference framework.

3.1 Preliminary
(Xiao et al. 2023) achieves WA full quantization by scaling activation outliers, but this increases the range variability of weights, making weight quantization more sensitive. Conversely, (Lin et al. 2024a) optimizes weight quantization by scaling weights, which significantly increases the diversity of activation, complicating activation quantization. These approaches highlight the drawbacks of manually setting scaling balance factors between activation and weight, making it challenging to achieve a perfect balance. To address this issue, we introduce a distribution correction-guided scaling method. Following (Shao et al. 2023) approach, we set the balance vectors between weights and activation as learnable parameters and add a learnable clipping parameter for weight. By employing our distribution correction and bit balance strategy to optimize model performance, our objectives are as follows:
(1) |
where and are full-precision weight and activation, denotes the quantizer of weight and activation, denotes the clipping operation, is the scale factor, and letting and to control the clipping range of the weight.
3.2 Improving Quantization by Distribution Correction
We observed significant variations in sensitivity across different layers of LLM models during quantization, with some layers having a critical impact on quantization performance. To validate this, as shown in Figure 1, we quantified various components of the LLaMA-7B model under weight-activation full quantization. While quantizing the gate_proj and up_proj layers in mlp and attention resulted in only minor performance degradation, quantizing the down_proj linear layer caused a substantial performance drop. This indicates that addressing down_proj quantization is crucial for performance recovery. Further analysis revealed that the primary cause of performance degradation due to down_proj quantization is the quantization of down_proj activation. At low bit-widths such as INT4, INT3, and INT2, the limited representation range causes a significant shift in the model distribution compared to full precision. As illustrated in Figure 3, during the block-wise quantization calibration process, we apply a double logarithm of cosine similarity loss on the output of down_proj to correct the distribution of the quantized model. The loss function called DLC loss :
(2) |
where represent the quantized output of the -th transformer block, represent the full-precision output of the -th transformer block, and represent the full-precision output of the -th transformer block, with its input originating from the quantized output of the -th transformer block.
Additionally, we conducted an analysis of the cosine similarity between activation at the input and output of decoder blocks in the LLaMA-7B model. The results revealed significant differences in similarity for the initial and final blocks, indicating their considerable impact on model inference performance. In response, we applied distribution compensation vector to the down_proj layers of these blocks to address and correct the distribution discrepancies using Eq. (3).
(3) |
where denotes round operation, represents the target bit-width, denotes the step-size, and is the zero-point. and denote the quantized and full-presicion weight, respectively. The vectors and are distribution compensation vectors, where indicates compensation is performed, and indicates no compensation.


To enhance the performance of the quantized model, we analyzed the changes in Attention Map distribution before and after quantization, as shown in Figure 2. In the full-precision model, attention is heavily focused on the first token, highlighting its key role in guiding text generation, consistent with LLM-QAT(Liu et al. 2023) findings. However, quantization disrupts this attention pattern, diminishing focus on the first token. To address this and restore the model’s attention during quantization, we introduced attention-aware KL divergence to reconstruct the attention map.
(4) |
where denotes the quantized attention map output of the -th transformer block, while refers to the full-precision attention map output of the same block.
In the end, we combined DLC loss and AKL loss, and our final optimization goal is:
(5) |
where , and are the parameters of the -th transformer block after calibration. When the distributions of the quantized output and the full-precision output match, we have a loss close to 0, which effectively guides the quantization process.
3.3 Bit Balance Strategy
Typically, pre-trained LLM model weight exhibit near-normal distribution, characterized by symmetry. Using Q-Q plots (Quantile-Quantile Plots), we confirmed the strong symmetry in the weight distribution of pre-trained models (see Appendix A). However, in standard INT2 quantization, the numerical representation is limited to four values, with symmetric quantization ranges of {-2, -1, 0, 1} or {-1, 0, 1, 2}, disrupting the original symmetric weight distribution (see Appendix A). This asymmetry leads to significant performance degradation, as shown in Table 1, where performance drops by 0.46 from W4A16 to W3A16 and by 5.19 from W3A16 to W2A16, indicating a sharp decline. To address this asymmetry impact on LLM quantization, we adopted the bit balance strategy like (Li et al. 2016; Ma et al. 2024a), extending the INT2 symmetric quantization space to {-2, -1, 0, 1, 2}. This modification restored model performance to 7.50, which is within a reasonable range compared to W3A16.
3.4 Custom Software Engine
Reconstructing Arbitrary Bit Computation. To support W1A1 quantization, NVIDIA introduced INT1 TensorCore in Turing and later architectures to provide hardware support. However, W1A1 quantization has not been widely applied due to significant performance degradation. Through mathematical analysis of quantized matrix multiplication, we find that any combination of quantization can be decomposed into a superposition of 1-bit matrix multiplications. Assuming the weight of a particular neural network layer are quantized to q bits and the input activation value is quantized to p bits, the matrix multiplication of and results in a 32-bit output . The key is to observe that the scalar values at any position in W and X can be decomposed into a series of 1-bit scalar numbers. Scalar operations with any combination of precision can be decomposed into 1-bit operations and shift operations. For example, a 2-bit x can be expressed as:
(6) |
where , . We use to denote a computational operation where the input is 1-bit data and the output is 32-bit. Thus, the original scalar-level arbitrary precision computation can be represented as:
(7) |
The above procedure can be generalized to any combination of matrix multiplications with bit-widths p and q. The detailed formulas are shown in the Appendix B. Using these transformations, we decompose the operation of arbitrary quantized combinations into a superposition of 1-bit matrix multiplications, enabling the underlying layer to invoke high-computing instruction implementations.
Model | Bits | WikiText2 | C4 |
---|---|---|---|
LLaMA-7B | W4A16 | 5.83 | 7.29 |
W3A16 | 6.29 | 8.01 | |
W2A16 | 11.48 | 15.74 | |
W2*A16 | 7.50 | 9.86 |
Engine Implementation. NVIDIA GPU has many processing elements called Streaming Multiprocessors (SMs) and uses a large number of threads to perform computing tasks in parallel. Threads are structured into thread blocks, which become the smallest scheduling execution unit on SMs. Therefore, the computation target is decomposed and mapped to each thread block, called Thread Block Tile, to achieve parallel computing. As shown in Figure 4(a), for a GEMM task of shape M×N×K, each thread block is responsible for computing a BM×BN output block, which is decomposed into sub-GEMM tasks of shape BM×BN×BK. Our engine converts quantized matrix multiplications with bit widths configured as {p, q} into special accumulations of p*q binarized matrix multiplications, so the true calculation task of thread block tile is p*BM × q*BN. First, in order to improve memory access continuity, we propose BitPacking strategy to decompose the quantized tensor into binary matrices, where is the quantization bit width. Taking input X as an example, this means that its bit perspective layout changes from [M, K, p] to [p, M, K]. All threads within a thread block share the same shared memory space. Within each thread block, threads are further organized into a set of warps, with each warp consisting of 32 consecutive threads. Next, warps collaborates to load the A matrix (p*BM × BK) and B matrix (BK × q*BN) data required for thread block tile calculation from GL and caches them in SMEM. Thanks to BitPacking, the process of reading p BM*BK single-bit row-major tiles and writing p*BM*BK bits SMEM is efficient and continuous. Subsequently, thread block contains multiple warps, so thread block tile can be further decomposed into Warp Tile to achieve warp-level parallelism, and the computing tasks of each warp are WM× WN. In the calculation preparation stage, the A matrix (WM×WK, row-major) and B matrix (WK×WN, col-major) are independently loaded from SMEM to FR. Then, the calculation is decomposed into WM_TILES*WARP_N_TILES TensorCore MMA(matrix-multiply-accumulate). Since A and B are binarized matrices, we actually use Binary TensorCore MMA (BMMA), which has a computing power 8 times and 4 times higher than INT8 and INT4 TensorCore respectively. All warps collaboratively complete the Thread Block Tile calculation, and the results are stored in the c fragment of each warp. Therefore, each warp needs to independently write the calculation results back to SMEM. As shown in Eq. (10), output tile(p*BM × q*BN) is globally reduced to obtain a final result(BM × BN), where each BM × BN sub-Tile needs to be multiplied by a certain scaling factor. We call this process Bit Reduction. As the final step, warps collaboratively load the final result from SMEM and write back to the target location in GL.
We implement the above calculation process as a GPU Kernel, called ABQKernel. As shown in Figure 4(b), ABQKernel is used to replace all gemm operations in the decoder layer, and assists with necessary BitPacking, quantization, and dequantization operations to achieve arbitrary quantization inference of the LLaMA model. We carefully manages the overhead of quantization operators by fusing them into existing operators and weight BitPacking is implemented offline for increased efficiency.
GPU Kernel Optimization. When M=1, the GEMM problem of shape MxNxK transforms into a GEMV problem, shifting from computation-intensive to memory-intensive, which becomes a performance bottleneck for model inference. When using ordinary TensorCore for accelerated computation, the dimensions of M are usually chunked in groups of 8, requiring padding if M8, leading to 87.5% redundant computation. Thanks to the revolutionary reconstruction of computing and BitPacking strategy, for the configuration, the actual computing task undertaken by ABQKernel is p*M × q*N × K. The expansion of the M dimension can effectively reduce the redundant calculations when calling TensorCore, and even when p*M 8 and p*M 8 = 0, padding can be completely avoided. We call the above optimization strategy GEMV Elimination. In addition, Computational and Pipeline Optimization, Auto Kernel Search, and Bank Conflicts Elimination are also applied. For details, see Appendix D.
4 Experiments
Bits | Method | LLaMA-7B | LLaMA-13B | LLaMA-2-7B | LLaMA-2-13B | ||||
WikiText2 | C4 | WikiText2 | C4 | WikiText2 | C4 | WikiText2 | C4 | ||
W6A6 | SmoothQuant | 6.03 | 7.47 | 5.42 | 6.97 | 6.20 | 7.76 | 5.18 | 7.67 |
OmniQuant | 5.96 | 7.43 | 5.28 | 6.84 | 5.87 | 7.48 | 5.14 | 6.74 | |
I-LLM | 5.84 | 7.32 | 5.23 | 6.79 | 5.68 | 7.27 | 5.10 | 6.74 | |
ABQ-LLM | 5.81 | 7.27 | 5.21 | 6.77 | 5.63 | 7.21 | 5.00 | 6.64 | |
W4A4 | SmoothQuant | 22.25 | 32.22 | 40.05 | 47.18 | 83.12 | 77.27 | 35.88 | 43.19 |
OmniQuant | 11.26 | 14.51 | 10.87 | 13.78 | 14.26 | 18.02 | 12.30 | 14.55 | |
AffineQuant | 10.28 | 13.64 | 10.32 | 13.44 | 12.69 | 15.76 | 11.45 | 13.97 | |
I-LLM | 9.10 | 12.33 | 7.99 | 10.96 | 10.55 | 12.92 | 9.76 | 12.57 | |
ABQ-LLM | 8.63 | 12.10 | 7.69 | 10.90 | 9.31 | 12.85 | 8.62 | 11.47 | |
W2A8 | OmniQuant | 15.70 | 26.44 | 13.50 | 19.01 | 37.95 | 103.39 | 21.74 | 31.72 |
AffineQuant | 9.76 | 15.52 | 9.21 | 12.55 | 1483 | 4688 | 12.30 | 29.32 | |
I-LLM | 14.08 | 18.89 | 11.80 | 16.19 | 123.93 | 200.54 | 25.74 | 40.59 | |
ABQ-LLM | 11.35 | 15.41 | 9.20 | 12.48 | 13.47 | 17.82 | 13.24 | 18.07 | |
W2*A8 | ABQ-LLM | 7.59 | 10.00 | 6.49 | 8.53 | 7.85 | 10.33 | 6.65 | 10.01 |
4.1 Experimental Setup
Baseline. For weight-only quantization, we compare our approach with GPTQ(Frantar et al. 2022), AWQ(Lin et al. 2024a), OmniQuant(Shao et al. 2023), and AffineQuant(Ma et al. 2024b). For weight-activation quantization, we benchmark our method against SmoothQuant(Xiao et al. 2023), OmniQuant(Shao et al. 2023), and I-LLM(Hu et al. 2024b).
Models and Datasets. We primarily evaluate our method using LLaMA (7B-13B) (Touvron et al. 2023a) and LLaMA-2 (7B-13B) (Touvron et al. 2023b)in this paper. Following previous work(Shao et al. 2023; Ma et al. 2024b), we evaluate the quantized models by reporting the perplexity of language generation experiments on WikiText2(Merity et al. 2016) and C4(Raffel et al. 2020). To assess performance on zero-shot tasks, we select several popular benchmarks including PIQA(Bisk et al. 2020), ARC(Clark et al. 2018), BoolQ(Clark et al. 2019), HellaSwag(Zellers et al. 2019), and Winogrande(Sakaguchi et al. 2021) using the lm-evaluation-harness(Gao et al. 2021).
Calibration We initialize the balance vectors for weights and activations following (Xiao et al. 2023), with the learnable clipping parameter for weights set to 1. For distribution correction compensation vectors, we set as an all-ones vector and as an all-zeros vector, ensuring starts at 0. Using the AdamW optimizer (Loshchilov and Hutter 2017) with no weight decay, we set learning rates of 5e-3 for balance vectors and 1e-2 for the clipping parameter and vector compensation vector. Calibration data includes 128 randomly selected 2048-token segments from WikiText2. The calibration process, conducted on an NVIDIA A800-40G GPU, utilized a batch size of 1 and spanned 20 epochs. For activation and KV Cache we perform per-token quantization, and for weight we perform per-channel quantization. By default, activation and KV cache use the same quantization bit.

4.2 Experiments on Language Generation Tasks
Language generation capability is central to large language models (LLMs). To validate our extraordinary performance in the challenging quantization task, we first compare perplexity, a crucial metric for language generation, with the baseline. As shown in Table 2, ABQ-LLM demonstrates outstanding performance across various quantization configurations, surpassing state-of-the-art methods such as AffineQuant and I-LLM. Notably, in the INT2 setting, the application of bit balance strategy yields significant improvements at minimal cost. The W2*A8 configurations substantially outperform the W2A8 configurations. Specifically, perplexity on WikiText2 and C4 datasets decreases by an average of 1.42 and 2.11 points, respectively, for W2*A8 compared to the W4A4. These findings validate the effectiveness of our distribution correction and bit balance strategy. The results of weight-only quantization are presented in Table 6 at Appendix C. Additional results for full WA quantization are provided in Tables 7 at Appendix C.
Model | Bits | Method | PiQA | ARC-e | ARC-c | BoolQ | HellaSwag | Winogrande | Avg. |
LLaMA-13B | W4A4 | OmniQuant | 69.69 | 47.30 | 33.10 | 62.84 | 58.96 | 55.80 | 54.37 |
AffineQuant | 66.32 | 43.90 | 29.61 | 64.10 | 56.88 | 54.70 | 52.58 | ||
I-LLM | 67.95 | 48.15 | 34.47 | 62.29 | 63.13 | 59.98 | 55.99 | ||
ABQ-LLM | 71.82 | 47.60 | 35.67 | 63.52 | 64.31 | 57.54 | 56.74 | ||
W2A8 | OmniQuant | 66.76 | 45.62 | 30.20 | 61.13 | 52.93 | 55.72 | 52.06 | |
AffineQuant | 71.00 | 46.70 | 32.33 | 62.23 | 58.62 | 63.53 | 55.73 | ||
I-LLM | 67.46 | 43.73 | 29.69 | 62.41 | 53.37 | 55.09 | 51.95 | ||
ABQ-LLM | 72.03 | 46.72 | 31.74 | 65.17 | 58.71 | 62.50 | 56.15 | ||
W2*A8 | ABQ-LLM | 74.91 | 54.92 | 38.65 | 68.53 | 68.21 | 66.54 | 61.96 | |
LLaMA-2-13B | W4A4 | OmniQuant | 67.08 | 45.66 | 32.25 | 63.73 | 58.39 | 54.61 | 53.62 |
AffineQuant | 67.68 | 46.63 | 32.85 | 65.90 | 60.62 | 54.14 | 54.63 | ||
I-LLM | 68.00 | 45.74 | 30.97 | 64.55 | 60.62 | 54.22 | 54.01 | ||
ABQ-LLM | 69.04 | 47.01 | 33.53 | 64.74 | 62.70 | 54.38 | 55.23 | ||
W2A8 | OmniQuant | 62.67 | 38.80 | 28.41 | 62.11 | 49.04 | 51.69 | 48.78 | |
AffineQuant | 61.31 | 38.51 | 26.96 | 62.04 | 41.92 | 50.74 | 46.91 | ||
I-LLM | 61.86 | 38.67 | 26.45 | 62.17 | 43.30 | 51.85 | 47.38 | ||
ABQ-LLM | 64.30 | 40.19 | 29.78 | 63.18 | 49.58 | 52.17 | 49.87 | ||
W2*A8 | ABQ-LLM | 73.50 | 49.79 | 35.15 | 70.15 | 67.45 | 58.87 | 59.15 |
4.3 Experiments on Zero-Shot Tasks
To further validate our model, we compare zero-shot accuracy with the baseline method, as shown in Table 3. Our ABQ-LLM method outperforms the previous method in most cases. Notably, after applying the bit balance strategy, the performance of W2*A8 improves significantly by 7.50% on average. Combining the performance in both language generation and zero-shot tasks, we conclude that ABQ-LLM achieves state-of-the-art results in handling challenging quantization tasks. See the Appendix C for more results and analysis of quantized configurations.
4.4 Inference Engine Evaluation
Kernel Benchmark. We evaluated the GEMV speedup of our ABQKernel across three matrix dimensions in LLaMA-7B and compared it with the quantization kernel provided by cuBLAS and CUTLASS. It is important to note that CUTLASS only supports W4A4 and W8A8, while cuBLAS supports only W8A8 for quantization operations. Our experiments were conducted on two different GPUs: the RTX 4080 and the RTX 3070. Figure 5 presents the results, showing that our ABQKernel achieve superior acceleration across all matrix sizes. Specifically, for special bit combinations such as W2A8 and W2A4, our ABQKernel significantly outperforms the baseline approaches, as cuBLAS and CUTLASS require conversion to W8A8 and W4A4 for computation. In the W2A8 configuration, our throughput is improved by 7.47 compared to W8A8 with CUTLASS and cuBLAS on dimensions (1, 4096) (4096, 4096).

End-to-end throughput. As shown in Figure 6, we integrate our ABQKernel into FastTransformer and compare it with the FP16 implementation of FastTransformer and the INT8 implementation of SmoothQuant. Compared to FP16, our scheme achieves 2.95 inference acceleration and 4.8 memory compression gain, while requiring only 10GB of memory for inference on the LLaMA-30B model, which is less than the memory required for LLaMA-7B with FP16. Additionally, our scheme achieves 1.6 speedup and 2.7 memory compression gain over SmoothQuant, significantly outperforming current mainstream inference methods. This substantial improvement reduces the cost of LLM services and facilitates their practical deployment.
Kernel Optimization Ablation. Table 4 presents the impact of various optimization techniques on the inference latency of the kernel when performing the GEMV operation with dimensions (1, 4096) (4096, 4096). Our ABQKernel significantly outperforms the CUTLASS W8A8 kernel when unoptimized. Additionally, by employing pipeline optimization, GEMV elimination and auto kernel search, we achieve a latency reduction by 7.47 and a corresponding increase in throughput by 7.47. These results significantly outperform CUTLASS.
Method | Latency(us) | TOPS |
---|---|---|
CUTLASS | 49.96 | 0.67 |
Native_kernel | 20.05 | 1.67 |
+ Pipeline Optimization | 14.66 | 2.28 |
+ Eliminiate GEMV | 10.92 | 3.07 |
+ Auto Kernel Search | 6.68 | 5.01 |
5 Conclusion
We present an arbitrary bit quantization inference framework called ABQ-LLM. Through an in-depth analysis of LLM quantization, we introduce distribution correction and bit balance strategy to enhance model performance. We then design a novel arbitrary bit inference engine to fully leverage the advantages of LLM quantization. Extensive experimental results demonstrate that ABQ-LLM achieves outstanding performance across various quantization configurations, including W6A6, W4A4, and W2A8. Moreover, ABQKernel consistently outperformed both CUTLASS and cuBLAS in all configurations. Our end-to-end inference speed is 1.6 faster than the industry SOTA, SmoothQuant, and achieves 2.7 memory compression gain.
References
- Arshia et al. (2022) Arshia, F. Z.; Keyvanrad, M. A.; Sadidpour, S. S.; and Mohammadi, S. M. R. 2022. PeQA: A Massive Persian Question-Answering and Chatbot Dataset. In 2022 12th International Conference on Computer and Knowledge Engineering (ICCKE), 392–397. IEEE.
- Ashkboos et al. (2024) Ashkboos, S.; Mohtashami, A.; Croci, M. L.; Li, B.; Jaggi, M.; Alistarh, D.; Hoefler, T.; and Hensman, J. 2024. Quarot: Outlier-free 4-bit inference in rotated llms. arXiv preprint arXiv:2404.00456.
- Bisk et al. (2020) Bisk, Y.; Zellers, R.; Gao, J.; Choi, Y.; et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 7432–7439.
- Bondarenko, Del Chiaro, and Nagel (2024) Bondarenko, Y.; Del Chiaro, R.; and Nagel, M. 2024. Low-Rank Quantization-Aware Training for LLMs. arXiv preprint arXiv:2406.06385.
- Bubeck et al. (2023) Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y. T.; Li, Y.; Lundberg, S.; et al. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Chee et al. (2024) Chee, J.; Cai, Y.; Kuleshov, V.; and De Sa, C. M. 2024. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36.
- Clark et al. (2019) Clark, C.; Lee, K.; Chang, M.-W.; Kwiatkowski, T.; Collins, M.; and Toutanova, K. 2019. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044.
- Clark et al. (2018) Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; and Tafjord, O. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Dettmers et al. (2022) Dettmers, T.; Lewis, M.; Belkada, Y.; and Zettlemoyer, L. 2022. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. Advances in Neural Information Processing Systems, 35: 30318–30332.
- Dettmers et al. (2024) Dettmers, T.; Pagnoni, A.; Holtzman, A.; and Zettlemoyer, L. 2024. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36.
- Dettmers et al. (2023) Dettmers, T.; Svirschevski, R.; Egiazarian, V.; Kuznedelev, D.; Frantar, E.; Ashkboos, S.; Borzunov, A.; Hoefler, T.; and Alistarh, D. 2023. Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2306.03078.
- Egiazarian et al. (2024) Egiazarian, V.; Panferov, A.; Kuznedelev, D.; Frantar, E.; Babenko, A.; and Alistarh, D. 2024. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118.
- Frantar et al. (2022) Frantar, E.; Ashkboos, S.; Hoefler, T.; and Alistarh, D. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
- Gao et al. (2021) Gao, L.; Tow, J.; Biderman, S.; Black, S.; DiPofi, A.; Foster, C.; Golding, L.; Hsu, J.; McDonell, K.; Muennighoff, N.; et al. 2021. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 10: 8–9.
- Hardy et al. (2023) Hardy, M.; Sucholutsky, I.; Thompson, B.; and Griffiths, T. 2023. Large language models meet cognitive science: LLMs as tools, models, and participants. In Proceedings of the annual meeting of the cognitive science society, volume 45.
- Hu et al. (2024a) Hu, X.; Chen, Y.; Yang, D.; Zhou, S.; Yuan, Z.; Yu, J.; and Xu, C. 2024a. I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models. arXiv preprint arXiv:2405.17849.
- Hu et al. (2024b) Hu, X.; Chen, Y.; Yang, D.; Zhou, S.; Yuan, Z.; Yu, J.; and Xu, C. 2024b. I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models. arXiv preprint arXiv:2405.17849.
- Hu et al. (2023) Hu, Z.; Feng, Y.; Luu, A. T.; Hooi, B.; and Lipani, A. 2023. Unlocking the potential of user feedback: Leveraging large language model as user simulators to enhance dialogue system. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 3953–3957.
- Huang et al. (2024) Huang, W.; Liu, Y.; Qin, H.; Li, Y.; Zhang, S.; Liu, X.; Magno, M.; and Qi, X. 2024. Billm: Pushing the limit of post-training quantization for llms. arXiv preprint arXiv:2402.04291.
- Kim et al. (2023) Kim, S.; Hooper, C.; Gholami, A.; Dong, Z.; Li, X.; Shen, S.; Mahoney, M. W.; and Keutzer, K. 2023. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629.
- Lee et al. (2024) Lee, C.; Jin, J.; Kim, T.; Kim, H.; and Park, E. 2024. Owq: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 13355–13364.
- Li et al. (2016) Li, F.; Liu, B.; Wang, X.; Zhang, B.; and Yan, J. 2016. Ternary weight networks. arXiv preprint arXiv:1605.04711.
- Lin et al. (2024a) Lin, J.; Tang, J.; Tang, H.; Yang, S.; Chen, W.-M.; Wang, W.-C.; Xiao, G.; Dang, X.; Gan, C.; and Han, S. 2024a. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. Proceedings of Machine Learning and Systems, 6: 87–100.
- Lin et al. (2024b) Lin, Y.; Tang, H.; Yang, S.; Zhang, Z.; Xiao, G.; Gan, C.; and Han, S. 2024b. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving. arXiv preprint arXiv:2405.04532.
- Liu et al. (2023) Liu, Z.; Oguz, B.; Zhao, C.; Chang, E.; Stock, P.; Mehdad, Y.; Shi, Y.; Krishnamoorthi, R.; and Chandra, V. 2023. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
- Loshchilov and Hutter (2017) Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Ma et al. (2024a) Ma, S.; Wang, H.; Ma, L.; Wang, L.; Wang, W.; Huang, S.; Dong, L.; Wang, R.; Xue, J.; and Wei, F. 2024a. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764.
- Ma et al. (2024b) Ma, Y.; Li, H.; Zheng, X.; Ling, F.; Xiao, X.; Wang, R.; Wen, S.; Chao, F.; and Ji, R. 2024b. Affinequant: Affine transformation quantization for large language models. arXiv preprint arXiv:2403.12544.
- Merity et al. (2016) Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2016. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
- Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140): 1–67.
- Sakaguchi et al. (2021) Sakaguchi, K.; Bras, R. L.; Bhagavatula, C.; and Choi, Y. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9): 99–106.
- Shang et al. (2023) Shang, Y.; Yuan, Z.; Wu, Q.; and Dong, Z. 2023. Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034.
- Shao et al. (2023) Shao, W.; Chen, M.; Zhang, Z.; Xu, P.; Zhao, L.; Li, Z.; Zhang, K.; Gao, P.; Qiao, Y.; and Luo, P. 2023. Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137.
- Touvron et al. (2023a) Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Touvron et al. (2023b) Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Tseng et al. (2024) Tseng, A.; Chee, J.; Sun, Q.; Kuleshov, V.; and De Sa, C. 2024. Quip#: Even better LLM quantization with hadamard incoherence and lattice codebooks. arXiv preprint arXiv:2402.04396.
- Xiao et al. (2023) Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; and Han, S. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 38087–38099. PMLR.
- Xu et al. (2023a) Xu, P.; Shao, W.; Zhang, K.; Gao, P.; Liu, S.; Lei, M.; Meng, F.; Huang, S.; Qiao, Y.; and Luo, P. 2023a. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265.
- Xu et al. (2023b) Xu, Y.; Xie, L.; Gu, X.; Chen, X.; Chang, H.; Zhang, H.; Chen, Z.; Zhang, X.; and Tian, Q. 2023b. Qa-lora: Quantization-aware low-rank adaptation of large language models. arXiv preprint arXiv:2309.14717.
- Yao et al. (2022) Yao, Z.; Yazdani Aminabadi, R.; Zhang, M.; Wu, X.; Li, C.; and He, Y. 2022. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35: 27168–27183.
- Zellers et al. (2019) Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
- Zhao et al. (2024) Zhao, Y.; Lin, C.-Y.; Zhu, K.; Ye, Z.; Chen, L.; Zheng, S.; Ceze, L.; Krishnamurthy, A.; Chen, T.; and Kasikci, B. 2024. Atom: Low-bit quantization for efficient and accurate llm serving. Proceedings of Machine Learning and Systems, 6: 196–209.
- Zheng et al. (2024) Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
Appendix A Bit Balance Analyze
As shown in Figure 7, there is a clear asymmetry in the weights of o_proj after INT2 quantization, particularly evident in the 10th and 20th blocks. This asymmetry shifts the distribution of the model’s weights, resulting in significant performance loss. However, after applying the balancing strategy, the distribution of weights closely aligns with that of the full-precision model, as illustrated in the subsequent figure. This alignment preserves the original symmetric distribution of weights and greatly enhances model performance.

Appendix B Arbitrary Bit Matrix Calculation
Formally, given a p-bit weight matrix W and a q-bit activation matrix X, we can first decompose into 1-bit matrices and .
(8) |
If BMMA stands for 1-bit matrix multiplication, then the BMMA operation can be called times to compute a series of 1-bit matrix multiplication components
(9) |
Finally, we process all 1-bit matrix multiplication components with bit-stacked scaling factors, after which they are summed to obtain a 32-bit output matrix
(10) |
After the above transformation process, we decompose the operation of arbitrary quantized combinations into a special superposition of 1-bit matrix multiplications, so that the underlying layer can be implemented by calling high-computing BMMA instructions.
Appendix C Full Results for Quantization Algorithms
In this paper, we mainly publish metrics for per-channel/per-token quantization. However, ABQ-LLM is naturally orthogonal to per-group quantization, and we validate this on LLaMA-7B for W4A4 g128 per-group quantization with the results shown in Table 5. Experimental results demonstrate that, in most cases, our method outperforms another state-of-the-art (SOTA) method, Atom(Zhao et al. 2024), which employs per-group quantization. Additionally, our method exhibits a degradation of less than 0.5 in PPL compared to FP16, indicating excellent performance.
Furthermore, we present the perplexity (PPL) metrics and zero-shot performance of ABQ-LLM under various quantization configurations. Unfortunately, most prior work has not disclosed complete experimental data, with only PPL metrics for common configurations such as W4A16, W3A16, W2A16, W4A4, and W6A6 being published. Additionally, the full zero-shot results were not made publicly available. We aim to establish a comprehensive quantization baseline for post-training quantization (PTQ) by making our extensive data publicly accessible. However, due to limitations in time and computational resources, we can only provide our own comprehensive experimental results and cannot fully reproduce the results of other baselines.
Bits | Method | PPL | Accuracy(%) | |||||||
---|---|---|---|---|---|---|---|---|---|---|
Wikitext2 | C4 | PiQA | ARC-e | ARC-c | BoolQ | HellaSwag | Winogrande | Avg. | ||
FP16 | - | 5.67 | 7.08 | 77.47 | 52.48 | 41.46 | 73.08 | 73.00 | 67.07 | 64.09 |
W4A4 g128 | Atom | 6.16 | 7.70 | 76.28 | 52.10 | 38.99 | 69.79 | 69.81 | 63.69 | 61.78 |
W4A4 g128 | ABQ-LLM | 6.05 | 7.61 | 76.55 | 51.81 | 38.14 | 71.65 | 70.22 | 63.22 | 61.93 |
To ensure the reproducibility of our results, we specify that for the PPL test, we follow the GPTQ method by setting the sentence length to 2048. It is important to note that different sentence lengths may affect the accuracy of the metrics. Additionally, for the zero-shot accuracy test, two metrics may be used: acc or acc_norm. If acc_norm is available, we report acc_norm; otherwise, we report the acc metric. Finally, we have published the quantized model series on the Huggingface platform. Table 6 presents the perplexity metrics for weight-only quantization, while Table 7 displays the perplexity metrics for various configurations where both weights and activations are quantized simultaneously. The zero-shot accuracies of LLaMA-7B, LLaMA-13B, LLaMA-2-7B, and LLaMA-2-13B are shown in Tables 9, 10, 8, and 11, respectively. Extensive experiments demonstrate that our method consistently outperforms other baselines across all quantization configurations, achieving state-of-the-art performance in both weight-only quantization and weight-activation quantization.
Bits | Method | LLaMA-7B | LLaMA-13B | LLaMA-2-7B | LLaMA-2-13B | ||||
WikiText2 | C4 | WikiText2 | C4 | WikiText2 | C4 | WikiText2 | C4 | ||
FP16 | - | 5.67 | 7.08 | 5.09 | 6.61 | 5.47 | 6.97 | 4.88 | 6.46 |
W4A16 | GPTQ | 6.13 | 7.43 | 5.40 | 6.84 | 5.83 | 7.37 | 5.13 | 6.70 |
AWQ | 6.08 | 7.52 | 5.34 | 6.86 | 6.15 | 7.68 | 5.12 | 6.74 | |
OmniQuant | 5.86 | 7.34 | 5.21 | 6.76 | 5.74 | 7.35 | 5.02 | 6.65 | |
AffineQuant | 5.84 | 7.30 | 5.20 | 6.75 | 5.69 | 7.29 | 5.01 | 6.64 | |
ABQ-LLM | 5.83 | 7.29 | 5.19 | 6.75 | 5.64 | 7.20 | 5.01 | 6.63 | |
W3A16 | GPTQ | 8.06 | 9.49 | 6.76 | 8.16 | 8.37 | 9.81 | 6.44 | 8.02 |
AWQ | 11.88 | 13.26 | 7.45 | 9.13 | 24.00 | 23.85 | 10.45 | 13.07 | |
OmniQuant | 6.49 | 8.19 | 5.68 | 7.32 | 6.58 | 8.65 | 5.58 | 7.44 | |
AffineQuant | 6.30 | 8.03 | 5.60 | 7.20 | 6.55 | 8.57 | 5.62 | 7.56 | |
ABQ-LLM | 6.29 | 8.01 | 5.56 | 7.24 | 6.28 | 8.10 | 5.44 | 7.26 | |
W2A16 | GPTQ | 2.1e3 | 689.13 | 5.5e3 | 2.5e3 | 7.7e3 | NAN | 2.1e3 | 323.12 |
OmniQuant | 15.47 | 24.89 | 13.21 | 18.31 | 37.37 | 90.64 | 17.21 | 26.76 | |
AffineQuant | 9.53 | 14.89 | 7.54 | 12.46 | 35.07 | 572.22 | 12.42 | 23.67 | |
ABQ-LLM | 11.48 | 15.74 | 9.34 | 12.28 | 13.11 | 17.81 | 13.09 | 20.49 | |
W2*A16 | ABQ-LLM | 7.50 | 9.86 | 6.64 | 8.43 | 7.82 | 10.33 | 6.52 | 7.87 |
Bits | Method | LLaMA-7B | LLaMA-13B | LLaMA-2-7B | LLaMA-2-13B | ||||
Wikitext2 | C4 | Wikitext2 | C4 | Wikitext2 | C4 | Wikitext2 | C4 | ||
FP16 | - | 5.67 | 7.08 | 5.09 | 6.61 | 5.47 | 6.97 | 4.88 | 6.46 |
W8A8 | SmoothQuant | 5.73 | - | 5.13 | - | 5.54 | - | 4.95 | - |
ABQ-LLM | 5.68 | 7.09 | 5.10 | 6.62 | 5.48 | 6.99 | 4.89 | 6.47 | |
W4A8 | AWQ | 6.33 | - | 5.59 | - | 6.28 | - | 5.25 | - |
QuaRot | 5.93 | - | 5.29 | - | 5.73 | - | 5.07 | - | |
Atom | 6.03 | - | 5.41 | - | 5.91 | - | 5.16 | - | |
Qserve | 5.93 | - | 5.28 | - | 5.75 | - | 5.12 | - | |
OmniQuant | 5.87 | 7.34 | - | - | - | - | - | - | |
ABQ-LLM | 5.84 | 7.32 | 5.22 | 6.77 | 5.67 | 7.24 | 5.01 | 6.64 | |
W4A6 | OmniQuant | 6.09 | 7.63 | - | - | - | - | - | - |
ABQ-LLM | 6.01 | 7.58 | 5.37 | 6.96 | 5.89 | 7.56 | 5.17 | 6.87 | |
W3A8 | ABQ-LLM | 6.30 | 8.04 | 5.59 | 7.26 | 6.27 | 8.14 | 5.45 | 7.27 |
W3A6 | ABQ-LLM | 6.60 | 8.47 | 5.84 | 7.61 | 6.56 | 8.65 | 5.92 | 7.90 |
W3A4 | ABQ-LLM | 12.16 | 17.19 | 9.96 | 14.36 | 13.65 | 19.00 | 20.35 | 20.09 |
W2*A8 | ABQ-LLM | 7.59 | 10.00 | 6.49 | 8.53 | 7.85 | 10.33 | 6.65 | 10.01 |
W2*A6 | ABQ-LLM | 8.08 | 10.89 | 6.99 | 9.30 | 9.08 | 11.74 | 10.91 | 14.89 |
Model | Bits | Method | PiQA | ARC-e | ARC-c | BoolQ | HellaSwag | Winogrande | Avg. |
---|---|---|---|---|---|---|---|---|---|
LLaMA-2-7B | FP16 | - | 76.98 | 53.57 | 40.61 | 71.07 | 72.96 | 67.24 | 63.73 |
W4A16 | ABQ-LLM | 77.36 | 53.82 | 39.50 | 70.76 | 71.76 | 66.85 | 63.34 | |
W3A16 | ABQ-LLM | 76.55 | 53.66 | 39.24 | 63.97 | 68.93 | 65.90 | 61.38 | |
W2*A16 | ABQ-LLM | 72.79 | 48.14 | 35.32 | 63.94 | 63.21 | 61.79 | 57.53 | |
W8A8 | ABQ-LLM | 76.76 | 53.53 | 40.35 | 71.19 | 72.84 | 66.92 | 63.60 | |
W6A6 | ABQ-LLM | 76.77 | 53.20 | 40.44 | 71.37 | 71.81 | 66.22 | 63.30 | |
W4A8 | ABQ-LLM | 76.76 | 53.11 | 39.07 | 67.86 | 71.38 | 66.92 | 62.52 | |
W4A6 | ABQ-LLM | 76.06 | 52.56 | 38.82 | 67.22 | 70.35 | 64.33 | 61.60 | |
W4A4 | ABQ-LLM | 68.55 | 44.11 | 31.31 | 63.12 | 55.29 | 53.82 | 52.70 | |
W3A8 | ABQ-LLM | 76.55 | 52.44 | 38.99 | 65.77 | 68.34 | 66.61 | 61.45 | |
W3A6 | ABQ-LLM | 74.26 | 51.47 | 37.62 | 65.22 | 66.74 | 63.22 | 59.76 | |
W3A4 | ABQ-LLM | 64.47 | 39.56 | 27.47 | 58.81 | 49.28 | 53.28 | 48.81 | |
W2*A8 | ABQ-LLM | 72.74 | 47.81 | 35.07 | 62.78 | 63.31 | 61.09 | 57.13 | |
W2*A6 | ABQ-LLM | 71.98 | 45.54 | 31.66 | 62.14 | 60.14 | 55.25 | 54.45 |
Model | Bits | Method | PiQA | ARC-e | ARC-c | BoolQ | HellaSwag | Winogrande | Avg. |
LLaMA-7B | FP16 | - | 77.47 | 52.48 | 41.46 | 73.08 | 73.00 | 67.07 | 64.09 |
W4A16 | AffineQuant | 77.53 | 51.85 | 38.65 | 70.89 | 71.49 | 66.93 | 62.89 | |
ABQ-LLM | 77.80 | 51.55 | 39.59 | 72.66 | 71.41 | 66.14 | 63.19 | ||
W3A16 | ABQ-LLM | 75.78 | 47.81 | 37.20 | 71.98 | 68.80 | 65.82 | 61.23 | |
W2*A16 | ABQ-LLM | 74.04 | 46.71 | 35.49 | 65.81 | 48.01 | 62.82 | 55.48 | |
W8A8 | ABQ-LLM | 77.25 | 52.56 | 41.72 | 72.97 | 73.05 | 67.32 | 64.14 | |
W6A6 | SmoothQuant | 76.75 | 51.64 | 39.88 | 71.75 | 71.67 | 65.03 | 62.81 | |
OmniQuant | 77.09 | 51.89 | 40.87 | 72.53 | 71.61 | 65.03 | 63.17 | ||
I-LLM | 76.99 | 52.66 | 40.78 | 72.94 | 71.31 | 65.67 | 63.39 | ||
ABQ-LLM | 78.07 | 52.81 | 40.10 | 71.90 | 71.80 | 66.53 | 63.53 | ||
W4A8 | OmniQuant | 77.36 | 51.85 | 38.65 | 70.67 | 71.20 | 64.71 | 62.40 | |
ABQ-LLM | 77.63 | 52.40 | 38.10 | 73.43 | 71.81 | 65.51 | 63.14 | ||
W4A6 | OmniQuant | 75.73 | 51.52 | 38.31 | 68.28 | 70.79 | 65.27 | 61.64 | |
ABQ-LLM | 76.28 | 50.72 | 39.68 | 70.52 | 70.35 | 64.88 | 62.06 | ||
W4A4 | OmniQuant | 66.15 | 45.20 | 31.14 | 63.51 | 56.44 | 53.43 | 52.65 | |
AffineQuant | 69.37 | 42.55 | 31.91 | 63.73 | 57.65 | 55.33 | 53.42 | ||
ABQ-LLM | 69.97 | 45.88 | 33.44 | 62.87 | 58.47 | 54.53 | 54.19 | ||
W3A8 | ABQ-LLM | 75.78 | 48.95 | 38.05 | 70.18 | 68.70 | 65.74 | 61.23 | |
W3A6 | ABQ-LLM | 75.57 | 49.03 | 38.57 | 69.57 | 67.12 | 62.35 | 60.37 | |
W3A4 | ABQ-LLM | 64.20 | 41.04 | 28.50 | 61.80 | 49.71 | 51.54 | 49.45 | |
W2*A8 | ABQ-LLM | 73.29 | 46.72 | 34.73 | 63.61 | 62.18 | 61.01 | 56.92 | |
W2*A6 | ABQ-LLM | 71.98 | 46.00 | 33.53 | 63.95 | 59.19 | 58.72 | 55.56 |
Model | Bits | Method | PiQA | ARC-e | ARC-c | BoolQ | HellaSwag | Winogrande | Avg. |
---|---|---|---|---|---|---|---|---|---|
LLaMA-13B | FP16 | - | 79.10 | 59.89 | 44.45 | 68.01 | 76.21 | 70.31 | 66.32 |
W4A16 | AffineQuant | 78.84 | 59.55 | 43.52 | 69.48 | 75.18 | 69.38 | 65.99 | |
ABQ-LLM | 78.89 | 59.43 | 43.25 | 70.37 | 75.38 | 69.22 | 66.09 | ||
W3A16 | ABQ-LLM | 77.86 | 57.62 | 42.15 | 66.45 | 73.03 | 69.53 | 64.44 | |
W2*A16 | ABQ-LLM | 75.03 | 52.99 | 38.48 | 66.42 | 68.69 | 66.53 | 61.36 | |
W8A8 | ABQ-LLM | 78.73 | 59.01 | 44.11 | 68.17 | 75.96 | 70.09 | 66.01 | |
W6A6 | SmoothQuant | 77.91 | 56.60 | 42.40 | 64.95 | 75.36 | 69.36 | 64.43 | |
OmniQuant | 78.40 | 57.28 | 42.91 | 67.00 | 75.82 | 68.27 | 64.94 | ||
I-LLM | 77.48 | 56.94 | 44.03 | 64.92 | 75.24 | 69.14 | 64.62 | ||
ABQ-LLM | 78.40 | 57.62 | 42.01 | 66.82 | 75.54 | 69.61 | 65.00 | ||
W4A8 | ABQ-LLM | 78.56 | 58.75 | 42.92 | 68.56 | 75.20 | 70.56 | 65.76 | |
W4A6 | ABQ-LLM | 77.64 | 56.94 | 42.32 | 67.06 | 74.23 | 68.19 | 64.39 | |
W4A4 | OmniQuant | 69.69 | 47.30 | 33.10 | 62.84 | 58.96 | 55.80 | 54.62 | |
AffineQuant | 66.32 | 43.90 | 29.61 | 64.10 | 56.88 | 54.70 | 52.58 | ||
I-LLM | 67.95 | 48.15 | 34.47 | 62.29 | 63.13 | 59.98 | 55.99 | ||
ABQ-LLM | 71.82 | 47.60 | 35.67 | 63.52 | 64.31 | 57.54 | 56.74 | ||
W3A8 | ABQ-LLM | 77.81 | 58.16 | 42.41 | 68.47 | 73.15 | 69.37 | 64.90 | |
W3A4 | ABQ-LLM | 64.79 | 42.21 | 30.54 | 60.55 | 55.59 | 53.51 | 51.20 | |
W3A6 | ABQ-LLM | 76.71 | 56.19 | 40.53 | 66.39 | 71.59 | 66.14 | 62.93 | |
W2*A8 | ABQ-LLM | 74.92 | 54.92 | 38.65 | 68.53 | 68.21 | 66.54 | 61.96 | |
W2*A6 | ABQ-LLM | 73.71 | 51.89 | 36.60 | 64.98 | 65.55 | 63.22 | 59.33 |
Model | Bits | Method | PiQA | ARC-e | ARC-c | BoolQ | HellaSwag | Winogrande | Avg. |
---|---|---|---|---|---|---|---|---|---|
LLaMA-2-13B | FP16 | - | 79.05 | 57.91 | 44.19 | 69.02 | 76.60 | 69.69 | 66.07 |
W4A16 | ABQ-LLM | 78.72 | 58.16 | 44.03 | 63.94 | 75.56 | 69.13 | 64.92 | |
W3A16 | ABQ-LLM | 77.48 | 55.39 | 43.86 | 67.58 | 72.63 | 67.56 | 64.08 | |
W2*A16 | ABQ-LLM | 75.41 | 51.39 | 36.43 | 72.63 | 67.03 | 60.62 | 60.59 | |
W8A8 | ABQ-LLM | 79.22 | 57.66 | 43.94 | 68.62 | 76.47 | 69.46 | 65.90 | |
W6A6 | ABQ-LLM | 78.62 | 55.81 | 43.69 | 66.33 | 75.51 | 68.27 | 64.70 | |
W4A8 | ABQ-LLM | 78.67 | 57.15 | 43.52 | 64.13 | 75.31 | 69.53 | 64.53 | |
W4A6 | ABQ-LLM | 77.97 | 57.11 | 43.26 | 67.37 | 74.14 | 66.30 | 64.36 | |
W4A4 | ABQ-LLM | 69.04 | 47.01 | 33.53 | 64.74 | 62.70 | 54.38 | 55.23 | |
W3A8 | ABQ-LLM | 77.53 | 56.36 | 42.74 | 68.71 | 72.87 | 66.06 | 64.05 | |
W3A6 | ABQ-LLM | 76.22 | 53.37 | 40.70 | 68.04 | 71.09 | 66.14 | 62.59 | |
W3A4 | ABQ-LLM | 63.60 | 42.55 | 29.35 | 58.62 | 52.62 | 53.19 | 49.99 | |
W2*A8 | ABQ-LLM | 73.50 | 49.79 | 35.15 | 70.15 | 67.45 | 58.88 | 59.15 |
Appendix D GPU Kernel Optimization Details
GEMV Elimination. Our custom engine decomposes the operation of arbitrary quantized combinations into a superposition of 1-bit matrix multiplications. As for GEMV problems, multiple single-bit GEMV are converted to GEMM, fully utilizing the TensorCore while reducing or even avoiding the redundant computation caused by padding. For the configuration, the actual computing task undertaken by ABQKernel is p*M × q*N × K. When p*M % MMA_M = 0, we can achieve zero redundant calculation. As shown in Figure 8, there are two problems with directly calling INT8 TensorCore to execute W2A8 GEMV: 1. Converting W2 to W8 online requires additional instructions. 2. When M 1 and MMA_M 8, calling TensorCore will introduce 87.5% redundant calculations. In contrast, ABQKernel achieves direct acceleration of W2A8 without any redundant computation.

Computational Pipeline Optimization. Figure 9 illustrates our optimization of the computation pipeline to enhance inference performance within the widely adopted Ampere architecture. At the shared memory level, we perform asynchronous writes from global memory to shared memory using the cp.async instruction to hide the latency of subsequent memory accesses. Before processing the first loop, a synchronization instruction ensures that TILE-0, the data required by the first loop, has been written to shared memory. During the processing of the first loop, TILE-1, the data needed for the second loop, is asynchronously written to shared memory. This concurrent execution of data writes for the second loop and computation of the first loop masks the shared memory access latency for subsequent loops. For register-level optimization within the loop, when k=0, TILE-0 data from shared memory is loaded into the first set of registers A0 and B0, while the data required for k=1 is preloaded into registers A1 and B1. Once the data in registers A0 and B0 is ready, the bmma operation is performed on A0 and B0. When k=1, the data A1 and B1 needed for bmma has already been preloaded at k=0, and the data required for k=2 is preloaded into registers A0 and B0, and so on. By doubling the register cache, data is written from shared memory to registers during the Tensor Core bmma computation, effectively masking the register access time. Where bmma denotes a 1-bit matrix multiplication operation, while TILE data refers to a block of data obtained after partitioning and processing the original dataset.

Auto Kernel Search. To utilize GPU memory more efficiently at all levels, a chunking strategy is essential during computation. Chunk size is crucial for GEMM performance. Before launching an arbitrary precision inference operator, we perform performance tests on different chunk sizes to select the best implementation. As shown in Figure 4, for the classical GEMM task, the M × N × K problem is usually chunked by BM × BN × BK at the thread block level, and each thread block is chunked by WM × WN × WK at the warp level. Each warp gets MMA_M × MMA_N × MMA_K based on the Tensor Core chunk size supported by the GPU model and searches for different chunk shapes after obtaining the chunk sizes at the three levels. In the arbitrary precision operator, the number of weight bits (q) and activation bits (p) also need to be considered, so the search space is larger compared to the classical GEMM. Given the BTC chunk sizes MMA_M = 8, MMA_N = 8, MMA_K = 128, the number of weighted warps is computed as W_WARPS_NUM = BN × q / WN, and the number of activation warps is computed as X_WARPS_NUM = BM × p / WM. The total number of warps is 1 X_WARPS_NUM × W_WARPS_NUM 32. To reduce the search space, we fix WK = MMA_K = 128 and set the length of BK to 128, 256, 384, 512. BM, BN, BK, WM, and WN cannot be infinitely large due to the limited shared memory and register usage of a single thread block. As for configuration, the design process of a candidate instance in the search space is summarized as follows:
1. Determine the number and layout of Warps contained in the Thread Block based on expert experience. for example, X_WARPS_NUM × W_WARPS_NUM = 1 × 4.
2. Determine the Thread Block Tile(BM, BN, BK) with the smallest redundant padding based on the quantized bit width p of the activation and the M dimension of the computing task.
3. The size of the WARP Tile(WM, WN, WK) is calculated based on the layout of Warps and Thread Block Tile. Ultimately, we test the operators at various chunk sizes and adopt the speed-optimized implementation.


Bank Conflicts Elimination. For attaining high bandwidth, shared memory is segmented into memory modules of uniform size, which are referred to as banks. Each bank occupies 4 bytes, and the continuous 128 bytes in the shared memory form 32 banks. Threads within a warp can access data from various banks simultaneously in one single memory access request. However, if multiple threads access data within the same bank concurrently, it will give rise to a bank conflict, leading to lower throughput.
The above Figure 10 shows the scenario where bank conflict occurs (BM=8, BK=512), where E denotes an element of int32, B denotes bank, and T denotes a thread. When data is loaded from global memory to shared memory, each thread accesses 16 bytes in order to fully utilize the memory bandwidth. The 8x512-bit data is divided into four phases, each reading 128 bytes (i.e., two consecutive rows of data in the figure). During this process, no Bank conflicts occur in any of the phases. However, when data is loaded from shared memory to registers, each thread needs to copy 4 bytes of data since the 8x8x128 BMMA stores matrix A (8x128 bits) and matrix B (8x128 bits) in 32-bit registers. When copying a matrix block (same color block), threads T0~3, T8~11, T16~19, and T24~27 all access B0~3, while T4~7, T12~15, T20~23, and T28~31 access B16~19, resulting in 4-way bank conflicts.
As depicted in the Figure 11, we employ the swizzle operation to address this issue. When the global memory is loaded into the shared memory, the addresses are swizzled, causing the 8x128bit data to be scattered across 32 banks. When T0~31 accesses 8x128bit data, the data comes from B0~31 respectively, and no bank conflict will arise, thereby further improving the efficiency of memory access.
LLaMA-7B | ||||||||
128 | 256 | 512 | 1024 | |||||
sequence length | Latency(ms) | Memory(GB) | Latency(ms) | Memory(GB) | Latency(ms) | Memory(GB) | Latency(ms) | Memory(GB) |
FP 16 | 1490.5 | 13.47 | 3005.95 | 13.534 | 6090.97 | 13.662 | 12561.82 | 13.918 |
W8A16(CUTLASS) | 868.35 | 7.394 | 1755.62 | 7.458 | 3594.95 | 7.586 | 7559.22 | 7.842 |
W8A8(SmoothQuant) | 832.25 | 7.394 | 1684.85 | 7.458 | 3445.86 | 7.586 | 7257.02 | 7.842 |
W4A16(CUTLASS) | 642.24 | 4.258 | 1312.91 | 4.322 | 2707.26 | 4.45 | 5786.8 | 4.706 |
W2A8(ABQ-LLM) | 505.60 | 2.784 | 1041.39 | 2.848 | 2167.30 | 2.976 | 4657.18 | 3.232 |
LLaMA-13B | ||||||||
128 | 256 | 512 | 1024 | |||||
sequence length | Latency(ms) | Memory(GB) | Latency(ms) | Memory(GB) | Latency(ms) | Memory(GB) | Latency(ms) | Memory(GB) |
FP 16 | 2726.66 | 25.6 | 5481.96 | 25.696 | 11071.81 | 25.92 | 22559.77 | 26.304 |
W8A16(CUTLASS) | 1439.66 | 13.524 | 2900.46 | 13.62 | 5922.28 | 13.844 | 12257.27 | 14.228 |
W8A8(SmoothQuant) | 1431.2 | 13.526 | 2874.51 | 13.622 | 5867.42 | 13.846 | 12193.68 | 14.23 |
W4A16(CUTLASS) | 999.1 | 7.444 | 2020.99 | 7.54 | 4155.94 | 7.764 | 8750.98 | 8.148 |
W2A8(ABQ-LLM) | 766.28 | 4.564 | 1550.21 | 4.66 | 3234.67 | 4.884 | 6885.69 | 5.268 |
LLaMA-30B | ||||||||
128 | 256 | 512 | 1024 | |||||
sequence length | Latency(ms) | Memory(GB) | Latency(ms) | Memory(GB) | Latency(ms) | Memory(GB) | Latency(ms) | Memory(GB) |
FP 16 | 3759.08 | 65.534 | 7540.17 | 65.726 | 15241.36 | 66.11 | 31073.23 | 66.878 |
W8A16(CUTLASS) | 3032.64 | 32.418 | 6111.43 | 32.642 | 12371.66 | 33.026 | 25477.58 | 33.794 |
W8A8(SmoothQuant) | 3057.96 | 32.418 | 6155.38 | 32.642 | 12465.96 | 33.026 | 25678.66 | 33.794 |
W4A16(CUTLASS) | 1938.2 | 17.178 | 3924.2 | 17.402 | 8011.57 | 17.786 | 16680.64 | 18.554 |
W2A8(ABQ-LLM) | 1730.51 | 9.616 | 3473.66 | 9.84 | 7133.13 | 10.224 | 14845.67 | 10.992 |
(1,4096)x(4096,4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 5.126408 | 5.003512 | 4.993599 | 5.016534 | 1.844438 | 1.820546 | 1.415388 | 1.409558 | 1.167617 | 0.990153 | 0.853578 | 0.760755 | |
CUTLASS(TOPS) | 1.395987 | 1.395987 | 0.671532 | 0.671532 | 1.395987 | 0.671532 | 1.395987 | 0.671532 | 0.671532 | 0.671532 | 0.671532 | 0.671532 | |
RTX3070 | CUBLAS(TOPS) | - | - | 0.662154 | 0.662154 | - | 0.662154 | - | 0.662154 | 0.662154 | 0.662154 | 0.662154 | 0.662154 |
(1,1024)x(1024,8192) | |||||||||||||
Ours(TOPS) | 3.959401 | 3.815584 | 3.710145 | 3.636041 | 3.27549 | 3.088989 | 2.729302 | 2.493001 | 1.107551 | 0.907098 | 0.774627 | 0.699006 | |
CUTLASS(TOPS) | 2.43737 | 2.43737 | 0.570752 | 0.570752 | 2.43737 | 0.570752 | 2.43737 | 0.570752 | 0.570752 | 0.570752 | 0.570752 | 0.570752 | |
RTX3070 | CUBLAS(TOPS) | - | - | 0.547283 | 0.547283 | - | 0.547283 | - | 0.547283 | 0.547283 | 0.547283 | 0.547283 | 0.547283 |
(1,11008)x(11008, 4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 2.831276 | 2.845823 | 2.851906 | 2.85108 | 1.893524 | 1.89145 | 1.429216 | 1.424223 | 1.063806 | 0.9578 | 0.781478 | 0.688792 | |
CUTLASS(TOPS) | 1.437404 | 1.437404 | 0.702909 | 0.702909 | 1.437404 | 0.702909 | 1.437404 | 0.702909 | 0.702909 | 0.702909 | 0.702909 | 0.702909 | |
RTX3070 | CUBLAS(TOPS) | - | - | 0.655704 | 0.655704 | - | 0.655704 | - | 0.655704 | 0.655704 | 0.655704 | 0.655704 | 0.655704 |
(1,5120)x(5120, 5120) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 2.77896 | 2.744274 | 2.769214 | 2.72645 | 1.946023 | 1.925246 | 1.485292 | 1.47769 | 1.207433 | 1.019555 | 0.88083 | 0.777159 | |
CUTLASS(TOPS) | 1.456822 | 1.456822 | 0.752888 | 0.752888 | 1.456822 | 0.752888 | 1.456822 | 0.752888 | 0.752888 | 0.752888 | 0.752888 | 0.752888 | |
RTX3070 | CUBLAS(TOPS) | - | - | 0.655704 | 0.655704 | - | 0.655704 | - | 0.655704 | 0.655704 | 0.655704 | 0.655704 | 0.655704 |
(1,4096)x(4096, 11008) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 2.93919 | 2.945875 | 2.932437 | 2.928438 | 2.048219 | 2.042158 | 1.558931 | 1.556203 | 1.263128 | 1.06387 | 0.915512 | 0.804458 | |
CUTLASS(TOPS) | 1.44785 | 1.44785 | 0.757006 | 0.757006 | 1.44785 | 0.757006 | 1.44785 | 0.757006 | 0.757006 | 0.757006 | 0.757006 | 0.757006 | |
RTX3070 | CUBLAS(TOPS) | - | - | 0.726139 | 0.726139 | - | - | 0.726139 | 0.726139 | 0.726139 | 0.726139 | 0.726139 | |
(4,4096)x(4096,4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 19.85037 | 18.573332 | 17.173864 | 15.596383 | 7.165928 | 7.040589 | 5.588709 | 5.465432 | 4.427809 | 3.841075 | 3.17604 | 2.952538 | |
CUTLASS(TOPS) | 5.029238 | 5.029238 | 2.600067 | 2.600067 | 5.029238 | 2.600067 | 5.029238 | 2.600067 | 2.600067 | 2.600067 | 2.600067 | 2.600067 | |
RTX3070 | CUBLAS(TOPS) | - | - | 2.623434 | 2.623434 | - | 2.623434 | - | 2.623434 | 2.623434 | 2.623434 | 2.623434 | 2.623434 |
(4,1024)x(1024,8192) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 14.911489 | 13.707592 | 12.752676 | 12.651737 | 11.246954 | 10.272099 | 8.611968 | 7.994145 | 4.100613 | 3.487998 | 2.770024 | 2.715393 | |
CUTLASS(TOPS) | 7.790775 | 7.790775 | 2.292671 | 2.292671 | 2.292671 | 2.292671 | 7.790775 | 2.292671 | 2.292671 | 2.292671 | 2.292671 | 2.292671 | |
RTX3070 | CUBLAS(TOPS) | - | - | 2.652782 | 2.652782 | - | 2.652782 | - | 2.652782 | 2.652782 | 2.652782 | 2.652782 | 2.652782 |
(4,11008)x(11008, 4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 11.352832 | 11.176699 | 10.902382 | 10.897325 | 7.418883 | 7.308671 | 5.631782 | 5.359304 | 4.201856 | 3.729234 | 2.658403 | 2.734334 | |
CUTLASS(TOPS) | 6.089864 | 6.089864 | 2.852206 | 2.852206 | 6.089864 | 2.852206 | 6.089864 | 2.852206 | 2.852206 | 2.852206 | 2.852206 | 2.852206 | |
RTX3070 | CUBLAS(TOPS) | - | - | 2.804943 | 2.804943 | - | - | 2.804943 | 2.804943 | 2.804943 | 2.804943 | 2.804943 | |
(4,5120)x(5120, 5120) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 11.023197 | 10.805677 | 10.686147 | 10.633989 | 7.62955 | 7.509258 | 5.885227 | 5.81653 | 4.70967 | 4.021154 | 3.319287 | 3.049934 | |
CUTLASS(TOPS) | 5.41928 | 5.41928 | 2.892696 | 2.892696 | 5.41928 | 2.892696 | 5.41928 | 2.892696 | 2.892696 | 2.892696 | 2.892696 | 2.892696 | |
RTX3070 | CUBLAS(TOPS) | - | - | 3.024624 | 3.024624 | - | 3.024624 | - | 3.024624 | 3.024624 | 3.024624 | 3.024624 | 3.024624 |
(4,4096)x(4096, 11008) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 11.65253 | 11.47451 | 11.370432 | 11.249153 | 8.044027 | 7.938343 | 6.158429 | 6.063344 | 4.936044 | 4.181711 | 3.516512 | 3.152066 | |
CUTLASS(TOPS) | 5.755347 | 5.755347 | 2.913199 | 2.913199 | 5.755347 | 2.913199 | 5.755347 | 2.913199 | 2.913199 | 2.913199 | 2.913199 | 2.913199 | |
RTX3070 | CUBLAS(TOPS) | - | - | 3.150072 | 3.150072 | - | 3.150072 | - | 3.150072 | 3.150072 | 3.150072 | 3.150072 | 3.150072 |
(8,4096)x(4096,4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 34.711864 | 29.218008 | 25.562555 | 23.108412 | 14.124731 | 13.319459 | 10.805161 | 10.304756 | 8.570437 | 7.295355 | 6.372617 | 5.418213 | |
CUTLASS(TOPS) | 10.443984 | 10.443984 | 5.328563 | 5.328563 | 10.443984 | 5.328563 | 10.443984 | 5.328563 | 5.328563 | 5.328563 | 5.328563 | 5.328563 | |
RTX3070 | CUBLAS(TOPS) | - | - | 5.136078 | 5.136078 | - | 5.136078 | - | 5.136078 | 5.136078 | 5.136078 | 5.136078 | 5.136078 |
(8,1024)x(1024,8192) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 25.143295 | 24.340204 | 22.991055 | 17.975151 | 21.315987 | 15.096982 | 14.203727 | 10.091 | 8.056054 | 6.547737 | 5.476622 | 4.921968 | |
CUTLASS(TOPS) | 11.837081 | 11.837081 | 4.52688 | 4.52688 | 11.837081 | 4.52688 | 11.837081 | 4.52688 | 4.52688 | 4.52688 | 4.52688 | 4.52688 | |
RTX3070 | CUBLAS(TOPS) | - | - | 5.051333 | 5.051333 | - | 5.051333 | - | 5.051333 | 5.051333 | 5.051333 | 5.051333 | 5.051333 |
(8,11008)x(11008, 4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 21.93853 | 21.763004 | 21.38822 | 21.141041 | 14.794146 | 14.292913 | 10.0646 | 9.831863 | 8.391643 | 6.331276 | 5.216482 | 5.386425 | |
CUTLASS(TOPS) | 11.559231 | 11.559231 | 5.788068 | 5.788068 | 11.559231 | 5.788068 | 11.559231 | 5.788068 | 5.788068 | 5.788068 | 5.788068 | 5.788068 | |
RTX3070 | CUBLAS(TOPS) | - | - | 5.433702 | 5.433702 | - | 5.433702 | - | 5.433702 | 5.433702 | 5.433702 | 5.433702 | 5.433702 |
(8,5120)x(5120, 5120) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 21.438643 | 20.998667 | 20.403486 | 20.144592 | 15.058184 | 14.257557 | 11.512318 | 11.014898 | 9.303804 | 7.909779 | 6.563432 | 5.816199 | |
CUTLASS(TOPS) | 11.438784 | 11.438784 | 5.650824 | 5.650824 | 11.438784 | 5.650824 | 11.438784 | 5.650824 | 5.650824 | 5.650824 | 5.650824 | 5.650824 | |
RTX3070 | CUBLAS(TOPS) | - | - | 5.843657 | 5.843657 | - | 5.843657 | - | 5.843657 | 5.843657 | 5.843657 | 5.843657 | 5.843657 |
(8,4096)x(4096, 11008) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 22.735727 | 22.486816 | 21.946047 | 21.290138 | 15.927655 | 14.94954 | 12.047882 | 11.759899 | 9.831679 | 8.178876 | 6.951003 | 5.942341 | |
CUTLASS(TOPS) | 12.133122 | 12.133122 | 5.794113 | 5.794113 | 12.133122 | 5.794113 | 12.133122 | 5.794113 | 5.794113 | 5.794113 | 5.794113 | 5.794113 | |
RTX3070 | CUBLAS(TOPS) | - | - | 6.094711 | 6.094711 | - | 6.094711 | - | 6.094711 | 6.094711 | 6.094711 | 6.094711 | 6.094711 |
(1,4096)x(4096,4096) | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 5.666263 | 5.596584 | 5.177539 | 5.192204 | 4.932711 | 5.103254 | 5.230327 | 3.984436 | 5.089779 | 4.697921 | 2.73454 | 3.997423 | |
CUTLASS(TOPS) | 4.378407 | 4.378407 | 2.483177 | 2.483177 | 4.378407 | 2.483177 | 4.378407 | 2.483177 | 2.483177 | 2.483177 | 2.483177 | 2.483177 | |
RTX4080 | CUBLAS(TOPS) | - | - | 2.132917 | 2.132917 | - | 2.132917 | - | 2.132917 | 2.132917 | 2.132917 | 2.132917 | 2.132917 |
(1,1024)x(1024,8192) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 3.544786 | 3.494135 | 3.491157 | 3.335505 | 3.090738 | 3.463115 | 3.280737 | 2.427181 | 2.593226 | 2.545285 | 2.018674 | 2.512498 | |
CUTLASS(TOPS) | 1.994158 | 1.994158 | 1.595792 | 1.595792 | 1.994158 | 1.595792 | 1.994158 | 1.595792 | 1.595792 | 1.595792 | 1.595792 | 1.595792 | |
RTX4080 | CUBLAS(TOPS) | - | - | 1.842555 | 1.842555 | - | 1.842555 | - | 1.842555 | 1.842555 | 1.842555 | 1.842555 | 1.842555 |
(1,11008)x(11008, 4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 11.459206 | 11.172798 | 8.219525 | 8.073364 | 6.89011 | 8.299166 | 7.945147 | 5.589236 | 6.926538 | 6.253657 | 3.5678 | 5.010184 | |
CUTLASS(TOPS) | 5.738564 | 5.738564 | 3.577838 | 3.577838 | 5.738564 | 3.577838 | 5.738564 | 3.577838 | 3.577838 | 3.577838 | 3.577838 | 3.577838 | |
RTX4080 | CUBLAS(TOPS) | - | - | 2.782205 | 2.782205 | - | 2.782205 | - | 2.782205 | 2.782205 | 2.782205 | 2.782205 | 2.782205 |
(1,5120)x(5120, 5120) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 8.179976 | 8.036415 | 7.981294 | 7.326846 | 7.348096 | 7.640651 | 7.2295 | 4.169042 | 6.045578 | 4.925341 | 3.0402 | 4.270938 | |
CUTLASS(TOPS) | 4.469664 | 4.469664 | 2.111863 | 2.111863 | 4.469664 | 2.111863 | 4.469664 | 2.111863 | 2.111863 | 2.111863 | 2.111863 | 2.111863 | |
RTX4080 | CUBLAS(TOPS) | - | - | 2.338113 | 2.338113 | - | - | 2.338113 | 2.338113 | 2.338113 | 2.338113 | 2.338113 | |
(1,4096)x(4096, 11008) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 13.542058 | 13.110614 | 12.246419 | 9.048911 | 7.191247 | 10.480065 | 7.804325 | 4.767432 | 6.815571 | 5.946993 | 3.366875 | 5.155972 | |
CUTLASS(TOPS) | 7.373691 | 7.373691 | 3.04093 | 3.04093 | 7.373691 | 3.04093 | 7.373691 | 3.04093 | 3.04093 | 3.04093 | 3.04093 | 3.04093 | |
RTX4080 | CUBLAS(TOPS) | - | - | 2.862102 | 2.862102 | - | 2.862102 | - | 2.862102 | 2.862102 | 2.862102 | 2.862102 | 2.862102 |
(4,4096)x(4096,4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 24.408194 | 20.693401 | 20.195993 | 20.795177 | 18.665907 | 20.496012 | 19.760592 | 12.737339 | 18.525913 | 16.639837 | 8.414879 | 11.580844 | |
CUTLASS(TOPS) | 16.965191 | 16.965191 | 11.346475 | 11.346475 | 16.965191 | 11.346475 | 16.965191 | 11.346475 | 11.346475 | 11.346475 | 11.346475 | 11.346475 | |
RTX4080 | CUBLAS(TOPS) | - | - | 9.346264 | 9.346264 | - | 9.346264 | - | 9.346264 | 9.346264 | 9.346264 | 9.346264 | 9.346264 |
(4,1024)x(1024,8192) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 14.124137 | 14.265563 | 14.515171 | 13.744966 | 13.02943 | 12.581451 | 11.312963 | 9.696109 | 10.008553 | 10.04691 | 5.635082 | 8.187905 | |
CUTLASS(TOPS) | 10.125006 | 10.125006 | 7.514869 | 7.514869 | 10.125006 | 7.514869 | 10.125006 | 7.514869 | 7.514869 | 7.514869 | 7.514869 | 7.514869 | |
RTX4080 | CUBLAS(TOPS) | - | - | 5.807355 | 5.807355 | - | 5.807355 | - | 5.807355 | 5.807355 | 5.807355 | 5.807355 | 5.807355 |
(4,11008)x(11008, 4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 44.499241 | 34.592556 | 31.936174 | 30.138264 | 26.151152 | 29.124102 | 30.114992 | 20.55409 | 25.744062 | 21.976168 | 11.041816 | 16.608015 | |
CUTLASS(TOPS) | 22.360314 | 22.360314 | 14.27062 | 14.27062 | 22.360314 | 14.27062 | 22.360314 | 14.27062 | 14.27062 | 14.27062 | 14.27062 | 14.27062 | |
RTX4080 | CUBLAS(TOPS) | - | - | 10.484434 | 10.484434 | - | 10.484434 | - | 10.484434 | 10.484434 | 10.484434 | 10.484434 | 10.484434 |
(4,5120)x(5120, 5120) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 32.262131 | 32.364094 | 31.850698 | 26.226151 | 20.416708 | 24.156643 | 26.690235 | 14.489883 | 18.306963 | 18.400719 | 8.766748 | 13.434794 | |
CUTLASS(TOPS) | 15.777285 | 15.777285 | 8.469383 | 8.469383 | 15.777285 | 8.469383 | 15.777285 | 8.469383 | 8.469383 | 8.469383 | 8.469383 | 8.469383 | |
RTX4080 | CUBLAS(TOPS) | - | - | 6.160892 | 6.160892 | - | 6.160892 | - | 6.160892 | 6.160892 | 6.160892 | 6.160892 | 6.160892 |
(4,4096)x(4096, 11008) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 52.217018 | 50.157482 | 45.968418 | 31.412338 | 28.61311 | 30.551258 | 30.689667 | 17.375622 | 22.56772 | 21.429615 | 10.303197 | 16.78546 | |
CUTLASS(TOPS) | 29.325113 | 29.325113 | 12.020953 | 12.020953 | 29.325113 | 12.020953 | 29.325113 | 12.020953 | 12.020953 | 12.020953 | 12.020953 | 12.020953 | |
RTX4080 | CUBLAS(TOPS) | - | - | 9.30319 | 9.30319 | - | 9.30319 | - | 9.30319 | 9.30319 | 9.30319 | 9.30319 | 9.30319 |
(8,4096)x(4096,4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 41.439137 | 41.30854 | 41.419498 | 39.114296 | 37.503555 | 37.120361 | 40.611 | 24.405922 | 32.121555 | 24.77029 | 16.262033 | 17.331835 | |
CUTLASS(TOPS) | 33.269646 | 33.269646 | 22.390638 | 22.390638 | 33.269646 | 22.390638 | 33.269646 | 22.390638 | 22.390638 | 22.390638 | 22.390638 | 22.390638 | |
RTX4080 | - | - | 18.537868 | 18.537868 | - | 18.537868 | - | 18.537868 | 18.537868 | 18.537868 | 18.537868 | 18.537868 | |
(8,1024)x(1024,8192) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 28.155737 | 27.346548 | 26.414156 | 20.811686 | 20.343021 | 20.249033 | 20.602325 | 18.68187 | 19.067791 | 18.388329 | 11.315893 | 11.722744 | |
CUTLASS(TOPS) | 16.941822 | 16.941822 | 11.685739 | 11.685739 | 16.941822 | 11.685739 | 16.941822 | 11.685739 | 11.685739 | 11.685739 | 11.685739 | 11.685739 | |
RTX4080 | CUBLAS(TOPS) | - | - | 11.626171 | 11.626171 | - | 11.626171 | - | 11.626171 | 11.626171 | 11.626171 | 11.626171 | 11.626171 |
(8,11008)x(11008, 4096) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 70.776772 | 63.515327 | 62.89724 | 53.603973 | 50.18964 | 44.507675 | 52.819912 | 35.874935 | 41.623066 | 34.309536 | 20.632343 | 20.644436 | |
CUTLASS(TOPS) | 42.806332 | 42.806332 | 28.412611 | 28.412611 | 42.806332 | 28.412611 | 42.806332 | 28.412611 | 28.412611 | 28.412611 | 28.412611 | 28.412611 | |
RTX4080 | CUBLAS(TOPS) | - | - | 20.376932 | 20.376932 | - | 20.376932 | - | 20.376932 | 20.376932 | 20.376932 | 20.376932 | 20.376932 |
(8,5120)x(5120, 5120) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 62.192532 | 58.058117 | 56.991051 | 38.496239 | 36.794827 | 36.71896 | 41.916618 | 27.445724 | 35.866901 | 26.239592 | 16.931917 | 20.035217 | |
CUTLASS(TOPS) | 35.998506 | 35.998506 | 16.818377 | 16.818377 | 35.998506 | 16.818377 | 35.998506 | 16.818377 | 16.818377 | 16.818377 | 16.818377 | 16.818377 | |
RTX4080 | CUBLAS(TOPS) | - | - | 11.887421 | 11.887421 | - | 11.887421 | - | 11.887421 | 11.887421 | 11.887421 | 11.887421 | 11.887421 |
(8,4096)x(4096, 11008) | |||||||||||||
w2a2 | w2a4 | w2a6 | w2a8 | w3a3 | w3a8 | w4a4 | w4a8 | w5a5 | w6a6 | w7a7 | w8a8 | ||
Ours(TOPS) | 103.437386 | 66.689888 | 63.139633 | 49.792355 | 51.081207 | 50.856277 | 52.395653 | 33.181564 | 43.949936 | 34.850956 | 19.520435 | 21.211935 | |
CUTLASS(TOPS) | 55.537205 | 55.537205 | 24.016449 | 24.016449 | 55.537205 | 24.016449 | 55.537205 | 24.016449 | 24.016449 | 24.016449 | 24.016449 | 24.016449 | |
RTX4080 | CUBLAS(TOPS) | - | - | 18.105219 | 18.105219 | - | - | 18.105219 | 18.105219 | 18.105219 | 18.105219 | 18.105219 |