AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization

Yifan Tan^1,2, Haoze Wang^1,2, Chao Yan², Yangdong Deng^1, corresponding author

Abstract

Model quantization has become a crucial technique to address the issues of large memory consumption and long inference times associated with LLMs. Mixed-precision quantization, which distinguishes between important and unimportant parameters, stands out among numerous quantization schemes as it achieves a balance between precision and compression rate. However, existing approaches can only identify important parameters through qualitative analysis and manual experiments without quantitatively analyzing how their importance is determined. We propose a new criterion, so-called “precision alignment”, to build a quantitative framework to holistically evaluate the importance of parameters in mixed-precision quantization. Our observations on floating point addition under various real-world scenarios suggest that two addends should have identical precision, otherwise the information in the higher-precision number will be wasted. Such an observation offers an essential principle to determine the precision of each parameter in matrix multiplication operation. As the first step towards applying the above discovery to large model inference, we develop a dynamic KV-Cache quantization technique to effectively reduce memory access latency. Different from existing quantization approaches that focus on memory saving, this work directly aims to accelerate LLM inference through quantifying floating numbers. The proposed technique attains a 25% saving of memory access and delivers up to 1.3× speedup in the computation of attention in the decoding phase of LLM, with almost no loss of precision.

Code — https://github.com/AlignedQuant/AlignedKV

Introduction

Large Language Models (LLMs) are increasingly becoming a fundamental force, revolutionizing human life.The emergent capability of LLMs, however, depends on their huge number of parameters, which have to incur significant overhead in terms of large memory consumption, intensive computational cost, and high memory latency. As an effective way to deal with the abovementioned problems, model quantization has attracted heavy research effort (Zhu et al. 2023; Yuan et al. 2024). In these works, some methods, e.g., integer-only networks, aim to improve computational speed, while others are dedicated to reducing memory latency by cutting down the volume of memory access. Our approach belongs to the latter category.

An essential problem for quantization is to achieve a high compression ratio and at the same time maintain a sufficient level of precision. Among various quantization methods, mixed-precision quantization stands out for its ability to reach such a balance. Researchers (Dettmers et al. 2022; Lee et al. 2024) already notice that not all values in a model are equally important, and quantizing important values into higher precision can decrease the precision loss caused by quantization. However, there are three questions remaining to be systematically resolved before we can exploit the full potential of mixed-precision quantization:

•

How to define the importance of parameters (i.e. elements in weight and KV-Cache)?
•

How accurate are required for important parameters and unimportant parameters respectively?
•

Are important parameters always important?

In this paper, we introduce a new criterion, designated as “precision alignment”, to holistically answer the first two questions. The criterion is inspired by the uncertainty calculation commonly used in physics (Johnson 2020). When performing an addition operation, the uncertainty of the result is close to the bigger one of the uncertainties of two attends, while the smaller one has a much less impact. Therefore, the most economical solution is to align the uncertainties of the two addends to the bigger one. By applying the criterion to the addition operation in matrix multiplication, we can reach a solution to the first two questions: the accuracy of a parameter should be chosen to support a consistent uncertainty between parameters involved in addition operations.

For the third question, in general, our analysis tends to give a negative answer, although for many cases the answer can be yes. A similar conclusion is reached by (Dong et al. 2024). In summary, although there are strategies to predict future importance (Xiao et al. 2023b; Sheng et al. 2023), static quantization methods have inherent limitations. In this work, we develop a dynamic KV-Cache quantization scheme to avoid these limitations.

Based on the precision alignment criterion, we develop a quantization framework that systematically evaluates the importance of parameters and dynamically determines the number of bits to be read. As the first step towards applying the criterion to large model computations, we develop a dynamic KV-Cache quantization scheme that reduces the volume of memory access for the KV-Caches by 25% and achieves an up to 1.3× speedup in Attention Block with almost no loss of precision during decoding phase of LLM. Our contributions are as follows:

1.

We proposed a metric for quantitatively evaluating the importance and precision of each parameter in mixed-precision quantization. To the best knowledge of the authors, this is the first work to enable a quantitative measurement in this area.
2.

We proposed a dynamic quantization scheme that allows for on-demand data quantization without the need to predict the future importance of data online. The key idea of our work is to read only the bits needed for parameters in KV-Caches during the calculation of LLM’s Attention Block. For some parameters, only the first 11 or 12 bits are necessary to guarantee the accuracy of the result, although they are saved in the Float16 format.
3.

We found that the columns of K are not equally important, and the rows of V are similarly varied in importance. Our observations suggest that different columns of K should be quantized to different bit widths based on their importance, and rows of V follow the same patterns.

Related Works

LLM Quantization with Outliers

Model quantization is a technique commonly used to reduce memory usage and accelerate inference speed when deploying large models. During quantization, outliers have a greater impact on the result and require higher precision than normal parameters (Dettmers et al. 2022). To deal with these outliers, some approaches (Li et al. 2023; Lee et al. 2024; Dettmers et al. 2023; Kim et al. 2023) choose to treat outlier and normal weights in a different manner. They decompose the weight matrix into a dense quantified matrix to store the regular parameters and a sparse high-precision matrix to store the outliers, then calculate with them separately. Besides, some methods blend outliers into normal parameters. Rotation (Lin et al. 2024a) treats matrices as a combination of vectors and randomly rotates them to hide outliers behind normal values. SmoothQuant (Xiao et al. 2023a) and AWQ (Lin et al. 2024b) scale outliers to fit other elements and multiply them by coefficients after dequantization.

However, the methods above can only identify important parameters through qualitative analysis and manual experiments. Only the largest elements are selected as outliers to retain full precision, lacking quantitative support. On the other hand, methods that analyze the importance of parameters quantitatively (Park et al. 2024; Kloberdanz and Le 2023) only focus on the importance of a matrix or a layer from a macroscopic perspective. To the best of the authors’ knowledge, this is the first work to systematically analyzes the importance of every parameter in a quantitative manner.

KV-Cache Compression and Quantization

KV-Cache is an essential mechanism to save computations, but at the expense of huge storage consumption and memory access latency. It’s necessary to reduce the expense. Early works (Liu et al. 2024b; Ge et al. 2023; Zhang et al. 2024b) evites useless tokens to save memory. Some methods (Zhang et al. 2024a; Yang et al. 2024a) based on this way find that the required number of tokens decreases as the layer becomes deeper. Some quantization methods like KIVI (Liu et al. 2024c) also take different measures to quantize KV-Cache to four or even two bits. KIVI suggests that the key cache should be quantified per channel, and the value cache should be quantified per token.

As the self-attention mechanism naturally assigns importance to tokens through the attention score, mixed-precision quantization is introduced for the quantization of KV-Cache. There are some methods (Yang et al. 2024b; He et al. 2024; Yue et al. 2024; Liu et al. 2024a) quantize tokens to different precision based on their scores, while other methods (Kang et al. 2024; Dong et al. 2024) pick outliers out of matrices and save them with high precision.

Currently, most methods aim to save memory occupied by KV-Cache. The latency associated with quantization and dequantization, nevertheless, is little attended. This paper is the first work targeting to reduce the memory access latency of KV-Cache. As the memory latency is higher than the computational latency by order of magnitude (Megha et al. 2023), our work proves that KV-Cache latency is worth significant optimization effort.

Precision Alignment Criterion

Precision Alignment in Uncertainty Calculation

In physics, the measurements of physical quantities generally cannot be exact. Therefore, in numerical representation, they are usually expressed in the form of $number\pm uncertainty$ , indicating that the result falls within this range. Physics has also established a set of rules for the calculation of such numerical values, which is called uncertainty calculation (Johnson 2020).

Uncertainty calculation is closely related to LLM quantization. From such a perspective, we can represent the pre-quantified matrix W as the sum of a quantified matrix $W_{quant}$ and an uncertainty $\Delta W$ . Therefore, we can express the matrix multiplication of the activation values A and the quantified weights W in the following form of uncertainty computation ( $QK^{T}$ and $SV$ follow the same pattern):

AW=A\left({W_{quant}\pm\mathrm{\Delta}W}\right)

We can consider the matrix multiplication above as a combination of two fundamental operations:

•

Multiplication of a number from A and a number from W, which corresponds to scalar multiplication in uncertainty calculation:

$s\times(x\pm\mathrm{\Delta}x)=sx\pm s\mathrm{\Delta}x$
•

Addition of the multiplication results, which corresponds to addition in uncertainty calculation:

$(x\pm\mathrm{\Delta}x)+(y\pm\mathrm{\Delta}y)=(x+y)\pm(\mathrm{\Delta}x+\mathrm{\Delta}y)$

In uncertainty addition, we can find that the uncertainty of the result equals $\Delta x+\Delta y$ . On the other hand, the cost of representing $x$ is proportional to $-\log\Delta x+constant$ , because we can reduce the uncertainty to $0.5\Delta x$ when we use one more bit on $x$ ’s representation. So we can keep the uncertainty of the result in $\Delta x+\Delta y$ at the cost of $\alpha(-\log\Delta x-\log\Delta y+constant)$ , which equals $\alpha(-\log(\Delta x\Delta y)+constant)$ .

We have the mean value inequality $\Delta x+\Delta y\geq 2\sqrt{\Delta x\Delta y}$ , where the conditions for equality is $\Delta x=\Delta y$ . So we conclude that the addition is most efficient when $\Delta x=\Delta y$ . We call this the “precision alignment criterion”.

Precision Alignment in Floating-Point Addition

Floating-point numbers are widely used in computers. Particularly, in deep learning, the float16 (also known as half-precision floating point, FP16) is the most frequently used format. The float16 format consists of 1 sign bit, 5 exponent bits, and 10 mantissa bits, as shown in Figure 1.

Refer to caption — Figure 1: The format of float16

The principle outlined in the previous subsection applies to floating point addition if we treat uncertainty as the precision of floating point. This is more comprehensible in the context of vertical addition, as shown in Figure 2. The figure demonstrates a case in which the precisions of two addends are different. We can find that the last few digits of the second addend do not have corresponding positions for the first addend, nor do they have corresponding positions for the results. Thus, these digits are wasted for their absence in result computation.

To avoid loading such wasted bits, we want the following two conditions to be held:

•

Precision Alignment on Attends: We expect the addends to have the same number of decimal places. Otherwise, some digits have no digits to add.
•

Precision Alignment on Results: We expect each addend to have identical numbers of decimal places with the final result. Otherwise, some digits cannot be stored in the result.

These two expectations are equivalent in the vast majority of cases. They simply come from different perspectives.

Precision Alignment in GEMM

As mentioned earlier, matrix multiplication consists of two fundamental operators—scalar multiplication and addition. We can understand matrix multiplication from a 3D perspective, as Basil Hosmer described (Hosmer 2023). Matrix multiplication corresponds to a large cube, with the two matrices involved in the multiplication and the result corresponding to the three faces of the cube, as Figure 3 shows. Based on this correspondence, the scalar multiplication operation represents the smaller cubes, while summing up the results of the scalar multiplications corresponds to the addition between the smaller cubes in the horizontal dimension.

The previous subsection suggests that addition with consistent precision among the addends is the most efficient. This principle is also suitable in additions inside General Matrix Multiply (GEMM). We can treat the addition in GEMM from two perspectives:

•

Precision Alignment on Attends: The precision of the intermediate results in GEMM should be consistent.
•

Precision Alignment on Results: The precision of each intermediate result in GEMM should be the same as that of the final result.

A Framework for Aligning Precision

In the previous text, we proved that it’s preferable to let addends align with each other and the final results. Based on this criterion, we develop a framework to propose the precision of each parameter in weight and KV-Cache so that all the additions are aligned with precision.

To apply the principle of precision alignment, we must be aware of certain information about the intermediate results and the final outcomes. It should be noted that we only need the approximate magnitudes of these values, which can be estimated before the actual computation. In the following discussion, we will assume that we have already obtained the magnitudes of these values.

Rule 1 (Attend Precision Alignment): We can consider the problem from the perspective of addend alignment. The objective is to keep consistent precision across all intermediate results. This means when there is inconsistency in precision, we can slightly reduce the precision of some higher-precision elements to achieve uniformity in the precision of the intermediate results. The final effect of this adjustment is that the precision of all intermediate results aligns with the lowest precision among them, as illustrated in Figure 4.

Rule 2 (Result Precision Alignment): We can also consider this problem from the perspective of result alignment. Here we aim for the precision of all intermediate results to be same as the precision of the final result. If any intermediate result has a precision higher than the final result, we can reduce it to match the final result’s precision, as shown in Figure 5.

Both perspectives offer a solution to determine the precision of the intermediate results. As mentioned earlier, these two forms are equivalent, allowing us to choose one according to the computation scenario. When the results can be estimated more accurately, we can use Rule 2 to align the precision of intermediate results. Otherwise, when the results are difficult to estimate and the number of addends is relatively small and manageable, we can choose Rule 1.

Once we obtain the precision of the intermediate results, we can infer the precision of the weights (K-Cache, V-Cache). Since scalar multiplication does not change the relative uncertainty (uncertainty/value), and the quantization bit width is only related to the relative uncertainty of the data, we can directly use the bit width of the intermediate results as the quantization bit width for the weights (K-Cache, V-Cache).

So far we have proposed a framework to determine the bit width of quantization. In the next section, we will introduce an application on KV-Cache using our framework, named “AlignedKV”. This work devotes to cutting down the total memory access of KV-Cache and reducing the memory latency of this part, which is the bottleneck of LLM inference speed.

AlignedKV: Quantizing KV-Cache with Precision Alignment Criterion

Dynamic Quantization

Currently, the vast majority of quantization methods are static, with only a few works employing dynamic quantization based on data (Wang, Zhang, and Han 2021) as most quantization techniques aim to save storage space, and hence a static quantization suffices.

However, in the context of KV-cache, dynamic quantization offers compelling advantages. Firstly, dynamic quantization can reduce memory access latency without adding additional delays. It’s important because memory access latency is the bottleneck for the inference speed in this stage. Secondly, static quantization introduces a significant problem for the usage of mixed-precision quantization in KV-cache, as it requires determining the importance of elements at the time of quantization. Although there are numerous methods available to predict the importance of elements to address this issue (i.e., the persistence of importance hypothesis)(Ge et al. 2023; Liu et al. 2024b), these predictions are always subject to counterexamples (Dong et al. 2024).

We propose a method to dynamically quantize KV-cache, which quantizes data when it is used. When storing the KV-cache, we maintain additional data structures to store the approximate magnitudes of the KV-cache elements, which is one of the preconditions for precision alignment. Afterwards, during the computation of $QK^{T}$ and $SV$ , we use the precision alignment criterion to determine the required precision for each element in K and V.

We then retrieve only the necessary number of bits from memory on an as-needed basis, as illustrated in Figure 6. According to the results of the precision alignment principle, we read all 16 bits for important parameters, the first 12 bits for less important parameters, and the first 8 bits for particularly unimportant parameters. The remaining bits are filled with 100… to minimize the precision loss to the greatest possible extent.

At the same time, as the GPU hardware can only read memory at a minimum unit of several bytes, we rearrange the storage structure to ensure continuity for hardware access, as shown in Figure 8.

Quantization of K-Cache

AlignedKV follows the process outlined below, applicable to both K-Cache and V-Cache:

1.

Obtain the approximate magnitude information of the values in KV-Cache by an additional data structure, preparing the data for precision alignment. We only need the order of magnitude information.
2.

Apply our framework to determine the required precision for loading each data element.
3.

Perform truncation on the data elements using dynamic quantization methods while loading data from GPU memory to L1 cache.

Among these, the latter two have already been addressed in the previous sections, leaving only the first step, which requires specific strategies based on the characteristics of the data.

Based on observations of the K-Cache, we found that the data distribution exhibits a columnar characteristic, where the data within a single column is relatively similar and the data between different columns varies significantly, as shown in Figure 7. In other words, we can use a single value from a column (specifically, the maximum value) to represent the order of magnitude for that column.

Thus, we maintain the information of the maximum value for each column of the K-Cache, which is recorded as $K_{colmax}$ . When computing $QK^{T}$ , we first use the value of $Q$ along with the values of $K_{colmax}$ to estimate the order of magnitude of the intermediate results. Then, we can determine the aligned precision of the intermediate results by Rule 1 (Addend Precision Alignment), which allows us to infer the required bit width for each element in the K-Cache. All of this is derived quantitatively, rather than merely through qualitative analysis.

Then, we load the required bits of each parameter from memory using the proposed dynamic quantization methods. As a result, we can avoid loading a large number of redundant bits from GPU memory, which significantly reduces the memory access latency for this part.

Quantization of V-Cache

Type	Method	Proportions of each Relative Error Range
Type	Method	0	(0,1/1024)	[1/1024, 1/512)	[1/512, 1/256)	[1/256, 1/128)	[1/128, inf)
$QK^{T}$	AlignedKV	56.30%	37.00%	5.42%	0.73%	0.31%	0.25%
$QK^{T}$	Truncated to 13 bits	18.65%	32.70%	36.21%	10.26%	1.36%	0.83%
$SV$	AlignedKV	76.12%	18.14%	3.61%	1.29%	0.39%	0.44%
$SV$	Truncated to 13 bits	20.04%	29.47%	26.66%	13.82%	5.71%	4.30%

Table 1: The distribution of relative error for result of AlignedKV (compare with normal result without any quantization)

Similar to the quantization of K-Cache, the quantization of V-Cache also requires obtaining approximate order of magnitude information for the addends or results in order to apply precision alignment criterion. However, unlike K-Cache, V-Cache does not exhibit a columnar distribution characteristic, which prevents us from applying the quantization method used for K-Cache. Fortunately, we observed that the magnitude differences among the elements in Score matrix (recorded as S) are quite significant, as shown in Figure 7, which means that we only need to compute the product of a subset of the larger elements in S with the corresponding V to obtain an approximate estimation of final results.

We first select the top k largest values from the score matrix S using an approximate top-k algorithm (we don’t use the exact top-k algorithm for speed) and multiply them by their corresponding V values to obtain an estimate of the SV result, referred to as $O_{est}$ . Subsequently, we can determine the precision alignment of the intermediate results by Precision Alignment on Result using $O_{est}$ . Similar to the quantization of K-Cache, the alignment of the intermediate results allows us to infer the required bit width for each element in V-Cache. We can then reduce memory access latency by reading only the necessary bits for each element.

Experiments

Experiments Setup

We performed comprehensive experiments to validate the effectiveness of the proposed method. We conducted our experiments on the Llama-2-7b model with our implementation of KV-cache coded in CUDA. The experiments were conducted on a Nvidia V100 GPU. It must be noted that the proposed method is friendly to efficient hardware implementation and the dedicated hardware can be significantly more efficient.

Reduction in Bit Widths

In this experiment, we apply AlignedKV on the actual KV-Cache data generated by Llama-2-7B. We statistically analyze the alignment of actual accuracy and the number of bits required for each element in the KV-Cache. The results are illustrated in Figure 9. They clearly demonstrate that after using AlignedKV, the average bit width for each element decreases from 16 to approximately 12, resulting in a significant reduction of the memory access latency during the KV-Cache access. It drops with the increase of context length, indicating that our method has significant potential for longer context.

Accuracy of GEMM Result

In quantization, it is essential to consider not only the compression rate but also the accuracy. In the previous experiment, we also analyze the impact of applying AlignedKV on the accuracy of the results. We record the changes in $QK^{T}$ and $SV$ GEMM after using AlignedKV and represent these changes through the distribution of relative errors.

We use the following formula to calculate the relative error for each value in the results:

{\rm relative\ error}=\frac{|R_{AlignedKV}-R_{normal}|}{R_{normal}}

	$\displaystyle{\rm where}\$	$\displaystyle R_{AlignedKV}=QK_{AlignedKV}^{T}{\rm\ or\ }SV_{AlignedKV}$
	$\displaystyle{\rm and}\$	$\displaystyle R_{normal}=QK_{normal}^{T}{\rm\ or\ }SV_{normal}$

We calculate the relative error for each value in the result matrix and observe their distribution, as shown in Table 1. Notably, although AlignedKV reduces the bit widths of KV-Cache by 4 bits, most of the results remain unchanged. In contrast, uniformly removing 3 bits from each element in KV-Cache yields significantly poorer results, despite this approach achieving a lower compression rate than AlignedKV. This indicates that by quantitatively analyzing the bit-width requirements for each element and letting them achieve a state of “alignment”, we can take both a high compression rate and high accuracy.

Runtime of AlignedKV

We also evaluated the actual runtime. We run the inference of a Llama-2-7B model with 128-token input and up to 8192-token completion (disregarding the model’s maximum context length of 4096 because we only care about the time cost). To highlight the effectiveness of AlignedKV, we only focus on the computations involved in the KV-Cache component. The computations are as follows:

	$\displaystyle S$	$\displaystyle=QK^{T}/\sqrt{128}$
	$\displaystyle S$	$\displaystyle=softmax(S)$
	$\displaystyle O$	$\displaystyle=SV$

When the context length reaches hundreds of thousands or even millions, the memory access latency of the KV-Cache will become the bottleneck for the entire model’s running speed. At this point, the benefits of AlignedKV will be substantial.

We compare AlignedKV with the native implementation in Torch and two other quantization methods—KIVI (Liu et al. 2024c) and GEAR (Kang et al. 2024) (these quantization methods are set to 10 bits for fairness). We also set a control group using AlignedKV’s code to evaluate the memory access reduction AlignedKV brings, ignoring the additional overhead associated with CUDA program execution. The only difference between the control group and AlignedKV is that the latter always loads elements as 16-bit values.

The runtime results are displayed in Figure 10. With the increasing position of tokens, the length of the KV-Cache also grows, at which point the advantages of AlignedKV in reducing memory access latency become apparent. When the generating length reaches around 7000 to 8000, AlignedKV demonstrated the shortest execution time among all experimental groups, surpassing the performance of the native implementation in Torch. The CUDA implementation of Torch is highly optimized. We expect a higher level of performance improvement of our implementation after optimization.

End-to-End Accuracy Performance

We are willing to present the end-to-end accuracy performance of AlignedKV, as AlignedKV doesn’t cause any loss of precision. We benchmark it on CoQA(Reddy, Chen, and Manning 2019), TruthfulQA(Lin, Hilton, and Evans 2021), and GSM8K(Cobbe et al. 2021) tasks using default settings, as KIVI does. We utilize an open-source evaluation repository (Gao et al. 2024) to conduct this experiment. The results are shown in Table 2.

Model	Method	CoQA	TruthfulQA	GSM8K
Llama-2-7B	origin	63.88	32.31	13.80
	AlignedKV	63.88	32.31	13.95
	KIVI	64.42	33.90	12.66

Table 2: Performance comparison between AlignedKV and the origin model

Through experiments, we find the results of AlignedKV are very close to the origin model, which means AlignedKV has no impact on the result of LLMs. The results are as we expected since the errors in GEMM results will be diluted on a higher level, and the results of GEMM by AlignedKV are already accurate.

Method	Output
origin	who is at the front. I am very glad to hear that he…
AlignedKV	who is at the front. I am very glad to hear that he…
KIVI	who is in the army, and who is now in the hospital…

Table 3: Model output with the prompt “I have just received a letter from my brother,”

To further demonstrate that our method incurs exactly no loss in end-to-end accuracy, we used the model to generate sentences by greedy search. We found that the results generated by AlignedKV are completely consistent with those produced by the original model. In contrast, although the sentences generated by KIVI are coherent, the initial words differ from those generated by the original model. A brief example is shown in Table 3.

Conclusion

Currently, a quantitative theoretical framework is greatly needed for the mixed-precision quantization methods, which still rely on experiments to select a preferable bit-width. To address this issue, we proposed a Precision Alignment criterion, which makes it feasible to derive the optimal bit-width for each parameter through theoretical analysis rather than relying solely on experimentation. This approach allows us to achieve both high compression rates and high accuracy.

Furthermore, based on this theory, we implemented a dynamic quantization method, which quantizes the KV-Cache to reduce memory access latency. This method can significantly accelerate the inference speed of the Attention component during the decoding phase with no impact on the accuracy of the final result.

Acknowledgments

This work is supported by Sunlune Company. We sincerely thank all the reviewers for their valuable feedback and suggestions.

References

Cobbe et al. (2021) Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; Hesse, C.; and Schulman, J. 2021. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168.
Dettmers et al. (2022) Dettmers, T.; Lewis, M.; Belkada, Y.; and Zettlemoyer, L. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems, 35: 30318–30332.
Dettmers et al. (2023) Dettmers, T.; Svirschevski, R.; Egiazarian, V.; Kuznedelev, D.; Frantar, E.; Ashkboos, S.; Borzunov, A.; Hoefler, T.; and Alistarh, D. 2023. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv:2306.03078.
Dong et al. (2024) Dong, S.; Cheng, W.; Qin, J.; and Wang, W. 2024. QAQ: Quality Adaptive Quantization for LLM KV Cache. arXiv preprint arXiv:2403.04643.
Gao et al. (2024) Gao, L.; Tow, J.; Abbasi, B.; Biderman, S.; Black, S.; DiPofi, A.; Foster, C.; Golding, L.; Hsu, J.; Le Noac’h, A.; Li, H.; McDonell, K.; Muennighoff, N.; Ociepa, C.; Phang, J.; Reynolds, L.; Schoelkopf, H.; Skowron, A.; Sutawika, L.; Tang, E.; Thite, A.; Wang, B.; Wang, K.; and Zou, A. 2024. A framework for few-shot language model evaluation.
Ge et al. (2023) Ge, S.; Zhang, Y.; Liu, L.; Zhang, M.; Han, J.; and Gao, J. 2023. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801.
He et al. (2024) He, Y.; Zhang, L.; Wu, W.; Liu, J.; Zhou, H.; and Zhuang, B. 2024. ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification. arXiv preprint arXiv:2405.14256.
Hosmer (2023) Hosmer, B. 2023. Inside the Matrix: Visualizing Matrix Multiplication, Attention and Beyond. https://pytorch.org/blog/inside-the-matrix/. Accessed on August 4, 2024.
Johnson (2020) Johnson, L. 2020. How to Calculate Uncertainty. https://sciencing.com/how-to-calculate-uncertainty-13710219.html. Accessed on August 3, 2024.
Kang et al. (2024) Kang, H.; Zhang, Q.; Kundu, S.; Jeong, G.; Liu, Z.; Krishna, T.; and Zhao, T. 2024. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527.
Kim et al. (2023) Kim, S.; Hooper, C.; Gholami, A.; Dong, Z.; Li, X.; Shen, S.; Mahoney, M. W.; and Keutzer, K. 2023. Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629.
Kloberdanz and Le (2023) Kloberdanz, E.; and Le, W. 2023. MixQuant: Mixed Precision Quantization with a Bit-width Optimization Search. arXiv preprint arXiv:2309.17341.
Lee et al. (2024) Lee, C.; Jin, J.; Kim, T.; Kim, H.; and Park, E. 2024. Owq: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 13355–13364.
Li et al. (2023) Li, S.; Ning, X.; Hong, K.; Liu, T.; Wang, L.; Li, X.; Zhong, K.; Dai, G.; Yang, H.; and Wang, Y. 2023. Llm-mq: Mixed-precision quantization for efficient llm deployment. In The Efficient Natural Language and Speech Processing Workshop with NeurIPS, volume 9.
Lin et al. (2024a) Lin, H.; Xu, H.; Wu, Y.; Cui, J.; Zhang, Y.; Mou, L.; Song, L.; Sun, Z.; and Wei, Y. 2024a. Rotation and Permutation for Advanced Outlier Management and Efficient Quantization of LLMs. arXiv preprint arXiv:2406.01721.
Lin et al. (2024b) Lin, J.; Tang, J.; Tang, H.; Yang, S.; Chen, W.-M.; Wang, W.-C.; Xiao, G.; Dang, X.; Gan, C.; and Han, S. 2024b. AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration. Proceedings of Machine Learning and Systems, 6: 87–100.
Lin, Hilton, and Evans (2021) Lin, S.; Hilton, J.; and Evans, O. 2021. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
Liu et al. (2024a) Liu, R.; Bai, H.; Lin, H.; Li, Y.; Gao, H.; Xu, Z.; Hou, L.; Yao, J.; and Yuan, C. 2024a. IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact. arXiv preprint arXiv:2403.01241.
Liu et al. (2024b) Liu, Z.; Desai, A.; Liao, F.; Wang, W.; Xie, V.; Xu, Z.; Kyrillidis, A.; and Shrivastava, A. 2024b. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36.
Liu et al. (2024c) Liu, Z.; Yuan, J.; Jin, H.; Zhong, S.; Xu, Z.; Braverman, V.; Chen, B.; and Hu, X. 2024c. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750.
Megha et al. (2023) Megha, A.; Asfandyar, Q.; Nikhil, S.; Linden, L.; Julian, Q.; and Daya, K. 2023. LLM Inference Performance Engineering: Best Practices. https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices. Accessed on August 13, 2024.
Park et al. (2024) Park, Y.; Hyun, J.; Cho, S.; Sim, B.; and Lee, J. W. 2024. Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs. arXiv preprint arXiv:2402.10517.
Reddy, Chen, and Manning (2019) Reddy, S.; Chen, D.; and Manning, C. D. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7: 249–266.
Sheng et al. (2023) Sheng, Y.; Zheng, L.; Yuan, B.; Li, Z.; Ryabinin, M.; Chen, B.; Liang, P.; Ré, C.; Stoica, I.; and Zhang, C. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, 31094–31116. PMLR.
Wang, Zhang, and Han (2021) Wang, H.; Zhang, Z.; and Han, S. 2021. Spatten: Efficient sparse attention architecture with cascade token and head pruning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 97–110. IEEE.
Xiao et al. (2023a) Xiao, G.; Lin, J.; Seznec, M.; Wu, H.; Demouth, J.; and Han, S. 2023a. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, 38087–38099. PMLR.
Xiao et al. (2023b) Xiao, G.; Tian, Y.; Chen, B.; Han, S.; and Lewis, M. 2023b. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
Yang et al. (2024a) Yang, D.; Han, X.; Gao, Y.; Hu, Y.; Zhang, S.; and Zhao, H. 2024a. PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference. arXiv preprint arXiv:2405.12532.
Yang et al. (2024b) Yang, J. Y.; Kim, B.; Bae, J.; Kwon, B.; Park, G.; Yang, E.; Kwon, S. J.; and Lee, D. 2024b. No token left behind: Reliable kv cache compression via importance-aware mixed precision quantization. arXiv preprint arXiv:2402.18096.
Yuan et al. (2024) Yuan, Z.; Shang, Y.; Zhou, Y.; Dong, Z.; Xue, C.; Wu, B.; Li, Z.; Gu, Q.; Lee, Y. J.; Yan, Y.; et al. 2024. Llm inference unveiled: Survey and roofline model insights. arXiv preprint arXiv:2402.16363.
Yue et al. (2024) Yue, Y.; Yuan, Z.; Duanmu, H.; Zhou, S.; Wu, J.; and Nie, L. 2024. Wkvquant: Quantizing weight and key/value cache for large language models gains more. arXiv preprint arXiv:2402.12065.
Zhang et al. (2024a) Zhang, Y.; Gao, B.; Liu, T.; Lu, K.; Xiong, W.; Dong, Y.; Chang, B.; Hu, J.; Xiao, W.; et al. 2024a. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling. arXiv preprint arXiv:2406.02069.
Zhang et al. (2024b) Zhang, Z.; Sheng, Y.; Zhou, T.; Chen, T.; Zheng, L.; Cai, R.; Song, Z.; Tian, Y.; Ré, C.; Barrett, C.; et al. 2024b. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36.
Zhu et al. (2023) Zhu, X.; Li, J.; Liu, Y.; Ma, C.; and Wang, W. 2023. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633.