An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks
Abstract
Despite their tremendous success and versatility, Large Language Models (LLMs) suffer from inference inefficiency while relying on advanced computational infrastructure.
To address these challenges and make LLMs more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of 1.58-bit LLMs with ternary weight matrices.
Particularly focusing on matrix multiplication as the bottle-neck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a by weight matrix, our efficient algorithm guarantees a time complexity of , a logarithmic factor improvement over the standard vector-matrix multiplication.
Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of the approach both with respect to time and memory, as we observed a reduction in inference time up to 29x and memory usage up to 6x.111Our code is publicly available on Github: https://github.com/UIC-InDeXLab/RSR.
1 Introduction
Large Language Models (LLMs) have achieved remarkable success and demonstrated versatility across a wide range of domains, yet they encounter significant challenges related to inference inefficiency. These models demand substantial computational resources, including specialized, costly GPUs with ample memory to achieve real-time inference. This inefficiency leads to slower response times, elevated energy consumption, higher operational expenses, and, perhaps more importantly, limited accessibility for everyday users who lack access to advanced computational infrastructure.
Given these limitations, current deployments of LLMs on typical consumer devices rely predominantly on API calls to powerful, remote servers (OpenAI, 2024; AI21 Labs, 2024; Hu et al., 2021; Hugging Face, 2023). While this approach enables users to leverage LLMs without needing advanced hardware, it introduces additional costs and delays due to network dependency, along with potential privacy concerns stemming from data transmitted to and processed by external servers (Yao et al., 2024; Pearce et al., 2023; Das et al., 2024; Finlayson et al., 2024).
Consequently, optimizing inference time and memory efficiency on standard, widely available hardware has become essential to make LLMs more practical and accessible for broader, real-world applications.
To that end, recent efforts have focused on quantizing the weights of LLMs, and more generally, Deep Neural Networks (DNNs), to enhance their computational efficiency and reduce energy consumption (Moons et al., 2017; Hubara et al., 2018; Wang et al., 2023; Chen et al., 2024; Ma et al., 2024). Particularly, limiting the weights to ternary values in 1.58-bit LLMs has demonstrated to preserve accuracy comparable to that of general LLMs, thereby offering a more effective alternative for inference tasks (Ma et al., 2024).
Expanding on recent advancements of ternary-weight LLMs (and DNNs), in this paper, we propose algorithms that improve their inference time and memory efficiency. Our approach makes deploying these models viable on a broader range of devices, including those with limited computational power, ultimately making these tools more widely accessible and cost-effective.
Specifically, while viewing the inference process as sequences of multiplying activation vectors to ternary weight matrices, we make a critical observation: once the model is trained, the weight matrices remain fixed and do not change.
Therefore, focusing on matrix multiplication as the bottleneck operation, we preprocess the weight matrices of the trained model and create indices that enable efficient multiplication during the inference time. At a high level, the indices are bucketized permutation lists. Interestingly, by replacing each weight matrix with its preprocessed indices, our approach reduces the space complexity required for storing the weights by a logarithmic factor.
We propose two algorithms for efficient multiplication of input vectors to the preprocessed weight matrices. At a high level, our algorithms transform each ternary matrix to two binary matrices222Note that our algorithms readily apply to DNNs with binary weight matrices, including BitNet (Wang et al., 2023)., while partitioning each matrix into a set of column blocks. Then, permuting the rows based on a lexical order, they compute a set of aggregate values that contribute to different elements of the result vector. This process guarantees a time complexity of for matrices in our first algorithm. Furthermore, introducing an efficient approach for writing the aggregate values in the result vector, our second algorithm improves the time complexity to . The run-time of our algorithms further improves by fine-tuning their blocking parameter.
Many widely used LLMs are characterized by substantial matrix weight sizes. For instance, the matrix size of GPT-3 is 12,288 () (Tsai, 2020; Tech, 2023; Brown, 2020), and this value is even greater for GPT-4. Consequently, achieving even a logarithmic factor improvement can have a significant impact, potentially resulting in at least up to a 13x reduction in inference time for models such as GPT-3.
In addition to theoretical analysis, we perform rigorous experiments to evaluate the time and memory of our algorithms in practice. Confirming our theoretical findings, our experiments demonstrate the superiority of algorithms over the standard multiplications approach. In particular, our algorithms reduced the inference time by up to 29x and reduced memory usage by up to 6x.
1.1 Summary of Contribution
In summary, our contributions are as follows:
-
•
(Section 3) We provide a formal definition of the vector-ternary-matrix multiplication problem and demonstrate its reduction to the vector-binary-matrix multiplication problem.
-
•
(Section 4) Following the observation that weight metrics are fixed once a model is trained, we preprocess them and construct indices that enable the development of our efficient algorithms while reducing the memory requirements by a logarithmic factor.
-
•
(Section 5) We introduce the RSR algorithm with a time complexity of for vector-binary-matrix multiplication for by matrices. We further optimize this algorithm and introduce RSR++ that achieves a faster running time of .
-
•
(Sections 6 and 7) We conduct various experiments to demonstrate the applicability of our algorithms for matrix multiplication using different implementation configurations. We show that we can achieve up to 29x faster run time and 6x less space usage on matrix multiplication task. Finally, we discuss the advantages and limitations of our algorithm in Section 7.
2 Background
1.58-bit LLMs limit their weights to ternary values (Ma et al., 2024). Generally speaking, ternary weights can be considered for a wide range of DNNs (Moons et al., 2017; Hubara et al., 2018; Yuan et al., 2024), significantly enhancing the computational efficiency of these large-scale models.
A trained 1.58-bit model consists of a collection of ternary weights used for inference. More precisely, each neural layer (see Figure 1) employed by these models incorporates ternary weights. The weights are learned during the training phase and are fixed during the inference time. During the inference time, given an input vector, the feedforward process can be viewed as a series of ternary matrix multiplications applied to the input vector to produce the final output. For instance, consider the two consecutive layers shown in a multi-layer perceptron in Figure 1. Each ternary weight matrix represents the edge weights between layers and with layer 0 being the input vector. The feedforward step is a chain of matrix products and activation functions that maps input vector to output vector ; it can be represented as follows (with ), where is the activation function (e.g., sigmoid or ReLU).
Figure 2 visualizes the bottleneck operation of the inference process: the multiplication of each activation vector to the weight matrix to generate the vector .
Building on this background, our work focuses on accelerating the bottleneck operation, the vector-ternary-matrix product. By improving the speed of this operation, we proportionally accelerate the entire inference process of a quantized LLM (or, more generally, a neural network), as each matrix product in the feedforward procedure is computed sequentially until the output layer. We reformulate this problem as a vector-binary-matrix multiplication, thereby enabling compatibility with both binary and ternary matrices.
Previous research has explored enhancing matrix multiplication through hardware-level accelerations (Guo et al., 2024; Wang et al., 2024). In contrast, our approach focuses on designing an algorithm that provides a guaranteed logarithmic improvement over naive matrix multiplication at the application layer, independent of specific hardware architectures.
In the subsequent sections, we concentrate on vector-ternary-matrix multiplication, which constitutes the primary focus of this study.
3 Preliminaries
In this section, we formally build the notation required for the following sections, followed by a formulation of the problem we address in this work.
3.1 Notation
We denote vectors by . We denote matrices by capital letters. For example, is a matrix of size with elements belonging to a set of permissible values, . Specifically, if (resp. ), then is denoted as a binary (resp. ternary) matrix.
Let denote the set of all bijective permutations functions . For a vector , the permuted vector under is represented as , where each element is repositioned such that , moving the element in to position of the permuted vector. We extend to matrices by permuting their rows. When is clear by the context, we may simplify to . Let denote the set of all ordered lists of length with entries in , and let represent the set of sorted lists with length at most . We use to indicate the -th column of a matrix . Similarly, we use to refer to the -th row the matrix and to indicate an element.
3.2 Problem Formulation
Given a vector and a ternary matrix , our objective is to efficiently compute the product . In this setting, the vector serves as the activation output from the previous layer of the neural network, and is the weight matrix (See Figures 1 and 2).
Here, we focus on a single multiplication instance and aim to accelerate this operation beyond the standard naive vector-matrix multiplication complexity of . We initially focus on square matrices of size for clarity in exposition. However, our approach is applicable more broadly and extends to general matrices of dimensions as discussed in Section 7.
A key observation is that the weight matrices are fixed after a model is trained, while the vector is provided at inference time. It allows us to preprocess each weight matrix to enable faster vector-matrix product – hence more efficient inference.
Problem 1 (Vector-Ternary-Matrix Product).
Given an input vector and a pre-defined ternary matrix , compute the product .
3.3 Solution Overview
Initially, we use the following lemma to reduce our problem into a Vector-Binary-Matrix product.
Lemma 3.1.
Any ternary matrix can be expressed as , where and are binary matrices.
Proof.
We can build and according to the following equation:
(1) |
∎
As a result, we focus on solving the following problem.
Problem 2 (Vector-Binary-Matrix Product).
Given an input vector and a pre-defined binary matrix , compute the product .
As a result of Lemma 3.1, any time guarantees established for this reduced version will apply equivalently to the original problem by a constant factor.
We introduce the Redundant Segment Reduction (RSR) algorithm, which optimizes matrix-vector multiplication by identifying and reusing redundant computations. The RSR algorithm begins by partitioning the matrix into smaller blocks, each with a row size of but a reduced column size.
Within each block, similar segments across different columns are identified. We will calculate the dot product for each unique segment only once. This approach reduces the time for computing the product of the vector and each matrix block. To maximize the length of similar segments across columns, we also apply a permutation to the matrix blocks. In the final step, we aggregate the results of the vector’s product with each matrix block to obtain the overall product. An overview of this approach is illustrated in Figure 3.
In Section § 4, we present the preprocessing phase of our algorithm, detailing the construction of essential data structures utilized during inference time. Subsequently, in Section § 5, we describe the inference-time operations, illustrating how our algorithm accelerates the multiplication process, achieving a logarithmic improvement by reducing the time complexity of the standard vector-matrix-multiplication to .
4 Preprocessing: Index Construction

In this section, we outline the preprocessing phase of our approach, during which we construct the essential initial data structures for the matrix in Problem 2. These structures are designed to optimize and accelerate the multiplication process during the inference time (explained in Section § 5).
Given a ternary matrix , representing the weights between two layers of a ternary NN, we first present it as the subtraction of two binary matrices , as described in Lemma 3.1. Let be either or . At a high level, during the preprocessing time, the algorithm RSR partitions the columns of into a set of column blocks, associating each block a row-permutation as its index to identify the longest common segments across the columns. This permutation is used later at the inference time for efficient vector-to-matrix multiplication.
The space complexity for representing a matrix is . Interestingly, as we shall explain in § 4.5, the space complexity of our indices is , reducing a logarithmic factor in the space required for representing .
4.1 Step 1: Column Blocking
The first step of our preprocessing is Column Blocking – the process of partitioning consecutive columns of the original binary matrix to construct a set of smaller, compact matrices, each representing a distinct block of columns.
Definition 4.1 (-Column Block).
Let and . We define as an matrix containing columns to through . This construction yields a series of submatrices that partition into contiguous column blocks. Each is called a -Column Block of .
For example, consider the following binary matrix,
The set of -Column Blocks of are as follows:
When is clear by the context, we use instead of to denote the -th block.
4.2 Step 2: Row Permutation
The next step after column blocking is binary row ordering, where the rows of each column block are sorted according to a lexicographic order.
Definition 4.2 (Binary Row Order).
Let be a binary matrix. For each row , let be the corresponding binary value of concatenating . For example, if , then . The Binary Row Order of , denoted as , is defined as a permutation on , such that the rows of are sorted based on their corresponding binary value in ascending order. That is, , if , then . We call as the Binary Row Ordered of .333In general, any binary matrix following this lexicographic order is called Binary Row Ordered matrix in this paper.
To further clarify the Binary Row Order, let us consider the following example, where is a block with two columns. The permutation function 444 is the -th value in . For example, . provides a Binary Row Order of the matrix .
(2) |
The right expression results from .
We define as a binary matrix which has exactly one row for each binary value from to , and it is also Binary Row Ordered. For example,
4.3 Step 3: Segmentation
Next, we segment each binary row ordered matrix (the sorted column blocks) into groups of rows with the same binary value representation, enabling further matrix compaction.
Definition 4.3 (Segmentation List).
We define a function that, given a Binary Row Ordered matrix of size , gives us the list of the boundary indices, specifying the ranges of rows with the same binary value representations.
For example, the segmentation on a column block is defined as . For simplicity, we denote this segmentation as . Specifically, let Then, all rows in range , have the binary value representation . Note that provides an upper bound on the size of .
Consider the Binary Row Ordered matrix shown in Example 2. The Segmentation list of this matrix is because the first row is the beginning of rows. Then, starts from index , and starts at index .
For a Binary Row Ordered matrix of dimensions , we extend this concept to define Full Segmentation as the list of exactly elements, where the th element of the list indicates the first index of a row that has the binary value . For example, the Full Segmentation of matrix in Example 2 is . Since we don’t have any rows for , we assign the same boundary as the next value. See Table 1 for an illustration.
Row Binary Value | Start Index |
00 | 1 |
01 | 4 |
10 | 6 |
11 | 6 |
From now on, we use the same notation to denote the Full Segmentation for a Binary Row Ordered matrix.
Proposition 4.4.
For a binary matrix and , represents the number of rows in whose binary value corresponds to . The number of rows for is .
Given the Full Segmentation of a Binary Row Ordered matrix, a compact representation of the matrix can be obtained by retaining only the first indices of each unique row value in the segmentation list. In the following sections, for simplicity, we use instead of to show the Full Segmentation of the permuted Column Block .
4.4 Preprocessing Algorithm
Putting all three steps together, the preprocessing algorithm is described in Algorithm 1. The algorithm starts by selecting a parameter such that . Next, it computes the -Column Blocks of . Hereafter, we denote these blocks as rather than for . For each , the algorithm proceeds to calculate both the Binary Row Orders and the Full Segmentations .
Figure 3 (Preprocessing – the left figure) provides a visual example of the preprocessing steps.
4.5 Complexity Analysis
To simplify the complexities, in the following, we provide the time and space complexities for an by weight matrix. Note that storing the weights matrix has a space complexity of .
Time Complexity:
During the preprocessing, we read all elements of the input matrix , which requires time. Blocking the matrix into -Column Blocks has a time complexity linear to the number of blocks, while determining the permutation corresponds to performing a bucket sort on the rows to achieve the lexicographic ordering, which requires . Consequently, the total preprocessing time is . Given that the input matrix size is , this complexity is asymptotically optimal, as any algorithm requires reading the matrix in .
Output Space Complexity:
The output of the preprocessing phase is a permutation and a segmentation list as each block index. Recall that , where . Therefore, storing each block index requires space. The total number of column blocks is . Hence, the total space complexity of the preprocessing output is . Particularly, when , the space complexity is . As we shall explain in § 5, the inference algorithms only require access to the block indices. Therefore, replacing the weight matrices with the block indices, we reduce the space complexity by a logarithmic factor.
Theorem 4.5.
Let a ternary weight matrix be expressed using the binary matrices and as . Then representing using the block indices of and produced by Algorithm 1 requires a space of complexity .
5 Inference Time: Vector-to-Matrix Multiplication
To simplify the complexity analysis without loss of generality, we assume the weight matrix is of size . Specifically, given a preprocessed binary matrix , we aim to efficiently compute the product when a vector is provided at inference time (see Figure 2 – the right side).
5.1 Segmented Sum Computation
Recall that during the preprocessing time, for each column block , we construct a permutation function and a full segmentation list as its index.
At the inference time, given a vector , our objective is to efficiently compute the multiplication of by each column block , using .
Our first step is computing the Segmented Sums over the permuted vector – i.e., the sum for each interval in the permuted vector that corresponds to groups of similar rows in the binary matrix .
Definition 5.1 (Segmented Sum).
Consider the permutation function and the full segmentation list , for a column block . Let be the permutation of the vector . The Segmented Sum of on is defined as a vector of size where,
(3) |
For example, for the Full Segmentation defined for Example 2 and a given vector , we have:
(4) |
Where , and the third element is 0 because there is no in the segmentation.
5.2 RSR
We are now ready to describe our algorithm RSR for computing at the inference time. Given the input vector , for each column block , in Step 1 we use the permutation function and the full segmentation list to compute the Segmented Sum using Equation 5.
Then, we use Lemma 5.2 to calculate .
Lemma 5.2.
.
Proof.
Define . At first, we should show that , and as a result, the product in the right-hand-side is valid. We know that (see Definition 5.1). In addition, based on Definition 4.3, we know that in the Segmentation list, there is exactly one element for each possible binary value between to . So .
Now, we focus on a single column of (th column). We show that . Where is the dot product of two vectors. Based on the Definition 5.1, we know that the th element of is the th segment sum of . This segment is the sum of some consecutive elements of . Based on Definition 4.2, this permutation is designed such that similar rows of are positioned together. This means that all the elements of that are in this segment (we denote this segment as , in other words, is the indices of that lie in this segment) are multiplied to the same binary value in . Call this value as . As a result, we can factor out for each segment of in the dot product. We can rewrite the dot product as:
(6) | ||||
(7) | ||||
(8) |
Where in the last line we used the fact that .
Each is the single that in column of corresponds to the segment . Based on Definition 4.2, the applied permutation tries to sort all columns based on the lexicographic order of rows. So, this means that and as a result,
(9) | ||||
(10) |
As a result, for all columns , the dot products are equal . Consequently, the vector-matrix products are also equal. ∎
Following Lemma 5.2, as the final step in the inference time, in Step 2 we calculate the product for each -Column Block . Algorithm 2 shows a pseudo-code of this algorithm.
5.2.1 Complexity Analysis
For each column block , step 1 takes since it makes a pass over each element of exactly once. Step 2 computes the product of a vector of size with a matrix of dimensions , resulting in a time complexity of .
The entire process is applied to -Column Blocks. Consequently, the total query time is . For any selection of , the overall running time can be written as .
Specifically, setting results in a time complexity of .
Theorem 5.3.
The RSR algorithm solves Problem 1 with the time complexity of .
5.2.2 The optimal
While we could prove Theorem 5.3 by choosing , this choice might not be optimal, i.e., it may not minimize the run-time of the algorithm.
The optimal value of can be found using Equation 11.
(11) |
5.3 RSR++
Having presented RSR, we now present our faster algorithm RSR++ that focuses on improving Step 2 of RSR. This step involves computing the product of a vector of size and the binary matrix of size (Line 5 of Algorithm 2).
The standard vector-matrix multiplication, in this case, leads to time. However, the special structure of matrix allows us to reduce it to . Algorithm 3 shows the pseudo-code for this multiplication. See Figure 4 for a visualization of this approach.

Let denote the final result of this product, . We build starting from the -th element to the first one. (i) The -th element can be calculated by summing up the elements in with odd indices. For example, in RSR++ visualization in Figure 4, the sum of odd-index elements (orange cells) in the top row is 23. Then, (ii) we calculate a vector of size where we sum each two consecutive elements of . Now we can see that the -th element of can be calculated by doing the same process as (i), this time, on (see rows 2 and 3 in RSR++ visualization in Figure 4). We continue this process by repeating steps (i) and (ii) to build all elements of .
5.3.1 Complexity Analysis
Step (i) is linear in terms of the size of at each step (see Line 5 of Algorithm 3). As a result, the running time is . Step (ii) also takes linear in size of . Consequently, the overall time calculating using this subroutine takes .
We can follow the same process of analyzing RSR, but this time using RSR++ as a subroutine for Line 5 of Algorithm 2. Based on the same analysis in § 5.2.1, the inference time would be , for a this would result in the running time of .
Theorem 5.4.
The RSR++ algorithm solves Problem 1 with a time complexity of .
5.3.2 The optimal
The process for finding the optimal value of for RSR++ is the same as the one for RSR, except that the binary search is applied in the range to optimize the following equation:
(12) |
6 Experiments
In this section, we evaluate our proposed algorithms’ time and memory performance in real-world scenarios. First, in Section 6.1, we detail the native C++ implementation of both RSR and RSR++, comparing their runtime performance under a fair setting to validate our theoretical guarantees. Next, in Section 6.2, we explore the process of determining the optimal value of , identifying the best for each input size . Finally, in Section 6.3, we implement and evaluate the algorithms using Python’s NumPy package, providing insights into their performance in a more practical and widely used computational environment. Our codes are publicly available in this repository.
6.1 Native Implementation
In this section, we focus on verifying the theoretical foundation of our proposed methods, RSR and RSR++, using native implementations in C++. We selected C++ as the implementation language because, being a compiled language, it is less susceptible to runtime noise, allowing inference time to better reflect the exact time complexity of the methods, relatively. As baselines, we also implemented Standard Matrix Multiplication, which we will refer to as Standard throughout this section.
For input, we assume access to a binary weight matrix where during preprocessing. During inference time, given input vector of size , our aim is to compute . with both and values generated randomly. These parameters align with typical size of weight matrices and input vector patterns commonly encountered in deep learning and large language model (LLM) inference tasks. We then measure inference time across varying values of for each method. We also utilized the optimal value of for each , which will be discussed in detail in the next section.
Figure 5 presents the inference times for these methods. Notably, our proposed algorithms achieve up to a x speedup over the Standard baseline when , representing a substantial improvement that could significantly impact DNN and LLM architectures, enabling them to utilize binary and ternary weights more effectively.
Figure 6 illustrates the difference between only RSR++ and RSR for in the inference time. In this figure, we can see up to % improvement when using RSR++ in the Step 2 of inference time (See Algorithm 3). The improvement percentage is computed using the formula .


6.2 Optimization
As outlined in Section 5.2.2, the optimal value of for RSR (reps. RSR++) is determined through a binary search over the interval (resp. ). Once the optimal is identified for each , it is then applied in the subsequent experiments. Figures 7(a) 7(b), show the running time with respect to different values. By increasing , the optimum value of also increases.


6.3 Matrix Multiplication Using NumPy



In this section, we investigate the practical performance improvements achieved by RSR, extending beyond native simulation environments. To this end, we use NumPy, a state-of-the-art library for matrix multiplication upon which numerous high-performance computing libraries are based. We implement RSR in Python using NumPy’s built-in functionalities and compare the inference time and memory consumption between RSR and NumPy’s np.dot() function (referred to as NumPy).
6.3.1 Inference Time
As in previous sections, let denote our weight matrix and the input vector, where our goal is to compute . We run the experiments on both Binary and Ternary matrices. This matrix is available prior to inference time, so we can run preprocessing on that. For the parameter , we use the previously determined optimal values in Section 6.2.
We initialize at and double its size in each experiment up to . Each multiplication is executed four times, with results averaged to mitigate noise. Figures 8(a) and 8(b) present results for matrix multiplication using both the naive NumPy method and RSR for Binary and Ternary weight matrices, respectively. As shown, RSR achieves up to a 24x speedup over the NumPy baseline on binary matrices of size , illustrating that our algorithm not only meets theoretical guarantees but also outperforms state-of-the-art practical methods.
6.3.2 Memory consumption
RSR requires a preprocessing step to compute permutations and segments, where only these two values need to be stored to calculate the final weights. This preprocessing enables substantial compression. Figure 9 illustrates the memory consumption in megabytes for NumPy and RSR. In the peak of preprocessing procedure, both the weight matrix and the segment data are stored in memory (indicated by the green line); however, after computation, only the segmentation lists and permutations are stored and will be cleared from memory. For a matrix of size , the storage required is reduced to less than 17% of the original matrix size (5.99x improvement), highlighting the benefits of this approach for large-scale storage and data transfer.
In summary, our algorithm RSR provides practical advantages for deploying large models. For instance, companies training new large language models (LLMs) could preprocess their weights to release only the final segments, permutations, and the optimal parameter . This would result in up to 24x faster inference times and up to 5.99x memory reduction, significantly easing both storage and transfer demands.
7 Discussion
In this section, we discuss some of the advantages and limitations of our proposed algorithms and outline potential directions for future research in this area.
7.1 Advantages
-
•
Our RSR++ algorithm shows up to a x improvement in inference time and a x reduction in memory usage, as observed in our experiments. Note that following our complexity analyses, these numbers get even larger by the size of the network layers. This significant improvement indicates that the algorithm can support larger hidden layers and matrices to enhance model accuracy while requiring fewer computational resources.
-
•
The efficiency of our algorithm leads to reduced energy consumption, further contributing to the model’s practicality and sustainability.
-
•
Our approach enables the deployment of more advanced models on ordinary devices, such as personal computers, with limited memory and processing power.
-
•
By processing one column block at a time, the algorithm only requires memory equivalent to the size of a single block, which is significantly less than that needed for full matrix multiplication.
-
•
Last but not least, our method is deployable on top of any of the current or future ternary models, without requiring any fine-tuning or re-training of the LLM or DNN. Once a model is trained, our preprocessing step can be applied, allowing the preprocessed model to be utilized at any future point.
7.2 Limitations
-
•
Our algorithms can be parallelized by executing block productions concurrently, as each block production in step 2 of the inference time is independent. Additionally, further parallelization can be achieved by applying block production simultaneously on different columns of the binary-encoded matrix . However, RSR++ is not easily parallelizable due to the sequential nature of operations applied on the vector in step 2. Future work could explore leveraging specific hardware instructions, such as blocking, permutations, segmentations, and block productions, to further optimize these steps at the hardware level.
-
•
The preprocessing time of our algorithms is , and this step must be applied to all weight matrices of an LLM or DNN. Although this process can be time-intensive, it is performed only once for a trained model.
7.3 Generalization
Our proposed algorithms are applicable to any binary or ternary matrices of size . Additionally, they can be extended to any -bit quantized matrix or quantized model. The approach begins by expressing the -bit matrix as the sum of binary matrices. Each of these binary matrices is processed individually, and the results are combined at the end. The procedure for decomposing the -bit matrix into binary matrices can be a result of applying Lemma 3.1 recursively on the matrix.
8 Conclusion
This paper presented algorithms that significantly improve inference time and memory efficiency for 1.58-bit LLMs with ternary weights. By preprocessing weight matrices to create indices, our approach reduces storage complexity for maintaining the model weights by a logarithmic factor. It enhances inference efficiency, achieving a time complexity of .
In addition to the theoretical guarantees, our experiment results demonstrated the practical advantages of our methods, with observed reductions in inference time of up to 29x and memory usage of up to 6x, underscoring the potential of our approach to make LLMs more accessible and cost-effective.
References
- AI21 Labs (2024) AI21 Labs. Ai21 studio, 2024. URL https://www.ai21.com/studio. Accessed: 2024-11-09.
- Brown (2020) Brown, T. B. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Chen et al. (2024) Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., Qiao, Y., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models. arXiv preprint arXiv:2407.11062, 2024.
- Das et al. (2024) Das, B. C., Amini, M. H., and Wu, Y. Security and privacy challenges of large language models: A survey. arXiv preprint arXiv:2402.00888, 2024.
- Finlayson et al. (2024) Finlayson, M., Ren, X., and Swayamdipta, S. Logits of api-protected llms leak proprietary information. arXiv preprint arXiv:2403.09539, 2024.
- Guo et al. (2024) Guo, H., Brandon, W., Cholakov, R., Ragan-Kelley, J., Xing, E. P., and Kim, Y. Fast matrix multiplications for lookup table-quantized llms. arXiv preprint arXiv:2407.10960, 2024.
- Hu et al. (2021) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Hubara et al. (2018) Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(187):1–30, 2018.
- Hugging Face (2023) Hugging Face. Autotrain by hugging face. Accessed: 2024-11-09, 2023. https://huggingface.co/autotrain.
- Ma et al. (2024) Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S., Dong, L., Wang, R., Xue, J., and Wei, F. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024.
- Moons et al. (2017) Moons, B., Goetschalckx, K., Van Berckelaer, N., and Verhelst, M. Minimum energy quantized neural networks. In 2017 51st Asilomar Conference on Signals, Systems, and Computers, pp. 1921–1925. IEEE, 2017.
- OpenAI (2024) OpenAI. Models, 2024. URL https://platform.openai.com/docs/models. Accessed: 2024-11-09.
- Pearce et al. (2023) Pearce, H., Tan, B., Ahmad, B., Karri, R., and Dolan-Gavitt, B. Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP), pp. 2339–2356. IEEE, 2023.
- Tech (2023) Tech, W. G. The journey of openai gpt models. https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2, 2023. Accessed: 2024-11-03.
- Tsai (2020) Tsai, A. Explaining gpt-3 architecture and working. https://medium.com/@tsaiabhi.cool/explaining-gpt-3-architecture-and-working-d0219c79202c, 2020. Accessed: 2024-11-03.
- Wang et al. (2023) Wang, H., Ma, S., Dong, L., Huang, S., Wang, H., Ma, L., Yang, F., Wang, R., Wu, Y., and Wei, F. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453, 2023.
- Wang et al. (2024) Wang, L., Ma, L., Cao, S., Zhang, Q., Xue, J., Shi, Y., Zheng, N., Miao, Z., Yang, F., Cao, T., et al. Ladder: Enabling efficient Low-Precision deep learning computing through hardware-aware tensor transformation. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pp. 307–323, 2024.
- Yao et al. (2024) Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., and Zhang, Y. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, pp. 100211, 2024.
- Yuan et al. (2024) Yuan, Z., Zhou, R., Wang, H., He, L., Ye, Y., and Sun, L. Vit-1.58 b: Mobile vision transformers in the 1-bit era. arXiv preprint arXiv:2406.18051, 2024.