This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Efficient Matrix Multiplication Algorithm for Accelerating Inference in Binary and Ternary Neural Networks

Mohsen Dehghankar    Mahdi Erfanian    Abolfazl Asudeh
Abstract

Despite their tremendous success and versatility, Large Language Models (LLMs) suffer from inference inefficiency while relying on advanced computational infrastructure.

To address these challenges and make LLMs more accessible and cost-effective, in this paper, we propose algorithms to improve the inference time and memory efficiency of 1.58-bit LLMs with ternary weight matrices.

Particularly focusing on matrix multiplication as the bottle-neck operation of inference, we observe that, once trained, the weight matrices of a model no longer change. This allows us to preprocess these matrices and create indices that help reduce the storage requirements by a logarithmic factor while enabling our efficient inference algorithms. Specifically, for a nn by nn weight matrix, our efficient algorithm guarantees a time complexity of O(n2logn)O(\frac{n^{2}}{\log n}), a logarithmic factor improvement over the standard O(n2)O(n^{2}) vector-matrix multiplication.

Besides theoretical analysis, we conduct extensive experiments to evaluate the practical efficiency of our algorithms. Our results confirm the superiority of the approach both with respect to time and memory, as we observed a reduction in inference time up to 29x and memory usage up to 6x.111Our code is publicly available on Github: https://github.com/UIC-InDeXLab/RSR.

Machine Learning, ICML

1 Introduction

Large Language Models (LLMs) have achieved remarkable success and demonstrated versatility across a wide range of domains, yet they encounter significant challenges related to inference inefficiency. These models demand substantial computational resources, including specialized, costly GPUs with ample memory to achieve real-time inference. This inefficiency leads to slower response times, elevated energy consumption, higher operational expenses, and, perhaps more importantly, limited accessibility for everyday users who lack access to advanced computational infrastructure.

Given these limitations, current deployments of LLMs on typical consumer devices rely predominantly on API calls to powerful, remote servers (OpenAI, 2024; AI21 Labs, 2024; Hu et al., 2021; Hugging Face, 2023). While this approach enables users to leverage LLMs without needing advanced hardware, it introduces additional costs and delays due to network dependency, along with potential privacy concerns stemming from data transmitted to and processed by external servers (Yao et al., 2024; Pearce et al., 2023; Das et al., 2024; Finlayson et al., 2024).

Consequently, optimizing inference time and memory efficiency on standard, widely available hardware has become essential to make LLMs more practical and accessible for broader, real-world applications.

To that end, recent efforts have focused on quantizing the weights of LLMs, and more generally, Deep Neural Networks (DNNs), to enhance their computational efficiency and reduce energy consumption (Moons et al., 2017; Hubara et al., 2018; Wang et al., 2023; Chen et al., 2024; Ma et al., 2024). Particularly, limiting the weights to ternary values {1,0,1}\{-1,0,1\} in 1.58-bit LLMs has demonstrated to preserve accuracy comparable to that of general LLMs, thereby offering a more effective alternative for inference tasks (Ma et al., 2024).

Expanding on recent advancements of ternary-weight LLMs (and DNNs), in this paper, we propose algorithms that improve their inference time and memory efficiency. Our approach makes deploying these models viable on a broader range of devices, including those with limited computational power, ultimately making these tools more widely accessible and cost-effective.

Specifically, while viewing the inference process as sequences of multiplying activation vectors to ternary weight matrices, we make a critical observation: once the model is trained, the weight matrices remain fixed and do not change.

Therefore, focusing on matrix multiplication as the bottleneck operation, we preprocess the weight matrices of the trained model and create indices that enable efficient multiplication during the inference time. At a high level, the indices are bucketized permutation lists. Interestingly, by replacing each weight matrix with its preprocessed indices, our approach reduces the space complexity required for storing the weights by a logarithmic factor.

We propose two algorithms for efficient multiplication of input vectors to the preprocessed weight matrices. At a high level, our algorithms transform each ternary matrix to two binary matrices222Note that our algorithms readily apply to DNNs with binary weight matrices, including BitNet (Wang et al., 2023)., while partitioning each matrix into a set of column blocks. Then, permuting the rows based on a lexical order, they compute a set of aggregate values that contribute to different elements of the result vector. This process guarantees a time complexity of O(n2log(n)loglog(n))O(\frac{n^{2}}{\log(n)-\log\log(n)}) for n×nn\times n matrices in our first algorithm. Furthermore, introducing an efficient approach for writing the aggregate values in the result vector, our second algorithm improves the time complexity to O(n2log(n))O(\frac{n^{2}}{\log(n)}). The run-time of our algorithms further improves by fine-tuning their blocking parameter.

Many widely used LLMs are characterized by substantial matrix weight sizes. For instance, the matrix size of GPT-3 is 12,288 (213\approx 2^{13}(Tsai, 2020; Tech, 2023; Brown, 2020), and this value is even greater for GPT-4. Consequently, achieving even a logarithmic factor improvement can have a significant impact, potentially resulting in at least up to a 13x reduction in inference time for models such as GPT-3.

In addition to theoretical analysis, we perform rigorous experiments to evaluate the time and memory of our algorithms in practice. Confirming our theoretical findings, our experiments demonstrate the superiority of algorithms over the standard O(n2)O(n^{2}) multiplications approach. In particular, our algorithms reduced the inference time by up to 29x and reduced memory usage by up to 6x.

1.1 Summary of Contribution

In summary, our contributions are as follows:

  • (Section 3) We provide a formal definition of the vector-ternary-matrix multiplication problem and demonstrate its reduction to the vector-binary-matrix multiplication problem.

  • (Section 4) Following the observation that weight metrics are fixed once a model is trained, we preprocess them and construct indices that enable the development of our efficient algorithms while reducing the memory requirements by a logarithmic factor.

  • (Section 5) We introduce the RSR algorithm with a time complexity of O(n2log(n)log(log(n)))O\left(\frac{n^{2}}{\log(n)-\log(\log(n))}\right) for vector-binary-matrix multiplication for nn by nn matrices. We further optimize this algorithm and introduce RSR++ that achieves a faster running time of O(n2log(n))O(\frac{n^{2}}{\log(n)}).

  • (Sections 6 and 7) We conduct various experiments to demonstrate the applicability of our algorithms for matrix multiplication using different implementation configurations. We show that we can achieve up to 29x faster run time and 6x less space usage on matrix multiplication task. Finally, we discuss the advantages and limitations of our algorithm in Section 7.

2 Background

1.58-bit LLMs limit their weights to ternary values {1,0,1}\{-1,0,1\} (Ma et al., 2024). Generally speaking, ternary weights can be considered for a wide range of DNNs (Moons et al., 2017; Hubara et al., 2018; Yuan et al., 2024), significantly enhancing the computational efficiency of these large-scale models.

A trained 1.58-bit model consists of a collection of ternary weights used for inference. More precisely, each neural layer (see Figure 1) employed by these models incorporates ternary weights. The weights are learned during the training phase and are fixed during the inference time. During the inference time, given an input vector, the feedforward process can be viewed as a series of ternary matrix multiplications applied to the input vector to produce the final output. For instance, consider the two consecutive layers shown in a multi-layer perceptron in Figure 1. Each ternary weight matrix WiW^{i} represents the edge weights between layers (i1)(i-1) and ii with layer 0 being the input vector. The feedforward step is a chain of matrix products and activation functions that maps input vector x\vec{x} to output vector y\vec{y}; it can be represented as follows (with a0=x\vec{a}^{0}=\vec{x}), where ff is the activation function (e.g., sigmoid or ReLU).

zk=ak1Wk\displaystyle\vec{z}^{k}=\vec{a}^{k-1}W^{k} ak=f(zk)\displaystyle\vec{a}^{k}=f(\vec{z}^{k})

Figure 2 visualizes the bottleneck operation of the inference process: the multiplication of each activation vector ak1\vec{a}^{k-1} to the weight matrix WkW^{k} to generate the vector zk\vec{z}^{k}.

Building on this background, our work focuses on accelerating the bottleneck operation, the vector-ternary-matrix product. By improving the speed of this operation, we proportionally accelerate the entire inference process of a quantized LLM (or, more generally, a neural network), as each matrix product in the feedforward procedure is computed sequentially until the output layer. We reformulate this problem as a vector-binary-matrix multiplication, thereby enabling compatibility with both binary and ternary matrices.

Previous research has explored enhancing matrix multiplication through hardware-level accelerations  (Guo et al., 2024; Wang et al., 2024). In contrast, our approach focuses on designing an algorithm that provides a guaranteed logarithmic improvement over naive matrix multiplication at the application layer, independent of specific hardware architectures.

In the subsequent sections, we concentrate on vector-ternary-matrix multiplication, which constitutes the primary focus of this study.

3 Preliminaries

In this section, we formally build the notation required for the following sections, followed by a formulation of the problem we address in this work.

3.1 Notation

We denote vectors by v\vec{v}. We denote matrices by capital letters. For example, AEn×mA\in E^{n\times m} is a matrix of size n×mn\times m with elements belonging to a set of permissible values, EE. Specifically, if E={0,1}E=\{0,1\} (resp. E={1,0,1}E=\{-1,0,1\}), then AEn×mA\in E^{n\times m} is denoted as a binary (resp. ternary) matrix.

x1x_{1}x2x_{2}xnIx_{n_{I}}n1,1n_{1,1}n1,2n_{1,2}n1,n1n_{1,n_{1}}n2,1n_{2,1}n2,2n_{2,2}n2,n2n_{2,n_{2}}y1y_{1}ynoy_{n_{o}}InputW1W^{1}W1,11W^{1}_{1,1}WnI,n11W^{1}_{n_{I},n_{1}}Layer 1W2W^{2}W1,12W^{2}_{1,1}Wn1,n22W^{2}_{n_{1},n_{2}}Layer 2W3W^{3}W1,13W^{3}_{1,1}Wn,mo3W^{3}_{n_{,}m_{o}}Output
Figure 1: A feedforward neural network with =3\ell=3 layers and nin_{i} nodes per hidden layer.

Let Σn\Sigma_{n} denote the set of all bijective permutations functions σ:{1,2,,n}{1,2,,n}\sigma:\{1,2,\dots,n\}\rightarrow\{1,2,\dots,n\}. For a vector v\vec{v}, the permuted vector under σ\sigma is represented as πσ(v)\pi_{\sigma}(\vec{v}), where each element is repositioned such that πσ(v)[i]=v[σ(i)]\pi_{\sigma}(\vec{v})[i]=\vec{v}[\sigma(i)], moving the element σ(i)\sigma(i) in v\vec{v} to position ii of the permuted vector. We extend πσ\pi_{\sigma} to matrices by permuting their rows. When σ\sigma is clear by the context, we may simplify πσ\pi_{\sigma} to π\pi. Let n\mathcal{L}^{\mathbb{N}}_{n} denote the set of all ordered lists of length nn with entries in \mathbb{N}, and let n\mathcal{L}^{\mathbb{N}}_{\leq n} represent the set of sorted lists with length at most nn. We use A[:,j]A[:,j] to indicate the jj-th column of a matrix AA. Similarly, we use A[i,:]A[i,:] to refer to the ii-th row the matrix AA and A[i,j]A[i,j] to indicate an element.

a1k1{a^{k-1}_{1}}{\cdots}ank1{a^{k-1}_{n}}(\left(\vbox{\hrule height=10.2494pt,depth=10.2494pt,width=0.0pt}\right.)\left.\vbox{\hrule height=10.2494pt,depth=10.2494pt,width=0.0pt}\right)W1,1k{W^{k}_{1,1}}W1,2k{W^{k}_{1,2}}W1,3k{W^{k}_{1,3}}W1,nk{W^{k}_{1,n}}W2,1k{W^{k}_{2,1}}W2,2k{W^{k}_{2,2}}W2,3k{W^{k}_{2,3}}W2,nk{W^{k}_{2,n}}W3,1k{W^{k}_{3,1}}W3,2k{W^{k}_{3,2}}W3,3k{W^{k}_{3,3}}W3,nk{W^{k}_{3,n}}{\vdots}{\vdots}{\vdots}{\vdots}Wn,1k{W^{k}_{n,1}}Wn,2k{W^{k}_{n,2}}Wn,3k{W^{k}_{n,3}}Wn,nk{W^{k}_{n,n}}(\left(\vbox{\hrule height=48.33453pt,depth=48.33453pt,width=0.0pt}\right.)\left.\vbox{\hrule height=48.33453pt,depth=48.33453pt,width=0.0pt}\right)z1k{~{}~{}z^{k}_{1}~{}~{}}z2k{~{}~{}z^{k}_{2}~{}~{}}z3k{~{}~{}~{}z^{k}_{3}\cdots}znk{z^{k}_{n}~{}~{}}(\left(\vbox{\hrule height=10.2494pt,depth=10.2494pt,width=0.0pt}\right.)\left.\vbox{\hrule height=10.2494pt,depth=10.2494pt,width=0.0pt}\right)ak1W:,2k\vec{a}^{k-1}W^{k}_{:,2}
Figure 2: Illustration of vector-to-matrix multiplication as the core inference operation.

3.2 Problem Formulation

Given a vector vn\vec{v}\in\mathbb{R}^{n} and a ternary matrix A{1,0,1}n×nA\in\{-1,0,1\}^{n\times n}, our objective is to efficiently compute the product vA\vec{v}\cdot A. In this setting, the vector v\vec{v} serves as the activation output from the previous layer of the neural network, and AA is the weight matrix (See Figures 1 and 2).

Here, we focus on a single multiplication instance and aim to accelerate this operation beyond the standard naive vector-matrix multiplication complexity of O(n2)O(n^{2}). We initially focus on square matrices of size n×nn\times n for clarity in exposition. However, our approach is applicable more broadly and extends to general matrices of dimensions n×mn\times m as discussed in Section  7.

A key observation is that the weight matrices are fixed after a model is trained, while the vector v\vec{v} is provided at inference time. It allows us to preprocess each weight matrix AA to enable faster vector-matrix product – hence more efficient inference.

Problem 1 (Vector-Ternary-Matrix Product).

Given an input vector vn\vec{v}\in\mathbb{R}^{n} and a pre-defined ternary matrix A{1,0,1}n×nA\in\{-1,0,1\}^{n\times n}, compute the product vA\vec{v}\cdot A.

3.3 Solution Overview

Initially, we use the following lemma to reduce our problem into a Vector-Binary-Matrix product.

Lemma 3.1.

Any ternary matrix AA can be expressed as A=B(1)B(2)A=B^{(1)}-B^{(2)}, where B(1)B^{(1)} and B(2)B^{(2)} are binary matrices.

Proof.

We can build B(1)B^{(1)} and B(2)B^{(2)} according to the following equation:

B(1)[i,j]\displaystyle B^{(1)}[i,j] ={0if A[i,j]{1,0},1if A[i,j]=1\displaystyle=\begin{cases}0&\text{if }A[i,j]\in\{-1,0\},\\ 1&\text{if }A[i,j]=1\end{cases}
B(2)[i,j]\displaystyle B^{(2)}[i,j] ={0if A[i,j]{0,1},1if A[i,j]=1\displaystyle=\begin{cases}0&\text{if }A[i,j]\in\{0,1\},\\ 1&\text{if }A[i,j]=-1\end{cases} (1)

As a result, we focus on solving the following problem.

Problem 2 (Vector-Binary-Matrix Product).

Given an input vector vn\vec{v}\in\mathbb{R}^{n} and a pre-defined binary matrix B{0,1}n×nB\in\{0,1\}^{n\times n}, compute the product vB\vec{v}\cdot B.

As a result of Lemma  3.1, any time guarantees established for this reduced version will apply equivalently to the original problem by a constant factor.

We introduce the Redundant Segment Reduction (RSR) algorithm, which optimizes matrix-vector multiplication by identifying and reusing redundant computations. The RSR algorithm begins by partitioning the matrix into smaller blocks, each with a row size of nn but a reduced column size.

Within each block, similar segments across different columns are identified. We will calculate the dot product for each unique segment only once. This approach reduces the time for computing the product of the vector and each matrix block. To maximize the length of similar segments across columns, we also apply a permutation to the matrix blocks. In the final step, we aggregate the results of the vector’s product with each matrix block to obtain the overall product. An overview of this approach is illustrated in Figure 3.

In Section § 4, we present the preprocessing phase of our algorithm, detailing the construction of essential data structures utilized during inference time. Subsequently, in Section § 5, we describe the inference-time operations, illustrating how our algorithm accelerates the multiplication process, achieving a logarithmic improvement by reducing the O(n2)O(n^{2}) time complexity of the standard vector-matrix-multiplication to O(n2log(n))O\left(\frac{n^{2}}{\log(n)}\right).

4 Preprocessing: Index Construction

Refer to caption
Figure 3: A visualization of the Redundant Segment Reduction method. The calculation of vB\vec{v}\cdot B. In this example, k=2k=2.

In this section, we outline the preprocessing phase of our approach, during which we construct the essential initial data structures for the matrix BB in Problem 2. These structures are designed to optimize and accelerate the multiplication process during the inference time (explained in Section § 5).

Given a ternary matrix A=WiA=W^{i}, representing the weights between two layers of a ternary NN, we first present it as the subtraction of two binary matrices A=B(1)B(2)A=B^{(1)}-B^{(2)}, as described in Lemma  3.1. Let BB be either B(1)B^{(1)} or B(2)B^{(2)}. At a high level, during the preprocessing time, the algorithm RSR partitions the columns of BB into a set of column blocks, associating each block a row-permutation as its index to identify the longest common segments across the columns. This permutation is used later at the inference time for efficient vector-to-matrix multiplication.

The space complexity for representing a matrix BB is O(n2)O(n^{2}). Interestingly, as we shall explain in § 4.5, the space complexity of our indices is O(n2logn)O(\frac{n^{2}}{\log n}), reducing a logarithmic factor in the space required for representing BB.

4.1 Step 1: Column Blocking

The first step of our preprocessing is Column Blocking – the process of partitioning consecutive columns of the original binary matrix BB to construct a set of smaller, compact matrices, each representing a distinct block of columns.

Definition 4.1 (kk-Column Block).

Let Bm×nB\in\mathbb{R}^{m\times n} and i{1,2,,nk}i\in\{1,2,\cdots,\lceil\frac{n}{k}\rceil\}. We define Bi[k]B^{[k]}_{i} as an m×km\times k matrix containing columns (i1)k+1(i-1)\cdot k+1 to through min(ik,n)\min(i\cdot k,n). This construction yields a series of submatrices Bi[k]B^{[k]}_{i} that partition BB into nk\lceil\frac{n}{k}\rceil contiguous column blocks. Each Bi[k]B^{[k]}_{i} is called a kk-Column Block of BB.

For example, consider the following binary matrix,

B=[011101000111011110110010001101000010]\displaystyle B=\left[\begin{matrix}0&1&1&1&0&1\\ 0&0&0&1&1&1\\ 0&1&1&1&1&0\\ 1&1&0&0&1&0\\ 0&0&1&1&0&1\\ 0&0&0&0&1&0\\ \end{matrix}\right]

The set of 22-Column Blocks of BB are as follows:

B1[2]=[010001110000],B2[2]=[110111011001],B3[2]=[011110100110]\displaystyle B^{[2]}_{1}=\left[\begin{matrix}0&1\\ 0&0\\ 0&1\\ 1&1\\ 0&0\\ 0&0\\ \end{matrix}\right],B^{[2]}_{2}=\left[\begin{matrix}1&1\\ 0&1\\ 1&1\\ 0&1\\ 1&0\\ 0&1\\ \end{matrix}\right],B^{[2]}_{3}=\left[\begin{matrix}0&1\\ 1&1\\ 1&0\\ 1&0\\ 0&1\\ 1&0\\ \end{matrix}\right]

When kk is clear by the context, we use BiB_{i} instead of Bi[k]B^{[k]}_{i} to denote the ii-th block.

4.2 Step 2: Row Permutation

The next step after column blocking is binary row ordering, where the rows of each column block are sorted according to a lexicographic order.

Definition 4.2 (Binary Row Order).

Let Bim×kB_{i}\in\mathbb{R}^{m\times k} be a binary matrix. For each row Bi[r,:]B_{i}[r,:], let Bi[r,:]2B_{i}[r,:]_{2} be the corresponding binary value of concatenating Bi[r,1]Bi[r,k]B_{i}[r,1]\cdots B_{i}[r,k]. For example, if Bi[r,:]=[1,0,1,1]B_{i}[r,:]=[1,0,1,1], then Bi[r,:]2=(1011)2B_{i}[r,:]_{2}=(1011)_{2}. The Binary Row Order of BiB_{i}, denoted as πσBi\pi_{\sigma_{B_{i}}}, is defined as a permutation on BiB_{i}, such that the rows of πσBi(Bi)\pi_{\sigma_{B_{i}}}(B_{i}) are sorted based on their corresponding binary value in ascending order. That is, rsm\forall r\neq s\leq m, if Bi[r,:]2<Bi[s,:]2B_{i}[r,:]_{2}<B_{i}[s,:]_{2}, then σBi(r)<σBi(s)\sigma_{B_{i}}(r)<\sigma_{B_{i}}(s). We call πσBi(Bi)\pi_{\sigma_{B_{i}}}(B_{i}) as the Binary Row Ordered of BiB_{i}.333In general, any binary matrix following this lexicographic order is called Binary Row Ordered matrix in this paper.

To further clarify the Binary Row Order, let us consider the following example, where BiB_{i} is a block with two columns. The permutation function σBi=2,5,6,1,3,4\sigma_{B_{i}}=\langle 2,5,6,1,3,4\rangle444σBi(j)\sigma_{B_{i}}(j) is the jj-th value in σBi\sigma_{B_{i}}. For example, σBi(2)=5\sigma_{B_{i}}(2)=5. provides a Binary Row Order πσBi(Bi)\pi_{\sigma_{B_{i}}}(B_{i}) of the matrix BiB_{i}.

Bi=[010001110000]πσBi(Bi)=[000000010111]\displaystyle B_{i}=\left[\begin{matrix}0&1\\ 0&0\\ 0&1\\ 1&1\\ 0&0\\ 0&0\\ \end{matrix}\right]\Longrightarrow\pi_{\sigma_{B_{i}}}(B_{i})=\left[\begin{matrix}0&0\\ 0&0\\ 0&0\\ 0&1\\ 0&1\\ 1&1\\ \end{matrix}\right] (2)

The right expression results from 002<012<11200_{2}<01_{2}<11_{2}.

We define Bin[k]Bin_{[k]} as a binary 2k×k2^{k}\times k matrix which has exactly one row for each binary value from 0 to 2k12^{k}-1, and it is also Binary Row Ordered. For example,

Bin[1]=[01],Bin[2]=[00011011],Bin[3]=[000001010011100101110111]\displaystyle Bin_{[1]}=\left[\begin{matrix}0\\ 1\end{matrix}\right],Bin_{[2]}=\left[\begin{matrix}0&0\\ 0&1\\ 1&0\\ 1&1\end{matrix}\right],Bin_{[3]}=\left[\begin{matrix}0&0&0\\ 0&0&1\\ 0&1&0\\ 0&1&1\\ 1&0&0\\ 1&0&1\\ 1&1&0\\ 1&1&1\end{matrix}\right]

4.3 Step 3: Segmentation

Next, we segment each binary row ordered matrix (the sorted column blocks) into groups of rows with the same binary value representation, enabling further matrix compaction.

Definition 4.3 (Segmentation List).

We define a function 𝒮\mathcal{S} that, given a Binary Row Ordered matrix BB of size m×km\times k, 𝒮(B)\mathcal{S}(B) gives us the list of the boundary indices, specifying the ranges of rows with the same binary value representations.

For example, the segmentation on a column block BiB_{i} is defined as 𝒮(πσBi(Bi))\mathcal{S}(\pi_{\sigma_{B_{i}}}(B_{i})). For simplicity, we denote this segmentation as 𝒮(Bi)\mathcal{S}(B_{i}). Specifically, let =𝒮(Bi)[j]\ell=\mathcal{S}(B_{i})[j] Then, all rows in range [,𝒮(Bi)[j+1])\big{[}\ell,\mathcal{S}(B_{i})[j+1]\big{)}, have the binary value representation Bi[,:]2B_{i}[\ell,:]_{2}. Note that min{2k,m}\min\{2^{k},m\} provides an upper bound on the size of 𝒮(Bi)\mathcal{S}(B_{i}).

Consider the Binary Row Ordered matrix πσBi(Bi)\pi_{\sigma_{B_{i}}}(B_{i}) shown in Example 2. The Segmentation list of this matrix is [1,4,6][1,4,6] because the first row is the beginning of 0000 rows. Then, 0101 starts from index 44, and 1111 starts at index 66.

For a Binary Row Ordered matrix of dimensions m×km\times k, we extend this concept to define Full Segmentation as the list of exactly 2k2^{k} elements, where the jjth element of the list indicates the first index of a row that has the binary value jj. For example, the Full Segmentation of matrix πσBi(Bi)\pi_{\sigma_{B_{i}}}(B_{i}) in Example  2 is [1,4,6,6][1,4,6,6]. Since we don’t have any rows for 1010, we assign the same boundary as the next value. See Table 1 for an illustration.

Row Binary Value Start Index
00 1
01 4
10 6
11 6
Table 1: The Full Segmentation of Example  2. There is no starting index for row 1010, so we skip it by using the same start index of next available value. The Full Segmentation list is the second column of this table.

From now on, we use the same notation 𝒮\mathcal{S} to denote the Full Segmentation for a Binary Row Ordered matrix.

Proposition 4.4.

For a binary matrix Bi{0,1}m×kB_{i}\in\{0,1\}^{m\times k} and j<2kj<2^{k}, 𝒮(Bi)[j+1]𝒮(Bi)[j]\mathcal{S}(B_{i})[j+1]-\mathcal{S}(B_{i})[j] represents the number of rows in BiB_{i} whose binary value corresponds to jj. The number of rows for j=2kj=2^{k} is m+1𝒮(Bi)[j]m+1-\mathcal{S}(B_{i})[j].

Given the Full Segmentation of a Binary Row Ordered matrix, a compact representation of the matrix can be obtained by retaining only the first indices of each unique row value in the segmentation list. In the following sections, for simplicity, we use LiL_{i} instead of 𝒮(Bi)\mathcal{S}(B_{i}) to show the Full Segmentation of the permuted Column Block BiB_{i}.

4.4 Preprocessing Algorithm

Putting all three steps together, the preprocessing algorithm is described in Algorithm  1. The algorithm starts by selecting a parameter kk such that klog2(n)k\leq\log_{2}(n). Next, it computes the kk-Column Blocks of BB. Hereafter, we denote these blocks as BiB_{i} rather than Bi[k]B^{[k]}_{i} for inki\leq\lceil\frac{n}{k}\rceil. For each BiB_{i}, the algorithm proceeds to calculate both the Binary Row Orders σBi\sigma_{B_{i}} and the Full Segmentations LiL_{i}.

Figure 3 (Preprocessing – the left figure) provides a visual example of the preprocessing steps.

Algorithm 1 Preprocess
1:  Input: binary matrix BB, and a parameter kk
2:  Output: permutations σBi\sigma_{B_{i}} and segmentations Li=𝒮(Bi)L_{i}=\mathcal{S}(B_{i})
3:  Create kk-Column blocks BiB_{i}\rhd Blocking
4:  for each block BiB_{i} do
5:     σBi\sigma_{B_{i}}\leftarrow Binary Row Order of BiB_{i}\rhd Permutation
6:     LiL_{i}\leftarrow Segments of πσBi(Bi)\pi_{\sigma_{B_{i}}}(B_{i})\rhd Segmentation
7:  end for
8:  Return: {σBii}\{\sigma_{B_{i}}\mid\forall i\} and {Lii}\{L_{i}\mid\forall i\}

4.5 Complexity Analysis

To simplify the complexities, in the following, we provide the time and space complexities for an nn by nn weight matrix. Note that storing the weights matrix has a space complexity of O(n2)O(n^{2}).

Time Complexity:

During the preprocessing, we read all elements of the input matrix BB, which requires O(n2)O(n^{2}) time. Blocking the matrix into kk-Column Blocks has a time complexity linear to the number of blocks, while determining the permutation corresponds to performing a bucket sort on the rows to achieve the lexicographic ordering, which requires O(n)O(n). Consequently, the total preprocessing time is O(n2)O(n^{2}). Given that the input matrix size is O(n2)O(n^{2}), this complexity is asymptotically optimal, as any algorithm requires reading the matrix BB in O(n2)O(n^{2}).

Output Space Complexity:

The output of the preprocessing phase is a permutation and a segmentation list as each block index. Recall that |Li|=|𝒮(Bi)|=2k|L_{i}|=|\mathcal{S}(B_{i})|=2^{k}, where klog2(n)k\leq\log_{2}(n). Therefore, storing each block index requires O(n)O(n) space. The total number of column blocks is nk\frac{n}{k}. Hence, the total space complexity of the preprocessing output is O(n2k)O(\frac{n^{2}}{k}). Particularly, when k=log(n)k=\log(n), the space complexity is O(n2log(n))O(\frac{n^{2}}{\log(n)}). As we shall explain in § 5, the inference algorithms only require access to the block indices. Therefore, replacing the weight matrices with the block indices, we reduce the space complexity by a logarithmic factor.

Theorem 4.5.

Let a ternary weight matrix A=WA=W^{\ell} be expressed using the binary matrices B(1)B^{(1)} and B(2)B^{(2)} as A=B(1)B(2)A=B^{(1)}-B^{(2)}. Then representing AA using the block indices of B(1)B^{(1)} and B(2)B^{(2)} produced by Algorithm 1 requires a space of complexity O(n2log(n))O\left(\frac{n^{2}}{\log(n)}\right).

5 Inference Time: Vector-to-Matrix Multiplication

To simplify the complexity analysis without loss of generality, we assume the weight matrix WW^{\ell} is of size n×nn\times n. Specifically, given a preprocessed n×nn\times n binary matrix BB, we aim to efficiently compute the product vB\vec{v}\cdot B when a vector vn\vec{v}\in\mathbb{R}^{n} is provided at inference time (see Figure 2 – the right side).

5.1 Segmented Sum Computation

Recall that during the preprocessing time, for each column block BiB_{i}, we construct a permutation function σBi\sigma_{B_{i}} and a full segmentation list LiL_{i} as its index.

At the inference time, given a vector v\vec{v}, our objective is to efficiently compute the multiplication of v\vec{v} by each column block BiB_{i}, using (σBi,Li)(\sigma_{B_{i}},L_{i}).

Our first step is computing the Segmented Sums over the permuted vector πσBi(v)\pi_{\sigma_{B_{i}}}(\vec{v}) – i.e., the sum for each interval in the permuted vector that corresponds to groups of similar rows in the binary matrix BiB_{i}.

Definition 5.1 (Segmented Sum).

Consider the permutation function σBi\sigma_{B_{i}} and the full segmentation list LiL_{i}, for a column block BiB_{i}. Let vπ=πσBi(v)\vec{v}_{\pi}=\pi_{\sigma_{B_{i}}}(\vec{v}) be the permutation of the vector vn\vec{v}\in\mathbb{R}^{n}. The Segmented Sum of vπ\vec{v}_{\pi} on LiL_{i} is defined as a vector SSLi(vπ)SS_{L_{i}}(\vec{v}_{\pi}) of size |Li||L_{i}| where,

SSLi(vπ)[j]={k=L[j]Li[j+1]1vπ[k]if j<|Li|,k=Li[j]nvπ[k]if j=|Li|\displaystyle SS_{L_{i}}(\vec{v}_{\pi})[j]=\begin{cases}\sum^{L_{i}[j+1]-1}_{k=L[j]}\vec{v}_{\pi}[k]&\text{if }j<|L_{i}|,\\[15.00002pt] \sum^{n}_{k=L_{i}[j]}\vec{v}_{\pi}[k]&\text{if }j=|L_{i}|\end{cases} (3)

For example, for the Full Segmentation defined for Example  2 and a given vector v=[3,2,4,5,9,1]\vec{v}=[3,2,4,5,9,1], we have:

SS(v)=[9,14,0,1]\displaystyle SS(\vec{v})=[9,14,0,1] (4)

Where 9=3+2+4,14=5+9,1=19=3+2+4,14=5+9,1=1, and the third element is 0 because there is no 10210_{2} in the segmentation.

Computing the Segmented Sums using Equation 3 requires first computing the permuted vector vπ\vec{v}_{\pi}. Instead, to efficiently compute the Segmented Sums in place (without computing the permuted vector), we use Equation 5:

SSLi(vπ)[j]=\displaystyle SS_{L_{i}}(\vec{v}_{\pi})[j]= (5)
SSLi,σBi(v)[j]=\displaystyle SS_{L_{i},\sigma_{B_{i}}}(\vec{v})[j]= {k=L[j]Li[j+1]1v[σBi(k)]if j<|Li|,k=Li[j]nv[σBi(k)]if j=|Li|\displaystyle\begin{cases}\sum^{L_{i}[j+1]-1}_{k=L[j]}\vec{v}\left[\sigma_{B_{i}}(k)\,\right]&\text{if }j<|L_{i}|,\\[15.00002pt] \sum^{n}_{k=L_{i}[j]}\vec{v}\left[\sigma_{B_{i}}(k)\,\right]&\text{if }j=|L_{i}|\end{cases}

5.2 RSR

We are now ready to describe our algorithm RSR for computing v.B\vec{v}.B at the inference time. Given the input vector v\vec{v}, for each column block BiB_{i}, in Step 1 we use the permutation function σBi\sigma_{B_{i}} and the full segmentation list LiL_{i} to compute the Segmented Sum SSLi,σBi(v)SS_{L_{i},\sigma_{B_{i}}}(\vec{v}) using Equation 5.

Then, we use Lemma 5.2 to calculate vBi\vec{v}\cdot B_{i}.

Lemma 5.2.

vBi=SSLi,σBi(v)Bin[k]\vec{v}\cdot B_{i}=SS_{L_{i},\sigma_{B_{i}}}(\vec{v})\cdot Bin_{[k]}.

Proof.

Define u=SSLi,σBi(v)\vec{u}=SS_{L_{i},\sigma_{B_{i}}}(\vec{v}). At first, we should show that |u|=2k|\vec{u}|=2^{k}, and as a result, the product in the right-hand-side is valid. We know that |SSLi,σBi(v)|=|Li||SS_{L_{i},\sigma_{B_{i}}}(\vec{v})|=|L_{i}| (see Definition  5.1). In addition, based on Definition  4.3, we know that in the Segmentation list, there is exactly one element for each possible binary value between 11 to 2k2^{k}. So |Li|=|𝒮(Bi)|=2k|u|=2k|L_{i}|=|\mathcal{S}(B_{i})|=2^{k}\Longrightarrow|\vec{u}|=2^{k}.

Now, we focus on a single column of BiB_{i} (jjth column). We show that vBi[:,j]=uBin[k][:,j]\langle\vec{v}\cdot B_{i}[:,j]\rangle=\langle\vec{u}\cdot Bin_{[k]}[:,j]\rangle. Where \langle\rangle is the dot product of two vectors. Based on the Definition  5.1, we know that the llth element of u\vec{u} is the llth segment sum of πσBi(v)\pi_{\sigma_{B_{i}}}(\vec{v}). This segment is the sum of some consecutive elements of πσBi(v)\pi_{\sigma_{B_{i}}}(\vec{v}). Based on Definition  4.2, this permutation is designed such that similar rows of BiB_{i} are positioned together. This means that all the elements of v\vec{v} that are in this segment (we denote this segment as SlS_{l}, in other words, SlS_{l} is the indices of v\vec{v} that lie in this segment) are multiplied to the same binary value in BiB_{i}. Call this value as blb_{l}. As a result, we can factor blb_{l} out for each segment of v\vec{v} in the dot product. We can rewrite the dot product as:

vBi[:,j]\displaystyle\langle\vec{v}\cdot B_{i}[:,j]\rangle =q=1nv[q]Bi[q,j]\displaystyle=\sum^{n}_{q=1}\vec{v}[q]\cdot B_{i}[q,j] (6)
=l=12k(eSlv[e])bl\displaystyle=\sum^{2^{k}}_{l=1}(\sum_{e\in S_{l}}\vec{v}[e])\cdot b_{l} (7)
=l=12ku[l]bl\displaystyle=\sum^{2^{k}}_{l=1}\vec{u}[l]\cdot b_{l} (8)

Where in the last line we used the fact that l,u[l]=eSlv[l]\forall l,\vec{u}[l]=\sum_{e\in S_{l}}\vec{v}[l].

Each blb_{l} is the single that in column jj of BiB_{i} corresponds to the segment SlS_{l}. Based on Definition  4.2, the applied permutation tries to sort all columns based on the lexicographic order of rows. So, this means that Bin[k][l,j]=blBin_{[k]}[l,j]=b_{l} and as a result,

l=12ku[l]bl\displaystyle\sum^{2^{k}}_{l=1}\vec{u}[l]\cdot b_{l} =l=12ku[l]Bin[k][l,j]\displaystyle=\sum^{2^{k}}_{l=1}\vec{u}[l]\cdot Bin_{[k]}[l,j] (9)
=uBink[:,j]\displaystyle=\langle\vec{u}\cdot Bin_{k}[:,j]\rangle (10)

As a result, for all columns jj, the dot products are equal vBi[:,j]=uBin[k][:,j]\langle\vec{v}\cdot B_{i}[:,j]\rangle=\langle\vec{u}\cdot Bin_{[k]}[:,j]\rangle. Consequently, the vector-matrix products are also equal. ∎

Following Lemma 5.2, as the final step in the inference time, in Step 2 we calculate the product SSLi,σBi(v)Bin[k]SS_{L_{i},\sigma_{B_{i}}}(\vec{v})\cdot Bin_{[k]} for each kk-Column Block BiB_{i}. Algorithm  2 shows a pseudo-code of this algorithm.

Algorithm 2 RSR (Inference Time)
1:  Input: vector vn\vec{v}\in\mathbb{R}^{n}, binary matrix BB
2:  Output: result r=vB\vec{r}=\vec{v}\cdot B where rn\vec{r}\in\mathbb{R}^{n}
3:  for each block BiB_{i} do
4:     uSSLi,σBi(v)\vec{u}\leftarrow SS_{L_{i},\sigma_{B_{i}}}(\vec{v}) \rhd Step 1; Segmented Sum (Eq 5)
5:     riuBin[k]\vec{r}_{i}\leftarrow\vec{u}\cdot Bin_{[k]} \rhd Step 2; Block Product
6:  end for
7:  r(r1,r2,,rnk)\vec{r}\leftarrow(\vec{r}_{1},\vec{r}_{2},\cdots,\vec{r}_{\lceil\frac{n}{k}\rceil}) \rhd Concatenate block results
8:  Return: r\vec{r}

5.2.1 Complexity Analysis

For each column block BiB_{i}, step 1 takes O(n)O(n) since it makes a pass over each element of v\vec{v} exactly once. Step 2 computes the product of a vector of size 2k2^{k} with a matrix of dimensions 2k×k2^{k}\times k, resulting in a time complexity of O(k2k)O(k\cdot 2^{k}).

The entire process is applied to nk\frac{n}{k} kk-Column Blocks. Consequently, the total query time is O(nk(n+k2k))O\left(\frac{n}{k}(n+k\cdot 2^{k})\right). For any selection of k2knk\cdot 2^{k}\leq n, the overall running time can be written as O(n2k)O\left(\frac{n^{2}}{k}\right).

Specifically, setting k=log(nlog(n))k=\log(\frac{n}{\log(n)}) results in a time complexity of O(n2log(nlog(n)))=O(n2log(n)log(log(n)))O\left(\frac{n^{2}}{\log(\frac{n}{\log(n)})}\right)=O\left(\frac{n^{2}}{\log(n)-\log(\log(n))}\right).

Theorem 5.3.

The RSR algorithm solves Problem 1 with the time complexity of O(n2log(n)log(log(n)))O\left(\frac{n^{2}}{\log(n)-\log(\log(n))}\right).

5.2.2 The optimal kk

While we could prove Theorem 5.3 by choosing k=log(nlog(n))k=\log(\frac{n}{\log(n)}), this choice might not be optimal, i.e., it may not minimize the run-time of the algorithm.

The optimal value of kk can be found using Equation 11.

kopt=argmink[log(log(n))]nk(n+k2k)\displaystyle k^{opt}=\underset{k\in[\log(\log(n))]}{\text{argmin}}\frac{n}{k}(n+k2^{k}) (11)

To find the optimal value of kk based on Equation 11, we apply a binary search on kk in the range [1,log(n)log(log(n))][1,\log(n)-\log(\log(n))]. In § 6, we run experiments to find the optimal value of kk and to evaluate the impact of choosing different values for it.

5.3 RSR++

Having presented RSR, we now present our faster algorithm RSR++ that focuses on improving Step 2 of RSR. This step involves computing the product of a vector u\vec{u} of size 2k2^{k} and the binary matrix Bin[k]Bin_{[k]} of size 2k×k2^{k}\times k (Line 5 of Algorithm  2).

The standard vector-matrix multiplication, in this case, leads to O(k2k)O(k\cdot 2^{k}) time. However, the special structure of matrix Bin[k]Bin_{[k]} allows us to reduce it to O(2k)O(2^{k}). Algorithm  3 shows the pseudo-code for this multiplication. See Figure 4 for a visualization of this approach.

Refer to caption
Figure 4: RSR++ visualization in the inference time. The difference is the block product computation in Step 2. See Algorithm 3.

Let r\vec{r} denote the final result of this product, uBin[k]\vec{u}\cdot Bin_{[k]}. We build r\vec{r} starting from the kk-th element to the first one. (i) The kk-th element can be calculated by summing up the elements in u\vec{u} with odd indices. For example, in RSR++ visualization in Figure 4, the sum of odd-index elements (orange cells) in the top row is 23. Then, (ii) we calculate a vector x\vec{x} of size |u|2\frac{|\vec{u}|}{2} where we sum each two consecutive elements of u\vec{u}. Now we can see that the (k1)(k-1)-th element of r\vec{r} can be calculated by doing the same process as (i), this time, on x\vec{x} (see rows 2 and 3 in RSR++ visualization in Figure 4). We continue this process by repeating steps (i) and (ii) to build all kk elements of r\vec{r}.

5.3.1 Complexity Analysis

Step (i) is linear in terms of the size of x\vec{x} at each step (see Line 5 of Algorithm  3). As a result, the running time is i=k1O(2i)=O(2k)\sum^{1}_{i=k}O(2^{i})=O(2^{k}). Step (ii) also takes linear in size of x\vec{x}. Consequently, the overall time calculating uBin[k]\vec{u}\cdot Bin_{[k]} using this subroutine takes O(2k)O(2^{k}).

We can follow the same process of analyzing RSR, but this time using RSR++ as a subroutine for Line 5 of Algorithm  2. Based on the same analysis in § 5.2.1, the inference time would be O(nk(n+2k))O(\frac{n}{k}(n+2^{k})), for a k=log(n)k=\log(n) this would result in the running time of O(n2log(n))O\left(\frac{n^{2}}{\log(n)}\right).

Theorem 5.4.

The RSR++ algorithm solves Problem 1 with a time complexity of O(n2log(n))O\left(\frac{n^{2}}{\log(n)}\right).

5.3.2 The optimal kk

The process for finding the optimal value of kk for RSR++ is the same as the one for RSR, except that the binary search is applied in the range [1,log(n)][1,\log(n)] to optimize the following equation:

kopt=argmink[log(n)]nk(n+2k)\displaystyle k^{opt}=\underset{k\in[\log(n)]}{\text{argmin}}~{}~{}\frac{n}{k}(n+2^{k}) (12)
Algorithm 3 RSR++ (Step 2 in Inference Time)
1:  Input: vector u2k\vec{u}\in\mathbb{R}^{2^{k}}, and matrix Bin[k]Bin_{[k]}
2:  Output: result r=uBin[k]\vec{r}=\vec{u}\cdot Bin_{[k]}
3:  Initialize xu\vec{x}\leftarrow\vec{u}
4:  for ii from kk to 11 do
5:     ri\vec{r}_{i}\leftarrow sum of odd indexed elements of x\vec{x}
6:     Initialize v\vec{v} to a vector of size |x|2\frac{|\vec{x}|}{2}
7:     for jj from 11 to |x|2\frac{|\vec{x}|}{2} do
8:        v[j]x[2i1]+x[2i]\vec{v}[j]\leftarrow\vec{x}[2i-1]+\vec{x}[2i]
9:     end for
10:     xv\vec{x}\leftarrow\vec{v}
11:  end for

6 Experiments

In this section, we evaluate our proposed algorithms’ time and memory performance in real-world scenarios. First, in Section 6.1, we detail the native C++ implementation of both RSR and RSR++, comparing their runtime performance under a fair setting to validate our theoretical guarantees. Next, in Section 6.2, we explore the process of determining the optimal value of kk, identifying the best kk for each input size nn. Finally, in Section 6.3, we implement and evaluate the algorithms using Python’s NumPy package, providing insights into their performance in a more practical and widely used computational environment. Our codes are publicly available in this repository.

6.1 Native Implementation

In this section, we focus on verifying the theoretical foundation of our proposed methods, RSR and RSR++, using native implementations in C++. We selected C++ as the implementation language because, being a compiled language, it is less susceptible to runtime noise, allowing inference time to better reflect the exact time complexity of the methods, relatively. As baselines, we also implemented Standard Matrix Multiplication, which we will refer to as Standard throughout this section.

For input, we assume access to a binary weight matrix Bn×nB^{n\times n} where 211n2162^{11}\leq n\leq 2^{16} during preprocessing. During inference time, given input vector v\vec{v} of size nn, our aim is to compute vBv\cdot B. with both BB and v\vec{v} values generated randomly. These parameters align with typical size of weight matrices and input vector patterns commonly encountered in deep learning and large language model (LLM) inference tasks. We then measure inference time across varying values of nn for each method. We also utilized the optimal value of kk for each nn, which will be discussed in detail in the next section.

Figure 5 presents the inference times for these methods. Notably, our proposed algorithms achieve up to a 2929x speedup over the Standard baseline when n=216n=2^{16}, representing a substantial improvement that could significantly impact DNN and LLM architectures, enabling them to utilize binary and ternary weights more effectively.

Figure 6 illustrates the difference between only RSR++ and RSR for in the inference time. In this figure, we can see up to 2525% improvement when using RSR++ in the Step 2 of inference time (See Algorithm 3). The improvement percentage is computed using the formula T(𝚁𝚂𝚁)T(𝚁𝚂𝚁++)T(𝚁𝚂𝚁)×100\frac{T({\tt RSR})-T({\tt RSR++})}{T({\tt RSR})}\times 100.

Refer to caption
Figure 5: Comparison of RSR, RSR++, and Standard on native C++ implementation for Binary Matrix Multiplication. The speedup values are between RSR++ and Standard. Each value is the average of 10 different runs.
Refer to caption
Figure 6: The comparison of RSR and RSR++. The percentage shows how much improvement we get using RSR++ relative to RSR. This is a native C++ implementation.

6.2 kk Optimization

As outlined in Section 5.2.2, the optimal value of kk for RSR (reps. RSR++) is determined through a binary search over the interval [0,log(nlog(n))][0,\log(\frac{n}{\log(n)})] (resp. [0,log(n)][0,\log(n)]). Once the optimal kk is identified for each nn, it is then applied in the subsequent experiments. Figures 7(a) 7(b), show the running time with respect to different kk values. By increasing nn, the optimum value of kk also increases.

Refer to caption
(a) RSR
Refer to caption
(b) RSR++
Figure 7: Finding the optimum kk for each nn. The red dots show the best kk for each nn that results in the best running time. Once calculating the optimum kk values, we use them for other experiments.

6.3 Matrix Multiplication Using NumPy

Refer to caption
(a) Binary
Refer to caption
(b) Ternary
Figure 8: Comparison of Time required to perform a vector to weight Matrix multiplication for (8(a)) Binary and (8(b)) Ternary weights between NumPy and RSR. Each data point is an average of 4 different runs.
Refer to caption
Figure 9: Memory consumption of RSR after the preprocessing is done. Compared to memory required for the Standard matrix multiplication (NumPy). In RSR, we only store permutations and segmentation lists in the memory.

In this section, we investigate the practical performance improvements achieved by RSR, extending beyond native simulation environments. To this end, we use NumPy, a state-of-the-art library for matrix multiplication upon which numerous high-performance computing libraries are based. We implement RSR in Python using NumPy’s built-in functionalities and compare the inference time and memory consumption between RSR and NumPy’s np.dot() function (referred to as NumPy).

6.3.1 Inference Time

As in previous sections, let An×nA^{n\times n} denote our weight matrix and v\vec{v} the input vector, where our goal is to compute vA\vec{v}\cdot A. We run the experiments on both Binary and Ternary matrices. This matrix is available prior to inference time, so we can run preprocessing on that. For the parameter kk, we use the previously determined optimal values in Section 6.2.

We initialize nn at 2112^{11} and double its size in each experiment up to 2152^{15}. Each multiplication is executed four times, with results averaged to mitigate noise. Figures 8(a) and 8(b) present results for matrix multiplication using both the naive NumPy method and RSR for Binary and Ternary weight matrices, respectively. As shown, RSR achieves up to a 24x speedup over the NumPy baseline on binary matrices of size n=215n=2^{15}, illustrating that our algorithm not only meets theoretical guarantees but also outperforms state-of-the-art practical methods.

6.3.2 Memory consumption

RSR requires a preprocessing step to compute permutations and segments, where only these two values need to be stored to calculate the final weights. This preprocessing enables substantial compression. Figure 9 illustrates the memory consumption in megabytes for NumPy and RSR. In the peak of preprocessing procedure, both the weight matrix AA and the segment data are stored in memory (indicated by the green line); however, after computation, only the segmentation lists and permutations are stored and AA will be cleared from memory. For a matrix of size n=216n=2^{16}, the storage required is reduced to less than 17% of the original matrix size (5.99x improvement), highlighting the benefits of this approach for large-scale storage and data transfer.

In summary, our algorithm RSR provides practical advantages for deploying large models. For instance, companies training new large language models (LLMs) could preprocess their weights to release only the final segments, permutations, and the optimal parameter kk. This would result in up to 24x faster inference times and up to 5.99x memory reduction, significantly easing both storage and transfer demands.

7 Discussion

In this section, we discuss some of the advantages and limitations of our proposed algorithms and outline potential directions for future research in this area.

7.1 Advantages

  • Our RSR++ algorithm shows up to a 2929x improvement in inference time and a 66x reduction in memory usage, as observed in our experiments. Note that following our complexity analyses, these numbers get even larger by the size of the network layers. This significant improvement indicates that the algorithm can support larger hidden layers and matrices to enhance model accuracy while requiring fewer computational resources.

  • The efficiency of our algorithm leads to reduced energy consumption, further contributing to the model’s practicality and sustainability.

  • Our approach enables the deployment of more advanced models on ordinary devices, such as personal computers, with limited memory and processing power.

  • By processing one column block at a time, the algorithm only requires memory equivalent to the size of a single block, which is significantly less than that needed for full matrix multiplication.

  • Last but not least, our method is deployable on top of any of the current or future ternary models, without requiring any fine-tuning or re-training of the LLM or DNN. Once a model is trained, our preprocessing step can be applied, allowing the preprocessed model to be utilized at any future point.

7.2 Limitations

  • Our algorithms can be parallelized by executing block productions concurrently, as each block production in step 2 of the inference time is independent. Additionally, further parallelization can be achieved by applying block production simultaneously on different columns of the binary-encoded matrix Bin[k]Bin_{[k]}. However, RSR++ is not easily parallelizable due to the sequential nature of operations applied on the vector u\vec{u} in step 2. Future work could explore leveraging specific hardware instructions, such as blocking, permutations, segmentations, and block productions, to further optimize these steps at the hardware level.

  • The preprocessing time of our algorithms is O(n2)O(n^{2}), and this step must be applied to all weight matrices of an LLM or DNN. Although this process can be time-intensive, it is performed only once for a trained model.

7.3 Generalization

Our proposed algorithms are applicable to any binary or ternary matrices of size n×mn\times m. Additionally, they can be extended to any qq-bit quantized matrix or quantized model. The approach begins by expressing the qq-bit matrix as the sum of 2q22^{q-2} binary matrices. Each of these binary matrices is processed individually, and the results are combined at the end. The procedure for decomposing the qq-bit matrix into 2q22^{q-2} binary matrices can be a result of applying Lemma 3.1 recursively on the matrix.

8 Conclusion

This paper presented algorithms that significantly improve inference time and memory efficiency for 1.58-bit LLMs with ternary weights. By preprocessing weight matrices to create indices, our approach reduces storage complexity for maintaining the model weights by a logarithmic factor. It enhances inference efficiency, achieving a time complexity of O(n2logn)O(\frac{n^{2}}{\log n}).

In addition to the theoretical guarantees, our experiment results demonstrated the practical advantages of our methods, with observed reductions in inference time of up to 29x and memory usage of up to 6x, underscoring the potential of our approach to make LLMs more accessible and cost-effective.

References

  • AI21 Labs (2024) AI21 Labs. Ai21 studio, 2024. URL https://www.ai21.com/studio. Accessed: 2024-11-09.
  • Brown (2020) Brown, T. B. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • Chen et al. (2024) Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., Qiao, Y., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models. arXiv preprint arXiv:2407.11062, 2024.
  • Das et al. (2024) Das, B. C., Amini, M. H., and Wu, Y. Security and privacy challenges of large language models: A survey. arXiv preprint arXiv:2402.00888, 2024.
  • Finlayson et al. (2024) Finlayson, M., Ren, X., and Swayamdipta, S. Logits of api-protected llms leak proprietary information. arXiv preprint arXiv:2403.09539, 2024.
  • Guo et al. (2024) Guo, H., Brandon, W., Cholakov, R., Ragan-Kelley, J., Xing, E. P., and Kim, Y. Fast matrix multiplications for lookup table-quantized llms. arXiv preprint arXiv:2407.10960, 2024.
  • Hu et al. (2021) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Hubara et al. (2018) Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(187):1–30, 2018.
  • Hugging Face (2023) Hugging Face. Autotrain by hugging face. Accessed: 2024-11-09, 2023. https://huggingface.co/autotrain.
  • Ma et al. (2024) Ma, S., Wang, H., Ma, L., Wang, L., Wang, W., Huang, S., Dong, L., Wang, R., Xue, J., and Wei, F. The era of 1-bit llms: All large language models are in 1.58 bits. arXiv preprint arXiv:2402.17764, 2024.
  • Moons et al. (2017) Moons, B., Goetschalckx, K., Van Berckelaer, N., and Verhelst, M. Minimum energy quantized neural networks. In 2017 51st Asilomar Conference on Signals, Systems, and Computers, pp.  1921–1925. IEEE, 2017.
  • OpenAI (2024) OpenAI. Models, 2024. URL https://platform.openai.com/docs/models. Accessed: 2024-11-09.
  • Pearce et al. (2023) Pearce, H., Tan, B., Ahmad, B., Karri, R., and Dolan-Gavitt, B. Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP), pp.  2339–2356. IEEE, 2023.
  • Tech (2023) Tech, W. G. The journey of openai gpt models. https://medium.com/walmartglobaltech/the-journey-of-open-ai-gpt-models-32d95b7b7fb2, 2023. Accessed: 2024-11-03.
  • Tsai (2020) Tsai, A. Explaining gpt-3 architecture and working. https://medium.com/@tsaiabhi.cool/explaining-gpt-3-architecture-and-working-d0219c79202c, 2020. Accessed: 2024-11-03.
  • Wang et al. (2023) Wang, H., Ma, S., Dong, L., Huang, S., Wang, H., Ma, L., Yang, F., Wang, R., Wu, Y., and Wei, F. Bitnet: Scaling 1-bit transformers for large language models. arXiv preprint arXiv:2310.11453, 2023.
  • Wang et al. (2024) Wang, L., Ma, L., Cao, S., Zhang, Q., Xue, J., Shi, Y., Zheng, N., Miao, Z., Yang, F., Cao, T., et al. Ladder: Enabling efficient {\{Low-Precision}\} deep learning computing through hardware-aware tensor transformation. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pp.  307–323, 2024.
  • Yao et al. (2024) Yao, Y., Duan, J., Xu, K., Cai, Y., Sun, Z., and Zhang, Y. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, pp.  100211, 2024.
  • Yuan et al. (2024) Yuan, Z., Zhou, R., Wang, H., He, L., Ye, Y., and Sun, L. Vit-1.58 b: Mobile vision transformers in the 1-bit era. arXiv preprint arXiv:2406.18051, 2024.