This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Faster Walsh-Hadamard Transform and Matrix Multiplication over Finite Fields using Lookup Tables

Josh Alman111Columbia University. [email protected].
Abstract

We use lookup tables to design faster algorithms for important algebraic problems over finite fields. These faster algorithms, which only use arithmetic operations and lookup table operations, may help to explain the difficulty of determining the complexities of these important problems. Our results over a constant-sized finite field are as follows.

The Walsh-Hadamard transform of a vector of length NN can be computed using O(NlogN/loglogN)O(N\log N/\log\log N) bit operations. This generalizes to any transform defined as a Kronecker power of a fixed matrix. By comparison, the Fast Walsh-Hadamard transform (similar to the Fast Fourier transform) uses O(NlogN)O(N\log N) arithmetic operations, which is believed to be optimal up to constant factors.

Any algebraic algorithm for multiplying two N×NN\times N matrices using O(Nω)O(N^{\omega}) operations can be converted into an algorithm using O(Nω/(logN)ω/21)O(N^{\omega}/(\log N)^{\omega/2-1}) bit operations. For example, Strassen’s algorithm can be converted into an algorithm using O(N2.81/(logN)0.4)O(N^{2.81}/(\log N)^{0.4}) bit operations. It remains an open problem with practical implications to determine the smallest constant cc such that Strassen’s algorithm can be implemented to use cN2.81+o(N2.81)c\cdot N^{2.81}+o(N^{2.81}) arithmetic operations; using a lookup table allows one to save a super-constant factor in bit operations.

1 Introduction

When NN is a power of 22, the Walsh-Hadamard transform is defined recursively by H2=[1111],H_{2}=\begin{bmatrix}1&1\\ 1&-1\\ \end{bmatrix}, and

HN=[HN/2HN/2HN/2HN/2].H_{N}=\begin{bmatrix}H_{N/2}&H_{N/2}\\ H_{N/2}&-H_{N/2}\\ \end{bmatrix}.

A common task in many areas of computation is to compute the length-NN Walsh-Hadamard transform, i.e., given as input a length-NN vector v𝔽Nv\in\mathbb{F}^{N}, compute the vector HNvH_{N}v. The most straightforward algorithm would compute this using O(N2)O(N^{2}) arithmetic operations, but the fast Walsh-Hadamard transform (FWHT) algorithm can compute this using only O(NlogN)O(N\log N) arithmetic operations. It is widely believed that Ω(NlogN)\Omega(N\log N) arithmetic operations are necessary, and a substantial amount of work has gone into proving this in restricted arithmetic models222For instance, this is known to hold for arithmetic circuits with ‘bounded coefficients’; see e.g., [Lok09, Section 3.3]., and studying conjectures which would imply this333For instance, the now-refuted conjecture that the Walsh-Hadamard transform is rigid [AW17].; see, for instance, the survey [Lok09].

A natural question arises: why restrict ourselves to arithmetic models of computation? The Walsh-Hadamard transform is commonly used in practice, and speedups using non-arithmetic operations could be impactful. Nonetheless, there has not been much work on non-arithmetic algorithms for the Walsh-Hadamard transform.

A related problem is matrix multiplication: given as input two n×nn\times n matrices, compute their product. The asymptotically fastest known algorithm is algebraic, and uses O(n2.373)O(n^{2.373}) arithmetic operations. That said, non-algebraic algorithmic techniques, especially lookup tables, have been used to design more practical, ‘combinatorial’ algorithms for a variant on this problem called Boolean matrix multiplication444In Boolean matrix multiplication, given two matrices whose entries are from {0,1}\{0,1\}, our goal is to multiply them over the AND-OR semiring, or equivalently, to determine which entries of their product over \mathbb{R} are nonzero (but not necessarily what nonzero values they take on). since the work of [ADKF70]. These techniques save logarithmic factors over the straightforward O(n3)O(n^{3}) time algorithm – the best algorithm along these lines runs in n3(loglogn)O(1)(logn)3n^{3}\cdot\frac{(\log\log n)^{O(1)}}{(\log n)^{3}} time [Yu18] in the standard word RAM model – but they are considered more practical than the algebraic algorithms (which save polynomial factors in nn). Lookup tables have also been used in practice to approximately multiply matrices [JPK+20, BG21].

In this paper, we show that lookup tables can be used to speed up the asymptotically fastest known algorithms for both the Walsh-Hadamard transform and (exact) matrix multiplication over finite fields. We will show, for instance, that only o(NlogN)o(N\log N) bit operations suffice to compute the length-NN Walsh-Hadamard transform over finite fields when we augment the arithmetic model to allow for lookup tables. This may help to explain the difficulty of proving lower bounds for this problem, and help to guide future work on arithmetic circuit lower bounds (since any lower bounds technique would need to fail when lookup tables are allowed).

We focus here on constant-sized finite fields, though our algorithms generalize to larger finite fields as well. Our algorithms are simple modifications of the usual recursive algorithms for solving these problems, and we describe below how they compare favorably to algorithms which are used in practice.

1.1 Model of computation

As discussed, these problems are typically studied in the arithmetic circuit model, wherein an algorithm may only perform arithmetic operations (+,,×,÷+,-,\times,\div) over the field 𝔽\mathbb{F} applied to inputs and fixed constants, and the arithmetic complexity of the algorithm is the number of such operations. The asymptotically fastest known algorithms for the problems we study here all fit within this model. However, we would also like the ability to use lookup tables, so we consider a more general model: the bit operations model.

The bit complexity of a RAM algorithm is the number of operations on bits performed by the algorithm. This is the natural model with the most direct comparison to the arithmetic model, since an algorithm with arithmetic complexity TT over a constant-sized finite field naturally has bit complexity O(T)O(T). This model is often used as a more realistic version of the arithmetic model (see e.g., [Pan81, Lin91, VDHL13, EDS18]).

One can see (via a simple tree data structure, for instance) that a lookup table with bb-bit keys and values can be implemented so that values can be looked up and changed using O(b)O(b) bit operations.

We note before moving on that all the algorithms in this paper will only perform arithmetic operations and lookup table operations; one could also define a nonstandard model of computation which allows for just these two types of operations, and get the same results.

1.2 Results: Walsh-Hadamard Transform

Our first result is a faster algorithm for the Walsh-Hadamard transform.

Theorem 1.1.

Let 𝔽q\mathbb{F}_{q} be a finite field of size q=O(1)q=O(1), let nn be a positive integer, and let N=2nN=2^{n}. There is an algorithm for computing the length-NN Walsh-Hadamard transform over 𝔽q\mathbb{F}_{q} that uses O(NlogNloglogN)O\left(\frac{N\log N}{\log\log N}\right) bit operations.

By comparison, the fast Walsh-Hadamard transform algorithm, which uses Θ(NlogN)\Theta(N\log N) arithmetic operations, would take Θ(NlogN)\Theta(N\log N) bit operations. Θ(NlogN)\Theta(N\log N) arithmetic operations is widely believed to be optimal over any field (whose characteristic is not 2; the problem is trivial in that case), but our algorithm improves on this using lookup tables.

Our algorithm uses the same recursive approach as the fast Walsh-Hadamard transform; our main idea is to use results from a lookup table to quickly jump forward many recursive layers at a time.

Although the Walsh-Hadamard transform is most often applied over the real or complex numbers, such as in signal processing and data compression, it has been applied over finite fields in areas including coding theory [RL01, XQ19], cryptographic protocols [HK10, MQ19], and learning algorithms [LC20].

Our algorithm also generalizes directly to any transform defined by Kronecker powers of a fixed matrix over a finite field. For instance, given as input the 2n2^{n} coefficients of a multilinear polynomial in nn variables over 𝔽2\mathbb{F}_{2}, we can compute its evaluation on all 2n2^{n} inputs from 𝔽2n\mathbb{F}_{2}^{n} in O(2nn/logn)O(2^{n}\cdot n/\log n) bit operations (improving over the usual recursive algorithm by a factor of logn\log n)555This problem corresponds to computing the linear transform defined by Kronecker powers of the matrix [1110]\begin{bmatrix}1&1\\ 1&0\\ \end{bmatrix}..

1.3 Results: Matrix Multiplication

Our result for matrix multiplication shows how to convert any algebraic algorithm into one which uses a lookup table to save a superconstant factor.

Theorem 1.2.

For any finite field 𝔽q\mathbb{F}_{q} of size q=O(1)q=O(1), suppose there is an algebraic algorithm for multiplying n×nn\times n matrices over 𝔽q\mathbb{F}_{q} in O(nτ)O(n^{\tau}) arithmetic operations for some constant τ>2\tau>2. Then, there is another algorithm for multiplying n×nn\times n matrices over 𝔽q\mathbb{F}_{q} which uses O(nτ/(logn)τ/21)O(n^{\tau}/(\log n)^{\tau/2-1}) bit operations.

This speeds up the standard implementation of the algebraic algorithm by a factor of Θ((logn)τ/21)\Theta((\log n)^{\tau/2-1}). For instance, Strassen’s algorithm [Str69] gives τ=2.81\tau=2.81, resulting in an algorithm using O(n2.81/(logn)0.4)O(n^{2.81}/(\log n)^{0.4}) bit operations by Theorem 4.2, and the asymptotically fastest known algorithm [CW90, AW21] gives τ=2.373\tau=2.373, resulting in an algorithm using O(n2.373/(logn)0.186)O(n^{2.373}/(\log n)^{0.186}) bit operations by Theorem 4.2.

Notably, much work has gone into improving the leading constant in the running time of Strassen’s algorithm. Strassen’s original algorithm [Str69] has leading constant 77 (meaning, it uses 7nlog2(7)+o(nlog2(7))7n^{\log_{2}(7)}+o(n^{\log_{2}(7)}) operations), and Winograd [Win71] improved this to 66. This was believed to be optimal due to lower bounds by Probert [Pro76] and Bshouty [Bsh95]. However, in a recent breakthrough, Karstadt and Schwartz [KS20] gave a new ‘change of basis’ approach, and used it to improve the leading constant to 55. They showed that 55 is optimal for their new approach, and later work showed it is also optimal for the more general ‘sparse decomposition’ approach [BS19]. The fact that we achieve an asymptotic speedup in this paper (which one can view as achieving leading constant ε\varepsilon for any ε>0\varepsilon>0 in bit complexity) may help to explain the difficulty of extending these lower bounds on the constant beyond restricted classes of algorithmic approaches.

Our approach for matrix multiplication is one that is commonly used in practice: use iterations of a recursive algebraic algorithm until the matrices one would like to multiply are sufficiently small, and then use a different algorithm optimized for small matrices. When this is used in practice, an optimized version of the straightforward (cubic-time) algorithm is used for small matrices, giving a constant factor improvement to the running time; see e.g., [HSHVDG16]. We implement this approach by instead using lookup tables to multiply superconstant-sized matrices very quickly, and thus get an asymptotic improvement.

Matrix multiplication over finite fields has many applications. One prominent example is Boolean matrix multiplication, which has a simple randomized reduction to matrix multiplication over any field666Set each entry of one of the matrices to 0 independently with probability 1/21/2, then multiply the two matrices and check which entries are nonzero.. Hence our algorithm gives an asymptotic speedup for applications of Boolean matrix multiplication such as detecting triangles in graphs.

2 Preliminaries

2.1 Notation

Throughout this paper, we write log\log to denote the base 22 logarithm. For a positive integer nn, we use the notation [n]:={1,2,3,,n}[n]:=\{1,2,3,\ldots,n\}.

2.2 Kronecker Products

If 𝔽\mathbb{F} is a field, and A𝔽N×NA\in\mathbb{F}^{N\times N} and B𝔽M×MB\in\mathbb{F}^{M\times M} are matrices, their Kronecker product AB𝔽(NM)×(NM)A\otimes B\in\mathbb{F}^{(NM)\times(NM)} is a matrix given by

AB=[B[1,1]AB[1,2]AB[1,M]AB[2,1]AB[2,2]AB[2,M]AB[M,1]AB[M,2]AB[M,M]A].A\otimes B=\begin{bmatrix}B[1,1]\cdot A&B[1,2]\cdot A&\cdots&B[1,M]\cdot A\\ B[2,1]\cdot A&B[2,2]\cdot A&\cdots&B[2,M]\cdot A\\ \vdots&\vdots&\ddots&\vdots\\ B[M,1]\cdot A&B[M,2]\cdot A&\cdots&B[M,M]\cdot A\\ \end{bmatrix}.

For a positive integer nn, we write An𝔽Nn×NnA^{\otimes n}\in\mathbb{F}^{N^{n}\times N^{n}} for the nnth Kronecker power of AA, which is the Kronecker product of nn copies of AA.

2.3 Matrices of interest

For positive integer NN, we write INI_{N} to denote the N×NN\times N identity matrix. If NN is a power of 22, we write HNH_{N} to denote the N×NN\times N Walsh-Hadamard transform, given by HN=H2logNH_{N}=H_{2}^{\otimes\log N}, where

H2=[1111].H_{2}=\begin{bmatrix}1&1\\ 1&-1\\ \end{bmatrix}.

3 Walsh-Hadamard Transform

Let 𝔽q\mathbb{F}_{q} be the finite field of size qq, and let NN be a power of 22. In this section, we give an algorithm for computing the length-NN Walsh-Hadamard transform, HNH_{N}, over 𝔽q\mathbb{F}_{q}. The key idea behind our new algorithm is to pick a K=O(logqN)K=O(\log_{q}N) such that qKNq^{K}\ll N, and first create a lookup table of the length-KK Walsh-Hadamard transforms of all vectors v𝔽qKv\in\mathbb{F}_{q}^{K}. We will then use this lookup table in conjunction with the following standard recursive approach for computing Kronecker powers (sometimes called Yates’ algorithm), which we will apply with M=HKM=H_{K}:

Lemma 3.1.

Let 𝔽q\mathbb{F}_{q} be any constant-sized finite field, let m,dm,d be positive integers, and let M𝔽d×dM\in\mathbb{F}^{d\times d} be any matrix. Suppose we are given an algorithm which, on input v𝔽dv\in\mathbb{F}^{d}, outputs MvMv in time TT. Then, there is an algorithm which, on input z𝔽dmz\in\mathbb{F}^{d^{m}}, outputs MmwM^{\otimes m}w in time O(Tmdm1)O(T\cdot m\cdot d^{m-1}).

Proof.

By definition of the Kronecker product, we can write MmM^{\otimes m} as a d×dd\times d block matrix (where each block is a dm1×dm1d^{m-1}\times d^{m-1} matrix) as

Mm\displaystyle M^{\otimes m}
=[M[1,1]M(m1)M[1,2]M(m1)M[1,d]M(m1)M[2,1]M(m1)M[2,2]M(m1)M[2,d]M(m1)M[d,1]M(m1)M[d,2]M(m1)M[d,d]M(m1)]\displaystyle=\begin{bmatrix}M[1,1]\cdot M^{\otimes(m-1)}&M[1,2]\cdot M^{\otimes(m-1)}&\cdots&M[1,d]\cdot M^{\otimes(m-1)}\\ M[2,1]\cdot M^{\otimes(m-1)}&M[2,2]\cdot M^{\otimes(m-1)}&\cdots&M[2,d]\cdot M^{\otimes(m-1)}\\ \vdots&\vdots&\ddots&\vdots\\ M[d,1]\cdot M^{\otimes(m-1)}&M[d,2]\cdot M^{\otimes(m-1)}&\cdots&M[d,d]\cdot M^{\otimes(m-1)}\\ \end{bmatrix}
=[M[1,1]Idm1M[1,2]Idm1M[1,d]Idm1M[2,1]Idm1M[2,2]Idm1M[2,d]Idm1M[d,1]Idm1M[d,2]Idm1M[d,d]Idm1]×[M(m1)M(m1)M(m1)]\displaystyle=\begin{bmatrix}M[1,1]\cdot I_{d^{m-1}}&M[1,2]\cdot I_{d^{m-1}}&\cdots&M[1,d]\cdot I_{d^{m-1}}\\ M[2,1]\cdot I_{d^{m-1}}&M[2,2]\cdot I_{d^{m-1}}&\cdots&M[2,d]\cdot I_{d^{m-1}}\\ \vdots&\vdots&\ddots&\vdots\\ M[d,1]\cdot I_{d^{m-1}}&M[d,2]\cdot I_{d^{m-1}}&\cdots&M[d,d]\cdot I_{d^{m-1}}\\ \end{bmatrix}\times\begin{bmatrix}M^{\otimes(m-1)}&&\\ &M^{\otimes(m-1)}&&\\ &&\ddots&\\ &&&M^{\otimes(m-1)}\\ \end{bmatrix}

Thus, we can multiply MmM^{\otimes m} times z𝔽dmz\in\mathbb{F}^{d^{m}} with a two-step process (corresponding to multiplying the matrix on the right times zz, then the matrix on the left times the result):

  1. 1.

    Partition z𝔽dmz\in\mathbb{F}^{d^{m}} into dd vectors z1,,zd𝔽dm1z_{1},\ldots,z_{d}\in\mathbb{F}^{d^{m-1}}. Recursively compute ui=M(m1)zi𝔽dm1u_{i}=M^{\otimes(m-1)}z_{i}\in\mathbb{F}^{d^{m-1}} for each i{1,,d}i\in\{1,\ldots,d\}.

  2. 2.

    For each j{1,,dm1}j\in\{1,\ldots,d^{m-1}\}, let xj𝔽dx_{j}\in\mathbb{F}^{d} be the vector consisting of, for each i{1,,d}i\in\{1,\ldots,d\}, entry jj of vector uiu_{i}. Use the given algorithm to compute yj=Mxjy_{j}=Mx_{j}.

Finally, we output the appropriate concatenation of the yjy_{j} vectors (where the first dm1d^{m-1} entries are the first entries of all the yjy_{j} vectors, the second dm1d^{m-1} entries are the second entries of all the yjy_{j} vectors, and so on).

Our algorithm makes dd recursive calls in the first step, and calls the given algorithm dm1d^{m-1} times in the second step. Hence, the total running time, E(dm)E(d^{m}), has the recurrence

E(dm)=dE(dm1)+dm1T.E(d^{m})=d\cdot E(d^{m-1})+d^{m-1}\cdot T.

This solves, as desired, to E(dm)=O(Tmdm1)E(d^{m})=O(T\cdot m\cdot d^{m-1}). ∎

We can now give our main algorithm for the Walsh-Hadamard transform:

Theorem 3.2.

Let 𝔽q\mathbb{F}_{q} be a finite field of size q=O(1)q=O(1), let nn be a positive integer, and let N=2nN=2^{n}. There is an algorithm for computing the length-NN Walsh-Hadamard transform over 𝔽q\mathbb{F}_{q} that uses O(NlogNloglogN)O\left(\frac{N\log N}{\log\log N}\right) bit operations.

Proof.

Let k=log(n2logq)k=\left\lfloor\log\left(\frac{n}{2\log q}\right)\right\rfloor, and let K=2kn2logqK=2^{k}\leq\frac{n}{2\log q}. We begin by iterating over all vectors v𝔽qKv\in\mathbb{F}_{q}^{K}, computing HKvH_{K}v, and storing it in a lookup table. We can do this in a straightforward way: there are qKq^{K} such vectors, and each can be computed using O(K2)O(K^{2}) additions and subtractions over 𝔽q\mathbb{F}_{q}, so the total time to create this lookup table is at most

O(qKK2)O(qn/(2logq)(logN)2)O(N(logN)2)O(N0.6).O(q^{K}\cdot K^{2})\leq O(q^{n/(2\log q)}\cdot(\log N)^{2})\leq O(\sqrt{N}\cdot(\log N)^{2})\leq O(N^{0.6}).

(This simple time bound could be improved, but we won’t bother here since it won’t substantially contribute to the final running time.)

Our goal now is, given v𝔽qNv\in\mathbb{F}_{q}^{N}, to compute HNvH_{N}v. Assume for now that kk divides nn. We will apply Lemma 3.1 with M=HKM=H_{K} (and hence d=Kd=K) and m=n/km=n/k, which will multiply the matrix (HK)n/k=HN(H_{K})^{\otimes n/k}=H_{N} times the vector vv, as desired. Each time that algorithm needs to multiply HKH_{K} times a vector of length KK, we do so by looking up the answer from the lookup table. Hence, TT in Lemma 3.1 will be the time to do one lookup from this table whose keys and values have length O(log(qK))=O(K)O(\log(q^{K}))=O(K), so T=O(K)T=O(K).

The total number of bit operations of our algorithm is thus, as desired,

O(N0.6+TnkKn/k1)=O(N0.6+TnkKN)=O(N0.6+nkN)=O(NlogNloglogN).O\left(N^{0.6}+T\cdot\frac{n}{k}\cdot K^{n/k-1}\right)=O\left(N^{0.6}+T\cdot\frac{n}{k\cdot K}\cdot N\right)=O\left(N^{0.6}+\frac{n}{k}\cdot N\right)=O\left(\frac{N\log N}{\log\log N}\right).

Finally, consider when kk does not divide nn. Let nn^{\prime} be the largest multiple of kk that is smaller than nn, so nk<n<nn-k<n^{\prime}<n. By the usual recursive approach (e.g., one recursive step of the algorithm presented in Lemma 3.1), it suffices to first perform 2nn2^{n-n^{\prime}} instances of a length-2n2^{n^{\prime}} Walsh-Hadamard transform, and then perform 2n2^{n^{\prime}} instances of a length-2nn2^{n-n^{\prime}} Walsh-Hadamard transform. We now count the number of bit operations for these two steps.

Using the same algorithm as above, since kk divides nn^{\prime}, a length-2n2^{n^{\prime}} Walsh-Hadamard transform can be performed using O(n2nlogn)O(\frac{n^{\prime}\cdot 2^{n^{\prime}}}{\log n^{\prime}}) bit operations. Hence, the total bit operations for the first step is O(2nnn2nlogn)O(n2nlogn)=O(NlogN/loglogN)O(2^{n-n^{\prime}}\cdot\frac{n^{\prime}\cdot 2^{n^{\prime}}}{\log n^{\prime}})\leq O(\frac{n\cdot 2^{n}}{\log n})=O(N\log N/\log\log N).

Using the usual fast Walsh-Hadamard transform, a length-2nn2^{n-n^{\prime}} Walsh-Hadamard transform can be performed using O(2nn(nn))O(2^{n-n^{\prime}}\cdot(n-n^{\prime})) bit operations. Hence, the total bit operations for the second step is O(2n2nn(nn))O(2nk)=O(NloglogN)O(2^{n^{\prime}}\cdot 2^{n-n^{\prime}}\cdot(n-n^{\prime}))\leq O(2^{n}\cdot k)=O(N\log\log N). We thus get the desired total running time. ∎

4 Matrix Multiplication

Fast algebraic algorithms for matrix multiplication over a field 𝔽\mathbb{F} critically rely on algebraic identities which take the following form, for positive integers q,rq,r, formal variables Xi,j,Yj,k,Zi,kX_{i,j},Y_{j,k},Z_{i,k} for i,j,k[t]i,j,k\in[t], and field coefficients αi,j,,βj,k,,γi,k,𝔽\alpha_{i,j,\ell},\beta_{j,k,\ell},\gamma_{i,k,\ell}\in\mathbb{F} for i,j,k[t]i,j,k\in[t] and [r]\ell\in[r]:

i=1tj=1tk=1tXi,jYj,kZi,k==1r(i=1tj=1tαi,j,Xi,j)(j=1tk=1tβj,k,Yj,k)(i=1tk=1tγi,k,Zi,k).\displaystyle\sum_{i=1}^{t}\sum_{j=1}^{t}\sum_{k=1}^{t}X_{i,j}Y_{j,k}Z_{i,k}=\sum_{\ell=1}^{r}\left(\sum_{i=1}^{t}\sum_{j=1}^{t}\alpha_{i,j,\ell}X_{i,j}\right)\left(\sum_{j=1}^{t}\sum_{k=1}^{t}\beta_{j,k,\ell}Y_{j,k}\right)\left(\sum_{i=1}^{t}\sum_{k=1}^{t}\gamma_{i,k,\ell}Z_{i,k}\right). (1)

As we will see next, an identity (1) can be used to design a matrix multiplication algorithm which runs using only O(nlogt(r))O(n^{\log_{t}(r)}) field operations. For instance, Strassen’s algorithm [Str69] gives an identity (1) with t=2t=2 and r=7r=7, yielding exponent log2(7)<2.81\log_{2}(7)<2.81. The matrix multiplication exponent ω\omega is defined as the infimum of all numbers such that, for every ε>0\varepsilon>0, there is a sufficiently large tt for which one gets an identity (1) with rtω+εr\leq t^{\omega+\varepsilon}. Indeed, it is known [Str73] that any algebraic algorithm for matrix multiplication can be converted into an identity (1) yielding the same running time in this way, up to a constant factor, so ω\omega captures the best exponent from any possible algebraic algorithm. The fastest known matrix multiplication algorithm [CW90, AW21] shows that ω<2.37286\omega<2.37286.

The standard recursive algorithm for multiplying matrices using identity (1) works as described in Algorithm 1 below. It recurses until it gets to a base case of multiplying n×nn\times n matrices when nSn\leq S for some parameter SS. In the usual algorithm, one picks SS to be a constant, so that such matrices can be multiplied in a constant number of operations; in our improvement, we will pick a larger SS and multiply such matrices using a lookup table.

Algorithm 1 Recursive matrix multiplication algorithm, using identity (1), with base case size SS
1:procedure MM(X,Y𝔽n×nX,Y\in\mathbb{F}^{n\times n})
2:     if nSn\leq S then
3:         Use base case procedure to multiply X×YX\times Y and output the result.
4:     else
5:         Partition XX into a t×tt\times t block matrix, with blocks Xi,j𝔽(n/t)×(n/t)X_{i,j}\in\mathbb{F}^{(n/t)\times(n/t)} for i,j[t]i,j\in[t].
6:         Similarly partition YY into a t×tt\times t block matrix, with blocks Yj,k𝔽(n/t)×(n/t)Y_{j,k}\in\mathbb{F}^{(n/t)\times(n/t)} for j,k[t]j,k\in[t].
7:         For k,i[t]k,i\in[t], set the matrices Zk,i𝔽(n/t)×(n/t)Z_{k,i}\in\mathbb{F}^{(n/t)\times(n/t)} to initially be all 0.
8:         for [r]\ell\in[r] do
9:              Compute A:=i,j[t]αi,j,Xi,jA_{\ell}:=\sum_{i,j\in[t]}\alpha_{i,j,\ell}X_{i,j}.
10:              Compute B:=j,k[t]βj,k,Yj,kB_{\ell}:=\sum_{j,k\in[t]}\beta_{j,k,\ell}Y_{j,k}.
11:              Compute C=MM(A,B)C_{\ell}=\textsc{MM}(A_{\ell},B_{\ell}) \triangleright Recursively multiply (n/t)×(n/t)(n/t)\times(n/t) matrices A×BA_{\ell}\times B_{\ell}.
12:              for i,k[t]i,k\in[t] do
13:                  Add γi,k,C\gamma_{i,k,\ell}C_{\ell} to Zi,kZ_{i,k}.
14:              end for
15:         end for
16:         Output Z𝔽n×nZ\in\mathbb{F}^{n\times n} given by the blocks Zi,kZ_{i,k}.
17:     end if
18:end procedure
Lemma 4.1.

Algorithm 1 correctly outputs the product X×YX\times Y.

Proof.

For a fixed i,k[t]i,k\in[t], we can see that lines 8-15 of the algorithm will set Zi,kZ_{i,k} to be the following matrix:

=1rγi,k,C==1rγi,k,(i=1tj=1tαi,j,Xi,j)(j=1tk=1tβj,k,Yj,k).\sum_{\ell=1}^{r}\gamma_{i,k,\ell}\cdot C_{\ell}=\sum_{\ell=1}^{r}\gamma_{i,k,\ell}\cdot\left(\sum_{i=1}^{t}\sum_{j=1}^{t}\alpha_{i,j,\ell}X_{i,j}\right)\left(\sum_{j=1}^{t}\sum_{k=1}^{t}\beta_{j,k,\ell}Y_{j,k}\right).

Notice in particular that this is the coefficient of the formal variable Zi,kZ_{i,k} in the right-hand side of the identity (1). Hence, it is also the coefficient of that variable in the left-hand side, namely,

j=1tXi,jYj,k.\sum_{j=1}^{t}X_{i,j}Y_{j,k}.

This is the desired output (i,k)(i,k) block when multiplying matrices X,YX,Y. ∎

Normally one would analyze the recursion for the number of operations performed by this algorithm when SS is a constant, and conclude a total operation count of O(nlogt(r))O(n^{\log_{t}(r)}). Here we will improve on this over finite fields using a lookup table:

Theorem 4.2.

Fix an identity (1) and let τ=logt(r)\tau=\log_{t}(r). For any finite field 𝔽q\mathbb{F}_{q} of size q=O(1)q=O(1), and any positive integer nn, there is an algorithm for multiplying n×nn\times n matrices over 𝔽q\mathbb{F}_{q} which uses O(nτ/(logn)τ/21)O(n^{\tau}/(\log n)^{\tau/2-1}) bit operations.

Proof.

Let s=14logn/logqs=\left\lfloor\sqrt{\frac{1}{4}\log n/\log q}\right\rfloor. We begin by iterating over all positive integers sss^{\prime}\leq s and all pairs of s×ss^{\prime}\times s^{\prime} matrices A,B𝔽qs×sA,B\in\mathbb{F}_{q}^{s^{\prime}\times s^{\prime}}, computing their product A×BA\times B, and storing it in another lookup table. Similar to before, the running time for this step won’t substantially contribute to the final running time, so we give a straightforward upper bound on the time it takes. The number of pairs of matrices we need to multiply is at most s(qs2)2s\cdot(q^{s^{2}})^{2}, and the time to multiply each pair is at most O(s3)O(s^{3}) by the straightforward algorithm. The total time to create the lookup table is hence at most

O(s4(qs2)2)O(q12logn/logq(logn)2)O(n0.5(logn)2).O(s^{4}\cdot(q^{s^{2}})^{2})\leq O(q^{\frac{1}{2}\log n/\log q}\cdot(\log n)^{2})\leq O(n^{0.5}\cdot(\log n)^{2}).

Note that this table’s keys and values are strings of length O(log(qs2)2)O(logn)O(\log(q^{s^{2}})^{2})\leq O(\log n), so lookup table operations can be performed using O(logn)O(\log n) bit operations.

We now use Algorithm 1, with base case size S=sS=s. The base case procedure is that, whenever we need to multiply two s×ss^{\prime}\times s^{\prime} matrices for sSs^{\prime}\leq S, we find the result in the lookup table.

Let E(m)E(m) denote the running time of this algorithm to multiply two m×mm\times m matrices. We get E(m)=O(logn)E(m)=O(\log n) if msm\leq s, and if m>sm>s, the recurrence

E(m)rE(m/t)+O(m2),\displaystyle E(m)\leq r\cdot E(m/t)+O(m^{2}), (2)

recalling that r,tr,t are constants given in the identity (1). The O(m2)O(m^{2}) term in the right-hand side of Equation (2) counts the bit operations to do a constant number of additions and scalar multiplications of m/t×m/tm/t\times m/t matrices. Solving this recurrence yields

E(n)=O((ns)logt(r)logn)=O(nτ(logn)τ/21),E(n)=O\left(\left(\frac{n}{s}\right)^{\log_{t}(r)}\cdot\log n\right)=O\left(\frac{n^{\tau}}{(\log n)^{\tau/2-1}}\right),

as desired. ∎

5 Conclusion

We showed that for two important open problems – determining the complexity of the Walsh-Hadamard transform, and determining the leading constant of Strassen’s algorithm for matrix multiplication – asymptotic improvements over the conjectured optimal arithmetic bounds are possible if one is allowed to use bit operations rather than just arithmetic operations.

Our algorithms only made use of arithmetic operations and lookup table operations, so they could extend to other models of computation as well. One natural question is whether they extend to the standard word RAM model with word size w=O(logn)w=O(\log n) for input size nn. Indeed, operations for the lookup tables we use, (with keys and values of O(logn)O(\log n) bits) require only O(1)O(1) word operations in this model.

It is not hard to see that our algorithm for matrix multiplication can be implemented to take advantage of this model, improving Theorem 4.2 to running time O(nτ/(logn)τ/2)O(n^{\tau}/(\log n)^{\tau/2}) (improving by a factor of logn\log n).

On the other hand, our algorithm for the Walsh-Hadamard transform seemingly cannot be implemented in the word RAM model to get asymptotic savings. The main culprit is step 2 in the proof of Lemma 3.1: if we want to efficiently use the lookup table to compute yj=Mxjy_{j}=Mx_{j}, then we first have to permute the bits of the uiu_{i} vectors so that each xjx_{j} fits in a single word. In other words, we are given the uiu_{i} vectors in order, and would like to permute their entries to get the xjx_{j} vectors in order instead.

Let s=Θ(dm/w)s=\Theta(d^{m}/w) be the number of words that our input fits into. In general it is known that performing a fixed permutation of the bits of a string contained in ss words, for sws\geq w, requires Θ(slogs)\Theta(s\log s) word operations [BMM97]. However, our particular permutation can be broken up into s/ws/w different permutations on ww words (e.g., the first words in the descriptions of x1,,xwx_{1},\ldots,x_{w} are a permutation of the first words in the descriptions of u1,,uwu_{1},\ldots,u_{w}). It can thus be performed in only O(slogw)O(s\log w) word operations.

Since this must be done at all mm levels of recursion, it incurs an additional Θ(mdmlogww)\Theta(m\cdot d^{m}\cdot\frac{\log w}{w}) word operations. With the parameter setting we ultimately use in Theorem 3.2, this is Θ(N)\Theta(N) word operations to perform all these permutations. By comparison, it is not too difficult to implement the usual fast Walsh-Hadamard transform to use a total of O(N)O(N) word operations as well. Hence, the time to perform permutations (which doesn’t come into play in the arithmetic or bit operation models) swamps any other computational savings in this model, and another approach is needed for a speedup.

Acknowledgements

I would like to thank Dylan McKay and Ryan Williams for invaluable discussions throughout this project, and anonymous reviewers for helpful comments. This research was supported in part by a grant from the Simons Foundation (Grant Number 825870 JA).

References

  • [ADKF70] Vladimir L’vovich Arlazarov, Yefim A Dinitz, MA Kronrod, and IgorAleksandrovich Faradzhev. On economical construction of the transitive closure of an oriented graph. In Doklady Akademii Nauk, volume 194, pages 487–488. Russian Academy of Sciences, 1970.
  • [AW17] Josh Alman and Ryan Williams. Probabilistic rank and matrix rigidity. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC), pages 641–652, 2017.
  • [AW21] Josh Alman and Virginia Vassilevska Williams. A refined laser method and faster matrix multiplication. In Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 522–539. SIAM, 2021.
  • [BG21] Davis Blalock and John Guttag. Multiplying matrices without multiplying. In International Conference on Machine Learning, pages 992–1004. PMLR, 2021.
  • [BMM97] Andrej Brodnik, Peter Bro Miltersen, and J Ian Munro. Trans-dichotomous algorithms without multiplication - some upper and lower bounds. In Workshop on Algorithms and Data Structures, pages 426–439. Springer, 1997.
  • [BS19] Gal Beniamini and Oded Schwartz. Faster matrix multiplication via sparse decomposition. In The 31st ACM Symposium on Parallelism in Algorithms and Architectures, pages 11–22, 2019.
  • [Bsh95] Nader H Bshouty. On the additive complexity of 2×\times 2 matrix multiplication. Information processing letters, 56(6):329–335, 1995.
  • [CW90] Don Coppersmith and Shmuel Winograd. Matrix multiplication via arithmetic progressions. Journal of symbolic computation, 9(3):251–280, 1990.
  • [EDS18] Mohab Safey El Din and Éric Schost. Bit complexity for multi-homogeneous polynomial system solving—application to polynomial minimization. Journal of Symbolic Computation, 87:176–206, 2018.
  • [HK10] Tor Helleseth and Alexander Kholosha. New binomial bent functions over the finite fields of odd characteristic. In 2010 IEEE International Symposium on Information Theory, pages 1277–1281. IEEE, 2010.
  • [HSHVDG16] Jianyu Huang, Tyler M Smith, Greg M Henry, and Robert A Van De Geijn. Strassen’s algorithm reloaded. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 690–701. IEEE, 2016.
  • [JPK+20] Yongkweon Jeon, Baeseong Park, Se Jung Kwon, Byeongwook Kim, Jeongin Yun, and Dongsoo Lee. Biqgemm: matrix multiplication with lookup table for binary-coding-based quantized dnns. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14. IEEE, 2020.
  • [KS20] Elaye Karstadt and Oded Schwartz. Matrix multiplication, a little faster. Journal of the ACM (JACM), 67(1):1–31, 2020.
  • [LC20] Ting Liu and Xuechen Chen. Deep learning-based belief propagation algorithm over non-binary finite fields. In 2020 International conference on wireless communications and signal processing (WCSP), pages 164–169. IEEE, 2020.
  • [Lin91] Andrzej Lingas. Bit complexity of matrix products. Information processing letters, 38(5):237–242, 1991.
  • [Lok09] Satyanarayana V Lokam. Complexity lower bounds using linear algebra. Foundations and Trends® in Theoretical Computer Science, 4(1–2):1–155, 2009.
  • [MQ19] Sihem Mesnager and Longjiang Qu. On two-to-one mappings over finite fields. IEEE Transactions on Information Theory, 65(12):7884–7895, 2019.
  • [Pan81] V Pan. The bit-complexity of arithmetic algorithms. Journal of Algorithms, 2(2):144–163, 1981.
  • [Pro76] Robert L Probert. On the additive complexity of matrix multiplication. SIAM Journal on Computing, 5(2):187–203, 1976.
  • [RL01] B Sundar Rajan and Moon Ho Lee. Quasicyclic dyadic codes in walsh-hadamard domain. In Proceedings. 2001 IEEE International Symposium on Information Theory (IEEE Cat. No. 01CH37252), page 37. IEEE, 2001.
  • [Str69] Volker Strassen. Gaussian elimination is not optimal. Numerische mathematik, 13(4):354–356, 1969.
  • [Str73] Volker Strassen. Vermeidung von divisionen. Journal für die reine und angewandte Mathematik, 264:184–202, 1973.
  • [VDHL13] Joris Van Der Hoeven and Grégoire Lecerf. On the bit-complexity of sparse polynomial and series multiplication. Journal of Symbolic Computation, 50:227–254, 2013.
  • [Win71] Shmuel Winograd. On multiplication of 2×\times 2 matrices. Linear algebra and its applications, 4(4):381–388, 1971.
  • [XQ19] Guangkui Xu and Longjiang Qu. Three classes of minimal linear codes over the finite fields of odd characteristic. IEEE Transactions on Information Theory, 65(11):7067–7078, 2019.
  • [Yu18] Huacheng Yu. An improved combinatorial algorithm for boolean matrix multiplication. Information and Computation, 261:240–247, 2018.