This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Algebraic Constructions of Neural Networks with Small Weights

Kordag Mehmet Kilic, Jin Sima and Jehoshua Bruck Electrical Engineering, California Institute of Technology, USA, {kkilic,jsima,bruck}@caltech.edu
Abstract

Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a well-studied topic both in circuit complexity theory and the practice of neural computation. We propose a new approach for studying these complexity trade-offs by considering a related algebraic framework. Specifically, given a single linear equation with arbitrary coefficients, we would like to express it using a system of linear equations with smaller (even constant) coefficients. The techniques we developed are based on Siegel’s Lemma for the bounds, anti-concentration inequalities for the existential results and extensions of Sylvester-type Hadamard matrices for the constructions.

We explicitly construct a constant weight, optimal size matrix to compute the EQUALITY function (checking if two integers expressed in binary are equal). Computing EQUALITY with a single linear equation requires exponentially large weights. In addition, we prove the existence of the best-known weight size (linear) matrices to compute the COMPARISON function (comparing between two integers expressed in binary). In the context of the circuit complexity theory, our results improve the upper bounds on the weight sizes for the best-known circuit sizes for EQUALITY and COMPARISON.

I Introduction

An nn-input Boolean function is a mapping from {0,1}n\{0,1\}^{n} to {0,1}\{0,1\}. In other words, it is a partitioning of nn-bit binary vectors into two sets with labels 0 and 11. In general, we can use systems of linear equations as descriptive models of these two sets of binary vectors. For example, the solution set of the equation i=1nxi=k\sum_{i=1}^{n}x_{i}=k is the nn-bit binary vectors X=(x1,,xn)X=(x_{1},\dots,x_{n}) where each xi{0,1}x_{i}\in\{0,1\} and kk is the number of 11s in the vectors. We can ask three important questions: How expressive can a single linear equation be? How many equations do we need to describe a Boolean function? Could we simulate a single equation by a system of equations with smaller integer weights?

Let us begin with an example: 33-input PARITY function where we label binary vectors with odd number of 1s as 1. We can write it in the following form:

PARITY(X)=𝟙{(22x3+21x2+20x1){1,2,4,7}}\text{PARITY}(X)=\mathds{1}\Big{\{}(2^{2}x_{3}+2^{1}x_{2}+2^{0}x_{1})\in\{1,2,4,7\}\Big{\}} (1)

where 𝟙{.}\mathds{1}\{.\} is the indicator function with outputs 0 or 1. We express the binary vectors as integers by using binary expansions. Thus, it can be shown that if the weights are exponentially large in nn, we can express all Boolean functions in this form.

Now, suppose that we are only allowed to use a single equality check in an indicator function. Considering the 33-input PARITY function, we can simply obtain

[202122202122202122202122][x1x2x3]=[1247]\begin{bmatrix}2^{0}&2^{1}&2^{2}\\ 2^{0}&2^{1}&2^{2}\\ 2^{0}&2^{1}&2^{2}\\ 2^{0}&2^{1}&2^{2}\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\end{bmatrix}=\begin{bmatrix}1\\ 2\\ 4\\ 7\end{bmatrix} (2)

None of the above equations can be satisfied if XX is labeled as 0. Conversely, if XX satisfies one of the above equations, we can label it as 11. For an arbitrary Boolean function of nn inputs, if we list every integer associated with vectors labeled as 11, the number of rows may become exponentially large in nn. Nevertheless, in this fashion, we can compute this function by the following system of equations using smaller weights.

[111111][x1x2x3]=[13]\begin{bmatrix}1&1&1\\ 1&1&1\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\end{bmatrix}=\begin{bmatrix}1\\ 3\end{bmatrix} (3)

Not only there is a simplification on the number of equations, but also the weights are reduced to smaller sizes. This phenomenon motivates the following question with more emphasis: For which Boolean functions could we obtain such simplifications in the number of equations and weight sizes? For PARITY, this simplification is possible because it is a symmetric Boolean function, i.e., the output depends only on the number of 11s of the input XX. We are particularly interested in such simplifications from large weights to small weights for a class of Boolean functions called threshold functions. Note that we use the word “large” to refer to exponentially large quantities in nn and the word “small” to refer to polynomially large quantities (including O(n0)=O(1)O(n^{0})=O(1)) in nn.

I-A Threshold Functions and Neural Networks

Threshold functions are commonly treated functions in Boolean analysis and machine learning as they form the basis of neural networks. Threshold functions compute a weighted summation of binary inputs and feed it to a threshold or equality check. If this sum is fed to the former, we call the functions linear threshold functions (see (4)). If it is fed to the latter, we call them exact threshold functions (see (5)). We can write an nn-input threshold function using the indicator function where wiw_{i}s are integer weights and bb is a bias term.

f𝒯(X)\displaystyle f_{\mathcal{L}\mathcal{T}}(X) =𝟙{i=1nwixib}\displaystyle=\mathds{1}\Big{\{}\sum_{i=1}^{n}w_{i}x_{i}\geq b\Big{\}} (4)
f(X)\displaystyle f_{\mathcal{E}}(X) =𝟙{i=1nwixi=b}\displaystyle=\mathds{1}\Big{\{}\sum_{i=1}^{n}w_{i}x_{i}=b\Big{\}} (5)

A device computing the corresponding threshold function is called a gate of that type. To illustrate the concept, we define COMPARISON (denoted by COMP) function which computes whether an nn-bit integer XX is greater than another nn-bit integer Y. The exact counterpart of it is defined as the EQUALITY (denoted by EQ) function which checks if two nn-bit integers are equal (see Figure 1). For example, we can write the EQ function in the following form.

EQ(X,Y)=𝟙{i=1n2i1(xiyi)=0}\text{EQ}(X,Y)=\mathds{1}\Big{\{}\sum_{i=1}^{n}2^{i-1}(x_{i}-y_{i})=0\Big{\}} (6)
𝒯\mathcal{L}\mathcal{T}COMPx1x_{1}202^{0}x2x_{2}212^{1}x3x_{3}222^{2}y1y_{1}20-2^{0}y2y_{2}21-2^{1}y3y_{3}22-2^{2}\mathcal{E}EQx1x_{1}202^{0}x2x_{2}212^{1}x3x_{3}222^{2}y1y_{1}20-2^{0}y2y_{2}21-2^{1}y3y_{3}22-2^{2}
Figure 1: The 33-input COMP and EQ functions for integers XX and YY computed by linear threshold and exact threshold gates. A gate with an 𝒯\mathcal{L}\mathcal{T}(or \mathcal{E}) inside is a linear (or exact) threshold gate. More explicitly, we can write COMP(X,Y)=𝟙{4x3+2x2+x14y3+2y2+y1}\text{COMP}(X,Y)=\mathds{1}\{4x_{3}+2x_{2}+x_{1}\geq 4y_{3}+2y_{2}+y_{1}\} and EQ(X,Y)=𝟙{4x3+2x2+x1=4y3+2y2+y1}\text{EQ}(X,Y)=\mathds{1}\{4x_{3}+2x_{2}+x_{1}=4y_{3}+2y_{2}+y_{1}\}

In general, it is proven that the weights of a threshold function can be represented by O(nlogn)O(n\log{n})-bits and this is tight [1, 3, 11, 19]. However, it is possible to construct “small” weight threshold circuits to compute any threshold function [2, 8, 12, 23]. This transformation from a circuit of depth dd with exponentially large weights in nn to another circuit with polynomially large weights in nn is typically within a constant factor of depth (e.g. d+1d+1 or 3d+33d+3 depending on the context) [8, 25]. For instance, such a transformation would simply follow if we can replace any ”large“ weight threshold function with “small” weight depth-22 circuits so that the new depth becomes 2d2d.

It is possible to reduce polynomial size weights into constant weights by replicating the gates that is fed to the top gate recursively (see Figure 2). Nevertheless, this would inevitably introduce a polynomial size blow-up in the circuit size. We emphasize that our focus is to achieve this weight size reduction from polynomial weights to constant weights with at most a constant size blow-up in the circuit size.

\mathcal{E}\mathcal{E}\mathcal{E}\mathcal{E}556677\mathcal{E}\mathcal{E}\mathcal{E}\mathcal{E}\mathcal{E}\mathcal{E}\mathcal{E}\mathcal{E}22333333333311
Figure 2: An example of a weight transformation for a single gate (in black) to construct constant weight circuits. Different gates are colored in red, blue, and violet and depending on the weight size, each gate is replicated a number of times. In this example, each weight in this construction is at most 3.

For neural networks and learning tasks, the weights are typically finite precision real numbers. To make the computations faster and more power efficient, the weights can be quantized to small integers with a loss in the accuracy of the network output [14, 13]. In practice, given that the neural circuit is large in size and deep in depth, this loss in the accuracy might be tolerated. We are interested in the amount of trade-off in the increase of size and depth while using as small weights as possible. More specifically, our goal is to provide insight to the computation of single threshold functions with large weights using threshold circuits with small weights by relating theoretical upper and lower bounds to the best known constructions.

In this manner, for the ternary weight case ({1,0,1}\{-1,0,1\} as in [17]), we give an explicit and optimal size circuit construction for the EQ function using depth-2 circuits. This optimality is guaranteed by achieving the theoretical lower bounds asymptotically up to vanishing terms [16, 20]. We also prove an existential result on the COMP constructions to reduce the weight size on the best known constructions. It is not known if constant weight constructions exist without a polynomial blow-up in the circuit size and an increase in the depth for arbitrary threshold functions.

I-B Bijective Mappings from Finite Fields to Integers

It seems that choosing powers of two as the weight set is important. In fact, we can expand any arbitrary weight in binary and use powers of two as a fixed weight set for any threshold function [16]. This weight set is a choice of convenience and a simple solution for the following question: How small could the elements of 𝒲={w1,w2,,wn}\mathcal{W}=\{w_{1},w_{2},\cdots,w_{n}\} be if 𝒲\mathcal{W} has all distinct subset sums (DSS)? If the weights satisfy this property, called the DSS property, they can define a bijection between {0,1}n\{0,1\}^{n} and integers. Erdős conjectured that the largest weight w𝒲w\in\mathcal{W} is upper bounded by c02nc_{0}2^{n} for some c0>0c_{0}>0 and therefore, choosing powers of two as the weight set is asymptotically optimal. The best known result for such weight sets yields 0.220022n0.22002\cdot 2^{n} and currently, the best lower bound is Ω(2n/n)\Omega(2^{n}/\sqrt{n}) [5, 7, 9].

Now, let us consider the following linear equation where the weights are fixed to the ascending powers of two but xix_{i}s are not necessarily binary. We denote the powers of two by the vector wbw_{b}.

wbTx=i=1n2i1xi=0w_{b}^{T}x=\sum_{i=1}^{n}2^{i-1}x_{i}=0 (7)

As the weights of wbw_{b} define a bijection between nn-bit binary vectors and integers, wbTx=0w_{b}^{T}x=0 does not admit a non-trivial solution for the alphabet {1,0,1}n\{-1,0,1\}^{n}. This is a necessary and sufficient condition to compute the EQ function given in (6). We extend this property to mm many rows to define EQ matrices which give a bijection between {0,1}n\{0,1\}^{n} and m\mathbb{Z}^{m}. Thus, an EQ matrix can be used to compute the EQ function in (6) and our goal is to use smaller weight sizes in the matrix.

Definition 1.

A matrix Am×nA\in\mathbb{Z}^{m\times n} is an EQ matrix if the homogeneous system Ax=0Ax=0 has no non-trivial solutions in {1,0,1}n\{-1,0,1\}^{n}.

Let Am×nA\in\mathbb{Z}^{m\times n} be an EQ matrix with the weight constraint WW\in\mathbb{Z} such that |aij|W|a_{ij}|\leq W for all i,ji,j and let RR denote the rate of the matrix AA, which is n/mn/m. It is clear that any full-rank square matrix is an EQ matrix with R=1R=1. Given any WW, how large can this RR be? For the maximal rate, a necessary condition can be proven by Siegel’s Lemma [22].

Lemma 1 (Siegel’s Lemma (modified)).

Consider any integer matrix Am×nA\in\mathbb{Z}^{m\times n} with m<nm<n and |aij|W|a_{ij}|\leq W for all i,ji,j and some integer WW. Then, Ax=0Ax=0 has a non-trivial solution for an integer vector xnx\in\mathbb{Z}^{n} such that x(nW)mnm||x||_{\infty}\leq(\sqrt{n}W)^{\frac{m}{n-m}}.

It is shown that if m=o(n/lognW)m=o(n/\log{nW}), then any Am×nA\in\mathbb{Z}^{m\times n} with weight constraint WW admits a non-trivial solution in {1,0,1}n\{-1,0,1\}^{n} and cannot be an EQ matrix, i.e., R=O(lognW)R=O(\log{nW}) is tight [16]. A similar result can be obtained by the matrix generalizations of Erdős’ Distinct Subset Sum problem [6].

When m=O(n/lognW)m=O(n/\log{nW}), the story is different. If m=O(n/logn)m=O(n/\log{n}), there exists a matrix A{1,1}m×nA\in\{-1,1\}^{m\times n} such that every non-trivial solution of Ax=0Ax=0 satisfies maxj|xj|c0nmnm\max_{j}|x_{j}|\geq c_{0}\sqrt{n}^{\frac{m}{n-m}} for a positive constant c0c_{0}. This is given by Beck’s Converse Theorem on Siegel’s Lemma [4].

For an explicit construction, it is possible to achieve the optimal rate R=O(lognW)R=O(\log{nW}) if we allow W=poly(n)W=poly(n). This can be done by the Chinese Remainder Theorem (CRT) and the Prime Number Theorem (PNT) [2, 12, 16]. It is known that CRT can be used to define residue codes [18]. For an integer xx and modulo base pp, we denote the modulo operation by [x]p[x]_{p}, which maps the integer to a value in {0,,p1}\{0,...,p-1\}. Suppose 0Z<2n0\leq Z<2^{n} for an integer ZZ. One can encode this integer by the mm-tuple (d1,d2,,dm)(d_{1},d_{2},\cdots,d_{m}) where [Z]pi=di[Z]_{p_{i}}=d_{i} for a prime number pip_{i}. Since we can also encode ZZ by its binary expansion, the CRT gives a bijection between m\mathbb{Z}^{m} and {0,1}n\{0,1\}^{n} as long as p1pm>2np_{1}\cdots p_{m}>2^{n}. By taking modulo pip_{i} of Equation (7), we can obtain the following matrix, defined as a CRT matrix:

[[20]3[21]3[22]3[24]3[25]3[26]3[27]3[20]5[21]5[22]5[24]5[25]5[26]5[27]5[20]7[21]7[22]7[24]7[25]7[26]7[27]7[20]11[21]11[22]11[24]11[25]11[26]11[27]11]\displaystyle\begin{bmatrix}[l][2^{0}]_{3}&[2^{1}]_{3}&[2^{2}]_{3}&[2^{4}]_{3}&[2^{5}]_{3}&[2^{6}]_{3}&[2^{7}]_{3}\\ [2^{0}]_{5}&[2^{1}]_{5}&[2^{2}]_{5}&[2^{4}]_{5}&[2^{5}]_{5}&[2^{6}]_{5}&[2^{7}]_{5}\\ [2^{0}]_{7}&[2^{1}]_{7}&[2^{2}]_{7}&[2^{4}]_{7}&[2^{5}]_{7}&[2^{6}]_{7}&[2^{7}]_{7}\\ [2^{0}]_{11}&[2^{1}]_{11}&[2^{2}]_{11}&[2^{4}]_{11}&[2^{5}]_{11}&[2^{6}]_{11}&[2^{7}]_{11}\end{bmatrix} (8)
=[121212121243124312412412124851097]4×8\displaystyle\hphantom{aaaaaaaaaa}=\begin{bmatrix}[r]1&2&1&2&1&2&1&2\\ 1&2&4&3&1&2&4&3\\ 1&2&4&1&2&4&1&2\\ 1&2&4&8&5&10&9&7\end{bmatrix}_{4\times 8} (9)

We have Z<256Z<256 since n=8n=8 and 35711=11553\cdot 5\cdot 7\cdot 11=1155. Therefore, this CRT matrix is an EQ matrix. In general, by the PNT, one needs O(n/logn)O(n/\log{n}) many rows to ensure that p1pm>2np_{1}\cdots p_{m}>2^{n}. Moreover, WW is bounded by the maximum prime size pmp_{m}, which is O(n)O(n) again by the PNT.

However, it is known that constant weight EQ matrices with asymptotically optimal rate exist by Beck’s Converse Theorem on Siegel’s Lemma [4, 16]. In this paper, we give an explicit construction where W=1W=1 and asymptotic efficiency in rate is achieved up to vanishing terms. It is in fact an extension of Sylvester-type Hadamard matrices.

[11111110110110011111110011011000]4×8\begin{bmatrix}[r]1&1&1&1&1&1&1&0\\ 1&-1&0&1&-1&0&0&1\\ 1&1&1&-1&-1&-1&0&0\\ 1&-1&0&-1&1&0&0&0\end{bmatrix}_{4\times 8} (10)

One can verify that the matrix in (10) is an EQ matrix and its rate is twice the trivial rate being the same in (8). Therefore, due to the optimality in the weight size, this construction can replace the CRT matrices in the constructions of EQ matrices.

We can also focus on qq-ary representation of integers in a similar fashion by extending the definition of EQ matrices to this setting. In our analysis, we always treat qq as a constant value.

Definition 2.

A matrix Am×nA\in\mathbb{Z}^{m\times n} is an EQq matrix\textup{EQ}_{q}\textup{ matrix} if the homogeneous system Ax=0Ax=0 has no non-trivial solutions in {q+1,,q1}n\{-q+1,\dots,q-1\}^{n}.

If q=2q=2, then we drop qq from the notation and say that the matrix is an EQ matrix. For the EQq\text{EQ}_{q} matrices, the optimal rate given by Siegel’s Lemma is still R=O(lognW)R=O(\log{nW}) and constant weight constructions exist. We give an extension of our construction to EQq\text{EQ}_{q} matrices where W=1W=1 and asymptotic efficiency in rate is achieved up to constant terms.

I-C Maximum Distance Separable Extensions of EQ Matrices

Residue codes are treated as Maximum Distance Separable (MDS) codes because one can extend the CRT matrix by adding more prime numbers to the matrix (without increasing nn) so that the resulting integer code achieves the Singleton bound in Hamming distance [24]. However, we do not say a CRT matrix is an MDS matrix as this refers to another concept.

Definition 3.

An integer matrix Am×nA\in\mathbb{Z}^{m\times n} (mnm\leq n) is MDS if and only if no m×mm\times m submatrix BB is singular.

Definition 4.

An integer matrix Arm×nA\in\mathbb{Z}^{rm\times n} is MDS for qq-ary bijections with MDS rate rr and EQ rate R=n/mR=n/m if and only if every m×nm\times n submatrix BB is an EQq\textup{EQ}_{q} matrix.

Because Definition 4 considers solutions over a restricted alphabet, we denote such matrices as RMDSq\textit{RMDS}_{q}. Remarkably, as qq\rightarrow\infty, both MDS definitions become the same. Similar to the EQq\text{EQ}_{q} definition, we drop the qq from the notation when q=2q=2. A CRT matrix is not MDS, however, it can be RMDSq\text{RMDS}_{q}.

We can demonstrate the difference of both MDS definitions by the following matrix. This matrix is an RMDS matrix with EQ rate 22 and MDS rate 5/45/4 because any 4×84\times 8 submatrix is an EQ matrix. This is in fact the same matrix in (8) with an additional row with entries [2i1]13[2^{i-1}]_{13} for i{1,,8}i\in\{1,...,8\}.

[1212121212431243124124121248510971248361211]5×8\displaystyle\begin{bmatrix}[r]1&2&1&2&1&2&1&2\\ 1&2&4&3&1&2&4&3\\ 1&2&4&1&2&4&1&2\\ 1&2&4&8&5&10&9&7\\ 1&2&4&8&3&6&12&11\end{bmatrix}_{5\times 8} (11)

Here, the determinant of the 5×55\times 5 submatrix given by the first five columns is 0. Thus, this matrix is not an MDS matrix.

In this work, we are interested in RMDSq\text{RMDS}_{q} matrices. We show that for a constant weight RMDSq\text{RMDS}_{q} matrix, the MDS rate bound r=O(1)r=O(1) is tight. We also provide an existence result for such RMDSq\text{RMDS}_{q} matrices where the weight size is bounded by O(r)O(r) given that the EQ rate is O(logn)O(\log{n}).

The following is a summary of our contributions in this paper.

  • In Section II, we explicitly give a rate-efficient EQ matrix construction with constant entries {1,0,1}\{-1,0,1\} where the optimality is guaranteed up to vanishing terms. This solves an open problem in [16] and [20].

  • In Section III, we prove that the MDS rate rr of an RMDSq\text{RMDS}_{q} matrix with entries from an alphabet 𝒬\mathcal{Q} with cardinality kk should satisfy rkk+1r\leq k^{k+1}. Therefore, constant weight RMDSq\text{RMDS}_{q} matrices can at most achieve r=O(1)r=O(1).

  • In Section III, provide an existence result for RMDSq\text{RMDS}_{q} matrices given that W=O(r)W=O(r) and the optimal EQ rate O(logn)O(\log{n}). In contrast, the best known results give W=O(rn)W=O(rn) with the optimal EQ rate O(logn)O(\log{n}).

  • In Section IV, we apply our results to Circuit Complexity Theory to obtain better weight sizes with asymptotically no trade-off in the circuit size for the depth-2 EQ and COMP constructions, as shown in the Table I.

TABLE I: Results for the Depth-2 Circuit Constructions
Function This Work Previous Works
Weight Size Constructive Weight Size Constructive
EQ O(1)O(1) Yes O(1)O(1)[16] No
O(n)O(n)[20] Yes
COMP O(n)O(n) No O(n2)O(n^{2})[2] Yes

II Rate-efficient Constructions with Constant Alphabet Size

The m×nm\times n EQ matrix construction we give here is based on Sylvester’s construction of Hadamard matrices. It is an extension of it to achieve higher rates with the trade-off that the matrix is no longer full-rank.

Theorem 1.

Suppose we are given an EQ matrix A0{1,0,1}m0×n0A_{0}\in\{-1,0,1\}^{m_{0}\times n_{0}}. At iteration kk, we construct the following matrix AkA_{k}:

Ak=[Ak1Ak1Imk1Ak1Ak10]A_{k}=\begin{bmatrix}[c]A_{k-1}&A_{k-1}&I_{m_{k-1}}\\ A_{k-1}&-A_{k-1}&0\end{bmatrix} (12)

AkA_{k} is an EQ matrix with mk=2km0m_{k}=2^{k}m_{0}, nk=2kn0(k2m0n0+1)n_{k}=2^{k}n_{0}(\frac{k}{2}\frac{m_{0}}{n_{0}}+1) for any integer k0k\geq 0.

Proof.

We will apply induction. The case k=0k=0 is trivial by assumption. For the system Akx=zA_{k}x=z, let us partition the vector xx and zz in the following way:

[Ak1Ak1Imk1Ak1Ak10][x(1)x(2)x(3)]=[z(1)z(2)]\begin{bmatrix}[c]A_{k-1}&A_{k-1}&I_{m_{k-1}}\\ A_{k-1}&-A_{k-1}&0\end{bmatrix}\begin{bmatrix}x^{(1)}\\ x^{(2)}\\ x^{(3)}\end{bmatrix}=\begin{bmatrix}z^{(1)}\\ z^{(2)}\end{bmatrix} (13)

Then, setting z=0z=0, we have Ak1x(1)=Ak1x(2)A_{k-1}x^{(1)}=A_{k-1}x^{(2)} by the second row block. Hence, the first row block of the construction implies 2Ak1x(1)+x(3)=02A_{k-1}x^{(1)}+x^{(3)}=0. Each entry of x(3)x^{(3)} is either 0 or a multiple of 22. Since xi{1,0,1}x_{i}\in\{-1,0,1\}, we see that x(3)=0x^{(3)}=0. Then, we obtain Ak1x(1)=Ak1x(2)=0A_{k-1}x^{(1)}=A_{k-1}x^{(2)}=0. Applying the induction hypothesis on Ak1x(1)A_{k-1}x^{(1)} and Ak1x(2)A_{k-1}x^{(2)}, we see that Akx=0A_{k}x=0 admits a unique trivial solution in {1,0,1}nk\{-1,0,1\}^{n_{k}}.

To see that mk=2km0m_{k}=2^{k}m_{0} is not difficult. For nkn_{k}, we can apply induction. ∎

To construct the matrix given in (10), we can use A0=I1=[1]A_{0}=I_{1}=\begin{bmatrix}1\end{bmatrix}, which is a trivial rate EQ matrix, and take k=2k=2. By rearrangement of the columns, we also see that there is a 4×44\times 4 Sylvester-type Hadamard matrix in this matrix.

For the sake of completeness, we will revisit Siegel’s Lemma to prove the best possible rate one can obtain for an EQ matrix with weights {1,0,1}\{-1,0,1\} as similarly done in [16]. We note that the following lemma gives the best possible rate upper bound to the best of our knowledge and our construction asymptotically shows the sharpness of Siegel’s Lemma similar to Beck’s result.

Lemma 2.

For any {1,0,1}m×n\{-1,0,1\}^{m\times n} EQ matrix, R=nm12logn+1R=\frac{n}{m}\leq\frac{1}{2}\log{n}+1.

Proof.

By Siegel’s Lemma, we know that for a matrix Am×nA\in\mathbb{Z}^{m\times n}, the homogeneous system Ax=0Ax=0 attains a non-trivial solution in {1,0,1}n\{-1,0,1\}^{n} when

x(nW)mnm21ϵ||x||_{\infty}\leq(\sqrt{n}W)^{\frac{m}{n-m}}\leq 2^{1-\epsilon} (14)

for some ϵ>0\epsilon>0. Then, since W=1W=1, we deduce that R=nm12(1ϵ)logn+1R=\frac{n}{m}\geq\frac{1}{2(1-\epsilon)}\log{n}+1. Obviously, an EQ matrix cannot obtain such a rate. Taking ϵ0\epsilon\rightarrow 0, we conclude the best upper bound is R12logn+1R\leq\frac{1}{2}\log{n}+1. ∎

Using Lemma 2 for our construction with A0=[1]A_{0}=\begin{bmatrix}1\end{bmatrix}, we compute the upper bound on the rate and the real rate

Rupper\displaystyle R_{upper} =k+1+log(k+2)2,Rconstr=k2+1\displaystyle=\frac{k+1+\log{(k+2)}}{2},\hphantom{a}R_{constr}=\frac{k}{2}+1 (15)
RupperRconstr\displaystyle\frac{R_{upper}}{R_{constr}} =1+log(k+2)1k+2RupperRconstr1\displaystyle=1+\frac{\log{(k+2)}-1}{k+2}\implies\frac{R_{upper}}{R_{constr}}\sim 1 (16)

which concludes that the construction is optimal in rate up to vanishing terms. By deleting columns, we can generalize the result to any nn\in\mathbb{Z} to achieve optimality up to a constant factor of 22.

We conjecture that one can extend the columns of any Hadamard matrix to obtain an EQ matrix with entries {1,0,1}\{-1,0,1\} achieving rate optimality up to vanishing terms. In this case, the Hadamard conjecture implies a rich set of EQ matrix constructions.

Given Ax=zmAx=z\in\mathbb{Z}^{m}, we note that there is a linear time decoding algorithm to find x{0,1}nx\in\{0,1\}^{n} uniquely, given in the appendix. This construction can also be generalized to EQq\text{EQ}_{q} matrices with the same proof idea. In this case, the rate is optimal up to a factor of qq.

Theorem 2.

Suppose we are given an EQq\text{EQ}_{q} matrix A0{1,0,1}m0×n0A_{0}\in\{-1,0,1\}^{m_{0}\times n_{0}}. At iteration kk, we construct the following matrix AkA_{k}:

[Ak1Ak1Ak1Ak1Imk1Ak1Ak10000Ak1Ak100000Ak10]\begin{bmatrix}[c]A_{k-1}&A_{k-1}&A_{k-1}&\cdots&A_{k-1}&I_{m_{k-1}}\\ A_{k-1}&-A_{k-1}&0&\cdots&0&0\\ 0&A_{k-1}&-A_{k-1}&\cdots&0&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\cdots&-A_{k-1}&0\end{bmatrix} (17)

AkA_{k} is an EQq\text{EQ}_{q} matrix with mk=qkm0m_{k}=q^{k}m_{0}, nk=qkn0(kqm0n0+1)n_{k}=q^{k}n_{0}(\frac{k}{q}\frac{m_{0}}{n_{0}}+1) for any integer k0k\geq 0.

III Bounds on the Rate and the Weight Size of RMDSq\text{RMDS}_{q} Matrices

Similar to a CRT-based RMDS matrix, is there a way to extend the EQ matrix given in Section II without a large trade-off in the weight size? We give an upper bound on the MDS rate rr based on the alphabet size of an RMDSq\text{RMDS}_{q} matrix.

Theorem 3.

An RMDSq\text{RMDS}_{q} matrix Arm×nA\in\mathbb{Z}^{rm\times n} with entries from 𝒬={q1,,qk}\mathcal{Q}=\{q_{1},...,q_{k}\} satisfies rkk+1r\leq k^{k+1} given that n>kn>k.

Proof.

The proof is based on a simple counting argument. We rearrange the rows of a matrix in blocks BiB_{i} for i{1,,kk+1}i\in\{1,\cdots,k^{k+1}\} assuming n>kn>k. Here, each block contains rows starting with a prefix of length k+1k+1 in the lexicographical order of the indices, i.e.,

A=[B1B2Bkk+1]=[q1q1q1q1q1q2qkqkqk]A=\begin{bmatrix}B_{1}\\ B_{2}\\ \vdots\\ B_{k^{k+1}}\end{bmatrix}=\begin{bmatrix}q_{1}&q_{1}&\cdots&q_{1}&\cdots\\ q_{1}&q_{1}&\cdots&q_{2}&\cdots\\ \vdots&\vdots&\ddots&\vdots&\ddots\\ q_{k}&q_{k}&\cdots&q_{k}&\cdots\end{bmatrix} (18)

For instance, the block B1B_{1} contains the rows starting with the prefix [q1q1q1]1×k+1\begin{bmatrix}q_{1}&q_{1}&\cdots&q_{1}\end{bmatrix}_{1\times k+1}. It is easy to see that there is a vector x{1,0,1}nx\in\{-1,0,1\}^{n} that will make at least one of the Bix=0B_{i}x=0 because it is guaranteed that the (k+1)(k+1)-th column will be equal to the one of the preceding elements by Pigeonhole Principle.

Since the matrix is MDS, any mm row selections should be an EQ matrix. Therefore, any BiB_{i} should not contain more than mm rows. Again, by Pigeonhole Principle, rmkk+1mrm\leq k^{k+1}m and consequently, rkk+1r\leq k^{k+1}. ∎

The condition that n>kn>k is trivial in the sense that if the weights are constant, then kk is a constant. Therefore, the MDS rate can only be a constant due to Theorem 3.

Corollary 3.1.

An RMDSq\text{RMDS}_{q} matrix Arm×nA\in\mathbb{Z}^{rm\times n} weight size W=O(1)W=O(1) can at most achieve the MDS rate r=O(1)r=O(1).

In Section II, we saw that CRT-based EQ matrix constructions are not optimal in terms of the weight size. For RMDSq\text{RMDS}_{q} matrices, we now show that the weight size can be reduced by a factor of nn. The idea is to use the probabilistic method on the existence of RMDSq\text{RMDS}_{q} matrices.

Lemma 3.

Suppose that a{W,,W}na\in\{-W,\dots,W\}^{n} is uniformly distributed and xi{q+1,,q1}0x_{i}\in\{-q+1,\dots,q-1\}\setminus 0 for i{1,,n}i\in\{1,\dots,n\} are fixed where qq is a constant. Then, for some constant CC,

Pr(aTx=0)CnWPr(a^{T}x=0)\leq\frac{C}{\sqrt{n}W} (19)
Proof.

We start with a statement of the Berry-Esseen Theorem.

Lemma 4 (Berry-Esseen Theorem).

Let X1,,XnX_{1},\dots,X_{n} be independent centered random variables with finite third moments 𝔼[|Xi3|]=ρi\mathbb{E}[|X_{i}^{3}|]=\rho_{i} and let σ2=i=1n𝔼[Xi2]\sigma^{2}=\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}]. Then, for any t>0t>0,

|Pr(i=1nXit)Φ(t)|Cσ3i=1nρi\Big{|}Pr\Big{(}\sum_{i=1}^{n}X_{i}\leq t\Big{)}-\Phi(t)\Big{|}\leq C\sigma^{-3}\sum_{i=1}^{n}\rho_{i} (20)

where CC is an absolute constant and Φ(t)\Phi(t) is the cumulative distribution function of 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}).

Since the density of a normal random variable is uniformly bounded by 1/2πσ21/\sqrt{2\pi\sigma^{2}}, we obtain the following.

Pr(|i=1nXi|t)2t2πσ2+2Cσ3i=1nρiPr\Big{(}\Big{|}\sum_{i=1}^{n}X_{i}\Big{|}\leq t\Big{)}\leq\frac{2t}{\sqrt{2\pi\sigma^{2}}}+2C\sigma^{-3}\sum_{i=1}^{n}\rho_{i} (21)

Let a(n1)a^{(n-1)} denote the (a1,,an1)(a_{1},\dots,a_{n-1}) and similarly, let x(n1)x^{(n-1)} denote (x1,,xn1)(x_{1},\dots,x_{n-1}). By the Total Probability Theorem, we have

Pr(\displaystyle Pr( aTx=0)=Pr(a(n1)Tx(n1)|xn|{W,,W})\displaystyle a^{T}x=0)=Pr\Bigg{(}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{|x_{n}|}\in\{-W,\dots,W\}\Bigg{)}
Pr(aTx=0|a(n1)Tx(n1)|xn|{W,,W})\displaystyle Pr\Bigg{(}a^{T}x=0\Big{|}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{|x_{n}|}\in\{-W,\dots,W\}\Bigg{)}
=Pr(a(n1)Tx(n1)|xn|{W,,W})12W+1\displaystyle=Pr\Bigg{(}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{|x_{n}|}\in\{-W,\dots,W\}\Bigg{)}\frac{1}{2W+1} (22)

where the last line follows from the fact that Pr(an=k)Pr(a_{n}=k) for some k{W,,W}k\in\{-W,\dots,W\} is 12W+1\frac{1}{2W+1}. The other conditional probability term is 0 because Pr(an=k)Pr(a_{n}=k) for any k{W,,W}k\not\in\{-W,\dots,W\} is 0. We will apply Berry-Esseen Theorem to find an upper bound on the first term.

We note that 𝔼[ai2xi2]=xi2(2W+1)2112\mathbb{E}[a_{i}^{2}x_{i}^{2}]=x_{i}^{2}\frac{(2W+1)^{2}-1}{12} and 𝔼[|ai3xi3|]=|xi|322W+1(W(W+1)2)2\mathbb{E}[|a_{i}^{3}x_{i}^{3}|]=|x_{i}|^{3}\frac{2}{2W+1}\Big{(}\frac{W(W+1)}{2}\Big{)}^{2}. Then,

Pr(\displaystyle Pr\Bigg{(} |a(n1)Tx(n1)|xn||W)2W|xn|2π(2W+1)2112i=1n1xi2\displaystyle\Bigg{|}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{|x_{n}|}\Bigg{|}\leq W\Bigg{)}\leq\frac{2W|x_{n}|}{\sqrt{2\pi}\sqrt{\frac{(2W+1)^{2}-1}{12}\sum_{i=1}^{n-1}x_{i}^{2}}}
+C22W+1(W(W+1)2)2i=1n1|xi|3((2W+1)2112)32(i=1n1xi2)32\displaystyle\hphantom{0000000000}+C\frac{\frac{2}{2W+1}\Big{(}\frac{W(W+1)}{2}\Big{)}^{2}\sum_{i=1}^{n-1}|x_{i}|^{3}}{\Big{(}\frac{(2W+1)^{2}-1}{12}\Big{)}^{\frac{3}{2}}\Big{(}\sum_{i=1}^{n-1}x_{i}^{2}\Big{)}^{\frac{3}{2}}} (23)

One can check that the whole RHS is in the form of Cn1\frac{C^{\prime}}{\sqrt{n-1}} for a constant CC^{\prime} as we assume that qq is a constant. We can bound |xi||x_{i}|s in the numerator by q1q-1 and |xi||x_{i}|s in the denominator by 11. The order of WW terms are the same and we can use a limit argument to bound them as well. Therefore, for some C′′>0C^{\prime\prime}>0,

Pr(aTx=0)Cn112W+1C′′nWPr(a^{T}x=0)\leq\frac{C^{\prime}}{\sqrt{n-1}}\frac{1}{2W+1}\leq\frac{C^{\prime\prime}}{\sqrt{n}W} (24)

Lemma 3 is related to the Littlewood-Offord problem and anti-concentration inequalities are typically used in this framework [10, 21]. We specifically use Berry-Esseen Theorem. Building on this lemma and the union bound, we have the following theorem.

Theorem 4.

An RMDSq\text{RMDS}_{q} matrix Mrm×nM\in\mathbb{Z}^{rm\times n} with entries in {W,,W}\{-W,\dots,W\} exists if m=Ω(n/logn)m=\Omega(n/\log{n}) and W=O(r)W=O(r).

Proof.

Let AA be any m×nm\times n submatrix of MM. Then, by the union bound (as done in [15]), we have

Pr(M is not RMDSq)=P\displaystyle Pr(\text{M is not RMDS}_{q})=P (rmm)Pr(A is not EQq)\displaystyle\leq\binom{rm}{m}Pr(\text{A is not EQ}_{q})
rmemPr(A is not EQq)\displaystyle\leq r^{m}e^{m}Pr(\text{A is not EQ}_{q}) (25)

We again use the union bound to sum over all x{q+1,,q1}n0x\in\{-q+1,\dots,q-1\}^{n}\setminus 0 the event that Ax=0Ax=0 by counting non-zero entries in xx by kk. Also, notice the independence of the events that aiTx=0a_{i}^{T}x=0 for i{1,,m}i\in\{1,\dots,m\}. Let x(k)x^{(k)} denote an arbitrary x{q+1,,q1}n0x\in\{-q+1,\dots,q-1\}^{n}\setminus 0 with kk non-zero entries. Then,

Pr(A is not EQq)\displaystyle Pr(\text{A is not EQ}_{q}) x{q+1,,q1}n0i=1mPr(aiTx=0)\displaystyle\leq\sum_{x\in\{-q+1,\dots,q-1\}^{n}\setminus 0}\prod_{i=1}^{m}Pr(a_{i}^{T}x=0) (26)
k=1n(nk)(2(q1))kPr(aTx(k)=0)m\displaystyle\leq\sum_{k=1}^{n}\binom{n}{k}(2(q-1))^{k}Pr(a^{T}x^{(k)}=0)^{m} (27)

We can use Lemma 3 to bound the Pr(aTx(k)=0)Pr(a^{T}x^{(k)}=0) term. Therefore,

P\displaystyle P rmemk=1n(nk)(2(q1))k(CkW)m\displaystyle\leq r^{m}e^{m}\sum_{k=1}^{n}\binom{n}{k}(2(q-1))^{k}\Big{(}\frac{C}{\sqrt{k}W}\Big{)}^{m} (28)
=k=1T1(nk)(2(q1))k(rC1kW)m\displaystyle\hphantom{000}=\sum_{k=1}^{T-1}\binom{n}{k}(2(q-1))^{k}\Big{(}\frac{rC_{1}}{\sqrt{k}W}\Big{)}^{m}
+k=Tn(nk)(2(q1))k(rC1kW)m\displaystyle\hphantom{000000}+\sum_{k=T}^{n}\binom{n}{k}(2(q-1))^{k}\Big{(}\frac{rC_{1}}{\sqrt{k}W}\Big{)}^{m} (29)

for some C1,TC_{1},T\in\mathbb{Z}. We bound the first summation by using (nk)(2(q1))k(n(q1))k\binom{n}{k}(2(q-1))^{k}\leq(n(q-1))^{k} and choosing k=1k=1 for the probability term. This gives a geometric sum from k=1k=1 to k=T1k=T-1.

k=1T1(n(q1))k(rC1W)m=(rC1W)m((n(q1))T1n(q1)1)\displaystyle\sum_{k=1}^{T-1}(n(q-1))^{k}\Big{(}\frac{rC_{1}}{W}\Big{)}^{m}=\Big{(}\frac{rC_{1}}{W}\Big{)}^{m}\Big{(}\frac{(n(q-1))^{T}-1}{n(q-1)-1}\Big{)} (30)

For the second term, we take the highest value term rC1TW\frac{rC_{1}}{\sqrt{T}W} and k=Tn(nk)2(q1)kk=0n(nk)2(q1)k=(2(q1)+1)n\sum_{k=T}^{n}\binom{n}{k}2(q-1)^{k}\leq\sum_{k=0}^{n}\binom{n}{k}2(q-1)^{k}=(2(q-1)+1)^{n}. Hence,

P\displaystyle P (rC1W)mC2(n(q1))T+(2(q1)+1)n(rC1TW)m\displaystyle\leq\Big{(}\frac{rC_{1}}{W}\Big{)}^{m}C_{2}(n(q-1))^{T}+(2(q-1)+1)^{n}\Big{(}\frac{rC_{1}}{\sqrt{T}W}\Big{)}^{m} (31)
2C3Tlogn(q1)mlogW/rC1\displaystyle\leq 2^{C_{3}T\log{n(q-1)}-m\log{W/rC_{1}}}
+2nlog(2(q1)+1)mlog(TW/rC1)\displaystyle\hphantom{00000}+2^{n\log{(2(q-1)+1)}-m\log{(\sqrt{T}W/rC_{1})}} (32)

We take T=O(nc)T=O(n^{c}) for some 0<c<10<c<1. It is easy to see that if m=Ω(n/logn)m=\Omega(n/\log{n}) and W=O(r)W=O(r), both terms vanish as nn goes to the infinity.

When r=1r=1, we prove an existential result on EQ matrices already known in [16]. We remark that the proof technique for Theorem 4 is not powerful enough to obtain non-trivial bounds on WW when m=1m=1 to attack Erdős’ conjecture on the DSS weight sets.

The CRT gives an explicit way to construct RMDSqrm×n\text{RMDS}_{q}\in\mathbb{Z}^{rm\times n} matrices. We obtain prm=O(rmlogrm)=O(rn)p_{rm}=O(rm\log{rm})=O(rn) given that r=O(nc)r=O(n^{c}) for some c>0c>0 and m=O(n/logn)m=O(n/\log{n}) by the PNT. Therefore, we have a factor of O(n)O(n) weight size reduction in Theorem 4. However, modular arithmetical properties of the CRT do not reflect to RMDSq\text{RMDS}_{q} matrices. Therefore, an RMDSq\text{RMDS}_{q} matrix cannot replace a CRT matrix in general (see Appendix).

IV Applications of the Algebraic Results to Neural Circuits

In this section, we will give EQ and COMP threshold circuit constructions. We note that in our analysis, the size of the bias term is ignored.

One can construct a depth-22 exact threshold circuit with small weights to compute the EQ function [16]. For the first layer, we select the weights for each exact threshold gate as the rows of the EQ matrix. Then, we connect the outputs of the first layer to the top gate, which just computes the mm-input AND function (i.e. 𝟙{z1++zm=m}\mathds{1}\{z_{1}+...+z_{m}=m\} for zi{0,1}z_{i}\in\{0,1\}). In Figure 3, we give an example of EQ constructions.

\mathcal{E}EQx1x_{1}^{\prime}x2x_{2}^{\prime}x3x_{3}^{\prime}x4x_{4}^{\prime}x5x_{5}^{\prime}x6x_{6}^{\prime}x7x_{7}^{\prime}x8x_{8}^{\prime}\mathcal{E}\mathcal{E}\mathcal{E}\mathcal{E}4-4
Figure 3: An example of EQ(X,Y)\text{EQ}(X,Y) constructions with 8 xi=xiyix_{i}^{\prime}=x_{i}-y_{i} inputs (or 16 if xix_{i}s and yiy_{i}s are counted separately) and 5 exact threshold gates (including the top gate). The black(or red) edges correspond to the edges with weight 1 (or -1).

We roughly follow the previous works for the construction of the COMP [2]. First, let us define F(l)(X,Y)=i=l+1n2il1(xiyi)F^{(l)}(X,Y)=\sum_{i=l+1}^{n}2^{i-l-1}(x_{i}-y_{i}) so that 𝟙{F(l)(X,Y)0}\mathds{1}\{F^{(l)}(X,Y)\geq 0\} is a (nl)(n-l)-bit COMP function for l<nl<n and F(0)(X,Y)=F(X,Y)F^{(0)}(X,Y)=F(X,Y). We say that XYX\geq Y when F(X,Y)0F(X,Y)\geq 0 and vice versa. We also denote by X(l)X^{(l)} the most significant (nl)(n-l)-tuple of an nn-tuple vector XX. We have the following observation.

Lemma 5.

Let F(l)(X,Y)=i=l+1n2il1(xiyi)F^{(l)}(X,Y)=\sum_{i=l+1}^{n}2^{i-l-1}(x_{i}-y_{i}) and F(0)(X,Y)=F(X,Y)F^{(0)}(X,Y)=F(X,Y) for X,Y{0,1}nX,Y\in\{0,1\}^{n}. Then,

F(X,Y)>0\displaystyle F(X,Y)>0 l:F(l)(X,Y)=1\displaystyle\Leftrightarrow\exists l:F^{(l)}(X,Y)=\hphantom{-}1 (33)
F(X,Y)<0\displaystyle F(X,Y)<0 l:F(l)(X,Y)=1\displaystyle\Leftrightarrow\exists l:F^{(l)}(X,Y)=-1 (34)
Proof.

It is easy to see that if XY=(0,0,,0,1,×,,×)X-Y=(0,0,\dots,0,1,\times,\dots,\times) where the vector has a number of leading 0s and ×\timess denote any of {1,0,1}\{-1,0,1\}, we see that F(X,Y)>0F(X,Y)>0 (this is called the domination property). Similarly, for F(X,Y)<0F(X,Y)<0, the vector XYX-Y should have the form (0,0,,0,1,×,,×)(0,0,\dots,0,-1,\times,\dots,\times) and F(X,Y)=0F(X,Y)=0 if and only if XY=(0,,0)X-Y=(0,\dots,0). The converse holds similarly. ∎

By searching for the (nl)(n-l)-tuple vectors (XY)(l)=(0,,0,1)(X-Y)^{(l)}=(0,...,0,-1) for all l{0,,n1}l\in\{0,...,n-1\}, we can compute the COMP. We claim that if we have an RMDSq\text{RMDS}_{q} matrix Arm×nA\in\mathbb{Z}^{rm\times n}, we can detect such vectors by solving A(l)(XY)(l)=anlA^{(l)}(X-Y)^{(l)}=-a_{n-l} where A(l)A^{(l)} is a truncated version of AA with the first nln-l columns and anla_{n-l} is the (nl)(n-l)-th column. Specifically, we obtain the following:

X<Y\displaystyle X<Y l=0n1𝟙{A(l)(XY)(l)=anl}rm\displaystyle\Rightarrow\sum_{l=0}^{n-1}\mathds{1}\{A^{(l)}(X-Y)^{(l)}=-a_{n-l}\}\geq rm (35)
XY\displaystyle X\geq Y l=0n1𝟙{A(l)(XY)(l)=anl}<n(m1)\displaystyle\Rightarrow\sum_{l=0}^{n-1}\mathds{1}\{A^{(l)}(X-Y)^{(l)}=-a_{n-l}\}<n(m-1) (36)

Here, the indicator function works row-wise, i.e., we have the output vector z{0,1}rmnz\in\{0,1\}^{rmn} such that zrml+i=𝟙{(A(l)(XY)(l))i=(anl)i}z_{rml+i}=\mathds{1}\{(A^{(l)}(X-Y)^{(l)})_{i}=-(a_{n-l})_{i}\} for i{1,,rm}i\in\{1,\dots,rm\} and l{0,,n1}l\in\{0,\dots,n-1\}. We use an RMDS3\text{RMDS}_{3} matrix in the construction to map {1,0,1}nl\{-1,0,1\}^{n-l} vectors bijectively to integer vectors with large Hamming distance. Note that each exact threshold function can be replaced by two linear threshold functions and a summation layer (i.e. 𝟙{F(X)=0}=𝟙{F(X)0}+𝟙{F(X)0}1)\mathds{1}\{F(X)=0\}=\mathds{1}\{F(X)\geq 0\}+\mathds{1}\{-F(X)\geq 0\}-1) which can be absorbed to the top gate (see Figure 4) . This increases the circuit size only by a factor of two.

𝒯\mathcal{L}\mathcal{T}𝟙{F(X)0}\mathds{1}\{F(X)\geq 0\}\vdotsx0.0x_{0}.0x1.0x_{1}.0xnx_{n}𝒯\mathcal{L}\mathcal{T}𝟙{F(X)0}\mathds{1}\{F(X)\leq 0\}+𝟙{F(X)=0}\mathds{1}\{F(X)=0\}11111-1
Figure 4: A construction of an arbitrary exact threshold function (𝟙{F(X)=0}\mathds{1}\{F(X)=0\}) using two linear threshold functions (𝟙{F(X)0}\mathds{1}\{F(X)\geq 0\} and 𝟙{F(X)0}\mathds{1}\{F(X)\leq 0\}) and a summation node. This summation node can be removed if its output is connected to another gate due to linearity.

If F(X,Y)<0F(X,Y)<0, zrml+iz_{rml+i} should be 1 for all i{1,,rm}i\in\{1,\dots,rm\} for some ll by Lemma 5. For F(X,Y)0F(X,Y)\geq 0 and any ll, the maximum number of 1s that can appear in zrml+iz_{rml+i} is upper bounded by m1m-1 because A is RMDS3\text{RMDS}_{3}. Therefore, the maximum number of 1s that can appear in zz is upper bounded by n(m1)n(m-1). A sketch of the construction is given in Figure 5.

𝒯\mathcal{L}\mathcal{T}COMP11x1x_{1}^{\prime}x2x_{2}^{\prime}\vdotsxn1x_{n-1}^{\prime}xnx_{n}^{\prime}(l=0)\hphantom{a}(l=0)𝒯\mathcal{L}\mathcal{T}\vdots𝟙{A(0)(XY)(0)=an}\hphantom{a}\mathds{1}\{A^{(0)}(X-Y)^{(0)}=-a_{n}\}𝒯\mathcal{L}\mathcal{T}(l=1)\hphantom{a}(l=1)𝒯\mathcal{L}\mathcal{T}\vdots𝒯\mathcal{L}\mathcal{T}\vdots(l=n1)\hphantom{a}(l=n-1)𝒯\mathcal{L}\mathcal{T}\vdots𝟙{A(n1)(XY)(n1)=a1}\hphantom{aaa}\mathds{1}\{A^{(n-1)}(X-Y)^{(n-1)}=-a_{1}\}𝒯\mathcal{L}\mathcal{T}
Figure 5: A sketch of the COMP(X,Y)\text{COMP}(X,Y) construction where xi=xiyix_{i}^{\prime}=x_{i}-y_{i} using linear threshold gates. Each color specifies an ll value in the construction. If X<YX<Y, all the rmrm many gates in at least one of the colors will give all 11s at the output. Otherwise, all the rmrm many gates in a color will give at most (m1)(m-1) many 11s at the output.

In order to make both cases separable, we choose r=nr=n. At the second layer, the top gate is a MAJORITY gate (𝟙{i=1rmnzi<n(m1)}=𝟙{i=1rmnzin(m1)+1}\mathds{1}\{\sum_{i=1}^{rmn}z_{i}<n(m-1)\}=\mathds{1}\{\sum_{i=1}^{rmn}-z_{i}\geq-n(m-1)+1\}). By Theorem 4, there exists an RMDS3\text{RMDS}_{3} matrix with m=O(n/logn)m=O(n/\log{n}) and W=O(n)W=O(n). Thus, there are rmn+1=O(n3/logn)rmn+1=O(n^{3}/\log{n}) many gates in the circuit, which is the best known result. The same size complexity can be achieved by a CRT-based approach with W=O(n2)W=O(n^{2}) [2].

V Conclusion

We explicitly constructed a rate-efficient constant weight EQ matrix for a previously existential result. Namely, we proved that the CRT matrix is not optimal in weight size to compute the EQ function and obtained the optimal EQ function constructions. For the COMP function, the weight size complexity is improved by a linear factor, using RMDSq\text{RMDS}_{q} matrices and their existence. An open problem is whether similar algebraic constructions can be found for general threshold functions so that these ideas can be developed into a weight quantization technique for neural networks.

Acknowledgement

This research was partially supported by The Carver Mead New Adventure Fund.

References

  • [1] Noga Alon and Văn H Vũ “Anti-Hadamard matrices, coin weighing, threshold gates, and indecomposable hypergraphs” In Journal of Combinatorial Theory, Series A 79.1 Elsevier, 1997, pp. 133–160
  • [2] Kazuyuki Amano and Akira Maruoka “On the complexity of depth-2 circuits with threshold gates” In International Symposium on Mathematical Foundations of Computer Science, 2005, pp. 107–118 Springer
  • [3] László Babai, Kristoffer Arnsfelt Hansen, Vladimir V Podolskii and Xiaoming Sun “Weights of exact threshold functions” In International Symposium on Mathematical Foundations of Computer Science, 2010, pp. 66–77 Springer
  • [4] József Beck “Siegel’s Lemma is sharp” In A journey through discrete mathematics Springer, 2017, pp. 165–206
  • [5] Tom Bohman “A construction for sets of integers with distinct subset sums” In the electronic journal of combinatorics, 1998, pp. R3–R3
  • [6] Simone Costa, Stefano Della Fiore and Marco Dalai “Variations on the Erdős Distinct-sums Problem” In arXiv preprint arXiv:2107.07885, 2021
  • [7] Quentin Dubroff, Jacob Fox and Max Wenqiang Xu “A Note on the Erdős Distinct Subset Sums Problem” In SIAM Journal on Discrete Mathematics 35.1 SIAM, 2021, pp. 322–324
  • [8] Mikael Goldmann and Marek Karpinski “Simulating threshold circuits by majority circuits” In Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, 1993, pp. 551–560
  • [9] Richard Guy “Unsolved problems in number theory” Springer Science & Business Media, 2004
  • [10] Gábor Halász “Estimates for the concentration function of combinatorial number theory and probability” In Periodica Mathematica Hungarica 8.3-4 Springer, 1977, pp. 197–211
  • [11] Johan Håstad “On the size of weights for threshold gates” In SIAM Journal on Discrete Mathematics 7.3 SIAM, 1994, pp. 484–492
  • [12] Thomas Hofmeister “A note on the simulation of exponential threshold weights” In International Computing and Combinatorics Conference, 1996, pp. 136–141 Springer
  • [13] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv and Yoshua Bengio “Binarized neural networks” In Advances in neural information processing systems 29, 2016, pp. 4107–4115
  • [14] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam and Dmitry Kalenichenko “Quantization and training of neural networks for efficient integer-arithmetic-only inference” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713
  • [15] Sankeerth Rao Karingula and Shachar Lovett “Singularity of Random Integer Matrices with Large Entries” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021), 2021 Schloss Dagstuhl-Leibniz-Zentrum für Informatik
  • [16] Kordag Mehmet Kilic and Jehoshua Bruck “Neural Network Computations with DOMINATION Functions” In 2021 IEEE International Symposium on Information Theory (ISIT), 2021, pp. 1029–1034 IEEE
  • [17] Fengfu Li, Bo Zhang and Bin Liu “Ternary weight networks” In arXiv preprint arXiv:1605.04711, 2016
  • [18] David Mandelbaum “Error correction in residue arithmetic” In IEEE Transactions on Computers 100.6 IEEE, 1972, pp. 538–545
  • [19] Saburo Muroga “Threshold logic and its applications”, 1971
  • [20] Vwani P Roychowdhury, Alon Orlitsky and Kai-Yeung Siu “Lower bounds on threshold and related circuits via communication complexity” In IEEE Transactions on Information Theory 40.2 IEEE, 1994, pp. 467–474
  • [21] Mark Rudelson and Roman Vershynin “The Littlewood–Offord problem and invertibility of random matrices” In Advances in Mathematics 218.2 Elsevier, 2008, pp. 600–633
  • [22] Carl L Siegel “Über einige anwendungen diophantischer approximationen” In On Some Applications of Diophantine Approximations Springer, 2014, pp. 81–138
  • [23] Kai-Yeung Siu and Jehoshua Bruck “On the power of threshold circuits with small weights” In SIAM Journal on Discrete Mathematics 4.3 SIAM, 1991, pp. 423–435
  • [24] Thian Fatt Tay and Chip-Hong Chang “A non-iterative multiple residue digit error detection and correction algorithm in RRNS” In IEEE transactions on computers 65.2 IEEE, 2015, pp. 396–408
  • [25] Gal Vardi and Ohad Shamir “Neural Networks with Small Weights and Depth-Separation Barriers” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 19433–19442 URL: https://proceedings.neurips.cc/paper/2020/file/e1fe6165cad3f7f3f57d409f78e4415f-Paper.pdf

The Decoding Algorithm: Suppose that Akx=zA_{k}x=z is given where AkA_{k} is an m×nm\times n EQ matrix given in Theorem 1 starting with A0=[1]A_{0}=\begin{bmatrix}1\end{bmatrix}. One can obtain a linear time decoding algorithm in nn by exploiting the recursive structure of the construction. Let us partition xx and zz in the way given in (13). It is clear that x(3)=[z(1)+z(2)]2x^{(3)}=[z^{(1)}+z^{(2)}]_{2}. Also, after computing x(3)x^{(3)}, we find that Ak1x(1)=(z(1)+z(2)x(3))/2A_{k-1}x^{(1)}=(z^{(1)}+z^{(2)}-x^{(3)})/2 and Ak1x(2)=(z(1)z(2)x(3))/2A_{k-1}x^{(2)}=(z^{(1)}-z^{(2)}-x^{(3)})/2. These operations can be done in O(mk1)O(m_{k-1}) time complexity. Let T(m)T(m) denote the time to decode zmz\in\mathbb{Z}^{m}. Then, T(m)=2T(m/2)+O(m)T(m)=2T(m/2)+O(m) and by the Master Theorem, T(m)=O(mlogm)=O(n)T(m)=O(m\log{m})=O(n).

An Example for the Decoding: Consider the following system Ax = b:

[11111110110110011111110011011000][x1x2x3x4x5x6x7x8]=[4210]\begin{bmatrix}[r]1&1&1&1&1&1&1&0\\ 1&-1&0&1&-1&0&0&1\\ 1&1&1&-1&-1&-1&0&0\\ 1&-1&0&-1&1&0&0&0\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\\ x_{4}\\ x_{5}\\ x_{6}\\ x_{7}\\ x_{8}\end{bmatrix}=\begin{bmatrix}4\\ -2\\ -1\\ 0\end{bmatrix}

Here is the first step of the algorithm:

a=[42],b=[10][x7x8]=[a+b]2=[[3]2[2]2]=[10]a=\begin{bmatrix}4\\ -2\end{bmatrix},\hphantom{a}b=\begin{bmatrix}-1\\ 0\end{bmatrix}\Rightarrow\begin{bmatrix}x_{7}\\ x_{8}\end{bmatrix}=[a+b]_{2}=\begin{bmatrix}[r][3]_{2}\\ [-2]_{2}\end{bmatrix}=\begin{bmatrix}1\\ 0\end{bmatrix}

Then, we obtain the two following systems:

[111110][x1x2x3]=a+b[10]2=[11]\displaystyle\begin{bmatrix}[r]1&1&1\\ 1&-1&0\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\\ \end{bmatrix}=\frac{a+b-\begin{bmatrix}1\\ 0\end{bmatrix}}{2}=\begin{bmatrix}1\\ -1\end{bmatrix}
[111110][x4x5x6]=ab[10]2=[21]\displaystyle\begin{bmatrix}[r]1&1&1\\ 1&-1&0\end{bmatrix}\begin{bmatrix}x_{4}\\ x_{5}\\ x_{6}\\ \end{bmatrix}=\frac{a-b-\begin{bmatrix}1\\ 0\end{bmatrix}}{2}=\begin{bmatrix}2\\ -1\end{bmatrix}

For the first system, we find that x3=[11]2=0x_{3}=[1-1]_{2}=0. Then,

x1=1102=0\displaystyle x_{1}=\frac{1-1-0}{2}=0 (37)
x2=1+102=1\displaystyle x_{2}=\frac{1+1-0}{2}=1 (38)

For the second system, we similarly find that x6=[21]2=1x_{6}=[2-1]_{2}=1. Then,

x4=2112=0\displaystyle x_{4}=\frac{2-1-1}{2}=0 (39)
x5=2+112=1\displaystyle x_{5}=\frac{2+1-1}{2}=1 (40)

Thus, we have x=[01001110]Tx=\begin{bmatrix}0&1&0&0&1&1&1&0\end{bmatrix}^{T}.

A Modular Arithmetical Property of the CRT Matrix: Consider the following equation with powers of two in 8 variables and the corresponding 4×84\times 8 CRT matrix, say AA.

wbTx=i=182i1xi=0w_{b}^{T}x=\sum_{i=1}^{8}2^{i-1}x_{i}=0 (41)
[[20]3[21]3[22]3[24]3[25]3[26]3[27]3[20]5[21]5[22]5[24]5[25]5[26]5[27]5[20]7[21]7[22]7[24]7[25]7[26]7[27]7[20]11[21]11[22]11[24]11[25]11[26]11[27]11]\displaystyle\begin{bmatrix}[l][2^{0}]_{3}&[2^{1}]_{3}&[2^{2}]_{3}&[2^{4}]_{3}&[2^{5}]_{3}&[2^{6}]_{3}&[2^{7}]_{3}\\ [2^{0}]_{5}&[2^{1}]_{5}&[2^{2}]_{5}&[2^{4}]_{5}&[2^{5}]_{5}&[2^{6}]_{5}&[2^{7}]_{5}\\ [2^{0}]_{7}&[2^{1}]_{7}&[2^{2}]_{7}&[2^{4}]_{7}&[2^{5}]_{7}&[2^{6}]_{7}&[2^{7}]_{7}\\ [2^{0}]_{11}&[2^{1}]_{11}&[2^{2}]_{11}&[2^{4}]_{11}&[2^{5}]_{11}&[2^{6}]_{11}&[2^{7}]_{11}\end{bmatrix} (42)
=[121212121243124312412412124851097]4×8\displaystyle\hphantom{aaaaaaaaaa}=\begin{bmatrix}[r]1&2&1&2&1&2&1&2\\ 1&2&4&3&1&2&4&3\\ 1&2&4&1&2&4&1&2\\ 1&2&4&8&5&10&9&7\end{bmatrix}_{4\times 8} (43)

For the prime pip_{i}, the elements in the row ii are congruent to the elements of wbw_{b}. The ithi^{\text{th}} element of the AxAx vector should be divisible by pip_{i} whenever wbTx=0w_{b}^{T}x=0. For instance, one can pick x=[21130110]Tx=\begin{bmatrix}2&1&1&3&0&1&-1&0\end{bmatrix}^{T} as a solution for wbTx=0w_{b}^{T}x=0 and we see that Ax=[12151433]T=[433527311]TAx=\begin{bmatrix}12&15&14&33\end{bmatrix}^{T}=\begin{bmatrix}4\cdot 3&3\cdot 5&2\cdot 7&3\cdot 11\end{bmatrix}^{T}. This property is essential in the construction of small weight depth-2 circuits for arbitrary threshold functions while the RMDSq\text{RMDS}_{q} matrices do not behave in this manner necessarily.