On Algebraic Constructions of Neural Networks with Small Weights

Kordag Mehmet Kilic, Jin Sima and Jehoshua Bruck Electrical Engineering, California Institute of Technology, USA, {kkilic,jsima,bruck}@caltech.edu

Abstract

Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a well-studied topic both in circuit complexity theory and the practice of neural computation. We propose a new approach for studying these complexity trade-offs by considering a related algebraic framework. Specifically, given a single linear equation with arbitrary coefficients, we would like to express it using a system of linear equations with smaller (even constant) coefficients. The techniques we developed are based on Siegel’s Lemma for the bounds, anti-concentration inequalities for the existential results and extensions of Sylvester-type Hadamard matrices for the constructions.

We explicitly construct a constant weight, optimal size matrix to compute the EQUALITY function (checking if two integers expressed in binary are equal). Computing EQUALITY with a single linear equation requires exponentially large weights. In addition, we prove the existence of the best-known weight size (linear) matrices to compute the COMPARISON function (comparing between two integers expressed in binary). In the context of the circuit complexity theory, our results improve the upper bounds on the weight sizes for the best-known circuit sizes for EQUALITY and COMPARISON.

I Introduction

An $n$ -input Boolean function is a mapping from $\{0,1\}^{n}$ to $\{0,1\}$ . In other words, it is a partitioning of $n$ -bit binary vectors into two sets with labels $0$ and $1$ . In general, we can use systems of linear equations as descriptive models of these two sets of binary vectors. For example, the solution set of the equation $\sum_{i=1}^{n}x_{i}=k$ is the $n$ -bit binary vectors $X=(x_{1},\dots,x_{n})$ where each $x_{i}\in\{0,1\}$ and $k$ is the number of $1$ s in the vectors. We can ask three important questions: How expressive can a single linear equation be? How many equations do we need to describe a Boolean function? Could we simulate a single equation by a system of equations with smaller integer weights?

Let us begin with an example: $3$ -input PARITY function where we label binary vectors with odd number of 1s as 1. We can write it in the following form:

\text{PARITY}(X)=\mathds{1}\Big{\{}(2^{2}x_{3}+2^{1}x_{2}+2^{0}x_{1})\in\{1,2,4,7\}\Big{\}}

(1)

where $\mathds{1}\{.\}$ is the indicator function with outputs 0 or 1. We express the binary vectors as integers by using binary expansions. Thus, it can be shown that if the weights are exponentially large in $n$ , we can express all Boolean functions in this form.

Now, suppose that we are only allowed to use a single equality check in an indicator function. Considering the $3$ -input PARITY function, we can simply obtain

\begin{bmatrix}2^{0}&2^{1}&2^{2}\\ 2^{0}&2^{1}&2^{2}\\ 2^{0}&2^{1}&2^{2}\\ 2^{0}&2^{1}&2^{2}\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\end{bmatrix}=\begin{bmatrix}1\\ 2\\ 4\\ 7\end{bmatrix}

(2)

None of the above equations can be satisfied if $X$ is labeled as 0. Conversely, if $X$ satisfies one of the above equations, we can label it as $1$ . For an arbitrary Boolean function of $n$ inputs, if we list every integer associated with vectors labeled as $1$ , the number of rows may become exponentially large in $n$ . Nevertheless, in this fashion, we can compute this function by the following system of equations using smaller weights.

\begin{bmatrix}1&1&1\\ 1&1&1\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\end{bmatrix}=\begin{bmatrix}1\\ 3\end{bmatrix}

(3)

Not only there is a simplification on the number of equations, but also the weights are reduced to smaller sizes. This phenomenon motivates the following question with more emphasis: For which Boolean functions could we obtain such simplifications in the number of equations and weight sizes? For PARITY, this simplification is possible because it is a symmetric Boolean function, i.e., the output depends only on the number of $1$ s of the input $X$ . We are particularly interested in such simplifications from large weights to small weights for a class of Boolean functions called threshold functions. Note that we use the word “large” to refer to exponentially large quantities in $n$ and the word “small” to refer to polynomially large quantities (including $O(n^{0})=O(1)$ ) in $n$ .

I-A Threshold Functions and Neural Networks

Threshold functions are commonly treated functions in Boolean analysis and machine learning as they form the basis of neural networks. Threshold functions compute a weighted summation of binary inputs and feed it to a threshold or equality check. If this sum is fed to the former, we call the functions linear threshold functions (see (4)). If it is fed to the latter, we call them exact threshold functions (see (5)). We can write an $n$ -input threshold function using the indicator function where $w_{i}$ s are integer weights and $b$ is a bias term.

	$\displaystyle f_{\mathcal{L}\mathcal{T}}(X)$	$\displaystyle=\mathds{1}\Big{\{}\sum_{i=1}^{n}w_{i}x_{i}\geq b\Big{\}}$		(4)
	$\displaystyle f_{\mathcal{E}}(X)$	$\displaystyle=\mathds{1}\Big{\{}\sum_{i=1}^{n}w_{i}x_{i}=b\Big{\}}$		(5)

A device computing the corresponding threshold function is called a gate of that type. To illustrate the concept, we define COMPARISON (denoted by COMP) function which computes whether an $n$ -bit integer $X$ is greater than another $n$ -bit integer Y. The exact counterpart of it is defined as the EQUALITY (denoted by EQ) function which checks if two $n$ -bit integers are equal (see Figure 1). For example, we can write the EQ function in the following form.

\text{EQ}(X,Y)=\mathds{1}\Big{\{}\sum_{i=1}^{n}2^{i-1}(x_{i}-y_{i})=0\Big{\}}

(6)

Figure 1: The

3

-input COMP and EQ functions for integers

X

and

Y

computed by linear threshold and exact threshold gates. A gate with an

\mathcal{L}\mathcal{T}

(or

\mathcal{E}

) inside is a linear (or exact) threshold gate. More explicitly, we can write

\text{COMP}(X,Y)=\mathds{1}\{4x_{3}+2x_{2}+x_{1}\geq 4y_{3}+2y_{2}+y_{1}\}

and

\text{EQ}(X,Y)=\mathds{1}\{4x_{3}+2x_{2}+x_{1}=4y_{3}+2y_{2}+y_{1}\}

In general, it is proven that the weights of a threshold function can be represented by $O(n\log{n})$ -bits and this is tight [1, 3, 11, 19]. However, it is possible to construct “small” weight threshold circuits to compute any threshold function [2, 8, 12, 23]. This transformation from a circuit of depth $d$ with exponentially large weights in $n$ to another circuit with polynomially large weights in $n$ is typically within a constant factor of depth (e.g. $d+1$ or $3d+3$ depending on the context) [8, 25]. For instance, such a transformation would simply follow if we can replace any ”large“ weight threshold function with “small” weight depth- $2$ circuits so that the new depth becomes $2d$ .

It is possible to reduce polynomial size weights into constant weights by replicating the gates that is fed to the top gate recursively (see Figure 2). Nevertheless, this would inevitably introduce a polynomial size blow-up in the circuit size. We emphasize that our focus is to achieve this weight size reduction from polynomial weights to constant weights with at most a constant size blow-up in the circuit size.

Figure 2: An example of a weight transformation for a single gate (in black) to construct constant weight circuits. Different gates are colored in red, blue, and violet and depending on the weight size, each gate is replicated a number of times. In this example, each weight in this construction is at most 3.

For neural networks and learning tasks, the weights are typically finite precision real numbers. To make the computations faster and more power efficient, the weights can be quantized to small integers with a loss in the accuracy of the network output [14, 13]. In practice, given that the neural circuit is large in size and deep in depth, this loss in the accuracy might be tolerated. We are interested in the amount of trade-off in the increase of size and depth while using as small weights as possible. More specifically, our goal is to provide insight to the computation of single threshold functions with large weights using threshold circuits with small weights by relating theoretical upper and lower bounds to the best known constructions.

In this manner, for the ternary weight case ( $\{-1,0,1\}$ as in [17]), we give an explicit and optimal size circuit construction for the EQ function using depth-2 circuits. This optimality is guaranteed by achieving the theoretical lower bounds asymptotically up to vanishing terms [16, 20]. We also prove an existential result on the COMP constructions to reduce the weight size on the best known constructions. It is not known if constant weight constructions exist without a polynomial blow-up in the circuit size and an increase in the depth for arbitrary threshold functions.

I-B Bijective Mappings from Finite Fields to Integers

It seems that choosing powers of two as the weight set is important. In fact, we can expand any arbitrary weight in binary and use powers of two as a fixed weight set for any threshold function [16]. This weight set is a choice of convenience and a simple solution for the following question: How small could the elements of $\mathcal{W}=\{w_{1},w_{2},\cdots,w_{n}\}$ be if $\mathcal{W}$ has all distinct subset sums (DSS)? If the weights satisfy this property, called the DSS property, they can define a bijection between $\{0,1\}^{n}$ and integers. Erdős conjectured that the largest weight $w\in\mathcal{W}$ is upper bounded by $c_{0}2^{n}$ for some $c_{0}>0$ and therefore, choosing powers of two as the weight set is asymptotically optimal. The best known result for such weight sets yields $0.22002\cdot 2^{n}$ and currently, the best lower bound is $\Omega(2^{n}/\sqrt{n})$ [5, 7, 9].

Now, let us consider the following linear equation where the weights are fixed to the ascending powers of two but $x_{i}$ s are not necessarily binary. We denote the powers of two by the vector $w_{b}$ .

w_{b}^{T}x=\sum_{i=1}^{n}2^{i-1}x_{i}=0

(7)

As the weights of $w_{b}$ define a bijection between $n$ -bit binary vectors and integers, $w_{b}^{T}x=0$ does not admit a non-trivial solution for the alphabet $\{-1,0,1\}^{n}$ . This is a necessary and sufficient condition to compute the EQ function given in (6). We extend this property to $m$ many rows to define EQ matrices which give a bijection between $\{0,1\}^{n}$ and $\mathbb{Z}^{m}$ . Thus, an EQ matrix can be used to compute the EQ function in (6) and our goal is to use smaller weight sizes in the matrix.

Definition 1.

A matrix $A\in\mathbb{Z}^{m\times n}$ is an EQ matrix if the homogeneous system $Ax=0$ has no non-trivial solutions in $\{-1,0,1\}^{n}$ .

Let $A\in\mathbb{Z}^{m\times n}$ be an EQ matrix with the weight constraint $W\in\mathbb{Z}$ such that $|a_{ij}|\leq W$ for all $i,j$ and let $R$ denote the rate of the matrix $A$ , which is $n/m$ . It is clear that any full-rank square matrix is an EQ matrix with $R=1$ . Given any $W$ , how large can this $R$ be? For the maximal rate, a necessary condition can be proven by Siegel’s Lemma [22].

Lemma 1 (Siegel’s Lemma (modified)).

Consider any integer matrix $A\in\mathbb{Z}^{m\times n}$ with $m<n$ and $|a_{ij}|\leq W$ for all $i,j$ and some integer $W$ . Then, $Ax=0$ has a non-trivial solution for an integer vector $x\in\mathbb{Z}^{n}$ such that $||x||_{\infty}\leq(\sqrt{n}W)^{\frac{m}{n-m}}$ .

It is shown that if $m=o(n/\log{nW})$ , then any $A\in\mathbb{Z}^{m\times n}$ with weight constraint $W$ admits a non-trivial solution in $\{-1,0,1\}^{n}$ and cannot be an EQ matrix, i.e., $R=O(\log{nW})$ is tight [16]. A similar result can be obtained by the matrix generalizations of Erdős’ Distinct Subset Sum problem [6].

When $m=O(n/\log{nW})$ , the story is different. If $m=O(n/\log{n})$ , there exists a matrix $A\in\{-1,1\}^{m\times n}$ such that every non-trivial solution of $Ax=0$ satisfies $\max_{j}|x_{j}|\geq c_{0}\sqrt{n}^{\frac{m}{n-m}}$ for a positive constant $c_{0}$ . This is given by Beck’s Converse Theorem on Siegel’s Lemma [4].

For an explicit construction, it is possible to achieve the optimal rate $R=O(\log{nW})$ if we allow $W=poly(n)$ . This can be done by the Chinese Remainder Theorem (CRT) and the Prime Number Theorem (PNT) [2, 12, 16]. It is known that CRT can be used to define residue codes [18]. For an integer $x$ and modulo base $p$ , we denote the modulo operation by $[x]_{p}$ , which maps the integer to a value in $\{0,...,p-1\}$ . Suppose $0\leq Z<2^{n}$ for an integer $Z$ . One can encode this integer by the $m$ -tuple $(d_{1},d_{2},\cdots,d_{m})$ where $[Z]_{p_{i}}=d_{i}$ for a prime number $p_{i}$ . Since we can also encode $Z$ by its binary expansion, the CRT gives a bijection between $\mathbb{Z}^{m}$ and $\{0,1\}^{n}$ as long as $p_{1}\cdots p_{m}>2^{n}$ . By taking modulo $p_{i}$ of Equation (7), we can obtain the following matrix, defined as a CRT matrix:

		$\displaystyle\begin{bmatrix}[l][2^{0}]_{3}&[2^{1}]_{3}&[2^{2}]_{3}&[2^{4}]_{3}&[2^{5}]_{3}&[2^{6}]_{3}&[2^{7}]_{3}\\ [2^{0}]_{5}&[2^{1}]_{5}&[2^{2}]_{5}&[2^{4}]_{5}&[2^{5}]_{5}&[2^{6}]_{5}&[2^{7}]_{5}\\ [2^{0}]_{7}&[2^{1}]_{7}&[2^{2}]_{7}&[2^{4}]_{7}&[2^{5}]_{7}&[2^{6}]_{7}&[2^{7}]_{7}\\ [2^{0}]_{11}&[2^{1}]_{11}&[2^{2}]_{11}&[2^{4}]_{11}&[2^{5}]_{11}&[2^{6}]_{11}&[2^{7}]_{11}\end{bmatrix}$		(8)
		$\displaystyle\hphantom{aaaaaaaaaa}=\begin{bmatrix}[r]1&2&1&2&1&2&1&2\\ 1&2&4&3&1&2&4&3\\ 1&2&4&1&2&4&1&2\\ 1&2&4&8&5&10&9&7\end{bmatrix}_{4\times 8}$		(9)

We have $Z<256$ since $n=8$ and $3\cdot 5\cdot 7\cdot 11=1155$ . Therefore, this CRT matrix is an EQ matrix. In general, by the PNT, one needs $O(n/\log{n})$ many rows to ensure that $p_{1}\cdots p_{m}>2^{n}$ . Moreover, $W$ is bounded by the maximum prime size $p_{m}$ , which is $O(n)$ again by the PNT.

However, it is known that constant weight EQ matrices with asymptotically optimal rate exist by Beck’s Converse Theorem on Siegel’s Lemma [4, 16]. In this paper, we give an explicit construction where $W=1$ and asymptotic efficiency in rate is achieved up to vanishing terms. It is in fact an extension of Sylvester-type Hadamard matrices.

\begin{bmatrix}[r]1&1&1&1&1&1&1&0\\ 1&-1&0&1&-1&0&0&1\\ 1&1&1&-1&-1&-1&0&0\\ 1&-1&0&-1&1&0&0&0\end{bmatrix}_{4\times 8}

(10)

One can verify that the matrix in (10) is an EQ matrix and its rate is twice the trivial rate being the same in (8). Therefore, due to the optimality in the weight size, this construction can replace the CRT matrices in the constructions of EQ matrices.

We can also focus on $q$ -ary representation of integers in a similar fashion by extending the definition of EQ matrices to this setting. In our analysis, we always treat $q$ as a constant value.

Definition 2.

A matrix $A\in\mathbb{Z}^{m\times n}$ is an $\textup{EQ}_{q}\textup{ matrix}$ if the homogeneous system $Ax=0$ has no non-trivial solutions in $\{-q+1,\dots,q-1\}^{n}$ .

If $q=2$ , then we drop $q$ from the notation and say that the matrix is an EQ matrix. For the $\text{EQ}_{q}$ matrices, the optimal rate given by Siegel’s Lemma is still $R=O(\log{nW})$ and constant weight constructions exist. We give an extension of our construction to $\text{EQ}_{q}$ matrices where $W=1$ and asymptotic efficiency in rate is achieved up to constant terms.

I-C Maximum Distance Separable Extensions of EQ Matrices

Residue codes are treated as Maximum Distance Separable (MDS) codes because one can extend the CRT matrix by adding more prime numbers to the matrix (without increasing $n$ ) so that the resulting integer code achieves the Singleton bound in Hamming distance [24]. However, we do not say a CRT matrix is an MDS matrix as this refers to another concept.

Definition 3.

An integer matrix $A\in\mathbb{Z}^{m\times n}$ ( $m\leq n$ ) is MDS if and only if no $m\times m$ submatrix $B$ is singular.

Definition 4.

An integer matrix $A\in\mathbb{Z}^{rm\times n}$ is MDS for $q$ -ary bijections with MDS rate $r$ and EQ rate $R=n/m$ if and only if every $m\times n$ submatrix $B$ is an $\textup{EQ}_{q}$ matrix.

Because Definition 4 considers solutions over a restricted alphabet, we denote such matrices as $\textit{RMDS}_{q}$ . Remarkably, as $q\rightarrow\infty$ , both MDS definitions become the same. Similar to the $\text{EQ}_{q}$ definition, we drop the $q$ from the notation when $q=2$ . A CRT matrix is not MDS, however, it can be $\text{RMDS}_{q}$ .

We can demonstrate the difference of both MDS definitions by the following matrix. This matrix is an RMDS matrix with EQ rate $2$ and MDS rate $5/4$ because any $4\times 8$ submatrix is an EQ matrix. This is in fact the same matrix in (8) with an additional row with entries $[2^{i-1}]_{13}$ for $i\in\{1,...,8\}$ .

\displaystyle\begin{bmatrix}[r]1&2&1&2&1&2&1&2\\ 1&2&4&3&1&2&4&3\\ 1&2&4&1&2&4&1&2\\ 1&2&4&8&5&10&9&7\\ 1&2&4&8&3&6&12&11\end{bmatrix}_{5\times 8}

(11)

Here, the determinant of the $5\times 5$ submatrix given by the first five columns is 0. Thus, this matrix is not an MDS matrix.

In this work, we are interested in $\text{RMDS}_{q}$ matrices. We show that for a constant weight $\text{RMDS}_{q}$ matrix, the MDS rate bound $r=O(1)$ is tight. We also provide an existence result for such $\text{RMDS}_{q}$ matrices where the weight size is bounded by $O(r)$ given that the EQ rate is $O(\log{n})$ .

The following is a summary of our contributions in this paper.

•

In Section II, we explicitly give a rate-efficient EQ matrix construction with constant entries $\{-1,0,1\}$ where the optimality is guaranteed up to vanishing terms. This solves an open problem in [16] and [20].
•

In Section III, we prove that the MDS rate $r$ of an $\text{RMDS}_{q}$ matrix with entries from an alphabet $\mathcal{Q}$ with cardinality $k$ should satisfy $r\leq k^{k+1}$ . Therefore, constant weight $\text{RMDS}_{q}$ matrices can at most achieve $r=O(1)$ .
•

In Section III, provide an existence result for $\text{RMDS}_{q}$ matrices given that $W=O(r)$ and the optimal EQ rate $O(\log{n})$ . In contrast, the best known results give $W=O(rn)$ with the optimal EQ rate $O(\log{n})$ .
•

In Section IV, we apply our results to Circuit Complexity Theory to obtain better weight sizes with asymptotically no trade-off in the circuit size for the depth-2 EQ and COMP constructions, as shown in the Table I.

TABLE I: Results for the Depth-2 Circuit Constructions

Function	This Work		Previous Works
Function	Weight Size	Constructive	Weight Size	Constructive
EQ	$O(1)$	Yes	$O(1)$ [16]	No
EQ	$O(1)$	Yes	$O(n)$ [20]	Yes
COMP	$O(n)$	No	$O(n^{2})$ [2]	Yes

II Rate-efficient Constructions with Constant Alphabet Size

The $m\times n$ EQ matrix construction we give here is based on Sylvester’s construction of Hadamard matrices. It is an extension of it to achieve higher rates with the trade-off that the matrix is no longer full-rank.

Theorem 1.

Suppose we are given an EQ matrix $A_{0}\in\{-1,0,1\}^{m_{0}\times n_{0}}$ . At iteration $k$ , we construct the following matrix $A_{k}$ :

A_{k}=\begin{bmatrix}[c]A_{k-1}&A_{k-1}&I_{m_{k-1}}\\ A_{k-1}&-A_{k-1}&0\end{bmatrix}

(12)

$A_{k}$ is an EQ matrix with $m_{k}=2^{k}m_{0}$ , $n_{k}=2^{k}n_{0}(\frac{k}{2}\frac{m_{0}}{n_{0}}+1)$ for any integer $k\geq 0$ .

Proof.

We will apply induction. The case $k=0$ is trivial by assumption. For the system $A_{k}x=z$ , let us partition the vector $x$ and $z$ in the following way:

\begin{bmatrix}[c]A_{k-1}&A_{k-1}&I_{m_{k-1}}\\ A_{k-1}&-A_{k-1}&0\end{bmatrix}\begin{bmatrix}x^{(1)}\\ x^{(2)}\\ x^{(3)}\end{bmatrix}=\begin{bmatrix}z^{(1)}\\ z^{(2)}\end{bmatrix}

(13)

Then, setting $z=0$ , we have $A_{k-1}x^{(1)}=A_{k-1}x^{(2)}$ by the second row block. Hence, the first row block of the construction implies $2A_{k-1}x^{(1)}+x^{(3)}=0$ . Each entry of $x^{(3)}$ is either $0$ or a multiple of $2$ . Since $x_{i}\in\{-1,0,1\}$ , we see that $x^{(3)}=0$ . Then, we obtain $A_{k-1}x^{(1)}=A_{k-1}x^{(2)}=0$ . Applying the induction hypothesis on $A_{k-1}x^{(1)}$ and $A_{k-1}x^{(2)}$ , we see that $A_{k}x=0$ admits a unique trivial solution in $\{-1,0,1\}^{n_{k}}$ .

To see that $m_{k}=2^{k}m_{0}$ is not difficult. For $n_{k}$ , we can apply induction. ∎

To construct the matrix given in (10), we can use $A_{0}=I_{1}=\begin{bmatrix}1\end{bmatrix}$ , which is a trivial rate EQ matrix, and take $k=2$ . By rearrangement of the columns, we also see that there is a $4\times 4$ Sylvester-type Hadamard matrix in this matrix.

For the sake of completeness, we will revisit Siegel’s Lemma to prove the best possible rate one can obtain for an EQ matrix with weights $\{-1,0,1\}$ as similarly done in [16]. We note that the following lemma gives the best possible rate upper bound to the best of our knowledge and our construction asymptotically shows the sharpness of Siegel’s Lemma similar to Beck’s result.

Lemma 2.

For any $\{-1,0,1\}^{m\times n}$ EQ matrix, $R=\frac{n}{m}\leq\frac{1}{2}\log{n}+1$ .

Proof.

By Siegel’s Lemma, we know that for a matrix $A\in\mathbb{Z}^{m\times n}$ , the homogeneous system $Ax=0$ attains a non-trivial solution in $\{-1,0,1\}^{n}$ when

||x||_{\infty}\leq(\sqrt{n}W)^{\frac{m}{n-m}}\leq 2^{1-\epsilon}

(14)

for some $\epsilon>0$ . Then, since $W=1$ , we deduce that $R=\frac{n}{m}\geq\frac{1}{2(1-\epsilon)}\log{n}+1$ . Obviously, an EQ matrix cannot obtain such a rate. Taking $\epsilon\rightarrow 0$ , we conclude the best upper bound is $R\leq\frac{1}{2}\log{n}+1$ . ∎

Using Lemma 2 for our construction with $A_{0}=\begin{bmatrix}1\end{bmatrix}$ , we compute the upper bound on the rate and the real rate

	$\displaystyle R_{upper}$	$\displaystyle=\frac{k+1+\log{(k+2)}}{2},\hphantom{a}R_{constr}=\frac{k}{2}+1$		(15)
	$\displaystyle\frac{R_{upper}}{R_{constr}}$	$\displaystyle=1+\frac{\log{(k+2)}-1}{k+2}\implies\frac{R_{upper}}{R_{constr}}\sim 1$		(16)

which concludes that the construction is optimal in rate up to vanishing terms. By deleting columns, we can generalize the result to any $n\in\mathbb{Z}$ to achieve optimality up to a constant factor of $2$ .

We conjecture that one can extend the columns of any Hadamard matrix to obtain an EQ matrix with entries $\{-1,0,1\}$ achieving rate optimality up to vanishing terms. In this case, the Hadamard conjecture implies a rich set of EQ matrix constructions.

Given $Ax=z\in\mathbb{Z}^{m}$ , we note that there is a linear time decoding algorithm to find $x\in\{0,1\}^{n}$ uniquely, given in the appendix. This construction can also be generalized to $\text{EQ}_{q}$ matrices with the same proof idea. In this case, the rate is optimal up to a factor of $q$ .

Theorem 2.

Suppose we are given an $\text{EQ}_{q}$ matrix $A_{0}\in\{-1,0,1\}^{m_{0}\times n_{0}}$ . At iteration $k$ , we construct the following matrix $A_{k}$ :

\begin{bmatrix}[c]A_{k-1}&A_{k-1}&A_{k-1}&\cdots&A_{k-1}&I_{m_{k-1}}\\ A_{k-1}&-A_{k-1}&0&\cdots&0&0\\ 0&A_{k-1}&-A_{k-1}&\cdots&0&0\\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&\cdots&-A_{k-1}&0\end{bmatrix}

(17)

$A_{k}$ is an $\text{EQ}_{q}$ matrix with $m_{k}=q^{k}m_{0}$ , $n_{k}=q^{k}n_{0}(\frac{k}{q}\frac{m_{0}}{n_{0}}+1)$ for any integer $k\geq 0$ .

III Bounds on the Rate and the Weight Size of $\text{RMDS}_{q}$ Matrices

Similar to a CRT-based RMDS matrix, is there a way to extend the EQ matrix given in Section II without a large trade-off in the weight size? We give an upper bound on the MDS rate $r$ based on the alphabet size of an $\text{RMDS}_{q}$ matrix.

Theorem 3.

An $\text{RMDS}_{q}$ matrix $A\in\mathbb{Z}^{rm\times n}$ with entries from $\mathcal{Q}=\{q_{1},...,q_{k}\}$ satisfies $r\leq k^{k+1}$ given that $n>k$ .

Proof.

The proof is based on a simple counting argument. We rearrange the rows of a matrix in blocks $B_{i}$ for $i\in\{1,\cdots,k^{k+1}\}$ assuming $n>k$ . Here, each block contains rows starting with a prefix of length $k+1$ in the lexicographical order of the indices, i.e.,

A=\begin{bmatrix}B_{1}\\ B_{2}\\ \vdots\\ B_{k^{k+1}}\end{bmatrix}=\begin{bmatrix}q_{1}&q_{1}&\cdots&q_{1}&\cdots\\ q_{1}&q_{1}&\cdots&q_{2}&\cdots\\ \vdots&\vdots&\ddots&\vdots&\ddots\\ q_{k}&q_{k}&\cdots&q_{k}&\cdots\end{bmatrix}

(18)

For instance, the block $B_{1}$ contains the rows starting with the prefix $\begin{bmatrix}q_{1}&q_{1}&\cdots&q_{1}\end{bmatrix}_{1\times k+1}$ . It is easy to see that there is a vector $x\in\{-1,0,1\}^{n}$ that will make at least one of the $B_{i}x=0$ because it is guaranteed that the $(k+1)$ -th column will be equal to the one of the preceding elements by Pigeonhole Principle.

Since the matrix is MDS, any $m$ row selections should be an EQ matrix. Therefore, any $B_{i}$ should not contain more than $m$ rows. Again, by Pigeonhole Principle, $rm\leq k^{k+1}m$ and consequently, $r\leq k^{k+1}$ . ∎

The condition that $n>k$ is trivial in the sense that if the weights are constant, then $k$ is a constant. Therefore, the MDS rate can only be a constant due to Theorem 3.

Corollary 3.1.

An $\text{RMDS}_{q}$ matrix $A\in\mathbb{Z}^{rm\times n}$ weight size $W=O(1)$ can at most achieve the MDS rate $r=O(1)$ .

In Section II, we saw that CRT-based EQ matrix constructions are not optimal in terms of the weight size. For $\text{RMDS}_{q}$ matrices, we now show that the weight size can be reduced by a factor of $n$ . The idea is to use the probabilistic method on the existence of $\text{RMDS}_{q}$ matrices.

Lemma 3.

Suppose that $a\in\{-W,\dots,W\}^{n}$ is uniformly distributed and $x_{i}\in\{-q+1,\dots,q-1\}\setminus 0$ for $i\in\{1,\dots,n\}$ are fixed where $q$ is a constant. Then, for some constant $C$ ,

Pr(a^{T}x=0)\leq\frac{C}{\sqrt{n}W}

(19)

Proof.

We start with a statement of the Berry-Esseen Theorem.

Lemma 4 (Berry-Esseen Theorem).

Let $X_{1},\dots,X_{n}$ be independent centered random variables with finite third moments $\mathbb{E}[|X_{i}^{3}|]=\rho_{i}$ and let $\sigma^{2}=\sum_{i=1}^{n}\mathbb{E}[X_{i}^{2}]$ . Then, for any $t>0$ ,

\Big{|}Pr\Big{(}\sum_{i=1}^{n}X_{i}\leq t\Big{)}-\Phi(t)\Big{|}\leq C\sigma^{-3}\sum_{i=1}^{n}\rho_{i}

(20)

where $C$ is an absolute constant and $\Phi(t)$ is the cumulative distribution function of $\mathcal{N}(0,\sigma^{2})$ .

Since the density of a normal random variable is uniformly bounded by $1/\sqrt{2\pi\sigma^{2}}$ , we obtain the following.

Pr\Big{(}\Big{|}\sum_{i=1}^{n}X_{i}\Big{|}\leq t\Big{)}\leq\frac{2t}{\sqrt{2\pi\sigma^{2}}}+2C\sigma^{-3}\sum_{i=1}^{n}\rho_{i}

(21)

Let $a^{(n-1)}$ denote the $(a_{1},\dots,a_{n-1})$ and similarly, let $x^{(n-1)}$ denote $(x_{1},\dots,x_{n-1})$ . By the Total Probability Theorem, we have

$\displaystyle Pr($	$\displaystyle a^{T}x=0)=Pr\Bigg{(}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{\|x_{n}\|}\in\{-W,\dots,W\}\Bigg{)}$
	$\displaystyle Pr\Bigg{(}a^{T}x=0\Big{\|}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{\|x_{n}\|}\in\{-W,\dots,W\}\Bigg{)}$
	$\displaystyle=Pr\Bigg{(}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{\|x_{n}\|}\in\{-W,\dots,W\}\Bigg{)}\frac{1}{2W+1}$	(22)

where the last line follows from the fact that $Pr(a_{n}=k)$ for some $k\in\{-W,\dots,W\}$ is $\frac{1}{2W+1}$ . The other conditional probability term is $0$ because $Pr(a_{n}=k)$ for any $k\not\in\{-W,\dots,W\}$ is $0$ . We will apply Berry-Esseen Theorem to find an upper bound on the first term.

We note that $\mathbb{E}[a_{i}^{2}x_{i}^{2}]=x_{i}^{2}\frac{(2W+1)^{2}-1}{12}$ and $\mathbb{E}[|a_{i}^{3}x_{i}^{3}|]=|x_{i}|^{3}\frac{2}{2W+1}\Big{(}\frac{W(W+1)}{2}\Big{)}^{2}$ . Then,

	$\displaystyle Pr\Bigg{(}$	$\displaystyle\Bigg{\|}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{\|x_{n}\|}\Bigg{\|}\leq W\Bigg{)}\leq\frac{2W\|x_{n}\|}{\sqrt{2\pi}\sqrt{\frac{(2W+1)^{2}-1}{12}\sum_{i=1}^{n-1}x_{i}^{2}}}$
		$\displaystyle\hphantom{0000000000}+C\frac{\frac{2}{2W+1}\Big{(}\frac{W(W+1)}{2}\Big{)}^{2}\sum_{i=1}^{n-1}\|x_{i}\|^{3}}{\Big{(}\frac{(2W+1)^{2}-1}{12}\Big{)}^{\frac{3}{2}}\Big{(}\sum_{i=1}^{n-1}x_{i}^{2}\Big{)}^{\frac{3}{2}}}$		(23)

One can check that the whole RHS is in the form of $\frac{C^{\prime}}{\sqrt{n-1}}$ for a constant $C^{\prime}$ as we assume that $q$ is a constant. We can bound $|x_{i}|$ s in the numerator by $q-1$ and $|x_{i}|$ s in the denominator by $1$ . The order of $W$ terms are the same and we can use a limit argument to bound them as well. Therefore, for some $C^{\prime\prime}>0$ ,

Pr(a^{T}x=0)\leq\frac{C^{\prime}}{\sqrt{n-1}}\frac{1}{2W+1}\leq\frac{C^{\prime\prime}}{\sqrt{n}W}

(24)

∎

Lemma 3 is related to the Littlewood-Offord problem and anti-concentration inequalities are typically used in this framework [10, 21]. We specifically use Berry-Esseen Theorem. Building on this lemma and the union bound, we have the following theorem.

Theorem 4.

An $\text{RMDS}_{q}$ matrix $M\in\mathbb{Z}^{rm\times n}$ with entries in $\{-W,\dots,W\}$ exists if $m=\Omega(n/\log{n})$ and $W=O(r)$ .

Proof.

Let $A$ be any $m\times n$ submatrix of $M$ . Then, by the union bound (as done in [15]), we have

	$\displaystyle Pr(\text{M is not RMDS}_{q})=P$	$\displaystyle\leq\binom{rm}{m}Pr(\text{A is not EQ}_{q})$
		$\displaystyle\leq r^{m}e^{m}Pr(\text{A is not EQ}_{q})$		(25)

We again use the union bound to sum over all $x\in\{-q+1,\dots,q-1\}^{n}\setminus 0$ the event that $Ax=0$ by counting non-zero entries in $x$ by $k$ . Also, notice the independence of the events that $a_{i}^{T}x=0$ for $i\in\{1,\dots,m\}$ . Let $x^{(k)}$ denote an arbitrary $x\in\{-q+1,\dots,q-1\}^{n}\setminus 0$ with $k$ non-zero entries. Then,

	$\displaystyle Pr(\text{A is not EQ}_{q})$	$\displaystyle\leq\sum_{x\in\{-q+1,\dots,q-1\}^{n}\setminus 0}\prod_{i=1}^{m}Pr(a_{i}^{T}x=0)$		(26)
		$\displaystyle\leq\sum_{k=1}^{n}\binom{n}{k}(2(q-1))^{k}Pr(a^{T}x^{(k)}=0)^{m}$		(27)

We can use Lemma 3 to bound the $Pr(a^{T}x^{(k)}=0)$ term. Therefore,

$\displaystyle P$	$\displaystyle\leq r^{m}e^{m}\sum_{k=1}^{n}\binom{n}{k}(2(q-1))^{k}\Big{(}\frac{C}{\sqrt{k}W}\Big{)}^{m}$	(28)
	$\displaystyle\hphantom{000}=\sum_{k=1}^{T-1}\binom{n}{k}(2(q-1))^{k}\Big{(}\frac{rC_{1}}{\sqrt{k}W}\Big{)}^{m}$
	$\displaystyle\hphantom{000000}+\sum_{k=T}^{n}\binom{n}{k}(2(q-1))^{k}\Big{(}\frac{rC_{1}}{\sqrt{k}W}\Big{)}^{m}$	(29)

for some $C_{1},T\in\mathbb{Z}$ . We bound the first summation by using $\binom{n}{k}(2(q-1))^{k}\leq(n(q-1))^{k}$ and choosing $k=1$ for the probability term. This gives a geometric sum from $k=1$ to $k=T-1$ .

\displaystyle\sum_{k=1}^{T-1}(n(q-1))^{k}\Big{(}\frac{rC_{1}}{W}\Big{)}^{m}=\Big{(}\frac{rC_{1}}{W}\Big{)}^{m}\Big{(}\frac{(n(q-1))^{T}-1}{n(q-1)-1}\Big{)}

(30)

For the second term, we take the highest value term $\frac{rC_{1}}{\sqrt{T}W}$ and $\sum_{k=T}^{n}\binom{n}{k}2(q-1)^{k}\leq\sum_{k=0}^{n}\binom{n}{k}2(q-1)^{k}=(2(q-1)+1)^{n}$ . Hence,

$\displaystyle P$	$\displaystyle\leq\Big{(}\frac{rC_{1}}{W}\Big{)}^{m}C_{2}(n(q-1))^{T}+(2(q-1)+1)^{n}\Big{(}\frac{rC_{1}}{\sqrt{T}W}\Big{)}^{m}$	(31)
	$\displaystyle\leq 2^{C_{3}T\log{n(q-1)}-m\log{W/rC_{1}}}$
	$\displaystyle\hphantom{00000}+2^{n\log{(2(q-1)+1)}-m\log{(\sqrt{T}W/rC_{1})}}$	(32)

We take $T=O(n^{c})$ for some $0<c<1$ . It is easy to see that if $m=\Omega(n/\log{n})$ and $W=O(r)$ , both terms vanish as $n$ goes to the infinity.

∎

When $r=1$ , we prove an existential result on EQ matrices already known in [16]. We remark that the proof technique for Theorem 4 is not powerful enough to obtain non-trivial bounds on $W$ when $m=1$ to attack Erdős’ conjecture on the DSS weight sets.

The CRT gives an explicit way to construct $\text{RMDS}_{q}\in\mathbb{Z}^{rm\times n}$ matrices. We obtain $p_{rm}=O(rm\log{rm})=O(rn)$ given that $r=O(n^{c})$ for some $c>0$ and $m=O(n/\log{n})$ by the PNT. Therefore, we have a factor of $O(n)$ weight size reduction in Theorem 4. However, modular arithmetical properties of the CRT do not reflect to $\text{RMDS}_{q}$ matrices. Therefore, an $\text{RMDS}_{q}$ matrix cannot replace a CRT matrix in general (see Appendix).

IV Applications of the Algebraic Results to Neural Circuits

In this section, we will give EQ and COMP threshold circuit constructions. We note that in our analysis, the size of the bias term is ignored.

One can construct a depth- $2$ exact threshold circuit with small weights to compute the EQ function [16]. For the first layer, we select the weights for each exact threshold gate as the rows of the EQ matrix. Then, we connect the outputs of the first layer to the top gate, which just computes the $m$ -input AND function (i.e. $\mathds{1}\{z_{1}+...+z_{m}=m\}$ for $z_{i}\in\{0,1\}$ ). In Figure 3, we give an example of EQ constructions.

Figure 3: An example of

\text{EQ}(X,Y)

constructions with 8

x_{i}^{\prime}=x_{i}-y_{i}

inputs (or 16 if

x_{i}

s and

y_{i}

s are counted separately) and 5 exact threshold gates (including the top gate). The black(or red) edges correspond to the edges with weight 1 (or -1).

We roughly follow the previous works for the construction of the COMP [2]. First, let us define $F^{(l)}(X,Y)=\sum_{i=l+1}^{n}2^{i-l-1}(x_{i}-y_{i})$ so that $\mathds{1}\{F^{(l)}(X,Y)\geq 0\}$ is a $(n-l)$ -bit COMP function for $l<n$ and $F^{(0)}(X,Y)=F(X,Y)$ . We say that $X\geq Y$ when $F(X,Y)\geq 0$ and vice versa. We also denote by $X^{(l)}$ the most significant $(n-l)$ -tuple of an $n$ -tuple vector $X$ . We have the following observation.

Lemma 5.

Let $F^{(l)}(X,Y)=\sum_{i=l+1}^{n}2^{i-l-1}(x_{i}-y_{i})$ and $F^{(0)}(X,Y)=F(X,Y)$ for $X,Y\in\{0,1\}^{n}$ . Then,

	$\displaystyle F(X,Y)>0$	$\displaystyle\Leftrightarrow\exists l:F^{(l)}(X,Y)=\hphantom{-}1$		(33)
	$\displaystyle F(X,Y)<0$	$\displaystyle\Leftrightarrow\exists l:F^{(l)}(X,Y)=-1$		(34)

Proof.

It is easy to see that if $X-Y=(0,0,\dots,0,1,\times,\dots,\times)$ where the vector has a number of leading 0s and $\times$ s denote any of $\{-1,0,1\}$ , we see that $F(X,Y)>0$ (this is called the domination property). Similarly, for $F(X,Y)<0$ , the vector $X-Y$ should have the form $(0,0,\dots,0,-1,\times,\dots,\times)$ and $F(X,Y)=0$ if and only if $X-Y=(0,\dots,0)$ . The converse holds similarly. ∎

By searching for the $(n-l)$ -tuple vectors $(X-Y)^{(l)}=(0,...,0,-1)$ for all $l\in\{0,...,n-1\}$ , we can compute the COMP. We claim that if we have an $\text{RMDS}_{q}$ matrix $A\in\mathbb{Z}^{rm\times n}$ , we can detect such vectors by solving $A^{(l)}(X-Y)^{(l)}=-a_{n-l}$ where $A^{(l)}$ is a truncated version of $A$ with the first $n-l$ columns and $a_{n-l}$ is the $(n-l)$ -th column. Specifically, we obtain the following:

	$\displaystyle X<Y$	$\displaystyle\Rightarrow\sum_{l=0}^{n-1}\mathds{1}\{A^{(l)}(X-Y)^{(l)}=-a_{n-l}\}\geq rm$		(35)
	$\displaystyle X\geq Y$	$\displaystyle\Rightarrow\sum_{l=0}^{n-1}\mathds{1}\{A^{(l)}(X-Y)^{(l)}=-a_{n-l}\}<n(m-1)$		(36)

Here, the indicator function works row-wise, i.e., we have the output vector $z\in\{0,1\}^{rmn}$ such that $z_{rml+i}=\mathds{1}\{(A^{(l)}(X-Y)^{(l)})_{i}=-(a_{n-l})_{i}\}$ for $i\in\{1,\dots,rm\}$ and $l\in\{0,\dots,n-1\}$ . We use an $\text{RMDS}_{3}$ matrix in the construction to map $\{-1,0,1\}^{n-l}$ vectors bijectively to integer vectors with large Hamming distance. Note that each exact threshold function can be replaced by two linear threshold functions and a summation layer (i.e. $\mathds{1}\{F(X)=0\}=\mathds{1}\{F(X)\geq 0\}+\mathds{1}\{-F(X)\geq 0\}-1)$ which can be absorbed to the top gate (see Figure 4) . This increases the circuit size only by a factor of two.

Figure 4: A construction of an arbitrary exact threshold function (

\mathds{1}\{F(X)=0\}

) using two linear threshold functions (

\mathds{1}\{F(X)\geq 0\}

and

\mathds{1}\{F(X)\leq 0\}

) and a summation node. This summation node can be removed if its output is connected to another gate due to linearity.

If $F(X,Y)<0$ , $z_{rml+i}$ should be 1 for all $i\in\{1,\dots,rm\}$ for some $l$ by Lemma 5. For $F(X,Y)\geq 0$ and any $l$ , the maximum number of 1s that can appear in $z_{rml+i}$ is upper bounded by $m-1$ because A is $\text{RMDS}_{3}$ . Therefore, the maximum number of 1s that can appear in $z$ is upper bounded by $n(m-1)$ . A sketch of the construction is given in Figure 5.

Figure 5: A sketch of the

\text{COMP}(X,Y)

construction where

x_{i}^{\prime}=x_{i}-y_{i}

using linear threshold gates. Each color specifies an

l

value in the construction. If

X<Y

, all the

rm

many gates in at least one of the colors will give all

1

s at the output. Otherwise, all the

rm

many gates in a color will give at most

(m-1)

many

1

s at the output.

In order to make both cases separable, we choose $r=n$ . At the second layer, the top gate is a MAJORITY gate ( $\mathds{1}\{\sum_{i=1}^{rmn}z_{i}<n(m-1)\}=\mathds{1}\{\sum_{i=1}^{rmn}-z_{i}\geq-n(m-1)+1\}$ ). By Theorem 4, there exists an $\text{RMDS}_{3}$ matrix with $m=O(n/\log{n})$ and $W=O(n)$ . Thus, there are $rmn+1=O(n^{3}/\log{n})$ many gates in the circuit, which is the best known result. The same size complexity can be achieved by a CRT-based approach with $W=O(n^{2})$ [2].

V Conclusion

We explicitly constructed a rate-efficient constant weight EQ matrix for a previously existential result. Namely, we proved that the CRT matrix is not optimal in weight size to compute the EQ function and obtained the optimal EQ function constructions. For the COMP function, the weight size complexity is improved by a linear factor, using $\text{RMDS}_{q}$ matrices and their existence. An open problem is whether similar algebraic constructions can be found for general threshold functions so that these ideas can be developed into a weight quantization technique for neural networks.

Acknowledgement

This research was partially supported by The Carver Mead New Adventure Fund.

References

[1] Noga Alon and Văn H Vũ “Anti-Hadamard matrices, coin weighing, threshold gates, and indecomposable hypergraphs” In Journal of Combinatorial Theory, Series A 79.1 Elsevier, 1997, pp. 133–160
[2] Kazuyuki Amano and Akira Maruoka “On the complexity of depth-2 circuits with threshold gates” In International Symposium on Mathematical Foundations of Computer Science, 2005, pp. 107–118 Springer
[3] László Babai, Kristoffer Arnsfelt Hansen, Vladimir V Podolskii and Xiaoming Sun “Weights of exact threshold functions” In International Symposium on Mathematical Foundations of Computer Science, 2010, pp. 66–77 Springer
[4] József Beck “Siegel’s Lemma is sharp” In A journey through discrete mathematics Springer, 2017, pp. 165–206
[5] Tom Bohman “A construction for sets of integers with distinct subset sums” In the electronic journal of combinatorics, 1998, pp. R3–R3
[6] Simone Costa, Stefano Della Fiore and Marco Dalai “Variations on the Erdős Distinct-sums Problem” In arXiv preprint arXiv:2107.07885, 2021
[7] Quentin Dubroff, Jacob Fox and Max Wenqiang Xu “A Note on the Erdős Distinct Subset Sums Problem” In SIAM Journal on Discrete Mathematics 35.1 SIAM, 2021, pp. 322–324
[8] Mikael Goldmann and Marek Karpinski “Simulating threshold circuits by majority circuits” In Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, 1993, pp. 551–560
[9] Richard Guy “Unsolved problems in number theory” Springer Science & Business Media, 2004
[10] Gábor Halász “Estimates for the concentration function of combinatorial number theory and probability” In Periodica Mathematica Hungarica 8.3-4 Springer, 1977, pp. 197–211
[11] Johan Håstad “On the size of weights for threshold gates” In SIAM Journal on Discrete Mathematics 7.3 SIAM, 1994, pp. 484–492
[12] Thomas Hofmeister “A note on the simulation of exponential threshold weights” In International Computing and Combinatorics Conference, 1996, pp. 136–141 Springer
[13] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv and Yoshua Bengio “Binarized neural networks” In Advances in neural information processing systems 29, 2016, pp. 4107–4115
[14] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam and Dmitry Kalenichenko “Quantization and training of neural networks for efficient integer-arithmetic-only inference” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713
[15] Sankeerth Rao Karingula and Shachar Lovett “Singularity of Random Integer Matrices with Large Entries” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021), 2021 Schloss Dagstuhl-Leibniz-Zentrum für Informatik
[16] Kordag Mehmet Kilic and Jehoshua Bruck “Neural Network Computations with DOMINATION Functions” In 2021 IEEE International Symposium on Information Theory (ISIT), 2021, pp. 1029–1034 IEEE
[17] Fengfu Li, Bo Zhang and Bin Liu “Ternary weight networks” In arXiv preprint arXiv:1605.04711, 2016
[18] David Mandelbaum “Error correction in residue arithmetic” In IEEE Transactions on Computers 100.6 IEEE, 1972, pp. 538–545
[19] Saburo Muroga “Threshold logic and its applications”, 1971
[20] Vwani P Roychowdhury, Alon Orlitsky and Kai-Yeung Siu “Lower bounds on threshold and related circuits via communication complexity” In IEEE Transactions on Information Theory 40.2 IEEE, 1994, pp. 467–474
[21] Mark Rudelson and Roman Vershynin “The Littlewood–Offord problem and invertibility of random matrices” In Advances in Mathematics 218.2 Elsevier, 2008, pp. 600–633
[22] Carl L Siegel “Über einige anwendungen diophantischer approximationen” In On Some Applications of Diophantine Approximations Springer, 2014, pp. 81–138
[23] Kai-Yeung Siu and Jehoshua Bruck “On the power of threshold circuits with small weights” In SIAM Journal on Discrete Mathematics 4.3 SIAM, 1991, pp. 423–435
[24] Thian Fatt Tay and Chip-Hong Chang “A non-iterative multiple residue digit error detection and correction algorithm in RRNS” In IEEE transactions on computers 65.2 IEEE, 2015, pp. 396–408
[25] Gal Vardi and Ohad Shamir “Neural Networks with Small Weights and Depth-Separation Barriers” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 19433–19442 URL: https://proceedings.neurips.cc/paper/2020/file/e1fe6165cad3f7f3f57d409f78e4415f-Paper.pdf

The Decoding Algorithm: Suppose that $A_{k}x=z$ is given where $A_{k}$ is an $m\times n$ EQ matrix given in Theorem 1 starting with $A_{0}=\begin{bmatrix}1\end{bmatrix}$ . One can obtain a linear time decoding algorithm in $n$ by exploiting the recursive structure of the construction. Let us partition $x$ and $z$ in the way given in (13). It is clear that $x^{(3)}=[z^{(1)}+z^{(2)}]_{2}$ . Also, after computing $x^{(3)}$ , we find that $A_{k-1}x^{(1)}=(z^{(1)}+z^{(2)}-x^{(3)})/2$ and $A_{k-1}x^{(2)}=(z^{(1)}-z^{(2)}-x^{(3)})/2$ . These operations can be done in $O(m_{k-1})$ time complexity. Let $T(m)$ denote the time to decode $z\in\mathbb{Z}^{m}$ . Then, $T(m)=2T(m/2)+O(m)$ and by the Master Theorem, $T(m)=O(m\log{m})=O(n)$ .

An Example for the Decoding: Consider the following system Ax = b:

\begin{bmatrix}[r]1&1&1&1&1&1&1&0\\ 1&-1&0&1&-1&0&0&1\\ 1&1&1&-1&-1&-1&0&0\\ 1&-1&0&-1&1&0&0&0\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\\ x_{4}\\ x_{5}\\ x_{6}\\ x_{7}\\ x_{8}\end{bmatrix}=\begin{bmatrix}4\\ -2\\ -1\\ 0\end{bmatrix}

Here is the first step of the algorithm:

a=\begin{bmatrix}4\\ -2\end{bmatrix},\hphantom{a}b=\begin{bmatrix}-1\\ 0\end{bmatrix}\Rightarrow\begin{bmatrix}x_{7}\\ x_{8}\end{bmatrix}=[a+b]_{2}=\begin{bmatrix}[r][3]_{2}\\ [-2]_{2}\end{bmatrix}=\begin{bmatrix}1\\ 0\end{bmatrix}

Then, we obtain the two following systems:

	$\displaystyle\begin{bmatrix}[r]1&1&1\\ 1&-1&0\end{bmatrix}\begin{bmatrix}x_{1}\\ x_{2}\\ x_{3}\\ \end{bmatrix}=\frac{a+b-\begin{bmatrix}1\\ 0\end{bmatrix}}{2}=\begin{bmatrix}1\\ -1\end{bmatrix}$
	$\displaystyle\begin{bmatrix}[r]1&1&1\\ 1&-1&0\end{bmatrix}\begin{bmatrix}x_{4}\\ x_{5}\\ x_{6}\\ \end{bmatrix}=\frac{a-b-\begin{bmatrix}1\\ 0\end{bmatrix}}{2}=\begin{bmatrix}2\\ -1\end{bmatrix}$

For the first system, we find that $x_{3}=[1-1]_{2}=0$ . Then,

	$\displaystyle x_{1}=\frac{1-1-0}{2}=0$		(37)
	$\displaystyle x_{2}=\frac{1+1-0}{2}=1$		(38)

For the second system, we similarly find that $x_{6}=[2-1]_{2}=1$ . Then,

	$\displaystyle x_{4}=\frac{2-1-1}{2}=0$		(39)
	$\displaystyle x_{5}=\frac{2+1-1}{2}=1$		(40)

Thus, we have $x=\begin{bmatrix}0&1&0&0&1&1&1&0\end{bmatrix}^{T}$ .

A Modular Arithmetical Property of the CRT Matrix: Consider the following equation with powers of two in 8 variables and the corresponding $4\times 8$ CRT matrix, say $A$ .

w_{b}^{T}x=\sum_{i=1}^{8}2^{i-1}x_{i}=0

(41)

	$\displaystyle\begin{bmatrix}[l][2^{0}]_{3}&[2^{1}]_{3}&[2^{2}]_{3}&[2^{4}]_{3}&[2^{5}]_{3}&[2^{6}]_{3}&[2^{7}]_{3}\\ [2^{0}]_{5}&[2^{1}]_{5}&[2^{2}]_{5}&[2^{4}]_{5}&[2^{5}]_{5}&[2^{6}]_{5}&[2^{7}]_{5}\\ [2^{0}]_{7}&[2^{1}]_{7}&[2^{2}]_{7}&[2^{4}]_{7}&[2^{5}]_{7}&[2^{6}]_{7}&[2^{7}]_{7}\\ [2^{0}]_{11}&[2^{1}]_{11}&[2^{2}]_{11}&[2^{4}]_{11}&[2^{5}]_{11}&[2^{6}]_{11}&[2^{7}]_{11}\end{bmatrix}$		(42)
	$\displaystyle\hphantom{aaaaaaaaaa}=\begin{bmatrix}[r]1&2&1&2&1&2&1&2\\ 1&2&4&3&1&2&4&3\\ 1&2&4&1&2&4&1&2\\ 1&2&4&8&5&10&9&7\end{bmatrix}_{4\times 8}$		(43)

For the prime $p_{i}$ , the elements in the row $i$ are congruent to the elements of $w_{b}$ . The $i^{\text{th}}$ element of the $Ax$ vector should be divisible by $p_{i}$ whenever $w_{b}^{T}x=0$ . For instance, one can pick $x=\begin{bmatrix}2&1&1&3&0&1&-1&0\end{bmatrix}^{T}$ as a solution for $w_{b}^{T}x=0$ and we see that $Ax=\begin{bmatrix}12&15&14&33\end{bmatrix}^{T}=\begin{bmatrix}4\cdot 3&3\cdot 5&2\cdot 7&3\cdot 11\end{bmatrix}^{T}$ . This property is essential in the construction of small weight depth-2 circuits for arbitrary threshold functions while the $\text{RMDS}_{q}$ matrices do not behave in this manner necessarily.

$\displaystyle Pr($	$\displaystyle a^{T}x=0)=Pr\Bigg{(}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{\|x_{n}\|}\in\{-W,\dots,W\}\Bigg{)}$
	$\displaystyle Pr\Bigg{(}a^{T}x=0\Big{\|}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{\|x_{n}\|}\in\{-W,\dots,W\}\Bigg{)}$
	$\displaystyle=Pr\Bigg{(}\frac{{a^{(n-1)}}^{T}x^{(n-1)}}{\|x_{n}\|}\in\{-W,\dots,W\}\Bigg{)}\frac{1}{2W+1}$	(22)

On Algebraic Constructions of Neural Networks with Small Weights

Abstract

I Introduction

I-A Threshold Functions and Neural Networks

I-B Bijective Mappings from Finite Fields to Integers

Definition 1.

Lemma 1 (Siegel’s Lemma (modified)).

Definition 2.

I-C Maximum Distance Separable Extensions of EQ Matrices

Definition 3.

Definition 4.

II Rate-efficient Constructions with Constant Alphabet Size

Theorem 1.

Proof.

Lemma 2.

Proof.

Theorem 2.

III Bounds on the Rate and the Weight Size of RMDSq\text{RMDS}_{q} Matrices

Theorem 3.

Proof.

Corollary 3.1.

Lemma 3.

Proof.

Lemma 4 (Berry-Esseen Theorem).

Theorem 4.

Proof.

IV Applications of the Algebraic Results to Neural Circuits

Lemma 5.

Proof.

V Conclusion

Acknowledgement

References

III Bounds on the Rate and the Weight Size of $\text{RMDS}_{q}$ Matrices