On Algebraic Constructions of Neural Networks with Small Weights
Abstract
Neural gates compute functions based on weighted sums of the input variables. The expressive power of neural gates (number of distinct functions it can compute) depends on the weight sizes and, in general, large weights (exponential in the number of inputs) are required. Studying the trade-offs among the weight sizes, circuit size and depth is a well-studied topic both in circuit complexity theory and the practice of neural computation. We propose a new approach for studying these complexity trade-offs by considering a related algebraic framework. Specifically, given a single linear equation with arbitrary coefficients, we would like to express it using a system of linear equations with smaller (even constant) coefficients. The techniques we developed are based on Siegel’s Lemma for the bounds, anti-concentration inequalities for the existential results and extensions of Sylvester-type Hadamard matrices for the constructions.
We explicitly construct a constant weight, optimal size matrix to compute the EQUALITY function (checking if two integers expressed in binary are equal). Computing EQUALITY with a single linear equation requires exponentially large weights. In addition, we prove the existence of the best-known weight size (linear) matrices to compute the COMPARISON function (comparing between two integers expressed in binary). In the context of the circuit complexity theory, our results improve the upper bounds on the weight sizes for the best-known circuit sizes for EQUALITY and COMPARISON.
I Introduction
An -input Boolean function is a mapping from to . In other words, it is a partitioning of -bit binary vectors into two sets with labels and . In general, we can use systems of linear equations as descriptive models of these two sets of binary vectors. For example, the solution set of the equation is the -bit binary vectors where each and is the number of s in the vectors. We can ask three important questions: How expressive can a single linear equation be? How many equations do we need to describe a Boolean function? Could we simulate a single equation by a system of equations with smaller integer weights?
Let us begin with an example: -input PARITY function where we label binary vectors with odd number of 1s as 1. We can write it in the following form:
(1) |
where is the indicator function with outputs 0 or 1. We express the binary vectors as integers by using binary expansions. Thus, it can be shown that if the weights are exponentially large in , we can express all Boolean functions in this form.
Now, suppose that we are only allowed to use a single equality check in an indicator function. Considering the -input PARITY function, we can simply obtain
(2) |
None of the above equations can be satisfied if is labeled as 0. Conversely, if satisfies one of the above equations, we can label it as . For an arbitrary Boolean function of inputs, if we list every integer associated with vectors labeled as , the number of rows may become exponentially large in . Nevertheless, in this fashion, we can compute this function by the following system of equations using smaller weights.
(3) |
Not only there is a simplification on the number of equations, but also the weights are reduced to smaller sizes. This phenomenon motivates the following question with more emphasis: For which Boolean functions could we obtain such simplifications in the number of equations and weight sizes? For PARITY, this simplification is possible because it is a symmetric Boolean function, i.e., the output depends only on the number of s of the input . We are particularly interested in such simplifications from large weights to small weights for a class of Boolean functions called threshold functions. Note that we use the word “large” to refer to exponentially large quantities in and the word “small” to refer to polynomially large quantities (including ) in .
I-A Threshold Functions and Neural Networks
Threshold functions are commonly treated functions in Boolean analysis and machine learning as they form the basis of neural networks. Threshold functions compute a weighted summation of binary inputs and feed it to a threshold or equality check. If this sum is fed to the former, we call the functions linear threshold functions (see (4)). If it is fed to the latter, we call them exact threshold functions (see (5)). We can write an -input threshold function using the indicator function where s are integer weights and is a bias term.
(4) | ||||
(5) |
A device computing the corresponding threshold function is called a gate of that type. To illustrate the concept, we define COMPARISON (denoted by COMP) function which computes whether an -bit integer is greater than another -bit integer Y. The exact counterpart of it is defined as the EQUALITY (denoted by EQ) function which checks if two -bit integers are equal (see Figure 1). For example, we can write the EQ function in the following form.
(6) |
In general, it is proven that the weights of a threshold function can be represented by -bits and this is tight [1, 3, 11, 19]. However, it is possible to construct “small” weight threshold circuits to compute any threshold function [2, 8, 12, 23]. This transformation from a circuit of depth with exponentially large weights in to another circuit with polynomially large weights in is typically within a constant factor of depth (e.g. or depending on the context) [8, 25]. For instance, such a transformation would simply follow if we can replace any ”large“ weight threshold function with “small” weight depth- circuits so that the new depth becomes .
It is possible to reduce polynomial size weights into constant weights by replicating the gates that is fed to the top gate recursively (see Figure 2). Nevertheless, this would inevitably introduce a polynomial size blow-up in the circuit size. We emphasize that our focus is to achieve this weight size reduction from polynomial weights to constant weights with at most a constant size blow-up in the circuit size.
For neural networks and learning tasks, the weights are typically finite precision real numbers. To make the computations faster and more power efficient, the weights can be quantized to small integers with a loss in the accuracy of the network output [14, 13]. In practice, given that the neural circuit is large in size and deep in depth, this loss in the accuracy might be tolerated. We are interested in the amount of trade-off in the increase of size and depth while using as small weights as possible. More specifically, our goal is to provide insight to the computation of single threshold functions with large weights using threshold circuits with small weights by relating theoretical upper and lower bounds to the best known constructions.
In this manner, for the ternary weight case ( as in [17]), we give an explicit and optimal size circuit construction for the EQ function using depth-2 circuits. This optimality is guaranteed by achieving the theoretical lower bounds asymptotically up to vanishing terms [16, 20]. We also prove an existential result on the COMP constructions to reduce the weight size on the best known constructions. It is not known if constant weight constructions exist without a polynomial blow-up in the circuit size and an increase in the depth for arbitrary threshold functions.
I-B Bijective Mappings from Finite Fields to Integers
It seems that choosing powers of two as the weight set is important. In fact, we can expand any arbitrary weight in binary and use powers of two as a fixed weight set for any threshold function [16]. This weight set is a choice of convenience and a simple solution for the following question: How small could the elements of be if has all distinct subset sums (DSS)? If the weights satisfy this property, called the DSS property, they can define a bijection between and integers. Erdős conjectured that the largest weight is upper bounded by for some and therefore, choosing powers of two as the weight set is asymptotically optimal. The best known result for such weight sets yields and currently, the best lower bound is [5, 7, 9].
Now, let us consider the following linear equation where the weights are fixed to the ascending powers of two but s are not necessarily binary. We denote the powers of two by the vector .
(7) |
As the weights of define a bijection between -bit binary vectors and integers, does not admit a non-trivial solution for the alphabet . This is a necessary and sufficient condition to compute the EQ function given in (6). We extend this property to many rows to define EQ matrices which give a bijection between and . Thus, an EQ matrix can be used to compute the EQ function in (6) and our goal is to use smaller weight sizes in the matrix.
Definition 1.
A matrix is an EQ matrix if the homogeneous system has no non-trivial solutions in .
Let be an EQ matrix with the weight constraint such that for all and let denote the rate of the matrix , which is . It is clear that any full-rank square matrix is an EQ matrix with . Given any , how large can this be? For the maximal rate, a necessary condition can be proven by Siegel’s Lemma [22].
Lemma 1 (Siegel’s Lemma (modified)).
Consider any integer matrix with and for all and some integer . Then, has a non-trivial solution for an integer vector such that .
It is shown that if , then any with weight constraint admits a non-trivial solution in and cannot be an EQ matrix, i.e., is tight [16]. A similar result can be obtained by the matrix generalizations of Erdős’ Distinct Subset Sum problem [6].
When , the story is different. If , there exists a matrix such that every non-trivial solution of satisfies for a positive constant . This is given by Beck’s Converse Theorem on Siegel’s Lemma [4].
For an explicit construction, it is possible to achieve the optimal rate if we allow . This can be done by the Chinese Remainder Theorem (CRT) and the Prime Number Theorem (PNT) [2, 12, 16]. It is known that CRT can be used to define residue codes [18]. For an integer and modulo base , we denote the modulo operation by , which maps the integer to a value in . Suppose for an integer . One can encode this integer by the -tuple where for a prime number . Since we can also encode by its binary expansion, the CRT gives a bijection between and as long as . By taking modulo of Equation (7), we can obtain the following matrix, defined as a CRT matrix:
(8) | ||||
(9) |
We have since and . Therefore, this CRT matrix is an EQ matrix. In general, by the PNT, one needs many rows to ensure that . Moreover, is bounded by the maximum prime size , which is again by the PNT.
However, it is known that constant weight EQ matrices with asymptotically optimal rate exist by Beck’s Converse Theorem on Siegel’s Lemma [4, 16]. In this paper, we give an explicit construction where and asymptotic efficiency in rate is achieved up to vanishing terms. It is in fact an extension of Sylvester-type Hadamard matrices.
(10) |
One can verify that the matrix in (10) is an EQ matrix and its rate is twice the trivial rate being the same in (8). Therefore, due to the optimality in the weight size, this construction can replace the CRT matrices in the constructions of EQ matrices.
We can also focus on -ary representation of integers in a similar fashion by extending the definition of EQ matrices to this setting. In our analysis, we always treat as a constant value.
Definition 2.
A matrix is an if the homogeneous system has no non-trivial solutions in .
If , then we drop from the notation and say that the matrix is an EQ matrix. For the matrices, the optimal rate given by Siegel’s Lemma is still and constant weight constructions exist. We give an extension of our construction to matrices where and asymptotic efficiency in rate is achieved up to constant terms.
I-C Maximum Distance Separable Extensions of EQ Matrices
Residue codes are treated as Maximum Distance Separable (MDS) codes because one can extend the CRT matrix by adding more prime numbers to the matrix (without increasing ) so that the resulting integer code achieves the Singleton bound in Hamming distance [24]. However, we do not say a CRT matrix is an MDS matrix as this refers to another concept.
Definition 3.
An integer matrix () is MDS if and only if no submatrix is singular.
Definition 4.
An integer matrix is MDS for -ary bijections with MDS rate and EQ rate if and only if every submatrix is an matrix.
Because Definition 4 considers solutions over a restricted alphabet, we denote such matrices as . Remarkably, as , both MDS definitions become the same. Similar to the definition, we drop the from the notation when . A CRT matrix is not MDS, however, it can be .
We can demonstrate the difference of both MDS definitions by the following matrix. This matrix is an RMDS matrix with EQ rate and MDS rate because any submatrix is an EQ matrix. This is in fact the same matrix in (8) with an additional row with entries for .
(11) |
Here, the determinant of the submatrix given by the first five columns is 0. Thus, this matrix is not an MDS matrix.
In this work, we are interested in matrices. We show that for a constant weight matrix, the MDS rate bound is tight. We also provide an existence result for such matrices where the weight size is bounded by given that the EQ rate is .
The following is a summary of our contributions in this paper.
- •
-
•
In Section III, we prove that the MDS rate of an matrix with entries from an alphabet with cardinality should satisfy . Therefore, constant weight matrices can at most achieve .
-
•
In Section III, provide an existence result for matrices given that and the optimal EQ rate . In contrast, the best known results give with the optimal EQ rate .
- •
II Rate-efficient Constructions with Constant Alphabet Size
The EQ matrix construction we give here is based on Sylvester’s construction of Hadamard matrices. It is an extension of it to achieve higher rates with the trade-off that the matrix is no longer full-rank.
Theorem 1.
Suppose we are given an EQ matrix . At iteration , we construct the following matrix :
(12) |
is an EQ matrix with , for any integer .
Proof.
We will apply induction. The case is trivial by assumption. For the system , let us partition the vector and in the following way:
(13) |
Then, setting , we have by the second row block. Hence, the first row block of the construction implies . Each entry of is either or a multiple of . Since , we see that . Then, we obtain . Applying the induction hypothesis on and , we see that admits a unique trivial solution in .
To see that is not difficult. For , we can apply induction. ∎
To construct the matrix given in (10), we can use , which is a trivial rate EQ matrix, and take . By rearrangement of the columns, we also see that there is a Sylvester-type Hadamard matrix in this matrix.
For the sake of completeness, we will revisit Siegel’s Lemma to prove the best possible rate one can obtain for an EQ matrix with weights as similarly done in [16]. We note that the following lemma gives the best possible rate upper bound to the best of our knowledge and our construction asymptotically shows the sharpness of Siegel’s Lemma similar to Beck’s result.
Lemma 2.
For any EQ matrix, .
Proof.
By Siegel’s Lemma, we know that for a matrix , the homogeneous system attains a non-trivial solution in when
(14) |
for some . Then, since , we deduce that . Obviously, an EQ matrix cannot obtain such a rate. Taking , we conclude the best upper bound is . ∎
Using Lemma 2 for our construction with , we compute the upper bound on the rate and the real rate
(15) | ||||
(16) |
which concludes that the construction is optimal in rate up to vanishing terms. By deleting columns, we can generalize the result to any to achieve optimality up to a constant factor of .
We conjecture that one can extend the columns of any Hadamard matrix to obtain an EQ matrix with entries achieving rate optimality up to vanishing terms. In this case, the Hadamard conjecture implies a rich set of EQ matrix constructions.
Given , we note that there is a linear time decoding algorithm to find uniquely, given in the appendix. This construction can also be generalized to matrices with the same proof idea. In this case, the rate is optimal up to a factor of .
Theorem 2.
Suppose we are given an matrix . At iteration , we construct the following matrix :
(17) |
is an matrix with , for any integer .
III Bounds on the Rate and the Weight Size of Matrices
Similar to a CRT-based RMDS matrix, is there a way to extend the EQ matrix given in Section II without a large trade-off in the weight size? We give an upper bound on the MDS rate based on the alphabet size of an matrix.
Theorem 3.
An matrix with entries from satisfies given that .
Proof.
The proof is based on a simple counting argument. We rearrange the rows of a matrix in blocks for assuming . Here, each block contains rows starting with a prefix of length in the lexicographical order of the indices, i.e.,
(18) |
For instance, the block contains the rows starting with the prefix . It is easy to see that there is a vector that will make at least one of the because it is guaranteed that the -th column will be equal to the one of the preceding elements by Pigeonhole Principle.
Since the matrix is MDS, any row selections should be an EQ matrix. Therefore, any should not contain more than rows. Again, by Pigeonhole Principle, and consequently, . ∎
The condition that is trivial in the sense that if the weights are constant, then is a constant. Therefore, the MDS rate can only be a constant due to Theorem 3.
Corollary 3.1.
An matrix weight size can at most achieve the MDS rate .
In Section II, we saw that CRT-based EQ matrix constructions are not optimal in terms of the weight size. For matrices, we now show that the weight size can be reduced by a factor of . The idea is to use the probabilistic method on the existence of matrices.
Lemma 3.
Suppose that is uniformly distributed and for are fixed where is a constant. Then, for some constant ,
(19) |
Proof.
We start with a statement of the Berry-Esseen Theorem.
Lemma 4 (Berry-Esseen Theorem).
Let be independent centered random variables with finite third moments and let . Then, for any ,
(20) |
where is an absolute constant and is the cumulative distribution function of .
Since the density of a normal random variable is uniformly bounded by , we obtain the following.
(21) |
Let denote the and similarly, let denote . By the Total Probability Theorem, we have
(22) |
where the last line follows from the fact that for some is . The other conditional probability term is because for any is . We will apply Berry-Esseen Theorem to find an upper bound on the first term.
We note that and . Then,
(23) |
One can check that the whole RHS is in the form of for a constant as we assume that is a constant. We can bound s in the numerator by and s in the denominator by . The order of terms are the same and we can use a limit argument to bound them as well. Therefore, for some ,
(24) |
∎
Lemma 3 is related to the Littlewood-Offord problem and anti-concentration inequalities are typically used in this framework [10, 21]. We specifically use Berry-Esseen Theorem. Building on this lemma and the union bound, we have the following theorem.
Theorem 4.
An matrix with entries in exists if and .
Proof.
Let be any submatrix of . Then, by the union bound (as done in [15]), we have
(25) |
We again use the union bound to sum over all the event that by counting non-zero entries in by . Also, notice the independence of the events that for . Let denote an arbitrary with non-zero entries. Then,
(26) | ||||
(27) |
We can use Lemma 3 to bound the term. Therefore,
(28) | ||||
(29) |
for some . We bound the first summation by using and choosing for the probability term. This gives a geometric sum from to .
(30) |
For the second term, we take the highest value term and . Hence,
(31) | ||||
(32) |
We take for some . It is easy to see that if and , both terms vanish as goes to the infinity.
∎
When , we prove an existential result on EQ matrices already known in [16]. We remark that the proof technique for Theorem 4 is not powerful enough to obtain non-trivial bounds on when to attack Erdős’ conjecture on the DSS weight sets.
The CRT gives an explicit way to construct matrices. We obtain given that for some and by the PNT. Therefore, we have a factor of weight size reduction in Theorem 4. However, modular arithmetical properties of the CRT do not reflect to matrices. Therefore, an matrix cannot replace a CRT matrix in general (see Appendix).
IV Applications of the Algebraic Results to Neural Circuits
In this section, we will give EQ and COMP threshold circuit constructions. We note that in our analysis, the size of the bias term is ignored.
One can construct a depth- exact threshold circuit with small weights to compute the EQ function [16]. For the first layer, we select the weights for each exact threshold gate as the rows of the EQ matrix. Then, we connect the outputs of the first layer to the top gate, which just computes the -input AND function (i.e. for ). In Figure 3, we give an example of EQ constructions.
We roughly follow the previous works for the construction of the COMP [2]. First, let us define so that is a -bit COMP function for and . We say that when and vice versa. We also denote by the most significant -tuple of an -tuple vector . We have the following observation.
Lemma 5.
Let and for . Then,
(33) | ||||
(34) |
Proof.
It is easy to see that if where the vector has a number of leading 0s and s denote any of , we see that (this is called the domination property). Similarly, for , the vector should have the form and if and only if . The converse holds similarly. ∎
By searching for the -tuple vectors for all , we can compute the COMP. We claim that if we have an matrix , we can detect such vectors by solving where is a truncated version of with the first columns and is the -th column. Specifically, we obtain the following:
(35) | ||||
(36) |
Here, the indicator function works row-wise, i.e., we have the output vector such that for and . We use an matrix in the construction to map vectors bijectively to integer vectors with large Hamming distance. Note that each exact threshold function can be replaced by two linear threshold functions and a summation layer (i.e. which can be absorbed to the top gate (see Figure 4) . This increases the circuit size only by a factor of two.
If , should be 1 for all for some by Lemma 5. For and any , the maximum number of 1s that can appear in is upper bounded by because A is . Therefore, the maximum number of 1s that can appear in is upper bounded by . A sketch of the construction is given in Figure 5.
In order to make both cases separable, we choose . At the second layer, the top gate is a MAJORITY gate (). By Theorem 4, there exists an matrix with and . Thus, there are many gates in the circuit, which is the best known result. The same size complexity can be achieved by a CRT-based approach with [2].
V Conclusion
We explicitly constructed a rate-efficient constant weight EQ matrix for a previously existential result. Namely, we proved that the CRT matrix is not optimal in weight size to compute the EQ function and obtained the optimal EQ function constructions. For the COMP function, the weight size complexity is improved by a linear factor, using matrices and their existence. An open problem is whether similar algebraic constructions can be found for general threshold functions so that these ideas can be developed into a weight quantization technique for neural networks.
Acknowledgement
This research was partially supported by The Carver Mead New Adventure Fund.
References
- [1] Noga Alon and Văn H Vũ “Anti-Hadamard matrices, coin weighing, threshold gates, and indecomposable hypergraphs” In Journal of Combinatorial Theory, Series A 79.1 Elsevier, 1997, pp. 133–160
- [2] Kazuyuki Amano and Akira Maruoka “On the complexity of depth-2 circuits with threshold gates” In International Symposium on Mathematical Foundations of Computer Science, 2005, pp. 107–118 Springer
- [3] László Babai, Kristoffer Arnsfelt Hansen, Vladimir V Podolskii and Xiaoming Sun “Weights of exact threshold functions” In International Symposium on Mathematical Foundations of Computer Science, 2010, pp. 66–77 Springer
- [4] József Beck “Siegel’s Lemma is sharp” In A journey through discrete mathematics Springer, 2017, pp. 165–206
- [5] Tom Bohman “A construction for sets of integers with distinct subset sums” In the electronic journal of combinatorics, 1998, pp. R3–R3
- [6] Simone Costa, Stefano Della Fiore and Marco Dalai “Variations on the Erdős Distinct-sums Problem” In arXiv preprint arXiv:2107.07885, 2021
- [7] Quentin Dubroff, Jacob Fox and Max Wenqiang Xu “A Note on the Erdős Distinct Subset Sums Problem” In SIAM Journal on Discrete Mathematics 35.1 SIAM, 2021, pp. 322–324
- [8] Mikael Goldmann and Marek Karpinski “Simulating threshold circuits by majority circuits” In Proceedings of the twenty-fifth annual ACM symposium on Theory of computing, 1993, pp. 551–560
- [9] Richard Guy “Unsolved problems in number theory” Springer Science & Business Media, 2004
- [10] Gábor Halász “Estimates for the concentration function of combinatorial number theory and probability” In Periodica Mathematica Hungarica 8.3-4 Springer, 1977, pp. 197–211
- [11] Johan Håstad “On the size of weights for threshold gates” In SIAM Journal on Discrete Mathematics 7.3 SIAM, 1994, pp. 484–492
- [12] Thomas Hofmeister “A note on the simulation of exponential threshold weights” In International Computing and Combinatorics Conference, 1996, pp. 136–141 Springer
- [13] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv and Yoshua Bengio “Binarized neural networks” In Advances in neural information processing systems 29, 2016, pp. 4107–4115
- [14] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam and Dmitry Kalenichenko “Quantization and training of neural networks for efficient integer-arithmetic-only inference” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713
- [15] Sankeerth Rao Karingula and Shachar Lovett “Singularity of Random Integer Matrices with Large Entries” In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2021), 2021 Schloss Dagstuhl-Leibniz-Zentrum für Informatik
- [16] Kordag Mehmet Kilic and Jehoshua Bruck “Neural Network Computations with DOMINATION Functions” In 2021 IEEE International Symposium on Information Theory (ISIT), 2021, pp. 1029–1034 IEEE
- [17] Fengfu Li, Bo Zhang and Bin Liu “Ternary weight networks” In arXiv preprint arXiv:1605.04711, 2016
- [18] David Mandelbaum “Error correction in residue arithmetic” In IEEE Transactions on Computers 100.6 IEEE, 1972, pp. 538–545
- [19] Saburo Muroga “Threshold logic and its applications”, 1971
- [20] Vwani P Roychowdhury, Alon Orlitsky and Kai-Yeung Siu “Lower bounds on threshold and related circuits via communication complexity” In IEEE Transactions on Information Theory 40.2 IEEE, 1994, pp. 467–474
- [21] Mark Rudelson and Roman Vershynin “The Littlewood–Offord problem and invertibility of random matrices” In Advances in Mathematics 218.2 Elsevier, 2008, pp. 600–633
- [22] Carl L Siegel “Über einige anwendungen diophantischer approximationen” In On Some Applications of Diophantine Approximations Springer, 2014, pp. 81–138
- [23] Kai-Yeung Siu and Jehoshua Bruck “On the power of threshold circuits with small weights” In SIAM Journal on Discrete Mathematics 4.3 SIAM, 1991, pp. 423–435
- [24] Thian Fatt Tay and Chip-Hong Chang “A non-iterative multiple residue digit error detection and correction algorithm in RRNS” In IEEE transactions on computers 65.2 IEEE, 2015, pp. 396–408
- [25] Gal Vardi and Ohad Shamir “Neural Networks with Small Weights and Depth-Separation Barriers” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 19433–19442 URL: https://proceedings.neurips.cc/paper/2020/file/e1fe6165cad3f7f3f57d409f78e4415f-Paper.pdf
The Decoding Algorithm: Suppose that is given where is an EQ matrix given in Theorem 1 starting with . One can obtain a linear time decoding algorithm in by exploiting the recursive structure of the construction. Let us partition and in the way given in (13). It is clear that . Also, after computing , we find that and . These operations can be done in time complexity. Let denote the time to decode . Then, and by the Master Theorem, .
An Example for the Decoding: Consider the following system Ax = b:
Here is the first step of the algorithm:
Then, we obtain the two following systems:
For the first system, we find that . Then,
(37) | |||
(38) |
For the second system, we similarly find that . Then,
(39) | |||
(40) |
Thus, we have .
A Modular Arithmetical Property of the CRT Matrix: Consider the following equation with powers of two in 8 variables and the corresponding CRT matrix, say .
(41) |
(42) | |||
(43) |
For the prime , the elements in the row are congruent to the elements of . The element of the vector should be divisible by whenever . For instance, one can pick as a solution for and we see that . This property is essential in the construction of small weight depth-2 circuits for arbitrary threshold functions while the matrices do not behave in this manner necessarily.