This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Graduate School of Science and Engineering, Yamagata University, Jonan 4-3-16, Yonezawa-shi Yamagata, 992-8510 [email protected] School of Science and Engineering, Yamagata University, Jonan 4-3-16, Yonezawa-shi Yamagata, 992-8510 Japan. \CopyrightKei Uchizawa and Haruki Abe \ccsdesc[100]Theory of computation Models of computation \EventEditors \EventNoEds1 \EventLongTitle \EventShortTitle \EventAcronym \EventYear2021 \EventDate \EventLocation \EventLogo \SeriesVolume \ArticleNo1

Exponential Lower Bounds for Threshold Circuits of Sub-Linear Depth and Energy

Kei Uchizawa    Haruki Abe
Abstract

In this paper, we investigate computational power of threshold circuits and other theoretical models of neural networks in terms of the following four complexity measures: size (the number of gates), depth, weight and energy. Here the energy complexity of a circuit measures sparsity of their computation, and is defined as the maximum number of gates outputting non-zero values taken over all the input assignments.

As our main result, we prove that any threshold circuit CC of size ss, depth dd, energy ee and weight ww satisfies log(rk(MC))ed(logs+logw+logn)\log(rk(M_{C}))\leq ed(\log s+\log w+\log n), where rk(MC)rk(M_{C}) is the rank of the communication matrix MCM_{C} of a 2n2n-variable Boolean function that CC computes. Thus, such a threshold circuit CC is able to compute only a Boolean function of which communication matrix has rank bounded by a product of logarithmic factors of s,ws,w and linear factors of d,ed,e. This implies an exponential lower bound on the size of even sublinear-depth threshold circuit if energy and weight are sufficiently small. For example, we can obtain an exponential lower bound s=2Ω(n1/3)s=2^{\Omega(n^{1/3})} even for threshold circuits of depth n1/3n^{1/3}, energy n1/3n^{1/3} and weight 2o(n1/3)2^{o(n^{1/3})}. We also show that the inequality is tight up to a constant factor when the depth dd and energy ee satisfies ed=o(n/logn)ed=o(n/\log n).

For other models of neural networks such as a discretized ReLE circuits and decretized sigmoid circuits, we prove that a similar inequality also holds for a discretized circuit CC: rk(MC)=O(ed(logs+logw+logn)3)rk(M_{C})=O(ed(\log s+\log w+\log n)^{3}). We obtain this inequality by showing that any discretized circuit can be simulated by a threshold circuit with moderate increase of size, depth, energy and weight. Thus, if we consider the number of non-zero output values as a measure for sparse activity of a neural network, our results suggest that larger depth linearly helps neural networks to acquire sparse activity.

keywords:
Circuit complexity, disjointness function, eqaulity function, neural networks, threshold circuits, ReLU cicuits, sigmoid circuits, sprase activity

1 Introduction

Background. DiCarlo and Cox argued that constructing good internal representations is crucial to perform visual information processing, such as object recognition, for neural networks in the brain [6]. Here, an internal representation is described by a vector in a very high dimensional space, where each axis is one neuron’s activity and the dimensionality equals to the number (e.g., \sim1 million) of neurons in a feedforward neural network. They call representations good if a given pair of two images that are hard to distinguish at the input space, but the resulting representations for them are easy to separate by simple classifiers such as a linear classifier. While such internal representations are likely to play fundamental role in information processing in the brain, it is also known that a neuron needs relatively high energy to be active [19, 30], and hence neural networks are forced to acquire representations supported by only a small number of active neurons [8]. These observations pose a question: for what information processing can neural networks construct good internal representations?

In the paper [44], Uchizawa et al. address the question from the viewpoint of circuit complexity. More formally, they employed threshold circuits as a model of neural networks [26, 27, 31, 34, 36, 37, 38], and introduced a complexity measure, called energy complexity, for sparsity of their internal representations. A threshold circuit is a feedforward logic circuit whose basic computational element computes a linear threshold function, and energy of a circuit is defined as the maximum number of internal gates outputting ones over all the input assignments. (See also  [7, 17, 35, 39, 47] for studies on energy complexity of other types of logic circuits). Uchizawa et al. then show that the energy complexity is closely related to the rank of linear decision trees. In particular, they prove that any linear decision tree of ll leaves can be simulated by a threshold circuit of size O(l)O(l) and energy O(logl)O(\log l). Thus, even logarithmic-energy threshold circuits have certain computational power: any linear decision tree of polynomial number of leaves can be simulated by a polynomial-size and logarithmic-energy threshold circuit.

Following the paper [44], a sequence of papers show relations among other major complexity measures such as size (the number of gates), depth, weight and fan-in [24, 40, 41, 45, 43, 42, 46]. In particular, Uchizawa and Takimoto [45] showed that any threshold circuit CC of depth dd and energy ee requires size s=2Ω(n/ed)s=2^{\Omega(n/e^{d})} if CC computes a high bounded-error communication complexity function such as Inner-Product function. Even for low communication complexity functions, an exponential lower bound on the size is known for constant-depth threshold circuits: any threshold circuit CC of depth dd and energy ee requires size s=2Ω(n/e2e+dlogen)s=2^{\Omega(n/e2^{e+d}\log^{e}n)} if CC computes the parity function [43]. These results provide exponential lower bounds if the depth is constant and energy is sub-linear [45] or sub-logarithmic [43], while both Inner-Product function and Parity function are computable by linear-size, constant-depth, and linear energy threshold circuits. Thus these results imply that the energy complexity strongly related to representational power of threshold circuits. However these lower bounds break down when we consider threshold circuits of larger depth and energy, say, non-constant depth and sub-linear energy.

Our Results for Threshold Circuits. In this paper, we prove that simple Boolean functions are hard even for sub-linear depth and sub-linear energy threshold circuits. Let CC be a threshold circuit with Boolean input variables 𝐱=(x1,x2,,xn){\mathbf{x}}=(x_{1},x_{2},\dots,x_{n}) and 𝐲=(y1,y2,,yn){\mathbf{y}}=(y_{1},y_{2},\dots,y_{n}). A communication matrix MCM_{C} of CC is a 2n×2n2^{n}\times 2^{n} matrix where each row (resp., each column) is indexed by an assignment 𝐚{0,1}n{\mathbf{a}}\in\{0,1\}^{n} to 𝐱{\mathbf{x}} (resp., 𝐛{0,1}n{\mathbf{b}}\in\{0,1\}^{n} to 𝐲{\mathbf{y}}), and the value MC[𝐚,𝐛]M_{C}[{\mathbf{a}},{\mathbf{b}}] is defined to be the output of CC given 𝐚{\mathbf{a}} and 𝐛{\mathbf{b}}. We denote by rk(MC)rk(M_{C}) the rank of MCM_{C} over 𝔽2\mathbb{F}_{2}. Our main result is the following relation among size, depth energy and weight.

Theorem 1.1.

Let s,d,es,d,e and ww be integers satisfying 2s,d2\leq s,d, 10e10\leq e, 1w1\leq w. If a threshold circuit CC computes a Boolean function of 2n2n variables, and has size ss, depth dd, energy ee and weight ww, then it holds that

log(rk(MC))ed(logs+logw+logn).\log(rk(M_{C}))\leq ed(\log s+\log w+\log n). (1)

The theorem implies exponential lower bounds for sub-linear depth and sub-linear energy threshold circuits. As an example, let us consider a Boolean function defined as follows: For a 2n2n input variables x1,,xnx_{1},\dots,x_{n} and y1,,yny_{1},\dots,y_{n},

We note that is a biologically motivated Boolean function: Maass [22] defined to model coincidence detection or a pattern matching, and Lynch and Musco [20] introduced a related problem, called Filter problem, for studying theoretical aspect of spiking neural networks. Since is the complement of the well-studied Boolean function, the disjointness function, the rank of is 2n2^{n} [15]. Thus, the theorem implies that

ned(logs+logw+logn)n\leq ed(\log s+\log w+\log n) (2)

holds if a threshold circuit CC computes . Arranging Eq. (2), we can obtain a lower bound 2n/(ed)/(wn)s2^{n/(ed)}/(wn)\leq s which is exponential in nn if both dd and ee are sub-linear and ww is sub-exponential. For example, we can obtain an exponential lower bound s=2Ω(n1/3)s=2^{\Omega(n^{1/3})} even for threshold circuits of depth n1/3n^{1/3}, energy n1/3n^{1/3} and weight 2o(n1/3)2^{o(n^{1/3})}. We can obtain similar lower bounds for the Inner-Product function and the equality function, since they have linear rank.

Comparing the lower bound s=2Ω(n/ed)s=2^{\Omega(n/e^{d})} given in [45] to ours, our lower bound is meaningful only for sub-exponential weight, but improves on it in two-fold: the lower bound is exponential even if dd is sub-linear, and provide a nontrivial lower bound for Boolean functions with much weaker condition: Threshold circuits need exponential size even for Boolean functions of the standard rank Ω(n)\Omega(n).

Threshold circuits have received considerable attention in circuit complexity, and a number of lower bound arguments have developed for threshold circuits under some restrictions on computational resources including size, depth, energy and weight [1, 2, 4, 10, 11, 14, 16, 24, 29, 33, 43, 45, 46]. However, the arguments for lower bounds are designated for constant-depth threshold circuits, and hence cannot provide meaningful ones when the depth is not constant. In particular, is computable by a depth-2 and linear-size threshold circuit. Thus, directly applying known techniques are unlikely to yield an exponential lower bound for .

To complement Theorem 1.1, we also show that the lower bound is tight up to a constant factor if the product of ee and dd are small:

Theorem 1.2.

For any integers ee and dd such that 2e2\leq e and 2d2\leq d, DISJn\mathrm{DISJ}_{n} is computable by a threshold circuit of size

s(e1)(d1)2n(e1)(d1).s\leq(e-1)(d-1)\cdot 2^{\frac{n}{(e-1)(d-1)}}.

depth dd, energy ee and weight

w(n(e1)(d1))2.w\leq\left(\frac{n}{(e-1)(d-1)}\right)^{2}.

Substituting s,d,es,d,e and ww of a threshold circuit given in Theorem 1.2 to the right hand side of Eq. (2), we have

ed(logs+logw+logn)\displaystyle ed(\log s+\log w+\log n)
\displaystyle\leq ed(n(e1)(d1)+log(e1)(d1)+log(n(e1)(d1))2+logn)\displaystyle ed\left(\frac{n}{(e-1)(d-1)}\!+\!\log(e-1)(d-1)\!+\!\log\left(\frac{n}{(e-1)(d-1)}\right)^{2}\!+\!\log n\right)
\displaystyle\leq 4n+O(edlogn),\displaystyle 4n+O(ed\log n),

which almost matches the left hand side of Eq. (2) if ed=o(n/logn)ed=o(n/\log n). Thus, Theorem 1.1 neatly captures the computational aspect of threshold circuits computing . Recall that any linear decision tree of polynomial number of leaves can be simulated by a polynomial-size and logarithmic-energy threshold circuit [44]. Also, it is known that any Boolean function is computable by a threshold circuit of energy one if exponential size is allowed [24]. Thus, we believe that the situation ed=o(n/logn)ed=o(n/\log n) is not too restrictive. we We also show that the lower bound is also tight for the equality function.

Our Result for Discretized Circuits. Besides threshold circuits, we consider other other well-studied model of neural network, where an activation function and weights of an computational element are discretized (such as, discretized sigmoid or ReLU circuits). The size, depth, energy and weight are important parameters also for artificial neural networks. The size and depth are major topics on success of deep learning. The energy is related to important techniques for deep learning method such as regularization, sparse coding, or sparse autoencoder [12, 18, 28]. The weight resolution is closely related to chip resources in neuromorphic hardware systems [32], and quantization schemes received attention [5, 13].

We define similar notions for the energy and weight of a discretized circuit, and show that any discretized circuit can be simulated by a threshold circuit with a moderate increase in size, depth, energy, and weight. Consequently, combining with Theorem 1.1, we can show that the rank is bounded by a product of the polylogarithmic factors of s,ws,w and linear factors of d,ed,e for discretized circuits. For example, we can obtain the following proposition for sigmoid circuits:

Theorem 1.3.

If a sigmoid circuit CC of size ss, depth dd, energy ee, and weight ww computes a Boolean function ff, then it holds that

log(rk(MC))=O(ed(logs+logw+logn)3).\log(rk(M_{C}))=O(ed(\log s+\log w+\log n)^{3}).

Maass, Schnitger and Sontag [21] showed that a sigmoid circuit could be simulated by a threshold circuit, but their simulation was optimized to be depth-efficient and did not consider energy. Thus, their result does not fit into our purpose.

Theorems 1.1 and 1.3 imply that a threshold circuit or discretized circuit are able to compute a Boolean function of bounded rank. Thus, we can consider these theorems as bounds on corresponding concept classes. According to the bound, cc times larger depth is comparable to 2c2^{c} times larger size. Thus, large depth could enormously help neural networks to increase its expressive power. Also, the bound suggests that increasing depth could also help a neural network to acquire sparse activity when we have hardware constraints on both the number of neurons and the weight resolution. These observations may shed some light on the reason for the success of deep learning.

Organization. The rest of the paper is organized as follows. In Section 2, we define terms needed for analysis. In Section 3, we present our main lower bound result. In Section 4, we show the tightness of the bound by constructing a threshold circuit computing . In Section 5, we show that a discretized circuit can be simulated by a threshold circuit. In Section 6, we conclude with some remarks.

2 Preliminaries

For an integer nn, we denote by [n][n] a set {1,2,n}\{1,2,\dots n\}. The base of the logarithm is two unless stated otherwise. In Section 2.1, we define terms on threshold circuits and discretized circuits. In Section 2.2, we define communication matrix, and present some known facts.

2.1 Circuit Model

In Sections 2.1.1 and 2.1.2, we give definitions of threshold and discritized circuits, respectively.

2.1.1 Threshold Circuits

Let kk be a positive integer. A threshold gate gg with kk input variables ξ1,ξ2,,ξk\xi_{1},\xi_{2},\dots,\xi_{k} has weights w1,w2,,wkw_{1},w_{2},\dots,w_{k}, and a threshold tt. We define the output g(ξ1,ξ2,,ξk)g(\xi_{1},\xi_{2},\dots,\xi_{k}) of gg as

g(ξ1,ξ2,,ξk)=sign(i=1kwiξit)={1 if ti=1kwiξi;0 otherwise\displaystyle g(\xi_{1},\xi_{2},\dots,\xi_{k})=\mathrm{sign}\left(\sum_{i=1}^{k}w_{i}\xi_{i}-t\right)=\left\{\begin{array}[]{ll}1&\mbox{ if }t\leq\sum_{i=1}^{k}w_{i}\xi_{i};\\ 0&\mbox{ otherwise}\end{array}\right.

To evaluate the weight resolution, we assume single synaptic weight to be discrete, and that w1,w2,,wnw_{1},w_{2},\dots,w_{n} are integers. The weight wgw_{g} of gg is defined as the maximum of the absolute values of w1,w2,,wkw_{1},w_{2},\dots,w_{k}. In other words, we assume that w1,w2,,wkw_{1},w_{2},\dots,w_{k} are O(logwg)O(\log w_{g})-bit coded discrete values. Throughout the paper, we allow a gate to have both positive and negative weights, although biological neurons are either excitatory (all the weights are positive) or inhibitory (all the weights are negative). As mentioned in [22], this relaxation has basically no impact on circuit complexity investigations, unless one cares about constant blowup in computational resources. This is because a single gate with positive and negative weights can be simulated by a pair of excitatory and inhibitory gates.

A threshold circuit CC is a combinatorial circuit consisting of threshold gates, and is expressed by a directed acyclic graph. The nodes of in-degree 0 correspond to input variables, and the other nodes correspond to gates. Let GG be a set of the gates in CC. For each gate gGg\in G, the level of gg, denoted by lev(g)\mathrm{lev}(g), is defined as the length of a longest path from an input variable to gg on the underlying graph of CC. For each l[d]l\in[d], we define GlG_{l} as a set of gates in the llth level:

Gl={gGlev(g)=l}.G_{l}=\{g\in G\mid\mathrm{lev}(g)=l\}.

Given an input assignment (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n} to (𝐱,𝐲)({\mathbf{x}},{\mathbf{y}}), the outputs of the gates in CC are inductively determined from the bottom level.

In this paper, we consider a threshold circuit CC for a Boolean function f:{0,1}2n{0,1}f:\{0,1\}^{2n}\to\{0,1\}. Thus, CC has 2n2n Boolean input variables 𝐱=(x1,x2,,xn){\mathbf{x}}=(x_{1},x_{2},\dots,x_{n}) and 𝐲=(y1,y2,,yn){\mathbf{y}}=(y_{1},y_{2},\dots,y_{n}), and a unique output gate, denoted by gclfg^{\mathrm{clf}}, which is a linear classifier separating internal representations given by the gates in the lower levels (possibly together with input variables). Consider a gate gg in CC. Let w1x,w2x,,wnxw^{x}_{1},w^{x}_{2},\dots,w^{x}_{n} (resp., w1y,w2y,,wnyw^{y}_{1},w^{y}_{2},\dots,w^{y}_{n}) be the weights for x1,x2,,xnx_{1},x_{2},\dots,x_{n} (resp., y1,y2,,yny_{1},y_{2},\dots,y_{n}), and tgt_{g} be threshold of gg. For each gate hh directed to gg, let wh,gw_{h,g} be a weight of gg for the output of hh. Then the output g(𝐱,𝐲)g({\mathbf{x}},{\mathbf{y}}) of gg is defined as

g(𝐱,𝐲)=sign(pg(𝐱,𝐲)tg)g({\mathbf{x}},{\mathbf{y}})=\mathrm{sign}\left(p_{g}({\mathbf{x}},{\mathbf{y}})-t_{g}\right)

where pg(𝐱,𝐲)p_{g}({\mathbf{x}},{\mathbf{y}}) denotes a potentials of gg invoked by the input variables and gates:

p(𝐱,𝐲)=i=1nwixxi+i=1nwiyyi+l=1lev(g)1hGlwh,gh(𝐱,𝐲).p({\mathbf{x}},{\mathbf{y}})=\sum_{i=1}^{n}w^{x}_{i}x_{i}+\sum_{i=1}^{n}w^{y}_{i}y_{i}+\sum_{l=1}^{\mbox{{\scriptsize lev}}(g)-1}\sum_{h\in G_{l}}w_{h,g}h({\mathbf{x}},{\mathbf{y}}).

We sometimes write pgx(𝐱)p^{x}_{g}({\mathbf{x}}) (resp., pgy(𝐲)p^{y}_{g}({\mathbf{y}})) for the potential invoked by 𝐱{\mathbf{x}} (resp., 𝐲{\mathbf{y}}):

pgx(𝐱)=i=1nwixxi and pgy(𝐲)=i=1nwiyyi.p^{x}_{g}({\mathbf{x}})=\sum_{i=1}^{n}w^{x}_{i}x_{i}\quad\mbox{ and }\quad p^{y}_{g}({\mathbf{y}})=\sum_{i=1}^{n}w^{y}_{i}y_{i}.

Although the inputs to gg are not only 𝐱{\mathbf{x}} and 𝐲{\mathbf{y}} but the outputs of gates in the lower levels, we write g(𝐱,𝐲)g({\mathbf{x}},{\mathbf{y}}) for the output of gg, because 𝐱{\mathbf{x}} and 𝐲{\mathbf{y}} inductively decide the output of gg. We say that CC computes a Boolean function f:{0,1}2n{0,1}f:\{0,1\}^{2n}\to\{0,1\} if gclf(𝐚,𝐛)=f(𝐚,𝐛)g^{\mathrm{clf}}({\mathbf{a}},{\mathbf{b}})=f({\mathbf{a}},{\mathbf{b}}) for every (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n}.

Let CC be a threshold circuit. We define size ss of CC as the number of the gates in CC, and depth dd of CC as the level of gclfg^{\mathrm{clf}}. We define the energy ee of CC as

e=max(𝐚,𝐛){0,1}2ngGg(𝐚,𝐛).e=\max_{({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n}}\sum_{g\in G}g({\mathbf{a}},{\mathbf{b}}).

We define weight ww of CC as the maximum of the weights of the gates in CC: w=maxgGwgw=\max_{g\in G}w_{g}.

2.1.2 Discretized Circuits

Let φ\varphi be an activation function. Let δ\delta be a discretizer that maps a real number to a number representable by a bitwidth bb. We define a discretized activation function δφ\delta\circ\varphi as a composition of φ\varphi and δ\delta, that is, δφ(x)=δ(φ(x))\delta\circ\varphi(x)=\delta(\varphi(x)) for any number xx. We say that δφ\delta\circ\varphi has silent range for an interval II if δφ(x)=0\delta\circ\varphi(x)=0 if xIx\in I, and δφ(x)0\delta\circ\varphi(x)\neq 0, otherwise. For example, if we use the ReLU function as the activation function φ\varphi, then δφ\delta\circ\varphi has silent range for I=(,0]I=(-\infty,0] for any discretizer δ\delta. If we use the sigmoid function as the activation function φ\varphi and linear partition as discretizer δ\delta, then δφ\delta\circ\varphi has silent range for I=(,tmax]I=(-\infty,t_{\max}] where tmax=ln(1/(2b1))t_{\max}=\ln(1/(2^{b}-1)) where ln\ln is the natural logarithm.

Let δφ\delta\circ\varphi be a discretized activation function with silent range. A (δφ)(\delta\circ\varphi)-gate gg with kk input variables ξ1,ξ2,,ξk\xi_{1},\xi_{2},\dots,\xi_{k} has weights w1,w2,,wkw_{1},w_{2},\dots,w_{k} and a threshold tt, where each of the weights and threshold are discretized by δ\delta. The output g(ξ1,ξ2,,ξk)g(\xi_{1},\xi_{2},\dots,\xi_{k}) of gg is then defined as

g(ξ1,ξ2,,ξk)=δφ(i=1kwiξit).\displaystyle g(\xi_{1},\xi_{2},\dots,\xi_{k})=\delta\circ\varphi\left(\sum_{i=1}^{k}w_{i}\xi_{i}-t\right).

A (δφ)(\delta\circ\varphi)-circuit is a combinatorial circuit consisting of (δφ)(\delta\circ\varphi)-gates except that the top gate gclfg^{\mathrm{clf}} is a threshold gate, that is, a linear classifier. We define size and depth of a (δφ)(\delta\circ\varphi)-circuit same as the ones for a threshold circuit. We define energy ee of a (δφ)(\delta\circ\varphi)-circuit as the maximum number of gates outputting non-zero values in the circuit:

e=max(𝐚,𝐛){0,1}2ngGg(𝐚,𝐛)0e=\max_{({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n}}\sum_{g\in G}\llbracket g({\mathbf{a}},{\mathbf{b}})\neq 0\rrbracket

where P\llbracket\mathrm{P}\rrbracket for a statement PP denote a notation of the function which outputs one if P\mathrm{P} is true, and zero otherwise. We define weight ww of CC as w=22bw=2^{2b}, where 2b2b is the bitwidth possibly needed to represent a potential value invoked by a single input of a gate in CC.

2.2 Communication Matrix and its Rank

Let Z{0,1}nZ\subseteq\{0,1\}^{n}. For a Boolean function f:Z×Z{0,1}f:Z\times Z\to\{0,1\}, we define a communication matrix MfM_{f} over ZZ as a 2|Z|×2|Z|2^{|Z|}\times 2^{|Z|} matrix where each row and column are indexed by 𝐚Z{\mathbf{a}}\in Z and 𝐛Z{\mathbf{b}}\in Z, respectively, and each entry is defined as Mf(𝐚,𝐛)=f(𝐚,𝐛)M_{f}({\mathbf{a}},{\mathbf{b}})=f({\mathbf{a}},{\mathbf{b}}). We denote by rk(Mf)rk(M_{f}) the rank of MfM_{f} over 𝔽2\mathbb{F}_{2}. If a circuit CC computes ff, we may write MCM_{C} instead of MfM_{f}. If a Boolean function ff does not have an obvious separation of the input variables to 𝐱{\mathbf{x}} and 𝐲{\mathbf{y}}, we may assume a separation so that rk(Mf)rk(M_{f}) is maximized.

Let kk and nn be natural numbers such that knk\leq n. Let

Zk={𝐚{0,1}nThe number of ones in 𝐚 is at most k}.Z_{k}=\{{\mathbf{a}}\in\{0,1\}^{n}\mid\mbox{The number of ones in ${\mathbf{a}}$ is at most $k$}\}.

A kk-disjointness function DISJn,k\mathrm{DISJ}_{n,k} over ZkZ_{k} is defines as follows:

DISJn,k(𝐱,𝐲)=i=1nxi¯yi¯\mathrm{DISJ}_{n,k}({\mathbf{x}},{\mathbf{y}})=\bigwedge_{i=1}^{n}\overline{x_{i}}\vee\overline{y_{i}}

where the input assignments are chosen from ZkZ_{k}. The book [15] contains a simple proof showing DISJn,k\mathrm{DISJ}_{n,k} has full rank.

Theorem 2.1 (Theorem 13.10 [15]).

rk(MDISJn,k)=i=0k(ni)rk(M_{\mathrm{DISJ}_{n,k}})=\sum_{i=0}^{k}{n\choose i}. In particular, rk(MDISJn,n)=2nrk(M_{\mathrm{DISJ}_{n,n}})=2^{n}.

is the complement of DISJn,n\mathrm{DISJ}_{n,n}. We can obtain the same bound for , as follows:

Corollary 2.2.

rk(M)=2nrk(M)=2^{n}.

We also use well-known facts on the rank. Let AA and BB be two matrices of same dimensions. We denote by A+BA+B the summation of AA and BB, and by ABA\circ B the Hadamard product of AA and BB.

Fact 15.

For two matrices AA and BB of same dimensions, we have

(i)

rk(A+B)rk(A)+rk(B)rk(A+B)\leq rk(A)+rk(B);

(ii)

rk(AB)rk(A)rk(B)rk(A\circ B)\leq rk(A)\cdot rk(B).

3 Lower Bound for Threshold Circuits

In this section, we give the inequality relating the rank of the communication matrix to the size, depth, energy and weight.

Theorem 3.1 (Theorem 1 restated).

Let s,d,es,d,e and ww be integers satisfying 2s,d2\leq s,d, 11e11\leq e, 1w1\leq w. Suppose a threshold circuit CC computes a Boolean function of 2n2n variables, and has size ss, depth dd, energy ee, and weight ww. Then it holds that

log(rk(MC))ed(logs+logw+logn).\log(rk(M_{C}))\leq ed(\log s+\log w+\log n).

We prove the the theorem by showing that MCM_{C} is a sum of matrices each of which corresponds to an internal representation that arises in CC. Since CC has bounded energy, the number of internal representations is also bounded. We then show by the inclusion-exclusion principle that each matrix corresponding an internal representation has bounded rank. Thus, Fact 1 implies the theorem.

Proof 3.2.

Let CC be a threshold circuit that computes a Boolean function of 2n2n variables, and has size ss, depth dd, energy ee and weight ww. Let GG be a set of the gates in CC. For l[d]l\in[d], let GlG_{l} be a set of the gates in ll-th level of CC. Without loss of generality, we assume that Gd={gclf}G_{d}=\{g^{\mathrm{clf}}\}. We evaluate the rank of MCM_{C}, and prove that

rk(MC)(cse1)e1((cse1)e1(2nw+1)e1)d1(2nw+1)\displaystyle rk(M_{C})\leq\left(\frac{c\cdot s}{e-1}\right)^{e-1}\!\cdot\left(\left(\frac{c\cdot s}{e-1}\right)^{e-1}\!\cdot(2nw+1)^{e-1}\right)^{d-1}\!\cdot(2nw+1) (4)

where c<3c<3. Equation (4) implies that

rk(MC)\displaystyle rk(M_{C}) \displaystyle\leq (cse1(2nw+1))(e1)d\displaystyle\left(\frac{c\cdot s}{e-1}\cdot(2nw+1)\right)^{(e-1)d}
\displaystyle\leq (snw)ed,\displaystyle\left(snw\right)^{ed},

where the last inequality holds if e11e\geq 11. Taking the logarithm of the inequality, we obtain the theorem.

Below we verify that Eq. (4) holds. Let 𝐏=(P1,P2,,Pd){\mathbf{P}}=(P_{1},P_{2},\dots,P_{d}), where PlP_{l} is defined as a subset of GlG_{l} for each l[d]l\in[d]. Given an input (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n}, we say that an internal representation 𝐏{\mathbf{P}} arises for (𝐚,𝐛)({\mathbf{a}},{\mathbf{b}}) if, for every l[d]l\in[d], g(𝐚,𝐛)=1g({\mathbf{a}},{\mathbf{b}})=1 for every gPlg\in P_{l}, and g(𝐚,𝐛)=0g({\mathbf{a}},{\mathbf{b}})=0 for every gPlg\not\in P_{l}. We denote by 𝐏(𝐚,𝐛){\mathbf{P}}^{*}({\mathbf{a}},{\mathbf{b}}) the internal representation that arises for (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n}. We then define 𝒫1\mathcal{P}_{1} as a set of internal representations that arise for (𝐚,𝐛)({\mathbf{a}},{\mathbf{b}}) such that gclf(𝐚,𝐛)=1g^{\mathrm{clf}}({\mathbf{a}},{\mathbf{b}})=1:

𝒫1={𝐏(𝐚,𝐛)gclf(𝐚,𝐛)=1}.\mathcal{P}_{1}=\{{\mathbf{P}}^{*}({\mathbf{a}},{\mathbf{b}})\mid g^{\mathrm{clf}}({\mathbf{a}},{\mathbf{b}})=1\}.

Note that, for any 𝐏=(P1,P2,,Pd)𝒫1(C){\mathbf{P}}=(P_{1},P_{2},\dots,P_{d})\in\mathcal{P}_{1}(C),

|P1|+|P2|++|Pd1|e1|P_{1}|+|P_{2}|+\cdots+|P_{d-1}|\leq e-1

and |Pd|=1|P_{d}|=1. Thus we have

|𝒫1|k=0e1(sk)(cse1)e1.\displaystyle|\mathcal{P}_{1}|\leq\sum_{k=0}^{e-1}{s\choose k}\leq\left(\frac{c\cdot s}{e-1}\right)^{e-1}. (5)

For each 𝐏𝒫1{\mathbf{P}}\in\mathcal{P}_{1}, let M𝐏M_{{\mathbf{P}}} be a 2n×2n2^{n}\times 2^{n} matrix such that, for every (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n},

M𝐏(𝐚,𝐛)={1 if 𝐏=𝐏(𝐚,𝐛);0 if 𝐏𝐏(𝐚,𝐛).M_{{\mathbf{P}}}({\mathbf{a}},{\mathbf{b}})=\left\{\begin{array}[]{ll}1&\mbox{ if }{\mathbf{P}}={\mathbf{P}}^{*}({\mathbf{a}},{\mathbf{b}});\\ 0&\mbox{ if }{\mathbf{P}}\neq{\mathbf{P}}^{*}({\mathbf{a}},{\mathbf{b}}).\end{array}\right.

By the definitions of 𝒫1\mathcal{P}_{1} and M𝐏M_{{\mathbf{P}}}, we have

MC=𝐏𝒫1M𝐏,M_{C}=\sum_{{\mathbf{P}}\in\mathcal{P}_{1}}M_{{\mathbf{P}}},

and hence Fact 1(i) implies that

rk(MC)𝐏𝒫1rk(M𝐏).rk(M_{C})\leq\sum_{{\mathbf{P}}\in\mathcal{P}_{1}}rk(M_{{\mathbf{P}}}).

Thus Eq. (5) implies that

rk(M𝐏)(cse1)e1max𝐏𝒫1rk(M𝐏).rk(M_{{\mathbf{P}}})\leq\left(\frac{c\cdot s}{e-1}\right)^{e-1}\cdot\max_{{\mathbf{P}}\in\mathcal{P}_{1}}rk(M_{{\mathbf{P}}}).

We complete the proof by showing that, for any 𝐏𝒫1(C){\mathbf{P}}\in\mathcal{P}_{1}(C), it holds that

rk(M𝐏)((2cse1)e1(2nw+1)e1)d1(2nw+1).rk(M_{{\mathbf{P}}})\leq\left(\left(\frac{2c\cdot s}{e-1}\right)^{e-1}\cdot(2nw+1)^{e-1}\right)^{d-1}\cdot(2nw+1).

In the following argument, we consider an arbitrary fixed internal representation 𝐏=(P1,P2,,Pd){\mathbf{P}}=(P_{1},P_{2},\dots,P_{d}) in 𝒫1\mathcal{P}_{1}. We call a gate a threshold function if the inputs of the gate consists of only 𝐱{\mathbf{x}} and 𝐲{\mathbf{y}}. For each gGg\in G, we denote by τ[g,𝐏]\tau[g,{\mathbf{P}}] a threshold function defined as

τ[g,𝐏](𝐱,𝐲)=sign(pgx(𝐱)+pgy(𝐲)+tg[𝐏]).\tau[g,{\mathbf{P}}]({\mathbf{x}},{\mathbf{y}})=\mathrm{sign}\left(p^{x}_{g}({\mathbf{x}})+p^{y}_{g}({\mathbf{y}})+t_{g}[{\mathbf{P}}]\right).

where tg[𝐏]t_{g}[{\mathbf{P}}] is a threshold of gg, being assumed that the internal representation 𝐏{\mathbf{P}} arises:

tg[𝐏]=l=1lev(g)1hPlwh,gtg.t_{g}[{\mathbf{P}}]=\sum_{l=1}^{{\mathrm{lev}(g)}-1}\sum_{h\in P_{l}}w_{h,g}-t_{g}.

For each l[d]l\in[d], we define a set TlT_{l} of threshold functions as

Tl={τ[g,𝐏]gGl}.T_{l}=\{\tau[g,{\mathbf{P}}]\mid g\in G_{l}\}.

Since every gate in G1G_{1} is a threshold function, T1T_{1} is identical to G1G_{1}.

For any set TT of threshold functions, we denote by M[T]M[T] a 2n×2n2^{n}\times 2^{n} matrix such that, for every (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n},

M[T](𝐚,𝐛)={1 if τT,τ(𝐚,𝐛)=1;0 if τT,τ(𝐚,𝐛)=0.M[T]({\mathbf{a}},{\mathbf{b}})=\left\{\begin{array}[]{ll}1&\mbox{ if }\forall\tau\in T,\tau({\mathbf{a}},{\mathbf{b}})=1;\\ 0&\mbox{ if }\exists\tau\in T,\tau({\mathbf{a}},{\mathbf{b}})=0.\end{array}\right.

It is well-known that the rank of M[T]M[T] is bounded [9, 10], as follows.

Claim 16.

rk(M[T])(2nw+1)|T|rk(M[T])\leq(2nw+1)^{|T|}.

We give a proof of the claim in Appendix for completeness.

For each l[d]l\in[d], based on PlP_{l} in 𝐏{\mathbf{P}}, we define a set QlQ_{l} of threshold functions as

Ql={τ[g,𝐏]gPl}TlQ_{l}=\{\tau[g,{\mathbf{P}}]\mid g\in P_{l}\}\subseteq T_{l}

and a family 𝒫(Ql)\mathcal{P}(Q_{l}) of sets TT of threshold functions as

𝒯(Ql)={TTlQlT and |T|e1}.\mathcal{T}(Q_{l})=\{T\subseteq T_{l}\mid Q_{l}\subseteq T\mbox{ and }|T|\leq e-1\}.

Following the inclusion-exclusion principle, we define a 2n×2n2^{n}\times 2^{n} matrix

H[Ql]=T𝒯(Ql)(1)|T||Ql|M[T].H[Q_{l}]=\sum_{T\in\mathcal{T}(Q_{l})}(-1)^{|T|-|Q_{l}|}M[T].

We can show that M𝐏M_{{\mathbf{P}}} is expressed as the Hadamard product of H[Q1],H[Q2],,H[Qd]H[Q_{1}],H[Q_{2}],\dots,H[Q_{d}]:

Claim 17.
M𝐏=H[Q1]H[Q2]H[Qd].M_{{\mathbf{P}}}=H[Q_{1}]\circ H[Q_{2}]\circ\dots\circ H[Q_{d}].

The proof of the claim is given in Appendix.

We finally evaluate rk(M𝐏)rk(M_{{\mathbf{P}}}). Claim 17 and Fact 1(ii) imply that

rk(M𝐏)=rk(H[Q1]H[Q2]H[Qd])l=1drk(H[Ql]).rk(M_{\mathbf{P}})=rk\left(H[Q_{1}]\circ H[Q_{2}]\circ\dots\circ H[Q_{d}]\right)\leq\prod_{l=1}^{d}rk(H[Q_{l}]). (6)

Since

|𝒯(Ql)|(cse1)e1|\mathcal{T}(Q_{l})|\leq\left(\frac{c\cdot s}{e-1}\right)^{e-1}

Fact 1(i) and Claim 16 imply that

rk(H[Ql])\displaystyle rk(H[Q_{l}]) \displaystyle\leq T𝒫(Ql)rk(M[T])\displaystyle\sum_{T\in\mathcal{P}(Q_{l})}rk(M[T]) (7)
\displaystyle\leq (cse1)e1(2nw+1)e1\displaystyle\left(\frac{c\cdot s}{e-1}\right)^{e-1}\cdot(2nw+1)^{e-1}

for every l[d1]l\in[d-1], and

rk(H[Qd])2nw+1.\displaystyle rk(H[Q_{d}])\leq 2nw+1. (8)

Equations (6)-(8) imply that

rk(M𝐏)((cse1)e1(2nw+1)e1)d1(2nw+1)rk(M_{\mathbf{P}})\leq\left(\left(\frac{c\cdot s}{e-1}\right)^{e-1}\cdot(2nw+1)^{e-1}\right)^{d-1}\cdot(2nw+1)

as desired. We thus have verified Eq. (4).

Combining Theorem 2.2 and Corollary 3.1, we obtain the following corollary:

Corollary 3.3.

Let s,d,es,d,e and ww be integers satisfying 2s,d2\leq s,d, 10e10\leq e, 1w1\leq w. Suppose a threshold circuit CC of size ss, depth dd, energy ee, and weight ww computes . Then it holds that

ned(logs+logw+logn).n\leq ed(\log s+\log w+\log n).

Equivalently, we have 2n/(ed)/(nw)s2^{n/(ed)}/(nw)\leq s.

Theorem 3.1 implies lower bounds for other Boolean functions with linear rank. For example, consider another Boolean function EQn\mathrm{EQ}_{n} asking if 𝐱=𝐲{\mathbf{x}}={\mathbf{y}}:

EQn(𝐱,𝐲)=i=1nxiyi¯\mathrm{EQ}_{n}({\mathbf{x}},{\mathbf{y}})=\bigwedge_{i=1}^{n}\overline{x_{i}\oplus y_{i}}

Since MEQnM_{\mathrm{EQ}_{n}} is the identity matrix with full rank, we have the same lower bound.

Corollary 3.4.

Let s,d,es,d,e and ww be integers satisfying 2s,d2\leq s,d, 10e10\leq e, 1w1\leq w. Suppose a threshold circuit CC of size ss, depth dd, energy ee, and weight ww computes EQn\mathrm{EQ}_{n}. Then it holds that

ned(logs+logw+logn).n\leq ed(\log s+\log w+\log n).

Equivalently, we have 2n/(ed)/(nw)s2^{n/(ed)}/(nw)\leq s.

4 Tightness of the Lower Bound

In this section, we show that the lower bound given in Theorem 3.1 is almost tight if the depth and energy are small.

4.1 Definitions

Let zz be a positive integer, and ff be a Boolean function of 2n2n variables. We say that ff is zz-piecewise with f1,f2,,fzf_{1},f_{2},\dots,f_{z} if the following conditions are satisfied: Let

Bj={i[n]xi or yi are fed into fj},B_{j}=\{i\in[n]\mid\mbox{$x_{i}$ or $y_{i}$ are fed into $f_{j}$}\},

then

(i)

B1,B2,,BzB_{1},B_{2},\dots,B_{z} compose a partition of [n][n];

(ii)

|Bj|n/z|B_{j}|\leq\lceil n/z\rceil for every j[z]j\in[z];

(iii)
f(𝐱,𝐲)=j=1zfj(𝐱,𝐲)orf(𝐱,𝐲)=j=1zfj(𝐱,𝐲)¯.f({\mathbf{x}},{\mathbf{y}})=\bigvee_{j=1}^{z}f_{j}({\mathbf{x}},{\mathbf{y}})\quad\mbox{or}\quad f({\mathbf{x}},{\mathbf{y}})=\overline{\bigvee_{j=1}^{z}f_{j}({\mathbf{x}},{\mathbf{y}})}.

We say that a set of threshold gates sharing input variables is a neural set, and a neural set is selective if at most one of the gates in the set outputs one for any input assignment. A selective neural set SS computes a Boolean function ff if for every assignment in f1(0)f^{-1}(0), no gates in SS outputs one, while for every assignment in f1(1)f^{-1}(1), exactly one gate in SS outputs one. We define the size and weight of SS as |S||S| and maxgSwg\max_{g\in S}w_{g}, respectively.

By a DNF-like construction, we can obtain a selective neural set of exponential size that computes ff for any Boolean function ff.

Theorem 4.1.

For any Boolean function ff of nn variables, there exists a selective neural set of size 2n2^{n} and weight one that computes ff.

4.2 Upper Bounds

The following proposition shows that we can a construct threshold circuits of small energy for piecewise functions.

Lemma 4.2.

Let ee and dd be integers satisfying 2e2\leq e and 2d2\leq d, and z=(e1)(d1)z=(e-1)(d-1). Suppose f:{0,1}2n{0,1}f:\{0,1\}^{2n}\to\{0,1\} is a zz-piecewise function with f1,f2,,fzf_{1},f_{2},\dots,f_{z}. If fjf_{j} is computable by a selective neural set of size at most ss^{\prime} and weight ww^{\prime} for every j[z]j\in[z], ff is computable by a threshold circuit of size

szs+1,s\leq z\cdot s^{\prime}+1,

depth dd, energy ee and weight

w2nzw.w\leq\frac{2n}{z}\cdot w^{\prime}.

Clearly, is zz-piecewise, and so the lemma gives our upper bound for .

Theorem 4.3 (Theorem 1.2 restated).

For any integers ee and dd such that 2e2\leq e and 2d2\leq d, is computable by a threshold circuit of size

s(e1)(d1)2n(e1)(d1).s\leq(e-1)(d-1)\cdot 2^{\frac{n}{(e-1)(d-1)}}.

depth dd, energy ee and weight

w(n(e1)(d1))2.w\leq\left(\frac{n}{(e-1)(d-1)}\right)^{2}.

We can also obtain a similar proposition for EQn\mathrm{EQ}_{n}.

Theorem 4.4.

For any integers ee and dd such that 2e2\leq e and 2d2\leq d, EQn\mathrm{EQ}_{n} is computable by a threshold circuit of size

s(e1)(d1)22n(e1)(d1).s\leq(e-1)(d-1)\cdot 2^{\frac{2n}{(e-1)(d-1)}}.

depth dd, energy ee and weight

wn(e1)(d1).w\leq\frac{n}{(e-1)(d-1)}.

5 Simulating Discretized Circuits

In this section, we show that any discretized circuit can be simulated using a threshold circuit with a moderate increase in size, depth, energy, and weight. Thus, a similar inequality holds for discretized circuits. Recall that we define energy of discretized circuits differently from those of threshold circuits.

Theorem 5.1 (Theorem 1.3 restated).

Let δ\delta be a discretizer and φ\varphi be an activation function such that δφ\delta\circ\varphi has a silent range. If a (δφ)(\delta\circ\varphi)-circuit CC of size ss, depth dd, energy ee, and weight ww computes a Boolean function ff, then it holds that

log(rk(MC))=O(ed(logs+logw+logn)3).\log(rk(M_{C}))=O(ed(\log s+\log w+\log n)^{3}).

Our simulation is based on a binary search of the potentials of a discretized gate, we employ a conversion technique from a linear decision tree to a threshold circuit presented in [44].

6 Conclusions

In this paper, we consider threshold circuits and other theoretical models of neural networks in terms of four complexity measures, size, depth energy and weight. We prove that if the communication matrix of a Boolean function ff has linear rank, any threshold circuit of sub-linear depth, sub-linear energy and sub-exponential weight needs exponential size to compute ff.

We believe that circuit complexity arguments provide insights into neural computation. Besides the famous Marr’s three-level approach towards understanding the brain computation (computational level, algorithmic level, and implementation level) [25], Valiant [48] added an additional requirement that it has to incorporate some understanding of the quantitative constraints that are faced by cortex. We expect that circuit complexity arguments could shed light on a quantitative constraint through a complexity measure. Maass et al[23] pointed out a difficulty to uncover a neural algorithm employed by the brain, because its hardware could be extremely adopted to the task, and consequently the algorithm vanishes: even if we know to the last detail its precise structure, connectivity, and vast array of numerical parameters (an artificial neural network given by deep learning is the case), it is still hard to extract an algorithm implemented in the network. A lower bound does not provide a description of an explicit neural algorithm, but could give a hint for computational natures behind neural computation, because a lower bound argument necessarily deals with every algorithm which a theoretical model of a neural network is able to implement. We expect that developing lower bound proof techniques for a theoretical model of neural networks can push this line of research.

References

  • [1] Kazuyuki Amano. On the size of depth-two threshold circuits for the Inner Product mod 2 function. In Alberto Leporati, Carlos Martín-Vide, Dana Shapira, and Claudio Zandron, editors, Language and Automata Theory and Applications, pages 235–247, Cham, 2020. Springer International Publishing.
  • [2] Kazuyuki Amano and Akira Maruoka. On the complexity of depth-2 circuits with threshold gates. In Proc. of the 30th international conference on Mathematical Foundations of Computer Science, pages 107–118, 2005.
  • [3] Boris Barbour, Nicolas Brunel, Vincent Hakim, and Jean-Pierre Nadal. What can we learn from synaptic weight distributions? Trends in Neurosciences, 30(12):622–629, 2007. URL: https://www.sciencedirect.com/science/article/pii/S0166223607002615, doi:https://doi.org/10.1016/j.tins.2007.09.005.
  • [4] Ruiwen Chen, Rahul Santhanam, and Srikanth Srinivasan. Average-case lower bounds and satisfiability algorithms for small threshold circuits. Theory of Computing, 14(9):1–55, 2018.
  • [5] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, pages 3123–3131, Cambridge, MA, USA, 2015. MIT Press.
  • [6] James J. DiCarlo and David D. Cox. Untangling invariant object recognition. Trends in Cognitive Sciences, 11(8):333–341, 2007. doi:10.1016/j.tics.2007.06.010.
  • [7] Krishnamoorthy Dinesh, Samir Otiv, and Jayalal Sarma. New bounds for energy complexity of boolean functions. Theoretical Computer Science, 845:59–75, 2020.
  • [8] Peter Földiák. Sparse coding in the primate cortex. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks, pages 1064–1068. MIT Press, 2003.
  • [9] Jürgen Forster, Matthias Krause, Satyanarayana V. Lokam, Rustam Mubarakzjanov, Niels Schmitt, and Hans Ulrich Simon. Relations between communication complexity, linear arrangements, and computational complexity. In Proc. of the 21st International Conference on Foundations of Software Technology and Theoretical Computer Science, pages 171–182, 2001.
  • [10] András Hajnal, Wolfgang Maass, Pavel Pudlák, Márió Szegedy, and György Turán. Threshold circuits of bounded depth. Journal of Computer and System Sciences, 46:129–154, 1993.
  • [11] Johan Håstad and Mikael Goldmann. On the power of small-depth threshold circuits. Computational Complexity, 1(2):113–129, 1991.
  • [12] Yunlong He, Koray Kavukcuoglu, Yun Wang, Arthur Szlam, and Yanjun Qi. Unsupervised feature learning by deep sparse coding. In Proc. of SIAM International Conference on Data Mining, pages 902–910, 2014.
  • [13] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(187):1–30, 2018. URL: http://jmlr.org/papers/v18/16-456.html.
  • [14] Russell Impagliazzo, Ramamohan Paturi, and Michael E. Saks. Size-depth tradeoffs for threshold circuits. SIAM Journal on Computing, 26(3):693–707, 1997.
  • [15] Stasys Jukna. Extremal Combinatorics with Applications in Computer Science. Springer-Verlag Berlin Heidelberg, 2011.
  • [16] D. M. Kane and R. Williams. Super-linear gate and super-quadratic wire lower bounds for depth-two and depth-three threshold circuits. In Proc. of the 48th Annual ACM Symposium on Theory of Computing, pages 633–643, 2016.
  • [17] O. M. Kasim-zade. On a measure of active circuits of functional elements (russian). Mathematical problems in cybernetics “Nauka”, (4):218?–228, 1992.
  • [18] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y. Ng. Efficient sparse coding algorithms. In Proc. of the 19th Advances in Neural Information Processing Systems, pages 801–808, 2006.
  • [19] Peter Lennie. The cost of cortical computation. Current Biology, 13:493–497, 2003.
  • [20] Nancy Lynch and Cameron Musco. A Basic Compositional Model for Spiking Neural Networks, pages 403–449. Springer Nature Switzerland, 2022. doi:10.1007/978-3-031-15629-8_22.
  • [21] Wolffanf Maass, Georg. Schnitger, and Eduardo D. Sontag. On the computational power of sigmoid versus boolean threshold circuits. In Proc. of 32nd Annual Symposium of Foundations of Computer Science, pages 767–776, 1991. doi:10.1109/SFCS.1991.185447.
  • [22] Wolfgang Maass. Networks of spiking neurons: The third generation of neural network models. Neural Networks, 10(9):1659–1671, 1997.
  • [23] Wolfgang Maass, Christos H. Papadimitriou, Santosh Vempala, and Robert Legenstein. Brain Computation: A Computer Science Perspective, pages 184–199. Springer International Publishing, 2019. doi:10.1007/978-3-319-91908-9_11.
  • [24] Hiroki Maniwa, Takayuki Oki, Akira Suzuki, Kei, and Xiao Zhou. Computational power of threshold circuits of energy at most two. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E101.A(9):1431–1439, 2018.
  • [25] David Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. W.H.Freeman & Co Ltd, 1982.
  • [26] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5:115–133, 1943.
  • [27] Marvin Minsky and Seymour Papert. Perceptrons: An Introduction to Computational Geometry. MIT Press, 1988.
  • [28] Andrew Ng. Sparse autoencoder. CS294A Lecture notes, 2011.
  • [29] Noam Nisan. The communication complexity of threshold gates. Proceeding of “Combinatorics, Paul Erdo¨\ddot{o}s is Eighty”, pages 301–315, 1993.
  • [30] Bruno A Olshausen and David J Field. Sparse coding of sensory inputs. Current Opinion in Neurobiology, 14(4):481–487, 2004.
  • [31] Ian Parberry. Circuit Complexity and Neural Networks. MIT Press, 1994.
  • [32] Thomas Pfeil, Tobias Potjans, Sven Schrader, Wiebke Potjans, Johannes Schemmel, Markus Diesmann, and Karlheinz Meier. Is a 4-bit synaptic weight resolution enough? - constraints on enabling spike-timing dependent plasticity in neuromorphic hardware. Frontiers in Neuroscience, 6:90, 2012. URL: https://www.frontiersin.org/article/10.3389/fnins.2012.00090, doi:10.3389/fnins.2012.00090.
  • [33] Alexander A. Razborov and Alexander A. Sherstov. The sign-rank of AC0. SIAM Journal on Computing, 39(5):1833–1855, 2010.
  • [34] Frank Rosenblatt. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386–408, 1958.
  • [35] Janio Carlos Nascimento Silva and Uéverton S. Souza. Computing the best-case energy complexity of satisfying assignments in monotone circuits. Theoretical Computer Science, 932:41–55, 2022. URL: https://www.sciencedirect.com/science/article/pii/S0304397522004777, doi:https://doi.org/10.1016/j.tcs.2022.08.005.
  • [36] K. Y. Siu and J. Bruck. On the power of threshold circuits with small weights. SIAM Journal on Discrete Mathematics, 4(3):423–435, 1991.
  • [37] Kai-Yeung. Siu, Vwani Roychowdhury, and Thomas Kailath. Discrete Neural Computation: A Theoretical Foundation. Prentice Hall, 1995.
  • [38] Kai-Yeung Siu and Vwani P. Roychowdhury. On optimal depth threshold circuits for multiplication and related problems. SIAM Journal on Discrete Mathematics, 7(2):284–292, 1994.
  • [39] Xiaoming Sun, Yuan Sun, Kewen Wu, and Zhiyu Xia. On the relationship between energy complexity and other boolean function measures. In Proc. of the 25th International Computing and Combinatorics Conference, pages 516–528, 2019.
  • [40] Akira Suzuki, Kei Uchizawa, and Xiao Zhou. Energy-efficient threshold circuits computing MOD functions. In Proc. of the 17th Computing: the Australasian Theory Symposium, pages 105–110, 2011.
  • [41] Akira Suzuki, Kei Uchizawa, and Xiao Zhou. Energy-efficient threshold circuits computing MOD functions. International Journal of Foundations of Computer Science, 24(1):15–29, 2013.
  • [42] Kei Uchizawa. Lower bounds for threshold circuits of bounded energy. Interdisciplinary Information Sciences, 20(1):27–50, 2014.
  • [43] Kei Uchizawa. Size, Depth and Energy of Threshold Circuits Computing Parity Function. In Yixin Cao, Siu-Wing Cheng, and Minming Li, editors, 31st International Symposium on Algorithms and Computation (ISAAC 2020), volume 181 of Leibniz International Proceedings in Informatics (LIPIcs), pages 54:1–54:13, Dagstuhl, Germany, 2020. Schloss Dagstuhl–Leibniz-Zentrum für Informatik. URL: https://drops.dagstuhl.de/opus/volltexte/2020/13398, doi:10.4230/LIPIcs.ISAAC.2020.54.
  • [44] Kei Uchizawa, Rodney Douglas, and Wolfgang Maass. On the computational power of threshold circuits with sparse activity. Neural Computation, 18(12):2994–3008, 2008.
  • [45] Kei Uchizawa and Eiji Takimoto. Exponential lower bounds on the size of constant-depth threshold circuits with small energy complexity. Theoretical Computer Science, 407(1–3):474–487, 2008.
  • [46] Kei Uchizawa, Eiji Takimoto, and Takao Nishizeki. Size-energy tradeoffs of unate circuits computing symmetric Boolean functions. Theoretical Computer Science, 412:773–782, 2011.
  • [47] M. N. Vaintsvaig. On the power of networks of functional elements (Russian). Doklady Akademii Nauk, 139(2):320?323, 1961.
  • [48] Leslie G Valiant. What must a global theory of cortex explain? Current Opinion in Neurobiology, 25:15–19, 2014. doi:https://doi.org/10.1016/j.conb.2013.10.006.

7 Appendix

7.1 Proof of Claim 16

Let z=|T|z=|T|, and τ1,τ2,,τz\tau_{1},\tau_{2},\dots,\tau_{z} be an arbitrary order of threshold functions in TT. For each k[z]k\in[z], we define

Rk={pτkx(𝐚)𝐚{0,1}n}.R_{k}=\{p^{x}_{\tau_{k}}({\mathbf{a}})\mid{\mathbf{a}}\in\{0,1\}^{n}\}.

Since a threshold function receives a value between w-w and ww from a single input, we have |Rk|2nw+1|R_{k}|\leq 2nw+1. For 𝐫=(r1,r2,,rz)R1×R2××Rz{\mathbf{r}}=(r_{1},r_{2},\dots,r_{z})\in R_{1}\times R_{2}\times\dots\times R_{z}, we define R(𝐫)=X(𝐫)×Y(𝐫)R({\mathbf{r}})=X({\mathbf{r}})\times Y({\mathbf{r}}) as a combinatorial rectangle where

X(𝐫)={𝐱k[z],pτk(𝐱)=rk}X({\mathbf{r}})=\{{\mathbf{x}}\mid\forall k\in[z],p_{\tau_{k}}({\mathbf{x}})=r_{k}\}

and

Y(𝐫)={𝐲k[z],tτkrk+pτky(𝐲)}.Y({\mathbf{r}})=\{{\mathbf{y}}\mid\forall k\in[z],t_{\tau_{k}}\leq r_{k}+p^{y}_{\tau_{k}}({\mathbf{y}})\}.

Clearly, all the rectangles are disjoint, and hence M[T]M[T] can be expressed as a sum of rank-1 matrices given by R(𝐫)R({\mathbf{r}})’s taken over all the 𝐫{\mathbf{r}}’s. Thus Fact 1(i) implies that its rank is at most |R1×R2××Rz|(2nw+1)z|R_{1}\times R_{2}\times\dots\times R_{z}|\leq(2nw+1)^{z}.

7.2 Proof of Claim 17

Consider an arbitrary fixed assignment (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n}. We show that

H[Q1](𝐚,𝐛)H[Q2](𝐚,𝐛)H[Qd](𝐚,𝐛)=0,H[Q_{1}]({\mathbf{a}},{\mathbf{b}})\circ H[Q_{2}]({\mathbf{a}},{\mathbf{b}})\circ\dots\circ H[Q_{d}]({\mathbf{a}},{\mathbf{b}})=0,

if M𝐏(𝐚,𝐛)=0M_{{\mathbf{P}}}({\mathbf{a}},{\mathbf{b}})=0, and

H[Q1](𝐚,𝐛)H[Q2](𝐚,𝐛)H[Qd](𝐚,𝐛)=1,H[Q_{1}]({\mathbf{a}},{\mathbf{b}})\circ H[Q_{2}]({\mathbf{a}},{\mathbf{b}})\circ\dots\circ H[Q_{d}]({\mathbf{a}},{\mathbf{b}})=1,

if M𝐏(𝐚,𝐛)=1M_{{\mathbf{P}}}({\mathbf{a}},{\mathbf{b}})=1. We write 𝐏=(P1,P2,,Pd){\mathbf{P}}^{*}=(P^{*}_{1},P^{*}_{2},\dots,P_{d}^{*}) to denote 𝐏(𝐚,𝐛){\mathbf{P}}^{*}({\mathbf{a}},{\mathbf{b}}) for a simpler notation.

Suppose M𝐏(𝐚,𝐛)=0M_{{\mathbf{P}}}({\mathbf{a}},{\mathbf{b}})=0. In this case, we have 𝐏𝐏{\mathbf{P}}\neq{\mathbf{P}}^{*}, and hence there exists a level l[d]l\in[d] such that PlPlP_{l}\neq P^{*}_{l} while Pl=PlP_{l^{\prime}}=P^{*}_{l^{\prime}} for every l[l1]l^{\prime}\in[l-1]. For such ll, it holds that

τ[g,𝐏](𝐚,𝐛)=τ[g,𝐏](𝐚,𝐛)\displaystyle\tau[g,{\mathbf{P}}^{*}]({\mathbf{a}},{\mathbf{b}})=\tau[g,{\mathbf{P}}]({\mathbf{a}},{\mathbf{b}}) (9)

for every gGlg\in G_{l}. We show that H[Ql](𝐚,𝐛)=0H[Q_{l}]({\mathbf{a}},{\mathbf{b}})=0 by considering two cases: Pl\PlP_{l}\backslash P^{*}_{l}\neq\emptyset and PlPlP_{l}\subset P_{l}^{*}.

Consider the case where Pl\PlP_{l}\backslash P^{*}_{l}\neq\emptyset, then there exists gPl\Plg\in P_{l}\backslash P^{*}_{l}. Since gPlg\not\in P^{*}_{l}, we have τ[g,𝐏](𝐚,𝐛)=0\tau[g,{\mathbf{P}}^{*}]({\mathbf{a}},{\mathbf{b}})=0. Thus, Eq. (9) implies that τ[g,𝐏](𝐚,𝐛)=0\tau[g,{\mathbf{P}}]({\mathbf{a}},{\mathbf{b}})=0, and hence M[T](𝐚,𝐛)=0M[T]({\mathbf{a}},{\mathbf{b}})=0 for every TT such that QlTQ_{l}\subseteq T. Therefore, for every T𝒯(Ql)T\in\mathcal{T}(Q_{l}), we have M[T](𝐚,𝐛)=0M[T]({\mathbf{a}},{\mathbf{b}})=0, and hence

H[Ql](𝐚,𝐛)=T𝒯(Ql)M[T](𝐚,𝐛)=0.H[Q_{l}]({\mathbf{a}},{\mathbf{b}})=\sum_{T\in\mathcal{T}(Q_{l})}M[T]({\mathbf{a}},{\mathbf{b}})=0.

Consider the other case where PlPlP_{l}\subset P_{l}^{*}. Let Ql={τ[g,𝐏]gPl}Q^{*}_{l}=\{\tau[g,{\mathbf{P}}^{*}]\mid g\in P^{*}_{l}\}. Equation (9) implies that M[T](𝐚,𝐛)=1M[T]({\mathbf{a}},{\mathbf{b}})=1 if TT satisfies QlTQlQ_{l}\subseteq T\subseteq Q^{*}_{l}, and M[T](𝐚,𝐛)=0M[T]({\mathbf{a}},{\mathbf{b}})=0, otherwise. Thus,

H[Ql](𝐚,𝐛)\displaystyle H[Q_{l}]({\mathbf{a}},{\mathbf{b}}) =\displaystyle= T𝒯(Ql)(1)|T||Ql|M[T]\displaystyle\sum_{T\in\mathcal{T}(Q_{l})}(-1)^{|T|-|Q_{l}|}M[T]
=\displaystyle= QlTQl(1)|T||Ql|\displaystyle\sum_{Q_{l}\subseteq T\subseteq Q^{*}_{l}}(-1)^{|T|-|Q_{l}|}

Therefore, by the binomial theorem,

H[Ql](𝐚,𝐛)\displaystyle H[Q_{l}]({\mathbf{a}},{\mathbf{b}}) =\displaystyle= k=0|Ql||Ql|(|Ql||Ql|k)(1)k\displaystyle\sum_{k=0}^{|Q^{*}_{l}|-|Q_{l}|}{|Q^{*}_{l}|-|Q_{l}|\choose k}(-1)^{k}
=\displaystyle= (11)|Ql||Ql|\displaystyle(1-1)^{|Q^{*}_{l}|-|Q_{l}|}
=\displaystyle= 0.\displaystyle 0.

Suppose M𝐏(𝐚,𝐛)=1M_{{\mathbf{P}}}({\mathbf{a}},{\mathbf{b}})=1. In this case, we have 𝐏=𝐏{\mathbf{P}}={\mathbf{P}}^{*}. Thus, for every l[d]l\in[d], Eq. (9) implies that M[T](𝐚,𝐛)=1M[T]({\mathbf{a}},{\mathbf{b}})=1 if T=QlT=Q_{l}, and M[T](𝐚,𝐛)=0M[T]({\mathbf{a}},{\mathbf{b}})=0, otherwise. Therefore,

H[Ql](𝐚,𝐛)\displaystyle H[Q_{l}]({\mathbf{a}},{\mathbf{b}}) =\displaystyle= T𝒯(Ql)(1)|T||Ql|M[T](𝐚,𝐛)\displaystyle\sum_{T\in\mathcal{T}(Q_{l})}(-1)^{|T|-|Q_{l}|}M[T]({\mathbf{a}},{\mathbf{b}}) (10)
=\displaystyle= (1)|Ql||Ql|\displaystyle(-1)^{|Q_{l}|-|Q_{l}|} (11)
=\displaystyle= 1.\displaystyle 1. (12)

Consequently

H[Q1](𝐚,𝐛)H[Q2](𝐚,𝐛)H[Qd](𝐚,𝐛)=1,H[Q_{1}]({\mathbf{a}},{\mathbf{b}})\circ H[Q_{2}]({\mathbf{a}},{\mathbf{b}})\circ\dots\circ H[Q_{d}]({\mathbf{a}},{\mathbf{b}})=1,

as desired.

7.3 Proof of Lemma 4.2

Let z=(e1)(d1)z=(e-1)(d-1). For simplicity, we assume that nn is divisible by zz. Let f:{0,1}2n{0,1}f:\{0,1\}^{2n}\to\{0,1\} be zz-piecewise with f1,f2,,fzf_{1},f_{2},\dots,f_{z}. We first prove the proposition for the case where ff satisfies

f(𝐱,𝐲)=j=1zfj(𝐱,𝐲).f({\mathbf{x}},{\mathbf{y}})=\bigvee_{j=1}^{z}f_{j}({\mathbf{x}},{\mathbf{y}}).

Let us relabel flf_{l} for l[z]l\in[z] as fk,lf_{k,l} for k[e1]k\in[e-1], and l[d1]l\in[d-1]. We then denote by Bk,l={i[n]xi or yi is fed into fk,l}B_{k,l}=\{i\in[n]\mid\mbox{$x_{i}$ or $y_{i}$ is fed into $f_{k,l}$}\} for k[e1]k\in[e-1], and l[d1]l\in[d-1], each of which contains n/zn/z integers. Thus, we have

f(𝐱,𝐲)=k[e1]l[d1]fk,l(𝐱,𝐲).f({\mathbf{x}},{\mathbf{y}})=\bigvee_{k\in[e-1]}\bigvee_{l\in[d-1]}f_{k,l}({\mathbf{x}},{\mathbf{y}}).

By the assumption, fk,lf_{k,l} is computable by a selective neural set Sk,lS_{k,l} for every pair of k[e1]k\in[e-1] and l[d1]l\in[d-1].

We construct the desired threshold circuit CC by arranging and connecting the selective neural sets, where CC has a simple layered structure consisting of the selective neural sets. After we complete the construction of CC, we show that CC computes ff, and evaluate its size, depth, energy, and weight.

We start from the bottom level. First, we put in the first level the gates in Sk,1S_{k,1} for every k[e1]k\in[e-1]. Then, for each ll, 2ld12\leq l\leq d-1, we add at the llth level the gates gSk,lg\in S_{k,l} for every k[e1]k\in[e-1], and connect the outputs of all the gates at the lower level to every gSk,lg\in S_{k,l} with weight 2nw/z-2nw^{\prime}/z. For gSk,lg\in S_{k,l} and (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n}, we denote by gC(𝐚,𝐛)g^{C}({\mathbf{a}},{\mathbf{b}}) the output of the gate gg placed in CC. Finally, we add gclfg^{\mathrm{clf}} that computes a conjunction of all gates in the layers 1,2,d11,2\dots,d-1:

gclf(𝐱,𝐲)=sign(k[e1]l[d1]gSk,lgC(𝐱,𝐲)1).g^{\mathrm{clf}}({\mathbf{x}},{\mathbf{y}})=\mathrm{sign}\left(\sum_{k\in[e-1]}\sum_{l\in[d-1]}\sum_{g\in S_{k,l}}g^{C}({\mathbf{x}},{\mathbf{y}})-1\right).

We here show that CC computes ff. By construction, the following claim is easy to verify:

Claim 22.

For any gSk,lg\in S_{k,l},

gC(𝐱,𝐲)={g(𝐱,𝐲)if every gate at the levels 1,,l1 outputs zero;0otherwise.g^{C}({\mathbf{x}},{\mathbf{y}})=\left\{\begin{array}[]{ll}g({\mathbf{x}},{\mathbf{y}})&\mbox{if every gate at the levels $1,\dots,l-1$ outputs zero};\\ 0&\mbox{otherwise}.\end{array}\right.
Proof 7.1.

If every gate at the levels 1,,l11,\dots,l-1 outputs zero, the output of gCg^{C} is identical to the counterpart of gg, and hence gC(𝐱,𝐲)=g(𝐱,𝐲)g^{C}({\mathbf{x}},{\mathbf{y}})=g({\mathbf{x}},{\mathbf{y}}). Otherwise, there is a gate outputting one at the lower levels. Since gCg^{C} receives an output from the gate at the lower level, the value 2wn/z-2w^{\prime}n/z is subtracted from the potential of gCg^{C}. Since gg receives at most 2n/z2n/z positive weights bounded by ww^{\prime}, the potential of gCg^{C} is now below its threshold. Otherwise, gg outputs one for any input assignment, which implies that, since Sk,lS_{k,l} is selective, fk,lf_{k,l} is the constant function. This contradicts the fact that ff is zz-piecewise.

Suppose f(𝐚,𝐛)=0f({\mathbf{a}},{\mathbf{b}})=0. In this case, for every k[e1]k\in[e-1], l[d1]l\in[d-1], and gSk,lg\subseteq S_{k,l}, gk,l(𝐚,𝐛)=0g{k,l}({\mathbf{a}},{\mathbf{b}})=0. Therefore, Claim 22 implies that no gate in CC outputs one.

Suppose f(𝐚,𝐛)=1f({\mathbf{a}},{\mathbf{b}})=1. In this case, there exists l[d1]l^{*}\in[d-1] such that fk,l(𝐚,𝐛)=1f_{k,l^{*}}({\mathbf{a}},{\mathbf{b}})=1 for some k[e1]k\in[e-1], while fk,l(𝐚,𝐛)=0f_{k,l}({\mathbf{a}},{\mathbf{b}})=0 for every 1ke11\leq k\leq e-1 and 1ll11\leq l\leq l^{*}-1. Since Sk,lS_{k,l^{*}} compute fk,lf_{k,l}, Claim 22 implies that there exists gSk,lg\in S_{k,l^{*}} such that gC(𝐚,𝐛)=1g^{C}({\mathbf{a}},{\mathbf{b}})=1, which implies that gclf(𝐱,𝐲)=1g^{\mathrm{clf}}({\mathbf{x}},{\mathbf{y}})=1.

Finally, we evaluate the size, depth, energy, and weight of CC. Since CC contains at most ss^{\prime} gates for each pair of k[e1]k\in[e-1] and l[d1]l\in[d-1] where z=(e1)(d1)z=(e-1)(d-1), we have in total szs+1s\leq zs^{\prime}+1. The additional one corresponds to the output gate. Because the gates gSk,lg\in S_{k,l} are placed at the ll-th level for l[d1]l\in[d-1], the level of gclfg^{\mathrm{clf}} is clearly dd, and hence, CC has depth dd. Claim 22 implies that if there is a gate outputting one at level ll, then no gate in higher levels outputs one. In addition, since Sk,lS_{k,l} is selective, at most one gate gSk,lg\in S_{k,l} outputs one. Therefore, at most e1e-1 gates at the llth level output one, followed by gclfg^{\mathrm{clf}} gates outputting one. Thus, CC has an energy ee. Any connection in CC has weight at most ww^{\prime} or 2nw/z2nw^{\prime}/z. Thus, the weight of CC is 2nw/z2nw^{\prime}/z.

Consider the other case where

f(𝐱,𝐲)=j=1zfj(𝐱,𝐲)¯.f({\mathbf{x}},{\mathbf{y}})=\overline{\bigvee_{j=1}^{z}f_{j}({\mathbf{x}},{\mathbf{y}})}.

We can obtain the desired circuit CC by the same construction as above except that gclfg^{\mathrm{clf}} computes the complement of a conjunction of all gates in the layers 1,2,d11,2\dots,d-1:

gclf(𝐱,𝐲)=sign(k[e1]l[d1]gSk,lgC(𝐱,𝐲)).g^{\mathrm{clf}}({\mathbf{x}},{\mathbf{y}})=\mathrm{sign}\left(\sum_{k\in[e-1]}\sum_{l\in[d-1]}\sum_{g\in S_{k,l}}-g^{C}({\mathbf{x}},{\mathbf{y}})\right).

7.4 Proof of Theorem 4.3

For simplicity, we consider the case where nn is a multiple of zz. It suffices to show that is zz-piecewise, and computable by a neural set of size s=2n/zs^{\prime}=2^{n/z} and weight w=n/zw^{\prime}=n/z.

We can verify that is zz-piecewise, since

where

fj(𝐱,𝐲)=iBjxiyi.f_{j}({\mathbf{x}},{\mathbf{y}})=\bigvee_{i\in B_{j}}x_{i}\wedge y_{i}.

and Bj={i[n](j1)n/z+1ijn/z}B_{j}=\{i\in[n]\mid(j-1)\lceil n/z\rceil+1\leq i\leq jn/z\}.

Consider an arbitrary fixed j[z]j\in[z]. Below we show that fjf_{j} is computable by a neural set of size

s2n/zs^{\prime}\leq 2^{n/z}

and weight wn/zw^{\prime}\leq n/z.

Recall that we denote by P\llbracket\mathrm{P}\rrbracket for a statement P\mathrm{P} a function that outputs one if P\mathrm{P} is true, and zero otherwise. Then fjf_{j} can be expressed as

fj(𝐱,𝐲)=BBjFjB(𝐱,𝐲)\displaystyle f_{j}({\mathbf{x}},{\mathbf{y}})=\bigvee_{\emptyset\subset B\subseteq B_{j}}F^{B}_{j}({\mathbf{x}},{\mathbf{y}}) (13)

where

FjB(𝐱,𝐲)=iBxi=1iBxi=0iByi.F^{B}_{j}({\mathbf{x}},{\mathbf{y}})=\bigwedge_{i\in B}\llbracket x_{i}=1\rrbracket\wedge\bigwedge_{i\not\in B}\llbracket x_{i}=0\rrbracket\wedge\bigvee_{i\in B}y_{i}.

For any assignment 𝐚{0,1}n{\mathbf{a}}\in\{0,1\}^{n}, we define B(𝐚)B^{*}({\mathbf{a}}) as

B(𝐚)={iBjai=1}.B^{*}({\mathbf{a}})=\{i\in B_{j}\mid a_{i}=1\}.

Then, for every (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n},

FjB(𝐚,𝐛)={1 if B=B(𝐚) and iB,bi=1;0 otherwise.\displaystyle F^{B}_{j}({\mathbf{a}},{\mathbf{b}})=\left\{\begin{array}[]{ll}1&\mbox{ if }B=B^{*}({\mathbf{a}})\mbox{ and }\exists i\in B,b_{i}=1;\\ 0&\mbox{ otherwise}.\end{array}\right. (16)

The function FjBF^{B}_{j} is computable by a threshold gate.

Claim 26.

For any BBjB\subseteq B_{j}, FjBF^{B}_{j} can be computed by a threshold gate gjBg^{B}_{j} with weights wixw^{x}_{i}, wiyw^{y}_{i}; the threshold tt is defined as follows: for every i[n]i\in[n],

wix={0 if iBj;|B| if iBBj;|B| if iBj\B,w^{x}_{i}=\left\{\begin{array}[]{ll}0&\mbox{ if }i\not\in B_{j};\\ |B|&\mbox{ if }i\in B\subseteq B_{j};\\ -|B|&\mbox{ if }i\in B_{j}\backslash B,\end{array}\right.

and

wiy={0 if iBk,l;1 if iBBk,l;0 if iBk,l\B,w^{y}_{i}=\left\{\begin{array}[]{ll}0&\mbox{ if }i\not\in B_{k,l};\\ 1&\mbox{ if }i\in B\subseteq B_{k,l};\\ 0&\mbox{ if }i\in B_{k,l}\backslash B,\end{array}\right.

and t=|B|2+1t=|B|^{2}+1.

Proof 7.2.

Suppose FjB(𝐱,𝐲)=1F^{B}_{j}({\mathbf{x}},{\mathbf{y}})=1, that is, B=B(𝐱)B=B^{*}({\mathbf{x}}), and there exists iBi^{*}\in B such that yi=1y_{i^{*}}=1. Then, we have

iBjwixxi+iBwiyyi|B||B|+1=t;\sum_{i\in B_{j}}w^{x}_{i}x_{i}+\sum_{i\in B}w^{y}_{i}y_{i}\geq|B|\cdot|B|+1=t;

thus, gjBg^{B}_{j} outputs one.

Suppose FjB(𝐱,𝐲)=0F^{B}_{j}({\mathbf{x}},{\mathbf{y}})=0. There are two cases: BB(𝐱)B\neq B^{*}({\mathbf{x}}) and yi=0y_{i}=0 for every iBji\in B_{j}

Consider the first case. If there exists iB(𝐱)\Bi^{*}\in B^{*}({\mathbf{x}})\backslash B, then

iBjwixxiiBwixxi+wix|B||B||B|.\sum_{i\in B_{j}}w^{x}_{i}x_{i}\leq\sum_{i\in B}w^{x}_{i}x_{i}+w^{x}_{i^{*}}\leq|B|\cdot|B|-|B|.

Thus,

iBjwixxi+iBwiyyi(|B||B||B|)+|B||B|2<t,\sum_{i\in B_{j}}w^{x}_{i}x_{i}+\sum_{i\in B}w^{y}_{i}y_{i}\leq(|B|\cdot|B|-|B|)+|B|\leq|B|^{2}<t,

and consequently, gk,lBg^{B}_{k,l} outputs zero. If B(𝐱)BB^{*}({\mathbf{x}})\subset B, then

iBjwixxi(|B|1)|B|.\sum_{i\in B_{j}}w^{x}_{i}x_{i}\leq(|B|-1)\cdot|B|.

Thus,

iBjwixxi+iBwiyyi((|B|1)|B|)+|B||B|2<t,\sum_{i\in B_{j}}w^{x}_{i}x_{i}+\sum_{i\in B}w^{y}_{i}y_{i}\leq((|B|-1)\cdot|B|)+|B|\leq|B|^{2}<t,

and hence, gjBg^{B}_{j} outputs zero.

In the second case, we have

iBwiyyi=0.\sum_{i\in B}w^{y}_{i}y_{i}=0.

Thus,

iBjwixxi+iBwiyyi|B||B|<t,\sum_{i\in B_{j}}w^{x}_{i}x_{i}+\sum_{i\in B}w^{y}_{i}y_{i}\leq|B|\cdot|B|<t,

and hence, gjBg^{B}_{j} outputs zero.

For any 𝐚{0,1}n{\mathbf{a}}\in\{0,1\}^{n}, Eq. (16) implies that only gjB(𝐚)g^{B^{*}({\mathbf{a}})}_{j} are allowed to output one. Thus, by Eq. (13), a selective neural set

Sj={gjBBBj}S_{j}=\{g^{B}_{j}\mid\emptyset\subset B\subseteq B_{j}\}

computes fjf_{j}. Since |Bj|n/z|B_{j}|\leq n/z, we have |Sj|2n/z|S_{j}|\leq 2^{n/z}. Claim 26 implies that wn/zw^{\prime}\leq n/z.

7.5 Proof of Theorem 4.4

Let z=(e1)(d1)z=(e-1)(d-1). EQn\mathrm{EQ}_{n} is zz-piecewise, since

EQn(𝐱,𝐲)=j=1zfj(𝐱,𝐲)¯\mathrm{EQ}_{n}({\mathbf{x}},{\mathbf{y}})=\overline{\bigvee_{j=1}^{z}f_{j}({\mathbf{x}},{\mathbf{y}})}

where

fj(𝐱,𝐲)=iBjxiyif_{j}({\mathbf{x}},{\mathbf{y}})=\bigvee_{i\in B_{j}}x_{i}\oplus y_{i}

and Bj={i[n](j1)n/z+1ijn/z}B_{j}=\{i\in[n]\mid(j-1)\lceil n/z\rceil+1\leq i\leq jn/z\}. Then the theorem implies that fjf_{j} is computable by a selective neural set of size s=22n/zs^{\prime}=2^{2n/z} and w=1w^{\prime}=1.

7.6 Proof of Theorem 5.1

Let δ\delta be a discretizer and φ\varphi be an activation function such that δφ\delta\circ\varphi has a silent range for II. In the proof, we only consider an open interval I=(tmin,tmax)I=(t_{\min},t_{\max}) because the proofs for the other cases are similar. Let CC be a (δφ)(\delta\circ\varphi)-circuit of size ss, depth dd, energy ee, and weight ww. We obtain the desired threshold circuit CC^{\prime} by showing a procedure by which any (δφ)(\delta\circ\varphi)-gate gg in CC can be safely replaced by a set of threshold gates.

Let gg be an arbitrary (δφ)(\delta\circ\varphi)-gate in CC that computes g(𝐱,𝐲)=δφ(p(𝐱,𝐲).)g({\mathbf{x}},{\mathbf{y}})=\delta\circ\varphi\left(p({\mathbf{x}},{\mathbf{y}}).\right) We first consider [tmax,)[t_{\max},\infty). Let PgP_{g} be a set of potential values for which gg outputs a non-zero value:

Pg={p(𝐚,𝐛)(𝐚,𝐛){0,1}2n,tmaxp(𝐚,𝐛)}.P_{g}=\{p({\mathbf{a}},{\mathbf{b}})\mid({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n},t_{\max}\leq p({\mathbf{a}},{\mathbf{b}})\}.

Since the activation function and weights are discretized, we have |Pg|=O((s+n)w)|P_{g}|=O((s+n)w).

We operate the binary search over PgP_{g}, and construct a threshold gate that outputs one for every input (𝐚,𝐛)({\mathbf{a}},{\mathbf{b}}) such that p(𝐚,𝐛)p({\mathbf{a}},{\mathbf{b}}) takes a particular value in PP. For any QPgQ\subseteq P_{g}, we define mid(Q)\mathrm{mid}(Q) as the median of the integers in QQ, and Q+Q^{+} (resp., QQ^{-}) as the upper (resp., lower) half of QQ:

Q+={pQpmid(Q)} and Q={pQmid(Q)<p}.Q^{+}=\{p\in Q\mid p\leq\mathrm{mid}(Q)\}\quad\mbox{ and }\quad Q^{-}=\{p\in Q\mid\mathrm{mid}(Q)<p\}.

If QQ contains an even number of values, we take the greater value of the two median values .

Let 𝐬{\mathbf{s}} be a binary string. We inductively construct a threshold gate g𝐬g_{{\mathbf{s}}} on the length of 𝐬{\mathbf{s}}. For two strings 𝐭{\mathbf{t}} and 𝐬{\mathbf{s}}, we write 𝐭𝐬{\mathbf{t}}\prec{\mathbf{s}} if 𝐭{\mathbf{t}} is a proper prefix of 𝐬{\mathbf{s}}. We denote a string 𝐭{\mathbf{t}} followed by 0 (resp., by 1) by 𝐭0{\mathbf{t}}0 (resp., 𝐭1{\mathbf{t}}1).

As the base of our construction, we consider the empty string ϵ\epsilon. Let Pϵ=PgP_{\epsilon}=P_{g}. We constructs a threshold gate gϵg_{\epsilon} that computes

gϵ(𝐱,𝐲)=sign(p(𝐱,𝐲)tg[ϵ]),g_{\epsilon}({\mathbf{x}},{\mathbf{y}})=\mathrm{sign}\left(p({\mathbf{x}},{\mathbf{y}})-t_{g}[\epsilon]\right),

where tg[ϵ]=mid(Pϵ)t_{g}[\epsilon]=\mathrm{mid}(P_{\epsilon}).

Suppose we have constructed gates g𝐭g_{{\mathbf{t}}} for every 𝐭{\mathbf{t}} satisfying |𝐭|k1|{\mathbf{t}}|\leq k-1. Consider a string 𝐬{\mathbf{s}} of length kk. By the induction hypothesis, we have gates g𝐭g_{{\mathbf{t}}} for every 𝐭{\mathbf{t}} and 𝐭𝐬{\mathbf{t}}\prec{\mathbf{s}}. Let 𝐬{\mathbf{s}}^{\prime} be a string obtained by dropping the last symbol of 𝐬{\mathbf{s}}. Then, we define

P𝐬={P𝐬+if the last symbol of 𝐬 is 1;P𝐬if the last symbol of 𝐬 is 0.P_{\mathbf{s}}=\left\{\begin{array}[]{ll}P_{{\mathbf{s}}^{\prime}}^{+}&\mbox{if the last symbol of ${\mathbf{s}}$ is 1};\\ P_{{\mathbf{s}}^{\prime}}^{-}&\mbox{if the last symbol of ${\mathbf{s}}$ is 0}.\end{array}\right.

Let W=3(s+n)wW=3(s+n)w. We construct a threshold gate g𝐬g_{{\mathbf{s}}} as follows:

g𝐬(𝐱,𝐲)=sign(p(𝐱,𝐲)+𝐭𝐬w𝐭,𝐬g𝐭(𝐱,𝐲)tg[𝐬]),g_{{\mathbf{s}}}({\mathbf{x}},{\mathbf{y}})=\mathrm{sign}\left(p({\mathbf{x}},{\mathbf{y}})+\sum_{{\mathbf{t}}\prec{\mathbf{s}}}w_{{\mathbf{t}},{\mathbf{s}}}\cdot g_{{\mathbf{t}}}({\mathbf{x}},{\mathbf{y}})-t_{g}[{{\mathbf{s}}}]\right),

where w𝐭,𝐬w_{{\mathbf{t}},{\mathbf{s}}} is the weight of the output of g𝐭g_{{\mathbf{t}}} and is defined as

w𝐭,𝐬={Wif 𝐭1 is a prefix of 𝐬;Wif 𝐭0 is a prefix of 𝐬;0otherwise,w_{{\mathbf{t}},{\mathbf{s}}}=\left\{\begin{array}[]{ll}W&\mbox{if ${\mathbf{t}}1$ is a prefix of ${\mathbf{s}}$};\\ -W&\mbox{if ${\mathbf{t}}0$ is a prefix of ${\mathbf{s}}$};\\ 0&\mbox{otherwise},\end{array}\right.

and tg[𝐬]=mid(P𝐬)WN1(𝐬)t_{g}[{\mathbf{s}}]=\mathrm{mid}(P_{\mathbf{s}})-W\cdot N_{1}({\mathbf{s}}), where N1(𝐬)N_{1}({\mathbf{s}}) is the number of ones in 𝐬{\mathbf{s}}.

We repeatedly apply the above procedure until |P𝐬|=1|P_{\mathbf{s}}|=1. Since we apply the binary search over PP, we have O(|P|)=O((s+n)w)O(|P|)=O((s+n)w) gates and the length of |𝐬||{\mathbf{s}}| as O(log(s+n)+logw)O(\log(s+n)+\log w) for any gate g𝐬g_{{\mathbf{s}}}.

Consider the strings 𝐬{\mathbf{s}} for which we have constructed g𝐬g_{{\mathbf{s}}}. Let

Sg={𝐬2|P𝐬|} and Lg={𝐬|P𝐬|=1}.S_{g}=\{{\mathbf{s}}\mid 2\leq|P_{\mathbf{s}}|\}\quad\mbox{ and }\quad L_{g}=\{{\mathbf{s}}\mid|P_{\mathbf{s}}|=1\}.

For each (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n}, we denote the unique string that satisfies P𝐬(𝐚,𝐛)={p(𝐚,𝐛)}P_{{\mathbf{s}}^{*}({\mathbf{a}},{\mathbf{b}})}=\{p({\mathbf{a}},{\mathbf{b}})\} by 𝐬(𝐚,𝐛)Lg{\mathbf{s}}^{*}({\mathbf{a}},{\mathbf{b}})\in L_{g}. The following claims show that the g𝐬g_{\mathbf{s}}s are useful for simulating gg.

Claim 27.

Let (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n} be an input assignment that satisfies tmaxp(𝐚,𝐛)t_{\max}\leq p({\mathbf{a}},{\mathbf{b}}).

(i)

For 𝐬Sg{\mathbf{s}}\in S_{g}, g𝐬g_{\mathbf{s}} outputs one if and only if 𝐬1{\mathbf{s}}1 is a prefix of 𝐬(𝐚,𝐛){\mathbf{s}}^{*}({\mathbf{a}},{\mathbf{b}}).

(ii)

For 𝐬Lg{\mathbf{s}}\in L_{g}, g𝐬g_{{\mathbf{s}}} outputs one if and only if 𝐬=𝐬(𝐚,𝐛){\mathbf{s}}={\mathbf{s}}^{*}({\mathbf{a}},{\mathbf{b}}).

Proof 7.3.

Consider an arbitrary input assignment (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n} that satisfies tmaxp(𝐚,𝐛)t_{\max}\leq p({\mathbf{a}},{\mathbf{b}}). For notational simplicity, we write 𝐬{\mathbf{s}}^{*} for 𝐬(𝐚,𝐛){\mathbf{s}}^{*}({\mathbf{a}},{\mathbf{b}}).

Proof of (i). We verify the claim by induction on the length of 𝐬{\mathbf{s}}. For the base case, we consider ϵ{\mathbf{\epsilon}}. It suffices to show that gϵ(𝐚,𝐛)=1g_{\epsilon}({\mathbf{a}},{\mathbf{b}})=1 if the first symbol of 𝐬{\mathbf{s}}^{*} is 1; otherwise, gϵ(𝐚,𝐛)=0g_{\epsilon}({\mathbf{a}},{\mathbf{b}})=0. If the first symbol is 1, we have p(𝐚,𝐛)Pϵ+p({\mathbf{a}},{\mathbf{b}})\in P^{+}_{\epsilon}, which implies that tg[ϵ]p(𝐚,𝐛)t_{g}[\epsilon]\leq p({\mathbf{a}},{\mathbf{b}}). Thus, gϵ(𝐚,𝐛)=1g_{\epsilon}({\mathbf{a}},{\mathbf{b}})=1. Similarly, if the first symbol is zero, we have p(𝐚,𝐛)Pϵp({\mathbf{a}},{\mathbf{b}})\in P^{-}_{\epsilon}, implying that p(𝐚,𝐛)<tg[ϵ]p({\mathbf{a}},{\mathbf{b}})<t_{g}[\epsilon]. Thus, gϵ(𝐚,𝐛)=0g_{\epsilon}({\mathbf{a}},{\mathbf{b}})=0.

We assume that for the induction hypothesis, g𝐭g_{\mathbf{t}} outputs one if and only if 𝐭1{\mathbf{t}}1 is a prefix of 𝐬(𝐚,𝐛){\mathbf{s}}^{*}({\mathbf{a}},{\mathbf{b}}) for every 𝐭{\mathbf{t}} of length k1k-1, at most, for a positive integer kk. Next, we consider a string 𝐬{\mathbf{s}} of length kk.

We first verify that g𝐬g_{\mathbf{s}} outputs zero if 𝐬{\mathbf{s}} itself is not a prefix of 𝐬(𝐚,𝐛){\mathbf{s}}^{*}({\mathbf{a}},{\mathbf{b}}). If 𝐬{\mathbf{s}} is not a prefix of 𝐬{\mathbf{s}}^{*}, there exists a prefix 𝐭{\mathbf{t}}^{\prime} of 𝐬{\mathbf{s}} such that 𝐭0{\mathbf{t}}^{\prime}0 (resp., 𝐭1{\mathbf{t}}^{\prime}1) is a prefix of 𝐬{\mathbf{s}}, whereas 𝐭1{\mathbf{t}}^{\prime}1 (resp., 𝐭0{\mathbf{t}}^{\prime}0) is not a prefix of 𝐬{\mathbf{s}}^{*}.

Consider the case where 𝐭0{\mathbf{t}}^{\prime}0 is a prefix of 𝐬{\mathbf{s}}^{*} (i.e., 𝐭1{\mathbf{t}}^{\prime}1 is a prefix of 𝐬{\mathbf{s}}). In this case, the induction hypothesis implies that: g𝐭(𝐚,𝐛)=0g_{{\mathbf{t}}^{\prime}}({\mathbf{a}},{\mathbf{b}})=0. In addition, since 𝐭1{\mathbf{t}}^{\prime}1 is a prefix of 𝐬{\mathbf{s}}, we have w𝐭,𝐬=Ww_{{\mathbf{t}}^{\prime},{\mathbf{s}}}=W. Thus, the potential of g𝐬g_{{\mathbf{s}}} for (𝐚,𝐛)({\mathbf{a}},{\mathbf{b}}) is at most

p(𝐚,𝐛)+𝐭𝐬w𝐭,𝐬g𝐭(𝐚,𝐛)tg[𝐬]\displaystyle p({\mathbf{a}},{\mathbf{b}})+\sum_{{\mathbf{t}}\prec{\mathbf{s}}}w_{{\mathbf{t}},{\mathbf{s}}}\cdot g_{{\mathbf{t}}}({\mathbf{a}},{\mathbf{b}})-t_{g}[{{\mathbf{s}}}]
\displaystyle\leq p(𝐚,𝐛)+W(N1(𝐬)1)mid(P𝐬)WN1(𝐬)\displaystyle p({\mathbf{a}},{\mathbf{b}})+W\cdot(N_{1}({\mathbf{s}})-1)-\mathrm{mid}(P_{{\mathbf{s}}})-W\cdot N_{1}({\mathbf{s}})
\displaystyle\leq p(𝐚,𝐛)mid(P𝐬)W,\displaystyle p({\mathbf{a}},{\mathbf{b}})-\mathrm{mid}(P_{{\mathbf{s}}})-W,

which is less than zero because p(𝐚,𝐛)(s+n)wp({\mathbf{a}},{\mathbf{b}})\leq(s+n)w.

Consider the case in which 𝐭1{\mathbf{t}}^{\prime}1 is a prefix of 𝐬{\mathbf{s}}^{*} (i.e., 𝐭0{\mathbf{t}}^{\prime}0 is a prefix of 𝐬{\mathbf{s}}). In this case, the induction hypothesis implies that g𝐭(𝐚,𝐛)=1g_{{\mathbf{t}}^{\prime}}({\mathbf{a}},{\mathbf{b}})=1. In addition, since 𝐭0{\mathbf{t}}^{\prime}0 is a prefix of 𝐬{\mathbf{s}}, w𝐭,𝐬=Ww_{{\mathbf{t}}^{\prime},{\mathbf{s}}}=-W. Thus, the potential of g𝐬g_{{\mathbf{s}}} for (𝐚,𝐛)({\mathbf{a}},{\mathbf{b}}) is at most

p(𝐚,𝐛)+𝐭𝐬w𝐭,𝐬g𝐭(𝐚,𝐛)tg[𝐬]\displaystyle p({\mathbf{a}},{\mathbf{b}})+\sum_{{\mathbf{t}}\prec{\mathbf{s}}}w_{{\mathbf{t}},{\mathbf{s}}}\cdot g_{{\mathbf{t}}}({\mathbf{a}},{\mathbf{b}})-t_{g}[{{\mathbf{s}}}]
\displaystyle\leq p(𝐚,𝐛)+(WN1(𝐬)W)mid(P𝐬)WN1(𝐬)\displaystyle p({\mathbf{a}},{\mathbf{b}})+(W\cdot N_{1}({\mathbf{s}})-W)-\mathrm{mid}(P_{{\mathbf{s}}})-W\cdot N_{1}({\mathbf{s}})
\displaystyle\leq p(𝐚,𝐛)mid(P𝐬)W,\displaystyle p({\mathbf{a}},{\mathbf{b}})-\mathrm{mid}(P_{{\mathbf{s}}})-W,

which is again less than zero because p(𝐚,𝐛)2(s+n)wp({\mathbf{a}},{\mathbf{b}})\leq 2(s+n)w.

Suppose 𝐬{\mathbf{s}} is a prefix of 𝐬{\mathbf{s}}^{*}. The induction hypothesis implies that the potential of g𝐬g_{\mathbf{s}} is equal to

p(𝐚,𝐛)+𝐭𝐬w𝐭,𝐬g𝐭(𝐚,𝐛)tg[𝐬]\displaystyle p({\mathbf{a}},{\mathbf{b}})+\sum_{{\mathbf{t}}\prec{\mathbf{s}}}w_{{\mathbf{t}},{\mathbf{s}}}\cdot g_{{\mathbf{t}}}({\mathbf{a}},{\mathbf{b}})-t_{g}[{{\mathbf{s}}}]
=\displaystyle= p(𝐚,𝐛)+WN1(𝐬)mid(P𝐬)WN1(𝐬)\displaystyle p({\mathbf{a}},{\mathbf{b}})+W\cdot N_{1}({\mathbf{s}})-\mathrm{mid}(P_{{\mathbf{s}}})-W\cdot N_{1}({\mathbf{s}})
=\displaystyle= p(𝐚,𝐛)mid(P𝐬).\displaystyle p({\mathbf{a}},{\mathbf{b}})-\mathrm{mid}(P_{{\mathbf{s}}}).

Thus, similar to the base case, we can show that g𝐬(𝐚,𝐛)=1g_{\mathbf{s}}({\mathbf{a}},{\mathbf{b}})=1 if 𝐬1{\mathbf{s}}1 is a prefix of 𝐬{\mathbf{s}}^{*}, and g𝐬((𝐚,𝐛))=0g_{\mathbf{s}}(({\mathbf{a}},{\mathbf{b}}))=0 otherwise. More formally, if 𝐬1{\mathbf{s}}1 is a prefix of 𝐬{\mathbf{s}}^{*}, we have p(𝐚,𝐛)P𝐬+p({\mathbf{a}},{\mathbf{b}})\in P^{+}_{\mathbf{s}}, which implies that mid(P𝐬)p(𝐚,𝐛)\mathrm{mid}(P_{\mathbf{s}})\leq p({\mathbf{a}},{\mathbf{b}}). Thus, g𝐬(𝐚,𝐛)=1g_{\mathbf{s}}({\mathbf{a}},{\mathbf{b}})=1. If 𝐬0{\mathbf{s}}0 is a prefix of 𝐬{\mathbf{s}}^{*}, we have p(𝐚,𝐛)P𝐬p({\mathbf{a}},{\mathbf{b}})\in P^{-}_{\mathbf{s}}, which implies that p(𝐚,𝐛)<mid(P𝐬)p({\mathbf{a}},{\mathbf{b}})<\mathrm{mid}(P_{\mathbf{s}}). Thus, g𝐬(𝐚,𝐛)=0g_{\mathbf{s}}({\mathbf{a}},{\mathbf{b}})=0.

Proof of (ii). Consider 𝐬Lg{\mathbf{s}}\in L_{g}. Similar to claim (i), we can verify that g𝐬(𝐚,𝐛)=0g_{\mathbf{s}}({\mathbf{a}},{\mathbf{b}})=0 if 𝐬𝐬{\mathbf{s}}\neq{\mathbf{s}}^{*}. If 𝐬=𝐬{\mathbf{s}}={\mathbf{s}}^{*}, claim (i) also implies that the potential of g𝐬g_{{\mathbf{s}}^{*}} is equal to

p(𝐚,𝐛)+𝐭𝐬w𝐭,𝐬g𝐭(𝐚,𝐛)tg[𝐬]\displaystyle p({\mathbf{a}},{\mathbf{b}})+\sum_{{\mathbf{t}}\prec{\mathbf{s}}^{*}}w_{{\mathbf{t}},{\mathbf{s}}^{*}}\cdot g_{{\mathbf{t}}}({\mathbf{a}},{\mathbf{b}})-t_{g}[{{\mathbf{s}}^{*}}]
=\displaystyle= p(𝐚,𝐛)+WN1(𝐬)mid(P𝐬)WN1(𝐬)\displaystyle p({\mathbf{a}},{\mathbf{b}})+W\cdot N_{1}({\mathbf{s}})-\mathrm{mid}(P_{{\mathbf{s}}})-W\cdot N_{1}({\mathbf{s}})
=\displaystyle= p(𝐚,𝐛)mid(P𝐬).\displaystyle p({\mathbf{a}},{\mathbf{b}})-\mathrm{mid}(P_{{\mathbf{s}}^{*}}).

Since P𝐬={p(𝐚,𝐛)}P_{{\mathbf{s}}^{*}}=\{p({\mathbf{a}},{\mathbf{b}})\}, mid(P𝐬)=p(𝐚,𝐛)\mathrm{mid}(P_{{\mathbf{s}}^{*}})=p({\mathbf{a}},{\mathbf{b}}). Thus, we have

p(𝐚,𝐛)mid(P𝐬)=0,\displaystyle p({\mathbf{a}},{\mathbf{b}})-\mathrm{mid}(P_{{\mathbf{s}}^{*}})=0,

which implies that g𝐬(𝐚,𝐛)=1g_{{\mathbf{s}}^{*}}({\mathbf{a}},{\mathbf{b}})=1.

Claim 28.

For any (𝐚,𝐛){0,1}2n({\mathbf{a}},{\mathbf{b}})\in\{0,1\}^{2n} that satisfies p(𝐚,𝐛)<tmaxp({\mathbf{a}},{\mathbf{b}})<t_{\max}, every gate g𝐬g_{\mathbf{s}} outputs zero.

Proof 7.4.

Since p(𝐚,𝐛)<tmaxmid(Pg)p({\mathbf{a}},{\mathbf{b}})<t_{\max}\leq\mathrm{mid}({P_{g}}), gϵ(𝐚,𝐛)=0g_{\epsilon}({\mathbf{a}},{\mathbf{b}})=0, which is implied by a similar argument to Claim 27, all the gates g𝐬g_{\mathbf{s}} such that 𝐬{\mathbf{s}} contains a symbol 1. If 𝐬{\mathbf{s}} consists of only 0s, we have p(𝐚,𝐛)<tmaxmid(Q)p({\mathbf{a}},{\mathbf{b}})<t_{\max}\leq\mathrm{mid}(Q) for any QPgQ\subseteq P_{g}, which implies that g𝐬(𝐚,𝐛)=0g_{\mathbf{s}}({\mathbf{a}},{\mathbf{b}})=0.

Claims 27 and 28 imply that we can safely replace gg with g𝐬g_{\mathbf{s}}s, 𝐬SgLg{\mathbf{s}}\in S_{g}\cup L_{g}, by connecting the output of each gate g𝐬g_{{\mathbf{s}}}, 𝐬Lg{\mathbf{s}}\in L_{g}, with weight wg,g(δφ(p))w_{g,g^{\prime}}\cdot(\delta\circ\varphi(p)), where pp is the unique value in P𝐬P_{\mathbf{s}} to every gate gg^{\prime} that the gate gg is originally connected to in CC. Since |SgLg|=O((s+n)w)|S_{g}\cup L_{g}|=O((s+n)w) and the length of 𝐬{\mathbf{s}} is O(log(s+n)+logw)O(\log(s+n)+\log w), the size and depth increase by these factors, respectively.

We can construct another set of threshold gates for (,tmin](-\infty,t_{\min}] in a similar manner to the above procedure That is, it suffices to consider a gate obtained by multiplying 1-1 by the weights and threshold of gg together with an interval [tmin,)[-t_{\min},\infty).

Using the above procedure, we replace every gate gG\{gclf}g\in G\backslash\{g^{\mathrm{clf}}\} in CC with a set of threshold gates, and complete the construction of CC^{\prime}. Clearly, the size of CC^{\prime} is O(s(s+n)w)O(s\cdot(s+n)w). Since g𝐬g_{\mathbf{s}} receives the outputs of g𝐭g_{{\mathbf{t}}} for every 𝐭𝐬{\mathbf{t}}\prec{\mathbf{s}} and the length of 𝐬{\mathbf{s}} is O(log(s+n)+logw)O(\log(s+n)+\log w), the depth of CC^{\prime} is O(d(log(s+n)+logw))O(d\cdot(\log(s+n)+\log w)). Claims 27 and 28 imply that the energy of CC^{\prime} is O(e(log(s+n)+logw))O(e\cdot(\log(s+n)+\log w)). Clearly, the weight of CC^{\prime} is W=O((s+n)w)W=O((s+n)w). Thus, Theorem 3.1 implies that rk(MC)rk(M_{C}) is bounded by

O(e(log(s+n)+logw)d(log(s+n)+logw)(logs+logw+logn))\displaystyle O(e(\log(s+n)+\log w)\!\cdot\!d(\log(s+n)+\log w)\!\cdot\!(\log s+\log w+\log n))
=\displaystyle= O(ed(logs+logn+logw)3),\displaystyle O(ed(\log s+\log n+\log w)^{3}),

as desired.