This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Finding Everything within Random Binary Networks

Kartik Sreenivasan    Shashank Rajput    Jy-yong Sohn    Dimitris Papailiopoulos
Abstract

A recent work by Ramanujan et al., (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks. A follow-up line of theoretical work provides justification of these findings by proving that slightly overparameterized neural networks, with commonly used continuous-valued random initializations can indeed be pruned to approximate any target network. In this work, we show that the amplitude of those random weights does not even matter. We prove that any target network can be approximated up to arbitrary accuracy by simply pruning a random network of binary {±1}\{\pm 1\} weights that is only a polylogarithmic factor wider and deeper than the target network.

Machine Learning, ICML

1 Introduction

As the number of parameters of state-of-the-art networks continues to increase, pruning has become a prime choice for sparsifying and compressing a model. A rich and long body of research, dating back to the 80s, shows that one can prune most networks to a tiny fraction of their size, while maintaining high accuracy (Mozer and Smolensky,, 1989; Hassibi and Stork,, 1993; Levin et al.,, 1994; LeCun et al.,, 1990; Han et al., 2015b, ; Han et al., 2015a, ; Li et al.,, 2016; Wen et al.,, 2016; Hubara et al.,, 2016, 2017; He et al.,, 2017; Wu et al.,, 2016; Zhu et al.,, 2016; He et al.,, 2018; Zhu and Gupta,, 2017; Cheng et al.,, 2019; Blalock et al.,, 2020; Deng et al.,, 2020).

A downside of most of the classic pruning approaches is that they sparsify a model once it is trained to full accuracy, followed by significant fine-tuning, resulting in a computationally burdensome procedure. Frankle and Carbin, (2018) conjectured the existence of lottery tickets, i.e., sparse subnetworks at (or near) initialization, that can be trained—just once—to reach the accuracy of state-of-the-art dense models. This may help alleviate the computational burden of prior approaches, as training is predominantly carried on a much sparser model. The conjectured existence of these lucky tickets is referred to as the Lottery Ticket Hypothesis (LTH). Frankle and Carbin, (2018) and (Frankle et al.,, 2020) show that not only do lottery tickets exist, but also the cost of “winning the lottery” is not very high.

Along the LTH literature, a curious phenomenon was observed; even at initialization and in the complete absence of training, one can find sub-networks of the random initial model that have prediction accuracy far beyond random guessing (Zhou et al.,, 2019; Ramanujan et al.,, 2020; Wang et al.,, 2019). Ramanujan et al., (2020) reported this in its most striking form: state-of-the-art accuracy models for CIFAR10 and ImageNet, simply reside within slightly larger, yet completely random networks, and appropriate pruning—and mere pruning—can reveal them! This “pruning is all you need” phenomenon is sometimes referred to as the Strong Lottery Ticket Hypothesis.

A recent line of work attempts to establish the theoretical validity of the Strong LTH by studying the following non-algorithmic question:

Can a random network be pruned to approximate a target function f(x)f(x)?

Here, ff represents a bounded range labeling function that acts on inputs x𝒳x\in\mathcal{X}, and is itself a neural network of finite width and depth. This assumption is not limiting, as neural networks are universal approximators (Stinchombe,, 1989; Barron,, 1993; Scarselli and Tsoi,, 1998; Klusowski and Barron,, 2018; Perekrestenko et al.,, 2018; Hanin,, 2019; Kidger and Lyons,, 2020). Note that the answer to the above question is trivial if one does not constraint the size of the random initial network, for all interesting cases of “random”. Indeed, if we start with an exponentially wider random neural network compared to the one representing ff, by sheer luck, one can always find weights, for each layer, near-identical to those of any target neural network that is ff. Achieving this result with a constrained overparameterization, i.e., the degree by which the random network to be pruned is wider/deeper than ff, is precisely why this question is challenging.

Malach et al., (2020) were the first to prove that the Strong LTH is true, assuming polynomial-sized overparameterization. Specifically, under some mild assumptions, they showed that to approximate a target network of width dd and depth ll to within error ε\varepsilon, it suffices to prune a random network of width 𝒪~(d2l2/ε2)\widetilde{{\mathcal{O}}}(d^{2}l^{2}/\varepsilon^{2}) and depth 2l2l. Pensia et al., (2020) offered an exponentially tighter bound using a connection to the SubsetSum problem. They showed that to approximate a target network within error ε\varepsilon, it is sufficient to prune a randomly initialized network of width 𝒪(dlog(dl/ε)){\mathcal{O}}(d\log(dl/\varepsilon)) and depth 2l2l. A corresponding lower bound for constant depth networks was also established. Orseau et al., (2020) were also able to reduce the dependence on ε\varepsilon to logarithmic. They show that in order to approximate a target network within error ε\varepsilon, it suffices to prune a random network of width 𝒪(d2log(dl/ε)){\mathcal{O}}(d^{2}\log(dl/\varepsilon)) if the weights are initialized with the hyperbolic distribution. However, this bound on overparamaterization is still polynomial in the width dd.

The above theoretical studies have focused exclusively on continuous distribution for initialization. However, in the experimental work by (Ramanujan et al.,, 2020), the authors manage to obtain the best performance by pruning networks of scaled, binary weights. Training binary networks has been studied extensively in the past (Courbariaux et al.,, 2015; Simons and Lee,, 2019) as they are compute, memory and hardware efficient, though in many cases they suffer from significant loss of accuracy. The findings of (Ramanujan et al.,, 2020) suggest that the accuracy loss may not be fundamental to networks of binary weights, when such networks are learned by pruning. Arguably, since “carving out” sub-networks of random models is expressive enough to approximate a target function, e.g., according to (Pensia et al.,, 2020; Malach et al.,, 2020), one is posed to wonder about the importance of weights altogether. So perhaps, binary weights is all you need.

Diffenderfer and Kailkhura, (2021) showed that indeed scaled binary networks can be pruned to approximate any target function. The required overparameterization is similar to that of Malach et al., (2020), i.e., polynomial in the width, depth and error of the approximation. Hence in a similar vein to the improvement that Pensia et al., (2020) offered over the bounds of Malach et al., (2020), we explore whether such an improvement is possible on the results of Diffenderfer and Kailkhura, (2021).

Our Contributions:

In this work, we offer an exponential improvement to the theoretical bounds by (Diffenderfer and Kailkhura,, 2021), establishing the following.

Theorem 1.

(informal) Consider a randomly initialized, FC, binary {±1}\{\pm 1\} network of ReLU activations, with depth Θ(llog(dlε))\Theta\left(l\log{(\frac{dl}{\varepsilon})}\right) and width Θ(dlog2(dlεδ))\Theta\left(d\log^{2}{(\frac{dl}{\varepsilon\delta})}\right), with the last layer consisting of scaled binary weights {±C}\{\pm C\}. Then, there is always a constant CC such that this network can be pruned to approximate any FC ReLU network, up to error ε>0\varepsilon>0 with depth ll and width dd, with probability at least 1δ1-\delta.

Therefore, we show that in order to approximate any target network, it suffices to just prune a logarithmically overparameterized binary network (Figure 1). In contrast to Diffenderfer and Kailkhura, (2021), our construction only requires that the last layer be scaled while the rest of the network is purely binary {±1}\{\pm 1\}. We show a comparison of the known Strong LTH results in Table 1.

Reference Width Depth Total Params Weights
Malach et al., (2020) 𝒪~(d2l2/ε2)\widetilde{{\mathcal{O}}}(d^{2}l^{2}/\varepsilon^{2}) 2l2l 𝒪~(d2l3/ε2)\widetilde{{\mathcal{O}}}(d^{2}l^{3}/\varepsilon^{2}) Real
Orseau et al., (2020) 𝒪(d2log(dl/ε){\mathcal{O}}(d^{2}\log(dl/\varepsilon) 2l2l 𝒪(d2llog(dl/ε){\mathcal{O}}(d^{2}l\log(dl/\varepsilon) Real (Hyperbolic)
Pensia et al., (2020) 𝒪(dlog(dl/min{ε,δ}){\mathcal{O}}(d\log(dl/\min\{\varepsilon,\delta\}) 2l2l 𝒪(dllog(dl/min{ε,δ}){\mathcal{O}}(dl\log(dl/\min\{\varepsilon,\delta\}) Real
Diffenderfer and Kailkhura, (2021) 𝒪((ld3/2/ε)+ldlog(ld/δ)){\mathcal{O}}((ld^{3/2}/\varepsilon)+ld\log(ld/\delta)) 2l2l 𝒪((l2d3/2/ε)+l2dlog(ld/δ)){\mathcal{O}}((l^{2}d^{3/2}/\varepsilon)+l^{2}d\log(ld/\delta)) {±ε}\{\pm\varepsilon\}
Ours, Theorem 1 𝒪(dlog2(dl/εδ)){\mathcal{O}}(d\log^{2}{(dl/\varepsilon\delta)}) 𝒪(llog(dl/ε)){\mathcal{O}}(l\log{(dl/\varepsilon)}) 𝒪(dllog3(dl/εδ)){\mathcal{O}}(dl\log^{3}{(dl/\varepsilon\delta)}) Binary-{±1}\{\pm 1\}111The weights of all the layers are purely binary {±1}\{\pm 1\} except for the last layer which is scaled so that it is {±ε}\{\pm\varepsilon^{\prime}\} where ε=(ε/d2l)l\varepsilon^{\prime}=({\varepsilon}/d^{2}l)^{l}.
Table 1: Comparing the upper bounds for the overparameterization needed to approximate a target network (of width dd and depth ll) within error ε>0\varepsilon>0 with probability at least 1δ1-\delta by pruning a randomly initialized network.
Refer to caption
Figure 1: Approximating a target network with high accuracy by pruning overparameterized random binary network. In this paper, we show that logarithmic overparameterization in both width and depth is sufficient.

In light of our theoretical results, one may wonder why in the literature of training, i.e., assigning a sign pattern to fixed architecture, binary networks, a loss of accuracy is observed, e.g., (Rastegari et al.,, 2016). Is this an algorithmic artifact, or does pruning random signs offer higher expressivity than assigning the signs? We show that there exist target functions that can be well approximated by pruning binary networks, yet none of all possible, binary, fully connected networks can approximate it.

Proposition 1.

(informal) There exist a function ff that can be represented by pruning a random 2-layer binary network of width dd, but not by any 2-layer fully-connected binary network of width dd.

Note that although finding a subnetwork of a random binary network results in a “ternary” architecture (e.g., 0 becomes a possible weight), the total number of possible choices of subnetworks is 2N2^{N}, if NN is the total number of weights. This is equal to the total number of sign assignments of the same FC network. Yet, as shown in the proposition above, pruning a random FC network is provably more expressive than finding a sign assignment for the same architecture.

2 Preliminaries and Problem Setup

Let f(𝒙):d0f({\bm{x}}):\mathbb{R}^{d_{0}}\rightarrow\mathbb{R} be the target FC network with ll layers and ReLU activations, represented as

f(𝒙)=σ(𝑾lσ(𝑾l1σ(𝑾1𝒙))),f({\bm{x}})=\sigma({\bm{W}}_{l}\sigma({\bm{W}}_{l-1}\dots\sigma({\bm{W}}_{1}{\bm{x}}))),

where 𝒙d0{\bm{x}}\in\mathbb{R}^{d_{0}} is the input, σ(𝒛)=max{𝒛,0}\sigma({\bm{z}})=\max\{{\bm{z}},0\} is the ReLU activation and 𝑾idi×di1{\bm{W}}_{i}\in\mathbb{R}^{d_{i}\times d_{i-1}} is the weight matrix of layer i[l]i\in[l]. With slight abuse of terminology, we will refer to ff as a network, as opposed to a labeling function. We then consider a binary111The weights of all the layers are purely binary {±1}\{\pm 1\} except for the last layer which is scaled so that it is {±ε}\{\pm\varepsilon^{\prime}\} where ε=(ε/d2l)l\varepsilon^{\prime}=({\varepsilon}/d^{2}l)^{l}. network of depth ll^{\prime}

g(𝒙)=σ((ε𝑩l)σ(𝑩l1σ(𝑩1𝒙))),g({\bm{x}})=\sigma((\varepsilon^{\prime}{\bm{B}}_{l^{\prime}})\sigma({\bm{B}}_{l^{\prime}-1}\dots\sigma({\bm{B}}_{1}{\bm{x}}))),

where 𝑩i{1,+1}di×di1{\bm{B}}_{i}\in\{-1,+1\}^{d_{i}^{\prime}\times d_{i-1}^{\prime}} is a binary weight matrix, with all weights drawn uniformly at random from {±1}\{\pm 1\}, for all layers i[l]i\in[l^{\prime}] and the last layer is multiplied by a factor of ε>0\varepsilon^{\prime}>0. The scaling factor is calculated precisely in Section 3.2.3, where we show that it is unavoidable for function approximation (i.e., regression), rather than classification.

Our goal is to find the smallest network gg so that it contains a subnetwork g~\tilde{g} which approximates ff well. More precisely, we will bound the overparameterization of the binary network, under which one can find supermask matrices 𝑴i{0,1}di×di1{\bm{M}}_{i}\in\{0,1\}^{d_{i}^{\prime}\times d_{i-1}^{\prime}}, for each layer i[l]i\in[l^{\prime}], such that the pruned network

g~(𝒙)=\displaystyle\tilde{g}({{\bm{x}}})= σ(ε(𝑴l𝑩l)σ((𝑴l1𝑩l1)\displaystyle\sigma(\varepsilon^{\prime}({{\bm{M}}}_{l^{\prime}}\odot{{\bm{B}}}_{l^{\prime}})\sigma(({{\bm{M}}}_{l^{\prime}-1}\odot{{\bm{B}}}_{l^{\prime}-1})\ldots
σ((𝑴1𝑩1)𝒙)))\displaystyle\ldots\sigma(({{\bm{M}}}_{1}\odot{{\bm{B}}}_{1}){{\bm{x}}})))

is ε\varepsilon-close to ff in the sense of uniform approximation over the unit-ball, i.e.,

max𝒙d0:𝒙1f(𝒙)g~(𝒙)ε\max_{{\bm{x}}\in\mathbb{R}^{d_{0}}:||{\bm{x}}||\leq 1}||f({\bm{x}})-\tilde{g}({\bm{x}})||\leq\varepsilon

for some desired ε>0\varepsilon>0. In this paper, we show gg only needs to be polylogarithmically larger than the target network ff to have this property. We formalize this and provide a proof in the following sections.

Henceforth, we denote [k]={1,2,,k}[k]=\{1,2,\cdots,k\} for some positive integer kk. Unless otherwise specified, ||||||\cdot|| refers to the 2\ell_{2} norm. We also use the max norm of a matrix, defined as 𝑨max:=maxij|Aij|||{\bm{A}}||_{\text{max}}:=\max_{ij}|A_{ij}|. The element-wise product between two matrices 𝑨{\bm{A}} and 𝑩{\bm{B}} is denoted by 𝑨𝑩{\bm{A}}\odot{\bm{B}}. We assume without loss of generality that the weights are specified in the base-10 system. However, since we don’t specify the base of the logarithm explicitly in our computations, we use the Θ()\Theta(\cdot) notation to hide constant factors that may arise from choosing different bases.

3 Strong Lottery Tickets by Binary Expansion

In this section, we formally present our approximation results. We show that in order to approximate any target network f(𝒙)f({\bm{x}}) within arbitrary approximation error ε{\varepsilon}, it suffices to prune a random binary11footnotemark: 1 network g(𝒙)g({\bm{x}}) that is just polylogarithmically deeper and wider than the target network.

3.1 Main Result

First, we point out that the scaling factor ε{\varepsilon}^{\prime} in the final layer of g(𝒙)g({\bm{x}}) is necessary for achieving arbitrary small approximation error for any target network f(𝒙)f({\bm{x}}). In other words, it is impossible to approximate any arbitrary target network with a purely binary {±1}\{\pm 1\} network regardless of the overparameterization. To see this, note that for the simple target function f(x)=εx,x[0,1]f(x)=\varepsilon x,\;x\in[0,1] and ε[0.5,1)\varepsilon\in[0.5,1), the best approximation possible by a binary network is g(x)=xg(x)=x and therefore maxx:|x|1|f(x)g(x)|(1ε)\max_{x\in\mathbb{R}:|x|\leq 1}|f(x)-g(x)|\geq(1-\varepsilon) for any binary network gg. We will show that just by allowing the weights of the final layer to be scaled, we can provide a uniform approximation guarantee while the rest of the network remains binary {±1}\{\pm 1\}. Formally, we have the following theorem:

Theorem 1.

Consider the set of FC ReLU networks \mathcal{F} defined as

={f:f(𝒙)=σ(𝑾lσ(𝑾l1σ(𝑾1𝒙))),\displaystyle\mathcal{F}=\{f:f({\bm{x}})=\sigma({\bm{W}}_{l}\sigma({\bm{W}}_{l-1}\dots\sigma({\bm{W}}_{1}{\bm{x}}))),
i𝑾idi×di1||𝑾i||1},\displaystyle\forall i\;{\bm{W}}_{i}\in\mathbb{R}^{d_{i}\times d_{i-1}}\;||{\bm{W}}_{i}||\leq 1\},

and let d=maxidid=\max_{i}d_{i}. For arbitrary target approximation error ε{\varepsilon}, let g(𝐱)=σ(ε𝐁lσ(𝐁l1σ(𝐁1𝐱)))g({\bm{x}})=\sigma(\varepsilon^{\prime}{\bm{B}}_{l^{\prime}}\sigma({\bm{B}}_{l^{\prime}-1}\dots\sigma({\bm{B}}_{1}{\bm{x}}))) (here ε=(ε/d2l)l\varepsilon^{\prime}=({\varepsilon}/d^{2}l)^{l}) be a randomly initialized network with depth l=Θ(llog(d2l/ε))l^{\prime}=\Theta(l\log({d^{2}l/\varepsilon})) such that every weight is drawn uniformly from {1,+1}\{-1,+1\} and the layer widths are Θ(logd2l/εlog(dllog2(d2l/ε)δ))\Theta\left(\log{d^{2}l/\varepsilon}\cdot\log\left(\frac{dl\log^{2}{(d^{2}l/\varepsilon)}}{\delta}\right)\right) times wider than f(𝐱)f({\bm{x}}).

Then, with probability at least 1δ1-\delta, for every ff\in\mathcal{F}, there exist pruning matrices 𝐌i{\bm{M}}_{i} such that

max𝒙d0:𝒙1|f(𝒙)g~(𝒙)|ε\max_{{\bm{x}}\in\mathbb{R}^{d_{0}}:||{\bm{x}}||\leq 1}|f({\bm{x}})-\tilde{g}({\bm{x}})|\leq\varepsilon

holds where

g~(𝒙):=\displaystyle\tilde{g}({\bm{x}}):= σ(ε(𝑴l𝑩l)σ((𝑴l1𝑩l1)\displaystyle\sigma({\varepsilon}^{\prime}({{\bm{M}}}_{l^{\prime}}\odot{{\bm{B}}}_{l^{\prime}})\sigma(({{\bm{M}}}_{l^{\prime}-1}\odot{{\bm{B}}}_{l^{\prime}-1})\ldots
σ((𝑴1𝑩1)𝒙))).\displaystyle\ldots\sigma(({{\bm{M}}}_{1}\odot{{\bm{B}}}_{1}){{\bm{x}}}))).
Remark 1.

The dimensions of the weight matrices of g(𝐱)g({\bm{x}}) in Theorem 1 are specified more precisely below. Let p=(d2l/ε)p=(d^{2}l/\varepsilon). Since l=llog(p)l^{\prime}=l\log(p), we have log(p)\lfloor\log(p)\rfloor layers in g(𝐱)g({\bm{x}}) that approximates each layer in f(𝐱)f({\bm{x}}). For each i[l]i\in[l], the dimension of 𝐁(i1)log(p)+1{\bm{B}}_{(i-1)\lfloor\log(p)\rfloor+1} is

Θ(di1log(p)log(dllog2(p)δ))×di1,\Theta\left(d_{i-1}\log(p)\log\left(\frac{dl\log^{2}{(p)}}{\delta}\right)\right)\times d_{i-1},

the dimension of 𝐁ilog(p){\bm{B}}_{i\lfloor\log(p)\rfloor} is

di×Θ(di1log(p)log(dllog2(p)δ))d_{i}\times\Theta\left(d_{i-1}\log{(p)}\log\left(\frac{dl\log^{2}{(p)}}{\delta}\right)\right)

and the remaining 𝐁(i1)log(p)+k{\bm{B}}_{(i-1)\lfloor\log(p)\rfloor+k} where 1<k<log(p)1<k<\lfloor\log(p)\rfloor have the dimension

Θ(di1log(p)log(dllog2(p)δ))\displaystyle\Theta\left(d_{i-1}\log{(p)}\log\left(\frac{dl\log^{2}{(p)}}{\delta}\right)\right)
×Θ(di1log(p)log(dllog2pδ)).\displaystyle\;\times\Theta\left(d_{i-1}\log{(p)}\log\left(\frac{dl\log^{2}{p}}{\delta}\right)\right).

3.2 Proof of Theorem 1

First, we show in Section 3.2.1 that any target network in f(𝒙)f({\bm{x}})\in\mathcal{F} can be approximated within ε>0\varepsilon>0, by another network g^p(𝒙)\hat{g}_{p}({\bm{x}}) having weights of finite-precision at most pp digits where pp is logarithmic in d,l,d,l, and ε\varepsilon.

Then, in Section 3.2.2, we show that any finite precision network can be represented exactly using a binary network where all the weights are binary (±1\pm 1) except the last layer, and the last layer weights are scaled-binary (±ε\pm\varepsilon^{\prime}). The proof sketch is as follows. First, through a simple scaling argument we show that any finite-precision network is equivalent to a network with integer weights in every layer except the last. We then present Theorem 2 which shows the deterministic construction of a binary network using diamond-shaped gadgets that can be pruned to approximate any integer network. Lemma 7 extends the result to the case when the network is initialized with random binary weights.

Putting these together completes the proof of Theorem 1 as shown in Section 3.2.3.

3.2.1 Logarithmic precision is sufficient

First, we consider the simplest setting wherein the target network contains a single weight i.e., h(x)=σ(wx)h(x)=\sigma(wx), where x,wx,w are scalars, the absolute values of which are bounded by 11. This assumption can be relaxed to any finite norm bound. We begin with noting a fact that log(1/ε)\log(1/{\varepsilon}) digits of precision are sufficient to approximate a real number within error ε{\varepsilon}, as formalized below

Fact 1.

Let w,|w|1w\in\mathbb{R},\;|w|\leq 1 and w^\hat{w} be a finite-precision truncation of ww with Θ(log(1/ε))\lceil\Theta(\log(1/{\varepsilon}))\rceil digits. Then |ww^|ε|w-\hat{w}|\leq{\varepsilon} holds.

Now we state the result for the case when the target network contains a single weight ww.

Lemma 1.

Consider a network h(x)=σ(wx)h(x)=\sigma(wx) where w,|w|1w\in\mathbb{R},|w|\leq 1. For a given ε>0{\varepsilon}>0, let w^\hat{w} be a finite-precision truncation of ww up to log(1/ε)\log(1/\varepsilon) digits and let g^log(1/ε)(x)=σ(w^x)\hat{g}_{\log(1/\varepsilon)}(x)=\sigma(\hat{w}x). Then we have

maxx:|x|1|h(x)g^log(1/ε)(x)|ε.\max_{x\in\mathbb{R}:|x|\leq 1}|h(x)-\hat{g}_{\log(1/\varepsilon)}(x)|\leq\varepsilon.
Proof.

By Fact 1, we know that |ww^|ε|w-\hat{w}|\leq\varepsilon. Applying Cauchy-Schwarz with |x|1|x|\leq 1 gives us |w^xwx|ε|\hat{w}x-wx|\leq\varepsilon. Since this holds for any xx and ReLU is 1-Lipschitz, the result follows. ∎

Lemma 1 can be extended to show that it suffices to consider finite-precision truncation up to log(d2l/ε)\log(d^{2}l/\varepsilon) digits to approximate a network for width dd and depth ll. This is stated more formally below.

Lemma 2.

Consider a network h(𝐱)=σ(𝐖lσ(𝐖l1σ(𝐖1𝐱)))h({\bm{x}})=\sigma({\bm{W}}_{l}\sigma({\bm{W}}_{l-1}\dots\sigma({\bm{W}}_{1}{\bm{x}}))) where 𝐖idi×di1{\bm{W}}_{i}\in\mathbb{R}^{d_{i}\times d_{i-1}}, 𝐖i1||{\bm{W}}_{i}||\leq 1. For a given ε>0\varepsilon>0, define g^log(d2l/ε)(𝐱)=σ(𝐖^lσ(𝐖^l1σ(𝐖^1𝐱)))\hat{g}_{\log(d^{2}l/\varepsilon)}({\bm{x}})=\sigma(\widehat{{\bm{W}}}_{l}\sigma(\widehat{{\bm{W}}}_{l-1}\dots\sigma(\widehat{{\bm{W}}}_{1}{\bm{x}}))) where 𝐖i^\widehat{{\bm{W}}_{i}} is a finite precision truncation of 𝐖i{\bm{W}}_{i} up to log(d2l/ε)\log(d^{2}l/\varepsilon) digits, where d=maxidid=\max_{i}d_{i}. Then we have

max𝒙d0:𝒙1|h(𝒙)g^log(d2l/ε)(𝒙)|ε.\max_{{\bm{x}}\in\mathbb{R}^{d_{0}}:||{\bm{x}}||\leq 1}|h({\bm{x}})-\hat{g}_{\log(d^{2}l/\varepsilon)}({\bm{x}})|\leq\varepsilon.

We provide the proof of Lemma 2 as well as approximation results for a single neuron and layer in supplementary materials.

3.2.2 Binary weights are sufficient

We begin by showing that any finite-precision FC ReLU network can be represented perfectly as a FC ReLU network with integer weights in every layer except the last, using a simple scaling argument. Since ReLU networks are positive homogenous, we have that σ(cz)=cσ(z)\sigma(c\cdot z)=c\cdot\sigma(z) for c>0c>0. Given a network gpg_{p} where all the weights are of finite-precision at most pp, we can apply this property layerwise with the scaling factor c=10pc=10^{p} so that,

f(𝒙)\displaystyle f({\bm{x}}) =σ(𝑾lσ(𝑾l1σ(𝑾1𝒙)))\displaystyle=\sigma({\bm{W}}_{l}\sigma({\bm{W}}_{l-1}\dots\sigma({\bm{W}}_{1}{\bm{x}})))
=1clσ(c𝑾lσ(c𝑾l1σ(c𝑾1𝒙)))\displaystyle=\frac{1}{c^{l}}\sigma(c{\bm{W}}_{l}\sigma(c{\bm{W}}_{l-1}\dots\sigma(c{\bm{W}}_{1}{\bm{x}})))
=σ(c𝑾^lσ(𝑾^l1σ(𝑾^1𝒙)))\displaystyle=\sigma(c^{\prime}\widehat{{\bm{W}}}_{l}\sigma(\widehat{{\bm{W}}}_{l-1}\dots\sigma(\widehat{{\bm{W}}}_{1}{\bm{x}}))) (1)

where 𝑾^i=10p𝑾i\widehat{{\bm{W}}}_{i}=10^{p}{\bm{W}}_{i} is a matrix of integer weights and c=1clc^{\prime}=\frac{1}{c^{l}}. Therefore, the rescaled network has integer weights in every layer except the last layer which has the weight matrix c𝑾^l=(cl)𝑾lc^{\prime}\widehat{{\bm{W}}}_{l}=(c^{-l}){\bm{W}}_{l}.

In the remaining part of this section, we show that any FC ReLU network with integer weights can be represented exactly by pruning a purely binary (±1)(\pm 1) FC ReLU network which is just polylogarithmic wider and deeper. More precisely, we prove the following result.

Theorem 2.

Consider the set of FC ReLU networks with integer weights W\mathcal{F}_{W} defined as

W={f:f(𝒙)=σ(𝑾lσ(𝑾l1σ(𝑾1𝒙))),\displaystyle\mathcal{F}_{W}=\{f:f({\bm{x}})=\sigma({\bm{W}}_{l}\sigma({\bm{W}}_{l-1}\dots\sigma({\bm{W}}_{1}{\bm{x}}))),
i𝑾idi×di1||𝑾i||maxW}\displaystyle\forall i\;{\bm{W}}_{i}\in\mathbb{Z}^{d_{i}\times d_{i-1}}\;||{\bm{W}}_{i}||_{max}\leq W\}

where W>0W>0. Define d=maxidid=\max_{i}d_{i} and let g(𝐱)=σ(𝐁lσ(𝐁l1σ(𝐁1𝐱)))g({\bm{x}})=\sigma({\bm{B}}_{l^{\prime}}\sigma({\bm{B}}_{l^{\prime}-1}\dots\sigma({\bm{B}}_{1}{\bm{x}}))) be a network with depth l=Θ(llog(|W|))l^{\prime}=\Theta(l\log({|W|})) where every weight is uniform-randomly generated from {1,+1}\{-1,+1\} and the layer widths are Θ(log|W|log(dllog2|W|δ))\Theta\left(\log{|W|}\cdot\log\left(\frac{dl\log^{2}{|W|}}{\delta}\right)\right) times wider than f(𝐱)f({\bm{x}}).

Then, with probability at least 1δ1-\delta, for every ff\in\mathcal{F}, there exist pruning matrices 𝐌i{\bm{M}}_{i} such that

f(𝒙)=g~(𝒙)f({\bm{x}})=\tilde{g}({\bm{x}})

holds for any 𝐱d0{\bm{x}}\in\mathbb{R}^{d_{0}} where g~(𝐱):=σ((𝐌l𝐁l)σ((𝐌l1𝐁l1)σ((𝐌1𝐁1)𝐱)))\tilde{g}({\bm{x}}):=\sigma(({{\bm{M}}}_{l^{\prime}}\odot{{\bm{B}}}_{l^{\prime}})\sigma(({{\bm{M}}}_{l^{\prime}-1}\odot{{\bm{B}}}_{l^{\prime}-1})\ldots\sigma(({{\bm{M}}}_{1}\odot{{\bm{B}}}_{1}){{\bm{x}}}))).

Remark 2.

The dimensions of the weight matrices of g(𝐱)g({\bm{x}}) in Theorem 2 are specified more precisely below. Note that we have log|W|\lfloor\log{|W|}\rfloor layers in g(𝐱)g({\bm{x}}) that exactly represents each layer in f(𝐱)f({\bm{x}}). For each i[l]i\in[l], the dimension of 𝐁(i1)log|W|+1{\bm{B}}_{(i-1)\lfloor\log{|W|}\rfloor+1} is

Θ(di1log|W|log(dllog2|W|δ))×di1,\Theta\left(d_{i-1}\log{|W|}\log\left(\frac{dl\log^{2}{|W|}}{\delta}\right)\right)\times d_{i-1},

the dimension of 𝐁ilog|W|{\bm{B}}_{i\lfloor\log{|W|}\rfloor} is

di×Θ(di1log|W|log(dllog2|W|δ))d_{i}\times\Theta\left(d_{i-1}\log{|W|}\log\left(\frac{dl\log^{2}{|W|}}{\delta}\right)\right)

and the remaining 𝐁(i1)log|W|+k{\bm{B}}_{(i-1)\lfloor\log|W|\rfloor+k} where 1<k<log|W|1<k<\lfloor\log|W|\rfloor have the dimension

Θ(di1log|W|log(dllog2|W|δ))\displaystyle\Theta\left(d_{i-1}\log{|W|}\log\left(\frac{dl\log^{2}{|W|}}{\delta}\right)\right)
×Θ(di1log|W|log(dllog2|W|δ))\displaystyle\times\Theta\left(d_{i-1}\log{|W|}\log\left(\frac{dl\log^{2}{|W|}}{\delta}\right)\right)
Remark 3.

Note that g~(𝐱)\tilde{g}({\bm{x}}) is exactly equal to f(𝐱)f({\bm{x}}). Furthermore, we provide a uniform guarantee for all networks in \mathcal{F} by pruning a single over-parameterized network, like Pensia et al., (2020).

Remark 4.

Theorem 2 can be made into a deterministic construction for any fixed target network thereby avoiding the log(1/δ)\log(1/\delta) overparameterization. We extend to the random initialization setting by resampling the construction a sufficient number of times.

Remark 5.

To resolve issues of numerical overflow, we can insert scaling neurons after every layer.

Remark 6.

The integer assumption can easily be converted to a finite-precision assumption using a simple scaling argument. Since all networks in practice use finite-precision arithmetic, Theorem 2 may be of independent interest to the reader. However, we emphasize here that there is no approximation error in this setting. Practitioners who are interested in small error(10k10^{-k}) can just apply Theorem 1 and incur an overparameterization factor of 𝒪(k){\mathcal{O}}(k).

The proof of Theorem 2 will first involve a deterministic construction for a binary network that gives us the desired guarantee. We then extend to the random binary initialization. The construction is based on a diamond-shaped gadget that allows us to approximate a single integer weight by pruning a binary ReLU network with just logarithmic overparameterization.

First, consider a target network that contains just a single integer weight i.e., h(x)=σ(wx)h(x)=\sigma(wx). We will show that there exists a binary FC ReLU network g(x)g(x) which can be pruned to approximate h(x)h(x).

Lemma 3.

Consider a network h(x)=σ(wx)h(x)=\sigma(wx) where ww\in\mathbb{Z}. Then there exists a FC ReLU binary network g(x)g(x) of width and depth O(log|w|)O(\log|w|) that can be pruned to g~(x)\tilde{g}(x) so that g~(x)=h(x)\tilde{g}(x)=h(x) for all xx\in\mathbb{R}.

Proof. Note that since ww is an integer, it can be represented using its binary (base-22) expansion

w=sign(w)k=0log2|w|zk2k,zk{0,1}.w=\operatorname{sign}(w)\sum_{k=0}^{\lfloor\log_{2}|w|\rfloor}z_{k}\cdot 2^{k},\quad\;z_{k}\in\{0,1\}. (2)

For ease of notation, we will use log()\log(\cdot) to represent log2()\log_{2}(\cdot) going forward. Denote n=log2|w|n=\lfloor\log_{2}|w|\rfloor. The construction of gn(x)g_{n}(x) in Figure 2a shows that 2n2^{n} can be represented by a binary network with ReLU activations for any nn. We will refer to this network as the diamond-shaped “gadget”.

Note that the expansion in Equation (2) requires 2k2^{k} for all 0kn=log|w|0\leq k\leq n=\lfloor\log|w|\rfloor. Luckily, any of these can be represented by just pruning gn(x)g_{n}(x) as shown in Figure 2b.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: The diamond-shaped binary ReLU networks that compute gn(x)g_{n}(x) and gnk(x)g_{n-k}(x), respectively. The dashed edges are just weights that have been “pruned” (set to 0). The output neuron is a simple linear activation.
Refer to caption
(a)
Refer to caption
(b)
Figure 3: We can use two instances of gnkg_{n-k} to create (a) fnk+f_{n-k}^{+} and (b) fnkf_{n-k}^{-} to approximate both positive weights and negative weights. The output neuron here has a linear activation.
Refer to caption
Figure 4: Illustration of gk(x)=k=0n(fk+(x)+fk(x))=σ(±k=0n2kx)g_{k}(x)=\sum_{k=0}^{n}(f_{k}^{+}(x)+f_{k}^{-}(x))=\sigma(\pm\sum_{k=0}^{n}2^{k}x) which can be further pruned to approximate f(x)=σ(wx)f(x)=\sigma(wx) for any w:|w|<2n+11w:|w|<2^{n+1}-1

However, since our constructions use the ReLU activation, this only works when x0x\geq 0. Using the same trick as  (Pensia et al.,, 2020), we can extend this construction by mirroring it as shown in Figure 3. This gives us fnk+(x):=2nkxf_{n-k}^{+}(x):=2^{n-k}x and fnk(x):=2nkxf_{n-k}^{-}(x):=-2^{n-k}x. The correctness of this mirroring trick relies on the simple observation that wx=σ(wx)σ(wx)wx=\sigma(wx)-\sigma(-wx).

Putting these together, we get gn(x)=σ(±k=0n2kx)g_{n}(x)=\sigma(\pm\sum_{k=0}^{n}2^{k}x) as shown in Figure 4. By pruning just the weights in the last layer, we can choose which terms to include. Setting n=log|w|n=\lfloor\log|w|\rfloor completes the approximation.

To calculate the overparameterization required to approximate h(x)=σ(wx)h(x)=\sigma(wx), we simply count the parameters in the above construction. Each gadget gkg_{k} is a network of width 22 and depth (log|w|)(\lfloor\log|w|\rfloor). To construct fk+f_{k}^{+}, we need two such gadgets. Therefore to construct fk+f_{k}^{+} and fkf_{k}^{-}, we need width 44 and depth (log|w|)(\lfloor\log|w|\rfloor). Repeating this for each k1,2,,log|w|k\in 1,2,\dots,\lfloor\log|w|\rfloor shows that our construction is a network of width and depth 𝒪(log|w|){\mathcal{O}}(\log|w|) which completes the proof of Lemma 3.∎

Remark 7.

The network in Fig. 4 used for proving Lemma 3 can be written as

g(x)=\displaystyle g(x)= σ(𝑴𝒗𝒗)T[(𝑴n𝑩n)σ((𝑴n1𝑩n1)\displaystyle\sigma({\bm{M}}_{{\bm{v}}}\odot{\bm{v}})^{T}[({{\bm{M}}}_{n}\odot{{\bm{B}}}_{n})\sigma(({{\bm{M}}}_{n-1}\odot{{\bm{B}}}_{n-1})\ldots
σ((𝑴1𝑩1)σ((𝑴𝒖𝒖)x)))]),\displaystyle\ldots\sigma(({{\bm{M}}}_{1}\odot{{\bm{B}}}_{1})\sigma(({{\bm{M}}}_{{\bm{u}}}\odot{{\bm{u}}}){x})))]),

where {𝐌i}i[n],𝐌𝐯,𝐌𝐮\{{\bm{M}}_{i}\}_{i\in[n]},{\bm{M}}_{{\bm{v}}},{\bm{M}}_{{\bm{u}}} are mask matrices and {𝐁i}i[n],𝐯,𝐮\{{\bm{B}}_{i}\}_{i\in[n]},{\bm{v}},{\bm{u}} are binary weight matrices. By pruning elements in 𝐮{\bm{u}} or 𝐯{\bm{v}}, one can obtain h(x)=σ(wx)h(x)=\sigma(wx). We will always prune the last layer 𝐯{\bm{v}} as it makes the construction more efficient when we extend it to approximating a layer.

Now, we extend the construction to the case where the target function is a neural network with a single neuron i.e., h(x)=σ(𝒘T𝒙)h(x)=\sigma({\bm{w}}^{T}{\bm{x}}).

Lemma 4.

Consider a network h(𝐱)=σ(𝐰T𝐱)h({\bm{x}})=\sigma({\bm{w}}^{T}{\bm{x}}) where 𝐰d{\bm{w}}\in\mathbb{Z}^{d} and 𝐰wmax||{\bm{w}}||_{\infty}\leq w_{max}. Then there exists a FC ReLU binary network g(𝐱)g({\bm{x}}) of width 𝒪(dlog|wmax|){\mathcal{O}}(d\log|w_{max}|) and depth 𝒪(log|wmax|){\mathcal{O}}(\log|w_{max}|) that can be pruned to g~(𝐱)\tilde{g}({\bm{x}}) so that g~(𝐱)=h(𝐱)\tilde{g}({\bm{x}})=h({\bm{x}}) for all 𝐱d{\bm{x}}\in\mathbb{R}^{d}.

Proof.

A neuron can be written as h(𝒙)=σ(𝒘T𝒙)=σ(i=1dwixi)h({\bm{x}})=\sigma({\bm{w}}^{T}{\bm{x}})=\sigma(\sum_{i=1}^{d}w_{i}x_{i}). Therefore, we can just repeat our construction from above for each wi,i[d]w_{i},i\in[d]. This results in a network of width 𝒪(dlog|wmax|){\mathcal{O}}(d\log|w_{max}|) while the depth remains unchanged at 𝒪(log|wmax|){\mathcal{O}}(\log|w_{max}|). ∎

Refer to caption
Figure 5: Illustration of the construction in Lemma 5: Approximating a 1-hidden layer network h(𝒙)=σ(𝟏Tσ(𝑾1𝒙))h({\bm{x}})=\sigma({\bm{1}}^{T}\sigma({\bm{W}}_{1}{\bm{x}})) by pruning the appropriate binary network g(𝒙)g({\bm{x}}). Pruning the last layer allows us to “re-use” weights.

Next, we describe how to approximate a single layer target network and avoid a quadratic overparameterization.

Lemma 5.

Consider a network h(𝐱)=σ(𝟏Tσ(𝐖1𝐱))h({\bm{x}})=\sigma({\bm{1}}^{T}\sigma({\bm{W}}_{1}{\bm{x}})) where 𝐖1d1×d0{\bm{W}}_{1}\in\mathbb{Z}^{d_{1}\times d_{0}} and 𝐖1maxW||{\bm{W}}_{1}||_{max}\leq W. Then there exists a FC ReLU binary network g(𝐱)g({\bm{x}}) of width 𝒪(max{d0,d1}log|W|){\mathcal{O}}(\max\{d_{0},d_{1}\}\log|W|) and depth 𝒪(log|W|){\mathcal{O}}(\log|W|) that can be pruned to g~(𝐱)\tilde{g}({\bm{x}}) so that g~(𝐱)=h(𝐱)\tilde{g}({\bm{x}})=h({\bm{x}}) for all 𝐱d0{\bm{x}}\in\mathbb{R}^{d_{0}}.

Proof.

Note that 𝟏d1{\bm{1}}\in\mathbb{R}^{d_{1}} is the vector of 11’s. Naively extending the construction from Lemma 4 would require us to replace each of the d0d_{0} neurons in the first layer by a network of width 𝒪(d1log|W|){\mathcal{O}}(d_{1}\log|W|) and depth 𝒪(log|W|){\mathcal{O}}(\log|W|). This already needs a network of width 𝒪(d0d1log|W|){\mathcal{O}}(d_{0}d_{1}\log|W|) which is a quadratic overparameterization. Instead, we take advantage of pruning only the last layer in the construction from Lemma 3 to re-use a sufficient number of weights and avoid the quadratic overparameterization. An illustration of this idea is shown in Figure 5. The key observation is that by pruning only the last layer of each fn(x)f_{n}(x) gadget, we leave it available to be re-used to approximate the weights of the remaining (d11)(d_{1}-1) neurons. Therefore, the width of the network required to approximate h(𝒙)h({\bm{x}}) is just 𝒪(max{d0,d1}log|W|){\mathcal{O}}(\max\{d_{0},d_{1}\}\log|W|) and the depth remains 𝒪(log|W|){\mathcal{O}}(\log|W|). ∎

Now we can tie it all together to show an approximation guarantee for any network in \mathcal{F}.

Lemma 6.

Consider a network f(𝐱)=σ(𝐖lσ(𝐖l1σ(𝐖1𝐱)))f({\bm{x}})=\sigma({\bm{W}}_{l}\sigma({\bm{W}}_{l-1}\dots\sigma({\bm{W}}_{1}{\bm{x}}))) where 𝐖idi×di1{\bm{W}}_{i}\in\mathbb{Z}^{d_{i}\times d_{i-1}}, 𝐖imaxW||{\bm{W}}_{i}||_{max}\leq W. Let d=maxidid=\max_{i}d_{i}. Then, there exists a FC ReLU binary network g(𝐱)g({\bm{x}}) of width 𝒪(dlog|W|){\mathcal{O}}(d\log|W|) and depth 𝒪(llog|W|){\mathcal{O}}(l\log|W|) that can be pruned to g~(𝐱)\tilde{g}({\bm{x}}) so that g~(𝐱)=f(𝐱)\tilde{g}({\bm{x}})=f({\bm{x}}) for all 𝐱d0{\bm{x}}\in\mathbb{R}^{d_{0}}.

Proof.

Note that there is no approximation error in any of the steps above. Therefore, we can just repeat the construction above for each of the ll layers in f(𝒙)f({\bm{x}}). This results in a network g(𝒙)g({\bm{x}}) of width 𝒪(dlog|W|){\mathcal{O}}(d\log|W|) and depth 𝒪(llog|W|){\mathcal{O}}(l\log|W|). A more precise description of the dimensions of each layer can be found in the statement of Theorem 1. ∎

Finally, below we show that our construction can be extended for networks with randomly initialized weights. The proof can be found in supplementary materials.

Lemma 7.

Consider a network f(𝐱)=σ(𝐖lσ(𝐖l1σ(𝐖1𝐱)))f({\bm{x}})=\sigma({\bm{W}}_{l}\sigma({\bm{W}}_{l-1}\dots\sigma({\bm{W}}_{1}{\bm{x}}))) where 𝐖idi×di1{\bm{W}}_{i}\in\mathbb{Z}^{d_{i}\times d_{i-1}}, 𝐖imaxW||{\bm{W}}_{i}||_{max}\leq W. Define d=maxidid=\max_{i}d_{i} and let g(𝐱)g({\bm{x}}) be a randomly initialized binary network of width Θ(dlog((dllog2|W|)δ))\Theta\left(d\log\left(\frac{(dl\log^{2}|W|)}{\delta}\right)\right) and depth Θ(llog|W|)\Theta(l\log|W|) such that every weight is drawn uniformly from {1,+1}\{-1,+1\}. Then g(𝐱)g({\bm{x}}) can be pruned to g~(𝐱)\tilde{g}({\bm{x}}) so that g~(𝐱)=f(𝐱)\tilde{g}({\bm{x}})=f({\bm{x}}) for all 𝐱d0{\bm{x}}\in\mathbb{R}^{d_{0}} with probability at least 1δ1-\delta.

Since every network fWf\in\mathcal{F}_{W} satisfies the assumptions of Lemma 7, we can apply it for the entire W\mathcal{F}_{W}. Note that we do not need to apply the union bound over W\mathcal{F}_{W} since the randomness is just needed to ensure the existence of a particular deterministic network - specifically g(x)g(x) from Lemma 6. Therefore, a single random network is sufficient to ensure the guarantee for every fWf\in\mathcal{F}_{W}. This completes the proof of Theorem 2.

3.2.3 Putting everything together

We now put the above results together to complete the proof of Theorem 1. First, note that by Lemma 2, to approximate any ff\in\mathcal{F} within ε>0\varepsilon>0, it suffices to consider g^(𝒙)\hat{g}({\bm{x}}) which is a finite-precision version of ff where the precision of each weight is at most p=log(d2l/ε)p=\log(d^{2}l/\varepsilon). Now, applying the scaling trick in Equation (1) we can represent g^(𝒙)\hat{g}({\bm{x}}) exactly as a scaled integer network i.e.,

g^(𝒙)=clσ(c𝑾^lσ(c𝑾^l1σ(c𝑾^1𝒙)))\hat{g}({\bm{x}})=c^{-l}\sigma(c\widehat{{\bm{W}}}_{l}\sigma(c\widehat{{\bm{W}}}_{l-1}\dots\sigma(c\widehat{{\bm{W}}}_{1}{\bm{x}})))

where c=10p=(d2l/ε)c=10^{p}=(d^{2}l/\varepsilon) and all the weight matrices c𝑾^ic\widehat{{\bm{W}}}_{i} are integer. Since 𝑾imax1\lVert{\bm{W}}_{i}\rVert_{\operatorname{max}}\leq 1, it is clear that c𝑾^imaxc\lVert c\widehat{{\bm{W}}}_{i}\rVert_{max}\leq c. Therefore, applying Theorem 2 to the integer network clg^(𝒙)c^{l}\hat{g}({\bm{x}}) with W=(d2l/ε)W=(d^{2}l/\varepsilon), we have the following. If h(𝒙)=σ(𝑩lσ(𝑩l1σ(𝑩1𝒙)))h({\bm{x}})=\sigma({\bm{B}}_{l^{\prime}}\sigma({\bm{B}}_{l^{\prime}-1}\dots\sigma({\bm{B}}_{1}{\bm{x}}))) is a randomly initialized binary network of depth Θ(llog(d2l/ε))\Theta(l\log(d^{2}l/\varepsilon)) and width Θ(log(d2l/ε)log(dllog2(d2l/ε)δ))\Theta\left(\log(d^{2}l/\varepsilon)\log\left(\frac{dl\log^{2}(d^{2}l/\varepsilon)}{\delta}\right)\right), then with probability 1δ1-\delta, it can be pruned to h~(𝒙)\tilde{h}({\bm{x}}) so that clg^(𝒙)=h~(𝒙)c^{l}\hat{g}({\bm{x}})=\tilde{h}({\bm{x}}) for any 𝒙{\bm{x}} in the unit sphere. Therefore, to approximate g^(𝒙)\hat{g}({\bm{x}}) we simply push the scaling factor clc^{-l} into the last layer BlB_{l^{\prime}} so that its weights are now scaled binary {±(ε/d2l)l}\{\pm(\varepsilon/d^{2}l)^{l}\}. Combining this with the approximation guarantee between g^(𝒙)\hat{g}({\bm{x}}) and f(𝒙)f({\bm{x}}) completes the proof.

3.3 Binary weights for classification

Theorem 2 can easily be extended for classification problems using finite-precision networks. Since sign()\operatorname{sign}(\cdot) is positive scale-invariant, we no longer even require the final layer to be scaled. Applying the same argument as Sec 3.2.3 and then dropping the clc^{-l} factor gives us the following corollary.

Corollary 1.

Consider the set of binary classification FC ReLU networks \mathcal{F} of width dd and depth ll, where the weight matrices {𝐖i}i=1l\{{\bm{W}}_{i}\}_{i=1}^{l} are of finite-precision at most pp digits. Let g(𝐱)=sign(𝐁lσ(𝐁l1σ(𝐁1𝐱)))g({\bm{x}})=\operatorname{sign}({\bm{B}}_{l^{\prime}}\sigma({\bm{B}}_{l^{\prime}-1}\dots\sigma({\bm{B}}_{1}{\bm{x}}))) be a randomly initialized binary network with depth l=Θ(lp)l^{\prime}=\Theta(lp) and width d=Θ(dplog(dlp/δ))d^{\prime}=\Theta(dp\log(dlp/\delta)) such that every weight is drawn uniformly from {1,+1}\{-1,+1\}. Then, with probability at least 1δ1-\delta, for every ff\in\mathcal{F}, there exist pruning matrices {𝐌i}i=1l\{{\bm{M}}_{i}\}_{i=1}^{l^{\prime}} such that f(𝐱)=g~(𝐱)f({\bm{x}})=\tilde{g}({\bm{x}}) for any 𝐱{\bm{x}} where g~(𝐱):=sign((𝐌l𝐁l)σ((𝐌l1𝐁l1)σ((𝐌1𝐁1)𝐱)))\tilde{g}({\bm{x}}):=\operatorname{sign}(({{\bm{M}}}_{l^{\prime}}\odot{{\bm{B}}}_{l^{\prime}})\sigma(({{\bm{M}}}_{l^{\prime}-1}\odot{{\bm{B}}}_{l^{\prime}-1})\ldots\sigma(({{\bm{M}}}_{1}\odot{{\bm{B}}}_{1}){{\bm{x}}}))).

4 Conclusion

In this paper, we prove the Strong LTH for binary networks establishing that logarithmic overparameterization is sufficient for pruning algorithms to discover accurate subnetworks within random binary models. By doing this, we provide theory supporting the wide range of experimental work in the field, e.g., scaled binary networks can achieve the best SOTA accuracy on benchmark image datasets (Diffenderfer and Kailkhura,, 2021; Ramanujan et al.,, 2020). Moreover, we show that only the last layer needs to be scaled binary, while the rest of the network can be purely binary {±1}\{\pm 1\}. It is well known in the binary network literature that a gain term (scaling the weights) makes the optimization problem more tractable (Simons and Lee,, 2019). While this is known empirically, it would be interesting to study this from a theoretical perspective so we can identify better algorithms to find binary networks of high accuracy.

Acknowledgements

The authors would like to thank Jonathan Frankle for early discussions on pruning random binary networks. This research was supported by ONR Grant N00014-21-1-2806.

References

  • Barron, (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945.
  • Blalock et al., (2020) Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., and Guttag, J. (2020). What is the state of neural network pruning? In Proceedings of Machine Learning and Systems 2020, pages 129–146.
  • Cheng et al., (2019) Cheng, Y., Wang, D., Zhou, P., and Zhang, T. (2019). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv:1710.09282 [cs].
  • Courbariaux et al., (2015) Courbariaux, M., Bengio, Y., and David, J.-P. (2015). Binaryconnect: Training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363.
  • Deng et al., (2020) Deng, L., Li, G., Han, S., Shi, L., and Xie, Y. (2020). Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532.
  • Diffenderfer and Kailkhura, (2021) Diffenderfer, J. and Kailkhura, B. (2021). Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network. arXiv preprint arXiv:2103.09377.
  • Frankle and Carbin, (2018) Frankle, J. and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations.
  • Frankle et al., (2020) Frankle, J., Dziugaite, G. K., Roy, D., and Carbin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR.
  • (9) Han, S., Mao, H., and Dally, W. J. (2015a). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
  • (10) Han, S., Pool, J., Tran, J., and Dally, W. J. (2015b). Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626.
  • Hanin, (2019) Hanin, B. (2019). Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics, 7(10):992.
  • Hassibi and Stork, (1993) Hassibi, B. and Stork, D. G. (1993). Second order derivatives for network pruning: Optimal Brain Surgeon. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 164–171. Morgan-Kaufmann.
  • He et al., (2018) He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. (2018). Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800.
  • He et al., (2017) He, Y., Zhang, X., and Sun, J. (2017). Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397.
  • Hubara et al., (2016) Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2016). Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 4114–4122.
  • Hubara et al., (2017) Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2017). Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898.
  • Kidger and Lyons, (2020) Kidger, P. and Lyons, T. (2020). Universal approximation with deep narrow networks. In Conference on Learning Theory, pages 2306–2327. PMLR.
  • Klusowski and Barron, (2018) Klusowski, J. M. and Barron, A. R. (2018). Approximation by combinations of relu and squared relu ridge functions with 1\ell^{1} and 0\ell^{0} controls. IEEE Transactions on Information Theory, 64(12):7649–7656.
  • LeCun et al., (1990) LeCun, Y., Denker, J. S., and Solla, S. A. (1990). Optimal Brain Damage. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2, pages 598–605. Morgan-Kaufmann.
  • Levin et al., (1994) Levin, A. U., Leen, T. K., and Moody, J. E. (1994). Fast Pruning Using Principal Components. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing Systems 6, pages 35–42. Morgan-Kaufmann.
  • Li et al., (2016) Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.
  • Malach et al., (2020) Malach, E., Yehudai, G., Shalev-Schwartz, S., and Shamir, O. (2020). Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR.
  • Mozer and Smolensky, (1989) Mozer, M. C. and Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115.
  • Orseau et al., (2020) Orseau, L., Hutter, M., and Rivasplata, O. (2020). Logarithmic pruning is all you need. Advances in Neural Information Processing Systems, 33.
  • Pensia et al., (2020) Pensia, A., Rajput, S., Nagle, A., Vishwakarma, H., and Papailiopoulos, D. (2020). Optimal lottery tickets via subsetsum: Logarithmic over-parameterization is sufficient. arXiv preprint arXiv:2006.07990.
  • Perekrestenko et al., (2018) Perekrestenko, D., Grohs, P., Elbrächter, D., and Bölcskei, H. (2018). The universal approximation power of finite-width deep relu networks. arXiv preprint arXiv:1806.01528.
  • Ramanujan et al., (2020) Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., and Rastegari, M. (2020). What’s Hidden in a Randomly Weighted Neural Network? arXiv:1911.13299 [cs].
  • Rastegari et al., (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer.
  • Scarselli and Tsoi, (1998) Scarselli, F. and Tsoi, A. C. (1998). Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural networks, 11(1):15–37.
  • Simons and Lee, (2019) Simons, T. and Lee, D.-J. (2019). A review of binarized neural networks. Electronics, 8(6):661.
  • Stinchombe, (1989) Stinchombe, M. (1989). Universal approximation using feed-forward networks with nonsigmoid hidden layer activation functions. Proc. IJCNN, Washington, DC, 1989, pages 161–166.
  • Wang et al., (2019) Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., and Hu, X. (2019). Pruning from Scratch. arXiv:1909.12579 [cs].
  • Wen et al., (2016) Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. (2016). Learning structured sparsity in deep neural networks. arXiv preprint arXiv:1608.03665.
  • Wu et al., (2016) Wu, J., Leng, C., Wang, Y., Hu, Q., and Cheng, J. (2016). Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828.
  • Zhou et al., (2019) Zhou, H., Lan, J., Liu, R., and Yosinski, J. (2019). Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067.
  • Zhu et al., (2016) Zhu, C., Han, S., Mao, H., and Dally, W. J. (2016). Trained ternary quantization. arXiv preprint arXiv:1612.01064.
  • Zhu and Gupta, (2017) Zhu, M. and Gupta, S. (2017). To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878.

Appendix A Appendix

A.1 Proof of Lemma 2

Before proving Lemma 2 in its entirety, we first extend Lemma 1 to approximate a neuron with log(d/ε)\log(d/\varepsilon)-precision.

Lemma 8.

Consider a network h(𝐱)=σ(𝐰Tx)h({\bm{x}})=\sigma({\bm{w}}^{T}x) where 𝐰d,𝐰1{\bm{w}}\in\mathbb{R}^{d},||{\bm{w}}||\leq 1 and ε>0\varepsilon>0. Let 𝐰^\hat{{\bm{w}}} be a coordinate-wise finite-precision truncation of 𝐰{\bm{w}} up to log(d/ε)\log(d/\varepsilon) digits and g^(x)=σ(𝐰^x)\hat{g}(x)=\sigma(\hat{{\bm{w}}}x). Then we have that

max𝒙d:𝒙1h(𝒙)g^(𝒙)ε.\max_{{\bm{x}}\in\mathbb{R}^{d}:||{\bm{x}}||\leq 1}||h({\bm{x}})-\hat{g}({\bm{x}})||\leq\varepsilon.
Proof.

Once again, by Lemma 1, we have that for each coordinate |𝒘i𝒘^i|ε/d|{\bm{w}}_{i}-\hat{{\bm{w}}}_{i}|\leq\varepsilon/d. Therefore, 𝒘𝒘^i=1d(εd)2ε||{\bm{w}}-\hat{{\bm{w}}}||\leq\sqrt{\sum_{i=1}^{d}\left(\frac{\varepsilon}{d}\right)^{2}}\leq\varepsilon. Applying Cauchy-Schwarz with 𝒙1||{\bm{x}}||\leq 1 completes the proof. ∎

We now extend the result to approximating a single layer with finite-precision.

Lemma 9.

Consider a network h(𝐱)=σ(𝟏Tσ(𝐖1𝐱))h({\bm{x}})=\sigma({\bm{1}}^{T}\sigma({\bm{W}}_{1}{\bm{x}})) where 𝐖1d1×d0,𝐖11{\bm{W}}_{1}\in\mathbb{R}^{d_{1}\times d_{0}},||{\bm{W}}_{1}||\leq 1 and ε>0\varepsilon>0. Let 𝐖1^\widehat{{\bm{W}}_{1}} be a coordinate-wise finite-precision truncation of 𝐖1{\bm{W}}_{1} up to log(d2/ε)\log(d^{2}/\varepsilon) digits and g^(x)=σ(𝐖1^x)\hat{g}(x)=\sigma(\widehat{{\bm{W}}_{1}}x) where d=max{d0,d1}d=\max\{d_{0},d_{1}\}. Then we have that

max𝒙d0:𝒙1h(𝒙)g^(𝒙)ε.\max_{{\bm{x}}\in\mathbb{R}^{d_{0}}:||{\bm{x}}||\leq 1}||h({\bm{x}})-\hat{g}({\bm{x}})||\leq\varepsilon.
Proof.

Note that for any 𝒙:𝒙1{\bm{x}}:||{\bm{x}}||\leq 1,

h(𝒙)g~(𝒙)\displaystyle\lVert h({\bm{x}})-\tilde{g}({\bm{x}})\rVert
=σ(i=1d1σ(W1(i)x))σ(i=1d1σ(W1^(i)x))\displaystyle=\lVert\sigma(\sum_{i=1}^{d_{1}}\sigma({{W_{1}}^{(i)}}x))-\sigma(\sum_{i=1}^{d_{1}}\sigma({{\widehat{W_{1}}}^{(i)}}x))\rVert
i=1d1σ(W1(i)x)i=1d1σ(W1^(i)x)\displaystyle\leq\lVert\sum_{i=1}^{d_{1}}\sigma({{W_{1}}^{(i)}}x)-\sum_{i=1}^{d_{1}}\sigma({{\widehat{W_{1}}}^{(i)}}x)\rVert (Since σ\sigma is 1-Lipschitz)
i=1d1σ(W1(i)x)σ(W1^(i)x)\displaystyle\leq\sum_{i=1}^{d_{1}}\lVert\sigma({{W_{1}}^{(i)}}x)-\sigma({{\widehat{W_{1}}}^{(i)}}x)\rVert
i=1d1εmax{d0,d1}\displaystyle\leq\sum_{i=1}^{d_{1}}\frac{\varepsilon}{\max\{d_{0},d_{1}\}}
ε\displaystyle\leq\varepsilon

Since this holds for any 𝒙{\bm{x}}, it also holds for the 𝒙{\bm{x}} that attains the maximum possible error which completes the proof. ∎

We are now ready to prove Lemma 2 which states that to approximate any FC ReLU network within ε\varepsilon, it suffices to simply consider weights with logarithmic precision.

Lemma 2.

Consider a network h(𝐱)=σ(𝐖lσ(𝐖l1σ(𝐖1𝐱)))h({\bm{x}})=\sigma({\bm{W}}_{l}\sigma({\bm{W}}_{l-1}\dots\sigma({\bm{W}}_{1}{\bm{x}}))) where 𝐖idi×di1{\bm{W}}_{i}\in\mathbb{R}^{d_{i}\times d_{i-1}}, 𝐖i1,did||{\bm{W}}_{i}||\leq 1,d_{i}\leq d and ε>0\varepsilon>0. Let g^(𝐱)=σ(𝐖^lσ(𝐖^l1σ(𝐖^1𝐱)))\hat{g}({\bm{x}})=\sigma(\widehat{{\bm{W}}}_{l}\sigma(\widehat{{\bm{W}}}_{l-1}\dots\sigma(\widehat{{\bm{W}}}_{1}{\bm{x}}))) where 𝐖i^\widehat{{\bm{W}}_{i}} is a finite precision truncation of 𝐖i{\bm{W}}_{i} up to log(d2l/ε)\log(d^{2}l/\varepsilon) digits. Then we have that

max𝒙d0:𝒙1h(𝒙)g^(𝒙)ε.\max_{{\bm{x}}\in\mathbb{R}^{d_{0}}:||{\bm{x}}||\leq 1}||h({\bm{x}})-\hat{g}({\bm{x}})||\leq\varepsilon.
Proof.

The proof follows inductively. First note that since operator norm is bounded by the frobenius norm, we have for the output of the first layer that,

σ(𝑾1𝒙)σ(𝑾^1𝒙)\displaystyle\lVert\sigma({\bm{W}}_{1}{\bm{x}})-\sigma(\widehat{{\bm{W}}}_{1}{\bm{x}})\rVert 𝑾1𝒙𝑾^1𝒙\displaystyle\leq\lVert{\bm{W}}_{1}{\bm{x}}-\widehat{{\bm{W}}}_{1}{\bm{x}}\rVert
𝑾1𝑾^1\displaystyle\leq\lVert{\bm{W}}_{1}-\widehat{{\bm{W}}}_{1}\rVert
𝑾1𝑾^1F\displaystyle\leq\lVert{\bm{W}}_{1}-\widehat{{\bm{W}}}_{1}\rVert_{F}
ε/l\displaystyle\leq\varepsilon/l

Next, denote the output of the first layer by 𝒙2:=σ(𝑾1𝒙),𝒙^2:=σ(𝑾^1𝒙){\bm{x}}_{2}:=\sigma({\bm{W}}_{1}{\bm{x}}),\;\widehat{{\bm{x}}}_{2}:=\sigma(\widehat{{\bm{W}}}_{1}{\bm{x}}). Repeating the calculation above for the second layer gives us,

σ(𝑾2σ(𝑾1𝒙))σ(𝑾^2(σ(𝑾^1𝒙))\displaystyle\lVert\sigma({\bm{W}}_{2}\sigma({\bm{W}}_{1}{\bm{x}}))-\sigma(\widehat{{\bm{W}}}_{2}(\sigma(\widehat{{\bm{W}}}_{1}{\bm{x}}))\rVert
=σ(𝑾2𝒙2)σ(𝑾^2𝒙^2)\displaystyle=\lVert\sigma({\bm{W}}_{2}{\bm{x}}_{2})-\sigma(\widehat{{\bm{W}}}_{2}\widehat{{\bm{x}}}_{2})\rVert
=σ(𝑾2x2)σ(𝑾^2x2)+σ(𝑾^2x2)σ(𝑾^2𝒙^2)\displaystyle=\lVert\sigma({\bm{W}}_{2}x_{2})-\sigma(\widehat{{\bm{W}}}_{2}x_{2})+\sigma(\widehat{{\bm{W}}}_{2}x_{2})-\sigma(\widehat{{\bm{W}}}_{2}\widehat{{\bm{x}}}_{2})\rVert
σ(𝑾2x2)σ(𝑾^2x2)+σ(𝑾^2x2)σ(𝑾^2𝒙^2)\displaystyle\leq\lVert\sigma({\bm{W}}_{2}x_{2})-\sigma(\widehat{{\bm{W}}}_{2}x_{2})\rVert+\lVert\sigma(\widehat{{\bm{W}}}_{2}x_{2})-\sigma(\widehat{{\bm{W}}}_{2}\widehat{{\bm{x}}}_{2})\rVert

Note that the first term is bounded by ε/l\varepsilon/l since 𝒙21||{\bm{x}}_{2}||\leq 1. Further, using the fact that 𝑾^21\lVert\widehat{{\bm{W}}}_{2}\rVert\leq 1 and 𝒙2𝒙^2ε/l\lVert{\bm{x}}_{2}-\widehat{{\bm{x}}}_{2}\rVert\leq\varepsilon/l as proved above, we get that

σ(𝑾2σ(𝑾1𝒙))σ(𝑾^2(σ(𝑾^1𝒙))2ε/l\displaystyle\lVert\sigma({\bm{W}}_{2}\sigma({\bm{W}}_{1}{\bm{x}}))-\sigma(\widehat{{\bm{W}}}_{2}(\sigma(\widehat{{\bm{W}}}_{1}{\bm{x}}))\rVert\leq 2\varepsilon/l

By induction, for each layer 1il1\leq i\leq l we have that for any 𝒙:𝒙1{\bm{x}}:\lVert{\bm{x}}\rVert\leq 1,

(σ(𝑾iσ(𝑾i1σ(𝑾1𝒙))))\displaystyle\lVert\big{(}\sigma({\bm{W}}_{i}\sigma({\bm{W}}_{i-1}\dots\sigma({\bm{W}}_{1}{\bm{x}})))\big{)}
(σ(𝑾^iσ(𝑾^i1σ(𝑾^1𝒙))))ε(i/l)\displaystyle-\big{(}\sigma(\widehat{{\bm{W}}}_{i}\sigma(\widehat{{\bm{W}}}_{i-1}\dots\sigma(\widehat{{\bm{W}}}_{1}{\bm{x}})))\big{)}\rVert\leq\varepsilon\cdot(i/l)

Setting i=li=l completes the proof. ∎

Lemma 7.

Consider a network f(𝐱)=σ(𝐖lσ(𝐖l1σ(𝐖1𝐱)))f({\bm{x}})=\sigma({\bm{W}}_{l}\sigma({\bm{W}}_{l-1}\dots\sigma({\bm{W}}_{1}{\bm{x}}))) where 𝐖idi×di1{\bm{W}}_{i}\in\mathbb{Z}^{d_{i}\times d_{i-1}}, 𝐖imaxW||{\bm{W}}_{i}||_{max}\leq W and didd_{i}\leq d. Let g(𝐱)g({\bm{x}}) be a randomly initialized binary network of width Θ(dlog((dllog2|W|)δ))\Theta\left(d\log\left(\frac{(dl\log^{2}|W|)}{\delta}\right)\right) and depth Θ(llog|W|)\Theta(l\log|W|) such that every weight is drawn uniformly from {1,+1}\{-1,+1\}. Then g(𝐱)g({\bm{x}}) can be pruned to g~(𝐱)\tilde{g}({\bm{x}}) so that g~(𝐱)=f(𝐱)\tilde{g}({\bm{x}})=f({\bm{x}}) for all 𝐱d0{\bm{x}}\in\mathbb{R}^{d_{0}} with probability at least 1δ1-\delta.

Proof.

Consider any one particular diamond structure in Figure 2. If we choose the weights of this network uniformly at random from {1,+1}\{-1,+1\}, then the probability that they are all 11 is at most (1/2)4(1/2)^{4}. If we make the network kk times wider, the probability that such a diamond-gadget does not exist in the network is given by (1(1/2)4)k(1-(1/2)^{4})^{k}. In other words, if we define the event AA to be the failure event, then P(A)=(1(1/2)4)kP(A)=\left(1-(1/2)^{4}\right)^{k}. Note that there are Θ(dllog2|W|)\Theta(dl\log^{2}|W|) such diamond structures in our network. By symmetry, the failure probability of each of them is identical. To have the overall probability of failure to be at most δ\delta, taking the union bound we have that

dllog2|W|(1(1/2)4)kδdl\log^{2}|W|\cdot\left(1-(1/2)^{4}\right)^{k}\leq\delta

Hence, it suffices to have kΘ(log((dllog2|W|)δ))k\geq\Theta\left(\log\left(\frac{(dl\log^{2}|W|)}{\delta}\right)\right). In other words, a randomly initialized binary network of width Θ(log((dllog2|W|)δ))\Theta\left(\log\left(\frac{(dl\log^{2}|W|)}{\delta}\right)\right) and depth Θ(llog|W|)\Theta(l\log{|W|}) contains our deterministic construction with probability at least 1δ1-\delta. ∎

A.2 Observation on the power of zero

We try to understand here if pruning is essential to the expressive power of binary networks. We note that pruning binary networks is strictly more powerful than merely sign-flipping their weights. To see this, consider the set of all networks that can be derived by pruning 1-hidden layer binary networks of width mm and input dimension dd:

prunedm={f:f(𝒙)=σ(i=1dxiσ(j=1mvjwji)),\displaystyle\mathcal{F}_{pruned}^{m}=\bigg{\{}f:f({\bm{x}})=\sigma\left(\sum_{i=1}^{d}x_{i}\sigma\left(\sum_{j=1}^{m}v_{j}w_{j}^{i}\right)\right),
wj,vj{+1,1,0}}.\displaystyle w_{j},v_{j}\in\{+1,-1,0\}\bigg{\}}.

Similarly, the set of all 1-hidden layer binary networks of the same width without pruning is given by

binm={f:f(𝒙)=σ(i=1dxiσ(j=1mvjwji)),\displaystyle\mathcal{F}_{bin}^{m}=\bigg{\{}f:f({\bm{x}})=\sigma\left(\sum_{i=1}^{d}x_{i}\sigma\left(\sum_{j=1}^{m}v_{j}w_{j}^{i}\right)\right),
wj,vj{+1,1}}.\displaystyle w_{j},v_{j}\in\{+1,-1\}\bigg{\}}. (3)

We prove the following proposition that shows that pruned\mathcal{F}_{pruned} is a strictly richer class of functions indicating that pruning is an essential part of approximating classifiers.

Proposition 1.

The function f(𝐱)=σ(i=1dixi)f({\bm{x}})=\sigma\left(\sum_{i=1}^{d}i\cdot x_{i}\right) satisfies f(𝐱)pruneddf({\bm{x}})\in\mathcal{F}_{pruned}^{d} and f(𝐱)bindf({\bm{x}})\notin\mathcal{F}_{bin}^{d}, i.e., without pruning, f(𝐱)f({\bm{x}}) cannot be represented by a single layer binary network of width dd.

Proof.

For simplicity, we consider the case when x0x\geq 0 so that the ReLU is equivalent to a linear activation. If the functions are not equal for the non-negative orthant, then they are surely different. First, note that we recover f(𝒙)f({\bm{x}}) from prunedd\mathcal{F}_{pruned}^{d} if we set vj=1v_{j}=1 and wji=𝟙jiw_{j}^{i}=\mathds{1}_{j\leq i} for all i,ji,j. Therefore, f(𝒙)pruneddf({\bm{x}})\in\mathcal{F}_{pruned}^{d} holds. To see that fbindf\notin\mathcal{F}_{bin}^{d}, first note that we can replace vjwjiv_{j}w_{j}^{i} with zji{+1,1}i,jz_{j}^{i}\in\{+1,-1\}\;\forall i,j in Equation (A.2). Hence, any gbindg\in\mathcal{F}_{bin}^{d} is of the form g(𝒙)=σ(i=1dxiσ(i=1dzji))g({\bm{x}})=\sigma\left(\sum_{i=1}^{d}x_{i}\sigma\left(\sum_{i=1}^{d}z_{j}^{i}\right)\right). Consider the coefficient of xdx_{d} in f(𝒙)f({\bm{x}}) which is (d1)(d-1). Since zjdz_{j}^{d} cannot be set to 0, the best approximation we can get using g(𝒙)g({\bm{x}}) is dd or d2d-2. In fact, this holds for every odd term in f(x)f(x). This completes the proof. ∎

Remark 8.

The above proposition suggests that pruning is more powerful than merely flipping signs of a binary network. In fact, the same argument can be extended for binary networks of any fixed width dd and depth ll to show that pruned networks are more expressive. However, it does not quantify this difference in expressivity.