Finding Everything within Random Binary Networks
Abstract
A recent work by Ramanujan et al., (2020) provides significant empirical evidence that sufficiently overparameterized, random neural networks contain untrained subnetworks that achieve state-of-the-art accuracy on several predictive tasks. A follow-up line of theoretical work provides justification of these findings by proving that slightly overparameterized neural networks, with commonly used continuous-valued random initializations can indeed be pruned to approximate any target network. In this work, we show that the amplitude of those random weights does not even matter. We prove that any target network can be approximated up to arbitrary accuracy by simply pruning a random network of binary weights that is only a polylogarithmic factor wider and deeper than the target network.
1 Introduction
As the number of parameters of state-of-the-art networks continues to increase, pruning has become a prime choice for sparsifying and compressing a model. A rich and long body of research, dating back to the 80s, shows that one can prune most networks to a tiny fraction of their size, while maintaining high accuracy (Mozer and Smolensky,, 1989; Hassibi and Stork,, 1993; Levin et al.,, 1994; LeCun et al.,, 1990; Han et al., 2015b, ; Han et al., 2015a, ; Li et al.,, 2016; Wen et al.,, 2016; Hubara et al.,, 2016, 2017; He et al.,, 2017; Wu et al.,, 2016; Zhu et al.,, 2016; He et al.,, 2018; Zhu and Gupta,, 2017; Cheng et al.,, 2019; Blalock et al.,, 2020; Deng et al.,, 2020).
A downside of most of the classic pruning approaches is that they sparsify a model once it is trained to full accuracy, followed by significant fine-tuning, resulting in a computationally burdensome procedure. Frankle and Carbin, (2018) conjectured the existence of lottery tickets, i.e., sparse subnetworks at (or near) initialization, that can be trained—just once—to reach the accuracy of state-of-the-art dense models. This may help alleviate the computational burden of prior approaches, as training is predominantly carried on a much sparser model. The conjectured existence of these lucky tickets is referred to as the Lottery Ticket Hypothesis (LTH). Frankle and Carbin, (2018) and (Frankle et al.,, 2020) show that not only do lottery tickets exist, but also the cost of “winning the lottery” is not very high.
Along the LTH literature, a curious phenomenon was observed; even at initialization and in the complete absence of training, one can find sub-networks of the random initial model that have prediction accuracy far beyond random guessing (Zhou et al.,, 2019; Ramanujan et al.,, 2020; Wang et al.,, 2019). Ramanujan et al., (2020) reported this in its most striking form: state-of-the-art accuracy models for CIFAR10 and ImageNet, simply reside within slightly larger, yet completely random networks, and appropriate pruning—and mere pruning—can reveal them! This “pruning is all you need” phenomenon is sometimes referred to as the Strong Lottery Ticket Hypothesis.
A recent line of work attempts to establish the theoretical validity of the Strong LTH by studying the following non-algorithmic question:
Can a random network be pruned to approximate a target function ?
Here, represents a bounded range labeling function that acts on inputs , and is itself a neural network of finite width and depth. This assumption is not limiting, as neural networks are universal approximators (Stinchombe,, 1989; Barron,, 1993; Scarselli and Tsoi,, 1998; Klusowski and Barron,, 2018; Perekrestenko et al.,, 2018; Hanin,, 2019; Kidger and Lyons,, 2020). Note that the answer to the above question is trivial if one does not constraint the size of the random initial network, for all interesting cases of “random”. Indeed, if we start with an exponentially wider random neural network compared to the one representing , by sheer luck, one can always find weights, for each layer, near-identical to those of any target neural network that is . Achieving this result with a constrained overparameterization, i.e., the degree by which the random network to be pruned is wider/deeper than , is precisely why this question is challenging.
Malach et al., (2020) were the first to prove that the Strong LTH is true, assuming polynomial-sized overparameterization. Specifically, under some mild assumptions, they showed that to approximate a target network of width and depth to within error , it suffices to prune a random network of width and depth . Pensia et al., (2020) offered an exponentially tighter bound using a connection to the SubsetSum problem. They showed that to approximate a target network within error , it is sufficient to prune a randomly initialized network of width and depth . A corresponding lower bound for constant depth networks was also established. Orseau et al., (2020) were also able to reduce the dependence on to logarithmic. They show that in order to approximate a target network within error , it suffices to prune a random network of width if the weights are initialized with the hyperbolic distribution. However, this bound on overparamaterization is still polynomial in the width .
The above theoretical studies have focused exclusively on continuous distribution for initialization. However, in the experimental work by (Ramanujan et al.,, 2020), the authors manage to obtain the best performance by pruning networks of scaled, binary weights. Training binary networks has been studied extensively in the past (Courbariaux et al.,, 2015; Simons and Lee,, 2019) as they are compute, memory and hardware efficient, though in many cases they suffer from significant loss of accuracy. The findings of (Ramanujan et al.,, 2020) suggest that the accuracy loss may not be fundamental to networks of binary weights, when such networks are learned by pruning. Arguably, since “carving out” sub-networks of random models is expressive enough to approximate a target function, e.g., according to (Pensia et al.,, 2020; Malach et al.,, 2020), one is posed to wonder about the importance of weights altogether. So perhaps, binary weights is all you need.
Diffenderfer and Kailkhura, (2021) showed that indeed scaled binary networks can be pruned to approximate any target function. The required overparameterization is similar to that of Malach et al., (2020), i.e., polynomial in the width, depth and error of the approximation. Hence in a similar vein to the improvement that Pensia et al., (2020) offered over the bounds of Malach et al., (2020), we explore whether such an improvement is possible on the results of Diffenderfer and Kailkhura, (2021).
Our Contributions:
In this work, we offer an exponential improvement to the theoretical bounds by (Diffenderfer and Kailkhura,, 2021), establishing the following.
Theorem 1.
(informal) Consider a randomly initialized, FC, binary network of ReLU activations, with depth and width , with the last layer consisting of scaled binary weights . Then, there is always a constant such that this network can be pruned to approximate any FC ReLU network, up to error with depth and width , with probability at least .
Therefore, we show that in order to approximate any target network, it suffices to just prune a logarithmically overparameterized binary network (Figure 1). In contrast to Diffenderfer and Kailkhura, (2021), our construction only requires that the last layer be scaled while the rest of the network is purely binary . We show a comparison of the known Strong LTH results in Table 1.
Reference | Width | Depth | Total Params | Weights |
Malach et al., (2020) | Real | |||
Orseau et al., (2020) | Real (Hyperbolic) | |||
Pensia et al., (2020) | Real | |||
Diffenderfer and Kailkhura, (2021) | ||||
Ours, Theorem 1 | Binary-111The weights of all the layers are purely binary except for the last layer which is scaled so that it is where . |

In light of our theoretical results, one may wonder why in the literature of training, i.e., assigning a sign pattern to fixed architecture, binary networks, a loss of accuracy is observed, e.g., (Rastegari et al.,, 2016). Is this an algorithmic artifact, or does pruning random signs offer higher expressivity than assigning the signs? We show that there exist target functions that can be well approximated by pruning binary networks, yet none of all possible, binary, fully connected networks can approximate it.
Proposition 1.
(informal) There exist a function that can be represented by pruning a random 2-layer binary network of width , but not by any 2-layer fully-connected binary network of width .
Note that although finding a subnetwork of a random binary network results in a “ternary” architecture (e.g., 0 becomes a possible weight), the total number of possible choices of subnetworks is , if is the total number of weights. This is equal to the total number of sign assignments of the same FC network. Yet, as shown in the proposition above, pruning a random FC network is provably more expressive than finding a sign assignment for the same architecture.
2 Preliminaries and Problem Setup
Let be the target FC network with layers and ReLU activations, represented as
where is the input, is the ReLU activation and is the weight matrix of layer . With slight abuse of terminology, we will refer to as a network, as opposed to a labeling function. We then consider a binary111The weights of all the layers are purely binary except for the last layer which is scaled so that it is where . network of depth
where is a binary weight matrix, with all weights drawn uniformly at random from , for all layers and the last layer is multiplied by a factor of . The scaling factor is calculated precisely in Section 3.2.3, where we show that it is unavoidable for function approximation (i.e., regression), rather than classification.
Our goal is to find the smallest network so that it contains a subnetwork which approximates well. More precisely, we will bound the overparameterization of the binary network, under which one can find supermask matrices , for each layer , such that the pruned network
is -close to in the sense of uniform approximation over the unit-ball, i.e.,
for some desired . In this paper, we show only needs to be polylogarithmically larger than the target network to have this property. We formalize this and provide a proof in the following sections.
Henceforth, we denote for some positive integer . Unless otherwise specified, refers to the norm. We also use the max norm of a matrix, defined as . The element-wise product between two matrices and is denoted by . We assume without loss of generality that the weights are specified in the base-10 system. However, since we don’t specify the base of the logarithm explicitly in our computations, we use the notation to hide constant factors that may arise from choosing different bases.
3 Strong Lottery Tickets by Binary Expansion
In this section, we formally present our approximation results. We show that in order to approximate any target network within arbitrary approximation error , it suffices to prune a random binary11footnotemark: 1 network that is just polylogarithmically deeper and wider than the target network.
3.1 Main Result
First, we point out that the scaling factor in the final layer of is necessary for achieving arbitrary small approximation error for any target network . In other words, it is impossible to approximate any arbitrary target network with a purely binary network regardless of the overparameterization. To see this, note that for the simple target function and , the best approximation possible by a binary network is and therefore for any binary network . We will show that just by allowing the weights of the final layer to be scaled, we can provide a uniform approximation guarantee while the rest of the network remains binary . Formally, we have the following theorem:
Theorem 1.
Consider the set of FC ReLU networks defined as
and let . For arbitrary target approximation error , let (here ) be a randomly initialized network with depth such that every weight is drawn uniformly from and the layer widths are times wider than .
Then, with probability at least , for every , there exist pruning matrices such that
holds where
Remark 1.
The dimensions of the weight matrices of in Theorem 1 are specified more precisely below. Let . Since , we have layers in that approximates each layer in . For each , the dimension of is
the dimension of is
and the remaining where have the dimension
3.2 Proof of Theorem 1
First, we show in Section 3.2.1 that any target network in can be approximated within , by another network having weights of finite-precision at most digits where is logarithmic in and .
Then, in Section 3.2.2, we show that any finite precision network can be represented exactly using a binary network where all the weights are binary () except the last layer, and the last layer weights are scaled-binary (). The proof sketch is as follows. First, through a simple scaling argument we show that any finite-precision network is equivalent to a network with integer weights in every layer except the last. We then present Theorem 2 which shows the deterministic construction of a binary network using diamond-shaped gadgets that can be pruned to approximate any integer network. Lemma 7 extends the result to the case when the network is initialized with random binary weights.
3.2.1 Logarithmic precision is sufficient
First, we consider the simplest setting wherein the target network contains a single weight i.e., , where are scalars, the absolute values of which are bounded by . This assumption can be relaxed to any finite norm bound. We begin with noting a fact that digits of precision are sufficient to approximate a real number within error , as formalized below
Fact 1.
Let and be a finite-precision truncation of with digits. Then holds.
Now we state the result for the case when the target network contains a single weight .
Lemma 1.
Consider a network where . For a given , let be a finite-precision truncation of up to digits and let . Then we have
Proof.
By Fact 1, we know that . Applying Cauchy-Schwarz with gives us . Since this holds for any and ReLU is 1-Lipschitz, the result follows. ∎
Lemma 1 can be extended to show that it suffices to consider finite-precision truncation up to digits to approximate a network for width and depth . This is stated more formally below.
Lemma 2.
Consider a network where , . For a given , define where is a finite precision truncation of up to digits, where . Then we have
We provide the proof of Lemma 2 as well as approximation results for a single neuron and layer in supplementary materials.
3.2.2 Binary weights are sufficient
We begin by showing that any finite-precision FC ReLU network can be represented perfectly as a FC ReLU network with integer weights in every layer except the last, using a simple scaling argument. Since ReLU networks are positive homogenous, we have that for . Given a network where all the weights are of finite-precision at most , we can apply this property layerwise with the scaling factor so that,
(1) |
where is a matrix of integer weights and . Therefore, the rescaled network has integer weights in every layer except the last layer which has the weight matrix .
In the remaining part of this section, we show that any FC ReLU network with integer weights can be represented exactly by pruning a purely binary FC ReLU network which is just polylogarithmic wider and deeper. More precisely, we prove the following result.
Theorem 2.
Consider the set of FC ReLU networks with integer weights defined as
where . Define and let be a network with depth where every weight is uniform-randomly generated from and the layer widths are times wider than .
Then, with probability at least , for every , there exist pruning matrices such that
holds for any where .
Remark 2.
The dimensions of the weight matrices of in Theorem 2 are specified more precisely below. Note that we have layers in that exactly represents each layer in . For each , the dimension of is
the dimension of is
and the remaining where have the dimension
Remark 3.
Note that is exactly equal to . Furthermore, we provide a uniform guarantee for all networks in by pruning a single over-parameterized network, like Pensia et al., (2020).
Remark 4.
Theorem 2 can be made into a deterministic construction for any fixed target network thereby avoiding the overparameterization. We extend to the random initialization setting by resampling the construction a sufficient number of times.
Remark 5.
To resolve issues of numerical overflow, we can insert scaling neurons after every layer.
Remark 6.
The integer assumption can easily be converted to a finite-precision assumption using a simple scaling argument. Since all networks in practice use finite-precision arithmetic, Theorem 2 may be of independent interest to the reader. However, we emphasize here that there is no approximation error in this setting. Practitioners who are interested in small error() can just apply Theorem 1 and incur an overparameterization factor of .
The proof of Theorem 2 will first involve a deterministic construction for a binary network that gives us the desired guarantee. We then extend to the random binary initialization. The construction is based on a diamond-shaped gadget that allows us to approximate a single integer weight by pruning a binary ReLU network with just logarithmic overparameterization.
First, consider a target network that contains just a single integer weight i.e., . We will show that there exists a binary FC ReLU network which can be pruned to approximate .
Lemma 3.
Consider a network where . Then there exists a FC ReLU binary network of width and depth that can be pruned to so that for all .
Proof. Note that since is an integer, it can be represented using its binary (base-) expansion
(2) |
For ease of notation, we will use to represent going forward. Denote . The construction of in Figure 2a shows that can be represented by a binary network with ReLU activations for any . We will refer to this network as the diamond-shaped “gadget”.
Note that the expansion in Equation (2) requires for all . Luckily, any of these can be represented by just pruning as shown in Figure 2b.





However, since our constructions use the ReLU activation, this only works when . Using the same trick as (Pensia et al.,, 2020), we can extend this construction by mirroring it as shown in Figure 3. This gives us and . The correctness of this mirroring trick relies on the simple observation that .
Putting these together, we get as shown in Figure 4. By pruning just the weights in the last layer, we can choose which terms to include. Setting completes the approximation.
To calculate the overparameterization required to approximate , we simply count the parameters in the above construction. Each gadget is a network of width and depth . To construct , we need two such gadgets. Therefore to construct and , we need width and depth . Repeating this for each shows that our construction is a network of width and depth which completes the proof of Lemma 3.∎
Remark 7.
Now, we extend the construction to the case where the target function is a neural network with a single neuron i.e., .
Lemma 4.
Consider a network where and . Then there exists a FC ReLU binary network of width and depth that can be pruned to so that for all .
Proof.
A neuron can be written as . Therefore, we can just repeat our construction from above for each . This results in a network of width while the depth remains unchanged at . ∎

Next, we describe how to approximate a single layer target network and avoid a quadratic overparameterization.
Lemma 5.
Consider a network where and . Then there exists a FC ReLU binary network of width and depth that can be pruned to so that for all .
Proof.
Note that is the vector of ’s. Naively extending the construction from Lemma 4 would require us to replace each of the neurons in the first layer by a network of width and depth . This already needs a network of width which is a quadratic overparameterization. Instead, we take advantage of pruning only the last layer in the construction from Lemma 3 to re-use a sufficient number of weights and avoid the quadratic overparameterization. An illustration of this idea is shown in Figure 5. The key observation is that by pruning only the last layer of each gadget, we leave it available to be re-used to approximate the weights of the remaining neurons. Therefore, the width of the network required to approximate is just and the depth remains . ∎
Now we can tie it all together to show an approximation guarantee for any network in .
Lemma 6.
Consider a network where , . Let . Then, there exists a FC ReLU binary network of width and depth that can be pruned to so that for all .
Proof.
Note that there is no approximation error in any of the steps above. Therefore, we can just repeat the construction above for each of the layers in . This results in a network of width and depth . A more precise description of the dimensions of each layer can be found in the statement of Theorem 1. ∎
Finally, below we show that our construction can be extended for networks with randomly initialized weights. The proof can be found in supplementary materials.
Lemma 7.
Consider a network where , . Define and let be a randomly initialized binary network of width and depth such that every weight is drawn uniformly from . Then can be pruned to so that for all with probability at least .
Since every network satisfies the assumptions of Lemma 7, we can apply it for the entire . Note that we do not need to apply the union bound over since the randomness is just needed to ensure the existence of a particular deterministic network - specifically from Lemma 6. Therefore, a single random network is sufficient to ensure the guarantee for every . This completes the proof of Theorem 2.
3.2.3 Putting everything together
We now put the above results together to complete the proof of Theorem 1. First, note that by Lemma 2, to approximate any within , it suffices to consider which is a finite-precision version of where the precision of each weight is at most . Now, applying the scaling trick in Equation (1) we can represent exactly as a scaled integer network i.e.,
where and all the weight matrices are integer. Since , it is clear that . Therefore, applying Theorem 2 to the integer network with , we have the following. If is a randomly initialized binary network of depth and width , then with probability , it can be pruned to so that for any in the unit sphere. Therefore, to approximate we simply push the scaling factor into the last layer so that its weights are now scaled binary . Combining this with the approximation guarantee between and completes the proof.
3.3 Binary weights for classification
Theorem 2 can easily be extended for classification problems using finite-precision networks. Since is positive scale-invariant, we no longer even require the final layer to be scaled. Applying the same argument as Sec 3.2.3 and then dropping the factor gives us the following corollary.
Corollary 1.
Consider the set of binary classification FC ReLU networks of width and depth , where the weight matrices are of finite-precision at most digits. Let be a randomly initialized binary network with depth and width such that every weight is drawn uniformly from . Then, with probability at least , for every , there exist pruning matrices such that for any where .
4 Conclusion
In this paper, we prove the Strong LTH for binary networks establishing that logarithmic overparameterization is sufficient for pruning algorithms to discover accurate subnetworks within random binary models. By doing this, we provide theory supporting the wide range of experimental work in the field, e.g., scaled binary networks can achieve the best SOTA accuracy on benchmark image datasets (Diffenderfer and Kailkhura,, 2021; Ramanujan et al.,, 2020). Moreover, we show that only the last layer needs to be scaled binary, while the rest of the network can be purely binary . It is well known in the binary network literature that a gain term (scaling the weights) makes the optimization problem more tractable (Simons and Lee,, 2019). While this is known empirically, it would be interesting to study this from a theoretical perspective so we can identify better algorithms to find binary networks of high accuracy.
Acknowledgements
The authors would like to thank Jonathan Frankle for early discussions on pruning random binary networks. This research was supported by ONR Grant N00014-21-1-2806.
References
- Barron, (1993) Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945.
- Blalock et al., (2020) Blalock, D., Gonzalez Ortiz, J. J., Frankle, J., and Guttag, J. (2020). What is the state of neural network pruning? In Proceedings of Machine Learning and Systems 2020, pages 129–146.
- Cheng et al., (2019) Cheng, Y., Wang, D., Zhou, P., and Zhang, T. (2019). A Survey of Model Compression and Acceleration for Deep Neural Networks. arXiv:1710.09282 [cs].
- Courbariaux et al., (2015) Courbariaux, M., Bengio, Y., and David, J.-P. (2015). Binaryconnect: Training deep neural networks with binary weights during propagations. arXiv preprint arXiv:1511.00363.
- Deng et al., (2020) Deng, L., Li, G., Han, S., Shi, L., and Xie, Y. (2020). Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532.
- Diffenderfer and Kailkhura, (2021) Diffenderfer, J. and Kailkhura, B. (2021). Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network. arXiv preprint arXiv:2103.09377.
- Frankle and Carbin, (2018) Frankle, J. and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations.
- Frankle et al., (2020) Frankle, J., Dziugaite, G. K., Roy, D., and Carbin, M. (2020). Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pages 3259–3269. PMLR.
- (9) Han, S., Mao, H., and Dally, W. J. (2015a). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
- (10) Han, S., Pool, J., Tran, J., and Dally, W. J. (2015b). Learning both weights and connections for efficient neural networks. arXiv preprint arXiv:1506.02626.
- Hanin, (2019) Hanin, B. (2019). Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics, 7(10):992.
- Hassibi and Stork, (1993) Hassibi, B. and Stork, D. G. (1993). Second order derivatives for network pruning: Optimal Brain Surgeon. In Hanson, S. J., Cowan, J. D., and Giles, C. L., editors, Advances in Neural Information Processing Systems 5, pages 164–171. Morgan-Kaufmann.
- He et al., (2018) He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., and Han, S. (2018). Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800.
- He et al., (2017) He, Y., Zhang, X., and Sun, J. (2017). Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397.
- Hubara et al., (2016) Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2016). Binarized neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pages 4114–4122.
- Hubara et al., (2017) Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. (2017). Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898.
- Kidger and Lyons, (2020) Kidger, P. and Lyons, T. (2020). Universal approximation with deep narrow networks. In Conference on Learning Theory, pages 2306–2327. PMLR.
- Klusowski and Barron, (2018) Klusowski, J. M. and Barron, A. R. (2018). Approximation by combinations of relu and squared relu ridge functions with and controls. IEEE Transactions on Information Theory, 64(12):7649–7656.
- LeCun et al., (1990) LeCun, Y., Denker, J. S., and Solla, S. A. (1990). Optimal Brain Damage. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2, pages 598–605. Morgan-Kaufmann.
- Levin et al., (1994) Levin, A. U., Leen, T. K., and Moody, J. E. (1994). Fast Pruning Using Principal Components. In Cowan, J. D., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing Systems 6, pages 35–42. Morgan-Kaufmann.
- Li et al., (2016) Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. (2016). Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710.
- Malach et al., (2020) Malach, E., Yehudai, G., Shalev-Schwartz, S., and Shamir, O. (2020). Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR.
- Mozer and Smolensky, (1989) Mozer, M. C. and Smolensky, P. (1989). Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in neural information processing systems, pages 107–115.
- Orseau et al., (2020) Orseau, L., Hutter, M., and Rivasplata, O. (2020). Logarithmic pruning is all you need. Advances in Neural Information Processing Systems, 33.
- Pensia et al., (2020) Pensia, A., Rajput, S., Nagle, A., Vishwakarma, H., and Papailiopoulos, D. (2020). Optimal lottery tickets via subsetsum: Logarithmic over-parameterization is sufficient. arXiv preprint arXiv:2006.07990.
- Perekrestenko et al., (2018) Perekrestenko, D., Grohs, P., Elbrächter, D., and Bölcskei, H. (2018). The universal approximation power of finite-width deep relu networks. arXiv preprint arXiv:1806.01528.
- Ramanujan et al., (2020) Ramanujan, V., Wortsman, M., Kembhavi, A., Farhadi, A., and Rastegari, M. (2020). What’s Hidden in a Randomly Weighted Neural Network? arXiv:1911.13299 [cs].
- Rastegari et al., (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. (2016). Xnor-net: Imagenet classification using binary convolutional neural networks. In European conference on computer vision, pages 525–542. Springer.
- Scarselli and Tsoi, (1998) Scarselli, F. and Tsoi, A. C. (1998). Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural networks, 11(1):15–37.
- Simons and Lee, (2019) Simons, T. and Lee, D.-J. (2019). A review of binarized neural networks. Electronics, 8(6):661.
- Stinchombe, (1989) Stinchombe, M. (1989). Universal approximation using feed-forward networks with nonsigmoid hidden layer activation functions. Proc. IJCNN, Washington, DC, 1989, pages 161–166.
- Wang et al., (2019) Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., and Hu, X. (2019). Pruning from Scratch. arXiv:1909.12579 [cs].
- Wen et al., (2016) Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. (2016). Learning structured sparsity in deep neural networks. arXiv preprint arXiv:1608.03665.
- Wu et al., (2016) Wu, J., Leng, C., Wang, Y., Hu, Q., and Cheng, J. (2016). Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4820–4828.
- Zhou et al., (2019) Zhou, H., Lan, J., Liu, R., and Yosinski, J. (2019). Deconstructing lottery tickets: Zeros, signs, and the supermask. arXiv preprint arXiv:1905.01067.
- Zhu et al., (2016) Zhu, C., Han, S., Mao, H., and Dally, W. J. (2016). Trained ternary quantization. arXiv preprint arXiv:1612.01064.
- Zhu and Gupta, (2017) Zhu, M. and Gupta, S. (2017). To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878.
Appendix A Appendix
A.1 Proof of Lemma 2
Before proving Lemma 2 in its entirety, we first extend Lemma 1 to approximate a neuron with -precision.
Lemma 8.
Consider a network where and . Let be a coordinate-wise finite-precision truncation of up to digits and . Then we have that
Proof.
Once again, by Lemma 1, we have that for each coordinate . Therefore, . Applying Cauchy-Schwarz with completes the proof. ∎
We now extend the result to approximating a single layer with finite-precision.
Lemma 9.
Consider a network where and . Let be a coordinate-wise finite-precision truncation of up to digits and where . Then we have that
Proof.
Note that for any ,
(Since is 1-Lipschitz) | |||
Since this holds for any , it also holds for the that attains the maximum possible error which completes the proof. ∎
We are now ready to prove Lemma 2 which states that to approximate any FC ReLU network within , it suffices to simply consider weights with logarithmic precision.
Lemma 2.
Consider a network where , and . Let where is a finite precision truncation of up to digits. Then we have that
Proof.
The proof follows inductively. First note that since operator norm is bounded by the frobenius norm, we have for the output of the first layer that,
Next, denote the output of the first layer by . Repeating the calculation above for the second layer gives us,
Note that the first term is bounded by since . Further, using the fact that and as proved above, we get that
By induction, for each layer we have that for any ,
Setting completes the proof. ∎
Lemma 7.
Consider a network where , and . Let be a randomly initialized binary network of width and depth such that every weight is drawn uniformly from . Then can be pruned to so that for all with probability at least .
Proof.
Consider any one particular diamond structure in Figure 2. If we choose the weights of this network uniformly at random from , then the probability that they are all is at most . If we make the network times wider, the probability that such a diamond-gadget does not exist in the network is given by . In other words, if we define the event to be the failure event, then . Note that there are such diamond structures in our network. By symmetry, the failure probability of each of them is identical. To have the overall probability of failure to be at most , taking the union bound we have that
Hence, it suffices to have . In other words, a randomly initialized binary network of width and depth contains our deterministic construction with probability at least . ∎
A.2 Observation on the power of zero
We try to understand here if pruning is essential to the expressive power of binary networks. We note that pruning binary networks is strictly more powerful than merely sign-flipping their weights. To see this, consider the set of all networks that can be derived by pruning 1-hidden layer binary networks of width and input dimension :
Similarly, the set of all 1-hidden layer binary networks of the same width without pruning is given by
(3) |
We prove the following proposition that shows that is a strictly richer class of functions indicating that pruning is an essential part of approximating classifiers.
Proposition 1.
The function satisfies and , i.e., without pruning, cannot be represented by a single layer binary network of width .
Proof.
For simplicity, we consider the case when so that the ReLU is equivalent to a linear activation. If the functions are not equal for the non-negative orthant, then they are surely different. First, note that we recover from if we set and for all . Therefore, holds. To see that , first note that we can replace with in Equation (A.2). Hence, any is of the form . Consider the coefficient of in which is . Since cannot be set to , the best approximation we can get using is or . In fact, this holds for every odd term in . This completes the proof. ∎
Remark 8.
The above proposition suggests that pruning is more powerful than merely flipping signs of a binary network. In fact, the same argument can be extended for binary networks of any fixed width and depth to show that pruned networks are more expressive. However, it does not quantify this difference in expressivity.