Hardness of Learning Neural Networks with
Natural Weights
Abstract
Neural networks are nowadays highly successful despite strong hardness results. The existing hardness results focus on the network architecture, and assume that the network’s weights are arbitrary. A natural approach to settle the discrepancy is to assume that the network’s weights are “well-behaved” and posses some generic properties that may allow efficient learning. This approach is supported by the intuition that the weights in real-world networks are not arbitrary, but exhibit some ”random-like” properties with respect to some ”natural” distributions.We prove negative results in this regard, and show that for depth- networks, and many “natural” weights distributions such as the normal and the uniform distribution, most networks are hard to learn. Namely, there is no efficient learning algorithm that is provably successful for most weights, and every input distribution. It implies that there is no generic property that holds with high probability in such random networks and allows efficient learning.
1 Introduction
Neural networks have revolutionized performance in multiple domains, such as computer vision and natural language processing, and have proven to be a highly effective tool for solving many challenging problems. This impressive practical success of neural networks is not well understood from the theoretical point of view. In particular, despite extensive research in recent years, it is not clear which models are learnable by neural networks algorithms.
Historically, there were many negative results for learning neural networks, and it is now known that under certain complexity assumptions, it is computationally hard to learn the class of functions computed by a neural network, even if the architecture is very simple. Indeed, it has been shown that learning neural networks is hard already for networks of depth (Klivans and Sherstov, 2006; Daniely and Shalev-Shwartz, 2016). These results hold already for improper learning, namely where the learning algorithm is allowed to return a hypothesis that does not belong to the considered hypothesis class.
In recent years, researchers have considered several ways to circumvent the discrepancy between those hardness results and the empirical success of neural networks. Namely, to understand which models are still learnable by neural networks algorithms. This effort includes proving learnability of linear models, including polynomials and kernel spaces (Andoni et al., 2014; Xie et al., 2016; Daniely et al., 2016; Daniely, 2017; Brutzkus et al., 2017; Jacot et al., 2018; Du et al., 2018; Oymak and Soltanolkotabi, 2018; Allen-Zhu et al., 2018a, b; Cao and Gu, 2019; Zou and Gu, 2019; Song and Yang, 2019; Ge et al., 2019; Oymak and Soltanolkotabi, 2019; Arora et al., 2019; Cao and Gu, 2019; Ji and Telgarsky, 2019; Ma et al., 2019; Lee et al., 2019; Daniely, 2019), making assumptions on the input distribution (Li and Yuan, 2017; Brutzkus and Globerson, 2017; Du et al., 2017a, b; Du and Goel, 2018; Goel et al., 2018; Shamir, 2018), the network’s weights (Arora et al., 2014; Shamir, 2018; Das et al., 2019; Agarwal et al., 2020; Goel and Klivans, 2017), or both (Janzamin et al., 2015; Tian, 2017).
In that respect, one fantastic result that can be potentially proven, is that neural networks are efficiently learnable if we assume that the network’s weights are “well-behaved”. Namely, that there are some generic properties of the network’s weights that allow efficient learning. This approach is supported by the intuition that the weights in real-world networks are not arbitrary, but exhibit some ”random-like” properties with respect to some ”natural weights distributions” (e.g., where the weights are drawn from a normal distribution). We say that a property of the network’s weights is a natural property with respect to such a natural weights distribution, if it holds with high probability. Existing hardness results focus on the network architecture, and assume that the weights are arbitrary. Thus, it is unclear whether there exists a natural property that allows efficient learning.
In this work, we investigate networks with random weights, and networks whose weights posses natural properties. We show that under various natural weights distributions most networks are hard to learn. Namely, there is no efficient learning algorithm that is provably successful for most weights, and every input distribution. We show that it implies that learning neural networks is hard already if their weights posses some natural property. Our hardness results are under the common assumption that refuting a random -SAT formula is hard (the RSAT assumption). We emphasize that our results are valid for any learning algorithm, and not just common neural networks algorithms.
We consider networks of depth with a single output neuron, where the weights in the first layer are drawn from some natural distribution, and the weights in the second layer are all . We consider multiple natural weights distributions, for example, where the weights vector of each hidden neuron is distributed by a multivariate normal distribution, distributed uniformly on the sphere, or that each of its components is drawn i.i.d. from a normal, uniform or Bernoulli distribution. For each weights distribution, we show that learning such networks with high probability over the choice of the weights is hard. Thus, for such weights distributions, most networks are hard. It implies that there is no generic property that holds with high probability (e.g., with probability ) in such random networks and allows efficient learning. Hence, if generic properties that allow efficient learning exist, then they are not natural, namely, they are rare with respect to all the natural weights distributions that we consider.
We also consider random neural networks of depth , where the first layer is a convolutional layer with non-overlapping patches such that its filter is drawn from some natural distribution, and the weights of the second layer are all . We show that learning is hard also for such networks. It implies that there is no generic property that holds with high probability in such random convolutional networks and allows efficient learning.
Related work
Hardness of learning neural networks. Hardness of learning neural networks in the standard (improper and distribution free) PAC model, follows from hardness of learning intersection of halfspaces. Klivans and Sherstov (2006) showed that, assuming the hardness of the shortest vector problem, learning intersection of halfspaces for a constant is hard. Daniely and Shalev-Shwartz (2016) showed that, under the RSAT assumption, learning intersection of halfspaces is hard. These results imply hardness of learning depth- networks with and hidden neurons (respectively). In the agnostic model, learning halfspaces is already hard (Feldman et al., 2006; Daniely, 2016).
Learning random neural networks. Shamir (2018) considers the problem of learning neural networks, where the weights are not arbitrary, but exhibit “nice” features such as non-degeneracy or some “random-like” appearance. The architecture of the networks that he considers is similar to ours. He shows that (under the RSAT assumption) no algorithm invariant to linear transformations can efficiently learn such networks if the columns of the weights matrix of the first layer are linearly independent. It implies that linear-invariant algorithms cannot learn such networks when the weights are chosen randomly. We note that this result holds only for linearly-invariant algorithms, which is a strong restriction. Standard gradient decent methods, for example, are not linearly invariant111Shamir (2018) shows that gradient decent can become linearly invariant if it is preceded by a certain preconditioning step.. Our results hold for all algorithms.
In Das et al. (2019), it is shown that deep random neural networks (of depth ) with the sign activation function, are hard to learn in the statistical query (SQ) model. This result was recently extended by Agarwal et al. (2020) to other activation functions, including the ReLU function. While their results hold for networks of depth and for SQ algorithms, our results hold for depth- networks and for all algorithms.
2 Preliminaries
2.1 Random Constraints Satisfaction Problems
Let be the collection of (signed) -tuples, that is, sequences for and distinct . Each defines a function by .
Let be some predicate. A -constraint with variables is a function of the form for some . An instance to the CSP problem is a -formula, i.e., a collection of -constraints (each is specified by a -tuple). The goal is to find an assignment that maximizes the fraction of satisfied constraints (i.e., constraints with ). We will allow CSP problems where varies with (but is still fixed for every ). For example, we can look of the -SAT problem.
We will consider the problem of distinguishing satisfiable from random formulas (a.k.a. the problem of refuting random formulas). For , we say that a randomized algorithm efficiently solves the problem , if is a polynomial-time algorithm such that:
-
•
If is a satisfiable instance to with variables and constraints, then
where the probability is over the randomness of .
-
•
If is a random222To be precise, in a random formula with variable and constraints, the -tuple defining each constraint is chosen uniformly, and independently from the other constraints. instance to with variables and constraints then
where the probability is over the choice of and the randomness of .
2.2 The random -SAT assumption
Unless we face a dramatic breakthrough in complexity theory, it seems unlikely that hardness of learning can be established on standard complexity assumptions such as (see Applebaum et al. (2008); Daniely et al. (2014)). Indeed, all currently known lower bounds are based on assumptions from cryptography or average case hardness. Following Daniely and Shalev-Shwartz (2016) we will rely on an assumption about random -SAT problems which we outline below.
Let be a random -SAT formula on variables. Precisely, each -SAT constraint is chosen independently and uniformly from the collection of -variate -SAT constraints. A simple probabilistic argument shows that for some constant (depending only on ), if , then is not satisfiable w.h.p. The problem of refuting random -SAT formulas (a.k.a. the problem of distinguishing satisfiable from random -SAT formulas) is the problem , where is the predicate .
The problem of refuting random -SAT formulas has been extensively studied during the last 50 years. It is not hard to see that the problem gets easier as gets larger. The currently best known algorithms Feige and Ofek (2004); Coja-Oghlan et al. (2004, 2010) can only refute random instances with constraints for and constraints for . In light of that, Feige (2002) made the assumption that for , refuting random instances with constraints, for every constant , is hard (and used that to prove hardness of approximation results). Here, we put forward the following assumption.
Assumption 2.1.
Refuting random - formulas with constraints is hard for some . Namely, for every there is such that the problem is hard.
Terminology 2.2.
A computational problem is RSAT-hard if its tractability refutes assumption 2.1.
In addition to the performance of best known algorithms, there is plenty of evidence to the above assumption, in the form of hardness of approximation results, and lower bounds on various algorithms, including resolution, convex hierarchies, sum-of-squares, statistical algorithms, and more. We refer the reader to Daniely and Shalev-Shwartz (2016) for a more complete discussion.
2.3 Learning hypothesis classes and random neural networks
Let be an hypothesis class. We say that a learning algorithm efficiently (PAC) learns if for every target function and every distribution over , given only access to examples where , the algorithm runs in time polynomial in and returns with probability at least (over the internal randomness of ), a predictor such that
For a real matrix of size , let be the function , where is the ReLU function, and is the clipping operation on the interval . This corresponds to depth- networks with hidden neurons, with no bias in the first layer, and where the outputs of the first layer are simply summed and moved through a clipping non-linearity (this operation can also be easily implemented using a second layer composed of two ReLU neurons).
Let be a distribution over real matrices of size . We assume that . We say that a learning algorithm efficiently learns a random neural network with respect to (-random network, for short), if it satisfies the following property. For a random matrix drawn according to , and every distribution (that may depend on ) over , given only access to examples where , the algorithm runs in time polynomial in and returns with probability at least over the choice of and the internal randomness of , a predictor such that
Remark 2.1.
Learning an hypothesis class requires that for every target function in the class and every input distribution the learning algorithm succeeds w.h.p., and learning a random neural network requires that for a random target function and every input distribution the learning algorithm succeeds w.h.p. Thus, in the former case an adversary chooses both the target function and the input distribution, and in the later the target function is chosen randomly and then the adversary chooses the input distribution. Therefore, the requirement from the algorithm for learning random neural networks is weaker then the requirement for learning neural networks in the standard PAC-learning model. We show hardness of learning already under this weaker requirement. Then, we show that hardness of learning random neural networks, implies hardness of learning neural networks with “natural” weights under the standard PAC-learning model.
2.4 Notations and terminology
We denote by the uniform distribution over the interval in ; by the symmetric Bernoulli distribution, namely, ; by the normal distribution with mean and variance , and by the multivariate normal distribution with mean and covariance matrix . We say that a distribution over is symmetric if it is continuous and its density satisfies for every , or that it is discrete and for every . For a matrix we denote by and the minimal and maximal singular values of . For we denote by its norm. We denote the dimensional unit sphere by . For let . We say that an algorithm is efficient if it runs in polynomial time.
3 Results
We show RSAT-hardness for learning -random networks, where corresponds either to a fully-connected layer, or to a convolutional layer. It implies hardness of learning depth- neural networks whose weights satisfy some natural property. We focus on the case of networks with hidden neurons. We note, however, that our results can be extended to networks with hidden neurons, for any . Moreover, while we consider networks whose weights in the second layer are all , our results can be easily extended to networks where the weights in the second layer are arbitrary or random positive numbers.
3.1 Fully-connected neural networks
We start with random fully-connected neural networks. First, we consider a distribution over real matrices, such that the entries are drawn i.i.d. from a symmetric distribution.
We say that a random variable is -subgaussian for some if for all we have
Theorem 3.1.
Let be a symmetric random variable with variance . Assume that the random variable is -subgaussian for some fixed . Let be a small constant, let , and let be a distribution over , such that the entries are i.i.d. copies of . Then, learning a -random neural network is RSAT-hard, already if the distribution is over vectors of norm at most in .
Since the normal distribution, the uniform distribution over an interval, and the symmetric Bernoulli distribution are subgaussian (Rivasplata (2012)), we have the following corollary.
Corollary 3.1.
Let be a small constant, let , and let be a distribution over , such that the entries are drawn i.i.d. from a distribution .
-
1.
If , then learning a -random neural network is RSAT-hard, already if the distribution is over vectors of norm at most in .
-
2.
If or , then learning a -random neural network is RSAT-hard, even if the distribution is over vectors of norm at most in .
In the following theorem, we consider the case where is such that each column is drawn i.i.d. from a multivariate normal distribution.
Theorem 3.2.
Let be a positive definite matrix of size , and let be its minimal eigenvalue. Let be a small constant, let , and let be a distribution over , such that each column is drawn i.i.d. from . Then, learning a -random neural network is RSAT-hard, already if the distribution is over vectors of norm at most in .
We also study the case where the distribution is such that each column is drawn i.i.d. from the uniform distribution on the sphere of radius in .
Theorem 3.3.
Let and let be a distribution over , such that each column is drawn i.i.d. from the uniform distribution over . Then, learning a -random neural network is RSAT-hard, already if the distribution is over vectors of norm at most in .
From the above theorems we have the following corollary, which shows that learning neural networks (in the standard PAC-learning model) is hard already if the weights satisfy some natural property.
Corollary 3.2.
Let , and let be a distribution over from Theorems 3.2, 3.3, or from Corollary 3.1. Let be a property that holds with probability at least for a matrix drawn from . Let be an hypothesis class. Then, learning is RSAT-hard, already if the distribution is over vectors of norm bounded by the appropriate expression from Theorems 3.2, 3.3, or Corollary 3.1.
The corollary follows easily from the following argument: Assume that learns . If a matrix satisfies then for every distribution , given access to examples where , the algorithm returns with probability at least a predictor such that
Now, let . Since satisfies with probability at least , then for every distribution , given access to examples where , the algorithm returns with probability at least a predictor that satisfies the above inequality. Hence learns a -random neural network. Note that this argument holds also if is over vectors of a bounded norm.
3.2 Convolutional neural networks
We now turn to random Convolutional Neural Networks (CNN). Here, the distribution corresponds to a random convolutional layer. Our convolutional layer has a very simple structure with non-overlapping patches. Let be an integer that divides , and let . We denote by the CNN
Note that the has hidden neurons. A random CNN corresponds to a random vector . Let be a distribution over . A -random CNN with hidden neurons is the CNN where is drawn from . Note that every such a distribution over CNNs, can be expressed by an appropriate distribution over matrices. Our results hold also if we replace the second layer of with a max-pooling layer (instead of summing and clipping).
We start with random CNNs, where each component in the weight vector is drawn i.i.d. from a symmetric distribution over . In the following theorem, the function bounds the concentration of , and is needed in order to bound the support of the distribution .
Theorem 3.4.
Let . Let be a distribution over such that each component is drawn i.i.d. from a symmetric distribution over . Let be a function such that . Then, learning a -random CNN with hidden neurons is RSAT-hard, already if the distribution is over vectors of norm at most in .
If is the uniform distribution then we can choose , if it is the normal distribution then we can choose , and if it is the symmetric Bernoulli distribution then we can choose . Thus, we obtain the following corollary.
Corollary 3.3.
Let be a distribution such that each component is drawn i.i.d. from a distribution over .
-
1.
If , then learning a -random CNN with hidden neurons is RSAT-hard, already if is over vectors of norm at most .
-
2.
If , then learning a -random CNN with hidden neurons is RSAT-hard, already if is over vectors of norm at most .
-
3.
If , then learning a -random CNN with hidden neurons is RSAT-hard, already if is over vectors of norm at most .
We also consider the case where is a CNN such that is drawn from a multivariate normal distribution .
Theorem 3.5.
Let be a positive definite matrix of size , and let be its minimal eigenvalue. Let . Then, learning a -random CNN (with hidden neurons) is RSAT-hard, already if the distribution is over vectors of norm at most in .
Finally, we study the case where is a CNN such that is drawn from the uniform distribution over the sphere.
Theorem 3.6.
Let be the uniform distribution over the sphere of radius in . Let . Then, learning a -random CNN (with hidden neurons) is RSAT-hard, already if the distribution is over vectors of norm at most in .
Now, the following corollary follows easily (from the same argument as in Corollary 3.2), and shows that learning CNNs (in the standard PAC-learning model) is hard already if the weights satisfy some natural property.
Corollary 3.4.
Let be a distribution over from Theorems 3.5, 3.6, or from Corollary 3.3. Let . Let be a property that holds with probability at least for a vector drawn from . Let be an hypothesis class. Then, learning is RSAT-hard, already if the distribution is over vectors of norm bounded by the appropriate expression from Theorems 3.5, 3.6, or Corollary 3.3.
3.2.1 Improving the bounds on the support of
By increasing the number of hidden neurons from to we can improve the bounds on the support of . Note that our results so far on learning random CNNs, are for CNNs with input dimension where is the size of the patches. We now consider CNNs with input dimension for some integer . Note that such CNNs have hidden neurons.
Assume that there is an efficient algorithms for learning -random CNNs with input dimension , where is a distribution over . Assume that uses samples with at most inputs. We show an algorithm for learning a -random CNN with . Let be a sample, and let where for every vector , the vector is obtained from by padding it with zeros. Thus, . Note that . Also, note that for every we have . Hence, is realizable by the CNN . Now, given , the algorithm runs on and returns an hypothesis .
Therefore, if learning -random CNNs with input dimension is hard already if the distribution is over vectors of norm at most , then learning -random CNNs with input dimension is hard already if the distribution is over vectors of norm at most . Hence we have the following corollaries.
Corollary 3.5.
Let be a distribution over such that each component is drawn i.i.d. from a distribution over . Let for some integer , and let .
-
1.
If , then learning a -random CNN (with hidden neurons) is RSAT-hard, already if is over vectors of norm at most .
-
2.
If , then learning a -random CNN (with hidden neurons) is RSAT-hard, already if is over vectors of norm at most .
Corollary 3.6.
Let be a positive definite matrix of size , and let be its minimal eigenvalue. Let for some integer , and let . Then, learning a -random CNN (with hidden neurons) is RSAT-hard, already if the distribution is over vectors of norm at most .
Corollary 3.7.
Let be the uniform distribution over the sphere of radius in . Let for some integer , and let . Then, learning a -random CNN (with hidden neurons) is RSAT-hard, already if the distribution is over vectors of norm at most .
As an example, consider a CNN with . Note that since the patch size is , then each hidden neuron has input neurons feeding into it. Consider a distribution over such that each component is drawn i.i.d. by a normal distribution with . This distribution corresponds to the standard Xavier initialization. Then, by Corollary 3.5, learning a -random CNN is RSAT-hard, already if is over vectors of norm at most . By choosing an appropriate , we have that learning a -random CNN is RSAT-hard, already if is over vectors of norm at most .
Finally, note that Corollary 3.4 holds also for the values of and the bounds on the support of from Corollaries 3.5, 3.6 and 3.7.
Open Questions
Obvious open questions arising from our work are to obtain sharper norm bounds on the input distribution, and to avoid the non-linearity in the second layer (the clipping operation). Another direction is to extend the results to more weights distributions and network architectures. A potential positive result, is to find some useful “unnatural” properties of the network’s weights that allow efficient learning.
4 Main proof ideas
Let where . By Daniely and Shalev-Shwartz (2016), learning is hard. We want to reduce the problem of learning to learning a -random neural network for some fixed . Let and let be a sample. Let be a distribution over the group of invertible matrices, and let . Consider the sample where for every we have . Since , we have . Thus, . Note that is a random matrix. Now, assume that there is an algorithm that learns successfully from . Consider the follow algorithm . Given a sample , the algorithm runs on , and returns the hypothesis . It is not hard to show that learns successfully from . Since our goal is to reduce the problem of learning to learning a -random network where is a fixed distribution, we need to be a -random matrix. However, the distribution of depends on both and (which is an unknown matrix).
Hence, the challenge is to find a reduction that translates a sample that is realizable by to a sample that is realizable by a -random network, without knowing . In order to obtain such a reduction, we proceed in two steps. First, we show that learning neural networks of the form where , is hard already if we restrict to a set of matrices with a special structure. Then, we show a distribution such that if and has the special structure, then . This property, as we showed, enables us to reduce the problem of learning such special-structure networks to the problem of learning -random networks.
In order to obtain a special structure for , consider the class . Note that the CNNs in have hidden neurons. The networks in have three important properties: (1) They are CNNs; (2) Their patches are non-overlapping; (3) The components in the filter are in . Hardness of learning can be shown by a reduction from the RSAT problem. We defer the details of this reduction to the next sections. Let be the matrix of size that corresponds to , namely . Note that for every we have , and for every other .
We now show a distribution such that if and has a structure that corresponds to , then . We start with the case where is a distribution over matrices such that each column is drawn i.i.d. from the uniform distribution on the sphere. We say that a matrix of size is a diagonal-blocks matrix if
where each block is a diagonal matrix . We denote , and . Note that for every , the vector contains all the entries on the diagonals of blocks in the -th column of blocks in . Let be a distribution over diagonal-blocks matrices, such that the vectors are drawn i.i.d. according to the uniform distribution on . Let be a matrix that corresponds to . Note that the columns of are i.i.d. copies from the uniform distribution on . Indeed, denote . Then, for every line index we denote , where are integers and . Thus, is the line index of the block in that correspond to the -th line in , and is the line index within the block. Now, note that
Since , and since the uniform distribution on a sphere does not change by multiplying a subset of component by , then the -th column of has the same distribution as , namely, the uniform distribution over . Also, the columns of are independent. Thus, .
The case where is a distribution over matrices such that the entries are drawn i.i.d. from a symmetric distribution (such as , or ) can be shown in a similar way. The result for the case where is such that each columns is drawn i.i.d. from a multivariate normal distribution cannot be obtained in the same way, since a might be sensitive to multiplication of its component by . However, recall that a vector in whose components are i.i.d. copies from has a multivariate normal distribution . Now, since every multivariate normal distribution can be obtained from by a linear transformation, we are able to show hardness also for the case where is such that each column is drawn i.i.d. from .
Recall that in our reduction we translate to where for every we have . Therefore, we need to show that our choice of is such that it is invertible with high probability. Also, since we want to show hardness already if the input distribution is supported on a bounded domain, then we need to bound the norm of , with high probability over the choice of . This task requires a careful analysis of the spectral norm of , namely, of .
The proofs of the results regarding random CNNs follow the same ideas. The main difference is that in this case, instead of multiplying by , we multiply each patch in by an appropriate matrix.
Proof structure
We start with a few definitions. We say that a sample is scattered if are independent fair coins (in particular, they are independent from ). We say that is contained in if for every .
Let be an algorithm whose input is a sample and whose output is either “scattered” or “-realizable”, where is an hypothesis class. We say that distinguishes size- -realizable samples from scattered samples if the following holds.
-
•
If the sample is scattered, then
where the probability is over the choice of and the randomness of .
-
•
If the sample satisfies for every for some , then
where the probability is over the randomness of .
We denote by the problem of distinguishing size- -realizable samples that are contained in from scattered samples.
Now, let be an algorithm whose input is a sample and whose output is either “scattered” or “-realizable”. We say that distinguishes size- -realizable samples from scattered samples if the following holds.
-
•
If the sample is scattered, then
where the probability is over the choice of and the randomness of .
-
•
If the sample satisfies for every , where is a random matrix drawn from , then
where the probability is over the choice of and the randomness of .
We denote by the problem of distinguishing size- -realizable samples that are contained in from scattered samples. In the case of random CNNs, we denote by the problem of distinguishing size- scattered samples that are contained in , from samples that are realizable by a random CNN where .
Recall that . As we described, hardness of learning -random neural networks where the distribution is supported on a bounded domain, can be shown by first showing hardness of learning with some , where the distribution is supported on some , and then reducing this problem to learning -random networks where the distribution is supported on some . We can show RSAT-hardness of learning by using the methodology of Daniely and Shalev-Shwartz (2016) as follows: First, show that if there is an efficient algorithm that learns where the distribution is supported on , then there is a fixed and an efficient algorithm that solves , and then show that for every fixed , the problem is RSAT-hard.
Our proof follows a slightly different path than the one described above. First, we show that if there is an efficient algorithm that learns -random neural networks where the distribution is supported on , then there is a fixed and an efficient algorithm that solves . Then, we show that for every fixed , the problem with some , is RSAT-hard. Finally, for the required matrix distributions and sets , we show a reduction from to . The main difference between this proof structure and the one described in the previous paragraph, is that here, for every distribution we need to show a reduction from to (which are decision problems), and in the previous proof structure for every we need to show a reduction between learning and learning -random networks. We chose this proof structure since here the proof for each is a reduction between decision problems, and thus we found the proofs to be simpler and cleaner this way. Other than this technical difference, both proof structures are essentially similar and follow the same ideas.
The case of random CNNs is similar, except that here we show, for each distribution over vectors, a reduction from to .
5 Proofs
5.1 Learning -random networks is harder than
Theorem 5.1.
Let be a distribution over matrices. Assume that there is an algorithm that learns -random neural networks, where the distribution is supported on . Then, there is a fixed and an efficient algorithm that solves .
Proof.
Let be an efficient learning algorithm that learns -random neural networks where the distribution is supported on . Let be such that uses a sample of size at most . Let . Let be a sample that is contained in . We will show an efficient algorithm that distinguishes whether is scattered or -realizable. This implies that the theorem holds for such that .
Given , the algorithm learns a function by running with an examples oracle that generates examples by choosing a random (uniformly distributed) example . We denote . Now, if , then returns that is -realizable, and otherwise it returns that it is scattered. Clearly, the algorithm runs in polynomial time. We now show that if is -realizable then recognizes it with probability at least , and that if is scattered then it also recognizes it with probability at least .
Assume first that is -realizable. Let be the uniform distribution over from . In this case, since is supported on , we are guaranteed that with probability at least over the choice of and the internal randomness of , we have . Therefore, the algorithm returns “-realizable”.
Now, assume that is scattered. Let be the function returned by . Let be the following function. For every , if then , and otherwise . Note that for every , if then . Therefore, . Let be the set of indices of that were not observed by . Note that given , the events are independent from one another, and each has probability . By the Hoefding bound, we have that for at most fraction of the indices in with probability at most
Thus, for at least fraction of the indices in with probability at least . Hence,
Therefore, for large enough , with probability at least we have , and thus the algorithm returns “scattered”. ∎
5.2 is RSAT-hard
For a predicate we denote by the problem whose instances are collections, , of constraints, each of which is either or constraint, and the goal is to maximize the number of satisfied constraints. Denote by the problem of distinguishing333As in , in order to succeed, and algorithm must return “satisfiable” w.p. at least on every satisfiable formula and “random” w.p. at least on random formulas. satisfiable from random formulas with variables and constraints. Here, in a random formula, each constraint is chosen w.p. to be a uniform constraint and w.p. a uniform constraint.
We will consider the predicate defined by
We will need the following lemma from Daniely and Shalev-Shwartz (2016). For an overview of its proof, see Appendix A.
Lemma 5.1.
Daniely and Shalev-Shwartz (2016) Let with , and let and be fixed integers. The problem can be efficiently reduced to the problem .
In the following lemma, we use Lemma 5.1 in order to show RSAT-hardness of with some appropriate and .
Lemma 5.2.
Let , and let be a fixed integer. The problem , where is the ball of radius in , is RSAT-hard.
Proof.
By Assumption 2.1, there is such that is hard, where the -SAT formula is over variables. Then, by Lemma 5.1, the problem is also hard. We will reduce to . Since , it would imply that is RSAT-hard.
Let be an input for . Namely, each constraint is either a CNF or a DNF formula. Equivalently, can be written as where for every , if is a DNF formula then and , and if is a CNF formula then is the DNF obtained by negating , and . Given as above, we encode each DNF formula (with clauses) as a vector such that each clause in (a signed -tuple) is encoded by a vector as follows. First, we have . Then, for every we have , and for every variable that does not appear in the clause we have . Thus, for every , the value of indicates whether the -th variable appears in the clause as a positive literal, a negative literal, or does not appear. The encoding of is the concatenation of the encodings of its clauses.
Let . If is random then is scattered, since each constraint is with probability a DNF formula, and with probability a CNF formula, and this choice is independent of the choice of the literals in . Assume now that is satisfiable by an assignment . Let . Note that is realizable by the CNN with hidden neurons. Indeed, if is the encoding of a clause of , then if all the literals in the clause are satisfied by , and otherwise . Therefore, .
Note that by our construction, for every we have for large enough
∎
5.3 Hardness of learning random fully-connected neural networks
Let . We say that a matrix of size is a diagonal-blocks matrix if
where each block is a diagonal matrix . For every let . Let be the submatrix of obtained by selecting the rows and columns in . Thus, is a matrix of size . For let be the restriction of to the coordinates .
Lemma 5.3.
Let be a diagonal-blocks matrix. Then,
Proof.
For every with we have
Hence, . ∎
5.3.1 Proof of Theorem 3.1
Let be a diagonal-blocks matrix, where each block is a diagonal matrix . Assume that for all the entries are i.i.d. copies of a random variable that has a symmetric distribution with variance . Also, assume that the random variable is -subgaussian for some fixed .
Lemma 5.4.
Proof.
Let . By Lemma 5.3, we have
(1) |
Note that for every , all entries of the matrix are i.i.d. copies of .
Now, we need the following theorem:
Theorem 5.2.
Rudelson and Vershynin (2008) Let be a real random variable with expectation and variance , and assume that is -subgaussian for some . Let be an matrix whose entries are i.i.d. copies of . Then, for every we have
where and depend only on .
Lemma 5.5.
Let be a distribution over such that each entry is drawn i.i.d. from . Note that a -random network has hidden neurons. Let be a fixed integer. Then, is RSAT-hard, where is the ball of radius in .
Proof.
By Lemma 5.2, the problem where is the ball of radius in , is RSAT-hard. We will reduce this problem to . Given a sample with for every , we will, with probability , construct a sample that is contained in , such that if is scattered then is scattered, and if is -realizable then is -realizable. Note that our reduction is allowed to fail with probability . Indeed, distinguishing scattered from realizable requires success with probability and therefore reductions between such problems are not sensitive to a failure with probability .
Assuming that is invertible (note that by Lemma 5.4 it holds with probability ), let where for every we have . Note that if is scattered then is also scattered.
Assume that is realizable by the CNN with . Let be the matrix of size such that . Thus, where for every we have , and for every other . Let . Note that is realizable by . Indeed, for every we have , and . Also, note that the entries of are i.i.d. copies of . Indeed, denote . Then, for every line we denote , where are integers and . Thus, is the line index of the block in that correspond to the -th line in , and is the line index within the block. Now, note that
Since is symmetric and , we have independently from the other entries. Thus, . Therefore, is a -random network.
Lemma 5.6.
Let be a distribution over with , such that each entry is drawn i.i.d. from . Let be a fixed integer, and let be a small constant. Then, is RSAT-hard, where is the ball of radius in .
Proof.
For integers we denote by the distribution over such that each entry is drawn i.i.d. from . Let , and let . By Lemma 5.5, the problem is RSAT-hard, where , and is the ball of radius in . We reduce this problem to , where is the ball of radius in . Note that .
Let with . For every , let be the vector obtained from by padding it with zeros. Thus, . Note that . Let . If is scattered then is also scattered. Note that if is realizable by then is realizable by where is obtained from by appending arbitrary lines. Assume that is -realizable, that is, . Then, is realizable by where is obtained from by appending lines such that each component is drawn i.i.d. from , and therefore, is -realizable. Finally, for every we have
∎
5.3.2 Proof of Theorem 3.2
Let be a distribution over with , such that each entry is drawn i.i.d. from . Let be a fixed integer. By Lemma 5.6, we have that is RSAT-hard, where is the ball of radius in . Let be the distribution over where each component is drawn i.i.d. from . Recall that (Tong (2012)). Therefore, in the distribution , the columns are drawn i.i.d. from . Let be a distribution over , such that each column is drawn i.i.d. from . By Theorem 5.1, we need to show that is RSAT-hard, where is the ball of radius in . We show a reduction from to .
Let be a sample. Let be the spectral decomposition of , and let . Recall that if then (Tong (2012)). For every , let , and let . Note that if is scattered then is also scattered. If is realizable by a -random network , then let . Note that is realizable by . Indeed, for every we have . Let and let . Since then for every . Now, since , we have for every that (i.i.d.). Therefore, , and thus . Hence, is -realizable.
We now bound the norms of the vectors in . Note that for every we have
5.3.3 Proof of Theorem 3.3
Let , and let be a diagonal-blocks matrix, where each block is a diagonal matrix . We denote , and . Note that for every , the vector contains all the entries on the diagonals of blocks in the -th column of blocks in . Assume that the vectors are drawn i.i.d. according to the uniform distribution on .
Lemma 5.7.
For some universal constant we have
Proof.
Let . For every , let be the vector that contains all the entries on the diagonals of blocks in the -th column of blocks in . That is, . Note that the vectors are i.i.d. copies from the uniform distribution on . By Lemma 5.3, we have
(2) |
Note that for every , all columns of the matrix are projections of the vectors on the coordinated. That is, the -th column in is obtained by drawing from the uniform distribution on and projecting on the coordinates .
We say that a distribution is isotropic if it has mean zero and its covariance matrix is the identity. The covariance matrix of the uniform distribution on is . Therefore, the uniform distribution on is isotropic. We will need the following theorem.
Theorem 5.3.
Adamczak et al. (2012) Let and let be an matrix with independent columns drawn from an isotropic log-concave distribution. For every we have
where c and C are positive universal constants.
We show that the distribution of the columns of is isotropic and log-concave. First, since the uniform distribution on is isotropic, then its projection on a subset of coordinates is also isotropic, and thus the distribution of the columns of is isotropic. In order to show that it is log-concave, we analyze its density. Let be a random variable whose distribution is the projection of a uniform distribution on on coordinates. It is known that the probability density of is (see Fang (2018))
where . Recall that the columns of are projections of the uniform distribution over , namely, the sphere of radius and not the unit sphere. Thus, let . The probability density of is
where . We denote
By replacing with we have
Hence, we have
Since , we need to show that the function
(3) |
(where ) is concave. This function can be written as , where
Recall that if is concave and non-decreasing, and is concave, then their composition is also concave. Clearly, and satisfy these conditions, and thus the function in Eq. 3 is concave. Hence is log-concave.
Let be a distribution over such that each column is drawn i.i.d. from the uniform distribution on . Note that a -random network has hidden neurons. Now, Theorem 3.3 follows immediately from Theorem 5.1 and the following lemma.
Lemma 5.8.
Let be a fixed integer. Then, is RSAT-hard, where is a ball of radius in .
Proof.
By Lemma 5.2, the problem where is the ball of radius in , is RSAT-hard. We will reduce this problem to . Given a sample with for every , we will, with probability , construct a sample that is contained in , such that if is scattered then is scattered, and if is -realizable then is -realizable. Note that our reduction is allowed to fail with probability . Indeed, distinguishing scattered from realizable requires success with probability and therefore reductions between such problems are not sensitive to a failure with probability .
Assuming that is invertible (by Lemma 5.7 it holds with probability ), let where for every we have . Note that if is scattered then is also scattered.
Assume that is realizable by the CNN with . Let be the matrix of size such that . Thus, where for every we have , and for every other . Let . Note that is realizable by . Indeed, for every we have , and . Also, note that the columns of are i.i.d. copies from the uniform distribution on . Indeed, denote . Then, for every line index we denote , where are integers and . Thus, is the line index of the block in that correspond to the -th line in , and is the line index within the block. Now, note that
Since , and since the uniform distribution on a sphere does not change by multiplying a subset of component by , then the -th column of has the same distribution as , namely, the uniform distribution over . Also, the columns of are independent. Thus, , and therefore is a -random network.
5.4 Hardness of learning random convolutional neural networks
5.4.1 Proof of Theorem 3.4
Lemma 5.9.
Let be a fixed integer. Then, is RSAT-hard, where is the ball of radius in .
Proof.
By Lemma 5.2, the problem where is the ball of radius in , is RSAT-hard. We will reduce this problem to . Given a sample with for every , we will, with probability , construct a sample that is contained in , such that if is scattered then is scattered, and if is -realizable then is -realizable. Note that our reduction is allowed to fail with probability . Indeed, distinguishing scattered from realizable requires success with probability and therefore reductions between such problems are not sensitive to a failure with probability .
Let where each is drawn i.i.d. from . Let be a diagonal matrix. Note that is invertible with probability , since for every we have . Now, for every from , denote where for every we have . Let , and let . Note that if is scattered then is also scattered. If is realizable by a CNN , then let . Note that is realizable by . Indeed, for every and we have . Also, note that since and is symmetric, then has the distribution , and thus is a -random CNN.
The probability that has some component with , is at most . Therefore, with probability we have for every that
Thus, . ∎
5.4.2 Proof of Theorem 3.5
Assume that the covariance matrix is of size , and let . Note that a -random CNN has hidden neurons. Let be a distribution over such that each component is drawn i.i.d. from . Let be a fixed integer. By Lemma 5.9 and by choosing , we have that is RSAT-hard, where is the ball of radius in . Note that (Tong (2012)). By Theorem 5.1, we need to show that is RSAT-hard, where is the ball of radius in . We show a reduction from to .
Let be a sample. For every from , denote where for every we have . Let be the spectral decomposition of , and let . Recall that if then (Tong (2012)). Let , and let . Note that if is scattered then is also scattered. If is realizable by a -random CNN , then let . Note that is realizable by . Indeed, for every and we have . Since , the sample is -realizable.
We now bound the norms of in . Note that for every we have
Hence, .
5.4.3 Proof of Theorem 3.6
Let . Let be the uniform distribution on . Note that a -random CNN has hidden neurons. Let be a fixed integer. By Theorem 5.1, we need to show that is RSAT-hard, where is the ball of radius in . By Lemma 5.2, the problem where is the ball of radius in , is RSAT-hard. We reduce this problem to . Given a sample with for every , we construct a sample that is contained in , such that if is scattered then is scattered, and if is -realizable then is -realizable.
Let be a random orthogonal matrix of size . For every denote where for every we have . For every let , and let . Note that if is scattered then is also scattered. If is realizable by a CNN , then let . Note that is realizable by . Indeed, for every and we have
Also, note that since and is orthogonal, is a random vector on the sphere of radius in , and thus is a -random CNN.
Since is orthogonal then for every we have
Hence .
References
- Adamczak et al. [2012] R. Adamczak, O. Guédon, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann. Condition number of a square matrix with iid columns drawn from a convex body. Proceedings of the American Mathematical Society, 140(3):987–998, 2012.
- Agarwal et al. [2020] N. Agarwal, P. Awasthi, and S. Kale. A deep conditioning treatment of neural networks. arXiv preprint arXiv:2002.01523, 2020.
- Allen-Zhu et al. [2018a] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018a.
- Allen-Zhu et al. [2018b] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018b.
- Andoni et al. [2014] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on Machine Learning, pages 1908–1916, 2014.
- Applebaum et al. [2008] B. Applebaum, B. Barak, and D. Xiao. On basing lower-bounds for learning on worst-case assumptions. In Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 211–220. IEEE, 2008.
- Arora et al. [2014] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. In International Conference on Machine Learning, pages 584–592, 2014.
- Arora et al. [2019] S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
- Brutzkus and Globerson [2017] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 605–614. JMLR. org, 2017.
- Brutzkus et al. [2017] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017.
- Cao and Gu [2019] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.
- Coja-Oghlan et al. [2004] A. Coja-Oghlan, A. Goerdt, and A. Lanka. Strong refutation heuristics for random k-sat. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 310–321. Springer, 2004.
- Coja-Oghlan et al. [2010] A. Coja-Oghlan, C. Cooper, and A. Frieze. An efficient sparse regularity concept. SIAM Journal on Discrete Mathematics, 23(4):2000–2034, 2010.
- Daniely [2016] A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 105–117. ACM, 2016.
- Daniely [2017] A. Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
- Daniely [2019] A. Daniely. Neural networks learning and memorization with (almost) no over-parameterization. arXiv preprint arXiv:1911.09873, 2019.
- Daniely and Shalev-Shwartz [2016] A. Daniely and S. Shalev-Shwartz. Complexity theoretic limitations on learning dnf’s. In Conference on Learning Theory, pages 815–830, 2016.
- Daniely et al. [2014] A. Daniely, N. Linial, and S. Shalev-Shwartz. From average case complexity to improper learning complexity. In STOC, 2014.
- Daniely et al. [2016] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In NIPS, 2016.
- Das et al. [2019] A. Das, S. Gollapudi, R. Kumar, and R. Panigrahy. On the learnability of deep random networks. arXiv preprint arXiv:1904.03866, 2019.
- Du and Goel [2018] S. S. Du and S. Goel. Improved learning of one-hidden-layer convolutional neural networks with overlaps. arXiv preprint arXiv:1805.07798, 2018.
- Du et al. [2017a] S. S. Du, J. D. Lee, and Y. Tian. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017a.
- Du et al. [2017b] S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017b.
- Du et al. [2018] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Fang [2018] K. W. Fang. Symmetric multivariate and related distributions. Chapman and Hall/CRC, 2018.
- Feige [2002] U. Feige. Relations between average case complexity and approximation complexity. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 534–543. ACM, 2002.
- Feige and Ofek [2004] U. Feige and E. Ofek. Easily refutable subformulas of large random 3cnf formulas. In Automata, languages and programming, pages 519–530. Springer, 2004.
- Feldman et al. [2006] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, 2006.
- Ge et al. [2019] R. Ge, R. Wang, and H. Zhao. Mildly overparametrized neural nets can memorize training data efficiently. arXiv preprint arXiv:1909.11837, 2019.
- Goel and Klivans [2017] S. Goel and A. Klivans. Learning neural networks with two nonlinear layers in polynomial time. arXiv preprint arXiv:1709.06010, 2017.
- Goel et al. [2018] S. Goel, A. Klivans, and R. Meka. Learning one convolutional layer with overlapping patches. arXiv preprint arXiv:1802.02547, 2018.
- Jacot et al. [2018] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
- Janzamin et al. [2015] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
- Ji and Telgarsky [2019] Z. Ji and M. Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
- Klivans and Sherstov [2006] A. R. Klivans and A. A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. In FOCS, 2006.
- Lee et al. [2019] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
- Li and Yuan [2017] Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
- Ma et al. [2019] C. Ma, L. Wu, et al. A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics. arXiv preprint arXiv:1904.04326, 2019.
- Oymak and Soltanolkotabi [2018] S. Oymak and M. Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004, 2018.
- Oymak and Soltanolkotabi [2019] S. Oymak and M. Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv:1902.04674 [cs, math, stat], Feb. 2019. URL http://arxiv.org/abs/1902.04674. arXiv: 1902.04674.
- Rivasplata [2012] O. Rivasplata. Subgaussian random variables: An expository note. Internet publication, PDF, 2012.
- Rudelson and Vershynin [2008] M. Rudelson and R. Vershynin. The littlewood–offord problem and invertibility of random matrices. Advances in Mathematics, 218(2):600–633, 2008.
- Shamir [2018] O. Shamir. Distribution-specific hardness of learning neural networks. The Journal of Machine Learning Research, 19(1):1135–1163, 2018.
- Song and Yang [2019] Z. Song and X. Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
- Tian [2017] Y. Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3404–3413. JMLR. org, 2017.
- Tong [2012] Y. L. Tong. The multivariate normal distribution. Springer Science & Business Media, 2012.
- Xie et al. [2016] B. Xie, Y. Liang, and L. Song. Diverse neural network learns true target functions. arXiv preprint arXiv:1611.03131, 2016.
- Zou and Gu [2019] D. Zou and Q. Gu. An improved analysis of training over-parameterized deep neural networks. arXiv preprint arXiv:1906.04688, 2019.
Appendix A From to (Daniely and Shalev-Shwartz [2016])
We outline the main ideas of the reduction.
First, we reduce to . This is done as follows. Given an instance to , by a simple greedy procedure, we try to find disjoint subsets , such that for every , the subset consists of constraints and each variable appears in at most one of the constraints in . Now, from every we construct a -constraint that is the conjunction of all constraints in . If is random, this procedure will succeed w.h.p. and will produce a random -formula. If is satisfiable, this procedure will either fail or produce a satisfiable -formula.
Now, we reduce to . This is done by replacing each constraint, with probability , with a random constraint. Clearly, if the original instance is a random instance of , then the produced instance is a random instance of . Furthermore, if the original instance is satisfied by the assignment , the same , w.h.p., will satisfy all the new constraints. The reason is that the probability that a random -constraint is satisfied by is , and hence, the probability that all new constraints are satisfied by is at least . Now, since , the last probability is .
For the full proof see Daniely and Shalev-Shwartz [2016].