Computational Complexity of Learning Neural Networks: Smoothness and Degeneracy
Abstract
Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network’s weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth- networks can be efficiently learned under the assumptions that the input distribution is Gaussian, and the weight matrix is non-degenerate. In this work, we study whether such assumptions may suffice for learning deeper networks and prove negative results. We show that learning depth- ReLU networks under the Gaussian input distribution is hard even in the smoothed-analysis framework, where a random noise is added to the network’s parameters. It implies that learning depth- ReLU networks under the Gaussian distribution is hard even if the weight matrices are non-degenerate. Moreover, we consider depth- networks, and show hardness of learning in the smoothed-analysis framework, where both the network parameters and the input distribution are smoothed. Our hardness results are under a well-studied assumption on the existence of local pseudorandom generators.
1 Introduction
The computational complexity of learning neural networks has been extensively studied in recent years, and there has been much effort in obtaining both upper bounds and hardness results. Nevertheless, it is still unclear when neural networks can be learned in polynomial time, namely, under what assumptions provably efficient algorithms exist.
Existing results imply hardness already for learning depth- ReLU networks in the standard PAC learning framework (e.g., Klivans and Sherstov (2006); Daniely and Shalev-Shwartz (2016)). Thus, without any assumptions on the input distribution or the network’s weights, efficient learning algorithms might not be achievable. Even when assuming that the input distribution is Gaussian, strong hardness results were obtained for depth- ReLU networks (Daniely and Vardi, 2021; Chen et al., 2022a), suggesting that assumptions merely on the input distribution might not suffice. Also, a hardness result by Daniely and Vardi (2020) shows that strong assumptions merely on the network’s weights (without restricting the input distribution) might not suffice even for efficiently learning depth- networks. The aforementioned hardness results hold already for improper learning, namely, where the learning algorithm is allowed to return a hypothesis that does not belong to the considered hypothesis class. Thus, a combination of assumptions on the input distribution and the network’s weights seems to be necessary for obtaining efficient algorithms.
Several polynomial-time algorithms for learning depth- neural networks have been obtained, under assumptions on the input distribution and the network’s weights (Awasthi et al., 2021; Janzamin et al., 2015; Zhong et al., 2017; Ge et al., 2017; Bakshi et al., 2019; Ge et al., 2018). In these works, it is assumed that the weight matrices are non-degenerate. That is, they either assume that the condition number of the weight matrix is bounded, or some similar non-degeneracy assumption. Specifically, Awasthi et al. (2021) gave an efficient algorithm for learning depth- (one-hidden-layer) ReLU networks, that may include bias terms in the hidden neurons, under the assumption that the input distribution is Gaussian, and the weight matrix is non-degenerate. The non-degeneracy assumption holds w.h.p. if we add a small random noise to any weight matrix, and hence their result implies efficient learning of depth- ReLU networks under the Gaussian distribution in the smoothed-analysis framework.
The positive results on depth- networks suggest the following question:
Is there an efficient algorithm for learning ReLU networks of depth larger than under the Gaussian distribution, where the weight matrices are non-degenerate, or in the smoothed-analysis framework where the network’s parameters are smoothed?
In this work, we give a negative answer to this question, already for depth- networks111We note that in our results the neural networks have the ReLU activation also in the output neuron.. We show that learning depth- ReLU networks under the Gaussian distribution is hard even in the smoothed-analysis framework, where a random noise is added to the network’s parameters. As a corollary, we show that learning depth- ReLU networks under the Gaussian distribution is hard even if the weight matrices are non-degenerate. Our hardness results are under a well-studied cryptographic assumption on the existence of local pseudorandom generators (PRG) with polynomial stretch.
Motivated by the existing positive results on smoothed-analysis in depth- networks, we also study whether learning depth- networks with smoothed parameters can be done under weak assumptions on the input distribution. Specifically, we consider the following question:
Is there an efficient algorithm for learning depth- ReLU networks in the smoothed-analysis framework, where both the network’s parameters and the input distribution are smoothed?
We give a negative answer to this question, by showing hardness of learning depth- ReLU networks where a random noise is added to the network’s parameters, and the input distribution on is obtained by smoothing an i.i.d. Bernoulli distribution on . This hardness result is also under the assumption on the existence of local PRGs.
Related work
Hardness of learning neural networks.
Hardness of improperly learning depth- neural networks follows from hardness of improperly learning DNFs or intersections of halfspaces, since these classes can be expressed by depth- networks. Klivans and Sherstov (2006) showed, assuming the hardness of the shortest vector problem, that learning intersections of halfspaces is hard. Hardness of learning DNF formulas is implied by Applebaum et al. (2010) under a combination of two assumptions: the first is related to the planted dense subgraph problem in hypergraphs, and the second is related to local PRGs. Daniely and Shalev-Shwartz (2016) showed hardness of learning DNFs under a common assumption, namely, that refuting a random -SAT formula is hard. All of the above results are distribution-free, namely, they do not imply hardness of learning neural networks under some specific distribution.
Applebaum and Raykov (2016) showed, under an assumption on a specific candidate for Goldreich’s PRG (i.e., based on a predicate called ), that learning depth- Boolean circuits under the uniform distribution on the hypercube is hard. Daniely and Vardi (2021) proved distribution-specific hardness of learning Boolean circuits of depth- (namely, DNFs) and depth-, under the assumpion on the existence of local PRGs that we also use in this work. For DNF formulas, they showed hardness of learning under a distribution where each component is drawn i.i.d. from a Bernoulli distribution (which is not uniform). For depth- Boolean circuits, they showed hardness of learning under the uniform distribution on the hypercube. Since the Boolean circuits can be expressed by ReLU networks of the same depth, these results readily translate to distribution-specific hardness of learning neural networks. Chen et al. (2022a) showed hardness of learning depth- neural networks under the uniform distribution on the hypercube, based on an assumption on the hardness of the Learning with Rounding (LWR) problem. Note that the input distributions in the above results are supported on the hypercube, and they do not immediately imply hardness of learning neural networks under continuous distributions.
When considering the computational complexity of learning neural networks, perhaps the most natural choice of an input distribution is the standard Gaussian distribution. Daniely and Vardi (2021) established hardness of learning depth- networks under this distribution, based on the assumpion on the existence of local PRGs. Chen et al. (2022a) also showed hardness of learning depth- networks under the Gaussian distribution, but their result holds already for networks that do not have an activation function in the output neuron, and it is based on the LWR assumption. They also showed hardness of learning constant depth ReLU networks from label queries (i.e., where the learner has the ability to query the value of the target network at any desired input) under the Gaussian distribution, based either on the decisional Diffie-Hellman or the Learning with Errors assumptions.
The above results suggest that assumptions on the input distribution might not suffice for achieving an efficient algorithm for learning depth- neural networks. A natural question is whether assumptions on the network weights may suffice. Daniely and Vardi (2020) showed (under the assumption that refuting a random -SAT formula is hard) that distribution-free learning of depth- neural networks is hard already if the weights are drawn from some “natural” distribution or satisfy some “natural” properties. Thus, if we do not impose any assumptions on the input distribution, then even very strong assumptions on the network’s weights do not suffice for efficient learning.
Several works in recent years have shown hardness of distribution-specific learning shallow neural networks using gradient-methods or statistical query (SQ) algorithms (Shamir, 2018; Song et al., 2017; Vempala and Wilmes, 2019; Goel et al., 2020a; Diakonikolas et al., 2020b; Chen et al., 2022a). Specifically, Goel et al. (2020a) and Diakonikolas et al. (2020b) gave superpolynomial correlational SQ (CSQ) lower bounds for learning depth- networks under the Gaussian distribution. Since there is separation between CSQ and SQ, these results do not imply hardness of general SQ learning. Chen et al. (2022a) showed general SQ lower bounds for depth- networks under the Gaussian distribution. It is worth noting that while the SQ framework captures some variants of the gradient-descent algorithm, it does not capture, for example, stochastic gradient-descent (SGD), which examines training points individually (see a discussion in Goel et al. (2020a)).
We emphasize that none of the above distribution-specific hardness results for neural networks (either for improper learning or SQ learning) holds in the smoothed analysis framework or for non-degenerate weights.
While hardness of proper learning is implied by hardness of improper learning, there are also works that show hardness of properly learning depth- networks under more standard assumptions (cf. Goel et al. (2020c)). Finally, distribution-specific hardness of learning a single ReLU neuron in the agnostic setting was shown in Goel et al. (2019, 2020b); Diakonikolas et al. (2020a).
Learning neural networks in polynomial time.
Awasthi et al. (2021) gave a polynomial-time algorithm for learning depth- (one-hidden-layer) ReLU networks, under the assumption that the input distribution is Gaussian, and the weight matrix of the target network is non-degenerate. Their algorithm is based on tensor decomposition, and it can handle bias terms in the hidden layer. Their result also implies that depth- ReLU networks with Gaussian inputs can be learned efficiently under the smoothed-analysis framework. Our work implies that such a result might not be possible in depth- networks (with activation in the output neuron). Prior to Awasthi et al. (2021), several works gave polynomial time algorithms for learning depth- neural networks where the input distribution is Gaussian and the weight matrix is non-degenerate (Janzamin et al., 2015; Zhong et al., 2017; Ge et al., 2017, 2018; Bakshi et al., 2019), but these works either do not handle the presence of bias terms or do not handle the ReLU activation. Some of the aforementioned works consider networks with multiple outputs, and allow certain non-Gaussian input distributions.
Provable guarantees for learning neural networks in super-polynomial time were given in Diakonikolas et al. (2020b); Chen et al. (2022b); Diakonikolas and Kane (2020); Goel and Klivans (2019); Vempala and Wilmes (2019); Zhang et al. (2016); Goel et al. (2017). Distribution-free learning of a single ReLU neuron (in the realizable setting) can be done in polynomial time (Kalai and Sastry, 2009; Kakade et al., 2011).
2 Preliminaries
2.1 Notation
We use bold-face letters to denote vectors, e.g., . For a vector and a sequence of indices, we let , i.e., the restriction of to the indices . We denote by the indicator function, for example equals if and otherwise. For an integer we denote . We denote by the normal distribution with mean and variance , and by the multivariate normal distribution with mean and covariance matrix . The identity matrix of size is denoted by . For we denote by the Euclidean norm. We use to denote a polynomial in .
2.2 Local pseudorandom generators
An -hypergraph is a hypergraph over vertices with hyperedges , each of cardinality . Each hyperedge is ordered, and all the members of a hyperedge are distinct. We let be the distribution over such hypergraphs in which a hypergraph is chosen by picking each hyperedge uniformly and independently at random among all the possible ordered hyperedges. Let be a predicate, and let be a -hypergraph. We call Goldreich’s pseudorandom generator (PRG) (Goldreich, 2000) the function such that for , we have . The integer is called the locality of the PRG. If is a constant then the PRG and the predicate are called local. We say that the PRG has polynomial stretch if for some constant . We let denote the collection of functions where is an -hypergraph. We sample a function from by choosing a random hypergraph from .
We denote by the operation of sampling a hypergraph from , and by the operation of sampling from the uniform distribution on . We say that is -pseudorandom generator (-PRG) if for every polynomial-time probabilistic algorithm the distinguishing advantage
is at most . Thus, the distinguisher is given a random hypergraph and a string , and its goal is to distinguish between the case where is chosen at random, and the case where is a random image of .
Our assumption is that local PRGs with polynomial stretch and constant distinguishing advantage exist:
Assumption 2.1.
For every constant , there exists a constant and a predicate , such that is -PRG.
We remark that the same assumption was used by Daniely and Vardi (2021) to show hardness-of-learning results. Local PRGs have been extensively studied in the last two decades. In particular, local PRGs with polynomial stretch have shown to have remarkable applications, such as secure-computation with constant computational overhead (Ishai et al., 2008; Applebaum et al., 2017), and general-purpose obfuscation based on constant degree multilinear maps (cf. Lin (2016); Lin and Vaikuntanathan (2016)). Significant evidence for Assumption 2.1 was shown in Applebaum (2013). Moreover, a concrete candidate for a local PRG, based on the XOR-MAJ predicate, was shown to be secure against all known attacks (Applebaum and Lovett, 2016; Couteau et al., 2018; Méaux et al., 2019; Applebaum and Raykov, 2016). See Daniely and Vardi (2021) for further discussion on the assumption, and on prior work regarding the relation between Goldreich’s PRG and hardness of learning.
2.3 Neural networks
We consider feedforward ReLU networks. Starting from an input , each layer in the network is of the form , where is the ReLU activation which applies to vectors entry-wise, is the weight matrix, and are the bias terms. The weights vector of the -th neuron in the -th layer is the -th row of , and its outgoing-weights vector is the -th column of . We define the depth of the network as the number of layers. Unless stated otherwise, the output neuron also has a ReLU activation function. Note that a depth- network with activation in the output neuron has non-linear layers. A neuron which is not an input or output neuron is called a hidden neuron. We sometimes consider neural networks with multiple outputs. The parameters of the neural network is the set of its weight matrices and bias vectors. We often view the parameters as a vector obtained by concatenating these matrices and vectors. For , we say that the parameters are -bounded if the absolute values of all weights and biases are at most .
2.4 Learning neural networks and the smoothed-analysis framework
We first define learning neural networks under the standard PAC framework:
Definition 2.1 (Distribution-specific PAC learning).
Learning depth- neural networks under an input distribution on is defined by the following framework:
-
1.
An adversary chooses a set of -bounded parameters for a depth- neural network , as well as some .
-
2.
Consider an examples oracle, such that each example is drawn i.i.d. with and . Then, given access to the examples oracle, the goal of the learning algorithm is to return with probability at least a hypothesis such that
We say that is efficient if it runs in time .
We consider learning in the smoothed-analysis framework (Spielman and Teng, 2004), which is a popular paradigm for analyzing non-worst-case computational complexity (Roughgarden, 2021). The smoothed-analysis framework has been successfully applied to many learning problems (e.g., Awasthi et al. (2021); Ge et al. (2018); Kalai et al. (2009); Bhaskara et al. (2014, 2019); Haghtalab et al. (2020); Ge et al. (2015); Kalai and Teng (2008); Brutzkus et al. (2020)). In the smoothed-analysis setting, the target network is not purely controlled by an adversary. Instead, the adversary can first generate an arbitrary network, and the parameters for this network (i.e., the weight matrices and bias terms) will be randomly perturbed to yield a perturbed network. The algorithm only needs to work with high probability on the perturbed network. This limits the power of the adversary and prevents it from creating highly degenerate cases. Formally, we consider the following framework (we note that a similar model was considered in Awasthi et al. (2021) and Ge et al. (2018)):
Definition 2.2 (Learning with smoothed parameters).
Learning depth- neural networks with smoothed parameters under an input distribution is defined as follows:
-
1.
An adversary chooses a set of -bounded parameters for a depth- neural network , as well as some .
-
2.
A perturbed set of parameters is obtained by a random perturbation to , namely, for .
-
3.
Consider an examples oracle, such that each example is drawn i.i.d. with and . Then, given access to the examples oracle, the goal of the learning algorithm is to return with probability at least (over the random perturbation and the internal randomness of ) a hypothesis such that
We say that is efficient if it runs in time .
Finally, we also consider a setting where both the parameters and the input distribution are smoothed:
Definition 2.3 (Learning with smoothed parameters and inputs).
Learning depth- neural networks with smoothed parameters and inputs under an input distribution is defined as follows:
-
1.
An adversary chooses a set of -bounded parameters for a depth- neural network , as well as some .
-
2.
A perturbed set of parameters is obtained by a random perturbation to , namely, for . Moreover, a smoothed input distribution is obtained from , such that is chosen by drawing and adding a random perturbation from .
-
3.
Consider an examples oracle, such that each example is drawn i.i.d. with and . Then, given access to the examples oracle, the goal of the learning algorithm is to return with probability at least (over the random perturbation and the internal randomness of ) a hypothesis such that
We say that is efficient if it runs in time .
We emphasize that all of the above definitions consider learning in the distribution-specific setting. Thus, the learning algorithm may depend on the specific input distribution .
3 Results
As we discussed in the introduction, there exist efficient algorithms for learning depth- (one-hidden-layer) ReLU networks with smoothed parameters under the Gaussian distribution. We now show that such a result may not be achieved for depth- networks:
Theorem 3.1.
We prove the theorem in Section 4. Next, we conclude that learning depth- neural networks under the Gaussian distribution on is hard in the standard PAC framework even if all weight matrices are non-degenerate, namely, when the minimal singular values of the weight matrices are lower bounded by . As we discussed in the introduction, in one-hidden-layer networks with similar assumptions there exist efficient learning algorithms.
Corollary 3.1.
The proof of the above corollary follows from Theorem 3.1, using the fact that by adding a small random perturbation to the weight matrices, we get w.h.p. non-degenerate matrices. Hence, an efficient algorithm that learns non-degenerate networks suffices for obtaining an efficient algorithm that learns under the smoothed analysis framework. See Appendix B for the formal proof.
The above hardness results consider depth- networks that include activation in the output neuron. Thus, the networks have three non-linear layers. We remark that these results readily imply hardness also for depth- networks without activation in the output, and hardness for depth- networks for any . We also note that the hardness results hold already for networks where all hidden layers are of the same width, e.g., where all layers are of width .
Theorem 3.1 gives a strong limitation on learning depth- networks in the smoothed-analysis framework. Motivated by existing positive results on smoothed-analysis in depth- networks, we now study whether learning depth- networks with smoothed parameters can be done under weak assumptions on the input distribution. Specifically, we consider the case where both the parameters and the input distribution are smoothed. We show that efficiently learning depth- networks may not be possible with smoothed parameters where the input distribution on is obtained by smoothing an i.i.d. Bernoulli distribution on .
Theorem 3.2.
4 Proof of Theorem 3.1
The proof builds on a technique from Daniely and Vardi (2021). It follows by showing that an efficient algorithm for learning depth- neural networks with smoothed parameters under the Gaussian distribution can be used for breaking a local PRG. Intuitively, the main challenge in our proof in comparison to Daniely and Vardi (2021) is that our reduction must handle the random noise which is added to the parameters. Specifically, Daniely and Vardi (2021) define a certain examples oracle, and show that the examples returned by the oracle are realizable by some neural network which depends on the unknown used by the PRG. Since the network depends on this unknown , some of the parameters of the network are unknown, and hence it is not trivial how to define an examples oracle which is realizable by a perturbed network. Moreover, we need to handle this random perturbation without increasing the network’s depth.
We now provide the formal proof. The proof relies on several lemmas which we prove in Appendix A. For a sufficiently large , let be the standard Gaussian distribution on . Assume that there is a -time algorithm that learns depth- neural networks with at most hidden neurons and parameter magnitudes bounded by , with smoothed parameters, under the distribution , with , and that we will specify later. Let be the sample complexity of , namely, uses a sample of size at most and returns with probability at least a hypothesis with , where is the perturbed network. Let be a constant such that for every sufficiently large . By Assumption 2.1, there exists a constant and a predicate , such that is -PRG. We will show an efficient algorithm with distinguishing advantage greater than and thus reach a contradiction.
Throughout this proof, we will use the following notations. For a hyperedge we denote by the following encoding of : the vector is a concatenation of vectors in , such that the -th vector has in the -th coordinate and elsewhere. Thus, consists of size- slices, each encoding a member of . For , and , we denote . That is, is the -th component in the -th slice in . For , let be such that for every hyperedge we have . Let be such that . Let be the density of , let , and let . Note that are the densities of the restriction of to the intervals and respectively. Let be a mapping such that for every and we have . For we denote , namely, the first components of (assuming ).
4.1 Defining the target network for
Since our goal is to use the algorithm for breaking PRGs, in this subsection we define a neural network that we will later use as a target network for . The network contains the subnetworks which we define below.
Let be a depth- neural network with input dimension , at most hidden neurons, at most output neurons (with activations in the output neurons), and parameter magnitudes bounded by (all bounds are for a sufficiently large ), which satisfies the following. We denote the set of output neurons of by . Let be an input to such that for some hyperedge , and assume that for every we have . Fix some . Then, for with the inputs to all output neurons are at most , and for with there exists a neuron in with input at least . Recall that our definition of a neuron’s input includes the addition of the bias term. The construction of the network is given in Lemma A.3. Intuitively, the network consists of a layer that transforms w.h.p. the input to , followed by a layer that satisfies the following: Building on a lemma from Daniely and Vardi (2021) which shows that can be computed by a DNF formula, we define a layer where each output neuron corresponds to a term in the DNF formula, such that if the term evaluates to 0 then the input to the neuron is at most , and otherwise it is at least . Note that the network depends on . However, only the second layer depends on , and thus given an input we may compute the first layer even without knowing . Let be a depth- neural network with no activation function in the output neuron, obtained from by summing the outputs from all neurons .
Let be a depth- neural network with input dimension , at most hidden neurons, at most output neurons, and parameter magnitudes bounded by (for a sufficiently large ), which satisfies the following. We denote the set of output neurons of by . Let be an input to such that for every we have . If is an encoding of a hyperedge then the inputs to all output neurons are at most , and otherwise there exists a neuron in with input at least . The construction of the network is given in Lemma A.5. Intuitively, each neuron in is responsible for checking whether violates some requirement that must hold in an encoding of a hyperedge. Let be a depth- neural network with no activation function in the output neuron, obtained from by summing the outputs from all neurons .
Let be a depth- neural network with input dimension , at most hidden neurons, output neurons, and parameter magnitudes bounded by (for a sufficiently large ), which satisfies the following. We denote the set of output neurons of by . Let be an input to . If there exists such that then there exists a neuron in with input at least . Moreover, if for all we have then the inputs to all neurons in are at most . The construction of the network is straightforward and given in Lemma A.6. Let be a depth- neural network with no activation function in the output neuron, obtained from by summing the outputs from all neurons .
Let be a depth- network obtained from as follows. For we have . The network has at most neurons, and parameter magnitudes bounded by (all bounds are for a sufficiently large ). Finally, let be a depth- neural network such that .
4.2 Defining the noise magnitude and analyzing the perturbed network
In order to use the algorithm w.r.t. some neural network with parameters , we need to implement an examples oracle, such that the examples are labeled according to a neural network with parameters , where is a random perturbation. Specifically, we use with an examples oracle where the labels correspond to a network , obtained from (w.r.t. an appropriate in the construction of ) by adding a small perturbation to the parameters. The perturbation is such that we add i.i.d. noise to each parameter in , where the noise is distributed according to , and is small enough such that the following holds. Let be any depth- neural network parameterized by for some with at most neurons, and parameter magnitudes bounded by (note that is polynomial in ). Then with probability at least over , we have for all , and the network is such that for every input with and every neuron we have: Let be the inputs to the neuron in the computations and (respectively), then . Thus, is sufficiently small, such that w.h.p. adding i.i.d. noise to each parameter does not change the inputs to the neurons by more than . Note that such an inverse-polynomial exists, since when the network size, parameter magnitudes, and input size are bounded by some , then the input to each neuron in is -Lipschitz as a function of , and thus it suffices to choose that implies with probability at least that for a sufficiently large polynomial (see Lemma A.7 for details).
Let be the parameters of the network . Recall that the parameters vector is the concatenation of all weight matrices and bias terms. Let be the parameters of , namely, where . By our choice of and the construction of the networks , with probability at least over , for every with , the inputs to the neurons in the computation satisfy the following properties, where we denote :
-
(P1)
If for some hyperedge , and for every we have , then the inputs to satisfy:
-
•
If the inputs to all neurons in are at most .
-
•
If there exists a neuron in with input at least .
-
•
-
(P2)
If for every we have , then the inputs to satisfy:
-
•
If is an encoding of a hyperedge then the inputs to all neurons are at most .
-
•
Otherwise, there exists a neuron in with input at least .
-
•
-
(P3)
The inputs to satisfy:
-
•
If there exists such that then there exists a neuron in with input at least .
-
•
If for all we have then the inputs to all neurons in are at most .
-
•
4.3 Stating the algorithm
Given a sequence , where are i.i.d. random hyperedges, the algorithm needs to distinguish whether is random or that for a random . Let .
We use the efficient algorithm in order to obtain distinguishing advantage greater than as follows. Let be a random perturbation, and let be the perturbed network as defined above, w.r.t. the unknown . Note that given a perturbation , only the weights in the second layer of the subnetwork in are unknown, since all other parameters do not depend on . The algorithm runs with the following examples oracle. In the -th call, the oracle first draws such that each component is drawn i.i.d. from a Bernoulli distribution which takes the value with probability . If is an encoding of a hyperedge then the oracle replaces with . Then, the oracle chooses such that for each component , if then is drawn from , and otherwise is drawn from . Let be such that , and the other components of are drawn i.i.d. from . Note that the vector has the distribution , due to the definitions of the densities and , and since replacing an encoding of a random hyperedge by an encoding of another random hyperedge does not change the distribution of . Let be the bias term of the output neuron of . The oracle returns , where the labels are chosen as follows:
-
•
If is not an encoding of a hyperedge, then .
-
•
If is an encoding of a hyperedge:
-
–
If does not have components in the interval , then if we set , and if we set .
-
–
If has a component in the interval , then .
-
–
If does not have components in the interval , but has a component in the interval , then the label is determined as follows:
-
*
If then .
-
*
If : Let be the network after omitting the neurons and their incoming and outgoing weights. Then, we set . Note that since only the second layer of depends on , then we can compute without knowing .
-
*
-
–
Let be the hypothesis returned by . Recall that uses at most examples, and hence contains at least examples that cannot view. We denote the indices of these examples by , and the examples by . By additional calls to the oracle, the algorithm obtains the examples that correspond to . Let be a hypothesis such that for all we have , thus, for the hypothesis is obtained from by clipping the output to the interval . Let . Now, if , then returns , and otherwise it returns . We remark that the decision of our algorithm is based on (rather than ) since we need the outputs to be bounded, in order to allow using Hoeffding’s inequality in our analysis, which we discuss in the next subsection.
4.4 Analyzing the algorithm
Note that the algorithm runs in time. We now show that if is pseudorandom then returns with probability greater than , and if is random then returns with probability less than .
We start with the case where is pseudorandom. In Lemma A.8, we prove that if is pseudorandom then w.h.p. (over and the i.i.d. inputs ) the examples returned by the oracle are realized by . Thus, for all . As we show in the lemma, this claim follows by noting that the following hold w.h.p., where we denote :
-
•
If is not an encoding of a hyperedge, then the oracle sets , and we have:
-
•
If is an encoding of a hyperedge , then by the definition of the examples oracle we have . Hence:
-
–
If does not have components in , then:
- *
-
*
If then the oracle sets . Since is pseudorandom, we have . Hence, in the computation there exists a neuron in with output at least (by Property (P1)), which implies .
-
–
If has a component in , then the oracle sets . Also, in the computation there exists a neuron in with output at least (by Property (P3)), which implies .
-
–
If does not have components in the interval , but has a component in the interval , then:
-
*
If the oracle sets . Since is pseudorandom, we have . Hence, in the computation there exists a neuron in with output at least (by Property (P1)), which implies .
- *
-
*
-
–
Recall that the algorithm is such that with probability at least (over , the i.i.d. inputs , and its internal randomness), given a size- dataset labeled by , it returns a hypothesis such that . By the definition of and the construction of , if has small error then also has small error, namely, we have . In Lemma A.9 we use the above arguments and Hoeffding’s inequality over , and prove that with probability greater than we have .
Next, we consider the case where is random. Let be such that iff does not have components in the interval , and for a hyperedge . If is random, then by the definition of our examples oracle, for every such that , we have with probability and otherwise. Also, by the definition of the oracle, is independent of and independent of the choice of the vector that corresponds to . Hence, for such with , any hypothesis cannot predict the label , and the expected loss for the example is at least . Moreover, in Lemma A.11 we show that for a sufficiently large . In Lemma A.12 we use these arguments to prove a lower bound on , and by Hoeffding’s inequality over we conclude that with probability greater than we have .
Overall, if is pseudorandom then with probability greater than the algorithm returns , and if is random then with probability greater than the algorithm returns . Thus, the distinguishing advantage is greater than .
5 Discussion
Understanding the computational complexity of learning neural networks is a central question in learning theory. Our results imply that the assumptions which allow for efficient learning in one-hidden-layer networks might not suffice in deeper networks. Also, in depth- networks we show that it is not sufficient to assume that both the parameters and the inputs are smoothed. We hope that our hardness results will help focus on assumptions that may allow for efficient learning. Below we discuss several intriguing open problems.
First, we emphasize that our hardness results are for neural networks that include the ReLU activation also in the output neuron. In contrast, the positive results on learning depth- networks that we discussed in the introduction do not include activation in the output neuron. Therefore, as far as we are aware, there is still a gap between the upper bounds and our hardness results: (1) Under the assumption that the input is Gaussian and the weights are non-degenerate, the cases of depth- networks with activation in the output neuron and of depth- networks without activation in the output are not settled; (2) In the setting where both the parameters and the input distribution are smoothed, the case of depth- networks without activation in the output is not settled.
Moreover, our hardness result for depth- networks suggests that adding a small (yet polynomial) noise to the parameters is not sufficient to allow efficient learning under the Gaussian distribution. An intriguing direction is to explore the case where the noise is larger. Intuitively, the case of very large noise corresponds to learning a random network, where the weights are drawn i.i.d. from the Gaussian distribution. Hence, it would be interesting to understand whether there is an efficient algorithm for learning random ReLU networks under the Gaussian input distribution. Likewise, the effect of large noise may also be considered w.r.t. the inputs, namely, we may use large noise for obtaining smoothed input distributions.
Acknowledgements
This work was done as part of the NSF-Simons Sponsored Collaboration on the Theoretical Foundations of Deep Learning.
References
- Applebaum [2013] B. Applebaum. Pseudorandom generators with long stretch and low locality from random local one-way functions. SIAM Journal on Computing, 42(5):2008–2037, 2013.
- Applebaum and Lovett [2016] B. Applebaum and S. Lovett. Algebraic attacks against random local functions and their countermeasures. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1087–1100, 2016.
- Applebaum and Raykov [2016] B. Applebaum and P. Raykov. Fast pseudorandom functions based on expander graphs. In Theory of Cryptography Conference, pages 27–56. Springer, 2016.
- Applebaum et al. [2010] B. Applebaum, B. Barak, and A. Wigderson. Public-key cryptography from different assumptions. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 171–180, 2010.
- Applebaum et al. [2017] B. Applebaum, I. Damgård, Y. Ishai, M. Nielsen, and L. Zichron. Secure arithmetic computation with constant computational overhead. In Annual International Cryptology Conference, pages 223–254. Springer, 2017.
- Awasthi et al. [2021] P. Awasthi, A. Tang, and A. Vijayaraghavan. Efficient algorithms for learning depth-2 neural networks with general relu activations. Advances in Neural Information Processing Systems, 34:13485–13496, 2021.
- Bakshi et al. [2019] A. Bakshi, R. Jayaram, and D. P. Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195–268. PMLR, 2019.
- Bhaskara et al. [2014] A. Bhaskara, M. Charikar, A. Moitra, and A. Vijayaraghavan. Smoothed analysis of tensor decompositions. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 594–603, 2014.
- Bhaskara et al. [2019] A. Bhaskara, A. Chen, A. Perreault, and A. Vijayaraghavan. Smoothed analysis in unsupervised learning via decoupling. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 582–610. IEEE, 2019.
- Brutzkus et al. [2020] A. Brutzkus, A. Daniely, and E. Malach. Id3 learns juntas for smoothed product distributions. In Conference on Learning Theory, pages 902–915. PMLR, 2020.
- Chen et al. [2022a] S. Chen, A. Gollakota, A. R. Klivans, and R. Meka. Hardness of noise-free learning for two-hidden-layer neural networks. arXiv preprint arXiv:2202.05258, 2022a.
- Chen et al. [2022b] S. Chen, A. R. Klivans, and R. Meka. Learning deep relu networks is fixed-parameter tractable. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 696–707. IEEE, 2022b.
- Couteau et al. [2018] G. Couteau, A. Dupin, P. Méaux, M. Rossi, and Y. Rotella. On the concrete security of goldreich’s pseudorandom generator. In International Conference on the Theory and Application of Cryptology and Information Security, pages 96–124. Springer, 2018.
- Daniely and Shalev-Shwartz [2016] A. Daniely and S. Shalev-Shwartz. Complexity theoretic limitations on learning dnf’s. In Conference on Learning Theory, pages 815–830, 2016.
- Daniely and Vardi [2020] A. Daniely and G. Vardi. Hardness of learning neural networks with natural weights. arXiv preprint arXiv:2006.03177, 2020.
- Daniely and Vardi [2021] A. Daniely and G. Vardi. From local pseudorandom generators to hardness of learning. In Conference on Learning Theory, pages 1358–1394. PMLR, 2021.
- Diakonikolas and Kane [2020] I. Diakonikolas and D. M. Kane. Small covers for near-zero sets of polynomials and learning latent variable models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 184–195. IEEE, 2020.
- Diakonikolas et al. [2020a] I. Diakonikolas, D. Kane, and N. Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33, 2020a.
- Diakonikolas et al. [2020b] I. Diakonikolas, D. M. Kane, V. Kontonis, and N. Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. arXiv preprint arXiv:2006.12476, 2020b.
- Ge et al. [2015] R. Ge, Q. Huang, and S. M. Kakade. Learning mixtures of gaussians in high dimensions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 761–770, 2015.
- Ge et al. [2017] R. Ge, J. D. Lee, and T. Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
- Ge et al. [2018] R. Ge, R. Kuditipudi, Z. Li, and X. Wang. Learning two-layer neural networks with symmetric inputs. arXiv preprint arXiv:1810.06793, 2018.
- Goel and Klivans [2019] S. Goel and A. R. Klivans. Learning neural networks with two nonlinear layers in polynomial time. In Conference on Learning Theory, pages 1470–1499. PMLR, 2019.
- Goel et al. [2017] S. Goel, V. Kanade, A. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Conference on Learning Theory, pages 1004–1042. PMLR, 2017.
- Goel et al. [2019] S. Goel, S. Karmalkar, and A. Klivans. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems, pages 8584–8593, 2019.
- Goel et al. [2020a] S. Goel, A. Gollakota, Z. Jin, S. Karmalkar, and A. Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. arXiv preprint arXiv:2006.12011, 2020a.
- Goel et al. [2020b] S. Goel, A. Gollakota, and A. Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33, 2020b.
- Goel et al. [2020c] S. Goel, A. Klivans, P. Manurangsi, and D. Reichman. Tight hardness results for training depth-2 relu networks. arXiv preprint arXiv:2011.13550, 2020c.
- Goldreich [2000] O. Goldreich. Candidate one-way functions based on expander graphs. IACR Cryptol. ePrint Arch., 2000:63, 2000.
- Haghtalab et al. [2020] N. Haghtalab, T. Roughgarden, and A. Shetty. Smoothed analysis of online and differentially private learning. Advances in Neural Information Processing Systems, 33:9203–9215, 2020.
- Ishai et al. [2008] Y. Ishai, E. Kushilevitz, R. Ostrovsky, and A. Sahai. Cryptography with constant computational overhead. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 433–442, 2008.
- Janzamin et al. [2015] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
- Kakade et al. [2011] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011.
- Kalai and Sastry [2009] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.
- Kalai and Teng [2008] A. T. Kalai and S.-H. Teng. Decision trees are pac-learnable from most product distributions: a smoothed analysis. arXiv preprint arXiv:0812.0933, 2008.
- Kalai et al. [2009] A. T. Kalai, A. Samorodnitsky, and S.-H. Teng. Learning and smoothed analysis. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pages 395–404. IEEE, 2009.
- Klivans and Sherstov [2006] A. R. Klivans and A. A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. In FOCS, 2006.
- Laurent and Massart [2000] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
- Lin [2016] H. Lin. Indistinguishability obfuscation from constant-degree graded encoding schemes. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 28–57. Springer, 2016.
- Lin and Vaikuntanathan [2016] H. Lin and V. Vaikuntanathan. Indistinguishability obfuscation from ddh-like assumptions on constant-degree graded encodings. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 11–20. IEEE, 2016.
- Méaux et al. [2019] P. Méaux, C. Carlet, A. Journault, and F.-X. Standaert. Improved filter permutators: Combining symmetric encryption design, boolean functions, low complexity cryptography, and homomorphic encryption, for private delegation of computations. IACR Cryptol. ePrint Arch., 2019:483, 2019.
- Roughgarden [2021] T. Roughgarden. Beyond the worst-case analysis of algorithms. Cambridge University Press, 2021.
- Sankar et al. [2006] A. Sankar, D. A. Spielman, and S.-H. Teng. Smoothed analysis of the condition numbers and growth factors of matrices. SIAM Journal on Matrix Analysis and Applications, 28(2):446–476, 2006.
- Shamir [2018] O. Shamir. Distribution-specific hardness of learning neural networks. The Journal of Machine Learning Research, 19(1):1135–1163, 2018.
- Song et al. [2017] L. Song, S. Vempala, J. Wilmes, and B. Xie. On the complexity of learning neural networks. In Advances in neural information processing systems, pages 5514–5522, 2017.
- Spielman and Teng [2004] D. A. Spielman and S.-H. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004.
- Vempala and Wilmes [2019] S. Vempala and J. Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115–3117. PMLR, 2019.
- Zhang et al. [2016] Y. Zhang, J. D. Lee, and M. I. Jordan. l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, pages 993–1001. PMLR, 2016.
- Zhong et al. [2017] K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pages 4140–4149. PMLR, 2017.
Appendix A Missing lemmas for the proof of Theorem 3.1
Lemma A.1 (Daniely and Vardi [2021]).
For every predicate and , there is a DNF formula over with at most terms, such that for every hyperedge we have . Moreover, each term in is a conjunction of positive literals.
Proof.
The following proof is from Daniely and Vardi [2021], and we give it here for completeness.
We denote by the set of satisfying assignments of . Note that the size of is at most . Consider the following DNF formula over :
For a hyperedge , we have
∎
Lemma A.2.
Let . There exists an affine layer with at most outputs, weights bounded by a constant and bias terms bounded by (for a sufficiently large ), such that given an input for some hyperedge , it satisfies the following: For with all outputs are at most , and for with there exists an output greater or equal to .
Proof.
By Lemma A.1, there exists a DNF formula over with at most terms, such that . Thus, if then all terms in are not satisfied for the input , and if then there is at least one term in which is satisfied for the input . Therefore, it suffices to construct an affine layer such that for an input , the -th output will be at most if the -th term of is not satisfied, and at least otherwise. Each term in is a conjunction of positive literals. Let be the indices of these literals. The -th output of the affine layer will be
Note that if the conjunction holds, then this expression is exactly , and otherwise it is at most . Finally, note that all weights are bounded by and all bias terms are bounded by (for large enough ). ∎
Lemma A.3.
Let . There exists a depth- neural network with input dimension , hidden neurons, at most output neurons, and parameter magnitudes bounded by (for a sufficiently large ), which satisfies the following. We denote the set of output neurons of by . Let be such that for some hyperedge , and assume that for every we have . Then, for with the inputs to all neurons are at most , and for with there exists a neuron in with input at least . Moreover, only the second layer of depends on .
Proof.
First, we construct a depth- neural network with a single layer of non-linearity, such that for every with for every , we have . The network has hidden neurons, and computes , where is such that
Note that if then , if then , and if then . Also, note that all weights and bias terms can be bounded by (for large enough ). Moreover, the network does not depend on .
Let such that for some hyperedge , and assume that for every we have . For such , we have . Hence, it suffices to show that we can construct an affine layer with at most outputs, weights bounded by a constant and bias terms bounded by , such that given an input it satisfies the following: For with all outputs are at most , and for with there exists an output greater or equal to . We construct such an affine layer in Lemma A.2. ∎
Lemma A.4.
There exists an affine layer with outputs, weights bounded by a constant and bias terms bounded by (for a sufficiently large ), such that given an input , if it is an encoding of a hyperedge then all outputs are at most , and otherwise there exists an output greater or equal to .
Proof.
Note that is not an encoding of a hyperedge iff at least one of the following holds:
-
1.
At least one of the size- slices in contains more than once.
-
2.
At least one of the size- slices in does not contain .
-
3.
There are two size- slices in that encode the same index.
We define the outputs of our affine layer as follows. First, we have outputs that correspond to (1). In order to check whether slice contains more than once, the output will be . Second, we have outputs that correspond to (2): in order to check whether slice does not contain , the output will be . Finally, we have outputs that correspond to (3): in order to check whether there are two slices that encode the same index , the output will be . Note that all weights are bounded by and all bias terms are bounded by for large enought . ∎
Lemma A.5.
There exists a depth- neural network with input dimension , at most hidden neurons, output neurons, and parameter magnitudes bounded by (for a sufficiently large ), which satisfies the following. We denote the set of output neurons of by . Let be such that for every we have . If is an encoding of a hyperedge then the inputs to all neurons are at most , and otherwise there exists a neuron in with input at least .
Proof.
Let be the depth- neural network from the proof of Lemma A.3, with a single layer of non-linearity with hidden neurons, and parameter magnitudes bounded by , such that for every with for every , we have .
Let be such that for every we have . For such we have . Hence, it suffices to show that we can construct an affine layer with outputs, weights bounded by a constant and bias terms bounded by , such that given an input , if it is an encoding of a hyperedge then all outputs are at most , and otherwise there exists an output greater or equal to . We construct such an affine layer in Lemma A.4. ∎
Lemma A.6.
There exists a depth- neural network with input dimension , at most hidden neurons, output neurons, and parameter magnitudes bounded by (for a sufficiently large ), which satisfies the following. We denote the set of output neurons of by . Let . If there exists such that then there exists a neuron in with input at least . If for all we have then the inputs to all neurons in are at most .
Proof.
It suffices to construct a univariate depth- network with one non-linear layer and a constant number of hidden neurons, such that for every input we have , and for every we have .
We construct as follows:
Note that all weights and bias terms are bounded by (for large enough ). ∎
Lemma A.7.
Let and . Then, there exists such that for a sufficiently large , with probability at least a vector satisfies .
Proof.
Let . Every component in has the distribution . By a standard tail bound of the Gaussian distribution, we have for every and that . Hence, for , we get
By the union bound, with probability at least , we have
Thus, for a sufficiently large , with probability at least we have . ∎
Lemma A.8.
If is pseudorandom then with probability at least (over and the i.i.d. inputs ) the examples returned by the oracle are realized by .
Proof.
By our choice of , with probability at least over , we have for all , and for every with the inputs to the neurons in the computation satisfy Properties (P1) through (P3). We first show that with probability at least all examples satisfy . Hence, with probability at least , Properties (P1) through (P3) hold for the computations for all .
Note that has the Chi-squared distribution. Since is of dimension , a concentration bound by Laurent and Massart [Laurent and Massart, 2000, Lemma 1] implies that for all we have
Plugging-in , we get
Thus, we have . By the union bound, with probability at least
(for a sufficiently large ), all examples satisfy .
Thus, we showed that with probability at least (for a sufficiently large ), we have for all , and Properties (P1) through (P3) hold for the computations for all . It remains to show that if these properties hold, then the examples are realized by .
Let . For brevity, we denote , , and . Since for all , and all incoming weights to the output neuron in are , then in all incoming weights to the output neuron are in , and the bias term in the output neuron, denoted by , is in . Consider the following cases:
-
•
If is not an encoding of a hyperedge then , and satisfies:
-
1.
If does not have components in , then there exists a neuron in with output at least (by Property (P2)).
-
2.
If has a component in , then there exists a neuron in with output at least (by Property (P3)).
In both cases, since all incoming weights to the output neuron in are in , and , then the input to the output neuron (including the bias term) is at most , and thus its output is .
-
1.
-
•
If is an encoding of a hyperedge , then by the definition of the examples oracle we have . Hence:
-
–
If does not have components in , then:
- *
-
*
If then the oracle sets . Since is pseudorandom, we have . Hence, in the computation there exists a neuron in with output at least (by Property (P1)). Since all incoming weights to the output neuron in are in , and , then the input to output neuron (including the bias term) is at most , and thus its output is .
-
–
If has a component in , then . Also, in the computation there exists a neuron in with output at least (by Property (P3)). Since all incoming weights to the output neuron in are in , and , then the input to output neuron (including the bias term) is at most , and thus its output is .
-
–
If does not have components in the interval , but has a component in the interval , then:
-
*
If the oracle sets . Since is pseudorandom, we have . Hence, in the computation there exists a neuron in with output at least (by Property (P1)). Since all incoming weights to the output neuron in are in , and , then the input to output neuron (including the bias term) is at most , and thus its output is .
- *
-
*
-
–
∎
Lemma A.9.
If is pseudorandom, then for a sufficiently large , with probability greater than we have
Proof.
By Lemma A.8, if is pseudorandom then with probability at least (over and the i.i.d. inputs ) the examples returned by the oracle are realized by . Recall that the algorithm is such that with probability at least (over , the i.i.d. inputs , and possibly its internal randomness), given a size- dataset labeled by , it returns a hypothesis such that . Hence, with probability at least the algorithm returns such a good hypothesis , given examples labeled by our examples oracle. Indeed, note that can return a bad hypothesis only if the random choices are either bad for (when used with realizable examples) or bad for the realizability of the examples returned by our oracle. By the definition of and the construction of , if has small error then also has small error, namely,
Let . Recall that by our choice of we have . Since, for all , by Hoeffding’s inequality, we have for a sufficiently large that
Moreover, by Lemma A.8,
Overall, by the union bound we have with probability at least for sufficiently large that:
-
•
.
-
•
.
-
•
.
Combining the above, we get that if is pseudorandom, then with probability greater than we have
∎
Lemma A.10.
Let be a random vector whose components are drawn i.i.d. from a Bernoulli distribution, which takes the value with probability . Then, for a sufficiently large , the vector is an encoding of a hyperedge with probability at least .
Proof.
The vector represents a hyperedge iff in each of the size- slices in there is exactly one -bit and each two of the slices in encode different indices. Hence,
Since for every we have , then for a sufficiently large the above is at least
∎
Lemma A.11.
Let be the vector returned by the oracle. We have
Proof.
Let . We have
(1) |
We now bound the terms in the above RHS. First, since has the Gaussian distribution, then its components are drawn i.i.d. from a density function bounded by . Hence, for a sufficiently large we have
(2) |
Lemma A.12.
If is random, then for a sufficiently large with probability larger than we have
Proof.
Let be such that iff does not have components in the interval , and for a hyperedge . If is random, then by the definition of our examples oracle, for every such that , we have with probability and otherwise. Also, by the definition of the oracle, is independent of and independent of the choice of the vector that corresponds to . If then for a sufficiently large the hypothesis satisfies for each random example the following
In Lemma A.11, we show that . Hence,
Thus, if then we have
Therefore, for large we have
Since, for all returned by the examples oracle, and the examples for are i.i.d., then by Hoeffding’s inequality, we have for a sufficiently large that
Hence, for large enough , with probability at least we have both and , and thus
∎
Appendix B Proof of Corollary 3.1
By the proof of Theorem 3.1, under Assumption 2.1, there is no -time algorithm that satisfies the following: Let be -bounded parameters of a depth- network , and let . Assume that , and that the widths of the hidden layers in are (i.e., the weight matrices are square). Let and let . Then, with probability at least , given access to an examples oracle for , the algorithm returns a hypothesis with .
Note that in the above, the requirements from are somewhat weaker than in our original definition of learning with smoothed parameters. Indeed, we assume that the widths of the hidden layers are and the required success probability is only (rather than ). We now explain why the hardness result holds already under these conditions:
-
•
Note that if we change the assumption on the learning algorithm in proof of Theorem 3.1 such that it succeeds with probability at least (rather than ), then in the case where is pseudorandom we get that the algorithm returns with probability at least (see the proof of Lemma A.9), which is still greater than . Also, the analysis of the case where is random does not change, and thus in this case returns with probability greater than . Consequently, we still get distinguishing advantage greater than .
-
•
Regarding the requirement on the widths, we note that in the proof of Theorem 3.1 the layers satisfy the following. The input dimension is , the width of the first hidden layer is at most , and the width of the second hidden layer is at most (all bounds are for a sufficiently large ). In order to get a network where all layers are of width , we add new neurons to the hidden layers, with incoming weights , outgoing weights , and bias terms . Then, for an appropriate choice of , even in the perturbed network the outputs of these new neurons will be w.h.p. for every input , and thus they will not affect the network’s output. Thus, using the same argument as in the proof of Theorem 3.1, we conclude that the hardness results holds already for network with square weight matrices.
Suppose that there exists an efficient algorithm that learns in the standard PAC framework depth- neural networks where the minimal singular value of each weight matrix is lower bounded by for any polynomial . We will use to obtain an efficient algorithm that learns depth- networks with smoothed parameters as described above, and thus reach a contradiction.
Let be -bounded parameters of a depth- network , and let . Assume that , and that the widths of the hidden layers in are . For random and , the algorithm has access to examples labeled by . Using Lemma B.1 below with and the union bound over the two weight matrices in , we get that with probability at least (for large enough ), the minimal singular values of all weight matrices in are at least for some sufficiently large polynomial . Our algorithm will simply run . Given that the minimal singular values of the weight matrices are at least , the algorithm runs in time and returns with probability at least a hypothesis with . Overall, the algorithm runs in time, and with probability at least (over both and the internal randomness) returns a hypothesis with loss at most .
Lemma B.1 (Sankar et al. [2006], Theorem 3.3).
Let be an arbitrary square matrix in , and let be a random matrix, where each entry is drawn i.i.d. from for some . Let be the minimal singular value of the matrix . Then, for every we have
Appendix C Proof of Theorem 3.2
The proof follows similar ideas to the proof of Theorem 3.1. The main difference is that we need to handle here a smoothed discrete input distribution rather than the standard Gaussian distribution.
For a sufficiently large , let be a distribution on , where each component is drawn i.i.d. from a Bernoulli distribution which takes the value with probability . Assume that there is a -time algorithm that learns depth- neural networks with at most hidden neurons and parameter magnitudes bounded by , with smoothed parameters and inputs, under the distribution , with and that we will specify later. Let be the sample complexity of , namely, uses a sample of size at most and returns with probability at least a hypothesis with . Note that is the distribution after smoothing with parameter , and the vector is the parameters of the target network after smoothing with parameter . Let be a constant such that for every sufficiently large . By Assumption 2.1, there exists a constant and a predicate , such that is -PRG. We will show an efficient algorithm with distinguishing advantage greater than and thus reach a contradiction.
Throughout this proof, we will use some notations from the proof of Theorem 3.1. We repeat it here for convenience. For a hyperedge we denote by the following encoding of : the vector is a concatenation of vectors in , such that the -th vector has in the -th coordinate and elsewhere. Thus, consists of size- slices, each encoding a member of . For , and , we denote . That is, is the -th component in the -th slice in . For , let be such that for every hyperedge we have . For we denote , namely, the first components of (assuming ).
C.1 Defining the target network for
Since our goal is to use the algorithm for breaking PRGs, in this subsection we define a neural network that we will later use as a target network for . The network contains the subnetworks that we define below.
Let be a depth- neural network (i.e., one layer, with activations in the output neurons) with input dimension , at most output neurons, and parameter magnitudes bounded by (all bounds are for a sufficiently large ), which satisfies the following. We denote the set of output neurons of by . Let be an input to such that for some hyperedge . Thus, even though takes inputs in , we consider now its behavior for an input with discrete components in . Fix some . Then, for with the inputs to all output neurons are at most , and for with there exists a neuron in with input at least . Recall that our definition of a neuron’s input includes the addition of the bias term. The construction of the network is given in Lemma A.2. Note that the network depends on . Let be a depth- neural network with no activation function in the output neuron, obtained from by summing the outputs from all neurons .
Let be a depth- neural network (i.e., one layer, with activations in the output neurons) with input dimension , at most output neurons, and parameter magnitudes bounded by (for a sufficiently large ), which satisfies the following. We denote the set of output neurons of by . Let be an input to (note that it has components only in ) . If is an encoding of a hyperedge then the inputs to all output neurons are at most , and otherwise there exists a neuron in with input at least . The construction of the network is given in Lemma A.4. Let be a depth- neural network with no activation function in the output neuron, obtained from by summing the outputs from all neurons .
Let be a depth- network obtained from as follows. For we have . The network has at most neurons, and parameter magnitudes bounded by (all bounds are for a sufficiently large ). Finally, let be a depth- neural network such that .
C.2 Defining the noise magnitudes and analyzing the perturbed network under perturbed inputs
In order to use the algorithm w.r.t. some neural network with parameters and a certain input distribution, we need to implement an examples oracle, such that the examples are drawn from a smoothed input distribution, and labeled according to a neural network with parameters , where is a random perturbation. Specifically, we use with an examples oracle where the input distribution is obtained from by smoothing, and the labels correspond to a network obtained from (w.r.t. an appropriate in the construction of ) by adding a small perturbation to the parameters. The smoothing magnitudes of the inputs and the network’s parameters (respectively) are such that the following hold.
We first choose the parameter as follows. Let be any depth- neural network parameterized by for some with at most neurons, and parameter magnitudes bounded by (note that is polynomial in ). Then, is such that with probability at least over , we have for all , and the network is such that for every input with and every neuron we have: Let be the inputs to the neuron in the computations and (respectively), then . Thus, is sufficiently small, such that w.h.p. adding i.i.d. noise to each parameter does not change the inputs to the neurons by more than . Note that such an inverse-polynomial exists, since when the network size, parameter magnitudes, and input size are bounded by some , then the input to each neuron in is -Lipschitz as a function of , and thus it suffices to choose that implies with probability at least that for a sufficiently large polynomial (see Lemma A.7 for details).
Next, we choose the parameter as follows. Let be any depth- neural network parameterized by with at most neurons, and parameter magnitudes bounded by . Then, is such that for every , with probability at least over the following holds for every neuron in the : Let be the inputs to the neuron in the computations and (respectively), then . Thus, is sufficiently small, such that w.h.p. adding noise to the input does not change the inputs to the neurons by more than . Note that such an inverse-polynomial exists, since when the network size and parameter magnitudes are bounded by some , then the input to each neuron in is -Lipschitz as a function of , and thus it suffices to choose that implies with probability at least that for a sufficiently large polynomial (see Lemma A.7 for details).
Let be the parameters of the network . Recall that the parameters vector is the concatenation of all weight matrices and bias terms. Let be the parameters of , namely, where . By our choice of and the construction of the networks , with probability at least over , for every the following holds: Let and let . Then with probability at least over the differences between inputs to all neurons in the computations and are at most . Indeed, w.h.p. for all the computations and are roughly similar (up to change of in the input to each neuron), and w.h.p. the computations and are roughly similar (up to change of in the input to each neuron). Thus, with probability at least over , the network is such that for every , we have with probability at least over that the computation satisfies the following properties, where :
-
(Q1)
If for some hyperedge , then the inputs to satisfy:
-
•
If the inputs to all neurons in are at most .
-
•
If there exists a neuron in with input at least .
-
•
-
(Q2)
The inputs to satisfy:
-
•
If is an encoding of a hyperedge then the inputs to all neurons are at most .
-
•
Otherwise, there exists a neuron in with input at least .
-
•
C.3 Stating the algorithm
Given a sequence , where are i.i.d. random hyperedges, the algorithm needs to distinguish whether is random or that for a random . Let .
We use the efficient algorithm in order to obtain distinguishing advantage greater than as follows. Let be a random perturbation, and let be the perturbed network as defined above, w.r.t. the unknown . Note that given a perturbation , only the weights in the second layer of the subnetwork in are unknown, since all other parameters do not depend on . The algorithm runs with the following examples oracle. In the -th call, the oracle first draws such that each component is drawn i.i.d. from a Bernoulli distribution which takes the value with probability . If is an encoding of a hyperedge then the oracle replaces with . Let be such that , and the other components of are drawn i.i.d. from a Bernoulli distribution which takes the value with probability . Note that the vector has the distribution , since replacing an encoding of a random hyperedge by an encoding of another random hyperedge does not change the distribution of . Let , where . Note that has the distribution . Let be the bias term of the output neuron of . The oracle returns , where the labels are chosen as follows:
-
•
If is not an encoding of a hyperedge, then .
-
•
If is an encoding of a hyperedge:
-
–
If we set .
-
–
If we set .
-
–
Let be the hypothesis returned by . Recall that uses at most examples, and hence contains at least examples that cannot view. We denote the indices of these examples by , and the examples by . By additional calls to the oracle, the algorithm obtains the examples that correspond to . Let be a hypothesis such that for all we have , thus, for the hypothesis is obtained from by clipping the output to the interval . Let . Now, if , then returns , and otherwise it returns . We remark that the decision of our algorithm is based on (rather than ) since we need the outputs to be bounded, in order to allow using Hoeffding’s inequality in our analysis, which we discuss in the next subsection.
C.4 Analyzing the algorithm
Note that the algorithm runs in time. We now show that if is pseudorandom then returns with probability greater than , and if is random then returns with probability less than . To that end, we use similar arguments to the proof of Theorem 3.1.
In Lemma C.1, we show that if is pseudorandom then with probability at least (over and for all ) the examples returned by the oracle are realized by . Recall that the algorithm is such that with probability at least (over , the i.i.d. inputs , and possibly its internal randomness), given a size- dataset labeled by , it returns a hypothesis such that . Hence, with probability at least the algorithm returns such a good hypothesis , given examples labeled by our examples oracle. Indeed, note that can return a bad hypothesis only if the random choices are either bad for (when used with realizable examples) or bad for the realizability of the examples returned by our oracle. By the definition of and the construction of , if has small error then also has small error, namely,
Let . Recall that by our choice of we have . Since, for all , by Hoeffding’s inequality, we have for a sufficiently large that
Moreover, by Lemma C.1,
Overall, by the union bound we have with probability at least for sufficiently large that:
-
•
.
-
•
.
-
•
.
Combining the above, we get that if is pseudorandom, then with probability greater than we have
We now consider the case where is random. For an example returned by the oracle, we denote . Thus, is the input that the oracle used before adding the additional components and adding noise . Let be such that iff for some hyperedge . If is random, then by the definition of our examples oracle, for every such that , we have with probability and otherwise. Also, by the definition of the oracle, is independent of , independent of the additional components that where added, and independent of the noise that corresponds to .
If then for a sufficiently large the hypothesis satisfies for each random example the following:
In Lemma A.10, we show that for a sufficiently large we have . Hence,
Thus, if then we have
Therefore, for large we have
Since, for all returned by the examples oracle, and the examples for are i.i.d., then by Hoeffding’s inequality, we have for a sufficiently large that
Hence, for large enough , with probability at least we have both and , and thus
Overall, if is pseudorandom then with probability greater than the algorithm returns , and if is random then with probability greater than the algorithm returns . Thus, the distinguishing advantage is greater than . This concludes the proof of the theorem. It remains to prove the deffered lemma on the realizability of the examples returned by the examples oracle:
Lemma C.1.
If is pseudorandom then with probability at least over and for , the examples returned by the oracle are realized by .
Proof.
By our choice of and and the construction of , with probability at least over , we have for all , and for every the following holds: Let and let . Then with probability at least over the inputs to the neurons in the computation satisfy Properties (Q1) and (Q2). Hence, with probability at least (for a sufficiently large ), for all , and Properties (Q1) and (Q2) hold for the computations for all . It remains to show that if for all and Properties (Q1) and (Q2) hold, then the examples are realized by .
Let . We denote , namely, the -th example returned by the oracle was obtained by adding noise to . We also denote . Since for all , and all incoming weights to the output neuron in are , then in all incoming weights to the output neuron are in , and the bias term in the output neuron, denoted by , is in . Consider the following cases:
-
•
If is not an encoding of a hyperedge then . Moreover, in the computation , there exists a neuron in with output at least (by Property (Q2)) . Since all incoming weights to the output neuron in are in , and , then the input to the output neuron (including the bias term) is at most , and thus its output is .
-
•
If is an encoding of a hyperedge , then by the definition of the examples oracle we have . Hence:
- –
-
–
If then the oracle sets . Since is pseudorandom, we have . Hence, in the computation there exists a neuron in with output at least (by Property (Q1)). Since all incoming weights to the output neuron in are in , and , then the input to output neuron (including the bias term) is at most , and thus its output is .
∎