This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Computational Complexity of Learning Neural Networks: Smoothness and Degeneracy

Amit Daniely
Hebrew University and Google
[email protected]
   Nathan Srebro
TTI-Chicago
[email protected]
   Gal Vardi
TTI-Chicago and Hebrew University
[email protected]
   Collaboration on the Theoretical Foundations of Deep Learning (deepfoundations.ai)
Abstract

Understanding when neural networks can be learned efficiently is a fundamental question in learning theory. Existing hardness results suggest that assumptions on both the input distribution and the network’s weights are necessary for obtaining efficient algorithms. Moreover, it was previously shown that depth-22 networks can be efficiently learned under the assumptions that the input distribution is Gaussian, and the weight matrix is non-degenerate. In this work, we study whether such assumptions may suffice for learning deeper networks and prove negative results. We show that learning depth-33 ReLU networks under the Gaussian input distribution is hard even in the smoothed-analysis framework, where a random noise is added to the network’s parameters. It implies that learning depth-33 ReLU networks under the Gaussian distribution is hard even if the weight matrices are non-degenerate. Moreover, we consider depth-22 networks, and show hardness of learning in the smoothed-analysis framework, where both the network parameters and the input distribution are smoothed. Our hardness results are under a well-studied assumption on the existence of local pseudorandom generators.

1 Introduction

The computational complexity of learning neural networks has been extensively studied in recent years, and there has been much effort in obtaining both upper bounds and hardness results. Nevertheless, it is still unclear when neural networks can be learned in polynomial time, namely, under what assumptions provably efficient algorithms exist.

Existing results imply hardness already for learning depth-22 ReLU networks in the standard PAC learning framework (e.g., Klivans and Sherstov (2006); Daniely and Shalev-Shwartz (2016)). Thus, without any assumptions on the input distribution or the network’s weights, efficient learning algorithms might not be achievable. Even when assuming that the input distribution is Gaussian, strong hardness results were obtained for depth-33 ReLU networks (Daniely and Vardi, 2021; Chen et al., 2022a), suggesting that assumptions merely on the input distribution might not suffice. Also, a hardness result by Daniely and Vardi (2020) shows that strong assumptions merely on the network’s weights (without restricting the input distribution) might not suffice even for efficiently learning depth-22 networks. The aforementioned hardness results hold already for improper learning, namely, where the learning algorithm is allowed to return a hypothesis that does not belong to the considered hypothesis class. Thus, a combination of assumptions on the input distribution and the network’s weights seems to be necessary for obtaining efficient algorithms.

Several polynomial-time algorithms for learning depth-22 neural networks have been obtained, under assumptions on the input distribution and the network’s weights (Awasthi et al., 2021; Janzamin et al., 2015; Zhong et al., 2017; Ge et al., 2017; Bakshi et al., 2019; Ge et al., 2018). In these works, it is assumed that the weight matrices are non-degenerate. That is, they either assume that the condition number of the weight matrix is bounded, or some similar non-degeneracy assumption. Specifically, Awasthi et al. (2021) gave an efficient algorithm for learning depth-22 (one-hidden-layer) ReLU networks, that may include bias terms in the hidden neurons, under the assumption that the input distribution is Gaussian, and the weight matrix is non-degenerate. The non-degeneracy assumption holds w.h.p. if we add a small random noise to any weight matrix, and hence their result implies efficient learning of depth-22 ReLU networks under the Gaussian distribution in the smoothed-analysis framework.

The positive results on depth-22 networks suggest the following question:

Is there an efficient algorithm for learning ReLU networks of depth larger than 22 under the Gaussian distribution, where the weight matrices are non-degenerate, or in the smoothed-analysis framework where the network’s parameters are smoothed?

In this work, we give a negative answer to this question, already for depth-33 networks111We note that in our results the neural networks have the ReLU activation also in the output neuron.. We show that learning depth-33 ReLU networks under the Gaussian distribution is hard even in the smoothed-analysis framework, where a random noise is added to the network’s parameters. As a corollary, we show that learning depth-33 ReLU networks under the Gaussian distribution is hard even if the weight matrices are non-degenerate. Our hardness results are under a well-studied cryptographic assumption on the existence of local pseudorandom generators (PRG) with polynomial stretch.

Motivated by the existing positive results on smoothed-analysis in depth-22 networks, we also study whether learning depth-22 networks with smoothed parameters can be done under weak assumptions on the input distribution. Specifically, we consider the following question:

Is there an efficient algorithm for learning depth-22 ReLU networks in the smoothed-analysis framework, where both the network’s parameters and the input distribution are smoothed?

We give a negative answer to this question, by showing hardness of learning depth-22 ReLU networks where a random noise is added to the network’s parameters, and the input distribution on d{\mathbb{R}}^{d} is obtained by smoothing an i.i.d. Bernoulli distribution on {0,1}d\{0,1\}^{d}. This hardness result is also under the assumption on the existence of local PRGs.

Related work

Hardness of learning neural networks.

Hardness of improperly learning depth-22 neural networks follows from hardness of improperly learning DNFs or intersections of halfspaces, since these classes can be expressed by depth-22 networks. Klivans and Sherstov (2006) showed, assuming the hardness of the shortest vector problem, that learning intersections of halfspaces is hard. Hardness of learning DNF formulas is implied by Applebaum et al. (2010) under a combination of two assumptions: the first is related to the planted dense subgraph problem in hypergraphs, and the second is related to local PRGs. Daniely and Shalev-Shwartz (2016) showed hardness of learning DNFs under a common assumption, namely, that refuting a random KK-SAT formula is hard. All of the above results are distribution-free, namely, they do not imply hardness of learning neural networks under some specific distribution.

Applebaum and Raykov (2016) showed, under an assumption on a specific candidate for Goldreich’s PRG (i.e., based on a predicate called XORMAJ\operatorname{XOR-MAJ}), that learning depth-33 Boolean circuits under the uniform distribution on the hypercube is hard. Daniely and Vardi (2021) proved distribution-specific hardness of learning Boolean circuits of depth-22 (namely, DNFs) and depth-33, under the assumpion on the existence of local PRGs that we also use in this work. For DNF formulas, they showed hardness of learning under a distribution where each component is drawn i.i.d. from a Bernoulli distribution (which is not uniform). For depth-33 Boolean circuits, they showed hardness of learning under the uniform distribution on the hypercube. Since the Boolean circuits can be expressed by ReLU networks of the same depth, these results readily translate to distribution-specific hardness of learning neural networks. Chen et al. (2022a) showed hardness of learning depth-22 neural networks under the uniform distribution on the hypercube, based on an assumption on the hardness of the Learning with Rounding (LWR) problem. Note that the input distributions in the above results are supported on the hypercube, and they do not immediately imply hardness of learning neural networks under continuous distributions.

When considering the computational complexity of learning neural networks, perhaps the most natural choice of an input distribution is the standard Gaussian distribution. Daniely and Vardi (2021) established hardness of learning depth-33 networks under this distribution, based on the assumpion on the existence of local PRGs. Chen et al. (2022a) also showed hardness of learning depth-33 networks under the Gaussian distribution, but their result holds already for networks that do not have an activation function in the output neuron, and it is based on the LWR assumption. They also showed hardness of learning constant depth ReLU networks from label queries (i.e., where the learner has the ability to query the value of the target network at any desired input) under the Gaussian distribution, based either on the decisional Diffie-Hellman or the Learning with Errors assumptions.

The above results suggest that assumptions on the input distribution might not suffice for achieving an efficient algorithm for learning depth-33 neural networks. A natural question is whether assumptions on the network weights may suffice. Daniely and Vardi (2020) showed (under the assumption that refuting a random KK-SAT formula is hard) that distribution-free learning of depth-22 neural networks is hard already if the weights are drawn from some “natural” distribution or satisfy some “natural” properties. Thus, if we do not impose any assumptions on the input distribution, then even very strong assumptions on the network’s weights do not suffice for efficient learning.

Several works in recent years have shown hardness of distribution-specific learning shallow neural networks using gradient-methods or statistical query (SQ) algorithms (Shamir, 2018; Song et al., 2017; Vempala and Wilmes, 2019; Goel et al., 2020a; Diakonikolas et al., 2020b; Chen et al., 2022a). Specifically, Goel et al. (2020a) and Diakonikolas et al. (2020b) gave superpolynomial correlational SQ (CSQ) lower bounds for learning depth-22 networks under the Gaussian distribution. Since there is separation between CSQ and SQ, these results do not imply hardness of general SQ learning. Chen et al. (2022a) showed general SQ lower bounds for depth-33 networks under the Gaussian distribution. It is worth noting that while the SQ framework captures some variants of the gradient-descent algorithm, it does not capture, for example, stochastic gradient-descent (SGD), which examines training points individually (see a discussion in Goel et al. (2020a)).

We emphasize that none of the above distribution-specific hardness results for neural networks (either for improper learning or SQ learning) holds in the smoothed analysis framework or for non-degenerate weights.

While hardness of proper learning is implied by hardness of improper learning, there are also works that show hardness of properly learning depth-22 networks under more standard assumptions (cf. Goel et al. (2020c)). Finally, distribution-specific hardness of learning a single ReLU neuron in the agnostic setting was shown in Goel et al. (2019, 2020b); Diakonikolas et al. (2020a).

Learning neural networks in polynomial time.

Awasthi et al. (2021) gave a polynomial-time algorithm for learning depth-22 (one-hidden-layer) ReLU networks, under the assumption that the input distribution is Gaussian, and the weight matrix of the target network is non-degenerate. Their algorithm is based on tensor decomposition, and it can handle bias terms in the hidden layer. Their result also implies that depth-22 ReLU networks with Gaussian inputs can be learned efficiently under the smoothed-analysis framework. Our work implies that such a result might not be possible in depth-33 networks (with activation in the output neuron). Prior to Awasthi et al. (2021), several works gave polynomial time algorithms for learning depth-22 neural networks where the input distribution is Gaussian and the weight matrix is non-degenerate (Janzamin et al., 2015; Zhong et al., 2017; Ge et al., 2017, 2018; Bakshi et al., 2019), but these works either do not handle the presence of bias terms or do not handle the ReLU activation. Some of the aforementioned works consider networks with multiple outputs, and allow certain non-Gaussian input distributions.

Provable guarantees for learning neural networks in super-polynomial time were given in Diakonikolas et al. (2020b); Chen et al. (2022b); Diakonikolas and Kane (2020); Goel and Klivans (2019); Vempala and Wilmes (2019); Zhang et al. (2016); Goel et al. (2017). Distribution-free learning of a single ReLU neuron (in the realizable setting) can be done in polynomial time (Kalai and Sastry, 2009; Kakade et al., 2011).

2 Preliminaries

2.1 Notation

We use bold-face letters to denote vectors, e.g., 𝐱=(x1,,xd)\mathbf{x}=(x_{1},\ldots,x_{d}). For a vector 𝐱\mathbf{x} and a sequence S=(i1,,ik)S=(i_{1},\ldots,i_{k}) of kk indices, we let 𝐱S=(xi1,,xik)\mathbf{x}_{S}=(x_{i_{1}},\ldots,x_{i_{k}}), i.e., the restriction of 𝐱\mathbf{x} to the indices SS. We denote by 𝟙[]\mathbbm{1}[\cdot] the indicator function, for example 𝟙[t5]\mathbbm{1}[t\geq 5] equals 11 if t5t\geq 5 and 0 otherwise. For an integer d1d\geq 1 we denote [d]={1,,d}[d]=\{1,\ldots,d\}. We denote by 𝒩(0,σ2){\cal N}(0,\sigma^{2}) the normal distribution with mean 0 and variance σ2\sigma^{2}, and by 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma) the multivariate normal distribution with mean 𝟎{\mathbf{0}} and covariance matrix Σ\Sigma. The identity matrix of size dd is denoted by IdI_{d}. For 𝐱d\mathbf{x}\in{\mathbb{R}}^{d} we denote by 𝐱\left\|\mathbf{x}\right\| the Euclidean norm. We use poly(a1,,an)\operatorname{poly}(a_{1},\ldots,a_{n}) to denote a polynomial in a1,,ana_{1},\ldots,a_{n}.

2.2 Local pseudorandom generators

An (n,m,k)(n,m,k)-hypergraph is a hypergraph over nn vertices [n][n] with mm hyperedges S1,,SmS_{1},\ldots,S_{m}, each of cardinality kk. Each hyperedge S=(i1,,ik)S=(i_{1},\ldots,i_{k}) is ordered, and all the kk members of a hyperedge are distinct. We let 𝒢n,m,k{\cal G}_{n,m,k} be the distribution over such hypergraphs in which a hypergraph is chosen by picking each hyperedge uniformly and independently at random among all the possible n(n1)(nk+1)n\cdot(n-1)\cdot\ldots\cdot(n-k+1) ordered hyperedges. Let P:{0,1}k{0,1}P:\{0,1\}^{k}\rightarrow\{0,1\} be a predicate, and let GG be a (n,m,k)(n,m,k)-hypergraph. We call Goldreich’s pseudorandom generator (PRG) (Goldreich, 2000) the function fP,G:{0,1}n{0,1}mf_{P,G}:\{0,1\}^{n}\rightarrow\{0,1\}^{m} such that for 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}, we have fP,G(𝐱)=(P(𝐱S1),,P(𝐱Sm))f_{P,G}(\mathbf{x})=(P(\mathbf{x}_{S_{1}}),\ldots,P(\mathbf{x}_{S_{m}})). The integer kk is called the locality of the PRG. If kk is a constant then the PRG and the predicate PP are called local. We say that the PRG has polynomial stretch if m=nsm=n^{s} for some constant s>1s>1. We let P,n,m{\cal F}_{P,n,m} denote the collection of functions fP,Gf_{P,G} where GG is an (n,m,k)(n,m,k)-hypergraph. We sample a function from P,n,m{\cal F}_{P,n,m} by choosing a random hypergraph GG from 𝒢n,m,k{\cal G}_{n,m,k}.

We denote by G𝑅𝒢n,m,kG\xleftarrow{R}{\cal G}_{n,m,k} the operation of sampling a hypergraph GG from 𝒢n,m,k{\cal G}_{n,m,k}, and by 𝐱𝑅{0,1}n\mathbf{x}\xleftarrow{R}\{0,1\}^{n} the operation of sampling 𝐱\mathbf{x} from the uniform distribution on {0,1}n\{0,1\}^{n}. We say that P,n,m{\cal F}_{P,n,m} is ε\varepsilon-pseudorandom generator (ε\varepsilon-PRG) if for every polynomial-time probabilistic algorithm 𝒜{\cal A} the distinguishing advantage

|PrG𝑅𝒢n,m,k,𝐱𝑅{0,1}n[𝒜(G,fP,G(𝐱))=1]PrG𝑅𝒢n,m,k,𝐲𝑅{0,1}m[𝒜(G,𝐲)=1]|\left|\Pr_{G\xleftarrow{R}{\cal G}_{n,m,k},\mathbf{x}\xleftarrow{R}\{0,1\}^{n}}[{\cal A}(G,f_{P,G}(\mathbf{x}))=1]-\Pr_{G\xleftarrow{R}{\cal G}_{n,m,k},\mathbf{y}\xleftarrow{R}\{0,1\}^{m}}[{\cal A}(G,\mathbf{y})=1]\right|

is at most ε\varepsilon. Thus, the distinguisher 𝒜{\cal A} is given a random hypergraph GG and a string 𝐲{0,1}m\mathbf{y}\in\{0,1\}^{m}, and its goal is to distinguish between the case where 𝐲\mathbf{y} is chosen at random, and the case where 𝐲\mathbf{y} is a random image of fP,Gf_{P,G}.

Our assumption is that local PRGs with polynomial stretch and constant distinguishing advantage exist:

Assumption 2.1.

For every constant s>1s>1, there exists a constant kk and a predicate P:{0,1}k{0,1}P:\{0,1\}^{k}\rightarrow\{0,1\}, such that P,n,ns{\cal F}_{P,n,n^{s}} is 13\frac{1}{3}-PRG.

We remark that the same assumption was used by Daniely and Vardi (2021) to show hardness-of-learning results. Local PRGs have been extensively studied in the last two decades. In particular, local PRGs with polynomial stretch have shown to have remarkable applications, such as secure-computation with constant computational overhead (Ishai et al., 2008; Applebaum et al., 2017), and general-purpose obfuscation based on constant degree multilinear maps (cf. Lin (2016); Lin and Vaikuntanathan (2016)). Significant evidence for Assumption 2.1 was shown in Applebaum (2013). Moreover, a concrete candidate for a local PRG, based on the XOR-MAJ predicate, was shown to be secure against all known attacks (Applebaum and Lovett, 2016; Couteau et al., 2018; Méaux et al., 2019; Applebaum and Raykov, 2016). See Daniely and Vardi (2021) for further discussion on the assumption, and on prior work regarding the relation between Goldreich’s PRG and hardness of learning.

2.3 Neural networks

We consider feedforward ReLU networks. Starting from an input 𝐱d\mathbf{x}\in{\mathbb{R}}^{d}, each layer in the network is of the form 𝐳σ(Wi𝐳+𝐛i)\mathbf{z}\mapsto\sigma(W_{i}\mathbf{z}+\mathbf{b}_{i}), where σ(a)=[a]+=max{0,a}\sigma(a)=[a]_{+}=\max\{0,a\} is the ReLU activation which applies to vectors entry-wise, WiW_{i} is the weight matrix, and 𝐛i\mathbf{b}_{i} are the bias terms. The weights vector of the jj-th neuron in the ii-th layer is the jj-th row of WiW_{i}, and its outgoing-weights vector is the jj-th column of Wi+1W_{i+1}. We define the depth of the network as the number of layers. Unless stated otherwise, the output neuron also has a ReLU activation function. Note that a depth-kk network with activation in the output neuron has kk non-linear layers. A neuron which is not an input or output neuron is called a hidden neuron. We sometimes consider neural networks with multiple outputs. The parameters of the neural network is the set of its weight matrices and bias vectors. We often view the parameters as a vector 𝜽p{\boldsymbol{\theta}}\in{\mathbb{R}}^{p} obtained by concatenating these matrices and vectors. For B0B\geq 0, we say that the parameters are BB-bounded if the absolute values of all weights and biases are at most BB.

2.4 Learning neural networks and the smoothed-analysis framework

We first define learning neural networks under the standard PAC framework:

Definition 2.1 (Distribution-specific PAC learning).

Learning depth-kk neural networks under an input distribution 𝒟{\cal D} on d{\mathbb{R}}^{d} is defined by the following framework:

  1. 1.

    An adversary chooses a set of BB-bounded parameters 𝜽p{\boldsymbol{\theta}}\in{\mathbb{R}}^{p} for a depth-kk neural network N𝜽:dN_{\boldsymbol{\theta}}:{\mathbb{R}}^{d}\to{\mathbb{R}}, as well as some ϵ>0\epsilon>0.

  2. 2.

    Consider an examples oracle, such that each example (𝐱,y)d×(\mathbf{x},y)\in{\mathbb{R}}^{d}\times{\mathbb{R}} is drawn i.i.d. with 𝐱𝒟\mathbf{x}\sim{\cal D} and y=N𝜽(𝐱)y=N_{{\boldsymbol{\theta}}}(\mathbf{x}). Then, given access to the examples oracle, the goal of the learning algorithm {\cal L} is to return with probability at least 34\frac{3}{4} a hypothesis h:dh:{\mathbb{R}}^{d}\to{\mathbb{R}} such that

    𝔼𝐱𝒟[(h(𝐱)N𝜽(𝐱))2]ϵ.\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}}\left[\left(h(\mathbf{x})-N_{{\boldsymbol{\theta}}}(\mathbf{x})\right)^{2}\right]\leq\epsilon~{}.

    We say that {\cal L} is efficient if it runs in time poly(d,p,B,1/ϵ)\operatorname{poly}(d,p,B,1/\epsilon).

We consider learning in the smoothed-analysis framework (Spielman and Teng, 2004), which is a popular paradigm for analyzing non-worst-case computational complexity (Roughgarden, 2021). The smoothed-analysis framework has been successfully applied to many learning problems (e.g., Awasthi et al. (2021); Ge et al. (2018); Kalai et al. (2009); Bhaskara et al. (2014, 2019); Haghtalab et al. (2020); Ge et al. (2015); Kalai and Teng (2008); Brutzkus et al. (2020)). In the smoothed-analysis setting, the target network is not purely controlled by an adversary. Instead, the adversary can first generate an arbitrary network, and the parameters for this network (i.e., the weight matrices and bias terms) will be randomly perturbed to yield a perturbed network. The algorithm only needs to work with high probability on the perturbed network. This limits the power of the adversary and prevents it from creating highly degenerate cases. Formally, we consider the following framework (we note that a similar model was considered in Awasthi et al. (2021) and Ge et al. (2018)):

Definition 2.2 (Learning with smoothed parameters).

Learning depth-kk neural networks with smoothed parameters under an input distribution 𝒟{\cal D} is defined as follows:

  1. 1.

    An adversary chooses a set of BB-bounded parameters 𝜽p{\boldsymbol{\theta}}\in{\mathbb{R}}^{p} for a depth-kk neural network N𝜽:dN_{\boldsymbol{\theta}}:{\mathbb{R}}^{d}\to{\mathbb{R}}, as well as some τ,ϵ>0\tau,\epsilon>0.

  2. 2.

    A perturbed set of parameters 𝜽^\hat{{\boldsymbol{\theta}}} is obtained by a random perturbation to 𝜽{\boldsymbol{\theta}}, namely, 𝜽^=𝜽+𝝃\hat{{\boldsymbol{\theta}}}={\boldsymbol{\theta}}+\boldsymbol{\xi} for 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}).

  3. 3.

    Consider an examples oracle, such that each example (𝐱,y)d×(\mathbf{x},y)\in{\mathbb{R}}^{d}\times{\mathbb{R}} is drawn i.i.d. with 𝐱𝒟\mathbf{x}\sim{\cal D} and y=N𝜽^(𝐱)y=N_{\hat{{\boldsymbol{\theta}}}}(\mathbf{x}). Then, given access to the examples oracle, the goal of the learning algorithm {\cal L} is to return with probability at least 34\frac{3}{4} (over the random perturbation 𝝃\boldsymbol{\xi} and the internal randomness of {\cal L}) a hypothesis h:dh:{\mathbb{R}}^{d}\to{\mathbb{R}} such that

    𝔼𝐱𝒟[(h(𝐱)N𝜽^(𝐱))2]ϵ.\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}}\left[\left(h(\mathbf{x})-N_{\hat{{\boldsymbol{\theta}}}}(\mathbf{x})\right)^{2}\right]\leq\epsilon~{}.

    We say that {\cal L} is efficient if it runs in time poly(d,p,B,1/ϵ,1/τ)\operatorname{poly}(d,p,B,1/\epsilon,1/\tau).

Finally, we also consider a setting where both the parameters and the input distribution are smoothed:

Definition 2.3 (Learning with smoothed parameters and inputs).

Learning depth-kk neural networks with smoothed parameters and inputs under an input distribution 𝒟{\cal D} is defined as follows:

  1. 1.

    An adversary chooses a set of BB-bounded parameters 𝜽p{\boldsymbol{\theta}}\in{\mathbb{R}}^{p} for a depth-kk neural network N𝜽:dN_{\boldsymbol{\theta}}:{\mathbb{R}}^{d}\to{\mathbb{R}}, as well as some τ,ω,ϵ>0\tau,\omega,\epsilon>0.

  2. 2.

    A perturbed set of parameters 𝜽^\hat{{\boldsymbol{\theta}}} is obtained by a random perturbation to 𝜽{\boldsymbol{\theta}}, namely, 𝜽^=𝜽+𝝃\hat{{\boldsymbol{\theta}}}={\boldsymbol{\theta}}+\boldsymbol{\xi} for 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}). Moreover, a smoothed input distribution 𝒟^\hat{{\cal D}} is obtained from 𝒟{\cal D}, such that 𝐱^𝒟^\hat{\mathbf{x}}\sim\hat{{\cal D}} is chosen by drawing 𝐱𝒟\mathbf{x}\sim{\cal D} and adding a random perturbation from 𝒩(𝟎,ω2Id){\cal N}({\mathbf{0}},\omega^{2}I_{d}).

  3. 3.

    Consider an examples oracle, such that each example (𝐱,y)d×(\mathbf{x},y)\in{\mathbb{R}}^{d}\times{\mathbb{R}} is drawn i.i.d. with 𝐱𝒟^\mathbf{x}\sim\hat{{\cal D}} and y=N𝜽^(𝐱)y=N_{\hat{{\boldsymbol{\theta}}}}(\mathbf{x}). Then, given access to the examples oracle, the goal of the learning algorithm {\cal L} is to return with probability at least 34\frac{3}{4} (over the random perturbation 𝝃\boldsymbol{\xi} and the internal randomness of {\cal L}) a hypothesis h:dh:{\mathbb{R}}^{d}\to{\mathbb{R}} such that

    𝔼𝐱𝒟^[(h(𝐱)N𝜽^(𝐱))2]ϵ.\operatorname*{\mathbb{E}}_{\mathbf{x}\sim\hat{{\cal D}}}\left[\left(h(\mathbf{x})-N_{\hat{{\boldsymbol{\theta}}}}(\mathbf{x})\right)^{2}\right]\leq\epsilon~{}.

    We say that {\cal L} is efficient if it runs in time poly(d,p,B,1/ϵ,1/τ,1/ω)\operatorname{poly}(d,p,B,1/\epsilon,1/\tau,1/\omega).

We emphasize that all of the above definitions consider learning in the distribution-specific setting. Thus, the learning algorithm may depend on the specific input distribution 𝒟{\cal D}.

3 Results

As we discussed in the introduction, there exist efficient algorithms for learning depth-22 (one-hidden-layer) ReLU networks with smoothed parameters under the Gaussian distribution. We now show that such a result may not be achieved for depth-33 networks:

Theorem 3.1.

Under Assumption 2.1, there is no efficient algorithm that learns depth-33 neural networks with smoothed parameters (in the sense of Definition 2.2) under the standard Gaussian distribution.

We prove the theorem in Section 4. Next, we conclude that learning depth-33 neural networks under the Gaussian distribution on d{\mathbb{R}}^{d} is hard in the standard PAC framework even if all weight matrices are non-degenerate, namely, when the minimal singular values of the weight matrices are lower bounded by 1/poly(d)1/\operatorname{poly}(d). As we discussed in the introduction, in one-hidden-layer networks with similar assumptions there exist efficient learning algorithms.

Corollary 3.1.

Under Assumption 2.1, there is no efficient algorithm that learns depth-33 neural networks (in the sense of Definition 2.1) under the standard Gaussian distribution on d{\mathbb{R}}^{d}, even if the smallest singular value of each weight matrix is at least 1/poly(d)1/\operatorname{poly}(d).

The proof of the above corollary follows from Theorem 3.1, using the fact that by adding a small random perturbation to the weight matrices, we get w.h.p. non-degenerate matrices. Hence, an efficient algorithm that learns non-degenerate networks suffices for obtaining an efficient algorithm that learns under the smoothed analysis framework. See Appendix B for the formal proof.

The above hardness results consider depth-33 networks that include activation in the output neuron. Thus, the networks have three non-linear layers. We remark that these results readily imply hardness also for depth-44 networks without activation in the output, and hardness for depth-kk networks for any k>3k>3. We also note that the hardness results hold already for networks where all hidden layers are of the same width, e.g., where all layers are of width dd.

Theorem 3.1 gives a strong limitation on learning depth-33 networks in the smoothed-analysis framework. Motivated by existing positive results on smoothed-analysis in depth-22 networks, we now study whether learning depth-22 networks with smoothed parameters can be done under weak assumptions on the input distribution. Specifically, we consider the case where both the parameters and the input distribution are smoothed. We show that efficiently learning depth-22 networks may not be possible with smoothed parameters where the input distribution on d{\mathbb{R}}^{d} is obtained by smoothing an i.i.d. Bernoulli distribution on {0,1}d\{0,1\}^{d}.

Theorem 3.2.

Under Assumption 2.1, there is no efficient algorithm that learns depth-22 neural networks with smoothed parameters and inputs (in the sense of Definition 2.3), under the distribution 𝒟{\cal D} on {0,1}d\{0,1\}^{d} where each coordinate is drawn i.i.d. from a Bernoulli distribution which takes the value 0 with probability 1d\frac{1}{\sqrt{d}}.

We prove the theorem in Appendix C. The proof follows from similar ideas to the proof of Theorem 3.1, which we discuss in the next section.

4 Proof of Theorem 3.1

The proof builds on a technique from Daniely and Vardi (2021). It follows by showing that an efficient algorithm for learning depth-33 neural networks with smoothed parameters under the Gaussian distribution can be used for breaking a local PRG. Intuitively, the main challenge in our proof in comparison to Daniely and Vardi (2021) is that our reduction must handle the random noise which is added to the parameters. Specifically, Daniely and Vardi (2021) define a certain examples oracle, and show that the examples returned by the oracle are realizable by some neural network which depends on the unknown 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n} used by the PRG. Since the network depends on this unknown 𝐱\mathbf{x}, some of the parameters of the network are unknown, and hence it is not trivial how to define an examples oracle which is realizable by a perturbed network. Moreover, we need to handle this random perturbation without increasing the network’s depth.

We now provide the formal proof. The proof relies on several lemmas which we prove in Appendix A. For a sufficiently large nn, let 𝒟{\cal D} be the standard Gaussian distribution on n2{\mathbb{R}}^{n^{2}}. Assume that there is a poly(n)\operatorname{poly}(n)-time algorithm {\cal L} that learns depth-33 neural networks with at most n2n^{2} hidden neurons and parameter magnitudes bounded by n3n^{3}, with smoothed parameters, under the distribution 𝒟{\cal D}, with ϵ=1n\epsilon=\frac{1}{n}, and τ=1/poly(n)\tau=1/\operatorname{poly}(n) that we will specify later. Let m(n)poly(n)m(n)\leq\operatorname{poly}(n) be the sample complexity of {\cal L}, namely, {\cal L} uses a sample of size at most m(n)m(n) and returns with probability at least 34\frac{3}{4} a hypothesis hh with 𝔼𝐳𝒟[(h(𝐳)N𝜽^(𝐳))2]ϵ=1n\operatorname*{\mathbb{E}}_{\mathbf{z}\sim{\cal D}}\left[\left(h(\mathbf{z})-N_{\hat{{\boldsymbol{\theta}}}}(\mathbf{z})\right)^{2}\right]\leq\epsilon=\frac{1}{n}, where N𝜽^N_{\hat{{\boldsymbol{\theta}}}} is the perturbed network. Let s>1s>1 be a constant such that nsm(n)+n3n^{s}\geq m(n)+n^{3} for every sufficiently large nn. By Assumption 2.1, there exists a constant kk and a predicate P:{0,1}k{0,1}P:\{0,1\}^{k}\to\{0,1\}, such that P,n,ns{\cal F}_{P,n,n^{s}} is 13\frac{1}{3}-PRG. We will show an efficient algorithm 𝒜{\cal A} with distinguishing advantage greater than 13\frac{1}{3} and thus reach a contradiction.

Throughout this proof, we will use the following notations. For a hyperedge S=(i1,,ik)S=(i_{1},\ldots,i_{k}) we denote by 𝐳S{0,1}kn\mathbf{z}^{S}\in\{0,1\}^{kn} the following encoding of SS: the vector 𝐳S\mathbf{z}^{S} is a concatenation of kk vectors in {0,1}n\{0,1\}^{n}, such that the jj-th vector has 0 in the iji_{j}-th coordinate and 11 elsewhere. Thus, 𝐳S\mathbf{z}^{S} consists of kk size-nn slices, each encoding a member of SS. For 𝐳{0,1}kn\mathbf{z}\in\{0,1\}^{kn}, i[k]i\in[k] and j[n]j\in[n], we denote zi,j=z(i1)n+jz_{i,j}=z_{(i-1)n+j}. That is, zi,jz_{i,j} is the jj-th component in the ii-th slice in 𝐳\mathbf{z}. For 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}, let P𝐱:{0,1}kn{0,1}P_{\mathbf{x}}:\{0,1\}^{kn}\to\{0,1\} be such that for every hyperedge SS we have P𝐱(𝐳S)=P(𝐱S)P_{\mathbf{x}}(\mathbf{z}^{S})=P(\mathbf{x}_{S}). Let cc be such that Prt𝒩(0,1)[tc]=1n\Pr_{t\sim{\cal N}(0,1)}[t\leq c]=\frac{1}{n}. Let μ\mu be the density of 𝒩(0,1){\cal N}(0,1), let μ(t)=n𝟙[tc]μ(t)\mu_{-}(t)=n\cdot\mathbbm{1}[t\leq c]\cdot\mu(t), and let μ+=nn1𝟙[tc]μ(t)\mu_{+}=\frac{n}{n-1}\cdot\mathbbm{1}[t\geq c]\cdot\mu(t). Note that μ,μ+\mu_{-},\mu_{+} are the densities of the restriction of μ\mu to the intervals tct\leq c and tct\geq c respectively. Let Ψ:kn{0,1}kn\Psi:{\mathbb{R}}^{kn}\to\{0,1\}^{kn} be a mapping such that for every 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} and i[kn]i\in[kn] we have Ψ(𝐳)i=𝟙[zic]\Psi(\mathbf{z}^{\prime})_{i}=\mathbbm{1}[z^{\prime}_{i}\geq c]. For 𝐳~n2\tilde{\mathbf{z}}\in{\mathbb{R}}^{n^{2}} we denote 𝐳~[kn]=(z~1,,z~kn)\tilde{\mathbf{z}}_{[kn]}=(\tilde{z}_{1},\ldots,\tilde{z}_{kn}), namely, the first knkn components of 𝐳~\tilde{\mathbf{z}} (assuming n2knn^{2}\geq kn).

4.1 Defining the target network for {\cal L}

Since our goal is to use the algorithm {\cal L} for breaking PRGs, in this subsection we define a neural network N~:n2\tilde{N}:{\mathbb{R}}^{n^{2}}\to{\mathbb{R}} that we will later use as a target network for {\cal L}. The network N~\tilde{N} contains the subnetworks N1,N2,N3N_{1},N_{2},N_{3} which we define below.

Let N1N_{1} be a depth-22 neural network with input dimension knkn, at most nlog(n)n\log(n) hidden neurons, at most log(n)\log(n) output neurons (with activations in the output neurons), and parameter magnitudes bounded by n3n^{3} (all bounds are for a sufficiently large nn), which satisfies the following. We denote the set of output neurons of N1N_{1} by 1{\cal E}_{1}. Let 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} be an input to N1N_{1} such that Ψ(𝐳)=𝐳S\Psi(\mathbf{z}^{\prime})=\mathbf{z}^{S} for some hyperedge SS, and assume that for every i[kn]i\in[kn] we have zi(c,c+1n2)z^{\prime}_{i}\not\in\left(c,c+\frac{1}{n^{2}}\right). Fix some 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}. Then, for SS with P𝐱(𝐳S)=0P_{\mathbf{x}}(\mathbf{z}^{S})=0 the inputs to all output neurons 1{\cal E}_{1} are at most 1-1, and for SS with P𝐱(𝐳S)=1P_{\mathbf{x}}(\mathbf{z}^{S})=1 there exists a neuron in 1{\cal E}_{1} with input at least 22. Recall that our definition of a neuron’s input includes the addition of the bias term. The construction of the network N1N_{1} is given in Lemma A.3. Intuitively, the network N1N_{1} consists of a layer that transforms w.h.p. the input 𝐳\mathbf{z}^{\prime} to Ψ(𝐳)=𝐳S\Psi(\mathbf{z}^{\prime})=\mathbf{z}^{S}, followed by a layer that satisfies the following: Building on a lemma from Daniely and Vardi (2021) which shows that P𝐱(𝐳S)P_{\mathbf{x}}(\mathbf{z}^{S}) can be computed by a DNF formula, we define a layer where each output neuron corresponds to a term in the DNF formula, such that if the term evaluates to 0 then the input to the neuron is at most 1-1, and otherwise it is at least 22. Note that the network N1N_{1} depends on 𝐱\mathbf{x}. However, only the second layer depends on 𝐱\mathbf{x}, and thus given an input we may compute the first layer even without knowing 𝐱\mathbf{x}. Let N1:knN^{\prime}_{1}:{\mathbb{R}}^{kn}\to{\mathbb{R}} be a depth-33 neural network with no activation function in the output neuron, obtained from N1N_{1} by summing the outputs from all neurons 1{\cal E}_{1}.

Let N2N_{2} be a depth-22 neural network with input dimension knkn, at most nlog(n)n\log(n) hidden neurons, at most 2n2n output neurons, and parameter magnitudes bounded by n3n^{3} (for a sufficiently large nn), which satisfies the following. We denote the set of output neurons of N2N_{2} by 2{\cal E}_{2}. Let 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} be an input to N2N_{2} such that for every i[kn]i\in[kn] we have zi(c,c+1n2)z^{\prime}_{i}\not\in\left(c,c+\frac{1}{n^{2}}\right). If Ψ(𝐳)\Psi(\mathbf{z}^{\prime}) is an encoding of a hyperedge then the inputs to all output neurons 2{\cal E}_{2} are at most 1-1, and otherwise there exists a neuron in 2{\cal E}_{2} with input at least 22. The construction of the network N2N_{2} is given in Lemma A.5. Intuitively, each neuron in 2{\cal E}_{2} is responsible for checking whether Ψ(𝐳)\Psi(\mathbf{z}^{\prime}) violates some requirement that must hold in an encoding of a hyperedge. Let N2:knN^{\prime}_{2}:{\mathbb{R}}^{kn}\to{\mathbb{R}} be a depth-33 neural network with no activation function in the output neuron, obtained from N2N_{2} by summing the outputs from all neurons 2{\cal E}_{2}.

Let N3N_{3} be a depth-22 neural network with input dimension knkn, at most nlog(n)n\log(n) hidden neurons, knnlog(n)kn\leq n\log(n) output neurons, and parameter magnitudes bounded by n3n^{3} (for a sufficiently large nn), which satisfies the following. We denote the set of output neurons of N3N_{3} by 3{\cal E}_{3}. Let 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} be an input to N3N_{3}. If there exists i[kn]i\in[kn] such that zi(c,c+1n2)z^{\prime}_{i}\in\left(c,c+\frac{1}{n^{2}}\right) then there exists a neuron in 3{\cal E}_{3} with input at least 22. Moreover, if for all i[kn]i\in[kn] we have zi(c1n2,c+2n2)z^{\prime}_{i}\not\in\left(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}\right) then the inputs to all neurons in 3{\cal E}_{3} are at most 1-1. The construction of the network N3N_{3} is straightforward and given in Lemma A.6. Let N3:knN^{\prime}_{3}:{\mathbb{R}}^{kn}\to{\mathbb{R}} be a depth-33 neural network with no activation function in the output neuron, obtained from N3N_{3} by summing the outputs from all neurons 3{\cal E}_{3}.

Let N:knN^{\prime}:{\mathbb{R}}^{kn}\to{\mathbb{R}} be a depth-33 network obtained from N1,N2,N3N^{\prime}_{1},N^{\prime}_{2},N^{\prime}_{3} as follows. For 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} we have N(𝐳)=[1N1(𝐳)N2(𝐳)N3(𝐳)]+N^{\prime}(\mathbf{z}^{\prime})=\left[1-N^{\prime}_{1}(\mathbf{z}^{\prime})-N^{\prime}_{2}(\mathbf{z}^{\prime})-N^{\prime}_{3}(\mathbf{z}^{\prime})\right]_{+}. The network NN^{\prime} has at most n2n^{2} neurons, and parameter magnitudes bounded by n3n^{3} (all bounds are for a sufficiently large nn). Finally, let N~:n2\tilde{N}:{\mathbb{R}}^{n^{2}}\rightarrow{\mathbb{R}} be a depth-33 neural network such that N~(𝐳~)=N(𝐳~[kn])\tilde{N}(\tilde{\mathbf{z}})=N^{\prime}\left(\tilde{\mathbf{z}}_{[kn]}\right).

4.2 Defining the noise magnitude τ\tau and analyzing the perturbed network

In order to use the algorithm {\cal L} w.r.t. some neural network with parameters 𝜽{\boldsymbol{\theta}}, we need to implement an examples oracle, such that the examples are labeled according to a neural network with parameters 𝜽+𝝃{\boldsymbol{\theta}}+\boldsymbol{\xi}, where 𝝃\boldsymbol{\xi} is a random perturbation. Specifically, we use {\cal L} with an examples oracle where the labels correspond to a network N^:n2\hat{N}:{\mathbb{R}}^{n^{2}}\to{\mathbb{R}}, obtained from N~\tilde{N} (w.r.t. an appropriate 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n} in the construction of N1N_{1}) by adding a small perturbation to the parameters. The perturbation is such that we add i.i.d. noise to each parameter in N~\tilde{N}, where the noise is distributed according to 𝒩(0,τ2){\cal N}(0,\tau^{2}), and τ=1/poly(n)\tau=1/\operatorname{poly}(n) is small enough such that the following holds. Let f𝜽:n2f_{\boldsymbol{\theta}}:{\mathbb{R}}^{n^{2}}\to{\mathbb{R}} be any depth-33 neural network parameterized by 𝜽r{\boldsymbol{\theta}}\in{\mathbb{R}}^{r} for some r>0r>0 with at most n2n^{2} neurons, and parameter magnitudes bounded by n3n^{3} (note that rr is polynomial in nn). Then with probability at least 11n1-\frac{1}{n} over 𝝃𝒩(𝟎,τ2Ir)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{r}), we have |ξi|110|\xi_{i}|\leq\frac{1}{10} for all i[r]i\in[r], and the network f𝜽+𝝃f_{{\boldsymbol{\theta}}+\boldsymbol{\xi}} is such that for every input 𝐳~n2\tilde{\mathbf{z}}\in{\mathbb{R}}^{n^{2}} with 𝐳~2n\left\|\tilde{\mathbf{z}}\right\|\leq 2n and every neuron we have: Let a,ba,b be the inputs to the neuron in the computations f𝜽(𝐳~)f_{\boldsymbol{\theta}}(\tilde{\mathbf{z}}) and f𝜽+𝝃(𝐳~)f_{{\boldsymbol{\theta}}+\boldsymbol{\xi}}(\tilde{\mathbf{z}}) (respectively), then |ab|12|a-b|\leq\frac{1}{2}. Thus, τ\tau is sufficiently small, such that w.h.p. adding i.i.d. noise 𝒩(0,τ2){\cal N}(0,\tau^{2}) to each parameter does not change the inputs to the neurons by more than 12\frac{1}{2}. Note that such an inverse-polynomial τ\tau exists, since when the network size, parameter magnitudes, and input size are bounded by some poly(n)\operatorname{poly}(n), then the input to each neuron in f𝜽(𝐳~)f_{\boldsymbol{\theta}}(\tilde{\mathbf{z}}) is poly(n)\operatorname{poly}(n)-Lipschitz as a function of 𝜽{\boldsymbol{\theta}}, and thus it suffices to choose τ\tau that implies with probability at least 11n1-\frac{1}{n} that 𝝃1q(n)\left\|\boldsymbol{\xi}\right\|\leq\frac{1}{q(n)} for a sufficiently large polynomial q(n)q(n) (see Lemma A.7 for details).

Let 𝜽~p\tilde{{\boldsymbol{\theta}}}\in{\mathbb{R}}^{p} be the parameters of the network N~\tilde{N}. Recall that the parameters vector 𝜽~\tilde{{\boldsymbol{\theta}}} is the concatenation of all weight matrices and bias terms. Let 𝜽^p\hat{{\boldsymbol{\theta}}}\in{\mathbb{R}}^{p} be the parameters of N^\hat{N}, namely, 𝜽^=𝜽~+𝝃\hat{{\boldsymbol{\theta}}}=\tilde{{\boldsymbol{\theta}}}+\boldsymbol{\xi} where 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}). By our choice of τ\tau and the construction of the networks N1,N2,N3N_{1},N_{2},N_{3}, with probability at least 11n1-\frac{1}{n} over 𝝃\boldsymbol{\xi}, for every 𝐳~\tilde{\mathbf{z}} with 𝐳~2n\left\|\tilde{\mathbf{z}}\right\|\leq 2n, the inputs to the neurons 1,2,3{\cal E}_{1},{\cal E}_{2},{\cal E}_{3} in the computation N^(𝐳~)\hat{N}(\tilde{\mathbf{z}}) satisfy the following properties, where we denote 𝐳=𝐳~[kn]\mathbf{z}^{\prime}=\tilde{\mathbf{z}}_{[kn]}:

  1. (P1)

    If Ψ(𝐳)=𝐳S\Psi(\mathbf{z}^{\prime})=\mathbf{z}^{S} for some hyperedge SS, and for every i[kn]i\in[kn] we have zi(c,c+1n2)z^{\prime}_{i}\not\in\left(c,c+\frac{1}{n^{2}}\right), then the inputs to 1{\cal E}_{1} satisfy:

    • If P𝐱(𝐳S)=0P_{\mathbf{x}}(\mathbf{z}^{S})=0 the inputs to all neurons in 1{\cal E}_{1} are at most 12-\frac{1}{2}.

    • If P𝐱(𝐳S)=1P_{\mathbf{x}}(\mathbf{z}^{S})=1 there exists a neuron in 1{\cal E}_{1} with input at least 32\frac{3}{2}.

  2. (P2)

    If for every i[kn]i\in[kn] we have zi(c,c+1n2)z^{\prime}_{i}\not\in\left(c,c+\frac{1}{n^{2}}\right), then the inputs to 2{\cal E}_{2} satisfy:

    • If Ψ(𝐳)\Psi(\mathbf{z}^{\prime}) is an encoding of a hyperedge then the inputs to all neurons 2{\cal E}_{2} are at most 12-\frac{1}{2}.

    • Otherwise, there exists a neuron in 2{\cal E}_{2} with input at least 32\frac{3}{2}.

  3. (P3)

    The inputs to 3{\cal E}_{3} satisfy:

    • If there exists i[kn]i\in[kn] such that zi(c,c+1n2)z^{\prime}_{i}\in\left(c,c+\frac{1}{n^{2}}\right) then there exists a neuron in 3{\cal E}_{3} with input at least 32\frac{3}{2}.

    • If for all i[kn]i\in[kn] we have zi(c1n2,c+2n2)z^{\prime}_{i}\not\in\left(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}\right) then the inputs to all neurons in 3{\cal E}_{3} are at most 12-\frac{1}{2}.

4.3 Stating the algorithm 𝒜{\cal A}

Given a sequence (S1,y1),,(Sns,yns)(S_{1},y_{1}),\ldots,(S_{n^{s}},y_{n^{s}}), where S1,,SnsS_{1},\ldots,S_{n^{s}} are i.i.d. random hyperedges, the algorithm 𝒜{\cal A} needs to distinguish whether 𝐲=(y1,,yns)\mathbf{y}=(y_{1},\ldots,y_{n^{s}}) is random or that 𝐲=(P(𝐱S1),,P(𝐱Sns))=(P𝐱(𝐳S1),,P𝐱(𝐳Sns))\mathbf{y}=(P(\mathbf{x}_{S_{1}}),\ldots,P(\mathbf{x}_{S_{n^{s}}}))=(P_{\mathbf{x}}(\mathbf{z}^{S_{1}}),\ldots,P_{\mathbf{x}}(\mathbf{z}^{S_{n^{s}}})) for a random 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}. Let 𝒮=((𝐳S1,y1),,(𝐳Sns,yns)){\cal S}=((\mathbf{z}^{S_{1}},y_{1}),\ldots,(\mathbf{z}^{S_{n^{s}}},y_{n^{s}})).

We use the efficient algorithm {\cal L} in order to obtain distinguishing advantage greater than 13\frac{1}{3} as follows. Let 𝝃\boldsymbol{\xi} be a random perturbation, and let N^\hat{N} be the perturbed network as defined above, w.r.t. the unknown 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}. Note that given a perturbation 𝝃\boldsymbol{\xi}, only the weights in the second layer of the subnetwork N1N_{1} in N^\hat{N} are unknown, since all other parameters do not depend on 𝐱\mathbf{x}. The algorithm 𝒜{\cal A} runs {\cal L} with the following examples oracle. In the ii-th call, the oracle first draws 𝐳{0,1}kn\mathbf{z}\in\{0,1\}^{kn} such that each component is drawn i.i.d. from a Bernoulli distribution which takes the value 0 with probability 1n\frac{1}{n}. If 𝐳\mathbf{z} is an encoding of a hyperedge then the oracle replaces 𝐳\mathbf{z} with 𝐳Si\mathbf{z}^{S_{i}}. Then, the oracle chooses 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} such that for each component jj, if zj=1z_{j}=1 then zjz^{\prime}_{j} is drawn from μ+\mu_{+}, and otherwise zjz^{\prime}_{j} is drawn from μ\mu_{-}. Let 𝐳~n2\tilde{\mathbf{z}}\in{\mathbb{R}}^{n^{2}} be such that 𝐳~[kn]=𝐳\tilde{\mathbf{z}}_{[kn]}=\mathbf{z}^{\prime}, and the other n2knn^{2}-kn components of 𝐳~\tilde{\mathbf{z}} are drawn i.i.d. from 𝒩(0,1){\cal N}(0,1). Note that the vector 𝐳~\tilde{\mathbf{z}} has the distribution 𝒟{\cal D}, due to the definitions of the densities μ+\mu_{+} and μ\mu_{-}, and since replacing an encoding of a random hyperedge by an encoding of another random hyperedge does not change the distribution of 𝐳\mathbf{z}. Let b^\hat{b}\in{\mathbb{R}} be the bias term of the output neuron of N^\hat{N}. The oracle returns (𝐳~,y~)(\tilde{\mathbf{z}},\tilde{y}), where the labels y~\tilde{y} are chosen as follows:

  • If Ψ(𝐳)\Psi(\mathbf{z}^{\prime}) is not an encoding of a hyperedge, then y~=0\tilde{y}=0.

  • If Ψ(𝐳)\Psi(\mathbf{z}^{\prime}) is an encoding of a hyperedge:

    • If 𝐳\mathbf{z}^{\prime} does not have components in the interval (c1n2,c+2n2)(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}), then if yi=0y_{i}=0 we set y~=b^\tilde{y}=\hat{b}, and if yi=1y_{i}=1 we set y~=0\tilde{y}=0.

    • If 𝐳\mathbf{z}^{\prime} has a component in the interval (c,c+1n2)(c,c+\frac{1}{n^{2}}), then y~=0\tilde{y}=0.

    • If 𝐳\mathbf{z}^{\prime} does not have components in the interval (c,c+1n2)(c,c+\frac{1}{n^{2}}), but has a component in the interval (c1n2,c+2n2)(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}), then the label y~\tilde{y} is determined as follows:

      • *

        If yi=1y_{i}=1 then y~=0\tilde{y}=0.

      • *

        If yi=0y_{i}=0: Let N^3\hat{N}_{3} be the network N^\hat{N} after omitting the neurons 1,2{\cal E}_{1},{\cal E}_{2} and their incoming and outgoing weights. Then, we set y~=[b^N^3(𝐳~)]+\tilde{y}=[\hat{b}-\hat{N}_{3}(\tilde{\mathbf{z}})]_{+}. Note that since only the second layer of N1N_{1} depends on 𝐱\mathbf{x}, then we can compute N^3(𝐳~)\hat{N}_{3}(\tilde{\mathbf{z}}) without knowing 𝐱\mathbf{x}.

Let hh be the hypothesis returned by {\cal L}. Recall that {\cal L} uses at most m(n)m(n) examples, and hence 𝒮{\cal S} contains at least n3n^{3} examples that {\cal L} cannot view. We denote the indices of these examples by I={m(n)+1,,m(n)+n3}I=\{m(n)+1,\ldots,m(n)+n^{3}\}, and the examples by 𝒮I={(𝐳Si,yi)}iI{\cal S}_{I}=\{(\mathbf{z}^{S_{i}},y_{i})\}_{i\in I}. By n3n^{3} additional calls to the oracle, the algorithm 𝒜{\cal A} obtains the examples 𝒮~I={(𝐳~i,y~i)}iI\tilde{{\cal S}}_{I}=\{(\tilde{\mathbf{z}}_{i},\tilde{y}_{i})\}_{i\in I} that correspond to 𝒮I{\cal S}_{I}. Let hh^{\prime} be a hypothesis such that for all 𝐳~n2\tilde{\mathbf{z}}\in{\mathbb{R}}^{n^{2}} we have h(𝐳~)=max{0,min{b^,h(𝐳~)}}h^{\prime}(\tilde{\mathbf{z}})=\max\{0,\min\{\hat{b},h(\tilde{\mathbf{z}})\}\}, thus, for b^0\hat{b}\geq 0 the hypothesis hh^{\prime} is obtained from hh by clipping the output to the interval [0,b^][0,\hat{b}]. Let I(h)=1|I|iI(h(𝐳~i)y~i)2\ell_{I}(h^{\prime})=\frac{1}{|I|}\sum_{i\in I}(h^{\prime}(\tilde{\mathbf{z}}_{i})-\tilde{y}_{i})^{2}. Now, if I(h)2n\ell_{I}(h^{\prime})\leq\frac{2}{n}, then 𝒜{\cal A} returns 11, and otherwise it returns 0. We remark that the decision of our algorithm is based on hh^{\prime} (rather than hh) since we need the outputs to be bounded, in order to allow using Hoeffding’s inequality in our analysis, which we discuss in the next subsection.

4.4 Analyzing the algorithm 𝒜{\cal A}

Note that the algorithm 𝒜{\cal A} runs in poly(n)\operatorname{poly}(n) time. We now show that if 𝒮{\cal S} is pseudorandom then 𝒜{\cal A} returns 11 with probability greater than 23\frac{2}{3}, and if 𝒮{\cal S} is random then 𝒜{\cal A} returns 11 with probability less than 13\frac{1}{3}.

We start with the case where 𝒮{\cal S} is pseudorandom. In Lemma A.8, we prove that if 𝒮{\cal S} is pseudorandom then w.h.p. (over 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}) and the i.i.d. inputs 𝐳~i𝒟\tilde{\mathbf{z}}_{i}\sim{\cal D}) the examples (𝐳~1,y~1),,(𝐳~m(n)+n3,y~m(n)+n3)(\tilde{\mathbf{z}}_{1},\tilde{y}_{1}),\ldots,(\tilde{\mathbf{z}}_{m(n)+n^{3}},\tilde{y}_{m(n)+n^{3}}) returned by the oracle are realized by N^\hat{N}. Thus, y~i=N^(𝐳~i)\tilde{y}_{i}=\hat{N}(\tilde{\mathbf{z}}_{i}) for all ii. As we show in the lemma, this claim follows by noting that the following hold w.h.p., where we denote 𝐳i=(𝐳~i)[kn]\mathbf{z}^{\prime}_{i}=(\tilde{\mathbf{z}}_{i})_{[kn]}:

  • If Ψ(𝐳i)\Psi(\mathbf{z}^{\prime}_{i}) is not an encoding of a hyperedge, then the oracle sets y~i=0\tilde{y}_{i}=0, and we have:

    • If 𝐳i\mathbf{z}^{\prime}_{i} does not have components in (c,c+1n)\left(c,c+\frac{1}{n}\right), then there exists a neuron in 2{\cal E}_{2} with output at least 32\frac{3}{2} (by Property (P2)), which implies N^(𝐳~i)=0\hat{N}(\tilde{\mathbf{z}}_{i})=0.

    • If 𝐳\mathbf{z}^{\prime} has a component in (c,c+1n)\left(c,c+\frac{1}{n}\right), then there exists a neuron in 3{\cal E}_{3} with output at least 32\frac{3}{2} (by Property (P3)), which implies N^(𝐳~i)=0\hat{N}(\tilde{\mathbf{z}}_{i})=0.

  • If Ψ(𝐳i)\Psi(\mathbf{z}^{\prime}_{i}) is an encoding of a hyperedge SS, then by the definition of the examples oracle we have S=SiS=S_{i}. Hence:

    • If 𝐳i\mathbf{z}^{\prime}_{i} does not have components in (c1n2,c+2n2)\left(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}\right), then:

      • *

        If yi=0y_{i}=0 then the oracle sets y~i=b^\tilde{y}_{i}=\hat{b}. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=0P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=0. Hence, in the computation N^(𝐳~i)\hat{N}(\tilde{\mathbf{z}}_{i}) the inputs to all neurons in 1,2,3{\cal E}_{1},{\cal E}_{2},{\cal E}_{3} are at most 12-\frac{1}{2} (by Properties (P1)(P2) and (P3)), and thus their outputs are 0. Therefore, N^(𝐳~i)=b^\hat{N}(\tilde{\mathbf{z}}_{i})=\hat{b}.

      • *

        If yi=1y_{i}=1 then the oracle sets y~i=0\tilde{y}_{i}=0. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=1P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=1. Hence, in the computation N^(𝐳~i)\hat{N}(\tilde{\mathbf{z}}_{i}) there exists a neuron in 1{\cal E}_{1} with output at least 32\frac{3}{2} (by Property (P1)), which implies N^(𝐳~i)=0\hat{N}(\tilde{\mathbf{z}}_{i})=0.

    • If 𝐳i\mathbf{z}^{\prime}_{i} has a component in (c,c+1n2)\left(c,c+\frac{1}{n^{2}}\right), then the oracle sets y~i=0\tilde{y}_{i}=0. Also, in the computation N^(𝐳~i)\hat{N}(\tilde{\mathbf{z}}_{i}) there exists a neuron in 3{\cal E}_{3} with output at least 32\frac{3}{2} (by Property (P3)), which implies N^(𝐳~i)=0\hat{N}(\tilde{\mathbf{z}}_{i})=0.

    • If 𝐳i\mathbf{z}^{\prime}_{i} does not have components in the interval (c,c+1n2)(c,c+\frac{1}{n^{2}}), but has a component in the interval (c1n2,c+2n2)(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}), then:

      • *

        If yi=1y_{i}=1 the oracle sets y~i=0\tilde{y}_{i}=0. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=1P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=1. Hence, in the computation N^(𝐳~i)\hat{N}(\tilde{\mathbf{z}}_{i}) there exists a neuron in 1{\cal E}_{1} with output at least 32\frac{3}{2} (by Property (P1)), which implies N^(𝐳~i)=0\hat{N}(\tilde{\mathbf{z}}_{i})=0.

      • *

        If yi=0y_{i}=0 the oracle sets y~i=[b^N^3(𝐳~i)]+\tilde{y}_{i}=[\hat{b}-\hat{N}_{3}(\tilde{\mathbf{z}}_{i})]_{+}. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=0P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=0. Therefore, in the computation N^(𝐳~i)\hat{N}(\tilde{\mathbf{z}}_{i}) all neurons in 1,2{\cal E}_{1},{\cal E}_{2} have output 0 (by Properties (P1) and (P2)), and hence their contribution to the output of N^\hat{N} is 0. Thus, by the definition of N^3\hat{N}_{3}, we have N^(𝐳~i)=[b^N^3(𝐳~i)]+\hat{N}(\tilde{\mathbf{z}}_{i})=[\hat{b}-\hat{N}_{3}(\tilde{\mathbf{z}}_{i})]_{+}.

Recall that the algorithm {\cal L} is such that with probability at least 34\frac{3}{4} (over 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}), the i.i.d. inputs 𝐳~i𝒟\tilde{\mathbf{z}}_{i}\sim{\cal D}, and its internal randomness), given a size-m(n)m(n) dataset labeled by N^\hat{N}, it returns a hypothesis hh such that 𝔼𝐳~𝒟[(h(𝐳~)N^(𝐳~))2]1n\operatorname*{\mathbb{E}}_{\tilde{\mathbf{z}}\sim{\cal D}}\left[(h(\tilde{\mathbf{z}})-\hat{N}(\tilde{\mathbf{z}}))^{2}\right]\leq\frac{1}{n}. By the definition of hh^{\prime} and the construction of N^\hat{N}, if hh has small error then hh^{\prime} also has small error, namely, we have 𝔼𝐳~𝒟[(h(𝐳~)N^(𝐳~))2]1n\operatorname*{\mathbb{E}}_{\tilde{\mathbf{z}}\sim{\cal D}}\left[(h^{\prime}(\tilde{\mathbf{z}})-\hat{N}(\tilde{\mathbf{z}}))^{2}\right]\leq\frac{1}{n}. In Lemma A.9 we use the above arguments and Hoeffding’s inequality over 𝒮~I\tilde{{\cal S}}_{I}, and prove that with probability greater than 23\frac{2}{3} we have I(h)2n\ell_{I}(h^{\prime})\leq\frac{2}{n}.

Next, we consider the case where 𝒮{\cal S} is random. Let 𝒵~n2\tilde{{\cal Z}}\subseteq{\mathbb{R}}^{n^{2}} be such that 𝐳~𝒵~\tilde{\mathbf{z}}\in\tilde{{\cal Z}} iff 𝐳~[kn]\tilde{\mathbf{z}}_{[kn]} does not have components in the interval (c1n2,c+2n2)(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}), and Ψ(𝐳~[kn])=𝐳S\Psi(\tilde{\mathbf{z}}_{[kn]})=\mathbf{z}^{S} for a hyperedge SS. If 𝒮{\cal S} is random, then by the definition of our examples oracle, for every i[m(n)+n3]i\in[m(n)+n^{3}] such that 𝐳~i𝒵~\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}}, we have y~i=b^\tilde{y}_{i}=\hat{b} with probability 12\frac{1}{2} and y~i=0\tilde{y}_{i}=0 otherwise. Also, by the definition of the oracle, y~i\tilde{y}_{i} is independent of SiS_{i} and independent of the choice of the vector 𝐳~i\tilde{\mathbf{z}}_{i} that corresponds to 𝐳Si\mathbf{z}^{S_{i}}. Hence, for such 𝐳~i𝒵~\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}} with iIi\in I, any hypothesis cannot predict the label y~i\tilde{y}_{i}, and the expected loss for the example is at least (b^2)2\left(\frac{\hat{b}}{2}\right)^{2}. Moreover, in Lemma A.11 we show that Pr[𝐳~i𝒵~]12log(n)\Pr\left[\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}}\right]\geq\frac{1}{2\log(n)} for a sufficiently large nn. In Lemma A.12 we use these arguments to prove a lower bound on 𝔼𝒮~I[I(h)]\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\left[\ell_{I}(h^{\prime})\right], and by Hoeffding’s inequality over 𝒮~I\tilde{{\cal S}}_{I} we conclude that with probability greater than 23\frac{2}{3} we have I(h)>2n\ell_{I}(h^{\prime})>\frac{2}{n}.

Overall, if 𝒮{\cal S} is pseudorandom then with probability greater than 23\frac{2}{3} the algorithm 𝒜{\cal A} returns 11, and if 𝒮{\cal S} is random then with probability greater than 23\frac{2}{3} the algorithm 𝒜{\cal A} returns 0. Thus, the distinguishing advantage is greater than 13\frac{1}{3}.

5 Discussion

Understanding the computational complexity of learning neural networks is a central question in learning theory. Our results imply that the assumptions which allow for efficient learning in one-hidden-layer networks might not suffice in deeper networks. Also, in depth-22 networks we show that it is not sufficient to assume that both the parameters and the inputs are smoothed. We hope that our hardness results will help focus on assumptions that may allow for efficient learning. Below we discuss several intriguing open problems.

First, we emphasize that our hardness results are for neural networks that include the ReLU activation also in the output neuron. In contrast, the positive results on learning depth-22 networks that we discussed in the introduction do not include activation in the output neuron. Therefore, as far as we are aware, there is still a gap between the upper bounds and our hardness results: (1) Under the assumption that the input is Gaussian and the weights are non-degenerate, the cases of depth-22 networks with activation in the output neuron and of depth-33 networks without activation in the output are not settled; (2) In the setting where both the parameters and the input distribution are smoothed, the case of depth-22 networks without activation in the output is not settled.

Moreover, our hardness result for depth-33 networks suggests that adding a small (yet polynomial) noise to the parameters is not sufficient to allow efficient learning under the Gaussian distribution. An intriguing direction is to explore the case where the noise is larger. Intuitively, the case of very large noise corresponds to learning a random network, where the weights are drawn i.i.d. from the Gaussian distribution. Hence, it would be interesting to understand whether there is an efficient algorithm for learning random ReLU networks under the Gaussian input distribution. Likewise, the effect of large noise may also be considered w.r.t. the inputs, namely, we may use large noise for obtaining smoothed input distributions.

Acknowledgements

This work was done as part of the NSF-Simons Sponsored Collaboration on the Theoretical Foundations of Deep Learning.

References

  • Applebaum [2013] B. Applebaum. Pseudorandom generators with long stretch and low locality from random local one-way functions. SIAM Journal on Computing, 42(5):2008–2037, 2013.
  • Applebaum and Lovett [2016] B. Applebaum and S. Lovett. Algebraic attacks against random local functions and their countermeasures. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 1087–1100, 2016.
  • Applebaum and Raykov [2016] B. Applebaum and P. Raykov. Fast pseudorandom functions based on expander graphs. In Theory of Cryptography Conference, pages 27–56. Springer, 2016.
  • Applebaum et al. [2010] B. Applebaum, B. Barak, and A. Wigderson. Public-key cryptography from different assumptions. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 171–180, 2010.
  • Applebaum et al. [2017] B. Applebaum, I. Damgård, Y. Ishai, M. Nielsen, and L. Zichron. Secure arithmetic computation with constant computational overhead. In Annual International Cryptology Conference, pages 223–254. Springer, 2017.
  • Awasthi et al. [2021] P. Awasthi, A. Tang, and A. Vijayaraghavan. Efficient algorithms for learning depth-2 neural networks with general relu activations. Advances in Neural Information Processing Systems, 34:13485–13496, 2021.
  • Bakshi et al. [2019] A. Bakshi, R. Jayaram, and D. P. Woodruff. Learning two layer rectified neural networks in polynomial time. In Conference on Learning Theory, pages 195–268. PMLR, 2019.
  • Bhaskara et al. [2014] A. Bhaskara, M. Charikar, A. Moitra, and A. Vijayaraghavan. Smoothed analysis of tensor decompositions. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 594–603, 2014.
  • Bhaskara et al. [2019] A. Bhaskara, A. Chen, A. Perreault, and A. Vijayaraghavan. Smoothed analysis in unsupervised learning via decoupling. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 582–610. IEEE, 2019.
  • Brutzkus et al. [2020] A. Brutzkus, A. Daniely, and E. Malach. Id3 learns juntas for smoothed product distributions. In Conference on Learning Theory, pages 902–915. PMLR, 2020.
  • Chen et al. [2022a] S. Chen, A. Gollakota, A. R. Klivans, and R. Meka. Hardness of noise-free learning for two-hidden-layer neural networks. arXiv preprint arXiv:2202.05258, 2022a.
  • Chen et al. [2022b] S. Chen, A. R. Klivans, and R. Meka. Learning deep relu networks is fixed-parameter tractable. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 696–707. IEEE, 2022b.
  • Couteau et al. [2018] G. Couteau, A. Dupin, P. Méaux, M. Rossi, and Y. Rotella. On the concrete security of goldreich’s pseudorandom generator. In International Conference on the Theory and Application of Cryptology and Information Security, pages 96–124. Springer, 2018.
  • Daniely and Shalev-Shwartz [2016] A. Daniely and S. Shalev-Shwartz. Complexity theoretic limitations on learning dnf’s. In Conference on Learning Theory, pages 815–830, 2016.
  • Daniely and Vardi [2020] A. Daniely and G. Vardi. Hardness of learning neural networks with natural weights. arXiv preprint arXiv:2006.03177, 2020.
  • Daniely and Vardi [2021] A. Daniely and G. Vardi. From local pseudorandom generators to hardness of learning. In Conference on Learning Theory, pages 1358–1394. PMLR, 2021.
  • Diakonikolas and Kane [2020] I. Diakonikolas and D. M. Kane. Small covers for near-zero sets of polynomials and learning latent variable models. In 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pages 184–195. IEEE, 2020.
  • Diakonikolas et al. [2020a] I. Diakonikolas, D. Kane, and N. Zarifis. Near-optimal sq lower bounds for agnostically learning halfspaces and relus under gaussian marginals. Advances in Neural Information Processing Systems, 33, 2020a.
  • Diakonikolas et al. [2020b] I. Diakonikolas, D. M. Kane, V. Kontonis, and N. Zarifis. Algorithms and sq lower bounds for pac learning one-hidden-layer relu networks. arXiv preprint arXiv:2006.12476, 2020b.
  • Ge et al. [2015] R. Ge, Q. Huang, and S. M. Kakade. Learning mixtures of gaussians in high dimensions. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 761–770, 2015.
  • Ge et al. [2017] R. Ge, J. D. Lee, and T. Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.
  • Ge et al. [2018] R. Ge, R. Kuditipudi, Z. Li, and X. Wang. Learning two-layer neural networks with symmetric inputs. arXiv preprint arXiv:1810.06793, 2018.
  • Goel and Klivans [2019] S. Goel and A. R. Klivans. Learning neural networks with two nonlinear layers in polynomial time. In Conference on Learning Theory, pages 1470–1499. PMLR, 2019.
  • Goel et al. [2017] S. Goel, V. Kanade, A. Klivans, and J. Thaler. Reliably learning the relu in polynomial time. In Conference on Learning Theory, pages 1004–1042. PMLR, 2017.
  • Goel et al. [2019] S. Goel, S. Karmalkar, and A. Klivans. Time/accuracy tradeoffs for learning a relu with respect to gaussian marginals. In Advances in Neural Information Processing Systems, pages 8584–8593, 2019.
  • Goel et al. [2020a] S. Goel, A. Gollakota, Z. Jin, S. Karmalkar, and A. Klivans. Superpolynomial lower bounds for learning one-layer neural networks using gradient descent. arXiv preprint arXiv:2006.12011, 2020a.
  • Goel et al. [2020b] S. Goel, A. Gollakota, and A. Klivans. Statistical-query lower bounds via functional gradients. Advances in Neural Information Processing Systems, 33, 2020b.
  • Goel et al. [2020c] S. Goel, A. Klivans, P. Manurangsi, and D. Reichman. Tight hardness results for training depth-2 relu networks. arXiv preprint arXiv:2011.13550, 2020c.
  • Goldreich [2000] O. Goldreich. Candidate one-way functions based on expander graphs. IACR Cryptol. ePrint Arch., 2000:63, 2000.
  • Haghtalab et al. [2020] N. Haghtalab, T. Roughgarden, and A. Shetty. Smoothed analysis of online and differentially private learning. Advances in Neural Information Processing Systems, 33:9203–9215, 2020.
  • Ishai et al. [2008] Y. Ishai, E. Kushilevitz, R. Ostrovsky, and A. Sahai. Cryptography with constant computational overhead. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 433–442, 2008.
  • Janzamin et al. [2015] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
  • Kakade et al. [2011] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single index models with isotonic regression. Advances in Neural Information Processing Systems, 24, 2011.
  • Kalai and Sastry [2009] A. T. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT, 2009.
  • Kalai and Teng [2008] A. T. Kalai and S.-H. Teng. Decision trees are pac-learnable from most product distributions: a smoothed analysis. arXiv preprint arXiv:0812.0933, 2008.
  • Kalai et al. [2009] A. T. Kalai, A. Samorodnitsky, and S.-H. Teng. Learning and smoothed analysis. In 2009 50th Annual IEEE Symposium on Foundations of Computer Science, pages 395–404. IEEE, 2009.
  • Klivans and Sherstov [2006] A. R. Klivans and A. A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. In FOCS, 2006.
  • Laurent and Massart [2000] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
  • Lin [2016] H. Lin. Indistinguishability obfuscation from constant-degree graded encoding schemes. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 28–57. Springer, 2016.
  • Lin and Vaikuntanathan [2016] H. Lin and V. Vaikuntanathan. Indistinguishability obfuscation from ddh-like assumptions on constant-degree graded encodings. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 11–20. IEEE, 2016.
  • Méaux et al. [2019] P. Méaux, C. Carlet, A. Journault, and F.-X. Standaert. Improved filter permutators: Combining symmetric encryption design, boolean functions, low complexity cryptography, and homomorphic encryption, for private delegation of computations. IACR Cryptol. ePrint Arch., 2019:483, 2019.
  • Roughgarden [2021] T. Roughgarden. Beyond the worst-case analysis of algorithms. Cambridge University Press, 2021.
  • Sankar et al. [2006] A. Sankar, D. A. Spielman, and S.-H. Teng. Smoothed analysis of the condition numbers and growth factors of matrices. SIAM Journal on Matrix Analysis and Applications, 28(2):446–476, 2006.
  • Shamir [2018] O. Shamir. Distribution-specific hardness of learning neural networks. The Journal of Machine Learning Research, 19(1):1135–1163, 2018.
  • Song et al. [2017] L. Song, S. Vempala, J. Wilmes, and B. Xie. On the complexity of learning neural networks. In Advances in neural information processing systems, pages 5514–5522, 2017.
  • Spielman and Teng [2004] D. A. Spielman and S.-H. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004.
  • Vempala and Wilmes [2019] S. Vempala and J. Wilmes. Gradient descent for one-hidden-layer neural networks: Polynomial convergence and sq lower bounds. In Conference on Learning Theory, pages 3115–3117. PMLR, 2019.
  • Zhang et al. [2016] Y. Zhang, J. D. Lee, and M. I. Jordan. l1-regularized neural networks are improperly learnable in polynomial time. In International Conference on Machine Learning, pages 993–1001. PMLR, 2016.
  • Zhong et al. [2017] K. Zhong, Z. Song, P. Jain, P. L. Bartlett, and I. S. Dhillon. Recovery guarantees for one-hidden-layer neural networks. In International conference on machine learning, pages 4140–4149. PMLR, 2017.

Appendix A Missing lemmas for the proof of Theorem 3.1

Lemma A.1 (Daniely and Vardi [2021]).

For every predicate P:{0,1}k{0,1}P:\{0,1\}^{k}\rightarrow\{0,1\} and 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}, there is a DNF formula ψ\psi over {0,1}kn\{0,1\}^{kn} with at most 2k2^{k} terms, such that for every hyperedge SS we have P𝐱(𝐳S)=ψ(𝐳S)P_{\mathbf{x}}(\mathbf{z}^{S})=\psi(\mathbf{z}^{S}). Moreover, each term in ψ\psi is a conjunction of positive literals.

Proof.

The following proof is from Daniely and Vardi [2021], and we give it here for completeness.

We denote by {0,1}k{\cal B}\subseteq\{0,1\}^{k} the set of satisfying assignments of PP. Note that the size of {\cal B} is at most 2k2^{k}. Consider the following DNF formula over {0,1}kn\{0,1\}^{kn}:

ψ(𝐳)=𝐛j[k]{l:xlbj}zj,l.\psi(\mathbf{z})=\bigvee_{\mathbf{b}\in{\cal B}}\bigwedge_{j\in[k]}\bigwedge_{\{l:x_{l}\neq b_{j}\}}z_{j,l}~{}.

For a hyperedge S=(i1,,ik)S=(i_{1},\ldots,i_{k}), we have

ψ(𝐳S)=1\displaystyle\psi(\mathbf{z}^{S})=1 𝐛j[k]xlbj,zj,lS=1\displaystyle\iff\exists\mathbf{b}\in{\cal B}\;\forall j\in[k]\;\forall x_{l}\neq b_{j},\;z^{S}_{j,l}=1
𝐛j[k]xlbj,ijl\displaystyle\iff\exists\mathbf{b}\in{\cal B}\;\forall j\in[k]\;\forall x_{l}\neq b_{j},\;i_{j}\neq l
𝐛j[k],xij=bj\displaystyle\iff\exists\mathbf{b}\in{\cal B}\;\forall j\in[k],\;x_{i_{j}}=b_{j}
𝐛,𝐱S=𝐛\displaystyle\iff\exists\mathbf{b}\in{\cal B},\;\mathbf{x}_{S}=\mathbf{b}
P(𝐱S)=1\displaystyle\iff P(\mathbf{x}_{S})=1
P𝐱(𝐳S)=1.\displaystyle\iff P_{\mathbf{x}}(\mathbf{z}^{S})=1~{}.

Lemma A.2.

Let 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}. There exists an affine layer with at most 2k2^{k} outputs, weights bounded by a constant and bias terms bounded by nlog(n)n\log(n) (for a sufficiently large nn), such that given an input 𝐳S{0,1}kn\mathbf{z}^{S}\in\{0,1\}^{kn} for some hyperedge SS, it satisfies the following: For SS with P𝐱(𝐳S)=0P_{\mathbf{x}}(\mathbf{z}^{S})=0 all outputs are at most 1-1, and for SS with P𝐱(𝐳S)=1P_{\mathbf{x}}(\mathbf{z}^{S})=1 there exists an output greater or equal to 22.

Proof.

By Lemma A.1, there exists a DNF formula φ𝐱\varphi_{\mathbf{x}} over {0,1}kn\{0,1\}^{kn} with at most 2k2^{k} terms, such that φ𝐱(𝐳S)=P𝐱(𝐳S)\varphi_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S}). Thus, if P𝐱(𝐳S)=0P_{\mathbf{x}}(\mathbf{z}^{S})=0 then all terms in φ𝐱\varphi_{\mathbf{x}} are not satisfied for the input 𝐳S\mathbf{z}^{S}, and if P𝐱(𝐳S)=1P_{\mathbf{x}}(\mathbf{z}^{S})=1 then there is at least one term in φ𝐱\varphi_{\mathbf{x}} which is satisfied for the input 𝐳S\mathbf{z}^{S}. Therefore, it suffices to construct an affine layer such that for an input 𝐳S\mathbf{z}^{S}, the jj-th output will be at most 1-1 if the jj-th term of φ𝐱\varphi_{\mathbf{x}} is not satisfied, and at least 22 otherwise. Each term CjC_{j} in φ𝐱\varphi_{\mathbf{x}} is a conjunction of positive literals. Let Ij[kn]I_{j}\subseteq[kn] be the indices of these literals. The jj-th output of the affine layer will be

(lIj3zlS)3|Ij|+2.\left(\sum_{l\in I_{j}}3z^{S}_{l}\right)-3|I_{j}|+2~{}.

Note that if the conjunction CjC_{j} holds, then this expression is exactly 3|Ij|3|Ij|+2=23|I_{j}|-3|I_{j}|+2=2, and otherwise it is at most 3(|Ij|1)3|Ij|+2=13(|I_{j}|-1)-3|I_{j}|+2=-1. Finally, note that all weights are bounded by 33 and all bias terms are bounded by nlog(n)n\log(n) (for large enough nn). ∎

Lemma A.3.

Let 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}. There exists a depth-22 neural network N1N_{1} with input dimension knkn, 2kn2kn hidden neurons, at most 2k2^{k} output neurons, and parameter magnitudes bounded by n3n^{3} (for a sufficiently large nn), which satisfies the following. We denote the set of output neurons of N1N_{1} by 1{\cal E}_{1}. Let 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} be such that Ψ(𝐳)=𝐳S\Psi(\mathbf{z}^{\prime})=\mathbf{z}^{S} for some hyperedge SS, and assume that for every i[kn]i\in[kn] we have zi(c,c+1n2)z^{\prime}_{i}\not\in\left(c,c+\frac{1}{n^{2}}\right). Then, for SS with P𝐱(𝐳S)=0P_{\mathbf{x}}(\mathbf{z}^{S})=0 the inputs to all neurons 1{\cal E}_{1} are at most 1-1, and for SS with P𝐱(𝐳S)=1P_{\mathbf{x}}(\mathbf{z}^{S})=1 there exists a neuron in 1{\cal E}_{1} with input at least 22. Moreover, only the second layer of N1N_{1} depends on 𝐱\mathbf{x}.

Proof.

First, we construct a depth-22 neural network NΨ:kn[0,1]knN_{\Psi}:{\mathbb{R}}^{kn}\rightarrow[0,1]^{kn} with a single layer of non-linearity, such that for every 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} with zi(c,c+1n2)z^{\prime}_{i}\not\in(c,c+\frac{1}{n^{2}}) for every i[kn]i\in[kn], we have NΨ(𝐳)=Ψ(𝐳)N_{\Psi}(\mathbf{z}^{\prime})=\Psi(\mathbf{z}^{\prime}). The network NΨN_{\Psi} has 2kn2kn hidden neurons, and computes NΨ(𝐳)=(f(z1),,f(zkn))N_{\Psi}(\mathbf{z}^{\prime})=(f(z^{\prime}_{1}),\ldots,f(z^{\prime}_{kn})), where f:[0,1]f:{\mathbb{R}}\rightarrow[0,1] is such that

f(t)=n2([tc]+[t(c+1n2)]+).f(t)=n^{2}\cdot\left(\left[t-c\right]_{+}-\left[t-\left(c+\frac{1}{n^{2}}\right)\right]_{+}\right)~{}.

Note that if tct\leq c then f(t)=0f(t)=0, if tc+1n2t\geq c+\frac{1}{n^{2}} then f(t)=1f(t)=1, and if c<t<c+1n2c<t<c+\frac{1}{n^{2}} then f(t)(0,1)f(t)\in(0,1). Also, note that all weights and bias terms can be bounded by n2n^{2} (for large enough nn). Moreover, the network NΨN_{\Psi} does not depend on 𝐱\mathbf{x}.

Let 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} such that Ψ(𝐳)=𝐳S\Psi(\mathbf{z}^{\prime})=\mathbf{z}^{S} for some hyperedge SS, and assume that for every i[kn]i\in[kn] we have zi(c,c+1n2)z^{\prime}_{i}\not\in\left(c,c+\frac{1}{n^{2}}\right). For such 𝐳\mathbf{z}^{\prime}, we have NΨ(𝐳)=Ψ(𝐳)=𝐳SN_{\Psi}(\mathbf{z}^{\prime})=\Psi(\mathbf{z}^{\prime})=\mathbf{z}^{S}. Hence, it suffices to show that we can construct an affine layer with at most 2k2^{k} outputs, weights bounded by a constant and bias terms bounded by n3n^{3}, such that given an input 𝐳S\mathbf{z}^{S} it satisfies the following: For SS with P𝐱(𝐳S)=0P_{\mathbf{x}}(\mathbf{z}^{S})=0 all outputs are at most 1-1, and for SS with P𝐱(𝐳S)=1P_{\mathbf{x}}(\mathbf{z}^{S})=1 there exists an output greater or equal to 22. We construct such an affine layer in Lemma A.2. ∎

Lemma A.4.

There exists an affine layer with 2k+n2k+n outputs, weights bounded by a constant and bias terms bounded by nlog(n)n\log(n) (for a sufficiently large nn), such that given an input 𝐳{0,1}kn\mathbf{z}\in\{0,1\}^{kn}, if it is an encoding of a hyperedge then all outputs are at most 1-1, and otherwise there exists an output greater or equal to 22.

Proof.

Note that 𝐳{0,1}kn\mathbf{z}\in\{0,1\}^{kn} is not an encoding of a hyperedge iff at least one of the following holds:

  1. 1.

    At least one of the kk size-nn slices in 𝐳\mathbf{z} contains 0 more than once.

  2. 2.

    At least one of the kk size-nn slices in 𝐳\mathbf{z} does not contain 0.

  3. 3.

    There are two size-nn slices in 𝐳\mathbf{z} that encode the same index.

We define the outputs of our affine layer as follows. First, we have kk outputs that correspond to (1). In order to check whether slice i[k]i\in[k] contains 0 more than once, the output will be 3n4(j[n]3zi,j)3n-4-(\sum_{j\in[n]}3z_{i,j}). Second, we have kk outputs that correspond to (2): in order to check whether slice i[k]i\in[k] does not contain 0, the output will be (j[n]3zi,j)3n+2(\sum_{j\in[n]}3z_{i,j})-3n+2. Finally, we have nn outputs that correspond to (3): in order to check whether there are two slices that encode the same index j[n]j\in[n], the output will be 3k4(i[k]3zi,j)3k-4-(\sum_{i\in[k]}3z_{i,j}). Note that all weights are bounded by 33 and all bias terms are bounded by nlog(n)n\log(n) for large enought nn. ∎

Lemma A.5.

There exists a depth-22 neural network N2N_{2} with input dimension knkn, at most 2kn2kn hidden neurons, 2k+n2k+n output neurons, and parameter magnitudes bounded by n3n^{3} (for a sufficiently large nn), which satisfies the following. We denote the set of output neurons of N2N_{2} by 2{\cal E}_{2}. Let 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} be such that for every i[kn]i\in[kn] we have zi(c,c+1n2)z^{\prime}_{i}\not\in\left(c,c+\frac{1}{n^{2}}\right). If Ψ(𝐳)\Psi(\mathbf{z}^{\prime}) is an encoding of a hyperedge then the inputs to all neurons 2{\cal E}_{2} are at most 1-1, and otherwise there exists a neuron in 2{\cal E}_{2} with input at least 22.

Proof.

Let NΨ:kn[0,1]knN_{\Psi}:{\mathbb{R}}^{kn}\rightarrow[0,1]^{kn} be the depth-22 neural network from the proof of Lemma A.3, with a single layer of non-linearity with 2kn2kn hidden neurons, and parameter magnitudes bounded by n2n^{2}, such that for every 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} with zi(c,c+1n2)z^{\prime}_{i}\not\in(c,c+\frac{1}{n^{2}}) for every i[kn]i\in[kn], we have NΨ(𝐳)=Ψ(𝐳)N_{\Psi}(\mathbf{z}^{\prime})=\Psi(\mathbf{z}^{\prime}).

Let 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} be such that for every i[kn]i\in[kn] we have zi(c,c+1n2)z^{\prime}_{i}\not\in\left(c,c+\frac{1}{n^{2}}\right). For such 𝐳\mathbf{z}^{\prime} we have NΨ(𝐳)=Ψ(𝐳)N_{\Psi}(\mathbf{z}^{\prime})=\Psi(\mathbf{z}^{\prime}). Hence, it suffices to show that we can construct an affine layer with 2k+n2k+n outputs, weights bounded by a constant and bias terms bounded by n3n^{3}, such that given an input 𝐳{0,1}kn\mathbf{z}\in\{0,1\}^{kn}, if it is an encoding of a hyperedge then all outputs are at most 1-1, and otherwise there exists an output greater or equal to 22. We construct such an affine layer in Lemma A.4. ∎

Lemma A.6.

There exists a depth-22 neural network N3N_{3} with input dimension knkn, at most nlog(n)n\log(n) hidden neurons, knnlog(n)kn\leq n\log(n) output neurons, and parameter magnitudes bounded by n3n^{3} (for a sufficiently large nn), which satisfies the following. We denote the set of output neurons of N3N_{3} by 3{\cal E}_{3}. Let 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn}. If there exists i[kn]i\in[kn] such that zi(c,c+1n2)z^{\prime}_{i}\in\left(c,c+\frac{1}{n^{2}}\right) then there exists a neuron in 3{\cal E}_{3} with input at least 22. If for all i[kn]i\in[kn] we have zi(c1n2,c+2n2)z^{\prime}_{i}\not\in\left(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}\right) then the inputs to all neurons in 3{\cal E}_{3} are at most 1-1.

Proof.

It suffices to construct a univariate depth-22 network f:f:{\mathbb{R}}\to{\mathbb{R}} with one non-linear layer and a constant number of hidden neurons, such that for every input zi(c,c+1n2)z^{\prime}_{i}\in(c,c+\frac{1}{n^{2}}) we have f(zi)=2f(z^{\prime}_{i})=2, and for every zi(c1n2,c+2n2)z^{\prime}_{i}\not\in(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}) we have f(zi)=1f(z^{\prime}_{i})=-1.

We construct ff as follows:

f(zi)=\displaystyle f(z^{\prime}_{i})= (3n2)([zi(c1n2)]+[zic]+)\displaystyle(3n^{2})\left(\left[z^{\prime}_{i}-\left(c-\frac{1}{n^{2}}\right)\right]_{+}-\left[z^{\prime}_{i}-c\right]_{+}\right)-
(3n2)([zi(c+1n2)]+[zi(c+2n2)]+)1.\displaystyle(3n^{2})\left(\left[z^{\prime}_{i}-\left(c+\frac{1}{n^{2}}\right)\right]_{+}-\left[z^{\prime}_{i}-\left(c+\frac{2}{n^{2}}\right)\right]_{+}\right)-1~{}.

Note that all weights and bias terms are bounded by n3n^{3} (for large enough nn). ∎

Lemma A.7.

Let q=poly(n)q=\operatorname{poly}(n) and r=poly(n)r=\operatorname{poly}(n). Then, there exists τ=1poly(n)\tau=\frac{1}{\operatorname{poly}(n)} such that for a sufficiently large nn, with probability at least 1exp(n/2)1-\exp(-n/2) a vector 𝛏𝒩(𝟎,τ2Ir)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{r}) satisfies 𝛏1q\left\|\boldsymbol{\xi}\right\|\leq\frac{1}{q}.

Proof.

Let τ=1q2rn\tau=\frac{1}{q\sqrt{2rn}}. Every component ξi\xi_{i} in 𝝃\boldsymbol{\xi} has the distribution 𝒩(0,τ2){\cal N}(0,\tau^{2}). By a standard tail bound of the Gaussian distribution, we have for every i[r]i\in[r] and t0t\geq 0 that Pr[ξit]2exp(t22τ2)\Pr[\xi_{i}\geq t]\leq 2\exp\left(-\frac{t^{2}}{2\tau^{2}}\right). Hence, for t=1qrt=\frac{1}{q\sqrt{r}}, we get

Pr[ξi1qr]2exp(12τ2q2r)=2exp(2rnq22q2r)=2exp(n).\Pr\left[\xi_{i}\geq\frac{1}{q\sqrt{r}}\right]\leq 2\exp\left(-\frac{1}{2\tau^{2}q^{2}r}\right)=2\exp\left(-\frac{2rnq^{2}}{2q^{2}r}\right)=2\exp\left(-n\right)~{}.

By the union bound, with probability at least 1r2en1-r\cdot 2e^{-n}, we have

𝝃2r1rq2=1q2.\left\|\boldsymbol{\xi}\right\|^{2}\leq r\cdot\frac{1}{rq^{2}}=\frac{1}{q^{2}}~{}.

Thus, for a sufficiently large nn, with probability at least 1exp(n/2)1-\exp(-n/2) we have 𝝃1q\left\|\boldsymbol{\xi}\right\|\leq\frac{1}{q}. ∎

Lemma A.8.

If 𝒮{\cal S} is pseudorandom then with probability at least 3940\frac{39}{40} (over 𝛏𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}) and the i.i.d. inputs 𝐳~i𝒟\tilde{\mathbf{z}}_{i}\sim{\cal D}) the examples (𝐳~1,y~1),,(𝐳~m(n)+n3,y~m(n)+n3)(\tilde{\mathbf{z}}_{1},\tilde{y}_{1}),\ldots,(\tilde{\mathbf{z}}_{m(n)+n^{3}},\tilde{y}_{m(n)+n^{3}}) returned by the oracle are realized by N^\hat{N}.

Proof.

By our choice of τ\tau, with probability at least 11n1-\frac{1}{n} over 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}), we have |ξj|110|\xi_{j}|\leq\frac{1}{10} for all j[p]j\in[p], and for every 𝐳~\tilde{\mathbf{z}} with 𝐳~2n\left\|\tilde{\mathbf{z}}\right\|\leq 2n the inputs to the neurons 1,2,3{\cal E}_{1},{\cal E}_{2},{\cal E}_{3} in the computation N^(𝐳~)\hat{N}(\tilde{\mathbf{z}}) satisfy Properties (P1) through (P3). We first show that with probability at least 11n1-\frac{1}{n} all examples 𝐳~1,,𝐳~m(n)+n3\tilde{\mathbf{z}}_{1},\ldots,\tilde{\mathbf{z}}_{m(n)+n^{3}} satisfy 𝐳~i2n\left\|\tilde{\mathbf{z}}_{i}\right\|\leq 2n. Hence, with probability at least 12n1-\frac{2}{n}, Properties (P1) through (P3) hold for the computations N^(𝐳~i)\hat{N}(\tilde{\mathbf{z}}_{i}) for all i[m(n)+n3]i\in[m(n)+n^{3}].

Note that 𝐳~i2\left\|\tilde{\mathbf{z}}_{i}\right\|^{2} has the Chi-squared distribution. Since 𝐳~i\tilde{\mathbf{z}}_{i} is of dimension n2n^{2}, a concentration bound by Laurent and Massart [Laurent and Massart, 2000, Lemma 1] implies that for all t>0t>0 we have

Pr[𝐳~i2n22nt+2t]et.\Pr\left[\left\|\tilde{\mathbf{z}}_{i}\right\|^{2}-n^{2}\geq 2n\sqrt{t}+2t\right]\leq e^{-t}~{}.

Plugging-in t=n24t=\frac{n^{2}}{4}, we get

Pr[𝐳~i24n2]\displaystyle\Pr\left[\left\|\tilde{\mathbf{z}}_{i}\right\|^{2}\geq 4n^{2}\right] =Pr[𝐳~i2n23n2]\displaystyle=\Pr\left[\left\|\tilde{\mathbf{z}}_{i}\right\|^{2}-n^{2}\geq 3n^{2}\right]
Pr[𝐳~i2n23n22]\displaystyle\leq\Pr\left[\left\|\tilde{\mathbf{z}}_{i}\right\|^{2}-n^{2}\geq\frac{3n^{2}}{2}\right]
=Pr[𝐳~i2n22nn24+2n24]\displaystyle=\Pr\left[\left\|\tilde{\mathbf{z}}_{i}\right\|^{2}-n^{2}\geq 2n\sqrt{\frac{n^{2}}{4}}+2\cdot\frac{n^{2}}{4}\right]
exp(n24).\displaystyle\leq\exp\left(-\frac{n^{2}}{4}\right)~{}.

Thus, we have Pr[𝐳~i2n]exp(n24)\Pr\left[\left\|\tilde{\mathbf{z}}_{i}\right\|\geq 2n\right]\leq\exp\left(-\frac{n^{2}}{4}\right). By the union bound, with probability at least

1(m(n)+n3)exp(n24)11n1-\left(m(n)+n^{3}\right)\exp\left(-\frac{n^{2}}{4}\right)\geq 1-\frac{1}{n}

(for a sufficiently large nn), all examples (𝐳~i,y~i)(\tilde{\mathbf{z}}_{i},\tilde{y}_{i}) satisfy 𝐳~i2n\left\|\tilde{\mathbf{z}}_{i}\right\|\leq 2n.

Thus, we showed that with probability at least 12n39401-\frac{2}{n}\geq\frac{39}{40} (for a sufficiently large nn), we have |ξj|110|\xi_{j}|\leq\frac{1}{10} for all j[p]j\in[p], and Properties (P1) through (P3) hold for the computations N^(𝐳~i)\hat{N}(\tilde{\mathbf{z}}_{i}) for all i[m(n)+n3]i\in[m(n)+n^{3}]. It remains to show that if these properties hold, then the examples (𝐳~1,y~1),,(𝐳~m(n)+n3,y~m(n)+n3)(\tilde{\mathbf{z}}_{1},\tilde{y}_{1}),\ldots,(\tilde{\mathbf{z}}_{m(n)+n^{3}},\tilde{y}_{m(n)+n^{3}}) are realized by N^\hat{N}.

Let i[m(n)+n3]i\in[m(n)+n^{3}]. For brevity, we denote 𝐳~=𝐳~i\tilde{\mathbf{z}}=\tilde{\mathbf{z}}_{i}, y~=y~i\tilde{y}=\tilde{y}_{i}, and 𝐳=𝐳~[kn]\mathbf{z}^{\prime}=\tilde{\mathbf{z}}_{[kn]}. Since |ξj|110|\xi_{j}|\leq\frac{1}{10} for all j[p]j\in[p], and all incoming weights to the output neuron in N~\tilde{N} are 1-1, then in N^\hat{N} all incoming weights to the output neuron are in [1110,910]\left[-\frac{11}{10},-\frac{9}{10}\right], and the bias term in the output neuron, denoted by b^\hat{b}, is in [910,1110]\left[\frac{9}{10},\frac{11}{10}\right]. Consider the following cases:

  • If Ψ(𝐳)\Psi(\mathbf{z}^{\prime}) is not an encoding of a hyperedge then y~=0\tilde{y}=0, and N^(𝐳~)\hat{N}(\tilde{\mathbf{z}}) satisfies:

    1. 1.

      If 𝐳\mathbf{z}^{\prime} does not have components in (c,c+1n)\left(c,c+\frac{1}{n}\right), then there exists a neuron in 2{\cal E}_{2} with output at least 32\frac{3}{2} (by Property (P2)).

    2. 2.

      If 𝐳\mathbf{z}^{\prime} has a component in (c,c+1n)\left(c,c+\frac{1}{n}\right), then there exists a neuron in 3{\cal E}_{3} with output at least 32\frac{3}{2} (by Property (P3)).

    In both cases, since all incoming weights to the output neuron in N^\hat{N} are in [1110,910]\left[-\frac{11}{10},-\frac{9}{10}\right], and b^[910,1110]\hat{b}\in\left[\frac{9}{10},\frac{11}{10}\right], then the input to the output neuron (including the bias term) is at most 111032910<0\frac{11}{10}-\frac{3}{2}\cdot\frac{9}{10}<0, and thus its output is 0.

  • If Ψ(𝐳)\Psi(\mathbf{z}^{\prime}) is an encoding of a hyperedge SS, then by the definition of the examples oracle we have S=SiS=S_{i}. Hence:

    • If 𝐳\mathbf{z}^{\prime} does not have components in (c1n2,c+2n2)\left(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}\right), then:

      • *

        If yi=0y_{i}=0 then the oracle sets y~=b^\tilde{y}=\hat{b}. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=0P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=0. Hence, in the computation N^(𝐳~)\hat{N}(\tilde{\mathbf{z}}) the inputs to all neurons in 1,2,3{\cal E}_{1},{\cal E}_{2},{\cal E}_{3} are at most 12-\frac{1}{2} (by Properties (P1)(P2) and (P3)), and thus their outputs are 0. Therefore, N^(𝐳~)=b^\hat{N}(\tilde{\mathbf{z}})=\hat{b}.

      • *

        If yi=1y_{i}=1 then the oracle sets y~=0\tilde{y}=0. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=1P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=1. Hence, in the computation N^(𝐳~)\hat{N}(\tilde{\mathbf{z}}) there exists a neuron in 1{\cal E}_{1} with output at least 32\frac{3}{2} (by Property (P1)). Since all incoming weights to the output neuron in N^\hat{N} are in [1110,910]\left[-\frac{11}{10},-\frac{9}{10}\right], and b^[910,1110]\hat{b}\in\left[\frac{9}{10},\frac{11}{10}\right], then the input to output neuron (including the bias term) is at most 111032910<0\frac{11}{10}-\frac{3}{2}\cdot\frac{9}{10}<0, and thus its output is 0.

    • If 𝐳\mathbf{z}^{\prime} has a component in (c,c+1n2)\left(c,c+\frac{1}{n^{2}}\right), then y~=0\tilde{y}=0. Also, in the computation N^(𝐳~)\hat{N}(\tilde{\mathbf{z}}) there exists a neuron in 3{\cal E}_{3} with output at least 32\frac{3}{2} (by Property (P3)). Since all incoming weights to the output neuron in N^\hat{N} are in [1110,910]\left[-\frac{11}{10},-\frac{9}{10}\right], and b^[910,1110]\hat{b}\in\left[\frac{9}{10},\frac{11}{10}\right], then the input to output neuron (including the bias term) is at most 111032910<0\frac{11}{10}-\frac{3}{2}\cdot\frac{9}{10}<0, and thus its output is 0.

    • If 𝐳\mathbf{z}^{\prime} does not have components in the interval (c,c+1n2)(c,c+\frac{1}{n^{2}}), but has a component in the interval (c1n2,c+2n2)(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}), then:

      • *

        If yi=1y_{i}=1 the oracle sets y~=0\tilde{y}=0. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=1P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=1. Hence, in the computation N^(𝐳~)\hat{N}(\tilde{\mathbf{z}}) there exists a neuron in 1{\cal E}_{1} with output at least 32\frac{3}{2} (by Property (P1)). Since all incoming weights to the output neuron in N^\hat{N} are in [1110,910]\left[-\frac{11}{10},-\frac{9}{10}\right], and b^[910,1110]\hat{b}\in\left[\frac{9}{10},\frac{11}{10}\right], then the input to output neuron (including the bias term) is at most 111032910<0\frac{11}{10}-\frac{3}{2}\cdot\frac{9}{10}<0, and thus its output is 0.

      • *

        If yi=0y_{i}=0 the oracle sets y~=[b^N^3(𝐳~)]+\tilde{y}=[\hat{b}-\hat{N}_{3}(\tilde{\mathbf{z}})]_{+}. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=0P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=0. Therefore, in the computation N^(𝐳~)\hat{N}(\tilde{\mathbf{z}}) all neurons in 1,2{\cal E}_{1},{\cal E}_{2} have output 0 (by Properties (P1) and (P2)), and hence their contribution to the output of N^\hat{N} is 0. Thus, by the definition of N^3\hat{N}_{3}, we have N^(𝐳~)=[b^N^3(𝐳~)]+\hat{N}(\tilde{\mathbf{z}})=[\hat{b}-\hat{N}_{3}(\tilde{\mathbf{z}})]_{+}.

Lemma A.9.

If 𝒮{\cal S} is pseudorandom, then for a sufficiently large nn, with probability greater than 23\frac{2}{3} we have

I(h)2n.\ell_{I}(h^{\prime})\leq\frac{2}{n}~{}.
Proof.

By Lemma A.8, if 𝒮{\cal S} is pseudorandom then with probability at least 3940\frac{39}{40} (over 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}) and the i.i.d. inputs 𝐳~i𝒟\tilde{\mathbf{z}}_{i}\sim{\cal D}) the examples (𝐳~1,y~1),,(𝐳~m(n),y~m(n))(\tilde{\mathbf{z}}_{1},\tilde{y}_{1}),\ldots,(\tilde{\mathbf{z}}_{m(n)},\tilde{y}_{m(n)}) returned by the oracle are realized by N^\hat{N}. Recall that the algorithm {\cal L} is such that with probability at least 34\frac{3}{4} (over 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}), the i.i.d. inputs 𝐳~i𝒟\tilde{\mathbf{z}}_{i}\sim{\cal D}, and possibly its internal randomness), given a size-m(n)m(n) dataset labeled by N^\hat{N}, it returns a hypothesis hh such that 𝔼𝐳~𝒟[(h(𝐳~)N^(𝐳~))2]1n\operatorname*{\mathbb{E}}_{\tilde{\mathbf{z}}\sim{\cal D}}\left[(h(\tilde{\mathbf{z}})-\hat{N}(\tilde{\mathbf{z}}))^{2}\right]\leq\frac{1}{n}. Hence, with probability at least 34140\frac{3}{4}-\frac{1}{40} the algorithm {\cal L} returns such a good hypothesis hh, given m(n)m(n) examples labeled by our examples oracle. Indeed, note that {\cal L} can return a bad hypothesis only if the random choices are either bad for {\cal L} (when used with realizable examples) or bad for the realizability of the examples returned by our oracle. By the definition of hh^{\prime} and the construction of N^\hat{N}, if hh has small error then hh^{\prime} also has small error, namely,

𝔼𝐳~𝒟[(h(𝐳~)N^(𝐳~))2]𝔼𝐳~𝒟[(h(𝐳~)N^(𝐳~))2]1n.\operatorname*{\mathbb{E}}_{\tilde{\mathbf{z}}\sim{\cal D}}\left[(h^{\prime}(\tilde{\mathbf{z}})-\hat{N}(\tilde{\mathbf{z}}))^{2}\right]\leq\operatorname*{\mathbb{E}}_{\tilde{\mathbf{z}}\sim{\cal D}}\left[(h(\tilde{\mathbf{z}})-\hat{N}(\tilde{\mathbf{z}}))^{2}\right]\leq\frac{1}{n}~{}.

Let ^I(h)=1|I|iI(h(𝐳~i)N^(𝐳~i))2\hat{\ell}_{I}(h^{\prime})=\frac{1}{|I|}\sum_{i\in I}(h^{\prime}(\tilde{\mathbf{z}}_{i})-\hat{N}(\tilde{\mathbf{z}}_{i}))^{2}. Recall that by our choice of τ\tau we have Pr[b^>1110]1n\Pr[\hat{b}>\frac{11}{10}]\leq\frac{1}{n}. Since, (h(𝐳~)N^(𝐳~))2[0,b^2](h^{\prime}(\tilde{\mathbf{z}})-\hat{N}(\tilde{\mathbf{z}}))^{2}\in[0,\hat{b}^{2}] for all 𝐳~n2\tilde{\mathbf{z}}\in{\mathbb{R}}^{n^{2}}, by Hoeffding’s inequality, we have for a sufficiently large nn that

Pr[|^I(h)𝔼𝒮~I^I(h)|1n]\displaystyle\Pr\left[\left|\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\right|\geq\frac{1}{n}\right] =Pr[|^I(h)𝔼𝒮~I^I(h)|1n|b^1110]Pr[b^1110]\displaystyle=\Pr\left[\left|\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\middle|\geq\frac{1}{n}\right|\hat{b}\leq\frac{11}{10}\right]\cdot\Pr\left[\hat{b}\leq\frac{11}{10}\right]
+Pr[|^I(h)𝔼𝒮~I^I(h)|1n|b^>1110]Pr[b^>1110]\displaystyle\;\;\;\;+\Pr\left[\left|\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\middle|\geq\frac{1}{n}\right|\hat{b}>\frac{11}{10}\right]\cdot\Pr\left[\hat{b}>\frac{11}{10}\right]
2exp(2n3n2(11/10)4)1+11n\displaystyle\leq 2\exp\left(-\frac{2n^{3}}{n^{2}(11/10)^{4}}\right)\cdot 1+1\cdot\frac{1}{n}
140.\displaystyle\leq\frac{1}{40}~{}.

Moreover, by Lemma A.8,

Pr[I(h)^I(h)]Pr[iI s.t. y~iN^(𝐳~i)]140.\Pr\left[\ell_{I}(h^{\prime})\neq\hat{\ell}_{I}(h^{\prime})\right]\leq\Pr\left[\exists i\in I\text{ s.t. }\tilde{y}_{i}\neq\hat{N}(\tilde{\mathbf{z}}_{i})\right]\leq\frac{1}{40}~{}.

Overall, by the union bound we have with probability at least 1(14+140+140+140)>231-\left(\frac{1}{4}+\frac{1}{40}+\frac{1}{40}+\frac{1}{40}\right)>\frac{2}{3} for sufficiently large nn that:

  • 𝔼𝒮~I^I(h)=𝔼𝐳~𝒟[(h(𝐳~)N^(𝐳~))2]1n\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})=\operatorname*{\mathbb{E}}_{\tilde{\mathbf{z}}\sim{\cal D}}\left[(h^{\prime}(\tilde{\mathbf{z}})-\hat{N}(\tilde{\mathbf{z}}))^{2}\right]\leq\frac{1}{n}.

  • |^I(h)𝔼𝒮~I^I(h)|1n\left|\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\right|\leq\frac{1}{n}.

  • I(h)^I(h)=0\ell_{I}(h^{\prime})-\hat{\ell}_{I}(h^{\prime})=0.

Combining the above, we get that if 𝒮{\cal S} is pseudorandom, then with probability greater than 23\frac{2}{3} we have

I(h)=(I(h)^I(h))+(^I(h)𝔼𝒮~I^I(h))+𝔼𝒮~I^I(h)0+1n+1n=2n.\ell_{I}(h^{\prime})=\left(\ell_{I}(h^{\prime})-\hat{\ell}_{I}(h^{\prime})\right)+\left(\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\right)+\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\leq 0+\frac{1}{n}+\frac{1}{n}=\frac{2}{n}~{}.

Lemma A.10.

Let 𝐳{0,1}kn\mathbf{z}\in\{0,1\}^{kn} be a random vector whose components are drawn i.i.d. from a Bernoulli distribution, which takes the value 0 with probability 1n\frac{1}{n}. Then, for a sufficiently large nn, the vector 𝐳\mathbf{z} is an encoding of a hyperedge with probability at least 1log(n)\frac{1}{\log(n)}.

Proof.

The vector 𝐳\mathbf{z} represents a hyperedge iff in each of the kk size-nn slices in 𝐳\mathbf{z} there is exactly one 0-bit and each two of the kk slices in 𝐳\mathbf{z} encode different indices. Hence,

Pr[𝐳 represents a hyperedge]\displaystyle\Pr\left[\mathbf{z}\text{ represents a hyperedge}\right] =n(n1)(nk+1)(1n)k(n1n)nkk\displaystyle=n\cdot(n-1)\cdot\ldots\cdot(n-k+1)\cdot\left(\frac{1}{n}\right)^{k}\left(\frac{n-1}{n}\right)^{nk-k}
(nkn)k(n1n)k(n1)\displaystyle\geq\left(\frac{n-k}{n}\right)^{k}\left(\frac{n-1}{n}\right)^{k(n-1)}
=(1kn)k(11n)k(n1).\displaystyle=\left(1-\frac{k}{n}\right)^{k}\left(1-\frac{1}{n}\right)^{k(n-1)}~{}.

Since for every x(0,1)x\in(0,1) we have ex<1x2e^{-x}<1-\frac{x}{2}, then for a sufficiently large nn the above is at least

exp(2k2n)exp(2k(n1)n)exp(1)exp(2k)1log(n).\exp\left(-\frac{2k^{2}}{n}\right)\cdot\exp\left(-\frac{2k(n-1)}{n}\right)\geq\exp\left(-1\right)\cdot\exp\left(-2k\right)\geq\frac{1}{\log(n)}~{}.

Lemma A.11.

Let 𝐳~n2\tilde{\mathbf{z}}\in{\mathbb{R}}^{n^{2}} be the vector returned by the oracle. We have

Pr[𝐳~𝒵~]12log(n).\Pr\left[\tilde{\mathbf{z}}\in\tilde{{\cal Z}}\right]\geq\frac{1}{2\log(n)}~{}.
Proof.

Let 𝐳=𝐳~[kn]\mathbf{z}^{\prime}=\tilde{\mathbf{z}}_{[kn]}. We have

Pr[𝐳~𝒵~]Pr[j[kn] s.t. zj(c1n2,c+2n2)]+Pr[Ψ(𝐳) does not represent a hyperedge].\Pr\left[\tilde{\mathbf{z}}\not\in\tilde{{\cal Z}}\right]\leq\Pr\left[\exists j\in[kn]\text{ s.t. }z^{\prime}_{j}\in\left(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}\right)\right]+\Pr\left[\Psi(\mathbf{z}^{\prime})\text{ does not represent a hyperedge}\right]~{}. (1)

We now bound the terms in the above RHS. First, since 𝐳\mathbf{z}^{\prime} has the Gaussian distribution, then its components are drawn i.i.d. from a density function bounded by 12π\frac{1}{2\pi}. Hence, for a sufficiently large nn we have

Pr[j[kn] s.t. zj(c1n2,c+2n2)]kn12π3n2=3k2πnlog(n)n.\Pr\left[\exists j\in[kn]\text{ s.t. }z^{\prime}_{j}\in\left(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}\right)\right]\leq kn\cdot\frac{1}{2\pi}\cdot\frac{3}{n^{2}}=\frac{3k}{2\pi n}\leq\frac{\log(n)}{n}~{}. (2)

Let 𝐳=Ψ(𝐳)\mathbf{z}=\Psi(\mathbf{z}^{\prime}). Note that 𝐳\mathbf{z} is a random vector whose components are drawn i.i.d. from a Bernoulli distribution, where the probability to get 0 is 1n\frac{1}{n}. By Lemma A.10, 𝐳\mathbf{z} is an encoding of a hyperedge with probability at least 1log(n)\frac{1}{\log(n)}. Combining it with Eq. (1) and (2), , we get for a sufficiently large nn that

Pr[𝐳~𝒵~]log(n)n+(11log(n))112log(n),\Pr\left[\tilde{\mathbf{z}}\not\in\tilde{{\cal Z}}\right]\leq\frac{\log(n)}{n}+\left(1-\frac{1}{\log(n)}\right)\leq 1-\frac{1}{2\log(n)}~{},

as required. ∎

Lemma A.12.

If 𝒮{\cal S} is random, then for a sufficiently large nn with probability larger than 23\frac{2}{3} we have

I(h)>2n.\ell_{I}(h^{\prime})>\frac{2}{n}~{}.
Proof.

Let 𝒵~n2\tilde{{\cal Z}}\subseteq{\mathbb{R}}^{n^{2}} be such that 𝐳~𝒵~\tilde{\mathbf{z}}\in\tilde{{\cal Z}} iff 𝐳~[kn]\tilde{\mathbf{z}}_{[kn]} does not have components in the interval (c1n2,c+2n2)(c-\frac{1}{n^{2}},c+\frac{2}{n^{2}}), and Ψ(𝐳~[kn])=𝐳S\Psi(\tilde{\mathbf{z}}_{[kn]})=\mathbf{z}^{S} for a hyperedge SS. If 𝒮{\cal S} is random, then by the definition of our examples oracle, for every i[m(n)+n3]i\in[m(n)+n^{3}] such that 𝐳~i𝒵~\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}}, we have y~i=b^\tilde{y}_{i}=\hat{b} with probability 12\frac{1}{2} and y~i=0\tilde{y}_{i}=0 otherwise. Also, by the definition of the oracle, y~i\tilde{y}_{i} is independent of SiS_{i} and independent of the choice of the vector 𝐳~i\tilde{\mathbf{z}}_{i} that corresponds to 𝐳Si\mathbf{z}^{S_{i}}. If b^910\hat{b}\geq\frac{9}{10} then for a sufficiently large nn the hypothesis hh^{\prime} satisfies for each random example (𝐳~i,y~i)𝒮~I(\tilde{\mathbf{z}}_{i},\tilde{y}_{i})\in\tilde{{\cal S}}_{I} the following

Pr(𝐳~i,y~i)\displaystyle\Pr_{(\tilde{\mathbf{z}}_{i},\tilde{y}_{i})} [(h(𝐳~i)y~i)215]\displaystyle\left[(h^{\prime}(\tilde{\mathbf{z}}_{i})-\tilde{y}_{i})^{2}\geq\frac{1}{5}\right]
Pr(𝐳~i,y~i)[(h(𝐳~i)y~i)215|𝐳~i𝒵~]Pr𝐳~i[𝐳~i𝒵~]\displaystyle\geq\Pr_{(\tilde{\mathbf{z}}_{i},\tilde{y}_{i})}\left[\left.(h^{\prime}(\tilde{\mathbf{z}}_{i})-\tilde{y}_{i})^{2}\geq\frac{1}{5}\;\right|\;\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}}\right]\cdot\Pr_{\tilde{\mathbf{z}}_{i}}\left[\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}}\right]
Pr(𝐳~i,y~i)[(h(𝐳~i)y~i)2(b^2)2|𝐳~i𝒵~]Pr𝐳~i[𝐳~i𝒵~]\displaystyle\geq\Pr_{(\tilde{\mathbf{z}}_{i},\tilde{y}_{i})}\left[\left.(h^{\prime}(\tilde{\mathbf{z}}_{i})-\tilde{y}_{i})^{2}\geq\left(\frac{\hat{b}}{2}\right)^{2}\;\right|\;\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}}\right]\cdot\Pr_{\tilde{\mathbf{z}}_{i}}\left[\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}}\right]
12Pr𝐳~i[𝐳~i𝒵~].\displaystyle\geq\frac{1}{2}\cdot\Pr_{\tilde{\mathbf{z}}_{i}}\left[\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}}\right]~{}.

In Lemma A.11, we show that Pr𝐳~i[𝐳~i𝒵~]12log(n)\Pr_{\tilde{\mathbf{z}}_{i}}\left[\tilde{\mathbf{z}}_{i}\in\tilde{{\cal Z}}\right]\geq\frac{1}{2\log(n)}. Hence,

Pr(𝐳~i,y~i)[(h(𝐳~i)y~i)215]1212log(n)14log(n).\displaystyle\Pr_{(\tilde{\mathbf{z}}_{i},\tilde{y}_{i})}\left[(h^{\prime}(\tilde{\mathbf{z}}_{i})-\tilde{y}_{i})^{2}\geq\frac{1}{5}\right]\geq\frac{1}{2}\cdot\frac{1}{2\log(n)}\geq\frac{1}{4\log(n)}~{}.

Thus, if b^910\hat{b}\geq\frac{9}{10} then we have

𝔼𝒮~I[I(h)]1514log(n)=120log(n).\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\left[\ell_{I}(h^{\prime})\right]\geq\frac{1}{5}\cdot\frac{1}{4\log(n)}=\frac{1}{20\log(n)}~{}.

Therefore, for large nn we have

Pr[𝔼𝒮~I[I(h)]120log(n)]11n78.\Pr\left[\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\left[\ell_{I}(h^{\prime})\right]\geq\frac{1}{20\log(n)}\right]\geq 1-\frac{1}{n}\geq\frac{7}{8}~{}.

Since, (h(𝐳~)y~)2[0,b^2](h^{\prime}(\tilde{\mathbf{z}})-\tilde{y})^{2}\in[0,\hat{b}^{2}] for all 𝐳~,y~\tilde{\mathbf{z}},\tilde{y} returned by the examples oracle, and the examples 𝐳~i\tilde{\mathbf{z}}_{i} for iIi\in I are i.i.d., then by Hoeffding’s inequality, we have for a sufficiently large nn that

Pr[|I(h)𝔼𝒮~II(h)|1n]\displaystyle\Pr\left[\left|\ell_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\ell_{I}(h^{\prime})\right|\geq\frac{1}{n}\right] =Pr[|I(h)𝔼𝒮~II(h)|1n|b^1110]Pr[b^1110]\displaystyle=\Pr\left[\left|\ell_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\ell_{I}(h^{\prime})\middle|\geq\frac{1}{n}\right|\hat{b}\leq\frac{11}{10}\right]\cdot\Pr\left[\hat{b}\leq\frac{11}{10}\right]
+Pr[|I(h)𝔼𝒮~II(h)|1n|b^>1110]Pr[b^>1110]\displaystyle\;\;\;\;+\Pr\left[\left|\ell_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\ell_{I}(h^{\prime})\middle|\geq\frac{1}{n}\right|\hat{b}>\frac{11}{10}\right]\cdot\Pr\left[\hat{b}>\frac{11}{10}\right]
2exp(2n3n2(11/10)4)1+11n\displaystyle\leq 2\exp\left(-\frac{2n^{3}}{n^{2}(11/10)^{4}}\right)\cdot 1+1\cdot\frac{1}{n}
18.\displaystyle\leq\frac{1}{8}~{}.

Hence, for large enough nn, with probability at least 11818=34>231-\frac{1}{8}-\frac{1}{8}=\frac{3}{4}>\frac{2}{3} we have both 𝔼𝒮~I[I(h)]120log(n)\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\left[\ell_{I}(h^{\prime})\right]\geq\frac{1}{20\log(n)} and |I(h)𝔼𝒮~II(h)|1n\left|\ell_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\tilde{{\cal S}}_{I}}\ell_{I}(h^{\prime})\right|\leq\frac{1}{n}, and thus

I(h)120log(n)1n>2n.\ell_{I}(h^{\prime})\geq\frac{1}{20\log(n)}-\frac{1}{n}>\frac{2}{n}~{}.

Appendix B Proof of Corollary 3.1

By the proof of Theorem 3.1, under Assumption 2.1, there is no poly(d)\operatorname{poly}(d)-time algorithm s{\cal L}_{s} that satisfies the following: Let 𝜽p{\boldsymbol{\theta}}\in{\mathbb{R}}^{p} be BB-bounded parameters of a depth-33 network N𝜽:dN_{\boldsymbol{\theta}}:{\mathbb{R}}^{d}\to{\mathbb{R}}, and let τ,ϵ>0\tau,\epsilon>0. Assume that p,B,1/ϵ,1/τpoly(d)p,B,1/\epsilon,1/\tau\leq\operatorname{poly}(d), and that the widths of the hidden layers in 𝒩𝜽{\cal N}_{\boldsymbol{\theta}} are dd (i.e., the weight matrices are square). Let 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\in{\cal N}({\mathbf{0}},\tau^{2}I_{p}) and let 𝜽^=𝜽+𝝃\hat{{\boldsymbol{\theta}}}={\boldsymbol{\theta}}+\boldsymbol{\xi}. Then, with probability at least 3411000\frac{3}{4}-\frac{1}{1000}, given access to an examples oracle for 𝒩𝜽^{\cal N}_{\hat{{\boldsymbol{\theta}}}}, the algorithm s{\cal L}_{s} returns a hypothesis hh with 𝔼𝐱[(h(𝐱)N𝜽^)2]ϵ\operatorname*{\mathbb{E}}_{\mathbf{x}}\left[(h(\mathbf{x})-N_{\hat{{\boldsymbol{\theta}}}})^{2}\right]\leq\epsilon.

Note that in the above, the requirements from s{\cal L}_{s} are somewhat weaker than in our original definition of learning with smoothed parameters. Indeed, we assume that the widths of the hidden layers are dd and the required success probability is only 3411000\frac{3}{4}-\frac{1}{1000} (rather than 34\frac{3}{4}). We now explain why the hardness result holds already under these conditions:

  • Note that if we change the assumption on the learning algorithm in proof of Theorem 3.1 such that it succeeds with probability at least 3411000\frac{3}{4}-\frac{1}{1000} (rather than 34\frac{3}{4}), then in the case where 𝒮{\cal S} is pseudorandom we get that the algorithm 𝒜{\cal A} returns 11 with probability at least 1(14+11000+140+140+140)1-\left(\frac{1}{4}+\frac{1}{1000}+\frac{1}{40}+\frac{1}{40}+\frac{1}{40}\right) (see the proof of Lemma A.9), which is still greater than 23\frac{2}{3}. Also, the analysis of the case where 𝒮{\cal S} is random does not change, and thus in this case 𝒜{\cal A} returns 0 with probability greater than 23\frac{2}{3}. Consequently, we still get distinguishing advantage greater than 13\frac{1}{3}.

  • Regarding the requirement on the widths, we note that in the proof of Theorem 3.1 the layers satisfy the following. The input dimension is d=n2d=n^{2}, the width of the first hidden layer is at most 3nlog(n)d3n\log(n)\leq d, and the width of the second hidden layer is at most log(n)+2n+nlog(n)d\log(n)+2n+n\log(n)\leq d (all bounds are for a sufficiently large dd). In order to get a network where all layers are of width dd, we add new neurons to the hidden layers, with incoming weights 0, outgoing weights 0, and bias terms 1-1. Then, for an appropriate choice of τ=1/poly(n)\tau=1/\operatorname{poly}(n), even in the perturbed network the outputs of these new neurons will be 0 w.h.p. for every input 𝐳~1,,𝐳~m(n)+n3\tilde{\mathbf{z}}_{1},\ldots,\tilde{\mathbf{z}}_{m(n)+n^{3}}, and thus they will not affect the network’s output. Thus, using the same argument as in the proof of Theorem 3.1, we conclude that the hardness results holds already for network with square weight matrices.

Suppose that there exists an efficient algorithm p{\cal L}_{p} that learns in the standard PAC framework depth-33 neural networks where the minimal singular value of each weight matrix is lower bounded by 1/q(d)1/q(d) for any polynomial q(d)q(d). We will use p{\cal L}_{p} to obtain an efficient algorithm s{\cal L}_{s} that learns depth-33 networks with smoothed parameters as described above, and thus reach a contradiction.

Let 𝜽p{\boldsymbol{\theta}}\in{\mathbb{R}}^{p} be BB-bounded parameters of a depth-33 network N𝜽:dN_{\boldsymbol{\theta}}:{\mathbb{R}}^{d}\to{\mathbb{R}}, and let τ,ϵ>0\tau,\epsilon>0. Assume that p,B,1/ϵ,1/τpoly(d)p,B,1/\epsilon,1/\tau\leq\operatorname{poly}(d), and that the widths of the hidden layers in 𝒩𝜽{\cal N}_{\boldsymbol{\theta}} are dd. For random 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}) and 𝜽^=𝜽+𝝃\hat{{\boldsymbol{\theta}}}={\boldsymbol{\theta}}+\boldsymbol{\xi}, the algorithm s{\cal L}_{s} has access to examples labeled by N𝜽^N_{\hat{{\boldsymbol{\theta}}}}. Using Lemma B.1 below with t=τdt=\frac{\tau}{d} and the union bound over the two weight matrices in N𝜽N_{\boldsymbol{\theta}}, we get that with probability at least 122.35d1110001-\frac{2\cdot 2.35}{\sqrt{d}}\geq 1-\frac{1}{1000} (for large enough dd), the minimal singular values of all weight matrices in 𝜽^\hat{{\boldsymbol{\theta}}} are at least τd1q(d)\frac{\tau}{d}\geq\frac{1}{q(d)} for some sufficiently large polynomial q(d)q(d). Our algorithm s{\cal L}_{s} will simply run p{\cal L}_{p}. Given that the minimal singular values of the weight matrices are at least 1q(d)\frac{1}{q(d)}, the algorithm p{\cal L}_{p} runs in time poly(d)\operatorname{poly}(d) and returns with probability at least 34\frac{3}{4} a hypothesis hh with 𝔼𝐱[(h(𝐱)N𝜽^(𝐱))2]ϵ\operatorname*{\mathbb{E}}_{\mathbf{x}}\left[(h(\mathbf{x})-N_{\hat{{\boldsymbol{\theta}}}}(\mathbf{x}))^{2}\right]\leq\epsilon. Overall, the algorithm s{\cal L}_{s} runs in poly(d)\operatorname{poly}(d) time, and with probability at least 3411000\frac{3}{4}-\frac{1}{1000} (over both 𝝃\boldsymbol{\xi} and the internal randomness) returns a hypothesis hh with loss at most ϵ\epsilon.

Lemma B.1 (Sankar et al. [2006], Theorem 3.3).

Let WW be an arbitrary square matrix in d×d{\mathbb{R}}^{d\times d}, and let Pd×dP\in{\mathbb{R}}^{d\times d} be a random matrix, where each entry is drawn i.i.d. from 𝒩(0,τ2){\cal N}(0,\tau^{2}) for some τ>0\tau>0. Let σd\sigma_{d} be the minimal singular value of the matrix W+PW+P. Then, for every t>0t>0 we have

PrP[σdt]2.35tdτ.\Pr_{P}\left[\sigma_{d}\leq t\right]\leq 2.35\cdot\frac{t\sqrt{d}}{\tau}~{}.

Appendix C Proof of Theorem 3.2

The proof follows similar ideas to the proof of Theorem 3.1. The main difference is that we need to handle here a smoothed discrete input distribution rather than the standard Gaussian distribution.

For a sufficiently large nn, let 𝒟{\cal D} be a distribution on {0,1}n2\{0,1\}^{n^{2}}, where each component is drawn i.i.d. from a Bernoulli distribution which takes the value 0 with probability 1n\frac{1}{n}. Assume that there is a poly(n)\operatorname{poly}(n)-time algorithm {\cal L} that learns depth-33 neural networks with at most n2n^{2} hidden neurons and parameter magnitudes bounded by n3n^{3}, with smoothed parameters and inputs, under the distribution 𝒟{\cal D}, with ϵ=1n\epsilon=\frac{1}{n} and τ,ω=1/poly(n)\tau,\omega=1/\operatorname{poly}(n) that we will specify later. Let m(n)poly(n)m(n)\leq\operatorname{poly}(n) be the sample complexity of {\cal L}, namely, {\cal L} uses a sample of size at most m(n)m(n) and returns with probability at least 34\frac{3}{4} a hypothesis hh with 𝔼𝐳𝒟^[(h(𝐳)N𝜽^(𝐳))2]ϵ=1n\operatorname*{\mathbb{E}}_{\mathbf{z}\sim\hat{{\cal D}}}\left[\left(h(\mathbf{z})-N_{\hat{{\boldsymbol{\theta}}}}(\mathbf{z})\right)^{2}\right]\leq\epsilon=\frac{1}{n}. Note that 𝒟^\hat{{\cal D}} is the distribution 𝒟{\cal D} after smoothing with parameter ω\omega, and the vector 𝜽^\hat{{\boldsymbol{\theta}}} is the parameters of the target network after smoothing with parameter τ\tau. Let s>1s>1 be a constant such that nsm(n)+n3n^{s}\geq m(n)+n^{3} for every sufficiently large nn. By Assumption 2.1, there exists a constant kk and a predicate P:{0,1}k{0,1}P:\{0,1\}^{k}\to\{0,1\}, such that P,n,ns{\cal F}_{P,n,n^{s}} is 13\frac{1}{3}-PRG. We will show an efficient algorithm 𝒜{\cal A} with distinguishing advantage greater than 13\frac{1}{3} and thus reach a contradiction.

Throughout this proof, we will use some notations from the proof of Theorem 3.1. We repeat it here for convenience. For a hyperedge S=(i1,,ik)S=(i_{1},\ldots,i_{k}) we denote by 𝐳S{0,1}kn\mathbf{z}^{S}\in\{0,1\}^{kn} the following encoding of SS: the vector 𝐳S\mathbf{z}^{S} is a concatenation of kk vectors in {0,1}n\{0,1\}^{n}, such that the jj-th vector has 0 in the iji_{j}-th coordinate and 11 elsewhere. Thus, 𝐳S\mathbf{z}^{S} consists of kk size-nn slices, each encoding a member of SS. For 𝐳{0,1}kn\mathbf{z}\in\{0,1\}^{kn}, i[k]i\in[k] and j[n]j\in[n], we denote zi,j=z(i1)n+jz_{i,j}=z_{(i-1)n+j}. That is, zi,jz_{i,j} is the jj-th component in the ii-th slice in 𝐳\mathbf{z}. For 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}, let P𝐱:{0,1}kn{0,1}P_{\mathbf{x}}:\{0,1\}^{kn}\to\{0,1\} be such that for every hyperedge SS we have P𝐱(𝐳S)=P(𝐱S)P_{\mathbf{x}}(\mathbf{z}^{S})=P(\mathbf{x}_{S}). For 𝐳~n2\tilde{\mathbf{z}}\in{\mathbb{R}}^{n^{2}} we denote 𝐳~[kn]=(z~1,,z~kn)\tilde{\mathbf{z}}_{[kn]}=(\tilde{z}_{1},\ldots,\tilde{z}_{kn}), namely, the first knkn components of 𝐳~\tilde{\mathbf{z}} (assuming n2knn^{2}\geq kn).

C.1 Defining the target network for {\cal L}

Since our goal is to use the algorithm {\cal L} for breaking PRGs, in this subsection we define a neural network N~:n2\tilde{N}:{\mathbb{R}}^{n^{2}}\to{\mathbb{R}} that we will later use as a target network for {\cal L}. The network N~\tilde{N} contains the subnetworks N1,N2N_{1},N_{2} that we define below.

Let N1N_{1} be a depth-11 neural network (i.e., one layer, with activations in the output neurons) with input dimension knkn, at most log(n)\log(n) output neurons, and parameter magnitudes bounded by n3n^{3} (all bounds are for a sufficiently large nn), which satisfies the following. We denote the set of output neurons of N1N_{1} by 1{\cal E}_{1}. Let 𝐳{0,1}kn\mathbf{z}^{\prime}\in\{0,1\}^{kn} be an input to N1N_{1} such that 𝐳=𝐳S\mathbf{z}^{\prime}=\mathbf{z}^{S} for some hyperedge SS. Thus, even though N1N_{1} takes inputs in kn{\mathbb{R}}^{kn}, we consider now its behavior for an input 𝐳\mathbf{z}^{\prime} with discrete components in {0,1}\{0,1\}. Fix some 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}. Then, for SS with P𝐱(𝐳S)=0P_{\mathbf{x}}(\mathbf{z}^{S})=0 the inputs to all output neurons 1{\cal E}_{1} are at most 1-1, and for SS with P𝐱(𝐳S)=1P_{\mathbf{x}}(\mathbf{z}^{S})=1 there exists a neuron in 1{\cal E}_{1} with input at least 22. Recall that our definition of a neuron’s input includes the addition of the bias term. The construction of the network N1N_{1} is given in Lemma A.2. Note that the network N1N_{1} depends on 𝐱\mathbf{x}. Let N1:knN^{\prime}_{1}:{\mathbb{R}}^{kn}\to{\mathbb{R}} be a depth-22 neural network with no activation function in the output neuron, obtained from N1N_{1} by summing the outputs from all neurons 1{\cal E}_{1}.

Let N2N_{2} be a depth-11 neural network (i.e., one layer, with activations in the output neurons) with input dimension knkn, at most 2n2n output neurons, and parameter magnitudes bounded by n3n^{3} (for a sufficiently large nn), which satisfies the following. We denote the set of output neurons of N2N_{2} by 2{\cal E}_{2}. Let 𝐳{0,1}kn\mathbf{z}^{\prime}\in\{0,1\}^{kn} be an input to N2N_{2} (note that it has components only in {0,1}\{0,1\}) . If 𝐳\mathbf{z}^{\prime} is an encoding of a hyperedge then the inputs to all output neurons 2{\cal E}_{2} are at most 1-1, and otherwise there exists a neuron in 2{\cal E}_{2} with input at least 22. The construction of the network N2N_{2} is given in Lemma A.4. Let N2:knN^{\prime}_{2}:{\mathbb{R}}^{kn}\to{\mathbb{R}} be a depth-22 neural network with no activation function in the output neuron, obtained from N2N_{2} by summing the outputs from all neurons 2{\cal E}_{2}.

Let N:knN^{\prime}:{\mathbb{R}}^{kn}\to{\mathbb{R}} be a depth-22 network obtained from N1,N2N^{\prime}_{1},N^{\prime}_{2} as follows. For 𝐳kn\mathbf{z}^{\prime}\in{\mathbb{R}}^{kn} we have N(𝐳)=[1N1(𝐳)N2(𝐳)]+N^{\prime}(\mathbf{z}^{\prime})=\left[1-N^{\prime}_{1}(\mathbf{z}^{\prime})-N^{\prime}_{2}(\mathbf{z}^{\prime})\right]_{+}. The network NN^{\prime} has at most n2n^{2} neurons, and parameter magnitudes bounded by n3n^{3} (all bounds are for a sufficiently large nn). Finally, let N~:n2\tilde{N}:{\mathbb{R}}^{n^{2}}\rightarrow{\mathbb{R}} be a depth-22 neural network such that N~(𝐳~)=N(𝐳~[kn])\tilde{N}(\tilde{\mathbf{z}})=N^{\prime}\left(\tilde{\mathbf{z}}_{[kn]}\right).

C.2 Defining the noise magnitudes τ,ω\tau,\omega and analyzing the perturbed network under perturbed inputs

In order to use the algorithm {\cal L} w.r.t. some neural network with parameters 𝜽{\boldsymbol{\theta}} and a certain input distribution, we need to implement an examples oracle, such that the examples are drawn from a smoothed input distribution, and labeled according to a neural network with parameters 𝜽+𝝃{\boldsymbol{\theta}}+\boldsymbol{\xi}, where 𝝃\boldsymbol{\xi} is a random perturbation. Specifically, we use {\cal L} with an examples oracle where the input distribution 𝒟^\hat{{\cal D}} is obtained from 𝒟{\cal D} by smoothing, and the labels correspond to a network N^:n2\hat{N}:{\mathbb{R}}^{n^{2}}\to{\mathbb{R}} obtained from N~\tilde{N} (w.r.t. an appropriate 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n} in the construction of N1N_{1}) by adding a small perturbation to the parameters. The smoothing magnitudes ω,τ\omega,\tau of the inputs and the network’s parameters (respectively) are such that the following hold.

We first choose the parameter τ=1/poly(n)\tau=1/\operatorname{poly}(n) as follows. Let f𝜽:n2f_{\boldsymbol{\theta}}:{\mathbb{R}}^{n^{2}}\to{\mathbb{R}} be any depth-22 neural network parameterized by 𝜽r{\boldsymbol{\theta}}\in{\mathbb{R}}^{r} for some r>0r>0 with at most n2n^{2} neurons, and parameter magnitudes bounded by n3n^{3} (note that rr is polynomial in nn). Then, τ\tau is such that with probability at least 11n1-\frac{1}{n} over 𝝃𝒩(𝟎,τ2Ir)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{r}), we have |ξi|110|\xi_{i}|\leq\frac{1}{10} for all i[r]i\in[r], and the network f𝜽+𝝃f_{{\boldsymbol{\theta}}+\boldsymbol{\xi}} is such that for every input 𝐳~n2\tilde{\mathbf{z}}\in{\mathbb{R}}^{n^{2}} with 𝐳~n\left\|\tilde{\mathbf{z}}\right\|\leq n and every neuron we have: Let a,ba,b be the inputs to the neuron in the computations f𝜽(𝐳~)f_{\boldsymbol{\theta}}(\tilde{\mathbf{z}}) and f𝜽+𝝃(𝐳~)f_{{\boldsymbol{\theta}}+\boldsymbol{\xi}}(\tilde{\mathbf{z}}) (respectively), then |ab|14|a-b|\leq\frac{1}{4}. Thus, τ\tau is sufficiently small, such that w.h.p. adding i.i.d. noise 𝒩(0,τ2){\cal N}(0,\tau^{2}) to each parameter does not change the inputs to the neurons by more than 14\frac{1}{4}. Note that such an inverse-polynomial τ\tau exists, since when the network size, parameter magnitudes, and input size are bounded by some poly(n)\operatorname{poly}(n), then the input to each neuron in f𝜽(𝐳~)f_{\boldsymbol{\theta}}(\tilde{\mathbf{z}}) is poly(n)\operatorname{poly}(n)-Lipschitz as a function of 𝜽{\boldsymbol{\theta}}, and thus it suffices to choose τ\tau that implies with probability at least 11n1-\frac{1}{n} that 𝝃1q(n)\left\|\boldsymbol{\xi}\right\|\leq\frac{1}{q(n)} for a sufficiently large polynomial q(n)q(n) (see Lemma A.7 for details).

Next, we choose the parameter ω=1/poly(n)\omega=1/\operatorname{poly}(n) as follows. Let f𝜽:n2f_{\boldsymbol{\theta}}:{\mathbb{R}}^{n^{2}}\to{\mathbb{R}} be any depth-22 neural network parameterized by 𝜽{\boldsymbol{\theta}} with at most n2n^{2} neurons, and parameter magnitudes bounded by n3+110n^{3}+\frac{1}{10}. Then, ω\omega is such that for every 𝐳{0,1}n2\mathbf{z}\in\{0,1\}^{n^{2}}, with probability at least 1exp(n/2)1-\exp(-n/2) over 𝜻𝒩(𝟎,ω2In2)\boldsymbol{\zeta}\sim{\cal N}({\mathbf{0}},\omega^{2}I_{n^{2}}) the following holds for every neuron in the f𝜽f_{\boldsymbol{\theta}}: Let a,ba,b be the inputs to the neuron in the computations f𝜽(𝐳)f_{\boldsymbol{\theta}}(\mathbf{z}) and f𝜽(𝐳+𝜻)f_{\boldsymbol{\theta}}(\mathbf{z}+\boldsymbol{\zeta}) (respectively), then |ab|14|a-b|\leq\frac{1}{4}. Thus, ω\omega is sufficiently small, such that w.h.p. adding noise 𝒩(𝟎,ω2In2){\cal N}({\mathbf{0}},\omega^{2}I_{n^{2}}) to the input 𝐳\mathbf{z} does not change the inputs to the neurons by more than 14\frac{1}{4}. Note that such an inverse-polynomial ω\omega exists, since when the network size and parameter magnitudes are bounded by some poly(n)\operatorname{poly}(n), then the input to each neuron in f𝜽(𝐳)f_{\boldsymbol{\theta}}(\mathbf{z}) is poly(n)\operatorname{poly}(n)-Lipschitz as a function of 𝐳\mathbf{z}, and thus it suffices to choose ω\omega that implies with probability at least 1exp(n/2)1-\exp(-n/2) that 𝜻1q(n)\left\|\boldsymbol{\zeta}\right\|\leq\frac{1}{q(n)} for a sufficiently large polynomial q(n)q(n) (see Lemma A.7 for details).

Let 𝜽~p\tilde{{\boldsymbol{\theta}}}\in{\mathbb{R}}^{p} be the parameters of the network N~\tilde{N}. Recall that the parameters vector 𝜽~\tilde{{\boldsymbol{\theta}}} is the concatenation of all weight matrices and bias terms. Let 𝜽^p\hat{{\boldsymbol{\theta}}}\in{\mathbb{R}}^{p} be the parameters of N^\hat{N}, namely, 𝜽^=𝜽~+𝝃\hat{{\boldsymbol{\theta}}}=\tilde{{\boldsymbol{\theta}}}+\boldsymbol{\xi} where 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}). By our choice of τ\tau and the construction of the networks N1,N2N_{1},N_{2}, with probability at least 11n1-\frac{1}{n} over 𝝃\boldsymbol{\xi}, for every 𝐳{0,1}n2\mathbf{z}\in\{0,1\}^{n^{2}} the following holds: Let 𝜻𝒩(𝟎,ω2In2)\boldsymbol{\zeta}\sim{\cal N}({\mathbf{0}},\omega^{2}I_{n^{2}}) and let 𝐳^=𝐳+𝜻\hat{\mathbf{z}}=\mathbf{z}+\boldsymbol{\zeta}. Then with probability at least 1exp(n/2)1-\exp(-n/2) over 𝜻\boldsymbol{\zeta} the differences between inputs to all neurons in the computations N^(𝐳^)\hat{N}(\hat{\mathbf{z}}) and N~(𝐳)\tilde{N}(\mathbf{z}) are at most 12\frac{1}{2}. Indeed, w.h.p. for all 𝐳{0,1}n2\mathbf{z}\in\{0,1\}^{n^{2}} the computations N~(𝐳)\tilde{N}(\mathbf{z}) and N^(𝐳)\hat{N}(\mathbf{z}) are roughly similar (up to change of 1/41/4 in the input to each neuron), and w.h.p. the computations N^(𝐳)\hat{N}(\mathbf{z}) and N^(𝐳^)\hat{N}(\hat{\mathbf{z}}) are roughly similar (up to change of 1/41/4 in the input to each neuron). Thus, with probability at least 11n1-\frac{1}{n} over 𝝃\boldsymbol{\xi}, the network N^\hat{N} is such that for every 𝐳{0,1}n2\mathbf{z}\in\{0,1\}^{n^{2}}, we have with probability at least 1exp(n/2)1-\exp(-n/2) over 𝜻\boldsymbol{\zeta} that the computation N^(𝐳^)\hat{N}(\hat{\mathbf{z}}) satisfies the following properties, where 𝐳:=𝐳[kn]\mathbf{z}^{\prime}:=\mathbf{z}_{[kn]}:

  1. (Q1)

    If 𝐳=𝐳S\mathbf{z}^{\prime}=\mathbf{z}^{S} for some hyperedge SS, then the inputs to 1{\cal E}_{1} satisfy:

    • If P𝐱(𝐳S)=0P_{\mathbf{x}}(\mathbf{z}^{S})=0 the inputs to all neurons in 1{\cal E}_{1} are at most 12-\frac{1}{2}.

    • If P𝐱(𝐳S)=1P_{\mathbf{x}}(\mathbf{z}^{S})=1 there exists a neuron in 1{\cal E}_{1} with input at least 32\frac{3}{2}.

  2. (Q2)

    The inputs to 2{\cal E}_{2} satisfy:

    • If 𝐳\mathbf{z}^{\prime} is an encoding of a hyperedge then the inputs to all neurons 2{\cal E}_{2} are at most 12-\frac{1}{2}.

    • Otherwise, there exists a neuron in 2{\cal E}_{2} with input at least 32\frac{3}{2}.

C.3 Stating the algorithm 𝒜{\cal A}

Given a sequence (S1,y1),,(Sns,yns)(S_{1},y_{1}),\ldots,(S_{n^{s}},y_{n^{s}}), where S1,,SnsS_{1},\ldots,S_{n^{s}} are i.i.d. random hyperedges, the algorithm 𝒜{\cal A} needs to distinguish whether 𝐲=(y1,,yns)\mathbf{y}=(y_{1},\ldots,y_{n^{s}}) is random or that 𝐲=(P(𝐱S1),,P(𝐱Sns))=(P𝐱(𝐳S1),,P𝐱(𝐳Sns))\mathbf{y}=(P(\mathbf{x}_{S_{1}}),\ldots,P(\mathbf{x}_{S_{n^{s}}}))=(P_{\mathbf{x}}(\mathbf{z}^{S_{1}}),\ldots,P_{\mathbf{x}}(\mathbf{z}^{S_{n^{s}}})) for a random 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}. Let 𝒮=((𝐳S1,y1),,(𝐳Sns,yns)){\cal S}=((\mathbf{z}^{S_{1}},y_{1}),\ldots,(\mathbf{z}^{S_{n^{s}}},y_{n^{s}})).

We use the efficient algorithm {\cal L} in order to obtain distinguishing advantage greater than 13\frac{1}{3} as follows. Let 𝝃\boldsymbol{\xi} be a random perturbation, and let N^\hat{N} be the perturbed network as defined above, w.r.t. the unknown 𝐱{0,1}n\mathbf{x}\in\{0,1\}^{n}. Note that given a perturbation 𝝃\boldsymbol{\xi}, only the weights in the second layer of the subnetwork N1N_{1} in N^\hat{N} are unknown, since all other parameters do not depend on 𝐱\mathbf{x}. The algorithm 𝒜{\cal A} runs {\cal L} with the following examples oracle. In the ii-th call, the oracle first draws 𝐳{0,1}kn\mathbf{z}^{\prime}\in\{0,1\}^{kn} such that each component is drawn i.i.d. from a Bernoulli distribution which takes the value 0 with probability 1n\frac{1}{n}. If 𝐳\mathbf{z}^{\prime} is an encoding of a hyperedge then the oracle replaces 𝐳\mathbf{z}^{\prime} with 𝐳Si\mathbf{z}^{S_{i}}. Let 𝐳{0,1}n2\mathbf{z}\in\{0,1\}^{n^{2}} be such that 𝐳[kn]=𝐳\mathbf{z}_{[kn]}=\mathbf{z}^{\prime}, and the other n2knn^{2}-kn components of 𝐳\mathbf{z} are drawn i.i.d. from a Bernoulli distribution which takes the value 0 with probability 1n\frac{1}{n}. Note that the vector 𝐳\mathbf{z} has the distribution 𝒟{\cal D}, since replacing an encoding of a random hyperedge by an encoding of another random hyperedge does not change the distribution of 𝐳\mathbf{z}^{\prime}. Let 𝐳^=𝐳+𝜻\hat{\mathbf{z}}=\mathbf{z}+\boldsymbol{\zeta}, where 𝜻𝒩(𝟎,ω2In2)\boldsymbol{\zeta}\sim{\cal N}({\mathbf{0}},\omega^{2}I_{n^{2}}). Note that 𝐳^\hat{\mathbf{z}} has the distribution 𝒟^\hat{{\cal D}}. Let b^\hat{b}\in{\mathbb{R}} be the bias term of the output neuron of N^\hat{N}. The oracle returns (𝐳^,y^)(\hat{\mathbf{z}},\hat{y}), where the labels y^\hat{y} are chosen as follows:

  • If 𝐳\mathbf{z}^{\prime} is not an encoding of a hyperedge, then y^=0\hat{y}=0.

  • If 𝐳\mathbf{z}^{\prime} is an encoding of a hyperedge:

    • If yi=0y_{i}=0 we set y^=b^\hat{y}=\hat{b}.

    • If yi=1y_{i}=1 we set y^=0\hat{y}=0.

Let hh be the hypothesis returned by {\cal L}. Recall that {\cal L} uses at most m(n)m(n) examples, and hence 𝒮{\cal S} contains at least n3n^{3} examples that {\cal L} cannot view. We denote the indices of these examples by I={m(n)+1,,m(n)+n3}I=\{m(n)+1,\ldots,m(n)+n^{3}\}, and the examples by 𝒮I={(𝐳Si,yi)}iI{\cal S}_{I}=\{(\mathbf{z}^{S_{i}},y_{i})\}_{i\in I}. By n3n^{3} additional calls to the oracle, the algorithm 𝒜{\cal A} obtains the examples 𝒮^I={(𝐳^i,y^i)}iI\hat{{\cal S}}_{I}=\{(\hat{\mathbf{z}}_{i},\hat{y}_{i})\}_{i\in I} that correspond to 𝒮I{\cal S}_{I}. Let hh^{\prime} be a hypothesis such that for all 𝐳~n2\tilde{\mathbf{z}}\in{\mathbb{R}}^{n^{2}} we have h(𝐳~)=max{0,min{b^,h(𝐳~)}}h^{\prime}(\tilde{\mathbf{z}})=\max\{0,\min\{\hat{b},h(\tilde{\mathbf{z}})\}\}, thus, for b^0\hat{b}\geq 0 the hypothesis hh^{\prime} is obtained from hh by clipping the output to the interval [0,b^][0,\hat{b}]. Let I(h)=1|I|iI(h(𝐳^i)y^i)2\ell_{I}(h^{\prime})=\frac{1}{|I|}\sum_{i\in I}(h^{\prime}(\hat{\mathbf{z}}_{i})-\hat{y}_{i})^{2}. Now, if I(h)2n\ell_{I}(h^{\prime})\leq\frac{2}{n}, then 𝒜{\cal A} returns 11, and otherwise it returns 0. We remark that the decision of our algorithm is based on hh^{\prime} (rather than hh) since we need the outputs to be bounded, in order to allow using Hoeffding’s inequality in our analysis, which we discuss in the next subsection.

C.4 Analyzing the algorithm 𝒜{\cal A}

Note that the algorithm 𝒜{\cal A} runs in poly(n)\operatorname{poly}(n) time. We now show that if 𝒮{\cal S} is pseudorandom then 𝒜{\cal A} returns 11 with probability greater than 23\frac{2}{3}, and if 𝒮{\cal S} is random then 𝒜{\cal A} returns 11 with probability less than 13\frac{1}{3}. To that end, we use similar arguments to the proof of Theorem 3.1.

In Lemma C.1, we show that if 𝒮{\cal S} is pseudorandom then with probability at least 3940\frac{39}{40} (over 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}) and 𝜻i𝒩(𝟎,ω2In2)\boldsymbol{\zeta}_{i}\sim{\cal N}({\mathbf{0}},\omega^{2}I_{n^{2}}) for all i[m(n)]i\in[m(n)]) the examples (𝐳^1,y^1),,(𝐳^m(n),y^m(n))(\hat{\mathbf{z}}_{1},\hat{y}_{1}),\ldots,(\hat{\mathbf{z}}_{m(n)},\hat{y}_{m(n)}) returned by the oracle are realized by N^\hat{N}. Recall that the algorithm {\cal L} is such that with probability at least 34\frac{3}{4} (over 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}), the i.i.d. inputs 𝐳^i𝒟^\hat{\mathbf{z}}_{i}\sim\hat{{\cal D}}, and possibly its internal randomness), given a size-m(n)m(n) dataset labeled by N^\hat{N}, it returns a hypothesis hh such that 𝔼𝐳^𝒟^[(h(𝐳^)N^(𝐳^))2]1n\operatorname*{\mathbb{E}}_{\hat{\mathbf{z}}\sim\hat{{\cal D}}}\left[(h(\hat{\mathbf{z}})-\hat{N}(\hat{\mathbf{z}}))^{2}\right]\leq\frac{1}{n}. Hence, with probability at least 34140\frac{3}{4}-\frac{1}{40} the algorithm {\cal L} returns such a good hypothesis hh, given m(n)m(n) examples labeled by our examples oracle. Indeed, note that {\cal L} can return a bad hypothesis only if the random choices are either bad for {\cal L} (when used with realizable examples) or bad for the realizability of the examples returned by our oracle. By the definition of hh^{\prime} and the construction of N^\hat{N}, if hh has small error then hh^{\prime} also has small error, namely,

𝔼𝐳^𝒟^[(h(𝐳^)N^(𝐳^))2]𝔼𝐳~𝒟^[(h(𝐳^)N^(𝐳^))2]1n.\operatorname*{\mathbb{E}}_{\hat{\mathbf{z}}\sim\hat{{\cal D}}}\left[(h^{\prime}(\hat{\mathbf{z}})-\hat{N}(\hat{\mathbf{z}}))^{2}\right]\leq\operatorname*{\mathbb{E}}_{\tilde{\mathbf{z}}\sim\hat{{\cal D}}}\left[(h(\hat{\mathbf{z}})-\hat{N}(\hat{\mathbf{z}}))^{2}\right]\leq\frac{1}{n}~{}.

Let ^I(h)=1|I|iI(h(𝐳^i)N^(𝐳^i))2\hat{\ell}_{I}(h^{\prime})=\frac{1}{|I|}\sum_{i\in I}(h^{\prime}(\hat{\mathbf{z}}_{i})-\hat{N}(\hat{\mathbf{z}}_{i}))^{2}. Recall that by our choice of τ\tau we have Pr[b^>1110]1n\Pr[\hat{b}>\frac{11}{10}]\leq\frac{1}{n}. Since, (h(𝐳^)N^(𝐳^))2[0,b^2](h^{\prime}(\hat{\mathbf{z}})-\hat{N}(\hat{\mathbf{z}}))^{2}\in[0,\hat{b}^{2}] for all 𝐳^n2\hat{\mathbf{z}}\in{\mathbb{R}}^{n^{2}}, by Hoeffding’s inequality, we have for a sufficiently large nn that

Pr[|^I(h)𝔼𝒮^I^I(h)|1n]\displaystyle\Pr\left[\left|\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\right|\geq\frac{1}{n}\right] =Pr[|^I(h)𝔼𝒮^I^I(h)|1n|b^1110]Pr[b^1110]\displaystyle=\Pr\left[\left|\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\middle|\geq\frac{1}{n}\right|\hat{b}\leq\frac{11}{10}\right]\cdot\Pr\left[\hat{b}\leq\frac{11}{10}\right]
+Pr[|^I(h)𝔼𝒮^I^I(h)|1n|b^>1110]Pr[b^>1110]\displaystyle\;\;\;\;+\Pr\left[\left|\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\middle|\geq\frac{1}{n}\right|\hat{b}>\frac{11}{10}\right]\cdot\Pr\left[\hat{b}>\frac{11}{10}\right]
2exp(2n3n2(11/10)4)1+11n\displaystyle\leq 2\exp\left(-\frac{2n^{3}}{n^{2}(11/10)^{4}}\right)\cdot 1+1\cdot\frac{1}{n}
140.\displaystyle\leq\frac{1}{40}~{}.

Moreover, by Lemma C.1,

Pr[I(h)^I(h)]Pr[iI s.t. y^iN^(𝐳^i)]140.\Pr\left[\ell_{I}(h^{\prime})\neq\hat{\ell}_{I}(h^{\prime})\right]\leq\Pr\left[\exists i\in I\text{ s.t. }\hat{y}_{i}\neq\hat{N}(\hat{\mathbf{z}}_{i})\right]\leq\frac{1}{40}~{}.

Overall, by the union bound we have with probability at least 1(14+140+140+140)>231-\left(\frac{1}{4}+\frac{1}{40}+\frac{1}{40}+\frac{1}{40}\right)>\frac{2}{3} for sufficiently large nn that:

  • 𝔼𝒮^I^I(h)=𝔼𝐳^𝒟^[(h(𝐳^)N^(𝐳^))2]1n\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})=\operatorname*{\mathbb{E}}_{\hat{\mathbf{z}}\sim\hat{{\cal D}}}\left[(h^{\prime}(\hat{\mathbf{z}})-\hat{N}(\hat{\mathbf{z}}))^{2}\right]\leq\frac{1}{n}.

  • |^I(h)𝔼𝒮^I^I(h)|1n\left|\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\right|\leq\frac{1}{n}.

  • I(h)^I(h)=0\ell_{I}(h^{\prime})-\hat{\ell}_{I}(h^{\prime})=0.

Combining the above, we get that if 𝒮{\cal S} is pseudorandom, then with probability greater than 23\frac{2}{3} we have

I(h)=(I(h)^I(h))+(^I(h)𝔼𝒮^I^I(h))+𝔼𝒮^I^I(h)0+1n+1n=2n.\ell_{I}(h^{\prime})=\left(\ell_{I}(h^{\prime})-\hat{\ell}_{I}(h^{\prime})\right)+\left(\hat{\ell}_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\right)+\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\hat{\ell}_{I}(h^{\prime})\leq 0+\frac{1}{n}+\frac{1}{n}=\frac{2}{n}~{}.

We now consider the case where 𝒮{\cal S} is random. For an example 𝐳^i=𝐳i+𝜻i\hat{\mathbf{z}}_{i}=\mathbf{z}_{i}+\boldsymbol{\zeta}_{i} returned by the oracle, we denote 𝐳i=(𝐳i)[kn]{0,1}kn\mathbf{z}^{\prime}_{i}=(\mathbf{z}_{i})_{[kn]}\in\{0,1\}^{kn}. Thus, 𝐳i\mathbf{z}^{\prime}_{i} is the input that the oracle used before adding the n2knn^{2}-kn additional components and adding noise 𝜻i\boldsymbol{\zeta}_{i}. Let 𝒵{0,1}kn{\cal Z}^{\prime}\subseteq\{0,1\}^{kn} be such that 𝐳𝒵\mathbf{z}^{\prime}\in{\cal Z}^{\prime} iff 𝐳=𝐳S\mathbf{z}^{\prime}=\mathbf{z}^{S} for some hyperedge SS. If 𝒮{\cal S} is random, then by the definition of our examples oracle, for every i[m(n)+n3]i\in[m(n)+n^{3}] such that 𝐳i𝒵\mathbf{z}^{\prime}_{i}\in{\cal Z}^{\prime}, we have y^i=b^\hat{y}_{i}=\hat{b} with probability 12\frac{1}{2} and y^i=0\hat{y}_{i}=0 otherwise. Also, by the definition of the oracle, y^i\hat{y}_{i} is independent of SiS_{i}, independent of the n2knn^{2}-kn additional components that where added, and independent of the noise 𝜻i𝒩(𝟎,ω2In2)\boldsymbol{\zeta}_{i}\sim{\cal N}({\mathbf{0}},\omega^{2}I_{n^{2}}) that corresponds to 𝐳^i\hat{\mathbf{z}}_{i}.

If b^910\hat{b}\geq\frac{9}{10} then for a sufficiently large nn the hypothesis hh^{\prime} satisfies for each random example (𝐳^i,y^i)𝒮^I(\hat{\mathbf{z}}_{i},\hat{y}_{i})\in\hat{{\cal S}}_{I} the following:

Pr(𝐳^i,y^i)\displaystyle\Pr_{(\hat{\mathbf{z}}_{i},\hat{y}_{i})} [(h(𝐳^i)y^i)215]\displaystyle\left[(h^{\prime}(\hat{\mathbf{z}}_{i})-\hat{y}_{i})^{2}\geq\frac{1}{5}\right]
Pr(𝐳^i,y^i)[(h(𝐳^i)y^i)215|𝐳i𝒵]Pr[𝐳i𝒵]\displaystyle\geq\Pr_{(\hat{\mathbf{z}}_{i},\hat{y}_{i})}\left[\left.(h^{\prime}(\hat{\mathbf{z}}_{i})-\hat{y}_{i})^{2}\geq\frac{1}{5}\;\right|\;\mathbf{z}^{\prime}_{i}\in{\cal Z}^{\prime}\right]\cdot\Pr\left[\mathbf{z}^{\prime}_{i}\in{\cal Z}^{\prime}\right]
Pr(𝐳^i,y^i)[(h(𝐳^i)y^i)2(b^2)2|𝐳i𝒵]Pr[𝐳i𝒵]\displaystyle\geq\Pr_{(\hat{\mathbf{z}}_{i},\hat{y}_{i})}\left[\left.(h^{\prime}(\hat{\mathbf{z}}_{i})-\hat{y}_{i})^{2}\geq\left(\frac{\hat{b}}{2}\right)^{2}\;\right|\;\mathbf{z}^{\prime}_{i}\in{\cal Z}^{\prime}\right]\cdot\Pr\left[\mathbf{z}^{\prime}_{i}\in{\cal Z}^{\prime}\right]
12Pr[𝐳i𝒵].\displaystyle\geq\frac{1}{2}\cdot\Pr\left[\mathbf{z}^{\prime}_{i}\in{\cal Z}^{\prime}\right]~{}.

In Lemma A.10, we show that for a sufficiently large nn we have Pr[𝐳i𝒵]1log(n)\Pr\left[\mathbf{z}^{\prime}_{i}\in{\cal Z}^{\prime}\right]\geq\frac{1}{\log(n)}. Hence,

Pr(𝐳^i,y^i)[(h(𝐳^i)y^i)215]121log(n)12log(n).\displaystyle\Pr_{(\hat{\mathbf{z}}_{i},\hat{y}_{i})}\left[(h^{\prime}(\hat{\mathbf{z}}_{i})-\hat{y}_{i})^{2}\geq\frac{1}{5}\right]\geq\frac{1}{2}\cdot\frac{1}{\log(n)}\geq\frac{1}{2\log(n)}~{}.

Thus, if b^910\hat{b}\geq\frac{9}{10} then we have

𝔼𝒮^I[I(h)]1512log(n)=110log(n).\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\left[\ell_{I}(h^{\prime})\right]\geq\frac{1}{5}\cdot\frac{1}{2\log(n)}=\frac{1}{10\log(n)}~{}.

Therefore, for large nn we have

Pr[𝔼𝒮^I[I(h)]110log(n)]11n78.\Pr\left[\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\left[\ell_{I}(h^{\prime})\right]\geq\frac{1}{10\log(n)}\right]\geq 1-\frac{1}{n}\geq\frac{7}{8}~{}.

Since, (h(𝐳^)y^)2[0,b^2](h^{\prime}(\hat{\mathbf{z}})-\hat{y})^{2}\in[0,\hat{b}^{2}] for all 𝐳^,y^\hat{\mathbf{z}},\hat{y} returned by the examples oracle, and the examples 𝐳^i\hat{\mathbf{z}}_{i} for iIi\in I are i.i.d., then by Hoeffding’s inequality, we have for a sufficiently large nn that

Pr[|I(h)𝔼𝒮^II(h)|1n]\displaystyle\Pr\left[\left|\ell_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\ell_{I}(h^{\prime})\right|\geq\frac{1}{n}\right] =Pr[|I(h)𝔼𝒮^II(h)|1n|b^1110]Pr[b^1110]\displaystyle=\Pr\left[\left|\ell_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\ell_{I}(h^{\prime})\middle|\geq\frac{1}{n}\right|\hat{b}\leq\frac{11}{10}\right]\cdot\Pr\left[\hat{b}\leq\frac{11}{10}\right]
+Pr[|I(h)𝔼𝒮^II(h)|1n|b^>1110]Pr[b^>1110]\displaystyle\;\;\;\;+\Pr\left[\left|\ell_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\ell_{I}(h^{\prime})\middle|\geq\frac{1}{n}\right|\hat{b}>\frac{11}{10}\right]\cdot\Pr\left[\hat{b}>\frac{11}{10}\right]
2exp(2n3n2(11/10)4)1+11n\displaystyle\leq 2\exp\left(-\frac{2n^{3}}{n^{2}(11/10)^{4}}\right)\cdot 1+1\cdot\frac{1}{n}
18.\displaystyle\leq\frac{1}{8}~{}.

Hence, for large enough nn, with probability at least 11818=34>231-\frac{1}{8}-\frac{1}{8}=\frac{3}{4}>\frac{2}{3} we have both 𝔼𝒮^I[I(h)]110log(n)\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\left[\ell_{I}(h^{\prime})\right]\geq\frac{1}{10\log(n)} and |I(h)𝔼𝒮^II(h)|1n\left|\ell_{I}(h^{\prime})-\operatorname*{\mathbb{E}}_{\hat{{\cal S}}_{I}}\ell_{I}(h^{\prime})\right|\leq\frac{1}{n}, and thus

I(h)110log(n)1n>2n.\ell_{I}(h^{\prime})\geq\frac{1}{10\log(n)}-\frac{1}{n}>\frac{2}{n}~{}.

Overall, if 𝒮{\cal S} is pseudorandom then with probability greater than 23\frac{2}{3} the algorithm 𝒜{\cal A} returns 11, and if 𝒮{\cal S} is random then with probability greater than 23\frac{2}{3} the algorithm 𝒜{\cal A} returns 0. Thus, the distinguishing advantage is greater than 13\frac{1}{3}. This concludes the proof of the theorem. It remains to prove the deffered lemma on the realizability of the examples returned by the examples oracle:

Lemma C.1.

If 𝒮{\cal S} is pseudorandom then with probability at least 3940\frac{39}{40} over 𝛏𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}) and 𝛇i𝒩(𝟎,ω2In2)\boldsymbol{\zeta}_{i}\sim{\cal N}({\mathbf{0}},\omega^{2}I_{n^{2}}) for i[m(n)+n3]i\in[m(n)+n^{3}], the examples (𝐳^1,y^1),,(𝐳^m(n)+n3,y^m(n)+n3)(\hat{\mathbf{z}}_{1},\hat{y}_{1}),\ldots,(\hat{\mathbf{z}}_{m(n)+n^{3}},\hat{y}_{m(n)+n^{3}}) returned by the oracle are realized by N^\hat{N}.

Proof.

By our choice of τ\tau and ω\omega and the construction of N1,N2N_{1},N_{2}, with probability at least 11n1-\frac{1}{n} over 𝝃𝒩(𝟎,τ2Ip)\boldsymbol{\xi}\sim{\cal N}({\mathbf{0}},\tau^{2}I_{p}), we have |ξj|110|\xi_{j}|\leq\frac{1}{10} for all j[p]j\in[p], and for every 𝐳{0,1}n2\mathbf{z}\in\{0,1\}^{n^{2}} the following holds: Let 𝜻𝒩(𝟎,ω2In2)\boldsymbol{\zeta}\sim{\cal N}({\mathbf{0}},\omega^{2}I_{n^{2}}) and let 𝐳^=𝐳+𝜻\hat{\mathbf{z}}=\mathbf{z}+\boldsymbol{\zeta}. Then with probability at least 1exp(n/2)1-\exp(-n/2) over 𝜻\boldsymbol{\zeta} the inputs to the neurons 1,2{\cal E}_{1},{\cal E}_{2} in the computation N^(𝐳^)\hat{N}(\hat{\mathbf{z}}) satisfy Properties (Q1) and (Q2). Hence, with probability at least 11n(m(n)+n3)exp(n/2)12n1-\frac{1}{n}-(m(n)+n^{3})\exp(-n/2)\geq 1-\frac{2}{n} (for a sufficiently large nn), |ξj|110|\xi_{j}|\leq\frac{1}{10} for all j[p]j\in[p], and Properties (Q1) and (Q2) hold for the computations N^(𝐳^i)\hat{N}(\hat{\mathbf{z}}_{i}) for all i[m(n)+n3]i\in[m(n)+n^{3}]. It remains to show that if |ξj|110|\xi_{j}|\leq\frac{1}{10} for all j[p]j\in[p] and Properties (Q1) and (Q2) hold, then the examples (𝐳^1,y^1),,(𝐳^m(n)+n3,y^m(n)+n3)(\hat{\mathbf{z}}_{1},\hat{y}_{1}),\ldots,(\hat{\mathbf{z}}_{m(n)+n^{3}},\hat{y}_{m(n)+n^{3}}) are realized by N^\hat{N}.

Let i[m(n)+n3]i\in[m(n)+n^{3}]. We denote 𝐳^i=𝐳i+𝜻i\hat{\mathbf{z}}_{i}=\mathbf{z}_{i}+\boldsymbol{\zeta}_{i}, namely, the ii-th example returned by the oracle was obtained by adding noise 𝜻i\boldsymbol{\zeta}_{i} to 𝐳i{0,1}n2\mathbf{z}_{i}\in\{0,1\}^{n^{2}}. We also denote 𝐳i=(𝐳i)[kn]{0,1}kn\mathbf{z}^{\prime}_{i}=(\mathbf{z}_{i})_{[kn]}\in\{0,1\}^{kn}. Since |ξj|110|\xi_{j}|\leq\frac{1}{10} for all j[p]j\in[p], and all incoming weights to the output neuron in N~\tilde{N} are 1-1, then in N^\hat{N} all incoming weights to the output neuron are in [1110,910]\left[-\frac{11}{10},-\frac{9}{10}\right], and the bias term in the output neuron, denoted by b^\hat{b}, is in [910,1110]\left[\frac{9}{10},\frac{11}{10}\right]. Consider the following cases:

  • If 𝐳i\mathbf{z}^{\prime}_{i} is not an encoding of a hyperedge then y^i=0\hat{y}_{i}=0. Moreover, in the computation N^(𝐳^i)\hat{N}(\hat{\mathbf{z}}_{i}), there exists a neuron in 2{\cal E}_{2} with output at least 32\frac{3}{2} (by Property (Q2)) . Since all incoming weights to the output neuron in N^\hat{N} are in [1110,910]\left[-\frac{11}{10},-\frac{9}{10}\right], and b^[910,1110]\hat{b}\in\left[\frac{9}{10},\frac{11}{10}\right], then the input to the output neuron (including the bias term) is at most 111032910<0\frac{11}{10}-\frac{3}{2}\cdot\frac{9}{10}<0, and thus its output is 0.

  • If 𝐳\mathbf{z}^{\prime} is an encoding of a hyperedge SS, then by the definition of the examples oracle we have S=SiS=S_{i}. Hence:

    • If yi=0y_{i}=0 then the oracle sets y^i=b^\hat{y}_{i}=\hat{b}. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=0P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=0. Hence, in the computation N^(𝐳^i)\hat{N}(\hat{\mathbf{z}}_{i}) the inputs to all neurons in 1,2{\cal E}_{1},{\cal E}_{2} are at most 12-\frac{1}{2} (by Properties (Q1) and (Q2)), and thus their outputs are 0. Therefore, N^(𝐳^i)=b^\hat{N}(\hat{\mathbf{z}}_{i})=\hat{b}.

    • If yi=1y_{i}=1 then the oracle sets y^i=0\hat{y}_{i}=0. Since 𝒮{\cal S} is pseudorandom, we have P𝐱(𝐳S)=P𝐱(𝐳Si)=yi=1P_{\mathbf{x}}(\mathbf{z}^{S})=P_{\mathbf{x}}(\mathbf{z}^{S_{i}})=y_{i}=1. Hence, in the computation N^(𝐳^i)\hat{N}(\hat{\mathbf{z}}_{i}) there exists a neuron in 1{\cal E}_{1} with output at least 32\frac{3}{2} (by Property (Q1)). Since all incoming weights to the output neuron in N^\hat{N} are in [1110,910]\left[-\frac{11}{10},-\frac{9}{10}\right], and b^[910,1110]\hat{b}\in\left[\frac{9}{10},\frac{11}{10}\right], then the input to output neuron (including the bias term) is at most 111032910<0\frac{11}{10}-\frac{3}{2}\cdot\frac{9}{10}<0, and thus its output is 0.