This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hardness of Learning Neural Networks with
Natural Weights

Amit Daniely School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel and Google Research Tel-Aviv, [email protected]    Gal Vardi Weizmann Institute of Science, Israel, [email protected]
Abstract

Neural networks are nowadays highly successful despite strong hardness results. The existing hardness results focus on the network architecture, and assume that the network’s weights are arbitrary. A natural approach to settle the discrepancy is to assume that the network’s weights are “well-behaved” and posses some generic properties that may allow efficient learning. This approach is supported by the intuition that the weights in real-world networks are not arbitrary, but exhibit some ”random-like” properties with respect to some ”natural” distributions.We prove negative results in this regard, and show that for depth-22 networks, and many “natural” weights distributions such as the normal and the uniform distribution, most networks are hard to learn. Namely, there is no efficient learning algorithm that is provably successful for most weights, and every input distribution. It implies that there is no generic property that holds with high probability in such random networks and allows efficient learning.

1 Introduction

Neural networks have revolutionized performance in multiple domains, such as computer vision and natural language processing, and have proven to be a highly effective tool for solving many challenging problems. This impressive practical success of neural networks is not well understood from the theoretical point of view. In particular, despite extensive research in recent years, it is not clear which models are learnable by neural networks algorithms.

Historically, there were many negative results for learning neural networks, and it is now known that under certain complexity assumptions, it is computationally hard to learn the class of functions computed by a neural network, even if the architecture is very simple. Indeed, it has been shown that learning neural networks is hard already for networks of depth 22 (Klivans and Sherstov, 2006; Daniely and Shalev-Shwartz, 2016). These results hold already for improper learning, namely where the learning algorithm is allowed to return a hypothesis that does not belong to the considered hypothesis class.

In recent years, researchers have considered several ways to circumvent the discrepancy between those hardness results and the empirical success of neural networks. Namely, to understand which models are still learnable by neural networks algorithms. This effort includes proving learnability of linear models, including polynomials and kernel spaces (Andoni et al., 2014; Xie et al., 2016; Daniely et al., 2016; Daniely, 2017; Brutzkus et al., 2017; Jacot et al., 2018; Du et al., 2018; Oymak and Soltanolkotabi, 2018; Allen-Zhu et al., 2018a, b; Cao and Gu, 2019; Zou and Gu, 2019; Song and Yang, 2019; Ge et al., 2019; Oymak and Soltanolkotabi, 2019; Arora et al., 2019; Cao and Gu, 2019; Ji and Telgarsky, 2019; Ma et al., 2019; Lee et al., 2019; Daniely, 2019), making assumptions on the input distribution (Li and Yuan, 2017; Brutzkus and Globerson, 2017; Du et al., 2017a, b; Du and Goel, 2018; Goel et al., 2018; Shamir, 2018), the network’s weights (Arora et al., 2014; Shamir, 2018; Das et al., 2019; Agarwal et al., 2020; Goel and Klivans, 2017), or both (Janzamin et al., 2015; Tian, 2017).

In that respect, one fantastic result that can be potentially proven, is that neural networks are efficiently learnable if we assume that the network’s weights are “well-behaved”. Namely, that there are some generic properties of the network’s weights that allow efficient learning. This approach is supported by the intuition that the weights in real-world networks are not arbitrary, but exhibit some ”random-like” properties with respect to some ”natural weights distributions” (e.g., where the weights are drawn from a normal distribution). We say that a property of the network’s weights is a natural property with respect to such a natural weights distribution, if it holds with high probability. Existing hardness results focus on the network architecture, and assume that the weights are arbitrary. Thus, it is unclear whether there exists a natural property that allows efficient learning.

In this work, we investigate networks with random weights, and networks whose weights posses natural properties. We show that under various natural weights distributions most networks are hard to learn. Namely, there is no efficient learning algorithm that is provably successful for most weights, and every input distribution. We show that it implies that learning neural networks is hard already if their weights posses some natural property. Our hardness results are under the common assumption that refuting a random KK-SAT formula is hard (the RSAT assumption). We emphasize that our results are valid for any learning algorithm, and not just common neural networks algorithms.

We consider networks of depth 22 with a single output neuron, where the weights in the first layer are drawn from some natural distribution, and the weights in the second layer are all 11. We consider multiple natural weights distributions, for example, where the weights vector of each hidden neuron is distributed by a multivariate normal distribution, distributed uniformly on the sphere, or that each of its components is drawn i.i.d. from a normal, uniform or Bernoulli distribution. For each weights distribution, we show that learning such networks with high probability over the choice of the weights is hard. Thus, for such weights distributions, most networks are hard. It implies that there is no generic property that holds with high probability (e.g., with probability 0.90.9) in such random networks and allows efficient learning. Hence, if generic properties that allow efficient learning exist, then they are not natural, namely, they are rare with respect to all the natural weights distributions that we consider.

We also consider random neural networks of depth 22, where the first layer is a convolutional layer with non-overlapping patches such that its filter is drawn from some natural distribution, and the weights of the second layer are all 11. We show that learning is hard also for such networks. It implies that there is no generic property that holds with high probability in such random convolutional networks and allows efficient learning.

Related work

Hardness of learning neural networks. Hardness of learning neural networks in the standard (improper and distribution free) PAC model, follows from hardness of learning intersection of halfspaces. Klivans and Sherstov (2006) showed that, assuming the hardness of the shortest vector problem, learning intersection of nϵn^{\epsilon} halfspaces for a constant ϵ>0\epsilon>0 is hard. Daniely and Shalev-Shwartz (2016) showed that, under the RSAT assumption, learning intersection of ω(log(n))\omega(\log(n)) halfspaces is hard. These results imply hardness of learning depth-22 networks with nϵn^{\epsilon} and ω(log(n))\omega(\log(n)) hidden neurons (respectively). In the agnostic model, learning halfspaces is already hard (Feldman et al., 2006; Daniely, 2016).

Learning random neural networks. Shamir (2018) considers the problem of learning neural networks, where the weights are not arbitrary, but exhibit “nice” features such as non-degeneracy or some “random-like” appearance. The architecture of the networks that he considers is similar to ours. He shows that (under the RSAT assumption) no algorithm invariant to linear transformations can efficiently learn such networks if the columns of the weights matrix of the first layer are linearly independent. It implies that linear-invariant algorithms cannot learn such networks when the weights are chosen randomly. We note that this result holds only for linearly-invariant algorithms, which is a strong restriction. Standard gradient decent methods, for example, are not linearly invariant111Shamir (2018) shows that gradient decent can become linearly invariant if it is preceded by a certain preconditioning step.. Our results hold for all algorithms.

In Das et al. (2019), it is shown that deep random neural networks (of depth ω(log(n))\omega(\log(n))) with the sign activation function, are hard to learn in the statistical query (SQ) model. This result was recently extended by Agarwal et al. (2020) to other activation functions, including the ReLU function. While their results hold for networks of depth ω(log(n))\omega(\log(n)) and for SQ algorithms, our results hold for depth-22 networks and for all algorithms.

Our paper is structured as follows: In Section 2 we provide notations and definitions, followed by our results in Section 3. We sketch our proof ideas in Section 4, with all proofs deferred to Section 5.

2 Preliminaries

2.1 Random Constraints Satisfaction Problems

Let 𝒳n,K{\cal X}_{n,K} be the collection of (signed) KK-tuples, that is, sequences x=[(α1,i1),,(αK,iK)]x=[(\alpha_{1},i_{1}),\ldots,(\alpha_{K},i_{K})] for α1,,αK{±1}\alpha_{1},\ldots,\alpha_{K}\in\{\pm 1\} and distinct i1,,iK[n]i_{1},\ldots,i_{K}\in[n]. Each x𝒳n,Kx\in{\cal X}_{n,K} defines a function Ux:{±1}n{±1}KU_{x}:\{\pm 1\}^{n}\to\{\pm 1\}^{K} by Ux(ψ)=(α1ψi1,,αKψiK)U_{x}(\psi)=(\alpha_{1}\psi_{i_{1}},\ldots,\alpha_{K}\psi_{i_{K}}).

Let P:{±1}K{0,1}P:\{\pm 1\}^{K}\to\{0,1\} be some predicate. A PP-constraint with nn variables is a function C:{±1}n{0,1}C:\{\pm 1\}^{n}\to\{0,1\} of the form C(x)=PUxC(x)=P\circ U_{x} for some x𝒳n,Kx\in{\cal X}_{n,K}. An instance to the CSP problem CSP(P)\mathrm{CSP}(P) is a PP-formula, i.e., a collection J={C1,,Cm}J=\{C_{1},\ldots,C_{m}\} of PP-constraints (each is specified by a KK-tuple). The goal is to find an assignment ψ{±1}n\psi\in\{\pm 1\}^{n} that maximizes the fraction of satisfied constraints (i.e., constraints with Ci(ψ)=1C_{i}(\psi)=1). We will allow CSP problems where PP varies with nn (but is still fixed for every nn). For example, we can look of the log(n)\lceil\log(n)\rceil-SAT problem.

We will consider the problem of distinguishing satisfiable from random PP formulas (a.k.a. the problem of refuting random PP formulas). For m:m:\mathbb{N}\to\mathbb{N}, we say that a randomized algorithm 𝒜{\cal A} efficiently solves the problem CSPm(n)rand(P)\mathrm{CSP}^{\mathrm{rand}}_{m(n)}(P), if 𝒜{\cal A} is a polynomial-time algorithm such that:

  • If JJ is a satisfiable instance to CSP(P)\mathrm{CSP}(P) with nn variables and m(n)m(n) constraints, then

    Pr(𝒜(J)=“satisfiable”)34on(1),\Pr\left({\cal A}(J)=\text{``satisfiable"}\right)\geq\frac{3}{4}-o_{n}(1)~{},

    where the probability is over the randomness of 𝒜{\cal A}.

  • If JJ is a random222To be precise, in a random formula with nn variable and mm constraints, the KK-tuple defining each constraint is chosen uniformly, and independently from the other constraints. instance to CSP(P)\mathrm{CSP}(P) with nn variables and m(n)m(n) constraints then

    Pr(𝒜(J)=“random”)34on(1),\Pr\left({\cal A}(J)=\text{``random"}\right)\geq\frac{3}{4}-o_{n}(1)~{},

    where the probability is over the choice of JJ and the randomness of 𝒜{\cal A}.

2.2 The random KK-SAT assumption

Unless we face a dramatic breakthrough in complexity theory, it seems unlikely that hardness of learning can be established on standard complexity assumptions such as 𝐏𝐍𝐏\mathbf{P}\neq\mathbf{NP} (see Applebaum et al. (2008); Daniely et al. (2014)). Indeed, all currently known lower bounds are based on assumptions from cryptography or average case hardness. Following Daniely and Shalev-Shwartz (2016) we will rely on an assumption about random KK-SAT problems which we outline below.

Let J={C1,,Cm}J=\{C_{1},\ldots,C_{m}\} be a random KK-SAT formula on nn variables. Precisely, each KK-SAT constraint CiC_{i} is chosen independently and uniformly from the collection of nn-variate KK-SAT constraints. A simple probabilistic argument shows that for some constant CC (depending only on KK), if mCnm\geq Cn, then JJ is not satisfiable w.h.p. The problem of refuting random KK-SAT formulas (a.k.a. the problem of distinguishing satisfiable from random KK-SAT formulas) is the problem CSPm(n)rand(SATK)\mathrm{CSP}^{\mathrm{rand}}_{m(n)}(\mathrm{SAT}_{K}), where SATK\mathrm{SAT}_{K} is the predicate z1zKz_{1}\vee\ldots\vee z_{K}.

The problem of refuting random KK-SAT formulas has been extensively studied during the last 50 years. It is not hard to see that the problem gets easier as mm gets larger. The currently best known algorithms Feige and Ofek (2004); Coja-Oghlan et al. (2004, 2010) can only refute random instances with Ω(nK2)\Omega\left(n^{\lceil\frac{K}{2}\rceil}\right) constraints for K4K\geq 4 and Ω(n1.5)\Omega\left(n^{1.5}\right) constraints for K=3K=3. In light of that, Feige (2002) made the assumption that for K=3K=3, refuting random instances with CnC\cdot n constraints, for every constant CC, is hard (and used that to prove hardness of approximation results). Here, we put forward the following assumption.

Assumption 2.1.

Refuting random KK-SAT\mathrm{SAT} formulas with nf(K)n^{f(K)} constraints is hard for some f(K)=ω(1)f(K)=\omega(1). Namely, for every d>0d>0 there is KK such that the problem CSPndrand(SATK)\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K}) is hard.

Terminology 2.2.

A computational problem is RSAT-hard if its tractability refutes assumption 2.1.

In addition to the performance of best known algorithms, there is plenty of evidence to the above assumption, in the form of hardness of approximation results, and lower bounds on various algorithms, including resolution, convex hierarchies, sum-of-squares, statistical algorithms, and more. We refer the reader to Daniely and Shalev-Shwartz (2016) for a more complete discussion.

2.3 Learning hypothesis classes and random neural networks

Let (n){\cal H}\subseteq{\mathbb{R}}^{({\mathbb{R}}^{n})} be an hypothesis class. We say that a learning algorithm {\cal L} efficiently (PAC) learns {\cal H} if for every target function ff\in{\cal H} and every distribution 𝒟{\cal D} over n{\mathbb{R}}^{n}, given only access to examples (𝐱,f(𝐱))(\mathbf{x},f(\mathbf{x})) where 𝐱𝒟\mathbf{x}\sim{\cal D}, the algorithm {\cal L} runs in time polynomial in nn and returns with probability at least 910\frac{9}{10} (over the internal randomness of {\cal L}), a predictor hh such that

𝔼𝐱𝒟[(h(𝐱)f(𝐱))2]110.\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}}\left[\left(h(\mathbf{x})-f(\mathbf{x})\right)^{2}\right]\leq\frac{1}{10}~{}.

For a real matrix W=(𝐰1,,𝐰m)W=(\mathbf{w}_{1},\ldots,\mathbf{w}_{m}) of size n×mn\times m, let hW:n[0,1]h_{W}:{\mathbb{R}}^{n}\rightarrow[0,1] be the function hW(𝐱)=[i=1m[𝐰i,𝐱]+][0,1]h_{W}(\mathbf{x})=\left[\sum_{i=1}^{m}[\langle\mathbf{w}_{i},\mathbf{x}\rangle]_{+}\right]_{[0,1]}, where [z]+=max{0,z}[z]_{+}=\max\{0,z\} is the ReLU function, and [z][0,1]=min{1,max{0,z}}[z]_{[0,1]}=\min\{1,\max\{0,z\}\} is the clipping operation on the interval [0,1][0,1]. This corresponds to depth-22 networks with mm hidden neurons, with no bias in the first layer, and where the outputs of the first layer are simply summed and moved through a clipping non-linearity (this operation can also be easily implemented using a second layer composed of two ReLU neurons).

Let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over real matrices of size n×mn\times m. We assume that mnm\leq n. We say that a learning algorithm {\cal L} efficiently learns a random neural network with respect to 𝒟mat{\cal D}_{\mathrm{mat}} (𝒟mat{\cal D}_{\mathrm{mat}}-random network, for short), if it satisfies the following property. For a random matrix WW drawn according to 𝒟mat{\cal D}_{\mathrm{mat}}, and every distribution 𝒟{\cal D} (that may depend on WW) over n{\mathbb{R}}^{n}, given only access to examples (𝐱,hW(𝐱))(\mathbf{x},h_{W}(\mathbf{x})) where 𝐱𝒟\mathbf{x}\sim{\cal D}, the algorithm {\cal L} runs in time polynomial in nn and returns with probability at least 34\frac{3}{4} over the choice of WW and the internal randomness of {\cal L}, a predictor hh such that

𝔼𝐱𝒟[(h(𝐱)hW(𝐱))2]110.\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}}\left[\left(h(\mathbf{x})-h_{W}(\mathbf{x})\right)^{2}\right]\leq\frac{1}{10}~{}.
Remark 2.1.

Learning an hypothesis class requires that for every target function in the class and every input distribution the learning algorithm succeeds w.h.p., and learning a random neural network requires that for a random target function and every input distribution the learning algorithm succeeds w.h.p. Thus, in the former case an adversary chooses both the target function and the input distribution, and in the later the target function is chosen randomly and then the adversary chooses the input distribution. Therefore, the requirement from the algorithm for learning random neural networks is weaker then the requirement for learning neural networks in the standard PAC-learning model. We show hardness of learning already under this weaker requirement. Then, we show that hardness of learning random neural networks, implies hardness of learning neural networks with “natural” weights under the standard PAC-learning model.

2.4 Notations and terminology

We denote by 𝒰([r,r]){\cal U}([-r,r]) the uniform distribution over the interval [r,r][-r,r] in {\mathbb{R}}; by 𝒰({±r}){\cal U}(\{\pm r\}) the symmetric Bernoulli distribution, namely, Pr(r)=Pr(r)=12Pr(r)=Pr(-r)=\frac{1}{2}; by 𝒩(0,σ2){\cal N}(0,\sigma^{2}) the normal distribution with mean 0 and variance σ2\sigma^{2}, and by 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma) the multivariate normal distribution with mean 𝟎{\mathbf{0}} and covariance matrix Σ\Sigma. We say that a distribution over {\mathbb{R}} is symmetric if it is continuous and its density satisfies f(x)=f(x)f(x)=f(-x) for every xx\in{\mathbb{R}}, or that it is discrete and Pr(x)=Pr(x)Pr(x)=Pr(-x) for every xx\in{\mathbb{R}}. For a matrix MM we denote by smin(M)s_{\min}(M) and smax(M)s_{\max}(M) the minimal and maximal singular values of MM. For 𝐱n\mathbf{x}\in{\mathbb{R}}^{n} we denote by 𝐱\left\|\mathbf{x}\right\| its L2L_{2} norm. We denote the n1n-1 dimensional unit sphere by 𝕊n1={𝐱n:𝐱=1}{\mathbb{S}}^{n-1}=\{\mathbf{x}\in{\mathbb{R}}^{n}:\left\|\mathbf{x}\right\|=1\}. For tt\in{\mathbb{N}} let [t]={1,,t}[t]=\{1,\ldots,t\}. We say that an algorithm is efficient if it runs in polynomial time.

3 Results

We show RSAT-hardness for learning 𝒟mat{\cal D}_{\mathrm{mat}}-random networks, where 𝒟mat{\cal D}_{\mathrm{mat}} corresponds either to a fully-connected layer, or to a convolutional layer. It implies hardness of learning depth-22 neural networks whose weights satisfy some natural property. We focus on the case of networks with 𝒪(log2(n)){\cal O}(\log^{2}(n)) hidden neurons. We note, however, that our results can be extended to networks with q(n)q(n) hidden neurons, for any q(n)=ω(log(n))q(n)=\omega(\log(n)). Moreover, while we consider networks whose weights in the second layer are all 11, our results can be easily extended to networks where the weights in the second layer are arbitrary or random positive numbers.

3.1 Fully-connected neural networks

We start with random fully-connected neural networks. First, we consider a distribution 𝒟mat{\cal D}_{\mathrm{mat}} over real matrices, such that the entries are drawn i.i.d. from a symmetric distribution.

We say that a random variable zz is bb-subgaussian for some b>0b>0 if for all t>0t>0 we have

Pr(|z|>t)2exp(t2b2).Pr\left(|z|>t\right)\leq 2\exp\left(-\frac{t^{2}}{b^{2}}\right)~{}.
Theorem 3.1.

Let zz be a symmetric random variable with variance σ2\sigma^{2}. Assume that the random variable z=zσz^{\prime}=\frac{z}{\sigma} is bb-subgaussian for some fixed bb. Let ϵ>0\epsilon>0 be a small constant, let m=𝒪(log2(n))m={\cal O}(\log^{2}(n)), and let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over n×m{\mathbb{R}}^{n\times m}, such that the entries are i.i.d. copies of zz. Then, learning a 𝒟mat{\cal D}_{\mathrm{mat}}-random neural network is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm at most nϵσ\frac{n^{\epsilon}}{\sigma} in n{\mathbb{R}}^{n}.

Since the normal distribution, the uniform distribution over an interval, and the symmetric Bernoulli distribution are subgaussian (Rivasplata (2012)), we have the following corollary.

Corollary 3.1.

Let ϵ>0\epsilon>0 be a small constant, let m=𝒪(log2(n))m={\cal O}(\log^{2}(n)), and let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over n×m{\mathbb{R}}^{n\times m}, such that the entries are drawn i.i.d. from a distribution 𝒟z{\cal D}_{z}.

  1. 1.

    If 𝒟z=𝒩(0,σ2){\cal D}_{z}={\cal N}(0,\sigma^{2}), then learning a 𝒟mat{\cal D}_{\mathrm{mat}}-random neural network is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm at most nϵσ\frac{n^{\epsilon}}{\sigma} in n{\mathbb{R}}^{n}.

  2. 2.

    If 𝒟z=𝒰([r,r]){\cal D}_{z}={\cal U}([-r,r]) or 𝒟z=𝒰({±r}){\cal D}_{z}={\cal U}(\{\pm r\}), then learning a 𝒟mat{\cal D}_{\mathrm{mat}}-random neural network is RSAT-hard, even if the distribution 𝒟{\cal D} is over vectors of norm at most nϵr\frac{n^{\epsilon}}{r} in n{\mathbb{R}}^{n}.

In the following theorem, we consider the case where 𝒟mat{\cal D}_{\mathrm{mat}} is such that each column is drawn i.i.d. from a multivariate normal distribution.

Theorem 3.2.

Let Σ\Sigma be a positive definite matrix of size n×nn\times n, and let λmin\lambda_{\min} be its minimal eigenvalue. Let ϵ>0\epsilon>0 be a small constant, let m=𝒪(log2(n))m={\cal O}(\log^{2}(n)), and let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over n×m{\mathbb{R}}^{n\times m}, such that each column is drawn i.i.d. from 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma). Then, learning a 𝒟mat{\cal D}_{\mathrm{mat}}-random neural network is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm at most nϵλmin\frac{n^{\epsilon}}{\sqrt{\lambda_{\min}}} in n{\mathbb{R}}^{n}.

We also study the case where the distribution 𝒟mat{\cal D}_{\mathrm{mat}} is such that each column is drawn i.i.d. from the uniform distribution on the sphere of radius rr in n{\mathbb{R}}^{n}.

Theorem 3.3.

Let m=𝒪(log2(n))m={\cal O}(\log^{2}(n)) and let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over n×m{\mathbb{R}}^{n\times m}, such that each column is drawn i.i.d. from the uniform distribution over r𝕊n1r\cdot{\mathbb{S}}^{n-1}. Then, learning a 𝒟mat{\cal D}_{\mathrm{mat}}-random neural network is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm at most 𝒪(nnlog4(n)r){\cal O}\left(\frac{n\sqrt{n}\log^{4}(n)}{r}\right) in n{\mathbb{R}}^{n}.

From the above theorems we have the following corollary, which shows that learning neural networks (in the standard PAC-learning model) is hard already if the weights satisfy some natural property.

Corollary 3.2.

Let m=𝒪(log2(n))m={\cal O}(\log^{2}(n)), and let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over n×m{\mathbb{R}}^{n\times m} from Theorems 3.2, 3.3, or from Corollary 3.1. Let PP be a property that holds with probability at least 910\frac{9}{10} for a matrix WW drawn from 𝒟mat{\cal D}_{\mathrm{mat}}. Let ={hW:Wn×m,W satisfies P}{\cal H}=\{h_{W}:W\in{\mathbb{R}}^{n\times m},\;W\text{ satisfies }P\} be an hypothesis class. Then, learning {\cal H} is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm bounded by the appropriate expression from Theorems 3.2, 3.3, or Corollary 3.1.

The corollary follows easily from the following argument: Assume that {\cal L} learns {\cal H}. If a matrix WW satisfies PP then for every distribution 𝒟{\cal D}, given access to examples (𝐱,hW(𝐱))(\mathbf{x},h_{W}(\mathbf{x})) where 𝐱𝒟\mathbf{x}\sim{\cal D}, the algorithm {\cal L} returns with probability at least 910\frac{9}{10} a predictor hh such that

𝔼𝐱𝒟[(h(𝐱)hW(𝐱))2]110.\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}}\left[\left(h(\mathbf{x})-h_{W}(\mathbf{x})\right)^{2}\right]\leq\frac{1}{10}~{}.

Now, let W𝒟matW\sim{\cal D}_{\mathrm{mat}}. Since WW satisfies PP with probability at least 910\frac{9}{10}, then for every distribution 𝒟{\cal D}, given access to examples (𝐱,hW(𝐱))(\mathbf{x},h_{W}(\mathbf{x})) where 𝐱𝒟\mathbf{x}\sim{\cal D}, the algorithm {\cal L} returns with probability at least 91091034\frac{9}{10}\cdot\frac{9}{10}\geq\frac{3}{4} a predictor that satisfies the above inequality. Hence {\cal L} learns a 𝒟mat{\cal D}_{\mathrm{mat}}-random neural network. Note that this argument holds also if 𝒟{\cal D} is over vectors of a bounded norm.

3.2 Convolutional neural networks

We now turn to random Convolutional Neural Networks (CNN). Here, the distribution 𝒟mat{\cal D}_{\mathrm{mat}} corresponds to a random convolutional layer. Our convolutional layer has a very simple structure with non-overlapping patches. Let tt be an integer that divides nn, and let 𝐰t\mathbf{w}\in{\mathbb{R}}^{t}. We denote by h𝐰n:n[0,1]h_{\mathbf{w}}^{n}:{\mathbb{R}}^{n}\rightarrow[0,1] the CNN

h𝐰n(𝐱)=[i=1nt[𝐰,(xt(i1)+1,,xti)]+][0,1].h_{\mathbf{w}}^{n}(\mathbf{x})=\left[\sum_{i=1}^{\frac{n}{t}}[\langle\mathbf{w},(x_{t(i-1)+1},\ldots,x_{t\cdot i})\rangle]_{+}\right]_{[0,1]}~{}.

Note that the h𝐰nh_{\mathbf{w}}^{n} has nt\frac{n}{t} hidden neurons. A random CNN corresponds to a random vector 𝐰t\mathbf{w}\in{\mathbb{R}}^{t}. Let 𝒟vec{\cal D}_{\mathrm{vec}} be a distribution over t{\mathbb{R}}^{t}. A 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN with mm hidden neurons is the CNN h𝐰mth_{\mathbf{w}}^{mt} where 𝐰\mathbf{w} is drawn from 𝒟vec{\cal D}_{\mathrm{vec}}. Note that every such a distribution over CNNs, can be expressed by an appropriate distribution 𝒟mat{\cal D}_{\mathrm{mat}} over matrices. Our results hold also if we replace the second layer of h𝐰nh_{\mathbf{w}}^{n} with a max-pooling layer (instead of summing and clipping).

We start with random CNNs, where each component in the weight vector is drawn i.i.d. from a symmetric distribution 𝒟z{\cal D}_{z} over {\mathbb{R}}. In the following theorem, the function f(t)f(t) bounds the concentration of 𝒟z{\cal D}_{z}, and is needed in order to bound the support of the distribution 𝒟{\cal D}.

Theorem 3.4.

Let n=(n+1)log2(n)n=(n^{\prime}+1)\log^{2}(n^{\prime}). Let 𝒟zn+1{\cal D}_{z}^{n^{\prime}+1} be a distribution over n+1{\mathbb{R}}^{n^{\prime}+1} such that each component is drawn i.i.d. from a symmetric distribution 𝒟z{\cal D}_{z} over {\mathbb{R}}. Let f(t)>0f(t)>0 be a function such that Prz𝒟z(|z|<f(t))=ot(1t)Pr_{z\sim{\cal D}_{z}}(|z|<f(t))=o_{t}(\frac{1}{t}). Then, learning a 𝒟zn+1{\cal D}_{z}^{n^{\prime}+1}-random CNN h𝐰nh_{\mathbf{w}}^{n} with log2(n)=𝒪(log2(n))\log^{2}(n^{\prime})={\cal O}(\log^{2}(n)) hidden neurons is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm at most log2(n)f(n)\frac{\log^{2}(n^{\prime})}{f(n^{\prime})} in n{\mathbb{R}}^{n}.

If 𝒟z{\cal D}_{z} is the uniform distribution 𝒰([r,r]){\cal U}([-r,r]) then we can choose f(t)=rtlog(t)f(t)=\frac{r}{t\log(t)}, if it is the normal distribution 𝒩(0,σ2){\cal N}(0,\sigma^{2}) then we can choose f(t)=σtlog(t)f(t)=\frac{\sigma}{t\log(t)}, and if it is the symmetric Bernoulli distribution 𝒰({±r}){\cal U}(\{\pm r\}) then we can choose f(t)=rf(t)=r. Thus, we obtain the following corollary.

Corollary 3.3.

Let 𝒟vec{\cal D}_{\mathrm{vec}} be a distribution such that each component is drawn i.i.d. from a distribution 𝒟z{\cal D}_{z} over {\mathbb{R}}.

  1. 1.

    If 𝒟z=𝒰([r,r]){\cal D}_{z}={\cal U}([-r,r]), then learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} with 𝒪(log2(n)){\cal O}(\log^{2}(n)) hidden neurons is RSAT-hard, already if 𝒟{\cal D} is over vectors of norm at most nlog(n)r\frac{n\log(n)}{r}.

  2. 2.

    If 𝒟z=𝒩(0,σ2){\cal D}_{z}={\cal N}(0,\sigma^{2}), then learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} with 𝒪(log2(n)){\cal O}(\log^{2}(n)) hidden neurons is RSAT-hard, already if 𝒟{\cal D} is over vectors of norm at most nlog(n)σ\frac{n\log(n)}{\sigma}.

  3. 3.

    If 𝒟z=𝒰({±r}){\cal D}_{z}={\cal U}(\{\pm r\}), then learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} with 𝒪(log2(n)){\cal O}(\log^{2}(n)) hidden neurons is RSAT-hard, already if 𝒟{\cal D} is over vectors of norm at most log2(n)r\frac{\log^{2}(n)}{r}.

We also consider the case where h𝐰nh_{\mathbf{w}}^{n} is a CNN such that 𝐰\mathbf{w} is drawn from a multivariate normal distribution 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma).

Theorem 3.5.

Let Σ\Sigma be a positive definite matrix of size t×tt\times t, and let λmin\lambda_{\min} be its minimal eigenvalue. Let n=𝒪(tlog2(t))n={\cal O}(t\log^{2}(t)). Then, learning a 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma)-random CNN h𝐰nh_{\mathbf{w}}^{n} (with 𝒪(log2(n)){\cal O}(\log^{2}(n)) hidden neurons) is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm at most nlog(n)λmin\frac{n\log(n)}{\sqrt{\lambda_{\min}}} in n{\mathbb{R}}^{n}.

Finally, we study the case where h𝐰nh_{\mathbf{w}}^{n} is a CNN such that 𝐰\mathbf{w} is drawn from the uniform distribution over the sphere.

Theorem 3.6.

Let 𝒟vec{\cal D}_{\mathrm{vec}} be the uniform distribution over the sphere of radius rr in t{\mathbb{R}}^{t}. Let n=𝒪(tlog2(t))n={\cal O}(t\log^{2}(t)). Then, learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} (with 𝒪(log2(n)){\cal O}(\log^{2}(n)) hidden neurons) is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm at most nlog(n)r\frac{\sqrt{n}\log(n)}{r} in n{\mathbb{R}}^{n}.

Now, the following corollary follows easily (from the same argument as in Corollary 3.2), and shows that learning CNNs (in the standard PAC-learning model) is hard already if the weights satisfy some natural property.

Corollary 3.4.

Let 𝒟vec{\cal D}_{\mathrm{vec}} be a distribution over t{\mathbb{R}}^{t} from Theorems 3.5, 3.6, or from Corollary 3.3. Let n=𝒪(tlog2(t))n={\cal O}(t\log^{2}(t)). Let PP be a property that holds with probability at least 910\frac{9}{10} for a vector 𝐰\mathbf{w} drawn from 𝒟vec{\cal D}_{\mathrm{vec}}. Let ={h𝐰n:𝐰t,𝐰 satisfies P}{\cal H}=\{h_{\mathbf{w}}^{n}:\mathbf{w}\in{\mathbb{R}}^{t},\;\mathbf{w}\text{ satisfies }P\} be an hypothesis class. Then, learning {\cal H} is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm bounded by the appropriate expression from Theorems 3.5, 3.6, or Corollary 3.3.

3.2.1 Improving the bounds on the support of 𝒟{\cal D}

By increasing the number of hidden neurons from 𝒪(log2(n)){\cal O}(\log^{2}(n)) to 𝒪(n){\cal O}(n) we can improve the bounds on the support of 𝒟{\cal D}. Note that our results so far on learning random CNNs, are for CNNs with input dimension n=𝒪(tlog2(t))n={\cal O}(t\log^{2}(t)) where tt is the size of the patches. We now consider CNNs with input dimension n~=tc\tilde{n}=t^{c} for some integer c>1c>1. Note that such CNNs have tc1=𝒪(n~)t^{c-1}={\cal O}(\tilde{n}) hidden neurons.

Assume that there is an efficient algorithms {\cal L}^{\prime} for learning 𝒟vec{\cal D}_{\mathrm{vec}}-random CNNs with input dimension n~=tc\tilde{n}=t^{c}, where 𝒟vec{\cal D}_{\mathrm{vec}} is a distribution over t{\mathbb{R}}^{t}. Assume that {\cal L}^{\prime} uses samples with at most n~d=tcd\tilde{n}^{d}=t^{cd} inputs. We show an algorithm {\cal L} for learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} with n=𝒪(tlog2(t))n={\cal O}(t\log^{2}(t)). Let S={(𝐱1,h𝐰n(𝐱1)),,(𝐱ncd,h𝐰n(𝐱ncd))}S=\{(\mathbf{x}_{1},h_{\mathbf{w}}^{n}(\mathbf{x}_{1})),\ldots,(\mathbf{x}_{n^{cd}},h_{\mathbf{w}}^{n}(\mathbf{x}_{n^{cd}}))\} be a sample, and let S={(𝐱1,h𝐰n(𝐱1)),,(𝐱ncd,h𝐰n(𝐱ncd))}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},h_{\mathbf{w}}^{n}(\mathbf{x}_{1})),\ldots,(\mathbf{x}^{\prime}_{n^{cd}},h_{\mathbf{w}}^{n}(\mathbf{x}_{n^{cd}}))\} where for every vector 𝐱n\mathbf{x}\in{\mathbb{R}}^{n}, the vector 𝐱n~\mathbf{x}^{\prime}\in{\mathbb{R}}^{\tilde{n}} is obtained from 𝐱\mathbf{x} by padding it with zeros. Thus, 𝐱=(𝐱,0,,0)\mathbf{x}^{\prime}=(\mathbf{x},0,\ldots,0). Note that ncd>n~dn^{cd}>\tilde{n}^{d}. Also, note that for every ii we have h𝐰n(𝐱i)=h𝐰n~(𝐱i)h_{\mathbf{w}}^{n}(\mathbf{x}_{i})=h_{\mathbf{w}}^{\tilde{n}}(\mathbf{x}^{\prime}_{i}). Hence, SS^{\prime} is realizable by the CNN h𝐰n~h_{\mathbf{w}}^{\tilde{n}}. Now, given SS, the algorithm {\cal L} runs {\cal L}^{\prime} on SS^{\prime} and returns an hypothesis h(𝐱)=(S)(𝐱)h(\mathbf{x})={\cal L}^{\prime}(S^{\prime})(\mathbf{x}^{\prime}).

Therefore, if learning 𝒟vec{\cal D}_{\mathrm{vec}}-random CNNs with input dimension n=𝒪(tlog2(t))n={\cal O}(t\log^{2}(t)) is hard already if the distribution 𝒟{\cal D} is over vectors of norm at most g(n)g(n), then learning 𝒟vec{\cal D}_{\mathrm{vec}}-random CNNs with input dimension n~=tc\tilde{n}=t^{c} is hard already if the distribution 𝒟{\cal D} is over vectors of norm at most g(n)<g(t2)=g(n~2c)g(n)<g(t^{2})=g(\tilde{n}^{\frac{2}{{c}}}). Hence we have the following corollaries.

Corollary 3.5.

Let 𝒟vec{\cal D}_{\mathrm{vec}} be a distribution over t{\mathbb{R}}^{t} such that each component is drawn i.i.d. from a distribution 𝒟z{\cal D}_{z} over {\mathbb{R}}. Let n=tcn=t^{c} for some integer c>1c>1, and let ϵ=3c\epsilon=\frac{3}{c}.

  1. 1.

    If 𝒟z=𝒰([r,r]){\cal D}_{z}={\cal U}([-r,r]), then learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} (with 𝒪(n){\cal O}(n) hidden neurons) is RSAT-hard, already if 𝒟{\cal D} is over vectors of norm at most nϵr\frac{n^{\epsilon}}{r}.

  2. 2.

    If 𝒟z=𝒩(0,σ2){\cal D}_{z}={\cal N}(0,\sigma^{2}), then learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} (with 𝒪(n){\cal O}(n) hidden neurons) is RSAT-hard, already if 𝒟{\cal D} is over vectors of norm at most nϵσ\frac{n^{\epsilon}}{\sigma}.

Corollary 3.6.

Let Σ\Sigma be a positive definite matrix of size t×tt\times t, and let λmin\lambda_{\min} be its minimal eigenvalue. Let n=tcn=t^{c} for some integer c>1c>1, and let ϵ=3c\epsilon=\frac{3}{c}. Then, learning a 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma)-random CNN h𝐰nh_{\mathbf{w}}^{n} (with 𝒪(n){\cal O}(n) hidden neurons) is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm at most nϵλmin\frac{n^{\epsilon}}{\sqrt{\lambda_{\min}}}.

Corollary 3.7.

Let 𝒟vec{\cal D}_{\mathrm{vec}} be the uniform distribution over the sphere of radius rr in t{\mathbb{R}}^{t}. Let n=tcn=t^{c} for some integer c>1c>1, and let ϵ=2c\epsilon=\frac{2}{c}. Then, learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} (with 𝒪(n){\cal O}(n) hidden neurons) is RSAT-hard, already if the distribution 𝒟{\cal D} is over vectors of norm at most nϵr\frac{n^{\epsilon}}{r}.

As an example, consider a CNN h𝐰nh_{\mathbf{w}}^{n} with n=tcn=t^{c}. Note that since the patch size is tt, then each hidden neuron has tt input neurons feeding into it. Consider a distribution 𝒟vec{\cal D}_{\mathrm{vec}} over t{\mathbb{R}}^{t} such that each component is drawn i.i.d. by a normal distribution with σ=1t\sigma=\frac{1}{\sqrt{t}}. This distribution corresponds to the standard Xavier initialization. Then, by Corollary 3.5, learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} is RSAT-hard, already if 𝒟{\cal D} is over vectors of norm at most n3ct=n3cn12cn^{\frac{3}{c}}\sqrt{t}=n^{\frac{3}{c}}\cdot n^{\frac{1}{2c}}. By choosing an appropriate cc, we have that learning a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} is RSAT-hard, already if 𝒟{\cal D} is over vectors of norm at most n\sqrt{n}.

Finally, note that Corollary 3.4 holds also for the values of nn and the bounds on the support of 𝒟{\cal D} from Corollaries 3.5, 3.6 and 3.7.

Open Questions

Obvious open questions arising from our work are to obtain sharper norm bounds on the input distribution, and to avoid the non-linearity in the second layer (the clipping operation). Another direction is to extend the results to more weights distributions and network architectures. A potential positive result, is to find some useful “unnatural” properties of the network’s weights that allow efficient learning.

4 Main proof ideas

Let ={hW:Wn×k}{\cal H}=\{h_{W}:W\in{\mathbb{R}}^{n\times k}\} where k=𝒪(log2(n))k={\cal O}(\log^{2}(n)). By Daniely and Shalev-Shwartz (2016), learning {\cal H} is hard. We want to reduce the problem of learning {\cal H} to learning a 𝒟mat{\cal D}_{\mathrm{mat}}-random neural network for some fixed 𝒟mat{\cal D}_{\mathrm{mat}}. Let Wn×kW\in{\mathbb{R}}^{n\times k} and let S={(𝐱1,hW(𝐱1)),,(𝐱m,hW(𝐱m))}S=\{(\mathbf{x}_{1},h_{W}(\mathbf{x}_{1})),\ldots,(\mathbf{x}_{m},h_{W}(\mathbf{x}_{m}))\} be a sample. Let 𝒟matM{\cal D}_{\mathrm{mat}}^{M} be a distribution over the group of n×nn\times n invertible matrices, and let M𝒟matMM\sim{\cal D}_{\mathrm{mat}}^{M}. Consider the sample S={(𝐱1,hW(𝐱1)),,(𝐱m,hW(𝐱m))}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},h_{W}(\mathbf{x}_{1})),\ldots,(\mathbf{x}^{\prime}_{m},h_{W}(\mathbf{x}_{m}))\} where for every i[m]i\in[m] we have 𝐱i=(M)1𝐱i\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}. Since W𝐱i=WM(M)1𝐱i=(MW)𝐱iW^{\top}\mathbf{x}_{i}=W^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}_{i}=(MW)^{\top}\mathbf{x}^{\prime}_{i}, we have hW(𝐱i)=hMW(𝐱i)h_{W}(\mathbf{x}_{i})=h_{MW}(\mathbf{x}^{\prime}_{i}). Thus, S={(𝐱1,hMW(𝐱1)),,(𝐱m,hMW(𝐱m))}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},h_{MW}(\mathbf{x}^{\prime}_{1})),\ldots,(\mathbf{x}^{\prime}_{m},h_{MW}(\mathbf{x}^{\prime}_{m}))\}. Note that MWMW is a random matrix. Now, assume that there is an algorithm {\cal L}^{\prime} that learns successfully from SS^{\prime}. Consider the follow algorithm {\cal L}. Given a sample SS, the algorithm {\cal L} runs {\cal L}^{\prime} on SS^{\prime}, and returns the hypothesis h(𝐱)=(S)((M)1𝐱)h(\mathbf{x})={\cal L}^{\prime}(S^{\prime})((M^{\top})^{-1}\mathbf{x}). It is not hard to show that {\cal L} learns successfully from SS. Since our goal is to reduce the problem of learning {\cal H} to learning a 𝒟mat{\cal D}_{\mathrm{mat}}-random network where 𝒟mat{\cal D}_{\mathrm{mat}} is a fixed distribution, we need MWMW to be a 𝒟mat{\cal D}_{\mathrm{mat}}-random matrix. However, the distribution of MWMW depends on both 𝒟matM{\cal D}_{\mathrm{mat}}^{M} and WW (which is an unknown matrix).

Hence, the challenge is to find a reduction that translates a sample that is realizable by hWh_{W} to a sample that is realizable by a 𝒟mat{\cal D}_{\mathrm{mat}}-random network, without knowing WW. In order to obtain such a reduction, we proceed in two steps. First, we show that learning neural networks of the form hWh_{W} where Wn×kW\in{\mathbb{R}}^{n\times k}, is hard already if we restrict WW to a set of matrices with a special structure. Then, we show a distribution 𝒟matM{\cal D}_{\mathrm{mat}}^{M} such that if M𝒟matMM\sim{\cal D}_{\mathrm{mat}}^{M} and WW has the special structure, then MW𝒟matMW\sim{\cal D}_{\mathrm{mat}}. This property, as we showed, enables us to reduce the problem of learning such special-structure networks to the problem of learning 𝒟mat{\cal D}_{\mathrm{mat}}-random networks.

In order to obtain a special structure for WW, consider the class signcnnn,k={h𝐰n:𝐰{±1}nk}{\cal H}_{\mathrm{sign-cnn}}^{n,k}=\{h_{\mathbf{w}}^{n}:\mathbf{w}\in\{\pm 1\}^{\frac{n}{k}}\}. Note that the CNNs in signcnnn,k{\cal H}_{\mathrm{sign-cnn}}^{n,k} have k=𝒪(log2(n))k={\cal O}(\log^{2}(n)) hidden neurons. The networks in signcnnn,k{\cal H}_{\mathrm{sign-cnn}}^{n,k} have three important properties: (1) They are CNNs; (2) Their patches are non-overlapping; (3) The components in the filter 𝐰\mathbf{w} are in {±1}\{\pm 1\}. Hardness of learning signcnnn,k{\cal H}_{\mathrm{sign-cnn}}^{n,k} can be shown by a reduction from the RSAT problem. We defer the details of this reduction to the next sections. Let W=(𝐰1,,𝐰k)W=(\mathbf{w}^{1},\ldots,\mathbf{w}^{k}) be the matrix of size n×kn\times k that corresponds to h𝐰nh_{\mathbf{w}}^{n}, namely hW=h𝐰nh_{W}=h_{\mathbf{w}}^{n}. Note that for every i[k]i\in[k] we have (𝐰n(i1)k+1i,,𝐰niki)=𝐰\left(\mathbf{w}^{i}_{\frac{n(i-1)}{k}+1},\ldots,\mathbf{w}^{i}_{\frac{ni}{k}}\right)=\mathbf{w}, and 𝐰ji=0\mathbf{w}^{i}_{j}=0 for every other j[n]j\in[n].

We now show a distribution 𝒟matM{\cal D}_{\mathrm{mat}}^{M} such that if M𝒟matMM\sim{\cal D}_{\mathrm{mat}}^{M} and WW has a structure that corresponds to signcnnn,k{\cal H}_{\mathrm{sign-cnn}}^{n,k}, then MW𝒟matMW\sim{\cal D}_{\mathrm{mat}}. We start with the case where 𝒟mat{\cal D}_{\mathrm{mat}} is a distribution over matrices such that each column is drawn i.i.d. from the uniform distribution on the sphere. We say that a matrix MM of size n×nn\times n is a diagonal-blocks matrix if

M=[B11B1kBk1Bkk]M=\begin{bmatrix}B^{11}&\dots&B^{1k}\\ \vdots&\ddots&\vdots\\ B^{k1}&\dots&B^{kk}\end{bmatrix}

where each block BijB^{ij} is a diagonal matrix diag(z1ij,,znkij)\mathrm{diag}(z_{1}^{ij},\ldots,z_{\frac{n}{k}}^{ij}). We denote 𝐳ij=(z1ij,,znkij)\mathbf{z}^{ij}=(z_{1}^{ij},\ldots,z_{\frac{n}{k}}^{ij}), and 𝐳j=(𝐳1j,,𝐳kj)n\mathbf{z}^{j}=(\mathbf{z}^{1j},\ldots,\mathbf{z}^{kj})\in{\mathbb{R}}^{n}. Note that for every j[k]j\in[k], the vector 𝐳j\mathbf{z}^{j} contains all the entries on the diagonals of blocks in the jj-th column of blocks in MM. Let 𝒟matM{\cal D}_{\mathrm{mat}}^{M} be a distribution over diagonal-blocks matrices, such that the vectors 𝐳j\mathbf{z}^{j} are drawn i.i.d. according to the uniform distribution on 𝕊n1{\mathbb{S}}^{n-1}. Let WW be a matrix that corresponds to h𝐰nsigncnnn,kh_{\mathbf{w}}^{n}\in{\cal H}_{\mathrm{sign-cnn}}^{n,k}. Note that the columns of W=MWW^{\prime}=MW are i.i.d. copies from the uniform distribution on 𝕊n1{\mathbb{S}}^{n-1}. Indeed, denote M=(𝐯1,,𝐯n)M^{\top}=(\mathbf{v}^{1},\ldots,\mathbf{v}^{n}). Then, for every line index i[n]i\in[n] we denote i=(b1)(nk)+ri=(b-1)\left(\frac{n}{k}\right)+r, where b,rb,r are integers and 1rnk1\leq r\leq\frac{n}{k}. Thus, bb is the line index of the block in MM that correspond to the ii-th line in MM, and rr is the line index within the block. Now, note that

Wij\displaystyle W^{\prime}_{ij} =\displaystyle= 𝐯i,𝐰j=(𝐯(j1)(nk)+1i,,𝐯j(nk)i),𝐰=(Br1bj,,Br(nk)bj),𝐰\displaystyle\langle\mathbf{v}^{i},\mathbf{w}^{j}\rangle=\langle\left(\mathbf{v}^{i}_{(j-1)\left(\frac{n}{k}\right)+1},\ldots,\mathbf{v}^{i}_{j\left(\frac{n}{k}\right)}\right),\mathbf{w}\rangle=\langle(B^{bj}_{r1},\ldots,B^{bj}_{r\left(\frac{n}{k}\right)}),\mathbf{w}\rangle
=\displaystyle= Brrbj𝐰r=zrbj𝐰r.\displaystyle B^{bj}_{rr}\cdot\mathbf{w}_{r}=z^{bj}_{r}\cdot\mathbf{w}_{r}~{}.

Since 𝐰r{±1}\mathbf{w}_{r}\in\{\pm 1\}, and since the uniform distribution on a sphere does not change by multiplying a subset of component by 1-1, then the jj-th column of WW^{\prime} has the same distribution as 𝐳j\mathbf{z}^{j}, namely, the uniform distribution over 𝕊n1{\mathbb{S}}^{n-1}. Also, the columns of WW^{\prime} are independent. Thus, W𝒟matW^{\prime}\sim{\cal D}_{\mathrm{mat}}.

The case where 𝒟mat{\cal D}_{\mathrm{mat}} is a distribution over matrices such that the entries are drawn i.i.d. from a symmetric distribution (such as 𝒰([r,r]){\cal U}([-r,r]), 𝒩(0,σ2){\cal N}(0,\sigma^{2}) or 𝒰({±r}){\cal U}(\{\pm r\})) can be shown in a similar way. The result for the case where 𝒟mat{\cal D}_{\mathrm{mat}} is such that each columns is drawn i.i.d. from a multivariate normal distribution 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma) cannot be obtained in the same way, since a 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma) might be sensitive to multiplication of its component by 1-1. However, recall that a vector in n{\mathbb{R}}^{n} whose components are i.i.d. copies from 𝒩(0,1){\cal N}(0,1) has a multivariate normal distribution 𝒩(𝟎,In){\cal N}({\mathbf{0}},I_{n}). Now, since every multivariate normal distribution 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma) can be obtained from 𝒩(𝟎,In){\cal N}({\mathbf{0}},I_{n}) by a linear transformation, we are able to show hardness also for the case where 𝒟mat{\cal D}_{\mathrm{mat}} is such that each column is drawn i.i.d. from 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma).

Recall that in our reduction we translate S={(𝐱1,hW(𝐱1)),,(𝐱m,hW(𝐱m))}S=\{(\mathbf{x}_{1},h_{W}(\mathbf{x}_{1})),\ldots,(\mathbf{x}_{m},h_{W}(\mathbf{x}_{m}))\} to S={(𝐱1,hW(𝐱1)),,(𝐱m,hW(𝐱m))}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},h_{W}(\mathbf{x}_{1})),\ldots,(\mathbf{x}^{\prime}_{m},h_{W}(\mathbf{x}_{m}))\} where for every i[m]i\in[m] we have 𝐱i=(M)1𝐱i\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}. Therefore, we need to show that our choice of MM is such that it is invertible with high probability. Also, since we want to show hardness already if the input distribution 𝒟{\cal D} is supported on a bounded domain, then we need to bound the norm of 𝐱i\mathbf{x}^{\prime}_{i}, with high probability over the choice of MM. This task requires a careful analysis of the spectral norm of (M)1(M^{\top})^{-1}, namely, of (smin(M))1\left(s_{\min}(M)\right)^{-1}.

The proofs of the results regarding random CNNs follow the same ideas. The main difference is that in this case, instead of multiplying 𝐱i\mathbf{x}_{i} by (M)1(M^{\top})^{-1}, we multiply each patch in 𝐱i\mathbf{x}_{i} by an appropriate matrix.

Proof structure

We start with a few definitions. We say that a sample S={(𝐱i,yi)}i=1m(n×{0,1})mS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{m}\in({\mathbb{R}}^{n}\times\{0,1\})^{m} is scattered if y1,,ymy_{1},\ldots,y_{m} are independent fair coins (in particular, they are independent from 𝐱1,,𝐱m\mathbf{x}_{1},\ldots,\mathbf{x}_{m}). We say that SS is contained in AnA\subseteq{\mathbb{R}}^{n} if 𝐱iA\mathbf{x}_{i}\in A for every i[m]i\in[m].

Let 𝒜{\cal A} be an algorithm whose input is a sample S={(𝐱i,yi)}i=1m(n)(n×{0,1})m(n)S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{m(n)}\in({\mathbb{R}}^{n}\times\{0,1\})^{m(n)} and whose output is either “scattered” or “{\cal H}-realizable”, where {\cal H} is an hypothesis class. We say that 𝒜{\cal A} distinguishes size-mm {\cal H}-realizable samples from scattered samples if the following holds.

  • If the sample SS is scattered, then

    Pr(𝒜(S)=“scattered”)34on(1),\Pr\left({\cal A}(S)=\text{``scattered"}\right)\geq\frac{3}{4}-o_{n}(1)~{},

    where the probability is over the choice of SS and the randomness of 𝒜{\cal A}.

  • If the sample SS satisfies h(𝐱i)=yih(\mathbf{x}_{i})=y_{i} for every i[m]i\in[m] for some hh\in{\cal H}, then

    Pr(𝒜(S)=-realizable”)34on(1),\Pr\left({\cal A}(S)=\text{``${\cal H}$-realizable"}\right)\geq\frac{3}{4}-o_{n}(1)~{},

    where the probability is over the randomness of 𝒜{\cal A}.

We denote by SCATm(n)A()\mathrm{SCAT}_{m(n)}^{A}({\cal H}) the problem of distinguishing size-m(n)m(n) {\cal H}-realizable samples that are contained in AnA\subseteq{\mathbb{R}}^{n} from scattered samples.

Now, let 𝒜{\cal A}^{\prime} be an algorithm whose input is a sample S={(𝐱i,yi)}i=1m(n)(n×{0,1})m(n)S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{m(n)}\in({\mathbb{R}}^{n}\times\{0,1\})^{m(n)} and whose output is either “scattered” or “𝒟mat{\cal D}_{\mathrm{mat}}-realizable”. We say that 𝒜{\cal A}^{\prime} distinguishes size-mm 𝒟mat{\cal D}_{\mathrm{mat}}-realizable samples from scattered samples if the following holds.

  • If the sample SS is scattered, then

    Pr(𝒜(S)=“scattered”)34on(1),\Pr\left({\cal A}^{\prime}(S)=\text{``scattered"}\right)\geq\frac{3}{4}-o_{n}(1)~{},

    where the probability is over the choice of SS and the randomness of 𝒜{\cal A}^{\prime}.

  • If the sample SS satisfies hW(𝐱i)=yih_{W}(\mathbf{x}_{i})=y_{i} for every i[m]i\in[m], where WW is a random matrix drawn from 𝒟mat{\cal D}_{\mathrm{mat}}, then

    Pr(𝒜(S)=𝒟mat-realizable”)34on(1),\Pr\left({\cal A}^{\prime}(S)=\text{``${\cal D}_{\mathrm{mat}}$-realizable"}\right)\geq\frac{3}{4}-o_{n}(1)~{},

    where the probability is over the choice of WW and the randomness of 𝒜{\cal A}^{\prime}.

We denote by SCATm(n)A(𝒟mat)\mathrm{SCAT}_{m(n)}^{A}({\cal D}_{\mathrm{mat}}) the problem of distinguishing size-m(n)m(n) 𝒟mat{\cal D}_{\mathrm{mat}}-realizable samples that are contained in AnA\subseteq{\mathbb{R}}^{n} from scattered samples. In the case of random CNNs, we denote by SCATm(n)A(𝒟vec,n)\mathrm{SCAT}_{m(n)}^{A}({\cal D}_{\mathrm{vec}},n) the problem of distinguishing size-m(n)m(n) scattered samples that are contained in AA, from samples that are realizable by a random CNN h𝐰nh_{\mathbf{w}}^{n} where 𝐰𝒟vec\mathbf{w}\sim{\cal D}_{\mathrm{vec}}.

Recall that signcnnn,m={h𝐰n:𝐰{±1}nm}{\cal H}_{\mathrm{sign-cnn}}^{n,m}=\{h_{\mathbf{w}}^{n}:\mathbf{w}\in\{\pm 1\}^{\frac{n}{m}}\}. As we described, hardness of learning 𝒟mat{\cal D}_{\mathrm{mat}}-random neural networks where the distribution 𝒟{\cal D} is supported on a bounded domain, can be shown by first showing hardness of learning signcnnn,m{\cal H}_{\mathrm{sign-cnn}}^{n,m} with some m=𝒪(log2(n))m={\cal O}(\log^{2}(n)), where the distribution 𝒟{\cal D} is supported on some AnA^{\prime}\subseteq{\mathbb{R}}^{n}, and then reducing this problem to learning 𝒟mat{\cal D}_{\mathrm{mat}}-random networks where the distribution 𝒟{\cal D} is supported on some AnA\subseteq{\mathbb{R}}^{n}. We can show RSAT-hardness of learning signcnnn,m{\cal H}_{\mathrm{sign-cnn}}^{n,m} by using the methodology of Daniely and Shalev-Shwartz (2016) as follows: First, show that if there is an efficient algorithm that learns signcnnn,m{\cal H}_{\mathrm{sign-cnn}}^{n,m} where the distribution 𝒟{\cal D} is supported on AnA^{\prime}\subseteq{\mathbb{R}}^{n}, then there is a fixed dd and an efficient algorithm that solves SCATndA(signcnnn,m)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m}), and then show that for every fixed dd, the problem SCATndA(signcnnn,m)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m}) is RSAT-hard.

Our proof follows a slightly different path than the one described above. First, we show that if there is an efficient algorithm that learns 𝒟mat{\cal D}_{\mathrm{mat}}-random neural networks where the distribution 𝒟{\cal D} is supported on AnA\subseteq{\mathbb{R}}^{n}, then there is a fixed dd and an efficient algorithm that solves SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}). Then, we show that for every fixed dd, the problem SCATndA(signcnnn,m)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m}) with some AnA^{\prime}\subseteq{\mathbb{R}}^{n}, is RSAT-hard. Finally, for the required matrix distributions 𝒟mat{\cal D}_{\mathrm{mat}} and sets AA, we show a reduction from SCATndA(signcnnn,m)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m}) to SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}). The main difference between this proof structure and the one described in the previous paragraph, is that here, for every distribution 𝒟mat{\cal D}_{\mathrm{mat}} we need to show a reduction from SCATndA(signcnnn,m)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m}) to SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}) (which are decision problems), and in the previous proof structure for every 𝒟mat{\cal D}_{\mathrm{mat}} we need to show a reduction between learning signcnnn,m{\cal H}_{\mathrm{sign-cnn}}^{n,m} and learning 𝒟mat{\cal D}_{\mathrm{mat}}-random networks. We chose this proof structure since here the proof for each 𝒟mat{\cal D}_{\mathrm{mat}} is a reduction between decision problems, and thus we found the proofs to be simpler and cleaner this way. Other than this technical difference, both proof structures are essentially similar and follow the same ideas.

The case of random CNNs is similar, except that here we show, for each distribution 𝒟vec{\cal D}_{\mathrm{vec}} over vectors, a reduction from SCATndA(signcnnn,m)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m}) to SCATndA(𝒟vec,n)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{vec}},n).

5 Proofs

5.1 Learning 𝒟mat{\cal D}_{\mathrm{mat}}-random networks is harder than SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})

Theorem 5.1.

Let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over matrices. Assume that there is an algorithm that learns 𝒟mat{\cal D}_{\mathrm{mat}}-random neural networks, where the distribution 𝒟{\cal D} is supported on AnA\subseteq{\mathbb{R}}^{n}. Then, there is a fixed dd and an efficient algorithm that solves SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}).

Proof.

Let {\cal L} be an efficient learning algorithm that learns 𝒟mat{\cal D}_{\mathrm{mat}}-random neural networks where the distribution 𝒟{\cal D} is supported on AA. Let m(n)m(n) be such that {\cal L} uses a sample of size at most m(n)m(n). Let p(n)=9m(n)+np(n)=9m(n)+n. Let S={(𝐱i,yi)}i=1p(n)(n×{0,1})p(n)S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{p(n)}\in({\mathbb{R}}^{n}\times\{0,1\})^{p(n)} be a sample that is contained in AA. We will show an efficient algorithm 𝒜{\cal A} that distinguishes whether SS is scattered or 𝒟mat{\cal D}_{\mathrm{mat}}-realizable. This implies that the theorem holds for dd such that ndp(n)n^{d}\geq p(n).

Given SS, the algorithm 𝒜{\cal A} learns a function h:nh:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}} by running {\cal L} with an examples oracle that generates examples by choosing a random (uniformly distributed) example (𝐱i,yi)S(\mathbf{x}_{i},y_{i})\in S. We denote S(h)=1p(n)i[p(n)](h(𝐱i)yi)2\ell_{S}(h)=\frac{1}{p(n)}\sum_{i\in[p(n)]}\left(h(\mathbf{x}_{i})-y_{i}\right)^{2}. Now, if S(h)110\ell_{S}(h)\leq\frac{1}{10}, then 𝒜{\cal A} returns that SS is 𝒟mat{\cal D}_{\mathrm{mat}}-realizable, and otherwise it returns that it is scattered. Clearly, the algorithm 𝒜{\cal A} runs in polynomial time. We now show that if SS is 𝒟mat{\cal D}_{\mathrm{mat}}-realizable then 𝒜{\cal A} recognizes it with probability at least 34\frac{3}{4}, and that if SS is scattered then it also recognizes it with probability at least 34\frac{3}{4}.

Assume first that SS is 𝒟mat{\cal D}_{\mathrm{mat}}-realizable. Let 𝒟S{\cal D}_{S} be the uniform distribution over 𝐱in\mathbf{x}_{i}\in{\mathbb{R}}^{n} from SS. In this case, since 𝒟S{\cal D}_{S} is supported on AA, we are guaranteed that with probability at least 34\frac{3}{4} over the choice of WW and the internal randomness of {\cal L}, we have S(h)=𝔼𝐱𝒟S[(h(𝐱)hW(𝐱))2]110\ell_{S}(h)=\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}_{S}}\left[\left(h(\mathbf{x})-h_{W}(\mathbf{x})\right)^{2}\right]\leq\frac{1}{10}. Therefore, the algorithm returns “𝒟mat{\cal D}_{\mathrm{mat}}-realizable”.

Now, assume that SS is scattered. Let h:nh:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}} be the function returned by {\cal L}. Let h:n{0,1}h^{\prime}:{\mathbb{R}}^{n}\rightarrow\{0,1\} be the following function. For every 𝐱n\mathbf{x}\in{\mathbb{R}}^{n}, if h(𝐱)12h(\mathbf{x})\geq\frac{1}{2} then h(𝐱)=1h^{\prime}(\mathbf{x})=1, and otherwise h(𝐱)=0h^{\prime}(\mathbf{x})=0. Note that for every (𝐱i,yi)S(\mathbf{x}_{i},y_{i})\in S, if h(𝐱i)yih^{\prime}(\mathbf{x}_{i})\neq y_{i} then (h(𝐱i)yi)214\left(h(\mathbf{x}_{i})-y_{i}\right)^{2}\geq\frac{1}{4}. Therefore, S(h)14S(h)\ell_{S}(h)\geq\frac{1}{4}\ell_{S}(h^{\prime}). Let C[p(n)]C\subseteq[p(n)] be the set of indices of SS that were not observed by {\cal L}. Note that given CC, the events {h(𝐱i)=yi}iC\{h^{\prime}(\mathbf{x}_{i})=y_{i}\}_{i\in C} are independent from one another, and each has probability 12\frac{1}{2}. By the Hoefding bound, we have that h(𝐱i)yih^{\prime}(\mathbf{x}_{i})\neq y_{i} for at most 12ln(n)n\frac{1}{2}-\sqrt{\frac{\ln(n)}{n}} fraction of the indices in CC with probability at most

exp(2|C|ln(n)n)=exp(2(8m(n)+n)ln(n)n)exp(2ln(n))=1n2.\exp\left(-\frac{2|C|\ln(n)}{n}\right)=\exp\left(-\frac{2(8m(n)+n)\ln(n)}{n}\right)\leq\exp\left(-2\ln(n)\right)=\frac{1}{n^{2}}~{}.

Thus, h(𝐱i)yih^{\prime}(\mathbf{x}_{i})\neq y_{i} for at least 12on(1)\frac{1}{2}-o_{n}(1) fraction of the indices in CC with probability at least 1on(1)1-o_{n}(1). Hence,

S(h)14S(h)14|C|p(n)(12on(1))=148m(n)+n9m(n)+n(12on(1))19on(1).\ell_{S}(h)\geq\frac{1}{4}\ell_{S}(h^{\prime})\geq\frac{1}{4}\cdot\frac{|C|}{p(n)}\left(\frac{1}{2}-o_{n}(1)\right)=\frac{1}{4}\cdot\frac{8m(n)+n}{9m(n)+n}\left(\frac{1}{2}-o_{n}(1)\right)\geq\frac{1}{9}-o_{n}(1)~{}.

Therefore, for large enough nn, with probability at least 34\frac{3}{4} we have S(h)>110\ell_{S}(h)>\frac{1}{10}, and thus the algorithm returns “scattered”. ∎

5.2 SCATndA(signcnnn,m)\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,m}) is RSAT-hard

For a predicate P:{±1}K{0,1}P:\{\pm 1\}^{K}\to\{0,1\} we denote by CSP(P,¬P)\mathrm{CSP}(P,\neg P) the problem whose instances are collections, JJ, of constraints, each of which is either PP or ¬P\neg P constraint, and the goal is to maximize the number of satisfied constraints. Denote by CSPm(n)rand(P,¬P)\mathrm{CSP}^{\mathrm{rand}}_{m(n)}(P,\neg P) the problem of distinguishing333As in CSPm(n)rand(P)\mathrm{CSP}^{\mathrm{rand}}_{m(n)}(P), in order to succeed, and algorithm must return “satisfiable” w.p. at least 34on(1)\frac{3}{4}-o_{n}(1) on every satisfiable formula and “random” w.p. at least 34on(1)\frac{3}{4}-o_{n}(1) on random formulas. satisfiable from random formulas with nn variables and m(n)m(n) constraints. Here, in a random formula, each constraint is chosen w.p. 12\frac{1}{2} to be a uniform PP constraint and w.p. 12\frac{1}{2} a uniform ¬P\neg P constraint.

We will consider the predicate TK,M:{0,1}KM{0,1}T_{K,M}:\{0,1\}^{KM}\to\{0,1\} defined by

TK,M(z)=(z1zK)(zK+1z2K)(z(M1)K+1zMK).T_{K,M}(z)=\left(z_{1}\vee\ldots\vee z_{K}\right)\wedge\left(z_{K+1}\vee\ldots\vee z_{2K}\right)\wedge\ldots\wedge\left(z_{(M-1)K+1}\vee\ldots\vee z_{MK}\right)~{}.

We will need the following lemma from Daniely and Shalev-Shwartz (2016). For an overview of its proof, see Appendix A.

Lemma 5.1.

Daniely and Shalev-Shwartz (2016) Let q(n)=ω(log(n))q(n)=\omega(\log(n)) with q(n)nlog(n)q(n)\leq\frac{n}{\log(n)}, and let dd and KK be fixed integers. The problem CSPndrand(SATK)\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K}) can be efficiently reduced to the problem CSPnd1rand(TK,q(n),¬TK,q(n))\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)}).

In the following lemma, we use Lemma 5.1 in order to show RSAT-hardness of SCATndA(signcnnn,m)\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,m}) with some appropriate mm and AA.

Lemma 5.2.

Let n=(n+1)log2(n)n=(n^{\prime}+1)\log^{2}(n^{\prime}), and let dd be a fixed integer. The problem SCATndA(signcnnn,log2(n))\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}), where AA is the ball of radius log2(n)\log^{2}(n^{\prime}) in n{\mathbb{R}}^{n}, is RSAT-hard.

Proof.

By Assumption 2.1, there is KK such that CSP(n)d+2rand(SATK)\mathrm{CSP}^{\mathrm{rand}}_{(n^{\prime})^{d+2}}(\mathrm{SAT}_{K}) is hard, where the KK-SAT formula is over nn^{\prime} variables. Then, by Lemma 5.1, the problem CSP(n)d+1rand(TK,log2(n),¬TK,log2(n))\mathrm{CSP}^{\mathrm{rand}}_{(n^{\prime})^{d+1}}(T_{K,\log^{2}(n^{\prime})},\neg T_{K,\log^{2}(n^{\prime})}) is also hard. We will reduce CSP(n)d+1rand(TK,log2(n),¬TK,log2(n))\mathrm{CSP}^{\mathrm{rand}}_{(n^{\prime})^{d+1}}(T_{K,\log^{2}(n^{\prime})},\neg T_{K,\log^{2}(n^{\prime})}) to SCAT(n)d+1A(signcnnn,log2(n))\mathrm{SCAT}_{(n^{\prime})^{d+1}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}). Since (n)d+1>nd(n^{\prime})^{d+1}>n^{d}, it would imply that SCATndA(signcnnn,log2(n))\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}) is RSAT-hard.

Let J={C1,,C(n)d+1}J=\{C_{1},\ldots,C_{(n^{\prime})^{d+1}}\} be an input for CSP(n)d+1rand(TK,log2(n),¬TK,log2(n))\mathrm{CSP}^{\mathrm{rand}}_{(n^{\prime})^{d+1}}(T_{K,\log^{2}(n^{\prime})},\neg T_{K,\log^{2}(n^{\prime})}). Namely, each constraint CiC_{i} is either a CNF or a DNF formula. Equivalently, JJ can be written as J={(C1,y1),,(C(n)d+1,y(n)d+1)}J^{\prime}=\{(C^{\prime}_{1},y_{1}),\ldots,(C^{\prime}_{(n^{\prime})^{d+1}},y_{(n^{\prime})^{d+1}})\} where for every ii, if CiC_{i} is a DNF formula then Ci=CiC^{\prime}_{i}=C_{i} and yi=1y_{i}=1, and if CiC_{i} is a CNF formula then CiC^{\prime}_{i} is the DNF obtained by negating CiC_{i}, and yi=0y_{i}=0. Given JJ^{\prime} as above, we encode each DNF formula CiC^{\prime}_{i} (with log2(n)\log^{2}(n^{\prime}) clauses) as a vector 𝐱in\mathbf{x}_{i}\in{\mathbb{R}}^{n} such that each clause [(α1,i1),,(αK,iK)][(\alpha_{1},i_{1}),\ldots,(\alpha_{K},i_{K})] in CiC^{\prime}_{i} (a signed KK-tuple) is encoded by a vector 𝐳=(z1,,zn+1)\mathbf{z}=(z_{1},\ldots,z_{n^{\prime}+1}) as follows. First, we have zn+1=(K1)z_{n^{\prime}+1}=-(K-1). Then, for every 1jK1\leq j\leq K we have zij=αjz_{i_{j}}=\alpha_{j}, and for every variable ll that does not appear in the clause we have zl=0z_{l}=0. Thus, for every 1ln1\leq l\leq n^{\prime}, the value of zlz_{l} indicates whether the ll-th variable appears in the clause as a positive literal, a negative literal, or does not appear. The encoding 𝐱i\mathbf{x}_{i} of CiC^{\prime}_{i} is the concatenation of the encodings of its clauses.

Let S={(𝐱1,y1),,(𝐱(n)d+1,y(n)d+1)}S=\{(\mathbf{x}_{1},y_{1}),\ldots,(\mathbf{x}_{(n^{\prime})^{d+1}},y_{(n^{\prime})^{d+1}})\}. If JJ is random then SS is scattered, since each constraint CiC_{i} is with probability 12\frac{1}{2} a DNF formula, and with probability 12\frac{1}{2} a CNF formula, and this choice is independent of the choice of the literals in CiC_{i}. Assume now that JJ is satisfiable by an assignment ψ{±1}n\psi\in\{\pm 1\}^{n^{\prime}}. Let 𝐰=(ψ,1){±1}n+1\mathbf{w}=(\psi,1)\in\{\pm 1\}^{n^{\prime}+1}. Note that SS is realizable by the CNN h𝐰nh_{\mathbf{w}}^{n} with log2(n)\log^{2}(n^{\prime}) hidden neurons. Indeed, if 𝐳n+1\mathbf{z}\in{\mathbb{R}}^{n^{\prime}+1} is the encoding of a clause of CiC^{\prime}_{i}, then 𝐳,𝐰=1\langle\mathbf{z},\mathbf{w}\rangle=1 if all the KK literals in the clause are satisfied by ψ\psi, and otherwise 𝐳,𝐰1\langle\mathbf{z},\mathbf{w}\rangle\leq-1. Therefore, h𝐰n(𝐱i)=yih_{\mathbf{w}}^{n}(\mathbf{x}_{i})=y_{i}.

Note that by our construction, for every i[(n)d+1]i\in[(n^{\prime})^{d+1}] we have for large enough nn^{\prime}

𝐱i=log2(n)(K+(K1)2)log(n)Klog2(n).\left\|\mathbf{x}_{i}\right\|=\sqrt{\log^{2}(n^{\prime})\left(K+(K-1)^{2}\right)}\leq\log(n^{\prime})\cdot K\leq\log^{2}(n^{\prime})~{}.

5.3 Hardness of learning random fully-connected neural networks

Let n=(n+1)log2(n)n=(n^{\prime}+1)\log^{2}(n^{\prime}). We say that a matrix MM of size n×nn\times n is a diagonal-blocks matrix if

M=[B11B1log2(n)Blog2(n)1Blog2(n)log2(n)]M=\begin{bmatrix}B^{11}&\dots&B^{1\log^{2}(n^{\prime})}\\ \vdots&\ddots&\vdots\\ B^{\log^{2}(n^{\prime})1}&\dots&B^{\log^{2}(n^{\prime})\log^{2}(n^{\prime})}\end{bmatrix}

where each block BijB^{ij} is a diagonal matrix diag(z1ij,,zn+1ij)\mathrm{diag}(z_{1}^{ij},\ldots,z_{n^{\prime}+1}^{ij}). For every 1in+11\leq i\leq n^{\prime}+1 let Si={i+j(n+1):0jlog2(n)1}S_{i}=\{i+j(n^{\prime}+1):0\leq j\leq\log^{2}(n^{\prime})-1\}. Let MSiM_{S_{i}} be the submatrix of MM obtained by selecting the rows and columns in SiS_{i}. Thus, MSiM_{S_{i}} is a matrix of size log2(n)×log2(n)\log^{2}(n^{\prime})\times\log^{2}(n^{\prime}). For 𝐱n\mathbf{x}\in{\mathbb{R}}^{n} let 𝐱Silog2(n)\mathbf{x}_{S_{i}}\in{\mathbb{R}}^{\log^{2}(n^{\prime})} be the restriction of 𝐱\mathbf{x} to the coordinates SiS_{i}.

Lemma 5.3.

Let MM be a diagonal-blocks matrix. Then,

smin(M)min1in+1smin(MSi).s_{\min}(M)\geq\min_{1\leq i\leq n^{\prime}+1}s_{\min}(M_{S_{i}})~{}.
Proof.

For every 𝐱n\mathbf{x}\in{\mathbb{R}}^{n} with 𝐱=1\left\|\mathbf{x}\right\|=1 we have

M𝐱2\displaystyle\left\|M\mathbf{x}\right\|^{2} =\displaystyle= 1in+1MSi𝐱Si21in+1(smin(MSi)𝐱Si)2\displaystyle\sum_{1\leq i\leq n^{\prime}+1}\left\|M_{S_{i}}\mathbf{x}_{S_{i}}\right\|^{2}\geq\sum_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M_{S_{i}})\left\|\mathbf{x}_{S_{i}}\right\|\right)^{2}
\displaystyle\geq min1in+1(smin(MSi))21in+1𝐱Si2=(min1in+1(smin(MSi))2)𝐱2\displaystyle\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M_{S_{i}})\right)^{2}\sum_{1\leq i\leq n^{\prime}+1}\left\|\mathbf{x}_{S_{i}}\right\|^{2}=\left(\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M_{S_{i}})\right)^{2}\right)\left\|\mathbf{x}\right\|^{2}
=\displaystyle= min1in+1(smin(MSi))2.\displaystyle\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M_{S_{i}})\right)^{2}~{}.

Hence, smin(M)min1in+1smin(MSi)s_{\min}(M)\geq\min_{1\leq i\leq n^{\prime}+1}s_{\min}(M_{S_{i}}). ∎

5.3.1 Proof of Theorem 3.1

Let MM be a diagonal-blocks matrix, where each block BijB^{ij} is a diagonal matrix diag(z1ij,,zn+1ij)\mathrm{diag}(z_{1}^{ij},\ldots,z_{n^{\prime}+1}^{ij}). Assume that for all i,j,li,j,l the entries zlijz_{l}^{ij} are i.i.d. copies of a random variable zz that has a symmetric distribution 𝒟z{\cal D}_{z} with variance σ2\sigma^{2}. Also, assume that the random variable z=zσz^{\prime}=\frac{z}{\sigma} is bb-subgaussian for some fixed bb.

Lemma 5.4.
Pr(smin(M)σnlog2(n))=on(1).Pr\left(s_{\min}(M)\leq\frac{\sigma}{n^{\prime}\log^{2}(n^{\prime})}\right)=o_{n}(1)~{}.
Proof.

Let M=1σMM^{\prime}=\frac{1}{\sigma}M. By Lemma 5.3, we have

smin(M)min1in+1smin(MSi).s_{\min}(M^{\prime})\geq\min_{1\leq i\leq n^{\prime}+1}s_{\min}(M^{\prime}_{S_{i}})~{}. (1)

Note that for every ii, all entries of the matrix MSiM^{\prime}_{S_{i}} are i.i.d. copies of zz^{\prime}.

Now, we need the following theorem:

Theorem 5.2.

Rudelson and Vershynin (2008) Let ξ\xi be a real random variable with expectation 0 and variance 11, and assume that ξ\xi is bb-subgaussian for some b>0b>0. Let AA be an n×nn\times n matrix whose entries are i.i.d. copies of ξ\xi. Then, for every t0t\geq 0 we have

Pr(smin(A)tn)Ct+cnPr\left(s_{\min}(A)\leq\frac{t}{\sqrt{n}}\right)\leq Ct+c^{n}

where C>0C>0 and c(0,1)c\in(0,1) depend only on bb.

By Theorem 5.2, since each matrix MSiM^{\prime}_{S_{i}} is of size log2(n)×log2(n)\log^{2}(n^{\prime})\times\log^{2}(n^{\prime}), we have for every i[n+1]i\in[n^{\prime}+1] that

Pr(smin(MSi)tlog(n))Ct+clog2(n).Pr\left(s_{\min}(M^{\prime}_{S_{i}})\leq\frac{t}{\log(n^{\prime})}\right)\leq Ct+c^{\log^{2}(n^{\prime})}~{}.

By choosing t=1nlog(n)t=\frac{1}{n^{\prime}\log(n^{\prime})} we have

Pr(smin(MSi)1nlog2(n))Cnlog(n)+clog2(n).Pr\left(s_{\min}(M^{\prime}_{S_{i}})\leq\frac{1}{n^{\prime}\log^{2}(n^{\prime})}\right)\leq\frac{C}{n^{\prime}\log(n^{\prime})}+c^{\log^{2}(n^{\prime})}~{}.

Then, by the union bound we have

Pr(min1in+1(smin(MSi))1nlog2(n))C(n+1)nlog(n)+clog2(n)(n+1)=on(1).Pr\left(\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M^{\prime}_{S_{i}})\right)\leq\frac{1}{n^{\prime}\log^{2}(n^{\prime})}\right)\leq\frac{C(n^{\prime}+1)}{n^{\prime}\log(n^{\prime})}+c^{\log^{2}(n^{\prime})}(n^{\prime}+1)=o_{n}(1)~{}.

Combining this with smin(M)=σsmin(M)s_{\min}(M)=\sigma\cdot s_{\min}(M^{\prime}) and with Eq. 1, we have

Pr(smin(M)σnlog2(n))\displaystyle Pr\left(s_{\min}(M)\leq\frac{\sigma}{n^{\prime}\log^{2}(n^{\prime})}\right) =\displaystyle= Pr(smin(M)1nlog2(n))\displaystyle Pr\left(s_{\min}(M^{\prime})\leq\frac{1}{n^{\prime}\log^{2}(n^{\prime})}\right)
\displaystyle\leq Pr(min1in+1(smin(MSi))1nlog2(n))=on(1).\displaystyle Pr\left(\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M^{\prime}_{S_{i}})\right)\leq\frac{1}{n^{\prime}\log^{2}(n^{\prime})}\right)=o_{n}(1)~{}.

Lemma 5.5.

Let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over n×log2(n){\mathbb{R}}^{n\times\log^{2}(n^{\prime})} such that each entry is drawn i.i.d. from 𝒟z{\cal D}_{z}. Note that a 𝒟mat{\cal D}_{\mathrm{mat}}-random network hWh_{W} has log2(n)=𝒪(log2(n))\log^{2}(n^{\prime})={\cal O}(\log^{2}(n)) hidden neurons. Let dd be a fixed integer. Then, SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}) is RSAT-hard, where AA is the ball of radius nlog2(n)σ\frac{n\log^{2}(n)}{\sigma} in n{\mathbb{R}}^{n}.

Proof.

By Lemma 5.2, the problem SCATndA(signcnnn,log2(n))\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}) where AA^{\prime} is the ball of radius log2(n)\log^{2}(n^{\prime}) in n{\mathbb{R}}^{n}, is RSAT-hard. We will reduce this problem to SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}). Given a sample S={(𝐱i,yi)}i=1nd(n×{0,1})ndS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}} with 𝐱ilog2(n)\left\|\mathbf{x}_{i}\right\|\leq\log^{2}(n^{\prime}) for every i[nd]i\in[n^{d}], we will, with probability 1on(1)1-o_{n}(1), construct a sample SS^{\prime} that is contained in AA, such that if SS is scattered then SS^{\prime} is scattered, and if SS is signcnnn,log2(n){\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}-realizable then SS^{\prime} is 𝒟mat{\cal D}_{\mathrm{mat}}-realizable. Note that our reduction is allowed to fail with probability on(1)o_{n}(1). Indeed, distinguishing scattered from realizable requires success with probability 34on(1)\frac{3}{4}-o_{n}(1) and therefore reductions between such problems are not sensitive to a failure with probability on(1)o_{n}(1).

Assuming that MM is invertible (note that by Lemma 5.4 it holds with probability 1on(1)1-o_{n}(1)), let S={(𝐱1,y1),,(𝐱nd,ynd)}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\} where for every i[nd]i\in[n^{d}] we have 𝐱i=(M)1𝐱i\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}. Note that if SS is scattered then SS^{\prime} is also scattered.

Assume that SS is realizable by the CNN h𝐰nh_{\mathbf{w}}^{n} with 𝐰{±1}n+1\mathbf{w}\in\{\pm 1\}^{n^{\prime}+1}. Let WW be the matrix of size n×log2(n)n\times\log^{2}(n^{\prime}) such that hW=h𝐰nh_{W}=h_{\mathbf{w}}^{n}. Thus, W=(𝐰1,,𝐰log2(n))W=(\mathbf{w}^{1},\ldots,\mathbf{w}^{\log^{2}(n^{\prime})}) where for every i[log2(n)]i\in[\log^{2}(n^{\prime})] we have (𝐰(i1)(n+1)+1i,,𝐰i(n+1)i)=𝐰(\mathbf{w}^{i}_{(i-1)(n^{\prime}+1)+1},\ldots,\mathbf{w}^{i}_{i(n^{\prime}+1)})=\mathbf{w}, and 𝐰ji=0\mathbf{w}^{i}_{j}=0 for every other j[n]j\in[n]. Let W=MWW^{\prime}=MW. Note that SS^{\prime} is realizable by hWh_{W^{\prime}}. Indeed, for every i[nd]i\in[n^{d}] we have yi=h𝐰n(𝐱i)=hW(𝐱i)y_{i}=h_{\mathbf{w}}^{n}(\mathbf{x}_{i})=h_{W}(\mathbf{x}_{i}), and W𝐱i=WM(M)1𝐱i=(W)𝐱iW^{\top}\mathbf{x}_{i}=W^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}_{i}=(W^{\prime})^{\top}\mathbf{x}^{\prime}_{i}. Also, note that the entries of WW^{\prime} are i.i.d. copies of zz. Indeed, denote M=(𝐯1,,𝐯n)M^{\top}=(\mathbf{v}^{1},\ldots,\mathbf{v}^{n}). Then, for every line i[n]i\in[n] we denote i=(b1)(n+1)+ri=(b-1)(n^{\prime}+1)+r, where b,rb,r are integers and 1rn+11\leq r\leq n^{\prime}+1. Thus, bb is the line index of the block in MM that correspond to the ii-th line in MM, and rr is the line index within the block. Now, note that

Wij\displaystyle W^{\prime}_{ij} =\displaystyle= 𝐯i,𝐰j=(𝐯(j1)(n+1)+1i,,𝐯j(n+1)i),𝐰=(Br1bj,,Br(n+1)bj),𝐰\displaystyle\langle\mathbf{v}^{i},\mathbf{w}^{j}\rangle=\langle\left(\mathbf{v}^{i}_{(j-1)(n^{\prime}+1)+1},\ldots,\mathbf{v}^{i}_{j(n^{\prime}+1)}\right),\mathbf{w}\rangle=\langle(B^{bj}_{r1},\ldots,B^{bj}_{r(n^{\prime}+1)}),\mathbf{w}\rangle
=\displaystyle= Brrbj𝐰r=zrbj𝐰r.\displaystyle B^{bj}_{rr}\cdot\mathbf{w}_{r}=z^{bj}_{r}\cdot\mathbf{w}_{r}~{}.

Since 𝒟z{\cal D}_{z} is symmetric and 𝐰r{±1}\mathbf{w}_{r}\in\{\pm 1\}, we have Wij𝒟zW^{\prime}_{ij}\sim{\cal D}_{z} independently from the other entries. Thus, W𝒟matW^{\prime}\sim{\cal D}_{\mathrm{mat}}. Therefore, hWh_{W^{\prime}} is a 𝒟mat{\cal D}_{\mathrm{mat}}-random network.

By Lemma 5.4, we have with probability 1on(1)1-o_{n}(1) that for every i[nd]i\in[n^{d}],

𝐱i\displaystyle\left\|\mathbf{x}^{\prime}_{i}\right\| =\displaystyle= (M)1𝐱ismax((M)1)𝐱i=1smin(M)𝐱i=1smin(M)𝐱i\displaystyle\left\|(M^{\top})^{-1}\mathbf{x}_{i}\right\|\leq s_{\max}\left((M^{\top})^{-1}\right)\left\|\mathbf{x}_{i}\right\|=\frac{1}{s_{\min}(M^{\top})}\left\|\mathbf{x}_{i}\right\|=\frac{1}{s_{\min}(M)}\left\|\mathbf{x}_{i}\right\|
\displaystyle\leq nlog2(n)σlog2(n)nlog2(n)σ.\displaystyle\frac{n^{\prime}\log^{2}(n^{\prime})}{\sigma}\log^{2}(n^{\prime})\leq\frac{n\log^{2}(n)}{\sigma}~{}.

Finally, Theorem 3.1 follows immediately from Theorem 5.1 and the following lemma.

Lemma 5.6.

Let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over n~×m{\mathbb{R}}^{{\tilde{n}}\times m} with m=𝒪(log2(n~))m={\cal O}(\log^{2}({\tilde{n}})), such that each entry is drawn i.i.d. from 𝒟z{\cal D}_{z}. Let dd be a fixed integer, and let ϵ>0\epsilon>0 be a small constant. Then, SCATn~dA(𝒟mat)\mathrm{SCAT}_{{\tilde{n}}^{d}}^{A}({\cal D}_{\mathrm{mat}}) is RSAT-hard, where AA is the ball of radius n~ϵσ\frac{{\tilde{n}}^{\epsilon}}{\sigma} in n~{\mathbb{R}}^{{\tilde{n}}}.

Proof.

For integers k,lk,l we denote by 𝒟matk,l{\cal D}_{\mathrm{mat}}^{k,l} the distribution over k×l{\mathbb{R}}^{k\times l} such that each entry is drawn i.i.d. from 𝒟z{\cal D}_{z}. Let c=2ϵc=\frac{2}{\epsilon}, and let n~=nc{\tilde{n}}=n^{c}. By Lemma 5.5, the problem SCATncdA(𝒟matn,m)\mathrm{SCAT}_{n^{cd}}^{A^{\prime}}({\cal D}_{\mathrm{mat}}^{n,m}) is RSAT-hard, where m=𝒪(log2(n))m={\cal O}(\log^{2}(n)), and AA^{\prime} is the ball of radius nlog2(n)σ\frac{n\log^{2}(n)}{\sigma} in n{\mathbb{R}}^{n}. We reduce this problem to SCATn~dA(𝒟matn~,m)\mathrm{SCAT}_{{\tilde{n}}^{d}}^{A}({\cal D}_{\mathrm{mat}}^{{\tilde{n}},m}), where AA is the ball of radius n~ϵσ\frac{{\tilde{n}}^{\epsilon}}{\sigma} in n~{\mathbb{R}}^{{\tilde{n}}}. Note that m=𝒪(log2(n))=𝒪(log2(n~))m={\cal O}(\log^{2}(n))={\cal O}(\log^{2}({\tilde{n}})).

Let S={(𝐱i,yi)}i=1ncd(n×{0,1})ncdS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{cd}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{cd}} with 𝐱inlog2(n)σ\left\|\mathbf{x}_{i}\right\|\leq\frac{n\log^{2}(n)}{\sigma}. For every i[ncd]i\in[n^{cd}], let 𝐱in~\mathbf{x}^{\prime}_{i}\in{\mathbb{R}}^{\tilde{n}} be the vector obtained from 𝐱i\mathbf{x}_{i} by padding it with zeros. Thus, 𝐱i=(𝐱i,0,,0)\mathbf{x}^{\prime}_{i}=(\mathbf{x}_{i},0,\ldots,0). Note that ncd=n~dn^{cd}={\tilde{n}}^{d}. Let S={(𝐱i,yi)}i=1n~dS^{\prime}=\{(\mathbf{x}^{\prime}_{i},y_{i})\}_{i=1}^{{\tilde{n}}^{d}}. If SS is scattered then SS^{\prime} is also scattered. Note that if SS is realizable by hWh_{W} then SS^{\prime} is realizable by hWh_{W^{\prime}} where WW^{\prime} is obtained from WW by appending n~n{\tilde{n}}-n arbitrary lines. Assume that SS is 𝒟matn,m{\cal D}_{\mathrm{mat}}^{n,m}-realizable, that is, W𝒟matn,mW\sim{\cal D}_{\mathrm{mat}}^{n,m}. Then, SS^{\prime} is realizable by hWh_{W^{\prime}} where WW^{\prime} is obtained from WW by appending lines such that each component is drawn i.i.d. from 𝒟z{\cal D}_{z}, and therefore, SS^{\prime} is 𝒟matn~,m{\cal D}_{\mathrm{mat}}^{{\tilde{n}},m}-realizable. Finally, for every in~di\in{\tilde{n}}^{d} we have

𝐱i=𝐱inlog2(n)σ=n~1clog2(n~1c)σn~2cσ=n~ϵσ.\left\|\mathbf{x}^{\prime}_{i}\right\|=\left\|\mathbf{x}_{i}\right\|\leq\frac{n\log^{2}(n)}{\sigma}=\frac{{\tilde{n}}^{\frac{1}{c}}\log^{2}({\tilde{n}}^{\frac{1}{c}})}{\sigma}\leq\frac{{\tilde{n}}^{\frac{2}{c}}}{\sigma}=\frac{{\tilde{n}}^{\epsilon}}{\sigma}~{}.

5.3.2 Proof of Theorem 3.2

Let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over n×m{\mathbb{R}}^{n\times m} with m=log2(n)m=\log^{2}(n), such that each entry is drawn i.i.d. from 𝒩(0,1){\cal N}(0,1). Let dd be a fixed integer. By Lemma 5.6, we have that SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}) is RSAT-hard, where AA is the ball of radius nϵn^{\epsilon} in n{\mathbb{R}}^{n}. Let (𝒩(0,1))n({\cal N}(0,1))^{n} be the distribution over n{\mathbb{R}}^{n} where each component is drawn i.i.d. from 𝒩(0,1){\cal N}(0,1). Recall that (𝒩(0,1))n=𝒩(𝟎,In)({\cal N}(0,1))^{n}={\cal N}({\mathbf{0}},I_{n}) (Tong (2012)). Therefore, in the distribution 𝒟mat{\cal D}_{\mathrm{mat}}, the columns are drawn i.i.d. from 𝒩(𝟎,In){\cal N}({\mathbf{0}},I_{n}). Let 𝒟mat{\cal D}_{\mathrm{mat}}^{\prime} be a distribution over n×m{\mathbb{R}}^{n\times m}, such that each column is drawn i.i.d. from 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma). By Theorem 5.1, we need to show that SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal D}_{\mathrm{mat}}^{\prime}) is RSAT-hard, where AA^{\prime} is the ball of radius nϵλmin\frac{n^{\epsilon}}{\sqrt{\lambda_{\min}}} in n{\mathbb{R}}^{n}. We show a reduction from SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}) to SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal D}_{\mathrm{mat}}^{\prime}).

Let S={(𝐱i,yi)}i=1nd(n×{0,1})ndS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}} be a sample. Let Σ=UΛU\Sigma=U\Lambda U^{\top} be the spectral decomposition of Σ\Sigma, and let M=UΛ12M=U\Lambda^{\frac{1}{2}}. Recall that if 𝐰𝒩(𝟎,In)\mathbf{w}\sim{\cal N}({\mathbf{0}},I_{n}) then M𝐰𝒩(𝟎,Σ)M\mathbf{w}\sim{\cal N}({\mathbf{0}},\Sigma) (Tong (2012)). For every i[nd]i\in[n^{d}], let 𝐱i=(M)1𝐱i\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}, and let S={(𝐱1,y1),,(𝐱nd,ynd)}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}. Note that if SS is scattered then SS^{\prime} is also scattered. If SS is realizable by a 𝒟mat{\cal D}_{\mathrm{mat}}-random network hWh_{W}, then let W=MWW^{\prime}=MW. Note that SS^{\prime} is realizable by hWh_{W^{\prime}}. Indeed, for every i[nd]i\in[n^{d}] we have (W)𝐱i=WM(M)1𝐱i=W𝐱i(W^{\prime})^{\top}\mathbf{x}^{\prime}_{i}=W^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}_{i}=W^{\top}\mathbf{x}_{i}. Let W=(𝐰1,,𝐰m)W=(\mathbf{w}_{1},\ldots,\mathbf{w}_{m}) and let W=(𝐰1,,𝐰m)W^{\prime}=(\mathbf{w}^{\prime}_{1},\ldots,\mathbf{w}^{\prime}_{m}). Since W=MWW^{\prime}=MW then 𝐰j=M𝐰j\mathbf{w}^{\prime}_{j}=M\mathbf{w}_{j} for every j[m]j\in[m]. Now, since W𝒟matW\sim{\cal D}_{\mathrm{mat}}, we have for every jj that 𝐰j𝒩(𝟎,In)\mathbf{w}_{j}\sim{\cal N}({\mathbf{0}},I_{n}) (i.i.d.). Therefore, 𝐰j=M𝐰j𝒩(𝟎,Σ)\mathbf{w}^{\prime}_{j}=M\mathbf{w}_{j}\sim{\cal N}({\mathbf{0}},\Sigma), and thus W𝒟matW^{\prime}\sim{\cal D}_{\mathrm{mat}}^{\prime}. Hence, SS^{\prime} is 𝒟mat{\cal D}_{\mathrm{mat}}^{\prime}-realizable.

We now bound the norms of the vectors 𝐱i\mathbf{x}^{\prime}_{i} in SS^{\prime}. Note that for every i[nd]i\in[n^{d}] we have

𝐱i=(M)1𝐱i=UΛ12𝐱i=Λ12𝐱iλmin12𝐱iλmin12nϵ.\left\|\mathbf{x}^{\prime}_{i}\right\|=\left\|(M^{\top})^{-1}\mathbf{x}_{i}\right\|=\left\|U\Lambda^{-\frac{1}{2}}\mathbf{x}_{i}\right\|=\left\|\Lambda^{-\frac{1}{2}}\mathbf{x}_{i}\right\|\leq\lambda_{\min}^{-\frac{1}{2}}\left\|\mathbf{x}_{i}\right\|\leq\lambda_{\min}^{-\frac{1}{2}}n^{\epsilon}~{}.

5.3.3 Proof of Theorem 3.3

Let n=(n+1)log2(n)n=(n^{\prime}+1)\log^{2}(n^{\prime}), and let MM be a diagonal-blocks matrix, where each block BijB^{ij} is a diagonal matrix diag(z1ij,,zn+1ij)\mathrm{diag}(z_{1}^{ij},\ldots,z_{n^{\prime}+1}^{ij}). We denote 𝐳ij=(z1ij,,zn+1ij)\mathbf{z}^{ij}=(z_{1}^{ij},\ldots,z_{n^{\prime}+1}^{ij}), and 𝐳j=(𝐳1j,,𝐳log2(n)j)n\mathbf{z}^{j}=(\mathbf{z}^{1j},\ldots,\mathbf{z}^{\log^{2}(n^{\prime})j})\in{\mathbb{R}}^{n}. Note that for every j[log2(n)]j\in[\log^{2}(n^{\prime})], the vector 𝐳j\mathbf{z}^{j} contains all the entries on the diagonals of blocks in the jj-th column of blocks in MM. Assume that the vectors 𝐳j\mathbf{z}^{j} are drawn i.i.d. according to the uniform distribution on r𝕊n1r\cdot{\mathbb{S}}^{n-1}.

Lemma 5.7.

For some universal constant c>0c^{\prime}>0 we have

Pr(smin(M)crnnlog5(n))=on(1).Pr\left(s_{\min}(M)\leq\frac{c^{\prime}r}{n^{\prime}\sqrt{n^{\prime}}\log^{5}(n^{\prime})}\right)=o_{n}(1)~{}.
Proof.

Let M=nrMM^{\prime}=\frac{\sqrt{n}}{r}M. For every j[log2(n)]j\in[\log^{2}(n^{\prime})], let 𝐳~jn\tilde{\mathbf{z}}^{j}\in{\mathbb{R}}^{n} be the vector that contains all the entries on the diagonals of blocks in the jj-th column of blocks in MM^{\prime}. That is, 𝐳~j=nr𝐳j\tilde{\mathbf{z}}^{j}=\frac{\sqrt{n}}{r}\mathbf{z}^{j}. Note that the vectors 𝐳~j\tilde{\mathbf{z}}^{j} are i.i.d. copies from the uniform distribution on n𝕊n1\sqrt{n}\cdot{\mathbb{S}}^{n-1}. By Lemma 5.3, we have

smin(M)min1in+1smin(MSi).s_{\min}(M^{\prime})\geq\min_{1\leq i\leq n^{\prime}+1}s_{\min}(M^{\prime}_{S_{i}})~{}. (2)

Note that for every ii, all columns of the matrix MSiM^{\prime}_{S_{i}} are projections of the vectors 𝐳~j\tilde{\mathbf{z}}^{j} on the SiS_{i} coordinated. That is, the jj-th column in MSiM^{\prime}_{S_{i}} is obtained by drawing 𝐳~j\tilde{\mathbf{z}}^{j} from the uniform distribution on n𝕊n1\sqrt{n}\cdot{\mathbb{S}}^{n-1} and projecting on the coordinates SiS_{i}.

We say that a distribution is isotropic if it has mean zero and its covariance matrix is the identity. The covariance matrix of the uniform distribution on 𝕊n1{\mathbb{S}}^{n-1} is 1nIn\frac{1}{n}I_{n}. Therefore, the uniform distribution on n𝕊n1\sqrt{n}\cdot{\mathbb{S}}^{n-1} is isotropic. We will need the following theorem.

Theorem 5.3.

Adamczak et al. (2012) Let m1m\geq 1 and let AA be an m×mm\times m matrix with independent columns drawn from an isotropic log-concave distribution. For every ϵ(0,1)\epsilon\in(0,1) we have

Pr(smin(A)cϵm)CmϵPr\left(s_{\min}(A)\leq\frac{c\epsilon}{\sqrt{m}}\right)\leq Cm\epsilon

where c and C are positive universal constants.

We show that the distribution of the columns of MSiM^{\prime}_{S_{i}} is isotropic and log-concave. First, since the uniform distribution on n𝕊n1\sqrt{n}\cdot{\mathbb{S}}^{n-1} is isotropic, then its projection on a subset of coordinates is also isotropic, and thus the distribution of the columns of MSiM^{\prime}_{S_{i}} is isotropic. In order to show that it is log-concave, we analyze its density. Let 𝐱n\mathbf{x}\in{\mathbb{R}}^{n} be a random variable whose distribution is the projection of a uniform distribution on 𝕊n1{\mathbb{S}}^{n-1} on kk coordinates. It is known that the probability density of 𝐱\mathbf{x} is (see Fang (2018))

f𝐱(x1,,xk)=Γ(n/2)Γ((nk)/2)πk/2(11ikxi2)nk21,f_{\mathbf{x}}(x_{1},\ldots,x_{k})=\frac{\Gamma(n/2)}{\Gamma((n-k)/2)\pi^{k/2}}\left(1-\sum_{1\leq i\leq k}x_{i}^{2}\right)^{\frac{n-k}{2}-1}~{},

where 1ikxi2<1\sum_{1\leq i\leq k}x_{i}^{2}<1. Recall that the columns of MSiM^{\prime}_{S_{i}} are projections of the uniform distribution over n𝕊n1\sqrt{n}\cdot{\mathbb{S}}^{n-1}, namely, the sphere of radius n\sqrt{n} and not the unit sphere. Thus, let 𝐱=n𝐱\mathbf{x}^{\prime}=\sqrt{n}\mathbf{x}. The probability density of 𝐱\mathbf{x}^{\prime} is

f𝐱(x1,,xk)\displaystyle f_{\mathbf{x}^{\prime}}(x^{\prime}_{1},\ldots,x^{\prime}_{k}) =\displaystyle= 1(n)kf𝐱(x1n,,xkn)\displaystyle\frac{1}{(\sqrt{n})^{k}}f_{\mathbf{x}}\left(\frac{x^{\prime}_{1}}{\sqrt{n}},\ldots,\frac{x^{\prime}_{k}}{\sqrt{n}}\right)
=\displaystyle= 1nk/2Γ(n/2)Γ((nk)/2)πk/2(11ik(xin)2)nk21,\displaystyle\frac{1}{n^{k/2}}\cdot\frac{\Gamma(n/2)}{\Gamma((n-k)/2)\pi^{k/2}}\left(1-\sum_{1\leq i\leq k}\left(\frac{x^{\prime}_{i}}{\sqrt{n}}\right)^{2}\right)^{\frac{n-k}{2}-1}~{},

where 1ik(xi)2<n\sum_{1\leq i\leq k}(x^{\prime}_{i})^{2}<n. We denote

g(n,k)=1nk/2Γ(n/2)Γ((nk)/2)πk/2.g(n,k)=\frac{1}{n^{k/2}}\cdot\frac{\Gamma(n/2)}{\Gamma((n-k)/2)\pi^{k/2}}~{}.

By replacing kk with log2(n)\log^{2}(n^{\prime}) we have

f𝐱(x1,,xlog2(n))\displaystyle f_{\mathbf{x}^{\prime}}(x^{\prime}_{1},\ldots,x^{\prime}_{\log^{2}(n^{\prime})}) =\displaystyle= g(n,log2(n))(11n1ilog2(n)(xi)2)nlog2(n)21.\displaystyle g(n,\log^{2}(n^{\prime}))\left(1-\frac{1}{n}\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}\right)^{\frac{n-\log^{2}(n^{\prime})}{2}-1}~{}.

Hence, we have

logf𝐱(x1,,xlog2(n))\displaystyle\log f_{\mathbf{x}^{\prime}}(x^{\prime}_{1},\ldots,x^{\prime}_{\log^{2}(n^{\prime})}) =\displaystyle=
log(g(n,log2(n)))\displaystyle\log\left(g(n,\log^{2}(n^{\prime}))\right) +\displaystyle+ (nlog2(n)21)log(11n1ilog2(n)(xi)2).\displaystyle\left(\frac{n-\log^{2}(n^{\prime})}{2}-1\right)\cdot\log\left(1-\frac{1}{n}\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}\right)~{}.

Since nlog2(n)21>0\frac{n-\log^{2}(n^{\prime})}{2}-1>0, we need to show that the function

log(11n1ilog2(n)(xi)2)\log\left(1-\frac{1}{n}\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}\right) (3)

(where 1ilog2(n)(xi)2<n\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}<n) is concave. This function can be written as h(f(x1,,xlog2(n)))h(f(x_{1},\ldots,x_{\log^{2}(n^{\prime})})), where

h(x)=log(1+x),h(x)=\log\left(1+x\right),
f(x1,,xlog2(n))=1n1ilog2(n)(xi)2.f(x^{\prime}_{1},\ldots,x^{\prime}_{\log^{2}(n^{\prime})})=-\frac{1}{n}\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}~{}.

Recall that if hh is concave and non-decreasing, and ff is concave, then their composition is also concave. Clearly, hh and ff satisfy these conditions, and thus the function in Eq. 3 is concave. Hence f𝐱f_{\mathbf{x}^{\prime}} is log-concave.

We now apply Theorem 5.3 on MSiM^{\prime}_{S_{i}}, and obtain that for every ϵ(0,1)\epsilon\in(0,1) we have

Pr(smin(MSi)cϵlog(n))Clog2(n)ϵ.Pr\left(s_{\min}(M^{\prime}_{S_{i}})\leq\frac{c\epsilon}{\log(n^{\prime})}\right)\leq C\log^{2}(n^{\prime})\epsilon~{}.

By choosing ϵ=1nlog3(n)\epsilon=\frac{1}{n^{\prime}\log^{3}(n^{\prime})} we have

Pr(smin(MSi)cnlog4(n))Cnlog(n).Pr\left(s_{\min}(M^{\prime}_{S_{i}})\leq\frac{c}{n^{\prime}\log^{4}(n^{\prime})}\right)\leq\frac{C}{n^{\prime}\log(n^{\prime})}~{}.

Now, by the union bound

Pr(min1in+1(smin(MSi))cnlog4(n))Cnlog(n)(n+1)=on(1).Pr\left(\min_{1\leq i\leq n^{\prime}+1}(s_{\min}(M^{\prime}_{S_{i}}))\leq\frac{c}{n^{\prime}\log^{4}(n^{\prime})}\right)\leq\frac{C}{n^{\prime}\log(n^{\prime})}\cdot(n^{\prime}+1)=o_{n}(1)~{}.

Combining this with smin(M)=rnsmin(M)s_{\min}(M)=\frac{r}{\sqrt{n}}s_{\min}(M^{\prime}) and with Eq. 2, we have

Pr(smin(M)crnnlog4(n))\displaystyle Pr\left(s_{\min}(M)\leq\frac{cr}{\sqrt{n}\cdot n^{\prime}\log^{4}(n^{\prime})}\right) =\displaystyle= Pr(smin(M)cnlog4(n))\displaystyle Pr\left(s_{\min}(M^{\prime})\leq\frac{c}{n^{\prime}\log^{4}(n^{\prime})}\right)
\displaystyle\leq Pr(min1in+1(smin(MSi))cnlog4(n))=on(1).\displaystyle Pr\left(\min_{1\leq i\leq n^{\prime}+1}(s_{\min}(M^{\prime}_{S_{i}}))\leq\frac{c}{n^{\prime}\log^{4}(n^{\prime})}\right)=o_{n}(1)~{}.

Note that

crnnlog4(n)=crn+1nlog5(n)cr2nnlog5(n)=crnnlog5(n),\frac{cr}{\sqrt{n}\cdot n^{\prime}\log^{4}(n^{\prime})}=\frac{cr}{\sqrt{n^{\prime}+1}\cdot n^{\prime}\log^{5}(n^{\prime})}\geq\frac{cr}{2\sqrt{n^{\prime}}\cdot n^{\prime}\log^{5}(n^{\prime})}=\frac{c^{\prime}r}{\sqrt{n^{\prime}}\cdot n^{\prime}\log^{5}(n^{\prime})}~{},

where c=c2c^{\prime}=\frac{c}{2}. Thus,

Pr(smin(M)crnnlog5(n))Pr(smin(M)crnnlog4(n))=on(1).Pr\left(s_{\min}(M)\leq\frac{c^{\prime}r}{\sqrt{n^{\prime}}\cdot n^{\prime}\log^{5}(n^{\prime})}\right)\leq Pr\left(s_{\min}(M)\leq\frac{cr}{\sqrt{n}\cdot n^{\prime}\log^{4}(n^{\prime})}\right)=o_{n}(1)~{}.

Let 𝒟mat{\cal D}_{\mathrm{mat}} be a distribution over n×log2(n){\mathbb{R}}^{n\times\log^{2}(n^{\prime})} such that each column is drawn i.i.d. from the uniform distribution on r𝕊n1r\cdot{\mathbb{S}}^{n-1}. Note that a 𝒟mat{\cal D}_{\mathrm{mat}}-random network hWh_{W} has log2(n)=𝒪(log2(n))\log^{2}(n^{\prime})={\cal O}(\log^{2}(n)) hidden neurons. Now, Theorem 3.3 follows immediately from Theorem 5.1 and the following lemma.

Lemma 5.8.

Let dd be a fixed integer. Then, SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}) is RSAT-hard, where AA is a ball of radius 𝒪(nnlog4(n)r){\cal O}\left(\frac{n\sqrt{n}\log^{4}(n)}{r}\right) in n{\mathbb{R}}^{n}.

Proof.

By Lemma 5.2, the problem SCATndA(signcnnn,log2(n))\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}) where AA^{\prime} is the ball of radius log2(n)\log^{2}(n^{\prime}) in n{\mathbb{R}}^{n}, is RSAT-hard. We will reduce this problem to SCATndA(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}}). Given a sample S={(𝐱i,yi)}i=1nd(n×{0,1})ndS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}} with 𝐱ilog2(n)\left\|\mathbf{x}_{i}\right\|\leq\log^{2}(n^{\prime}) for every i[nd]i\in[n^{d}], we will, with probability 1on(1)1-o_{n}(1), construct a sample SS^{\prime} that is contained in AA, such that if SS is scattered then SS^{\prime} is scattered, and if SS is signcnnn,log2(n){\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}-realizable then SS^{\prime} is 𝒟mat{\cal D}_{\mathrm{mat}}-realizable. Note that our reduction is allowed to fail with probability on(1)o_{n}(1). Indeed, distinguishing scattered from realizable requires success with probability 34on(1)\frac{3}{4}-o_{n}(1) and therefore reductions between such problems are not sensitive to a failure with probability on(1)o_{n}(1).

Assuming that MM is invertible (by Lemma 5.7 it holds with probability 1on(1)1-o_{n}(1)), let S={(𝐱1,y1),,(𝐱nd,ynd)}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\} where for every ii we have 𝐱i=(M)1𝐱i\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}. Note that if SS is scattered then SS^{\prime} is also scattered.

Assume that SS is realizable by the CNN h𝐰nh_{\mathbf{w}}^{n} with 𝐰{±1}n+1\mathbf{w}\in\{\pm 1\}^{n^{\prime}+1}. Let WW be the matrix of size n×log2(n)n\times\log^{2}(n^{\prime}) such that hW=h𝐰nh_{W}=h_{\mathbf{w}}^{n}. Thus, W=(𝐰1,,𝐰log2(n))W=(\mathbf{w}^{1},\ldots,\mathbf{w}^{\log^{2}(n^{\prime})}) where for every i[log2(n)]i\in[\log^{2}(n^{\prime})] we have (𝐰(i1)(n+1)+1i,,𝐰i(n+1)i)=𝐰(\mathbf{w}^{i}_{(i-1)(n^{\prime}+1)+1},\ldots,\mathbf{w}^{i}_{i(n^{\prime}+1)})=\mathbf{w}, and 𝐰ji=0\mathbf{w}^{i}_{j}=0 for every other j[n]j\in[n]. Let W=MWW^{\prime}=MW. Note that SS^{\prime} is realizable by hWh_{W^{\prime}}. Indeed, for every i[nd]i\in[n^{d}] we have yi=h𝐰n(𝐱i)=hW(𝐱i)y_{i}=h_{\mathbf{w}}^{n}(\mathbf{x}_{i})=h_{W}(\mathbf{x}_{i}), and W𝐱i=WM(M)1𝐱i=(W)𝐱iW^{\top}\mathbf{x}_{i}=W^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}_{i}=(W^{\prime})^{\top}\mathbf{x}^{\prime}_{i}. Also, note that the columns of WW^{\prime} are i.i.d. copies from the uniform distribution on r𝕊n1r\cdot{\mathbb{S}}^{n-1}. Indeed, denote M=(𝐯1,,𝐯n)M^{\top}=(\mathbf{v}^{1},\ldots,\mathbf{v}^{n}). Then, for every line index i[n]i\in[n] we denote i=(b1)(n+1)+ri=(b-1)(n^{\prime}+1)+r, where b,rb,r are integers and 1rn+11\leq r\leq n^{\prime}+1. Thus, bb is the line index of the block in MM that correspond to the ii-th line in MM, and rr is the line index within the block. Now, note that

Wij\displaystyle W^{\prime}_{ij} =\displaystyle= 𝐯i,𝐰j=(𝐯(j1)(n+1)+1i,,𝐯j(n+1)i),𝐰=(Br1bj,,Br(n+1)bj),𝐰\displaystyle\langle\mathbf{v}^{i},\mathbf{w}^{j}\rangle=\langle\left(\mathbf{v}^{i}_{(j-1)(n^{\prime}+1)+1},\ldots,\mathbf{v}^{i}_{j(n^{\prime}+1)}\right),\mathbf{w}\rangle=\langle(B^{bj}_{r1},\ldots,B^{bj}_{r(n^{\prime}+1)}),\mathbf{w}\rangle
=\displaystyle= Brrbj𝐰r=zrbj𝐰r.\displaystyle B^{bj}_{rr}\cdot\mathbf{w}_{r}=z^{bj}_{r}\cdot\mathbf{w}_{r}~{}.

Since 𝐰r{±1}\mathbf{w}_{r}\in\{\pm 1\}, and since the uniform distribution on a sphere does not change by multiplying a subset of component by 1-1, then the jj-th column of WW^{\prime} has the same distribution as 𝐳j\mathbf{z}^{j}, namely, the uniform distribution over r𝕊n1r\cdot{\mathbb{S}}^{n-1}. Also, the columns of WW^{\prime} are independent. Thus, W𝒟matW^{\prime}\sim{\cal D}_{\mathrm{mat}}, and therefore hWh_{W^{\prime}} is a 𝒟mat{\cal D}_{\mathrm{mat}}-random network.

By Lemma 5.7, we have with probability 1on(1)1-o_{n}(1) that for every ii,

𝐱i\displaystyle\left\|\mathbf{x}^{\prime}_{i}\right\| =\displaystyle= (M)1𝐱ismax((M)1)𝐱i=1smin(M)𝐱i=1smin(M)𝐱i\displaystyle\left\|(M^{\top})^{-1}\mathbf{x}_{i}\right\|\leq s_{\max}\left((M^{\top})^{-1}\right)\left\|\mathbf{x}_{i}\right\|=\frac{1}{s_{\min}(M^{\top})}\left\|\mathbf{x}_{i}\right\|=\frac{1}{s_{\min}(M)}\left\|\mathbf{x}_{i}\right\|
\displaystyle\leq nnlog5(n)crlog2(n)nnlog4(n)cr.\displaystyle\frac{n^{\prime}\sqrt{n^{\prime}}\log^{5}(n^{\prime})}{c^{\prime}r}\cdot\log^{2}(n^{\prime})\leq\frac{n\sqrt{n}\log^{4}(n)}{c^{\prime}r}~{}.

Thus, 𝐱i=𝒪(nnlog4(n)r)\left\|\mathbf{x}^{\prime}_{i}\right\|={\cal O}\left(\frac{n\sqrt{n}\log^{4}(n)}{r}\right). ∎

5.4 Hardness of learning random convolutional neural networks

5.4.1 Proof of Theorem 3.4

Theorem 3.4 follows immediately from Theorem 5.1 and the following lemma:

Lemma 5.9.

Let dd be a fixed integer. Then, SCATndA(𝒟zn+1,n)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{z}^{n^{\prime}+1},n) is RSAT-hard, where AA is the ball of radius log2(n)f(n)\frac{\log^{2}(n^{\prime})}{f(n^{\prime})} in n{\mathbb{R}}^{n}.

Proof.

By Lemma 5.2, the problem SCATndA(signcnnn,log2(n))\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}) where AA^{\prime} is the ball of radius log2(n)\log^{2}(n^{\prime}) in n{\mathbb{R}}^{n}, is RSAT-hard. We will reduce this problem to SCATndA(𝒟zn+1,n)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{z}^{n^{\prime}+1},n). Given a sample S={(𝐱i,yi)}i=1nd(n×{0,1})ndS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}} with 𝐱ilog2(n)\left\|\mathbf{x}_{i}\right\|\leq\log^{2}(n^{\prime}) for every i[nd]i\in[n^{d}], we will, with probability 1on(1)1-o_{n}(1), construct a sample SS^{\prime} that is contained in AA, such that if SS is scattered then SS^{\prime} is scattered, and if SS is signcnnn,log2(n){\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}-realizable then SS^{\prime} is 𝒟zn+1{\cal D}_{z}^{n^{\prime}+1}-realizable. Note that our reduction is allowed to fail with probability on(1)o_{n}(1). Indeed, distinguishing scattered from realizable requires success with probability 34on(1)\frac{3}{4}-o_{n}(1) and therefore reductions between such problems are not sensitive to a failure with probability on(1)o_{n}(1).

Let 𝐳=(z1,,zn+1)\mathbf{z}=(z_{1},\ldots,z_{n^{\prime}+1}) where each ziz_{i} is drawn i.i.d. from 𝒟z{\cal D}_{z}. Let M=diag(𝐳)M=\mathrm{diag}(\mathbf{z}) be a diagonal matrix. Note that MM is invertible with probability 1on(1)1-o_{n}(1), since for every i[n+1]i\in[n^{\prime}+1] we have Przi𝒟z(zi=0)Przi𝒟z(|zi|<f(n))=o(1n)Pr_{z_{i}\sim{\cal D}_{z}}(z_{i}=0)\leq Pr_{z_{i}\sim{\cal D}_{z}}(|z_{i}|<f(n^{\prime}))=o(\frac{1}{n^{\prime}}). Now, for every 𝐱i\mathbf{x}_{i} from SS, denote 𝐱i=(𝐱1i,,𝐱log2(n)i)\mathbf{x}_{i}=(\mathbf{x}^{i}_{1},\ldots,\mathbf{x}^{i}_{\log^{2}(n^{\prime})}) where for every jj we have 𝐱jin+1\mathbf{x}^{i}_{j}\in{\mathbb{R}}^{n^{\prime}+1}. Let 𝐱i=(M1𝐱1i,,M1𝐱log2(n)i)\mathbf{x}^{\prime}_{i}=(M^{-1}\mathbf{x}^{i}_{1},\ldots,M^{-1}\mathbf{x}^{i}_{\log^{2}(n^{\prime})}), and let S={(𝐱1,y1),,(𝐱nd,ynd)}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}. Note that if SS is scattered then SS^{\prime} is also scattered. If SS is realizable by a CNN h𝐰nsigncnnn,log2(n)h_{\mathbf{w}}^{n}\in{\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}, then let 𝐰=M𝐰\mathbf{w}^{\prime}=M\mathbf{w}. Note that SS^{\prime} is realizable by h𝐰nh_{\mathbf{w}^{\prime}}^{n}. Indeed, for every ii and jj we have 𝐰,M1𝐱ji=𝐰MM1𝐱ji=𝐰MM1𝐱ji=𝐰,𝐱ji\langle\mathbf{w}^{\prime},M^{-1}\mathbf{x}^{i}_{j}\rangle=\mathbf{w}^{\top}M^{\top}M^{-1}\mathbf{x}^{i}_{j}=\mathbf{w}^{\top}MM^{-1}\mathbf{x}^{i}_{j}=\langle\mathbf{w},\mathbf{x}^{i}_{j}\rangle. Also, note that since 𝐰{±1}n+1\mathbf{w}\in\{\pm 1\}^{n^{\prime}+1} and 𝒟z{\cal D}_{z} is symmetric, then 𝐰\mathbf{w}^{\prime} has the distribution 𝒟zn+1{\cal D}_{z}^{n^{\prime}+1}, and thus h𝐰nh_{\mathbf{w}^{\prime}}^{n} is a 𝒟zn+1{\cal D}_{z}^{n^{\prime}+1}-random CNN.

The probability that 𝐳𝒟zn+1\mathbf{z}\sim{\cal D}_{z}^{n^{\prime}+1} has some component ziz_{i} with |zi|<f(n)|z_{i}|<f(n^{\prime}), is at most (n+1)o(1n)=on(1)(n^{\prime}+1)\cdot o(\frac{1}{n^{\prime}})=o_{n}(1). Therefore, with probability 1on(1)1-o_{n}(1) we have for every i[nd]i\in[n^{d}] that

𝐱i2\displaystyle\left\|\mathbf{x}^{\prime}_{i}\right\|^{2} =\displaystyle= 1jlog2(n)M1𝐱ji21jlog2(n)(1f(n)𝐱ji)2=1(f(n))21jlog2(n)𝐱ji2\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\|M^{-1}\mathbf{x}^{i}_{j}\right\|^{2}\leq\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left(\frac{1}{f(n^{\prime})}\left\|\mathbf{x}^{i}_{j}\right\|\right)^{2}=\frac{1}{(f(n^{\prime}))^{2}}\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\|\mathbf{x}^{i}_{j}\right\|^{2}
=\displaystyle= 1(f(n))2𝐱i2log4(n)(f(n))2.\displaystyle\frac{1}{(f(n^{\prime}))^{2}}\left\|\mathbf{x}_{i}\right\|^{2}\leq\frac{\log^{4}(n^{\prime})}{(f(n^{\prime}))^{2}}~{}.

Thus, 𝐱ilog2(n)f(n)\left\|\mathbf{x}^{\prime}_{i}\right\|\leq\frac{\log^{2}(n^{\prime})}{f(n^{\prime})}. ∎

5.4.2 Proof of Theorem 3.5

Assume that the covariance matrix Σ\Sigma is of size (n+1)×(n+1)(n^{\prime}+1)\times(n^{\prime}+1), and let n=(n+1)log2(n)n=(n^{\prime}+1)\log^{2}(n^{\prime}). Note that a 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma)-random CNN h𝐰nh_{\mathbf{w}}^{n} has log2(n)=𝒪(log2(n))\log^{2}(n^{\prime})={\cal O}(\log^{2}(n)) hidden neurons. Let 𝒟vec{\cal D}_{\mathrm{vec}} be a distribution over n+1{\mathbb{R}}^{n^{\prime}+1} such that each component is drawn i.i.d. from 𝒩(0,1){\cal N}(0,1). Let dd be a fixed integer. By Lemma 5.9 and by choosing f(n)=1nlog(n)f(n^{\prime})=\frac{1}{n^{\prime}\log(n^{\prime})}, we have that SCATndA(𝒟vec,n)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{vec}},n) is RSAT-hard, where AA is the ball of radius nlog3(n)nlog(n)n^{\prime}\log^{3}(n^{\prime})\leq n\log(n) in n{\mathbb{R}}^{n}. Note that 𝒟vec=𝒩(𝟎,In+1){\cal D}_{\mathrm{vec}}={\cal N}({\mathbf{0}},I_{n^{\prime}+1}) (Tong (2012)). By Theorem 5.1, we need to show that SCATndA(𝒩(𝟎,Σ),n)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal N}({\mathbf{0}},\Sigma),n) is RSAT-hard, where AA^{\prime} is the ball of radius λmin12nlog(n)\lambda_{\min}^{-\frac{1}{2}}n\log(n) in n{\mathbb{R}}^{n}. We show a reduction from SCATndA(𝒩(𝟎,In+1),n)\mathrm{SCAT}_{n^{d}}^{A}({\cal N}({\mathbf{0}},I_{n^{\prime}+1}),n) to SCATndA(𝒩(𝟎,Σ),n)\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal N}({\mathbf{0}},\Sigma),n).

Let S={(𝐱i,yi)}i=1nd(n×{0,1})ndS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}} be a sample. For every 𝐱i\mathbf{x}_{i} from SS, denote 𝐱i=(𝐱1i,,𝐱log2(n)i)\mathbf{x}_{i}=(\mathbf{x}^{i}_{1},\ldots,\mathbf{x}^{i}_{\log^{2}(n^{\prime})}) where for every jj we have 𝐱jin+1\mathbf{x}^{i}_{j}\in{\mathbb{R}}^{n^{\prime}+1}. Let Σ=UΛU\Sigma=U\Lambda U^{\top} be the spectral decomposition of Σ\Sigma, and let M=UΛ12M=U\Lambda^{\frac{1}{2}}. Recall that if 𝐰𝒩(𝟎,In+1)\mathbf{w}\sim{\cal N}({\mathbf{0}},I_{n^{\prime}+1}) then M𝐰𝒩(𝟎,Σ)M\mathbf{w}\sim{\cal N}({\mathbf{0}},\Sigma) (Tong (2012)). Let 𝐱i=((M)1𝐱1i,,(M)1𝐱log2(n)i)\mathbf{x}^{\prime}_{i}=((M^{\top})^{-1}\mathbf{x}^{i}_{1},\ldots,(M^{\top})^{-1}\mathbf{x}^{i}_{\log^{2}(n^{\prime})}), and let S={(𝐱1,y1),,(𝐱nd,ynd)}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}. Note that if SS is scattered then SS^{\prime} is also scattered. If SS is realizable by a 𝒩(𝟎,In+1){\cal N}({\mathbf{0}},I_{n^{\prime}+1})-random CNN h𝐰nh_{\mathbf{w}}^{n}, then let 𝐰=M𝐰\mathbf{w}^{\prime}=M\mathbf{w}. Note that SS^{\prime} is realizable by h𝐰nh_{\mathbf{w}^{\prime}}^{n}. Indeed, for every ii and jj we have 𝐰,(M)1𝐱ji=𝐰M(M)1𝐱ji=𝐰,𝐱ji\langle\mathbf{w}^{\prime},(M^{\top})^{-1}\mathbf{x}^{i}_{j}\rangle=\mathbf{w}^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}^{i}_{j}=\langle\mathbf{w},\mathbf{x}^{i}_{j}\rangle. Since 𝐰=M𝐰𝒩(𝟎,Σ)\mathbf{w}^{\prime}=M\mathbf{w}\sim{\cal N}({\mathbf{0}},\Sigma), the sample SS^{\prime} is 𝒩(𝟎,Σ){\cal N}({\mathbf{0}},\Sigma)-realizable.

We now bound the norms of 𝐱i\mathbf{x}^{\prime}_{i} in SS^{\prime}. Note that for every i[nd]i\in[n^{d}] we have

𝐱i2\displaystyle\left\|\mathbf{x}^{\prime}_{i}\right\|^{2} =\displaystyle= 1jlog2(n)(M)1𝐱ji2=1jlog2(n)UΛ12𝐱ji2=1jlog2(n)Λ12𝐱ji2\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\|(M^{\top})^{-1}\mathbf{x}^{i}_{j}\right\|^{2}=\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\|U\Lambda^{-\frac{1}{2}}\mathbf{x}^{i}_{j}\right\|^{2}=\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\|\Lambda^{-\frac{1}{2}}\mathbf{x}^{i}_{j}\right\|^{2}
\displaystyle\leq 1jlog2(n)λmin12𝐱ji2=λmin11jlog2(n)𝐱ji2=λmin1𝐱i2.\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\|\lambda_{\min}^{-\frac{1}{2}}\mathbf{x}^{i}_{j}\right\|^{2}=\lambda_{\min}^{-1}\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\|\mathbf{x}^{i}_{j}\right\|^{2}=\lambda_{\min}^{-1}\left\|\mathbf{x}_{i}\right\|^{2}~{}.

Hence, 𝐱iλmin12𝐱iλmin12nlog(n)\left\|\mathbf{x}^{\prime}_{i}\right\|\leq\lambda_{\min}^{-\frac{1}{2}}\left\|\mathbf{x}_{i}\right\|\leq\lambda_{\min}^{-\frac{1}{2}}n\log(n).

5.4.3 Proof of Theorem 3.6

Let n=(n+1)log2(n)n=(n^{\prime}+1)\log^{2}(n^{\prime}). Let 𝒟vec{\cal D}_{\mathrm{vec}} be the uniform distribution on r𝕊nr\cdot{\mathbb{S}}^{n^{\prime}}. Note that a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN h𝐰nh_{\mathbf{w}}^{n} has log2(n)=𝒪(log2(n))\log^{2}(n^{\prime})={\cal O}(\log^{2}(n)) hidden neurons. Let dd be a fixed integer. By Theorem 5.1, we need to show that SCATndA(𝒟vec,n)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{vec}},n) is RSAT-hard, where AA is the ball of radius nlog(n)r\frac{\sqrt{n}\log(n)}{r} in n{\mathbb{R}}^{n}. By Lemma 5.2, the problem SCATndA(signcnnn,log2(n))\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}) where AA^{\prime} is the ball of radius log2(n)\log^{2}(n^{\prime}) in n{\mathbb{R}}^{n}, is RSAT-hard. We reduce this problem to SCATndA(𝒟vec,n)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{vec}},n). Given a sample S={(𝐱i,yi)}i=1nd(n×{0,1})ndS=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}} with 𝐱ilog2(n)\left\|\mathbf{x}_{i}\right\|\leq\log^{2}(n^{\prime}) for every i[nd]i\in[n^{d}], we construct a sample SS^{\prime} that is contained in AA, such that if SS is scattered then SS^{\prime} is scattered, and if SS is signcnnn,log2(n){\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}-realizable then SS^{\prime} is 𝒟vec{\cal D}_{\mathrm{vec}}-realizable.

Let MM be a random orthogonal matrix of size (n+1)×(n+1)(n^{\prime}+1)\times(n^{\prime}+1). For every i[nd]i\in[n^{d}] denote 𝐱i=(𝐱1i,,𝐱log2(n)i)\mathbf{x}_{i}=(\mathbf{x}^{i}_{1},\ldots,\mathbf{x}^{i}_{\log^{2}(n^{\prime})}) where for every jj we have 𝐱jin+1\mathbf{x}^{i}_{j}\in{\mathbb{R}}^{n^{\prime}+1}. For every i[nd]i\in[n^{d}] let 𝐱i=(n+1rM𝐱1i,,n+1rM𝐱log2(n)i)\mathbf{x}^{\prime}_{i}=(\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{1},\ldots,\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{\log^{2}(n^{\prime})}), and let S={(𝐱1,y1),,(𝐱nd,ynd)}S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}. Note that if SS is scattered then SS^{\prime} is also scattered. If SS is realizable by a CNN h𝐰nsigncnnn,log2(n)h_{\mathbf{w}}^{n}\in{\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}, then let 𝐰=rn+1M𝐰\mathbf{w}^{\prime}=\frac{r}{\sqrt{n^{\prime}+1}}M\mathbf{w}. Note that SS^{\prime} is realizable by h𝐰nh_{\mathbf{w}^{\prime}}^{n}. Indeed, for every ii and jj we have

𝐰,n+1rM𝐱ji=𝐰rn+1Mn+1rM𝐱ji=𝐰,𝐱ji.\langle\mathbf{w}^{\prime},\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{j}\rangle=\mathbf{w}^{\top}\frac{r}{\sqrt{n^{\prime}+1}}M^{\top}\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{j}=\langle\mathbf{w},\mathbf{x}^{i}_{j}\rangle~{}.

Also, note that since 𝐰=n+1\left\|\mathbf{w}\right\|=\sqrt{n^{\prime}+1} and MM is orthogonal, 𝐰\mathbf{w}^{\prime} is a random vector on the sphere of radius rr in n+1{\mathbb{R}}^{n^{\prime}+1}, and thus h𝐰nh_{\mathbf{w}^{\prime}}^{n} is a 𝒟vec{\cal D}_{\mathrm{vec}}-random CNN.

Since MM is orthogonal then for every i[nd]i\in[n^{d}] we have

𝐱i2\displaystyle\left\|\mathbf{x}^{\prime}_{i}\right\|^{2} =\displaystyle= 1jlog2(n)n+1rM𝐱ji2=n+1r21jlog2(n)𝐱ji2\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\|\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{j}\right\|^{2}=\frac{n^{\prime}+1}{r^{2}}\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\|\mathbf{x}^{i}_{j}\right\|^{2}
=\displaystyle= n+1r2𝐱i2(n+1)log4(n)r2nlog2(n)r2.\displaystyle\frac{n^{\prime}+1}{r^{2}}\cdot\left\|\mathbf{x}_{i}\right\|^{2}\leq\frac{(n^{\prime}+1)\log^{4}(n^{\prime})}{r^{2}}\leq\frac{n\log^{2}(n)}{r^{2}}~{}.

Hence 𝐱inlog(n)r\left\|\mathbf{x}^{\prime}_{i}\right\|\leq\frac{\sqrt{n}\log(n)}{r}.

References

  • Adamczak et al. [2012] R. Adamczak, O. Guédon, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann. Condition number of a square matrix with iid columns drawn from a convex body. Proceedings of the American Mathematical Society, 140(3):987–998, 2012.
  • Agarwal et al. [2020] N. Agarwal, P. Awasthi, and S. Kale. A deep conditioning treatment of neural networks. arXiv preprint arXiv:2002.01523, 2020.
  • Allen-Zhu et al. [2018a] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018a.
  • Allen-Zhu et al. [2018b] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018b.
  • Andoni et al. [2014] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on Machine Learning, pages 1908–1916, 2014.
  • Applebaum et al. [2008] B. Applebaum, B. Barak, and D. Xiao. On basing lower-bounds for learning on worst-case assumptions. In Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 211–220. IEEE, 2008.
  • Arora et al. [2014] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. In International Conference on Machine Learning, pages 584–592, 2014.
  • Arora et al. [2019] S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
  • Brutzkus and Globerson [2017] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 605–614. JMLR. org, 2017.
  • Brutzkus et al. [2017] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017.
  • Cao and Gu [2019] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.
  • Coja-Oghlan et al. [2004] A. Coja-Oghlan, A. Goerdt, and A. Lanka. Strong refutation heuristics for random k-sat. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 310–321. Springer, 2004.
  • Coja-Oghlan et al. [2010] A. Coja-Oghlan, C. Cooper, and A. Frieze. An efficient sparse regularity concept. SIAM Journal on Discrete Mathematics, 23(4):2000–2034, 2010.
  • Daniely [2016] A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 105–117. ACM, 2016.
  • Daniely [2017] A. Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
  • Daniely [2019] A. Daniely. Neural networks learning and memorization with (almost) no over-parameterization. arXiv preprint arXiv:1911.09873, 2019.
  • Daniely and Shalev-Shwartz [2016] A. Daniely and S. Shalev-Shwartz. Complexity theoretic limitations on learning dnf’s. In Conference on Learning Theory, pages 815–830, 2016.
  • Daniely et al. [2014] A. Daniely, N. Linial, and S. Shalev-Shwartz. From average case complexity to improper learning complexity. In STOC, 2014.
  • Daniely et al. [2016] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In NIPS, 2016.
  • Das et al. [2019] A. Das, S. Gollapudi, R. Kumar, and R. Panigrahy. On the learnability of deep random networks. arXiv preprint arXiv:1904.03866, 2019.
  • Du and Goel [2018] S. S. Du and S. Goel. Improved learning of one-hidden-layer convolutional neural networks with overlaps. arXiv preprint arXiv:1805.07798, 2018.
  • Du et al. [2017a] S. S. Du, J. D. Lee, and Y. Tian. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017a.
  • Du et al. [2017b] S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017b.
  • Du et al. [2018] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • Fang [2018] K. W. Fang. Symmetric multivariate and related distributions. Chapman and Hall/CRC, 2018.
  • Feige [2002] U. Feige. Relations between average case complexity and approximation complexity. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 534–543. ACM, 2002.
  • Feige and Ofek [2004] U. Feige and E. Ofek. Easily refutable subformulas of large random 3cnf formulas. In Automata, languages and programming, pages 519–530. Springer, 2004.
  • Feldman et al. [2006] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, 2006.
  • Ge et al. [2019] R. Ge, R. Wang, and H. Zhao. Mildly overparametrized neural nets can memorize training data efficiently. arXiv preprint arXiv:1909.11837, 2019.
  • Goel and Klivans [2017] S. Goel and A. Klivans. Learning neural networks with two nonlinear layers in polynomial time. arXiv preprint arXiv:1709.06010, 2017.
  • Goel et al. [2018] S. Goel, A. Klivans, and R. Meka. Learning one convolutional layer with overlapping patches. arXiv preprint arXiv:1802.02547, 2018.
  • Jacot et al. [2018] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
  • Janzamin et al. [2015] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
  • Ji and Telgarsky [2019] Z. Ji and M. Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
  • Klivans and Sherstov [2006] A. R. Klivans and A. A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. In FOCS, 2006.
  • Lee et al. [2019] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
  • Li and Yuan [2017] Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
  • Ma et al. [2019] C. Ma, L. Wu, et al. A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics. arXiv preprint arXiv:1904.04326, 2019.
  • Oymak and Soltanolkotabi [2018] S. Oymak and M. Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004, 2018.
  • Oymak and Soltanolkotabi [2019] S. Oymak and M. Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv:1902.04674 [cs, math, stat], Feb. 2019. URL http://arxiv.org/abs/1902.04674. arXiv: 1902.04674.
  • Rivasplata [2012] O. Rivasplata. Subgaussian random variables: An expository note. Internet publication, PDF, 2012.
  • Rudelson and Vershynin [2008] M. Rudelson and R. Vershynin. The littlewood–offord problem and invertibility of random matrices. Advances in Mathematics, 218(2):600–633, 2008.
  • Shamir [2018] O. Shamir. Distribution-specific hardness of learning neural networks. The Journal of Machine Learning Research, 19(1):1135–1163, 2018.
  • Song and Yang [2019] Z. Song and X. Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
  • Tian [2017] Y. Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3404–3413. JMLR. org, 2017.
  • Tong [2012] Y. L. Tong. The multivariate normal distribution. Springer Science & Business Media, 2012.
  • Xie et al. [2016] B. Xie, Y. Liang, and L. Song. Diverse neural network learns true target functions. arXiv preprint arXiv:1611.03131, 2016.
  • Zou and Gu [2019] D. Zou and Q. Gu. An improved analysis of training over-parameterized deep neural networks. arXiv preprint arXiv:1906.04688, 2019.

Appendix A From CSPndrand(SATK)\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K}) to CSPnd1rand(TK,q(n),¬TK,q(n))\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)}) (Daniely and Shalev-Shwartz [2016])

We outline the main ideas of the reduction.

First, we reduce CSPndrand(SATK)\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K}) to CSPnd1rand(TK,q(n))\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)}). This is done as follows. Given an instance J={C1,,Cnd}J=\{C_{1},\ldots,C_{n^{d}}\} to CSP(SATK)\mathrm{CSP}(\mathrm{SAT}_{K}), by a simple greedy procedure, we try to find nd1n^{d-1} disjoint subsets J1,,Jnd1JJ^{\prime}_{1},\ldots,J^{\prime}_{n^{d-1}}\subset J, such that for every tt, the subset JtJ^{\prime}_{t} consists of q(n)q(n) constraints and each variable appears in at most one of the constraints in JtJ^{\prime}_{t}. Now, from every JtJ^{\prime}_{t} we construct a TK,q(n)T_{K,q(n)}-constraint that is the conjunction of all constraints in JtJ^{\prime}_{t}. If JJ is random, this procedure will succeed w.h.p. and will produce a random TK,q(n)T_{K,q(n)}-formula. If JJ is satisfiable, this procedure will either fail or produce a satisfiable TK,q(n)T_{K,q(n)}-formula.

Now, we reduce CSPnd1rand(TK,q(n))\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)}) to CSPnd1rand(TK,q(n),¬TK,q(n))\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)}). This is done by replacing each constraint, with probability 12\frac{1}{2}, with a random ¬P\neg P constraint. Clearly, if the original instance is a random instance of CSPnd1rand(TK,q(n))\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)}), then the produced instance is a random instance of CSPnd1rand(TK,q(n),¬TK,q(n))\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)}). Furthermore, if the original instance is satisfied by the assignment ψ{±1}n\psi\in\{\pm 1\}^{n}, the same ψ\psi, w.h.p., will satisfy all the new constraints. The reason is that the probability that a random ¬TK,q(n)\neg T_{K,q(n)}-constraint is satisfied by ψ\psi is 1(12K)q(n)1-\left(1-2^{-K}\right)^{q(n)}, and hence, the probability that all new constraints are satisfied by ψ\psi is at least 1nd1(12K)q(n)1-n^{d-1}\left(1-2^{-K}\right)^{q(n)}. Now, since q(n)=ω(log(n))q(n)=\omega(\log(n)), the last probability is 1on(1)1-o_{n}(1).

For the full proof see Daniely and Shalev-Shwartz [2016].