Hardness of Learning Neural Networks with
Natural Weights

Amit Daniely School of Computer Science and Engineering, The Hebrew University, Jerusalem, Israel and Google Research Tel-Aviv, [email protected] Gal Vardi Weizmann Institute of Science, Israel, [email protected]

Abstract

Neural networks are nowadays highly successful despite strong hardness results. The existing hardness results focus on the network architecture, and assume that the network’s weights are arbitrary. A natural approach to settle the discrepancy is to assume that the network’s weights are “well-behaved” and posses some generic properties that may allow efficient learning. This approach is supported by the intuition that the weights in real-world networks are not arbitrary, but exhibit some ”random-like” properties with respect to some ”natural” distributions.We prove negative results in this regard, and show that for depth- $2$ networks, and many “natural” weights distributions such as the normal and the uniform distribution, most networks are hard to learn. Namely, there is no efficient learning algorithm that is provably successful for most weights, and every input distribution. It implies that there is no generic property that holds with high probability in such random networks and allows efficient learning.

1 Introduction

Neural networks have revolutionized performance in multiple domains, such as computer vision and natural language processing, and have proven to be a highly effective tool for solving many challenging problems. This impressive practical success of neural networks is not well understood from the theoretical point of view. In particular, despite extensive research in recent years, it is not clear which models are learnable by neural networks algorithms.

Historically, there were many negative results for learning neural networks, and it is now known that under certain complexity assumptions, it is computationally hard to learn the class of functions computed by a neural network, even if the architecture is very simple. Indeed, it has been shown that learning neural networks is hard already for networks of depth $2$ (Klivans and Sherstov, 2006; Daniely and Shalev-Shwartz, 2016). These results hold already for improper learning, namely where the learning algorithm is allowed to return a hypothesis that does not belong to the considered hypothesis class.

In recent years, researchers have considered several ways to circumvent the discrepancy between those hardness results and the empirical success of neural networks. Namely, to understand which models are still learnable by neural networks algorithms. This effort includes proving learnability of linear models, including polynomials and kernel spaces (Andoni et al., 2014; Xie et al., 2016; Daniely et al., 2016; Daniely, 2017; Brutzkus et al., 2017; Jacot et al., 2018; Du et al., 2018; Oymak and Soltanolkotabi, 2018; Allen-Zhu et al., 2018a, b; Cao and Gu, 2019; Zou and Gu, 2019; Song and Yang, 2019; Ge et al., 2019; Oymak and Soltanolkotabi, 2019; Arora et al., 2019; Cao and Gu, 2019; Ji and Telgarsky, 2019; Ma et al., 2019; Lee et al., 2019; Daniely, 2019), making assumptions on the input distribution (Li and Yuan, 2017; Brutzkus and Globerson, 2017; Du et al., 2017a, b; Du and Goel, 2018; Goel et al., 2018; Shamir, 2018), the network’s weights (Arora et al., 2014; Shamir, 2018; Das et al., 2019; Agarwal et al., 2020; Goel and Klivans, 2017), or both (Janzamin et al., 2015; Tian, 2017).

In that respect, one fantastic result that can be potentially proven, is that neural networks are efficiently learnable if we assume that the network’s weights are “well-behaved”. Namely, that there are some generic properties of the network’s weights that allow efficient learning. This approach is supported by the intuition that the weights in real-world networks are not arbitrary, but exhibit some ”random-like” properties with respect to some ”natural weights distributions” (e.g., where the weights are drawn from a normal distribution). We say that a property of the network’s weights is a natural property with respect to such a natural weights distribution, if it holds with high probability. Existing hardness results focus on the network architecture, and assume that the weights are arbitrary. Thus, it is unclear whether there exists a natural property that allows efficient learning.

In this work, we investigate networks with random weights, and networks whose weights posses natural properties. We show that under various natural weights distributions most networks are hard to learn. Namely, there is no efficient learning algorithm that is provably successful for most weights, and every input distribution. We show that it implies that learning neural networks is hard already if their weights posses some natural property. Our hardness results are under the common assumption that refuting a random $K$ -SAT formula is hard (the RSAT assumption). We emphasize that our results are valid for any learning algorithm, and not just common neural networks algorithms.

We consider networks of depth $2$ with a single output neuron, where the weights in the first layer are drawn from some natural distribution, and the weights in the second layer are all $1$ . We consider multiple natural weights distributions, for example, where the weights vector of each hidden neuron is distributed by a multivariate normal distribution, distributed uniformly on the sphere, or that each of its components is drawn i.i.d. from a normal, uniform or Bernoulli distribution. For each weights distribution, we show that learning such networks with high probability over the choice of the weights is hard. Thus, for such weights distributions, most networks are hard. It implies that there is no generic property that holds with high probability (e.g., with probability $0.9$ ) in such random networks and allows efficient learning. Hence, if generic properties that allow efficient learning exist, then they are not natural, namely, they are rare with respect to all the natural weights distributions that we consider.

We also consider random neural networks of depth $2$ , where the first layer is a convolutional layer with non-overlapping patches such that its filter is drawn from some natural distribution, and the weights of the second layer are all $1$ . We show that learning is hard also for such networks. It implies that there is no generic property that holds with high probability in such random convolutional networks and allows efficient learning.

Related work

Hardness of learning neural networks. Hardness of learning neural networks in the standard (improper and distribution free) PAC model, follows from hardness of learning intersection of halfspaces. Klivans and Sherstov (2006) showed that, assuming the hardness of the shortest vector problem, learning intersection of $n^{\epsilon}$ halfspaces for a constant $\epsilon>0$ is hard. Daniely and Shalev-Shwartz (2016) showed that, under the RSAT assumption, learning intersection of $\omega(\log(n))$ halfspaces is hard. These results imply hardness of learning depth- $2$ networks with $n^{\epsilon}$ and $\omega(\log(n))$ hidden neurons (respectively). In the agnostic model, learning halfspaces is already hard (Feldman et al., 2006; Daniely, 2016).

Learning random neural networks. Shamir (2018) considers the problem of learning neural networks, where the weights are not arbitrary, but exhibit “nice” features such as non-degeneracy or some “random-like” appearance. The architecture of the networks that he considers is similar to ours. He shows that (under the RSAT assumption) no algorithm invariant to linear transformations can efficiently learn such networks if the columns of the weights matrix of the first layer are linearly independent. It implies that linear-invariant algorithms cannot learn such networks when the weights are chosen randomly. We note that this result holds only for linearly-invariant algorithms, which is a strong restriction. Standard gradient decent methods, for example, are not linearly invariant¹¹1Shamir (2018) shows that gradient decent can become linearly invariant if it is preceded by a certain preconditioning step.. Our results hold for all algorithms.

In Das et al. (2019), it is shown that deep random neural networks (of depth $\omega(\log(n))$ ) with the sign activation function, are hard to learn in the statistical query (SQ) model. This result was recently extended by Agarwal et al. (2020) to other activation functions, including the ReLU function. While their results hold for networks of depth $\omega(\log(n))$ and for SQ algorithms, our results hold for depth- $2$ networks and for all algorithms.

Our paper is structured as follows: In Section 2 we provide notations and definitions, followed by our results in Section 3. We sketch our proof ideas in Section 4, with all proofs deferred to Section 5.

2 Preliminaries

2.1 Random Constraints Satisfaction Problems

Let ${\cal X}_{n,K}$ be the collection of (signed) $K$ -tuples, that is, sequences $x=[(\alpha_{1},i_{1}),\ldots,(\alpha_{K},i_{K})]$ for $\alpha_{1},\ldots,\alpha_{K}\in\{\pm 1\}$ and distinct $i_{1},\ldots,i_{K}\in[n]$ . Each $x\in{\cal X}_{n,K}$ defines a function $U_{x}:\{\pm 1\}^{n}\to\{\pm 1\}^{K}$ by $U_{x}(\psi)=(\alpha_{1}\psi_{i_{1}},\ldots,\alpha_{K}\psi_{i_{K}})$ .

Let $P:\{\pm 1\}^{K}\to\{0,1\}$ be some predicate. A $P$ -constraint with $n$ variables is a function $C:\{\pm 1\}^{n}\to\{0,1\}$ of the form $C(x)=P\circ U_{x}$ for some $x\in{\cal X}_{n,K}$ . An instance to the CSP problem $\mathrm{CSP}(P)$ is a $P$ -formula, i.e., a collection $J=\{C_{1},\ldots,C_{m}\}$ of $P$ -constraints (each is specified by a $K$ -tuple). The goal is to find an assignment $\psi\in\{\pm 1\}^{n}$ that maximizes the fraction of satisfied constraints (i.e., constraints with $C_{i}(\psi)=1$ ). We will allow CSP problems where $P$ varies with $n$ (but is still fixed for every $n$ ). For example, we can look of the $\lceil\log(n)\rceil$ -SAT problem.

We will consider the problem of distinguishing satisfiable from random $P$ formulas (a.k.a. the problem of refuting random $P$ formulas). For $m:\mathbb{N}\to\mathbb{N}$ , we say that a randomized algorithm ${\cal A}$ efficiently solves the problem $\mathrm{CSP}^{\mathrm{rand}}_{m(n)}(P)$ , if ${\cal A}$ is a polynomial-time algorithm such that:

•

If $J$ is a satisfiable instance to $\mathrm{CSP}(P)$ with $n$ variables and $m(n)$ constraints, then

$\Pr\left({\cal A}(J)=\text{``satisfiable"}\right)\geq\frac{3}{4}-o_{n}(1)~{},$

where the probability is over the randomness of ${\cal A}$ .
•

If $J$ is a random²²2To be precise, in a random formula with $n$ variable and $m$ constraints, the $K$ -tuple defining each constraint is chosen uniformly, and independently from the other constraints. instance to $\mathrm{CSP}(P)$ with $n$ variables and $m(n)$ constraints then

$\Pr\left({\cal A}(J)=\text{``random"}\right)\geq\frac{3}{4}-o_{n}(1)~{},$

where the probability is over the choice of $J$ and the randomness of ${\cal A}$ .

2.2 The random $K$ -SAT assumption

Unless we face a dramatic breakthrough in complexity theory, it seems unlikely that hardness of learning can be established on standard complexity assumptions such as $\mathbf{P}\neq\mathbf{NP}$ (see Applebaum et al. (2008); Daniely et al. (2014)). Indeed, all currently known lower bounds are based on assumptions from cryptography or average case hardness. Following Daniely and Shalev-Shwartz (2016) we will rely on an assumption about random $K$ -SAT problems which we outline below.

Let $J=\{C_{1},\ldots,C_{m}\}$ be a random $K$ -SAT formula on $n$ variables. Precisely, each $K$ -SAT constraint $C_{i}$ is chosen independently and uniformly from the collection of $n$ -variate $K$ -SAT constraints. A simple probabilistic argument shows that for some constant $C$ (depending only on $K$ ), if $m\geq Cn$ , then $J$ is not satisfiable w.h.p. The problem of refuting random $K$ -SAT formulas (a.k.a. the problem of distinguishing satisfiable from random $K$ -SAT formulas) is the problem $\mathrm{CSP}^{\mathrm{rand}}_{m(n)}(\mathrm{SAT}_{K})$ , where $\mathrm{SAT}_{K}$ is the predicate $z_{1}\vee\ldots\vee z_{K}$ .

The problem of refuting random $K$ -SAT formulas has been extensively studied during the last 50 years. It is not hard to see that the problem gets easier as $m$ gets larger. The currently best known algorithms Feige and Ofek (2004); Coja-Oghlan et al. (2004, 2010) can only refute random instances with $\Omega\left(n^{\lceil\frac{K}{2}\rceil}\right)$ constraints for $K\geq 4$ and $\Omega\left(n^{1.5}\right)$ constraints for $K=3$ . In light of that, Feige (2002) made the assumption that for $K=3$ , refuting random instances with $C\cdot n$ constraints, for every constant $C$ , is hard (and used that to prove hardness of approximation results). Here, we put forward the following assumption.

Assumption 2.1.

Refuting random $K$ - $\mathrm{SAT}$ formulas with $n^{f(K)}$ constraints is hard for some $f(K)=\omega(1)$ . Namely, for every $d>0$ there is $K$ such that the problem $\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K})$ is hard.

Terminology 2.2.

A computational problem is RSAT-hard if its tractability refutes assumption 2.1.

In addition to the performance of best known algorithms, there is plenty of evidence to the above assumption, in the form of hardness of approximation results, and lower bounds on various algorithms, including resolution, convex hierarchies, sum-of-squares, statistical algorithms, and more. We refer the reader to Daniely and Shalev-Shwartz (2016) for a more complete discussion.

2.3 Learning hypothesis classes and random neural networks

Let ${\cal H}\subseteq{\mathbb{R}}^{({\mathbb{R}}^{n})}$ be an hypothesis class. We say that a learning algorithm ${\cal L}$ efficiently (PAC) learns ${\cal H}$ if for every target function $f\in{\cal H}$ and every distribution ${\cal D}$ over ${\mathbb{R}}^{n}$ , given only access to examples $(\mathbf{x},f(\mathbf{x}))$ where $\mathbf{x}\sim{\cal D}$ , the algorithm ${\cal L}$ runs in time polynomial in $n$ and returns with probability at least $\frac{9}{10}$ (over the internal randomness of ${\cal L}$ ), a predictor $h$ such that

\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}}\left[\left(h(\mathbf{x})-f(\mathbf{x})\right)^{2}\right]\leq\frac{1}{10}~{}.

For a real matrix $W=(\mathbf{w}_{1},\ldots,\mathbf{w}_{m})$ of size $n\times m$ , let $h_{W}:{\mathbb{R}}^{n}\rightarrow[0,1]$ be the function $h_{W}(\mathbf{x})=\left[\sum_{i=1}^{m}[\langle\mathbf{w}_{i},\mathbf{x}\rangle]_{+}\right]_{[0,1]}$ , where $[z]_{+}=\max\{0,z\}$ is the ReLU function, and $[z]_{[0,1]}=\min\{1,\max\{0,z\}\}$ is the clipping operation on the interval $[0,1]$ . This corresponds to depth- $2$ networks with $m$ hidden neurons, with no bias in the first layer, and where the outputs of the first layer are simply summed and moved through a clipping non-linearity (this operation can also be easily implemented using a second layer composed of two ReLU neurons).

Let ${\cal D}_{\mathrm{mat}}$ be a distribution over real matrices of size $n\times m$ . We assume that $m\leq n$ . We say that a learning algorithm ${\cal L}$ efficiently learns a random neural network with respect to ${\cal D}_{\mathrm{mat}}$ ( ${\cal D}_{\mathrm{mat}}$ -random network, for short), if it satisfies the following property. For a random matrix $W$ drawn according to ${\cal D}_{\mathrm{mat}}$ , and every distribution ${\cal D}$ (that may depend on $W$ ) over ${\mathbb{R}}^{n}$ , given only access to examples $(\mathbf{x},h_{W}(\mathbf{x}))$ where $\mathbf{x}\sim{\cal D}$ , the algorithm ${\cal L}$ runs in time polynomial in $n$ and returns with probability at least $\frac{3}{4}$ over the choice of $W$ and the internal randomness of ${\cal L}$ , a predictor $h$ such that

\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}}\left[\left(h(\mathbf{x})-h_{W}(\mathbf{x})\right)^{2}\right]\leq\frac{1}{10}~{}.

Remark 2.1.

Learning an hypothesis class requires that for every target function in the class and every input distribution the learning algorithm succeeds w.h.p., and learning a random neural network requires that for a random target function and every input distribution the learning algorithm succeeds w.h.p. Thus, in the former case an adversary chooses both the target function and the input distribution, and in the later the target function is chosen randomly and then the adversary chooses the input distribution. Therefore, the requirement from the algorithm for learning random neural networks is weaker then the requirement for learning neural networks in the standard PAC-learning model. We show hardness of learning already under this weaker requirement. Then, we show that hardness of learning random neural networks, implies hardness of learning neural networks with “natural” weights under the standard PAC-learning model.

2.4 Notations and terminology

We denote by ${\cal U}([-r,r])$ the uniform distribution over the interval $[-r,r]$ in ${\mathbb{R}}$ ; by ${\cal U}(\{\pm r\})$ the symmetric Bernoulli distribution, namely, $Pr(r)=Pr(-r)=\frac{1}{2}$ ; by ${\cal N}(0,\sigma^{2})$ the normal distribution with mean $0$ and variance $\sigma^{2}$ , and by ${\cal N}({\mathbf{0}},\Sigma)$ the multivariate normal distribution with mean ${\mathbf{0}}$ and covariance matrix $\Sigma$ . We say that a distribution over ${\mathbb{R}}$ is symmetric if it is continuous and its density satisfies $f(x)=f(-x)$ for every $x\in{\mathbb{R}}$ , or that it is discrete and $Pr(x)=Pr(-x)$ for every $x\in{\mathbb{R}}$ . For a matrix $M$ we denote by $s_{\min}(M)$ and $s_{\max}(M)$ the minimal and maximal singular values of $M$ . For $\mathbf{x}\in{\mathbb{R}}^{n}$ we denote by $\left\|\mathbf{x}\right\|$ its $L_{2}$ norm. We denote the $n-1$ dimensional unit sphere by ${\mathbb{S}}^{n-1}=\{\mathbf{x}\in{\mathbb{R}}^{n}:\left\|\mathbf{x}\right\|=1\}$ . For $t\in{\mathbb{N}}$ let $[t]=\{1,\ldots,t\}$ . We say that an algorithm is efficient if it runs in polynomial time.

3 Results

We show RSAT-hardness for learning ${\cal D}_{\mathrm{mat}}$ -random networks, where ${\cal D}_{\mathrm{mat}}$ corresponds either to a fully-connected layer, or to a convolutional layer. It implies hardness of learning depth- $2$ neural networks whose weights satisfy some natural property. We focus on the case of networks with ${\cal O}(\log^{2}(n))$ hidden neurons. We note, however, that our results can be extended to networks with $q(n)$ hidden neurons, for any $q(n)=\omega(\log(n))$ . Moreover, while we consider networks whose weights in the second layer are all $1$ , our results can be easily extended to networks where the weights in the second layer are arbitrary or random positive numbers.

3.1 Fully-connected neural networks

We start with random fully-connected neural networks. First, we consider a distribution ${\cal D}_{\mathrm{mat}}$ over real matrices, such that the entries are drawn i.i.d. from a symmetric distribution.

We say that a random variable $z$ is $b$ -subgaussian for some $b>0$ if for all $t>0$ we have

Pr\left(|z|>t\right)\leq 2\exp\left(-\frac{t^{2}}{b^{2}}\right)~{}.

Theorem 3.1.

Let $z$ be a symmetric random variable with variance $\sigma^{2}$ . Assume that the random variable $z^{\prime}=\frac{z}{\sigma}$ is $b$ -subgaussian for some fixed $b$ . Let $\epsilon>0$ be a small constant, let $m={\cal O}(\log^{2}(n))$ , and let ${\cal D}_{\mathrm{mat}}$ be a distribution over ${\mathbb{R}}^{n\times m}$ , such that the entries are i.i.d. copies of $z$ . Then, learning a ${\cal D}_{\mathrm{mat}}$ -random neural network is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm at most $\frac{n^{\epsilon}}{\sigma}$ in ${\mathbb{R}}^{n}$ .

Since the normal distribution, the uniform distribution over an interval, and the symmetric Bernoulli distribution are subgaussian (Rivasplata (2012)), we have the following corollary.

Corollary 3.1.

Let $\epsilon>0$ be a small constant, let $m={\cal O}(\log^{2}(n))$ , and let ${\cal D}_{\mathrm{mat}}$ be a distribution over ${\mathbb{R}}^{n\times m}$ , such that the entries are drawn i.i.d. from a distribution ${\cal D}_{z}$ .

1.

If ${\cal D}_{z}={\cal N}(0,\sigma^{2})$ , then learning a ${\cal D}_{\mathrm{mat}}$ -random neural network is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm at most $\frac{n^{\epsilon}}{\sigma}$ in ${\mathbb{R}}^{n}$ .
2.

If ${\cal D}_{z}={\cal U}([-r,r])$ or ${\cal D}_{z}={\cal U}(\{\pm r\})$ , then learning a ${\cal D}_{\mathrm{mat}}$ -random neural network is RSAT-hard, even if the distribution ${\cal D}$ is over vectors of norm at most $\frac{n^{\epsilon}}{r}$ in ${\mathbb{R}}^{n}$ .

In the following theorem, we consider the case where ${\cal D}_{\mathrm{mat}}$ is such that each column is drawn i.i.d. from a multivariate normal distribution.

Theorem 3.2.

Let $\Sigma$ be a positive definite matrix of size $n\times n$ , and let $\lambda_{\min}$ be its minimal eigenvalue. Let $\epsilon>0$ be a small constant, let $m={\cal O}(\log^{2}(n))$ , and let ${\cal D}_{\mathrm{mat}}$ be a distribution over ${\mathbb{R}}^{n\times m}$ , such that each column is drawn i.i.d. from ${\cal N}({\mathbf{0}},\Sigma)$ . Then, learning a ${\cal D}_{\mathrm{mat}}$ -random neural network is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm at most $\frac{n^{\epsilon}}{\sqrt{\lambda_{\min}}}$ in ${\mathbb{R}}^{n}$ .

We also study the case where the distribution ${\cal D}_{\mathrm{mat}}$ is such that each column is drawn i.i.d. from the uniform distribution on the sphere of radius $r$ in ${\mathbb{R}}^{n}$ .

Theorem 3.3.

Let $m={\cal O}(\log^{2}(n))$ and let ${\cal D}_{\mathrm{mat}}$ be a distribution over ${\mathbb{R}}^{n\times m}$ , such that each column is drawn i.i.d. from the uniform distribution over $r\cdot{\mathbb{S}}^{n-1}$ . Then, learning a ${\cal D}_{\mathrm{mat}}$ -random neural network is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm at most ${\cal O}\left(\frac{n\sqrt{n}\log^{4}(n)}{r}\right)$ in ${\mathbb{R}}^{n}$ .

From the above theorems we have the following corollary, which shows that learning neural networks (in the standard PAC-learning model) is hard already if the weights satisfy some natural property.

Corollary 3.2.

Let $m={\cal O}(\log^{2}(n))$ , and let ${\cal D}_{\mathrm{mat}}$ be a distribution over ${\mathbb{R}}^{n\times m}$ from Theorems 3.2, 3.3, or from Corollary 3.1. Let $P$ be a property that holds with probability at least $\frac{9}{10}$ for a matrix $W$ drawn from ${\cal D}_{\mathrm{mat}}$ . Let ${\cal H}=\{h_{W}:W\in{\mathbb{R}}^{n\times m},\;W\text{ satisfies }P\}$ be an hypothesis class. Then, learning ${\cal H}$ is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm bounded by the appropriate expression from Theorems 3.2, 3.3, or Corollary 3.1.

The corollary follows easily from the following argument: Assume that ${\cal L}$ learns ${\cal H}$ . If a matrix $W$ satisfies $P$ then for every distribution ${\cal D}$ , given access to examples $(\mathbf{x},h_{W}(\mathbf{x}))$ where $\mathbf{x}\sim{\cal D}$ , the algorithm ${\cal L}$ returns with probability at least $\frac{9}{10}$ a predictor $h$ such that

\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}}\left[\left(h(\mathbf{x})-h_{W}(\mathbf{x})\right)^{2}\right]\leq\frac{1}{10}~{}.

Now, let $W\sim{\cal D}_{\mathrm{mat}}$ . Since $W$ satisfies $P$ with probability at least $\frac{9}{10}$ , then for every distribution ${\cal D}$ , given access to examples $(\mathbf{x},h_{W}(\mathbf{x}))$ where $\mathbf{x}\sim{\cal D}$ , the algorithm ${\cal L}$ returns with probability at least $\frac{9}{10}\cdot\frac{9}{10}\geq\frac{3}{4}$ a predictor that satisfies the above inequality. Hence ${\cal L}$ learns a ${\cal D}_{\mathrm{mat}}$ -random neural network. Note that this argument holds also if ${\cal D}$ is over vectors of a bounded norm.

3.2 Convolutional neural networks

We now turn to random Convolutional Neural Networks (CNN). Here, the distribution ${\cal D}_{\mathrm{mat}}$ corresponds to a random convolutional layer. Our convolutional layer has a very simple structure with non-overlapping patches. Let $t$ be an integer that divides $n$ , and let $\mathbf{w}\in{\mathbb{R}}^{t}$ . We denote by $h_{\mathbf{w}}^{n}:{\mathbb{R}}^{n}\rightarrow[0,1]$ the CNN

h_{\mathbf{w}}^{n}(\mathbf{x})=\left[\sum_{i=1}^{\frac{n}{t}}[\langle\mathbf{w},(x_{t(i-1)+1},\ldots,x_{t\cdot i})\rangle]_{+}\right]_{[0,1]}~{}.

Note that the $h_{\mathbf{w}}^{n}$ has $\frac{n}{t}$ hidden neurons. A random CNN corresponds to a random vector $\mathbf{w}\in{\mathbb{R}}^{t}$ . Let ${\cal D}_{\mathrm{vec}}$ be a distribution over ${\mathbb{R}}^{t}$ . A ${\cal D}_{\mathrm{vec}}$ -random CNN with $m$ hidden neurons is the CNN $h_{\mathbf{w}}^{mt}$ where $\mathbf{w}$ is drawn from ${\cal D}_{\mathrm{vec}}$ . Note that every such a distribution over CNNs, can be expressed by an appropriate distribution ${\cal D}_{\mathrm{mat}}$ over matrices. Our results hold also if we replace the second layer of $h_{\mathbf{w}}^{n}$ with a max-pooling layer (instead of summing and clipping).

We start with random CNNs, where each component in the weight vector is drawn i.i.d. from a symmetric distribution ${\cal D}_{z}$ over ${\mathbb{R}}$ . In the following theorem, the function $f(t)$ bounds the concentration of ${\cal D}_{z}$ , and is needed in order to bound the support of the distribution ${\cal D}$ .

Theorem 3.4.

Let $n=(n^{\prime}+1)\log^{2}(n^{\prime})$ . Let ${\cal D}_{z}^{n^{\prime}+1}$ be a distribution over ${\mathbb{R}}^{n^{\prime}+1}$ such that each component is drawn i.i.d. from a symmetric distribution ${\cal D}_{z}$ over ${\mathbb{R}}$ . Let $f(t)>0$ be a function such that $Pr_{z\sim{\cal D}_{z}}(|z|<f(t))=o_{t}(\frac{1}{t})$ . Then, learning a ${\cal D}_{z}^{n^{\prime}+1}$ -random CNN $h_{\mathbf{w}}^{n}$ with $\log^{2}(n^{\prime})={\cal O}(\log^{2}(n))$ hidden neurons is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm at most $\frac{\log^{2}(n^{\prime})}{f(n^{\prime})}$ in ${\mathbb{R}}^{n}$ .

If ${\cal D}_{z}$ is the uniform distribution ${\cal U}([-r,r])$ then we can choose $f(t)=\frac{r}{t\log(t)}$ , if it is the normal distribution ${\cal N}(0,\sigma^{2})$ then we can choose $f(t)=\frac{\sigma}{t\log(t)}$ , and if it is the symmetric Bernoulli distribution ${\cal U}(\{\pm r\})$ then we can choose $f(t)=r$ . Thus, we obtain the following corollary.

Corollary 3.3.

Let ${\cal D}_{\mathrm{vec}}$ be a distribution such that each component is drawn i.i.d. from a distribution ${\cal D}_{z}$ over ${\mathbb{R}}$ .

1.

If ${\cal D}_{z}={\cal U}([-r,r])$ , then learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ with ${\cal O}(\log^{2}(n))$ hidden neurons is RSAT-hard, already if ${\cal D}$ is over vectors of norm at most $\frac{n\log(n)}{r}$ .
2.

If ${\cal D}_{z}={\cal N}(0,\sigma^{2})$ , then learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ with ${\cal O}(\log^{2}(n))$ hidden neurons is RSAT-hard, already if ${\cal D}$ is over vectors of norm at most $\frac{n\log(n)}{\sigma}$ .
3.

If ${\cal D}_{z}={\cal U}(\{\pm r\})$ , then learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ with ${\cal O}(\log^{2}(n))$ hidden neurons is RSAT-hard, already if ${\cal D}$ is over vectors of norm at most $\frac{\log^{2}(n)}{r}$ .

We also consider the case where $h_{\mathbf{w}}^{n}$ is a CNN such that $\mathbf{w}$ is drawn from a multivariate normal distribution ${\cal N}({\mathbf{0}},\Sigma)$ .

Theorem 3.5.

Let $\Sigma$ be a positive definite matrix of size $t\times t$ , and let $\lambda_{\min}$ be its minimal eigenvalue. Let $n={\cal O}(t\log^{2}(t))$ . Then, learning a ${\cal N}({\mathbf{0}},\Sigma)$ -random CNN $h_{\mathbf{w}}^{n}$ (with ${\cal O}(\log^{2}(n))$ hidden neurons) is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm at most $\frac{n\log(n)}{\sqrt{\lambda_{\min}}}$ in ${\mathbb{R}}^{n}$ .

Finally, we study the case where $h_{\mathbf{w}}^{n}$ is a CNN such that $\mathbf{w}$ is drawn from the uniform distribution over the sphere.

Theorem 3.6.

Let ${\cal D}_{\mathrm{vec}}$ be the uniform distribution over the sphere of radius $r$ in ${\mathbb{R}}^{t}$ . Let $n={\cal O}(t\log^{2}(t))$ . Then, learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ (with ${\cal O}(\log^{2}(n))$ hidden neurons) is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm at most $\frac{\sqrt{n}\log(n)}{r}$ in ${\mathbb{R}}^{n}$ .

Now, the following corollary follows easily (from the same argument as in Corollary 3.2), and shows that learning CNNs (in the standard PAC-learning model) is hard already if the weights satisfy some natural property.

Corollary 3.4.

Let ${\cal D}_{\mathrm{vec}}$ be a distribution over ${\mathbb{R}}^{t}$ from Theorems 3.5, 3.6, or from Corollary 3.3. Let $n={\cal O}(t\log^{2}(t))$ . Let $P$ be a property that holds with probability at least $\frac{9}{10}$ for a vector $\mathbf{w}$ drawn from ${\cal D}_{\mathrm{vec}}$ . Let ${\cal H}=\{h_{\mathbf{w}}^{n}:\mathbf{w}\in{\mathbb{R}}^{t},\;\mathbf{w}\text{ satisfies }P\}$ be an hypothesis class. Then, learning ${\cal H}$ is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm bounded by the appropriate expression from Theorems 3.5, 3.6, or Corollary 3.3.

3.2.1 Improving the bounds on the support of ${\cal D}$

By increasing the number of hidden neurons from ${\cal O}(\log^{2}(n))$ to ${\cal O}(n)$ we can improve the bounds on the support of ${\cal D}$ . Note that our results so far on learning random CNNs, are for CNNs with input dimension $n={\cal O}(t\log^{2}(t))$ where $t$ is the size of the patches. We now consider CNNs with input dimension $\tilde{n}=t^{c}$ for some integer $c>1$ . Note that such CNNs have $t^{c-1}={\cal O}(\tilde{n})$ hidden neurons.

Assume that there is an efficient algorithms ${\cal L}^{\prime}$ for learning ${\cal D}_{\mathrm{vec}}$ -random CNNs with input dimension $\tilde{n}=t^{c}$ , where ${\cal D}_{\mathrm{vec}}$ is a distribution over ${\mathbb{R}}^{t}$ . Assume that ${\cal L}^{\prime}$ uses samples with at most $\tilde{n}^{d}=t^{cd}$ inputs. We show an algorithm ${\cal L}$ for learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ with $n={\cal O}(t\log^{2}(t))$ . Let $S=\{(\mathbf{x}_{1},h_{\mathbf{w}}^{n}(\mathbf{x}_{1})),\ldots,(\mathbf{x}_{n^{cd}},h_{\mathbf{w}}^{n}(\mathbf{x}_{n^{cd}}))\}$ be a sample, and let $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},h_{\mathbf{w}}^{n}(\mathbf{x}_{1})),\ldots,(\mathbf{x}^{\prime}_{n^{cd}},h_{\mathbf{w}}^{n}(\mathbf{x}_{n^{cd}}))\}$ where for every vector $\mathbf{x}\in{\mathbb{R}}^{n}$ , the vector $\mathbf{x}^{\prime}\in{\mathbb{R}}^{\tilde{n}}$ is obtained from $\mathbf{x}$ by padding it with zeros. Thus, $\mathbf{x}^{\prime}=(\mathbf{x},0,\ldots,0)$ . Note that $n^{cd}>\tilde{n}^{d}$ . Also, note that for every $i$ we have $h_{\mathbf{w}}^{n}(\mathbf{x}_{i})=h_{\mathbf{w}}^{\tilde{n}}(\mathbf{x}^{\prime}_{i})$ . Hence, $S^{\prime}$ is realizable by the CNN $h_{\mathbf{w}}^{\tilde{n}}$ . Now, given $S$ , the algorithm ${\cal L}$ runs ${\cal L}^{\prime}$ on $S^{\prime}$ and returns an hypothesis $h(\mathbf{x})={\cal L}^{\prime}(S^{\prime})(\mathbf{x}^{\prime})$ .

Therefore, if learning ${\cal D}_{\mathrm{vec}}$ -random CNNs with input dimension $n={\cal O}(t\log^{2}(t))$ is hard already if the distribution ${\cal D}$ is over vectors of norm at most $g(n)$ , then learning ${\cal D}_{\mathrm{vec}}$ -random CNNs with input dimension $\tilde{n}=t^{c}$ is hard already if the distribution ${\cal D}$ is over vectors of norm at most $g(n)<g(t^{2})=g(\tilde{n}^{\frac{2}{{c}}})$ . Hence we have the following corollaries.

Corollary 3.5.

Let ${\cal D}_{\mathrm{vec}}$ be a distribution over ${\mathbb{R}}^{t}$ such that each component is drawn i.i.d. from a distribution ${\cal D}_{z}$ over ${\mathbb{R}}$ . Let $n=t^{c}$ for some integer $c>1$ , and let $\epsilon=\frac{3}{c}$ .

1.

If ${\cal D}_{z}={\cal U}([-r,r])$ , then learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ (with ${\cal O}(n)$ hidden neurons) is RSAT-hard, already if ${\cal D}$ is over vectors of norm at most $\frac{n^{\epsilon}}{r}$ .
2.

If ${\cal D}_{z}={\cal N}(0,\sigma^{2})$ , then learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ (with ${\cal O}(n)$ hidden neurons) is RSAT-hard, already if ${\cal D}$ is over vectors of norm at most $\frac{n^{\epsilon}}{\sigma}$ .

Corollary 3.6.

Let $\Sigma$ be a positive definite matrix of size $t\times t$ , and let $\lambda_{\min}$ be its minimal eigenvalue. Let $n=t^{c}$ for some integer $c>1$ , and let $\epsilon=\frac{3}{c}$ . Then, learning a ${\cal N}({\mathbf{0}},\Sigma)$ -random CNN $h_{\mathbf{w}}^{n}$ (with ${\cal O}(n)$ hidden neurons) is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm at most $\frac{n^{\epsilon}}{\sqrt{\lambda_{\min}}}$ .

Corollary 3.7.

Let ${\cal D}_{\mathrm{vec}}$ be the uniform distribution over the sphere of radius $r$ in ${\mathbb{R}}^{t}$ . Let $n=t^{c}$ for some integer $c>1$ , and let $\epsilon=\frac{2}{c}$ . Then, learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ (with ${\cal O}(n)$ hidden neurons) is RSAT-hard, already if the distribution ${\cal D}$ is over vectors of norm at most $\frac{n^{\epsilon}}{r}$ .

As an example, consider a CNN $h_{\mathbf{w}}^{n}$ with $n=t^{c}$ . Note that since the patch size is $t$ , then each hidden neuron has $t$ input neurons feeding into it. Consider a distribution ${\cal D}_{\mathrm{vec}}$ over ${\mathbb{R}}^{t}$ such that each component is drawn i.i.d. by a normal distribution with $\sigma=\frac{1}{\sqrt{t}}$ . This distribution corresponds to the standard Xavier initialization. Then, by Corollary 3.5, learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ is RSAT-hard, already if ${\cal D}$ is over vectors of norm at most $n^{\frac{3}{c}}\sqrt{t}=n^{\frac{3}{c}}\cdot n^{\frac{1}{2c}}$ . By choosing an appropriate $c$ , we have that learning a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ is RSAT-hard, already if ${\cal D}$ is over vectors of norm at most $\sqrt{n}$ .

Finally, note that Corollary 3.4 holds also for the values of $n$ and the bounds on the support of ${\cal D}$ from Corollaries 3.5, 3.6 and 3.7.

Open Questions

Obvious open questions arising from our work are to obtain sharper norm bounds on the input distribution, and to avoid the non-linearity in the second layer (the clipping operation). Another direction is to extend the results to more weights distributions and network architectures. A potential positive result, is to find some useful “unnatural” properties of the network’s weights that allow efficient learning.

4 Main proof ideas

Let ${\cal H}=\{h_{W}:W\in{\mathbb{R}}^{n\times k}\}$ where $k={\cal O}(\log^{2}(n))$ . By Daniely and Shalev-Shwartz (2016), learning ${\cal H}$ is hard. We want to reduce the problem of learning ${\cal H}$ to learning a ${\cal D}_{\mathrm{mat}}$ -random neural network for some fixed ${\cal D}_{\mathrm{mat}}$ . Let $W\in{\mathbb{R}}^{n\times k}$ and let $S=\{(\mathbf{x}_{1},h_{W}(\mathbf{x}_{1})),\ldots,(\mathbf{x}_{m},h_{W}(\mathbf{x}_{m}))\}$ be a sample. Let ${\cal D}_{\mathrm{mat}}^{M}$ be a distribution over the group of $n\times n$ invertible matrices, and let $M\sim{\cal D}_{\mathrm{mat}}^{M}$ . Consider the sample $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},h_{W}(\mathbf{x}_{1})),\ldots,(\mathbf{x}^{\prime}_{m},h_{W}(\mathbf{x}_{m}))\}$ where for every $i\in[m]$ we have $\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}$ . Since $W^{\top}\mathbf{x}_{i}=W^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}_{i}=(MW)^{\top}\mathbf{x}^{\prime}_{i}$ , we have $h_{W}(\mathbf{x}_{i})=h_{MW}(\mathbf{x}^{\prime}_{i})$ . Thus, $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},h_{MW}(\mathbf{x}^{\prime}_{1})),\ldots,(\mathbf{x}^{\prime}_{m},h_{MW}(\mathbf{x}^{\prime}_{m}))\}$ . Note that $MW$ is a random matrix. Now, assume that there is an algorithm ${\cal L}^{\prime}$ that learns successfully from $S^{\prime}$ . Consider the follow algorithm ${\cal L}$ . Given a sample $S$ , the algorithm ${\cal L}$ runs ${\cal L}^{\prime}$ on $S^{\prime}$ , and returns the hypothesis $h(\mathbf{x})={\cal L}^{\prime}(S^{\prime})((M^{\top})^{-1}\mathbf{x})$ . It is not hard to show that ${\cal L}$ learns successfully from $S$ . Since our goal is to reduce the problem of learning ${\cal H}$ to learning a ${\cal D}_{\mathrm{mat}}$ -random network where ${\cal D}_{\mathrm{mat}}$ is a fixed distribution, we need $MW$ to be a ${\cal D}_{\mathrm{mat}}$ -random matrix. However, the distribution of $MW$ depends on both ${\cal D}_{\mathrm{mat}}^{M}$ and $W$ (which is an unknown matrix).

Hence, the challenge is to find a reduction that translates a sample that is realizable by $h_{W}$ to a sample that is realizable by a ${\cal D}_{\mathrm{mat}}$ -random network, without knowing $W$ . In order to obtain such a reduction, we proceed in two steps. First, we show that learning neural networks of the form $h_{W}$ where $W\in{\mathbb{R}}^{n\times k}$ , is hard already if we restrict $W$ to a set of matrices with a special structure. Then, we show a distribution ${\cal D}_{\mathrm{mat}}^{M}$ such that if $M\sim{\cal D}_{\mathrm{mat}}^{M}$ and $W$ has the special structure, then $MW\sim{\cal D}_{\mathrm{mat}}$ . This property, as we showed, enables us to reduce the problem of learning such special-structure networks to the problem of learning ${\cal D}_{\mathrm{mat}}$ -random networks.

In order to obtain a special structure for $W$ , consider the class ${\cal H}_{\mathrm{sign-cnn}}^{n,k}=\{h_{\mathbf{w}}^{n}:\mathbf{w}\in\{\pm 1\}^{\frac{n}{k}}\}$ . Note that the CNNs in ${\cal H}_{\mathrm{sign-cnn}}^{n,k}$ have $k={\cal O}(\log^{2}(n))$ hidden neurons. The networks in ${\cal H}_{\mathrm{sign-cnn}}^{n,k}$ have three important properties: (1) They are CNNs; (2) Their patches are non-overlapping; (3) The components in the filter $\mathbf{w}$ are in $\{\pm 1\}$ . Hardness of learning ${\cal H}_{\mathrm{sign-cnn}}^{n,k}$ can be shown by a reduction from the RSAT problem. We defer the details of this reduction to the next sections. Let $W=(\mathbf{w}^{1},\ldots,\mathbf{w}^{k})$ be the matrix of size $n\times k$ that corresponds to $h_{\mathbf{w}}^{n}$ , namely $h_{W}=h_{\mathbf{w}}^{n}$ . Note that for every $i\in[k]$ we have $\left(\mathbf{w}^{i}_{\frac{n(i-1)}{k}+1},\ldots,\mathbf{w}^{i}_{\frac{ni}{k}}\right)=\mathbf{w}$ , and $\mathbf{w}^{i}_{j}=0$ for every other $j\in[n]$ .

We now show a distribution ${\cal D}_{\mathrm{mat}}^{M}$ such that if $M\sim{\cal D}_{\mathrm{mat}}^{M}$ and $W$ has a structure that corresponds to ${\cal H}_{\mathrm{sign-cnn}}^{n,k}$ , then $MW\sim{\cal D}_{\mathrm{mat}}$ . We start with the case where ${\cal D}_{\mathrm{mat}}$ is a distribution over matrices such that each column is drawn i.i.d. from the uniform distribution on the sphere. We say that a matrix $M$ of size $n\times n$ is a diagonal-blocks matrix if

M=\begin{bmatrix}B^{11}&\dots&B^{1k}\\ \vdots&\ddots&\vdots\\ B^{k1}&\dots&B^{kk}\end{bmatrix}

where each block $B^{ij}$ is a diagonal matrix $\mathrm{diag}(z_{1}^{ij},\ldots,z_{\frac{n}{k}}^{ij})$ . We denote $\mathbf{z}^{ij}=(z_{1}^{ij},\ldots,z_{\frac{n}{k}}^{ij})$ , and $\mathbf{z}^{j}=(\mathbf{z}^{1j},\ldots,\mathbf{z}^{kj})\in{\mathbb{R}}^{n}$ . Note that for every $j\in[k]$ , the vector $\mathbf{z}^{j}$ contains all the entries on the diagonals of blocks in the $j$ -th column of blocks in $M$ . Let ${\cal D}_{\mathrm{mat}}^{M}$ be a distribution over diagonal-blocks matrices, such that the vectors $\mathbf{z}^{j}$ are drawn i.i.d. according to the uniform distribution on ${\mathbb{S}}^{n-1}$ . Let $W$ be a matrix that corresponds to $h_{\mathbf{w}}^{n}\in{\cal H}_{\mathrm{sign-cnn}}^{n,k}$ . Note that the columns of $W^{\prime}=MW$ are i.i.d. copies from the uniform distribution on ${\mathbb{S}}^{n-1}$ . Indeed, denote $M^{\top}=(\mathbf{v}^{1},\ldots,\mathbf{v}^{n})$ . Then, for every line index $i\in[n]$ we denote $i=(b-1)\left(\frac{n}{k}\right)+r$ , where $b,r$ are integers and $1\leq r\leq\frac{n}{k}$ . Thus, $b$ is the line index of the block in $M$ that correspond to the $i$ -th line in $M$ , and $r$ is the line index within the block. Now, note that

	$\displaystyle W^{\prime}_{ij}$	$\displaystyle=$	$\displaystyle\langle\mathbf{v}^{i},\mathbf{w}^{j}\rangle=\langle\left(\mathbf{v}^{i}_{(j-1)\left(\frac{n}{k}\right)+1},\ldots,\mathbf{v}^{i}_{j\left(\frac{n}{k}\right)}\right),\mathbf{w}\rangle=\langle(B^{bj}_{r1},\ldots,B^{bj}_{r\left(\frac{n}{k}\right)}),\mathbf{w}\rangle$
		$\displaystyle=$	$\displaystyle B^{bj}_{rr}\cdot\mathbf{w}_{r}=z^{bj}_{r}\cdot\mathbf{w}_{r}~{}.$

Since $\mathbf{w}_{r}\in\{\pm 1\}$ , and since the uniform distribution on a sphere does not change by multiplying a subset of component by $-1$ , then the $j$ -th column of $W^{\prime}$ has the same distribution as $\mathbf{z}^{j}$ , namely, the uniform distribution over ${\mathbb{S}}^{n-1}$ . Also, the columns of $W^{\prime}$ are independent. Thus, $W^{\prime}\sim{\cal D}_{\mathrm{mat}}$ .

The case where ${\cal D}_{\mathrm{mat}}$ is a distribution over matrices such that the entries are drawn i.i.d. from a symmetric distribution (such as ${\cal U}([-r,r])$ , ${\cal N}(0,\sigma^{2})$ or ${\cal U}(\{\pm r\})$ ) can be shown in a similar way. The result for the case where ${\cal D}_{\mathrm{mat}}$ is such that each columns is drawn i.i.d. from a multivariate normal distribution ${\cal N}({\mathbf{0}},\Sigma)$ cannot be obtained in the same way, since a ${\cal N}({\mathbf{0}},\Sigma)$ might be sensitive to multiplication of its component by $-1$ . However, recall that a vector in ${\mathbb{R}}^{n}$ whose components are i.i.d. copies from ${\cal N}(0,1)$ has a multivariate normal distribution ${\cal N}({\mathbf{0}},I_{n})$ . Now, since every multivariate normal distribution ${\cal N}({\mathbf{0}},\Sigma)$ can be obtained from ${\cal N}({\mathbf{0}},I_{n})$ by a linear transformation, we are able to show hardness also for the case where ${\cal D}_{\mathrm{mat}}$ is such that each column is drawn i.i.d. from ${\cal N}({\mathbf{0}},\Sigma)$ .

Recall that in our reduction we translate $S=\{(\mathbf{x}_{1},h_{W}(\mathbf{x}_{1})),\ldots,(\mathbf{x}_{m},h_{W}(\mathbf{x}_{m}))\}$ to $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},h_{W}(\mathbf{x}_{1})),\ldots,(\mathbf{x}^{\prime}_{m},h_{W}(\mathbf{x}_{m}))\}$ where for every $i\in[m]$ we have $\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}$ . Therefore, we need to show that our choice of $M$ is such that it is invertible with high probability. Also, since we want to show hardness already if the input distribution ${\cal D}$ is supported on a bounded domain, then we need to bound the norm of $\mathbf{x}^{\prime}_{i}$ , with high probability over the choice of $M$ . This task requires a careful analysis of the spectral norm of $(M^{\top})^{-1}$ , namely, of $\left(s_{\min}(M)\right)^{-1}$ .

The proofs of the results regarding random CNNs follow the same ideas. The main difference is that in this case, instead of multiplying $\mathbf{x}_{i}$ by $(M^{\top})^{-1}$ , we multiply each patch in $\mathbf{x}_{i}$ by an appropriate matrix.

Proof structure

We start with a few definitions. We say that a sample $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{m}\in({\mathbb{R}}^{n}\times\{0,1\})^{m}$ is scattered if $y_{1},\ldots,y_{m}$ are independent fair coins (in particular, they are independent from $\mathbf{x}_{1},\ldots,\mathbf{x}_{m}$ ). We say that $S$ is contained in $A\subseteq{\mathbb{R}}^{n}$ if $\mathbf{x}_{i}\in A$ for every $i\in[m]$ .

Let ${\cal A}$ be an algorithm whose input is a sample $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{m(n)}\in({\mathbb{R}}^{n}\times\{0,1\})^{m(n)}$ and whose output is either “scattered” or “ ${\cal H}$ -realizable”, where ${\cal H}$ is an hypothesis class. We say that ${\cal A}$ distinguishes size- $m$ ${\cal H}$ -realizable samples from scattered samples if the following holds.

•

If the sample $S$ is scattered, then

$\Pr\left({\cal A}(S)=\text{``scattered"}\right)\geq\frac{3}{4}-o_{n}(1)~{},$

where the probability is over the choice of $S$ and the randomness of ${\cal A}$ .
•

If the sample $S$ satisfies $h(\mathbf{x}_{i})=y_{i}$ for every $i\in[m]$ for some $h\in{\cal H}$ , then

$\Pr\left({\cal A}(S)=\text{``${\cal H}$-realizable"}\right)\geq\frac{3}{4}-o_{n}(1)~{},$

where the probability is over the randomness of ${\cal A}$ .

We denote by $\mathrm{SCAT}_{m(n)}^{A}({\cal H})$ the problem of distinguishing size- $m(n)$ ${\cal H}$ -realizable samples that are contained in $A\subseteq{\mathbb{R}}^{n}$ from scattered samples.

Now, let ${\cal A}^{\prime}$ be an algorithm whose input is a sample $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{m(n)}\in({\mathbb{R}}^{n}\times\{0,1\})^{m(n)}$ and whose output is either “scattered” or “ ${\cal D}_{\mathrm{mat}}$ -realizable”. We say that ${\cal A}^{\prime}$ distinguishes size- $m$ ${\cal D}_{\mathrm{mat}}$ -realizable samples from scattered samples if the following holds.

•

If the sample $S$ is scattered, then

$\Pr\left({\cal A}^{\prime}(S)=\text{``scattered"}\right)\geq\frac{3}{4}-o_{n}(1)~{},$

where the probability is over the choice of $S$ and the randomness of ${\cal A}^{\prime}$ .
•

If the sample $S$ satisfies $h_{W}(\mathbf{x}_{i})=y_{i}$ for every $i\in[m]$ , where $W$ is a random matrix drawn from ${\cal D}_{\mathrm{mat}}$ , then

$\Pr\left({\cal A}^{\prime}(S)=\text{``${\cal D}_{\mathrm{mat}}$-realizable"}\right)\geq\frac{3}{4}-o_{n}(1)~{},$

where the probability is over the choice of $W$ and the randomness of ${\cal A}^{\prime}$ .

We denote by $\mathrm{SCAT}_{m(n)}^{A}({\cal D}_{\mathrm{mat}})$ the problem of distinguishing size- $m(n)$ ${\cal D}_{\mathrm{mat}}$ -realizable samples that are contained in $A\subseteq{\mathbb{R}}^{n}$ from scattered samples. In the case of random CNNs, we denote by $\mathrm{SCAT}_{m(n)}^{A}({\cal D}_{\mathrm{vec}},n)$ the problem of distinguishing size- $m(n)$ scattered samples that are contained in $A$ , from samples that are realizable by a random CNN $h_{\mathbf{w}}^{n}$ where $\mathbf{w}\sim{\cal D}_{\mathrm{vec}}$ .

Recall that ${\cal H}_{\mathrm{sign-cnn}}^{n,m}=\{h_{\mathbf{w}}^{n}:\mathbf{w}\in\{\pm 1\}^{\frac{n}{m}}\}$ . As we described, hardness of learning ${\cal D}_{\mathrm{mat}}$ -random neural networks where the distribution ${\cal D}$ is supported on a bounded domain, can be shown by first showing hardness of learning ${\cal H}_{\mathrm{sign-cnn}}^{n,m}$ with some $m={\cal O}(\log^{2}(n))$ , where the distribution ${\cal D}$ is supported on some $A^{\prime}\subseteq{\mathbb{R}}^{n}$ , and then reducing this problem to learning ${\cal D}_{\mathrm{mat}}$ -random networks where the distribution ${\cal D}$ is supported on some $A\subseteq{\mathbb{R}}^{n}$ . We can show RSAT-hardness of learning ${\cal H}_{\mathrm{sign-cnn}}^{n,m}$ by using the methodology of Daniely and Shalev-Shwartz (2016) as follows: First, show that if there is an efficient algorithm that learns ${\cal H}_{\mathrm{sign-cnn}}^{n,m}$ where the distribution ${\cal D}$ is supported on $A^{\prime}\subseteq{\mathbb{R}}^{n}$ , then there is a fixed $d$ and an efficient algorithm that solves $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m})$ , and then show that for every fixed $d$ , the problem $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m})$ is RSAT-hard.

Our proof follows a slightly different path than the one described above. First, we show that if there is an efficient algorithm that learns ${\cal D}_{\mathrm{mat}}$ -random neural networks where the distribution ${\cal D}$ is supported on $A\subseteq{\mathbb{R}}^{n}$ , then there is a fixed $d$ and an efficient algorithm that solves $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$ . Then, we show that for every fixed $d$ , the problem $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m})$ with some $A^{\prime}\subseteq{\mathbb{R}}^{n}$ , is RSAT-hard. Finally, for the required matrix distributions ${\cal D}_{\mathrm{mat}}$ and sets $A$ , we show a reduction from $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m})$ to $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$ . The main difference between this proof structure and the one described in the previous paragraph, is that here, for every distribution ${\cal D}_{\mathrm{mat}}$ we need to show a reduction from $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m})$ to $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$ (which are decision problems), and in the previous proof structure for every ${\cal D}_{\mathrm{mat}}$ we need to show a reduction between learning ${\cal H}_{\mathrm{sign-cnn}}^{n,m}$ and learning ${\cal D}_{\mathrm{mat}}$ -random networks. We chose this proof structure since here the proof for each ${\cal D}_{\mathrm{mat}}$ is a reduction between decision problems, and thus we found the proofs to be simpler and cleaner this way. Other than this technical difference, both proof structures are essentially similar and follow the same ideas.

The case of random CNNs is similar, except that here we show, for each distribution ${\cal D}_{\mathrm{vec}}$ over vectors, a reduction from $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,m})$ to $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{vec}},n)$ .

5 Proofs

5.1 Learning ${\cal D}_{\mathrm{mat}}$ -random networks is harder than $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$

Theorem 5.1.

Let ${\cal D}_{\mathrm{mat}}$ be a distribution over matrices. Assume that there is an algorithm that learns ${\cal D}_{\mathrm{mat}}$ -random neural networks, where the distribution ${\cal D}$ is supported on $A\subseteq{\mathbb{R}}^{n}$ . Then, there is a fixed $d$ and an efficient algorithm that solves $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$ .

Proof.

Let ${\cal L}$ be an efficient learning algorithm that learns ${\cal D}_{\mathrm{mat}}$ -random neural networks where the distribution ${\cal D}$ is supported on $A$ . Let $m(n)$ be such that ${\cal L}$ uses a sample of size at most $m(n)$ . Let $p(n)=9m(n)+n$ . Let $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{p(n)}\in({\mathbb{R}}^{n}\times\{0,1\})^{p(n)}$ be a sample that is contained in $A$ . We will show an efficient algorithm ${\cal A}$ that distinguishes whether $S$ is scattered or ${\cal D}_{\mathrm{mat}}$ -realizable. This implies that the theorem holds for $d$ such that $n^{d}\geq p(n)$ .

Given $S$ , the algorithm ${\cal A}$ learns a function $h:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}$ by running ${\cal L}$ with an examples oracle that generates examples by choosing a random (uniformly distributed) example $(\mathbf{x}_{i},y_{i})\in S$ . We denote $\ell_{S}(h)=\frac{1}{p(n)}\sum_{i\in[p(n)]}\left(h(\mathbf{x}_{i})-y_{i}\right)^{2}$ . Now, if $\ell_{S}(h)\leq\frac{1}{10}$ , then ${\cal A}$ returns that $S$ is ${\cal D}_{\mathrm{mat}}$ -realizable, and otherwise it returns that it is scattered. Clearly, the algorithm ${\cal A}$ runs in polynomial time. We now show that if $S$ is ${\cal D}_{\mathrm{mat}}$ -realizable then ${\cal A}$ recognizes it with probability at least $\frac{3}{4}$ , and that if $S$ is scattered then it also recognizes it with probability at least $\frac{3}{4}$ .

Assume first that $S$ is ${\cal D}_{\mathrm{mat}}$ -realizable. Let ${\cal D}_{S}$ be the uniform distribution over $\mathbf{x}_{i}\in{\mathbb{R}}^{n}$ from $S$ . In this case, since ${\cal D}_{S}$ is supported on $A$ , we are guaranteed that with probability at least $\frac{3}{4}$ over the choice of $W$ and the internal randomness of ${\cal L}$ , we have $\ell_{S}(h)=\operatorname*{\mathbb{E}}_{\mathbf{x}\sim{\cal D}_{S}}\left[\left(h(\mathbf{x})-h_{W}(\mathbf{x})\right)^{2}\right]\leq\frac{1}{10}$ . Therefore, the algorithm returns “ ${\cal D}_{\mathrm{mat}}$ -realizable”.

Now, assume that $S$ is scattered. Let $h:{\mathbb{R}}^{n}\rightarrow{\mathbb{R}}$ be the function returned by ${\cal L}$ . Let $h^{\prime}:{\mathbb{R}}^{n}\rightarrow\{0,1\}$ be the following function. For every $\mathbf{x}\in{\mathbb{R}}^{n}$ , if $h(\mathbf{x})\geq\frac{1}{2}$ then $h^{\prime}(\mathbf{x})=1$ , and otherwise $h^{\prime}(\mathbf{x})=0$ . Note that for every $(\mathbf{x}_{i},y_{i})\in S$ , if $h^{\prime}(\mathbf{x}_{i})\neq y_{i}$ then $\left(h(\mathbf{x}_{i})-y_{i}\right)^{2}\geq\frac{1}{4}$ . Therefore, $\ell_{S}(h)\geq\frac{1}{4}\ell_{S}(h^{\prime})$ . Let $C\subseteq[p(n)]$ be the set of indices of $S$ that were not observed by ${\cal L}$ . Note that given $C$ , the events $\{h^{\prime}(\mathbf{x}_{i})=y_{i}\}_{i\in C}$ are independent from one another, and each has probability $\frac{1}{2}$ . By the Hoefding bound, we have that $h^{\prime}(\mathbf{x}_{i})\neq y_{i}$ for at most $\frac{1}{2}-\sqrt{\frac{\ln(n)}{n}}$ fraction of the indices in $C$ with probability at most

\exp\left(-\frac{2|C|\ln(n)}{n}\right)=\exp\left(-\frac{2(8m(n)+n)\ln(n)}{n}\right)\leq\exp\left(-2\ln(n)\right)=\frac{1}{n^{2}}~{}.

Thus, $h^{\prime}(\mathbf{x}_{i})\neq y_{i}$ for at least $\frac{1}{2}-o_{n}(1)$ fraction of the indices in $C$ with probability at least $1-o_{n}(1)$ . Hence,

\ell_{S}(h)\geq\frac{1}{4}\ell_{S}(h^{\prime})\geq\frac{1}{4}\cdot\frac{|C|}{p(n)}\left(\frac{1}{2}-o_{n}(1)\right)=\frac{1}{4}\cdot\frac{8m(n)+n}{9m(n)+n}\left(\frac{1}{2}-o_{n}(1)\right)\geq\frac{1}{9}-o_{n}(1)~{}.

Therefore, for large enough $n$ , with probability at least $\frac{3}{4}$ we have $\ell_{S}(h)>\frac{1}{10}$ , and thus the algorithm returns “scattered”. ∎

5.2 $\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,m})$ is RSAT-hard

For a predicate $P:\{\pm 1\}^{K}\to\{0,1\}$ we denote by $\mathrm{CSP}(P,\neg P)$ the problem whose instances are collections, $J$ , of constraints, each of which is either $P$ or $\neg P$ constraint, and the goal is to maximize the number of satisfied constraints. Denote by $\mathrm{CSP}^{\mathrm{rand}}_{m(n)}(P,\neg P)$ the problem of distinguishing³³3As in $\mathrm{CSP}^{\mathrm{rand}}_{m(n)}(P)$ , in order to succeed, and algorithm must return “satisfiable” w.p. at least $\frac{3}{4}-o_{n}(1)$ on every satisfiable formula and “random” w.p. at least $\frac{3}{4}-o_{n}(1)$ on random formulas. satisfiable from random formulas with $n$ variables and $m(n)$ constraints. Here, in a random formula, each constraint is chosen w.p. $\frac{1}{2}$ to be a uniform $P$ constraint and w.p. $\frac{1}{2}$ a uniform $\neg P$ constraint.

We will consider the predicate $T_{K,M}:\{0,1\}^{KM}\to\{0,1\}$ defined by

T_{K,M}(z)=\left(z_{1}\vee\ldots\vee z_{K}\right)\wedge\left(z_{K+1}\vee\ldots\vee z_{2K}\right)\wedge\ldots\wedge\left(z_{(M-1)K+1}\vee\ldots\vee z_{MK}\right)~{}.

We will need the following lemma from Daniely and Shalev-Shwartz (2016). For an overview of its proof, see Appendix A.

Lemma 5.1.

Daniely and Shalev-Shwartz (2016) Let $q(n)=\omega(\log(n))$ with $q(n)\leq\frac{n}{\log(n)}$ , and let $d$ and $K$ be fixed integers. The problem $\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K})$ can be efficiently reduced to the problem $\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)})$ .

In the following lemma, we use Lemma 5.1 in order to show RSAT-hardness of $\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,m})$ with some appropriate $m$ and $A$ .

Lemma 5.2.

Let $n=(n^{\prime}+1)\log^{2}(n^{\prime})$ , and let $d$ be a fixed integer. The problem $\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})})$ , where $A$ is the ball of radius $\log^{2}(n^{\prime})$ in ${\mathbb{R}}^{n}$ , is RSAT-hard.

Proof.

By Assumption 2.1, there is $K$ such that $\mathrm{CSP}^{\mathrm{rand}}_{(n^{\prime})^{d+2}}(\mathrm{SAT}_{K})$ is hard, where the $K$ -SAT formula is over $n^{\prime}$ variables. Then, by Lemma 5.1, the problem $\mathrm{CSP}^{\mathrm{rand}}_{(n^{\prime})^{d+1}}(T_{K,\log^{2}(n^{\prime})},\neg T_{K,\log^{2}(n^{\prime})})$ is also hard. We will reduce $\mathrm{CSP}^{\mathrm{rand}}_{(n^{\prime})^{d+1}}(T_{K,\log^{2}(n^{\prime})},\neg T_{K,\log^{2}(n^{\prime})})$ to $\mathrm{SCAT}_{(n^{\prime})^{d+1}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})})$ . Since $(n^{\prime})^{d+1}>n^{d}$ , it would imply that $\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})})$ is RSAT-hard.

Let $J=\{C_{1},\ldots,C_{(n^{\prime})^{d+1}}\}$ be an input for $\mathrm{CSP}^{\mathrm{rand}}_{(n^{\prime})^{d+1}}(T_{K,\log^{2}(n^{\prime})},\neg T_{K,\log^{2}(n^{\prime})})$ . Namely, each constraint $C_{i}$ is either a CNF or a DNF formula. Equivalently, $J$ can be written as $J^{\prime}=\{(C^{\prime}_{1},y_{1}),\ldots,(C^{\prime}_{(n^{\prime})^{d+1}},y_{(n^{\prime})^{d+1}})\}$ where for every $i$ , if $C_{i}$ is a DNF formula then $C^{\prime}_{i}=C_{i}$ and $y_{i}=1$ , and if $C_{i}$ is a CNF formula then $C^{\prime}_{i}$ is the DNF obtained by negating $C_{i}$ , and $y_{i}=0$ . Given $J^{\prime}$ as above, we encode each DNF formula $C^{\prime}_{i}$ (with $\log^{2}(n^{\prime})$ clauses) as a vector $\mathbf{x}_{i}\in{\mathbb{R}}^{n}$ such that each clause $[(\alpha_{1},i_{1}),\ldots,(\alpha_{K},i_{K})]$ in $C^{\prime}_{i}$ (a signed $K$ -tuple) is encoded by a vector $\mathbf{z}=(z_{1},\ldots,z_{n^{\prime}+1})$ as follows. First, we have $z_{n^{\prime}+1}=-(K-1)$ . Then, for every $1\leq j\leq K$ we have $z_{i_{j}}=\alpha_{j}$ , and for every variable $l$ that does not appear in the clause we have $z_{l}=0$ . Thus, for every $1\leq l\leq n^{\prime}$ , the value of $z_{l}$ indicates whether the $l$ -th variable appears in the clause as a positive literal, a negative literal, or does not appear. The encoding $\mathbf{x}_{i}$ of $C^{\prime}_{i}$ is the concatenation of the encodings of its clauses.

Let $S=\{(\mathbf{x}_{1},y_{1}),\ldots,(\mathbf{x}_{(n^{\prime})^{d+1}},y_{(n^{\prime})^{d+1}})\}$ . If $J$ is random then $S$ is scattered, since each constraint $C_{i}$ is with probability $\frac{1}{2}$ a DNF formula, and with probability $\frac{1}{2}$ a CNF formula, and this choice is independent of the choice of the literals in $C_{i}$ . Assume now that $J$ is satisfiable by an assignment $\psi\in\{\pm 1\}^{n^{\prime}}$ . Let $\mathbf{w}=(\psi,1)\in\{\pm 1\}^{n^{\prime}+1}$ . Note that $S$ is realizable by the CNN $h_{\mathbf{w}}^{n}$ with $\log^{2}(n^{\prime})$ hidden neurons. Indeed, if $\mathbf{z}\in{\mathbb{R}}^{n^{\prime}+1}$ is the encoding of a clause of $C^{\prime}_{i}$ , then $\langle\mathbf{z},\mathbf{w}\rangle=1$ if all the $K$ literals in the clause are satisfied by $\psi$ , and otherwise $\langle\mathbf{z},\mathbf{w}\rangle\leq-1$ . Therefore, $h_{\mathbf{w}}^{n}(\mathbf{x}_{i})=y_{i}$ .

Note that by our construction, for every $i\in[(n^{\prime})^{d+1}]$ we have for large enough $n^{\prime}$

\left\|\mathbf{x}_{i}\right\|=\sqrt{\log^{2}(n^{\prime})\left(K+(K-1)^{2}\right)}\leq\log(n^{\prime})\cdot K\leq\log^{2}(n^{\prime})~{}.

∎

5.3 Hardness of learning random fully-connected neural networks

Let $n=(n^{\prime}+1)\log^{2}(n^{\prime})$ . We say that a matrix $M$ of size $n\times n$ is a diagonal-blocks matrix if

M=\begin{bmatrix}B^{11}&\dots&B^{1\log^{2}(n^{\prime})}\\ \vdots&\ddots&\vdots\\ B^{\log^{2}(n^{\prime})1}&\dots&B^{\log^{2}(n^{\prime})\log^{2}(n^{\prime})}\end{bmatrix}

where each block $B^{ij}$ is a diagonal matrix $\mathrm{diag}(z_{1}^{ij},\ldots,z_{n^{\prime}+1}^{ij})$ . For every $1\leq i\leq n^{\prime}+1$ let $S_{i}=\{i+j(n^{\prime}+1):0\leq j\leq\log^{2}(n^{\prime})-1\}$ . Let $M_{S_{i}}$ be the submatrix of $M$ obtained by selecting the rows and columns in $S_{i}$ . Thus, $M_{S_{i}}$ is a matrix of size $\log^{2}(n^{\prime})\times\log^{2}(n^{\prime})$ . For $\mathbf{x}\in{\mathbb{R}}^{n}$ let $\mathbf{x}_{S_{i}}\in{\mathbb{R}}^{\log^{2}(n^{\prime})}$ be the restriction of $\mathbf{x}$ to the coordinates $S_{i}$ .

Lemma 5.3.

Let $M$ be a diagonal-blocks matrix. Then,

s_{\min}(M)\geq\min_{1\leq i\leq n^{\prime}+1}s_{\min}(M_{S_{i}})~{}.

Proof.

For every $\mathbf{x}\in{\mathbb{R}}^{n}$ with $\left\|\mathbf{x}\right\|=1$ we have

$\displaystyle\left\\|M\mathbf{x}\right\\|^{2}$	$\displaystyle=$	$\displaystyle\sum_{1\leq i\leq n^{\prime}+1}\left\\|M_{S_{i}}\mathbf{x}_{S_{i}}\right\\|^{2}\geq\sum_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M_{S_{i}})\left\\|\mathbf{x}_{S_{i}}\right\\|\right)^{2}$
	$\displaystyle\geq$	$\displaystyle\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M_{S_{i}})\right)^{2}\sum_{1\leq i\leq n^{\prime}+1}\left\\|\mathbf{x}_{S_{i}}\right\\|^{2}=\left(\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M_{S_{i}})\right)^{2}\right)\left\\|\mathbf{x}\right\\|^{2}$
	$\displaystyle=$	$\displaystyle\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M_{S_{i}})\right)^{2}~{}.$

Hence, $s_{\min}(M)\geq\min_{1\leq i\leq n^{\prime}+1}s_{\min}(M_{S_{i}})$ . ∎

5.3.1 Proof of Theorem 3.1

Let $M$ be a diagonal-blocks matrix, where each block $B^{ij}$ is a diagonal matrix $\mathrm{diag}(z_{1}^{ij},\ldots,z_{n^{\prime}+1}^{ij})$ . Assume that for all $i,j,l$ the entries $z_{l}^{ij}$ are i.i.d. copies of a random variable $z$ that has a symmetric distribution ${\cal D}_{z}$ with variance $\sigma^{2}$ . Also, assume that the random variable $z^{\prime}=\frac{z}{\sigma}$ is $b$ -subgaussian for some fixed $b$ .

Lemma 5.4.

Pr\left(s_{\min}(M)\leq\frac{\sigma}{n^{\prime}\log^{2}(n^{\prime})}\right)=o_{n}(1)~{}.

Proof.

Let $M^{\prime}=\frac{1}{\sigma}M$ . By Lemma 5.3, we have

s_{\min}(M^{\prime})\geq\min_{1\leq i\leq n^{\prime}+1}s_{\min}(M^{\prime}_{S_{i}})~{}.

(1)

Note that for every $i$ , all entries of the matrix $M^{\prime}_{S_{i}}$ are i.i.d. copies of $z^{\prime}$ .

Now, we need the following theorem:

Theorem 5.2.

Rudelson and Vershynin (2008) Let $\xi$ be a real random variable with expectation $0$ and variance $1$ , and assume that $\xi$ is $b$ -subgaussian for some $b>0$ . Let $A$ be an $n\times n$ matrix whose entries are i.i.d. copies of $\xi$ . Then, for every $t\geq 0$ we have

Pr\left(s_{\min}(A)\leq\frac{t}{\sqrt{n}}\right)\leq Ct+c^{n}

where $C>0$ and $c\in(0,1)$ depend only on $b$ .

By Theorem 5.2, since each matrix $M^{\prime}_{S_{i}}$ is of size $\log^{2}(n^{\prime})\times\log^{2}(n^{\prime})$ , we have for every $i\in[n^{\prime}+1]$ that

Pr\left(s_{\min}(M^{\prime}_{S_{i}})\leq\frac{t}{\log(n^{\prime})}\right)\leq Ct+c^{\log^{2}(n^{\prime})}~{}.

By choosing $t=\frac{1}{n^{\prime}\log(n^{\prime})}$ we have

Pr\left(s_{\min}(M^{\prime}_{S_{i}})\leq\frac{1}{n^{\prime}\log^{2}(n^{\prime})}\right)\leq\frac{C}{n^{\prime}\log(n^{\prime})}+c^{\log^{2}(n^{\prime})}~{}.

Then, by the union bound we have

Pr\left(\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M^{\prime}_{S_{i}})\right)\leq\frac{1}{n^{\prime}\log^{2}(n^{\prime})}\right)\leq\frac{C(n^{\prime}+1)}{n^{\prime}\log(n^{\prime})}+c^{\log^{2}(n^{\prime})}(n^{\prime}+1)=o_{n}(1)~{}.

Combining this with $s_{\min}(M)=\sigma\cdot s_{\min}(M^{\prime})$ and with Eq. 1, we have

	$\displaystyle Pr\left(s_{\min}(M)\leq\frac{\sigma}{n^{\prime}\log^{2}(n^{\prime})}\right)$	$\displaystyle=$	$\displaystyle Pr\left(s_{\min}(M^{\prime})\leq\frac{1}{n^{\prime}\log^{2}(n^{\prime})}\right)$
		$\displaystyle\leq$	$\displaystyle Pr\left(\min_{1\leq i\leq n^{\prime}+1}\left(s_{\min}(M^{\prime}_{S_{i}})\right)\leq\frac{1}{n^{\prime}\log^{2}(n^{\prime})}\right)=o_{n}(1)~{}.$

∎

Lemma 5.5.

Let ${\cal D}_{\mathrm{mat}}$ be a distribution over ${\mathbb{R}}^{n\times\log^{2}(n^{\prime})}$ such that each entry is drawn i.i.d. from ${\cal D}_{z}$ . Note that a ${\cal D}_{\mathrm{mat}}$ -random network $h_{W}$ has $\log^{2}(n^{\prime})={\cal O}(\log^{2}(n))$ hidden neurons. Let $d$ be a fixed integer. Then, $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$ is RSAT-hard, where $A$ is the ball of radius $\frac{n\log^{2}(n)}{\sigma}$ in ${\mathbb{R}}^{n}$ .

Proof.

By Lemma 5.2, the problem $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})})$ where $A^{\prime}$ is the ball of radius $\log^{2}(n^{\prime})$ in ${\mathbb{R}}^{n}$ , is RSAT-hard. We will reduce this problem to $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$ . Given a sample $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}}$ with $\left\|\mathbf{x}_{i}\right\|\leq\log^{2}(n^{\prime})$ for every $i\in[n^{d}]$ , we will, with probability $1-o_{n}(1)$ , construct a sample $S^{\prime}$ that is contained in $A$ , such that if $S$ is scattered then $S^{\prime}$ is scattered, and if $S$ is ${\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}$ -realizable then $S^{\prime}$ is ${\cal D}_{\mathrm{mat}}$ -realizable. Note that our reduction is allowed to fail with probability $o_{n}(1)$ . Indeed, distinguishing scattered from realizable requires success with probability $\frac{3}{4}-o_{n}(1)$ and therefore reductions between such problems are not sensitive to a failure with probability $o_{n}(1)$ .

Assuming that $M$ is invertible (note that by Lemma 5.4 it holds with probability $1-o_{n}(1)$ ), let $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}$ where for every $i\in[n^{d}]$ we have $\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}$ . Note that if $S$ is scattered then $S^{\prime}$ is also scattered.

Assume that $S$ is realizable by the CNN $h_{\mathbf{w}}^{n}$ with $\mathbf{w}\in\{\pm 1\}^{n^{\prime}+1}$ . Let $W$ be the matrix of size $n\times\log^{2}(n^{\prime})$ such that $h_{W}=h_{\mathbf{w}}^{n}$ . Thus, $W=(\mathbf{w}^{1},\ldots,\mathbf{w}^{\log^{2}(n^{\prime})})$ where for every $i\in[\log^{2}(n^{\prime})]$ we have $(\mathbf{w}^{i}_{(i-1)(n^{\prime}+1)+1},\ldots,\mathbf{w}^{i}_{i(n^{\prime}+1)})=\mathbf{w}$ , and $\mathbf{w}^{i}_{j}=0$ for every other $j\in[n]$ . Let $W^{\prime}=MW$ . Note that $S^{\prime}$ is realizable by $h_{W^{\prime}}$ . Indeed, for every $i\in[n^{d}]$ we have $y_{i}=h_{\mathbf{w}}^{n}(\mathbf{x}_{i})=h_{W}(\mathbf{x}_{i})$ , and $W^{\top}\mathbf{x}_{i}=W^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}_{i}=(W^{\prime})^{\top}\mathbf{x}^{\prime}_{i}$ . Also, note that the entries of $W^{\prime}$ are i.i.d. copies of $z$ . Indeed, denote $M^{\top}=(\mathbf{v}^{1},\ldots,\mathbf{v}^{n})$ . Then, for every line $i\in[n]$ we denote $i=(b-1)(n^{\prime}+1)+r$ , where $b,r$ are integers and $1\leq r\leq n^{\prime}+1$ . Thus, $b$ is the line index of the block in $M$ that correspond to the $i$ -th line in $M$ , and $r$ is the line index within the block. Now, note that

	$\displaystyle W^{\prime}_{ij}$	$\displaystyle=$	$\displaystyle\langle\mathbf{v}^{i},\mathbf{w}^{j}\rangle=\langle\left(\mathbf{v}^{i}_{(j-1)(n^{\prime}+1)+1},\ldots,\mathbf{v}^{i}_{j(n^{\prime}+1)}\right),\mathbf{w}\rangle=\langle(B^{bj}_{r1},\ldots,B^{bj}_{r(n^{\prime}+1)}),\mathbf{w}\rangle$
		$\displaystyle=$	$\displaystyle B^{bj}_{rr}\cdot\mathbf{w}_{r}=z^{bj}_{r}\cdot\mathbf{w}_{r}~{}.$

Since ${\cal D}_{z}$ is symmetric and $\mathbf{w}_{r}\in\{\pm 1\}$ , we have $W^{\prime}_{ij}\sim{\cal D}_{z}$ independently from the other entries. Thus, $W^{\prime}\sim{\cal D}_{\mathrm{mat}}$ . Therefore, $h_{W^{\prime}}$ is a ${\cal D}_{\mathrm{mat}}$ -random network.

By Lemma 5.4, we have with probability $1-o_{n}(1)$ that for every $i\in[n^{d}]$ ,

	$\displaystyle\left\\|\mathbf{x}^{\prime}_{i}\right\\|$	$\displaystyle=$	$\displaystyle\left\\|(M^{\top})^{-1}\mathbf{x}_{i}\right\\|\leq s_{\max}\left((M^{\top})^{-1}\right)\left\\|\mathbf{x}_{i}\right\\|=\frac{1}{s_{\min}(M^{\top})}\left\\|\mathbf{x}_{i}\right\\|=\frac{1}{s_{\min}(M)}\left\\|\mathbf{x}_{i}\right\\|$
		$\displaystyle\leq$	$\displaystyle\frac{n^{\prime}\log^{2}(n^{\prime})}{\sigma}\log^{2}(n^{\prime})\leq\frac{n\log^{2}(n)}{\sigma}~{}.$

∎

Finally, Theorem 3.1 follows immediately from Theorem 5.1 and the following lemma.

Lemma 5.6.

Let ${\cal D}_{\mathrm{mat}}$ be a distribution over ${\mathbb{R}}^{{\tilde{n}}\times m}$ with $m={\cal O}(\log^{2}({\tilde{n}}))$ , such that each entry is drawn i.i.d. from ${\cal D}_{z}$ . Let $d$ be a fixed integer, and let $\epsilon>0$ be a small constant. Then, $\mathrm{SCAT}_{{\tilde{n}}^{d}}^{A}({\cal D}_{\mathrm{mat}})$ is RSAT-hard, where $A$ is the ball of radius $\frac{{\tilde{n}}^{\epsilon}}{\sigma}$ in ${\mathbb{R}}^{{\tilde{n}}}$ .

Proof.

For integers $k,l$ we denote by ${\cal D}_{\mathrm{mat}}^{k,l}$ the distribution over ${\mathbb{R}}^{k\times l}$ such that each entry is drawn i.i.d. from ${\cal D}_{z}$ . Let $c=\frac{2}{\epsilon}$ , and let ${\tilde{n}}=n^{c}$ . By Lemma 5.5, the problem $\mathrm{SCAT}_{n^{cd}}^{A^{\prime}}({\cal D}_{\mathrm{mat}}^{n,m})$ is RSAT-hard, where $m={\cal O}(\log^{2}(n))$ , and $A^{\prime}$ is the ball of radius $\frac{n\log^{2}(n)}{\sigma}$ in ${\mathbb{R}}^{n}$ . We reduce this problem to $\mathrm{SCAT}_{{\tilde{n}}^{d}}^{A}({\cal D}_{\mathrm{mat}}^{{\tilde{n}},m})$ , where $A$ is the ball of radius $\frac{{\tilde{n}}^{\epsilon}}{\sigma}$ in ${\mathbb{R}}^{{\tilde{n}}}$ . Note that $m={\cal O}(\log^{2}(n))={\cal O}(\log^{2}({\tilde{n}}))$ .

Let $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{cd}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{cd}}$ with $\left\|\mathbf{x}_{i}\right\|\leq\frac{n\log^{2}(n)}{\sigma}$ . For every $i\in[n^{cd}]$ , let $\mathbf{x}^{\prime}_{i}\in{\mathbb{R}}^{\tilde{n}}$ be the vector obtained from $\mathbf{x}_{i}$ by padding it with zeros. Thus, $\mathbf{x}^{\prime}_{i}=(\mathbf{x}_{i},0,\ldots,0)$ . Note that $n^{cd}={\tilde{n}}^{d}$ . Let $S^{\prime}=\{(\mathbf{x}^{\prime}_{i},y_{i})\}_{i=1}^{{\tilde{n}}^{d}}$ . If $S$ is scattered then $S^{\prime}$ is also scattered. Note that if $S$ is realizable by $h_{W}$ then $S^{\prime}$ is realizable by $h_{W^{\prime}}$ where $W^{\prime}$ is obtained from $W$ by appending ${\tilde{n}}-n$ arbitrary lines. Assume that $S$ is ${\cal D}_{\mathrm{mat}}^{n,m}$ -realizable, that is, $W\sim{\cal D}_{\mathrm{mat}}^{n,m}$ . Then, $S^{\prime}$ is realizable by $h_{W^{\prime}}$ where $W^{\prime}$ is obtained from $W$ by appending lines such that each component is drawn i.i.d. from ${\cal D}_{z}$ , and therefore, $S^{\prime}$ is ${\cal D}_{\mathrm{mat}}^{{\tilde{n}},m}$ -realizable. Finally, for every $i\in{\tilde{n}}^{d}$ we have

\left\|\mathbf{x}^{\prime}_{i}\right\|=\left\|\mathbf{x}_{i}\right\|\leq\frac{n\log^{2}(n)}{\sigma}=\frac{{\tilde{n}}^{\frac{1}{c}}\log^{2}({\tilde{n}}^{\frac{1}{c}})}{\sigma}\leq\frac{{\tilde{n}}^{\frac{2}{c}}}{\sigma}=\frac{{\tilde{n}}^{\epsilon}}{\sigma}~{}.

∎

5.3.2 Proof of Theorem 3.2

Let ${\cal D}_{\mathrm{mat}}$ be a distribution over ${\mathbb{R}}^{n\times m}$ with $m=\log^{2}(n)$ , such that each entry is drawn i.i.d. from ${\cal N}(0,1)$ . Let $d$ be a fixed integer. By Lemma 5.6, we have that $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$ is RSAT-hard, where $A$ is the ball of radius $n^{\epsilon}$ in ${\mathbb{R}}^{n}$ . Let $({\cal N}(0,1))^{n}$ be the distribution over ${\mathbb{R}}^{n}$ where each component is drawn i.i.d. from ${\cal N}(0,1)$ . Recall that $({\cal N}(0,1))^{n}={\cal N}({\mathbf{0}},I_{n})$ (Tong (2012)). Therefore, in the distribution ${\cal D}_{\mathrm{mat}}$ , the columns are drawn i.i.d. from ${\cal N}({\mathbf{0}},I_{n})$ . Let ${\cal D}_{\mathrm{mat}}^{\prime}$ be a distribution over ${\mathbb{R}}^{n\times m}$ , such that each column is drawn i.i.d. from ${\cal N}({\mathbf{0}},\Sigma)$ . By Theorem 5.1, we need to show that $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal D}_{\mathrm{mat}}^{\prime})$ is RSAT-hard, where $A^{\prime}$ is the ball of radius $\frac{n^{\epsilon}}{\sqrt{\lambda_{\min}}}$ in ${\mathbb{R}}^{n}$ . We show a reduction from $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$ to $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal D}_{\mathrm{mat}}^{\prime})$ .

Let $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}}$ be a sample. Let $\Sigma=U\Lambda U^{\top}$ be the spectral decomposition of $\Sigma$ , and let $M=U\Lambda^{\frac{1}{2}}$ . Recall that if $\mathbf{w}\sim{\cal N}({\mathbf{0}},I_{n})$ then $M\mathbf{w}\sim{\cal N}({\mathbf{0}},\Sigma)$ (Tong (2012)). For every $i\in[n^{d}]$ , let $\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}$ , and let $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}$ . Note that if $S$ is scattered then $S^{\prime}$ is also scattered. If $S$ is realizable by a ${\cal D}_{\mathrm{mat}}$ -random network $h_{W}$ , then let $W^{\prime}=MW$ . Note that $S^{\prime}$ is realizable by $h_{W^{\prime}}$ . Indeed, for every $i\in[n^{d}]$ we have $(W^{\prime})^{\top}\mathbf{x}^{\prime}_{i}=W^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}_{i}=W^{\top}\mathbf{x}_{i}$ . Let $W=(\mathbf{w}_{1},\ldots,\mathbf{w}_{m})$ and let $W^{\prime}=(\mathbf{w}^{\prime}_{1},\ldots,\mathbf{w}^{\prime}_{m})$ . Since $W^{\prime}=MW$ then $\mathbf{w}^{\prime}_{j}=M\mathbf{w}_{j}$ for every $j\in[m]$ . Now, since $W\sim{\cal D}_{\mathrm{mat}}$ , we have for every $j$ that $\mathbf{w}_{j}\sim{\cal N}({\mathbf{0}},I_{n})$ (i.i.d.). Therefore, $\mathbf{w}^{\prime}_{j}=M\mathbf{w}_{j}\sim{\cal N}({\mathbf{0}},\Sigma)$ , and thus $W^{\prime}\sim{\cal D}_{\mathrm{mat}}^{\prime}$ . Hence, $S^{\prime}$ is ${\cal D}_{\mathrm{mat}}^{\prime}$ -realizable.

We now bound the norms of the vectors $\mathbf{x}^{\prime}_{i}$ in $S^{\prime}$ . Note that for every $i\in[n^{d}]$ we have

\left\|\mathbf{x}^{\prime}_{i}\right\|=\left\|(M^{\top})^{-1}\mathbf{x}_{i}\right\|=\left\|U\Lambda^{-\frac{1}{2}}\mathbf{x}_{i}\right\|=\left\|\Lambda^{-\frac{1}{2}}\mathbf{x}_{i}\right\|\leq\lambda_{\min}^{-\frac{1}{2}}\left\|\mathbf{x}_{i}\right\|\leq\lambda_{\min}^{-\frac{1}{2}}n^{\epsilon}~{}.

5.3.3 Proof of Theorem 3.3

Let $n=(n^{\prime}+1)\log^{2}(n^{\prime})$ , and let $M$ be a diagonal-blocks matrix, where each block $B^{ij}$ is a diagonal matrix $\mathrm{diag}(z_{1}^{ij},\ldots,z_{n^{\prime}+1}^{ij})$ . We denote $\mathbf{z}^{ij}=(z_{1}^{ij},\ldots,z_{n^{\prime}+1}^{ij})$ , and $\mathbf{z}^{j}=(\mathbf{z}^{1j},\ldots,\mathbf{z}^{\log^{2}(n^{\prime})j})\in{\mathbb{R}}^{n}$ . Note that for every $j\in[\log^{2}(n^{\prime})]$ , the vector $\mathbf{z}^{j}$ contains all the entries on the diagonals of blocks in the $j$ -th column of blocks in $M$ . Assume that the vectors $\mathbf{z}^{j}$ are drawn i.i.d. according to the uniform distribution on $r\cdot{\mathbb{S}}^{n-1}$ .

Lemma 5.7.

For some universal constant $c^{\prime}>0$ we have

Pr\left(s_{\min}(M)\leq\frac{c^{\prime}r}{n^{\prime}\sqrt{n^{\prime}}\log^{5}(n^{\prime})}\right)=o_{n}(1)~{}.

Proof.

Let $M^{\prime}=\frac{\sqrt{n}}{r}M$ . For every $j\in[\log^{2}(n^{\prime})]$ , let $\tilde{\mathbf{z}}^{j}\in{\mathbb{R}}^{n}$ be the vector that contains all the entries on the diagonals of blocks in the $j$ -th column of blocks in $M^{\prime}$ . That is, $\tilde{\mathbf{z}}^{j}=\frac{\sqrt{n}}{r}\mathbf{z}^{j}$ . Note that the vectors $\tilde{\mathbf{z}}^{j}$ are i.i.d. copies from the uniform distribution on $\sqrt{n}\cdot{\mathbb{S}}^{n-1}$ . By Lemma 5.3, we have

s_{\min}(M^{\prime})\geq\min_{1\leq i\leq n^{\prime}+1}s_{\min}(M^{\prime}_{S_{i}})~{}.

(2)

Note that for every $i$ , all columns of the matrix $M^{\prime}_{S_{i}}$ are projections of the vectors $\tilde{\mathbf{z}}^{j}$ on the $S_{i}$ coordinated. That is, the $j$ -th column in $M^{\prime}_{S_{i}}$ is obtained by drawing $\tilde{\mathbf{z}}^{j}$ from the uniform distribution on $\sqrt{n}\cdot{\mathbb{S}}^{n-1}$ and projecting on the coordinates $S_{i}$ .

We say that a distribution is isotropic if it has mean zero and its covariance matrix is the identity. The covariance matrix of the uniform distribution on ${\mathbb{S}}^{n-1}$ is $\frac{1}{n}I_{n}$ . Therefore, the uniform distribution on $\sqrt{n}\cdot{\mathbb{S}}^{n-1}$ is isotropic. We will need the following theorem.

Theorem 5.3.

Adamczak et al. (2012) Let $m\geq 1$ and let $A$ be an $m\times m$ matrix with independent columns drawn from an isotropic log-concave distribution. For every $\epsilon\in(0,1)$ we have

Pr\left(s_{\min}(A)\leq\frac{c\epsilon}{\sqrt{m}}\right)\leq Cm\epsilon

where c and C are positive universal constants.

We show that the distribution of the columns of $M^{\prime}_{S_{i}}$ is isotropic and log-concave. First, since the uniform distribution on $\sqrt{n}\cdot{\mathbb{S}}^{n-1}$ is isotropic, then its projection on a subset of coordinates is also isotropic, and thus the distribution of the columns of $M^{\prime}_{S_{i}}$ is isotropic. In order to show that it is log-concave, we analyze its density. Let $\mathbf{x}\in{\mathbb{R}}^{n}$ be a random variable whose distribution is the projection of a uniform distribution on ${\mathbb{S}}^{n-1}$ on $k$ coordinates. It is known that the probability density of $\mathbf{x}$ is (see Fang (2018))

f_{\mathbf{x}}(x_{1},\ldots,x_{k})=\frac{\Gamma(n/2)}{\Gamma((n-k)/2)\pi^{k/2}}\left(1-\sum_{1\leq i\leq k}x_{i}^{2}\right)^{\frac{n-k}{2}-1}~{},

where $\sum_{1\leq i\leq k}x_{i}^{2}<1$ . Recall that the columns of $M^{\prime}_{S_{i}}$ are projections of the uniform distribution over $\sqrt{n}\cdot{\mathbb{S}}^{n-1}$ , namely, the sphere of radius $\sqrt{n}$ and not the unit sphere. Thus, let $\mathbf{x}^{\prime}=\sqrt{n}\mathbf{x}$ . The probability density of $\mathbf{x}^{\prime}$ is

	$\displaystyle f_{\mathbf{x}^{\prime}}(x^{\prime}_{1},\ldots,x^{\prime}_{k})$	$\displaystyle=$	$\displaystyle\frac{1}{(\sqrt{n})^{k}}f_{\mathbf{x}}\left(\frac{x^{\prime}_{1}}{\sqrt{n}},\ldots,\frac{x^{\prime}_{k}}{\sqrt{n}}\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{n^{k/2}}\cdot\frac{\Gamma(n/2)}{\Gamma((n-k)/2)\pi^{k/2}}\left(1-\sum_{1\leq i\leq k}\left(\frac{x^{\prime}_{i}}{\sqrt{n}}\right)^{2}\right)^{\frac{n-k}{2}-1}~{},$

where $\sum_{1\leq i\leq k}(x^{\prime}_{i})^{2}<n$ . We denote

g(n,k)=\frac{1}{n^{k/2}}\cdot\frac{\Gamma(n/2)}{\Gamma((n-k)/2)\pi^{k/2}}~{}.

By replacing $k$ with $\log^{2}(n^{\prime})$ we have

\displaystyle f_{\mathbf{x}^{\prime}}(x^{\prime}_{1},\ldots,x^{\prime}_{\log^{2}(n^{\prime})})

\displaystyle=

\displaystyle g(n,\log^{2}(n^{\prime}))\left(1-\frac{1}{n}\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}\right)^{\frac{n-\log^{2}(n^{\prime})}{2}-1}~{}.

Hence, we have

	$\displaystyle\log f_{\mathbf{x}^{\prime}}(x^{\prime}_{1},\ldots,x^{\prime}_{\log^{2}(n^{\prime})})$	$\displaystyle=$
	$\displaystyle\log\left(g(n,\log^{2}(n^{\prime}))\right)$	$\displaystyle+$	$\displaystyle\left(\frac{n-\log^{2}(n^{\prime})}{2}-1\right)\cdot\log\left(1-\frac{1}{n}\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}\right)~{}.$

Since $\frac{n-\log^{2}(n^{\prime})}{2}-1>0$ , we need to show that the function

\log\left(1-\frac{1}{n}\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}\right)

(3)

(where $\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}<n$ ) is concave. This function can be written as $h(f(x_{1},\ldots,x_{\log^{2}(n^{\prime})}))$ , where

h(x)=\log\left(1+x\right),

f(x^{\prime}_{1},\ldots,x^{\prime}_{\log^{2}(n^{\prime})})=-\frac{1}{n}\sum_{1\leq i\leq\log^{2}(n^{\prime})}(x^{\prime}_{i})^{2}~{}.

Recall that if $h$ is concave and non-decreasing, and $f$ is concave, then their composition is also concave. Clearly, $h$ and $f$ satisfy these conditions, and thus the function in Eq. 3 is concave. Hence $f_{\mathbf{x}^{\prime}}$ is log-concave.

We now apply Theorem 5.3 on $M^{\prime}_{S_{i}}$ , and obtain that for every $\epsilon\in(0,1)$ we have

Pr\left(s_{\min}(M^{\prime}_{S_{i}})\leq\frac{c\epsilon}{\log(n^{\prime})}\right)\leq C\log^{2}(n^{\prime})\epsilon~{}.

By choosing $\epsilon=\frac{1}{n^{\prime}\log^{3}(n^{\prime})}$ we have

Pr\left(s_{\min}(M^{\prime}_{S_{i}})\leq\frac{c}{n^{\prime}\log^{4}(n^{\prime})}\right)\leq\frac{C}{n^{\prime}\log(n^{\prime})}~{}.

Now, by the union bound

Pr\left(\min_{1\leq i\leq n^{\prime}+1}(s_{\min}(M^{\prime}_{S_{i}}))\leq\frac{c}{n^{\prime}\log^{4}(n^{\prime})}\right)\leq\frac{C}{n^{\prime}\log(n^{\prime})}\cdot(n^{\prime}+1)=o_{n}(1)~{}.

Combining this with $s_{\min}(M)=\frac{r}{\sqrt{n}}s_{\min}(M^{\prime})$ and with Eq. 2, we have

	$\displaystyle Pr\left(s_{\min}(M)\leq\frac{cr}{\sqrt{n}\cdot n^{\prime}\log^{4}(n^{\prime})}\right)$	$\displaystyle=$	$\displaystyle Pr\left(s_{\min}(M^{\prime})\leq\frac{c}{n^{\prime}\log^{4}(n^{\prime})}\right)$
		$\displaystyle\leq$	$\displaystyle Pr\left(\min_{1\leq i\leq n^{\prime}+1}(s_{\min}(M^{\prime}_{S_{i}}))\leq\frac{c}{n^{\prime}\log^{4}(n^{\prime})}\right)=o_{n}(1)~{}.$

Note that

\frac{cr}{\sqrt{n}\cdot n^{\prime}\log^{4}(n^{\prime})}=\frac{cr}{\sqrt{n^{\prime}+1}\cdot n^{\prime}\log^{5}(n^{\prime})}\geq\frac{cr}{2\sqrt{n^{\prime}}\cdot n^{\prime}\log^{5}(n^{\prime})}=\frac{c^{\prime}r}{\sqrt{n^{\prime}}\cdot n^{\prime}\log^{5}(n^{\prime})}~{},

where $c^{\prime}=\frac{c}{2}$ . Thus,

Pr\left(s_{\min}(M)\leq\frac{c^{\prime}r}{\sqrt{n^{\prime}}\cdot n^{\prime}\log^{5}(n^{\prime})}\right)\leq Pr\left(s_{\min}(M)\leq\frac{cr}{\sqrt{n}\cdot n^{\prime}\log^{4}(n^{\prime})}\right)=o_{n}(1)~{}.

∎

Let ${\cal D}_{\mathrm{mat}}$ be a distribution over ${\mathbb{R}}^{n\times\log^{2}(n^{\prime})}$ such that each column is drawn i.i.d. from the uniform distribution on $r\cdot{\mathbb{S}}^{n-1}$ . Note that a ${\cal D}_{\mathrm{mat}}$ -random network $h_{W}$ has $\log^{2}(n^{\prime})={\cal O}(\log^{2}(n))$ hidden neurons. Now, Theorem 3.3 follows immediately from Theorem 5.1 and the following lemma.

Lemma 5.8.

Let $d$ be a fixed integer. Then, $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$ is RSAT-hard, where $A$ is a ball of radius ${\cal O}\left(\frac{n\sqrt{n}\log^{4}(n)}{r}\right)$ in ${\mathbb{R}}^{n}$ .

Proof.

Assuming that $M$ is invertible (by Lemma 5.7 it holds with probability $1-o_{n}(1)$ ), let $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}$ where for every $i$ we have $\mathbf{x}^{\prime}_{i}=(M^{\top})^{-1}\mathbf{x}_{i}$ . Note that if $S$ is scattered then $S^{\prime}$ is also scattered.

Assume that $S$ is realizable by the CNN $h_{\mathbf{w}}^{n}$ with $\mathbf{w}\in\{\pm 1\}^{n^{\prime}+1}$ . Let $W$ be the matrix of size $n\times\log^{2}(n^{\prime})$ such that $h_{W}=h_{\mathbf{w}}^{n}$ . Thus, $W=(\mathbf{w}^{1},\ldots,\mathbf{w}^{\log^{2}(n^{\prime})})$ where for every $i\in[\log^{2}(n^{\prime})]$ we have $(\mathbf{w}^{i}_{(i-1)(n^{\prime}+1)+1},\ldots,\mathbf{w}^{i}_{i(n^{\prime}+1)})=\mathbf{w}$ , and $\mathbf{w}^{i}_{j}=0$ for every other $j\in[n]$ . Let $W^{\prime}=MW$ . Note that $S^{\prime}$ is realizable by $h_{W^{\prime}}$ . Indeed, for every $i\in[n^{d}]$ we have $y_{i}=h_{\mathbf{w}}^{n}(\mathbf{x}_{i})=h_{W}(\mathbf{x}_{i})$ , and $W^{\top}\mathbf{x}_{i}=W^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}_{i}=(W^{\prime})^{\top}\mathbf{x}^{\prime}_{i}$ . Also, note that the columns of $W^{\prime}$ are i.i.d. copies from the uniform distribution on $r\cdot{\mathbb{S}}^{n-1}$ . Indeed, denote $M^{\top}=(\mathbf{v}^{1},\ldots,\mathbf{v}^{n})$ . Then, for every line index $i\in[n]$ we denote $i=(b-1)(n^{\prime}+1)+r$ , where $b,r$ are integers and $1\leq r\leq n^{\prime}+1$ . Thus, $b$ is the line index of the block in $M$ that correspond to the $i$ -th line in $M$ , and $r$ is the line index within the block. Now, note that

	$\displaystyle W^{\prime}_{ij}$	$\displaystyle=$	$\displaystyle\langle\mathbf{v}^{i},\mathbf{w}^{j}\rangle=\langle\left(\mathbf{v}^{i}_{(j-1)(n^{\prime}+1)+1},\ldots,\mathbf{v}^{i}_{j(n^{\prime}+1)}\right),\mathbf{w}\rangle=\langle(B^{bj}_{r1},\ldots,B^{bj}_{r(n^{\prime}+1)}),\mathbf{w}\rangle$
		$\displaystyle=$	$\displaystyle B^{bj}_{rr}\cdot\mathbf{w}_{r}=z^{bj}_{r}\cdot\mathbf{w}_{r}~{}.$

Since $\mathbf{w}_{r}\in\{\pm 1\}$ , and since the uniform distribution on a sphere does not change by multiplying a subset of component by $-1$ , then the $j$ -th column of $W^{\prime}$ has the same distribution as $\mathbf{z}^{j}$ , namely, the uniform distribution over $r\cdot{\mathbb{S}}^{n-1}$ . Also, the columns of $W^{\prime}$ are independent. Thus, $W^{\prime}\sim{\cal D}_{\mathrm{mat}}$ , and therefore $h_{W^{\prime}}$ is a ${\cal D}_{\mathrm{mat}}$ -random network.

By Lemma 5.7, we have with probability $1-o_{n}(1)$ that for every $i$ ,

	$\displaystyle\left\\|\mathbf{x}^{\prime}_{i}\right\\|$	$\displaystyle=$	$\displaystyle\left\\|(M^{\top})^{-1}\mathbf{x}_{i}\right\\|\leq s_{\max}\left((M^{\top})^{-1}\right)\left\\|\mathbf{x}_{i}\right\\|=\frac{1}{s_{\min}(M^{\top})}\left\\|\mathbf{x}_{i}\right\\|=\frac{1}{s_{\min}(M)}\left\\|\mathbf{x}_{i}\right\\|$
		$\displaystyle\leq$	$\displaystyle\frac{n^{\prime}\sqrt{n^{\prime}}\log^{5}(n^{\prime})}{c^{\prime}r}\cdot\log^{2}(n^{\prime})\leq\frac{n\sqrt{n}\log^{4}(n)}{c^{\prime}r}~{}.$

Thus, $\left\|\mathbf{x}^{\prime}_{i}\right\|={\cal O}\left(\frac{n\sqrt{n}\log^{4}(n)}{r}\right)$ . ∎

5.4 Hardness of learning random convolutional neural networks

5.4.1 Proof of Theorem 3.4

Theorem 3.4 follows immediately from Theorem 5.1 and the following lemma:

Lemma 5.9.

Let $d$ be a fixed integer. Then, $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{z}^{n^{\prime}+1},n)$ is RSAT-hard, where $A$ is the ball of radius $\frac{\log^{2}(n^{\prime})}{f(n^{\prime})}$ in ${\mathbb{R}}^{n}$ .

Proof.

By Lemma 5.2, the problem $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})})$ where $A^{\prime}$ is the ball of radius $\log^{2}(n^{\prime})$ in ${\mathbb{R}}^{n}$ , is RSAT-hard. We will reduce this problem to $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{z}^{n^{\prime}+1},n)$ . Given a sample $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}}$ with $\left\|\mathbf{x}_{i}\right\|\leq\log^{2}(n^{\prime})$ for every $i\in[n^{d}]$ , we will, with probability $1-o_{n}(1)$ , construct a sample $S^{\prime}$ that is contained in $A$ , such that if $S$ is scattered then $S^{\prime}$ is scattered, and if $S$ is ${\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}$ -realizable then $S^{\prime}$ is ${\cal D}_{z}^{n^{\prime}+1}$ -realizable. Note that our reduction is allowed to fail with probability $o_{n}(1)$ . Indeed, distinguishing scattered from realizable requires success with probability $\frac{3}{4}-o_{n}(1)$ and therefore reductions between such problems are not sensitive to a failure with probability $o_{n}(1)$ .

Let $\mathbf{z}=(z_{1},\ldots,z_{n^{\prime}+1})$ where each $z_{i}$ is drawn i.i.d. from ${\cal D}_{z}$ . Let $M=\mathrm{diag}(\mathbf{z})$ be a diagonal matrix. Note that $M$ is invertible with probability $1-o_{n}(1)$ , since for every $i\in[n^{\prime}+1]$ we have $Pr_{z_{i}\sim{\cal D}_{z}}(z_{i}=0)\leq Pr_{z_{i}\sim{\cal D}_{z}}(|z_{i}|<f(n^{\prime}))=o(\frac{1}{n^{\prime}})$ . Now, for every $\mathbf{x}_{i}$ from $S$ , denote $\mathbf{x}_{i}=(\mathbf{x}^{i}_{1},\ldots,\mathbf{x}^{i}_{\log^{2}(n^{\prime})})$ where for every $j$ we have $\mathbf{x}^{i}_{j}\in{\mathbb{R}}^{n^{\prime}+1}$ . Let $\mathbf{x}^{\prime}_{i}=(M^{-1}\mathbf{x}^{i}_{1},\ldots,M^{-1}\mathbf{x}^{i}_{\log^{2}(n^{\prime})})$ , and let $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}$ . Note that if $S$ is scattered then $S^{\prime}$ is also scattered. If $S$ is realizable by a CNN $h_{\mathbf{w}}^{n}\in{\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}$ , then let $\mathbf{w}^{\prime}=M\mathbf{w}$ . Note that $S^{\prime}$ is realizable by $h_{\mathbf{w}^{\prime}}^{n}$ . Indeed, for every $i$ and $j$ we have $\langle\mathbf{w}^{\prime},M^{-1}\mathbf{x}^{i}_{j}\rangle=\mathbf{w}^{\top}M^{\top}M^{-1}\mathbf{x}^{i}_{j}=\mathbf{w}^{\top}MM^{-1}\mathbf{x}^{i}_{j}=\langle\mathbf{w},\mathbf{x}^{i}_{j}\rangle$ . Also, note that since $\mathbf{w}\in\{\pm 1\}^{n^{\prime}+1}$ and ${\cal D}_{z}$ is symmetric, then $\mathbf{w}^{\prime}$ has the distribution ${\cal D}_{z}^{n^{\prime}+1}$ , and thus $h_{\mathbf{w}^{\prime}}^{n}$ is a ${\cal D}_{z}^{n^{\prime}+1}$ -random CNN.

The probability that $\mathbf{z}\sim{\cal D}_{z}^{n^{\prime}+1}$ has some component $z_{i}$ with $|z_{i}|<f(n^{\prime})$ , is at most $(n^{\prime}+1)\cdot o(\frac{1}{n^{\prime}})=o_{n}(1)$ . Therefore, with probability $1-o_{n}(1)$ we have for every $i\in[n^{d}]$ that

	$\displaystyle\left\\|\mathbf{x}^{\prime}_{i}\right\\|^{2}$	$\displaystyle=$	$\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|M^{-1}\mathbf{x}^{i}_{j}\right\\|^{2}\leq\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left(\frac{1}{f(n^{\prime})}\left\\|\mathbf{x}^{i}_{j}\right\\|\right)^{2}=\frac{1}{(f(n^{\prime}))^{2}}\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\mathbf{x}^{i}_{j}\right\\|^{2}$
		$\displaystyle=$	$\displaystyle\frac{1}{(f(n^{\prime}))^{2}}\left\\|\mathbf{x}_{i}\right\\|^{2}\leq\frac{\log^{4}(n^{\prime})}{(f(n^{\prime}))^{2}}~{}.$

Thus, $\left\|\mathbf{x}^{\prime}_{i}\right\|\leq\frac{\log^{2}(n^{\prime})}{f(n^{\prime})}$ . ∎

5.4.2 Proof of Theorem 3.5

Assume that the covariance matrix $\Sigma$ is of size $(n^{\prime}+1)\times(n^{\prime}+1)$ , and let $n=(n^{\prime}+1)\log^{2}(n^{\prime})$ . Note that a ${\cal N}({\mathbf{0}},\Sigma)$ -random CNN $h_{\mathbf{w}}^{n}$ has $\log^{2}(n^{\prime})={\cal O}(\log^{2}(n))$ hidden neurons. Let ${\cal D}_{\mathrm{vec}}$ be a distribution over ${\mathbb{R}}^{n^{\prime}+1}$ such that each component is drawn i.i.d. from ${\cal N}(0,1)$ . Let $d$ be a fixed integer. By Lemma 5.9 and by choosing $f(n^{\prime})=\frac{1}{n^{\prime}\log(n^{\prime})}$ , we have that $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{vec}},n)$ is RSAT-hard, where $A$ is the ball of radius $n^{\prime}\log^{3}(n^{\prime})\leq n\log(n)$ in ${\mathbb{R}}^{n}$ . Note that ${\cal D}_{\mathrm{vec}}={\cal N}({\mathbf{0}},I_{n^{\prime}+1})$ (Tong (2012)). By Theorem 5.1, we need to show that $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal N}({\mathbf{0}},\Sigma),n)$ is RSAT-hard, where $A^{\prime}$ is the ball of radius $\lambda_{\min}^{-\frac{1}{2}}n\log(n)$ in ${\mathbb{R}}^{n}$ . We show a reduction from $\mathrm{SCAT}_{n^{d}}^{A}({\cal N}({\mathbf{0}},I_{n^{\prime}+1}),n)$ to $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal N}({\mathbf{0}},\Sigma),n)$ .

Let $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}}$ be a sample. For every $\mathbf{x}_{i}$ from $S$ , denote $\mathbf{x}_{i}=(\mathbf{x}^{i}_{1},\ldots,\mathbf{x}^{i}_{\log^{2}(n^{\prime})})$ where for every $j$ we have $\mathbf{x}^{i}_{j}\in{\mathbb{R}}^{n^{\prime}+1}$ . Let $\Sigma=U\Lambda U^{\top}$ be the spectral decomposition of $\Sigma$ , and let $M=U\Lambda^{\frac{1}{2}}$ . Recall that if $\mathbf{w}\sim{\cal N}({\mathbf{0}},I_{n^{\prime}+1})$ then $M\mathbf{w}\sim{\cal N}({\mathbf{0}},\Sigma)$ (Tong (2012)). Let $\mathbf{x}^{\prime}_{i}=((M^{\top})^{-1}\mathbf{x}^{i}_{1},\ldots,(M^{\top})^{-1}\mathbf{x}^{i}_{\log^{2}(n^{\prime})})$ , and let $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}$ . Note that if $S$ is scattered then $S^{\prime}$ is also scattered. If $S$ is realizable by a ${\cal N}({\mathbf{0}},I_{n^{\prime}+1})$ -random CNN $h_{\mathbf{w}}^{n}$ , then let $\mathbf{w}^{\prime}=M\mathbf{w}$ . Note that $S^{\prime}$ is realizable by $h_{\mathbf{w}^{\prime}}^{n}$ . Indeed, for every $i$ and $j$ we have $\langle\mathbf{w}^{\prime},(M^{\top})^{-1}\mathbf{x}^{i}_{j}\rangle=\mathbf{w}^{\top}M^{\top}(M^{\top})^{-1}\mathbf{x}^{i}_{j}=\langle\mathbf{w},\mathbf{x}^{i}_{j}\rangle$ . Since $\mathbf{w}^{\prime}=M\mathbf{w}\sim{\cal N}({\mathbf{0}},\Sigma)$ , the sample $S^{\prime}$ is ${\cal N}({\mathbf{0}},\Sigma)$ -realizable.

We now bound the norms of $\mathbf{x}^{\prime}_{i}$ in $S^{\prime}$ . Note that for every $i\in[n^{d}]$ we have

	$\displaystyle\left\\|\mathbf{x}^{\prime}_{i}\right\\|^{2}$	$\displaystyle=$	$\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|(M^{\top})^{-1}\mathbf{x}^{i}_{j}\right\\|^{2}=\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|U\Lambda^{-\frac{1}{2}}\mathbf{x}^{i}_{j}\right\\|^{2}=\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\Lambda^{-\frac{1}{2}}\mathbf{x}^{i}_{j}\right\\|^{2}$
		$\displaystyle\leq$	$\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\lambda_{\min}^{-\frac{1}{2}}\mathbf{x}^{i}_{j}\right\\|^{2}=\lambda_{\min}^{-1}\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\mathbf{x}^{i}_{j}\right\\|^{2}=\lambda_{\min}^{-1}\left\\|\mathbf{x}_{i}\right\\|^{2}~{}.$

Hence, $\left\|\mathbf{x}^{\prime}_{i}\right\|\leq\lambda_{\min}^{-\frac{1}{2}}\left\|\mathbf{x}_{i}\right\|\leq\lambda_{\min}^{-\frac{1}{2}}n\log(n)$ .

5.4.3 Proof of Theorem 3.6

Let $n=(n^{\prime}+1)\log^{2}(n^{\prime})$ . Let ${\cal D}_{\mathrm{vec}}$ be the uniform distribution on $r\cdot{\mathbb{S}}^{n^{\prime}}$ . Note that a ${\cal D}_{\mathrm{vec}}$ -random CNN $h_{\mathbf{w}}^{n}$ has $\log^{2}(n^{\prime})={\cal O}(\log^{2}(n))$ hidden neurons. Let $d$ be a fixed integer. By Theorem 5.1, we need to show that $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{vec}},n)$ is RSAT-hard, where $A$ is the ball of radius $\frac{\sqrt{n}\log(n)}{r}$ in ${\mathbb{R}}^{n}$ . By Lemma 5.2, the problem $\mathrm{SCAT}_{n^{d}}^{A^{\prime}}({\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})})$ where $A^{\prime}$ is the ball of radius $\log^{2}(n^{\prime})$ in ${\mathbb{R}}^{n}$ , is RSAT-hard. We reduce this problem to $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{vec}},n)$ . Given a sample $S=\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{n^{d}}\in({\mathbb{R}}^{n}\times\{0,1\})^{n^{d}}$ with $\left\|\mathbf{x}_{i}\right\|\leq\log^{2}(n^{\prime})$ for every $i\in[n^{d}]$ , we construct a sample $S^{\prime}$ that is contained in $A$ , such that if $S$ is scattered then $S^{\prime}$ is scattered, and if $S$ is ${\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}$ -realizable then $S^{\prime}$ is ${\cal D}_{\mathrm{vec}}$ -realizable.

Let $M$ be a random orthogonal matrix of size $(n^{\prime}+1)\times(n^{\prime}+1)$ . For every $i\in[n^{d}]$ denote $\mathbf{x}_{i}=(\mathbf{x}^{i}_{1},\ldots,\mathbf{x}^{i}_{\log^{2}(n^{\prime})})$ where for every $j$ we have $\mathbf{x}^{i}_{j}\in{\mathbb{R}}^{n^{\prime}+1}$ . For every $i\in[n^{d}]$ let $\mathbf{x}^{\prime}_{i}=(\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{1},\ldots,\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{\log^{2}(n^{\prime})})$ , and let $S^{\prime}=\{(\mathbf{x}^{\prime}_{1},y_{1}),\ldots,(\mathbf{x}^{\prime}_{n^{d}},y_{n^{d}})\}$ . Note that if $S$ is scattered then $S^{\prime}$ is also scattered. If $S$ is realizable by a CNN $h_{\mathbf{w}}^{n}\in{\cal H}_{\mathrm{sign-cnn}}^{n,\log^{2}(n^{\prime})}$ , then let $\mathbf{w}^{\prime}=\frac{r}{\sqrt{n^{\prime}+1}}M\mathbf{w}$ . Note that $S^{\prime}$ is realizable by $h_{\mathbf{w}^{\prime}}^{n}$ . Indeed, for every $i$ and $j$ we have

\langle\mathbf{w}^{\prime},\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{j}\rangle=\mathbf{w}^{\top}\frac{r}{\sqrt{n^{\prime}+1}}M^{\top}\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{j}=\langle\mathbf{w},\mathbf{x}^{i}_{j}\rangle~{}.

Also, note that since $\left\|\mathbf{w}\right\|=\sqrt{n^{\prime}+1}$ and $M$ is orthogonal, $\mathbf{w}^{\prime}$ is a random vector on the sphere of radius $r$ in ${\mathbb{R}}^{n^{\prime}+1}$ , and thus $h_{\mathbf{w}^{\prime}}^{n}$ is a ${\cal D}_{\mathrm{vec}}$ -random CNN.

Since $M$ is orthogonal then for every $i\in[n^{d}]$ we have

	$\displaystyle\left\\|\mathbf{x}^{\prime}_{i}\right\\|^{2}$	$\displaystyle=$	$\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{j}\right\\|^{2}=\frac{n^{\prime}+1}{r^{2}}\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\mathbf{x}^{i}_{j}\right\\|^{2}$
		$\displaystyle=$	$\displaystyle\frac{n^{\prime}+1}{r^{2}}\cdot\left\\|\mathbf{x}_{i}\right\\|^{2}\leq\frac{(n^{\prime}+1)\log^{4}(n^{\prime})}{r^{2}}\leq\frac{n\log^{2}(n)}{r^{2}}~{}.$

Hence $\left\|\mathbf{x}^{\prime}_{i}\right\|\leq\frac{\sqrt{n}\log(n)}{r}$ .

References

Adamczak et al. [2012] R. Adamczak, O. Guédon, A. Litvak, A. Pajor, and N. Tomczak-Jaegermann. Condition number of a square matrix with iid columns drawn from a convex body. Proceedings of the American Mathematical Society, 140(3):987–998, 2012.
Agarwal et al. [2020] N. Agarwal, P. Awasthi, and S. Kale. A deep conditioning treatment of neural networks. arXiv preprint arXiv:2002.01523, 2020.
Allen-Zhu et al. [2018a] Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018a.
Allen-Zhu et al. [2018b] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. arXiv preprint arXiv:1811.03962, 2018b.
Andoni et al. [2014] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning polynomials with neural networks. In Proceedings of the 31st International Conference on Machine Learning, pages 1908–1916, 2014.
Applebaum et al. [2008] B. Applebaum, B. Barak, and D. Xiao. On basing lower-bounds for learning on worst-case assumptions. In Foundations of Computer Science, 2008. FOCS’08. IEEE 49th Annual IEEE Symposium on, pages 211–220. IEEE, 2008.
Arora et al. [2014] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. In International Conference on Machine Learning, pages 584–592, 2014.
Arora et al. [2019] S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.
Brutzkus and Globerson [2017] A. Brutzkus and A. Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 605–614. JMLR. org, 2017.
Brutzkus et al. [2017] A. Brutzkus, A. Globerson, E. Malach, and S. Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017.
Cao and Gu [2019] Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. arXiv preprint arXiv:1905.13210, 2019.
Coja-Oghlan et al. [2004] A. Coja-Oghlan, A. Goerdt, and A. Lanka. Strong refutation heuristics for random k-sat. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pages 310–321. Springer, 2004.
Coja-Oghlan et al. [2010] A. Coja-Oghlan, C. Cooper, and A. Frieze. An efficient sparse regularity concept. SIAM Journal on Discrete Mathematics, 23(4):2000–2034, 2010.
Daniely [2016] A. Daniely. Complexity theoretic limitations on learning halfspaces. In Proceedings of the forty-eighth annual ACM symposium on Theory of Computing, pages 105–117. ACM, 2016.
Daniely [2017] A. Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422–2430, 2017.
Daniely [2019] A. Daniely. Neural networks learning and memorization with (almost) no over-parameterization. arXiv preprint arXiv:1911.09873, 2019.
Daniely and Shalev-Shwartz [2016] A. Daniely and S. Shalev-Shwartz. Complexity theoretic limitations on learning dnf’s. In Conference on Learning Theory, pages 815–830, 2016.
Daniely et al. [2014] A. Daniely, N. Linial, and S. Shalev-Shwartz. From average case complexity to improper learning complexity. In STOC, 2014.
Daniely et al. [2016] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In NIPS, 2016.
Das et al. [2019] A. Das, S. Gollapudi, R. Kumar, and R. Panigrahy. On the learnability of deep random networks. arXiv preprint arXiv:1904.03866, 2019.
Du and Goel [2018] S. S. Du and S. Goel. Improved learning of one-hidden-layer convolutional neural networks with overlaps. arXiv preprint arXiv:1805.07798, 2018.
Du et al. [2017a] S. S. Du, J. D. Lee, and Y. Tian. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017a.
Du et al. [2017b] S. S. Du, J. D. Lee, Y. Tian, B. Poczos, and A. Singh. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017b.
Du et al. [2018] S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
Fang [2018] K. W. Fang. Symmetric multivariate and related distributions. Chapman and Hall/CRC, 2018.
Feige [2002] U. Feige. Relations between average case complexity and approximation complexity. In Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pages 534–543. ACM, 2002.
Feige and Ofek [2004] U. Feige and E. Ofek. Easily refutable subformulas of large random 3cnf formulas. In Automata, languages and programming, pages 519–530. Springer, 2004.
Feldman et al. [2006] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisy parities and halfspaces. In In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, 2006.
Ge et al. [2019] R. Ge, R. Wang, and H. Zhao. Mildly overparametrized neural nets can memorize training data efficiently. arXiv preprint arXiv:1909.11837, 2019.
Goel and Klivans [2017] S. Goel and A. Klivans. Learning neural networks with two nonlinear layers in polynomial time. arXiv preprint arXiv:1709.06010, 2017.
Goel et al. [2018] S. Goel, A. Klivans, and R. Meka. Learning one convolutional layer with overlapping patches. arXiv preprint arXiv:1802.02547, 2018.
Jacot et al. [2018] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
Janzamin et al. [2015] M. Janzamin, H. Sedghi, and A. Anandkumar. Beating the perils of non-convexity: Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.
Ji and Telgarsky [2019] Z. Ji and M. Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. arXiv preprint arXiv:1909.12292, 2019.
Klivans and Sherstov [2006] A. R. Klivans and A. A. Sherstov. Cryptographic hardness for learning intersections of halfspaces. In FOCS, 2006.
Lee et al. [2019] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
Li and Yuan [2017] Y. Li and Y. Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
Ma et al. [2019] C. Ma, L. Wu, et al. A comparative analysis of the optimization and generalization property of two-layer neural network and random feature models under gradient descent dynamics. arXiv preprint arXiv:1904.04326, 2019.
Oymak and Soltanolkotabi [2018] S. Oymak and M. Soltanolkotabi. Overparameterized nonlinear learning: Gradient descent takes the shortest path? arXiv preprint arXiv:1812.10004, 2018.
Oymak and Soltanolkotabi [2019] S. Oymak and M. Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. arXiv:1902.04674 [cs, math, stat], Feb. 2019. URL http://arxiv.org/abs/1902.04674. arXiv: 1902.04674.
Rivasplata [2012] O. Rivasplata. Subgaussian random variables: An expository note. Internet publication, PDF, 2012.
Rudelson and Vershynin [2008] M. Rudelson and R. Vershynin. The littlewood–offord problem and invertibility of random matrices. Advances in Mathematics, 218(2):600–633, 2008.
Shamir [2018] O. Shamir. Distribution-specific hardness of learning neural networks. The Journal of Machine Learning Research, 19(1):1135–1163, 2018.
Song and Yang [2019] Z. Song and X. Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
Tian [2017] Y. Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3404–3413. JMLR. org, 2017.
Tong [2012] Y. L. Tong. The multivariate normal distribution. Springer Science & Business Media, 2012.
Xie et al. [2016] B. Xie, Y. Liang, and L. Song. Diverse neural network learns true target functions. arXiv preprint arXiv:1611.03131, 2016.
Zou and Gu [2019] D. Zou and Q. Gu. An improved analysis of training over-parameterized deep neural networks. arXiv preprint arXiv:1906.04688, 2019.

Appendix A From $\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K})$ to $\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)})$ (Daniely and Shalev-Shwartz [2016])

We outline the main ideas of the reduction.

First, we reduce $\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K})$ to $\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)})$ . This is done as follows. Given an instance $J=\{C_{1},\ldots,C_{n^{d}}\}$ to $\mathrm{CSP}(\mathrm{SAT}_{K})$ , by a simple greedy procedure, we try to find $n^{d-1}$ disjoint subsets $J^{\prime}_{1},\ldots,J^{\prime}_{n^{d-1}}\subset J$ , such that for every $t$ , the subset $J^{\prime}_{t}$ consists of $q(n)$ constraints and each variable appears in at most one of the constraints in $J^{\prime}_{t}$ . Now, from every $J^{\prime}_{t}$ we construct a $T_{K,q(n)}$ -constraint that is the conjunction of all constraints in $J^{\prime}_{t}$ . If $J$ is random, this procedure will succeed w.h.p. and will produce a random $T_{K,q(n)}$ -formula. If $J$ is satisfiable, this procedure will either fail or produce a satisfiable $T_{K,q(n)}$ -formula.

Now, we reduce $\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)})$ to $\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)})$ . This is done by replacing each constraint, with probability $\frac{1}{2}$ , with a random $\neg P$ constraint. Clearly, if the original instance is a random instance of $\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)})$ , then the produced instance is a random instance of $\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)})$ . Furthermore, if the original instance is satisfied by the assignment $\psi\in\{\pm 1\}^{n}$ , the same $\psi$ , w.h.p., will satisfy all the new constraints. The reason is that the probability that a random $\neg T_{K,q(n)}$ -constraint is satisfied by $\psi$ is $1-\left(1-2^{-K}\right)^{q(n)}$ , and hence, the probability that all new constraints are satisfied by $\psi$ is at least $1-n^{d-1}\left(1-2^{-K}\right)^{q(n)}$ . Now, since $q(n)=\omega(\log(n))$ , the last probability is $1-o_{n}(1)$ .

For the full proof see Daniely and Shalev-Shwartz [2016].

	$\displaystyle\left\\|\mathbf{x}^{\prime}_{i}\right\\|^{2}$	$\displaystyle=$	$\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|M^{-1}\mathbf{x}^{i}_{j}\right\\|^{2}\leq\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left(\frac{1}{f(n^{\prime})}\left\\|\mathbf{x}^{i}_{j}\right\\|\right)^{2}=\frac{1}{(f(n^{\prime}))^{2}}\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\mathbf{x}^{i}_{j}\right\\|^{2}$
		$\displaystyle=$	$\displaystyle\frac{1}{(f(n^{\prime}))^{2}}\left\\|\mathbf{x}_{i}\right\\|^{2}\leq\frac{\log^{4}(n^{\prime})}{(f(n^{\prime}))^{2}}~{}.$

	$\displaystyle\left\\|\mathbf{x}^{\prime}_{i}\right\\|^{2}$	$\displaystyle=$	$\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|(M^{\top})^{-1}\mathbf{x}^{i}_{j}\right\\|^{2}=\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|U\Lambda^{-\frac{1}{2}}\mathbf{x}^{i}_{j}\right\\|^{2}=\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\Lambda^{-\frac{1}{2}}\mathbf{x}^{i}_{j}\right\\|^{2}$
		$\displaystyle\leq$	$\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\lambda_{\min}^{-\frac{1}{2}}\mathbf{x}^{i}_{j}\right\\|^{2}=\lambda_{\min}^{-1}\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\mathbf{x}^{i}_{j}\right\\|^{2}=\lambda_{\min}^{-1}\left\\|\mathbf{x}_{i}\right\\|^{2}~{}.$

	$\displaystyle\left\\|\mathbf{x}^{\prime}_{i}\right\\|^{2}$	$\displaystyle=$	$\displaystyle\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\frac{\sqrt{n^{\prime}+1}}{r}M\mathbf{x}^{i}_{j}\right\\|^{2}=\frac{n^{\prime}+1}{r^{2}}\sum_{1\leq j\leq\log^{2}(n^{\prime})}\left\\|\mathbf{x}^{i}_{j}\right\\|^{2}$
		$\displaystyle=$	$\displaystyle\frac{n^{\prime}+1}{r^{2}}\cdot\left\\|\mathbf{x}_{i}\right\\|^{2}\leq\frac{(n^{\prime}+1)\log^{4}(n^{\prime})}{r^{2}}\leq\frac{n\log^{2}(n)}{r^{2}}~{}.$

Hardness of Learning Neural Networks with Natural Weights

Abstract

1 Introduction

Related work

2 Preliminaries

2.1 Random Constraints Satisfaction Problems

2.2 The random KK-SAT assumption

Assumption 2.1.

Terminology 2.2.

2.3 Learning hypothesis classes and random neural networks

Remark 2.1.

2.4 Notations and terminology

3 Results

3.1 Fully-connected neural networks

Theorem 3.1.

Corollary 3.1.

Theorem 3.2.

Theorem 3.3.

Corollary 3.2.

3.2 Convolutional neural networks

Theorem 3.4.

Corollary 3.3.

Theorem 3.5.

Theorem 3.6.

Corollary 3.4.

3.2.1 Improving the bounds on the support of 𝒟{\cal D}

Corollary 3.5.

Corollary 3.6.

Corollary 3.7.

Open Questions

4 Main proof ideas

Proof structure

5 Proofs

5.1 Learning 𝒟mat{\cal D}_{\mathrm{mat}}-random networks is harder than SCATndA​(𝒟mat)\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})

Theorem 5.1.

Proof.

5.2 SCATndA​(ℋsign−cnnn,m)\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,m}) is RSAT-hard

Lemma 5.1.

Lemma 5.2.

Proof.

5.3 Hardness of learning random fully-connected neural networks

Lemma 5.3.

Proof.

5.3.1 Proof of Theorem 3.1

Lemma 5.4.

Proof.

Theorem 5.2.

Lemma 5.5.

Proof.

Lemma 5.6.

Proof.

5.3.2 Proof of Theorem 3.2

5.3.3 Proof of Theorem 3.3

Lemma 5.7.

Proof.

Theorem 5.3.

Lemma 5.8.

Proof.

5.4 Hardness of learning random convolutional neural networks

5.4.1 Proof of Theorem 3.4

Lemma 5.9.

Proof.

5.4.2 Proof of Theorem 3.5

5.4.3 Proof of Theorem 3.6

References

Appendix A From CSPndrand​(SATK)\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K}) to CSPnd−1rand​(TK,q​(n),¬TK,q​(n))\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)}) (Daniely and Shalev-Shwartz [2016])

Hardness of Learning Neural Networks with
Natural Weights

2.2 The random $K$ -SAT assumption

3.2.1 Improving the bounds on the support of ${\cal D}$

5.1 Learning ${\cal D}_{\mathrm{mat}}$ -random networks is harder than $\mathrm{SCAT}_{n^{d}}^{A}({\cal D}_{\mathrm{mat}})$

5.2 $\mathrm{SCAT}_{n^{d}}^{A}({\cal H}_{\mathrm{sign-cnn}}^{n,m})$ is RSAT-hard

Appendix A From $\mathrm{CSP}^{\mathrm{rand}}_{n^{d}}(\mathrm{SAT}_{K})$ to $\mathrm{CSP}^{\mathrm{rand}}_{n^{d-1}}(T_{K,q(n)},\neg T_{K,q(n)})$ (Daniely and Shalev-Shwartz [2016])