Pruning Randomly Initialized Neural Networks
with Iterative Randomization

Daiki Chijiwa^† Shin’ya Yamaguchi^† Yasutoshi Ida^† Kenji Umakoshi^‡ Tomohiro Inoue^‡

^†NTT Computer and Data Science Laboratories, NTT Corporation
^‡NTT Social Informatics Laboratories, NTT Corporation Corresponding author: [email protected]

Abstract

Pruning the weights of randomly initialized neural networks plays an important role in the context of lottery ticket hypothesis. Ramanujan et al. [26] empirically showed that only pruning the weights can achieve remarkable performance instead of optimizing the weight values. However, to achieve the same level of performance as the weight optimization, the pruning approach requires more parameters in the networks before pruning and thus more memory space. To overcome this parameter inefficiency, we introduce a novel framework to prune randomly initialized neural networks with iteratively randomizing weight values (IteRand). Theoretically, we prove an approximation theorem in our framework, which indicates that the randomizing operations are provably effective to reduce the required number of the parameters. We also empirically demonstrate the parameter efficiency in multiple experiments on CIFAR-10 and ImageNet. The code is available at https://github.com/dchiji-ntt/iterand.

1 Introduction

The lottery ticket hypothesis, which was originally proposed by Frankle and Carbin [6], has been an important topic in the research of deep neural networks (DNNs). The hypothesis claims that an over-parameterized DNN has a sparse subnetwork (called a winning ticket) that can achieve almost the same accuracy as the fully-trained entire network when trained independently. If the hypothesis holds for a given network, then we can reduce the computational cost by using the sparse subnetwork instead of the entire network while maintaining the accuracy [15, 22]. In addition to the practical benefit, the hypothesis also suggests that the over-parametrization of DNNs is no longer necessary and their subnetworks alone are sufficient to achieve full accuracy.

Ramanujan et al. [26] went one step further. They proposed and empirically demonstrated a conjecture related to the above hypothesis, called the strong lottery ticket hypothesis, which informally states that there exists a subnetwork in a randomly initialized neural network such that it already achieves almost the same accuracy as a fully trained network, without any optimization of the weights of the network. A remarkable consequence of this hypothesis is that neural networks could be trained by solving a discrete optimization problem. That is, we may train a randomly initialized neural network by finding an optimal subnetwork (which we call weight-pruning optimization), instead of optimizing the network weights continuously (which we call weight-optimization) with stochastic gradient descent (SGD).

However, the weight-pruning optimization requires a problematic amount of parameters in the random network before pruning. Pensia et al. [25] theoretically showed that the required network width for the weight-pruning optimization needs to be logarithmically wider than the weight-optimization at least in the case of shallow networks. Therefore, the weight-pruning optimization requires more parameters, and thus more memory space, than the weight-optimization to achieve the same accuracy. In other words, under a given memory constraint, the weight-pruning optimization can have lower final accuracy than the weight-optimization in practice.

In this paper, we propose a novel optimization method for neural networks called weight-pruning with iterative randomization (IteRand), which extends the weight-pruning optimization to overcome the parameter inefficiency. The key idea is to virtually increase the network width by randomizing pruned weights at each iteration of the weight-pruning optimization, without any additional memory consumption. Indeed, we theoretically show that the required network width can be reduced by the randomizing operations. More precisely, our theoretical result indicates that, if the number of randomizing operations is large enough, we can reduce the required network width for weight-pruning to the same as that for a network fully trained by the weight-optimization up to constant factors, in contrast to the logarithmic factors of the previous results [25, 24]. We also empirically demonstrate that, under a given amount of network parameters, IteRand boosts the accuracy of the weight-pruning optimization in multiple vision experiments.

2 Background

In this section, we review the prior works on pruning randomly initialized neural networks.

Notation and setup.

Let $d,N\in\mathbb{N}$ . Let $f(x;\bm{\theta})$ be an $l$ -layered ReLU neural network with an input $x\in\mathbb{R}^{d}$ and parameters $\bm{\theta}=(\theta_{i})_{1\leq i\leq n}\in\mathbb{R}^{n}$ , where each weight $\theta_{i}$ is randomly sampled from a distribution $\mathcal{D}_{\mathrm{param}}$ over $\mathbb{R}$ . A subnetwork of $f(x;\bm{\theta})$ is written as $f(x;\bm{\theta}\odot\mathbf{m})$ where $\mathbf{m}\in\{0,1\}^{n}$ is a binary mask and " $\odot$ " represents an element-wise multiplication.

Ramanujan et al. [26] empirically observed that we can train the randomly initialized neural network $f(x;\bm{\theta})$ by solving the following discrete optimization problem, which we call weight-pruning optimization:

\min_{\mathbf{m}\in\{0,1\}^{n}}\underset{(x,y)\sim\mathcal{D}_{\mathrm{labeled}}}{\mathbb{E}}\Big{[}\mathcal{L}(f(x;\bm{\theta}\odot\mathbf{m}),y)\Big{]},

(1)

where $\mathcal{D}_{\mathrm{labeled}}$ is a distribution on a set of labeled data $(x,y)$ and $\mathcal{L}$ is a loss function. To solve this optimization problem, Ramanujan et al. [26] proposed an optimization algorithm, called edge-popup (Algorithm 1).

Initialize

\bm{\theta}\sim\mathcal{D}^{n}_{\mathrm{param}},\mathbf{s}\sim\mathcal{D}^{n}_{\mathrm{score}}

;

\mathcal{D}_{\mathrm{param}}

and

\mathcal{D}_{\mathrm{score}}

are distributions over

\mathbb{R}

1 while $k=0,\cdots,N-1$ do

2 Sample a labeled data

(x,y)\sim\mathcal{D}_{\mathrm{labeled}}

;

\mathbf{m},\mathbf{s}\leftarrow{\texttt{TrainMask}}(\bm{\theta},\mathbf{s};(x,y))

;

// Optimize importance scores

\mathbf{s}

and update

\mathbf{m}

4 end while

5 return

\mathbf{m},\bm{\theta}

;

Algorithm 1 Weight-pruning optimization by edge-popup [26]

The TrainMask (Algorithm 2) is the key process in Algorithm 1. It has a latent variable $\mathbf{s}=(s_{i})_{1\leq i\leq n}\in\mathbb{R}^{n}$ , where each element $s_{i}$ represents an importance score of the corresponding weight $\theta_{i}$ , and optimizes $\mathbf{s}$ instead of directly optimizing the discrete variable $\mathbf{m}$ . Given the score $\mathbf{s}$ , the corresponding $\mathbf{m}$ is computed by the function ${\texttt{CalculateMask}}(\mathbf{s})$ , which returns $\mathbf{m}=(m_{i})_{1\leq i\leq n}$ defined as follows: $m_{i}=1$ if $s_{i}$ is top $100(1-p)\%$ in $\{s_{i}\}_{1\leq i\leq n}$ , otherwise $m_{i}=0$ , where $p\in(0,1)$ is a hyperparameter representing a sparsity rate of the pruned network. In the line 3 of Algorithm 2, $\mathrm{SGD}_{\eta,\lambda,\mu}(\mathbf{s},\mathbf{g})$ returns the updated value of $\mathbf{s}$ by stochastic gradient descent with a learning rate $\eta$ , weight decay $\lambda$ , momentum coefficient $\mu$ , and gradient vector $\mathbf{g}$ .

1 Input:

\bm{\theta},\mathbf{s}\in\mathbb{R}^{n}

(x,y)

: a labeled data;

\mathbf{m}\leftarrow{\texttt{CalculateMask}}(\mathbf{s})

;

// Calculate the mask

\mathbf{m}

with the current scores

\mathbf{s}\leftarrow\mathrm{SGD}_{\eta,\lambda,\mu}\left(\mathbf{s},\nabla_{\mathbf{\overline{s}}=\mathbf{m}}\mathcal{L}(f(x;\bm{\theta}\odot\mathbf{\overline{s}}),y)\right)

;

// Update

\mathbf{s}

by the gradient at

\overline{\mathbf{s}}=\mathbf{m}

\mathbf{m}\leftarrow{\texttt{CalculateMask}}(\mathbf{s})

;

// Calculate new mask

\mathbf{m}

with the updated scores

2 return

\mathbf{m},\mathbf{s}

;

Algorithm 2 Pseudo code of TrainMask

On the theoretical side, Malach et al. [19] first provided a mathematical justification of the above empirical observation. They formulated it as an approximation theorem with some assumptions on the network width as follows.

Theorem 2.1 (informal statement of Theorem 2.1 in [19])

Let $f_{\mathrm{target}}(x)$ be an $l$ -layered network with bounded weight matrices, and $g(x)$ be a randomly initialized $2l$ -layered neural network. If the width of $g(x)$ is larger than $f_{\mathrm{target}}(x)$ by the factor of a polynomial term, then there probably exists a subnetwork of $g(x)$ that approximates $f_{\mathrm{target}}(x)$ .

By considering a well-trained network as $f_{\mathrm{target}}$ , Theorem 2.1 indicates that pruning a sufficiently wide $g(x)$ may reveal a subnetwork which achieves good test accuracy as $f_{\mathrm{target}}$ , in principle.

In the follow-up works [25, 24], the assumption on the network width was improved by reducing the factor of the required width to a logarithmic term. However, Pensia et al. [25] showed that the logarithmic order is unavoidable at least in the case of $l=1$ . While their results imply the optimality of the logarithmic bound, it also means that we cannot further relax the assumption on the network width as long as we work in the same setting. This indicates a limitation of the weight-pruning optimization, i.e. the weight-pruning optimization can train only less expressive models than ones trained with the conventional weight-optimization like SGD, under a given amount of memory or network parameters.

3 Method

In this section, we present a novel method called weight-pruning with iterative randomization (IteRand) for randomly initialized neural networks.

As discussed in Section 2, although the original weight-pruning optimization (Algorithm 1) can achieve good accuracy, it still has a limitation in the expressive power under a fixed amount of memory or network parameters. Our method is designed to overcome this limitation. The main idea is to randomize pruned weights at each iteration of the weight-pruning optimization. As we prove in Section 4, this reduces the required size of an entire network to be pruned.

We use the same notation and setup as Section 2. In addition, we assume that each weight $\theta_{i}$ of the network $f(x;\bm{\theta})$ can be re-sampled from $\mathcal{D}_{\mathrm{param}}$ at each iteration of the weight-pruning optimization.

3.1 Algorithm

Algorithm 3 describes our proposed method, IteRand, which extends Algorithm 1. The differences from Algorithm 1 are lines 5-7. IteRand has a hyperparameter $K_{\mathrm{per}}\in\mathbb{N}_{\geq 1}$ (line 5). At the $k$ -th iteration, whenever $k$ can be divided by $K_{\mathrm{per}}$ , pruned weights are randomized by ${\texttt{Randomize}}(\bm{\theta},\mathbf{m})$ function (line 6). There are multiple possible designs for ${\texttt{Randomize}}(\bm{\theta},\mathbf{m})$ , which will be discussed in the next subsection.

1 Initialize

\bm{\theta}\sim\mathcal{D}_{\mathrm{param}},\mathbf{s}\sim\mathcal{D}_{\mathrm{score}}

;

2 while $k=0,\cdots,N-1$ do

3 Sample a labeled data

(x,y)\sim\mathcal{D}_{\mathrm{labeled}}

;

\mathbf{m},\mathbf{s}\leftarrow{\texttt{TrainMask}}(\bm{\theta},\mathbf{s};(x,y))

;

5 if $k+1$ can be divided by $K_{\mathrm{per}}$ then // New if-block added to Algorithm 1

\bm{\theta}\leftarrow{\texttt{Randomize}}(\bm{\theta},\mathbf{m})

;

// Randomize a subset of pruned weights

7 end if

9 end while

10 return

\mathbf{m}

\bm{\theta}

;

Algorithm 3 Weight-pruning optimization with iterative randomization (IteRand)

Note that $K_{\mathrm{per}}$ controls how frequently the algorithm randomizes the pruned weights. Indeed the total number of the randomizing operations is $\lfloor N/K_{\mathrm{per}}\rfloor$ . If $K_{\mathrm{per}}$ is too small, the algorithm is likely to be unstable because it may randomize even the important weights before their scores are well-optimized, and also the overhead of the randomizing operations cannot be ignored. In contrast, if $K_{\mathrm{per}}$ is too large, the algorithm becomes almost same as the original weight-pruning Algorithm 1, and thus the effect of the randomization disappears. We fix $K_{\mathrm{per}}=300$ on CIFAR-10 (about $1\text{ epoch}$ ) and $K_{\mathrm{per}}=1000$ on ImageNet (about $1/10\text{ epochs}$ ) in our experiments (Section 5).

3.2 Designs of ${\texttt{Randomize}}(\bm{\theta},\mathbf{m})$

Here, we discuss how to define ${\texttt{Randomize}}(\bm{\theta},\mathbf{m})$ function. There are several possible ways to randomize a subset of the parameters $\bm{\theta}$ .

Naive randomization.

For any distribution $\mathcal{D}_{\mathrm{param}}$ , a naive definition of the randomization function (which we call naive randomization) can be given as follows.

\displaystyle{\texttt{Randomize}}(\bm{\theta},\mathbf{m})_{i}:=\begin{cases}\theta_{i},\ \ \ (\text{if }m_{i}=1)\\ \widetilde{\theta}_{i},\ \ \ (\text{otherwise})\end{cases}

where we denote the $i$ -th component of ${\texttt{Randomize}}(\bm{\theta},\mathbf{m})\in\mathbb{R}^{n}$ as ${\texttt{Randomize}}(\bm{\theta},\mathbf{m})_{i}$ , and each $\widetilde{\theta}_{i}\in\mathbb{R}$ is a random variable with the distribution $\mathcal{D}_{\mathrm{param}}$ . Also this can be written in another form as

{\texttt{Randomize}}(\bm{\theta},\mathbf{m}):=\bm{\theta}\odot\mathbf{m}+\widetilde{\bm{\theta}}\odot(1-\mathbf{m}),

(2)

where $\widetilde{\bm{\theta}}=(\widetilde{\theta}_{i})_{1\leq i\leq n}\in\mathbb{R}^{n}$ is a random variable with the distribution $\mathcal{D}^{n}_{\mathrm{param}}$ .

Partial randomization.

The naive randomization (Eq. (2)) is likely to be unstable because it entirely replaces all pruned weights with random values every $K_{\mathrm{per}}$ iteration. To increase the stability, we modify the definition of the naive randomization as it replaces a randomly chosen subset of the pruned weights as follows (which we call partial randomization):

\displaystyle{\texttt{Randomize}}(\bm{\theta},\mathbf{m}):=\bm{\theta}\odot\mathbf{m}+\left(\bm{\theta}\odot(1-\mathbf{b}_{r})+\widetilde{\bm{\theta}}\odot\mathbf{b}_{r}\right)\odot(1-\mathbf{m}),

(3)

Refer to caption — Figure 1: Analysis on $r$ . We compare partial randomizations with $r\in\{0.0,0.01,0.1,1.0\}$ applied to CNNs. The y-axis is the validation accuracy on CIFAR-10. $r=0.1$ achieves better mean accuracy for every CNNs.

where $\widetilde{\bm{\theta}}\in\mathbb{R}^{n}$ is the same as in Eq. (2), $r\in[0,1]$ is a hyperparameter and $\mathbf{b}_{r}=(b_{r,i})_{1\leq i\leq n}\in\{0,1\}^{n}$ is a binary vector whose each element is sampled from the Bernoulli distribution ${\mathrm{Bernoulli}}(r)$ , i.e. $b_{r,i}=1$ with probability $r$ and $b_{r,i}=0$ with probability $1-r$ .

The partial randomization replaces randomly chosen $100r\%$ of all pruned weights with random values. Note that, when $r=1$ , the partial randomization is equivalent to the naive randomization (Eq. (2)). In contrast, when $r=0$ , it never randomizes any weights and thus is equivalent to Algorithm 1.

In Figure 1, we observe that $r=0.1$ works well with various network architectures on CIFAR-10, where we use the Kaiming uniform distribution (whose definition will be given in Section 5) for $\mathcal{D}_{\mathrm{param}}$ and $K_{\mathrm{per}}=300$ .

4 Theoretical justification

In this section, we present a theoretical justification for our iterative randomization approach on the weight-pruning of randomly initialized neural networks.

4.1 Setup

We consider a target neural network $f:\mathbb{R}^{d_{0}}\to\mathbb{R}^{d_{l}}$ of depth $l$ , which is described as follows.

f(x)=F_{l}\sigma(F_{l-1}\sigma(\cdots F_{1}(x)\cdots)),

(4)

where $x$ is a $d_{0}$ -dimensional real vector, $\sigma$ is the ReLU activation, and $F_{i}$ is a $d_{i}\times d_{i-1}$ matrix. Our objective is to approximate the target network $f(x)$ by pruning a randomly initialized neural network $g(x)$ , which tends to be larger than the target network.

Similar to the previous works [19, 25], we assume that $g(x)$ is twice as deep as the target network $f(x)$ . Thus, $g(x)$ can be described as

g(x)=G_{2l}\sigma(G_{2l-1}\sigma(\cdots G_{1}(x)\cdots)),

(5)

where $G_{j}$ is a $\widetilde{d}_{j}\times\widetilde{d}_{j-1}$ matrix ( $j=1,\cdots,2l$ ) with $\widetilde{d}_{2i}=d_{i}$ . Each element of the matrix $G_{j}$ is assumed to be drawn from the uniform distribution $U[-1,1]$ . Since there is a one-to-one correspondence between pruned networks of $g(x)$ and sequences of binary matrices $M=\{M_{j}\}_{j=1,\cdots,2l}$ with $M_{j}\in\{0,1\}^{\widetilde{d}_{j}\times\widetilde{d}_{j-1}}$ , every pruned network of $g(x)$ can be described as

g_{M}(x)=(G_{2l}\odot M_{2l})\sigma((G_{2l-1}\odot M_{2l-1})\sigma(\cdots(G_{1}\odot M_{1})(x)\cdots)).

(6)

Under these setups, we recall that the previous works showed that, with high probability, there exists a subnetwork of $g(x)$ that approximates $f(x)$ when the width of $g(x)$ is larger than $f(x)$ by polynomial factors [19] or logarithmic factors [25, 24].

4.2 Formulation and main results

Now we attempt to mathematically formulate our proposed method, IteRand, as an approximation problem. As described in Algorithm 3, the method consists of two steps: optimizing binary variables $M=\{M_{j}\}_{j=1,\cdots,2l}$ and randomizing pruned weights in $g(x)$ . The first step can be formulated as the approximation problem of $f(x)$ by some $g_{M}(x)$ as described above. Corresponding to the second step, we introduce an idealized assumption on $g(x)$ for a given number $R\in\mathbb{N}_{\geq 1}$ : each element of the weight matrix $G_{j}$ can be re-sampled with replacement from the uniform distribution $U[-1,1]$ up to $R-1$ times, for all $j=1,\cdots,2l$ . (re-sampling assumption for $R$ )

Under this re-sampling assumption, we obtain the following theorem.

Theorem 4.1 (Main Theorem)

Fix $\epsilon,\delta>0$ , and we assume that $\lVert F_{i}\rVert_{\rm Frob}\leq 1$ . Let $R\in\mathbb{N}$ , and assume that $g(x)$ satisfies the re-sampling assumption for $R$ .

If $\widetilde{d}_{2i-1}\geq 2d_{i-1}\lceil\frac{64l^{2}d_{i-1}^{2}d_{i}}{\epsilon^{2}R^{2}}\log(\frac{2ld_{i-1}d_{i}}{\delta})\rceil$ holds for all $i=1,\cdots,l$ , then with probability at least $1-\delta$ , there exist binary matrices $M=\{M_{j}\}_{1\leq j\leq 2l}$ such that

\lVert f(x)-g_{M}(x)\rVert_{2}\leq\epsilon,\text{ for }\|x\|_{\infty}\leq 1.

(7)

In particular, if $R$ is larger than $\frac{8ld_{i-1}}{\epsilon}\sqrt{d_{i}\log(\frac{2ld_{i-1}d_{i}}{\delta})}$ , then $\widetilde{d}_{2i-1}=2d_{i-1}$ is enough.

Theorem 4.1 shows that the iterative randomization is provably helpful to approximate wider networks in the weight-pruning optimization of a random network. In fact, the required width for $g(x)$ in Theorem 4.1 is reduced to even twice as wide as $f(x)$ when the number of re-sampling is sufficiently large, in contrast to the prior results without re-sampling assumption where the required width is logarithmically wider than $f(x)$ [25, 24]. This means that, under a fixed amount of parameters of $g(x)$ , we can achieve a higher accuracy by weight-pruning of $g(x)$ with iterative randomization since a wider target network has a higher model capacity.

In the rest of this section we present the core ideas by proving the simplest case ( $l=d_{0}=d_{1}=1$ ) of the theorem, while the full proof of Theorem 4.1 is given in Appendix A. Note that the full proof is obtained essentially by applying the argument for the simplest case inductively on the widths and depths.

4.3 Proof ideas for Theorem 4.1

Let us consider the case of $l=d_{0}=d_{1}=1$ . Then the target network $f(x)$ can be written as $f(x)=wx:\mathbb{R}\to\mathbb{R}$ , where $w\in\mathbb{R}$ , and $g(x)$ can be written as $g(x)=\mathbf{v}^{T}\sigma(\mathbf{u}x)$ where $\mathbf{u},\mathbf{v}\in\mathbb{R}^{\widetilde{d}_{1}}$ . Also, subnetworks of $g(x)$ can be written as $g_{\mathbf{m}}(x)=(\mathbf{v}\odot\mathbf{m})^{T}\sigma((\mathbf{u}\odot\mathbf{m})x)$ for some $\mathbf{m}\in\{0,1\}^{\widetilde{d}_{1}}$ .

There are two technical points in our proof. The first point is the following splitting of $f(x)$ :

f(x)=w\sigma(x)-w\sigma(-x),

(8)

for any $x\in\mathbb{R}$ . This splitting is very similar to the one used in the previous works [19, 25, 24]:

f(x)=\sigma(wx)-\sigma(-wx).

(9)

However, if we use the latter splitting Eq. (9), it turns out that we cannot obtain the lower bound of $\widetilde{d}_{2i-1}$ in Theorem 4.1 when $d_{0}>0$ . (Here we do not treat this case, but the proof for $d_{0}>0$ is given in Appendix A.) Thus we need to use our splitting Eq. (8) instead.

Using Eq. (8), we can give another proof of the following approximation result without iterative randomization, which was already shown in the previous work [19].

Lemma 4.2

Fix $\epsilon,\delta\in(0,1),w\in[-1,1],d\in\mathbb{N}$ . Let $\mathbf{u},\mathbf{v}\sim U[-1,1]^{d}$ be uniformly random weights of a $2$ -layered neural network $g(x):=\mathbf{v}^{T}\sigma(\mathbf{u}\cdot x)$ . If $d\geq 2\lceil\frac{16}{\epsilon^{2}}\log(\frac{2}{\delta})\rceil$ holds, then with probability at least $1-\delta$ ,

\big{|}wx-g_{\mathbf{m}}(x)\big{|}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1,

(10)

where $g_{\mathbf{m}}(x):=(\mathbf{v}\odot\mathbf{m})^{T}\sigma(\mathbf{u}\cdot x)$ for some $\mathbf{m}\in\{0,1\}^{d}$ .

Proof (sketch):

We assume that $d$ is an even number as $d=2d^{\prime}$ so that we can split an index set $\{0,\cdots,d-1\}$ of $d$ hidden neurons of $g(x)$ into $I=\{0,\cdots,d^{\prime}-1\}$ and $J=\{d^{\prime},\cdots,d-1\}$ . Then we have the corresponding subnetworks $g_{I}(x)$ and $g_{J}(x)$ given by $g_{I}(x):=\sum_{k\in I}v_{k}\sigma(u_{k}x),g_{J}(x):=\sum_{k\in J}v_{k}\sigma(u_{k}x)$ , which satisfy the equation $g(x)=g_{I}(x)+g_{J}(x)$ .

By the splitting Eq. (8), it is enough to consider the probabilities for approximating $w\sigma(x)$ by a subnetwork of $g_{I}(x)$ and for approximating $-w\sigma(-x)$ by a subnetwork of $g_{J}(x)$ . Now we have

	$\displaystyle\mathbb{P}\left(\not\exists i\in I\text{ s.t. }\|u_{i}-1\|\leq\frac{\epsilon}{2},\|v_{i}-w\|\leq\frac{\epsilon}{2}\right)\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}}\leq\frac{\delta}{2},$		(11)
	$\displaystyle\mathbb{P}\left(\not\exists j\in J\text{ s.t. }\|u_{j}+1\|\leq\frac{\epsilon}{2},\|v_{j}+w\|\leq\frac{\epsilon}{2}\right)\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}}\leq\frac{\delta}{2},$		(12)

for $d^{\prime}\geq\frac{16}{\epsilon^{2}}\log\left(\frac{2}{\delta}\right)$ , by a standard argument of the uniform distribution and the inequality $e^{x}\geq 1+x$ for $x\geq 0$ . By the union bound, with probability at least $1-\delta$ , we have $i\in I$ and $j\in J$ such that

	$\displaystyle\big{\|}w\sigma(x)-v_{i}\sigma(u_{i}x)\big{\|}\leq\frac{\epsilon}{2},$
	$\displaystyle\big{\|}-w\sigma(-x)-v_{j}\sigma(u_{j}x)\big{\|}\leq\frac{\epsilon}{2}.$

Combining these inequalities, we finish the proof. $\square$

The second point of our proof is introducing projection maps to leverage the re-sampling assumption, as follows. As in the proof of Lemma 4.2, we assume that $d=2d^{\prime}$ for some $d^{\prime}\in\mathbb{N}$ and let $I=\{0,\cdots,d^{\prime}-1\},J=\{d^{\prime},\cdots,d-1\}$ . Now we define a projection map

\pi:\widetilde{I}\to I,\ \ \ k\mapsto\lfloor k/R\rfloor,

(13)

where $\widetilde{I}:=\{0,\cdots,d^{\prime}R-1\}$ , and $\lfloor\cdot\rfloor$ denotes the floor function. Similarly for $J$ , we can define $\widetilde{J}:=\{d^{\prime}R,\cdots,dR-1\}$ and the corresponding projection map. Using these projection maps, we can extend Lemma 4.2 to the one with the re-sampling assumption, which is the special case of Theorem 4.1:

Proposition 4.3 (Theorem 4.1 with $l=d_{0}=d_{1}=1$ )

Fix $\epsilon,\delta\in(0,1),w\in[-1,1],d\in\mathbb{N}$ . Let $\mathbf{u},\mathbf{v}\sim U[-1,1]^{d}$ be uniformly random weights of a $2$ -layered neural network $g(x):=\mathbf{v}^{T}\sigma(\mathbf{u}\cdot x)$ . Let $R\in\mathbb{N}$ and we assume that each element of $\mathbf{u}$ and $\mathbf{v}$ can be re-sampled with replacement up to $R-1$ times. If $d\geq 2\lceil\frac{16}{\epsilon^{2}R^{2}}\log(\frac{2}{\delta})\rceil$ holds, then with probability at least $1-\delta$ ,

\big{|}wx-g_{\mathbf{m}}(x)\big{|}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1,

(14)

where $g_{\mathbf{m}}(x):=(\mathbf{v}\odot\mathbf{m})^{T}\sigma(\mathbf{u}\cdot x)$ for some $\mathbf{m}\in\{0,1\}^{d}$ .

Proof (sketch):

Similarly to the proof of Lemma 4.2, we utilize the splitting Eq. (8). We mainly argue on the approximation of $w\sigma(x)$ since the argument for approximating $-w\sigma(x)$ is parallel.

By the assumption that each element of $\mathbf{u}$ and $\mathbf{v}$ can be re-sampled up to $R-1$ times, we can replace the probability in Eq. (11) in the proof of Lemma 4.2, using the projection map $\pi:\widetilde{I}\to I$ , by

\mathbb{P}\left(\not\exists i_{1},i_{2}\in\widetilde{I}\text{ s.t. }\pi(i_{1})=\pi(i_{2}),\ \ |\widetilde{u}_{i_{1}}-1|\leq\frac{\epsilon}{2},\ \ |\widetilde{v}_{i_{2}}-w|\leq\frac{\epsilon}{2}\right),

(15)

where $\widetilde{u}_{1},\cdots,\widetilde{u}_{d^{\prime}R},\widetilde{v}_{1},\cdots,\widetilde{v}_{d^{\prime}R}\sim U[-1,1]$ . Indeed, since we have

\#\{(i_{1},i_{2})\in\widetilde{I}\times\widetilde{I}:\pi(i_{1})=\pi(i_{2})\}=d^{\prime}R^{2},

(16)

we can evaluate the probability Eq. (15) as

\text{Eq.}(\ref{probability for g_I with iter rand})\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}R^{2}}\leq\frac{\delta}{2},

(17)

for $d^{\prime}\geq\frac{16}{\epsilon^{2}R^{2}}\log\left(\frac{\delta}{2}\right)$ . Eq. (17) can play the same role as Eq. (11) in the proof of Lemma 4.2.

Parallel argument can be applied for the approximation of $-w\sigma(x)$ by replacing $I$ with $J$ . The rest of the proof is the same as Lemma 4.2. $\square$

5 Experiments

In this section, we perform several experiments to evaluate our proposed method, IteRand (Algorithm 3). Our main aim is to empirically verify the parameter efficiency of IteRand, compared with edge-popup [26] (Algorithm 1) on which IteRand is based. Specifically, we demonstrate that IteRand can achieve better accuracy than edge-popup under a given amount of network parameters. In all experiments, we used the partial randomization with $r=0.1$ (Eq. (3)) for Randomize in Algorithm 3.

Setup.

We used two vision datasets: CIFAR-10 [13] and ImageNet [28]. CIFAR-10 is a small-scale dataset of $32\times 32$ images with $10$ class labels. It has $50$ k images for training and $10$ k for testing. We randomly split the $50$ k training images into $45$ k for actual training and $5$ k for validation. ImageNet is a dataset of $224\times 224$ images with $1000$ class labels. It has the train set of $1.28$ million images and the validation set of $50$ k images. We randomly split the training images into $99:1$ , and used the former for actual training and the latter for validating models. When testing models, we used the validation set of ImageNet (which we refer to as the test set). For network architectures, we used multiple convolutional neural networks (CNNs): Conv6 [6] as a shallow network and ResNets [11] as deep networks. Conv6 is a 6-layered VGG-like CNN, which is also used in the prior work [26]. ResNets are more practical CNNs with skip connections and batch normalization layers. Following the settings in Ramanujan et al. [26], we used non-affine batch normalization layers, which are layers that only normalize their inputs and do not apply any affine transform, when training ResNets with edge-popup and IteRand. All of our experiments were performed with 1 GPU (NVIDIA GTX 1080 Ti, 11GB) for CIFAR-10 and 2 GPUs (NVIDIA V100, 16GB) for ImageNet. The details of the network architectures and hyperparameters for training are given in Appendix B.

Parameter distributions.

With the same notation as Section 2, both IteRand and edge-popup requires two distributions: $\mathcal{D}_{\mathrm{param}}$ and $\mathcal{D}_{\mathrm{score}}$ . In our experiments, we consider Kaiming uniform (KU) and signed Kaiming constant (SC) distribution. The KU distribution is the uniform distribution over the interval $[-\sqrt{\frac{6}{c_{\mathrm{fanin}}}},\sqrt{\frac{6}{c_{\mathrm{fanin}}}}]$ where $c_{\mathrm{fanin}}$ is the constant defined for each layer of ReLU neural networks [1][10]. The SC distribution is the uniform distribution over the two-valued set $\{-\sqrt{\frac{2}{c_{\mathrm{fanin}}}},\sqrt{\frac{2}{c_{\mathrm{fanin}}}}\}$ , which is introduced by Ramanujan et al. [26]. We fix $\mathcal{D}_{\mathrm{score}}$ to the KU distribution, and use the KU or SC distribution for $\mathcal{D}_{\mathrm{param}}$ .

5.1 Varying the network width

To demonstrate the parameter efficiency, we introduce a hyperparameter $\rho$ of the width factor for Conv6, ResNet18 and ResNet34. The details of this modification are given in Appendix B. We train and test these networks on CIFAR-10 using IteRand, edge-popup and SGD, varying the width factor $\rho$ in $\{0.25,0.5,1.0,2.0\}$ for each network (Figure 2). In the experiments for IteRand and edge-popup, we used the sparsity rate of $p=0.5$ for Conv6 and $p=0.6$ for ResNet18 and ResNet34. Our method outperforms the baseline method for various widths. The difference in accuracy is large both when the width is small ( $\rho\leq 0.5$ ), and when $\mathcal{D}_{\mathrm{param}}$ is the KU distribution where edge-popup struggles.

5.2 ImageNet experiments

The parameter efficiency of our method is also confirmed on ImageNet (Figure 3(a)). ImageNet is more difficult than CIFAR-10, and thus more complexity is required for networks to achieve competitive performance. Since our method can increase the network complexity as shown in Section 4, the effect is significant in ImageNet especially when the complexity is limited such as ResNet34.

In addition to the parameter efficiency, we also observe the effect of iterative randomization on the behavior of optimization process, by plotting training curves (Figure 3(b)). Surprisingly, IteRand achieves significantly better performance than edge-popup at the early stage of the optimization, which indicates that the iterative randomization accelerates the optimization process especially when the number of iterations is limited.

6 Related work

Lottery ticket hypothesis

Frankle and Carbin [6] originally proposed the lottery ticket hypothesis. Many properties and applications have been studied in subsequent works, such as transferability of winning subnetworks [22], sparsification before training [15, 31, 29] and further ablation studies on the hypothesis [34].

Zhou et al [34] surprisingly showed that only pruning randomly initialized networks without any optimization on their weights can be a training method surprisingly. Ramanujan et al. [26] went further by proposing an effective pruning algorithm on random networks, called edge-popup, and achieved competitive accuracy compared with standard training algorithms by weight-optimization [27]. Malach et al. [19] mathematically formalized their pruning-only approach as an approximation problem for a target network and proved it with lower bound condition on the width of random networks to be pruned. Subsequent works [25, 24] successfully relaxed the lower bound to the logarithmic factor wider than the width of a target network. Our work can be seen as an extension of their works to allow re-sampling of the weights of the random networks for finite $R$ times, and we showed that the logarithmic factor can be reduced to a constant one when $R$ is large enough. (See Section 4.)

Neural network pruning and regrowth

Studies of finding sparse structures of neural networks date back to the late 1980s [9, 14]. There are many approaches to sparsify networks, such as magnitude-based pruning [8], $L_{0}$ regularization [17] and variational dropout [21]. Although these methods only focus on pruning unnecessary weights from the networks, there are several studies on re-adding new weights during sparsification [12] to maintain the model complexity of the sparse networks. Han et al. [7] proposed a dense-sparse-dense (DSD) training algorithm consisting of three stages: dense training, pruning, and recovering the pruned weights as zeros followed by dense training. Mocanu et al. [20] proposed sparse evolutionary training (SET), which repeats pruning and regrowth with random values at the end of each training epoch. Pruning algorithms proposed in other works [2, 23, 4] are designed to recover pruned weights by zero-initialization instead of random values, so that the recovered weights do not affect the outputs of the networks. While these methods are similar to our iterative randomization method in terms of the re-adding processes, all of them use weight-optimization to train networks including re-added weights, in contrast to our pruning-only approach.

7 Limitations

There are several limitations in our theoretical results. (1) Theorem 4.1 indicates only the existence of subnetworks that approximate a given neural network, not whether our method works well empirically. (2) The theorem focused on the case when the parameter distribution $\mathcal{D}_{\mathrm{param}}$ is the uniform distribution over the interval $[-1,1]$ . Thus, generalizing our theorem to other distributions, such as the uniform distribution over binary values $\{-1,1\}$ [3], is left for future work. (3) The required width given in the theorem may not be optimal. Indeed, prior work [25] showed that we can reduce the polynomial factors in the required width to logarithmic ones in the case when the number of re-sampling operations $R=1$ . Whether we can reduce the required width for $R>1$ remains an open question.

Also our algorithm (IteRand) and its empirical results have several limitations. (1) Pruning randomly-initialized networks without any randomization can reduce the storage cost by saving only a single random seed and the binary mask representing the optimal subnetwork. However, if we save the network pruned with IteRand in the same way, it requires more storage cost: $R$ random seeds and $R$ binary masks, where $R$ is the number of re-samplings. (2) Although our method can be applied with any score-based pruning algorithms (e.g. Zhou et al [34] and Wang et al [32]), we evaluated our method only combined with edge-popup [26], which is the state-of-the-art algorithm for pruning random networks. Since our theoretical results do not depend on any pruning algorithms, we expect that our method can be effectively combined with better pruning algorithms that emerge in the future. (3) We performed experiments mainly on the tasks of image classification. An intriguing question is how effectively our method works on other tasks such as language understanding, audio recognition, and deep reinforcement learning with various network architectures.

8 Conclusion

In this paper, we proposed a novel framework of iterative randomization (IteRand) for pruning randomly initialized neural networks. IteRand can virtually increase the network widths without any additional memory consumption, by randomizing pruned weights of the networks iteratively during the pruning procedure. We verified its parameter efficiency both theoretically and empirically.

Our results indicate that the weight-pruning of random networks may become a practical approach to train the networks when we apply the randomizing operations enough times. This opens up the possibility that the weight-pruning can be used instead of the standard weight-optimization within the same memory budget.

Acknowledgement

The authors thank Kyosuke Nishida and Hengjin Tang for valuable discussions in the early phase of our study. We also thank Osamu Saisho for helpful comments on the manuscript. We appreciate CCI team in NTT for building and maintaining their computational cluster on which most of our experiments were computed.

References

[1] torch.nn.init – PyTorch 1.8.1 documentation. https://pytorch.org/docs/stable/nn.init.html. Accessed: 2021-10-21.
[2] Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. International Conference on Learning Representations, 2018.
[3] James Diffenderfer and Bhavya Kailkhura. Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network. In International Conference on Learning Representations, 2021.
[4] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pages 2943–2952. PMLR, 2020.
[5] Facebook. facebookresearch/open_lth. https://github.com/facebookresearch/open_lth. Accessed: 2021-10-21.
[6] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
[7] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, et al. Dsd: Dense-sparse-dense training for deep neural networks. International Conference on Learning Representations, 2017.
[8] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pages 1135–1143, 2015.
[9] Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-propagation. Advances in Neural Information Processing Systems, 1:177–185, 1988.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034. IEEE Computer Society, 2015.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE Computer Society, 2016.
[12] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554, 2021.
[13] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
[14] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.
[15] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2018.
[16] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
[17] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. In International Conference on Learning Representations, 2018.
[18] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
[19] Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR, 2020.
[20] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1–12, 2018.
[21] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.
[22] Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[23] Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, pages 4646–4655. PMLR, 2019.
[24] Laurent Orseau, Marcus Hutter, and Omar Rivasplata. Logarithmic pruning is all you need. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2925–2934. Curran Associates, Inc., 2020.
[25] Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. Optimal lottery tickets via subset sum: Logarithmic over-parameterization is sufficient. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2599–2610. Curran Associates, Inc., 2020.
[26] Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11893–11902, 2020.
[27] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
[28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[29] Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6377–6389. Curran Associates, Inc., 2020.
[30] Ben Trevett. bentrevett/pytorch-sentiment-analysis. https://github.com/bentrevett/pytorch-sentiment-analysis. Accessed: 2021-10-21.
[31] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020.
[32] Yulong Wang, Xiaolu Zhang, Lingxi Xie, Jun Zhou, Hang Su, Bo Zhang, and Xiaolin Hu. Pruning from scratch. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12273–12280, 2020.
[33] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Edwin R. Hancock Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1–87.12. BMVA Press, September 2016.
[34] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Appendix A Proof of main theorem (Theorem 4.1)

A.1 Settings and main theorem

Let $d_{0},\cdots,d_{l}\in\mathbb{N}_{\geq 1}$ . We consider a target neural network $f:\mathbb{R}^{d_{0}}\to\mathbb{R}^{d_{l}}$ of depth $l$ , which is described as follows.

f(x)=F_{l}\sigma(F_{l-1}\sigma(\cdots F_{1}(x)\cdots)),

(18)

Similar to the previous works [19, 25], we assume that $g(x)$ is twice as deep as the target network $f(x)$ . Thus, $g(x)$ can be described as

g(x)=G_{2l}\sigma(G_{2l-1}\sigma(\cdots G_{1}(x)\cdots)),

(19)

where $G_{j}$ is a $\widetilde{d}_{j}\times\widetilde{d}_{j-1}$ matrix ( $\widetilde{d}_{j}\in\mathbb{N}_{\geq 1}\text{ for }j=1,\cdots,2l$ ) with $\widetilde{d}_{2i}=d_{i}$ . Each element of the matrix $G_{j}$ is assumed to be drawn from the uniform distribution $U[-1,1]$ . Since there is a one-to-one correspondence between pruned networks of $g(x)$ and sequences of binary matrices $M=\{M_{j}\}_{j=1,\cdots,2l}$ with $M_{j}\in\{0,1\}^{\widetilde{d}_{j}\times\widetilde{d}_{j-1}}$ , every pruned network of $g(x)$ can be described as

g_{M}(x)=(G_{2l}\odot M_{2l})\sigma((G_{2l-1}\odot M_{2l-1})\sigma(\cdots(G_{1}\odot M_{1})(x)\cdots)).

(20)

We introduce an idealized assumption on $g(x)$ for a given number $R\in\mathbb{N}_{\geq 1}$ : each element of the weight matrix $G_{j}$ can be re-sampled with replacement from the uniform distribution $U[-1,1]$ up to $R-1$ times, for all $j=1,\cdots,2l$ (re-sampling assumption for $R$ ). Here, re-sampling with replacement means that we sample a new value for an element of $G_{j}$ and replace the old value of the element by the new one.

Under this re-sampling assumption, we describe our main theorem as follows.

Theorem A.1 (Main Theorem)

Fix $\epsilon,\delta>0$ , and we assume that $\lVert F_{i}\rVert_{\rm Frob}\leq 1$ . Let $R\in\mathbb{N}$ and we assume that each element of $G_{i}$ can be re-sampled with replacement from the uniform distribution $U[-1,1]$ up to $R-1$ times.

\lVert f(x)-g_{M}(x)\rVert_{2}\leq\epsilon,\text{ for }\|x\|_{\infty}\leq 1.

(21)

In particular, if $R$ is larger than $\frac{8ld_{i-1}}{\epsilon}\sqrt{d_{i}\log(\frac{2ld_{i-1}d_{i}}{\delta})}$ , then $\widetilde{d}_{2i-1}=2d_{i-1}$ is enough.

A.2 Proof of Theorem A.1

Our proof is based on the following simple observation, similar to the arguments in Malach et al. [19].

Lemma A.2

Fix some $n\in\mathbb{N}$ , $\alpha\in[-1,1]$ and $\epsilon,\delta\in(0,1)$ . Let $X_{1},\cdots,X_{n}\sim U[-1,1]$ . If $n\geq\frac{2}{\epsilon}\log(\frac{1}{\delta})$ holds, then with probability at least $1-\delta$ , we have

|\alpha-X_{i}|\leq\epsilon,

(22)

for some $i\in\{1,\cdots,n\}$ .

Proof:

We can assume $\alpha\geq 0$ without loss of generality. By considering half $\epsilon$ -ball of $\alpha$ , we have

\mathbb{P}_{X\sim U[-1,1]}\Big{[}\big{|}\alpha-X\big{|}\leq\epsilon\Big{]}\geq\frac{\epsilon}{2}.

Thus it follows that

\mathbb{P}_{X_{1},\cdots,X_{n}\sim U[-1,1]}\Big{[}\big{|}\alpha-X_{i}\big{|}>\epsilon\text{ for all }i\Big{]}\leq(1-\frac{\epsilon}{2})^{n}\leq e^{-\frac{n\epsilon}{2}}\leq\delta.

$\square$

First, we consider to approximate a single variable linear function $f(x)=wx:\mathbb{R}\to\mathbb{R},w\in\mathbb{R}$ by some subnetwork of $2$ -layered neural network $g(x)$ with $d$ hidden neurons, without the re-sampling assumption. Note that this is the same setting as in Malach et al. [19], but we give another proof so that we can later extend it to the one with the resumpling assumption.

Lemma A.3

Fix $\epsilon,\delta\in(0,1),w\in[-1,1],d\in\mathbb{N}$ . Let ${\bf u},{\bf v}\sim U[-1,1]^{d}$ be uniformly random weights of a $2$ -layered neural network $g(x):={\bf v}^{T}\sigma({\bf u}\cdot x)$ . If $d\geq 2\lceil\frac{16}{\epsilon^{2}}\log(\frac{2}{\delta})\rceil$ holds, then with probability at least $1-\delta$ ,

\big{|}wx-g_{\bf m}(x)\big{|}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1,

(23)

where $g_{\bf m}(x):=({\bf v}\odot{\bf m})^{T}\sigma({\bf u}\cdot x)$ for some ${\bf m}\in\{0,1\}^{d}$ .

Proof:

The core idea is to decompose $wx$ as

wx=w(\sigma(x)-\sigma(-x))=w\sigma(x)-w\sigma(-x).

(24)

We assume that $d$ is an even number as $d=2d^{\prime}$ so that we can split an index set $\{1,\cdots,d\}$ of hidden neurons of $g(x)$ into $I=\{1,\cdots,d^{\prime}\}$ and $J=\{d^{\prime}+1,\cdots,d\}$ . Then we have the corresponding subnetworks $g_{I}(x)$ and $g_{J}(x)$ given by $g_{I}(x):=\sum_{k\in I}v_{k}\sigma(u_{k}x),g_{J}(x):=\sum_{k\in J}v_{k}\sigma(u_{k}x)$ , which satisfy the equation $g(x)=g_{I}(x)+g_{J}(x)$ .

From Eq. (24), it is enough to consider the probabilities for approximating $w\sigma(x)$ by a subnetwork of $g_{I}(x)$ and for approximating $-w\sigma(-x)$ by a subnetwork of $g_{J}(x)$ . Now we have

	$\displaystyle\mathbb{P}\left(\not\exists i\in I\text{ s.t. }\|u_{i}-1\|\leq\frac{\epsilon}{2},\|v_{i}-w\|\leq\frac{\epsilon}{2}\right)\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}}\leq\frac{\delta}{2},$		(25)
	$\displaystyle\mathbb{P}\left(\not\exists j\in J\text{ s.t. }\|u_{j}+1\|\leq\frac{\epsilon}{2},\|v_{j}+w\|\leq\frac{\epsilon}{2}\right)\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}}\leq\frac{\delta}{2},$		(26)

for $d^{\prime}\geq\frac{16}{\epsilon^{2}}\log\left(\frac{2}{\delta}\right)$ as well as in the proof of Lemma A.2. By using the union bound, with probability at least $1-\delta$ , we have $i\in I$ and $j\in J$ such that

	$\displaystyle\big{\|}w\sigma(x)-v_{i}\sigma(u_{i}x)\big{\|}\leq\frac{\epsilon}{2},$
	$\displaystyle\big{\|}-w\sigma(-x)-v_{j}\sigma(u_{j}x)\big{\|}\leq\frac{\epsilon}{2}.$

Combining these inequalities and Eq. (24), we finish the proof. $\square$

Now we extend Lemma A.3 to the one with the re-sampling assumption.

Lemma A.4

Fix $\epsilon,\delta\in(0,1),w\in[-1,1],d\in\mathbb{N}$ . Let ${\bf u},{\bf v}\sim U[-1,1]^{d}$ be uniformly random weights of a $2$ -layered neural network $g(x):={\bf v}^{T}\sigma({\bf u}\cdot x)$ . Let $R\in\mathbb{N}$ and we assume that each elements of $u$ and $v$ can be re-sampled with replacement up to $R-1$ times. If $d\geq 2\lceil\frac{16}{\epsilon^{2}R^{2}}\log(\frac{2}{\delta})\rceil$ holds, then with probability at least $1-\delta$ ,

\big{|}wx-g_{\bf m}(x)\big{|}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1,

(27)

where $g_{\bf m}(x):=({\bf v}\odot{\bf m})^{T}\sigma({\bf u}\cdot x)$ for some ${\bf m}\in\{0,1\}^{d}$ .

Proof:

As in the proof of Lemma A.3, we assume that $d=2d^{\prime}$ and let $I=\{1,\cdots,d^{\prime}\},J=\{d^{\prime}+1,\cdots,d\}$ . Now we consider $\widetilde{I}=\{1,\cdots,d^{\prime}R\}$ and a projection $\pi:\widetilde{I}\to I$ defined by $\pi(k)=\lfloor(k-1)/R\rfloor+1$ . Since we assumed that each elements of ${\bf u}$ and ${\bf v}$ can be re-sampled up to $R-1$ times, we can replace the probability Eq. (25) in the proof of Lemma A.3 by

\mathbb{P}\left(\not\exists i_{1},i_{2}\in\widetilde{I}\text{ s.t. }\pi(i_{1})=\pi(i_{2}),\ \ |\widetilde{u}_{i_{1}}-1|\leq\frac{\epsilon}{2},\ \ |\widetilde{v}_{i_{2}}-w|\leq\frac{\epsilon}{2}\right),

(28)

where $\widetilde{u}_{1},\cdots,\widetilde{u}_{d^{\prime}R},\widetilde{v}_{1},\cdots,\widetilde{v}_{d^{\prime}R}\sim U[-1,1]$ . Since we have

\#\{(i_{1},i_{2})\in\widetilde{I}\times\widetilde{I}:\pi(i_{1})=\pi(i_{2})\}=d^{\prime}R^{2},

(29)

we can evaluate the probability $\mbox{{\rm Eq.}}~{}(\ref{app:probability for g_I with iter rand})$ as

\mbox{{\rm Eq.}}~{}(\ref{app:probability for g_I with iter rand})\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}R^{2}}\leq\frac{\delta}{2},

for $d^{\prime}\geq\frac{16}{\epsilon^{2}R^{2}}\log\left(\frac{\delta}{2}\right)$ . The rest of the proof is same as Lemma A.3. $\square$

Then, we generalize the above lemma to the case which the target function $f(x)$ is a single-variable linear map with higher output dimensions.

Lemma A.5

Fix $\epsilon,\delta\in(0,1),d_{1},d_{2}\in\mathbb{N},{\bf w}\in[-1,1]^{d_{2}}$ . Let ${\bf u}\sim U[-1,1]^{d_{1}},V\sim U[-1,1]^{d_{2}\times d_{1}}$ be uniformly random weights of a $2$ -layered neural network $g(x):=V\sigma({\bf u}\cdot x)$ . Let $R\in\mathbb{N}$ and we assume that each elements of ${\bf u}$ and $V$ can be re-sampled with replacement up to $R-1$ times.

If $d\geq 2\lceil\frac{16d_{2}}{\epsilon^{2}R^{2}}\log(\frac{2d_{2}}{\delta})\rceil$ holds, then with probability at least $1-\delta$ ,

\lVert{\bf w}\cdot x-g_{M}(x)\rVert_{2}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1,

(30)

where $g_{M}(x):=(V\odot M)^{T}\sigma({\bf u}\cdot x)$ for some $M\in\{0,1\}^{d}$ .

Proof:

We denote $V=(V_{ki})_{1\leq k\leq d_{2},1\leq i\leq d_{1}}$ . As in the proof of Lemma A.3, we assume $d_{1}=2d^{\prime}_{1}$ and split the index set $\{1,\cdots,d_{1}\}$ into $I=\{1,\cdots,d^{\prime}_{1}\}$ , $J=\{d^{\prime}_{1}+1,\cdots,d_{1}\}$ . Also we consider the corresponding subnetworks of $g(x)$ :

g_{I}(x):=\left(\sum_{i\in I}V_{ki}\sigma(u_{i}x)\right)_{1\leq k\leq d_{2}},\ \ g_{J}(x):=\left(\sum_{i\in J}V_{ki}\sigma(u_{i}x)\right)_{1\leq k\leq d_{2}}.

Similar as the proof of Lemma A.3 and Lemma A.4, it is enough to show that there probably exists a subnetwork of $g_{I}(x)$ which approximates ${\bf w}\cdot\sigma(x)$ , and also that there simultaneously exists a subnetwork of $g_{J}(x)$ which approximates $-{\bf w}\cdot\sigma(-x)$ .

For simplicity, we focus on $g_{I}(x)$ in the following argument, but same conclusion holds for $g_{J}(x)$ as well. Fix $k\in\{1,\cdots,d_{2}\}$ . Then we consider the following probability,

\mathbb{P}\left(\not\exists i_{1},i_{2}\in\widetilde{I}\text{ s.t. }\pi(i_{1})=\pi(i_{2}),\ \ |\widetilde{u}_{i_{1}}-1|\leq\frac{\epsilon}{2\sqrt{d_{2}}},\ \ |\widetilde{V}_{k,i_{2}}-w|\leq\frac{\epsilon}{2\sqrt{d_{2}}}\right),

(31)

where $\widetilde{u}_{i},\widetilde{V}_{ki}\sim U[-1,1]$ for $i=1,\cdots,d^{\prime}R$ . By using Eq. (29), if $d^{\prime}\geq\frac{16d_{2}}{\epsilon^{2}R^{2}}\log\left(\frac{2d_{2}}{\delta}\right)$ , we have

\mbox{{\rm Eq.}}~{}(\ref{app:probability for g_I (linear map ver)})\leq\left(1-\frac{\epsilon^{2}}{16d_{2}}\right)^{d^{\prime}R^{2}}\leq\frac{\delta}{2d_{2}}

Therefore, by the union bound over $k=1,\cdots,d_{2}$ , we have $i_{1},i_{2}\in\widetilde{I}$ for each $k$ such that $i:=\pi(i_{1})=\pi(i_{2})$ , $|\widetilde{u}_{i_{1}}-1|\leq\frac{\epsilon}{2\sqrt{d_{2}}}$ , $|\widetilde{V}_{k,i_{2}}-w_{k}|\leq\frac{\epsilon}{2\sqrt{d_{2}}}$ , with probability at least $1-\delta$ , and thus

\big{|}w_{k}\sigma(x)-V_{ki}\sigma(u_{i}x)\big{|}\leq\frac{\epsilon}{2\sqrt{d_{2}}},\ \ \text{ for }x\in\mathbb{R},|x|\leq 1,

(32)

if we substitute $\widetilde{u}_{i_{1}}$ for $u_{i}$ , and $\widetilde{V}_{k,i_{2}}$ for $V_{ki}$ . We note that the choice of $\widetilde{u}_{i_{1}}$ and $\widetilde{V}_{k,i_{2}}$ may not be unique, but Eq. (32) does not depend on these choice.

Therefore, by taking $M$ appropriately, we have

	$\displaystyle\lVert{\bf w}x-g_{M}(x)\rVert_{2}$	$\displaystyle\leq$	$\displaystyle\lVert{\bf w}\sigma(x)-g_{I}(x)\rVert_{2}+\lVert-{\bf w}\sigma(-x)-g_{J}(x)\rVert_{2}$
		$\displaystyle\leq$	$\displaystyle\frac{\epsilon}{2}+\frac{\epsilon}{2}=\epsilon.$

for all $x\in\mathbb{R}$ with $|x|\leq 1$ . $\square$

Subsequently, we can generalize Lemma A.5 to multiple variables version:

Lemma A.6

Fix $\epsilon,\delta\in(0,1)$ , $d_{0},d_{1},d_{2}\in\mathbb{N}$ , $W\in[-1,1]^{d_{2}\times d_{0}}$ . Let $U\sim U[-1,1]^{d_{1}\times d_{0}},V\sim U[-1,1]^{d_{2}\times d_{1}}$ be uniformly random weights of a $2$ -layered neural network $g({\bf x}):=V\sigma(U{\bf x})$ . Let $R\in\mathbb{N}$ and we assume that each elements of $U$ and $V$ can be re-sampled with replacement up to $R-1$ times.

If $d_{1}\geq 2d_{0}\lceil\frac{16d_{0}^{2}d_{2}}{\epsilon^{2}R^{2}}\log(\frac{2d_{0}d_{2}}{\delta})\rceil$ holds, then with probability at least $1-\delta$ ,

\lVert W{\bf x}-g_{M,N}({\bf x})\rVert_{2}\leq\epsilon,\text{ for all }{\bf x}\in\mathbb{R}^{d_{0}},\lVert x\rVert_{\infty}\leq 1,

(33)

where $g_{M,N}(x):=(V\odot M)^{T}\sigma((U\odot N)\cdot x)$ for some $M\in\{0,1\}^{d_{2}\times d_{1}}$ , $N\in\{0,1\}^{d_{1}\times d_{0}}$ .

Proof:

Let $d^{\prime}_{1}=d_{1}/d_{0}$ , and we assume $d^{\prime}_{1}\in\mathbb{N}$ . We take $N$ as the following binary matrix:

N=\begin{pmatrix}\mbox{\large{\bf 1}}&&\mbox{\large 0}\\ &\ddots&\\ \mbox{\large 0}&&\mbox{\large{\bf 1}}\end{pmatrix},\text{ where }\mbox{\large{\bf 1}}=\begin{pmatrix}1\\ \vdots\\ 1\\ \end{pmatrix}\in\mathbb{R}^{d^{\prime}_{1}\times 1}

By the decomposition $U\odot N={\bf u}_{1}\oplus\cdots\oplus{\bf u}_{d_{0}}$ , where each ${\bf u}_{i}$ is a $d^{\prime}_{1}\times 1$ -matrix, we have

g_{M,N}({\bf x})=(V\odot M)^{T}\big{(}\sigma({\bf u}_{1}x_{1})\oplus\cdots\oplus\sigma({\bf u}_{d_{0}}x_{d_{0}})\big{)}.

(34)

Here, we denote $M$ as follows:

M=\begin{pmatrix}&&\\ M_{1}&\cdots&M_{d_{0}}\\ &&\end{pmatrix},

where each $M_{i}$ is a $d_{2}\times d^{\prime}_{1}$ -matrix with binary coefficients. Then we have

V\odot M=(V_{1}\odot M_{1})+\cdots+(V_{d_{0}}\odot M_{d_{0}}),\ \ V_{i}\in\mathbb{R}^{d_{2}\times d^{\prime}_{1}}.

(35)

By combining Eq. (34) and Eq. (35), we have

g_{M,N}({\bf x})=\sum_{1\leq i\leq d_{0}}(V_{i}\odot M_{i})\sigma({\bf u}_{i}x_{i}).

(36)

Applying Lemma A.5 to each independent summands in Eq. (36), with probability at least $1-\frac{\delta}{d_{0}}$ , there exists $M_{i}$ for fixed $i\in\{1,\cdots,d_{0}\}$ such that

\lVert{\bf w}_{i}x_{i}-(V_{i}\odot M_{i})\sigma({\bf u}_{i}x_{i})\rVert_{2}\leq\frac{\epsilon}{d_{0}},\text{ for }x_{i}\in\mathbb{R},|x_{i}|\leq 1,

(37)

where ${\bf w}_{i}$ is the $i$ -th column vector of $W$ .

Using the union bound, we have $M_{1},\cdots,M_{d_{0}}$ satisfying Eq. (37) simultaneously with probability at least $1-\delta$ . Therefore, by combining Eq. (36), Eq. (37) and the decomposition $W{\bf x}=\sum_{j}{\bf w}_{j}x_{j}$ , we obtain Eq. (33). $\square$

Finally, by using Lemma A.6, we prove Theorem A.1. The outline of the proof is same as prior works [19][25].

Proof of Theorem A.1:

By Lemma A.6, for each fixed $k\in\{1,\cdots,l\}$ , we know that there exists binary matrices $M_{2k-1},M_{2k}$ such that

\lVert F_{k}{\bf x}-(G_{2k}\odot M_{2k})\sigma\big{(}(G_{2k-1}\odot M_{2k-1}){\bf x}\big{)}\rVert_{2}\leq\frac{\epsilon}{2l},

(38)

for all ${\bf x}\in\mathbb{R}^{d_{0}}$ with $\lVert{\bf x}\rVert_{\infty}\leq 1$ , with probability at least $1-\frac{\delta}{l}$ . Taking the union bound, we can get $M=(M_{1},\cdots,M_{2l})$ satisfying Eq. (38) for all $k=1,\cdots,l$ with probability at least $1-\delta$ .

For the above $M$ and any ${\bf x}_{0}\in\mathbb{R}^{d_{0}}$ with $\lVert{\bf x}_{0}\rVert_{\infty}\leq 1$ , we define sequences $\{{\bf x}_{k}\}_{0\leq k\leq l}$ and $\{\widetilde{{\bf x}}_{k}\}_{0\leq k\leq l}$ as

	$\displaystyle{\bf x}_{k}:=f_{k}({\bf x}_{k-1}),$
	$\displaystyle\widetilde{\bf x}_{0}:={\bf x}_{0},\ \ \ \widetilde{\bf x}_{k}:=g_{M,k}(\widetilde{\bf x}_{k-1}),$

where $f_{k}({\bf x})$ and $g_{M,k}({\bf x})$ are given by

\displaystyle f_{k}({\bf x}):=\begin{cases}\sigma\big{(}F_{k}{\bf x}\big{)},&(1\leq k\leq l-1)\\ F_{l}{\bf x},&(k=l)\end{cases}

\displaystyle g_{M,k}({\bf x}):=\begin{cases}\sigma\big{(}(G_{2k}\odot M_{2k})\sigma\big{(}(G_{2k-1}\odot M_{2k-1}){\bf x}\big{)}\big{)},&(1\leq k\leq l-1)\\ (G_{2l}\odot M_{2l})\sigma\big{(}(G_{2l-1}\odot M_{2l-1}){\bf x}\big{)}.&(k=l)\end{cases}

By induction on $k\in\{0,\cdots,l\}$ , we can show that

\lVert{\bf x}_{k}-\widetilde{\bf x}_{k}\rVert_{2}\leq\frac{k\epsilon}{l},\ \ \ \lVert\widetilde{\bf x}_{k}\rVert_{\infty}\leq 2.

(39)

For $k=1$ , this is trivial by definition. Consider the $k>1$ case. First of all, we remark that the following inequality is obtained by Eq. (38) and the $1$ -Lipschitz property of ReLU function $\sigma$ :

\displaystyle\lVert f_{k}({\bf x})-g_{M,k}({\bf x})\rVert_{2}\leq\frac{\epsilon}{2l},\ \text{ for }{\bf x}\in\mathbb{R}^{d_{0}},\lVert{\bf x}\rVert_{\infty}\leq 1.

By scaling ${\bf x}$ , the above inequality can be rewritten as follows:

\displaystyle\lVert f_{k}({\bf x})-g_{M,k}({\bf x})\rVert_{2}\leq\frac{\epsilon}{l},\ \text{ for }{\bf x}\in\mathbb{R}^{d_{0}},\lVert{\bf x}\rVert_{\infty}\leq 2.

Then, we have

$\displaystyle\lVert{\bf x}_{k}-\widetilde{\bf x}_{k}\rVert_{2}$	$\displaystyle=$	$\displaystyle\lVert f_{k}({\bf x}_{k-1})-g_{M,k}(\widetilde{\bf x}_{k-1})\rVert_{2}$
	$\displaystyle\leq$	$\displaystyle\lVert f_{k}({\bf x}_{k-1})-f_{k}(\widetilde{\bf x}_{k-1})+f_{k}(\widetilde{\bf x}_{k-1})-g_{M,k}(\widetilde{\bf x}_{k-1})\rVert_{2}$
	$\displaystyle\leq$	$\displaystyle\lVert f_{k}({\bf x}_{k-1})-f_{k}(\widetilde{\bf x}_{k-1})\rVert_{2}+\lVert f_{k}(\widetilde{\bf x}_{k-1})-g_{M,k}(\widetilde{\bf x}_{k-1})\rVert_{2}$
	$\displaystyle\leq$	$\displaystyle\lVert F_{k}\rVert_{{\rm Frob}}\cdot\lVert{\bf x}_{k-1}-\widetilde{\bf x}_{k-1}\rVert_{2}+\frac{\epsilon}{l}$
	$\displaystyle\leq$	$\displaystyle\frac{k\epsilon}{l}\ \ \ \ \ \ \text{(by the induction hypothesis)},$

and $\lVert\widetilde{\bf x}_{k}\rVert_{\infty}\leq\lVert\widetilde{\bf x}_{k}\rVert_{2}\leq\lVert{\bf x}_{k}\rVert_{2}+\lVert{\bf x}_{k}-\widetilde{\bf x}_{k}\rVert_{2}\leq 1+\frac{k\epsilon}{l}\leq 2$ .

In particular, for $k=l$ , Eq. (39) is nothing but Eq. (21). $\square$

Appendix B Details for our experiments

B.1 Network architectures

In our experiments on CIFAR-10 and ImageNet, we used the following network architectures: Conv6, ResNet18, ResNet34, ResNet50, and ResNet101. In Table 1 (for CIFAR-10) and Table 2 (for ImageNet), we describe their configurations with a width factor $\rho\in\mathbb{R}_{>0}$ . When $\rho=1.0$ , the architectures are standard ones.

For each ResNet network, the bracket $[\cdots]$ represents the basic block for ResNet18 and ResNet34, and the bottleneck block for ResNet50 and ResNet101, following the original settings by He et al. [11]. We have a batch normalization layer right after each convolution operation as well. Note that, when we train and evaluate these networks with IteRand or edge-popup [26], we replace the batch normalization to the non-affine one, which fixes its all learnable multipliers to $1$ and all learnable bias terms to $0$ following the design by Ramanujan et al. [26].

Table 1: Network Architectures for CIFAR-10. (

\rho

: a width factor)

	Conv6	ResNet18	ResNet34
$3\times 3$ Convolution	$64\rho$ , $64\rho$ , max-pool	$64$ , $[64\rho,64\rho]\times 2$	$64$ , $[64\rho,64\rho]\times 3$
& Pooling Layers	$128\rho$ , $128\rho$ , max-pool	$[128\rho,128\rho]\times 2$	$[128\rho,128\rho]\times 4$
	$256\rho$ , $256\rho$ , max-pool	$[256\rho,256\rho]\times 2$	$[256\rho,256\rho]\times 6$
		$[512\rho,512\rho]\times 2$	$[512\rho,512\rho]\times 3$
		avg-pool	avg-pool
Linear Layers	$256\rho$ , $256\rho$ , $10$	$10$	$10$

Table 2: Network Architectures for ImageNet. (

\rho

: a width factor)

	ResNet18	ResNet34	ResNet50	ResNet101
Convolution	$64\ (7\times 7,\text{ stride }2)$
Pooling	$\mbox{{\rm max-pool}}\ (3\times 3,\text{ stride }2,\text{ padding }1)$
Convolution	$[64\rho,64\rho]\times 2$	$[64\rho,64\rho]\times 3$	$[64\rho,64\rho,256\rho]\times 3$	$[64\rho,64\rho,256\rho]\times 3$
Blocks	$[128\rho,128\rho]\times 2$	$[128\rho,128\rho]\times 4$	$[128\rho,128\rho,512\rho]\times 4$	$[128\rho,128\rho,512\rho]\times 4$
	$[256\rho,256\rho]\times 2$	$[256\rho,256\rho]\times 6$	$[256\rho,256\rho,1024\rho]\times 6$	$[256\rho,256\rho,1024\rho]\times 23$
	$[512\rho,512\rho]\times 2$	$[512\rho,512\rho]\times 3$	$[512\rho,512\rho,2048\rho]\times 3$	$[512\rho,512\rho,2048\rho]\times 3$
Pooling	$\mbox{{\rm avg-pool}}\ (7\times 7)$
Linear	$1000$

B.2 Hyperparameters

In our experiments, we trained several neural networks by three methods (SGD, edge-popup [26], and IteRand) on two datasets (CIFAR-10 and ImageNet). For each dataset, we adopted different hyperparameters as follows.

CIFAR-10 experiments.

•

SGD: We used SGD with momentum for the optimization. It has the following hyperparameters: a total epoch number $E$ , batch size $B$ , learning rate $\eta$ , weight decay $\lambda$ , momentum coefficient $\mu$ . For all network architectures, we used common values except for the learning rate and weight decay: $E=100$ , $B=128$ , and $\mu=0.9$ . For the learning rate and weight decay, we used $\eta=0.01,\lambda=1.0\times 10^{-4}$ for Conv6, $\eta=0.1,\lambda=5.0\times 10^{-4}$ for ResNet18 and ResNet34, following Ramanujan et al. [26]. Moreover, we decayed the learning rate by cosine annealing [16].
•

edge-popup: With the same notation in Section 2, edge-popup has the same hyperparameters as SGD and an additional one, a sparsity rate $p$ . We used the same values as SGD for each network except for the learning rate and the sparsity rate. For the learning rate, we used $\eta=0.2$ for Conv6, and $\eta=0.1$ for ResNet18 and ResNet34. We decayed the learning rate by cosine annealing, same as SGD. For the sparsity rate, we used $p=0.5$ for Conv6, and $p=0.6$ for ResNet18 and ResNet34.
•

IteRand: With the notation in Section 3, IteRand has the same hypeparameters as edge-popup and the following additional ones: a randomization period $K_{\mathrm{per}}\in\mathbb{N}_{\geq 1}$ and a sampling rate $r\in[0,1]$ for partial randomization. We used the same values as edge-popup for the former hyperparameters, and $K_{\mathrm{per}}=300,r=0.1$ for the latter ones.

ImageNet experiments.

•

SGD: For all network architectures, we used the following hyperparameters (except for the learning rate): $E=105,B=128,\lambda=1.0\times 10^{-4}$ , and $\mu=0.9$ . For the first $5$ epochs, we gradually increased the learning rate as $\eta=0.1\times(i/5)$ for each $i$ -th epoch ( $i=1,\cdots,5$ ). For the next $95$ epochs, we decayed the learning rate by cosine annealing starting from $\eta=0.1$ . For the final $5$ epochs, we set the learning rate $\eta=1.0\times 10^{-5}$ to ensure the optimization converges.
•

edge-popup: For all network architectures, we used the same hyperparameters as SGD and the sparsity rate $p=0.7$ .
•

IteRand: For all network architectures, we used the same hyperparameters as edge-popup and $K_{\mathrm{per}}=1000,r=0.1$ .

B.3 Experimental results in table forms

In Table 3, we give means $\pm$ one standard deviations which are plotted in Figure 2 in Section 5.

Table 3: Results for Figure 2 in Section 5

Network	Method	$\mathcal{D}_{\mathrm{param}}$	$\rho=0.25$	$\rho=0.5$	$\rho=1.0$	$\rho=2.0$
Conv6	IteRand	SC	$75.16\pm 0.23$	$84.31\pm 0.13$	$88.80\pm 0.20$	$90.89\pm 0.17$
	IteRand	KU	$73.90\pm 0.47$	$83.18\pm 0.38$	$88.20\pm 0.38$	$90.53\pm 0.28$
	edge-popup	SC	$70.35\pm 1.16$	$81.54\pm 0.11$	$87.60\pm 0.11$	$90.25\pm 0.06$
	edge-popup	KU	$67.02\pm 1.14$	$79.37\pm 0.23$	$86.14\pm 0.37$	$89.91\pm 0.23$
	SGD	KU	$79.51\pm 0.88$	$84.46\pm 0.34$	$87.69\pm 0.36$	$89.63\pm 0.22$
ResNet18	IteRand	SC	$86.09\pm 0.24$	$90.50\pm 0.36$	$92.61\pm 0.17$	$93.82\pm 0.15$
	IteRand	KU	$84.71\pm 0.42$	$89.96\pm 0.06$	$92.47\pm 0.16$	$93.52\pm 0.16$
	edge-popup	SC	$84.23\pm 0.33$	$89.61\pm 0.06$	$92.29\pm 0.04$	$93.57\pm 0.13$
	edge-popup	KU	$81.13\pm 0.09$	$88.45\pm 0.53$	$91.79\pm 0.19$	$93.25\pm 0.06$
	SGD	KU	$90.99\pm 0.29$	$93.10\pm 0.14$	$94.43\pm 0.05$	$95.02\pm 0.23$
ResNet34	IteRand	SC	$88.26\pm 0.35$	$91.87\pm 0.17$	$93.54\pm 0.27$	$94.25\pm 0.15$
	IteRand	KU	$87.36\pm 0.14$	$91.58\pm 0.16$	$93.27\pm 0.19$	$94.05\pm 0.10$
	edge-popup	SC	$87.37\pm 0.30$	$91.41\pm 0.26$	$93.27\pm 0.09$	$94.25\pm 0.13$
	edge-popup	KU	$85.31\pm 0.66$	$90.85\pm 0.11$	$93.00\pm 0.11$	$93.96\pm 0.10$
	SGD	KU	$92.26\pm 0.12$	$94.20\pm 0.25$	$94.66\pm 0.18$	$95.33\pm 0.10$

In Table 4, we give means $\pm$ one standard deviations which are plotted in Figure3 in Section 5.

Table 4: Results for Figure 3 in Section 5

Network	Method	$\mathcal{D}_{\mathrm{param}}$	Accuracy (Top-1)	Sparsity (%)	# of Params
ResNet18	SGD	KU	$69.89$	$0\%$	$11.16$ M
ResNet34	SGD	KU	$73.82$	$0\%$	$21.26$ M
ResNet34	IteRand	SC	$65.86$	$70\%$	$6.38$ M
ResNet34	edge-popup	SC	$62.14$	$70\%$	$6.38$ M
ResNet50	IteRand	SC	$69.19$	$70\%$	$7.04$ M
ResNet50	edge-popup	SC	$67.11$	$70\%$	$7.04$ M
ResNet101	IteRand	SC	$72.99$	$70\%$	$12.72$ M
ResNet101	edge-popup	SC	$71.85$	$70\%$	$12.72$ M

Appendix C Additional experiments

C.1 Ablation study on hyperparameters: $K_{\mathrm{per}}$ and $r$

IteRand has two hyperparameters: $K_{\mathrm{per}}$ (see line 5 in Algorithm 3) and $r$ (see Eq. (3) in Section 3).

$K_{\mathrm{per}}$ controls the frequency of randomizing operations during the optimization. Note that, in our experiments in Section 5, we fixed it to $K_{\mathrm{per}}=300$ on CIFAR-10, which is nearly $1$ epoch (= $351$ iterations) when the batch size is $128$ . As we discussed in Section 3, too small $K_{\mathrm{per}}$ may degrade the performance because it may randomize even the important weights before their scores are well-optimized. In contrast, too large $K_{\mathrm{per}}$ makes IteRand almost same as edge-popup, and thus the effect of the randomization disappears. Figure 4 shows this phenomenon with varying $K_{\mathrm{per}}$ in $\{1,30,300,3000,30000\}$ . In relation to our theoretical results (Theorem A.1), we note that the expected number (here we denote $R^{\prime}$ ) of randomizing operations for each weight is in inverse proportion to $K_{\mathrm{per}}$ . The theoretical results imply that greater $R^{\prime}$ leads to better approximation ability of IteRand. We observe that the results in Figure 4 are consistent with this implication, in the region where $K_{\mathrm{per}}$ is not too small ( $K_{\mathrm{per}}\geq 300$ ).

Also, we investigate the relationship between $K_{\mathrm{per}}$ and $r$ . Figure 5 shows how test accuracy changes when both $K_{\mathrm{per}}$ and $r$ vary. From this result, we find that the accuracies seem to depend on $r/K_{\mathrm{per}}$ . This may be because each pruned parameter in the neural network is randomized $Nr/K_{\mathrm{per}}$ times in expectation during the optimization. On the other hand, when we use larger $r\in[0,1]$ , we have to explore $K_{\mathrm{per}}$ in longer period (e.g. $3000$ iterations when $r=1.0$ ). Thus appropriately choosing $r$ leads to shrink the search space of $K_{\mathrm{per}}$ .

C.2 Computational overhead of iterative randomization

IteRand introduces additional computational cost to the base method, edge-popup, by iterative randomization. However, the additional computational cost is negligibly small in all our experiments. We measured the average overhead of a single randomizing operation, which is the only difference from edge-popup, as follows: $97.10$ ms for ResNet18 ( $11.2$ M params), $200.55$ ms for ResNet50 ( $23.5$ M params). Thus, the total additional cost should be about $10$ seconds in the whole training ( $1.5$ hours) for ResNet18 on CIFAR-10 and $200\text{--}300$ seconds in the whole training (one week) for ResNet50 on ImageNet.

Also, we measured the total training times of our expriments for ResNet-18 on CIFAR-10 (Table 5). The additional computational cost of IteRand over edge-popup is tens of seconds, which is quite consistent with the above estimates.

Table 5: Training times for ResNet18 on CIFAR-10.

IteRand	$6253.69$ (secs)
edge-popup	$6231.80$ (secs)
SGD	$6388.61$ (secs)

C.3 Experiments with varying sparsity

Table 6 shows the comparison of IteRand and edge-popup when varying the sparsity parameter $p\in\{0.1,0.3,0.5,0.7,0.9,0.95,0.99\}$ . We can see that IteRand is effective for almost all the sparsities $p$ .

Table 6: Test accuracies with various sparsities on CIFAR-10.

Networks	Methods	$p=0.1$	$p=0.3$	$p=0.5$	$p=0.7$	$p=0.9$	$p=0.95$	$p=0.99$
Conv6	IteRand (SC)	$\mathbf{87.17}\pm 0.21$	$\mathbf{89.08}\pm 0.14$	$\mathbf{89.19}\pm 0.13$	$\mathbf{87.67}\pm 0.07$	$\mathbf{76.09}\pm 0.23$	$\mathbf{59.03}\pm 1.55$	$12.04\pm 3.40$
Conv6	edge-popup (SC)	$80.31\pm 0.27$	$86.55\pm 0.22$	$87.57\pm 0.03$	$86.40\pm 0.13$	$73.25\pm 0.79$	$55.23\pm 1.48$	$\mathbf{12.13}\pm 3.46$
ResNet18	IteRand (SC)	$\mathbf{91.79}\pm 0.20$	$\mathbf{92.67}\pm 0.11$	$\mathbf{92.66}\pm 0.22$	$\mathbf{92.61}\pm 0.15$	$\mathbf{91.82}\pm 0.15$	$\mathbf{90.40}\pm 0.42$	$\mathbf{76.31}\pm 2.56$
ResNet18	edge-popup (SC)	$87.37\pm 0.18$	$91.43\pm 0.16$	$92.25\pm 0.18$	$92.32\pm 0.10$	$91.64\pm 0.18$	$90.28\pm 0.30$	$75.21\pm 2.71$

Moreover, we compared the pruning-only approach (IteRand and edge-popup) and the iterative magnitude pruning (IMP) approach [6] with various sparsity rates. We employed the OpenLTH framework [5], which contains the implementation of IMP, as a codebase for this experiment and implemented both edge-popup and IteRand in this framework. The results are shown in Table 7. Overall, the IMP outperforms the pruning-only methods. However, there is still room for improvement in the pruning-only approach such as introducing scheduled sparsities or an adaptive threshold, which is left to future work.

Table 7: Comparison of the pruning-only approach and magnituide-based one.

Networks	Methods	$p=0.5$	$p=0.7$	$p=0.9$	$p=0.95$	$p=0.99$
VGG11	IteRand (SC)	$88.46\pm 0.22$	$88.29\pm 0.42$	$87.05\pm 0.07$	$84.37\pm 0.59$	$64.93\pm 4.81$
	edge-popup (SC)	$87.09\pm 0.31$	$87.34\pm 0.21$	$85.11\pm 0.55$	$81.06\pm 0.84$	$61.71\pm 6.05$
	IMP with $3$ retraining [6]	$\mathbf{91.47}\pm 0.15$	$\mathbf{91.48}\pm 0.16$	$\mathbf{90.89}\pm 0.08$	$\mathbf{90.39}\pm 0.27$	$\mathbf{88.076}\pm 0.17$
ResNet20	IteRand (SC)	$84.17\pm 0.78$	$82.31\pm 0.52$	$70.96\pm 0.55$	$55.60\pm 0.70$	$24.45\pm 0.67$
	edge-popup (SC)	$76.57\pm 0.91$	$75.83\pm 2.75$	$49.25\pm 6.33$	$42.96\pm 3.91$	$20.11\pm 3.43$
	IMP with $3$ retraining [6]	$\mathbf{90.70}\pm 0.37$	$\mathbf{89.79}\pm 0.14$	$\mathbf{86.87}\pm 0.26$	$\mathbf{84.28}\pm 0.08$	$\mathbf{71.78}\pm 1.66$

C.4 Detailed empirical analysis on the parameter efficiency

We conducted experiments to see how much more network width edge-popup requires than IteRand to achieve the same accuracy (Table 8). Here we use ResNet18 with various width factors $\rho\in\mathbb{R}_{>0}$ . We first computed the test accuracies of IteRand with $\rho=0.5,1.0$ as target values. Next we explored the width factors for which edge-popup achieves the same accuracy as the target values. Table 8 shows that edge-popup requires $1.3$ times wider networks than IteRand in this specific setting.

Table 8: Test accuracies for ResNet18 with various width factors.

	$\rho=0.5$	$\rho=0.65$	$\rho=1.0$	$\rho=1.3$
IteRand (KU)	$\mathbf{89.96}\pm 0.06$	-	$\mathbf{92.47}\pm 0.16$	-
edge-popup (KU)	$88.45\pm 0.53$	$\mathbf{89.99}\pm 0.08$	$91.79\pm 0.19$	$\mathbf{92.54}\pm 0.19$

C.5 Experiments with large-scale networks

In addition to the experiments in Section 5, we conducted experiments to see the effectiveness of IteRand with large-scale networks: WideResNet-50-2 [33] and ResNet50 with the width factor $\rho=2.0$ . Table 9 shows that the iterative randomization is still effective for these networks to improve the performance of weight-pruning optimization.

Table 9: Experiments with large-scale networks.

	IteRand (SC)	edge-popup (SC)	# of parameters
WideResNet-50-2	$\mathbf{73.57}\%$	$71.59\%$	$68.8$ M
ResNet50 ( $\rho=2.0$ )	$\mathbf{74.05}\%$	$72.96\%$	$97.8$ M

C.6 Experiments with a text classification task

Although our main theorem (Theorem 4.1) indicates that the effectiveness of IteRand does not depend on any specific tasks, we only presented the results on image classification datasets in the body of this paper. In Table 10, we present experimental results on a text classification dataset, IMDB [18], with recurrent neural networks (see Table 11 for the network architectures). For this experiment, we implemented both edge-popup and IteRand on the Jupyter notebook originally written by Trevett [30]. All models are trained for $15$ epochs and the learning rate $\eta$ we used is $\eta=1.0$ for SGD and $\eta=2.5$ for edge-popup and IteRand. Note that the learning rate $\eta=2.5$ does not work well for SGD, thus we employed the different value from the one for edge-popup and IteRand. Also we set the hyperparameters for IteRand as $p=0.5$ , $K_{\mathrm{per}}=\lfloor 270/6\rfloor$ ( $\approx$ $1/6$ epochs) and $r=1.0$ .

Table 10: Test accuracies on the IMDB dataset over

5

runs.

	IteRand (SC)	edge-popup (SC)	SGD
LSTM	$\mathbf{88.44}\pm 0.28\%$	$88.16\pm 0.12\%$	$87.39\pm 0.37\%$
BiLSTM	$\mathbf{88.51}\pm 0.24\%$	$88.34\pm 0.19\%$	$87.62\pm 0.22\%$

Table 11: The network architectures for IMDB.

	(Bi)LSTM
Embedding Layer	${\rm dim}=100$
LSTM Layer	${\rm hidden\_dim}=256,{\rm num\_layers}=1,$
LSTM Layer	( ${\rm bidirectional}={\rm True}$ for BiLSTM)
Dropout Layer	$p=0.2$
Linear Layer	${\rm output\_dim}=1$
Output Layer	sigmoid

Pruning Randomly Initialized Neural Networks with Iterative Randomization

Abstract

1 Introduction

2 Background

Notation and setup.

Theorem 2.1 (informal statement of Theorem 2.1 in [19])

3 Method

3.1 Algorithm

3.2 Designs of Randomize​(𝜽,𝐦){\texttt{Randomize}}(\bm{\theta},\mathbf{m})

Naive randomization.

Partial randomization.

4 Theoretical justification

4.1 Setup

4.2 Formulation and main results

Theorem 4.1 (Main Theorem)

4.3 Proof ideas for Theorem 4.1

Lemma 4.2

Proof (sketch):

Proposition 4.3 (Theorem 4.1 with l=d0=d1=1l=d_{0}=d_{1}=1)

Proof (sketch):

5 Experiments

Setup.

Parameter distributions.

5.1 Varying the network width

5.2 ImageNet experiments

6 Related work

Lottery ticket hypothesis

Neural network pruning and regrowth

7 Limitations

8 Conclusion

Acknowledgement

References

Appendix A Proof of main theorem (Theorem 4.1)

A.1 Settings and main theorem

Theorem A.1 (Main Theorem)

A.2 Proof of Theorem A.1

Lemma A.2

Proof:

Lemma A.3

Proof:

Lemma A.4

Proof:

Lemma A.5

Proof:

Lemma A.6

Proof:

Proof of Theorem A.1:

Appendix B Details for our experiments

B.1 Network architectures

B.2 Hyperparameters

CIFAR-10 experiments.

ImageNet experiments.

B.3 Experimental results in table forms

Appendix C Additional experiments

C.1 Ablation study on hyperparameters: KperK_{\mathrm{per}} and rr

C.2 Computational overhead of iterative randomization

C.3 Experiments with varying sparsity

C.4 Detailed empirical analysis on the parameter efficiency

C.5 Experiments with large-scale networks

C.6 Experiments with a text classification task

Pruning Randomly Initialized Neural Networks
with Iterative Randomization

3.2 Designs of ${\texttt{Randomize}}(\bm{\theta},\mathbf{m})$

Proposition 4.3 (Theorem 4.1 with $l=d_{0}=d_{1}=1$ )

C.1 Ablation study on hyperparameters: $K_{\mathrm{per}}$ and $r$