This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Pruning Randomly Initialized Neural Networks
with Iterative Randomization

Daiki Chijiwa    Shin’ya Yamaguchi    Yasutoshi Ida Kenji Umakoshi    Tomohiro Inoue

NTT Computer and Data Science Laboratories, NTT Corporation
NTT Social Informatics Laboratories, NTT Corporation
Corresponding author: [email protected]
Abstract

Pruning the weights of randomly initialized neural networks plays an important role in the context of lottery ticket hypothesis. Ramanujan et al. [26] empirically showed that only pruning the weights can achieve remarkable performance instead of optimizing the weight values. However, to achieve the same level of performance as the weight optimization, the pruning approach requires more parameters in the networks before pruning and thus more memory space. To overcome this parameter inefficiency, we introduce a novel framework to prune randomly initialized neural networks with iteratively randomizing weight values (IteRand). Theoretically, we prove an approximation theorem in our framework, which indicates that the randomizing operations are provably effective to reduce the required number of the parameters. We also empirically demonstrate the parameter efficiency in multiple experiments on CIFAR-10 and ImageNet. The code is available at https://github.com/dchiji-ntt/iterand.

1 Introduction

The lottery ticket hypothesis, which was originally proposed by Frankle and Carbin [6], has been an important topic in the research of deep neural networks (DNNs). The hypothesis claims that an over-parameterized DNN has a sparse subnetwork (called a winning ticket) that can achieve almost the same accuracy as the fully-trained entire network when trained independently. If the hypothesis holds for a given network, then we can reduce the computational cost by using the sparse subnetwork instead of the entire network while maintaining the accuracy [15, 22]. In addition to the practical benefit, the hypothesis also suggests that the over-parametrization of DNNs is no longer necessary and their subnetworks alone are sufficient to achieve full accuracy.

Ramanujan et al. [26] went one step further. They proposed and empirically demonstrated a conjecture related to the above hypothesis, called the strong lottery ticket hypothesis, which informally states that there exists a subnetwork in a randomly initialized neural network such that it already achieves almost the same accuracy as a fully trained network, without any optimization of the weights of the network. A remarkable consequence of this hypothesis is that neural networks could be trained by solving a discrete optimization problem. That is, we may train a randomly initialized neural network by finding an optimal subnetwork (which we call weight-pruning optimization), instead of optimizing the network weights continuously (which we call weight-optimization) with stochastic gradient descent (SGD).

However, the weight-pruning optimization requires a problematic amount of parameters in the random network before pruning. Pensia et al. [25] theoretically showed that the required network width for the weight-pruning optimization needs to be logarithmically wider than the weight-optimization at least in the case of shallow networks. Therefore, the weight-pruning optimization requires more parameters, and thus more memory space, than the weight-optimization to achieve the same accuracy. In other words, under a given memory constraint, the weight-pruning optimization can have lower final accuracy than the weight-optimization in practice.

In this paper, we propose a novel optimization method for neural networks called weight-pruning with iterative randomization (IteRand), which extends the weight-pruning optimization to overcome the parameter inefficiency. The key idea is to virtually increase the network width by randomizing pruned weights at each iteration of the weight-pruning optimization, without any additional memory consumption. Indeed, we theoretically show that the required network width can be reduced by the randomizing operations. More precisely, our theoretical result indicates that, if the number of randomizing operations is large enough, we can reduce the required network width for weight-pruning to the same as that for a network fully trained by the weight-optimization up to constant factors, in contrast to the logarithmic factors of the previous results [25, 24]. We also empirically demonstrate that, under a given amount of network parameters, IteRand boosts the accuracy of the weight-pruning optimization in multiple vision experiments.

2 Background

In this section, we review the prior works on pruning randomly initialized neural networks.

Notation and setup.

Let d,Nd,N\in\mathbb{N}. Let f(x;𝜽)f(x;\bm{\theta}) be an ll-layered ReLU neural network with an input xdx\in\mathbb{R}^{d} and parameters 𝜽=(θi)1inn\bm{\theta}=(\theta_{i})_{1\leq i\leq n}\in\mathbb{R}^{n}, where each weight θi\theta_{i} is randomly sampled from a distribution 𝒟param\mathcal{D}_{\mathrm{param}} over \mathbb{R}. A subnetwork of f(x;𝜽)f(x;\bm{\theta}) is written as f(x;𝜽𝐦)f(x;\bm{\theta}\odot\mathbf{m}) where 𝐦{0,1}n\mathbf{m}\in\{0,1\}^{n} is a binary mask and "\odot" represents an element-wise multiplication.

Ramanujan et al. [26] empirically observed that we can train the randomly initialized neural network f(x;𝜽)f(x;\bm{\theta}) by solving the following discrete optimization problem, which we call weight-pruning optimization:

min𝐦{0,1}n𝔼(x,y)𝒟labeled[(f(x;𝜽𝐦),y)],\min_{\mathbf{m}\in\{0,1\}^{n}}\underset{(x,y)\sim\mathcal{D}_{\mathrm{labeled}}}{\mathbb{E}}\Big{[}\mathcal{L}(f(x;\bm{\theta}\odot\mathbf{m}),y)\Big{]}, (1)

where 𝒟labeled\mathcal{D}_{\mathrm{labeled}} is a distribution on a set of labeled data (x,y)(x,y) and \mathcal{L} is a loss function. To solve this optimization problem, Ramanujan et al. [26] proposed an optimization algorithm, called edge-popup (Algorithm 1).

Initialize 𝜽𝒟paramn,𝐬𝒟scoren\bm{\theta}\sim\mathcal{D}^{n}_{\mathrm{param}},\mathbf{s}\sim\mathcal{D}^{n}_{\mathrm{score}};
  // 𝒟param\mathcal{D}_{\mathrm{param}} and 𝒟score\mathcal{D}_{\mathrm{score}} are distributions over \mathbb{R}
1 while k=0,,N1k=0,\cdots,N-1 do
2      Sample a labeled data (x,y)𝒟labeled(x,y)\sim\mathcal{D}_{\mathrm{labeled}};
      𝐦,𝐬TrainMask(𝜽,𝐬;(x,y))\mathbf{m},\mathbf{s}\leftarrow{\texttt{TrainMask}}(\bm{\theta},\mathbf{s};(x,y));
       // Optimize importance scores 𝐬\mathbf{s} and update 𝐦\mathbf{m}
3     
4      end while
5     return 𝐦,𝜽\mathbf{m},\bm{\theta};
Algorithm 1 Weight-pruning optimization by edge-popup [26]

The TrainMask (Algorithm 2) is the key process in Algorithm 1. It has a latent variable 𝐬=(si)1inn\mathbf{s}=(s_{i})_{1\leq i\leq n}\in\mathbb{R}^{n}, where each element sis_{i} represents an importance score of the corresponding weight θi\theta_{i}, and optimizes 𝐬\mathbf{s} instead of directly optimizing the discrete variable 𝐦\mathbf{m}. Given the score 𝐬\mathbf{s}, the corresponding 𝐦\mathbf{m} is computed by the function CalculateMask(𝐬){\texttt{CalculateMask}}(\mathbf{s}), which returns 𝐦=(mi)1in\mathbf{m}=(m_{i})_{1\leq i\leq n} defined as follows: mi=1m_{i}=1 if sis_{i} is top 100(1p)%100(1-p)\% in {si}1in\{s_{i}\}_{1\leq i\leq n}, otherwise mi=0m_{i}=0, where p(0,1)p\in(0,1) is a hyperparameter representing a sparsity rate of the pruned network. In the line 3 of Algorithm 2, SGDη,λ,μ(𝐬,𝐠)\mathrm{SGD}_{\eta,\lambda,\mu}(\mathbf{s},\mathbf{g}) returns the updated value of 𝐬\mathbf{s} by stochastic gradient descent with a learning rate η\eta, weight decay λ\lambda, momentum coefficient μ\mu, and gradient vector 𝐠\mathbf{g}.

1 Input: 𝜽,𝐬n\bm{\theta},\mathbf{s}\in\mathbb{R}^{n}, (x,y)(x,y): a labeled data;
𝐦CalculateMask(𝐬)\mathbf{m}\leftarrow{\texttt{CalculateMask}}(\mathbf{s});
  // Calculate the mask 𝐦\mathbf{m} with the current scores
𝐬SGDη,λ,μ(𝐬,𝐬¯=𝐦(f(x;𝜽𝐬¯),y))\mathbf{s}\leftarrow\mathrm{SGD}_{\eta,\lambda,\mu}\left(\mathbf{s},\nabla_{\mathbf{\overline{s}}=\mathbf{m}}\mathcal{L}(f(x;\bm{\theta}\odot\mathbf{\overline{s}}),y)\right);
  // Update 𝐬\mathbf{s} by the gradient at 𝐬¯=𝐦\overline{\mathbf{s}}=\mathbf{m}
𝐦CalculateMask(𝐬)\mathbf{m}\leftarrow{\texttt{CalculateMask}}(\mathbf{s});
  // Calculate new mask 𝐦\mathbf{m} with the updated scores
2 return 𝐦,𝐬\mathbf{m},\mathbf{s};
Algorithm 2 Pseudo code of TrainMask

On the theoretical side, Malach et al. [19] first provided a mathematical justification of the above empirical observation. They formulated it as an approximation theorem with some assumptions on the network width as follows.

Theorem 2.1 (informal statement of Theorem 2.1 in [19])

Let ftarget(x)f_{\mathrm{target}}(x) be an ll-layered network with bounded weight matrices, and g(x)g(x) be a randomly initialized 2l2l-layered neural network. If the width of g(x)g(x) is larger than ftarget(x)f_{\mathrm{target}}(x) by the factor of a polynomial term, then there probably exists a subnetwork of g(x)g(x) that approximates ftarget(x)f_{\mathrm{target}}(x).

By considering a well-trained network as ftargetf_{\mathrm{target}}, Theorem 2.1 indicates that pruning a sufficiently wide g(x)g(x) may reveal a subnetwork which achieves good test accuracy as ftargetf_{\mathrm{target}}, in principle.

In the follow-up works [25, 24], the assumption on the network width was improved by reducing the factor of the required width to a logarithmic term. However, Pensia et al. [25] showed that the logarithmic order is unavoidable at least in the case of l=1l=1. While their results imply the optimality of the logarithmic bound, it also means that we cannot further relax the assumption on the network width as long as we work in the same setting. This indicates a limitation of the weight-pruning optimization, i.e. the weight-pruning optimization can train only less expressive models than ones trained with the conventional weight-optimization like SGD, under a given amount of memory or network parameters.

3 Method

In this section, we present a novel method called weight-pruning with iterative randomization (IteRand) for randomly initialized neural networks.

As discussed in Section 2, although the original weight-pruning optimization (Algorithm 1) can achieve good accuracy, it still has a limitation in the expressive power under a fixed amount of memory or network parameters. Our method is designed to overcome this limitation. The main idea is to randomize pruned weights at each iteration of the weight-pruning optimization. As we prove in Section 4, this reduces the required size of an entire network to be pruned.

We use the same notation and setup as Section 2. In addition, we assume that each weight θi\theta_{i} of the network f(x;𝜽)f(x;\bm{\theta}) can be re-sampled from 𝒟param\mathcal{D}_{\mathrm{param}} at each iteration of the weight-pruning optimization.

3.1 Algorithm

Algorithm 3 describes our proposed method, IteRand, which extends Algorithm 1. The differences from Algorithm 1 are lines 5-7. IteRand has a hyperparameter Kper1K_{\mathrm{per}}\in\mathbb{N}_{\geq 1} (line 5). At the kk-th iteration, whenever kk can be divided by KperK_{\mathrm{per}}, pruned weights are randomized by Randomize(𝜽,𝐦){\texttt{Randomize}}(\bm{\theta},\mathbf{m}) function (line 6). There are multiple possible designs for Randomize(𝜽,𝐦){\texttt{Randomize}}(\bm{\theta},\mathbf{m}), which will be discussed in the next subsection.

1 Initialize 𝜽𝒟param,𝐬𝒟score\bm{\theta}\sim\mathcal{D}_{\mathrm{param}},\mathbf{s}\sim\mathcal{D}_{\mathrm{score}};
2 while k=0,,N1k=0,\cdots,N-1 do
3      Sample a labeled data (x,y)𝒟labeled(x,y)\sim\mathcal{D}_{\mathrm{labeled}};
4      𝐦,𝐬TrainMask(𝜽,𝐬;(x,y))\mathbf{m},\mathbf{s}\leftarrow{\texttt{TrainMask}}(\bm{\theta},\mathbf{s};(x,y));
5      if k+1k+1 can be divided by KperK_{\mathrm{per}} then // New if-block added to Algorithm 1
           𝜽Randomize(𝜽,𝐦)\bm{\theta}\leftarrow{\texttt{Randomize}}(\bm{\theta},\mathbf{m});
            // Randomize a subset of pruned weights
6          
7           end if
8     
9      end while
10     return 𝐦\mathbf{m}, 𝜽\bm{\theta};
11     
Algorithm 3 Weight-pruning optimization with iterative randomization (IteRand)

Note that KperK_{\mathrm{per}} controls how frequently the algorithm randomizes the pruned weights. Indeed the total number of the randomizing operations is N/Kper\lfloor N/K_{\mathrm{per}}\rfloor. If KperK_{\mathrm{per}} is too small, the algorithm is likely to be unstable because it may randomize even the important weights before their scores are well-optimized, and also the overhead of the randomizing operations cannot be ignored. In contrast, if KperK_{\mathrm{per}} is too large, the algorithm becomes almost same as the original weight-pruning Algorithm 1, and thus the effect of the randomization disappears. We fix Kper=300K_{\mathrm{per}}=300 on CIFAR-10 (about 1 epoch1\text{ epoch}) and Kper=1000K_{\mathrm{per}}=1000 on ImageNet (about 1/10 epochs1/10\text{ epochs}) in our experiments (Section 5).

3.2 Designs of Randomize(𝜽,𝐦){\texttt{Randomize}}(\bm{\theta},\mathbf{m})

Here, we discuss how to define Randomize(𝜽,𝐦){\texttt{Randomize}}(\bm{\theta},\mathbf{m}) function. There are several possible ways to randomize a subset of the parameters 𝜽\bm{\theta}.

Naive randomization.

For any distribution 𝒟param\mathcal{D}_{\mathrm{param}}, a naive definition of the randomization function (which we call naive randomization) can be given as follows.

Randomize(𝜽,𝐦)i:={θi,(if mi=1)θ~i,(otherwise)\displaystyle{\texttt{Randomize}}(\bm{\theta},\mathbf{m})_{i}:=\begin{cases}\theta_{i},\ \ \ (\text{if }m_{i}=1)\\ \widetilde{\theta}_{i},\ \ \ (\text{otherwise})\end{cases}

where we denote the ii-th component of Randomize(𝜽,𝐦)n{\texttt{Randomize}}(\bm{\theta},\mathbf{m})\in\mathbb{R}^{n} as Randomize(𝜽,𝐦)i{\texttt{Randomize}}(\bm{\theta},\mathbf{m})_{i}, and each θ~i\widetilde{\theta}_{i}\in\mathbb{R} is a random variable with the distribution 𝒟param\mathcal{D}_{\mathrm{param}}. Also this can be written in another form as

Randomize(𝜽,𝐦):=𝜽𝐦+𝜽~(1𝐦),{\texttt{Randomize}}(\bm{\theta},\mathbf{m}):=\bm{\theta}\odot\mathbf{m}+\widetilde{\bm{\theta}}\odot(1-\mathbf{m}), (2)

where 𝜽~=(θ~i)1inn\widetilde{\bm{\theta}}=(\widetilde{\theta}_{i})_{1\leq i\leq n}\in\mathbb{R}^{n} is a random variable with the distribution 𝒟paramn\mathcal{D}^{n}_{\mathrm{param}}.

Partial randomization.

The naive randomization (Eq. (2)) is likely to be unstable because it entirely replaces all pruned weights with random values every KperK_{\mathrm{per}} iteration. To increase the stability, we modify the definition of the naive randomization as it replaces a randomly chosen subset of the pruned weights as follows (which we call partial randomization):

Randomize(𝜽,𝐦):=𝜽𝐦+(𝜽(1𝐛r)+𝜽~𝐛r)(1𝐦),\displaystyle{\texttt{Randomize}}(\bm{\theta},\mathbf{m}):=\bm{\theta}\odot\mathbf{m}+\left(\bm{\theta}\odot(1-\mathbf{b}_{r})+\widetilde{\bm{\theta}}\odot\mathbf{b}_{r}\right)\odot(1-\mathbf{m}), (3)
Refer to caption
Figure 1: Analysis on rr. We compare partial randomizations with r{0.0,0.01,0.1,1.0}r\in\{0.0,0.01,0.1,1.0\} applied to CNNs. The y-axis is the validation accuracy on CIFAR-10. r=0.1r=0.1 achieves better mean accuracy for every CNNs.

where 𝜽~n\widetilde{\bm{\theta}}\in\mathbb{R}^{n} is the same as in Eq. (2), r[0,1]r\in[0,1] is a hyperparameter and 𝐛r=(br,i)1in{0,1}n\mathbf{b}_{r}=(b_{r,i})_{1\leq i\leq n}\in\{0,1\}^{n} is a binary vector whose each element is sampled from the Bernoulli distribution Bernoulli(r){\mathrm{Bernoulli}}(r), i.e. br,i=1b_{r,i}=1 with probability rr and br,i=0b_{r,i}=0 with probability 1r1-r.

The partial randomization replaces randomly chosen 100r%100r\% of all pruned weights with random values. Note that, when r=1r=1, the partial randomization is equivalent to the naive randomization (Eq. (2)). In contrast, when r=0r=0, it never randomizes any weights and thus is equivalent to Algorithm 1.

In Figure 1, we observe that r=0.1r=0.1 works well with various network architectures on CIFAR-10, where we use the Kaiming uniform distribution (whose definition will be given in Section 5) for 𝒟param\mathcal{D}_{\mathrm{param}} and Kper=300K_{\mathrm{per}}=300.

4 Theoretical justification

In this section, we present a theoretical justification for our iterative randomization approach on the weight-pruning of randomly initialized neural networks.

4.1 Setup

We consider a target neural network f:d0dlf:\mathbb{R}^{d_{0}}\to\mathbb{R}^{d_{l}} of depth ll, which is described as follows.

f(x)=Flσ(Fl1σ(F1(x))),f(x)=F_{l}\sigma(F_{l-1}\sigma(\cdots F_{1}(x)\cdots)), (4)

where xx is a d0d_{0}-dimensional real vector, σ\sigma is the ReLU activation, and FiF_{i} is a di×di1d_{i}\times d_{i-1} matrix. Our objective is to approximate the target network f(x)f(x) by pruning a randomly initialized neural network g(x)g(x), which tends to be larger than the target network.

Similar to the previous works [19, 25], we assume that g(x)g(x) is twice as deep as the target network f(x)f(x). Thus, g(x)g(x) can be described as

g(x)=G2lσ(G2l1σ(G1(x))),g(x)=G_{2l}\sigma(G_{2l-1}\sigma(\cdots G_{1}(x)\cdots)), (5)

where GjG_{j} is a d~j×d~j1\widetilde{d}_{j}\times\widetilde{d}_{j-1} matrix (j=1,,2lj=1,\cdots,2l) with d~2i=di\widetilde{d}_{2i}=d_{i}. Each element of the matrix GjG_{j} is assumed to be drawn from the uniform distribution U[1,1]U[-1,1]. Since there is a one-to-one correspondence between pruned networks of g(x)g(x) and sequences of binary matrices M={Mj}j=1,,2lM=\{M_{j}\}_{j=1,\cdots,2l} with Mj{0,1}d~j×d~j1M_{j}\in\{0,1\}^{\widetilde{d}_{j}\times\widetilde{d}_{j-1}}, every pruned network of g(x)g(x) can be described as

gM(x)=(G2lM2l)σ((G2l1M2l1)σ((G1M1)(x))).g_{M}(x)=(G_{2l}\odot M_{2l})\sigma((G_{2l-1}\odot M_{2l-1})\sigma(\cdots(G_{1}\odot M_{1})(x)\cdots)). (6)

Under these setups, we recall that the previous works showed that, with high probability, there exists a subnetwork of g(x)g(x) that approximates f(x)f(x) when the width of g(x)g(x) is larger than f(x)f(x) by polynomial factors [19] or logarithmic factors [25, 24].

4.2 Formulation and main results

Now we attempt to mathematically formulate our proposed method, IteRand, as an approximation problem. As described in Algorithm 3, the method consists of two steps: optimizing binary variables M={Mj}j=1,,2lM=\{M_{j}\}_{j=1,\cdots,2l} and randomizing pruned weights in g(x)g(x). The first step can be formulated as the approximation problem of f(x)f(x) by some gM(x)g_{M}(x) as described above. Corresponding to the second step, we introduce an idealized assumption on g(x)g(x) for a given number R1R\in\mathbb{N}_{\geq 1}: each element of the weight matrix GjG_{j} can be re-sampled with replacement from the uniform distribution U[1,1]U[-1,1] up to R1R-1 times, for all j=1,,2lj=1,\cdots,2l. (re-sampling assumption for RR)

Under this re-sampling assumption, we obtain the following theorem.

Theorem 4.1 (Main Theorem)

Fix ϵ,δ>0\epsilon,\delta>0, and we assume that FiFrob1\lVert F_{i}\rVert_{\rm Frob}\leq 1. Let RR\in\mathbb{N}, and assume that g(x)g(x) satisfies the re-sampling assumption for RR.

If d~2i12di164l2di12diϵ2R2log(2ldi1diδ)\widetilde{d}_{2i-1}\geq 2d_{i-1}\lceil\frac{64l^{2}d_{i-1}^{2}d_{i}}{\epsilon^{2}R^{2}}\log(\frac{2ld_{i-1}d_{i}}{\delta})\rceil holds for all i=1,,li=1,\cdots,l, then with probability at least 1δ1-\delta, there exist binary matrices M={Mj}1j2lM=\{M_{j}\}_{1\leq j\leq 2l} such that

f(x)gM(x)2ϵ, for x1.\lVert f(x)-g_{M}(x)\rVert_{2}\leq\epsilon,\text{ for }\|x\|_{\infty}\leq 1. (7)

In particular, if RR is larger than 8ldi1ϵdilog(2ldi1diδ)\frac{8ld_{i-1}}{\epsilon}\sqrt{d_{i}\log(\frac{2ld_{i-1}d_{i}}{\delta})}, then d~2i1=2di1\widetilde{d}_{2i-1}=2d_{i-1} is enough.

Theorem 4.1 shows that the iterative randomization is provably helpful to approximate wider networks in the weight-pruning optimization of a random network. In fact, the required width for g(x)g(x) in Theorem 4.1 is reduced to even twice as wide as f(x)f(x) when the number of re-sampling is sufficiently large, in contrast to the prior results without re-sampling assumption where the required width is logarithmically wider than f(x)f(x) [25, 24]. This means that, under a fixed amount of parameters of g(x)g(x), we can achieve a higher accuracy by weight-pruning of g(x)g(x) with iterative randomization since a wider target network has a higher model capacity.

In the rest of this section we present the core ideas by proving the simplest case (l=d0=d1=1l=d_{0}=d_{1}=1) of the theorem, while the full proof of Theorem 4.1 is given in Appendix A. Note that the full proof is obtained essentially by applying the argument for the simplest case inductively on the widths and depths.

4.3 Proof ideas for Theorem 4.1

Let us consider the case of l=d0=d1=1l=d_{0}=d_{1}=1. Then the target network f(x)f(x) can be written as f(x)=wx:f(x)=wx:\mathbb{R}\to\mathbb{R}, where ww\in\mathbb{R}, and g(x)g(x) can be written as g(x)=𝐯Tσ(𝐮x)g(x)=\mathbf{v}^{T}\sigma(\mathbf{u}x) where 𝐮,𝐯d~1\mathbf{u},\mathbf{v}\in\mathbb{R}^{\widetilde{d}_{1}}. Also, subnetworks of g(x)g(x) can be written as g𝐦(x)=(𝐯𝐦)Tσ((𝐮𝐦)x)g_{\mathbf{m}}(x)=(\mathbf{v}\odot\mathbf{m})^{T}\sigma((\mathbf{u}\odot\mathbf{m})x) for some 𝐦{0,1}d~1\mathbf{m}\in\{0,1\}^{\widetilde{d}_{1}}.

There are two technical points in our proof. The first point is the following splitting of f(x)f(x):

f(x)=wσ(x)wσ(x),f(x)=w\sigma(x)-w\sigma(-x), (8)

for any xx\in\mathbb{R}. This splitting is very similar to the one used in the previous works [19, 25, 24]:

f(x)=σ(wx)σ(wx).f(x)=\sigma(wx)-\sigma(-wx). (9)

However, if we use the latter splitting Eq. (9), it turns out that we cannot obtain the lower bound of d~2i1\widetilde{d}_{2i-1} in Theorem 4.1 when d0>0d_{0}>0. (Here we do not treat this case, but the proof for d0>0d_{0}>0 is given in Appendix A.) Thus we need to use our splitting Eq. (8) instead.

Using Eq. (8), we can give another proof of the following approximation result without iterative randomization, which was already shown in the previous work [19].

Lemma 4.2

Fix ϵ,δ(0,1),w[1,1],d\epsilon,\delta\in(0,1),w\in[-1,1],d\in\mathbb{N}. Let 𝐮,𝐯U[1,1]d\mathbf{u},\mathbf{v}\sim U[-1,1]^{d} be uniformly random weights of a 22-layered neural network g(x):=𝐯Tσ(𝐮x)g(x):=\mathbf{v}^{T}\sigma(\mathbf{u}\cdot x). If d216ϵ2log(2δ)d\geq 2\lceil\frac{16}{\epsilon^{2}}\log(\frac{2}{\delta})\rceil holds, then with probability at least 1δ1-\delta,

|wxg𝐦(x)|ϵ, for all x,|x|1,\big{|}wx-g_{\mathbf{m}}(x)\big{|}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1, (10)

where g𝐦(x):=(𝐯𝐦)Tσ(𝐮x)g_{\mathbf{m}}(x):=(\mathbf{v}\odot\mathbf{m})^{T}\sigma(\mathbf{u}\cdot x) for some 𝐦{0,1}d\mathbf{m}\in\{0,1\}^{d}.

Proof (sketch):

We assume that dd is an even number as d=2dd=2d^{\prime} so that we can split an index set {0,,d1}\{0,\cdots,d-1\} of dd hidden neurons of g(x)g(x) into I={0,,d1}I=\{0,\cdots,d^{\prime}-1\} and J={d,,d1}J=\{d^{\prime},\cdots,d-1\}. Then we have the corresponding subnetworks gI(x)g_{I}(x) and gJ(x)g_{J}(x) given by gI(x):=kIvkσ(ukx),gJ(x):=kJvkσ(ukx)g_{I}(x):=\sum_{k\in I}v_{k}\sigma(u_{k}x),g_{J}(x):=\sum_{k\in J}v_{k}\sigma(u_{k}x), which satisfy the equation g(x)=gI(x)+gJ(x)g(x)=g_{I}(x)+g_{J}(x).

By the splitting Eq. (8), it is enough to consider the probabilities for approximating wσ(x)w\sigma(x) by a subnetwork of gI(x)g_{I}(x) and for approximating wσ(x)-w\sigma(-x) by a subnetwork of gJ(x)g_{J}(x). Now we have

(iI s.t. |ui1|ϵ2,|viw|ϵ2)(1ϵ216)dδ2,\displaystyle\mathbb{P}\left(\not\exists i\in I\text{ s.t. }|u_{i}-1|\leq\frac{\epsilon}{2},|v_{i}-w|\leq\frac{\epsilon}{2}\right)\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}}\leq\frac{\delta}{2}, (11)
(jJ s.t. |uj+1|ϵ2,|vj+w|ϵ2)(1ϵ216)dδ2,\displaystyle\mathbb{P}\left(\not\exists j\in J\text{ s.t. }|u_{j}+1|\leq\frac{\epsilon}{2},|v_{j}+w|\leq\frac{\epsilon}{2}\right)\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}}\leq\frac{\delta}{2}, (12)

for d16ϵ2log(2δ)d^{\prime}\geq\frac{16}{\epsilon^{2}}\log\left(\frac{2}{\delta}\right), by a standard argument of the uniform distribution and the inequality ex1+xe^{x}\geq 1+x for x0x\geq 0. By the union bound, with probability at least 1δ1-\delta, we have iIi\in I and jJj\in J such that

|wσ(x)viσ(uix)|ϵ2,\displaystyle\big{|}w\sigma(x)-v_{i}\sigma(u_{i}x)\big{|}\leq\frac{\epsilon}{2},
|wσ(x)vjσ(ujx)|ϵ2.\displaystyle\big{|}-w\sigma(-x)-v_{j}\sigma(u_{j}x)\big{|}\leq\frac{\epsilon}{2}.

Combining these inequalities, we finish the proof. \square

The second point of our proof is introducing projection maps to leverage the re-sampling assumption, as follows. As in the proof of Lemma 4.2, we assume that d=2dd=2d^{\prime} for some dd^{\prime}\in\mathbb{N} and let I={0,,d1},J={d,,d1}I=\{0,\cdots,d^{\prime}-1\},J=\{d^{\prime},\cdots,d-1\}. Now we define a projection map

π:I~I,kk/R,\pi:\widetilde{I}\to I,\ \ \ k\mapsto\lfloor k/R\rfloor, (13)

where I~:={0,,dR1}\widetilde{I}:=\{0,\cdots,d^{\prime}R-1\}, and \lfloor\cdot\rfloor denotes the floor function. Similarly for JJ, we can define J~:={dR,,dR1}\widetilde{J}:=\{d^{\prime}R,\cdots,dR-1\} and the corresponding projection map. Using these projection maps, we can extend Lemma 4.2 to the one with the re-sampling assumption, which is the special case of Theorem 4.1:

Proposition 4.3 (Theorem 4.1 with l=d0=d1=1l=d_{0}=d_{1}=1)

Fix ϵ,δ(0,1),w[1,1],d\epsilon,\delta\in(0,1),w\in[-1,1],d\in\mathbb{N}. Let 𝐮,𝐯U[1,1]d\mathbf{u},\mathbf{v}\sim U[-1,1]^{d} be uniformly random weights of a 22-layered neural network g(x):=𝐯Tσ(𝐮x)g(x):=\mathbf{v}^{T}\sigma(\mathbf{u}\cdot x). Let RR\in\mathbb{N} and we assume that each element of 𝐮\mathbf{u} and 𝐯\mathbf{v} can be re-sampled with replacement up to R1R-1 times. If d216ϵ2R2log(2δ)d\geq 2\lceil\frac{16}{\epsilon^{2}R^{2}}\log(\frac{2}{\delta})\rceil holds, then with probability at least 1δ1-\delta,

|wxg𝐦(x)|ϵ, for all x,|x|1,\big{|}wx-g_{\mathbf{m}}(x)\big{|}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1, (14)

where g𝐦(x):=(𝐯𝐦)Tσ(𝐮x)g_{\mathbf{m}}(x):=(\mathbf{v}\odot\mathbf{m})^{T}\sigma(\mathbf{u}\cdot x) for some 𝐦{0,1}d\mathbf{m}\in\{0,1\}^{d}.

Proof (sketch):

Similarly to the proof of Lemma 4.2, we utilize the splitting Eq. (8). We mainly argue on the approximation of wσ(x)w\sigma(x) since the argument for approximating wσ(x)-w\sigma(x) is parallel.

By the assumption that each element of 𝐮\mathbf{u} and 𝐯\mathbf{v} can be re-sampled up to R1R-1 times, we can replace the probability in Eq. (11) in the proof of Lemma 4.2, using the projection map π:I~I\pi:\widetilde{I}\to I, by

(i1,i2I~ s.t. π(i1)=π(i2),|u~i11|ϵ2,|v~i2w|ϵ2),\mathbb{P}\left(\not\exists i_{1},i_{2}\in\widetilde{I}\text{ s.t. }\pi(i_{1})=\pi(i_{2}),\ \ |\widetilde{u}_{i_{1}}-1|\leq\frac{\epsilon}{2},\ \ |\widetilde{v}_{i_{2}}-w|\leq\frac{\epsilon}{2}\right), (15)

where u~1,,u~dR,v~1,,v~dRU[1,1]\widetilde{u}_{1},\cdots,\widetilde{u}_{d^{\prime}R},\widetilde{v}_{1},\cdots,\widetilde{v}_{d^{\prime}R}\sim U[-1,1]. Indeed, since we have

#{(i1,i2)I~×I~:π(i1)=π(i2)}=dR2,\#\{(i_{1},i_{2})\in\widetilde{I}\times\widetilde{I}:\pi(i_{1})=\pi(i_{2})\}=d^{\prime}R^{2}, (16)

we can evaluate the probability Eq. (15) as

Eq.(15)(1ϵ216)dR2δ2,\text{Eq.}(\ref{probability for g_I with iter rand})\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}R^{2}}\leq\frac{\delta}{2}, (17)

for d16ϵ2R2log(δ2)d^{\prime}\geq\frac{16}{\epsilon^{2}R^{2}}\log\left(\frac{\delta}{2}\right). Eq. (17) can play the same role as Eq. (11) in the proof of Lemma 4.2.

Parallel argument can be applied for the approximation of wσ(x)-w\sigma(x) by replacing II with JJ. The rest of the proof is the same as Lemma 4.2. \square

5 Experiments

In this section, we perform several experiments to evaluate our proposed method, IteRand (Algorithm 3). Our main aim is to empirically verify the parameter efficiency of IteRand, compared with edge-popup [26] (Algorithm 1) on which IteRand is based. Specifically, we demonstrate that IteRand can achieve better accuracy than edge-popup under a given amount of network parameters. In all experiments, we used the partial randomization with r=0.1r=0.1 (Eq. (3)) for Randomize in Algorithm 3.

Setup.

We used two vision datasets: CIFAR-10 [13] and ImageNet [28]. CIFAR-10 is a small-scale dataset of 32×3232\times 32 images with 1010 class labels. It has 5050k images for training and 1010k for testing. We randomly split the 5050k training images into 4545k for actual training and 55k for validation. ImageNet is a dataset of 224×224224\times 224 images with 10001000 class labels. It has the train set of 1.281.28 million images and the validation set of 5050k images. We randomly split the training images into 99:199:1, and used the former for actual training and the latter for validating models. When testing models, we used the validation set of ImageNet (which we refer to as the test set). For network architectures, we used multiple convolutional neural networks (CNNs): Conv6 [6] as a shallow network and ResNets [11] as deep networks. Conv6 is a 6-layered VGG-like CNN, which is also used in the prior work [26]. ResNets are more practical CNNs with skip connections and batch normalization layers. Following the settings in Ramanujan et al. [26], we used non-affine batch normalization layers, which are layers that only normalize their inputs and do not apply any affine transform, when training ResNets with edge-popup and IteRand. All of our experiments were performed with 1 GPU (NVIDIA GTX 1080 Ti, 11GB) for CIFAR-10 and 2 GPUs (NVIDIA V100, 16GB) for ImageNet. The details of the network architectures and hyperparameters for training are given in Appendix B.

Parameter distributions.

With the same notation as Section 2, both IteRand and edge-popup requires two distributions: 𝒟param\mathcal{D}_{\mathrm{param}} and 𝒟score\mathcal{D}_{\mathrm{score}}. In our experiments, we consider Kaiming uniform (KU) and signed Kaiming constant (SC) distribution. The KU distribution is the uniform distribution over the interval [6cfanin,6cfanin][-\sqrt{\frac{6}{c_{\mathrm{fanin}}}},\sqrt{\frac{6}{c_{\mathrm{fanin}}}}] where cfaninc_{\mathrm{fanin}} is the constant defined for each layer of ReLU neural networks [1][10]. The SC distribution is the uniform distribution over the two-valued set {2cfanin,2cfanin}\{-\sqrt{\frac{2}{c_{\mathrm{fanin}}}},\sqrt{\frac{2}{c_{\mathrm{fanin}}}}\}, which is introduced by Ramanujan et al. [26]. We fix 𝒟score\mathcal{D}_{\mathrm{score}} to the KU distribution, and use the KU or SC distribution for 𝒟param\mathcal{D}_{\mathrm{param}}.

5.1 Varying the network width

To demonstrate the parameter efficiency, we introduce a hyperparameter ρ\rho of the width factor for Conv6, ResNet18 and ResNet34. The details of this modification are given in Appendix B. We train and test these networks on CIFAR-10 using IteRand, edge-popup and SGD, varying the width factor ρ\rho in {0.25,0.5,1.0,2.0}\{0.25,0.5,1.0,2.0\} for each network (Figure 2). In the experiments for IteRand and edge-popup, we used the sparsity rate of p=0.5p=0.5 for Conv6 and p=0.6p=0.6 for ResNet18 and ResNet34. Our method outperforms the baseline method for various widths. The difference in accuracy is large both when the width is small (ρ0.5\rho\leq 0.5), and when 𝒟param\mathcal{D}_{\mathrm{param}} is the KU distribution where edge-popup struggles.

Refer to caption
Figure 2: We train and evaluate three CNNs (Conv6 , ResNet18 and ResNet34) with various widths on CIFAR-10. The x-axis is the width factor ρ\rho and the y-axis is the accuracy on the test set. We plot the mean ±\pm standard deviation over three runs for each experiment. KU/SC in the legend represents the distribution used for 𝒟param\mathcal{D}_{\mathrm{param}}.

5.2 ImageNet experiments

The parameter efficiency of our method is also confirmed on ImageNet (Figure 3(a)). ImageNet is more difficult than CIFAR-10, and thus more complexity is required for networks to achieve competitive performance. Since our method can increase the network complexity as shown in Section 4, the effect is significant in ImageNet especially when the complexity is limited such as ResNet34.

In addition to the parameter efficiency, we also observe the effect of iterative randomization on the behavior of optimization process, by plotting training curves (Figure 3(b)). Surprisingly, IteRand achieves significantly better performance than edge-popup at the early stage of the optimization, which indicates that the iterative randomization accelerates the optimization process especially when the number of iterations is limited.

Refer to caption
(a) Network Size vs. Test Accuracy.
Refer to caption
Refer to caption
(b) Training Curves.
Figure 3: We compare IteRand with edge-popup on ImageNet. We used a fixed p=0.7p=0.7 for the sparsity rate (Section 2) following Ramanujan et al. [26]. Figure (a) shows parameter-accuracy tradeoff curves on the test set. In Figure (b), we plot the train/val accuracy of ResNet50 during optimization. IteRand outperforms edge-popup from early epochs.

6 Related work

Lottery ticket hypothesis

Frankle and Carbin [6] originally proposed the lottery ticket hypothesis. Many properties and applications have been studied in subsequent works, such as transferability of winning subnetworks [22], sparsification before training [15, 31, 29] and further ablation studies on the hypothesis [34].

Zhou et al [34] surprisingly showed that only pruning randomly initialized networks without any optimization on their weights can be a training method surprisingly. Ramanujan et al. [26] went further by proposing an effective pruning algorithm on random networks, called edge-popup, and achieved competitive accuracy compared with standard training algorithms by weight-optimization [27]. Malach et al. [19] mathematically formalized their pruning-only approach as an approximation problem for a target network and proved it with lower bound condition on the width of random networks to be pruned. Subsequent works [25, 24] successfully relaxed the lower bound to the logarithmic factor wider than the width of a target network. Our work can be seen as an extension of their works to allow re-sampling of the weights of the random networks for finite RR times, and we showed that the logarithmic factor can be reduced to a constant one when RR is large enough. (See Section 4.)

Neural network pruning and regrowth

Studies of finding sparse structures of neural networks date back to the late 1980s [9, 14]. There are many approaches to sparsify networks, such as magnitude-based pruning [8], L0L_{0} regularization [17] and variational dropout [21]. Although these methods only focus on pruning unnecessary weights from the networks, there are several studies on re-adding new weights during sparsification [12] to maintain the model complexity of the sparse networks. Han et al. [7] proposed a dense-sparse-dense (DSD) training algorithm consisting of three stages: dense training, pruning, and recovering the pruned weights as zeros followed by dense training. Mocanu et al. [20] proposed sparse evolutionary training (SET), which repeats pruning and regrowth with random values at the end of each training epoch. Pruning algorithms proposed in other works [2, 23, 4] are designed to recover pruned weights by zero-initialization instead of random values, so that the recovered weights do not affect the outputs of the networks. While these methods are similar to our iterative randomization method in terms of the re-adding processes, all of them use weight-optimization to train networks including re-added weights, in contrast to our pruning-only approach.

7 Limitations

There are several limitations in our theoretical results. (1) Theorem 4.1 indicates only the existence of subnetworks that approximate a given neural network, not whether our method works well empirically. (2) The theorem focused on the case when the parameter distribution 𝒟param\mathcal{D}_{\mathrm{param}} is the uniform distribution over the interval [1,1][-1,1]. Thus, generalizing our theorem to other distributions, such as the uniform distribution over binary values {1,1}\{-1,1\} [3], is left for future work. (3) The required width given in the theorem may not be optimal. Indeed, prior work [25] showed that we can reduce the polynomial factors in the required width to logarithmic ones in the case when the number of re-sampling operations R=1R=1. Whether we can reduce the required width for R>1R>1 remains an open question.

Also our algorithm (IteRand) and its empirical results have several limitations. (1) Pruning randomly-initialized networks without any randomization can reduce the storage cost by saving only a single random seed and the binary mask representing the optimal subnetwork. However, if we save the network pruned with IteRand in the same way, it requires more storage cost: RR random seeds and RR binary masks, where RR is the number of re-samplings. (2) Although our method can be applied with any score-based pruning algorithms (e.g. Zhou et al [34] and Wang et al [32]), we evaluated our method only combined with edge-popup [26], which is the state-of-the-art algorithm for pruning random networks. Since our theoretical results do not depend on any pruning algorithms, we expect that our method can be effectively combined with better pruning algorithms that emerge in the future. (3) We performed experiments mainly on the tasks of image classification. An intriguing question is how effectively our method works on other tasks such as language understanding, audio recognition, and deep reinforcement learning with various network architectures.

8 Conclusion

In this paper, we proposed a novel framework of iterative randomization (IteRand) for pruning randomly initialized neural networks. IteRand can virtually increase the network widths without any additional memory consumption, by randomizing pruned weights of the networks iteratively during the pruning procedure. We verified its parameter efficiency both theoretically and empirically.

Our results indicate that the weight-pruning of random networks may become a practical approach to train the networks when we apply the randomizing operations enough times. This opens up the possibility that the weight-pruning can be used instead of the standard weight-optimization within the same memory budget.

Acknowledgement

The authors thank Kyosuke Nishida and Hengjin Tang for valuable discussions in the early phase of our study. We also thank Osamu Saisho for helpful comments on the manuscript. We appreciate CCI team in NTT for building and maintaining their computational cluster on which most of our experiments were computed.

References

  • [1] torch.nn.init – PyTorch 1.8.1 documentation. https://pytorch.org/docs/stable/nn.init.html. Accessed: 2021-10-21.
  • [2] Guillaume Bellec, David Kappel, Wolfgang Maass, and Robert Legenstein. Deep rewiring: Training very sparse deep networks. International Conference on Learning Representations, 2018.
  • [3] James Diffenderfer and Bhavya Kailkhura. Multi-prize lottery ticket hypothesis: Finding accurate binary neural networks by pruning a randomly weighted network. In International Conference on Learning Representations, 2021.
  • [4] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pages 2943–2952. PMLR, 2020.
  • [5] Facebook. facebookresearch/open_lth. https://github.com/facebookresearch/open_lth. Accessed: 2021-10-21.
  • [6] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
  • [7] Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, et al. Dsd: Dense-sparse-dense training for deep neural networks. International Conference on Learning Representations, 2017.
  • [8] Song Han, Jeff Pool, John Tran, and William J Dally. Learning both weights and connections for efficient neural networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems-Volume 1, pages 1135–1143, 2015.
  • [9] Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-propagation. Advances in Neural Information Processing Systems, 1:177–185, 1988.
  • [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1026–1034. IEEE Computer Society, 2015.
  • [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778. IEEE Computer Society, 2016.
  • [12] Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554, 2021.
  • [13] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.
  • [14] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605, 1990.
  • [15] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. Snip: Single-shot network pruning based on connection sensitivity. In International Conference on Learning Representations, 2018.
  • [16] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017.
  • [17] Christos Louizos, Max Welling, and Diederik P Kingma. Learning sparse neural networks through l_0 regularization. In International Conference on Learning Representations, 2018.
  • [18] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
  • [19] Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need. In International Conference on Machine Learning, pages 6682–6691. PMLR, 2020.
  • [20] Decebal Constantin Mocanu, Elena Mocanu, Peter Stone, Phuong H Nguyen, Madeleine Gibescu, and Antonio Liotta. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1):1–12, 2018.
  • [21] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning, pages 2498–2507. PMLR, 2017.
  • [22] Ari Morcos, Haonan Yu, Michela Paganini, and Yuandong Tian. One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [23] Hesham Mostafa and Xin Wang. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, pages 4646–4655. PMLR, 2019.
  • [24] Laurent Orseau, Marcus Hutter, and Omar Rivasplata. Logarithmic pruning is all you need. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2925–2934. Curran Associates, Inc., 2020.
  • [25] Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dimitris Papailiopoulos. Optimal lottery tickets via subset sum: Logarithmic over-parameterization is sufficient. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 2599–2610. Curran Associates, Inc., 2020.
  • [26] Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11893–11902, 2020.
  • [27] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
  • [28] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [29] Hidenori Tanaka, Daniel Kunin, Daniel L Yamins, and Surya Ganguli. Pruning neural networks without any data by iteratively conserving synaptic flow. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6377–6389. Curran Associates, Inc., 2020.
  • [30] Ben Trevett. bentrevett/pytorch-sentiment-analysis. https://github.com/bentrevett/pytorch-sentiment-analysis. Accessed: 2021-10-21.
  • [31] Chaoqi Wang, Guodong Zhang, and Roger Grosse. Picking winning tickets before training by preserving gradient flow. In International Conference on Learning Representations, 2020.
  • [32] Yulong Wang, Xiaolu Zhang, Lingxi Xie, Jun Zhou, Hang Su, Bo Zhang, and Xiaolin Hu. Pruning from scratch. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12273–12280, 2020.
  • [33] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Edwin R. Hancock Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1–87.12. BMVA Press, September 2016.
  • [34] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski. Deconstructing lottery tickets: Zeros, signs, and the supermask. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Appendix A Proof of main theorem (Theorem 4.1)

A.1 Settings and main theorem

Let d0,,dl1d_{0},\cdots,d_{l}\in\mathbb{N}_{\geq 1}. We consider a target neural network f:d0dlf:\mathbb{R}^{d_{0}}\to\mathbb{R}^{d_{l}} of depth ll, which is described as follows.

f(x)=Flσ(Fl1σ(F1(x))),f(x)=F_{l}\sigma(F_{l-1}\sigma(\cdots F_{1}(x)\cdots)), (18)

where xx is a d0d_{0}-dimensional real vector, σ\sigma is the ReLU activation, and FiF_{i} is a di×di1d_{i}\times d_{i-1} matrix. Our objective is to approximate the target network f(x)f(x) by pruning a randomly initialized neural network g(x)g(x), which tends to be larger than the target network.

Similar to the previous works [19, 25], we assume that g(x)g(x) is twice as deep as the target network f(x)f(x). Thus, g(x)g(x) can be described as

g(x)=G2lσ(G2l1σ(G1(x))),g(x)=G_{2l}\sigma(G_{2l-1}\sigma(\cdots G_{1}(x)\cdots)), (19)

where GjG_{j} is a d~j×d~j1\widetilde{d}_{j}\times\widetilde{d}_{j-1} matrix (d~j1 for j=1,,2l\widetilde{d}_{j}\in\mathbb{N}_{\geq 1}\text{ for }j=1,\cdots,2l) with d~2i=di\widetilde{d}_{2i}=d_{i}. Each element of the matrix GjG_{j} is assumed to be drawn from the uniform distribution U[1,1]U[-1,1]. Since there is a one-to-one correspondence between pruned networks of g(x)g(x) and sequences of binary matrices M={Mj}j=1,,2lM=\{M_{j}\}_{j=1,\cdots,2l} with Mj{0,1}d~j×d~j1M_{j}\in\{0,1\}^{\widetilde{d}_{j}\times\widetilde{d}_{j-1}}, every pruned network of g(x)g(x) can be described as

gM(x)=(G2lM2l)σ((G2l1M2l1)σ((G1M1)(x))).g_{M}(x)=(G_{2l}\odot M_{2l})\sigma((G_{2l-1}\odot M_{2l-1})\sigma(\cdots(G_{1}\odot M_{1})(x)\cdots)). (20)

We introduce an idealized assumption on g(x)g(x) for a given number R1R\in\mathbb{N}_{\geq 1}: each element of the weight matrix GjG_{j} can be re-sampled with replacement from the uniform distribution U[1,1]U[-1,1] up to R1R-1 times, for all j=1,,2lj=1,\cdots,2l (re-sampling assumption for RR). Here, re-sampling with replacement means that we sample a new value for an element of GjG_{j} and replace the old value of the element by the new one.

Under this re-sampling assumption, we describe our main theorem as follows.

Theorem A.1 (Main Theorem)

Fix ϵ,δ>0\epsilon,\delta>0, and we assume that FiFrob1\lVert F_{i}\rVert_{\rm Frob}\leq 1. Let RR\in\mathbb{N} and we assume that each element of GiG_{i} can be re-sampled with replacement from the uniform distribution U[1,1]U[-1,1] up to R1R-1 times.

If d~2i12di164l2di12diϵ2R2log(2ldi1diδ)\widetilde{d}_{2i-1}\geq 2d_{i-1}\lceil\frac{64l^{2}d_{i-1}^{2}d_{i}}{\epsilon^{2}R^{2}}\log(\frac{2ld_{i-1}d_{i}}{\delta})\rceil holds for all i=1,,li=1,\cdots,l, then with probability at least 1δ1-\delta, there exist binary matrices M={Mj}1j2lM=\{M_{j}\}_{1\leq j\leq 2l} such that

f(x)gM(x)2ϵ, for x1.\lVert f(x)-g_{M}(x)\rVert_{2}\leq\epsilon,\text{ for }\|x\|_{\infty}\leq 1. (21)

In particular, if RR is larger than 8ldi1ϵdilog(2ldi1diδ)\frac{8ld_{i-1}}{\epsilon}\sqrt{d_{i}\log(\frac{2ld_{i-1}d_{i}}{\delta})}, then d~2i1=2di1\widetilde{d}_{2i-1}=2d_{i-1} is enough.

A.2 Proof of Theorem A.1

Our proof is based on the following simple observation, similar to the arguments in Malach et al. [19].

Lemma A.2

Fix some nn\in\mathbb{N}, α[1,1]\alpha\in[-1,1] and ϵ,δ(0,1)\epsilon,\delta\in(0,1). Let X1,,XnU[1,1]X_{1},\cdots,X_{n}\sim U[-1,1]. If n2ϵlog(1δ)n\geq\frac{2}{\epsilon}\log(\frac{1}{\delta}) holds, then with probability at least 1δ1-\delta, we have

|αXi|ϵ,|\alpha-X_{i}|\leq\epsilon, (22)

for some i{1,,n}i\in\{1,\cdots,n\}.

Proof:

We can assume α0\alpha\geq 0 without loss of generality. By considering half ϵ\epsilon-ball of α\alpha, we have

XU[1,1][|αX|ϵ]ϵ2.\mathbb{P}_{X\sim U[-1,1]}\Big{[}\big{|}\alpha-X\big{|}\leq\epsilon\Big{]}\geq\frac{\epsilon}{2}.

Thus it follows that

X1,,XnU[1,1][|αXi|>ϵ for all i](1ϵ2)nenϵ2δ.\mathbb{P}_{X_{1},\cdots,X_{n}\sim U[-1,1]}\Big{[}\big{|}\alpha-X_{i}\big{|}>\epsilon\text{ for all }i\Big{]}\leq(1-\frac{\epsilon}{2})^{n}\leq e^{-\frac{n\epsilon}{2}}\leq\delta.

\square

First, we consider to approximate a single variable linear function f(x)=wx:,wf(x)=wx:\mathbb{R}\to\mathbb{R},w\in\mathbb{R} by some subnetwork of 22-layered neural network g(x)g(x) with dd hidden neurons, without the re-sampling assumption. Note that this is the same setting as in Malach et al. [19], but we give another proof so that we can later extend it to the one with the resumpling assumption.

Lemma A.3

Fix ϵ,δ(0,1),w[1,1],d\epsilon,\delta\in(0,1),w\in[-1,1],d\in\mathbb{N}. Let 𝐮,𝐯U[1,1]d{\bf u},{\bf v}\sim U[-1,1]^{d} be uniformly random weights of a 22-layered neural network g(x):=𝐯Tσ(𝐮x)g(x):={\bf v}^{T}\sigma({\bf u}\cdot x). If d216ϵ2log(2δ)d\geq 2\lceil\frac{16}{\epsilon^{2}}\log(\frac{2}{\delta})\rceil holds, then with probability at least 1δ1-\delta,

|wxg𝐦(x)|ϵ, for all x,|x|1,\big{|}wx-g_{\bf m}(x)\big{|}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1, (23)

where g𝐦(x):=(𝐯𝐦)Tσ(𝐮x)g_{\bf m}(x):=({\bf v}\odot{\bf m})^{T}\sigma({\bf u}\cdot x) for some 𝐦{0,1}d{\bf m}\in\{0,1\}^{d}.

Proof:

The core idea is to decompose wxwx as

wx=w(σ(x)σ(x))=wσ(x)wσ(x).wx=w(\sigma(x)-\sigma(-x))=w\sigma(x)-w\sigma(-x). (24)

We assume that dd is an even number as d=2dd=2d^{\prime} so that we can split an index set {1,,d}\{1,\cdots,d\} of hidden neurons of g(x)g(x) into I={1,,d}I=\{1,\cdots,d^{\prime}\} and J={d+1,,d}J=\{d^{\prime}+1,\cdots,d\}. Then we have the corresponding subnetworks gI(x)g_{I}(x) and gJ(x)g_{J}(x) given by gI(x):=kIvkσ(ukx),gJ(x):=kJvkσ(ukx)g_{I}(x):=\sum_{k\in I}v_{k}\sigma(u_{k}x),g_{J}(x):=\sum_{k\in J}v_{k}\sigma(u_{k}x), which satisfy the equation g(x)=gI(x)+gJ(x)g(x)=g_{I}(x)+g_{J}(x).

From Eq. (24), it is enough to consider the probabilities for approximating wσ(x)w\sigma(x) by a subnetwork of gI(x)g_{I}(x) and for approximating wσ(x)-w\sigma(-x) by a subnetwork of gJ(x)g_{J}(x). Now we have

(iI s.t. |ui1|ϵ2,|viw|ϵ2)(1ϵ216)dδ2,\displaystyle\mathbb{P}\left(\not\exists i\in I\text{ s.t. }|u_{i}-1|\leq\frac{\epsilon}{2},|v_{i}-w|\leq\frac{\epsilon}{2}\right)\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}}\leq\frac{\delta}{2}, (25)
(jJ s.t. |uj+1|ϵ2,|vj+w|ϵ2)(1ϵ216)dδ2,\displaystyle\mathbb{P}\left(\not\exists j\in J\text{ s.t. }|u_{j}+1|\leq\frac{\epsilon}{2},|v_{j}+w|\leq\frac{\epsilon}{2}\right)\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}}\leq\frac{\delta}{2}, (26)

for d16ϵ2log(2δ)d^{\prime}\geq\frac{16}{\epsilon^{2}}\log\left(\frac{2}{\delta}\right) as well as in the proof of Lemma A.2. By using the union bound, with probability at least 1δ1-\delta, we have iIi\in I and jJj\in J such that

|wσ(x)viσ(uix)|ϵ2,\displaystyle\big{|}w\sigma(x)-v_{i}\sigma(u_{i}x)\big{|}\leq\frac{\epsilon}{2},
|wσ(x)vjσ(ujx)|ϵ2.\displaystyle\big{|}-w\sigma(-x)-v_{j}\sigma(u_{j}x)\big{|}\leq\frac{\epsilon}{2}.

Combining these inequalities and Eq. (24), we finish the proof. \square

Now we extend Lemma A.3 to the one with the re-sampling assumption.

Lemma A.4

Fix ϵ,δ(0,1),w[1,1],d\epsilon,\delta\in(0,1),w\in[-1,1],d\in\mathbb{N}. Let 𝐮,𝐯U[1,1]d{\bf u},{\bf v}\sim U[-1,1]^{d} be uniformly random weights of a 22-layered neural network g(x):=𝐯Tσ(𝐮x)g(x):={\bf v}^{T}\sigma({\bf u}\cdot x). Let RR\in\mathbb{N} and we assume that each elements of uu and vv can be re-sampled with replacement up to R1R-1 times. If d216ϵ2R2log(2δ)d\geq 2\lceil\frac{16}{\epsilon^{2}R^{2}}\log(\frac{2}{\delta})\rceil holds, then with probability at least 1δ1-\delta,

|wxg𝐦(x)|ϵ, for all x,|x|1,\big{|}wx-g_{\bf m}(x)\big{|}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1, (27)

where g𝐦(x):=(𝐯𝐦)Tσ(𝐮x)g_{\bf m}(x):=({\bf v}\odot{\bf m})^{T}\sigma({\bf u}\cdot x) for some 𝐦{0,1}d{\bf m}\in\{0,1\}^{d}.

Proof:

As in the proof of Lemma A.3, we assume that d=2dd=2d^{\prime} and let I={1,,d},J={d+1,,d}I=\{1,\cdots,d^{\prime}\},J=\{d^{\prime}+1,\cdots,d\}. Now we consider I~={1,,dR}\widetilde{I}=\{1,\cdots,d^{\prime}R\} and a projection π:I~I\pi:\widetilde{I}\to I defined by π(k)=(k1)/R+1\pi(k)=\lfloor(k-1)/R\rfloor+1. Since we assumed that each elements of 𝐮{\bf u} and 𝐯{\bf v} can be re-sampled up to R1R-1 times, we can replace the probability Eq. (25) in the proof of Lemma A.3 by

(i1,i2I~ s.t. π(i1)=π(i2),|u~i11|ϵ2,|v~i2w|ϵ2),\mathbb{P}\left(\not\exists i_{1},i_{2}\in\widetilde{I}\text{ s.t. }\pi(i_{1})=\pi(i_{2}),\ \ |\widetilde{u}_{i_{1}}-1|\leq\frac{\epsilon}{2},\ \ |\widetilde{v}_{i_{2}}-w|\leq\frac{\epsilon}{2}\right), (28)

where u~1,,u~dR,v~1,,v~dRU[1,1]\widetilde{u}_{1},\cdots,\widetilde{u}_{d^{\prime}R},\widetilde{v}_{1},\cdots,\widetilde{v}_{d^{\prime}R}\sim U[-1,1]. Since we have

#{(i1,i2)I~×I~:π(i1)=π(i2)}=dR2,\#\{(i_{1},i_{2})\in\widetilde{I}\times\widetilde{I}:\pi(i_{1})=\pi(i_{2})\}=d^{\prime}R^{2}, (29)

we can evaluate the probability Eq.(28)\mbox{{\rm Eq.}}~{}(\ref{app:probability for g_I with iter rand}) as

Eq.(28)(1ϵ216)dR2δ2,\mbox{{\rm Eq.}}~{}(\ref{app:probability for g_I with iter rand})\leq\left(1-\frac{\epsilon^{2}}{16}\right)^{d^{\prime}R^{2}}\leq\frac{\delta}{2},

for d16ϵ2R2log(δ2)d^{\prime}\geq\frac{16}{\epsilon^{2}R^{2}}\log\left(\frac{\delta}{2}\right). The rest of the proof is same as Lemma A.3. \square

Then, we generalize the above lemma to the case which the target function f(x)f(x) is a single-variable linear map with higher output dimensions.

Lemma A.5

Fix ϵ,δ(0,1),d1,d2,𝐰[1,1]d2\epsilon,\delta\in(0,1),d_{1},d_{2}\in\mathbb{N},{\bf w}\in[-1,1]^{d_{2}}. Let 𝐮U[1,1]d1,VU[1,1]d2×d1{\bf u}\sim U[-1,1]^{d_{1}},V\sim U[-1,1]^{d_{2}\times d_{1}} be uniformly random weights of a 22-layered neural network g(x):=Vσ(𝐮x)g(x):=V\sigma({\bf u}\cdot x). Let RR\in\mathbb{N} and we assume that each elements of 𝐮{\bf u} and VV can be re-sampled with replacement up to R1R-1 times.

If d216d2ϵ2R2log(2d2δ)d\geq 2\lceil\frac{16d_{2}}{\epsilon^{2}R^{2}}\log(\frac{2d_{2}}{\delta})\rceil holds, then with probability at least 1δ1-\delta,

𝐰xgM(x)2ϵ, for all x,|x|1,\lVert{\bf w}\cdot x-g_{M}(x)\rVert_{2}\leq\epsilon,\text{ for all }x\in\mathbb{R},|x|\leq 1, (30)

where gM(x):=(VM)Tσ(𝐮x)g_{M}(x):=(V\odot M)^{T}\sigma({\bf u}\cdot x) for some M{0,1}dM\in\{0,1\}^{d}.

Proof:

We denote V=(Vki)1kd2,1id1V=(V_{ki})_{1\leq k\leq d_{2},1\leq i\leq d_{1}}. As in the proof of Lemma A.3, we assume d1=2d1d_{1}=2d^{\prime}_{1} and split the index set {1,,d1}\{1,\cdots,d_{1}\} into I={1,,d1}I=\{1,\cdots,d^{\prime}_{1}\}, J={d1+1,,d1}J=\{d^{\prime}_{1}+1,\cdots,d_{1}\}. Also we consider the corresponding subnetworks of g(x)g(x):

gI(x):=(iIVkiσ(uix))1kd2,gJ(x):=(iJVkiσ(uix))1kd2.g_{I}(x):=\left(\sum_{i\in I}V_{ki}\sigma(u_{i}x)\right)_{1\leq k\leq d_{2}},\ \ g_{J}(x):=\left(\sum_{i\in J}V_{ki}\sigma(u_{i}x)\right)_{1\leq k\leq d_{2}}.

Similar as the proof of Lemma A.3 and Lemma A.4, it is enough to show that there probably exists a subnetwork of gI(x)g_{I}(x) which approximates 𝐰σ(x){\bf w}\cdot\sigma(x) , and also that there simultaneously exists a subnetwork of gJ(x)g_{J}(x) which approximates 𝐰σ(x)-{\bf w}\cdot\sigma(-x).

For simplicity, we focus on gI(x)g_{I}(x) in the following argument, but same conclusion holds for gJ(x)g_{J}(x) as well. Fix k{1,,d2}k\in\{1,\cdots,d_{2}\}. Then we consider the following probability,

(i1,i2I~ s.t. π(i1)=π(i2),|u~i11|ϵ2d2,|V~k,i2w|ϵ2d2),\mathbb{P}\left(\not\exists i_{1},i_{2}\in\widetilde{I}\text{ s.t. }\pi(i_{1})=\pi(i_{2}),\ \ |\widetilde{u}_{i_{1}}-1|\leq\frac{\epsilon}{2\sqrt{d_{2}}},\ \ |\widetilde{V}_{k,i_{2}}-w|\leq\frac{\epsilon}{2\sqrt{d_{2}}}\right), (31)

where u~i,V~kiU[1,1]\widetilde{u}_{i},\widetilde{V}_{ki}\sim U[-1,1] for i=1,,dRi=1,\cdots,d^{\prime}R. By using Eq. (29), if d16d2ϵ2R2log(2d2δ)d^{\prime}\geq\frac{16d_{2}}{\epsilon^{2}R^{2}}\log\left(\frac{2d_{2}}{\delta}\right), we have

Eq.(31)(1ϵ216d2)dR2δ2d2\mbox{{\rm Eq.}}~{}(\ref{app:probability for g_I (linear map ver)})\leq\left(1-\frac{\epsilon^{2}}{16d_{2}}\right)^{d^{\prime}R^{2}}\leq\frac{\delta}{2d_{2}}

Therefore, by the union bound over k=1,,d2k=1,\cdots,d_{2}, we have i1,i2I~i_{1},i_{2}\in\widetilde{I} for each kk such that i:=π(i1)=π(i2)i:=\pi(i_{1})=\pi(i_{2}), |u~i11|ϵ2d2|\widetilde{u}_{i_{1}}-1|\leq\frac{\epsilon}{2\sqrt{d_{2}}}, |V~k,i2wk|ϵ2d2|\widetilde{V}_{k,i_{2}}-w_{k}|\leq\frac{\epsilon}{2\sqrt{d_{2}}}, with probability at least 1δ1-\delta, and thus

|wkσ(x)Vkiσ(uix)|ϵ2d2, for x,|x|1,\big{|}w_{k}\sigma(x)-V_{ki}\sigma(u_{i}x)\big{|}\leq\frac{\epsilon}{2\sqrt{d_{2}}},\ \ \text{ for }x\in\mathbb{R},|x|\leq 1, (32)

if we substitute u~i1\widetilde{u}_{i_{1}} for uiu_{i}, and V~k,i2\widetilde{V}_{k,i_{2}} for VkiV_{ki}. We note that the choice of u~i1\widetilde{u}_{i_{1}} and V~k,i2\widetilde{V}_{k,i_{2}} may not be unique, but Eq. (32) does not depend on these choice.

Therefore, by taking MM appropriately, we have

𝐰xgM(x)2\displaystyle\lVert{\bf w}x-g_{M}(x)\rVert_{2} \displaystyle\leq 𝐰σ(x)gI(x)2+𝐰σ(x)gJ(x)2\displaystyle\lVert{\bf w}\sigma(x)-g_{I}(x)\rVert_{2}+\lVert-{\bf w}\sigma(-x)-g_{J}(x)\rVert_{2}
\displaystyle\leq ϵ2+ϵ2=ϵ.\displaystyle\frac{\epsilon}{2}+\frac{\epsilon}{2}=\epsilon.

for all xx\in\mathbb{R} with |x|1|x|\leq 1. \square

Subsequently, we can generalize Lemma A.5 to multiple variables version:

Lemma A.6

Fix ϵ,δ(0,1)\epsilon,\delta\in(0,1), d0,d1,d2d_{0},d_{1},d_{2}\in\mathbb{N}, W[1,1]d2×d0W\in[-1,1]^{d_{2}\times d_{0}}. Let UU[1,1]d1×d0,VU[1,1]d2×d1U\sim U[-1,1]^{d_{1}\times d_{0}},V\sim U[-1,1]^{d_{2}\times d_{1}} be uniformly random weights of a 22-layered neural network g(𝐱):=Vσ(U𝐱)g({\bf x}):=V\sigma(U{\bf x}). Let RR\in\mathbb{N} and we assume that each elements of UU and VV can be re-sampled with replacement up to R1R-1 times.

If d12d016d02d2ϵ2R2log(2d0d2δ)d_{1}\geq 2d_{0}\lceil\frac{16d_{0}^{2}d_{2}}{\epsilon^{2}R^{2}}\log(\frac{2d_{0}d_{2}}{\delta})\rceil holds, then with probability at least 1δ1-\delta,

W𝐱gM,N(𝐱)2ϵ, for all 𝐱d0,x1,\lVert W{\bf x}-g_{M,N}({\bf x})\rVert_{2}\leq\epsilon,\text{ for all }{\bf x}\in\mathbb{R}^{d_{0}},\lVert x\rVert_{\infty}\leq 1, (33)

where gM,N(x):=(VM)Tσ((UN)x)g_{M,N}(x):=(V\odot M)^{T}\sigma((U\odot N)\cdot x) for some M{0,1}d2×d1M\in\{0,1\}^{d_{2}\times d_{1}}, N{0,1}d1×d0N\in\{0,1\}^{d_{1}\times d_{0}}.

Proof:

Let d1=d1/d0d^{\prime}_{1}=d_{1}/d_{0}, and we assume d1d^{\prime}_{1}\in\mathbb{N}. We take NN as the following binary matrix:

N=(1001), where 1=(11)d1×1N=\begin{pmatrix}\mbox{\large{\bf 1}}&&\mbox{\large 0}\\ &\ddots&\\ \mbox{\large 0}&&\mbox{\large{\bf 1}}\end{pmatrix},\text{ where }\mbox{\large{\bf 1}}=\begin{pmatrix}1\\ \vdots\\ 1\\ \end{pmatrix}\in\mathbb{R}^{d^{\prime}_{1}\times 1}

By the decomposition UN=𝐮1𝐮d0U\odot N={\bf u}_{1}\oplus\cdots\oplus{\bf u}_{d_{0}}, where each 𝐮i{\bf u}_{i} is a d1×1d^{\prime}_{1}\times 1-matrix, we have

gM,N(𝐱)=(VM)T(σ(𝐮1x1)σ(𝐮d0xd0)).g_{M,N}({\bf x})=(V\odot M)^{T}\big{(}\sigma({\bf u}_{1}x_{1})\oplus\cdots\oplus\sigma({\bf u}_{d_{0}}x_{d_{0}})\big{)}. (34)

Here, we denote MM as follows:

M=(M1Md0),M=\begin{pmatrix}&&\\ M_{1}&\cdots&M_{d_{0}}\\ &&\end{pmatrix},

where each MiM_{i} is a d2×d1d_{2}\times d^{\prime}_{1}-matrix with binary coefficients. Then we have

VM=(V1M1)++(Vd0Md0),Vid2×d1.V\odot M=(V_{1}\odot M_{1})+\cdots+(V_{d_{0}}\odot M_{d_{0}}),\ \ V_{i}\in\mathbb{R}^{d_{2}\times d^{\prime}_{1}}. (35)

By combining Eq. (34) and Eq. (35), we have

gM,N(𝐱)=1id0(ViMi)σ(𝐮ixi).g_{M,N}({\bf x})=\sum_{1\leq i\leq d_{0}}(V_{i}\odot M_{i})\sigma({\bf u}_{i}x_{i}). (36)

Applying Lemma A.5 to each independent summands in Eq. (36), with probability at least 1δd01-\frac{\delta}{d_{0}}, there exists MiM_{i} for fixed i{1,,d0}i\in\{1,\cdots,d_{0}\} such that

𝐰ixi(ViMi)σ(𝐮ixi)2ϵd0, for xi,|xi|1,\lVert{\bf w}_{i}x_{i}-(V_{i}\odot M_{i})\sigma({\bf u}_{i}x_{i})\rVert_{2}\leq\frac{\epsilon}{d_{0}},\text{ for }x_{i}\in\mathbb{R},|x_{i}|\leq 1, (37)

where 𝐰i{\bf w}_{i} is the ii-th column vector of WW.

Using the union bound, we have M1,,Md0M_{1},\cdots,M_{d_{0}} satisfying Eq. (37) simultaneously with probability at least 1δ1-\delta. Therefore, by combining Eq. (36), Eq. (37) and the decomposition W𝐱=j𝐰jxjW{\bf x}=\sum_{j}{\bf w}_{j}x_{j}, we obtain Eq. (33). \square

Finally, by using Lemma A.6, we prove Theorem A.1. The outline of the proof is same as prior works [19][25].

Proof of Theorem A.1:

By Lemma A.6, for each fixed k{1,,l}k\in\{1,\cdots,l\}, we know that there exists binary matrices M2k1,M2kM_{2k-1},M_{2k} such that

Fk𝐱(G2kM2k)σ((G2k1M2k1)𝐱)2ϵ2l,\lVert F_{k}{\bf x}-(G_{2k}\odot M_{2k})\sigma\big{(}(G_{2k-1}\odot M_{2k-1}){\bf x}\big{)}\rVert_{2}\leq\frac{\epsilon}{2l}, (38)

for all 𝐱d0{\bf x}\in\mathbb{R}^{d_{0}} with 𝐱1\lVert{\bf x}\rVert_{\infty}\leq 1, with probability at least 1δl1-\frac{\delta}{l}. Taking the union bound, we can get M=(M1,,M2l)M=(M_{1},\cdots,M_{2l}) satisfying Eq. (38) for all k=1,,lk=1,\cdots,l with probability at least 1δ1-\delta.

For the above MM and any 𝐱0d0{\bf x}_{0}\in\mathbb{R}^{d_{0}} with 𝐱01\lVert{\bf x}_{0}\rVert_{\infty}\leq 1, we define sequences {𝐱k}0kl\{{\bf x}_{k}\}_{0\leq k\leq l} and {𝐱~k}0kl\{\widetilde{{\bf x}}_{k}\}_{0\leq k\leq l} as

𝐱k:=fk(𝐱k1),\displaystyle{\bf x}_{k}:=f_{k}({\bf x}_{k-1}),
𝐱~0:=𝐱0,𝐱~k:=gM,k(𝐱~k1),\displaystyle\widetilde{\bf x}_{0}:={\bf x}_{0},\ \ \ \widetilde{\bf x}_{k}:=g_{M,k}(\widetilde{\bf x}_{k-1}),

where fk(𝐱)f_{k}({\bf x}) and gM,k(𝐱)g_{M,k}({\bf x}) are given by

fk(𝐱):={σ(Fk𝐱),(1kl1)Fl𝐱,(k=l)\displaystyle f_{k}({\bf x}):=\begin{cases}\sigma\big{(}F_{k}{\bf x}\big{)},&(1\leq k\leq l-1)\\ F_{l}{\bf x},&(k=l)\end{cases}
gM,k(𝐱):={σ((G2kM2k)σ((G2k1M2k1)𝐱)),(1kl1)(G2lM2l)σ((G2l1M2l1)𝐱).(k=l)\displaystyle g_{M,k}({\bf x}):=\begin{cases}\sigma\big{(}(G_{2k}\odot M_{2k})\sigma\big{(}(G_{2k-1}\odot M_{2k-1}){\bf x}\big{)}\big{)},&(1\leq k\leq l-1)\\ (G_{2l}\odot M_{2l})\sigma\big{(}(G_{2l-1}\odot M_{2l-1}){\bf x}\big{)}.&(k=l)\end{cases}

By induction on k{0,,l}k\in\{0,\cdots,l\}, we can show that

𝐱k𝐱~k2kϵl,𝐱~k2.\lVert{\bf x}_{k}-\widetilde{\bf x}_{k}\rVert_{2}\leq\frac{k\epsilon}{l},\ \ \ \lVert\widetilde{\bf x}_{k}\rVert_{\infty}\leq 2. (39)

For k=1k=1, this is trivial by definition. Consider the k>1k>1 case. First of all, we remark that the following inequality is obtained by Eq. (38) and the 11-Lipschitz property of ReLU function σ\sigma:

fk(𝐱)gM,k(𝐱)2ϵ2l, for 𝐱d0,𝐱1.\displaystyle\lVert f_{k}({\bf x})-g_{M,k}({\bf x})\rVert_{2}\leq\frac{\epsilon}{2l},\ \text{ for }{\bf x}\in\mathbb{R}^{d_{0}},\lVert{\bf x}\rVert_{\infty}\leq 1.

By scaling 𝐱{\bf x}, the above inequality can be rewritten as follows:

fk(𝐱)gM,k(𝐱)2ϵl, for 𝐱d0,𝐱2.\displaystyle\lVert f_{k}({\bf x})-g_{M,k}({\bf x})\rVert_{2}\leq\frac{\epsilon}{l},\ \text{ for }{\bf x}\in\mathbb{R}^{d_{0}},\lVert{\bf x}\rVert_{\infty}\leq 2.

Then, we have

𝐱k𝐱~k2\displaystyle\lVert{\bf x}_{k}-\widetilde{\bf x}_{k}\rVert_{2} =\displaystyle= fk(𝐱k1)gM,k(𝐱~k1)2\displaystyle\lVert f_{k}({\bf x}_{k-1})-g_{M,k}(\widetilde{\bf x}_{k-1})\rVert_{2}
\displaystyle\leq fk(𝐱k1)fk(𝐱~k1)+fk(𝐱~k1)gM,k(𝐱~k1)2\displaystyle\lVert f_{k}({\bf x}_{k-1})-f_{k}(\widetilde{\bf x}_{k-1})+f_{k}(\widetilde{\bf x}_{k-1})-g_{M,k}(\widetilde{\bf x}_{k-1})\rVert_{2}
\displaystyle\leq fk(𝐱k1)fk(𝐱~k1)2+fk(𝐱~k1)gM,k(𝐱~k1)2\displaystyle\lVert f_{k}({\bf x}_{k-1})-f_{k}(\widetilde{\bf x}_{k-1})\rVert_{2}+\lVert f_{k}(\widetilde{\bf x}_{k-1})-g_{M,k}(\widetilde{\bf x}_{k-1})\rVert_{2}
\displaystyle\leq FkFrob𝐱k1𝐱~k12+ϵl\displaystyle\lVert F_{k}\rVert_{{\rm Frob}}\cdot\lVert{\bf x}_{k-1}-\widetilde{\bf x}_{k-1}\rVert_{2}+\frac{\epsilon}{l}
\displaystyle\leq kϵl(by the induction hypothesis),\displaystyle\frac{k\epsilon}{l}\ \ \ \ \ \ \text{(by the induction hypothesis)},

and 𝐱~k𝐱~k2𝐱k2+𝐱k𝐱~k21+kϵl2\lVert\widetilde{\bf x}_{k}\rVert_{\infty}\leq\lVert\widetilde{\bf x}_{k}\rVert_{2}\leq\lVert{\bf x}_{k}\rVert_{2}+\lVert{\bf x}_{k}-\widetilde{\bf x}_{k}\rVert_{2}\leq 1+\frac{k\epsilon}{l}\leq 2.

In particular, for k=lk=l, Eq. (39) is nothing but Eq. (21). \square

Appendix B Details for our experiments

B.1 Network architectures

In our experiments on CIFAR-10 and ImageNet, we used the following network architectures: Conv6, ResNet18, ResNet34, ResNet50, and ResNet101. In Table 1 (for CIFAR-10) and Table 2 (for ImageNet), we describe their configurations with a width factor ρ>0\rho\in\mathbb{R}_{>0}. When ρ=1.0\rho=1.0, the architectures are standard ones.

For each ResNet network, the bracket [][\cdots] represents the basic block for ResNet18 and ResNet34, and the bottleneck block for ResNet50 and ResNet101, following the original settings by He et al. [11]. We have a batch normalization layer right after each convolution operation as well. Note that, when we train and evaluate these networks with IteRand or edge-popup [26], we replace the batch normalization to the non-affine one, which fixes its all learnable multipliers to 11 and all learnable bias terms to 0 following the design by Ramanujan et al. [26].

Table 1: Network Architectures for CIFAR-10. (ρ\rho: a width factor)
Layer Network Conv6 ResNet18 ResNet34
3×33\times 3 Convolution 64ρ64\rho, 64ρ64\rho, max-pool 6464, [64ρ,64ρ]×2[64\rho,64\rho]\times 2 6464, [64ρ,64ρ]×3[64\rho,64\rho]\times 3
& Pooling Layers 128ρ128\rho, 128ρ128\rho, max-pool [128ρ,128ρ]×2[128\rho,128\rho]\times 2 [128ρ,128ρ]×4[128\rho,128\rho]\times 4
256ρ256\rho, 256ρ256\rho, max-pool [256ρ,256ρ]×2[256\rho,256\rho]\times 2 [256ρ,256ρ]×6[256\rho,256\rho]\times 6
[512ρ,512ρ]×2[512\rho,512\rho]\times 2 [512ρ,512ρ]×3[512\rho,512\rho]\times 3
avg-pool avg-pool
Linear Layers 256ρ256\rho, 256ρ256\rho, 1010 1010 1010
Table 2: Network Architectures for ImageNet. (ρ\rho: a width factor)
Layer Network ResNet18 ResNet34 ResNet50 ResNet101
Convolution 64(7×7, stride 2)64\ (7\times 7,\text{ stride }2)
Pooling max-pool(3×3, stride 2, padding 1)\mbox{{\rm max-pool}}\ (3\times 3,\text{ stride }2,\text{ padding }1)
Convolution [64ρ,64ρ]×2[64\rho,64\rho]\times 2 [64ρ,64ρ]×3[64\rho,64\rho]\times 3 [64ρ,64ρ,256ρ]×3[64\rho,64\rho,256\rho]\times 3 [64ρ,64ρ,256ρ]×3[64\rho,64\rho,256\rho]\times 3
Blocks [128ρ,128ρ]×2[128\rho,128\rho]\times 2 [128ρ,128ρ]×4[128\rho,128\rho]\times 4 [128ρ,128ρ,512ρ]×4[128\rho,128\rho,512\rho]\times 4 [128ρ,128ρ,512ρ]×4[128\rho,128\rho,512\rho]\times 4
[256ρ,256ρ]×2[256\rho,256\rho]\times 2 [256ρ,256ρ]×6[256\rho,256\rho]\times 6 [256ρ,256ρ,1024ρ]×6[256\rho,256\rho,1024\rho]\times 6 [256ρ,256ρ,1024ρ]×23[256\rho,256\rho,1024\rho]\times 23
[512ρ,512ρ]×2[512\rho,512\rho]\times 2 [512ρ,512ρ]×3[512\rho,512\rho]\times 3 [512ρ,512ρ,2048ρ]×3[512\rho,512\rho,2048\rho]\times 3 [512ρ,512ρ,2048ρ]×3[512\rho,512\rho,2048\rho]\times 3
Pooling avg-pool(7×7)\mbox{{\rm avg-pool}}\ (7\times 7)
Linear 10001000

B.2 Hyperparameters

In our experiments, we trained several neural networks by three methods (SGD, edge-popup [26], and IteRand) on two datasets (CIFAR-10 and ImageNet). For each dataset, we adopted different hyperparameters as follows.

CIFAR-10 experiments.

  • SGD: We used SGD with momentum for the optimization. It has the following hyperparameters: a total epoch number EE, batch size BB, learning rate η\eta, weight decay λ\lambda, momentum coefficient μ\mu. For all network architectures, we used common values except for the learning rate and weight decay: E=100E=100, B=128B=128, and μ=0.9\mu=0.9. For the learning rate and weight decay, we used η=0.01,λ=1.0×104\eta=0.01,\lambda=1.0\times 10^{-4} for Conv6, η=0.1,λ=5.0×104\eta=0.1,\lambda=5.0\times 10^{-4} for ResNet18 and ResNet34, following Ramanujan et al. [26]. Moreover, we decayed the learning rate by cosine annealing [16].

  • edge-popup: With the same notation in Section 2, edge-popup has the same hyperparameters as SGD and an additional one, a sparsity rate pp. We used the same values as SGD for each network except for the learning rate and the sparsity rate. For the learning rate, we used η=0.2\eta=0.2 for Conv6, and η=0.1\eta=0.1 for ResNet18 and ResNet34. We decayed the learning rate by cosine annealing, same as SGD. For the sparsity rate, we used p=0.5p=0.5 for Conv6, and p=0.6p=0.6 for ResNet18 and ResNet34.

  • IteRand: With the notation in Section 3, IteRand has the same hypeparameters as edge-popup and the following additional ones: a randomization period Kper1K_{\mathrm{per}}\in\mathbb{N}_{\geq 1} and a sampling rate r[0,1]r\in[0,1] for partial randomization. We used the same values as edge-popup for the former hyperparameters, and Kper=300,r=0.1K_{\mathrm{per}}=300,r=0.1 for the latter ones.

ImageNet experiments.

  • SGD: For all network architectures, we used the following hyperparameters (except for the learning rate): E=105,B=128,λ=1.0×104E=105,B=128,\lambda=1.0\times 10^{-4}, and μ=0.9\mu=0.9. For the first 55 epochs, we gradually increased the learning rate as η=0.1×(i/5)\eta=0.1\times(i/5) for each ii-th epoch (i=1,,5i=1,\cdots,5). For the next 9595 epochs, we decayed the learning rate by cosine annealing starting from η=0.1\eta=0.1. For the final 55 epochs, we set the learning rate η=1.0×105\eta=1.0\times 10^{-5} to ensure the optimization converges.

  • edge-popup: For all network architectures, we used the same hyperparameters as SGD and the sparsity rate p=0.7p=0.7.

  • IteRand: For all network architectures, we used the same hyperparameters as edge-popup and Kper=1000,r=0.1K_{\mathrm{per}}=1000,r=0.1.

B.3 Experimental results in table forms

In Table 3, we give means ±\pm one standard deviations which are plotted in Figure 2 in Section 5.

Table 3: Results for Figure 2 in Section 5
Network Method 𝒟param\mathcal{D}_{\mathrm{param}} ρ=0.25\rho=0.25 ρ=0.5\rho=0.5 ρ=1.0\rho=1.0 ρ=2.0\rho=2.0
Conv6 IteRand SC 75.16±0.2375.16\pm 0.23 84.31±0.1384.31\pm 0.13 88.80±0.2088.80\pm 0.20 90.89±0.1790.89\pm 0.17
KU 73.90±0.4773.90\pm 0.47 83.18±0.3883.18\pm 0.38 88.20±0.3888.20\pm 0.38 90.53±0.2890.53\pm 0.28
edge-popup SC 70.35±1.1670.35\pm 1.16 81.54±0.1181.54\pm 0.11 87.60±0.1187.60\pm 0.11 90.25±0.0690.25\pm 0.06
KU 67.02±1.1467.02\pm 1.14 79.37±0.2379.37\pm 0.23 86.14±0.3786.14\pm 0.37 89.91±0.2389.91\pm 0.23
SGD KU 79.51±0.8879.51\pm 0.88 84.46±0.3484.46\pm 0.34 87.69±0.3687.69\pm 0.36 89.63±0.2289.63\pm 0.22
ResNet18 IteRand SC 86.09±0.2486.09\pm 0.24 90.50±0.3690.50\pm 0.36 92.61±0.1792.61\pm 0.17 93.82±0.1593.82\pm 0.15
KU 84.71±0.4284.71\pm 0.42 89.96±0.0689.96\pm 0.06 92.47±0.1692.47\pm 0.16 93.52±0.1693.52\pm 0.16
edge-popup SC 84.23±0.3384.23\pm 0.33 89.61±0.0689.61\pm 0.06 92.29±0.0492.29\pm 0.04 93.57±0.1393.57\pm 0.13
KU 81.13±0.0981.13\pm 0.09 88.45±0.5388.45\pm 0.53 91.79±0.1991.79\pm 0.19 93.25±0.0693.25\pm 0.06
SGD KU 90.99±0.2990.99\pm 0.29 93.10±0.1493.10\pm 0.14 94.43±0.0594.43\pm 0.05 95.02±0.2395.02\pm 0.23
ResNet34 IteRand SC 88.26±0.3588.26\pm 0.35 91.87±0.1791.87\pm 0.17 93.54±0.2793.54\pm 0.27 94.25±0.1594.25\pm 0.15
KU 87.36±0.1487.36\pm 0.14 91.58±0.1691.58\pm 0.16 93.27±0.1993.27\pm 0.19 94.05±0.1094.05\pm 0.10
edge-popup SC 87.37±0.3087.37\pm 0.30 91.41±0.2691.41\pm 0.26 93.27±0.0993.27\pm 0.09 94.25±0.1394.25\pm 0.13
KU 85.31±0.6685.31\pm 0.66 90.85±0.1190.85\pm 0.11 93.00±0.1193.00\pm 0.11 93.96±0.1093.96\pm 0.10
SGD KU 92.26±0.1292.26\pm 0.12 94.20±0.2594.20\pm 0.25 94.66±0.1894.66\pm 0.18 95.33±0.1095.33\pm 0.10

In Table 4, we give means ±\pm one standard deviations which are plotted in Figure3 in Section 5.

Table 4: Results for Figure 3 in Section 5
Network Method 𝒟param\mathcal{D}_{\mathrm{param}} Accuracy (Top-1) Sparsity (%) # of Params
ResNet18 SGD KU 69.8969.89 0%0\% 11.1611.16M
ResNet34 SGD KU 73.8273.82 0%0\% 21.2621.26M
ResNet34 IteRand SC 65.8665.86 70%70\% 6.386.38M
edge-popup SC 62.1462.14 70%70\% 6.386.38M
ResNet50 IteRand SC 69.1969.19 70%70\% 7.047.04M
edge-popup SC 67.1167.11 70%70\% 7.047.04M
ResNet101 IteRand SC 72.9972.99 70%70\% 12.7212.72M
edge-popup SC 71.8571.85 70%70\% 12.7212.72M

Appendix C Additional experiments

C.1 Ablation study on hyperparameters: KperK_{\mathrm{per}} and rr

IteRand has two hyperparameters: KperK_{\mathrm{per}} (see line 5 in Algorithm 3) and rr (see Eq. (3) in Section 3).

KperK_{\mathrm{per}} controls the frequency of randomizing operations during the optimization. Note that, in our experiments in Section 5, we fixed it to Kper=300K_{\mathrm{per}}=300 on CIFAR-10, which is nearly 11 epoch (= 351351 iterations) when the batch size is 128128. As we discussed in Section 3, too small KperK_{\mathrm{per}} may degrade the performance because it may randomize even the important weights before their scores are well-optimized. In contrast, too large KperK_{\mathrm{per}} makes IteRand almost same as edge-popup, and thus the effect of the randomization disappears. Figure 4 shows this phenomenon with varying KperK_{\mathrm{per}} in {1,30,300,3000,30000}\{1,30,300,3000,30000\}. In relation to our theoretical results (Theorem A.1), we note that the expected number (here we denote RR^{\prime}) of randomizing operations for each weight is in inverse proportion to KperK_{\mathrm{per}}. The theoretical results imply that greater RR^{\prime} leads to better approximation ability of IteRand. We observe that the results in Figure 4 are consistent with this implication, in the region where KperK_{\mathrm{per}} is not too small (Kper300K_{\mathrm{per}}\geq 300).

Refer to caption
(a) Conv6
Refer to caption
(b) ResNet18
Refer to caption
(c) ResNet34
Figure 4: We train and validate Conv6, ResNet18 and ResNet34 on CIFAR-10 with various Kper{1,30,300,3000,30000}K_{\mathrm{per}}\in\{1,30,300,3000,30000\}. The x-axis is KperK_{\mathrm{per}} in a log scale, and the y-axis is validation accuracy.

Also, we investigate the relationship between KperK_{\mathrm{per}} and rr. Figure 5 shows how test accuracy changes when both KperK_{\mathrm{per}} and rr vary. From this result, we find that the accuracies seem to depend on r/Kperr/K_{\mathrm{per}}. This may be because each pruned parameter in the neural network is randomized Nr/KperNr/K_{\mathrm{per}} times in expectation during the optimization. On the other hand, when we use larger r[0,1]r\in[0,1], we have to explore KperK_{\mathrm{per}} in longer period (e.g. 30003000 iterations when r=1.0r=1.0). Thus appropriately choosing rr leads to shrink the search space of KperK_{\mathrm{per}}.

Refer to caption
Figure 5: Test accuracies on CIFAR-10 with ResNet18. The x-axis is Kper{1,30,300,3000}K_{\mathrm{per}}\in\{1,30,300,3000\} and the y-axis is r{0.001,0.01,0.1,1.0}r\in\{0.001,0.01,0.1,1.0\}.

C.2 Computational overhead of iterative randomization

IteRand introduces additional computational cost to the base method, edge-popup, by iterative randomization. However, the additional computational cost is negligibly small in all our experiments. We measured the average overhead of a single randomizing operation, which is the only difference from edge-popup, as follows: 97.1097.10 ms for ResNet18 (11.211.2M params), 200.55200.55 ms for ResNet50 (23.523.5M params). Thus, the total additional cost should be about 1010 seconds in the whole training (1.51.5 hours) for ResNet18 on CIFAR-10 and 200300200\text{--}300 seconds in the whole training (one week) for ResNet50 on ImageNet.

Also, we measured the total training times of our expriments for ResNet-18 on CIFAR-10 (Table 5). The additional computational cost of IteRand over edge-popup is tens of seconds, which is quite consistent with the above estimates.

Table 5: Training times for ResNet18 on CIFAR-10.
IteRand 6253.696253.69 (secs)
edge-popup 6231.806231.80 (secs)
SGD 6388.616388.61 (secs)

C.3 Experiments with varying sparsity

Table 6 shows the comparison of IteRand and edge-popup when varying the sparsity parameter p{0.1,0.3,0.5,0.7,0.9,0.95,0.99}p\in\{0.1,0.3,0.5,0.7,0.9,0.95,0.99\}. We can see that IteRand is effective for almost all the sparsities pp.

Table 6: Test accuracies with various sparsities on CIFAR-10.
Networks Methods p=0.1p=0.1 p=0.3p=0.3 p=0.5p=0.5 p=0.7p=0.7 p=0.9p=0.9 p=0.95p=0.95 p=0.99p=0.99
Conv6 IteRand (SC) 87.17±0.21\mathbf{87.17}\pm 0.21 89.08±0.14\mathbf{89.08}\pm 0.14 89.19±0.13\mathbf{89.19}\pm 0.13 87.67±0.07\mathbf{87.67}\pm 0.07 76.09±0.23\mathbf{76.09}\pm 0.23 59.03±1.55\mathbf{59.03}\pm 1.55 12.04±3.4012.04\pm 3.40
edge-popup (SC) 80.31±0.2780.31\pm 0.27 86.55±0.2286.55\pm 0.22 87.57±0.0387.57\pm 0.03 86.40±0.1386.40\pm 0.13 73.25±0.7973.25\pm 0.79 55.23±1.4855.23\pm 1.48 12.13±3.46\mathbf{12.13}\pm 3.46
ResNet18 IteRand (SC) 91.79±0.20\mathbf{91.79}\pm 0.20 92.67±0.11\mathbf{92.67}\pm 0.11 92.66±0.22\mathbf{92.66}\pm 0.22 92.61±0.15\mathbf{92.61}\pm 0.15 91.82±0.15\mathbf{91.82}\pm 0.15 90.40±0.42\mathbf{90.40}\pm 0.42 76.31±2.56\mathbf{76.31}\pm 2.56
edge-popup (SC) 87.37±0.1887.37\pm 0.18 91.43±0.1691.43\pm 0.16 92.25±0.1892.25\pm 0.18 92.32±0.1092.32\pm 0.10 91.64±0.1891.64\pm 0.18 90.28±0.3090.28\pm 0.30 75.21±2.7175.21\pm 2.71

Moreover, we compared the pruning-only approach (IteRand and edge-popup) and the iterative magnitude pruning (IMP) approach [6] with various sparsity rates. We employed the OpenLTH framework [5], which contains the implementation of IMP, as a codebase for this experiment and implemented both edge-popup and IteRand in this framework. The results are shown in Table 7. Overall, the IMP outperforms the pruning-only methods. However, there is still room for improvement in the pruning-only approach such as introducing scheduled sparsities or an adaptive threshold, which is left to future work.

Table 7: Comparison of the pruning-only approach and magnituide-based one.
Networks Methods p=0.5p=0.5 p=0.7p=0.7 p=0.9p=0.9 p=0.95p=0.95 p=0.99p=0.99
VGG11 IteRand (SC) 88.46±0.2288.46\pm 0.22 88.29±0.4288.29\pm 0.42 87.05±0.0787.05\pm 0.07 84.37±0.5984.37\pm 0.59 64.93±4.8164.93\pm 4.81
edge-popup (SC) 87.09±0.3187.09\pm 0.31 87.34±0.2187.34\pm 0.21 85.11±0.5585.11\pm 0.55 81.06±0.8481.06\pm 0.84 61.71±6.0561.71\pm 6.05
IMP with 33 retraining [6] 91.47±0.15\mathbf{91.47}\pm 0.15 91.48±0.16\mathbf{91.48}\pm 0.16 90.89±0.08\mathbf{90.89}\pm 0.08 90.39±0.27\mathbf{90.39}\pm 0.27 88.076±0.17\mathbf{88.076}\pm 0.17
ResNet20 IteRand (SC) 84.17±0.7884.17\pm 0.78 82.31±0.5282.31\pm 0.52 70.96±0.5570.96\pm 0.55 55.60±0.7055.60\pm 0.70 24.45±0.6724.45\pm 0.67
edge-popup (SC) 76.57±0.9176.57\pm 0.91 75.83±2.7575.83\pm 2.75 49.25±6.3349.25\pm 6.33 42.96±3.9142.96\pm 3.91 20.11±3.4320.11\pm 3.43
IMP with 33 retraining [6] 90.70±0.37\mathbf{90.70}\pm 0.37 89.79±0.14\mathbf{89.79}\pm 0.14 86.87±0.26\mathbf{86.87}\pm 0.26 84.28±0.08\mathbf{84.28}\pm 0.08 71.78±1.66\mathbf{71.78}\pm 1.66

C.4 Detailed empirical analysis on the parameter efficiency

We conducted experiments to see how much more network width edge-popup requires than IteRand to achieve the same accuracy (Table 8). Here we use ResNet18 with various width factors ρ>0\rho\in\mathbb{R}_{>0}. We first computed the test accuracies of IteRand with ρ=0.5,1.0\rho=0.5,1.0 as target values. Next we explored the width factors for which edge-popup achieves the same accuracy as the target values. Table 8 shows that edge-popup requires 1.31.3 times wider networks than IteRand in this specific setting.

Table 8: Test accuracies for ResNet18 with various width factors.
ρ=0.5\rho=0.5 ρ=0.65\rho=0.65 ρ=1.0\rho=1.0 ρ=1.3\rho=1.3
IteRand (KU) 89.96±0.06\mathbf{89.96}\pm 0.06 - 92.47±0.16\mathbf{92.47}\pm 0.16 -
edge-popup (KU) 88.45±0.5388.45\pm 0.53 89.99±0.08\mathbf{89.99}\pm 0.08 91.79±0.1991.79\pm 0.19 92.54±0.19\mathbf{92.54}\pm 0.19

C.5 Experiments with large-scale networks

In addition to the experiments in Section 5, we conducted experiments to see the effectiveness of IteRand with large-scale networks: WideResNet-50-2 [33] and ResNet50 with the width factor ρ=2.0\rho=2.0. Table 9 shows that the iterative randomization is still effective for these networks to improve the performance of weight-pruning optimization.

Table 9: Experiments with large-scale networks.
IteRand (SC) edge-popup (SC) # of parameters
WideResNet-50-2 73.57%\mathbf{73.57}\% 71.59%71.59\% 68.868.8 M
ResNet50 (ρ=2.0\rho=2.0) 74.05%\mathbf{74.05}\% 72.96%72.96\% 97.897.8 M

C.6 Experiments with a text classification task

Although our main theorem (Theorem 4.1) indicates that the effectiveness of IteRand does not depend on any specific tasks, we only presented the results on image classification datasets in the body of this paper. In Table 10, we present experimental results on a text classification dataset, IMDB [18], with recurrent neural networks (see Table 11 for the network architectures). For this experiment, we implemented both edge-popup and IteRand on the Jupyter notebook originally written by Trevett [30]. All models are trained for 1515 epochs and the learning rate η\eta we used is η=1.0\eta=1.0 for SGD and η=2.5\eta=2.5 for edge-popup and IteRand. Note that the learning rate η=2.5\eta=2.5 does not work well for SGD, thus we employed the different value from the one for edge-popup and IteRand. Also we set the hyperparameters for IteRand as p=0.5p=0.5, Kper=270/6K_{\mathrm{per}}=\lfloor 270/6\rfloor (\approx 1/61/6 epochs) and r=1.0r=1.0.

Table 10: Test accuracies on the IMDB dataset over 55 runs.
Networks Methods IteRand (SC) edge-popup (SC) SGD
LSTM 88.44±0.28%\mathbf{88.44}\pm 0.28\% 88.16±0.12%88.16\pm 0.12\% 87.39±0.37%87.39\pm 0.37\%
BiLSTM 88.51±0.24%\mathbf{88.51}\pm 0.24\% 88.34±0.19%88.34\pm 0.19\% 87.62±0.22%87.62\pm 0.22\%
Table 11: The network architectures for IMDB.
Layer Network (Bi)LSTM
Embedding Layer dim=100{\rm dim}=100
LSTM Layer hidden_dim=256,num_layers=1,{\rm hidden\_dim}=256,{\rm num\_layers}=1,
(bidirectional=True{\rm bidirectional}={\rm True} for BiLSTM)
Dropout Layer p=0.2p=0.2
Linear Layer output_dim=1{\rm output\_dim}=1
Output Layer sigmoid