\addauthor

Duong H. [email protected] \addauthorTrung-Nhan [email protected] \addauthorNam [email protected] \addinstitution Ho Chi Minh City University of Technology
Ho Chi Minh, Vietnam Pay Attention to Snapshots of Pruning

Paying more Attention to Snapshots of Iterative Pruning: Improving Model Compression via Ensemble Distillation

Abstract

Network pruning is one of the most dominant methods for reducing the heavy inference cost of deep neural networks. Existing methods often iteratively prune networks to attain high compression ratio without incurring significant loss in performance. However, we argue that conventional methods for retraining pruned networks (i.e., using small, fixed learning rate) are inadequate as they completely ignore the benefits from snapshots of iterative pruning. In this work, we show that strong ensembles can be constructed from snapshots of iterative pruning, which achieve competitive performance and vary in network structure. Furthermore, we present a simple, general and effective pipeline that generates strong ensembles of networks during pruning with large learning rate restarting, and utilizes knowledge distillation with those ensembles to improve the predictive power of compact models. In standard image classification benchmarks such as CIFAR and Tiny-Imagenet, we advance state-of-the-art pruning ratio of structured pruning by integrating simple $\ell_{1}$ -norm filters pruning into our pipeline. Specifically, we reduce 75-80% of total parameters and 65-70% MACs of numerous variants of ResNet architectures while having comparable or better performance than that of original networks. Code is available at https://github.com/lehduong/kesi.

1 Introduction

Motivation Researchers have extensively exploited deep and wide networks for the sake of achieving superior performance on various tasks. Most of state-of-the-art networks are extremely computationally expensive and require excessive memory. However, real-world applications usually require running deep neural networks on edge devices for various reasons: user privacy, security, real-time analysis, offline capability, reducing cost for server deployment, and so on. Adopting large and cumbersome networks to such resource-constrained environments is challenging due to restrictions of memory, computational power, energy consumption, and so on.

Background

Network pruning [LeCun et al. (1990), Reed (1993), Han et al. (2015), Li et al. (2016)] reduces a cumbersome and over-parameterized network to compact one by removing unnecessary weights and connections of networks. It is widely believed that small networks pruned from large, over-parameterized networks achieve superior performance than those trained from scratch [Frankle and Carbin (2018), Renda et al. (2020), Li et al. (2016), Luo et al. (2017)]. A plausible explanation to this phenomenon is the lottery ticket hypothesis [Frankle and Carbin (2018)] i.e. large, over-parameterized networks contain many optimal sub-networks i.e. winning tickets. In particular, network pruning could be done in two manners: one-shot pruning - prune a network with the desired compression ratio and retrain it only one time, or iterative pruning - only prune small ratio of the original network, retrain and repeat that process until the target size is reached. It has been shown that iterative pruning could lead to a greater compression ratio compare to one-shot pruning approaches [Han et al. (2015), Luo et al. (2017), Li et al. (2016), Renda et al. (2020)]. Furthermore, Frankle et al\bmvaOneDot[Frankle and Carbin (2018)] point out that iteratively-pruned-winning-tickets learn faster and reach higher test accuracy at smaller network size.

On the other hand, ensembles of neural networks are known to be much more robust and accurate than individual networks [Huang et al. (2017), Ashukha et al. (2020), Snoek et al. (2019)]. In spite of their superior performance, the tremendous cost of training and inference of ensembles makes them less attractive in practice. For the purpose of accelerating training time of ensembles, prior works proposed methods encouraging models to converge to different local minimums during training [Huang et al. (2017), Garipov et al. (2018), Yang et al. (2019b)]. To reduce inference time of ensembles, one could use a single network to mimic behavior of ensembles as pioneered by born-again tree [(6)] and knowledge distillation [Hinton et al. (2015), Balan et al. (2015), Buciluǎ et al. (2006), Malinin et al. (2019)]. In above approaches, although small networks can not achieve comparable performance with ensembles of networks, dark knowledge transferred from teachers to student network could bridge the gap between their predictive powers.

Our proposal

While existing methods of iterative pruning are more effective than one-shot pruning, the snapshots at each pruning iteration are mostly overlooked. We consider leveraging the snapshots of iterative pruning to take the performance of compact models to the next level.

In this work, we propose a simple pipeline for model compression by slightly modifying the standard approach. Specifically, we make use of large learning rate restarting at each pruning iteration to retrain pruned networks. Hence, each retraining step could be considered as a cycle of Snapshot ensemble [Huang et al. (2017)]. Utilizing both large learning rate restarting and pruning foster the diversity between snapshots, thus, constructing strong ensembles. Once achieved the desired compression ratio, we then distill the knowledge from the ensembles of snapshots of iterative pruning to the final model. Our method acquires the advantages of network pruning, ensembles learning, and knowledge distillation. To the best of our knowledge, this is the first work attempting to exploit snapshots of iterative pruning to further improve the performance of pruned networks.

Our main contributions

The contributions of our work are summarized as below:

1.

We empirically show that fine-tuning with large learning rate restarting can achieve competitive or better results than the common strategy (i.e. small, fixed learning rate) on a range of standard datasets and architectures. Surprisingly, such simple modification can create very strong baselines for both structured and unstructured pruning.
2.

We demonstrate that snapshots of iterative pruning could construct strong ensembles.
3.

We propose a simple pipeline to combine knowledge distillation from ensembles and iterative pruning. We empirically show that our approach can achieve state-of-the-art pruning ratio by reducing $75-80\%$ of parameters and $65-70\%$ MACs on numerous variants of ResNet while having comparable or better results than original networks.

2 Related Work

Knowledge Distillation

The approach of training small, efficient student network to mimic behavior of large, over-parameterized network has been proposed for a long time [Buciluǎ et al. (2006)] and was recently repopularized in [Hinton et al. (2015), Ba and Caruana (2014)]. Later, knowledge distillation was extended to various aspects, transferring knowledge from intermediate layers [Romero et al. (2014), Zagoruyko and Komodakis (2016a)], allowing teachers and students to guide each others [Zhang et al. (2018)], using teacher and student with the same architecture [Furlanello et al. (2018), Yang et al. (2019b), Yang et al. (2019a), Bagherinezhad et al. (2018)], distilling knowledge in multiple steps [Mirzadeh et al. (2019)]. To address the cost of training two networks in knowledge distillation, [Zhu et al. (2018), Zhang et al. (2018), Yang et al. (2019b)] propose online approaches to train the student and teacher networks in one generation. Furthermore, Anil et al\bmvaOneDot[Anil et al. (2018)] adopt knowledge distillation to accelerate the training of large scale neural networks. Universally Slimmable networks [Yu and Huang (2019)] provide an ensemble of sub-networks that has implicit knowledge distillation through shared weights.

Network Pruning

The idea behind network pruning is to reduce the redundant weights and connections of original network to achieve compact networks without losing much performance [Han et al. (2015), Li et al. (2016)]. In general, pruning can be divided into two categories: structured pruning and unstructured pruning. Unstructured pruning [Hanson and Pratt (1989), LeCun et al. (1990), Han et al. (2015), Srinivas and Babu (2015), Guo et al. (2016)] always results in sparse weight matrices, which can not directly accelerate the inference efficiency without specialized hardware/libraries. In contrast, structured pruning approaches [Li et al. (2016), He et al. (2017), Yu et al. (2018), Lin et al. (2019), Molchanov et al. (2016)] remove the redundant weights at the level of filters/channels/layers, thus, speeding up the inference of networks directly. There are numerous approaches to determine redundant filters/weights: [Luo et al. (2017)] use statistic information of the next filters to select unimportant filters, [Li et al. (2016)] prune the filters that have smallest norms in each layer, [Molchanov et al. (2016)] select the filters to minimize the construction loss estimated with Taylor expansion. As these criteria are rough estimations of weight’s importance, pruning a large number of filters/weights at once might break down and lead to inferior performance compare to iterative pruning [Han et al. (2015), Li et al. (2016)]. Recently, Liu et al\bmvaOneDot[Liu et al. (2018)] empirically show that training the pruned model from scratch can also achieve comparable or even better performance than fine-tuning. While the efficacy of network pruning remains an open question, in this work, we propose exploiting the benefit of having multiple networks through iterative pruning for constructing ensembles of networks.

3 Knowledge Distillation

Consider the classification problem in which we need to determine the correct category for input image $\mathbf{x}$ among $M$ classes. The probability of class $m$ for sample $\mathbf{x}_{n}$ given by neural network $f$ parameterized by $\boldsymbol{\theta}$ is computed as:

p_{m}(\mathbf{x}_{n};\boldsymbol{\theta},\tau)=\frac{\exp(\frac{f_{m}(\mathbf{x}_{n};\boldsymbol{\theta})}{\tau})}{\sum_{i=1}^{M}\exp(\frac{f_{i}(\mathbf{x}_{n};\boldsymbol{\theta})}{\tau})}

(1)

Where $\tau$ is the temperature of softmax function, higher values of $\tau$ lead to softer output distribution. Conventional approaches optimize the parameters $\boldsymbol{\theta}$ by sampling mini-batches $\mathcal{B}$ from the dataset and update the parameters to minimize cross-entropy objective:

\mathcal{L}_{NCE}(\mathcal{B};\boldsymbol{\theta})=-\frac{1}{N}\sum_{n=1}^{N}\sum_{m=1}^{M}y_{m}\log p_{m}(\mathbf{x}_{n};\boldsymbol{\theta},1)

(2)

The target distribution of a sample is usually represented by one-hot vector i.e. only the true class is $1$ and all other classes are $0$ . Since input images might differ in term of noise, complexity, and multi-modality, enforcing networks to excessively fit the delta distribution of ground truth for all samples might deteriorate their generalization. Besides that, the similarity between classes provides rich information for learning and potentially prevent overfitting [Yang et al. (2019a)]. Knowledge distillation [Buciluǎ et al. (2006), Hinton et al. (2015)] uses a trained (teacher) network, which usually has high capacity, to guide the training of other (student) network. Let $q_{m}(\mathbf{x}_{n})$ be the probability of class $m$ for image $\mathbf{x}_{n}$ given by the teacher network, which is parameterized by $\boldsymbol{\psi}$ . The objective function of knowledge distillation is defined as:

\mathcal{L}_{KD}(\mathcal{B};\boldsymbol{\theta},\tau,\boldsymbol{\psi})=-\frac{\tau^{2}}{N}\sum_{n=1}^{N}\sum_{m=1}^{M}q_{m}(\mathbf{x}_{n};\boldsymbol{\psi},\tau)\log\frac{q_{m}(\mathbf{x}_{n};\boldsymbol{\psi},\tau)}{p_{m}(\mathbf{x}_{n};\boldsymbol{\theta},\tau)}

(3)

In case the teacher is an ensemble of $K$ networks, the target distribution of knowledge distillation is the average of outputs of all networks: $\bar{q}_{m}(\mathbf{x}_{n};\boldsymbol{\psi}_{1:K},\tau)=\frac{1}{K}\sum_{k=1}^{K}q_{m}(\mathbf{x}_{n};\boldsymbol{\psi}_{k},\tau)$ .

An alternative approach is optimizing the mean of Kullback-Leibler divergence between the student and each teacher network:

\mathcal{L^{\prime}}_{KD}(\mathcal{B};\boldsymbol{\theta},\tau,\boldsymbol{\psi}_{1:K})=-\frac{\tau^{2}}{KN}\sum_{n=1}^{N}\sum_{m=1}^{M}\sum_{k=1}^{K}q_{m}(\mathbf{x}_{n};\boldsymbol{\psi}_{k},\tau)\log\frac{q_{m}(\mathbf{x}_{n};\boldsymbol{\psi}_{k},\tau)}{p_{m}(\mathbf{x}_{n};\boldsymbol{\theta},\tau)}

(4)

We experimented with two above objectives but did not observe significant difference in performance of student networks, thus, we only report results of the second approach.

4 Snapshots of Iterative Pruning

Refer to caption — Figure 1: Overview of our approach combining the advantage of knowledge distillation, ensembles of networks, and network pruning. At the start, we prune the filters/weights according to some criteria ( $\ell_{1}$ -norm, Taylor approximation,…). With KESI, we retrain the pruned networks with large learning rate and minimize the conventional supervised loss function. Once we achieve the desired pruning ratio, we use knowledge distillation to transfer the knowledge from ensembles of snapshots of iterative pruning to the final model.

In contrast to previous works, which mainly focus on the aforementioned usage of iterative pruning (i.e. alleviating the noise of weight’s importance estimation), we exploit the benefits of generating multiple models varying in structure and capacity to construct strong ensembles.

Inspired by the prior works of [Smith (2015), Loshchilov and Hutter (2016)] in which the authors show that promising local optimums could be found in a small number of epochs after restarting the learning rate. Furthermore, Huang et al\bmvaOneDot[Huang et al. (2017)] demonstrate that utilizing large learning rate restarting during training can construct strong ensembles without much additional cost.

Broadly speaking, the performance of ensembles depends on: the performance of individual network and the diversity of them. On the other hand, network pruning generates snapshots varying in structure and achieving competitive performance. Hence, if pruned networks could achieve minimal loss in predictive power relative to the original network, the ensemble of them could potentially outperforms the ensemble of networks having identical architecture (and trained with large learning rate restarting).

Prior works such as [Han et al. (2015), Liu et al. (2018), Molchanov et al. (2016)] retrain the pruned networks for $T$ more epochs with a fixed learning rate, which is usually the final learning rate of the training. However, this approach might result in multiple snapshots being stuck in similar local optimums, thus, leading to very weak ensembles as shown in our experiments. Similar to [Huang et al. (2017)], we adopt the large learning rate restarting at each pruning iteration to encourage each snapshot to converge to different optimum. For learning rate restarting, we utilize the One-cycle policy [Smith and Topin (2019)], which is proved to increase convergence speed of several models. Due to the similarity of our proposed method and Snapshot Ensembling [Huang et al. (2017)], we refer to each pruning and retraining step as a cycle. One-cycle policy adjusts learning rate at each mini-batch update and has two phases:

Increasing learning rate The learning rate and momentum of optimizer will be initialized to $\eta_{initial}$ and $\beta_{initial}$ respectively. During the first $T$ iterations of fine-tuning, learning rate and momentum gradually increase from initial values to $\eta_{max}$ , $\beta_{max}$ . The learning rate and momentum at $i$ -th step with cosine annealing strategy are given by:

\eta_{i}=\eta_{max}+\frac{\eta_{initial}-\eta_{max}}{2}(1+\cos(\frac{i}{T}\cdot\pi))

(5)

\beta_{i}=\beta_{max}+\frac{\beta_{initial}-\beta_{max}}{2}(1+\cos(\frac{i}{T}\cdot\pi))

(6)

Decreasing learning rate After $T$ iterations, learning rate and momentum will be gradually decreased from $\eta_{max}$ and $\beta_{max}$ to $\eta_{min}$ and $\beta_{min}$ in $L-T$ iterations where $L$ is total number of iterations for fine-tuning.

\eta_{i}=\eta_{min}+\frac{\eta_{max}-\eta_{min}}{2}(1+\cos(\frac{i-T}{L-T}\cdot\pi))

(7)

\beta_{i}=\beta_{initial}+\frac{\beta_{max}-\beta_{initial}}{2}(1+\cos(\frac{i-T}{L-T}\cdot\pi))

(8)

It is worth noticing that differs from previous works [Huang et al. (2017), Yang et al. (2019b)], which use cosine annealing schedule, by using One-cycle policy, we also "warm-up" learning rate at the start of each cycle. In our experiments, warming up learning rate is extremely important to achieve high accuracy with deep and large networks.

Surprisingly, retraining with One-cycle policy does not only generate significantly stronger ensembles, but also consistently outperforms the standard strategy for finetuning in terms of predictive accuracy of individual snapshots. We hypothesize that the (local) optimums of pruned networks are actually far from those of original networks, thus, large learning rate is needed to guarantee the convergence of pruned networks. We leave rigorous evaluation to investigate this phenomenon for future works.

5 Effective Pipeline for Model Compression

Since we already obtain strong ensembles during pruning, it is straightforward to distill the knowledge from them to the final pruned network. Our proposed pipeline can be summarized as follow:

1.

Train the baseline model to completion.

Prune redundant weights of the network based on some criteria.

Retrain the pruned network with large learning rate.

Repeat step 2 and 3 until desired compression ratio is reached.

Distill knowledge from ensembles of snapshots of pruning.

Algorithm 1 Knowledge Distillation from Ensemble of Snapshots of Iterative pruning

From now, we refer to our pipeline for model compression as Knowledge Distillation from Ensembles of Snapshots of Iterative Pruning (KESI). An overview of our approach is depicted in Figure 1. Our approach is extremely simple, easy to implement and can be adopted with any pruning mechanisms. We discuss the reasons why ensembles of snapshots of pruning are naturally suited for knowledge distillation.

Quality of Teacher In knowledge distillation, student can either learn to jointly optimize the supervised loss (Equation 2) and knowledge distillation loss (Equation 4) or only optimize the distillation objective. In the former case, if the teacher is poorly trained, mathematically speaking, the two objectives will conflict with each other. In the latter case, a poor teacher provides weak supervision (noisy label), making it’s harder to learn from the student’s perspective. Furthermore, ensembles provide more robust predictions on noisy labeled datasets [Lee and Chung (2019)] and out-of-distribution examples [Lakshminarayanan et al. (2017)].

Student and Teacher Gap Although ensembles of snapshots have superior performance than the original network, it is not sufficient to guarantee the improvement in the performance of the student network with Knowledge Distillation. In fact, many works such as [(33), Cho and Hariharan (2019), Yang et al. (2019a)] show that a powerful teacher might impair the performance of its student if there is a large gap between their predictive powers. However, ensembles of snapshots of pruning consist of models varying in capacity. Hence, teacher’s predictions of hard-to-learn samples (because of their complexity, multi-modality) will have softer distributions as the small networks could not "remember" those samples and would be more uncertain about them.

In this work, we only investigate knowledge distillation from ensembles of fixed-weights teachers, however, we can also jointly train all models and allow them to guide each other, which is referred to as deep mutual learning [Zhang et al. (2018)].

6 Experiments

We conduct experiments on CIFAR-10, CIFAR-100 [Krizhevsky et al. (2009)] and Tiny-Imagenet ¹¹1https://tiny-imagenet.herokuapp.com datasets.

The two CIFAR datasets [Krizhevsky et al. (2009)] consist of colored natural images sized at $32\times 32$ pixels. CIFAR-10 (C10) and CIFAR-100 (C100) images are drawn from 10 and 100 classes, respectively. For each dataset, there are 50,000 training images and 10,000 images reserved for testing.

The Tiny ImageNet dataset consists of a subset of ImageNet images [Deng et al. (2009)]. There are 200 classes, each of which has 500 training images and 50 validation images. Each image is resized to $64\times 64$ and augmented with random crops, horizontal mirroring, and RGB intensity scaling.

We run each experiment 3 times then report mean and standard deviation of each network. In our experiments, we prune all networks in 5 cycles unless otherwise stated.

6.1 Experiment setup

Training baselines

We adopt the training and pruning code from [Liu et al. (2018)] ²²2https://github.com/Eric-mingjie/rethinking-network-pruning. We train all networks with Stochastic Gradient Descent (SGD), learning rate is dropped from $0.1$ to $0.01$ at $50\%$ training and to $0.001$ at $75\%$ . The batch size is set to $128$ and weight decay is $0.0001$ similar to [He et al. (2016a), He et al. (2016b)].

CIFAR In order to create strong baseline models, we extend the training schedule of all models to 300 epochs. For WideResnet, we use same configurations as described in [Zagoruyko and Komodakis (2016b)].

Tiny-Imagenet we adopt Pytorch’s pretrained models on ImageNet and only replace the last fully-connected layer and train networks for $T=100$ more epochs. We warm up learning rate from $0.01$ to $0.1$ in $10$ epochs. Other configurations are adopted from CIFAR training recipe.

Pruning

Structured pruning we use $\ell_{1}$ -norm based filters pruning [Li et al. (2016)] for simplicity. In each layer, a fixed number of filters having smallest $\ell_{1}$ -norm will be pruned. Since the bulk of networks tend to be last layers, we increase the percentage of filters that will be pruned as the layer goes deeper to achieve higher compression ratio.
Unstructured pruning, we exploit (global) magnitude-based weight pruning [Han et al. (2015)] i.e. pooling parameters across all layers and pruning weights with lowest magnitude. Specifically, we only prune parameters of convolutional layers similar to (Liu et al., 2018).

Retraining

The budget for fine-tuning of each cycle is $T=40$ and $T=25$ epochs on CIFAR and Tiny-Imagenet datasets respectively regardless of model architectures. In standard policy, the learning rate is set to 0.001 and fixed during retraining.

For One-cycle policy, we set the initial learning rate $\eta_{initial}=0.01$ , gradually increase it to the maximum learning rate $\eta_{max}=0.1$ in $10\%$ of total (retrain) epochs, then decrease it to the minimum learning rate $\eta_{min}=0.0001$ for remaining epochs. Other configurations are identical to those of training.

Knowledge Distillation

We use Adam optimizer Kingma and Ba (2014) for ensemble distillation since it gives better results than vanilla SGD in our experiments. For knowledge distillation, we also adopt One-cycle policy where we set $\eta_{initial},\eta_{max},\eta_{min}$ to $1\mathrm{e}{-4},1\mathrm{e}-3,1\mathrm{e}{-}6$ respectively. We do not explicitly use regularization for knowledge distillation. Other configurations e.g\bmvaOneDotbatch size, number of retraining epochs,… are similar to normal finetuning.

In our experiments, we use temperature $\tau=5$ . The teachers i.e. ensembles of snapshots consist of 6 models including the original (unpruned) network and 5 snapshots of pruning.

6.2 Results

6.2.1 Effectiveness of large learning rate

We conduct experiments to empirically evaluate the performance of pruned networks trained with large learning rate compare to networks fine-tuned with small learning rate. Figure 2 and 3 demonstrate results of pruned networks with different compression ratios for both structured and unstructured pruning. Exhaustive results are reported in supplementary documents.

6.2.2 Performance of ensembles of snapshots

We compare the performance of ensembles of snapshots with different approaches: snapshots of pruned networks trained with small learning rate, snapshots of pruned networks trained with large learning rate restarting and snapshots of unpruned networks retrained with large learning rate (i.e. all snapshots have same architecture as the original network). Figure 4 presents the result of this experiment.

We can see that although the network capacity is decreased at each cycle, the ensembles of snapshots of iterative pruning achieve competitive or even better than snapshots of networks with same architecture. Detailed results of performance of ensembles are reported in supplementary documents.

6.2.3 Performance of compact networks trained with our pipeline

Model	Methods	% Params $\downarrow$	% FLOPs $\downarrow$	baseline	pruned
Resnet-56	CP (He et al., 2017)	-	50.6	92.80	91.80
	FPEC (Li et al., 2016)	14.1	27.6	93.04	93.06
	NISP (Yu et al., 2018)	42.4	35.5	93.26	93.01
	GAL-0.8 (Lin et al., 2019)	65.9	60.2	93.26	91.58
	GBN (You et al., 2019)	66.7	70.3	93.10	93.07
	HRank (Lin et al., 2020)	42.4	50.0	93.26	93.17
	PFEC+KESI (our)	67.1	61.5	93.42	$\mathbf{93.34\pm 0.05}$
Resnet-110	PFEC (Li et al., 2016)	32.6	38.7	93.53	93.30
	GAL-0.5 (Lin et al., 2019)	44.8	48.5	93.50	92.74
	HRank (Lin et al., 2020)	68.7	68.6	93.52	92.65
	PFEC+KESI (our)	77.5	65.4	94.01	$\mathbf{94.01\pm 0.22}$

Table 1: Comparing performance of pruned networks with other approaches on CIFAR-10 dataset.

Model	Structured Pruning					Unstructured Pruning
Model	Method	# Params(M)	% MACs $\downarrow$	C10	C100	Method	# Params(M)	C10	C100
Resnet-56	baseline	0.85	0.00	93.42	71.07	baseline	0.85	93.42	71.07
	PFEC Li et al. (2016)	0.28	61.5	$90.35\pm 0.36$	$64.91\pm 0.14$	MWP Han et al. (2015)	0.29	$92.69\pm 0.02$	$69.53\pm 0.07$
	PFEC+One-cycle	0.28	61.5	$92.31\pm 0.26$	$69.03\pm 0.24$	MWP+One-cycle	0.29	${\color[rgb]{0,0,1}93.41\pm 0.08}$	${\color[rgb]{0,0,1}70.46\pm 0.30}$
	PFEC+KESI(our)	0.28	61.5	${\color[rgb]{0,0,1}\mathbf{93.34\pm 0.05}}$	${\color[rgb]{0,0,1}\mathbf{70.95\pm 0.11}}$	MWP+KESI(our)	0.29	$\mathbf{93.90\pm 0.10}$	$\mathbf{72.27\pm 0.09}$
Resnet-110	baseline	1.73	0.00	94.01	72.35	baseline	1.73	94.01	72.35
	PFEC Li et al. (2016)	0.39	65.38	$91.51\pm 0.08$	$65.44\pm 0.04$	MWP Han et al. (2015)	0.36	$93.02\pm 0.04$	$68.90\pm 0.08$
	PFEC+One-cycle	0.39	65.38	$93.24\pm 0.16$	$69.54\pm 0.07$	MWP+One-cycle	0.36	$93.69\pm 0.17$	$71.59\pm 0.30$
	PFEC+KESI(our)	0.39	65.38	${\color[rgb]{0,0,1}\mathbf{94.01\pm 0.22}}$	${\color[rgb]{0,0,1}\mathbf{72.12\pm 0.11}}$	MWP+KESI(our)	0.36	${\color[rgb]{0,0,1}\mathbf{94.44\pm 0.11}}$	${\color[rgb]{0,0,1}\mathbf{73.12\pm 0.25}}$
Preresnet-164	baseline	1.70	0.00	95.06	76.35	baseline	1.70	95.06	76.35
	PFEC Li et al. (2016)	0.31	69.23	$92.05\pm 0.11$	$69.20\pm 0.04$	MWP Han et al. (2015)
	PFEC+One-cycle	0.31	69.23	${\color[rgb]{0,0,1}94.15\pm 0.06}$	${\color[rgb]{0,0,1}73.99\pm 0.06}$	MWP+One-cycle
	PFEC+KESI(our)	0.31	69.23	${\color[rgb]{0,0,1}\mathbf{94.30\pm 0.53}}$	${\color[rgb]{0,0,1}\mathbf{75.84\pm 0.32}}$	MWP+KESI(our)
WideResnet-16-8	baseline	10.96	0.00	95.62	79.57	baseline	10.96	95.62	79.57
	PFEC Li et al. (2016)	2.48	64.52	$94.61\pm 0.09$	$73.82\pm 0.10$	MWP Han et al. (2015)	2.53	$95.47\pm 0.07$	$77.92\pm 0.16$
	PFEC+One-cycle	2.48	64.52	${\color[rgb]{0,0,1}94.91\pm 0.04}$	${\color[rgb]{0,0,1}76.46\pm 0.27}$	MWP+One-cycle	2.53	${\color[rgb]{0,0,1}95.55\pm 0.05}$	${\color[rgb]{0,0,1}78.82\pm 0.11}$
	PFEC+KESI(our)	2.48	64.52	${\color[rgb]{0,0,1}\mathbf{95.68\pm 0.12}}$	${\color[rgb]{0,0,1}\mathbf{79.01\pm 0.20}}$	MWP+KESI(our)	2.53	${\color[rgb]{0,0,1}\mathbf{95.97\pm 0.05}}$	${\color[rgb]{0,0,1}\mathbf{80.08\pm 0.06}}$
VGG-16	baseline	14.99	0.00	94.23	73.24	baseline	14.99	94.23	73.24
	PFEC Li et al. (2016)	2.71	45.16	$93.88\pm 0.12$	$68.37\pm 0.09$	MWP Han et al. (2015)	1.02	$93.47\pm 0.22$	$68.39\pm 0.21$
	PFEC+One-cycle	2.71	45.16	$94.10\pm 0.09$	${\color[rgb]{0,0,1}71.95\pm 0.04}$	MWP+One-cycle	1.02	${\color[rgb]{0,0,1}93.53\pm 0.10}$	${\color[rgb]{0,0,1}71.74\pm 0.15}$
	PFEC+KESI(our)	2.71	45.16	${\color[rgb]{0,0,1}\mathbf{94.59\pm 0.09}}$	${\color[rgb]{0,0,1}\mathbf{73.52\pm 0.20}}$	MWP+KESI(our)	1.02	${\color[rgb]{0,0,1}\mathbf{94.01\pm 0.06}}$	${\color[rgb]{0,0,1}\mathbf{73.91\pm 0.09}}$

Table 2: Accuracy (%) of pruned networks on CIFAR-10 and CIFAR-100 datasets trained with different strategies. PFEC (or MWP) are models pruned with

\ell_{1}

-norm filters pruning Li et al. (2016) (or magnitude-based weights pruning Han et al. (2015)) and fine-tuned with small learning rate. PFEC/MWP+One-cycle are pruned networks retrained with large learning rate restarting. PFEC/MWP+KESI are pruned networks retrained with our pipeline

Table 3: Performance of compact models on Tiny-Imagenet Model Method #Params (M) MACs(G) Acc Resnet-18 baseline 11.01 1.82 67.22 FPEC Li et al. (2016) 2.71 0.83 $61.06\pm 0.32$ FPEC+One-cycle 2.71 0.83 ${\color[rgb]{0,0,1}64.70\pm 0.33}$ FPEC+KESI (our) 2.71 0.83 ${\color[rgb]{0,0,1}\mathbf{66.87\pm 0.26}}$ Resnet-34 baseline 21.39 3.68 68.81 FPEC Li et al. (2016) 5.40 1.57 $64.93\pm 0.15$ FPEC+One-cycle 5.40 1.57 ${\color[rgb]{0,0,1}67.26\pm 0.21}$ FPEC+KESI (our) 5.40 1.57 ${\color[rgb]{0,0,1}\mathbf{70.02\pm 0.43}}$ Table 4: Knowledge distillation with ensembles teacher and single model teacher Model Method #Params (M) C10 C100 Resnet-56 baseline 0.85 93.42 71.07 single teacher 0.28 $93.13\pm 0.04$ $70.29\pm 0.14$ ensemble teacher 0.28 $\mathbf{93.34\pm 0.05}$ $\mathbf{72.27\pm 0.09}$ Resnet-110 baseline 1.73 94.01 72.35 single teacher 0.39 $93.48\pm 0.05$ $71.50\pm 0.11$ ensemble teacher 0.39 $\mathbf{94.01\pm 0.22}$ $\mathbf{73.12\pm 0.25}$ WRN-16-8 baseline 19.96 95.62 79.57 single teacher 2.48 $95.37\pm 0.21$ $78.71\pm 0.24$ ensemble teacher 2.48 $\mathbf{95.68\pm 0.12}$ $\mathbf{79.01\pm 0.20}$

In this section, we demonstrate that the smaller models trained with our pipeline (KESI) achieve comparable or even better results than the original model. Each final model is iteratively pruned and retrained in 5 cycles with different strategies. Table 2 and 4 present the performance of compact models on CIFAR-10, CIFAR-100 and Tiny-Imagenet. Specifically, we compare the iteratively-pruned-models retrained with small learning rate, large learning rate and our pipeline (i.e. large learning rate + knowledge distillation). Our pipeline consistently outperforms the standard strategy by a large margin for both structured and unstructured pruning.

Although our approach is general and can be applied to any (iterative) pruning mechanism, we also give a comparison of model trained with our pipeline and conventional approaches in table 1. We conduct experiment to compare performance of student networks trained with single teacher (i.e. original/unpruned networks) and ensembles teacher in table 4 for ablation study. We can see that compact models learn from ensembles outperform those learn from a single teacher by a large margin.

7 Conclusion

We propose a simple pipeline by slightly modifying the standard approach to acquire the advantages of network ensembles, knowledge distillation and network pruning. Our experiments show that small and compact networks trained with our pipeline significantly outperform the standard approach and create very strong baselines for model compression. Specifically, our method reduces nearly $80\%$ of parameters and $70\%$ FLOPs of several models by structured pruning without incurring loss in performance.
Acknowledgement The authors thank anonymous reviewers and area chairs for their useful feedback. We also want to express our appreciation to Ms. Le Thi Tham Quynh for her valuable aids in the final preparation of the paper.

References

Anil et al. (2018) Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. Large scale distributed neural network training through online distillation. arXiv preprint arXiv:1804.03235, 2018.
Ashukha et al. (2020) Arsenii Ashukha, Alexander Lyzhov, Dmitry Molchanov, and Dmitry Vetrov. Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. arXiv preprint arXiv:2002.06470, 2020.
Ba and Caruana (2014) Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pages 2654–2662, 2014.
Bagherinezhad et al. (2018) Hessam Bagherinezhad, Maxwell Horton, Mohammad Rastegari, and Ali Farhadi. Label refinery: Improving imagenet classification through label progression. arXiv preprint arXiv:1805.02641, 2018.
Balan et al. (2015) Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. Bayesian dark knowledge. In Advances in Neural Information Processing Systems, pages 3438–3446, 2015.
(6) Leo Breiman and Nong Shang. Born again trees.
Buciluǎ et al. (2006) Cristian Buciluǎ, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 535–541, 2006.
Cho and Hariharan (2019) Jang Hyun Cho and Bharath Hariharan. On the efficacy of knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 4794–4802, 2019.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Frankle and Carbin (2018) Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
Furlanello et al. (2018) Tommaso Furlanello, Zachary C Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. arXiv preprint arXiv:1805.04770, 2018.
Garipov et al. (2018) Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of dnns. In Advances in Neural Information Processing Systems, pages 8789–8798, 2018.
Guo et al. (2016) Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In Advances in neural information processing systems, pages 1379–1387, 2016.
Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
Hanson and Pratt (1989) Stephen José Hanson and Lorien Y Pratt. Comparing biases for minimal network construction with back-propagation. In Advances in neural information processing systems, pages 177–185, 1989.
He et al. (2016a) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016a.
He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016b.
He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
Huang et al. (2017) Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger. Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109, 2017.
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems, pages 6402–6413, 2017.
LeCun et al. (1990) Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information processing systems, pages 598–605, 1990.
Lee and Chung (2019) Jisoo Lee and Sae-Young Chung. Robust training with ensemble consensus. arXiv preprint arXiv:1910.09792, 2019.
Li et al. (2016) Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
Lin et al. (2020) Mingbao Lin, Rongrong Ji, Yan Wang, Yichen Zhang, Baochang Zhang, Yonghong Tian, and Ling Shao. Hrank: Filter pruning using high-rank feature map. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1529–1538, 2020.
Lin et al. (2019) Shaohui Lin, Rongrong Ji, Chenqian Yan, Baochang Zhang, Liujuan Cao, Qixiang Ye, Feiyue Huang, and David Doermann. Towards optimal structured cnn pruning via generative adversarial learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2790–2799, 2019.
Liu et al. (2018) Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
Loshchilov and Hutter (2016) Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gradient descent with restarts. corr abs/1608.03983 (2016). arXiv preprint arXiv:1608.03983, 2016.
Luo et al. (2017) Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pages 5058–5066, 2017.
Malinin et al. (2019) Andrey Malinin, Bruno Mlodozeniec, and Mark Gales. Ensemble distribution distillation. arXiv preprint arXiv:1905.00076, 2019.
(33) Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant.
Mirzadeh et al. (2019) Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant: Bridging the gap between student and teacher. arXiv preprint arXiv:1902.03393, 2019.
Molchanov et al. (2016) Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
Reed (1993) Russell Reed. Pruning algorithms-a survey. IEEE transactions on Neural Networks, 4(5):740–747, 1993.
Renda et al. (2020) Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
Romero et al. (2014) Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
Smith (2015) Leslie N Smith. No more pesky learning rate guessing games. CoRR, abs/1506.01186, 5, 2015.
Smith and Topin (2019) Leslie N Smith and Nicholay Topin. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, volume 11006, page 1100612. International Society for Optics and Photonics, 2019.
Snoek et al. (2019) Jasper Snoek, Yaniv Ovadia, Emily Fertig, Balaji Lakshminarayanan, Sebastian Nowozin, D Sculley, Joshua Dillon, Jie Ren, and Zachary Nado. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, pages 13969–13980, 2019.
Srinivas and Babu (2015) Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks. arXiv preprint arXiv:1507.06149, 2015.
Yang et al. (2019a) Chenglin Yang, Lingxi Xie, Siyuan Qiao, and Alan L Yuille. Training deep neural networks in generations: A more tolerant teacher educates better students. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5628–5635, 2019a.
Yang et al. (2019b) Chenglin Yang, Lingxi Xie, Chi Su, and Alan L Yuille. Snapshot distillation: Teacher-student optimization in one generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2859–2868, 2019b.
You et al. (2019) Zhonghui You, Kun Yan, Jinmian Ye, Meng Ma, and Ping Wang. Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 2133–2144, 2019.
Yu and Huang (2019) Jiahui Yu and Thomas S Huang. Universally slimmable networks and improved training techniques. In Proceedings of the IEEE International Conference on Computer Vision, pages 1803–1811, 2019.
Yu et al. (2018) Ruichi Yu, Ang Li, Chun-Fu Chen, Jui-Hsin Lai, Vlad I Morariu, Xintong Han, Mingfei Gao, Ching-Yung Lin, and Larry S Davis. Nisp: Pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9194–9203, 2018.
Zagoruyko and Komodakis (2016a) Sergey Zagoruyko and Nikos Komodakis. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016a.
Zagoruyko and Komodakis (2016b) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016b.
Zhang et al. (2018) Ying Zhang, Tao Xiang, Timothy M Hospedales, and Huchuan Lu. Deep mutual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4320–4328, 2018.
Zhu et al. (2018) Xiatian Zhu, Shaogang Gong, et al. Knowledge distillation by on-the-fly native ensemble. In Advances in neural information processing systems, pages 7517–7527, 2018.

Appendix A Results

Table 5: Results of iterative

\ell_{1}

-norm Filters Pruning Li et al. (2016) on CIFAR-10 and CIFAR-100 datasets. The SLR column presents the result of pruned networks finetuned with small learning rate while LLR column shows the results of same networks finetuned with large learning rate.

Model	#Param	% MACs(G) $\downarrow$	C10-SLR	C10-LLR	C100-SLR	C100-LLR
Resnet-110 (baseline)	1.73M	0.00	94.01	94.01	72.35	72.35
Resnet-110 #1	1.26M	23.08	$93.51\pm 0.05$	$93.58\pm 0.05$	$69.70\pm 0.17$	$71.67\pm 0.03$
Resnet-110 #2	0.93M	38.46	$92.91\pm 0.02$	$93.49\pm 0.05$	$68.47\pm 0.20$	$70.97\pm 0.43$
Resnet-110 #3	0.70M	50.00	$92.65\pm 0.02$	$93.53\pm 0.12$	$66.89\pm 0.14$	$70.59\pm 0.19$
Resnet-110 #4	0.52M	57.69	$92.11\pm 0.01$	$93.62\pm 0.03$	$66.19\pm 0.07$	$70.26\pm 0.26$
Resnet-110 #5	0.39M	65.38	$91.51\pm 0.08$	$93.24\pm 0.16$	$65.44\pm 0.04$	$69.54\pm 0.07$
Resnet-110 Ensemble	-	-	93.82	$94.01\pm 0.22$	73.32	$75.33\pm 0.14$
Resnet-56 (baseline)	0.85M	0.00	93.42	93.42	71.07	71.07
Resnet-56 #1	0.66M	23.07	$92.73\pm 0.06$	$93.23\pm 0.18$	$69.10\pm 0.15$	$69.83\pm 0.09$
Resnet-56 #2	0.52M	30.77	$92.24\pm 0.17$	$93.21\pm 0.08$	$67.92\pm 0.05$	$69.58\pm 0.13$
Resnet-56 #3	0.42M	46.15	$91.64\pm 0.19$	$92.90\pm 0.16$	$66.76\pm 0.03$	$69.50\pm 0.25$
Resnet-56 #4	0.35M	53.85	$91.10\pm 0.27$	$92.74\pm 0.22$	$65.205\pm 0.05$	$69.49\pm 0.08$
Resnet-56 #5	0.29M	61.54	$90.35\pm 0.36$	$92.31\pm 0.26$	$64.91\pm 0.14$	$69.03\pm 0.24$
Resnet-56 Ensemble	-	-	93.25	$94.29\pm 0.02$	71.37	$74.23\pm 0.21$
VGG-16 (baseline)	14.99M	0.00	94.23	94.23	73.24	73.24
VGG-16 #1	9.46M	0.00	$94.13\pm 0.09$	$93.95\pm 0.02$	$71.39\pm 0.06$	$72.38\pm 0.11$
VGG-16 #2	6.27M	0.00	$94.09\pm 0.13$	$93.90\pm 0.03$	$70.48\pm 0.07$	$72.10\pm 0.1$
VGG-16 #3	4.43M	0.00	$94.09\pm 0.04$	$93.93\pm 0.04$	$69.73\pm 0.06$	$72.28\pm 0.11$
VGG-16 #4	3.36M	0.00	$94.03\pm 0.13$	$93.89\pm 0.10$	$69.09\pm 0.05$	$72.22\pm 0.19$
VGG-16 #5	2.71M	0.00	$93.88\pm 0.12$	$94.10\pm 0.09$	$68.37\pm 0.09$	$71.95\pm 0.04$
VGG-16 Ensemble	-	-	94.29	$95.04\pm 0.07$	$72.86\pm 0.02$	$75.93\pm 0.06$
PreResnet-164 (baseline)	1.7M	0.00	95.06	95.06	76.35	76.35
PreResnet-164 #1	1.09M	26.92	$94.43\pm 0.06$	$94.92\pm 0.05$	$74.65\pm 0.04$	$76.20\pm 0.07$
PreResnet-164 #2	0.74M	46.15	$93.71\pm 0.07$	$94.74\pm 0.14$	$73.17\pm 0.03$	$75.87\pm 0.03$
PreResnet-164 #3	0.54M	57.69	$93.50\pm 0.01$	$94.66\pm 0.19$	$71.89\pm 0.01$	$75.15\pm 0.16$
PreResnet-164 #4	0.4M	65.38	$92.61\pm 0.02$	$94.69\pm 0.17$	$70.50\pm 0.09$	$74.03\pm 0.68$
PreResnet-164 #5	0.31M	69.23	$92.06\pm 0.11$	$94.15\pm 0.06$	$69.20\pm 0.04$	$73.99\pm 0.06$
PreResnet-164 Ensemble	-	-	-	$95.60\pm 0.04$	-	$79.19\pm 0.07$
WideResnet-16-8 (baseline)	11.01	0.00	95.62	95.62	79.57	79.57
WideResnet-16-8 #1	8.01	20.00	$95.36\pm 0.01$	$95.18\pm 0.1$	$78.52\pm 0.04$	$78.19\pm 0.19$
WideResnet-16-8 #2	5.89	35.48	$95.20\pm 0.02$	$95.25\pm 0.14$	$77.46\pm 0.14$	$77.81\pm 0.2$
WideResnet-16-8 #3	4.38	47.74	$94.95\pm 0.01$	$95.08\pm 0.16$	$76.29\pm 0.02$	$77.43\pm 0.36$
WideResnet-16-8 #4	3.28	57.42	$94.97\pm 0.01$	$95.08\pm 0.08$	$74.73\pm 0.03$	$76.95\pm 0.21$
WideResnet-16-8 #5	2.48	64.52	$94.61\pm 0.09$	$94.91\pm 0.04$	$73.82\pm 0.1$	$76.46\pm 0.27$
WideResnet-16-8 Ensemble	-	-	95.63	$95.79\pm 0.08$	$79.22\pm 0.03$	$80.45\pm 0.14$

Table 6: Results of iterative Weights pruning Han et al. (2015) on CIFAR-10 and CIFAR-100 datasets. The SLR column presents the result of pruned networks finetuned with small learning rate while LLR column shows the results of same networks finetuned with large learning rate.

Model	#Active Params (M)	C10-SLR	C10-LLR	C100-SLR	C100-LLR
Resnet-56 (baseline)	0	93.42	93.42	71.07	71.07
Resnet-56 #1	0.66	$93.15\pm 0.02$	$93.36\pm 0.11$	$70.95\pm 0.05$	$70.40\pm 0.12$
Resnet-56 #2	0.52	$93.18\pm 0.05$	$93.35\pm 0.17$	$70.78\pm 0.01$	$70.58\pm 0.19$
Resnet-56 #3	0.42	$93.03\pm 0.02$	$93.36\pm 0.12$	$70.31\pm 0.06$	$70.51\pm 0.13$
Resnet-56 #4	0.35	$92.78\pm 0.01$	$93.43\pm 0.08$	$69.91\pm 0.01$	$70.75\pm 0.04$
Resnet-56 #5	0.29	$92.69\pm 0.02$	$93.41\pm 0.08$	$69.53\pm 0.07$	$70.46\pm 0.3$
Resnet-56 Ensemble	-	93.39	$94.15\pm 0.04$	$71.18\pm 0.08$	$72.69\pm 0.02$
Resnet-110 (baseline)	0	94.01	94.01	72.35	72.35
Resnet-110 #1	1.27	$93.90\pm 0.08$	$93.65\pm 0.1$	$72.30\pm 0.07$	$72.11\pm 0.24$
Resnet-110 #2	0.94	$93.73\pm 0.05$	$93.86\pm 0.1$	$71.75\pm 0.02$	$72.09\pm 0.06$
Resnet-110 #3	0.69	$93.65\pm 0.03$	$93.89\pm 0.03$	$70.96\pm 0.04$	$72.38\pm 0.25$
Resnet-110 #4	0.50	$93.32\pm 0.07$	$93.79\pm 0.01$	$70.69\pm 0.02$	$72.09\pm 0.05$
Resnet-110 #5	0.36	$93.02\pm 0.04$	$93.69\pm 0.17$	$68.90\pm 0.08$	$71.59\pm 0.30$
Resnet-110 Ensemble	-	$93.98\pm 0.04$	$94.56\pm 0.04$	$72.51\pm 0.06$	$74.19\pm 0.02$
WideResnet-110 (baseline)	11.01	95.62	95.62	79.57	79.57
WideResnet-110 #1	8.05	$95.55\pm 0.02$	$95.32\pm 0.03$	$79.32\pm 0.07$	$78.75\pm 0.15$
WideResnet-110 #2	5.97	$95.60\pm 0.03$	$95.46\pm 0.07$	$79.17\pm 0.07$	$78.74\pm 0.18$
WideResnet-110 #3	4.44	$95.65\pm 0.05$	$95.30\pm 0.02$	$77.86\pm 0.02$	$78.84\pm 0.13$
WideResnet-110 #4	3.34	$95.65\pm 0.02$	$95.37\pm 0.05$	$78.02\pm 0.19$	$78.81\pm 0.17$
WideResnet-110 #5	2.53	$95.47\pm 0.07$	$95.55\pm 0.05$	$77.92\pm 0.16$	$78.82\pm 0.11$
WideResnet-110 Ensemble	-	$95.62\pm 0.03$	$95.86\pm 0.08$	$79.62\pm 0.28$	$80.26\pm 0.11$

Table 7: Results of iterative

\ell_{1}

-norm Filters Pruning Li et al. (2016) on Tiny-Imagenet dataset. The SLR column presents the result of pruned networks finetuned with small learning rate while LLR column shows the results of same networks finetuned with large learning rate.

Model	#Param	SLR	LLR
Resnet-18 (baseline)	11.01	67.22	67.22
Resnet-18 #1	8.30	$65.66\pm 0.21$	$66.46\pm 0.23$
Resnet-18 #2	6.17	$64.04\pm 0.17$	$65.81\pm 0.14$
Resnet-18 #3	4.64	$63.59\pm 0.10$	$65.37\pm 0.07$
Resnet-18 #4	3.52	$61.91\pm 0.12$	$64.75\pm 0.13$
Resnet-18 #5	2.71	$61.06\pm 0.32$	$64.70\pm 0.33$
Resnet-18 Ensemble	-	$67.63\pm 0.21$	$69.30\pm 0.11$
Resnet-34 (baseline)	21.39	68.81	68.81
Resnet-34 #1	15.97	$68.18\pm 0.06$	$68.66\pm 0.18$
Resnet-34 #2	12.02	$67.10\pm 0.10$	$68.20\pm 0.02$
Resnet-34 #3	9.13	$66.41\pm 0.07$	$67.90\pm 0.12$
Resnet-34 #4	6.99	$66.05\pm 0.19$	$67.02\pm 0.24$
Resnet-34 #5	5.40	$64.93\pm 0.14$	$67.26\pm 0.08$
Resnet-34 Ensemble	-	$69.88\pm 0.11$	$71.31\pm 0.13$


(a) Resnet-56 on CIFAR-10	(b) Resnet-110 on CIFAR-10	(c) WideResnet-16-8 on CIFAR-10


(a) Resnet-56 on CIFAR-100	(b) Resnet-110 on CIFAR-100	(c) WideResnet-16-8 on CIFAR-100