This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerate CNNs from Three Dimensions: A Comprehensive
Pruning Framework

Wenxiao Wang    Minghao Chen    Shuai Zhao    Long Chen    Jinming Hu    Haifeng Liu    Deng Cai    Xiaofei He    Wei Liu
Abstract

Most neural network pruning methods, such as filter-level and layer-level prunings, prune the network model along one dimension (depth, width, or resolution) solely to meet a computational budget. However, such a pruning policy often leads to excessive reduction of that dimension, thus inducing a huge accuracy loss. To alleviate this issue, we argue that pruning should be conducted along three dimensions comprehensively. For this purpose, our pruning framework formulates pruning as an optimization problem. Specifically, it first casts the relationships between a certain model’s accuracy and depth/width/resolution into a polynomial regression and then maximizes the polynomial to acquire the optimal values for the three dimensions. Finally, the model is pruned along the three optimal dimensions accordingly. In this framework, since collecting too much data for training the regression is very time-costly, we propose two approaches to lower the cost: 1) specializing the polynomial to ensure an accurate regression even with less training data; 2) employing iterative pruning and fine-tuning to collect the data faster. Extensive experiments show that our proposed algorithm surpasses state-of-the-art pruning algorithms and even neural architecture search-based algorithms.


1 Introduction

To deploy pre-trained Convolutional Neural Networks (CNNs) (Simonyan & Zisserman, 2015; He et al., 2016; Huang et al., 2017; Tan & Le, 2019) on resource-constrained mobile devices, plenty of methods (Ba & Caruana, 2014; Hinton et al., 2015; Liu et al., 2017; He et al., 2019; Frankle & Carbin, 2019) have been proposed for model acceleration. Among them, neural network pruning, which prunes redundant components (e.g., filters) of CNNs to cater for a computational budget, is one of the most popular and is the focus of this paper.

Currently, the dominant pruning methods fall into three categories: (1) layer-level pruning (Wang et al., 2019b; Lin et al., 2019), which prunes redundant layers and reduces model’s depth, (2) filter-level pruning (Liu et al., 2017; Li et al., 2017; Molchanov et al., 2017; He et al., 2019; Wang et al., 2019a; Luo et al., 2019; Kang & Han, 2020; Ye et al., 2020a), which prunes redundant filters and reduces model’s width, and (3) image-level pruning111Though images are actually resized for acceleration, we will use term “pruning images” for simplicity in what follows. (Howard et al., 2017; Tan & Le, 2019; Han et al., 2020), which resizes images and reduces model’s input resolution. These three kinds of pruning methods respectively focus on one single dimension (i.e., depth, width, or image resolution) that impacts on a model’s computational cost.

Naturally, we raise an important question overlooked by much previous work: given a pre-trained neural network model, which dimension — depth, width, or resolution — should we prune to minimize the model’s accuracy loss? In practice, users empirically choose a redundant dimension, which, however, often leads to a sub-optimal pruned model because of an inappropriate dimension choice. Even worse, excessive pruning of whichever dimension will cause an unacceptable loss, as shown in Figure LABEL:fig:intro. Instead, comprehensively pruning these three dimensions yields a much lower loss than solely pruning whichever dimension, demonstrated by Figure LABEL:fig:intro, therefore enabling model acceleration with much better quality.

In this paper, we propose a framework that prunes three dimensions comprehensively. Instead of solely pruning one dimension to reduce the computational cost, our framework first decides how much of each dimension should be pruned. To this end, we formulate model acceleration as an optimization problem. Precisely, given a pre-trained neural network model and a target computational cost, assuming that the pruned model’s depth, width, and resolution are d×100%d\times 100\%, w×100%w\times 100\%, and r×100%r\times 100\% of the original model, respectively, we seek the optimal (d,w,r)(d,w,r) that maximizes the model’s accuracy – aa:

maxd,w,ra:=(d,w,r),s.t.𝒞(d,w,r)=τ,\displaystyle\mathop{\max}_{d,w,r}\ a:=\mathcal{F}(d,w,r),\;\;\mathrm{s.t.}\;\;\mathcal{C}(d,w,r)=\tau, (1)

where (d,w,r)\mathcal{F}(d,w,r) is a Model Accuracy Predictor (MAP). 𝒞(d,w,r)\mathcal{C}(d,w,r) and τ\tau represent the model’s computational cost and its constraint, respectively. (Tan & Le, 2019) has designed a reasonable expression for 𝒞(d,w,r)\mathcal{C}(d,w,r). However, designing a MAP manually is unachievable as its form can be arbitrarily complicated or even varies with the architecture (e.g., the MAPs for ResNet and MobileNet may be in different forms). Hence, we propose approximating the MAP via a polynomial regression, because polynomials can approximate arbitrary continuous functions according to Taylor’s theorem. Specifically, we can formulate the MAP as a polynomial and collect a sufficient set of (d,w,r,a)(d,w,r,a) as training data to estimate its parameters. Then, problem (1) can be solved with Lagrange’s multiplier theorem, and the model is eventually pruned in terms of the optimized (d,w,r)(d,w,r).

The main challenge that this framework encounters is that the polynomial regression requires tremendous training data (i.e., {(d,w,r,a)}\{(d,w,r,a)\}), while the collection of the data is very costly because fetching each item of data, i.e., a (d,w,r,a)(d,w,r,a), means training a new neural network model from scratch. To reduce both the collection time and model training cost, we improve the framework in two aspects: 1) A specialized polynomial is proposed whose weight tensor is replaced with its low-rank substitute. The low-rank weight tensor prevents the polynomial from overfitting and ensures an accurate regression even with limited training data. Further, as a bonus, the updated MAP owns a more concise form. 2) Given a pre-trained model, we prune and fine-tune it iteratively to acquire a series of new models and their corresponding {(d,w,r,a)}\{(d,w,r,a)\}, which is much faster than training such new models from scratch.

Extensive experiments are conducted to show the superiority of our proposed pruning algorithm over the state-of-the-art pruning algorithms. Further, we compare against some algorithms that balance the size of three dimensions (depth, width, and resolution) from a Neural Architecture Search (NAS) perspective. The comparative results also show our advantages over them.

It is worth highlighting that the contributions of this work are three-fold:

  • We propose to prune a model along three dimensions comprehensively and determine the optimal values for these dimensions by solving a polynomial regression and subsequently an optimization problem.

  • To complete the regression process with an acceptable cost, we apply two approaches: 1) specializing a MAP adapting to the scenario of limited training data; 2) using iterative pruning and fine-tuning to collect data faster.

  • We do extensive experiments to validate that our proposed algorithm outperforms state-of-the-art pruning and even NAS-based model acceleration algorithms.

2 Background and Related Work

Neural Network Pruning:

In the early stage, neural network pruning is done at the weight-level (Han et al., 2016; Frankle & Carbin, 2019; Sehwag et al., 2020; Ye et al., 2020b; Frankle & Carbin, 2019; Lee et al., 2019). However, it needs specific libraries for sparse matrix calculation (e.g., cuSPARSE) to accelerate the inference, while these libraries’ support on mobile devices is restricted. Nowadays, the most dominant pruning methods are at the filter-level, layer-level, or image-level, directly reducing the computational cost for all devices. Filter-level pruning (Liu et al., 2017; Li et al., 2017; Molchanov et al., 2017; He et al., 2018, 2019; Wang et al., 2019a; Kang & Han, 2020; Ye et al., 2020a; Li et al., 2020; Wang et al., 2020) compresses models by removing unimportant filters in CNNs, layer-level pruning (Wang et al., 2019b) does that by pruning redundant layers, and image-level pruning (Howard et al., 2017) saves computation by using small input images. They all receive great success in pruning CNNs. However, focusing on pruning one dimension solely also restricts their potentials.

Refer to caption
Figure 2: The pipeline of the proposed pruning framework. It first prunes a pre-trained model from three dimensions independently, yielding a set of (dn,wn,rn,an)(d_{n},w_{n},r_{n},a_{n}) that is taken as training data. Then, the training data is used to fit our specialized MAP (\mathcal{F}) via a polynomial regression. The optimal (d,w,r)(d^{\star},w^{\star},r^{\star}) is then acquired by maximizing \mathcal{F} subject to a computational cost constraint. Finally, the model will be pruned comprehensively in terms of (d,w,r)(d^{\star},w^{\star},r^{\star}).
Multi-Dimension Pruning:

To the best of our knowledge, there are two methods (Wen et al., 2016; Lin et al., 2019) which prune models at both the filter- and layer-levels. Both of them train models with extra regularization terms and induce sparsity into the models. Then the filters or layers with much sparsity will be pruned with a slight loss incurred. However, the same method cannot be used for balancing image size because images do not contain trainable parameters, and there is no way to induce sparsity into the images. In contrast, our proposed framework can balance three dimensions comprehensively, yielding better model acceleration results than the above mentioned methods.

Pruning vs. NAS:

Pruning and NAS (Pham et al., 2018; Gao et al., 2020; He et al., 2020; Tian et al., 2020; Liu et al., 2019; Tan & Le, 2019; Han et al., 2020; Howard et al., 2019; Huang et al., 2019; Zoph & Le, 2017) share the same goal, that is, maximizing a certain model’s accuracy given a computational budget. However, their settings are very different: pruning shrinks the model from a pre-trained one, utilizing both pre-trained model’s architecture and weights, while NAS searches the model(s) from scratch. Therefore, though several algorithms (Tan & Le, 2019; Han et al., 2020) also attempt to balance the three dimensions (i.e., depth, width, and resolution) of the model from a NAS perspective, they cannot be applied for pruning directly.

3 Proposed Framework

3.1 Preliminaries

For a model MM, we define 𝒟(M)\mathcal{D}(M), 𝒲(M,l)\mathcal{W}(M,l), and (M)\mathcal{R}(M) as its depth, width, and input resolution. Specifically, 𝒟(M)\mathcal{D}(M) represents the number of blocks222E.g., Conv-BN-ReLu blocks and residual blocks. that MM contains; 𝒲(M,l)\mathcal{W}(M,l) denotes the number of filters of a certain layer ll in the model MM; (M)\mathcal{R}(M) is the side length of MM’s input image. Given a pre-trained model M0M_{0}, we also define dnd_{n}, wnw_{n}, and rnr_{n} of a pruned model MnM_{n} as:

dn=𝒟(Mn)𝒟(M0),wn=𝒲(Mn,l)𝒲(M0,l),rn=(Mn)(M0).d_{n}=\frac{\mathcal{D}(M_{n})}{\mathcal{D}(M_{0})},\ w_{n}=\frac{\mathcal{W}(M_{n},l)}{\mathcal{W}(M_{0},l)},\ r_{n}=\frac{\mathcal{R}(M_{n})}{\mathcal{R}(M_{0})}. (2)

For filter pruning, following previous work (Luo et al., 2019; He et al., 2019; Lin et al., 2020), we prune all layers with the same ratio, so wnw_{n} of a model has no concern with the choice of layer ll. Further, for a pruning task, it is easy to know: dn,wn,rn(0,1]d_{n},w_{n},r_{n}\in(0,1] and d0=w0=r0=1d_{0}=w_{0}=r_{0}=1.

The pipeline of our proposed pruning framework is introduced in Figure 2. Unlike previous work that prunes one dimension solely, we first look for a pruning policy (i.e., how much of each dimension should be pruned) which aims to maximize the model’s accuracy in Section 3.2. Then, we depict the process of pruning and fine-tuning a target model in terms of pruning policy in Section 3.3.

3.2 Model Acceleration as Optimization

3.2.1 Formulation

Given a CNN architecture, the model’s depth, width, and image resolution are three key aspects that affect both the model’s accuracy and its computational cost. Thus, model acceleration can be formulated as the following problem:

d,w,r\displaystyle d^{\star},w^{\star},r^{\star} =argmaxd,w,r(d,w,r;Θ)\displaystyle=\mathop{\arg\max}\limits_{d,w,r}\mathcal{F}(d,w,r;\Theta) (3)
s.t.𝒞(d,w,r)=T×𝒞(d0,w0,r0),\displaystyle\quad\ \ \mathrm{s.t.}\;\mathcal{C}(d,w,r)=T\times\mathcal{C}(d_{0},w_{0},r_{0}),

where (d,w,r;Θ)\mathcal{F}(d,w,r;\Theta) is a Model Accuracy Predictor (MAP), which predicts the model’s accuracy given (d,w,r)(d,w,r). Θ\Theta contains the parameters of the MAP. 𝒞(d,w,r)\mathcal{C}(d,w,r) represents the computational cost (e.g., FLOPs) of a model. T(0,1)T\in(0,1) implies that the pruned model’s computational cost is TT proportion of the original model. Problem (3) can be solved using Lagrange’s multiplier theorem once (d,w,r)\mathcal{F}(d,w,r) and 𝒞(d,w,r)\mathcal{C}(d,w,r) are known. Following (Tan & Le, 2019) in which a model’s computational cost is proportional to d,w2d,w^{2}, and r2r^{2}, we re-define 𝒞(d,w,r)\mathcal{C}(d,w,r) as:

𝒞(d,w,r)=dw2r2.\displaystyle\mathcal{C}(d,w,r)=dw^{2}r^{2}. (4)

However, designing a MAP manually is unachievable as its form can be arbitrarily complicated, and different architectures may own different forms. An intuitive idea is resorting to a polynomial regression because any continuous function can be approximated with polynomials according to Taylor’s theorem. Specifically, we can train NN models with different (d,w,r)(d,w,r), attaining their accuracy aa, and fit the MAP with a polynomial by using {(dn,wn,rn,an)}n=1N\{(d_{n},w_{n},r_{n},a_{n})\}_{n=1}^{N} as training data. However, the regression process requires hundreds of data items (dn,wn,rn,an)(d_{n},w_{n},r_{n},a_{n}) for training an accurate regression, whereas fetching each item of that data needs us to train a new model from scratch, which is very resource-inefficient and time-consuming. To overcome this obstacle, on the one hand, we specialize a MAP that ensures an accurate regression even with less training data in Section 3.2.2. On the other hand, we expedite acquiring each data item (dn,wn,rn,an)(d_{n},w_{n},r_{n},a_{n}) by employing iterative pruning and fine-tuning in Section 3.2.3.

3.2.2 Specialized MAP

The polynomial-shaped MAP can be represented as:

(d,w,r;Θ)=i,j,k=0𝒦θijkdiwjrk,\displaystyle\mathcal{F}(d,w,r;\Theta)=\sum_{i,j,k=0}^{\mathcal{K}}\theta_{ijk}d^{i}w^{j}r^{k}, (5)

where Θ(𝒦+1)×(𝒦+1)×(𝒦+1)\Theta\in\mathbb{R}^{(\mathcal{K}+1)\times(\mathcal{K}+1)\times(\mathcal{K}+1)} is a tensor, and all θijk\theta_{ijk} are its elements333For the convenience of expression, we assume the same highest degree 𝒦\mathcal{K} for dd, ww, and rr, though conclusions in this section still hold when they have different 𝒦\mathcal{K}.. Without any constraint on Θ\Theta, the polynomial can be highly flexible and expressive. However, high flexibility also makes it easy to overfit (Bishop, 2006), especially when the training data (i.e., {(d,w,r,a)}\{(d,w,r,a)\}) is scarce. To avoid overfitting and ensure an accurate regression with limited training data, a relatively simple MAP with less flexibility and expressiveness is needed. We achieve this by restricting the rank444The definition of tensor rank is the same as (Bourbaki, 2003). of Θ\Theta during the regression process, i.e., Θ\Theta in the MAP is replaced by its low-rank substitute. Formally, for Θ\Theta of rank ~\widetilde{\mathcal{R}}, its \mathcal{R}-rank substitute (<~\mathcal{R}<\widetilde{\mathcal{R}}) and elements are defined as (Kolda & Bader, 2009):

Θ\displaystyle\Theta q=1squqvq,θijkq=1sqiuqjvqk,\displaystyle\approx\sum_{q=1}^{\mathcal{R}}\vec{s_{q}}\otimes\vec{u_{q}}\otimes\vec{v_{q}},\ \ \ \theta_{ijk}\approx\sum_{q=1}^{\mathcal{R}}s_{qi}u_{qj}v_{qk}, (6)

in which \otimes represents outer product, and sq,uq,vq𝒦+1\vec{s_{q}},\vec{u_{q}},\vec{v_{q}}\in\mathbb{R}^{\mathcal{K}+1} represent (𝒦+1)(\mathcal{K}+1)-dimensional vectors, e.g., sq=[sq0,sq1,,sq𝒦]\vec{s_{q}}=[s_{q0},s_{q1},\cdots,s_{q\mathcal{K}}]^{\top}. Then, replacing θijk\theta_{ijk} in Eq. (5) yields:

(d,w,r;Θ)\displaystyle\mathcal{F}(d,w,r;\Theta) i,j,k=0𝒦q=1sqiuqjvqkdiwjrk\displaystyle\approx\sum_{i,j,k=0}^{\mathcal{K}}\sum_{q=1}^{\mathcal{R}}s_{qi}u_{qj}v_{qk}d^{i}w^{j}r^{k} (7)
=q=1i,j,k=0𝒦(sqidi)(uqjwj)(vqkrk)\displaystyle=\sum_{q=1}^{\mathcal{R}}\sum_{i,j,k=0}^{\mathcal{K}}(s_{qi}d^{i})(u_{qj}w^{j})(v_{qk}r^{k})
=q=1i=0𝒦sqidij=0𝒦uqjwjk=0𝒦vqkrk\displaystyle=\sum_{q=1}^{\mathcal{R}}\sum_{i=0}^{\mathcal{K}}s_{qi}d^{i}\sum_{j=0}^{\mathcal{K}}u_{qj}w^{j}\sum_{k=0}^{\mathcal{K}}v_{qk}r^{k}
=q=1(d;sq)(w;uq)(r;vq),\displaystyle=\sum_{q=1}^{\mathcal{R}}\mathcal{H}(d;\vec{s_{q}})\mathcal{H}(w;\vec{u_{q}})\mathcal{H}(r;\vec{v_{q}}),

in which \mathcal{H} represents univariate polynomial. In practice, we take Eq. (7) as our MAP and control its flexibility by adjusting \mathcal{R}. A smaller \mathcal{R} indicates a simpler MAP. Empirically, we find that =1\mathcal{R}=1 is enough for achieving an accurate regression in most cases, which provides our MAP with a highly succinct form. We also verify through experiments that =1\mathcal{R}=1 makes sense because it accords with the prior of the MAP (Section 4.3).

Algorithm 1  Iterative Pruning and Fine-tuning
  input pre-trained M0M_{0}, rounds rdsrds, pruning setting TT
  initialize train_data={(d0,w0,r0,a0)}train\_data=\{(d_{0},w_{0},r_{0},a_{0})\}
  function PruneAlong(dimensiondimension, x0x_{0}, xminx_{min}
     for n=1n=1 to rdsrds do
        xn=xn1x0xminrdsx_{n}=x_{n-1}-\frac{x_{0}-x_{min}}{rds}
        pruning Mn1M_{n-1} along dimensiondimension to xnMnx_{n}\rightarrow M_{n}
        fine-tuning Mn(dn,wn,rn,an)M_{n}\rightarrow(d_{n},w_{n},r_{n},a_{n})
        add (dn,wn,rn,an)(d_{n},w_{n},r_{n},a_{n}) to train_datatrain\_data
     end for
  end function
  dmin=Td0,wmin=Tw0,rmin=Tr0d_{min}=Td_{0},w_{min}=\sqrt{T}w_{0},r_{min}=\sqrt{T}r_{0}
  PruneAlong(“depth”, d0d_{0}, dmind_{min})
  PruneAlong(“width”, w0w_{0}, wminw_{min})
  PruneAlong(“resolution”, r0r_{0}, rminr_{min})
  return train_datatrain\_data

3.2.3 Fast Data Collection

To collect data used for MAP’s regression, instead of training many models with different (d,w,r)(d,w,r) from scratch, we apply iterative pruning and fine-tuning to acquire the data.

Iterative Pruning and Fine-tuning:

As shown in Algorithm 1, the pre-trained model M0M_{0} is pruned along three dimensions independently. At each dimension, we iteratively apply pruning and fine-tuning on M0M_{0} to generate many models, and the configurations {(dn,wn,rn,an)}\{(d_{n},w_{n},r_{n},a_{n})\} of these models are collected for the MAP’s regression. dmind_{min} in Algorithm 1 indicates that if we reduce the model’s depth to dmind_{min}, the computational cost constraint TT can be fulfilled without pruning the model’s width and input resolution. It is easy to deduce that the optimal ddmind^{\star}\geq d_{min}. Likewise, wminw_{min} and rminr_{min} in Algorithm 1 are minimal possible values for ww and rr, respectively.

Compared with training models from scratch, our data collection strategy enjoys two advantages: 1) A pruned pre-trained model converges much faster than the one training from scratch, thus taking much less time to obtain a new model. 2) Besides the finally pruned model, iterative pruning yields several intermediate models as well as their configurations {(dn,wn,rn,an)}\{(d_{n},w_{n},r_{n},a_{n})\}, which can also be used for the MAP’s regression.

3.2.4 Optimizing the MAP

With the collected data, we fit the MAP by using a regression algorithm. Then, the optimal (d,w,r)(d^{\star},w^{\star},r^{\star}) satisfies Eq. (8) according to Lagrange’s multiplier theorem, where λ\lambda is the Lagrange multiplier.

{dw2r2T×d0w02r02=0q=1(d;sq)(w;uq)(r;vq)+λw2r2=0q=1(d;sq)(w;uq)(r;vq)+2λdwr2=0q=1(d;sq)(w;uq)(r;vq)+2λdw2r=0\left\{\begin{aligned} dw^{2}r^{2}-T\times d_{0}w_{0}^{2}r_{0}^{2}&=0\\ \sum\nolimits_{q=1}^{\mathcal{R}}\mathcal{H}^{\prime}(d;\vec{s_{q}})\mathcal{H}(w;\vec{u_{q}})\mathcal{H}(r;\vec{v_{q}})+\lambda w^{2}r^{2}&=0\\ \sum\nolimits_{q=1}^{\mathcal{R}}\mathcal{H}(d;\vec{s_{q}})\mathcal{H}^{\prime}(w;\vec{u_{q}})\mathcal{H}(r;\vec{v_{q}})+2\lambda dwr^{2}&=0\\ \sum\nolimits_{q=1}^{\mathcal{R}}\mathcal{H}(d;\vec{s_{q}})\mathcal{H}(w;\vec{u_{q}})\mathcal{H}^{\prime}(r;\vec{v_{q}})+2\lambda dw^{2}r&=0\\ \end{aligned}\right. (8)

3.3 Comprehensive Pruning and Fine-tuning

Leveraging the optimal (d,w,r)(d^{\star},w^{\star},r^{\star}), filter-level pruning and layer-level pruning are applied to prune a pre-trained model M0M_{0} to the target dd^{\star} and ww^{\star}, and then the model is fine-tuned with images of size rr^{\star}. During the entire pruning process, layer-pruning first and filter-pruning first are both viable and yield the same pruned model. Without loss of generality, we describe the pruning process by assuming layer-pruning first, and the concrete steps are as follows:

Pruning Layers:

Following DBP (Wang et al., 2019b), we put a linear classifier after each layer of model M0M_{0} and test its accuracy on the evaluation dataset. The accuracy of each linear classifier indicates the discrimination of its corresponding layer’s features. Further, each layer’s discrimination enhancement compared with its preceding layer is seen as the importance of the layer. With this importance metric, we pick out the least important (1d/d0)×100%(1-d^{\star}/d_{0})\times 100\% layers and remove them from M0M_{0}, yielding Mp1M_{p_{1}}.

Pruning Filters:

Filter-level pruning is performed over Mp1M_{p_{1}}. In particular, we use the scaling factor of BN layers as the importance metric, just like Slimming (Liu et al., 2017). However, different from Slimming that compares the importances of all filters globally, we only compare the importances of filters in the same layer, and the least important (1w/w0)×100%(1-w^{\star}/w_{0})\times 100\% filters of each layer will be pruned. Through such a modification, the pruned ratios of all layers are kept the same. Assume the model after filter-pruning to be Mp2M_{p_{2}}.

Fine-tuning with Smaller Images:

After pruning, the pruned model Mp2M_{p_{2}} is fine-tuned with images of size rr^{\star}. The images are resized by bilinear down-sampling, which is the most common down-sampling scheme for images. The model will be fine-tuned with a small learning rate till convergence, leading to the finally pruned model MpM_{p}.

4 Experiments

4.1 Experimental Settings

Datasets:

We take three popular datasets as testbeds of our algorithm: CIFAR-10 (Krizhevsky et al., 2009), TinyImageNet (Wu et al., 2017), and ImageNet (Russakovsky et al., 2015). These three datasets differ in their image-resolutions (32×\times32 to 224×\times224), number of classes (10 to 1000), and scale of datasets (50K to 1000K images). For all the datasets, images are augmented by symmetric padding, random clipping, and randomly horizontal flip, all of which are common (He et al., 2016; Howard et al., 2017; Wang et al., 2019a) augmentation methods for these datasets.

Architectures:

We test our algorithm on three popular network architectures: ResNet (He et al., 2016), DenseNet (Huang et al., 2017), and EfficientNet (Tan & Le, 2019). Their basic blocks vary from residual blocks to densely connected blocks and NAS-searched blocks, representing three of the most popular designs for deep CNNs.

Evaluation Protocol:

Following the conventions of previous work (Li et al., 2020; Lin et al., 2020; Ye et al., 2020a), we take the accuracy, parameters-reduction-ratio (PrrPrr), and FLOPs-reduction-ratio (FrrFrr) as the evaluation protocol of our model acceleration algorithm. PrrPrr and FrrFrr are defined as Eq. (9), where M0M_{0} and MpM_{p} represent the base model and the pruned model, respectively.

Prr=1Params(Mp)Params(M0),Frr=1FLOPs(Mp)FLOPs(M0).Prr=1-\frac{\textit{Params}(M_{p})}{\textit{Params}(M_{0})},Frr=1-\frac{\textit{FLOPs}(M_{p})}{\textit{FLOPs}(M_{0})}. (9)
Compared Algorithms:

The compared algorithms fall into three categories: (1) Algorithms solely pruning the model along one dimension (i.e., depth, width, or resolution), including \mathcal{R}-only (Howard et al., 2017), 𝒲\mathcal{W}-only (Liu et al., 2017), FPGM (He et al., 2019), DBP (Wang et al., 2019b), PScratch (Wang et al., 2020), DHP (Li et al., 2020), and HRank (Lin et al., 2020); (2) Algorithms that prune along multi-dimensions, such as GAL (Lin et al., 2019); (3) NAS-based algorithms, including EfficientNet (Tan & Le, 2019) and TinyNet (Han et al., 2020), which balance the size of the three dimensions from the NAS perspective.

Table 1: Pruning results on CIFAR-10 and TinyImageNet. 𝒟\mathcal{D}, 𝒲\mathcal{W}, and \mathcal{R} indicate whether the model will be pruned along depth, width, and resolution dimension, respectively. “Acc. Drop” means the accuracy loss induced by pruning (smaller is better). Results with \dagger are drawn from original papers, and the others are run with their published code with slight modifications. Our algorithm achieves smaller accuracy losses than the others with similar PrrPrr and FrrFrr.
Dataset Architecture Algorithm 𝒟\mathcal{D} 𝒲\mathcal{W} \mathcal{R} Baseline Accuracy Acc. Drop PrrPrr FrrFrr
CIFAR-10 ResNet-32 \mathcal{R}-only (Howard et al., 2017) 93.18% 90.19% 2.99% - 0.52
𝒲\mathcal{W}-only (Liu et al., 2017) 93.18% 92.16%  1.02% 0.47 0.47
𝒟\mathcal{D}-only DBP (Wang et al., 2019b) 93.18% 92.65%  0.53% 0.28 0.48
GAL (Lin et al., 2019) 93.18% 91.72%  1.46% 0.39 0.50
FPGM (He et al., 2019) 92.63% 92.31%  0.32% - 0.42
PScratch (Wang et al., 2020) 93.18% 92.18%  1.00% - 0.50
Ours 93.18% 93.27% -0.09% 0.38 0.49
ResNet-56 \mathcal{R}-only (Howard et al., 2017) 93.69% 92.00%  1.69% - 0.51
𝒲\mathcal{W}-only (Liu et al., 2017) 93.69% 92.97%  0.72% 0.50 0.50
𝒟\mathcal{D}-only DBP (Wang et al., 2019b) 93.69% 93.27%  0.42% 0.40 0.52
GAL (Lin et al., 2019) 93.26% 93.38% -0.12% 0.12 0.38
FPGM (He et al., 2019) 93.59% 93.26%  0.33% - 0.52
PScratch (Wang et al., 2020) 93.23% 93.05%  0.18% - 0.50
HRank (Lin et al., 2020) 93.26% 93.17%  0.09% 0.42 0.50
DHP (Li et al., 2020) 93.65% 93.58%  0.07% 0.42 0.49
Ours 93.69% 93.76% -0.07% 0.40 0.50
DenseNet-40 \mathcal{R}-only (Howard et al., 2017) 94.59% 92.88%  1.71% - 0.53
𝒲\mathcal{W}-only (Liu et al., 2017) 94.59% 94.26%  0.33% 0.65 0.65
𝒟\mathcal{D}-only DBP (Wang et al., 2019b) 94.59% 94.02%  0.57% 0.60 0.46
GAL (Lin et al., 2019) 94.81% 94.50%  0.31% 0.57 0.55
HRank (Lin et al., 2020) 94.81% 93.68%  1.13% 0.54 0.61
DHP (Li et al., 2020) 94.74% 93.94%  0.80% 0.36 0.62
Ours 94.59% 94.54%  0.05% 0.66 0.66
TinyImageNet ResNet-56 \mathcal{R}-only (Howard et al., 2017) 56.55% 54.64%  1.91% - 0.49
𝒲\mathcal{W}-only (Liu et al., 2017) 56.55% 52.45%  4.10% 0.54 0.53
𝒟\mathcal{D}-only DBP (Wang et al., 2019b) 56.55% 55.57%  0.98% 0.25 0.53
GAL (Lin et al., 2019) 56.55% 55.87%  0.68% 0.32 0.52
DHP (Li et al., 2020) 56.55% 55.82%  0.73% 0.46 0.55
Ours 56.55% 56.04%  0.51% 0.34 0.59
ResNet-101 \mathcal{R}-only (Howard et al., 2017) 64.83% 55.48%  9.35% - 0.75
𝒲\mathcal{W}-only (Liu et al., 2017) 64.83% 63.47%  1.36% 0.75 0.75
𝒟\mathcal{D}-only DBP (Wang et al., 2019b) 64.83% 61.35%  3.48% 0.76 0.77
GAL (Lin et al., 2019) 64.83% 64.33%  0.50% 0.45 0.76
DHP (Li et al., 2020) 64.83% 64.82%  0.01% 0.50 0.75
Ours 64.83% 65.27% -0.44% 0.51 0.75
DenseNet-100 \mathcal{R}-only (Howard et al., 2017) 61.34% 56.97%  4.37% - 0.75
𝒲\mathcal{W}-only (Liu et al., 2017) 61.34% 59.56%  1.78% 0.75 0.75
𝒟\mathcal{D}-only DBP (Wang et al., 2019b) 61.34% 58.44%  2.90% 0.65 0.78
GAL (Lin et al., 2019) 61.34% 59.03%  2.31% 0.78 0.70
DHP (Li et al., 2020) 61.34% 59.40%  1.94% 0.73 0.73
Ours 61.34% 60.22%  1.12% 0.73 0.75
Training Settings:

For base models trained on CIFAR-10, we set batch size to 6464 for DenseNet and 128128 for ResNet, respectively. Weight decay is set to 10410^{-4}. The models are trained for 160160 epochs with the learning rate starting from 0.10.1 and divided by 1010 at epochs 8080 and 120120. These are all the most common training settings (He et al., 2016; Howard et al., 2017; Wang et al., 2019a) for models trained on CIFAR-10. For ResNet and DenseNet trained on TinyImageNet and ImageNet, batch size is set to 256256, and weight decay is 10410^{-4}. Models are trained for 100100 epochs. The learning rate is set to 0.10.1 at the beginning and is multiplied by 0.10.1 at epochs 3030, 6060, and 9090. For EfficientNet, we apply the same training policy as (Han et al., 2020), which is also the most common for EfficientNet implemented with PyTorch (Paszke et al., 2017).

Regression and Pruning Settings:

The MAP’s hyper-parameters are set to =1\mathcal{R}=1 and 𝒦=3\mathcal{K}=3 in our pruning experiments. When collecting training data (i.e., {(dn,wn,rn)}n=1N\{(d_{n},w_{n},r_{n})\}_{n=1}^{N}) for the polynomial regression, the model is pruned along each dimension for four times (i.e., rds=4rds=4 in Algorithm 1). ResNet and DenseNet trained on CIFAR-10 are fine-tuned for 4040 epochs at each round, and for 8080 epochs after comprehensive pruning. Therefore, the data collection process consumes as much time as training 33 models (training one model from scratch costs 160160 epochs). Similarly, models trained on TinyImageNet and ImageNet are fine-tuned for 3030 epochs at each round of the iterative pruning process. Thus, it takes about the same time as training 3.63.6 models for the data collection process. The finally pruned models trained on TinyImageNet and ImageNet are fine-tuned for 6060 epochs after comprehensive pruning.

4.2 Results and Analyses

Results on CIFAR-10 and TinyImageNet:

The experimental results on CIFAR-10 and TinyImageNet are shown in Table 1. As we can see, 𝒲\mathcal{W}-only induces greater loss than 𝒟\mathcal{D}-only for ResNet-32 and ResNet-56, while for ResNet-101, the situation is opposite. In other words, the importance of different dimensions lies in the original size of depth, width, and resolution, and we cannot deduce it from a simple prior, which further shows the essentiality of our algorithm. We balance the size of the three dimensions dynamically and always achieve better results than pruning one or two dimensions. The most competitive opponent of our algorithm is DHP, which achieves similar accuracy and FrrFrr to our algorithm for ResNet-56 trained on both datasets. However, we show higher accuracy than DHP for DenseNet-40 on CIFAR-10 (94.54% vs. 93.94%), for ResNet-101 (65.27% vs. 64.82%) on TinyImageNet, and for DenseNet-100 (60.22% vs. 59.40%) on TinyImageNet with similar PrrPrr and FrrFrr, which sheds light on the robustness of our algorithm across different architectures and datasets.

Results on ImageNet:

Experiments with ImageNet are done on ResNet-50 and DenseNet-121. From Table 2, we can see that our algorithm achieves 0.45% higher accuracy on ResNet-50 than the state-of-the-art algorithms (DHP and PScratch) with the same FrrFrr. The improvement on DenseNet-121 is marginal compared with 𝒲\mathcal{W}-only, because our algorithm also prunes DenseNet-121 mainly along width dimension, which indicates that DenseNet-121’s width is large and has much redundancy. By contrast, images do not need to be pruned. With a comprehensive consideration, our algorithm also deems that we should mainly prune filters of DenseNet-121 for acceleration. Therefore, it produces similar pruning results to filter-level pruning. However, the results do not imply that our algorithm is powerless. On the contrary, a pruning policy with a comprehensive consideration is always better than an arbitrary one, though they may produce similar results sometimes.

Comparison with NAS:

Algorithms that balance the three dimensions (i.e., depth, width, and resolution) in a NAS manner are also compared, and the results are shown in Table 3. GPU-days is the most common metric to evaluate the search cost of NAS algorithms, which indicates natural days they spend if running with only one GPU. Both EfficientNet (Tan & Le, 2019) and TinyNet (Han et al., 2020) employ so many resources in searching the optimal (d,w,r)(d^{\star},w^{\star},r^{\star}), while we do not have enough GPUs to reproduce their searching process. Thus, the results of EfficientNet and TinyNet are both drawn from (Han et al., 2020), and their search costs are estimated through the number of models they trained. For example, training an EfficientNet for 300300 epochs takes about 2626 hours with 8×8\timesV100 GPUs, while TinyNet requires to train 6060 EfficientNet models from scratch. Hence, its search cost is about 520520 GPU days. Instead, our algorithm only spends about 125\frac{1}{25} as much time as TinyNet on searching but achieves similar accuracy.

Table 2: Pruning Results on ImageNet. The improvement on DenseNet-121 is marginal because of DenseNet-121’s property. Detailed reasons are described in Section 4.2.
Algorithm 𝒟\mathcal{D} 𝒲\mathcal{W} \mathcal{R} Accuracy PrrPrr FrrFrr
ResNet-50 (76.15%)
\mathcal{R}-only (Howard et al., 2017) 71.56% - 0.50
𝒲\mathcal{W}-only (Liu et al., 2017) 74.52% 0.50 0.50
DBP (Wang et al., 2019b) 73.92% 0.56 0.50
GAL (Lin et al., 2019) 71.95% 0.17 0.43
FPGM (He et al., 2019) 74.83% - 0.54
PScratch (Wang et al., 2020) 75.45% 0.64 0.50
HRank (Wang et al., 2020) 74.98% 0.37 0.44
DHP (Li et al., 2020) 75.45% 0.54 0.50
Ours 75.90% 0.53 0.50
DenseNet-121 (75.01%)
\mathcal{R}-only (Howard et al., 2017) 73.07% - 0.51
𝒲\mathcal{W}-only (Liu et al., 2017) 73.58% 0.51 0.51
DBP (Wang et al., 2019b) 68.08% 0.66 0.37
Ours 73.68% 0.48 0.51
Table 3: Comparison with NAS-based model acceleration algorithms. They all take EfficientNet-B0 as the baseline model. GPU-days is measured with NVIDIA V100. Note that training an EfficientNet from scratch costs about 8.7 GPU days.
Algorithm Params FLOPs Top-1/Top-5 Acc. Search Cost
(GPU days)
EfficientNet-B0 5.3M 387M 76.7%/93.2% -
TinyNet-A 5.1M 339M 76.8%/93.3% \sim 520
Ours 5.1M 314M 76.8%/93.3% 26
EfficientNet-B-1 3.6M 201M 74.7%/92.1% -
TinyNet-B 3.7M 202M 75.0%/92.2% \sim 520
Ours 3.6M 198M 75.2%/92.7% 24
EfficientNet-B-2 3.0M 98M 70.5%/89.5% -
TinyNet-C 2.5M 100M 71.2%/89.7% \sim 520
Ours 3.1M 98M 71.6%/89.9% 21

4.3 Ablation Study

Rank of Θ\Theta:

In Section 3.2.2, our proposed MAP is

(d,w,r;Θ)=q=1(d;sq)(w;uq)(r;vq),\displaystyle\mathcal{F}(d,w,r;\Theta)=\sum_{q=1}^{\mathcal{R}}\mathcal{H}(d;\vec{s_{q}})\mathcal{H}(w;\vec{u_{q}})\mathcal{H}(r;\vec{v_{q}}), (10)

where the rank of Θ\Theta is less than or equal to \mathcal{R}. Experimentally, we find that =1\mathcal{R}=1 works well in most cases. To further explore this interesting phenomenon, ResNets with different (d,w,r)(d,w,r) are trained on CIFAR-10. The base model (i.e., (d,w,r)=(1.0,1.0,1.0)(d,w,r)=(1.0,1.0,1.0)) is ResNet-32 with images of size 3232, and the results are plotted in Figure 3. Observations from the first three sub-figures are shown in their titles. We can deduce from these observations555More results and the proof are put in our supplementary material.:

(d,w,r;Θ)(d;sq)(w;uq)(r;vq),\displaystyle\mathcal{F}(d,w,r;\Theta)\approx\mathcal{H}(d;\vec{s_{q}})\mathcal{H}(w;\vec{u_{q}})\mathcal{H}(r;\vec{v_{q}}), (11)

which coincides with Eq. (10) once =1\mathcal{R}=1. In other words, three variables (d,w,r)(d,w,r) in the MAP can be approximately separated from each other. We also test the MAP with different \mathcal{R}, as shown in the 4th sub-figure of Figure 3. The MAP with larger \mathcal{R} yields similar (d,w,r)(d^{\star},w^{\star},r^{\star}) to that of =1\mathcal{R}=1, which also indicates that =1\mathcal{R}=1 is enough for obtaining a well-performed MAP.

Other Methods of Avoiding Overfitting:

Besides restricting the rank of Θ\Theta, we also try two extra methods of avoiding overfitting, i.e., decreasing the degree 𝒦\mathcal{K} of polynomials and applying regression with regularization terms. The results are reported in Table 4. Specifically, a set of 1313 items (dn,wn,rn,an)(d_{n},w_{n},r_{n},a_{n}) is used to fit the MAP, and a set of the other 8080 items is used for evaluation. All data is collected with ResNet-56 trained on CIFAR-10. Training error and evaluation error are both reported. As we can see, normal polynomial regression induces severe overfitting and high evaluation loss, and 2\ell_{2}-regularization has a limited effect on dealing with the overfitting issue. Still, lowering the degree of polynomials is not a wise choice because it makes the polynomials fail to converge even on training data. Instead, our specialized MAP shows a lower error rate on both training data (0.08%0.08\%) and evaluation data (0.25%0.25\%).

Influence of Polynomials’ Degree:

Figure 4 shows the pruning results when adjusting the MAP’s degree 𝒦\mathcal{K}. Especially, the polynomial regression degrades to linear regression when 𝒦=1\mathcal{K}=1. It turns out that for polynomials with 𝒦2\mathcal{K}\leq 2, the predicted optimal (d,w,r)(d^{\star},w^{\star},r^{\star}) actually leads to a sub-optimal pruning policy, which indicates that the MAP is too simple to use. For polynomials with 𝒦3\mathcal{K}\geq 3, all MAPs generate similar predictions about the optimal (d,w,r)(d,w,r), i.e., (0.78,0.82,0.98)(0.78,0.82,0.98) for ResNet-32 trained on CIFAR-10 and (0.65,1.0,0.63)(0.65,1.0,0.63) for ResNet-56 trained on TinyImageNet. These results corroborate that our algorithm is relatively robust with respect to different degrees 𝒦\mathcal{K}, so practitioners do not need to choose the polynomial degree carefully.

Refer to caption
Figure 3: Accuracies of ResNets with different (d,w,r)(d,w,r) trained on CIFAR-10 (the first three sub-figures) and the predicted optimal (d,w,r)(d^{\star},w^{\star},r^{\star}) with different \mathcal{R} in Eq. (10).
Table 4: MAP’s regression results with lower degree or regularization. 𝒦\mathcal{K} means the highest degree for each variable in (d,w,r)(d,w,r). 2\ell_{2} is the coefficient of the regularization term, and 0 indicates no regularization.
Type 𝒦\mathcal{K} 2\ell_{2} Train Err. Eval Err.
Normal 1 0 2.66% 2.97%
Polynomial 2 0 1.62% 2.25%
3 0 0.28% 1.28%
5 0 0.02% 2.31%
5 10310^{-3} 0.02% 2.28%
10 0 0.01% 2.58%
10 10310^{-3} 0.02% 2.32%
Ours 3 0 0.14% 0.33%
5 0 0.08% 0.25%
Refer to caption
Figure 4: The predicted optimal (d,w,r)(d^{\star},w^{\star},r^{\star}) and corresponding pruning results with different 𝒦\mathcal{K}. For all 𝒦3\mathcal{K}\geq 3, the MAP’s predicted optimal (d,w,r)(d^{\star},w^{\star},r^{\star}) are very similar. Users do not need to bother to choose 𝒦\mathcal{K} carefully.
Refer to caption
Figure 5: Visualization of last layer’s feature maps from different models. The baseline model is a pre-trained ResNet-56. Four different pruning policies are tested, and our pruned model’s feature maps look most like those of the baseline model.

4.4 Case Study

Visualization of Feature Maps for Different Pruning Policies:

In order to further understand why pruning the three dimensions simultaneously yields better results than pruning only one, Figure 5 compares the feature maps for models with different pruning policies. Specifically, all models are pruned from the same baseline model — ResNet-56 pre-trained on CIFAR-10. Input images are randomly chosen from the Internet. For visualization, we extract the last convolutional layer’s feature maps and compute their mean absolute values across channels. Figure 5 shows that the feature maps our pruned model outputs look most similar to those of the original model. This finding reveals that our algorithm preserves most information of the original model by pruning the three dimensions comprehensively.

5 Conclusion

In this paper, we proposed a novel pruning framework which prunes a pre-trained model along three dimensions, i.e., depth, width, and resolution, comprehensively. Remarkably, our framework can determine the optimal values for these three dimensions through modeling the relationships between the model’s accuracy and depth/width/resolution into a polynomial regression and subsequently solving an optimization problem. The extensive experimental results demonstrate that the proposed pruning algorithm outperforms state-of-the-art pruning algorithms under a comparable computational budget. In contrast with NAS-based methods, we generated the pruned models that are superior to the NAS-searched models with a much reduced computational cost.

Acknowledgements

This work was supported in part by The National Key Research and Development Program of China (Grant No. 2018AAA0101400), in part by The National Nature Science Foundation of China (Grant Nos: 62036009, U1909203, 61936006), and in part by Innovation Capability Support Program of Shaanxi (Program No. 2021TD-05).

References

  • Ba & Caruana (2014) Ba, J. and Caruana, R. Do deep nets really need to be deep? In NeurIPS, 2014.
  • Bishop (2006) Bishop, C. M. Pattern Recognition and Machine Learning. Springer, 2006.
  • Bourbaki (2003) Bourbaki, N. Elements of mathematics: Algebra. Springer, 2003.
  • Frankle & Carbin (2019) Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In ICLR, 2019.
  • Gao et al. (2020) Gao, Y., Bai, H., Jie, Z., Ma, J., Jia, K., and Liu, W. MTL-NAS: task-agnostic neural architecture search towards general-purpose multi-task learning. In CVPR, pp.  11540–11549. IEEE, 2020.
  • Han et al. (2020) Han, K., Wang, Y., Zhang, Q., Zhang, W., Xu, C., and Zhang, T. Model rubik’s cube: Twisting resolution, depth and width for tinynets. NeurIPS, 2020.
  • Han et al. (2016) Han, S., Mao, H., and Dally, W. J. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, 2016.
  • He et al. (2020) He, C., Ye, H., Shen, L., and Zhang, T. Milenas: Efficient neural architecture search via mixed-level reformulation. In CVPR, pp.  11990–11999. IEEE, 2020.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
  • He et al. (2018) He, Y., Kang, G., Dong, X., Fu, Y., and Yang, Y. Soft filter pruning for accelerating deep convolutional neural networks. In IJCAI, 2018.
  • He et al. (2019) He, Y., Liu, P., Wang, Z., Hu, Z., and Yang, Y. Filter pruning via geometric median for deep convolutional neural networks acceleration. In CVPR, 2019.
  • Hinton et al. (2015) Hinton, G. E., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
  • Howard et al. (2019) Howard, A., Pang, R., Adam, H., Le, Q. V., Sandler, M., Chen, B., Wang, W., Chen, L., Tan, M., Chu, G., Vasudevan, V., and Zhu, Y. Searching for mobilenetv3. In ICCV, pp.  1314–1324. IEEE, 2019.
  • Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
  • Huang et al. (2017) Huang, G., Liu, Z., van der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In CVPR, 2017.
  • Huang et al. (2019) Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M. X., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., and Chen, Z. Gpipe: Efficient training of giant neural networks using pipeline parallelism. In Wallach, H. M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.), NeurIPS, 2019.
  • Kang & Han (2020) Kang, M. and Han, B. Operation-aware soft channel pruning using differentiable masks. In ICML, 2020.
  • Kolda & Bader (2009) Kolda, T. G. and Bader, B. W. Tensor decompositions and applications. Society for Industrial and Applied Mathematics Review, 2009.
  • Krizhevsky et al. (2009) Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • Lee et al. (2019) Lee, N., Ajanthan, T., and Torr, P. H. S. Snip: single-shot network pruning based on connection sensitivity. In ICLR, 2019.
  • Li et al. (2017) Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. Pruning filters for efficient convnets. In ICLR, 2017.
  • Li et al. (2020) Li, Y., Gu, S., Zhang, K., Gool, L. V., and Timofte, R. DHP: differentiable meta pruning via hypernetworks. In ECCV, 2020.
  • Lin et al. (2020) Lin, M., Ji, R., Wang, Y., Zhang, Y., Zhang, B., Tian, Y., and Shao, L. Hrank: Filter pruning using high-rank feature map. In CVPR, 2020.
  • Lin et al. (2019) Lin, S., Ji, R., Yan, C., Zhang, B., Cao, L., Ye, Q., Huang, F., and Doermann, D. S. Towards optimal structured CNN pruning via generative adversarial learning. In CVPR, 2019.
  • Liu et al. (2019) Liu, H., Simonyan, K., and Yang, Y. DARTS: differentiable architecture search. In ICLR. OpenReview.net, 2019.
  • Liu et al. (2017) Liu, Z., Li, J., Shen, Z., Huang, G., Yan, S., and Zhang, C. Learning efficient convolutional networks through network slimming. In ICCV, 2017.
  • Luo et al. (2019) Luo, J., Zhang, H., Zhou, H., Xie, C., Wu, J., and Lin, W. Thinet: Pruning CNN filters for a thinner net. TPAMI, 2019.
  • Molchanov et al. (2017) Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2017.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
  • Pham et al. (2018) Pham, H., Guan, M. Y., Zoph, B., Le, Q. V., and Dean, J. Efficient neural architecture search via parameter sharing. In Dy, J. G. and Krause, A. (eds.), ICML, 2018.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M. S., Berg, A. C., and Li, F. Imagenet large scale visual recognition challenge. IJCV, 2015.
  • Sehwag et al. (2020) Sehwag, V., Wang, S., Mittal, P., and Jana, S. HYDRA: pruning adversarially robust neural networks. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS, 2020.
  • Simonyan & Zisserman (2015) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Bengio, Y. and LeCun, Y. (eds.), ICLR, 2015.
  • Tan & Le (2019) Tan, M. and Le, Q. V. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
  • Tian et al. (2020) Tian, Y., Shen, L., Shen, L., Su, G., Li, Z., and Liu, W. Alphagan: Fully differentiable architecture search for generative adversarial networks. CoRR, abs/2006.09134, 2020.
  • Wang et al. (2019a) Wang, W., Fu, C., Guo, J., Cai, D., and He, X. COP: customized deep model compression via regularized correlation-based filter-level pruning. In IJCAI, 2019a.
  • Wang et al. (2019b) Wang, W., Zhao, S., Chen, M., Hu, J., Cai, D., and Liu, H. DBP: discrimination based block-level pruning for deep model acceleration. CoRR, abs/1912.10178, 2019b.
  • Wang et al. (2020) Wang, Y., Zhang, X., Xie, L., Zhou, J., Su, H., Zhang, B., and Hu, X. Pruning from scratch. In AAAI, 2020.
  • Wen et al. (2016) Wen, W., Wu, C., Wang, Y., Chen, Y., and Li, H. Learning structured sparsity in deep neural networks. In NIPS, 2016.
  • Wu et al. (2017) Wu, J., Zhang, Q., and Xu, G. Tiny imagenet visual recognition challenge. Technical report, 2017.
  • Ye et al. (2020a) Ye, M., Gong, C., Nie, L., Zhou, D., Klivans, A., and Liu, Q. Good subnetworks provably exist: Pruning via greedy forward selection. In ICML, 2020a.
  • Ye et al. (2020b) Ye, X., Dai, P., Luo, J., Guo, X., Qi, Y., Yang, J., and Chen, Y. Accelerating CNN training by pruning activation gradients. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J. (eds.), ECCV, 2020b.
  • Zoph & Le (2017) Zoph, B. and Le, Q. V. Neural architecture search with reinforcement learning. In ICLR, 2017.
Table 5: Accuracies (%) of ResNets and DenseNets on CIFAR-10 with different depths (dd), widths (ww) and resolutions (rr). The base models are ResNet-32 and DenseNet-40.
(a) (d2,w1,r1)(d2,w2,r1)(d2,w1,r2)(d2,w2,r2)\frac{\mathcal{F}(d_{2},w_{1},r_{1})}{\mathcal{F}(d_{2},w_{2},r_{1})}\approx\frac{\mathcal{F}(d_{2},w_{1},r_{2})}{\mathcal{F}(d_{2},w_{2},r_{2})}
ResNet (d=1.0d=1.0)
ww rr 1.00 0.87 0.75 0.62 0.50
0.60 90.59 89.43 88.51 86.59 83.19
0.70 91.39 90.53 89.26 87.38 84.62
0.80 92.19 90.95 89.88 88.38 85.15
0.90 92.54 91.55 90.77 88.60 85.93
1.00 92.84 91.87 91.10 89.38 86.56
DenseNet (d=1.0d=1.0)
ww rr 1.00 0.87 0.75 0.62 0.50
0.60 90.82 90.48 89.62 88.04 85.63
0.70 91.54 90.98 90.19 88.79 86.75
0.80 91.99 91.57 90.59 90.01 87.67
0.90 92.74 92.08 91.19 90.56 88.12
1.00 93.09 92.25 92.05 90.68 88.38
(b) (d1,w1,r1)(d2,w1,r1)(d1,w1,r2)(d2,w1,r2)\frac{\mathcal{F}(d_{1},w_{1},r_{1})}{\mathcal{F}(d_{2},w_{1},r_{1})}\approx\frac{\mathcal{F}(d_{1},w_{1},r_{2})}{\mathcal{F}(d_{2},w_{1},r_{2})}
ResNet (w=1.0w=1.0)
dd rr 1.00 0.87 0.75 0.62 0.50
0.11 86.88 85.91 85.15 83.64 81.68
0.33 92.30 91.12 90.08 88.79 85.87
0.55 92.84 91.87 91.10 89.38 86.56
0.77 93.43 92.40 91.77 89.87 87.17
1.00 93.63 92.63 91.79 90.14 87.48
DenseNet (w=1.0w=1.0)
dd rr 1.00 0.87 0.75 0.62 0.50
0.20 88.16 87.64 86.77 85.50 83.83
0.40 92.00 91.22 90.32 89.15 86.89
0.60 93.03 92.06 91.85 90.63 88.42
0.80 93.80 93.21 92.78 91.74 89.25
1.00 94.53 93.75 93.69 92.21 89.84
(c) (d1,w1,r2)(d2,w1,r2)(d1,w2,r2)(d2,w2,r2)\frac{\mathcal{F}(d_{1},w_{1},r_{2})}{\mathcal{F}(d_{2},w_{1},r_{2})}\approx\frac{\mathcal{F}(d_{1},w_{2},r_{2})}{\mathcal{F}(d_{2},w_{2},r_{2})}
ResNet (r=1.0r=1.0)
ww dd 0.11 0.33 0.55 0.77 1.00
0.60 83.36 89.39 90.91 91.41 92.01
0.70 84.03 90.17 91.43 91.85 92.36
0.80 85.45 91.21 92.16 92.65 92.7
0.90 86.14 91.74 92.60 92.56 93.12
1.00 86.22 92.12 92.88 93.17 93.64
DenseNet (r=1.0r=1.0)
ww dd 0.20 0.40 0.60 0.80 1.00
0.60 84.51 88.82 89.95 91.40 92.70
0.70 85.43 89.26 90.51 92.35 92.83
0.80 86.92 90.75 91.07 93.01 93.66
0.90 87.88 91.74 91.46 93.14 93.86
1.00 88.16 92.00 93.03 93.80 94.53

Appendix A Why the Rank of Θ\Theta Is 1

A.1 Experiments

In our original paper, we propose a model accuracy predictor (MAP):

(d,w,r;Θ)=q=1(d;sq)(w;uq)(r;vq),\displaystyle\mathcal{F}(d,w,r;\Theta)=\sum_{q=1}^{\mathcal{R}}\mathcal{H}(d;\vec{s_{q}})\mathcal{H}(w;\vec{u_{q}})\mathcal{H}(r;\vec{v_{q}}), (12)

where \mathcal{H} represents a univariate polynomial, and \mathcal{R} means the rank of Θ\Theta. Pruning experiments shown in our original paper are done with =1\mathcal{R}=1 because we find that =1\mathcal{R}=1 is enough for an accurate approximation of MAP. In this case, Eq. (12) becomes:

(d,w,r;Θ)=(d;s)(w;u)(r;v).\displaystyle\mathcal{F}(d,w,r;\Theta)=\mathcal{H}(d;\vec{s})\mathcal{H}(w;\vec{u})\mathcal{H}(r;\vec{v}). (13)

We assume that =1\mathcal{R}=1 works well because it accords with the real distribution of (d,w,r)\mathcal{F}(d,w,r). To further verify the assumption. we design some experiments to show the relations between the model’s accuracy and (d,w,r)(d,w,r). Specifically, we train many ResNets and DenseNets with different (d,w,r)(d,w,r), and the results are in Table 5(c) — part of them have also been reported in the original paper. The observations are shown in their subtitles, from which we can draw the same conclusion as Eq. (13).

A.2 Proof

Omitting all subscripts of d1d_{1}, w1w_{1}, and r1r_{1} of the equations in Table 5(c), we have:

(d2,w,r)(d2,w2,r)\displaystyle\frac{\mathcal{F}(d_{2},w,r)}{\mathcal{F}(d_{2},w_{2},r)} (d2,w,r2)(d2,w2,r2)\displaystyle\approx\frac{\mathcal{F}(d_{2},w,r_{2})}{\mathcal{F}(d_{2},w_{2},r_{2})} (14)
(d2,w,r)\displaystyle\Rightarrow\mathcal{F}(d_{2},w,r) (d2,w2,r)(d2,w,r2)(d2,w2,r2),\displaystyle\approx\frac{\mathcal{F}(d_{2},w_{2},r)\mathcal{F}(d_{2},w,r_{2})}{\mathcal{F}(d_{2},w_{2},r_{2})},
(d,w,r)(d2,w,r)\displaystyle\frac{\mathcal{F}(d,w,r)}{\mathcal{F}(d_{2},w,r)} (d,w,r2)(d2,w,r2)\displaystyle\approx\frac{\mathcal{F}(d,w,r_{2})}{\mathcal{F}(d_{2},w,r_{2})} (15)
(d,w,r)\displaystyle\Rightarrow\mathcal{F}(d,w,r) (d2,w,r)(d,w,r2)(d2,w,r2),\displaystyle\approx\frac{\mathcal{F}(d_{2},w,r)\mathcal{F}(d,w,r_{2})}{\mathcal{F}(d_{2},w,r_{2})},
(d,w,r2)(d2,w,r2)\displaystyle\qquad\frac{\mathcal{F}(d,w,r_{2})}{\mathcal{F}(d_{2},w,r_{2})} =(d,w2,r2)(d2,w2,r2)\displaystyle=\frac{\mathcal{F}(d,w_{2},r_{2})}{\mathcal{F}(d_{2},w_{2},r_{2})} (16)
(d,w,r2)\displaystyle\Rightarrow\mathcal{F}(d,w,r_{2}) =(d2,w,r2)(d,w2,r2)(d2,w2,r2).\displaystyle=\frac{\mathcal{F}(d_{2},w,r_{2})\mathcal{F}(d,w_{2},r_{2})}{\mathcal{F}(d_{2},w_{2},r_{2})}.

Substituting Eq. (14) and Eq. (16) into Eq. (15), we have:

(d,w,r)\displaystyle\mathcal{F}(d,w,r) (d,w2,r2)(d2,w,r2)(d2,w2,r)(d2,w2,r2)2\displaystyle\approx\frac{\mathcal{F}(d,w_{2},r_{2})\mathcal{F}(d_{2},w,r_{2})\mathcal{F}(d_{2},w_{2},r)}{\mathcal{F}(d_{2},w_{2},r_{2})^{2}} (17)
=(d;s)(w;u)(r;v)(d2,w2,r2)2,\displaystyle=\frac{\mathcal{H}(d;\vec{s})\mathcal{H}(w;\vec{u})\mathcal{H}(r;\vec{v})}{{\mathcal{F}(d_{2},w_{2},r_{2})^{2}}},

where 1(d2,w2,r2)2\frac{1}{\mathcal{F}(d_{2},w_{2},r_{2})^{2}} is a constant and can be merged into any \mathcal{H}. Thus, Eq. (17) can be re-formulated as:

(d,w,r)(d;s)(w;u)(r;v),\displaystyle\mathcal{F}(d,w,r)\approx\mathcal{H}(d;\vec{s})\mathcal{H}(w;\vec{u})\mathcal{H}(r;\vec{v}), (18)

which complies with Eq. (13).