This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ACP: Automatic Channel Pruning via Clustering and Swarm Intelligence Optimization for CNN

Jingfei Chang, Yang Lu, Ping Xue, Yiqun Xu, and Zhen Wei Corresponding author: Yang Lu.J. Chang, Y. Lu, P. Xue, Y. Xu and Z. Wei are with the School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230009, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]).Y. Lu and Z. Wei are also with the Engineering Research Center of Safety Critical Industrial Measurement and Control Technology, Ministry of Education, Hefei University of Technology.
Abstract

As the convolutional neural network (CNN) gets deeper and wider in recent years, the requirements for the amount of data and hardware resources have gradually increased. Meanwhile, CNN also reveals salient redundancy in several tasks. The existing magnitude-based pruning methods are efficient, but the performance of the compressed network is unpredictable. While the accuracy loss after pruning based on the structure sensitivity is relatively slight, the process is time-consuming and the algorithm complexity is notable. In this article, we propose a novel automatic channel pruning method (ACP). Specifically, we firstly perform layer-wise channel clustering via the similarity of the feature maps to perform preliminary pruning on the network. Then a population initialization method is introduced to transform the pruned structure into a candidate population. Finally, we conduct searching and optimizing iteratively based on the particle swarm optimization (PSO) to find the optimal compressed structure. The compact network is then retrained to mitigate the accuracy loss from pruning. Our method is evaluated against several state-of-the-art CNNs on three different classification datasets CIFAR-10/100 and ILSVRC-2012. On the ILSVRC-2012, when removing 64.36% parameters and 63.34% floating-point operations (FLOPs) of ResNet-50, the Top-1 and Top-5 accuracy drop are less than 0.9%. Moreover, we demonstrate that without harming overall performance it is possible to compress SSD by more than 50% on the target detection dataset PASCAL VOC. It further verifies that the proposed method can also be applied to other CNNs and application scenarios.

Index Terms:
Deep learning, convolutional neural network, channel pruning, model compressing, clustering, swarm intelligence optimization.

I Introduction

With the continuous expansion of datasets and the substantial increase in hardware computing power, deep neural network (DNN) models [1],[2],[3] have achieved great success in scientific research and engineering. As one of the noteworthy members, CNN has a better performance in several fields such as image classification [4], object detection [5], style transfer [6], and semantic segmentation [7] due to the parameter sharing and local connectivity schemes. However, as computer vision tasks become more complex, the depth and width of the network are gradually expanding. While improving performance, the scale of CNN has also become extremely large. As a result, some existing deep CNNs can only be applied in GPU, TPU, or cloud, which considerably limits the development and application of CNNs. Moreover, most researches have revealed that deep CNNs have obvious parameter redundancy in many tasks.

To disentangle the dilemma, many researchers devote themselves to compressing and accelerating the CNNs. The current mainstream methods mainly include designing dedicated hardware [8], optimizing convolutional calculation [9], and designing network compression schemes. Among them, network compressing is mainly divided into four types: low-rank decomposition [10], quantification [11], knowledge distillation [12], and network pruning. The low-rank decomposition can achieve sparseness and directly compress the network. Nevertheless, additional calculations are introduced in the implementation, which is not conducive to the decrease of FLOPs. Quantification can degrade storage requirements and speed up the inference, but it will cause accuracy loss. Knowledge distillation can improve the performance of small networks by learning from the pre-trained large teacher network, however, the performance depends on the similarity of the two tasks. The network pruning is based on the redundancy of CNN and evaluates the importance of the parameters to delete the insignificant filters or channels. Since it is easy to implement and can effectively compress and accelerate the network while maintaining the original accuracy, network pruning has received widespread attention. The existing pruning methods based on the magnitude of filters and channels [13],[14],[15] or their similarity [16], etc. are relatively efficient, but the accuracy drop of the compact network is unstable. While the pruning schemes via the structure sensitivity can obtain a compressed network with less precision loss, but the pruning phase is time-cost. For example, [17] considers the influence of error propagation on pruning and discards neurons layer by layer via minimizing the final response layer. [18] implements network pruning through reinforcement learning. In response to the above obstacles, we deliberate to prune the channels for preliminarily compressing and then design an algorithm to further optimize the pruned model. In this way, the performance of the compact network can be improved as much as possible which is even better than the vanilla network. The motivation of our method is two-fold. First, [16] shows the effectiveness of leveraging the similarity between feature maps to measure the parameter redundancy. Second, [19] reveals that the neural architecture search can guide to optimize the network pruning.

In this paper, we propose an automatic channel pruning method based on clustering and swarm intelligence optimization. First, we randomly sample and test several images from the dataset using the pre-trained baseline CNN. The feature maps at the same place are integrated to calculate the cosine similarity of channels in each layer. According to the similarity, the feature maps are analyzed and clustered, and the result of the clustering is regarded as the number of remaining channels to produce the preliminary compact network. Then we propose an approach to initialize the network structure population based on the compressed substructure. Finally, the particle swarm optimization is adopted to automatically search for the optimal compact model. The two benefits of performing preliminary clustering pruning and then optimizing the substructure through swarm intelligence searching are as follows. First, the clustering pruning itself is efficient, and it can shrink the searching space for the subsequent iterative optimization, which makes the entire network compressing time-saving. Second, clustering pruning can provide a more reliable initialization for optimization, thereby mitigating the accuracy drop. In this paper, we conduct compressing for VGGNet [20], ResNet [21], and GoogLeNet [22] on CIFAR-10, CIFAR-100, and ILSVRC-2012 [23], and the SSD [24] is compressed on PASCAL VOC [25].

Extensive experiments indicate that ACP can effectively and accurately compress CNNs. On CIFAR-10, we remove 80.92% parameters and 78.32% FLOPs of ResNet-110, and surprisingly gain 0.33% improvement of performance. While the parameters and FLOPs of GoogLeNet are discarded by 66.29% and 70.44% respectively, the accuracy of the compact network is even increased by 0.37%. For CIFAR-100 with less training images, ACP discards 47.67% parameters and 52.21% FLOPs of ResNet-56 with only 0.21% accuracy drop. On the large-scale ILSVRC-2012, when pruning 54.69% parameters and 46.82% FLOPs of ResNet-50, its Top-1 and Top-5 accuracy only reduce 0.41% and 0.21%.

The proposed ACP is adapt to all existing CNNs and different tasks. Apart from that, the iterative searching based on PSO is capable of optimizing all channel pruning methods. Moreover, because of no sparseness introducing, ACP does not require the assistance of additional sparse matrix operations and acceleration libraries. And the entire pruning phase can be achieved by controlling only one hyper-parameter, which notably reduces labor intervention and can perform automatic compression.

The main contributions of this paper can be summarized as follows:

  • We propose a layer-wise clustering pruning method for CNN, which regards the sum of the number of clusters obtained by clustering as the number of channels in each layer.

  • We introduce a CNN structure optimizing method based on PSO. We propose an initializing scheme to extend the preliminarily pruned network as a structure population, and then apply the PSO to obtain the optimal compact network.

  • An automatic channel pruning method combined with the clustering and PSO is presented. It can achieve flexible compression for CNN by only adjusting one hyper-parameter under the condition that almost retains or even exceeds the original accuracy.

  • This paper conducts comprehensive experiments on the image classification datasets CIFAR-10/100, ILSVRC-2012, and further verification on the target detection dataset PASCAL VOC. The results sufficiently confirm the effectiveness of our ACP.

Refer to caption
Figure 1: The ACP framework. For simplicity, we take a three-layer CNN as an example. (a) Inputs: original network and different datasets. (b) Clustering Pruning: The redundant channels are clustered and pruned based on the similarity of feature maps. (c) Swarm Intelligence Structure Optimization: The pruned network structure obtained by clustering is initialized as the candidate population. We train a few epochs for each candidate to calculate its fitness and then update all substructures. After some cycles, the substructure with the highest fitness is regarded as the global best compact network. (d) Different datasets end up with various pruned network structures. We finally retrain the pruned model to recover its predictive accuracy. (This figure is best viewed in color and zoomed in.)

II Related Works

Deep CNNs have made breakthroughs in various fields of computer vision, but their enormous parameters and FLOPs exceedingly limit the application of CNNs. And for some simple tasks, too many parameters will cause over-fitting, which may cause performance deterioration. At present, there have been several methods devoted to network compression and acceleration. Among them, the network pruning algorithm has gained extensive attention due to its simple implementation and apparent effects, and many excellent algorithms have emerged. The existing network pruning algorithms can be roughly divided into two categories, magnitude-based pruning and sensitivity-based pruning.

II-A Magnitude-based pruning

The magnitude-based pruning focuses on the importance of parameters and connections. According to the proposed importance evaluation index, the unimportant parameters and connections in the network are directly deleted, and then the obtained compact network is fine-tuned or retrained to restore the experimental accuracy. The core of this type of algorithms is to design a reasonable and effective parameter importance evaluation index. [26] uses the scaling factor of the batch normalization layers to evaluate the importance of the channels and deletes the unimportant channels to realize the channel-level network pruning. [13] uses the magnitude of the weights to measure and deletes all insignificant connections in the pre-trained network whose weight is lower than the threshold. [15] calculates and sorts the L1 norm of the filters and prunes them with smaller values and their corresponding feature maps according to the preset pruning rate. [27] utilizes the L2 norm to measure the importance of the filters, and will continuously update the filters that were pruned last time during the training phase. [28] adopts the fully connected layer as a linear classifier to extract the feature representation of the middle layers and prune them that with less improvement according to a predefined threshold. [29] judges the redundancy of one filter based on the Euclidean Distance of it and the geometric mean of all filters in the layer and deletes the redundant filters in the network according to the preset pruning rate. [30] applies the rank of the feature map to determine how much information it contains. The low-rank feature maps contain less information and can be deleted with confidence. The high-rank feature maps hold much more message, even if fixing some parameters will not damage the final performance. [16] proposes a new filter pruning method, which analyses the diversity and similarity of feature maps to prune redundant filters in the network. [31] prunes the 3D ConvNets according to the magnitude and information score of the filters.

Clustering is the task of dividing the data points into several groups such that data points in the same groups are more similar to each other than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters. Broadly speaking, clustering can be divided into three subgroups: prototype-based clustering, such as k-means algorithm [32], etc.; hierarchical clustering, such as BIRCH algorithm [33], etc. and density-based clustering, such as DBSCAN algorithm [34],[35] and OPTICS Algorithm [36], etc. According to specific magnitude characteristics such as similarity of the parameters, the original network can be clustered and pruned.

II-B Sensitivity-based pruning

Sensitivity-based methods require analyzing the influence of pruning on the original network. Concretely, by considering the impact of pruning strategies on network performance in the compressing process, the CNN is pruned layer by layer, and the scheme is adjusted continually during the accelerating phase. The core of this type of algorithm is to design a rational and effective network pruning adjustment strategy. [17] considers the importance score propagation of the neurons and uses feature ranking techniques to measure the importance of each neuron in the ”final response layer” to prune the filters in earlier layers. [37] introduces connection sensitivity to evaluate the importance of structural connections and prunes the network during the parameter initialization stage before training. [38] comprehensively considers the neuron connections before and after the pruned layer and compresses the network by minimizing the formulated Frobenius distortion. [39] introduces a variational technique to estimate channel saliency in the training process and looks for a suitable probability distribution to further deletes redundant channels. [40] proposes a feature boosting and suppression method to predictively amplify salient convolutional channels and skip unimportant ones at run-time. This method retains the complete network structure. [41] develops a differentiable pruning criteria sampler that is learnable and optimized by the validation loss of the pruned network obtained from the sampled criteria. So they can adaptively select the appropriate pruning criteria for different functional layers. [42] directly optimizes channel-wise differentiable discrete gate under resource constraint to controls the deletion of channels in the convolutional layers. [43] proposes a differentiable Markov channel pruning method to efficiently compress the network that is differentiable and can be directly optimized by gradient descent with respect to standard task loss and budget regularization. [44] proposes an effective channel selection layer, which automatically finds less important filters in a joint training manner. The method takes previous activation responses as an input and generates a binary index code for pruning. [45] proposes a network channel pruning scheme based on sparse learning and genetic algorithm.

Swarm intelligence optimization algorithms are a form of nature-based optimization algorithms that mainly simulates the cooperative behavior of insects, birds, and animals, etc. The individuals of the community continuously change the direction to search by sharing knowledge between them until they achieve their goals. Famous swarm intelligence optimization algorithms include ant colony algorithm [46], particle swarm optimization algorithm [47],[48], and artificial bee colony algorithm [49], etc.

III Proposed Method

We propose the ACP to decrease the redundancy of CNN for different tasks and achieve effective and accurate channel-level compression and acceleration by reducing the number of parameters and FLOPs. In this section, we present the overall ACP framework first, and then specifically introduce the two major components: clustering pruning and swarm intelligent structure optimization. Finally, we describe the pruning strategy for different CNNs.

Refer to caption
Figure 2: Illustration of clustering pruning.

III-A The Automatic Channel Pruning Framework

Fig.1 shows the overall framework of ACP. Given an image classification dataset, we randomly sample numerous images and infer them via the pre-trained CNN. The feature maps are clustered based on the density and the number of clusters in each layer is regarded as the number of channels after pruning. Then the obtained network substructure is initialized as a structure candidate population. According to the accuracy loss of the subnetwork, we iteratively search and optimize each candidate using PSO. Finally, after a few cycles, we can obtain the optimal compact structure via automatic channel pruning. For different tasks and datasets, the compressed network structure is generally different.

III-B Clustering Pruning

Given a CNN with L convolutional layers and the dataset DD, the convolutional operation is formulated as Conv=(D,W;C)Conv={{\cal F}}\left({D,W;C}\right). C=(c1,c2,,cL)C=\left({{c_{1}},{c_{2}},\ldots,{c_{L}}}\right) is the original network structure, where cl{c_{l}} is the number of channels in the ll-th layer. The size of the filter Wcout×cin×k×kW\in{{\mathbb{R}}^{{{c}_{out}}\times{{c}_{in}}\times k\times k}} is k×kk\times k, where cout{c_{out}} and cin{c_{in}} denote the number of output and input channels, respectively. The feature map is a 4D tensor Ms×c×W×HM\in{{\mathbb{R}}^{s\times c\times W\times H}} with the size of W×HW\times H, where ss is the number of samples from the dataset.

Here, we perform layer-wise clustering pruning based on the similarity of feature maps to achieve the preliminary network compressing. The number of groups formed by clustering is regarded as the number of channels after pruning. Given an input image, the feature map generated in the ll-th layer is Mlcl×Wl×Hl{{M}_{l}}\in{{\mathbb{R}}^{{{c}_{l}}\times{{W}_{l}}\times{{H}_{l}}}}, and then the cl{{c}_{l}} channels of which the size is Wl×Hl{{W}_{l}}\times{{H}_{l}} are clustered. We draw on the idea of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. The density-based clustering can separate regions with adequate density into clusters with arbitrary shapes in the dataset containing noise. Moreover, there is no need to specifically assign the number of classes to be formed before clustering.

Different filters perform convolutional operations on the input to generate various feature maps, and some of them in the same layer may have similarities. For a given dataset containing mm training images, we randomly pick ss (sm)\left(s\leq m\right) samples and process them through the CNN. To fully consider the impact of different types of figures on the output feature maps and obtain more accurate similarity, we superimpose and average the feature maps at the same position produced by ss samples, and then calculate the cosine distance to estimate the similarity between any two feature maps Mli{{M}_{li}} and Mlj{{M}_{lj}} in the same layer. The specific definition is as follows:

dist(Mlij)=|cos[(1sr=1sMlir),(1sr=1sMljr)]|,{dist({M_{lij}})=\left|{\cos\left[{\left({\frac{1}{s}\sum\limits_{r=1}^{s}{M_{li}^{r}}}\right),\left({\frac{1}{s}\sum\limits_{r=1}^{s}{M_{lj}^{r}}}\right)}\right]}\right|}, (1)

where MlirM_{li}^{r} is the ii-th feature map produced by the rr-th image in the ll-th layer. The neighborhood parameter of the cluster is (ε\varepsilon,MinPtsMinPts), where ε\varepsilon is the neighborhood radius, and MinPtsMinPts refers to the minimum number of feature maps in the ε\varepsilon-neighborhood of the core feature map. If the ε\varepsilon-neighborhood of the feature map Mli{{M}_{li}} contains at least MinPtsMinPts feature maps, that is |ε(Mli)|MinPts\left|{{\mathbb{N}}_{\varepsilon}}\left({{M}_{li}}\right)\right|\geq MinPts, then Mli{{M}_{li}} is assigned as a core feature map, and all of the core feature maps form a set. We randomly select a core feature map and cluster all feature maps including non-core ones in its ε\varepsilon-neighborhood into a group. If there is a core feature map, then all the feature maps in its ε\varepsilon-neighborhood also belong to this group. Feature maps that are neither core feature maps nor belong to ε\varepsilon-neighborhood of any core feature maps are regarded as noises. According to this scheme, we cluster the channels in each layer and consider the sum of the number of groups and noises obtained by clustering as the number of channels after clustering pruning. In this way, the primary compressed substructure C=(c1,c2,,cL){C}^{\prime}=\left({{c}_{1}^{\prime}},{{c}_{2}^{\prime}},\ldots,{{c}_{L}^{\prime}}\right) is gained. Fig.2 depicts the process of clustering and pruning in one layer. The upper right corner shows the substructure after pruning, and the feature maps in the same color belong to the same cluster. The number of feature maps in the solid line indicates the number of remaining channels.

III-C Swarm Intelligence Structure Optimization

In the previous section, we have conducted primary clustering pruning for the original network structure CC. In this section, the compressed substructure C{C}^{\prime} is further optimized through structure searching. We use the PSO algorithm to optimize C{C}^{\prime} and find the optimal compact network C{{C}^{*}} which satisfies:

C=argmaxCAcc(DTest(DTrain,W;C)){{{C}^{*}}=\underset{{{C}^{\prime}}}{\mathop{\arg\max}}\,Acc\left({{D}_{Test}}\left({{D}_{Train}},{W}^{\prime};{C}^{\prime}\right)\right)} (2)

where DTrain{{D}_{Train}} denotes training set and DTest{{D}_{Test}} is test set. Acc()Acc\left(\centerdot\right) is the test accuracy of the structure C{C}^{\prime}. The structure searching optimization can be divided into the following two steps.

III-C1 Network Structure Initialization

First of all, we initialize the particle swarm according to the compressed structure C{C}^{\prime} and obtain a structure candidate population set {Cn0}\left\{C_{n}^{0}\right\} with NN substructure candidates, where n[1,N]n\in\left[1,N\right]. In order to achieve locally random initialization that makes the initialized network structure similar to C{C}^{\prime} but different in several layers, we propose a structure population initialization scheme. The specific implementation is as follows:

cnl0=cl+n×rand{1,0,1}{c_{nl}^{0}={{c}_{l}}^{\prime}+n\times rand\left\{-1,0,1\right\}} (3)

where the nn-th structure is Cn0=(cn10,cn20,,cnL0)C_{n}^{0}=\left(c_{n1}^{0},c_{n2}^{0},\ldots,c_{nL}^{0}\right), and rand{1,0,1}rand\left\{-1,0,1\right\} denotes sampling a number from -1, 0 and 1 randomly. Then, we initialize the searching velocity set {Vn0}\left\{V_{n}^{0}\right\} of channels, where Vn0=(vn10,vn20,,vnL0)V_{n}^{0}=\left(v_{n1}^{0},v_{n2}^{0},\ldots,v_{nL}^{0}\right), and the max velocity is set to vmax{{v}_{max}} so that vn[vmax,vmax]v_{n}\in\left[-{{v}_{max}},{{v}_{max}}\right]. The fitness of Cn0C_{n}^{0} can be calculated via Eq.4.

fitness(Cn)=Acc(DTest(DTrain,Wn;Cn)){fitness\left({{C}_{n}}\right)=Acc\left({{D}_{Test}}\left({{D}_{Train}},{{W}_{n}};{{C}_{n}}\right)\right)} (4)

Since the network training is time-consuming, it is cumbersome to calculate fitness many times during the swarm intelligence structure searching and optimization. To tackle this problem, we only train a few epochs for each structure to obtain fitness. To do so, it can not only reveals the performance of the structure candidate but also saves the searching time. The NN network structures are initialized as the local best structure pbestnpbes{{t}_{n}} respectively and the structure with the greatest fitness is regarded as the global best network gbestgbest.

III-C2 Iterative Searching

After the initialization, iterative searching is performed to find the optimal CNN structure. First, we update the searching velocity as follows:

vnt=w×vnt1+α1×rand×(pbestncnt1)+α2×rand×(gbestcnt1)\begin{split}v_{n}^{t}=&w\times v_{n}^{t-1}+{{\alpha}_{1}}\times rand\times\left(pbes{{t}_{n}}-c_{n}^{t-1}\right)\\ &+{{\alpha}_{2}}\times rand\times\left(gbest-c_{n}^{t-1}\right)\end{split} (5)

where vntv_{n}^{t} denotes the channel searching velocity of the nn-th candidate in the tt-th iteration. α1{{\alpha}_{1}} and α2{{\alpha}_{2}} are two learning weight that are two positive constants. In this paper, we adopt the recommend value α1=α2=2{{\alpha}_{1}}={{\alpha}_{2}}=2 in [48], since it on average makes the weights for “social” and “cognition” parts to be 1. randrand is a random number between 0 and 1, while ww is inertia weight. The dynamic ww behave better in the searching optimization, therefore, we utilize the linear decreasing weight scheme for ww as follow:

wt=(wmaxwmin)(Tt)/T+wmin{{{w}^{t}}={\left({{w}_{max}}-{{w}_{min}}\right)\left(T-t\right)}/{T}\;+{{w}_{min}}} (6)

where wmax{{w}_{max}} denotes the initial inertia weight, while wmin{{w}_{min}} is the inertia weight at the final iteration. Here, wmax=0.9{{w}_{max}}={0.9} and wmin=0.4{{w}_{min}}={0.4}. TT is the maximum iteration cycle and tt denotes the current cycle. Then we update the candidate CntC_{n}^{t} as follows:

cnt=cnt1+r×vnt{c_{n}^{t}=c_{n}^{t-1}+r\times v_{n}^{t}} (7)

where rr is the learning rate of channels and we set r=2r=2. After that, we calculate the fitness of each candidate. If it is greater than that of the initial local best structure, we assign the candidate as the new pbestnpbes{{t}_{n}}. Furthermore, if it is larger than that of the global best network, gbestgbest is replaced by the candidate. In the end, after iteratively searching for TT cycles, the gbestgbest is the optimal pruned structure C{{C}^{*}}. Algorithm 1 manifests the pseudocode of our ACP. Given a pre-trained network, we retrain the obtained compact network from pruning and optimizing to mitigate the accuracy loss from compressing.

Algorithm 1 Automatic Channel pruning (ACP)

Input: Original structure: CC, Cycles: TT, Number of initialization structures: NN.
Output: Optimal pruned structure: C{C}^{*}.

1:Calculate the cosine similarity between feature maps in each layer via Eq.1;
2:Cluster the feature maps in each layer to obtain the pruned structure C{C}^{\prime};
3:for n=1n=1 to NN do
4:     Initialize the pruned structure set {Cn0}\left\{C_{n}^{0}\right\};
5:     Initialize the searching velocity set {Vn0}\left\{V_{n}^{0}\right\};
6:     Calculate fitness(Cn0)fitness\left(C_{n}^{0}\right) of Cn0{{C}_{n}^{0}} via Eq.4;
7:     pbestn=Cn0pbes{{t}_{n}}=C_{n}^{0};
8:end for
9:gbest=max{pbestn}gbest=\max\{pbes{{t}_{n}}\};
10:for t=1t=1 to TT do
11:     for n=1n=1 to NN do
12:         Update the searching velocity VntV_{n}^{t} via Eq.5;
13:         Update the candidate CntC_{n}^{t} via Eq.7;
14:         Calculate fitness(Cnt)fitness\left(C_{n}^{t}\right) of candidate Cnt{C_{n}^{t}} via Eq.4;
15:         if fitness(Cnt)>fitness(pbestn)fitness\left(C_{n}^{t}\right)>fitness\left(pbes{{t}_{n}}\right) then
16:              pbestn=Cntpbes{{t}_{n}}=C_{n}^{t};
17:              fitness(pbestn)=fitness(Cnt)fitness\left(pbes{{t}_{n}}\right)=fitness\left(C_{n}^{t}\right);
18:         end if
19:         if fitness(pbestn)>fitness(gbest)fitness\left(pbes{{t}_{n}}\right)>fitness\left(gbest\right) then
20:              gbest=pbestngbest=pbes{{t}_{n}};
21:              fitness(gbest)=fitness(pbestn)fitness\left(gbest\right)=fitness\left(pbes{{t}_{n}}\right);
22:         end if
23:     end for
24:end for

C=gbest{C}^{*}=gbest.

For most of the existing network pruning methods, the compressed network inherits the weights and bias from the original network to restore the performance as much as possible through fine-tuning. However, when the network pruning rate is remarkable, the accuracy recovery after fine-tuning is not obvious, and the actual performance of the compact network cannot be greatly manifested. [50] makes a surprising observation in structured network pruning that fine-tuning a pruned model only gives comparable or worse performance than training that model with randomly initialized weights. And the experiment results reveal that the pruned architecture itself, rather than a set of inherited “important” weights, is more crucial to the efficiency in the final model. The results of pruning VGG-16 on CIFAR-10 further verify the observation in [50]. In order to fully demonstrate the performance of the compact network, we retrain them from scratch in our experiments. Specifically, we keep the number of FLOPs consistent before and after pruning. The number of training epochs of the original network multiplied by the accelerating rate of FLOPs is regarded as the retraining epochs. Finally, we compare the accuracy of the pruned network with the original network to draw a conclusion.

III-D Pruning Strategy for Different CNNs

Refer to caption
Figure 3: Illustration of pruning Inception V3 module. The black font indicates the number of original channels, and the red font indicates that after pruning.

Since different CNNs have different network structures, different pruning schemes need to be adopted for different networks to maintain structural stability. We perform experiments on VGGNet, ResNet, and GoogLeNet respectively. Among these CNNs, VGGNet is a common layer-by-layer CNN. During the compressing process, the output channels of each layer can be deleted independently. The number of input channels of each layer remains the same as the output of the previous layer. ResNet and GoogLeNet leverage the Residual and Inception module to make the network deeper and wider respectively. If the channels are discarded arbitrarily, the dimensional matching of the network will be destroyed. For ResNet with the basic block, we only prune the output of the first layer, and the input of the second layer is changed accordingly. The bottleneck consists of three layers, among which the size of the filters in the first and third layers is 1×\times1, while that in the middle layer is 3×\times3. We prune the channels in the middle layer of the bottleneck. And the output channels in the first layer and the input channels of the third layer will be modified accordingly. GoogLeNet is a more complex network, composed of multiple Inception V3 modules that contain four branches. To prune the GoogLeNet, we compress the branches containing two and three convolutional layers. The specific structure and pruning scheme of the Inception V3 module is plotted in Fig.3.

IV Experiments

IV-A Implementation Details

In order to demonstrate the effectiveness of our proposed method, we prune VGGNet, ResNet, and GoogLeNet on CIFAR-10, CIFAR-100, and ILSVRC-2012 for image classification. Moreover, we prune SSD on PASCAL VOC for object detection. We implement all experiments using PyTorch. The network pre-training and hyper-parameter settings use the method presented in [21] for sake of obtaining better benchmark accuracy and be able to compare fairly with some existing network pruning methods. The learning rate is initially set to 0.1 and then decreased by a factor of 10 on half and three-quarter epochs. We train 160 epochs on CIFAR-10 and CIFAR-100 and 100 epochs on ILSVRC-2012. During training, all the networks are trained using SGD, and batch size is set to be 64 on CIFAR-10/100 and 16 on ILSVRC-2012. We use a weight decay of 1e1e-44 and a momentum of 0.9. In the retraining stage, we adjust the learning rate as described above on the CIFAR while adopting the cosine annealing adjustment strategy on the ILSVRC-2012. We only change the neighborhood radius in the clustering pruning to control the compression ratio of the networks. Other hyper-parameter settings are as follows: MinPtsMinPts=5, TT=5, NN=6. We perform specific ablation analysis on ε\varepsilon and MinPts later. We train each compact network for three epochs before calculating its fitness.

TABLE I: Accuracy and pruning ratio of VGG-16, ResNet-56/110 and GoogLeNet on CIFAR-10. ````Acc.drop"" is the accuracy drop of the pruned network, so a negative number means the compressed model has better performance than the baseline. A smaller number of ````Acc.drop"" is better. ````Parameters.drop"" and ````FLOPs.drop"" are the pruned percentage of the parameters and FLOPs, respectively.
Model Acc/% Acc.drop/% Parameters/M Parameters.drop/% FLOPs/M FLOPs.drop/%
VGG-16 Baseline 93.60 - 14.73 - 314.59 -
ACP(ε\varepsilon=0.010) 94.03 -0.43 5.02 65.72 152.64 51.48
ACP(ε\varepsilon=0.020) 93.66 -0.06 2.76 81.28 93.52 70.25
ACP(ε\varepsilon=0.035) 93.45 0.15 1.23 91.66 83.54 73.44
Resnet-56 Baseline 93.18 - 0.85 - 127.62 -
ACP(ε\varepsilon=0.035) 93.78 -0.60 0.59 30.59 78.31 38.64
ACP(ε\varepsilon=0.075) 93.39 -0.21 0.35 58.82 58.17 54.42
ACP(ε\varepsilon=0.080) 92.91 0.27 0.30 64.71 53.80 57.84
Resnet-110 Baseline 93.32 - 1.73 - 256.04 -
ACP(ε\varepsilon=0.085) 94.33 -1.01 0.65 62.43 92.29 63.95
ACP(ε\varepsilon=0.130) 94.07 -0.75 0.49 71.68 78.11 69.49
ACP(ε\varepsilon=0.300) 93.65 -0.33 0.33 80.92 55.52 78.32
GoogLeNet Baseline 94.72 - 6.17 - 1533.96 -
ACP(ε\varepsilon=0.030) 95.16 -0.44 3.95 35.98 901.60 41.22
ACP(ε\varepsilon=0.070) 95.12 -0.40 2.90 53.00 620.36 59.56
ACP(ε\varepsilon=0.200) 95.09 -0.37 2.08 66.29 453.38 70.44
TABLE II: Accuracy and pruning ratio of VGG-19 and ResNet-56/110 on CIFAR-100.
Model Acc/% Acc.drop/% Parameters/M Parameters.drop/% FLOPs/M FLOPs.drop/%
VGG-19 Baseline 72.01 - 20.09 - 399.52 -
ACP(ε\varepsilon=0.005) 73.12 -1.10 10.03 50.07 239.77 39.99
ACP(ε\varepsilon=0.010) 71.95 0.06 5.22 74.02 146.87 63.24
Resnet-56 Baseline 71.36 - 0.86 - 127.09 -
ACP(ε\varepsilon=0.030) 71.77 -0.41 0.58 32.56 82.21 35.31
ACP(ε\varepsilon=0.060) 71.15 0.21 0.45 47.67 60.73 52.21
Resnet-110 Baseline 72.15 - 1.73 - 256.04 -
ACP(ε\varepsilon=0.040) 73.00 -0.85 0.96 44.51 135.71 47.00
ACP(ε\varepsilon=0.070) 72.49 -0.34 0.74 57.23 119.33 53.39
TABLE III: Accuracy and pruning ratio of ResNet-18/34/50 on ILSVRC-2012.
Model Top-1 Acc/% Top-1 Acc.drop/% Top-5 Acc/% Top-5 Acc.drop/% Parameters/M Parameters.drop/% FLOPs/M FLOPs.drop/%
Resnet-18 Baseline 70.02 - 89.23 - 11.69 - 1819.87 -
ACP(ε\varepsilon=0.005) 67.82 2.20 88.41 0.82 5.06 56.72 1197.99 34.17
ACP(ε\varepsilon=0.010) 66.24 3.78 87.12 2.11 4.51 61.42 959.66 47.27
Resnet-34 Baseline 73.40 - 91.44 - 21.80 - 3672.07 -
ACP(ε\varepsilon=0.010) 73.03 0.37 91.32 0.12 15.38 29.45 2525.73 31.22
ACP(ε\varepsilon=0.030) 72.08 1.32 90.57 0.87 9.14 58.07 1625.64 55.73
Resnet-50 Baseline 75.94 - 92.93 - 25.56 - 4112.32 -
ACP(ε\varepsilon=0.020) 75.53 0.41 92.72 0.21 11.58 54.69 2186.87 46.82
ACP(ε\varepsilon=0.040) 74.64 0.89 92.19 0.53 9.11 64.36 1507.46 63.34

IV-B Results on CIFAR-10

We first prune VGG-16 and ResNet-56/110 on CIFAR-10. The results are shown in TABLE I. It can be seen that for the three networks when pruning more than half of the parameters and FLOPs, the performance of all the compact networks even has a slight improvement compared with the vanilla network. For VGG-16, we remove 65.72% parameters and 51.48% FLOPs while still keeping the accuracy at 94.03%, which is even 0.43% higher than the baseline. When the parameters and FLOPs of ResNet-56 are discarded by 58.82% and 54.42%, respectively, the precision is improved by 0.21%. While the pruning rates of parameters and FLOPs for ResNet-110 are 62.43% and 63.95%, the performance of the compact network surprisingly increased by 1.01%. Generally speaking, with the increase of the compression ratio, the accuracy of the pruned model drops gradually. Our method reduces up to 91.66% parameters and 73.44% FLOPs for VGG-16 on CIFAR 10 while maintaining the accuracy as high as 93.45%, which is only declined by 0.15% compared with the original network. We remove 64.71% parameters and 57.84% FLOPs for ResNet-56 with only 0.27% performance drop of accuracy. For ResNet-110, although compressing 80.92% parameters and 78.32% FLOPs, the performance of the network is still improved by 0.33% compared to the original network.

The experimental results show that the deep CNN does have certain redundancy, and the moderate compression will not affect its accuracy. When the pruning ratio is small, removing some redundant parameters in the network can improve its generalization, which leads to an increase in performance. As ResNet-56 is compressed to 0.59M, the classification performance on CIFAR-10 even exceeds the original VGG-16 of 14.73M. It also supports the excellent performance of the residual structure in image feature extraction and classification tasks from the side. When ResNet-110 is pruned to 0.33M, the parameters and FLOPs are both less than the pruned ResNet-56 of 0.35M, but the accuracy is 0.26% ahead. The result sheds light on that increasing the depth of the ResNet can improve its performance when the number of parameters of networks are similar.

To further validate the effectiveness of the proposed ACP, we also perform experiments on GoogLeNet. Due to the integration of the Inception module, GoogLeNet increases the width of the network, and finally, the baseline accuracy reach 94.72%, which is ahead of VGGNet and ResNet. We remove 66.29% parameters and 70.44% FLOPs while still holding the accuracy at 95.09%, which is even 0.37% higher than the baseline model. It further manifests that the proposed method can achieve effective compression and acceleration for CNN. And the ACP also plays the role of regularization while reducing the redundancy of the network.

IV-C Results on CIFAR-100

We continue to experiment with the CNN on CIFAR-100 and the results are reported in TABLE II. CIFAR-100 and CIFAR-10 have the same number of images, but the category has increased from 10 to 100. As the training images of each class decrease, the performance of the network also drops significantly. The initial classification accuracy of VGG-19 on CIFAR-100 is only 72.01%. When pruning 74.02% parameter and 63.24% FLOPs, the accuracy only drops by 0.06%. We remove 57.23% parameters and 53.39% FLOPs of ResNet-110 while even 0.34% higher than the baseline model. Under the circumstance, the parameter and FLOPs of 0.74M ResNet-110 are both smaller than the 0.86M original ResNet-56, but the performance is 1.13% higher than the ResNet-56. The result further verifies the conclusion in the previous section.

IV-D Results on ILSVRC-2012

To sufficiently demonstrate the effectiveness of our ACP, we prune ResNet-18/34/50 on the large-scale dataset ILSVRC-2012 that contains 1000 categories. Among them, ResNet-18/34 use basic block, while ResNet-50 uses bottleneck. We arrange two pruning ratios for each network, and the results are presented in TABLE III. We can see that the accuracy of all the three networks after compressing dose have a loss compared with the original network, moreover as the pruning ratio increases the accuracy loss also raises. It reveals that the three ResNets have less parameter redundancy on ILSVRC-2012. In spite of this, our ACP can still obtain flexible compression and acceleration with slight accuracy drop. When delete 56.72% parameters and 34.17% FLOPs of ResNet-18, the Top-1 and Top-5 accuracy drop by 2.20% and 0.82%, respectively. For ResNet-34, the initial Top-1 and Top-5 accuracy are 73.40% and 91.44%. When the pruning rate of the parameters and FLOPs exceeds 50%, the Top-1 and Top-5 accuracy only drop 1.32% and 0.87%. Because of the bottleneck, the Top-1 and Top-5 accuracy of ResNet-50 increase by 2.54% and 1.49% respectively, while the parameters only increase by 3.76M compared to ResNet-34. We remove 54.69% parameters and 46.82% FLOPs of ResNet-50, and the performance of the compact network is hardly reduced (0.41% Top-1 loss and 0.21% Top-5 loss). When we discard 64.36% parameters and 63.34% FLOPs, the compact ResNet-50 are smaller than the baseline ResNet-18, but the accuracy is significantly improved (Top-1: 74.64% versus 70.02%, Top-5: 92.19% versus 89.23%).

IV-E Comparison with Other Methods

In this subsection, we compare the method on three datasets including CIFAR-10, CIFAR-100, and ILSVRC-2012. Among the compared methods, Li et al. [15], SFP [51], DCP [52], FPGM [29], EDP [53], CNN-FCF [54], CCP [55], Taylor-FO-BN [56] and HRank [30] are the state-of-the-art channel pruning methods. The results of these competing methods are reported according to the original article.

TABLE IV: Performance comparison of ResNet-56 on CIFAR-10. The ````-"" indicates that the results are not listed in the original article.
Method Baseline Acc/% Pruned Acc/% Acc.drop/% Parameters.drop/% FLOPs.drop/%
SFP [51] 93.59 93.89 -0.30 - 14.70
Li et al. [15] 93.04 93.06 -0.02 13.70 27.60
Liu et al. [50] 93.14 93.05 0.09 13.70 27.60
HRank [30] 93.26 93.52 -0.26 29.30 16.80
ACP(ε\varepsilon=0.035) 93.18 93.78 -0.60 30.59 38.64
SFP [51] 93.59 93.78 -0.19 - 41.10
HRank [30] 93.26 93.17 0.09 50.00 42.40
CNN-FCF [54] 93.14 93.38 -0.24 43.09 42.78
NISP [17] - - 0.03 42.60 43.61
Y.He et al. [41] 93.59 93.72 -0.13 - 47.10
He et al. [14] 92.80 91.80 1.00 - 50.00
AMC [18] 92.80 91.90 0.90 - 50.00
DCP [52] 93.80 93.59 0.31 - 50.00
DMC [42] 93.62 93.69 -0.07 - 50.00
SFP [51] 93.59 93.35 0.24 - 52.60
FPGM [29] 93.59 92.89 0.70 - 52.60
CCP [55] 93.50 93.42 0.08 - 52.60
Y.He et al. [41] 93.59 93.34 0.25 - 52.90
ABCPruner [57] 93.26 93.23 0.03 54.20 54.13
EDP [53] 93.61 93.61 0 54.18 57.71
ACP(ε\varepsilon=0.075) 93.18 93.39 -0.21 58.82 54.42
TABLE V: Performance comparison of ResNet-110 on CIFAR-10.
Method Baseline Acc/% Pruned Acc/% Acc.drop/% Parameters.drop/% FLOPs.drop/%
SFP [51] 93.68 93.83 -0.15 - 14.60
Li et al. [15] 93.53 93.55 -0.02 2.30 15.90
Liu et al. [50] 93.14 93.22 -0.08 2.30 15.90
SFP [51] 93.68 93.93 -0.25 - 28.20
Li et al. [15] 93.53 93.30 0.20 32.40 38.60
Liu et al. [50] 93.14 93.60 -0.46 32.40 38.60
HRank [30] 93.50 94.23 -0.73 41.20 39.40
SFP [51] 93.68 93.86 -0.18 - 40.80
CNN-FCF [54] 93.58 93.67 -0.09 43.19 43.08
NISP [17] - - 0.18 43.25 43.78
GAL [58] 93.50 92.74 0.76 44.80 48.50
FPGM [29] 93.68 93.73 -0.05 - 52.30
Y.He et al. [41] 93.68 93.79 -0.11 - 60.30
ACP(ε\varepsilon=0.085) 93.32 94.33 -1.01 62.43 63.95
ABCPruner [57] 93.50 93.58 -0.08 67.41 65.04
CNN-FCF [54] 93.58 92.96 0.62 69.51 70.81
HRank [30] 93.50 92.65 0.85 68.60 68.70
ACP(ε\varepsilon=0.300) 93.32 93.65 -0.33 80.92 78.32

IV-E1 Comparison on CIFAR-10

First of all, we compare the methods of pruning ResNet-56 on CIFAR-10. The results are listed in TABLE IV. As we can see, our proposed ACP is superior to the existing methods including the state-of-the-art in terms of parameter and FLOPs compressing rate and the performance after pruning. We achieve even 0.21% accuracy rise with discarding 58.82% parameters and 54.42% FLOPs, while the accuracy of the compact network in ABCPruner [57] reduces by 0.03% with a similar compressing rate (54.20% versus 58.82% and 54.13% versus 54.42%). Although the performance of the compressed ResNet-56 after pruning by CNN-FCF [54] improves by 0.24% which is 0.03% better than our ACP, the pruning rate of parameters and FLOPs are 15.73% and 11.64% lower than ours respectively (43.09% versus 58.82% and 42.78% versus 54.42%).

Secondly, we compare the pruning methods for ResNet-110 on CIFAR-10. The results are shown in TABLE V. When we remove 62.43% parameters and 63.95% FLOPs, the accuracy of the compact network is 1.01% higher than the vanilla network, which is far better than other methods. Among the compared schemes, Liu et al. [50] achieve the best performance with 0.46% accuracy raise when the parameters and FLOPs are pruned by 32.40% and 38.60% respectively. When the compressing rate increases to about 80%, the compact network obtained via ACP still has better performance than the original network and the compared approaches. The results further verify the effectiveness of the proposed method for compressing and accelerating CNN.

TABLE VI: Performance comparison of ResNet-56 on CIFAR-100.
Method Baseline Acc/% Pruned Acc/% Acc.drop/% FLOPs.drop/%
Y.He et al. [41] 71.41 70.83 0.58 51.60
SFP [51] 71.40 68.70 2.61 52.60
FPGM [29] 71.41 69.66 1.75 52.60
ACP(ε\varepsilon=0.060) 71.36 71.15 0.21 52.21
TABLE VII: Performance comparison of ResNet-34 on ILSVRC-2012.
Method Top-1 Acc.drop/% Top-5 Acc.drop/% Parameters.drop/% FLOPs.drop/%
Li et al. [15] 0.75 - 7.20 7.50
Liu et al. [50] 0.40 - 10.80 24.20
Taylor-FO-BN [56] 0.48 - - 24.20
ACP(ε\varepsilon=0.010) 0.37 0.12 29.45 31.22
FPGM [29] 1.29 0.54 - 41.10
SFP [51] 2.09 1.29 - 41.10
NISP [17] 0.92 - 43.68 43.76
EDP [53] 1.13 0.51 45.50 44.90
ABCPruner [57] 2.30 1.40 53.58 41.00
CNN-FCF [54] 1.97 1.22 55.80 54.87
ACP(ε\varepsilon=0.030) 1.32 0.87 58.07 55.73

IV-E2 Comparison on CIFAR-100

In the CIFAR-100 experiments, we tabulate the compared results of pruning ResNet-56 in TABLE VI. The original accuracy is only 71.36% due to the inadequate training set. In this case, the redundancy of the network is relatively scanty, and it also enhances the difficulty of compressing. Although the pruned accuracy drops to an extent compared to the original network, our ACP obtains better parameters and FLOPs reduction and performance than other methods. SFP [51] and FPGM [29] remove 52.60% FLOPs with 2.61% and 1.75% accuracy loss, respectively. However, at the nearly same FLOPs pruning rate with ACP (52.21% versus 52.60%), the accuracy only degrades by 0.21%.

IV-E3 Comparison on ILSVRC-2012

In the end, we compare the performance of the pruning methods of ResNet-34 and ResNet-50 on ILSVRC-2012. The results are depicted in TABLE VII and TABLE VIII. It can be found from TABLE VII that the performance of ACP is generally superior among compared methods. Although the Top-1 and Top-5 accuracy loss is slightly higher than that of NISP [17], FPGM [29], and EDP [53], the pruning rate of parameters and FLOPs of ACP is over 10% higher than the three approaches. When we discard 29.45% parameter and 31.22% FLOPs of ResNet-34, the performance of the compact network is better than Li et al. [13], Liu et al. [50], and Taylor-FO-BN [56].

As we can see from TABLE VIII, the performance of ACP significantly exceeds the other schemes. Although the Top-5 accuracy loss of the pruned network using ACP is 0.02% (0.21% versus 0.19%) higher than that of CNN-FCF [54], the Top-1 accuracy loss reduces by 0.06% (0.41% versus 0.47%). Moreover, the compressing rate of the parameters increases by 12.28% compared with CNN-FCF [54] in the case of nearly the same pruning rate of FLOPs. The Top-5 accuracy loss of ThiNet [59] is 0.12% less than ACP, but its Top-1 accuracy loss is even 0.86% higher with 20.97% (33.72% versus 54.69%) and 10.03% (36.79% versus 46.82%) lower parameters and FLOPs compressing rate respectively. Additionally, as the pruning rate increases, the dominance of ACP becomes more obvious. All the results manifest that the proposed ACP still has a good performance on large-scale image classification tasks.

TABLE VIII: Performance comparison of ResNet-50 on ILSVRC-2012.
Method Top-1 Acc.drop/% Top-5 Acc.drop/% Parameters.drop/% FLOPs.drop/%
ThiNet 1.27 0.09 33.72 36.79
SFP [51] 1.54 0.81 - 41.80
HRank [30] 1.17 0.54 36.67 43.77
NISP [17] 0.89 - 43.82 44.01
Taylor-FO-BN [56] 1.68 - - 45.00
CNN-FCF [54] 0.47 0.19 42.41 46.05
EDP [53] 0.56 0.34 43.90 52.60
ACP(ε\varepsilon=0.020) 0.41 0.21 54.69 46.82
He et al. [14] - 1.40 - 50.00
FPGM [29] 1.32 0.55 - 53.50
CCP [55] 0.94 0.45 - 54.10
GAL [58] 4.35 2.05 - 55.00
ABCPruner [57] 2.49 1.45 56.01 56.61
CNN-FCF [54] 1.60 0.69 52.52 57.10
Y.He et al. [41] 1.97 0.95 - 60.80
HRank [30] 4.17 1.86 46.00 62.10
ACP(ε\varepsilon=0.040) 0.89 0.53 64.36 63.34
Refer to caption
Figure 4: The influence of MinPts on the channels. (Left) Pruning VGG-16 on CIFAR-10, where ε\varepsilon=0.010; (Right) Pruning Resnet-18 on ILSVRC-2012, where ε\varepsilon=0.010.

IV-F Ablation Analysis

Then, we conduct ablation analysis on the proposed ACP method. This section is composed by six parts: the influence of MinPtsMinPts, the influence of ε\varepsilon, the influence of different similarity criteria, the influence of PSO, the effectiveness of ACP and generalization ability on detection tasks.

IV-F1 The Influence of MinPtsMinPts

MinPtsMinPts is a hyper-parameter for clustering in our method. We analyze the influence of MinPtsMinPts on the number of pruned channels in the clustering pruning stage. The results are displayed in Fig.4. Given the pre-trained model, we set ε\varepsilon=0.010 and change MinPtsMinPts from 1 to 10 to record the number of remaining channels in each layer after pruning. The left image is the results of pruning VGG-16 on CIFAR-10. It can be seen that only the last two layers Conv5_2 and Conv5_3 significantly fluctuate with MinPtsMinPts, and the other layers are relatively stable or only have slight variances. The number of channels in more than half layers keeps almost unchanged after compressing with the change of MinPtsMinPts. The right image is the results of pruning Resnet-18 on ILSVRC-2012. It manifests that the number of pruned channels in all layers varies with MinPtsMinPts to a small extent. Therefore, controlling the pruning rate by changing MinPtsMinPts will hamper the flexibility of the network compression.

In order to select a better MinPtsMinPts, we perform experiments on the effect of different hyper-parameter combinations (ε\varepsilon, MinPtsMinPts) on the number of parameters, FLOPs, and channels when compressing VGG-16 on CIFAR-10 in the clustering pruning phase. The experimental results are shown in Fig.5. When MinPtsMinPts=5 (the red curve), the pruning rates of the original network change more smoothly with variable ε\varepsilon than the other values of MinPtsMinPts, meanwhile the coverage of all pruning rates is also relatively moderate. Comprehensively considering the adjustment of hyper-parameters and the compressing rate of the network, we make MinPtsMinPts=5 unchanged and only change ε\varepsilon to perform our ACP. The experimental results in the previous section entirely support the effectiveness of this hyper-parameter setting method.

Refer to caption
Figure 5: The influence of (ε\varepsilon, MinPts) for the pruned percentage of parameters, FLOPs and channels of VGG-16 on CIFAR-10.
Refer to caption
Figure 6: The influence of ε\varepsilon for the pruned percentage and accuracy of GoogLeNet on CIFAR-10.
Refer to caption
Figure 7: The influence of ε\varepsilon for the pruned percentage and accuracy of GoogLeNet on CIFAR-10.
TABLE IX: Performance comparison of pruned networks with and without PSO. ````Acc.w.o."" means we compress networks only with the clustering pruning; ````Acc.w."" means we optimize compact networks after clustering pruning via PSO.
Dataset CIFAR-10 CIFAR-100
Model VGG-16 ResNet-56 GoogLeNet ResNet-56
ε\varepsilon 0.010 0.020 0.035 0.035 0.075 0.080 0.030 0.070 0.200 0.030 0.060
Baseline Acc/% 93.60 93.18 94.72 71.36
Acc.w.o./% 93.56 93.21 91.82 93.81 93.15 93.53 95.00 95.10 94.79 70.32 70.50
Acc.w./% 94.03 93.66 93.45 93.78 93.39 92.91 95.16 95.12 95.09 71.77 71.15

IV-F2 The Influence of ε\varepsilon

The neighborhood radius ε\varepsilon is the only one hyper-parameter in our method. Since ε\varepsilon can restrict the size of the clusters formed in the clustering pruning, we adjust it to control the pruning rate of the CNN. The larger the ε\varepsilon, the fewer the channel clusters in each layer, and the greater the degree of network compression. Under this circumstance, the number of channels, parameters, and FLOPs will decrease, and the accuracy of the compact network will proceed to reduce. Here, we compress GoogLeNet on the CIFAR-10 to verify the influence of ε\varepsilon for network pruning. The results are shown in Fig.6. From the figure, it is clear that when the compressing rate is tiny, the pruning rate of three indicators changes more obviously, and as the compression amplitude increases, the changes gradually become gentle. This is because when ε\varepsilon is small, the clusters are more sensitive, and a slight raise of ε\varepsilon has a great impact on the clustering results and the pruning rate. With the increase of ε\varepsilon, the clustering tends to be stable, and then increasing ε\varepsilon will have a small effect on the network pruning rate.

Fig.7 presents the influence of ε\varepsilon on the number of channels before and after pruning on CIFAR-10. From top to bottom are VGG-16, ResNet-56, ResNet-110, and GoogLeNet. As ε\varepsilon increases, the network compressing rate gradually raises. As can be seen, not all the number of channels in each layer gradually declines with the increase of the network pruning rate. For example, when ε\varepsilon=0.020, the pruning rate of the tenth layer of VGG-16 is lower than that with ε\varepsilon=0.010. It reveals that when the overall compressing rate of the network raises, increasing the number of channels in several layers can even improve the performance of the compact network. These layers are more important for feature extraction in image classification tasks when the network is compressed remarkably. Meanwhile, it further demonstrates that the ACP can find an optimal compact structure with superior performance according to different pruning rates.

IV-F3 The Influence of Different Similarity Criteria

We use cosine distance to measure the similarity of the feature maps and guide clustering. In this section, we continue to estimate other similarity measurements for feature maps. Specifically, the Cosine Distance, Euclidean Distance, Manhattan Distance, and Chebyshev Distance are selected to cluster and prune VGG-16 on CIFAR-10. We analyze the impact of four similarity on clustering pruning and the experimental results are depicted in Fig.8. When we use Cosine Distance, the compression rate of the network gradually increases as ε\varepsilon raises, meanwhile the number of channels in all layers constantly decreases. While Euclidean Distance is applied, the number of channels of the first seven layers remains unchanged with pruning, only that of the last three layers reduce. As the last three layers are almost discarded to one channel, the previous seven layers are still no cutting. For Manhattan Distance, only the number of channels of the last three layers keep decreasing, while the first nine layers are unchanged. Finally, as for Chebyshev Distance, the changes in the last six layers are more obvious, but the first seven layers are still almost unpruned. We conjecture this is because the dimension of the feature maps closed to the input is higher than that closed to the output. When the Cosine Distance is adopted to measure the similarity of feature maps, it will not be affected by their dimension. Even for high-dimension features, the Cosine Distance is still accurate and effective. The other three kinds of similarity are less sensitive to the dimension of features. This experiment entirely proves the effectiveness of Cosine Distance in the similarity measure of feature maps and the flexibility for channel pruning.

Refer to caption
Figure 8: The influence of different similarity criteria for clustering pruning of VGG-16 on CIFAR-10.

IV-F4 The Influence of PSO

To verify the effect of PSO, we further compare the performance of the compact network with and without optimization. As can be seen from TABLE IX, when ResNet-56 is compressed on CIFAR-10 with ε\varepsilon=0.035, the accuracy without optimizing is 0.03% higher than that with optimizing, and when ε\varepsilon is 0.080, the accuracy is 0.62% higher without optimization. In the other nine groups of experiments, the performance with swarm intelligence searching is better than that by directly clustering pruning. On CIFAR-10, when ε\varepsilon is 0.035, the accuracy of the compressed VGG-16 with PSO is even 1.63% higher than that without optimization. This experiment adequately verifies the rationality of the proposed automatic channel pruning scheme based on clustering and swarm intelligence optimization. Although, the performance of the subnetwork obtained by clustering pruning may be poor, it shrinks the initializing and searching space of the structure candidate and relieves the optimizing pressure of swarm intelligence optimization. Finally, after several iterations, the performance of the compact network will be significantly improved, which is close to the baseline or even better.

IV-F5 The Effectiveness of ACP

In this section, we compare the proposed ACP to the following six methods Slimming [26], WLP [13], He et al. [14], SFP [51], HRank [30], and ABCPruner [57]. Among them, Slimming performs sparse induction regularization on the scaling factor of the BN layer and then prunes the channels with smaller scaling factors according to the compressing ratio. WLP trims unimportant connections with smaller weights according to a preset trimming rate. He et al. utilize a channel selection method based on LASSO regression to delete redundant channels. SFP applies the L2-norm to evaluate the importance of filters and discards the useless filters and related channels. HRank adopts the rank of the feature maps to evaluate the significance of the channels and then compresses the network. The above five methods are all structured pruning methods. ABCPruner formulates the network compression as an optimization problem to find the optimal pruned structure via artificial bee colony algorithm.

Refer to caption
Figure 9: Comparison of different pruning methods for ResNet-56 on CIFAR-10.

In order to achieve a fair comparison, we prune ResNet-56 on CIFAR-10 by the above methods with the same environment and parameter settings on a 8G GTX 1070 GPU. The accuracy loss of each pruning scheme with different parameters and FLOPs pruned percentage is depicted in Fig.9. In general, as the network compression rate raises, the precision of the compact network obtained by all seven approaches trends to decline. When the pruning rate of parameters and FLOPs reaches 30% and 40% respectively with our ACP, the pruned performance is best, which even 0.5% higher than the vanilla network. Both ABCPruner and our ACP draw on the idea of neural structure search, and finally, the performance is better than the other five structured pruning methods. It unveils that neural structure search can optimize and promote the network compressing. With compressing, the accuracy drop of the pruned network via ACP is always less than the other methods. The results validate that the automatic channel pruning is cutting-edge in compression and accelerating CNNs.

IV-F6 Generalization Ability on Detection Tasks

As described above, ACP is effective in image classification. To further analyze the generalization of our method, we conduct experiments on object detection. Specially, we use pruned VGG-16 on CIFAR-10 as a backbone network to deploy SSD on PASCAL VOC. The training images of PASCAL VOC 2007 and 2012 are combined for training, and then the 4963 testing images in PASCAL VOC 2007 are used to estimate the performance. Here, we tabulate the pruning rate of parameters and FLOPs of SSD and the loss of Mean Average Precision (mAP) in TABLE X. The result indicates that SSD has significant redundancy on PASCAL VOC. When discarding 35.41% parameters and 35.85% FLOPs respectively, the mAP is even increased by 0.79%. Furthermore, pruning 61.13% parameters and 66.44% FLOPs slightly changes performance with 0.40% mAP value drop but significantly decreases storage and computational requirements. This experiment proves the our ACP also has good generalization on object detection.

V Conclusion

In this work, we propose a novel automatic channel pruning method for compressing and accelerating CNNs. We use the cosine distance between feature maps to perform clustering and pruning for channels. Additionally, the PSO algorithm is introduced to iteratively search for the optimal compact network based on the candidate population initialized by the clustering pruned substructure. The compressed model is then retrained to mitigate the accuracy loss from pruning. We also analyze the impact of various similarities on the clustering and demonstrate the cosine distance can be applied to measure the similarity for feature maps of different dimensions. Extensive experiments have manifested that ACP is comparable with state-of-the-art pruning methods in performance and pruning rate of parameters and FLOPs. On CIFAR-10, when discarding more than 50% parameters and FLOPs of VGGNet, ResNet, and GoogLeNet, their performance even improves. Our proposed method achieves 60% compression with only 0.89% and 0.53% drop in Top-1 and Top-5 accuracy respectively for ResNet-50 on ILSVRC-2012. Moreover, ablation analysis has shown that ACP is generally applicable to other real world tasks and networks.

In the future, we will integrate the channel pruning method with other compression schemes such as quantization. Furthermore, we will consider applying existing approaches to accelerate other real-world vision tasks and even natural language processing.

TABLE X: Pruning SSD on PASCAL VOC for object detection.
Method mAP/% mAP.drop/% Parameters.drop/% FLOPs.drop/%
Baseline 74.13 - - -
ACP(ε\varepsilon=0.040) 74.92 -0.79 35.41 35.85
ACP(ε\varepsilon=0.040) 74.80 -0.67 50.55 49.51
ACP(ε\varepsilon=0.040) 73.73 0.40 61.13 66.44

Acknowledgment

This work was supported in part by the National Key Research and Development Program under Grant 2018YFC0604404, in part by the National Natural Science Foundation of China under Grant 61806067, and in part by the Anhui Provincial Key R&D Program (202004a05020040).

References

  • [1] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, JUL 2006.
  • [2] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, JUL 28 2006.
  • [3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, MAY 28 2015.
  • [4] A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, pp. 84–90, June 2017.
  • [5] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” in NIPS 2015, vol. 28, 2015.
  • [6] X. Huang and S. Belongie, “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization,” in ICCV, 2017, pp. 1510–1519.
  • [7] H. Noh, S. Hong, and B. Han, “Learning Deconvolution Network for Semantic Segmentation,” in ICCV), 2015, pp. 1520–1528.
  • [8] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning,” ACM SIGPLAN NOTICES, vol. 49, no. 4, pp. 269–283, APR 2014.
  • [9] A. Howard, Menglong Zhu, Bo Chen, D. Kalenichenko, Weijun Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications [arXiv],” arXiv, p. 9 pp., 16 April 2017.
  • [10] P. Wang, Q. Hu, Z. Fang, C. Zhao, and J. Cheng, “Deepsearch: A fast image search framework for mobile devices,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 14, no. 1, Dec. 2017.
  • [11] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks,” in ECCV, vol. 9908, 2016, pp. 525–542.
  • [12] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network [arXiv],” arXiv, p. 9 pp., 9 March 2015.
  • [13] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both Weights and Connections for Efficient Neural Networks,” in NIPS, Cortes, C and Lawrence, ND and Lee, DD and Sugiyama, M and Garnett, R, Ed., vol. 28, 2015.
  • [14] Y. He, X. Zhang, and J. Sun, “Channel Pruning for Accelerating Very Deep Neural Networks,” in ICCV, 2017, pp. 1398–1406.
  • [15] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” in ICLR, 2017.
  • [16] H. Li, C. Ma, W. Xu, and X. Liu, “Feature statistics guided efficient filter pruning,” in IJCAI, 2020, pp. 2619–2625.
  • [17] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance score propagation,” in CVPR, 2018, pp. 9194–9203.
  • [18] Y. He, J. Lin, Z. Liu, H. Wang, L.-J. Li, and S. Han, “Amc: Automl for model compression and acceleration on mobile devices,” in ECCV, 2018, pp. 815–832.
  • [19] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” in ICLR, 2017.
  • [20] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  • [22] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015, pp. 1–9.
  • [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li, “Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
  • [24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single Shot MultiBox Detector,” in ECCV, vol. 9905, 2016, pp. 21–37.
  • [25] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge: A Retrospective,” International Journal of Computer Vision, vol. 111, no. 1, pp. 98–136, JAN 2015.
  • [26] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning Efficient Convolutional Networks through Network Slimming,” in ICCV), 2017, pp. 2755–2763.
  • [27] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” in IJCAI, 2018, pp. 2234–2240.
  • [28] S. Chen and Q. Zhao, “Shallowing deep networks: Layer-wise pruning based on feature representations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 12, pp. 3048–3056, 2019.
  • [29] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geometric median for deep convolutional neural networks acceleration,” in CVPR, 2019, pp. 4340–4349.
  • [30] M. Lin, R. Ji, Y. Wang, Y. Zhang, B. Zhang, Y. Tian, and L. Shao, “Hrank: Filter pruning using high-rank feature map,” in CVPR, 2020, pp. 1526–1535.
  • [31] Z. Wang, W. Hong, Y.-P. Tan, and J. Yuan, “Pruning 3D Filters For Accelerating 3D ConvNets,” IEEE Transactions on Multimedia, vol. 22, no. 8, pp. 2126–2137, AUG 2020.
  • [32] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14, 1967, pp. 281–297.
  • [33] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” in Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, 1996, pp. 103–114.
  • [34] M. Ester, H. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), 1996, pp. 226–231.
  • [35] E. Schubert, J. Sander, M. Ester, H. P. Kriegel, and X. Xu, “Dbscan revisited, revisited: Why and how you should (still) use dbscan,” ACM Trans. Database Syst., vol. 42, no. 3, Jul. 2017.
  • [36] M. Ankerst, M. M. Breunig, H. Kriegel, and J. Sander, “OPTICS: ordering points to identify the clustering structure,” in Proceedings ACM SIGMOD International Conference on Management of Data, 1999, pp. 49–60.
  • [37] N. Lee, T. Ajanthan, and P. H. S. Torr, “Snip: single-shot network pruning based on connection sensitivity,” in ICLR, 2019.
  • [38] S. Park, J. Lee, S. Mo, and J. Shin, “Lookahead: A far-sighted alternative of magnitude-based pruning,” in ICLR, 2020.
  • [39] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian, “Variational convolutional neural network pruning,” in CVPR, 2019, pp. 2780–2789.
  • [40] X. Gao, Y. Zhao, L. Dudziak, R. D. Mullins, and C. Xu, “Dynamic channel pruning: Feature boosting and suppression,” in ICLR, 2019.
  • [41] Y. He, Y. Ding, P. Liu, L. Zhu, H. Zhang, and Y. Yang, “Learning filter pruning criteria for deep convolutional neural networks acceleration,” in CVPR, 2020, pp. 2006–2015.
  • [42] S. Gao, F. Huang, J. Pei, and H. Huang, “Discrete model compression with resource constraint for deep neural networks,” in CVPR, 2020, pp. 1896–1905.
  • [43] S. Guo, Y. Wang, Q. Li, and J. Yan, “DMCP: differentiable markov channel pruning for neural networks,” in CVPR, 2020, pp. 1536–1544.
  • [44] J. Luo and J. Wu, “Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference,” Pattern Recognit., vol. 107, p. 107461, 2020.
  • [45] Z. Wang, F. Li, G. Shi, X. Xie, and F. Wang, “Network pruning using sparse learning and genetic algorithm,” Neurocomputing, vol. 404, pp. 247–256, 2020.
  • [46] A. Colorni, M. Dorigo, and V. Maniezzo, “Distributed optimization by ant colonies,” in Toward a Practice of Autonomous Systems. Proceedings of the First European Conference on Artificial Life, 1992 1992, pp. 134–42.
  • [47] R. Eberhart and J. Kennedy, “A new optimizer using particle swarm theory,” in MHS’95. Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1995 1995, pp. 39–43.
  • [48] Y. Shi and R. Eberhart, “A modified particle swarm optimizer,” in 1998 IEEE International Conference on Evolutionary Computation Proceedings, 1998, pp. 69–73.
  • [49] D. Karaboga, “An idea based on honey bee swarm for numerical optimization,” Technical report-tr06, Erciyes university, engineering faculty, computer …, Tech. Rep., 2005.
  • [50] Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell, “Rethinking the value of network pruning,” in ICLR, 2019.
  • [51] Y. He, G. Kang, X. Dong, Y. Fu, and Y. Yang, “Soft filter pruning for accelerating deep convolutional neural networks,” in IJCAI, 2018, pp. 2234–2240.
  • [52] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu, “Discrimination-aware channel pruning for deep neural networks,” in NeurIPS, 2018, pp. 883–894.
  • [53] X. Ruan, Y. Liu, C. Yuan, B. Li, W. Hu, Y. Li, and S. Maybank, “Edp: An efficient decomposition and pruning scheme for convolutional neural network compression,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2020.
  • [54] T. Li, B. Wu, Y. Yang, Y. Fan, Y. Zhang, and W. Liu, “Compressing convolutional neural networks via factorized convolutional filters,” in CVPR, 2019, pp. 3977–3986.
  • [55] H. Peng, J. Wu, S. Chen, and J. Huang, “Collaborative channel pruning for deep networks,” in ICML, vol. 97, 2019, pp. 5113–5122.
  • [56] P. Molchanov, A. Mallya, S. Tyree, I. Frosio, and J. Kautz, “Importance estimation for neural network pruning,” in CVPR, 2019, pp. 11 264–11 272.
  • [57] M. Lin, R. Ji, Y. Zhang, B. Zhang, Y. Wu, and Y. Tian, “Channel pruning via automatic structure search,” in IJCAI, 2020, pp. 673–679.
  • [58] S. Lin, R. Ji, C. Yan, B. Zhang, L. Cao, Q. Ye, F. Huang, and D. S. Doermann, “Towards optimal structured CNN pruning via generative adversarial learning,” in CVPR, 2019, pp. 2790–2799.
  • [59] J. Luo, H. Zhang, H. Zhou, C. Xie, J. Wu, and W. Lin, “Thinet: Pruning CNN filters for a thinner net,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 10, pp. 2525–2538, 2019.