IC Networks: Remodeling the Basic Unit for Convolutional Neural Networks

Junyi An¹ Fengshan Liu¹ Jian zhao² Furao shen¹¹¹1Contact Author
¹Department of Computer Science and Technology, Nanjing University, Nanjing, China
²School of Electronic Science and Engineering, Nanjing University, Nanjing, China
{junyian, liufengshan}@smail.nju.edu.cn
{frshen, jianzhao}@nju.edu.cn

Abstract

Convolutional neural network (CNN) is a class of artificial neural networks widely used in computer vision tasks. Most CNNs achieve excellent performance by stacking certain types of basic units. In addition to increasing the depth and width of the network, designing more effective basic units has become an important research topic. Inspired by the elastic collision model in physics, we present a general structure which can be integrated into the existing CNNs to improve their performance. We term it the “Inter-layer Collision” (IC) structure. Compared to the traditional convolution structure, the IC structure introduces nonlinearity and feature recalibration in the linear convolution operation, which can capture more fine-grained features. In addition, a new training method, namely weak logit distillation (WLD), is proposed to speed up the training of IC networks by extracting knowledge from pre-trained basic models. In the ImageNet experiment, we integrate the IC structure into ResNet-50 and reduce the top-1 error from $22.38\%$ to $21.75\%$ , which also catches up the top-1 error of ResNet-100 ( $21.75\%$ ) with nearly half of FLOPs.

1 Introduction

Convolutional neural networks (CNNs) have made great achievements in the field of computer vision. The success of AlexNet Krizhevsky et al. (2012) and VGGNet Simonyan and Zisserman (2015) shows the superiority of deep networks, leading to a trend of building larger and deeper networks. However, this method is inefficient in improving network performance. On the one hand, increasing the depth and width brings a huge computational burden and causes a series of problems, such as model degradation. On the other hand, because the relationship between different hyper-parameters is complicated, the increased number of hyper-parameters makes it more difficult to design deep networks. Therefore, the research focus in recent years has gradually shifted to improving the representation ability of basic network units in order to design more efficient CNN architectures.

The convolutional layer is a basic unit of CNN. By stacking a series of convolutional layers together with non-linear activation layers, CNNs are able to produce image representations that capture abstract features and cover global theoretical receptive fields. For each convolutional layer, the filter can capture a certain type of feature in the input. However, not all features contribute to a given task. Recent studies have shown that the network can obtain more powerful representation ability by feature recalibration, which emphasizes informative features and suppresses less useful ones Bell et al. (2016); Hu et al. (2018). Besides, there is evidence that introducing the non-linear kernel method in the convolutional layer can improve the generalization ability of the network Wang et al. (2019); Zoumpourlis et al. (2017). However, kernel methods may cause overfitting by complicating the networks and introducing a high amount of calculation.

We argue that the convolutional layer can also benefit from a simple non-linear representation, and propose a structure of introducing a non-linear operation in the convolutional layer, which also performs feature recalibration to enhance the representation ability of convolutional layers. Starting from the most basic neural network structure where a linear transformation and a non-linear activation function are applied to the input successively, we use a non-linear operation to optimize the representation of the linear part. The proposed structure divides the input space into multiple subspaces which can represent different linear transformations, providing more patterns for succedent activation function. We call this structure the inter-layer collision (IC) neuron, since it is inspired by the elastic collision model that we use to mimic the way information is transmitted between adjacent layers.

We build the IC convolutional layer by combining the IC neuron with the convolution operation. The structure of the IC layer is depicted in Figure 1, where the term $F_{conv}$ denotes the convolution operation mapping the input $\mathbf{X}$ to the feature maps $\mathbf{U}$ . An IC branch used to divide the input space can be easily combined with the $F_{conv}$ branch. When the input $\mathbf{X}$ passes though the IC branch, the $F_{ex}$ operation first extract local features by aggregating input features in local regions, which represents the local distribution of channel-wise input features. Then, local features are recalibrated by the adjustment operation $F_{ad}$ to increase flexibility of representation. Finally, the local features pass though a $F_{com}$ operation which combines two branches to generate extra linear transformation. The final output $\mathbf{U}^{\prime}$ of the IC layer with the same spatial dimensions ( $H\times W$ ) as $\mathbf{U}$ can be fed directly into subsequent layers of the network. In addition, the IC branch only introduces little computational cost, because its operations use lightweight structures.

We construct a set of IC networks by integrating the IC layers into existing models. Our experiments show that the IC networks have significant improvements compared to basic models with little additional computational burden. However, training from scratch may take a lot of time and suffer from complex hyper-parameter settings, especially when the basic models are large. From Kirkpatrick et al. (2017), we find that the basic models may have similar parameter configurations to the corresponding IC models. Therefore, we propose a training method, namely the weak logit distillation (WLD), which distills the knowledge of pre-trained basic models. By combining the optimal basic models with a loss using weak soft targets, we show that the WLD only needs a few training rounds to successfully achieve or even exceed the result of training from scratch. In summary, our contributions are:

•

We propose a novel structure called the IC layer that can be used to build CNN architectures. We prove its effectiveness on enhancing representation and integrate it into some existing models. The experiments show the universality and superiority of the IC layer.
•

We propose a method called WLD that can guide the learning of IC networks. It is a novel idea whose goal is to extract knowledge from smaller teacher models through the knowledge distillation (KD) method. The experiments show that by the WLD, the IC networks achieve higher performance with a shorter training time. Especially, the IC-ResNet-50 integrates the IC layer into ResNet-50 and reduces the top-1 error from $22.38\%$ to $21.75\%$ , which also achieves the top-1 error of ResNet-100 ( $21.75\%$ ).

Refer to caption — Figure 1: The structure of a complete IC layer.

2 Related Work

Effective computational unit. Designing effective basic CNN units was a significant topic, since it reduces the difficulty of designing architectures by directly using basic units in existing models. Hu et al. (2018) proposed the SE block that used feature recalibration to improve the performance of existing networks. However, the SE blocks were usually combined with building blocks rather than more basic structures. Wang et al. (2019) introduced the non-linear kernel method to convolutional layers to improve representation. Although that work bypassed the explicit calculation of the high dimensional features via a kernel trick, the complexity is obviously increased. In contrast to these work. Our proposed IC branch introduces the non-linear representation by a lightweight structure. Besides, it is combined with a single convolutional layer that can be applied to a wider range of CNN architectures.

Knowledge Distillation. Recent KD methods used the feature distillation Heo et al. (2019) or self-supervision Xu et al. (2020) to extract the deep knowledge from a larger teacher model. Different from them, our training method WLD distill knowledge from a smaller pre-trained model while tolerating the gap between the teacher and student models. WLD is novel because it used the KD method to solve a different task: learning the optimal representation quickly and precisely when new components are introduced into the CNNs. The loss of WLD refers to the common KD loss proposed in Hinton et al. (2015).

3 The Inter-layer Collision Network

In this section, we first show how the IC structure works and its combination with existing CNNs. Then, we introduce the WLD method to optimize IC networks. Finally, we analyze the influence of IC structure on model complexity.

3.1 Optimization of non-linear representation of the MP model

The MP neuron McCulloch and Pitts (1990) is the most commonly used neuron model, which can be formulated as $\mathrm{f}(\sum_{i=1}^{n}\ w_{i}x_{i}+b)$ , where a linear transformation and a non-linear activation function $\mathrm{f}(\cdot)$ are applied to the $n$ -D input successively. $w_{i}$ and $b$ denote the learnable weight and the bias, respectively. To facilitate the non-linear representation of neural models, we propose a new neuron model by replacing the linear transformation with a non-linear one:

	$\displaystyle\centering y$	$\displaystyle=\mathrm{f}\left(\sum_{i=1}^{n}\ w_{i}x_{i}+\sigma\left(\sum_{i=1}^{n}\ (w_{i}-1)x_{i}+b_{1}\right)+b_{2}\right)$		(1)
		$\displaystyle=\mathrm{f}\left(\sum_{i=1}^{n}\ w_{i}x_{i}+\sigma\left(\sum_{i=1}^{n}\ w_{i}x_{i}-x_{sum}+b_{1}\right)+b_{2}\right),$		(1)

where $\mathrm{f}(\cdot)$ denotes a non-linear activation function and $\sigma(\cdot)$ is a rectified linear unit (ReLU) Nair and Hinton (2010). $b_{1}$ and $b_{2}$ are two independent biases used to adjust the center of the model distribution. $x_{sum}=\sum_{i=1}^{n}\ x_{i}$ denotes the summation of all the input features. We term the structure defined in eq. (1) the inter-layer collision (IC) neuron, since this idea is inspired by the physical elastic collision model where the speeds of two objects are $\frac{2m_{1}}{m_{1}+m_{2}}v$ and $(\frac{2m_{1}}{m_{1}+m_{2}}-1)v$ after collision (Details in Appendix C). We treat $\frac{2m_{1}}{m_{1}+m_{2}}$ as learnable weight $w_{i}$ and introduce some mathematical adjustments.

Through introducing the $\sigma(\cdot)$ operation, the neuron model can definitely increase the number of non-linear patterns. We use the term $H=\sum_{i=1}^{n}\ (w_{i}-1)x_{i}$ to represent a hyperplane ( $H=0$ ) in an $n$ -dimensional Euclidean space, which divides eq. (1) as follows:

y=\begin{cases}\mathrm{f}\left(2\sum_{i=1}^{n}\ w_{i}x_{i}-\sum_{i=1}^{n}\ x_{i}\right)&\text{if }H\geq 0\\ \mathrm{f}\left(\sum_{i=1}^{n}\ w_{i}x_{i}\right)&\text{if }H<0\end{cases}.

(2)

Here we omit the bias term. Intuitively, this IC neuron has a stronger representation ability than the MP neuron, since it can produce two different linear representations before the activation operation $\mathrm{f}(\cdot)$ . To facilitate understanding, we use 2-D data as input and ReLU as the activation function to show how the two kinds of neurons generate non-linear boundaries. Figure 2(a)(b) shows that the single MP neuron and the IC neuron can divide the $2$ -D Euclidean space into multiple subspaces, each of which can represent a fixed linear transformation. We observe that a single IC neuron can divide one more subspace to represent a different linear transformation. Furthermore, we map the XOR problem which is a typical linear inseparable problem onto a $2$ -D plane to explain the difference between the two kinds of neurons. For the ReLU MP neuron, it is clear that at least two neurons are required to solve the XOR problem. However, the non-linear boundary of the IC neuron is a broken line, providing a single neuron possibility to solve the XOR problem. Figure 2(c) gives a solution with a single IC neuron, dividing the whole plane into three spaces where $(0,0),(1,1)$ are represented by zero and $(1,0),(0,1)$ are in two spaces with similar representation.

Although neuron using the hyperplane $H=0$ can increase the number of non-linear patterns, the calculation of the hyperplane is limited by weight $w_{i}$ , making it inflexible to divide different subspaces. Hence, the subspaces divided by the hyperplane are usually not optimal thus the weights easily converge to a local minimum. To add more representation flexibility to the hyperplane $H=0$ , we improve eq. (1) by introducing an adjustment weight $w^{\prime}$ :

	$\displaystyle\centering y$	$\displaystyle=\mathrm{f}\left(\sum_{i=1}^{n}\ w_{i}x_{i}+\sigma\left(\sum_{i=1}^{n}\ w_{i}x_{i}-w^{\prime}x_{sum}+b_{1}\right)+b_{2}\right)$		(3)
		$\displaystyle=\mathrm{f}\left(\mathbf{w}^{T}\mathbf{x}+\sigma\left(\mathbf{w}^{T}\mathbf{x}-w^{\prime}\times\mathbf{1}^{T}\mathbf{x}+b_{1}\right)+b_{2}\right),$		(3)

where $\mathbf{w}$ and $\mathbf{x}$ represent weight vector and input vector, respectively. $\mathbf{1}$ is an all-one vector. $w^{\prime}$ can be regarded as the intrinsic weight of one neuron, which is different from the weight $w_{i}$ connecting two neurons. Then there are two independent parameters in the calculation of the hyperplane $H=0$ : $w^{\prime}$ is used to change the direction of the hyperplane, and the bias $b_{1}$ is used to shift the hyperplane in the whole space. We give a theoretical analysis of the adjustable range of $w^{\prime}$ :

Theorem 1.

$\mathbf{w}=(w_{1},\dots,w_{n})^{T}$ and $\mathbf{1}=(1,\dots,1)^{T}$ are two n- $D$ vectors. By adjusting $w^{\prime}$ , the hyperplane $\sum_{i=1}^{n}\ (w_{i}-w^{\prime})x_{i}=0$ can be rotated $\pi$ $rad$ around the cross product of $\mathbf{w}$ and $\mathbf{1}$ when the two vectors are linearity independent.

Theorem 1 implies that $\sum_{i=1}^{n}\ (w_{i}-w^{\prime})x_{i}=0$ can almost represent all the hyperplanes parallel to the cross product of $\mathbf{w}$ and $\mathbf{1}$ , providing flexible strategies for dividing spaces. Besides, the IC neuron keeps a significant advantage over the MP neuron in that the $\sum_{i=1}^{n}\ w_{i}x_{i}$ term can flexibly represent any linear combinations in the input space. We provide the theorem as below:

Theorem 2.

In a closed $n$ -D input space, for any given MP neuron $(w_{1},\dots,w_{n},b)$ , there is always an IC neuron $(w_{1},\dots,w_{n},w^{\prime},b_{1},b_{2})$ that can completely represent this MP neuron.

The proof of Theorem 1 and 2 are provided in the Appendix A. In summary, by adjusting the relationship between $\mathbf{w}$ and $w^{\prime}$ , the IC neuron can retain the representation ability of the MP neuron and flexibly segment linear representation spaces for some complex distributions.

3.2 Application on convolutional structure

The convolutional kernel, a filter used to capture the latent features in input signals, can be regarded as a combination of the MP model and a sliding window. To simplify the notation, we omit the activation operator and bias term. The output feature $\mathbf{u}_{i}\in\mathbb{R}^{H\times W}$ of the standard convolutional layer is given by:

\displaystyle\mathbf{u}_{i}=\mathbf{w}_{i}\otimes\mathbf{X},

(4)

where $\mathbf{X}\in\mathbb{R}^{H^{\prime}\times W^{\prime}\times C^{\prime}}$ is the input feature map, $\mathbf{w}_{i}\in\mathbb{R}^{k\times k\times C^{\prime}}$ is a filter kernel that belongs to $\mathbf{W}_{i}=[\mathbf{w}_{1},\dots,\mathbf{w}_{C}]$ and $\otimes$ is used to denote the convolution operator. To apply the IC neuron model to eq. (4), we replace the kernel $\mathbf{w}_{i}$ with an IC kernel $[\mathbf{w}_{i},w^{\prime}_{i}]$ ,

	$\displaystyle\mathbf{u}_{i}$	$\displaystyle=[\mathbf{w}_{i},w^{\prime}_{i}]\otimes\mathbf{X}$		(5)
		$\displaystyle=\mathbf{w}_{i}\otimes X+\sigma(\mathbf{w}_{i}\otimes\mathbf{X}-w^{\prime}_{i}\times(\mathbf{1}\otimes\mathbf{X})),$		(5)

where $\mathbf{1}$ is an all-one tensor with the same size as $\mathbf{w}_{i}$ . The input feature map may contain hundreds of channels and $\mathbf{1}\otimes\mathbf{X}$ term will mix features in all the channels with the same proportion. Therefore, we distinguish different features by a grouped convolutional trick:

\displaystyle\mathbf{u}_{i}=\mathbf{w}_{i}\otimes\mathbf{X}+\sigma(\mathbf{w}_{i}\otimes\mathbf{X}-(\mathbf{1}\hat{\otimes}\mathbf{X})\cdot\mathbf{w}^{\prime}_{i}),

(6)

where $\hat{\otimes}$ denotes the depthwise convolution Chollet (2017) which separates $\mathbf{1}$ and $\mathbf{X}$ into $C^{\prime}$ independent channels and performs channel-wise convolution. The adjustment weight $\mathbf{w}^{\prime}_{i}$ becomes a vector with size $C^{\prime}$ , recalibrating the features of $\mathbf{1}\hat{\otimes}\mathbf{X}$ by channel-wise multiplication ( $\cdot$ ). Note that all the convolution operators ( $\otimes$ and $\hat{\otimes}$ ) in eq. (6) share the same stride and padding. The structure in eq. (6) which we term the IC layer has two main advantages compared to the traditional convolutional layer:

•

According to Section 3.1, the single kernel $[\mathbf{w}_{i},\mathbf{w}^{\prime}_{i}]$ can represent more linear patterns before activation, which enhancing the representation of convolutional layer.
•

The information $\mathbf{1}\hat{\otimes}\mathbf{X}$ contains some low-level features from the previous layer. It helps the filters to learn high-level features faster since it provides an approximate distribution of the pixels in the input feature maps.

The IC layer can be easily integrated into some existing models. Consider the ResNets He et al. (2016) which are commonly used networks as examples. The basic block used by ResNet-18 and ResNet-34 has two $3\times 3$ convolutional layers. The bottleneck block used by deeper ResNets has three convolutional layers ( $1\times 1,3\times 3,1\times 1$ ). Note that the information $\mathbf{1}\hat{\otimes}\mathbf{X}$ equals $\mathbf{X}$ when kernel size is $1\times 1$ , it will not provide extra features for the filter. Therefore, we prefer to replace all the $3\times 3$ layers in building blocks to build IC-ResNets. The combination with other popular models is introduced in Section 4, where we show the universality and superiority of the IC layer.

3.3 Learning with Knowledge Distillation

In order to further understand why the IC layer can capture more fine-grained features, we compare the Grad-CAM visualizations Selvaraju et al. (2017) of the IC models with their basic models. As shown in Figure 3, the IC networks tend to focus on more relevant regions with more object details. More importantly, the features of IC networks and basic networks have some similarities. We think that although the features captured by IC networks have finer texture information, their focus on the image is similar to features captured by basic networks.

(a) original image

(b) ResNet-18

Figure 3: The Grad-CAM visualizations for different models. The images are randomly picked from ImageNet validation set.

Motivated by the observed similarity, we propose a method to guide the learning of IC networks by using the knowledge of pre-trained basic networks. First, we have a basic network B with a pre-trained set of parameters $\mathbf{\theta}^{*}$ and our goal is to train a corresponding IC network IC-B with better performance. According to Theorem 2 and the argument that a network has many configurations of parameters with the same performance Kirkpatrick et al. (2017), we assume that IC-B can have a configuration close to $\mathbf{\theta}^{*}$ . Based on our hypothesis, we load the set $\mathbf{\theta}_{*}$ for IC-B and add a scaling factor $\alpha$ to control the influence of $\sigma(\cdot)$ . The IC layer in IC-B can be initialized as:

\displaystyle\mathbf{u}_{i}=\mathbf{w}_{i}^{*}\otimes\mathbf{X}+\alpha\times\sigma(\mathbf{w}_{i}^{*}\otimes\mathbf{X}-(\mathbf{1}\hat{\otimes}\mathbf{X})\cdot\mathbf{w}^{\prime}_{i}),

(7)

where $\mathbf{w}_{i}^{*}$ is a weight loaded from $\mathbf{\theta}_{*}$ and we use random initialization for $\mathbf{w}^{\prime}_{i}$ . Since the $\mathbf{w}_{i}^{*}\otimes\mathbf{X}$ term can capture the features independently, $\alpha$ is set to zero at the beginning to let IC-B have the same performance as B. After initialization, we fine-tune all the parameters. The $\sigma(\cdot)$ term benefits from the pre-trained information, making the adjustment weight converges quickly. We use the scaling factor $\alpha$ to gradually amplify the influence of $\sigma(\cdot)$ in the training process. There are two strategies to adjust $\alpha$ . The first one is to gradually increase its value manually, but it need to provide accurate hyperparameters. The second is to treat $\alpha$ as a parameter trained with other parameters. Our experience shows that the best value of $\alpha$ is usually in $[0.01,0.04]$ .

To further use the knowledge of B, we refer to the way in KD. We use the basic network B as a teacher model and the IC-B as a student model. The soft targets predicted by a well-optimized teacher model can provide extra information, compared to the ground truth. To obtain the soft targets of B, temperature scaling Hinton et al. (2015) is used to soften the peaky softmax distribution:

\displaystyle p^{i}(x;\tau)=\frac{e^{s_{i}(x)/\tau}}{\sum_{k}e^{s_{k}(x)/\tau}},

(8)

where $x$ is the data sample, $i$ is the category index, $s_{i}(x)$ is the score logit that $x$ obtains on category $i$ , and $\tau$ is the temperature. The knowledge distillation loss $L_{kd}$ measured by KL-divergence as:

\displaystyle L_{kd}=-\tau^{2}\sum_{x\in\mathcal{D}_{x}}\sum_{i=1}^{C}p_{t}^{i}(x;\tau)\log(p_{s}^{i}(x;\tau)),

(9)

where $t$ and $s$ denote the teacher and student models. $C$ is the total number of classes, $\mathcal{D}_{x}$ indicates the dataset. The complete loss function $L$ of training IC-B is a combination of the standard cross-entropy loss $L_{ce}$ and knowledge distillation loss $L_{kd}$ :

\displaystyle L=(1-\lambda)L_{ce}+\lambda\max(L_{ce}-e,0),

(10)

where $\lambda$ is a balancing weight. $e$ is a constant used to increase tolerance the for the gap between the soft targets of B and IC-B.

Model	Top-1 err.	Top-5 err.	GFlops	Params
ResNet-18	30.24 / 27.88	10.92 / 9.42	1.82	11.7M
IC-ResNet-18	28.56 / 26.69	9.80 / 8.56	2.01	12.9M
ResNet-34	26.70 / 24.52	8.58 / 7.46	3.68	21.8M
IC-ResNet-34	25.55 / 23.49	7.90 / 6.86	4.07	24.2M
ResNet-50	23.85 / 22.85	7.13 / 6.71	4.12	25.6M
IC-ResNet-50	23.20 / 21.90	6.72 / 6.08	4.33	26.8M

Table 1: IC-ResNets performance results on ImageNet validation set. The error rates (%) use single-crop/10-crop testing.

Different from the traditional KD method, our method does not extract the knowledge from a deeper network with better performance. Therefore, we do not need the output distributions of the teacher and student models to be exactly equal. The loss term $\max(L_{ce}-e,0)$ allows deviation between $p_{s}^{i}$ and $p_{t}^{i}$ . Besides, during training, we gradually reduce the value of $\lambda$ to make the IC-B reduce the dependence on the teacher network. Eq. (10) provides a strategy to introduce some extra information for training IC networks. Combined with loading pre-trained parameters and fine-tuning, the IC networks achieve high performance after a few learning rounds. We term this learning method the weak logit distillation (WLD) since it weakens the impact of soft targets to reduce the dependence on the teacher networks.

3.4 Parameters and Complexity Analysis

For the standard convolutional layer with $k\times k$ receptive fields, the transformation

\displaystyle\mathbf{X}\in\mathbb{R}^{H^{\prime}\times W^{\prime}\times C^{\prime}}\xrightarrow{conv}\mathbf{U}\in\mathbb{R}^{H\times W\times C}

(11)

needs $k\times k\times C^{\prime}\times C$ parameters. In the IC layer, we calculate the sum of each channel without additional parameters. The weight $\mathbf{W^{\prime}}=[\mathbf{w^{\prime}}_{1},\mathbf{w^{\prime}}_{2},\cdots,\mathbf{w^{\prime}}_{C}]$ adds $1\times 1\times C^{\prime}\times C$ parameters. Therefore, the number of parameters added by the IC layer is only $\frac{1}{k\times k}$ of the original layer.

The IC layer adds a depthwise convolution and a learnable weight $\mathbf{W}^{\prime}$ which can be regarded as a $1\times 1$ convolution. The depthwise convolution is used to calculate the element-wise sum of the input by an all-one kernel. The increased computational complexity is the same as adding a convolutional filter because we only need to do this operation once. Therefore, the increased complexity is $\frac{1}{C}$ of the original layer. The weight $\mathbf{W^{\prime}}$ is a $1\times 1$ convolution which uses less than $\frac{1}{k\times k}$ computation of the $k\times k$ convolution Sifre and Mallat (2014). Therefore, the approximate extra computation is $\frac{1}{C}+\frac{1}{k\times k}$ of the original layer.

4 Experiments

In this section, we investigate the effectiveness of different IC architectures by a series of comparative experiments. Besides, we evaluate the WLD method on training IC networks.

4.1 ImageNet Results

We use the ILSVRC 2012 classification dataset Russakovsky et al. (2015) which consists of more than 1 million color images in 1000 classes divided into 1.28M training images and 50K validation images. We use three versions of ResNets (ResNet-18, ResNet-34, ResNet-50) to build the corresponding IC networks. For a fair comparison, all our experiments on ImageNet are conducted with the same environment setting. The optimizer uses the stochastic gradient descent (SGD) LeCun et al. (1989) method with a weight decay of $10^{-4}$ and a momentum of $0.9$ . The training process is set to $120$ epochs with a batch size of $256$ . The learning rate is initially set to $0.1$ and will be reduced 10 times every $30$ epochs. Besides, all experiments are implemented with the Pytorch Paszke et al. (2019) framework on a server with NVIDIA TITAN Xp GPUs.

Model	Top-1 err.	Top-5 err.
IC-ResNet-18	28.82	9.98
IC-ResNet-34	25.78	8.02
IC-ResNet-50	23.07	6.67
EfficientNet-B0	23.7	6.8
IC-EfficientNet-B0	23.6	6.7
EfficientNet-B1	21.2	5.6
IC-EfficientNet-B1	21.0	5.5
EfficientNet-B2	20.2	5.1
IC-EfficientNet-B2	20.1	5.0

Table 2: The single-crop error rates (%) of WLD results on the ImageNet validation set.

To compare with previous research He et al. (2016); Wang et al. (2019), we apply both the single-crop and 10-crop for testing. We build the IC-ResNet-18, IC-ResNet-34 and IC-ResNet-50 by replacing the whole $3\times 3$ convolutional layers in blocks with $3\times 3$ IC layers. The validation error and FLOPs are reported in Table 1. We observe that the IC-ResNet-18 and IC-ResNet-34 can obviously reduce the 10-crop top-1 error by $1.19\%,1.03\%$ and the top-5 error by $0.86\%,0.60\%$ with a small increase in the calculation ( $10.44\%,10.60\%$ ), validating the effectiveness of the IC layer. The IC-ResNet-50 retains $1\times 1$ convolutional layers in bottleneck blocks. The 10-crop top-1 error is $21.90\%$ and the top-5 error is $6.08\%$ , exceeding ResNet-50 by $0.95\%$ and $0.63\%$ , respectively. Moreover, the extra FLOPs of the IC-ResNet-50 is only $5.10\%$ of the ResNet-50. For the deeper ResNets, we believe that the deeper IC-ResNets can get similar results, since they all use the same block as ResNet-50.

4.1.1 Training with WLD

To evaluate the WLD method, We construct a set of experiments to train the IC-ResNets and three versions of the EfficientNets Tan and Le (2019) (EfficientNet-b0, EfficientNet-b1 and EfficientNet-b2). The training process is set to $30$ epochs and the learning rate whose initial value is $0.001$ will be reduced 10 times every $15$ epochs ( $10$ epochs for EfficientNets). The other hyperparameter settings are: $e$ is $0.005$ ; $\lambda$ initially set to $0.9$ is reduced to $0.1$ . Table 2 shows that WLD achieves the accuracy of training from scratch with fewer training rounds. Remarkably, by training with WLD, the 10-crop result of the IC-ResNet-50 ( $21.75\%$ top-1 and $5.92\%$ top-5 error) achieves the error rate achieved by the deeper ResNet-101 network ( $21.75\%$ top-1 and $6.05\%$ top-5 error) with nearly half of the computational burden ( $4.33$ GFLOPs vs. $7.85$ GFLOPs).

This set of experiments prove our hypothesis mentioned in Section 3.3 and provide a understandable conclusion: although the low learning rate limits the connection weights in the vicinity of pre-trained model, the WLD method can still find optimal adjustment weights to improve the representation of IC networks. Specially, the EfficientNets which come from neural architecture search have difficuly in training from scratch. Through the pre-trained models and WLD method, the corresponding IC-EfficientNets achieve higher performance with a simple hyperparameter configuration.

4.1.2 The effectiveness of $1\times 1$ IC layers

To evaluate the $1\times 1$ IC layers, we integrate them into the IC-ResNet-50 to build the IC-ResNet-50-B by only replacing the first $1\times 1$ layers in bottleneck blocks and the IC-ResNet-50-C by replacing all $1\times 1$ layers. As shown in Table 3, although the IC-ResNet-50-B and IC-ResNet-50-C exceed the IC-ResNet-50, it obviously increases the FLOPs of the model. Besides, we observe that the overfitting in two IC models where the training accuracy is improved significantly but the test accuracy is not. Combined with analysis in Section 3.2, we argue that $1\times 1$ IC layers focus more on improving model capacity rather than introducing new feature information. When the number of channels of $1\times 1$ IC layers is relatively large, the models are more likely to cause overfitting and bring an expensive computational burden.

Model	Top-1 err.	GFlops/Params
IC-ResNet-50	23.07/6.68	4.33/26.8M
IC-ResNet-50-B	23.02/6.66	5.27/31.2M
IC-ResNet-50-C	22.96/6.65	6.10/36.2M

Table 3: Performance results of the IC networks with

1\times 1

IC layers on ImageNet validation set.

4.2 CIFAR-10 Results

We further investigate the universality of the IC layer by integrating it into some other modern architectures. These experiments are conducted on the CIFAR10 dataset, which consists of 60K $32\times 32$ colour images in 10 classes divided into 50K training images and 10K testing images. Each model is trained in $200$ epochs with a batch size of $128$ . The learning rate is initialized to $0.1$ , which will be reduced $10$ times at the $60$ th epoch and the $120$ th epoch. The optimizer settings are the same as in the ImageNet experiment.

We integrate the IC layers into VGGNets (VGG-16 and VGG-19 versions), MobileNet Howard et al. (2017), SENets (SE-ResNet-18 and SE-ResNet-50 versions) and ResNeXt Xie et al. (2017) (2x64d version). Specially, the adjacent convolution layers (a depthwise convolution layer and a pointwise convolutional layer) in MobileNet are treated as one convolutional layer when integrating with the IC layer because there is a close relationship between adjacent layers. The results are listed in Table 4, we observe that the IC layers improve representation of all the basic models. This set of experiments show the universality of the IC layers. Besides, we observe that the IC networks have faster convergence speed than basic models in both the ImageNet and CIFAR-10 experiments. The training curves are shown in Appendix B.

Model	Top-1 err.	Model	Top-1 err.
VGG-16	6.36	SE-ResNet-18	5.08
IC-VGG-16	6.04	IC-SE-ResNet-18	4.85
VGG-19	6.46	SE-ResNet-50	5.10
IC-VGG-19	6.27	IC-SE-ResNet-50	4.61
MobileNet	9.92	ResNext(2x64d)	4.62
IC-MobileNet	9.00	IC-ResNext(2x64d)	4.43

Table 4: Results on CIFAR10 with various IC models. We use our environment settings to reprodurce the baseline results.

framework	backbone	mAP
Faster R-CNN	ResNet-50	79.5
Faster R-CNN	IC-ResNet-50	80.5
Retinanet	ResNet-50	77.3
Retinanet	IC-ResNet-50	78.2

Table 5: Results on PASCAL VOC 2007+2012 test set.

4.3 Object Detection

We further assess the generalization of IC networks on the task of object detection using the PASCAL VOC 2007+2012 detection benchmark Everingham et al. (2010), This dataset consists of about 5K train/val images and 5K test images over 20 object categories. We use the IC-ResNet-50 as the backbone networks to capture the features. Weights are initialized by the parameters of the IC-ResNet-50 trained by WLD on the ImageNet dataset. The detection frameworks that we use are Faster R-CNN Ren et al. (2015) and Retinanet Lin et al. (2017). We use the same configuration for both the IC-ResNet-50 and the ResNet-50, which is described in Chen et al. (2019). We evaluate detection mean Average Precision (mAP) which is the actual metric for object detection. As shown in Table 5, IC-ResNet-50 outperforms ResNet-50 by $1.0\%$ and $0.9\%$ on the Faster R-CNN and Retinanet frameworks, respectively. The ResNet-50 result we report is the same as previous work. Our experiments demonstrate that the IC networks can be easily integrated into the object detection and achieve better performance with negligible additional cost. We believe that IC networks can show their superiority across a broad range of vision tasks and datasets.

5 Conclusion

In this paper, we propose the IC structure that brings non-linearity and feature recalibration to convolution operation. By dividing the input space, the IC structure has a stronger representation ability than the traditional convolution structure. We build the IC networks by integrating the IC structure into the state-of-the-art models. Besides, we propose the WLD method to facilitate the training of IC networks. It is shown that training with WLD can bypass requirement of the complex hyperparameter design and reach convergence quickly. A wide range of experiments show the effectiveness of IC networks across multiple datasets and tasks. Finally, we expect to integrate IC structure into more architectures and further improve the performance of computer vision tasks.

References

Bell et al. [2016] S. Bell, C. L. Zitnick, K. Bala, and R. B. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, pages 2874–2883, 2016.
Chen et al. [2019] K. Chen, J. Q. Wang, J. M. Pang, Y. H. Cao, Y. Xiong, X. X. Li, S. Y. Sun, W. S. Feng, Z. W. Liu, J. R. Xu, et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155, 2019.
Chollet [2017] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1251–1258, 2017.
Everingham et al. [2010] M. Everingham, L. Van Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. International journal of computer vision, 88(2):303–338, 2010.
He et al. [2016] K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
Heo et al. [2019] B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi. A comprehensive overhaul of feature distillation. In ICCV, pages 1921–1930, 2019.
Hinton et al. [2015] G. E. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NeurIPS Deep Learning and Representation Learning Workshop, 2015.
Howard et al. [2017] A. G. Howard, M. l. Zhu, B. Chen, D. Kalenichenko, W. J. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. In CVPR, 2017.
Hu et al. [2018] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141, 2018.
Kirkpatrick et al. [2017] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, pages 1097–1105, 2012.
LeCun et al. [1989] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
Lin et al. [2017] T. Lin, P. Goyal, R. B. Girshick, K. M. He, and P. Dollár. Focal loss for dense object detection. In ICCV, pages 2999–3007, 2017.
McCulloch and Pitts [1990] W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. Bulletin of mathematical biology, 52(1-2):99–115, 1990.
Nair and Hinton [2010] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
Paszke et al. [2019] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035, 2019.
Ren et al. [2015] S. Q. Ren, K. M. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
Russakovsky et al. [2015] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. H. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
Selvaraju et al. [2017] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, pages 618–626, 2017.
Sifre and Mallat [2014] L. Sifre and S. Mallat. Rigid-motion scattering for texture classification. arXiv:1403.1687, 2014.
Simonyan and Zisserman [2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
Tan and Le [2019] M. X. Tan and Q. V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, volume 97, pages 6105–6114, 2019.
Wang et al. [2019] C. Wang, J. F. Yang, L. H. Xie, and J. S. Yuan. Kervolutional neural networks. In CVPR, pages 31–40, 2019.
Xie et al. [2017] S. N. Xie, R. Girshick, P. Dollár, Z. W. Tu, and K. M. He. Aggregated residual transformations for deep neural networks. In CVPR, pages 1492–1500, 2017.
Xu et al. [2020] G. D. Xu, Z. W. Liu, X. X. Li, and C. C. Loy. Knowledge distillation meets self-supervision. In ECCV, volume 12354, pages 588–604, 2020.
Zoumpourlis et al. [2017] G. Zoumpourlis, A. Doumanoglou, N. Vretos, and P. Daras. Non-linear convolution filters for cnn-based learning. In ICCV, pages 4761–4769, 2017.