A Battle of Network Structures: An Empirical Study of CNN, Transformer, and MLP

Yucheng Zhao¹¹1Equal contribution. ²²2Interns at MSRA. ¹ Guangting Wang¹¹1Equal contribution. ²²2Interns at MSRA. ¹ Chuanxin Tang¹¹1Equal contribution. ² Chong Luo² Wenjun Zeng²
Zheng-Jun Zha ¹
University of Science and Technology of China¹ Microsoft Research Asia²
{lnc, flylight}@mail.ustc.edu.cn {chutan, cluo, wezeng}@microsoft.com [email protected]

Abstract

Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision. Recently, Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and MLP-Mixer, started to lead new trends as they showed promising results in the ImageNet classification task. In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons. To ensure a fair comparison, we first develop a unified framework called SPACH which adopts separate modules for spatial and channel processing. Our experiments under the SPACH framework reveal that all structures can achieve competitive performance at a moderate scale. However, they demonstrate distinctive behaviors when the network size scales up. Based on our findings, we propose two hybrid models using convolution and Transformer modules. The resulting Hybrid-MS-S+ model achieves 83.9% top-1 accuracy with 63M parameters and 12.3G FLOPS. It is already on par with the SOTA models with sophisticated designs. The code and models are publicly available at https://github.com/microsoft/SPACH.

1 Introduction

Convolutional neural networks (CNNs) have been dominating the computer vision (CV) field since the renaissance of deep neural networks (DNNs). They have demonstrated effectiveness in numerous vision tasks from image classification [12], object detection[27], to pixel-based segmentation [11]. Remarkably, despite the huge success of Transformer structure [37] in natural language processing (NLP) [8], the CV society still focuses on the CNN structure for quite some time.

The transformer structure finally made its grand debut in CV last year. Vision Transformer (ViT) [9] showed that a pure Transformer applied directly to a sequence of image patches can perform very well on image classification tasks, if the training dataset is sufficiently large. DeiT [35] further demonstrated that Transformer can be successfully trained on typical-scale dataset, such as ImageNet-1K [7], with appropriate data augmentation and model regularization.

Interestingly, before the heat of Transformer dissipated, the structure of multi-layer perceptrons (MLPs) was revived by Tolstikhin et al. in a work called MLP-Mixer [33]. MLP-Mixer is based exclusively on MLPs applied across spatial locations and feature channels. When trained on large datasets, MLP-Mixer attains competitive scores on image classification benchmarks. The success of MLP-Mixer suggests that neither convolution nor attention are necessary for good performance. It sparked further research on MLP as the authors wished [20, 26].

However, as the reported accuracy on image classification benchmarks continues to increase by new network designs from various camps, no conclusion can be made as which structure among CNN, Transformer, and MLP performs the best or is most suitable for vision tasks. This is partly due to the pursuit of high scores that leads to multifarious tricks and exhaustive parameter tuning. As a result, network structures cannot be fairly compared in a systematic way. The work presented in this paper fills this blank by conducting a series of controlled experiments over CNN, Transformer, and MLP in a unified framework.

We first develop a unified framework called SPACH as shown in Fig. 1. It is mostly adopted from current Transformer and MLP frameworks, since convolution can also fit into this framework and is in general robust to optimization. The SPACH framework contains a plug-and-play module called mixing block which could be implemented as convolution layers, Transformer layers, or MLP layers. Aside from the mixing block, other components in the framework are kept the same when we explore different structures. This is in stark contrast to previous work which compares different network structures in different frameworks that vary greatly in layer cascade, normalization, and other non-trivial implementation details. As a matter of fact, we found that these structure-free components play an important role in the final performance of the model, and this is commonly neglected in the literature.

Refer to caption — Figure 1: Illustration of the proposed experimental framework named SPACH.

With this unified framework, we design a series of controlled experiments to compare the three network structures. The results show that all three network structures could perform well on the image classification task when pre-trained on ImageNet-1K. In addition, each individual structure has its distinctive properties leading different behaviors when the network size scales up. We also find several common design choices which contribute a lot to the performance of our SPACH framework. The detailed findings are listed in the following.

•

Multi-stage design is standard in CNN models, but its effectiveness is largely overlooked in Transformer-based or MLP-based models. We find that the multi-stage framework consistently and notably outperforms the single-stage framework no matter which of the three network structures is chosen.
•

Local modeling is efficient and crucial. With only light-weight depth-wise convolutions, the convolution model can achieve similar performance as a Transformer model in our SPACH framework. By adding a local modeling bypass in both MLP and Transformer structures, a significant performance boost is obtained with negligible parameters and FLOPs increase.
•

MLP can achieve strong performance under small model sizes, but it suffers severely from over-fitting when the model size scales up. We believe that over-fitting is the main obstacle that prevents MLP from achieving SOTA performance.
•

Convolution and Transformer are complementary in the sense that convolution structure has the best generalization capability while Transformer structure has the largest model capacity among the three structures. This suggests that convolution is still the best choice in designing lightweight models but designing large models should take Transformer into account.

Based on these findings, we propose two hybrid models of different scales which are built upon convolution and Transformer layers. Experimental results show that, when a sweet point between generalization capability and model capacity is reached, the performance of these straightforward hybrid models is already on par with SOTA models with sophisticated architecture designs.

2 Background

CNN and its variants have dominated the vision domain. During the evolution of CNN models, useful experience about the architecture design has been accumulated. Recently, two types of architectures, namely Transformer [9] and MLP [33], begin to emerge in the vision domain and have shown performance similar to the well-optimized CNNs. These results kindle a spark towards building better vision models beyond CNNs.

Convolution-based vision models Since the entrance of deep learning era pioneered by AlexNet [18], the computer vision community has devoted enormous efforts to designing better vision backbones. In the past decade, most work focused on improving the design of CNN, and a series of networks, including VGG [29], ResNet [12], SENet [15], Xception [2], MoblieNet[14, 28], and EfficientNet [31, 32], are designed. They achieve significant accuracy improvements in various vision tasks.

A standard convolution layer learns filters in a 3D space, with two spatial dimensions and one channel dimension. Thus, the learning of spatial correlations and channel correlations are coupled inside a single convolution kernel. Differently, A depth-wise convolution layer only learns spatial correlations by moving the learning process of channel correlations to an additional 1x1 convolution. The fundamental hypothesis behind this design is that cross-channel correlations and spatial correlations are sufficiently decoupled that it is preferable not to map them jointly [2]. Recent work [31, 32] shows that depth-wise convolution can achieve both high accuracy and good efficiency, confirming this hypothesis to some extent. In addition, the idea of decoupling spatial and channel correlations is adopted in the vision Transformer. Therefore, this paper employs the spatial-channel decoupling idea in our framework design.

Transformer-based vision models. With the success of Transformer in natural language processing (NLP) [37, 8], many researchers start to explore the use of Transformer as a stand-alone architecture for vision tasks. They are facing two main challenges. First, Transformer operates over a group of tokens, but no natural tokens, similar to the words in natural language, exist in an image. Second, images have a strong local structure while the Transformer structure treats all tokens equally and ignores locality. The pioneering work ViT [9] solved the first challenge by simply dividing an image into non-overlapping patches and treat each patch as a visual token. ViT also reveals that Transformer models trained on large-scale datasets could attain SOTA image recognition performance. However, when the training data is insufficient, ViT does not perform well due to the lack of inductive biases. DeiT [35] mitigates the problem by introducing a regularization and augmentation pipeline on ImageNet-1K.

Swin [21] and Twins [3] propose local ViT to address the second challenge. They adopt locally-grouped self-attention by computing the standard self-attention within non-overlapping windows. The local mechanism not only leads to performance improvement thanks to the reintroduction of locality, but also bring sufficient improvement on memory and computational efficiency. Thus, the pyramid structure becomes feasible again for vision Transformer.

There has been a blowout development in the design of Transformer-based vision models. Since this paper is not intended to review the progress of vision Transformer, we only briefly introduce some highly correlated Transformer models. CPVT [4] and CvT [39] introduce convolution into Transformer blocks, bringing the desired translation-invariance properties into ViT architecture. CaiT [36] introduces a LayerScale approach to empower effective training of deeper ViT network. It is also discovered that some class-attention layers built on top of ViT network offer more effective processing than the class embedding. LV-ViT [17] proposes a bag of training techniques to build a strong baseline for vision Transformer. LeViT [10] proposes a hybrid neural network for fast image classification inference.

MLP-based vision models. Although MLP is not a new concept for the computer vision community, the recent progress on MLP-based visual models surprisingly demonstrates, both conceptually and technically, that simple architecture can achieve competitive performance with CNN or Transformer [33]. The pioneering work MLP-Mixer proposed a Mixer architecture using channel-mixing MLPs and channel-mixing MLPs to communicate between different channels and spatial locations (tokens), respectively. It achieves promising results when trained on a large-scale dataset (i.e., JFT[30]). ResMLP [34] built a similar MLP-based model with a deeper architecture. ResMLP does not need large-scale datasets and it achieves comparable accuracy/complexity trade-offs on ImageNet-1K with Transformer-based models. FF [23] showed that simply replacing the attention layer in ViT with an MLP layer applied over the patch dimension could achieve moderate performance on ImageNet classification. gMLP [20] proposed a gating mechanism on MLP and suggested that self-attention is not a necessary ingredient for scaling up machine learning models.

3 A Unified Experimental Framework

In order to fairly compare the three network structures, we are in need of a unified framework that excludes other performance-affecting factors. Since recent MLP-based networks have already shared a similar framework as Transformer-based networks, we build the unified experimental framework based on them and try to include CNN-based network in this framework as well.

Model

SPACH-XXS

SPACH-XS

SPACH-S

Conv

C=384,R=2.0,N=12

C=384,R=2.0,N=24

C=512,R=3.0,N=24

Transformer

C=192,R=2.0,N=12

C=384,R=2.0,N=12

C=512,R=3.0,N=12

MLP

C=384,R=2.0,N=12

C=384,R=2.0,N=24

C=512,R=3.0,N=24

Conv-MS

C=64,R=2.0

N_{s}=\{2,2,6,2\}

C=96,R=2.0

N_{s}=\{3,4,12,3\}

C=128,R=3.0

N_{s}=\{3,4,12,3\}

Transformer-MS

C=32,R=2.0

N_{s}=\{2,2,6,2\}

C=64,R=2.0

N_{s}=\{3,4,12,3\}

C=96,R=3.0

N_{s}=\{3,4,12,3\}

MLP-MS

C=64,R=2.0

N_{s}=\{2,2,6,2\}

C=96,R=2.0

N_{s}=\{3,4,12,3\}

C=128,R=3.0

N_{s}=\{3,4,12,3\}

Table 1: SPACH and SPACH-MS model variants.

C

: feature dimension,

R

: expansion ratio of MLP in

\mathcal{F}_{c}

N

: number of mixing blocks of SPACH,

N_{s}

: number of mixing blocks in the

i_{th}

stage of SPACH-MS.

3.1 Overview of the SPACH Framework

We build our experimental framework with reference to ViT [9] and MLP-Mixer [33]. Fig. 1(a) shows the single-stage version of the SPACH framework, which is used for our empirical study. The architecture is very simple and consists mainly of a cascade of mixing blocks, plus some necessary auxiliary modules, such as patch embedding, global average pooling, and a linear classifier. Fig. 1(b) shows the details of the mixing block. Note that the spatial mixing and channel mixing are performed in consecutive steps. The name SPACH for our framework is coined to emphasize the serial structure of SPAtial and CHannel processing.

We also enable a multi-stage variation, referred to as SPACH-MS, as shown in Fig. 1(c). Multi-stage is an important mechanism in CNN-based networks to improve the performance. Unlike the single-stage SPACH, which processes the image in a low resolution by down-sampling the image by a large factor at the input, SPACH-MS is designed to keep a high-resolution in the initial stage of the framework and progressively perform down-sampling. Specifically, our SPACH-MS contains four stages with down-sample ratios of 4, 8, 16, and 32, respectively. Each stage contains $N_{s}$ mixing blocks, where $s$ is the stage index. Due to the extremely high computational cost of Transformer and MLP on high-resolution feature maps, we implement the mixing blocks in the first stage with convolutions only. The feature dimension within a stage remains constant, and will be multiplied with a factor of 2 after down-sampling.

Let $I\in\mathbb{R}^{3\times h\times w}$ denotes an input image, where 3 is the RGB channels and $H\times W$ is the spatial dimensions. Our SPACH framework first passes the input image through a Patch Embedding layer, which is the same as the one in ViT, to convert $I$ into patch embeddings $X_{p}\in\mathbb{R}^{C\times\frac{h}{p}\times\frac{w}{p}}$ . Here $p$ denotes patch size, which is 16 in the single-stage implementation and 4 in the multi-stage implementation. After the cascades of mixing blocks, a classification head implemented by a linear layer is used for the supervised pre-training.

We list the hyper-parameters used in different model configurations in Table 1. Three model size for each variations of SPACH are designed, namely SPACH-XXS, SPACH-XS and SPACH-S, by controlling the number of blocks, the number of channels, and the expansion ratio of channel mixing MLP $\mathcal{F}_{c}$ . The model size, theoretical computational complexity (FLOPS), and empirical throughput are presented in Section 4. We measure the throughput using one P100 GPU.

3.2 Mixing Block Design

Mixing blocks are key components in the SPACH framework. As shown in Fig. 1(b), for an input feature $X\in\mathbb{R}^{C\times H\times W}$ , where $C$ and $H\times W$ denote channel and spatial dimensions, it is first processed by a spatial mixing function $\mathcal{F}_{s}$ and then by a channel mixing function $\mathcal{F}_{c}$ . $\mathcal{F}_{s}$ focuses on aggregating context information from different spatial locations while $\mathcal{F}_{c}$ focuses on channel information fusion. Denoting the output as $Y$ , we can formulate a mixing block as:

Y=\mathcal{F}_{s}(\mathcal{F}_{c}(X)).

(1)

Following ViT [9], we use an MLP with appropriate normalization and residual connection to implement $\mathcal{F}_{c}$ . The MLP here can be also viewed as a 1x1 convolution (also known as point-wise convolution [2]) which is a special case of regular convolution. Note that $\mathcal{F}_{c}$ only performs channel fusion and does not explore any spatial context.

The spatial mixing function $\mathcal{F}_{s}$ is the key to implement different architectures. As shown in Fig. 2, we implement three structures using convolution, self-attention, and MLP. The common components include normalization and residual connection. Specifically, the convolution structure is implemented by a 3x3 depth-wise convolution, as channel mixing will be handled separately in subsequent steps. For the Transformer structure, there is a positional embedding module in the original design. But recent research suggests that absolute positional embedding breaks translation variance, which is not suitable for images. In view of this and inspired by recent vision transformer design [4, 39], we introduce a convolutional positional encoding (CPE) as a bypass in each spatial mixing module. The CPE module has negligible parameters and FLOPs. For MLP-based network, the pioneering work MLP-Mixer does not use any positional embedding, but we empirically find that adding the very lightweight CPE significantly improves the model performance, so we use the same treatment for MLP as for Transformer.

The three implementations of $\mathcal{F}_{s}$ have distinctive properties as listed in Table 2. First, the convolution structure only involves local connections so that it is computational efficient. Second, the self-attention structure uses dynamic weight for each input instance so that model capacity is increased. Moreover, it has a global receptive field, which enables information to flow freely across different positions [37]. Third, MLP structure has a global receptive field just as the self-attention structure, but it does not use dynamic weight. In summary, these three properties seen in different architectures are all desirable and may have positive influence on the model performance or efficiency. We can find convolution and self-attention have complementary properties thus there is potential to build hybrid model to combine all desirable properties. Besides, MLP structure seems to be inferior to self-attention in this analysis.

Properties

Convolution

Self-Attention

MLP

Sparse

Connectivity

✓

Dynamic

Weight

✓

Global

Receptive Field

✓

Table 2: Three desired properties in network design are seen in different network structures.

4 Empirical Studies on Mixing Blocks

In this section, we design a series of controlled experiments to compare the three network structures. We first introduce the experimental settings in Section 4.1, and then present our main findings in Section 4.2, 4.3, 4.4, and 4.5.

4.1 Datasets and Training Pipelines

We conduct experiments on ImageNet-1K (IN-1K) [7] image classification which has 1k classes. The training set has 1.28M images while the validation set has 50k images. The Top-1 accuracy on a single crop is reported. Unless otherwise indicated, we use the input resolution of 224x224. Most of our training settings are inherited from DeiT [35]. We employ an AdamW [22] optimizer for 300 epochs with a cosine decay learning rate scheduler and 20 epochs of linear warm-up. The weight decay is 0.05, and the initial learning rate is $0.005\times\frac{\text{batchsize}}{512}$ . 8 GPUs with mini-batch 128 per GPU are used in training, resulting a total batch-size of 1024. We use exactly the same data augmentation and regularization configurations as DeiT, including Rand-Augment [5], random erasing [42], Mixup [41], CutMix [40], stochastic depth [16], and repeated augmentation [1, 13]. We use the same training pipeline for all comparing models. And the implementation is built upon PyTorch [24] and timm library [38].

Network Scale	Model	Single-Stage				Multi-Stage
		#param.	FLOPs	Throughput	IN-1K	#param.	FLOPs	Throughput	IN-1K
		#param.	FLOPs	(image/s)	Top-1 acc.	#param.	FLOPs	(image/s)	Top-1 acc.
XXS	Conv	8M	1.4G	1513	72.1	5M	0.7G (-0.7G)	1576	73.3 (+1.2)
	Trans	4M	0.9G	1202	68.0	2M	0.5G (-0.4G)	1558	65.4 (-2.6)
	MLP	9M	1.8G	980	74.1	6M	0.9G (-0.9G)	1202	74.9 (+0.8)
XS	Conv	15M	2.8G	770	77.3	17M	2.8G (-0.0G)	602	80.1 (+2.8)
	Trans	15M	3.1G	548	78.4	14M	3.1G (-0.0G)	441	80.1 (+1.7)
	MLP	17M	3.5G	503	78.5	19M	3.4G (-0.1G)	438	80.7 (+2.2)
S	Conv	39M	7.4G	374	80.1	44M	7.2G (-0.2G)	328	81.6 (+1.5)
	Trans	33M	6.7G	328	81.7	40M	7.6G (+0.9G)	246	82.9 (+1.2)
	MLP	41M	8.7G	272	78.6	46M	8.2G (-0.5G)	254	82.1 (+3.5)

Table 3: Model performance of SPACH and SPACH-MS at three network scales.

Figure 3: The multi-stage models (named with -MS suffix) always achieve a better performance than their single-stage counterparts.

Model	#param.	FLOPs	throughput	IN-1K
Model	#param.	FLOPs	(image/s)	Top-1 acc.
Trans-MS-S	40M	7.6G	246	82.9
$\text{Trans-MS-S}^{-}$	40M	7.6G	259	82.3
MLP-MS-S	46M	8.2G	254	82.1
$\text{MLP-MS-S}^{-}$	46M	8.2G	274	80.1

Table 4: Both Transformer structure and MLP structure benefit from local modeling at a very small computational cost. The superscription - indicates without local modeling.

4.2 Multi-Stage is Superior to Single-Stage

Multi-stage design is standard in CNN models, but it is largely overlooked in Transformer-based or MLP-based models. Our first finding is that multi-stage design should always be adopted in vision models no matter which of the three network structures is chosen.

Table 3 compares the image classification performance between multi-stage framework and single-stage framework. For all three network scales and all three network structures, multi-stage framework consistently achieves better complexity-accuracy trade-off. For the sake of easy comparison, the changes of FLOPs and accuracy are highlighted in Table 3. Most of the multi-stage models are designed to have slightly fewer computational costs, but they manage to achieve a higher accuracy than the corresponding single-stage models. An accuracy loss of 2.6 points is observed for the Transformer model at the XXS scale, but it is understandable as the multi-stage model happens to have only half of the parameters and FLOPs of the corresponding single-stage model.

In addition, Fig. 3 shows how the image classification accuracy changes with the size of model parameters and model throughput. Despite the different trends observed for different network structures, the multi-stage models always outperform their single-stage counterparts.

This finding is consistent with the results reported in recent work. Both Swin-Transformer [21] and TWins [3] adopt multi-stage framework and achieve a stronger performance than the single-stage framework DeiT [35]. Our empirical study suggests that the use of multi-stage framework can be an important reason.

4.3 Local Modeling is Crucial

Although it has been pointed out in many previous work [39, 4, 19, 6, 21] that local modeling is crucial for vision models, we will show in this subsection how amazingly efficient local modeling could be.

In our empirical study, the spatial mixing block of the convolution structure is implemented by a $3\times 3$ depth-wise convolution, which is a typical local modeling operation. It is so light-weight that it only contributes to 0.3% of the model parameter and 0.5% of the FLOPs. However, as Table 3 and Fig. 3 show, this structure can achieve competitive performance when compared with the Transformer structure in the XXS and XS configurations.

It is due to the sheer efficiency of $3\times 3$ depth-wise convolution that we propose to use it as a bypass in both MLP and Transformer structures. The increase of model parameters and inference FLOPs is almost negligible, but the locality of the models is greatly strengthened. In order to demonstrate how local modeling helps the performance of Transformer and MLP structures, we carry out an ablation study which removes this convolution bypass in the two structures.

Table 4 shows the performance comparison between models with or without local modeling. The two models we pick are the top performers in Table 3 when multi-stage framework is used and network scale is S. We can clearly find that the convolution bypass only slightly decreases the throughput, but brings a notable accuracy increase to both models. Note that the convolution bypass is treated as convolutional positional embedding in Trans-MS-S, so we bring back the standard patch embedding as in ViT [9] in $\text{Trans-MS-S}^{-}$ . For $\text{MLP-MS-S}^{-}$ , we follow the practice in MLP-Mixer and do not use any positional embedding. This experiment confirms the importance of local modeling and suggests the use of $3\times 3$ depth-wise convolution as a bypass for any designed network structures.

4.4 A Detailed Analysis of MLP

Model	#param.	FLOPs	throughput	IN-1K
Model	#param.	FLOPs	(image/s)	Top-1 acc.
MLP-S	41M	8.7G	272	78.6
+Shared	39M	8.7G	274	80.2
MLP-MS-S	46M	8.2G	254	82.1
+Shared	45M	8.2G	244	82.5

Table 5: The performance of MLP models are greatly boosted when weight sharing is adopted to alleviate over-fitting.

Due to the excessive number of parameters, MLP models suffer severely from over-fitting. We believe that over-fitting is the main obstacle for MLP to achieve SOTA performance. In this part, we discuss two mechanisms which can potentially alleviate this problem.

One is the use of multi-stage framework. We have already shown in Table 3 that multi-stage framework brings gain. Such gain is even more prominent for larger MLP models. In particular, the MLP-MS-S models achieves 2.6 accuracy gain over the single-stage model MLP-S. We believe this owes to the strong generalization capability of the multi-stage framework. Fig. 4 shows how the test accuracy increases with the decrease of training loss. Over-fitting can be observed when the test accuracy starts to flatten. These results also lead to a very promising baseline for MLP-based models. Without bells and whistles, MLP-MS-S model achieves 82.1% ImageNet Top-1 accuracy, which is 5.7 points higher than the best results reported by MLP-Mixer [33] when ImageNet-1K is used as training data.

The other mechanism is parameter reduction through weight sharing. We apply weight-sharing on the spatial mixing function $\mathcal{F}_{s}$ . For the single-stage model, all $N$ mixing blocks use the same $\mathcal{F}_{s}$ , while for the multi-stage model, each stage use the same same $\mathcal{F}_{s}$ for its $N_{s}$ mixing blocks. We present the results of S models in Table 5. We can find that the shared-weight variants, denoted by ”+Shared”, achieve higher accuracy with almost the same model size and computation cost. Although they are still inferior to Transformer models, the performance is on par with or even better than convolution models. Fig. 4 confirms that using shared weights in the MLP-MS model further delays the appearance of over-fitting signs. Therefore, we conclude that MLP-based models remain competitive if they could solve or alleviate the over-fitting problem.

Figure 4: Illustration of the over-fitting problem in MLP-based models. Both multi-stage framework and weight sharing alleviate the problem.

4.5 Convolution and Transformer are Complementary

We find that convolution and Transformer are complementary in the sense that convolution structure has the best generalization capability while Transformer structure has the largest model capacity among the three structures we investigated.

Fig. 5 shows that, before the performance of Conv-MS saturates, it achieves a higher test accuracy than Trans-MS at the same training loss. This shows that convolution models generalize better than Transformer models. In particular, when the training loss is relatively large, the convolution models show great superiority against Transformer models. This suggests that convolution is still the best choice in designing lightweight vision models.

On the other hand, both Fig. 3 and Fig. 5 show that Transformer models achieve higher accuracy than the other two structures when we increase the model size and allow for higher computational cost. Recall that we have discussed three properties of network architectures in Section 3.2. It is now clear that the sparse connectivity helps to increase generalization capability, while dynamic weight and global receptive field help to increase model capacity.

Figure 5: Conv-MS has a better generalization capability than Trans-MS as it achieves a higher test accuracy at the same training loss before the model saturates.

5 Hybrid Models

As discussed in Section 3.2 and 4.4, convolution and Transformer structures have complementary characteristics and have potential to be used in a single model. Based on this observation, we construct hybrid models at the XS and S scales based on these two structures. The procedure we used to construct hybrid models is rather simple. We take a multi-stage convolution-based model as the base model, and replace some selected layers with Transformer layers. Considering the local modeling capability of convolutions and global modeling capability of Transformers, we tend to do such replacement in later stages of the model. The details of layer selection in the two hybrid models are listed as follows.

•

Hybrid-MS-XS: It is based on Conv-MS-XS. The last ten layers in Stage 3 and the last two layers in Stage 4 are replaced by Transformer layers. Stage 1 and 2 remain unchanged.
•

Hybrid-MS-S: It is based on Conv-MS-S. The last two layers in Stage 2, the last ten layers in Stage 3, and the last two layers in Stage 4 are replaced by Transformer layers. Stage 1 remains unchanged.

In order to unleash the full potential of hybrid models, we further adopt the deep patch embedding layer (PEL) implementation as suggested in LV-ViT [17]. Different from default PEL which uses one large (16x16) convolution kernel, the deep PEL uses four convolution kernels with kernel size $\{7,3,3,2\}$ , stride $\{2,1,1,2\}$ , and channel number $\{64,64,64,C\}$ . By using small kernel sizes and more convolution kernels, deep PEL helps a vision model to explore the locality inside single patch embedding vector. We mark models with deep PEL as ”Hybrid-MS-*+”.

Table 6 shows comparison between our hybrid models and some of the state-of-the-art models based on CNN, Transformer, or MLP. All listed models are trained on ImageNet-1K. Within the section of our models, we can find that hybrid models achieve better model size-performance trade-off than pure convolution models or Transformer models. The Hybrid-MS-XS achieves 82.4% top-1 accuracy with 28M parameters, which is higher than Conv-MS-S with 44M parameters and only a little lower than Trans-MS-S with 40M parameters. In addition, the Hybrid-MS-S achieve 83.7% top-1 accuracy with 63M parameters, which has 0.8 point gain compared with Trans-MS-S.

The Hybrid-MS-S+ model we proposed achieves 83.9% top-1 accuracy with 63M parameters. This number is higher than the accuracy achieved by SOTA models Swin-B and CaiT-S36, which have model size of 88M and 68.2M, respectively. The FLOPs of our model is also fewer than these two models. We believe Hybrid-MS-S can be serve as a strong yet simple baseline for future research on architecture design of vision models.

Model	#param.	FLOPs	IN-1K
Model	#param.	FLOPs	Top-1 acc.
CNN
RegNetY-4G [25]	21M	4.1G	80.0
RegNetY-8G [25]	39M	8.0G	81.7
RegNetY-16G [25]	84M	16.0G	82.9
Transformer
ViT-B/16* [9]	86M	-	77.9
DeiT-S [35]	22M	4.6G	79.8
DeiT-B [35]	86M	17.5G	81.8
Swin-T [21]	29M	4.5G	81.3
Swin-S [21]	50M	8.7G	83.0
Swin-B [21]	88M	15.4G	83.5
CaiT-XS24 [36]	26.6M	5.4G	81.8
CaiT-S36 [36]	68.2M	13.9G	83.3
CvT-13 [39]	20M	4.5G	81.6
CvT-21 [39]	32M	7.1G	82.5
MLP
FF-Base [23]	62M	-	74.9
Mixer-B/16 [33]	79M	-	76.4
ResMLP-S24 [34]	30M	6.0G	79.4
ResMLP-B24 [34]	45M	23.0G	81.0
gMLP-S [20]	20M	4.5G	79.4
gMLP-B [20]	73M	15.8G	81.6
Ours
Conv-MS-XS	17M	2.8G	80.1
Conv-MS-S	44M	7.2G	81.6
Trans-MS-XS	14M	3.1G	80.1
Trans-MS-S	40M	7.6G	82.9
Hybrid-MS-XS	28M	4.5G	82.4
Hybrid-MS-XS+	28M	5.6G	82.8
Hybrid-MS-S	63M	11.2G	83.7
Hybrid-MS-S+	63M	12.3G	83.9

Table 6: Comparison of different models on ImageNet-1K classification. Compared models are grouped according to network structures, and our models are listed in the last, Most models are pre-trained with 224x224 images, except ViT-B/16*, which uses 384x384 images.

6 Conclusion

The objective of this work is to understand how the emerging Transformer and MLP structures compare with CNNs in the computer vision domain. We first built a simple and unified framework, called SPACH, that could use CNN, Transformer, or MLP as plug-and-play components. Under the SPACH framework, we discover with a little surprise that all three network structures are similarly competitive in terms of the accuracy-complexity trade-off, although they show distinctive properties when the network scales up. In addition to the analysis of specific network structures, we also investigate two important design choices, namely multi-stage framework and local modeling, which are largely overlooked in previous work. Finally, inspired by the analysis, we propose two hybrid models which achieve SOTA performance on ImageNet-1k classification without bells and whistles.

Our work also raises several questions worth exploring. First, realizing the fact that the performance of MLP-based models is largely affected by over-fitting, is it possible to design a high-performing MLP model that is not subject to over-fitting? Second, current analyses suggest that neither convolution nor Transformer is the optimal structure across all model sizes. What is the best way to fuse these two structures? Last but not least, do better visual models exist beyond the known structures including CNN, Transformer, and MLP?

References

[1] Maxim Berman, Hervé Jégou, Andrea Vedaldi, Iasonas Kokkinos, and Matthijs Douze. Multigrain: a unified image embedding for classes and instances. CoRR, abs/1902.05509, 2019.
[2] François Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, pages 1800–1807. IEEE Computer Society, 2017.
[3] Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Twins: Revisiting spatial attention design in vision transformers. CoRR, abs/2104.13840, 2021.
[4] Xiangxiang Chu, Zhi Tian, Bo Zhang, Xinlong Wang, Xiaolin Wei, Huaxia Xia, and Chunhua Shen. Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882, 2021.
[5] Ekin Dogus Cubuk, Barret Zoph, Jon Shlens, and Quoc Le. Randaugment: Practical automated data augmentation with a reduced search space. In NeurIPS, 2020.
[6] Stéphane d’Ascoli, Hugo Touvron, Matthew L. Leavitt, Ari S. Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 2286–2296. PMLR, 2021.
[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE Computer Society, 2009.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR. OpenReview.net, 2021.
[10] Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. Levit: a vision transformer in convnet’s clothing for faster inference. CoRR, abs/2104.01136, 2021.
[11] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. In ICCV, pages 2980–2988. IEEE Computer Society, 2017.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778. IEEE Computer Society, 2016.
[13] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Soudry. Augment your batch: Improving generalization through instance repetition. In CVPR, pages 8126–8135. IEEE, 2020.
[14] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
[15] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In CVPR, pages 7132–7141. IEEE Computer Society, 2018.
[16] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q. Weinberger. Deep networks with stochastic depth. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 646–661. Springer, 2016.
[17] Zihang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transformers. arXiv preprint arXiv:2104.10858, 2021.
[18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
[19] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. CoRR, abs/2104.05707, 2021.
[20] Hanxiao Liu, Zihang Dai, David R. So, and Quoc V. Le. Pay attention to mlps. CoRR, abs/2105.08050, 2021.
[21] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. CoRR, abs/2103.14030, 2021.
[22] Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017.
[23] Luke Melas-Kyriazi. Do you even need attention? A stack of feed-forward layers does surprisingly well on imagenet. CoRR, abs/2105.02723, 2021.
[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035, 2019.
[25] Ilija Radosavovic, Raj Prateek Kosaraju, Ross B. Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In CVPR, pages 10425–10433. IEEE, 2020.
[26] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image classification. CoRR, abs/2107.00645, 2021.
[27] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell., 39(6):1137–1149, 2017.
[28] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pages 4510–4520. IEEE Computer Society, 2018.
[29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
[30] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pages 843–852. IEEE Computer Society, 2017.
[31] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 6105–6114. PMLR, 2019.
[32] Mingxing Tan and Quoc V. Le. Efficientnetv2: Smaller models and faster training. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 10096–10106. PMLR, 2021.
[33] Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, and Alexey Dosovitskiy. Mlp-mixer: An all-mlp architecture for vision. CoRR, abs/2105.01601, 2021.
[34] Hugo Touvron, Piotr Bojanowski, Mathilde Caron, Matthieu Cord, Alaaeldin El-Nouby, Edouard Grave, Armand Joulin, Gabriel Synnaeve, Jakob Verbeek, and Hervé Jégou. Resmlp: Feedforward networks for image classification with data-efficient training. CoRR, abs/2105.03404, 2021.
[35] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR, 2021.
[36] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. CoRR, abs/2103.17239, 2021.
[37] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
[38] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
[39] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. CoRR, abs/2103.15808, 2021.
[40] Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Seong Joon Oh, Youngjoon Yoo, and Junsuk Choe. Cutmix: Regularization strategy to train strong classifiers with localizable features. In ICCV, pages 6022–6031. IEEE, 2019.
[41] Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In ICLR (Poster). OpenReview.net, 2018.
[42] Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. In AAAI, pages 13001–13008. AAAI Press, 2020.