This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Jiangning Zhang ([email protected])
Xiangtai Li ([email protected])
Yabiao Wang ([email protected])
Chengjie Wang ([email protected])
Yibo Yang ([email protected])
✉ Yong Liu ([email protected])
Dacheng Tao ([email protected])
Equally-contributed first authors.
1 Institute of Cyber-Systems and Control, Advanced Perception on Robotics and Intelligent Learning Lab (APRIL), Zhejiang University, China.
2 School of Artificial Intelligence, Key Laboratory of Machine Perception (MOE), Peking University, China.
3 Youtu Lab, Tencent, China.
4 School of Computer Science, Faculty of Engineering, The University of Sydney, Darlington, NSW 2008, Australia.

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Jiangning Zhang1,∗    Xiangtai Li2,∗    Yabiao Wang3    Chengjie Wang3    Yibo Yang2    Yong Liu1    Dacheng Tao4
(Received: date / Accepted: date)
Abstract

Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based Transformer (EAT) block, which consists of three residual parts, i.e., Multi-Scale Region Aggregation (MSRA), Global and Local Interaction (GLI), and Feed-Forward Network (FFN) modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a Task-Related Head (TRH) docked with transformer backbone to complete final information fusion more flexibly and improve a Modulated Deformable MSA (MD-MSA) to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over State-Of-The-Art (SOTA) methods. E.g., our Mobile (1.8M), Tiny (6.1M), Small (24.3M), and Base (49.0M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.

Keywords:
Computer vision Vision transformer Evolutionary algorithm Image classification Object detection Image segmentation

1 Introduction

Since Vaswani et alattn introduce the Transformer that achieves outstanding success in the machine translation task, many improvements have been made over this structure attn_improve3 ; attn_improve4 ; attn_bert . Subsequently, Alexey et alattn_vit firstly introduce Transformer to the computer vision field and propose a novel ViT model that successfully sparks a new wave of research besides conventional CNN-based vision models. Recently, many excellent vision transformer models attn_pvt ; attn_swin ; attn_xcit ; attn_metaformer ; attn_nat ; attn_mpvit ; attn_uniformer have been proposed and have achieved great success in the field of many vision tasks. Currently, many attempts have been made to explain and improve the Transformer structure from different perspectives theory_more_inductive_bias ; theory_more1 ; theory1 ; theory2 ; theory3 ; attn_convit ; theory_more2 ; theory_more3 ; attattr , while continuing research is still needed. Most current models generally migrate the structural design of CNN, and they are experimentally conducted to verify the effectiveness of modules or improvements, which lacks explanations about why improved Transformer approaches work attattr ; attn_metaformer ; attn_notattention .

Refer to caption
Figure 1: Comparisons of structural analogousness between (a)-Transformer module and (b)-evolutionary algorithm, where they have analogous concepts of 1) individual definition (Token Embedding vs. Individual), 2) global information interaction (MSA vs. Crossover), 3) individual feature enhancement (FFN vs. Mutation), 4) feature inheritance (Skip Connection vs. Population Succession), etc. (c)-Intuitive illustration of some EA variants.

Inspired by biological population evolution, we explain the rationality of Transformer by analogy with the proven effective, stable, and robust Evolutionary Algorithm (EA) in this article, which has been widely used in many practical applications. We observe that the procedure of the Transformer (abbr., TR) has similar attributes to the naive EA through analogical analysis in Figure 1 (a)-TR and (b)-EA:

1)

In terms of data format, TR processes patch embeddings while EA evolutes individuals, both of them have the same data formats and necessary initialization.

2)

In terms of optimization objective, TR aims to obtain an optimal vector representation that fuses global information through multiple layers (denoted as xN in Figure 1), while EA focuses on getting the best individual globally through multiple iterations.

3)

In terms of component, Multi-head Self-Attention (MSA) in TR aims to enrich patch embeddings through global information communication densely, while crossover operation in EA plays the role of interacting global individuals sparsely. Also, Feed-Forward Network (FFN) in TR enhances every single embedding for all spatial positions, which is similar to Mutation in EA that evolves each individual of the whole population.

4)

Furthermore, we deduce the mathematical characterization of crossover and mutation operators in EA (c.f., Equations 5,8) and find that they have the same mathematical formulations as MSA and FFN in TR (c.f., Equations 6,9), respectively.

In addition to the above basic analogies between naive Transformer and EA, we explore further to improve the current vision transformer by leveraging other domain knowledge of EA variants. Without losing generality, we only study the widely used and effective EA methods that could inspire us to improve Transformer. They can be mainly divided into the following categories:

1)

Global and Local populations inspired simultaneously global and local modeling. In contrast to naive EA that only models global interaction, local search-based EA variants focus on finding a better individual in its neighborhood lea1 ; lea2 ; lea4 that is more efficient without associating the global search space. Furthermore, Moscato et almea1 firstly propose the Memetic Evolutionary Algorithm (MEA) that introduces a local search process for converging high-quality solutions faster than conventional evolutionary counterparts, and intuitive illustration can be viewed in Figure 1-(c). For a particular individual (i.e., center sheep with red background), naive EA only contains Global Population concept, while Local Population idea enables the model to focus on more relevant individuals. Inspired by those EA variants, we revisit the global MSA part and improve a novel Global and Local Interaction (GLI) module, which is designed as a parallel structure that employs an extra local operation beside the global operation, i.e., introducing inductive bias and locality in MSA. The former is used to mine more relevant local information, while the latter aims to model global cue interactions. Considering that the spatial relationship among real individuals will not be as horizontal and vertical as the image, we further propose a Modulated Deformable MSA (MD-MSA) to dynamically model irregular locations, which could focus on more informative reorganizational regions.

2)

Multi-population inspired multi-scale information aggregation. Some works lpea2 ; lpea4 introduce multi-population evolutionary algorithm to solve the optimization problems, which adopts different searching regions to more efficiently enhance the diversity of individuals and can obtain a better model performance significantly. As shown in Figure 1-(c), Long-Distance Population could supplement more diverse and richer cues, while Short-Distance Population focuses on providing general evolutionary features. Analogously, this idea inspires us to design a Multi-Scale Region Aggregation (MSRA) module that aggregates information from different receptive fields for vision transformer, which could integrate more expressive features from different resolutions before feeding them into the next module.

3)

Dynamic population inspired pyramid architecture design. The works alpha2 ; alpha3 ; alpha4 investigate jDEdynNP-F algorithm with a dynamic population reduction scheme that significantly improves the effectiveness and accelerates the convergence of the model, which is similar to pyramid-alike improvements of some current vision transformers attn_pvt2 ; attn_twins ; attn_swin2 ; attn_nat ; attn_metaformer . Analogously, we extend our previous columnar-alike work eat to a pyramid structure like PVT attn_pvt , which significantly boosts the performance for many vision tasks.

4)

Self-adapted parameters inspired weighted operation mixing. Brest et alalpha1 propose an adaptation mechanism to control different optimization processes for better results, and some memetic EAs mea1 ; mea2 ; mea3 own a similar concept of search intensity to balance the global and local calculation. This encourages us to learn appropriate weights for different operations, which can increase the performance and be more interpretable.

5)

Multi-objective EA inspired task-related feature merging. Current TR-based vision models would initialize different tokens for different tasks attn_deit (e.g., classification and distillation) or use the pooling operation to obtain global representation attn_swin . However, both manners suffer from potentially incompatibility: the former treats the task token and image patches coequally that is unreasonable, and the calculation of each layer will slightly increase the amount of calculation (O(n2)O(n^{2}) to O((n+1)2)O((n+1)^{2})), while the latter uses only one pooling result for multiple tasks that could potentially damage the model accuracy. Inspired by multi-objective EAs moea1 ; moea2 ; moea3 that find a set of solutions for different targets, we design a Task-Related Head (TRH) docked with transformer backbone to complete final information fusion, which is elegant and flexible for different tasks learning.

Refer to caption
Figure 2: Paradigm of the proposed basic EAT Block, which contains three y=f(x)+xy=f(x)+x residuals to model: (a) multi-scale information aggregation, (b) feature interactions among tokens, and (c) individual enhancement.
Refer to caption
Figure 3: Comparison with SOTAs in terms of Top-1 vs. GPU throughput. All models are trained only with ImageNet-1K imagenet dataset in 224×\times224, and the radius represents the relative number of parameters.

Based on the above analyses, we improve our columnar EAT model eat to a pyramid EA-inspired Transformer (EATFormer) that achieves a new SOTA result. Figure 3 illustrates intuitive comparisons with SOTAs under GPU throughput, Top-1, and the number of parameters evaluation indexes, where our smallest EATFormer-Mobile obtains 69.4 Top-1 with 3,926 throughput under one V100 GPU, and the EATFormer-Base achieves 83.9 Top-1 with only 49.0M parameters. Specifically, we make the following four contributions compared with the previous conference work:

In theory, we enrich evolutionary explanations for the rationality of Vision Transformer and derive a consistent mathematical formulation with evolutionary algorithm.

On framework, we propose a novel basic EA-based Transformer (EAT) block (shown in Figure 2) that consists of three residual parts to model multi-scale, interactive, and individual information, respectively, which is stacked to form our proposed pyramid EATFormer.

For method, inspired by effective EA variants, we analogously design: 1) Global and Local Interaction module, 2) Multi-Scale Region Aggregation module, 3) Task-Related Head module, and 4) Modulated Deformable MSA module to improve effectiveness and usability of our EATFormer.

Massive experiments on classification, object detection, and semantic segmentation tasks demonstrate the superiority and efficiency of our approach, while ablation and explanatory experiments further prove the efficacy of EATFormer and its components.

2 Related Work

2.1 Evolution Algorithms

Evolution algorithm (EA) is a subset of evolutionary computation in computational intelligence that belongs to modern heuristics, and it serves as an essential umbrella term to describe population-based optimization and search techniques in the last 50 years ea1 ; ea2 ; bartz2014evolutionary . Inspired by biological evolution, general EAs mainly contain reproduction, crossover, mutation, and selection steps, which have been proven effective and stable in many application scenarios mea1 ; mea2 , and a series of improved EA approaches have been advanced in succession. Differential Evolution (DE) developed in 1995 dea1 is arguably one of the most competitive improved EA that significantly advances the global modeling capability dea2 ; dea3 . The core idea of DE is introducing a complete differential concept to the conventional EA, which differentiates and scales two individuals in the same population and interacts with the third individual to generate a new individual. In contrast to the category mentioned above, local search-based EAs aim to find a solution that is as good as or better than all other solutions in its neighborhood lea1 ; lea2 ; lea4 . This thought is more efficient than global search in that a solution can quickly be verified as a local optimum without associating the global search space. However, the locality-aware operation will restrict the ability of global modeling that could lead to suboptimal results in some scenarios, so some researchers attempt to fuse both above modeling manners. Moscato et almea1 firstly propose the Memetic Evolutionary Algorithm (MEA) in 1989 that applies a local search process to refine solutions for hard problems, which could converge to high-quality solutions more efficiently than conventional evolutionary counterparts. In detail, this variant is a particular global-local search hybrid: the global character is given by the traditional EA, while the local aspect is mainly performed through constructive methods and intelligent local search heuristics mea2 . Analogously, some later works lpea2 ; lpea4 introduce a multi-population evolutionary algorithm to solve the constrained function optimization problems relatively efficiently, which adopts different searching regions to enhance the diversity of individuals that improves the model ability dramatically. This strategy inspires us to design a basic feature extraction module for vision transformer: whether a similar multi-scale manner can be adopted to enhance model expressiveness. Furthermore, Brest et alalpha1 propose an adaptation mechanism on the control parameters CRCR and FF for crossover and mutation operations associated with DE, where adapted parameters are applied to different optimization processes for obtaining better results. Remarkably, those MEAs mea1 ; mea2 ; mea3 mentioned above own a similar concept of search intensity to balance the global and local calculation. Subsequent work alpha2 investigates jDEdynNP-F algorithm with a dynamic population reduction scheme, where the population size of the next generation is equal to half the previous population size. This strategy significantly improves the effectiveness and accelerates the convergence of the model that is consistently illustrated by works alpha3 ; alpha4 . Furthermore, some literatures bio1 ; bio2 ; bio3 suggest that there are hierarchical structures of V1, V2, V4, and inferotemporal cortex in the evolutionary brain, which have ordered interconnection among them in the form of both feed-forward and feedback connections. Moreover, researchers moea1 ; moea2 ; moea3 study multi-objective EAs to find optimal trade-offs to get a set of solutions for different targets.

Inspired by the aforementioned EA variants that introduce various and valid concepts for optimization, we explain and improve the naive transformer structure by conceptual analogy in the paper, where a novel and potent EATFormer with pyramid architecture, multi-scale region aggregation, and global-local modeling is hand-designed. Furthermore, a plug-and-play task-related head module is developed to solve different targets separately and improve model performance.

2.2 Vision Transformers

Since Transformer structure achieves significant progress for machine translation task attn , many improved language models attn_elmo ; attn_bert ; attn_gpt1 ; attn_gpt2 ; attn_gpt3 are proposed and obtain great achievements, and some later works attn_improve1 ; attn_improve2 ; attn_improve3 ; attn_improve4 ; attn_improve5 ; wang2021evolving advance the basic transformer module for better efficiency. Inspired by the success of Transformer in NLP and the rapid improvement of computing power, Alexey et alattn_vit propose a novel ViT that firstly introduces the transformer to vision classification and sparks a new wave of research besides conventional CNN-based vision models. Subsequently, many excellent vision transformer models are proposed, and they can mainly be divided into two categories, i.e., pure and hybrid vision transformers. The former only contains transformer module without CNN-based layers, and early worksattn_deit ; attn_deepvit ; attn_tnt ; attn_cait ; attn_t2tvit ; attn_cpvt ; attn_xcit follow columnar structure of original ViT. Typically, DeiT attn_deit propose an efficient training recipe to moderate the dependence on large datasets, DeepViT attn_deepvit and CaiT attn_cait focus on fast performance saturation when scaling ViT to be deeper, and TNT attn_tnt divide local patches into smaller patches for fine-grained modeling. Furthermore, researchers attn_pvt ; attn_swin ; attn_shuffle ; attn_halonets ; attn_twins ; attn_swin2 ; attn_nat ; attn_dpt ; attn_metaformer ; attn_SSA advance ViT to pyramid structure that is more powerful and suitable for dense prediction. PVT attn_pvt leverages a non-overlapping patch partition to reduce feature size, while Swin attn_swin utilizes a shifted window scheme to alternately model in-window and cross-window connection. The latter incorporates the idea of convolution that owns natural inductive bias of locality and translation invariance, and this kind of combination dramatically improves the model effect. Specifically, Srinivas et alattn_botnet advance CNN-based models by replacing the convolution of the bottleneck block with the MSA structure. Later researches attn_localvit ; attn_ceit ; attn_convit ; attn_vitae introduce convolution designs into columnar visual transformers, while works attn_lvt ; attn_volo ; attn_crossformer ; attn_container ; attn_uniformer ; attn_coat ; attn_cvt ; attn_pvt2 ; attn_vitae2 fuse convolution structures into pyramid structure or use CNN-based backbone on early stages, which has obvious advantages over pure transformer models. Moreover, some researchers design hybrid models from the parallel perspective of convolution and transformer attn_coat ; attn_mobileformer ; attn_mixformer , while Xia et alattn_vit introduce a deformable idea dcn1 to MSA module and obtain a boost on Swin attn_swin . Recently, MPViT attn_mpvit explores multi-scale patch embedding and multi-path structure that enable both fine and coarse feature representations simultaneously. Benefiting from advances in basic vision transformer models, many task-specific models are proposed and achieve significant progress in down-stream vision tasks, e.g., object detection down_detr ; down_deformabledetr ; down_YOLOS ; down_pix2seq , semantic segmentation down_setr ; down_transunet ; down_medt ; down_segformer ; down_hrformer ; down_maskformer ; down_mask2former , generative adversarial network down_transgan ; down_fat ; down_gat , low-level vision down_ipt ; down_swinir ; down_restormer , video understanding down_vtn ; down_timesformer ; down_STRM ; down_bevt ; down_LSTR , self-supervised learning down_sit ; down_mocov3 ; down_fdsl ; down_mae ; down_beit ; down_dino ; down_peco ; down_cae ; down_simmim ; down_maskfeat ; down_data2vec ; down_convmae , neural architecture search down_hat ; down_bossnas ; down_glit ; down_autoformer ; down_s3 , etc. Inspired by practical improvements in EA variants, this work migrates them to Transformer improvement and designs a powerful visual model with higher precision and efficiency than contemporary works. Also, thanks to the elaborate analogical design, the proposed EATFormer in this paper is highly explanatory.

2.3 Explanatory of Transformers

Transformer-based models have achieved remarkable results on CV tasks, leading us to question why Transformer works so well, even better than Convolutional Neural Networks (CNN). Many efforts have done to answer this question. Goyal et altheory_more_inductive_bias point out that studying the kind of inductive biases that humans and animals exploit could help inspire AI research and neuroscience theories. Pan et altheory_more1 show a strong underlying relation between convolution and self-attention operations. Jean-Baptiste et altheory1 prove that a multi-head self-attention layer with a sufficient number of heads is at least as expressive as any convolutional layer, while Raghu et altheory2 find striking differences between ViT and CNN on image classification. Kunchang et alattn_uniformer seamlessly integrate the merits of convolution and self-attention in a concise transformer format, while ConVit attn_convit combines the strengths of both CNN/Transformer architectures to introduce gated positional self-attention. Introducing local CNN into Transformer is followed by many subsequent works attn_cpvt ; attn_uniformer ; cmt ; attn_vitae2 ; mobilevit ; edgenext . Furthermore, Juhong et altheory3 take a biologically inspired approach and explore modeling peripheral vision by incorporating peripheral position encoding to the multi-head self-attention layers in Transformer. Besides, some works explore the relation between Transformer and other models, e.g., Katharopoulos et altheory_more_rnn reveals their relationship of Transformer to recurrent neural networks, and Kim et altheory_more_gnn prove that Transformer is theoretically at least as expressive as an invariant graph network composed of equivariant linear layers. Moreover, Bhojanapalli et altheory_more2 find that ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations when pre-trained with a sufficient amount of data. Hao et alattattr propose a self-attention attribution method to interpret the information interactions inside Transformer, while Liu et altheory_more3 propose an actionable diagnostic methodology to measure the consistency between explanation weights and the impact polarity for attention-based models. Dong et alattn_notattention find that MLP stops the output from degeneration, and removing MSA in Transformer would also significantly damage the performance. Recently, Qiang et altheory_more4 propose a novel Transformer explanation technique via attentive class activation tokens by leveraging encoded features, gradients, and attention weights to generate a faithful and confident explanation. Xu et altheory_more5 propose a new way to visualize the model by firstly computing attention scores based on attribution and then propagating these attention scores through the layers. Works attn_metaformer ; zhang2023rethinking demonstrate that the general architecture of the Transformers is more essential to performance rather than the specific token mixer module. The above work explores the interpretation of Transformer from a variety of perspectives. At the same time, we will provide another explanation from the perspective of evolutionary algorithms and design a robust model to perform multiple CV tasks.

3 Preliminary Transformer

The vision transformer generally refers to the encoder part of the original transformer structure, which consists of Multi-head Self-Attention layer (MSA), Feed-Forward Network (FFN), Layer Normalization (LN), and Residual Connection (RC). Given the input feature maps 𝑿imgC×H×W\boldsymbol{X}_{img}\in\mathbb{R}^{C\times H\times W}, Img2Seq\operatorname{Img2Seq} operation firstly flattens it to a 1D sequence 𝑿seqC×N\boldsymbol{X}_{seq}\in\mathbb{R}^{C\times N} that complies with standard NLP format, denoted as: 𝑿seq=Img2Seq(𝑿img)\boldsymbol{X}_{seq}=\operatorname{Img2Seq}(\boldsymbol{X}_{img}).

MSA fuses several SA operations to process 𝑸𝑲𝑽\boldsymbol{QKV} that jointly attend to information in different representation subspaces. Specifically, LN solved 𝑿seq\boldsymbol{X}_{seq} goes through linear layers to obtain projected queries (𝑸\boldsymbol{Q}), keys (𝑲\boldsymbol{K}) and values (𝑽\boldsymbol{V}) presentations, formulated as:

MSA(𝑿seq)=(h=1H𝑿h)𝑾O,where𝑿h=Attention(𝑿seq𝑾hQ,𝑿seq𝑾hK,𝑿seq𝑾hV)=Softmax([(𝑿seq𝑾hQ)(𝑿seq𝑾hK)Tdk])𝑿seq𝑾hV=Softmax([𝑸h𝑲hTdk])𝑽h=𝑨h𝑽h,\begin{aligned} \operatorname{MSA}(\boldsymbol{X}_{seq})&=\left(\bigoplus_{h=1}^{H}\boldsymbol{X}_{h}\right)\boldsymbol{W}^{O},\\ \text{where}~{}\boldsymbol{X}_{h}&=\operatorname{Attention}\left(\boldsymbol{X}_{seq}\boldsymbol{W}_{h}^{Q},\boldsymbol{X}_{seq}\boldsymbol{W}_{h}^{K},\boldsymbol{X}_{seq}\boldsymbol{W}_{h}^{V}\right)\\ &=\operatorname{Softmax}\left(\left[\frac{(\boldsymbol{X}_{seq}\boldsymbol{W}_{h}^{Q})(\boldsymbol{X}_{seq}\boldsymbol{W}_{h}^{K})^{T}}{\sqrt{d_{k}}}\right]\right)\boldsymbol{X}_{seq}\boldsymbol{W}_{h}^{V}\\ &=\operatorname{Softmax}\left(\left[\frac{\boldsymbol{Q}_{h}\boldsymbol{K}_{h}^{T}}{\sqrt{d_{k}}}\right]\right)\boldsymbol{V}_{h}\\ &=\boldsymbol{A}_{h}\boldsymbol{V}_{h},\end{aligned}

(1)

where dmd_{m} is the input dimension, while dqd_{q}, dkd_{k}, and dvd_{v} are hidden dimensions of the corresponding projection subspace, and generally dqd_{q} equals dkd_{k}; h\mathrm{h} is the head number; 𝑾hQdm×dq\boldsymbol{W}_{h}^{Q}\in\mathbb{R}^{d_{m}\times d_{q}}, 𝑾hKdm×dk\boldsymbol{W}_{h}^{K}\in\mathbb{R}^{d_{m}\times d_{k}}, and 𝑾hVdm×dv\boldsymbol{W}_{h}^{V}\in\mathbb{R}^{d_{m}\times d_{v}} are parameter matrices for 𝑸𝑲𝑽\boldsymbol{QKV}, respectively; 𝑾Ohdv×dm\boldsymbol{W}^{O}\in\mathbb{R}^{hd_{v}\times d_{m}} maps each head feature 𝑿h\boldsymbol{X}_{h} to the output; \oplus means concatenation operation; 𝑨hl×l\boldsymbol{A}_{h}\in\mathbb{R}^{l\times l} is the attention matrix of hh-th head.

FFN consists of two cascaded linear transformations with a ReLU activation in between:

FFN(𝑿seq)=max(0,𝑿seq𝑾1+𝒃1)𝑾2+𝒃2,\displaystyle\mathrm{FFN}(\boldsymbol{X}_{seq})=\max\left(0,\boldsymbol{X}_{seq}\boldsymbol{W}_{1}+\boldsymbol{b}_{1}\right)\boldsymbol{W}_{2}+\boldsymbol{b}_{2}, (2)

where 𝑾1\boldsymbol{W}_{1} and 𝑾2\boldsymbol{W}_{2} are weights of two linear layers, while 𝒃1\boldsymbol{b}_{1} and 𝒃2\boldsymbol{b}_{2} are corresponding biases.

LN is applied before each layer of MSA and FFN, and the transformed 𝑿^seq\hat{\boldsymbol{X}}_{seq} is calculated by:

𝑿^seq=𝑿seq+[MSA|FFN](LN(𝑿seq)).\displaystyle\hat{\boldsymbol{X}}_{seq}=\boldsymbol{X}_{seq}+[\text{MSA}~{}|~{}\text{FFN}](\text{LN}(\boldsymbol{X}_{seq})). (3)

Finally, reversed Seq2Img\operatorname{Seq2Img} operation reshapes the enhanced 𝑿^seq\hat{\boldsymbol{X}}_{seq} back to 2D feature maps, denoted as: 𝑿^img=Seq2Img(𝑿^seq)\hat{\boldsymbol{X}}_{img}=\operatorname{Seq2Img}(\hat{\boldsymbol{X}}_{seq}).

4 EA-Inspired Vision Transformer

In this section, we expand the relationship among operators in naive EA and modules in naive Transformer, and consistent mathematical formulations for each conceptual pair can be derived, revealing evolutionary explanations for the rationality of Vision Transformer structure. Inspired by the core ideas of some effective EA variants, we deduce them into transformer architecture design and improve a mighty pyramid EATFormer over the previous columnar model.

4.1 Evolutionary Explanation of Transformer

As aforementioned in Figure 1, the Transformer block has conceptually similar sub-modules analogously to evolutionary algorithm. Basically, Transformer inputs a sequence of patch tokens while EA evolutes a population that consists of many individuals. Both of which have the consistent vector format and necessary initialization. In order to facilitate the subsequent analogy and formula derivation, we symbolize the patch token (individual) as 𝒙i=[xi,1,xi,2,,xi,D]\boldsymbol{x}_{i}=[x_{i,1},x_{i,2},\dots,x_{i,D}], where ii and DD indicate data order and dimension, respectively. Define LL as the sequence length, the sequence (population) can be denoted as 𝑿=[𝒙1,𝒙2,,𝒙L]T\boldsymbol{X}=[\boldsymbol{x}_{1},\boldsymbol{x}_{2},\dots,\boldsymbol{x}_{L}]^{\mathrm{T}}. The specific relationship analyses of different components are as follows:

Crossover Operator vs. MSA Module.
For the crossover operator of EA, it aims at creating new individuals by combining parts of other individuals. For an individual 𝒙i\boldsymbol{x}_{i} specifically, the operator will randomly pick another individual 𝒙j=[xj,1,xj,2,,xj,D](1jL)\boldsymbol{x}_{j}=[x_{j,1},x_{j,2},\dots,x_{j,D}](1\leq j\leq L) in the global population and randomly replaces features of 𝒙i\boldsymbol{x}_{i} with 𝒙j\boldsymbol{x}_{j} to form the new individual 𝒙^i\hat{\boldsymbol{x}}_{i}:

𝒙^i,d={𝒙j,d, if randb(d)CR𝒙i,d, otherwise \displaystyle\hat{\boldsymbol{x}}_{i,d}=\begin{cases}\boldsymbol{x}_{j,d}\text{, if }\operatorname{randb}(d)\leqslant CR\\ \boldsymbol{x}_{i,d}\text{, otherwise }\end{cases} (4)
s.t.ij,d{1,2,,D},\displaystyle\text{s.t.}~{}~{}i\neq j,{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}d\in\{1,2,\ldots,D\}}\text{, }

where randb(d)\operatorname{randb}(d) is the dd-th evaluation of a uniform random number generator with outcome in [0,1][0,1], and CRCR is the crossover constant in [0,1][0,1] that is determined by the user. We re-formulate this process as:

𝒙^i=boldsymbolxi,1𝒘i,1,,𝒙i,D𝒘i,D]+[𝒙j,1𝒘j,1,,𝒙j,D𝒘j,D]=𝒙1𝟎+𝒙i𝒘i+𝒙j𝒘j+𝒙L𝟎=𝒙1𝟎+𝒙i𝑾icr+𝒙j𝑾jcr+𝒙L𝟎=l=1L(𝒙l𝑾lcr),s.t.𝒘i+𝒘j=𝟏𝒘i,d[0,1],𝒘j,d[0,1],d{1,2,,D},\begin{aligned} \hat{\boldsymbol{x}}_{i}&=boldsymbol{x}_{i,1}\boldsymbol{w}_{i,1},\dots,\boldsymbol{x}_{i,D}\boldsymbol{w}_{i,D}]+[\boldsymbol{x}_{j,1}\boldsymbol{w}_{j,1},\dots,\boldsymbol{x}_{j,D}\boldsymbol{w}_{j,D}]\\ &=\boldsymbol{x}_{1}\odot\boldsymbol{0}+\dots\boldsymbol{x}_{i}\odot\boldsymbol{w}_{i}+\dots\boldsymbol{x}_{j}\odot\boldsymbol{w}_{j}+\dots\boldsymbol{x}_{L}\odot\boldsymbol{0}\\ &=\boldsymbol{x}_{1}\boldsymbol{0}+\dots\boldsymbol{x}_{i}\boldsymbol{W}^{cr}_{i}+\dots\boldsymbol{x}_{j}\boldsymbol{W}^{cr}_{j}+\dots\boldsymbol{x}_{L}\boldsymbol{0}\\ &=\sum_{l=1}^{L}\left(\boldsymbol{x}_{l}\boldsymbol{W}^{cr}_{l}\right),\\ &~{}\text{s.t.}~{}~{}\boldsymbol{w}_{i}+\boldsymbol{w}_{j}=\boldsymbol{1}~{}\text{, }\\ &~{}~{}~{}~{}~{}~{}~{}\boldsymbol{w}_{i,d}\in[0,1],~{}\boldsymbol{w}_{j,d}\in[0,1],~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}d\in\{1,2,\ldots,D\}}~{}\text{, }\end{aligned}

(5)

where 𝒘i\boldsymbol{w}_{i} and 𝒘j\boldsymbol{w}_{j} are vectors filled with zeros or ones, indicating the feature selections of 𝒙i\boldsymbol{x}_{i} and 𝒙j\boldsymbol{x}_{j}, while 𝑾icr\boldsymbol{W}^{cr}_{i} and 𝑾jcr\boldsymbol{W}^{cr}_{j} are corresponding diagonal matrix representations. \odot means the point-wise multiplication operation for each position. 𝟎\boldsymbol{0} represents that corresponding individual has no contribution, i.e., 𝑾lcr(li,j)\boldsymbol{W}^{cr}_{l}(l\neq i,j) fulls of zeros. As can be seen above, crossover operator is actually a sparse global feature interaction process.

For the MSA module of Transformer, each patch embedding interacts with all embeddings in dense communications. Without loss of generality, 𝒙i\boldsymbol{x}_{i} interacts with the whole population 𝑿\boldsymbol{X} as follows:

𝒙^i\displaystyle\hat{\boldsymbol{x}}_{i} =h=1H𝒙^i,h\displaystyle=\bigoplus_{h=1}^{H}\hat{\boldsymbol{x}}_{i,h} (6)
=h=1Hl=1L𝑨l,h𝑽l,h\displaystyle=\bigoplus_{h=1}^{H}\sum_{l=1}^{L}\boldsymbol{A}_{l,h}\boldsymbol{V}_{l,h}
=h=1Hl=1L𝑨l,h𝒙l𝑾hV\displaystyle=\bigoplus_{h=1}^{H}\sum_{l=1}^{L}\boldsymbol{A}_{l,h}\boldsymbol{x}_{l}\boldsymbol{W}_{h}^{V}
=h=1Hl=1L(𝑨l,h𝒙l𝑾hV)\displaystyle=\bigoplus_{h=1}^{H}\sum_{l=1}^{L}\left(\boldsymbol{A}_{l,h}\boldsymbol{x}_{l}\boldsymbol{W}_{h}^{V}\right)
=l=1L(𝒙lh=1H(𝑨l,h𝑾hV))\displaystyle=\sum_{l=1}^{L}\left(\boldsymbol{x}_{l}\bigoplus_{h=1}^{H}\left(\boldsymbol{A}_{l,h}\boldsymbol{W}_{h}^{V}\right)\right)
=l=1L(𝒙l(𝑨l𝑾V)),\displaystyle=\sum_{l=1}^{L}\left(\boldsymbol{x}_{l}\left(\boldsymbol{A}_{l}\boldsymbol{W}^{V}\right)\right),

where 𝑨l,h\boldsymbol{A}_{l,h} (l \in {\{1,2,\cdots,L}\}) is the attention weight of hh-th head from embedding token 𝒙l\boldsymbol{x}_{l} to 𝒙i\boldsymbol{x}_{i}, which is calculated between the query value of 𝒙i,h\boldsymbol{x}_{i,h} and the key value of 𝒙l,h\boldsymbol{x}_{l,h} of hh-th head followed with a Softmax()\operatorname{Softmax}(\cdot) postprocessing; 𝑽l,h\boldsymbol{V}_{l,h} (l \in {\{1,2,\cdots,L}\}) is the projected Value feature for 𝒙l\boldsymbol{x}_{l} with corresponding weights 𝑾hV\boldsymbol{W}_{h}^{V}; 𝒙^i,h\hat{\boldsymbol{x}}_{i,h} is the sum of all weighted 𝑽l,h\boldsymbol{V}_{l,h} (l \in {\{1,2,\cdots,L}\}) by 𝑨l,h\boldsymbol{A}_{l,h} (l \in {\{1,2,\cdots,L}\}), i.e., 𝒙^i,h=l=1L𝑨l,h𝑽l,h\hat{\boldsymbol{x}}_{i,h}=\sum_{l=1}^{L}\boldsymbol{A}_{l,h}\boldsymbol{V}_{l,h}, (c.f., Eq. 1 in Section 3) for more details. 𝑾V\boldsymbol{W}^{V} is the parameter matrix for the value projection and \oplus means the concatenation operation. By comparing Equation 5 with Equation 6, we find that both above components have the same formula representation, and the crossover operation is a sparse global interaction while densely-modeling MSA has more complex computing and modeling capabilities.

Mutation Operator vs. FFN Module.
For the mutation operator in EA, it brings random evolutions into the population by stochastically changing specific features of individuals. Specifically, an individual 𝒙i\boldsymbol{x}_{i} in the population goes through Mutation operation to form the new individual 𝒙^i\hat{\boldsymbol{x}}_{i}, formulated as follows:

𝒙^i,d={rand(vdL,vdH)xi,d, if randb(d)MU1xi,d, otherwise \displaystyle\hat{\boldsymbol{x}}_{i,d}=\begin{cases}\operatorname{rand}(v_{d}^{L},v_{d}^{H})\cdot x_{i,d}\text{, if }\operatorname{randb}(d)\leqslant MU\\ 1\cdot x_{i,d}\text{, otherwise }\end{cases} (7)
s.t.d{1,2,,D},\displaystyle\text{s.t.}~{}~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}d\in\{1,2,\ldots,D\}}\text{, }

where randb(d)\operatorname{randb}(d) is the dd-th evaluation of a uniform random number generator with outcome in [0,1][0,1], and MUMU is the mutation constant in [0,1][0,1] that user determines. vjLv_{j}^{L} and vjHv_{j}^{H} are lower and upper scale bounds of the jj-th feature relative to xi,dx_{i,d}. Similarly, we re-formulate this process as:

𝒙^i\displaystyle\hat{\boldsymbol{x}}_{i} =𝒙i𝒘i\displaystyle=\boldsymbol{x}_{i}\odot\boldsymbol{w}_{i} (8)
=𝒙i𝑾imu,\displaystyle=\boldsymbol{x}_{i}\boldsymbol{W}^{mu}_{i},

where 𝒘i\boldsymbol{w}_{i} is a randomly generated vector that represents weights of each feature value, while 𝑾imu\boldsymbol{W}^{mu}_{i} is the corresponding diagonal matrix representation; \odot means the point-wise multiplication operation for each position.

For the FFN module in Transformer, each patch embedding carries on directional feature transformation through cascaded linear layers (c.f., Equation 2). Getting rid of complex nonlinear transformations, we only take one linear layer as an example:

𝒙^i\displaystyle\hat{\boldsymbol{x}}_{i} =𝒙i𝑾FFN,\displaystyle=\boldsymbol{x}_{i}\boldsymbol{W}^{FFN}, (9)

where 𝑾FFN\boldsymbol{W}^{FFN} is the weight of the linear layer, and it is applied to each embedding separately and identically.

By analyzing the calculation process of Equation 8 and Equation 9, Mutation and FFN operations share a unified form of matrix multiplication, so they are supposed to own a consistent function essentially. Besides, at the microcosmic level, the weight of FFN change dynamically during the training process, so the output of the individual differs among different iterations (similar to the random process of mutation). At the macroscopic objective of the algorithm, the mutation in EA is optimized into one potential direction under the constraint of the objective function (statistically speaking, only partial mutation individuals are retained, that is, the mutation also has a determinate meaning in the whole training process). In comparison, the trained FFN can be regarded as a directional mutation under the constraint of loss functions. Finally, note that we are only discussing the comparison with the mutation on one linear layer of FFN, and 𝑾FFN\boldsymbol{W}^{FFN} is more expressive than diagonal 𝑾imu\boldsymbol{W}^{mu}_{i} in fact because it contains cascaded linear layers and the non-linear ReLU activation is interspersed between adjacent linear layers, as depicted in Equation 2.

Population Succession vs. RC Operation.
In the evolution of the biological population, individuals at the current iteration have a certain probability of inheriting to the next iteration, where a partial population of the current iteration will be combined with the selected individuals. Similarly, the above pattern is expressed by Transformer structure in the form of Residual Connection (RC), i.e., patch embeddings of the previous layer are directly mapped to the next layer. Specifically, partial-selection can be viewed as a dropout technique in Transformer, while population succession can be formulated as a concatenation operation that has a consistent mathematical expression with residual concatenation, whereas addition operation can be regarded as a particular case of the concatenation operation that shares some partial weights.

Best Individual vs. Task-Related Token.
Generally speaking, the Transformer-based model chooses an enhanced task-related token (e.g., classification token) that combines information of all patch embeddings as the output feature, while the EA-based method chooses the individual with the best fitness score among the population as the output.

Refer to caption
Figure 4: Structure of EA-inspired columnar EAT model eat and improved pyramid EATFormer. (a) The top part shows the architecture of the previous EAT model, where the basic block consists of parallel global and local paths as well as an FFN module. (b) The middle part illustrates overall architecture of EATFormer that contains four stages with ii-th stage consisting of Ni basic EAT blocks; The bottom part illustrates the structure of serial modules in EAT block, i.e., MSRA (c.f., Section 4.3.1), GLI (c.f., Section 4.3.2), and FFN (c.f., Section 3) from left to right, and a MD-MSA is proposed to effectively improve the model performance; The right part shows the designed Task-Related Head module docked with transformer backbone for specific tasks.

Necessity of Modules in Transformer.
As described in the work s16 , the absence of the crossover operator or mutation operator will significantly damage the model’s performance. Similarly, Dong et alattn_notattention explore the effect of MLP in the Transformer and find that MLP stops the output from degeneration, and removing MSA in Transformer would also significantly damage the effectiveness of the model. Thus we can conclude that global information interaction and individual evolution are necessary for Transformer, just like the global crossover and individual mutation in EA.

4.2 Short Description of Previous Columnar EAT

We explore the relationship among operators in naive EA and modules in naive Transformer in the previous NeurIPS’21 conferenceeat and analogically improve a columnar EAT based on ViT model. Figure 4-(a) shows the structure of EAT model that is stacked of with N improved Transformer blocks inspired by local population concept in some EA works lea1 ; lea2 ; mea1 , where a local path is introduced in parallel with global MSA operation. Also, this work designs a Task-Related Head to deal with various tasks more flexibly, i.e., classification and distillation.

However, the columnar structure is naturally inadequate for downstream dense prediction tasks, and it is inferior in terms of accuracy compared with contemporaneous works attn_pvt ; attn_swin , which limits the usefulness of the model in some scenarios. To address the above weaknesses, this paper further explores analogies between EA and Transformer and improves the previous work to a pyramid EATFormer, consisting of the newly designed EAT block inspired by the effective EA variants.

4.3 Methodology of Pyramid EATFormer Architecture

Architecture of the improved EATFormer is illustrated in Figure 4-(b), which contains four stages of different resolutions following PVT attn_pvt . Specifically, the model is made up of EAT blocks that contains three mixed-paradigm y=f(x)+xy=f(x)+x residuals: (a) Multi-Scale Region Aggregation (MSRA), (b) Global and Local Interaction (GLI), and (c) Feed-Forward Network (FFN) modules, and the down-sampling procedure between two stages is realized by MSRA with stride greater than 1. Besides, we propose a novel Modulated Deformable MSA (MD-MSA) to advance global modeling and a Task-Related Head (TRH) to complete different tasks more elegantly and flexibly.

4.3.1 Multi-Scale Region Aggregation

Inspired by some multi-population-based EA methods lpea2 ; lpea4 that would adopt different searching regions for obtaining a better model performance, we analogically extend this concept to multiple sets of spatial positions for the 2D image and design a novel Multi-Scale Region Aggregation (MSRA) module for the studied vision transformer. As shown in Figure 4.(a), MSRA contains NN local convolution operations (i.e., ConvSn,1nN{\rm Conv_{{S}_{n}}},1\leq n\leq N) with different strides to aggregate information from different receptive fields, which simultaneously play the role of providing inductive bias without extra position embedding procedures. Specifically, the nn-th dilation operation ono_{n} that transforms input feature map xx can be formulated as:

on(𝒙)=ConvSn(Norm(𝒙))\displaystyle o_{n}(\boldsymbol{x})=\text{Conv}_{{S}_{n}}(\text{Norm}(\boldsymbol{x})) (10)
s.t.n{1,2,,N},\displaystyle\text{s.t.}~{}~{}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}n\in\{1,2,\ldots,N\}}\text{, }

Weighted Operation Mixing (WOM) mechanism is further proposed to mix all operations by a softmax function over a set of learnable weights α1,,αN\alpha_{1},\dots,\alpha_{N}, and the intermediate representation 𝒙o\boldsymbol{x}_{o} is calculated by the mixing function \mathcal{F} as follows:

𝒙o=n=1Nexp(αn)n=1Nexp(αn)on(𝒙),\displaystyle\boldsymbol{x}_{o}=\sum_{n=1}^{N}\frac{\exp\left(\alpha_{n}\right)}{\sum_{n^{\prime}=1}^{N}\exp\left(\alpha_{n^{\prime}}\right)}o_{n}(\boldsymbol{x}), (11)

where \mathcal{F} in the above formula is the addition function, and other fusion functions like concatenation are also available for a better effect at the cost of more parameters. The paper chooses the addition function acquiescently. Then, a convolution layer ConvSo{\rm Conv_{{S}_{o}}} maps 𝒙o\boldsymbol{x}_{o} to the same number of channels as the input 𝒙\boldsymbol{x}, and the final output of the module is obtained after a residual connection. Also, the MSRA module serves as the model stem and Patch Embedding that makes the EATFormer more uniform and elegant. Note that this paper does not use any form of position embedding since CNN-based MSRA can provide a natural inductive bias for the next GLI module.

4.3.2 Global and Local Interaction

Motivated by EA variants mea1 ; mea2 ; mea3 that introduce local search procedures besides conventional global search for converging higher-quality solutions faster and effectively (c.f., Figure 1-(c) for a better intuitive explanation), we improve a MSA-based global module to a novel Global and Local Interaction (GLI) module. As shown in Figure 4.(b), GLI contains an extra local path in parallel with the global path, where the former aims to mine more discriminative locality-relevant information like the above-mentioned local population idea, while the latter is retained to model global information. Specifically, the input features are divided into global features (marked green) and local features (marked blue) at the channel level with ratio pp, which are then fed into global and local paths to conduct feature interactions, respectively. Note that we also apply the proposed Weighted Operation Mixing mechanism in 4.3.1 to balance two branches, i.e., global weight αg\alpha_{\textit{g}} and local weight αl\alpha_{\textit{l}}. The outputs of the two paths recover the original data dimension by concatenation operation \mathcal{H}. Thus the improved module is very flexible and can be viewed as a plug-and-play module for the current transformer structure. In detail, the local operation can be traditional convolution layer or other improved modules, e.g., DCN dcn1 ; dcn2 , local MSA, etc, while global operation can be MSA attn ; attn_vit , D-MSA attn_dpt , Performer attn_improve3 , etc.

In this paper, we choose naive convolution with MSA modules as basic compositions of GLI, and it owns O(1)O(1) maximum path length between any two positions for keeping global modeling capability besides enhancing locality, as shown in Table 1. Therefore, the proposed structure maintains the same parallelism and efficiency as the original vision transformer. Also, the selection of feature separation ratio pp is crucial to the effect and efficiency of the model because different ratios bring different parameters, FLOPs, and precision of the model. In detail, the local path contains a group of point-wise and k×kk\times k depth-wise convolutions. Assume that the feature map in C×H×W=C×L\mathbb{R}^{C\times H\times W}=\mathbb{R}^{C\times L}, and both paths have Cg=p×CC_{g}=p\times C and Cl=CCgC_{l}=C-C_{g} channels, respectively. Here we present an analysis process about the number of parameters and computation of the improved GLI module as follows:

1) Overall Params equals 4(Cg+1)Cg+(k2+1)Cl+(Cl+1)Cl4(C_{g}+1)C_{g}+(k^{2}+1)C_{l}+(C_{l}+1)C_{l} according to Table 1, and it is factorized based on Cl=CCgC_{l}=C-C_{g}:

Params=\displaystyle Params= 5Cg2+(22CK2)Cg+\displaystyle 5{C_{g}}^{2}+(2-2C-K^{2})C_{g}+ (12)
(k2+2+C)C.\displaystyle(k^{2}+2+C)C.

Applying the minimum value formula of a quadratic function, Equation 12 obtains the minimum value when Cgminp=0.2C+0.1(k22)C_{g}^{min_{p}}=0.2C+0.1(k^{2}-2). Given that the channel number are integers and latter term can be ignored, we obtains Cgminp=0.2CC_{g}^{min_{p}}=0.2C, i.e., pminpp^{min_{p}} equals 0.2.

2) Overall FLOPs equals 8Cg2L+4CgL2+3L2+(2k2)LCl+2ClLCl8C_{g}^{2}L+4C_{g}L^{2}+3L^{2}+(2k^{2})LC_{l}+2C_{l}LC_{l} according to Table 1, and it is factorized based on Cl=CCgC_{l}=C-C_{g}:

FLOPs=\displaystyle FLOPs= 10LCg2+(4L22k2L4LC)Cg+\displaystyle 10L{C_{g}}^{2}+(4L^{2}-2k^{2}L-4LC)C_{g}+ (13)
(3L+2k2C+2C2)L.\displaystyle(3L+2k^{2}C+2C^{2})L.

Applying the minimum value formula of a quadratic function, Equation 13 obtains the minimum value when Cgminf=0.2C+0.1(k22L)C_{g}^{min_{f}}=0.2C+0.1(k^{2}-2L). Also, ignoring the latter term, we obtains Cgminf=0.2CC_{g}^{min_{f}}=0.2C that follows the same trend with CgminpC_{g}^{min_{p}}. Therefore, we can draw two conclusions: ① The parameters and calculations of GLI are much lower than single-path MSA (pp < 1), and the minimum value can be obtained when using both paths (pp > 0); ② According to Equation 12 and Equation 13, there is not much difference about the total parameters and calculations when pp lies in the range [0, 0.5], so pp is set 0.5 for all layers in this paper for simplicity and efficiency. Also, experiments in Section 5.5.1 demonstrate that p=0.5p=0.5 is the most economical and efficient option. Note that the number of convolution parameters and computation of the current local path are smaller compared with the global path, while the stronger local structure will make ratio pp change larger and this paper will not elaborate on the details.

Furthermore, we advance the global path by designing a Modulated Deformable MSA (MD-MSA in Section 4.3.3) module, which improves the model performance with negligible parameters and GFLOPs increasing, and a comparison study to explore combinations of different operations is further conducted in the experimental section.

Table 1: Properties of convolution and MSA layers with Parameters (Params), floating point operations (FLOPs), and Maximum Path Length (MPL). Assume that the input and output feature maps in C×H×W\mathbb{R}^{C\times H\times W}, L=H×WL=H\times W, H=WH=W, kk and GG are kernel size and group number for convolution layers.
Type Params FLOPs MPL
MSA 4(C+1)C4(C+1)C 8C2L+4CL2+3L28C^{2}L+4CL^{2}+3L^{2} O(1)O(1)
Conv (Ck2/G+1)C(Ck^{2}/G+1)C (2Ck2/G)LC(2Ck^{2}/G)LC O(H/k)O(H/k)
Refer to caption
Figure 5: Structure of the proposed MD-MSA.

4.3.3 Modulated Deformable MSA

Inspired by the irregular spatial distribution among real individuals that are not as horizontal and vertical as the image, we improve a novel Modulated Deformable MSA (MD-MSA) module that considers position fine-tuning and re-weighting of each spatial patch. As shown in Figure 5, the blue dotted line represents naive MSA procedure that 𝑸𝑲𝑽\boldsymbol{QKV} features are obtained by the input feature map 𝑿\boldsymbol{X} from function fqkv()f_{qkv}(\cdot), i.e., 𝑸𝑲𝑽=fqkv(𝑿)\boldsymbol{QKV}=f_{qkv}(\boldsymbol{X}) and fqkv=fqfkfvf_{qkv}=f_{q}\oplus f_{k}\oplus f_{v} (\oplus denotes concatenation operation), while the red solid line shows the procedure of MD-MSA. And the main difference between the proposed MD-MSA and original MSA lies in the query-aware access of fine-tuned feature map 𝑿^\boldsymbol{\hat{X}} to extract 𝑲𝑽\boldsymbol{KV} features further. Specifically, given the input feature map 𝑿\boldsymbol{X} with LL positions, 𝑸\boldsymbol{Q} is obtained by function fqf_{q}, i.e., 𝑸=fq(𝑿)\boldsymbol{Q}=f_{q}(\boldsymbol{X}), which is then used to predict deformable offset Δl\Delta l and modulation scalar Δm\Delta m for all positions:

Δl,Δm=fmd(𝑸).\displaystyle\Delta l,\Delta m=f_{md}(\boldsymbol{Q}). (14)

For the ll-th position, the re-sampled and re-weighted feature 𝑿^l\boldsymbol{\hat{X}}_{l} is calculated by:

𝑿^l=𝒮(𝑿l,Δl)Δm,\displaystyle\boldsymbol{\hat{X}}_{l}=\mathcal{S}(\boldsymbol{X}_{l},\Delta l)\cdot\Delta m, (15)

where Δl\Delta l is the relative coordinate with an unconstrained range for the ll-th position, while Δm\Delta m lies in the range [0, 1], and 𝒮\mathcal{S} represents the bilinear interpolation function. Then 𝑲𝑽\boldsymbol{KV} is obtained with the new feature map 𝑿^\boldsymbol{\hat{X}}, i.e., 𝑲𝑽=fkv(𝑿^)\boldsymbol{KV}=f_{kv}(\boldsymbol{\hat{X}}). It is worth mentioning that the main difference between MD-MSA and recent similar work attn_DAT lies in the modulation operation, where MD-MSA could apply appropriate attention to different position features to obtain better results. Also, any form of position embedding is not used since it makes no contribution to results, and detailed comparative experiments can be viewed in Section 5.4.3.

Refer to caption
Figure 6: Structure of the proposed Task-Related Head.
Table 2: Overall analogical correlations between EA and EATFormer.
EA EATFormer
Basics Population Size Patch Number (Section 1)
(Discrete) Individual (Continuous) Patch Token
(Sparse) Crossover Operation (Dense) Global MSA module
(Sparse) Mutation Operation (Dense) Individual FFN module
(Partial) Population Succession (Integral) Residual Connection
(Global) Best Individual (Aggregated) Task-Related Token
Improvements Multi-Scale Population Multi-Scale Region Aggregation module (Section 4.3.1)
Global and Local Population Global and Local Interaction module (Section 4.3.2)
Self-Adapting Parameters Weighted Operation Mixing (Section 4.3.2)
Irregular Population Distribution Modulated Deformable MSA (Section 4.3.3)
Multi-Objective EA Task-Related Head (Section 4.3.4)
Dynamic Population Pyramid Architecture (Figure 4)

4.3.4 Task-Related Head

Current transformer-based vision models would initialize different tokens for different tasks attn_deit or use the pooling operation to obtain global representation attn_swin . However, both manners are potentially incompatible: the former treats the task token and image patches coequally as unreasonable and clumsy, because the task token and image tokens have different feature distributions while additional computation is required from O(n2)O(n^{2}) to O((n+1)2)O((n+1)^{2}); the latter uses only one pooling result for multiple tasks that is also inappropriate and harmful for losing wealthy information. Inspired by multi-objective EAs moea1 ; moea2 ; moea3 that find a set of solutions for different targets, we design a Task-Related Head (TRH) docked with transformer backbone to obtain the corresponding task output through the final features. As shown in Figure 6, we employ a cross-attention paradigm to implement this module: KK and VV (gray lines) are output features extracted by the transformer backbone, while QQ (red line) is the task-related token to integrate global information. Note that this design is more effective and flexible for different tasks learning simultaneously while consuming a negligible computation amount compared to the backbone, and more analytical experiments can be viewed in the following Section 5.4.8. For a more fair comparison, TRH presented in the former conference version eat is not used by default because this plug-and-play module can easily be added to other methods, and we will conduct an ablation experiment in Section 5.4.8 to verify the validity of TRH.

4.3.5 Overall Congruent Relationships

To more clearly show the design inspirations of different modules, we summarize the analogies between the improved EATFormer research and homologous concepts (ideas) from EA variants in Table 2.

4.4 EATFormer Variants

In the former conference version eat , we improve the columnar ViT by introducing a local path in parallel with global MSA operation, denoted as EAT-Ti, EAT-S, and EAT-B in the top part of Table 3. In this paper, we extend the columnar structure to a pyramid architecture and carefully re-design a novel EATFormer model, which has a series of scales for different practical applications, and these variants can be viewed in the bottom part of Table 3. Except for the depth and dimension of the model, other parameters remain consistent for all models: the head dimension of MSA is 32; window size is set to 7; kernel size of all convolution is 3×33\times 3; dilations of the MSRA module for four stages are [1], [1], [1,2,3], and [1,2], respectively; low-level stage1-2 only use local path while high-level stage3-4 employ hybrid GLI module for efficiency. More detailed structures and implementations can be viewed in the attached source code.

Table 3: Detailed settings of different EATFormer variants. Top part shows previous columnar EAT models eat .
000000Network Depth Dimension Params. (M) FLOPs (G) Inf. Mem. (G) Top-1
Col. EAT-Ti 12 192 005.7 01.01 02.2 72.7
EAT-S 12 384 022.1 03.83 02.9 80.4
EAT-B 12 768 86.6 14.83 04.5 82.0
Pyramid EATFormer-Mobile [ 1, 1, 4, 1 ] [ 48, 064, 160, 256 ] 001.8 00.36 02.2 69.4
EATFormer-Lite [ 1, 2, 6, 1 ] [ 64, 128, 192, 256 ] 003.5 00.91 02.7 75.4
EATFormer-Tiny [ 2, 2, 6, 2 ] [ 64, 128, 192, 256 ] 006.1 01.41 03.1 78.4
EATFormer-Mini [ 2, 3, 8, 2 ] [ 64, 128, 256, 320 ] 011.1 02.29 03.6 80.9
EATFormer-Small [ 3, 4, 12, 3 ] [ 64, 128, 320, 448 ] 024.3 04.32 04.9 83.1
EATFormer-Medium [ 4, 5, 14, 4 ] [ 64, 160, 384, 512 ] 039.9 07.07 06.2 83.6
EATFormer-Base [ 5, 6, 20, 7 ] [ 96, 160, 384, 576 ] 063.5 10.89 08.7 83.9

4.5 Further Discussion

Compared with EAT in the former conference version, the improved EATFormer has better inspirations, finer analogical designs, and more sufficient experiments. And we prove the effectiveness and integrity of the proposed method through a series of following experiments, such as comparison with SOTA methods, downstream task transferring, ablation studies, and explanatory experiments. It is worth noting that the backbone of EATFormer in this paper only contains one unified EAT block, which fully considers three aspects of modeling: 1) multi-scale information aggregation, 2) feature interactions among tokens, and 3) individual enhancement. Also, the architecture recipes of EATFormer variants in this paper are mainly given by our intuition and proved by experiments, but the alterable configure parameters can be used as the search space for NAS that is worth further exploration in our future works, e.g., embedding dimension, dilations of MSRA, kernel size of MSRA, fusion function of MSRA, down-sampling mode of MSRA, separation ratio of GLI, normalization types, window size, operation combinations of GLI, etc.

5 Experiments

In this section, to evaluate the effectiveness and superiority of our improved EATFormer architecture, we experiment for mainstream vision tasks with models of different volumes as the backbone and orderly conduct down-stream tasks, i.e., image-level classification (ImageNet-1K imagenet ), object-level detection and instance segmentation (COCO 2017 coco ), and pixel-level semantic segmentation (ADE20K ade20k ). Massive ablation and explanatory experiments are further conducted to prove the effectiveness of EATFormer and its components.

5.1 Image Classification

5.1.1 Experimental Setting

All of our EATFormer variants are trained for 300 epochs from scratch without pre-training, extra datasets, pre-trained models, token labeling tlt alike strategy, and exponential moving average. We employ the same training recipe as Deit attn_deit to all EATFormer variants for fair comparisons with different SOTA methods: AdamW adamw optimizer is used for training with betas and weight decay equaling (0.9, 0.999) and 5e2e^{-2}, respectively; Batch size is set to 2,048, while learning rate is 5e4e^{-4} by default with a linear increasing compared with batch size divided by 512; Standard cosine learning rate scheduler, data augmentation strategies, warm-up, and stochastic depth are used during the training phase attn_deit . EATFormer is built on PyTorch pytorch and relies on the TIMM interface timm .

Table 4: Comparison with SOTA methods on ImageNet-1K. Reported results are from corresponding papers. Gray background shows CNN-based and columnar methods, while the proposed EATFormer variants are colored in blue.
00000000Network Params. \downarrow (M) FLOPs \downarrow (G) Images/s \uparrow Resolution Top-1 Pub.00
GPU CPU
MNetV3-Small 0.75x mnetv3 02.0 00.05 9872 589.1 2242224^{2} 65.4 ICCV’19
EATFormer-Mobile 01.8 00.36 3926 456.3 2242224^{2} 69.4 -000
MobileNetV3 0.75× mnetv3 04.0 00.16 5585 315.4 2242224^{2} 73.3 ICCV’19
PVTv2-B0 attn_pvt 03.6 00.57 1711 104.2 2242224^{2} 70.5 ICCV’21
XCiT-N12 attn_xcit 03.1 00.56 2736 290.9 2242224^{2} 69.9 NeurIPS’21
VAN-Tiny attn_van 04.1 00.87 1706 107.9 2242224^{2} 75.4 arXiv’22
EATFormer-Lite 03.5 00.91 2168 246.3 2242224^{2} 75.4 -000
DeiT-Ti attn_deit 05.7 01.25 2342 417.7 2242224^{2} 72.2 ICML’21
EAT-Ti eat 05.8 01.01 2356 436.8 2242224^{2} 72.7 NeurIPS’21
EfficientNet-B0 efficientnet 05.3 00.39 2835 225.1 2242224^{2} 77.1 ICML’19
CoaT-Lite Tiny attn_coat 05.7 01.59 1055 143.5 2242224^{2} 77.5 ICCV’21
ViTAE-6M attn_vitae 06.6 02.16 0921 152.6 2242224^{2} 77.9 NeurIPS’21
XCiT-T12 attn_xcit 06.7 01.25 1750 259.5 2242224^{2} 77.1 NeurIPS’21
MPViT-T attn_mpvit 05.8 01.65 0755 125.9 2242224^{2} 78.2 CVPR’22
EATFormer-Tiny 06.1 01.41 1549 167.5 2242224^{2} 78.4 -000
EATFormer-Tiny-384 06.1 04.23 0536 056.9 3842384^{2} 80.1 -000
EfficientNet-B2 efficientnet 09.1 00.88 1440 143.4 2562256^{2} 80.1 ICML’19
PVTv2-B1 attn_pvt 14.0 02.12 1006 079.2 2242224^{2} 78.7 ICCV’21
ViTAE-13M attn_vitae 10.8 03.05 0698 114.3 2242224^{2} 81.0 NeurIPS’21
XCiT-T24 attn_xcit 12.1 02.35 0933 146.6 2242224^{2} 79.4 NeurIPS’21
CoaT-Lite Mini attn_coat 11.0 01.99 0968 164.8 2242224^{2} 79.1 ICCV’21
PoolFormer-S12 attn_metaformer 11.9 01.82 1858 218.7 2242224^{2} 77.2 CVPR’22
MPViT-XS attn_mpvit 10.5 02.97 0612 095.8 2242224^{2} 80.9 CVPR’22
VAN-Small attn_van 13.8 02.50 0992 095.2 2242224^{2} 81.1 arXiv’22
EATFormer-Mini 11.1 02.29 1055 122.1 2242224^{2} 80.9 -000
DeiT-S attn_deit 22.0 04.60 0937 163.5 2242224^{2} 79.8 ICML’21
EAT-S eat 22.1 03.83 0964 175.6 2242224^{2} 80.4 NeurIPS’21
ResNet-50 resnet ; timm_resnet 25.5 04.11 1192 123.8 2242224^{2} 80.4 CVPR’16
EfficientNet-B4 efficientnet 19.3 03.13 0495 038.2 3202320^{2} 82.9 ICML’19
CoaT-Lite Small attn_coat 19.8 03.96 0542 093.8 2242224^{2} 81.9 ICCV’21
PVTv2-B2 attn_pvt 25.3 04.04 0585 036.5 2242224^{2} 82.0 ICCV’21
Swin-T attn_swin 28.2 04.50 0664 088.0 2242224^{2} 81.3 ICCV’21
XCiT-S12 attn_xcit 26.2 18.92 0187 030.1 2242224^{2} 82.0 NeurIPS’21
ViTAE-S attn_vitae 24.0 06.20 0399 069.9 2242224^{2} 82.0 NeurIPS’21
UniFormer-S attn_uniformer 21.5 03.64 0844 121.1 2242224^{2} 82.9 ICLR’22
CrossFormer-T attn_crossformer 27.7 02.86 0948 185.4 2242224^{2} 81.5 ICLR’22
DAT-T attn_DAT 28.3 04.58 0581 075.6 2242224^{2} 82.0 CVPR’22
PoolFormer-S36 attn_metaformer 30.8 05.00 0656 076.4 2242224^{2} 81.4 CVPR’22
MPViT-S attn_mpvit 22.8 04.80 0417 068.7 2242224^{2} 83.0 CVPR’22
Shunted-S attn_SSA 22.4 05.01 0461 048.6 2242224^{2} 83.7 CVPR’22
PoolFormer-S24 attn_metaformer 21.3 03.41 0971 111.9 2242224^{2} 80.3 CVPR’22
CSWin-T cswin 22.3 04.34 0592 066.6 2242224^{2} 82.7 CVPR’22
iFormer-S iformer 19.9 04.85 0528 058.8 2242224^{2} 83.4 NeurIPS’22
MaxViT-T maxvit 25.1 04.56 0367 036.6 2242224^{2} 83.6 ECCV’22
VAN-Base attn_van 26.5 05.00 0542 051.9 2242224^{2} 82.8 arXiv’22
ViTAEv2-S attn_vitae2 19.3 05.78 0464 072.8 2242224^{2} 82.6 IJCV’23
NAT-Tiny attn_nat 27.9 04.11 0401 < 1 2242224^{2} 83.2 arXiv’22
EATFormer-Small 24.3 04.32 0615 073.3 2242224^{2} 83.1 -000
EATFormer-Small-384 24.3 12.92 0198 024.1 3842384^{2} 84.3 -000
ResNet-101 resnet ; timm_resnet 44.5 07.83 0675 081.5 2242224^{2} 81.5 CVPR’16
EfficientNet-B5 efficientnet 30.3 10.46 0163 018.6 4562456^{2} 83.6 ICML’19
PVTv2-B3 attn_pvt 45.2 06.92 0392 045.7 2242224^{2} 83.2 ICCV’21
CoaT-Lite Medium attn_coat 44.5 09.80 0275 042.2 2242224^{2} 83.6 ICCV’21
XCiT-S24 attn_xcit 47.6 36.04 0099 016.0 2242224^{2} 82.6 NeurIPS’21
CSWin-S cswin 34.6 06.83 0360 051.2 2242224^{2} 83.6 CVPR’22
EATFormer-Medium 39.9 07.05 0425 053.4 2242224^{2} 83.6 -000
ViT-B/16 attn_vit 86.5 17.58 0293 053.0 3842384^{2} 77.9 ICLR’21
DeiT-B attn_deit 86.5 17.58 0293 053.7 2242224^{2} 81.8 ICML’21
EAT-B eat 86.6 14.83 0331 071.5 2242224^{2} 82.0 NeurIPS’21
ResNet-152 resnet ; timm_resnet 60.1 11.55 0470 057.5 2242224^{2} 82.0 CVPR’16
EfficientNet-B7 efficientnet 66.3 38.32 0047 005.6 6002600^{2} 84.3 ICML’19
PVTv2-B5 attn_pvt 81.9 11.76 0256 033.9 2242224^{2} 83.8 ICCV’21
Swin-B attn_swin 87.7 15.46 0258 038.6 2242224^{2} 83.5 ICCV’21
Twins-SVT-L attn_twins 99.2 15.14 0271 043.1 2242224^{2} 83.2 NeurIPS’21
UniFormer-B attn_uniformer 49.7 08.27 0378 058.7 2242224^{2} 83.9 ICLR’22
DAT-B attn_DAT 87.8 15.78 0217 030.8 2242224^{2} 84.0 CVPR’22
Shunted-B attn_SSA 39.6 08.18 0290 033.7 2242224^{2} 84.0 CVPR’22
PoolFormer-M48 attn_metaformer 73.4 11.59 0301 037.5 2242224^{2} 82.5 CVPR’22
MPViT-B attn_mpvit 74.8 16.44 0181 029.1 2242224^{2} 84.0 CVPR’22
CSWin-B cswin 77.4 15.00 0204 032.5 2242224^{2} 84.2 CVPR’22
iFormer-B iformer 47.9 09.38 0262 038.9 2242224^{2} 84.6 NeurIPS’22
MaxViT-S maxvit 55.8 09.41 0231 032.1 2242224^{2} 84.4 ECCV’22
NAT-Small attn_nat 50.7 07.50 0260 < 1 2242224^{2} 83.7 arXiv’22
ViTAEv2-48M attn_vitae2 48.6 13.38 0251 038.6 2242224^{2} 83.8 IJCV’23
EATFormer-Base 49.0 08.94 0329 043.7 2242224^{2} 83.9 -000
EATFormer-Base-384 49.0 26.11 0112 014.2 3842384^{2} 84.9 -000

5.1.2 Experimental Results

In this work, we design EATFormer variations at different scales to meet different application requirements, and comparison results with SOTA methods are shown in Table 4. To fully evaluate the effects of different methods, we choose the number of parameters (Params.), FLOPs, Top-1 accuracy on ImageNet-1K, as well as throughput of GPU (with basic batch size equaling 128 by a single V100 SXM2 32GB, and the batch size will be reduced to the maximum that memory requires for large models) and CPU (with batch size equaling 128 by Xeon 8255C CPU @ 2.50GHz) as evaluation indexes. Our smallest EATFormer-Mobile obtains 69.4 that is much higher than MobileNetV3-Small 0.75×\times counterpart, i.e., 65.4, while the largest EATFormer-Base obtains a very competitive result with only 49.0M parameters, and it further achieves 84.9 at 384×\times384 resolution. Comparatively, although our approach obtains a slight improvement over recent SOTA MPViT-T/-XS/-S by +0.2%/+0.0%/+0.1%, EATFormer features significantly fewer FLOPs by -0.21G/-0.68G/-0.48G, faster GPU speed by +2.1×\times/+1.7×\times/+1.5×\times, and CPU speed by +1.33×\times/+1.27×\times/+1.07×\times. At the highest 50M-level model, our EATFormer-B still achieves a throughput of 329 that is 1.8×\times\uparrow faster than MPViT-B, and this efficiency increase is also considerable. Meaning that EATFormer is more user-friendly than MPViT on general-purpose GPU and CPU devices, and our EATFormer can better trade-off parameters, computation, and precision. At the same time, our tiny, small, and base models improve by +5.7\uparrow, +2.7\uparrow, and +1.9\uparrow compared with the previous conference version. Interestingly, we find that the Top-1 accuracy of different methods with 50\sim80M parameters would be approximately saturated to 84.0 without external data, token labeling, larger resolution, etc., so it is worth future exploration to alleviate this problem.

Table 5: Object detection and instance segmentation with Mask R-CNN on COCO coco dataset for 1×\times and 3×\times schedules. All backbones are pre-trained on ImageNet-1K imagenet .
Backbone Mask R-CNN 1×\times Mask R-CNN 3×\times Params. (M)\downarrow FLOPs (G)\downarrow Pub.00
APbAP^{b} AP50bAP^{b}_{50} AP75bAP^{b}_{75} APmAP^{m} AP50mAP^{m}_{50} AP75mAP^{m}_{75} APbAP^{b} AP50bAP^{b}_{50} AP75bAP^{b}_{75} APmAP^{m} AP50mAP^{m}_{50} AP75mAP^{m}_{75}
PVT-Tiny attn_pvt 36.7 59.2 39.3 35.1 56.7 37.3 39.8 62.2 43.0 37.4 59.3 39.9 33 - ICCV’21
PVTv2-B0 attn_pvt2 38.2 60.5 40.7 36.2 57.8 38.6 - - - - - - 23 195 CVM’22
XCiT-T12 attn_xcit - - - - - - 44.5 66.4 48.8 40.4 63.5 43.3 26 266 NeurIPS’21
PFormer-S12 attn_metaformer 37.3 59.0 40.1 34.6 55.8 36.9 - - - - - - 31 - CVPR’22
MPViT-T attn_mpvit 42.2 64.2 45.8 39.0 61.4 41.8 44.8 66.9 49.2 41.0 64.2 44.1 28 216 CVPR’22
EATFormer-Tiny 42.3 64.7 46.2 39.0 61.5 42.0 45.4 67.5 49.5 41.4 64.8 44.6 25 198 -000
ResNet-50 resnet 38.0 58.6 41.4 34.4 55.1 36.7 41.0 61.7 44.9 37.1 58.4 40.1 44 260 CVPR’16
Swin-T attn_swin 43.7 66.6 47.7 39.8 63.3 42.7 46.0 68.1 50.3 41.6 65.1 44.9 48 267 ICCV’21
Twins-S attn_twins 43.4 66.0 47.3 40.3 63.2 43.4 46.8 69.2 51.2 42.6 66.3 45.8 44 228 NeurIPS’21
PFormer-S24 attn_metaformer 40.1 62.2 43.4 37.0 59.1 39.6 - - - - - - 41 - CVPR’22
DAT-T attn_DAT 44.4 67.6 48.5 40.4 64.2 43.1 47.1 69.2 51.6 42.4 66.1 45.5 48 272 CVPR’22
MPViT-S attn_mpvit 46.4 68.6 51.2 42.4 65.6 45.7 48.4 70.5 52.6 43.9 67.6 47.5 43 268 CVPR’22
EATFormer-Small 46.1 68.4 50.4 41.9 65.3 44.8 47.4 69.3 51.9 42.9 66.4 46.3 44 258 -000
ResNet-101 resnet 40.4 61.1 44.2 36.4 57.7 38.8 42.8 63.2 47.1 38.5 60.1 41.3 63 336 CVPR’16
Swin-S attn_swin 45.7 67.9 50.4 41.1 64.9 44.2 48.5 70.2 53.5 43.3 67.3 46.6 69 359 ICCV’21
Twins-B attn_twins 45.2 67.6 49.3 41.5 64.5 44.8 48.0 69.5 52.7 43.0 66.8 46.6 76 340 NeurIPS’21
PFormer-S36 attn_metaformer 41.0 63.1 44.8 37.7 60.1 40.0 - - - - - - 51 - CVPR’22
DAT-S attn_DAT 47.1 69.9 51.5 42.5 66.7 45.4 49.0 70.9 53.8 44.0 68.0 47.5 69 378 CVPR’22
MPViT-B attn_mpvit 48.2 70.0 52.9 43.5 67.1 46.8 49.5 70.9 54.0 44.5 68.3 48.3 95 503 CVPR’22
EATFormer-Base 47.2 69.4 52.1 42.8 66.4 46.5 49.0 70.3 53.6 44.2 67.7 47.6 68 349 -000

5.2 Object Detection and Instance Segmentation

5.2.1 Experimental Setting

To further evaluate the effectiveness and superiority of our method, ImageNet-1K imagenet pre-trained EATFormer is benchmarked as the feature extractor for downstream object detection and instance segmentation tasks on COCO2017 dataset coco , and its window size increases from 7 to 12 without global attention and other changes. For fair comparisons, we employ MMDetection library mmdetection for experiments and follow the same training recipe as Swin-Transformer attn_swin : 1×\times schedule for 12 epochs and 3×\times schedule with a multi-scale training strategy for 36 epochs. AdamW adamw optimizer is used for training with learning rate and weight decay equaling 1e4e^{-4} and 5e2e^{-2}, respectively.

Refer to caption
Figure 7: Intuitive visualizations of two downstream tasks compared with Swin Transformer. Distinct differences are highlighted with red circles and rectangles.

5.2.2 Experimental Results

Comparison results of box mAP (APbAP^{b}) and mask mAP (APmAP^{m}) are reported in Table 5, and our improved EATFormer obtains competitive results over recent approaches. Specifically, our tiny model obtains +5.6\uparrow/+5.6\uparrow APbAP^{b} and +3.9\uparrow/+4.0\uparrow APmAP^{m} improvements over PVT-Tiny attn_pvt on both 1×\times and 3×\times schedules, while achieves higher results over MPViT attn_mpvit with less parameters and FLOPs, i.e., +0.6\uparrow and +0.4\uparrow on 3×\times schedule. For larger EATFormer-small and EATFormer-base models, we consistently get better results than recent counterparts, which surpass Swin-T by +2.4\uparrow/+2.1\uparrow and Swin-S by +1.5\uparrow/+1.7\uparrow with 1×\times schedule, while by +1.4\uparrow/+1.3\uparrow and by +0.5\uparrow/+0.9\uparrow with 3×\times schedule. Also, we obtain slightly higher results than DAT attn_DAT with computation amount going down by 29G\downarrow. Although EATFormer is slightly lower than SOTA MPViT-S/B in the downstream task metrics, our method has obvious advantages in the number of parameters and computation, e.g., -10G\downarrow FLOPs decreasing than MPViT-S for Mask R-CNN, while -154G\downarrow (-30.6%) FLOPs and -27M\downarrow (-28.4%) parameters decreasing than MPViT-B, effectively balancing the trade-off between effectiveness and performance. For MPViT-T, our EATFormer-Tiny has obvious metrics, parameter numbers, and computation advantages. Qualitative visualizations on validation dataset compared with Swin-S attn_swin are shown in the top part of Figure 7. Results indicate that our EATFormer can obtain more accurate detection accuracy, fewer false positives, and finer segmentation results than Swin Transformer.

5.3 Semantic Segmentation

5.3.1 Experimental Setting

We further conduct semantic segmentation experiments on the ADE20K ade20k dataset, and pre-trained EATFormer with window size equaling 12 is integrated into UperNet upernet architecture to obtain pixel-level predictions. In detail, we follow the same setting of Swin-Transformer attn_swin to train the model for 160k iterations. AdamW adamw optimizer is also used with learning rate and weight decay equaling 1e4e^{-4} and 5e2e^{-2}, respectively.

5.3.2 Experimental Results

Segmentation results compared with contemporary SOTA works under three main model scales are reported in Table 6. Our EATFormer-Tiny obtains a significantly +3.4\uparrow improvement than recent VAN-Tiny attn_van , while EATFormer-Small achieves a higher mIoU with fewer FLOPs over SOTA methods. For larger EATFormer-Base, it consistently obtains competitive results, i.e., +1.7\uparrow and +1.0\uparrow than Swin-S attn_swin and DAT-S attn_DAT , respectively. Compared with SOTA MPViT, we obtain a better trade-off among parameters, computation, and precision. E.g., our EATFormer-Base has 26M fewer parameters and 156G fewer FLOPs compared to MPViT-B. Our approach generally has excellent overall precision and computation performance than counterpart. Also, intuitive visualizations of the validation dataset compared with Swin-S attn_swin are shown in the bottom part of Figure 7. Qualitative results consistently demonstrate the robustness and effectiveness of the proposed approach, where our EATFormer has more accurate segmentation results.

Table 6: Semantic segmentation results compared with SOTAs on ADE20K ade20k by Upernet upernet .
Backbone Params. GFLOPs mIoU Pub.00
XCiT-T12 attn_xcit 034 0- 43.5 NeurIPS’21
VAN-Tiny attn_van 032 0858 41.1 CVPR’22
EATFormer-Tiny 034 0870 44.5 -000
Swin-T attn_swin 060 0945 44.5 ICCV’21
XCiT-S12 attn_xcit 052 0- 46.6 NeurIPS’21
DAT-T attn_DAT 060 0957 45.5 CVPR’22
ViTAEv2-S attn_vitae2 049 0- 45.0 IJCV’23
MPViT-S attn_mpvit 052 0943 48.3 CVPR’22
UniFormer-S uniformer_arxiv 052 0955 47.0 arXiv’22
EATFormer-Small 053 0934 47.3 -000
Swin-S attn_swin 081 1038 47.6 ICCV’21
XCiT-M24 attn_xcit 109 0- 48.4 NeurIPS’21
DAT-S attn_DAT 081 1079 48.3 CVPR’22
MPViT-B attn_mpvit 105 1186 50.3 CVPR’22
UniFormer-B uniformer_arxiv 080 1106 49.5 arXiv’22
EATFormer-Base 079 1030 49.3 -000

5.4 Ablation Study

To fully evaluate the effectiveness of each designed module, we conduct a series of ablation studies in the following sections. By default, EATFormer-Tiny is used for all experiments, and we follow the same training recipe as mentioned in Section 5.1.1.

5.4.1 Component of EAT Block

As afore-mentioned in Section 4.3, our proposed EAT block contains: 1) MSRA, 2) GLI, and 3) FFN modules that are responsible for aggregating multi-scale information, interacting global and local features, and enhancing the features of each location, respectively. To verify the validity of each module in the EAT block, we conduct an ablation experiment in Table 7 that contains different component combinations. Results indicate that each component contributes to the model performance, and our EATFormer obtains the best result when using all three parts. Since FFN takes up most of the parameters and calculations, we can conduct further research on optimizing this module to obtain better-integrated model performance.

MSRA GLI FFN Params FLOPs Top-1
2.4 0.45 62.9
2.6 0.51 64.4
5.2 1.17 71.4
2.9 0.60 67.7
5.5 1.26 76.0
5.8 1.32 77.4
6.1 1.41 78.4
Table 7: Ablation study for different component combinations in EAT block.

5.4.2 Separation Ratio of GLI

We deduce from Equation 12 and Equation 13 in Section 4.3.2 that EATFormer has the lowest number of parameters and calculation amount when separation ratio pp of GLI equals 0.2, and there is not much difference about the total parameters and calculations when pp lies in the range [0, 0.5]. To further prove the above analysis and verify the validity of the GLI, we conduct a set of experiments with equal interval sampling of pp in range [0, 1] for the classification task. As shown in Figure 8, the x-coordinate represents different proportions, and the left y-ordinate represents Top-1 accuracy of the modified EATFormer-Tiny with embedding dims equaling [64, 128, 230, 320] for divisible channels. The right y-ordinate shows the model’s running speed and relative computation amount. Results in the figure are consistent with the foregoing derivation, and pp equaling 0.5 is the most economical and efficient choice, where the model has relatively high precision, fast speed, and low computational cost. All GLI layers in this article use the same ratio, and exploring different ratios for different layers should lead to further improvements based on the above analysis.

Refer to caption
Figure 8: Separation ratio analysis of GLI. Blue circle represents the Top-1 accuracy of EATFormer at different separation ratios (c.f., left axis), and the radius represents the relative number of parameters. Orange and pink lines represent the running speed and relative FLOPs of the model (c.f., right axis).

5.4.3 Component Ablation of EATFormer

Following the core idea of paralleling global and local modelings, this paper extends a pyramid architecture over the previous columnar EAT model eat . Specifically, EAT block-based EATFormer can be seen as evolving from the naive baseline, which employs: 1) patch embedding for down-sampling; 2) MSRA with only one scale; 3) naive MSA ; 4) simple addition operation with αi,i=1,,N,g,l\alpha_{i},i=1,\dots,N,g,l equaling 1, instead of: 1) MSRA for down-sampling ; 2) MSRA with multiple scale; 3) improved MD-MSA ; 4) weighted operation mixing (WOM) with learnable αi,i=1,,N,g,l\alpha_{i},i=1,\dots,N,g,l. Detail ablation experiment based on EATFormer-tiny can be viewed in Table 8, and the results indicate that each individual component has a role, and different components combination can complement each other to help the model achieve higher results. Note that WOM can only be applied if multi-path-based MSRA is used.

MSRA Down MSRA MD- MSA WOM Param. FLOPs Top-1
4.792 1.232 77.4
5.202 1.300 77.8
5.208 1.283 77.9
4.804 1.236 77.5
4.805 1.236 77.7
6.109 1.412 78.2
5.214 1.304 78.0
5.220 1.288 78.1
6.122 1.416 78.2
6.109 1.412 78.1
5.221 1.288 78.0
6.122 1.416 78.4
Table 8: Ablation study for different component combinations in EATFormer. ✔ means choice while is the opposite, and ✚ represents D-MSA that abandons the modulation operation.

5.4.4 Composition of GLI

By default, the global path in GLI employs the designed MD-MSA module inspired by the dynamic population concept, while the local branch uses conventional CNN to model static feature extraction. To further assess the potential of the GLI module, different combinations of global (i.e., MSA and MD-MSA) and local (i.e., CNN and DCNv2 dcn2 ) operators are used for experiments. As shown in Table 9, MD-MSA improves the model effect by 0.30.3\uparrow only with negligible parameters and computation, while DCNv2 can further boost performance by a large margin at the cost of higher storage and computation. Theoretically, MD-MSA has no significant impact on the speed, but the naive PyTorch implementation without CUDA acceleration leads to a obvious decrease in GPU speed. Therefore, the running speed of our model could be improved after further optimization for MD-MSA.

Global Local Param. FLOPs GPU Top-1
MSA CNN 06.1 1.412 1896 78.1
MSA DCNv2 09.0 1.522 1567 79.0
MD-MSA CNN 06.1 1.416 1549 78.4
MD-MSA DCNv2 09.0 1.526 1333 79.2
Table 9: Ablation study for compositions of GLI.

5.4.5 Normalization Type

Transformer-based vision models generally use Layer Normalization (LN) to achieve better results rather than Batch Normalization (BN). Nevertheless, considering that LN requires slightly more computation than BN and the proposed hybrid EATFormer contains many convolutions that are usually combined with Batch Normalization (BN) layers, we conduct an ablation study to evaluate which normalization would be better. Table 10 shows the results on three EATFormer variants, and BN-normalized EATFormer achieves slightly better results while owing an significantly faster GPU inference speed. Note that merging convolution and BN layers is not used here, and this technique can further improve the inference speed.

Network Params FLOPs GPU Top-1
Tiny-LN 06.1 1.425 0963 78.2
Tiny-BN 06.1 1.416 1549 78.4
Small-LN 24.3 4.337 0448 82.8
Small-BN 24.3 4.320 0615 83.1
Base-LN 49.0 8.775 0240 83.7
Base-BN 49.0 8.744 0345 83.9
Table 10: Effects of different normalization types.

5.4.6 MSRA at Different Stages

Different network depths may have different requirements for the MSRA module, so we explore the introduction of MSRA at different stages. As shown in Table 11, our model obtains the best result when MSRA is used in [2, 3, 4] stages, and the model effect decreases sharply when only used in the fourth stage. Considering the model accuracy and efficiency, using this module in [3, 4] stages is a better choice.

Stages Params FLOPs GPU Top-1
[[1, 2, 3, 4]] 6.3 1.541 1291 78.2
[[2, 3, 4]] 6.3 1.533 1434 78.5
[[3, 4]] 6.1 1.416 1549 78.4
[[4]] 5.6 1.326 1695 77.9
Table 11: Ablation study of MSRA on different stages.

5.4.7 Kernel Size of MSRA

The MSRA module for multi-scale modeling adopts CNN as its primary component so that the convolution kernel may influence the model results. As shown in Table 12, a larger kernel size can only slightly increase the model effect, but the number of parameters and the amount of calculation could increase dramatically. Therefore, we employ efficient 3×33\times 3 kernel size in MSRA for EATFormer at all scales.

Size Params FLOPs GPU Top-1
3×\times3 06.1 1.416 1549 78.4
5×\times5 09.0 1.845 1342 78.5
7×\times7 13.4 2.487 1087 78.5
Table 12: Ablation study on kernel size of MSRA.

5.4.8 Layer Number of TRH

The Plug-and-play TRH module can easily be docked with the transformer backbone to obtain the task-related feature representation, and we take the classification task as an example to explore the effect of this module. As shown in Table 13, Top-1 accuracy is significantly improved by gradually increasing the number of TRH layers in the EATFormer-Tiny model, and the performance tends to saturation after two layers. Therefore, using two-layer TRH is the recommended choice to balance model effectiveness and efficiency. However, there is no noticeable improvement in the larger models, so the multi-task advantage of TRH for the larger model is more important than accuracy improvement.

Network Params FLOPs GPU Top-1
Tiny 06.1 1.416 1549 78.4
Tiny +1 06.9+0.8 1.423 1495 78.7
Tiny +2 07.7+1.6 1.430 1461 79.1
Tiny +3 08.4+2.3 1.438 1423 79.2
Small +2 29.1+4.8 4.363 0589 83.2
Base +2 55.3+6.3 9.001 0316 83.9
Table 13: Quantitative ablation study for the layer number of TRH.

5.5 EATFormer Explanation

5.5.1 Alpha Distribution of Different Depths

The weighted operation mixing mechanism can improve the model performance and objectively represent the model’s attention to different branches at different depths. Based on EATFormer-tiny, we use 3-path MSRA along with 2-path GLI for each EAT block, and the alpha-indicated weight distribution after training is shown in Figure 9. 1) For the MSRA module, the proportion of α1\alpha_{1} (i.e., dilation equals 1) in the same stage shows an increasing trend while the larger α3\alpha_{3} is the opposite, indicating that local feature extraction with stronger correlation (i.e., smaller scale) is more critical for the network. And weight mutation between adjacent stages is caused by a down-sampling operation that changes the feature distribution. In the last stage4, large scale paths have more weight because they need to model as much global information as possible to get proper classification results. But in general, the proportion of each branch is balanced, meaning that feature learning at all scales contributes to the network. Considering the amount of computation and the number of parameters, this also supports the experimental result about why only using MSRA for stage3/4 described in above Section 5.4.6. 2) For the GLI module, the global branch has more and more weight than the local branches as the network deepens, indicating that both branches are effective and complement each other: local CNN is more suitable for low-level feature extraction while the global transformer is better at high-level information fusion.

Refer to caption
Figure 9: Alpha distribution of the trained EATFormer for different depths. Top and bottom parts represent αi,i=1,,N\alpha_{i},i=1,\dots,N in MSRA and αi,i=g,l\alpha_{i},i=g,l in GLI, respectively.

5.5.2 Attention Visualization

To better illustrate which parts of the image the model focuses on, Grad-CAM gradcam is applied to highlight concerning regions by our small model. As shown in Figure 10, we visualize different images by column for ResNet-50 resnet , Swin-B attn_swin , and our EATFormer-Base models, respectively. Results indicate that: 1) CNN-based ResNet tends to focus on as many regions as possible but ignores edges; 2) Transformer-based Swin pays more attention to sparse local areas; 3) Thanks to the design of MSRA and GLI modules, our EATFormer has more discriminative attention to subject targets that own very sharp edges.

Refer to caption
Figure 10: Attention visualizations by Grad-CAM of our EATFormer compared to CNN-based ResNet-50 resnet and transformer-based Swin-B attn_swin .

5.5.3 Attention Distance of Global Path in GLI

We design the GLI module to explicitly model global and local information separately, so the local branch could undertake part of the short-distance modeling of the global branch. To verify this, we visualize the modeling distance of the global branch for our previous columnar EAT model eat and current studied EATFormer in Figure 11: 1-Top) Compared with DeiT without local modeling, our EAT pays more attention to global information fusion (choosing layer 4/6 for examples), where more significant values are found at off-diagonal locations. 2-Bottom) Attention maps in the last stage are visualized because the window size equals the feature size that could cover overall information. When using global modeling alone (w/o GLI), the model only focuses on sparse regions but will pay attention to more regions when GLI is used. Results indicate that the designed parallel local path takes responsibility for some local modelings that should be the responsibility of the global path. We can find differences in feature modeling between columnar-alike and pyramid-aware architectures.

Relationship with EA. Motivated by EA variants mea1 ; mea2 ; mea3 that introduce local search procedures besides conventional global search for converging higher-quality solutions, we analogically improve the novel GLI module. When GLI is not used (only global modeling), the model tends to correlate local regional features, which is consistent with the concept of local population in biological evolution due to geographical constraints, i.e., just like the concept of local search in EA. With GLI, explicit local modeling unlocks global modeling potential, forcing global branches to associate more distant features for better results, just as the global/local concept in EA mea1 ; mea2 ; mea3 that improves performance.

Refer to caption
Figure 11: Attention scope for the transformer-based global branch in GLI. The top part shows the attention maps for columnar DeiT attn_deit (w/o local modeling) and our previous EAT model eat (w/ local modeling) in different depths; the bottom part shows results of EATFormer w/ and w/o local modelings in the last stage.

5.5.4 Δl\Delta l and Δm\Delta m Distribution of MD-MSA

Figure 12 visualizes the learned offset (the longer the arrow, the farther the deformable distance, and the arrow direction indicates sampling direction) and modulation (the brighter the color, the greater the weight) of MD-MSA in stage4. There are differences in offset and modulation of each location in different depths, and the model unexpectedly tends to give more weight to the main object that could describe the main parts of the object. Since we set align_corners to true when resampling, it has a gradually increasing bias from 0 to 0.5 from the center to the edge. Therefore, the visualization results behave as a whole spreading outwards that may visually weaken changes in each learned position. Please zoom in for better visualization.

Relationship with EA. Inspired by the irregular spatial distribution among real individuals that are not as horizontal and vertical as the image, we improve the novel MD-MSA module that considers the offset of each spatial position. As show in Figure 12, different positions (individuals) prefer different offsets and modulation (i.e., direction and scale), just as individuals have different preferences in different regions of the biological world. This modeling method has also been verified in EA, e.g., the improved works dae_supp3 ; dae_supp1 ; dae_supp2 adopt the similar parameter adaption and feature scaling idea to conduct global feature interaction.

Refer to caption
Figure 12: Visualization of deformable offsets (denoted as arrows) and modulation scalar (denoted as color) for MD-MSA of our small model in the last stage. Please zoom in on the red circle for more details.

5.5.5 Visualization of Attention Map in TRH

Taking the classification task as an example, we visualize the attention map in the two-layer TRH that contains multiple heads in the inner cross-attention layer. As shown in Figure 13, we normalize values of attention maps to [0, 1] and draw them on the right side of the image. Results indicate that different heads focus on different regions, and the deeper TRH2 focuses on a broader area than TRH1 to form the final feature.

Refer to caption
Figure 13: Attention map visualization in TRH for the classification task. The model contains two TRHs, and four head attentions are displayed in each TRH.

5.5.6 Parameters and FLOPs Distribution

Taking the designed EATFormer-Tiny as an example, we analyze the distribution of parameters and FLOPs in different layers, where the model contains a stem for resolution reduction, four stages for feature extraction, and a head for target output. As shown in Figure 14, the number of parameters is mainly distributed in the deep stage3/4, while FLOPs concentrate in the early stages, and FFN occupies the majority of parameter number calculation. Therefore, we can focus on the optimization of the FFN structure to better balance the comprehensive model efficiency in future work.

Refer to caption
Figure 14: Analyses of Params. and FLOPs distribution.

5.5.7 Works comparison with local/global concepts.

In this paper, the locality in ViT refers to the introduction of CNN with inductive bias into the Transformer structure, and we design the GLI block as a parallel structure that introduces a different local branch beside the global branch. This idea is motivated by some EA variants mea1 ; mea2 ; mea3 that employ local search procedures besides conventional global search for converging higher-quality solutions. Also, global/local concept is only an idea in the macro sense, and the specific way varies from method to method. E.g., global/local concept in MPViT attn_mpvit is expressed as parallelism between blocks rather than within each block, while CMT cmt cascades local information into the FFN module rather than the MSA as previous works attn_botnet ; attn_ceit ; efficientformerv2 . Comparatively, our GLI block consists of local convolution and global MD-MSA operations, and Weighted Operation Mixing (WOM) mechanism is further proposed to mix all operations adaptively. So, we argue that GLI obviously differs comparison methods. Besides, we compare our method with some contemporary/recent works attn_mpvit ; cmt ; attn_uniformer ; attn_vitae2 ; mobilevit ; edgenext ; efficientformerv2 which incorporate the global/local concept to their model designs. To further illustrate the differences with these methods, we make comprehensive comparison with them in terms of local/global concepts by several criteria in Table 14). Results illustrates the uniqueness of GLI in the technical level.

Table 14: Global/local idea comparisons among recent methods. : Whether the local/global concept is instantiated as a parallel block; : Whether the local operation is combined with MSA; : Whether only one kind of block is used for the model; : Whether feature split is employed for local/global paths; : Whether feature importance is considered for local/global paths. ✔: Satisfied; ✘: Unsatisfied.
Method vs. Criterion
MPViT attn_mpvit
CMT cmt
UniFormer attn_uniformer
ViTAEv2 attn_vitae2
MobileViT mobilevit
EdgeNeXt edgenext
EfficientFormerv2 efficientformerv2
EATFormer (Ours)

6 Conclusion

This paper explains the rationality of vision transformer by analogy with EA and improves our previous columnar EAT to a novel pyramid EATFormer architecture inspired by effective EA variants. Specifically, the designed backbone consists only of the proposed EAT block that contains three residual parts, i.e., MSRA, GLI, and FFN modules, to model multi-scale, interactive, and individual information separately. Moreover, we propose a TRH module and improve an MD-MSA module to boost the effectiveness and usability of our EATFormer further. Abundant experiments on classification and downstream tasks demonstrate the superiority of our approach over SOTA methods in terms of accuracy and efficiency, while ablation and explanatory experiments further illustrate the effectiveness of EATFormer and each analogically designed component.

Nevertheless, we do not use larger models (e.g., >100M), larger datasets(i.e., ImageNet-21K imagenet ) or stronger training strategy (i.e., token labeling tlt ) for experiments due to limited amount of computation. Also, the architecture recipes are mainly given by our intuition, and the super-parameter could be used to optimize the model structure further. We will explore the above aspects and the combination with self-supervised learning techniques in future works.

Data Availability Statement. All the datasets used in this paper are available online. ImageNet-1K 111http://image-net.org, COCO 2017 222https://cocodataset.org, and ADE20K 333http://sceneparsing.csail.mit.edu can be downloaded from their official website accordingly.

Acknowledgement. This work was supported by a Grant from The National Natural Science Foundation of China(No. 62103363)

References

  • (1) Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al.: Xcit: Cross-covariance image transformers. In: NeurIPS (2021)
  • (2) Atito, S., Awais, M., Kittler, J.: Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602 (2021)
  • (3) Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: A general framework for self-supervised learning in speech, vision and language. In: ICML (2022)
  • (4) Bao, H., Dong, L., Piao, S., Wei, F.: BEit: BERT pre-training of image transformers. In: ICLR (2022)
  • (5) Bartz-Beielstein, T., Branke, J., Mehnen, J., Mersmann, O.: Evolutionary algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (2014)
  • (6) Bello, I.: Lambdanetworks: Modeling long-range interactions without attention. In: ICLR (2021)
  • (7) Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
  • (8) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV (2021)
  • (9) Bhowmik, P., Pantho, M.J.H., Bobda, C.: Bio-inspired smart vision sensor: toward a reconfigurable hardware modeling of the hierarchical processing in the brain. J REAL-TIME IMAGE PR (2021)
  • (10) Brest, J., Greiner, S., Boskovic, B., Mernik, M., Zumer, V.: Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. TEC (2006)
  • (11) Brest, J., Zamuda, A., Boskovic, B., Maucec, M.S., Zumer, V.: High-dimensional real-parameter optimization using self-adaptive differential evolution algorithm with population size reduction. In: CEC (2008)
  • (12) Brest, J., Zamuda, A., Fister, I., Maučec, M.S.: Large scale global optimization using self-adaptive differential evolution algorithm. In: CEC (2010)
  • (13) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: NeurIPS (2020)
  • (14) Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
  • (15) Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
  • (16) Chen, B., Li, P., Li, C., Li, B., Bai, L., Lin, C., Sun, M., Yan, J., Ouyang, W.: Glit: Neural architecture search for global and local image transformer. In: ICCV (2021)
  • (17) Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W.: Pre-trained image processing transformer. In: CVPR (2021)
  • (18) Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
  • (19) Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
  • (20) Chen, M., Peng, H., Fu, J., Ling, H.: Autoformer: Searching transformers for visual recognition. In: ICCV (2021)
  • (21) Chen, M., Wu, K., Ni, B., Peng, H., Liu, B., Fu, J., Chao, H., Ling, H.: Searching the search space of vision transformer. In: NeurIPS (2021)
  • (22) Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., Cheng, J., Wang, J.: Mixformer: Mixing features across windows and dimensions. In: CVPR (2022)
  • (23) Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: A language modeling framework for object detection. In: ICLR (2022)
  • (24) Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., Wang, J.: Context autoencoder for self-supervised representation learning. IJCV (2023)
  • (25) Chen, X., Xie, S., He, K.: An empirical study of training self-supervised visual transformers. In: ICCV (2021)
  • (26) Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer. In: CVPR (2022)
  • (27) Chen, Z., Kang, L.: Multi-population evolutionary algorithm for solving constrained optimization problems. In: AIAI (2005)
  • (28) Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., Tang, M.: Dpt: Deformable patch-based transformer for visual recognition. In: ACM MM (2021)
  • (29) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
  • (30) Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
  • (31) Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., Belanger, D.B., Colwell, L.J., Weller, A.: Rethinking attention with performers. In: ICLR (2021)
  • (32) Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C.: Twins: Revisiting the design of spatial attention in vision transformers. In: NeurIPS (2021)
  • (33) Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., Shen, C.: Conditional positional encodings for vision transformers. In: ICLR (2023)
  • (34) Coello, C.A.C., Lamont, G.B.: Applications of multi-objective evolutionary algorithms, vol. 1. World Scientific (2004)
  • (35) Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: ICLR (2020)
  • (36) Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV (2017)
  • (37) Das, S., Suganthan, P.N.: Differential evolution: A survey of the state-of-the-art. TEC (2010)
  • (38) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
  • (39) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
  • (40) Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: CVPR (2022)
  • (41) Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., Guo, B.: Peco: Perceptual codebook for bert pre-training of vision transformers. In: AAAI (2023)
  • (42) Dong, Y., Cordonnier, J.B., Loukas, A.: Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In: ICML (2021)
  • (43) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
  • (44) d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: Convit: Improving vision transformers with soft convolutional inductive biases. In: ICML (2021)
  • (45) Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., Liu, W.: You only look at one sequence: Rethinking transformer in vision through object detection. In: NeurIPS (2021)
  • (46) Felleman, D.J., Van Essen, D.C.: Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991) (1991)
  • (47) Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., Qiao, Y.: Mcmae: Masked convolution meets masked autoencoders. In: NeurIPS (2022)
  • (48) García-Martínez, C., Lozano, M.: Local search based on genetic algorithms. In: Advances in metaheuristics for hard optimization. Springer (2008)
  • (49) Goyal, A., Bengio, Y.: Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A (2022)
  • (50) Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: Cmt: Convolutional neural networks meet vision transformers. In: CVPR (2022)
  • (51) Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M.: Visual attention network. CVM (2023)
  • (52) Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: NeurIPS (2021)
  • (53) Hao, Y., Dong, L., Wei, F., Xu, K.: Self-attention attribution: Interpreting information interactions inside transformer. In: AAAI (2021)
  • (54) Hart, W.E., Krasnogor, N., Smith, J.E.: Memetic evolutionary algorithms. In: Recent advances in memetic algorithms, pp. 3–27. Springer (2005)
  • (55) Hassanat, A., Almohammadi, K., Alkafaween, E., Abunawas, E., Hammouri, A., Prasath, V.: Choosing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Information (2019)
  • (56) Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention transformer. In: CVPR (2023)
  • (57) He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
  • (58) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
  • (59) He, R., Ravula, A., Kanagal, B., Ainslie, J.: Realformer: Transformer likes residual attention. arXiv preprint arXiv:2012.11747 (2020)
  • (60) Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: ICCV (2019)
  • (61) Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B.: Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021)
  • (62) Hudson, D.A., Zitnick, L.: Generative adversarial transformers. In: ICML (2021)
  • (63) Jiang, Y., Chang, S., Wang, Z.: Transgan: Two pure transformers can make one strong gan, and that can scale up. In: NeurIPS (2021)
  • (64) Jiang, Z.H., Hou, Q., Yuan, L., Zhou, D., Shi, Y., Jin, X., Wang, A., Feng, J.: All tokens matter: Token labeling for training better vision transformers. In: NeurIPS (2021)
  • (65) Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: ICML (2020)
  • (66) Khare, V., Yao, X., Deb, K.: Performance scaling of multi-objective evolutionary algorithms. In: EMO (2003)
  • (67) Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., Hong, S.: Pure transformers are powerful graph learners. In: NeurIPS (2022)
  • (68) Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient transformer. In: ICLR (2020)
  • (69) Kolen, A., Pesch, E.: Genetic local search in combinatorial optimization. Discrete Applied Mathematics (1994)
  • (70) Kumar, S., Sharma, V.K., Kumari, R.: Memetic search in differential evolution algorithm. arXiv preprint arXiv:1408.0101 (2014)
  • (71) Land, M.W.S.: Evolutionary algorithms with local search for combinatorial optimization. University of California, San Diego (1998)
  • (72) Lee, Y., Kim, J., Willette, J., Hwang, S.J.: Mpvit: Multi-path vision transformer for dense prediction. In: CVPR (2022)
  • (73) Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., Chang, X.: Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In: ICCV (2021)
  • (74) Li, K., Wang, Y., Peng, G., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unified transformer for efficient spatial-temporal representation learning. In: ICLR (2022)
  • (75) Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI (2023)
  • (76) Li, X., Wang, L., Jiang, Q., Li, N.: Differential evolution algorithm with multi-population cooperation and multi-strategy integration. Neurocomputing (2021)
  • (77) Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., Ren, J.: Rethinking vision transformers for mobilenet size and speed. In: ICCV (2023)
  • (78) Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
  • (79) Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: ICCV (2021)
  • (80) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
  • (81) Liu, J., Lampinen, J.: A fuzzy adaptive differential evolution algorithm. Soft Computing (2005)
  • (82) Liu, Y., Li, H., Guo, Y., Kong, C., Li, J., Wang, S.: Rethinking attention-model explainability through faithfulness violation test. In: ICML (2022)
  • (83) Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al.: Swin transformer v2: Scaling up capacity and resolution. In: CVPR (2022)
  • (84) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV (2021)
  • (85) Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
  • (86) Lu, J., Mottaghi, R., Kembhavi, A., et al.: Container: Context aggregation networks. In: NeurIPS (2021)
  • (87) Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S.W., Anwer, R.M., Shahbaz Khan, F.: Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. In: ECCVW (2023)
  • (88) Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In: ICLR (2022)
  • (89) Min, J., Zhao, Y., Luo, C., Cho, M.: Peripheral vision transformer. In: NeurIPS (2022)
  • (90) Moscato, P., et al.: On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech concurrent computation program, C3P Report 826, 1989 (1989)
  • (91) Motter, B.C.: Focal attention produces spatially selective processing in visual cortical areas v1, v2, and v4 in the presence of competing stimuli. Journal of neurophysiology (1993)
  • (92) Nakashima, K., Kataoka, H., Matsumoto, A., Iwata, K., Inoue, N., Satoh, Y.: Can vision transformers learn without natural images? In: AAAI (2022)
  • (93) Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: ICCV (2021)
  • (94) Opara, K.R., Arabas, J.: Differential evolution: A survey of theoretical analyses. Swarm and evolutionary computation (2019)
  • (95) Padhye, N., Mittal, P., Deb, K.: Differential evolution: Performances and analyses. In: CEC (2013)
  • (96) Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., Huang, G.: On the integration of self-attention and convolution. In: CVPR (2022)
  • (97) Pant, M., Zaheer, H., Garcia-Hernandez, L., Abraham, A., et al.: Differential evolution: A review of more than two decades of research. Engineering Applications of Artificial Intelligence (2020)
  • (98) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS (2019)
  • (99) Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: ICLR (2018)
  • (100) Qiang, Y., Pan, D., Li, C., Li, X., Jang, R., Zhu, D.: Attcat: Explaining transformers via attentive class activation tokens. In: NeurIPS (2022)
  • (101) Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. OpenAI (2018)
  • (102) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog (2019)
  • (103) Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? In: NeurIPS (2021)
  • (104) Ren, S., Zhou, D., He, S., Feng, J., Wang, X.: Shunted self-attention via multi-scale token aggregation. In: CVPR (2022)
  • (105) Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV (2017)
  • (106) Shi, E.C., Leung, F.H., Law, B.N.: Differential evolution with adaptive population size. In: ICDSP (2014)
  • (107) Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., YAN, S.: Inception transformer. In: NeurIPS (2022)
  • (108) Sloss, A.N., Gustafson, S.: 2019 evolutionary algorithms review. Genetic programming theory and practice XVII (2020)
  • (109) Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: CVPR (2021)
  • (110) Storn, R., Price, K.: Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. pp. 341–359 (1997)
  • (111) Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML (2019)
  • (112) Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., Ghanem, B.: Spatio-temporal relation modeling for few-shot action recognition. In: CVPR (2022)
  • (113) Toffolo, A., Benini, E.: Genetic diversity as an objective in multi-objective evolutionary algorithms. Evolutionary computation (2003)
  • (114) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
  • (115) Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV (2021)
  • (116) Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: Maxvit: Multi-axis vision transformer. In: ECCV (2022)
  • (117) Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: Gated axial-attention for medical image segmentation. In: MICCAI (2021)
  • (118) Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: CVPR (2021)
  • (119) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
  • (120) Vikhar, P.A.: Evolutionary algorithms: A critical review and its future prospects. In: ICGTSPICC (2016)
  • (121) Wan, Z., Chen, H., An, J., Jiang, W., Yao, C., Luo, J.: Facial attribute transformers for precise and robust makeup transfer. In: CACV (2022)
  • (122) Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C., Han, S.: Hat: Hardware-aware transformers for efficient natural language processing. In: ACL (2020)
  • (123) Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Jiang, Y.G., Zhou, L., Yuan, L.: Bevt: Bert pretraining of video transformers. In: CVPR (2022)
  • (124) Wang, S., Li, B., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
  • (125) Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: ICCV (2021)
  • (126) Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pvt v2: Improved baselines with pyramid vision transformer. CVM (2022)
  • (127) Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., Liu, W.: Crossformer: A versatile vision transformer hinging on cross-scale attention. In: ICLR (2022)
  • (128) Wang, Y., Yang, Y., Bai, J., Zhang, M., Bai, J., Yu, J., Zhang, C., Huang, G., Tong, Y.: Evolving attention with residual convolutions. In: ICML (2021)
  • (129) Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: CVPR (2022)
  • (130) Wightman, R.: Pytorch image models. https://github.com/rwightman/pytorch-image-models (2019)
  • (131) Wightman, R., Touvron, H., Jegou, H.: Resnet strikes back: An improved training procedure in timm. In: NeurIPSW (2021)
  • (132) Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: Introducing convolutions to vision transformers. In: ICCV (2021)
  • (133) Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: CVPR (2022)
  • (134) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV (2018)
  • (135) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS (2021)
  • (136) Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: Simmim: A simple framework for masked image modeling. In: CVPR (2022)
  • (137) Xu, L., Yan, X., Ding, W., Liu, Z.: Attribution rollout: a new way to interpret visual transformer. JAIHC (2023)
  • (138) Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., Soatto, S.: Long short-term transformer for online action detection. In: NeurIPS (2021)
  • (139) Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. In: ICCV (2021)
  • (140) Xu, Y., Zhang, Q., Zhang, J., Tao, D.: Vitae: Vision transformer advanced by exploring intrinsic inductive bias. In: NeurIPS (2021)
  • (141) Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z., Yuille, A.: Lite vision transformer with enhanced self-attention. In: CVPR (2022)
  • (142) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR (2022)
  • (143) Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: ICCV (2021)
  • (144) Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng, J., Yan, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: ICCV (2021)
  • (145) Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: Volo: Vision outlooker for visual recognition. TPAMI (2022)
  • (146) Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., Wang, J.: Hrformer: High-resolution vision transformer for dense predict. In: NeurIPS (2021)
  • (147) Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: CVPR (2022)
  • (148) Zhang, J., Li, X., Li, J., Liu, L., Xue, Z., Zhang, B., Jiang, Z., Huang, T., Wang, Y., Wang, C.: Rethinking mobile block for efficient attention-based models. In: ICCV (2023)
  • (149) Zhang, J., Xu, C., Li, J., Chen, W., Wang, Y., Tai, Y., Chen, S., Wang, C., Huang, F., Liu, Y.: Analogous to evolutionary algorithm: Designing a unified sequence model. In: NeurIPS (2021)
  • (150) Zhang, Q., Xu, Y., Zhang, J., Tao, D.: Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. IJCV (2023)
  • (151) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
  • (152) Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ade20k dataset. IJCV (2019)
  • (153) Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Hou, Q., Feng, J.: Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
  • (154) Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. In: CVPR (2019)
  • (155) Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable {detr}: Deformable transformers for end-to-end object detection. In: ICLR (2021)