∎
Xiangtai Li ([email protected])
Yabiao Wang ([email protected])
Chengjie Wang ([email protected])
Yibo Yang ([email protected])
✉ Yong Liu ([email protected])
Dacheng Tao ([email protected])
∗ Equally-contributed first authors.
1 Institute of Cyber-Systems and Control, Advanced Perception on Robotics and Intelligent Learning Lab (APRIL), Zhejiang University, China.
2 School of Artificial Intelligence, Key Laboratory of Machine Perception (MOE), Peking University, China.
3 Youtu Lab, Tencent, China.
4 School of Computer Science, Faculty of Engineering, The University of Sydney, Darlington, NSW 2008, Australia.
EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm
Abstract
Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical Evolutionary Algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based Transformer (EAT) block, which consists of three residual parts, i.e., Multi-Scale Region Aggregation (MSRA), Global and Local Interaction (GLI), and Feed-Forward Network (FFN) modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a Task-Related Head (TRH) docked with transformer backbone to complete final information fusion more flexibly and improve a Modulated Deformable MSA (MD-MSA) to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over State-Of-The-Art (SOTA) methods. E.g., our Mobile (1.8M), Tiny (6.1M), Small (24.3M), and Base (49.0M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://github.com/zhangzjn/EATFormer.
Keywords:
Computer vision Vision transformer Evolutionary algorithm Image classification Object detection Image segmentation1 Introduction
Since Vaswani et al. attn introduce the Transformer that achieves outstanding success in the machine translation task, many improvements have been made over this structure attn_improve3 ; attn_improve4 ; attn_bert . Subsequently, Alexey et al. attn_vit firstly introduce Transformer to the computer vision field and propose a novel ViT model that successfully sparks a new wave of research besides conventional CNN-based vision models. Recently, many excellent vision transformer models attn_pvt ; attn_swin ; attn_xcit ; attn_metaformer ; attn_nat ; attn_mpvit ; attn_uniformer have been proposed and have achieved great success in the field of many vision tasks. Currently, many attempts have been made to explain and improve the Transformer structure from different perspectives theory_more_inductive_bias ; theory_more1 ; theory1 ; theory2 ; theory3 ; attn_convit ; theory_more2 ; theory_more3 ; attattr , while continuing research is still needed. Most current models generally migrate the structural design of CNN, and they are experimentally conducted to verify the effectiveness of modules or improvements, which lacks explanations about why improved Transformer approaches work attattr ; attn_metaformer ; attn_notattention .

Inspired by biological population evolution, we explain the rationality of Transformer by analogy with the proven effective, stable, and robust Evolutionary Algorithm (EA) in this article, which has been widely used in many practical applications. We observe that the procedure of the Transformer (abbr., TR) has similar attributes to the naive EA through analogical analysis in Figure 1 (a)-TR and (b)-EA:
- 1)
-
In terms of data format, TR processes patch embeddings while EA evolutes individuals, both of them have the same data formats and necessary initialization.
- 2)
-
In terms of optimization objective, TR aims to obtain an optimal vector representation that fuses global information through multiple layers (denoted as xN in Figure 1), while EA focuses on getting the best individual globally through multiple iterations.
- 3)
-
In terms of component, Multi-head Self-Attention (MSA) in TR aims to enrich patch embeddings through global information communication densely, while crossover operation in EA plays the role of interacting global individuals sparsely. Also, Feed-Forward Network (FFN) in TR enhances every single embedding for all spatial positions, which is similar to Mutation in EA that evolves each individual of the whole population.
- 4)
In addition to the above basic analogies between naive Transformer and EA, we explore further to improve the current vision transformer by leveraging other domain knowledge of EA variants. Without losing generality, we only study the widely used and effective EA methods that could inspire us to improve Transformer. They can be mainly divided into the following categories:
- 1)
-
Global and Local populations inspired simultaneously global and local modeling. In contrast to naive EA that only models global interaction, local search-based EA variants focus on finding a better individual in its neighborhood lea1 ; lea2 ; lea4 that is more efficient without associating the global search space. Furthermore, Moscato et al. mea1 firstly propose the Memetic Evolutionary Algorithm (MEA) that introduces a local search process for converging high-quality solutions faster than conventional evolutionary counterparts, and intuitive illustration can be viewed in Figure 1-(c). For a particular individual (i.e., center sheep with red background), naive EA only contains ⚫ Global Population concept, while ⚫ Local Population idea enables the model to focus on more relevant individuals. Inspired by those EA variants, we revisit the global MSA part and improve a novel Global and Local Interaction (GLI) module, which is designed as a parallel structure that employs an extra local operation beside the global operation, i.e., introducing inductive bias and locality in MSA. The former is used to mine more relevant local information, while the latter aims to model global cue interactions. Considering that the spatial relationship among real individuals will not be as horizontal and vertical as the image, we further propose a Modulated Deformable MSA (MD-MSA) to dynamically model irregular locations, which could focus on more informative reorganizational regions.
- 2)
-
Multi-population inspired multi-scale information aggregation. Some works lpea2 ; lpea4 introduce multi-population evolutionary algorithm to solve the optimization problems, which adopts different searching regions to more efficiently enhance the diversity of individuals and can obtain a better model performance significantly. As shown in Figure 1-(c), ◆ Long-Distance Population could supplement more diverse and richer cues, while ◆ Short-Distance Population focuses on providing general evolutionary features. Analogously, this idea inspires us to design a Multi-Scale Region Aggregation (MSRA) module that aggregates information from different receptive fields for vision transformer, which could integrate more expressive features from different resolutions before feeding them into the next module.
- 3)
-
Dynamic population inspired pyramid architecture design. The works alpha2 ; alpha3 ; alpha4 investigate jDEdynNP-F algorithm with a dynamic population reduction scheme that significantly improves the effectiveness and accelerates the convergence of the model, which is similar to pyramid-alike improvements of some current vision transformers attn_pvt2 ; attn_twins ; attn_swin2 ; attn_nat ; attn_metaformer . Analogously, we extend our previous columnar-alike work eat to a pyramid structure like PVT attn_pvt , which significantly boosts the performance for many vision tasks.
- 4)
-
Self-adapted parameters inspired weighted operation mixing. Brest et al. alpha1 propose an adaptation mechanism to control different optimization processes for better results, and some memetic EAs mea1 ; mea2 ; mea3 own a similar concept of search intensity to balance the global and local calculation. This encourages us to learn appropriate weights for different operations, which can increase the performance and be more interpretable.
- 5)
-
Multi-objective EA inspired task-related feature merging. Current TR-based vision models would initialize different tokens for different tasks attn_deit (e.g., classification and distillation) or use the pooling operation to obtain global representation attn_swin . However, both manners suffer from potentially incompatibility: the former treats the task token and image patches coequally that is unreasonable, and the calculation of each layer will slightly increase the amount of calculation ( to ), while the latter uses only one pooling result for multiple tasks that could potentially damage the model accuracy. Inspired by multi-objective EAs moea1 ; moea2 ; moea3 that find a set of solutions for different targets, we design a Task-Related Head (TRH) docked with transformer backbone to complete final information fusion, which is elegant and flexible for different tasks learning.


Based on the above analyses, we improve our columnar EAT model eat to a pyramid EA-inspired Transformer (EATFormer) that achieves a new SOTA result. Figure 3 illustrates intuitive comparisons with SOTAs under GPU throughput, Top-1, and the number of parameters evaluation indexes, where our smallest EATFormer-Mobile obtains 69.4 Top-1 with 3,926 throughput under one V100 GPU, and the EATFormer-Base achieves 83.9 Top-1 with only 49.0M parameters. Specifically, we make the following four contributions compared with the previous conference work:
- ➤
-
In theory, we enrich evolutionary explanations for the rationality of Vision Transformer and derive a consistent mathematical formulation with evolutionary algorithm.
- ➤
-
On framework, we propose a novel basic EA-based Transformer (EAT) block (shown in Figure 2) that consists of three residual parts to model multi-scale, interactive, and individual information, respectively, which is stacked to form our proposed pyramid EATFormer.
- ➤
-
For method, inspired by effective EA variants, we analogously design: 1) Global and Local Interaction module, 2) Multi-Scale Region Aggregation module, 3) Task-Related Head module, and 4) Modulated Deformable MSA module to improve effectiveness and usability of our EATFormer.
- ➤
-
Massive experiments on classification, object detection, and semantic segmentation tasks demonstrate the superiority and efficiency of our approach, while ablation and explanatory experiments further prove the efficacy of EATFormer and its components.
2 Related Work
2.1 Evolution Algorithms
Evolution algorithm (EA) is a subset of evolutionary computation in computational intelligence that belongs to modern heuristics, and it serves as an essential umbrella term to describe population-based optimization and search techniques in the last 50 years ea1 ; ea2 ; bartz2014evolutionary . Inspired by biological evolution, general EAs mainly contain reproduction, crossover, mutation, and selection steps, which have been proven effective and stable in many application scenarios mea1 ; mea2 , and a series of improved EA approaches have been advanced in succession. Differential Evolution (DE) developed in 1995 dea1 is arguably one of the most competitive improved EA that significantly advances the global modeling capability dea2 ; dea3 . The core idea of DE is introducing a complete differential concept to the conventional EA, which differentiates and scales two individuals in the same population and interacts with the third individual to generate a new individual. In contrast to the category mentioned above, local search-based EAs aim to find a solution that is as good as or better than all other solutions in its neighborhood lea1 ; lea2 ; lea4 . This thought is more efficient than global search in that a solution can quickly be verified as a local optimum without associating the global search space. However, the locality-aware operation will restrict the ability of global modeling that could lead to suboptimal results in some scenarios, so some researchers attempt to fuse both above modeling manners. Moscato et al. mea1 firstly propose the Memetic Evolutionary Algorithm (MEA) in 1989 that applies a local search process to refine solutions for hard problems, which could converge to high-quality solutions more efficiently than conventional evolutionary counterparts. In detail, this variant is a particular global-local search hybrid: the global character is given by the traditional EA, while the local aspect is mainly performed through constructive methods and intelligent local search heuristics mea2 . Analogously, some later works lpea2 ; lpea4 introduce a multi-population evolutionary algorithm to solve the constrained function optimization problems relatively efficiently, which adopts different searching regions to enhance the diversity of individuals that improves the model ability dramatically. This strategy inspires us to design a basic feature extraction module for vision transformer: whether a similar multi-scale manner can be adopted to enhance model expressiveness. Furthermore, Brest et al. alpha1 propose an adaptation mechanism on the control parameters and for crossover and mutation operations associated with DE, where adapted parameters are applied to different optimization processes for obtaining better results. Remarkably, those MEAs mea1 ; mea2 ; mea3 mentioned above own a similar concept of search intensity to balance the global and local calculation. Subsequent work alpha2 investigates jDEdynNP-F algorithm with a dynamic population reduction scheme, where the population size of the next generation is equal to half the previous population size. This strategy significantly improves the effectiveness and accelerates the convergence of the model that is consistently illustrated by works alpha3 ; alpha4 . Furthermore, some literatures bio1 ; bio2 ; bio3 suggest that there are hierarchical structures of V1, V2, V4, and inferotemporal cortex in the evolutionary brain, which have ordered interconnection among them in the form of both feed-forward and feedback connections. Moreover, researchers moea1 ; moea2 ; moea3 study multi-objective EAs to find optimal trade-offs to get a set of solutions for different targets.
Inspired by the aforementioned EA variants that introduce various and valid concepts for optimization, we explain and improve the naive transformer structure by conceptual analogy in the paper, where a novel and potent EATFormer with pyramid architecture, multi-scale region aggregation, and global-local modeling is hand-designed. Furthermore, a plug-and-play task-related head module is developed to solve different targets separately and improve model performance.
2.2 Vision Transformers
Since Transformer structure achieves significant progress for machine translation task attn , many improved language models attn_elmo ; attn_bert ; attn_gpt1 ; attn_gpt2 ; attn_gpt3 are proposed and obtain great achievements, and some later works attn_improve1 ; attn_improve2 ; attn_improve3 ; attn_improve4 ; attn_improve5 ; wang2021evolving advance the basic transformer module for better efficiency. Inspired by the success of Transformer in NLP and the rapid improvement of computing power, Alexey et al. attn_vit propose a novel ViT that firstly introduces the transformer to vision classification and sparks a new wave of research besides conventional CNN-based vision models. Subsequently, many excellent vision transformer models are proposed, and they can mainly be divided into two categories, i.e., pure and hybrid vision transformers. The former only contains transformer module without CNN-based layers, and early worksattn_deit ; attn_deepvit ; attn_tnt ; attn_cait ; attn_t2tvit ; attn_cpvt ; attn_xcit follow columnar structure of original ViT. Typically, DeiT attn_deit propose an efficient training recipe to moderate the dependence on large datasets, DeepViT attn_deepvit and CaiT attn_cait focus on fast performance saturation when scaling ViT to be deeper, and TNT attn_tnt divide local patches into smaller patches for fine-grained modeling. Furthermore, researchers attn_pvt ; attn_swin ; attn_shuffle ; attn_halonets ; attn_twins ; attn_swin2 ; attn_nat ; attn_dpt ; attn_metaformer ; attn_SSA advance ViT to pyramid structure that is more powerful and suitable for dense prediction. PVT attn_pvt leverages a non-overlapping patch partition to reduce feature size, while Swin attn_swin utilizes a shifted window scheme to alternately model in-window and cross-window connection. The latter incorporates the idea of convolution that owns natural inductive bias of locality and translation invariance, and this kind of combination dramatically improves the model effect. Specifically, Srinivas et al. attn_botnet advance CNN-based models by replacing the convolution of the bottleneck block with the MSA structure. Later researches attn_localvit ; attn_ceit ; attn_convit ; attn_vitae introduce convolution designs into columnar visual transformers, while works attn_lvt ; attn_volo ; attn_crossformer ; attn_container ; attn_uniformer ; attn_coat ; attn_cvt ; attn_pvt2 ; attn_vitae2 fuse convolution structures into pyramid structure or use CNN-based backbone on early stages, which has obvious advantages over pure transformer models. Moreover, some researchers design hybrid models from the parallel perspective of convolution and transformer attn_coat ; attn_mobileformer ; attn_mixformer , while Xia et al. attn_vit introduce a deformable idea dcn1 to MSA module and obtain a boost on Swin attn_swin . Recently, MPViT attn_mpvit explores multi-scale patch embedding and multi-path structure that enable both fine and coarse feature representations simultaneously. Benefiting from advances in basic vision transformer models, many task-specific models are proposed and achieve significant progress in down-stream vision tasks, e.g., object detection down_detr ; down_deformabledetr ; down_YOLOS ; down_pix2seq , semantic segmentation down_setr ; down_transunet ; down_medt ; down_segformer ; down_hrformer ; down_maskformer ; down_mask2former , generative adversarial network down_transgan ; down_fat ; down_gat , low-level vision down_ipt ; down_swinir ; down_restormer , video understanding down_vtn ; down_timesformer ; down_STRM ; down_bevt ; down_LSTR , self-supervised learning down_sit ; down_mocov3 ; down_fdsl ; down_mae ; down_beit ; down_dino ; down_peco ; down_cae ; down_simmim ; down_maskfeat ; down_data2vec ; down_convmae , neural architecture search down_hat ; down_bossnas ; down_glit ; down_autoformer ; down_s3 , etc. Inspired by practical improvements in EA variants, this work migrates them to Transformer improvement and designs a powerful visual model with higher precision and efficiency than contemporary works. Also, thanks to the elaborate analogical design, the proposed EATFormer in this paper is highly explanatory.
2.3 Explanatory of Transformers
Transformer-based models have achieved remarkable results on CV tasks, leading us to question why Transformer works so well, even better than Convolutional Neural Networks (CNN). Many efforts have done to answer this question. Goyal et al. theory_more_inductive_bias point out that studying the kind of inductive biases that humans and animals exploit could help inspire AI research and neuroscience theories. Pan et al. theory_more1 show a strong underlying relation between convolution and self-attention operations. Jean-Baptiste et al. theory1 prove that a multi-head self-attention layer with a sufficient number of heads is at least as expressive as any convolutional layer, while Raghu et al. theory2 find striking differences between ViT and CNN on image classification. Kunchang et al. attn_uniformer seamlessly integrate the merits of convolution and self-attention in a concise transformer format, while ConVit attn_convit combines the strengths of both CNN/Transformer architectures to introduce gated positional self-attention. Introducing local CNN into Transformer is followed by many subsequent works attn_cpvt ; attn_uniformer ; cmt ; attn_vitae2 ; mobilevit ; edgenext . Furthermore, Juhong et al. theory3 take a biologically inspired approach and explore modeling peripheral vision by incorporating peripheral position encoding to the multi-head self-attention layers in Transformer. Besides, some works explore the relation between Transformer and other models, e.g., Katharopoulos et al. theory_more_rnn reveals their relationship of Transformer to recurrent neural networks, and Kim et al. theory_more_gnn prove that Transformer is theoretically at least as expressive as an invariant graph network composed of equivariant linear layers. Moreover, Bhojanapalli et al. theory_more2 find that ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations when pre-trained with a sufficient amount of data. Hao et al. attattr propose a self-attention attribution method to interpret the information interactions inside Transformer, while Liu et al. theory_more3 propose an actionable diagnostic methodology to measure the consistency between explanation weights and the impact polarity for attention-based models. Dong et al. attn_notattention find that MLP stops the output from degeneration, and removing MSA in Transformer would also significantly damage the performance. Recently, Qiang et al. theory_more4 propose a novel Transformer explanation technique via attentive class activation tokens by leveraging encoded features, gradients, and attention weights to generate a faithful and confident explanation. Xu et al. theory_more5 propose a new way to visualize the model by firstly computing attention scores based on attribution and then propagating these attention scores through the layers. Works attn_metaformer ; zhang2023rethinking demonstrate that the general architecture of the Transformers is more essential to performance rather than the specific token mixer module. The above work explores the interpretation of Transformer from a variety of perspectives. At the same time, we will provide another explanation from the perspective of evolutionary algorithms and design a robust model to perform multiple CV tasks.
3 Preliminary Transformer
The vision transformer generally refers to the encoder part of the original transformer structure, which consists of Multi-head Self-Attention layer (MSA), Feed-Forward Network (FFN), Layer Normalization (LN), and Residual Connection (RC). Given the input feature maps , operation firstly flattens it to a 1D sequence that complies with standard NLP format, denoted as: .
- ❐
-
MSA fuses several SA operations to process that jointly attend to information in different representation subspaces. Specifically, LN solved goes through linear layers to obtain projected queries (), keys () and values () presentations, formulated as:
(1) where is the input dimension, while , , and are hidden dimensions of the corresponding projection subspace, and generally equals ; is the head number; , , and are parameter matrices for , respectively; maps each head feature to the output; means concatenation operation; is the attention matrix of -th head.
- ❐
-
FFN consists of two cascaded linear transformations with a ReLU activation in between:
(2) where and are weights of two linear layers, while and are corresponding biases.
- ❐
-
LN is applied before each layer of MSA and FFN, and the transformed is calculated by:
(3)
Finally, reversed operation reshapes the enhanced back to 2D feature maps, denoted as: .
4 EA-Inspired Vision Transformer
In this section, we expand the relationship among operators in naive EA and modules in naive Transformer, and consistent mathematical formulations for each conceptual pair can be derived, revealing evolutionary explanations for the rationality of Vision Transformer structure. Inspired by the core ideas of some effective EA variants, we deduce them into transformer architecture design and improve a mighty pyramid EATFormer over the previous columnar model.
4.1 Evolutionary Explanation of Transformer
As aforementioned in Figure 1, the Transformer block has conceptually similar sub-modules analogously to evolutionary algorithm. Basically, Transformer inputs a sequence of patch tokens while EA evolutes a population that consists of many individuals. Both of which have the consistent vector format and necessary initialization. In order to facilitate the subsequent analogy and formula derivation, we symbolize the patch token (individual) as , where and indicate data order and dimension, respectively. Define as the sequence length, the sequence (population) can be denoted as . The specific relationship analyses of different components are as follows:
- ✰
-
Crossover Operator vs. MSA Module.
For the crossover operator of EA, it aims at creating new individuals by combining parts of other individuals. For an individual specifically, the operator will randomly pick another individual in the global population and randomly replaces features of with to form the new individual :(4) where is the -th evaluation of a uniform random number generator with outcome in , and is the crossover constant in that is determined by the user. We re-formulate this process as:
(5) where and are vectors filled with zeros or ones, indicating the feature selections of and , while and are corresponding diagonal matrix representations. means the point-wise multiplication operation for each position. represents that corresponding individual has no contribution, i.e., fulls of zeros. As can be seen above, crossover operator is actually a sparse global feature interaction process.
For the MSA module of Transformer, each patch embedding interacts with all embeddings in dense communications. Without loss of generality, interacts with the whole population as follows:
(6) where (l 1,2,,L) is the attention weight of -th head from embedding token to , which is calculated between the query value of and the key value of of -th head followed with a postprocessing; (l 1,2,,L) is the projected Value feature for with corresponding weights ; is the sum of all weighted (l 1,2,,L) by (l 1,2,,L), i.e., , (c.f., Eq. 1 in Section 3) for more details. is the parameter matrix for the value projection and means the concatenation operation. By comparing Equation 5 with Equation 6, we find that both above components have the same formula representation, and the crossover operation is a sparse global interaction while densely-modeling MSA has more complex computing and modeling capabilities.
- ✰
-
Mutation Operator vs. FFN Module.
For the mutation operator in EA, it brings random evolutions into the population by stochastically changing specific features of individuals. Specifically, an individual in the population goes through Mutation operation to form the new individual , formulated as follows:(7) where is the -th evaluation of a uniform random number generator with outcome in , and is the mutation constant in that user determines. and are lower and upper scale bounds of the -th feature relative to . Similarly, we re-formulate this process as:
(8) where is a randomly generated vector that represents weights of each feature value, while is the corresponding diagonal matrix representation; means the point-wise multiplication operation for each position.
For the FFN module in Transformer, each patch embedding carries on directional feature transformation through cascaded linear layers (c.f., Equation 2). Getting rid of complex nonlinear transformations, we only take one linear layer as an example:
(9) where is the weight of the linear layer, and it is applied to each embedding separately and identically.
By analyzing the calculation process of Equation 8 and Equation 9, Mutation and FFN operations share a unified form of matrix multiplication, so they are supposed to own a consistent function essentially. Besides, at the microcosmic level, the weight of FFN change dynamically during the training process, so the output of the individual differs among different iterations (similar to the random process of mutation). At the macroscopic objective of the algorithm, the mutation in EA is optimized into one potential direction under the constraint of the objective function (statistically speaking, only partial mutation individuals are retained, that is, the mutation also has a determinate meaning in the whole training process). In comparison, the trained FFN can be regarded as a directional mutation under the constraint of loss functions. Finally, note that we are only discussing the comparison with the mutation on one linear layer of FFN, and is more expressive than diagonal in fact because it contains cascaded linear layers and the non-linear ReLU activation is interspersed between adjacent linear layers, as depicted in Equation 2.
- ✰
-
Population Succession vs. RC Operation.
In the evolution of the biological population, individuals at the current iteration have a certain probability of inheriting to the next iteration, where a partial population of the current iteration will be combined with the selected individuals. Similarly, the above pattern is expressed by Transformer structure in the form of Residual Connection (RC), i.e., patch embeddings of the previous layer are directly mapped to the next layer. Specifically, partial-selection can be viewed as a dropout technique in Transformer, while population succession can be formulated as a concatenation operation that has a consistent mathematical expression with residual concatenation, whereas addition operation can be regarded as a particular case of the concatenation operation that shares some partial weights. - ✰
-
Best Individual vs. Task-Related Token.
Generally speaking, the Transformer-based model chooses an enhanced task-related token (e.g., classification token) that combines information of all patch embeddings as the output feature, while the EA-based method chooses the individual with the best fitness score among the population as the output.Figure 4: Structure of EA-inspired columnar EAT model eat and improved pyramid EATFormer. (a) The top part shows the architecture of the previous EAT model, where the basic block consists of parallel global and local paths as well as an FFN module. (b) The middle part illustrates overall architecture of EATFormer that contains four stages with -th stage consisting of Ni basic EAT blocks; The bottom part illustrates the structure of serial modules in EAT block, i.e., MSRA (c.f., Section 4.3.1), GLI (c.f., Section 4.3.2), and FFN (c.f., Section 3) from left to right, and a MD-MSA is proposed to effectively improve the model performance; The right part shows the designed Task-Related Head module docked with transformer backbone for specific tasks. - ✰
-
Necessity of Modules in Transformer.
As described in the work s16 , the absence of the crossover operator or mutation operator will significantly damage the model’s performance. Similarly, Dong et al. attn_notattention explore the effect of MLP in the Transformer and find that MLP stops the output from degeneration, and removing MSA in Transformer would also significantly damage the effectiveness of the model. Thus we can conclude that global information interaction and individual evolution are necessary for Transformer, just like the global crossover and individual mutation in EA.
4.2 Short Description of Previous Columnar EAT
We explore the relationship among operators in naive EA and modules in naive Transformer in the previous NeurIPS’21 conferenceeat and analogically improve a columnar EAT based on ViT model. Figure 4-(a) shows the structure of EAT model that is stacked of with N improved Transformer blocks inspired by local population concept in some EA works lea1 ; lea2 ; mea1 , where a local path is introduced in parallel with global MSA operation. Also, this work designs a Task-Related Head to deal with various tasks more flexibly, i.e., classification and distillation.
However, the columnar structure is naturally inadequate for downstream dense prediction tasks, and it is inferior in terms of accuracy compared with contemporaneous works attn_pvt ; attn_swin , which limits the usefulness of the model in some scenarios. To address the above weaknesses, this paper further explores analogies between EA and Transformer and improves the previous work to a pyramid EATFormer, consisting of the newly designed EAT block inspired by the effective EA variants.
4.3 Methodology of Pyramid EATFormer Architecture
Architecture of the improved EATFormer is illustrated in Figure 4-(b), which contains four stages of different resolutions following PVT attn_pvt . Specifically, the model is made up of EAT blocks that contains three mixed-paradigm residuals: (a) Multi-Scale Region Aggregation (MSRA), (b) Global and Local Interaction (GLI), and (c) Feed-Forward Network (FFN) modules, and the down-sampling procedure between two stages is realized by MSRA with stride greater than 1. Besides, we propose a novel Modulated Deformable MSA (MD-MSA) to advance global modeling and a Task-Related Head (TRH) to complete different tasks more elegantly and flexibly.
4.3.1 Multi-Scale Region Aggregation
Inspired by some multi-population-based EA methods lpea2 ; lpea4 that would adopt different searching regions for obtaining a better model performance, we analogically extend this concept to multiple sets of spatial positions for the 2D image and design a novel Multi-Scale Region Aggregation (MSRA) module for the studied vision transformer. As shown in Figure 4.(a), MSRA contains local convolution operations (i.e., ) with different strides to aggregate information from different receptive fields, which simultaneously play the role of providing inductive bias without extra position embedding procedures. Specifically, the -th dilation operation that transforms input feature map can be formulated as:
(10) | ||||
Weighted Operation Mixing (WOM) mechanism is further proposed to mix all operations by a softmax function over a set of learnable weights , and the intermediate representation is calculated by the mixing function as follows:
(11) |
where in the above formula is the addition function, and other fusion functions like concatenation are also available for a better effect at the cost of more parameters. The paper chooses the addition function acquiescently. Then, a convolution layer maps to the same number of channels as the input , and the final output of the module is obtained after a residual connection. Also, the MSRA module serves as the model stem and Patch Embedding that makes the EATFormer more uniform and elegant. Note that this paper does not use any form of position embedding since CNN-based MSRA can provide a natural inductive bias for the next GLI module.
4.3.2 Global and Local Interaction
Motivated by EA variants mea1 ; mea2 ; mea3 that introduce local search procedures besides conventional global search for converging higher-quality solutions faster and effectively (c.f., Figure 1-(c) for a better intuitive explanation), we improve a MSA-based global module to a novel Global and Local Interaction (GLI) module. As shown in Figure 4.(b), GLI contains an extra local path in parallel with the global path, where the former aims to mine more discriminative locality-relevant information like the above-mentioned local population idea, while the latter is retained to model global information. Specifically, the input features are divided into global features (marked green) and local features (marked blue) at the channel level with ratio , which are then fed into global and local paths to conduct feature interactions, respectively. Note that we also apply the proposed Weighted Operation Mixing mechanism in 4.3.1 to balance two branches, i.e., global weight and local weight . The outputs of the two paths recover the original data dimension by concatenation operation . Thus the improved module is very flexible and can be viewed as a plug-and-play module for the current transformer structure. In detail, the local operation can be traditional convolution layer or other improved modules, e.g., DCN dcn1 ; dcn2 , local MSA, etc, while global operation can be MSA attn ; attn_vit , D-MSA attn_dpt , Performer attn_improve3 , etc.
In this paper, we choose naive convolution with MSA modules as basic compositions of GLI, and it owns maximum path length between any two positions for keeping global modeling capability besides enhancing locality, as shown in Table 1. Therefore, the proposed structure maintains the same parallelism and efficiency as the original vision transformer. Also, the selection of feature separation ratio is crucial to the effect and efficiency of the model because different ratios bring different parameters, FLOPs, and precision of the model. In detail, the local path contains a group of point-wise and depth-wise convolutions. Assume that the feature map in , and both paths have and channels, respectively. Here we present an analysis process about the number of parameters and computation of the improved GLI module as follows:
1) Overall Params equals according to Table 1, and it is factorized based on :
(12) | ||||
Applying the minimum value formula of a quadratic function, Equation 12 obtains the minimum value when . Given that the channel number are integers and latter term can be ignored, we obtains , i.e., equals 0.2.
2) Overall FLOPs equals according to Table 1, and it is factorized based on :
(13) | ||||
Applying the minimum value formula of a quadratic function, Equation 13 obtains the minimum value when . Also, ignoring the latter term, we obtains that follows the same trend with . Therefore, we can draw two conclusions: ① The parameters and calculations of GLI are much lower than single-path MSA ( < 1), and the minimum value can be obtained when using both paths ( > 0); ② According to Equation 12 and Equation 13, there is not much difference about the total parameters and calculations when lies in the range [0, 0.5], so is set 0.5 for all layers in this paper for simplicity and efficiency. Also, experiments in Section 5.5.1 demonstrate that is the most economical and efficient option. Note that the number of convolution parameters and computation of the current local path are smaller compared with the global path, while the stronger local structure will make ratio change larger and this paper will not elaborate on the details.
Furthermore, we advance the global path by designing a Modulated Deformable MSA (MD-MSA in Section 4.3.3) module, which improves the model performance with negligible parameters and GFLOPs increasing, and a comparison study to explore combinations of different operations is further conducted in the experimental section.
Type | Params | FLOPs | MPL |
MSA | |||
Conv |

4.3.3 Modulated Deformable MSA
Inspired by the irregular spatial distribution among real individuals that are not as horizontal and vertical as the image, we improve a novel Modulated Deformable MSA (MD-MSA) module that considers position fine-tuning and re-weighting of each spatial patch. As shown in Figure 5, the blue dotted line represents naive MSA procedure that features are obtained by the input feature map from function , i.e., and ( denotes concatenation operation), while the red solid line shows the procedure of MD-MSA. And the main difference between the proposed MD-MSA and original MSA lies in the query-aware access of fine-tuned feature map to extract features further. Specifically, given the input feature map with positions, is obtained by function , i.e., , which is then used to predict deformable offset and modulation scalar for all positions:
(14) |
For the -th position, the re-sampled and re-weighted feature is calculated by:
(15) |
where is the relative coordinate with an unconstrained range for the -th position, while lies in the range [0, 1], and represents the bilinear interpolation function. Then is obtained with the new feature map , i.e., . It is worth mentioning that the main difference between MD-MSA and recent similar work attn_DAT lies in the modulation operation, where MD-MSA could apply appropriate attention to different position features to obtain better results. Also, any form of position embedding is not used since it makes no contribution to results, and detailed comparative experiments can be viewed in Section 5.4.3.

EA | EATFormer | ||
Basics | Population Size | Patch Number | (Section 1) |
(Discrete) Individual | (Continuous) Patch Token | ||
(Sparse) Crossover Operation | (Dense) Global MSA module | ||
(Sparse) Mutation Operation | (Dense) Individual FFN module | ||
(Partial) Population Succession | (Integral) Residual Connection | ||
(Global) Best Individual | (Aggregated) Task-Related Token | ||
Improvements | Multi-Scale Population | Multi-Scale Region Aggregation module | (Section 4.3.1) |
Global and Local Population | Global and Local Interaction module | (Section 4.3.2) | |
Self-Adapting Parameters | Weighted Operation Mixing | (Section 4.3.2) | |
Irregular Population Distribution | Modulated Deformable MSA | (Section 4.3.3) | |
Multi-Objective EA | Task-Related Head | (Section 4.3.4) | |
Dynamic Population | Pyramid Architecture | (Figure 4) |
4.3.4 Task-Related Head
Current transformer-based vision models would initialize different tokens for different tasks attn_deit or use the pooling operation to obtain global representation attn_swin . However, both manners are potentially incompatible: the former treats the task token and image patches coequally as unreasonable and clumsy, because the task token and image tokens have different feature distributions while additional computation is required from to ; the latter uses only one pooling result for multiple tasks that is also inappropriate and harmful for losing wealthy information. Inspired by multi-objective EAs moea1 ; moea2 ; moea3 that find a set of solutions for different targets, we design a Task-Related Head (TRH) docked with transformer backbone to obtain the corresponding task output through the final features. As shown in Figure 6, we employ a cross-attention paradigm to implement this module: and (gray lines) are output features extracted by the transformer backbone, while (red line) is the task-related token to integrate global information. Note that this design is more effective and flexible for different tasks learning simultaneously while consuming a negligible computation amount compared to the backbone, and more analytical experiments can be viewed in the following Section 5.4.8. For a more fair comparison, TRH presented in the former conference version eat is not used by default because this plug-and-play module can easily be added to other methods, and we will conduct an ablation experiment in Section 5.4.8 to verify the validity of TRH.
4.3.5 Overall Congruent Relationships
To more clearly show the design inspirations of different modules, we summarize the analogies between the improved EATFormer research and homologous concepts (ideas) from EA variants in Table 2.
4.4 EATFormer Variants
In the former conference version eat , we improve the columnar ViT by introducing a local path in parallel with global MSA operation, denoted as EAT-Ti, EAT-S, and EAT-B in the top part of Table 3. In this paper, we extend the columnar structure to a pyramid architecture and carefully re-design a novel EATFormer model, which has a series of scales for different practical applications, and these variants can be viewed in the bottom part of Table 3. Except for the depth and dimension of the model, other parameters remain consistent for all models: the head dimension of MSA is 32; window size is set to 7; kernel size of all convolution is ; dilations of the MSRA module for four stages are [1], [1], [1,2,3], and [1,2], respectively; low-level stage1-2 only use local path while high-level stage3-4 employ hybrid GLI module for efficiency. More detailed structures and implementations can be viewed in the attached source code.
Network | Depth | Dimension | Params. (M) | FLOPs (G) | Inf. Mem. (G) | Top-1 | |
Col. | EAT-Ti | 12 | 192 | 5.7 | 1.01 | 2.2 | 72.7 |
EAT-S | 12 | 384 | 22.1 | 3.83 | 2.9 | 80.4 | |
EAT-B | 12 | 768 | 86.6 | 14.83 | 4.5 | 82.0 | |
Pyramid | EATFormer-Mobile | [ 1, 1, 4, 1 ] | [ 48, 64, 160, 256 ] | 1.8 | 0.36 | 2.2 | 69.4 |
EATFormer-Lite | [ 1, 2, 6, 1 ] | [ 64, 128, 192, 256 ] | 3.5 | 0.91 | 2.7 | 75.4 | |
EATFormer-Tiny | [ 2, 2, 6, 2 ] | [ 64, 128, 192, 256 ] | 6.1 | 1.41 | 3.1 | 78.4 | |
EATFormer-Mini | [ 2, 3, 8, 2 ] | [ 64, 128, 256, 320 ] | 11.1 | 2.29 | 3.6 | 80.9 | |
EATFormer-Small | [ 3, 4, 12, 3 ] | [ 64, 128, 320, 448 ] | 24.3 | 4.32 | 4.9 | 83.1 | |
EATFormer-Medium | [ 4, 5, 14, 4 ] | [ 64, 160, 384, 512 ] | 39.9 | 7.07 | 6.2 | 83.6 | |
EATFormer-Base | [ 5, 6, 20, 7 ] | [ 96, 160, 384, 576 ] | 63.5 | 10.89 | 8.7 | 83.9 |
4.5 Further Discussion
Compared with EAT in the former conference version, the improved EATFormer has better inspirations, finer analogical designs, and more sufficient experiments. And we prove the effectiveness and integrity of the proposed method through a series of following experiments, such as comparison with SOTA methods, downstream task transferring, ablation studies, and explanatory experiments. It is worth noting that the backbone of EATFormer in this paper only contains one unified EAT block, which fully considers three aspects of modeling: 1) multi-scale information aggregation, 2) feature interactions among tokens, and 3) individual enhancement. Also, the architecture recipes of EATFormer variants in this paper are mainly given by our intuition and proved by experiments, but the alterable configure parameters can be used as the search space for NAS that is worth further exploration in our future works, e.g., embedding dimension, dilations of MSRA, kernel size of MSRA, fusion function of MSRA, down-sampling mode of MSRA, separation ratio of GLI, normalization types, window size, operation combinations of GLI, etc.
5 Experiments
In this section, to evaluate the effectiveness and superiority of our improved EATFormer architecture, we experiment for mainstream vision tasks with models of different volumes as the backbone and orderly conduct down-stream tasks, i.e., image-level classification (ImageNet-1K imagenet ), object-level detection and instance segmentation (COCO 2017 coco ), and pixel-level semantic segmentation (ADE20K ade20k ). Massive ablation and explanatory experiments are further conducted to prove the effectiveness of EATFormer and its components.
5.1 Image Classification
5.1.1 Experimental Setting
All of our EATFormer variants are trained for 300 epochs from scratch without pre-training, extra datasets, pre-trained models, token labeling tlt alike strategy, and exponential moving average. We employ the same training recipe as Deit attn_deit to all EATFormer variants for fair comparisons with different SOTA methods: AdamW adamw optimizer is used for training with betas and weight decay equaling (0.9, 0.999) and 5, respectively; Batch size is set to 2,048, while learning rate is 5 by default with a linear increasing compared with batch size divided by 512; Standard cosine learning rate scheduler, data augmentation strategies, warm-up, and stochastic depth are used during the training phase attn_deit . EATFormer is built on PyTorch pytorch and relies on the TIMM interface timm .
Network | Params. (M) | FLOPs (G) | Images/s | Resolution | Top-1 | Pub. | |
GPU | CPU | ||||||
MNetV3-Small 0.75x mnetv3 | 2.0 | 0.05 | 9872 | 589.1 | 65.4 | ICCV’19 | |
EATFormer-Mobile | 1.8 | 0.36 | 3926 | 456.3 | 69.4 | - | |
MobileNetV3 0.75× mnetv3 | 4.0 | 0.16 | 5585 | 315.4 | 73.3 | ICCV’19 | |
PVTv2-B0 attn_pvt | 3.6 | 0.57 | 1711 | 104.2 | 70.5 | ICCV’21 | |
XCiT-N12 attn_xcit | 3.1 | 0.56 | 2736 | 290.9 | 69.9 | NeurIPS’21 | |
VAN-Tiny attn_van | 4.1 | 0.87 | 1706 | 107.9 | 75.4 | arXiv’22 | |
EATFormer-Lite | 3.5 | 0.91 | 2168 | 246.3 | 75.4 | - | |
DeiT-Ti attn_deit | 5.7 | 1.25 | 2342 | 417.7 | 72.2 | ICML’21 | |
EAT-Ti eat | 5.8 | 1.01 | 2356 | 436.8 | 72.7 | NeurIPS’21 | |
EfficientNet-B0 efficientnet | 5.3 | 0.39 | 2835 | 225.1 | 77.1 | ICML’19 | |
CoaT-Lite Tiny attn_coat | 5.7 | 1.59 | 1055 | 143.5 | 77.5 | ICCV’21 | |
ViTAE-6M attn_vitae | 6.6 | 2.16 | 921 | 152.6 | 77.9 | NeurIPS’21 | |
XCiT-T12 attn_xcit | 6.7 | 1.25 | 1750 | 259.5 | 77.1 | NeurIPS’21 | |
MPViT-T attn_mpvit | 5.8 | 1.65 | 755 | 125.9 | 78.2 | CVPR’22 | |
EATFormer-Tiny | 6.1 | 1.41 | 1549 | 167.5 | 78.4 | - | |
EATFormer-Tiny-384 | 6.1 | 4.23 | 536 | 56.9 | 80.1 | - | |
EfficientNet-B2 efficientnet | 9.1 | 0.88 | 1440 | 143.4 | 80.1 | ICML’19 | |
PVTv2-B1 attn_pvt | 14.0 | 2.12 | 1006 | 79.2 | 78.7 | ICCV’21 | |
ViTAE-13M attn_vitae | 10.8 | 3.05 | 698 | 114.3 | 81.0 | NeurIPS’21 | |
XCiT-T24 attn_xcit | 12.1 | 2.35 | 933 | 146.6 | 79.4 | NeurIPS’21 | |
CoaT-Lite Mini attn_coat | 11.0 | 1.99 | 968 | 164.8 | 79.1 | ICCV’21 | |
PoolFormer-S12 attn_metaformer | 11.9 | 1.82 | 1858 | 218.7 | 77.2 | CVPR’22 | |
MPViT-XS attn_mpvit | 10.5 | 2.97 | 612 | 95.8 | 80.9 | CVPR’22 | |
VAN-Small attn_van | 13.8 | 2.50 | 992 | 95.2 | 81.1 | arXiv’22 | |
EATFormer-Mini | 11.1 | 2.29 | 1055 | 122.1 | 80.9 | - | |
DeiT-S attn_deit | 22.0 | 4.60 | 937 | 163.5 | 79.8 | ICML’21 | |
EAT-S eat | 22.1 | 3.83 | 964 | 175.6 | 80.4 | NeurIPS’21 | |
ResNet-50 resnet ; timm_resnet | 25.5 | 4.11 | 1192 | 123.8 | 80.4 | CVPR’16 | |
EfficientNet-B4 efficientnet | 19.3 | 3.13 | 495 | 38.2 | 82.9 | ICML’19 | |
CoaT-Lite Small attn_coat | 19.8 | 3.96 | 542 | 93.8 | 81.9 | ICCV’21 | |
PVTv2-B2 attn_pvt | 25.3 | 4.04 | 585 | 36.5 | 82.0 | ICCV’21 | |
Swin-T attn_swin | 28.2 | 4.50 | 664 | 88.0 | 81.3 | ICCV’21 | |
XCiT-S12 attn_xcit | 26.2 | 18.92 | 187 | 30.1 | 82.0 | NeurIPS’21 | |
ViTAE-S attn_vitae | 24.0 | 6.20 | 399 | 69.9 | 82.0 | NeurIPS’21 | |
UniFormer-S attn_uniformer | 21.5 | 3.64 | 844 | 121.1 | 82.9 | ICLR’22 | |
CrossFormer-T attn_crossformer | 27.7 | 2.86 | 948 | 185.4 | 81.5 | ICLR’22 | |
DAT-T attn_DAT | 28.3 | 4.58 | 581 | 75.6 | 82.0 | CVPR’22 | |
PoolFormer-S36 attn_metaformer | 30.8 | 5.00 | 656 | 76.4 | 81.4 | CVPR’22 | |
MPViT-S attn_mpvit | 22.8 | 4.80 | 417 | 68.7 | 83.0 | CVPR’22 | |
Shunted-S attn_SSA | 22.4 | 5.01 | 461 | 48.6 | 83.7 | CVPR’22 | |
PoolFormer-S24 attn_metaformer | 21.3 | 3.41 | 971 | 111.9 | 80.3 | CVPR’22 | |
CSWin-T cswin | 22.3 | 4.34 | 592 | 66.6 | 82.7 | CVPR’22 | |
iFormer-S iformer | 19.9 | 4.85 | 528 | 58.8 | 83.4 | NeurIPS’22 | |
MaxViT-T maxvit | 25.1 | 4.56 | 367 | 36.6 | 83.6 | ECCV’22 | |
VAN-Base attn_van | 26.5 | 5.00 | 542 | 51.9 | 82.8 | arXiv’22 | |
ViTAEv2-S attn_vitae2 | 19.3 | 5.78 | 464 | 72.8 | 82.6 | IJCV’23 | |
NAT-Tiny attn_nat | 27.9 | 4.11 | 401 | < 1 | 83.2 | arXiv’22 | |
EATFormer-Small | 24.3 | 4.32 | 615 | 73.3 | 83.1 | - | |
EATFormer-Small-384 | 24.3 | 12.92 | 198 | 24.1 | 84.3 | - | |
ResNet-101 resnet ; timm_resnet | 44.5 | 7.83 | 675 | 81.5 | 81.5 | CVPR’16 | |
EfficientNet-B5 efficientnet | 30.3 | 10.46 | 163 | 18.6 | 83.6 | ICML’19 | |
PVTv2-B3 attn_pvt | 45.2 | 6.92 | 392 | 45.7 | 83.2 | ICCV’21 | |
CoaT-Lite Medium attn_coat | 44.5 | 9.80 | 275 | 42.2 | 83.6 | ICCV’21 | |
XCiT-S24 attn_xcit | 47.6 | 36.04 | 99 | 16.0 | 82.6 | NeurIPS’21 | |
CSWin-S cswin | 34.6 | 6.83 | 360 | 51.2 | 83.6 | CVPR’22 | |
EATFormer-Medium | 39.9 | 7.05 | 425 | 53.4 | 83.6 | - | |
ViT-B/16 attn_vit | 86.5 | 17.58 | 293 | 53.0 | 77.9 | ICLR’21 | |
DeiT-B attn_deit | 86.5 | 17.58 | 293 | 53.7 | 81.8 | ICML’21 | |
EAT-B eat | 86.6 | 14.83 | 331 | 71.5 | 82.0 | NeurIPS’21 | |
ResNet-152 resnet ; timm_resnet | 60.1 | 11.55 | 470 | 57.5 | 82.0 | CVPR’16 | |
EfficientNet-B7 efficientnet | 66.3 | 38.32 | 47 | 5.6 | 84.3 | ICML’19 | |
PVTv2-B5 attn_pvt | 81.9 | 11.76 | 256 | 33.9 | 83.8 | ICCV’21 | |
Swin-B attn_swin | 87.7 | 15.46 | 258 | 38.6 | 83.5 | ICCV’21 | |
Twins-SVT-L attn_twins | 99.2 | 15.14 | 271 | 43.1 | 83.2 | NeurIPS’21 | |
UniFormer-B attn_uniformer | 49.7 | 8.27 | 378 | 58.7 | 83.9 | ICLR’22 | |
DAT-B attn_DAT | 87.8 | 15.78 | 217 | 30.8 | 84.0 | CVPR’22 | |
Shunted-B attn_SSA | 39.6 | 8.18 | 290 | 33.7 | 84.0 | CVPR’22 | |
PoolFormer-M48 attn_metaformer | 73.4 | 11.59 | 301 | 37.5 | 82.5 | CVPR’22 | |
MPViT-B attn_mpvit | 74.8 | 16.44 | 181 | 29.1 | 84.0 | CVPR’22 | |
CSWin-B cswin | 77.4 | 15.00 | 204 | 32.5 | 84.2 | CVPR’22 | |
iFormer-B iformer | 47.9 | 9.38 | 262 | 38.9 | 84.6 | NeurIPS’22 | |
MaxViT-S maxvit | 55.8 | 9.41 | 231 | 32.1 | 84.4 | ECCV’22 | |
NAT-Small attn_nat | 50.7 | 7.50 | 260 | < 1 | 83.7 | arXiv’22 | |
ViTAEv2-48M attn_vitae2 | 48.6 | 13.38 | 251 | 38.6 | 83.8 | IJCV’23 | |
EATFormer-Base | 49.0 | 8.94 | 329 | 43.7 | 83.9 | - | |
EATFormer-Base-384 | 49.0 | 26.11 | 112 | 14.2 | 84.9 | - |
5.1.2 Experimental Results
In this work, we design EATFormer variations at different scales to meet different application requirements, and comparison results with SOTA methods are shown in Table 4. To fully evaluate the effects of different methods, we choose the number of parameters (Params.), FLOPs, Top-1 accuracy on ImageNet-1K, as well as throughput of GPU (with basic batch size equaling 128 by a single V100 SXM2 32GB, and the batch size will be reduced to the maximum that memory requires for large models) and CPU (with batch size equaling 128 by Xeon 8255C CPU @ 2.50GHz) as evaluation indexes. Our smallest EATFormer-Mobile obtains 69.4 that is much higher than MobileNetV3-Small 0.75 counterpart, i.e., 65.4, while the largest EATFormer-Base obtains a very competitive result with only 49.0M parameters, and it further achieves 84.9 at 384384 resolution. Comparatively, although our approach obtains a slight improvement over recent SOTA MPViT-T/-XS/-S by +0.2%/+0.0%/+0.1%, EATFormer features significantly fewer FLOPs by -0.21G/-0.68G/-0.48G, faster GPU speed by +2.1/+1.7/+1.5, and CPU speed by +1.33/+1.27/+1.07. At the highest 50M-level model, our EATFormer-B still achieves a throughput of 329 that is 1.8 faster than MPViT-B, and this efficiency increase is also considerable. Meaning that EATFormer is more user-friendly than MPViT on general-purpose GPU and CPU devices, and our EATFormer can better trade-off parameters, computation, and precision. At the same time, our tiny, small, and base models improve by +5.7, +2.7, and +1.9 compared with the previous conference version. Interestingly, we find that the Top-1 accuracy of different methods with 5080M parameters would be approximately saturated to 84.0 without external data, token labeling, larger resolution, etc., so it is worth future exploration to alleviate this problem.
Backbone | Mask R-CNN 1 | Mask R-CNN 3 | Params. (M) | FLOPs (G) | Pub. | ||||||||||
PVT-Tiny attn_pvt | 36.7 | 59.2 | 39.3 | 35.1 | 56.7 | 37.3 | 39.8 | 62.2 | 43.0 | 37.4 | 59.3 | 39.9 | 33 | - | ICCV’21 |
PVTv2-B0 attn_pvt2 | 38.2 | 60.5 | 40.7 | 36.2 | 57.8 | 38.6 | - | - | - | - | - | - | 23 | 195 | CVM’22 |
XCiT-T12 attn_xcit | - | - | - | - | - | - | 44.5 | 66.4 | 48.8 | 40.4 | 63.5 | 43.3 | 26 | 266 | NeurIPS’21 |
PFormer-S12 attn_metaformer | 37.3 | 59.0 | 40.1 | 34.6 | 55.8 | 36.9 | - | - | - | - | - | - | 31 | - | CVPR’22 |
MPViT-T attn_mpvit | 42.2 | 64.2 | 45.8 | 39.0 | 61.4 | 41.8 | 44.8 | 66.9 | 49.2 | 41.0 | 64.2 | 44.1 | 28 | 216 | CVPR’22 |
EATFormer-Tiny | 42.3 | 64.7 | 46.2 | 39.0 | 61.5 | 42.0 | 45.4 | 67.5 | 49.5 | 41.4 | 64.8 | 44.6 | 25 | 198 | - |
ResNet-50 resnet | 38.0 | 58.6 | 41.4 | 34.4 | 55.1 | 36.7 | 41.0 | 61.7 | 44.9 | 37.1 | 58.4 | 40.1 | 44 | 260 | CVPR’16 |
Swin-T attn_swin | 43.7 | 66.6 | 47.7 | 39.8 | 63.3 | 42.7 | 46.0 | 68.1 | 50.3 | 41.6 | 65.1 | 44.9 | 48 | 267 | ICCV’21 |
Twins-S attn_twins | 43.4 | 66.0 | 47.3 | 40.3 | 63.2 | 43.4 | 46.8 | 69.2 | 51.2 | 42.6 | 66.3 | 45.8 | 44 | 228 | NeurIPS’21 |
PFormer-S24 attn_metaformer | 40.1 | 62.2 | 43.4 | 37.0 | 59.1 | 39.6 | - | - | - | - | - | - | 41 | - | CVPR’22 |
DAT-T attn_DAT | 44.4 | 67.6 | 48.5 | 40.4 | 64.2 | 43.1 | 47.1 | 69.2 | 51.6 | 42.4 | 66.1 | 45.5 | 48 | 272 | CVPR’22 |
MPViT-S attn_mpvit | 46.4 | 68.6 | 51.2 | 42.4 | 65.6 | 45.7 | 48.4 | 70.5 | 52.6 | 43.9 | 67.6 | 47.5 | 43 | 268 | CVPR’22 |
EATFormer-Small | 46.1 | 68.4 | 50.4 | 41.9 | 65.3 | 44.8 | 47.4 | 69.3 | 51.9 | 42.9 | 66.4 | 46.3 | 44 | 258 | - |
ResNet-101 resnet | 40.4 | 61.1 | 44.2 | 36.4 | 57.7 | 38.8 | 42.8 | 63.2 | 47.1 | 38.5 | 60.1 | 41.3 | 63 | 336 | CVPR’16 |
Swin-S attn_swin | 45.7 | 67.9 | 50.4 | 41.1 | 64.9 | 44.2 | 48.5 | 70.2 | 53.5 | 43.3 | 67.3 | 46.6 | 69 | 359 | ICCV’21 |
Twins-B attn_twins | 45.2 | 67.6 | 49.3 | 41.5 | 64.5 | 44.8 | 48.0 | 69.5 | 52.7 | 43.0 | 66.8 | 46.6 | 76 | 340 | NeurIPS’21 |
PFormer-S36 attn_metaformer | 41.0 | 63.1 | 44.8 | 37.7 | 60.1 | 40.0 | - | - | - | - | - | - | 51 | - | CVPR’22 |
DAT-S attn_DAT | 47.1 | 69.9 | 51.5 | 42.5 | 66.7 | 45.4 | 49.0 | 70.9 | 53.8 | 44.0 | 68.0 | 47.5 | 69 | 378 | CVPR’22 |
MPViT-B attn_mpvit | 48.2 | 70.0 | 52.9 | 43.5 | 67.1 | 46.8 | 49.5 | 70.9 | 54.0 | 44.5 | 68.3 | 48.3 | 95 | 503 | CVPR’22 |
EATFormer-Base | 47.2 | 69.4 | 52.1 | 42.8 | 66.4 | 46.5 | 49.0 | 70.3 | 53.6 | 44.2 | 67.7 | 47.6 | 68 | 349 | - |
5.2 Object Detection and Instance Segmentation
5.2.1 Experimental Setting
To further evaluate the effectiveness and superiority of our method, ImageNet-1K imagenet pre-trained EATFormer is benchmarked as the feature extractor for downstream object detection and instance segmentation tasks on COCO2017 dataset coco , and its window size increases from 7 to 12 without global attention and other changes. For fair comparisons, we employ MMDetection library mmdetection for experiments and follow the same training recipe as Swin-Transformer attn_swin : 1 schedule for 12 epochs and 3 schedule with a multi-scale training strategy for 36 epochs. AdamW adamw optimizer is used for training with learning rate and weight decay equaling 1 and 5, respectively.

5.2.2 Experimental Results
Comparison results of box mAP () and mask mAP () are reported in Table 5, and our improved EATFormer obtains competitive results over recent approaches. Specifically, our tiny model obtains +5.6/+5.6 and +3.9/+4.0 improvements over PVT-Tiny attn_pvt on both 1 and 3 schedules, while achieves higher results over MPViT attn_mpvit with less parameters and FLOPs, i.e., +0.6 and +0.4 on 3 schedule. For larger EATFormer-small and EATFormer-base models, we consistently get better results than recent counterparts, which surpass Swin-T by +2.4/+2.1 and Swin-S by +1.5/+1.7 with 1 schedule, while by +1.4/+1.3 and by +0.5/+0.9 with 3 schedule. Also, we obtain slightly higher results than DAT attn_DAT with computation amount going down by 29G. Although EATFormer is slightly lower than SOTA MPViT-S/B in the downstream task metrics, our method has obvious advantages in the number of parameters and computation, e.g., -10G FLOPs decreasing than MPViT-S for Mask R-CNN, while -154G (-30.6%) FLOPs and -27M (-28.4%) parameters decreasing than MPViT-B, effectively balancing the trade-off between effectiveness and performance. For MPViT-T, our EATFormer-Tiny has obvious metrics, parameter numbers, and computation advantages. Qualitative visualizations on validation dataset compared with Swin-S attn_swin are shown in the top part of Figure 7. Results indicate that our EATFormer can obtain more accurate detection accuracy, fewer false positives, and finer segmentation results than Swin Transformer.
5.3 Semantic Segmentation
5.3.1 Experimental Setting
We further conduct semantic segmentation experiments on the ADE20K ade20k dataset, and pre-trained EATFormer with window size equaling 12 is integrated into UperNet upernet architecture to obtain pixel-level predictions. In detail, we follow the same setting of Swin-Transformer attn_swin to train the model for 160k iterations. AdamW adamw optimizer is also used with learning rate and weight decay equaling 1 and 5, respectively.
5.3.2 Experimental Results
Segmentation results compared with contemporary SOTA works under three main model scales are reported in Table 6. Our EATFormer-Tiny obtains a significantly +3.4 improvement than recent VAN-Tiny attn_van , while EATFormer-Small achieves a higher mIoU with fewer FLOPs over SOTA methods. For larger EATFormer-Base, it consistently obtains competitive results, i.e., +1.7 and +1.0 than Swin-S attn_swin and DAT-S attn_DAT , respectively. Compared with SOTA MPViT, we obtain a better trade-off among parameters, computation, and precision. E.g., our EATFormer-Base has 26M fewer parameters and 156G fewer FLOPs compared to MPViT-B. Our approach generally has excellent overall precision and computation performance than counterpart. Also, intuitive visualizations of the validation dataset compared with Swin-S attn_swin are shown in the bottom part of Figure 7. Qualitative results consistently demonstrate the robustness and effectiveness of the proposed approach, where our EATFormer has more accurate segmentation results.
Backbone | Params. | GFLOPs | mIoU | Pub. |
XCiT-T12 attn_xcit | 34 | - | 43.5 | NeurIPS’21 |
VAN-Tiny attn_van | 32 | 858 | 41.1 | CVPR’22 |
EATFormer-Tiny | 34 | 870 | 44.5 | - |
Swin-T attn_swin | 60 | 945 | 44.5 | ICCV’21 |
XCiT-S12 attn_xcit | 52 | - | 46.6 | NeurIPS’21 |
DAT-T attn_DAT | 60 | 957 | 45.5 | CVPR’22 |
ViTAEv2-S attn_vitae2 | 49 | - | 45.0 | IJCV’23 |
MPViT-S attn_mpvit | 52 | 943 | 48.3 | CVPR’22 |
UniFormer-S uniformer_arxiv | 52 | 955 | 47.0 | arXiv’22 |
EATFormer-Small | 53 | 934 | 47.3 | - |
Swin-S attn_swin | 81 | 1038 | 47.6 | ICCV’21 |
XCiT-M24 attn_xcit | 109 | - | 48.4 | NeurIPS’21 |
DAT-S attn_DAT | 81 | 1079 | 48.3 | CVPR’22 |
MPViT-B attn_mpvit | 105 | 1186 | 50.3 | CVPR’22 |
UniFormer-B uniformer_arxiv | 80 | 1106 | 49.5 | arXiv’22 |
EATFormer-Base | 79 | 1030 | 49.3 | - |
5.4 Ablation Study
To fully evaluate the effectiveness of each designed module, we conduct a series of ablation studies in the following sections. By default, EATFormer-Tiny is used for all experiments, and we follow the same training recipe as mentioned in Section 5.1.1.
5.4.1 Component of EAT Block
As afore-mentioned in Section 4.3, our proposed EAT block contains: 1) MSRA, 2) GLI, and 3) FFN modules that are responsible for aggregating multi-scale information, interacting global and local features, and enhancing the features of each location, respectively. To verify the validity of each module in the EAT block, we conduct an ablation experiment in Table 7 that contains different component combinations. Results indicate that each component contributes to the model performance, and our EATFormer obtains the best result when using all three parts. Since FFN takes up most of the parameters and calculations, we can conduct further research on optimizing this module to obtain better-integrated model performance.
MSRA | GLI | FFN | Params | FLOPs | Top-1 |
✔ | ✘ | ✘ | 2.4 | 0.45 | 62.9 |
✘ | ✔ | ✘ | 2.6 | 0.51 | 64.4 |
✘ | ✘ | ✔ | 5.2 | 1.17 | 71.4 |
✔ | ✔ | ✘ | 2.9 | 0.60 | 67.7 |
✔ | ✘ | ✔ | 5.5 | 1.26 | 76.0 |
✘ | ✔ | ✔ | 5.8 | 1.32 | 77.4 |
✔ | ✔ | ✔ | 6.1 | 1.41 | 78.4 |
5.4.2 Separation Ratio of GLI
We deduce from Equation 12 and Equation 13 in Section 4.3.2 that EATFormer has the lowest number of parameters and calculation amount when separation ratio of GLI equals 0.2, and there is not much difference about the total parameters and calculations when lies in the range [0, 0.5]. To further prove the above analysis and verify the validity of the GLI, we conduct a set of experiments with equal interval sampling of in range [0, 1] for the classification task. As shown in Figure 8, the x-coordinate represents different proportions, and the left y-ordinate represents Top-1 accuracy of the modified EATFormer-Tiny with embedding dims equaling [64, 128, 230, 320] for divisible channels. The right y-ordinate shows the model’s running speed and relative computation amount. Results in the figure are consistent with the foregoing derivation, and equaling 0.5 is the most economical and efficient choice, where the model has relatively high precision, fast speed, and low computational cost. All GLI layers in this article use the same ratio, and exploring different ratios for different layers should lead to further improvements based on the above analysis.

5.4.3 Component Ablation of EATFormer
Following the core idea of paralleling global and local modelings, this paper extends a pyramid architecture over the previous columnar EAT model eat . Specifically, EAT block-based EATFormer can be seen as evolving from the naive baseline, which employs: 1) patch embedding for down-sampling; 2) MSRA with only one scale; 3) naive MSA ; 4) simple addition operation with equaling 1, instead of: 1) MSRA for down-sampling ; 2) MSRA with multiple scale; 3) improved MD-MSA ; 4) weighted operation mixing (WOM) with learnable . Detail ablation experiment based on EATFormer-tiny can be viewed in Table 8, and the results indicate that each individual component has a role, and different components combination can complement each other to help the model achieve higher results. Note that WOM can only be applied if multi-path-based MSRA is used.
MSRA Down | MSRA | MD- MSA | WOM | Param. | FLOPs | Top-1 |
✘ | ✘ | ✘ | ✘ | 4.792 | 1.232 | 77.4 |
✔ | ✘ | ✘ | ✘ | 5.202 | 1.300 | 77.8 |
✘ | ✔ | ✘ | ✘ | 5.208 | 1.283 | 77.9 |
✘ | ✘ | ✚ | ✘ | 4.804 | 1.236 | 77.5 |
✘ | ✘ | ✔ | ✘ | 4.805 | 1.236 | 77.7 |
✔ | ✔ | ✘ | ✘ | 6.109 | 1.412 | 78.2 |
✔ | ✘ | ✔ | ✘ | 5.214 | 1.304 | 78.0 |
✘ | ✔ | ✔ | ✘ | 5.220 | 1.288 | 78.1 |
✔ | ✔ | ✔ | ✘ | 6.122 | 1.416 | 78.2 |
✔ | ✔ | ✘ | ✔ | 6.109 | 1.412 | 78.1 |
✘ | ✔ | ✔ | ✔ | 5.221 | 1.288 | 78.0 |
✔ | ✔ | ✔ | ✔ | 6.122 | 1.416 | 78.4 |
5.4.4 Composition of GLI
By default, the global path in GLI employs the designed MD-MSA module inspired by the dynamic population concept, while the local branch uses conventional CNN to model static feature extraction. To further assess the potential of the GLI module, different combinations of global (i.e., MSA and MD-MSA) and local (i.e., CNN and DCNv2 dcn2 ) operators are used for experiments. As shown in Table 9, MD-MSA improves the model effect by only with negligible parameters and computation, while DCNv2 can further boost performance by a large margin at the cost of higher storage and computation. Theoretically, MD-MSA has no significant impact on the speed, but the naive PyTorch implementation without CUDA acceleration leads to a obvious decrease in GPU speed. Therefore, the running speed of our model could be improved after further optimization for MD-MSA.
Global | Local | Param. | FLOPs | GPU | Top-1 |
MSA | CNN | 6.1 | 1.412 | 1896 | 78.1 |
MSA | DCNv2 | 9.0 | 1.522 | 1567 | 79.0 |
MD-MSA | CNN | 6.1 | 1.416 | 1549 | 78.4 |
MD-MSA | DCNv2 | 9.0 | 1.526 | 1333 | 79.2 |
5.4.5 Normalization Type
Transformer-based vision models generally use Layer Normalization (LN) to achieve better results rather than Batch Normalization (BN). Nevertheless, considering that LN requires slightly more computation than BN and the proposed hybrid EATFormer contains many convolutions that are usually combined with Batch Normalization (BN) layers, we conduct an ablation study to evaluate which normalization would be better. Table 10 shows the results on three EATFormer variants, and BN-normalized EATFormer achieves slightly better results while owing an significantly faster GPU inference speed. Note that merging convolution and BN layers is not used here, and this technique can further improve the inference speed.
Network | Params | FLOPs | GPU | Top-1 |
Tiny-LN | 6.1 | 1.425 | 963 | 78.2 |
Tiny-BN | 6.1 | 1.416 | 1549 | 78.4 |
Small-LN | 24.3 | 4.337 | 448 | 82.8 |
Small-BN | 24.3 | 4.320 | 615 | 83.1 |
Base-LN | 49.0 | 8.775 | 240 | 83.7 |
Base-BN | 49.0 | 8.744 | 345 | 83.9 |
5.4.6 MSRA at Different Stages
Different network depths may have different requirements for the MSRA module, so we explore the introduction of MSRA at different stages. As shown in Table 11, our model obtains the best result when MSRA is used in [2, 3, 4] stages, and the model effect decreases sharply when only used in the fourth stage. Considering the model accuracy and efficiency, using this module in [3, 4] stages is a better choice.
Stages | Params | FLOPs | GPU | Top-1 |
1, 2, 3, 4 | 6.3 | 1.541 | 1291 | 78.2 |
2, 3, 4 | 6.3 | 1.533 | 1434 | 78.5 |
3, 4 | 6.1 | 1.416 | 1549 | 78.4 |
4 | 5.6 | 1.326 | 1695 | 77.9 |
5.4.7 Kernel Size of MSRA
The MSRA module for multi-scale modeling adopts CNN as its primary component so that the convolution kernel may influence the model results. As shown in Table 12, a larger kernel size can only slightly increase the model effect, but the number of parameters and the amount of calculation could increase dramatically. Therefore, we employ efficient kernel size in MSRA for EATFormer at all scales.
Size | Params | FLOPs | GPU | Top-1 |
33 | 6.1 | 1.416 | 1549 | 78.4 |
55 | 9.0 | 1.845 | 1342 | 78.5 |
77 | 13.4 | 2.487 | 1087 | 78.5 |
5.4.8 Layer Number of TRH
The Plug-and-play TRH module can easily be docked with the transformer backbone to obtain the task-related feature representation, and we take the classification task as an example to explore the effect of this module. As shown in Table 13, Top-1 accuracy is significantly improved by gradually increasing the number of TRH layers in the EATFormer-Tiny model, and the performance tends to saturation after two layers. Therefore, using two-layer TRH is the recommended choice to balance model effectiveness and efficiency. However, there is no noticeable improvement in the larger models, so the multi-task advantage of TRH for the larger model is more important than accuracy improvement.
Network | Params | FLOPs | GPU | Top-1 |
Tiny | 6.1 | 1.416 | 1549 | 78.4 |
Tiny +1 | 6.9+0.8 | 1.423 | 1495 | 78.7 |
Tiny +2 | 7.7+1.6 | 1.430 | 1461 | 79.1 |
Tiny +3 | 8.4+2.3 | 1.438 | 1423 | 79.2 |
Small +2 | 29.1+4.8 | 4.363 | 589 | 83.2 |
Base +2 | 55.3+6.3 | 9.001 | 316 | 83.9 |
5.5 EATFormer Explanation
5.5.1 Alpha Distribution of Different Depths
The weighted operation mixing mechanism can improve the model performance and objectively represent the model’s attention to different branches at different depths. Based on EATFormer-tiny, we use 3-path MSRA along with 2-path GLI for each EAT block, and the alpha-indicated weight distribution after training is shown in Figure 9. 1) For the MSRA module, the proportion of (i.e., dilation equals 1) in the same stage shows an increasing trend while the larger is the opposite, indicating that local feature extraction with stronger correlation (i.e., smaller scale) is more critical for the network. And weight mutation between adjacent stages is caused by a down-sampling operation that changes the feature distribution. In the last stage4, large scale paths have more weight because they need to model as much global information as possible to get proper classification results. But in general, the proportion of each branch is balanced, meaning that feature learning at all scales contributes to the network. Considering the amount of computation and the number of parameters, this also supports the experimental result about why only using MSRA for stage3/4 described in above Section 5.4.6. 2) For the GLI module, the global branch has more and more weight than the local branches as the network deepens, indicating that both branches are effective and complement each other: local CNN is more suitable for low-level feature extraction while the global transformer is better at high-level information fusion.

5.5.2 Attention Visualization
To better illustrate which parts of the image the model focuses on, Grad-CAM gradcam is applied to highlight concerning regions by our small model. As shown in Figure 10, we visualize different images by column for ResNet-50 resnet , Swin-B attn_swin , and our EATFormer-Base models, respectively. Results indicate that: 1) CNN-based ResNet tends to focus on as many regions as possible but ignores edges; 2) Transformer-based Swin pays more attention to sparse local areas; 3) Thanks to the design of MSRA and GLI modules, our EATFormer has more discriminative attention to subject targets that own very sharp edges.

5.5.3 Attention Distance of Global Path in GLI
We design the GLI module to explicitly model global and local information separately, so the local branch could undertake part of the short-distance modeling of the global branch. To verify this, we visualize the modeling distance of the global branch for our previous columnar EAT model eat and current studied EATFormer in Figure 11: 1-Top) Compared with DeiT without local modeling, our EAT pays more attention to global information fusion (choosing layer 4/6 for examples), where more significant values are found at off-diagonal locations. 2-Bottom) Attention maps in the last stage are visualized because the window size equals the feature size that could cover overall information. When using global modeling alone (w/o GLI), the model only focuses on sparse regions but will pay attention to more regions when GLI is used. Results indicate that the designed parallel local path takes responsibility for some local modelings that should be the responsibility of the global path. We can find differences in feature modeling between columnar-alike and pyramid-aware architectures.
Relationship with EA. Motivated by EA variants mea1 ; mea2 ; mea3 that introduce local search procedures besides conventional global search for converging higher-quality solutions, we analogically improve the novel GLI module. When GLI is not used (only global modeling), the model tends to correlate local regional features, which is consistent with the concept of local population in biological evolution due to geographical constraints, i.e., just like the concept of local search in EA. With GLI, explicit local modeling unlocks global modeling potential, forcing global branches to associate more distant features for better results, just as the global/local concept in EA mea1 ; mea2 ; mea3 that improves performance.

5.5.4 and Distribution of MD-MSA
Figure 12 visualizes the learned offset (the longer the arrow, the farther the deformable distance, and the arrow direction indicates sampling direction) and modulation (the brighter the color, the greater the weight) of MD-MSA in stage4. There are differences in offset and modulation of each location in different depths, and the model unexpectedly tends to give more weight to the main object that could describe the main parts of the object. Since we set align_corners to true when resampling, it has a gradually increasing bias from 0 to 0.5 from the center to the edge. Therefore, the visualization results behave as a whole spreading outwards that may visually weaken changes in each learned position. Please zoom in for better visualization.
Relationship with EA. Inspired by the irregular spatial distribution among real individuals that are not as horizontal and vertical as the image, we improve the novel MD-MSA module that considers the offset of each spatial position. As show in Figure 12, different positions (individuals) prefer different offsets and modulation (i.e., direction and scale), just as individuals have different preferences in different regions of the biological world. This modeling method has also been verified in EA, e.g., the improved works dae_supp3 ; dae_supp1 ; dae_supp2 adopt the similar parameter adaption and feature scaling idea to conduct global feature interaction.

5.5.5 Visualization of Attention Map in TRH
Taking the classification task as an example, we visualize the attention map in the two-layer TRH that contains multiple heads in the inner cross-attention layer. As shown in Figure 13, we normalize values of attention maps to [0, 1] and draw them on the right side of the image. Results indicate that different heads focus on different regions, and the deeper TRH2 focuses on a broader area than TRH1 to form the final feature.

5.5.6 Parameters and FLOPs Distribution
Taking the designed EATFormer-Tiny as an example, we analyze the distribution of parameters and FLOPs in different layers, where the model contains a stem for resolution reduction, four stages for feature extraction, and a head for target output. As shown in Figure 14, the number of parameters is mainly distributed in the deep stage3/4, while FLOPs concentrate in the early stages, and FFN occupies the majority of parameter number calculation. Therefore, we can focus on the optimization of the FFN structure to better balance the comprehensive model efficiency in future work.

5.5.7 Works comparison with local/global concepts.
In this paper, the locality in ViT refers to the introduction of CNN with inductive bias into the Transformer structure, and we design the GLI block as a parallel structure that introduces a different local branch beside the global branch. This idea is motivated by some EA variants mea1 ; mea2 ; mea3 that employ local search procedures besides conventional global search for converging higher-quality solutions. Also, global/local concept is only an idea in the macro sense, and the specific way varies from method to method. E.g., global/local concept in MPViT attn_mpvit is expressed as parallelism between blocks rather than within each block, while CMT cmt cascades local information into the FFN module rather than the MSA as previous works attn_botnet ; attn_ceit ; efficientformerv2 . Comparatively, our GLI block consists of local convolution and global MD-MSA operations, and Weighted Operation Mixing (WOM) mechanism is further proposed to mix all operations adaptively. So, we argue that GLI obviously differs comparison methods. Besides, we compare our method with some contemporary/recent works attn_mpvit ; cmt ; attn_uniformer ; attn_vitae2 ; mobilevit ; edgenext ; efficientformerv2 which incorporate the global/local concept to their model designs. To further illustrate the differences with these methods, we make comprehensive comparison with them in terms of local/global concepts by several criteria in Table 14). Results illustrates the uniqueness of GLI in the technical level.
Method vs. Criterion | ➀ | ➁ | ➂ | ➃ | ➄ |
MPViT attn_mpvit | ✘ | ✘ | ✔ | ✘ | ✘ |
CMT cmt | ✘ | ✘ | ✔ | ✘ | ✘ |
UniFormer attn_uniformer | ✘ | ✔ | ✔ | ✘ | ✘ |
ViTAEv2 attn_vitae2 | ✔ | ✔ | ✘ | ✘ | ✘ |
MobileViT mobilevit | ✘ | ✔ | ✘ | ✘ | ✘ |
EdgeNeXt edgenext | ✘ | ✔ | ✘ | ✘ | ✘ |
EfficientFormerv2 efficientformerv2 | ✘ | ✘ | ✘ | ✘ | ✘ |
EATFormer (Ours) | ✔ | ✔ | ✔ | ✔ | ✔ |
6 Conclusion
This paper explains the rationality of vision transformer by analogy with EA and improves our previous columnar EAT to a novel pyramid EATFormer architecture inspired by effective EA variants. Specifically, the designed backbone consists only of the proposed EAT block that contains three residual parts, i.e., MSRA, GLI, and FFN modules, to model multi-scale, interactive, and individual information separately. Moreover, we propose a TRH module and improve an MD-MSA module to boost the effectiveness and usability of our EATFormer further. Abundant experiments on classification and downstream tasks demonstrate the superiority of our approach over SOTA methods in terms of accuracy and efficiency, while ablation and explanatory experiments further illustrate the effectiveness of EATFormer and each analogically designed component.
Nevertheless, we do not use larger models (e.g., >100M), larger datasets(i.e., ImageNet-21K imagenet ) or stronger training strategy (i.e., token labeling tlt ) for experiments due to limited amount of computation. Also, the architecture recipes are mainly given by our intuition, and the super-parameter could be used to optimize the model structure further. We will explore the above aspects and the combination with self-supervised learning techniques in future works.
Data Availability Statement. All the datasets used in this paper are available online. ImageNet-1K 111http://image-net.org, COCO 2017 222https://cocodataset.org, and ADE20K 333http://sceneparsing.csail.mit.edu can be downloaded from their official website accordingly.
Acknowledgement. This work was supported by a Grant from The National Natural Science Foundation of China(No. 62103363)
References
- (1) Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al.: Xcit: Cross-covariance image transformers. In: NeurIPS (2021)
- (2) Atito, S., Awais, M., Kittler, J.: Sit: Self-supervised vision transformer. arXiv preprint arXiv:2104.03602 (2021)
- (3) Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., Auli, M.: Data2vec: A general framework for self-supervised learning in speech, vision and language. In: ICML (2022)
- (4) Bao, H., Dong, L., Piao, S., Wei, F.: BEit: BERT pre-training of image transformers. In: ICLR (2022)
- (5) Bartz-Beielstein, T., Branke, J., Mehnen, J., Mersmann, O.: Evolutionary algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (2014)
- (6) Bello, I.: Lambdanetworks: Modeling long-range interactions without attention. In: ICLR (2021)
- (7) Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
- (8) Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., Veit, A.: Understanding robustness of transformers for image classification. In: ICCV (2021)
- (9) Bhowmik, P., Pantho, M.J.H., Bobda, C.: Bio-inspired smart vision sensor: toward a reconfigurable hardware modeling of the hierarchical processing in the brain. J REAL-TIME IMAGE PR (2021)
- (10) Brest, J., Greiner, S., Boskovic, B., Mernik, M., Zumer, V.: Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. TEC (2006)
- (11) Brest, J., Zamuda, A., Boskovic, B., Maucec, M.S., Zumer, V.: High-dimensional real-parameter optimization using self-adaptive differential evolution algorithm with population size reduction. In: CEC (2008)
- (12) Brest, J., Zamuda, A., Fister, I., Maučec, M.S.: Large scale global optimization using self-adaptive differential evolution algorithm. In: CEC (2010)
- (13) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. In: NeurIPS (2020)
- (14) Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
- (15) Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021)
- (16) Chen, B., Li, P., Li, C., Li, B., Bai, L., Lin, C., Sun, M., Yan, J., Ouyang, W.: Glit: Neural architecture search for global and local image transformer. In: ICCV (2021)
- (17) Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., Gao, W.: Pre-trained image processing transformer. In: CVPR (2021)
- (18) Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
- (19) Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
- (20) Chen, M., Peng, H., Fu, J., Ling, H.: Autoformer: Searching transformers for visual recognition. In: ICCV (2021)
- (21) Chen, M., Wu, K., Ni, B., Peng, H., Liu, B., Fu, J., Chao, H., Ling, H.: Searching the search space of vision transformer. In: NeurIPS (2021)
- (22) Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., Cheng, J., Wang, J.: Mixformer: Mixing features across windows and dimensions. In: CVPR (2022)
- (23) Chen, T., Saxena, S., Li, L., Fleet, D.J., Hinton, G.: Pix2seq: A language modeling framework for object detection. In: ICLR (2022)
- (24) Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., Wang, J.: Context autoencoder for self-supervised representation learning. IJCV (2023)
- (25) Chen, X., Xie, S., He, K.: An empirical study of training self-supervised visual transformers. In: ICCV (2021)
- (26) Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., Liu, Z.: Mobile-former: Bridging mobilenet and transformer. In: CVPR (2022)
- (27) Chen, Z., Kang, L.: Multi-population evolutionary algorithm for solving constrained optimization problems. In: AIAI (2005)
- (28) Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., Tang, M.: Dpt: Deformable patch-based transformer for visual recognition. In: ACM MM (2021)
- (29) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
- (30) Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for semantic segmentation. In: NeurIPS (2021)
- (31) Choromanski, K.M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J.Q., Mohiuddin, A., Kaiser, L., Belanger, D.B., Colwell, L.J., Weller, A.: Rethinking attention with performers. In: ICLR (2021)
- (32) Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C.: Twins: Revisiting the design of spatial attention in vision transformers. In: NeurIPS (2021)
- (33) Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., Shen, C.: Conditional positional encodings for vision transformers. In: ICLR (2023)
- (34) Coello, C.A.C., Lamont, G.B.: Applications of multi-objective evolutionary algorithms, vol. 1. World Scientific (2004)
- (35) Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: ICLR (2020)
- (36) Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convolutional networks. In: ICCV (2017)
- (37) Das, S., Suganthan, P.N.: Differential evolution: A survey of the state-of-the-art. TEC (2010)
- (38) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR (2009)
- (39) Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL (2019)
- (40) Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: CVPR (2022)
- (41) Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., Guo, B.: Peco: Perceptual codebook for bert pre-training of vision transformers. In: AAAI (2023)
- (42) Dong, Y., Cordonnier, J.B., Loukas, A.: Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In: ICML (2021)
- (43) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
- (44) d’Ascoli, S., Touvron, H., Leavitt, M.L., Morcos, A.S., Biroli, G., Sagun, L.: Convit: Improving vision transformers with soft convolutional inductive biases. In: ICML (2021)
- (45) Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., Liu, W.: You only look at one sequence: Rethinking transformer in vision through object detection. In: NeurIPS (2021)
- (46) Felleman, D.J., Van Essen, D.C.: Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991) (1991)
- (47) Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., Qiao, Y.: Mcmae: Masked convolution meets masked autoencoders. In: NeurIPS (2022)
- (48) García-Martínez, C., Lozano, M.: Local search based on genetic algorithms. In: Advances in metaheuristics for hard optimization. Springer (2008)
- (49) Goyal, A., Bengio, Y.: Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A (2022)
- (50) Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., Xu, C.: Cmt: Convolutional neural networks meet vision transformers. In: CVPR (2022)
- (51) Guo, M.H., Lu, C.Z., Liu, Z.N., Cheng, M.M., Hu, S.M.: Visual attention network. CVM (2023)
- (52) Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. In: NeurIPS (2021)
- (53) Hao, Y., Dong, L., Wei, F., Xu, K.: Self-attention attribution: Interpreting information interactions inside transformer. In: AAAI (2021)
- (54) Hart, W.E., Krasnogor, N., Smith, J.E.: Memetic evolutionary algorithms. In: Recent advances in memetic algorithms, pp. 3–27. Springer (2005)
- (55) Hassanat, A., Almohammadi, K., Alkafaween, E., Abunawas, E., Hammouri, A., Prasath, V.: Choosing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Information (2019)
- (56) Hassani, A., Walton, S., Li, J., Li, S., Shi, H.: Neighborhood attention transformer. In: CVPR (2023)
- (57) He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR (2022)
- (58) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
- (59) He, R., Ravula, A., Kanagal, B., Ainslie, J.: Realformer: Transformer likes residual attention. arXiv preprint arXiv:2012.11747 (2020)
- (60) Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: ICCV (2019)
- (61) Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., Fu, B.: Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650 (2021)
- (62) Hudson, D.A., Zitnick, L.: Generative adversarial transformers. In: ICML (2021)
- (63) Jiang, Y., Chang, S., Wang, Z.: Transgan: Two pure transformers can make one strong gan, and that can scale up. In: NeurIPS (2021)
- (64) Jiang, Z.H., Hou, Q., Yuan, L., Zhou, D., Shi, Y., Jin, X., Wang, A., Feng, J.: All tokens matter: Token labeling for training better vision transformers. In: NeurIPS (2021)
- (65) Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are rnns: Fast autoregressive transformers with linear attention. In: ICML (2020)
- (66) Khare, V., Yao, X., Deb, K.: Performance scaling of multi-objective evolutionary algorithms. In: EMO (2003)
- (67) Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., Hong, S.: Pure transformers are powerful graph learners. In: NeurIPS (2022)
- (68) Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: The efficient transformer. In: ICLR (2020)
- (69) Kolen, A., Pesch, E.: Genetic local search in combinatorial optimization. Discrete Applied Mathematics (1994)
- (70) Kumar, S., Sharma, V.K., Kumari, R.: Memetic search in differential evolution algorithm. arXiv preprint arXiv:1408.0101 (2014)
- (71) Land, M.W.S.: Evolutionary algorithms with local search for combinatorial optimization. University of California, San Diego (1998)
- (72) Lee, Y., Kim, J., Willette, J., Hwang, S.J.: Mpvit: Multi-path vision transformer for dense prediction. In: CVPR (2022)
- (73) Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., Chang, X.: Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In: ICCV (2021)
- (74) Li, K., Wang, Y., Peng, G., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unified transformer for efficient spatial-temporal representation learning. In: ICLR (2022)
- (75) Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., Qiao, Y.: Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI (2023)
- (76) Li, X., Wang, L., Jiang, Q., Li, N.: Differential evolution algorithm with multi-population cooperation and multi-strategy integration. Neurocomputing (2021)
- (77) Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., Ren, J.: Rethinking vision transformers for mobilenet size and speed. In: ICCV (2023)
- (78) Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
- (79) Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., Timofte, R.: Swinir: Image restoration using swin transformer. In: ICCV (2021)
- (80) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
- (81) Liu, J., Lampinen, J.: A fuzzy adaptive differential evolution algorithm. Soft Computing (2005)
- (82) Liu, Y., Li, H., Guo, Y., Kong, C., Li, J., Wang, S.: Rethinking attention-model explainability through faithfulness violation test. In: ICML (2022)
- (83) Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., et al.: Swin transformer v2: Scaling up capacity and resolution. In: CVPR (2022)
- (84) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV (2021)
- (85) Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: ICLR (2019)
- (86) Lu, J., Mottaghi, R., Kembhavi, A., et al.: Container: Context aggregation networks. In: NeurIPS (2021)
- (87) Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S.W., Anwer, R.M., Shahbaz Khan, F.: Edgenext: efficiently amalgamated cnn-transformer architecture for mobile vision applications. In: ECCVW (2023)
- (88) Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In: ICLR (2022)
- (89) Min, J., Zhao, Y., Luo, C., Cho, M.: Peripheral vision transformer. In: NeurIPS (2022)
- (90) Moscato, P., et al.: On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech concurrent computation program, C3P Report 826, 1989 (1989)
- (91) Motter, B.C.: Focal attention produces spatially selective processing in visual cortical areas v1, v2, and v4 in the presence of competing stimuli. Journal of neurophysiology (1993)
- (92) Nakashima, K., Kataoka, H., Matsumoto, A., Iwata, K., Inoue, N., Satoh, Y.: Can vision transformers learn without natural images? In: AAAI (2022)
- (93) Neimark, D., Bar, O., Zohar, M., Asselmann, D.: Video transformer network. In: ICCV (2021)
- (94) Opara, K.R., Arabas, J.: Differential evolution: A survey of theoretical analyses. Swarm and evolutionary computation (2019)
- (95) Padhye, N., Mittal, P., Deb, K.: Differential evolution: Performances and analyses. In: CEC (2013)
- (96) Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., Huang, G.: On the integration of self-attention and convolution. In: CVPR (2022)
- (97) Pant, M., Zaheer, H., Garcia-Hernandez, L., Abraham, A., et al.: Differential evolution: A review of more than two decades of research. Engineering Applications of Artificial Intelligence (2020)
- (98) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS (2019)
- (99) Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: ICLR (2018)
- (100) Qiang, Y., Pan, D., Li, C., Li, X., Jang, R., Zhu, D.: Attcat: Explaining transformers via attentive class activation tokens. In: NeurIPS (2022)
- (101) Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving language understanding by generative pre-training. OpenAI (2018)
- (102) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog (2019)
- (103) Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? In: NeurIPS (2021)
- (104) Ren, S., Zhou, D., He, S., Feng, J., Wang, X.: Shunted self-attention via multi-scale token aggregation. In: CVPR (2022)
- (105) Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: ICCV (2017)
- (106) Shi, E.C., Leung, F.H., Law, B.N.: Differential evolution with adaptive population size. In: ICDSP (2014)
- (107) Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., YAN, S.: Inception transformer. In: NeurIPS (2022)
- (108) Sloss, A.N., Gustafson, S.: 2019 evolutionary algorithms review. Genetic programming theory and practice XVII (2020)
- (109) Srinivas, A., Lin, T.Y., Parmar, N., Shlens, J., Abbeel, P., Vaswani, A.: Bottleneck transformers for visual recognition. In: CVPR (2021)
- (110) Storn, R., Price, K.: Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. J. Glob. Optim. pp. 341–359 (1997)
- (111) Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: ICML (2019)
- (112) Thatipelli, A., Narayan, S., Khan, S., Anwer, R.M., Khan, F.S., Ghanem, B.: Spatio-temporal relation modeling for few-shot action recognition. In: CVPR (2022)
- (113) Toffolo, A., Benini, E.: Genetic diversity as an objective in multi-objective evolutionary algorithms. Evolutionary computation (2003)
- (114) Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: ICML (2021)
- (115) Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: ICCV (2021)
- (116) Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: Maxvit: Multi-axis vision transformer. In: ECCV (2022)
- (117) Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., Patel, V.M.: Medical transformer: Gated axial-attention for medical image segmentation. In: MICCAI (2021)
- (118) Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: CVPR (2021)
- (119) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
- (120) Vikhar, P.A.: Evolutionary algorithms: A critical review and its future prospects. In: ICGTSPICC (2016)
- (121) Wan, Z., Chen, H., An, J., Jiang, W., Yao, C., Luo, J.: Facial attribute transformers for precise and robust makeup transfer. In: CACV (2022)
- (122) Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C., Han, S.: Hat: Hardware-aware transformers for efficient natural language processing. In: ACL (2020)
- (123) Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Jiang, Y.G., Zhou, L., Yuan, L.: Bevt: Bert pretraining of video transformers. In: CVPR (2022)
- (124) Wang, S., Li, B., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
- (125) Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: ICCV (2021)
- (126) Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., Shao, L.: Pvt v2: Improved baselines with pyramid vision transformer. CVM (2022)
- (127) Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., Liu, W.: Crossformer: A versatile vision transformer hinging on cross-scale attention. In: ICLR (2022)
- (128) Wang, Y., Yang, Y., Bai, J., Zhang, M., Bai, J., Yu, J., Zhang, C., Huang, G., Tong, Y.: Evolving attention with residual convolutions. In: ICML (2021)
- (129) Wei, C., Fan, H., Xie, S., Wu, C.Y., Yuille, A., Feichtenhofer, C.: Masked feature prediction for self-supervised visual pre-training. In: CVPR (2022)
- (130) Wightman, R.: Pytorch image models. https://github.com/rwightman/pytorch-image-models (2019)
- (131) Wightman, R., Touvron, H., Jegou, H.: Resnet strikes back: An improved training procedure in timm. In: NeurIPSW (2021)
- (132) Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., Zhang, L.: Cvt: Introducing convolutions to vision transformers. In: ICCV (2021)
- (133) Xia, Z., Pan, X., Song, S., Li, L.E., Huang, G.: Vision transformer with deformable attention. In: CVPR (2022)
- (134) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV (2018)
- (135) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS (2021)
- (136) Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., Hu, H.: Simmim: A simple framework for masked image modeling. In: CVPR (2022)
- (137) Xu, L., Yan, X., Ding, W., Liu, Z.: Attribution rollout: a new way to interpret visual transformer. JAIHC (2023)
- (138) Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., Soatto, S.: Long short-term transformer for online action detection. In: NeurIPS (2021)
- (139) Xu, W., Xu, Y., Chang, T., Tu, Z.: Co-scale conv-attentional image transformers. In: ICCV (2021)
- (140) Xu, Y., Zhang, Q., Zhang, J., Tao, D.: Vitae: Vision transformer advanced by exploring intrinsic inductive bias. In: NeurIPS (2021)
- (141) Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z., Yuille, A.: Lite vision transformer with enhanced self-attention. In: CVPR (2022)
- (142) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR (2022)
- (143) Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., Wu, W.: Incorporating convolution designs into visual transformers. In: ICCV (2021)
- (144) Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.H., Tay, F.E., Feng, J., Yan, S.: Tokens-to-token vit: Training vision transformers from scratch on imagenet. In: ICCV (2021)
- (145) Yuan, L., Hou, Q., Jiang, Z., Feng, J., Yan, S.: Volo: Vision outlooker for visual recognition. TPAMI (2022)
- (146) Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., Wang, J.: Hrformer: High-resolution vision transformer for dense predict. In: NeurIPS (2021)
- (147) Zamir, S.W., Arora, A., Khan, S., Hayat, M., Khan, F.S., Yang, M.H.: Restormer: Efficient transformer for high-resolution image restoration. In: CVPR (2022)
- (148) Zhang, J., Li, X., Li, J., Liu, L., Xue, Z., Zhang, B., Jiang, Z., Huang, T., Wang, Y., Wang, C.: Rethinking mobile block for efficient attention-based models. In: ICCV (2023)
- (149) Zhang, J., Xu, C., Li, J., Chen, W., Wang, Y., Tai, Y., Chen, S., Wang, C., Huang, F., Liu, Y.: Analogous to evolutionary algorithm: Designing a unified sequence model. In: NeurIPS (2021)
- (150) Zhang, Q., Xu, Y., Zhang, J., Tao, D.: Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. IJCV (2023)
- (151) Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
- (152) Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Semantic understanding of scenes through the ade20k dataset. IJCV (2019)
- (153) Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Hou, Q., Feng, J.: Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)
- (154) Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. In: CVPR (2019)
- (155) Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable {detr}: Deformable transformers for end-to-end object detection. In: ICLR (2021)