RepNeXt: A Fast Multi-Scale CNN using Structural Reparameterization

Mingshu Zhao¹ Yi Luo¹ Yong Ouyang^1,2
¹Sichuan Energy Internet Research Institute, Tsinghua University
²Chengdu Qingrong Shentong Technology
[email protected] [email protected] [email protected]

Abstract

In the realm of resource-constrained mobile vision tasks, the pursuit of efficiency and performance consistently drives innovation in lightweight Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). While ViTs excel at capturing global context through self-attention mechanisms, their deployment in resource-limited environments is hindered by computational complexity and latency. Conversely, lightweight CNNs are favored for their parameter efficiency and low latency. This study investigates the complementary advantages of CNNs and ViTs to develop a versatile vision backbone tailored for resource-constrained applications. We introduce RepNeXt, a novel model series integrates multi-scale feature representations and incorporates both serial and parallel structural reparameterization (SRP) to enhance network depth and width without compromising inference speed. Extensive experiments demonstrate RepNeXt’s superiority over current leading lightweight CNNs and ViTs, providing advantageous latency across various vision benchmarks. RepNeXt-M4 matches RepViT-M1.5’s 82.3% accuracy on ImageNet within 1.5ms on an iPhone 12, outperforms its AP^box by 1.3 on MS-COCO, and reduces parameters by 0.7M. Codes and models are available at https://github.com/suous/RepNeXt.

1 Introduction

Over the past decade, Convolutional Neural Networks (CNNs) [35, 25, 56] have been predominant in computer vision applications, leveraging their inherent locality and translation equivariance [15]. To facilitate their deployment on resource-constrained devices, various efficient design principles have emerged, including spatial or depth separable convolutions [62, 63, 29], channel shuffling [84], partial channel operations [21, 4, 82], neural architecture search [66, 64], network pruning [22, 76, 75], and structural reparameterization (SRP) [10, 9].

Refer to caption — Figure 1: Latency vs Accuracy Comparison. The top-1 accuracy is tested on ImageNet-1K and the latency is measured by an iPhone 12 with iOS 16 across 20 experimental sets. RepNeXt consistently achieves the best trade-off between performance and latency.

Vision Transformers (ViTs) [15, 2] have emerged as a competitive alternative to CNNs, with several innovations aimed at improving their efficiency, such as hierarchical designs or hybrid architectures [74, 40, 7, 78, 24], as well as local processing operations or linear attention mechanisms [23, 53, 13, 20, 1]. However, many optimizations require special operations that may not be feasible on devices with limited resources. Meanwhile, efficient designs often prioritize optimizing inference speed based on metrics like floating point operations or model sizes, which may not consistently correlate with actual latency experienced in mobile applications. Consequently, convolution operations are still preferred for balancing latency and accuracy [73].

Inspired by the sophisticated architectures [80] of ViTs and their ability to model long-range spatial dependencies [15], large-kernel CNNs [47, 11, 45] have gained widespread research attention for enlarging the effective receptive field (ERF). However, expanding kernel sizes may substantially inflate parameter counts, resulting in considerable memory requirements and optimization challenges.

To balance performance and speed while preserving both local and global representations, we present RepNeXt, a multi-scale CNN inspired by MixConv [65] and InceptionNeXt [81]. RepNeXt combines the hierarchical design of CNNs [35, 25] with the general architecture of ViTs [71, 80, 47, 81] at a macro level, and integrates the efficiency of small-kernel convolutions with the broad perspective of large-kernel convolutions at a micro level. Extensive experiments demonstrate its effectiveness across various vision benchmarks, including ImageNet-1K [8] for image classification, MS-COCO [44] for object detection and instance segmentation, and ADE20K [85] for semantic segmentation. Our contributions can be concluded as follows.

•

We introduce RepNeXt, a simple yet effective vision backbone with a consistent design across inner-stage blocks and downsampling layers, achieving competitive or superior performance considering the trade-off between accuracy and latency with only fundamental operation units, facilitating subsequent optimizations.
•

We leverage both serial and parallel SRP mechanisms to increase network depth and width during training, effectively improving representational capacity without sacrificing inference speed.
•

Following [73], we further demonstrate that a simple multi-scale CNN (without channel attention blocks [30]) can outperform sophisticated architectures or complicated operators through intricate design or neural architecture search (NAS) across various vision tasks.

2 Related Work

Efficient CNNs: Crafting efficient CNNs for edge vision applications has received a lot of attention in recent years. MobileNets [29, 57, 28] proposed depthwise separable convolutions as well as inverted residual blocks for better efficiency-accuracy trade-off. SqueezeNet [32] used squeeze and expand operations to maintain representational capacity while reducing computational cost. ShuffleNet [84] implemented channel shuffle after pointwise group convolutions for improved information exchange. GhostNet [21] and FasterNet [4] introduced partial channel operations to generate feature maps more efficiently. MicroNet [39] aggressively reduced FLOPs through further network decomposition and sparsification. MnasNet [66] and EfficientNet [64] leveraged neural architecture search (NAS) to automatically discover efficient architectures. ParC-Net [83] proposed position-aware circular convolution (ParC) to provide a global receptive field while producing location-sensitive features. StarNet [48] demonstrated the efficacy of star operation in extracting substantial representation power from implicitly high-dimensional spaces. RepViT [73] integrated architectural designs from efficient ViTs into mobile CNNs, leveraging SRP [9, 10] techniques and SE [30] modules to boost performance. Furthermore, network pruning [22, 76, 75] and low-bit quantization [77] mechanisms are often employed to further reduce model size and memory usage.

Efficient ViTs: Recent advancements in efficient ViTs concentrate on incorporating spatial inductive biases within ViT blocks. MobileViTs [49, 50, 72] integrated the efficiency of MobileNets with the global modeling capabilities of ViTs. Mobile-Former [5] utilized a bidirectional parallel structure to facilitate interaction between local and global features. EfficientFormers [41, 43] featured a dimension-consistent design using hardware-friendly 4D modules and powerful 3D Multi-Head Self-Attention (MHSA) blocks. FastViT [69] combined $7\times 7$ depthwise convolutions and SRP to improve model capacity and efficiency. EdgeViTs [52] innovated with Local-Global-Local blocks to better integrate MHSA and convolution. SwiftFormer [59] introduced an efficient additive attention mechanism that replaces the quadratic matrix multiplications with linear element-wise operations. LightViT [31] incorporated a global aggregation scheme into both token and channel mixers to achieve a superior performance-efficiency trade-off. SHVIT [82] addressed computational redundancies with a Single-Head Self-Attention (SHSA) on subset of channels.

Large Kernel CNNs: Traditional CNNs such as AlexNet [35] and GoogLeNet [61] favored large kernels in their early layers, but the trend shifted towards stacking $3\times 3$ kernels after VGG [60]. InceptionNets [62, 63] decomposed $n\times n$ convolutions into sequential $1\times n$ and $n\times 1$ convolutions for efficiency. GCN [54] and SegNeXt [19] increased the kernel size through a combination of $1\times k+k\times 1$ and $k\times 1+1\times k$ convolutions for semantic segmentation. ConvMixer [68] achieved a substantial performance improvement through $9\times 9$ depthwise convolutions inspired by the global perspective of ViTs [15] and MLP-Mixers [67]. MogaNet [37] crafted multi-scale spatial aggregation blocks with dilated convolutions to gather discriminative features. ConvNeXt [47] explored modern CNN architecture with $7\times 7$ depthwise convolutions, reflecting the design philosophy of Swin Transformer [46]. InceptionNeXt [81] enhanced throughput and performance by decomposing large-kernel depthwise convolutions into four parallel branches. SKNet [38] and LSKNet [42] combined multi-branch convolutions along the channel or spatial dimension. RepLKNet [11] expanded kernel size to $31\times 31$ with SRP, achieving performance comparable to Swin Transformers. Furthermore, UniRepLKNet [12] introduced four design principles for large-kernel CNNs, demonstrating universal applicability across various modalities. SLaK [45] incorporated stripe convolutions with dynamic sparsity to scale up kernels to $51\times 51$ . PeLK [3] investigated convolution operations with kernels expanding up to $101\times 101$ in a human-like pattern. Additionally, LargeKernel3D [6] and ModernTCN [14] introduced large kernel design into 3D networks and time series analysis.

There are three major differences between prior efforts and our proposed method: (1) We adopt a simple and consistent design across inner-stage blocks and downsampling layers, facilitating easier hardware acceleration and further algorithm optimization. (2) We introduce multi-scale depthwise convolution, where the large-kernel convolution is decomposed into strip convolutions for efficiency, and the reparameterized medium-kernel convolution is meticulously crafted to imitate the central focusing characteristic of human eyes. (3) We eliminate normalization layers from SRP branches to reduce memory usage during training, enabling greater feature diversity within limited resources.

3 Method

3.1 Overall Architecture

The architecture of RepNeXt is based on RepViT [73], as illustrated in Figure 2. The macro structure follows the four-stage framework of conventional CNNs [25] and hierarchical ViTs [46]. It begins with a stem module consisting of two $3\times 3$ convolutions with a stride of $2$ [73, 43, 79]. Each subsequent stage progressively enhances the semantic representation while reducing spatial dimensions. The micro blocks adhere to the MetaNeXt design [47, 81], incorporating a token mixer for spatial feature extraction, a channel mixer for visual semantic interaction, a normalization layer [33] to stabilize and accelerate training, and a shortcut connection [25] to smooth the loss landscape [36].

Y=X+\mathrm{ChannelMixer}\big{(}\mathrm{Norm}(\mathrm{TokenMixer}(X))\big{)},

(1)

where $X,Y\in\mathbb{R}^{B\times C\times H\times W}$ with $B$ represents batch size, $C$ denotes channel number, and $H$ and $W$ indicate image height and width, respectively. $\mathrm{Norm}(\cdot)$ denotes the Batch Normalization (BN) layer [33]. $\mathrm{TokenMixer}(\cdot)$ operates as a chunk convolution when maintaining the feature scale or as a copy convolution during downsampling. Meanwhile, $\mathrm{ChannelMixer}(X)=\mathrm{Conv_{1\times 1,\downarrow}}\big{(}\sigma(\mathrm{Conv_{1\times 1,\uparrow}}(X))\big{)}$ is a channel MLP module comprising two fully-connected layers with an activation function in between, resembling the feed-forward network in a Transformer [71]. Here, $\sigma$ represents the GELU [27] activate function, and $\mathrm{Conv_{1\times 1,\uparrow}}$ and $\mathrm{Conv_{1\times 1,\downarrow}}$ stand for $1\times 1$ pointwise convolutions for expanding and squeezing feature maps, respectively.

The downsampling layer between each stage is a modified version of the MetaNeXt block [47, 81], where the shortcut connection bypasses the channel mixer.

\begin{split}\hat{X}&=\mathrm{Norm}(\mathrm{TokenMixer}(X)),\\ Y&=\hat{X}+\mathrm{ChannelMixer}(\hat{X}),\end{split}

(2)

where $\hat{X},Y\in\mathbb{R}^{B\times 2C\times H/2\times W/2}$ . Additionally, an optional $1\times 1$ pointwise convolution layer can be implemented to achieve customized output channels.

3.2 Chunk convolution

Algorithm 1 Chunk Convolution in a PyTorch-like style

⬇

class ChunkConv(Module):

def __init__(self, in_channels):

super().__init__()

assert in_channels % 4 == 0

hidden_channels = in_channels // 4

self.s = RepDWConvS(hidden_channels)

self.m = RepDWConvM(hidden_channels)

self.l = Sequential(

Conv2d(

in_channels=hidden_channels,

out_channels=hidden_channels,

kernel_size=(1, 11),

padding=(0, 5),

groups=hidden_channel

Conv2d(

in_channels=hidden_channels,

out_channels=hidden_channels,

kernel_size=(11, 1),

padding=(5, 0),

groups=hidden_channels

)

def forward(self, x):

i, s, m, l = chunk(x, chunks=4, dim=1)

bs = (i, self.s(s), self.m(m), self.l(l))

return cat(bs, dim=1)

Table 1: Complexity of different types of convolution. The measurement is simplified by assuming consistent input and output channels and omitting the bias term.

k

C

H

and

W

denote kernel size, channel number, image height and width, respectively.

Convolution	Parameters	FLOPs
Standard	$k^{2}C^{2}$	$k^{2}C^{2}HW$
Depthwise	$k^{2}C$	$k^{2}CHW$
Chunk	$(9+k^{2}+22)C/4$	$(9+k^{2}+22)CHW/4$

Chunk convolution, as illustrated in Algorithm 1, represents a specialized form of the inception depthwise convolution [81] where each group possesses an equal number of channels for simplicity: 1. Identity mapping, preserving original information while reducing computation; 2. Repamaterized small-kernel depthwise convolution, capturing local features and accelerating processing; 3. Repamaterized medium-kernel depthwise convolution, expanding the ERF and leveraging the flexibility of SRP to emulate the central focusing feature of human eyes; 4. Equivalent large-kernel depthwise convolution, comprising two layers of strip convolutions, effectively capturing the global perspective while conserving computational resources. By incorporating this multi-scale strategy, our model aims to replicate the long-range modeling capabilities observed in ViTs while maintaining the locality and efficiency of CNNs. Specifically, for input $X$ , it is evenly partitioned into four groups along the channel dimension,

X_{i},X_{s},X_{m},X_{l}=\mathrm{Chunk}(X),

(3)

where $\mathrm{Chunk}(\cdot)$ splits input $X$ evenly along the channel dimension ( $X_{i},X_{s},X_{m},X_{l}\in\mathbb{R}^{B\times C/4\times H\times W}$ ). Next, each inputs are fed into different parallel branches,

\begin{split}Y_{i}\;&=X_{i},\\ Y_{s}\;&=\mathrm{RepDWConvS\;\,}_{k_{s}\;\times k_{s}\;}(X_{s}\;),\\ Y_{m}&=\mathrm{RepDWConvM}_{k_{m}\times k_{m}}(X_{m}),\\ Y_{l}\;&=\mathrm{DWConv}_{k_{l}\times 1}\big{(}\mathrm{DWConv}_{1\times k_{l}}(X_{l})\big{)},\\ \end{split}

(4)

where $k_{s}$ denotes the small square kernel size which is defaulted to $3$ , $k_{m}$ represents the medium square kernel size with a default value of $7$ , and $k_{l}$ refers to the strip kernel size set as $11$ by default. $\mathrm{RepDWConvS}$ and $\mathrm{RepDWConvM}$ stand for reparameterized small and medium-kernel depthwise convolutions, respectively. Ultimately, the outputs from each branch are concatenated along the channel dimension,

Y=\mathrm{Concat}(Y_{i},Y_{s},Y_{m},Y_{l})

(5)

Specifically, $\mathrm{RepDWConvS}$ and $\mathrm{RepDWConvM}$ consist of multiple branches during training, as illustrated in Figure 2, which are consolidated into a single branch during inference. Additionally, inspired by the Peripheral Convolution [3] and Decomposed Manhattan Self-Attention [16], we have meticulously designed $\mathrm{RepDWConvM}$ with five different kernel patterns to emulate the central focusing property of human eyes.

\begin{split}Y_{s}\;&=\mathrm{DWConv}_{3\times 3}(X_{s})+\\ &\;\;\;\;\;\mathrm{DWConv}_{1\times 3}(X_{s})+\\ &\;\;\;\;\;\mathrm{DWConv}_{3\times 1}(X_{s})+\\ &\;\;\;\;\;\mathrm{DWConv}_{2\times 2,d=2}(X_{s})\\ Y_{m}&=\mathrm{DWConv}_{7\times 7}(X_{m})+\\ &\;\;\;\;\;\mathrm{DWConv}_{3\times 5}(X_{m})+\\ &\;\;\;\;\;\mathrm{DWConv}_{5\times 3}(X_{m})+\\ &\;\;\;\;\;\mathrm{DWConv}_{7\times 1}\big{(}\mathrm{DWConv}_{1\times 7}(X_{m})\big{)}+\\ &\;\;\;\;\;\mathrm{DWConv}_{5\times 1}\big{(}\mathrm{DWConv}_{1\times 5}(X_{m})\big{)}\end{split}

(6)

Table 2: Classification performance on ImgeNet-1K. Following [41, 73], latency is measured on an iPhone 12 with models compiled by Core ML Tools, reporting both the mean and standard deviation across 20 experimental trials. Similar to [18], throughput is tested on a Nvidia RTX3090 GPU with maximum power-of-two batch size that fits in memory. “

\dagger

” denotes the evaluation image size is 256.

Model	Type	Params (M)	GMACs	Latency $\downarrow$ (ms)	Throughput $\uparrow$ (im/s)	Top-1 (%)
MobileViG-Ti [51]	CNN-GNN	5.2	0.7	$1.27{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.02}$	4337	75.7
SwiftFormer-XS [59]	Hybrid	3.5	0.6	$1.00{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.04}$	4304	75.7
EfficientFormerV2-S0 [43]	Hybrid	3.5	0.4	$0.91{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	1274	75.7
FastViT-T8^$\dagger$ [69]	Hybrid	3.6	0.7	$0.89{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	3909	76.7
RepViT-M0.9 [73]	CONV	5.1	0.8	$0.89{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	4817	78.7
EfficientFormerV2-S1 [43]	Hybrid	6.1	0.7	$1.06{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	1153	79.0
RepViT-M1.0 [73]	CONV	6.8	1.1	$1.02{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	3910	80.0
RepNeXt-M1	CONV	4.8	0.8	$\mathbf{0.86}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.03}$	3885	78.8
RepNeXt-M2	CONV	6.5	1.1	$\mathbf{1.00}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.04}$	3198	80.1
MobileViG-S [51]	CNN-GNN	7.2	1.0	$1.50{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	2985	78.2
SwiftFormer-S [59]	Hybrid	6.1	1.0	$1.16{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.04}$	3376	78.5
EfficientFormer-L1 [41]	Hybrid	12.3	1.3	$1.42{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.02}$	3360	79.2
FastViT-T12^$\dagger$ [69]	Hybrid	6.8	1.4	$1.33{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	3182	80.3
RepViT-M1.1 [73]	CONV	8.2	1.3	$1.13{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	3604	80.7
RepNeXt-M3	CONV	7.8	1.3	$\mathbf{1.11}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.04}$	2903	80.7
MobileViG-M [51]	CNN-GNN	14.0	1.5	$1.86{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.02}$	2491	80.6
FastViT-S12^$\dagger$ [69]	Hybrid	8.8	1.8	$1.51{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	2313	80.9
SwiftFormer-L1 [59]	Hybrid	12.1	1.6	$1.62{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.02}$	2576	80.9
EfficientFormerV2-S2 [43]	Hybrid	12.6	1.3	$1.63{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	611	81.6
FastViT-SA12^$\dagger$ [69]	Hybrid	10.9	1.9	$1.66{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	2181	81.9
RepViT-M1.5 [73]	CONV	14.0	2.3	$1.51{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.02}$	2151	82.3
RepNeXt-M4	CONV	13.3	2.3	$\mathbf{1.48}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.04}$	1745	82.3
EfficientFormer-L3 [41]	Hybrid	31.3	3.9	$2.79{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.02}$	1422	82.4
MobileViG-B [51]	CNN-GNN	26.7	2.8	$2.87{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.04}$	1446	82.6
SwiftFormer-L3 [59]	Hybrid	28.5	4.0	$2.99{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.08}$	1474	83.0
EfficientFormer-L7 [41]	Hybrid	82.1	10.2	$6.80{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.02}$	619	83.3
EfficientFormerV2-L [43]	Hybrid	26.1	2.6	$2.75{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	399	83.3
RepViT-M2.3 [73]	CONV	22.9	4.5	$2.24{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	1184	83.3
FastViT-SA24^$\dagger$ [69]	Hybrid	20.6	3.8	$2.78{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	1128	83.4
RepNeXt-M5	CONV	21.7	4.5	$\mathbf{2.20}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.02}$	978	83.3

The inference complexity of three types of convolution is shown in Table 1. The computational cost of chunk convolution reflects the mixed nature of the operations performed within each branch. By distributing the operations, chunk convolution strikes a balance between computational complexity and representational capability.

3.3 Copy convolution

Copy convolution as shown in Algorithm 2, is a variation of the chunk convolution, where each group operate on the same input with a stride of $2$ to reduce spatial dimensions. The distinction lies in the sequential stacking of strip convolutions $\mathrm{DWConv}_{3\times 1}\big{(}\mathrm{DWConv}_{1\times 3}(X_{s})\big{)}$ , rather than parallel execution $\mathrm{DWConv}_{3\times 1}(X_{s})+\mathrm{DWConv}_{1\times 3}(X_{s})$ .

\begin{split}Y_{s}\;&=\mathrm{RepDWConvS\;\,}_{k_{s}\;\times k_{s}\;,s=2}(X_{s}\;),\\ Y_{m}&=\mathrm{RepDWConvM}_{k_{m}\times k_{m},s=2}(X_{m}),\\ \end{split}

(7)

similarly, the outputs from each branch are concatenated,

Y=\mathrm{Concat}(Y_{s},Y_{m})

(8)

Additionally, a pointwise convolution can be utilized to adjust the channel dimension, providing greater flexibility.

Algorithm 2 Copy Convolution in a PyTorch-like style

⬇

class CopyConv(Module):

def __init__(self, in_channels):

super().__init__()

self.s = RepDWConvS(in_channels, stride=2)

self.m = RepDWConvM(in_channels, stride=2)

def forward(self, x):

return cat((self.s(x), self.m(x)), dim=1)

4 Experiments

We demonstrate RepNeXt’s applicability and effectiveness by conducting experiments across different vision tasks: classification on ImageNet-1K [8], object detection and instance segmentation on MS-COCO 2017 [44], and semantic segmentation on ADE20K [85]. Following [41, 43, 70, 50, 73], we export the model using Core ML Tools and evaluate its latency on an iPhone 12 running iOS 16 utilizing the Xcode performance tool. Furthermore, we provide throughput analysis on a Nvidia RTX3090 GPU, adhering to the procedure in [73], where we measure the throughput using the maximum power-of-two batch size that fits in memory.

4.1 Image Classification

Implementation details.

We perform image classification experiments on ImageNet-1K, employing a standard image size of 224 $\times$ 224 for both training and testing. This dataset comprises approximately 1.3M training, 50k validation and 100k test images, distributed across 1000 categories. We train all models from scratch for 300 epochs using the same training recipe as in [69, 43, 73], except for the RepNeXt-M5 model, which used a weight decay of 0.03 instead of 0.025. To ensure fair comparisons, we utilize the RegNetY-16GF [55] model with a top-1 accuracy of 82.9% as the teacher model for distillation. Latency measurements are conducted on an iPhone 12 with models compiled by Core ML Tools under a batch size of 1 across 20 experimental trials. Following [69, 43], we report the performance with and without distillation in 2 and 3, respectively.

Table 3: Results without distillation on ImageNet-1K, where“

\dagger

” denotes the evaluation image size is 256.

Model	Latency (ms)	Params(M)	Top-1 (%)
EfficientFormerV2-S0 [43]	$0.91{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	3.5	73.7
FastViT-T8^$\dagger$ [69]	$0.89{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	3.6	75.6
MobileOne-S1 [70]	$0.89{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	4.8	75.9
StarNet-S3 [48]	$0.98{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	3.7	77.4
RepViT-M0.9 [73]	$0.89{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	5.1	77.4
EfficientFormerV2-S1 [43]	$1.06{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	6.1	77.9
RepViT-M1.0 [73]	$1.02{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	6.8	78.6
RepNeXt-M1	$\mathbf{0.86}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.03}$	4.8	77.5
RepNeXt-M2	$\mathbf{1.00}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.04}$	6.5	78.9
MobileOne-S2 [70]	$1.14{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	7.8	77.4
MobileOne-S3 [70]	$1.31{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	10.1	78.1
StarNet-S4 [48]	$1.11{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	7.5	78.4
FastViT-T12^$\dagger$ [69]	$1.33{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	6.8	79.1
RepViT-M1.1 [73]	$1.13{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	8.2	79.4
RepNeXt-M3	$\mathbf{1.11}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.04}$	7.8	79.4
MobileOne-S4 [70]	$1.73{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	14.8	79.4
FastViT-S12^$\dagger$ [69]	$1.51{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	8.8	79.8
PoolFormer-S24 [80]	$2.45{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	21.0	80.3
EfficientFormerV2-S2 [43]	$1.63{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	12.6	80.4
FastViT-SA12^$\dagger$ [69]	$1.66{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	10.9	80.6
RepViT-M1.5 [73]	$1.51{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	14.0	81.2
RepNeXt-M4	$\mathbf{1.48}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.04}$	13.3	81.2
PoolFormer-S36 [80]	$3.48{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.05}$	31.0	81.4
RepViT-M2.3 [73]	$2.24{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	22.9	82.5
FastViT-SA24^$\dagger$ [69]	$2.78{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	20.6	82.6
RepNeXt-M5	$\mathbf{2.20}{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\,\pm\,0.02}$	21.7	82.4

Results with knowledge distillation.

As demonstrated in 2, RepNeXt achieves an optimal balance between accuracy and latency across various model sizes. With similar model sizes and latency, RepNeXt-M2 outperform EfficientFormerV2-S1 by 1.1% top-1 accuracy and exhibits higher throughput. RepNeXt-M1 and RepNeXt-M2 consistently surpass RepViT-M0.9 and RepViT-M1.0 by 0.1% in top-1 accuracy while maintaining lower latency and fewer parameters. Larger models match the top-1 accuracy of their counterparts while benefiting from further parameters reduction. These results highlight the effectiveness and efficiency of our design, showing that a simple multi-scale CNN can outperform sophisticated architectures or complicated operators on mobile devices.

Results without knowledge distillation.

As depicted in 3, RepNeXt achieves Top-1 accuracy comparable or superior to RepViT without the use of knowledge distillation, demonstrating its strong performance independently. Furthermore, RepNeXt strikes an optimal balance among accuracy, latency, and model size. For instance, RepNeXt-M1 achieves a Top-1 accuracy of 77.5%, with a latency of 0.86ms and a compact size of 4.8M parameters. Additionally, with a latency of 1.0ms, RepNeXt-M2 surpasses RepViT-M1.0 by 0.3% in accuracy while having 0.3M fewer parameters. In the case of larger models, RepNeXt-M3 delivered a 1.0% performance improvement over StarNet-S4, with identical latency of 1.11ms. Meanwhile, RepNeXt-M4 matches the 81.2% accuracy of RepViT-M1.5, but with a 0.03ms speed advantage and a reduction of 0.7M parameters.

4.2 Downstream Tasks

Object Detection and Instance Segmentation.

We evaluate RepNeXt’s transfer ablility on object detection and instance segmentation tasks. Following [43], we integrate RepNeXt into the Mask-RCNN framework [26] and conduct experiments on the MS-COCO 2017 dataset [44]. As shown in Table 4, RepNeXt consistently outperforms the competitors in terms of AP^box and AP^mask while maintaining similar latency and model sizes. For instance, RepNeXt-M4 outperforms RepViT-m1.5 by 1.3 AP^box and 0.5 AP^mask with a similar latency, and matches the AP^box and AP^mask of SwiftFormer-L3 but operates twice as fast. RepNeXt-M5 achieves competitive AP^box and AP^mask compared to RepViT-M2.3 and EfficientFormerV2-L, which are both initialized with weights pretrained for 450 epochs on ImageNet-1K. These results further demonstrate the advantages of large-kernel convolution in downstream tasks, as noted in [11], and highlight the efficacy of our multi-scale kernel design, which equivalents to a grouped large-kernel depthwise convolution with additional inductive bias and efficiency trade-offs.

Table 4: Object detection and instance segmentation were evaluated using Mask R-CNN on MS-COCO 2017, while semantic segmentation results were obtained on ADE20K. Backbone latencies were measured on an iPhone 12 with 512

\times

512 image crops using Core ML Tools. Models marked with “

\dagger

” were initialized with weights pretrained for 450 epochs on ImageNet-1K.

Backbone	Latency $\downarrow$ (ms)	Object Detection			Instance Segmentation			Semantic
Backbone	Latency $\downarrow$ (ms)	AP^box	AP ${}^{box}_{50}$	AP ${}^{box}_{75}$	AP^mask	AP ${}^{mask}_{50}$	AP ${}^{mask}_{75}$	mIoU
ResNet18 [25]	4.4	34.0	54.0	36.7	31.2	51.0	32.7	32.9
PoolFormer-S12 [80]	7.5	37.3	59.0	40.1	34.6	55.8	36.9	37.2
EfficientFormer-L1 [41]	5.4	37.9	60.3	41.0	35.4	57.3	37.3	38.9
FastViT-SA12 [69]	5.6	38.9	60.5	42.2	35.9	57.6	38.1	38.0
RepViT-M1.1 [73]	4.9	39.8	61.9	43.5	37.2	58.8	40.1	40.6
RepNeXt-M3	5.1	40.8	62.4	44.7	37.8	59.5	40.6	40.6
PoolFormer-S24 [80]	12.3	40.1	62.2	43.4	37.0	59.1	39.6	40.3
PVT-Small [74]	53.7	40.4	62.9	43.8	37.8	60.1	40.3	39.8
SwiftFormer-L1 [59]	8.4	41.2	63.2	44.8	38.1	60.2	40.7	41.4
EfficientFormer-L3 [41]	12.4	41.4	63.9	44.7	38.1	61.0	40.4	43.5
RepViT-M1.5 [73]	6.4	41.6	63.2	45.3	38.6	60.5	41.5	43.6
FastViT-SA24 [69]	9.3	42.0	63.5	45.8	38.0	60.5	40.5	41.0
RepNeXt-M4	6.6	42.9	64.4	47.2	39.1	61.7	41.7	43.3
SwiftFormer-L3 [59]	12.5	42.7	64.4	46.7	39.1	61.7	41.8	43.9
EfficientFormerV2-S2^$\dagger$ [43]	12.0	43.4	65.4	47.5	39.5	62.4	42.2	42.4
FastViT-SA36 [69]	12.9	43.8	65.1	47.9	39.4	62.0	42.3	42.9
EfficientFormerV2-L^$\dagger$ [43]	18.2	44.7	66.3	48.8	40.4	63.5	43.2	45.2
RepViT-M2.3^$\dagger$ [73]	9.9	44.6	66.1	48.8	40.8	63.6	43.9	46.1
RepNeXt-M5	10.4	44.7	66.0	49.2	40.7	63.5	43.6	45.0

Semantic Segmentation.

We perform semantic segmentation experiments on the ADE20K dataset [85], which consists of approximately 20K training images and 2K validation images across 150 categories. We adhered to the training protocol from the previous works [41, 43] using the Semantic FPN framework [34]. As illustrated in Table 4, RepNeXt demonstrates favorable mIoU-latency trade-offs across various model sizes.

Table 5: Ablation study conducted under 120 epochs on the ImageNet-1K classification benchmark, using RepNeXt-M1 as the baseline. Metrics reported include Top-1 accuracy on the validation set, latency on an iPhone 12, and throughput on a RTX3090 GPU.

Ablation	Variant	Params (M)	Latency (ms)	Throughput (im/s)	Top-1 (%)
Baseline	None (RepNeXt-M1)	4.82	$0.86{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	3885	75.34
Downsample	simple $3\times 3$ convolution	4.89	$0.87{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.04}$	4078	74.45
Branch	remove small kernel	4.81	$0.85{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	4017	75.22
	remove medium kernel	4.77	$0.85{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	4341	75.25
	remove large kernel	4.79	$0.84{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	4332	75.22
Medium kernel	add $5\times 5$ and $3\times 3$ kernels	4.82	$0.86{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	3885	75.62
	$5\times 5$ and $3\times 3$ kernels $\rightarrow$ $5\times 3$ and $3\times 5$ kernels	4.82	$0.86{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	3885	75.69
	add sequential $1\times 7,7\times 1$ and $1\times 5,5\times 1$ kernels	4.82	$0.86{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	3885	75.71
\cdashline2-6 Small kernel	add sequential $1\times 3,3\times 1$ and dilated $2\times 2$ kernels	4.82	$0.86{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	3885	75.84
\cdashline2-6 Small kernel	sequential $1\times 3,3\times 1$ $\rightarrow$ parallel operations	4.82	$0.86{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.03}$	3885	75.97
RepViT	None(RepViT-M0.9)	5.07	$0.89{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	4817	75.19
RepViT	Downsample $\rightarrow$ RepNeXt’s Downsample	4.99	$0.89{\color[rgb]{0.7529,0.4902,0.6471}\scriptstyle\;\pm\,0.01}$	4731	75.32
\cdashline2-6 ConvNeXt	None(ConvNeXt-femto)	5.22	-	3636	72.37
\cdashline2-6 ConvNeXt	Downsample $\rightarrow$ RepNeXt’s design	5.25	-	3544	74.28

4.3 Ablation Studies

We conduct ablation studies under 120 epochs on ImageNet-1K [8] using RepNeXt-M1 without SRP as baseline from the following aspects.

Downsampling layer. As illustrated in Table 5. The baseline model serves as the reference point with a Top-1 accuracy of 75.34%. This model includes the full architecture with all kernel branches and the will designed downsampling layers. It is clear that replacing the downsampling layers with simple $3\times 3$ convolutions results in a noticeable drop in Top-1 accuracy to 74.45%, which is a decrease of 0.9% compared to the baseline. This change implies that the well designed downsampling layers in the baseline architecture are crucial for maintaining higher accuracy. Additionally, substituting RepVit’s downsampling layer with our proposed modification slightly increased accuracy from 75.19% to 75.32% without affecting latency. Our change to ConvNeXt also improved accuracy by 1.9%.

Kernel branches. Table 5 shows that each kernel branch contributes incrementally to the overall Top-1 accuracy of the model. The removal of any single branch leads to a slight decrease in accuracy. For example: removing the small kernel branch leads to a minor reduction in Top-1 accuracy to 75.22%, excluding the medium kernel branch results in a Top-1 accuracy of 75.25%, and the Top-1 accuracy drops to 75.22% when the large kernel branch is eliminated. This highlights the collective contribution of multi-scale kernels in improving the model’s performance.

Structural reparameterization. Table 5 illustrates the substantial enhancement in Top-1 accuracy by incorporating the SRP mechanism. For the medium-kernel branches, adding $5\times 5$ and $3\times 3$ convolution kernels increase the accuracy from 75.34% to 75.62%. Subsequently, substituting these kernels with $5\times 3$ and $3\times 5$ kernels further boosts the performance to 75.69%. Moreover, integrating two sequentially stacked strip convolutions into the branch slightly increases the accuracy to 75.71%. These incremental advancements collectively highlight the effectiveness of our design, which is specifically engineered to emulate the human foveal vision system. Refining the small kernel branch by introducing a dilated $2\times 2$ convolution and a series of concatenated strip convolutions has substantially lifted the accuracy to 75.84%. Additionally, the transition from serial to parallel branches for the strip convolutions has further elevated accuracy to 75.97%, surpassing previous records [43, 59] achieved with knowledge distillation under 300 epochs, as detailed in Table 2. Overall, depending on its versatility and efficacy, SRP is becoming the default option for designing lightweight network architectures [69, 73].

4.4 CAM Analysis

We visualize class activation maps (CAM) using Grad-CAM [58] with the TorchCAM Toolbox [17]. As illustrated in Figure 3, RepNeXt can capture local features like RepViT while also enjoying a global view akin to FastViT.

5 Conclusions

In this paper, we introduced a multi-scale depthwise convolution integrated with both serial and parallel SRP mechanisms, enhancing feature diversity and expanding the network’s expressive capacity without compromising inference speed. Specifically, we designed a reparameterized medium-kernel convolution to imitate the human foveal vision system. Additionally, we proposed our light-weight, general-purpose RepNeXts that employed the distribute-transform-aggregate design philosophy across inner-stage blocks as well as downsampling layers, achieving comparable or superior accuracy-efficiency trade-off across various vision benchmarks, especially on downstream tasks. Moreover, our flexible multi-branch design functions as a grouped depthwise convolution with additional inductive bias and efficiency trade-offs. It can also be reparameterized into a single-branch large-kernel depthwise convolution, enabling potential optimization towards different accelerators.

For future enhancements, we plan to delve into optimizations towards large kernel designs, investigate SRP upon channels mixers, extend RepNeXt to more vision tasks and other modalities, and scale up our models further. We hope our simple yet effective design will inspire further research towards light-weight models.

Limitations. One limitation of RepNeXt is its marginal improvement in both accuracy, speed, and model size compared to the previous state-of-the-art (SOTA) model [73]. Additionally, it experiences a substantial increase in inference time when dealing with larger images due to large-kernel convolutions. We aim to address these shortcomings in future iterations.

References

Cai et al. [2022] Han Cai, Chuang Gan, and Song Han. Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. arXiv preprint arXiv:2205.14756, 2022.
Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
Chen et al. [2024] Honghao Chen, Xiangxiang Chu, Yongjian Ren, Xin Zhao, and Kaiqi Huang. Pelk: Parameter-efficient large kernel convnets with peripheral convolution. arXiv preprint arXiv:2403.07589, 2024.
Chen et al. [2023a] Jierun Chen, Shiu-hong Kao, Hao He, Weipeng Zhuo, Song Wen, Chul-Ho Lee, and S-H Gary Chan. Run, don’t walk: Chasing higher flops for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12021–12031, 2023a.
Chen et al. [2022] Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet and transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5270–5279, 2022.
Chen et al. [2023b] Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. Largekernel3d: Scaling up kernels in 3d sparse cnns. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13488–13498, 2023b.
Dai et al. [2021] Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. Advances in neural information processing systems, 34:3965–3977, 2021.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Ding et al. [2019] Xiaohan Ding, Yuchen Guo, Guiguang Ding, and Jungong Han. Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1911–1920, 2019.
Ding et al. [2021] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021.
Ding et al. [2022] Xiaohan Ding, Xiangyu Zhang, Jungong Han, and Guiguang Ding. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11963–11975, 2022.
Ding et al. [2023] Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, and Ying Shan. Unireplknet: A universal perception large-kernel convnet for audio, video, point cloud, time-series and image recognition. arXiv preprint arXiv:2311.15599, 2023.
Dong et al. [2022] Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, and Baining Guo. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12124–12134, 2022.
donghao and wang xue [2024] Luo donghao and wang xue. ModernTCN: A modern pure convolution structure for general time series analysis. In The Twelfth International Conference on Learning Representations, 2024.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Fan et al. [2024] Qihang Fan, Huaibo Huang, Mingrui Chen, Hongmin Liu, and Ran He. Rmt: Retentive networks meet vision transformers. In CVPR, 2024.
Fernandez [2020] François-Guillaume Fernandez. Torchcam: class activation explorer. https://github.com/frgfm/torch-cam, 2020.
Graham et al. [2021] Benjamin Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herve Jegou, and Matthijs Douze. Levit: A vision transformer in convnet’s clothing for faster inference. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12259–12269, 2021.
Guo et al. [2022] Meng-Hao Guo, Cheng-Ze Lu, Qibin Hou, Zhengning Liu, Ming-Ming Cheng, and Shi-Min Hu. Segnext: Rethinking convolutional attention design for semantic segmentation. arXiv preprint arXiv:2209.08575, 2022.
Han et al. [2023] Dongchen Han, Xuran Pan, Yizeng Han, Shiji Song, and Gao Huang. Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
Han et al. [2020] Kai Han, Yunhe Wang, Qi Tian, Jianyuan Guo, Chunjing Xu, and Chang Xu. Ghostnet: More features from cheap operations. In CVPR, 2020.
Han et al. [2016] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR), 2016.
Hassani et al. [2023] Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6185–6194, 2023.
Hatamizadeh et al. [2023] Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M Alvarez, Jan Kautz, and Pavlo Molchanov. Fastervit: Fast vision transformers with hierarchical attention. arXiv preprint arXiv:2306.06189, 2023.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
Hendrycks and Gimpel [2016] Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
Howard et al. [2019] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
Howard et al. [2017] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Hu et al. [2018] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
Huang et al. [2022] Tao Huang, Lang Huang, Shan You, Fei Wang, Chen Qian, and Chang Xu. Lightvit: Towards light-weight convolution-free vision transformers. arXiv preprint arXiv:2207.05557, 2022.
Iandola et al. [2016] Forrest N. Iandola, Song Han, Matthew W. Moskewicz, Khalid Ashraf, William J. Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and $<$ 0.5mb model size. arXiv:1602.07360, 2016.
Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
Kirillov et al. [2019] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár. Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6399–6408, 2019.
Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
Li et al. [2018] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Neural Information Processing Systems, 2018.
Li et al. [2024] Siyuan Li, Zedong Wang, Zicheng Liu, Cheng Tan, Haitao Lin, Di Wu, Zhiyuan Chen, Jiangbin Zheng, and Stan Z. Li. Moganet: Multi-order gated aggregation network. In International Conference on Learning Representations, 2024.
Li et al. [2019] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang. Selective kernel networks. In CVPR, 2019.
Li et al. [2021] Yunsheng Li, Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Lu Yuan, Zicheng Liu, Lei Zhang, and Nuno Vasconcelos. Micronet: Improving image recognition with extremely low flops. In International Conference on Computer Vision, 2021.
Li et al. [2022a] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In CVPR, 2022a.
Li et al. [2022b] Yanyu Li, Geng Yuan, Yang Wen, Ju Hu, Georgios Evangelidis, Sergey Tulyakov, Yanzhi Wang, and Jian Ren. Efficientformer: Vision transformers at mobilenet speed. Advances in Neural Information Processing Systems, 35:12934–12949, 2022b.
Li et al. [2023a] Yuxuan Li, Qibin Hou, Zhaohui Zheng, Ming-Ming Cheng, Jian Yang, and Xiang Li. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16794–16805, 2023a.
Li et al. [2023b] Yanyu Li, Ju Hu, Yang Wen, Georgios Evangelidis, Kamyar Salahi, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Rethinking vision transformers for mobilenet size and speed. In Proceedings of the IEEE international conference on computer vision, 2023b.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Liu et al. [2022a] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Xuxi Chen, Qiao Xiao, Boqian Wu, Tommi Kärkkäinen, Mykola Pechenizkiy, Decebal Mocanu, and Zhangyang Wang. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv preprint arXiv:2207.03620, 2022a.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
Liu et al. [2022b] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022b.
Ma et al. [2024] Xu Ma, Xiyang Dai, Yue Bai, Yizhou Wang, and Yun Fu. Rewrite the stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Mehta and Rastegari [2021] Sachin Mehta and Mohammad Rastegari. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178, 2021.
Mehta and Rastegari [2022] Sachin Mehta and Mohammad Rastegari. Separable self-attention for mobile vision transformers. arXiv preprint arXiv:2206.02680, 2022.
Munir et al. [2023] Mustafa Munir, William Avery, and Radu Marculescu. Mobilevig: Graph-based sparse attention for mobile vision applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2210–2218, 2023.
Pan et al. [2022] Junting Pan, Adrian Bulat, Fuwen Tan, Xiatian Zhu, Lukasz Dudziak, Hongsheng Li, Georgios Tzimiropoulos, and Brais Martinez. Edgevits: Competing light-weight cnns on mobile devices with vision transformers. In European Conference on Computer Vision, pages 294–311. Springer, 2022.
Pan et al. [2023] Xuran Pan, Tianzhu Ye, Zhuofan Xia, Shiji Song, and Gao Huang. Slide-transformer: Hierarchical vision transformer with local self-attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2082–2091, 2023.
Peng et al. [2017] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2017.
Radosavovic et al. [2020] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick, Kaiming He, and Piotr Dollár. Designing network design spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 10428–10436, 2020.
Redmon et al. [2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
Selvaraju et al. [2017] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In CVPR, pages 618–626, 2017.
Shaker et al. [2023] Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan. Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
Szegedy et al. [2017] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence, 2017.
Tan and Le [2019a] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019a.
Tan and Le [2019b] Mingxing Tan and Quoc V Le. Mixconv: Mixed depthwise convolutional kernels. arXiv preprint arXiv:1907.09595, 2019b.
Tan et al. [2019] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2820–2828, 2019.
Tolstikhin et al. [2021] Ilya Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, et al. Mlp-mixer: An all-mlp architecture for vision. arXiv preprint arXiv:2105.01601, 2021.
Trockman and Kolter [2022] Asher Trockman and J Zico Kolter. Patches are all you need? arXiv preprint arXiv:2201.09792, 2022.
Vasu et al. [2023a] Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and Anurag Ranjan. Fastvit: A fast hybrid vision transformer using structural reparameterization. arXiv preprint arXiv:2303.14189, 2023a.
Vasu et al. [2023b] Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel, and Anurag Ranjan. Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7907–7917, 2023b.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Wadekar and Chaurasia [2022] Shakti N Wadekar and Abhishek Chaurasia. Mobilevitv3: Mobile-friendly vision transformer with simple and effective fusion of local, global and input features. arXiv preprint arXiv:2209.15159, 2022.
Wang et al. [2023] Ao Wang, Hui Chen, Zijia Lin, Hengjun Pu, and Guiguang Ding. Repvit: Revisiting mobile cnn from vit perspective. arXiv preprint arXiv:2307.09283, 2023.
Wang et al. [2021a] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021a.
Wang et al. [2021b] Zi Wang, Chengcheng Li, and Xiangyang Wang. Convolutional neural network pruning with structural redundancy reduction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14908–14917, 2021b.
Wen et al. [2016] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems, 2016.
Wu et al. [2016] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4820–4828, 2016.
Xia et al. [2024] Chunlong Xia, Xinliang Wang, Feng Lv, Xin Hao, and Yifeng Shi. Vit-comer: Vision transformer with convolutional multi-scale feature interaction for dense predictions. arXiv preprint arXiv:2403.07392, 2024.
Xiao et al. [2021] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. Early convolutions help transformers see better. Advances in neural information processing systems, 34:30392–30400, 2021.
Yu et al. [2022] Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022.
Yu et al. [2023] Weihao Yu, Pan Zhou, Shuicheng Yan, and Xinchao Wang. Inceptionnext: when inception meets convnext. arXiv preprint arXiv:2303.16900, 2023.
Yun and Ro [2024] Seokju Yun and Youngmin Ro. Shvit: Single-head vision transformer with memory efficient macro design. arXiv preprint arXiv:2401.16456, 2024.
Zhang et al. [2022] Haokui Zhang, Wenze Hu, and Xiaoyu Wang. Parc-net: Position aware circular convolution with merits from convnets and transformer. In European Conference on Computer Vision, 2022.
Zhang et al. [2018] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6848–6856, 2018.
Zhou et al. [2017] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.