M2Former: Multi-Scale Patch Selection for Fine-Grained Visual Recognition

Jiyong Moon¹ Junseok Lee¹ Yunju Lee¹ Seongsik Park¹
¹Dongguk University
{asdwldyd, 2021126776, 2022126715, s.park}@dgu.ac.kr corresponding author

Abstract

Recently, vision Transformers (ViTs) have been actively applied to fine-grained visual recognition (FGVR). ViT can effectively model the interdependencies between patch-divided object regions through an inherent self-attention mechanism. In addition, patch selection is used with ViT to remove redundant patch information and highlight the most discriminative object patches. However, existing ViT-based FGVR models are limited to single-scale processing, and their fixed receptive fields hinder representational richness and exacerbate vulnerability to scale variability. Therefore, we propose multi-scale patch selection (MSPS) to improve the multi-scale capabilities of existing ViT-based models. Specifically, MSPS selects salient patches of different scales at different stages of a multi-scale vision Transformer (MS-ViT). In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to model cross-scale interactions between selected multi-scale patches and fully reflect them in model decisions. Compared to previous single-scale patch selection (SSPS), our proposed MSPS encourages richer object representations based on feature hierarchy and consistently improves performance from small-sized to large-sized objects. As a result, we propose M2Former, which outperforms CNN-/ViT-based models on several widely used FGVR benchmarks.

1 Introduction

Refer to caption — Figure 1: Comparison between previous single-scale patch selection (SSPS) and our multi-scale patch selection (MSPS). (a) SSPS extracts salient image patches of the same size, and the limited receptive field causes suboptimal object representation and vulnerability to scale variance. (b) On the other hand, our MSPS extracts salient patches in multi-scale. This encourages rich representations of objects, from deep semantic information in large-sized patches to fine-grained details in small-sized patches. In addition, the flexibility of multi-scale patches is useful for handling extremely large or small objects through multiple receptive fields.

Despite recent rapid advances, fine-grained visual recognition (FGVR) is still one of the non-trivial tasks in computer vision community. Unlike conventional recognition tasks, FGVR aims to predict subordinate categories of a given object, e.g., subcategories of birds [65, 59], flowers [1, 46] and cars [43, 30]. It is a highly challenging task due to inherently subtle inter-class differences caused by similar subordinate categories and large intra-class variations caused by object pose, scale, or deformation.

The most common solution for FGVR is to decompose the target object into multiple local parts [3, 70, 82, 74, 87, 28]. Due to subtle differences between fine-grained categories mostly resides in the unique properties of object parts [26], decomposed local parts provide more discriminative clues of the target object. For example, a given bird object can be decomposed into its beak, wing, and head parts. At this time, Glaucous Winged Gull and California Gull can be distinguished by comparing their corresponding object parts. Early approaches of these part-based methods finds discriminative local parts using manual part annotations [3, 70, 82, 26]. However, curating manual annotations for all possible object parts is labor-intensive and carries the risk of human-error [10]. Therefore, research focus has consequently shifted to a weakly-supervised manner [74, 87, 28, 35, 85, 17]. They use additional tricks such as attention mechanisms [28, 85, 54] or region proposal networks (RPN) [35, 17, 29] to estimate local parts with only category-level labels. However, the part proposal process greatly increases the overall computational cost. Additionally, they tend not to deeply consider the interactions between estimated local parts that are essential for accurate recognition [20].

Recently, vision Transformers (ViTs) [9] are being actively applied to FGVR [20, 88, 25, 61, 73, 31, 72, 42, 78]. Relying exclusively on the Transformer [58] architecture, ViT has shown competitive image classification performance at large-scale. Similar to token sequences in NLP, ViT embeds the input images into fixed-size image patches, and the patches pass through multiple Transformer encoder blocks. ViT’s patch-by-patch processing is highly suitable for FGVR because each image patch can be considered as a local part. This means that the cumbersome part proposal process is no longer necessary. Additionally, the self-attention mechanism [58] inherent in each encoder block facilitates modeling of global interactions between patch-divided local parts. ViT-based FGVR models use patch selection to further boost the performance [20, 88, 25, 61]. Because ViT deals with all patch-divided image regions equally, many irrelevant patches may lead to inaccurate recognition. Similar to part proposals, patch selection selects the most salient patches from a set of generated image patches based on the computed importance ranking, i.e., accumulated attention weights [51, 25]. As a result, redundant patch information is filtered out and only selected salient patches are considered for final decision.

However, the existing ViT-based FGVR methods suffer from their single-scale limitations. ViT uses fixed-size image patches throughout the entire network, which ensures that the receptive field remains the same across all layers, and it prevents the ViT from obtaining multi-scale feature representations [14, 63, 38]. On the other hand, convolutional neural networks (CNNs) are suitable for multi-scale feature representations thanks to their staged architecture, where feature resolution decreases as layer depth increases [21, 49, 33, 39, 76]. In early stages, spatial details of an object are encoded on high-resolution feature maps, and as the stages deepens, the receptive field expands with decreasing feature resolution, and higher-order semantic patterns are encoded into low-resolution feature maps. Multi-scale features are important for most vision tasks, especially pixel-level dense predictions tasks, e.g., object detection [48, 84, 83], and segmentation [50, 6, 79]. In the same context, single-scale processing can cause two failure cases in FGVR, which lead to suboptimal recognition performance. (i) First, it is vulnerable to scale changes of fine-grained objects [38, 33, 48]. Fixed patch size may be insufficient to capture very subtle features of small-scale objects due to too coarse patches, and conversely, discriminative features may be over-decomposed for large-scale objects due to too finely split patches. (ii) Second, single-scale processing limits representational richness for objects [14, 84]. Compared to CNN that explores rich feature hierarchies from multi-scale features, ViT considers only monotonic single-scale features due to its fixed receptive field.

In this paper, we improve existing ViT-based FGVR methods by enhancing multi-scale capabilities. One simple solution is to use the recent multi-scale vision Transformers (MS-ViT) [68, 14, 32, 63, 38, 37, 89, 56, 19]. In fact, we can achieve satisfactory results simply by using MS-ViT. However, we further boost the performance by adapting patch selection to MS-ViT. Specifically, we propose a multi-scale patch selection (MSPS) that extends the previous single-scale patch selection (SSPS) [20, 88, 25, 61] to multi-scale. MSPS select salient patches of different scales from different stages of the MS-ViT backbone. As shown in Fig. 1, multi-scale salient patches selected through MSPS include both large-scale patches that capture object semantics and small-scale patches that capture fine-grained details. Compared to single-scale patches in SSPS, feature hierarchies in multi-scale patches provide richer representations of objects, which leads to better recognition performance. In addition, the flexibility of multi-scale patches is useful for handling extremely large/small objects through multiple receptive fields.

However, we argue that patch selection alone cannot fully explain the object, and consideration is required of how to model interactions between selected patches and effectively reflect them in the final decision. It is more complicated than the case considering only single-scale patches. Therefore, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to effectively deal with selected multi-scale patches. First, CTT aggregates the multi-scale patch information by transferring the global CLS token to each stage. Each stage-specific patch information is shared through transferred global CLS tokens, which generate richer network-level representations. In addition, we propose MSCA to model direct interactions between selected multi-scale patches. In the MSCA block, cross-scale interactions in both spatial and channel dimensions are computed for selected patches of all stages. Finally, our multi-scale vision Transformer with multi-scale patch selection (M2Former) obtains improved FGVR performance over other ViT-based SSPS models, as well as CNN-based models.

Our main contributions can be summarized as follows:

•

We propose MSPS that further boosts the multi-scale capabilities of MS-ViT. Compared to SSPS, MSPS generates richer representations of fine-grained objects with feature hierarchies, and obtains flexibility for scale changes with multiple receptive fields.
•

We propose CTT that effectively shares the selected multi-scale patch information. Stage-specific patch information is shared through transferred global CLS tokens to generate enhanced network-level representations.
•

We design multi-scale cross-attention (MSCA) block to capture the direct interactions of selected multi-scale patches. In the MSCA block, the spatial-/channel-wise cross-scale interdependencies can be captured.
•

Extensive experimental results on widely used FGVR benchmarks show the superiority of our M2Former over conventional methods. In short, our M2Former achieves an accuracy of 92.4% on Caltech-UCSD Birds (CUB) [65], and 91.1% on NABirds [57].

2 Related Work

2.1 Part-based FGVR

A number of methods have been proposed to classify subordinate object categories [3, 70, 82, 26, 35, 85, 17, 15, 10, 5, 4, 90, 86, 66, 44]. Among them, part-based methods decompose target objects into multiple local parts to capture more discriminative features [3, 70, 82, 26, 35, 85, 15]. Encoded local representations can be used either in conjunction with image-level representations or by themselves for recognition. This entails the use of a detection branch to generate discriminative part proposals from the input image either before or in parallel with the classification layer. Early works leverage manual part annotations for a fully-supervised train of detection branches [3, 70, 82, 26]. However, curating large-scale part annotations is labor-intensive and highly susceptible to human error. Thus, recent part-based methods localize informative regions in a weakly-supervised way using only category-level labels [28, 53, 7, 2, 27, 35, 85, 17, 15, 29, 54]. TASN [87] introduces trilinear attention sampling to generate detail-preserved views based on feature channels. P2P-Net [74] and DF-GMM [64] use a Feature Pyramid Network (FPN) [33] to generate local part proposals from convolutional feature maps. RA-CNN [15] iteratively zoom in local discriminative regions and reinforce multi-scale feature learning with inter-scale ranking loss. InSP [80] extracts potential part regions from the attention maps of two different images and applies content swapping to learn fine-grained local structures. Those part-based methods are good at locating discriminative parts, but they are ambiguous at capturing interactions between estimated local parts. Additionally, explicitly generating region proposals incurs a large computational burden, and performance can degrade significantly if the generated proposals are inaccurate. Thus, we focus on ViT-based FGVR methods that can avoid cumbersome region proposals.

2.2 ViT-based FGVR

Transformer [58] is originally introduced in the natural language processing (NLP) community, and recently shown great potential in the field of computer vision. Specifically, vision Transformer (ViT) [9] has recorded remarkable performance in image classification on large-scale datasets relying entirely on a pure Transformer encoder architecture. ViT splits images into non-overlapping patches and then applies multiple Transformer encoder blocks, consisting of multi-head self-attention module (MSA) and feed-forward networks (FFN). Recently, ViTs are being actively applied to FGVR [20, 88, 25, 61, 73, 31, 72, 42, 78]. ViT is highly suitable for FGVR in that (i) patch-by-patch processing can effectively replace part proposal generation, and (ii) it is easy to model global interactions between patch-divided local parts through the inherent self-attention mechanism. In most cases, patch selection, which selects discriminative patches from a initial patch sequence, is used with ViT-based models. TransFG [20] selects the most discriminative patches based on the attention score at the last encoder block to consider only the important image regions. Similarly, DCAL [88] conducts patch selection based on attention scores to highlight the interaction of high-response image regions. FFVT [61] extends this selection process to all layers of ViT to utilize layer-wise patch information. RAMS-Trans [25] enhances local representations by recurrently zooming-in salient object regions obtained by patch selection. For the same purpose, we use patch selection to strengthen the decisiveness of the network. However, we extend the existing patch selection from single-scale to multi-scale to improve representational richness and obtain flexibility for scale changes.

2.3 Multi-Scale Processing

Multi-scale features are important for most vision tasks, e.g., object detection [49, 33, 48, 84, 83], semantic segmentation [50, 6, 79], edge detection [36, 71], and image classification [16, 77, 21], because visual patterns occur at multi-scales in natural scenes. CNNs naturally learn coarse-to-fine multi-scale features through a stack of convolutional operators [52, 22], so most research has been proposed to enhance the multi-scale capabilities of CNNs [21, 49, 33, 39, 76]. FPN [33] introduces a feature pyramid to extract features with different scales from a single image, and fuses them in a top-down way to generate a semantically strong multi-scale feature representation. SPPNet [21] proposes spatial pyramid pooling to improve backbone networks to model multi-scale features of arbitrary size. Faster R-CNN [48] proposes region proposal networks that generate object bounding boxes of different scales. FCN [39] proposes a fully convolutional architecture that generates dense prediction maps for semantic segmentation from hierarchical representations of CNNs.

On the other hand, ViT suffers from being unsuitable to handle multi-scale features due to fixed scale image patches. After the initial patch embedding layer, ViT maintains image patches of the same size, and these single-scale feature maps are not suitable for many vision tasks, especially those requiring pixel-level dense prediction. To alleviate this issue, model architectures that adapt multi-scale feature hierarchies to ViT have recently been proposed. CvT [68] adapts convolutional operations to patch embedding layers and attention projection layers to enable feature hierarchies with improved efficiency. MViT [14] introduces attention pooling that controls feature resolution by adjusting the pooling stride of queries to implement multi-scale feature hierarchies. PVT [63] proposes a progressive shrinking strategy that conducts patch embedding with different patch sizes at the beginning of each stage. SwinT [38] generates multi-resolution feature maps by merging adjacent local patches using a patch merging layer. We also focus on feature hierarchies of fine-grained objects using multi-scale vision Transformers. However, we propose additional methods to further boost the multi-scale capability, i.e., MSPS, CTT, and MSCA.

3 Our Method

The overall framework of our method is presented in Fig 2. First, we use multi-scale vision Transformers (MS-ViT) as our backbone network (Section 3.1). After that, multi-scale patch selection (MSPS) is equipped on different stages of MS-ViT to extract multi-scale salient patches (Section 3.2). Class token transfer (CTT) aggregates multi-scale patch information by transferring the global CLS token to each stage. Multi-scale cross-attention (MSCA) blocks are used to model spatial-/channel-wise interactions of selected multi-scale patches (Section 3.4). Finally, we use additional training strategies for better optimization (Section 3.5). More details are described as follows.

3.1 Multi-Scale Vision Transformer

To enhance the multi-scale capability, we use MS-ViT as our backbone network, specifically the recent Multiscale Vision Transformer (MViT) [14, 32]. MViT constructs four-stage pyramid structure for low-level to high-level visual modeling instead of single-scale processing. To produce a hierarchical representation, MViT introduces Pooling Attention (PA), which pools query tensors to control the downsampling factor. We refer the interested reader to the original work [14, 32] for details.

Let $X_{0}\in{\mathbb{R}^{h_{0}\times{w_{0}}\times{c_{0}}}}$ denote the input image where $h_{0}$ , $w_{0}$ , and $c_{0}$ refer to the height, width, and the number of channels, respectively. $X_{0}$ first goes through a patch embedding layer to produce initial feature maps with a patch size of $4\times{4}$ . As the stage deepens, the resolution of the feature maps decreases and the channel dimension increases proportionally. As a result, at each stage $i\in{\{1,2,3,4\}}$ , we can extract the feature maps $X_{i}\in\{X_{1},X_{2},X_{3},X_{4}\}$ with resolutions ${h_{i}}\times{w_{i}}\in\{{\frac{h_{0}}{4}\times\frac{w_{0}}{4}},{\frac{h_{0}}{8}\times\frac{w_{0}}{8}},{\frac{h_{0}}{16}\times\frac{w_{0}}{16}},{\frac{h_{0}}{32}\times\frac{w_{0}}{32}}\}$ and channel dimensions $c_{i}\in\{96,192,384,768\}$ . We can also flatten $X_{i}$ into 1D patch sequence as $X_{i}\in{\mathbb{R}^{l_{i}\times{c_{i}}}}$ , where $l_{i}=h_{i}\times{w_{i}}$ . In fact, after patch embedding, we attach a trainable class token (CLS token) to patch sequence, and all patches $\widetilde{X}_{i}\in{\mathbb{R}^{\widetilde{l}_{i}\times{c_{i}}}}$ are fed into consecutive encoder blocks, where $\widetilde{l}_{i}=l_{i}+1$ . After the last block, the CLS token is detached from the patch sequence and used for class prediction through a linear classifier.

3.2 Multi-Scale Patch Selection

Single-scale patch selection (SSPS) has limited representations due to its fixed receptive field. Therefore, We propose multi-scale patch selection (MSPS) that extends SSPS to multi-scale. With multiple receptive fields, our proposed MSPS encourages rich representations of objects from deep semantic information to fine-grained details when compared to SSPS. We design MSPS based on the MViT backbone. Specifically, we select salient patches from the intermediate feature maps produced at each stage of MViT.

Given patch sequence $\widetilde{X}_{i}\in{\mathbb{R}^{\widetilde{l}_{i}\times{c_{i}}}}$ , we start by detaching the CLS token and reshaping it into 2D feature maps to $X_{i}\in{\mathbb{R}^{h_{i}\times{w_{i}}\times{c_{i}}}}$ . And then, we group $r\times{r}$ neighboring patches, reshaping $X_{i}$ into $\hat{X}_{i}\in{\mathbb{R}^{r^{2}\times\frac{{h_{i}w_{i}}}{r^{2}}\times{c_{i}}}}$ . This means $\frac{{h_{i}w_{i}}}{r^{2}}$ neighboring patch groups are generated. Afterwards, we apply a per-group average to merge patch groups, producing $\hat{X}_{i}\in{\mathbb{R}^{\hat{l}_{i}\times{c_{i}}}}$ , where $\hat{l}_{i}=\frac{{h_{i}w_{i}}}{r^{2}}$ . We set $r=2$ to merge patches within a $2\times 2$ local region. This merging process removes the redundancies of neighboring patches, which forces MSPS to search for salient patches in wider areas of the image.

Now, we produce a score map $S_{i}\in{\mathbb{R}^{\hat{l}_{i}}}$ using a pre-defined scoring function $\mathcal{S}\left(\cdot\right)$ . Then, patches with top- $k$ scores are selected from $\hat{X}_{i}$ ,

P_{i}=\text{MSPS}\left(\hat{X}_{i};S_{i},k_{i}\right),

(1)

where $P_{i}\in{\mathbb{R}^{{k_{i}}\times{c_{i}}}}$ . We set $k$ differently for each stage to consider hierarchical representations. Since the high-resolution feature maps of the lower stage capture the detailed shape of the object with a small patch size, we set $k$ to be large so that enough patches are selected to sufficiently represent the details of the object. On the other hand, low-resolution feature maps of the higher stage capture the semantic information of objects with a large patch size, so small $k$ is sufficient to represent the overall semantics.

For patch selection, we have to decide how to define the scoring function $\mathcal{S}$ . Attention roll-out [51] has been mainly used as a scoring function for SSPS [20, 88]. Attention roll-out aggregates the attention weights of the Transformer blocks through successive matrix multiplications, and the patch selection module selects the most salient patches based on the aggregated attention weights. However, since we use MS-ViT as the backbone, we cannot use attention roll-out because the size of attention weights is different for each stage, even each block. Instead, we propose a simple scoring function based on mean activation, where the score for the $j$ -th patch of $\hat{X}_{i}$ is calculated by:

S_{i}^{j}=\mathcal{S}(\hat{X}_{i})=\frac{1}{c_{i}}\sum_{c=1}^{c_{i}}\hat{X}_{i}^{j}\left(c\right),

(2)

where $c$ is the channel index $c\in\{1,2,\dots,c_{i}\}$ . Mean activation measures how strongly the channels in each patch are activated on average. After computing the score map, our MSPS conducts patch selection based on it. This is implemented through top- $k$ and gather operations. We extract $k_{i}$ patch indices with the highest scores from the $S_{i}$ through the top- $k$ operation, and patches corresponding to the patch indices $I_{i}$ are selected from $\hat{X}_{i}$ ,

		$\displaystyle I_{i}=\text{topkIndex}(S_{i};k_{i}),$		(3)
		$\displaystyle P_{i}=\text{gather}(\hat{X}_{i};I_{i}),$		(3)

where $I_{i}\in{\mathbb{N}^{{k_{i}}}}$ , and $P_{i}\in{\mathbb{R}^{{k_{i}}\times{c_{i}}}}$ .

3.3 Class Token Transfer

Through MSPS, we can extract salient patches from each stage, $\mathbf{P}=\{P_{1},P_{2},P_{3},P_{4}\}$ . In Section 3.2, the CLS token is detached from the patch sequence before MSPS at each stage. The simplest way to reflect the selected multi-scale patches in the model decisions is to concatenate the detached CLS token $\texttt{CLS}_{i}$ with the $P_{i}$ again and feed it into a few additional ViT blocks, consisting of multi-head self-attention (MSA) and feed-forward networks (FFN):

		$\displaystyle\widetilde{P}_{i}=\text{concat}(P_{i},\texttt{CLS}_{i}),$		(4)
		$\displaystyle\widetilde{O}_{i}=\text{FFN}(\text{MSA}(\widetilde{P}_{i})),$		(4)

where $\widetilde{P}_{i}$ , $\widetilde{O}_{i}\in{\mathbb{R}^{\widetilde{k}_{i}\times{c_{i}}}}$ , and $\widetilde{k}_{i}=k_{i}+1$ . Finally, predictions for each stage are computed by extracting the $\texttt{CLS}_{i}$ from $\widetilde{O}_{i}$ and connecting the linear classifier. It should be noted that the $\texttt{CLS}_{i}$ is shared by all stages: the set of $\texttt{CLS}_{i}$ is derived from the global CLS token and it is detached with different dimensions at each stage. This means that the stage-specific multi-scale information is shared to some extent through $\texttt{CLS}_{i}$ . However, the current sharing method may cause inconsistency between stage features because the detached $\texttt{CLS}_{i}$ does not equally utilize the representational power of the network. For example, $\texttt{CLS}_{1}$ is detached right after stage-1 and it will always lag behind $\texttt{CLS}_{4}$ , which shares the same root as $\texttt{CLS}_{1}$ but utilizes representations of all stages.

To this end, we introduce a class token transfer (CTT) strategy that aggregates multi-scale information more effectively. The core idea is to use the CLS token transferred from the global CLS token $\texttt{CLS}_{g}$ rather than using the detached $\texttt{CLS}_{i}$ at each stage. It should be noted that $\texttt{CLS}_{g}$ is equal to $\texttt{CLS}_{4}$ , so $\texttt{CLS}_{g}\in{\mathbb{R}^{c_{4}}}$ . We transfer the $\texttt{CLS}_{g}$ according to the dimension of each stage through a projection layer consisting of two linear layers along with Batch Normalization (BN) and ReLU activation:

\overline{\texttt{CLS}}_{i}=W_{i}^{1}\left(\text{ReLU}\left(\text{BN}\left(W_{i}^{0}\texttt{CLS}_{g}\right)\right)\right),

(5)

where $W_{i}^{0}\in{\mathbb{R}^{2c_{i}\times{c_{4}}}}$ , $W_{i}^{1}\in{\mathbb{R}^{c_{i}\times{2c_{i}}}}$ are the weight matrices, and $\overline{\texttt{CLS}}_{i}$ is the transferred $\texttt{CLS}_{g}$ in stage $i\in\{1,2,3\}$ . Now (4) is reformulated as:

		$\displaystyle\widetilde{P}_{i}=\text{concat}(P_{i},\overline{\texttt{CLS}}_{i}),$		(6)
		$\displaystyle\widetilde{O}_{i}=\text{FFN}(\text{MSA}(\widetilde{P}_{i})).$		(6)

Compared to conventional approaches, CTT guarantees consistency between stage features as it uses CLS tokens utilized with the same representational power. Each stage encodes stage-specific patch information into a globally updated CLS token. CTT is similar to the top-down pathway [33, 84]: it combines high-level representations of objects with multi-scale representations of lower layers to generate richer network-level representations.

3.4 Multi-Scale Cross-Attention

Although CTT can aggregate multi-scale patch information from all stages, it cannot model direct interactions between multi-scale patches, which indicates how interrelated they are. Therefore, we propose multi-scale cross-attention (MSCA) to model the interactions between multi-scale patches.

MSCA takes $\widetilde{\mathbf{P}}=\{\widetilde{P}_{1},\widetilde{P}_{2},\widetilde{P}_{3},\widetilde{P}_{4}\}$ as input and models the interactions between selected multi-scale salient patches. Specifically, MSCA consists of channel cross-attention (CCA) and spatial cross-attention (SCA), so (6) is reformulated as:

\widetilde{\mathbf{O}}=\text{MSCA}(\widetilde{\mathbf{P}})=\text{SCA}(\text{CCA}(\widetilde{\mathbf{P}})),

(7)

where $\widetilde{\mathbf{O}}=\{\widetilde{O}_{1},\widetilde{O}_{2},\widetilde{O}_{3},\widetilde{O}_{4}\}$ .

3.4.1 Channel Cross-Attention

Exploring feature channels has been very important in many vision tasks because feature channels encode visual patterns that are strongly related to foreground objects [87, 7, 4, 47, 69]. Many studies have been proposed to enhance the representational power of a network by explicitly modeling the interdependencies between the feature channels [24, 67, 23, 62, 60]. In the same vein, we propose CCA to further enhance the representational richness of multi-scale patches by explicitly modeling their cross-scale channel interactions.

We illustrate CCA in Fig. 3 (a). First, we apply global average pooling (GAP) to $\widetilde{P}_{i}$ to obtain a global channel descriptor $D_{i}\in{\mathbb{R}^{c_{i}}}$ for each stage. The $c$ -th element of $D_{i}$ is calculated by:

D_{i}^{c}=\frac{1}{\widetilde{k}_{i}}\sum_{j=1}^{\widetilde{k}_{i}}{\widetilde{P}_{i}^{c}\left(j\right)},

(8)

where $j$ is the patch index $j\in\{1,2,\dots,\widetilde{k}_{i}\}$ . From the stage-specific channel descriptors, we compute the channel attention score as follows:

		$\displaystyle\mathbf{D}=\text{concat}\left(D_{1},D_{2},D_{3},D_{4}\right)\in{\mathbb{R}^{c}},$		(9)
		$\displaystyle\mathbf{C}=\text{sigmoid}\left(W^{c,1}\text{ReLU}\left(\text{BN}\left(W^{c,0}\mathbf{D}\right)\right)\right)\in{\mathbb{R}^{c}},$		(9)

where $c=\sum{c_{i}}$ , $W^{c,0}\in{\mathbb{R}^{{\frac{c}{2}}\times{c}}}$ , and $W^{c,1}\in{\mathbb{R}^{c\times\frac{c}{2}}}$ . We then split $\mathbf{C}$ back into $C_{i}\in{\mathbb{R}^{c_{i}}}$ and recalibrate the channels of $\widetilde{P}_{i}$ as follows:

\widetilde{Y}_{i}=\widetilde{P}_{i}\otimes{C_{i}}+\widetilde{P}_{i},

(10)

where $\otimes$ indicates element-wise multiplication. In (9), we compute the channel attention score by aggregating the channel descriptors of all multi-scale patches. It captures channel-dependencies in a cross-scale way and reflects them back to each stage-specific channel information.

3.4.2 Spatial Cross-Attention

In addition to channel-wise interactions, we can compute the spatial-wise interdependencies of selected multi-scale patches. To this end, we propose SCA, which is a multi-scale extension of MSA [9, 58].

We illustrate SCA in Fig. 3 (b). First, We compute query, key, value tensors $Q_{i}$ , $K_{i}$ , $V_{i}$ for every $\widetilde{Y}_{i}$ ,

Q_{i}=\widetilde{Y}_{i}W_{i}^{Q},{\quad}K_{i}=\widetilde{Y}_{i}W_{i}^{K},{\quad}V_{i}=\widetilde{Y}_{i}W_{i}^{V},

(11)

where $W_{i}^{Q}$ , $W_{i}^{K}$ , $W_{i}^{V}\in{\mathbb{R}^{c_{i}\times{d}}}$ , and $Q_{i}$ , $K_{i}$ , $V_{i}\in{\mathbb{R}^{\widetilde{k}_{i}\times{d}}}$ . After that, we concatenate the $K_{i}$ and $V_{i}$ of all stages to generate global key and value tensors $\mathbf{K}$ , $\mathbf{V}$ ,

		$\displaystyle\mathbf{K}=\text{concat}\left(K_{1},K_{2},K_{3},K_{4}\right)\in{\mathbb{R}^{\widetilde{k}\times{d}}},$		(12)
		$\displaystyle\mathbf{V}=\text{concat}\left(V_{1},V_{2},V_{3},V_{4}\right)\in{\mathbb{R}^{\widetilde{k}\times{d}}},$		(12)

where $\widetilde{k}=\sum{\widetilde{k}_{i}}$ . Now, we can compute self-attention for $Q_{i}$ , $\mathbf{K}$ , $\mathbf{V}$ , and single linear layer is used to restore the dimension,

		$\displaystyle\widetilde{A}_{i}=\text{Softmax}\left(Q_{i}\mathbf{K}^{T}/\sqrt{d}\right)\mathbf{V},$		(13)
		$\displaystyle\widetilde{O}_{i}=\widetilde{A}_{i}W_{i}^{s},$		(13)

where $\widetilde{A}_{i}\in{\mathbb{R}^{\widetilde{k}_{i}\times{d}}}$ , $W_{i}^{s}\in{\mathbb{R}^{d\times{c_{i}}}}$ , and $\widetilde{O}_{i}\in{\mathbb{R}^{\widetilde{k}_{i}\times{c_{i}}}}$ . SCA is also implemented in a multi-head manner [58]. For global key and value, SCA captures how strongly multi-scale patches interact spatially with each other. Specifically, SCA models how large-scale semantic patches decompose into more fine-grained views, and conversely, how small-scale fine-grained patches can be identified in more global views.

3.5 Training

After the MSCA block, we can extract $\overline{\texttt{CLS}}_{i}$ from $\widetilde{O}_{i}$ and compute the class prediction $y_{i}$ using a linear classifier. In addition, we can compute $y_{con}$ by concatenating all $\overline{\texttt{CLS}}_{i}$ tokens. For model training, we compare every $\mathbf{y}=\{y_{1},y_{2},y_{3},y_{4},y_{con}\}$ for the ground-truth label $\hat{y}$ ,

\mathcal{L}=\sum_{y\in\mathbf{y}}{\sum_{t=1}^{n}{-\hat{y}^{t}\log{y^{{t}}}}},

(14)

where $n$ is the total number of classes, and $t$ denotes the element index of the label. To improve model generalization and encourage diversity of representations from specific stages, we employ soft supervision using label smoothing [45, 74]. We modify the one-hot vector $\hat{y}$ as follows:

\hat{y}_{\alpha}[t]=\begin{cases}\alpha&t=\hat{t}\\ \frac{1-\alpha}{n}&t\neq\hat{t}\end{cases},

(15)

where $\hat{t}$ denotes index of the ground-truth class, and $\alpha$ denotes a smoothing factor $\alpha\in[0,1]$ . $\alpha$ controls the magnitude of the ground-truth class. As a result, the different predictions are supervised with different labels during training. We set $\alpha$ to increase in equal intervals by $0.1$ from $0.6$ to $1$ , so $y_{1}$ has the smallest $\alpha=0.6$ .

For inference, we conduct a final prediction considering all of $\mathbf{y}$ ,

y_{all}=\sum_{y\in\mathbf{y}}{y},

(16)

where the maximum entry in the $y_{all}$ corresponds to the class prediction.

4 Experiments

Table 1: Comparison with the state-of-the-art methods on CUB.

method	backbone	accuracy (%)
RA-CNN [15]	VGG19	85.3
NTS-Net [75]	ResNet50	87.5
Cross-X [41]	ResNet50	87.7
DBTNet [86]	ResNet101	88.1
DF-GMM [64]	ResNet50	88.8
PMG [10]	ResNet50	89.6
API-Net [90]	DenseNet161	90.0
P2P-Net [74]	ResNet50	90.2
TransFG [20]	ViT-B	91.7
RAMS-Trans [25]	ViT-B	91.3
FFVT [61]	ViT-B	91.6
DCAL [88]	R50-ViT-B	92.0
M2Former (ours)	MViTv2-B	92.4

Table 2: Comparison with the state-of-the-art methods on NABirds.

method	backbone	accuracy (%)
MaxEnt [13]	DenseNet161	83.0
Cross-X [41]	ResNet50	86.4
PAIRS [18]	ResNet50	87.9
API-Net [90]	DenseNet161	88.1
PMGv2 [11]	ResNet101	88.4
CS-Parts [12]	ResNet50	88.5
MGE-CNN [81]	ResNet101	88.6
FixSENet-154 [55]	SENet154	89.2
ViT [9]	ViT-B	89.9
TransFG [20]	ViT-B	90.8
M2Former (ours)	MViTv2-B	91.1

In this section, we evaluate our proposed M2Former on widely used fine-grained benchmarks. In addition, we conduct ablation studies and further analyses to validate the effectiveness of our method. More details are described as follows.

4.1 Datasets

We use two FGVR datasets to evaluate the proposed method: Caltech-UCSD Birds (CUB) [65], and NABirds [57]. The CUB dataset consists of 11,788 images and 200 bird species. All images are split into 5,994 for training and 5,794 for testing. The NABirds is a larger dataset, consisting of 48,562 images and 555 classes. All images are split into 23,929 for training and 24,633 for testing. Most of our experiments are conducted on CUB.

4.2 Implementation Details

We use MViTv2-B [32] pre-trained on ImageNet21K [8] as our backbone network. We add MSPS to every stage of MViTv2-B. After patch selection, selected patches pass through one MSCA block. We empirically set $\mathbf{k}=\{k_{1},k_{2},k_{3},k_{4}\}$ , the number of patches selected for each stage, to $\{6,18,54,162\}$ . Our training recipe follows the same as recent work [20]. The model is trained for a total of 10,000 iterations, and an SGD optimizer with momentum of 0.9 and weight decay of 0 is used. The batch size is set to 16, and the initial learning rate is set to 0.03. The learning rate has a cosine decay schedule [40]. For augmentations, raw images are first resized into $600\times 600$ followed by cropping into $448\times 448$ . We use random cropping for training, and center cropping for testing. Random horizontal flipping is adapted to training images. We implement the whole model with the PyTorch framework on three NVIDIA A5000 GPUs.

4.3 Main Results

We compare our M2Former with state-of-the-art FGVR methods including ViT-based and CNN-based models on each dataset.

4.3.1 Results on CUB

First, the evaluation results on CUB are presented in Table 1. As shown in Table 1, our M2Former obtained a top-1 accuracy of 92.4%, which significantly outperforms CNN-based methods. Especially, M2Former improves recent P2P-Net [74] by 2.2% higher. In addition, M2Former achieved higher accuracy compared to ViT-based models using SSPS. Specifically, M2Former outperforms TransFG [20], RAMS-Trans [25], FFVT [61], and DCAL [88] by 0.7%, 1.1%, 0.8%, and 0.4% higher respectively. These results indicate that our proposed MSPS encourages enhanced representations compared to SSPS.

4.3.2 Results on NABirds

The evaluation results on NABirds are presented in Table 2. Our proposed M2Former obtains the top-1 accuracy of 91.1% on NABirds. Compared to CNN-based models, our M2Former shows significantly improved performance. For example, M2Former improves PMGv2 [11] by 2.7% higher. In addition, our method outperforms other ViT-based models. For example, M2Former improves ViT [9] by 1.2% higher and TransFG [20] by 0.3% higher.

4.4 Ablation Study

We analyzed each component of the proposed M2Former through the ablation study. All experiments are conducted on CUB dataset.

Table 3: Effect of the proposed modules on CUB.

index	MSPS	CTT	CCA	SCA	accuracy (%)
(a)					91.6
(b)	✓				91.6
(c)	✓	✓			92.1
(d)	✓	✓	✓		92.2
(e)	✓	✓		✓	92.1
(f)	✓	✓	✓	✓	92.4

Table 4: Effect of the number of MSCA blocks on CUB.

num. blocks	accuracy (%)
1	92.4
2	92.1
3	92.2
4	92.2

Table 5: Effect of the number of selected patches on CUB.

$\mathbf{k}$	accuracy (%)
{6, 8, 10, 12}	92.2
{6, 12, 24, 48}	92.2
{6, 18, 54, 162}	92.4
{7, 28, 112, 448}	92.1

Table 6: Comparison of different backbone networks on CUB.

method	accuracy (%)
CvT-21 [68]	89.3
CvT-21 ${}_{\text{MSPS}}$	90.0
SwinT-B [38]	90.6
SwinT-B ${}_{\text{MSPS}}$	91.3

4.4.1 Ablation on Proposed Modules

We investigate the influence of the proposed modules (i.e., MSPS, CTT, CCA, SCA) included in the M2Former architecture. The results are presented in Table 3. The pure MViTv2-B baseline obtained top-1 accuracy of 91.6% on CUB (Table 3 (a)). When MSPS was added, we did not find any noticeable improvement (Table 3 (b)). This indicates that effectively incorporating selected patches into model decisions is more important than patch selection itself. For this purpose, CTT allows multi-scale patch information to be shared across the entire network through transferred global CLS tokens. Indeed, with CTT, we improve the baseline by 0.5% higher (Table 3 (c)). In addition, CCA and SCA capture more direct interactions between multi-scale patches, but using CCA or SCA alone was not effective for improvement (Table 3 (d) and (e)). On the other hand, when using both CCA and SCA (full MSCA block) we obtain the highest top-1 accuracy of 92.4%, which improves baseline by 0.8% higher (Table 3 (f)). This means that spatial-/channel-level interactions of multi-scale patches are strongly correlated and need to be considered simultaneously.

4.4.2 Number of MSCA Blocks

Table 4 shows the results of the ablation experiment on the number of our MSCA blocks. The results show that using a single MSCA block performs best and increasing the number of MSCA blocks no longer yields a meaningful improvement. This means that a single MSCA block is sufficient to model the interactions of multi-scale patches. Moreover, using one MSCA block is efficient as it only introduces small extra parameters.

4.4.3 Number of Selected Patches

Table 7: Comparison of different CTT methods on CUB.

method	accuracy (%)
global pool	91.5
simple attach	91.9
CTT w/ 1-MLP	92.1
CTT w/ 2-MLP	92.4

Table 5 shows the influences from the number of selected patches. We define it as $\mathbf{k}=\{k_{1},k_{2},k_{3},k_{4}\}$ , where $k_{i}$ is the number of patches selected through patch selection at stage $i$ . $k_{i}$ is set differently for each stage. As shown in Table 5, the larger the overall number of selected patches, the better the performance in general. However, after it grows to some extent, increasing selected patches is insignificant and caused unnecessary computations. We empirically found the optimal $\mathbf{k}=\{6,8,54,162\}$ .

4.4.4 Different Backbone Networks

In Table 6, we analyze whether our method provides consistent effects on different backbone networks. We use CvT-21 [68] and SwinT-B [38] as backbone architectures and compare them to their baseline. Both are initialized with pre-trained weights from ImageNet21K [8]. The proposed modules are added to all three stages for CvT-21 and all four stages for SwinT-B. We follow the training recipe from recent work [20]. It should be noted that CTT could not be applied to the SwinT-B backbone as it does not use a CLS token. In addition, SwinT-B uses an input resolution of $384\times 384$ : raw images are first resized into $510\times 510$ followed by cropping into $384\times 384$ . As shown in Table 6, Cvt-21 and SwinT-B baselines obtained top-1 accuracies of 89.3% and 90.6%, respectively. When we added the proposed modules, we could obtain top-1 accuracies of 90.0% and 91.3%, which are both 0.7% higher than their pure counterparts. This suggests that selecting multi-scale salient patches and modeling their interactions is important for fine-grained recognition, and our method is beneficial for that purpose.

4.5 Further Analysis

Table 8: Comparison of different MSPS stages on CUB.

method	stages for patch selection				accuracy (%)
method	stage1	stage2	stage3	stage4	ob_l	ob_m	ob_s	total
TransFG [20]	SSPS at last block				91.7	91.7	90.1	91.3
M2Former ${}_{\text{none}}$					92.3(+0.6)	91.8(+0.1)	90.5(+0.4)	91.6(+0.3)
M2Former₄				✓	91.9(+0.2)	91.7(+0.0)	90.6(+0.5)	91.5(+0.2)
M2Former_3,4			✓	✓	92.9(+1.2)	91.9(+0.2)	90.8(+0.7)	91.9(+0.6)
M2Former_2,3,4		✓	✓	✓	92.9(+1.2)	92.4(+0.7)	91.4(+1.3)	92.3(+1.0)
M2Former ${}_{\text{full}}$	✓	✓	✓	✓	92.9(+1.2)	92.6(+0.9)	91.4(+1.3)	92.4(+1.1)

In this section, we conduct additional experiments to further validate the effectiveness of our proposed methods. All experiments are conducted on CUB dataset.

4.5.1 Contributions from CTT

In Table 7, we examine the contributions from CTT. First, we start by not using the CLS token. In this setup, the patch sequence selected at each stage does not contain any CLS tokens. Instead, selected patches that pass through the MSCA block are projected as feature vectors with global average pooling (GAP), and the features are used for final prediction through a linear layer. This is noted as the ‘global pool’ in Table 7. As a result, the global pool obtains a top-1 accuracy of 91.5%, which lags behind the MViTv2-B baseline (91.6% in Table 3). We conjecture that this low accuracy is due to the lack of a shared intermediary (i.e., CLS token) to aggregate the information of the selected multi-scale patches, even when using the MSCA block.

Then, we now initialize the CLS token of the backbone, and the patch sequence selected in each stage is concatenated with the CLS token of the same dimension that was detached before patch selection (as in Section 3.3 (4)). This is simply re-attaching the CLS token that was detached before MSPS, so it is noted ‘simple attach’ in Table 7. As a result, the simple attach obtains an accuracy of 91.9%, which improve the global pool by 0.4% higher. We argue that this improvement comes from that simple attach can share information between multi-scale patches through the CLS tokens. Since the global CLS token is detached with different channel dimensions at each stage, stage-specific patch information can be shared through the global CLS token to generate richer representations.

As discussed in Section 3.3, we propose CTT to enhance multi-scale feature sharing. In this setup, each stage uses a ’transferred’ global CLS token rather than a simply ’detached’ one. This can be implemented using a single projection layer to match its dimensions (noted ‘CTT w/ 1-MLP’ in Table 7). As a result, it obtains an accuracy of 92.1%, which improve the simple attach by 0.2% higher. Compared to using a detached CLS token at an intermediate stage, transferring the globally updated CLS token is more effective in that it can inject local information of each stage to the object’s deep semantic information, utilizing the same representational power of the network. Finally, CTT performs best when using 2-layer MLP with non-linearity (noted ‘CTT w/ 2-MLP’ in Table 7).

4.5.2 Contributions from MSPS

In the earlier section, we pointed out that SSPS is suboptimal because it is difficult to deal with scale changes. We now show that our proposed MSPS can consistently improve performance on different object scales.

First, we need to classify a given set of objects according to their scale. Following COCO [34], we categorize all objects into large, medium, and small-sized objects according to their bounding box size. We compute the size of the bounding box using the given box coordinates and sort them in ascending order. Then, we compute the quartiles for sorted bounding box sizes. Finally, we classify an object as a small-sized (ob_s) object if its bounding box size is less than the first quartile, as a large-sized object (ob_l) if its bounding box size is greater than the third quartile, and as a medium-sized (ob_m) object otherwise. Specifically, 25% of objects are small, 50% are medium, and 25% are large. Example images for each category are shown in Fig. 4.

We train five model variants with different MSPS stages: M2Former ${}_{\text{none}}$ , M2Former₄, M2Former_3,4, M2Former_2,3,4, M2Former_full. It should be noted that M2Former ${}_{\text{none}}$ is exactly the same as the MViTv2-B baseline. For comparison, we trained TransFG [20], which conducts SSPS at the last encoder block.

The results are presented in Table 8. First, TransFG obtained a total top-1 accuracy of 91.3%, which obtained accuracies of 91.7% for ob_l, 91.7% for ob_m, and 90.1% for ob_s. The performance of ob_s seems to lag behind ob_l and ob_m. M2Former ${}_{\text{none}}$ obtains a total top-1 accuracy of 91.6%, which outperforms TransFG by 0.3% higher. This means that we can achieve satisfactory results simply by using MS-ViT. Additionally, the improvement was prominent in ob_l and ob_s (0.6% and 0.4% higher, respectively), indicating that multi-scale features are important in mitigating scale variability.

When we add MSPS to stage-4 (M2Former₄), we obtain a top-1 accuracy of 91.5%, which is 0.2% higher than the baseline. M2Former₄ is almost identical to SSPS as it selects only single-scale patches in a single stage. However, M2Former₄ obtained a lower total top-1 accuracy than M2Former ${}_{\text{none}}$ , and it generally results in lower accuracy at all scales. This is also in contrast to previous findings [20] where performance was improved when patch selection was conducted at the last block.

However, when adding the MSPS in stage-3 (M2Former_3,4), we obtained a top-1 accuracy of 91.9%, which outperforms the SSPS baseline by 0.6% higher. The improvements were found at all object scales, but were noticeable at ob_l (1.2% higher compared to the SSPS baseline). This means that salient patches from stage-3 (along with selected patches from stage-4) provide the definitive cue for large objects.

We can see this improvement even when adding MSPS in stage-2 (M2Former_2,3,4). Especially, adding MSPS in stage-2 leads to improvements for ob_m and ob_s. Compared to the SSPS baseline, it is 0.7% higher for ob_m, 1.3% higher for ob_s, and 1.0% higher for total accuracy. This indicates that the finer-grained object features extracted at stage-2 enhance representations for small/medium-sized objects.

Finally, when adding MSPS to stage-1 (M2Former_full), we obtained the highest total top-1 accuracy of 92.4%, with a slight further improvement in ob_m. In summary, MSPS from high to low stages models a feature hierarchy from deep semantic features to subtle fine-grained features, which consistently improves recognition accuracy for large to small-sized objects. As a result, MSPS encourages networks to generate richer representations of fine-grained objects and to be more flexible to scale changes.

4.6 Visualization

To further investigate the proposed method, We present the visualization results in Fig. 5 and Fig. 6.

Fig. 5 shows the selected patches at each stage by conducting MSPS for several images sampled from CUB. In each subfigure, the first column is the original image, and selected patches from stage-4 to stage-1 are marked with red rectangles. At higher stages, large-sized patches that capture several parts of the object are selected. Sufficiently large patches are appropriate for modeling the intra-image structure and overall semantics of a given object. On the other hand, at the lower stage, smaller patches are chosen to model subtle details. Especially, the smallest size patches selected in stage-1 capture the coarsest features such as object edges. As a result, MSPS enhances object representations at different levels for each stage.

Fig. 6 shows the cross-attention maps of MSCA for selected patches. For visualization, we sample some patches selected from stage-4. And then, we extract cross-attention maps between sampled patches and patches selected at different stages from SCA. In Fig. 6, the first column shows the sampled stage-4 patch as a red rectangle. The second to fourth columns show the attention maps between sampled patches and selected patches from stage-3 to stage-1 as orange rectangles. The brightness of a color indicates the strength of attention: brighter means stronger interactions. In addition to modeling channel interactions in CCA, SCA models spatial interactions between selected multi-scale patches. As a result, each selected patch can be correlated with other patches that exist in different locations at different scales.

5 Conclusion

In this paper, we propose a multi-scale patch selection (MSPS) for fine-grained visual recognition (FGVR). MSPS selects salient patches in multi-scale at each stage of a multi-scale vision Transformer (MS-ViT) based on mean activation. In addition, we introduce class token transfer (CTT) and multi-scale cross-attention (MSCA) to effectively deal with selected multi-scale patch information. CTT transfers the globally updated CLS token to each stage so that stage-specific patch information can be shared throughout the entire network. MSCA directly models the spatial-/channel-wise correlation between selected multi-scale patches. Compared to single-scale patch selection (SSPS), MSPS provides richer representations for fine-grained objects and flexibility for scale changes. As a result, our proposed M2Former obtains accuracies of 92.4%, and 91.1% on CUB and NABirds, respectively, which outperform CNN-based models and ViT-based SSPS models. Our ablation experiments and further analyses validate the effectiveness of our proposed methods.

References

[1] Anelia Angelova, Shenghuo Zhu, and Yuanqing Lin. Image segmentation for large-scale subcategory flower recognition. In 2013 IEEE Workshop on Applications of Computer Vision (WACV), pages 39–45. IEEE, 2013.
[2] Ardhendu Behera, Zachary Wharton, Pradeep RPG Hewage, and Asish Bera. Context-aware attentional pooling (cap) for fine-grained visual classification. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 929–937, 2021.
[3] Thomas Berg and Peter N Belhumeur. Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 955–962, 2013.
[4] Dongliang Chang, Yifeng Ding, Jiyang Xie, Ayan Kumar Bhunia, Xiaoxu Li, Zhanyu Ma, Ming Wu, Jun Guo, and Yi-Zhe Song. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing, 29:4683–4695, 2020.
[5] Dongliang Chang, Kaiyue Pang, Yixiao Zheng, Zhanyu Ma, Yi-Zhe Song, and Jun Guo. Your” flamingo” is my” bird”: fine-grained, or not. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11476–11485, 2021.
[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014.
[7] Yue Chen, Yalong Bai, Wei Zhang, and Tao Mei. Destruction and construction learning for fine-grained image recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5157–5166, 2019.
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[10] Ruoyi Du, Dongliang Chang, Ayan Kumar Bhunia, Jiyang Xie, Zhanyu Ma, Yi-Zhe Song, and Jun Guo. Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX, pages 153–168. Springer, 2020.
[11] Ruoyi Du, Jiyang Xie, Zhanyu Ma, Dongliang Chang, Yi-Zhe Song, and Jun Guo. Progressive learning of category-consistent multi-granularity features for fine-grained visual classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9521–9535, 2021.
[12] Abhimanyu Dubey, Otkrist Gupta, Pei Guo, Ramesh Raskar, Ryan Farrell, and Nikhil Naik. Pairwise confusion for fine-grained visual classification. In Proceedings of the European conference on computer vision (ECCV), pages 70–86, 2018.
[13] Abhimanyu Dubey, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Maximum-entropy fine grained classification. Advances in neural information processing systems, 31, 2018.
[14] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021.
[15] Jianlong Fu, Heliang Zheng, and Tao Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4438–4446, 2017.
[16] Shang-Hua Gao, Ming-Ming Cheng, Kai Zhao, Xin-Yu Zhang, Ming-Hsuan Yang, and Philip Torr. Res2net: A new multi-scale backbone architecture. IEEE transactions on pattern analysis and machine intelligence, 43(2):652–662, 2019.
[17] Weifeng Ge, Xiangru Lin, and Yizhou Yu. Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3034–3043, 2019.
[18] Pei Guo and Ryan Farrell. Aligned to the object, not to the image: A unified pose-aligned representation for fine-grained recognition. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1876–1885. IEEE, 2019.
[19] Ali Hassani, Steven Walton, Jiachen Li, Shen Li, and Humphrey Shi. Neighborhood attention transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6185–6194, 2023.
[20] Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, and Changhu Wang. Transfg: A transformer architecture for fine-grained recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 852–860, 2022.
[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9):1904–1916, 2015.
[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[23] Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Andrea Vedaldi. Gather-excite: Exploiting feature context in convolutional neural networks. Advances in neural information processing systems, 31, 2018.
[24] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[25] Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Yuan He, and Hui Xue. Rams-trans: Recurrent attention multi-scale transformer for fine-grained image recognition. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4239–4248, 2021.
[26] Shaoli Huang, Zhe Xu, Dacheng Tao, and Ya Zhang. Part-stacked cnn for fine-grained visual categorization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1173–1182, 2016.
[27] Zixuan Huang and Yin Li. Interpretable and accurate fine-grained recognition via region grouping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8662–8672, 2020.
[28] Ruyi Ji, Longyin Wen, Libo Zhang, Dawei Du, Yanjun Wu, Chen Zhao, Xianglong Liu, and Feiyue Huang. Attention convolutional binary neural tree for fine-grained visual categorization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10468–10477, 2020.
[29] Xiao Ke, Yuhang Cai, Baitao Chen, Hao Liu, and Wenzhong Guo. Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification. Pattern Recognition, 137:109305, 2023.
[30] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
[31] Hanting Li, Mingzhe Sui, Feng Zhao, Zhengjun Zha, and Feng Wu. Mvt: mask vision transformer for facial expression recognition in the wild. arXiv preprint arXiv:2106.04520, 2021.
[32] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong, Jitendra Malik, and Christoph Feichtenhofer. Mvitv2: Improved multiscale vision transformers for classification and detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4804–4814, 2022.
[33] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[34] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[35] Chuanbin Liu, Hongtao Xie, Zheng-Jun Zha, Lingfeng Ma, Lingyun Yu, and Yongdong Zhang. Filtration and distillation: Enhancing region attention for fine-grained visual categorization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11555–11562, 2020.
[36] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. Richer convolutional features for edge detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3000–3009, 2017.
[37] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
[38] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
[39] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[40] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
[41] Wei Luo, Xitong Yang, Xianjie Mo, Yuheng Lu, Larry S Davis, Jun Li, Jian Yang, and Ser-Nam Lim. Cross-x learning for fine-grained visual categorization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8242–8251, 2019.
[42] Fuyan Ma, Bin Sun, and Shutao Li. Facial expression recognition with visual transformers and attentional selective fusion. IEEE Transactions on Affective Computing, 2021.
[43] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
[44] Shunan Mao, Yaowei Wang, Xiaoyu Wang, and Shiliang Zhang. Multi-proxy feature learning for robust fine-grained visual recognition. Pattern Recognition, page 109779, 2023.
[45] Rafael Müller, Simon Kornblith, and Geoffrey E Hinton. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
[46] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
[47] Yuxin Peng, Xiangteng He, and Junjie Zhao. Object-part attention model for fine-grained image classification. IEEE Transactions on Image Processing, 27(3):1487–1500, 2017.
[48] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
[49] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28, 2015.
[50] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
[51] Sofia Serrano and Noah A Smith. Is attention interpretable? arXiv preprint arXiv:1906.03731, 2019.
[52] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[53] Guolei Sun, Hisham Cholakkal, Salman Khan, Fahad Khan, and Ling Shao. Fine-grained recognition: Accounting for subtle differences between similar classes. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 12047–12054, 2020.
[54] Hao Tang, Chengcheng Yuan, Zechao Li, and Jinhui Tang. Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recognition, 130:108792, 2022.
[55] Hugo Touvron, Andrea Vedaldi, Matthijs Douze, and Hervé Jégou. Fixing the train-test resolution discrepancy. Advances in neural information processing systems, 32, 2019.
[56] Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit: Multi-axis vision transformer. In European conference on computer vision, pages 459–479. Springer, 2022.
[57] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 595–604, 2015.
[58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[59] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
[60] Haonan Wang, Peng Cao, Jiaqi Wang, and Osmar R Zaiane. Uctransnet: rethinking the skip connections in u-net from a channel-wise perspective with transformer. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 2441–2449, 2022.
[61] Jun Wang, Xiaohan Yu, and Yongsheng Gao. Feature fusion vision transformer for fine-grained visual categorization. arXiv preprint arXiv:2107.02341, 2021.
[62] Qilong Wang, Banggu Wu, Pengfei Zhu, Peihua Li, Wangmeng Zuo, and Qinghua Hu. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11534–11542, 2020.
[63] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
[64] Zhihui Wang, Shijie Wang, Shuhui Yang, Haojie Li, Jianjun Li, and Zezhou Li. Weakly supervised fine-grained image classification via guassian mixture model oriented discriminative learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9749–9758, 2020.
[65] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-ucsd birds 200. 2010.
[66] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 499–515. Springer, 2016.
[67] Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
[68] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 22–31, 2021.
[69] Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 842–850, 2015.
[70] Lingxi Xie, Qi Tian, Richang Hong, Shuicheng Yan, and Bo Zhang. Hierarchical part matching for fine-grained visual categorization. In Proceedings of the IEEE international conference on computer vision, pages 1641–1648, 2013.
[71] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
[72] Fanglei Xue, Qiangchang Wang, and Guodong Guo. Transfer: Learning relation-aware facial expression representations with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3601–3610, 2021.
[73] Fanglei Xue, Qiangchang Wang, Zichang Tan, Zhongsong Ma, and Guodong Guo. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Transactions on Affective Computing, 2022.
[74] Xuhui Yang, Yaowei Wang, Ke Chen, Yong Xu, and Yonghong Tian. Fine-grained object classification via self-supervised pose alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7399–7408, June 2022.
[75] Ze Yang, Tiange Luo, Dong Wang, Zhiqiang Hu, Jun Gao, and Liwei Wang. Learning to navigate for fine-grained classification. In Proceedings of the European conference on computer vision (ECCV), pages 420–435, 2018.
[76] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[77] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 472–480, 2017.
[78] Xiaohan Yu, Jun Wang, Yang Zhao, and Yongsheng Gao. Mix-vit: Mixing attentive vision transformer for ultra-fine-grained visual categorization. Pattern Recognition, 135:109131, 2023.
[79] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7151–7160, 2018.
[80] Lianbo Zhang, Shaoli Huang, and Wei Liu. Intra-class part swapping for fine-grained image classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3209–3218, 2021.
[81] Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. Learning a mixture of granularity-specific experts for fine-grained categorization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8331–8340, 2019.
[82] Ning Zhang, Jeff Donahue, Ross Girshick, and Trevor Darrell. Part-based r-cnns for fine-grained category detection. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pages 834–849. Springer, 2014.
[83] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Single-shot refinement neural network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4203–4212, 2018.
[84] Qijie Zhao, Tao Sheng, Yongtao Wang, Zhi Tang, Ying Chen, Ling Cai, and Haibin Ling. M2det: A single-shot object detector based on multi-level feature pyramid network. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9259–9266, 2019.
[85] Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE international conference on computer vision, pages 5209–5217, 2017.
[86] Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. Learning deep bilinear transformation for fine-grained image representation. Advances in Neural Information Processing Systems, 32, 2019.
[87] Heliang Zheng, Jianlong Fu, Zheng-Jun Zha, and Jiebo Luo. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[88] Haowei Zhu, Wenjing Ke, Dong Li, Ji Liu, Lu Tian, and Yi Shan. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4692–4702, 2022.
[89] Lei Zhu, Xinjiang Wang, Zhanghan Ke, Wayne Zhang, and Rynson WH Lau. Biformer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10323–10333, 2023.
[90] Peiqin Zhuang, Yali Wang, and Yu Qiao. Learning attentive pairwise interaction for fine-grained classification. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 13130–13137, 2020.