Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Siyuan Wei¹ Tianzhu Ye²¹¹footnotemark: 1 Shen Zhang¹ Yao Tang¹ Jiajun Liang¹
¹MEGVII Technology ²Tsinghua University
{weisiyuan, zhangshen,tangyao02,liangjiajun}@megvii.com,[email protected] The first two authors contributed equally to this workCorresponding author

Abstract

Although vision transformers (ViTs) have shown promising results in various computer vision tasks recently, their high computational cost limits their practical applications. Previous approaches that prune redundant tokens have demonstrated a good trade-off between performance and computation costs. Nevertheless, errors caused by pruning strategies can lead to significant information loss. Our quantitative experiments reveal that the impact of pruned tokens on performance should be noticeable. To address this issue, we propose a novel joint Token Pruning & Squeezing module (TPS) for compressing vision transformers with higher efficiency. Firstly, TPS adopts pruning to get the reserved and pruned subsets. Secondly, TPS squeezes the information of pruned tokens into partial reserved tokens via the unidirectional nearest-neighbor matching and similarity-based fusing steps. Compared to state-of-the-art methods, our approach outperforms them under all token pruning intensities. Especially while shrinking DeiT-tiny&small computational budgets to 35%, it improves the accuracy by 1%-6% compared with baselines on ImageNet classification. The proposed method can accelerate the throughput of DeiT-small beyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%. Experiments on various transformers demonstrate the effectiveness of our method, while analysis experiments prove our higher robustness to the errors of the token pruning policy. Code is available at https://github.com/megvii-research/TPS-CVPR2023.

1 Introduction

The transformer architecture has become popular for various natural language processing (NLP) tasks, and its improved variants have been adopted for many vision tasks. Vision transformers (ViTs) [5] leverage the long-range dependencies of self-attention mechanisms to achieve excellent performance, often surpassing that of CNNs. In addition to the vanilla ViT architecture, recent studies [17, 31, 33] have explored hybrid ViT designs incorporating convolution layers and multi-scale architectures. Despite their excellent performance, transformers still require relatively high computational budgets. This is due to the quadratic computation and memory costs associated with token length. To address this issue, contemporary approaches [25, 27, 36, 8, 16, 35, 14, 21] propose pruning redundant tokens. They trade acceptable performance degradation for a more cost-effective model. Knowledge distillation [11] and other techniques can further mitigate the resulting performance drop.

Refer to caption — Figure 1: Comparisons between token pruning paradigm [25] (the 2nd row) and our joint Token Pruning & Squeezing (the 3rd row). The context information, such as the sod in the examples, is helpful for prediction but is discarded. Our method remits the information loss by squeezing the pruned tokens into reserved ones instead of naively dropping them, as indicated by the stacked patches. By this design, we could apply more aggressive token pruning with less performance drop. The example results are from the ImageNet1K [4], and we reduce the actual patches grid $14\times 14$ to $7\times 7$ for visualization clarity.

However, a steep drop in performance is inevitable as pruning tokens further increases because both essential subject and auxiliary context information drop significantly, especially when the number of reserved tokens is closely below 10. Aggressive token pruning could lead to incomplete subject and background context loss, causing the wrong prediction, as shown in Fig. 1. Specifically, the background tokens containing sod help recognize the input image as a lawn mower rather than a folding chair. Meanwhile, missing subject tokens make the baseball indistinguishable from a rugby ball. To regain adequate information from pruned tokens, EViT [16] and Evo-ViT [35] propose aggregating pruned tokens as one, as shown in Fig. 2 (b). Still, they neglect the discrepancy among these tokens, leading to feature collapse and hindering more aggressive token pruning.

Towards more aggressive pruning, we argue that information in pruned tokens deserves better treatment. We did a toy experiment to answer what accuracy token pruning could achieve if it applied the reversed pruning policy in the first pruned transformer block as Fig. 3 shows. Taking dynamicViT [25] as a case study, the performance of reversed policy is enough to bring extra accuracy complementary to the original one (denoted by bonus accuracy). Moreover, this phenomenon would become more significant as pruning continues (red line in Fig. 3.).

To conserve the information from the pruned tokens, we propose a Joint Token Pruning & Squeezing (TPS) module to accommodate more aggressive compression of ViTs. TPS module utilizes a feature dispatch mechanism that squeezes essential features from pruned tokens into reserved ones, as shown in Fig. 2 (c). Firstly, based on the scoring result, the TPS module divides input tokens into two complementary subsets: the reserved and pruned sets. Secondly, instead of discarding or collapsing tokens from the pruned set into a single one, we employ a unidirectional nearest-neighbor matching algorithm to dispatch each of them independently to the associated reserved token dubbed as the host token. This design reduces information loss without sacrificing computational efficiency. Subsequently, we apply a similarity-based fusing way to squeeze the features of matched pruned tokens into corresponding host tokens while the non-selected reserved tokens remain identical. This design reduces the context information loss while retaining a reasonable computation budget. We can easily achieve hardware-friendly constant shape inference when fixing the cardinality of the reserved token set. Furthermore, we introduce two flexible variants: the inter-block version dTPS and the intra-block version eTPS, which are essentially plug-and-play blocks for both vanilla ViTs and hybrid ViTs.

We conduct extensive experiments on two datasets: ImageNet1K [4] and large fine-grained dataset iNaturalist 2019 [29] to prove our efficiency, flexibility, and robustness. Firstly, experiments under different token pruning settings demonstrate the superior performance of our TPS while operating more aggressive compression compared with token pruning [25] and token reorganization [16]; further comparisons with state-of-the-art transformers [28, 13, 39, 31, 40, 20, 8, 36] show our promising efficiency. Secondly, we manifest the flexibility of our TPS by integrating it into popular ViTs, including both vanilla ViTs and hybrid ViTs. Finally, the evaluations under the random token selection policy confirm the higher robustness of our TPS.

Overall, our contributions are summarized as follows:

•

We propose the joint Token Pruning & Squeezing (TPS) and its two variants: dTPS and eTPS, to conserve the information of discarded tokens and facilitate more aggressive compression of vision transformers.
•

Extensive experiments demonstrate our higher performance compared with prior approaches. Especially while compressing GFLOPs of DeiT-small&tiny to 35%, our TPS outperforms baselines with accuracy improvements of 1%-6%.
•

Broadest experiments applying our method to vanilla ViTs and hybrid ViTs show our flexibility, while the analysis experiments prove that our TPS is more robust than token pruning and token reorganization.

2 Related Work

Since the transformer [30] was proved efficacious in NLP tasks, numerous studies have explored methods to acclimate the transformer architecture to computer vision tasks [5, 6, 7, 1, 3, 17, 10, 19, 22, 26, 37, 24, 32], including vanilla ViTs and hybrid ViTs.

Vanilla ViTs. Following the “primary ViT”, a series of vision transformer variants inherit the central architecture and evolve from diverse perspectives, which we call the vanilla ViTs in this paper. DeiT [28] surpasses standard CNNs and ViT by introducing a distillation token to learn from a teacher network. LV-ViT [13] presents a new training objective called token labeling. T2T-ViT [38] recursively aggregates neighboring tokens into one token, while PS-ViT [39] introduces a progressive sampling module that selects informative tokens.

Hybrid ViTs. Besides, recent studies [31, 33, 15] incorporate convolutional layers and employ multi-scale architectures to lower the cost of computations and memory, which we call the hybrid ViTs. Swin Transformer [17] modified ViT with the multi-stage architecture and shifted window-based self-attention. CVT [33] presents a hierarchical architecture facilitated by the convolutional token embedding layer. PVT [31] introduces the pyramid architecture of the transformer and develops the spatial-reduction attention (SRA) to reduce the cost further.

Token Pruning. Considering the spatial redundancy of input images, many researchers aim at discarding nonessential tokens with an acceptable performance drop. Tang et al. [27] propose to approximate the impact of patches and discard inattentive patches in a top-down paradigm. DynamicViT [25] and AdaViT [20] employ the learnable heads to score tokens and discard less informative ones with a fixed pruning ratio. A-ViT [36] and ATS [8] go further by sampling tokens with an input-dependent number. However, mainstream deep learning frameworks do not strongly support dynamic token length inference. The main disadvantage of token pruning models is the pruned information loss which leads to a drop in accuracy and limits more aggressive token pruning. To tackle this, Evo-ViT [35], EViT [16], and SPViT [14] preserve the background context by collapsing the pruned tokens into one token reorganization, which is called token reorganization. Token reorganization remits the pruned token information loss, but a noticeable performance drop can still be observed, especially regarding a higher pruning ratio of tokens. Furthermore, relevant auxiliary strategies are proposed to facilitate token pruning. SPViT [14] employs a layer-to-phase progressive training strategy, while IA-RED² performs a hierarchical training scheme. The complicated training schemes help improve performances but also draw into more hyper-parameters and optimization difficulties.

We investigate the drawbacks of current token pruning methods and invent a novel token reduction approach: joint token Pruning & Squeezing with higher efficiency, robustness, and flexibility, which only requires fine-tuning pre-trained models easily.

3 Method

3.1 Motivation

To quantitatively verify the discarded information of pruned tokens, we conduct a toy experiment on DynamicViT [25] as Fig. 3 shows. It is easy to agree that the performance of pruned model declines as the pruning becomes more aggressive. Nevertheless, by exchanging reserved and pruned tokens (dubbed as the reversed policy in Fig. 3), we find that the pruned tokens can still handle partial cases correctly. Furthermore, the bonus accuracy, which is dedicated by the cases that only the reversed policy predicts rightly, rises along with the token pruning intensity. It implies that the exclusive information from pruned tokens matters more while the token pruning intensity grows.

These fun facts motivate us to assimilate the pruned tokens into the reserved tokens to prevent information loss, as shown in Fig. 1. As shown in Fig. 2 (c), TPS employs two steps to compress ViTs, including token pruning and squeezing.

3.2 Token Pruning

In this section, we briefly review the basic procedure of token pruning. Note that our TPS is compatible with any token pruning techniques. Here, we introduce two variants of TPS: dTPS and eTPS, to cover both intra-block and inter-block token compression shown in Fig. 4. They follow the pruning parts of two baselines for a fair comparison with two typical baselines [25, 16].

As shown in Fig. 4, dTPS adopts the learnable token score prediction head from dynamicViT [25] and samples the binary decision mask by Straight-Through Gumbel-Softmax [12] for differentiability; eTPS utilizes the class token attention values to measure tokens’ importance as EViT [16]. In the inference stage of both variants, based on token scores, we devise the token selection policy using the Top-k operation with a fixed given token reduction ratio ${\rho}$ . Both variants ensure the constant shape to benefit from the inference optimization on the computation graph. The tokens are separated into two subsets, $S^{r}$ and $S^{p}$ , where the reserved tokens are placed in $S^{r}$ and the pruned ones are placed in $S^{p}$ . More implementation details can be found in our codes.

3.3 Token Squeezing

After reserved & pruned tokens are split, we introduce our token squeezing part. Considering that the reserved ones contribute the majority of correct predictions, we aim to design a procedure that retains most of the attentive tokens while compressing information from rest, preserving the model’s overall performance. To avoid generating extra tokens as [16, 14], we inject pruned tokens into similar reserved tokens. So, we apply a unidirectional nearest-neighbor matching algorithm from $S^{p}$ to $S^{r}$ in a many-to-one manner. After that, we employ a similarity-based fusing method to assimilate information from pruned tokens into partial reserved tokens. We summarize the above process as two steps: matching and fusing.

Matching. Given the two subsets $S^{r}$ and $S^{p}$ , $I^{r}$ and $I^{p}$ are the corresponding token indices of $S^{r}$ and $S^{p}$ . A similarity matrix $c_{i,j}$ for all $i\in I^{p}$ and $j\in I^{r}$ represents the interactions between the tokens for matching. For each pruned token $\boldsymbol{x}_{i}\in S^{p}$ , we find its nearest token $\boldsymbol{x}_{*}^{host}\in S^{r}$ from the reserved token set $S^{r}$ as its host token:

\boldsymbol{x}_{*}^{host}=\mathop{argmax}\limits_{\boldsymbol{x}_{j}\in S^{r}}{c_{i,j}}.

(1)

Note that since the token matching step is unidirectional from $S^{p}$ to $S^{r}$ , multiple pruned tokens can share the same host token and not each reserved token can serve as a host token. We then record the matching results in a mask matrix $\boldsymbol{M}\in\mathbb{R}^{N^{p}\times N^{r}}$ and its values are decided by:

m_{i,j}=\begin{cases}1,&\boldsymbol{x}_{j}\ is\ the\ host\ token\ of\ \boldsymbol{x}_{i},\\ 0,&otherwise,\end{cases}

(2)

where $N^{p}$ and $N^{r}$ denote the token number of two subsets. The mask helps that the following fusing step can be conducted with regular matrix operations on $S^{r}$ and $S^{p}$ while excluding the influence from non-matched pairs.

Although the attention map is a natural and free choice to measure interactions among tokens, we can acquire higher performances with the cosine similarity between $S^{r}$ and $S^{p}$ as the ablation experiment in Section. 4.2 discusses. Therefore in all of our experiments, the similarity matrix is defined as:

c_{i,j}=\frac{{\boldsymbol{x}_{i}^{T}}{\boldsymbol{x}_{j}}}{\|\boldsymbol{x}_{i}\|\|\boldsymbol{x}_{j}\|},for\ i\in I^{p},j\in I^{r}.

(3)

Since the similarity matrix $c_{i,j}$ is generated directly from input features, no extra parameters are introduced in the matching step.

Fusing. Simply averaging tokens can lead to feature dispersion because of discrepancies among the different tokens. EViT [16] utilizes the token importance scores to re-weight the aggregated tokens. Separately, we use a similarity-based weighting scheme. It expands the influence of closer tokens to the host tokens while also avoiding potential flaws from imperfect token scoring. As previously mentioned, the fusing step encompasses all tokens from two subsets and is controlled by the mask $\boldsymbol{M}$ to ensure that only host tokens and pruned tokens are mixed. This introduces a few redundant computations but increases practical training & inference throughput due to the efficiency of regular matrix operations.

Specifically, the reserved token $\boldsymbol{x}_{j}$ is updated by fusing the original feature and pruned tokens’ features as follows:

\boldsymbol{y}_{j}={w}_{j}\boldsymbol{x}_{j}+\sum_{\boldsymbol{x}_{i}\in S^{p}}{w_{i}\boldsymbol{x}_{i}},

(4)

where $w_{i}$ is the weight of each pruned token $\boldsymbol{x}_{i}\in S^{p}$ , ${w}_{j}$ is the weight of the reserved token itself, and $\boldsymbol{y}_{j}$ is the updated one. The fusing weight $w_{i}$ depends on the mask value $m_{i,j}$ and similarity $c_{i,j}$ :

w_{i}=\frac{\exp(c_{i,j})m_{i,j}}{\sum_{\boldsymbol{x}_{i}\in S^{p}}\exp(c_{i,j})m_{i,j}+\mathrm{e}}.

(5)

The reserved token always has the largest fusing weight $w_{j}$ , as the similarity between ${x}_{j}$ and itself equals to $1$ :

w_{j}=\frac{\mathrm{e}}{\sum_{\boldsymbol{x}_{i}\in S^{p}}\exp(c_{i,j})m_{i,j}+\mathrm{e}}.

(6)

According to the above equations, the reserved tokens that have not been chosen as host tokens remain unchanged, while the pruned tokens are squeezed into host tokens and replace the original ones.

As can be seen, our matching and fusing steps ensure that the number of processed tokens equals the number of reserved tokens, thereby maintaining a constant shape for efficient inference.

3.4 TPS on Hybrid ViTs

To prove our flexibility and generalization across different transformers, we also conduct experiments in hybrid ViTs [31, 33]. For plain transformer blocks, our TPS modules can be easily inserted to reduce the token number and achieve a significant speedup. If the layer contains operations that require a complete spatial structured input: e.g., convolution or pooling, the operation of our TPS will be adjusted slightly. For example, in PVT [31] models, the TPS module is inserted before the first block of each stage with token pruning applied and generates the decision policy $\boldsymbol{D}$ . For the attention layer, we decrease the token dimension size of the input and consequent query $\boldsymbol{Q}$ . If the spatial-reduction layer is employed inside, the dropped token features are complemented with zeros to maintain the structured spatial input. More details can be found in supplementary materials.

4 Experiment

Datasets and evaluation metrics. We conduct contrast experiments with two typical baselines: DynamicViT [25] and EViT [16], and compare our performances with state-of-the-art transformers. For quantitative comparisons, we report the Top-1 accuracy, the number of giga floating-point operations (GFLOPs), and throughput. The input size is set to $224\times 224$ for all the experiments. The evaluated datasets include the ImageNet1K [4] and the large fine-grained image classification dataset: iNaturalist 2019 [29].

Experiments Details. We follow the same data augmentations used in DeiT [28]. The model is initialized with pre-trained models’ weights and fine-tuned with different token pruning locations and keeping ratios. We adopt the AdamW [18] as the optimizer and a cosine learning rate scheduler. We compare our dTPS and eTPS with DynamicViT [25] and EViT [16] under multiple pruning settings¹¹1The pruning settings include combinations of three multi-layer pruning settings: pruning locations include {\nth4,\nth7,\nth10},{\nth3,\nth5,\nth7,\nth9}, and {\nth4,\nth6,\nth8,\nth10}, and token keeping ratios $\rho\in\{0.5,0.7\}$ . We follow the same training settings and loss functions from [16, 25], except for the basic learning rate is set to $\frac{batchsize}{1024}\times 2.5\times 10^{-4}$ and no stage of fixing backbone weights in dynamicViT & dTPS. The setting changes slightly because the training under the original setting appears unstable, especially with aggressive pruning.

4.1 Main Results

Comparison to baselines. As Fig. 5 shows, we compare our method with the token pruning baseline: DynamicViT, and the token reorganization baseline: EViT, by replacing their original pruning modules with our dTPS & eTPS modules. The contrast experiments involve DeiT-small&tiny. All the models in this part are fine-tuned 30 epochs under multiple pruning settings. Under all the settings, our method outperforms DynamicViT and EViT. Both dynamicViT and EViT encounter a larger accuracy drop along with the progressively aggressive pruning. While shrinking the computational budgets of DeiT to 35%, our method can avoid 1%-6% accuracy decline compared with baselines. Equipped with our TPS, we can accelerate the throughput of DeiT-small to 1745 images/s, which is beyond that of DeiT-tiny: 1686 images/s, and surpass the accuracy of DeiT-tiny by 4.78%.

Visual Comparisons. We demonstrate the cases from ImageNet1K [4], which DeiT predicts correctly at first but gives the wrong predictions after being applied with token pruning. As Fig. 7 shows, the imperfect pruning policy brings the context information loss, which leads to a close but incorrect prediction. However, our TPS remedies these cases by saving the pruned tokens’ information.

Comparison to states of the art. In Fig. 6, we demonstrate our TPS performances compared with other state-of-the-art transformers, including token pruning methods [20, 8, 25, 16, 35, 20], vanilla ViTs [28, 13, 39, 38, 2, 23] and hybrid ViTs [17, 31, 40, 34]. By integrating DeiT-small&tiny and LV-ViT-small&tiny with TPS and fine-tuning them only 30 epochs, we can achieve a quite competitive performance among numerous vision transformers from the perspective of accuracy-computation trade-off.

Extension on more backbones. As shown in Tab. 1 and Tab. 2, we incorporate TPS into different vanilla ViTs [28, 39, 38] and hybrid ViTs [31, 40] to prove the flexibility and generalization. For vanilla ViTs, our TPS outperforms EViT [16], Evo-ViT [35], A-ViT [36], IA-RED² and SPViT [14] with equal or slightly increasing computation while using DeiT [28], LV-ViT [13] as backbones. DeiT-small&tiny with TPS applied can surpass the pre-trained models by 0.3% and 0.7% in accuracy under 100 fine-tuning epochs. For hybrid ViTs, we can compress the GFLOPS of PVT-tiny by 13% and improve its accuracy by 0.1%.

Method	Param(M)	GFLOPs	Top-1 Acc.(%)
DeiT-S	22.05	4.6	79.8
DynamicViT[25]	22.77	2.9	79.3
EViT[16]	22.05	3.0	79.5
ATS^†[8]	22.05	2.9	79.7
A-ViT^†[36] (100 epochs)	22.05	3.6	78.6
Evo-ViT[35] (300 epochs)	22.05	3.0	79.4
SPViT[14] (75 epochs)	22.13	2.7	79.3
IA-RED²[21] (90 epochs)	-	-	79.1
eTPS (ours)	22.05	3.0	79.7
dTPS* (ours)	22.77	3.0	80.1
DeiT-T	5.72	1.3	72.2
DynamicViT(re-impl)[25]	5.90	0.8	71.4
EViT(re-impl)[16]	5.72	0.8	71.9
A-ViT^†[36] (100 epochs)	5.00	0.8	71.0
Evo-ViT[35] (300 epochs)	5.72	0.8	72.0
SPViT[14] (75 epochs)	-	0.9	72.1
eTPS (ours)	5.72	0.8	72.3
dTPS* (ours)	5.90	0.8	72.9
LV-ViT-S	26.17	6.6	83.3
DynamicViT[25]	26.89	3.8	82.0
EViT[16]	26.17	3.9	82.5
eTPS (ours)	26.17	3.8	82.5
dTPS* (ours)	26.89	3.8	82.6
LV-ViT-T	8.53	2.9	79.1
DynamicViT(re-impl)[25]	8.82	2.0	77.1
eTPS (ours)	8.53	2.0	78.0
dTPS* (ours)	8.82	2.0	78.7
PS-ViT-B/14[39]	21.34	5.4	81.7
ATS^†[8]	21.34	3.7	81.5
dTPS* (ours)	22.07	3.7	81.5

Table 1: Comparison among different token pruning methods applied to multiple vanilla vision transformers. “*” denotes our method is fine-tuned 100 epochs. Methods marked with ”†” do not support constant-shape inference. Prior methods above are trained 30 epochs by default unless otherwise specified. “Re-impl” means that we implement the method according to the official code. For a fair comparison with prior methods, we utilize computationally comparable pruning setups to fine-tune backbones with TPS.

Method	Param (M)	GFLOPs	Top-1 Acc. (%)
PVT-T[31]	13.23	1.94	75.1
dTPS* (ours)	13.85	1.69 (-13%)	75.2 (+0.1)
PVT-S	24.49	3.83	79.8
dTPS* (ours)	25.11	3.14 (-18%)	79.2 (-0.6)
CvT-13[33]	20.00	4.58	81.6
dTPS* (ours)	20.72	3.04 (-34%)	80.8 (-0.8)
CvT-21	31.62	7.21	82.5
dTPS* (ours)	32.35	4.10 (-43%)	80.9 (-1.6)

Table 2: Experiments of our methods applied to hybrid vision transformers, including PVT [31] and CVT [33].

Fine-Grained Visual categorization. We compare our dTPS with DynamicViT by fine-tuning DeiT on iNaturalist 2019 [29] as shown in Tab. 3. See the supplementary materials for the training details on iNaturalist 2019 [29]. Compared with dynamicViT, our dTPS obtains 0.3% accuracy improvement in DeiT-tiny and 0.2% accuracy improvement in DeiT-small when fine-tuning 30 epochs, respectively. We further fine-tune dTPS 100 epochs and observe a significant improvement in both backbones. Notably, dTPS fine-tuned with 100 epochs is only 0.1% lower than Deit-small while shrinking the computational budgets of Deit-small to 65%.

Method	Param(M)	GFLOPs	Top-1 Acc.(%)
DeiT-S[28]	22.05	4.6	74.8
DynamicViT(re-impl)[25]	22.77	2.9	74.0
dTPS (ours)	22.77	3.0	74.2
dTPS* (ours)	22.77	3.0	74.7
DeiT-T	5.72	1.26	72.8
DynamicViT(re-impl)[25]	5.90	0.8	71.4
dTPS (ours)	5.90	0.8	71.7
dTPS* (ours)	5.90	0.8	72.4

Table 3: Results of dynamicViT[25] and dTPS on iNaturalist 2019 [29]. The two models are trained 30 epochs by default. “*” denotes the model is trained 100 epochs. “Re-impl” means we implement the method on the backbone according to the official code. See the appendix for the training details of the backbone.

4.2 Ablation Study

Feature Type	Top-1 Acc. (%)
Full	71.90
Content	71.73
Position	70.92

Table 4: Different feature types used in the matching step. Conducted on DeiT-tiny with pruned layers at {\nth4,\nth7,\nth8} and keeping ratio set to 0.7.

TPM Variant	Similarity Matrix	GFLOPs	Top-1 Acc.(%)
dTPS	Cosine similarity	0.810	71.90
dTPS	Previous attention	0.807	71.35
eTPS	Cosine similarity	0.821	72.26
eTPS	Previous attention	0.818	71.67

Table 5: Comparison between different types of cost matrix. The baseline denotes calculating the tokens’ cosine similarity; the previous attention denotes that we reuse it to devise the matching results. Conducted on DeiT-tiny with pruned layers at {\nth4,\nth7,\nth8} and keeping ratio set to 0.7.

Epochs of training. Fig. 8 shows that both variants can benefit from longer training epochs and surpass the DeiT-small&tiny with only 65% GFLOPS. However, the benefit of epochs varies slightly in two variants. Because the class-token attention scoring requires no extra optimization target, eTPS performs better than dTPS under 30 epochs. On the other hand, dTPS can benefit more from longer training epochs in DeiT-small, for its learnable scoring brings higher performance upper bound.

Feature Type. We show the effects of feature type used to establish the matching relationships. Supposing $\boldsymbol{x}_{i}$ is the full embedding of the token and the position feature $\boldsymbol{p}_{i}$ is the corresponding positional embedding, we define the content feature as $\boldsymbol{x}_{i}-\boldsymbol{p}_{i}$ . As Tab. 4 illustrates, the entire feature is more favorable for it contains both the content and position information.

Similarity Matrix. Considering that the dot-product attention of query and key measures tokens’ relationships naturally, we have tried reusing the previous attention to replace the computations of cosine similarities in the matching step. We believe the previous attention is outdated to measure current tokens’ relations and Tab. 5 shows that calculating the cosine similarity of current features outperforms reusing the attention with only a minor computational increase.

4.3 Robustness Experiments

We generate random token selection policies to construct manufactured policy errors that simulate the cases brought by sub-optimal token pruning strategies. All models are based on DeiT-small and fine-tuned 30 epochs with identical pruning setups. We run the experiments under random policies 100 times and report the average results. By comparing the performances of our method with dynamicViT [25] and EViT [16], the accuracy drop from the original to random policies denotes the robustness under incorrect policies. As shown in Tab. 6, our inter-block version dTPS and intra-block version eTPS have fewer accuracy drops than dynamicViT [25] and EViT [16].

Methods	Policy	Top-1 Acc. ( $\%$ )
DynamicViT	Original	79.42
DynamicViT	Random	76.51 (-3.7)
dTPS	Original	79.68
dTPS	Random	78.19 (-1.9)
EViT	Original	79.51
EViT	Random	77.47 (-2.6)
eTPS	Original	79.66
eTPS	Random	78.06 (-2.0)

Table 6: ImageNet1K results of applying random token selection policy to our methods and baselines. The percentages in parentheses represent the relative performance degradation ratio brought by random policies.

5 Conclusions and Limitations

In this paper, we presented a novel joint Token Pruning & Squeezing (TPS) module to compress vision transformers more aggressively. With the capability of conserving information, our TPS can avoid a significant performance drop compared to token pruning and reorganization. Our method has better efficiency than prior token pruning methods and states of the arts in vision transformers. Extensive experiments under various backbones and quantitative analyses show our flexibility and robustness.

However, there are still some limitations to our method. Firstly, structured spatial operations of hybrid ViTs restrict the straightforward integration of token pruning. Secondly, the procedure of fine-tuning pre-trained models might be replaced by more advanced pruning-aware training-from-scratch schemes to shorten the total training time. In the future, we will evolve our method to be more adaptive to hybrid ViTs and apply it to more dense prediction tasks.

Supplementary Material

Matching Method	Acc. (%)
N:1	71.90
1:1	69.02

(a) Different matching methods on dTPS-DeiT-T. The keep ratio is 0.7.

Fusing Method	Policy	Acc. ( $\%$ )
Weighting	Original	70.58
Weighting	Random	65.56 (-5.02)
Average	Original	70.47
Average	Random	65.173 (-5.30)

(b) Robustness comparison of fusing methods on dTPS-DeiT-T. The keep ratio is 0.5.

Table 7: The pruned layers includes \nth4,\nth7,\nth8. (a) N:1 matching employed by TPS finds the nearest reserved token for each pruned token to inject multiple tokens into the same token, while 1:1 matching finds the nearest pruned token for each reserved token. (b) The similarity-weighting fusing obtains a lower accuracy drop under random token squeezing (see more details in the main paper: Sec. 4.3 ) than average fusing.

1 Overview

In the supplemental materials, we show the following details of our joint Token Pruning & Squeezing (TPS):

•

Visualizations.
•

Details of two variants.
•

TPS on hybrid ViTs.
•

Detailed experiment settings.
•

TPS on larger models and with larger input size.
•

TPS under different keep ratios.
•

More ablations about TPS design.

2 Visualizations

We demonstrate the additional cases from ImageNet1K [4], which our TPS-DeiT and DeiT predict correctly but dynamicViT-DeiT predict wrongly. As Fig 10 shows, we found that the imperfect pruning policy brings the loss of background context and incomplete subject, which puzzles the model and leads to a close but incorrect prediction. However, our TPS conquers these cases by squeezing the information of pruned tokens into similar reserved tokens.

3 Details of Two Variants

We design two variants of TPS: dTPS and eTPS, to show our flexibility and compare fairly with dynamicViT [25] and EViT [16]. Theoretically, our TPS can be incorporated with any token pruning method. In this paper, we choose dynamicViT and EViT as baselines for their strong performance and concise forms. The major disparities between the two variants are as follows:

Forward procedure. The TPS module drops the tokens from inputs practically except for the training stage of dTPS. In each pruning stage, the dTPS module employs the gumbel-softmax [12] to sample binary decision mask randomly during training and maintain the presently reserved mask and pruned mask to avoid previously pruned tokens from participating in the matching and fusing step. In the subsequent attention layer, the attention masking strategy from dynamicViT [25] is employed to erase the effects of dropped tokens. The implementation details can be found in the code file.

Position to insert. As mentioned in the paper, dTPS and eTPS cut down tokens in inter-block and intra-block ways, respectively. The dTPS module is inserted before the transformer block, while the eTPS module is inserted after the multi-head attention layer. The distinction derives from the different token scoring methods. The learnable score prediction employed by dynamicViT and dTPS does not rely on any internal operation of transformer blocks, while the scoring based on the class-token attentions requires the results from the multi-head attention block.

Parameters. The eTPS module is parameter-free, while the dTPS module increases the total number of parameters by a small amount due to its learnable token score prediction head.

Performances. The performances of dTPS and eTPS modules are close but can be slightly different when training epochs changes. According to our experiments, the eTPS module outperforms the dTPS module under 30 epochs. The opposite results were observed under 100 epochs. The difference demonstrates that extra parameters of dTPS modules endow the model with a higher upper limit.

4 TPS on Hybrid ViTs

We conduct experiments on PVT [31] and CvT [33] to prove our design is compatible with hybrid ViTs.

4.1 PVT

Generally, We insert dTPS modules between the patch embedding layer and the subsequent basic block of each pruned stage. Unlike TPS on vanilla ViTs, we reserve attentive tokens from the whole tokens set in each pruned stage and utilize the masking or padding operations to maintain the complete spatial structure during training and inference, respectively.

Training. The spatial reduction layer of the basic block in PVT requires input with a complete spatial structure. For the basic block with a spatial reduction block, given the policy $M$ , we maintain the complete spatial structure and mask the dropped tokens in key $K$ and in value $V$ with zeros before the spatial reduction layer as follows:

S\!R\!A^{*}(Q,K,V)=M\!H\!A(Q,S\!R(K\odot M),S\!R(V\odot M))

(7)

Here, $S\!R\!A^{*}$ is the modified spatial reduction attention, $M\!H\!A$ is the multi-head attention operation, and $SR$ is the spatial reduction layer. Moreover, we perform the same masking operation on dropped tokens before the patch embedding layer of next stage.

For the basic block without a spatial reduction layer, no masking operation is needed ,and we conduct the attention masking strategy from dynamicViT [25] to erase the effects of the dropped tokens.

Inference. The inference procedure of dTPS is adjusted slightly to practically accelerate the spatial reduction layer. The input tokens are pruned by a top-k selection operation based on the scoring results. For the block with a spatial reduction layer, we pad the previously dropped tokens with zero in the $S\!R\!A^{*}$ to maintain the complete spatial structure.

K^{\prime}=Pad(K,M)

(8)

V^{\prime}=Pad(V,M)

(9)

S\!R\!A^{*}(Q,K,V)=M\!H\!A(Q,S\!R(M^{\prime}),S\!R(V^{\prime}))

(10)

Also, the same padding operation is utilized before the patch embedding layer of the next stage. For the block without a spatial reduction layer, no padding operation is needed either. The requirement of complete spatial structure leads to less shrinkage of computations.

4.2 CvT

The last stage of CvT [33] contains most of its blocks; therefore we only modify the last stage with our dTPS. The operations remain the same for other stages as the original CvT [33].

Training. The convolutional projection operation in CvT requires the input with a complete spatial structure. Given the policy $M$ , we mask the dropped tokens with zeros before the convolution projection:

Q,K,V=ConvolutionalProjection(X\odot M)

(11)

Inference. The input tokens are pruned by a top-k selection operation based on the scoring results. To maintain the complete spatial structure, we pad the previously dropped tokens with zeros in the convolutional projection layer.

X^{\prime}=Pad(X,M)

(12)

Q,K,V=ConvolutionalProjection(X^{\prime})

(13)

ATS [8] conducts experiments on CvT [33] as well. It takes a variant of CvT [33] as the pre-trained model without convolutional projection in stage 3. It only performs token pruning in stage 3 to avoid the extra operation to maintain the structured spatial input. Compared to ATS [8], our method utilizes masking and padding during training and inference to keep the spatial structure.

5 Detailed Experiment Settings

5.1 ImageNet-1K Classification

All experiments follow the same data augmentations used in DeiT²²2 https://github.com/facebookresearch/DeiT [28]. All the model is initialized with pre-trained models’ weights and fine-tuned with different token pruning location and token keeping ratio. We adopt the AdamW [18] as the optimizer and a cosine learning rate scheduler.

TPS on DeiT [28]. The experiment settings of dTPS-DeiT follows dynamicViT³³3 https://github.com/raoyongming/DynamicViT except for basic learning rate is set to $\frac{batchsize}{1024}\times 2.5\times 10^{-4}$ and no stage of fixing backbone weights. The experiment settings of eTPS-DeiT follow EViT⁴⁴4https://github.com/youweiliang/evit. The pruning settings include combinations of three multi-layer pruning settings: $prune\_locs\in\{[4,7,10],[3,5,7,9],[4,6,8,10]\}$ , and two token keeping ratios : $\rho\in\{0.5,0.7\}$ . The token keeping ratio remains the same in all pruning stages.

TPS on LV-ViT [13]. The experiments of dTPS and eTPS on LV-ViT [13] follow the same training settings of dTPS and eTPS on DeiT, except for the basic learning rate is set to $\frac{batchsize}{1024}\times 1.0\times 10^{-4}$ for the stable convergence. For LV-ViT-T, the pruning settings include combinations of three multi-layer pruning settings: $prune\_locs\in\{[4,7,10],[3,5,7,9],[4,6,8,10]\}$ , and two token keeping ratios : $\rho\in\{0.5,0.7\}$ . For LV-ViT-S, the pruning settings include combinations of three multi-layer pruning settings: $prune\_locs\in\{[5,9,13],[3,6,9,12],[4,7,10,13]\}$ , and two token keeping ratios : $\rho\in\{0.5,0.7\}$ .

TPS on PS-ViT [39]. The experiments of dTPS on PS-ViT [39] follow the same training settings as ATS [8] on PS-ViT [39]. The basic learning rate is set to $\frac{batchsize}{768}\times 5.0\times 10^{-4}$ , $prune\_locs$ is set to [3,6,9] and token keeping ratio $\rho$ is 0.5.

TPS on PVT [31]. The pruning stages include stage 2, stage 3, and stage 4. The token keeping ratio for all dTPS modules is set to 0.7. Basic learning rate is set to $\frac{batchsize}{1024}\times 2.5\times 10^{-4}$ .

TPS on CvT [33]. The basic learning rate is set to $\frac{batchsize}{1024}\times 5.0\times 10^{-5}$ . The dTPS modules are only inserted into stage 3, and the pruning locations include [3,6,9] for CvT-13,[5,9,13] for CvT-21. The token keeping ratio for all TPS modules is set to 0.5.

5.2 iNaturalist 2019 Classification

TPS on DeiT [28]. For the experiment on iNaturalist 2019 Classification [29], we re-train DeiT and fine-tune the model with dynamicViT or dTPS applied.

In the training step, we initialize DeiT [28] with weights of ImagetNet1K pre-trained model and re-train them for $300$ epochs. The basic learning rate is set to $\frac{batchsize}{1024}\times 10^{-3}$ . The other settings follow DeiT [28].

In the fine-tuning step, we initialize dynamicViT-DeiT and dTPS-DeiT with weights from the last step and fine-tune them for 30 epochs with the same pruning setup. The token pruning location is set to [4,7,10], and the token keeping ratio is 0.5. We follow the same fine-tuning settings as the experiments on ImageNet1K, except for no distillation loss.

6 TPS on Larger Models and with Larger Input Size

We conduct experiments of TPS on DeiT-B as shown in Fig. 9(a) to demonstrate it is compatible with large models. We also prove TPS can perform well with larger input size such as $384\times 384$ , as shown in Fig. 9(b).

7 TPS under Different Keep Ratios

Experiments of TPS under different keep ratios are shown in Fig. 11.

8 More Ablations About TPS Design

More matching & fusing methods are shown in Tab. 7 as ablations about our TPS design. Tab. 7(a) indicates that TPS performance improvement benefits from compressing pruned tokens’ information while unmatched reserved tokens remain unchanged. Token scoring can be proved necessary for squeezing under random token division meets a significant drop as shown in Tab. 7(b) and robustness analysis in the main paper: Sec. 4.3.

References

[1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
[2] Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 357–366, 2021.
[3] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34:17864–17875, 2021.
[4] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[5] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[6] Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, and Hervé Jégou. Training vision transformers for image retrieval. arXiv preprint arXiv:2102.05644, 2021.
[7] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
[8] Mohsen Fayyaz, Soroush Abbasi Koohpayegani, Farnoush Rezaei, and Sommerlade1 Hamed Pirsiavash2 Juergen Gall. Adaptive token sampling for efficient vision transformers.
[9] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer. Advances in Neural Information Processing Systems, 34:15908–15919, 2021.
[10] Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based object re-identification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 15013–15022, 2021.
[11] Geoffrey Hinton, Oriol Vinyals, Jeff Dean, et al. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
[12] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
[13] Zi-Hang Jiang, Qibin Hou, Li Yuan, Daquan Zhou, Yujun Shi, Xiaojie Jin, Anran Wang, and Jiashi Feng. All tokens matter: Token labeling for training better vision transformers. Advances in Neural Information Processing Systems, 34:18590–18602, 2021.
[14] Zhenglun Kong, Peiyan Dong, Xiaolong Ma, Xin Meng, Wei Niu, Mengshu Sun, Xuan Shen, Geng Yuan, Bin Ren, Hao Tang, et al. Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European Conference on Computer Vision, pages 620–640. Springer, 2022.
[15] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707, 2021.
[16] Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800, 2022.
[17] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
[18] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[19] Jiageng Mao, Yujing Xue, Minzhe Niu, Haoyue Bai, Jiashi Feng, Xiaodan Liang, Hang Xu, and Chunjing Xu. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3164–3173, 2021.
[20] Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu, Yu-Gang Jiang, and Ser-Nam Lim. Adavit: Adaptive vision transformers for efficient image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12309–12318, 2022.
[21] Bowen Pan, Rameswar Panda, Yifan Jiang, Zhangyang Wang, Rogerio Feris, and Aude Oliva. Ia-red²: Interpretability-aware redundancy reduction for vision transformers, 2021.
[22] Xuran Pan, Zhuofan Xia, Shiji Song, Li Erran Li, and Gao Huang. 3d object detection with pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7463–7472, 2021.
[23] Zizheng Pan, Bohan Zhuang, Jing Liu, Haoyu He, and Jianfei Cai. Scalable vision transformers with hierarchical pooling. In Proceedings of the IEEE/cvf international conference on computer vision, pages 377–386, 2021.
[24] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021.
[25] Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021.
[26] Yongming Rao, Wenliang Zhao, Zheng Zhu, Jiwen Lu, and Jie Zhou. Global filter networks for image classification. Advances in Neural Information Processing Systems, 34:980–993, 2021.
[27] Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu, and Dacheng Tao. Patch slimming for efficient vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12165–12174, 2022.
[28] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
[29] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist challenge 2019 dataset. arXiv preprint arXiv:1707.06642, 2019.
[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[31] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 568–578, 2021.
[32] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8741–8750, 2021.
[33] Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22–31, 2021.
[34] Weijian Xu, Yifan Xu, Tyler Chang, and Zhuowen Tu. Co-scale conv-attentional image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9981–9990, 2021.
[35] Yifan Xu, Zhijie Zhang, Mengdan Zhang, Kekai Sheng, Ke Li, Weiming Dong, Liqing Zhang, Changsheng Xu, and Xing Sun. Evo-vit: Slow-fast token evolution for dynamic vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2964–2972, 2022.
[36] Hongxu Yin, Arash Vahdat, Jose M Alvarez, Arun Mallya, Jan Kautz, and Pavlo Molchanov. A-vit: Adaptive tokens for efficient vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10809–10818, 2022.
[37] Xumin Yu, Yongming Rao, Ziyi Wang, Zuyan Liu, Jiwen Lu, and Jie Zhou. Pointr: Diverse point cloud completion with geometry-aware transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12498–12507, 2021.
[38] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zi-Hang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 558–567, 2021.
[39] Xiaoyu Yue, Shuyang Sun, Zhanghui Kuang, Meng Wei, Philip HS Torr, Wayne Zhang, and Dahua Lin. Vision transformer with progressive sampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 387–396, 2021.
[40] Wang Zeng, Sheng Jin, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang, and Xiaogang Wang. Not all tokens are equal: Human-centric visual analysis via token clustering transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11101–11111, 2022.