PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers

Zhikai Li, Mengjuan Chen, Junrui Xiao, and Qingyi Gu This work was supported in part by the National Key Research and Development Program of China under Grant 2022ZD0119402; in part by the National Natural Science Foundation of China under Grant 62276255.(Corresponding author: Qingyi Gu.) Z. Li, M. Chen, J. Xiao and Q. Gu are with the Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and Z. Li and J. Xiao are also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]).

Abstract

Data-free quantization can potentially address data privacy and security concerns in model compression, and thus has been widely investigated. Recently, PSAQ-ViT (paper entitled “Patch Similarity Aware Data-Free Quantization for Vision Transformers”) designs a relative value metric, patch similarity, to generate data from pre-trained vision transformers (ViTs), achieving the first attempt at data-free quantization for ViTs. In this paper, we propose PSAQ-ViT V2, a more accurate and general data-free quantization framework for ViTs, built on top of PSAQ-ViT. More specifically, following the patch similarity metric in PSAQ-ViT, we introduce an adaptive teacher-student strategy, which facilitates the constant cyclic evolution of the generated samples and the quantized model (student) in a competitive and interactive fashion under the supervision of the full-precision model (teacher), thus significantly improving the accuracy of the quantized model. Moreover, without the auxiliary category guidance, we employ the task- and model-independent prior information, making the general-purpose scheme compatible with a broad range of vision tasks and models. Extensive experiments are conducted on various models on image classification, object detection, and semantic segmentation tasks, and PSAQ-ViT V2, with the naive quantization strategy and without access to real-world data, consistently achieves competitive results, showing potential as a powerful baseline on data-free quantization for ViTs. For instance, with Swin-S as the (backbone) model, 8-bit quantization reaches 82.13 top-1 accuracy on ImageNet, 50.9 box AP and 44.1 mask AP on COCO, and 47.2 mIoU on ADE20K. We hope that accurate and general PSAQ-ViT V2 can serve as a potential and practice solution in real-world applications involving sensitive data. Code is released and merged at: https://github.com/zkkli/PSAQ-ViT.

Index Terms:

Model compression, quantized vision transformers, data-free quantization, patch similarity.

I Introduction

Thanks to the great success in a variety of vision applications, vision transformers (ViTs) have recently drawn widespread attention in both research and practice [1, 2, 3, 4]. Nevertheless, their large memory footprints, computational overheads, and power consumption are challenging for real-world applications [5, 6, 7], especially for the deployment and efficient inference on resource-constrained edge devices [8, 9]. Model quantization [10, 11], which reduces the representation precision of parameters, is an effective procedure to reduce model complexity in a hardware-friendly manner [12, 13, 14, 15, 16, 17, 18, 19]. Unfortunately, existing quantization methods require access to the original training datasets to calibrate/fine-tune the quantization parameters [20], which can raise widely-held data privacy and security issues and thus cannot be applied in data-sensitive scenarios [21, 22, 23, 24].

Consequently, data-free quantization, which works to generate fake data from the prior information of the pre-trained model itself to replace the real data used in the quantization process, has been widely investigated and explored [21, 25, 22, 26, 27, 23]. Existing approaches have focused on convolutional neural networks (CNNs), which utilize batch normalization (BN) regularization to facilitate the distribution of the generated samples to match the real-data statistics embedded in the BN layers of the pre-trained full-precision (FP) model [21, 25, 22]. However, this is not applicable to ViTs with layer normalization (LN), because LN is dynamically computed and does not store the statistical prior of the original data [20]. Thus, designing specific data-free quantization schemes for ViTs is highly desired.

To this end, PSAQ-ViT (paper entitled “Patch Similarity Aware Data-Free Quantization for Vision Transformers”) [20] makes a first attempt to quantize ViTs without any real-world data, filling the gap in the community. Since there is no available absolute value metric in ViTs like BN statistics, PSAQ-ViT discovers a general difference in the self-attention module’s processing of Gaussian noise and real images, $i$ . $e$ ., patch similarity, and accordingly develops a relative value metric to reduce this difference and thus optimize Gaussian noise to approximate real images. Despite achieving good results, the accuracy and generality of this scheme are still less than expected. First, sample generation and model quantization are considered as two independent phases, which hinders the informativeness and diversity of the generated samples and the on-going learning of the quantized model. Second, it utilizes the auxiliary category prior to enhance the semantic features of the class-conditional samples, which unfortunately restricts the scheme to the image classification task and makes it inapplicable to high-level vision applications such as object detection and semantic segmentation.

To address the above issues, we propose PSAQ-ViT V2, an enhanced version to enable more accurate and general data-free quantization for ViTs. This is achieved by well-designed refinements building on patch similarity in PSAQ-ViT, as shown in Fig. 1. Specifically, we introduce a teacher-student strategy that facilitates the evolution of the quantized model using the generated samples under the supervision of the FP model. It is worth noting that instead of a one-time synthesis of all samples, the above procedure is adaptive, $i$ . $e$ ., the generated samples and the quantized model cyclically play a minimax game with respect to the model discrepancy. This competitive and interactive style forces the progressive emergence of new and diverse features in the generated samples, thus promoting constant and effective learning of the quantized model. In this process, image augmentation is used to cost-effectively expand the samples in each mini-batch. Moreover, thanks to the removal of the category prior, the objective functions in this work are independent to the tasks and models, whose general-purpose nature makes the scheme compatible with a broad range of vision tasks, including image classification, object detection, and semantic segmentation, and also allows for different teacher and student model structures, $e$ . $g$ ., the learning of quantized DeiT-T can be performed under the supervision of DeiT-B.

To sum up, our contributions are as follows:

•

On top of the relative value metric (patch similarity) for sample generation in PSAQ-ViT, we propose an enhanced version, which is a more accurate and general data-free quantization framework for ViTs.
•

We introduce an adaptive teacher-student strategy, whose competitive and interactive properties allow to progressively force new sample features to emerge, thus facilitating the constantly-evolving quantized model under the supervision of the FP model.
•

The prior information used in this scheme is task- and model-independent, just in contrast to the category prior, making the scheme general-purpose and compatible with various vision tasks and models.
•

Extensive experiments are performed with various models on benchmark datasets for multiple vision tasks, and the results demonstrate notable advantages in accuracy and generality, with the potential to serve as a strong baseline.

Refer to caption — Figure 1: Overview of the proposed PSAQ-ViT V2 framework. Building on patch similarity proposed in PSAQ-ViT, we introduce an adaptive teacher-student strategy to enhance the performance, where the generated samples and the quantized model cyclically play a minimax game with respect to the model discrepancy. Specifically, the FP model is kept fixed; in stage 1 of each cycle, the diversity of patch similarity and the model discrepancy are maximized to update the generated samples, and in stage 2, the model discrepancy is minimized to update the quantized model. The alternation of these two stages enables the constant cyclic evolution of the generated samples and the quantized model.

II Related Works

II-A Vision Transformers

With the success in natural language processing applications, transformers, which enjoy the global receptive field offered by the self-attention mechanism, have recently achieved significant performance gains on a range of computer vision applications [28, 29, 30, 31, 32]. By reshaping the image into a sequence of flattened 2D patches as the input, ViT [33] is the first work to apply the transformer-based models to the computer vision community, which boosts the performance baseline on the image classification task. DeiT [34] employs a distillation token to allow the student to learn from the teacher through attention, which can achieve competitive results on ImageNet with low training data cost. Swin [35] constructs hierarchical feature maps using the shift of the window partition between consecutive self-attention layers and has linear computational complexity to image size, further demonstrating the potential of transformer-based models as vision backbone. In addition to the image classification task, ViTs are also emerging as popular solutions for other high-level vision tasks, such as object detection [36, 37], semantic segmentation [38], and video recognition [39].

Unfortunately, despite the excellent performance, ViTs typically have complex model architectures and their high number of parameters and computation is intolerable for real-world applications [5, 8]. To this end, several works recently focus on lightweight ViTs, such as MobileViT [40] and TinyViT [41], whose model complexity, however, is still far from satisfactory, especially for the deployment on the resource-constrained edge devices. As a result, model compression techniques for ViTs are regarded as necessary and promising solutions.

II-B Data-Driven Quantization

Model quantization, which replaces 32-bit floating-point parameters with low-bit values, is a popular and promising way to reduce the complexity of neural networks [10, 11], thus allowing their deployment and real-time inference on edge devices. To date, there have been considerable researches attempting to quantize CNNs. A selection of notable works performs quantization-aware training (QAT) to improve the accuracy of the quantized model [12, 42, 43, 44], where the straight-through estimator (STE) [45] is used to approximate the gradient back-propagation of discrete parameters. To reduce the computational cost of QAT, other efforts propose post-training quantization (PTQ) methods [46, 47, 48, 49, 50, 51], which focus on the calibration of quantization parameters.

In addition, several quantization schemes designed for ViTs have been recently presented. Ranking-Aware [8] introduces a ranking loss into the conventional quantization objective to keep the relative order of the self-attention results after quantization. PTQ4ViT [52] proposes the twin uniform quantization method to reduce the quantization error on the activation values after Softmax and GELU functions. FQ-ViT [7] designs specific quantization strategies for Softmax and LN components to obtain the fully quantized ViTs. I-ViT [9] utilizes integer bit-shifting to approximate Softmax and GELU operations to achieve efficient integer-only inference for ViTs.

However, all the above methods rely heavily on the original training data, making them inapplicable in data-sensitive scenarios where the original data is not available.

II-C Data-Free Quantization

With the ability to compress models without access to any real data, data-free quantization can potentially address data privacy and security issues and is thus receiving increasing attention [23, 27]. ZeroQ [21] proposes BN regularization to facilitate the distribution matching between the generated samples and the real data, and then uses the generated samples to calibrate the quantization parameters. Based on BN regularization, DSG [22] utilizes slack distribution alignment and layerwise sample enhancement to improve the diversity of the generated samples. GDFQ [25] and IntraQ [26] combine the category prior with BN statistics to generate class-conditional samples while matching the real-data distribution, powering the performance of data-free quantization. These methods, however, are only applicable to CNNs and not to ViTs, because LN in ViTs, unlike BN, does not store any prior information of the original training data.

To address the above challenge, PSAQ-ViT [20], as the previous version of this work, carries out an insight into the general difference in the self-attention module’s processing of Gaussian noise and real images, and thus proposes a relative value metric, patch similarity, to optimize the Gaussian noise to approximate real images. In this work, building on patch similarity, we are interested in making the scheme more accurate and general by introducing advanced quantization learning strategies, which further advances the data-free quantization community for ViTs.

III Methodology

Framework Overview: Fig. 1 illustrates an overview of the proposed framework. The whole scheme requires only a pre-trained FP model, without any other information especially real-world data, to obtain a quantized model with superior performance. In this scheme, following PSAQ-ViT, patch similarity remains an essential component in driving Gaussian noise to approximate the real images, as detailed in Section III-B. On this basis, we introduce an adaptive teacher-student strategy, which facilitates consistent and effective learning of the quantized model using the generated samples with constantly new features under the supervision of the FP model. This is achieved by the competition and interaction between the generated samples and the quantized model, and specifically, they cyclically play a two-player minimax game with respect to the model discrepancy, as detailed in Section III-C. The overall pipeline described above is summarized and presented in Section III-D.

III-A Preliminaries

Standard ViTs consist of an embedding layer and several stacked transformer blocks. First, the embedding layer reshapes the input image into a sequence of flatted 2D patches, which are then linearly projected into a $d$ -dimensional space to obtain the token embeddings $X\in\mathbb{R}^{N\times d}$ . Here, $N$ is the number of patches.

The token embeddings $X$ are then fed into a series of transformer blocks, where each block is composed of a multi-head self-attention (MSA) module and a multi-layer perceptron (MLP) module as follows:

	$\displaystyle\hat{X}=X+\text{MSA}(\text{LN}(X))$		(1)
	$\displaystyle Y=\hat{X}+\text{MLP}(\text{LN}(\hat{X}))$		(1)

MSA calculates the attention between different patches to learn inter-patch representations from a global view as follows:

	$\displaystyle\text{MSA}(X)$	$\displaystyle=\text{Concat}(\text{head}_{1},\cdots,\text{head}_{H})W^{o}$		(2)
	$\displaystyle\text{where head}_{i}=$	$\displaystyle\;\text{Attn}(Q_{i},K_{i},V_{i})=\text{softmax}(\frac{Q_{i}K_{i}^{T}}{\sqrt{d}})V_{i}$		(2)

where $H$ is the number of attention heads. Here, query $Q_{i}$ , key $K_{i}$ , and value $V_{i}$ are computed by linear projections using matrix multiplication, $i$ . $e$ ., $Q_{i}=XW_{i}^{Q}$ , $K_{i}=XW_{i}^{K}$ , $V_{i}=XW_{i}^{V}$ . Then, MLP employs two dense layers to process the obtained attention for high-dimensional feature mapping and information fusion as follows:

\text{MLP}(\hat{X})=\text{GELU}(\hat{X}W_{1}+b_{1})W_{2}+b_{2}

(3)

Finally, the blocks are followed by the classification/detection/segmentation heads depending on the vision task to get the final result.

In this paper, we work on quantizing all the parameters (weights and activations) of the operations ( $e$ . $g$ ., large matrix multiplication) in the embedding layer, transformer blocks, and subsequent vision heads, including the input and output. For the quantization strategy, we employ the simple uniform quantization, which is the most popular and hardware-friendly method and is defined as follows:

\theta^{q}=\lfloor\frac{\text{clip}(\theta^{p},q_{0},q_{2^{k}-1})-q_{0}}{\Delta}\rceil,\,\text{where}\,\Delta=\frac{q_{2^{k}-1}-q_{0}}{2^{k}-1}

(4)

where $\theta^{p}$ and $\theta^{q}$ denote the parameters of the FP model and the quantized model, respectively. Here, $\lfloor\cdot\rceil$ is the round operator, $q_{0}$ and $q_{2^{k}-1}$ are the quantization clipping values, and $k$ is the quantization bit-precision.

III-B PSAQ-ViT: Patch Similarity Metric

As stated before, ViTs do not have BN layers that store information about the original training data, leaving no available absolute value prior information to generate the samples for performing quantization, which is the major challenge of data-free quantization for ViTs. Therefore, the previous version of this work, PASQ-ViT, is to mine deeper into the prior information of the pre-trained ViTs and thus explore a reliable relative value metric that can well describe the general difference between Gaussian noise and real images, so that we can reduce this difference to make the Gaussian noise approximate the real images.

As the unique structure in ViTs, the self-attention module learns inter-patch feature representations, which are believed to contain a certain amount of original data information. Consequently, we provide an in-depth analysis of the inference process of the self-attention module, and then we observe that the reason the model can make good decisions is that the self-attention module can distinguish the foreground from the background of the training data, thus allocating more attention to the foreground that is more important for the decision. Since the input of ViTs are independent vectors mapped by 2D patches, the responses of the self-attention module to different patches are significantly different, $i$ . $e$ ., the foreground patches receive more attention. More specifically, when the pre-trained model executes inference, real images consistently produce the above features, while Gaussian noise, whose foreground is not easily extracted, does not have a similar capability and inevitably leads to homogeneous responses, as shown in Fig. 2. Note that the real images here are only used to verify the general difference ( $e$ . $g$ ., a certain metric of the real images is always larger than that of Gaussian noise), and they will not be involved in any subsequent process. Therefore, this general difference can indirectly represent the prior information of ViTs and thus can be used to design the relative value metric to guide the sample generation.

Based on the above insights, we work on designing a reliable metric that can measure the diversity of the self-attention module’s responses. For the $l$ -th block in ViTs, the output of the MSA module is defined as $O_{l}\in\mathbb{R}^{B\times H\times N\times d}$ ( $l\in\{1,\cdots,L\}$ ), where each dimension denotes the batch size, number of heads, number of patches, and hidden size, respectively. To simplify the expression, we ignore the batch dimension, $i$ . $e$ ., $O_{l}\in\mathbb{R}^{H\times N\times d}$ .

Due to the relative value metric, it is necessary to first normalize $O_{l}$ to ensure the fairness of the comparison. We accomplish this by calculating the cosine similarity between each subspace vector in the patch dimension, specifying the data range at [-1, 1], as follows:

\Gamma_{l}(u_{i},u_{j})=\frac{u_{i}\cdot u_{j}}{||u_{i}||\;||u_{j}||}

(5)

where the numerator represents the inner product of the vectors, and $||\cdot||$ denotes the $l_{2}$ norm. Here, $u_{i},u_{j}\in\mathbb{R}^{H\times d}$ ( $i,j\in\{1,\cdots,N\}$ ) is the $i$ -th/ $j$ -th vector in the patch dimension of $O_{l}$ , and $\Gamma_{l}(u_{i},u_{j})$ represents the cosine similarity between $u_{i}$ and $u_{j}$ . After pairwise calculations, we obtain the $l$ -th block’s cosine similarity matrix $\bm{\Gamma}_{l}=[\Gamma_{l}(u_{i},u_{j})]_{N\times N}$ , which is a symmetric matrix and is termed as patch similarity. The diversity of patch similarity can potentially represent the diversity of the original data, which not only elegantly achieves data normalization, but also has the additional advantage of achieving reasonable $\frac{Hd}{N}$ -fold dimensionality reduction ( $\mathbb{R}^{H\times N\times d}\to\mathbb{R}^{N\times N}$ ). For instance, for the DeiT-B model, the amount of data is reduced by a factor of 3.92, which can greatly improve the subsequent computational efficiency.

Then, the diversity of patch similarity is measured by the information entropy, which can represent the amount of information expressed by the input data. To ensure gradient back-propagation, we employ the differential entropy that has a continuous nature. More intuitively, the $l$ -th block’s differential entropy of patch similarity $\bm{\Gamma}_{l}$ at the input generated data $\mathcal{G}$ is calculated as follows:

\mathbb{E}(\bm{\Gamma}_{l};\mathcal{G})=-\int f_{h}(x)\cdot\log\left[f_{h}(x)\right]dx

(6)

where $f_{h}(x)$ is the continuous probability density function of $\bm{\Gamma}_{l}$ , which is obtained using kernel density estimation as follows:

f_{h}(x)=\frac{1}{M}\sum_{m=1}^{M}K_{h}(x-x_{m})=\frac{1}{Mh}\sum_{m=1}^{M}K(\frac{x-x_{m}}{h})

(7)

where $K(\cdot)$ is the kernel ( $e$ . $g$ . normal kernel), $h$ is the bandwidth, $x_{m}$ ( $m\in\{1,\cdots,M\}$ ) is a training point drawn from $\bm{\Gamma}_{l}$ and is the center of a kernel, and $x$ is the given test point.

Finally, we sum the differential entropy of each block to account for the diversity of patch similarity across all blocks, and the Patch Similarity Entropy Loss is defined as follows:

\mathcal{L}_{PSE}=\sum_{l=1}^{L}\mathbb{E}(\bm{\Gamma}_{l};\mathcal{G})

(8)

III-C Adaptive Teacher-Student Strategy

To enhance the performance of data-free quantization in PSAQ-ViT, we introduce an adaptive teacher-student strategy, which drives the quantized model (student) to imitate the FP model (teacher) with the help of the generated samples. Here, adaptivity is reflected in the fact that instead of a one-time synthesis, the generated samples evolve alternately with the quantized model in a cycle. This competitive and interactive nature forces the gradual emergence of new features in the generated samples, and thus facilitates the quantized model to constantly learn new knowledge. Specifically, they cyclically play a two-player minimax game with respect to the model discrepancy $\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})$ as follows:

\mathcal{Q}^{*},\mathcal{G}^{*}=\arg\min_{\mathcal{Q}}\max_{\mathcal{G}}\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})

(9)

where $\mathcal{O}_{\mathcal{Q}}$ and $\mathcal{O}_{\mathcal{P}}$ are the output logits of the quantized model and the FP model when the input is $\mathcal{G}$ , respectively.

In the typical real-data-driven knowledge transfer or distillation applications, Kullback-Leibler (KL) divergence is a popular metric for estimating model discrepancy. However, our minimax game is different from these cases where the student can greedily perform one-time learning from the soft targets produced by the teacher, thus it has unsatisfactory contributions. Specifically, the distance described by KL divergence is loose, which makes the optimization space become small after the quantized model converges to the currently generated samples, $i$ . $e$ ., the results of KL divergence are similar when inputting samples with new features, which causes gradient decay and thus hinders the competition and constant evolution of the generated samples and the quantized model. Thus, we employ a more robust function, Mean Absolute Error (MAE), to construct a better search space, and the Discrepancy Loss is defined as follows:

\mathcal{L}_{\mathcal{D}}=\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})=\frac{1}{n}||\mathcal{Q}(\mathcal{G})-\mathcal{P}(\mathcal{G})||_{1}

(10)

where $n$ is the length of the output logits, $e$ . $g$ ., category number for classification and prediction map size for detection and segmentation.

In practice, each cycle in the above minimax game can be divided into two stages, each of which is discussed in detail below. In stage 1, we fix the FP model $\mathcal{P}$ and the quantized model $\mathcal{Q}$ and only update the generated samples $\mathcal{G}$ by maximizing the model discrepancy $\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})$ . Intuitively, the purpose of this action is to encourage generating more confusing samples. Thus, for this purpose, new features are forced to emerge gradually in the generated samples $\mathcal{G}$ , which can drive the samples to discover missing representations and potentially increase their informativeness and diversity. In stage 2, the FP model $\mathcal{P}$ and the generated samples $\mathcal{G}$ are fixed, and the model discrepancy $\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})$ is minimized to update the quantized model $\mathcal{Q}$ , which drives the quantized model to imitate the FP model to obtain the promising performance. In this manner, the two stages are executed alternately, allowing for the cyclic evolution of the generated samples and the quantized model.

III-D The Overall Pipeline

Input: A pre-trained FP vision transformer

\mathcal{P}

Output: A quantized vision transformer

\mathcal{Q}

Initialize the quantized model

\mathcal{Q}

by Eq. 4;

Randomly produce Gaussian noise

\mathcal{G}\thicksim\mathcal{N}(0,1)

;

for number of iterations do

# Stage 1: Sample generation

\mathcal{G}=\arg\max_{\mathcal{G}}\left[\sum_{l=1}^{L}\mathbb{E}(\bm{\Gamma}_{l};\mathcal{G})+\alpha\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})\right]

for g_step do

Input

\mathcal{G}

into

\mathcal{P}

and

\mathcal{Q}

to obtain

\mathcal{O}_{\mathcal{P}}

and

\mathcal{O}_{\mathcal{Q}}

;

Hook attention of

\mathcal{P}

and calculate

\bm{\Gamma}_{l}

by Eq. 5;

Calculate

\mathcal{L}_{PSE}

by Eq. 8;

Calculate

\mathcal{L}_{D}

by Eq. 10;

Combine the losses to obtain

\mathcal{L}_{G}

by Eq. 11;

Fix

\mathcal{P}

and

\mathcal{Q}

, update

\mathcal{G}

by minimizing

\mathcal{L}_{G}

;

end for

# Stage 2: Quantization learning

\mathcal{Q}=\arg\min_{\mathcal{Q}}\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})

for q_step do

Perform image augmentation

\hat{\mathcal{G}}

= aug

(\mathcal{G})

;

Input

\hat{\mathcal{G}}

into

\mathcal{P}

and

\mathcal{Q}

to obtain

\mathcal{O}_{\mathcal{P}}

and

\mathcal{O}_{\mathcal{Q}}

;

Calibrate clipping values

q_{0}

and

q_{2^{k}-1}

\mathcal{Q}

;

Calculate

\mathcal{L}_{\mathcal{Q}}

by Eq. 12;

Fix

\mathcal{P}

and

\mathcal{G}

, update

\mathcal{Q}

by minimizing

\mathcal{L}_{Q}

;

end for

Algorithm 1 The PSAQ-ViT V2 Pipeline

We fuse the patch similarity metric and the adaptive teacher-student strategy to obtain the overall scheme, which is summarized in Algorithm 1. Following the teacher-student strategy, the overall pipeline is also performed in a two-stage cycle consisting of sample generation and quantization learning. The procedure and loss functions of each stage are described individually below.

Sample generation In this stage, both the diversity of patch similarity and the model discrepancy are considered to search for a satisfactory sample space. Specifically, we combine the Patch Similarity Entropy Loss $\mathcal{L}_{PSE}$ and the Discrepancy Loss $\mathcal{L}_{\mathcal{D}}$ , and since they are to be maximized, the loss $\mathcal{L}_{\mathcal{G}}$ of sample generation is defined as follows:

	$\displaystyle\mathcal{L}_{\mathcal{G}}$	$\displaystyle=-\mathcal{L}_{PSE}-\alpha\cdot\mathcal{L}_{\mathcal{D}}$		(11)
		$\displaystyle=-\sum_{l=1}^{L}\mathbb{E}(\bm{\Gamma}_{l};\mathcal{G})-\alpha\cdot\left[\frac{1}{n}\|\|\mathcal{Q}(\mathcal{G})-\mathcal{P}(\mathcal{G})\|\|_{1}\right]$		(11)

where $\alpha$ is a balance coefficient. Note that in contrast to PSAQ-ViT, there is no auxiliary category prior in the above equation, which ensures the generality of the scheme.

Quantization learning In this stage, we utilize the samples generated in the former stage to facilitate the learning of the quantized model under the supervision of the FP model. However, since the batch size of the generated samples is small ( $e$ . $g$ ., 32), the fearful overfitting cases typically appear in quantization learning, which poses a challenge to the constant and effective learning of the quantized model. A naive approach is to simply increase the batch size in the former stages, but this brings unexpected computational overhead. Therefore, we employ image augmentation to cost-effectively expand the sample capacity, following the method in SimCLR [53] whose augmentation policy includes random crop (with flip and resize), color distortion, and Gaussian blur.

Then, we input the augmented samples $\hat{\mathcal{G}}$ into the FP model and the quantized model, calibrate the quantization clipping values $q_{0}$ and $q_{2^{k}-1}$ , and update the quantized model by minimizing the model discrepancy via the following loss:

\mathcal{L}_{\mathcal{Q}}=\mathcal{L}_{D}=\frac{1}{n}||\mathcal{Q}(\hat{\mathcal{G}})-\mathcal{P}(\hat{\mathcal{G}})||_{1}

(12)

After several iterations of learning in the above two stages, the generated samples and the quantized model will reach an ideal balance point, where the model discrepancy no longer changes after iterations, $i$ . $e$ ., the generated samples cannot pull apart the FP model and the quantized model, and the quantized model have learned the full transferred knowledge. In this case, the quantized model is considered to be functionally identical to the FP model.

It is also worth noting that all the objective functions are task-independent, thus the proposed scheme is general-purpose and compatible with a broad range of vision tasks, and they are also model-independent, which allows for different teacher and student model structures, $e$ . $g$ ., the learning of quantized DeiT-T can be performed under the supervision of DeiT-B.

IV Experiments

IV-A Implementation Details

Models and datasets We evaluate PSAQ-ViT V2 on various popular models for image classification, object detection, and semantic segmentation with the following models and datasets.

•

Image classification: We perform classification experiments on ImageNet [54] dataset, which has 1.28M training images and 50K validation images from 1,000 classes. DeiT-T/S/B [34] and Swin-T/B [35] are adopted as the baseline model. Since ViT [33] and DeiT have the same model structure, we do not evaluate ViT in this work. The pre-trained models are obtained from the timm [55] framework.
•

Object detection: Object detection and instance segmentation experiments are conducted on COCO 2017 [56] dataset, which contains 118K training, 5K validation, and 20K test-dev images. The baseline is the Cascade Mask R-CNN [57] framework in mmdetection [58] with DeiT-T/S and Swin-T as the backbones.
•

Semantic segmentation: We adopt ADE20K [59] dataset to evaluate the performance for semantic segmentation. It has 25K images in total, with 20K for training, 2K for validation, and another 3K for testing, covering a broad range of 150 semantic categories. With DeiT-T/S and Swin-T as the backbones, we take the UperNet [60] framework in mmsegmentation [61] as the baseline.

Comparison methods It is naturally necessary to make a comparison with the previous version PSAQ-ViT, however, it is unfair and inadequate to compare with only PSAQ-ViT, because PSAQ-ViT only performs quantization parameter calibration without a learning process, and it contains the category prior, which is not applicable to detection and segmentation tasks. Therefore, we have to build a reasonable comparison method, called Standard V2, on our own. Standard V2 is distinguished from Standard in the previous version. Standard V2 introduces the model interaction and directly uses real training data with the Discrepancy Loss $\mathcal{L}_{\mathcal{D}}$ for teacher-student learning, and it traverses the training set once to complete convergence. For a fair comparison, we make sure to use the same number of samples, more intuitively, in PSAQ-ViT V2, the result of multiplying the number of iterations, q_step, and batch size is equal to the size of the training set. All other settings for Standard V2 and PSAQ-ViT V2 are the same.

Experimental settings All experiments in this work are implemented in Pytorch. For a fair demonstration of effectiveness, we employ the most basic quantization method, $i$ . $e$ ., for weights, symmetric uniform quantization with vanilla MinMax strategy is applied; for activations, asymmetric uniform quantization is applied, and the default strategy is vanilla MinMax if not specifically declared. We adopt Adam [62] optimizer in both sample generation and quantization learning stages, where the search space of learning rate is {0.2, 0.25} in sample generation and {5e-7, 1e-6, 2e-6} in quantization learning. Although further hyperparameter tuning may achieve better accuracy, for uniformity, all our experiments are conducted using batch size 32 and weight decay 1e-4. The hyperparameter $\alpha$ is set to 1.0 after a simple grid search, and its selection has little effect on the final performance. All experiments are done on a single NVIDIA GeForce RTX 3090 GPU.

IV-B Quantization Results for Image Classification

We start by discussing the quantization results on ImageNet image classification, which is compared with the previous version PSAQ-ViT and the real-data-driven Standard V2. Interestingly, our proposed PSAQ-ViT V2 without access to the original dataset can even slightly outperforms Standard V2. This is partly thanks to the Patch Similarity Entropy Loss $\mathcal{L}_{PSE}$ . It forces the distinction between foreground and background in the generated samples according to the response of the self-attention module, and then when these samples are utilized to calibrate the quantization parameters, they in turn reinforce the functionality of the self-attention module, thus acting as positive feedback that can reduce the activation outliers to some extent and therefore improve the tolerance to parameter clipping. The Discrepancy Loss $\mathcal{L}_{\mathcal{D}}$ also contributes to the surprising performance. It encourages increasing the discrepancy between the FP model and the quantized model to force the emergence of new hard-to-discriminate features in the samples, and these features can be regarded as support vectors (aka hard instances) that contain more typical and instructive knowledge, thus promoting more effective learning of the quantized model.

TABLE I: Quantization results on ImageNet Image classification. We compare the proposed PSAQ-ViT V2 with the previous version PSAQ-ViT, as well as Standard V2, which uses an equal number of real samples for direct teacher-student learning. PSAQ-ViT V2 significantly improves the quantization accuracy, even slightly outperforming real-data-driven Standard V2. Here, “No Data” indicates that no real data participate in the quantization process, and “W

x

y

” denotes quantifying the weights and activations to

x

-bit and

y

-bit, respectively. We abbreviate quantization precision as “Prec”, model size as “Size” (in MB), Bit Operations [63] as “BOPS” (in G), and Top-1 test accuracy as “Top-1” (in %).

Model	Method	No Data	Prec.	Size	BOPS	Top-1
DeiT-T	Baseline	$-$	FP	20	1106	72.21
	PSAQ-ViT	$\checkmark$	W4/A8	2.5	34.6	65.57
	Standard V2	$\times$	W4/A8	2.5	34.6	68.43
	PSAQ-ViT V2	$\checkmark$	W4/A8	2.5	34.6	68.61
	PSAQ-ViT	$\checkmark$	W8/A8	5	69.1	71.56
	Standard V2	$\times$	W8/A8	5	69.1	72.06
	PSAQ-ViT V2	$\checkmark$	W8/A8	5	69.1	72.17
DeiT-S	Baseline	$-$	FP	88	4710	79.85
	PSAQ-ViT	$\checkmark$	W4/A8	11	147	73.23
	Standard V2	$\times$	W4/A8	11	147	75.98
	PSAQ-ViT V2	$\checkmark$	W4/A8	11	147	76.36
	PSAQ-ViT	$\checkmark$	W8/A8	22	294	76.92
	Standard V2	$\times$	W8/A8	22	294	79.24
	PSAQ-ViT V2	$\checkmark$	W8/A8	22	294	79.56
DeiT-B	Baseline	$-$	FP	344	17920	81.85
	PSAQ-ViT	$\checkmark$	W4/A8	43	560	77.05
	Standard V2	$\times$	W4/A8	43	560	79.17
	PSAQ-ViT V2	$\checkmark$	W4/A8	43	560	79.49
	PSAQ-ViT	$\checkmark$	W8/A8	86	1120	79.10
	Standard V2	$\times$	W8/A8	86	1120	81.26
	PSAQ-ViT V2	$\checkmark$	W8/A8	86	1120	81.52
Swin-T	Baseline	$-$	FP	116	4608	81.35
	PSAQ-ViT	$\checkmark$	W4/A8	14.5	144	71.79
	Standard V2	$\times$	W4/A8	14.5	144	75.51
	PSAQ-ViT V2	$\checkmark$	W4/A8	14.5	144	76.28
	PSAQ-ViT	$\checkmark$	W8/A8	29	288	75.35
	Standard V2	$\times$	W8/A8	29	288	79.62
	PSAQ-ViT V2	$\checkmark$	W8/A8	29	288	80.21
Swin-S	Baseline	$-$	FP	200	8909	83.20
	PSAQ-ViT	$\checkmark$	W4/A8	25	278	75.14
	Standard V2	$\times$	W4/A8	25	278	78.22
	PSAQ-ViT V2	$\checkmark$	W4/A8	25	278	78.86
	PSAQ-ViT	$\checkmark$	W8/A8	50	557	76.64
	Standard V2	$\times$	W8/A8	50	557	81.42
	PSAQ-ViT V2	$\checkmark$	W8/A8	50	557	82.13

Table I shows the quantization results for various models at different quantization bit-precisions. For the quantization of DeiT-T, our proposed PSAQ-ViT V2 improves 3.04% over the previous version at W4/A8 precision, and achieves 72.17% accuracy at W8/A8 precision with only 5MB memory footprint. DeiT-S quantized with our method improves 0.38% and 0.32% over the real-data-driven Standard V2 at W4/A8 and W8/A8 precisions, respectively. Similar to the previous models, the quantization results of DeiT-B also show that PSAQ-ViT V2 achieves the best performance, with 2.44% and 0.32% improvements over PSAQ-ViT and Standard V2 at W4/A8 precision, respectively, and only 0.33% accuracy reduction compared to the FP model at W8/A8 precision with 4-fold compression. When using the vanilla MinMax quantization strategy, the parameter distribution of Swin is highly sensitive to discretization; for instance, PSAQ-ViT produces 9.56% and 6.00% accuracy degradation in the W4/A8 and W8/A8 quantization of Swin-T, respectively. Fortunately, the proposed teacher-student strategy updates the parameter distribution in the model to make it quantization-friendly, and thus achieves significant accuracy compensation of 4.49% and 4.86% at W4/A8 and W8/A8 precisions, respectively. The adaptive nature of our method is also more evident in Swin; for instance, for the W4/A8 and W8/A8 quantization of Swin-S, our method achieves performance improvements of 0.64% and 0.71% over Standard V2, respectively.

Results of combining with advanced strategies Since our method is orthogonal to the quantization strategies, we combine PSAQ-ViT V2 with advanced quantization strategies to further improve the performance, as shown in Table II. First, we employ exponential moving average (EMA), which smooths the quantization clipping values, to replace vanilla MinMax, achieving an accuracy gain of 0.97% and 1.05% in the W4/A8 quantization of DeiT-T and Swin-T, respectively. Moreover, as mentioned before, the model prior and loss functions in PSAQ-ViT V2 are model-independent, thus allowing for different teacher and student model structures. Here, we employ the base model to supervise the quantization of the corresponding tiny model. For instance, thanks to the guidance of the powerful FP DeiT-B with 81.85%, the quantization performance of DeiT-T at W4/A8 precision achieves an inspiring 1.25% improvement.

TABLE II: Quantization results of combining with advanced strategies, including EMA for smoothing the quantization clipping values and supervised learning by a better teacher. For the latter, here we employ the base model to supervise the quantization of the corresponding tiny model.

Model	Method	Advanced	No Data	Prec.	Size	BOPS	Top-1
DeiT-T	Baseline	$-$	$-$	FP	20	1106	72.21
	PSAQ-ViT V2	$-$	$\checkmark$	W4/A8	2.5	34.6	68.61
		EMA	$\checkmark$	W4/A8	2.5	34.6	69.58
		$\mathcal{P}$ :DeiT-B	$\checkmark$	W4/A8	2.5	34.6	69.86
Swin-T	Baseline	$-$	$-$	FP	116	4608	81.35
	PSAQ-ViT V2	$-$	$\checkmark$	W4/A8	14.5	144	76.28
		EMA	$\checkmark$	W4/A8	14.5	144	77.33
		$\mathcal{P}$ :Swin-B	$\checkmark$	W4/A8	14.5	144	77.19

Analysis of generated samples Fig. 3 shows the visualization results of the generated images (224 $\times$ 224 pixels) using PSAQ-ViT and PSAQ-ViT V2 (including mid and final iterations). Note that these images require only a pre-trained FP model, and not any additional information, especially the original data or any absolute value metrics. Thanks to the joint guidance of the patch similar metric and the category prior, the generated images in PSAQ-ViT can clearly distinguish the foreground from the background, and the foreground is rich in semantic information with a specified category, $e$ . $g$ ., the top left pseudo-label is “teddy bear”. However, the background information in these samples is singular and homogeneous. In contrast, there is no category guidance in this work, resulting in weakly intuitive semantic features of the images, but this in turn helps the sample to improve informativeness and diversity in both foreground and background without limitations, thus more closely matching the real data in the overall statistics. In addition, during the cyclic iterations, the samples are progressively evolving in competition and interaction, with continuously emerging new features and increasingly evident semantic information, thus facilitating constant and effective learning of the quantized model.

To demonstrate the validity of the Patch Similarity Entropy Loss $\mathcal{L}_{PSE}$ more obviously, we also conduct the comparison experiments of the kernel density curves of the patch similarity for each block in DeiT-B model when the input is the real image, Gaussian noise, and the generated image, as shown in Fig. 4. For the responses to Gaussian noise, the kernel density curves all show a concentrated unimodal shape and the central value of the curve is high, indicating a high degree of similarity between each patch of Gaussian noise and thus a full classification as background or foreground. Fortunately, the kernel density curves corresponding to our generated images are very approximate to those corresponding to the real images. They all show a dispersed bimodal shape, indicating a high diversity of responses, and the left and right peaks of curves describe inter- and intra-category similarity, respectively, which is in line with the expectation that the images can easily be distinguished between foreground and background.

IV-C Quantization Results for Object Detection

Unlike PSAQ-ViT which is only applicable to the image classification task, PSAQ-ViT V2 is general and compatible with a broad range of vision tasks. Here, the Cascade Mask R-CNN framework with Swin-S/DeiT-S as the backbone, which achieves state-of-the-art (SOTA) performance on object detection and instance segmentation tasks on COCO dataset, is used to perform quantization experiments and the results are reported in Table III. To the best of our knowledge, this is the first attempt to perform data-free quantization of ViTs for object detection and instance segmentation.

When using DeiT-S as the backbone, PSAQ-ViT V2 achieves consistently better performance than Standard V2, with 0.3 higher box Average Precision (AP) and 0.4 higher mask AP at W4/A8 precision, and it achieves 47.3 box AP and 40.8 mask AP at W8/A8 precision, enabling a 4-fold almost lossless compression. In the case of applying Swin-S as the backbone, our method also shows promising performance, outperforming Standard V2 by 0.6 box AP and 0.6 mask AP at W8/A8 precision, which is only 0.9 box AP and 0.6 mask AP lower than the FP model.

TABLE III: Quantization results on COCO object detection and instance segmentation. We take the concatenation of the backbone DeiT-S/Swin-S and the detection method Cascade Mask R-CNN as the baseline models. PSAQ-ViT V2 can even slightly outperforms real-data-driven Standard V2. Here, “AP

{}^{\text{box}}

” is the box AP for object detection, and “AP

{}^{\text{mask}}

” is the mask AP for instance segmentation.

Model	Method	No Data	Prec.	Size	BOPS( $\times 10^{3}$ )	AP ${}^{\text{box}}$	AP ${}^{\text{mask}}$
DeiT-S / Cascade Mask R-CNN	Baseline	$-$	FP	320	910	48.0	41.4
	Standard V2	$\times$	W4/A8	40	28.4	44.5	38.4
	PSAQ-ViT V2	$\checkmark$	W4/A8	40	28.4	44.8	38.8
	Standard V2	$\times$	W8/A8	80	56.9	46.8	40.2
	PSAQ-ViT V2	$\checkmark$	W8/A8	80	56.9	47.3	40.8
Swin-S / Cascade Mask R-CNN	Baseline	$-$	FP	428	858	51.8	44.7
	Standard V2	$\times$	W4/A8	50.4	26.8	47.2	40.8
	PSAQ-ViT V2	$\checkmark$	W4/A8	50.4	26.8	47.9	41.4
	Standard V2	$\times$	W8/A8	107	53.6	50.3	43.5
	PSAQ-ViT V2	$\checkmark$	W8/A8	107	53.6	50.9	44.1

For a more intuitive view of the performance of the quantized model, we visualize the detection results and compare them with those of the FP model, as illustrated in the first row in Fig. 5. The backbone we employ is Swin-S. It can be seen that the quantized model at W8/A8 precision achieves similar performance to the FP model, with slight differences in partial classification scores, $e$ . $g$ ., the score for ’baseball glove’ changes from 0.97 to 0.92. And when performing W4/A8 quantization, the hard small object “baseball bat” is misclassified into “tennis racket”, while the results of easy objects such as “person” are unaffected.

IV-D Quantization Results for Semantic Segmentation

Our proposed PSAQ-ViT V2 can also be applied to the semantic segmentation task. We employ the UperNet framework with Swin-S/DeiT-S as the backbone, which is the SOTA scheme on the semantic segmentation task on ADE20K dataset, and the quantization results are shown in Table IV. To the best of our knowledge, this is the first attempt to quantize ViTs for semantic segmentation without any real data.

The discretization of the model parameters has a large impact on the pixel-level semantic segmentation, making the accuracy of the quantized model not as encouraging as the results on previous tasks. For instance, W8/A8 quantization cannot achieve lossless compression, but rather produces an accuracy degradation greater than 1 mean Intersection over Union (mIoU). This effect also acts on Standard V2 using real data, thus the performance advantage of our method over Standard V2 does not change. For the quantization of DeiT-S/UperNet, our proposed PSAQ-ViT V2 achieves 39.9 mIoU and 42.7 mIoU at W4/A8 and W8/A8 precisions, respectively, which are 0.3 mIoU and 0.2 mIoU higher than Standard V2. Similarly, in the W4/A8 and W8/A8 quantization of Swin-S/UperNet, our method also consistently has performance benefits over Standard V2.

In addition, we also perform visualizations of the segmentation results, as illustrated in the second row in Fig. 5. The backbone we employ is Swin-S. As we can see, the quantized model at W8/A8 precision is comparable to the FP model for the overall segmentation performance, while there is a slight difference in the recognition of a few small regions. In the case of compressing the model to W4/A8 precision, the segmentation results are not good for semantically weak regions, for instance, the object in the lower left corner of the image is not well recognized.

TABLE IV: Quantization results on ADE20K semantic segmentation. We take the concatenation of the backbone DeiT-S/Swin-S and the segmentation method UperNet as the baseline models. PSAQ-ViT V2 can even slightly outperforms real-data-driven Standard V2.

Model	Method	No Data	Prec.	Size	BOPS( $\times 10^{3}$ )	mIoU
DeiT-S / UperNet	Baseline	$-$	FP	208	1125	44.0
	Standard V2	$\times$	W4/A8	26	35.2	39.6
	PSAQ-ViT V2	$\checkmark$	W4/A8	26	35.2	39.9
	Standard V2	$\times$	W8/A8	52	70.3	42.5
	PSAQ-ViT V2	$\checkmark$	W8/A8	52	70.3	42.7
Swin-S / UperNet	Baseline	$-$	FP	324	1063	49.3
	Standard V2	$\times$	W4/A8	40.5	33.2	44.2
	PSAQ-ViT V2	$\checkmark$	W4/A8	40.5	33.2	44.6
	Standard V2	$\times$	W8/A8	81	66.4	46.7
	PSAQ-ViT V2	$\checkmark$	W8/A8	81	66.4	47.2

IV-E Ablation Studies

Here, we perform two ablation studies to verify the effectiveness of each component of the proposed method. First, we explore the contributions of the Patch Similarity Entropy Loss $\mathcal{L}_{PSE}$ and the Discrepancy Loss $\mathcal{L}_{\mathcal{D}}$ in sample generation, as shown in Table V. Without any optimization, inputting Gaussian noise directly to guide the learning of the quantized model can produce severe performance degradation. When only $\mathcal{L}_{\mathcal{D}}$ is utilized on Gaussian noise to perform the minimax game, there is a slight improvement in accuracy, but it is still far from satisfactory. In summary, in the absence of real data, the quantization performance obtained directly with Gaussian noise is unacceptable. Thus, $\mathcal{L}_{PSE}$ -guided optimization to make the Gaussian noise approximate the real images is highly necessary. For instance, when performing W4/A8 quantization of DeiT-S, only $\mathcal{L}_{PSE}$ -guided optimization achieves 74.69% accuracy, which is 59.97% higher than Gaussian noise. Further, we add $\mathcal{L}_{\mathcal{D}}$ to introduce the model interaction so that the quantized model enjoys a more adequate knowledge transfer, resulting in a better performance with a 1.67% improvement. Similar comparative results are also obtained in the W4/A8 quantization of Swin-S, where using $\mathcal{L}_{PSE}$ improves the accuracy substantially to 77.03% and combining it with $\mathcal{L}_{\mathcal{D}}$ further improves to a satisfactory 78.86%. The above results fully demonstrate the critical role of $\mathcal{L}_{PSE}$ and $\mathcal{L}_{\mathcal{D}}$ in improving the quantization performance.

TABLE V: Ablation studies on the losses for updating the generated samples

\mathcal{G}

. The Patch Similarity Entropy Loss

\mathcal{L}_{PSE}

plays a crucial role in sample generation. On this basis, the model interaction introduced by the Discrepancy Loss

\mathcal{L}_{\mathcal{D}}

also substantially improves the quantization accuracy.

Model	$\mathcal{L}_{PSE}$	$\mathcal{L}_{D}$	Prec.	Size	BOPS	Top-1
DeiT-S	$-$	$-$	FP	88	4710	79.85
	$\times$	$\times$	W4/A8	11	147	14.72
	$\times$	$\checkmark$	W4/A8	11	147	25.81
	$\checkmark$	$\times$	W4/A8	11	147	74.69
	$\checkmark$	$\checkmark$	W4/A8	11	147	76.36
Swin-S	$-$	$-$	FP	200	8909	83.20
	$\times$	$\times$	W4/A8	25	278	8.96
	$\times$	$\checkmark$	W4/A8	25	278	20.24
	$\checkmark$	$\times$	W4/A8	25	278	77.03
	$\checkmark$	$\checkmark$	W4/A8	25	278	78.86

Second, we compare the performance of different estimation functions for the model discrepancy $\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})$ , and the results are reported in Table VI. The experimental results verify the previous analysis that KL divergence leads to a dying minimax game since it produces decaying gradients after converging to the current generated samples. The reason is that in our minimax game, the student greedily performs one-time learning from the teacher, which leaves little room for optimizing KL divergence, so that the KL divergence results are similar when inputting samples with new features, thus hinders the competition and constant evolution of the generated samples and the quantized model. More specifically, we evaluate the performance of KL divergence for multiple temperature settings $T\in$ {20, 10, 5, 1} and find that as the temperature preliminarily decreases ( $e$ . $g$ ., from 20 to 10), the constraint on the soft label is tightened and the above issue is slightly alleviated. Intuitively, in the W4/A8 quantization of DeiT-S, the accuracy is improved from 74.56% to 75.23%. Whereas, after the temperature is small, further reduction does not improve the performance any further, for example, similar performance (75.57% and 75.51%) is achieved at temperatures of 5 and 1. This indicates that the above issues are not radically ameliorated. In addition, we introduce an advanced dynamic temperature strategy [64], which offers some improvements over the scheme with the fixed temperature, however, it is still weaker than the scheme with the MAE function, with the latter being 0.46% and 0.31% higher in the W4/A8 quantization of DeiT-S and Swin-S, respectively. The above results fully demonstrate the validity of the MAE function, which provides a better search space for the minimax game and thus can effectively solve the gradient decay issue.

TABLE VI: Ablation Studies on the estimation functions of the model discrepancy

\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})

. KL divergence produces decaying gradients after converging to the current generated samples, leading to an unsatisfactory minimax game. Here, T is the temperature coefficient. In contrast, the MAE we employ is the promising function in the minimax game.

Model	$\mathcal{D}(\mathcal{O}_{\mathcal{Q}},\mathcal{O}_{\mathcal{P}};\mathcal{G})$	Prec.	Size	BOPS	Top-1
DeiT-S	$-$	FP	88	4710	79.85
	KLD ( $T$ =20)	W4/A8	11	147	74.56
	KLD ( $T$ =10)	W4/A8	11	147	75.23
	KLD ( $T$ =5)	W4/A8	11	147	75.57
	KLD ( $T$ =1)	W4/A8	11	147	75.51
	KLD ( $T$ :dynamic[64])	W4/A8	11	147	75.90
	MAE	W4/A8	11	147	76.36
Swin-S	$-$	FP	200	8909	83.20
	KLD ( $T$ =20)	W4/A8	25	278	77.02
	KLD ( $T$ =10)	W4/A8	25	278	77.67
	KLD ( $T$ =5)	W4/A8	25	278	78.04
	KLD ( $T$ =1)	W4/A8	25	278	78.11
	KLD ( $T$ :dynamic[64])	W4/A8	25	278	78.55
	MAE	W4/A8	25	278	78.86

V Conclusions

In this paper, we propose an enhanced version on top of PSAQ-ViT, called PSAQ-ViT V2, which is a more accurate and general data-free quantization framework for ViTs. Specifically, we follow the relative value metric proposed by PSAQ-ViT and maximize the entropy of patch similarity to facilitate the approximation of Gaussian noise to the real images, which is also the core component of sample generation in PSAQ-ViT V2. On this basis, we introduce an adaptive teacher-student strategy, which facilitates the constant cyclic evolution of the generated samples and the quantized model under the supervision of the FP model. Its competitive and interactive nature can improve the informativeness and diversity of the generated samples and encourage more effective learning of the quantized model, thus significantly improving the quantization accuracy. It is also worth noting that the model prior and objective functions we employ are task- and model-independent, making PSAQ-ViT V2 general-purpose and compatible with various vision tasks, including image classification, object detection, and semantic segmentation. We perform extensive experiments on various models for multiple tasks and consistently obtain SOTA results, which can even outperform the real-data-driven method under the same settings. In addition, for a more intuitive representation, we also visualize the generated samples on the image classification task and the results of the quantized model on the object detection and semantic segmentation tasks.

As part of future work, one could combine more advanced quantization methods ( $e$ . $g$ ., non-uniform quantization) to achieve lower bit-precision ( $e$ . $g$ ., 4-bit) compression. Furthermore, one could extend the sample generation method to a wider range of application scenarios, including data-free knowledge distillation and data-free black-box attacks.

References

[1] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,” ACM Computing Surveys (CSUR), 2021.
[2] K. Han, Y. Wang, H. Chen, X. Chen, J. Guo, Z. Liu, Y. Tang, A. Xiao, C. Xu, Y. Xu, et al., “A survey on vision transformer,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 87–110, 2022.
[3] M. Kaselimi, A. Voulodimos, I. Daskalopoulos, N. Doulamis, and A. Doulamis, “A vision transformer model for convolution-free multilabel classification of satellite imagery in deforestation monitoring,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[4] S. Alfasly, C. K. Chui, Q. Jiang, J. Lu, and C. Xu, “An effective video transformer with synchronized spatiotemporal and spatial self-attention for action recognition,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[5] Y. Tang, K. Han, Y. Wang, C. Xu, J. Guo, C. Xu, and D. Tao, “Patch slimming for efficient vision transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12165–12174, 2022.
[6] Z. Hao, J. Guo, D. Jia, K. Han, Y. Tang, C. Zhang, H. Hu, and Y. Wang, “Learning efficient vision transformers via fine-grained manifold distillation,” in Advances in Neural Information Processing Systems, 2021.
[7] Y. Lin, T. Zhang, P. Sun, Z. Li, and S. Zhou, “Fq-vit: Post-training quantization for fully quantized vision transformer,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pp. 1173–1179, 2022.
[8] Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[9] Z. Li and Q. Gu, “I-vit: Integer-only quantization for efficient vision transformer inference,” arXiv preprint arXiv:2207.01405, 2022.
[10] R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
[11] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, “A survey of quantization methods for efficient neural network inference,” in Low-Power Computer Vision, pp. 291–326, Chapman and Hall/CRC.
[12] D. Zhang, J. Yang, D. Ye, and G. Hua, “Lq-nets: Learned quantization for highly accurate and compact deep neural networks,” in Proceedings of the European conference on computer vision (ECCV), pp. 365–382, 2018.
[13] A. T. Elthakeb, P. Pilligundla, F. Mireshghallah, T. Elgindi, C.-A. Deledalle, and H. Esmaeilzadeh, “Gradient-based deep quantization of neural networks through sinusoidal adaptive regularization,” arXiv preprint arXiv:2003.00146, 2020.
[14] T.-W. Chin, I. Pierce, J. Chuang, V. Chandra, and D. Marculescu, “One weight bitwidth to rule them all,” in European Conference on Computer Vision, pp. 85–103, Springer, 2020.
[15] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” Advances in neural information processing systems, vol. 29, 2016.
[16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” in European conference on computer vision, pp. 525–542, Springer, 2016.
[17] N. Kim, D. Shin, W. Choi, G. Kim, and J. Park, “Exploiting retraining-based mixed-precision quantization for low-cost dnn accelerator design,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 7, pp. 2925–2938, 2020.
[18] W. Fei, W. Dai, C. Li, J. Zou, and H. Xiong, “General bitwidth assignment for efficient deep convolutional neural network quantization,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
[19] P. Wang, X. He, Q. Chen, A. Cheng, Q. Liu, and J. Cheng, “Unsupervised network quantization via fixed-point factorization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 6, pp. 2706–2720, 2020.
[20] Z. Li, L. Ma, M. Chen, J. Xiao, and Q. Gu, “Patch similarity aware data-free quantization for vision transformers,” in European conference on computer vision, pp. 154–170, 2022.
[21] Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. W. Mahoney, and K. Keutzer, “Zeroq: A novel zero shot quantization framework,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13169–13178, 2020.
[22] X. Zhang, H. Qin, Y. Ding, R. Gong, Q. Yan, R. Tao, Y. Li, F. Yu, and X. Liu, “Diversifying sample generation for accurate data-free quantization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15658–15667, 2021.
[23] C. Guo, Y. Qiu, J. Leng, X. Gao, C. Zhang, Y. Liu, F. Yang, Y. Zhu, and M. Guo, “Squant: On-the-fly data-free quantization via diagonal hessian approximation,” in International Conference on Learning Representations, 2022.
[24] H. Yin, P. Molchanov, J. M. Alvarez, Z. Li, A. Mallya, D. Hoiem, N. K. Jha, and J. Kautz, “Dreaming to distill: Data-free knowledge transfer via deepinversion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8715–8724, 2020.
[25] S. Xu, H. Li, B. Zhuang, J. Liu, J. Cao, C. Liang, and M. Tan, “Generative low-bitwidth data free quantization,” in European Conference on Computer Vision, pp. 1–17, Springer, 2020.
[26] Y. Zhong, M. Lin, G. Nan, J. Liu, B. Zhang, Y. Tian, and R. Ji, “Intraq: Learning synthetic images with intra-class heterogeneity for zero-shot network quantization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12339–12348, 2022.
[27] K. Choi, D. Hong, N. Park, Y. Kim, and J. Lee, “Qimera: Data-free quantization with synthetic boundary supporting samples,” Advances in Neural Information Processing Systems, vol. 34, pp. 14835–14847, 2021.
[28] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6836–6846, 2021.
[29] D. Zhou, B. Kang, X. Jin, L. Yang, X. Lian, Z. Jiang, Q. Hou, and J. Feng, “Deepvit: Towards deeper vision transformer,” arXiv preprint arXiv:2103.11886, 2021.
[30] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135, 2021.
[31] K. Wu, H. Peng, M. Chen, J. Fu, and H. Chao, “Rethinking and improving relative position encoding for vision transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10033–10041, 2021.
[32] K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, “Transformer in transformer,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[33] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
[34] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning, pp. 10347–10357, PMLR, 2021.
[35] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022, 2021.
[36] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in European conference on computer vision, pp. 213–229, Springer, Cham, 2020.
[37] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in International Conference on Learning Representations, 2021.
[38] H. Chen, Y. Wang, T. Guo, C. Xu, Y. Deng, Z. Liu, S. Ma, C. Xu, C. Xu, and W. Gao, “Pre-trained image processing transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12299–12310, 2021.
[39] D. Neimark, O. Bar, M. Zohar, and D. Asselmann, “Video transformer network,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3163–3172, 2021.
[40] S. Mehta and M. Rastegari, “Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer,” in International Conference on Learning Representations, 2022.
[41] K. Wu, J. Zhang, H. Peng, M. Liu, B. Xiao, J. Fu, and L. Yuan, “Tinyvit: Fast pretraining distillation for small vision transformers,” in European conference on computer vision, pp. 68–85, 2022.
[42] Y. Li, X. Dong, and W. Wang, “Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks,” in International Conference on Learning Representations, 2020.
[43] J. Choi, Z. Wang, S. Venkataramani, P. I.-J. Chuang, V. Srinivasan, and K. Gopalakrishnan, “Pact: Parameterized clipping activation for quantized neural networks,” arXiv preprint arXiv:1805.06085, 2018.
[44] S. K. Esser, J. L. McKinstry, D. Bablani, R. Appuswamy, and D. S. Modha, “Learned step size quantization,” in International Conference on Learning Representations, 2020.
[45] Y. Bengio, N. Léonard, and A. Courville, “Estimating or propagating gradients through stochastic neurons for conditional computation,” arXiv preprint arXiv:1308.3432, 2013.
[46] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2704–2713, 2018.
[47] R. Li, Y. Wang, F. Liang, H. Qin, J. Yan, and R. Fan, “Fully quantized network for object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2810–2819, 2019.
[48] Y. Choukroun, E. Kravchik, F. Yang, and P. Kisilev, “Low-bit quantization of neural networks for efficient inference,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3009–3018, IEEE, 2019.
[49] D. Wu, Q. Tang, Y. Zhao, M. Zhang, Y. Fu, and D. Zhang, “Easyquant: Post-training quantization via scale optimization,” arXiv preprint arXiv:2006.16669, 2020.
[50] M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, “Up or down? adaptive rounding for post-training quantization,” in International Conference on Machine Learning, pp. 7197–7206, PMLR, 2020.
[51] Y. Li, R. Gong, X. Tan, Y. Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu, “Brecq: Pushing the limit of post-training quantization by block reconstruction,” in International Conference on Learning Representations, 2021.
[52] Z. Yuan, C. Xue, Y. Chen, Q. Wu, and G. Sun, “Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization,” in European conference on computer vision, pp. 191–207, 2022.
[53] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning, pp. 1597–1607, PMLR, 2020.
[54] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
[55] R. Wightman, “Pytorch image models.” https://github.com/rwightman/pytorch-image-models, 2019.
[56] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740–755, Springer, 2014.
[57] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162, 2018.
[58] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open mmlab detection toolbox and benchmark,” arXiv preprint arXiv:1906.07155, 2019.
[59] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and A. Torralba, “Semantic understanding of scenes through the ade20k dataset,” International Journal of Computer Vision, vol. 127, no. 3, pp. 302–321, 2019.
[60] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in Proceedings of the European conference on computer vision (ECCV), pp. 418–434, 2018.
[61] M. Contributors, “MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark.” https://github.com/open-mmlab/mmsegmentation, 2020.
[62] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[63] M. Van Baalen, C. Louizos, M. Nagel, R. A. Amjad, Y. Wang, T. Blankevoort, and M. Welling, “Bayesian bits: Unifying quantization and pruning,” Advances in neural information processing systems, vol. 33, pp. 5741–5752, 2020.
[64] A. Jafari, M. Rezagholizadeh, P. Sharma, and A. Ghodsi, “Annealing knowledge distillation,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2493–2504, 2021.