This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

GranQ: Granular Zero-Shot Quantization with Unified Layer-Channel Awareness

Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, Sanghyun Park
Yonsei University
{hip9863,jyy1551, hyojoy, skd, sanghyun}@yonsei.ac.kr
Abstract

Zero-shot quantization (ZSQ) enables neural network compression without training data, which is crucial in restricted data access environments. However, existing ZSQ methods suffer from significant activation loss in low-bit environments owing to their coarse-grained scaling strategy. To address this issue, we propose GranQ, a novel ZSQ approach that leverages layer-channel awareness to minimize the quantization error. Unlike conventional layer- or channel-wise quantization, GranQ dynamically adjusts quantization granularity by considering both layer- and channel-level activation distributions. This enables fine-grained quantization while minimizing activation distortion. Additionally, we introduce vectorized activation quantization, which enables efficient parallel computation and reduces computational overhead while preserving accuracy. GranQ achieves superior performance compared with those of state-of-the-art ZSQ methods that employ quantization-aware training. With these findings, we anticipate that GranQ will inspire novel research directions beyond conventional ZSQ approaches focused on data generation and model training. The official code will be released upon completion of the review.

1 Introduction

Refer to caption
Figure 1: Comparison of quantization strategies. Existing methods apply either per-tensor scaling with layer-wise quantization or per-channel scaling. GranQ combines per-channel scaling with per-layer iteration for more granular quantization.

Neural network compression techniques have been extensively studied to make large-scale deep learning (DL) models practically deployable. In particular, reducing model size while minimizing performance degradation is crucial for utilizing DL on edge devices (e.g., mobile phones, embedded systems, and drones). Major approaches to model compression [7, 12] include quantization [24, 30, 17], pruning [28, 16, 33, 6], knowledge distillation [21, 20, 40], and neural architecture search [45, 13]. Among these, quantization has emerged as the most actively studied technique. It serves as an effective compression method by reducing unnecessary representational ranges in the model while maintaining its performance. However, a major limitation of quantization is the need for fine-tuning or calibration to maintain the performance of the full-precision (FP) model [17]. To address this limitation, zero-shot quantization (ZSQ) (i.e., data-free quantization) has been introduced [34, 41], which allows compressed models to maintain their performance without requiring access to the original training data. Since the introduction of ZeroQ [4], studies on ZSQ have advanced in two main directions. The first direction focuses on data generation, where synthetic data are created from the FP model. The second direction focuses on effectively applying the activation distributions of the synthetic data to the quantized (Q) model. This second direction is further divided into post-training quantization (PTQ), which calibrates the activation distributions, and quantization-aware training (QAT), which fine-tunes the Q model directly. We categorize existing ZSQ methods based on these directions, as summarized in Table 1.

Data Generation
(PTQ, QAT)
Calibration
(PTQ)
Fine-tuning
(QAT)
ZeroQ (CVPR 20) [4]
GDFQ (ECCV 20) [41]
DSG (CVPR 21) [43]
MixMix (CVPR 21) [31]
Qimera (NeurIPS 21) [9]
IntraQ (CVPR 22) [44]
GENIE (CVPR 23) [25]
AdaDFQ (CVPR 23) [36]
Casual-DFQ (ICCV 23) [39]
TexQ (NeurIPS 23) [5]
RIS (AAAI 24) [1]
GenQ (ECCV 24) [32]
SQuant (ICLR 22) [18]
UDFC (ICCV 23) [2]
AIT (CVPR 22) [10]
PLF (CVPR 24) [14]
SynQ (ICLR 25) [26]
GranQ (Ours)
Table 1: Categorizing zero-shot quantization algorithms. PTQ denotes post-training quantization, and QAT denotes quantization-aware training. Data generation methods typically support both PTQ and QAT, whereas violet-marked methods utilize fine-tuning for data generation and support only QAT.

First, data generation studies focus on generating high-quality data to effectively train the Q model. A common approach uses batch normalization statistics from the FP model [41]. Recent studies focused on generating data adaptable to various bit-widths [36] and using diverse generative models such as stable diffusion guided by text prompts [32]. Meanwhile, calibration (PTQ) studies aim to minimize the quantization error by calibrating the Q model without additional training [18, 2]. Finally, fine-tuning (QAT) studies focus on transferring key information from the FP model to the Q model through knowledge distillation [10, 14, 23, 26].

However, despite extensive studies, severe performance degradation in low-bit quantization remains unresolved. To address this issue, we performed an in-depth analysis of the ZSQ process, focusing on why low-bitwidth settings still suffer from performance loss even after QAT fine-tuning. Our findings reveal that quantization errors mainly stem from the loss of activation values instead of data quality or training methods. Notably, we discovered that layer-wise (per-tensor) quantization fails to accurately preserve activations in ZSQ, leading to coarse and inaccurate representations.

Based on this analysis, we introduce GranQ, a novel ZSQ approach. GranQ independently handles large activations in specific layers and channels by narrowing the range between the maximum and minimum values within each channel (Figure 1). This dynamic adjustment minimizes activation loss and preserves the original activation values by reducing quantization errors, as shown in Figure 2. The proposed method effectively handles activation loss in low-bit quantization and achieves state-of-the-art (SOTA) performance in QAT settings on the CIFAR and ImageNet datasets. Furthermore, we implement vectorized operations for granular quantization with minimal computational overhead, enabling efficient quantization on GPUs. Our contributions can be summarized as follows:

  • We perform a comprehensive analysis of quantization errors in low-bit ZSQ settings. Our findings reveal that conventional activation quantization methods (e.g., layer- and channel-wise) lead to significant activation loss. This issue stems from coarse-grained activation quantization, which fails to preserve fine-grained details.

  • We propose GranQ, a novel method that supports granular quantization and maintains computational efficiency. We introduce a fine-grained quantization mechanism, which considers every channel in each layer, to reduce activation loss. Additionally, we mitigate computational overhead using vectorized activation quantization. To the best of our knowledge, our approach is the first to address the ZSQ problem.

  • We achieve SOTA performance over existing ZSQ methods through extensive evaluation. Specifically, on the CIFAR-100 dataset, in the 3-bit quantization setting, GranQ achieves an accuracy of 62.73%, improving by 5.45% over the latest method on ResNet-20. Furthermore, on the CIFAR-10 dataset in the 5-bit quantization setting, GranQ achieves an accuracy of 94.06%, slightly exceeding the FP model performance by 0.17%.

Refer to caption
(a) Layer Quantization
Refer to caption
(b) GranQ (Ours)
Figure 2: Comparison between (a) layer-wise quantization and (b) the proposed GranQ method on the CIFAR-10 dataset from AdaDFQ[36]. We visualize the activation of the first layer in ResNet-20 for the first image of the first batch. ww and aa denote weight and activation, respectively. Each sub-figure presents the activation graph for 32-bit FP (left) and 3-bit quantization (right). Although layer-wise quantization significantly distorts the original activation after quantization, GranQ effectively preserves it with minimal distortion.

2 Related Work

2.1 Quantization

Quantization can reduce the representational range of pre-trained deep neural networks (DNNs) and minimize memory usage. Quantization methods are categorized into PTQ and QAT based on the timing of the range reduction.

PTQ quantizes fully trained DNNs without any further training [17, 3], offering low computational cost and simple implementation. However, because PTQ directly applies quantization to the FP model, it struggles to properly adjust the range during quantization. To overcome this, calibration is used to select optimal quantization ranges by determining scaling factors. In contrast, QAT fine-tunes the Q model with quantization operations integrated into the training process, enabling direct optimization of activations [17]. In particular, it employs the straight-through estimator [42] to estimate gradients.

However, both PTQ and QAT have the limitation of requiring access to data. PTQ needs data for calibration, whereas QAT requires it for fine-tuning. In practice, obtaining training data is often challenging. To address this problem, ZSQ has been proposed, which performs quantization without using original data [41].

2.2 Zero-shot Quantization

As summarized in Table 1, ZSQ was initially introduced by ZeroQ [4], leading to the development of various ZSQ algorithms. These studies have explored diverse approaches to improve quantization performance under data-free settings. Recent studies have focused on generating high-quality synthetic data [25, 32] and developing more effective methods to utilize them [10, 23, 26]. However, these approaches are still limited in preserving activations in low-bit settings. Therefore, a new quantization method is required to effectively minimize activation loss. In this study, we propose a novel approach to address this challenge.

3 Preliminaries and Problem Definition

In this section, we provide a new definition of the existing ZSQ problem and introduce the preliminaries. The detailed mathematical notations are listed in Table LABEL:tab6 in the supplementary material.

3.1 Activation Quantization

Activation quantization reduces the precision of intermediate activation values in neural networks by converting them into low-bitwidth representations [8]. This technique is often used alongside weight quantization to further compress the model. Among various activation quantization methods, linear quantization is the most widely used. It uses a normalized scaling factor to adjust the activation values, and subsequently converts them into integers within a fixed bit range [24, 15]. The overall process typically involves three stages: forward quantization, scaling parameter computation, and dequantization [17].

3.1.1 Forward Quantization

The forward quantization process involves mapping continuous floating-point values to an integer space. This process utilizes the quantization operator QQ, and the equation is defined as follows:

xq=𝒬(x,s,z)=xs+zx_{q}=\mathcal{Q}(x,s,z)=\left\lfloor\frac{x}{s}+z\right\rceil (1)

Here, 𝒬(x,s,z)\mathcal{Q}(x,s,z) denotes a quantization operator that converts a real value xx into an integer xqx_{q}. The value xx is first scaled by a factor ss and thereafter shifted using a zero-point zz. This zero-point accounts for offsets in asymmetric quantization, particularly when activation values are positive. \left\lfloor\cdot\right\rceil denotes the rounding function. Consequently, the forward quantization process involves scaling, zero-point adjustment, and rounding to the nearest integer within the target quantized range.

3.1.2 Scaling Parameter Computation

In activation quantization, setting the scaling factor ss is crucial for mapping various activation distributions to a fixed integer range. In this study, we employ the asymmetric quantization method, which is the most commonly used for activation quantization [29]. We calculate the scaling factor ss and the zero-point zz as follows:

s=xmaxxmin2b1,z=xminss=\frac{x_{\max}-x_{\min}}{2^{b}-1},\quad z=\left\lfloor\frac{-x_{\min}}{s}\right\rceil (2)

Here, xmaxx_{\max}, xminx_{\min} and bb represent the maximum and minimum values of the activation, and the number of quantization bits, respectively. The zero-point zz represents the offset used to adjust activation values. xmaxx_{\max} and xminx_{\min} determine the range of the activation distribution, which directly affects the computation of scaling factor ss. Specifically, when xmaxxminx_{\max}-x_{\min} is large, the scaling factor ss increases. This leads to a larger quantization step size, making it harder to preserve fine-grained activations. Conversely, when xmaxxminx_{\max}-x_{\min} is small, the scaling factor ss also decreases. This allows for finer value distinctions and enhances the quality of representation.

3.1.3 Dequantization

The dequantization process converts quantized integer values back into their real-valued representation, as shown in the following equation:

xdeq=s(xqz)x_{\text{deq}}=s(x_{q}-z) (3)

In this process, the zero-point zz is subtracted to reverse the forward quantization, and the result is multiplied by the scaling factor ss to obtain the final value xdeqx_{deq}. The dequantization process restores quantized activation values to their original real-valued range. It is essential during model inference for converting final predictions into real values. This process is essential for converting the final prediction into real values during model inference.

Refer to caption
Figure 3: Visualization of activation magnitudes for the first image in the first batch using ResNet-20 on CIFAR-10. The left column illustrates activations from the FP model, whereas the right column presents activations after layer-wise quantization. Significant distortions are observed in certain layers, particularly in the final layer.

3.2 Problem Definition

Existing activation quantization methods are designed based on layer-wise quantization to minimize computational cost. However, fixed scaling factors struggle to handle varying activations. This issue becomes more severe under low-bitwidth settings in ZSQ environments. The following outlines the primary challenges faced in ZSQ under low-bitwidth settings.
(Problem 1) Coarse-grained activation representation
In conventional activation quantization approaches, the scaling factor ss is fixed per layer or channel, mapping all activations to the same range. However, in low-bitwidth settings, this approach struggles to capture fine-grained activation variations from generated data.
(Problem 2) Severe quantization error in certain layers
During quantization, certain layers induce severe quantization errors, leading to significant activation distortions. These distortions substantially differ from the original distribution and can directly impact the performance of the model.

4 Observation and Methodology

We observe the causes of activation distortion during quantization and analyze how to mitigate them. Based on this, we introduce the proposed method known as GranQ.

4.1 Observation

ResNet-20 Layer-Q Channel-Q GranQ
Avg.XQX2Q2()Avg.\frac{X\cdot Q}{\|X\|_{2}\|Q\|_{2}}(\uparrow)
Avg.XQ2X2()Avg.\frac{\|X-Q\|_{2}}{\|X\|_{2}}(\downarrow)
0.5111
0.3129
0.5602
0.2528
0.6835
0.1063
Table 2: Comparative analysis of averaged cosine similarity and relative error over all layers and batches. Higher cosine similarity (\uparrow) indicates better retention of activation information, whereas lower relative error (\downarrow) signifies reduced quantization distortion. Layer-Q and Channel-Q denote layer- and channel-wise quantization, respectively, whereas GranQ integrates both for improved quantization.

Layer-wise quantization effectively captures the activation distribution of each layer. However, it becomes inefficient when activation distributions within a layer exhibit significant variation along the channel axis. Particularly in ZSQ, as shown in the FP graph in Figure 2(a), the activation distributions can differ substantially across channels even within the same layer. This implies that limited representation bits are forced to capture a wide range of activation values, making fine-grained quantization difficult. It consistently appears across all layers. We analyzed this phenomenon occurring across all layers and provided a summary in Table 2. In Table 2, we present the cosine similarity XQ|X|2|Q|2\frac{X\cdot Q}{|X|_{2}|Q|_{2}}, along with the relative error |XQ|2|X|2\frac{|X-Q|_{2}}{|X|_{2}}. From Table 2, we identify that both layer- and channel-wise quantization fail to preserve the original activation effectively. This indicates that layer- and channel-wise quantization, which independently process activations at the layer or channel level, are not suitable for ZSQ. In contrast, GranQ, which considers both layers and channels simultaneously, achieves 1.34×\times higher similarity and 2.94×\times lower relative error compared with those of layer-wise quantization, demonstrating its effectiveness in preserving activations.

Furthermore, Figure 3 provides a visual representation of the effect of quantization on activation values across layers. In layer-quantizaiton, although the overall activations generally reflect the FP activations to some extent, certain layers exhibit fine activation values being mapped to discrete values, leading to coarse representations. These coarse representations have been identified as a major factor leading to a decline in the overall expressiveness and inference performance of the model. Further analysis of Figure 3 is provided in the supplementary material, in Figures LABEL:sup_fig3 and LABEL:sup_fig1.

4.2 Granular Quantization

Based on our analysis, we propose GranQ (granular layer-channel quantization), a fine-grained ZSQ approach designed to minimize computational overhead, as shown in Figure 4. GranQ combines layer and channel information using layer-channel scaling and improves efficiency through vectorized quantization.

4.2.1 Layer-Channel Scaling

We propose layer-channel scaling, which enables GranQ to assign individual scaling factors to each channel, enhancing quantization precision. This approach reduces the impact of outlier activations in specific channels on the quantization process. The layer-channel scaling is defined as follows:

xmin\displaystyle\vec{x}_{\min} =minh,w(xl),xmax=maxh,w(xl)\displaystyle=\min_{h,w}(x_{l}),\quad\vec{x}_{\max}=\max_{h,w}(x_{l}) (4)
where xlC×H×W\displaystyle\text{where }x_{l}\in\mathbb{R}^{C\times H\times W}
Refer to caption
Figure 4: Overview of the GranQ algorithm. \scriptsize1⃝ Input xlx_{l} of each layer is separated along the channel axis, with width and height flattened to reduce dimensionality (see Equation 4). \scriptsize2⃝ Scaling is performed by integrating and vectorizing the previously separated per-channel xminx_{min} and xmaxx_{max} values (see Equation 5). \scriptsize3⃝ Quantization and dequantization are performed using the scaled parameters (see Equation 6).
Dataset
Model
(FP 32)
Bits
GDFQ
(ECCV 20)
ARC+AIT
(CVPR 22)
AdaDFQ
(CVPR 23)
TexQ
(NeurIPS 24)
AIT+RIS
(AAAI 24)
PLF
(CVPR 24)
GenQ
(ECCV 24)
AKT
(SAC 25)
SynQ
(ICLR 25)
GranQ
(Ours)
Cifar-10 ResNet-20 (93.89) 3ww3aa 75.11 - 84.89 86.47 - 88.04 - 86.76 88.11 91.37
4ww4aa 90.25 90.49 92.31 92.68 92.59 92.47 - 92.64 92.76 93.52
5ww5aa 93.38 92.98 93.81 - 93.59 - - 93.83 - 94.06
Cifar-100 ResNet-20 (70.33) 3ww3aa 47.61 - 52.74 55.87 - 57.03 - 54.68 57.28 62.73
4ww4aa 63.39 61.05 66.81 67.18 65.99 66.94 - 66.94 67.34 68.79
5ww5aa 66.12 68.40 69.93 - 69.55 - - 69.75 - 70.05
ImageNet ResNet-18 (71.47) 3ww3aa 20.23 - 38.10 50.28 - - 68.18 49.88 52.02 64.41
4ww4aa 60.60 65.73 66.53 67.73 67.55 67.02 70.03 65.89 67.90 70.39
5ww5aa 68.49 70.28 70.29 - 70.59 - - 69.40 - 71.31
MobileNetV2 (73.03) 3ww3aa 1.46 - 28.99 32.80 - - 59.15 30.56 34.21 62.42
4ww4aa 59.43 66.47 65.41 67.07 - - 69.65 64.85 67.27 70.62
5ww5aa 68.11 71.96 71.61 - - - - 71.71 - 72.49
ResNet-50 (77.73) 3ww3aa 0.31 - 17.63 25.27 - - 73.99 24.50 26.89 70.76
4ww4aa 54.16 68.27 68.38 70.72 71.54 68.97 76.10 68.75 71.05 76.63
5ww5aa 71.63 76.00 76.03 - 76.36 - - 75.90 - 77.58
Table 3: Accuracy Evaluation of QAT Methods for ZSQ. ww and aa represent weight and activation, respectively. Bold values indicate the best accuracy, and underlined values denote the second-best accuracy. \dagger indicates our re-implementation with the official code.

Here, the vectors xmin\vec{x}_{\min} and xmax\vec{x}_{\max} represent the vectorized xminx_{min} and xmaxx_{max} values for each channel from the activation input xlx_{l} of each layer, which has a shape of C×H×W\mathbb{R}^{C\times H\times W}. This approach enables independent normalization for each channel, instead of the single scalar used in traditional layer-wise quantization. In the layer-channel scaling step of GranQ, vectorized scaling factors are applied by individually considering the activation distribution of each channel. The main step in layer-channel scaling involves isolating activations for each channel to compute individual scaling factors, as shown in \scriptsize1⃝ of Figure 4. Further details of the quantization process will be provided in the next section.

4.2.2 Vectorized Quantization

In the layer-channel scaling stage, the activation tensor of each layer is vectorized along the channel dimension. In the subsequent vectorized quantization stage, both the scaling factor s\vec{s} and zero-point z\vec{z} are also vectorized and computed as shown in Equation 5. The activation values go through quantization and dequantization (Equation 6), with all operations running in parallel across channels. During computation, scaling factor ss and zero-point zz of each channel are broadcasted to the activation values and applied element-wise.

s=xmaxxmin2b1,z=xmins\vec{s}=\frac{\vec{x}_{\max}-\vec{x}_{\min}}{2^{b}-1},\quad\vec{z}=\left\lfloor-\frac{\vec{x}_{\min}}{\vec{s}}\right\rceil (5)
xq,l=xs+z,xdeq,l=s(xq,lz)x_{q,l}=\left\lfloor\frac{x}{\vec{s}}+\vec{z}\right\rceil,x_{\text{deq},l}=\vec{s}(x_{q,l}-\vec{z}) (6)

This process can be applied to every layer in the network, enabling the vectorized quantization of all activation tensors. A detailed computational process is provided in Algorithm LABEL:alg1, available in the supplementary material.

5 Experiments

In this section, we thoroughly evaluate the effectiveness of GranQ. Experiments are conducted on diverse benchmark datasets, with the performance compared with those of existing ZSQ methods.

5.1 Experimental Setup and Details

The experiments were conducted using widely adopted ZSQ evaluation datasets, including CIFAR-10, CIFAR-100 [27], and ImageNet (ILSVRC 2012) [11] validation datasets. For the CIFAR datasets, ResNet-20 [19] was used as the quantization model, whereas ResNet-18 [19], ResNet-50 [19], and MobileNetV2 [38] were employed for ImageNet. All experiments were conducted using the SGD optimizer [37] with a momentum of 0.9 and weight decay of 1e-4. The CIFAR-10 and CIFAR-100 experiments were each conducted for 200 epochs, with batch sizes of 16 and 200, respectively. For ImageNet, we trained for 400 epochs with a batch size of 16. The initial learning rate was set to 1e-4 for CIFAR-10 and CIFAR-100, and 1e-5 for ImageNet, with multi-step learning rate decay applied. The decay steps were set to 100, 200, and 300 epochs for CIFAR, and at 350 and 400 epochs for ImageNet, with a decay rate of 0.1. We compared our method with existing ZSQ methods [41, 10, 36, 5, 1, 14, 32, 23, 26]. For data generation, we followed the AdaDFQ [36] approach based on ACGAN [35]. Further implementation details are provided in the supplementary material, Section LABEL:sec:Details_of_Experimental_Setup. In our ablation study, layer-wise quantization was applied to all layers containing activation functions, while channel-wise quantization was performed per channel at the batch level.

CIFAR-100 ResNet-20 (70.33%)
Method GDFQ Qimera+AIT AdaDFQ AdaDFQ+AKT
Baseline +GranQ Baseline +GranQ Baseline +GranQ Baseline +GranQ
3w3a 47.61 59.04+11.43 45.70 60.42+14.72 52.74 62.73+9.99 54.68 62.01+7.33
4w4a 63.39 66.97+3.58 65.80 68.08+2.28 66.81 68.79+1.98 66.94 68.77+1.83
5w5a 66.12 68.96+2.84 69.26 70.14+0.88 69.93 70.05+0.12 69.75 70.21+0.46
Table 4: Accuracy of existing SOTA methods with the integration of GranQ. GDFQ and AdaDFQ focus on data generation, whereas Qimera+AIT and AdaDFQ+AKT primarily enhance quantized model training. In nnwmma notation, nn and mm represent the bit-widths of weights and activations, respectively. \dagger denotes our re-implementation.
Layer-Q Channel-Q
Layer-Channel
(Iteration)
Layer-Channel
(Vectorization, GranQ)
Accuracy (%)
49.98 (±0.24\pm 0.24)
51.26 (±0.28\pm 0.28)
62.68 (±0.11\pm 0.11)
62.73 (±0.17\pm 0.17)
Table 5: Comparative analysis of four quantization approaches on ResNet-20 accuracy. The experiments were conducted using the AdaDFQ method on CIFAR-100 with 3-bit quantization. Results are averaged over three repetitions with standard deviation.

5.2 Performance Evaluation

We evaluated the performance of GranQ against SOTA ZSQ methods, with results summarized in Table 3. All comparison experiments were conducted under 3, 4, and 5-bit quantization settings.

CIFAR-10/100. As indicated in Table 3, GranQ consistently achieved the highest accuracy across all bitwidths. For CIFAR-10, it attained accuracies of 94.06% (5-bit), 93.52% (4-bit), and 91.37% (3-bit). For CIFAR-100, the results were 70.05% (5-bit), 68.79% (4-bit), and 62.73% (3-bit). Notably, GranQ outperformed SynQ [26], the previous SOTA, by +5.45% in the CIFAR-100 3-bit setting. This result demonstrates its ability to effectively overcome the limitations of conventional low-bitwidth quantization techniques. Overall, GranQ consistently outperformed existing methods across all bitwidths in both CIFAR-10 and CIFAR-100, with particularly strong improvements in the 3-bit quantization setting. Remarkably, in the CIFAR-10 5-bit setting, GranQ even surpassed the performance of the FP model. These results suggest that we can effectively apply GranQ to small- and medium-scale datasets.

ImageNet. In the ImageNet experiments, GranQ achieved competitive performance across various bitwidths. Specifically, for ResNet-18, it attained top accuracies of 70.39% (4-bit) and 71.31% (5-bit). For ResNet-50, it achieved the highest accuracies of 76.63% (4-bit) and 71.31% (5-bit). Additionally, in the MobileNetV2 setting, GranQ achieved SOTA performance across all bitwidths (3, 4, and 5-bit). In the 3-bit setting, GranQ achieved the second-best accuracies, with 64.41% on ResNet-18 and 70.76% on ResNet-50. These results are -3.77% and -3.23% lower than those of GenQ, respectively. However, GenQ uses a pre-trained diffusion model [22] and prompts during data generation, leading to slower speeds compared with those of typical GAN-based methods [41]. In contrast, GranQ maintains the speed advantage of GAN-based generation while achieving high performance through fine-grained adjustments.

5.3 Ablation Study

5.3.1 Effectiveness Evaluation

GranQ consistently demonstrates performance improvements when applied to various ZSQ methods. As summarized in Table 4, GranQ achieves steady performance improvements when integrated with existing SOTA ZSQ techniques.

First, we analyzed the impact of GranQ on data synthesis-based quantization methods, specifically GDFQ [41] and AdaDFQ [36]. The integration of GranQ into GDFQ [41] led to a notable improvement, with accuracy rising from 47.61% to 59.04% (+11.43%) in the 3-bit setting and from 63.39% to 66.97% (+3.58%) in the 4-bit setting. Similarly, AdaDFQ [36] exhibited improvements of +9.99% in the 3-bit setting (from 52.74% to 62.73%) and +1.98% in the 4-bit setting (from 66.81% to 68.79%). These results suggest that GranQ effectively reduces quantization errors when combined with data synthesis-based quantization methods, leading to enhanced model performance.

Furthermore, GranQ also exhibits significant improvements in methods focused on training quantized models, such as Qimera+AIT [10] and AdaDFQ+AKT [23]. For Qimera+AIT [10], the 3-bit accuracy increased from 45.70% to 60.42% (+14.72%). Similarly, for AdaDFQ+AKT [23], the accuracy improved from 54.68% to 62.01% (+7.33%). These findings demonstrate that GranQ is not only effective in data synthesis-based methods but also enhances performance during the model training process.

Refer to caption
Figure 5: Ablation study on the latency of the quantization process in ResNet-20 across different batch sizes. The experiments were conducted using the AdaDFQ method on CIFAR-100 with a 3-bit quantization setting.

5.3.2 Efficiency Evaluation

In general, performing precise quantization that considers both layers and channels leads to an exponential increase in computational cost. However, GranQ is designed to minimize this overhead while maintaining high accuracy by leveraging vectorized operations. To validate this, we measured the latency across various batch sizes (16, 32, 64, 128, and 200). Figure 5 compares four methods.

The first two methods are conventional quantization techniques: layer- and channel-wise. The third is scalar-based iterative, which considers all channels in each layer but performs multiple iterations. The fourth method is GranQ, which integrates both layer and channel information while employing vectorized quantization. Although scalar-based iterative quantization performs the same operations as GranQ, it relies on repeated scalar computations, as illustrated in Equations 1, 2, and 3. In contrast, GranQ enhances efficiency through parallelized computations (Algorithm LABEL:alg1 in supplementary material).

A key observation from Figure 5 is that GranQ induces only a slight increase in latency compared with layer- and channel-wise quantization. Moreover, as summarized in Table 5, GranQ achieves significantly higher accuracy than traditional quantization methods. Even when compared with the less efficient iteration-based layer-channel quantization, GranQ maintains nearly identical accuracy. Additionally, as summarized in Table 2, GranQ substantially reduces quantization error and effectively preserves activations. These findings confirm that GranQ achieves fine-grained quantization while benefiting from parallel computation.

6 Discussion

6.1 Why GranQ is Effective?

The most common activation quantization approaches are known for their low computational cost and fast processing speeds. However, they often fail to accurately capture activation distributions, resulting in substantial quantization errors (Figure 3). In layer-wise quantization, representing the entire activation range with a single scaling factor makes it difficult to reflect fine-grained distribution changes. Similarly, channel-wise quantization encounters challenges owing to significant range differences in activations across layers for the same channel, leading to increased information loss (Section LABEL:sec:extraanalysis in the supplementary material). These limitations become more pronounced in low-bit settings, where higher quantization errors result in performance degradation.

To overcome these issues, GranQ introduces the layer-channel scaling method, which combines layer- and channel-wise quantization. Unlike conventional methods, GranQ considers the activation distribution of each channel individually while also incorporating layer information, enabling more precise quantization. Furthermore, although conventional quantization relies on scalar operations with a single scaling factor, GranQ utilizes multiple scaling factors and performs vectorized computations. This approach leverages GPU resources efficiently, minimizing computational overhead while achieving effective quantization.

Experimental results demonstrate that GranQ outperforms traditional layer- and channel-wise quantization methods, achieving higher cosine similarity and lower relative error (Table 2). This demonstrates that GranQ accurately preserves the original activation distribution. Additionally, GranQ maintains computational efficiency in terms of latency and effectively balances the trade-off between computational cost and performance (Figure 5 and Table 5). Most notably, GranQ achieves SOTA performance compared with those of existing layer-wise quantization methods (Table 3), highlighting its efficiency and superiority over scalar iteration approaches.

6.2 Limitation and Future-work

As GranQ aims to precisely preserve activations during quantization, it partially depends on generated data quality. Consequently, studies on data generation remain highly important. The proposed method has the potential to achieve even better performance when applied to high-quality data generation strategies. In future work, we aim to improve the quantization performance by focusing on both the accurate generation of original activations and their effective preservation.

7 Conclusion

We present a comprehensive analysis of activation quantization errors in ZSQ and propose GranQ, a novel quantization method to effectively reduce these errors. Conventional layer- and channel-wise quantization methods rely on a single scaling factor, which often results in substantial activation loss. In contrast, GranQ employs a fine-grained quantization approach that simultaneously considers both layer and channel information, thereby reducing activation loss and enabling more precise quantization. Moreover, GranQ successfully maintains computational efficiency through vectorized quantization while performing detailed and fine-grained quantization.

Through extensive experiments, we demonstrate that GranQ achieves superior performance while preserving computational efficiency. Compared with various SOTA methods, GranQ demonstrates competitive performance and achieves notable improvements. Our study highlights the importance of activation preservation in ZSQ and proposes GranQ as an insightful direction to address this challenge.

Acknowledgement This research was supported by the National Research Foundation (NRF) funded by the Korean government (MSIT) (No. RS-2023-00229822).

References

  • Bai et al. [2024] Jianhong Bai, Yuchen Yang, Huanpeng Chu, Hualiang Wang, Zuozhu Liu, Ruizhe Chen, Xiaoxuan He, Lianrui Mu, Chengfei Cai, and Haoji Hu. Robustness-guided image synthesis for data-free quantization. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 10971–10979, 2024.
  • Bai et al. [2023] Shipeng Bai, Jun Chen, Xintian Shen, Yixuan Qian, and Yong Liu. Unified data-free compression: Pruning and quantization without fine-tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5876–5885, 2023.
  • Banner et al. [2019] Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in Neural Information Processing Systems, 32, 2019.
  • Cai et al. [2020] Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13169–13178, 2020.
  • Chen et al. [2024] Xinrui Chen, Yizhi Wang, Renao Yan, Yiqing Liu, Tian Guan, and Yonghong He. Texq: zero-shot network quantization with texture feature distribution calibration. Advances in Neural Information Processing Systems, 36, 2024.
  • Cheng et al. [2024] Hongrong Cheng, Miao Zhang, and Javen Qinfeng Shi. A survey on deep neural network pruning: Taxonomy, comparison, analysis, and recommendations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • Cheng et al. [2018] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1):126–136, 2018.
  • Choi et al. [2018] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
  • Choi et al. [2021] Kanghyun Choi, Deokki Hong, Noseong Park, Youngsok Kim, and Jinho Lee. Qimera: Data-free quantization with synthetic boundary supporting samples. Advances in Neural Information Processing Systems, 34:14835–14847, 2021.
  • Choi et al. [2022] Kanghyun Choi, Hye Yoon Lee, Deokki Hong, Joonsang Yu, Noseong Park, Youngsok Kim, and Jinho Lee. It’s all in the teacher: Zero-shot quantization brought closer to the teacher. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8311–8321, 2022.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Deng et al. [2020] Lei Deng, Guoqi Li, Song Han, Luping Shi, and Yuan Xie. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proceedings of the IEEE, 108(4):485–532, 2020.
  • Elsken et al. [2019] Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. Journal of Machine Learning Research, 20(55):1–21, 2019.
  • Fan et al. [2024] Chunxiao Fan, Ziqi Wang, Dan Guo, and Meng Wang. Data-free quantization via pseudo-label filtering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5589–5598, 2024.
  • Fang et al. [2020] Jun Fang, Ali Shafiee, Hamzah Abdel-Aziz, David Thorsley, Georgios Georgiadis, and Joseph H Hassoun. Post-training piecewise linear quantization for deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 69–86. Springer, 2020.
  • Frankle and Carbin [2018] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018.
  • Gholami et al. [2022] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pages 291–326. Chapman and Hall/CRC, 2022.
  • Guo et al. [2022] Cong Guo, Yuxian Qiu, Jingwen Leng, Xiaotian Gao, Chen Zhang, Yunxin Liu, Fan Yang, Yuhao Zhu, and Minyi Guo. Squant: On-the-fly data-free quantization via diagonal hessian approximation. arXiv preprint arXiv:2202.07471, 2022.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Heo et al. [2019] Byeongho Heo, Jeesoo Kim, Sangdoo Yun, Hyojin Park, Nojun Kwak, and Jin Young Choi. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1921–1930, 2019.
  • Hinton [2015] Geoffrey Hinton. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Hong et al. [2024] Inpyo Hong, Youngwan Jo, Hyojeong Lee, Sunghyun Ahn, and Sanghyun Park. Advanced knowledge transfer: Refined feature distillation for zero-shot quantization in edge computing. arXiv preprint arXiv:2412.19125, 2024.
  • Hubara et al. [2018] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. Journal of Machine Learning Research, 18(187):1–30, 2018.
  • Jeon et al. [2023] Yongkweon Jeon, Chungman Lee, and Ho-young Kim. Genie: show me the data for quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12064–12073, 2023.
  • Kim et al. [2025] Minjun Kim, Jongjin Kim, and U Kang. Synq: Accurate zero-shot quantization by synthesis-aware fine-tuning. In The Thirteenth International Conference on Learning Representations, 2025.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • LeCun et al. [1989] Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  • Li et al. [2024a] Min Li, Zihao Huang, Lin Chen, Junxing Ren, Miao Jiang, Fengfa Li, Jitao Fu, and Chenghua Gao. Contemporary advances in neural network quantization: A survey. In 2024 International Joint Conference on Neural Networks (IJCNN), pages 1–10, 2024a.
  • Li et al. [2021a] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021a.
  • Li et al. [2021b] Yuhang Li, Feng Zhu, Ruihao Gong, Mingzhu Shen, Xin Dong, Fengwei Yu, Shaoqing Lu, and Shi Gu. Mixmix: All you need for data-free compression are feature and data mixing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4410–4419, 2021b.
  • Li et al. [2024b] Yuhang Li, Youngeun Kim, Donghyun Lee, Souvik Kundu, and Priyadarshini Panda. Genq: Quantization in low data regimes with generative synthetic data. In European Conference on Computer Vision, pages 216–235. Springer, 2024b.
  • Liu et al. [2018] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
  • Nagel et al. [2019] Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
  • Odena et al. [2017] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with auxiliary classifier gans. In International conference on machine learning, pages 2642–2651. PMLR, 2017.
  • Qian et al. [2023] Biao Qian, Yang Wang, Richang Hong, and Meng Wang. Adaptive data-free quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7960–7968, 2023.
  • Ruder [2016] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  • Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  • Shang et al. [2023] Yuzhang Shang, Bingxin Xu, Gaowen Liu, Ramana Rao Kompella, and Yan Yan. Causal-dfq: Causality guided data-free network quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17437–17446, 2023.
  • Wang and Yoon [2021] Lin Wang and Kuk-Jin Yoon. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE transactions on pattern analysis and machine intelligence, 44(6):3048–3068, 2021.
  • Xu et al. [2020] Shoukai Xu, Haokun Li, Bohan Zhuang, Jing Liu, Jiezhang Cao, Chuangrun Liang, and Mingkui Tan. Generative low-bitwidth data free quantization. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 1–17. Springer, 2020.
  • Yin et al. [2019] Penghang Yin, Jiancheng Lyu, Shuai Zhang, Stanley Osher, Yingyong Qi, and Jack Xin. Understanding straight-through estimator in training activation quantized neural nets. arXiv preprint arXiv:1903.05662, 2019.
  • Zhang et al. [2021] Xiangguo Zhang, Haotong Qin, Yifu Ding, Ruihao Gong, Qinghua Yan, Renshuai Tao, Yuhang Li, Fengwei Yu, and Xianglong Liu. Diversifying sample generation for accurate data-free quantization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15658–15667, 2021.
  • Zhong et al. [2022] Yunshan Zhong, Mingbao Lin, Gongrui Nan, Jianzhuang Liu, Baochang Zhang, Yonghong Tian, and Rongrong Ji. Intraq: Learning synthetic images with intra-class heterogeneity for zero-shot network quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12339–12348, 2022.
  • Zoph [2016] B Zoph. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.