This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Augmenting Hessians with Inter-Layer Dependencies for Mixed-Precision Post-Training Quantization

Clemens JS Schaefer{}^{\;\;\dagger}, Navid Lambert-Shirzad, Xiaofan Zhang, Chiachen Chou,
Tom Jablin, Jian Li, Elfie Guo, Caitlin Stanton, Siddharth Joshi, and Yu Emma Wang
University of Notre Dame, Notre Dame, IN, USA
Google LLC, Mountain View, CA, USA
[email protected], [email protected]
Work conducted while interning at Google LLC
Abstract

Efficiently serving neural network models with low latency is becoming more challenging due to increasing model complexity and parameter count. Model quantization offers a solution which simultaneously reduces memory footprint and compute requirements. However, aggressive quantization may lead to an unacceptable loss in model accuracy owing to differences in sensitivity to numerical imperfection across different layers in the model. To address this challenge, we propose a mixed-precision post training quantization (PTQ) approach that assigns different numerical precisions to tensors in a network based on their specific needs, for a reduced memory footprint and improved latency while preserving model accuracy. Previous works rely on layer-wise Hessian information to determine numerical precision, but as we demonstrate, Hessian estimation is typically insufficient in determining an effective ordering of layer sensitivities. We address this by augmenting the estimated Hessian with additional information to capture inter-layer dependencies. We demonstrate that this consistently improves PTQ performance along the accuracy-latency Pareto frontier across multiple models. Our method combines second-order information and inter-layer dependencies to guide a bisection search, finding quantization configurations within a user-configurable model accuracy degradation range. We evaluate the effectiveness of our method on the ResNet50, MobileNetV2, and BERT models. Our experiments demonstrate latency reductions compared to a 16-bit baseline of 25.48%25.48\%, 21.69%21.69\%, and 33.28%33.28\% respectively, while maintaining model accuracy to within 99.99%99.99\% of the baseline model.

1 Introduction

Neural networks (NNs) underpin many machine learning (ML) systems, achieving state-of-the-art (SOTA) performance across a wide range of tasks, including computer vision [52], natural language processing [4], and generative models for text [30, 42] and images [34]. However, these remarkable capabilities incur substantial compute and memory costs making these models challenging to deploy at scale while guaranteeing quality of service. These challenges are further exacerbated by the increasing proliferation of ML across tasks [19]. Overcoming these challenges requires resource efficient models that balance deployment costs against quality of service (QoS) metrics (e.g., latency and accuracy). Researchers have addressed this need using a variety of techniques including: hardware-efficient NN designs [41], pruning [24], and quantization [10]. Among these, quantization offers the simultaneous benefit of reducing the model footprint, enabling cheaper compute primitives, and reducing NN inference latency with the corresponding reduction in compute-energy.

By perturbing model parameters from their trained values, quantization can degrade model accuracy. Most often, this is mitigated by either incorporating quantization during the initial training or additional training of the NN with quantized parameters, collectively referred to in the literature as quantization aware training (QAT) [8, 44]. However, by updating model parameters and subsequent revalidation, and requiring training/finetuning data, QAT incurs significant overheads during model deployment. Post-training quantization (PTQ) aims to avoid this by determining quantization scales and rounding schemes either on a small calibration dataset or in a data-free manner, minimizing changes to model parameters. This trades off quantization complexity and model revalidation efforts against the accuracy of the quantized model [46].

Recognizing the benefits from model quantization, commercially available NN accelerators such as NVIDIA GPUs [29, 28] or Google TPUs [17, 18] support quantized operations at various bit-widths, e.g. int4, int8, fp8, fp16, fp32 or fp64, to facilitate efficient NN inference. Maximally exploiting these hardware capabilities is challenging in practice because different NN layers and operations need to be configured to different bit-widths to best balance model accuracy against serving efficiency. Since the search space of all possible bit-width choices is exponential with the number of layers (or tensor slices for finer-grained approaches), this presents a significant challenge for rapidly deploying quantized NNs while guaranteeing QoS. QAT tackles that challenge by: (i) training bit-widths alongside other model parameters, given model size constraints [3, 43], (ii) using black-box reinforcement learning solutions to determine bit-widths [44], or (iii) using auxiliary metrics to reduce the search space [8]. The increased complexity associated with PTQ has typically resulted in research either: (i) ignoring mixed precision PTQ entirely and quantizing the model with a single bit-width [11, 9], (ii) used Pareto frontier methods based on Hessian sensitivity and model size [5], or (iii) used integer programming [15, 9] to determine mixed precision configurations.

While the model fine-tuning involved in QAT may introduce computational overhead compared to the simpler PTQ techniques, PTQ often results in a prohibitive quality gap for the same level of model compression [17]. This quality compromise prevents the broader adoption and deployment of PTQ-based quantization methods. Existing mixed precision PTQ approaches assume the numerical precision of different layers are independent, i.e., that the decision of quantizing one layer is made regardless of other layers. As we will show, this assumption results in sub-optimal quantization configurations. Prior efforts are unable to effectively select insensitive layers to quantize, with some quantizing more sensitive layers resulting in a poorer quality than what is acceptable in production. The situation is further exacerbated when models are trained and deployed by different entities, where training accuracy is chosen without considering any subsequent quantization. In this case, any quantization-induced drop in accuracy might be intolerable.

Refer to caption

Figure 1: Summary of our approach. On the left we illustrate the fully-parallelizable computation of our sensitivity metric: the mean of the Hessian traces per layer and the loss after quantizing all pairs of layers. The Hessian for each matrix and inter-layer excess degradation are combined to guide a bisection search for the ideal bit allocation in the network given an accuracy target. On the right we illustrate our quantization method, for weights we follow SQuant [11] and determine the ideal quantization rounding scheme data-free meanwhile for the activations we use a percentile calibration scheme to determine the quantization scales.

To tackle the challenges associated with mixed precision PTQ, this paper develops a unified method applicable to multiple model types (large scale, small scale, convolutional, and transformers) and data modalities (vision and text). Our approach enables deploying floating-point ML models to commercially available hardware with no manual intervention (see Figure 1). As our primary contributions: (i) we demonstrate the ineffectiveness of the layer-wise sensitivity metric and introduce a novel metric that combines second-order information with inter-layer dependencies, (ii) we propose a guided bisection search to identify optimal quantization configurations while maintaining a production-level accuracy, and (iii) we evaluate our technique experimentally on convolutional vision models and a transformer-based language model and show reductions of model footprints and inference latency given tight accuracy constraints. We demonstrate latency reductions of 25.48%25.48\% (ResNet50), 21.69%21.69\% (MobileNetV2), and 33.28%33.28\% (BERT) while maintaining model accuracy within 99.99% of the baseline model on a calibration dataset.

2 Related Work

To find mixed precision quantization policies for QAT Wang et al. [44] use reinforcement learning with feedback from a hardware accelerator , reporting a 1.41.95×1.4-1.95\times improvement to latency and 1.9×1.9\times improvement to power over a baseline eight-bit integer model with comparable accuracy. Gradient-based QAT to learn the precision on weights and activations have shown considerable success on models trained on the ImageNet task, with multiple competitive results on sub-5 MB models [39, 32]. When the numerical precision cannot directly be learned, alternative approaches typically employ a surrogate metric to determine layer importance or sensitivity and allocate precision accordingly. One example of such an approach was presented by Yao et al. [48] propose to use the mean of the Hessian trace to determine layer sensitivity and develop a Pareto frontier of all model quantization configurations for use in QAT. They reduce the size of a ResNet50 to 7.99MB while achieving 75.76% accuracy.

Due to the improvement to model compression observed during mixed-precision QAT, recent work has also studied the feasibility of applying mixed-precision quantization to PTQ. Nahshan et al. [27] investigate how quantization impacts the model loss landscape, observing flat separable structures for mild quantization and highly non-separable, steep curvature, for low bit-width quantization. Building on this they devise a three step method to improve PTQ: (i) determine the quantization step that minimizes a norm of the quantization error of the individual layers, (ii) use quadratic interpolation to approximate an optimum quantization scale, and (iii) jointly optimize the parameters of all layers acquired on the previous step by applying a gradient-free optimization method. In a similar way Nagel et al. [26] theoretically analyze the impact of the rounding decision in during quantization and formulate it as a binary optimization problem (round up vs round down). Their proposed solution uses a layer-wise local loss, which can be optimized using a relaxation method for improved PTQ performance. Yao et al. [49] demonstrate int4 and int8 PTQ on large Transformer-based models by using fine-grained quantization and layer-wise data-independent knowledge distillation.

Cai et al. [5] introduce a mixed precision PTQ scheme that employs Hessian estimations, similar to previous QAT methods  [48]. To estimate the Hessian, the authors extract a distilled dataset from the unquantized model using batchnorm matching, which makes this inapplicable to transformer-based models. Hubara et al. [15] quantize a model by updating its parameters to minimizes the error between the quantized layer output and full precision output, fine-tuning batchnorm parameters. They formulate the allocation of precision on a per-layer basis as an integer linear programming problem, with the cost being a function of the estimated model footprint and the accuracy. This method assumes strong inter-layer independence, and changes the model weights as well as batchnorm parameters, blurring the distinction between QAT and PTQ. This integer programming approach has also been adopted by other works such as [9]. Alternative approaches to determining layer sensitivity have also been studied, such as signal-to-quantization noise [31] and Fisher Information [51]. These metrics are used in a similar fashion, to construct an ordered list of layer sensitivity to facilitate a search for optimized mixed-precision quanization configurations.

To the best of our knowledge, Zheng et al. [53] are among few prior work examining inter-layer dependency for PTQ. They phrase the quantization process as a network-wise larger scale combinatorial optimization problem of discrete variables and enable efficient solution through various regularization techniques. However, they do not consider mixed precision and their accuracy degrades notably for the four bit configuration with eight bit top and bottom layers.

3 Method

Figure 1 provides and overview of our PTQ methodology. Initially, we determine per-layer sensitivity by approximating each layer’s Hessian trace and inter-layer dependencies by assessing the impact of pairwise quantization across the model layers. We consolidate these into a single metric to establish an ordered sensitivity list. Next, we employ an bisection search to determine the sensitivity thresholds to facilitate a bit-width assignment to each layer that still meets the QoS requirements. The right side of the figure illustrates the model quantization process, optimizing the rounding scheme for weights [11], and employing a percentile based calibration scheme for the activations [46].

3.1 Quantization

Fixed point quantization often termed integer quantization reduces the precision of numerical values in a model for a corresponding decrease in storage and compute requirements. This is typically achieved by applying clipping and rounding operations to the original floating-point values, often formulated as:

Q(𝐱)=round(clip(α𝐱)2b1)2(b1)α1.Q(\mathbf{x})=\text{round}(\text{clip}(\alpha\cdot\mathbf{x})\cdot 2^{b-1})\cdot 2^{-(b-1)}\cdot\alpha^{-1}.

Here, QQ is the quantization function and 𝐱\mathbf{x} is the floating point value. The clipping function saturates values exceeding the thresholds to their corresponding extrema( minimum -1 and maximum 1), bb is the bit-width, and α\alpha is the quantization scale.

To ensure compatibility with most commercially available hardware, we enforce that all operands (activations/weights) in a matrix multiply (matmul) have the same bit precision. For weights, we employ fine-grained quantization, where the rounding function and scale parameters are determined for each tensor dimension, e.g., per-channel, per-filter, or per-embedding. Building on previous work on PTQ, parameters remain unchanged and instead we adapt the rounding function [11, 22, 26]. Our work minimizes the Constrained Absolute Sum of Error (CASE) by modifying the rounding direction for the values contributing the most to the CASE for that matmul [11]. The scale (α\alpha) for the weights are set based on the minimum and maximum observed along the tensor dimension. We determine a single scale for activations using single forward pass with a calibration set (a subset of elements from the training data). We employ a percentile-based method to determine the quantization scale for the activations [46], on a per-layer basis.

3.2 Sensitivity Measures

Refer to caption
Figure 2: The effectiveness of the Hessian trace as a sensitivity metric to guide quantization configuration search, comparing quantization outcomes for ResNet50 and BERT models: Hessian-guided bisection and progressive searches, and a random sensitivity-guided progressive search. The performance gap between Hessian-ordered bisection and progressive searches suggests that uncaptured interactions between multiple quantized layers are highly impactful. Though the progressive search can recover from misordered layer sensitivities, its runtime makes it impractical for production.

The space of possible configurations for a quantized model is exponential with the number tensors. Consider a ResNet50 with three different configuration options just for the parameters (e.g. bit-widths), this results in 3503^{50} possible quantization configurations. Exhaustively evaluating these configurations is not practical for modern workloads. Consequently searching through this space efficiently is critical to deploying quantized models. The use of an informative sensitivity metric can reduce this vast space, making it practical to search for performant configurations.

One of the most commonly used sensitivity metric employs estimations of the Hessian, which pertains to the local curvature of a function. This choice is informed by theory that model accuracy is robust to perturbations in values that occupy flat regions of the loss function (low local curvature). However, for those values that occupy regions of high local curvature (sharp), small perturbations can have an exaggerated impact on model accuracy [7, 36, 13]. One way of estimating the local curvature, uses the Hessian of the loss function, which comprises second-order partial derivatives of the loss.

Rather than directly evaluating the Hessian, which is computationally prohibitive, we approximate the trace using Hutchinson’s algorithm as seen in related work [8, 21]. We define a Hessian-based metric for the ii layer of a network as:

iHessian=𝐄[tr(L(𝐱,𝕎)𝐰i2)].\mathcal{E}_{i}^{\text{Hessian}}=\mathbf{E}\left[\mathrm{tr}\left(\frac{L(\mathbf{x},\mathbb{W})}{\partial\mathbf{w}_{i}^{2}}\right)\right].

Where tr\mathrm{tr} is the trace operator, LL the model’s loss function, 𝕎\mathbb{W} the set of all considered tensors (e.g., weights/activations) and 𝐱\mathbf{x} the calibration data. Higher Hessian\mathcal{E}_{\text{Hessian}} values signify increased local curvature of the loss function, implying greater model sensitivity to parameter changes. Sorting by Hessian\mathcal{E}_{\text{Hessian}} gives an ordering of the ease of layer quantization.

Given an accurately ordered layer sensitivity list, a bisection search-like method can efficiently determine layer quantization configurations. However, as shown in Figure 2, a bisection search yields subpar results with layers ordered by the Hessian compared to a sequential (progressive) search algorithm. The progressive search, sequentially evaluates the suitability of assigning each layer a bit-width using cumulative model degradation as the assignment criterion (see Supplementary Materials 2 for pseudo-code). We test for two orders in which layers are evaluated, first prioritized by the sensitivity metric Hessian\mathcal{E}_{\text{Hessian}} or randomly ordered. We attribute the performance discrepancy between progressive and bisection searches to incorrect ordering of high sensitivity layers. This misclassification changes ordering and is recoverable by the progressive search but catastrophic for the bisection method. As Figure 2 demonstrates, the performance of progressive search is competitive with the Hessian-guided search even with a randomly ordered sensitivity list, significantly outdoing the Hessian-guided bisection search. However, with a correctly ordered sensitivity list, both search methods should yield identical configurations.

Refer to caption
Figure 3: Sensitivity Metrics for ResNet50, MobileNetV2 and BERT. Top row shows the excess degradation of layer quantization combinations at eight bit. The convolutional networks exhibit high excess degradation at near the early and late layers, whereas the transformer model shows higher excess degradation towards the middle of the network. The bottom row plots the difference between the layer sensitivity obtained from the Hessian, the excess degradation, and augmented Hessian.

Since assuming layer-wise independence does not produce an accurate ordering of the final per-layer sensitivity, we augment our sensitivity metric by estimating pairwise-layer sensitivities. Second order methods become cost prohibitive for this and Hutchinson’s algorithm only captures the impact of the diagonal elements of the Hessian, making it unsuitable for our needs. Instead, we estimate multi-layer dependencies by directly quantizing the layers in a pairwise fashion:

iInterLayer\displaystyle\mathcal{E}_{i}^{\text{InterLayer}} =jlL(𝐱,𝕎i,j)max(L(𝐱,𝕎i),L(𝐱,𝕎j)),\displaystyle=\sum^{l}_{j}L(\mathbf{x},\mathbb{W}^{i,j})-\max(L(\mathbf{x},\mathbb{W}^{i}),L(\mathbf{x},\mathbb{W}^{j})),
𝕎i,j\displaystyle\mathbb{W}^{i,j} ={𝕎{𝐰𝐢,𝐰𝐣},Q(𝐰𝐢),Q(𝐰𝐣)}.\displaystyle=\left\{\mathbb{W}\setminus{\left\{\mathbf{w_{i}},\mathbf{w_{j}}\right\}},Q(\mathbf{w_{i}}),Q(\mathbf{w_{j}})\right\}.

Here, we sum the excess degradation incurred given the interaction between two layers. We define excess degradation to mean the difference in the loss (LL) between the jointly quantized loss and the single layer quantized loss per-layer. We clip the minimum iInterLayer\mathcal{E}_{i}^{\text{InterLayer}} to 0, disregarding any negative values. We then normalize and scale iInterLayer\mathcal{E}_{i}^{\text{InterLayer}} to combine it with the iHessian\mathcal{E}_{i}^{\text{Hessian}} as:

iAugHessian=iHessian+βiInterLayer,β=𝐄[iHessian]𝐄[iInterLayer].\mathcal{E}_{i}^{\text{AugHessian}}=\mathcal{E}_{i}^{\text{Hessian}}+\beta\mathcal{E}_{i}^{\text{InterLayer}},\;\;\;\;\;\;\;\;\;\beta=\frac{\mathbf{E}[\mathcal{E}_{i}^{\text{Hessian}}]}{\mathbf{E}[\mathcal{E}_{i}^{\text{InterLayer}}]}.

Figure 3 visualizes these metrics as well as their combination. Figure 3 (top) shows the excess degradation from quantizing NN layers, pairwise, for three models. For both the vision models, the early and late layers show a higher sensitivity, while the transformer based BERT exhibits higher sensitivity towards the center. As seen, the Hessian-based sensitivity measure does not result in the same ordering, where e.g., the impact of quantizing the last layer is underestimated.

3.3 Search for Quantization Configurations

We use a bisection method to determine the quantization configuration with 𝒪(blogN)\mathcal{O}(b\log{N}) model evaluations. Here, NN is the total number of layers and bb the number of available quantization bit-widths. We implement this search on a sensitivity list sorted based on the augmented sensitivity measure (AugHessian\mathcal{E}^{\text{AugHessian}}), to determine the threshold sensitivity value corresponding to different quantization levels. We evaluate the quantized configuration using the same calibration set used to determine scale parameters. The bisection search iteratively updates the threshold value, and thereby the quantization configuration, by expanding or decreasing the number of quantized layers depending on if the accuracy target is achieved. We progressively determine the sensitivity threshold for each available precision setting, starting form the highest (e.g., 8-bits) to the lowest (e.g., 4-bits). Pseudocode for the bisection algorithm is provided in Supplementary Materials Algorithm 1.

4 Experiments

Refer to caption


Figure 4: Evaluation of our method on ResNet50, MobileNetV2, and BERT. Highlighting the performance gain of our proposed Augmented Hessian sensitivity metric over the Hessian, especially note the big latency gains at high accuracy levels of the MobileNetV2 and BERT model, suggestion our metric contains high-quality knowledge for quantization.

We evaluate our proposed method on the ImageNet [37] and SQuaAD [33] datasets, using ResNet50 [12], MobileNetV2 [38], and BERT [6]. ResNet50 (ImageNet) and BERT (SQuAD) are commonly accepted dataset-model combinations from the MLPerf inference suite [35]111https://mlcommons.org/. We show results on MobileNetV2 to demonstrate the versatility of our method and performance on small edge models. For calibration, determining the sensitivity, and guiding the search we randomly sample 4096 examples from the original training data which we use for all steps. For activation calibration we set quantization scales based on the 99.999 percentile value observed during a forward pass of the calibration set through a model with quantized weights. We also improvements to the quantization performance for MobileNetV2 on adding a layer size penalty.

We estimate latency by benchmarking key kernels like gemm and conv2d at various numerical precisions on A100 GPUs, using an inference batch-size of one. Directly capturing the interplay between memory hierarchy, bus-speeds, compute-utilization, and compiler optimizations. We identified the top-performing kernels for specific tensor shapes and precisions using the CUTLASS [20] profiler and optimizer. This data was then used to estimate deployment latencies for different multi-precision models. Our results (Tables 12, and 3) show linear model size reduction with bit quantity and reflect the complex deployment interactions arising from latency reductions.

Table 1: Comparison of our results to other work for ResNet50. Our method (augmented Hessian with 99.9%) achieves highest accuracy while enabling 25.88% latency reduction through quantization over 16 bit model. See text for explanation of annotation marks in the table.
ResNet50 Accuracy Size Latency Precision
No FT Absolute Relative MB Relative ms Relative W A
Baseline (ours) 77.60 100.00% 51.00 100.00% 5.20 100.00% 16 16
MrBiQ [16] 75.17 97.62% 13.78 27.02% 2.70 51.87% 4 4
ZeroQ [5] 76.08 97.89% 12.73 24.97% 3.82 73.46% MP 8
QDrop [45] 75.45 97.99% 13.78 27.02% 2.70 51.87% 4 4
LAPQ [27] 74.80 98.29% 25.50 50.00% 3.82 73.46% 8 4
AdaQuant [15] x 75.90 98.32% 24.32 47.69% 3.79 72.88% MP MP
AdaRound [26] 75.01 98.61% 25.50 50.00% 3.82 73.46% 4 8
HAWQV3 [48] x 76.73 98.73% 18.70 36.67% 3.28 63.16% MP MP
BSQ [47] x 75.29 98.90% - - 2.70 51.87% MP 4
OBQ [9] x 75.72 99.17% 12.75 25.00% 2.68 51.54% 4 4
PTQMP [31] 75.95 99.76% 25.50 50.00% 3.82 73.46% MP MP
SQuant [11] 77.66 99.91% 25.50 50.00% 3.82 73.46% 8 8
Hessian 99% 77.13 99.39% 21.54 42.24% 3.75 72.17% MP MP
AugHessian 99% 77.19 99.46% 23.68 46.43% 3.58 68.93% MP MP
Hessian 99.9% 77.26 99.55% 25.52 50.03% 3.85 74.12% MP MP
AugHessian 99.9% 77.57 99.95% 25.52 50.03% 3.85 74.12% MP MP
Table 2: MobileNetV2 results and comparison to other work. Previous work do not reach accuracy within 1% of the baseline model. Our method offers multiple configuration within 1% degradation which offer up to 22.10% latency reductions compared to unquantized models. See text for explanation of annotation marks in the table.
MobileNetV2 Accuracy Size Latency Precision
No FT Absolute Relative MB Relative ms Relative W A
Baseline (ours) 71.52 100.00% 6.94 100.00% 3.97 100.00% 16 16
LAPQ [27] 65.10 90.67% 1.73 25.00% 3.97 100.00% 4 32
BRECQ [22] 66.57 91.83% 2.38 34.23% 2.55 64.18% 4 4
QDrop [45] 68.84 94.96% 2.38 34.23% 2.55 64.18% 4 4
ZeroQ [5] 69.44 95.08% 0.87 12.49% 3.02 76.05% MP 8
MrBiQ [16] 68.97 95.14% 2.38 34.23% 2.55 64.18% 4 4
NWQ [53] 69.60 96.01% 2.38 34.23% 2.55 64.18% 4 4
AdaQuant [15] x 70.22 96.14% 5.09 73.29% 3.18 80.22% MP MP
AdaRound [26] 69.25 96.56% 1.73 25.00% 3.02 76.05% 4 8
DFQ [25] 71.20 97.49% 3.47 50.00% 3.02 76.05% 8 8
PTQMP [31] 70.68 98.34% 3.47 50.00% 3.02 76.05% MP MP
PTQ-MP [23] 70.70 98.50% 3.47 50.00% 3.02 76.05% 8 8
Hessian 99% 71.01 99.28% 3.47 50.02% 3.12 78.48% MP MP
AugHessian 99% 71.25 99.61% 4.75 68.44% 3.09 77.90% MP MP
Hessian 99.9% 71.20 99.56% 3.49 50.26% 3.32 83.55% MP MP
AugHessian 99.9% 71.34 99.75% 4.75 68.47% 3.11 78.31% MP MP
Table 3: Our results on Bert and comparison to other work. Comparable other works do not leverage mixed precision, by doing so we deliver best absolute and relative accuracy while also beating latency of other works operating at eight bits. See text for explanation of annotation marks in the table.
BERT Accuracy Size Latency Precision
No FT Absolute Relative MB Relative ms Relative W A E
Baseline (ours) 90.25 100.00% 603.98 100.00% 4.28 100.00% 8 8 8
QDrop [45] 77.26 87.38% 151.00 25.00% 2.31 53.93% 4 8 4
QBert [40] x 86.95 98.04% 151.00 25.00% 2.79 65.19% 8 8 8
MREM-P [1] 87.30 98.42% 151.00 25.00% 2.79 65.19% 8 8 8
BRECQ [22] 87.41 99.08% 301.99 50.00% 2.79 65.19% 8 8 8
Q8Bert [50] x 87.74 99.19% 301.99 50.00% 2.79 65.19% 8 8 8
MrBiQ [16] 87.69 99.40% 113.25 18.75% 2.79 65.19% 8 8 8
AdaQuant [15] x 88.70 99.88% 301.99 50.00% 2.79 65.19% 4 8 8
Hessian 99% 88.94 98.54% 295.18 48.87% 2.79 65.29% MP MP MP
AugHessian 99% 89.56 99.23% 285.74 47.31% 2.74 64.13% MP MP MP
Hessian 99.9% 90.24 99.98% 435.16 72.05% 3.34 78.00% MP MP MP
AugHessian 99.9% 90.22 99.96% 389.55 64.50% 3.21 75.03% MP MP MP
Hessian 99.99% 90.29 100.04% 579.87 96.01% 4.11 96.00% MP MP MP
AugHessian 99.99% 90.24 99.98% 434.64 71.96% 3.47 81.06% MP MP MP

We present our experimental results and contextualize them with respect to other work in Tables 1, 2, and 3, summarizing absolute accuracy, model size, latency, as well as relative measures. The tables provide results for two search settings: a 99% and 99.9% accuracy target (and 99.99% for BERT). Our method delivers competitive model latency and model compression while exceeding the accuracy delivered by other techniques that result in similarly compressed models. Some entries in the table have additional annotations, for fair comparison across techniques. We distinguish between QAT and PTQ by indicating the presence or absence of finetuning (No FT). We always use the reported baseline indicating relative drop in accuracy using . We use to indicate that models quantized first and last layers to 8 bit, to indicate a manual rerun of light pipeline, and to indicate that the method implements batch-norm finetuning. Across all search targets, the augmented Hessian sensitivity metric consistently improves upon model latency compared to the pure Hessian metric. For ResNet50, targeting a 99.9% accuracy, our quantized model attained the highest accuracy with 99.9% on the calibration set and 99.95% on the complete ImageNet validation set, while delivering a 25.88% reduction in latency. Similar outcomes were observed for MobileNetV2, where no other model exceeded the 99% accuracy threshold compared to the unquantized baseline. Our method delivered up to 99.75% accuracy while reducing latency by 21.69%. When quantizing BERT models, we observed a 16.93% latency difference between the accuracy targets of 99% and 99.99%, achieving the highest accuracy among quantized models while still improving model serving latency.

Refer to caption

Figure 5: Per-layer bit-width configurations for ResNet50, MobileNetV2, and BERT. The blue bars are bit-widths obtained using a Hessian sensitivty metric and green bars with augmented Hessians, with stippling distinguishing the 98% accuracy target from 99.99%. The convolutional networks can be quantized to 8-bits, with some sensitive, 16-bit, layers distinguishing between the two sensitivity metrics. Relaxing the accuracy target to 98% results in more four bit layers. For BERT, the augmented Hessian metric can more aggressive quantize the early and latter layers in the network, with most layers quantized to 8-bits for the 98% target accuracy.

Figure 4 shows the performance difference arising from the use of the two sensitivity metrics. For all configurations, the model derived from the augmented Hessian occupies a superior position on the accuracy-latency frontier. Ablation studies show the impact of using only inter-layer dependency to guide search (Supplementary Materials section A.2). While relaxed target accuracies the two metrics result in similar model serving latency, with more stringent targets the augmented search metric improves upon the Hessian by 5-15% across all models. We provide more insight on the difference between the two sensitivity metrics in Figure 7 in Supplementary Materials. Figure 5 shows a detailed bit allocation break down for all three models. The major difference between using only the Hessian for search guidance vs. the augmented Hessian is that more layerer are quantized to eight bits, especially visible for early layers. Additionally we show the difference between a 98% and 99.99% accuracy target, which manifests itself as more layers quantized to four bits for the augmented Hessian sensitivity. Table 6 in the Supplementary Materials shows how many evaluations our bisection search took for the 99% and 99.9% accuracy targets in Figure 4, the values are aligned with the theoretical expectations of 𝒪(N)\mathcal{O}(N) where NN is the number of layers. With an average of only six evaluations our bisection search is significantly faster than sequential search.

Limitations

We have not evaluated the impact of mixed-precision kernels e.g., 4W8A, but we do not foresee complications arising from this for our approach. Our latency estimates are currently pessimistic since they do not capture the impact of kernel/operator fusion. Computing the excess degradation requires l(l1)2+l\frac{l\cdot(l-1)}{2}+l (ll number of layers) model evaluations. While these can be parallelized and batched, these evaluations might still be costly for larger models. Additionally, estimating the trace of the Hessian remains a computationally intensive task. Although this has not limited us for the models we evaluated, but might impact applicability for significantly larger models. Because we use a calibration dataset at multiple steps, quantization performance will strongly depend on the alignment between the calibration data and the evaluation/real world data. Additional research is needed to implement this in a completely data-free fashion [54].

5 Conclusions

We introduce a practical mixed precision PTQ pipeline for efficiently quantizing floating point NN models while maintaining a target accuracy on a calibration dataset. Our technique calibrates the quantizer scales and adapts the weight rounding scheme but does not adapt any of the original model parameters (including batch-norm parameters). We demonstrate the limitations of assuming layer independence in estimating layer sensitivity and address this using a new sensitivity metric that also captures the pairwise interaction between multiple quantized layers. This improved metric enables us to use a bisection-search to determine quantization configurations that outperform the unaugmented sensitivity metric. Our method is demonstrated across small (MobileNetV2), medium (ResNet50), and large-scale (BERT) models, applicable to both vision and text data modalities. It achieves latency improvements ranging from 25-33% with minimal impact on accuracy. On an anverage, we require six model evaluations to find these quantization configurations across the tested models.

Broader Impacts

Our mixed-precision PTQ can deploy models with low latency creating positive impacts such as improved accessibility or reduced energy consumption (contributing to sustainability and cost-effectiveness). However we also consider risks of negative impacts such as limited interpretability mostly due to a lack of research on the interpretablilty of quantized models (reducing transparency and trust) and potential bias and fairness issues which are actively researched [14].

References

  • Bai et al. [2022] Haoli Bai, Lu Hou, Lifeng Shang, Xin Jiang, Irwin King, and Michael R Lyu. Towards efficient post-training quantization of pre-trained language models. Advances in Neural Information Processing Systems, 35:1405–1418, 2022.
  • Banner et al. [2018] Ron Banner, Yury Nahshan, Elad Hoffer, and Daniel Soudry. Aciq: Analytical clipping for integer quantization of neural networks. 2018.
  • Bhalgat et al. [2020] Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 696–697, 2020.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Cai et al. [2020] Yaohui Cai, Zhewei Yao, Zhen Dong, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Zeroq: A novel zero shot quantization framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13169–13178, 2020.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dong et al. [2019] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, 2019.
  • Dong et al. [2020] Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518–18529, 2020.
  • Frantar and Alistarh [2022] Elias Frantar and Dan Alistarh. Optimal brain compression: A framework for accurate post-training quantization and pruning. arXiv preprint arXiv:2208.11580, 2022.
  • Gholami et al. [2021] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021.
  • Guo et al. [2022] Cong Guo, Yuxian Qiu, Jingwen Leng, Xiaotian Gao, Chen Zhang, Yunxin Liu, Fan Yang, Yuhao Zhu, and Minyi Guo. Squant: On-the-fly data-free quantization via diagonal hessian approximation. arXiv preprint arXiv:2202.07471, 2022.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural computation, 9(1):1–42, 1997.
  • Hooker et al. [2020] Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, and Emily Denton. Characterising bias in compressed models. arXiv preprint arXiv:2010.03058, 2020.
  • Hubara et al. [2021] Itay Hubara, Yury Nahshan, Yair Hanani, Ron Banner, and Daniel Soudry. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pages 4466–4475. PMLR, 2021.
  • Jeon et al. [2022] Yongkweon Jeon, Chungman Lee, Eulrang Cho, and Yeonju Ro. Mr. biq: Post-training non-uniform quantization based on minimizing the reconstruction error. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12329–12338, 2022.
  • Jouppi et al. [2021] Norman P Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, et al. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. IEEE, 2021.
  • Jouppi et al. [2023] Norman P Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil, Suvinay Subramanian, Andy Swing, Brian Towles, et al. Tpu v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. arXiv preprint arXiv:2304.01433, 2023.
  • Jumper et al. [2021] John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  • Kerr et al. [2022] Andrew Kerr, Haicheng Wu, Manish Gupta, Dustyn Blasig, Pradeep Ramini, Duane Merrill, Aniket Shivam, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Matt Nicely. CUTLASS, 11 2022. URL https://github.com/NVIDIA/cutlass.
  • Lee et al. [2021] Junghyup Lee, Dohyung Kim, and Bumsub Ham. Network quantization with element-wise gradient scaling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6448–6457, 2021.
  • Li et al. [2021] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. arXiv preprint arXiv:2102.05426, 2021.
  • Liu et al. [2021] Xingchao Liu, Mao Ye, Dengyong Zhou, and Qiang Liu. Post-training quantization with multiple points: Mixed precision without mixed precision. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8697–8705, 2021.
  • Mishra et al. [2021] Asit Mishra, Jorge Albericio Latorre, Jeff Pool, Darko Stosic, Dusan Stosic, Ganesh Venkatesh, Chong Yu, and Paulius Micikevicius. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378, 2021.
  • Nagel et al. [2019] Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1325–1334, 2019.
  • Nagel et al. [2020] Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pages 7197–7206. PMLR, 2020.
  • Nahshan et al. [2021] Yury Nahshan, Brian Chmiel, Chaim Baskin, Evgenii Zheltonozhskii, Ron Banner, Alex M Bronstein, and Avi Mendelson. Loss aware post-training quantization. Machine Learning, 110(11):3245–3262, 2021.
  • NVIDIA [2023] NVIDIA. Nvidia h100 tensor core gpu, 2023. URL https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet.
  • NVIDIA [2020] NVIDIA NVIDIA. A100 tensor core gpu architecture, 2020.
  • OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
  • Pandey et al. [2023] Nilesh Prasad Pandey, Markus Nagel, Mart van Baalen, Yin Huang, Chirag Patel, and Tijmen Blankevoort. A practical mixed precision algorithm for post-training quantization. arXiv preprint arXiv:2302.05397, 2023.
  • Park and Yoo [2020] Eunhyeok Park and Sungjoo Yoo. Profit: A novel training method for sub-4-bit mobilenet models. In European Conference on Computer Vision, pages 430–446. Springer, 2020.
  • Rajpurkar et al. [2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, art. arXiv:1606.05250, 2016.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  • Reddi et al. [2020] Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, et al. Mlperf inference benchmark. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 446–459. IEEE, 2020.
  • Rissanen [1978] Jorma Rissanen. Modeling by shortest data description. Automatica, 14(5):465–471, 1978.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
  • Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  • Schaefer et al. [2022] Clemens JS Schaefer, Siddharth Joshi, Shan Li, and Raul Blazquez. Edge inference with fully differentiable quantized mixed precision neural networks. arXiv preprint arXiv:2206.07741, 2022.
  • Shen et al. [2020] Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8815–8821, 2020.
  • Tan and Le [2019] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  • Thoppilan et al. [2022] Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  • Uhlich et al. [2019] Stefan Uhlich, Lukas Mauch, Fabien Cardinaux, Kazuki Yoshiyama, Javier Alonso Garcia, Stephen Tiedemann, Thomas Kemp, and Akira Nakamura. Mixed precision dnns: All you need is a good parametrization. arXiv preprint arXiv:1905.11452, 2019.
  • Wang et al. [2019] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8612–8620, 2019.
  • Wei et al. [2022] Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. Qdrop: randomly dropping quantization for extremely low-bit post-training quantization. arXiv preprint arXiv:2203.05740, 2022.
  • Wu et al. [2020] Hao Wu, Patrick Judd, Xiaojie Zhang, Mikhail Isaev, and Paulius Micikevicius. Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602, 2020.
  • Yang et al. [2021] Huanrui Yang, Lin Duan, Yiran Chen, and Hai Li. Bsq: Exploring bit-level sparsity for mixed-precision neural network quantization. arXiv preprint arXiv:2102.10462, 2021.
  • Yao et al. [2021] Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael Mahoney, et al. Hawq-v3: Dyadic neural network quantization. In International Conference on Machine Learning, pages 11875–11886. PMLR, 2021.
  • Yao et al. [2022] Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022.
  • Zafrir et al. [2019] Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), pages 36–39. IEEE, 2019.
  • Zandonati et al. [2022] Ben Zandonati, Adrian Alan Pol, Maurizio Pierini, Olya Sirkin, and Tal Kopetz. Fit: A metric for model sensitivity. arXiv preprint arXiv:2210.08502, 2022.
  • Zhai et al. [2022] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  • Zheng et al. [2022] DanDan Zheng, Yuanliu Liu, Liang Li, et al. Leveraging inter-layer dependency for post-training quantization. Advances in Neural Information Processing Systems, 35:6666–6679, 2022.
  • Zhong et al. [2022] Yunshan Zhong, Mingbao Lin, Gongrui Nan, Jianzhuang Liu, Baochang Zhang, Yonghong Tian, and Rongrong Ji. Intraq: Learning synthetic images with intra-class heterogeneity for zero-shot network quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12339–12348, 2022.

Appendix A Supplementary Materials

A.1 Error Bars

In Figure 6 we show error bars over five trials where we changed the calibration and evaluation data of our bisection search to analyze the variance of our results. For ResNet50 we show that augmented Hessians are consistently better than pure Hessian information where the standard deviation is increasing which higher accuracy targets. For the MobileNetV2 the story roughly holds true however the latency mean using Hessian information for the highest accuracy target is lower. We contributed that the higher variance where by small uninformed random perturbation result in better models.

Refer to caption
Figure 6: Error bars over five trials for ResNet50 and MobileNetV2

A.2 Greedy Search

We also use the Hessian and our augmented Hessian in combination with a greedy search method (outlined in  2) to analyze a potential upper bound for PTQ quantization since the greedy method is progressive and potentially can correct for errors in the sensitivity ordering. Table 4 shows the resulting latency percentages compared to a 16-bit floating point baseline for ResNet50 and MobileNetV2 given Hessian and augmented Hessian layer ordering. Generally the performance is close to the bisection search (see Table 1 and 2) for both the Hessian and augmented Hessian underlining the contribution of the augmented Hessians even in the greedy setting. We have to note that the ResNet50 with a 99.9% target performed worse in a single greedy run than for our bisection search, which we attribute to the progressive nature of the greedy search, e.g. during the search layers are individually added and discarded wherein the bisection search quantizes multiple layers at a time so that cross layer interactions have a chance to compensate for accuracy degradations. Our main takeaway here is that a bisection search with augmented Hessian can perform a quick search (see Table 6 for number of evaluations) which results in best possible accuracy. Note that the greedy search took at least NN evaluations where NN is the number of layers which is about 8×8\times more evaluations than the bisection search.

Table 4: Results for a greedy search using Hessians and augmented Hessians on ResNet50 and MobileNetV2. Percentage numbers in the table are relative latencies compared to a 16 bit floating point baseline.
Accuracy Target 97.5% 99% 99.9% 99.99%
ResNet50 Hessian 69.49% 73.38% 81.81% 85.73%
ResNet50 AugHessian 70.97% 71.97% 83.21% 84.63%
MobileNetV2 Hessian 75.72% 78.37% 81.05% 83.20%
MobileNetV2 AugHessian 75.72% 77.95% 79.91% 86.80%

A.3 Search with Inter-Layer Dependencies

Given that the inter-layer dependencies augment the Hessian in a meaningful way and improve PTQ we also ran a bisection with only inter-layer dependencies. The resulting latencies for ResNet50, MobileNetV2, and BERT are in Table 5. Compared to results from Tables 12, and 3 PTQ without any Hessian information results in roughly 10% slower latencies. Highlighting the importance of combined Hessian and inter-layer information.

Table 5: Relative latency numbers of a bisection search using only inter-layer dependencies for ResNet50, MobileNetV2, and BERT.
Accuracy Target 95% 97.5% 98% 99% 99.9 % 99.99%
ResNet50 69.92% 71.90% 83.33% 80.67% 82.53% 83.33%
MobileNetV2 75.85% 78.62% 78.62% 78.36% 85.68% 87.56%
BERT 73.48% 72.27% 72.27% 72.27% 79.16% 85.35%

A.4 Sensitivity Threshold

Refer to caption
Figure 7: ResNet50, MobileNetV2 and BERT layer sensitivity of the Hessian and augmented Hessian. The dashed line shows the final sensitivity threshold for eight bit quantization (e.g. points above the line remain at 16 bits and below are quantized to eight bits or lower) and demonstrating a significant gap between traditional Hessian based metric and our augmented Hessian.

A.5 Observed Search Length

Table 6: Search iterations required to determine a mixed precision configuration. Our search has two steps: (i) determining all layers quantizable down to eighth bits and (ii) finding all layers which can be quantized down to four bits (from the subset of layers which already can be quantized down to eight). Most search take six configuration evaluations which is in line with the theoretical time complexity of bisection search 𝒪(logN)\mathcal{O}(\log{N}), e.g log54=5.7549\log{54}=5.7549 (ResNet50), log53=5.7280\log{53}=5.7280, and log193=7.5925\log{193}=7.5925.
ResNet50 99% ResNet50 99.9% MBV2 99% MBV2 99.9% BERT 99% BERT 99.9%
8 bit 4 bit 8 bit 4 bit 8 bit 4 bit 8 bit 4 bit 8 bit 4 bit 8 bit 4 bit
Hessian 6 5 6 4 6 6 6 6 6 6 6 7
AugHessian 6 6 6 5 6 6 6 6 6 8 6 7

A.6 Additional Comparison to Other Work

Table 7: Additional comparison to other work for ResNet50. We excluded those models from the main text since all of them focus on four bit quantization and their relative performance degradation is bigger than works mentioned in Table 1.
ResNet50 Accuracy Size Latency Precision
No FT Absolute Relative MB Relative ms Relative W A
Baseline (ours) 77.60 100.00% 51.00 100.00% 5.20 100.00% 16 16
DFQ [25] 64.50 83.55% 13.78 27.02% 2.70 51.87% 4 4
ACIQ [2] 68.10 88.21% 13.78 27.02% 2.70 51.87% 4 4
PTQ-MP [23] 72.67 95.43% 13.78 27.02% 2.70 51.87% 4 4
BRECQ [22] 75.05 97.22% 13.78 27.02% 2.70 51.87% 4 4
MrBiQ [16] 75.17 97.62% 13.78 27.02% 2.70 51.87% 4 4

A.7 Search Algorithm Pseudo Code

Algorithm 1 Bisection search for ideal quantization configuration. Worst and average time complexity is 𝒪(blogN)\mathcal{O}(b\log{N}) with bb as the number of bit-width choices and NN the number of layers.
1:Input: data xx, sensitivity metric ss, accuracy target tt, available bit-widths bsbs, model ff.
2:Initialize working configuration ww with max(bs)\max(bs).
3:Initialize layer list llll with all layers of ff.
4:Sort llll by ss in ascending order.
5:for bb in bsbs do
6:     Initialize threhsold thr=length(ll)/2thr=\text{length}(ll)/2.
7:     Initialize upper limit uplupl to length(ll)\text{length}(ll).
8:     Initialize lower limit lowllowl to 0.
9:     repeat
10:         Initialize local working config lwlw with ww.
11:         lw[ll[0:thr]]blw[ll[0:thr]]\leftarrow b.
12:         Evaluate f(x,lw)f(x,lw) and save accuracy aa.
13:         if a>=ta>=t  then
14:              lowlthrlowl\leftarrow thr.
15:              thrthr+(uplthr)/2thr\leftarrow thr+(upl-thr)/2.
16:         else
17:              uplthrupl\leftarrow thr.
18:              thrthr(thrlowl)/2thr\leftarrow thr-(thr-lowl)/2.
19:         end if
20:     until thrthr is not changing.
21:     w[ll[0:thr]]bw[ll[0:thr]]\leftarrow b.
22:     llll[0:thr]ll\leftarrow ll[0:thr].
23:end for
24:Return: optimal working configuration ww.
Algorithm 2 Progressive approach for ideal quantization configuration. Average time complexity is 𝒪((22(b1))N)\mathcal{O}((2-2^{-(b-1)})N) and worst case 𝒪(bN)\mathcal{O}(bN) where bb is the number of bit-width choices and NN the number of layers.
1:Input: data xx, sensitivity metric ss, accuracy target tt, available bit-widths bsbs, model ff.
2:Initialize working configuration ww with max(bs)\max(bs).
3:Initialize layer list llll with all layers of model.
4:Sort llll by ss in ascending order.
5:for bb in bsbs do
6:     Initialize quantizable layer qlql\leftarrow\emptyset.
7:     for ll in llll do
8:         w[l]bw[l]\leftarrow b.
9:         Evaluate f(x,w)f(x,w) and save accuracy aa.
10:         if a>=ta>=t  then
11:              Append ll to qlql.
12:         else
13:              Set w[l]w[l] back to last working value.
14:         end if
15:     end for
16:     llqlll\leftarrow ql.
17:end for
18:Return: optimal working configuration ww.