Sharpness-Aware Quantization for Deep Neural Networks

Jing Liu Jianfei Cai Bohan Zhuang
ZIP Lab, Monash University Corresponding author. Email:

\tt [email protected]

Abstract

Network quantization is a dominant paradigm of model compression. However, the abrupt changes in quantized weights during training often lead to severe loss fluctuations and result in a sharp loss landscape, making the gradients unstable and thus degrading the performance. Recently, Sharpness-Aware Minimization (SAM) has been proposed to smooth the loss landscape and improve the generalization performance of the models. Nevertheless, directly applying SAM to the quantized models can lead to perturbation mismatch or diminishment issues, resulting in suboptimal performance. In this paper, we propose a novel method, dubbed Sharpness-Aware Quantization (SAQ), to explore the effect of SAM in model compression, particularly quantization for the first time. Specifically, we first provide a unified view of quantization and SAM by treating them as introducing quantization noises and adversarial perturbations to the model weights, respectively. According to whether the noise and perturbation terms depend on each other, SAQ can be formulated into three cases, which are analyzed and compared comprehensively. Furthermore, by introducing an efficient training strategy, SAQ only incurs a little additional training overhead compared with the default optimizer (e.g., SGD or AdamW). Extensive experiments on both convolutional neural networks and Transformers across various datasets (i.e., ImageNet, CIFAR-10/100, Oxford Flowers-102, Oxford-IIIT Pets) show that SAQ improves the generalization performance of the quantized models, yielding the SOTA results in uniform quantization. For example, on ImageNet, SAQ outperforms AdamW by 1.2% on the Top-1 accuracy for 4-bit ViT-B/16. Our 4-bit ResNet-50 surpasses the previous SOTA method by 0.9% on the Top-1 accuracy.

1 Introduction

With powerful high-performance computing and massive labeled data, convolutional neural networks (CNNs) and Transformers have dramatically improved the accuracy of many computer vision (CV) and natural language processing (NLP) tasks, such as image classification [24, 16], dense prediction [63, 6], sentence classification [68, 13], and machine translation [52, 67], to the level of being ready for real-world applications. Despite the remarkable breakthroughs that deep learning has achieved, the considerable computational overhead and model size greatly hampers the development and deployment of deep learning techniques at scale, especially on resource-constrained devices such as mobile phones. To obtain compact models, many network quantization methods [27, 77] have been proposed to tackle the efficiency bottlenecks.

Refer to caption — (a) Full-precision ResNet-18

Despite the high compression ratio, training a low-precision model is very challenging due to the discrete and non-differentiable nature of network quantization. In contrast to the full-precision ones, the low-precision models represent weights, activations, and even gradients with only a small set of values, which limits the representation power of the quantized models. As shown in Figure 1, a slight change in full-precision weights coming from the gradient update or quantization noises might incur large change in quantized weights due to discretization, which leads to drastic loss fluctuations and results in much sharper loss landscape [47]. As a result, the enormous loss fluctuations make gradients unreliable during optimization, which misleads weight update and thus incurs a performance drop.

There have been some studies showing that flat minima of the loss function found by stochastic gradient-based methods result in good generalization [25, 31, 19, 29]. Recently, Sharpness-Aware Minimization (SAM) [21] and its variants [36, 80, 33] have been proposed to smooth the loss landscape and significantly improve model generalization ability. Specifically, SAM first introduces perturbations to model weights and then minimizes a perturbed loss to seek parameters that lie in neighborhoods with uniformly low training loss. However, all the existing methods are based on full-precision over-parameterized models. How to perform SAM on the compressed models, especially on the quantized ones, has rarely been explored, which is a new and important problem. A simple solution is directly applying SAM to train the quantized models. Nevertheless, as we will discuss in Section 4.2, the introduced perturbations can be either mismatched with the quantized weights or diminished by clipping and discretization operations, which may lead to suboptimal performance.

In this paper, we propose a novel method, called Sharpness-Aware Quantization (SAQ), to find minima with both low loss value and low loss curvature and thus improve the generalization performance of the quantized models. To our knowledge, this is a pioneering work to study the effect of SAM in model compression, especially in network quantization. To this end, we first provide a unified view for quantization and SAM, where we treat them as introducing quantization noises $\bm{\epsilon}_{q}$ and adversarial perturbations $\hat{\bm{\epsilon}}_{s}$ to the model weights, respectively. According to whether $\bm{\epsilon}_{q}$ and $\hat{\bm{\epsilon}}_{s}$ are dependent on each other, we can formulate our SAQ into three cases. We then study and compare these cases comprehensively. Considering that SAQ requires additional training overhead to compute $\hat{\bm{\epsilon}}_{s}$ , we further introduce an efficient training strategy, enabling SAQ to achieve comparable training efficiency as the default optimization counterpart such as AdamW or SGD, which makes it scalable to large models. Extensive experiments on both CNNs and Transformers across various datasets show the promising performance of SAQ.

Our main contributions are summarized as follows:

•

We propose SAQ to seek flatter minima for the quantized models in order to materially improve the generalization performance. To our knowledge, this is a pioneering work that jointly performs the model compression (i.e., quantization) and the loss landscape smoothing.
•

We provide a unified view for the landscape smoothing of the quantized models, where we consider quantization and SAM as introducing quantization noises and adversarial perturbations to the model weights, respectively. Relying on this, we present three cases of SAQ and make comprehensive comparisons among them. We further introduce an efficient training strategy to largely reduce the computational overhead brought by SAQ while keeping its performance gain.
•

Experiments on both CNNs and Transformers across a variety of datasets show that SAQ improves quantized models’ generalization performance and performs favorably against SOTA uniform quantization methods. For example, on ImageNet, our 4-bit ViT-B/16 surpasses AdamW by 1.2% on the Top-1 accuracy. Moreover, our 4-bit ResNet-50 exceeds the SOTA method by 0.9% on the Top-1 accuracy.

2 Related Work

Network quantization. Network quantization seeks to reduce the model size and computational cost by mapping weights, activations, and even gradients of a CNN or ViT to low-precision ones. Existing quantization methods can be roughly divided into two categories according to the quantization bitwidth, namely, fixed-point quantization [77, 5, 26, 10, 79, 74, 30, 20, 8, 34, 48, 23] and binary quantization [27, 62, 49, 43, 47, 2, 61]. To reduce the quantization error, existing methods [10, 74, 30, 20, 4, 72] explicitly parameterize the quantizer and train it jointly with network parameters. To reduce the optimization difficulty incurred by the non-differentiable discretization, extensive methods [14, 73, 22, 37, 32] have been proposed to approximate the gradients. To encourage more information to be maintained by the quantized weights, several weight regularization methods [23, 46] have been proposed to alleviate the discrepancy between the full-precision and low-precision weights. To improve robustness against different bitwidths of quantization, [1] introduces $\ell_{1}$ -norm of the loss gradients as a regularization term. Compared with these methods, SAQ focuses on improving the generalization performance of the quantized models from a new perspective by smoothing the loss landscape. Compared with recent studies [56, 11] that mitigates oscillations incurred by implicit stochasticity of the straight-through estimator (STE) [3] to stabilize optimization, SAQ instead reduces the effect of adversarial perturbations and quantization noises by directly minimizing the perturbed quantization loss and vanilla quantization loss.

Loss geometry and generalization. Hochreiter et al. [25] pioneered the proposition that flat local minima may generalize better in neural networks. Following that, several studies have been proposed to investigate the relation between the geometry of the loss landscape and the generalization performance of the models [31, 65, 19, 7, 29, 54, 44]. Recently, Sharpness-Aware Minimization (SAM) [21] seeks to find parameters that lie in a region with uniformly low loss value and shows promising performance across various architectures and benchmark datasets. Concurrent works have also been proposed to introduce adversarial weight perturbations to improve the robustness against adversarial examples [71] or generalization performance [76]. However, the computational overhead of these methods is roughly doubled compared with those using conventional optimizers (e.g., SGD and AdamW). To address this issue, ESAM [17], LookSAM [45], SAF [18], AE-SAM [28] have been proposed to accelerate the SAM optimization without performance drop. Apart from the efficiency issues, several methods including ASAM [36], GSAM [80] and Fisher SAM [33] have been proposed to improve the performance of SAM. More recently, SAM has been applied to improve the performance of the pruned models [55]. While these methods target on full-precision models, our proposed SAQ focuses on improving the generalization performance of the quantized models, which is a pioneering one in the sense that we jointly perform model compression (i.e., quantization) and loss landscape smoothing.

3 Preliminary

3.1 Network Quantization

In this paper, we use uniform quantization which is hardware-friendly [77]. Given an $L$ -layer deep model, let $w^{l}$ and $x^{l}$ be the weight and input activation w.r.t. the $l$ -th layer. For simplicity, we omit the layer index $l$ afterward. Before performing quantization, we first normalize $w$ and $x$ into the scale of $[0,1]$ by applying clipping as

\begin{split}\hat{w}&=\begin{cases}\frac{1}{2}\left(\frac{w}{\alpha_{w}}+1\right),&\text{if}~{}~{}-1<\frac{w}{\alpha_{w}}<1\\ 0,&\text{if}~{}~{}\frac{w}{\alpha_{w}}\leq-1\\ 1,&\text{if}~{}~{}\frac{w}{\alpha_{w}}\geq 1\\ \end{cases},~{}\\ \hat{x}&=\begin{cases}\frac{x}{\alpha_{x}},\hphantom{+1+1+}&\text{if}~{}~{}0<\frac{x}{\alpha_{x}}<1\\ 0,&\text{if}~{}~{}\frac{x}{\alpha_{x}}\leq 0\\ 1,&\text{if}~{}~{}\frac{x}{\alpha_{x}}\geq 1\\ \end{cases},\end{split}

(1)

where $\alpha_{w}$ and $\alpha_{x}$ are layer-wise trainable clipping levels that limit the range of weight and activation, respectively. We then quantize $\hat{z}\in\{\hat{w},\hat{x}\}$ to the discrete one $\bar{z}\in\{\bar{w},\bar{x}\}$ by $\bar{z}=D(\hat{z},s)=s\cdot\lfloor{\hat{z}}/{s}\rceil$ , where $\lfloor\cdot\rceil$ is a rounding operator that returns the nearest integer of a given value and $s=1/(2^{b}-1)$ is the normalized step size for $b$ -bit quantization. Lastly, we obtain the quantized $w$ and $x$ by

\begin{split}Q_{w}(w)=\alpha_{w}(2\bar{w}-1),~{}Q_{x}(x)=\alpha_{x}\bar{x}.\end{split}

(2)

During training, the rounding operation $\lfloor\cdot\rceil$ is non-differentiable. To overcome this issue, following [77, 27], we apply the straight-through estimation (STE) [3] to approximate the gradient of the rounding operator by identity mapping for backpropagation, namely, $\partial\bar{z}/\partial\hat{z}\approx 1$ .

3.2 Sharpness-Aware Minimization

Without loss of generality, let ${\mathcal{S}}=\{({\bf x}_{i},y_{i})\}_{i=1}^{n}$ be the training data. The goal of model training is to minimize the empirical risk ${\mathcal{L}}({\bf w})=\frac{1}{n}\sum_{i=1}^{n}\ell({\bf w},{\bf x}_{i},y_{i})$ , where $\ell({\bf w},{\bf x}_{i},y_{i})$ is a loss function for the sample $({\bf x}_{i},y_{i})$ with model weights ${\bf w}$ . Instead of seeking a single place with a local minimal loss, Sharpness-Aware Minimization [21] (SAM) seeks a region that has uniformly low training loss (both low loss and low curvature), which can be formulated as a min-max optimization problem as

\min_{{\bf w}}\max_{\left\|\bm{\epsilon}\right\|_{2}\leq\rho}{\mathcal{L}}({\bf w}+\bm{\epsilon}),

(3)

where the inner optimization problem attempts to find perturbations $\bm{\epsilon}$ in an $\ell_{2}$ Euclidean ball with a pre-defined radius $\rho$ that maximizes the perturbed loss ${\mathcal{L}}({\bf w}+\bm{\epsilon})$ . To solve the inner problem, SAM approximates the optimal $\bm{\epsilon}$ to maximize ${\mathcal{L}}({\bf w}+\bm{\epsilon})$ using a first-order Taylor expansion as

\begin{split}\hat{\bm{\epsilon}}&={\arg\max}_{\left\|\bm{\epsilon}\right\|_{2}\leq\rho}{\mathcal{L}}({\bf w}+\bm{\epsilon})\\ &\approx{\arg\max}_{\left\|\bm{\epsilon}\right\|_{2}\leq\rho}{\mathcal{L}}({\bf w})+\bm{\epsilon}^{\top}\nabla_{{\bf w}}{\mathcal{L}}({\bf w})\\ &\approx\rho\frac{\nabla_{{\bf w}}{\mathcal{L}}\left({\bf w}\right)}{\left\|\nabla_{{\bf w}}{\mathcal{L}}\left({\bf w}\right)\right\|_{2}}.\end{split}

(4)

By substituting Eq. (4) back into Eq. (3), we then have the following optimization problem:

\min_{{\bf w}}{\mathcal{L}}({\bf w}+\hat{\bm{\epsilon}}).

(5)

Lastly, SAM updates the model weights based on the gradient $\nabla_{{\bf w}}{\mathcal{L}}({\bf w})|_{{\bf w}+\hat{\bm{\epsilon}}}$ .

4 Sharpness-Aware Quantization

As shown in Figure 1, the low-precision model shows a much sharper loss landscape compared with the full-precision one. Therefore, small perturbations on the full-precision weights may incur large changes in the quantized weights, leading to severe loss oscillation. As a result, the gradients are unstable during training, which might mislead weight update and the resulting quantized model might converge to poor local minima. To overcome this, one may directly apply SAM to train the quantized models, which, however, can suffer from perturbation mismatch or diminishment problems due to the clipping and discretization operations in quantization (See Section 4.2), resulting in suboptimal performance.

In the following, we describe our proposed Sharpness-Aware Quantization (SAQ) to smooth the loss landscape and improve the generalization performance of the quantized models. We begin with a unified view on the loss landscape smoothing of quantized models in Section 4.1, and then provide an analysis of three different cases of SAQ in Section 4.2. Last, we introduce a fast optimization method for SAQ in Section 4.3.

4.1 Unifying Quantization and SAM

We consider quantization and SAM as introducing quantization noises $\bm{\epsilon}_{q}$ and adversarial perturbations $\bm{\epsilon}_{s}$ to the model weights ${\bf w}$ , respectively, which provides a unified view for the loss landscape smoothing of the quantized models. In this way, the optimization problem is

\begin{split}\min_{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}}{\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})~{}\text{where}~{}\hat{\bm{\epsilon}}_{s}={\arg\max}_{\left\|\bm{\epsilon}_{s}\right\|_{2}\leq\rho}{\mathcal{L}}_{p},\end{split}

(6)

where ${\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})$ is a perturbed quantization loss and ${\mathcal{L}}_{p}$ is a perturbed loss depending on the full-precision weights ${\bf w}$ or the quantized weights $Q_{w}({\bf w})$ .

4.2 Case Analysis for SAQ

To solve the optimization problem in Eq. (6), we need to obtain $\bm{\epsilon}_{q}$ as well as $\hat{\bm{\epsilon}}_{s}$ . According to whether $\bm{\epsilon}_{q}$ and $\hat{\bm{\epsilon}}_{s}$ are dependent on each other, we can transform the loss function in Eq. (6) to different objectives, as shown in Table 1. To simplify the notation, we define the quantization error function $\bm{\epsilon}_{q}({\bf w})$ and the perturbation function $\hat{\bm{\epsilon}}_{s}({\bf w})$ as

\bm{\epsilon}_{q}({\bf w})=Q_{w}({\bf w})-{\bf w},~{}\hat{\bm{\epsilon}}_{s}({\bf w})=\rho\frac{\nabla_{{\bf w}}{\mathcal{L}}\left({\bf w}\right)}{\left\|\nabla_{{\bf w}}{\mathcal{L}}\left({\bf w}\right)\right\|_{2}}.

(7)

Table 1: Objectives for different cases of SAQ.

Name	Objective function
Unified	${\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})$
Case 1	${\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}({\bf w})+\hat{\bm{\epsilon}}_{s}({\bf w}))$
Case 2	${\mathcal{L}}({({\bf w}+\hat{\bm{\epsilon}}_{s}({\bf w}))}+\bm{\epsilon}_{q}({{\bf w}+\hat{\bm{\epsilon}}_{s}({\bf w})}))$
Case 3	${\mathcal{L}}({({\bf w}+\bm{\epsilon}_{q}({\bf w}))}+\hat{\bm{\epsilon}}_{s}({{\bf w}+\bm{\epsilon}_{q}({\bf w})}))$

Case 1: We calculate the quantization noises $\bm{\epsilon}_{q}$ and optimal perturbations $\hat{\bm{\epsilon}}_{s}$ independently. In this case, the perturbed loss is defined as ${\mathcal{L}}_{p}={\mathcal{L}}({\bf w}+\bm{\epsilon}_{s})$ . By maximizing the perturbed loss with $\ell_{2}$ -norm constraint, the optimal perturbations can be approximated by $\hat{\bm{\epsilon}}_{s}({\bf w})$ . In this way, the optimization problem can be transformed to

\min_{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}}{\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}({\bf w})+\hat{\bm{\epsilon}}_{s}({\bf w})).

(8)

With Eq. (7), we have ${\bf w}+\bm{\epsilon}_{q}({\bf w})=Q_{w}({\bf w})$ . Then, the above problem can be rewritten as

\min_{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}}{\mathcal{L}}(Q_{w}({\bf w})+\hat{\bm{\epsilon}}_{s}({\bf w})).

(9)

In this case, the optimal perturbations introduced to the quantized weights $Q_{w}({\bf w})$ depend on the gradient of the full-precision weights ${\bf w}$ . Using the chain rule, the full-precision weights’ gradient can be computed by

\begin{split}\frac{\partial{\mathcal{L}}_{p}({\bf w})}{\partial w_{i}}&=\frac{\partial{\mathcal{L}}_{p}({\bf w})}{\partial Q_{w}(w_{i})}\frac{\partial Q_{w}(w_{i})}{\partial w_{i}}\\ &=\begin{cases}\frac{\partial{\mathcal{L}}_{p}({\bf w})}{\partial Q_{w}(w_{i})}&\text{if}~{}-1\leq\frac{w_{i}}{\alpha_{w}^{l}}\leq 1\\ 0&\text{otherwise}\end{cases},\end{split}

(10)

where $w_{i}$ is the $i$ -th element of ${\bf w}$ for layer $l$ and $\alpha_{w}^{l}$ is the corresponding clipping level. Due to the clipping operation, the difference between the full-precision weights’ gradient $\partial{\mathcal{L}}_{p}({\bf w})/\partial w_{i}$ and the quantized weights’ gradient $\partial{\mathcal{L}}_{p}({\bf w})/\partial Q_{w}(w_{i})$ results in a perturbation mismatch problem, which makes the training process noisy and might degrade the quantization performance.

Besides, Case 1 assumes that $\bm{\epsilon}_{q}$ and $\hat{\bm{\epsilon}}_{s}$ are computed independently, which ignores the dependency between them. To address this issue, we introduce another two cases of SAQ in the following.

Case 2: We first combine model weights with the optimal perturbations $\hat{\bm{\epsilon}}_{s}$ and then introduce the quantization noises $\bm{\epsilon}_{q}$ to the perturbed model weights. In this way, the optimization problem is transformed to

\min_{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}}{\mathcal{L}}({({\bf w}+\hat{\bm{\epsilon}}_{s}({\bf w}))}+\bm{\epsilon}_{q}({{\bf w}+\hat{\bm{\epsilon}}_{s}({\bf w})})).

(11)

Same as Case 1, the perturbed loss is ${\mathcal{L}}_{p}={\mathcal{L}}({\bf w}+\bm{\epsilon}_{s})$ and the optimal perturbations can be obtained by $\hat{\bm{\epsilon}}_{s}({\bf w})$ . In this case, the quantization noises $\bm{\epsilon}_{q}({\bf w}+\hat{\bm{\epsilon}}_{s}({\bf w}))$ is represented as a function of the optimal perturbations $\hat{\bm{\epsilon}}_{s}({\bf w})$ . Using Eq. (7), we reformulate the problem as

\min_{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}}{\mathcal{L}}(Q_{w}({\bf w}+\hat{\bm{\epsilon}}_{s}({\bf w}))).

(12)

Nevertheless, the introduced small perturbations may not change the resulting quantized weights due to the discretization process and results in perturbation diminishment issue, i.e., $Q_{w}({\bf w}+\hat{\bm{\epsilon}}_{s}({\bf w}))=Q_{w}({\bf w})$ . As a result, ${\mathcal{L}}(Q_{w}({\bf w}+\hat{\bm{\epsilon}}_{s}({\bf w})))$ might be reduced to ${\mathcal{L}}(Q_{w}({\bf w}))$ , which degenerates to the regular quantization.

Case 3: We first combine model weights with the quantization noises $\bm{\epsilon}_{q}$ and then introduce the optimal perturbations $\hat{\bm{\epsilon}}_{s}$ . In this way, the optimization problem becomes

\min_{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}}{\mathcal{L}}({({\bf w}+\bm{\epsilon}_{q}({\bf w}))}+\hat{\bm{\epsilon}}_{s}({{\bf w}+\bm{\epsilon}_{q}({\bf w})})).

(13)

In this case, we define the perturbed loss as ${\mathcal{L}}_{p}={\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}({\bf w})+\bm{\epsilon}_{s})$ and obtain the optimal perturbations by $\hat{\bm{\epsilon}}_{s}({\bf w}+\bm{\epsilon}_{q}({\bf w}))$ which is expressed as a function of the quantization noises $\bm{\epsilon}_{q}({\bf w})$ . With Eq. (7), the optimization problem can be rewritten as

\min_{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}}{\mathcal{L}}(Q_{w}({\bf w})+\hat{\bm{\epsilon}}_{s}(Q_{w}({\bf w}))),

(14)

where we introduce perturbations to the quantized weights $Q_{w}({\bf w})$ rather than the full-precision weights ${\bf w}$ as in Case 2. In this way, the introduced perturbations will not be diminished by the quantization operation. Moreover, compared with Case 1, Case 3 does not suffer from the perturbation mismatch issue since the optimal perturbations depend on the gradient of the quantized weights instead of the full-precision ones. In summary, Case 3 is the best suited to smooth the loss landscape of the quantized models.

4.3 Fast Optimization for SAQ

Final objective. In SAQ, we seek model parameters ${\bf u}\in\{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}\}$ that is located in a neighborhood with uniformly low value of ${\mathcal{L}}(Q_{w}({\bf w}))$ by minimizing a surrogate loss ${\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})$ . However, the sharp loss landscape of the quantized model leads to a large angle $\theta$ between the gradient of the perturbed quantization loss ${\bf g}_{s}=\nabla_{{\bf u}}{\mathcal{L}}\left({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s}\right)$ and the gradient of vanilla quantization loss ${\bf g}=\nabla_{{\bf u}}{\mathcal{L}}(Q_{w}({\bf w}))$ (See Figure 4), which forms a gap between ${\mathcal{L}}(Q_{w}({\bf w}))$ and ${\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})$ , particularly at extreme low-bit cases. Consequently, the optimization becomes challenging and leads to sub-optimal performance.

To overcome this challenge, we introduce an additional vanilla quantization loss ${\mathcal{L}}(Q_{w}({\bf w}))$ into the objective and reformulate the optimization problem as

\begin{split}&\min_{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}}{\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})+{\mathcal{L}}(Q_{w}({\bf w}))\\ &\text{where}~{}\hat{\bm{\epsilon}}_{s}={\arg\max}_{\left\|\bm{\epsilon}\right\|_{2}\leq\rho}{\mathcal{L}}_{p},\end{split}

(15)

where we enforce the quantized models to find minima with both low ${\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})$ and ${\mathcal{L}}(Q_{w}({\bf w}))$ . As the gradient of ${\mathcal{L}}(Q_{w}({\bf w}))$ has been computed during the backpropagation when solving the inner problem as shown in Eq. (4), we can reuse them when solving the outer problem.

Efficient training. Note that solving the problem in Eq. (LABEL:eq:unified_equation_2) requires additional forward and backward propagation, which results in roughly doubled training cost compared with regular optimizers such as SGD. To address this issue, we customize an efficient training strategy following LookSAM [45]. As shown in Figure 2, we decompose the perturbed quantization loss gradient ${\bf g}_{s}=\nabla_{{\bf u}}{\mathcal{L}}\left({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s}\right)$ into two components, ${\bf g}_{p}$ and $\mathbf{g}_{v}$ , which are parallel and vertical to the gradient of the vanilla quantization loss, ${\bf g}=\nabla_{{\bf u}}{\mathcal{L}}(Q_{w}({\bf w}))$ , respectively. We have a similar observation as in LookSAM that ${\bf g}_{v}$ changes much slower than ${\bf g}_{p}$ and ${\bf g}_{s}$ during training. Therefore, we only calculate ${\bf g}_{s}$ every $\tau$ iterations and obtain its vertical component $\mathbf{g}_{v}$ . For the intermediate iterations, we reuse the direction of ${\bf g}_{v}$ to approximate ${\bf g}_{s}={\bf g}+\beta\|{\bf g}\|\frac{{{\bf g}_{v}}}{\|{{\bf g}_{v}}\|}$ , where $\beta$ is a hyperparameter scaling the gradient. As a result, SAQ is only slightly slower than the regular optimizers (See Table 6). We then update the model parameters using a gradient combination of ${\bf g}_{s}+{\bf g}$ . The training process of SAQ is summarized in Algorithm 1.

Figure 2: An illustration of gradient decomposition in SAQ where we decompose

{\color[rgb]{0.9296875,0.484375,0.23046875}{\bf g}_{s}=\nabla_{{\bf u}}{\mathcal{L}}\left({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s}\right)}

into components

{\color[rgb]{0.75,0.609375,0.953125}{\bf g}_{p}}

and

{\color[rgb]{0,0.6953125,0.9296875}{\bf g}_{v}}

that are parallel and vertical to

{\color[rgb]{0.109375,0.6875,0.34375}{\bf g}=\nabla_{{\bf u}}{\mathcal{L}}(Q_{w}({\bf w}))}

\theta

is the angle between

{\color[rgb]{0.9296875,0.484375,0.23046875}{\bf g}_{s}}

and

{\color[rgb]{0.109375,0.6875,0.34375}{\bf g}}

Algorithm 1 Training algorithm for SAQ

0: Training set

{\mathcal{D}}

, model parameter

{\bf u}\in\{{\bf w},\bm{\alpha}_{w},\bm{\alpha}_{x}\}

, learning rate

\eta

, update frequency

\tau

, total training iterations

T

, hyperparameter

\beta

1: for

t\in\{1,2,\dots,T\}

2: Sample a batch of data

{\mathcal{B}}_{t}

from

{\mathcal{D}}

3: Compute

{\color[rgb]{0.109375,0.6875,0.34375}{\bf g}=\nabla_{{\bf u}}{\mathcal{L}}(Q_{w}({\bf w}))}

{\mathcal{B}}_{t}

4: if

t=\tau k,~{}~{}k\in\mathbb{N}^{+}

then

5: Compute

\bm{\epsilon}_{q}

and

\hat{\bm{\epsilon}}_{s}

6: Compute

{\color[rgb]{0.9296875,0.484375,0.23046875}{\bf g}_{s}=\nabla_{{\bf u}}{\mathcal{L}}\left({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s}\right)}

{\mathcal{B}}_{t}

7: Project

{\color[rgb]{0.9296875,0.484375,0.23046875}{\bf g}_{s}}

onto

{\color[rgb]{0.109375,0.6875,0.34375}{\bf g}}

to obtain

{\color[rgb]{0.75,0.609375,0.953125}{\bf g}_{p}}=\frac{{\color[rgb]{0.9296875,0.484375,0.23046875}{\bf g}_{s}}^{\top}{\color[rgb]{0.109375,0.6875,0.34375}{\bf g}}}{{\color[rgb]{0.109375,0.6875,0.34375}{\bf g}}^{\top}{\color[rgb]{0.109375,0.6875,0.34375}{\bf g}}}{\color[rgb]{0.109375,0.6875,0.34375}{\bf g}}

8: Compute

{\color[rgb]{0,0.6953125,0.9296875}{\bf g}_{v}}={\color[rgb]{0.9296875,0.484375,0.23046875}{\bf g}_{s}}-{\color[rgb]{0.75,0.609375,0.953125}{\bf g}_{p}}

9: else

10: Compute

{\color[rgb]{0.9296875,0.484375,0.23046875}{\bf g}_{s}}={\color[rgb]{0.109375,0.6875,0.34375}{\bf g}}+\beta\|{\color[rgb]{0.109375,0.6875,0.34375}{\bf g}}\|\frac{{\color[rgb]{0,0.6953125,0.9296875}{\bf g}_{v}}}{\|{\color[rgb]{0,0.6953125,0.9296875}{\bf g}_{v}}\|}

11: end if

12: Update

{\bf u}

{\bf u}={\bf u}-\eta({{\bf g}_{s}+{\bf g}})

13: end for

5 Experiments

Table 2: Performance comparisons of different methods with ResNet-18, ResNet-34 and ResNet-50 on ImageNet. We obtain DoReFa-Net results from [10]. “W/A” refers to the bitwidth of weights and activations, respectively. “FP” represents the Top-1 accuracy of the full-precision models. “-” denotes that the results are not reported. We do not apply advanced techniques to boost the performance as mentioned in the compared methods in Section 5.

ResNet-18 (FP: 70.7)
Method	Bitwidth	Accuracy (%)		Bitwidth	Accuracy (%)
Method	(W/A)	Top-1	Top-5	(W/A)	Top-1	Top-5
\cdashline1-7 DoReFa-Net^∗ [77]	4/4	68.1	88.1	2/2	62.6	84.4
PACT^∗ [10]	4/4	69.2	89.0	2/2	64.4	85.6
LQ-Nets^∗ [74]	4/4	69.3	88.8	2/2	64.9	85.9
DSQ [22]	4/4	69.6	-	2/2	65.2	-
BRECQ [40]	4/4	69.6	-	2/2	-	-
FAQ [53]	4/4	69.8	89.1	2/2	-	-
QIL^∗ [30]	4/4	70.1	-	2/2	65.7	-
LLT^∗ [70]	4/4	70.4	89.6	2/2	66.0	86.2
Auxi [78]	4/4	-	-	2/2	66.7	87.0
DAQ^∗ [32]	4/4	70.5	-	2/2	66.9	-
EWGS^∗ [37]	4/4	70.6	-	2/2	67.0	-
BR [23]	4/4	70.8	89.6	2/2	67.2	87.3
APOT [39]	4/4	70.7	89.6	2/2	67.3	87.5
LSQ [20]	4/4	71.1	90.0	2/2	67.6	87.6
SAQ (Ours)	4/4	71.6	90.0	2/2	67.8	87.6
ResNet-34 (FP: 74.1)
\cdashline1-7 LQ-Nets^∗ [74]	4/4	-	-	2/2	69.8	89.1
DSQ [22]	4/4	72.8	-	2/2	70.0	-
FAQ [53]	4/4	73.3	91.3	2/2	-	-
QIL^∗ [30]	4/4	73.7	-	2/2	70.6	-
APOT [39]	4/4	73.8	91.6	2/2	70.9	89.7
DAQ^∗ [32]	4/4	73.7	-	2/2	71.0	-
Auxi [78]	4/4	-	-	2/2	71.2	89.8
EWGS^∗ [37]	4/4	73.9	-	2/2	71.4	-
LSQ [20]	4/4	74.1	91.7	2/2	71.6	90.3
SAQ (Ours)	4/4	75.0	92.3	2/2	71.8	90.4
ResNet-50 (FP: 76.8)
\cdashline1-7 DoReFa-Net^∗ [77]	4/4	71.4	89.8	2/2	67.1	87.3
LQ-Net^∗ [74]	4/4	75.1	92.4	2/2	71.5	90.3
FAQ [53]	4/4	76.3	93.0	2/2	-	-
PACT^∗ [10]	4/4	76.5	93.2	2/2	72.2	90.5
APOT [39]	4/4	76.6	93.1	2/2	73.4	91.4
LSQ [20]	4/4	76.7	93.2	2/2	73.7	91.5
Auxi [78]	4/4	-	-	2/2	73.8	91.4
SAQ (Ours)	4/4	77.6	93.6	2/2	74.5	91.9

•

^∗ denotes that the first and last layers are not quantized.

Datasets and evaluation metrics. We evaluate our method on ImageNet [12] which is a large-scale dataset containing 1.28 million training images and 50k validation samples with 1k classes. We measure the performance of different methods using the Top-1 and Top-5 accuracy.

Implementation details. Our implementations are based on PyTorch [60]. We apply SAQ to CNNs and vision Transformers, including ResNet-18 [24], ResNet-34, ResNet-50, MobileNetV2 [64] and ViT [16]. We first train the full-precision models and use them to initialize the low-precision ones. Following LSQ [20], we quantize both weights and activations for all matrix multiplication layers, including convolutional layers, fully-connected layers, and self-attention layers. For the first and last layers, we quantize both weights and activations to 8-bit to preserve the performance. We do not apply quantization to the input images since they have been quantized to 8-bit during image preprocessing.

For CNNs, we use the uniform quantization method mentioned in Section 3.1. Relying on SGD with the momentum term of 0.9, we apply SAQ with Case 3 to train the quantized models with a mini-batch size of 512 unless otherwise specified. Following APOT [39], we use weight normalization before quantization. We initialize the clipping levels to 1. We fine-tune 90 epochs for ResNet-18, ResNet-34, and ResNet-50. We set weight decay to $1\times 10^{-4}$ by default except for 2-bit ResNet-18, for which we set it to $2.5\times 10^{-5}$ following [20]. For MobileNetV2, we fine-tune 140 epochs following [58]. We set the weight decay to $4\times 10^{-5}$ . The learning rate is initialized to 0.02 and decreased to 0 following the cosine annealing [50]. For ViTs, we use LSQ+ [4] uniform quantization following Q-ViT [42]. We initialize the clipping levels by minimizing the quantization error following [41]. Relying on AdamW [51], we apply SAQ with Case 3 to train ViTs. The learning rate is initialized to $2\times 10^{-4}$ and decreased to 0 using the cosine annealing. We train the quantized model for 150 epochs with a mini-batch size of 1,024. We do not apply the automatic mixed-precision training following [42]. For the hyperparameter $\rho$ , we conduct grid search over {0.02, 0.05, 0.1, 0.15, 0.2, $\dots$ , 1.0} to find appropriate values following the common practice in SAM [21] and GSAM [80]. More details regarding $\rho$ and its sensitivity analysis can be found in the supplementary material. Following LookSAM [45], we set hyperparameter $\beta$ and update frequency $\tau$ to 0.7 and 4, respectively. Due to limited space, we put more implementation details in the supplementary material.

Compared methods. We compare with enormous fixed-point quantization methods, including DoReFa-Net [77], PACT [10], LQ-Nets [74], DSQ [22], FAQ [53], QIL [30], Auxi [78], PROFIT [58], LSQ [20], APOT [39], LSQ+ [4], LLSQ [75], DAQ [32], BRECQ [40], EWGS [37], BR [23], OOQ [56], and LLT [70].

Table 3: Performance comparisons in terms of ViT-S/32, ViT-S/16, ViT-B/32, ViT-B/16, and MobileNetV2 on ImageNet. We obtain the results of PACT from [69]. We do not apply iterative training with weight freezing and progressive quantization in PROFIT [58] to improve the performance of the quantized models.

Network

Method

Bitwidth

(W/A)

Top-1

Acc. (%)

Top-5

Acc. (%)

ViT-S/32 (FP: 68.5)

LSQ+ [4]

4/4

68.0

88.1

SAQ (Ours)

4/4

68.6

88.4

ViT-S/16 (FP: 75.9)

LSQ+ [4]

4/4

76.1

93.0

SAQ (Ours)

4/4

76.9

93.5

ViT-B/32 (FP: 70.7)

LSQ+ [4]

4/4

72.1

90.4

SAQ (Ours)

4/4

72.7

90.7

ViT-B/16 (FP: 77.2)

LSQ+ [4]

4/4

78.0

93.4

SAQ (Ours)

4/4

79.2

94.2

MobileNetV2 (FP: 71.9)

PACT [10]

4/4

61.4

83.7

DSQ^∗ [22]

4/4

64.8

BRECQ [40]

4/4

66.6

LLSQ^∗ [75]

4/4

67.4

88.0

EWGS [37]

4/4

70.3

BR [23]

4/4

70.4

89.4

OOQ [56]

4/4

70.6

PROFIT [58]

4/4

71.6

90.4

SAQ (Ours)

4/4

72.0

90.4

•

^∗ denotes that the first and last layers are not quantized.

Unless otherwise specified, we do not apply advanced techniques to boost performance such as pre-activation in LSQ, non-uniform quantization in APOT, weight regularization in BR and OOQ, knowledge distillation in Auxi and PROFIT, iterative training with weight freezing in OOQ and PROFIT, gradient scaling in EWGS, progressive quantization in PROFIT, batch normalization re-estimation in PROFIT and OOQ, asymmetric quantization in PROFIT and LSQ+. More advanced techniques used in other quantization methods are discussed in the supplementary.

5.1 Main Results

We apply SAQ to quantize ResNet-18, ResNet-34, and ResNet-50 on ImageNet. From Table 2, SAQ outperforms existing SOTA uniform quantization methods by a large margin. The improvement is more obvious with the increase of bitwidth. For example, for 2-bit ResNet-34, the Top-1 accuracy improvement of SAQ over LSQ is 0.2% while for the 4-bit one is 0.9%. We speculate that the loss landscape of the quantized models becomes sharper with the decrease of bitwidths due to the discretization operation in quantization as shown in the supplementary. As a result, smoothing the loss landscapes of the 2-bit quantized models is harder than the 4-bit counterparts. Moreover, for 2-bit quantization, deeper models show more obvious accuracy improvement over the SOTA methods. For instance, SAQ surpasses Auxi by 0.7% on 2-bit ResNet-50 while only bringing 0.2% Top-1 accuracy improvement over LSQ on 2-bit ResNet-18.

Table 4: Performance comparisons of different cases. We report the results of ResNet-50 on ImageNet.

\lambda_{max}

denotes the largest eigenvalue of the Hessian of the converged quantized model. Lower

\lambda_{\mathrm{max}}

indicates flatter loss landscapes.

Method

Bitwidth

Acc. (%)

\lambda_{\mathrm{max}}

Bitwidth

Acc. (%)

\lambda_{\mathrm{max}}

(W/A)

Top-1

Top-5

(W/A)

Top-1

Top-5

ResNet-50

\cdashline1-9 SGD

4/4

76.5

93.1

71.8

2/2

73.9

91.6

60.1

Case 1

4/4

77.3

93.5

6.6

2/2

74.3

91.8

12.6

Case 2

4/4

77.0

93.3

14.0

2/2

74.2

91.8

24.4

Case 3

4/4

77.6

93.6

6.3

2/2

74.5

91.9

9.5

Remarkably, our 4-bit ResNet-34 surpasses the full-precision model by 0.9% on the Top-1 accuracy. One possible reason is that performing quantization with SAQ helps to remove redundancy and regularize the networks. Similar phenomena are also observed in LSQ.

To show the effectiveness of our method on lightweight models, we apply SAQ to quantize MobileNetV2. From Table 3, our SAQ yields better performance than the SOTA uniform quantization methods. For example, SAQ exceeds PROFIT by 0.4% on the Top-1 accuracy. We also apply SAQ to ViT [16]. We implement LSQ+ following [42] and compare our method with it. From Table 3, our SAQ shows consistently superior performance over the baseline LSQ+ (e.g., 1.2% Top-1 accuracy improvement on ViT-B/16).

5.2 Ablation Studies

Performance comparisons of different cases. To investigate the effectiveness of different cases introduced in Section 4.2, we apply different methods to quantize ResNet-50 on ImageNet. We use “SGD” to represent training the quantized models with the vanilla SGD. To measure the loss curvature, we report the largest eigenvalue $\lambda_{\mathrm{max}}$ of the Hessian of the converged quantized models following [9, 21]. From Table 4, Case 1, Case 2, and Case 3 all yield significantly higher accuracy and lower $\lambda_{\mathrm{max}}$ than the SGD counterpart. This strongly shows that our method is able to smooth the loss landscape and improve the generalization performance of the quantized models. Among the three cases, Case 2 performs the worst with the lowest accuracy and the highest $\lambda_{\mathrm{max}}$ , which suggests that the perturbations introduced by SAM might be diminished due to the discretization, leading to suboptimal performance. Moreover, Case 3 consistently performs better than Case 1. For example, on 4-bit ResNet-50, Case 3 exceeds Case 1 by 0.3% on the Top-1 accuracy as well as achieving lower $\lambda_{\mathrm{max}}$ . These results indicate that the perturbation mismatch issue in Case 1 might degrade the quantization performance.

Table 5: Effect of different losses in the objective function in Eq. (LABEL:eq:unified_equation_2) on ImageNet. “VQL” represents the vanilla quantization loss

{\mathcal{L}}(Q_{w}({\bf w}))

and “PQL” denotes the perturbed quantization loss

{\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})

VQL	PQL	Bitwidth	Accuracy (%)		Bitwidth	Accuracy (%)
VQL	PQL	(W/A)	Top-1	Top-5	(W/A)	Top-1	Top-5
ResNet-18
\cdashline1-8 ✓		2/2	67.3	87.4	4/4	71.1	89.8
	✓	2/2	67.5	87.5	4/4	71.3	90.0
✓	✓	2/2	67.8	87.6	4/4	71.6	90.1

Besides, to investigate the generalization capability of SGD and our SAQ, we show the training and validation losses of 4-bit ViT-B/16 on ImageNet in Figure 4. Compared with SGD, SAQ achieves lower training and validation losses at the beginning of training. At the end of training, SAQ shows slightly higher training loss but much lower validation loss and $\lambda_{{\mathrm{max}}}$ than SGD. These results justify that SAQ converges to a flatter local minimum and thus achieves better generalization performance.

Effect of different losses in the objective function. To investigate the effect of different components in the objective in Eq. (LABEL:eq:unified_equation_2), we apply different methods to quantize ResNet-18. From Table 5, using the perturbed quantization loss ${\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})$ surpasses the one equipped with the vanilla quantization loss ${\mathcal{L}}(Q_{w}({\bf w}))$ by 0.2% on the Top-1 accuracy for 2-bit and 4-bit quantization, supporting that smoothing the loss landscape improves the generalization performance of the quantized models. By combining ${\mathcal{L}}(Q_{w}({\bf w}))$ and ${\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})$ , we observe Top-1 accuracy improvement of 0.3% for both 2-bit and 4-bit quantization.

To further show the effect of ${\mathcal{L}}(Q_{w}({\bf w}))$ , we visualize the cosine similarity between $\nabla_{{\bf u}}{\mathcal{L}}({\bf w}+\bm{\epsilon}_{q}+\hat{\bm{\epsilon}}_{s})$ and $\nabla_{{\bf u}}{\mathcal{L}}(Q_{w}({\bf w}))$ of the quantized ResNet-18 in Figure 4. We observe that the cosine similarities of 2-bit models are lower than those of 4-bit ones during training. By introducing ${\mathcal{L}}(Q_{w}({\bf w}))$ , all quantized models achieve higher similarities. These results justify that introducing ${\mathcal{L}}(Q_{w}({\bf w}))$ helps to reduce the gap between two losses and improve the performance of the quantized models.

Table 6: Effect of the efficient training (ET) strategy on ImageNet. We measure the training throughput on 4 NVIDIA V100 GPUs with a mini-batch size of 512.

Network	Method	Accuracy (%)		Train Throughput
Network	Method	Top-1	Top-5	(images/s)
4-bit ResNet-34	SGD	74.4	91.9	1632
	SAQ w/o ET	75.1	92.2	841
	SAQ w/ ET	75.0	92.3	1454

Effect of the efficient training strategy. To investigate the effect of the efficient training strategy mentioned in Section 4.3, we apply SAQ to train 4-bit ResNet-34 with and without the efficient training strategy on ImageNet. The training throughput is quantified by the number of processed images per second on 4 NVIDIA V100 GPUs with a mini-batch size of 512. From Table 6, we observe that our SAQ with efficient training strategy is only $\sim$ 11% slower than SGD while achieving nearly the same performance as SAQ without the efficient training strategy.

Table 7: Effect of jointly performing quantization and loss landscape smoothing on ImageNet. The Top-1 accuracy of the full-precision ResNet-18 obtained by SAM is 70.9%.

Method	Bitwidth	Accuracy (%)		Bitwidth	Accuracy (%)
Method	(W/A)	Top-1	Top-5	(W/A)	Top-1	Top-5
ResNet-18 (FP: 70.7)
\cdashline1-7 SGD	2/2	67.3	87.4	4/4	71.1	89.8
SAM $\rightarrow$ SGD	2/2	67.4	87.5	4/4	71.1	89.9
SAQ (Ours)	2/2	67.8	87.6	4/4	71.6	90.1

SAQ vs. train flat and then quantize. To further investigate the effectiveness of SAQ, we compare our method with “SAM $\rightarrow$ SGD” that first obtains a full-precision model with SAM and then trains a quantized model with SGD using full-precision model weights as initialization. We also include “SGD” that trains the quantized models with SGD for comparisons. We report the results of different methods with ResNet-18 on ImageNet. As seen from Table 7, for the full-precision model, SAM outperforms SGD by 0.2% on the Top-1 accuracy. However, for 2-bit quantization, SAM $\rightarrow$ SGD only yields 0.1% Top-1 accuracy improvement compared with SGD. We speculate that smoothing the loss landscape of the pre-trained models provides a better weight initialization for the quantized models. Nevertheless, due to the large distribution gap between the quantized weights and full-precision weights, the performance gain over SGD is limited. In contrast, SAQ performs better than SAM $\rightarrow$ SGD, which shows the superiority of jointly performing quantization and loss landscape smoothing. For example, on ResNet-18, SAQ exceeds SAM $\rightarrow$ SGD by 0.4% on the Top-1 accuracy. These results suggest that the improvement comes not only from SAM but also from our SAM customization for network quantization.

Table 8: Transfer performance comparisons on downstream tasks. We measure the performance of different methods on 4-bit ResNet-50 using the Top-1 accuracy (%).

Method	SGD	SAQ (Ours)
CIFAR-10 [35]	97.0 $\pm$ 0.0	97.1 $\pm$ 0.1
CIFAR-100 [35]	82.4 $\pm$ 0.2	83.1 $\pm$ 0.2
Oxford Flowers-102 [59]	96.1 $\pm$ 0.2	96.4 $\pm$ 0.4
Oxford-IIIT Pets [57]	94.9 $\pm$ 0.2	95.9 $\pm$ 0.2

More results on transfer learning. To evaluate the transfer power of different quantized models, we conduct transfer learning experiments on new datasets, including CIFAR-10 [35], CIFAR-100, Oxford-IIIT Pets [59], and Oxford Flowers-102 [57]. We use the quantized models trained on ImageNet to initialize the model weights and then fine-tune all layers using SGD. We repeat the experiments 5 times and report the mean as well as the standard deviation of the Top-1 accuracy. More implementation details can be found in the supplementary. From Table 8, SAQ leads to much better transfer performance. For example, on Oxford-IIIT Pets, SAQ quantized 4-bit ResNet-50 brings 1.0% Top-1 accuracy improvement over the SGD counterpart. These results justify that SAQ improves the generalization performance by smoothing the loss landscape of the quantized models.

6 Conclusion and Future Work

In this paper, we have devised a new training approach, called Sharpness-Aware Quantization (SAQ), to improve the generalization capability of the quantized models, which jointly performs compression (i.e., quantization) and loss landscape smoothing for the first time. To this end, we have provided a unified view for the loss landscape smoothing of the quantized models by formulating quantization and SAM as introducing quantization noises and adversarial perturbations to the model weights. According to whether the quantization noises and adversarial perturbations are dependent on each other, we have formulated SAQ into three cases, which have been fully studied and compared. We have further introduced an efficient training strategy that substantially reduces the training overhead, allowing SAQ to achieve comparable training speed to that of the default optimization method. Extensive experiments on various datasets with different architectures including CNNs and Transformers have demonstrated that SAQ consistently improves the performance of the quantized models and yields the SOTA uniform quantization results. In the future, our method can be extended to jointly perform pruning, quantization, and loss landscape smoothing to obtain more compact models with better performance.

References

[1] Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, and Max Welling. Gradient $\ell_{1}$ regularization for quantization robustness. In ICLR, 2020.
[2] Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin Jin, Xin Jiang, Qun Liu, Michael R Lyu, and Irwin King. Binarybert: Pushing the limit of bert quantization. In ACL/IJCNLP, 2021.
[3] Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[4] Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In CVPRW, pages 696–697, 2020.
[5] Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave gaussian quantization. In CVPR, pages 5918–5926, 2017.
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
[7] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In ICLR, 2017.
[8] Peng Chen, Jing Liu, Bohan Zhuang, Mingkui Tan, and Chunhua Shen. Aqd: Towards accurate quantized object detection. In CVPR, pages 104–113, 2021.
[9] Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. In ICLR, 2022.
[10] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.
[11] Alexandre Défossez, Yossi Adi, and Gabriel Synnaeve. Differentiable model compression via pseudo quantization noise. TMLR, 2022.
[12] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, pages 248–255, 2009.
[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186, 2019.
[14] Ruizhou Ding, Ting-Wu Chin, Zeye Liu, and Diana Marculescu. Regularizing activation distribution for training binarized deep networks. In CVPR, pages 11408–11417, 2019.
[15] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. In ICCV, pages 293–302, 2019.
[16] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[17] Jiawei Du, Hanshu Yan, Jiashi Feng, Joey Tianyi Zhou, Liangli Zhen, Rick Siow Mong Goh, and Vincent Tan. Efficient sharpness-aware minimization for improved training of neural networks. In ICLR, 2022.
[18] Jiawei Du, Daquan Zhou, Jiashi Feng, Vincent YF Tan, and Joey Tianyi Zhou. Sharpness-aware training for free. In NeurIPS, 2022.
[19] Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In UAI, 2017.
[20] Steven K. Esser, Jeffrey L. McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S. Modha. Learned step size quantization. In ICLR, 2020.
[21] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In ICLR, 2021.
[22] Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In ICCV, pages 4852–4861, 2019.
[23] Tiantian Han, Dong Li, Ji Liu, Lu Tian, and Yi Shan. Improving low-precision network quantization via bin regularization. In ICCV, pages 5261–5270, 2021.
[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[25] Sepp Hochreiter and Jürgen Schmidhuber. Simplifying neural nets by discovering flat minima. In NeurIPS, pages 529–536, 1995.
[26] Lu Hou and James T. Kwok. Loss-aware weight quantization of deep networks. In ICLR, 2018.
[27] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. NeurIPS, 29, 2016.
[28] Weisen Jiang, Hansi Yang, Yu Zhang, and James Kwok. An adaptive policy to employ sharpness-aware minimization. In ICLR, 2023.
[29] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In ICLR, 2020.
[30] Sangil Jung, Changyong Son, Seohyung Lee, Jinwoo Son, Jae-Joon Han, Youngjun Kwak, Sung Ju Hwang, and Changkyu Choi. Learning to quantize deep networks by optimizing quantization intervals with task loss. In CVPR, pages 4350–4359, 2019.
[31] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
[32] Dohyung Kim, Junghyup Lee, and Bumsub Ham. Distance-aware quantization. In ICCV, pages 5271–5280, 2021.
[33] Minyoung Kim, Da Li, Shell X Hu, and Timothy Hospedales. Fisher sam: Information geometry and sharpness aware minimisation. In ICML, pages 11148–11161, 2022.
[34] Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. I-bert: Integer-only bert quantization. In ICML, pages 5506–5518. PMLR, 2021.
[35] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Tech Report, 2009.
[36] Jungmin Kwon, Jeongseop Kim, Hyun-Seok Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. In ICML, 2021.
[37] Junghyup Lee, Dohyung Kim, and Bumsub Ham. Network quantization with element-wise gradient scaling. In CVPR, pages 6448–6457, 2021.
[38] Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. Visualizing the loss landscape of neural nets. In NeurIPS, 2018.
[39] Yuhang Li, Xin Dong, and Wei Wang. Additive powers-of-two quantization: An efficient non-uniform discretization for neural networks. In ICLR, 2020.
[40] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. In ICLR, 2021.
[41] Zhexin Li, Peisong Wang, Zhiyuan Wang, and Jian Cheng. Fixed-point quantization for vision transformer. In CAC, pages 7282–7287. IEEE, 2021.
[42] Zhexin Li, Tong Yang, Peisong Wang, and Jian Cheng. Q-vit: Fully differentiable quantization for vision transformer. arXiv preprint arXiv:2201.07703, 2022.
[43] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In NeurIPS, pages 345–353, 2017.
[44] Chen Liu, Mathieu Salzmann, Tao Lin, Ryota Tomioka, and Sabine Süsstrunk. On the loss landscape of adversarial training: Identifying challenges and how to overcome them. In NeurIPS, 2020.
[45] Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, and Yang You. Towards efficient and scalable sharpness-aware minimization. In CVPR, pages 12360–12370, 2022.
[46] Zechun Liu, Kwang-Ting Cheng, Dong Huang, Eric P Xing, and Zhiqiang Shen. Nonuniform-to-uniform quantization: Towards accurate quantization via generalized straight-through estimation. In CVPR, pages 4942–4952, 2022.
[47] Zechun Liu, Zhiqiang Shen, Shichao Li, Koen Helwegen, Dong Huang, and Kwang-Ting Cheng. How do adam and training strategies help bnns optimization. In ICML, volume 139, pages 6936–6946, 2021.
[48] Zhenhua Liu, Yunhe Wang, Kai Han, Wei Zhang, Siwei Ma, and Wen Gao. Post-training quantization for vision transformer. NeurIPS, 34:28092–28103, 2021.
[49] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In ECCV, pages 722–737, 2018.
[50] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In ICLR, 2017.
[51] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In ICLR, 2019.
[52] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In NeurIPS, 2017.
[53] Jeffrey L McKinstry, Steven K Esser, Rathinakumar Appuswamy, Deepika Bablani, John V Arthur, Izzet B Yildiz, and Dharmendra S Modha. Discovering low-precision networks close to full-precision networks for efficient inference. In NIPSW, pages 6–9. IEEE, 2019.
[54] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Jonathan Uesato, and Pascal Frossard. Robustness via curvature regularization, and vice versa. In CVPR, pages 9078–9086, 2019.
[55] Clara Na, Sanket Vaibhav Mehta, and Emma Strubell. Train flat, then compress: Sharpness-aware minimization learns more compressible models. arXiv preprint arXiv:2205.12694, 2022.
[56] Markus Nagel, Marios Fournarakis, Yelysei Bondarenko, and Tijmen Blankevoort. Overcoming oscillations in quantization-aware training. In ICML, volume 162, pages 16318–16330, 2022.
[57] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In ICVGIP, pages 722–729. IEEE, 2008.
[58] Eunhyeok Park and Sungjoo Yoo. Profit: A novel training method for sub-4-bit mobilenet models. In ECCV, pages 430–446. Springer, 2020.
[59] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In CVPR, pages 3498–3505. IEEE, 2012.
[60] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 32, 2019.
[61] Haotong Qin, Yifu Ding, Mingyuan Zhang, YAN Qinghua, Aishan Liu, Qingqing Dang, Ziwei Liu, and Xianglong Liu. Bibert: Accurate fully binarized bert. In ICLR, 2022.
[62] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV, pages 525–542, 2016.
[63] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 28, 2015.
[64] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pages 4510–4520, 2018.
[65] Samuel L. Smith and Quoc V. Le. A bayesian perspective on generalization and stochastic gradient descent. In ICLR, 2018.
[66] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
[67] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017.
[68] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019.
[69] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. Haq: Hardware-aware automated quantization with mixed precision. In CVPR, pages 8612–8620, 2019.
[70] Longguang Wang, Xiaoyu Dong, Yingqian Wang, Li Liu, Wei An, and Yulan Guo. Learnable lookup table for neural network quantization. In CVPR, pages 12423–12433, 2022.
[71] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. NeurIPS, 33:2958–2969, 2020.
[72] Kohei Yamamoto. Learnable companding quantization for accurate low-bit neural networks. In CVPR, pages 5029–5038, 2021.
[73] Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and Xian-sheng Hua. Quantization networks. In CVPR, 2019.
[74] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In ECCV, pages 365–382, 2018.
[75] Xiandong Zhao, Ying Wang, Xuyi Cai, Cheng Liu, and Lei Zhang. Linear symmetric quantization of neural networks for low-precision integer hardware. In ICLR, 2020.
[76] Yaowei Zheng, Richong Zhang, and Yongyi Mao. Regularizing neural networks via adversarial model perturbation. In CVPR, pages 8156–8165, 2021.
[77] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160, 2016.
[78] Bohan Zhuang, Lingqiao Liu, Mingkui Tan, Chunhua Shen, and Ian Reid. Training quantized neural networks with a full-precision auxiliary module. In CVPR, pages 1488–1497, 2020.
[79] Bohan Zhuang, Chunhua Shen, Mingkui Tan, Lingqiao Liu, and Ian Reid. Towards effective low-bitwidth convolutional neural networks. In CVPR, pages 7920–7928, 2018.
[80] Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C Dvornek, James s Duncan, Ting Liu, et al. Surrogate gap minimization improves sharpness-aware training. In ICLR, 2022.

Appendix

A More Implementation Details

In this section, we provide more implementation details of SAQ. As suggested by SAM [21] and GSAM [80], we apply $m$ -sharpness strategy with $m=128$ . For both CNNs and ViTs, we use inception-style pre-processing [66] without strong data augmentation. Specifically, we randomly crop $224\times 224$ patches from an image or its horizontal flip counterpart for training. At test time, a $224\times 224$ centered crop is chosen. For the hyper-parameter $\rho$ , we conduct grid search over {0.02, 0.05, 0.1, 0.15, 0.2, $\dots$ , 1.0} to find appropriate values following the common practice in SAM [21] and GSAM [80]. We put the detailed settings of $\rho$ in Table A. For MobileNetV2, we fine-tune the quantized model with additional learnable layer-wise offsets for activations and knowledge distillation following [58, 46]. As suggested by [58, 56], we apply BN re-estimation for 10 iterations after training. To compute the largest eigenvalue $\lambda_{max}$ of the Hessian of different quantized models on ImageNet, we use the power iteration algorithm following [15]. To reduce the computational cost, we randomly sample 10k training images for computation.

Unless otherwise specified, we do not apply advanced techniques such as knowledge distillation [79, 58, 78, 46], non-uniform quantization [46, 39, 72], asymmetric quantization [4, 58, 46], pre-activation [20, 72, 46], weight regularization [46, 23, 56], gradient scaling [37], progressive quantization [79, 58], batch normalization re-estimation [58, 56] and iterative training with weight freezing [58, 56], which can further improve the performance of the quantized models.

For the transfer learning experiments in Section 5.2, we train all models for 100 epochs. We use SGD with a momentum term of 0.9 for optimization. The learning rate is initialized to 0.01 and decreased to 0 using the cosine annealing. The mini-batch size and the weight decay are set to 64 and 0, respectively.

Table A: Hyper-parameter

\rho

for different quantized models on ImageNet.

Network	ResNet-18		ResNet-34		ResNet-50		MobileNetV2	ViT-S/32	ViT-S/16	ViT-B/32	ViT-B/16
Bitwidth	2	4	2	4	2	4	4	4	4	4	4
$\rho$	0.30	0.65	0.65	0.90	0.65	0.95	0.40	0.01	0.01	0.01	0.01

B Sensitivity of Hyperparameter $\rho$

To investigate the sensitivity of hyperparameter $\rho$ , we apply SAQ to train 4-bit ResNet-18 with different $\rho$ and show the results in Figure A. From the results, SAQ is relatively insensitive to $\rho$ as it outperforms SGD for a wide range of values. With the increase of $\rho$ , the performance of the quantized models improves initially but then deteriorates. On the one hand, increasing perturbation strength helps to improve the generalization performance of the quantized model. On the other hand, too large $\rho$ may incur optimization difficulty and thus lead to sub-optimal performance.

C Visualization of the Loss Landscapes

In this section, we show the loss landscape of different quantized models on ImageNet using the visualization method in [38]. We show the results in Figures B and C. The $x$ - and $y$ -axes of the figures represent two randomly sampled orthogonal directions. From the results, the loss landscapes of the quantized models become smoother and flatter with the increase of bitwidth, suggesting that smoothing the loss landscapes of the 4-bit quantized models is easier than the 2-bit counterparts. Moreover, the loss landscapes of the quantized models obtained by SAQ are less chaotic and show larger contour interval compared with the SGD counterpart, indicating that SAQ is able to find flatter and smoother minima over SGD.