Position-based Scaled Gradient for Model Quantization and Pruning

Jangho Kim
Seoul National University
Seoul, Korea
[email protected]
&KiYoon Yoo
Seoul National University
Seoul, Korea
[email protected]
&Nojun Kwak
Seoul National University
Seoul, Korea
[email protected]
Corresponding Author

Abstract

We propose the position-based scaled gradient (PSG) that scales the gradient depending on the position of a weight vector to make it more compression-friendly. First, we theoretically show that applying PSG to the standard gradient descent (GD), which is called PSGD, is equivalent to the GD in the warped weight space, a space made by warping the original weight space via an appropriately designed invertible function. Second, we empirically show that PSG acting as a regularizer to the weight vectors is favorable for model compression domains such as quantization and pruning. PSG reduces the gap between the weight distributions of a full-precision model and its compressed counterpart. This enables the versatile deployment of a model either as an uncompressed mode or as a compressed mode depending on the availability of resources. The experimental results on CIFAR-10/100 and ImageNet datasets show the effectiveness of the proposed PSG in both domains of pruning and quantization even for extremely low bits. The code is released in Github¹¹1https://github.com/Jangho-Kim/PSG-pytorch.

1 Introduction

Many regularization strategies have been proposed to induce a prior to neural networks [21, 41, 20, 24]. Inspired by such regularization methods which induce a prior for a specific purpose, in this paper we propose a novel regularization method that non-uniformly scales gradient for model compression problems. The scaled gradient, whose scale depends on the position of the weight, constrains the weight to a set of compression-friendly grid points. We replace the conventional gradient in the stochastic gradient descent (SGD) with the proposed position-based scaled gradient (PSG) and call it as PSGD. We show that applying PSGD in the original weight space is equivalent to optimizing the weights by the standard SGD in a warped space, to which weights from the original space are warped by an invertible function. The invertible warping function is designed such that the weights of the original space are forced to merge to the desired target positions by scaling the gradients.

Refer to caption — (a) Classification and Quantization Error

We are not the first to scale the gradient elements. The scaled gradient method which is also known as the variable metric method [9] multiplies a positive definite matrix to the gradient vector to scale the gradient. It includes a wide variety of methods such as the Newton method, Quasi-Netwon methods and the natural gradient method [11, 36, 4]. Generally, they rely on Hessian estimation or Fisher information matrix for their scaling. However, our method is different from them in that our scaling does not depend on the loss function but it depends solely on the current position of the weight.

We apply the proposed PSG method to the model compression problems such as quantization and pruning. In recent years, deploying a deep neural network (DNN) on restricted edge devices such as smartphones and IoT devices has become a very important issue. For these reasons, reducing bit-width of model weights (quantization) and removing unimportant model weights (pruning) have been studied and widely used for applications. Majority of the literature in quantization, dubbed as Quantiation Aware Training (QAT) methods, fine-tunes a pre-trained model on the low precision domain without considering the full precision domain using the entire training dataset. Moreover, this scenario is restrictive in real-world applications because additional training is needed. In the additional training phase, a full-size dataset and high computational resources are required which prohibits easy and fast deployment of DNNs on edge devices for customers in need.

To resolve this problem, many works have focused on post-training quantization (PTQ) methods that do not require full-scale training [26, 34, 2, 44]. For example, [34] starts with a pre-trained model with only minor modification on the weights by equalizing the scales across channels and correcting biases. However, inherent discrepancy in the distribution of the pre-trained model and that of the quantized model is too large for the aforementioned methods to offset the fundamental difference in the distributions. As shown in Fig. 1, due to the differences in the two distributions, the classification error and the quantization error, denoted as the mean squared error increase as lower bit-width is used. Accordingly, when it comes to layer-wise quantization, existing post-training methods suffer significant accuracy degradation when it is quantized below 6-bit.

Meanwhile, another line of research in quantization has recently emerged that approaches the task from the initial training phase [1]. Our method follows this scheme of training from scratch like standard SGD, but we attain a competent full-precision model that can also be effortlessly quantized to a low precision model with no additional post-processing. In essence, our main goal is to train a compression-friendly model that can be easily compressed when the resources are limited, without the need of re-training, fine-tuning and even accessing the data. To achieve this, we constrain the original weights to merge to a set of quantized grid points (Appendix A and Fig. 1(b)) by scaling their gradients proportional to the error between the original weight and its quantized version. For pruning, the weights are regularized to merge to zero. More details will be described in Sec 3.

Our contributions can be summarized as follows:

$\bullet$ We propose a novel regularization method for model compression by introducing the position-based scaled gradient (PSG) which can be considered as a variant of the variable metric method.

$\bullet$ We prove theoretically that PSG descent (PSGD) is equivalent to applying the standard gradient descent in the warped weight space. This leads the weight to converge to a well-performing local minimum in both compressed and uncompressed weight spaces (see Appendix A and Fig. 1).

$\bullet$ We apply PSG in quantization and pruning and verify the effectiveness of PSG on CIFAR and ImageNet datasets. We also show that PSGD is very effective for extremely low bit quantization. Furthermore, when PSGD-pretrained model is used along with a concurrent PTQ method, it outperforms its SGD-pretrained counterpart.

2 Related work

Quantization QAT methods have shown increasingly strong performance in the low-precision domain even to 2,3 bit-width [13, 15, 3]. Post-training quantization, on the other hand, aims to quantize weights and activation without additional training or using the training data. Majority of the works in recent literature starts from a pre-trained network trained by standard training scheme [44, 34, 2]. Many works on channel-wise quantization methods, which require storing quantization parameters per channel, have shown notable improvement in performance even at 4-bit [2, 8]. However, layer-wise quantization methods, which are more hardware-friendly as they store quantization parameters per layer (as opposed to per channel), still suffers at lower bit-widths [34, 26, 44]. [34] achieves near full-precision accuracy at 8-bit by bias correction and range equalization of channels, while [44] splits channels with outliers to reduce the clipping error. However, both suffer from severe accuracy degradation under 6-bit. Our method improves on but is not limited to the uniform layer-wise quantization. Concurrent to ours, [35] and [33] propose to directly minimize the quantization error using a calibration dataset to achieve higher performance at under 6-bit. We show using PSGD pretrained model outperforms using SGD pretrained model in Section 5.

Meanwhile, another line of work in quantization has focused on quantization robustness by regularizing the weight distribution from the initial training phase. [31] focuses on minimizing the Lipshitz constant to regularize the gradients for robustness against adverserial attacks. Similarly, [1] proposes a new regularization term on the norm of the gradients for quantization robustness across different bit widths. This enables "on-the-fly" quantization to various bit widths. Our method does not have an explicit regularization term but scales the gradients to implicitly regularize the weights in the full-precision domain to make them quantization-friendly. Additionally, we do not introduce significant training overhead because gradient norm regularization is not necessary, while [1] necessitates double-backpropagation which increases the training complexity. Some other related works in quantization aims to quantize the gradient vectors for efficient training [10], propose more representative encoding formats [40], or learn the optimal mixed precision bit-width [42].

Pruning Another relevant line of research in model compression is pruning, in which unimportant units such as weights, filters, or entire blocks are pruned [23, 29]. Recent works have focused on pruning methods that include the pruning process in the training phase [37, 45, 32, 28]. Among them, substantial amount of works utilize sparsity-inducing regularization. [32] proposes training with L0 norm regularizer on individual weights to train a sparse network, using the expected L0 objective to relax the otherwise indifferentiable regularization term. Meanwhile, other works focus on using saliency criterion. [28] utilizes gradients of masks as a proxy for importance to prune networks at a single-shot. Similar to [28] and [32], our method does not need a heuristic pruning schedule during training nor additional fine-tuning after pruning. In our method, pruning is formulated as a subclass of quantization because PSG can be used for sparse training by setting the target value as zero instead of the quantized grid points.

3 Proposed method

In this section, we describe the proposed position-based scaled gradient descent (PSGD) method. In PSGD, a scaling function regularizes the original weight to merge to one of the desired target points which performs well at both uncompressed and compressed domains. This is equivalent to optimizing via SGD in the warped weight space. With a specially designed invertible function that warps the original weight space, the loss function in this warped space converges to a different local minima that are more compression-friendly compared to the solutions driven in the original weight space.

We first prove that optimizing in the original space with PSGD is equivalent to optimizing in the warped space with gradient descent. Then, we demonstrate how PSGD is used to constrain the weights to a set of desired target points. Lastly, we provide explanation on how this method is able to yield comparable performance with that of vanilla SGD in the original uncompressed domain, despite its strong regularization effect.

3.1 Optimization in warped space

Theorem: Let $\mathcal{F}:\mathcal{X}\rightarrow\mathcal{Y},\ \mathcal{X},\mathcal{Y}\subset\mathbb{R}^{n}$ , be an arbitrary invertible multivariate function that warps the original weight space $\mathcal{X}$ into $\mathcal{Y}$ and consider the loss function $\mathcal{L}:\mathcal{X}\rightarrow\mathbb{R}$ and the equivalent loss function $\mathcal{L}^{\prime}=\mathcal{L}\circ\mathcal{F}^{-1}:\mathcal{Y}\rightarrow\mathbb{R}$ . Then, the gradient descent (GD) method in the warped space $\mathcal{Y}$ is equivalent to applying a scaled gradient descent in the original space $\mathcal{X}$ such that

GD(\boldsymbol{y},\nabla_{\boldsymbol{y}}^{\mathcal{L}^{\prime}})\equiv GD(\boldsymbol{x},(\mathcal{J}_{\boldsymbol{x}}^{\mathcal{F}})^{-2}\nabla_{\boldsymbol{x}}^{\mathcal{L}}),

(1)

where $\boldsymbol{y}=\mathcal{F}(\boldsymbol{x})$ and $\nabla_{a}^{b}$ and $\mathcal{J}_{a}^{b}$ respectively denote the gradient and Jacobian of the function $b$ with respect to the variable $a$ .

Proof: Consider the point $\boldsymbol{x}_{t}\in\mathcal{X}$ at time $t$ and its warped version $\boldsymbol{y}_{t}\in\mathcal{Y}$ . To find the local minimum of $\mathcal{L}^{\prime}(\boldsymbol{y})$ , the standard gradient descent method at time step $t$ in the warped space can be applied as follows:

\boldsymbol{y}_{t+1}=\boldsymbol{y}_{t}-\eta\nabla_{\boldsymbol{y}}^{\mathcal{L}^{\prime}}(\boldsymbol{y}_{t}).

(2)

Here, $\nabla_{\boldsymbol{y}}^{\mathcal{L}^{\prime}}(\boldsymbol{y}_{t})=\frac{\partial\mathcal{L}^{\prime}}{\partial\boldsymbol{y}}|_{\boldsymbol{y}_{t}}$ is the gradient and $\eta$ is the learning rate. Applying the inverse function $\mathcal{F}^{-1}$ to $\boldsymbol{y}_{t+1}$ , we obtain the updated point $\boldsymbol{x}_{t+1}$ :

\boldsymbol{x}_{t+1}=\mathcal{F}^{-1}(\boldsymbol{y}_{t+1})=\mathcal{F}^{-1}(\boldsymbol{y}_{t}-\eta\nabla_{\boldsymbol{y}}^{\mathcal{L}^{\prime}}(\boldsymbol{y}_{t}))=\mathcal{F}^{-1}(\boldsymbol{y}_{t})-\eta\mathcal{J}_{\boldsymbol{y}}^{\boldsymbol{x}}(\boldsymbol{y}_{t})\nabla_{\boldsymbol{y}}^{\mathcal{L}^{\prime}}(\boldsymbol{y}_{t})

(3)

where the last equality is from the first-order Taylor approximation around $\boldsymbol{y}_{t}$ and $\mathcal{J}_{\boldsymbol{y}}^{\boldsymbol{x}}=\mathcal{J}_{\boldsymbol{y}}^{\mathcal{F}^{-1}}=\frac{\partial\boldsymbol{x}}{\partial\boldsymbol{y}}\in\mathbb{R}^{n\times n}$ is the Jacobian of $\boldsymbol{x}=\mathcal{F}^{-1}(\boldsymbol{y})$ with respect to $\boldsymbol{y}$ . By the chain rule, $\nabla_{\boldsymbol{y}}^{\mathcal{L}^{\prime}}=\frac{\partial\boldsymbol{x}}{\partial\boldsymbol{y}}\frac{\partial\mathcal{L}}{\partial\boldsymbol{x}}=\mathcal{J}_{\boldsymbol{y}}^{\boldsymbol{x}}\nabla_{\boldsymbol{x}}^{\mathcal{L}}$ . Because $\mathcal{J}_{\boldsymbol{y}}^{\boldsymbol{x}}=(\mathcal{J}_{\boldsymbol{x}}^{\boldsymbol{y}})^{-1}=(\mathcal{J}_{\boldsymbol{x}}^{\mathcal{F}})^{-1}$ , we can rewrite Eq.(3) as

\boldsymbol{x}_{t+1}=\boldsymbol{x}_{t}-\eta(\mathcal{J}_{\boldsymbol{x}}^{\mathcal{F}}(\boldsymbol{x}_{t}))^{-2}\nabla_{\boldsymbol{x}}^{\mathcal{L}}(\boldsymbol{x}_{t}).

(4)

Now Eq.(2) and Eq.(4) are equivalent and Eq.(1) is proved. In other words, the scaled gradient descent (PSGD) in the original space $\mathcal{X}$ , whose scaling is determined by the matrix $(\mathcal{J}_{\boldsymbol{x}}^{\mathcal{F}})^{-2}$ , is equivalent to gradient descent in the warped space $\mathcal{Y}$ .

3.2 Position-based scaled gradient

In this part, we introduce one example of designing the invertible function $\mathcal{F}(\boldsymbol{x})$ for scaling the gradients. This invertible function should cause the original weight vector $\boldsymbol{x}$ to merge to a set of desired target points $\{\bar{\boldsymbol{x}}\}$ . These kinds of desired target weights can act as a prior in the optimization process to constrain the original weights to merge at specific positions. The details of how to set the target points will be deferred to the next subsection.

The gist of weight-dependent gradient scaling is simple. For a given weight vector, if the specific weight element is far from the desired target point, a higher scaling value is applied so as to escape this position faster. On the other hand, if the distance is small, lower scaling value is applied to prevent the weight vector from deviating from the position. From now on, we focus on the design of the scaling function for the quantization problem. For pruning, the procedure is analogous and we omit the detail.

Scaling function: We use the same warping function $f$ for each coordinate $x_{i},i\in\{1,\cdots,n\}$ independently, i.e. $\boldsymbol{y}=\mathcal{F}(\boldsymbol{x})=[f(x_{1}),f(x_{2}),\cdots f(x_{n})]^{T}$ . Thus the Jacobian matrix becomes diagonal ( $\mathcal{J}_{\boldsymbol{x}}^{\mathcal{F}}=\text{diag}(f^{\prime}(x_{1}),\cdots,f^{\prime}(x_{n})$ ) and our method belongs to the diagonally scaled gradient method.

Consider the following warping function

f(x)=2\ \text{sign}(x-\bar{x})(\sqrt{|x-\bar{x}|+\epsilon}-\sqrt{\epsilon})+c(\bar{x})

(5)

where the target $\bar{x}$ is determined as the closest grid point from $x$ , $\text{sign}(x)\in\{\pm 1,0\}$ is a sign function and $c(\bar{x})$ is a constant dependent on the specific grid point $\bar{x}$ making the function continuous²²2Details on $c(\bar{x})$ can be found in Appendix B. Also, another example of warping function and its experimental results are included in the same section.. $\epsilon$ is an arbitrarily small constant to avoid infinite gradient. Then, from Eq.(4), the elementwise scaling function becomes $s(x)=\frac{1}{{[f^{\prime}(x)}]^{2}}$ and consequently

s(x)=|x-\bar{x}(x)|+\epsilon.

(6)

Using the elementwise scaling function Eq.(6), the elementwise weight update rule for the PSG descent (PSGD) becomes

{x_{i}}^{t+1}={x_{i}}^{t}-\eta s(x_{i})\left.\frac{\partial\mathcal{L}}{\partial x_{i}}\right|_{\boldsymbol{x}^{t}}

(7)

where, $\eta$ is the learning rate³³3We set $\eta=\eta_{0}\lambda_{s}$ where $\eta_{0}$ is the conventional learning rate and $\lambda_{s}$ is a hyper-parameter that can be set differently for various scaling functions depending on their range..

3.3 Target points

Quantization: In this paper, we use the uniform symmetric quantization method [26] and the per-layer quantization scheme for hardware friendliness. Consider a floating point range [ $min_{x}$ , $max_{x}$ ] of model weights. The weight $x$ is quantized to an integer ranging [ $-2^{n-1}+1$ , $2^{n-1}-1$ ] for $\operatorname{n-bit}$ precision. Quantization-dequantization for the weights of a network is defined with step-size ( $\Delta$ ) and clipping values. The overall quantization process is as follows:

x_{Q}=Clip(\Bigl{\lfloor}{\frac{x}{\Delta}}\Bigr{\rceil},-2^{n-1}+1,2^{n-1}-1),\quad\Delta=\frac{max(-min_{x},max_{x})}{2^{n-1}-1}

(8)

where $\lfloor\cdot\rceil$ is the round to the closest integer operation and $Clip(x,a,b)=\begin{cases}b&\text{if }\quad x>b\\ a&\text{if }\quad x<a\\ x&\text{elsewise}.\end{cases}$

We can get the quantized weights with the de-quantization process as $\bar{x}=x_{Q}\times\Delta$ and use this quantized weights for target positions of quantization.
Pruning: For magnitude-based pruning methods, weights near zero are removed. Therefore, we choose zero as the target value (i.e. $\bar{x}=0$ ).

3.4 PSGD for deep networks

Many literature focusing on the optimization of DNNs with stochastic gradient descent (SGD) have reported that multiple experiments give consistently similar performance although DNNs have many local minima (e.g. see Sec. 2 of [6]). [7] analyzed the loss surface of DNNs and showed that large networks have many local minima with similar performance on the test set and the lowest critical values of the random loss function are located in a specific band lower-bounded by the global minimum. From this respect, we explain informally how PSGD for deep networks works. As illustrated in Fig. 2, we posit that there exist many local minima ( $A,B$ ) in the original weight space $\mathcal{X}$ with similar performance, only some of which ( $A$ ) are close to one of the target points (0) exhibiting high performance also in the compressed domain. As in Fig. 2 left, assume that the region of convergence for $B$ is much wider than that of $A$ , meaning that there exists more chance to output solution $B$ rather than $A$ from random initialization. By the warping function $\mathcal{F}$ specially designed as described above (Eq. 5), the original space $\mathcal{X}$ is warped to $\mathcal{Y}$ such that the areas near target points are expanded while those far from the targets are contracted. If we apply gradient descent in this warped space, the loss function will have a better chance of converging to $A^{\prime}$ . Correspondingly, PSGD in the original space will more likely output $A$ rather than $B$ , which is favorable for compression. Note that $\mathcal{F}$ transforms the original weight space to the warped space $\mathcal{Y}$ not to the compressed domain.

4 Experiments

In this section, we experimentally show the effectiveness of PSGD. To verify our PSGD method, we first conduct experiments for sparse training by setting the target point as 0, then we further extend our method to quantization with CIFAR [27] and ImageNet ILSVRC 2015 [38] dataset. We first demonstrate the effectiveness in sparse training with magnitude-based pruning by comparing with L0-regularization [32] and SNIP [28]. [32] penalizes the non-zero model parameters and shares the scheme of regularizing the model while training. Like ours, [28] is a single-shot pruning method, which does not require pruning schedules nor additional fine-tuning.

For quantization, we compare our method with (1) methods that employ regularization at the initial training phase [1, 17, 31]. We choose gradient L1 norm regularization [1] method and Lipschitz regularization methods [31, 17] from the original paper [1] as baselines, because they propose new regularization techniques used at the training phase similar to us. Note that [17] adds an L2 penalty term on the gradient of weights instead of the L1 penalty like [1]. We also compare with (2) existing state-of-the-art layer-wise post-training quantization methods that start from pre-trained models [34, 44] to show the improvement in lower bits (4-bit). Refer to Section 2 for the details on the compared methods. To validate the effectiveness of our method, we also train our model for extremely low bit (2,3-bit) weights. Lastly, we show the experimental results on various network architectures and applying PSG to the Adam optimizer [25], which are detailed in Appendix D.

Implementation details We used the Pytorch framework for all experiments. For the pruning experiment of Table 3, we used ResNet-32 [19] on the CIFAR-100, following the training hyperparameters of [43]. We used released official implementations of [32] and re-implemented [28] for the Pytorch framework. For quantization experiments of Table 2 and 4, we used ResNet-18 and followed [1] settings for CIFAR-10 and ImageNet. For [44], released official implementations were used for experiment. All other numbers are either from the original paper or re-implemented. For fair comparison, all quantization experiments followed the layer-wise uniform symmetric quantization [26] and when quantizing the activation, we clipped the activation range using batch normalization parameters as described in [34], same as [1]. PSGD is applied from the last 15 epochs for ImageNet experiments and from the first learning rate decay epoch for CIFAR experiments. We use additional 30 epochs for PSGD at extremely low bits experiments (Table 4). Also, we tuned the hyper-parameter $\lambda_{s}$ for each bit-widths and sparsity. Our search criteria is ensuring that the performance of uncompressed model is not degraded, similar to [1]. More details are in Appendix C.

[Uncaptioned image] — Table 1: Test accuracy of ResNet-32 across different sparsity ratios (percentage of zeros) on CIFAR-100 after magnitude-based pruning [18] without any fine-tuning.

4.1 Pruning

As a preliminary experiment, we first demonstrate that PSG-based optimization is possible with a single target point set at zero. Then, we apply magnitude-based pruning following [18] across different sparsity ratios. As the purpose of the experiment is to verify that the weights are centered on zero, weights are pruned once after training has completed and the model is evaluated without fine-tuning for [32] and ours. Results for [28], which prunes the weights by single-shot at initialization, are shown for comparison on single-shot pruning.

Table 1 indicates that our method outperforms the two methods across various high sparsity ratios. While all three methods are able to maintain accuracy at low sparsity ( $\sim$ 10%), [32] has some accuracy degradation at 20% and suffers severely at high sparsity. This is in line with the results shown in [14] that the method was unable to produce sparse residual models without significant damage to the model quality. Comparing with [28], our method is able to maintain higher accuracy even at high sparsity, displaying the strength in single-shot pruning, in which no pruning schedules nor additional training are necessary. Fig. 3 shows the distribution of weights in SGD- and PSGD-trained models.

Table 2: Test accuracy of regularization methods that do not have post-training process for ResNet-18 on the ImageNet and CIFAR dataset. PSGD@W# indicates the target number of bits for weights in PSGD is #. All numbers except ours are from [1]. At #-bit, PSGD@W# performs the best in most cases.

Method ImageNet CIFAR-10 FP W8A8 W6A6 W4A4 FP W8A4 W6A4 W4A4 SGD 69.70 69.20 63.80 0.30 93.54 85.51 85.35 83.98 DQ Regularization [31] 68.28 67.76 62.31 0.24 92.46 83.31 83.34 82.47 Gradient L2 [17] 68.34 68.02 64.52 0.19 93.31 84.50 84.99 83.82 Gradient L1 [1] 70.07 69.92 66.39 0.22 93.36 88.70 88.45 87.62 Gradient L1 ( $\lambda=0.05$ ) [1] 64.02 63.76 61.19 55.32 – – – – PSGD@W8 (Ours) 70.22 70.13 66.02 0.60 93.67 93.10 93.03 90.65 PSGD@W6 (Ours) 70.07 69.83 69.51 0.29 93.54 92.76 92.88 90.55 PSGD@W4 (Ours) 68.18 67.63 62.73 63.45 93.63 93.04 93.12 91.03

Table 3: Comparison with Post-training Quantization methods using ResNet-18 on the ImageNet dataset. Results of DFQ are from [34].

Method W8A8 W6A6 W4A4 DFQ [34] 69.7 66.3 – OCS + Best Clip [44] 69.37 66.76 44.3 PSGD (Ours) 70.13 69.51 63.45

Table 4: Extremely low bits accuracy of ResNet-18 on the ImageNet dataset. The first convolutional layer and the last linear layer are quantized at 8-bit. Activation is fixed to 8-bit.

Method (FP / W3A8) (FP / W2A8) SGD 69.76 / 0.10 69.76 / 0.10 PSGD (Ours) 66.75 / 66.36 64.60 / 62.65

4.2 Quantization

In the quantization domain, we first compare PSGD with regularization methods at the on-the-fly bit-widths problem, meaning that a single model is evaluated across various bit-widths. Then, we compare with existing state-of-the-art layer-wise symmetric post-training methods to verify handling the problem of accuracy drop at low bits due to the differences in weight distributions (See Fig. 1).

Regularization methods Table 2 shows the results of regularization methods on CIFAR-10 and ImageNet datasets, respectively. In the CIFAR-10 experiments of Table 2, we fix the activation bit-width to 4-bit and then vary the weight bit-widths from 8 to 4. For the ImageNet experiments of Table 2, we use equal bit-widths for both weights and activations, following [1]. In CIFAR-10 experiment, all methods seem to maintain the performance of the quantized model until 4-bit quantization. Regardless of target bit-widths, PSGD outperforms all other regularization methods. On the other hand, ImageNet experiment generally shows reasonable results until 6-bit but the accuracy drastically drops at 4-bit. PSGD targeting 8-bit and 6-bit marginally improves on all bits, yet also experiences drastic accuracy drop at 4-bit. In contrast, Gradient L1 ( $\lambda=0.05$ ) and PSGD @ W4 maintain the performance of the quantized models even at 4-bit. Comparing with the second best method Gradient L1 ( $\lambda=0.05$ ) [1], PSGD outperforms it at all bit-widths. At full precision (FP), 8-, 6- and 4-bit, the gap of performance between [1] and ours are about 4.2%, 3.9%, 1.5% and 8.1%, respectively. From Table 2, while the quantization noise may slightly degrade the accuracy in some cases, a general trend that using more bits leads to higher accuracy is demonstrated. Compared to other regularization methods, PSGD is able to maintain reasonable performance across all bits by constraining the distribution of the full precision weight to resemble that of the quantized weight. This quantization-friendliness is achieved by the appropriately designed scaling function. In addition, unlike [1], PSGD does not need additional overhead of computing double-backpropagation.

Post-training methods Table 4 shows that OCS, state-of-the-art post-training method, has a drastic accuracy drop at 4-bit. For OCS, following the original paper, we chose the best clipping method for both weights and activation. DFQ also has a similar tendency of showing drastic accuracy drop under the 6-bit as depicted in Fig. 1 of the original paper of DFQ [34]. This is due to the fundamental discrepancy between FP and quantized weight distributions as stated in Sec. 1 and Fig. 1. On the other hand, models trained with PSGD have similar full-precision and quantized weight distributions and hence low quantization error due to the scaling function. Our method outperforms OCS at 4-bit by around 19% without any post-training and weight clipping to treat the outliers. Applying PTQ method to our PSG pre-trained model is shown in Sec 5.

Extremely low bits quantization As shown in Fig. 1, SGD suffers drastic accuracy drop at extremely low bits such as 3-bit and 2-bit. To confirm that PSGD can handle extremely low bit, we conduct experiments with PSGD targeting 3-bit and 2-bit except the first and last layers which are quantized at 8-bit. Table 4 shows the results of applying PSGD. Although the full precision accuracy does drop due to the strong constraints, PSGD is able to maintain reasonable accuracy. This demonstrates the potential of PSGD as a key solution to post-training quantization at extremely low bits.

5 Discussion

In this section, we first focus on the local minima found by PSG with a toy example to gain a deeper understanding. In this toy example, we train with SGD and PSGD on 2-bit on the MNIST dataset with a fully-connected network consisting of two hidden layers (50, 20 neurons). In the subsequent sections, we clarify the differences of the purpose of PSGD and that of QAT and analyze why PSGD does not achieve near-FP performance in LP. Lastly, we demonstrate the potential application of PSG by applying it to a concurrent PTQ method.

Quantized and sparse model SGD generally yields a bell-shaped distribution of weights which is not adaptable for low bit quantization [44]. On the other hand, PSGD always provides a multi-modal distribution peaked at the quantized values. For this example, three target points are used (2-bit) so the weights are merged into three modes as depicted in Fig. 4a. A large proportion of the weights are near zero. Similarly, we note that the sparsity of ResNet-18@W4 shown in Table 2 is 72.4% at LP. This is because symmetric quantization also contains zero as the target point. PSGD has nearly the same accuracy with FP ( $\sim$ 96%) at W2A32. However, the accuracy of SGD at W2A32 is about 9%, although the FP accuracy is 97%. This tendency is also shown in Fig. 1b, which demonstrates that PSGD reduces the quantization error.

Curvature of PSGD solution In Sec 3.4 and Fig. 2, we claimed that PSG finds a minimum with sharp valleys that is more compression friendly, but has a less chance to be found. As the curvature in the direction of the Hessian eigenvector is determined by the corresponding eigenvalue [16], we compare the curvature of solutions yielded by SGD and PSGD by assessing the magnitude of the eigenvalues, similar to [5]. SGD provides minima with relatively wide valleys because it has many near-zero eigenvalues and the similar tendency is observed in [5]. However, the weights trained by PSGD have much more large positive eigenvalues, which means the solution lies in a relatively sharp valley compared to SGD. Specifically, the number of large eigenvalues ( $\lambda>10^{-3}$ ) in PSGD is 9 times more than that of SGD. From this toy example, we confirm that PSG helps to find the minima which are more compression-friendly (Fig 4a) and lie in sharp valleys (Fig. 4b) hard to reach by vanilla SGD.

Quantization-aware training vs PSGD Conventional QAT methods [13, 15] starts with a pre-trained model initially trained with SGD and further update the weights by only considering the low precision weights. In contrast, regularization methods such as our work and [1] starts from scratch and update the full-precision weights analogous to SGD. In our work, the sole purpose of PSGD is to find a set of full precision weights that are quantization-friendly so that versatile deployment as low precision (LP) is possible without further operation. Therefore, regularization methods start from the initial training phase analogous to SGD, whereas QAT methods starts with a pre-trained model after the initial training phase such as SGD and PSGD. The purpose of QAT methods is solely focused on LP weights. In general, a coarse gradient is used to update the weights attained by forwarding the LP weights, instead of the FP weights by using the straight-through-estimator (STE) [26]. Additionally, the quantization scheme is modified to include trainable parameters dependent on the low-precision weights and activations. Thus, QAT cannot maintain the performance of full-precision as it only focuses on that of low-precision such as 4 bit-width.

Post-training with PSGD-trained model Our model attains similar full-precision performance with SGD and reasonable performance at low-precision even with naive quantization. Thus, PSGD-trained model can be potentially used as a pre-trained model for QAT or PTQ methods. We performed additional experiments using the model trained with PSGD in Table 4 by applying a concurrent PTQ work, LAPQ [35], using the official code. This attains 66.5% accuracy for W4A4, which is more than 3.1% and 6.2% points higher than that of PSGD-only and LAPQ-only respectively. This shows that PTQ methods can benefit from using our pretrained model.

6 Conclusion

In this work, we introduce the position-based scaled gradient (PSG) which scales the gradient proportional to the distance between the current weight and the corresponding target point. We prove the stochastic PSG descent (PSGD) is equivalent to applying the SGD in the warped space. Based on the hypothesis that DNN has many local minima with similar performance on the test set, PSGD is able to find a compression-friendly minimum that is hard to reach by other optimizers. PSGD can be a key solution to low bit post training quantization, because PSGD reduces the quantization error bridging the discrepancy between the distributions of the compressed and uncompressed weights. Because target points act as a prior to constrain original weights to be merged at specific positions, PSGD also can be used for sparse training by simply changing the target point as 0. In our experiments, we verify PSGD in the domain of pruning and quantization by showing the effectiveness on various image classification datasets such as CIFAR-10/100 and ImageNet. Also, we empirically show that PSGD finds minima which are located in sharper valleys than SGD. We believe that PSGD will help further researches in model quantization and pruning.

Broader Impact

PSG is a fundamental method of scaling each gradient component differently depending on the position of a weight vector. This technique can replace conventional gradient in any applications that require different treatment of specific locations in the parameter space. As shown in the paper, the easiest conceivable applications would be quantization and pruning where a definite preference for specific weight forms exists. These model compression techinques are at the heart of the fast and lightweight deployment of any deep learning algorithms and thus, PSG can make a huge impact in the related industry. As another potentially related research topic, PSG has a chance to be utilized in the optimization area such as the integer programming and the combinatorial optimization acting as a tool in optimizing a continuous surrogate of an objective function in a discrete space.

Acknowledgments

This work was supported by IITP grant funded by the Korea government (MSIT) (No.2019-0-01367, Babymind) and Promising-Pioneering Researcher Program through Seoul National University (SNU).

References

[1] Milad Alizadeh, Arash Behboodi, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, and Max Welling. Gradient $\ell_{1}$ regularization for quantization robustness. In International Conference on Learning Representations, 2020.
[2] Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment. In Advances in Neural Information Processing Systems, pages 7948–7956, 2019.
[3] Yash Bhalgat, Jinwon Lee, Markus Nagel, Tijmen Blankevoort, and Nojun Kwak. Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 696–697, 2020.
[4] Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
[5] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124018, 2019.
[6] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. In International Conference on Learning Representations, 2017.
[7] Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pages 192–204, 2015.
[8] Yoni Choukroun, Eli Kravchik, Fan Yang, and Pavel Kisilev. Low-bit quantization of neural networks for efficient inference. 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Oct 2019.
[9] William C Davidon. Variable metric method for minimization. SIAM Journal on Optimization, 1(1):1–17, 1991.
[10] Christopher De Sa, Megan Leszczynski, Jian Zhang, Alana Marzoev, Christopher R Aberger, Kunle Olukotun, and Christopher Ré. High-accuracy low-precision training. arXiv preprint arXiv:1803.03383, 2018.
[11] John E Dennis, Jr and Jorge J Moré. Quasi-newton methods, motivation and theory. SIAM review, 19(1):46–89, 1977.
[12] L Vandenberghe EE236C. 1. gradient method.
[13] Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. arXiv preprint arXiv:1902.08153, 2019.
[14] Trevor Gale, Erich Elsen, and Sara Hooker. The state of sparsity in deep neural networks. ArXiv, abs/1902.09574, 2019.
[15] Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 4852–4861, 2019.
[16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
[17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, pages 5767–5777, 2017.
[18] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pages 1135–1143, 2015.
[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[20] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[21] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
[22] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
[23] Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 304–320, 2018.
[24] Jangho Kim, SeongUk Park, and Nojun Kwak. Paraphrasing complex network: Network compression via factor transfer. In Advances in Neural Information Processing Systems, pages 2760–2769, 2018.
[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[26] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper, 2018.
[27] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
[28] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY. In International Conference on Learning Representations, 2019.
[29] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
[30] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pages 6389–6399, 2018.
[31] Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When efficiency meets robustness. In International Conference on Learning Representations, 2019.
[32] Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through $l_{0}$ regularization. In International Conference on Learning Representations, 2018.
[33] Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. arXiv preprint arXiv:2004.10568, 2020.
[34] Markus Nagel, Mart van Baalen, Tijmen Blankevoort, and Max Welling. Data-free quantization through weight equalization and bias correction. In Proceedings of the IEEE International Conference on Computer Vision, pages 1325–1334, 2019.
[35] Yury Nahshan, Brian Chmiel, Chaim Baskin, Evgenii Zheltonozhskii, Ron Banner, Alex M Bronstein, and Avi Mendelson. Loss aware post-training quantization. arXiv preprint arXiv:1911.07190, 2019.
[36] Jorge Nocedal and Stephen Wright. Numerical optimization. Springer Science & Business Media, 2006.
[37] Alex Renda, Jonathan Frankle, and Michael Carbin. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
[38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
[39] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[40] Thierry Tambe, En-Yu Yang, Zishen Wan, Yuntian Deng, Vijay Janapa Reddi, Alexander Rush, David Brooks, and Gu-Yeon Wei. Adaptivfloat: A floating-point based data type for resilient deep learning inference. arXiv preprint arXiv:1909.13271, 2019.
[41] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[42] Mart van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. Bayesian bits: Unifying quantization and pruning. arXiv preprint arXiv:2005.07093, 2020.
[43] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European conference on computer vision (ECCV), pages 365–382, 2018.
[44] Ritchie Zhao, Yuwei Hu, Jordan Dotzel, Chris De Sa, and Zhiru Zhang. Improving Neural Network Quantization without Retraining using Outlier Channel Splitting. International Conference on Machine Learning (ICML), pages 7543–7552, June 2019.
[45] Michael Zhu and Suyog Gupta. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, 2017.

Appendix A Detailed Results on CIFAR-100 with ResNet-32 (Fig. 1)

In Table 5, we show the classification accuracies and the corresponding mean squared error (MSE) of PSGD depicted in Fig. 1 of the original paper. Also, the weight distributions of the model at various layers are shown in Fig. 5. In this experiment, we only quantize the weights, not the activations, to compare the performance degradation as weight bit-width decreases. The mean squared errors (MSE) of the weights across different bit-widths are also reported. The MSE is computed by the squared mean of the differences in full-precision weights and the low-precision weights across layers. As some variance in performance was observed for lower bit-widths, we report the mean $\pm$ standard deviation for the 2-bit and the 3-bit experiments. As stated, PSGD successfully merges the weights to the target points and obtains quite low MSE until 3-bit and the 2-bit MSE of PSGD is more than 2 times smaller than that of conventional SGD.

In Fig. 5, we display the full-precision weight distributions of the PSGD models and compare them against vanilla SGD-trained distributions. Four random layers of each model are shown column-wise. The first row displays the model trained with SGD and L2 weight decay. Below are distributions trained with PSGD with target points for the sparse training case, 2-bit, 3-bit, and 4-bit respectively. Note that all the histograms are plotted in the full-precision domain, rather than the low-precision domain. For 2-bit, all three target bins ( $2^{2}-1$ ) are visible. For 3-bit, only five target bins are visible as the peripheral two bins contain relatively low numbers of weight components.

Table 5: 8-, 6-, 4-, 3- and 2-bit weight quantization results of ResNet-32 learned by PSGD on the CIFAR-100 dataset. This is also reported in Figure 1 of the original paper. For lower bit-widths, the mean and standard deviation over five runs are reported. MSE between the learned weight and the target weight is reported. The numbers in the parentheses are the MSEs of SGD.

Bit-width 8-bit 6-bit 4-bit 3-bit 2-bit Full precision 70.14 70.59 70.08 68.96 $\pm$ 0.34 67.16 $\pm$ 0.40 Low precision 70.05 70.33 69.57 68.56 $\pm$ 0.19 60.76 $\pm$ 2.18 MSE (SGD MSE) 0.03 (0.03) 0.41 (0.58) 0.5 (11) 0.84 (48) 67 (157)

Appendix B Methods

B.1 Offset $c(\bar{x})$

In our warping function in the following

f(x)=2\ \text{sign}(x-\bar{x})(\sqrt{|x-\bar{x}|+\epsilon}-\sqrt{\epsilon})+c(\bar{x}),

(9)

we introduced $c(\bar{x})$ for making $f(x)$ continuous. If we do not add a constant $c(\bar{x})$ , the $f(x)$ has points of discontinuity at every $\{(n+0.5)\Delta|n\in\mathbb{Z}\}$ as depicted in Fig. 6, where $\Delta$ represents step size and $n\Delta$ means $n$ -th quantized value identical to $\bar{x}$ corresponding to $x$ . We can calculate the left sided limit and right sided limit at $n\Delta+0.5\Delta$ using Eq. 9.

	$\displaystyle f(n\Delta+0.5\Delta^{-})$	$\displaystyle=$	$\displaystyle 2\ (\sqrt{0.5\Delta+\epsilon}-\sqrt{\epsilon})+c(n\Delta)$		(10)
	$\displaystyle f(n\Delta+0.5\Delta^{+})$	$\displaystyle=$	$\displaystyle-2\ (\sqrt{0.5\Delta+\epsilon}-\sqrt{\epsilon})+c((n+1)\Delta)$		(11)

Based on the condition that the left sided limit and the right sided limit should be the same (Eq. 10 = Eq. 11), we can get the following recurrence relation:

c((n+1)\Delta)-c(n\Delta)=4\ (\sqrt{0.5\Delta+\epsilon}-\sqrt{\epsilon}).

(12)

Using the successive substitution for calculating $c(\bar{x})$ , it becomes

	$\displaystyle\xcancel{c(\Delta)}-c(0)=4\ (\sqrt{0.5\Delta+\epsilon}-\sqrt{\epsilon})$
	$\displaystyle\xcancel{c(2\Delta)}-\xcancel{c(\Delta)}=4\ (\sqrt{0.5\Delta+\epsilon}-\sqrt{\epsilon})$
	$\displaystyle\mathmakebox[\widthof{{}\mspace{300.0mu}{}}][c]{\vdots}$
	$\displaystyle+\ c(n\Delta)-\xcancel{c((n-1)\Delta)}=4\ (\sqrt{0.5\Delta+\epsilon}-\sqrt{\epsilon})$

	$\displaystyle c(n\Delta)-c(0)=4n\ (\sqrt{0.5\Delta+\epsilon}-\sqrt{\epsilon}).$

Setting $c(0)=0$ and because $n\Delta=\bar{x}$ , $c(\bar{x})$ can be calculated as below:

c(\bar{x})=\frac{4\bar{x}}{\Delta}\ (\sqrt{0.5\Delta+\epsilon}-\sqrt{\epsilon}).

(13)

B.2 Non-separable directional scaling

Here, we introduce another example of warping function and the corresponding scaling function. In this case, we define the warping function as a multivariate function as $\mathcal{F}(\boldsymbol{x})=[f_{1}(\boldsymbol{x}),\cdots,f_{n}(\boldsymbol{x})]^{T}$ and set

f_{i}(\boldsymbol{x})=2\ \text{sign}(x_{i}-\bar{x}_{i})\sqrt{(\|\boldsymbol{x}-\bar{\boldsymbol{x}}(\boldsymbol{x})\|_{\infty}+\epsilon)\cdot({|x_{i}-\bar{x_{i}}|+\epsilon})}+c(\bar{x_{i}}).

(14)

Here, $\|\boldsymbol{x}\|_{\infty}$ is the infinite norm or max norm which can be replaced with $|x_{max}|$ where $max$ is the index with the maximum absolute value. $c$ is a constant as in Eq. 9. By using $\delta_{i}=x_{i}-\bar{x_{i}}$ , the partial derivative of Eq. 14 becomes

\frac{\partial f_{i}}{\partial x_{j}}=\begin{cases}\sqrt{|\delta_{max}|+\epsilon}/\sqrt{|\delta_{i}|+\epsilon}\quad&\text{if}\quad j=i\quad\text{and}\quad j\neq max\\ 2&\text{if}\quad j=i\quad\text{and}\quad j=max\\ \text{sign}(\delta_{i})\text{sign}(\delta_{j})\sqrt{|{\delta_{i}|+\epsilon}}/\sqrt{|\delta_{j}|+\epsilon}\ &\text{if}\quad j\neq i\quad\text{and}\quad j=max\\ 0\qquad\quad\quad&\text{otherwise.}\end{cases}

(15)

By changing the order of variable index, we can put the max element to the last and then the Jacobian matrix $J_{x}^{\mathcal{F}}$ becomes upper triangular with all-positive diagonal elements and the only non-zero off-diagonal elements are in the last column of the matrix. Comparing the magnitude of non-zero off-diagonal elements, which is in the range of $[-1,1]$ , with that of diagonal elements which is in the range of $[1,\sqrt{q/2\epsilon+1}]$ where $q$ is the size of a quantization grid, off-diagonal elements does not dominate the diagonal elements. Furthermore, considering the deep network with a huge number of weight parameters, we can neglect the effect of off-diagonal elements and use only the diagonal elements of the Jacobian matrix for scaling. In this case, the elementwise scaling function becomes

s(x)={\frac{|x-\bar{x}|+\epsilon}{\|\boldsymbol{x}-\bar{\boldsymbol{x}}\|_{\infty}+\epsilon}}.

(16)

Using the elementwise scaling function Eq.(16), the elementwise weight update rule for the PSG descent (PSGD) becomes

{x_{i}}^{t+1}={x_{i}}^{t}-\eta s(x_{i})\left.\frac{\partial\mathcal{L}}{\partial x_{i}}\right|_{\boldsymbol{x}^{t}}.

(17)

Independent vs Directional scaling: The independent scaling function such as the one presented in the main paper (Eq.6) only considers the independent element-wise distance between the positions of weights and the targets. This means when the weight vector is very close to one of the target points, the magnitude of gradients could be very small, leading to slow convergence as the scaling function for all elements will be nearly 0. To avoid this, we added a small $\epsilon$ in Eq.6. Note, however, that the weights are needed to be updated according to the task loss (e.g. cross-entropy loss) to find an optimal solution. To address this degradation of gradient magnitude, directional scaling function (Eq.(16)) finds the dominant direction by normalizing the scaled gradient as depicted in Figure 7. The directional scaling performs slightly better than the independent scaling as shown in Table 6, but the difference is not much. Note however that the vanishing of scaling function at the target in the independent scaling can be mitigated by increasing the offset $\epsilon$ in any way.

Table 6: 2-bit results on ResNet-18 with 90 epochs. Weight decay is not applied.

Method FP W2A8 Independent scaling 63.35 54.96 Directional scaling 63.18 56.60

Appendix C Implementation details

We use CIFAR-10/100 and the ImageNet datasets for experiments. CIFAR-10 consists of 50,000 training images and 10,000 test images, consisting of 10 classes with 6000 images per class. CIFAR-100 consists of 100 classes with 600 images per class. The ImageNet dataset consists of 1.2 million images. We use 50,000 validation images for the test, which are not included in training samples. We use the conventional data pre-processing steps⁴⁴4https://github.com/kuangliu/pytorch-cifar ⁵⁵5https://github.com/pytorch/examples/blob/master/imagenet/main.py.

ImageNet / CIFAR-10 For ResNet-18, we started training with a L2 weight decay of $10^{-4}$ and learning rate of 0.1, then decayed the learning rate with a factor of 0.1 at every 30 epochs. Training was terminated at 90 epochs. We only used the last 15 epochs for training the model with PSGD similar to [1]. This means we applied the PSG method after 75 epochs with learning rate 0.001. For extremely low-bits experiments, we did not use any weight decay after 75 epochs (See below). We tuned the hyper-parameters $\lambda_{s}$ for target bit-widths. All numbers are results of the last epoch. We used the official code of [44] for comparisons with 0.02 for the Expand Ratio⁶⁶6https://github.com/cornell-zhang/dnn-quant-ocs.

CIFAR-100 For ResNet-32, the same weight decay and initial learning rate were used as above and the learning rate was decayed at 82 and 123 epoch following [43]. Training was terminated at 150 epoch. For VGG16 with batchnorm normalization (VGG16-bn), we decayed the learning rate at 145 epoch instead. We applied PSG after the first learning rate decay. The first convolutional layer and the last linear layer are quantizedat 8-bit for the 2-bit and the 3-bit experiments. For sparse training, training was terminated at 200 epoch and weight decay was not used at higher sparsity ratio, while all the other training hyperparameters were the same. For [32], we used the official implementation for the results ⁷⁷7https://github.com/AMLab-Amsterdam/L0_regularization.

Extremely low-bits experiments For ImageNet, we did not use the weight decay for 2-, 3-bits as it hinders convergence. For CIFAR100, weight decay was not used for only 2-bits. See the details regarding how weight decay affects training with PSGD in Sec. D.3. In addition, we experimented with training for longer epochs than the original schedule. In this case, we run additional 30 epochs for PSGD. The total number of epochs is 120 and we apply PSG methods for the last 45 epochs.

Appendix D Additional experiments

D.1 Adam optimizer with PSG

To show the applicability of our PSG to other types of optimizers, we applied our PSG to the Adam optimizer by using the same scaling function with ResNet-32 on 4-bits with the CIFAR-100 dataset. Following the convention, the initial learning rate of $10^{-3}$ was used and the first and the last layer of the model were fixed to 8-bits. All the other training hyperparameters remained the same. Table 7 compares the quantization results of models trained with vanilla Adam and applying PSG to Adam.

Table 7: ResNet-32 trained with Adam on the CIFAR-100 dataset. Vanilla Adam also suffers accuracy degradation on 4 bits, while applying PSG to Adam recovers the accuracy by more than 5%. Weight-only quantization is shown by W4A32.

Method FP W4A32 W4A4 Adam 66.66 55.27 43.5 Adam with PSG 66.80 60.35 51.55

D.2 Various architectures with PSGD

In this section, we show the results of applying PSGD to various architectures. Table 8 shows the quantization results of VGG16 [39] with batch normalization on the CIFAR-100 dataset and DenseNet-121 [22] on the ImageNet dataset, respectively.

For DenseNet, we run additional 15 epochs from the pre-trained model to reduce the training time ⁸⁸8https://download.pytorch.org/models/densenet121-a639ec97.pth. For fair comparisons in terms of the number of epochs, we also trained for additional 15 epochs for SGD with the same last learning rate (0.001). However, we only observed oscillation in the performance during the additional epochs. Similar to the extremely low-bits experiments, we fixed the activation bit-width to 8-bit.

For VGG16 on the CIFAR-100 dataset, similar tendendcy in performance was observed with ResNet-32. The 4-bit targeted model was able to maintain its full-precision accuracy, while the model targeting lower bit-widths had some accuracy degradation.

Table 8: The performances of various architectures with PSGD.

DataSet & Network Method (FP / W4A4) (FP / W3A8) (FP / W2A8) CIFAR-100 & VGG16-bn SGD 73.12 / 63.08 73.12 / 3.44 73.12 / 1.00 PSGD 73.21 / 70.92 71.85 / 68.28 69.36 / 53.25 DataSet & Network Method (FP / W8A8) (FP / W6A8) (FP / W4A8) ImageNet & DeseNet-121 SGD 74.43 / 73.85 74.43 / 70.57 74.43 / 0.36 PSGD 75.16 / 75.03 75.12 / 74.84 72.60 / 72.26

D.3 Weight decay at extremely low-bits

To show the weight decay effect on extremely low-bits with PSG, we trained models with and without weight decay with 90 epochs consisting of 75 epochs with SGD and last 15 epochs with PSGD. The results are shown in Table 9. Based on the experiment results, we found that weight decay incurred a detrimental effect on extremely low-bit cases (2,3-bit). Figure 8 shows the weight distribution of both models with and without weight decay. The range of the weight distribution with weight decay is smaller (Blue) than that of the weights without weight decay (Red) due to the weight shrinkage effect. This regularization effect does not matter at higher bit-widths such as 6-bit and 4-bit. However, it has a negative effect on the performance of extremely low-bits so we do not use weight decay for the extremely low-bits experiment.

Table 9: 6-,4-,3- and 2-bit results of ResNet-18 on the ImageNet dataset. All results are from training for 90 epochs.

Method (FP / W6A6) (FP / W4A4) (FP / W3A8) (FP / W2A8) PSGD with weight decay 70.07 / 69.51 68.18 / 63.45 66.52 / 63.76 63.17 / 51.53 PSGD without weight decay 70.26 / 69.69 68.03 / 63.38 66.81 / 64.71 63.35 / 54.96

D.4 Longer training for extremely low-bits

Although the model without weight decay does increase the performance significantly compared to the baseline, the performance gain is relatively lower in higher bit-widths (Table 9). We train for additional 30 epochs for PSGD and show the numbers in Table 10. In the table, we can see a significant performance enhancement by the longer training with PSGD. Note that additional training is not useful for bit-widths over 3-bit.

Table 10: ResNet-18 on the ImageNet dataset. The numbers in the second row are the results of longer training (120 epochs), which use additional 30 epochs with PSGD.

Method Weight decay No Weight decay (FP / W3A8) (FP / W2A8) (FP / W3A8) (FP / W2A8) PSGD 66.52 / 63.76 63.17 / 51.53 66.81 / 64.71 63.35 / 54.96 PSGD with additional epochs 66.64 / 65.23 63.90 / 54.45 66.75 / 66.36 64.60 / 62.65

Appendix E Visualization of loss space

we have also used official code⁹⁹9https://github.com/tomgoldstein/loss-landscape of [30] to qualitatively assess the curvature of Fig. 5 in original paper, using the same experimental setting of Sec. 5, which is depicted in Fig 9 and it shows a similar tendency. The solution of PSGD is in the more sharp valley than it of SGD.

Appendix F Convergence analysis

Our algorithm is a variant of GD; the equivalent convergence analysis can be applied with the condition that the step-size, $0<t\leq\frac{1}{L}$ where $L$ is a Lipschitz constant. The detailed definition and proof are in [12].

Theorem: Given the scaling vector, $s(\cdot)$ $\in\mathbb{R}^{n}$ and a convex, L-smooth function, $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ satisfies: $f(x_{i+1})-f(x_{i})\leq(\frac{s(x_{i})^{\circ 2}-2s(x_{i})}{2L})^{\mathsf{T}}\nabla{f}({x_{i}})^{\circ 2}$ , which is monotonically nonincreasing because r.h.s is always negative. As $i\rightarrow\infty$ , $f(x_{i+1})$ converges to the optimum. ( $\circ$ denotes the Hadamard operation)
Proof: Substituting $x_{i+1}$ for $x_{i}-\frac{s(x_{i})}{L}\circ\nabla{f}(x_{i})$ into r.h.s of $f(x_{i+1})-f(x_{i})\leq\nabla{f}({x_{i}})^{\mathsf{T}}(x_{i+1}-x_{i})+\frac{L}{2}\|x_{i+1}-x_{i}\|_{2}^{2}$ , (which follows from the property of L-smoothness) yields the inequality. Given $s(x_{i})=\frac{abs(x_{i}-\bar{x}_{i})+\epsilon}{\|x_{i}-\bar{x}_{i}\|_{\infty}+\epsilon}$ (Appendix B), which satisfies $0<s(x_{i})_{j}\leq 1$ , $\forall j\in[1,n]$ the r.h.s is always negative. ( $abs(\cdot)$ is defined as the element-wise absolute value function).

Appendix G Hyper-parameter $\lambda_{s}$

We searched the appropriate $\lambda_{s}$ with the criteria that the performance of the uncompressed model is not degraded, similar to [1]. For hyper-parameter tuning, we use two disjoint subsets of the training dataset for training and validation. Then we used the found $\lambda_{s}$ to retrain on the whole training dataset. The below table shows the values of $\lambda_{s}$ used in experiments of the original paper. The $\lambda_{s}$ tended to rise for lower target bit-widths or for higher sparsity ratios (See Table 11 and 12). In CIFAR-10, we observe that same $\lambda_{s}$ value yields fair performance across all bit-widths.

Table 11:

\lambda_{s}

used in the sparse training experiment.

CIFAR-100 & ResNet-32 Sparsity (%) 20.0 50.0 70.0 80.0 90.0 $\lambda_{s}$ 100 100 200 600 1200

Table 12:

\lambda_{s}

used in the quantization experiments.

ResNet-18 ImageNet CIFAR-10 8-bit 6-bit 4-bit 8-bit 6-bit 4-bit $\lambda_{s}$ 500 500 1000 10 10 10