This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

QNCD: Quantization Noise Correction for Diffusion Models

Huanpeng Chu Kuaishou TechnologyBeijingChina [email protected] Wei Wu Mei TuanShanghaiChina [email protected] Chengjie Zang Kuaishou TechnologyBeijingChina zcjjj˙[email protected]  and  Kun Yuan Kuaishou TechnologyBeijingChina [email protected]
(2024)
Abstract.

Diffusion models have revolutionized image synthesis, setting new benchmarks in quality and creativity. However, their widespread adoption is hindered by the intensive computation required during the iterative denoising process. Post-training quantization (PTQ) presents a solution to accelerate sampling, aibeit at the expense of sample quality, extremely in low-bit settings. Addressing this, our study introduces a unified Quantization Noise Correction Scheme (QNCD), aimed at minishing quantization noise throughout the sampling process. We identify two primary quantization challenges: intra and inter quantization noise. Intra quantization noise, mainly exacerbated by embeddings in the resblock module, extends activation quantization ranges, increasing disturbances in each single denosing step. Besides, inter quantization noise stems from cumulative quantization deviations across the entire denoising process, altering data distributions step-by-step. QNCD combats these through embedding-derived feature smoothing for eliminating intra quantization noise and an effective runtime noise estimation module for dynamiclly filtering inter quantization noise. Extensive experiments demonstrate that our method outperforms previous quantization methods for diffusion models, achieving lossless results in W4A8 and W8A8 quantization settings on ImageNet (LDM-4). Code is available at: https://github.com/huanpengchu/QNCD.

Diffusion Models, Post Training Quantization, Model Compression
journalyear: 2024copyright: acmlicensedconference: Proceedings of the 32nd ACM International Conference on Multimedia; October 28-November 1, 2024; Melbourne, VIC, Australiabooktitle: Proceedings of the 32nd ACM International Conference on Multimedia (MM ’24), October 28-November 1, 2024, Melbourne, VIC, Australiadoi: 10.1145/3664647.3681451isbn: 979-8-4007-0686-8/24/10ccs: Computing methodologies Computer vision

1. Introduction

Refer to caption
Figure 1. Comparison of metrics for denoising processes w.r.t.timestep (tt). LPIPS Distance between the quantized Stable Diffusion model (W8A8) outputs and its floating-point counterpart on MS-COCO, along with their respective CLIP scores and FID (Fréchet Inception Distance) scores.

Recently, diffusion models have achieved remarkable progress in various synthesizing tasks, such as image generating (ho2020denoising, ), super-resolution (saharia2021image, ), image editing and in-panting (song2020score, ), image translation (sasaki2021unitddpm, ) etc. Compared to traditional SOTA generative adversarial networks (GANs (goodfellow2020generative, )), diffusion models do not suffer from the problem of model collapse and posterior collapse, exhibit higher stability.

However, this comes at the cost of the high computational resources and a large number of parameters required to run these models, which are only available on cloud-based devices. For example, Stable Diffusion (rombach2021high, ) requires 16 GB of running memory and GPU and more than 10 GB of VRAM, which is not feasible for most consumer-grade PCs, let alone resource-limited edge devices.

Our work employs post-training quantization (PTQ) to speed up the sampling process in all time steps, while avoiding the high cost of retraining diffusion models. PTQ, having been well-studied in traditional deep learning tasks like classification and segmentation (bhalgat2020lsq+, ; cai2020zeroq, ; guo2022squant, ), stands out as a preferred compression method due to the minimal requirements on training data and the convenience of direct deployment on hardware devices. Despite the many attractive benefits of PTQ, its implementation in diffusion models remains challenging. The main reason for this is that the framework of diffusion models is quite different from previous PTQ implementations (e.g., CNN and ViT (dosovitskiy2020image, ; liu2021post, ) for image recognition). Specifically, diffusion models commonly use UNet structures, which incorporate embedding. In addition, diffusion models iteratively invoke the same UNet model during sampling. In recent work, PTQ4DM (shang2023post, ) and Q-Diffusion (li2023q, ) first apply PTQ to diffusion models and attribute the challenge to the fact that the activation distribution is constantly changing with time steps. PTQD (PTQD, ) integrates partial quantization noise into diffusion perturbed noise and proposes a mixed-precision scheme.

In contrast, we analyze in detail the sources of quantization noise and its negative impact on sampling direction, image quality. Specifically, we propose QNCD, a novel post-training quantization noise correction scheme dedicated for diffusion models. First, we identify embeddings in resblock modules as the primary source of intra quantization noise, as embeddings amplifies the outliers of original features, making quantization challenging. We compute smoothing factors from embeddings, making features easy for quantization, thus reducing intra quantization noise. Besides, for inter quantization noise accumulated among sampling steps, we propose a run-time noise estimation module based on the diffusion and denoising theory of diffusion model. By filtering out the estimated quantization noise, our QNCD can dynamically correct deviations in output distribution throughout the sampling steps.

As shown in Fig. 1, with a smaller LPIPS Distance (dosovitskiy2016generating, ) and a higher CLIP Score (radford2021learning, ), the sampling direction of our QNCD more closely aligns with that of full-precision (FP) model. In addition, our method’s FID (heusel2017gans, ) metric consistently outperforms Q-Diffusion, with a final FID reduction of 2.23, reaching 27.33. Overall, our contributions are shown as follows:

  • We propose QNCD, a novel post-training quantization scheme for diffusion models to filter out quantization noise.

  • We find that a new challenge in quantizing diffusion models is the ongoing emergence and accumulation of quantization noise, which alters sampling direction and image quality.

  • We introduce a feature smooth approach to reduce intra quantization noise when combining features with embeddings. Simultaneously, we utilize a run-time noise estimation module to correct inter quantization noise

  • Our extensive experiments show that our method achieves new state-of-the-art performance for post-training quantization of diffusion models, especially in low-bit cases. Additionally, our methodology aligns more closely with the full-precision models in both objective metrics and subjective evaluations.

2. Related Work

Model quantization is a method that transitions from floating-point computations to low-precision fixed-point operations. This shift can effectively diminish the model’s computational burden, reduce parameter size and memory usage, and expedite computational processes.It can be divided into two main categories: quantization-aware training (QAT) (jacob2018quantization, ) and post-quantization training (PTQ). QAT often yields superior outcomes with significantly fewer bits, but comes with the cost of substantial training overhead and a need for the raw dataset. In constract, PTQ bypasses the need for extensive data retraining, leveraging just a fraction of unlabeled data for calibration, making it a more cost-effective and deployment-friendly alternative. Given that retraining for diffusion has an unaffordable cost (e.g the training of Stable Diffusion (rombach2021high, ) requires a cluster of over 4000 NVIDIA A100 GPUs), current works have pivoted towards PTQ to obtain low-bit diffusion models.

Until now, only a handful of current studies have focused on post-training quantization of diffusion models. Among them, PTQ4DM (shang2023post, ) devised a time-step aware sampling strategy for calibration dataset, but its experiments are limited to small datasets and low resolutions. Q-Diffusion (li2023q, ) employs a state-of-the-art PTQ method (BRECQ (li2021brecq, )) to obtain the performance, which imposes an additional training burden. PTQD (PTQD, ) integrates partial quantization noise into diffusion perturbed noise and proposes a mixed-precision scheme. TDQ (TDQ, ) dynamically adjusts the quantization interval based on time step information.

Our method analyze in detail the sources of quantization noise and propose corresponding correction modules. In addition, we employ the most primitive PTQ methods, inherently attributing the performance improvement to our approach.

3. Method

In Section 3.1, we provide an introduction to the sampling process of the diffusion model and present a unified formula for quantization noise. Following this, in Section 3.2, we analyze the sources of quantization noise and its impact on the sampling direction. In Section 3.3, we present the entire workflow of QNCD.

Refer to caption
Figure 2. (a) shows the similarity of individual layer features across the entire UNet model during a single sampling step, illuminating that quantization noise primarily arises from the incorporation of embeddings. (b) illustrates the distribution of activations before and after the incorperation of embedding (within the last Resblock). When combined with embeddings, outliers in features are amplified, which can be efficiently mitigated using our smoothing factor.

3.1. Preliminaries

Diffusion models are a family of probabilistic generative models that progressively destruct real data by injecting noise, then learn to reverse this process for generation, represented notably by denosing diffusion probabilistic models (DDPMs (ho2020denoising, )). DDPM is composed of two chains: a forward chain that perturbs data to noise, and a reverse chain that converts noise back to data. The former is usually designed by hand and its goal is to convert any data distribution into a simple prior distribution (e.g., a standard Gaussian distribution). Given a data distribution x0x_{0}, the forward process generates a sequence of random variables with transition kernel q(xt|xt1)q(x_{t}|x_{t-1}):

(1) q(x1:T|x0)=t=1Tq(xt|xt1),q(xt|xt1)=𝒩(xt;αtxt1,βtI)\begin{split}q(x_{1:T}|x_{0})&=\prod_{t=1}^{T}q(x_{t}|x_{t-1}),\\ q(x_{t}|x_{t-1})&=\mathcal{N}(x_{t};\sqrt{\alpha_{t}}x_{t-1},\beta_{t}I)\end{split}

where αt\alpha_{t} and βt\beta_{t} are hyper parameters and βt=1αt\beta_{t}=1-\alpha_{t}.

In the denoising process, with a Gaussian noise xTx_{T}, the diffusion model can generate samples by iterative sampling xt1x_{t-1} from pθ(xt1|xt)p_{\theta}(x_{t-1}|x_{t}) until obtaining x0x_{0}, where the Gaussian distribution pθ(xt1|xt)p_{\theta}(x_{t-1}|x_{t}) is a simulation of the unavailable real distribution q(xt1|xt)q(x_{t-1}|x_{t}). The mean value μθ(xt1|xt)\mu_{\theta}(x_{t-1}|x_{t}) of pθ(xt1|xt)p_{\theta}(x_{t-1}|x_{t}) is calculated from the noise prediction network ϵθ\epsilon_{\theta} (usually the UNet model):

(2) μθ(xt,t)=1αt(xtβt1αt¯ϵθ(xt,t))\begin{split}\mu_{\theta}(x_{t},t)&=\frac{1}{\sqrt{\alpha_{t}}}\bigg{(}x_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha_{t}}}}{\epsilon_{\theta}(x_{t},t)}\bigg{)}\end{split}

where αt¯=i=1tαi\overline{\alpha_{t}}=\prod_{i=1}^{t}\alpha_{i}. Therefore, the sampling process of xt1x_{t-1} is shown as follows:

(3) xt1=1αt(xtβt1αt¯ϵθ(xt,t))+σtz,zN(0,I)\quad x_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\bigg{(}x_{t}-\frac{\beta_{t}}{\sqrt{1-\overline{\alpha_{t}}}}{\epsilon_{\theta}(x_{t},t)\bigg{)}}+\sigma_{t}z,\quad z\in N(0,I)\\

The diffusion model continuously evokes the noise prediction network to acquire the noise ϵθ(xt,t)\epsilon_{\theta}(x_{t},t) and filter it out. The huge number of iterative time steps (sometimes TT=4000) and the complexity of the noise prediction network ϵθ\epsilon_{\theta} make the sampling of diffusion models expensive. Post-training quantization for diffusion models is performed on the noise prediction network ϵθ\epsilon_{\theta}, which inevitably introduces quantization noise.

(4) x~t1=1αt(x~tβt1αt¯ϵ~θ(x~t,t))+σtz,zN(0,I)=1αt(x~tβt1αt¯(ϵθ(x~t,t)+qθ(x~t,t)))+σtz.\begin{split}\quad\widetilde{x}_{t-1}&\!=\frac{1}{\sqrt{\alpha_{t}}}\bigg{(}\widetilde{x}_{t}\!-\frac{\beta_{t}}{\sqrt{1\!-\overline{\alpha_{t}}}}{\widetilde{\epsilon}_{\theta}(\widetilde{x}_{t},t)\bigg{)}}+\sigma_{t}z,\!\quad z\!\in N(0,I)\\ &\!=\frac{1}{\sqrt{\alpha_{t}}}\bigg{(}\widetilde{x}_{t}\!-\frac{\beta_{t}}{\sqrt{1\!-\overline{\alpha_{t}}}}\big{(}{{\epsilon}_{\theta}(\widetilde{x}_{t},t)+q_{\theta}(\widetilde{x}_{t},t)\big{)}\bigg{)}}+\sigma_{t}z.\end{split}

The noise prediction network UNet ϵθ\epsilon_{\theta} is constructed from multiple Resblocks, where parameterized operations (such as convolutions, fully connected layers, etc.) will generate intra quantization noise within the single-step sampling. From the perspective of complete sampling process, these intra quantization noises accumulate to form inter quantization noise qθ(x~t,t)q_{\theta}(\widetilde{x}_{t},t) which further accumulates in the current output x~t1\widetilde{x}_{t-1} , thus affecting subsequent sampling processes.

3.2. Quantization noise analysis

3.2.1. Intra Quantization Noise introduced by embedding

We consider the quantization noise within the denoising network UNet in a single sampling step as intra quantization noise, which is strongly affected by embedding. As shown in Fig. 2(a), intra quantization noise of the diffusion model exhibits periodic changes during a single step. It escalates when embedding is incorporated into features but then decreases during the fusion of UNet’s low-level and high-level features. This cyclical behavior underscores that the primary culprit of quantization noise is the embedding integration phase. Take the embedding fusion stage of DDIM as an example:

(5) scalet,shiftt=layeremb(embt).split(),ht=norm(ht)(1+scalet)+shiftt\begin{split}&scale_{t},shift_{t}=layer_{emb}(emb_{t}).split(),\\ &h_{t}=norm(h_{t})*(1+scale_{t})+shift_{t}\end{split}

Where hth_{t} stands for activation and embtemb_{t} is the corresponding embedding. embtemb_{t} imparts an utterly different distribution to activation hth_{t}. In Fig. 2(b), the distribution of feature hth_{t} generally stabilizes within a quantization-friendly range after processing through a normalization layer. However, the emergence of outliers within the activated correlation channel may pose a challenge.

These outliers are several magnitudes larger than the majority of the data, leading to a skew in the maximum magnitude measurement during quantization. This dominance by outliers could possibly result in reduced precision for the majority of non-outlier values. Further complications arise with the incorporation of embedding, as shown in the middle part of Fig. 2(b). Embedding separates a 1×c1\!\times c dimensional scale vector scaletscale_{t} which scales the feature hth_{t} on a channel-by-channel basis. This channel-specific scaling alters the distribution of the activation hth_{t} in a manner that certain channels, especially those with problematic outliers, experience amplification. Since activations are typically per-tensor quantized, the combined effect of embedding magnification and existing outliers makes the quantization of activations less efficient.

In summary, after normalization, feature hth_{t} is easily quantifiable, but with the introduction of embtemb_{t}, it becomes challenging to quantify , indicating an increase in intra quantization noise.

Refer to caption
Figure 3. (a) demonstrates the mean and std of outputs across all time steps, while (b) visualizes the output distribution at a specific step , revealing a substantial discrepancy between the output of the quantized diffusion model (Orange) and that of the full-precision model (gray). The gray dashed line in (a) represents when our noise estimation module is running.

3.2.2. Inter Quantization Noise

Diffusion models attain their final outputs through iterative denoising network calls. During this procedure, the inter quantization noise, expressed as qθ(x~t,t)q_{\theta}(\widetilde{x}_{t},t) in Eq. 4, assimilates into xt1x_{t-1} and advances to the subsequent denoising step, exerting an influence over the entire sampling process.

As shown in Fig. 3, we plot the variation curves of the output Mean and Std during sampling steps. The accumulated inter quantization noise changes the distribution of synthesized data. With continuous sampling, the data distribution of the quantization model further deviates from that of the full-precision model.

3.3. Effect of Quantization Noise

Quantization noise reduces the sampling efficiency of the diffusion model, changes the sampling direction, and ultimately reduces the quality of the synthesized image.

First, the introduction of quantization noise gives rise to new noise sources that necessitate denoising, substantially impairing the sampling efficiency of diffusion models. As depicted in Fig. 1, we compare LPIPS distance and FID metrics at each step between the full-precision model and the quantized model. The quantized diffusion model has a FID metric of 194.11 after 30 steps, which is still much larger than the result of the full-precision (FP) model after only 10 steps (FID = 140.76).

Besides, quantization noise also alters the iteration direction of diffusion models. In Fig. 1, we visualize the changing trend of CLIP Score for different methods to provide a more intuitive representation of the iteration direction. At the beginning of the iteration, all methods start with relatively low CLIP Scores, while full-precision model’s score continues to rise, indicating correct iteration direction. By step 25, quantized diffusion model shows a 3.16% difference in CLIP Score compared to full-precision model (26.09% vs. 29.25%).

In conclusion, quantization noise presents challenges in maintaining performance following model quantization. This not only calls for minimizing intra quantization noise as much as possible at all steps but also necessitates estimating and filtering out the remaining accumulated inter quantization noise.

3.4. Qunantization Noise Correction for Diffusion Models

We propose two techniques: intra quantization correction techniques and inter quantization correction techniques to address the challenges identified in the previous section.

Refer to caption
Figure 4. Visualization of scaletscale_{t} and smoothing factor SS in heatmap representation. For ease of visualization, we select only 12 values from the 512-dimensional data.

3.4.1. Intra Quantization Correction

As shown in Fig.2, during the single-step sampling process, embedding amplifies outliers of activation, leading to an imbalance among channels. Ultimately, this induces a periodic increase in intra quantization noise. For reducing intra quantization noise, we propose the utilization of a channel-specific smoothing factor SS. By dividing activation with their respective SS values, channels are balanced out and more adaptable to quantization. We then incorporate the filtered factor into weights, thus maintaining mathematical equivalence of the convolution, as follows:

(6) Y=Q(ht)Q(W)=Q(htS)Q(SW).\displaystyle Y=Q(h_{t})*Q(W)=Q(\frac{h_{t}}{S})*Q(SW).

Ultimately, we can transfer the quantization challenges presented by embedding from activations to weights, which are more robust to quantization.

Refer to caption
Figure 5. The pipeline of our proposed method. We initiate by saving the accurate embedding and deduce the smoothing factor SS in the calibration stage. During the inference stage, the pre-computed SS is applied to smooth the features hth_{t}, thereby the intra quantization noise is diminished. Besides, at periodic intervals, the inter quantization noise qθ(x~t,t){q_{\theta}({\widetilde{x}}_{t},t)} is estimated through our noise estimation module, which is filter out in output distribution.

Since embedding operates on a channel-by-channel basis, our goal is to derive a factor SS for each channel from the embedding, making ht=ht/Sh_{t}=h_{t}/S easier to quantize. As evident from Eq. 5, the term 1+scalet1+scale_{t}, derived from the separation from embedding, accentuates the discrepancies among the activation channels. However, 1+scalet1+scale_{t} is dynamic and fluctuates based on the time step tt. Consequently, we examine the embedding across all tt scenarios to ascertain the mean value of 1+scalet1+scale_{t} and employ it as a static factor SS:

(7) S=1Tt=1T1+scalet,scalet,shiftt=embt.split().\begin{split}S=\frac{1}{T}\sum_{t=1}^{T}\mid 1+scale_{t}\mid,\\ scale_{t},shift_{t}=emb_{t}.split().\end{split}

As shown in Fig. 2(b), The static factor SS we obtained serves to calibrate these unbalanced channels, harmonizing the distribution across each one, rendering the eventual activations more conducive to per-tensor quantization. Different scaletscale_{t} and SS are highly similar, with only minor amplitude differences present on some channels.

In addition, in LDM-type diffusion models, embedding is incorporated differently than in Eq. 5:

(8) ht=(ht+embt)μ(ht+embt)σ(ht+embt)α+β,\displaystyle h_{t}=\frac{(h_{t}+emb_{t})-\mu_{(h_{t}+emb_{t})}}{\sigma_{(h_{t}+emb_{t})}}*\alpha+\beta,

where α\alpha and β\beta are pretrained affine transform parameters in the Group-Norm operation. At this point, the distribution of the final activation hth_{t} is jointly determined by embtemb_{t} and the coefficients α\alpha of group norms. Thus, our smoothing factor SS is calculated as follows:

(9) S=1Tt=1Tembtμembtσembtα.\displaystyle S=\frac{1}{T}\sum_{t=1}^{T}\frac{{emb_{t}}-\mu_{emb_{t}}}{\sigma_{emb_{t}}}*\alpha.

In Fig. 4, we present a visualization of scaletscale_{t} within embtemb_{t} across various modules, time steps, and categories. The key observation is that the distribution of scaletscale_{t} exhibits imbalance, unevenly scaling the input activations by channel, which renders the activations less suitable for quantization. Despite this, the aggregate distribution of different scalettscalet_{t} remains relatively stable, with only minor amplitude discrepancies across certain channels.

Our smoothing factor SS is thus capable of effectively representing the scalettscalet_{t} encountered during the sampling phase. By filtering through SS, our QNCD method mitigates the enhancing impact of salettsalet_{t} on activation outliers, leading to a reduction in intra quantization noise. Crucially, SS is pre-computed prior to the formal inference stage for each Resblock, ensuring no additional computational load during the actual inference.

3.4.2. Inter Quantization Noise Correction

Eq. 4 shows the process of a single-step sampling in diffusion models, where UNet outputs the filtered noise ϵ~θ(x~t,t){\widetilde{\epsilon}}_{\theta}(\widetilde{x}_{t},t). It consists of two parts, the de-noising noise ϵθ(x~t,t){\epsilon}_{\theta}(\widetilde{x}_{t},t), and inter quantization noise qθ(x~t,t)q_{\theta}(\widetilde{x}_{t},t), which keeps accumulating in x~t\widetilde{x}_{t}. Ideally, we filter out qθ(x~t,t)q_{\theta}(\widetilde{x}_{t},t), thus avoiding the accumulation of quantization noise, but it is impractical to separate qθ(x~t,t)q_{\theta}(\widetilde{x}_{t},t) from ϵ~θ(x~t,t){\widetilde{\epsilon}}_{\theta}(\widetilde{x}_{t},t).

The training stage of the diffusion model gives us a possibility to separate qθ(x~t,t)q_{\theta}(\widetilde{x}_{t},t):

(10) xt=αtxt1+1αtz1z1N(0,I).\displaystyle\quad x_{t}=\sqrt{\alpha_{t}}x_{t-1}+\sqrt{1-\alpha_{t}}z_{1}\quad z_{1}\in N(0,I).

Eq. 10 does a single-step diffusion operation, where a Gaussian noise z1z_{1} is added to xt1x_{t-1} to get xtx_{t}.

(11) Ltsimple=Et[1,T],xt,ϵθ[z1ϵθ(xt,t)].L_{t}^{simple}=E_{t\sim[1,T],x_{t},\epsilon_{\theta}}[||z_{1}-\epsilon_{\theta}(x_{t},t)||].

The training process of the diffusion model (Eq. 11) drives the denoising network ϵθ\epsilon_{\theta} to learn the distribution of Gaussian noise z1z_{1}, which means ϵθ(xt,t)z1\epsilon_{\theta}(x_{t},t)\approx z_{1} is satisfied in the pre-trained diffusion model.

This property is still guaranteed in the quantized pre-trained diffusion model:

(12) x^t=αtx~t1+1αtz1,ϵ~θ(x^t,t)=ϵθ(x^t,t)+qθ(x^t,t)z1+qθ(x^t,t).\begin{split}\hat{x}_{t}&=\sqrt{\alpha_{t}}\widetilde{x}_{t-1}+\sqrt{1-\alpha_{t}}z_{1},\\ \widetilde{\epsilon}_{\theta}(\hat{x}_{t},t)&=\epsilon_{\theta}(\hat{x}_{t},t)+q_{\theta}(\hat{x}_{t},t)\approx z_{1}+q_{\theta}(\hat{x}_{t},t).\end{split}

As shown in Fig. 5 and Eq. 12, we add the standard Gaussian noise z1z_{1} to x~t1\widetilde{x}_{t-1} to get x^t\hat{x}_{t}, and feed it into the quantized model ϵ~θ\widetilde{\epsilon}_{\theta}. The output of the quantized model contains both the Gaussian noise z1z_{1} to be filtered out and the newly introduced quantization noise qθ(x^t,t)q_{\theta}(\hat{x}_{t},t). x^t\hat{x}_{t} is obtained by a single-step denoising and diffusion process on x~t\widetilde{x}_{t}, thus their distributions remain highly similar as well as the corresponding quantization noise:

(13) qθ(x~t,t)qθ(x^t,t)ϵ~θ(x^t,t)z1.\displaystyle q_{\theta}(\widetilde{x}_{t},t)\approx q_{\theta}(\hat{x}_{t},t)\approx\widetilde{\epsilon}_{\theta}(\hat{x}_{t},t)-z_{1}.

Finally, the quantization noise qθ(x~t,t)q_{\theta}(\widetilde{x}_{t},t) can be determined, as the Gaussian noise z1z_{1} is manually designed and ϵ~θ(x^t,t)\widetilde{\epsilon}_{\theta}(\hat{x}_{t},t) is the output of the noise predicting network, both of which are ascertainable.

Based on the above analysis, we can estimate the quantization noise by simulating the diffusion model training process. We first add the deterministic noise z1z_{1} to x~t1\widetilde{x}_{t-1}, which can be filtered out in the diffusion sample process, and then the rest of the indeterministic noise is the quantization noise qθ(x~t,t)q_{\theta}(\widetilde{x}_{t},t). Besides, the quantization noise is obtained through estimation and doesn’t align perfectly with the actual noise in terms of pixel dimension, whereas it is identical at the level of the overall distribution. Thus we get the distribution of the quantization noise at stage t1t\!-\!1 and correct the output sample in Eq. 4.

The above procedure is for single-step quantization noise estimation. In the diffusion model, the distributions of samples from neighboring steps are very similar, as well as the corresponding quantization noise distribution (qθ(x~t+1,t+1)qθ(x~t,t)q_{\theta}(\widetilde{x}_{t+1},t+1)\approx q_{\theta}(\widetilde{x}_{t},t)). Therefore, in actual sampling process, we divide entire sampling steps into multiple stages, estimating the distribution of quantization noise one time per stage. For example our method run only 4 times to estimate the quantization noise during a sampling process of 100 time steps, which brings a negligible increase in sampling duration. As shown in Fig. 3, our QNCD periodically estimates the distribution of inter quantization noise and corrects distribution deviations caused by this noise. This noise correction enables the distribution of sample outputs to closely align with the full-precision models.

3.5. Summary of methods

As shown in Fig. 5, our method contains two major blocks: intra quantization noise correction module and inter quantization noise correction module. Firstly, we determine the smoothing factor SS on a channel-by-channel basis, thereby transitioning the distribution disparities induced by embedding to the weights, which makes the activation easier for quantization. Secondly, we discern the distribution of quantization noise via our run-time noise estimation module, enabling its exclusion in subsequent sampling steps.

4. Experiments

4.1. Implementation Details

Datasets and quantization settings: Consistent with the experimental details of PTQ4DM (shang2023post, ), Q-Diffusion (li2023q, ), we conduct image synthesis experiments using pre-trained diffusion models (DDIM (song2020denoising, )) , latent diffusion models (LDM (rombach2021high, )) and Stable Diffusion on four standard benchmarks: CIFAR(32×\times32) (krizhevsky2009learning, ), ImageNet (256×\times256) (deng2009imagenet, ), LSUN-Bedrooms(256×\times256) (yu2015lsun, ), MS-COCO (512×\times512) (cocodataset, ). All experimental configurations, including the number of steps and variance, follow the official implementation. To facilitate method validation and comparison, we use the naive PTQ method (MSE-based range setting) for 8-bit quantization due to its simplicity and speed. For 4-bit weight quantization, we adopt BRECQ (li2021brecq, ) and Adaround (nagel2020adaptive, ) to ensure model performance. We uniformly sample from all time steps to create the PTQ calibration dataset, using 5120 samples for all datasets.

Evaluation Details: For each experiment we report the widely adopted Frechet Inception Distance (FID) (heusel2017gans, ) and sFID (salimans2016improved, ) to evaluate performance. For ImageNet and CIFAR experiments, we additionally report Inception Score (IS) (barratt2018note, ) for reference to ensure consistency of reported results. For MS-COCO, we introduce CLIP Score to ensure the correspondence between the synthesized images and prompts.

In line with Q-Diffusion, we generated 50,000 samples for evaluation. Due to the time-consuming nature of sampling high-resolution images like MS-COCO, we produced only 10,000 samples with Stable Diffusion to speed up the comparison process.

4.2. Unconditional Generation

Table 1. Quantization results on CIFAR(32 ×\times 32) with DDIM. (50,000 samples)
Method Bitwidth DDIM(Steps=100) DDIM(Steps=250)
(W/A) IS \uparrow FID \downarrow sFID \downarrow IS \uparrow FID \downarrow sFID \downarrow
FP 32/32 9.04 4.19 4.41 9.06 4.00 4.35
TDQ 8/8 8.85 5.99 - - - -
Q-Diffusion 8/8 9.17 3.93 4.34 9.38 3.84 4.27
Ours 8/8 9.24 3.36 4.24 9.41 3.46 4.21
Q-Diffusion 4/8 9.41 4.92 5.13 9.64 4.37 4.59
Ours 4/8 9.53 4.85 5.06 9.78 4.43 4.51
Q-Diffusion 4/6 7.53 39.07 43.36 7.81 34.65 37.29
Ours 4/6 8.86 12.26 14.83 9.01 11.09 13.46
Table 2. Comparisons with extra SOTA methods on ImageNet (LDM-4,Steps=20) and LSUN-Bed (LDM-4,Steps=200). ”*” means results in the corresponding paper.(50,000 samples)
     Method      ImageNet(FID \downarrow / IS \uparrow)      LSUN-Bed(FID \downarrow / SFID \downarrow)
     (FP:11.42/245.39)      (FP:3.16/7.84)
     W8A8      W4A8      W4A6      W8A8
     PTQDPTQD^{*}      11.94/153.92      10.40/214.73      -      3.75/9.89
     TDQTDQ^{*}      -      -      41.23/-      -
     Q-Diffusion      10.92/229.31      9.56/219.64      41.25/89.82      4.03/10.15
     QNCD      10.57/231.85      9.48/221.62      20.14/136.49      3.82/9.65

Results on CIFAR: The results are displayed in Tab 1. Note that WnnAmm means nn-bit quantization for weights and mm-bit for activations. At W8A8 bitwidth, our method achieves FIDs and sFIDs close to the full-precision model, with FID reductions of 0.57 (steps=100) and 0.38 (steps=250) compared to Q-Diffusion. Previous methods struggled with low-bit quantization noise, leading to high FIDs and sFIDs, such as 39.07 and 43.36 for W4A6 at 100 steps with Q-Diffusion. Our method effectively eliminates this noise, achieving a much lower FID of 12.26, proving its effectiveness. Our method enables 6-bit activation quantization on diffusion models without performance collapse, unlike previous methods limited to 8-bit.

Results on LSUN-Bedrooms: At the W8A8 bitwidth, our method reduces the FID by 0.21 compared to Q-Diffusion as shown in Tab. 2, proving the effectiveness of our method on the task of high-resolution image synthesis.

4.3. Class-conditional Generation

Refer to caption
(a) Q-Diffusion
Refer to caption
(b) Ours
Refer to caption
(c) Full Precision
Figure 6. Stable Diffusion 512 × 512 text-guided image synthesis results using Q-Diffusion and our QNCD under W8A8 precision. All text prompts are sourced exclusively from MS-COCO dataset.

Results on ImageNet: we carry out complex experiments on the generation of conditional ImageNet datasets to demonstrate the effectiveness of our method. To facilitate the validation, we adopt the LDM-4 model with 20 steps. As shown in the Tab. 2, our method consistently narrows the performance gap between quantized and full-precision diffusion models. Specifically, under the settings of W8A8 and W4A8, our QNCD can approach lossless performance. It is worth noting that at the W4A6 bidwidth setting, the IS of Q-Diffusion drops to 89.82, which is 156.1 lower than that of the full-precision model (245.39). Our method well handles the low-bit quantization of the activation. Compared to the FID of Q-Diffusion which is as high as 41.25, the FID of our method is 20.14, indicating the effectiveness of our method. The visualizations are available in the Appendix.

Refer to caption
Figure 7. Quantization results for Stable Diffusion(steps=50) on MS-COCO (10,000 samples). The dashed lines represent results of full-precision model.

4.4. Text-guided Image Generation

We assess the performance of QNCD through Stable Diffusion for text-guided image generation, using text prompts derived from the MS-COCO dataset. As demonstrated in Tab. 7, our method surpasses Q-Diffusion in both FID metrics and CLIP Scores.

In addition, we visualize the final generated image in Fig. 6. For all three methods (FP, Q-Diffusion, and ours), we have given the same content conditions as input to facilitate comparison. It can be noticed that the accumulated quantization noise changes the content space of the image, causing the final synthesized image to be shifted. As shown in the red box in Fig.6, the synthesized image shows the importation of abnormal content , such as abnormal faces, incomplete bowls and floating cows. Compared with Q-Diffusion, our method provides a higher quality image, which is closer to the full-precision model synthesized image and has more realistic details, colors, and richer semantic information. In conclusion, our method effectively mitigates the quantization noise, and is closer to the full-precision model not only in terms of statistical metrics, but also in terms of visualization. More visualization results are shown in Appendix.

Table 3. The effect of different modules of QNCD with Stable Diffusion on MS-COCO(512×\times512).
     Method      Bitwidth      Stable Diffusion(Steps=50)
     (W/A)      FID \downarrow      CLIP Score \uparrow
     FP      32/32      23.80      30.54
     Q-Diffusion      8/8      27.84      30.23
     Intra-QNCD      8/8      27.41      30.25
     Inter-QNCD      8/8      27.60      30.29
     QNCD      8/8      27.33      30.32
Refer to caption
Figure 8. Comparison of the signal-to-noise- ratio (SNR) and cosine similarity in each step of DDIM(100) on CIFAR.(W8A8)

4.5. Ablation Study

4.5.1. Comparison of SNR and Cosine Similarity.

Given the output of the floating-point noise prediction network and the corresponding quantization noise, we plot the change curve of the signal-to-noise ratio (SNR) of the quantized diffusion model, as shown in Fig. 8, as well as the cosine similarity between the corresponding output and the FP output. We can find that: 1, with the increase of denoising steps, the cosine similarity between the quantized model output and the FP output is continuously decreasing, which also means that the overall quantization noise is continuously accumulating, and the corresponding SNR is continuously decreasing. 2, our method can estimate and filter out the quantization noise on a global scale, resulting in a better SNR, and a sampling process closer to that of the floating-point model.

4.5.2. Effects of each module

As shown in Tab. 3, we conducted ablation experiments on Stable Diffusion (step=50) with MS-COCO 512×512 dataset to demonstrate the effectiveness of our QNCD, which includes intra quantization noise correction (Intra-QNCD) and inter quantization noise correction (Inter-QNCD). Intra-QNCD reduces FID by 0.43 compared to Q-Diffusion, while Inter-QNCD reduces FID by 0.24 and improves the CLIP Score by 0.06. Combining both blocks, QNCD achieves a 0.51 reduction in FID, showing their collaborative effectiveness in improving performance. These results validate our noise correction techniques in post-training quantization of diffusion models.

Table 4. Inference time and Image Quality Assessment(IQA) for MS-COCO via Stable Diffusion (512*512, 50 steps).
FP16 Original PTQ(W8A8) Q-Diffusion(W8A8) QNCD(W8A8)
Inference Time 959.5ms 601.8ms 628.3ms 631.2ms
IQA Score\uparrow(0 \sim 1) 0.847 0.728 0.775 0.793

4.5.3. Comparison of real inference efficiency

For fair comparison, we provide end-to-end inference times in Tab. 4. Inference times are based on the UNet of Stable Diffusion V1.4, which denote whole denoising process of diffusion models. The experimental background is A100, TensorRT-8.6 and CUDA-11.7. Similarly to our QNCD, Q-Diffusion introduces the Short-Cut split operation in pursuit of better model performance, which also imposes an additional inference burden (26.5ms compared to original PTQ). Our method runs at a similar speed to Q-Diffusion, but with higher image quality.

4.5.4. Comparison through Image Quality Assessment

As shown in Tab. 2, the FID metrics of PTQD and Q-Diffusion on ImageNet dataset are 11.94 and 10.92, superior to the FP’s score of 12.45 under the W8A8 setting. This same pattern extends to the LSUN-Bedrooms and CIFAR datasets, which is unexpected and implies that the FID metric may not be an optimal indicator of image quality. This is because FID focuses more on the overall distribution similarity rather than the specific quality of each image. For a more comprehensive comparison, we further refer to objective Image Quality Assessment(IQA) metrics proposed in CLIP-IQA (clipiqa, ) to evaluate 5000 synthesized images. Our method achieves an IQA metric of 0.793, which is better than Q-Diffusion (0.775), but still falls short compared to FP (0.847).

5. Conclusion

In this paper, we propose QNCD, a unified quantization noise correction scheme for diffusion models. To start with, we do a detailed analysis of the sources and effects of quantization noise in terms of visualization and actual metrics, and find that the periodic increase in intra quantization noise comes from embedding’s alteration of feature distributions. Thus, we calculate a smoothing factor for features to reduce quantization noise. Besides, a run-time noise estimation module is proposed to estimate the distribution of inter quantization noise, which is further filtered out in the sampling process of the diffusion model. Leveraging these techniques, our QNCD surpasses existing state-of-the-art post-training quantized diffusion models, especially at low-bit activation quantization (W4A6). Our approach achieves the current SOTA on multiple diffusion modeling frameworks (DDIM , LDM and Stable Diffusion) and multiple datasets, demonstrating the broad applicability of QNCD.

Acknowledgements

We want to extend our heartfelt thanks to everyone who has provided their suggestions regarding our manuscript. Their insightful comments and constructive feedback have significantly contributed to the improvement of this paper.

References

  • (1) Barratt, S., Sharma, R.: A note on the inception score. arXiv preprint arXiv:1801.01973 (2018)
  • (2) Bhalgat, Y., Lee, J., Nagel, M., Blankevoort, T., Kwak, N.: Lsq+: Improving low-bit quantization through learnable offsets and better initialization. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp. 2978–2985 (2020)
  • (3) Cai, Y., Yao, Z., Dong, Z., Gholami, A., Mahoney, M.W., Keutzer, K.: Zeroq: A novel zero shot quantization framework. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13166–13175 (2020)
  • (4) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
  • (5) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • (6) Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. Advances in neural information processing systems 29 (2016)
  • (7) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM 63(11), 139–144 (2020)
  • (8) Guo, C., Qiu, Y., Leng, J., Gao, X., Zhang, C., Liu, Y., Yang, F., Zhu, Y., Guo, M.: Squant: On-the-fly data-free quantization via diagonal hessian approximation. ArXiv (abs/2202.07471), 1–14 (2022)
  • (9) He, Y., Liu, L., Liu, J., Wu, W., Zhou, H., Zhuang, B.: Ptqd: Accurate post-training quantization for diffusion models. Advances in Neural Information Processing Systems 36 (2024)
  • (10) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems. vol. 30, pp. 12–21 (2017)
  • (11) Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33(6), 6840–6851 (2020)
  • (12) Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., Kalenichenko, D.: Quantization and training of neural networks for efficient integer-arithmetic-only inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2704–2713 (2018)
  • (13) Krizhevsky, A., et al.: Learning multiple layers of features from tiny images (2009), 4, 6
  • (14) Li, X., Lian, L., Liu, Y., Yang, H., Dong, Z., Kang, D., Zhang, S., Keutzer, K.: Q-diffusion: Quantizing diffusion models. arXiv preprint arXiv:2302.04304 (2023)
  • (15) Li, Y., Gong, R., Tan, X., Yang, Y., Hu, P., Zhang, Q., Yu, F., Wang, W., Gu, S.: Brecq: Pushing the limit of post-training quantization by block reconstruction. In: International Conference on Learning Representations (2021)
  • (16) Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. arXiv preprint arXiv:1405.0312 (2014)
  • (17) Liu, Z., Wang, Y., Han, K., Zhang, W., Ma, S., Gao, W.: Post-training quantization for vision transformer. Advances in Neural Information Processing Systems 34, 28092–28103 (2021)
  • (18) Nagel, M., Amjad, R.A., Baalen, M.v., Louizos, C., Blankevoort, T.: Up or down? adaptive rounding for post-training quantization. arXiv preprint arXiv:2004.10568 (2020)
  • (19) Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • (20) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10674–10685 (2021)
  • (21) Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image superresolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (2021)
  • (22) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in neural information processing systems. vol. 29 (2016)
  • (23) Sasaki, H., Willcocks, C.G., Breckon, T.: Unitddpm: Unpaired image translation with denoising diffusion probabilistic models. arXiv preprint arXiv:2104.05358 (2021)
  • (24) Shang, Y., Yuan, Z., Xie, B., Wu, B., Yan, Y.: Post-training quantization on diffusion models. In: CVPR (2023)
  • (25) So, J., Lee, J., Ahn, D., Kim, H., Park, E.: Temporal dynamic quantization for diffusion models. Advances in Neural Information Processing Systems 36 (2024)
  • (26) Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
  • (27) Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)
  • (28) Wang, J., Chan, K.C., Loy, C.C.: Exploring clip for assessing the look and feel of images. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 2555–2563 (2023)
  • (29) Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., Xiao, J.: Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)