This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PTQ4DiT: Post-training Quantization for
Diffusion Transformers

Junyi Wu1,3, Haoxuan Wang1,∗ Yuzhang Shang2 Mubarak Shah3 Yan Yan1,†
1University of Illinois Chicago 2Illinois Institute of Technology 3University of Central Florida
https://github.com/adreamwu/PTQ4DiT
Equal Contribution. Corresponding Author. Work done during Junyi Wu’s visit to CRCV, UCF.
Abstract

The recent introduction of Diffusion Transformers (DiTs) has demonstrated exceptional capabilities in image generation by using a different backbone architecture, departing from traditional U-Nets and embracing the scalable nature of transformers. Despite their advanced capabilities, the wide deployment of DiTs, particularly for real-time applications, is currently hampered by considerable computational demands at the inference stage. Post-training Quantization (PTQ) has emerged as a fast and data-efficient solution that can significantly reduce computation and memory footprint by using low-bit weights and activations. However, its applicability to DiTs has not yet been explored and faces non-trivial difficulties due to the unique design of DiTs. In this paper, we propose PTQ4DiT, a specifically designed PTQ method for DiTs. We discover two primary quantization challenges inherent in DiTs, notably the presence of salient channels with extreme magnitudes and the temporal variability in distributions of salient activation over multiple timesteps. To tackle these challenges, we propose Channel-wise Salience Balancing (CSB) and Spearman’s ρ\rho-guided Salience Calibration (SSC). CSB leverages the complementarity property of channel magnitudes to redistribute the extremes, alleviating quantization errors for both activations and weights. SSC extends this approach by dynamically adjusting the balanced salience to capture the temporal variations in activation. Additionally, to eliminate extra computational costs caused by PTQ4DiT during inference, we design an offline re-parameterization strategy for DiTs. Experiments demonstrate that our PTQ4DiT successfully quantizes DiTs to 8-bit precision (W8A8) while preserving comparable generation ability and further enables effective quantization to 4-bit weight precision (W4A8) for the first time.

1 Introduction

Diffusion models have spearheaded recent breakthroughs in generation tasks [59, 7]. In the past, these models were based on convolutional U-Nets [40] as their backbone architectures [46, 17, 9, 39]. However, recent work [2, 60, 30] has revealed that the U-Net inductive bias is not essential for the success of diffusion models and even limits their scalability. Among this trend, Diffusion Transformers (DiTs) [37] have demonstrated exceptional capabilities in image generation by using a different backbone architecture. Different from U-Nets that carefully design downsampling and upsampling blocks with skip-connections, DiTs are constructed by repeatedly and sequentially stacking transformer blocks [49]. This architectural choice inherits the scaling property of transformers [5, 48, 58, 31], facilitating more flexible parameter expansion for enhanced performance. With their versatility and scalability, DiTs have been successfully integrated into advanced frameworks like Sora [4], demonstrating their potential as a leading architecture for future generative models [14, 6, 30, 65].

Nonetheless, the widespread adoption of Diffusion Transformers is currently constrained by their massive amount of parameters and computational complexity. DiTs consist of a large number of repeated transformer blocks and employ a lengthy iterative image sampling process, demanding high computational costs during inference. For instance, generating a 512×\times512 resolution image using DiTs can take more than 20 seconds and 10510^{5} Gflops on an NVIDIA RTX A6000 GPU. This substantial requirement makes them unacceptable or impractical for real-time applications, especially considering the potential for increased model sizes and feature resolutions.

Model quantization [33, 32, 28] is a prominent technique for accelerating deep learning models because of its high compression rate and significant reduction in inference time. This technique transforms model weights and activations into low-bit formats, which directly reduces the computational burden and memory usage. Among various methods, Post-training Quantization (PTQ) stands out as a leading approach since it circumvents the need to re-train the original model [62, 44, 18, 63, 22]. Practically, PTQ requires only a small dataset for fast calibration, thus is highly suitable for quantizing DiTs, whose re-training process involves extensive data and computational resources [14, 6].

However, quantizing DiTs in a post-training manner is non-trivial due to the complex distribution patterns in weights and activations. We discover two major challenges that impede the effective quantization of DiTs: The emergence of salient channels, channels with extreme magnitudes, in both weights and activations of linear layers within DiT blocks. When low-bit representations are used for these salient channels, pronounced errors compared to the full-precision (FP) counterparts are observed, incurring fundamental difficulty for quantization. The extreme magnitudes within salient activation channels significantly vary as the inference proceeds across multiple timesteps. This dynamic behavior further complicates the quantization of salient channels, as quantization strategies optimized for one timestep may fail to generalize to other timesteps. Such inconsistency, especially in salient channels that dominate the activation signals, can result in significant deviations from the full-precision distribution, leading to degradation in the generation ability of quantized models.

Targeting these two challenges, we propose a novel Post-training Quantization method specifically for Diffusion Transformers, termed PTQ4DiT. To address the quantization difficulty associated with salient channels, we propose Channel-wise Salience Balancing (CSB). CSB capitalizes on an interesting observation of the salient channels that extreme values do not coincide in the same channel of activation and weight within the same layer, as shown in Figure 1 (Left). Leveraging this complementarity property, CSB facilitates the redistribution of extreme magnitudes between activations and weights to minimize the overall channel salience. Concretely, we introduce Salience Balancing Matrices, derived from the statistical properties of activation and weight distributions, to channel-wise transform both activations and weights. This transformation achieves equilibrium in their salient channels, effectively mitigating the quantization difficulty of the balanced distributions.

Refer to caption
Figure 1: (Left) Illustration of salient channels in activation and weight. Note that salient activation channels exhibit variations over different timesteps (e.g., t=t1,t2,t3.t=t_{1},t_{2},t_{3}.), posing non-trivial quantization challenges. To mitigate the overall quantization difficulty, our method leverages the complementarity (activation and weight channels do not have extreme magnitude simultaneously) to redistribute channel salience between weights and activations across various timesteps. (Right) Quantization performance on W8A8 and W4A8, employing FID (lower is better) and IS (higher is better) metrics on ImageNet 256×\times256 [41]. The circle size indicates the model size.

Recognizing the variability in activations over different timesteps, we further extend the concept of channel salience along the temporal dimension and propose Spearman’s ρ\rho-guided Salience Calibration (SSC). This method refines the Salience Balancing Matrices to comprehensively evaluate activation salience over timesteps, with more emphasis on timesteps where the complementarity between salient activation and weight channels is more significant. Furthermore, we design a re-parameterization scheme that can offline absorb these Salience Balancing Matrices into adjacent layers, thus avoiding additional computation overhead at the inference stage.

While the performance of mainstream PTQ methods degrades on DiTs, our PTQ4DiT achieves comparable performance to the FP counterpart with 8-bit weight and activation (W8A8). In addition, PTQ4DiT can generate high-quality images with further reduced weight precision at 4-bit (W4A8). To the best of our knowledge, PTQ4DiT is the first method for effective DiT quantization.

Refer to caption
Figure 2: (Left) Overview of the Diffusion Transformer (DiT) Block [37]. (Middle) Illustration of the linear layer in Multi-Head Self-Attention (MHSA) and Pointwise Feedforward (PF) modules, which incorporates our proposed Channel-wise Salience Balancing (CSB) and Spearman’s ρ\rho-guided Salience Calibration (SSC) to address quantization difficulties for both activation 𝐗\mathbf{X} and weight 𝐖\mathbf{W}. Appendix A depicts detailed structures of the MHSA and PF modules with adjusted linear layers. (Right) Illustration of CSB and SSC in PTQ4DiT. CSB redistributes salient channels between weights and activations from various timesteps to reduce overall quantization errors. SSC calibrates the activation salience across multiple timesteps via selective aggregation, with more focus on timesteps where quantization errors can be significantly reduced by CSB.

2 Backgrounds and Related Works

2.1 Diffusion Transformers

Although generative models built upon U-Nets have made great advancements in the last few years, transformer-like architectures are increasingly attracting attention [39, 7, 59]. The recently explored Diffusion Transformers (DiTs) [37] have achieved state-of-the-art performance in image generation. Encouragingly, DiTs exhibit remarkable scalability in model size and data representation, positioning them as a promising backbone for a wide range of generative applications [4, 30, 65].

DiTs consist of nBn_{B} blocks, each containing a Multi-Head Self-Attention (MHSA) and a Pointwise Feedforward (PF) module [49, 11, 37], both preceded by their respective adaptive Layer Norm (adaLN) [38]. We illustrate the DiT Block structure in Figure 2 (Left). These blocks sequentially process the noised latent and conditional information, which are both represented as tokens in a lower-dimensional latent space [39]. In each block, conditional input 𝐜din\mathbf{c}\in\mathbb{R}^{d_{in}} is converted to scale and shift parameters (𝜸,𝜷din\bm{\gamma},\bm{\beta}\in\mathbb{R}^{d_{in}}), which are regressed through MLPs then injected into the noised latent 𝐙n×din\mathbf{Z}\in\mathbb{R}^{n\times d_{in}} via adaLN:

(𝜸,𝜷)=MLPs(𝐜),adaLN(𝐙)=LN(𝐙)(𝟏+𝜸)+𝜷,(\bm{\gamma},\bm{\beta})=\text{MLPs}(\mathbf{c}),\quad\text{adaLN}(\mathbf{Z})=\text{LN}(\mathbf{Z})\odot(\bm{1}+\bm{\gamma})+\bm{\beta}, (1)

where LN()(\cdot) is the standard Layer Norm [1]. These adaLN modules dynamically adjust the layer normalization before each MHSA and PF module, enhancing DiTs’ adaptability to varying conditions and improving the generation quality.

Despite their effectiveness, DiTs require extensive computational resources to generate high-quality images, which impedes their real-world deployment. In this paper, we devise a model quantization method for DiTs that reduces both time and memory consumption without necessitating re-training the original models, offering a robust and practical solution for enhancing the efficiency of DiTs.

2.2 Model Quantization

Model quantization is a compression technique that improves the inference efficiency of deep learning models by transforming full-precision tensors into bb-bit integer approximations, leading to direct computational acceleration and memory saving [33, 62, 8, 28, 19, 64]. Formally, the quantization process can be defined as:

Q(𝐱)=clamp(𝐱𝜹+𝝀,0,2b1),Q(\mathbf{x})=\text{clamp}(\lfloor\frac{\mathbf{x}}{\bm{\delta}}\rceil+\bm{\lambda},0,2^{b}-1), (2)

where 𝐱\mathbf{x} denotes the full-precision tensor, \lfloor\cdot\rceil is the round-to-nearest operator [32], and the clamp function restricts the quantized value within the range of [0,2b1][0,2^{b}-1]. Here, 𝜹\bm{\delta} and 𝝀\bm{\lambda} are quantization parameters subject to optimization. Among various quantization methods, Post-training Quantization (PTQ) is a dominant approach for large quantized models, as it circumvents the substantial resources required for model re-training [20, 52, 25, 44, 15]. PTQ employs a small calibration dataset to optimize quantization parameters, which aims to reduce the performance gap between the quantized models and their full-precision counterparts with minimal data and computational expenses.

PTQ has been effectively applied to a wide range of neural networks, including CNNs [20, 52, 25], Language Transformers [8, 57, 24, 23, 27], Vision Transformers [62, 13, 21, 29], and U-Net-based Diffusion models [44, 18, 51, 50]. Despite its demonstrated success, PTQ’s applicability to Diffusion Transformers (DiTs) remains unexplored, presenting a significant open challenge within the research community. To bridge this gap, our work delves into the unique challenges of quantizing DiTs and introduces the first PTQ method for DiTs that can fruitfully preserve their generation performance.

Refer to caption
Figure 3: Illustration of maximal absolute magnitudes of activation (left) and weight (right) channels in a DiT linear layer, alongside their corresponding quantization Error (MSE). Channels with greater maximal absolute values tend to incur larger errors, presenting a fundamental quantization difficulty.

3 Diffusion Transformer Quantization Challenges

Diffusion Transformers (DiTs) diverge from conventional generative or discriminative models [39, 11] through their unique design. Specifically, DiTs are constructed with a series of large transformer blocks and operate under a multi-timestep paradigm to progressively transform pure noise into images. Our analysis reveals complex distribution patterns and temporal dynamics in the inference process of DiTs, identifying two primary challenges that prevent effective DiT quantization.

Pronounced Quantization Error in Salient Channels. The first challenge lies in systematic quantization errors in DiT’s linear layers. As shown in Figure 3, activation and weight channels with significantly high absolute values are prone to substantial errors after quantization. We term these as salient channels, characterized by extreme values that greatly exceed the typical range of magnitudes. Upon uniform quantization (Eq. (2)), it is often necessary to truncate these extreme values in order to maintain the precision of the broader set of standard channels. This compromise can result in notable deviations from the original full-precision distribution as the sampling process proceeds, especially given DiT’s layered architecture and repetitive inference paradigm.

Refer to caption
Figure 4: Boxplot of maximal absolute magnitudes of activation channels in a linear layer within DiT over different timesteps, which exhibit significant temporal variations.

Temporal Variation in Salient Activation. Another challenge of DiT quantization arises from temporal variations in the magnitudes of salient activation channels. Rather than static inputs, DiTs operate across a sequence of timesteps to generate high-quality images from random noise. Consequently, activation distributions can vary drastically within the inference process, which is particularly evident in salient channels that dominate the signal. Figure 4 demonstrates that the distribution of maximal absolute values in activation channels exhibits significant variations over different timesteps. This temporal variability introduces a non-trivial difficulty to quantization optimization: Quantization parameters effective for salient activation channels at one timestep may not be suitable at other timesteps. Such discrepancies can exacerbate quantization errors, cumulatively impairing the generation quality. Therefore, for accurate quantization, it is imperative to capture the evolving trait of salient channels throughout the entire denoising procedure.

4 PTQ4DiT

To overcome the identified challenges, we propose Channel-wise Salience Balancing (CSB) and Spearman’s ρ\rho-guided Salience Calibration (SSC) in our PTQ4DiT in Sections 4.1 and 4.2, respectively. Subsequently, we devise a re-parameterization scheme in Section 4.3, eliminating extra computational demands of PTQ4DiT during inference while maintaining the mathematical equivalence.

4.1 Channel-wise Salience Balancing

A linear layer f(;𝐖)f(\cdot;\mathbf{W}) within MHSA and PF modules typically takes a token sequence 𝐗n×din\mathbf{X}\in\mathbb{R}^{n\times d_{in}} as input and performs linear transformation with its weight matrix 𝐖din×dout\mathbf{W}\in\mathbb{R}^{d_{in}\times d_{out}}, formulated as f(𝐗;𝐖)=𝐗𝐖,f(\mathbf{X};\mathbf{W})=\mathbf{X}\cdot\mathbf{W}, where nn is the sequence length, and dind_{in} and doutd_{out} denote the input and output dimensions, respectively. As discussed in Section 3, both the activation 𝐗\mathbf{X} and the weight matrix 𝐖\mathbf{W} exhibit salient channels that possess elements with significantly greater absolute magnitudes, which lead to large post-quantization errors.

Fortunately, large values do not coincide in the same channels of activation and weight, so these extremes do not amplify each other, as observed in Figure 3. This property suggests the feasibility of complementarily redistributing the large magnitudes in salient channels between activation and weight, thereby alleviating quantization difficulties for both. Inspired by previous works on large model compression [54, 45, 61, 23], we propose Channel-wise Salience Balancing (CSB), which employs diagonal Salience Balancing Matrices 𝐁𝐗\mathbf{B^{X}} and 𝐁𝐖\mathbf{B^{W}} to adjust the channel-wise distribution of activation and weight, as expressed by:

𝐗~=𝐗𝐁𝐗,𝐖~=𝐁𝐖𝐖.\mathbf{\widetilde{X}}=\mathbf{XB^{X}},\quad\mathbf{\widetilde{W}}=\mathbf{B^{W}W}. (3)

To address the quantization difficulties, we need to achieve balanced distributions in 𝐗~\mathbf{\widetilde{X}} and 𝐖~\mathbf{\widetilde{W}}, which requires 𝐁𝐗\mathbf{B^{X}} and 𝐁𝐖\mathbf{B^{W}} to capture the characteristics of salient channels. Considering that the quantization error is significantly influenced by the range of distributions [33, 57, 26], we measure the salience ss of an activation or weight channel as the maximal absolute value among its elements:

s(𝐗j)=max(|𝐗j|),s(𝐖j)=max(|𝐖j|),wherej=1,2,,din.s(\mathbf{X}_{j})=\text{max}(\lvert\mathbf{X}_{j}\rvert),\quad s(\mathbf{W}_{j})=\text{max}(\lvert\mathbf{W}_{j}\rvert),\quad\text{where}\quad j=1,2,\ldots,d_{in}. (4)

Here, jj is the channel index. Consequently, the balanced salience s~\widetilde{s}, representing the equilibrium between activation and weight channels, can be quantified using the geometric mean. Specifically, for the jj-th channel, the balanced salience is calculated as follows:

s~(𝐗j,𝐖j)=(s(𝐗j)s(𝐖j))12.\widetilde{s}(\mathbf{X}_{j},\mathbf{W}_{j})=(s(\mathbf{X}_{j})\cdot s(\mathbf{W}_{j}))^{\frac{1}{2}}. (5)

Building on these concepts, we proceed to construct the Salience Balancing Matrices, which modulate the salience of activations and weights with the guidance of s~\widetilde{s}:

𝐁𝐗=diag(s~(𝐗1,𝐖1)s(𝐗1),s~(𝐗2,𝐖2)s(𝐗2),,s~(𝐗din,𝐖din)s(𝐗din)),\mathbf{B^{X}}=\text{diag}(\frac{\widetilde{s}(\mathbf{X}_{1},\mathbf{W}_{1})}{s(\mathbf{X}_{1})},\frac{\widetilde{s}(\mathbf{X}_{2},\mathbf{W}_{2})}{s(\mathbf{X}_{2})},\ldots,\frac{\widetilde{s}(\mathbf{X}_{d_{in}},\mathbf{W}_{d_{in}})}{s(\mathbf{X}_{d_{in}})}), (6)
𝐁𝐖=diag(s~(𝐗1,𝐖1)s(𝐖1),s~(𝐗2,𝐖2)s(𝐖2),,s~(𝐗din,𝐖din)s(𝐖din)).\mathbf{B^{W}}=\text{diag}(\frac{\widetilde{s}(\mathbf{X}_{1},\mathbf{W}_{1})}{s(\mathbf{W}_{1})},\frac{\widetilde{s}(\mathbf{X}_{2},\mathbf{W}_{2})}{s(\mathbf{W}_{2})},\ldots,\frac{\widetilde{s}(\mathbf{X}_{d_{in}},\mathbf{W}_{d_{in}})}{s(\mathbf{W}_{d_{in}})}). (7)

Following these, the balancing transformation defined by Eq. (3) will result in a complementary redistribution of channel salience between activations and weights. Specifically, for each channel jj, we have s(𝐗~j)=s(𝐖~j)=s~(𝐗j,𝐖j)s(\mathbf{\widetilde{X}}_{j})=s(\mathbf{\widetilde{W}}_{j})=\widetilde{s}(\mathbf{X}_{j},\mathbf{W}_{j}), thereby alleviating the quantization difficulties, as demonstrated by the reduction in overall channel salience:

max(so(𝐗~),so(𝐖~))max(so(𝐗),so(𝐖)).\text{max}(s_{o}(\mathbf{\widetilde{X}}),s_{o}(\mathbf{\widetilde{W}}))\leq\text{max}(s_{o}(\mathbf{X}),s_{o}(\mathbf{W})). (8)

Here, we characterize the overall salience sos_{o} of activations or weights using the maximum salience across channels, e.g., so(𝐗)=max(s(𝐗1),s(𝐗2),,s(𝐗din))s_{o}(\mathbf{X})=\text{max}(s(\mathbf{X}_{1}),s(\mathbf{X}_{2}),\ldots,s(\mathbf{X}_{d_{in}})), which reflects the distribution range of elements that are quantized collectively under certain granularity.

4.2 Spearman’s ρ\rho-guided Salience Calibration

Diffusion Transformers (DiTs) utilize an iterative denoising process for image sampling [37]. Under this sequential paradigm, the linear layer ff receives inputs from an activation sequence 𝐗(1:T)=(𝐗(1),𝐗(2),,𝐗(T))\mathbf{X}^{(1:T)}=(\mathbf{X}^{(1)},\mathbf{X}^{(2)},\ldots,\mathbf{X}^{(T)}), which encompasses TT timesteps. Targeting a certain timestep tt, the salience of all activation and weight channels can be evaluated using Eq. (4):

𝐬(𝐗(t))=(s(𝐗1(t)),s(𝐗2(t)),,s(𝐗din(t))),𝐬(𝐖)=(s(𝐖1),s(𝐖2),,s(𝐖din)).\mathbf{s}(\mathbf{X}^{(t)})=(s(\mathbf{X}^{(t)}_{1}),s(\mathbf{X}^{(t)}_{2}),\ldots,s(\mathbf{X}^{(t)}_{d_{in}})),\quad\mathbf{s}(\mathbf{W})=(s(\mathbf{W}_{1}),s(\mathbf{W}_{2}),\ldots,s(\mathbf{W}_{d_{in}})). (9)

While 𝐬(𝐖)\mathbf{s}(\mathbf{W}) remains consistent, we find that {𝐬(𝐗(t))}t=1T\{\mathbf{s}(\mathbf{X}^{(t)})\}_{t=1}^{T} exhibits significant temporal variations during the process of transforming purely random noise into high-quality images, as demonstrated in Figure 4. These fluctuations diminish the effectiveness of our CSB since quantization errors can be exacerbated by the biased estimation of activation salience among timesteps, resulting in degraded generation quality of the quantized models.

To accurately gauge the activation channel salience under multi-timestep scenarios, we propose Spearman’s ρ\rho-guided Salience Calibration (SSC). This offers a comprehensive evaluation of activation salience, with enhanced focus allocated to the timesteps where the complementarity property is more significant, facilitating effective salience balancing between activation and weight channels. Essentially, the lower the correlation between activation salience 𝐬(𝐗(t))\mathbf{s}(\mathbf{X}^{(t)}) and weight salience 𝐬(𝐖)\mathbf{s}(\mathbf{W}), the greater reduction effect in overall channel salience (Eq. (8)). The intuition of SSC is visualized in Figure 2 (Right). Mathematically, we formulate the Spearman’s ρ\rho-calibrated Temporal Salience 𝐬ρ\mathbf{s}_{\rho} by selectively aggregating the activation salience along timesteps:

𝐬ρ(𝐗(1:T))=(η1,η2,,ηT)(𝐬(𝐗(1)),𝐬(𝐗(2)),,𝐬(𝐗(T)))Tdin,\mathbf{s}_{\rho}(\mathbf{X}^{(1:T)})=(\eta_{1},\eta_{2},\ldots,\eta_{T})\cdot(\mathbf{s}(\mathbf{X}^{(1)}),\mathbf{s}(\mathbf{X}^{(2)}),\ldots,\mathbf{s}(\mathbf{X}^{(T)}))^{\text{T}}\in\mathbb{R}^{d_{in}}, (10)

where weighting factors {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} are derived from a normalized exponential form of inverse Spearman’s ρ\rho statistic [47, 55, 56]:

ηt=exp[ρ(𝐬(𝐗(t)),𝐬(𝐖))]τ=1Texp[ρ(𝐬(𝐗(τ)),𝐬(𝐖))].\eta_{t}=\frac{\text{exp}[{-\rho(\mathbf{s}(\mathbf{X}^{(t)}),\mathbf{s}(\mathbf{W}))]}}{\sum_{\tau=1}^{T}\text{exp}[{-\rho(\mathbf{s}(\mathbf{X}^{(\tau)}),\mathbf{s}(\mathbf{W}))]}}. (11)

Here, ρ(,)\rho(\cdot,\cdot) computes the correlation between two sequences, and ηt\eta_{t} serves as the weighting factor for activation salience at timestep tt. In this method, ηt\eta_{t} inversely reflects the correlation coefficient ρ(𝐬(𝐗(t)),𝐬(𝐖))\rho(\mathbf{s}(\mathbf{X}^{(t)}),\mathbf{s}(\mathbf{W})), thereby prioritizing timesteps where there is a higher degree of complementarity in salience between activations and weights. Subsequently, we utilize 𝐬ρ\mathbf{s}_{\rho} for activation salience in Eqs. (5), (6), and (7), yielding refined Salience Balancing Matrices, denoted as 𝐁ρ𝐗\mathbf{B}^{\mathbf{X}}_{\rho} and 𝐁ρ𝐖\mathbf{B}^{\mathbf{W}}_{\rho}. By applying SSC, we calibrate the activation salience within CSB to strategically account for the temporal variations during the denoising process. Appendix B presents the full Algorithm for PTQ4DiT.

4.3 Re-Parameterization

Before quantization, we estimate 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} and 𝐁ρ𝐖\mathbf{B_{\rho}^{W}} on a small calibration dataset generated from multiple timesteps. Then, we incorporate these matrices into the linear layers within MHSA and PF modules [37] to alleviate the quantization difficulty. Given that 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} and 𝐁ρ𝐖\mathbf{B_{\rho}^{W}} are mutual inverses, this incorporation maintains mathematical equivalence to the original linear layer ff:

𝐗~𝐖~=(𝐗𝐁ρ𝐗)(𝐁ρ𝐖𝐖)=𝐗𝐖.\mathbf{\widetilde{X}}\cdot\mathbf{\widetilde{W}}=(\mathbf{XB_{\rho}^{X}})\cdot(\mathbf{B_{\rho}^{W}W})=\mathbf{X}\cdot\mathbf{W}. (12)

The proof is provided in Appendix C. Furthermore, we design a re-parameterization scheme for DiTs, allowing for obtaining 𝐗~\mathbf{\widetilde{X}} and 𝐖~\mathbf{\widetilde{W}} without extra computational burden during inference. Specifically, we update the weight matrix of linear layer ff to 𝐖~\mathbf{\widetilde{W}} offline and seamlessly integrate 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} into the preceding linear transformation operations. This integration includes adaptations to adaLN [38, 37] and matrix multiplications within attention mechanisms [49]. Appendix A discusses these adaptations.

Post-adaLN. For linear layers following the adaLN module, we integrate 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} by adjusting the scale and shift parameters (𝜸,𝜷din\bm{\gamma},\bm{\beta}\in\mathbb{R}^{d_{in}}) within adaLN:

𝐗~=adaLN~(𝐙)=LN(𝐙)(𝐁ρ𝐗+𝜸~)+𝜷~,where𝜸~=𝜸𝐁ρ𝐗,𝜷~=𝜷𝐁ρ𝐗.\mathbf{\widetilde{X}}=\widetilde{\text{adaLN}}(\mathbf{Z})=\text{LN}(\mathbf{Z})\odot(\mathbf{B_{\rho}^{X}}+\widetilde{\bm{\gamma}})+\bm{\widetilde{\beta}},\quad\text{where}\quad\bm{\widetilde{\gamma}}=\bm{\gamma}\mathbf{B_{\rho}^{X}},\quad\bm{\widetilde{\beta}}=\bm{\beta}\mathbf{B_{\rho}^{X}}. (13)

Equivalently, we fuse 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} into the MLPs responsible for regressing these parameters, thus avoiding additional computation overhead at inference time. Detailed derivations are provided in Appendix D.

Post-Matrix-Multiplication. For linear layers after matrix multiplication, the effect of PTQ4DiT can be realized by directly absorbing the Salience Balancing Matrices into the preceding de-quantization functions associated with the matrix multiplication [12, 53, 61].

Refer to caption
Figure 5: Random samples generated by PTQ4DiT and two strong baselines: RepQ* [21] and Q-Diffusion [18], with W4A8 quantization on ImageNet 512×\times512 and 256×\times256. Our method can produce high-quality images with finer details. Appendix E presents more visualization results.

5 Experiments

5.1 Experimental Settings

Our experimental setup is similar to the original study of Diffusion Transformers (DiTs) [37]. We evaluate PTQ4DiT on the ImageNet dataset [41], using pre-trained class-conditional DiT-XL/2 models [37] at image resolutions of 256×\times256 and 512×\times512. The DDPM solver [17] with 250 sampling steps is employed for the generation process. To further assess the robustness of our method, we conduct additional experiments with reduced sampling steps of 100 and 50.

For fair benchmarking, all methods utilize uniform quantizers for all activations and weights, with channel-wise quantization for weights and tensor-wise for activations, unless specified otherwise. To construct the calibration set, we uniformly select 25 timesteps for 256-resolution experiments and 10 timesteps for 512-resolution experiments, generating 32 samples at each selected timestep. The optimization of quantization parameters follows the implementation from Q-Diffusion [18]. Our code is based on PyTorch [36], and all experiments are conducted on NVIDIA RTX A6000 GPUs.

To comprehensively assess generated image quality, we employ four metrics: Fréchet Inception Distance (FID) [16], spatial FID (sFID) [42, 34], Inception Score (IS) [42, 3], and Precision, all computed using the ADM toolkit [10]. For all methods under evaluation, including the full-precision (FP) models, we sample 10,000 images for ImageNet 256×\times256, and 5,000 for ImageNet 512×\times512, consistent with conventions from prior studies [35, 44].

5.2 Quantization Performance

We present a comprehensive assessment of our PTQ4DiT against prevalent baseline methods in various settings. Our evaluation focuses on mainstream Post-training Quantization (PTQ) methods that are widely used and adaptable to DiTs, including PTQ4DM [44], Q-Diffusion [18], and PTQD [15].

Table 1: Performance comparison on ImageNet 256×\times256. ‘(W/A)’ indicates that the precision of weights and activations are W and A bits, respectively.
Timesteps
Bit-width (W/A)
Method Size (MB) FID \downarrow sFID \downarrow IS \uparrow Precision \uparrow
250 32/32 FP 2575.42 4.53 17.93 278.50 0.8231
8/8 PTQ4DM 645.72 21.65 100.14 134.22 0.6342
Q-Diffusion 645.72 5.57 18.22 227.50 0.7612
PTQD 645.72 5.69 18.42 224.26 0.7594
RepQ* 645.72 4.51 18.01 264.68 0.8076
Ours 645.72 4.63 17.72 274.86 0.8299
4/8 PTQ4DM 323.79 72.58 52.39 35.79 0.2642
Q-Diffusion 323.79 15.31 26.04 134.71 0.6194
PTQD 323.79 16.45 22.29 130.45 0.6111
RepQ* 323.79 23.21 28.58 104.28 0.4640
Ours 323.79 7.09 23.23 201.91 0.7217
100 32/32 FP 2575.42 5.00 19.02 274.78 0.8149
8/8 PTQ4DM 645.72 15.36 79.31 172.37 0.6926
Q-Diffusion 645.72 7.93 19.46 202.84 0.7299
PTQD 645.72 8.12 19.64 199.00 0.7295
RepQ* 645.72 5.20 19.87 254.70 0.7929
Ours 645.72 4.73 17.83 277.27 0.8270
4/8 PTQ4DM 323.79 89.78 57.20 26.02 0.2146
Q-Diffusion 323.79 54.95 36.13 42.80 0.3846
PTQD 323.79 55.96 37.24 42.87 0.3948
RepQ* 323.79 26.64 29.42 91.39 0.4347
Ours 323.79 7.75 22.01 190.38 0.7292
50 32/32 FP 2575.42 6.02 21.77 246.24 0.7812
8/8 PTQ4DM 645.72 17.52 84.28 154.08 0.6574
Q-Diffusion 645.72 14.61 27.57 153.01 0.6601
PTQD 645.72 15.21 27.52 151.60 0.6578
RepQ* 645.72 7.17 23.67 224.83 0.7496
Ours 645.72 5.45 19.50 250.68 0.7882
4/8 PTQ4DM 323.79 102.52 58.66 19.29 0.1710
Q-Diffusion 323.79 22.89 29.49 109.22 0.5752
PTQD 323.79 25.62 29.77 104.28 0.5667
RepQ* 323.79 31.39 30.77 80.64 0.4091
Ours 323.79 9.17 24.29 179.95 0.7052
Table 2: Performance on ImageNet 512×\times512 with W4A8.
Timesteps Method FID \downarrow sFID \downarrow IS \uparrow Precision \uparrow
250 FP 8.39 36.25 257.06 0.8426
PTQ4DM 68.43 57.76 35.16 0.4712
QDiffusion 58.81 56.75 31.29 0.4878
PTQD 87.53 74.55 34.40 0.5144
RepQ* 59.65 73.71 33.19 0.3676
Ours 17.55 46.92 123.49 0.7592
100 FP 9.06 37.58 239.03 0.8300
PTQ4DM 70.63 57.73 33.82 0.4574
QDiffusion 62.05 57.02 29.52 0.4786
PTQD 81.17 66.58 35.67 0.5166
RepQ* 62.70 73.29 31.44 0.3606
Ours 19.00 50.71 121.35 0.7514
50 FP 11.28 41.70 213.86 0.8100
PTQ4DM 71.69 59.10 33.77 0.4604
QDiffusion 53.49 50.27 38.99 0.5430
PTQD 73.45 59.14 39.63 0.5508
RepQ* 65.92 74.19 30.92 0.3542
Ours 19.71 52.27 118.32 0.7336

We reimplement these methods to suit the unique structure of DiTs. Considering the architectural similarity between DiTs and ViTs [11], our analysis also includes RepQ-ViT [21], the state-of-the-art PTQ method initially designed for ViTs. We enhance RepQ-ViT (denoted as RepQ*) by extending the calibration set to integrate temporal dynamics and customizing its advanced channel-wise and log2\sqrt{2} quantizers specifically for DiTs.

Tables 1 and 2 report the outcomes on large-scale class-conditional image generation for ImageNet 256×\times256 and 512×\times512, respectively. Table 1 demonstrates the effectiveness of PTQ4DiT across various quantization settings and timesteps. Notably, our finding reveals that at 8-bit precision (W8A8), PTQ4DiT closely matches the generative capabilities of the

Refer to caption
Figure 6: Quantization performance on W8A8. The circle size represents the computational load (in Gflops).

FP models, whereas most baseline methods experience significant performance losses. At the more stringent 4-bit weight precision (W4A8), all baseline methods exhibit more considerable degradation. For instance, under 250 timesteps, PTQ4DM [44] sees a drastic FID increase of 68.05. In contrast, our PTQ4DiT only incurs a slight increase of 2.56. This resilience remains evident as the number of timesteps decreases, underscoring the robustness of PTQ4DiT in resource-limited environments. Moreover, PTQ4DiT markedly outperforms mainstream methods at the higher 512×\times512 resolution, further validating its superiority. For example, using 250 timesteps, PTQ4DiT substantially lowers FID by 41.26 and sFID by 9.83 over the second-best method, Q-Diffusion. Figure 6 depicts the efficiency-vs-efficacy trade-off on W8A8 across various timestep configurations. Our PTQ4DiT achieves comparable performance levels to FP models but with considerably reduced computational costs, offering a viable alternative for high-quality image generation. Figures 5, 8, and 9 also present randomly generated images for visual comparisons, highlighting PTQ4DiT’s ability to produce images of superior quality.

5.3 Ablation Study

To verify the efficacy of CSB and SSC, we conduct an ablative study on the challenging W4A8 quantization. Experiments are performed on ImageNet 256×\times256 using 250 sampling timesteps. Three method variants are considered in our ablation: (i) Baseline, which applies basic linear quantization on DiTs, (ii) Baseline + CSB, which integrates CSB in the linear layers within MHSA and PF modules, where the Salience Balancing Matrices 𝐁𝐗\mathbf{B^{X}} and 𝐁𝐖\mathbf{B^{W}} are estimated based on distributions at the midpoint timestep T2\frac{T}{2}, and (iii) Baseline + CSB + SSC, which is the complete PTQ4DiT. Results detailed in Table 3 indicate that each proposed component improves the performance, validating their effectiveness. Particularly, CSB enhances upon the Baseline by a large margin, decreasing FID by 14.37 and sFID by 2.35, suggesting its critical role in alleviating the severe quantization difficulties inherent in DiTs. Note that with the addition of CSB, our method surpasses Q-Diffusion [18], a leading PTQ method for diffusion models. Moreover, integrating SSC further boosts our PTQ4DiT towards state-of-the-art performance, facilitating high-quality image generation at W4A8 precision, as shown in Figure 5.

Table 3: Ablation study on ImageNet 256×\times256 with W4A8.
Method Size (MB) FID \downarrow sFID \downarrow IS \uparrow Precision \uparrow
FP 2575.42 4.53 17.93 278.50 0.8231
Q-Diffusion 323.79 15.31 26.04 134.71 0.6194
Baseline 323.79 22.54 27.31 105.55 0.4791
+ CSB 323.79 8.17 24.96 187.94 0.7183
+ CSB + SSC (Ours) 323.79 7.09 23.23 201.91 0.7217

6 Conclusion

This paper proposes PTQ4DiT, a novel Post-training Quantization (PTQ) method for Diffusion Transformers (DiTs). Our analysis identifies the primary challenges in effective DiT quantization: the pronounced quantization errors incurred by salient channels with extreme magnitudes and the temporal variability in salient activation. To address these challenges, we design Channel-wise Salience Balancing (CSB) and Spearman’s ρ\rho-guided Salience Calibration (SSC). Specifically, CSB utilizes the complementarity nature of salient channels to redistribute the extremes within activations and weights toward the balanced salience. SSC dynamically adjusts salience evaluations across different timesteps, prioritizing timesteps where salient activation and weight channels exhibit significant complementarity, thereby mitigating overall quantization difficulties. To avoid extra computational costs of PTQ4DiT, we also devise a re-parameterization strategy for efficient inference. Experiments show that our PTQ4DiT can effectively quantize DiTs to 8-bit precision (W8A8) and further advance to 4-bit weight (W4A8) while maintaining high-quality image generation capabilities.

Acknowledgements. This research is supported by NSF IIS-2309073 and ECCS-2123521. This article solely reflects the opinions and conclusions of authors and not funding agencies.

References

  • [1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • [2] Fan Bao, Chongxuan Li, Yue Cao, and Jun Zhu. All are worth words: a vit backbone for score-based diffusion models. In NeurIPSW, 2022.
  • [3] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
  • [4] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024.
  • [5] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
  • [6] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthesis. In ICLR, 2024.
  • [7] Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE TPAMI, 2023.
  • [8] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale. In NeurIPS, 2022.
  • [9] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, volume 34, pages 8780–8794, 2021.
  • [10] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, volume 34, pages 8780–8794, 2021.
  • [11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • [12] Steven K Esser, Jeffrey L McKinstry, Deepika Bablani, Rathinakumar Appuswamy, and Dharmendra S Modha. Learned step size quantization. In ICLR, 2020.
  • [13] Natalia Frumkin, Dibakar Gope, and Diana Marculescu. Jumping through local minima: Quantization in the loss landscape of vision transformers. In ICCV, pages 16978–16988, 2023.
  • [14] Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. In ICCV, pages 23164–23173, 2023.
  • [15] Yefei He, Luping Liu, Jing Liu, Weijia Wu, Hong Zhou, and Bohan Zhuang. Ptqd: Accurate post-training quantization for diffusion models. In NeurIPS, volume 36, 2023.
  • [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, volume 30, 2017.
  • [17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, volume 33, pages 6840–6851, 2020.
  • [18] Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q-diffusion: Quantizing diffusion models. In ICCV, pages 17535–17545, 2023.
  • [19] Yanjing Li, Sheng Xu, Xianbin Cao, Xiao Sun, and Baochang Zhang. Q-dm: An efficient low-bit quantized diffusion model. In NeurIPS, volume 36, 2023.
  • [20] Yuhang Li, Ruihao Gong, Xu Tan, Yang Yang, Peng Hu, Qi Zhang, Fengwei Yu, Wei Wang, and Shi Gu. Brecq: Pushing the limit of post-training quantization by block reconstruction. In ICLR, 2021.
  • [21] Zhikai Li, Junrui Xiao, Lianwei Yang, and Qingyi Gu. Repq-vit: Scale reparameterization for post-training quantization of vision transformers. In ICCV, pages 17227–17236, 2023.
  • [22] Haokun Lin, Haoli Bai, Zhili Liu, Lu Hou, Muyi Sun, Linqi Song, Ying Wei, and Zhenan Sun. Mope-clip: Structured pruning for efficient vision-language models with module-wise pruning error metric. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27370–27380, 2024.
  • [23] Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. arXiv preprint arXiv:2406.01721, 2024.
  • [24] Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. In MLSys, 2024.
  • [25] Jiawei Liu, Lin Niu, Zhihang Yuan, Dawei Yang, Xinggang Wang, and Wenyu Liu. Pd-quant: Post-training quantization based on prediction difference metric. In CVPR, pages 24427–24437, 2023.
  • [26] Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. Qllm: Accurate and efficient low-bitwidth quantization for large language models. In ICLR, 2024.
  • [27] Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, and Chun Yuan. Intactkv: Improving large language model quantization by keeping pivot tokens intact. arXiv preprint arXiv:2403.01241, 2024.
  • [28] Shih-Yang Liu, Zechun Liu, and Kwang-Ting Cheng. Oscillation-free quantization for low-bit vision transformers. In ICML, 2023.
  • [29] Yijiang Liu, Huanrui Yang, Zhen Dong, Kurt Keutzer, Li Du, and Shanghang Zhang. Noisyquant: Noisy bias-enhanced post-training activation quantization for vision transformers. In CVPR, pages 20321–20330, 2023.
  • [30] Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024.
  • [31] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  • [32] Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. In ICML, pages 7197–7206, 2020.
  • [33] Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization. arXiv preprint arXiv:2106.08295, 2021.
  • [34] Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations. arXiv preprint arXiv:2103.03841, 2021.
  • [35] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In ICML, pages 8162–8171. PMLR, 2021.
  • [36] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, volume 32, 2019.
  • [37] William Peebles and Saining Xie. Scalable diffusion models with transformers. In CVPR, pages 4195–4205, 2023.
  • [38] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. In AAAI, 2018.
  • [39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bjorn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  • [40] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  • [41] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. IJCV, 115:211–252, 2015.
  • [42] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NeurIPS, volume 29, 2016.
  • [43] Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. Pb-llm: Partially binarized large language models. In ICLR, 2024.
  • [44] Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In CVPR, pages 1972–1981, 2023.
  • [45] Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. In ICLR, 2024.
  • [46] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
  • [47] Charles Spearman. The proof and measurement of association between two things. 1961.
  • [48] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. Training data-efficient image transformers and distillation through attention. In ICML, 2021.
  • [49] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
  • [50] Changyuan Wang, Ziwei Wang, Xiuwei Xu, Yansong Tang, Jie Zhou, and Jiwen Lu. Towards accurate post-training quantization for diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16026–16035, 2024.
  • [51] Haoxuan Wang, Yuzhang Shang, Zhihang Yuan, Junyi Wu, and Yan Yan. Quest: Low-bit diffusion model quantization via efficient selective finetuning. arXiv preprint arXiv:2402.03666, 2024.
  • [52] Xiuying Wei, Ruihao Gong, Yuhang Li, Xianglong Liu, and Fengwei Yu. Qdrop: Randomly dropping quantization for extremely low-bit post-training quantization. In ICLR, 2022.
  • [53] Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling. In EMNLP, 2023.
  • [54] Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. In NeurIPS, 2022.
  • [55] Junyi Wu, Bin Duan, Weitai Kang, Hao Tang, and Yan Yan. Token transformation matters: Towards faithful post-hoc explanation for vision transformer. In CVPR, 2024.
  • [56] Junyi Wu, Weitai Kang, Hao Tang, Yuan Hong, and Yan Yan. On the faithfulness of vision transformer explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10936–10945, 2024.
  • [57] Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In ICML, pages 38087–38099. PMLR, 2023.
  • [58] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS, volume 34, pages 12077–12090, 2021.
  • [59] Ling Yang, Zhilong Zhang, Yang Song, Shenda Hong, Runsheng Xu, Yue Zhao, Wentao Zhang, Bin Cui, and Ming-Hsuan Yang. Diffusion models: A comprehensive survey of methods and applications. ACM Computing Surveys, 56(4):1–39, 2023.
  • [60] Xiulong Yang, Sheng-Min Shih, Yinlin Fu, Xiaoting Zhao, and Shihao Ji. Your vit is secretly a hybrid discriminative-generative diffusion model. arXiv preprint arXiv:2208.07791, 2022.
  • [61] Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. Asvd: Activation-aware singular value decomposition for compressing large language models. arXiv preprint arXiv:2312.05821, 2023.
  • [62] Zhihang Yuan, Chenhao Xue, Yiqi Chen, Qiang Wu, and Guangyu Sun. Ptq4vit: Post-training quantization for vision transformers with twin uniform quantization. In ECCV, pages 191–207, 2022.
  • [63] Aozhong Zhang, Naigang Wang, Yanxia Deng, Xin Li, Zi Yang, and Penghang Yin. Magr: Weight magnitude reduction for enhancing post-training quantization. In NeurIPS, 2024.
  • [64] Aozhong Zhang, Zi Yang, Naigang Wang, Yingyong Qin, Jack Xin, Xin Li, and Penghang Yin. Comq: A backpropagation-free algorithm for post-training quantization. arXiv preprint arXiv:2403.07134, 2024.
  • [65] Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, et al. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520, 2024.

Appendix A Structures of MHSA and PF with Adjusted Linear Layers

Refer to caption
Figure 7: Illustration of structures of the MHSA and PF modules within DiT Blocks [37]. Our proposed CSB and SSC are embedded in their linear layers, including Projection1, Projection2, and FC1. CSB and SSC collectively mitigate the quantization difficulties by transforming both activations and weights using Salience Balancing Matrices, 𝐁ρ𝐖\mathbf{B_{\rho}^{W}} and 𝐁ρ𝐗\mathbf{B_{\rho}^{X}}. To prevent extra computational burdens at inference time, 𝐁ρ𝐖\mathbf{B_{\rho}^{W}} is absorbed into the weight matrix of the linear layer ff. Meanwhile, 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} is integrated offline into the MLPs layer prior to adaLN modules for Projection1 and FC1, and into the preceding matrix multiplication operation for Projection2.

The Multi-Head Self-Attention (MHSA) and Pointwise Feedforward (PF) modules are essential for processing input tokens and conditional information in DiT Blocks [37]. As depicted in Figure 7, we incorporate our Channel-wise Salience Balancing (CSB) and Spearman’s ρ\rho-guided Salience Calibration (SSC) techniques into the linear layers within both modules. These techniques are designed to mitigate the quantization difficulties by dynamically adjusting the salience of activations and weights via Salience Balancing Matrices. Through the adjustments, CSB and SSC allow for more uniform distributions of activation and weight magnitudes across salient channels, reducing the impact of extreme values and enhancing the overall stability of the quantization process.

To eliminate additional computational demands during inference, the Salience Balancing Matrices, 𝐁ρ𝐖\mathbf{B_{\rho}^{W}} and 𝐁ρ𝐗\mathbf{B_{\rho}^{X}}, are pre-integrated into the DiT Blocks. Specifically, we replace the weight matrix of the linear layer ff with 𝐖~=𝐁ρ𝐖𝐖\mathbf{\widetilde{W}}=\mathbf{B_{\rho}^{W}}\mathbf{W} and integrate 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} into the preceding linear transformations. For Projection1 and FC1, 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} is absorbed into the MLPs before the adaLN modules, while for Projection2, it can be absorbed within the matrix multiplication [12, 53, 43]. Derivations of the integration are provided in Appendix D.

Appendix B PTQ4DiT Pipeline

Algorithm 1 Post-Training Quantization for Diffusion Transformers (PTQ4DiT)
1:Input: Pre-trained DiT model, Activation sequence 𝐗(1:T)\mathbf{X}^{(1:T)} from calibration dataset
2:Output: Quantized DiT model with low-bit activations and weights
3: Preparation:
4:Estimate activation salience 𝐬(𝐗(t))\mathbf{s}(\mathbf{X}^{(t)}) at each timestep tt \triangleright Using Eq. (9)
5:Estimate weight salience 𝐬(𝐖)\mathbf{s}(\mathbf{W}) \triangleright Using Eq. (9)
6: Spearman’s ρ\rho-guided Salience Calibration:
7:Compute correlation coefficients {ρ(𝐬(𝐗(t)),𝐬(𝐖))}t=1T\{\rho(\mathbf{s}(\mathbf{X}^{(t)}),\mathbf{s}(\mathbf{W}))\}_{t=1}^{T}
8:Compute weighting factors {ηt}t=1T\{\eta_{t}\}_{t=1}^{T} \triangleright Using Eq. (11)
9:Compute temporal salience 𝐬ρ(𝐗(1:T))\mathbf{s}_{\rho}(\mathbf{X}^{(1:T)}) \triangleright Using Eq. (10)
10: Channel-wise Salience Balancing:
11:Compute balanced salience sρ~(𝐗j(1:T),𝐖j)\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{j},\mathbf{W}_{j}) for each channel jj \triangleright Using Eqs. (5), (14)
12:Construct refined Salience Balancing Matrices 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} and 𝐁ρ𝐖\mathbf{B_{\rho}^{W}} \triangleright Using Eqs. (6), (7)
13: Re-Parameterization:
14:Integrate 𝐁ρ𝐖\mathbf{B_{\rho}^{W}} into the weight matrix of linear layers offline \triangleright By 𝐖~=𝐁ρ𝐖𝐖\mathbf{\widetilde{W}}=\mathbf{B_{\rho}^{W}}\mathbf{W}
15:Integrate 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} into the MLPs before adaLN modules offline \triangleright Using Eqs. (13), (20)
16: Quantization:
17:Obtain 𝐗~=𝐗𝐁ρ𝐗\mathbf{\widetilde{X}}=\mathbf{X}\mathbf{B_{\rho}^{X}} and 𝐖~=𝐁ρ𝐖𝐖\mathbf{\widetilde{W}}=\mathbf{B_{\rho}^{W}}\mathbf{W} without extra computational demand during inference
18:Perform quantization on balanced activation 𝐗~\mathbf{\widetilde{X}} and weight 𝐖~\mathbf{\widetilde{W}}

This section provides a comprehensive description of the PTQ4DiT pipeline, detailed in Algorithm 1. The PTQ4DiT is designed to enhance the performance of quantized DiTs by addressing quantization challenges through sophisticated salience estimation and balancing strategies.

The full algorithm mainly consists of five steps: The pipeline begins with estimating activation and weight salience for the pre-trained model using a calibration dataset. Following the estimation, we employ Spearman’s ρ\rho-guided Salience Calibration to compute correlation coefficients between activation salience and weight salience, which helps determine the weighting factors for each timestep. These factors are crucial for computing a temporally adjusted salience, which aims to minimize quantization errors that typically occur due to misalignment in salience peaks across the timesteps. The Channel-wise Salience Balancing step follows, wherein Salience Balancing Matrices are constructed to redistribute the activation and weight values channel-wise. Specifically, for each channel jj, the balanced salience sρ~(𝐗j(1:T),𝐖j)\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{j},\mathbf{W}_{j}) is given by:

sρ~(𝐗j(1:T),𝐖j)=(sρ(𝐗j(1:T))s(𝐖j))12,\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{j},\mathbf{W}_{j})=(s_{\rho}(\mathbf{X}^{(1:T)}_{j})\cdot s(\mathbf{W}_{j}))^{\frac{1}{2}}, (14)

where sρ(𝐗j(1:T))s_{\rho}(\mathbf{X}^{(1:T)}_{j}) is the jj-th element of 𝐬ρ(𝐗(1:T))\mathbf{s}_{\rho}(\mathbf{X}^{(1:T)}). Then, we formulate the refined Salience Balancing Matrices 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} and 𝐁ρ𝐖\mathbf{B_{\rho}^{W}} based on sρ~(𝐗j(1:T),𝐖j)\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{j},\mathbf{W}_{j}) and sρ(𝐗j(1:T))s_{\rho}(\mathbf{X}^{(1:T)}_{j}), as detailed in Eq. (15). This step is pivotal in aligning the activation and weight distributions, thereby minimizing the overall quantization difficulty. In the Re-Parameterization phase, these balancing matrices are integrated into the pre-trained model, ensuring that no additional computational cost is required during inference. This integration maintains computational efficiency while retaining the benefits of our salience balancing technique. Finally, we perform quantization on the model with balanced activations and weights, setting the stage for the deployment of efficient and effective quantized DiTs in resource-constrained environments.

Appendix C Proof of Mathematical Equivalence

In this section, we provide detailed proof demonstrating that our PTQ4DiT maintains mathematical equivalence to the original linear layers. This proof ensures that the balancing operation does not alter the original computational outcomes of the full-precision models.

In PTQ4DiT, we introduce the Salience Balancing Matrices 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} and 𝐁ρ𝐖\mathbf{B_{\rho}^{W}}, which are diagonal matrices intended to balance the salience across activation and weight channels. We verify the inverse relationship of 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} and 𝐁ρ𝐖\mathbf{B_{\rho}^{W}} mentioned in Section 4.3:

 𝐁ρ𝐗𝐁ρ𝐖\displaystyle\quad\text{ }\mathbf{B_{\rho}^{X}}\cdot\mathbf{B_{\rho}^{W}} (15)
=diag(sρ~(𝐗1(1:T),𝐖1)sρ(𝐗1(1:T)),,sρ~(𝐗din(1:T),𝐖din)sρ(𝐗din(1:T)))diag(sρ~(𝐗1(1:T),𝐖1)s(𝐖1),,sρ~(𝐗din(1:T),𝐖din)s(𝐖din))\displaystyle=\text{diag}(\frac{\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{1},\mathbf{W}_{1})}{s_{\rho}(\mathbf{X}^{(1:T)}_{1})},\ldots,\frac{\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{d_{in}},\mathbf{W}_{d_{in}})}{s_{\rho}(\mathbf{X}^{(1:T)}_{d_{in}})})\cdot\text{diag}(\frac{\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{1},\mathbf{W}_{1})}{s(\mathbf{W}_{1})},\ldots,\frac{\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{d_{in}},\mathbf{W}_{d_{in}})}{s(\mathbf{W}_{d_{in}})})
=diag(sρ~(𝐗1(1:T),𝐖1)sρ(𝐗1(1:T))sρ~(𝐗1(1:T),𝐖1)s(𝐖1),,sρ~(𝐗din(1:T),𝐖din)sρ(𝐗din(1:T))sρ~(𝐗din(1:T),𝐖din)s(𝐖din))\displaystyle=\text{diag}(\frac{\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{1},\mathbf{W}_{1})}{s_{\rho}(\mathbf{X}^{(1:T)}_{1})}\cdot\frac{\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{1},\mathbf{W}_{1})}{s(\mathbf{W}_{1})},\ldots,\frac{\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{d_{in}},\mathbf{W}_{d_{in}})}{s_{\rho}(\mathbf{X}^{(1:T)}_{d_{in}})}\cdot\frac{\widetilde{s_{\rho}}(\mathbf{X}^{(1:T)}_{d_{in}},\mathbf{W}_{d_{in}})}{s(\mathbf{W}_{d_{in}})})
=diag((sρ(𝐗1(1:T))s(𝐖1))122sρ(𝐗1(1:T))s(𝐖1),,(sρ(𝐗din(1:T))s(𝐖din))122sρ(𝐗din(1:T))s(𝐖din))=𝐈,\displaystyle=\text{diag}(\frac{(s_{\rho}(\mathbf{X}^{(1:T)}_{1})\cdot s(\mathbf{W}_{1}))^{\frac{1}{2}\cdot 2}}{s_{\rho}(\mathbf{X}^{(1:T)}_{1})\cdot s(\mathbf{W}_{1})},\ldots,\frac{(s_{\rho}(\mathbf{X}^{(1:T)}_{d_{in}})\cdot s(\mathbf{W}_{d_{in}}))^{\frac{1}{2}\cdot 2}}{s_{\rho}(\mathbf{X}^{(1:T)}_{d_{in}})\cdot s(\mathbf{W}_{d_{in}})})=\mathbf{I},

where 𝐈\mathbf{I} denotes the identity matrix. Therefore, we can derive the mathematical equivalence:

𝐗~𝐖~=(𝐗𝐁ρ𝐗)(𝐁ρ𝐖𝐖)=𝐗(𝐁ρ𝐗𝐁ρ𝐖)𝐖=𝐗𝐖.\mathbf{\widetilde{X}}\cdot\mathbf{\widetilde{W}}=(\mathbf{XB_{\rho}^{X}})\cdot(\mathbf{B_{\rho}^{W}W})=\mathbf{X\cdot(B_{\rho}^{X}}\mathbf{B_{\rho}^{W})\cdot W}=\mathbf{X}\cdot\mathbf{W}. (16)

Appendix D Derivations of Post-adaLN Integration

This section details the integration of the Salience Balancing Matrix 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} into the MLPs before the adaptive Layer Norm (adaLN) modules [37], aimed at eliminating extra computational overhead at the inference stage. Recall that the initial formulation of adaLN on the input latent noise 𝐙n×din\mathbf{Z}\in\mathbb{R}^{n\times d_{in}} is given by:

𝐗=adaLN(𝐙)=LN(𝐙)(𝟏+𝜸)+𝜷,\mathbf{X}=\text{adaLN}(\mathbf{Z})=\text{LN}(\mathbf{Z})\odot(\bm{1}+\bm{\gamma})+\bm{\beta}, (17)

where 𝜸,𝜷din\bm{\gamma},\bm{\beta}\in\mathbb{R}^{d_{in}} are scale and shift parameters, respectively, regressed by MLPs based on the conditional input 𝐜din\mathbf{c}\in\mathbb{R}^{d_{in}}:

(𝜸,𝜷)=MLPs(𝐜)=𝐜(𝐖𝜸,𝐖𝜷)+(𝐛𝜸,𝐛𝜷).(\bm{\gamma},\bm{\beta})=\text{MLPs}(\mathbf{c})=\mathbf{c}\cdot(\mathbf{W}_{\bm{\gamma}},\mathbf{W}_{\bm{\beta}})+(\mathbf{b}_{\bm{\gamma}},\mathbf{b}_{\bm{\beta}}). (18)

Here, 𝐖𝜸,𝐖𝜷\mathbf{W}_{\bm{\gamma}},\mathbf{W}_{\bm{\beta}} are weight matrices, and 𝐛𝜸,𝐛𝜷\mathbf{b}_{\bm{\gamma}},\mathbf{b}_{\bm{\beta}} are bias terms. In PTQ4DiT, 𝐗~\mathbf{\widetilde{X}} is obtained by applying 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} to the output of adaLN as follows:

𝐗~=𝐗𝐁ρ𝐗=LN(𝐙)(𝐁ρ𝐗+𝜸𝐁ρ𝐗)+𝜷𝐁ρ𝐗,\mathbf{\widetilde{X}}=\mathbf{XB_{\rho}^{X}}=\text{LN}(\mathbf{Z})\odot(\mathbf{B_{\rho}^{X}}+\bm{\gamma}\mathbf{B_{\rho}^{X}})+\bm{\beta}\mathbf{B_{\rho}^{X}}, (19)

which echos with Eq. (13). To avoid additional matrix multiplications in 𝜸𝐁ρ𝐗\bm{\gamma}\mathbf{B_{\rho}^{X}} and 𝜷𝐁ρ𝐗\bm{\beta}\mathbf{B_{\rho}^{X}}, we can pre-absorb 𝐁ρ𝐗\mathbf{B_{\rho}^{X}} in MLPs’ weights and biases offline, expressed as:

(𝐖~𝜸,𝐖~𝜷)=(𝐖𝜸𝐁ρ𝐗,𝐖𝜷𝐁ρ𝐗),(𝐛~𝜸,𝐛~𝜷)=(𝐛𝜸𝐁ρ𝐗,𝐛𝜷𝐁ρ𝐗).(\widetilde{\mathbf{W}}_{\bm{\gamma}},\widetilde{\mathbf{W}}_{\bm{\beta}})=(\mathbf{W}_{\bm{\gamma}}\mathbf{B_{\rho}^{X}},\mathbf{W}_{\bm{\beta}}\mathbf{B_{\rho}^{X}}),\quad(\widetilde{\mathbf{b}}_{\bm{\gamma}},\widetilde{\mathbf{b}}_{\bm{\beta}})=(\mathbf{b}_{\bm{\gamma}}\mathbf{B_{\rho}^{X}},\mathbf{b}_{\bm{\beta}}\mathbf{B_{\rho}^{X}}). (20)

Thus, the re-parameterized MLPs can directly produce the adjusted scale and shift parameters:

 MLPs~(𝐜)\displaystyle\quad\text{ }\widetilde{\text{MLPs}}(\mathbf{c}) (21)
=𝐜(𝐖~𝜸,𝐖~𝜷)+(𝐛~𝜸,𝐛~𝜷)\displaystyle=\mathbf{c}\cdot(\widetilde{\mathbf{W}}_{\bm{\gamma}},\widetilde{\mathbf{W}}_{\bm{\beta}})+(\widetilde{\mathbf{b}}_{\bm{\gamma}},\widetilde{\mathbf{b}}_{\bm{\beta}})
=𝐜(𝐖𝜸𝐁ρ𝐗,𝐖𝜷𝐁ρ𝐗)+(𝐛𝜸𝐁ρ𝐗,𝐛𝜷𝐁ρ𝐗)\displaystyle=\mathbf{c}\cdot(\mathbf{W}_{\bm{\gamma}}\mathbf{B_{\rho}^{X}},\mathbf{W}_{\bm{\beta}}\mathbf{B_{\rho}^{X}})+(\mathbf{b}_{\bm{\gamma}}\mathbf{B_{\rho}^{X}},\mathbf{b}_{\bm{\beta}}\mathbf{B_{\rho}^{X}})
=(𝐜(𝐖𝜸,𝐖𝜷)+(𝐛𝜸,𝐛𝜷))𝐁ρ𝐗\displaystyle=(\mathbf{c}\cdot(\mathbf{W}_{\bm{\gamma}},\mathbf{W}_{\bm{\beta}})+(\mathbf{b}_{\bm{\gamma}},\mathbf{b}_{\bm{\beta}}))\cdot\mathbf{B_{\rho}^{X}}
=(𝜸,𝜷)𝐁ρ𝐗\displaystyle=(\bm{\gamma},\bm{\beta})\cdot\mathbf{B_{\rho}^{X}}
=(𝜸𝐁ρ𝐗,𝜷𝐁ρ𝐗).\displaystyle=(\bm{\gamma}\mathbf{B_{\rho}^{X}},\bm{\beta}\mathbf{B_{\rho}^{X}}).

This allows for obtaining 𝐗~\mathbf{\widetilde{X}} without extra computational burden at the inference stage.

Appendix E Additional Visualization Results

Figures 8 and 9 supplement visualization results of our PTQ4DiT on W8A8 quantization, compared with baseline PTQ methods and the full-precision (FP) counterpart, on ImageNet 512×\times512 and 256×\times256. Our method generates results that closely mirror those of the FP models, presenting finer details and richer semantic content than the baseline approaches.

Refer to caption
Figure 8: Random samples generated by different PTQ methods with W8A8 quantization, alongside the full-precision DiTs [37], on ImageNet 512×\times512.
Refer to caption
Figure 9: Random samples generated by different PTQ methods with W8A8 quantization, alongside the full-precision DiTs [37], on ImageNet 256×\times256.

Appendix F Limitations and Broader Impacts

This work introduces a pioneering solution facilitating the broad deployment of Diffusion Transformers (DiTs) through Post-training Quantization. Our method substantially lowers computational and memory demands, thereby improving the accessibility of DiTs. Currently, our research concentrates on visual generation. For future work, we plan to extend our methodology to other generative models across various modalities, such as audio and 3D. However, there remains an inherent risk that these generative models could be utilized to produce disinformation. While our study contributes to the widespread application of DiTs, it does not address such ethical risks. We recognize the importance of developing safeguards and encourage further research into strategies that can prevent the misuse of these powerful generative technologies.