Understanding Diffusion-based Representation Learning via Low-Dimensional Modeling
Abstract
This work addresses the critical question of why and when diffusion models, despite their generative design, are capable of learning high-quality representations in a self-supervised manner. We hypothesize that diffusion models excel in representation learning due to their ability to learn the low-dimensional distributions of image datasets via optimizing a noise-controlled denoising objective. Our empirical results support this hypothesis, indicating that variations in the representation learning performance of diffusion models across noise levels are closely linked to the quality of the corresponding posterior estimation. Grounded on this observation, we offer theoretical insights into the unimodal representation dynamics of diffusion models as noise scales vary, demonstrating how they effectively learn meaningful representations through the denoising process. We also highlight the impact of the inherent parameter-sharing mechanism in diffusion models, which accounts for their advantages over traditional denoising auto-encoders in representation learning.
1 Introduction
2 Representation Learning via diffusion models
3 Theoretical Understanding Through Low-Dimensional Models
4 Additional Experiments
5 Conclusion
In this work, we establish a link between distribution recovery, posterior estimation, and representation learning, providing the first theoretical study of diffusion-based representation learning dynamics across varying noise scales. Using a low-dimensional mixture of low-rank Gaussians, we show that the unimodal representation learning dynamic arises from the interplay between data denoising and class specification. Additionally, our analysis highlights the inherent weight-sharing mechanism in diffusion models, demonstrating its benefits for peak representation performance as well as its limitations in optimizing high-noise regions due to increased complexity. Experiments on both synthetic and real datasets validate our findings.