Learning Diffusion Model from Noisy Measurement using Principled Expectation-Maximization Method
Abstract
Diffusion models have demonstrated exceptional ability in modeling complex image distributions, making them versatile plug-and-play priors for solving imaging inverse problems. However, their reliance on large-scale clean datasets for training limits their applicability in scenarios where acquiring clean data is costly or impractical. Recent approaches have attempted to learn diffusion models directly from corrupted measurements, but these methods either lack theoretical convergence guarantees or are restricted to specific types of data corruption. In this paper, we propose a principled expectation-maximization (EM) framework that iteratively learns diffusion models from noisy data with arbitrary corruption types. Our framework employs a plug-and-play Monte Carlo method to accurately estimate clean images from noisy measurements, followed by training the diffusion model using the reconstructed images. This process alternates between estimation and training until convergence. We evaluate the performance of our method across various imaging tasks, including inpainting, denoising, and deblurring. Experimental results demonstrate that our approach enables the learning of high-fidelity diffusion priors from noisy data, significantly enhancing reconstruction quality in imaging inverse problems.
Index Terms:
Computational imaging, inverse problems, image priors, diffusion models, Bayesian inferenceI Introduction
Diffusion models (DMs) [1, 2, 3] are emerging as versatile tools for capturing high-dimensional distributions [4, 5, 6, 7, 8] and serving as powerful image priors for solving inverse problems [9, 10, 11, 12, 13, 14]. However, training these models typically requires large datasets of clean, high-quality images, which are often expensive or impractical to obtain [15]. For example, in fluorescent microscopy, low Signal-to-Noise Ratio (SNR) images are prevalent due to observational corruptions and measurement noise (e.g., motion blur, camera readout noise), while high SNR images are scarce because of the long exposure times required and the risk of phototoxicity.
Recent works have sought to address this challenge by learning clean DMs directly from noisy measurements. For example, RenderDiffusion [16] and DWFM [17] incorporate the imaging forward model, , into the generative process to train 3D DMs from 2D images. However, these methods assume multiple measurements per object, limiting their applicability in many real-world scenarios. SURE-Score [18] uses Stein’s unbiased risk estimate (SURE) as a regularizer for learning DMs from noisy data. However, this approach heavily constrains the model’s parameter space, leading to suboptimal performance when the data are severely corrupted, such as in inpainting tasks. Ambient Diffusion [15] introduces a further corruption strategy that masks additional pixels to help the DM learn to reconstruct missing information. While effective for inpainting, it struggles with measurement noise and other types of image corruption.
To address these limitations, EMDiffusion [19] proposes a general expectation-maximization (EM) framework, achieving state-of-the-art performance in learning clean DMs from various types of corrupted images. This framework alternates between an expectation step (E-step), which uses a known diffusion prior to sample the posterior of clean images given noisy measurements , and a maximization step (M-step), where the recovered clean images are used to refine the diffusion model. However, its E-step relies on an approximate diffusion posterior sampling (DPS) algorithm [11], which lacks theoretical convergence guarantees and sometimes produces low-quality reconstructions, limiting its ability to capture accurate clean image distributions.
In this work, we propose a principled EM framework for learning clean DMs from noisy measurements with arbitrary corruption types. We enhance the E-step of EMDiffusion by upgrading the DPS algorithm to a plug-and-play Monte Carlo (PMC) method, which offers provable convergence guarantees and improved posterior sampling accuracy. This, in turn, leads to more precise learning of clean generative models. Extensive experiments on noisy CIFAR-10 and CelebA datasets with various corruption types demonstrate the effectiveness of our approach.
II Methodology
II-A Preliminaries
Diffusion models (DMs) [2, 1] capture data distributions by approximating the score function, i.e., the gradient of the log-likelihood , using a deep neural network parameterized by . To achieve this, DMs define a forward stochastic differential equation (SDE) that progressively injects noise, while its solution is a reverse-time SDE that progressively removes the noise:
(1) |
where is the noise schedule, represents the time index, and and are the forward and backward Wiener processes, respectively. Once a DM is trained on large-scale clean data, diverse samples can be generated by replacing in Eq. 1 with the learned score function , where is the pre-defined noise strength corresponding to .
By Bayes’ rule, , the reverse-time SDE can also be adapted to sample from the posterior distribution of images conditioned on measurements . In this case, acts as a data-fidelity term to ensure the reconstructed images match the measurements. The score function from Eq. 1 is then modified as a conditional version:
(2) |
where is a time-dependent data-fidelity term distinct from and is generally intractable. The challenge in posterior sampling via diffusion models is bridging with . In the widely used diffusion posterior sampling (DPS) algorithm [11], we empirically assume that:
(3) |
However, this approximation lacks theoretical guarantees for convergence to the true posterior distribution, which can result in inaccurate sampling, even in simple cases.
II-B Principled Expectation-Maximum Framework
In this paper, we propose a principled expectation-maximization (EM) framework for learning clean diffusion models from noisy measurements. This approach inherits basic procedure of the standard EM algorithm [20, 21] and EMDiffusion [19], alternating between two steps: 1) the E-step, where we sample the posterior distribution of clean images from noisy measurements using a known diffusion model, and 2) the M-step, where we refine the DM using these recovered samples. This iterative process guides the DM towards a local minimum, progressively improving both the model and the quality of the recovered images.
Method | CIFAR10-Denoising | CIFAR10-Inpainting | CelebA-Deblurring | ||||||
PSNR | LPIPS | FID | PSNR | LPIPS | FID | PSNR | LPIPS | FID | |
measurements | 18.05 | 0.047 | 132.59 | 13.49 | 0.295 | 234.47 | 22.47 | 0.365 | 72.83 |
DPS w/ clean prior | 25.91 | 0.010 | 7.08 | 25.44 | 0.008 | 7.08 | 29.05 | 0.013 | 10.24 |
Noise2Self [23] | 21.32 | 0.227 | 92.06 | - | - | - | - | - | - |
SURE-Score [18] | 22.42 | 0.138 | 132.61 | 15.75 | 0.182 | 220.01 | 22.07 | 0.383 | 191.96 |
Ambient Diffusion [15] | - | - | - | 20.57 | 0.027 | 28.88 | - | - | - |
EMDiffusion [19] | 23.16 | 0.022 | 86.47 | 24.70 | 0.009 | 21.08 | 23.74 | 0.103 | 91.89 |
Ours | 24.77 | 0.016 | 72.84 | 22.74 | 0.023 | 25.46 | 26.58 | 0.081 | 79.43 |
|
II-B1 E-step, principled DM-based posterior sampling
The E-step employs the plug-and-play Monte Carlo (PMC) algorithm for principled diffusion posterior sampling. Unlike conventional DPS, PMC is grounded in the theories of plug-and-play (PnP) optimization and Langevin Monte Carlo methods, providing non-asymptotic stationarity guarantees in terms of Fisher information. Assuming a minimum mean square error (MMSE) denoiser, , as an implicit image prior, the PnP framework is expressed as:
(4) |
where is the step size, and controls the denoising strength. According to Tweedie’s formula [24], the MMSE denoiser can be formulated as:
(5) |
Here, represents the prior distribution convolved with Gaussian noise of standard deviation . Though intractable, this prior can be approximated by the score-based diffusion model . Consequently, the PnP optimization in Eq. 4 can be rewritten as:
(6) |
Introducing additional Brownian motion and a scaling factor before the score function leads to the PMC formulation, as originally described in [25]:
(7) |
where index the EM iterations. As the diffusion model’s accuracy improves, increases from a small value (e.g., ) to 1. Since PMC does not rely on approximations for the time-dependent data fidelity , it ensures principled convergence to the true posterior distribution.
II-B2 M-step, refining diffusion priors
After obtaining the posterior samples of clean images (Sec. II-B1), we refine the diffusion model using these reconstructed images. The M-step is similar to standard diffusion model training with clean data and uses the denoising score matching (DSM) loss to update the model parameters:
(8) |
where represents the posterior samples, and are the noisy versions of generated by the forward SDE in Eq. 1.
To speed up training, we do not retrain the diffusion model from scratch in each EM iteration. Instead, we fine-tune the model parameters from the previous iteration during the early stages. In the final 1-3 EM iterations, we reset the model to eliminate any influence of the low-quality posterior samples from earlier iterations, then retrain it from scratch.
II-B3 Initialization of the EM iterations
A good initialization of DMs is crucial for ensuring the EM algorithm converges to the correct local minimum. In our method, we pre-train the DMs using 50 randomly selected clean images before the first EM iteration.
III Experiments
III-A Experimental Settings
Datasets. We evaluate our method on two datasets with distinct characteristics: CIFAR10 [26] at a resolution of and CelebA [27] at . Noisy measurements are generated by corrupting clean images from the training sets. In the first six iterations, 5,000 measurements are randomly selected for posterior sampling and diffusion model refinement. In the last three iterations, all measurements are used for posterior sampling and training diffusion model from scratch. We randomly choose 250 clean images from test sets for evaluation of image reconstruction and original training sets for evaluation of image generation.
Baselines. We compare our method with four related baselines: Ambient Diffusion [15], SURE-Score [18], EMDiffusion [19], and DPS with a clean prior [11]. Ambient Diffusion, SURE-Score, and EMDiffusion have similar settings to our method, as they do not rely on large-scale clean data to train DMs, but instead train DMs directly from noisy measurements. In contrast, DPS [11] leverages a pre-trained clean diffusion prior for posterior sampling, representing the performance upper bound for our approach.
|
|
Metrics. We evaluate image reconstruction quality using two metrics: peak signal-to-noise ratio (PSNR) and learned perceptual image patch similarity (LPIPS). Additionally, we compute the Fréchet Inception Distance (FID) between samples generated by the trained DMs and the reserved clean training set to assess the quality of image generation.
III-B Image Reconstruction
We evaluate our principled expectation-maximization (EM) framework on two datasets across three representative imaging inverse problems: image denoising, random inpainting, and Gaussian deblurring.
Image denoising. For image denoising, we add Gaussian noise with . As shown in Table I and Fig. 1, our method significantly outperforms all baselines. The reconstructed images exhibit notably fewer artifacts compared to the current state-of-the-art, EMDiffusion, demonstrating the advantage of moving beyond empirical approximations for time-dependent data fidelity.
Image inpainting. For image inpainting, we randomly mask 60% of the pixels in the images. As shown in Table I and Fig. 1, the posterior samples closely match the ground truths, highlighting the robustness of our method. Although our method achieves the second-best performance in the random inpainting task, this is due to EMDiffusion’s simple approximation being particularly effective for this problem [28]. In EMDiffusion, the measurement guides the updates of all pixels, through , not just the unmasked ones.
Image deblurring. For image deblurring, we apply a Gaussian blur kernel with a standard deviation of 2. As shown in Table. I and Fig. 2, our method consistently outperforms all baselines. The posterior samples progressively approximate the ground truths throughout the iterations, verifying the effectiveness of the EM framework.
III-C Learned Diffusion Priors
In Table I and Fig. 3, we compare the quantitative and qualitative results of the diffusion priors learned by our method against the baselines. The FID scores in Table I demonstrate that our method outperforms the baselines in all three tasks: image inpainting and denoising on CIFAR10, and image deblurring on CelebA. The progressively improving quality of the posterior samples in Fig. 2 suggests that the diffusion model is effectively approximating the true prior distribution, validating the effectiveness of our iterative learning framework. As shown in uncurated samples in Fig. 3, our method significantly reduces sharp artifacts compared to EMDiffusion, further confirming its superior performance.
IV Conclusion
In this paper, we present a principled expectation-maximization framework for learning clean diffusion models from partial measurements. Each iteration of the framework employs a plug-and-play Monte Carlo algorithm to reconstruct posterior image samples from the measurements, followed by refining the diffusion model with these recovered images. Our method demonstrates superior performance across various tasks and datasets, outperforming existing baselines in both image reconstruction and diffusion model training. For future work, we aim to extend this approach to more challenging scenarios, such as extremely sparse or noisy measurements in scientific applications.
References
- [1] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- [2] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.
- [3] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning. PMLR, 2015, pp. 2256–2265.
- [4] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning. PMLR, 2021, pp. 8162–8171.
- [5] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831.
- [6] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
- [7] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
- [8] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10.
- [9] B. T. Feng, J. Smith, M. Rubinstein, H. Chang, K. L. Bouman, and W. T. Freeman, “Score-based diffusion models as principled priors for inverse imaging,” arXiv preprint arXiv:2304.11751, 2023.
- [10] G. Zhang, J. Ji, Y. Zhang, M. Yu, T. Jaakkola, and S. Chang, “Towards coherent image inpainting using denoising diffusion implicit models,” in International Conference on Machine Learning. PMLR, 2023, pp. 41 164–41 193.
- [11] H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, and J. C. Ye, “Diffusion posterior sampling for general noisy inverse problems,” arXiv preprint arXiv:2209.14687, 2022.
- [12] H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models for inverse problems using manifold constraints,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 683–25 696, 2022.
- [13] Y. Song, L. Shen, L. Xing, and S. Ermon, “Solving inverse problems in medical imaging with score-based generative models,” arXiv preprint arXiv:2111.08005, 2021.
- [14] B. Kawar, N. Elata, T. Michaeli, and M. Elad, “Gsure-based diffusion model training with corrupted data,” arXiv preprint arXiv:2305.13128, 2023.
- [15] G. Daras, K. Shah, Y. Dagan, A. Gollakota, A. G. Dimakis, and A. Klivans, “Ambient diffusion: Learning clean distributions from corrupted data,” 2023.
- [16] T. Anciukevičius, Z. Xu, M. Fisher, P. Henderson, H. Bilen, N. J. Mitra, and P. Guerrero, “Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 608–12 618.
- [17] A. Tewari, T. Yin, G. Cazenavette, S. Rezchikov, J. B. Tenenbaum, F. Durand, W. T. Freeman, and V. Sitzmann, “Diffusion with forward models: Solving stochastic inverse problems without direct supervision,” arXiv preprint arXiv:2306.11719, 2023.
- [18] A. Aali, M. Arvinte, S. Kumar, and J. I. Tamir, “Solving inverse problems with score-based generative priors learned from noisy data,” arXiv preprint arXiv:2305.01166, 2023.
- [19] W. Bai, Y. Wang, W. Chen, and H. Sun, “An expectation-maximization algorithm for training clean diffusion models from corrupted observations,” arXiv preprint arXiv:2407.01014, 2024.
- [20] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society: series B (methodological), vol. 39, no. 1, pp. 1–22, 1977.
- [21] A. Gao, J. Castellanos, Y. Yue, Z. Ross, and K. Bouman, “Deepgem: Generalized expectation-maximization for blind inversion,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 592–11 603, 2021.
- [22] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
- [23] J. Batson and L. Royer, “Noise2self: Blind denoising by self-supervision,” in International Conference on Machine Learning. PMLR, 2019, pp. 524–533.
- [24] B. Efron, “Tweedie’s formula and selection bias,” Journal of the American Statistical Association, vol. 106, no. 496, pp. 1602–1614, 2011.
- [25] Y. Sun, Z. Wu, Y. Chen, B. T. Feng, and K. L. Bouman, “Provable probabilistic imaging using score-based generative priors,” arXiv preprint arXiv:2310.10835, 2023.
- [26] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
- [27] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- [28] Z. Dou and Y. Song, “Diffusion posterior sampling for linear inverse problem solving: A filtering perspective,” in The Twelfth International Conference on Learning Representations, 2024.