Learning Diffusion Model from Noisy Measurement using Principled Expectation-Maximization Method

Weimin Bai Academy for Advanced
Studies, Peking University
[email protected] Weiheng Tang School of Physics
Peking University
[email protected] Enze Ye College of Future Technology
Peking University
[email protected] Siyi Chen {@IEEEauthorhalign} Wenzheng Chen School of Physics
Peking University
[email protected] Wangxuan Institute of Computer
Technology, Peking University
[email protected] He Sun College of Future Technology
Peking University
[email protected]

Abstract

Diffusion models have demonstrated exceptional ability in modeling complex image distributions, making them versatile plug-and-play priors for solving imaging inverse problems. However, their reliance on large-scale clean datasets for training limits their applicability in scenarios where acquiring clean data is costly or impractical. Recent approaches have attempted to learn diffusion models directly from corrupted measurements, but these methods either lack theoretical convergence guarantees or are restricted to specific types of data corruption. In this paper, we propose a principled expectation-maximization (EM) framework that iteratively learns diffusion models from noisy data with arbitrary corruption types. Our framework employs a plug-and-play Monte Carlo method to accurately estimate clean images from noisy measurements, followed by training the diffusion model using the reconstructed images. This process alternates between estimation and training until convergence. We evaluate the performance of our method across various imaging tasks, including inpainting, denoising, and deblurring. Experimental results demonstrate that our approach enables the learning of high-fidelity diffusion priors from noisy data, significantly enhancing reconstruction quality in imaging inverse problems.

Index Terms:

Computational imaging, inverse problems, image priors, diffusion models, Bayesian inference

I Introduction

Diffusion models (DMs) [1, 2, 3] are emerging as versatile tools for capturing high-dimensional distributions [4, 5, 6, 7, 8] and serving as powerful image priors for solving inverse problems [9, 10, 11, 12, 13, 14]. However, training these models typically requires large datasets of clean, high-quality images, which are often expensive or impractical to obtain [15]. For example, in fluorescent microscopy, low Signal-to-Noise Ratio (SNR) images $\boldsymbol{y}$ are prevalent due to observational corruptions and measurement noise (e.g., motion blur, camera readout noise), while high SNR images $\boldsymbol{x}$ are scarce because of the long exposure times required and the risk of phototoxicity.

Recent works have sought to address this challenge by learning clean DMs directly from noisy measurements. For example, RenderDiffusion [16] and DWFM [17] incorporate the imaging forward model, $f(\cdot):\boldsymbol{x}\to\boldsymbol{y}$ , into the generative process to train 3D DMs from 2D images. However, these methods assume multiple measurements per object, limiting their applicability in many real-world scenarios. SURE-Score [18] uses Stein’s unbiased risk estimate (SURE) as a regularizer for learning DMs from noisy data. However, this approach heavily constrains the model’s parameter space, leading to suboptimal performance when the data are severely corrupted, such as in inpainting tasks. Ambient Diffusion [15] introduces a further corruption strategy that masks additional pixels to help the DM learn to reconstruct missing information. While effective for inpainting, it struggles with measurement noise and other types of image corruption.

To address these limitations, EMDiffusion [19] proposes a general expectation-maximization (EM) framework, achieving state-of-the-art performance in learning clean DMs from various types of corrupted images. This framework alternates between an expectation step (E-step), which uses a known diffusion prior to sample the posterior of clean images $\boldsymbol{x}$ given noisy measurements $\boldsymbol{y}$ , and a maximization step (M-step), where the recovered clean images are used to refine the diffusion model. However, its E-step relies on an approximate diffusion posterior sampling (DPS) algorithm [11], which lacks theoretical convergence guarantees and sometimes produces low-quality reconstructions, limiting its ability to capture accurate clean image distributions.

In this work, we propose a principled EM framework for learning clean DMs from noisy measurements with arbitrary corruption types. We enhance the E-step of EMDiffusion by upgrading the DPS algorithm to a plug-and-play Monte Carlo (PMC) method, which offers provable convergence guarantees and improved posterior sampling accuracy. This, in turn, leads to more precise learning of clean generative models. Extensive experiments on noisy CIFAR-10 and CelebA datasets with various corruption types demonstrate the effectiveness of our approach.

II Methodology

II-A Preliminaries

Diffusion models (DMs) [2, 1] capture data distributions by approximating the score function, i.e., the gradient of the log-likelihood $\nabla_{\boldsymbol{x}}\log p(\boldsymbol{x})$ , using a deep neural network $\boldsymbol{s}_{\theta}$ parameterized by $\theta$ . To achieve this, DMs define a forward stochastic differential equation (SDE) that progressively injects noise, while its solution is a reverse-time SDE that progressively removes the noise:

\begin{split}&\text{Forward: }\mathrm{d}\boldsymbol{x}=-\frac{\beta_{t}}{2}\boldsymbol{x}\mathrm{d}t+\sqrt{\beta_{t}}\mathrm{~{}d}\boldsymbol{w},\\ &\text{Reverse: }\mathrm{d}\boldsymbol{x}=\left[-\frac{\beta_{t}}{2}\boldsymbol{x}-\beta_{t}\nabla_{\boldsymbol{x}_{t}}\log p_{t}\left(\boldsymbol{x}_{t}\right)\right]\mathrm{d}t+\sqrt{\beta_{t}}\mathrm{~{}d}\overline{\boldsymbol{w}},\vspace{-0.1in}\end{split}\vspace{-0.1in}

(1)

where $\beta_{t}\in(0,1)$ is the noise schedule, $t\in[0,T]$ represents the time index, and $\boldsymbol{w}$ and $\overline{\boldsymbol{w}}$ are the forward and backward Wiener processes, respectively. Once a DM is trained on large-scale clean data, diverse samples can be generated by replacing $\nabla_{\boldsymbol{x}_{t}}\log p_{t}(\boldsymbol{x}_{t})$ in Eq. 1 with the learned score function $\boldsymbol{s}_{\theta}(\boldsymbol{x}_{t},\sigma(t))$ , where $\sigma(t)$ is the pre-defined noise strength corresponding to $\beta_{t}$ .

By Bayes’ rule, $p(\boldsymbol{x}|\boldsymbol{y})\propto p(\boldsymbol{y}|\boldsymbol{x})p(\boldsymbol{x})$ , the reverse-time SDE can also be adapted to sample from the posterior distribution of images $\boldsymbol{x}$ conditioned on measurements $\boldsymbol{y}$ . In this case, $p(\boldsymbol{y}|\boldsymbol{x})$ acts as a data-fidelity term to ensure the reconstructed images match the measurements. The score function from Eq. 1 is then modified as a conditional version:

\nabla_{\boldsymbol{x}_{t}}\log p_{t}\left(\boldsymbol{x}_{t}|\boldsymbol{y}\right)=\nabla_{\boldsymbol{x}_{t}}\log p_{t}\left(\boldsymbol{x}_{t}\right)+\nabla_{\boldsymbol{x}_{t}}\log p_{t}\left(\boldsymbol{y}|\boldsymbol{x}_{t}\right).

(2)

where $p_{t}\left(\boldsymbol{y}|\boldsymbol{x}_{t}\right)$ is a time-dependent data-fidelity term distinct from $p(\boldsymbol{y}|\boldsymbol{x})$ and is generally intractable. The challenge in posterior sampling via diffusion models is bridging $p_{t}\left(\boldsymbol{y}|\boldsymbol{x}_{t}\right)$ with $p(\boldsymbol{y}|\boldsymbol{x})$ . In the widely used diffusion posterior sampling (DPS) algorithm [11], we empirically assume that:

\nabla_{\boldsymbol{x}_{t}}\log p_{t}\left(\boldsymbol{y}|\boldsymbol{x}_{t}\right)\approx\nabla{\boldsymbol{x}_{t}}\log p_{t}\left(\boldsymbol{y}|\hat{\boldsymbol{x}}_{0}\right),\text{where}\ \hat{\boldsymbol{x}}_{0}=\mathbb{E}[\boldsymbol{x}_{0}|\boldsymbol{x}_{t}].

(3)

However, this approximation lacks theoretical guarantees for convergence to the true posterior distribution, which can result in inaccurate sampling, even in simple cases.

II-B Principled Expectation-Maximum Framework

In this paper, we propose a principled expectation-maximization (EM) framework for learning clean diffusion models from noisy measurements. This approach inherits basic procedure of the standard EM algorithm [20, 21] and EMDiffusion [19], alternating between two steps: 1) the E-step, where we sample the posterior distribution of clean images from noisy measurements using a known diffusion model, and 2) the M-step, where we refine the DM using these recovered samples. This iterative process guides the DM towards a local minimum, progressively improving both the model and the quality of the recovered images.

0: DM

\boldsymbol{s}_{\theta}

, measurements

(\boldsymbol{y},f)

, few high-quality data

\boldsymbol{x}

, Iterations

N

, Timesteps

T

, schedule

\left\{\beta_{t}\right\}_{t=1}^{T}

1: Initialize

\boldsymbol{s}_{\theta}

\boldsymbol{x}

through DSM [22]

2: for

i=1

N

\hat{\boldsymbol{x}}_{0}\leftarrow\text{SamplingStep}(\boldsymbol{s}_{\theta},\boldsymbol{y},f,\left\{\beta_{t}\right\}_{t=1}^{T})

{Sec. II-B1}

\boldsymbol{s}_{\theta}\leftarrow\text{RefiningStep}(\boldsymbol{s}_{\theta},\hat{\boldsymbol{x}}_{0},\left\{\beta_{t}\right\}_{t=1}^{T})

{Sec. II-B2}

5: end for

6: return Learned

\boldsymbol{s}_{\theta}

Algorithm 1 Principled Expectation-Maximum Framework

TABLE I: Quantitative comparisons of inverse imaging and learned priors. Best and second best results are highlighted in bold and underline. DPS w/ clean prior is the upper bound of the results, as results in gray are inaccessible to the clean priors.

Method	CIFAR10-Denoising			CIFAR10-Inpainting			CelebA-Deblurring
Method	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$	PSNR $\uparrow$	LPIPS $\downarrow$	FID $\downarrow$
measurements	18.05	0.047	132.59	13.49	0.295	234.47	22.47	0.365	72.83
DPS w/ clean prior	25.91	0.010	7.08	25.44	0.008	7.08	29.05	0.013	10.24
Noise2Self [23]	21.32	0.227	92.06	-	-	-	-	-	-
SURE-Score [18]	22.42	0.138	132.61	15.75	0.182	220.01	22.07	0.383	191.96
Ambient Diffusion [15]	-	-	-	20.57	0.027	28.88	-	-	-
EMDiffusion [19]	23.16	0.022	86.47	24.70	0.009	21.08	23.74	0.103	91.89
Ours	24.77	0.016	72.84	22.74	0.023	25.46	26.58	0.081	79.43

Noisy Measurement	SURE- Score [18]	EM Diffusion [19]	Ours	DPS [11] w/ Clean Prior	Ground Truth	Masked Measurement	Ambient Diffusion [15]	EM Diffusion [19]	Ours	DPS [11] w/ Clean Prior	Ground Truth
\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-1.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-2.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-3.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-4.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-5.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-6.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-7.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-8.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-9.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-10.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-11.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/1-12.png} \end{overpic}
\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-1.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-2.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-3.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-4.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-5.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-6.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-7.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-8.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-9.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-10.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-11.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/2-12.png} \end{overpic}
\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-1.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-2.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-3.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-4.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-5.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-6.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-7.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-8.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-9.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-10.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-11.png} \end{overpic}	\begin{overpic}[width=32.95686pt]{fig/fig4_emscore/3-12.png} \end{overpic}
(a) CIFAR10, Denoising						(b) CIFAR10, Inpainting

Figure 1: Qualitive results of image reconstruction on (a) CIFAR-10 noisy measurements and (b) CIFAR-10 masked measurements.

II-B1 E-step, principled DM-based posterior sampling

The E-step employs the plug-and-play Monte Carlo (PMC) algorithm for principled diffusion posterior sampling. Unlike conventional DPS, PMC is grounded in the theories of plug-and-play (PnP) optimization and Langevin Monte Carlo methods, providing non-asymptotic stationarity guarantees in terms of Fisher information. Assuming a minimum mean square error (MMSE) denoiser, $D_{\sigma}$ , as an implicit image prior, the PnP framework is expressed as:

\boldsymbol{x}_{t+1}=D_{\sigma}(\boldsymbol{x}_{t}+\gamma\nabla_{\boldsymbol{x_{t}}}\log p(\boldsymbol{y}|\boldsymbol{x}_{t})),\vspace{-0.08in}

(4)

where $\gamma>0$ is the step size, and $\sigma$ controls the denoising strength. According to Tweedie’s formula [24], the MMSE denoiser can be formulated as:

\begin{split}D_{\sigma}(\boldsymbol{x})&=\boldsymbol{x}+\sigma^{2}\nabla_{\boldsymbol{x}}\log p_{\sigma}(\boldsymbol{x}),\\ \text{where}\ p_{\sigma}(\boldsymbol{x})&=\int p(\boldsymbol{\mu})\mathcal{N}(\boldsymbol{x}-\boldsymbol{\mu};0,\sigma\mathbf{I})d\boldsymbol{\mu}.\end{split}

(5)

Here, $p_{\sigma}(\boldsymbol{x})$ represents the prior distribution convolved with Gaussian noise of standard deviation $\sigma$ . Though intractable, this prior can be approximated by the score-based diffusion model $\boldsymbol{s}_{\theta}(\boldsymbol{x},\sigma)$ . Consequently, the PnP optimization in Eq. 4 can be rewritten as:

\begin{split}\boldsymbol{x}_{t+1}=&\boldsymbol{x}_{t}+\gamma\nabla_{\boldsymbol{x_{t}}}\log p(\boldsymbol{y}|\boldsymbol{x}_{t})\\ &+\sigma^{2}\nabla_{\boldsymbol{x}}\log p_{\sigma}(\boldsymbol{x}_{t}+\gamma\nabla_{\boldsymbol{x_{t}}}\log p(\boldsymbol{y}|\boldsymbol{x}_{t})),\\ =&\boldsymbol{x}_{t}+\gamma\nabla_{\boldsymbol{x_{t}}}\log p(\boldsymbol{y}|\boldsymbol{x}_{t})\\ &-\sigma^{2}\boldsymbol{s}_{\theta}(\boldsymbol{x}_{t}+\gamma\nabla_{\boldsymbol{x}_{t}}\log p(\boldsymbol{y}|\boldsymbol{x}_{t}),\sigma(t)).\end{split}\vspace{-0.08in}

(6)

Introducing additional Brownian motion $\boldsymbol{\omega}_{t}$ and a scaling factor $\alpha_{k}$ before the score function leads to the PMC formulation, as originally described in [25]:

\begin{split}\boldsymbol{x}_{t+1}=&\boldsymbol{x}_{t}+\gamma\nabla_{\boldsymbol{x_{t}}}\log p(\boldsymbol{y}|\boldsymbol{x}_{t})\\ &-\alpha_{k}\boldsymbol{s}_{\theta}(\boldsymbol{x}_{t}+\gamma\nabla_{\boldsymbol{x}_{t}}\log p(\boldsymbol{y}|\boldsymbol{x}_{t}),\sigma(t))+\boldsymbol{\omega}_{t},\end{split}\vspace{-0.08in}

(7)

where $k$ index the EM iterations. As the diffusion model’s accuracy improves, $\alpha_{k}$ increases from a small value (e.g., $1e-3$ ) to 1. Since PMC does not rely on approximations for the time-dependent data fidelity $p_{t}\left(\boldsymbol{y}|\boldsymbol{x}_{t}\right)$ , it ensures principled convergence to the true posterior distribution.

II-B2 M-step, refining diffusion priors

After obtaining the posterior samples of clean images (Sec. II-B1), we refine the diffusion model using these reconstructed images. The M-step is similar to standard diffusion model training with clean data and uses the denoising score matching (DSM) loss to update the model parameters:

\theta^{*}=\underset{\theta}{\arg\min}\mathbb{E}_{t,\boldsymbol{x}_{t},\hat{\boldsymbol{x}}}\left[\left\|\boldsymbol{s}_{\theta}(\boldsymbol{x}_{t},t)-\nabla_{\boldsymbol{x}_{t}}\log p(\boldsymbol{x}_{t}|\hat{\boldsymbol{x}})\right\|_{2}^{2}\right],

(8)

where $\hat{\boldsymbol{x}}$ represents the posterior samples, and $\boldsymbol{x}_{t}$ are the noisy versions of $\hat{\boldsymbol{x}}_{0}$ generated by the forward SDE in Eq. 1.

To speed up training, we do not retrain the diffusion model from scratch in each EM iteration. Instead, we fine-tune the model parameters $\theta$ from the previous iteration during the early stages. In the final 1-3 EM iterations, we reset the model to eliminate any influence of the low-quality posterior samples from earlier iterations, then retrain it from scratch.

II-B3 Initialization of the EM iterations

A good initialization of DMs is crucial for ensuring the EM algorithm converges to the correct local minimum. In our method, we pre-train the DMs using 50 randomly selected clean images before the first EM iteration.

III Experiments

III-A Experimental Settings

Datasets. We evaluate our method on two datasets with distinct characteristics: CIFAR10 [26] at a resolution of $32\times 32$ and CelebA [27] at $64\times 64$ . Noisy measurements $\boldsymbol{y}$ are generated by corrupting clean images $\boldsymbol{x}$ from the training sets. In the first six iterations, 5,000 measurements are randomly selected for posterior sampling and diffusion model refinement. In the last three iterations, all measurements are used for posterior sampling and training diffusion model from scratch. We randomly choose 250 clean images from test sets for evaluation of image reconstruction and original training sets for evaluation of image generation.

Baselines. We compare our method with four related baselines: Ambient Diffusion [15], SURE-Score [18], EMDiffusion [19], and DPS with a clean prior [11]. Ambient Diffusion, SURE-Score, and EMDiffusion have similar settings to our method, as they do not rely on large-scale clean data to train DMs, but instead train DMs directly from noisy measurements. In contrast, DPS [11] leverages a pre-trained clean diffusion prior for posterior sampling, representing the performance upper bound for our approach.

Masked Measurement	SURE- Score [18]	EM Diffusion [19]	Ours 1st Iteration				Ours Final Iteration	DPS [11] w/ Clean Prior	Ground Truth
\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-1.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-2.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-3.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-4.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-5.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-6.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-7.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-10.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-11.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/1-12.png} \end{overpic}
\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-1.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-2.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-3.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-4.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-6.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-6.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-5.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-10.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-11.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/2-12.png} \end{overpic}
\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-1.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-2.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-3.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-4.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-5.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-6.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-7.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-10.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-11.png} \end{overpic}	\begin{overpic}[width=39.89096pt]{fig/fig3_emscore/3-12.png} \end{overpic}

Figure 2: Qualitive results on CelebA deblurring. For each image, the blur kernel is a

9\times 9

Gaussian blur kernel. Within the principled iterative learning framework, the diffusion model learns cleaner score-based priors, improving the quality of reconstructed images.

\begin{overpic}[width=112.73882pt]{fig/obs66.png} \end{overpic}	\begin{overpic}[width=112.73882pt]{fig/emd66.png} \end{overpic}	\begin{overpic}[width=112.73882pt]{fig/pmc66.png} \end{overpic}
Training Data	EMDiffusion	Ours

Figure 3: Comparison of uncurated samples generated from diffusion models trained by EMDiffusion and the proposed method.

Metrics. We evaluate image reconstruction quality using two metrics: peak signal-to-noise ratio (PSNR) and learned perceptual image patch similarity (LPIPS). Additionally, we compute the Fréchet Inception Distance (FID) between samples generated by the trained DMs and the reserved clean training set to assess the quality of image generation.

III-B Image Reconstruction

We evaluate our principled expectation-maximization (EM) framework on two datasets across three representative imaging inverse problems: image denoising, random inpainting, and Gaussian deblurring.

Image denoising. For image denoising, we add Gaussian noise $\boldsymbol{n}\sim\mathcal{N}(0,\sigma^{2}\boldsymbol{I})$ with $\sigma=0.2$ . As shown in Table I and Fig. 1, our method significantly outperforms all baselines. The reconstructed images exhibit notably fewer artifacts compared to the current state-of-the-art, EMDiffusion, demonstrating the advantage of moving beyond empirical approximations for time-dependent data fidelity.

Image inpainting. For image inpainting, we randomly mask 60% of the pixels in the images. As shown in Table I and Fig. 1, the posterior samples closely match the ground truths, highlighting the robustness of our method. Although our method achieves the second-best performance in the random inpainting task, this is due to EMDiffusion’s simple approximation being particularly effective for this problem [28]. In EMDiffusion, the measurement $\boldsymbol{y}$ guides the updates of all pixels, through $\nabla_{\boldsymbol{x}_{t}}\log p_{t}\left(\boldsymbol{y}|\hat{\boldsymbol{x}}_{0}\right)$ , not just the unmasked ones.

Image deblurring. For image deblurring, we apply a $9\times 9$ Gaussian blur kernel with a standard deviation of 2. As shown in Table. I and Fig. 2, our method consistently outperforms all baselines. The posterior samples progressively approximate the ground truths throughout the iterations, verifying the effectiveness of the EM framework.

III-C Learned Diffusion Priors

In Table I and Fig. 3, we compare the quantitative and qualitative results of the diffusion priors learned by our method against the baselines. The FID scores in Table I demonstrate that our method outperforms the baselines in all three tasks: image inpainting and denoising on CIFAR10, and image deblurring on CelebA. The progressively improving quality of the posterior samples in Fig. 2 suggests that the diffusion model is effectively approximating the true prior distribution, validating the effectiveness of our iterative learning framework. As shown in uncurated samples in Fig. 3, our method significantly reduces sharp artifacts compared to EMDiffusion, further confirming its superior performance.

IV Conclusion

In this paper, we present a principled expectation-maximization framework for learning clean diffusion models from partial measurements. Each iteration of the framework employs a plug-and-play Monte Carlo algorithm to reconstruct posterior image samples from the measurements, followed by refining the diffusion model with these recovered images. Our method demonstrates superior performance across various tasks and datasets, outperforming existing baselines in both image reconstruction and diffusion model training. For future work, we aim to extend this approach to more challenging scenarios, such as extremely sparse or noisy measurements in scientific applications.

References

[1] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
[2] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” arXiv preprint arXiv:2011.13456, 2020.
[3] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning. PMLR, 2015, pp. 2256–2265.
[4] A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning. PMLR, 2021, pp. 8162–8171.
[5] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831.
[6] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
[7] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
[8] C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10.
[9] B. T. Feng, J. Smith, M. Rubinstein, H. Chang, K. L. Bouman, and W. T. Freeman, “Score-based diffusion models as principled priors for inverse imaging,” arXiv preprint arXiv:2304.11751, 2023.
[10] G. Zhang, J. Ji, Y. Zhang, M. Yu, T. Jaakkola, and S. Chang, “Towards coherent image inpainting using denoising diffusion implicit models,” in International Conference on Machine Learning. PMLR, 2023, pp. 41 164–41 193.
[11] H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, and J. C. Ye, “Diffusion posterior sampling for general noisy inverse problems,” arXiv preprint arXiv:2209.14687, 2022.
[12] H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models for inverse problems using manifold constraints,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 683–25 696, 2022.
[13] Y. Song, L. Shen, L. Xing, and S. Ermon, “Solving inverse problems in medical imaging with score-based generative models,” arXiv preprint arXiv:2111.08005, 2021.
[14] B. Kawar, N. Elata, T. Michaeli, and M. Elad, “Gsure-based diffusion model training with corrupted data,” arXiv preprint arXiv:2305.13128, 2023.
[15] G. Daras, K. Shah, Y. Dagan, A. Gollakota, A. G. Dimakis, and A. Klivans, “Ambient diffusion: Learning clean distributions from corrupted data,” 2023.
[16] T. Anciukevičius, Z. Xu, M. Fisher, P. Henderson, H. Bilen, N. J. Mitra, and P. Guerrero, “Renderdiffusion: Image diffusion for 3d reconstruction, inpainting and generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 608–12 618.
[17] A. Tewari, T. Yin, G. Cazenavette, S. Rezchikov, J. B. Tenenbaum, F. Durand, W. T. Freeman, and V. Sitzmann, “Diffusion with forward models: Solving stochastic inverse problems without direct supervision,” arXiv preprint arXiv:2306.11719, 2023.
[18] A. Aali, M. Arvinte, S. Kumar, and J. I. Tamir, “Solving inverse problems with score-based generative priors learned from noisy data,” arXiv preprint arXiv:2305.01166, 2023.
[19] W. Bai, Y. Wang, W. Chen, and H. Sun, “An expectation-maximization algorithm for training clean diffusion models from corrupted observations,” arXiv preprint arXiv:2407.01014, 2024.
[20] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the royal statistical society: series B (methodological), vol. 39, no. 1, pp. 1–22, 1977.
[21] A. Gao, J. Castellanos, Y. Yue, Z. Ross, and K. Bouman, “Deepgem: Generalized expectation-maximization for blind inversion,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 592–11 603, 2021.
[22] P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
[23] J. Batson and L. Royer, “Noise2self: Blind denoising by self-supervision,” in International Conference on Machine Learning. PMLR, 2019, pp. 524–533.
[24] B. Efron, “Tweedie’s formula and selection bias,” Journal of the American Statistical Association, vol. 106, no. 496, pp. 1602–1614, 2011.
[25] Y. Sun, Z. Wu, Y. Chen, B. T. Feng, and K. L. Bouman, “Provable probabilistic imaging using score-based generative priors,” arXiv preprint arXiv:2310.10835, 2023.
[26] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.
[27] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proceedings of International Conference on Computer Vision (ICCV), December 2015.
[28] Z. Dou and Y. Song, “Diffusion posterior sampling for linear inverse problem solving: A filtering perspective,” in The Twelfth International Conference on Learning Representations, 2024.