Poisson flow consistency models for low-dose CT image denoising

Dennis Hein, Adam Wang, and Ge Wang This study was financially supported by MedTechLabs, GE HealthCare, and the Göran Gustafsson foundation (grant no. 2114).D. Hein ([email protected]) is with the Department of Physics, KTH Royal Institute of Technology, Stockholm, Sweden and MedTechLabs, BioClinicum, Karolinska University Hospital, Stockholm, Sweden. A. Wang is with the Department of Radiology and the Department of Electrical Engineering, Stanford University, Stanford, CA, USA. G. Wang is with the Department of Biomedical Engineering, School of Engineering, Biomedical Imaging Center, Center for Biotechnology and Interdisciplinary Studies, Rensselaer Polytechnic Institute, Troy, NY, USA.

Abstract

Diffusion and Poisson flow models have demonstrated remarkable success for a wide range of generative tasks. Nevertheless, their iterative nature results in computationally expensive sampling and the number of function evaluations (NFE) required can be orders of magnitude larger than for single-step methods. Consistency models are a recent class of deep generative models which enable single-step sampling of high quality data without the need for adversarial training. In this paper, we introduce a novel image denoising technique which combines the flexibility afforded in Poisson flow generative models (PFGM)++ with the, high quality, single-step sampling of consistency models. The proposed method first learns a trajectory between a noise distribution and the posterior distribution of interest by training PFGM++ in a supervised fashion. These pre-trained PFGM++ are subsequently “distilled” into Poisson flow consistency models (PFCM) via an updated version of consistency distillation. We call this approach posterior sampling Poisson flow consistency models (PS-PFCM). Our results indicate that the added flexibility of tuning the hyperparameter $D$ , the dimensionality of the augmentation variables in PFGM++, allows us to outperform consistency models, a current state-of-the-art diffusion-style model with $\text{NFE}=1$ on clinical low-dose CT images. Notably, PFCM is in itself a novel family of deep generative models and we provide initial results on the CIFAR-10 dataset.

I Introduction

X-ray computed tomography (CT) is widely used for both diagnosis and treatment planning of a wide range of disease, including stroke, cancer, and cardiovascular disease. Nevertheless, due to the potential risk associated with even low doses of ionizing radiation[1, 2], there is an ongoing effort to reduce the dose to be as low as reasonably achievable whilst maintaining high diagnostic quality[3, 4]. The event of photon-counting CT (PCCT), enables improved low-dose imaging by rejecting the impact of electronic noise, in addition to major improvements in resolution, both in space and energy [5, 6, 7, 8, 9, 10]. However, higher resolution, for a given dose, means fewer photons in each voxel or energy bin, which results in an inevitable increase in image noise. Thus, image reconstruction and post-processing techniques with excellent noise performance are required to fully capitalize on the benefits offered by PCCT.

There are a range of successful approaches to noise suppression in CT including iterative reconstruction, pre- and post-processing, and dual domain methods. Iterative reconstruction [11, 12, 13, 14, 15, 16, 17] has proved to be exceptionally capable at suppressing noise whilst keeping salient features intact. Nevertheless, as with most iterative methods, this approach is usually computationally expensive. Common alternatives include pre-processing[18], post-processing[19, 21, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35], or dual-domain methods[36]. Pre-processing, that is prior to image reconstruction, can be a very efficient solution as it is fully agnostic to parameters used in reconstruction such as kernel, matrix size, and field of view (FOV), different combinations of which, routinely used in the clinic, lead to widely different noise characteristics and visually distinct images. However, pre-processing suffers from being computationally expensive since the sinograms are usually significantly larger than the reconstructed images. Post-processing techniques, operating directly on the reconstructed images, do not suffer from this drawback. There are many successful conventional, such as non-local means (NLM) [19, 21] and block-matching 3D (BM3D) [28] filtering, as well as deep learning-based post-processing techniques[20, 22, 23, 25, 24, 26, 27, 29, 30, 31, 32, 33, 34, 35]. There are also methods which take advantage of information from both the sinogram and image domain simultaneously[36], usually referred to as dual-domain.

Diffusion and Poisson flow models have demonstrated remarkable success for a range of generative and image processing tasks, both in the unconditional[37, 38, 39, 40, 41, 42, 43, 44] and conditional[40, 45, 46, 47, 48, 49, 50, 32, 33, 51, 34, 35] setting. Inspired by electrostatics, instead of non-equilibrium thermodynamics, Poisson flow generative models (PFGM)++[44] treats the $N:=n\times n$ dimensional data as electric charges in a $N+D$ -dimensional augmented space. The resulting electric field lines define a trajectory, via an ODE, between the data distribution and an easy-to-sample prior noise distribution. Remarkably, PFGM++ takes a step towards unification of physics-inspired generative models as the authors prove that the training and sampling processes of PFGM++ converges to that of diffusion models (EDM[42]) in the $D\rightarrow\infty,r=\sigma\sqrt{D},$ limit. Hence, the PFGM++ allows additional flexibility, trading-off robustness for rigidity by tuning the hyperparameter $D$ , whilst including diffusion models as a special case. In addition, reference [44] shows that the training and sampling algorithms developed for EDM [42] can directly be re-purposed for training and sampling in PFGM++ via a simple change-of-variables and an updated noise distribution.

Consistency models[52, 53] are a recent class of deep generative models which enables high quality single-step sampling without adversarial training. Iterative sampling is the double-edge sword of diffusion models. On the one hand, it allows for zero-shot editing of data and trading-off image quality for compute. On the other hand, the computational burden of this iterative process can be orders of magnitude higher than for single-step methods and thus limit their use in real-time applications. Consistency models, building upon the continuous-time diffusion models framework, learns a function which directly maps noise to data in a single-step. This is achieved by enforcing self-consistency: any two points on the same solution trajectory map to the same initial point. Reference [52] provides two ways of obtaining a consistency model. First, there is consistency distillation, where a pre-trained diffusion model is “distilled” into a consistency model, using the pre-trained score network to obtain pairs of adjacent points on the solution trajectory. Self-consistency is then enforced by minimizing the distance between the consistency model for these points. Second, there is consistency training, which is a stand-alone model, the Gaussian perturbation kernel is used to generate a pair of adjacent points on the same solution trajectory. Notably, although consistency models offer single-step sampling they do not sacrifice zero-shot data editing nor the ability to trade-off compute for image quality. As expected, when moving from the diffusion model, which is a multi-step sampler ( $\text{NFE}>1$ ), to the consistency model, which is a single-step sampler ( $\text{NFE}=1$ ), there is a quite significant drop in performance–larger for consistency training than for consistency distillation. However, this drop in performance is significantly mitigate by the improved consistency training introduced in [53], outperforming both consistency training and distillation from [52], and achieving close to state-of-the-art results despite maintaining $\text{NFE}=1.$

In this paper, we propose a novel post-processing denoising technique for CT by combining the flexibility afforded by tuning $D$ in PFGM++[44] with the high quality single-step sampling of consistency models [52]. The main contributions of this paper are as follows: 1) We present posterior sampling Poisson flow consistency models (PS-PFCM), a novel framework for image denoising using an updated version of consistency distillation to obtain a single-step sampler from a pre-trained PFGM++. In particular, we first train PFGM++, in a supervised fashion, to learn a trajectory between a noise distribution and the posterior distribution of interest. These PFGM++ are subsequently “distilled” into single-step samplers via an updated version of consistency distillation. 2) We demonstrate that we can tune $D$ to obtain superior performance to consistency models, a current state-of-the-art diffusion-style model with $\text{NFE}=1$ , on clinical low-dose CT images. 3) We note that Poisson flow consistency models (PFCM) are in fact a novel family of deep generative models. Conventionally, deep generative models are first developed for unconditional image generation and then in the same, or subsequent papers, the model is re-purposed for conditional generation via zero-shot editing or supervised learning, solving various inverse problems such as denoising, inpainting, colorization, and super-resolution. Here, we directly apply this novel family of deep generative models to the inverse problem of CT image denoising, using supervised learning to enable sampling from the desired posterior. Nevertheless, to substantiate that PFCM is a valid generative model, we include initial results for the task of unconditional image generation on CIFAR-10 $32\times 32$ dataset[63] in the appendix.

II Methods

II-A Problem formulation

Our objective is to obtain a high quality reconstruction $\bm{\hat{x}}\in\mathbb{R}^{N}$ of $\bm{x}\in\mathbb{R}^{N}$ based on noisy observations $\bm{y}=\mathcal{F}(\bm{x})\in\mathbb{R}^{N}$ , where $\mathcal{F}:\mathbb{R}^{N}\rightarrow\mathbb{R}^{N}$ denotes the noise degradation operator, including factors such as quantum noise [20], and $N:=n\times n.$ We treat this as a statistical inverse problem and, given the prior $\bm{x}\sim p(\bm{x})$ , the solution is given by a sample $\hat{\bm{x}}$ from the posterior $p(\bm{x}|\bm{y})$ . This approach to inverse problem solving is commonly referred to as posterior sampling.

II-B Consistency models

Consistency models [52, 53] build on the continuous-time formulation of diffusion models [40, 42], which realize generative trajectories by progressively perturbing the data distribution, $p_{\text{data}}$ , into a noise distribution via a stochastic differential equation (SDE) and then learning to reverse the process, generating data from noise. Remarkably, this SDE has a corresponding ordinary differential equation (ODE) called the probability flow (PF) ODE [40]. Using the settings in [42] the PF ODE takes the form

d\bm{x}=-\sigma\nabla_{\bm{x}}\log p_{\sigma}(\bm{x})dt,

(1)

for $\sigma\in[\sigma_{\text{min}},\sigma_{\text{{max}}}]$ , where $\nabla_{\bm{{x}}}\log p_{\sigma}(\bm{x})$ is the score function of the perturbed data distribution $p_{\sigma}(\bm{x})$ , and $\sigma(t)$ is the predefined noise scales. $\sigma_{\text{min}}$ is chosen sufficiently small, but not zero for numerical stability, such that $p_{\sigma_{\text{min}}}(\bm{x})\approx p_{\text{data}}(\bm{x})$ and $\sigma_{\text{max}}$ sufficiently large such that $p_{\sigma_{\text{max}}}(\bm{x})\approx\pi(\bm{x})$ , some unstructured, easy-to-sample, noise distribution. In this case $\pi(\bm{x}):=\mathcal{N}(\bm{0},\sigma_{\text{max}}\bm{I}).$

In particular, for the solution trajectory $\{\bm{x}_{\sigma}\}_{\sigma\in[\sigma_{\text{min}},\sigma_{\text{max}}]}$ consistency models set out to learn a consistency function

f(\bm{x}_{\sigma},\sigma)\mapsto\bm{x}_{\sigma_{\text{min}}}

(2)

by enforcing self-consistency

f(\bm{x}_{\sigma},\sigma)=f(\bm{x}_{\sigma^{\prime}},\sigma^{\prime}),\forall\sigma,\sigma^{\prime}\in[\sigma_{\text{min}},\sigma_{\text{max}}].

(3)

Let $F_{\bm{\theta}}(\bm{x},t)$ denote a free-form deep neural network, the desired solution will then be enforced via the boundary condition

f_{\bm{\theta}}(\bm{x},\sigma)=c_{\text{skip}}(\sigma)\bm{x}+c_{\text{out}}(\sigma)F_{\bm{\theta}}(\bm{x},\sigma)

(4)

where $c_{\text{skip}}(\sigma)$ and $c_{\text{out}}(\sigma)$ are differentiable functions such that $c_{\text{skip}}(\epsilon)=1$ and $c_{\text{out}}(\epsilon)=0$ and $f_{\bm{\theta}}(\bm{x},\sigma)$ denotes the deep neural network used to approximate the consistency function. To learn the consistency function, the domain $[\sigma_{\text{min}},\sigma_{\text{max}}]$ is discretized using a sequence of noise scales $\sigma_{\text{min}}=\sigma_{1}<\sigma_{2}<...<\sigma_{N}=\sigma_{\text{max}}.$ Following [42], reference [52] sets $\sigma_{i}=(\sigma_{\text{min}}^{1/\rho}+\frac{i-1}{N-1}(\sigma_{\text{max}}^{1/\rho}-\sigma_{\text{min}}^{1/\rho}))^{\rho}$ with $\rho=7$ for $i=1,...,N.$ Consistency models are then trained using the consistency matching loss

\mathbb{E}\left[\lambda(\sigma_{i})d(f_{\bm{\theta}}(\bm{x}_{\sigma_{i+1}},\sigma_{i+1}),f_{\bm{\theta}^{-}}(\breve{\bm{x}}_{\sigma_{i}},\sigma_{i}))\right],

(5)

where

\breve{\bm{x}}_{\sigma_{i}}=\bm{x}_{\sigma_{i+1}}-(\sigma_{i}-\sigma_{i+1})\sigma_{i+1}\nabla_{\bm{x}}\log p_{\sigma_{i+1}}(\bm{x})|_{\bm{x}=\bm{x}_{\sigma_{i+1}}},

(6)

is a single-step in the reverse direction solving the PF ODE in Eq. (1), $\lambda(\cdot)\in\mathbb{R}_{+}$ is a positive weighting function, and $d(\cdot,\cdot)$ is a metric function such that $d(\bm{x},\bm{y})\geq 0$ and $d(\bm{x},\bm{y})=0\iff\bm{x}=\bm{y}.$ Reference [52] proposes to use the Learned Perceptual Image Patch Similarity (LPIPS) [58] as metric function. The expectation in Eq. (5) is taken with respect to $i\sim\mathcal{U}[1,N-1]$ , a uniform distribution over integers $i=1,...,N-1$ , and $\bm{x}_{\sigma_{i+1}}\sim p_{\sigma_{i+1}}(\bm{x}).$ The objective in Eq. (5) is minimized via stochastic gradient descent on parameters $\bm{\theta}$ whilst $\bm{\theta}^{-}$ are updated via the exponential moving average (EMA)

\bm{\theta^{-}}\leftarrow\text{stopgrad}(\mu\bm{\theta^{-}}+(1-\mu)\bm{\theta}),\;

(7)

where $0\leq\mu\leq 1$ is the decay rate. $f_{\bm{\theta}}$ and $f_{\bm{\theta}^{-}}$ are called the student and teacher network, respectively. Directly minimzing the objective in Eq. (6) is not feasible in practice since $\nabla_{\bm{x}}\log p_{\sigma_{i+1}}(\bm{x})$ is unknown. Reference [52] presents two algorithms to circumvent this issue: consistency distillation and consistency training. Consistency distillation depends on a pre-trained score network, $s_{\bm{\phi}}(\bm{x},\sigma),$ to estimate the score function, $\nabla_{\bm{x}}\log p_{\sigma_{i+1}}(\bm{x}).$ The pre-trained score network is used instead of $\nabla_{\bm{x}}\log p_{\sigma_{i+1}}(\bm{x})$ in Eq. (6) to generate a pair of adjacent points $(\breve{\bm{x}}_{\sigma_{i}},\bm{x}_{\sigma_{i+1}})$ on the solution trajectory. Consistency training, on the other hand, is a stand-alone model. A perturbed sample $\bm{x}_{\sigma}$ from ground truth $\bm{x}$ can be obtained via the Gaussian perturbation kernel $p_{\sigma}(\bm{x}_{\sigma}|\bm{x})=\mathcal{N}(\bm{x},\sigma\bm{I})$ as $\bm{x}+\sigma\bm{z}$ , where $\bm{x}\sim p(\bm{x})$ and $\bm{z}\sim\mathcal{N}(\bm{0},\bm{I}).$ Hence, consistency training obtains the pair of adjacent points on the same solution trajectory $(\bm{x}_{\sigma_{i+1}},\breve{\bm{x}}_{\sigma_{i}})$ via $\bm{x}_{\sigma_{i+1}}=\bm{x}+\sigma_{i+1}\bm{z}$ and $\breve{\bm{x}}_{\sigma_{i}}=\bm{x}+\sigma_{i}\bm{z}$ using the same $\bm{x}$ and $\bm{z}.$ Consistency training then employs the loss

\mathbb{E}\left[\lambda(\sigma_{i})d(f_{\bm{\theta}}(\bm{x}+\sigma_{i+1}\bm{z},\sigma_{i+1}),f_{\bm{\theta}^{-}}(\bm{x}+\sigma_{i}\bm{z},\sigma_{i}))\right],

(8)

which is equivalent to the consistency matching loss in Eq. (5) in the $N\rightarrow\infty$ limit [52]. Once we have obtained an approximation of the consistency function, a sample can simply be generated by drawing an initial noise vector from the noise distribution and passing through the consistency model $\hat{\bm{x}}=f_{\bm{\theta}}(\bm{z},\sigma_{\text{max}}).$ Importantly, consistency models do not sacrifice the ability to trade-off image quality for compute and zero-shot editing of data as they still support multi-step sampling [52].

II-C PFGM++

PFGM++[44] is closely related to continuous-time diffusion models [40, 42] in theory and in practice. The training and sampling algorithms from [42] can be applied directly via a simple change-of-variables and an updated noise distribution [44]. It is also possible to show that the training and sampling processes of PFGM++ will converge to that of diffusion models [42] is the $D\rightarrow\infty,r=\sigma\sqrt{D},$ limit. Inspired by electrostatics, PFGM++ treats the $N$ -dimensional data as an electric charge in a $N+D$ -dimensional augmented space. Let $\tilde{\bm{x}}:=(\bm{x},\bm{0})\in\mathbb{R}^{N+D}$ and $\tilde{\bm{x}}_{\sigma}:=(\bm{x}_{\sigma},\bm{z})\in\mathbb{R}^{N+D}$ denote the augmented ground truth and perturbed data, respectively. The objective of interest is then the high dimensional electric field

\bm{E}(\bm{\tilde{x}}_{\sigma})=\frac{1}{S_{N+D-1}(1)}\int\frac{\bm{\tilde{x}}_{\sigma}-\bm{\tilde{x}}}{||\bm{\tilde{x}}_{\sigma}-\bm{\tilde{x}}||^{N+D}}p(\bm{x})d\bm{x},

(9)

where $S_{N+D-1}(1)$ is the surface area of the unit $(N+D-1)$ -sphere, $p(\bm{x})$ is the distribution of the ground truth data. Notably, this electric field is rotationally symmetric on the $D$ -dimensional cylinder $\sum_{i=1}^{D}z_{i}^{2}=r^{2},\forall r>0$ and thus it suffices to track the norm of the augmented variables $r=r(\bm{\tilde{x}}_{\sigma}):=||\bm{z}||_{2}.$ For notational brevity, redefine $\tilde{\bm{x}}:=(\bm{x},0)\in\mathbb{R}^{N+1}$ and $\tilde{\bm{x}}_{\sigma}:=(\bm{x}_{\sigma},r)\in\mathbb{R}^{N+1}.$ This dimensionality reduction, afforded by symmetry, yields the following ODE

d\bm{x}_{\sigma}=\bm{E}(\bm{\tilde{x}}_{\sigma})_{\bm{x}_{\sigma}}\cdot E(\bm{\tilde{x}}_{\sigma})_{r}^{-1}dr,

(10)

where $\bm{E}(\bm{\tilde{x}}_{\sigma})_{\bm{x}_{\sigma}}=\frac{1}{S_{N+D-1}(1)}\int\frac{\bm{x}_{\sigma}-\bm{x}}{||\bm{\tilde{x}}_{\sigma}-\bm{\tilde{x}}||^{N+D}}p(\bm{x})d\bm{x}$ and $E(\bm{\tilde{x}}_{\sigma})_{r}=\frac{1}{S_{N+D-1}(1)}\int\frac{r}{||\bm{\tilde{x}}_{\sigma}-\bm{\tilde{x}}||^{N+D}}p(\bm{x})d\bm{x}$ are the $\bm{x}_{\sigma}$ and $r$ components of Eq. (9), respectively. Eq. (10) defines a bijective mapping between the ground truth data on the $r=0$ $(\bm{z}=\bm{0})$ hyperplane and an easy-to-sample prior noise distribution on the $r=r_{\max}$ hyper-cylinder [44]. As is the case for diffusion models [40, 42], PFGM++ uses a perturbation based objective function. Let $p_{r}(\bm{x}_{\sigma}|\bm{x})$ denote the perturbation kernel, which samples perturbed $\bm{x}_{\sigma}$ from ground truth $\bm{x}$ , and $p(r)$ the training distribution over $r.$ Then the objective of interest is

\mathbb{E}_{r\sim p(r)}\mathbb{E}_{\bm{x}\sim p(\bm{x})}\mathbb{E}_{\bm{x}_{\sigma}\sim p_{r}(\bm{x}_{\sigma}|\bm{x})}\left[||s_{\phi}(\bm{\tilde{x}}_{\sigma})-\frac{\bm{x}_{\sigma}-\bm{x}}{r/\sqrt{D}}||_{2}^{2}\right]

(11)

We can ensure that the minimizer of Eq. (11) is $s^{*}_{\phi}(\bm{\tilde{x}}_{\sigma})=\sqrt{D}\bm{E}(\bm{\tilde{x}}_{\sigma})_{\bm{x}_{\sigma}}\cdot E(\bm{\tilde{x}}_{\sigma})_{r}^{-1}$ by choosing the perturbation kernel to be $p_{r}(\bm{x}_{\sigma}|\bm{x})\propto 1/(||\bm{x}_{\sigma}-\bm{x}||_{2}^{2}+r^{2})^{\frac{N+D}{2}}.$ As with diffusion models, a sample can then be generated by solving $d\bm{x}_{\sigma}/dr=\bm{E}(\bm{\tilde{x}}_{\sigma})_{\bm{x}_{\sigma}}/E(\bm{\tilde{x}}_{\sigma})_{r}=s^{*}_{\phi}(\bm{\tilde{x}}_{\sigma})/\sqrt{D}.$

II-D Poisson flow consistency models

We extend the idea of consistency models to where the underlying model is a PFGM++ instead of a diffusion model. PFGM++ accepts diffusion models as a special case but at the same time allows for trading-off robustness for rigidity by tuning $D$ , the dimensionality of the augmentation variables. This yields a novel family of deep generative models which we call Poisson flow consistency models (PFCM). In this paper, we only treat the case of consistency distillation and leave consistency training, and improved consistency training [53], to future work. For consistency distillation, the PF ODE solver from the pre-trained diffusion model is applied as is, including for the second-order Heun solver proposed in [42]. Using the noise distribution from PFGM++ we can sample $\bm{x}_{\sigma_{i+1}}$ from the PFGM++ solution trajectory and, as with consistency models, we can subsequently obtain the correspond $\breve{\bm{x}}_{\sigma_{i}}$ by taking a single reverse step of the PF ODE solver using the pre-trained network from PFGM++, which estimates $\sqrt{D}\bm{E}(\bm{\tilde{x}}_{\sigma})_{\bm{x}_{\sigma}}\cdot E(\bm{\tilde{x}}_{\sigma})_{r}^{-1}.$ Hence, exactly as in PFGM++[44], we can re-purpose the consistency distillation training algorithm to train a PFCM via updating the noise distribution and applying a change-of-variables.

In particular, for our discretized domain $[\sigma_{\text{min}},\sigma_{\text{{max}}}]$ , let $\Phi(\cdot,\cdot,\bm{\phi})$ denote the update function of a one-step ODE solver applied to the PF ODE. We can then write Eq. (6) as

\breve{\bm{x}}_{\sigma_{i}}=\bm{x}_{\sigma_{i+1}}+(\sigma_{i}-\sigma_{i+1})\Phi(\bm{x}_{\sigma_{i+1}},\sigma_{i+1},\bm{\phi}),

(12)

where $\Phi(\bm{x}_{\sigma_{i+1}},\sigma_{i+1},\bm{\phi}):=-\sigma_{i+1}s_{\bm{\phi}}(\bm{x}_{\sigma_{i+1}},\sigma_{i+1})$ , using the first order Euler step for simplicity. More generally, Eq. (12) is taking a single-step in the reverse direction solving a PF ODE on the form

d\bm{x}=\Phi(\bm{x},\sigma,\bm{\phi})dt.

(13)

Using the fact that $\sigma(t)=t,$ $\bm{\tilde{x}}_{\sigma}:=(\bm{x}_{\sigma},r),$ and the alignment formula $r=\sigma\sqrt{D}$ [44], we can do the change-of-variables $dr=d\sigma\sqrt{D}=dt\sqrt{D}$ and rewrite $d\bm{x}/dr$ in Eq. (10) as

d\bm{x}=s_{\bm{\phi}}(\bm{\tilde{x}})/\sqrt{D}dr=s_{\bm{\phi}}(\bm{\tilde{x}})dt.

(14)

Hence, we can write the PF ODE in Eq. (10) on the form in Eq. (13) with $\Phi_{\text{PFGM++}}(\bm{x},\sigma,\bm{\phi}):=s_{\bm{\phi}}(\tilde{\bm{x}})=\sqrt{D}\bm{E}(\bm{\tilde{x}}_{\sigma})_{\bm{x}_{\sigma}}\cdot E(\bm{\tilde{x}}_{\sigma})_{r}^{-1}.$ A PFCM can then be trained via consistency distillation by using $\Phi_{\text{PFGM++}}(\cdot,\cdot,\bm{\phi})$ , from a pre-trained PFGM++, and an updated noise distribution.

As for diffusion models, $p_{r}(\cdot|\bm{x})$ enables sampling of perturbed $\bm{x}_{\sigma}$ from ground truth $\bm{x}.$ However, sampling from $p_{r}(\cdot|\bm{x})$ is slightly more involved than from a simple Gaussian. In particular, $p_{r}(\cdot|\bm{x})$ is first split into hyperspherical coordinates to $\mathcal{U}_{\psi}(\psi)p_{r}(R)$ , where $\mathcal{U}_{\psi}$ denotes the uniform distribution of the angle component and the distribution of $R=||\bm{x}_{\sigma}-\bm{x}||$ , the perturbed radius, is

p_{r}(R)\propto\frac{R^{N-1}}{(R^{2}+r^{2})^{\frac{N+D}{2}}}.

(15)

In practice, reference [44] proposes to sample from $p_{r}(\cdot|\bm{x})$ through the following steps: 1) get $r_{i}$ from $\sigma_{i}$ via the alignment formula $r_{i}=\sigma_{i}\sqrt{D}$ . 2) sample radii $R_{i}$ from $p_{r_{i}}(R)$ , 3) sample uniform angles $\bm{v}_{i}\leftarrow\bm{u}_{i}/||\bm{u}_{i}||_{2},$ where $\bm{u}_{i}\sim\mathcal{N}(\bm{0};\bm{I})$ , 4) get perturbed data $\bm{x}_{\sigma_{i}}=\bm{x}+R_{i}\bm{v}_{i}\sim p_{r}(\bm{x}_{\sigma_{i}}|\bm{x}).$

Finally, we are interested in inverse problem solving, in particular image denoising. As with diffusion models and PFGM++, one can move from an unconditional generator to a conditional one in two main ways: via zero-shot data editing using a network trained in an unsupervised fashion and training the network to directly learn a trajectory from the prior noise distribution to the posterior distribution of interest. The main advantages for the former is that it imposes a much laxer data requirement and allows for using one single pre-trained network for a range of downstream tasks (e.g., denoising, inpainting, colorization, super-resolution). The main advantage for the latter is that much more information is available at training time. Hence, we expect that, all else equal, this will yield superior performance. Since we have paired data available in the Mayo low-dose CT data, we opt for the supervised approach in this paper and leave the unsupervised approach for future work. Hence, we will do one more update to the framework in [52]: the pre-trained PFGM++ and PFCM are both trained in a supervised fashion to directly learn a mapping between the noise distribution and the posterior distribution of interest. Feeding the condition image, $\bm{y}$ , as additional input to the network to directly learn a given inverse problem has been demonstrated to work well empirically for diffusion models [45, 48, 49], PFGM++ [34], and consistency models [34]. Our results indicate that this approach generalizes well to PS-PFCM. Our proposed training of PS-PFCM is available in Algorithm 1 with updates to consistency distillation in [52] highlighted in blue. The proposed sampling is available in Algorithm 2.

1Require: dataset

\mathcal{D},

initial model parameter

\bm{\theta},

learning rate

\eta

, ODE solver

\Phi(\cdot,\cdot,{\color[rgb]{0,0,1}\cdot};\bm{\phi}),

distance metric

d(\cdot,\cdot)

, weighting function

\lambda(\cdot)

and decay parameter

\mu

\bm{\theta^{-}}\leftarrow\bm{\theta}

3 repeat

4 Sample

{\color[rgb]{0,0,1}(\bm{x},\bm{y})}\sim\mathcal{D}

and

i\sim\mathcal{U}[1,N-1]

5 Set

r_{i+1}\leftarrow\sigma_{i+1}\sqrt{D}

6 Sample radii

R_{i+1}\sim p_{r_{i+1}}(R)

7 Sample uniform angles

\bm{v}_{i+1}\leftarrow\bm{u}_{i+1}/||\bm{u}_{i+1}||_{2},

with

\bm{u}_{i+1}\sim\mathcal{N}(\bm{0};\bm{I})

\bm{x}_{\sigma_{i+1}}\leftarrow\bm{x}+R_{i+1}\bm{v}_{i+1}

\breve{\bm{x}}_{\sigma_{i}}\leftarrow\bm{x}_{\sigma_{i+1}}+(\sigma_{i}-\sigma_{i+1})\Phi(\bm{x}_{\sigma_{i+1}},\sigma_{i+1},{\color[rgb]{0,0,1}\bm{y}};\bm{\phi})

\mathcal{L}(\bm{\theta},\bm{\theta^{-}};\bm{\phi})\leftarrow

\lambda(\sigma_{i})\left(f_{\bm{\theta}}(\bm{x}_{\sigma_{i+1}},\sigma_{i+1},{\color[rgb]{0,0,1}\bm{y}}),f_{\bm{{\theta^{-}}}}(\breve{\bm{x}}_{\sigma_{i}},\sigma_{i},{\color[rgb]{0,0,1}\bm{y}})\right)

\bm{\theta}\leftarrow\bm{\theta}-\eta\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta},\bm{\theta^{-}};\bm{\phi})

\bm{\theta^{-}}\leftarrow\text{stopgrad}(\mu\bm{\theta^{-}}+(1-\mu)\bm{\theta})

until convergence

Algorithm 1 Training PS-PFCM via distillation. Extended from [52] and [44].

1Require: Consistency model

f_{\bm{\theta}}(\cdot,\cdot,{\color[rgb]{0,0,1}\cdot})

and condition data

\bm{y}

2 Set

r\leftarrow\sigma_{\text{max}}\sqrt{D}

3 Sample radii

R\sim p_{r}(R)

4 Sample uniform angles

\bm{v}\leftarrow\bm{u}/||\bm{u}||_{2},

with

\bm{u}\sim\mathcal{N}(\bm{0};\bm{I})

\bm{x}\leftarrow R\bm{v}

\hat{\bm{x}}\leftarrow f_{\bm{\theta}}(\bm{x},\sigma_{\text{max}},{\color[rgb]{0,0,1}\bm{y}})

Algorithm 2 Sampling from PS-PFCM. Extended from [52] and [44].

III Experiments

III-A Datasets

III-A1 Mayo low-dose CT data

For training and validation we use data from the Mayo Clinic used in the AAPM low-dose CT grand challenge [55]. We use the data reconstructed on a $512\times 512$ pixel grid with 1 mm slice thickness and a medium (D30) reconstruction kernel. The first eight patients are used as training data, yielding a total of 4800 slices, and the remaining two patients as validation data, with a total of 1136 slices.

III-B Implementation details

All networks are trained on randomly extracted $256\times 256$ patches using rectified Adam [56] with learning rate $1\times 10^{-4}$ for PFGM++ and $1\times 10^{-5}$ for PS-PFCM. In addition to yielding more efficient training in terms of lower graphics memory requirements, training on randomly extracted patches also serves as data augmentation. To further reduce graphics memory requirements the networks are trained using mixed precision. Hyperparameters for training and sampling are set as in [52] for the LSUN $256\times 256$ experiments¹¹1As specified in https://github.com/openai/consistency_models/blob/main/scripts/launch.sh. with the exception of batch size which we set to 4 and $\sigma_{\text{max}}$ which we set to $\sigma_{\text{max}}=380$ as in PFGM++[44]. The network architecture is as in [57] and we use channel multiplier 256, channels per resolution [1, 1, 2, 2, 4, 4], attention layers at resolutions [32,16,8], and two res-blocks per resolution. We use a dropout probability of $10\%$ when training PFGM++. As for CD in [52], we set $\lambda(\sigma):=1$ and using LPIPS[58] as metric function. To get conditional models, trained in a supervised fashion, the network is updated to accept one additional input channel and the training and sampling scripts are updated accordingly. The networks are trained using $4\times$ NVIDIA A6000 48 GPUs. We train PS-PFCM for $D\in\{4096,8192,131072\}$ , where we have followed the recommendation to re-scale $D$ with $N.$ ²²2https://github.com/Newbeeer/pfgmpp/issues/10. The PFGM++ are first trained for 300k iterations. We subsequently train the PS-PFCM for another 300k iterations.

In addition to PFGM++[44], which is a state-of-the-art diffusion-style model, we also compare PS-PFCM to the corresponding method using consistency models [52, 53], the current state-of-the-art diffusion-style models with NFE=1, as well EDM [42]. All models are trained in a supervised fashion, using the “trick” of feeding the condition image, $\bm{y}$ , as additional input to the network to directly learn a trajectory between the prior noise distribution and the posterior distribution of interest. EDM and CD are trained with identical hyperparameters as for PFGM++ and PS-PFCM. The improved version of consistency training, iCT [53], demonstrates impressive performance, outperforming even consistency distillation [53]. However, we still choose to compare to consistency distillation for two reasons. First, to the best of our knowledge, consistency training or iCT has never be trained in a supervised fashion to directly learn a trajectory to the desired posterior. Reference [34], on the other hand, trained a consistency model using consistency distillation in a supervised fashion and obtained good results. Second, since the proposed method is based on distillation, not consistency training, comparing to consistency distillation, trained for the same task with identical hyperparameters, to PS-PFCM is tantamount to an ablation study of the impact of choosing $D$ finite versus $D\rightarrow\infty.$ As in reference [52], we refer to the consistency model trained using consistency distillation as CD.

The results are evaluated qualitatively, via visual inspection, and quantitatively using average peak signal-to-noise ratio (PSNR), structural similarity index (SSIM [59]), and LPIPS over the low-dose CT validation dataset. Extensive experiments conducted in [58] illustrate that LPIPS corresponds much better to human perception than more traditional metrics such as SSIM and PSNR. However, one important caveat is that we now apply it to gray-scale CT images whereas the experiments, and training of the underlying networks, were on natural RGB images. The results correlate well with human perception; however, the interpretation of LPIPS is a bit unclear since we are essentially applying it to an out-of-distribution case. We employ the same version of LPIPS as used in the score-matching loss.

III-C Results

Quantitative results are available in Table I. We include these metrics for the low-dose CT (LDCT) images as a baseline. The overall top performer is EDM. However, this top performance incurs the computational cost of $\text{NFE}=79.$ For the $D$ considered here, the best performance for PFGM++ is obtained for $D=4096$ , which is marginally worse than for EDM. CD achieves $\text{NFE}=1$ but it comes with a quite significant drop in performance compared to EDM. Moving from PFGM++ to single-step PS-PFCM also results in the expected drop in performance. However, PS-PFCM $D=4096$ outperforms the corresponding CD according to each metric considered, despite distilling from a marginally worse performing underlying model. Hence, we have demonstrated that there exists a finite $D\in(0,\infty)$ for which the penalty incurred for moving to a single-step sampler is lower for PS-PFCM than for consistency models. This is likely due to the additional robustness afforded by setting $D$ relatively low in PFGM++. In this setup, the single-step sampler yields $79\times$ faster sampling. To put this in perspective, processing the 1136 slices in the low-dose CT validation set with NFE=1 takes about 3 minutes whereas it takes around 4 hours for NFE=79 on a single NVIDIA A6000 48GB GPU.

Qualitative results for a representative sample from the low-dose CT validation set is available in Fig. 1 and 2. This patient has a metastasis in the liver and we show a magnified version of the region-of-interest (ROI) in Fig. 1, containing this lesion, in Fig. 2. SSIM and PSNR for this particular slice, in addition to NFE for easy comparison, are also included in Fig. 1. The normal-dose CT (NDCT) image, shown in a), is treated as ground truth. We can see that the metastasis is clearly visible in all images, including for the NDCT image. EDM, PFGM++, CD, and PS-PFCM all produce realistic noise characteristics that closely matches that of the NDCT image. How they differ is the accuracy in which certain details are reproduced. Since all models are 2D denoising networks, it is feasible that there is simply insufficient information in the LDCT image to fully reproduce all features visible in the NDCT image. We have marked a feature that seem particularly difficult to reproduce with a yellow arrow. There is a large variation in how this feature is reproduced both when comparing EDM to PFGM++, with $D\in\{4096,8192,131072\}$ , but also when comparing EDM or PFGM++ to the corresponding distilled consistency model or PS-PFCM. Qualitatively, it is difficult to state which model does the best job reproducing this feature.

	LPIPS ( $\downarrow$ )	SSIM ( $\uparrow$ )	PSNR ( $\uparrow$ )	NFE ( $\downarrow$ )
LDCT	$0.152\pm 0.02$	$0.94\pm 0.02$	$41.5\pm 1.6$
EDM [42]	$0.053\pm 0.01$	$0.97\pm 0.01$	$43.8\pm 1.2$	79
CD [52]	$0.065\pm 0.01$	$0.94\pm 0.01$	$42.0\pm 0.8$	1
PFGM++ [44]
D=4096	$0.055\pm 0.01$	$0.96\pm 0.01$	$43.6\pm 1.2$	79
D=8192	$0.056\pm 0.01$	$0.96\pm 0.01$	$43.6\pm 1.3$	79
D=131072	$0.058\pm 0.01$	$0.96\pm 0.01$	$43.2\pm 1.1$	79
PS-PFCM
D=4096	$\bm{0.061}\pm 0.01$	$\bm{0.96}\pm 0.01$	$\bm{43.0}\pm 1.0$	1
D=8192	$0.066\pm 0.01$	$0.95\pm 0.01$	$42.0\pm 1.1$	1
D=131072	$0.065\pm 0.01$	$0.95\pm 0.01$	$42.0\pm 1.0$	1

TABLE I: Mean and standard deviation of LPIPS, SSIM, and PSNR in the low-dose CT validation set.

\downarrow

means lower is better.

\uparrow

means higher is better. Best performance in the

\text{NFE}=1

category in bold.

Refer to caption — Figure 1: Representative slice from the low-dose CT validation set. Patient with metastasis in the liver. A magnified version of this lesion is available in Fig. 2. a) NDCT, b) LDCT, c) EDM, d) CD, e) PFGM++ (D=4096), f) PS-PFCM (D=4096), g) PFGM++ (D=8192), h) PS-PFCM (D=8192), i) PFGM++ (D=131072), j) PS-PFCM (D=131072). 1 mm-slices. Window setting [-160,240] HU.

IV Discussion and conclusion

The dimensionality of the augmentation variables, $D$ , is a key hyperparameter in the PFGM++ framework. Although one can quite easily find a $D$ for which PFGM++ outperforms EDM and PFCM outperforms consistency models, it is unlikely that we have found the optimal $D.$ As overall performance is very sensitive to this hyperparameter, there are tentatively significant performance gains to be made by simply choosing a better $D.$ Hence, an interesting avenue of future research is to develop superior methods of finding the optimal $D$ . One tentative solution to this is to employ Bayesian optimization (e.g., [60]). This would allow us to find an good approximation of the global optima via whilst imposing a reasonable computational burden. This approach to hyperparameter tuning has proved to be very successful, perhaps most notably in the development of AlphaGo [61, 62]. Another interesting approach is to update the PFGM++ framework to include $D$ as a learnable parameter. Such an approach would circumvent the need of tuning $D$ altogether.

CT data should not really be thought of as 2D slices, but rather as 3D volumes. Given this data structure, there is an abundance of relevant information in adjacent slices and only considering 2D slices leaves all that information on the table. In particular, using 3D information can greatly facilitate the task of discriminating signal from noise. We intend to expand PS-PFCM to the 3D case in future work.

In contrast to convention, we have proposed a novel family of deep generative models and directly applied to the case of conditional generation: the trajectory is from the noise distribution to a posterior distribution, not the prior, ground truth, distribution. However, as a part of this project, we did train a PFCM, via consistency distillation, for the task of unconditional image generation on the CIFAR-10 dataset³³3https://www.cs.toronto.edu/~kriz/cifar.html. and initial results, shown in the appendix, indicate that this is a very promising direction. We surmise that with adequate compute resources, and further tuning of hyperparameters (in particular $D$ ), PFCM can perform competitively as a novel deep generative network for the task of unconditional image generation. This very interesting direction is left to future research.

In conclusion, we have presented PS-PFCM, a novel post-processing denoising technique for CT. The suggested approach combines the flexibility of tuning $D$ in PFGM++[44] with the high quality single-step sampling of consistency models [52]. A PFGM++, trained in a supervised fashion to learn the trajectory from the prior noise distribution to the posterior distribution of interest, is “distilled” into a single-step sampler using an updated version of consistency distillation [52]. Our results indicate that tuning $D$ allows us to outperform the corresponds consistency model on clinical low-dose CT images. Importantly, we note that PFCM is in itself in fact a novel family of deep generative models and thus this paper merely presents a special case of a much more general deep learning framework.

Acknowledgment

D. Hein discloses research collaboration with, and consultancy for, GE HealthCare.

References

[1] A. B. de González and S. Darby, “Risk of cancer from diagnostic X-rays: estimates for the UK and 14 other countries,” The Lancet (British edition), vol. 363, no. 9406, pp. 345–351, 2004
[2] D. J. Brenner and E. J. Hall, “Computed Tomography — An Increasing Source of Radiation Exposure,” The New England journal of medicine, vol. 357, no. 22, pp. 2277–2284, 2007.
[3] G. Wang, J. C. Ye, and B. De Man, “Deep learning for tomographic image reconstruction,” Nature Mach. Intell., vol. 2, no. 12, pp. 737–748, 2020.
[4] L.R Koetzier et al. “Deep Learning Image Reconstruction for CT: Technical Principles and Clinical Prospects”, in Radiology, vol. 306, no. 3, pp. e221257–e221257, 2023.
[5] M. J. Willemink, M. Persson, A. Pourmorteza, N. J. Pelc, and D. Fleischmann, “Photon-counting CT: Technical principles and clinical prospects,” Radiology, vol. 289, no. 2, pp. 293–312, 2018.
[6] T. Flohr, M. Petersilka, A. Henning, S. Ulzheimer, J. Ferda, and B. Schmidt, “Photon-counting CT review,” Phys. Med. 79, P126-136, 2020.
[7] M. Danielsson, M. Persson, and M. Sjölin, “Photon-counting x-ray detectors for CT,” Phys. Med. Biol., vol. 66, no. 3, pp. 03TR01–03TR01, 2021.
[8] S. S. Hsieh, S. Leng, K. Rajendran, S. Tao, and C. H. McCollough, “Photon Counting CT: Clinical Applications and Future Developments,” IEEE Trans. Radiat. Plasma Med. Sci., vol. 5, no. 4, pp. 441–452, 2021.
[9] K. Higashigaito et al., “Contrast-Enhanced Abdominal CT with Clinical Photon-Counting Detector CT: Assessment of Image Quality and Comparison with Energy-Integrating Detector CT,” Acad. Radiol., vol. 29, no. 5, pp. 689–697, 2022.
[10] K. Rajendran et al., “First Clinical Photon-counting Detector CT System: Technical Evaluation,” Radiology, 303(1), 130–138, 2022.
[11] J. Wang, T. Li, H. Lu, and Z. Liang, “Penalized weighted least-squares approach to sinogram noise reduction and image reconstruction for low-dose X-ray computed tomography,” IEEE Trans. Med. Imag., vol. 25, no. 10, pp. 1272–1283, Oct. 2006.
[12] J. B. Thibault, K. D. Sauer, C. A. Bouman, and J. Hsieh, “A three dimensional statistical approach to improved image quality for multislice helical CT,” Med. Phys., vol. 34, no. 11, pp. 4526–4544, Nov. 2007.
[13] E. Y. Sidky and X. Pan, “Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization,” Phys. Med. Biol., vol. 53, no. 17, pp. 4777–4807, Sep. 2008.
[14] H. Yu, S. Zhao, E. A. Hoffman, and G. Wang, “Ultra-low dose lung CT perfusion regularized by a previous scan,” Acad. Radiol., vol. 16, no. 3, pp. 363–373, Mar. 2009.
[15] Z. Tian, X. Jia, K. Yuan, T. Pan, and S. B. Jiang, “Low-dose CT reconstruction via edge-preserving total variation regularization,” Phys. Med. Biol., vol. 56, no. 18, p. 5949, 2011.
[16] J. W. Stayman, H. Dang, Y. Ding, and J. H. Siewerdsen, “PIRPLE: a penalized-likelihood framework for incorporation of prior images in CT reconstruction,” Phys. Med. Biol., vol. 58, no. 21, pp. 7563–7582, 2013.
[17] G. L. Zeng and W. Wang, “Does noise weighting matter in CT iterative reconstruction?” IEEE Trans. Radiat. Plasma Med. Sci., vol. 1, no. 1, pp. 68–75, Jan. 2017.
[18] P. J. La Riviere, “Penalized-likelihood sinogram smoothing for low-dose CT,” Med. Phys., vol. 32, no. 6, pp. 1676–1683, Jun. 2005.
[19] J. Ma et al., “Low-dose computed tomography image restoration using previous normal-dose scan,” Med. Phys., vol. 38, no. 10, pp. 5713–5731, 2011.
[20] H. Chen et al., “Low-Dose CT With a Residual Encoder-Decoder Convolutional Neural Network,” in IEEE Trans. Med. Imag., vol. 36, no. 12, pp. 2524-2535, Dec. 2017
[21] Y. Zhang, Y. Xi, Q. Yang, W. Cong, J. Zhou, and G. Wang, “Spectral CT reconstruction with image sparsity and spectral mean,” IEEE Trans. Comput. Imag., vol. 2, no. 4, pp. 510–523, Dec. 2016.
[22] J. M. Wolterink, T. Leiner, M. A. Viergever, and I. Isgum, “Generative Adversarial Networks for Noise Reduction in Low-Dose CT,” IEEE Trans. Med. Imag., vol. 36, no. 12, pp. 2536–2545, 2017.
[23] Q. Yang et al., “Low-Dose CT Image Denoising Using a Generative Adversarial Network With Wasserstein Distance and Perceptual Loss,” IEEE Trans. Med. Imag., vol. 37, no. 6, pp. 1348–1357, 2018.
[24] H. Shan et al., “Competitive performance of a modularized deep neural network compared to commercial algorithms for low-dose CT image reconstruction,” Nature Mach. Intell., vol. 1, no. 6, pp. 269–276, 2019.
[25] B. Kim, M. Han, H. Shim, and J. Baek, “A performance comparison of convolutional neural network‐based image denoising methods: The effect of loss functions on low‐dose CT images,” Med. Phys., vol. 46, no. 9, pp. 3906–3923, 2019.
[26] K. Kim, S. Soltanayev and S. Y. Chun, “Unsupervised Training of Denoisers for Low-Dose CT Reconstruction Without Full-Dose Ground Truth,” in IEEE J. Sel. Top., vol. 14(6):1112-1125, 2020,
[27] N. Yuan, J. Zhou, and J. Qi, “Half2Half: deep neural network based CT image denoising without independent reference data,” Phys. Med. Biol., vol. 65, no. 21, pp. 215020–215020, 2020.
[28] Y. Mäkinen, L. Azzari, A. Foi, 2020, “Collaborative Filtering of Correlated Noise: Exact Transform-Domain Variance for Improved Shrinkage and Patch Matching”, IEEE Trans. Image Process., vol. 29, pp. 8339-8354.
[29] Z. Li, S. Zhou, J. Huang, L. Yu, and M. Jin, “Investigation of Low-Dose CT Image Denoising Using Unpaired Deep Learning Methods,” IEEE Trans. Radiat. Plasma Med. Sci., vol. 5, no. 2, pp. 224–234, 2021.
[30] S. Wang, Y. Yang, Z. Yin, A.S Wang, “Noise2Noise for denoising photon counting CT images: generating training data from existing scans”, Proc. SPIE 12463, Medical Imaging 2023: Physics of Medical Imaging, 1246304, 2023;
[31] C. Niu et al., “Noise Suppression With Similarity-Based Self-Supervised Deep Learning,” IEEE Trans. Med. Imag., 42(6):1590-1602, 2023.
[32] X. Liu, Y. Xie, J. Cheng, S. Diao, S. Tan, and X. Liang, “Diffusion Probabilistic Priors for Zero-Shot Low-Dose CT Image Denoising,” 2023, [Online]. Available: https://arxiv.org/abs/2305.15887.
[33] M. Tivnan et al., “Fourier Diffusion Models: A Method to Control MTF and NPS in Score-Based Stochastic Image Generation,” [Online]. Available: https://arxiv.org/abs/2303.13285
[34] D. Hein et al., “PPFM: Image denoising in photon-counting CT using single-step posterior sampling Poisson flow generative models,” 2023, [Online], Available: https://arxiv.org/abs/2312.09754.
[35] D. Hein et al., “Noise suppression in photon-counting CT using unsupervised Poisson flow generative models,” 2023, [Online], Available: https://arxiv.org/abs/2309.01553.
[36] Niu, C., Li, M., Guo, X., and Wang, G., “Self-supervised dual-domain network for low-dose CT denoising,” in Developments in X-Ray Tomography XIV, vol. 12242. SPIE, 2022.
[37] J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in PMLR, 2015.
[38] J. Ho, A. Jain, and P. Abbeel. “Denoising diffusion probabilistic models,” in Proc. NeurIPS, 2020.
[39] A. Q. Nichol and P. Dhariwal. “Improved denoising diffusion probabilistic models,” in Proc. ICML, volume 139, pages 8162–8171, 2021.
[40] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” in Proc. ICLR, 2021.
[41] J. Song, C. Meng, and S. Ermon. “Denoising diffusion implicit models,” in Proc. ICLR, 2021.
[42] T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the Design Space of Diffusion-Based Generative Models,” in Proc. NeurIPS, 2022.
[43] Y. Xu, Z. Liu, M. Tegmark, and T. Jaakkola, “Poisson flow generative models,” in Proc. NeurIPS, 2022.
[44] Y. Xu, Z. Liu, Y. Tian, S. Tong, M. Tegmark, and T. Jaakkola, “PFGM++:Unlocking the potential of physics-inspired generative models”, in Proc. ICML, 2023.
[45] G. Batzolis, J. Stanczuk, C.-B. Schönlieb, C. Etmann, “Conditional Image Generation with Score-Based Diffusion Models,” 2021, [Online]. Available: https://arxiv.org/abs/2111.13606.
[46] H. Chung, B. Sim, and J. C. Ye, “Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction,” 2021, [Online]. Available: https://arxiv.org/abs/2112.05146.
[47] Y. Song, L. Shen, L. Xing, and S. Ermon, “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models,” 2021, [Online]. Available: https://arxiv.org/abs/2111.08005.
[48] C. Saharia, W. Chan, H. Chang, C. A. Lee, J. Ho, T. Salimans, D. J. Fleet, and M. Norouzi. “Palette: Image-to-image diffusion models.” In Proc. SIGGRAPH, 2022.
[49] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image Super-Resolution via Iterative Refinement,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 4, pp. 4713–14, 2023
[50] H. Chung, E. S. Lee, and J. C. Ye, “MR Image Denoising and Super-Resolution Using Regularized Reverse Diffusion,” IEEE Trans. Med. Imag., vol. 42, no. 4, pp. 922–934, 2023.
[51] R. Ge, Y. He, C. Xia, Y. Chen, D. Zhang, and G. Wang, “JCCS-PFGM: A Novel Circle-Supervision based Poisson Flow Generative Model for Multiphase CECT Progressive Low-Dose Reconstruction with Joint Condition,” [Online]. Available: https://arxiv.org/abs/2306.07824.
[52] Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency Models,” in Proc. ICML, 2023.
[53] Y. Song, and P. Dhariwal, “Improved Techniques for Training Consistency Models,” 2023. [Online]. Available: https://arxiv.org/abs/2310.14189.
[54] H. Almqvist, et al., “Initial Clinical Images From a Second-Generation Prototype Silicon-Based Photon-Counting Computed Tomography System. Acad. Radiol., Epub ahead of print, 2023.
[55] AAPM, “Low dose CT grand challenge,” 2017. [Online]. Available: https://www.aapm.org/grandchallenge/lowdosect/.
[56] L. Liu et al., “On the Variance of the Adaptive Learning Rate and Beyond,” 2019, [Online], Available: https://arxiv.org/abs/1908.03265.
[57] P. Dhariwal, and A. Nichol, “Diffusion models beat GANs on image synthesis neurips,” in Proc. NeurIPS, 2021.
[58] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric,” in in Proc. CVPR, 2018, pp. 586–595.
[59] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, 2004.
[60] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. de Freitas, “Taking the Human Out of the Loop: A Review of Bayesian Optimization,” Proceedings of the IEEE, vol. 104, no. 1, pp. 148–175, 2016.
[61] D. Silver et al., “Mastering the game of Go with deep neural networks and tree search,” Nature (London), vol. 529, no. 7587, pp. 484–489, 2016.
[62] Y. Chen et al., “Bayesian Optimization in AlphaGo,” 2018, [Online]. Available: https://arxiv.org/abs/1812.06855.
[63] A. Krizhevsky et al., ”Learning multiple layers of features from tiny images,” 2009.
[64] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” in Proc. NeurIPS, 2017.

Appendix

To substantiate the claim that PFCM are indeed valid deep generative models, we evaluate it on the most basic benchmark in the literature: unconditional image generation on the CIFAR-10 $32\times 32$ dataset[63]. For completeness, we restate the updated training and sampling algorithms, where anything pertaining to the supervised learning setup has been removed, in Algorithm 3 and 4. The purpose of this study is not to demonstrate that PFCM outperforms consistency models, but only to demonstrate that we get reasonable output, both quantitatively and qualitatively, for a standard generative task. Nonetheless, we surmise that, with adequate compute resources, one can find a $D\in(0,\infty)$ for which PFCM outperforms the corresponding consistency model. We used the official implementation of consistency models⁴⁴4https://github.com/openai/consistency_models_cifar10. for the CIFAR-10 experiments and subsequently update the noise distributions to move from EDM to PFGM++ and from consistency models to PFCM. Setups for training, sampling, and evaluation are set as in [52] for the CIFAR-10 experiments⁵⁵5As specified in https://github.com/openai/consistency_models_cifar10/blob/main/launch.sh and corresponding configuration files. with the exception of batch size for the distillation which had to be reduced to 256 from 512 to fit on 2 $\times$ NVIDIA A6000 48GB GPUs. To counteract the fact that this means lower throughput during the course of training, we increased the number of training iterations from 800k to 1200k.⁶⁶6We note that FID is still on average decaying and hence better results are likely obtained by continuing training further. Checkpoints are saved each 10k iterations, and we report the top performing defined as achieving the best (lowest) Fréchet inception distance (FID) [64]. Uncurated samples from the CIFAR-10 dataset are available in Fig. 3 along with FID based on 50k samples. We trained an identical version of EDM as in [52] to check that we are able to reproduce their results. Although not exactly the same, our reproduction yields $\text{FID}=2.10$ which is within 3% of $\text{FID}=2.04$ reported in [52]. Due to limited compute resources, we only run the experiment for $D=128$ .

1Require: dataset

\mathcal{D},

initial model parameter

\bm{\theta},

learning rate

\eta

, ODE solver

\Phi(\cdot,\cdot;\bm{\phi}),

distance metric

d(\cdot,\cdot)

, weighting function

\lambda(\cdot)

and decay parameter

\mu

\bm{\theta^{-}}\leftarrow\bm{\theta}

3 repeat

4 Sample

\bm{x}\sim\mathcal{D}

and

i\sim\mathcal{U}[1,N-1]

5 Set

r_{i+1}\leftarrow\sigma_{i+1}\sqrt{D}

6 Sample radii

R_{i+1}\sim p_{r_{i+1}}(R)

7 Sample uniform angles

\bm{v}_{i+1}\leftarrow\bm{u}_{i+1}/||\bm{u}_{i+1}||_{2},

with

\bm{u}_{i+1}\sim\mathcal{N}(\bm{0};\bm{I})

\bm{x}_{\sigma_{i+1}}\leftarrow\bm{x}+R_{i+1}\bm{v}_{i+1}

\breve{\bm{x}}_{\sigma_{i}}\leftarrow\bm{x}_{\sigma_{i+1}}+(\sigma_{i}-\sigma_{i+1})\Phi(\bm{x}_{\sigma_{i+1}},\sigma_{i+1};\bm{\phi})

\mathcal{L}(\bm{\theta},\bm{\theta^{-}};\bm{\phi})\leftarrow

\lambda(\sigma_{i})\left(f_{\bm{\theta}}(\bm{x}_{\sigma_{i+1}},\sigma_{i+1}),f_{\bm{{\theta^{-}}}}(\breve{\bm{x}}_{\sigma_{i}},\sigma_{i})\right)

\bm{\theta}\leftarrow\bm{\theta}-\eta\nabla_{\bm{\theta}}\mathcal{L}(\bm{\theta},\bm{\theta^{-}};\bm{\phi})

\bm{\theta^{-}}\leftarrow\text{stopgrad}(\mu\bm{\theta^{-}}+(1-\mu)\bm{\theta})

until convergence

Algorithm 3 Training PFCM via distillation. Extended from [52] and [44].

1Require: Consistency model

f_{\bm{\theta}}(\cdot,\cdot)

2 Set

r\leftarrow\sigma_{\text{max}}\sqrt{D}

3 Sample radii

R\sim p_{r}(R)

4 Sample uniform angles

\bm{v}\leftarrow\bm{u}/||\bm{u}||_{2},

with

\bm{u}\sim\mathcal{N}(\bm{0};\bm{I})

\bm{x}\leftarrow R\bm{v}

\hat{\bm{x}}\leftarrow f_{\bm{\theta}}(\bm{x},\sigma_{\text{max}})

Algorithm 4 Sampling from PFCM. Extended from [52] and [44].