¹¹institutetext: Johns Hopkins University, MD, USA ²²institutetext: Bilkent University, Ankara, Turkey ³³institutetext: National Magnetic Resonance Research Center (UMRAM), Ankara, Turkey ³³email: {ykorkma1,vpatel36}@jhu.edu, [email protected]

Self-Supervised MRI Reconstruction with Unrolled Diffusion Models

Yilmaz Korkmaz 11 Tolga Cukur 2233 Vishal M. Patel 11

Abstract

Magnetic Resonance Imaging (MRI) produces excellent soft tissue contrast, albeit it is an inherently slow imaging modality. Promising deep learning methods have recently been proposed to reconstruct accelerated MRI scans. However, existing methods still suffer from various limitations regarding image fidelity, contextual sensitivity, and reliance on fully-sampled acquisitions for model training. To comprehensively address these limitations, we propose a novel self-supervised deep reconstruction model, named Self-Supervised Diffusion Reconstruction (SSDiffRecon). SSDiffRecon expresses a conditional diffusion process as an unrolled architecture that interleaves cross-attention transformers for reverse diffusion steps with data-consistency blocks for physics-driven processing. Unlike recent diffusion methods for MRI reconstruction, a self-supervision strategy is adopted to train SSDiffRecon using only undersampled k-space data. Comprehensive experiments on public brain MR datasets demonstrates the superiority of SSDiffRecon against state-of-the-art supervised, and self-supervised baselines in terms of reconstruction speed and quality. Implementation will be available at https://github.com/yilmazkorkmaz1/SSDiffRecon.

Keywords:

Magnetic Resonance Imaging Self-Supervised Learning Cross-Attention Transformers Accelerated MRI

1 Introduction

Magnetic Resonance Imaging (MRI) is one of the most widely used imaging modalities due to its excellent soft tissue contrast, but it has prolonged and costly scan sessions. Therefore, accelerated MRI methods are needed to improve its clinical utilization. Acceleration through undersampled acquisitions of a subset of k-space samples (i.e., Fourier domain coefficients) results in aliasing artifacts [20, 11, 23]. Many promising deep-learning methods have been proposed to reconstruct images by suppressing aliasing artifacts [28, 18, 26, 19, 33, 1, 7, 31, 21, 12, 2, 14, 8, 9, 10]. However, many existing methods are limited by suboptimal capture of the data distribution, poor contextual sensitivity, and reliance on fully-sampled acquisitions for model training [16, 27, 7].

A recently emergent framework for learning data distributions in computer vision is based on diffusion models [13, 22]. Several recent studies have considered diffusion-based MRI reconstructions, where either an unconditional or a conditional diffusion model is trained to generate images and reconstruction is achieved by later injecting data-consistency projections in between diffusion steps during inference [29, 3, 4, 24, 6]. While promising results have been reported, these diffusion methods can show limited reliability due to omission of physical constraints during training, and undesirable reliance on fully-sampled images. There is a more recent work that tried to mitigate fully-sampled data needs by Cui et al. [5]. In this work authors proposed a two-staged training strategy where a Bayesian network is used to learn the fully-sampled data distribution to train a score model which is then used for conditional sampling. Our model differs from this approach since we trained it end-to-end without allowing error propagation from distinct training sessions.

To overcome mentioned limitations, we propose a novel self-supervised accelerated MRI reconstruction method, called SSDiffRecon. SSDiffRecon leverages a conditional diffusion model that interleaves linear-complexity cross-attention transformer blocks for denoising with data-consistency projections for fidelity to physical constraints. It further adopts self-supervised learning by prediction of masked-out k-space samples in undersampled acquisitions. SSDiffRecon achieves on par performance with supervised baselines while outperforming self-supervised baselines in terms of inference speed and image fidelity.

2 Background

2.1 Accelerated MRI Reconstruction

Acceleration in MRI is achieved via undersampling the acquisitions in the Fourier domain as follows

F_{p}CI=y,

(1)

where $F_{p}$ is the partial Fourier operator, $C$ denotes coil sensitivity maps, $I$ is the MR image and $y$ is partially acquired k-space data. Reconstruction of fully sampled target MR image $I$ from $y$ is an ill-posed problem since the number of unknowns are higher than the number of equations. Supervised deep learning methods try to solve this ill-posed problem using prior knowledge gathered in the offline training sessions as follows

\widehat{I}=\underset{I}{\operatorname{argmin}}\frac{1}{2}\|y-F_{p}CI\|^{2}+\lambda(I),

(2)

where $\widehat{I}$ is the reconstruction, and $\lambda(I)$ is the prior knowledge-guided regularization term. In supervised reconstruction frameworks, prior knowledge is induced from underlying mapping between under- and fully sampled acquisitions.

2.2 Denoising Diffusion Models

In diffusion models [13], Gaussian noise is progressively mapped on the data via a forward noising process

\quad q\left(\mathbf{x}_{t}\mid\mathbf{x}_{t-1}\right)=\mathcal{N}\left(\mathbf{x}_{t};\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},\beta_{t}\mathbf{I}\right),

(3)

where $\beta_{t}$ refers to the fixed variance schedule. After a sufficient number of forward diffusion steps $(T)$ , $x_{t}$ follows a Gaussian distribution. Then, the backward diffusion process is deployed to gradually denoise $x_{T}$ to get $x_{0}$ using a deep neural network as a denoiser as follows

p_{\theta}\left(x_{t-1}\mid x_{t}\right)=\mathcal{N}\left(x_{t-1};\epsilon_{\theta}\left(x_{t},t\right),\sigma_{t}^{2}\mathbf{I}\right),

(4)

where $\sigma_{t}^{2}=\tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}$ and $\epsilon_{\theta}$ represents the denoising neural network parametrized during backward diffusion and trained using the following loss [13]

L(\theta)=\mathbb{E}_{t,\mathbf{x}_{0},\epsilon}\left[\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon},t\right)\right\|^{2}\right],

(5)

where $\bar{\alpha}_{t}=\prod_{m=1}^{t}\alpha_{m}$ , $\alpha_{t}=1-\beta_{t}$ and $\epsilon\sim\mathcal{N}(0,I)$ .

Refer to caption — Figure 1: Overall training scheme and network architecture. SSDiffRecon utilizes an unrolled physics-guided network as a denoiser in the diffusion process while allowing time index guidance through the Mapper Network via cross-attention transformer layers (shown in green). After two transformer layers, it performs data-consistency (shown in orange). Corresponding noisy input undersampled ( $x_{t}^{u,p}$ ) and denoised reconstructed images ( $\hat{x_{0}}^{u}$ ) are shown during training. $L_{1}$ difference between k-space points in pre-allocated locations ( $M_{r}$ ) has been utilized as the loss function.

3 SSDiffRecon

In SSDiffRecon, we utilize a conditional diffusion probabilistic model to reconstruct fully-sampled MR images given undersampled acquisitions as input. The reverse diffusion steps are parametrized using an unrolled transformer architecture as shown in Fig. 1. To improve adaptation across time steps in the diffusion process, we inject the time-index $t$ via cross-attention transformers as opposed to the original DDPM models that add time embeddings as a bias term. In what follows we describe the training and inference procedures of SSDiffRecon.

Self-Supervised Training:

For self-supervised learning, we adopt a k-space masking strategy for diffusion models [30] as follows

L(\theta)=\|\mathcal{M}_{r}\odot\mathcal{F}(C{x}^{u})-\mathcal{M}_{r}\odot\mathcal{F}(C\hat{x_{0}}^{u})\|_{1},

(6)

where $\|\cdot\|_{1}$ denotes the $L_{1}$ -norm, $\mathcal{F}$ denotes 2D Fourier Transform, $C$ are coil sensitivities, ${x}^{u}$ is the image derived from undersampled acquisitions, and $\mathcal{M}_{r}$ is the random sub-mask within the main undersampling mask $\mathcal{M}$ . Here $\hat{x_{0}}^{u}$ is the output of the unrolled denoiser network ( $R_{\theta}$ ) corresponding to the estimate of fully-sampled target image.

\hat{x_{0}}^{u}=R_{\theta}(x_{t}^{u,p},y_{p},\mathcal{M}_{p},C,t),\quad x_{t}^{u,p}=\sqrt{\bar{\alpha}_{t}}x^{u,p}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,

(7)

where $\mathcal{M}_{p}$ is the sub-mask of the remaining points in $\mathcal{M}$ after excluding $\mathcal{M}_{r}$ , $y_{p}$ is the further undersampled k-space points using the mask $\mathcal{M}_{p}$ and $x^{u,p}$ is the zero-filled reconstruction of $y_{p}$ . Training scheme is illustrated in Fig. 1.

Inference:

To speed up image sampling, inference starts with zero-filled Fourier reconstruction of the undersampled acquisitions as opposed to a pure noise sample. Conditional diffusion sampling is then performed with the trained diffusion model that iterates through cross-attention transformers for denoising and data-consistency projections. For gradual denoising, we introduce a descending random noise onto the undersampled data within data-consistency layers. Accordingly, the reverse diffusion step at time-index $ts$ is given as

x_{ts-1}=R_{\theta}(x_{ts},y_{ts}^{\epsilon},\mathcal{M},C,ts)+\sigma_{ts}z,\quad y_{ts}^{\epsilon}=\mathcal{F}(\sqrt{\bar{\alpha}_{ts}}x^{u}+\sqrt{1-\bar{\alpha}_{ts}}\epsilon_{low}),

(8)

where $\epsilon_{low}\sim\mathcal{N}(0,0.1I)$ and $z\sim\mathcal{N}(0,I)$ . Inference procedure is illustrated in Fig. 2.

3.0.1 Unrolled Denoising Network $R_{\theta}(.)$

: SSDiffRecon deploys an unrolled physics-guided denoiser in the diffusion process instead of UNET as is used in [13]. Our denoiser network consists of the following two fundamental structures as shown in Fig. 1. The entire network is trained end-to-end.

1.

Mapper Network
2.

Unrolled Denoising Blocks

Mapper network

: Mapper network is trained to generate local and global latent variables ( $w_{l}$ and $w_{g}$ , respectively) that control the fine and global features in the generated images via cross-attention and instance modulation detailed in later sections. The mapper network is taking time index of the diffusion and extracted label of undersampled image (i.e., undersampling rate and target contrast in multiple contrast dataset) as input and built with 12 fully-connected layers each with 32 neurons.

Unrolled Denoising Blocks

: Each denoising block consists of cross-attention and data-consistency layers sequentially. Let the input of the $j$ th denoising block at time instant $t$ be $x_{in,j}^{t}\in\mathbb{R}^{(h\times w)\times n}$ , where $h$ and $w$ denote the height and width of the image, and $n$ denotes the number of feature channels. First, input is modulated with the affine-transformed global latent variable ( $w_{g}\in\mathbb{R}^{32}$ ) via modulated-convolution adopted from [15]. Assuming that the modulated-convolution kernel is given as $\beta_{j}$ , this operation is expressed as follows

\displaystyle x_{output,j}^{t}=\begin{bmatrix}\sum_{m}x_{in,j}^{t,m}\circledast\beta_{j}^{m,1}\\ \vdots\\ \sum_{m}x_{in,j}^{t,m}\circledast\beta_{j}^{m,v}\\ \end{bmatrix},

(9)

where $\beta_{j}^{u,v}\in\mathbb{R}^{3\times 3}$ is the convolution kernel for the $u^{th}$ input channel and the $v^{th}$ output channel, and $m$ is the channel index. Then, the output of modulated convolution goes into the cross-attention transformer where the attention map $att_{j}^{t}$ is calculated using local latent variables $w_{l}^{t}$ at time index $t$ as follows

\displaystyle att_{j}^{t}=softmax\left(\frac{Q_{j}(x_{output,j}^{t}+P.E.)K_{j}(w_{l}^{t}+P.E.)^{T}}{\sqrt{n}}\right)V_{j}(w_{l}^{t}),

(10)

where $Q_{j}(.)$ , $K_{j}(.)$ , $V_{j}(.)$ are queries, keys and values, respectively where each function represents a dense layer with input inside the parenthesis, and $P.E.$ is the positional encoding. Then, $x_{output,j}^{t}$ is normalized to zero-mean unit variance and scaled with a learned projection of the attention maps $att_{j}^{t}$ as follows

\displaystyle x_{output,j}^{t}=\alpha_{j}(att_{j})\odot\left(\frac{x_{output,j}^{t}-\mu(x_{output,j}^{t})}{\sigma(x_{output,j}^{t})}\right),

(11)

where $\alpha_{j}(.)$ is the learned scale parameter. After repeating the sequence of cross-attention layer twice, lastly the data-consistency is performed. To perform data-consistency the number of channels in $x_{output,j}^{t}$ is decreased to 2 with an additional convolution layer. Then, 2-channel images are converted, where channels represent real and imaginary components, to complex and data-consistency is applied as follows

\displaystyle x_{output,j}^{t}=\mathcal{F}^{-1}\{\mathcal{F}(Cx_{output,j}^{t})\odot(1-\mathcal{M}_{p})+\mathcal{F}(Cx^{u})\odot\mathcal{M}_{p}\},

(12)

where $\mathcal{F}^{-1}$ represents the inverse 2D Fourier transform. Then, using another extra convolution, the number of feature maps are increased to $n$ again for the next denoising block.

Implementation Details

: Adam optimizer is used for self-supervised training with $\beta=(0.9,0.999)$ and learning rate 0.002. Default noise schedule paramaters are taken from [13]. 1000 forward and 5 reverse diffusion steps are used for training and inference respectively with batch size equals to 1. $\mathcal{M}_{r}$ are sampled from $\mathcal{M}$ using uniform distribution by collecting 5% of acquired points. We used network snapshots at 445K and 654K steps which corresponds to 28th and 109th epochs for IXI and fastMRI datasets respectively. A single NVIDIA RTX A5000 gpu is used for training and inference.

4 Experimental Results

4.1 Datasets

Experiments are performed using the following multi-coil and single-coil brain MRI datasets:

1.

fastMRI: Reconstruction performance illustrated in multi-coil brain MRI dataset [17], 100 subjects are used for training, 10 for validation and 40 for testing. Data from multiple sites are included with no common protocol. T₁-, T₂- and Flair-weighted acquisitions are considered. GCC [32] is used to decrease the number of coils to 5 to reduce the computational complexity.
2.

IXI: Reconstruction performance illustrated in single-coil brain MRI data from IXI (http://brain-development.org/ixi-dataset/). T₁-, T₂- and PD-weighted acquisitions are considered. In IXI, 25 subjects are used for training, 5 for validation and 10 for testing.

Acquisitions are retrospectively undersampled using variable-density masks. Undersampling masks are generated based on a 2D Gaussian distribution with variance adjusted to obtain acceleration rates of $R=[4,8]$ .

4.2 Competing Methods

We compare the performance of SSDiffRecon with the following supervised and self-supervised baselines:

1.

DDPM: Supervised diffusion-based reconstruction baseline. DDPM is trained with fully sampled MR images and follows a novel k-space sampling approach during inference introduced by Peng et al. [24]. 1000 forward and backward diffusion steps are used in training and inference respectively.
2.

self-DDPM: Self-supervised diffusion-based reconstruction baseline. Self-DDPM is trained using only undersampled MRI acquisitions. Other than training, the inference procedure is identical to the DDPM.
3.

D5C5: Supervised model-based reconstruction baseline. D5C5 is trained using under- and fully sampled paired MR images. Network architecture and training loss are adopted from [25].
4.

self-D5C5: Self-supervised model-based reconstruction baseline. Self-D5C5 is trained using the self-supervision approach introduced in [30] using undersampled acquisitions. The hyperparameters and network architecture are the same as in D5C5.
5.

RGAN: CNN-based reconstruction baseline. RGAN is trained using paired under- and fully sampled MR images. Network architecture and hyperparameters are adapted from [7].
6.

self-RGAN: Self-supervised CNN-based reconstruction baseline. Self-RGAN is trained using the self-supervision loss in [30] using only undersampled images. Network architecture and other hyperparameters are identical to RGAN.

4.3 Experiments

We compared the reconstruction performance using Peak-Signal-to-Noise-Ratio (PSNR, dB) and Structural-Similarity-Index (SSIM, %) between reconstructions and the ground truth images. Hyperparameter selection for each method is performed via cross-validation to maximize PSNR.

Ablation Experiments

We perform the following four ablation experiments to show the relative effect of each component in the model on the reconstruction quality as well as the effect of self-supervision in Table 1.

1.

Supervised: Supervised training of SSDiffRecon using paired under- and fully sampled MR images and pixel-wise loss is performed. Other than training, inference sampling procedures are the same as the SSDiffRecon.
2.

UNET: Original UNET architecture in DDPM [13] is trained with the same self-supervised loss as in SSDiffRecon. Other than the denoising network architecture, the training and inference procedures are not changed.
3.

Without TR: SSDiffRecon without cross-attention transformer layers is trained and tested. This model only consists of data-consistency and CNN layers. Other than the network, training and inference procedures are not changed.
4.

Without DC: SSDiffRecon without the data-consistency layers is trained and tested. This model does not utilize data-consistency but the other training and inference details are the same as the SSDiffRecon.

Table 1: Ablation results as avaraged across whole fastMRI test set.

	SSDiffRecon		Supervised		UNET		Without TR		Without DC
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
fastMRI	35.9	94.1	36.3	93.2	26.9	84.7	35.1	93.8	26.4	66.4

Table 2: Reconstruction performance on the IXI dataset for R = 4 and 8.

	DDPM		D5C5		RGAN		self-DDPM		Self-D5C5		self-RGAN		SSDiffRecon
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
T₁-4x	39.4	98.8	37.1	97.1	36.8	97.6	33.6	92.7	39.3	98.4	36.9	97.9	42.3	99.3
T₁-8x	33.0	96.8	30.9	94.0	32.3	95.4	30.1	90.2	32.7	96.0	31.9	95.9	34.6	97.9
T₂-4x	41.8	98.5	38.9	95.0	38.5	96.6	35.4	88.1	40.2	97.4	39.0	96.9	45.9	99.1
T₂-8x	36.2	96.2	34.2	91.6	34.6	94.3	32.9	84.9	34.6	94.5	34.9	94.6	39.3	98.0
PD-4x	37.1	98.5	36.5	94.7	35.6	96.5	32.5	88.4	39.4	98.5	36.3	96.8	38.4	99.0
PD-8x	32.2	96.1	30.1	90.5	31.9	93.9	29.8	85.0	32.3	94.5	31.8	94.4	33.3	97.8

5 Results

The results are presented in two clusters using a single figure and table for each dataset; fastMRI results are presented in Fig. 4 and Table 3, and IXI results are presented in Fig. 3 and Table 2. The best performed method in each test case is marked in bold in the tables. SSDiffRecon yields 2.55dB more average PSNR and %1.96 SSIM than the second best self-supervised baseline in IXI, while performing 0.4dB better in terms of PSNR and %0.25 in terms of SSIM on fastMRI. Visually, it captured most of the high frequency details while other self-supervised reconstructions suffer from either high noise or blurring artifact. Moreover, visual quality of reconstructions is either very close or better than supervised methods as be seen in the figures. It is also important to note that SSDiffRecon is performing only five backward diffusion steps while regular DDPM perform thousand diffusion steps for an equivalent reconstruction performance.

Table 3: Reconstruction performance on the fastMRI dataset for R = 4 and 8.

	DDPM		D5C5		RGAN		self-DDPM		self-D5C5		Self-RGAN		SSDiffRecon
	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
T₁-4x	40.2	95.3	39.3	94.8	39.6	95.5	38.4	95.8	38.0	93.1	38.3	95.0	40.1	96.5
T₁-8x	36.2	91.7	35.6	92.6	36.0	92.7	35.4	93.3	34.8	90.5	35.0	92.4	35.1	93.3
T₂-4x	38.2	96.0	37.5	96.2	36.8	95.8	36.8	96.0	37.1	95.6	36.8	95.6	37.7	96.6
T₂-8x	34.5	93.0	34.3	94.0	33.8	93.1	34.3	94.2	33.8	93.0	34.0	93.4	33.7	93.8
Flair-4x	36.8	93.5	36.2	93.3	35.1	92.8	35.7	94.1	35.4	91.1	35.3	92.1	36.9	94.6
Flair-8x	33.1	87.8	32.7	89.1	32.0	88.1	32.5	89.6	32.2	86.3	32.1	87.8	32.1	89.7

6 Conclusion

We proposed a novel diffusion-based unrolled architecture for accelerated MRI reconstruction. Our model performs better than self-supervised baselines in a relatively short inference time while performing on-par with the supervised reconstruction methods. Inference time and model complexity analyses are presented in the supplementary materials.

6.0.1 Acknowledgement.

This work was supported by NIH R01 grant R01CA276221 and TUBITAK 1001 grant 121E488.

References

[1] Aggarwal, H.K., Mani, M.P., Jacob, M.: MoDL: Model-Based deep learning architecture for inverse problems. IEEE Transactions on Medical Imaging 38(2), 394–405 (2019)
[2] Bakker, T., Muckley, M., Romero-Soriano, A., Drozdzal, M., Pineda, L.: On learning adaptive acquisition policies for undersampled multi-coil mri reconstruction. arXiv preprint arXiv:2203.16392 (2022)
[3] Cao, C., Cui, Z.X., Liu, S., Liang, D., Zhu, Y.: High-frequency space diffusion models for accelerated mri. arXiv preprint arXiv:2208.05481 (2022)
[4] Cao, Y., Wang, L., Zhang, J., Xia, H., Yang, F., Zhu, Y.: Accelerating multi-echo mri in k-space with complex-valued diffusion probabilistic model. In: 2022 16th IEEE International Conference on Signal Processing (ICSP). vol. 1, pp. 479–484. IEEE (2022)
[5] Cui, Z.X., Cao, C., Liu, S., Zhu, Q., Cheng, J., Wang, H., Zhu, Y., Liang, D.: Self-score: Self-supervised learning on score-based models for mri reconstruction. arXiv preprint arXiv:2209.00835 (2022)
[6] Dar, S.U., Öztürk, Ş., Korkmaz, Y., Elmas, G., Özbey, M., Güngör, A., Çukur, T.: Adaptive diffusion priors for accelerated mri reconstruction. arXiv preprint arXiv:2207.05876 (2022)
[7] Dar, S.U., Yurt, M., Shahdloo, M., Ildız, M.E., Tınaz, B., Çukur, T.: Prior-guided image reconstruction for accelerated multi-contrast MRI via generative adversarial networks. IEEE Journal of Selected Topics in Signal Processing 14(6), 1072–1087 (2020)
[8] Guo, P., Patel, V.M.: Reference-based mri reconstruction using texture transformer. In: Medical Imaging with Deep Learning (2023)
[9] Guo, P., Valanarasu, J.M.J., Wang, P., Zhou, J., Jiang, S., Patel, V.M.: Over-and-under complete convolutional rnn for mri reconstruction. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VI 24. pp. 13–23. Springer (2021)
[10] Guo, P., Wang, P., Zhou, J., Jiang, S., Patel, V.M.: Multi-institutional collaborations for improving deep learning-based magnetic resonance image reconstruction using federated learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2423–2432 (2021)
[11] Haldar, J.P., Hernando, D., Liang, Z.P.: Compressed-sensing mri with random encoding. IEEE transactions on Medical Imaging 30(4), 893–903 (2010)
[12] Hammernik, K., Pan, J., Rueckert, D., Küstner, T.: Motion-guided physics-based learning for cardiac mri reconstruction. In: 2021 55th Asilomar Conference on Signals, Systems, and Computers. pp. 900–907. IEEE (2021)
[13] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33, 6840–6851 (2020)
[14] Huang, W., Li, C., Fan, W., Zhou, Y., Liu, Q., Zheng, H., Wang, S.: Rethinking the optimization process for self-supervised model-driven mri reconstruction. arXiv preprint arXiv:2203.09724 (2022)
[15] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of StyleGAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8107–8116 (2020)
[16] Knoll, F., Hammernik, K., Kobler, E., Pock, T., Recht, M.P., Sodickson, D.K.: Assessment of the generalization of learned image reconstruction and the potential for transfer learning. Magnetic Resonance in Medicine 81(1), 116–128 (2019)
[17] Knoll, F., Zbontar, J., Sriram, A., Muckley, M.J., Bruno, M., Defazio, A., Parente, M., Geras, K.J., Katsnelson, J., Chandarana, H., Zhang, Z., Drozdzalv, M., Romero, A., Rabbat, M., Vincent, P., Pinkerton, J., Wang, D., Yakubova, N., Owens, E., Zitnick, C.L., Recht, M.P., Sodickson, D.K., Lui, Y.W.: fastMRI: A publicly available raw k-space and DICOM dataset of knee images for accelerated MR image reconstruction using machine learning. Radiology: Artificial Intelligence 2(1), e190007 (2020)
[18] Kwon, K., Kim, D., Park, H.: A parallel MR imaging method using multilayer perceptron. Medical Physics 44(12), 6209–6224 (2017). https://doi.org/10.1002/mp.12600
[19] Lee, D., Yoo, J., Tak, S., Ye, J.C.: Deep residual learning for accelerated mri using magnitude and phase networks. IEEE Transactions on Biomedical Engineering 65(9), 1985–1995 (2018)
[20] Lustig, M., Donoho, D., Pauly, J.M.: Sparse mri: The application of compressed sensing for rapid mr imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 58(6), 1182–1195 (2007)
[21] Mardani, M., Gong, E., Cheng, J.Y., Vasanawala, S., Zaharchuk, G., Xing, L., Pauly, J.M.: Deep generative adversarial neural networks for compressive sensing MRI. IEEE Transactions on Medical Imaging 38(1), 167–179 (2019)
[22] Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: International Conference on Machine Learning. pp. 8162–8171. PMLR (2021)
[23] Patel, V.M., Maleh, R., Gilbert, A.C., Chellappa, R.: Gradient-based image recovery methods from incomplete fourier measurements. IEEE Transactions on Image Processing 21(1), 94–105 (2011)
[24] Peng, C., Guo, P., Zhou, S.K., Patel, V.M., Chellappa, R.: Towards performant and reliable undersampled mr reconstruction via diffusion model sampling. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VI. pp. 623–633. Springer (2022)
[25] Qin, C., Schlemper, J., Caballero, J., Price, A.N., Hajnal, J.V., Rueckert, D.: Convolutional recurrent neural networks for dynamic mr image reconstruction. IEEE transactions on medical imaging 38(1), 280–290 (2018)
[26] Schlemper, J., Caballero, J., Hajnal, J.V., Price, A., Rueckert, D.: A Deep Cascade of Convolutional Neural Networks for MR Image Reconstruction. In: International Conference on Information Processing in Medical Imaging. pp. 647–658 (2017)
[27] Sriram, A., Zbontar, J., Murrell, T., Zitnick, C.L., Defazio, A., Sodickson, D.K.: GrappaNet: Combining parallel imaging with deep learning for multi-coil MRI reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14303–14310 (June 2020)
[28] Wang, S., Su, Z., Ying, L., Peng, X., Zhu, S., Liang, F., Feng, D., Liang, D.: Accelerating magnetic resonance imaging via deep learning. In: IEEE 13th International Symposium on Biomedical Imaging (ISBI). pp. 514–517 (2016). https://doi.org/10.1109/ISBI.2016.7493320
[29] Xie, Y., Li, Q.: Measurement-conditioned denoising diffusion probabilistic model for under-sampled medical image reconstruction. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VI. pp. 655–664. Springer (2022)
[30] Yaman, B., Hosseini, S.A.H., Moeller, S., Ellermann, J., Uğurbil, K., Akçakaya, M.: Self-supervised learning of physics-guided reconstruction neural networks without fully sampled reference data. Magnetic resonance in medicine 84(6), 3172–3191 (2020)
[31] Yu, S., Dong, H., Yang, G., Slabaugh, G., Dragotti, P.L., Ye, X., Liu, F., Arridge, S., Keegan, J., Firmin, D., Guo, Y.: DAGAN: Deep de-aliasing generative adversarial networks for fast compressed sensing MRI reconstruction. IEEE Transactions on Medical Imaging 37(6), 1310–1321 (2018)
[32] Zhang, T., Pauly, J.M., Vasanawala, S.S., Lustig, M.: Coil compression for accelerated imaging with cartesian sampling. Magnetic resonance in medicine 69(2), 571–582 (2013)
[33] Zhu, B., Liu, J.Z., Rosen, B.R., Rosen, M.S.: Image reconstruction by domain transform manifold learning. Nature 555(7697), 487–492 (2018)

Supplementary Materials

Table 4: Number of trainable parameters are shown for each competing method. Self-supervised baselines (self-DDPM, self-D5C5 and self-RGAN) have the same parameters with the supervised versions since they are using the same network structures. Therefore, only a single value included for each competing network type. As be seen from the numbers, DDPM is the most complex model among all competing methods due to its large denoising UNET.

	DDPM	D5C5	RGAN	SSDiffRecon
#Trainable Parameters	164,303,618	297,794	11,377,154	3,292,570

Table 5: Inference times needed to reconstruct a single slice have been presented in seconds. A single NVIDIA RTX A5000 gpu is used in the demonstration. Self-supervised baselines (self-DDPM, self-D5C5 and self-RGAN) have the same inference time with the supervised versions since they are using the same network structures. Therefore, only a single value is included for each network type. RGAN and D5C5 are feed-forward networks, so the lowest inference time is expected in those methods with a relatively poor reconstruction performance. On the other hand, DDPM is utilizing a thousand backward diffusion steps, thus suffering from prolonged inference time.

	DDPM	D5C5	RGAN	SSDiffRecon
Inference Time	52.270	0.045	0.004	0.219