Β©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Memory-Efficient Deep end-to-end Posterior Network (DEEPEN) for inverse problems
Abstract
End-to-End (E2E) unrolled optimization frameworks show promise for Magnetic Resonance (MR) image recovery, but suffer from high memory usage during training. In addition, these deterministic approaches do not offer opportunities for sampling from the posterior distribution. In this paper, we introduce a memory-efficient approach for E2E learning of the posterior distribution. We represent this distribution as the combination of a data-consistency-induced likelihood term and an energy model for the prior, parameterized by a Convolutional Neural Network (CNN). The CNN weights are learned from training data in an E2E fashion using maximum likelihood optimization. The learned model enables the recovery of images from undersampled measurements using the Maximum A Posteriori (MAP) optimization. In addition, the posterior model can be sampled to derive uncertainty maps about the reconstruction. Experiments on parallel MR image reconstruction show that our approach performs comparable to the memory-intensive E2E unrolled algorithm, performs better than its memory-efficient counterpart, and can provide uncertainty maps. Our framework paves the way towards MR image reconstruction in 3D and higher dimensions.
Index Terms:
Energy model, MAP estimate, Parallel MRI reconstruction, Uncertainty estimate.I Introduction
Compressed Sensing (CS) algorithms have been widely used for the recovery of Magnetic Resonance (MR) images from highly undersampled measurements. These methods rely on an energy minimization formulation, where the cost function is the combination of data-consistency and regularization terms. The algorithms typically alternate between a data-consistency update and a proximal mapping step that can be viewed as a denoiser. Inspired by this, Plug-and-Play (PnP) methods [1] replace the proximal operator with a Convolutional Neural Network (CNN), which is pre-trained as a denoiser for Gaussian noise corrupted images. End-to-End (E2E) training methods [2] instead rely on algorithm unrolling to optimize the CNN for a specific forward model, thereby offering improvement in performance. However, the unrolling strategy requires a lot of memory during backpropagation, limiting the number of iterations. Deep Equilibrium Models (DEQ) [3, 4] were introduced to reduce memory demand by iterating a single layer until convergence to the fixed point. Both the DEQ and PnP methods require the CNN to be a contraction for theoretical convergence guarantees, which affects their practical performance.
All of the above deep learning methods can be viewed as Maximum A Posteriori (MAP) methods, differing mainly on how the prior is learned. On the contrary, we propose a memory-efficient Deep E2E learning Network (DEEPEN) to learn the posterior distribution, tailored for a specific forward model. This learned posterior model can be used to derive the MAP estimate, and samples can be generated to compute the uncertainty maps. Unlike fixed point PnP and DEQ methods that lack an energy-based formulation, we propose to represent the posterior distribution as the combination of a likelihood term determined by data consistency and a prior. We use an Energy-Based Model (EBM) for the prior [5], which is parameterized by a CNN. Traditional EBM approaches [5] pre-learn the prior from training data, followed by their combination with a likelihood term for guidance. On the contrary, we directly learn the posterior for a specific forward model in an E2E fashion. We learn the CNN parameters using maximum likelihood optimization, which can be viewed as a contrastive learning strategy. In particular, training amounts to minimizing the energy for the true reference samples, while the energy is maximized for the fake samples drawn from the posterior. We use Langevin dynamics to generate the fake samples. Note that our approach does not involve algorithm unrolling, and therefore its memory demand during training is minimal. Unlike DEQ training, this approach does not require fixed-point iterations for backpropagation, which also reduces the computational complexity of the algorithm. Unlike existing models such as PnP and DEQ, our MAP estimation algorithm does not require Lipschitz constraints on the CNN for convergence, potentially enhancing performance. In contrast to DEQ and unrolled methods, the proposed posterior energy model can also be sampled to generate representative images and calculate uncertainty estimates. This approach is thus an alternative to diffusion models that use pre-trained CNN denoisers at various noise levels, which often require numerous iterations and specialized approaches to guide diffusion at different noise scales [6]. In comparison, our method uses a single-scale CNN, which leads to faster convergence.
II Posterior learning
We consider the recovery of an MR image from its corrupted undersampled measurements :
(1) |
where is known and . The likelihood in this case is given by:
(2) |
where is a normalizing constant. The negative log posterior of the distribution is specified by:
(3) |
where is the prior. We model the prior in (3) as in [5]:
(4) |
where is a neural network with denoting its parameters. Any CNN model that takes an image and outputs a positive scalar value is sufficient for the above approach. We illustrate the energy model in Fig. 1 using a two-layer network. The gradient of the network can be evaluated using the chain rule. In the general case, the gradient operator can be derived using the built-in autograd function. Note from Fig. 1 that the gradient resembles an auto-encoder with a one-dimensional latent space.

II-A Maximum Likelihood training of the posterior

Current EBM methods pre-learn from the training data and then use it in (3), similar to PnP methods. Motivated by the improved performance of E2E methods, we learn the posterior distribution using the training data and for a specific forward model in an E2E fashion using maximum likelihood optimization. In particular, we determine the optimal weights of the energy by minimizing the negative log-likelihood of the training data set:
(5) |
where is the probability distribution of the data and the posterior is defined as in (3) whose prior is defined in (4). We thus have:
(6) |
where is a normalizing constant. For simplicity, we absorb parameter into the definition of the energy. Consequently we have:
(7) |
The second term can be computed using the chain rule [5]:
(8) |
where are samples from the learned posterior distribution . In the EBM literature, these are the generated or fake samples denoted by , while the training samples are referred to as true samples, denoted by . Thus, the ML estimation of amounts to minimization of the loss:
(9) |
Intuitively, the training strategy (II-A) will seek to decrease the energy of the true samples () and increase the energy of the fake samples (). This may be seen as an adversarial training scheme similar to [7], where the classifier involves the energy itself as shown in Fig. 2. The algorithm converges when the the fake samples are identical in distribution to the training samples; i.e, .
II-B Generation of fake samples using Markov Chain Monte Carlo (MCMC)
We generate the fake samples using Langevin MCMC method, which only requires the gradient of w.r.t. :
(10) |
where is the step-size, and is drawn from a simple posterior distribution. The entire training process is summarized in Fig. 2.
II-C Maximum aposteriori image recovery
Once the posterior is learned, we obtain the MAP estimate by minimizing (6) w.r.t. using the classical gradient descent algorithm:
(11) |
where is the step-size and is found using the backtracking line search method [8]. This line search method starts with a large step-size value, i.e., , and keeps decreasing it by a factor of until the chosen step-size sufficiently decreases the cost function. The decrease in the cost function is typically measured using the ArmijoβGoldstein rule, which is , thereby ensuring that the chosen step-size monotonically decreases the cost function at every iteration. The following result shows that the proposed algorithm converges monotonically to a local minimum of the cost function (6).
Theorem II.1 ([8])
Consider the cost function in (6), which is bounded below by zero111The CNN implementation has a Rectified Linear Unit (RELU) in the output layer, which makes the lower bound zero.. Then, the steepest descent optimization scheme in (11) with backtracking line search to determine the optimal step-size will monotonically decrease the cost function (i.e., ). In addition, the sequence will converge to a stationary point of (6), provided the gradient is Lipschitz-continuous.
Note that the above theorem is applicable irrespective of the Lipschitz constant of the CNN, and when is a conservative vector field i.e., it is the gradient of .


III Experiments
III-A Data set
We evaluated the performance of DEEPEN on the publicly available parallel fastmri brain data set which contains FLAIR and T2-weighted images [9]. It is a twelve-channel brain data set and consists of complex images of size . The matrix in (1) in this case is , where is the sampling matrix, is the Fourier transform and is the coil sensitivity map which is estimated using ESPIRiT algorithm [10]. The data set for each contrast was split into training, validation, and test subjects. DEEPEN was trained and evaluated on both contrasts separately for a four-fold retrospective undersampled measurement, which was obtained by sampling the data along the phase-encoding direction using a 1D non-uniform variable density mask.
III-B Architecture and implementation
The network defined in (4) was built using five 3x3 convolutional layers with 64 channels each, followed by a linear layer. A ReLU was used between each convolutional layer and at the end of the linear layer. We evaluated the gradient using the chain rule.
The MCMC sampling was performed for iterations and was initialized with which is obtained by minimizing the negative logarithmic posterior distribution w.r.t. . Next, similar to [11], for stable training we found it beneficial to: a) add Gaussian noise of standard deviation to the training data that effectively made the training data smooth, and b) scale by which is equivalent to using the following Langevin MCMC update:
(12) |
DEEPEN was trained with and the optimal was found using the Adam optimizer. The gradient descent algorithm was run until
.
We compared the performance of the proposed algorithm with SENSE [12], MoDL [2], and MoL [4]. The unrolled algorithms were trained in an E2E fashion for iterations. A five-layer CNN was used for both the unrolled algorithms. The Lipschitz constraint in case of MoL was implemented using the log-barrier approach.
IV Results

IV-A Maximum aposteriori estimates
Table I compares the performance of the reconstruction algorithms on the test data. Fig. 3(a) and Fig. 3(b) show the reconstructed images of T2-weighted and FLAIR images, respectively. PSNR and SSIM are used as evaluation metrics. Table I shows that DEEPEN and MoDL perform better than SENSE and MoL, and perform comparably with each other. While SENSE is a CS-based algorithm and does not have a neural network module, the lower performance of the unrolled MoL algorithm is due to the Lipschitz constraint on its CNN. This constraint is required to ensure convergence of MoL to the fixed point. Note that, according to Theorem 2.1, the DEEPEN algorithm can ensure convergence to a stationary point without any constraint on - which results in improved performance. We would like to remind the reader that although DEEPEN and MoDL have similar performance, DEEPEN is memory-efficient and hence paves the way to reconstruct images of higher dimensions (for example, 3D brain volumes) which is not possible with MoDL.
Contrast | FLAIR | T2 | ||
---|---|---|---|---|
Avg. PSNR(dB) | Avg. SSIM | Avg. PSNR(dB) | Avg. SSIM | |
SENSE | 34.24 | 0.931 | 36.14 | 0.95 |
DEEPEN | 37.14 | 0.96 | 39.22 | 0.98 |
MoDL | 37.97 | 0.97 | 40.13 | 0.98 |
MoL | 36.31 | 0.96 | 38.12 | 0.97 |
IV-B Bayes estimation
Given the posterior distribution , we can now draw samples from it. This aids in estimating the Minimum Mean Square Error (MMSE) and the uncertainty map of the reconstructed image. Fig. 4 shows the MMSE and the uncertainty map provided by DEEPEN on a four-fold undersampled FLAIR-contrast MR brain image. The MMSE and the uncertainty map are estimated by computing the mean and variance of samples from the posterior distribution. The samples are obtained using the Langevin MCMC method initialized randomly from a Gaussian distribution.
V Conclusion
In this paper, we proposed a memory-efficient E2E training framework DEEPEN, which does not require constraining the Lispchitz constant of the CNN. Consequently, DEEPEN achieved a better PSNR than the memory-efficient unrolled algorithm MoL. Moreover, we have a well-defined cost function that allows guaranteed convergence to a stationary point and enables the use of faster sampling algorithms such as Metropolis-Hastings to obtain the uncertainty map of the estimate. This work can be extended to MR image reconstruction in higher dimensions.
VI Compliance with ethical standards
This study was conducted on a publicly available human subject data set. Ethical approval was not required, as confirmed by the license attached with the open-access data.
VII Acknowledgments
This work is supported by NIH grants R01-AG067078, R01-EB031169, and R01-EB019961.
References
- [1] U.Β S. Kamilov, C.Β A. Bouman, G.Β T. Buzzard, and B.Β Wohlberg, βPlug-and-play methods for integrating physical and learned models in computational imaging,β arXiv preprint arXiv:2203.17061, 2022.
- [2] H.Β K. Aggarwal, M.Β P. Mani, and M.Β Jacob, βModl: Model-based deep learning architecture for inverse problems,β IEEE transactions on medical imaging, vol.Β 38, no.Β 2, pp. 394β405, 2018.
- [3] S.Β Bai, J.Β Z. Kolter, and V.Β Koltun, βDeep equilibrium models,β Advances in Neural Information Processing Systems, vol.Β 32, 2019.
- [4] A.Β Pramanik, M.Β B. Zimmerman, and M.Β Jacob, βMemory-efficient model-based deep learning with convergence and robustness guarantees,β IEEE Transactions on Computational Imaging, vol.Β 9, pp. 260β275, 2023.
- [5] Y.Β Song and D.Β P. Kingma, βHow to train your energy-based models,β arXiv preprint arXiv:2101.03288, 2021.
- [6] H.Β Chung, J.Β Kim, M.Β T. Mccann, M.Β L. Klasky, and J.Β C. Ye, βDiffusion posterior sampling for general noisy inverse problems,β in The Eleventh International Conference on Learning Representations, 2022.
- [7] I.Β Goodfellow, J.Β Pouget-Abadie, M.Β Mirza, B.Β Xu, D.Β Warde-Farley, S.Β Ozair, A.Β Courville, and Y.Β Bengio, βGenerative adversarial nets,β Advances in neural information processing systems, vol.Β 27, 2014.
- [8] J.Β Nocedal and S.Β J. Wright, Numerical optimization.Β Β Β Springer, 1999.
- [9] J.Β Zbontar, F.Β Knoll, A.Β Sriram, T.Β Murrell, Z.Β Huang, M.Β J. Muckley, A.Β Defazio, R.Β Stern, P.Β Johnson, M.Β Bruno etΒ al., βfastmri: An open dataset and benchmarks for accelerated mri,β arXiv preprint arXiv:1811.08839, 2018.
- [10] M.Β Uecker, P.Β Lai, M.Β J. Murphy, P.Β Virtue, M.Β Elad, J.Β M. Pauly, S.Β S. Vasanawala, and M.Β Lustig, βEspiritβan eigenvalue approach to autocalibrating parallel mri: where sense meets grappa,β Magnetic resonance in medicine, vol.Β 71, no.Β 3, pp. 990β1001, 2014.
- [11] E.Β Nijkamp, M.Β Hill, T.Β Han, S.-C. Zhu, and Y.Β N. Wu, βOn the anatomy of mcmc-based maximum likelihood learning of energy-based models,β in Proceedings of the AAAI Conference on Artificial Intelligence, vol.Β 34, no.Β 04, 2020, pp. 5272β5280.
- [12] K.Β P. Pruessmann, M.Β Weiger, M.Β B. Scheidegger, and P.Β Boesiger, βSense: sensitivity encoding for fast mri,β Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, vol.Β 42, no.Β 5, pp. 952β962, 1999.