This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Β©2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Memory-Efficient Deep end-to-end Posterior Network (DEEPEN) for inverse problems

Abstract

End-to-End (E2E) unrolled optimization frameworks show promise for Magnetic Resonance (MR) image recovery, but suffer from high memory usage during training. In addition, these deterministic approaches do not offer opportunities for sampling from the posterior distribution. In this paper, we introduce a memory-efficient approach for E2E learning of the posterior distribution. We represent this distribution as the combination of a data-consistency-induced likelihood term and an energy model for the prior, parameterized by a Convolutional Neural Network (CNN). The CNN weights are learned from training data in an E2E fashion using maximum likelihood optimization. The learned model enables the recovery of images from undersampled measurements using the Maximum A Posteriori (MAP) optimization. In addition, the posterior model can be sampled to derive uncertainty maps about the reconstruction. Experiments on parallel MR image reconstruction show that our approach performs comparable to the memory-intensive E2E unrolled algorithm, performs better than its memory-efficient counterpart, and can provide uncertainty maps. Our framework paves the way towards MR image reconstruction in 3D and higher dimensions.

Index Terms:
Energy model, MAP estimate, Parallel MRI reconstruction, Uncertainty estimate.

I Introduction

Compressed Sensing (CS) algorithms have been widely used for the recovery of Magnetic Resonance (MR) images from highly undersampled measurements. These methods rely on an energy minimization formulation, where the cost function is the combination of data-consistency and regularization terms. The algorithms typically alternate between a data-consistency update and a proximal mapping step that can be viewed as a denoiser. Inspired by this, Plug-and-Play (PnP) methods [1] replace the proximal operator with a Convolutional Neural Network (CNN), which is pre-trained as a denoiser for Gaussian noise corrupted images. End-to-End (E2E) training methods [2] instead rely on algorithm unrolling to optimize the CNN for a specific forward model, thereby offering improvement in performance. However, the unrolling strategy requires a lot of memory during backpropagation, limiting the number of iterations. Deep Equilibrium Models (DEQ) [3, 4] were introduced to reduce memory demand by iterating a single layer until convergence to the fixed point. Both the DEQ and PnP methods require the CNN to be a contraction for theoretical convergence guarantees, which affects their practical performance.

All of the above deep learning methods can be viewed as Maximum A Posteriori (MAP) methods, differing mainly on how the prior is learned. On the contrary, we propose a memory-efficient Deep E2E learning Network (DEEPEN) to learn the posterior distribution, tailored for a specific forward model. This learned posterior model can be used to derive the MAP estimate, and samples can be generated to compute the uncertainty maps. Unlike fixed point PnP and DEQ methods that lack an energy-based formulation, we propose to represent the posterior distribution as the combination of a likelihood term determined by data consistency and a prior. We use an Energy-Based Model (EBM) for the prior [5], which is parameterized by a CNN. Traditional EBM approaches [5] pre-learn the prior from training data, followed by their combination with a likelihood term for guidance. On the contrary, we directly learn the posterior for a specific forward model in an E2E fashion. We learn the CNN parameters using maximum likelihood optimization, which can be viewed as a contrastive learning strategy. In particular, training amounts to minimizing the energy for the true reference samples, while the energy is maximized for the fake samples drawn from the posterior. We use Langevin dynamics to generate the fake samples. Note that our approach does not involve algorithm unrolling, and therefore its memory demand during training is minimal. Unlike DEQ training, this approach does not require fixed-point iterations for backpropagation, which also reduces the computational complexity of the algorithm. Unlike existing models such as PnP and DEQ, our MAP estimation algorithm does not require Lipschitz constraints on the CNN for convergence, potentially enhancing performance. In contrast to DEQ and unrolled methods, the proposed posterior energy model can also be sampled to generate representative images and calculate uncertainty estimates. This approach is thus an alternative to diffusion models that use pre-trained CNN denoisers at various noise levels, which often require numerous iterations and specialized approaches to guide diffusion at different noise scales [6]. In comparison, our method uses a single-scale CNN, which leads to faster convergence.

II Posterior learning

We consider the recovery of an MR image π’™βˆˆβ„‚m{\boldsymbol{x}}\in\mathbb{C}^{m} from its corrupted undersampled measurements π’ƒβˆˆβ„‚n{\boldsymbol{b}}\in\mathbb{C}^{n}:

𝒃=𝐀​𝒙+𝒏{\boldsymbol{b}}={\mathbf{A}}{\boldsymbol{x}}+{\boldsymbol{n}} (1)

where π€βˆˆβ„‚nΓ—m{\mathbf{A}}\in\mathbb{C}^{n\times m} is known and π’βˆˆπ’©β€‹(0,Ξ·2β€‹πˆ){\boldsymbol{n}}\in\mathcal{N}(0,\eta^{2}{\mathbf{I}}). The likelihood in this case is given by:

p​(𝒃|𝒙)=1P​exp⁑(βˆ’β€–π€β€‹π’™βˆ’π’ƒβ€–22​η2)p({\boldsymbol{b}}|{\boldsymbol{x}})=\dfrac{1}{P}\exp\left(-\dfrac{\|{\mathbf{A}}{\boldsymbol{x}}-{\boldsymbol{b}}\|^{2}}{2\eta^{2}}\right) (2)

where PP is a normalizing constant. The negative log posterior of the distribution is specified by:

βˆ’log⁑p​(𝒙|𝒃)\displaystyle-\log p({\boldsymbol{x}}|{\boldsymbol{b}}) =\displaystyle= 12​η2β€‹β€–π€β€‹π’™βˆ’π’ƒβ€–22βˆ’log⁑p​(𝒙)+P\displaystyle\dfrac{1}{2\eta^{2}}\|{\mathbf{A}}{\boldsymbol{x}}-{\boldsymbol{b}}\|_{2}^{2}-\log p({\boldsymbol{x}})+P (3)

where p​(𝒙)p({\boldsymbol{x}}) is the prior. We model the prior in (3) as in [5]:

pπœ½β€‹(𝒙)=1Zπœ½β€‹exp⁑(βˆ’Eπœ½β€‹(𝒙))p_{\boldsymbol{\theta}}({\boldsymbol{x}})=\dfrac{1}{Z_{\boldsymbol{\theta}}}\exp(-E_{\boldsymbol{\theta}}({\boldsymbol{x}})) (4)

where Eπœ½β€‹(𝒙):β„‚m→ℝ+E_{\boldsymbol{\theta}}({\boldsymbol{x}}):\mathbb{C}^{m}\rightarrow\mathbb{R}^{+} is a neural network with 𝜽\boldsymbol{\theta} denoting its parameters. Any CNN model that takes an image and outputs a positive scalar value is sufficient for the above approach. We illustrate the energy model in Fig. 1 using a two-layer network. The gradient βˆ‡π’™Eπœ½β€‹(𝒙):β„‚mβ†’β„‚m\nabla_{{\boldsymbol{x}}}E_{\boldsymbol{\theta}}({\boldsymbol{x}}):\mathbb{C}^{m}\rightarrow\mathbb{C}^{m} of the network can be evaluated using the chain rule. In the general case, the gradient operator can be derived using the built-in autograd function. Note from Fig. 1 that the gradient resembles an auto-encoder with a one-dimensional latent space.

Refer to caption
Fig.Β 1: Illustration of computation of Eπœ½β€‹(𝒙)E_{\boldsymbol{\theta}}({\boldsymbol{x}}) using a two-layer network. Its gradient βˆ‡π’™Eπœ½β€‹(𝒙)\nabla_{\boldsymbol{x}}E_{\boldsymbol{\theta}}({\boldsymbol{x}}) is computed using the chain rule. W1W_{1} and W1TW_{1}^{T} represents a convolutional and a transposed convolutional layer of appropriate size with shared weights, respectively; L2L_{2} and L2TL_{2}^{T} are linear layers with shared weights; Ο„\tau is the activation function and Ο„β€²\tau^{{}^{\prime}} represents its gradient.

II-A Maximum Likelihood training of the posterior

Refer to caption
Fig.Β 2: Illustration of training procedure of the proposed algorithm. The fake samples π’™βˆ’{\boldsymbol{x}}^{-} are generated using Langevin sampling, indicated by the yellow box. The intermediate results are not stored for backpropagation; so, a physical layer is sufficient for forward propagation, thus keeping the memory demand low. The samples 𝒙+{\boldsymbol{x}}^{+} are obtained from the training data. The training loss involves the energy difference between true and fake samples. Therefore, the training algorithm aims to modify the energy Eπœ½β€‹(β‹…)E_{\boldsymbol{\theta}}(\cdot) so that the generated samples π’™βˆ’{\boldsymbol{x}}^{-} match the true samples 𝒙+{\boldsymbol{x}}^{+}.

Current EBM methods pre-learn p​(𝒙)p({\boldsymbol{x}}) from the training data and then use it in (3), similar to PnP methods. Motivated by the improved performance of E2E methods, we learn the posterior distribution pπœ½β€‹(𝒙|𝒃)p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}}) using the training data and for a specific forward model 𝐀\mathbf{A} in an E2E fashion using maximum likelihood optimization. In particular, we determine the optimal weights of the energy Eπœ½β€‹(𝒙)E_{\boldsymbol{\theta}}({\boldsymbol{x}}) by minimizing the negative log-likelihood of the training data set:

πœ½βˆ—\displaystyle\boldsymbol{\theta}^{*} =\displaystyle= arg⁑minπœ½β‘π”Όπ’™βˆΌq​(𝒙)​[βˆ’log⁑pπœ½β€‹(𝒙|𝒃)βŸβ„’πœ½β€‹(𝒙)]\displaystyle\arg\min_{\boldsymbol{\theta}}\>\mathbb{E}_{{\boldsymbol{x}}\sim q({\boldsymbol{x}})}\Big{[}\underbrace{-\log p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}})}_{\mathcal{L}_{\boldsymbol{\theta}}({\boldsymbol{x}})}\Big{]} (5)

where q​(𝒙)q({\boldsymbol{x}}) is the probability distribution of the data and the posterior pπœ½β€‹(𝒙|𝒃)p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}}) is defined as in (3) whose prior is defined in (4). We thus have:

β„’πœ½β€‹(𝒙)\displaystyle\mathcal{L}_{\boldsymbol{\theta}}({\boldsymbol{x}}) =\displaystyle= 12β€‹β€–π€β€‹π’™βˆ’π’ƒβ€–2+Eπœ½β€‹(𝒙)+log⁑𝒁~𝜽,\displaystyle\dfrac{1}{2}\|{\mathbf{A}}{\boldsymbol{x}}-{\boldsymbol{b}}\|^{2}+E_{\boldsymbol{\theta}}({\boldsymbol{x}})+\log\tilde{{\boldsymbol{Z}}}_{\boldsymbol{\theta}}, (6)

where 𝒁~𝜽=∫exp⁑(βˆ’12β€‹β€–π€β€‹π’™βˆ’π’ƒβ€–2βˆ’Eπœ½β€‹(𝒙))​𝑑𝒙\tilde{{\boldsymbol{Z}}}_{\boldsymbol{\theta}}=\int\exp\left(-\dfrac{1}{2}\|{\mathbf{A}}{\boldsymbol{x}}-{\boldsymbol{b}}\|^{2}-E_{\boldsymbol{\theta}}({\boldsymbol{x}})\right)d{\boldsymbol{x}} is a normalizing constant. For simplicity, we absorb Ξ·2\eta^{2} parameter into the definition of the energy. Consequently we have:

βˆ‡πœ½β„’πœ½β€‹(𝒙)=π”Όπ’™βˆΌq​(𝒙)​[βˆ‡πœ½Eπœ½β€‹(𝒙)]+βˆ‡πœ½log⁑𝒁~𝜽\displaystyle\nabla_{\boldsymbol{\theta}}\mathcal{L}_{\boldsymbol{\theta}}(\boldsymbol{x})=\mathbb{E}_{{\boldsymbol{x}}\sim q({\boldsymbol{x}})}[\nabla_{\boldsymbol{\theta}}E_{\boldsymbol{\theta}}({\boldsymbol{x}})]+\nabla_{\boldsymbol{\theta}}\log\tilde{{\boldsymbol{Z}}}_{\boldsymbol{\theta}} (7)

The second term can be computed using the chain rule [5]:

βˆ‡πœ½log⁑𝒁~𝜽=βˆ’π”Όπ’™βˆΌpπœ½β€‹(𝒙|𝒃)​[βˆ‡πœ½Eπœ½β€‹(𝒙)]\displaystyle\nabla_{\boldsymbol{\theta}}\log\tilde{{\boldsymbol{Z}}}_{\boldsymbol{\theta}}=-\mathbb{E}_{{\boldsymbol{x}}\sim p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}})}[\nabla_{\boldsymbol{\theta}}E_{\boldsymbol{\theta}}({\boldsymbol{x}})] (8)

where π’™βˆΌpπœ½β€‹(𝒙|𝒃){\boldsymbol{x}}\sim p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}}) are samples from the learned posterior distribution pπœ½β€‹(𝒙|𝒃)p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}}). In the EBM literature, these are the generated or fake samples denoted by π’™βˆ’{\boldsymbol{x}}^{-}, while the training samples are referred to as true samples, denoted by 𝒙+∼q​(𝒙){\boldsymbol{x}}^{+}\sim q({\boldsymbol{x}}). Thus, the ML estimation of 𝜽\boldsymbol{\theta} amounts to minimization of the loss:

ℒ′​(𝜽)\displaystyle\mathcal{L}^{\prime}(\boldsymbol{\theta}) =\displaystyle= π”Όπ’™βˆΌq​(𝒙)​Eπœ½β€‹(𝒙)βˆ’π”Όπ’™βˆΌpπœ½β€‹(𝒙|𝒃)​Eπœ½β€‹(𝒙)\displaystyle\mathbb{E}_{{\boldsymbol{x}}\sim q({\boldsymbol{x}})}E_{\boldsymbol{\theta}}({\boldsymbol{x}})-\mathbb{E}_{{\boldsymbol{x}}\sim p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}})}E_{\boldsymbol{\theta}}({\boldsymbol{x}})
β‰ˆ\displaystyle\approx (1nβ€‹βˆ‘i=1nEπœ½β€‹(𝒙i+)βˆ’1mβ€‹βˆ‘j=1mEπœ½β€‹(𝒙jβˆ’))\displaystyle\left(\dfrac{1}{n}\displaystyle\sum_{i=1}^{n}{E_{\boldsymbol{\theta}}({\boldsymbol{x}}^{+}_{i})}-\dfrac{1}{m}\displaystyle\sum_{j=1}^{m}{E_{\boldsymbol{\theta}}({\boldsymbol{x}}^{-}_{j})}\right) (9)

Intuitively, the training strategy (II-A) will seek to decrease the energy of the true samples (Eπœ½β€‹(𝒙+)E_{\boldsymbol{\theta}}({\boldsymbol{x}}^{+})) and increase the energy of the fake samples (Eπœ½β€‹(π’™βˆ’)E_{\boldsymbol{\theta}}({\boldsymbol{x}}^{-})). This may be seen as an adversarial training scheme similar to [7], where the classifier involves the energy itself as shown in Fig. 2. The algorithm converges when the the fake samples are identical in distribution to the training samples; i.e, Eπœ½β€‹(𝒙+)β‰ˆEπœ½β€‹(π’™βˆ’)E_{\boldsymbol{\theta}}({\boldsymbol{x}}^{+})\approx E_{\boldsymbol{\theta}}({\boldsymbol{x}}^{-}).

II-B Generation of fake samples using Markov Chain Monte Carlo (MCMC)

We generate the fake samples π’™βˆ’βˆΌpπœ½β€‹(𝒙|𝒃){\boldsymbol{x}}^{-}\sim p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}}) using Langevin MCMC method, which only requires the gradient of log⁑pπœ½β€‹(𝒙|𝒃)\log p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}}) w.r.t. 𝒙{\boldsymbol{x}} :

𝒙n+1=𝒙n+Ο΅22β€‹βˆ‡π’™log⁑pπœ½β€‹(𝒙n|𝒃)+ϡ​𝒛n=𝒙nβˆ’Ο΅22​(𝐀H​(𝐀​𝒙nβˆ’π’ƒ)+βˆ‡xEπœ½β€‹(𝒙n))+ϡ​𝒛n\begin{array}[]{ll}{\boldsymbol{x}}_{n+1}={\boldsymbol{x}}_{n}+\dfrac{\epsilon^{2}}{2}\nabla_{{\boldsymbol{x}}}\log p_{\boldsymbol{\theta}}({\boldsymbol{x}}_{n}|{\boldsymbol{b}})+\epsilon{\boldsymbol{z}}_{n}\\ \ \hskip 22.76219pt={\boldsymbol{x}}_{n}-\dfrac{\epsilon^{2}}{2}\left({\mathbf{A}}^{H}({\mathbf{A}}{\boldsymbol{x}}_{n}-{\boldsymbol{b}})+\nabla_{x}E_{\boldsymbol{\theta}}({\boldsymbol{x}}_{n})\right)+\epsilon{\boldsymbol{z}}_{n}\end{array} (10)

where Ο΅>0\epsilon>0 is the step-size, 𝒛nβˆΌπ’©β€‹(0,𝐈){\boldsymbol{z}}_{n}\sim\mathcal{N}(0,{\mathbf{I}}) and 𝒙0{\boldsymbol{x}}_{0} is drawn from a simple posterior distribution. The entire training process is summarized in Fig. 2.

II-C Maximum aposteriori image recovery

Once the posterior is learned, we obtain the MAP estimate by minimizing (6) w.r.t. 𝒙{\boldsymbol{x}} using the classical gradient descent algorithm:

𝒙k+1=𝒙kβˆ’Ξ±k​(𝐀H​(𝐀​𝒙kβˆ’π’ƒ)+βˆ‡xEπœ½β€‹(𝒙k)){\boldsymbol{x}}_{k+1}={\boldsymbol{x}}_{k}-\alpha_{k}\left({\mathbf{A}}^{H}({\mathbf{A}}{\boldsymbol{x}}_{k}-{\boldsymbol{b}})+\nabla_{x}E_{\boldsymbol{\theta}}({\boldsymbol{x}}_{k})\right) (11)

where Ξ±k\alpha_{k} is the step-size and is found using the backtracking line search method [8]. This line search method starts with a large step-size value, i.e., Ξ±k=1\alpha_{k}=1, and keeps decreasing it by a factor of β∈(0,1)\beta\in(0,1) until the chosen step-size sufficiently decreases the cost function. The decrease in the cost function is typically measured using the Armijo–Goldstein rule, which is ℒθ​(𝒙kβˆ’Ξ±kβ€‹βˆ‡xℒθ​(𝒙k))>ℒθ​(𝒙k)βˆ’Ξ²β€‹Ξ±kβ€‹β€–βˆ‡xℒθ​(𝒙k)β€–22\mathcal{L}_{\theta}\left({\boldsymbol{x}}_{k}-\alpha_{k}\nabla_{x}\mathcal{L}_{\theta}({\boldsymbol{x}}_{k})\right)>\mathcal{L}_{\theta}({\boldsymbol{x}}_{k})-\beta\alpha_{k}\|\nabla_{x}\mathcal{L}_{\theta}({\boldsymbol{x}}_{k})\|_{2}^{2}, thereby ensuring that the chosen step-size monotonically decreases the cost function at every iteration. The following result shows that the proposed algorithm converges monotonically to a local minimum of the cost function (6).

Theorem II.1 ([8])

Consider the cost function ℒθ​(𝐱)\mathcal{L}_{\theta}({\boldsymbol{x}}) in (6), which is bounded below by zero111The CNN implementation Eθ​(𝐱)E_{\theta}({\boldsymbol{x}}) has a Rectified Linear Unit (RELU) in the output layer, which makes the lower bound zero.. Then, the steepest descent optimization scheme in (11) with backtracking line search to determine the optimal step-size Ξ±\alpha will monotonically decrease the cost function (i.e., ℒθ​(𝐱k+1)≀ℒθ​(𝐱k)\mathcal{L}_{\theta}({\boldsymbol{x}}_{k}+1)\leq\mathcal{L}_{\theta}({\boldsymbol{x}}_{k})). In addition, the sequence {𝐱k}\{{\boldsymbol{x}}_{k}\} will converge to a stationary point of (6), provided the gradient βˆ‡π±β„’ΞΈβ€‹(𝐱)\nabla_{\boldsymbol{x}}\mathcal{L}_{\theta}({\boldsymbol{x}}) is Lipschitz-continuous.

Note that the above theorem is applicable irrespective of the Lipschitz constant of the CNN, and when βˆ‡π’™β„’ΞΈβ€‹(𝒙)\nabla_{\boldsymbol{x}}\mathcal{L}_{\theta}({\boldsymbol{x}}) is a conservative vector field i.e., it is the gradient of ℒθ​(𝒙)\mathcal{L}_{\theta}({\boldsymbol{x}}).

Refer to caption
(a) T2
Refer to caption
(b) FLAIR
Fig.Β 3: Comparision of DEEPEN with MoDL and MoL for two different contrasts: (a) T2 and (b) FLAIR. Top row shows the reconstructed image, second row shows the enlarged image, and the third row is the error image.

III Experiments

III-A Data set

We evaluated the performance of DEEPEN on the publicly available parallel fastmri brain data set which contains FLAIR and T2-weighted images [9]. It is a twelve-channel brain data set and consists of complex images of size 320Γ—320320\times 320. The matrix 𝐀{\mathbf{A}} in (1) in this case is 𝐀=𝑺​𝐅​π‘ͺ{\mathbf{A}}={\boldsymbol{S}}{\mathbf{F}}{\boldsymbol{C}}, where 𝑺{\boldsymbol{S}} is the sampling matrix, 𝐅{\mathbf{F}} is the Fourier transform and π‘ͺ{\boldsymbol{C}} is the coil sensitivity map which is estimated using ESPIRiT algorithm [10]. The data set for each contrast was split into 4545 training, 55 validation, and 5050 test subjects. DEEPEN was trained and evaluated on both contrasts separately for a four-fold retrospective undersampled measurement, which was obtained by sampling the data along the phase-encoding direction using a 1D non-uniform variable density mask.

III-B Architecture and implementation

The network Eπœ½β€‹(𝒙)E_{\boldsymbol{\theta}}({\boldsymbol{x}}) defined in (4) was built using five 3x3 convolutional layers with 64 channels each, followed by a linear layer. A ReLU was used between each convolutional layer and at the end of the linear layer. We evaluated the gradient βˆ‡π’™Eπœ½β€‹(𝒙)\nabla_{\boldsymbol{x}}E_{\boldsymbol{\theta}}({\boldsymbol{x}}) using the chain rule.

The MCMC sampling was performed for 3030 iterations and was initialized with 𝒙0=(𝐀H​𝐀+Ξ»~β€‹πˆ)βˆ’1​𝐀H​𝒃{\boldsymbol{x}}_{0}=({\mathbf{A}}^{H}{\mathbf{A}}+\tilde{\lambda}{\mathbf{I}})^{-1}{\mathbf{A}}^{H}{\boldsymbol{b}} which is obtained by minimizing the negative logarithmic posterior distribution βˆ’log⁑p0​(𝒙|𝒃)=β€–π€β€‹π’™βˆ’π’ƒβ€–22+Ξ»~​‖𝒙‖22-\log p_{0}({\boldsymbol{x}}|{\boldsymbol{b}})=\|{\mathbf{A}}{\boldsymbol{x}}-{\boldsymbol{b}}\|_{2}^{2}+\tilde{\lambda}\|{\boldsymbol{x}}\|_{2}^{2} w.r.t. 𝒙{\boldsymbol{x}}. Next, similar to [11], for stable training we found it beneficial to: a) add Gaussian noise of standard deviation 2​ϡ2\epsilon to the training data {𝒙i+}\{{\boldsymbol{x}}_{i}^{+}\} that effectively made the training data smooth, and b) scale βˆ‡xlog⁑pπœ½β€‹(𝒙|𝒃)\nabla_{x}\log p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}}) by Ο΅22\dfrac{\epsilon^{2}}{2} which is equivalent to using the following Langevin MCMC update:

𝒙n+1=𝒙nβˆ’(𝐀H​(𝐀​𝒙nβˆ’π’ƒ)+βˆ‡xEπœ½β€‹(𝒙n))+ϡ​𝒛n\ \begin{array}[]{ll}{\boldsymbol{x}}_{n+1}={\boldsymbol{x}}_{n}-\left({\mathbf{A}}^{H}({\mathbf{A}}{\boldsymbol{x}}_{n}-{\boldsymbol{b}})+\nabla_{x}E_{\boldsymbol{\theta}}({\boldsymbol{x}}_{n})\right)+\epsilon{\boldsymbol{z}}_{n}\end{array} (12)

DEEPEN was trained with Ο΅=0.001\epsilon=0.001 and the optimal 𝜽\boldsymbol{\theta} was found using the Adam optimizer. The gradient descent algorithm was run until |β„’πœ½β€‹(𝒙k+1)βˆ’β„’πœ½β€‹(𝒙k)||β„’πœ½β€‹(𝒙k)|≀10βˆ’6\dfrac{|\mathcal{L}_{\boldsymbol{\theta}}({\boldsymbol{x}}_{k+1})-\mathcal{L}_{\boldsymbol{\theta}}({\boldsymbol{x}}_{k})|}{|\mathcal{L}_{\boldsymbol{\theta}}({\boldsymbol{x}}_{k})|}\leq 10^{-6}.
We compared the performance of the proposed algorithm with SENSE [12], MoDL [2], and MoL [4]. The unrolled algorithms were trained in an E2E fashion for 1010 iterations. A five-layer CNN was used for both the unrolled algorithms. The Lipschitz constraint in case of MoL was implemented using the log-barrier approach.

IV Results

Refer to caption
Fig.Β 4: MMSE, uncertainty and the MAP estimate given by the DEEPEN algorithm on the four-fold undersampled FLAIR image. The MMSE and the uncertainty map was obtained by taking the mean and variance over 100100 samples.

IV-A Maximum aposteriori estimates

Table I compares the performance of the reconstruction algorithms on the test data. Fig. 3(a) and Fig. 3(b) show the reconstructed images of T2-weighted and FLAIR images, respectively. PSNR and SSIM are used as evaluation metrics. Table I shows that DEEPEN and MoDL perform better than SENSE and MoL, and perform comparably with each other. While SENSE is a CS-based algorithm and does not have a neural network module, the lower performance of the unrolled MoL algorithm is due to the Lipschitz constraint on its CNN. This constraint is required to ensure convergence of MoL to the fixed point. Note that, according to Theorem 2.1, the DEEPEN algorithm can ensure convergence to a stationary point without any constraint on βˆ‡xEπœ½β€‹(𝒙)\nabla_{x}E_{\boldsymbol{\theta}}({\boldsymbol{x}}) - which results in improved performance. We would like to remind the reader that although DEEPEN and MoDL have similar performance, DEEPEN is memory-efficient and hence paves the way to reconstruct images of higher dimensions (for example, 3D brain volumes) which is not possible with MoDL.

Table I: Comparison of DEEPEN MAP estimates against SENSE, MoDL and MoL for two different 4-fold undersampled contrast types.
Contrast FLAIR T2
Avg. PSNR(dB) Avg. SSIM Avg. PSNR(dB) Avg. SSIM
SENSE 34.24 0.931 36.14 0.95
DEEPEN 37.14 0.96 39.22 0.98
MoDL 37.97 0.97 40.13 0.98
MoL 36.31 0.96 38.12 0.97

IV-B Bayes estimation

Given the posterior distribution pπœ½β€‹(𝒙|𝒃)p_{\boldsymbol{\theta}}({\boldsymbol{x}}|{\boldsymbol{b}}), we can now draw samples from it. This aids in estimating the Minimum Mean Square Error (MMSE) and the uncertainty map of the reconstructed image. Fig. 4 shows the MMSE and the uncertainty map provided by DEEPEN on a four-fold undersampled FLAIR-contrast MR brain image. The MMSE and the uncertainty map are estimated by computing the mean and variance of 100100 samples from the posterior distribution. The samples are obtained using the Langevin MCMC method initialized randomly from a Gaussian distribution.

V Conclusion

In this paper, we proposed a memory-efficient E2E training framework DEEPEN, which does not require constraining the Lispchitz constant of the CNN. Consequently, DEEPEN achieved a better PSNR than the memory-efficient unrolled algorithm MoL. Moreover, we have a well-defined cost function that allows guaranteed convergence to a stationary point and enables the use of faster sampling algorithms such as Metropolis-Hastings to obtain the uncertainty map of the estimate. This work can be extended to MR image reconstruction in higher dimensions.

VI Compliance with ethical standards

This study was conducted on a publicly available human subject data set. Ethical approval was not required, as confirmed by the license attached with the open-access data.

VII Acknowledgments

This work is supported by NIH grants R01-AG067078, R01-EB031169, and R01-EB019961.

References

  • [1] U.Β S. Kamilov, C.Β A. Bouman, G.Β T. Buzzard, and B.Β Wohlberg, β€œPlug-and-play methods for integrating physical and learned models in computational imaging,” arXiv preprint arXiv:2203.17061, 2022.
  • [2] H.Β K. Aggarwal, M.Β P. Mani, and M.Β Jacob, β€œModl: Model-based deep learning architecture for inverse problems,” IEEE transactions on medical imaging, vol.Β 38, no.Β 2, pp. 394–405, 2018.
  • [3] S.Β Bai, J.Β Z. Kolter, and V.Β Koltun, β€œDeep equilibrium models,” Advances in Neural Information Processing Systems, vol.Β 32, 2019.
  • [4] A.Β Pramanik, M.Β B. Zimmerman, and M.Β Jacob, β€œMemory-efficient model-based deep learning with convergence and robustness guarantees,” IEEE Transactions on Computational Imaging, vol.Β 9, pp. 260–275, 2023.
  • [5] Y.Β Song and D.Β P. Kingma, β€œHow to train your energy-based models,” arXiv preprint arXiv:2101.03288, 2021.
  • [6] H.Β Chung, J.Β Kim, M.Β T. Mccann, M.Β L. Klasky, and J.Β C. Ye, β€œDiffusion posterior sampling for general noisy inverse problems,” in The Eleventh International Conference on Learning Representations, 2022.
  • [7] I.Β Goodfellow, J.Β Pouget-Abadie, M.Β Mirza, B.Β Xu, D.Β Warde-Farley, S.Β Ozair, A.Β Courville, and Y.Β Bengio, β€œGenerative adversarial nets,” Advances in neural information processing systems, vol.Β 27, 2014.
  • [8] J.Β Nocedal and S.Β J. Wright, Numerical optimization.Β Β Β Springer, 1999.
  • [9] J.Β Zbontar, F.Β Knoll, A.Β Sriram, T.Β Murrell, Z.Β Huang, M.Β J. Muckley, A.Β Defazio, R.Β Stern, P.Β Johnson, M.Β Bruno etΒ al., β€œfastmri: An open dataset and benchmarks for accelerated mri,” arXiv preprint arXiv:1811.08839, 2018.
  • [10] M.Β Uecker, P.Β Lai, M.Β J. Murphy, P.Β Virtue, M.Β Elad, J.Β M. Pauly, S.Β S. Vasanawala, and M.Β Lustig, β€œEspiritβ€”an eigenvalue approach to autocalibrating parallel mri: where sense meets grappa,” Magnetic resonance in medicine, vol.Β 71, no.Β 3, pp. 990–1001, 2014.
  • [11] E.Β Nijkamp, M.Β Hill, T.Β Han, S.-C. Zhu, and Y.Β N. Wu, β€œOn the anatomy of mcmc-based maximum likelihood learning of energy-based models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol.Β 34, no.Β 04, 2020, pp. 5272–5280.
  • [12] K.Β P. Pruessmann, M.Β Weiger, M.Β B. Scheidegger, and P.Β Boesiger, β€œSense: sensitivity encoding for fast mri,” Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, vol.Β 42, no.Β 5, pp. 952–962, 1999.