This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provably Convergent Learned Inexact Descent Algorithm for Low-Dose CT Reconstruction

Qingchao Zhang, Mehrdad Alvandipour, Wenjun Xia, Yi Zhang, Xiaojing Ye and Yunmei Chen Q. Zhang and M. Alvandipour contributed equally. Corresponding author: Y. Chen.Q. Zhang, M. Alvandipour, Y. Chen are with the Department of Mathematics, University of Florida, Gainesville, FL 32611 USA (e-mail: {qingchaozhang, m.alvandipour, yun}@ufl.edu).W. Xia and Y. Zhang are with with the College of Computer Science, Sichuan University, Chengdu 610065 China (e-mail: [email protected]; [email protected]).X. Ye is with the Department of Mathematics and Statistics, Georgia State University, Atlanta, GA 30303 USA (e-mail: [email protected]).
Abstract

We propose a provably convergent method, called Efficient Learned Descent Algorithm (ELDA), for low-dose CT (LDCT) reconstruction. ELDA is a highly interpretable neural network architecture with learned parameters and meanwhile retains convergence guarantee as classical optimization algorithms. To improve reconstruction quality, the proposed ELDA also employs a new non-local feature mapping and an associated regularizer. We compare ELDA with several state-of-the-art deep image methods, such as RED-CNN and Learned Primal-Dual, on a set of LDCT reconstruction problems. Numerical experiments demonstrate improvement of reconstruction quality using ELDA with merely 19 layers, suggesting promising performance of ELDA in solution accuracy and parameter efficiency.

Key words: Low-dose CT, deep learning, inverse problems, learned descent algorithm, optimization.

1 Introduction

Computed Tomography (CT) is one of the most widely used imaging technologies for medical diagnosis. CT employs X-ray measurements from different angles to generate cross-sectional images of the human body [1, 2]. As high dosage X-rays can be harmful to human body [3, 4, 5], substantial efforts have been devoted to image reconstruction using low-dose CT measurements [6, 7, 8]. There are two main strategies for dose reduction in CT scans: one is to reduce the number of views, and the other is to reduce the exposure time and the current of X-ray tube [9], both of which will introduce various degrees of noise and artifacts and then compromise the subsequent diagnosis. Here we focus on the second type however our method is not specific to a particular scanning mode. We formulate the problem as an optimization problem which will be solved with our Efficient Learned Descent Algorithm (ELDA).

The classic analytical method to reconstruct CT images from projection data, Filtered Back-Projection (FBP), leads to heavy noise and artifacts in the low dose scenario. The remedy for this problem have been sought from three different perspectives: pre-processing the sinograms [10, 11, 12], post-processing the images [13], or the hybrid approach with iterative reconstructions that encode prior information into the process [14, 15, 16].

The advent of machine learning methods and its success on various image processing applications, have naturally led to incorporation of deep models into all of the above approaches and produced a better performance than analytical methods [17]. For instance, CNN methods[18, 19, 20, 21], that have been applied to sparse view [22, 23, 24, 21, 25] and low dose [18, 19, 20, 26, 27] data. It is also applied in projection domain synthesis [25, 28], post processing [29, 24, 20, 30, 21], and for prior learning in iterative methods [31, 32, 33].

Recently, a number of learned optimization methods have been proposed and are proven very effective in CT reconstruction problem, as they are able to learn adaptive regularizer which leads to more accurate image reconstruction in a variety of medical imaging applications. However, existing works model regularizers using convolutional neural networks (CNNs) which only explore local image features. This limits the representation power of deep neural networks and is not suitable for medical imaging applications which demand high image qualities. Moreover, most of existing deep networks for image reconstruction are cast as black-boxes and can be difficult to interpret. Last but not least, deep neural networks for image reconstruction are also criticized for lacking mathematical justifications and convergence guarantee.

In this work, we leverage the framework developed in [34] and propose an improved learned descent algorithm ELDA. It further boosts image reconstruction quality using an adaptive non-local feature regularizer. More importantly, compared to [34], ELDA is more computationally efficient since the safeguard iterate is only computed when a descent condition fails to hold, which happens rarely due to allowance of inexact gradient computation in our algorithmic design. As a result, our model retains convergence guarantee and meanwhile also improves reconstruction quality over existing methods. The main contributions of this work are summarized as follows.

  • We propose an efficient learned descent algorithm with inexact gradients, called ELDA, to solve the non-smooth non-convex optimization problem in low-dose CT reconstruction. We provide comprehensive convergence and iteration complexity analysis of ELDA.

  • ELDA adopts efficient update scheme which only computes safeguard iterate when the desired descent condition fails to hold, and hence is more computationally economical than LDA developed in [34]. Moreover, ELDA employs sparsity enhancing and non-local smoothing regularization which further improve imaging quality.

  • We conduct comprehensive experimental analysis of ELDA and compare it to several state-of-the-art deep-learning based methods for LDCT reconstruction.

In section 2, we present the related works in the literature that associate with our problem. Then in section 3, we present our method by first defining our model and each of its components, and then stating the algorithm and details of network training. After that in section 4, we state our lemma and theorem regarding the output of the network. Section 5 presents the numerical results including parameter study, ablation study and comparison with other competing algorithms.

2 Related Works

A natural application of neural networks in CT reconstruction, has been in noise removal in either the projection domain [25, 28] or the image domain [29, 24, 20, 30, 21]. In particular, Residual Encoder-Decoder Convolutional Neural Network (RED-CNN) proposed by Chen et al [29], is an end-to-end mapping from low-dose CT images to normal dose which uses FBP to get low-dose CT images from projections and restrict the problem to denoising in the image domain. And yet another attempt is FBPConvNet by Jin et al [24] which is inspired by U-net [35] and further explores CNN architectures while noting the parallels with the general form of an iterative proximal update.

Model Based Image Reconstruction (MBIR) methods attempt to model CT physics, measurement noise, and image priors in order to achieve higher reconstruction quality in LDCT. Such methods learn the regularizer and are able to improve LDCT reconstruction significantly [14, 16, 36, 31], however their convergence speed is not optimal [37]. Later, researchers adopted NNs in other aspects of the algorithm and formed a new class of methods called Iterative Neural Networks (INN). INNs seek to enjoy the best of both world of MBIR and denoising deep neural networks, by employing moderate complexity denoisers for image refining and learning better regularizers [38, 39, 40, 41].

INNs have network architectures that are inspired by the optimization model and algorithm and this learning capacity enables them to outperform the classical iterative solutions by learning better regularizer while also being more time efficient. For example, recently BCD-Net [39] improved the reconstruction accuracy compared to MBIR methods and NN denoisers. It showed that it generalizes better than denoisers such as FBPConvNet which lack MBIR modules, and also its learned transforms help to outperform state-of-the-art MBIR methods. Further research in this area has been devoted to improving time efficiency of the algorithm with the image quality. Recently,[40] proposed Momentum-Net, as the first fast and convergent INN architecture inspired by Block Proximal Extrapolated Gradient method using a Majorizer. It also guarantees convergence to a fixed-point while improving MBIR speed, accuracy, and reconstruction quality. Momentum-Net is a general framework for inverse problems and its recent application to LDCT [41] showed it improves image reconstruction accuracy compared to a state-of-the-art noniterative image denoising NN. Convergence guarantee is one of the main challenges in the design of INNs and beside its theoretical value, it is highly desirable in medical applications. LEARN[32] is another model that unrolls an iterative reconstruction scheme while modeling the regularization by a field-of-experts. And yet another similar attempt is Learned Primal-Dual [42] which unfolds a proximal primal-dual optimization method where the proximal operator is replaced with a CNN. Their choice of iterative scheme is primal dual hybrid gradient (PDHG) which is further modified to benefit from the learning capacity of NNs and then used to solve the TV regularized CT reconstruction problem.

In all of the previous works, the architecture is only inspired by the optimization model, and in order to improve their performance they introduce components in the network that does not correspond to steps of the algorithm. Also, the choice of regularization limits the network to only learn local features and as we will empirically demonstrate, it limits the performance of these networks. One model that attempts to learn non-local features is MAGIC [43]. It is also a deep neural network inspired by a simple iterative reconstruction method, i.e. gradient descent. However MAGIC breaks the correspondence between architecture and algorithm in order to extract non-local features. They manually add a non-local corrector in iteration steps which is only intuitively justified, and does not directly correspond to a modified regularizer in the optimization model.

In [34], a Learned Descent Algorithm (LDA) is developed. The LDA architecture is fully determined by the algorithm and thus the network is fully interpretable. As interpretability and convergence guarantee is highly desirable in medical imaging, this framework is a promising method for inverse problems such as LDCT reconstruction. Compared to [34], the present work proposes a more efficient numerical scheme of LDA, leading to comparable network parameters, lower computational cost, and more stable convergence behavior. We achieve this by developing an efficient learned inexact descent algorithm which only computes the safeguard iterate when a prescribed descent condition fails to hold and thus substantially reduces computational cost in practice. Additionaly, we propose a novel non-local smoothing regularizer that further confirms the heuristics in optimization inspired networks such as MAGIC [43] but leads to a fully interpretable network and allows us to provide convergence guarantee of the network.

3 Method

In this section, we introduce the proposed inexact learned descent algorithm for solving the following low-dose CT reconstruction model:

𝐱θ(s)=argmin𝐱{ϕ(𝐱;𝐛(s),θ):=f(𝐱;𝐛(s))+r(𝐱;θ)},\mathbf{x}^{(s)}_{\theta}=\operatorname*{\mathrm{arg\,min}}_{\mathbf{x}}\,\{\phi(\mathbf{x};\mathbf{b}^{(s)},\theta)\mathrel{\mathop{\ordinarycolon}}=f(\mathbf{x};\mathbf{b}^{(s)})+r(\mathbf{x};\theta)\}, (1)

where ff is the data fidelity term that measures the consistency between the reconstructed image 𝐱\mathbf{x} and the sinogram measurements 𝐛\mathbf{b}, and rr is the regularization that may incorporate prior information of 𝐱\mathbf{x}. The regularization r(;θ)r(\cdot;\theta) is realized as a highly structured DNN with parameter θ\theta to be learned. The optimal parameter θ\theta of rr is then obtained by minimizing the loss function \mathcal{L}, where \mathcal{L} measures the (averaged) difference between 𝐱θ(s)\mathbf{x}_{\theta}^{(s)}-the minimizer of ϕ(;𝐛(s),θ)\phi(\cdot;\mathbf{b}^{(s)},\theta), and the given ground truth 𝐱^(s)\hat{\mathbf{x}}^{(s)} for every s[N]s\in[N], where NN is the number of training data pairs. For notation simplicity, we write f(𝐱)f(\mathbf{x}) and r(𝐱)r(\mathbf{x}) instead of f(𝐱;𝐛(s))f(\mathbf{x};\mathbf{b}^{(s)}) and r(𝐱;θ)r(\mathbf{x};\theta) respectively hereafter. We choose f(𝐱)=12A𝐱𝐛2f(\mathbf{x})=\frac{1}{2}\|A\mathbf{x}-\mathbf{b}\|^{2} as the data-fidelity term, where AA is the system matrix for CT scanner. However, our proposed method can be readily extended to any smooth but (possibly) nonconvex ff.

3.1 Regularization Term in Model (1)

The regularization term rr in (1) consists of two parts. One of them enhances the sparsity of the solution under a learned transform and the other one smooths the feature maps non-locally:

r(𝐱):=r^(𝐱)+λr¯(𝐱),r(\mathbf{x})\mathrel{\mathop{\ordinarycolon}}=\hat{r}(\mathbf{x})+\lambda\overline{r}(\mathbf{x}), (2)

where λ\lambda is a coefficient to balance these two terms which can be learned.

3.1.1 The sparsity-enhancing regularizer

To enhance the sparsity of 𝐱\mathbf{x} under a learned transform 𝐠\mathbf{g}, we propose to minimize the l2,1l_{2,1} norm of 𝐠(𝐱)\mathbf{g}(\mathbf{x}). If 𝐠\mathbf{g} is a differential operator, then the l2,1l_{2,1} norm of 𝐠(𝐱)\mathbf{g}(\mathbf{x}) reduces to the total variation of 𝐱\mathbf{x}. That is,

r^(𝐱)=𝐠(𝐱)2,1=i=1m𝐠i(𝐱),\hat{r}(\mathbf{x})=\|\mathbf{g}(\mathbf{x})\|_{2,1}=\sum_{i=1}^{m}\|\mathbf{g}_{i}(\mathbf{x})\|, (3)

where each 𝐠i(𝐱)d\mathbf{g}_{i}(\mathbf{x})\in\mathbb{R}^{d} can be viewed as a feature descriptor vector at the position ii, as depicted in Fig. 1 (up). In our experiments, we simply set the feature extraction operator 𝐠\mathbf{g} to a vanilla ll-layer CNN with nonlinear activation function σ\sigma but no bias, as follows:

𝐠(𝐱)=𝐰lσσ(𝐰3σ(𝐰2σ(𝐰1𝐱))),\mathbf{g}(\mathbf{x})=\mathbf{w}_{l}*\sigma\cdots\ \sigma(\mathbf{w}_{3}*\sigma(\mathbf{w}_{2}*\sigma(\mathbf{w}_{1}*\mathbf{x}))), (4)

where {𝐰q}q=1l\{\mathbf{w}_{q}\}_{q=1}^{l} denote the convolution weights consisting of dd kernels with identical spatial kernel size (3×33\times 3), and * denotes the convolution operation. Here, the componentwise activation function σ\sigma is constructed to be the smoothed rectified linear unit as defined below

σ(x)={0,ifxδ,14δx2+12x+δ4,ifδ<x<δ,x,ifxδ,\sigma(x)=\begin{cases}0,&\mbox{if}\ x\leq-\delta,\\ \frac{1}{4\delta}x^{2}+\frac{1}{2}x+\frac{\delta}{4},&\mbox{if}\ -\delta<x<\delta,\\ x,&\mbox{if}\ x\geq\delta,\end{cases} (5)

where the prefixed parameter δ\delta is set to be 0.0010.001 in our experiment. Besides the smooth σ\sigma, each convolution operation of 𝐠\mathbf{g} in (4) can be viewed as matrix multiplication, which enable 𝐠\mathbf{g} to be differentiable, and 𝐠\nabla\mathbf{g} can be easily obtained by Chain Rule where each 𝐰q\mathbf{w}_{q}^{\top} can be implemented as transposed convolutional operation [44].

As r^(𝐱)\hat{r}(\mathbf{x}) defined in (3) is nonsmooth and nonconvex, we apply the Nesterov’s smoothing technique [45] to get the smooth approximation and the detail is given in [34]:

r^ε(𝐱)=iI012ε𝐠i(𝐱)2+iI1(𝐠i(𝐱)ε2),\hat{r}_{\varepsilon}(\mathbf{x})=\sum_{i\in I_{0}}\frac{1}{2\varepsilon}\|\mathbf{g}_{i}(\mathbf{x})\|^{2}+\sum_{i\in I_{1}}\mathinner{\Bigl{(}\|\mathbf{g}_{i}(\mathbf{x})\|-\frac{\varepsilon}{2}\Bigr{)}}, (6)

where I0={i[m]|𝐠i(𝐱)ε},I1=[m]I0.I_{0}=\{i\in[m]\ |\ \|\mathbf{g}_{i}(\mathbf{x})\|\leq\varepsilon\},\ I_{1}=[m]\setminus I_{0}. Here the parameter ε\varepsilon controls how close the smoothed r^ε(𝐱)\hat{r}_{\varepsilon}(\mathbf{x}) is to the original function r^(𝐱)\hat{r}(\mathbf{x}), and one can readily show that r^ε(𝐱)r^(𝐱)r^ε(𝐱)+mε2\hat{r}_{\varepsilon}(\mathbf{x})\leq\hat{r}(\mathbf{x})\leq\hat{r}_{\varepsilon}(\mathbf{x})+\frac{m\varepsilon}{2} for all 𝐱\mathbf{x} in n\mathbb{R}^{n}. From (6) we can also derive r^ε(𝐱)\nabla\hat{r}_{\varepsilon}(\mathbf{x}) to be

r^ε(𝐱)=iI0𝐠i(𝐱)𝐠i(𝐱)ε+iI1𝐠i(𝐱)𝐠i(𝐱)𝐠i(𝐱),\nabla\hat{r}_{\varepsilon}(\mathbf{x})=\sum_{i\in I_{0}}\nabla\mathbf{g}_{i}(\mathbf{x})^{\top}\frac{\mathbf{g}_{i}(\mathbf{x})}{\varepsilon}+\sum_{i\in I_{1}}\nabla\mathbf{g}_{i}(\mathbf{x})^{\top}\frac{\mathbf{g}_{i}(\mathbf{x})}{\|\mathbf{g}_{i}(\mathbf{x})\|},

where 𝐠i(𝐱)d×n\nabla\mathbf{g}_{i}(\mathbf{x})\in\mathbb{R}^{d\times n} is the Jacobian of 𝐠i\mathbf{g}_{i} at 𝐱\mathbf{x}.

Refer to caption
Refer to caption
Figure 1: The feature matrix 𝐠d×m\mathbf{g}\in\mathbb{R}^{d\times m} (up) and the folded feature matrix 𝐠^κd×mκ\hat{\mathbf{g}}\in\mathbb{R}^{\kappa d\times\frac{m}{\kappa}} (bottom) reshaped from the feature maps obtained from the last convolution of the CNN defined in (4). The folding rate κ\kappa is 4 for 𝐠^\hat{\mathbf{g}} in this illustration.

3.1.2 The nonlocal smoothing regularizer

Since convolution operations only extract the local information, each feature descriptor vector 𝐠i\mathbf{g}_{i} can only encode the local features of a small patch of the input 𝐱\mathbf{x} (i.e. receptive field) [46]. So here we seek to incorporate an additional non-local smoothing regularizer r¯\overline{r} that enables capturing of the underlying long-range dependencies between the patches of the feature descriptor vectors. To this end we form the folded feature descriptor vectors {𝐠^i}\{\hat{\mathbf{g}}_{i}\} as described in Fig. 1 (bottom) by folding the adjacent κ\kappa feature descriptors together, and define r¯\overline{r} by:

r¯(𝐱)=(i,j)𝒲ij𝐠^i(𝐱)𝐠^j(𝐱)2,\overline{r}(\mathbf{x})=\sum_{(i,j)}\mathcal{W}_{ij}\|\hat{\mathbf{g}}_{i}(\mathbf{x})-\hat{\mathbf{g}}_{j}(\mathbf{x})\|^{2}, (7)

where the similarity matrix 𝒲\mathcal{W} is defined by 𝒲ij=exp(𝐠^i(𝐱)𝐠^j(𝐱)2δ2),\mathcal{W}_{ij}=\exp{\Big{(}-\frac{\|\hat{\mathbf{g}}_{i}(\mathbf{x})-\hat{\mathbf{g}}_{j}(\mathbf{x})\|^{2}}{\delta^{2}}\Big{)}}, and δ\delta is the standard deviation, which is estimated by the median of the Euclidean distances between the folded feature descriptor vectors in the model.

Additionally, r¯\overline{r} can also be written in the quadratic form [47] as r¯(𝐱)=tr(𝐠^(𝐱)𝐠^(𝐱)),\overline{r}(\mathbf{x})=tr(\hat{\mathbf{g}}(\mathbf{x})\mathcal{L}\ \hat{\mathbf{g}}(\mathbf{x})^{\top}), where tr()tr() is the trace operator, =𝒟𝒲\mathcal{L}=\mathcal{D}-\mathcal{W}, and 𝒟\mathcal{D} is the diagonal matrix with 𝒟ii=j=1m𝒲ij\mathcal{D}_{ii}=\sum_{j=1}^{m}\mathcal{W}_{ij}, and \mathcal{L} is positive semidefinite. And its gradient is computed by

r¯(𝐱)\displaystyle\nabla\overline{r}(\mathbf{x}) =2(i,j)𝒲~ij(𝐠^i(𝐱)𝐠^j(𝐱))(𝐠^i(𝐱)𝐠^j(𝐱))\displaystyle=2\cdot\sum_{(i,j)}\tilde{\mathcal{W}}_{ij}(\nabla\hat{\mathbf{g}}_{i}(\mathbf{x})-\nabla\hat{\mathbf{g}}_{j}(\mathbf{x}))^{\top}(\hat{\mathbf{g}}_{i}(\mathbf{x})-\hat{\mathbf{g}}_{j}(\mathbf{x}))
=2q=1κd𝐠^q(𝐱)~𝐠^q(𝐱),\displaystyle=2\cdot\sum_{q=1}^{\kappa d}\nabla\hat{\mathbf{g}}^{q}(\mathbf{x})^{\top}\tilde{\mathcal{L}}\ \hat{\mathbf{g}}^{q}(\mathbf{x}),

where 𝐠^q(𝐱)mκ×n\nabla\hat{\mathbf{g}}^{q}(\mathbf{x})\in\mathbb{R}^{\frac{m}{\kappa}\times n} is the Jacobian of 𝐠^q(𝐱)\hat{\mathbf{g}}^{q}(\mathbf{x}). And ~=𝒟~𝒲~\tilde{\mathcal{L}}=\tilde{\mathcal{D}}-\tilde{\mathcal{W}}, 𝒟~ii=j=1m𝒲~ij\tilde{\mathcal{D}}_{ii}=\sum_{j=1}^{m}\mathcal{\tilde{W}}_{ij} and 𝒲~ij=𝒲ij(𝐱)(1𝐠^i(𝐱)𝐠^j(𝐱)2δ2)\tilde{\mathcal{W}}_{ij}=\mathcal{W}_{ij}(\mathbf{x})(1-\frac{\lVert\hat{\mathbf{g}}_{i}(\mathbf{x})-\hat{\mathbf{g}}_{j}(\mathbf{x})\rVert^{2}}{\delta^{2}}). Each 𝐠^q(𝐱)\hat{\mathbf{g}}^{q}(\mathbf{x}) represents the qq-th row of the folded feature matrix 𝐠^(𝐱)\hat{\mathbf{g}}(\mathbf{x}) as illustrated in Fig. 1.

3.2 Inexact Learned Descent Algorithm

Now we present an inexact smoothing gradient descent type algorithm to solve the nonconvex and nonsmooth problem (1) with the smooth approximation of r(𝐱)r(\mathbf{x}) defined by rε(𝐱):=r^ε(𝐱)+λr¯(𝐱).r_{\varepsilon}(\mathbf{x})\mathrel{\mathop{\ordinarycolon}}=\hat{r}_{\varepsilon}(\mathbf{x})+\lambda\overline{r}(\mathbf{x}). The proposed algorithm is shown in Algorithm 1. In each iteration kk, we solve the following smoothed problem (9) with fixed ε=εk\varepsilon=\varepsilon_{k} in Line 3-14. And Line 15 is aimed to check and update εk\varepsilon_{k} by a reduction principle.

min𝐱{ϕε(𝐱;𝐛(s),θ):=f(𝐱;𝐛(s))+rε(𝐱;θ)}.\min_{\mathbf{x}}\,\{\phi_{\varepsilon}(\mathbf{x};\mathbf{b}^{(s)},\theta)\mathrel{\mathop{\ordinarycolon}}=f(\mathbf{x};\mathbf{b}^{(s)})+r_{\varepsilon}(\mathbf{x};\theta)\}. (9)

As the regularization term rεr_{\varepsilon} is learned via a deep neural network (DNN), some common issues of the DNN have to be taken into consideration when designing the algorithm, such as gradient exploding and vanishing problem during training [48]. Substantial improvement in performance has been achieved by ResNet [49] which introduces residual connections to alleviate these issues. As in (9) only the second term rε(𝐱;θ)r_{\varepsilon}(\mathbf{x};\theta) is learned, we desire to have individual residual updates for this term in our algorithm. To this end, we use the first order proximal method to solve the smoothed problem (9) by iterating the following steps

𝐳k+1\displaystyle\mathbf{z}_{k+1} =𝐱kαkf(𝐱k),\displaystyle=\mathbf{x}_{k}-\alpha_{k}\nabla f(\mathbf{x}_{k}), (10a)
𝐱k+1\displaystyle\mathbf{x}_{k+1} =proxαkrεk(𝐳k+1),\displaystyle=\operatorname{\mathrm{prox}}_{\alpha_{k}r_{\varepsilon_{k}}}(\mathbf{z}_{k+1}), (10b)

where proxαr(𝐳):=argmin𝐱12α𝐱𝐳2+r(𝐱)\operatorname{\mathrm{prox}}_{\alpha r}(\mathbf{z})\mathrel{\mathop{\ordinarycolon}}=\operatorname*{\mathrm{arg\,min}}_{\mathbf{x}}\ \frac{1}{2\alpha}\|\mathbf{x}-\mathbf{z}\|^{2}+r(\mathbf{x}).

From our construction of rεkr_{\varepsilon_{k}}, it is hard to get the close-form solution to subproblem in (10b). Here we propose to linearize the “nonsimple” term rεkr_{\varepsilon_{k}} by

r~εk(𝐱)=rεk(𝐳k+1)+rεk(𝐳k+1),𝐱𝐳k+1+12βk𝐱𝐳k+12.\tilde{r}_{\varepsilon_{k}}(\mathbf{x})=r_{\varepsilon_{k}}(\mathbf{z}_{k+1})+\langle\nabla r_{\varepsilon_{k}}(\mathbf{z}_{k+1}),\mathbf{x}-\mathbf{z}_{k+1}\rangle+\frac{1}{2\beta_{k}}\|\mathbf{x}-\mathbf{z}_{k+1}\|^{2}. (11)

With this approximation, instead of solving (10b) directly, we update by the following step

𝐮k+1=proxαkr~εk(𝐳k+1),\mathbf{u}_{k+1}=\operatorname{\mathrm{prox}}_{\alpha_{k}\tilde{r}_{\varepsilon_{k}}}(\mathbf{z}_{k+1}), (12)

which has a closed-form solution giving the residual update

𝐮k+1=𝐳k+1τkrεk(𝐳k+1),\mathbf{u}_{k+1}=\mathbf{z}_{k+1}-\tau_{k}\nabla r_{\varepsilon_{k}}(\mathbf{z}_{k+1}), (13)

where rεk=r^εk+λr¯\nabla r_{\varepsilon_{k}}=\nabla\hat{r}_{\varepsilon_{k}}+\lambda\nabla\overline{r} and τk=αkβkαk+βk\tau_{k}=\frac{\alpha_{k}\beta_{k}}{\alpha_{k}+\beta_{k}}.

From the optimization perspective, with the approximation in (11), if we update 𝐱k+1=𝐮k+1\mathbf{x}_{k+1}=\mathbf{u}_{k+1} then the algorithm can not be guaranteed to converge. In order to ensure convergence, we check whether 𝐮k+1\mathbf{u}_{k+1} satisfies

ϕεk(𝐱k)c𝐮k+1𝐱kandϕε(𝐮k+1)ϕε(𝐱k)ι2𝐮k+1𝐱k2,\begin{split}\|\nabla\phi_{\varepsilon_{k}}(\mathbf{x}_{k})\|\leq c\|\mathbf{u}_{k+1}-\mathbf{x}_{k}\|\ \ \ \mbox{and}\ \ \ \phi_{\varepsilon}(\mathbf{u}_{k+1})-\phi_{\varepsilon}(\mathbf{x}_{k})\leq-\frac{\iota}{2}\|\mathbf{u}_{k+1}-\mathbf{x}_{k}\|^{2},\end{split} (14)

where cc and ι\iota are prefixed constant numbers. If the condition (14) holds, we take 𝐱k+1=𝐮k+1\mathbf{x}_{k+1}=\mathbf{u}_{k+1}; otherwise, we take the standard gradient descent 𝐯k+1\mathbf{v}_{k+1} coming from

𝐯k+1=argmin𝐱f(𝐱k),𝐱𝐱k+rεk(𝐱k),𝐱𝐱k+12αk𝐱𝐱k2,\mathbf{v}_{k+1}=\operatorname*{\mathrm{arg\,min}}_{\mathbf{x}}\langle\nabla f(\mathbf{x}_{k}),\mathbf{x}-\mathbf{x}_{k}\rangle+\langle\nabla r_{\varepsilon_{k}}(\mathbf{x}_{k}),\mathbf{x}-\mathbf{x}_{k}\rangle+\frac{1}{2\alpha_{k}}\|\mathbf{x}-\mathbf{x}_{k}\|^{2}, (15)

which has the exact solution

𝐯k+1=𝐱kαkf(𝐱k)αkrεk(𝐱k).\begin{split}\mathbf{v}_{k+1}=\mathbf{x}_{k}-\alpha_{k}\nabla f(\mathbf{x}_{k})-\alpha_{k}\nabla r_{\varepsilon_{k}}(\mathbf{x}_{k}).\end{split} (16)

To ensure convergence, we need to find αk\alpha_{k} through line search such that 𝐯k+1\mathbf{v}_{k+1} satisfies

ϕεk(𝐯k+1)ϕεk(𝐱k)τ𝐯k+1𝐱k2,\phi_{\varepsilon_{k}}(\mathbf{v}_{k+1})-\phi_{\varepsilon_{k}}(\mathbf{x}_{k})\leq-\tau\|\mathbf{v}_{k+1}-\mathbf{x}_{k}\|^{2}, (17)

where τ\tau is a prefixed constant. Lemma 4.2 proves the convergence of lines 3–14 Algorithm 1, including the termination of its line search (lines 9–13) in finitely many steps.

Our algorithm is inspired by [34] but modified to become more efficient and suitable for deep neural network. The first contrast is the number of computations of the two candidates 𝐮k+1\mathbf{u}_{k+1} and 𝐯k+1\mathbf{v}_{k+1}. While [34] computes both candidates at every iteration and then chooses the one that achieves a smaller function value, we propose the criteria above in (14) for updating 𝐱k+1\mathbf{x}_{k+1}, which potentially saves extra computational time for calculating the candidate 𝐯k+1\mathbf{v}_{k+1}. In addition, the descending condition in Line 5 of Algorithm 1 mitigates the frequent alternations between the candidates 𝐮k+1\mathbf{u}_{k+1} and 𝐯k+1\mathbf{v}_{k+1} with the algorithm proceeding, as details shown in Section 5.2.

The learned inexact gradient

To further increase the capacity of the network, we employ the learned transposed convolution operator, i.e. we replace 𝐰q\mathbf{w}_{q}^{\top} by a transposed convolution 𝐰~q\widetilde{\mathbf{w}}_{q} with relearned weights, where qq denotes the index of convolution in (4). To approximately achieve 𝐰~q𝐰q\widetilde{\mathbf{w}}_{q}\approx\mathbf{w}_{q}^{\top}, we add the constraint term constraint=1Nwq=14𝐰~q𝐰qF2\mathcal{L}_{constraint}=\frac{1}{N_{w}}\sum_{q=1}^{4}\|\widetilde{\mathbf{w}}_{q}-\mathbf{w}^{\top}_{q}\|^{2}_{F} to the loss function in training to produce the data-driven transposed convolutions. Here NwN_{w} is the number of parameters in learned transposed convolutions and F\|\cdot\|_{F} is the Frobenius norm. In effect, the consequence of this modification is only to substitute rεk\nabla r_{\varepsilon_{k}} by the inexact gradient r~εk\widetilde{\nabla r}_{\varepsilon_{k}} equipped with learned transpose at Line 4 in Algorithm 1. This can further increase the capacity of the unrolled network while maintaining the convergence property.

Algorithm 1 The Efficient learned Descent Algorithm (ELDA) for the Nonsmooth Nonconvex Problem
1:  Input: Initial 𝐱0\mathbf{x}_{0}, 0<ρ,γ<10<\rho,\gamma<1, and ε0,σ,c,ι,τ>0\varepsilon_{0},\sigma,c,\iota,\tau>0. Maximum iteration KK or tolerance ϵtol>0\epsilon_{\mathrm{tol}}>0.
2:  for k=0,1,2,,Kk=0,1,2,\dots,K do
3:     𝐳k+1=𝐱kαkf(𝐱k)\mathbf{z}_{k+1}=\mathbf{x}_{k}-\alpha_{k}\nabla f(\mathbf{x}_{k}),
4:     𝐮k+1=𝐳k+1τkrεk(𝐳k+1)\mathbf{u}_{k+1}=\mathbf{z}_{k+1}-\tau_{k}\nabla r_{\varepsilon_{k}}(\mathbf{z}_{k+1}), (possibly inexact)
5:     if  condition (14) holds then
6:        set 𝐱k+1=𝐮k+1\mathbf{x}_{k+1}=\mathbf{u}_{k+1},
7:     else
8:        𝐯k+1=𝐱kαkf(𝐱k)αkrεk(𝐱k)\mathbf{v}_{k+1}=\mathbf{x}_{k}-\alpha_{k}\nabla f(\mathbf{x}_{k})-\alpha_{k}\nabla r_{\varepsilon_{k}}(\mathbf{x}_{k}),
9:        if  condition (17) holds then
10:           set 𝐱k+1=𝐯k+1\mathbf{x}_{k+1}=\mathbf{v}_{k+1},
11:        else
12:           update αkραk\alpha_{k}\leftarrow\rho\alpha_{k}, then go to 8,
13:        end if
14:     end if
15:     if ϕεk(𝐱k+1)<σγεk\|\nabla\phi_{\varepsilon_{k}}(\mathbf{x}_{k+1})\|<\sigma\gamma{\varepsilon_{k}}, set εk+1=γεk\varepsilon_{k+1}=\gamma{\varepsilon_{k}}; otherwise, set εk+1=εk\varepsilon_{k+1}={\varepsilon_{k}}.
16:     if σεk<ϵtol\sigma{\varepsilon_{k}}<\epsilon_{\mathrm{tol}}, terminate.
17:  end for
18:  Output: 𝐱k+1\mathbf{x}_{k+1}.

3.3 Network Training:

We allow the step sizes αk\alpha_{k} and τk\tau_{k} to vary in different phases. Moreover, all {αk,τk}k=1K\{\alpha_{k},\tau_{k}\}_{k=1}^{K} and initial threshold ε0\varepsilon_{0} are designed to be learned parameters fitted by data. Here let θ\theta stand for the set of all learned parameters of ELDA which consists of the weights of the convolutions and approximated transposed convolutions, step sizes {αk,τk}k=1K\{\alpha_{k},\tau_{k}\}_{k=1}^{K} and threshold ε0\varepsilon_{0}, parameter λ\lambda. Given NN training data pairs {(𝐛(s),𝐱^(s))}s=1N\{(\mathbf{b}^{(s)},\hat{\mathbf{x}}^{(s)})\}_{s=1}^{N} of the ground truth data 𝐱^(s)\hat{\mathbf{x}}^{(s)} and its corresponding measurement 𝐛(s)\mathbf{b}^{(s)}, the loss function (θ)\mathcal{L}(\theta) is defined to be the sum of the discrepancy loss discrepancy\mathcal{L}_{discrepancy} and the constraint loss constraint\mathcal{L}_{constraint}:

(θ)=1Ns=1N𝐱θK𝐱^(s)2discrepancy+ϑNwq=14𝐰~q𝐰qF2constraint,\begin{split}\mathcal{L}(\theta)=\underbrace{\frac{1}{N}\sum_{s=1}^{N}\|\mathbf{x}^{K}_{\theta}-\hat{\mathbf{x}}^{(s)}\|^{2}}_{\mathcal{L}_{discrepancy}}+\underbrace{\frac{\vartheta}{N_{w}}\sum_{q=1}^{4}\|\widetilde{\mathbf{w}}_{q}-\mathbf{w}^{\top}_{q}\|^{2}_{F}}_{\mathcal{L}_{constraint}},\end{split} (18)

where discrepancy\mathcal{L}_{discrepancy} measures the discrepancy between the ground truth 𝐱^(s)\hat{\mathbf{x}}^{(s)} and 𝐱θK\mathbf{x}^{K}_{\theta} which is the output of the KK-phase network. Here, the constraint coefficient ϑ\vartheta is set to 10210^{-2} in our experiment.

4 Convergence Analysis

According to the problem we are solving, we make a few assumptions on ff and 𝐠\mathbf{g} throughout this section.
(A1): ff is differentiable and (possibly) nonconvex, and f\nabla f is LfL_{f}-Lipschitz continuous.
(A2): Every component of 𝐠\mathbf{g} is differentiable and (possibly) nonconvex, 𝐠\nabla\mathbf{g} is LgL_{g}-Lipschitz continuous, and sup𝐱𝒳𝐠(𝐱)M\sup_{\mathbf{x}\in\mathcal{X}}\|\nabla\mathbf{g}(\mathbf{x})\|\leq M for some constant M>0M>0.
(A3): ϕ\phi is coercive, and ϕ=min𝐱𝒳ϕ(𝐱)>\phi^{*}=\min_{\mathbf{x}\in\mathcal{X}}\phi(\mathbf{x})>-\infty.

With the smoothly differentiable activation σ\sigma defined in (5) and boundedness of σ\sigma^{\prime} as well as the fixed convolution weights in (4) after training, we can immediately verify that the first two assumptions hold, and typically in image reconstruction ϕ\phi is assumed to be coercive [34].

As the objective function in (1) is nonsmooth and nonconvex, we utilize the Clarke subdifferential[50] to characterize the optimality of solutions. We denote D(𝐳;𝐯):=lim sup𝐳𝐱,t0[(f(𝐳+t𝐯)f(𝐳))/t]D(\mathbf{z};\mathbf{v})\mathrel{\mathop{\ordinarycolon}}=\limsup_{\mathbf{z}\rightarrow\mathbf{x},\,t\downarrow 0}[(f(\mathbf{z}+t\mathbf{v})-f(\mathbf{z}))/t].

Definition 4.1 (Clarke subdifferential).

Suppose that f:n(,+]f\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{n}\to(-\infty,+\infty] is locally Lipschitz, the Clarke subdifferential f(𝐱)\partial f(\mathbf{x}) of ff at 𝐱\mathbf{x} is defined as

f(𝐱):={𝐰n|𝐰,𝐯D(𝐳;𝐯),𝐯n}.\partial f(\mathbf{x})\mathrel{\mathop{\ordinarycolon}}=\Big{\{}\mathbf{w}\in\mathbb{R}^{n}\ |\ \langle\mathbf{w},\mathbf{v}\rangle\leq D(\mathbf{z};\mathbf{v}),\forall\,\mathbf{v}\in\mathbb{R}^{n}\Big{\}}.
Definition 4.2 (Clarke stationary point).

For a locally Lipschitz function ff, a point 𝐱Rn\mathbf{x}\in R^{n} is called a Clarke stationary point of ff if 0f(𝐱)0\in\partial f(\mathbf{x}).

Lemma 4.1.

The gradient of r¯\overline{r} is Lipschitz continuous.

Proof.

From equation 6 in the paper, it follows that

(𝐱)r¯=(i,j)2exp(𝐠i(𝐱)𝐠j(𝐱)2σ2)(1𝐠i(𝐱)𝐠j(𝐱)2σ2)(𝐠i(𝐱)𝐠j(𝐱))(𝐠i(𝐱)𝐠j(𝐱)).\nabla_{(\mathbf{x})}\overline{r}=\sum_{(i,j)}2\exp{\Big{(}-\frac{\|\mathbf{g}_{i}(\mathbf{x})-\mathbf{g}_{j}(\mathbf{x})\|^{2}}{\sigma^{2}}\Big{)}}(1-\frac{\lVert\mathbf{g}_{i}(\mathbf{x})-\mathbf{g}_{j}(\mathbf{x})\rVert^{2}}{\sigma^{2}})(\nabla\mathbf{g}_{i}(\mathbf{x})-\nabla\mathbf{g}_{j}(\mathbf{x}))^{\top}(\mathbf{g}_{i}(\mathbf{x})-\mathbf{g}_{j}(\mathbf{x})).

We only need to show one of the terms under the sum is Lipschitz, i.e. we need f(𝐮)=exp(𝐮2σ2)(1𝐮2σ2)(𝐮)(𝐮)f(\mathbf{u})=\exp{\Big{(}-\frac{\|\mathbf{u}\|^{2}}{\sigma^{2}}\Big{)}}(1-\frac{\lVert\mathbf{u}\rVert^{2}}{\sigma^{2}})(\nabla\mathbf{u})^{\top}(\mathbf{u}) to be Lipschitz, where 𝐮=𝐠i(𝐱)𝐠j(𝐱)d\mathbf{u}=\mathbf{g}_{i}(\mathbf{x})-\mathbf{g}_{j}(\mathbf{x})\in\mathbb{R}^{d}. We show f(𝐮)f(\mathbf{u}) is Lipschitz by showing its derivative is bounded:

f(𝐮)\displaystyle\lVert\nabla f(\mathbf{u})\rVert exp(𝐮2σ2)(2σ2(𝐮)(𝐮))(1𝐮2σ2)(𝐮)𝐮+exp(𝐮2σ2)(2σ2(𝐮)(𝐮))(𝐮)𝐮\displaystyle\leqslant\lVert\exp{\Big{(}-\frac{\|\mathbf{u}\|^{2}}{\sigma^{2}}\Big{)}}\big{(}-\frac{2}{\sigma^{2}}(\nabla\mathbf{u})^{\top}(\mathbf{u})\big{)}\rVert\,(1-\frac{\lVert\mathbf{u}\rVert^{2}}{\sigma^{2}})\,\lVert(\nabla\mathbf{u})^{\top}\rVert\lVert\mathbf{u}\rVert+\exp{\Big{(}-\frac{\|\mathbf{u}\|^{2}}{\sigma^{2}}\Big{)}}\,\lVert\big{(}-\frac{2}{\sigma^{2}}(\nabla\mathbf{u})^{\top}(\mathbf{u})\big{)}\rVert\,\lVert(\nabla\mathbf{u})^{\top}\rVert\lVert\mathbf{u}\rVert
+|exp(𝐮2σ2)(1𝐮2σ2)|(2𝐮)𝐮+|exp(𝐮2σ2)(1𝐮2σ2)|𝐮2.\displaystyle+\lvert\exp{\Big{(}-\frac{\|\mathbf{u}\|^{2}}{\sigma^{2}}\Big{)}}\,(1-\frac{\lVert\mathbf{u}\rVert^{2}}{\sigma^{2}})\rvert\,\lVert(\nabla^{2}\mathbf{u})\rVert\lVert\mathbf{u}\rVert+\lvert\exp{\Big{(}-\frac{\|\mathbf{u}\|^{2}}{\sigma^{2}}\Big{)}}(1-\frac{\lVert\mathbf{u}\rVert^{2}}{\sigma^{2}})\rvert\lVert\nabla\mathbf{u}\rVert^{2}.

Note that 𝐮=𝐠i(𝐱)𝐠j(𝐱)\mathbf{u}=\mathbf{g}_{i}(\mathbf{x})-\mathbf{g}_{j}(\mathbf{x}) is Lipschitz in 𝐱\mathbf{x} since sup𝐱𝒳𝐠(𝐱)M\sup_{\mathbf{x}\in\mathcal{X}}\|\nabla\mathbf{g}(\mathbf{x})\|\leq M, therefore 𝐮M\lVert\nabla\mathbf{u}\rVert\leqslant M. Also since 𝐠\nabla\mathbf{g} is LgL_{g} Lipschitz, we get 2𝐮2Lg\lVert\nabla^{2}\mathbf{u}\rVert\leqslant 2L_{g}. Also, a polynomial in 𝐮\|\mathbf{u}\| times exp(𝐮2σ2)\exp{\Big{(}-\frac{\|\mathbf{u}\|^{2}}{\sigma^{2}}\Big{)}} is bounded so we get:

f(𝐮)(4M2σ2+Lg+M2)C.\lVert\nabla f(\mathbf{u})\rVert\leqslant(4\frac{M^{2}}{\sigma^{2}}+L_{g}+M^{2})C. (19)

where CC is some constant that bounds all the instances of polynomial in 𝐮\|\mathbf{u}\| times exp(𝐮2σ2)\exp{\Big{(}-\frac{\|\mathbf{u}\|^{2}}{\sigma^{2}}\Big{)}} occurring above. Therefore (𝐱)r¯\nabla_{(\mathbf{x})}\overline{r} is Lipschitz with constant Lr=2m2(4M2σ2+Lg+M2)CL_{r}=2m^{2}(4\frac{M^{2}}{\sigma^{2}}+L_{g}+M^{2})C. ∎

The following Lemma 4.2 considers the case where ε\varepsilon is a positive constant, which corresponds to an iterative scheme that only executes Lines 3–14 of Algorithm 1.

Lemma 4.2.

Let ε,c,ι,τ>0\varepsilon,c,\iota,\tau>0, 0<ρ<10<\rho<1 and arbitrary initial 𝐱0n\mathbf{x}_{0}\in\mathbb{R}^{n}. Suppose {𝐱k}\{\mathbf{x}_{k}\} is the sequence generated by repeating Lines 3–14 of Algorithm 1 with fixed εk=ε\varepsilon_{k}=\varepsilon, and ϕ:=min𝐱nϕ(𝐱)\phi^{*}\mathrel{\mathop{\ordinarycolon}}=\min_{\mathbf{x}\in\mathbb{R}^{n}}\phi(\mathbf{x}). Then ϕε(𝐱k)0\|\nabla\phi_{\varepsilon}(\mathbf{x}_{k})\|\to 0 as kk\to\infty.

Proof.

In each iteration we compute 𝐮k+1\mathbf{u}_{k+1} by 𝐳k+1τkrεk(𝐳k+1)\mathbf{z}_{k+1}-\tau_{k}\nabla r_{\varepsilon_{k}}(\mathbf{z}_{k+1}) , and put 𝐱k+1=𝐮k+1\mathbf{x}_{k+1}=\mathbf{u}_{k+1} only if the condition {ϕεk(𝐱k)c𝐮k+1𝐱k\{\|\phi_{\varepsilon_{k}}(\mathbf{x}_{k})\|\leq c\|\mathbf{u}_{k+1}-\mathbf{x}_{k}\| andand ϕε(𝐮k+1)ϕε(𝐱k)ι2𝐮k+1𝐱k2}\phi_{\varepsilon}(\mathbf{u}_{k+1})-\phi_{\varepsilon}(\mathbf{x}_{k})\leq-\frac{\iota}{2}\|\mathbf{u}_{k+1}-\mathbf{x}_{k}\|^{2}\} is satisfied. In this case, we will have ϕε(𝐱k+1)ϕε(𝐱k)\phi_{\varepsilon}(\mathbf{x}_{k+1})\leq\phi_{\varepsilon}(\mathbf{x}_{k}).

Otherwise we compute 𝐯k+1=𝐱kαkf(𝐱k)αkrεk(𝐱k)=αkϕε(𝐱k+1)\mathbf{v}_{k+1}=\mathbf{x}_{k}-\alpha_{k}\nabla f(\mathbf{x}_{k})-\alpha_{k}\nabla r_{\varepsilon_{k}}(\mathbf{x}_{k})=-\alpha_{k}\nabla\phi_{\varepsilon}(\mathbf{x}_{k+1}) where αk\alpha_{k} is found through line search until the criteria

ϕεk(𝐯k+1)ϕεk(𝐱k)τ𝐯k+1𝐱k2\phi_{\varepsilon_{k}}(\mathbf{v}_{k+1})-\phi_{\varepsilon_{k}}(\mathbf{x}_{k})\leq-\tau\|\mathbf{v}_{k+1}-\mathbf{x}_{k}\|^{2} (20)

is satisfied and then put 𝐱k+1=𝐯k+1\mathbf{x}_{k+1}=\mathbf{v}_{k+1}. In this case, from Lemma 3.2 in [34] we know that the gradient rε\nabla r_{\varepsilon} is Lipschitz continuous with constant mLg+M2ε\sqrt{m}L_{g}+\frac{M^{2}}{\varepsilon}. And r¯\nabla\overline{r} is also Lipschitz continuous with constant LrL_{r}. Furthermore, in (A1) we assumed f\nabla f is LfL_{f}-Lipschitz continuous. Hence putting Lε=Lf+mLg+M2ε+LrL_{\varepsilon}=L_{f}+\sqrt{m}L_{g}+\frac{M^{2}}{\varepsilon}+L_{r}, we get that ϕε\nabla\phi_{\varepsilon} is LεL_{\varepsilon}-Lipschitz continuous, which implies

ϕε(𝐯k+1)ϕε(𝐱k)+ϕε(𝐱k),𝐯k+1𝐱k+Lε2𝐯k+1𝐱k2.\phi_{\varepsilon}(\mathbf{v}_{k+1})\leq\phi_{\varepsilon}(\mathbf{x}_{k})+\langle\nabla\phi_{\varepsilon}(\mathbf{x}_{k}),\mathbf{v}_{k+1}-\mathbf{x}_{k}\rangle+\frac{L_{\varepsilon}}{2}\|\mathbf{v}_{k+1}-\mathbf{x}_{k}\|^{2}. (21)

Also by the optimality condition of 𝐯k+1=argmin𝐱f(𝐱k),𝐱𝐱k+rεk(𝐱k),𝐱𝐱k+12αk𝐱𝐱k2,\mathbf{v}_{k+1}=\operatorname*{\mathrm{arg\,min}}_{\mathbf{x}}\ \langle\nabla f(\mathbf{x}_{k}),\mathbf{x}-\mathbf{x}_{k}\rangle\\ +\langle\nabla r_{\varepsilon_{k}}(\mathbf{x}_{k}),\mathbf{x}-\mathbf{x}_{k}\rangle+\frac{1}{2\alpha_{k}}\|\mathbf{x}-\mathbf{x}_{k}\|^{2}, we have:

ϕε(𝐱k),𝐯k+1𝐱k+12αk𝐯k+1𝐱k2=12αk𝐯k+1𝐱k20.\langle\nabla\phi_{\varepsilon}(\mathbf{x}_{k}),\mathbf{v}_{k+1}-\mathbf{x}_{k}\rangle+\frac{1}{2\alpha_{k}}\|\mathbf{v}_{k+1}-\mathbf{x}_{k}\|^{2}=\frac{-1}{2\alpha_{k}}\|\mathbf{v}_{k+1}-\mathbf{x}_{k}\|^{2}\leq 0. (22)

From above (21) and (22), we can get

ϕε(𝐯k+1)ϕε(𝐱k)(1αkLε2)𝐯k+1𝐱k2.\phi_{\varepsilon}(\mathbf{v}_{k+1})-\phi_{\varepsilon}(\mathbf{x}_{k})\leq-\mathinner{\Bigl{(}\frac{1}{\alpha_{k}}-\frac{L_{\varepsilon}}{2}\Bigr{)}}\|\mathbf{v}_{k+1}-\mathbf{x}_{k}\|^{2}. (23)

Therefore it is enough for αk1τ+Lϵ/2\alpha_{k}\leqslant\frac{1}{\tau+L_{\epsilon}/2} so that the criteria (20) is satisfied. This implies that there is a finite kk so that ρkαk1τ+Lϵ/2\rho^{k}\alpha_{k}\leqslant\frac{1}{\tau+L_{\epsilon}/2} and the line search stops successfully and we can get ϕε(𝐱k+1)=ϕε(𝐯k+1)ϕε(𝐱k)\phi_{\varepsilon}(\mathbf{x}_{k+1})=\phi_{\varepsilon}(\mathbf{v}_{k+1})\leq\phi_{\varepsilon}(\mathbf{x}_{k}). Hence in both cases (i) and (ii), we can conclude that ϕε(𝐱k+1)ϕε(𝐱k)\phi_{\varepsilon}(\mathbf{x}_{k+1})\leq\phi_{\varepsilon}(\mathbf{x}_{k}).

Now from 𝐯k+1=𝐱kαkϕε(𝐱k)\mathbf{v}_{k+1}=\mathbf{x}_{k}-\alpha_{k}\nabla\phi_{\varepsilon}(\mathbf{x}_{k}) we have

ϕε(𝐯k+1)ϕε(𝐱k)ταk2ϕε(𝐱k)2.\phi_{\varepsilon}(\mathbf{v}_{k+1})-\phi_{\varepsilon}(\mathbf{x}_{k})\leqslant-\frac{\tau}{\alpha_{k}^{2}}\lVert\nabla\phi_{\varepsilon}(\mathbf{x}_{k})\rVert^{2}. (24)

which by rearranging gives

ϕε(𝐱k)2αk2τ(ϕε(𝐱k)ϕε(𝐯k+1)).\|\nabla\phi_{\varepsilon}(\mathbf{x}_{k})\|^{2}\leq\frac{\alpha_{k}^{2}}{\tau}(\phi_{\varepsilon}(\mathbf{x}_{k})-\phi_{\varepsilon}(\mathbf{v}_{k+1})). (25)

And From ϕεk(𝐱k)c𝐮k+1𝐱k\|\nabla\phi_{\varepsilon_{k}}(\mathbf{x}_{k})\|\leq c\|\mathbf{u}_{k+1}-\mathbf{x}_{k}\| and ϕε(𝐮k+1)ϕε(𝐱k)ι2𝐮k+1𝐱k2\phi_{\varepsilon}(\mathbf{u}_{k+1})-\phi_{\varepsilon}(\mathbf{x}_{k})\leq-\frac{\iota}{2}\|\mathbf{u}_{k+1}-\mathbf{x}_{k}\|^{2}, we can get

ϕε(𝐱k)22c2ι(ϕε(𝐱k)ϕε(𝐮k+1)).\|\nabla\phi_{\varepsilon}(\mathbf{x}_{k})\|^{2}\leq\frac{2c^{2}}{\iota}(\phi_{\varepsilon}(\mathbf{x}_{k})-\phi_{\varepsilon}(\mathbf{u}_{k+1})). (26)

Because 𝐱k+1\mathbf{x}_{k+1} equals either 𝐮k+1\mathbf{u}_{k+1} or 𝐯k+1\mathbf{v}_{k+1} depending on the condition on Line 5 and 9 in the Algorithm, from (25) and (26), we have

ϕε(𝐱k)2C(ϕε(𝐱k)ϕε(𝐱k+1)).\|\nabla\phi_{\varepsilon}(\mathbf{x}_{k})\|^{2}\leq C(\phi_{\varepsilon}(\mathbf{x}_{k})-\phi_{\varepsilon}(\mathbf{x}_{k+1})). (27)

where C=max{αk2τ,2c2ι}C=\max\{\frac{\alpha_{k}^{2}}{\tau},\frac{2c^{2}}{\iota}\}.

Summing up (27) for k=0,,Kk=0,\dots,K, we have

k=0Kϕε(𝐱k)2C(ϕε(𝐱0)ϕε(𝐱K+1)).\sum_{k=0}^{K}\|\nabla\phi_{\varepsilon}(\mathbf{x}_{k})\|^{2}\leq C(\phi_{\varepsilon}(\mathbf{x}_{0})-\phi_{\varepsilon}(\mathbf{x}_{K+1})). (28)

Combining with the fact ϕε(𝐱)ϕ(𝐱)mε2ϕmε2\phi_{\varepsilon}(\mathbf{x})\geq\phi(\mathbf{x})-\frac{m\varepsilon}{2}\geq\phi^{*}-\frac{m\varepsilon}{2}, we have

k=0Kϕε(𝐱k)2C(ϕε(𝐱0)ϕ+mε2).\sum_{k=0}^{K}\|\nabla\phi_{\varepsilon}(\mathbf{x}_{k})\|^{2}\leq C(\phi_{\varepsilon}(\mathbf{x}_{0})-\phi^{*}+\frac{m\varepsilon}{2}). (29)

The right hand side is finite, and thus by letting KK\to\infty we conclude ϕε(𝐱k)0\|\nabla\phi_{\varepsilon}(\mathbf{x}_{k})\|\to 0, which proves the lemma. ∎

Note that Lemma 4.2 does not rely on the term rεk(𝐳k+1)\nabla r_{\varepsilon_{k}}(\mathbf{z}_{k+1}) in (13) and Line 5 in Algorithm 1. Thus it will still hold if we change the rεk(𝐳k+1)\nabla r_{\varepsilon_{k}}(\mathbf{z}_{k+1}) to a generalized learned inexact r~εk(𝐳k+1)\widetilde{\nabla r}_{\varepsilon_{k}}(\mathbf{z}_{k+1}) when updating 𝐮k+1\mathbf{u}_{k+1} as described in Section 3.2. Next we consider the case where ε\varepsilon varies in Theorem 4.4. More precisely, we focus on the subsequence {𝐱kl+1}\{\mathbf{x}_{k_{l}+1}\} which selects the iterates when the reduction criterion in Line 15 is satisfied for k=klk=k_{l} and εk\varepsilon_{k} is reduced.

Lemma 4.3.

Suppose that the sequence {𝐱k}\{\mathbf{x}_{k}\} is generated by Algorithm 1 and any initial 𝐱0\mathbf{x}_{0}. Then for any k0k\geq 0 we have

ϕεk+1(𝐱k+1)+mεk+12ϕεk(𝐱k+1)+mεk2ϕεk(𝐱k)+mεk2.\phi_{\varepsilon_{k+1}}(\mathbf{x}_{k+1})+\frac{m\varepsilon_{k+1}}{2}\leq\phi_{\varepsilon_{k}}(\mathbf{x}_{k+1})+\frac{m\varepsilon_{k}}{2}\leq\phi_{\varepsilon_{k}}(\mathbf{x}_{k})+\frac{m\varepsilon_{k}}{2}. (30)
Proof.

The proof of this lemma is similar to Lemma 3.4 of [34]. The second inequality is immediate from (27). So we focus on the first inequality. For any ε>0\varepsilon>0 and 𝐱\mathbf{x}, denote

rε,i(𝐱):={12ε𝐠i(𝐱)2,if𝐠i(𝐱)ε,𝐠i(𝐱)ε2,if𝐠i(𝐱)>ε.r_{\varepsilon,i}(\mathbf{x})\mathrel{\mathop{\ordinarycolon}}=\begin{cases}\frac{1}{2\varepsilon}\|\mathbf{g}_{i}(\mathbf{x})\|^{2},&\mbox{if}\ \|\mathbf{g}_{i}(\mathbf{x})\|\leq\varepsilon,\\ \|\mathbf{g}_{i}(\mathbf{x})\|-\frac{\varepsilon}{2},&\mbox{if}\ \|\mathbf{g}_{i}(\mathbf{x})\|>\varepsilon.\end{cases} (31)

Since ϕε(𝐱)=i=1mrε,i(𝐱)+λr¯(𝐱)+f(𝐱)\phi_{\varepsilon}(\mathbf{x})=\sum_{i=1}^{m}r_{\varepsilon,i}(\mathbf{x})+\lambda\overline{r}(\mathbf{x})+f(\mathbf{x}), to prove the first inequality it suffices to show that

rεk+1,i(𝐱k+1)+εk+12rεk,i(𝐱k+1)+εk2.r_{\varepsilon_{k+1},i}(\mathbf{x}_{k+1})+\frac{\varepsilon_{k+1}}{2}\leq r_{\varepsilon_{k},i}(\mathbf{x}_{k+1})+\frac{\varepsilon_{k}}{2}. (32)

If εk+1=εk\varepsilon_{k+1}=\varepsilon_{k}, then the two quantities above are identical and the first inequality holds. Now suppose εk+1=γεk<εk\varepsilon_{k+1}=\gamma\varepsilon_{k}<\varepsilon_{k}. We then consider the relation between 𝐠i(𝐱k+1)\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|, εk+1\varepsilon_{k+1} and εk\varepsilon_{k} in three cases:

  1. 1.

    If 𝐠i(𝐱k+1)>εk>εk+1\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|>\varepsilon_{k}>\varepsilon_{k+1}, then by the definition in (31), there is

    rεk+1,i(𝐱k+1)+εk+12=𝐠i(𝐱k+1)=rεk,i(𝐱k+1)+εk2.r_{\varepsilon_{k+1},i}(\mathbf{x}_{k+1})+\frac{\varepsilon_{k+1}}{2}=\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|=r_{\varepsilon_{k},i}(\mathbf{x}_{k+1})+\frac{\varepsilon_{k}}{2}.
  2. 2.

    If εk𝐠i(𝐱k+1)>εk+1\varepsilon_{k}\geq\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|>\varepsilon_{k+1}, then (31) implies

    rεk+1,i(𝐱k+1)+εk+12=𝐠i(𝐱k+1)\displaystyle r_{\varepsilon_{k+1},i}(\mathbf{x}_{k+1})+\frac{\varepsilon_{k+1}}{2}=\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\| =𝐠i(𝐱k+1)2+𝐠i(𝐱k+1)2=𝐠i(𝐱k+1)22𝐠i(𝐱k+1)+𝐠i(𝐱k+1)2\displaystyle=\frac{\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|}{2}+\frac{\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|}{2}=\frac{\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|^{2}}{2\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|}+\frac{\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|}{2}
    𝐠i(𝐱k+1)22εk+εk2=rεk,i(𝐱k+1)+εk2.\displaystyle\leqslant\frac{\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|^{2}}{2\varepsilon_{k}}+\frac{\varepsilon_{k}}{2}=r_{\varepsilon_{k},i}(\mathbf{x}_{k+1})+\frac{\varepsilon_{k}}{2}.

    The second lines follows from the fact that 𝐠i(𝐱k+1)22ε+ε2\frac{\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|^{2}}{2\varepsilon}+\frac{\varepsilon}{2}—as a function of ε\varepsilon—is non-decreasing for all ε𝐠i(𝐱k+1)\varepsilon\geq\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|

  3. 3.

    If εk>εk+1𝐠i(𝐱k+1)\varepsilon_{k}>\varepsilon_{k+1}\geq\|\mathbf{g}_{i}(\mathbf{x}_{k+1})\|, then again the previous fact and (31) imply (32).

Therefore, in either of the three cases, (32) holds and hence

rεk+1(𝐱k+1)+mεk+12=i=1m(rεk+1,i(𝐱k+1)+εk+12)i=1m(rεk,i(𝐱k+1)+εk2)=rεk(𝐱k+1)+mεk2,r_{\varepsilon_{k+1}}(\mathbf{x}_{k+1})+\frac{m\varepsilon_{k+1}}{2}=\sum_{i=1}^{m}\mathinner{\Bigl{(}r_{\varepsilon_{k+1},i}(\mathbf{x}_{k+1})+\frac{\varepsilon_{k+1}}{2}\Bigr{)}}\leq\sum_{i=1}^{m}\mathinner{\Bigl{(}r_{\varepsilon_{k},i}(\mathbf{x}_{k+1})+\frac{\varepsilon_{k}}{2}\Bigr{)}}=r_{\varepsilon_{k}}(\mathbf{x}_{k+1})+\frac{m\varepsilon_{k}}{2},

which implies the first inequality of (30). ∎

Theorem 4.4.

Suppose that {𝐱k}\{\mathbf{x}_{k}\} is the sequence generated by Algorithm 1 with any initial 𝐱0\mathbf{x}_{0}. Let {𝐱kl+1}\{\mathbf{x}_{k_{l}+1}\} be the subsequence where the reduction criterion εk+1=γεk\varepsilon_{k+1}=\gamma{\varepsilon_{k}} in line 15 is met for k=klk=k_{l} and l=1,2,l=1,2,\dots. Then, {𝐱kl+1}\{\mathbf{x}_{k_{l}+1}\} has at least one accumulation point, and every accumulation point of {𝐱kl+1}\{\mathbf{x}_{k_{l}+1}\} is a Clarke stationary point of (1).

Proof.

By the Lemma 4.2 and the definition of Clarke subdifferential, this theorem can be easily proved in the similar way to Theorem 3.6 in [34].

Due to Lemma 4.3 and ϕ(𝐱)ϕε(𝐱)+mε2\phi(\mathbf{x})\leq\phi_{\varepsilon}(\mathbf{x})+\frac{m\varepsilon}{2} for all ε>0\varepsilon>0 and 𝐱𝒳\mathbf{x}\in\mathcal{X}, we know that

ϕ(𝐱k)ϕεk(𝐱k)+mεk2ϕε0(𝐱0)+mε02<.\phi(\mathbf{x}_{k})\leq\phi_{\varepsilon_{k}}(\mathbf{x}_{k})+\frac{m\varepsilon_{k}}{2}\leq\cdots\leq\phi_{\varepsilon_{0}}(\mathbf{x}_{0})+\frac{m\varepsilon_{0}}{2}<\infty.

Since ϕ\phi is coercive, we know that {𝐱k}\{\mathbf{x}_{k}\} is bounded. Hence {𝐱kl+1}\{\mathbf{x}_{k_{l}+1}\} is also bounded and has at least one accumulation point.

Note that 𝐱kl+1\mathbf{x}_{k_{l}+1} satisfies the reduction criterion in Line 15 of Algorithm 1, i.e., ϕεkl(𝐱kl+1)σγεkl=σε0γl+10\|\nabla\phi_{\varepsilon_{k_{l}}}(\mathbf{x}_{k_{l}+1})\|\leq\sigma\gamma\varepsilon_{k_{l}}=\sigma\varepsilon_{0}\gamma^{l+1}\to 0 as ll\to\infty. For notation simplicity, let {𝐱j+1}\{\mathbf{x}_{j+1}\} denote any convergent subsequence of {𝐱kl+1}\{\mathbf{x}_{k_{l}+1}\} and εj\varepsilon_{j} the corresponding εk\varepsilon_{k} used in the iteration to generate 𝐱j+1\mathbf{x}_{j+1}. Then there exists 𝐱^𝒳\hat{\mathbf{x}}\in\mathcal{X} such that 𝐱j+1𝐱^\mathbf{x}_{j+1}\to\hat{\mathbf{x}}, εj0\varepsilon_{j}\to 0, and ϕεj(𝐱j+1)0\nabla\phi_{\varepsilon_{j}}(\mathbf{x}_{j+1})\to 0 as jj\to\infty.

Note that the Clarke subdifferential of ϕ\phi at 𝐱^\hat{\mathbf{x}} is given by ϕ(𝐱)=r^(𝐱)+λr¯(𝐱)+f(𝐱)\partial\phi(\mathbf{x})=\partial\hat{r}(\mathbf{x})+\lambda\nabla\overline{r}(\mathbf{x})+\nabla f(\mathbf{x}):

ϕ(𝐱^)={iI0𝐠i(𝐱^)𝐰i+iI1𝐠i(𝐱^)𝐠i(𝐱^)𝐠i(𝐱^)+λr¯(𝐱^)+f(𝐱^)|Π(𝐰i;𝒞(𝐠i(𝐱^)))1,iI0},\partial\phi(\hat{\mathbf{x}})=\mathinner{\Bigl{\{}\sum_{i\in I_{0}}\nabla\mathbf{g}_{i}(\hat{\mathbf{x}})^{\top}\mathbf{w}_{i}+\sum_{i\in I_{1}}\nabla\mathbf{g}_{i}(\hat{\mathbf{x}})^{\top}\frac{\mathbf{g}_{i}(\hat{\mathbf{x}})}{\|\mathbf{g}_{i}(\hat{\mathbf{x}})\|}+\lambda\nabla\overline{r}(\hat{\mathbf{x}})+\nabla f(\hat{\mathbf{x}})\ \bigg{|}\ \|\Pi(\mathbf{w}_{i};\mathcal{C}(\nabla\mathbf{g}_{i}(\hat{\mathbf{x}})))\leq 1,\ \forall\,i\in I_{0}\Bigr{\}}}, (33)

where I0={i[m]|𝐠i(𝐱^)=0}I_{0}=\{i\in[m]\ |\ \|\mathbf{g}_{i}(\hat{\mathbf{x}})\|=0\} and I1=[m]I0I_{1}=[m]\setminus I_{0}. Then we know that there exists JJ sufficiently large, such that

εj<12min{𝐠i(𝐱^)|iI1}12𝐠i(𝐱^)𝐠i(𝐱j+1),jJ,iI1,\varepsilon_{j}<\frac{1}{2}\min\{\|\mathbf{g}_{i}(\hat{\mathbf{x}})\|\ |\ i\in I_{1}\}\leq\frac{1}{2}\|\mathbf{g}_{i}(\hat{\mathbf{x}})\|\leq\|\mathbf{g}_{i}(\mathbf{x}_{j+1})\|,\quad\forall\,j\geq J,\quad\forall\,i\in I_{1},

where we used the facts that min{𝐠i(𝐱^)|iI1}>0\min\{\|\mathbf{g}_{i}(\hat{\mathbf{x}})\|\ |\ i\in I_{1}\}>0 and εj0\varepsilon_{j}\to 0 in the first inequality, and 𝐱j+1𝐱^\mathbf{x}_{j+1}\to\hat{\mathbf{x}} and the continuity of 𝐠i\mathbf{g}_{i} for all ii in the last inequality. Furthermore, we denote

𝐬j,i:={𝐠i(𝐱j+1)εj,if𝐠i(𝐱j+1)εj,𝐠i(𝐱j+1)𝐠i(𝐱j+1),if𝐠i(𝐱j+1)>εj.\mathbf{s}_{j,i}\mathrel{\mathop{\ordinarycolon}}=\begin{cases}\frac{\mathbf{g}_{i}(\mathbf{x}_{j+1})}{\varepsilon_{j}},&\mbox{if}\ \|\mathbf{g}_{i}(\mathbf{x}_{j+1})\|\leq\varepsilon_{j},\\ \frac{\mathbf{g}_{i}(\mathbf{x}_{j+1})}{\|\mathbf{g}_{i}(\mathbf{x}_{j+1})\|},&\mbox{if}\ \|\mathbf{g}_{i}(\mathbf{x}_{j+1})\|>\varepsilon_{j}.\end{cases}

Then we have

ϕεj(𝐱j+1)=iI0𝐠i(𝐱j+1)𝐬j,i+iI1𝐠i(𝐱j+1)𝐠i(𝐱j+1)𝐠i(𝐱j+1)+λr¯(𝐱j+1)+f(𝐱j+1).\nabla\phi_{\varepsilon_{j}}(\mathbf{x}_{j+1})=\sum_{i\in I_{0}}\nabla\mathbf{g}_{i}(\mathbf{x}_{j+1})^{\top}\mathbf{s}_{j,i}+\sum_{i\in I_{1}}\nabla\mathbf{g}_{i}(\mathbf{x}_{j+1})^{\top}\frac{\mathbf{g}_{i}(\mathbf{x}_{j+1})}{\|\mathbf{g}_{i}(\mathbf{x}_{j+1})\|}+\lambda\nabla\overline{r}(\mathbf{x}_{j+1})+\nabla f(\mathbf{x}_{j+1}). (34)

Comparing (33) and (34), we can see that the last two terms on the right hand side of (34) converge to those of (33), respectively, due to the facts that 𝐱j+1𝐱^\mathbf{x}_{j+1}\to\hat{\mathbf{x}} and the the continuity of 𝐠i,𝐠i,+r¯,f\mathbf{g}_{i},\nabla\mathbf{g}_{i},+\nabla\overline{r},\nabla f. Moreover, noting that Π(𝐬j,i;𝒞(𝐠i(𝐱^)))𝐬j,i1\|\Pi(\mathbf{s}_{j,i};\mathcal{C}(\nabla\mathbf{g}_{i}(\hat{\mathbf{x}})))\|\leq\|\mathbf{s}_{j,i}\|\leq 1, we can see that the first term on the right hand side of (34) also converges to the set formed by the first term of (33) due to the continuity of 𝐠i\mathbf{g}_{i} and 𝐠i\nabla\mathbf{g}_{i}. Hence we know that

dist(ϕεj(𝐱j+1),ϕ(𝐱^))0,\operatorname{\mathrm{dist}}(\nabla\phi_{\varepsilon_{j}}(\mathbf{x}_{j+1}),\partial\phi(\hat{\mathbf{x}}))\to 0,

as j0j\to 0. Since ϕεj(𝐱j+1)0\nabla\phi_{\varepsilon_{j}}(\mathbf{x}_{j+1})\to 0 and ϕ(𝐱^)\partial\phi(\hat{\mathbf{x}}) is closed, we conclude that 0ϕ(𝐱^)0\in\partial\phi(\hat{\mathbf{x}}). ∎

From Theorem 4.4, we conclude that the output of our network converges to a (local) minimizer of the original regularized problem (1).

5 Experiments and Results

Here we present our experiments on LDCT image reconstruction problems with various dose levels and compare with existing state-of-the-art algorithms in terms of image quality, run time and the number of parameters etc. We adopt a warm start training strategy which imitates the iterating of optimization algorithm. More precisely, first we train the network with K=3K=3 phases, where each phase in the network corresponds to an iteration in optimization algorithm. After it converges, we add 22 more phases and we continue training the 55-phase network until it converges. We continue adding 2 more phases until there is no noticeable improvement.

As computing the similarity weight matrix 𝒲\mathcal{W} is high-cost in time and memory, we also experiment with approximation of 𝒲\mathcal{W} computed on the initial reconstruction 𝐱0\mathbf{x}_{0} without updating in each iteration, i.e. 𝒲ijexp(𝐠^i(𝐱0)𝐠^j(𝐱0)2δ2).\mathcal{W}_{ij}\approx\exp{\big{(}-\frac{\|\hat{\mathbf{g}}_{i}(\mathbf{x}_{0})-\hat{\mathbf{g}}_{j}(\mathbf{x}_{0})\|^{2}}{\delta^{2}}\big{)}}. Thus r¯(𝐱)\overline{r}(\mathbf{x}) can be differentiated as r¯(𝐱)=2q=1d𝐠^q(𝐱)𝐠^q(𝐱).\nabla\overline{r}(\mathbf{x})=2\cdot\sum_{q=1}^{d}\nabla\hat{\mathbf{g}}^{q}(\mathbf{x})^{\top}\mathcal{L}\ \hat{\mathbf{g}}^{q}(\mathbf{x}). In the approximation scenario we compute \mathcal{L} once and use it for all the phases, but in the other case we have an extra 𝒪(m2)\mathcal{O}(m^{2}) computation in each phase, every time we compute ϕ\nabla\phi. As shown in the Section 5.2, this approximation does not exacerbate the network performance much but can increase the running speed.

All the experiments are performed on a computer with Intel i7-6700K CPU at 3.40 GHz, 16 GB of memory, and a Nvidia GTX-1080Ti GPU of 11GB graphics card memory, and implemented with the PyTorch toolbox [51] in Python. The initial 𝐱0\mathbf{x}_{0} is obtained by FBP algorithm. The spatial kernel size of the convolution and transposed convolution is set to be 3×33\times 3 and the channel number is set to 4848 with layer number l=4l=4 as default. The learned weights of convolutions and transposed convolutions are initialized by Xavier Initializer [52] and the starting ε0\varepsilon_{0} is initialized to be 0.0010.001. All the learnable parameters are trained by the Adam Optimizer with β1=0.9\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999. The network is trained with learning rate 1e-4 for 200200 epochs when phase number K=3K=3, followed by 100100 epochs when adding more phases.

We test the performance of ELDA on the 2016 NIH-AAPM-Mayo Clinic Low-Dose CT Grand Challenge [53] which contains 5936 full-dose CT (FDCT) data from 10 patients, from which we randomly select 500500 images and resize them to the size 256×256256\times 256. Then we randomly divide the dataset into 400400 images for training and 100100 for testing. Distance-driven algorithm [54, 55] is applied to simulate the projections in fan-beam geometry. The source-to-rotation center and detector-to-rotation center distances are both set to 250 mm. The physical image region covers 170 mm ×\times 170 mm. On detector there are 512 detector elements each with width 0.72 mm. There are 1024 projection views in total with the projection angles are evenly distributed over a full scan range. Similar to [56], the simulated noisy transmission measurement II is generated by adding Poisson and electronic noise as

I=Possion(I0exp(b^))+Normal(0,σe2),I=Possion(I_{0}\exp{(-\hat{b})})+Normal(0,\sigma_{e}^{2}), (35)

where I0I_{0} is the incident X-ray intensity and σe2\sigma_{e}^{2} is the variance the background electronic noise. And b^\hat{b} represents the noise-free projection. In this simulation, I0I_{0} is set to 1.0×1061.0\times 10^{6} for normal dose and σe2\sigma_{e}^{2} is prefixed to be 10 for all dose cases. Then the noisy projection 𝐛\mathbf{b} is calculated by taking the logarithm transformation on I0I\frac{I_{0}}{I}. In low dose case, in order to investigate the robustness of all compared algorithms, we generated three sets of different low dose projections with I0=1.0×105,5.0×104I_{0}=1.0\times 10^{5},5.0\times 10^{4} and 2.5×1042.5\times 10^{4} which correspond to 10%10\%, 5%5\% and 2.5%2.5\% of the full dose incident accordingly. We use peak signal to noise ratio (PSNR) and structureal similarity index measure (SSIM) to evaluate the quality of the reconstructed images.

5.1 Parameter Study

The regularization term of our model is learned from training samples, yet there are still a few key network hyperparameters need to be set manually. Specifically, we investigate the impacts of some parameters of the architecture, which includes the number of convolutions (ll), the depth of the convolution kernels (dd) and the phase number (KK). The influence of each hyperparameter is sensed by perturbing it with others fixed at d=48d=48, l=4l=4 and K=19K=19. The setting includes all factors/components listed in Table. 2 and all following results are trained and tested with dose level 10%10\%.

Depth of the convolution kernels: We evaluate the instances of d=16,32,48d=16,32,48 and 6464. The results are listed in Table. 1. It is evident that the PSNR score raises with growing depth of the kernels, but the profit gradually reduces. On the contrary, the number of parameters and running time grow greatly in the meantime.

Table 1: The reconstruction results associated with different depths of convolution kernels and different number of convolutions in each phase with dose level 10%.
Depth of conv. kernels 16 32 48 64
PSNR (dB) 47.11 47.51 47.73 47.75
Number of parameters 14,152 55,912 125,320 222,376
Average testing time (s) 1.139 1.231 1.411 1.539
Number of convolutions 2 3 4 5
PSNR (dB) 47.27 47.60 47.73 47.77
Number of parameters 42,376 83,848 125,320 167,656
Average testing time (s) 1.117 1.258 1.411 1.530
Refer to caption
Figure 2: The reconstruction PSNR of ELDA across various phase numbers on dose level 10%10\%.
Table 2: Comparison of different algorithms and the influence of various design factors or components on the LDCT reconstruction performance with dose level 10%.
Algorithms Plain-GD AGD LDA ELDA
PSNR (dB) 45.47 45.95 46.29 46.35
Average testing time (s) 0.602 0.605 1.338 1.239
ELDA Factors / Components
Inexact Transpose?
Non-local regularizer?
Approximated Matrix 𝒲\mathcal{W}?
PSNR (dB) 46.35 46.74 47.75 47.73
Average testing time (s) 1.239 1.247 3.703 1.411

Number of convolutions: We evaluate the cases of different number of convolutions l=2,3,4l=2,3,4 and 55. The corresponding results are reported in Table. 1. We can observe that more convolutions contribute to better reconstructed image quality. But the increase of PSNR score is insignificant from 44 convolutions to 55 while the parameter number and the test time rise significantly.

Phase number K\mathbf{K}: As shown in Fig. 2, the PSNR increases with the phase number KK. And the plot approaches nearly flat after 19 phases.

To balance the trade-off between reconstruction performance and network complexity, we take d=48d=48, l=4l=4, K=19K=19 when comparing with other methods.

Table 3: Quantitative results (Mean ±\pm Standard Deviation) of the LDCT reconstructions of AAPM-Mayo data obtained by various algorithms and different dose levels. Average run time (per image) is measured in second.
Dose Level 1.0×1051.0\times 10^{5} 5.0×1045.0\times 10^{4} 2.5×1042.5\times 10^{4} Run Time Parameters
PSNR (dB) SSIM PSNR (dB) SSIM PSNR (dB) SSIM
FBP [57] 38.03±\pm0.68 0.9084±\pm0.0128 35.25±\pm0.74 0.8459±\pm0.0209 32.35±\pm0.77 0.7556±\pm0.0295 1.0 N/A
TGV [58] 43.60±\pm0.50 0.9845±\pm0.0020 42.80±\pm0.52 0.9812±\pm0.0027 41.10±\pm0.62 0.9698±\pm0.0055 9.41029.4\cdot 10^{2} N/A
FBPConvNet [24] 42.49±\pm0.47 0.9764±\pm0.0029 41.14±\pm0.49 0.9698±\pm0.0041 39.62±\pm0.48 0.9589±\pm0.0052 7.41027.4\cdot 10^{-2} 1.01071.0\cdot 10^{7}
RED-CNN [29] 44.58±\pm0.58 0.9848±\pm0.0026 43.25±\pm0.60 0.9808±\pm0.0033 41.61±\pm0.63 0.9737±\pm0.0048 2.6𝟏𝟎𝟑\mathbf{2.6\cdot 10^{-3}} 1.81061.8\cdot 10^{6}
Learned PD [42] 44.81±\pm0.59 0.9863±\pm0.0025 43.28±\pm0.61 0.9813±\pm0.0034 41.74±\pm0.62 0.9745±\pm0.0048 1.8 2.51052.5\cdot 10^{5}
LEARN [32] 45.19±\pm0.60 0.9868±\pm0.0023 43.86±\pm0.61 0.9840±\pm0.0027 42.06±\pm0.61 0.9776±\pm0.0037 4.8 1.91061.9\cdot 10^{6}
LDA [34] 46.29±\pm0.60 0.9896±\pm0.0019 44.64±\pm0.61 0.9858±\pm0.0025 42.97±\pm0.64 0.9802±\pm0.0035 1.3 6.2𝟏𝟎𝟒\mathbf{6.2\cdot 10^{4}}
MAGIC [43] 47.54±\pm0.66 0.9919±\pm0.0017 45.83±\pm0.64 0.9888±\pm0.0022 44.37±\pm0.63 0.9853±\pm0.0027 5.3 2.11062.1\cdot 10^{6}
ELDA (Ours) 47.73±\pm0.69 0.9923±\pm0.0017 46.40±\pm0.64 0.9899±\pm0.0021 44.86±\pm0.65 0.9859±\pm0.0031 1.4 1.21051.2\cdot 10^{5}
Refer to caption
(a) Reference
Refer to caption
(b) FBP (36.20)
Refer to caption
(c) TGV (43.40)
Refer to caption
(d) FBPConvNet (41.65)
Refer to caption
(e) RED-CNN (43.84)
Refer to caption
(f) Learned PD (43.90)
Refer to caption
(g) LEARN (44.31)
Refer to caption
(h) LDA (45.01)
Refer to caption
(i) MAGIC (46.42)
Refer to caption
(j) ELDA (47.14)
Figure 3: Representative CT images of AAPM-Mayo data reconstructed by various methods with dose level 5%. The display window is [-160, 240] HU. PSNRs (dB) are shown in the parentheses. The regions of interest are magnified in red boxes for better visualization.

5.2 Ablation Study

In this section, we first investigate the effectiveness of the proposed algorithm in ELDA. To this end, we compare ELDA with unrolling the standard gradient descent iteration of (1), and an accelerated inertial version by setting 𝐱k+1=𝐱kαkϕ(𝐱k)+θk(𝐱k𝐱k1)\mathbf{x}_{k+1}=\mathbf{x}_{k}-\alpha_{k}\nabla\phi(\mathbf{x}_{k})+\theta_{k}(\mathbf{x}_{k}-\mathbf{x}_{k-1}) where θk\theta_{k} is also learned. Here these two algorithms are named as Plain-GD and AGD, respectively. In addition, to show the superiority of the new descending condition (14)&(17) over the competition strategy in LDA [34], we compare the result with LDA here as well. The comparison of different algorithms are shown in Table. 2 (up), where the experiments follow the default parameter configuration as Section 5.1 without the additional components listed in Table. 2 (bottom). It is quite obvious that all AGD, LDA and ELDA achieve higher PNSRs than Plain-GD, where ELDA achieves the best. With the new descending condition it achieves average 0.06 dB PSNR better than LDA and about 0.1 seconds faster. Furthermore, we have also computed the ratios that the candidate 𝐮k+1\mathbf{u}_{k+1} is taken instead of 𝐯k+1\mathbf{v}_{k+1}, and found it to be 86.2%86.2\% for LDA and 99.6%99.6\% for ELDA respectively. That indicates that the proposed descending condition (14) can effectively avoid the frequent candidate alternating compared to the competition strategy used in LDA.

Moreover, we check the influence of some essential factors/components of our ELDA model, i.e. the inexact transpose, the nonlocal smoothing regularizer and the approximated weight matrix 𝒲\mathcal{W}. The results are summarized in Table. 2 (bottom). It is remarkable that the inexact transpose and the nonlocal smoothing regularizer can effectively increase the network performance by a large margin. And with the approximated weight matrix 𝒲\mathcal{W} there is no significant decreasing of the PSNR. As the initial 𝐱0\mathbf{x}_{0} obtained by FBP is not far from 𝐱k\mathbf{x}_{k} in each iteration, the 𝒲\mathcal{W} approximated by 𝐱0\mathbf{x}_{0} can provide a good estimation to the true one. Thus, in the following sections we will keep all these features in Table 2 (bottom) when comparing ELDA with other methods.

Refer to caption
(a) Reference
Refer to caption
(b) FBP (31.09)
Refer to caption
(c) TGV (39.65)
Refer to caption
(d) FBPConvNet (38.27)
Refer to caption
(e) RED-CNN (39.49)
Refer to caption
(f) Learned PD (39.60)
Refer to caption
(g) LEARN (40.11)
Refer to caption
(h) LDA (40.79)
Refer to caption
(i) MAGIC (41.27)
Refer to caption
(j) ELDA (42.41)
Figure 4: Representative CT images of NBIA data reconstructed by various methods with dose level 2.5%. The display window is [-160, 240] HU. PSNRs (dB) are shown in the parentheses. The regions of interest are magnified in red boxes for better visualization.

5.3 Comparison with the State-of-the-art

In this section, we compare our reconstruction results on the 100 AAPM-Mayo testing images with several existing algorithms: two classic reconstruction methods, i.e., FBP [57] and TGV [58] and six approaches based on deep learning, i.e., FBPConvNet [24], RED-CNN [29], Learned Primal-Dual [42], LDA [34], LEARN [32] and MAGIC [43]. For fair comparison, all deep learning algorithms compared here are trained and evaluated on the same dataset, dose levels and evaluation metrics. The experimental results on various dose levels are summarized in Table. 4 and the representative qualitative results on dose level 5%5\% are shown in Fig. 3. These results show that ELDA reconstructs more accurate images using relatively much fewer network parameters and decent running time.

5.4 Validation with NBIA Data

To demonstrate the generalizability of the proposed method, we validate our model on another dataset of the National Biomedical Imaging Archive (NBIA). We randomly sampled 80 images from the NBIA dataset with various parts of the human body for diversity. For fair comparison, all deep learning based reconstruction models compared here are trained on the same dataset identical to Section 5.3. Fig. 4 visualizes the reconstructed images obtained by different methods under dose level 2.5%. It can be seen that ELDA preserves the details well, avoids over-smoothing and reduces artifacts, which gives the promising reconstruction quality in Fig. 4. The quantitative results are provided in Table. 4.

Table 4: Quantitative Results (Mean ±\pm Standard Deviation) of the LDCT Reconstructions Obtained by Various Algorithms and Different Dose Levels on NBIA Data.
Dose Level 1.0×1051.0\times 10^{5} 5.0×1045.0\times 10^{4} 2.5×1042.5\times 10^{4}
PSNR (dB) SSIM PSNR (dB) SSIM PSNR (dB) SSIM
FBP 37.36±\pm0.43 0.9125±\pm0.0145 34.61±\pm0.51 0.8533±\pm0.0240 31.74±\pm0.55 0.7682±\pm0.0339
FBPConvNet 40.86±\pm0.25 0.9693±\pm0.0026 39.27±\pm0.29 0.9587±\pm0.0033 37.69±\pm0.38 0.9449±\pm0.0036
RED-CNN 41.74±\pm0.32 0.9751±\pm0.0022 40.32±\pm0.38 0.9683±\pm0.0026 38.76±\pm0.46 0.9578±\pm0.0029
Learned PD 41.86±\pm0.31 0.9760±\pm0.0025 40.39±\pm0.39 0.9687±\pm0.0031 38.89±\pm0.43 0.9594±\pm0.0033
LEARN 42.44±\pm0.35 0.9780±\pm0.0021 41.06±\pm0.42 0.9738±\pm0.0021 39.33±\pm0.48 0.9638±\pm0.0024
TGV 42.53±\pm0.52 0.9818±\pm0.0017 41.44±\pm0.32 0.9762±\pm0.0032 39.60±\pm0.22 0.9601±\pm0.0080
LDA 43.37±\pm0.38 0.9829±\pm0.0017 41.64±\pm0.47 0.9763±\pm0.0023 39.99±\pm0.55 0.9678±\pm0.0021
MAGIC 44.58±\pm0.37 0.9866±\pm0.0017 43.10±\pm0.28 0.9821±\pm0.0024 41.12±\pm0.20 0.9707±\pm0.0058
ELDA (Proposed) 44.65±\pm0.53 0.9867±\pm0.0016 43.33±\pm0.45 0.9826±\pm0.0017 41.61±\pm0.55 0.9763±\pm0.0017

6 Conclusion

In brief, we propose an efficient inexact learned descent algorithm for low-dose CT reconstruction. With incorporating the sparsity enhancing and non-local smoothing modules in the regularizer, the proposed ELDA outperforms several existing state-of-the-art reconstruction methods in accuracy and efficiency on two widely known datasets and retains convergence property.

References

  • [1] G. N. Hounsfield, “Computerized transverse axial scanning (tomography): Part 1. description of system,” The British Journal of Radiology, vol. 46, no. 552, pp. 1016–1022, 1973.
  • [2] A. M. Cormack, “Representation of a function by its line integrals, with some radiological applications. ii,” Journal of Applied Physics, vol. 35, no. 10, pp. 2908–2913, 1964.
  • [3] D. J. Brenner, C. D. Elliston, E. J. Hall, and W. E. Berdon, “Estimated risks of radiation-induced fatal cancer from pediatric ct,” American journal of roentgenology, vol. 176, no. 2, pp. 289–296, 2001.
  • [4] A. S. Brody, D. P. Frush, W. Huda, R. L. Brent et al., “Radiation risk to children from computed tomography,” Pediatrics, vol. 120, no. 3, pp. 677–682, 2007.
  • [5] N. Saltybaeva, K. Martini, T. Frauenfelder, and H. Alkadhi, “Organ dose and attributable cancer risk in lung cancer screening with low-dose computed tomography,” PloS one, vol. 11, no. 5, p. e0155722, 2016.
  • [6] A. M. den Harder et al., “Radiation dose reduction in pediatric great vessel stent computed tomography using iterative reconstruction: A phantom study,” PloS one, vol. 12, no. 4, p. e0175714, 2017.
  • [7] A. Sauter et al., “Ultra low dose ct pulmonary angiography with iterative reconstruction,” PLoS One, vol. 11, no. 9, p. e0162716, 2016.
  • [8] E. J. Keller et al., “Reinforcing the importance and feasibility of implementing a low-dose protocol for ct-guided biopsies,” Academic radiology, vol. 25, no. 9, pp. 1146–1151, 2018.
  • [9] C.-J. Hsieh, S.-C. Jin, J.-C. Chen, C.-W. Kuo, R.-T. Wang, and W.-C. Chu, “Performance of sparse-view ct reconstruction with multi-directional gradient operators,” PloS one, vol. 14, no. 1, p. e0209674, 2019.
  • [10] H. Lu, T. li, and Z. Liang, “Sinogram noise reduction for low-dose ct by statistics-based nonlinear filters,” Proceedings of SPIE - The International Society for Optical Engineering, vol. 5747, 04 2005.
  • [11] A. Manduca et al., “Projection space denoising with bilateral filtering and ct noise modeling for dose reduction in ct,” Medical Physics, vol. 36, no. 11, pp. 4911–4919, 2009.
  • [12] Jing Wang, Tianfang Li, Hongbing Lu, and Zhengrong Liang, “Penalized weighted least-squares approach to sinogram noise reduction and image reconstruction for low-dose x-ray computed tomography,” IEEE Transactions on Medical Imaging, vol. 25, no. 10, pp. 1272–1283, 2006.
  • [13] S. Tipnis et al., “Iterative reconstruction in image space (iris) and lesion detection in abdominal ct,” in Medical Imaging 2010: Physics of Medical Imaging, vol. 7622.   International Society for Optics and Photonics, 2010, p. 76222K.
  • [14] L. L. Geyer et al., “State of the art: iterative ct reconstruction techniques,” Radiology, vol. 276, no. 2, pp. 339–357, 2015.
  • [15] M. J. Willemink et al., “Iterative reconstruction techniques for computed tomography part 2: initial results in dose reduction and image quality,” European radiology, vol. 23, no. 6, pp. 1632–1642, 2013.
  • [16] X. Zheng, S. Ravishankar, Y. Long, and J. A. Fessler, “Pwls-ultra: An efficient clustering and learning-based approach for low-dose 3d ct image reconstruction,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1498–1510, 2018.
  • [17] H. Shan et al., “Competitive performance of a modularized deep neural network compared to commercial algorithms for low-dose ct image reconstruction,” Nature Machine Intelligence, vol. 1, no. 6, pp. 269–276, 2019.
  • [18] K. Liang, H. Yang, K. Kang, and Y. Xing, “Improve angular resolution for sparse-view ct with residual convolutional neural network,” in Medical Imaging 2018: Physics of Medical Imaging, vol. 10573.   International Society for Optics and Photonics, 2018, p. 105731K.
  • [19] Y. Han and J. C. Ye, “Framing u-net via deep convolutional framelets: Application to sparse-view ct,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1418–1429, 2018.
  • [20] H. Chen et al., “Low-dose ct via convolutional neural network,” Biomedical optics express, vol. 8, no. 2, pp. 679–694, 2017.
  • [21] S. Xie and T. Yang, “Artifact removal in sparse-angle ct based on feature fusion residual network,” IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 5, no. 2, pp. 261–271, 2021.
  • [22] Z. Zhang, X. Liang, X. Dong, Y. Xie, and G. Cao, “A sparse-view ct reconstruction method based on combination of densenet and deconvolution,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1407–1417, 2018.
  • [23] D. Hu et al., “Hybrid-domain neural network processing for sparse-view ct reconstruction,” IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 5, no. 1, pp. 88–98, 2021.
  • [24] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4509–4522, 2017.
  • [25] H. Lee, J. Lee, H. Kim, B. Cho, and S. Cho, “Deep-neural-network-based sinogram synthesis for sparse-view ct image reconstruction,” IEEE Transactions on Radiation and Plasma Medical Sciences, vol. 3, no. 2, pp. 109–119, 2018.
  • [26] E. Kang, W. Chang, J. Yoo, and J. C. Ye, “Deep convolutional framelet denosing for low-dose ct via wavelet residual network,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1358–1369, 2018.
  • [27] Q. Yang et al., “Low-dose ct image denoising using a generative adversarial network with wasserstein distance and perceptual loss,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1348–1357, 2018.
  • [28] H. Lee, J. Lee, and S. Cho, “View-interpolation of sparsely sampled sinogram using convolutional neural network,” in Medical Imaging 2017: Image Processing, vol. 10133.   International Society for Optics and Photonics, 2017, p. 1013328.
  • [29] H. Chen et al., “Low-dose ct with a residual encoder-decoder convolutional neural network,” IEEE transactions on medical imaging, vol. 36, no. 12, pp. 2524–2535, 2017.
  • [30] E. Kang, J. Min, and J. C. Ye, “A deep convolutional neural network using directional wavelets for low-dose x-ray ct reconstruction,” Medical physics, vol. 44, no. 10, pp. e360–e375, 2017.
  • [31] S. Ye, S. Ravishankar, Y. Long, and J. A. Fessler, “Spultra: Low-dose ct image reconstruction with joint statistical and learned image models,” IEEE Transactions on Medical Imaging, vol. 39, no. 3, pp. 729–741, 2019.
  • [32] H. Chen et al., “Learn: Learned experts’ assessment-based reconstruction network for sparse-data ct,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1333–1347, 2018.
  • [33] D. Wu, K. Kim, G. El Fakhri, and Q. Li, “Iterative low-dose ct reconstruction with priors trained by artificial neural network,” IEEE transactions on medical imaging, vol. 36, no. 12, pp. 2479–2486, 2017.
  • [34] Y. Chen, H. Liu, X. Ye, and Q. Zhang, “Learnable descent algorithm for nonsmooth nonconvex image reconstruction,” arXiv preprint arXiv:2007.11245, 2020.
  • [35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [36] I. Y. Chun and J. A. Fessler, “Convolutional dictionary learning: Acceleration and convergence,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1697–1712, 2017.
  • [37] ——, “Convolutional analysis operator learning: Acceleration and convergence,” IEEE Transactions on Image Processing, vol. 29, no. 1, pp. 2108–2122, 2019.
  • [38] J. Sun, H. Li, Z. Xu et al., “Deep admm-net for compressive sensing mri,” in Advances in neural information processing systems, 2016, pp. 10–18.
  • [39] I. Y. Chun, X. Zheng, Y. Long, and J. A. Fessler, “Bcd-net for low-dose ct reconstruction: Acceleration, convergence, and generalization,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2019, pp. 31–40.
  • [40] I. Y. Chun, Z. Huang, H. Lim, and J. Fessler, “Momentum-net: Fast and convergent iterative neural network for inverse problems,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, pp. DOI: 10.1109/tpami.2020.3012955.
  • [41] S. Ye, Y. Long, and I. Y. Chun, “Momentum-net for low-dose ct image reconstruction,” arXiv preprint arXiv:2002.12018, 2020.
  • [42] J. Adler and O. Öktem, “Learned primal-dual reconstruction,” IEEE transactions on medical imaging, vol. 37, no. 6, pp. 1322–1332, 2018.
  • [43] W. Xia et al., “Magic: Manifold and graph integrative convolutional network for low-dose ct reconstruction,” arXiv preprint arXiv:2008.00406, 2020.
  • [44] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” arXiv preprint arXiv:1603.07285, 2016.
  • [45] Y. Nesterov, “Smooth minimization of non-smooth functions,” Mathematical programming, vol. 103, no. 1, pp. 127–152, 2005.
  • [46] H. Le and A. Borji, “What are the receptive, effective receptive, and projective fields of neurons in convolutional neural networks?” CoRR, vol. abs/1705.07049, 2017.
  • [47] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computation, vol. 15, no. 6, pp. 1373–1396, 2003.
  • [48] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision.   Springer International Publishing, 2016, pp. 630–645.
  • [49] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [50] F. H. Clarke, Optimization and nonsmooth analysis.   Siam, 1990, vol. 5.
  • [51] A. Paszke et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32.   Curran Associates, Inc., 2019, pp. 8024–8035.
  • [52] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in In Proceedings of the International Conference on Artificial Intelligence and Statistics. Society for Artificial Intelligence and Statistics, 2010.
  • [53] C. McCollough, “Tu-fg-207a-04: Overview of the low dose ct grand challenge,” Medical Physics, vol. 43, pp. 3759–3760, 06 2016.
  • [54] B. De Man and S. Basu, “Distance-driven projection and backprojection,” in 2002 IEEE Nuclear Science Symposium Conference Record, vol. 3.   IEEE, 2002, pp. 1477–1480.
  • [55] ——, “Distance-driven projection and backprojection in three dimensions,” Physics in Medicine & Biology, vol. 49, no. 11, p. 2463, 2004.
  • [56] T. li, H. Lu, and Z. Liang, “Penalized weighted least-squares approach to sinogram noise reduction and image reconstruction for low-dose x-ray computed tomography,” IEEE transactions on medical imaging, vol. 25, pp. 1272–83, 11 2006.
  • [57] A. C. Kak, M. Slaney, and G. Wang, “Principles of computerized tomographic imaging,” Medical Physics, vol. 29, no. 1, p. 107, 2002.
  • [58] S. Niu et al., “Sparse-view x-ray ct reconstruction via total generalized variation regularization,” Physics in Medicine & Biology, vol. 59, no. 12, p. 2997, 2014.