This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Yunmei Chen, Lezhi Liu and Lei Zhang22institutetext: Department of Mathematics, University of Florida, Gainesville, FL 32611, USA
22email: yun, lliu1, [email protected]

A Learned Proximal Alternating Minimization Algorithm and Its Induced Network for a Class of Two-block Nonconvex and Nonsmooth Optimization

Yunmei Chen    Lezhi Liu    Lei Zhang
(Received: date / Accepted: date)
Abstract

This work proposes a general learned proximal alternating minimization algorithm, LPAM, for solving learnable two-block nonsmooth and nonconvex optimization problems. We tackle the nonsmoothness by an appropriate smoothing technique with automatic diminishing smoothing effect. For smoothed nonconvex problems we modify the proximal alternating linearized minimization (PALM) scheme by incorporating the residual learning architecture, which has proven to be highly effective in deep network training, and employing the block coordinate decent (BCD) iterates as a safeguard for the convergence of the algorithm. We prove that there is a subsequence of the iterates generated by LPAM, which has at least one accumulation point and each accumulation point is a Clarke stationary point. Our method is widely applicable as one can employ various learning problems formulated as two-block optimizations, and is also easy to be extended for solving multi-block nonsmooth and nonconvex optimization problems. The network, whose architecture follows the LPAM exactly, namely LPAM-net, inherits the convergence properties of the algorithm to make the network interpretable. As an example application of LPAM-net, we present the numerical and theoretical results on the application of LPAM-net for joint multi-modal MRI reconstruction with significantly under-sampled kk-space data. The experimental results indicate the proposed LPAM-net is parameter-efficient and has favourable performance in comparison with some state-of-the-art methods.

Keywords:
Learned alternating minimization algorithm Nonconvex nonsmooth optimization Deep learning MRI Image reconstruction
MSC:
90C26 65K05 65K10 68U10

1 Acknowledgement

Funding This research was partially supported by NSF grant DMS-2152961 and Simons Foundation Collaboration Grant (Award Number: 584918).

Data Availability The original dataset we used in all experiments are from Multi-modal Brain Tumor Segmentation Challenge 2018.menze2014multimodal

2 Introduction

Recent years have witnessed remarkable success of deep learning across various real-world applications. However, a purely data-driven approach may fail to approximate the desired functions, especially when training data are scarce. It is well known that the scarcity of data leads to overfitting and challenges in interpretation. To mitigate these issues, the unrolling/unfolding neural networks (UNNs) have been developed and have shown promising results in solving inverse problems arising from computer vision and medical imaging. The UNNs are multi-phase neural networks, in which each phase mimics one iteration of the optimization algorithms for solving the inverse problems. However, despite their promising performance in practice, most of the existing UNNs only superficially resemble steps of optimization algorithms. Consequently, their outputs do not really yield solutions to any interpretable variational models. This results is lack of theoretical justifications and convergence guarantees for their outputs.

Recently a novel class of unrolling methods known as learned optimization algorithms (LOA) LDA ; ldct ; wanyuMiccai has been developed. The goal of LOA is to tackle the challenges associated with solving a class of inverse problems arising from various image reconstruction and synthesis problems. These LOAs strategically combine the learned variational model with the prior domain knowledge of underlying physical processes to enhance its interpretability. These methods also utilize optimization to guarantee the convergence.

The optimization scheme induces a highly structured deep network, aligning its architecture precisely with the iterative algorithm. Hence, the network inherits the convergence property of the algorithm, and it outputs an approximation of the solution to the variational model. As a result, it is interpretable and parameter efficient. However, the existing LOAs only consider a single block of variables.

Motivated by a wide range of applications in machine learning that involve multi-modalities or multi-domains problems, such as multi-task learning, transfer learning, multi-modal learning and fusion, dual domain image reconstruction, and image synthesis, in this work we extend the idea of LDA to deal with learned multi-block nonconvex and nonsmooth optimization problems. More specifically, we develop a novel convergent learned proximal alternating minimization (LPAM) algorithm to solve the following class of learnable optimization problems:

(M1):minimizeΦ(𝐱1,𝐱2;Θ):=H1(𝐱1;θ1)+H2(𝐱2;θ2)+H(𝐱1,𝐱2;θ),(𝐱1,𝐱2)n×m,(M_{1})\mathrel{\mathop{\ordinarycolon}}\quad\mbox{minimize}\quad\Phi(\mathbf{x}_{1},\mathbf{x}_{2};\Theta)\mathrel{\mathop{\ordinarycolon}}=H_{1}(\mathbf{x}_{1};\theta_{1})+H_{2}(\mathbf{x}_{2};\theta_{2})+H(\mathbf{x}_{1},\mathbf{x}_{2};\theta),\quad(\mathbf{x}_{1},\mathbf{x}_{2})\in\mathbb{R}^{n}\times\mathbb{R}^{m}, (1)

where each learnable function is a parameterized function. Θ=(θ1,θ2,θ)\Theta=(\theta_{1},\theta_{2},\theta) is the set of learned parameters. Each function in (1) is possibly nonconvex and nonsmooth. We shall also study the convergence and iteration complexity of the algorithm, and provide experimental results on the application on the joint multi-contrast MRI reconstruction with significantly under-sampled data.

3 Related work

3.1 Related work on alternating minimization algorithms:

Alternating minimization (AM) or “two block” coordinate minimization (BCM) algorithm plays an indispensable role in solving a rich class of problems formulated as follows:

(M):minimizeΦ(𝐱1,𝐱2):=H1(𝐱1)+H2(𝐱2)+H(𝐱1,𝐱2),(𝐱1,𝐱2)n×m.(M)\mathrel{\mathop{\ordinarycolon}}\quad\mbox{minimize}\quad\Phi(\mathbf{x}_{1},\mathbf{x}_{2})\mathrel{\mathop{\ordinarycolon}}=H_{1}(\mathbf{x}_{1})+H_{2}(\mathbf{x}_{2})+H(\mathbf{x}_{1},\mathbf{x}_{2}),\quad(\mathbf{x}_{1},\mathbf{x}_{2})\in\mathbb{R}^{n}\times\mathbb{R}^{m}. (2)

The standard approach to solve this problem is the so-called Gauss-Seidel iteration scheme, also known as block coordinate descent (BCD) method. This scheme alternately keeps one block of updated variable fixed and optimize the other block palomar2010convex ; bertsekas2015parallel . With assumptions on convexity and smoothness, several convergence results have been established beck2013convergence ; charbonnier1997deterministic .

Inspired by many practical problems arising from machine learning and image processing, such as non-negative matrix factorization, blind image deconvolution, low rank matrix completion jain2013low ; hardt2014understanding , phase retrieval problem netrapalli2013phase , interference alignmentpeters2009interference , and compressed sensing using synthesis sparsity modelsabolghasemi2012gradient joint multi-modal image reconstructionsroubek2011robust , the AM algorithms for nonconvex (including biconvex) and nonsmooth minimization have attracted great interest. One of the effective approaches to solve nonconvex and nonsmooth problems (2) is the proximal alternating minimization (PAM) algorithms. The PAM algorithm developed in attouch2010proximal can be viewed as a proximal regularization of a two block Gauss-Seidel method for solving (2). Under the assumption that the objective function satisfies Kurdyka- Lojasiewicz (KL) property, the convergence to a critical point of Φ\Phi of each bounded sequence generated by the algorithm has been obtained in attouch2013convergence . However, PAM algorithm requires exact minimization of a nonconvex and nonsmooth problem in each step. To overcome this difficulty, the pioneer work bolte2014proximal introduced a proximal alternating linearized minimization (PALM) algorithm, which can be viewed as alternating the steps of the proximal forward-backward scheme. Building on the KL property, it was proved that each bounded sequence generated by PALM globally converges to a critical point. Later, the work pock2016inertial proposed an inertial version of the PALM (iPALM) algorithm motivated from the Heavy Ball method of Polyak polyak1964some . It is also closely related to multi-step algorithms of Nesterov’s accelerated gradient method nesterov2018lectures . The global convergence to a critical point is also obtained for iPALM under the assumption that the objective function Φ\Phi satisfies the KL property and the same smoothness conditions as that for PALM.

3.2 Related work on unrolling neural networks inspired by AM algorithms:

In recent years, the unrolling neural networks inspired by AM algorithms unroll the iteration scheme of an AM algorithm into a multi-phase deep neural network. Several unrolling convolutional neural networks (CNN) based on AM algorithms have been developed for solving inverse problems in computer vision including the unrolling neural networks resembling the steps of alternating direction method of multipliers (ADMM) algorithm, and block coodinate decent (BCD) algorithm. The ADMM-net sun2016deep with the architecture derived from ADMM has been applied to compressive sensing MRI reconstruction. The deeply aggregated alternating minimization neural network (DeepAM) kim2017deeply for image restoration uses a data-driven approach in the energy minimization framework. At each iteration, the DeepAM advances the following two steps in the conventional AM algorithm. One step uses a data-driven approach to update the auxiliary variable representing the gradient of the image by a CNN that plays the role of regularization. The other step recovers the image by minimizing the data fidelity and the disparity between the gradient of the image and updated auxiliary variable. Moreover, several versions of BCD-nets that resemble the steps of BCD algorithm have been successfully applied to solve the inverse problems for signal/image recovery/reconstruction aggarwal2018modl ; chun2019bcd ; chun2018deep ; chun2020convolutional ; chun2020momentum .

The common idea of those BCD-nets is combining a denoising CNN in defining the regularization as one module, namely a denoising module, and a model based image reconstruction (MBIR) as the other module into the BCD framework. The MBIR module enhances the data fidelity for the image from the denoising module. For instance, at each iteration of the BCD-net proposed in chun2018deep , it trains an image mapping CNN using identical convolutional kernels in both encoders and decoders as one module in a BCD framework for image recovery. The other module in this BCD-net minimizes an objective function consisting of the data fidelity and penalty between to be updated image and the deniosed image from image mapping CNN. This BCD-Net chun2018deep is modified from chun2019bcd , where a sparsity promoting denoising network is trained in denoising modules and the accelerated proximal gradient method using a majorizer is applied to MBIR modules with statistical CT data-fidelity for low-dose CT reconstruction. Furthermore, to achieve fast and convergent solutions for the inverse problem in image processing, the work in chun2020momentum proposes a Momentum-Net. The Momentum-Net is motivated by applying the block proximal extrapolated gradient method using a Majorizer and convolutional autoencoders to MBIR. Each iteration of Momentum-Net consists of three core modules: image refining, extrapolation and MBIR. The image refining module is a denoising module that trains an image refining CNN to remove iteration-wise artifact. The extrapolation module uses momentum from previous updates to amplify the changes in subsequent iterations and accelerate convergence. The MBIR is non-iterative, since the MBIR problem in consideration of this work has a closed-form solution. Momentum-Net guarantees convergence to a fixed-point for general differentiable nonconvex data-fit terms and convex feasible sets, under the conditions that the sequence of paired refining CNNs is asymptotically nonexpansive and the refined image from the image refining module gives an asymptotically block-coordinate minimizer.

Very recently, ding2023learned presented a learned alternating algorithm aimed at solving a specific variational model for dual-domain sparse-view CT reconstruction. In this model, the joint term is convex and smooth, and the regularization terms are learnable in both the image and sinogram domains, respectively. In this study, we extend their work by considering a broader scenario, where each term in the objective function can be nonconvex and nonsmooth. We provide sufficient conditions for taking smoothing approximations of nonsmooth objective functions. We also provide the analysis of the iteration complexity in addition to the convergence analysis.

4 Contributions and Organization

In this work we propose a novel learned proximal alternating minimization algorithm addressing nonsmooth and nonconvex two-block variational models with provable convergence and iteration complexity. The proposed LPAM algorithm for solving (M1M_{1}) determines the architecture of the deep neural network, hence the design of the LPAM considers not only the convergence and efficiency, but also the ability to assist the training of the network parameters Θ\Theta, i.e., reducing the error in minimizing the loss function.

Our main idea for designing this algorithm is as follows: (i) We tackle the nonsmoothness by an appropriate smoothing technique with automatic diminishing smoothing effect; (ii) We modify the PALM scheme by incorporating the residual learning architecture, which has proven to be highly effective in deep network training he2016deep ; and (iii) When some modified PALM iterates fail to satisfy certain conditions, we employ the block coordinate decent (BCD) iterates as a safeguard to ensure convergence. Moreover, we prove that a subsequence generated by the proposed LPAM has accumulation points and all of them are Clarke stationary points of the nonsmooth nonconvex problem.This algorithm can be readily extended to multi-block nonsmooth and nonconvex optimization.

This algorithm significantly broadens its applicability, as it relaxes the strict convergence conditions of the existing algorithms. The PALM or iPALM algorithm assumes the joint term H(𝐱1,𝐱2)H(\mathbf{x}_{1},\mathbf{x}_{2}) to be smooth and possibly nonconvex, and the proximal points associated with Hi(𝐱i)H_{i}(\mathbf{x}_{i}) (i=1,2i=1,2) are easy to obtain. Their global convergence results are built on KL property of the objective function. These assumptions in general cannot be met by the learnable parameterized functions representing neural networks. The proposed LPAM algorithm will not require these conditions. Each function in (1) could be nonconvex and nonsmooth. Without those restrictions, the LPAM algorithm is flexible and applicable to deep learning problems in various applications. Moreover, the convergence of LPAM can still be established in the sense of sub-sequence convergence to Clarke stationary points (see detail in Section 3). But it cannot have global convergence to a critical point as the PALM or iPALM algorithm.

The convergence result of the BCD-net is based on the assumptions that the paired image refining neural networks are asymptotically nonexpansive, the data fidelity is differentiable, and the MBIR problem is not difficult to solve. The proposed LPAM algorithm does not require these assumptions. The convergence of LPAM is assured by the convergence of the BCD algorithm for smooth optimization and the property of the limit of the gradient of the smoothed objective functions as smoothing factor tends zero. By removing the constraints on the model, the PALM can significantly broaden its range of applications.

The remainder of the paper is organized as follows. In Section 5 we construct the LPAM algorithm motivated by the PALM algorithm and the BCD algorithm, and introduce the three steps that constitute the LPAM algorithm. In Section 6, we prove the convergence of the LPAM algorithm and deduce the iteration complexity of this algorithm. In Section 7, we apply the LPAM algorithm to the joint reconstruction of T1 and T2 MRI images with significantly under-sampled data to show the promising performance of the proposed method. Furthermore, we provide convergence analysis and comparison with several existing methods. All the figures are listed at the end of the article.


Throughout the paper, we use the following notations without further notification.

  1. 1.

    H(𝐱1,𝐱2;θ)H(\mathbf{x}_{1},\mathbf{x}_{2};\theta) denotes the function HH of 𝐱1\mathbf{x}_{1} and 𝐱2\mathbf{x}_{2} given a collection of parameters θ\theta. Hϵ(𝐱1,𝐱2;θ)H_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2};\theta) denotes the smooth approximation of H(𝐱1,𝐱2;θ)H(\mathbf{x}_{1},\mathbf{x}_{2};\theta) with a smoothing parameter ϵ\epsilon.

  2. 2.

    For a differentiable function f(𝐱1,𝐱2)f(\mathbf{x}_{1},\mathbf{x}_{2}), we denote the gradient of ff with respect to 𝐱i\mathbf{x}_{i} by if(𝐱1,𝐱2)\nabla_{i}f(\mathbf{x}_{1},\mathbf{x}_{2}) (i=1,2)(i=1,2). The gradient ff on the domain (𝐱1,𝐱2)Rn×Rm(\mathbf{x}_{1},\mathbf{x}_{2})\in R^{n}\times R^{m} is represented by 1,2f(𝐱1,𝐱2)\nabla_{1,2}f(\mathbf{x}_{1},\mathbf{x}_{2}).

  3. 3.

    We use \|\cdot\| to denote the standard L2L^{2} norm in Euclidean space, and write ,n\langle\cdot,\cdot\rangle_{\mathbb{R}^{n}} for the standard inner product on n\mathbb{R}^{n}.

  4. 4.

    χ:=n×m\chi\mathrel{\mathop{\ordinarycolon}}=\mathbb{R}^{n}\times\mathbb{R}^{m}.

5 Learned Proximal Alternating Minimization (LPAM) Algorithm

In this section we propose a learned proximal alternating minimization (LPAM) algorithm for solving the problem (1). Since θ\theta is learned, we will omit it and simply write (1) as (2).

For H1H_{1},H2H_{2}, HH and Φ\Phi we assume

(A1):H1,H2andH are proper and locally Lipschitz onn,mandn×m,respectively.\displaystyle(A1)\mathrel{\mathop{\ordinarycolon}}H_{1},\ H_{2}\ \mbox{and}\ H\ \mbox{ are proper and locally Lipschitz on}\ \mathbb{R}^{n},\ \mathbb{R}^{m}\ \mbox{and}\ \mathbb{R}^{n}\times\mathbb{R}^{m},\mbox{respectively}.
Each of them is possibly non-convex and non-smooth.
(A2):Φ is coercive,andmin𝐗χΦ(𝐗)>\displaystyle(A2)\mathrel{\mathop{\ordinarycolon}}\Phi\mbox{ is coercive},\mbox{and}\ \min_{\mathbf{X}\in\chi}\Phi(\mathbf{X})>-\infty

The proposed LPAM algorithm consists of three stages. In the first stage we convert the nonconvex and nonsmooth problem (2) to a nonconvex smooth optimization problem by an appropriate smoothing procedure. The smoothing method is not unique, but we require the smooth approximation Φϵ\Phi_{\epsilon} of Φ:=H1(𝐱1)+H2(𝐱2)+H(𝐱1,𝐱2)\Phi\mathrel{\mathop{\ordinarycolon}}=H_{1}(\mathbf{x}_{1})+H_{2}(\mathbf{x}_{2})+H(\mathbf{x}_{1},\mathbf{x}_{2}), which is defined as

Φϵ(𝐱1,𝐱2):=H1,ϵ(𝐱1)+H2,ϵ(𝐱2)+Hϵ(𝐱1,𝐱2)\Phi_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2})\mathrel{\mathop{\ordinarycolon}}=H_{1,\epsilon}(\mathbf{x}_{1})+H_{2,\epsilon}(\mathbf{x}_{2})+H_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2}) (3)

to satisfy following conditions:

  • (C1:) H1,ϵ\nabla H_{1,\epsilon}, H2,ϵ\nabla H_{2,\epsilon} and 1,2Hϵ\nabla_{1,2}H_{\epsilon} are Lipschitz continuous in n\mathbb{R}^{n}, m\mathbb{R}^{m} and n×m\mathbb{R}^{n}\times\mathbb{R}^{m}, respectively.

  • (C2:) limϵ0+,𝐳i𝐱iHi,ϵ(𝐳i)=Hi(𝐱i)\lim_{\epsilon\to 0^{+},\mathbf{z}_{i}\to\mathbf{x}_{i}}H_{i,\epsilon}(\mathbf{z}_{i})=H_{i}(\mathbf{x}_{i}), i=1,2i=1,2, for any given 𝐱i\mathbf{x}_{i} in the corresponding domain. Also for any 𝐗=(𝐱1,𝐱2)n×m\mathbf{X}=(\mathbf{x}_{1},\mathbf{x}_{2})\in\mathbb{R}^{n}\times\mathbb{R}^{m}, we assume that when ϵ0+\epsilon\to 0^{+} and 𝐙=(𝐳1,𝐳2)𝐗{\bf Z}=(\mathbf{z}_{1},\mathbf{z}_{2})\to\mathbf{X}, limϵ0+,𝐙𝐗Hϵ(𝐙)=H(𝐗)\lim_{\epsilon\to 0^{+},{\bf Z}\to\mathbf{X}}H_{\epsilon}({\bf Z})=H(\mathbf{X}).

  • (C3:) There is a continuous and non-negative function 𝔪:[0,)[0,)\mathfrak{m}\mathrel{\mathop{\ordinarycolon}}[0,\infty)\to[0,\infty) such that 𝔪(0)=0\mathfrak{m}(0)=0,

    Φϵ(𝐱1,𝐱2)+𝔪(ϵ)Φδ(𝐱1,𝐱2)+𝔪(δ)for all(𝐱1,𝐱2)χ,and 0<ϵδ.\Phi_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2})+\mathfrak{m}(\epsilon)\leq\Phi_{\delta}(\mathbf{x}_{1},\mathbf{x}_{2})+\mathfrak{m}(\delta)\,\,\mbox{for all}\,(\mathbf{x}_{1},\mathbf{x}_{2})\in\chi,\,\mbox{and}\,0<\epsilon\leq\delta.
  • (C4:) For any 𝐗=(𝐱1,𝐱2)n×m\mathbf{X}^{*}=(\mathbf{x}_{1},\mathbf{x}_{2})\in\mathbb{R}^{n}\times\mathbb{R}^{m} and any 𝐗k𝐗\mathbf{X}^{k}\to\mathbf{X}^{*}, lim𝐗k𝐗,ϵ0+1,2Φϵ(𝐗k)cΦ(𝐗)\lim_{\mathbf{X}^{k}\to\mathbf{X}^{*},\epsilon\to 0^{+}}\nabla_{1,2}\Phi_{\epsilon}(\mathbf{X}^{k})\in\partial^{c}\Phi(\mathbf{X}^{*}), where cΦ(𝐗)\partial^{c}\Phi(\mathbf{X}^{*}) stands for the Clarke subdifferential of Φ\Phi (see Definition 1).

Remark 1

The Lipschitz constants of H1,ϵ,H2,ϵ\nabla H_{1,\epsilon},\nabla H_{2,\epsilon} and 1,2Hϵ\nabla_{1,2}H_{\epsilon} may tend to infinity as ϵ\epsilon tends to 0. The condition 𝐂𝟑{\bf C3} is much weaker than the monotonicity of Φϵ\Phi_{\epsilon}. The condition 𝐂𝟑{\bf C3} roughly says that as long as Φϵ\Phi_{\epsilon} is not too much different from a monotonic function with respect to ϵ\epsilon, our method still leads to convergence. This brings great convenience in application.

In the second stage we solve the smoothed nonconvex problem with a fixed smoothing factor ϵ\epsilon, i.e.

min𝐱1,𝐱2{Φϵ(𝐱1,𝐱2):=H1,ϵ(𝐱1)+H2,ϵ(𝐱2)+Hϵ(𝐱1,𝐱2)}.\min_{\mathbf{x}_{1},\mathbf{x}_{2}}\{\Phi_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2})\mathrel{\mathop{\ordinarycolon}}=H_{1,\epsilon}(\mathbf{x}_{1})+H_{2,\epsilon}(\mathbf{x}_{2})+H_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2})\}. (4)

In light of the substantial improvement in practical performance by ResNet he2016deep , we propose a modified PALM that incorporates the architecture of the ResNet for solving (4). With ϵ=ϵk>0\epsilon=\epsilon_{k}>0, PALM bolte2014proximal generates a sequence of iterates {𝐱1k+1,𝐱2k+1}\{\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1}\} by

𝐳1k+1=𝐱1kαkH1,ϵk(𝐱1k),\mathbf{z}_{1}^{k+1}=\mathbf{x}_{1}^{k}-\alpha_{k}\nabla H_{1,\epsilon_{k}}(\mathbf{x}_{1}^{k}),
𝐮1k+1=argmin𝐮12αk𝐮𝐳1k+12+Hϵk(𝐮,𝐱2k),\mathbf{u}_{1}^{k+1}=\arg\min_{\mathbf{u}}\frac{1}{2\alpha_{k}}\|\mathbf{u}-\mathbf{z}_{1}^{k+1}\|^{2}+H_{\epsilon_{k}}(\mathbf{u},\mathbf{x}_{2}^{k}),

and

𝐳2k+1=𝐱2kβkH2,ϵk(𝐱2k),\mathbf{z}_{2}^{k+1}=\mathbf{x}_{2}^{k}-\beta_{k}\nabla H_{2,\epsilon_{k}}(\mathbf{x}_{2}^{k}),
𝐮2k+1=argmin𝐮12βk𝐮𝐳2k+12+Hϵk(𝐮1k+1,𝐮),\mathbf{u}_{2}^{k+1}=\arg\min_{\mathbf{u}}\frac{1}{2\beta_{k}}\|\mathbf{u}-\mathbf{z}_{2}^{k+1}\|^{2}+H_{\epsilon_{k}}(\mathbf{u}_{1}^{k+1},\mathbf{u}),

where αk\alpha_{k} and βk\beta_{k} are step sizes. In some situations, the proximal points 𝐮1k+1\mathbf{u}_{1}^{k+1} and 𝐮2k+1\mathbf{u}_{2}^{k+1} are not easy to compute, so we replace Hϵk(𝐮,𝐱2k)H_{\epsilon_{k}}(\mathbf{u},\mathbf{x}_{2}^{k}) by its linear approximation at 𝐳1k+1\mathbf{z}_{1}^{k+1} together with a proximal term 12pk𝐮𝐳1k+12\frac{1}{2p_{k}}\|\mathbf{u}-\mathbf{z}_{1}^{k+1}\|^{2}. Similarly, we replace Hϵk(𝐮1k+1,𝐮)H_{\epsilon_{k}}(\mathbf{u}_{1}^{k+1},\mathbf{u}) by its linear approximation at 𝐳2k+1\mathbf{z}_{2}^{k+1} together with a proximal term 12qk𝐮𝐳2k+12\frac{1}{2q_{k}}\|\mathbf{u}-\mathbf{z}_{2}^{k+1}\|^{2}. Thus, we have

𝐮1k+1\displaystyle\mathbf{u}_{1}^{k+1} =argmin𝐮12αk𝐮𝐳1k+12+<1Hϵk(𝐳1k+1,𝐱2k),𝐮𝐳1k+1>+12pk𝐮𝐳1k+12\displaystyle=\arg\min_{\mathbf{u}}\frac{1}{2\alpha_{k}}\|\mathbf{u}-\mathbf{z}_{1}^{k+1}\|^{2}+<\nabla_{1}H_{\epsilon_{k}}(\mathbf{z}_{1}^{k+1},\mathbf{x}_{2}^{k}),\mathbf{u}-\mathbf{z}_{1}^{k+1}>+\frac{1}{2p_{k}}\|\mathbf{u}-\mathbf{z}_{1}^{k+1}\|^{2} (5)
=𝐳1k+1τk1Hϵk(𝐳1k+1,𝐱2k),withτk=αkpkαk+pk.\displaystyle=\mathbf{z}_{1}^{k+1}-\tau_{k}\nabla_{1}H_{\epsilon_{k}}(\mathbf{z}_{1}^{k+1},\mathbf{x}_{2}^{k}),\quad\quad\mbox{with}\quad\tau_{k}=\frac{\alpha_{k}p_{k}}{\alpha_{k}+p_{k}}.

where we recall that 1Hϵk\nabla_{1}H_{\epsilon_{k}} stands for the gradient for the first nn variables.

𝐮2k+1\displaystyle\mathbf{u}_{2}^{k+1} =argmin𝐮12βk𝐮𝐳2k+12+<2Hϵk(𝐮1k+1,𝐳2k+1),𝐮𝐳2k+1>+12qk𝐮𝐳2k+12\displaystyle=\arg\min_{\mathbf{u}}\frac{1}{2\beta_{k}}\|\mathbf{u}-\mathbf{z}_{2}^{k+1}\|^{2}+<\nabla_{2}H_{\epsilon_{k}}(\mathbf{u}_{1}^{k+1},\mathbf{z}_{2}^{k+1}),\mathbf{u}-\mathbf{z}_{2}^{k+1}>+\frac{1}{2q_{k}}\|\mathbf{u}-\mathbf{z}_{2}^{k+1}\|^{2} (6)
=𝐳2k+1γk2Hϵk(𝐮1k+1,𝐳2k+1),withγk=βkqkβk+qk.\displaystyle=\mathbf{z}_{2}^{k+1}-\gamma_{k}\nabla_{2}H_{\epsilon_{k}}(\mathbf{u}_{1}^{k+1},\mathbf{z}_{2}^{k+1}),\quad\quad\mbox{with}\quad\gamma_{k}=\frac{\beta_{k}q_{k}}{\beta_{k}+q_{k}}.

In deep learning approach, the step sizes αk\alpha_{k}, τk\tau_{k}, βk\beta_{k} and γk\gamma_{k} could be learnable hyper-parameters. The advantage for taking the scheme (5) and (6) to update u1u_{1} and u2u_{2} is that this scheme is consistent with the architecture of residual learning he-1 ; he-2 , which only learns the correction needed for 𝐳ik+1\mathbf{z}_{i}^{k+1} (i=1,2)(i=1,2) and avoids gradient vanishing in network training, and thus improves solution quality in practice. However, the convergence of the sequence {(𝐮1k+1,𝐮2k+1)}\{(\mathbf{u}_{1}^{k+1},\mathbf{u}_{2}^{k+1})\} is not guaranteed. Inspired by the proof of convergence in bolte2014proximal , we propose the following:

If 𝐮1k+1\mathbf{u}_{1}^{k+1} and 𝐮2k+1\mathbf{u}_{2}^{k+1} satisfy

Φϵk(𝐮1k+1,𝐮2k+1)Φϵk(𝐱1k,𝐱2k)a(𝐮1k+1𝐱1k2+𝐮2k+1𝐱2k2)\Phi_{\epsilon_{k}}(\mathbf{u}_{1}^{k+1},\mathbf{u}_{2}^{k+1})-\Phi_{\epsilon_{k}}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\leq-a(\|\mathbf{u}_{1}^{k+1}-\mathbf{x}_{1}^{k}\|^{2}+\|\mathbf{u}_{2}^{k+1}-\mathbf{x}_{2}^{k}\|^{2}) (7)

and

1,2Φϵk(𝐱1k,𝐱2k)a1(𝐮1k+1𝐱1k+𝐮2k+1𝐱2k),\|\nabla_{1,2}\Phi_{\epsilon_{k}}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|\leq a^{-1}(\|\mathbf{u}_{1}^{k+1}-\mathbf{x}_{1}^{k}\|+\|\mathbf{u}_{2}^{k+1}-\mathbf{x}_{2}^{k}\|), (8)

for some a>0a>0, we take 𝐱1k+1=𝐮1k+1\mathbf{x}_{1}^{k+1}=\mathbf{u}_{1}^{k+1}, 𝐱2k+1=𝐮2k+1\mathbf{x}_{2}^{k+1}=\mathbf{u}_{2}^{k+1}.

If one of (7) and (8) is violated, we compute (𝐯1k+1,𝐯2k+1)(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1}) by BCD algorithm with a simple line-search strategy for better step sizes as follows: Let α¯,β¯(0,1)\bar{\alpha},\bar{\beta}\in(0,1), we compute

𝐯1k+1=argmin𝐯<1Φϵk(𝐱1k,𝐱2k),𝐯𝐱1k>+12α¯𝐯𝐱1k2,\displaystyle\mathbf{v}_{1}^{k+1}=\arg\min_{\mathbf{v}}<\nabla_{1}\Phi_{\epsilon_{k}}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k}),\mathbf{v}-\mathbf{x}_{1}^{k}>+\frac{1}{2\bar{\alpha}}\|\mathbf{v}-\mathbf{x}_{1}^{k}\|^{2}, (9)
𝐯2k+1=argmin𝐯<2Φϵk(𝐯1k+1,𝐱2k),𝐯𝐱2k>+12β¯𝐯𝐱2k2.\displaystyle\mathbf{v}_{2}^{k+1}=\arg\min_{\mathbf{v}}<\nabla_{2}\Phi_{\epsilon_{k}}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k}),\mathbf{v}-\mathbf{x}_{2}^{k}>+\frac{1}{2\bar{\beta}}\|\mathbf{v}-\mathbf{x}_{2}^{k}\|^{2}. (10)

The first order optimality condition leads to

𝐯1k+1=𝐱1kα¯(H1,ϵk(𝐱1k)+1Hϵk(𝐱1k,𝐱2k)),\displaystyle\mathbf{v}_{1}^{k+1}=\mathbf{x}_{1}^{k}-\bar{\alpha}(\nabla H_{1,\epsilon_{k}}(\mathbf{x}_{1}^{k})+\nabla_{1}H_{\epsilon_{k}}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})), (11)
𝐯2k+1=𝐱2kβ¯(H2,ϵk(𝐱2k)+2Hϵk(𝐯1k+1,𝐱2k)).\displaystyle\mathbf{v}_{2}^{k+1}=\mathbf{x}_{2}^{k}-\bar{\beta}(\nabla H_{2,\epsilon_{k}}(\mathbf{x}_{2}^{k})+\nabla_{2}H_{\epsilon_{k}}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})).

If for some δ(0,1)\delta\in(0,1),

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)δ(𝐯1k+1𝐱1k2+𝐯2k+1𝐱2k2),\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})-\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\leq-\delta(\|\mathbf{v}_{1}^{k+1}-\mathbf{x}_{1}^{k}\|^{2}+\|\mathbf{v}_{2}^{k+1}-\mathbf{x}_{2}^{k}\|^{2}), (12)

we set

𝐱1k+1=𝐯1k+1,𝐱2k+1=𝐯2k+1,\mathbf{x}_{1}^{k+1}=\mathbf{v}_{1}^{k+1},\quad\mathbf{x}_{2}^{k+1}=\mathbf{v}_{2}^{k+1},

otherwise we reduce step sizes (α¯,β¯)(\bar{\alpha},\bar{\beta}) by ρ(α¯,β¯)(α¯,β¯)\rho(\bar{\alpha},\bar{\beta})\rightarrow(\bar{\alpha},\bar{\beta}) where 0<ρ<10<\rho<1, and recompute 𝐯1k+1,𝐯2k+1\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1} until the condition (12) holds. Then take 𝐱1k+1=𝐯1k+1\mathbf{x}_{1}^{k+1}=\mathbf{v}_{1}^{k+1}, 𝐱2k+1=𝐯2k+1\mathbf{x}_{2}^{k+1}=\mathbf{v}_{2}^{k+1}.

In the third stage, we check if the 2\ell_{2}-norm of the gradient of the smoothed objective function has been reduced enough, so that we can perform the second stage with a reduced smoothing factor. By gradually reducing the smoothing factor, we obtain a subsequence of the iterates that converges to a Clarke stationary point of the original nonconvex and nonsmooth problem.

The scheme of LPMA is given below (For notational simplicity, we omit all learned network parameters).

Algorithm 1 Learned Proximal Alternating Minimization (LPAM) Algorithm
1:  Input: Initial 𝐱10\mathbf{x}_{1}^{0}, 𝐱20\mathbf{x}_{2}^{0}, 0<ρ,γ,δ<10<\rho,\gamma,\delta<1, and α¯,β¯,σ,a,ϵ0>0\bar{\alpha},\bar{\beta},\sigma,a,\epsilon_{0}>0. Maximum iteration KK or tolerance εtol>0\varepsilon_{tol}>0.
2:  for k=0,1,2,,Kk=0,1,2,\dots,K do
3:     𝐳1k+1=𝐱1kαkH1,ϵk(𝐱1k)\mathbf{z}_{1}^{k+1}=\mathbf{x}_{1}^{k}-\alpha_{k}\nabla H_{1,\epsilon_{k}}(\mathbf{x}_{1}^{k}),
4:     𝐮1k+1=𝐳1k+1τk1Hϵk(𝐳1k+1,𝐱2k)\mathbf{u}_{1}^{k+1}=\mathbf{z}_{1}^{k+1}-\tau_{k}\nabla_{1}H_{\epsilon_{k}}(\mathbf{z}_{1}^{k+1},\mathbf{x}_{2}^{k}),
5:     𝐳2k+1=𝐱2kβkH2,ϵk(𝐱2k)\mathbf{z}_{2}^{k+1}=\mathbf{x}_{2}^{k}-\beta_{k}\nabla H_{2,\epsilon_{k}}(\mathbf{x}_{2}^{k})
6:     𝐮2k+1=𝐳2k+1γk2Hϵk(𝐮1k+1,𝐳2k+1,)\mathbf{u}_{2}^{k+1}=\mathbf{z}_{2}^{k+1}-\gamma_{k}\nabla_{2}H_{\epsilon_{k}}(\mathbf{u}_{1}^{k+1},\mathbf{z}_{2}^{k+1},)
7:     if  conditions (7) and (8) hold, then
8:        set 𝐱1k+1=𝐮1k+1\mathbf{x}_{1}^{k+1}=\mathbf{u}_{1}^{k+1}, 𝐱2k+1=𝐮2k+1\mathbf{x}_{2}^{k+1}=\mathbf{u}_{2}^{k+1},
9:     else
10:        For given α¯,β¯(0,1)\bar{\alpha},\bar{\beta}\in(0,1)
11:        𝐯1k+1=𝐱1kα¯(H1,ϵk(𝐱1k)+1Hϵk(𝐱1k,𝐱2k)),\mathbf{v}_{1}^{k+1}=\mathbf{x}_{1}^{k}-\bar{\alpha}(\nabla H_{1,\epsilon_{k}}(\mathbf{x}_{1}^{k})+\nabla_{1}H_{\epsilon_{k}}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})),
12:        𝐯2k+1=𝐱2kβ¯(H2,ϵk(𝐱2k)+2Hϵk(𝐯1k+1,𝐱2k)).\mathbf{v}_{2}^{k+1}=\mathbf{x}_{2}^{k}-\bar{\beta}(\nabla H_{2,\epsilon_{k}}(\mathbf{x}_{2}^{k})+\nabla_{2}H_{\epsilon_{k}}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})).
13:        if  condition (12) holds, then
14:           set 𝐱1k+1=𝐯1k+1\mathbf{x}_{1}^{k+1}=\mathbf{v}_{1}^{k+1}, 𝐱2k+1=𝐯2k+1\mathbf{x}_{2}^{k+1}=\mathbf{v}_{2}^{k+1},
15:        else
16:           update (α¯,β¯)ρ(α¯,β¯)(\bar{\alpha},\bar{\beta})\leftarrow\rho(\bar{\alpha},\bar{\beta}), then go to 11,
17:        end if
18:     end if
19:     if 1,2Φεk(𝐱1k+1,𝐱2k+1)<σγεk\|\nabla_{1,2}\Phi_{\varepsilon_{k}}(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})\|<\sigma\gamma{\varepsilon_{k}}, set εk+1=γεk\varepsilon_{k+1}=\gamma{\varepsilon_{k}}; otherwise, set εk+1=εk\varepsilon_{k+1}={\varepsilon_{k}}.
20:     if σεk<εtol\sigma{\varepsilon_{k}}<\varepsilon_{tol}, terminate.
21:  end for
22:  Output: 𝐱1k+1\mathbf{x}_{1}^{k+1}, 𝐱2k+1\mathbf{x}_{2}^{k+1}

We would like to comment here for the LPAM algorithm. Line 2 to line 18 can be viewed as the steps of the inner iteration, which, for a fixed ε\varepsilon, generates a sequence {𝐱1k,𝐱2k}\{\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k}\} such that 1,2Φϵ(𝐱1k,𝐱2k)0\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|\to 0 as kk\to\infty (see the first statement of Lemma 1 in the next section). This guarantees the reduction of 1,2Φε(𝐱1k,𝐱2k)\|\nabla_{1,2}\Phi_{\varepsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\| for a fixed ε\varepsilon. When this value drops below a prefixed value, line 19 decreases ε\varepsilon, and then we repeat the procedures from line 2 to line 18 to generates a new sequence {𝐱1k,𝐱2k}\{\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k}\} corresponding to the reduced ε\varepsilon. We continue these steps until the termination condition in the line 20 is met. The sequence formed by the iterates, in which the reduction criterion for ε\varepsilon is met, has at least one accumulation point and each accumulation point is a Clarke Stationary Point of the original problem. This will be proved in Theorem 1 in Section 6.

Remark: In order to make the LPMA algorithm match the ResNet architecture, if the function Hϵ(𝐱1,𝐱2)H_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2}) is learnable and H1,ϵ(𝐱1)H_{1,\epsilon}(\mathbf{x}_{1}) and H2,ϵ(𝐱2)H_{2,\epsilon}(\mathbf{x}_{2}) are given, it is better to use the scheme above. If the functions H1,ϵ(𝐱1)H_{1,\epsilon}(\mathbf{x}_{1}) and H2,ϵ(𝐱2)H_{2,\epsilon}(\mathbf{x}_{2}) are learnable and Hϵ(𝐱1,𝐱2)H_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2}) is given, it is better to change the steps 3-6 to the following:

𝐳1k+1=𝐱1kαk1Hϵk(𝐱1k,𝐱2k),\displaystyle\mathbf{z}_{1}^{k+1}=\mathbf{x}_{1}^{k}-\alpha_{k}\nabla_{1}H_{\epsilon_{k}}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k}), (13)
𝐮1k+1=𝐳1k+1τkH1,ϵk(𝐳1k+1);\displaystyle\mathbf{u}_{1}^{k+1}=\mathbf{z}_{1}^{k+1}-\tau_{k}\nabla H_{1,\epsilon_{k}}(\mathbf{z}_{1}^{k+1}); (14)
𝐳2k+1=𝐱2kβk2Hϵk(𝐮1k+1,𝐱2k),\displaystyle\mathbf{z}_{2}^{k+1}=\mathbf{x}_{2}^{k}-\beta_{k}\nabla_{2}H_{\epsilon_{k}}(\mathbf{u}_{1}^{k+1},\mathbf{x}_{2}^{k}), (15)
𝐮2k+1=𝐳2k+1γkH2,ϵk(𝐳2k+1);\displaystyle\mathbf{u}_{2}^{k+1}=\mathbf{z}_{2}^{k+1}-\gamma_{k}\nabla H_{2,\epsilon_{k}}(\mathbf{z}_{2}^{k+1}); (16)

and keep the other steps in Algorithm 1 unchanged.

6 Convergence Analysis

Since we deal with a nonconvex and nonsmooth optimization problem, the Clarke subdifferential, which is based on the concept of generalized directional derivatives, is employed here to characterize the optimality of the solutions to (2).

Definition 1

(Clarke subdifferential). Suppose that f:n×m(,]f\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{n}\times\mathbb{R}^{m}\to(-\infty,\infty] is locally Lipschitz. The Clarke subdifferential of ff at 𝐗=(𝐱1,𝐱2)\mathbf{X}=(\mathbf{x}_{1},\mathbf{x}_{2}) is defined as

cf(𝐗):={𝐖n×m|<𝐖,V>lim supZ𝐗,t0f(Z+tV)f(Z)t,Vn×m}.\displaystyle\partial^{c}f(\mathbf{X})\mathrel{\mathop{\ordinarycolon}}=\{\mathbf{W}\in\mathbb{R}^{n}\times\mathbb{R}^{m}|<\mathbf{W},\textbf{V}>\leq\limsup_{\textbf{Z}\to\mathbf{X},t\downarrow 0}\frac{f(\textbf{Z}+t\textbf{V})-f(\textbf{Z})}{t},\quad\forall\;\textbf{V}\in\mathbb{R}^{n}\times\mathbb{R}^{m}\}.

where 𝐗,Zn×m\mathbf{X},\textbf{Z}\in\mathbb{R}^{n}\times\mathbb{R}^{m}, <𝐖,V>=<𝐰1,𝐯1>n+<𝐰2,𝐯2>m<\mathbf{W},\textbf{V}>=<\mathbf{w}_{1},\mathbf{v}_{1}>_{\mathbb{R}^{n}}+<\mathbf{w}_{2},\mathbf{v}_{2}>_{\mathbb{R}^{m}} for V=(𝐯1,𝐯2)\textbf{V}=(\mathbf{v}_{1},\mathbf{v}_{2}), 𝐖=(𝐰1,𝐰2)\mathbf{W}=(\mathbf{w}_{1},\mathbf{w}_{2}).

Definition 2

(Clarke stationary point) For a locally Lipschitz function ff defined as in Definition 1, a point 𝐗=(𝐱1,𝐱2)n×m\mathbf{X}=(\mathbf{x}_{1},\mathbf{x}_{2})\in\mathbb{R}^{n}\times\mathbb{R}^{m} is called a Clarke Stationary Point of ff, if 0cf(X)0\in\partial^{c}f(\textbf{X}).

Lemma 1

Assume that Φϵ(𝐱1,𝐱2)\Phi_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2}) is a smooth approximation of Φ(𝐱1,𝐱2)\Phi(\mathbf{x}_{1},\mathbf{x}_{2}) defined in (3) satisfying the conditions (C1)-(C4). Let ε,η,a>0\varepsilon,\eta,a>0, ρ,α¯,β¯(0,1)\rho,\bar{\alpha},\bar{\beta}\in(0,1), and 𝐗0=(𝐱10,𝐱20)\mathbf{X}^{0}=(\mathbf{x}_{1}^{0},\mathbf{x}_{2}^{0}) be an arbitrary initial condition. Suppose {𝐗k=(𝐱1k,𝐱2k)}\{\mathbf{X}^{k}=(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\} is the sequence generated by repeating Lines 2–18 in Algorithm 1 with εk=ε\varepsilon_{k}=\varepsilon. Then we have

  1. 1.

    1,2Φϵ(𝐱1k,𝐱2k)0\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|\to 0 as kk\to\infty.

  2. 2.

    Let (α¯,β¯)(\bar{\alpha},\bar{\beta}) (α¯,β¯(0,1))(\bar{\alpha},\bar{\beta}\in(0,1)) be the initial step size for (𝐯1k,𝐯2k)(\mathbf{v}_{1}^{k},\mathbf{v}_{2}^{k}). The maximum number of required line search steps max\ell_{max} is

    max=[log((Lϵ2+δ)max{α¯,β¯})log1/ρ]+1,\ell_{max}=\Bigg{[}\frac{\log\big{(}(\frac{L_{\epsilon}}{2}+\delta)\max\{\bar{\alpha},\bar{\beta}\}\big{)}}{\log 1/\rho}\Bigg{]}+1,

    where [A]\big{[}A\big{]} represents the largest integer less than or equal to AA, and LϵL_{\epsilon} is the sum of the Lipschitz constants for 1H1,ϵ\nabla_{1}H_{1,\epsilon}, 2H2,ϵ\nabla_{2}H_{2,\epsilon}, and 1,2Hϵ\nabla_{1,2}H_{\epsilon}, i.e.

    Lϵ:=L1H1,ϵ+L2H2,ϵ+L1,2Hϵ.L_{\epsilon}\mathrel{\mathop{\ordinarycolon}}=L_{\nabla_{1}H_{1,\epsilon}}+L_{\nabla_{2}H_{2,\epsilon}}+L_{\nabla_{1,2}H_{\epsilon}}.
  3. 3.

    For any η>0\eta>0,

    min{k;1,2Φϵ(𝐱1k+1,𝐱2k+1)<η}(2a3+4max(α¯,β¯)2δmin(α¯,β¯)2ρ2Lϵ2)Φ(𝐗0)Φ+1η2\min\{k\in\mathbb{N};\quad\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})\|<\eta\}\leq(2a^{-3}+\frac{4\max(\bar{\alpha},\bar{\beta})^{2}}{\delta\min(\bar{\alpha},\bar{\beta})^{2}\rho^{2}}L_{\epsilon}^{2})\frac{\Phi(\mathbf{X}^{0})-\Phi_{*}+1}{\eta^{2}} (17)

Proof of Lemma 1:

Given (𝐱1k,𝐱2k)(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k}), in the case that (𝐮1k+1,𝐮2k+1)(\mathbf{u}_{1}^{k+1},\mathbf{u}_{2}^{k+1}) generated by Algorithm 1 satisfies the conditions (7)-(8), we take (𝐱1k+1,𝐱2k+1)=(𝐮1k+1,𝐮2k+1)(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})=(\mathbf{u}_{1}^{k+1},\mathbf{u}_{2}^{k+1}). From (7)-(8), it is easy to get

1,2Φϵ(𝐱1k,𝐱2k)22(Φϵ(𝐱1k,𝐱2k)Φϵ(𝐮1k+1,𝐮2k+1))a3,\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}\leq\frac{2(\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})-\Phi_{\epsilon}(\mathbf{u}_{1}^{k+1},\mathbf{u}_{2}^{k+1}))}{a^{3}}, (18)

If (𝐮1k+1,𝐮2k+1)(\mathbf{u}_{1}^{k+1},\mathbf{u}_{2}^{k+1}) fails to satisfy the condition (7) or (8), the algorithm computes (𝐯1k+1,𝐯2k+1)(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1}) by (11) with a line search strategy. Let k\ell_{k} be the number of the required line search steps to meet the condition (12), and then we take (𝐱1k+1,𝐱2k+1)=(𝐯1k+1,𝐯2k+1)(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})=(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1}). This gives the following

𝐯1k+1=𝐱1kα¯ρk1Φϵ(𝐱1k,𝐱2k),𝐯2k+1=𝐱2kβ¯ρk2Φϵ(𝐯1k+1,𝐱2k).\mathbf{v}_{1}^{k+1}=\mathbf{x}_{1}^{k}-\bar{\alpha}\rho^{\ell_{k}}\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k}),\qquad\mathbf{v}_{2}^{k+1}=\mathbf{x}_{2}^{k}-\bar{\beta}\rho^{\ell_{k}}\nabla_{2}\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k}). (19)

Since 1,2Φϵ\nabla_{1,2}\Phi_{\epsilon} is LϵL_{\epsilon}-Lipschitz continuous, we have

Φϵ(𝐯1k+1,𝐯2k+1)\displaystyle\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1}) Φϵ(𝐯1k+1,𝐱2k)+2Φϵ(𝐯1k+1,𝐱2k)(𝐯2k+1𝐱2k)+Lϵ2𝐯2k+1𝐱2k2\displaystyle\leq\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})+\nabla_{2}\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})\cdot(\mathbf{v}_{2}^{k+1}-\mathbf{x}_{2}^{k})+\frac{L_{\epsilon}}{2}\|\mathbf{v}_{2}^{k+1}-\mathbf{x}_{2}^{k}\|^{2}
\displaystyle\leq Φϵ(𝐱1k,𝐱2k)+1Φϵ(𝐱1k,𝐱2k)(𝐯1k+1𝐱1k)+Lϵ2𝐯1k+1𝐱1k2\displaystyle\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})+\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\cdot(\mathbf{v}_{1}^{k+1}-\mathbf{x}_{1}^{k})+\frac{L_{\epsilon}}{2}\|\mathbf{v}_{1}^{k+1}-\mathbf{x}_{1}^{k}\|^{2}
+\displaystyle+ 2Φϵ(𝐯1k+1,𝐱2k)(𝐯2k+1𝐱2k)+Lϵ2𝐯2k+1𝐱2k2.\displaystyle\nabla_{2}\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})\cdot(\mathbf{v}_{2}^{k+1}-\mathbf{x}_{2}^{k})+\frac{L_{\epsilon}}{2}\|\mathbf{v}_{2}^{k+1}-\mathbf{x}_{2}^{k}\|^{2}. (20)

The combination of (19) and (20) yields

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)+(1α¯ρk+Lϵ2)𝐯1k+1𝐱1k2+(1β¯ρk+Lϵ2)𝐯2k+1𝐱2k2.\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})\leq\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})+(-\frac{1}{\bar{\alpha}\rho^{\ell_{k}}}+\frac{L_{\epsilon}}{2})\|\mathbf{v}_{1}^{k+1}-\mathbf{x}_{1}^{k}\|^{2}+(-\frac{1}{\bar{\beta}\rho^{\ell_{k}}}+\frac{L_{\epsilon}}{2})\|\mathbf{v}_{2}^{k+1}-\mathbf{x}_{2}^{k}\|^{2}. (21)

Hence the condition (12) is met if

1α¯ρk+Lϵ2δ,and1β¯ρk+Lϵ2δ.-\frac{1}{\bar{\alpha}\rho^{\ell_{k}}}+\frac{L_{\epsilon}}{2}\leq-\delta,\quad\mbox{and}\quad-\frac{1}{\bar{\beta}\rho^{\ell_{k}}}+\frac{L_{\epsilon}}{2}\leq-\delta. (22)

Hence, for any k=1,2,k=1,2,\ldots the maximum line search steps max\ell_{max} required for (12) satisfies

ρmax=(δ+Lε/2)1(max{α¯,β¯})1,\rho^{\ell_{max}}=(\delta+L_{\varepsilon}/2)^{-1}(\max\{\bar{\alpha},\bar{\beta}\})^{-1},

so we abuse the notation max\ell_{max} by defining it as

max=[log((Lϵ2+δ)max{α¯,β¯})log1/ρ]+1\ell_{max}=\Bigg{[}\frac{\log\big{(}(\frac{L_{\epsilon}}{2}+\delta)\max\{\bar{\alpha},\bar{\beta}\}\big{)}}{\log 1/\rho}\Bigg{]}+1 (23)

This proves the second statement of this lemma. Moreover, the fact that kmax\ell_{k}\leq\ell_{max} for any k=0,1,2,k=0,1,2,\ldots, implies

ρkρmax=ρ(δ+Lε/2)max{α¯,β¯}.\rho^{\ell_{k}}\geq\rho^{\ell_{max}}=\frac{\rho}{(\delta+L_{\varepsilon}/2)\max\{\bar{\alpha},\bar{\beta}\}}. (24)

Next we prove the other two statements. Note that from (19) the condition (12) can be rewritten as:

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)δα¯2ρ2k1Φϵ(𝐱1k,𝐱2k)2δβ¯2ρ2k2Φϵ(𝐯1k+1,𝐱2k)2.\displaystyle\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})\leq\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})-\delta\bar{\alpha}^{2}\rho^{2\ell_{k}}\|\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}-\delta\bar{\beta}^{2}\rho^{2\ell_{k}}\|\nabla_{2}\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})\|^{2}. (25)

Now we estimate the last term in (25) in terms of 1,2Φϵ(𝐱1k,𝐱2k)\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k}). First by the Lipschitz continuity of 2Φϵ\nabla_{2}\Phi_{\epsilon} we have

2Φϵ(𝐯1k+1,𝐱2k)2Φϵ(𝐱1k,𝐱2k)Lϵ𝐯1k+1𝐱1k.\|\nabla_{2}\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})-\nabla_{2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|\leq L_{\epsilon}\|\mathbf{v}_{1}^{k+1}-\mathbf{x}_{1}^{k}\|.

Clearly this implies

2Φϵ(𝐯1k+1,𝐱2k)2Φϵ(𝐱1k,𝐱2k)Lϵ𝐯1k+1𝐱1k.\|\nabla_{2}\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})\|\geq\|\nabla_{2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|-L_{\epsilon}\|\mathbf{v}_{1}^{k+1}-\mathbf{x}_{1}^{k}\|. (26)

Here if the right hand side is positive, we will choose σ(0,1)\sigma\in(0,1) to be determined, and take advantage of the following inequality: (ab)2σ2a2σb2(a-b)^{2}\geq\frac{\sigma}{2}a^{2}-\sigma b^{2} together with (19) to obtain

2Φϵ(𝐯1k+1,𝐱2k)2\displaystyle\|\nabla_{2}\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})\|^{2}
\displaystyle\geq σ22Φϵ(𝐱1k,𝐱2k)2σLϵ2𝐯1k+1𝐱1k2\displaystyle\frac{\sigma}{2}\|\nabla_{2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}-\sigma L_{\epsilon}^{2}\|\mathbf{v}_{1}^{k+1}-\mathbf{x}_{1}^{k}\|^{2}
=\displaystyle= σ22Φϵ(𝐱1k,𝐱2k)2σLϵ2α¯2ρ2k1Φϵ(𝐱1k,𝐱2k)2.\displaystyle\frac{\sigma}{2}\|\nabla_{2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}-\sigma L_{\epsilon}^{2}\bar{\alpha}^{2}\rho^{2\ell_{k}}\|\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}. (27)

Inserting (6) into (25), we have

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)\displaystyle\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})-\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k}) (28)
\displaystyle\leq (δα¯2ρ2kδβ¯2ρ2kLϵ2σα¯2ρ2k)1Φϵ(𝐱1k,𝐱2k)2δβ¯2ρ2kσ22Φϵ(𝐱1k,𝐱2k)2\displaystyle-\big{(}\delta\bar{\alpha}^{2}\rho^{2\ell_{k}}-\delta\bar{\beta}^{2}\rho^{2\ell_{k}}L_{\epsilon}^{2}\sigma\bar{\alpha}^{2}\rho^{2\ell_{k}}\big{)}\|\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}-\delta\bar{\beta}^{2}\rho^{2\ell_{k}}\frac{\sigma}{2}\|\nabla_{2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}
=\displaystyle= δα¯2ρ2k(1β¯2ρ2kLϵ2σ)1Φϵ(𝐱1k,𝐱2k)2δβ¯2ρ2kσ22Φϵ(𝐱1k,𝐱2k)2\displaystyle-\delta\bar{\alpha}^{2}\rho^{2\ell_{k}}(1-\bar{\beta}^{2}\rho^{2\ell_{k}}L_{\epsilon}^{2}\sigma)\|\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}-\delta\bar{\beta}^{2}\rho^{2\ell_{k}}\frac{\sigma}{2}\|\nabla_{2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}

If β¯2ρ2kLϵ2<23\bar{\beta}^{2}\rho^{2\ell_{k}}L_{\epsilon}^{2}<\frac{2}{3}, we just choose σ=12\sigma=\frac{1}{2}. Otherwise we set σ\sigma as σ=1/(2β¯2ρ2kLϵ2)\sigma=1/(2\bar{\beta}^{2}\rho^{2\ell_{k}}L_{\epsilon}^{2}). In either case the coefficient of 1ϕϵ(𝐱1k,𝐱2k)2\|\nabla_{1}\phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2} is less than δ2α¯2ρ2k-\frac{\delta}{2}\bar{\alpha}^{2}\rho^{2\ell_{k}}. For the coefficient of 2ϕϵ(𝐱1k,𝐱2k)2\|\nabla_{2}\phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}, in the first case it is less than δ4β¯2ρ2k-\frac{\delta}{4}\bar{\beta}^{2}\rho^{2\ell_{k}}. In the second case it is less than δ4Lϵ2-\frac{\delta}{4L_{\epsilon}^{2}}. Then it is easy to verify that in both cases, by (24) both coefficients are less than δmin(α¯,β¯)2ρ24max(α¯,β¯)2Lϵ2-\frac{\delta\min(\bar{\alpha},\bar{\beta})^{2}\rho^{2}}{4\max(\bar{\alpha},\bar{\beta})^{2}L_{\epsilon}^{2}} Thus we have

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)Dϵ1,2Φϵ(𝐱1k,𝐱2k)2,\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})-\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\leq-D_{\epsilon}\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}, (29)

where

Dϵ=δmin(α¯,β¯)2ρ24max(α¯,β¯)2Lϵ2.D_{\epsilon}=\frac{\delta\min(\bar{\alpha},\bar{\beta})^{2}\rho^{2}}{4\max(\bar{\alpha},\bar{\beta})^{2}L_{\epsilon}^{2}}. (30)

Consequently,

1,2Φϵ(𝐱1k,𝐱2k)24max(α¯,β¯)2Lϵ2δmin(α¯,β¯)2ρ2{Φϵ(𝐱1k,𝐱2k)Φϵ(𝐯1k+1,𝐯2k+1)}.\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}\leq\frac{4\max(\bar{\alpha},\bar{\beta})^{2}L_{\epsilon}^{2}}{\delta\min(\bar{\alpha},\bar{\beta})^{2}\rho^{2}}\{\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})-\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})\}. (31)

If the right hand side of (26) is negative, we obtain from (19) that

2Φϵ(x1k,x2k)Lϵv1k+1x1k=Lϵα¯ρlk1Φϵ(x1k,x2k).\|\nabla_{2}\Phi_{\epsilon}(x_{1}^{k},x_{2}^{k})\|\leq L_{\epsilon}\|v_{1}^{k+1}-x_{1}^{k}\|=L_{\epsilon}\bar{\alpha}\rho^{l_{k}}\|\nabla_{1}\Phi_{\epsilon}(x_{1}^{k},x_{2}^{k})\|.

Consequently,

1,2Φϵ(x1k,x2k)2=1Φϵ(x1k,x2k)2+2Φϵ(x1k,x2k)2(1+Lϵ2α¯2ρ2lk)1Φϵ(x1k,x2k)2.\|\nabla_{1,2}\Phi_{\epsilon}(x_{1}^{k},x_{2}^{k})\|^{2}=\|\nabla_{1}\Phi_{\epsilon}(x_{1}^{k},x_{2}^{k})\|^{2}+\|\nabla_{2}\Phi_{\epsilon}(x_{1}^{k},x_{2}^{k})\|^{2}\leq(1+L_{\epsilon}^{2}\bar{\alpha}^{2}\rho^{2l_{k}})\|\nabla_{1}\Phi_{\epsilon}(x_{1}^{k},x_{2}^{k})\|^{2}. (32)

Then we use (32) to estimate (25) differently:

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)\displaystyle\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})-\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k}) (33)
\displaystyle\leq δα¯2ρ2k1Φϵ(𝐱1k,𝐱2k)2δα¯2ρ2lk1+Lϵ2α¯2ρ2lk1,2Φϵ(x1k,x2k)2.\displaystyle-\delta\bar{\alpha}^{2}\rho^{2\ell_{k}}\|\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}\leq-\frac{\delta\bar{\alpha}^{2}\rho^{2l_{k}}}{1+L_{\epsilon}^{2}\bar{\alpha}^{2}\rho^{2l_{k}}}\|\nabla_{1,2}\Phi_{\epsilon}(x_{1}^{k},x_{2}^{k})\|^{2}.

If Lϵ2α¯2ρ2lk1L_{\epsilon}^{2}\bar{\alpha}^{2}\rho^{2l_{k}}\geq 1, we have

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)δ2Lϵ21,2Φϵ(x1k,x2k)2\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})-\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\leq-\frac{\delta}{2L_{\epsilon}^{2}}\|\nabla_{1,2}\Phi_{\epsilon}(x_{1}^{k},x_{2}^{k})\|^{2}

Otherwise

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)δ2α¯2ρ2lk1,2Φϵ(x1k,x2k)2\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})-\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\leq-\frac{\delta}{2}\bar{\alpha}^{2}\rho^{2l_{k}}\|\nabla_{1,2}\Phi_{\epsilon}(x_{1}^{k},x_{2}^{k})\|^{2}

In either case we see that (29) still holds. Note that we used (24) for a lower bound of ρ2lk\rho^{2l_{k}}. Thus we have established (31).

Observing (7), (18), (12) and (31) we see that in either case of (𝐱1k+1,𝐱2k+1)=(𝐮1k+1,𝐮2k+1)(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})=(\mathbf{u}_{1}^{k+1},\mathbf{u}_{2}^{k+1}) or (𝐱1k+1,𝐱2k+1)=(𝐯1k+1,𝐯2k+1)(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})=(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1}), there are two constants

b1=min{a,δ}andb2=max{2a3,4max(α¯,β¯)2Lϵ2δmin(α¯,β¯)2ρ2},b_{1}=\min\{a,\delta\}\quad\mbox{and}\quad b_{2}=\max\{2a^{-3},\frac{4\max(\bar{\alpha},\bar{\beta})^{2}L_{\epsilon}^{2}}{\delta\min(\bar{\alpha},\bar{\beta})^{2}\rho^{2}}\}, (34)

such that

Φϵ(𝐱1k+1,𝐱2k+1)Φϵ(𝐱1k,𝐱2k)b1(𝐱1k+1𝐱1k2+𝐱2k+1𝐱2k2),\Phi_{\epsilon}(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})-\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\leq-b_{1}(\|\mathbf{x}_{1}^{k+1}-\mathbf{x}_{1}^{k}\|^{2}+\|\mathbf{x}_{2}^{k+1}-\mathbf{x}_{2}^{k}\|^{2}), (35)

and

1,2Φϵ(𝐱1k,𝐱2k)2b2(Φϵ(𝐱1k,𝐱2k)Φϵ(𝐱1k+1,𝐱2k+1)).\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}\leq b_{2}(\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})-\Phi_{\epsilon}(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})). (36)

From (35) and C2 we get

Φϵ(𝐱1k+1,𝐱2k+1)<Φϵ(𝐱1k,𝐱2k)<<Φϵ(𝐱10,𝐱20)Φ(𝐱10,𝐱20)+1\Phi_{\epsilon}(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})<\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})<\ldots<\Phi_{\epsilon}(\mathbf{x}_{1}^{0},\mathbf{x}_{2}^{0})\leq\Phi(\mathbf{x}_{1}^{0},\mathbf{x}_{2}^{0})+1 (37)

Furthermore, for any integer K>0K>0 summing up (36) for k=0,,Kk=0,\ldots,K, we have

k=0K1,2Φϵ(𝐱1k,𝐱2k)2b2(Φϵ(𝐱10,𝐱20)Φϵ(𝐱1K+1,𝐱2K+1))b2(Φϵ(𝐱10,𝐱20)Φ+1).\sum_{k=0}^{K}\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}\leq b_{2}(\Phi_{\epsilon}(\mathbf{x}_{1}^{0},\mathbf{x}_{2}^{0})-\Phi_{\epsilon}(\mathbf{x}_{1}^{K+1},\mathbf{x}_{2}^{K+1}))\leq b_{2}(\Phi_{\epsilon}(\mathbf{x}_{1}^{0},\mathbf{x}_{2}^{0})-\Phi_{*}+1). (38)

We use Φ\Phi_{*} to denote a uniform lower bound for all Φϵ\Phi_{\epsilon}: min𝐱Φϵ(𝐱)Φ\min_{\mathbf{x}}\Phi_{\epsilon}(\mathbf{x})\geq\Phi_{*} for all ϵ>0\epsilon>0. Since b2b_{2} is independent of KK, we see that for any fixed ϵ>0\epsilon>0, 1,2Φϵ(𝐱1k,𝐱2k)20\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}\to 0 as kk\to\infty. This proves the first statement of Lemma 1. For the third statement of Lemma 1, we set κ:=min{k;1,2Φϵ(𝐱1k+1,𝐱2k+1)η}\kappa\mathrel{\mathop{\ordinarycolon}}=\min\{k\in\mathbb{N};\,\,\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})\|\leq\eta\}, then we know that Φϵ(𝐱1k+1,𝐱2k+1)η\|\Phi_{\epsilon}(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})\|\geq\eta for all kκ1k\leq\kappa-1. Thus from (38) we have

κη2k=0κ11,2Φϵ(𝐱1k+1,𝐱2k+1)2=k=1κ1,2Φϵ(𝐱1k,𝐱2k)2b2(Φϵ(𝐱10,𝐱20)Φ+1).\kappa\eta^{2}\leq\sum_{k=0}^{\kappa-1}\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1})\|^{2}=\sum_{k=1}^{\kappa}\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}\leq b_{2}(\Phi_{\epsilon}(\mathbf{x}_{1}^{0},\mathbf{x}_{2}^{0})-\Phi_{*}+1).

The third statement follows immediately. Lemma 1 is established. \Box

From the first statement of Lemma 1 the reduction criterion for ϵk\epsilon_{k} in Line 19 of Algorithm 1 must be satisfied within finitely many iterations for any k=1,2,k=1,2,\ldots. Hence ϵk\epsilon_{k} will eventually be small enough to terminate the algorithm. Let klk_{l} be the counter of iteration when the criterion in Line 19 of Algorithm 1 is met for the ll-th time (we set k0=1k_{0}=-1), then we can partition the iteration counters k=0,1,2,,k=0,1,2,\dots, into segments accordingly, so that in the ll-th segment k=kl+1,,kl+1k=k_{l}+1,\dots,k_{l+1} and ϵk=ϵkl+1=ϵ0γl\epsilon_{k}=\epsilon_{k_{l}+1}=\epsilon_{0}\gamma^{l}. In the next theorem we will give the length of each segment, which will lead to the iteration complexity of Algorithm 1 for any ϵtol>0\epsilon_{\mathrm{tol}}>0.

Theorem 6.1

Let Φϵ\Phi_{\epsilon} and Φ\Phi be described as in Lemma 1 and (𝐂𝟏)(𝐂𝟒){\bf(C1)-(C4)} hold for Φϵ\Phi_{\epsilon}. If ϵtol=0\epsilon_{tol}=0, suppose {𝐗k=(𝐱1k,𝐱2k)}\{\mathbf{X}^{k}=(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\} is the sequence generated by Algorithm 1 with arbitrary initial condition 𝐗0=(𝐱10,𝐱20)\mathbf{X}^{0}=(\mathbf{x}_{1}^{0},\mathbf{x}_{2}^{0}). Let {𝐗~l}={(𝐱~1l,𝐱~2l)=:(𝐱1kl+1,𝐱2kl+1)}\{\tilde{\mathbf{X}}^{l}\}=\{(\tilde{\mathbf{x}}_{1}^{l},\tilde{\mathbf{x}}_{2}^{l})=\mathrel{\mathop{\ordinarycolon}}(\mathbf{x}_{1}^{k_{l}+1},\mathbf{x}_{2}^{k_{l}+1})\} be the subsequence of {𝐗k}\{\mathbf{X}^{k}\} where the reduction criterion for ϵ\epsilon in the algorithm is met for k=klk=k_{l} and l=1,2l=1,2.... Then {𝐗~l}\{\tilde{\mathbf{X}}^{l}\} has at least one accumulation point and each accumulation point is a Clarke Stationary Point of the original problem. Moreover, the number of iterations, kl+1klk_{l+1}-k_{l}, for the lthl-th segment is bounded by

kl+1kl(2a3+4min{α¯,β¯}2δmax{α¯,β¯}2Lϵ0γl2)Φ(𝐗0)Φ+1σ2ϵ02γ2l+2.k_{l+1}-k_{l}\leq(2a^{-3}+\frac{4\min\{\bar{\alpha},\bar{\beta}\}^{2}}{\delta\max\{\bar{\alpha},\bar{\beta}\}^{2}}L_{\epsilon_{0}\gamma^{l}}^{2})\frac{\Phi(\mathbf{X}^{0})-\Phi_{*}+1}{\sigma^{2}\epsilon_{0}^{2}\gamma^{2l+2}}. (39)

For any ϵtol>0\epsilon_{tol}>0, the total number of iterations for Algorithm 1 to terminate with ϵtol\epsilon_{tol} is bounded by

l=1l01(2a3+4min{α¯,β¯}2δmax{α¯,β¯}2Lϵ0γl2)Φ(𝐗0)Φ+1σ2ϵ02γ2l+2=O(Lϵtol2ϵtol2)\sum_{l=1}^{l_{0}-1}(2a^{-3}+\frac{4\min\{\bar{\alpha},\bar{\beta}\}^{2}}{\delta\max\{\bar{\alpha},\bar{\beta}\}^{2}}L_{\epsilon_{0}\gamma^{l}}^{2})\frac{\Phi(\mathbf{X}^{0})-\Phi_{*}+1}{\sigma^{2}\epsilon_{0}^{2}\gamma^{2l+2}}=O(L_{\epsilon_{tol}}^{2}\epsilon_{tol}^{-2}) (40)

where l0l_{0} is number of reductions and we have l01=logσϵ0/ϵtollog1/γl_{0}-1=\frac{\log\sigma\epsilon_{0}/\epsilon_{tol}}{\log 1/\gamma}, C1>0C_{1}>0 depends only on ρ\rho and δ\delta.

Remark 2

In application we usually have Lϵϵ1L_{\epsilon}\sim\epsilon^{-1}, which makes the bound in (40) O(ϵtol4)O(\epsilon_{tol}^{-4}).

Remark 3

The figure below shows the generation of the sequence {𝐗~l}\{\tilde{\mathbf{X}}^{l}\} in Theorem 1. The iterates highlighted in red indicate the selected sequence {𝐗~l}\{\tilde{\mathbf{X}}^{l}\}, where the reduction criterion for εk\varepsilon_{k} is met at k=klk=k_{l} for l=0,1,2l=0,1,2....

[Uncaptioned image]

Proof of Theorem 6.1: First we claim that the sequence 𝐗~l=(𝐱~1l,𝐱~2l)\tilde{\mathbf{X}}^{l}=(\tilde{\mathbf{x}}_{1}^{l},\tilde{\mathbf{x}}_{2}^{l}) is compact. This is largely based on C3. In the first step, the initial 𝐗0\mathbf{X}^{0} corresponds to ϵ0\epsilon_{0}. Then we have Φϵ0(𝐗1)<Φϵ0(𝐗0)\Phi_{\epsilon_{0}}(\mathbf{X}^{1})<\Phi_{\epsilon_{0}}(\mathbf{X}^{0}). After this we may have 𝐗2,..,𝐗l\mathbf{X}^{2},..,\mathbf{X}^{l} and then we have ϵ1=γϵ0\epsilon_{1}=\gamma\epsilon_{0} and Φϵ1(𝐗l+1)<Φϵ1(𝐗l)\Phi_{\epsilon_{1}}(\mathbf{X}^{l+1})<\Phi_{\epsilon_{1}}(\mathbf{X}^{l}). Here we mention that we don’t have Φϵ1(𝐗l+1)Φϵ0(𝐗l)\Phi_{\epsilon_{1}}(\mathbf{X}^{l+1})\leq\Phi_{\epsilon_{0}}(\mathbf{X}^{l}), but by the third requirement of Φϵ\Phi_{\epsilon} we have

Φϵ1(𝐗l+1)+𝔪(ϵ1)Φϵ0(𝐗l+1)+𝔪(ϵ0)<Φϵ0(𝐗0)+𝔪(ϵ0).\Phi_{\epsilon_{1}}(\mathbf{X}^{l+1})+\mathfrak{m}(\epsilon_{1})\leq\Phi_{\epsilon_{0}}(\mathbf{X}^{l+1})+\mathfrak{m}(\epsilon_{0})<\Phi_{\epsilon_{0}}(\mathbf{X}^{0})+\mathfrak{m}(\epsilon_{0}).

In a similar manner we then prove that Φϵl(𝐗~l)+𝔪(ϵl)\Phi_{\epsilon_{l}}(\tilde{\mathbf{X}}^{l})+\mathfrak{m}(\epsilon_{l}) is uniformly bounded for all ll. Since 𝔪\mathfrak{m} is a positive continuous function and all ϵlϵ0\epsilon_{l}\leq\epsilon_{0}, after removing the bound for all 𝔪(ϵl)\mathfrak{m}(\epsilon_{l}) we have the uniform bound for all Φϵl(𝐗~l)\Phi_{\epsilon_{l}}(\tilde{\mathbf{X}}^{l}).

Next we estimate the iteration times based on the previous lemma. If ϵtol0\epsilon_{tol}\neq 0, to estimate kl+1klk_{l+1}-k_{l}, which is the iteration times for the inner circle, we see that ϵ=ϵkl=ϵ0γl\epsilon=\epsilon_{k_{l}}=\epsilon_{0}\gamma^{l}, then η\eta in Lemma 1 is η=σγϵkl+1=σϵ0γl+1\eta=\sigma\gamma\epsilon_{k_{l}+1}=\sigma\epsilon_{0}\gamma^{l+1}, the initial step is 𝐗kl+1\mathbf{X}_{k_{l}+1}. Then

kl+1kl(2a3+4min{α¯,β¯}2δmax{α¯,β¯}2Lϵ0γl2)Φ(𝐗0)Φ+1σ2ϵ02γ2l+2.k_{l+1}-k_{l}\leq(2a^{-3}+\frac{4\min\{\bar{\alpha},\bar{\beta}\}^{2}}{\delta\max\{\bar{\alpha},\bar{\beta}\}^{2}}L_{\epsilon_{0}\gamma^{l}}^{2})\frac{\Phi(\mathbf{X}^{0})-\Phi_{*}+1}{\sigma^{2}\epsilon_{0}^{2}\gamma^{2l+2}}.

and (39) is justified. Let l0l_{0} be the number of reductions. Then σϵ0γl01ϵtol\sigma\epsilon_{0}\gamma^{l_{0}-1}\geq\epsilon_{tol}. Thus

l01logσϵ0/ϵtollog1/γ.l_{0}-1\leq\frac{\log\sigma\epsilon_{0}/\epsilon_{tol}}{\log 1/\gamma}. (41)

Based on this we have

l=0l01(kl+1kl)l=1l01(2a3+4min{α¯,β¯}2δmax{α¯,β¯}2Lϵ0γl2)Φ(𝐗0)Φ+1σ2ϵ02γ2l+2,\sum_{l=0}^{l_{0}-1}(k_{l+1}-k_{l})\leq\sum_{l=1}^{l_{0}-1}(2a^{-3}+\frac{4\min\{\bar{\alpha},\bar{\beta}\}^{2}}{\delta\max\{\bar{\alpha},\bar{\beta}\}^{2}}L_{\epsilon_{0}\gamma^{l}}^{2})\frac{\Phi(\mathbf{X}^{0})-\Phi_{*}+1}{\sigma^{2}\epsilon_{0}^{2}\gamma^{2l+2}},

which yields the left hand side of (40). If we use LϵtolL_{\epsilon_{tol}} as an upper bound of Lϵ0γlL_{\epsilon_{0}\gamma^{l}} for all ll, the summation above leads to O(Lϵtol2γ2l0)O(L_{\epsilon_{tol}}^{2}\gamma^{2l_{0}}). Using the estimate of l0l_{0} in (41) we see that this bound is O(Lϵtol2ϵtol2)O(L_{\epsilon_{tol}}^{2}\epsilon_{tol}^{-2}). Since 1,2Φϵkl(𝐗kl+1)σγϵkl=σϵ0γl+10\|\nabla_{1,2}\Phi_{\epsilon_{k_{l}}}(\mathbf{X}^{k_{l}+1})\|\leq\sigma\gamma\epsilon_{k_{l}}=\sigma\epsilon_{0}\gamma^{l+1}\to 0 as ll\to\infty, so if we choose a subsequence of {𝐱1kl+1,𝐱2kl+1}\{\mathbf{x}_{1}^{k_{l}+1},\mathbf{x}_{2}^{k_{l}+1}\} that converges to an equilibrium point 𝐱¯=(𝐱¯1,𝐱¯2)\bar{\mathbf{x}}=(\bar{\mathbf{x}}_{1},\bar{\mathbf{x}}_{2}), by C4, 0cϕ0\in\partial^{c}\phi, then Definition 2 says 𝐱¯\bar{\mathbf{x}} is a Clarke Stationary Point of Φ\Phi. Theorem 6.1 is established. \Box

6.1 Small initial steps

Finally in this subsection we prove that if the initial steps α¯\bar{\alpha} and β¯\bar{\beta} satisfy

δ<α¯Lϵ<1,δ<β¯Lϵ<1,\delta<\bar{\alpha}L_{\epsilon}<1,\quad\delta<\bar{\beta}L_{\epsilon}<1, (42)

for some δ>0\delta>0, we can improve the estimate of DϵD_{\epsilon} from (30) to DϵcLϵ1D_{\epsilon}\leq cL_{\epsilon}^{-1} for some universal constant c>0c>0.

Since under this assumption we have 1+Lϵ2α¯<0-1+\frac{L_{\epsilon}}{2}\bar{\alpha}<0 and 1+Lϵ2β¯<0-1+\frac{L_{\epsilon}}{2}\bar{\beta}<0, we can estimate the decrease of Φϵ(𝐱1k+1,𝐱2k+1)\Phi_{\epsilon}(\mathbf{x}_{1}^{k+1},\mathbf{x}_{2}^{k+1}) in terms of 1,2Φϵ(𝐱1k,𝐱2k)2\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2} using (19),(21) and k=0\ell_{k}=0:

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)+(1α¯+Lϵ2)(α¯)21Φϵ(𝐱1k,𝐱2k)2\displaystyle\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})\leq\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})+(-\frac{1}{\bar{\alpha}}+\frac{L_{\epsilon}}{2})(\bar{\alpha})^{2}\|\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}
+(1β¯+Lϵ2)(β¯)22Φϵ(𝐯1k+1,𝐱2k)2.\displaystyle+(-\frac{1}{\bar{\beta}}+\frac{L_{\epsilon}}{2})(\bar{\beta})^{2}\|\nabla_{2}\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})\|^{2}. (43)

To estimate 1,2Φϵ(𝐯1k+1,𝐱2k)\|\nabla_{1,2}\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{x}_{2}^{k})\| we use Lipschitz continuity and an elementary inequality to obtain (6) as before. Inserting (6) into (6.1) we have

Φϵ(𝐯1k+1,𝐯2k+1)Φϵ(𝐱1k,𝐱2k)\displaystyle\Phi_{\epsilon}(\mathbf{v}_{1}^{k+1},\mathbf{v}_{2}^{k+1})-\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})
\displaystyle\leq ((1α¯+Lϵ2)(α¯)2(1β¯+Lϵ2)(β¯)2σLϵ2(α¯)2)1Φϵ(𝐱1k,𝐱2k)2\displaystyle\bigg{(}(-\frac{1}{\bar{\alpha}}+\frac{L_{\epsilon}}{2})(\bar{\alpha})^{2}-(-\frac{1}{\bar{\beta}}+\frac{L_{\epsilon}}{2})(\bar{\beta})^{2}\sigma L_{\epsilon}^{2}(\bar{\alpha})^{2}\bigg{)}\|\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}
+σ2(1β¯+Lϵ2)(β¯)22Φϵ(𝐱1k,𝐱2k)2\displaystyle\quad+\frac{\sigma}{2}(-\frac{1}{\bar{\beta}}+\frac{L_{\epsilon}}{2})(\bar{\beta})^{2}\|\nabla_{2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}
=\displaystyle= c1,ϵ1Φϵ(𝐱1k,𝐱2k)2+c2,ϵ2Φϵ(𝐱1k,𝐱2k)2.\displaystyle c_{1,\epsilon}\|\nabla_{1}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}+c_{2,\epsilon}\|\nabla_{2}\Phi_{\epsilon}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|^{2}. (44)

where

c1,ϵ=(1α¯+Lϵ2)(α¯)2(1β¯+Lϵ2)(β¯)2σLϵ2(α¯)2\displaystyle c_{1,\epsilon}=(-\frac{1}{\bar{\alpha}}+\frac{L_{\epsilon}}{2})(\bar{\alpha})^{2}-(-\frac{1}{\bar{\beta}}+\frac{L_{\epsilon}}{2})(\bar{\beta})^{2}\sigma L_{\epsilon}^{2}(\bar{\alpha})^{2}
=α¯((1+Lϵ2α¯)(1+Lϵ2β¯)σ(Lϵβ¯)(Lϵα¯))\displaystyle=\bar{\alpha}\bigg{(}(-1+\frac{L_{\epsilon}}{2}\bar{\alpha})-(-1+\frac{L_{\epsilon}}{2}\bar{\beta})\sigma(L_{\epsilon}\bar{\beta})(L_{\epsilon}\bar{\alpha})\bigg{)} (45)
c2,ϵ=σ2(1β¯+Lϵ2)(β¯)2=σ2(1+β¯Lϵ2)(β¯).c_{2,\epsilon}=\frac{\sigma}{2}(-\frac{1}{\bar{\beta}}+\frac{L_{\epsilon}}{2})(\bar{\beta})^{2}=\frac{\sigma}{2}(-1+\frac{\bar{\beta}L_{\epsilon}}{2})(\bar{\beta}).

we choose σ\sigma to be

σ=1Lϵ2α¯2(Lϵβ¯)(Lϵα¯)(1Lϵ2β¯).\sigma=\frac{1-\frac{L_{\epsilon}}{2}\bar{\alpha}}{2(L_{\epsilon}\bar{\beta})(L_{\epsilon}\bar{\alpha})(1-\frac{L_{\epsilon}}{2}\bar{\beta})}.

Then

c1,ϵ=α¯2(1+Lϵα¯2),c2,ϵ=1+Lϵ/2α¯4Lϵ(Lϵα¯).c_{1,\epsilon}=\frac{\bar{\alpha}}{2}(-1+\frac{L_{\epsilon}\bar{\alpha}}{2}),\quad c_{2,\epsilon}=\frac{-1+L_{\epsilon}/2\bar{\alpha}}{4L_{\epsilon}(L_{\epsilon}\bar{\alpha})}.

Observing (42) we have

|c1,ϵ|+|c2,ϵ|CLϵ|c_{1,\epsilon}|+|c_{2,\epsilon}|\leq CL_{\epsilon}

for some c>0c>0 independent of ϵ\epsilon. This clearly implies that DϵD_{\epsilon} is bounded by CLϵ1CL_{\epsilon}^{-1} for some universal C>0C>0.

7 Experiments

In this section, we examine the performance of the proposed method for joint MRI T1 and T2 image reconstruction using significantly under-sampled data. We outline the proposed variational model, explore the convergence property of the LPAM algorithm used to solve the model, and demonstrate the efficiency of the LPAM-net.

The content is organized into five subsections. The first subsection introduces the variational model, especially the learned joint feature extractor. Then in the second subsection we verify the assumptions for the convergence of the algorithm. Then in subsection 3, we describe the dataset used in the experiments and the metrics used to evaluate the reconstruction results. Subsection 4 elucidates the complete reconstruction procedure including the initialization network and the LPAM-net. Finally, in the last subsection, we demonstrate the numerical results of the experiments via comparisons with several existing methods.

7.1 Proposed model and the feature extractor

Given the under-sampled k-space data f1,f2{\textbf{f}_{1},\textbf{f}_{2}} of the modalities T1 and T2, our goal is to jointly reconstruct the corresponding images (𝐱1(\mathbf{x}_{1} and 𝐱2)\mathbf{x}_{2}). We formulate the T1-T2 joint reconstruction as the following nonconvex and nonsmooth minimization problem:

min𝐗=(𝐱1,𝐱2)T12PF𝐱1f12+12PF𝐱2f22+gθ(𝐱1,𝐱2)2,1.\min_{\mathbf{X}=(\mathbf{x}_{1},\mathbf{x}_{2})^{T}}\frac{1}{2}\|\textbf{P}F\mathbf{x}_{1}-\textbf{f}_{1}\|^{2}+\frac{1}{2}\|\textbf{P}F\mathbf{x}_{2}-\textbf{f}_{2}\|^{2}+\|g_{\theta}(\mathbf{x}_{1},\mathbf{x}_{2})\|_{2,1}. (46)

where 𝐱1n\mathbf{x}_{1}\in\mathbb{R}^{n}, 𝐱2n\mathbf{x}_{2}\in\mathbb{R}^{n}, f1n\textbf{f}_{1}\in\mathbb{C}^{n} and f2n\textbf{f}_{2}\in\mathbb{C}^{n}. f1\textbf{f}_{1} and f2\textbf{f}_{2} are the k-space data, and 𝐠\mathbf{g} is a learned joint feature extractor mapping from n×2\mathbb{C}^{n\times 2} to n×d,\mathbb{C}^{n\times d}, where nn is the spatial dimension and dd is the channel number of the output feature tensor. Here, FF stands for the discrete Fourier transform and P is the binary matrix representing the k-space mask when acquiring data for 𝐱1\mathbf{x}_{1} and 𝐱2.\mathbf{x}_{2}. (𝐱1,𝐱2)(\mathbf{x}_{1},\mathbf{x}_{2}) means the image 𝐱1\mathbf{x}_{1} and the image 𝐱2\mathbf{x}_{2} are concatenated. In this model, the first two terms are data fidelity terms and the last term is the regularization term. The regularization term 𝐠θ\mathbf{g}_{\theta} is a learnable common feature extractor for 𝐱1\mathbf{x}_{1} and 𝐱2\mathbf{x}_{2} and it enhances the sparsity of the common features.

More specifically, let 𝐗=(𝐱1,𝐱2)T\mathbf{X}=(\mathbf{x}_{1},\mathbf{x}_{2})^{T} denote the vector of the concatenated two images of dimension 2n2n. Then, the regularization term is represented by 𝐠θ(𝐗)2,1\|\mathbf{g}_{\theta}(\mathbf{X})\|_{2,1}\ . The 𝐠θ,i(𝐗)d\mathbf{g}_{\theta,i}(\mathbf{X})\in\mathbb{C}^{d} is the ii-feature vector of 𝐗\mathbf{X} for i=1,,ni=1,...,n. Here, nn represents the pixel number of the image. 1,2𝐠θ(𝐗)\nabla_{1,2}\mathbf{g}_{\theta}(\mathbf{X}) represents the gradient of 𝐠θ\mathbf{g}_{\theta} as a function of 2n2n variables. Then,

𝐠θ(𝐗)2,1=i=1n𝐠θ,i(𝐗),\|\mathbf{g}_{\theta}(\mathbf{X})\|_{2,1}=\sum_{i=1}^{n}\|\mathbf{g}_{\theta,i}(\mathbf{X})\|,

where 𝐠\mathbf{g} is a vanilla l-layer CNN with component-wise activation function σ\sigma.

𝐠θ(𝐗)=𝐰l(σσ(𝐰3σ(𝐰2σ(𝐰1(𝐗))))\mathbf{g}_{\theta}(\mathbf{X})=\mathbf{w}_{l}(\sigma...\sigma(\mathbf{w}_{3}*\sigma(\mathbf{w}_{2}*\sigma*(\mathbf{w}_{1}*(\mathbf{X})))) (47)

Here σ\sigma is a smoothed ReLu function defined as:

σ(x)={0,ifxδ14δx2+12x+δ4,ifδ<x<δx,ifxδ.\sigma(x)=\left\{\begin{array}[]{ll}0,&\mbox{if}\quad x\leq-\delta\\ \frac{1}{4\delta}x^{2}+\frac{1}{2}x+\frac{\delta}{4},&\mbox{if}\quad-\delta<x<\delta\\ x,&\mbox{if}\quad x\geq\delta.\end{array}\right. (48)

We denote 𝐠θ,i(𝐗)\mathbf{g}_{\theta,i}(\mathbf{X}) as 𝐠i(𝐗)\mathbf{g}_{i}(\mathbf{X}) for simplicity in the following text and in the convergence analysis.

Let rϵ(𝐗)r_{\epsilon}(\mathbf{X}) be the smoothed approximation of the regularization term with the parameter ϵ\epsilon as follows:

rϵ(𝐗)=iI012ϵ𝐠i(𝐗)2+iI1(𝐠i(𝐗)ϵ2).r_{\epsilon}(\mathbf{X})=\sum_{i\in I_{0}}\frac{1}{2\epsilon}\|\mathbf{g}_{i}(\mathbf{X})\|^{2}+\sum_{i\in I_{1}}(\|\mathbf{g}_{i}(\mathbf{X})\|-\frac{\epsilon}{2}).

where I0={i[n]|𝐠i(𝐗)ϵ}I_{0}=\{i\in[n]|\|\mathbf{g}_{i}(\mathbf{X})\|\leq\epsilon\}, I1=[n]I0I_{1}=[n]\setminus I_{0}. Then, the gradient of the smoothed term is:

1,2rϵ(𝐗)=iI01,2𝐠i(𝐗)𝐠i(𝐗)ϵ+iI11,2𝐠i(𝐗)𝐠i(𝐗)𝐠i(𝐗).\nabla_{1,2}r_{\epsilon}(\mathbf{X})=\sum_{i\in I_{0}}\nabla_{1,2}\mathbf{g}_{i}(\mathbf{X})^{\top}\frac{\mathbf{g}_{i}(\mathbf{X})}{\epsilon}+\sum_{i\in I_{1}}\nabla_{1,2}\mathbf{g}_{i}(\mathbf{X})^{\top}\frac{\mathbf{g}_{i}(\mathbf{X})}{\|\mathbf{g}_{i}(\mathbf{X})\|}. (49)

Now the smoothed objective function Φϵ(𝐗)\Phi_{\epsilon}(\mathbf{X}) of Φ(𝐗)\Phi(\mathbf{X}) is written as

Φϵ(𝐗)=12PF𝐱1f12+12PF𝐱2f22+rϵ(𝐗).\Phi_{\epsilon}(\mathbf{X})=\frac{1}{2}\|\textbf{P}F\mathbf{x}_{1}-\textbf{f}_{1}\|^{2}+\frac{1}{2}\|\textbf{P}F\mathbf{x}_{2}-\textbf{f}_{2}\|^{2}+r_{\epsilon}(\mathbf{X}).

7.2 Convergence Analysis

7.2.1 Theoretical proof

In this subsection, we verify that the proposed variational model (46) and its smoothing approximation satisfy the assumptions required in Lemma 1 and Theorem 1. This verification provides a strong theoretical support for the convergence of T1-T2 joint reconstruction.

Step one: Verification of 𝐂𝟏{\bf C1} and 𝐂𝟐{\bf C2}:

From the definition of rϵr_{\epsilon}, it is easy to see that both 𝐂𝟏{\bf C1} and 𝐂𝟐{\bf C2} hold.

Step two: Verification of 𝐂𝟑{\bf C3}:

In order to prove C3, We choose the continuous function 𝔪(ϵ)=n2ϵ\mathfrak{m}(\epsilon)=\dfrac{n}{2}\epsilon and 𝔪(δ)=n2δ\mathfrak{m}(\delta)=\dfrac{n}{2}\delta. Then, we need to prove the following inequality:

Φϵ(𝐗)+n2ϵΦδ(𝐗)+n2δfor all𝐗χ,and  0<ϵδ.\Phi_{\epsilon}(\mathbf{X})+\dfrac{n}{2}\epsilon\leq\Phi_{\delta}(\mathbf{X})+\dfrac{n}{2}\delta\,\,\mbox{for all}\,\mathbf{X}\in\chi,\,\mbox{and}\,\,0<\epsilon\leq\delta. (50)

Note that the data fidelity term is independent of smoothing parameter, hence we only need to prove the following inequality:

rϵ,i(𝐗)+ϵ2rδ,i(𝐗)+δ2,for all𝐗χ,and0<ϵ<δ.r_{\epsilon,i}(\mathbf{X})+\frac{\epsilon}{2}\leq r_{\delta,i}(\mathbf{X})+\frac{\delta}{2}\quad,\mbox{for all}\quad\mathbf{X}\in\chi,\quad\mbox{and}\quad 0<\epsilon<\delta. (51)

Recall that rϵ,i(𝐗)r_{\epsilon,i}(\mathbf{X}) is defined as:

rϵ,i(𝐗):={12ϵ𝐠i(𝐗)2,if𝐠i(𝐗)ϵ,𝐠i(𝐗)ϵ2,if𝐠i(𝐗)>ϵ.r_{\epsilon,i}(\mathbf{X})\mathrel{\mathop{\ordinarycolon}}=\left\{\begin{array}[]{ll}\frac{1}{2\epsilon}\|\mathbf{g}_{i}(\mathbf{X})\|^{2},\quad\mbox{if}\quad\|\mathbf{g}_{i}(\mathbf{X})\|\leq\epsilon,\\ \\ \|\mathbf{g}_{i}(\mathbf{X})\|-\frac{\epsilon}{2},\quad\mbox{if}\quad\|\mathbf{g}_{i}(\mathbf{X})\|>\epsilon.\end{array}\right. (52)

The proof of (51) consists of three steps. In the first case, if 𝐠i(𝐗)δ\|\mathbf{g}_{i}(\mathbf{X})\|\geq\delta, it is easy to use the definition of rϵ,i(𝐗)r_{\epsilon,i}(\mathbf{X}) to see that both sides of (51) are equal to 𝐠i(𝐗)\|\mathbf{g}_{i}(\mathbf{X})\|. In the second case, we suppose ϵ<𝐠i(𝐗)<δ\epsilon<\|\mathbf{g}_{i}(\mathbf{X})\|<\delta. Then

rϵ,i(𝐗)+ϵ2=𝐠i(𝐗)=𝐠i(𝐗)22𝐠i(𝐗)+𝐠i(𝐗)2𝐠i(𝐗)22δ+δ2=rδ,i(𝐗)+δ2.r_{\epsilon,i}(\mathbf{X})+\frac{\epsilon}{2}=\|\mathbf{g}_{i}(\mathbf{X})\|=\frac{\|\mathbf{g}_{i}(\mathbf{X})\|^{2}}{2\|\mathbf{g}_{i}(\mathbf{X})\|}+\frac{\|\mathbf{g}_{i}(\mathbf{X})\|}{2}\leq\frac{\|\mathbf{g}_{i}(\mathbf{X})\|^{2}}{2\delta}+\frac{\delta}{2}=r_{\delta,i}(\mathbf{X})+\frac{\delta}{2}.

Note that the last inequality holds due to the simple fact that the function

h(t):=𝐠i(𝐗)2t+t2,h(t)\mathrel{\mathop{\ordinarycolon}}=\frac{\|\mathbf{g}_{i}(\mathbf{X})\|}{2t}+\frac{t}{2},

is nondecreasing for t𝐠i(𝐗)t\geq\|\mathbf{g}_{i}(\mathbf{X})\|.

In case three: If δ>ϵ𝐠i(𝐗)\delta>\epsilon\geq\|\mathbf{g}_{i}(\mathbf{X})\|,

rϵ,i(𝐗)+ϵ2=12ϵ𝐠i(𝐗)2+ϵ212δ𝐠i(𝐗)2+δ2=rδ,i(𝐗)+δ2r_{\epsilon,i}(\mathbf{X})+\frac{\epsilon}{2}=\frac{1}{2\epsilon}\|\mathbf{g}_{i}(\mathbf{X})\|^{2}+\frac{\epsilon}{2}\leq\frac{1}{2\delta}\|\mathbf{g}_{i}(\mathbf{X})\|^{2}+\frac{\delta}{2}=r_{\delta,i}(\mathbf{X})+\frac{\delta}{2}

where the inequality also holds due to the fact that the function h(t)h(t) as defined above is a nondecreasing function for ϵ𝐠i(𝐗)\epsilon\geq\|\mathbf{g}_{i}(\mathbf{X})\|. In each case, 𝐂𝟑{\bf C3} is satisfied.

Step three: Verification of 𝐂𝟒\bf C4:

Note that 𝐗kl+1\mathbf{X}_{k_{l}+1} satisfies Φϵkl(𝐗kl+1)0\|\Phi_{\epsilon_{k_{l}}}(\mathbf{X}_{k_{l}+1})\|\to 0 as ll\to\infty. For simplicity, we use {𝐗j+1}\{\mathbf{X}_{j+1}\} instead of 𝐗kl+1.\mathbf{X}_{k_{l}+1}. ϵj\epsilon_{j} is the smoothed factor used in the iteration to obtain 𝐗j+1\mathbf{X}_{j+1}. Then, we have 𝐗j+1𝐗~\mathbf{X}_{j+1}\to\tilde{\mathbf{X}}, ϵj0\epsilon_{j}\to 0 and 1,2ϕϵj(𝐗j+1)0\nabla_{1,2}\phi_{\epsilon_{j}}(\mathbf{X}_{j+1})\to 0.

Here we claim that 1,2Φϵj(𝐗j+1)\nabla_{1,2}\Phi_{\epsilon_{j}}(\mathbf{X}_{j+1}) converges to a Clarke point. Similar proof has been demonstrated in LDA that

Φ(𝐗~)={iI01,2𝐠i(𝐗~)𝐰i+iI11,2𝐠i(𝐗~)𝐠i(𝐗~)𝐠i(𝐗~)+1,2f(𝐗~)|\displaystyle\partial\Phi(\tilde{\mathbf{X}})=\{\sum_{i\in I_{0}}\nabla_{1,2}\mathbf{g}_{i}(\tilde{\mathbf{X}})^{\top}\mathbf{w}_{i}+\sum_{i\in I_{1}}\nabla_{1,2}\mathbf{g}_{i}(\tilde{\mathbf{X}})^{\top}\frac{\mathbf{g}_{i}(\tilde{\mathbf{X}})}{\|\mathbf{g}_{i}(\tilde{\mathbf{X}})\|}+\nabla_{1,2}f(\tilde{\mathbf{X}})\bigg{|} (53)
𝐰id,Π(𝐰i;𝒞(1,2𝐠i(𝐗~)))1,iI0}\displaystyle\mathbf{w}_{i}\in\mathbb{C}^{d},\|\Pi(\mathbf{w}_{i};\mathcal{C}(\nabla_{1,2}\mathbf{g}_{i}(\tilde{\mathbf{X}})))\|\leq 1,\forall i\in I_{0}\}

where I0={i[n]|𝐠i(𝐗~)|=0}I_{0}=\{i\in[n]|\|\mathbf{g}_{i}(\tilde{\mathbf{X}})|=0\} and I1=[n]I0I_{1}=[n]\setminus I_{0}. Π(𝐰;𝒞(A))\Pi(\mathbf{w};\mathcal{C}(A)) is the projection of 𝐰\mathbf{w} onto 𝒞(A)\mathcal{C}(A) which stands for the column space of AA. Then for JJ sufficiently large

ϵj<12min{𝐠i(𝐗~);iI1}12𝐠i(𝐗~)𝐠i(𝐗j+1),jJ,iI1\epsilon_{j}<\frac{1}{2}\min\{\|\mathbf{g}_{i}(\tilde{\mathbf{X}})\|;i\in I_{1}\}\leq\frac{1}{2}\|\mathbf{g}_{i}(\tilde{\mathbf{X}})\|\leq\|\mathbf{g}_{i}(\mathbf{X}_{j+1})\|,\quad\forall j\geq J,\quad\forall i\in I_{1}

We use the fact that min{𝐠i(𝐗~);iI1}>0\min\{\|\mathbf{g}_{i}(\tilde{\mathbf{X}})\|;i\in I_{1}\}>0 and ϵj0\epsilon_{j}\to 0 to prove the first inequality and we use 𝐗j+1𝐗~\mathbf{X}_{j+1}\to\tilde{\mathbf{X}} and the continuity of 𝐠i\mathbf{g}_{i} for all ii to prove the last inequality. Furthermore we denote

sj,i={𝐠i(𝐗j+1)ϵj,if𝐠i(𝐗j+1)ϵj,𝐠i(𝐗j+1)𝐠i(𝐗j+1),if𝐠i(𝐗j+1)>ϵj.\textbf{s}_{j,i}=\left\{\begin{array}[]{ll}\frac{\mathbf{g}_{i}(\mathbf{X}_{j+1})}{\epsilon_{j}},\quad\mbox{if}\quad\|\mathbf{g}_{i}(\mathbf{X}_{j+1})\|\leq\epsilon_{j},\\ \\ \frac{\mathbf{g}_{i}(\mathbf{X}_{j+1})}{\|\mathbf{g}_{i}(\mathbf{X}_{j+1})\|},\quad\mbox{if}\quad\|\mathbf{g}_{i}(\mathbf{X}_{j+1})\|>\epsilon_{j}.\end{array}\right.

Then, we have

1,2Φϵj(𝐗j+1)=iI01,2𝐠i(𝐗j+1)sj,i+iI11,2𝐠i(𝐗j+1)𝐠i(𝐗j+1)𝐠i(𝐗j+1)+1,2f(𝐗j+1).\nabla_{1,2}\Phi_{\epsilon_{j}}(\mathbf{X}_{j+1})=\sum_{i\in I_{0}}\nabla_{1,2}\mathbf{g}_{i}(\mathbf{X}_{j+1})^{\top}\textbf{s}_{j,i}+\sum_{i\in I_{1}}\nabla_{1,2}\mathbf{g}_{i}(\mathbf{X}_{j+1})^{\top}\frac{\mathbf{g}_{i}(\mathbf{X}_{j+1})}{\|\mathbf{g}_{i}(\mathbf{X}_{j+1})\|}+\nabla_{1,2}f(\mathbf{X}_{j+1}).

As j,j\to\infty, obviously 1,2Φϵj(𝐗j+1)\nabla_{1,2}\Phi_{\epsilon_{j}}(\mathbf{X}_{j+1}) converges to the three terms in (53). Since 1,2Φϵj(𝐗j+1)0\nabla_{1,2}\Phi_{\epsilon_{j}}(\mathbf{X}_{j+1})\to 0 and Φ(𝐗~)\partial\Phi(\tilde{\mathbf{X}}) is closed, 𝐗j+1\mathbf{X}_{j+1} converges to a Clarke stationary point. We have validated the four conditions required for the proposed algorithm to be convergent in regards of the application of joint reconstruction of T1 and T2 MRI images. Therefore, by Lemma 1 and Theorem 1, this joint reconstruction will provide a convergent solution.

7.3 Experiment settings

The dataset we used in all experiments are from Multi-modal Brain Tumor Segmentation Challenge 2018.menze2014multimodal Both the training set and the validation set in Multi-modal Brain Tumor Segmentation Challenge 2018 contain four modalities (T1,T2, FLAIR and T1c1_{c}). The training set is scanned from 285 patients and the validation set is obtained from 66 patients. Each image has a volume size of 240×240×155.240\times 240\times 155. Each modality has two types of gliomas: 75 volumes of low-grade gliomas(LGG) and 210 volumes of high-grade gliomas(HGG).

In our experiments, we used HGG MRI images from two modalities (T1 and T2). For each modality, we chose images scanned from 50 patients as our training set and we randomly chose images scanned from 6 patients as our testing set. We cropped the 2D image size to be 160×180160\times 180 in the center region, resulting in a overall number of 500 images as our training set and 60 images as our testing set.

To obtain the k-space images, We used Matlab to apply 2D fast Fourier transform to every ground truth image in the training set and the testing set. Then, we shifted the zero-frequency components to the central area for each image and we applied 10%, 20% radial mask to every image. After having all the operations on Matlab, we had the k-space under-sampled images, denoted as f1\textbf{f}_{1} and f2\textbf{f}_{2}.

The Initialization network was implemented in Python using the TensorFlow framework and the LPAM-net was implemented in Python using the PyTorch framework. Our experiments were run on a Linux server with an NVIDIA A100 Tensor Core GPU available on HiPerGator. The batch size was set to be 1 in all experiments due to the consideration of the GPU memory, data volume and the operation speed.

Our reconstruction results are assessed using four evaluation matrices: peak-to-noise ratio (PSNR), structural similarity (SSIM), normalized mean squared error (NMSE) and root mean square error (RMSE). PSNR, SSIM, NMSE and RMSE can be computed using the following equations.

Suppose we have a reconstructed image 𝐗\mathbf{X} and a ground truth image Y, each with a dimension of m×n.m\times n. We define the mean square error and peak-to-noise ratio as

MSE(𝐗,Y)=1mni=0m1j=0n1[𝐗(i,j)Y(i,j)]2=1mn𝐗Y22MSE\hskip 1.42262pt(\mathbf{X},\textbf{Y})=\dfrac{1}{mn}\sum_{i=0}^{m-1}\sum_{j=0}^{n-1}[\mathbf{X}(i,j)-\textbf{Y}(i,j)]^{2}=\dfrac{1}{mn}||\mathbf{X}-\textbf{Y}||_{2}^{2} (54)
PSNR(𝐗,Y)=10log10(MAX(Y)MSE(𝐗,Y))PSNR\hskip 1.42262pt(\mathbf{X},\textbf{Y})=10\cdot log_{10}\left(\dfrac{MAX(\textbf{Y})}{MSE(\mathbf{X},\textbf{Y})}\right)

where MAX(Y)MAX(\textbf{Y}) is the maximum possible pixel value of the ground truth image. PSNR measures the quality of a compressed image in comparison to its ground truth image. A higher PSNR indicates a better quality of the compressed or reconstructed image.

SSIM(𝐗,Y)=(2μXμY+c1)(2σXY+c2)(μX2+μY2+c1)(σX2+σY2+c2)SSIM\hskip 1.42262pt(\mathbf{X},\textbf{Y})=\dfrac{(2\mu_{X}\mu_{Y}+c_{1})(2\sigma_{XY}+c_{2})}{(\mu_{X}^{2}+\mu_{Y}^{2}+c_{1})(\sigma_{X}^{2}+\sigma_{Y}^{2}+c_{2})}

where μX,μY\mu_{X},\mu_{Y} is the pixel mean of XX and Y,Y, respectively; σX2,σY2\sigma_{X}^{2},\sigma_{Y}^{2} are the variance of X,YX,Y respectively. c1=(k1L)2,c2=(k2L)2c_{1}=(k_{1}L)^{2},c_{2}=(k_{2}L)^{2} are two variables to avoid zero denominator and LL is the dynamic range of the pixel values. The SSIM, ranging from 0 to 1, is used to measure the similarity between two images. A higher SSIM value indicates a greater similarity between the two images.

NMSE(𝐗,Y)=𝐗Y22Y22NMSE\hskip 1.42262pt(\mathbf{X},\textbf{Y})=\dfrac{||\mathbf{X}-\textbf{Y}||_{2}^{2}}{||\textbf{Y}||_{2}^{2}}

NMSE measures the mean relative error.

RMSE(𝐗,Y)=MSE(𝐗,Y)=1mn𝐗Y22RMSE\hskip 1.99168pt(\mathbf{X},\textbf{Y})=\sqrt{MSE(\mathbf{X},\textbf{Y})}=\sqrt{\dfrac{1}{mn}||\mathbf{X}-\textbf{Y}||_{2}^{2}}

RMSE is utilized to measure the difference between the reconstructed image and the ground truth.

7.4 The architecture of the reconstruction process

The complete reconstruction procedure consists of the Initialization network and the LPAM-net. The architecture of the multi-phase LPAM-net follows the LPAM algorithm exactly in the way that each phase of the network corresponds to each itration of the algorithm. The details are given in the following subsections.

7.4.1 Initialization network

To have a better input for LPAM-net, we construct an initialization network, the major component of which is the initialization blocks. It contains a 4-layer convolutional neural network(CNN) following the residual structure, as outlined in he2016deep . Subsequently, an inverse Fourier transform is applied to the output of the initialization block. The initialization block performs interpolation on the inputs f1\textbf{f}_{1} and f2\textbf{f}_{2}, which helps to fill up the missing components of the under-sampled images in the k-space domain. We learned the weights of the kernels of the Initialization network by minimizing the following loss function L=Loss1+Loss2L=Loss1+Loss2 using Adam Optimizer implemented using Tensorflow:

Loss1=1mn𝐱10𝐱1truth22Loss1=\dfrac{1}{mn}||\mathbf{x}_{1}^{0}-\mathbf{x}_{1}^{truth}||_{2}^{2}
Loss2=1mn𝐱20𝐱2truth22,Loss2=\dfrac{1}{mn}||\mathbf{x}_{2}^{0}-\mathbf{x}_{2}^{truth}||_{2}^{2},

where m×nm\times n is the dimension of the image.

A learning rate of 10310^{-3} was set, complemented by a decay rate of 0.95 for every 100 steps. The Xavier initialization was utilized to initialize the kernels. Table 1 displays the quantitative assessment of the Initialization network for the 60 images in the testing set. This includes the average and the standard deviation for the outputs 𝐱10,𝐱20\mathbf{x}_{1}^{0},\mathbf{x}_{2}^{0} under noiseless condition, corresponding to radial masks with an under-sampling ratios of 10% and 20%.

Table 1: The evaluation results (mean ±\pm standard deviation) for noiseless 𝐱10\mathbf{x}_{1}^{0} and 𝐱20\mathbf{x}_{2}^{0} after the initialization network
Image PSNR SSIM NMSE
10% T1 23.146 ±\pm 0.66 0.617 ±\pm 0.034 0.0196 ±\pm 0.0035
10% T2 23.65 ±\pm 1.64 0.619 ±\pm 0.04 0.063 ±\pm 0.026
20% T1 26.517 ±\pm 0.518 0.735 ±\pm 0.031 0.009 ±\pm 0.0014
20% T2 27.032 ±\pm 1.633 0.735 ±\pm 0.031 0.029 ±\pm 0.012

7.4.2 LPAM-net for Joint Reconstruction

The architecture of the proposed LPAM-net is shown in the flowchart 1, which follows the LPAM algorithm exactly. The inputs for the LPAM-net are the outputs of the Initialization network, denoted as 𝐱10\mathbf{x}_{1}^{0} and 𝐱20\mathbf{x}_{2}^{0}. The outputs of the final phase of the LPAM-net are the resulting reconstructed images.

Refer to caption
Figure 1: The architecture of the propsoed Initialization network and the LPAM-net. Top: The proposed Initialization network; Middle: The architecture of the LPAM-net for joint reconstruction of T1 and T2 images; Bottom: The detailed illustration of the kthk^{th}–phase of the LPAM-net.

Throughout our experiments, the regularization we used is the 2,1\ell_{2,1}-norm of the joint feature extractor, which is described in the Section 7.1. The formula for the regularization term is r(𝐱1,𝐱2)=𝐠θ(𝐱1,𝐱2)2,1,r(\mathbf{x}_{1},\mathbf{x}_{2})=||\mathbf{g}_{\theta}(\mathbf{x}_{1},\mathbf{x}_{2})||_{2,1}, and the expressions of 1,2rϵ(𝐱1,𝐱2)\nabla_{1,2}r_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2}) and rϵ(𝐱1,𝐱2)r_{\epsilon}(\mathbf{x}_{1},\mathbf{x}_{2}) are given in 49 and 52, respectively. We denote the feature extractor 𝐠θ(𝐱1,𝐱2)\mathbf{g}_{\theta}(\mathbf{x}_{1},\mathbf{x}_{2}) as 𝐠(𝐱1,𝐱2)\mathbf{g}(\mathbf{x}_{1},\mathbf{x}_{2}) or 𝐠\mathbf{g} for simplicity in the following text. Regarding the joint feature extractor 𝐠\mathbf{g}, we parameterize it as 47 with component-wise activation σ\sigma as 48. The prefixed parameter δ\delta in the activation function σ\sigma was set to be 0.01. The default configuration of the feature extraction operator 𝐠\mathbf{g} was set as follows: it consisted of l=4l=4 complex-kernel convolution layers. Every kernel in each layer has a dimension of 3×\times3×\times32 (except for the first layer), where 32 is the depth of the kernel. In the first layer, the kernel size was 3×\times3×\times2, with a depth of 2. For each layer, we had 32 kernels. We set the stride to be 1 and used zero-padding to preserve image size.

For joint reconstruction using the LPAM-net, we need to learn the step sizes αk,τk,βk\alpha_{k},\tau_{k},\beta_{k} and γk,\gamma_{k}, the threshold ϵk,\epsilon_{k}, and the weights of the kernels of the feature extractor 𝐠\mathbf{g}. We consolidated all the learned parameters into a unified symbol denoted as θ\theta. The total number of the phases k is set to be 1515. To enhance the stability and precision of the parameters before increasing the iteration count, we initially trained LPAM-net with phase number K=3 for 100 epochs. After completing the training of the phase K=3, an incremental strategy involving adding 2 more phases each time was employed and for K>>3, we conducted training of 30 epochs during each iteration. Throughout the training, the batch size was set to be 11 and parameters were updated after the training for every batch.

We updated the parameters θ\theta by minimizing the loss function LL^{\prime} using Adam Optimizer implemented by Pytorch with β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999. The learning rate was set at 10410^{-4}, and the initial step sizes were established as α0=β0=0.5\alpha_{0}=\beta_{0}=0.5. For τ0\tau_{0} and μ0,\mu_{0}, we have three initial step sizes for different phases. For phase 1 to phase 3, τ0=μ0=2\tau_{0}=\mu_{0}=2; for phase 4 to phase 12, τ0=μ0=1;\tau_{0}=\mu_{0}=1; for phase 13 to phase 15, τ0=μ0=0.1.\tau_{0}=\mu_{0}=0.1. The Xavier initialization was applied to initialize the kernels. The outputs for the algorithm were denoted as 𝐱1k\mathbf{x}_{1}^{k} and 𝐱2k\mathbf{x}_{2}^{k} and the ground truth were represented by 𝐱1truth\mathbf{x}_{1}^{truth} and 𝐱2truth.\mathbf{x}_{2}^{truth}. To calculate the loss, we converted the m×nm\times n image into a vector of length mnmn. Suppose the total pixel number of an image is mn,mn, the loss function we minimized is expressed as L=Loss3+Loss4L^{\prime}=Loss3+Loss4 where Loss3Loss3 and Loss4Loss4 are defined as follows respectively:

Loss3=1mn𝐱1k𝐱1truth22+0.1×(1SSIM(𝐱1k,𝐱1truth))Loss3=\dfrac{1}{mn}||\mathbf{x}_{1}^{k}-\mathbf{x}_{1}^{truth}||_{2}^{2}+0.1\times(1-SSIM(\mathbf{x}_{1}^{k},\mathbf{x}_{1}^{truth}))
Loss4=1mn𝐱2k𝐱2truth22+0.1×(1SSIM(𝐱2k,𝐱2truth))Loss4=\dfrac{1}{mn}||\mathbf{x}_{2}^{k}-\mathbf{x}_{2}^{truth}||_{2}^{2}+0.1\times(1-SSIM(\mathbf{x}_{2}^{k},\mathbf{x}_{2}^{truth}))

7.5 Convergence behavior of the LPAM-net

Refer to caption
Refer to caption
Figure 2: From Left to Right: The PSNR values for the reconstruction results of the T1 and T2 image with a 20%20\% under-sampling ratio at various phases. In both figures, the LPAM-net learns the network parameters in the first 15 phases. Starting from 𝐗16\mathbf{X}_{16} the reconstruction results are obtained from the same model (46) with fixed 𝐠\mathbf{g} learned from the 15-phase LPAM-net. The PSNR values for the reconstruction results of the first 15 phases are plotted in blue, and since the 16th16^{th} phase, that are plotted in orange.

In the previous subsection, we proved that there is at least a subsequence of the iterates generated by the LPAM algorithm converges to a Clarke stationary point. It is expected that LPAM would perform stably even beyond the trained phases. To demonstrate its stability empirically, we set the parameters as follows: ϵ0=0.01,\epsilon_{0}=0.01, γ=0.9,\gamma=0.9, σ=60000\sigma=60000 and the number of phases KK = 15. We used the kk-space data of a 20%20\% under-sampling ratio for the joint reconstruction of MR T1 and T2 images. The 2\ell_{2}-norm of the gradient of the objective function value (1,2Φεk(𝐱1k,𝐱2k)\|\nabla_{1,2}\Phi_{\varepsilon_{k}}(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})\|) decreases as the number of phase increases. According to the ϵ\epsilon reduction criteria specified in the line 19 in the algorithm, this leads to a reduction in the smoothing factor from 0.010.01 to 0.00430.0043 by phase 1515. During the 15 phases, the smoothing factor steadily decreases as the number of phases increases, indicating that the smoothed objective function is progressively converging to the original problem.

Furthermore, we extend the algorithm beyond 15 phases and observe its performance on the values of the objective function and PSNR for the reconstruction. To this end, after 1515 phases, we set ϵtol=0\epsilon_{tol}=0 and ran the algorithm for up to 5010 iterations with fixed gg that is learned at the 15th15^{th} phase. Figure 2 and figure 3 illustrate the changes in the values of PSNR and objective function as the number of iterations increases, respectively. Figure 3 is drawn based on the objective function with 0.0093 as the coefficient of the regularization part. This coefficient does not affect the convergence based on Lemma 1. As we can tell, LPAM-net reduces the objective function value steadily in the first 15 phases. After that, the value of the objective function continue to decrease, but not as fast as the first 15 phases. The PSNR for both T1 and T2 images after the 15th15^{th} phase maintain stable with slightly decrease in the first a few iterations. The figure 4 visually shows the reconstructed images obtained after 15, 150, 1005, 5010 iterations. We can see that the reconstructed images are almost identical and free of visual artifacts, with very similar PSNR for both T1 and T2 contracts. These evidence shows that LPAM-net maintains stable performance over extended iterations.

Refer to caption
Figure 3: Objective function value (Φ(𝐱1k,𝐱2k)\Phi(\mathbf{x}_{1}^{k},\mathbf{x}_{2}^{k})) of images reconstruction with a 20% under-sampling ratio across various phase number K. The results of the first 15 phases are plotted in blue and since the 16th16^{th} phase, that are plotted in orange.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Reference Iteration 15 (41.683) Iteration 150 (41.579) Iteration 1005 (41.500) Iteration 5010 (41.499)
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Reference Iteration 15 (43.332) Iteration 150 (43.368) Iteration 1005 (43.373) Iteration 5010 (43.363)
Figure 4: Top row: Representation brain T1 MRI images reconstructed by the LPAM-net with a 20% under-sampling ratio in the radial mask after 15, 150, 1005 and 5010 iterations; Bottom row: Representation brain T2 MRI images reconstructed by the LPAM-net with a 20% under-sampling ratio in radial mask after 15, 150, 1005 and 5010 iterations. PSNRs (dB) are shown in the parentheses.

7.6 Results

In this subsection, we provide numerical results to showcase the enhanced image quality achieved by the proposed LPAM-net and the variational model 46. First, in order to show the advantage of using the joint features as the regularization term in the model, we make comparison between the LPAM-net and the Individual-modality Reconstruction Network. Next, we compare the LPAM-net with the network induced by BCD algorithm, which is equivalent to run only the lines 10-17 and skip the lines 3-9 in LPAM algorithm. Finally, we present the comparison of the LPAM-net with four state-of-art methods for joint reconstruction of MRI T1 and T1 images.

7.6.1 Comparison with Individual-modality Reconstruction Network

This experiment is designed to verify that using the joint features in the regularization term enhances the accuracy and efficiency of the reconstructions of the MRI T1 and T2 images. Since the LPAM-net takes advantage of the common features from both T1 and T2 contracts, the proposed reconstructed images should preserve more details. To validate this advantage, we designed an Individual-modality Reconstruction Network which extracts the individual features of the T1 and T2 contracts separately. The Individual-modality Reconstruction Network precisely adheres the algorithm in LDA for minimizing the following problem:

min𝐱iPF𝐱ifi2+𝐠θi(𝐱i)2,1\min_{\mathbf{x}_{i}}\|\textbf{P}\textbf{F}\mathbf{x}_{i}-\textbf{f}_{i}\|^{2}+\|\mathbf{g}_{\theta_{i}}(\mathbf{x}_{i})\|_{2,1}

where 𝐱in\mathbf{x}_{i}\in\mathbb{R}^{n}, fi\textbf{f}_{i} is the under-sampled k-space images and i=1,2.i=1,2. 𝐠θi\mathbf{g}_{\theta_{i}} is the individual feature extractor from n\mathbb{C}^{n} to n×d\mathbb{C}^{n\times d} and it is designed to be a 4-layer CNN. Unlike the joint reconstruction, in which we learn a joint feature extraction operator 𝐠θ\mathbf{g}_{\theta}, in this individual reconstruction, we learn two feature extraction operators 𝐠θ1\mathbf{g}_{\theta_{1}} and 𝐠θ2\mathbf{g}_{\theta_{2}} separately. We learned the weights of the kernels in the CNNs by minimizing the loss function using Adam Optimizer implemented by Pytorch. Suppose the total pixel number of an image is mn,mn, the loss function we minimized is expressed as follows:

Loss=1mn𝐱ik𝐱itruth22+0.1×(1SSIM(𝐱ik,𝐱itruth))Loss=\dfrac{1}{mn}||\mathbf{x}_{i}^{k}-\mathbf{x}_{i}^{truth}||_{2}^{2}+0.1\times(1-SSIM(\mathbf{x}_{i}^{k},\mathbf{x}_{i}^{truth}))

where 𝐱ik\mathbf{x}_{i}^{k} is the output and 𝐱itruth\mathbf{x}_{i}^{truth} is the ground truth image.

We compare T1 and T2 image reconstruction results with 10%10\% under-sampling ratio kk-space noiseless data resulting from LPAM-net and the Individual-modality reconstruction Network, respectively. Figure 5 shows the PSNR value at the phase number K=3,5,,15K=3,5,...,15 of both deep neural networks. Both networks were trained using Adam Optimizer implemented in Pytorch with β1=0.9,β2=0.999.\beta_{1}=0.9,\beta_{2}=0.999. The learning rate was set at 10410^{-4} and the number of phase is set at 15. The average PSNR and its standard deviation are depicted in figure 5 for each phase. For phase 3, Individual-modality Reconstruction Network outperforms, but as the phase number increases, the LPAM-net produces better reconstructions. At phase 15, the average PSNR of the LPAM-net improves 0.40 dB for the T1 images and improves 1.49 dB for the T2 images, respectively comparing to the Individual-modality Reconstruction Network. In addition, the standard deviation decreases 0.10 dB for the T1 images and decreases 0.21 dB for the T2 images.

Refer to caption
Refer to caption
Figure 5: PSNR for the reconstruction result of images with a 10% under-sampling ratio across various phase number KK. Left: T1 images. Right: T2 images. In both figures, the blue line represents results for joint reconstruction via LPAM-net and the orange line represents the results from the Individual-modality Reconstruction Network. The vertical line represents the standard deviation of PSNR.

Although PSNR is a crucial measure for comparing the two networks, it is hard for us to visually discern the impact that the LPAM-net has in the reconstruction process. In order to visualize the differences, we added Gaussian white noise to the complex-valued under-sampled k-space images. The real noise and the imaginary noise are both normally distributed with the same mean and the same standard deviation, but they maintain independence from each other. Then, we acquired three groups of T1, T2 images: one set is noiseless; one set has complex-valued noise with a mean of zero and a standard deviation of 3 on both the real part and imaginary part, and the third set has complex-valued noise with a mean of zero and a standard deviation of 7 on both the real part and imaginary part. Next, we trained and tested these three groups of T1,T2 images using both the LPAM-net and the Individual-modality Reconstruction Network. The resulting reconstructed T1 images from the three groups are presented in figure 6 with the ground truth reference. Figure 7 illustrates the pointwise absolute error for each group of the images compared to the ground truth reference. Likewise, the reconstructed T2 images and the associated absolute errors are shown in figure 8 and figure 9.

Figure 6 and figure 8 reveal that, for under-sampled images with any levels of Gaussian white noise, the edges of the tissues in the images reconstructed by the LPAM-net are sharper than the edges of the tissue reconstructed by the Individual-modality Reconstruction Network. Furthermore, the absolute errors are less in the images reconstructed from the LPAM-net in figure 7 and figure 9. Table 2 and table 3 provide a comprehensive summary for the quantitative comparison between the LPAM-net and the Individual-modality Reconstruction Network across various levels of noise. The PSNR, SSIM, NMSE and RMSE are provided for images with each level of noise. We can conclude that the LPAM-net generates more accurate images than the Individual-modality Reconstruction Network regardless of the noise level in the under-sampled images. The results show the advantage of using the joint feature operator as the regularization term in the reconstruction process. Additionally, using joint feature operator demonstrates greater parameter efficiency.

Table 2: The results (mean ±\pm standard deviation) of T1 image reconstruction using LPAM-net and Individual-modality Reconstruction Network, with the radial mask at a 10% under-sampling ratio, are presented for different variances of Gaussian white noises added to the k-space data.
Method SD of noise PSNR SSIM NMSE RMSE #Par
Individual-modality 7 29.1356 ±\pm 0.9087 0.8688 ±\pm 0.0108 0.0051 ±\pm 0.0016 0.0351 ±\pm 0.0036 55903
LPAM-net 7 29.9429 ±\pm 0.7945 0.8852 ±\pm 0.0097 0.0042 ±\pm 0.0011 0.032 ±\pm 0.0028 56510
Individual-modality 3 31.2754 ±\pm 1.1333 0.9044 ±\pm 0.0098 0.0031 ±\pm 0.0011 0.0275 ±\pm 0.0034 55903
LPAM-net 3 31.994 ±\pm 1.01 0.9175 ±\pm 0.0091 0.0026 ±\pm 0.0008 0.0253 ±\pm 0.0028 56510
Individual-modality none 33.0252 ±\pm 1.3921 0.9251 ±\pm 0.0094 0.0021 ±\pm 0.0008 0.0226 ±\pm 0.0034 55903
LPAM-net none 33.3269 ±\pm 1.2853 0.927 ±\pm 0.0091 0.002 ±\pm 0.0007 0.0218 ±\pm 0.003 56510
Table 3: The results (mean ±\pm standard deviation) of T2 image reconstruction using LPAM-net and Individual-modality Reconstruction Network, respectively, with the radial mask at a 10% under-sampling ratio, are presented for different variances of Gaussian white noises added to the k-space data.
Method SD of noise PSNR SSIM NMSE RMSE #Par
Individual-modality 7 28.6758 ±\pm 1.0697 0.8632 ±\pm 0.016 0.0193 ±\pm 0.0067 0.0371 ±\pm 0.0046 55903
LPAM-net 7 29.8219 ±\pm 1.107 0.88 ±\pm 0.0178 0.015 ±\pm 0.0055 0.0325 ±\pm 0.0041 56510
Individual-modality 3 30.972 ±\pm 1.2514 0.9049 ±\pm 0.0149 0.0114 ±\pm 0.0040 0.0286 ±\pm 0.0041 55903
LPAM-net 3 32.1759 ±\pm 1.0952 0.9209 ±\pm 0.0147 0.0087 ±\pm 0.0031 0.0248 ±\pm 0.0031 56510
Individual-modality none 32.621 ±\pm 1.6011 0.9209 ±\pm 0.0151 0.0078 ±\pm 0.0028 0.0238 ±\pm 0.0044 55903
LPAM-net none 34.011 ±\pm 1.3859 0.9392 ±\pm 0.0132 0.0057 ±\pm 0.0021 0.0202 ±\pm 0.0032 56510
Refer to caption Refer to caption Refer to caption Refer to caption
Individual_7 (29.233) Individual_3 (31.373) Individual_non (32.933) Reference
Refer to caption Refer to caption Refer to caption Refer to caption
LPAM-net_7 (29.943) LPAM-net_3 (31.994) LPAM-net_non (33.327) Reference
Figure 6: Reconstruction results of T1 brain images by the LPAM-net and the Individual-modality Reconstruction Network employing the radial mask with a 10% under-sampling ratio. Top to bottom rows: Reconstruction results obtained by the Individual-modality Reconstruction Network and the LPAM-net, respectively. The first to the third columns: Images reconstructed from the k-spaces, where a complex noise with a standard deviation of 7, 3, and 0 (noiseless) is added, respectively. The last column: Ground truth reference. PSNRs (dB) are shown in the parentheses. The regions of interest are magnified in red boxes for better visualization.
Refer to caption Refer to caption Refer to caption
Individual_7 (29.233) Individual_3 (31.373) Individual_non (32.933)
Refer to caption Refer to caption Refer to caption
LPAM-net_7 (29.943) LPAM-net_3 (31.994) LPAM-net_non (33.327)
Figure 7: The corresponding pointwise absolute errors between the reconstructed brain T1 images in Figure 6 and the ground truth reference. The errors are scaled up by a factor of 4 for better visualization and the bright parts indicate large value.
Refer to caption Refer to caption Refer to caption Refer to caption
Individual_7 (28.989) Individual_3 (31.052) Individual_non (32.521) Reference
Refer to caption Refer to caption Refer to caption Refer to caption
LPAM-net_7 (29.822) LPAM-net_3 (32.176) LPAM-net_non (34.011) Reference
Figure 8: Reconstruction results of T2 brain images by the LPAM-net and the Individual-modality Reconstruction Network employing the radial mask with a 10% under-sampling ratio. Top to bottom rows: Reconstruction results obtained by the Individual-modality Reconstruction Network and the LPAM-net, respectively. The first to the third columns: Images reconstructed from the k-spaces, where a complex noise with a standard deviation of 7, 3, and 0 (noiseless) is added, respectively. The last column: Ground truth reference. PSNRs (dB) are shown in the parentheses. The regions of interest are magnified in red boxes for better visualization.
Refer to caption Refer to caption Refer to caption
Individual_7 (28.989) Individual_3 (31.052) Individual_non (32.521)
Refer to caption Refer to caption Refer to caption
LPAM-net_7 (29.822) LPAM-net_3 (32.176) LPAM-net_non (34.011)
Figure 9: The corresponding pointwise absolute errors between the reconstructed brain T2 images in Figure 8 and the ground truth reference. The errors are scaled up by a factor of 4 for better visualization and the bright parts indicate large value.

7.6.2 Comparison with the BCD algorithm

This experiment aims to demonstrate the effectiveness of the proposed LPAM algorithm by comparing to the standard BCD algorithm. The LPAM algorithm uses 𝐮k+1\mathbf{u}_{k+1} to update the scheme to match the Res-net structure for better training, and 𝐯k+1\mathbf{v}_{k+1} as a safeguard to ensure convergence. The standard BCD algorithm is equivalent to remove the iterations for updating 𝐮k+1\mathbf{u}_{k+1} and only performing 𝐯k+1\mathbf{v}_{k+1} in the LPAM. Although it converges for nonconvex smooth optimization, but the networks induced by BCD algorithm do not have Resnet structure. We conducted experiments comparing these two methods on under-sampled T1 and T2 images using the same variational model 46. Figure 10 and figure 11 illustrate the average PSNR and SSIM versus phase number with respect to the radial mask with a under-sampling ratio of 10% and 20%. As we can tell, both the average PSNR and the average SSIM of the two algorithms ascend, indicating an increase in image quality as phase number rises. For each phase, the values of the PSNR and the SSIM of the LPAM algorithm outperforms the values of the BCD algorithm in each graph, highlighting the efficiency and accuracy of the LPAM algorithm. The compared networks were trained using Adam Optimizer implemented in Pytorch with β1=0.9,β2=0.999\beta_{1}=0.9,\beta_{2}=0.999 and the learning rate was set at 104.10^{-4}.

Refer to caption
Refer to caption
Figure 10: Average PSNR and SSIM of the reconstruction results for images with a 10% under-sampling ratio obtained by the LPAM algorithm and the BCD algorithm across various phase number K. Left: T1 images. Right: T2 images.
Refer to caption
Refer to caption
Figure 11: Average PSNR and SSIM of the reconstruction results for images with a 20% under-sampling ratio obtained by the LPAM algorithm and the BCD algorithm across various phase number K. Left: T1 images. Right: T2 images.

7.6.3 Comparison with the State-of-the-Art

We compare the proposed algorithm LPAM-net with five representative methods, including the X-netdo2020reconstruction , the Joint Group Sparsity-based Network (JGSN)guo2023joint , ReconFormerguo2023reconformer and the Joint Cross-Attention Network (jCAN)sun2023joint . X-net, originating from the U-net, introduces the capability to accommodate two inputs and generates two outputs and it is a non model-based method. JGSN, jCAN and ReconFormer are three model-based methods. JGSN unrolls the iterative process of the joint sparsity algorithm. jCAN deploys Vision Transformer and CNN in the image and k-space domains, respectively. ReconFormer designs Recurrent Pyamid Transformer at each architecture unit. Since the original codes for X-net and JGSN are not available, we implemented X-net and JGSN according to their original paper.

Similar to LPAM-net, X-net and JGSN update the two under-sampled modalities jointly. ReconFormer is a single-modality reconstruction network and we reconstructed T1 and T2 images respectively for comparison. jCAN restores the under-sampled target modality with guidance from the full-sampled auxiliary modality; thus, we used full-sampled T1 images to restore under-sampled T2 images and vice versa for comparison. Since some methods incorporate transformers, we adjusted the patch size and the window size to fit our image sizes. To ensure a fair evaluation, we trained each of the state-of-the-art method for 100 epoches and the quantitative analysis of the reconstructed images is presented in the table 4 and table 5. Compared with the other methods, the proposed LPAM-net achieves competitive performance on both T1 and T2 images with a under-sampling ratio of 20%.

p

Table 4: The comparison (mean ±\pm standard deviation) between LPAM-net and the state-of-the-art methods for T1 images with an under-sampling ratio of 20%
Method PSNR SSIM NMSE RMSE #Par
X-net 32.713±\pm 0.93 0.93±\pm 0.006 0.002±\pm 0.0006 0.024±\pm 0.024 42.816M
JGSN 38.617±\pm 1.22 0.972±\pm 0.0048 0.0006±\pm 0.0002 0.012±\pm 0.0016 21192
jCAN 35.367±\pm 1.006 0.927±\pm 0.0087 0.0012±\pm 0.0004 0.017±\pm 0.0019 45.1M
ReconFormer 39.124±\pm 1.47 0.977±\pm 0.0047 0.0005±\pm 0.0002 0.011±\pm 0.0019 1.1M
LPAM-net 40.66±\pm 1.508 0.983±\pm 0.004 0.0004±\pm 0.0002 0.0094±\pm 0.0016 56510
Table 5: The comparison (mean ±\pm standard deviation) between LPAM-net and the state-of-the-art methods for T2 images with an under-sampling ratio of 20%
Method PSNR SSIM NMSE RMSE #Par
X-net 32.65±\pm 1.634 0.923±\pm 0.0106 0.0081±\pm 0.0033 0.024±\pm 0.005 42.816M
JGSN 39.3±\pm 1.413 0.975±\pm 0.0046 0.0017±\pm 0.0006 0.011±\pm 0.0017 21192
jCAN 37.583±\pm 1.513 0.964±\pm 0.0054 0.0025±\pm 0.001 0.013±\pm 0.0023 45.1M
ReconFormer 40.58±\pm 1.706 0.982±\pm 0.0051 0.0012±\pm 0.0004 0.0095±\pm 0.0018 1.1M
LPAM-net 42.536±\pm 1.527 0.987±\pm 0.0043 0.0008±\pm 0.0003 0.0076±\pm 0.0013 56510

8 Declarations

The authors declare that they have no conflict of interest.

References

  • (1) Abolghasemi, V., Ferdowsi, S., Sanei, S.: A gradient-based alternating minimization approach for optimization of the measurement matrix in compressive sensing. Signal Processing 92(4), 999–1009 (2012)
  • (2) Aggarwal, H.K., Mani, M.P., Jacob, M.: Modl: Model-based deep learning architecture for inverse problems. IEEE transactions on medical imaging 38(2), 394–405 (2018)
  • (3) Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Mathematics of operations research 35(2), 438–457 (2010)
  • (4) Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized gauss–seidel methods. Mathematical Programming 137(1-2), 91–129 (2013)
  • (5) Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM journal on Optimization 23(4), 2037–2060 (2013)
  • (6) Bertsekas, D., Tsitsiklis, J.: Parallel and distributed computation: numerical methods. Athena Scientific (2015)
  • (7) Bian, W., Zhang, Q., Ye, X., Chen, Y.: A learnable variational model for joint multimodal mri reconstruction and synthesis. Lecture Notes in Computer Science 13436, 354–364 (2022). DOI https://doi.org/10.1007/978-3-031-16446-0˙34
  • (8) Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming 146(1-2), 459–494 (2014)
  • (9) Charbonnier, P., Blanc-Féraud, L., Aubert, G., Barlaud, M.: Deterministic edge-preserving regularization in computed imaging. IEEE Transactions on image processing 6(2), 298–311 (1997)
  • (10) Chen, Y., Liu, H., Ye, X., Zhang, Q.: Learnable descent algorithm for nonsmooth nonconvex image reconstruction. SIAM Journal on Imaging Sciences 14(4), 1532–1564 (2021). DOI 10.1137/20M1353368. URL https://doi.org/10.1137/20M1353368
  • (11) Chun, I.Y., Fessler, J.A.: Convolutional analysis operator learning: Acceleration and convergence. IEEE Transactions on Image Processing 29, 2108–2122 (2020)
  • (12) Chun, I.Y., Huang, Z., Lim, H., Fessler, J.: Momentum-net: Fast and convergent iterative neural network for inverse problems. IEEE transactions on pattern analysis and machine intelligence (2020)
  • (13) Chun, I.Y., Zheng, X., Long, Y., Fessler, J.A.: Bcd-net for low-dose ct reconstruction: Acceleration, convergence, and generalization. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22, pp. 31–40. Springer (2019)
  • (14) Chun, Y., Fessler, J.A.: Deep bcd-net using identical encoding-decoding cnn structures for iterative image recovery. In: 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pp. 1–5. IEEE (2018)
  • (15) Ding, C., Zhang, Q., Wang, G., Ye, X., Chen, Y.: Learned alternating minimization algorithm for dual-domain sparse-view ct reconstruction. arXiv preprint arXiv:2306.02644 (2023)
  • (16) Do, W.J., Seo, S., Han, Y., Ye, J.C., Choi, S.H., Park, S.H.: Reconstruction of multicontrast mr images through deep learning. Medical physics 47(3), 983–997 (2020)
  • (17) Guo, D., Zeng, G., Fu, H., Wang, Z., Yang, Y., Qu, X.: A joint group sparsity-based deep learning for multi-contrast mri reconstruction. Journal of Magnetic Resonance 346, 107354 (2023)
  • (18) Guo, P., Mei, Y., Zhou, J., Jiang, S., Patel, V.M.: Reconformer: Accelerated mri reconstruction using recurrent transformer. IEEE transactions on medical imaging (2023)
  • (19) Hardt, M.: Understanding alternating minimization for matrix completion. In: 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pp. 651–660. IEEE (2014)
  • (20) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
  • (21) Jain, P., Netrapalli, P., Sanghavi, S.: Low-rank matrix completion using alternating minimization. In: Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pp. 665–674 (2013)
  • (22) Kaiming He Xiangyu Zhang, S.R., Sun, J.: Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition p. 770–778 (2016)
  • (23) Kaiming He Xiangyu Zhang, S.R., Sun, J.: Deep residual learning for image recognition. European conference on computer vision pp. 630–645 (2016). DOI 10.1007/978-3-319-46493-0˙38
  • (24) Kim, Y., Jung, H., Min, D., Sohn, K.: Deeply aggregated alternating minimization for image restoration. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6419–6427 (2017)
  • (25) Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., et al.: The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging 34(10), 1993–2024 (2014)
  • (26) Nesterov, Y., et al.: Lectures on convex optimization, vol. 137. Springer (2018)
  • (27) Netrapalli, P., Jain, P., Sanghavi, S.: Phase retrieval using alternating minimization. Advances in Neural Information Processing Systems 26 (2013)
  • (28) Palomar, D.P., Eldar, Y.C.: Convex optimization in signal processing and communications. Cambridge university press (2010)
  • (29) Peters, S.W., Heath, R.W.: Interference alignment via alternating minimization. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2445–2448. IEEE (2009)
  • (30) Pock, T., Sabach, S.: Inertial proximal alternating linearized minimization (ipalm) for nonconvex and nonsmooth problems. SIAM Journal on Imaging Sciences 9(4), 1756–1787 (2016). DOI 10.1137/16M1064064
  • (31) Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics 4(5), 1–17 (1964)
  • (32) Sroubek, F., Milanfar, P.: Robust multichannel blind deconvolution via fast alternating minimization. IEEE Transactions on Image processing 21(4), 1687–1700 (2011)
  • (33) Sun, J., Li, H., Xu, Z., et al.: Deep admm-net for compressive sensing mri. Advances in neural information processing systems 29 (2016)
  • (34) Sun, K., Wang, Q., Shen, D.: Joint cross-attention network with deep modality prior for fast mri reconstruction. IEEE Transactions on Medical Imaging (2023)
  • (35) Zhang, Q., Alvandipour, M., Xia, W., Zhang, Y., Ye, X., Chen, Y.: Provably convergent learned inexact descent algorithm for low-dose ct reconstruction (2021). DOI 10.48550/ARXIV.2104.12939. URL https://arxiv.org/abs/2104.12939