This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

\ul

A New Inexact Proximal Linear Algorithm with Adaptive Stopping Criteria for Robust Phase Retrieval

Zhong Zheng, Shiqian Ma, and Lingzhou Xue 111Z. Zheng and L. Xue are with the Department of Statistics at Penn State University, University Park, PA 16802. Research is partly supported by NSF grants DMS-1953189, CCF-2007823, and DMS-2210775. Email: {zvz5337, lzxue}@psu.edu. S. Ma is with the Department of Computational Applied Mathematics and Operations Research at Rice University, Houston, TX 77005. Research is supported in part by NSF grants DMS-2243650, CCF-2308597, CCF-2311275, and ECCS-2326591, and a startup fund from Rice University. Email: [email protected]. S. Ma and L. Xue are co-corresponding authors.
Abstract

This paper considers the robust phase retrieval problem, which can be cast as a nonsmooth and nonconvex optimization problem. We propose a new inexact proximal linear algorithm with the subproblem being solved inexactly. Our contributions are two adaptive stopping criteria for the subproblem. The convergence behavior of the proposed methods is analyzed. Through experiments on both synthetic and real datasets, we demonstrate that our methods are much more efficient than existing methods, such as the original proximal linear algorithm and the subgradient method.

Keywords— Robust Phase Retrieval, Nonconvex and Nonsmooth Optimization, Proximal Linear Algorithm, Complexity, Sharpness.

1 Introduction

Phase retrieval aims to recover a signal from intensity-based or magnitude-based measurements. It finds various applications in different fields, including X-ray crystallography [1], optics [2], array and high-power coherent diffractive imaging [3], astronomy [4] and microscopy [5]. Mathematically, phase retrieval tries to find the true signal vectors xx_{\star} or x-x_{\star} in n\mathbb{R}^{n} from a set of magnitude measurements:

bi=(aix)2, for i=1,2,,m,b_{i}=(a_{i}^{\top}x_{\star})^{2},\text{ for }i=1,2,\ldots,m, (1)

where aina_{i}\in\mathbb{R}^{n} and bi0,i=1,2,,mb_{i}\geq 0,i=1,2,\ldots,m. Directly solving the equations leads to an NP-hard problem [6], and nonconvex algorithms based on different designs of objective functions have been well studied in the literature, including Wirtinger flow [7], truncated Wirtinger flow [8], truncated amplitude flow [9], reshaped Wirtinger flow [10], etc.

In this paper, we focus on the robust phase retrieval (RPR) problem [11], which considers the case where bib_{i} contains noise due to measurement errors of equipment. That is,

bi={(aix)2,i𝕀1,ξi,i𝕀2,b_{i}=\begin{cases}(a_{i}^{\top}x_{\star})^{2},&i\in\mathbb{I}_{1},\\ \xi_{i},&i\in\mathbb{I}_{2},\end{cases} (2)

in which 𝕀1𝕀2={1,2,m}\mathbb{I}_{1}\bigcup\mathbb{I}_{2}=\{1,2\ldots,m\}, 𝕀1𝕀2=\mathbb{I}_{1}\cap\mathbb{I}_{2}=\emptyset, and ξi\xi_{i} denotes a random noise. [11] proposed to formulate RPR as the following optimization problem:

minxnF(x):=1mi=1m|(aix)2bi|.\min_{x\in\mathbb{R}^{n}}F(x):=\frac{1}{m}\sum_{i=1}^{m}\left|(a_{i}^{\top}x)^{2}-b_{i}\right|. (3)

It is demonstrated in [11] that using (3) for RPR possesses better recoverability compared to the median truncated Wirtinger flow algorithm [12] based on the 2\ell_{2}-loss.

Solving (3) is challenging because it is a nonconvex and nonsmooth optimization problem. In [13], the authors proposed the subgradient method to solve it. This method requires geometrically decaying step size, and it is unclear how to schedule this kind of step size in practice. [11] proposed to use the proximal linear (PL) algorithm to solve (3). For ease of presentation, we rewrite (3) as

minxnF(x)=h(c(x))=1m|Ax|2b1,\min_{x\in\mathbb{R}^{n}}F(x)=h(c(x))=\frac{1}{m}\left\||Ax|^{2}-b\right\|_{1}, (4)

where Ax=[a1,x,,am,x]Ax=[\langle a_{1},x\rangle,\ldots,\langle a_{m},x\rangle]^{\top}, b=[b1,,bm]b=[b_{1},\ldots,b_{m}]^{\top}, h(z):=1mz1h(z):=\frac{1}{m}\|z\|_{1}, and c(x):=|Ax|2bc(x):=|Ax|^{2}-b is a smooth map in which ||2|\cdot|^{2} is element-wise square. One typical iteration of the PL algorithm is

xk+1argminxnFt(x;xk),x^{k+1}\approx\operatorname*{arg\,min}_{x\in\mathbb{R}^{n}}\ F_{t}(x;x^{k}), (5)

where t>0t>0 is the step size,

F(z;y):=h(c(y)+c(y)(zy)),F(z;y):=h(c(y)+\nabla c(y)(z-y)), (6)
Ft(z;y):=F(z;y)+12tzy22,F_{t}(z;y):=F(z;y)+\frac{1}{2t}\|z-y\|_{2}^{2}, (7)

c\nabla c denotes the Jacobian of cc, and “\approx” means that the subproblem is solved inexactly. The subproblem (5) is convex and can be solved by various methods such as the proximal operator graph splitting (POGS) algorithm used in [11]. The PL method has drawn lots of attention recently. It has been studied by [14, 15, 16] and applied to solving many important applications such as RPR [11], robust matrix recovery [17, 18], and sparse spectral clustering [19]. The subproblem (5) is usually solved inexactly for practical concerns. As pointed out in [11], the PL implemented in [11] is much slower than the median truncated Wirtinger flow method. We found that this is mainly due to their stopping criterion for solving the subproblem (5), which unnecessarily solves (5) to very high accuracy in the early stage of the algorithm. Moreover, we found that the POGS algorithm used in [11] is ineffective in solving the subproblem (5). In this paper, we propose adaptive stopping criteria for inexactly solving (5) with the fast iterative shrinkage-thresholding algorithm (FISTA) [20, 21, 22]. We found that our new inexact PL (IPL) with the adaptive stopping criteria greatly outperforms existing implementations of PL methods [11] for solving RPR (4).

Our Contributions. In this paper, we propose two new adaptive stopping criteria for inexactly solving (5). The first one ensures that (5) is solved to a relatively low accuracy:

(LACC) Ft(xk+1;xk)minxnFt(x;xk)\displaystyle\quad F_{t}(x^{k+1};x^{k})-\min_{x\in\mathbb{R}^{n}}\ F_{t}(x;x^{k}) (8)
ρl(Ft(xk;xk)Ft(xk+1;xk)),ρl>0,\displaystyle\leq\rho_{l}\left(F_{t}(x^{k};x^{k})-F_{t}(x^{k+1};x^{k})\right),\quad\rho_{l}>0,

and the second one ensures that (5) is solved to a relatively high accuracy:

(HACC) Ft(xk+1;xk)minxnFt(x;xk)\displaystyle\quad F_{t}(x^{k+1};x^{k})-\min_{x\in\mathbb{R}^{n}}\ F_{t}(x;x^{k}) (9)
ρh2txk+1xk22,0<ρh<1/4.\displaystyle\leq\frac{\rho_{h}}{2t}\|x^{k+1}-x^{k}\|_{2}^{2},\quad 0<\rho_{h}<1/4.

Here, ρl\rho_{l} and ρh\rho_{h} are given constants. Similar to the proximal bundle method [23] for nonsmooth convex problems, (LACC) and (HACC) are designed to ensure the sufficient decrease of the objective function for the nonsmooth and nonconvex RPR problem. Note that both (LACC) and (HACC) are only used theoretically because minxnFt(x;xk)\min_{x\in\mathbb{R}^{n}}\ F_{t}(x;x^{k}) is not available. Later we will propose more practical stopping criteria that can guarantee (LACC) and (HACC). The connections of our approach to existing work are listed below.

  • (a)

    Our (LACC) condition coincides with the inexact stopping criterion proposed in [24, 25, 26] for the proximal gradient method. In these papers, the authors focus on a different optimization problem

    minxnf0(x):=f1(x)+f2(x),\min_{x\in\mathbb{R}^{n}}f_{0}(x):=f_{1}(x)+f_{2}(x),

    in which f1f_{1} is a smooth function, and f2f_{2} is a proper, convex, and lower semi-continuous function. One typical iteration of their algorithms can be written as

    yk+1\displaystyle y^{k+1}\approx minxnf0k(x)=f1(xk)+(xxk)f1(xk)\displaystyle\ \min_{x\in\mathbb{R}^{n}}f_{0k}(x)=f_{1}(x^{k})+(x-x_{k})^{\top}\nabla f_{1}(x^{k})
    +f2(x)+12(xxk)Hk(xxk),\displaystyle+f_{2}(x)+\frac{1}{2}(x-x^{k})^{\top}H_{k}(x-x^{k}), (10a)
    xk+1=\displaystyle x^{k+1}= xk+λk(yk+1xk),\displaystyle\ x^{k}+\lambda_{k}(y^{k+1}-x^{k}), (10b)

    where Hkn×nH_{k}\in\mathbb{R}^{n\times n} is a positive semi-definite matrix and λk[0,1]\lambda_{k}\in[0,1] is a step size. The stopping criterion for inexactly solving (10a) proposed in [24, 25, 26] is

    f0k(yk+1)f~0kη(f0k(xk)f~0k),f_{0k}(y^{k+1})-\tilde{f}_{0k}\leq\eta\left(f_{0k}(x^{k})-\tilde{f}_{0k}\right), (11)

    where f~0k=minxnf0k(x)\tilde{f}_{0k}=\min_{x\in\mathbb{R}^{n}}f_{0k}(x) and η(0,1)\eta\in(0,1). We note that this is the same as our (LACC). Therefore, our (LACC) is essentially an extension of (11) from the proximal gradient method to the proximal linear method.

  • (b)

    To the best of our knowledge, our (HACC) criterion is new and serves as a good alternative to (LACC). From our numerical experiments, we found that (HACC) works comparably with (LACC), and we believe that it can be useful for other applications.

  • (c)

    We analyze the overall complexity and the local convergence of our IPL algorithm for solving RPR under the sharpness condition. To the best of our knowledge, this is the first time such results have been obtained under the sharpness condition.

  • (d)

    We propose to solve (5) inexactly using FISTA [20, 21, 22], which uses easily verifiable stopping conditions that can guarantee (LACC) and (HACC). Through extensive numerical experiments, we demonstrate that our IPL with the new stopping criteria significantly outperforms existing algorithms for solving RPR.

Organization. The rest of this paper is organized as follows. In Section 2, we propose the main framework of our inexact proximal linear algorithm with two new adaptive stopping criteria for the subproblem. We establish its iteration complexity for obtaining an ϵ\epsilon-stationary point and its local convergence under the sharpness condition. Connections with some existing methods are also discussed. In Section 3, we discuss how to adapt the FISTA to solve the subproblem inexactly. We also establish the overall complexity of FISTA – the total number of iterations of the FISTA – in order to obtain an ϵ\epsilon-optimal solution under the sharpness condition. In Section 4, we show the numerical results on both synthetic and real datasets to demonstrate the advantage of the proposed methods over some existing methods. The proofs for all the theorems and lemmas are given in Section 5. Finally, we include some concluding remarks in Section 6.

2 IPL and Its Convergence Analysis

In this section, we introduce our IPL algorithm for solving the RPR (4) with the inexact stopping criteria (LACC) and (HACC) for the subproblem (5) and analyze its convergence. We will discuss the FISTA for solving (5) that guarantees (LACC) and (HACC) in the next section.

We first follow [15] to introduce some notation. Let

St(y):=argminxnFt(x;y),S_{t}(y):=\operatorname*{arg\,min}_{x\in\mathbb{R}^{n}}\ F_{t}(x;y),
ε(x;y):=Ft(x;y)Ft(St(y);y),\varepsilon(x;y):=F_{t}(x;y)-F_{t}(S_{t}(y);y),

where Ft(z;y)F_{t}(z;y) is defined in (7). We will also use the notation

L=2mA22=2mi=1maiai2.L=\frac{2}{m}\|A\|_{2}^{2}=\frac{2}{m}\left\|\sum_{i=1}^{m}a_{i}a_{i}^{\top}\right\|_{2}.

A Meta Algorithm of our IPL is summarized in Algorithm 1. We again emphasize that Algorithm 1 cannot be implemented because minxnFt(x;xk)\min_{x\in\mathbb{R}^{n}}\ F_{t}(x;x^{k}) is not available, and we will discuss practical versions of it in the next section.

Algorithm 1 IPL – A Meta Algorithm
  Input: Initial point x0x^{0}, step size t=1/Lt=1/L, parameters ρl>0\rho_{l}>0 and ρh(0,1/4)\rho_{h}\in(0,1/4)
  for k=0,1,,k=0,1,\ldots, do
     Obtain xk+1x^{k+1} by inexactly solving (5) with one of the following stopping criteria:
      Option 1: (LACC), i.e., (8)
      Option 2: (HACC), i.e., (9)
  end for

2.1 Convergence under General Settings

In this subsection, we analyze the convergence rate of IPL (Algorithm 1) for obtaining an ϵ\epsilon-stationary point of (4) under the general settings when the sharpness condition may not hold. We use the definition of ϵ\epsilon-stationary point as introduced in [15].

Definition 1.

We call x~\tilde{x} an ϵ\epsilon-stationary point of (4) if the following inequality holds:

𝒢t(x~)2ϵ,\|\mathcal{G}_{t}(\tilde{x})\|_{2}\leq\epsilon, (12)

where 𝒢t(x)\mathcal{G}_{t}(x) is the proximal gradient which is defined as:

𝒢t(x)=t1(xSt(x)).\mathcal{G}_{t}(x)=t^{-1}\left(x-S_{t}(x)\right). (13)

Our convergence rate result of Algorithm 1 is given in Theorem 1, and the proof is given in Section 5.

Theorem 1.

Denote F=infxnF(x)F^{\star}=\inf_{x\in\mathbb{R}^{n}}F(x). For Algorithm 1 with t=1/Lt=1/L, the following conclusion holds.

  • (a)

    When (LACC) holds with ρl>0\rho_{l}>0 for any kk\in\mathbb{N}, we can find an ϵ\epsilon-stationary point in

    2(1+ρl)(F(x0)F)tϵ2\left\lfloor\frac{2(1+\rho_{l})(F(x^{0})-F^{\star})}{t\epsilon^{2}}\right\rfloor

    iterations for any ϵ>0\epsilon>0.

  • (b)

    When (HACC) holds with 0<ρh<1/40<\rho_{h}<1/4 for any kk\in\mathbb{N}, we can find an ϵ\epsilon-stationary point in

    2(1ρh)2(F(x0)F)(12ρh)tϵ2\left\lfloor\frac{2(1-\sqrt{\rho_{h}})^{2}(F(x^{0})-F^{\star})}{(1-2\sqrt{\rho_{h}})t\epsilon^{2}}\right\rfloor

    iterations for any ϵ>0\epsilon>0.

Theorem 1 shows that IPL finds an ϵ\epsilon-stationary point in O(1/ϵ2)O(1/\epsilon^{2}) main iterations with the adaptive IPL stopping conditions. Moreover, Theorem 1 achieves the best known convergence rate for PL in [15]. We should point out that we use two adaptive stopping criteria for the subproblem, but [15] requires solving the subproblem (5) exactly (see their Proposition 3) or using their pre-determined subproblem accuracy conditions (see their Theorem 5.2).

2.2 Local Convergence under Sharpness Assumption

In this subsection, we analyze the local convergence of IPL (Algorithm 1) to the global optimal solution under the sharpness condition.

Assumption 1 (Sharpness).

There exists a constant λs>0\lambda_{s}>0 such that the following inequality holds for any xnx\in\mathbb{R}^{n}:

F(x)F(x)λsΔ(x),F(x)-F(x_{\star})\geq\lambda_{s}\Delta(x), (14)

where Δ(x):=min{xx2,x+x2}\Delta(x):=\min\{\|x-x_{\star}\|_{2},\|x+x_{\star}\|_{2}\}.

[11] proved that the sharpness condition (Assumption 1) is satisfied by the RPR (4) with high probability under certain mild conditions.

Another assumption is about the closeness between the initial point and the optimal solution, which can be guaranteed by the modified spectral initialization (see Algorithm 3 in [11]) with high probability under some mild conditions.

Assumption 2.

Under Assumption 1, we assume that the initial point x0x^{0} in Algorithm 1 satisfies the following inequalities.

  1. (a)

    If (LACC) is chosen in Algorithm 1, then we assume x0x^{0} satisfies

    F(x0)F(x)λs2/(2L).F(x^{0})-F(x_{\star})\leq\lambda_{s}^{2}/(2L). (15)
  2. (b)

    If (HACC) is chosen in Algorithm 1, then we assume x0x^{0} satisfies

    Δ(x0)λs(14ρh)2(13ρh)L.\Delta(x^{0})\leq\frac{\lambda_{s}(1-4\rho_{h})}{2(1-3\rho_{h})L}. (16)

We now define the ϵ\epsilon-optimal solution to RPR (4).

Definition 2.

We call x¯\bar{x} an ϵ\epsilon-optimal solution to RPR (4), if Δ(x¯)ϵ\Delta(\bar{x})\leq\epsilon.

Now we are ready to show in Theorem 2 that, in terms of main iteration number, (LACC) leads to local linear convergence and (HACC) leads to local quadratic convergence.

Theorem 2.

Let t=1Lt=\frac{1}{L} and suppose that Assumption 1 holds. For the sequence {xk}k=0\{x^{k}\}_{k=0}^{\infty} generated by Algorithm 1, we have the following conclusions.

  1. (a)

    (Low Accuracy) When (15) holds and (8) holds with ρl>0\rho_{l}>0 for any kk\in\mathbb{N}, we have

    Δ(xk)F(x0)F(x)λs(1+4ρl2+4ρl)k,k.\Delta(x^{k})\leq\frac{F(x^{0})-F(x_{\star})}{\lambda_{s}}\left(\frac{1+4\rho_{l}}{2+4\rho_{l}}\right)^{k},\qquad\forall k\in\mathbb{N}.
  2. (b)

    (High Accuracy) When (16) holds and (9) holds with 0<ρh<1/40<\rho_{h}<1/4 for any kk\in\mathbb{N}, we have

    Δ(xk)λs(14ρh)L(13ρh)ζ2k,k,\Delta(x^{k})\leq\frac{\lambda_{s}(1-4\rho_{h})}{L(1-3\rho_{h})}\zeta^{2^{k}},\qquad\forall k\in\mathbb{N},

    where ζ:=LΔ(x0)(13ρh)λs(14ρh)\zeta:=\frac{L\Delta(x^{0})(1-3\rho_{h})}{\lambda_{s}(1-4\rho_{h})}.

Theorem 2 shows that, with a good initialization, using (LACC) finds an ϵ\epsilon-optimal solution to (4) within O(log1ϵ)O(\log\frac{1}{\epsilon}) iterations, which is a linear rate, and using (HACC) finds an ϵ\epsilon-optimal solution to (4) within O(loglog1ϵ)O(\log\log\frac{1}{\epsilon}) iterations, which is a quadratic rate.

2.3 Related Work

There are two closely related works that need to be discussed here. [11] studied the PL algorithm for solving RPR (4), and established its local quadratic convergence under the sharpness condition. But their theoretical analysis requires the subproblem (5) to be solved exactly. In practice, [11] proposed to use POGS [27], which is a variant of the alternating direction method of multipliers (ADMM), to solve (5) inexactly. However, they did not provide any convergence analysis for the algorithm when the subproblem (5) is solved inexactly by POGS. [15] also considered solving (4) for obtaining an ϵ\epsilon-stationary point as defined in Definition 1.222The authors of [15] actually considered solving a more general problem minxg(x)+h(c(x))\min_{x}g(x)+h(c(x)). Here, for simplicity, we assume that g=0g=0 and this does not affect the discussion. Indeed, several algorithms were proposed and analyzed in [15]. In particular, a practical algorithm proposed by [15] uses FISTA [20, 21, 22] to inexactly solve

minλmϕk,ν(λ)=\displaystyle\min_{\lambda\in\mathbb{R}^{m}}\phi_{k,\nu}(\lambda)= t2xk/tc(xk)λ22\displaystyle\frac{t}{2}\|x^{k}/t-\nabla c(x^{k})^{\top}\lambda\|_{2}^{2}
λ(c(xk)c(xk)xk)+(hν)(λ),\displaystyle-\lambda^{\top}(c(x^{k})-\nabla c(x^{k})x^{k})+(h_{\nu})^{\star}(\lambda),

which is the dual problem of a smoothed version of (5). Here (hν)()(h_{\nu})^{\star}(\cdot) is the Fenchel conjugate of hνh_{\nu}, and hνh_{\nu} is the Moreau envelope of hh, which is defined as

hν(λ)=infλmh(λ)+12νλλ22.h_{\nu}(\lambda)=\inf_{\lambda^{\prime}\in\mathbb{R}^{m}}h(\lambda^{\prime})+\frac{1}{2\nu}\|\lambda^{\prime}-\lambda\|_{2}^{2}.

[15] proposed to terminate the FISTA when dist(0;ϕk,ν(λk+1))1Lh(k+1)2\mbox{dist}(0;\partial\phi_{k,\nu}(\lambda^{k+1}))\leq\frac{1}{L_{h}(k+1)^{2}}, and then update xkx^{k} by xk+1=xktc(xk)λk+1x^{k+1}=x^{k}-t\nabla c(x^{k})^{\top}\lambda^{k+1}. The authors established the overall complexity of this algorithm for suitably chosen parameters tt and ν\nu. Compared to [15], we use adaptive stopping criteria and provide a better convergence rate based on Assumption 1.

3 FISTA for Solving the Subproblem Inexactly

In this section, we propose to use the FISTA to inexactly solve (5) with more practical stopping criteria that guarantee (LACC) and (HACC). Therefore, the convergence results (Theorems 1 and 2) in Section 2 still apply here.

For simplicity, we let t=1/Lt=1/L throughout this section and rewrite (5) as follows.

minznHk(z)=12tz22+Bkzdk1,\min_{z\in\mathbb{R}^{n}}\ H_{k}(z)=\frac{1}{2t}\|z\|_{2}^{2}+\|B_{k}z-d_{k}\|_{1}, (17)

where we denote z=xxk,z=x-x^{k}, Bk=2mdiag(Axk)A,B_{k}=\frac{2}{m}\mbox{diag}(Ax^{k})A, and dk=1m(b(Axk)2)d_{k}=\frac{1}{m}\left(b-(Ax^{k})^{2}\right). As a result, (LACC) and (HACC) can be rewritten respectively as

Hk(zk)minznHk(z)ρl(Hk(0)Hk(zk)),ρl0,H_{k}(z_{k})-\min_{z\in\mathbb{R}^{n}}H_{k}(z)\leq\rho_{l}\left(H_{k}(0)-H_{k}(z_{k})\right),\rho_{l}\geq 0, (18)

and

Hk(zk)minznHk(z)ρh2tzk22,0ρh<1/4.H_{k}(z_{k})-\min_{z\in\mathbb{R}^{n}}H_{k}(z)\leq\frac{\rho_{h}}{2t}\|z_{k}\|_{2}^{2},0\leq\rho_{h}<1/4. (19)

In IPL, we set xk+1=xk+zkx^{k+1}=x^{k}+z_{k}, where zkz_{k} satisfies either (18) or (19). The dual problem of (17) is

maxλRm,λ1Dk(λ)=t2Bkλ22λdk.\max_{\lambda\in R^{m},\|\lambda\|_{\infty}\leq 1}D_{k}(\lambda)=-\frac{t}{2}\left\|B_{k}^{\top}\lambda\right\|_{2}^{2}-\lambda^{\top}d_{k}. (20)

From weak duality, we know that

Dk(λ)Hk(z),zn, and λ1.\displaystyle D_{k}(\lambda)\leq H_{k}(z),\forall z\in\mathbb{R}^{n},\mbox{ and }\|\lambda\|_{\infty}\leq 1. (21)

Therefore, Dk(λ)D_{k}(\lambda) can serve as a lower bound for minzHk(z)\min_{z}H_{k}(z), and we can obtain verifiable stopping criteria that are sufficient conditions for (18) and (19). This leads to our inexact FISTA for solving (17), which is summarized in Algorithm 2. Here we define zk(λ)=tBkλz_{k}(\lambda)=-tB_{k}^{\top}\lambda.

Algorithm 2 FISTA for Solving (20)
  Input: λ0m\lambda^{0}\in\mathbb{R}^{m} satisfying λ01\|\lambda^{0}\|_{\infty}\leq 1. λa0=λb0=λc0=λ0\lambda_{a}^{0}=\lambda_{b}^{0}=\lambda_{c}^{0}=\lambda^{0}, γ0=1\gamma_{0}=1, ρl>0\rho_{l}>0 and ρh(0,1/4)\rho_{h}\in(0,1/4).
  for j=0,1,2j=0,1,2\ldots do
     
λcj+1\displaystyle\lambda_{c}^{j+1} =(1γj)λaj+γjλbj,\displaystyle=(1-\gamma_{j})\lambda_{a}^{j}+\gamma_{j}\lambda_{b}^{j},
λbj+1\displaystyle\lambda_{b}^{j+1} =argminλ1γj2tkjλλcj+122+\displaystyle=\operatorname*{arg\,min}_{\|\lambda\|_{\infty}\leq 1}\ \frac{{\gamma_{j}}}{2t_{kj}}\|\lambda-\lambda_{c}^{j+1}\|^{2}_{2}+
(λλcj+1)(tBkBkλcj+1+dk),tkj>0,\displaystyle(\lambda-\lambda_{c}^{j+1})^{\top}\left(tB_{k}B_{k}^{\top}\lambda_{c}^{j+1}+d_{k}\right),t_{kj}>0, (22)
λaj+1\displaystyle\lambda_{a}^{j+1} =(1γj)λaj+γjλbj+1,\displaystyle=(1-\gamma_{j})\lambda_{a}^{j}+\gamma_{j}\lambda_{b}^{j+1},
γj+1\displaystyle\gamma_{j+1} =2/(1+1+4/γj2),\displaystyle=2/\left(1+\sqrt{1+4/\gamma_{j}^{2}}\right),
     Terminate if one of the following stopping criteria is satisfied:
(LACC-FISTA) Hk(zk(λaj+1))Dk(λaj+1))\displaystyle\quad H_{k}(z_{k}(\lambda_{a}^{j+1}))-D_{k}(\lambda_{a}^{j+1}))\leq
ρl(Hk(0)Hk(zk(λaj+1))),\displaystyle\rho_{l}(H_{k}(0)-H_{k}(z_{k}(\lambda_{a}^{j+1}))), (23)
(HACC-FISTA) Hk(zk(λaj+1))Dk(λaj+1))\displaystyle\quad H_{k}(z_{k}(\lambda_{a}^{j+1}))-D_{k}(\lambda_{a}^{j+1}))\leq
ρh2tzk(λaj+1)22.\displaystyle\frac{\rho_{h}}{2t}\|z_{k}(\lambda_{a}^{j+1})\|_{2}^{2}. (24)
  end for
  Output: λk=λaj+1,zk=tBkλk\lambda_{k}=\lambda_{a}^{j+1},z_{k}=-tB_{k}^{\top}\lambda_{k}, xk+1=xk+zkx^{k+1}=x^{k}+z_{k}.
Remark 1.

Here we remark on the step size tkjt_{kj} in (22). It can be chosen as (tLk2)1\left(tL_{k}^{2}\right)^{-1} for some LkBk2L_{k}\geq\|B_{k}\|_{2} or chosen by the Armijo backtracking line search. More specifically, suppose that we have an initial step size tk(1)>0t_{k(-1)}>0. Given the step size tk(j1),j0t_{k(j-1)},j\geq 0, denote

λb,tsj+1=argminλm,λ1\displaystyle\lambda_{b,t_{s}}^{j+1}=\operatorname*{arg\,min}_{\lambda\in\mathbb{R}^{m},\|\lambda\|_{\infty}\leq 1} γj2tsλλcj+122\displaystyle\frac{\gamma_{j}}{2t_{s}}\|\lambda-\lambda_{c}^{j+1}\|^{2}_{2}
+(λλcj+1)(tBkBkλcj+1+dk),\displaystyle+(\lambda-\lambda_{c}^{j+1})^{\top}\left(tB_{k}B_{k}^{\top}\lambda_{c}^{j+1}+d_{k}\right),

and

λa,tsj+1=(1γj)λaj+γjλb,tsj+1\lambda_{a,t_{s}}^{j+1}=(1-\gamma_{j})\lambda_{a}^{j}+\gamma_{j}\lambda_{b,t_{s}}^{j+1}

tkjt_{kj} can be selected as

tkj=\displaystyle t_{kj}= max{ts|ts=2stk(j1),s,\displaystyle\max\{t_{s}|t_{s}=2^{-s}t_{k(j-1)},s\in\mathbb{N},
12tsλa,tsj+1λcj+122t2Bkλa,tsj+1Bkλcj+122}.\displaystyle\frac{1}{2t_{s}}\|\lambda_{a,t_{s}}^{j+1}-\lambda_{c}^{j+1}\|_{2}^{2}\geq\frac{t}{2}\|B_{k}^{\top}\lambda_{a,t_{s}}^{j+1}-B_{k}^{\top}\lambda_{c}^{j+1}\|_{2}^{2}\}.

We now discuss the overall complexity of the IPL (Algorithm 1) with the subproblem (5) solved inexactly by FISTA (Algorithm 2). For ease of presentation, we denote this algorithm as IPL+FISTA. We assume that IPL is terminated after KϵK_{\epsilon} iterations, when an ϵ\epsilon-optimal solution is found, i.e., Δ(xKϵ)ϵ,Δ(xKϵ1)>ϵ\Delta(x^{K_{\epsilon}})\leq\epsilon,\Delta(x^{K_{\epsilon}-1})>\epsilon. We use Jk,k0J_{k},k\geq 0 to denote the number of iterations of FISTA when it is called in the kk-th iteration (getting xk+1x^{k+1} from xkx^{k}) of IPL. The overall complexity of IPL+FISTA for obtaining an ϵ\epsilon-optimal solution is thus given by J(ϵ)=k=0Kϵ1JkJ(\epsilon)=\sum_{k=0}^{K_{\epsilon}-1}J_{k}, which equals the times that we call (22). Now we are ready to give the overall complexity of IPL+FISTA in Theorem 3.

Theorem 3.

Let t=1/Lt=1/L, tkj=(tBk22)1t_{kj}=\left(t\|B_{k}\|_{2}^{2}\right)^{-1} and suppose that Assumption 1 holds.

  • (a)

    (Low Accuracy) For the overall complexity of Algorithm 1, when using Algorithm 2 with option (23) with ρl>0\rho_{l}>0, for any x0nx^{0}\in\mathbb{R}^{n} that satisfies Δ(x0)E1\Delta(x^{0})\leq E_{1}, we have

    J(ϵ)E2/ϵ,ϵ>0.\displaystyle J(\epsilon)\leq E_{2}/\epsilon,\forall\epsilon>0. (25)

    Here E1,E2E_{1},E_{2} are positive constants that only depend on {ai}i=1m,{bi}i=1m,x,λs,L\{a_{i}\}_{i=1}^{m},\{b_{i}\}_{i=1}^{m},x_{\star},\lambda_{s},L and ρl\rho_{l}, and they will be specified later in the proof.

  • (b)

    (High Accuracy) For the overall complexity of Algorithm 1, when using Algorithm 2 with option (24) with ρh(0,1/4)\rho_{h}\in(0,1/4), there exist positive constants E3,E4E_{3},E_{4}, and {ϵi}i=1\{\epsilon_{i}\}_{i=1}^{\infty} with ϵi>ϵi+1,i+\epsilon_{i}>\epsilon_{i+1},\forall i\in\mathbb{N}_{+} and limiϵi=0\lim_{i\rightarrow\infty}\epsilon_{i}=0 such that if Δ(x0)E3\Delta(x^{0})\leq E_{3}, then

    J(ϵi)E4/ϵi,i+.\displaystyle J(\epsilon_{i})\leq E_{4}/\epsilon_{i},\forall i\in\mathbb{N}_{+}. (26)

    Here E3,E4E_{3},E_{4} only depend on {ai}i=1m,{bi}i=1m,x,λs,L\{a_{i}\}_{i=1}^{m},\{b_{i}\}_{i=1}^{m},x_{\star},\lambda_{s},L and ρh\rho_{h}, {ϵi}i=1\{\epsilon_{i}\}_{i=1}^{\infty} depends on {Δ(xi)}i=1\{\Delta(x^{i})\}_{i=1}^{\infty} and they will be specified later in the proof. Note that (26) implies that the worst-case overall complexity might be higher than O(1/ϵ)O(1/\epsilon) – see the explanation below for more details.

Theorem 3 shows that under the (LACC-FISTA) stopping criterion, we need O(1/ϵ)O(1/\epsilon) iterations to find an ϵ\epsilon-optimal solution, and under the (HACC-FISTA) stopping criterion, we have the same rate with regard to a countable positive sequence that decreases to zero. Theorem 3 provides better theoretical rates compared to O(1/ϵ3)O(1/\epsilon^{3}) in [15]. Moreover, the results in Theorem 3 are about the convergence to ϵ\epsilon-optimal solution, while the results in [15] are for convergence to ϵ\epsilon-stationary point. Our results require the sharpness condition, which was not assumed in [15].

For (b) in Theorem 3, we can only find a countable sequence of diminishing ϵi\epsilon_{i}’s to show the O(1/ϵi)O(1/\epsilon_{i}) rate. We cannot show the O(1/ϵ)O(1/\epsilon) rate for any fixed ϵ>0\epsilon>0. This is because of the local quadratic convergence under (HACC) shown in (b) of Theorem 3. For instance, if our initial point x0x^{0} satisfies Δ(x0)=2ϵ>ϵ\Delta(x^{0})=2\epsilon>\epsilon, under the (HACC), the quadratic convergence result in Theorem 3 (b) implies that Δ(x1)Cϵ2<ϵ\Delta(x^{1})\leq C\epsilon^{2}<\epsilon for some constant C>0C>0 when ϵ\epsilon is sufficiently small. Therefore, IPL finds an ϵ\epsilon-optimal stationary point with only one main iteration. However, Lemma 15 (b) indicates that J(ϵ)=J0=O(1/ϵ2)J(\epsilon)=J_{0}=O(1/\epsilon^{2}). Therefore, it is possible that the overall complexity becomes O(1/ϵ2)O(1/\epsilon^{2}), which is higher than O(1/ϵ)O(1/\epsilon).

4 Numerical Experiments

In this section, we conduct numerical experiments to compare our IPL method with existing methods for solving the RPR problem (4). Readers can find the code and datasets to replicate the experiments in this section via https://github.com/zhengzhongpku/IPL-code-share. The algorithms that we test include the following ones.

  • (i)

    PL: The original proximal linear algorithm proposed by [11] where the subproblem (5) is solved by POGS [27]. POGS terminates when both the primal residual and the dual residual are small enough. In their code, the authors [11] implemented a two-stage trick that uses a relatively larger tolerance in early iterations and a smaller tolerance in later iterations to terminate the POGS. In our comparison, we use all the default parameters set by the authors in their code333The code of [11] can be downloaded from https://web.stanford.edu/~jduchi/projects/phase-retrieval-code.tgz.

  • (ii)

    Subgradient method. The subgradient method with geometrically decaying step sizes was proposed by [13], and they used this algorithm to solve the RPR (4). One typical iteration of this algorithm is

    xk+1=xkλ0qkξk/ξk2,k0,\displaystyle x^{k+1}=x^{k}-\lambda_{0}q^{k}\xi_{k}/\|\xi_{k}\|_{2},k\geq 0, (27)

    in which λ0>0,q(0,1)\lambda_{0}>0,q\in(0,1) are hyper-parameters and ξk=1ni=1m2aixksign((aixk)2bi).\xi_{k}=\frac{1}{n}\sum_{i=1}^{m}2a_{i}^{\top}x^{k}\mbox{sign}((a_{i}^{\top}x^{k})^{2}-b_{i}).

  • (iii)

    IPL-FISTA-Low, IPL-FISTA-High: our IPL+FISTA algorithm with stopping criteria (LACC-FISTA) and (HACC-FISTA) in Algorithm 2, respectively, and we also used the Armijo backtracking line search discussed in Remark 1.

The initial point for all the tested algorithms is generated by the spectral initialization given in Algorithm 3 in [11]. All the code is run on a server with Intel Xeon E5-2650v4 (2.2GHz). Each task is limited to a single core – no multi-threading is used.

4.1 Synthetic Data

We generate synthetic data following the same manner as [11]. Specifically, aia_{i}’s are drawn randomly from the normal distribution 𝒩(0,In)\mathcal{N}(0,I_{n}). The entries of x{1,1}nx_{\star}\in\{-1,1\}^{n} are drawn randomly from discrete Bernoulli distribution. We denote pfail=|𝕀2|/mp_{\textrm{fail}}=|\mathbb{I}_{2}|/m, where 𝕀2\mathbb{I}_{2} is generated by random sampling without replacement from {1,2,,m}\{1,2,\ldots,m\}. ξi\xi_{i}’s for these samples in (2) are independently drawn from Cauchy distribution, which means that

bi=ξi=M~tan(π2Ui),UiU(0,1),i𝕀2,b_{i}=\xi_{i}=\tilde{M}\tan\left(\frac{\pi}{2}U_{i}\right),U_{i}\sim U(0,1),\forall\ i\in\mathbb{I}_{2},

where M~\tilde{M} is the sample median of {(aix)2}i=1m\{(a_{i}^{\top}x_{\star})^{2}\}_{i=1}^{m}. For a given threshold ϵ>0\epsilon>0, we call an algorithm successful if it returns an xx such that the relative error

Δ(x)/x2ϵ.\Delta(x)/\|x_{\star}\|_{2}\leq\epsilon. (28)

For each combination of n,k=m/nn,k=m/n and pfailp_{\textrm{fail}}, we randomly generate 50 instances according to the above procedure, and we report the success rate of the algorithm among the 50 instances. For IPL-FISTA-Low and IPL-FISTA-High, we set ρl=ρh=0.24.\rho_{l}=\rho_{h}=0.24. For the subgradient method, we set q=0.998q=0.998 which is one of the suggested choices of qq in [13]. Moreover, [13] did not specify how to choose λ0\lambda_{0}, and we set λ0=0.1x02\lambda_{0}=0.1\|x^{0}\|_{2} as we found that this choice gave good performance. Since we found that in most cases, the relative error given by PL is in the level of [105,103][10^{-5},10^{-3}], we set ϵ=103\epsilon=10^{-3} in (28) for PL. In our comparison, we first run PL using the default settings of the code provided by the authors of [11]. If the returned xx satisfies (28) with ϵ=103\epsilon=10^{-3}, then we claim that PL is successful, and we terminate IPL and Subgradient method once they found an iterate with a smaller objective value than the one given by PL. If the iterate returned by PL does not satisfy (28) with ϵ=103\epsilon=10^{-3}, then we claim that PL failed, and we then terminate IPL and Subgradient method when F(xk)F(x)107F(x^{k})-F(x_{\star})\leq 10^{-7}. The CPU time is only reported for the successful cases for PL. The cost of computing the spectral norm A2\|A\|_{2} to obtain LL is included in the total CPU time of PL and IPL.

The simulation results corresponding to pfail=0.05p_{\textrm{fail}}=0.05 and pfail=0.15p_{\textrm{fail}}=0.15 are shown in Figures 1 and 2, where the xx-axis corresponds to different values of mm since n=500n=500 is fixed. From both Figures 1 and 2, we can see that the four algorithms have similar success rates, but the total CPU time of IPL-FISTA-Low and IPL-FISTA-High that includes the cost of computing the spectral norm A2\|A\|_{2} is significantly less than that of others.

Refer to caption
Figure 1: The comparison of success rates and CPU time on synthetic datasets with pfail=0.05p_{\textrm{fail}}=0.05 and n=500n=500.
Refer to caption
Figure 2: The comparison of success rates and CPU time on synthetic datasets with pfail=0.15p_{\textrm{fail}}=0.15 and n=500n=500.

4.2 Image Recovery

In this section, we compare the four candidate algorithms on images in a similar manner as [11]. In particular, suppose we have an RGB image array Xn1×n2×3X_{\star}\in\mathbb{R}^{n_{1}\times n_{2}\times 3}, we construct the signal as x=[vec(X);0]nx_{\star}=[\mbox{vec}(X_{\star});0]\in\mathbb{R}^{n}, in which n=min{2ssN,2s3n1n2}n=\min\{2^{s}\mid s\in N,2^{s}\geq 3n_{1}n_{2}\}. Let Hn1n{1,1}n×nH_{n}\in\frac{1}{\sqrt{n}}\{-1,1\}^{n\times n} be the Hadamard matrix and Sjdiag({1,1}n),j=1,2,,kS_{j}\in\mbox{diag}(\{-1,1\}^{n}),j=1,2,\ldots,k is a random diagonal matrix, and its diagonal elements are independently distributed as discrete uniform distribution. We then let A=mk[HnS1;HnS2HnSk]A=\frac{{\sqrt{m}}}{\sqrt{k}}[H_{n}S_{1};H_{n}S_{2}\ldots H_{n}S_{k}] and we know L=2L=2 in this case. The advantage of such a mapping is that it mimics the fast Fourier transform, and calculating AyAy is only of time complexity O(mlogm)O(m\log m). We first examine the numerical comparisons on a real RNA nanoparticles image444https://visualsonline.cancer.gov/details.cfm?imageid=11167 as shown in Figure 3, and we follow the code of [11] for the experiments on a sub-image with n=218n=2^{18}. We also take pfail{0.05,0.1,0.15,0.2}p_{\textrm{fail}}\in\{0.05,0.1,0.15,0.2\}, k{3,6}k\in\{3,6\} and set the noise in the same way as the synthetic datasets. For each combination of dataset parameters, we run 50 replicates by generating 50 different AA and test all the candidate algorithms. We use the same way as the synthetic datasets to define success. For IPL, ρl=ρh=0.24.\rho_{l}=\rho_{h}=0.24. For the Subgradient method, λ0=0.1x02,q=0.998\lambda_{0}=0.1\|x^{0}\|_{2},q=0.998. For a replicate, if PL succeeds, the CPU time is the time needed to reach (28).

Refer to caption
Figure 3: A real RNA nanoparticles image.

Table 1 reports the median CPU time (in seconds) of the candidate algorithms for pfail=0.1p_{\textrm{fail}}=0.1 and m/n=6m/n=6 based on two tolerances ϵ=101\epsilon=10^{-1} and ϵ=107\epsilon=10^{-7}. We only show the results for this combination of pfailp_{\textrm{fail}} and m/nm/n because other choices give similar results. It is noted from Table 1 that PL can only reach a relative error that takes value in [102,101][10^{-2},10^{-1}], and IPL-FISTA is much more efficient than PL and Subgradient methods.

Table 1: The comparison of the median CPU time in seconds for RNA image recovery.
ϵ=0.1\epsilon=0.1 ϵ=107\epsilon=10^{-7}
PL 3166.93 NA
Subgradient 88.14 659.64
IPL-FISTA-Low 6.01 218.14
IPL-FISTA-High 48.56 175.60

Additional experimental results are reported in Table 2 for comparing CPU time for the four candidate algorithms. We provide four images with nn being at most 2222^{22}. The experiments use the same m/nm/n and pfailp_{\textrm{fail}}, and the CPU time based on ten replications is reported in the form of “median (Interquartile Range)”. Subgradient method, IPL-FISTA-Low and IPL-FISTA-High are terminated with tolerance ϵ=107\epsilon=10^{-7}. We can see that PL cannot terminate within 10 hours, and IPL-FISTA still enjoys the best efficiency.

Table 2: The comparison of the median CPU time in hours with the interquartile range (IQR) in the parentheses for multiple image recovery tasks.
RNA (n=222n=2^{22}) Hubble555https://www.nasa.gov/image-feature/goddard/2017/hubble-hones-in-on-a-hypergiants-home (n=222n=2^{22})
PL >10>10 >10>10
Subgradient 4.39 (0.24) 4.72 (0.27)
IPL-FISTA-Low 1.65 (0.41) 2.07 (0.14)
IPL-FISTA-High 1.30 (0.08) 1.26 (0.06)
James Web666https://www.nasa.gov/webbfirstimages (n=221n=2^{21}) Penn State777https://www.britannica.com/topic/Pennsylvania-State-University (n=222n=2^{22})
PL >10>10 >10>10
Subgradient 1.86 (0.14) 4.84 (0.66)
IPL-FISTA-Low 0.90 (0.21) 2.23 (0.16)
IPL-FISTA-High 0.56 (0.09) 1.41 (0.21)

5 Proofs

5.1 Proof of Theorem 1

Before proceeding, we first present some lemmas.

Lemma 1 (Weak Convexity).

(Discussion of Condition C2 in [11]) The following inequalities hold for any x,ynx,y\in\mathbb{R}^{n}, and L=2mA22L=\frac{2}{m}\|A\|_{2}^{2}.

|F(x)F(x;y)|L2xy22,\displaystyle\left|F(x)-F(x;y)\right|\leq\frac{L}{2}\|x-y\|_{2}^{2}, (29)
F(x)Ft(x;y),t1L.\displaystyle F(x)\leq F_{t}(x;y),\ \forall\ t\leq\frac{1}{L}. (30)
Lemma 2 (see, e.g., equation (5.2) in [15]).

The following inequality holds for any 0<t1/L0<t\leq 1/L and x,yn.x,y\in\mathbb{R}^{n}.

F(x)F(y)+ε(y;x)12txSt(x)22.F(x)-F(y)+\varepsilon(y;x)\geq\frac{1}{2t}\|x-S_{t}(x)\|_{2}^{2}. (31)

The following lemma studies one iteration of our IPL algorithm (as summarized in Algorithm 1).

Lemma 3.

If 0<t1L0<t\leq\frac{1}{L}, we have the following inequalities.

  1. (a)

    When the low accuracy condition (8) is satisfied, we have

    F(xk)F(xk+1)12(1+ρl)txkSt(xk)22.F(x^{k})-F(x^{k+1})\geq\frac{1}{2(1+\rho_{l})t}\|x^{k}-S_{t}(x^{k})\|_{2}^{2}. (32)
  2. (b)

    When the high accuracy condition (9) is satisfied, we have

    F(xk)F(xk+1)12ρh2t(1ρh)2xkSt(xk)22.F(x^{k})-F(x^{k+1})\geq\frac{1-2\sqrt{\rho_{h}}}{2t(1-\sqrt{\rho_{h}})^{2}}\|x^{k}-S_{t}(x^{k})\|_{2}^{2}. (33)
Proof of Lemma 3.

We first prove part (a) of Lemma 3 and then prove part (b) of Lemma 3.

(a). Letting x=xkx=x^{k} and y=xk+1y=x^{k+1} in (31), we have

F(xk)\displaystyle F(x^{k}) F(xk+1)12txkSt(xk)22ε(xk+1;xk)\displaystyle-F(x^{k+1})\geq\frac{1}{2t}\|x^{k}-S_{t}(x^{k})\|_{2}^{2}-\varepsilon(x^{k+1};x^{k})
12txkSt(xk)22ρl(F(xk)F(xk+1)),\displaystyle\geq\frac{1}{2t}\|x^{k}-S_{t}(x^{k})\|_{2}^{2}-\rho_{l}(F(x^{k})-F(x^{k+1})),

where the second inequality follows from (8) and (30). This proves (32) in Lemma 3(a).

(b). When (9) holds, since Ft(;xk)F_{t}(\cdot;x^{k}) is 1t\frac{1}{t}-strongly convex, we have

ρh2txkxk+122ε(xk+1;xk)12txk+1St(xk)22.\frac{\rho_{h}}{2t}\|x^{k}-x^{k+1}\|_{2}^{2}\geq\varepsilon(x^{k+1};x^{k})\geq\frac{1}{2t}\|x^{k+1}-S_{t}(x^{k})\|_{2}^{2}. (34)

Let u=xkxk+1u=x^{k}-x^{k+1}, v=xk+1St(xk)v=x^{k+1}-S_{t}(x^{k}). From (34) we have

ρhu22v22,\rho_{h}\|u\|_{2}^{2}\geq\|v\|_{2}^{2}, (35)

and

u+v22(1ρh)2u22\displaystyle\|u+v\|_{2}^{2}-(1-\sqrt{\rho_{h}})^{2}\|u\|_{2}^{2}
=\displaystyle= (2ρhρh)u22+2uv+v22\displaystyle(2\sqrt{\rho_{h}}-\rho_{h})\|u\|_{2}^{2}+2u^{\top}v+\|v\|_{2}^{2}
\displaystyle\geq ρh(2ρh)u222u2v2+v22\displaystyle\sqrt{\rho_{h}}(2-\sqrt{\rho_{h}})\|u\|_{2}^{2}-2\|u\|_{2}\|v\|_{2}+\|v\|_{2}^{2}
=\displaystyle= (ρhu2v2)((2ρh)u2v2)\displaystyle\left(\sqrt{\rho_{h}}\|u\|_{2}-\|v\|_{2}\right)\left((2-\sqrt{\rho_{h}})\|u\|_{2}-\|v\|_{2}\right)
\displaystyle\geq 0,\displaystyle 0, (36)

where the first inequality is from the Cauchy-Schwarz inequality, and the second inequality follows from (35) and the fact that ρh(0,1/4)\rho_{h}\in(0,1/4). Therefore, from (34) and (36) we have

ε(xk+1;xk)\displaystyle\varepsilon(x^{k+1};x^{k}) ρh2txkxk+122\displaystyle\leq\frac{\rho_{h}}{2t}\|x^{k}-x^{k+1}\|_{2}^{2}
ρh2t(1ρh)2xkSt(xk)22,\displaystyle\leq\frac{\rho_{h}}{2t(1-\sqrt{\rho_{h}})^{2}}\|x^{k}-S_{t}(x^{k})\|_{2}^{2},

which, together with (31), yields

F(xk)F(xk+1)\displaystyle F(x^{k})-F(x^{k+1})
\displaystyle\geq 12txkSt(xk)22ε(xk+1;xk)\displaystyle\frac{1}{2t}\|x^{k}-S_{t}(x^{k})\|_{2}^{2}-\varepsilon(x^{k+1};x^{k})
\displaystyle\geq 12txkSt(xk)22ρh2t(1ρh)2xkSt(xk)22\displaystyle\frac{1}{2t}\|x^{k}-S_{t}(x^{k})\|_{2}^{2}-\frac{\rho_{h}}{2t(1-\sqrt{\rho_{h}})^{2}}\|x^{k}-S_{t}(x^{k})\|_{2}^{2}
=\displaystyle= 12ρh2t(1ρh)2xkSt(xk)22.\displaystyle\frac{1-2\sqrt{\rho_{h}}}{2t(1-\sqrt{\rho_{h}})^{2}}\|x^{k}-S_{t}(x^{k})\|_{2}^{2}.

This proves (33) in Lemma 3(b).

Therefore, the proof of Lemma 3 is complete. ∎

Now we are ready to prove Theorem 1.

Proof of Theorem 1.

We will prove (a) and (b) together. Both (32) and (33) indicate that {xk}\{x^{k}\} generated by Algorithm 1 satisfies

F(xk)F(xk+1)βt𝒢t(xk)22,F(x^{k})-F(x^{k+1})\geq\beta t\|\mathcal{G}_{t}(x^{k})\|_{2}^{2},

in which β>0\beta>0 is a constant that β=1/(2(1+ρl))\beta=1/(2(1+\rho_{l})) for LACC and β=(12ρh)/(2(1ρh)2)\beta=(1-2\sqrt{\rho_{h}})/(2(1-\sqrt{\rho_{h}})^{2}) for HACC. Letting F=infxnF(x)F^{\star}=\inf_{x\in\mathbb{R}^{n}}F(x), we have

F(x0)F\displaystyle F(x^{0})-F^{\star} k=0K01βt𝒢t(xk)22\displaystyle\geq\sum_{k=0}^{K_{0}-1}\beta t\|\mathcal{G}_{t}(x^{k})\|_{2}^{2}
βtK0min0kK01𝒢t(xk)22.\displaystyle\geq\beta tK_{0}\min_{0\leq k\leq K_{0}-1}\|\mathcal{G}_{t}(x^{k})\|_{2}^{2}.

Then, our IPL (Algorithm 1) finds an ϵ\epsilon-stationary point to RPR (4) in

F(x0)Fβtϵ2\left\lfloor\frac{F(x^{0})-F^{\star}}{\beta t\epsilon^{2}}\right\rfloor

iterations. Therefore, the proof of both (a) and (b) in Theorem 1 is complete. ∎

5.2 Proof of Theorem 2

To prove Theorem 2, we need the following lemmas.

Lemma 4.

(Part of the Proof for Theorem 1 in [11]) Let t=1/Lt=1/L. For any xk,xk+1nx^{k},x^{k+1}\in\mathbb{R}^{n}, we have

F(xk+1)F(x)+L4xk+1x22\displaystyle F(x^{k+1})-F(x_{\star})+\frac{L}{4}\|x^{k+1}-x_{\star}\|_{2}^{2}
Lxkx22+2ε(xk+1,xk),\displaystyle\leq L\|x^{k}-x_{\star}\|_{2}^{2}+2\varepsilon(x^{k+1},x^{k}), (37)

which also holds if we replace xx_{\star} by x-x_{\star}.

Proof of Lemma 4.

We have

L2xSt(xk)22\displaystyle\frac{L}{2}\|x_{\star}-S_{t}(x^{k})\|_{2}^{2}
\displaystyle\geq L4xxk+122L2St(xk)xk+122\displaystyle\frac{L}{4}\|x_{\star}-x^{k+1}\|_{2}^{2}-\frac{L}{2}\|S_{t}(x^{k})-x^{k+1}\|_{2}^{2}
\displaystyle\geq L4xxk+122ε(xk+1;xk).\displaystyle\frac{L}{4}\|x_{\star}-x^{k+1}\|_{2}^{2}-\varepsilon(x^{k+1};x^{k}). (38)

where the first inequality follows from the Cauchy-Schwarz inequality and the second one is from the convexity of 22\|\cdot\|_{2}^{2}. We then have

F(xk+1)\displaystyle F(x^{k+1})
\displaystyle\leq F(xk+1;xk)+L2xkxk+122\displaystyle F(x^{k+1};x^{k})+\frac{L}{2}\|x^{k}-x^{k+1}\|_{2}^{2}
=\displaystyle= Ft(St(xk);xk)+ε(xk+1;xk)\displaystyle F_{t}(S_{t}(x^{k});x^{k})+\varepsilon(x^{k+1};x^{k})
\displaystyle\leq F(x;xk)+L2xkx22L2xSt(xk)22\displaystyle F(x_{\star};x^{k})+\frac{L}{2}\|x^{k}-x_{\star}\|_{2}^{2}-\frac{L}{2}\|x_{\star}-S_{t}(x^{k})\|_{2}^{2}
+ε(xk+1;xk)\displaystyle+\varepsilon(x^{k+1};x^{k})
\displaystyle\leq F(x)+Lxkx22L2xSt(xk)22+ε(xk+1;xk),\displaystyle F(x_{\star})+L\|x^{k}-x_{\star}\|_{2}^{2}-\frac{L}{2}\|x_{\star}-S_{t}(x^{k})\|_{2}^{2}+\varepsilon(x^{k+1};x^{k}),

where the first and the last inequalities are from (29), and the second inequality is from the strong convexity of Ft(;xk)F_{t}(\cdot;x^{k}). Combining this inequality with (38) yields (37). It is easy to find that all the proofs still hold if we replace xx_{\star} with x-x_{\star}. Therefore, the proof of Lemma 4 is complete. ∎

Lemma 5 (One-step progress).

Let t=1/Lt=1/L and suppose that Assumption 1 holds.

  1. (a)

    (Low accuracy condition) When (8) is satisfied for some ρl0\rho_{l}\geq 0, we have

    (1+2ρl)(F(xk+1)F(x))\displaystyle(1+2\rho_{l})\left(F(x^{k+1})-F(x_{\star})\right) (39)
    \displaystyle\leq 2ρl(F(xk)F(x))+L(Δ(xk))2.\displaystyle 2\rho_{l}(F(x^{k})-F(x_{\star}))+L(\Delta(x^{k}))^{2}.

    If we also have F(xk)F(x)λs2/(2L)F(x^{k})-F(x_{\star})\leq\lambda_{s}^{2}/(2L), then we have

    (1+2ρl)(F(xk+1)F(x))\displaystyle(1+2\rho_{l})\left(F(x^{k+1})-F(x_{\star})\right) (40)
    \displaystyle\leq (12+2ρl)(F(xk)F(x)).\displaystyle\left(\frac{1}{2}+2\rho_{l}\right)(F(x^{k})-F(x_{\star})).
  2. (b)

    (High accuracy condition) When (9) is satisfied, we have

    λsΔ(xk+1)L(13ρh)(14ρh)(Δ(xk))2.\lambda_{s}\Delta(x^{k+1})\leq\frac{L(1-3\rho_{h})}{(1-4\rho_{h})}(\Delta(x^{k}))^{2}. (41)
Proof of Lemma 5.

We first prove part (a) of Lemma 5 and then prove part (b) of Lemma 5.

(a). When the low accuracy condition (8) holds, from (30) we have

ε(xk+1;xk)\displaystyle\varepsilon(x^{k+1};x^{k}) ρl(Ft(xk;xk)Ft(xk+1;xk))\displaystyle\leq\rho_{l}(F_{t}(x^{k};x^{k})-F_{t}(x^{k+1};x^{k}))
ρl(F(xk)F(xk+1)),\displaystyle\leq\rho_{l}(F(x^{k})-F(x^{k+1})),

which, combining with (37), yields

(1+2ρl)(F(xk+1)F(x))+L4xk+1x22\displaystyle(1+2\rho_{l})\left(F(x^{k+1})-F(x_{\star})\right)+\frac{L}{4}\|x^{k+1}-x_{\star}\|_{2}^{2}
\displaystyle\leq 2ρl(F(xk)F(x))+Lxkx22.\displaystyle 2\rho_{l}\left(F(x^{k})-F(x_{\star})\right)+L\|x^{k}-x_{\star}\|_{2}^{2}.

Discarding the term L4xk+1x22\frac{L}{4}\|x^{k+1}-x_{\star}\|_{2}^{2}, we get

(1+2ρl)(F(xk+1)F(x))\displaystyle(1+2\rho_{l})\left(F(x^{k+1})-F(x_{\star})\right)
\displaystyle\leq 2ρl(F(xk)F(x))+Lxkx22.\displaystyle 2\rho_{l}\left(F(x^{k})-F(x_{\star})\right)+L\|x^{k}-x_{\star}\|_{2}^{2}.

Since (37) also holds when xx_{\star} is replaced by x-x_{\star}, we also have

(1+2ρl)(F(xk+1)F(x))\displaystyle(1+2\rho_{l})\left(F(x^{k+1})-F(x_{\star})\right)
\displaystyle\leq 2ρl(F(xk)F(x))+Lxk+x22.\displaystyle 2\rho_{l}\left(F(x^{k})-F(x_{\star})\right)+L\|x^{k}+x_{\star}\|_{2}^{2}.

This proves (39). (40) holds because of Assumption 1. Hence, it proves part (a) of Lemma 5.

(b). When the high accuracy condition (9) holds, we have

2ε(xk+1;xk)\displaystyle 2\varepsilon(x^{k+1};x^{k})
\displaystyle\leq ρhLxkxk+122\displaystyle\rho_{h}L\|x^{k}-x^{k+1}\|_{2}^{2}
\displaystyle\leq ρhL(xk+1x224ρh+xkx2214ρh),\displaystyle\rho_{h}L\left(\frac{\|x^{k+1}-x_{\star}\|_{2}^{2}}{4\rho_{h}}+\frac{\|x^{k}-x_{\star}\|_{2}^{2}}{1-4\rho_{h}}\right),

where the second inequality is from the Cauchy-Schwarz inequality. Combining with (37), we have

F(xk+1)F(x)(13ρh)L14ρhxkx22.F(x^{k+1})-F(x_{\star})\leq\frac{(1-3\rho_{h})L}{1-4\rho_{h}}\|x^{k}-x_{\star}\|_{2}^{2}.

Similarly, replacing xx_{\star} by x-x_{\star}, we have

F(xk+1)F(x)(13ρh)L14ρhxk+x22.F(x^{k+1})-F(x_{\star})\leq\frac{(1-3\rho_{h})L}{1-4\rho_{h}}\|x^{k}+x_{\star}\|_{2}^{2}.

Combining the above two inequalities with Assumption 1 yields (41), which proves part (b) of Lemma 5.

Therefore, the proof of Lemma 5 is complete. ∎

Now we are ready to give the proof of Theorem 2.

Proof of Theorem 2.

(a). We will prove that for any kk\in\mathbb{N},

F(xk)F(x)(F(x0)F(x))(1+4ρl2+4ρl)kF(x^{k})-F(x_{\star})\leq\left(F(x^{0})-F(x_{\star})\right)\left(\frac{1+4\rho_{l}}{2+4\rho_{l}}\right)^{k} (42)

by induction, which immediately leads to the conclusion of (a) by Assumption 1. First, (42) clearly holds for k=0k=0. Now we assume that it holds for kk. For k+1k+1, since (15) holds, we have

F(xk)F(x)\displaystyle F(x^{k})-F(x_{\star})
\displaystyle\leq (F(x0)F(x))(1+4ρl2+4ρl)k\displaystyle\left(F(x^{0})-F(x_{\star})\right)\left(\frac{1+4\rho_{l}}{2+4\rho_{l}}\right)^{k}
\displaystyle\leq F(x0)F(x)\displaystyle F(x^{0})-F(x_{\star})
\displaystyle\leq λs2/(2L).\displaystyle\lambda_{s}^{2}/(2L).

Using (40) directly proves that (42) holds when kk is replaced by k+1k+1. This proves part (a) of Theorem 2.

(b). Note that (41) is equivalent to

LΔ(xk+1)(13ρh)λs(14ρh)(LΔ(xk)(13ρh)λs(14ρh))2.\frac{L\Delta(x^{k+1})(1-3\rho_{h})}{\lambda_{s}(1-4\rho_{h})}\leq\left(\frac{L\Delta(x^{k})(1-3\rho_{h})}{\lambda_{s}(1-4\rho_{h})}\right)^{2}. (43)

Since (16) holds, from (43) we know that

LΔ(x1)(13ρh)λs(14ρh)(12)2.\frac{L\Delta(x^{1})(1-3\rho_{h})}{\lambda_{s}(1-4\rho_{h})}\leq\left(\frac{1}{2}\right)^{2}.

Plug this back into (43) and work recursively, we get

LΔ(xk)(13ρh)λs(14ρh)(12)2k.\frac{L\Delta(x^{k})(1-3\rho_{h})}{\lambda_{s}(1-4\rho_{h})}\leq\left(\frac{1}{2}\right)^{2^{k}}.

Then, Δ(xk)0\Delta(x^{k})\rightarrow 0, which together with (43) proves the quadratic convergence in part (b) of Theorem 2.

Therefore, the proof of Theorem 2 is complete. ∎

5.3 Proof of Theorem 3

Throughout this subsection, we assume that the assumptions in Theorem 3 hold. That is, we assume t=1/Lt=1/L, tkj=(tBk22)1t_{kj}=\left(t\|B_{k}\|_{2}^{2}\right)^{-1} in Algorithm 2, and Assumption 1 holds. To prove Theorem 3, we first present some lemmas.

Lemma 6 (Local Lipschitz Constant of F(x)F(x)).

It holds that

supx,yn,Δ(x)r,Δ(y)r,xy|F(x)F(y)|xy2L(x2+r).\sup_{x,y\in\mathbb{R}^{n},\Delta(x)\leq r,\Delta(y)\leq r,x\neq y}\frac{|F(x)-F(y)|}{\|x-y\|_{2}}\leq L(\|x_{\star}\|_{2}+r).
Proof of Lemma 6.

Denote u=(xy)/xy2u=(x-y)/\|x-y\|_{2} and v=(x+y)/x+y2v=(x+y)/\|x+y\|_{2} when x+y0x+y\neq 0. vv is set as 0 when x+y=0x+y=0.

|F(x)F(y)|/xy2\displaystyle\left|F(x)-F(y)\right|/\|x-y\|_{2}
=\displaystyle= 1mxy2|i=1m|(aix)2bi|i=1m|(aiy)2bi||\displaystyle\frac{1}{m\|x-y\|_{2}}\left|\sum_{i=1}^{m}\left|(a_{i}^{\top}x)^{2}-b_{i}\right|-\sum_{i=1}^{m}\left|(a_{i}^{\top}y)^{2}-b_{i}\right|\right|
\displaystyle\leq 1mxy2i=1m|(aix)2(aiy)2|\displaystyle\frac{1}{m\|x-y\|_{2}}\sum_{i=1}^{m}|(a_{i}^{\top}x)^{2}-(a_{i}^{\top}y)^{2}|
=\displaystyle= 1mxy2i=1m|(ai(xy))(ai(x+y))|\displaystyle\frac{1}{m\|x-y\|_{2}}\sum_{i=1}^{m}|(a_{i}^{\top}(x-y))(a_{i}^{\top}(x+y))|
=\displaystyle= x+y21mi=1m|aiu||aiv|.\displaystyle\|x+y\|_{2}\frac{1}{m}\sum_{i=1}^{m}|a_{i}^{\top}u|\cdot|a_{i}^{\top}v|.

Recalling that L=2mA22=2mi=1maiai2L=\frac{2}{m}\|A\|_{2}^{2}=\frac{2}{m}\|\sum_{i=1}^{m}a_{i}a_{i}^{\top}\|_{2} as defined in Algorithm 1 and noticing the fact that when Δ(x)r\Delta(x)\leq r and Δ(y)r\Delta(y)\leq r, we have x+y22(x2+r)\|x+y\|_{2}\leq{2}(\|x_{\star}\|_{2}+r), and hence we can claim that

x+y21mi=1m|aiu|×|aiv|\displaystyle\|x+y\|_{2}\frac{1}{m}\sum_{i=1}^{m}|a_{i}^{\top}u|\times|a_{i}^{\top}v|
\displaystyle\leq x+y2(u(12mi=1maiai)u+v(12mi=1maiai)v)\displaystyle\|x+y\|_{2}\left(u^{\top}\left(\frac{1}{2m}\sum_{i=1}^{m}a_{i}a_{i}^{\top}\right)u+v^{\top}\left(\frac{1}{2m}\sum_{i=1}^{m}a_{i}a_{i}^{\top}\right)v\right)
\displaystyle\leq L(x2+r),\displaystyle L(\|x_{\star}\|_{2}+r),

which proves the desired result. Therefore, the proof of Lemma 6 is complete. ∎

Lemma 7.

If Assumption 1 holds, then for any r0r\geq 0, we have

{xn:Δ(x)E(r)}\displaystyle\{x\in\mathbb{R}^{n}:\Delta(x)\leq E(r)\}
\displaystyle\subseteq {xn:F(x)F(x)r}\displaystyle\{x\in\mathbb{R}^{n}:F(x)-F(x_{\star})\leq r\}
\displaystyle\subseteq {xn:Δ(x)r/λs},\displaystyle\{x\in\mathbb{R}^{n}:\Delta(x)\leq r/\lambda_{s}\},

in which

E(r)=(L2x22+4rLLx2)/(2L).\displaystyle E(r)=\left(\sqrt{L^{2}\|x_{\star}\|_{2}^{2}+4rL}-L\|x_{\star}\|_{2}\right)/(2L). (44)

This relationship also indicates that E(r)r/λs.E(r)\leq r/\lambda_{s}.

Proof of Lemma 7.

For xnx\in\mathbb{R}^{n}, if Δ(x)E(r)\Delta(x)\leq E(r) for some r0r\geq 0, without loss of generality, we assume that Δ(x)=xx2\Delta(x)=\|x-x_{\star}\|_{2}. From Lemma 6, we have

F(x)F(x)LE(r)(x2+E(r))=r,F(x)-F(x_{\star})\leq LE(r)(\|x_{\star}\|_{2}+E(r))=r,

which proves the first inclusion. The second inclusion follows immediately from Assumption 1. Therefore, the proof of Lemma 7 is complete. ∎

Lemma 8 (Bound of Bk2\|B_{k}\|_{2}).

For any r0r\geq 0, if supkΔ(xk)r\sup_{k\in\mathbb{N}}\Delta(x^{k})\leq r, then

supkBk2B(r):=2mA2(x2+r)maxi=1,2,,mai2.\sup_{k\in\mathbb{N}}\|B_{k}\|_{2}\leq B(r):=\frac{2}{m}\|A\|_{2}(\|x_{\star}\|_{2}+r)\max_{i=1,2,\ldots,m}\|a_{i}\|_{2}.
Proof of Lemma 8.

Since Bk=2mdiag(Axk)AB_{k}=\frac{2}{m}\mbox{diag}(Ax^{k})A, we have Bk22mAxkA2\|B_{k}\|_{2}\leq\frac{2}{m}\|Ax^{k}\|_{\infty}\|A\|_{2} and xk2x2+r.\|x^{k}\|_{2}\leq\|x_{\star}\|_{2}+r. The desired result follows by using Axk(x2+r)maxi=1,2,,mai2\|Ax^{k}\|_{\infty}\leq(\|x_{\star}\|_{2}+r)\max_{i=1,2,\ldots,m}\|a_{i}\|_{2}. Therefore, the proof of Lemma 8 is complete. ∎

Next, we provide Lemmas 9 and 10 to show the convergence rate for solving the subproblem with Algorithm 2. These results can be used for both conditions for (a) and (b).

Lemma 9 (see [20]).

In the jj-th iteration of Algorithm 2, we have

maxλ1Dk(λ)Dk(λaj)\displaystyle\max_{\|\lambda\|_{\infty}\leq 1}D_{k}(\lambda)-D_{k}(\lambda_{a}^{j}) tCBk22(j+1)2λ0λk+122\displaystyle\leq\frac{tC\|B_{k}\|_{2}^{2}}{(j+1)^{2}}\|\lambda^{0}-\lambda_{k+1}^{\star}\|_{2}^{2}
tCmBk22(j+1)2,j,\displaystyle\leq\frac{tCm\|B_{k}\|_{2}^{2}}{(j+1)^{2}},\forall j\in\mathbb{N},

where λk+1argmaxλ1Dk(λ)\lambda_{k+1}^{\star}\in\operatorname*{arg\,max}_{\|\lambda\|_{\infty}\leq 1}\ D_{k}(\lambda) and C>0C>0 is universal constant.

Lemma 10 (Theorem 4 in [28]).

For Algorithm 2, there exist universal positive constants C,C′′C^{\prime},C^{\prime\prime} such that, when jC,jj\geq C^{\prime},j\in\mathbb{N}, it holds

Hk(zk(λaj))Dk(λaj)C′′tmBk22j+1.\displaystyle H_{k}(z_{k}(\lambda_{a}^{j}))-D_{k}(\lambda_{a}^{j})\leq\frac{C^{\prime\prime}tm\|B_{k}\|_{2}^{2}}{j+1}. (45)
Proof of Lemma 10.

Theorem 4 in [28] indicates that if

(j+1)2\displaystyle(j+1)^{2} max{2CtmBk22/tmBk22,2CtmBk22Bk22m(1/t)ϵ2}\displaystyle\geq\max\left\{\frac{2Ctm\|B_{k}\|_{2}^{2}/t}{m\|B_{k}\|_{2}^{2}},\frac{2Ctm\|B_{k}\|_{2}^{2}\|B_{k}\|_{2}^{2}m}{(1/t)\epsilon^{2}}\right\}
=max{2C,2Ct2m2Bk24ϵ2}\displaystyle=\max\left\{2C,\frac{2Ct^{2}m^{2}\|B_{k}\|_{2}^{4}}{\epsilon^{2}}\right\}

then we have

Hk(zk(λaj))Dk(λaj)ϵ.H_{k}(z_{k}(\lambda_{a}^{j}))-D_{k}(\lambda_{a}^{j})\leq\epsilon.

Here CC is the constant used in Lemma 9. Specifically, by choosing ϵ=tmBk222Cj+1\epsilon=\frac{tm\|B_{k}\|_{2}^{2}\sqrt{2C}}{j+1}, we know that if jC:=2C1j\geq C^{\prime}:=\sqrt{2C}-1 holds, then (45) holds, where C′′=2CC^{\prime\prime}=\sqrt{2C}. Therefore, the proof of Lemma 10 is complete. ∎

We now define some constants.

E1=min{E(λs2/(2L)),E(λsM1B2(λs/(2L))/C)},E_{1}=\min\{E\left(\lambda_{s}^{2}/(2L)\right),E\left(\lambda_{s}M_{1}B^{2}(\lambda_{s}/(2L))/C^{\prime}\right)\},
E2=(2+4ρl)M1B2(λs/(2L))L(x2+λs/(2L))λs,E_{2}=\frac{(2+4\rho_{l})M_{1}B^{2}(\lambda_{s}/(2L))L(\|x_{\star}\|_{2}+\lambda_{s}/(2L))}{\lambda_{s}},
E3=min{E0/2,B(E0/2)M2/C,E(λs2/(2L))},E_{3}=\min\{E_{0}/2,B(E_{0}/2)\sqrt{M_{2}/C^{\prime}},E(\lambda_{s}^{2}/(2L))\},
E4=4M2B2(E0/2)/(3E0),E_{4}=4M_{2}B^{2}(E_{0}/2)/(3E_{0}),

and {ϵi}i=1={Δ(xi)}i=1,\{\epsilon_{i}\}_{i=1}^{\infty}=\{\Delta(x^{i})\}_{i=1}^{\infty}, where M=λs2L(λs2L+x2)1M=\frac{\lambda_{s}}{2L}\left(\frac{\lambda_{s}}{2L}+\|x_{\star}\|_{2}\right)^{-1}, M1=2C′′tm(ρl+1)/(λsρl)M_{1}={2}C^{\prime\prime}tm(\rho_{l}+1)/(\lambda_{s}\rho_{l}), M2=4C′′t2m(ρh+1)/(ρhM2)M_{2}=4C^{\prime\prime}t^{2}m(\rho_{h}+1)/(\rho_{h}M^{2}), E0=λs(14ρh)(13ρh)LE_{0}=\frac{\lambda_{s}(1-4\rho_{h})}{(1-3\rho_{h})L} and C,C′′C^{\prime},C^{\prime\prime} are universal positive constants mentioned in Lemma 10.

We now formally state the sufficiency of (23) and (24) for (8) and (9) in Lemma 11 so that we can use the linear and quadratic convergence rate for main iterations.

Lemma 11.

For Algorithm 2, (23) indicates (8) and (24) indicates (9).

Proof of Lemma 11.

Note that zk(λaj+1)=tBkλaj+1z_{k}(\lambda_{a}^{j+1})=-tB_{k}^{\top}\lambda_{a}^{j+1}, and xk+1=xk+zk(λaj+1)x^{k+1}=x^{k}+z_{k}(\lambda_{a}^{j+1}). The conclusion immediately holds because Hk(zk(λaj+1))=Ft(xk+1;xk)H_{k}(z_{k}(\lambda_{a}^{j+1}))=F_{t}(x^{k+1};x^{k}) and

Hk(zk(λaj+1))Dk(λaj+1))Hk(zk(λaj+1))minznHk(z)H_{k}(z_{k}(\lambda_{a}^{j+1}))-D_{k}(\lambda_{a}^{j+1}))\geq H_{k}(z_{k}(\lambda_{a}^{j+1}))-\min_{z\in\mathbb{R}^{n}}H_{k}(z)

which is from strong duality. Therefore, the proof of Lemma 11 is complete. ∎

Based on Lemma 11, we prove some properties induced by conditions of Theorem 3. In particular, Lemma 12 gives the ones induced by part (a) of Theorem 3, and Lemma 13 gives the ones induced by part (b) of Theorem 3.

Lemma 12.

Assume Assumption 1 and (8) hold for any kk\in\mathbb{N} with some ρl0\rho_{l}\geq 0. If Δ(x0)E1\Delta(x^{0})\leq E_{1}, then for any kk\in\mathbb{N}, we have

F(xk)F(x)λs2/(2L)F(x^{k})-F(x_{\star})\leq\lambda_{s}^{2}/(2L) (46)

and

Δ(xk)min{λs/(2L),M1B2(λs/(2L))/C}.\Delta(x^{k})\leq\min\{\lambda_{s}/(2L),M_{1}B^{2}(\lambda_{s}/(2L))/C^{\prime}\}. (47)
Proof of Lemma 12.

Note that E()E(\cdot) defined in (44) is monotonically increasing. Since Δ(x0)E1\Delta(x^{0})\leq E_{1}, from Lemma 7 we have

F(x0)F(x)λsmin{λs/(2L),M1B2(λs/(2L))/C}.F(x^{0})-F(x_{\star})\leq\lambda_{s}\min\{\lambda_{s}/(2L),M_{1}B^{2}(\lambda_{s}/(2L))/C^{\prime}\}.

Therefore, we can apply (40) and it implies that

F(xk)F(x)\displaystyle F(x^{k})-F(x_{\star}) F(x0)F(x)\displaystyle\leq F(x^{0})-F(x_{\star}) (48)
λsmin{λs/(2L),M1B2(λs/(2L))/C},\displaystyle\leq\lambda_{s}\min\{\lambda_{s}/(2L),M_{1}B^{2}(\lambda_{s}/(2L))/C^{\prime}\},

which proves (46). From (14), we have

λsΔ(xk)F(xk)F(x),\lambda_{s}\Delta(x^{k})\leq F(x^{k})-F(x_{\star}),

which leads to (47). Therefore, the proof of Lemma 12 is complete. ∎

Lemma 13.

Assume Assumption 1 and (9) hold for any kk\in\mathbb{N} with some 0ρh<1/40\leq\rho_{h}<1/4. If Δ(x0)E3\Delta(x^{0})\leq E_{3}, then for any kk\in\mathbb{N}, (46) holds, and the following two inequalities hold

Δ(xk+1)\displaystyle\Delta(x^{k+1}) 12Δ(xk)\displaystyle\leq\frac{1}{2}\Delta(x^{k}) (49)
Δ(xk)\displaystyle\Delta(x^{k}) E3.\displaystyle\leq E_{3}. (50)
Proof of Lemma 13.

Based on (41), we have

E0Δ(xk+1)Δ2(xk),k0.E_{0}\Delta(x^{k+1})\leq\Delta^{2}(x^{k}),\forall k\geq 0. (51)

We prove (49) by induction. When k=0k=0, noticing that Δ(x0)E0/2\Delta(x^{0})\leq E_{0}/2, by (51), we have E0Δ(x1)Δ(x0)(E0/2)E_{0}\Delta(x^{1})\leq\Delta(x^{0})(E_{0}/2), and therefore (49) holds for k=0k=0. We now assume that (49) holds for k<k0k<k_{0} for k0k_{0}\in\mathbb{N}. We then have Δ(xk0)Δ(x0)E0/2\Delta(x^{k_{0}})\leq\Delta(x^{0})\leq E_{0}/2 and using (51), we have E0Δ(xk0+1)Δ(xk0)(E0/2)E_{0}\Delta(x^{k_{0}+1})\leq\Delta(x^{k_{0}})(E_{0}/2), which implies that (49) holds for k=k0k=k_{0}. This completes the proof for (49), which immediately indicates (50) for any k0k\geq 0. This indicates that Δ(xk)Δ(x0)E3E(λs2/(2L))\Delta(x^{k})\leq\Delta(x^{0})\leq E_{3}\leq E(\lambda_{s}^{2}/(2L)). By Lemma 7, F(xk)F(x)λs2/(2L)F(x^{k})-F(x_{\star})\leq\lambda_{s}^{2}/(2L), which indicates that (46) holds for any k0k\geq 0. Therefore, the proof of Lemma 13 is complete. ∎

We can notice that a common property under conditions for (a) or (b) is that (46) holds for any k0k\geq 0, based on which, we give Lemma 14 to show another common property.

Lemma 14.

If (46) and Assumption 1 hold, then we have

λs2L(λs2L+x2)1Δ(xk)xkSt(xk)2,\displaystyle\frac{\lambda_{s}}{2L}\left(\frac{\lambda_{s}}{2L}+\|x_{\star}\|_{2}\right)^{-1}\Delta(x^{k})\leq\|x^{k}-S_{t}(x^{k})\|_{2}, (52)
12λsΔ(xk)Ft(xk;xk)Ft(St(xk);xk).\displaystyle{\frac{1}{2}}\lambda_{s}\Delta(x^{k})\leq F_{t}(x^{k};x^{k})-F_{t}\left(S_{t}(x^{k});x^{k}\right). (53)
Proof of Lemma 14.

For fixed k0k\geq 0, we define x~0=xk\tilde{x}^{0}=x^{k} and x~1=St(xk)\tilde{x}^{1}=S_{t}(x^{k}). Therefore x~0\tilde{x}^{0} and x~1\tilde{x}^{1} satisfy (8) (with kk replaced by 0 and xx replaced by x~\tilde{x}) with ρl=0\rho_{l}=0. Hence we have λsΔ(x~0)F(x~0)F(x)λs2/(2L)\lambda_{s}\Delta(\tilde{x}^{0})\leq F(\tilde{x}^{0})-F(x_{\star})\leq\lambda_{s}^{2}/(2L) and

F(x~0)F(x~1)\displaystyle F(\tilde{x}^{0})-F(\tilde{x}^{1})
=\displaystyle= F(x~0)F(x)(F(x~1)F(x))\displaystyle F(\tilde{x}^{0})-F(x_{\star})-\left(F(\tilde{x}^{1})-F(x_{\star})\right)
\displaystyle\geq F(x~0)F(x)12(F(x~0)F(x))\displaystyle F(\tilde{x}^{0})-F(x_{\star})-\frac{1}{2}\left(F(\tilde{x}^{0})-F(x_{\star})\right)
=\displaystyle= 12(F(x~0)F(x))\displaystyle\frac{1}{2}(F(\tilde{x}^{0})-F(x_{\star})) (54)

where the inequality is from (40) with ρl=0\rho_{l}=0. Moreover, from Assumption 1 we have

λsΔ(x~1)\displaystyle\lambda_{s}\Delta(\tilde{x}^{1}) F(x~1)F(x)(1/2)(F(x~0)F(x))\displaystyle\leq F(\tilde{x}^{1})-F(x_{\star})\leq(1/2)\left(F(\tilde{x}^{0})-F(x_{\star})\right)
λs2/(2L),\displaystyle\leq\lambda_{s}^{2}/(2L),

where the second inequality can be obtained by applying (40). Thus, we have max{Δ(x~0),Δ(x~1)}λs/(2L)\max\{\Delta(\tilde{x}^{0}),\Delta(\tilde{x}^{1})\}\leq\lambda_{s}/(2L). So, based on Lemma 6 and Assumption 1,

λsΔ(x~0)\displaystyle\lambda_{s}\Delta(\tilde{x}^{0}) F(x~0)F(x)2(F(x~0)F(x~1))\displaystyle\leq F(\tilde{x}^{0})-F(x_{\star})\leq 2\left(F(\tilde{x}^{0})-F(\tilde{x}^{1})\right)
2L(λs2L+x2)x~0x~12,\displaystyle\leq 2L\left(\frac{\lambda_{s}}{2L}+\|x_{\star}\|_{2}\right)\|\tilde{x}^{0}-\tilde{x}^{1}\|_{2}, (55)

where the second inequality is from (54), and the last inequality is from Lemma 6. (55) implies that

x~0x~12λs2L(λs2L+x2)1Δ(x~0).\|\tilde{x}^{0}-\tilde{x}^{1}\|_{2}\geq\frac{\lambda_{s}}{2L}\left(\frac{\lambda_{s}}{2L}+\|x_{\star}\|_{2}\right)^{-1}\Delta(\tilde{x}^{0}).

This proves (52). To prove (53), without loss of generality, we assume that Δ(x~0)=x~0x2\Delta(\tilde{x}^{0})=\|\tilde{x}^{0}-x_{\star}\|_{2}. Using (29) and the fact that Ft(x~0;x~0)=F(x~0)F_{t}(\tilde{x}^{0};\tilde{x}^{0})=F(\tilde{x}^{0}) and noticing that x~1=St(x~0)\tilde{x}^{1}=S_{t}(\tilde{x}^{0}) is the minimizer of Ft(;x~0)F_{t}(\cdot;\tilde{x}^{0}), we have

Ft(x~0;x~0)Ft(x~1;x~0)F(x~0)Ft(x;x~0)\displaystyle F_{t}(\tilde{x}^{0};\tilde{x}^{0})-F_{t}(\tilde{x}^{1};\tilde{x}^{0})\geq F(\tilde{x}^{0})-F_{t}(x_{\star};\tilde{x}^{0})
=F(x~0)F(x;x0)L2Δ2(x~0)\displaystyle={F(\tilde{x}^{0})-F(x_{\star};x^{0})-\frac{L}{2}\Delta^{2}(\tilde{x}^{0})}
F(x~0)F(x)LΔ2(x~0)\displaystyle{\geq F(\tilde{x}^{0})-F(x_{\star})-L\Delta^{2}(\tilde{x}^{0})}
λsΔ(x~0)LΔ2(x~0),\displaystyle\geq\lambda_{s}\Delta(\tilde{x}^{0})-L\Delta^{2}(\tilde{x}^{0}),

where the second inequality is due to (29), and the last inequality is due to Assumption 1. Using the fact that Δ(x~0)(F(x~0)F(x))/λsλs/(2L)\Delta(\tilde{x}^{0})\leq(F(\tilde{x}^{0})-F(x_{\star}))/\lambda_{s}\leq\lambda_{s}/(2L) based on Assumption 1, we have

Ft(x~0;x~0)Ft(x~1;x~0)(λsLλs2L)Δ(x~0)=λs2Δ(x~0),F_{t}(\tilde{x}^{0};\tilde{x}^{0})-F_{t}(\tilde{x}^{1};\tilde{x}^{0})\geq(\lambda_{s}-{L}\frac{\lambda_{s}}{2L})\Delta(\tilde{x}^{0})={\frac{\lambda_{s}}{2}}\Delta(\tilde{x}^{0}),

which proves (53). Therefore, the proof of Lemma 14 is complete. ∎

Now, we are ready to give an upper bound for JkJ_{k}.

Lemma 15.

Assume that Assumption 1 holds. For Algorithm 2 under either options (23) with ρl>0\rho_{l}>0 or (24) with ρh(0,1/4)\rho_{h}\in(0,1/4), for any kk\in\mathbb{N}, if (46) holds, the following statements hold.

  • (a)

    (Low Accuracy) When using option (23), we have

    Jkmax{M1Bk22Δ(xk),C}1.\displaystyle J_{k}\leq\left\lceil\max\left\{\frac{M_{1}\|B_{k}\|_{2}^{2}}{\Delta(x^{k})},C^{\prime}\right\}\right\rceil-1. (56)
  • (b)

    (High Accuracy) When using option (24), we have

    Jkmax{M2Bk22Δ2(xk),C}1.\displaystyle J_{k}\leq\left\lceil\max\left\{\frac{M_{2}\|B_{k}\|_{2}^{2}}{\Delta^{2}(x^{k})},C^{\prime}\right\}\right\rceil-1. (57)

Here, CC^{\prime} and C′′C^{\prime\prime} are the constants in Lemma 10 and M=λs2L(λs2L+x2)1M=\frac{\lambda_{s}}{2L}\left(\frac{\lambda_{s}}{2L}+\|x_{\star}\|_{2}\right)^{-1}, M1=2C′′tm(ρl+1)/(λsρl)M_{1}={2}C^{\prime\prime}tm(\rho_{l}+1)/(\lambda_{s}\rho_{l}) and M2=4C′′t2m(ρh+1)/(ρhM2).M_{2}=4C^{\prime\prime}t^{2}m(\rho_{h}+1)/(\rho_{h}M^{2}).

Proof of Lemma 15 .

(a). By choosing

j=max{C,(2C′′tmBk22(ρl+1)ρlλsΔ(xk))}1,j=\left\lceil\max\left\{C^{\prime},\left(\frac{{2}C^{\prime\prime}tm\|B_{k}\|_{2}^{2}(\rho_{l}+1)}{\rho_{l}\lambda_{s}\Delta(x^{k})}\right)\right\}\right\rceil-1, (58)

we have

C′′tmBk22j+2λsρlΔ(xk)/(2+2ρl),\frac{C^{\prime\prime}tm\|B_{k}\|_{2}^{2}}{j+2}\leq\lambda_{s}\rho_{l}\Delta(x^{k})/({2}+{2}\rho_{l}), (59)

which, together with (45), yields that

Hk(zk(λaj+1))Dk(λaj+1)\displaystyle H_{k}(z_{k}(\lambda_{a}^{j+1}))-D_{k}(\lambda_{a}^{j+1})
\displaystyle\leq λsρlΔ(xk)/(2+2ρl)\displaystyle\lambda_{s}\rho_{l}\Delta(x^{k})/({2}+{2}\rho_{l})
\displaystyle\leq ρl1+ρl(Hk(0)minznHk(z)),\displaystyle\frac{\rho_{l}}{1+\rho_{l}}\left(H_{k}(0)-\min_{z\in\mathbb{R}^{n}}H_{k}(z)\right), (60)

where the last inequality is from (53). From (60) we have

Hk(zk(λaj+1))Dk(λaj+1)\displaystyle H_{k}(z_{k}(\lambda_{a}^{j+1}))-D_{k}(\lambda_{a}^{j+1})
\displaystyle\leq ρl(Hk(zk(λaj+1))+Dk(λaj+1)+Hk(0)minznHk(z))\displaystyle\rho_{l}\left(-H_{k}(z_{k}(\lambda_{a}^{j+1}))+D_{k}(\lambda_{a}^{j+1})+H_{k}(0)-\min_{z\in\mathbb{R}^{n}}H_{k}(z)\right)
\displaystyle\leq ρl(Hk(0)Hk(zk(λaj+1))),\displaystyle\rho_{l}\left(H_{k}(0)-H_{k}(z_{k}(\lambda_{a}^{j+1}))\right),

where the last inequality is due to (21), and this gives (23). Therefore, (23) should have already been satisfied when (58) holds. This proves (56) in Lemma 15(a).

(b). By choosing

j=max{C,(4C′′t2mBk22(ρh+1)M2ρhΔ2(xk))}1,j=\left\lceil\max\left\{C^{\prime},\left(\frac{4C^{\prime\prime}t^{2}m\|B_{k}\|_{2}^{2}(\rho_{h}+1)}{M^{2}\rho_{h}\Delta^{2}(x^{k})}\right)\right\}\right\rceil-1, (61)

we have

C′′tmBk22j+2M2ρhΔ2(xk)/(4t(1+ρh)),\frac{C^{\prime\prime}tm\|B_{k}\|_{2}^{2}}{j+2}\leq M^{2}\rho_{h}\Delta^{2}(x^{k})/\left(4t(1+\rho_{h})\right), (62)

which, together with (45), yields that

Hk(zk(λaj+1))Dk(λaj+1)\displaystyle H_{k}(z_{k}(\lambda_{a}^{j+1}))-D_{k}(\lambda_{a}^{j+1})
\displaystyle\leq M2ρhΔ2(xk)/(4t(1+ρh))\displaystyle M^{2}\rho_{h}\Delta^{2}(x^{k})/\left(4t(1+\rho_{h})\right)
\displaystyle\leq ρhxkSt(xk)22/(4t(1+ρh)),\displaystyle\rho_{h}\|x^{k}-S_{t}(x^{k})\|_{2}^{2}/\left(4t(1+\rho_{h})\right), (63)

where the last inequality is from (52). Denote zk=argminznHk(z)z_{k\star}=\operatorname*{arg\,min}_{z\in\mathbb{R}^{n}}H_{k}(z), which indicates that zk2=xkSt(xk)2\|z_{k\star}\|_{2}=\|x^{k}-S_{t}(x^{k})\|_{2}. We have

12tzk(λaj+1)zk22Hk(zk(λaj+1))Dk(λaj+1),\frac{1}{2t}\|z_{k}(\lambda_{a}^{j+1})-z_{k\star}\|_{2}^{2}\leq H_{k}(z_{k}(\lambda_{a}^{j+1}))-D_{k}(\lambda_{a}^{j+1}),

which follows from the fact that Hk()H_{k}(\cdot) is 1t\frac{1}{t}-strongly convex and (21). From (63), we have zk(λaj+1)zk22ρh/(2(1+ρh))zk22.\|z_{k}(\lambda_{a}^{j+1})-z_{k\star}\|_{2}^{2}\leq\rho_{h}/\left(2(1+\rho_{h})\right)\|z_{k\star}\|_{2}^{2}. So, by the Cauchy-Schwarz inequality, we have

ρh2tzk(λaj+1)22\displaystyle\frac{\rho_{h}}{2t}\|z_{k}(\lambda_{a}^{j+1})\|_{2}^{2} ρh4tzk22ρh2tzk(λaj+1)zk22\displaystyle\geq\frac{\rho_{h}}{4t}\|z_{k\star}\|_{2}^{2}-\frac{\rho_{h}}{2t}\|z_{k}(\lambda_{a}^{j+1})-z_{k\star}\|_{2}^{2}
ρhzk22/(4t(1+ρh)).\displaystyle\geq\rho_{h}\|z_{k\star}\|_{2}^{2}/\left(4t(1+\rho_{h})\right).

Together with (63), (24) should have already been satisfied when (61) holds. This proves (57) in Lemma 15(b).

Therefore, the proof of Lemma 15 is complete. ∎

Next, we are ready to present the proof of Theorem 3.

Proof of Theorem 3.

We first prove part (a) of Theorem 3 and then prove part (b) of Theorem 3 in what follows.

(a). Lemma 8 and (47) indicate that supkBk2B(λs/(2L))\sup_{k\in\mathbb{N}}\|B_{k}\|_{2}\leq B(\lambda_{s}/(2L)). Together with (46), (56), (47), we obtain

JkM1B2(λs/(2L))/Δ(xk).J_{k}\leq M_{1}B^{2}(\lambda_{s}/(2L))/\Delta(x^{k}). (64)

We now prove the desired result. If ϵΔ(x0)\epsilon\geq\Delta(x^{0}), then J(ϵ)=0J(\epsilon)=0 and (25) holds. If ϵ<Δ(x0)\epsilon<\Delta(x^{0}), we denote KϵK_{\epsilon} as the smallest index such that Δ(xk)>ϵ,kKϵ\Delta(x^{k})>\epsilon,\forall k\leq K_{\epsilon} and Δ(xKϵ+1)ϵ\Delta(x^{K_{\epsilon}+1})\leq\epsilon. From Lemma 6 we have

L(x2+λs/(2L))Δ(xk)\displaystyle L(\|x_{\star}\|_{2}+\lambda_{s}/(2L))\Delta(x^{k}) (65)
\displaystyle\geq F(xk)F(x)\displaystyle F(x^{k})-F(x_{\star})
\displaystyle\geq (F(xKϵ)F(x))(2+4ρl1+4ρl)Kϵk\displaystyle\left(F(x^{K_{\epsilon}})-F(x_{\star})\right)\left(\frac{2+4\rho_{l}}{1+4\rho_{l}}\right)^{K_{\epsilon}-k}
\displaystyle\geq λsΔ(xKϵ)(2+4ρl1+4ρl)Kϵk\displaystyle\lambda_{s}\Delta(x^{K_{\epsilon}})\left(\frac{2+4\rho_{l}}{1+4\rho_{l}}\right)^{K_{\epsilon}-k}
\displaystyle\geq ϵλs(2+4ρl1+4ρl)Kϵk, 0kKϵ.\displaystyle\epsilon\lambda_{s}\left(\frac{2+4\rho_{l}}{1+4\rho_{l}}\right)^{K_{\epsilon}-k},\ \forall\ 0\leq k\leq K_{\epsilon}.

Here we explain how these inequalities were obtained. The first inequality follows from Lemma 6 and (47). Since (46) holds, (40) holds for any kk\in\mathbb{N} and it implies the second inequality in (65). The third inequality in (65) follows from Assumption 1 and the last inequality in (65) follows from the fact that Δ(xKϵ)>ϵ\Delta(x^{K_{\epsilon}})>\epsilon.

From (65) we know that 0kKϵ\forall\ 0\leq k\leq K_{\epsilon}, it holds that

Δ(xk)ϵλsL(x2+λs/(2L))(2+4ρl1+4ρl)Kϵk.\Delta(x^{k})\geq\frac{\epsilon\lambda_{s}}{L(\|x_{\star}\|_{2}+\lambda_{s}/(2L))}\left(\frac{2+4\rho_{l}}{1+4\rho_{l}}\right)^{K_{\epsilon}-k}. (66)

Therefore, we have

J(ϵ)\displaystyle J(\epsilon) =k=0KϵJk\displaystyle=\sum_{k=0}^{K_{\epsilon}}J_{k}
k=0KϵM1B2(λs/(2L))/Δ(xk)\displaystyle\leq\sum_{k=0}^{K_{\epsilon}}M_{1}B^{2}(\lambda_{s}/(2L))/\Delta(x^{k})
k=0KϵM1B2(λs2L)L(x2+λs/(2L))λsϵ(2+4ρl1+4ρl)kKϵ\displaystyle\leq\sum_{k=0}^{K_{\epsilon}}\frac{M_{1}B^{2}(\frac{\lambda_{s}}{2L})L(\|x_{\star}\|_{2}+\lambda_{s}/(2L))}{\lambda_{s}\epsilon}\left(\frac{2+4\rho_{l}}{1+4\rho_{l}}\right)^{k-K_{\epsilon}}
=E2ϵ(2+4ρ)k=0Kϵ(2+4ρl1+4ρl)kKϵE2ϵ.\displaystyle=\frac{E_{2}}{\epsilon(2+4\rho)}\sum_{k=0}^{K_{\epsilon}}\left(\frac{2+4\rho_{l}}{1+4\rho_{l}}\right)^{k-K_{\epsilon}}\leq\frac{E_{2}}{\epsilon}.

where the first inequality follows from (64) and the second inequality follows from (66). This completes the proof of Theorem 3(a).

(b). Lemma 8 and (50) indicates that supkBk2B(E0/2)\sup_{k\in\mathbb{N}}\|B_{k}\|_{2}\leq B(E_{0}/2). Together with (46), (49), (57), ,

JkM2B2(E0/2)/Δ2(xk),k0,J_{k}\leq M_{2}B^{2}(E_{0}/2)/\Delta^{2}(x^{k}),\forall k\geq 0, (67)

which follows from our choice of E3B(E0/2)M2/CE_{3}\leq B(E_{0}/2)\sqrt{M_{2}/C^{\prime}} such that M2B2(E0/2)/Δ2(xk)M2B2(E0/2)/Δ2(x0)CM_{2}B^{2}(E_{0}/2)/\Delta^{2}(x^{k})\geq M_{2}B^{2}(E_{0}/2)/\Delta^{2}(x^{0})\geq C^{\prime}. Next, we prove the conclusion. We adopt the same definition of KϵK_{\epsilon} as in (a) and pick {ϵi}i=1={Δ(xi)}i=1\{\epsilon_{i}\}_{i=1}^{\infty}=\{\Delta(x^{i})\}_{i=1}^{\infty} so that Kϵi=i1K_{\epsilon_{i}}=i-1. From (51) we have

Δ(xi1)E0ϵi,i.\Delta(x^{i-1})\geq\sqrt{E_{0}\epsilon_{i}},\forall i\in\mathbb{N}. (68)

From (49) and (68) we have

Δ(xk)2i1kE0ϵi, 0ki1.\Delta(x^{k})\geq 2^{i-1-k}\sqrt{E_{0}\epsilon_{i}},\forall\ 0\leq k\leq i-1. (69)

Finally, for any ii\in\mathbb{N}, we have

J(ϵi)=k=0i1Jkk=0i1M2B2(E0/2)4i1kE0ϵi\displaystyle J(\epsilon_{i})=\sum_{k=0}^{i-1}J_{k}\leq\sum_{k=0}^{i-1}\frac{M_{2}B^{2}(E_{0}/2)}{4^{i-1-k}E_{0}\epsilon_{i}}
4M2B2(E0/2)3E0ϵi=E4ϵi.\displaystyle\leq\frac{4M_{2}B^{2}(E_{0}/2)}{3E_{0}\epsilon_{i}}=\frac{E_{4}}{\epsilon_{i}}.

Here, the first inequality follows from (67) and (69), and the second inequality follows from the fact that

k=0i114i1kk=i114i1k=43.\sum_{k=0}^{i-1}\frac{1}{4^{i-1-k}}\leq\sum_{k=-\infty}^{i-1}\frac{1}{4^{i-1-k}}=\frac{4}{3}.

This completes the proof of Theorem 3(b).

Therefore, the proof of Theorem 3 is complete. ∎

6 Conclusion

In this paper, we proposed a new inexact proximal linear algorithm for solving the robust phase retrieval problem. Our contribution lies in the two adaptive stopping criteria for the subproblem in the proximal linear algorithm. We showed that the iteration complexity of our inexact proximal linear algorithm is in the same order as the exact proximal linear algorithm. Under the sharpness condition, we are able to prove the local convergence of the proposed method. Moreover, we discussed how to use the FISTA for solving the subproblem in our inexact proximal linear algorithm, and analyzed the total oracle complexity for obtaining an ϵ\epsilon-optimal solution under the sharpness condition. Numerical results demonstrated the superior performance of the proposed methods over some existing methods.

References

  • [1] J. Miao, P. Charalambous, J. Kirz, and D. Sayre, “Extending the methodology of x-ray crystallography to allow imaging of micrometre-sized non-crystalline specimens,” Nature, vol. 400, no. 6742, pp. 342–344, 1999.
  • [2] R. P. Millane, “Phase retrieval in crystallography and optics,” Journal of the Optical Society of America A, vol. 7, no. 3, pp. 394–411, 1990.
  • [3] A. Chai, M. Moscoso, and G. Papanicolaou, “Array imaging using intensity-only measurements,” Inverse Problems, vol. 27, no. 1, p. 015005, 2010.
  • [4] C. Fienup and J. Dainty, “Phase retrieval and image reconstruction for astronomy,” Image Recovery: Theory and Application, vol. 231, p. 275, 1987.
  • [5] J. Miao, T. Ishikawa, Q. Shen, and T. Earnest, “Extending x-ray crystallography to allow the imaging of noncrystalline materials, cells, and single protein complexes,” Annual Review of Physical Chemistry, vol. 59, pp. 387–410, 2008.
  • [6] M. Fickus, D. G. Mixon, A. A. Nelson, and Y. Wang, “Phase retrieval from very few measurements,” Linear Algebra and its Applications, vol. 449, pp. 475–499, 2014.
  • [7] E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval via wirtinger flow: Theory and algorithms,” IEEE Transactions on Information Theory, vol. 61, no. 4, pp. 1985–2007, 2015.
  • [8] Y. Chen and E. J. Candès, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,” Communications on Pure and Applied Mathematics, vol. 70, no. 5, pp. 822–883, 2017.
  • [9] G. Wang, G. B. Giannakis, and Y. C. Eldar, “Solving systems of random quadratic equations via truncated amplitude flow,” IEEE Transactions on Information Theory, vol. 64, no. 2, pp. 773–794, 2017.
  • [10] H. Zhang, Y. Zhou, Y. Liang, and Y. Chi, “A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms,” Journal of Machine Learning Research, vol. 18, 2017.
  • [11] J. C. Duchi and F. Ruan, “Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval,” Information and Inference: A Journal of the IMA, vol. 8, no. 3, pp. 471–529, 2019.
  • [12] H. Zhang, Y. Chi, and Y. Liang, “Provable non-convex phase retrieval with outliers: Median truncated wirtinger flow,” in International Conference on Machine Learning.   PMLR, 2016, pp. 1022–1031.
  • [13] D. Davis, D. Drusvyatskiy, K. J. MacPhee, and C. Paquette, “Subgradient methods for sharp weakly convex functions,” Journal of Optimization Theory and Applications, vol. 179, no. 3, pp. 962–982, 2018.
  • [14] A. S. Lewis and S. J. Wright, “A proximal method for composite minimization,” Mathematical Programming, vol. 158, no. 1, pp. 501–546, 2016.
  • [15] D. Drusvyatskiy and C. Paquette, “Efficiency of minimizing compositions of convex functions and smooth maps,” Mathematical Programming, vol. 178, no. 1, pp. 503–558, 2019.
  • [16] J. C. Duchi and F. Ruan, “Stochastic methods for composite and weakly convex optimization problems,” SIAM Journal on Optimization, vol. 28, no. 4, pp. 3229–3259, 2018.
  • [17] V. Charisopoulos, Y. Chen, D. Davis, M. Díaz, L. Ding, and D. Drusvyatskiy, “Low-rank matrix recovery with composite optimization: good conditioning and rapid convergence,” Foundations of Computational Mathematics, pp. 1–89, 2021.
  • [18] V. Charisopoulos, D. Davis, M. Díaz, and D. Drusvyatskiy, “Composite optimization for robust blind deconvolution,” arXiv preprint arXiv:1901.01624, 2019.
  • [19] Z. Wang, B. Liu, S. Chen, S. Ma, L. Xue, and H. Zhao, “A manifold proximal linear method for sparse spectral clustering with application to single-cell RNA sequencing data analysis,” INFORMS Journal on Optimization, vol. 4, no. 2, pp. 200–214, 2022.
  • [20] Y. Nesterov, “On an approach to the construction of optimal methods of minimization of smooth convex functions,” Ekonomika i Mateaticheskie Metody, vol. 24, no. 3, pp. 509–517, 1988.
  • [21] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM Journal on Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
  • [22] Y. Nesterov, “Gradient methods for minimizing composite functions,” Mathematical Programming, vol. 140, no. 1, pp. 125–161, 2013.
  • [23] M. Díaz and B. Grimmer, “Optimal convergence rates for the proximal bundle method,” SIAM Journal on Optimization, vol. 33, no. 2, pp. 424–454, 2023.
  • [24] S. Bonettini, I. Loris, F. Porta, and M. Prato, “Variable metric inexact line-search-based methods for nonsmooth optimization,” SIAM Journal on Optimization, vol. 26, no. 2, pp. 891–921, 2016.
  • [25] C.-p. Lee and S. J. Wright, “Inexact successive quadratic approximation for regularized optimization,” Computational Optimization and Applications, vol. 72, no. 3, pp. 641–674, 2019.
  • [26] ——, “Inexact variable metric stochastic block-coordinate descent for regularized optimization,” Journal of Optimization Theory and Applications, vol. 185, no. 1, pp. 151–187, 2020.
  • [27] N. Parikh and S. Boyd, “Block splitting for distributed optimization,” Mathematical Programming Computation, vol. 6, no. 1, pp. 77–102, 2014.
  • [28] C. Dünner, S. Forte, M. Takác, and M. Jaggi, “Primal-dual rates and certificates,” in International Conference on Machine Learning.   PMLR, 2016, pp. 783–792.