This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Yao Li 22institutetext: Michigan State University
East Lansing, MI 48824, USA
[email protected]
33institutetext: Ming Yan 44institutetext: Michigan State University
East Lansing, MI 48824, USA
[email protected]

On the improved conditions for some primal-dual algorithms

Yao Li    Ming Yan
(Received: date / Accepted: date)
Abstract

The convex minimization of f(𝐱)+g(𝐱)+h(𝐀𝐱)f({\mathbf{x}})+g({\mathbf{x}})+h({\mathbf{A}}{\mathbf{x}}) over n\mathbb{R}^{n} with differentiable ff and linear operator 𝐀:nm{\mathbf{A}}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}, has been well-studied in the literature. By considering the primal-dual optimality of the problem, many algorithms are proposed from different perspectives such as monotone operator scheme and fixed point theory. In this paper, we start with a base algorithm to reveal the connection between several algorithms such as AFBA, PD3O and Chambolle-Pock. Then, we prove its convergence under a relaxed assumption associated with the linear operator and characterize the general constraint on primal and dual stepsizes. The result improves the upper bound of stepsizes of AFBA and indicates that Chambolle-Pock, as the special case of the base algorithm when f=0f=0, can take the stepsize of the dual iteration up to 4/34/3 of the previously proven one.

Keywords:
Primal-dual Asymmetric Forward–Backward-Adjoint splitting Primal–Dual Three-Operator splitting Chambolle-Pock
journal: Noname

1 Introduction

We consider the following minimization problem in the form of the sum of three functions

minimize𝐱nf(𝐱)+g(𝐱)+h(𝐀𝐱),\operatorname*{minimize}_{{\mathbf{x}}\in\mathbb{R}^{n}}\ f({\mathbf{x}})+g({\mathbf{x}})+h({\mathbf{A}}{\mathbf{x}}), (1)

where ff is a differentiable convex function with LL-Lipschitz continuous gradient and gg, hh are proper, closed and convex functions taking values in (,].(-\infty,\infty]. The third function hh is composited with a linear operator 𝐀n×m{\mathbf{A}}\in\mathbb{R}^{n\times m}, whose largest singular value is σ\sigma, i.e., σ=𝐀𝐀\sigma=\sqrt{\|{\mathbf{A}}{\mathbf{A}}^{\top}\|}.

We use the asterisk superscript to denote the Legendre-Fenchel conjugate function, e.g., h:mh^{*}:\mathbb{R}^{m}\rightarrow\mathbb{R} is defined as

h(𝐬)=sup𝐱m𝐬,𝐱h(𝐱).h^{*}({\mathbf{s}})=\sup_{{\mathbf{x}}\in\mathbb{R}^{m}}\ \langle{\mathbf{s}},{\mathbf{x}}\rangle-h({\mathbf{x}}).

Then, the saddle-point form of problem (1) is

min𝐱nmax𝐬mf(𝐱)+g(𝐱)+𝐀𝐱,𝐬h(𝐬)\min_{{\mathbf{x}}\in\mathbb{R}^{n}}\max_{{\mathbf{s}}\in\mathbb{R}^{m}}\ f({\mathbf{x}})+g({\mathbf{x}})+\langle{\mathbf{A}}{\mathbf{x}},{\mathbf{s}}\rangle-h^{*}({\mathbf{s}}) (2)

where 𝐱{\mathbf{x}} and 𝐬{\mathbf{s}} are primal and dual variables, respectively. With the existence of a solution pair (𝐱,𝐬)({\mathbf{x}}^{*},{\mathbf{s}}^{*}), the first-order optimal condition is characterized as

{0f(𝐱)+g(𝐱)+𝐀𝐬,0h(𝐬)𝐀𝐱,\left\{\begin{aligned} &0\in\nabla f({\mathbf{x}}^{*})+\partial g({\mathbf{x}}^{*})+{\mathbf{A}}^{\top}{\mathbf{s}}^{*},\\ &0\in\partial h^{*}({\mathbf{s}}^{*})-{\mathbf{A}}{\mathbf{x}}^{*},\end{aligned}\right.

where g(𝐱)={𝐯n|g(𝐲)g(𝐱)𝐯,𝐲𝐱,𝐲n}\partial g({\mathbf{x}})=\{{\mathbf{v}}\in\mathbb{R}^{n}\ |\ g({\mathbf{y}})-g({\mathbf{x}})\geq\langle{\mathbf{v}},{\mathbf{y}}-{\mathbf{x}}\rangle,\forall{\mathbf{y}}\in\mathbb{R}^{n}\} is the subdifferential of gg at 𝐱{\mathbf{x}}. The duality theorem (rockafellar1974conjugate, , Theorem 15) asserts that 𝐱{\mathbf{x}}^{*} is the solution of the problem (1) and 𝐬{\mathbf{s}}^{*} is the solution of the corresponding dual problem

minimize𝐬m(fg)(𝐀𝐬)+h(𝐬),\operatorname*{minimize}_{{\mathbf{s}}\in\mathbb{R}^{m}}\ (f^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*})(-{\mathbf{A}}^{\top}{\mathbf{s}})+h^{*}({\mathbf{s}}), (3)

where fgf^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*} is the infimal convolution of ff^{*} and gg^{*} that is defined as fg(𝐱)=inf𝐲nf(𝐲)+g(𝐱𝐲)f^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*}({\mathbf{x}})=\inf_{{\mathbf{y}}\in\mathbb{R}^{n}}f^{*}({\mathbf{y}})+g^{*}({\mathbf{x}}-{\mathbf{y}}) with the property that (fg)=f+g.(f^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*})^{*}=f+g. Note that when f=0f=0 (or g=0g=0), the infimal convolution fgf^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*} boils down to gg^{*} (or ff^{*}).

Existing primal-dual algorithms to solve the above saddle-point problem (2) include Condat-Vu condat2013primal ; vu2013splitting , Primal–Dual Fixed-Point algorithm(PDFP) chen2016primal , Asymmetric Forward–Backward-Adjoint splitting(AFBA) latafat2017asymmetric and Primal–Dual Three-Operator splitting(PD3O) yan2018new . When f=0f=0, Condat-Vu and PD3O are reduced to Chambolle-Pock chambolle2011first . When g=0,g=0, all algorithms except Condat-Vu are reduced to Proximal Alternating Predictor–Corrector(PAPC) or Primal–Dual Fixed-Point algorithm based on the Proximity Operator(PDFP2O) loris2011generalization ; chen2013primal ; drori2015simple . The conditions on their stepsizes to guarantee the convergence is associated with the singular value of the linear operator σ\sigma and the Lipschitz constant LL. With notations ηp\eta_{\text{p}} for the primal stepsize and ηd\eta_{\text{d}} for the dual stepsize, Table 1 lists the conditions on stepsizes of the aforementioned algorithms and their connections when either ff or gg vanishes.

ηp,ηd\eta_{\text{p}},\eta_{\text{d}} f=0f=0 g=0g=0
Condat-Vu ηpηdσ2+ηpL/21\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}+\eta_{\text{p}}L/2\leq 1 C-P(primal)
PDFP ηpL/2<1,ηpηdσ21\eta_{\text{p}}L/2<1,\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}\leq 1 PAPC
AFBA ηpηdσ2+ηpηdσ+ηpL2\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}+\sqrt{\eta_{\text{p}}\eta_{\text{d}}}\sigma+\eta_{\text{p}}L\leq 2 C-P(dual) PAPC
PD3O ηpL/2<1,ηpηdσ21\eta_{\text{p}}L/2<1,\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}\leq 1 C-P(primal) PAPC
Chambolle-Pock ηpηdσ21\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}\leq 1 f=0f=0
PAPC(PDFP2O) ηpL/2<1\eta_{\text{p}}L/2<1, ηpηdσ21\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}\leq 1 g=0g=0
Table 1: The conditions on stepsizes for the primal-dual algorithms. C-P(primal) and C-P(dual) stand for Chambolle-Pock applied to the primal problem and the dual problem, respectively. In li2021new , PAPC can take a larger upper bound of ηpηdσ2\eta_{\text{p}}\eta_{\text{d}}\sigma^{2} up to 4/34/3.

The listed conditions in Table 1 are all sufficient to guarantee the convergence of the associated algorithms, and some cannot be relaxed any further. For PAPC, the first constraint, ηpL/2<1\eta_{\text{p}}L/2<1, cannot be relaxed since it reduces to gradient descent method with stepsize ηp\eta_{\text{p}} when there is only one ff, i.e., h=0h=0. However, it turns out that the upper bound of ηpηdσ2\eta_{\text{p}}\eta_{\text{d}}\sigma^{2} can be relaxed to 4/34/3. In  he2020optimal ; li2021linear , a special case of PAPC, linearized augumented Lagrangian method, and an application of PAPC on the decentralized optimization are shown to converge under the relaxed condition. The general convergence result of PAPC with relaxed condition is shown in li2021new . The 4/34/3 bound is also shown to be tight in he2020optimal ; li2021new . This result raises an open question, i.e., can the similar relaxation be applied to Chambolle-Pock? The follow-up question is can this result be generalized to algorithms for the general problem (1)?.

The first question is implicitly answered by some work. A generalized Chambolle-Pock is proposed in he2021generalized to achieve the relaxation and its equivalence to the canonical Chambolle-Pock is proved for the following primal-dual problem

min𝐱nmax𝐬mg(𝐱)+𝐀𝐱,𝐬𝐛,𝐬\min_{{\mathbf{x}}\in\mathbb{R}^{n}}\max_{{\mathbf{s}}\in\mathbb{R}^{m}}\ g({\mathbf{x}})+\langle{\mathbf{A}}{\mathbf{x}},{\mathbf{s}}\rangle-\langle{\mathbf{b}},{\mathbf{s}}\rangle

which covers the general form (2) with linear hh^{*}. The necessity of the condition is also shown by a simple case.

In he2020optimally , a relaxed linearization parameter is considered in the linearized Alternating Direction Method of Multipliers(L-ADMM) so that a larger stepsize can be used in the linearized subproblem to converge faster. L-ADMM considers the following linearly constrained problem,

minimize𝐱n1,𝐲n2\displaystyle\operatorname*{minimize}_{{\mathbf{x}}\in\mathbb{R}^{n_{1}},{\mathbf{y}}\in\mathbb{R}^{n_{2}}} f~(𝐱)+g~(𝐲)\displaystyle\ \ \widetilde{f}({\mathbf{x}})+\widetilde{g}({\mathbf{y}}) (4)
s.t. 𝐀~𝐱+𝐁~𝐲=𝐛\displaystyle\ \ \widetilde{{\mathbf{A}}}{\mathbf{x}}+\widetilde{{\mathbf{B}}}{\mathbf{y}}={\mathbf{b}}

where f~:n1\widetilde{f}:\mathbb{R}^{n_{1}}\rightarrow\mathbb{R}, g~:n2\widetilde{g}:\mathbb{R}^{n_{2}}\rightarrow\mathbb{R} are proper, closed and convex, 𝐀~m×n1\widetilde{{\mathbf{A}}}\in\mathbb{R}^{m\times n_{1}}, 𝐁~m×n2\widetilde{{\mathbf{B}}}\in\mathbb{R}^{m\times n_{2}} are two linear operators and 𝐛m{\mathbf{b}}\in\mathbb{R}^{m}. When f~=g\widetilde{f}=g^{*} and g~=h\widetilde{g}=h^{*}, the dual problem can be formulated via Lagrangian multiplier as

minimize𝐬m𝐛,𝐬+g(𝐀~𝐬)+h(𝐁~𝐬).\operatorname*{minimize}_{{\mathbf{s}}\in\mathbb{R}^{m}}\ -\langle{\mathbf{b}},{\mathbf{s}}\rangle+g(\widetilde{{\mathbf{A}}}^{\top}{\mathbf{s}})+h(\widetilde{{\mathbf{B}}}^{\top}{\mathbf{s}}). (5)

A special case of the above dual problem (5) is 𝐛=𝟎{\mathbf{b}}=\mathbf{0}, 𝐀~=𝐈\widetilde{{\mathbf{A}}}={\mathbf{I}}, and 𝐁~=𝐀\widetilde{{\mathbf{B}}}^{\top}={\mathbf{A}}. In this case, it reduces to the problem (1) with f=0.f=0. It can be easily shown that L-ADMM applied to this special case of the problem (4) is equivalent to Chambolle-Pock applied to the corresponeding dual problem. Therefore, the result in he2020optimally suggests the improved condition on stepsizes of Chambolle-Pock, ηpηdσ2<43.\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}<\frac{4}{3}. When 𝐛0{\mathbf{b}}\not=0 in the special case, the result in he2020optimally also indicates that the relaxed conditon is extensible to the problem (1) with a linear function ff.

However, there is no result for general Lipschitz smooth ff in the literature. This work will address the second question for AFBA and PD3O by characterizing a more general upper bound for ηp\eta_{\text{p}} and ηd\eta_{\text{d}}, and directly give an affirmative answer to the first question.

Throughout the rest of the paper, we assume that there exists a solution pair (𝐱,𝐬)({\mathbf{x}}^{*},{\mathbf{s}}^{*}) of the problem (2). We use (𝐈+λg)1(𝐱)({\mathbf{I}}+\lambda\partial g)^{-1}({\mathbf{x}}) to represent the proximal operator of gg at 𝐱{\mathbf{x}}, which is defined as

𝐩𝐫𝐨𝐱λg(𝐱)argmin𝐲ng(𝐲)+12λ𝐲𝐱2.\mathbf{prox}_{\lambda g}({\mathbf{x}})\coloneqq\mathop{\arg\min}_{{\mathbf{y}}\in\mathbb{R}^{n}}\ g({\mathbf{y}})+\frac{1}{2\lambda}\|{\mathbf{y}}-{\mathbf{x}}\|^{2}.

We use ,\langle\cdot,\cdot\rangle and \|\cdot\| as the standard inner product and norm defined over n\mathbb{R}^{n}, and denote 𝐌=,𝐌()\|\cdot\|_{\mathbf{M}}=\sqrt{\langle\cdot,{\mathbf{M}}(\cdot)\rangle} as the (semi)norm associated with the given positive (semi)definite matrix 𝐌n×n{\mathbf{M}}\in\mathbb{R}^{n\times n}.

The rest of this paper is organized as follows. We first derive a base algorithm that is later shown to be an equivalent form of AFBA and illustrate the connection between AFBA, PD3O and Chambolle-Pock in Section 2. We then prove the convergence of the base algorithm under the relaxed condition associated with the matrix 𝐀{\mathbf{A}} in Section 3. In the next section, we numerically show the performance of the algorithms under the proved weaker condition.

2 The Base Algorithm

For simplicity, we set ηp=r\eta_{\text{p}}=r and ηd=λ/r\eta_{\text{d}}=\lambda/r. That is, the product of the primal and dual stepsizes is λ\lambda. The optimality condition of problem (2) can be reformulated as the following monotone inclusion problem

[00][f(𝐱)0]+[g𝐀𝐀h][𝐱𝐬].\begin{bmatrix}0\\ 0\end{bmatrix}\in\begin{bmatrix}\nabla f({\mathbf{x}})\\ 0\end{bmatrix}+\begin{bmatrix}\partial g&{\mathbf{A}}^{\top}\\ -{\mathbf{A}}&\partial h^{*}\end{bmatrix}\begin{bmatrix}{\mathbf{x}}\\ {\mathbf{s}}\end{bmatrix}. (6)

This problem is also equivalent to

[𝐈r2λ𝐈r2𝐀𝐀][𝐱𝐬][rf(𝐱)0]+[𝐈+rgr𝐀r𝐀r2λ𝐈r2𝐀𝐀+rh][𝐱𝐬].\begin{bmatrix}{\mathbf{I}}&\\ &\frac{r^{2}}{\lambda}{\mathbf{I}}-r^{2}{\mathbf{A}}{\mathbf{A}}^{\top}\end{bmatrix}\begin{bmatrix}{\mathbf{x}}\\ {\mathbf{s}}\end{bmatrix}\in\begin{bmatrix}r\nabla f({\mathbf{x}})\\ 0\end{bmatrix}+\begin{bmatrix}{\mathbf{I}}+r\partial g&r{\mathbf{A}}^{\top}\\ -r{\mathbf{A}}&\frac{r^{2}}{\lambda}{\mathbf{I}}-r^{2}{\mathbf{A}}{\mathbf{A}}^{\top}+r\partial h^{*}\end{bmatrix}\begin{bmatrix}{\mathbf{x}}\\ {\mathbf{s}}\end{bmatrix}.

Let ζ𝐱rf(𝐱)rg(𝐱)\zeta\in{\mathbf{x}}-r\nabla f({\mathbf{x}})-r\partial g({\mathbf{x}}), then the above form can be decomposed as

[𝐈r2λ𝐈r2𝐀𝐀][ζ𝐬][𝐈r𝐀r𝐀r2λ𝐈r2𝐀𝐀+rh]\displaystyle\begin{bmatrix}{\mathbf{I}}&\\ &\frac{r^{2}}{\lambda}{\mathbf{I}}-r^{2}{\mathbf{A}}{\mathbf{A}}^{\top}\end{bmatrix}\begin{bmatrix}\zeta\\ {\mathbf{s}}\end{bmatrix}\in\begin{bmatrix}{\mathbf{I}}&r{\mathbf{A}}^{\top}\\ -r{\mathbf{A}}&\frac{r^{2}}{\lambda}{\mathbf{I}}-r^{2}{\mathbf{A}}{\mathbf{A}}^{\top}+r\partial h^{*}\end{bmatrix} [𝐱𝐬],\displaystyle\begin{bmatrix}{\mathbf{x}}\\ {\mathbf{s}}\end{bmatrix}, (7a)
(𝐈+rg)1(2𝐱ζrf(𝐱))+ζ𝐱=\displaystyle({\mathbf{I}}+r\partial g)^{-1}(2{\mathbf{x}}-\zeta-r\nabla f({\mathbf{x}}))+\zeta-{\mathbf{x}}= ζ.\displaystyle\ \zeta. (7b)

Plug (𝐱k+1,𝐬k+1,ζk+1)({\mathbf{x}}^{k+1},{\mathbf{s}}^{k+1},\zeta^{k+1}) into the right-hand side of (7) and let the rest ζ,𝐱,𝐬\zeta,{\mathbf{x}},{\mathbf{s}} on the left-hand side of (7) be the variables in kkth iteration (𝐱k,𝐬k,ζk)({\mathbf{x}}^{k},{\mathbf{s}}^{k},\zeta^{k}). Then, we can solve for the next-step variables of the system (7) in the 𝐬𝐱ζ{\mathbf{s}}\mathchar 45\relax{\mathbf{x}}\mathchar 45\relax\zeta order by Gaussian elimination. The base algorithm follows

𝐬k+1\displaystyle{\mathbf{s}}^{k+1} =(𝐈+λrh)1(λr𝐀ζk+(𝐈λ𝐀𝐀)𝐬k),\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}(\frac{\lambda}{r}{\mathbf{A}}\zeta^{k}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}), (8a)
𝐱k+1\displaystyle{\mathbf{x}}^{k+1} =ζkr𝐀𝐬k+1,\displaystyle=\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}, (8b)
ζk+1\displaystyle\zeta^{k+1} =(𝐈+rg)1(𝐱k+1r𝐀𝐬k+1rf(𝐱k+1))𝐱k+1+ζk,\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}-r\nabla f({\mathbf{x}}^{k+1}))-{\mathbf{x}}^{k+1}+\zeta^{k}, (8c)

where rr and λ/r\lambda/r are primal and dual stepsizes, respectively.

In Section 3, we will show the convergence of the algorithm when r<4θ32θ12Lr<\frac{4\theta-3}{2\theta-1}\frac{2}{L} and λ1θσ2\lambda\leq\frac{1}{\theta\sigma^{2}} with any θ(3/4,1].\theta\in(3/4,1]. Notice that, when θ=1,\theta=1, the conditions on rr and λ\lambda is reduced to r<2Lr<\frac{2}{L} and λ1σ2\lambda\leq\frac{1}{\sigma^{2}}. As listed in Table 1, this condition is the sufficient condition on stepizes for PDFP shown in chen2016primal and PD3O shown in yan2018new , which is apparently weaker than that for Condat-Vu shown in condat2013primal ; vu2013splitting and AFBA shown in latafat2017asymmetric . This general form of the upper bound gives a larger range of λ\lambda and accordingly narrows the range of values of rr.

However, the compromise on rr will disappear when ff is linear (L=0L=0). In this case, rr can take any value in (0,)(0,\infty) and consequently, λ\lambda can take value as large as 43σ2\frac{4}{3\sigma^{2}}. We will later show in Section 2.3 that the base algorithm recovers Chambolle-Pock applied to the dual problem (3) with f=0f=0. The range on λ\lambda is significantly better than the previously used for Chambolle-Pock, which is λ1σ2.\lambda\leq\frac{1}{\sigma^{2}}. Therefore, the first question is answered.

Next, we will discuss the connection of the base algorithm with AFBA, PD3O ,Chambolle-Pock and PAPC(PDFP2O). We say an algorithm is equivalent to another one if and only if we can find an one-one map between the sequences generated by two algorithms with appropriate initialization.

2.1 Connection with AFBA

AFBA is a scheme for general monotone inclusion problem of three-operator splitting. The special form (latafat2017asymmetric, , Algorithm 55) used to solve the problem (2) is conducted as

𝐬1k+1\displaystyle{\mathbf{s}}_{1}^{k+1} =(𝐈+λrh)1(𝐬1k+λr𝐀𝐱¯1k),\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}({\mathbf{s}}_{1}^{k}+\frac{\lambda}{r}{\mathbf{A}}\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}), (9a)
𝐱1k+1\displaystyle{\mathbf{x}}_{1}^{k+1} =𝐱¯1kr𝐀(𝐬1k+1𝐬1k),\displaystyle=\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{1}^{k+1}-{\mathbf{s}}_{1}^{k}), (9b)
𝐱¯1k+1\displaystyle\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k+1} =(𝐈+rg)1(𝐱1k+1r𝐀𝐬1k+1rf(𝐱1k+1)).\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}_{1}^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}-r\nabla f({\mathbf{x}}_{1}^{k+1})). (9c)

We will show that AFBA is equivalent to the base algorithm if the relation

[𝐬1k𝐱1k𝐱¯1k]=[𝐬k𝐱kζkr𝐀𝐬k],[𝐬k𝐱kζk]=[𝐬1k𝐱1k𝐱¯1k+r𝐀𝐬1k]\begin{bmatrix}{\mathbf{s}}_{1}^{k}\\ {\mathbf{x}}_{1}^{k}\\ \mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{x}}^{k}\\ \zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k}\end{bmatrix},\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{x}}^{k}\\ \zeta^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{1}^{k}\\ {\mathbf{x}}_{1}^{k}\\ \mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k}\end{bmatrix} (10)

holds.

Define ζk+1=𝐱¯1k+1+r𝐀𝐬1k+1\zeta^{k+1}=\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k+1}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}, then (9a) and (9b) are reformulated by canceling 𝐱¯\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu as

𝐬1k+1\displaystyle{\mathbf{s}}_{1}^{k+1} =(𝐈+λrh)1(𝐬1k+λr𝐱¯1k)=(𝐈+λrh)1(λr𝐀ζk+(𝐈λ𝐀𝐀)𝐬1k),\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}({\mathbf{s}}_{1}^{k}+\frac{\lambda}{r}\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k})=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}(\frac{\lambda}{r}{\mathbf{A}}\zeta^{k}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}_{1}^{k}),
𝐱1k+1\displaystyle{\mathbf{x}}_{1}^{k+1} =ζkr𝐀𝐬1kr𝐀(𝐬1k+1𝐬1k)=ζkr𝐀𝐬1k+1.\displaystyle=\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{1}^{k+1}-{\mathbf{s}}_{1}^{k})=\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}.

The update in (9c) gives the update of ζ\zeta as

ζk+1\displaystyle\zeta^{k+1} =𝐱¯1k+1+r𝐀𝐬1k+1\displaystyle=\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k+1}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}
=𝐱¯1k+1+ζk𝐱1k+1\displaystyle=\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k+1}+\zeta^{k}-{\mathbf{x}}_{1}^{k+1}
=(𝐈+rg)1(𝐱1k+1r𝐀𝐬1k+1rf(𝐱1k+1))+ζk𝐱1k+1.\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}_{1}^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}-r\nabla f({\mathbf{x}}_{1}^{k+1}))+\zeta^{k}-{\mathbf{x}}_{1}^{k+1}.

Hence, the sequence {(𝐬1k,𝐱1k)}\{({\mathbf{s}}_{1}^{k},{\mathbf{x}}_{1}^{k})\} generated by AFBA coincides with the sequence {(𝐬k,𝐱k)}\{({\mathbf{s}}^{k},{\mathbf{x}}^{k})\} generated by the base algorithm with the same initialization. The reverse direction can be verified by (10) in the same way, hence the base algorithm is equivalent to AFBA applied to the problem (2).

From (latafat2017asymmetric, , Proposition 5.3.5.3.), (𝐱k,𝐬k)({\mathbf{x}}^{k},{\mathbf{s}}^{k}) generated by AFBA will converge to a saddle point solution (𝐱,𝐬)({\mathbf{x}}^{*},{\mathbf{s}}^{*}) when λσ22+λσ2+rL21.\frac{\lambda\sigma^{2}}{2}+\frac{\sqrt{\lambda}\sigma}{2}+\frac{rL}{2}\leq 1. The theoretical result of the base algorithm improves the condition to r4θ32θ12Lr\leq\frac{4\theta-3}{2\theta-1}\frac{2}{L} and λ1θσ2\lambda\leq\frac{1}{\theta\sigma^{2}} for any θ(34,1].\theta\in(\frac{3}{4},1]. Furthermore, when θ=1,\theta=1, the relaxed condition coincides with the one required for PDFP and PD3O.

2.2 Connection with PD3O

Applying PD3O to the dual problem (3), we get the following iteration,

𝐬2k+1\displaystyle{\mathbf{s}}_{2}^{k+1} =(𝐈+λrh)1(𝐳2k),\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}({\mathbf{z}}_{2}^{k}), (11a)
𝐱2k+1\displaystyle{\mathbf{x}}_{2}^{k+1} =(𝐈+rg)1((𝐈λ𝐀𝐀)𝐱2krf(𝐱2k)r𝐀(2𝐬2k+1𝐳2k)),\displaystyle=({\mathbf{I}}+r\partial g)^{-1}(({\mathbf{I}}-\lambda{\mathbf{A}}^{\top}{\mathbf{A}}){\mathbf{x}}_{2}^{k}-r\nabla f({\mathbf{x}}_{2}^{k})-r{\mathbf{A}}^{\top}(2{\mathbf{s}}_{2}^{k+1}-{\mathbf{z}}_{2}^{k})), (11b)
𝐳2k+1\displaystyle{\mathbf{z}}_{2}^{k+1} =𝐬2k+1+λr𝐀𝐱2k+1.\displaystyle={\mathbf{s}}_{2}^{k+1}+\frac{\lambda}{r}{\mathbf{A}}{\mathbf{x}}_{2}^{k+1}. (11c)

We can easily verify that the above iteration is equivalent to AFBA if

[𝐬2k𝐱2k𝐳2k]=[𝐬1k𝐱¯1k𝐬1k+λr𝐀𝐱¯1k],[𝐬1k𝐱1k𝐱¯1k]=[𝐬2k𝐱2k1r𝐀(𝐬2k𝐬2k1)𝐱2k].\begin{bmatrix}{\mathbf{s}}_{2}^{k}\\ {\mathbf{x}}_{2}^{k}\\ {\mathbf{z}}_{2}^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{1}^{k}\\ \mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\\ {\mathbf{s}}_{1}^{k}+\frac{\lambda}{r}{\mathbf{A}}\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\end{bmatrix},\begin{bmatrix}{\mathbf{s}}_{1}^{k}\\ {\mathbf{x}}_{1}^{k}\\ \mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{2}^{k}\\ {\mathbf{x}}^{k-1}_{2}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{2}^{k}-{\mathbf{s}}_{2}^{k-1})\\ {\mathbf{x}}_{2}^{k}\end{bmatrix}. (12)

Combining (11a) and (11c), we have

𝐬2k+1=(𝐈+λrh)1(𝐬2k+λr𝐀𝐱2k).{\mathbf{s}}_{2}^{k+1}=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}({\mathbf{s}}_{2}^{k}+\frac{\lambda}{r}{\mathbf{A}}{\mathbf{x}}_{2}^{k}).

The equation  (11b) is reformulated by cancelling 𝐳2{\mathbf{z}}_{2} as

𝐱2k+1\displaystyle{\mathbf{x}}_{2}^{k+1} =(𝐈+rg)1((𝐈λ𝐀𝐀)𝐱2krf(𝐱2k)r𝐀(2𝐬2k+1𝐬2kλr𝐀𝐱2k))\displaystyle=({\mathbf{I}}+r\partial g)^{-1}(({\mathbf{I}}-\lambda{\mathbf{A}}^{\top}{\mathbf{A}}){\mathbf{x}}_{2}^{k}-r\nabla f({\mathbf{x}}_{2}^{k})-r{\mathbf{A}}^{\top}(2{\mathbf{s}}_{2}^{k+1}-{\mathbf{s}}_{2}^{k}-\frac{\lambda}{r}{\mathbf{A}}{\mathbf{x}}_{2}^{k}))
=(𝐈+rg)1(𝐱2krf(𝐱2k)r𝐀(2𝐬2k+1𝐬2k)).\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}_{2}^{k}-r\nabla f({\mathbf{x}}_{2}^{k})-r{\mathbf{A}}^{\top}(2{\mathbf{s}}_{2}^{k+1}-{\mathbf{s}}_{2}^{k})).

By defining 𝐱1k+1=𝐱2kr𝐀(𝐬2k+1𝐬2k){\mathbf{x}}_{1}^{k+1}={\mathbf{x}}_{2}^{k}-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}_{2}-{\mathbf{s}}_{2}^{k}), we observe that the sequence {(𝐬2k,𝐱2k)}\{({\mathbf{s}}_{2}^{k},{\mathbf{x}}_{2}^{k})\} generated by PD3O coincides with the sequence {(𝐬1k,𝐱¯1k}\{({\mathbf{s}}_{1}^{k},\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\} with the same initialization. The reverse direction also holds from (12).

Due to the equivalence between AFBA and the base algorithm, PD3O is also equivalent to the base algorithm following the sequence relation

[𝐬2k𝐱2k𝐳2k]=[𝐬kζkr𝐀𝐬k(𝐈λ𝐀𝐀)𝐬k+λr𝐀ζk],[𝐬k𝐱kζk]=[𝐬2k𝐱2k1r𝐀(𝐬2k𝐬2k1)𝐱2k+r𝐀𝐬2k].\begin{bmatrix}{\mathbf{s}}_{2}^{k}\\ {\mathbf{x}}_{2}^{k}\\ {\mathbf{z}}_{2}^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}^{k}\\ \zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k}\\ ({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}+\frac{\lambda}{r}{\mathbf{A}}\zeta^{k}\end{bmatrix},\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{x}}^{k}\\ \zeta^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{2}^{k}\\ {\mathbf{x}}^{k-1}_{2}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{2}^{k}-{\mathbf{s}}_{2}^{k-1})\\ {\mathbf{x}}_{2}^{k}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{2}^{k}\end{bmatrix}. (13)

2.3 Connection with Chambolle-Pock

When the smooth function ff vanishes, the dual problem (3) boils down to

minimize𝐬mg(𝐀𝐬)+h(𝐬).\operatorname*{minimize}_{{\mathbf{s}}\in\mathbb{R}^{m}}\ g^{*}(-{\mathbf{A}}^{\top}{\mathbf{s}})+h^{*}({\mathbf{s}}). (14)

Applying Chambolle-Pock to it, we get

𝐬3k+1\displaystyle{\mathbf{s}}_{3}^{k+1} =(𝐈+λrh)1(𝐬3k+λr𝐀𝐱3k),\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}({\mathbf{s}}_{3}^{k}+\frac{\lambda}{r}{\mathbf{A}}{\mathbf{x}}_{3}^{k}), (15a)
𝐱3k+1\displaystyle{\mathbf{x}}_{3}^{k+1} =(𝐈+rg)1(𝐱3kr𝐀(2𝐬3k+1𝐬3k)).\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}_{3}^{k}-r{\mathbf{A}}^{\top}(2{\mathbf{s}}_{3}^{k+1}-{\mathbf{s}}_{3}^{k})). (15b)

The relation between Chambolle-Pock and the base algorithm can be observed either from the relation between PD3O and Chambolle-Pock or from that the sequence {(𝐬3k,𝐱3k)}\{({\mathbf{s}}_{3}^{k},{\mathbf{x}}_{3}^{k})\} generated by Chambolle-Pock coincides with the sequence {(𝐬k,ζkr𝐀𝐬k)}\{({\mathbf{s}}^{k},\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k})\} generated by the base algorithm with appropriate initialization.

The sequence relation is

[𝐬3k𝐱3k]=[𝐬kζkr𝐀𝐬k],[𝐬k𝐱kζk]=[𝐬3k𝐱3k1r𝐀(𝐬3k𝐬3k1)𝐱3k+r𝐀𝐬3k].\begin{bmatrix}{\mathbf{s}}_{3}^{k}\\ {\mathbf{x}}_{3}^{k}\\ \end{bmatrix}=\begin{bmatrix}{\mathbf{s}}^{k}\\ \zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k}\\ \end{bmatrix},\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{x}}^{k}\\ \zeta^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{3}^{k}\\ {\mathbf{x}}^{k-1}_{3}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{3}^{k}-{\mathbf{s}}_{3}^{k-1})\\ {\mathbf{x}}_{3}^{k}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{3}^{k}\end{bmatrix}. (16)

In this case, the relaxed condition on stepsizes of the base algorithm is r>0r>0 and λ<43σ2,\lambda<\frac{4}{3\sigma^{2}}, which extends the choice of λ\lambda to a larger range for Chambolle-Pock. This bound is shown to be tight in Section 3.2.

2.4 Connection with PAPC(PDFP2O)

Although in Table 1, AFBA and PD3O are reduced to PAPC when g=0g=0, and we have shown the equivalence between the base algorithm and them, we now directly derive it for completeness.

With g=0g=0, the iteration of the base algorithm reduces to

𝐬k+1\displaystyle{\mathbf{s}}^{k+1} =(𝐈+λrh)1(λr𝐀ζk+(𝐈λ𝐀𝐀)𝐬k),\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}(\frac{\lambda}{r}{\mathbf{A}}\zeta^{k}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}), (17a)
𝐱k+1\displaystyle{\mathbf{x}}^{k+1} =ζkr𝐀𝐬k+1,\displaystyle=\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}, (17b)
ζk+1\displaystyle\zeta^{k+1} =𝐱k+1rf(𝐱k+1).\displaystyle={\mathbf{x}}^{k+1}-r\nabla f({\mathbf{x}}^{k+1}). (17c)

Let 𝐬4k=𝐬k{\mathbf{s}}_{4}^{k}={\mathbf{s}}^{k} and 𝐱4k=𝐱k{\mathbf{x}}_{4}^{k}={\mathbf{x}}^{k}. After canceling ζ\zeta, we get

𝐬4k+1\displaystyle{\mathbf{s}}_{4}^{k+1} =(𝐈+λrh)1(λr𝐀(𝐱4krf(𝐱4k))+(𝐈λ𝐀𝐀)𝐬4k),\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}(\frac{\lambda}{r}{\mathbf{A}}({\mathbf{x}}_{4}^{k}-r\nabla f({\mathbf{x}}_{4}^{k}))+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}_{4}^{k}), (18a)
𝐱4k+1\displaystyle{\mathbf{x}}_{4}^{k+1} =𝐱4krf(𝐱4k)r𝐀𝐬4k+1.\displaystyle={\mathbf{x}}_{4}^{k}-r\nabla f({\mathbf{x}}_{4}^{k})-r{\mathbf{A}}^{\top}{\mathbf{s}}_{4}^{k+1}. (18b)

With appropriation initialization, i.e., ζ0=𝐱0rf(𝐱0),\zeta^{0}={\mathbf{x}}^{0}-r\nabla f({\mathbf{x}}^{0}), the above iteration is exactly PAPC applied to the problem (1). However, the relaxed condition, r<4θ32θ12Lr<\frac{4\theta-3}{2\theta-1}\frac{2}{L} and λ<1θσ2(𝐀)\lambda<\frac{1}{\theta\sigma^{2}({\mathbf{A}})} for θ(3/4,1]\theta\in(3/4,1] cannot achieve the tight bound shown in li2021new . One of the possible reason is that we cannot express 𝐀𝐬{\mathbf{A}}^{\top}{\mathbf{s}} as a combination of 𝐱{\mathbf{x}} and f(𝐱)\nabla f({\mathbf{x}}) like (18b) when nontrivial gg exists. This will be left for future research.

2.5 Relation Diagram

The relation between aforementioned algorithms is visualized via the following diagram where the algorithms in boxed are applied to the dual problem (3) with ηp=λ/r\eta_{\text{p}}=\lambda/r and ηd=r\eta_{\text{d}}=r.

PAPC(PDFP2O)(18)The Base Algorithm(8)AFBA(9)PD3O(11){\boxed{\textbf{PD3O\eqref{pd3o}}}}Chambolle-Pock(15){\boxed{\textbf{Chambolle-Pock\eqref{cp}}}}g=0\scriptstyle{g=0}f=0\scriptstyle{f=0}(16)\scriptstyle{\eqref{base_cp}}(10)\scriptstyle{\eqref{base_afba}}(13)\scriptstyle{\eqref{base_pd3o}}(12)\scriptstyle{\eqref{afba_pd3o}}

The theoretical analysis of the base algorithm in the next section provides a unified proof of the convergence for the above algorithms under the relaxed conditions on rr and λ.\lambda. An example is given in Section 3.2 to show the necessary condition of rr and λ\lambda.

3 Convergence Analysis

3.1 Convergence under a relaxed condition

We first define the following auxiliary variables associated with the proximal mapping of hh^{*} and gg

𝐲k+1\displaystyle{\mathbf{y}}^{k+1} =(𝐈+rg)1(𝐱k+1r𝐀𝐬k+1rf(𝐱k+1)),\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}-r\nabla f({\mathbf{x}}^{k+1})), (19a)
𝐪gk+1\displaystyle{\mathbf{q}}_{g}^{k+1} =1r𝐱k+1𝐀𝐬k+1f(𝐱k+1)1r𝐲k+1,\displaystyle=\frac{1}{r}{\mathbf{x}}^{k+1}-{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}-\nabla f({\mathbf{x}}^{k+1})-\frac{1}{r}{\mathbf{y}}^{k+1}, (19b)
𝐪hk+1\displaystyle{\mathbf{q}}_{h}^{k+1} =𝐀ζk+rλ(𝐈λ𝐀𝐀)𝐬krλ𝐬k+1.\displaystyle={\mathbf{A}}\zeta^{k}+\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}-\frac{r}{\lambda}{\mathbf{s}}^{k+1}. (19c)

With the existence of the saddle-point solution (𝐱,𝐬),({\mathbf{x}}^{*},{\mathbf{s}}^{*}), we will see the quadruple (ζk+1,𝐲k+1,𝐪gk+1,𝐪hk+1)(\zeta^{k+1},{\mathbf{y}}^{k+1},{\mathbf{q}}_{g}^{k+1},{\mathbf{q}}_{h}^{k+1}) converges to

ζ\displaystyle\zeta^{*} =𝐱+r𝐀𝐬,\displaystyle={\mathbf{x}}^{*}+r{\mathbf{A}}^{\top}{\mathbf{s}}^{*}, (20a)
𝐲\displaystyle{\mathbf{y}}^{*} =𝐱,\displaystyle={\mathbf{x}}^{*}, (20b)
𝐪g\displaystyle{\mathbf{q}}_{g}^{*} =𝐀𝐬f(𝐱),\displaystyle=-{\mathbf{A}}^{\top}{\mathbf{s}}^{*}-\nabla f({\mathbf{x}}^{*}), (20c)
𝐪h\displaystyle{\mathbf{q}}_{h}^{*} =𝐀ζr𝐀𝐀𝐬=𝐀𝐱.\displaystyle={\mathbf{A}}\zeta^{*}-r{\mathbf{A}}{\mathbf{A}}^{\top}{\mathbf{s}}^{*}={\mathbf{A}}{\mathbf{x}}^{*}. (20d)

Though the relation between the base algorithm, AFBA and PD3O asserts the fixed point of iteration (8) solves the problem (2), we restate it in the following lemma for completeness.

Lemma 1 (Optimality)

Let (𝐬,𝐱,ζ)({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star}) be the fixed point of the iteration (8), then (𝐬,𝐱)({\mathbf{s}}^{\star},{\mathbf{x}}^{\star}) is the solution of the problem (2) with ζ=𝐱+r𝐀𝐬.\zeta^{\star}={\mathbf{x}}^{\star}+r{\mathbf{A}}^{\top}{\mathbf{s}}^{\star}.

Proof

Iteration (8b) gives the relation ζ=𝐱+r𝐀𝐬.\zeta^{\star}={\mathbf{x}}^{\star}+r{\mathbf{A}}^{\top}{\mathbf{s}}^{\star}. From (8a) and (8c), we have

𝐬+λrh(𝐬)\displaystyle{\mathbf{s}}^{\star}+\frac{\lambda}{r}\partial h^{*}({\mathbf{s}}^{\star}) =λr𝐀ζ+(𝐈λ𝐀𝐀)𝐬,\displaystyle=\frac{\lambda}{r}{\mathbf{A}}\zeta^{\star}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{\star},
𝐱+rg(𝐱)\displaystyle{\mathbf{x}}^{\star}+r\partial g({\mathbf{x}}^{\star}) =𝐱r𝐀𝐬rf(𝐱).\displaystyle={\mathbf{x}}^{\star}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{\star}-r\nabla f({\mathbf{x}}^{\star}).

Canceling ζ\zeta^{*}, we have

[h𝐀𝐀g+f][𝐬𝐱]=[00],\begin{bmatrix}\partial h^{*}&-{\mathbf{A}}\\ {\mathbf{A}}^{\top}&\partial g+\nabla f\end{bmatrix}\begin{bmatrix}{\mathbf{s}}^{\star}\\ {\mathbf{x}}^{\star}\end{bmatrix}=\begin{bmatrix}0\\ 0\end{bmatrix},

which is the optimal condition of the problem (2).∎

We now use (𝐬,𝐱,ζ)({\mathbf{s}}^{*},{\mathbf{x}}^{*},\zeta^{*}) to denote an arbitrary fixed point of the base algorithm.

Lemma 2 (Fundamental equality)

Let the sequence {(𝐬k,𝐱k,ζk)}\{({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k})\} be generated by the iteration (8), we have the following two equalities hold:

𝐲k𝐲,𝐪gk𝐪g+𝐬k+1𝐬k,𝐪hk+1𝐪h\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{*}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle
=\displaystyle= 1r𝐱k+1𝐱,𝐱k𝐱k+1+T1+rλ𝐬k+1𝐬,𝐬k𝐬k+1+T2\displaystyle\ \frac{1}{r}\langle{\mathbf{x}}^{k+1}-{\mathbf{x}}^{*},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+T_{1}+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{2} (21)

and

𝐲k+1𝐲,𝐪gk+1𝐪g+𝐬k+1𝐬k,𝐪hk+1𝐪h\displaystyle\ \langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{*}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle
=\displaystyle= 1rζk+1ζ,ζkζk+1+rλ𝐬k+1𝐬,𝐬k𝐬k+1\displaystyle\ \frac{1}{r}\langle\zeta^{k+1}-\zeta^{*},\zeta^{k}-\zeta^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle
+r𝐀(𝐬k+1𝐬),𝐀(𝐬k+1𝐬k)+T3,\displaystyle\ +r\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}),{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\rangle+T_{3}, (22)

where

T1𝐀(𝐬k+1𝐬k),𝐱k𝐱k+1,\displaystyle T_{1}\coloneqq\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle,
T2𝐲𝐲k,f(𝐱k)f(𝐱),\displaystyle T_{2}\coloneqq\langle{\mathbf{y}}^{*}-{\mathbf{y}}^{k},\nabla f({\mathbf{x}}^{k})-\nabla f({\mathbf{x}}^{*})\rangle,
T3𝐲𝐲k+1,f(𝐱k+1)f(𝐱).\displaystyle T_{3}\coloneqq\langle{\mathbf{y}}^{*}-{\mathbf{y}}^{k+1},\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{*})\rangle.
Proof

Plugging the iteration (8b) into 𝐪hk+1{\mathbf{q}}_{h}^{k+1} to cancel ζk,\zeta^{k}, we have

𝐪hk+1=\displaystyle{\mathbf{q}}_{h}^{k+1}= 𝐀𝐱k+1+r𝐀𝐀𝐬k+1+rλ(𝐈λ𝐀𝐀)𝐬krλ𝐬k+1\displaystyle\ {\mathbf{A}}{\mathbf{x}}^{k+1}+r{\mathbf{A}}{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}+\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}-\frac{r}{\lambda}{\mathbf{s}}^{k+1}
=\displaystyle= 𝐀𝐱k+1+rλ(𝐈λ𝐀𝐀)(𝐬k𝐬k+1).\displaystyle\ {\mathbf{A}}{\mathbf{x}}^{k+1}+\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}). (23)

For the equality (21), we have

𝐲k𝐲,𝐪gk𝐪g+𝐬k+1𝐬,𝐪hk+1𝐪h\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{*}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle
=\displaystyle= 1r𝐲k𝐲,𝐱k𝐱1r𝐲k𝐲2=1r𝐲k𝐲,𝐱k𝐲k𝐲k𝐲,𝐀(𝐬k𝐬)+T2\displaystyle\ \underbrace{\frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{x}}^{k}-{\mathbf{x}}^{*}\rangle-\frac{1}{r}\|{\mathbf{y}}^{k}-{\mathbf{y}}^{*}\|^{2}}_{=\frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{x}}^{k}-{\mathbf{y}}^{k}\rangle}-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{*})\rangle+T_{2}
+𝐬k+1𝐬,𝐀(𝐱k+1𝐱)+rλ𝐬k+1𝐬,(𝐈λ𝐀𝐀)(𝐬k𝐬k+1)\displaystyle\ +\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{*})\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle
=\displaystyle= 1r𝐲k𝐲,𝐱k𝐱k+1𝐲k𝐲,𝐀(𝐬k+1𝐬k)𝐲k𝐲,𝐀(𝐬k𝐬)=𝐲k𝐲,𝐀(𝐬k+1𝐬)\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle\underbrace{-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\rangle-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{*})\rangle}_{=-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle}
+T2+𝐬k+1𝐬,𝐀(𝐱k+1𝐱)+rλ𝐬k+1𝐬,(𝐈λ𝐀𝐀)(𝐬k𝐬k+1)\displaystyle\ +T_{2}+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{*})\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle
=\displaystyle= 1r𝐲k𝐲,𝐱k𝐱k+1+𝐱k+1𝐲k,𝐀(𝐬k+1𝐬)+T2\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+\langle{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle+T_{2}
+rλ𝐬k+1𝐬,(𝐈λ𝐀𝐀)(𝐬k𝐬k+1),\displaystyle\ +\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle, (24)

where the second equality uses

𝐱k𝐱k+1=\displaystyle{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}= 𝐱kζk+r𝐀𝐬k+1\displaystyle\ {\mathbf{x}}^{k}-\zeta^{k}+r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}
=\displaystyle= 𝐱k𝐲k+r𝐀(𝐬k+1𝐬k),\displaystyle\ {\mathbf{x}}^{k}-{\mathbf{y}}^{k}+r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}), (25)

which is derived from the iteration (8b) and (8c).

Note that from the equality (25), we also have

𝐱k+1𝐲k=r𝐀(𝐬k+1𝐬k).\displaystyle{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k}=-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}). (26)

Hence

𝐲k𝐲,𝐪gk𝐪g+𝐬k+1𝐬,𝐪hk+1𝐪h\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{*}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle
=\displaystyle= 1r𝐲k𝐲,𝐱k𝐱k+1r𝐬k+1𝐬k,𝐀𝐀(𝐬k+1𝐬)+T2\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle-r\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{A}}{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle+T_{2}
+rλ𝐬k+1𝐬,(𝐈λ𝐀𝐀)(𝐬k𝐬k+1)\displaystyle\ +\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle
=\displaystyle= 1r𝐲k𝐲,𝐱k𝐱k+1+rλ𝐬k+1𝐬,𝐬k𝐬k+1+T2\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{2}
=\displaystyle= 1r𝐱k+1𝐱,𝐱k𝐱k+1+T1+rλ𝐬k+1𝐬,𝐬k𝐬k+1+T2\displaystyle\frac{1}{r}\langle{\mathbf{x}}^{k+1}-{\mathbf{x}}^{*},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+T_{1}+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{2} (27)

and the equality (21) is derived.

For the equality (22), we have

𝐲k+1𝐲,𝐪gk+1𝐪g+𝐬k+1𝐬,𝐪hk+1𝐪h\displaystyle\ \langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{*}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle
=\displaystyle= 1r𝐲k+1𝐲,𝐱k+1𝐲k+1𝐲k+1𝐲,𝐀(𝐬k+1𝐬)+T3\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k+1}\rangle-\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle+T_{3}
+𝐬k+1𝐬,𝐀(ζkζ)+rλ𝐬k+1𝐬,(𝐈λ𝐀𝐀)𝐬k𝐬k+1+λ𝐀𝐀𝐬\displaystyle\ +\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{A}}(\zeta^{k}-\zeta^{*})\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}+\lambda{\mathbf{A}}{\mathbf{A}}^{\top}{\mathbf{s}}^{*}\rangle
=\displaystyle= 1r𝐲k+1𝐲,ζkζk+1𝐲k+1𝐲,𝐀(𝐬k+1𝐬)+T3\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},\zeta^{k}-\zeta^{k+1}\rangle-\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle+T_{3}
+𝐬k+1𝐬,𝐀(ζkζ)+r𝐬k+1𝐬,𝐀𝐀(𝐬𝐬k)=𝐀(𝐬k+1𝐬),𝐲k𝐲\displaystyle\ +\underbrace{\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{A}}(\zeta^{k}-\zeta^{*})\rangle+r\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{A}}{\mathbf{A}}^{\top}({\mathbf{s}}^{*}-{\mathbf{s}}^{k})\rangle}_{=\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}),{\mathbf{y}}^{k}-{\mathbf{y}}^{*}\rangle}
+rλ𝐬k+1𝐬,𝐬k𝐬k+1\displaystyle\ +\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle
=\displaystyle= 1r𝐲k+1𝐲,ζkζk+1+𝐲k𝐲k+1,𝐀(𝐬k+1𝐬)+T3\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},\zeta^{k}-\zeta^{k+1}\rangle+\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{k+1},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle+T_{3}
+rλ𝐬k+1𝐬,𝐬k𝐬k+1.\displaystyle\ +\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle. (28)

Note that from the equality (25), we also have

𝐲k+1=ζk+1r𝐀𝐬k+1.\displaystyle{\mathbf{y}}^{k+1}=\zeta^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}. (29)

Combining it with (20a) and (20b), we have

𝐲k+1𝐲=ζk+1ζr𝐀(𝐬k+1𝐬),\displaystyle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*}=\zeta^{k+1}-\zeta^{*}-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}), (30)

then

𝐲k+1𝐲,𝐪gk+1𝐪g+𝐬k+1𝐬,𝐪hk+1𝐪h\displaystyle\ \langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{*}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle
=\displaystyle= 1rζk+1ζ,ζkζk+1+rλ𝐬k+1𝐬,𝐬k𝐬k+1+T3\displaystyle\ \frac{1}{r}\langle\zeta^{k+1}-\zeta^{*},\zeta^{k}-\zeta^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{3}
+𝐲k𝐲k+1,𝐀(𝐬k+1𝐬)ζkζk+1,𝐀(𝐬k+1𝐬)\displaystyle\ +\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{k+1},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle-\langle\zeta^{k}-\zeta^{k+1},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle
=\displaystyle= 1rζk+1ζ,ζkζk+1+rλ𝐬k+1𝐬,𝐬k𝐬k+1+T3\displaystyle\ \frac{1}{r}\langle\zeta^{k+1}-\zeta^{*},\zeta^{k}-\zeta^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{3}
+r𝐀(𝐬k+1𝐬k),𝐀(𝐬k+1𝐬),\displaystyle\ +r\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle, (31)

where the last equality uses (29) and the equality (22) is derived. ∎

The next lemma characterizes the upper bound for T1,T2T_{1},T_{2} and T3T_{3}.

Lemma 3

For any θ(3/4,1]\theta\in(3/4,1] and θλσ21\theta\lambda\sigma^{2}\leq 1, we have the following inequalities hold.

T1\displaystyle T_{1}\leq r4λ𝐬k𝐬k1𝐌2r4λ𝐬k+1𝐬k𝐌2+14(1θ)r𝐀(𝐬k𝐬k1)2\displaystyle\ \frac{r}{4\lambda}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\|^{2}_{{\mathbf{M}}}-\frac{r}{4\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}+\frac{1}{4}(1-\theta)r\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\|^{2}
+(1214θ)r𝐀(𝐬k+1𝐬k)2+(54θ)1r𝐱k𝐱k+12,\displaystyle\ +(\frac{1}{2}-\frac{1}{4}\theta)r\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\|^{2}+(\frac{5}{4}-\theta)\frac{1}{r}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}, (32)
T2\displaystyle T_{2}\leq L4ζk1ζk2\displaystyle\ \frac{L}{4}\|\zeta^{k-1}-\zeta^{k}\|^{2}
=\displaystyle= L4𝐱k𝐱k+12+r2L4𝐀(𝐬k𝐬k+1)2rL2T1,\displaystyle\ \frac{L}{4}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}+\frac{r^{2}L}{4}\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\|^{2}-\frac{rL}{2}T_{1}, (33)
T3\displaystyle T_{3}\leq L4ζkζk+12,\displaystyle\ \frac{L}{4}\|\zeta^{k}-\zeta^{k+1}\|^{2}, (34)

where T1,T2,T3T_{1},T_{2},T_{3} are defined in Lemma 2 and 𝐌𝐈θλ𝐀𝐀𝟎.{\mathbf{M}}\coloneqq{\mathbf{I}}-\theta\lambda{\mathbf{A}}{\mathbf{A}}^{\top}\succcurlyeq\mathbf{0}.

Proof

We start from the upper bound of T3.T_{3}. Since 𝐲=𝐱{\mathbf{y}}^{*}={\mathbf{x}}^{*},

T3=\displaystyle T_{3}= 𝐱k+1𝐲k+1,f(𝐱k+1)f(𝐱)𝐱k+1𝐱,f(𝐱k+1)f(𝐱)\displaystyle\ \langle{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k+1},\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{*})\rangle-\langle{\mathbf{x}}^{k+1}-{\mathbf{x}}^{*},\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{*})\rangle
\displaystyle\leq 𝐱k+1𝐲k+1,f(𝐱k+1)f(𝐱)1Lf(𝐱k+1)f(𝐱)2\displaystyle\ \langle{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k+1},\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{*})\rangle-\frac{1}{L}\|\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{*})\|^{2}
\displaystyle\leq L4𝐱k+1𝐲k+12,\displaystyle\ \frac{L}{4}\|{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k+1}\|^{2}, (35)

where the second inequality uses the equivalent form of Lipschitz continuous f\nabla f from (nesterov2018lectures, , Theorem 2.1.5) and the last one uses Cauchy’s inequality.

To upper bound T2,T_{2}, by the same argument as above, we have

T2\displaystyle T_{2}\leq L4𝐱k𝐲k2\displaystyle\ \frac{L}{4}\|{\mathbf{x}}^{k}-{\mathbf{y}}^{k}\|^{2}
=\displaystyle= L4𝐱k𝐱k+1r𝐀(𝐬k+1𝐬k)2,\displaystyle\ \frac{L}{4}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\|^{2}, (36)

where the equality is from (25) and (3) is derived by expanding it.

For T1,T_{1}, firstly we have

T1=\displaystyle T_{1}= 𝐬k+1𝐬k,rλ(𝐈λ𝐀𝐀)(𝐬k𝐬k1)rλ(𝐈λ𝐀𝐀)(𝐬k+1𝐬k)\displaystyle\ \langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})-\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\rangle
𝐬k+1𝐬k,𝐪hk+1𝐪hk\displaystyle\ -\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{k}\rangle
\displaystyle\leq 𝐬k+1𝐬k,rλ𝐌(𝐬k𝐬k1(𝐬k+1𝐬k))\displaystyle\ \langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},\frac{r}{\lambda}{\mathbf{M}}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}-({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}))\rangle
(1θ)r𝐬k+1𝐬k,𝐀𝐀(𝐬k𝐬k1(𝐬k+1𝐬k))\displaystyle-(1-\theta)r\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{A}}{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}-({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}))\rangle
=\displaystyle= r2λ𝐬k𝐬k1𝐌2r2λ𝐬k+1𝐬k𝐌2r2λ𝐬k+1+𝐬k12𝐬k𝐌2\displaystyle\ \frac{r}{2\lambda}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}+{\mathbf{s}}^{k-1}-2{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}
(1θ)r𝐀(𝐬k+1𝐬k),𝐀(𝐬k𝐬k1(𝐬k+1𝐬k))\displaystyle\ -(1-\theta)r\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}-({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}))\rangle
\displaystyle\leq r2λ𝐬k𝐬k1𝐌2r2λ𝐬k+1𝐬k𝐌2+12(1θ)r𝐀(𝐬k𝐬k1)2\displaystyle\ \frac{r}{2\lambda}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}+\frac{1}{2}(1-\theta)r\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\|^{2}
+32(1θ)r𝐀(𝐬k+1𝐬k)2,\displaystyle\ +\frac{3}{2}(1-\theta)r\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\|^{2}, (37)

where the first inequality uses the firm nonexpansiveness of 𝐩𝐫𝐨𝐱λrh\mathbf{prox}_{\frac{\lambda}{r}h^{*}} from (bauschke2011convex, , Proposition 4.2), the second equality uses

ab,cd=12(ad2ac2+bc2bd2)\langle a-b,c-d\rangle=\frac{1}{2}(\|a-d\|^{2}-\|a-c\|^{2}+\|b-c\|^{2}-\|b-d\|^{2})

and the last step uses Chauchy’s inequality.

On the other hand, we also have

T1=\displaystyle T_{1}= 212𝐀(𝐬k+1𝐬k),12(𝐱k𝐱k+1)\displaystyle\ 2\langle\frac{1}{\sqrt{2}}{\mathbf{A}}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),\frac{1}{\sqrt{2}}({\mathbf{x}}^{k}-{\mathbf{x}}^{k+1})\rangle
\displaystyle\leq 2(θ12)𝐀(𝐬k+1𝐬k),(522θ)(𝐱k𝐱k+1)\displaystyle\ 2\langle\sqrt{(\theta-\frac{1}{2})}{\mathbf{A}}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),\sqrt{(\frac{5}{2}-2\theta)}({\mathbf{x}}^{k}-{\mathbf{x}}^{k+1})\rangle
\displaystyle\leq (θ12)𝐀(𝐬k+1𝐬k)2+(522θ)𝐱k𝐱k+12,\displaystyle\ (\theta-\frac{1}{2})\|{\mathbf{A}}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\|^{2}+(\frac{5}{2}-2\theta)\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}, (38)

where the first inequality holds since for any θ(34,1]\theta\in(\frac{3}{4},1]

(θ12)(522θ)14=2θ2+72θ32=2(θ34)(1θ)0.(\theta-\frac{1}{2})(\frac{5}{2}-2\theta)-\frac{1}{4}=-2\theta^{2}+\frac{7}{2}\theta-\frac{3}{2}=2(\theta-\frac{3}{4})(1-\theta)\geq 0.

Therefore,

T112×(37)+12×(38),T_{1}\leq\frac{1}{2}\times\eqref{lem2:ineq4}+\frac{1}{2}\times\eqref{lem2:ineq5},

which gives the inequality (3). ∎

Theorem 3.1

Let the sequence {(𝐬k,𝐱k,ζk)}\{({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k})\} be generated by the base algorithm, then {(𝐬k,𝐱k)}\{({\mathbf{s}}^{k},{\mathbf{x}}^{k})\} converges to a solution of the problem (2), if

r<4θ32θ12L,λ1θσ2r<\frac{4\theta-3}{2\theta-1}\frac{2}{L},\ \ \lambda\leq\frac{1}{\theta\sigma^{2}}

for any θ(3/4,1].\theta\in(3/4,1].

Proof

Descent Inequality. From the assumption on λ,\lambda, there exists θ~(0,θ)\widetilde{\theta}\in(0,\theta) such that

𝐈θ~λ𝐀𝐀𝟎.{\mathbf{I}}-\widetilde{\theta}\lambda{\mathbf{A}}{\mathbf{A}}^{\top}\succ\mathbf{0}.

Define a parameter α\alpha depending on θ\theta and θ~\widetilde{\theta} as

α={θ~1θ,34<θ<1,0,θ=1,\alpha=\left\{\begin{aligned} \frac{\widetilde{\theta}}{1-\theta},&\ \ \frac{3}{4}<\theta<1,\\ 0,&\ \ \theta=1,\end{aligned}\right. (39)

and let 𝐌~=𝐈α(1θ)𝐀𝐀𝟎\widetilde{{\mathbf{M}}}={\mathbf{I}}-\alpha(1-\theta){\mathbf{A}}{\mathbf{A}}^{\top}\succ\mathbf{0}.

Combining Lemma 2 and Lemma 3, for any r(0,2/L)r\in(0,2/L) we have

𝐲k𝐲,𝐪gk𝐪g+𝐬k+1𝐬k,𝐪hk+1𝐪h\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{*}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle
=\displaystyle= 1r𝐱k+1𝐱,𝐱k𝐱k+1+rλ𝐬k+1𝐬,𝐬k𝐬k+1+T1+T2\displaystyle\ \frac{1}{r}\langle{\mathbf{x}}^{k+1}-{\mathbf{x}}^{*},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{1}+T_{2}
\displaystyle\leq 12r𝐱k𝐱212r𝐱k+1𝐱212r𝐱k𝐱k+12\displaystyle\ \frac{1}{2r}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{*}\|^{2}-\frac{1}{2r}\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{*}\|^{2}-\frac{1}{2r}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}
+r2λ𝐬k𝐬2r2λ𝐬k+1𝐬2r2λ𝐬k+1𝐬k2\displaystyle\ +\frac{r}{2\lambda}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{*}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}
+α(12rL4)ζk1ζk2+(L4α(12rL4))ζk1ζk2\displaystyle\ +\alpha(\frac{1}{2r}-\frac{L}{4})\|\zeta^{k-1}-\zeta^{k}\|^{2}+(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))\|\zeta^{k-1}-\zeta^{k}\|^{2}
+T1\displaystyle\ +T_{1}
\displaystyle\leq 12r𝐱k𝐱212r𝐱k+1𝐱212r𝐱k𝐱k+12\displaystyle\ \frac{1}{2r}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{*}\|^{2}-\frac{1}{2r}\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{*}\|^{2}-\frac{1}{2r}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}
+r2λ𝐬k𝐬2r2λ𝐬k+1𝐬2r2λ𝐬k+1𝐬k2\displaystyle\ +\frac{r}{2\lambda}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{*}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}
+α(12rL4)ζk1ζk2+(L4α(12rL4))𝐱k𝐱k+12\displaystyle\ +\alpha(\frac{1}{2r}-\frac{L}{4})\|\zeta^{k-1}-\zeta^{k}\|^{2}+(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}
+(L4α(12rL4))r2𝐀(𝐬k𝐬k+1)2+(1+α)(1rL2)T1.\displaystyle\ +(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))r^{2}\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\|^{2}+(1+\alpha)(1-\frac{rL}{2})T_{1}. (40)

Note that

(1+α)(1rL2)T1\displaystyle(1+\alpha)(1-\frac{rL}{2})T_{1}\leq (1+α)(1rL2)[r4λ𝐬k𝐬k1𝐌2r4λ𝐬k+1𝐬k𝐌2\displaystyle\ (1+\alpha)(1-\frac{rL}{2})\Big{[}\frac{r}{4\lambda}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\|^{2}_{{\mathbf{M}}}-\frac{r}{4\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}
+14(1θ)r𝐀(𝐬k𝐬k1)2\displaystyle\ +\frac{1}{4}(1-\theta)r\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\|^{2}
+(1214θ)r𝐀(𝐬k+1𝐬k)2(θ34)1r𝐱k𝐱k+12]\displaystyle\ +(\frac{1}{2}-\frac{1}{4}\theta)r\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\|^{2}-(\theta-\frac{3}{4})\frac{1}{r}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}\Big{]}
+(1+α)(1rL2)12r𝐱k𝐱k+12.\displaystyle\ +(1+\alpha)(1-\frac{rL}{2})\frac{1}{2r}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}. (41)

Combine (40), (41), and

12r+L4α(12rL4)+(1+α)(12rL4)=0,-\frac{1}{2r}+\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4})+(1+\alpha)(\frac{1}{2r}-\frac{L}{4})=0,

we get

𝐲k𝐲,𝐪gk𝐪g+𝐬k+1𝐬k,𝐪hk+1𝐪h\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{*}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle
\displaystyle\leq 12r𝐱k𝐱212r𝐱k+1𝐱2+r2λ𝐬k𝐬2r2λ𝐬k+1𝐬2\displaystyle\ \frac{1}{2r}\|{\mathbf{x}}^{k}-{\mathbf{x}}^{*}\|^{2}-\frac{1}{2r}\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{*}\|^{2}+\frac{r}{2\lambda}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{*}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}\|^{2}
+α(12rL4)ζk1ζk2r2λ𝐬k+1𝐬k2\displaystyle\ +\alpha(\frac{1}{2r}-\frac{L}{4})\|\zeta^{k-1}-\zeta^{k}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}
+(L4α(12rL4))r2𝐀(𝐬k𝐬k+1)2\displaystyle\ +(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))r^{2}\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\|^{2}
(1+α)(1rL2)[r4λ𝐬k𝐬k1𝐌2r4λ𝐬k+1𝐬k𝐌2\displaystyle\ (1+\alpha)(1-\frac{rL}{2})\Big{[}\frac{r}{4\lambda}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\|^{2}_{{\mathbf{M}}}-\frac{r}{4\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}
+14(1θ)r𝐀(𝐬k𝐬k1)214(1θ)r𝐀(𝐬k+1𝐬k)2\displaystyle\ +\frac{1}{4}(1-\theta)r\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\|^{2}-\frac{1}{4}(1-\theta)r\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\|^{2}
+(3412θ)r𝐀(𝐬k+1𝐬k)2(θ34)𝐱k𝐱k+12].\displaystyle\ +(\frac{3}{4}-\frac{1}{2}\theta)r\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\|^{2}-(\theta-\frac{3}{4})\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}\Big{]}. (42)

The other inequality in Lemma 2 becomes

𝐲k+1𝐲,𝐪gk+1𝐪g+𝐬k+1𝐬k,𝐪hk+1𝐪h\displaystyle\ \langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{*}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle
=\displaystyle= 1rζk+1ζ,ζkζk+1+rλ𝐬k+1𝐬,𝐌(𝐬k𝐬k+1)\displaystyle\ \frac{1}{r}\langle\zeta^{k+1}-\zeta^{*},\zeta^{k}-\zeta^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{M}}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle
(1θ)rrλ𝐀(𝐬k+1𝐬),𝐀(𝐬k𝐬k+1)+T3\displaystyle\ -(1-\theta)r\frac{r}{\lambda}\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}),{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle+T_{3}
\displaystyle\leq 12rζkζ212rζk+1ζ212rζk+1ζk2\displaystyle\ \frac{1}{2r}\|\zeta^{k}-\zeta^{*}\|^{2}-\frac{1}{2r}\|\zeta^{k+1}-\zeta^{*}\|^{2}-\frac{1}{2r}\|\zeta^{k+1}-\zeta^{k}\|^{2}
+r2λ𝐬k𝐬𝐌2r2λ𝐬k+1𝐬𝐌2r2λ𝐬k+1𝐬k𝐌2\displaystyle\ +\frac{r}{2\lambda}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{*}\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}
(1θ)r2𝐀(𝐬k𝐬)2+(1θ)r2𝐀(𝐬k+1𝐬)2\displaystyle\ -(1-\theta)\frac{r}{2}\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{*})\|^{2}+(1-\theta)\frac{r}{2}\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\|^{2}
+(1θ)r2𝐀(𝐬k+1𝐬k)2+L4ζkζk+12.\displaystyle\ +(1-\theta)\frac{r}{2}\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\|^{2}+\frac{L}{4}\|\zeta^{k}-\zeta^{k+1}\|^{2}. (43)

Let β=(1+α)(1rL2)\beta=(1+\alpha)(1-\frac{rL}{2}) for simplicity. We have

Φk+1=\displaystyle\Phi^{k+1}= 12r𝐱k+1𝐱2+r2λ𝐬k+1𝐬𝐌+𝐌~2+12rζk+1ζ2\displaystyle\ \frac{1}{2r}\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{*}\|^{2}+\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}\|^{2}_{{\mathbf{M}}+\widetilde{{\mathbf{M}}}}+\frac{1}{2r}\|\zeta^{k+1}-\zeta^{*}\|^{2}
+βr4λ𝐬k+1𝐬k𝐌2+βr(1θ)4𝐬k+1𝐬k𝐀𝐀2\displaystyle\ +\frac{\beta r}{4\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}+\frac{\beta r(1-\theta)}{4}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{A}}{\mathbf{A}}^{\top}}
+α2r(1rL2)ζkζk+12.\displaystyle\ +\frac{\alpha}{2r}(1-\frac{rL}{2})\|\zeta^{k}-\zeta^{k+1}\|^{2}.

Considering (3.1)+α×(3.1)\eqref{thm:ineq3}+\alpha\times\eqref{thm:ineq4}, we have

𝐲k𝐲,𝐪gk𝐪g+(1+α)𝐬k+1𝐬k,𝐪hk+1𝐪h+α𝐲k+1𝐲,𝐪gk+1𝐪g\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{*}\rangle+(1+\alpha)\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle+\alpha\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{*}\rangle
\displaystyle\leq ΦkΦk+1β(θ34)𝐱k𝐱k+12r2λ𝐬k+1𝐬k𝐌2\displaystyle\ \Phi^{k}-\Phi^{k+1}-\beta(\theta-\frac{3}{4})\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}
r2λ𝐬k+1𝐬k𝐈α(1θ)λ𝐀𝐀2+(3412θ)βr𝐬k+1𝐬k𝐀𝐀2\displaystyle\ -\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{I}}-\alpha(1-\theta)\lambda{\mathbf{A}}{\mathbf{A}}^{\top}}+(\frac{3}{4}-\frac{1}{2}\theta)\beta r\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{A}}{\mathbf{A}}^{\top}}
+(rL2α(1rL2))r2𝐬k𝐬k+1𝐀𝐀2\displaystyle\ +(\frac{rL}{2}-\alpha(1-\frac{rL}{2}))\frac{r}{2}\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\|_{{\mathbf{A}}{\mathbf{A}}^{\top}}^{2}
=\displaystyle= ΦkΦk+1β(θ34)𝐱k𝐱k+12r2λ𝐬k+1𝐬k𝐌2\displaystyle\ \Phi^{k}-\Phi^{k+1}-\beta(\theta-\frac{3}{4})\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}}
r2λ𝐬k+1𝐬k𝐍2,\displaystyle\ -\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{N}}}, (44)

where

𝐍=𝐈[α(1θ)+(32θ)β+rL2α(1rL2)]λ𝐀𝐀.{\mathbf{N}}={\mathbf{I}}-\Big{[}\alpha(1-\theta)+(\frac{3}{2}-\theta)\beta+\frac{rL}{2}-\alpha(1-\frac{rL}{2})\Big{]}\lambda{\mathbf{A}}{\mathbf{A}}^{\top}.

Note that if we let

α(1θ)+(32θ)(1+α)(1rL2)+rL2α(1rL2)<θ,\alpha(1-\theta)+(\frac{3}{2}-\theta)(1+\alpha)(1-\frac{rL}{2})+\frac{rL}{2}-\alpha(1-\frac{rL}{2})<\theta,

which is equivalent to

rL2<4θ32θ1,\frac{rL}{2}<\frac{4\theta-3}{2\theta-1},

we have 𝐍𝐌𝟎.{\mathbf{N}}\succ{\mathbf{M}}\succcurlyeq\mathbf{0}.

Due to 𝐌~𝟎\widetilde{{\mathbf{M}}}\succ\mathbf{0} and the firm nonexpansiveness of 𝐩𝐫𝐨𝐱rg\mathbf{prox}_{rg} and 𝐩𝐫𝐨𝐱λrh\mathbf{prox}_{\frac{\lambda}{r}h^{*}}, from (bauschke2011convex, , Proposition 4.2), we get the descent inequality

Φk+1Φkβ(θ34)𝐱k+1𝐱k2r2λ𝐬k+1𝐬k𝐌+𝐍2.\Phi^{k+1}\leq\Phi^{k}-\beta(\theta-\frac{3}{4})\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}+{\mathbf{N}}}. (45)

Convergence.

Taking the telescopic sum from k=0k=0 to \infty, we get

limk𝐱k+1𝐱k=0,\displaystyle\lim_{k\rightarrow\infty}\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}\|=0, (46)
limk𝐬k+1𝐬k=0,\displaystyle\lim_{k\rightarrow\infty}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|=0, (47)

due to θ>34\theta>\frac{3}{4} and 𝐌+𝐍𝟎.{\mathbf{M}}+{\mathbf{N}}\succ\mathbf{0}. Moreover, from (8b), we get

limkζk+1ζk=0.\lim_{k\rightarrow\infty}\|\zeta^{k+1}-\zeta^{k}\|=0. (48)

From the descent inequality, the nonnegative sequence {Φk}\{\Phi^{k}\} is nonincreasing, so it converges to a nonnegative constant, which implies (𝐬k,𝐱k,ζk)({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k}) is bounded in m×n×n\mathbb{R}^{m}\times\mathbb{R}^{n}\times\mathbb{R}^{n}. Due to the compactness, there is a subsequence, {(𝐬kj,𝐱kj,ζkj)}\{({\mathbf{s}}^{k_{j}},{\mathbf{x}}^{k_{j}},\zeta^{k_{j}})\} converging to (𝐬,𝐱,ζ)({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star}).

The nonexpansiveness of 𝐩𝐫𝐨𝐱rg\mathbf{prox}_{rg} and 𝐩𝐫𝐨𝐱λrh\mathbf{prox}_{\frac{\lambda}{r}h^{*}} implies they are continuous and from (46), (47) and (48),

limj(𝐬kj+1,𝐱kj+1,ζkj+1)(𝐬,𝐱,ζ)\displaystyle\ \lim_{j\rightarrow\infty}\|({\mathbf{s}}^{k_{j}+1},{\mathbf{x}}^{k_{j}+1},\zeta^{k_{j}+1})-({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star})\|
\displaystyle\leq limj(𝐬kj+1,𝐱kj+1,ζkj+1)(𝐬kj,𝐱kj,ζkj)+(𝐬kj,𝐱kj,ζkj)(𝐬,𝐱,ζ)\displaystyle\ \lim_{j\rightarrow\infty}\|({\mathbf{s}}^{k_{j}+1},{\mathbf{x}}^{k_{j}+1},\zeta^{k_{j}+1})-({\mathbf{s}}^{k_{j}},{\mathbf{x}}^{k_{j}},\zeta^{k_{j}})\|+\|({\mathbf{s}}^{k_{j}},{\mathbf{x}}^{k_{j}},\zeta^{k_{j}})-({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star})\|
=\displaystyle= 0.\displaystyle\ 0.

So, (𝐬,𝐱,ζ)({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star}) is a fixed point of the iteration (8).

By choosing (𝐬,𝐱,ζ)=(𝐬,𝐱,ζ),({\mathbf{s}}^{*},{\mathbf{x}}^{*},\zeta^{*})=({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star}), we have

limjΦkj=0.\lim_{j\rightarrow\infty}\Phi^{k_{j}}=0.

Hence

limj(𝐬k,𝐱k,ζk)=(𝐬,𝐱,ζ).\lim_{j\rightarrow\infty}({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k})=({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star}).

Lastly, from Lemma 1, (𝐬,𝐱)({\mathbf{s}}^{\star},{\mathbf{x}}^{\star}) is a solution of the problem (2). ∎

Due to the relation discussed in Section 2.1 and 2.2, we also prove the convergence of AFBA and PD3O under the relaxed condition. When f=0f=0, the relation in Section 2.3 implies the improved convergence result of Chambolle-Pock.

Corollary 1 (Chambolle-Pock)

Suppose r>0r>0 and λ<43σ2\lambda<\frac{4}{3\sigma^{2}}, we have the following results hold,

  1. 1.

    Let {(𝐬k,𝐱k,ζk)}\{({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k})\} be generated by the iteration (8), then (𝐬k,𝐱k)({\mathbf{s}}^{k},{\mathbf{x}}^{k}) converges to a solution of the problem (2) with linear ff.

  2. 2.

    Let {(𝐬3k,𝐱3k)}\{({\mathbf{s}}_{3}^{k},{\mathbf{x}}_{3}^{k})\} be generated by Chambolle-Pock (15), then it converges to a solution of the problem (2) with f=0f=0.

Proof
  1. 1.

    When ff is linear, the Lipschitz constant of gradient is 0. We can set L=ϵ>0L=\epsilon>0, then by Theorem 3.1 if r<4θ32θ12ϵr<\frac{4\theta-3}{2\theta-1}\frac{2}{\epsilon} and λ1θσ2,\lambda\leq\frac{1}{\theta\sigma^{2}},

    limk(𝐬k,𝐱k)=(𝐬,𝐱),\lim_{k\rightarrow\infty}({\mathbf{s}}^{k},{\mathbf{x}}^{k})=({\mathbf{s}}^{\star},{\mathbf{x}}^{\star}),

    where (𝐬,𝐱)({\mathbf{s}}^{\star},{\mathbf{x}}^{\star}) is a saddle-point solution.

    Let ϵ0\epsilon\rightarrow 0 and θ3/4\theta\rightarrow 3/4, then the proof is complete.

  2. 2.

    From the sequence relation (16) and the above argument, we have

    limk(𝐬3k,𝐱3k)=limk(𝐬k,𝐱k1r𝐀(𝐬k𝐬k1))=(𝐬,𝐱).\lim_{k\rightarrow\infty}({\mathbf{s}}^{k}_{3},{\mathbf{x}}^{k}_{3})=\lim_{k\rightarrow\infty}({\mathbf{s}}^{k},{\mathbf{x}}^{k-1}-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}))=({\mathbf{s}}^{\star},{\mathbf{x}}^{\star}).

The comparison of aforementioned algorithms with improved conditions on stepsizes is summarized in Table 2, where the results of the algorithm in bold are proved in this paper.

ff gg r,λr,\lambda
Chambolle-Pock 0 convex r>0r>0, λ<4/(3σ2)\lambda<4/(3\sigma^{2})
PAPC(PDFP2O) LL-smooth 0 rL/2<1rL/2<1, λ<4/(3σ2)\lambda<4/(3\sigma^{2})
Condat-Vu LL-smooth convex λσ2+rL/21\lambda\sigma^{2}+rL/2\leq 1
PDFP LL-smooth convex rL/2<1rL/2<1, λσ2(𝐀)1\lambda\sigma^{2}({\mathbf{A}})\leq 1
AFBA LL-smooth convex rL/2<ΓrL/2<\Gamma, θλσ21\theta\lambda\sigma^{2}\leq 1
PD3O LL-smooth convex rL/2<ΓrL/2<\Gamma, θλσ21\theta\lambda\sigma^{2}\leq 1
The Base Algorithm LL-smooth convex rL/2<ΓrL/2<\Gamma, θλσ21\theta\lambda\sigma^{2}\leq 1
Table 2: The comparison of the requirement of stepsizes to guarantee the convergence of some primal-dual algorithms. LL-smooth means the function is convex and has a LL-Lipschitz continuous gradient. Γ(4θ3)/(2θ1)\Gamma\coloneqq(4\theta-3)/(2\theta-1), where θ\theta is an arbitrary number in (3/4,1].(3/4,1].

3.2 Tightness of Upper Bound for Stepsizes

In this section, we provide a simple example to show that the upper bound for λ\lambda can not be relaxed further. This example is more general than the one provide in he2020optimally . We consider the following saddle point problem

min𝐱nmax𝐬m𝐀𝐱,𝐬.\min_{{\mathbf{x}}\in\mathbb{R}^{n}}\max_{{\mathbf{s}}\in\mathbb{R}^{m}}~{}\langle{\mathbf{A}}{\mathbf{x}},{\mathbf{s}}\rangle.

Any (𝐱,𝐬)({\mathbf{x}},{\mathbf{s}}) such that 𝐀𝐱=𝟎{\mathbf{A}}{\mathbf{x}}=\mathbf{0} and 𝐀𝐬=𝟎{\mathbf{A}}^{\top}{\mathbf{s}}=\mathbf{0} is a solution. For this special problem, one iteration becomes

𝐬k+1=\displaystyle{\mathbf{s}}^{k+1}= λr𝐀𝐱k+(𝐈λ𝐀𝐀)𝐱k,\displaystyle~{}{\lambda\over r}{\mathbf{A}}{\mathbf{x}}^{k}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{x}}^{k}, (49)
𝐱k+1=\displaystyle{\mathbf{x}}^{k+1}= 𝐱kr𝐀𝐬k+1=(𝐈λ𝐀𝐀)𝐱kr𝐀(𝐈λ𝐀𝐀)𝐬k,\displaystyle~{}{\mathbf{x}}^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}=({\mathbf{I}}-\lambda{\mathbf{A}}^{\top}{\mathbf{A}}){\mathbf{x}}^{k}-r{\mathbf{A}}^{\top}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}, (50)

We can rewrite it as

[𝐬k+1𝐀𝐱k+1]=[𝐈λ𝐀𝐀λr𝐈r𝐀𝐀(𝐈λ𝐀𝐀)𝐈λ𝐀𝐀][𝐬k𝐀𝐱k].\displaystyle\begin{bmatrix}{\mathbf{s}}^{k+1}\\ {\mathbf{A}}{\mathbf{x}}^{k+1}\end{bmatrix}=\begin{bmatrix}{\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}&{\lambda\over r}{\mathbf{I}}\\ -r{\mathbf{A}}{\mathbf{A}}^{\top}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})&{\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}\end{bmatrix}\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{A}}{\mathbf{x}}^{k}\end{bmatrix}. (51)

Therefore, to make the algorithm converge, the eigenvalues of the matrix can not have magnitude larger than 1. Because 𝐀𝐀{\mathbf{A}}{\mathbf{A}}^{\top} is symmetric, we only need to consider the 2×22\times 2 matrix

[1λθλrrθ(1λθ)1λθ],\displaystyle\begin{bmatrix}1-\lambda\theta&{\lambda\over r}\\ -r\theta(1-\lambda\theta)&1-\lambda\theta\end{bmatrix}, (52)

where θ\theta is the eigenvalues of 𝐀𝐀{\mathbf{A}}{\mathbf{A}}^{\top}. Its two eigenvalues are

1λθ±λθ(1λθ).1-\lambda\theta\pm\sqrt{-\lambda\theta(1-\lambda\theta)}.

We consider different cases for λθ\lambda\theta. If λθ<1\lambda\theta<1, both eigenvalues are complex numbers, and their magnitude is (1λθ)2+λθ(1λθ)1\sqrt{(1-\lambda\theta)^{2}+\lambda\theta(1-\lambda\theta)}\leq 1. If λθ=1\lambda\theta=1, both eigenvalues are zero. When λθ>1\lambda\theta>1, both eigenvalues are real number. The eigenvalue 1λθ+λθ(1λθ)<1λθ+λθλθ=1.1-\lambda\theta+\sqrt{-\lambda\theta(1-\lambda\theta)}<1-\lambda\theta+\sqrt{\lambda\theta\lambda\theta}=1. The other eigenvalue is 1λθλθ(λθ1)1-\lambda\theta-\sqrt{\lambda\theta(\lambda\theta-1)}. To make sure that its magnitude is less than one, we need 1λθλθ(λθ1)>11-\lambda\theta-\sqrt{\lambda\theta(\lambda\theta-1)}>-1, that is λθ<4/3.\lambda\theta<4/3. The condition for the convergence with any initial value is λθ<4/3\lambda\theta<4/3 for all eigenvalues of 𝐀𝐀{\mathbf{A}}{\mathbf{A}}^{\top}, that is λσ2<4/3\lambda\sigma^{2}<4/3.

This example shows that the condition λσ2<4/3\lambda\sigma^{2}<4/3 can not be relaxed further for Chambolle-Pock.

4 Numerical Experiments

In this section, we demonstrate the performance of several primal-dual algorithms under the relaxed condition and compare their results with existing ones. More specifically, we use PD3O and AFBA to solve the fused LASSO (least absolute shrinkage and selection operator) and Chambolle-Pock to solve LASSO to show their convergence with different combinations of rr and λ\lambda.

4.1 The fused LASSO

The fused LASSO problem (see, e.g., tibshirani2005sparsity ) is formulated as

minimize𝐱1000012𝐊𝐱𝐛2+μ1𝐁𝐱1+μ2𝐱1\operatorname*{minimize}_{{\mathbf{x}}\in\mathbb{R}^{10000}}\ \frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2}+\mu_{1}\|{\mathbf{B}}{\mathbf{x}}\|_{1}+\mu_{2}\|{\mathbf{x}}\|_{1} (53)

where 𝐊500×10000{\mathbf{K}}\in\mathbb{R}^{500\times 10000}. The two penalty parameters μ1\mu_{1} and μ2\mu_{2} are set to 200200 and 2020, respectively. The ithith row of 𝐁9999×10000{\mathbf{B}}\in\mathbb{R}^{9999\times 10000} has 1-1 on the iith column, 11 on the i+1i+1th column, and 0 on other columns.

We let f(𝐱)=12𝐊𝐱𝐛2,f({\mathbf{x}})=\frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2}, g(𝐱)=μ2𝐱1g({\mathbf{x}})=\mu_{2}\|{\mathbf{x}}\|_{1} and h(𝐁𝐱)=μ1𝐁𝐱1h({\mathbf{B}}{\mathbf{x}})=\mu_{1}\|{\mathbf{B}}{\mathbf{x}}\|_{1}. Then the primal-dual form is

min𝐱10000max𝐬999912𝐊𝐱𝐛2+μ2𝐱1+𝐁𝐱,𝐬μ1χB(𝐬μ1),\min_{{\mathbf{x}}\in\mathbb{R}^{10000}}\max_{{\mathbf{s}}\in\mathbb{R}^{9999}}\ \frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2}+\mu_{2}\|{\mathbf{x}}\|_{1}+\langle{\mathbf{B}}{\mathbf{x}},{\mathbf{s}}\rangle-\mu_{1}\chi_{B_{\infty}}\left(\frac{{\mathbf{s}}}{\mu_{1}}\right), (54)

where BB_{\infty} is the closed unit ball in \ell_{\infty} norm and χB\chi_{B_{\infty}} is the indicator function over BB_{\infty}.

We generate a sparse vector 𝐱True{\mathbf{x}}_{\text{True}} with 5050 nonzero entries and a dense matrix 𝐀{\mathbf{A}} whose each entry is independently sampled from the standard normal distribution. The response vector 𝐛{\mathbf{b}} is obtained by adding Gaussian noise to 𝐀𝐱True{\mathbf{A}}{\mathbf{x}}_{\text{True}}. We calculate the estimated optimal solution by running 10,00010,000 steps PD3O for the problem and get the estimated optimal function value ff^{*}.

We set the default parameters r=1/σ2(𝐊)r=1/\sigma^{2}({\mathbf{K}}), λ=1/σ2(𝐁)\lambda=1/\sigma^{2}({\mathbf{B}}) and consider several choices of stepsizes in the two scenarios: (1) rL2<4θ32θ1,λ1θσ2;\frac{rL}{2}<\frac{4\theta-3}{2\theta-1},\lambda\leq\frac{1}{\theta\sigma^{2}}; (2) rL2<1,λ<43σ2.\frac{rL}{2}<1,\lambda<\frac{4}{3\sigma^{2}}. The first scenario obeys the relaxed condition shown in this paper, while the other one may violate the condition. In Fig. 1, the left figure shows the result with θ=1/1.19\theta=1/1.19 and 4/54/5 for the first scenario. In this figure, we compare four choices of the stepsizes: (1) the default parameter (r,λ)(r,\lambda); (2) we fix the primal stepsize rr and choose a small θ=1/1.19\theta=1/1.19 to obtain a large λ\lambda. The new parameter is (r,1.19λ)(r,1.19\lambda); (3) Choose a smaller θ=0.8\theta=0.8 and decrease the primal stepsize rr only; (4) Choose the same θ=0.8\theta=0.8 and increase the parameter λ\lambda to its upper bound. The right figure in Fig. 1 compare the convergence of algorithms under the second scenario with a larger λ\lambda. Note that, the settings with 1.3λ1.3\lambda do not satisfy the condition in this paper. We observe that in either scenario, the primal stepsize dominates the convergence of algorithms and a slightly larger λ\lambda has little effect on the algorithm.

Refer to caption
Refer to caption
Figure 1: The comparison of the performance of PD3O and AFBA with different parameters. In both figures, the fixed parameters rr and λ\lambda are set to 1/σ2(𝐊)1/\sigma^{2}({\mathbf{K}}) and 1/σ2(𝐁)1/\sigma^{2}({\mathbf{B}}), respectively.

However, the benefit of larger λ\lambda appears when 𝐊{\mathbf{K}} has a full column rank. We consider the same fused LASSO problem with 𝐊2500×2500{\mathbf{K}}\in\mathbb{R}^{2500\times 2500}. We conduct the experiments for two cases, randomly generated 𝐊{\mathbf{K}} and 𝐊=𝐈.{\mathbf{K}}={\mathbf{I}}. The problem setting is changed to μ1=5\mu_{1}=5 and μ2=1/5\mu_{2}=1/5. The true solution is generated in the same way with 2525 nonzero entries. As shown in Fig. 2, both top and bottom figures indicate 1020%10-20\% acceleration of convergence when λ\lambda is increased. The numerical result on the second scenario also suggests that the general constraint on rr and λ\lambda shown in this paper may not be tight.

Refer to caption
Refer to caption
(a) Random 𝐊2500×2500{\mathbf{K}}\in\mathbb{R}^{2500\times 2500}
Refer to caption
Refer to caption
(b) 𝐊=𝐈2500×2500{\mathbf{K}}={\mathbf{I}}\in\mathbb{R}^{2500\times 2500}
Figure 2: The comparison of the performance of PD3O and AFBA with different parameters. In all figures, the fixed parameters rr and λ\lambda are set to 1/σ2(𝐊)1/\sigma^{2}({\mathbf{K}}) and 1/σ2(𝐁)1/\sigma^{2}({\mathbf{B}}), respectively.

4.2 LASSO

We consider the following LASSO problem (see, e.g., tibshirani1996regression )

minimize𝐱500012𝐊𝐱𝐛2+μ𝐱1\operatorname*{minimize}_{{\mathbf{x}}\in\mathbb{R}^{5000}}\ \frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2}+\mu\|{\mathbf{x}}\|_{1} (55)

where 𝐊500×5000{\mathbf{K}}\in\mathbb{R}^{500\times 5000} and μ=200\mu=200 is a penalty parameter. We let g(𝐱)=μ𝐱1g({\mathbf{x}})=\mu\|{\mathbf{x}}\|_{1} and h(𝐊𝐱)=12𝐊𝐱𝐛2h({\mathbf{K}}{\mathbf{x}})=\frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2} and consider the primal-dual form as

min𝐱5000max𝐬500μ𝐱1+𝐊𝐱,𝐬12𝐬2𝐛,𝐬.\min_{{\mathbf{x}}\in\mathbb{R}^{5000}}\max_{{\mathbf{s}}\in\mathbb{R}^{500}}\ \mu\|{\mathbf{x}}\|_{1}+\langle{\mathbf{K}}{\mathbf{x}},{\mathbf{s}}\rangle-\frac{1}{2}\|{\mathbf{s}}\|^{2}-\langle{\mathbf{b}},{\mathbf{s}}\rangle. (56)

The data are generated in a similar way to the previous experiment. The optimal solution 𝐱{\mathbf{x}}^{*} is calculated by performing 10,000 iterations of Chambolle-Pock. Fig. 3 shows the effect of the relaxed conditions on the convergence of Chambolle-Pock in terms of two different residual measures. The default dual parameter λ\lambda is set to 1/σ2(𝐊)1/\sigma^{2}({\mathbf{K}}) and the relaxed one is 1.32λ1.32\lambda. The primal stepsize rr is chosen from {0.001,0.005,0.01,0.05}\{0.001,0.005,0.01,0.05\}.

Refer to caption
Refer to caption
Figure 3: The comparison of the performance of Chambolle-Pock with different parameters on LASSO in terms of the function values and distance to the optimal solution. In both figures, the fixed parameter λ\lambda is set to 1/σ2(𝐀)1/\sigma^{2}({\mathbf{A}}).

From Fig. 3, we observe that by choosing the acceptable range of rr, the larger value of λ\lambda makes the algorithm converge faster up to 2030%20-30\% acceleration. We also note that the relaxed λ\lambda fails to speed up the algorithm and the residual curves overlap when rr is set to a very small number. This is possibly due to the small progression of the primal variable at each step. The experiment on extreme values of rr verifies the explanation as shown in Fig. 4. We use different colors to differentiate the result of the extreme large values from the result of the extreme small values. This figure also suggests that balanced primal and dual stepsizes is needed. Though how to choose good primal and dual stepsizes is out of the focus of this paper, the relaxed condition in this paper provided theoretic guarantee to increase one or both stepsizes.

Refer to caption
Refer to caption
Figure 4: The comparison of performance of Chambolle-Pock with different parameters chosen from extreme values. In both figures, the fixed parameter λ\lambda is set to 1/σ2(𝐀)1/\sigma^{2}({\mathbf{A}}).

5 Conclusions

In this paper, we use a base algorithm to build the connection between some primal-dual algorithms. Then we prove the convergence of the base algorithm under a relaxed condition for the primal and dual stepsizes. It implies a possible choice of larger dual stepsize with a corresponding conservative primal stepsize. Chambolle-Pock, as a special case of the base algorithm, can take the dual stepsize up to 4/34/3 of the original one without a compromise on the primal stepsize. The numerical experiments indicates the acceleration for tested algorithms under the relaxed condition. The benefit for Chambolle-Pock is more prominent. The condition for Chambolle-Pock is also proved to be optimal and can not be relaxed. However, how to relax the condition further for the base algorithm is still an open problem.

Acknowledgements.
This work is partially supported by the NSF grant DMS-2012439.

References

  • [1] R Tyrrell Rockafellar. Conjugate duality and optimization. SIAM, 1974.
  • [2] Laurent Condat. A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms. Journal of optimization theory and applications, 158(2):460–479, 2013.
  • [3] Bang Cong Vu. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Computational Mathematics, 38(3):667–681, 2013.
  • [4] Peijun Chen, Jianguo Huang, and Xiaoqun Zhang. A primal-dual fixed point algorithm for minimization of the sum of three convex separable functions. Fixed Point Theory and Applications, 2016(1):1–18, 2016.
  • [5] Puya Latafat and Panagiotis Patrinos. Asymmetric forward–backward–adjoint splitting for solving monotone inclusions involving three operators. Computational Optimization and Applications, 68(1):57–93, 2017.
  • [6] Ming Yan. A new primal–dual algorithm for minimizing the sum of three functions with a linear operator. Journal of Scientific Computing, 76(3):1698–1717, 2018.
  • [7] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145, 2011.
  • [8] Ignace Loris and Caroline Verhoeven. On a generalization of the iterative soft-thresholding algorithm for the case of non-separable penalty. Inverse Problems, 27(12):125007, 2011.
  • [9] Peijun Chen, Jianguo Huang, and Xiaoqun Zhang. A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Problems, 29(2):025011, 2013.
  • [10] Yoel Drori, Shoham Sabach, and Marc Teboulle. A simple algorithm for a class of nonsmooth convex–concave saddle-point problems. Operations Research Letters, 43(2):209–214, 2015.
  • [11] Zhi Li and Ming Yan. New convergence analysis of a primal-dual algorithm with large stepsizes. Advances in Computational Mathematics, 47(1):1–20, 2021.
  • [12] Bingsheng He, Feng Ma, and Xiaoming Yuan. Optimal proximal augmented lagrangian method and its application to full jacobian splitting for multi-block separable convex minimization problems. IMA Journal of Numerical Analysis, 40(2):1188–1216, 2020.
  • [13] Yao Li and Ming Yan. On the linear convergence of two decentralized algorithms. Journal of Optimization Theory and Applications, 189(1):271–290, 2021.
  • [14] Bingsheng He, Feng Ma, Shengjie Xu, and Xiaoming Yuan. A generalized primal-dual algorithm with improved convergence condition for saddle point problems. arXiv preprint arXiv:2112.00254, 2021.
  • [15] Bingsheng He, Feng Ma, and Xiaoming Yuan. Optimally linearizing the alternating direction method of multipliers for convex programming. Computational Optimization and Applications, 75(2):361–388, 2020.
  • [16] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
  • [17] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011.
  • [18] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005.
  • [19] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.