∎

¹¹institutetext: Yao Li ²²institutetext: Michigan State University
East Lansing, MI 48824, USA
[email protected] ³³institutetext: Ming Yan ⁴⁴institutetext: Michigan State University
East Lansing, MI 48824, USA
[email protected]

On the improved conditions for some primal-dual algorithms

Yao Li Ming Yan

(Received: date / Accepted: date)

Abstract

The convex minimization of $f({\mathbf{x}})+g({\mathbf{x}})+h({\mathbf{A}}{\mathbf{x}})$ over $\mathbb{R}^{n}$ with differentiable $f$ and linear operator ${\mathbf{A}}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ , has been well-studied in the literature. By considering the primal-dual optimality of the problem, many algorithms are proposed from different perspectives such as monotone operator scheme and fixed point theory. In this paper, we start with a base algorithm to reveal the connection between several algorithms such as AFBA, PD3O and Chambolle-Pock. Then, we prove its convergence under a relaxed assumption associated with the linear operator and characterize the general constraint on primal and dual stepsizes. The result improves the upper bound of stepsizes of AFBA and indicates that Chambolle-Pock, as the special case of the base algorithm when $f=0$ , can take the stepsize of the dual iteration up to $4/3$ of the previously proven one.

Keywords:

Primal-dual Asymmetric Forward–Backward-Adjoint splitting Primal–Dual Three-Operator splitting Chambolle-Pock

^†^†journal: Noname

1 Introduction

We consider the following minimization problem in the form of the sum of three functions

\operatorname*{minimize}_{{\mathbf{x}}\in\mathbb{R}^{n}}\ f({\mathbf{x}})+g({\mathbf{x}})+h({\mathbf{A}}{\mathbf{x}}),

(1)

where $f$ is a differentiable convex function with $L$ -Lipschitz continuous gradient and $g$ , $h$ are proper, closed and convex functions taking values in $(-\infty,\infty].$ The third function $h$ is composited with a linear operator ${\mathbf{A}}\in\mathbb{R}^{n\times m}$ , whose largest singular value is $\sigma$ , i.e., $\sigma=\sqrt{\|{\mathbf{A}}{\mathbf{A}}^{\top}\|}$ .

We use the asterisk superscript to denote the Legendre-Fenchel conjugate function, e.g., $h^{*}:\mathbb{R}^{m}\rightarrow\mathbb{R}$ is defined as

h^{*}({\mathbf{s}})=\sup_{{\mathbf{x}}\in\mathbb{R}^{m}}\ \langle{\mathbf{s}},{\mathbf{x}}\rangle-h({\mathbf{x}}).

Then, the saddle-point form of problem (1) is

\min_{{\mathbf{x}}\in\mathbb{R}^{n}}\max_{{\mathbf{s}}\in\mathbb{R}^{m}}\ f({\mathbf{x}})+g({\mathbf{x}})+\langle{\mathbf{A}}{\mathbf{x}},{\mathbf{s}}\rangle-h^{*}({\mathbf{s}})

(2)

where ${\mathbf{x}}$ and ${\mathbf{s}}$ are primal and dual variables, respectively. With the existence of a solution pair $({\mathbf{x}}^{*},{\mathbf{s}}^{*})$ , the first-order optimal condition is characterized as

\left\{\begin{aligned} &0\in\nabla f({\mathbf{x}}^{*})+\partial g({\mathbf{x}}^{*})+{\mathbf{A}}^{\top}{\mathbf{s}}^{*},\\ &0\in\partial h^{*}({\mathbf{s}}^{*})-{\mathbf{A}}{\mathbf{x}}^{*},\end{aligned}\right.

where $\partial g({\mathbf{x}})=\{{\mathbf{v}}\in\mathbb{R}^{n}\ |\ g({\mathbf{y}})-g({\mathbf{x}})\geq\langle{\mathbf{v}},{\mathbf{y}}-{\mathbf{x}}\rangle,\forall{\mathbf{y}}\in\mathbb{R}^{n}\}$ is the subdifferential of $g$ at ${\mathbf{x}}$ . The duality theorem (rockafellar1974conjugate, , Theorem 15) asserts that ${\mathbf{x}}^{*}$ is the solution of the problem (1) and ${\mathbf{s}}^{*}$ is the solution of the corresponding dual problem

\operatorname*{minimize}_{{\mathbf{s}}\in\mathbb{R}^{m}}\ (f^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*})(-{\mathbf{A}}^{\top}{\mathbf{s}})+h^{*}({\mathbf{s}}),

(3)

where $f^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*}$ is the infimal convolution of $f^{*}$ and $g^{*}$ that is defined as $f^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*}({\mathbf{x}})=\inf_{{\mathbf{y}}\in\mathbb{R}^{n}}f^{*}({\mathbf{y}})+g^{*}({\mathbf{x}}-{\mathbf{y}})$ with the property that $(f^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*})^{*}=f+g.$ Note that when $f=0$ (or $g=0$ ), the infimal convolution $f^{*}\mathrel{\raisebox{-1.0pt}{\framebox(1.0,1.0)[]{$\scriptstyle$}}}g^{*}$ boils down to $g^{*}$ (or $f^{*}$ ).

Existing primal-dual algorithms to solve the above saddle-point problem (2) include Condat-Vu condat2013primal ; vu2013splitting , Primal–Dual Fixed-Point algorithm(PDFP) chen2016primal , Asymmetric Forward–Backward-Adjoint splitting(AFBA) latafat2017asymmetric and Primal–Dual Three-Operator splitting(PD3O) yan2018new . When $f=0$ , Condat-Vu and PD3O are reduced to Chambolle-Pock chambolle2011first . When $g=0,$ all algorithms except Condat-Vu are reduced to Proximal Alternating Predictor–Corrector(PAPC) or Primal–Dual Fixed-Point algorithm based on the Proximity Operator(PDFP²O) loris2011generalization ; chen2013primal ; drori2015simple . The conditions on their stepsizes to guarantee the convergence is associated with the singular value of the linear operator $\sigma$ and the Lipschitz constant $L$ . With notations $\eta_{\text{p}}$ for the primal stepsize and $\eta_{\text{d}}$ for the dual stepsize, Table 1 lists the conditions on stepsizes of the aforementioned algorithms and their connections when either $f$ or $g$ vanishes.

	$\eta_{\text{p}},\eta_{\text{d}}$	$f=0$	$g=0$
Condat-Vu	$\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}+\eta_{\text{p}}L/2\leq 1$	C-P(primal)
PDFP	$\eta_{\text{p}}L/2<1,\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}\leq 1$		PAPC
AFBA	$\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}+\sqrt{\eta_{\text{p}}\eta_{\text{d}}}\sigma+\eta_{\text{p}}L\leq 2$	C-P(dual)	PAPC
PD3O	$\eta_{\text{p}}L/2<1,\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}\leq 1$	C-P(primal)	PAPC

Chambolle-Pock	$\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}\leq 1$	$f=0$
PAPC(PDFP²O)	$\eta_{\text{p}}L/2<1$ , $\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}\leq 1$	$g=0$

Table 1: The conditions on stepsizes for the primal-dual algorithms. C-P(primal) and C-P(dual) stand for Chambolle-Pock applied to the primal problem and the dual problem, respectively. In li2021new , PAPC can take a larger upper bound of

\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}

up to

4/3

The listed conditions in Table 1 are all sufficient to guarantee the convergence of the associated algorithms, and some cannot be relaxed any further. For PAPC, the first constraint, $\eta_{\text{p}}L/2<1$ , cannot be relaxed since it reduces to gradient descent method with stepsize $\eta_{\text{p}}$ when there is only one $f$ , i.e., $h=0$ . However, it turns out that the upper bound of $\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}$ can be relaxed to $4/3$ . In he2020optimal ; li2021linear , a special case of PAPC, linearized augumented Lagrangian method, and an application of PAPC on the decentralized optimization are shown to converge under the relaxed condition. The general convergence result of PAPC with relaxed condition is shown in li2021new . The $4/3$ bound is also shown to be tight in he2020optimal ; li2021new . This result raises an open question, i.e., can the similar relaxation be applied to Chambolle-Pock? The follow-up question is can this result be generalized to algorithms for the general problem (1)?.

The first question is implicitly answered by some work. A generalized Chambolle-Pock is proposed in he2021generalized to achieve the relaxation and its equivalence to the canonical Chambolle-Pock is proved for the following primal-dual problem

\min_{{\mathbf{x}}\in\mathbb{R}^{n}}\max_{{\mathbf{s}}\in\mathbb{R}^{m}}\ g({\mathbf{x}})+\langle{\mathbf{A}}{\mathbf{x}},{\mathbf{s}}\rangle-\langle{\mathbf{b}},{\mathbf{s}}\rangle

which covers the general form (2) with linear $h^{*}$ . The necessity of the condition is also shown by a simple case.

In he2020optimally , a relaxed linearization parameter is considered in the linearized Alternating Direction Method of Multipliers(L-ADMM) so that a larger stepsize can be used in the linearized subproblem to converge faster. L-ADMM considers the following linearly constrained problem,

	$\displaystyle\operatorname*{minimize}_{{\mathbf{x}}\in\mathbb{R}^{n_{1}},{\mathbf{y}}\in\mathbb{R}^{n_{2}}}$	$\displaystyle\ \ \widetilde{f}({\mathbf{x}})+\widetilde{g}({\mathbf{y}})$		(4)
	s.t.	$\displaystyle\ \ \widetilde{{\mathbf{A}}}{\mathbf{x}}+\widetilde{{\mathbf{B}}}{\mathbf{y}}={\mathbf{b}}$		(4)

where $\widetilde{f}:\mathbb{R}^{n_{1}}\rightarrow\mathbb{R}$ , $\widetilde{g}:\mathbb{R}^{n_{2}}\rightarrow\mathbb{R}$ are proper, closed and convex, $\widetilde{{\mathbf{A}}}\in\mathbb{R}^{m\times n_{1}}$ , $\widetilde{{\mathbf{B}}}\in\mathbb{R}^{m\times n_{2}}$ are two linear operators and ${\mathbf{b}}\in\mathbb{R}^{m}$ . When $\widetilde{f}=g^{*}$ and $\widetilde{g}=h^{*}$ , the dual problem can be formulated via Lagrangian multiplier as

\operatorname*{minimize}_{{\mathbf{s}}\in\mathbb{R}^{m}}\ -\langle{\mathbf{b}},{\mathbf{s}}\rangle+g(\widetilde{{\mathbf{A}}}^{\top}{\mathbf{s}})+h(\widetilde{{\mathbf{B}}}^{\top}{\mathbf{s}}).

(5)

A special case of the above dual problem (5) is ${\mathbf{b}}=\mathbf{0}$ , $\widetilde{{\mathbf{A}}}={\mathbf{I}}$ , and $\widetilde{{\mathbf{B}}}^{\top}={\mathbf{A}}$ . In this case, it reduces to the problem (1) with $f=0.$ It can be easily shown that L-ADMM applied to this special case of the problem (4) is equivalent to Chambolle-Pock applied to the corresponeding dual problem. Therefore, the result in he2020optimally suggests the improved condition on stepsizes of Chambolle-Pock, $\eta_{\text{p}}\eta_{\text{d}}\sigma^{2}<\frac{4}{3}.$ When ${\mathbf{b}}\not=0$ in the special case, the result in he2020optimally also indicates that the relaxed conditon is extensible to the problem (1) with a linear function $f$ .

However, there is no result for general Lipschitz smooth $f$ in the literature. This work will address the second question for AFBA and PD3O by characterizing a more general upper bound for $\eta_{\text{p}}$ and $\eta_{\text{d}}$ , and directly give an affirmative answer to the first question.

Throughout the rest of the paper, we assume that there exists a solution pair $({\mathbf{x}}^{*},{\mathbf{s}}^{*})$ of the problem (2). We use $({\mathbf{I}}+\lambda\partial g)^{-1}({\mathbf{x}})$ to represent the proximal operator of $g$ at ${\mathbf{x}}$ , which is defined as

\mathbf{prox}_{\lambda g}({\mathbf{x}})\coloneqq\mathop{\arg\min}_{{\mathbf{y}}\in\mathbb{R}^{n}}\ g({\mathbf{y}})+\frac{1}{2\lambda}\|{\mathbf{y}}-{\mathbf{x}}\|^{2}.

We use $\langle\cdot,\cdot\rangle$ and $\|\cdot\|$ as the standard inner product and norm defined over $\mathbb{R}^{n}$ , and denote $\|\cdot\|_{\mathbf{M}}=\sqrt{\langle\cdot,{\mathbf{M}}(\cdot)\rangle}$ as the (semi)norm associated with the given positive (semi)definite matrix ${\mathbf{M}}\in\mathbb{R}^{n\times n}$ .

The rest of this paper is organized as follows. We first derive a base algorithm that is later shown to be an equivalent form of AFBA and illustrate the connection between AFBA, PD3O and Chambolle-Pock in Section 2. We then prove the convergence of the base algorithm under the relaxed condition associated with the matrix ${\mathbf{A}}$ in Section 3. In the next section, we numerically show the performance of the algorithms under the proved weaker condition.

2 The Base Algorithm

For simplicity, we set $\eta_{\text{p}}=r$ and $\eta_{\text{d}}=\lambda/r$ . That is, the product of the primal and dual stepsizes is $\lambda$ . The optimality condition of problem (2) can be reformulated as the following monotone inclusion problem

\begin{bmatrix}0\\ 0\end{bmatrix}\in\begin{bmatrix}\nabla f({\mathbf{x}})\\ 0\end{bmatrix}+\begin{bmatrix}\partial g&{\mathbf{A}}^{\top}\\ -{\mathbf{A}}&\partial h^{*}\end{bmatrix}\begin{bmatrix}{\mathbf{x}}\\ {\mathbf{s}}\end{bmatrix}.

(6)

This problem is also equivalent to

\begin{bmatrix}{\mathbf{I}}&\\ &\frac{r^{2}}{\lambda}{\mathbf{I}}-r^{2}{\mathbf{A}}{\mathbf{A}}^{\top}\end{bmatrix}\begin{bmatrix}{\mathbf{x}}\\ {\mathbf{s}}\end{bmatrix}\in\begin{bmatrix}r\nabla f({\mathbf{x}})\\ 0\end{bmatrix}+\begin{bmatrix}{\mathbf{I}}+r\partial g&r{\mathbf{A}}^{\top}\\ -r{\mathbf{A}}&\frac{r^{2}}{\lambda}{\mathbf{I}}-r^{2}{\mathbf{A}}{\mathbf{A}}^{\top}+r\partial h^{*}\end{bmatrix}\begin{bmatrix}{\mathbf{x}}\\ {\mathbf{s}}\end{bmatrix}.

Let $\zeta\in{\mathbf{x}}-r\nabla f({\mathbf{x}})-r\partial g({\mathbf{x}})$ , then the above form can be decomposed as


$\displaystyle\begin{bmatrix}{\mathbf{I}}&\\ &\frac{r^{2}}{\lambda}{\mathbf{I}}-r^{2}{\mathbf{A}}{\mathbf{A}}^{\top}\end{bmatrix}\begin{bmatrix}\zeta\\ {\mathbf{s}}\end{bmatrix}\in\begin{bmatrix}{\mathbf{I}}&r{\mathbf{A}}^{\top}\\ -r{\mathbf{A}}&\frac{r^{2}}{\lambda}{\mathbf{I}}-r^{2}{\mathbf{A}}{\mathbf{A}}^{\top}+r\partial h^{*}\end{bmatrix}$	$\displaystyle\begin{bmatrix}{\mathbf{x}}\\ {\mathbf{s}}\end{bmatrix},$	(7a)
$\displaystyle({\mathbf{I}}+r\partial g)^{-1}(2{\mathbf{x}}-\zeta-r\nabla f({\mathbf{x}}))+\zeta-{\mathbf{x}}=$	$\displaystyle\ \zeta.$	(7b)

Plug $({\mathbf{x}}^{k+1},{\mathbf{s}}^{k+1},\zeta^{k+1})$ into the right-hand side of (7) and let the rest $\zeta,{\mathbf{x}},{\mathbf{s}}$ on the left-hand side of (7) be the variables in $k$ th iteration $({\mathbf{x}}^{k},{\mathbf{s}}^{k},\zeta^{k})$ . Then, we can solve for the next-step variables of the system (7) in the ${\mathbf{s}}\mathchar 45\relax{\mathbf{x}}\mathchar 45\relax\zeta$ order by Gaussian elimination. The base algorithm follows


$\displaystyle{\mathbf{s}}^{k+1}$	$\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}(\frac{\lambda}{r}{\mathbf{A}}\zeta^{k}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}),$	(8a)
$\displaystyle{\mathbf{x}}^{k+1}$	$\displaystyle=\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1},$	(8b)
$\displaystyle\zeta^{k+1}$	$\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}-r\nabla f({\mathbf{x}}^{k+1}))-{\mathbf{x}}^{k+1}+\zeta^{k},$	(8c)

where $r$ and $\lambda/r$ are primal and dual stepsizes, respectively.

In Section 3, we will show the convergence of the algorithm when $r<\frac{4\theta-3}{2\theta-1}\frac{2}{L}$ and $\lambda\leq\frac{1}{\theta\sigma^{2}}$ with any $\theta\in(3/4,1].$ Notice that, when $\theta=1,$ the conditions on $r$ and $\lambda$ is reduced to $r<\frac{2}{L}$ and $\lambda\leq\frac{1}{\sigma^{2}}$ . As listed in Table 1, this condition is the sufficient condition on stepizes for PDFP shown in chen2016primal and PD3O shown in yan2018new , which is apparently weaker than that for Condat-Vu shown in condat2013primal ; vu2013splitting and AFBA shown in latafat2017asymmetric . This general form of the upper bound gives a larger range of $\lambda$ and accordingly narrows the range of values of $r$ .

However, the compromise on $r$ will disappear when $f$ is linear ( $L=0$ ). In this case, $r$ can take any value in $(0,\infty)$ and consequently, $\lambda$ can take value as large as $\frac{4}{3\sigma^{2}}$ . We will later show in Section 2.3 that the base algorithm recovers Chambolle-Pock applied to the dual problem (3) with $f=0$ . The range on $\lambda$ is significantly better than the previously used for Chambolle-Pock, which is $\lambda\leq\frac{1}{\sigma^{2}}.$ Therefore, the first question is answered.

Next, we will discuss the connection of the base algorithm with AFBA, PD3O ,Chambolle-Pock and PAPC(PDFP²O). We say an algorithm is equivalent to another one if and only if we can find an one-one map between the sequences generated by two algorithms with appropriate initialization.

2.1 Connection with AFBA

AFBA is a scheme for general monotone inclusion problem of three-operator splitting. The special form (latafat2017asymmetric, , Algorithm $5$ ) used to solve the problem (2) is conducted as


$\displaystyle{\mathbf{s}}_{1}^{k+1}$	$\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}({\mathbf{s}}_{1}^{k}+\frac{\lambda}{r}{\mathbf{A}}\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}),$	(9a)
$\displaystyle{\mathbf{x}}_{1}^{k+1}$	$\displaystyle=\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{1}^{k+1}-{\mathbf{s}}_{1}^{k}),$	(9b)
$\displaystyle\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k+1}$	$\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}_{1}^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}-r\nabla f({\mathbf{x}}_{1}^{k+1})).$	(9c)

We will show that AFBA is equivalent to the base algorithm if the relation

\begin{bmatrix}{\mathbf{s}}_{1}^{k}\\ {\mathbf{x}}_{1}^{k}\\ \mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{x}}^{k}\\ \zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k}\end{bmatrix},\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{x}}^{k}\\ \zeta^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{1}^{k}\\ {\mathbf{x}}_{1}^{k}\\ \mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k}\end{bmatrix}

(10)

holds.

Define $\zeta^{k+1}=\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k+1}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}$ , then (9a) and (9b) are reformulated by canceling $\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu$ as

	$\displaystyle{\mathbf{s}}_{1}^{k+1}$	$\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{})^{-1}({\mathbf{s}}_{1}^{k}+\frac{\lambda}{r}\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k})=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{})^{-1}(\frac{\lambda}{r}{\mathbf{A}}\zeta^{k}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}_{1}^{k}),$
	$\displaystyle{\mathbf{x}}_{1}^{k+1}$	$\displaystyle=\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{1}^{k+1}-{\mathbf{s}}_{1}^{k})=\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}.$

The update in (9c) gives the update of $\zeta$ as

	$\displaystyle\zeta^{k+1}$	$\displaystyle=\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k+1}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}$
		$\displaystyle=\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k+1}+\zeta^{k}-{\mathbf{x}}_{1}^{k+1}$
		$\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}_{1}^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}_{1}^{k+1}-r\nabla f({\mathbf{x}}_{1}^{k+1}))+\zeta^{k}-{\mathbf{x}}_{1}^{k+1}.$

Hence, the sequence $\{({\mathbf{s}}_{1}^{k},{\mathbf{x}}_{1}^{k})\}$ generated by AFBA coincides with the sequence $\{({\mathbf{s}}^{k},{\mathbf{x}}^{k})\}$ generated by the base algorithm with the same initialization. The reverse direction can be verified by (10) in the same way, hence the base algorithm is equivalent to AFBA applied to the problem (2).

From (latafat2017asymmetric, , Proposition $5.3.$ ), $({\mathbf{x}}^{k},{\mathbf{s}}^{k})$ generated by AFBA will converge to a saddle point solution $({\mathbf{x}}^{*},{\mathbf{s}}^{*})$ when $\frac{\lambda\sigma^{2}}{2}+\frac{\sqrt{\lambda}\sigma}{2}+\frac{rL}{2}\leq 1.$ The theoretical result of the base algorithm improves the condition to $r\leq\frac{4\theta-3}{2\theta-1}\frac{2}{L}$ and $\lambda\leq\frac{1}{\theta\sigma^{2}}$ for any $\theta\in(\frac{3}{4},1].$ Furthermore, when $\theta=1,$ the relaxed condition coincides with the one required for PDFP and PD3O.

2.2 Connection with PD3O

Applying PD3O to the dual problem (3), we get the following iteration,


$\displaystyle{\mathbf{s}}_{2}^{k+1}$	$\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}({\mathbf{z}}_{2}^{k}),$	(11a)
$\displaystyle{\mathbf{x}}_{2}^{k+1}$	$\displaystyle=({\mathbf{I}}+r\partial g)^{-1}(({\mathbf{I}}-\lambda{\mathbf{A}}^{\top}{\mathbf{A}}){\mathbf{x}}_{2}^{k}-r\nabla f({\mathbf{x}}_{2}^{k})-r{\mathbf{A}}^{\top}(2{\mathbf{s}}_{2}^{k+1}-{\mathbf{z}}_{2}^{k})),$	(11b)
$\displaystyle{\mathbf{z}}_{2}^{k+1}$	$\displaystyle={\mathbf{s}}_{2}^{k+1}+\frac{\lambda}{r}{\mathbf{A}}{\mathbf{x}}_{2}^{k+1}.$	(11c)

We can easily verify that the above iteration is equivalent to AFBA if

\begin{bmatrix}{\mathbf{s}}_{2}^{k}\\ {\mathbf{x}}_{2}^{k}\\ {\mathbf{z}}_{2}^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{1}^{k}\\ \mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\\ {\mathbf{s}}_{1}^{k}+\frac{\lambda}{r}{\mathbf{A}}\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\end{bmatrix},\begin{bmatrix}{\mathbf{s}}_{1}^{k}\\ {\mathbf{x}}_{1}^{k}\\ \mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{2}^{k}\\ {\mathbf{x}}^{k-1}_{2}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{2}^{k}-{\mathbf{s}}_{2}^{k-1})\\ {\mathbf{x}}_{2}^{k}\end{bmatrix}.

(12)

Combining (11a) and (11c), we have

{\mathbf{s}}_{2}^{k+1}=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}({\mathbf{s}}_{2}^{k}+\frac{\lambda}{r}{\mathbf{A}}{\mathbf{x}}_{2}^{k}).

The equation (11b) is reformulated by cancelling ${\mathbf{z}}_{2}$ as

	$\displaystyle{\mathbf{x}}_{2}^{k+1}$	$\displaystyle=({\mathbf{I}}+r\partial g)^{-1}(({\mathbf{I}}-\lambda{\mathbf{A}}^{\top}{\mathbf{A}}){\mathbf{x}}_{2}^{k}-r\nabla f({\mathbf{x}}_{2}^{k})-r{\mathbf{A}}^{\top}(2{\mathbf{s}}_{2}^{k+1}-{\mathbf{s}}_{2}^{k}-\frac{\lambda}{r}{\mathbf{A}}{\mathbf{x}}_{2}^{k}))$
		$\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}_{2}^{k}-r\nabla f({\mathbf{x}}_{2}^{k})-r{\mathbf{A}}^{\top}(2{\mathbf{s}}_{2}^{k+1}-{\mathbf{s}}_{2}^{k})).$

By defining ${\mathbf{x}}_{1}^{k+1}={\mathbf{x}}_{2}^{k}-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}_{2}-{\mathbf{s}}_{2}^{k})$ , we observe that the sequence $\{({\mathbf{s}}_{2}^{k},{\mathbf{x}}_{2}^{k})\}$ generated by PD3O coincides with the sequence $\{({\mathbf{s}}_{1}^{k},\mkern 1.5mu\overline{\mkern-1.5mu{\mathbf{x}}\mkern-1.5mu}\mkern 1.5mu_{1}^{k}\}$ with the same initialization. The reverse direction also holds from (12).

Due to the equivalence between AFBA and the base algorithm, PD3O is also equivalent to the base algorithm following the sequence relation

\begin{bmatrix}{\mathbf{s}}_{2}^{k}\\ {\mathbf{x}}_{2}^{k}\\ {\mathbf{z}}_{2}^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}^{k}\\ \zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k}\\ ({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}+\frac{\lambda}{r}{\mathbf{A}}\zeta^{k}\end{bmatrix},\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{x}}^{k}\\ \zeta^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{2}^{k}\\ {\mathbf{x}}^{k-1}_{2}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{2}^{k}-{\mathbf{s}}_{2}^{k-1})\\ {\mathbf{x}}_{2}^{k}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{2}^{k}\end{bmatrix}.

(13)

2.3 Connection with Chambolle-Pock

When the smooth function $f$ vanishes, the dual problem (3) boils down to

\operatorname*{minimize}_{{\mathbf{s}}\in\mathbb{R}^{m}}\ g^{*}(-{\mathbf{A}}^{\top}{\mathbf{s}})+h^{*}({\mathbf{s}}).

(14)

Applying Chambolle-Pock to it, we get


$\displaystyle{\mathbf{s}}_{3}^{k+1}$	$\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}({\mathbf{s}}_{3}^{k}+\frac{\lambda}{r}{\mathbf{A}}{\mathbf{x}}_{3}^{k}),$	(15a)
$\displaystyle{\mathbf{x}}_{3}^{k+1}$	$\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}_{3}^{k}-r{\mathbf{A}}^{\top}(2{\mathbf{s}}_{3}^{k+1}-{\mathbf{s}}_{3}^{k})).$	(15b)

The relation between Chambolle-Pock and the base algorithm can be observed either from the relation between PD3O and Chambolle-Pock or from that the sequence $\{({\mathbf{s}}_{3}^{k},{\mathbf{x}}_{3}^{k})\}$ generated by Chambolle-Pock coincides with the sequence $\{({\mathbf{s}}^{k},\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k})\}$ generated by the base algorithm with appropriate initialization.

The sequence relation is

\begin{bmatrix}{\mathbf{s}}_{3}^{k}\\ {\mathbf{x}}_{3}^{k}\\ \end{bmatrix}=\begin{bmatrix}{\mathbf{s}}^{k}\\ \zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k}\\ \end{bmatrix},\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{x}}^{k}\\ \zeta^{k}\end{bmatrix}=\begin{bmatrix}{\mathbf{s}}_{3}^{k}\\ {\mathbf{x}}^{k-1}_{3}-r{\mathbf{A}}^{\top}({\mathbf{s}}_{3}^{k}-{\mathbf{s}}_{3}^{k-1})\\ {\mathbf{x}}_{3}^{k}+r{\mathbf{A}}^{\top}{\mathbf{s}}_{3}^{k}\end{bmatrix}.

(16)

In this case, the relaxed condition on stepsizes of the base algorithm is $r>0$ and $\lambda<\frac{4}{3\sigma^{2}},$ which extends the choice of $\lambda$ to a larger range for Chambolle-Pock. This bound is shown to be tight in Section 3.2.

2.4 Connection with PAPC(PDFP²O)

Although in Table 1, AFBA and PD3O are reduced to PAPC when $g=0$ , and we have shown the equivalence between the base algorithm and them, we now directly derive it for completeness.

With $g=0$ , the iteration of the base algorithm reduces to


$\displaystyle{\mathbf{s}}^{k+1}$	$\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}(\frac{\lambda}{r}{\mathbf{A}}\zeta^{k}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}),$	(17a)
$\displaystyle{\mathbf{x}}^{k+1}$	$\displaystyle=\zeta^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1},$	(17b)
$\displaystyle\zeta^{k+1}$	$\displaystyle={\mathbf{x}}^{k+1}-r\nabla f({\mathbf{x}}^{k+1}).$	(17c)

Let ${\mathbf{s}}_{4}^{k}={\mathbf{s}}^{k}$ and ${\mathbf{x}}_{4}^{k}={\mathbf{x}}^{k}$ . After canceling $\zeta$ , we get


$\displaystyle{\mathbf{s}}_{4}^{k+1}$	$\displaystyle=({\mathbf{I}}+\frac{\lambda}{r}\partial h^{*})^{-1}(\frac{\lambda}{r}{\mathbf{A}}({\mathbf{x}}_{4}^{k}-r\nabla f({\mathbf{x}}_{4}^{k}))+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}_{4}^{k}),$	(18a)
$\displaystyle{\mathbf{x}}_{4}^{k+1}$	$\displaystyle={\mathbf{x}}_{4}^{k}-r\nabla f({\mathbf{x}}_{4}^{k})-r{\mathbf{A}}^{\top}{\mathbf{s}}_{4}^{k+1}.$	(18b)

With appropriation initialization, i.e., $\zeta^{0}={\mathbf{x}}^{0}-r\nabla f({\mathbf{x}}^{0}),$ the above iteration is exactly PAPC applied to the problem (1). However, the relaxed condition, $r<\frac{4\theta-3}{2\theta-1}\frac{2}{L}$ and $\lambda<\frac{1}{\theta\sigma^{2}({\mathbf{A}})}$ for $\theta\in(3/4,1]$ cannot achieve the tight bound shown in li2021new . One of the possible reason is that we cannot express ${\mathbf{A}}^{\top}{\mathbf{s}}$ as a combination of ${\mathbf{x}}$ and $\nabla f({\mathbf{x}})$ like (18b) when nontrivial $g$ exists. This will be left for future research.

2.5 Relation Diagram

The relation between aforementioned algorithms is visualized via the following diagram where the algorithms in boxed are applied to the dual problem (3) with $\eta_{\text{p}}=\lambda/r$ and $\eta_{\text{d}}=r$ .

The theoretical analysis of the base algorithm in the next section provides a unified proof of the convergence for the above algorithms under the relaxed conditions on $r$ and $\lambda.$ An example is given in Section 3.2 to show the necessary condition of $r$ and $\lambda$ .

3 Convergence Analysis

3.1 Convergence under a relaxed condition

We first define the following auxiliary variables associated with the proximal mapping of $h^{*}$ and $g$


$\displaystyle{\mathbf{y}}^{k+1}$	$\displaystyle=({\mathbf{I}}+r\partial g)^{-1}({\mathbf{x}}^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}-r\nabla f({\mathbf{x}}^{k+1})),$	(19a)
$\displaystyle{\mathbf{q}}_{g}^{k+1}$	$\displaystyle=\frac{1}{r}{\mathbf{x}}^{k+1}-{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}-\nabla f({\mathbf{x}}^{k+1})-\frac{1}{r}{\mathbf{y}}^{k+1},$	(19b)
$\displaystyle{\mathbf{q}}_{h}^{k+1}$	$\displaystyle={\mathbf{A}}\zeta^{k}+\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}-\frac{r}{\lambda}{\mathbf{s}}^{k+1}.$	(19c)

With the existence of the saddle-point solution $({\mathbf{x}}^{*},{\mathbf{s}}^{*}),$ we will see the quadruple $(\zeta^{k+1},{\mathbf{y}}^{k+1},{\mathbf{q}}_{g}^{k+1},{\mathbf{q}}_{h}^{k+1})$ converges to


$\displaystyle\zeta^{*}$	$\displaystyle={\mathbf{x}}^{}+r{\mathbf{A}}^{\top}{\mathbf{s}}^{},$	(20a)
$\displaystyle{\mathbf{y}}^{*}$	$\displaystyle={\mathbf{x}}^{*},$	(20b)
$\displaystyle{\mathbf{q}}_{g}^{*}$	$\displaystyle=-{\mathbf{A}}^{\top}{\mathbf{s}}^{}-\nabla f({\mathbf{x}}^{}),$	(20c)
$\displaystyle{\mathbf{q}}_{h}^{*}$	$\displaystyle={\mathbf{A}}\zeta^{}-r{\mathbf{A}}{\mathbf{A}}^{\top}{\mathbf{s}}^{}={\mathbf{A}}{\mathbf{x}}^{*}.$	(20d)

Though the relation between the base algorithm, AFBA and PD3O asserts the fixed point of iteration (8) solves the problem (2), we restate it in the following lemma for completeness.

Lemma 1 (Optimality)

Let $({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star})$ be the fixed point of the iteration (8), then $({\mathbf{s}}^{\star},{\mathbf{x}}^{\star})$ is the solution of the problem (2) with $\zeta^{\star}={\mathbf{x}}^{\star}+r{\mathbf{A}}^{\top}{\mathbf{s}}^{\star}.$

Proof

Iteration (8b) gives the relation $\zeta^{\star}={\mathbf{x}}^{\star}+r{\mathbf{A}}^{\top}{\mathbf{s}}^{\star}.$ From (8a) and (8c), we have

	$\displaystyle{\mathbf{s}}^{\star}+\frac{\lambda}{r}\partial h^{*}({\mathbf{s}}^{\star})$	$\displaystyle=\frac{\lambda}{r}{\mathbf{A}}\zeta^{\star}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{\star},$
	$\displaystyle{\mathbf{x}}^{\star}+r\partial g({\mathbf{x}}^{\star})$	$\displaystyle={\mathbf{x}}^{\star}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{\star}-r\nabla f({\mathbf{x}}^{\star}).$

Canceling $\zeta^{*}$ , we have

\begin{bmatrix}\partial h^{*}&-{\mathbf{A}}\\ {\mathbf{A}}^{\top}&\partial g+\nabla f\end{bmatrix}\begin{bmatrix}{\mathbf{s}}^{\star}\\ {\mathbf{x}}^{\star}\end{bmatrix}=\begin{bmatrix}0\\ 0\end{bmatrix},

which is the optimal condition of the problem (2).∎

We now use $({\mathbf{s}}^{*},{\mathbf{x}}^{*},\zeta^{*})$ to denote an arbitrary fixed point of the base algorithm.

Lemma 2 (Fundamental equality)

Let the sequence $\{({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k})\}$ be generated by the iteration (8), we have the following two equalities hold:

		$\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle$
	$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{x}}^{k+1}-{\mathbf{x}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+T_{1}+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{2}$		(21)

and

	$\displaystyle\ \langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle\zeta^{k+1}-\zeta^{},\zeta^{k}-\zeta^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle$
	$\displaystyle\ +r\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}),{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\rangle+T_{3},$	(22)

where

		$\displaystyle T_{1}\coloneqq\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle,$
		$\displaystyle T_{2}\coloneqq\langle{\mathbf{y}}^{}-{\mathbf{y}}^{k},\nabla f({\mathbf{x}}^{k})-\nabla f({\mathbf{x}}^{})\rangle,$
		$\displaystyle T_{3}\coloneqq\langle{\mathbf{y}}^{}-{\mathbf{y}}^{k+1},\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{})\rangle.$

Proof

Plugging the iteration (8b) into ${\mathbf{q}}_{h}^{k+1}$ to cancel $\zeta^{k},$ we have

	$\displaystyle{\mathbf{q}}_{h}^{k+1}=$	$\displaystyle\ {\mathbf{A}}{\mathbf{x}}^{k+1}+r{\mathbf{A}}{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}+\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}-\frac{r}{\lambda}{\mathbf{s}}^{k+1}$
	$\displaystyle=$	$\displaystyle\ {\mathbf{A}}{\mathbf{x}}^{k+1}+\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}).$		(23)

For the equality (21), we have

	$\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{}\rangle$
$\displaystyle=$	$\displaystyle\ \underbrace{\frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{}\rangle-\frac{1}{r}\\|{\mathbf{y}}^{k}-{\mathbf{y}}^{}\\|^{2}}_{=\frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{y}}^{k}\rangle}-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{})\rangle+T_{2}$
	$\displaystyle\ +\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{})\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle\underbrace{-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\rangle-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{})\rangle}_{=-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\rangle}$
	$\displaystyle\ +T_{2}+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{})\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+\langle{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\rangle+T_{2}$
	$\displaystyle\ +\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle,$	(24)

where the second equality uses

	$\displaystyle{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}=$	$\displaystyle\ {\mathbf{x}}^{k}-\zeta^{k}+r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}$
	$\displaystyle=$	$\displaystyle\ {\mathbf{x}}^{k}-{\mathbf{y}}^{k}+r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),$		(25)

which is derived from the iteration (8b) and (8c).

Note that from the equality (25), we also have

\displaystyle{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k}=-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}).

(26)

Hence

	$\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle-r\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{A}}{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\rangle+T_{2}$
	$\displaystyle\ +\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{2}$
$\displaystyle=$	$\displaystyle\frac{1}{r}\langle{\mathbf{x}}^{k+1}-{\mathbf{x}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+T_{1}+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{2}$	(27)

and the equality (21) is derived.

For the equality (22), we have

	$\displaystyle\ \langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k+1}\rangle-\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle+T_{3}$
	$\displaystyle\ +\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{A}}(\zeta^{k}-\zeta^{})\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}+\lambda{\mathbf{A}}{\mathbf{A}}^{\top}{\mathbf{s}}^{}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},\zeta^{k}-\zeta^{k+1}\rangle-\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle+T_{3}$
	$\displaystyle\ +\underbrace{\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{A}}(\zeta^{k}-\zeta^{})\rangle+r\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{A}}{\mathbf{A}}^{\top}({\mathbf{s}}^{}-{\mathbf{s}}^{k})\rangle}_{=\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{}),{\mathbf{y}}^{k}-{\mathbf{y}}^{}\rangle}$
	$\displaystyle\ +\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},\zeta^{k}-\zeta^{k+1}\rangle+\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{k+1},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\rangle+T_{3}$
	$\displaystyle\ +\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle.$	(28)

Note that from the equality (25), we also have

\displaystyle{\mathbf{y}}^{k+1}=\zeta^{k+1}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}.

(29)

Combining it with (20a) and (20b), we have

\displaystyle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{*}=\zeta^{k+1}-\zeta^{*}-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}),

(30)

then

	$\displaystyle\ \langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle\zeta^{k+1}-\zeta^{},\zeta^{k}-\zeta^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{3}$
	$\displaystyle\ +\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{k+1},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\rangle-\langle\zeta^{k}-\zeta^{k+1},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle\zeta^{k+1}-\zeta^{},\zeta^{k}-\zeta^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{3}$
	$\displaystyle\ +r\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*})\rangle,$	(31)

where the last equality uses (29) and the equality (22) is derived. ∎

The next lemma characterizes the upper bound for $T_{1},T_{2}$ and $T_{3}$ .

Lemma 3

For any $\theta\in(3/4,1]$ and $\theta\lambda\sigma^{2}\leq 1$ , we have the following inequalities hold.

	$\displaystyle T_{1}\leq$	$\displaystyle\ \frac{r}{4\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\\|^{2}_{{\mathbf{M}}}-\frac{r}{4\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}+\frac{1}{4}(1-\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\\|^{2}$
		$\displaystyle\ +(\frac{1}{2}-\frac{1}{4}\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}+(\frac{5}{4}-\theta)\frac{1}{r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2},$		(32)

	$\displaystyle T_{2}\leq$	$\displaystyle\ \frac{L}{4}\\|\zeta^{k-1}-\zeta^{k}\\|^{2}$
	$\displaystyle=$	$\displaystyle\ \frac{L}{4}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}+\frac{r^{2}L}{4}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\\|^{2}-\frac{rL}{2}T_{1},$		(33)

\displaystyle T_{3}\leq

\displaystyle\ \frac{L}{4}\|\zeta^{k}-\zeta^{k+1}\|^{2},

(34)

where $T_{1},T_{2},T_{3}$ are defined in Lemma 2 and ${\mathbf{M}}\coloneqq{\mathbf{I}}-\theta\lambda{\mathbf{A}}{\mathbf{A}}^{\top}\succcurlyeq\mathbf{0}.$

Proof

We start from the upper bound of $T_{3}.$ Since ${\mathbf{y}}^{*}={\mathbf{x}}^{*}$ ,

$\displaystyle T_{3}=$	$\displaystyle\ \langle{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k+1},\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{})\rangle-\langle{\mathbf{x}}^{k+1}-{\mathbf{x}}^{},\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{*})\rangle$
$\displaystyle\leq$	$\displaystyle\ \langle{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k+1},\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{})\rangle-\frac{1}{L}\\|\nabla f({\mathbf{x}}^{k+1})-\nabla f({\mathbf{x}}^{})\\|^{2}$
$\displaystyle\leq$	$\displaystyle\ \frac{L}{4}\\|{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k+1}\\|^{2},$	(35)

where the second inequality uses the equivalent form of Lipschitz continuous $\nabla f$ from (nesterov2018lectures, , Theorem 2.1.5) and the last one uses Cauchy’s inequality.

To upper bound $T_{2},$ by the same argument as above, we have

	$\displaystyle T_{2}\leq$	$\displaystyle\ \frac{L}{4}\\|{\mathbf{x}}^{k}-{\mathbf{y}}^{k}\\|^{2}$
	$\displaystyle=$	$\displaystyle\ \frac{L}{4}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2},$		(36)

where the equality is from (25) and (3) is derived by expanding it.

For $T_{1},$ firstly we have

$\displaystyle T_{1}=$	$\displaystyle\ \langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})-\frac{r}{\lambda}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\rangle$
	$\displaystyle\ -\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{k}\rangle$
$\displaystyle\leq$	$\displaystyle\ \langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},\frac{r}{\lambda}{\mathbf{M}}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}-({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}))\rangle$
	$\displaystyle-(1-\theta)r\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{A}}{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}-({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}))\rangle$
$\displaystyle=$	$\displaystyle\ \frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}+{\mathbf{s}}^{k-1}-2{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}$
	$\displaystyle\ -(1-\theta)r\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}-({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}))\rangle$
$\displaystyle\leq$	$\displaystyle\ \frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}+\frac{1}{2}(1-\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\\|^{2}$
	$\displaystyle\ +\frac{3}{2}(1-\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2},$	(37)

where the first inequality uses the firm nonexpansiveness of $\mathbf{prox}_{\frac{\lambda}{r}h^{*}}$ from (bauschke2011convex, , Proposition 4.2), the second equality uses

\langle a-b,c-d\rangle=\frac{1}{2}(\|a-d\|^{2}-\|a-c\|^{2}+\|b-c\|^{2}-\|b-d\|^{2})

and the last step uses Chauchy’s inequality.

On the other hand, we also have

$\displaystyle T_{1}=$	$\displaystyle\ 2\langle\frac{1}{\sqrt{2}}{\mathbf{A}}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),\frac{1}{\sqrt{2}}({\mathbf{x}}^{k}-{\mathbf{x}}^{k+1})\rangle$
$\displaystyle\leq$	$\displaystyle\ 2\langle\sqrt{(\theta-\frac{1}{2})}{\mathbf{A}}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}),\sqrt{(\frac{5}{2}-2\theta)}({\mathbf{x}}^{k}-{\mathbf{x}}^{k+1})\rangle$
$\displaystyle\leq$	$\displaystyle\ (\theta-\frac{1}{2})\\|{\mathbf{A}}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}+(\frac{5}{2}-2\theta)\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2},$	(38)

where the first inequality holds since for any $\theta\in(\frac{3}{4},1]$

(\theta-\frac{1}{2})(\frac{5}{2}-2\theta)-\frac{1}{4}=-2\theta^{2}+\frac{7}{2}\theta-\frac{3}{2}=2(\theta-\frac{3}{4})(1-\theta)\geq 0.

Therefore,

T_{1}\leq\frac{1}{2}\times\eqref{lem2:ineq4}+\frac{1}{2}\times\eqref{lem2:ineq5},

which gives the inequality (3). ∎

Theorem 3.1

Let the sequence $\{({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k})\}$ be generated by the base algorithm, then $\{({\mathbf{s}}^{k},{\mathbf{x}}^{k})\}$ converges to a solution of the problem (2), if

r<\frac{4\theta-3}{2\theta-1}\frac{2}{L},\ \ \lambda\leq\frac{1}{\theta\sigma^{2}}

for any $\theta\in(3/4,1].$

Proof

Descent Inequality. From the assumption on $\lambda,$ there exists $\widetilde{\theta}\in(0,\theta)$ such that

{\mathbf{I}}-\widetilde{\theta}\lambda{\mathbf{A}}{\mathbf{A}}^{\top}\succ\mathbf{0}.

Define a parameter $\alpha$ depending on $\theta$ and $\widetilde{\theta}$ as

\alpha=\left\{\begin{aligned} \frac{\widetilde{\theta}}{1-\theta},&\ \ \frac{3}{4}<\theta<1,\\ 0,&\ \ \theta=1,\end{aligned}\right.

(39)

and let $\widetilde{{\mathbf{M}}}={\mathbf{I}}-\alpha(1-\theta){\mathbf{A}}{\mathbf{A}}^{\top}\succ\mathbf{0}$ .

Combining Lemma 2 and Lemma 3, for any $r\in(0,2/L)$ we have

	$\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{x}}^{k+1}-{\mathbf{x}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{1}+T_{2}$
$\displaystyle\leq$	$\displaystyle\ \frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}$
	$\displaystyle\ +\frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}$
	$\displaystyle\ +\alpha(\frac{1}{2r}-\frac{L}{4})\\|\zeta^{k-1}-\zeta^{k}\\|^{2}+(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))\\|\zeta^{k-1}-\zeta^{k}\\|^{2}$
	$\displaystyle\ +T_{1}$
$\displaystyle\leq$	$\displaystyle\ \frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}$
	$\displaystyle\ +\frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}$
	$\displaystyle\ +\alpha(\frac{1}{2r}-\frac{L}{4})\\|\zeta^{k-1}-\zeta^{k}\\|^{2}+(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}$
	$\displaystyle\ +(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))r^{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\\|^{2}+(1+\alpha)(1-\frac{rL}{2})T_{1}.$	(40)

Note that

$\displaystyle(1+\alpha)(1-\frac{rL}{2})T_{1}\leq$	$\displaystyle\ (1+\alpha)(1-\frac{rL}{2})\Big{[}\frac{r}{4\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\\|^{2}_{{\mathbf{M}}}-\frac{r}{4\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}$
	$\displaystyle\ +\frac{1}{4}(1-\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\\|^{2}$
	$\displaystyle\ +(\frac{1}{2}-\frac{1}{4}\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}-(\theta-\frac{3}{4})\frac{1}{r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}\Big{]}$
	$\displaystyle\ +(1+\alpha)(1-\frac{rL}{2})\frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}.$	(41)

Combine (40), (41), and

-\frac{1}{2r}+\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4})+(1+\alpha)(\frac{1}{2r}-\frac{L}{4})=0,

we get

	$\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle$
$\displaystyle\leq$	$\displaystyle\ \frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{}\\|^{2}+\frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{}\\|^{2}$
	$\displaystyle\ +\alpha(\frac{1}{2r}-\frac{L}{4})\\|\zeta^{k-1}-\zeta^{k}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}$
	$\displaystyle\ +(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))r^{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\\|^{2}$
	$\displaystyle\ (1+\alpha)(1-\frac{rL}{2})\Big{[}\frac{r}{4\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\\|^{2}_{{\mathbf{M}}}-\frac{r}{4\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}$
	$\displaystyle\ +\frac{1}{4}(1-\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\\|^{2}-\frac{1}{4}(1-\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}$
	$\displaystyle\ +(\frac{3}{4}-\frac{1}{2}\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}-(\theta-\frac{3}{4})\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}\Big{]}.$	(42)

The other inequality in Lemma 2 becomes

	$\displaystyle\ \langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle\zeta^{k+1}-\zeta^{},\zeta^{k}-\zeta^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{M}}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle$
	$\displaystyle\ -(1-\theta)r\frac{r}{\lambda}\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}),{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle+T_{3}$
$\displaystyle\leq$	$\displaystyle\ \frac{1}{2r}\\|\zeta^{k}-\zeta^{}\\|^{2}-\frac{1}{2r}\\|\zeta^{k+1}-\zeta^{}\\|^{2}-\frac{1}{2r}\\|\zeta^{k+1}-\zeta^{k}\\|^{2}$
	$\displaystyle\ +\frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{}\\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{}\\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}$
	$\displaystyle\ -(1-\theta)\frac{r}{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{})\\|^{2}+(1-\theta)\frac{r}{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\\|^{2}$
	$\displaystyle\ +(1-\theta)\frac{r}{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}+\frac{L}{4}\\|\zeta^{k}-\zeta^{k+1}\\|^{2}.$	(43)

Let $\beta=(1+\alpha)(1-\frac{rL}{2})$ for simplicity. We have

	$\displaystyle\Phi^{k+1}=$	$\displaystyle\ \frac{1}{2r}\\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{}\\|^{2}+\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{}\\|^{2}_{{\mathbf{M}}+\widetilde{{\mathbf{M}}}}+\frac{1}{2r}\\|\zeta^{k+1}-\zeta^{*}\\|^{2}$
		$\displaystyle\ +\frac{\beta r}{4\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}+\frac{\beta r(1-\theta)}{4}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{A}}{\mathbf{A}}^{\top}}$
		$\displaystyle\ +\frac{\alpha}{2r}(1-\frac{rL}{2})\\|\zeta^{k}-\zeta^{k+1}\\|^{2}.$

Considering $\eqref{thm:ineq3}+\alpha\times\eqref{thm:ineq4}$ , we have

	$\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{}\rangle+(1+\alpha)\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{}\rangle+\alpha\langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{*}\rangle$
$\displaystyle\leq$	$\displaystyle\ \Phi^{k}-\Phi^{k+1}-\beta(\theta-\frac{3}{4})\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}$
	$\displaystyle\ -\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{I}}-\alpha(1-\theta)\lambda{\mathbf{A}}{\mathbf{A}}^{\top}}+(\frac{3}{4}-\frac{1}{2}\theta)\beta r\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{A}}{\mathbf{A}}^{\top}}$
	$\displaystyle\ +(\frac{rL}{2}-\alpha(1-\frac{rL}{2}))\frac{r}{2}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\\|_{{\mathbf{A}}{\mathbf{A}}^{\top}}^{2}$
$\displaystyle=$	$\displaystyle\ \Phi^{k}-\Phi^{k+1}-\beta(\theta-\frac{3}{4})\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}$
	$\displaystyle\ -\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{N}}},$	(44)

where

{\mathbf{N}}={\mathbf{I}}-\Big{[}\alpha(1-\theta)+(\frac{3}{2}-\theta)\beta+\frac{rL}{2}-\alpha(1-\frac{rL}{2})\Big{]}\lambda{\mathbf{A}}{\mathbf{A}}^{\top}.

Note that if we let

\alpha(1-\theta)+(\frac{3}{2}-\theta)(1+\alpha)(1-\frac{rL}{2})+\frac{rL}{2}-\alpha(1-\frac{rL}{2})<\theta,

which is equivalent to

\frac{rL}{2}<\frac{4\theta-3}{2\theta-1},

we have ${\mathbf{N}}\succ{\mathbf{M}}\succcurlyeq\mathbf{0}.$

Due to $\widetilde{{\mathbf{M}}}\succ\mathbf{0}$ and the firm nonexpansiveness of $\mathbf{prox}_{rg}$ and $\mathbf{prox}_{\frac{\lambda}{r}h^{*}}$ , from (bauschke2011convex, , Proposition 4.2), we get the descent inequality

\Phi^{k+1}\leq\Phi^{k}-\beta(\theta-\frac{3}{4})\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}\|^{2}-\frac{r}{2\lambda}\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\|^{2}_{{\mathbf{M}}+{\mathbf{N}}}.

(45)

Convergence.

Taking the telescopic sum from $k=0$ to $\infty$ , we get

	$\displaystyle\lim_{k\rightarrow\infty}\\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{k}\\|=0,$		(46)
	$\displaystyle\lim_{k\rightarrow\infty}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|=0,$		(47)

due to $\theta>\frac{3}{4}$ and ${\mathbf{M}}+{\mathbf{N}}\succ\mathbf{0}.$ Moreover, from (8b), we get

\lim_{k\rightarrow\infty}\|\zeta^{k+1}-\zeta^{k}\|=0.

(48)

From the descent inequality, the nonnegative sequence $\{\Phi^{k}\}$ is nonincreasing, so it converges to a nonnegative constant, which implies $({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k})$ is bounded in $\mathbb{R}^{m}\times\mathbb{R}^{n}\times\mathbb{R}^{n}$ . Due to the compactness, there is a subsequence, $\{({\mathbf{s}}^{k_{j}},{\mathbf{x}}^{k_{j}},\zeta^{k_{j}})\}$ converging to $({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star})$ .

The nonexpansiveness of $\mathbf{prox}_{rg}$ and $\mathbf{prox}_{\frac{\lambda}{r}h^{*}}$ implies they are continuous and from (46), (47) and (48),

		$\displaystyle\ \lim_{j\rightarrow\infty}\\|({\mathbf{s}}^{k_{j}+1},{\mathbf{x}}^{k_{j}+1},\zeta^{k_{j}+1})-({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star})\\|$
	$\displaystyle\leq$	$\displaystyle\ \lim_{j\rightarrow\infty}\\|({\mathbf{s}}^{k_{j}+1},{\mathbf{x}}^{k_{j}+1},\zeta^{k_{j}+1})-({\mathbf{s}}^{k_{j}},{\mathbf{x}}^{k_{j}},\zeta^{k_{j}})\\|+\\|({\mathbf{s}}^{k_{j}},{\mathbf{x}}^{k_{j}},\zeta^{k_{j}})-({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star})\\|$
	$\displaystyle=$	$\displaystyle\ 0.$

So, $({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star})$ is a fixed point of the iteration (8).

By choosing $({\mathbf{s}}^{*},{\mathbf{x}}^{*},\zeta^{*})=({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star}),$ we have

\lim_{j\rightarrow\infty}\Phi^{k_{j}}=0.

Hence

\lim_{j\rightarrow\infty}({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k})=({\mathbf{s}}^{\star},{\mathbf{x}}^{\star},\zeta^{\star}).

Lastly, from Lemma 1, $({\mathbf{s}}^{\star},{\mathbf{x}}^{\star})$ is a solution of the problem (2). ∎

Due to the relation discussed in Section 2.1 and 2.2, we also prove the convergence of AFBA and PD3O under the relaxed condition. When $f=0$ , the relation in Section 2.3 implies the improved convergence result of Chambolle-Pock.

Corollary 1 (Chambolle-Pock)

Suppose $r>0$ and $\lambda<\frac{4}{3\sigma^{2}}$ , we have the following results hold,

1.

Let $\{({\mathbf{s}}^{k},{\mathbf{x}}^{k},\zeta^{k})\}$ be generated by the iteration (8), then $({\mathbf{s}}^{k},{\mathbf{x}}^{k})$ converges to a solution of the problem (2) with linear $f$ .
2.

Let $\{({\mathbf{s}}_{3}^{k},{\mathbf{x}}_{3}^{k})\}$ be generated by Chambolle-Pock (15), then it converges to a solution of the problem (2) with $f=0$ .

Proof

1.

When $f$ is linear, the Lipschitz constant of gradient is $0$ . We can set $L=\epsilon>0$ , then by Theorem 3.1 if $r<\frac{4\theta-3}{2\theta-1}\frac{2}{\epsilon}$ and $\lambda\leq\frac{1}{\theta\sigma^{2}},$

$\lim_{k\rightarrow\infty}({\mathbf{s}}^{k},{\mathbf{x}}^{k})=({\mathbf{s}}^{\star},{\mathbf{x}}^{\star}),$

where $({\mathbf{s}}^{\star},{\mathbf{x}}^{\star})$ is a saddle-point solution.

Let $\epsilon\rightarrow 0$ and $\theta\rightarrow 3/4$ , then the proof is complete.

From the sequence relation (16) and the above argument, we have

\lim_{k\rightarrow\infty}({\mathbf{s}}^{k}_{3},{\mathbf{x}}^{k}_{3})=\lim_{k\rightarrow\infty}({\mathbf{s}}^{k},{\mathbf{x}}^{k-1}-r{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}))=({\mathbf{s}}^{\star},{\mathbf{x}}^{\star}).

∎

The comparison of aforementioned algorithms with improved conditions on stepsizes is summarized in Table 2, where the results of the algorithm in bold are proved in this paper.

	$f$	$g$	$r,\lambda$
Chambolle-Pock	$0$	convex	$r>0$ , $\lambda<4/(3\sigma^{2})$
PAPC(PDFP²O)	$L$ -smooth	$0$	$rL/2<1$ , $\lambda<4/(3\sigma^{2})$
Condat-Vu	$L$ -smooth	convex	$\lambda\sigma^{2}+rL/2\leq 1$
PDFP	$L$ -smooth	convex	$rL/2<1$ , $\lambda\sigma^{2}({\mathbf{A}})\leq 1$
AFBA	$L$ -smooth	convex	$rL/2<\Gamma$ , $\theta\lambda\sigma^{2}\leq 1$
PD3O	$L$ -smooth	convex	$rL/2<\Gamma$ , $\theta\lambda\sigma^{2}\leq 1$
The Base Algorithm	$L$ -smooth	convex	$rL/2<\Gamma$ , $\theta\lambda\sigma^{2}\leq 1$

Table 2: The comparison of the requirement of stepsizes to guarantee the convergence of some primal-dual algorithms.

L

-smooth means the function is convex and has a

L

-Lipschitz continuous gradient.

\Gamma\coloneqq(4\theta-3)/(2\theta-1)

, where

\theta

is an arbitrary number in

(3/4,1].

3.2 Tightness of Upper Bound for Stepsizes

In this section, we provide a simple example to show that the upper bound for $\lambda$ can not be relaxed further. This example is more general than the one provide in he2020optimally . We consider the following saddle point problem

\min_{{\mathbf{x}}\in\mathbb{R}^{n}}\max_{{\mathbf{s}}\in\mathbb{R}^{m}}~{}\langle{\mathbf{A}}{\mathbf{x}},{\mathbf{s}}\rangle.

Any $({\mathbf{x}},{\mathbf{s}})$ such that ${\mathbf{A}}{\mathbf{x}}=\mathbf{0}$ and ${\mathbf{A}}^{\top}{\mathbf{s}}=\mathbf{0}$ is a solution. For this special problem, one iteration becomes

	$\displaystyle{\mathbf{s}}^{k+1}=$	$\displaystyle~{}{\lambda\over r}{\mathbf{A}}{\mathbf{x}}^{k}+({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{x}}^{k},$		(49)
	$\displaystyle{\mathbf{x}}^{k+1}=$	$\displaystyle~{}{\mathbf{x}}^{k}-r{\mathbf{A}}^{\top}{\mathbf{s}}^{k+1}=({\mathbf{I}}-\lambda{\mathbf{A}}^{\top}{\mathbf{A}}){\mathbf{x}}^{k}-r{\mathbf{A}}^{\top}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}){\mathbf{s}}^{k},$		(50)

We can rewrite it as

\displaystyle\begin{bmatrix}{\mathbf{s}}^{k+1}\\ {\mathbf{A}}{\mathbf{x}}^{k+1}\end{bmatrix}=\begin{bmatrix}{\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}&{\lambda\over r}{\mathbf{I}}\\ -r{\mathbf{A}}{\mathbf{A}}^{\top}({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})&{\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top}\end{bmatrix}\begin{bmatrix}{\mathbf{s}}^{k}\\ {\mathbf{A}}{\mathbf{x}}^{k}\end{bmatrix}.

(51)

Therefore, to make the algorithm converge, the eigenvalues of the matrix can not have magnitude larger than 1. Because ${\mathbf{A}}{\mathbf{A}}^{\top}$ is symmetric, we only need to consider the $2\times 2$ matrix

\displaystyle\begin{bmatrix}1-\lambda\theta&{\lambda\over r}\\ -r\theta(1-\lambda\theta)&1-\lambda\theta\end{bmatrix},

(52)

where $\theta$ is the eigenvalues of ${\mathbf{A}}{\mathbf{A}}^{\top}$ . Its two eigenvalues are

1-\lambda\theta\pm\sqrt{-\lambda\theta(1-\lambda\theta)}.

We consider different cases for $\lambda\theta$ . If $\lambda\theta<1$ , both eigenvalues are complex numbers, and their magnitude is $\sqrt{(1-\lambda\theta)^{2}+\lambda\theta(1-\lambda\theta)}\leq 1$ . If $\lambda\theta=1$ , both eigenvalues are zero. When $\lambda\theta>1$ , both eigenvalues are real number. The eigenvalue $1-\lambda\theta+\sqrt{-\lambda\theta(1-\lambda\theta)}<1-\lambda\theta+\sqrt{\lambda\theta\lambda\theta}=1.$ The other eigenvalue is $1-\lambda\theta-\sqrt{\lambda\theta(\lambda\theta-1)}$ . To make sure that its magnitude is less than one, we need $1-\lambda\theta-\sqrt{\lambda\theta(\lambda\theta-1)}>-1$ , that is $\lambda\theta<4/3.$ The condition for the convergence with any initial value is $\lambda\theta<4/3$ for all eigenvalues of ${\mathbf{A}}{\mathbf{A}}^{\top}$ , that is $\lambda\sigma^{2}<4/3$ .

This example shows that the condition $\lambda\sigma^{2}<4/3$ can not be relaxed further for Chambolle-Pock.

4 Numerical Experiments

In this section, we demonstrate the performance of several primal-dual algorithms under the relaxed condition and compare their results with existing ones. More specifically, we use PD3O and AFBA to solve the fused LASSO (least absolute shrinkage and selection operator) and Chambolle-Pock to solve LASSO to show their convergence with different combinations of $r$ and $\lambda$ .

4.1 The fused LASSO

The fused LASSO problem (see, e.g., tibshirani2005sparsity ) is formulated as

\operatorname*{minimize}_{{\mathbf{x}}\in\mathbb{R}^{10000}}\ \frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2}+\mu_{1}\|{\mathbf{B}}{\mathbf{x}}\|_{1}+\mu_{2}\|{\mathbf{x}}\|_{1}

(53)

where ${\mathbf{K}}\in\mathbb{R}^{500\times 10000}$ . The two penalty parameters $\mu_{1}$ and $\mu_{2}$ are set to $200$ and $20$ , respectively. The $ith$ row of ${\mathbf{B}}\in\mathbb{R}^{9999\times 10000}$ has $-1$ on the $i$ th column, $1$ on the $i+1$ th column, and $0$ on other columns.

We let $f({\mathbf{x}})=\frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2},$ $g({\mathbf{x}})=\mu_{2}\|{\mathbf{x}}\|_{1}$ and $h({\mathbf{B}}{\mathbf{x}})=\mu_{1}\|{\mathbf{B}}{\mathbf{x}}\|_{1}$ . Then the primal-dual form is

\min_{{\mathbf{x}}\in\mathbb{R}^{10000}}\max_{{\mathbf{s}}\in\mathbb{R}^{9999}}\ \frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2}+\mu_{2}\|{\mathbf{x}}\|_{1}+\langle{\mathbf{B}}{\mathbf{x}},{\mathbf{s}}\rangle-\mu_{1}\chi_{B_{\infty}}\left(\frac{{\mathbf{s}}}{\mu_{1}}\right),

(54)

where $B_{\infty}$ is the closed unit ball in $\ell_{\infty}$ norm and $\chi_{B_{\infty}}$ is the indicator function over $B_{\infty}$ .

We generate a sparse vector ${\mathbf{x}}_{\text{True}}$ with $50$ nonzero entries and a dense matrix ${\mathbf{A}}$ whose each entry is independently sampled from the standard normal distribution. The response vector ${\mathbf{b}}$ is obtained by adding Gaussian noise to ${\mathbf{A}}{\mathbf{x}}_{\text{True}}$ . We calculate the estimated optimal solution by running $10,000$ steps PD3O for the problem and get the estimated optimal function value $f^{*}$ .

We set the default parameters $r=1/\sigma^{2}({\mathbf{K}})$ , $\lambda=1/\sigma^{2}({\mathbf{B}})$ and consider several choices of stepsizes in the two scenarios: (1) $\frac{rL}{2}<\frac{4\theta-3}{2\theta-1},\lambda\leq\frac{1}{\theta\sigma^{2}};$ (2) $\frac{rL}{2}<1,\lambda<\frac{4}{3\sigma^{2}}.$ The first scenario obeys the relaxed condition shown in this paper, while the other one may violate the condition. In Fig. 1, the left figure shows the result with $\theta=1/1.19$ and $4/5$ for the first scenario. In this figure, we compare four choices of the stepsizes: (1) the default parameter $(r,\lambda)$ ; (2) we fix the primal stepsize $r$ and choose a small $\theta=1/1.19$ to obtain a large $\lambda$ . The new parameter is $(r,1.19\lambda)$ ; (3) Choose a smaller $\theta=0.8$ and decrease the primal stepsize $r$ only; (4) Choose the same $\theta=0.8$ and increase the parameter $\lambda$ to its upper bound. The right figure in Fig. 1 compare the convergence of algorithms under the second scenario with a larger $\lambda$ . Note that, the settings with $1.3\lambda$ do not satisfy the condition in this paper. We observe that in either scenario, the primal stepsize dominates the convergence of algorithms and a slightly larger $\lambda$ has little effect on the algorithm.

Refer to caption — Figure 1: The comparison of the performance of PD3O and AFBA with different parameters. In both figures, the fixed parameters $r$ and $\lambda$ are set to $1/\sigma^{2}({\mathbf{K}})$ and $1/\sigma^{2}({\mathbf{B}})$ , respectively.

However, the benefit of larger $\lambda$ appears when ${\mathbf{K}}$ has a full column rank. We consider the same fused LASSO problem with ${\mathbf{K}}\in\mathbb{R}^{2500\times 2500}$ . We conduct the experiments for two cases, randomly generated ${\mathbf{K}}$ and ${\mathbf{K}}={\mathbf{I}}.$ The problem setting is changed to $\mu_{1}=5$ and $\mu_{2}=1/5$ . The true solution is generated in the same way with $25$ nonzero entries. As shown in Fig. 2, both top and bottom figures indicate $10-20\%$ acceleration of convergence when $\lambda$ is increased. The numerical result on the second scenario also suggests that the general constraint on $r$ and $\lambda$ shown in this paper may not be tight.

4.2 LASSO

We consider the following LASSO problem (see, e.g., tibshirani1996regression )

\operatorname*{minimize}_{{\mathbf{x}}\in\mathbb{R}^{5000}}\ \frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2}+\mu\|{\mathbf{x}}\|_{1}

(55)

where ${\mathbf{K}}\in\mathbb{R}^{500\times 5000}$ and $\mu=200$ is a penalty parameter. We let $g({\mathbf{x}})=\mu\|{\mathbf{x}}\|_{1}$ and $h({\mathbf{K}}{\mathbf{x}})=\frac{1}{2}\|{\mathbf{K}}{\mathbf{x}}-{\mathbf{b}}\|^{2}$ and consider the primal-dual form as

\min_{{\mathbf{x}}\in\mathbb{R}^{5000}}\max_{{\mathbf{s}}\in\mathbb{R}^{500}}\ \mu\|{\mathbf{x}}\|_{1}+\langle{\mathbf{K}}{\mathbf{x}},{\mathbf{s}}\rangle-\frac{1}{2}\|{\mathbf{s}}\|^{2}-\langle{\mathbf{b}},{\mathbf{s}}\rangle.

(56)

The data are generated in a similar way to the previous experiment. The optimal solution ${\mathbf{x}}^{*}$ is calculated by performing 10,000 iterations of Chambolle-Pock. Fig. 3 shows the effect of the relaxed conditions on the convergence of Chambolle-Pock in terms of two different residual measures. The default dual parameter $\lambda$ is set to $1/\sigma^{2}({\mathbf{K}})$ and the relaxed one is $1.32\lambda$ . The primal stepsize $r$ is chosen from $\{0.001,0.005,0.01,0.05\}$ .

From Fig. 3, we observe that by choosing the acceptable range of $r$ , the larger value of $\lambda$ makes the algorithm converge faster up to $20-30\%$ acceleration. We also note that the relaxed $\lambda$ fails to speed up the algorithm and the residual curves overlap when $r$ is set to a very small number. This is possibly due to the small progression of the primal variable at each step. The experiment on extreme values of $r$ verifies the explanation as shown in Fig. 4. We use different colors to differentiate the result of the extreme large values from the result of the extreme small values. This figure also suggests that balanced primal and dual stepsizes is needed. Though how to choose good primal and dual stepsizes is out of the focus of this paper, the relaxed condition in this paper provided theoretic guarantee to increase one or both stepsizes.

5 Conclusions

In this paper, we use a base algorithm to build the connection between some primal-dual algorithms. Then we prove the convergence of the base algorithm under a relaxed condition for the primal and dual stepsizes. It implies a possible choice of larger dual stepsize with a corresponding conservative primal stepsize. Chambolle-Pock, as a special case of the base algorithm, can take the dual stepsize up to $4/3$ of the original one without a compromise on the primal stepsize. The numerical experiments indicates the acceleration for tested algorithms under the relaxed condition. The benefit for Chambolle-Pock is more prominent. The condition for Chambolle-Pock is also proved to be optimal and can not be relaxed. However, how to relax the condition further for the base algorithm is still an open problem.

Acknowledgements.

This work is partially supported by the NSF grant DMS-2012439.

References

[1] R Tyrrell Rockafellar. Conjugate duality and optimization. SIAM, 1974.
[2] Laurent Condat. A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms. Journal of optimization theory and applications, 158(2):460–479, 2013.
[3] Bang Cong Vu. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Computational Mathematics, 38(3):667–681, 2013.
[4] Peijun Chen, Jianguo Huang, and Xiaoqun Zhang. A primal-dual fixed point algorithm for minimization of the sum of three convex separable functions. Fixed Point Theory and Applications, 2016(1):1–18, 2016.
[5] Puya Latafat and Panagiotis Patrinos. Asymmetric forward–backward–adjoint splitting for solving monotone inclusions involving three operators. Computational Optimization and Applications, 68(1):57–93, 2017.
[6] Ming Yan. A new primal–dual algorithm for minimizing the sum of three functions with a linear operator. Journal of Scientific Computing, 76(3):1698–1717, 2018.
[7] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145, 2011.
[8] Ignace Loris and Caroline Verhoeven. On a generalization of the iterative soft-thresholding algorithm for the case of non-separable penalty. Inverse Problems, 27(12):125007, 2011.
[9] Peijun Chen, Jianguo Huang, and Xiaoqun Zhang. A primal–dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Problems, 29(2):025011, 2013.
[10] Yoel Drori, Shoham Sabach, and Marc Teboulle. A simple algorithm for a class of nonsmooth convex–concave saddle-point problems. Operations Research Letters, 43(2):209–214, 2015.
[11] Zhi Li and Ming Yan. New convergence analysis of a primal-dual algorithm with large stepsizes. Advances in Computational Mathematics, 47(1):1–20, 2021.
[12] Bingsheng He, Feng Ma, and Xiaoming Yuan. Optimal proximal augmented lagrangian method and its application to full jacobian splitting for multi-block separable convex minimization problems. IMA Journal of Numerical Analysis, 40(2):1188–1216, 2020.
[13] Yao Li and Ming Yan. On the linear convergence of two decentralized algorithms. Journal of Optimization Theory and Applications, 189(1):271–290, 2021.
[14] Bingsheng He, Feng Ma, Shengjie Xu, and Xiaoming Yuan. A generalized primal-dual algorithm with improved convergence condition for saddle point problems. arXiv preprint arXiv:2112.00254, 2021.
[15] Bingsheng He, Feng Ma, and Xiaoming Yuan. Optimally linearizing the alternating direction method of multipliers for convex programming. Computational Optimization and Applications, 75(2):361–388, 2020.
[16] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
[17] Heinz H Bauschke, Patrick L Combettes, et al. Convex analysis and monotone operator theory in Hilbert spaces, volume 408. Springer, 2011.
[18] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005.
[19] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.

	$\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{}\rangle$
$\displaystyle=$	$\displaystyle\ \underbrace{\frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{}\rangle-\frac{1}{r}\\|{\mathbf{y}}^{k}-{\mathbf{y}}^{}\\|^{2}}_{=\frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{y}}^{k}\rangle}-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{})\rangle+T_{2}$
	$\displaystyle\ +\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{})\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle\underbrace{-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\rangle-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{})\rangle}_{=-\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\rangle}$
	$\displaystyle\ +T_{2}+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{A}}({\mathbf{x}}^{k+1}-{\mathbf{x}}^{})\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+\langle{\mathbf{x}}^{k+1}-{\mathbf{y}}^{k},{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\rangle+T_{2}$
	$\displaystyle\ +\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{*},({\mathbf{I}}-\lambda{\mathbf{A}}{\mathbf{A}}^{\top})({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle,$	(24)

	$\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle{\mathbf{x}}^{k+1}-{\mathbf{x}}^{},{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{s}}^{k}-{\mathbf{s}}^{k+1}\rangle+T_{1}+T_{2}$
$\displaystyle\leq$	$\displaystyle\ \frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}$
	$\displaystyle\ +\frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}$
	$\displaystyle\ +\alpha(\frac{1}{2r}-\frac{L}{4})\\|\zeta^{k-1}-\zeta^{k}\\|^{2}+(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))\\|\zeta^{k-1}-\zeta^{k}\\|^{2}$
	$\displaystyle\ +T_{1}$
$\displaystyle\leq$	$\displaystyle\ \frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}$
	$\displaystyle\ +\frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}$
	$\displaystyle\ +\alpha(\frac{1}{2r}-\frac{L}{4})\\|\zeta^{k-1}-\zeta^{k}\\|^{2}+(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}$
	$\displaystyle\ +(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))r^{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\\|^{2}+(1+\alpha)(1-\frac{rL}{2})T_{1}.$	(40)

$\displaystyle(1+\alpha)(1-\frac{rL}{2})T_{1}\leq$	$\displaystyle\ (1+\alpha)(1-\frac{rL}{2})\Big{[}\frac{r}{4\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\\|^{2}_{{\mathbf{M}}}-\frac{r}{4\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}$
	$\displaystyle\ +\frac{1}{4}(1-\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\\|^{2}$
	$\displaystyle\ +(\frac{1}{2}-\frac{1}{4}\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}-(\theta-\frac{3}{4})\frac{1}{r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}\Big{]}$
	$\displaystyle\ +(1+\alpha)(1-\frac{rL}{2})\frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}.$	(41)

	$\displaystyle\ \langle{\mathbf{y}}^{k}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle$
$\displaystyle\leq$	$\displaystyle\ \frac{1}{2r}\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{}\\|^{2}-\frac{1}{2r}\\|{\mathbf{x}}^{k+1}-{\mathbf{x}}^{}\\|^{2}+\frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{}\\|^{2}$
	$\displaystyle\ +\alpha(\frac{1}{2r}-\frac{L}{4})\\|\zeta^{k-1}-\zeta^{k}\\|^{2}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}$
	$\displaystyle\ +(\frac{L}{4}-\alpha(\frac{1}{2r}-\frac{L}{4}))r^{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\\|^{2}$
	$\displaystyle\ (1+\alpha)(1-\frac{rL}{2})\Big{[}\frac{r}{4\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{k-1}\\|^{2}_{{\mathbf{M}}}-\frac{r}{4\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}$
	$\displaystyle\ +\frac{1}{4}(1-\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k-1})\\|^{2}-\frac{1}{4}(1-\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}$
	$\displaystyle\ +(\frac{3}{4}-\frac{1}{2}\theta)r\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}-(\theta-\frac{3}{4})\\|{\mathbf{x}}^{k}-{\mathbf{x}}^{k+1}\\|^{2}\Big{]}.$	(42)

	$\displaystyle\ \langle{\mathbf{y}}^{k+1}-{\mathbf{y}}^{},{\mathbf{q}}_{g}^{k+1}-{\mathbf{q}}_{g}^{}\rangle+\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k},{\mathbf{q}}_{h}^{k+1}-{\mathbf{q}}_{h}^{*}\rangle$
$\displaystyle=$	$\displaystyle\ \frac{1}{r}\langle\zeta^{k+1}-\zeta^{},\zeta^{k}-\zeta^{k+1}\rangle+\frac{r}{\lambda}\langle{\mathbf{s}}^{k+1}-{\mathbf{s}}^{},{\mathbf{M}}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle$
	$\displaystyle\ -(1-\theta)r\frac{r}{\lambda}\langle{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{*}),{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{k+1})\rangle+T_{3}$
$\displaystyle\leq$	$\displaystyle\ \frac{1}{2r}\\|\zeta^{k}-\zeta^{}\\|^{2}-\frac{1}{2r}\\|\zeta^{k+1}-\zeta^{}\\|^{2}-\frac{1}{2r}\\|\zeta^{k+1}-\zeta^{k}\\|^{2}$
	$\displaystyle\ +\frac{r}{2\lambda}\\|{\mathbf{s}}^{k}-{\mathbf{s}}^{}\\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{}\\|^{2}_{{\mathbf{M}}}-\frac{r}{2\lambda}\\|{\mathbf{s}}^{k+1}-{\mathbf{s}}^{k}\\|^{2}_{{\mathbf{M}}}$
	$\displaystyle\ -(1-\theta)\frac{r}{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k}-{\mathbf{s}}^{})\\|^{2}+(1-\theta)\frac{r}{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{})\\|^{2}$
	$\displaystyle\ +(1-\theta)\frac{r}{2}\\|{\mathbf{A}}^{\top}({\mathbf{s}}^{k+1}-{\mathbf{s}}^{k})\\|^{2}+\frac{L}{4}\\|\zeta^{k}-\zeta^{k+1}\\|^{2}.$	(43)

On the improved conditions for some primal-dual algorithms

Abstract

Keywords:

1 Introduction

2 The Base Algorithm

2.1 Connection with AFBA

2.2 Connection with PD3O

2.3 Connection with Chambolle-Pock

2.4 Connection with PAPC(PDFP2O)

2.5 Relation Diagram

3 Convergence Analysis

3.1 Convergence under a relaxed condition

Lemma 1 (Optimality)

Proof

Lemma 2 (Fundamental equality)

Proof

Lemma 3

Proof

Theorem 3.1

Proof

Convergence.

Corollary 1 (Chambolle-Pock)

Proof

3.2 Tightness of Upper Bound for Stepsizes

4 Numerical Experiments

4.1 The fused LASSO

4.2 LASSO

5 Conclusions

Acknowledgements.

References

2.4 Connection with PAPC(PDFP²O)