This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic forward-backward-half forward splitting algorithm with variance reduction

Liqian Qin,   Yaxuan Zhang  and  Qiao-Li Dong
College of Science, Civil Aviation University of China, Tianjin 300300, China,
email: [email protected]email: [email protected]Corresponding author. email: [email protected]
Abstract

In this paper, we present a stochastic forward-backward-half forward splitting algorithm with variance reduction for solving the structured monotone inclusion problem composed of a maximally monotone operator, a maximally monotone and Lipschitz continuous operator and a cocoercive operator. By defining a Lyapunov function, we establish the almost sure convergence of the proposed algorithm, and obtain the linear convergence when one of the maximally monotone operators is strongly monotone. Numerical examples are provided to show the performance of the proposed algorithm.

Key words: Variance reduction; Forward-backward-half forward splitting algorithm; Monotone inclusion problem; The almost sure convergence; Strongly monotone; Linear convergence.

1 Introduction and preliminaries

In this paper, we consider the structured monotone inclusion problem which is to find xdx\in\mathbb{R}^{d} such that

0(A+B+C)(x),\displaystyle\ 0\in(A+B+C)(x), (1)

where A:d2dA:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}} is a maximally monotone operator, B:ddB:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is a maximally monotone and LBL_{B}-Lipschitz operator, and C:ddC:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is a β\beta-cocoercive operator. Problem (1) arises in various applications such as optimization problems [4, 7], variational inequalities [1], deep learning [3] and image deblurring [22].

Numerous iterative algorithms for solving (1) have been presented and analyzed, see, for instance, [6, 7, 11, 12, 13, 14, 15, 19, 21, 22] and references therein. In particular, Briceño-Arias et al. [4] first proposed a forward-backward-half forward (FBHF) splitting algorithm as follows

{pk=JγkA(xkγk(B+C)xk),xk+1=PX(pk+γk(BxkBpk)),\left\{\begin{array}[]{lr}p^{k}=J_{\gamma^{k}A}\left(x^{k}-\gamma^{k}(B+C)x^{k}\right),&\\ x^{k+1}=P_{X}(p^{k}+\gamma^{k}(Bx^{k}-Bp^{k})),&\\ \end{array}\right. (2)

where γk\gamma^{k} is step-size, γk[η,χη]\gamma^{k}\in[\eta,\chi-\eta], η(0,χ2)\eta\in(0,\frac{\chi}{2}), χ=4β1+1+16β2LB2\chi=\frac{4\beta}{1+\sqrt{1+16\beta^{2}L_{B}^{2}}}, JγkA=(Id+γkA)1J_{\gamma^{k}A}=({\rm Id}+\gamma^{k}A)^{-1} is the resolvent of AA, and XX is a nonempty closed convex subset of d\mathbb{R}^{d} containing a solution of the problem (1). They obtained the weak convergence of the method (2) in the real Hilbert space.

In many cases, monotone inclusion problems have a finite sum structure. For example, finite sum minimization is ubiquitous in machine learning where we minimize the empirical risk [10], and nonlinear constrained optimization problems [4]. Finite sum saddle-point problems and finite sum variational inequalities can also be transformed into the monotone inclusion problems [20]. Given the effectiveness of variance-reduced algorithms for finite sum function minimization, a natural idea is to use similar algorithms to solve the more general finite sum monotone inclusion problems.

Now, we detail our problem setting. Suppose that the maximally monotone operator BB in (1) has a finite sum representation B=i=1NBiB=\sum_{i=1}^{N}B_{i}, where each BiB_{i} is LiL_{i}-Lipschitz, BB is LBL_{B}-Lipschitz and it is LL-Lipschitz in mean. Then the problem (1) can be written in the following form

Findxdsuch that 0(A+i=1NBi+C)(x).\mbox{Find}\ x\in\mathbb{R}^{d}\ \ \mbox{such that}\ 0\in(A+\sum_{i=1}^{N}B_{i}+C)(x).

It might be the case that LiL_{i} are easy to compute, but not LBL_{B}. In this case, i=1NLiLB\sum_{i=1}^{N}L_{i}\geq L_{B} gives us a most natural upper bound on LBL_{B}. On the other hand, the cost of computing BxBx is rather expansive when NN is very large.

Throughout this paper, we assume access to a stochastic oracle BξB_{\xi} such that BξB_{\xi} is unbiased, B(x)=𝔼[Bξ(x)]B(x)=\mathbb{E}[B_{\xi}(x)], and then consider utilizing the stochastic oracle BξB_{\xi} to perform in the half forward step in the (2) instead of BB, which yields lower cost per iteration. The two simplest stochastic oracles can be defined as follows

  • (i)

    Uniform sampling: Bξ(x)=NBi(x)B_{\xi}(x)=NB_{i}(x), Pξ(i)=Prob{ξ=i}=1NP_{\xi}(i)=\operatorname{Prob}\{\xi=i\}=\frac{1}{N}. In this case, L=Ni=1NLi2L=\sqrt{N\sum_{i=1}^{N}L_{i}^{2}}.

  • (ii)

    Importance sampling: Bξ(x)=1PiBi(x)B_{\xi}(x)=\frac{1}{P_{i}}B_{i}(x), Pξ(i)=Prob{ξ=i}=Lij=1NLjP_{\xi}(i)=\operatorname{Prob}\{\xi=i\}=\frac{L_{i}}{\sum_{j=1}^{N}L_{j}}. In this case, L=i=1NLiL=\sum_{i=1}^{N}L_{i}.

Recently, Kovalev et al.[9] proposed a loopless variant of SVRG [8] which removes the outer loop present in SVRG and uses a probabilistic update of the full gradient instead. Later, Alacaoglu et al. [1] proposed the loopless version of extragradient method with variance reduction for solving variational inequalities. They also applied the same idea over the forward-backward-forward (FBF) splitting algorithm which was introduced by Tseng [18] to solve the two operators monotone inclusion problem,

findxdsuch that 0(A+B)(x),\displaystyle\mbox{find}\ x\in\mathbb{R}^{d}\ \ \mbox{such that}\ 0\in(A+B)(x),

where A:d2dA:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}} and B:ddB:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} are maximally monotone operators. The operator B:ddB:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} has a stochastic oracle BξB_{\xi} that is unbiased and L¯\bar{L}-Lipschitz in mean. They proved the almost sure convergence of the forward-backward-forward splitting algorithm with variance reduction when BξB_{\xi} is continuous for all ξ\xi. However, the cocoercive operator CC is required to admit a finite-sum structure as well, if one extends the forward-backward-forward splitting algorithm with variance reduction to solve problem (1).

In this paper, we propose a stochastic forward-backward-half forward splitting algorithm with variance reduction (shortly, VRFBHF). Under some mild assumptions, we establish the almost sure convergence of the sequence {xk}k\{x^{k}\}_{k\in\mathbb{N}} generated by our algorithm. Lyapunov analysis of the proposed algorithm is based on the monotonicity inequalities of AA and BB, and the cocoercivity inequality of CC. Furthermore, we obtain the linear convergence when AA or BB is strongly monotone. Numerical experiments are conducted to demonstrate the efficacy of the proposed algorithm.

Following, we recall some definitions and known results which will be helpful for further analysis.

Throughout this paper, d\mathbb{R}^{d} is a dd-dimensional Euclidean space with inner product ,\langle\cdot,\cdot\rangle and induced norm \|\cdot\|. The set of nonnegative integers is denoted by \mathbb{N}. Probability mass function Pξ()P_{\xi}(\cdot) supported on {1,,N}\{1,...,N\}.

Definition 1.1.

([2, Definition 20.1 and Definition 20.20]) A set-valued mapping A:d2dA:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}} is characterized by its graph gra(A)={(x,u)d×d:uAx}.\operatorname*{gra}(A)=\{(x,u)\in\mathbb{R}^{d}\times\mathbb{R}^{d}:u\in Ax\}. A set-valued mapping A:d2dA:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}} is said to be
(i) monotone if uv,xy0\langle u-v,x-y\rangle\geq 0 for all (x,u),(y,v)gra(A).(x,u),(y,v)\in\operatorname*{gra}(A).(ii) maximally monotone if there exists no monotone operator B:d2dB:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}} such that gra(B)(B) properly contains gra(A),(A), i.e., for every (x,u)d×d(x,u)\in\mathbb{R}^{d}\times\mathbb{R}^{d},

(x,u)gra(A)uv,xy0,(y,v)gra(A).(x,u)\in\operatorname*{gra}(A)\ \ \Leftrightarrow\ \ \langle u-v,x-y\rangle\geq 0,\ \ \forall(y,v)\in\operatorname*{gra}(A).
Definition 1.2.

An operator T:ddT:\mathbb{R}^{d}\rightarrow{\mathbb{R}^{d}} is said to be(i) LL-Lipschitz continuous, if there exists a constant L>0L>0, such that

TxTyLxy,x,yd;\|Tx-Ty\|\leq L\|x-y\|,\quad\forall x,y\in\mathbb{R}^{d};

(ii) β{\beta}-cocoercive, if there exists a constant β>0\beta>0, such that

TxTy,xyβTxTy2,x,yd.\langle Tx-Ty,x-y\rangle\geq{\beta}\|Tx-Ty\|^{2},\quad\forall x,y\in\mathbb{R}^{d}.

By the Cauchy–Schwarz inequality, a β{\beta}-cocoercive operator is 1β\frac{1}{\beta}-Lipschitz continuous.

Lemma 1.3.

([2, Proposition 20.38]) Let A:𝒵2𝒵A:\mathcal{Z}\rightarrow 2^{\mathcal{Z}} be maximally monotone, where 𝒵\mathcal{Z} is a finite dimensional Euclidean space. Then gra(A)\operatorname*{gra}(A) is closed in 𝒵strong×𝒵strong{\mathcal{Z}}^{\rm strong}\times{\mathcal{Z}}^{\rm strong}, i.e., for every sequence (xk,uk)k(x^{k},u^{k})_{k\in\mathbb{N}} in gra(A)\operatorname*{gra}(A) and (x,u)𝒵×𝒵(x,u)\in\mathcal{Z}\times\mathcal{Z}, if xkxx^{k}\rightarrow x and ukuu^{k}\rightarrow u, then (x,u)gra(A)(x,u)\in\operatorname*{gra}(A).

Lemma 1.4.

([17, Theorem 1]) Let (Ω,,P)(\Omega,\mathcal{F},P) be a probability space and 12\mathcal{F}_{1}\subset\mathcal{F}_{2}\subset... be a sequence of sub-σ\sigma-algebras of \mathcal{F}. For each k=1,2,,k=1,2,..., let zkz^{k}, βk\beta^{k}, ξk\xi^{k}, and ζk\zeta^{k} be non-negative k\mathcal{F}_{k}-measurable random variables such that

𝔼(zk+1|k)(1+βk)zk+ξkζk.\mathbb{E}(z^{k+1}|\mathcal{F}_{k})\leq(1+\beta^{k})z^{k}+\xi^{k}-\zeta^{k}.

then limkzk\lim_{k\rightarrow\infty}z^{k} exists and k=1ζk<\sum_{k=1}^{\infty}\zeta^{k}<\infty almost surely on k=1βk<\sum_{k=1}^{\infty}\beta^{k}<\infty and k=1ξk<\sum_{k=1}^{\infty}\xi^{k}<\infty.

The paper is organized as follows. In Section 2, we introduce the stochastic forward-backward-half forward splitting algorithm with variance reduction to solve the problem (1), and show the almost sure and linear convergence of the proposed algorithm. Finally, we present the numerical experiments in Section 3.

2 Main Results

In the sequel, we assume that the following conditions are satisfied:

Assumption 2.1.
  • (i)

    The operator A:d2dA:\mathbb{R}^{d}\to 2^{\mathbb{R}^{d}} is maximal monotone;

  • (ii)

    The operator B:ddB:\mathbb{R}^{d}\to\mathbb{R}^{d} is single-valued, monotone and LBL_{B}-Lipschitz;

  • (iii)

    The operator BB has a stochastic oracle BξB_{\xi} that is unbiased, B(x)=𝔼[Bξ(x)]B(x)=\mathbb{E}[B_{\xi}(x)], and LL-Lipschitz in mean:

    𝔼[Bξ(u)Bξ(v)2]L2uv2,u,vd;\mathbb{E}[\|B_{\xi}(u)-B_{\xi}(v)\|^{2}]\leq L^{2}\|u-v\|^{2},\quad\forall u,v\in\mathbb{R}^{d}; (3)
  • (iv)

    C:C:\mathcal{H}\to\mathcal{H} is β\beta-cocoercive;

  • (v)

    The solution set of the problem (1), denoted by zer(A+B+C){\rm zer}(A+B+C), is nonempty.

We now present the stochastic forward-backward-half forward splitting algorithm with variance reduction to solve the problem (1).

Algorithm 2.2.

  VRFBHF   

1. Input: Probability p(0,1]p\in(0,1], probability distribution QQ, step-size γ\gamma, λ(0,1).\lambda\in(0,1).

Let x0=w0x^{0}=w^{0}.

2. for k=0,1,k=0,1,\ldots do

3. x¯k=λxk+(1λ)wk\bar{x}^{k}=\lambda x^{k}+(1-\lambda)w^{k}

4. yk=JγA(x¯kγ(B+C)wk)y^{k}=J_{\gamma A}\left(\bar{x}^{k}-\gamma(B+C)w^{k}\right)

5. Draw an index ξk\xi_{k} according to QQ

6. xk+1=yk+γ(BξkwkBξkyk)x^{k+1}=y^{k}+\gamma(B_{\xi_{k}}w^{k}-B_{\xi_{k}}y^{k})

7. wk+1={xk+1,with probabilitypwk,with probability  1pw^{k+1}=\begin{cases}x^{k+1},&\hbox{with probability}\,\,p\\ w^{k},&\hbox{with probability}\,\,1-p\\ \end{cases}

8: end  for   

Remark 2.3.

Algorithm 2.2 is a very general algorithm and it is brand new to the literature. We review how Algorithm 2.2 relates to previous work. Algorithm 2.2 becomes the forward-backward-forward algorithm with variance reduction in [1] if C=0C=0. Algorithm 2.2 reduces to loopless SVRG in [9] if λ=1\lambda=1, B=fB=\nabla f, A=0A=0 and C=0C=0, where f(x)=i=1Nfi(x)f(x)=\sum_{i=1}^{N}f_{i}(x) and fi(x)f_{i}(x) is the loss of model xx on data point ii.

Remark 2.4.

We have two sources of randomness at each iteraton: the index ξk\xi_{k} which is used for updating xk+1x^{k+1}, and the reference point wkw^{k} which is updated in each iteration with probability pp by the iterate xk+1x^{k+1}, or left unchanged with probability 1p1-p. Intuitively, we wish to keep pp small to lower the cost per iteration. And different from the FBHF splitting algorithm (2), we use the parameter λ\lambda to add a step of calculating x¯k=λxk+(1λ)wk\bar{x}^{k}=\lambda x^{k}+(1-\lambda)w^{k}. This means that we assign some weight to the past iteration points.

2.1 The almost sure convergence

In this subsection, we establish the almost sure convergence of Algorithm 2.2. We use the following notations for conditional expectations: 𝔼k[]=𝔼[|σ(ξ0,,ξk1,wk)]\mathbb{E}_{k}[\cdot]=\mathbb{E}[\cdot|\sigma(\xi_{0},...,\xi_{k-1},w^{k})] and 𝔼k+12[]=𝔼[|σ(ξ0,,ξk,wk)]\mathbb{E}_{k+\frac{1}{2}}[\cdot]=\mathbb{E}[\cdot|\sigma(\xi_{0},...,\xi_{k},w^{k})].

For the iterates {xk}k\{x^{k}\}_{k\in\mathbb{N}} and {wk}k\{w^{k}\}_{k\in\mathbb{N}} generated by Algorithm 2.2, we define the Lyapunov function

Φk(x):=λxkx2+1λpwkx2,xd,\Phi_{k}(x):=\lambda\|x^{k}-x\|^{2}+\frac{1-\lambda}{p}\|w^{k}-x\|^{2},\ \forall x\in\mathbb{R}^{d},

which helps to establish the almost sure convergence of the proposed algorithm.

Theorem 2.1.

Let Assumption 2.1 hold, λ[0,1)\lambda\in[0,1), p(0,1]p\in(0,1], and γ(0,4β(1λ)1+1+16β2L2(1λ))\gamma\in(0,\frac{4\beta(1-\lambda)}{1+\sqrt{1+16\beta^{2}L^{2}(1-\lambda)}}). Then for {xk}k\{x^{k}\}_{k\in\mathbb{N}} generated by Algorithm 2.2 and any xzer(A+B+C)x^{*}\in\operatorname*{zer}(A+B+C), it holds that

𝔼k[Φk+1(x)]Φk(x).\mathbb{E}_{k}[\Phi_{k+1}(x^{*})]\leq\Phi_{k}(x^{*}). (4)

Then, almost surely there exists xzer(A+B+C)x^{*}\in\operatorname*{zer}(A+B+C) such that the sequence {xk}k\{x^{k}\}_{k\in\mathbb{N}} generated by Algorithm 2.2 converges to xx^{*}.

Proof.

Since xzer(A+B+C),x^{*}\in\operatorname*{zer}(A+B+C), we have

γ(B+C)xγAx.-\gamma(B+C)x^{*}\in\gamma Ax^{*}. (5)

Step 4 in Algorithm 2.2 is equivalent to the inclusion

x¯kykγ(B+C)wkγAyk.\bar{x}^{k}-y^{k}-\gamma(B+C)w^{k}\in\gamma Ay^{k}. (6)

Combining (5), (6) and the monotonicity of AA, we have

ykx¯k+γ(B+C)wk,xykγ(B+C)x,xyk0.\langle y^{k}-\bar{x}^{k}+\gamma(B+C)w^{k},x^{*}-y^{k}\rangle-\gamma\langle(B+C)x^{*},x^{*}-y^{k}\rangle\geq 0.

Then from step 6 in Algorithm 2.2, we obtain

xk+1x¯k+γ(BwkBξkwk+Bξkyk)+γCwk,xykγ(B+C)x,xyk0.\langle x^{k+1}-\bar{x}^{k}+\gamma(Bw^{k}-B_{\xi_{k}}w^{k}+B_{\xi_{k}}y^{k})+\gamma Cw^{k},x^{*}-y^{k}\rangle-\gamma\langle(B+C)x^{*},x^{*}-y^{k}\rangle\geq 0. (7)

By the definition of x¯k\bar{x}^{k} and identities 2a,b=a+b2a2b2=a2+b2ab22\langle a,b\rangle=\|a+b\|^{2}-\|a\|^{2}-\|b\|^{2}=\|a\|^{2}+\|b\|^{2}-\|a-b\|^{2}, we have

2xk+1x¯k,xyk\displaystyle 2\langle x^{k+1}-\bar{x}^{k},x^{*}-y^{k}\rangle (8)
=\displaystyle= 2xk+1yk,xyk+2ykx¯k,xyk\displaystyle 2\langle x^{k+1}-y^{k},x^{*}-y^{k}\rangle+2\langle y^{k}-\bar{x}^{k},x^{*}-y^{k}\rangle
=\displaystyle= xk+1yk2+xyk2xk+1x2+2λykxk,xyk\displaystyle\|x^{k+1}-y^{k}\|^{2}+\|x^{*}-y^{k}\|^{2}-\|x^{k+1}-x^{*}\|^{2}+2\lambda\langle y^{k}-x^{k},x^{*}-y^{k}\rangle
+2(1λ)ykwk,xyk\displaystyle+2(1-\lambda)\langle y^{k}-w^{k},x^{*}-y^{k}\rangle
=\displaystyle= xk+1yk2+xyk2xk+1x2\displaystyle\|x^{k+1}-y^{k}\|^{2}+\|x^{*}-y^{k}\|^{2}-\|x^{k+1}-x^{*}\|^{2}
+λ(xkx2ykxk2ykx2)\displaystyle+\lambda(\|x^{k}-x^{*}\|^{2}-\|y^{k}-x^{k}\|^{2}-\|y^{k}-x^{*}\|^{2})
+(1λ)(wkx2ykwk2ykx2)\displaystyle+(1-\lambda)(\|w^{k}-x^{*}\|^{2}-\|y^{k}-w^{k}\|^{2}-\|y^{k}-x^{*}\|^{2})
=\displaystyle= xk+1yk2xk+1x2+λxkx2λykxk2\displaystyle\|x^{k+1}-y^{k}\|^{2}-\|x^{k+1}-x^{*}\|^{2}+\lambda\|x^{k}-x^{*}\|^{2}-\lambda\|y^{k}-x^{k}\|^{2}
+(1λ)wkx2(1λ)ykwk2.\displaystyle+(1-\lambda)\|w^{k}-x^{*}\|^{2}-(1-\lambda)\|y^{k}-w^{k}\|^{2}.

By the β\beta-cocoercivity of CC and Young’s inequality a,bβa2+14βb2\langle a,b\rangle\leq\beta\|a\|^{2}+\frac{1}{4\beta}\|b\|^{2} for all a,bda,b\in\mathbb{R}^{d}, we get

2γCwkCx,xyk\displaystyle 2\gamma\langle Cw^{k}-Cx^{*},x^{*}-y^{k}\rangle (9)
=\displaystyle= 2γCwkCx,xwk+2γCwkCx,wkyk\displaystyle 2\gamma\langle Cw^{k}-Cx^{*},x^{*}-w^{k}\rangle+2\gamma\langle Cw^{k}-Cx^{*},w^{k}-y^{k}\rangle
\displaystyle\leq 2γβCwkCx2+2γβCwkCx2+γ2βwkyk2\displaystyle-2\gamma\beta\|Cw^{k}-Cx^{*}\|^{2}+2\gamma\beta\|Cw^{k}-Cx^{*}\|^{2}+\frac{\gamma}{2\beta}\|w^{k}-y^{k}\|^{2}
=\displaystyle= γ2βwkyk2.\displaystyle\frac{\gamma}{2\beta}\|w^{k}-y^{k}\|^{2}.

We use (8) and (9) in (7) to obtain

2γBx(BwkBξkwk+Bξkyk),xyk+xk+1x2\displaystyle\ \ \ \ 2\gamma\langle Bx^{*}-(Bw^{k}-B_{\xi_{k}}w^{k}+B_{\xi_{k}}y^{k}),x^{*}-y^{k}\rangle+\|x^{k+1}-x^{*}\|^{2} (10)
λxkx2+(1λ)wkx2+xk+1yk2\displaystyle\leq\lambda\|x^{k}-x^{*}\|^{2}+(1-\lambda)\|w^{k}-x^{*}\|^{2}+\|x^{k+1}-y^{k}\|^{2}
λykxk2(1λγ2β)ykwk2.\displaystyle\ \ \ \ -\lambda\|y^{k}-x^{k}\|^{2}-(1-\lambda-\frac{\gamma}{2\beta})\|y^{k}-w^{k}\|^{2}.

Taking expectation 𝔼k\mathbb{E}_{k} on (10) and using

𝔼k[BwkBξkwk+Bξkyk,xyk]=Byk,xyk,\mathbb{E}_{k}[\langle Bw^{k}-B_{\xi_{k}}w^{k}+B_{\xi_{k}}y^{k},x^{*}-y^{k}\rangle]=\langle By^{k},x^{*}-y^{k}\rangle,

we obtain

2γBxByk,xyk+𝔼kxk+1x2\displaystyle\ \ \ \ 2\gamma\langle Bx^{*}-By^{k},x^{*}-y^{k}\rangle+\mathbb{E}_{k}\|x^{k+1}-x^{*}\|^{2}
λxkx2+(1λ)wkx2+𝔼kxk+1yk2\displaystyle\leq\lambda\|x^{k}-x^{*}\|^{2}+(1-\lambda)\|w^{k}-x^{*}\|^{2}+\mathbb{E}_{k}\|x^{k+1}-y^{k}\|^{2}
λykxk2(1λγ2β)ykwk2.\displaystyle\ \ \ \ -\lambda\|y^{k}-x^{k}\|^{2}-(1-\lambda-\frac{\gamma}{2\beta})\|y^{k}-w^{k}\|^{2}.

By the monotonicity of B,B, we have

BxByk,xyk0.\langle Bx^{*}-By^{k},x^{*}-y^{k}\rangle\geq 0. (11)

Combining the definition of xk+1x^{k+1} and (3), we have

𝔼kxk+1yk2γ2L2ykwk2.\mathbb{E}_{k}\|x^{k+1}-y^{k}\|^{2}\leq\gamma^{2}L^{2}\|y^{k}-w^{k}\|^{2}.

Therefore,

𝔼kxk+1x2\displaystyle\mathbb{E}_{k}\|x^{k+1}-x^{*}\|^{2} λxkx2+(1λ)wkx2λykxk2\displaystyle\leq\lambda\|x^{k}-x^{*}\|^{2}+(1-\lambda)\|w^{k}-x^{*}\|^{2}-\lambda\|y^{k}-x^{k}\|^{2} (12)
(1λγ2L2γ2β)ykwk2.\displaystyle\ \ \ \ -(1-\lambda-\gamma^{2}L^{2}-\frac{\gamma}{2\beta})\|y^{k}-w^{k}\|^{2}.

On the other hand, the definition of wk+1w^{k+1} and 𝔼k+12\mathbb{E}_{k+\frac{1}{2}} yield that

1λp𝔼k+12[wk+1x2]=(1λ)xk+1x2+(1λ)1ppwkx2.\frac{1-\lambda}{p}\mathbb{E}_{k+\frac{1}{2}}[\|w^{k+1}-x^{*}\|^{2}]=(1-\lambda)\|x^{k+1}-x^{*}\|^{2}+(1-\lambda)\frac{1-p}{p}\|w^{k}-x^{*}\|^{2}. (13)

Then apply to (13) the tower property 𝔼k[𝔼k+12[]]=𝔼k[]\mathbb{E}_{k}[\mathbb{E}_{k+\frac{1}{2}}[\cdot]]=\mathbb{E}_{k}[\cdot], we have

1λp𝔼k[wk+1x2]=(1λ)𝔼kxk+1x2+(1λ)1ppwkx2.\frac{1-\lambda}{p}\mathbb{E}_{k}[\|w^{k+1}-x^{*}\|^{2}]=(1-\lambda)\mathbb{E}_{k}\|x^{k+1}-x^{*}\|^{2}+(1-\lambda)\frac{1-p}{p}\|w^{k}-x^{*}\|^{2}. (14)

We add (14) to (12) to obtain

λ𝔼kxk+1x2+1λp𝔼kwk+1x2\displaystyle\ \ \ \ \lambda\mathbb{E}_{k}\|x^{k+1}-x^{*}\|^{2}+\frac{1-\lambda}{p}\mathbb{E}_{k}\|w^{k+1}-x^{*}\|^{2} (15)
λxkx2+1λpwkx2λykxk2\displaystyle\leq\lambda\|x^{k}-x^{*}\|^{2}+\frac{1-\lambda}{p}\|w^{k}-x^{*}\|^{2}-\lambda\|y^{k}-x^{k}\|^{2}
(1λγ2L2γ2β)ykwk2.\displaystyle\ \ \ \ -(1-\lambda-\gamma^{2}L^{2}-\frac{\gamma}{2\beta})\|y^{k}-w^{k}\|^{2}.

Thus, the inequality (4) holds by γ(0,4β(1λ)1+1+16β2L2(1λ))\gamma\in(0,\frac{4\beta(1-\lambda)}{1+\sqrt{1+16\beta^{2}L^{2}(1-\lambda)}}) and 0<λ<10<\lambda<1.

Next, we show the almost sure convergence of the sequence {xk}k\{x^{k}\}_{k\in\mathbb{N}} generated by Algorithm 2.2. By Lemma 1.4, we have that {Φk(x)}k\{\Phi_{k}(x^{*})\}_{k\in\mathbb{N}} converges almost surely and {ykxk2}k,\{\|y^{k}-x^{k}\|^{2}\}_{k\in\mathbb{N}}, {ykwk2}k\{\|y^{k}-w^{k}\|^{2}\}_{k\in\mathbb{N}} converges to 0 almost surely. Then applying Proposition 2.3 in [5], we can construct a space Ξ\Xi, with (Ξ)=1\mathbb{P}(\Xi)=1, such that θΞ\forall\theta\in\Xi and xzer(A+B+C)\forall x^{*}\in\operatorname*{zer}(A+B+C),

{λxk(θ)x2+1λpwk(θ)x2}kconverges,\{\lambda\|x^{k}(\theta)-x^{*}\|^{2}+\frac{1-\lambda}{p}\|w^{k}(\theta)-x^{*}\|^{2}\}_{k\in\mathbb{N}}\ \hbox{converges,} (16)

which implies that the sequence {xk(θ)}k\{x^{k}(\theta)\}_{k\in\mathbb{N}} is bounded. Let Ξ\Xi^{{}^{\prime}} be the probability 1 set such that θΞ\forall\theta\in\Xi^{{}^{\prime}}, yk(θ)xk(θ)0y^{k}(\theta)-x^{k}(\theta)\rightarrow 0, yk(θ)wk(θ)0y^{k}(\theta)-w^{k}(\theta)\rightarrow 0, which implies yk(θ)x¯k(θ)0y^{k}(\theta)-\bar{x}^{k}(\theta)\rightarrow 0. Pick θΞΞ\theta\in\Xi\bigcap\Xi^{{}^{\prime}} and let {xkj(θ)}j\{x^{k_{j}}(\theta)\}_{j\in\mathbb{N}} be the convergent subsequence of the bounded sequence {xk(θ)}k\{x^{k}(\theta)\}_{k\in\mathbb{N}}, say without loss of generality that xkj(θ)x¯(θ)x^{k_{j}}(\theta)\rightarrow\bar{x}(\theta) as jj\rightarrow\infty. From ykj(θ)xkj(θ)0y^{k_{j}}(\theta)-x^{k_{j}}(\theta)\rightarrow 0, it follows that ykj(θ)x¯(θ)y^{k_{j}}(\theta)\rightarrow\bar{x}(\theta) as jj\rightarrow\infty. Then according to (6), we can get

x¯kj(θ)ykj(θ)γ((B+C)wkj(θ)(B+C)ykj(θ))γ(A+B+C)ykj(θ).\bar{x}^{k_{j}}(\theta)-y^{k_{j}}(\theta)-\gamma((B+C)w^{k_{j}}(\theta)-(B+C)y^{k_{j}}(\theta))\in\gamma(A+B+C)y^{k_{j}}(\theta).

We know that B+CB+C is (LB+1β)(L_{B}+\frac{1}{\beta})-Lipschitz. Therefore,

x¯kj(θ)ykj(θ)γ((B+C)wkj(θ)(B+C)ykj(θ))0,j.\bar{x}^{k_{j}}(\theta)-y^{k_{j}}(\theta)-\gamma((B+C)w^{k_{j}}(\theta)-(B+C)y^{k_{j}}(\theta))\rightarrow 0,\ j\rightarrow\infty.

Furthermore, based on the assumption that the operator BB has a full domain, we have that A+BA+B is maximally monotone. Combining CC is cocoercive, one has that A+B+CA+B+C is maximally monotone. By Lemma 1.3, (x¯(θ),0)gra(A+B+C)(\bar{x}(\theta),0)\in\operatorname*{gra}(A+B+C), i.e., x¯(θ)zer(A+B+C)\bar{x}(\theta)\in\operatorname*{zer}(A+B+C).

Hence, all cluster points of {xk(θ)}k\{x^{k}(\theta)\}_{k\in\mathbb{N}} and {wk(θ)}k\{w^{k}(\theta)\}_{k\in\mathbb{N}} belong to zer(A+B+C)\operatorname*{zer}(A+B+C). We have shown that at least one subsequence of {λxk(θ)x¯(θ)2+1λpwk(θ)x¯(θ)2}k\{\lambda\|x^{k}(\theta)-\bar{x}(\theta)\|^{2}+\frac{1-\lambda}{p}\|w^{k}(\theta)-\bar{x}(\theta)\|^{2}\}_{k\in\mathbb{N}} converges to 0. Combining (16), we deduce λxk(θ)x¯(θ)2+1λpwk(θ)x¯(θ)20\lambda\|x^{k}(\theta)-\bar{x}(\theta)\|^{2}+\frac{1-\lambda}{p}\|w^{k}(\theta)-\bar{x}(\theta)\|^{2}\rightarrow 0 and consequently xk(θ)x¯(θ)20\|x^{k}(\theta)-\bar{x}(\theta)\|^{2}\rightarrow 0. This shows that {xk}k\{x^{k}\}_{k\in\mathbb{N}} converges almost surely to a point in zer(A+B+C)\operatorname*{zer}(A+B+C). ∎

2.2 Linear convergence

In this subsection, we show the linear convergence of Algorithm 2.2 for solving the structured monotone inclusion problem (1) when BB is μ\mu-strongly monotone. Indeed, assuming that the operator AA is strongly monotone also leads to a linear convergence result, and the proof procedure is similar.

Theorem 2.2.

Let Assumption 2.1 hold, BB be μ\mu-strongly monotone. If we set λ=1p\lambda=1-p, and γ=min{p2L,βp}\gamma=\min\{\frac{\sqrt{p}}{2L},\beta p\} in Algorithm 2.2, then for the sequence {xk}k\{x^{k}\}_{k\in\mathbb{N}} generated by Algorithm 2.2 and any xzer(A+B+C)x^{*}\in\operatorname*{zer}(A+B+C), it holds that

𝔼xkx2(11+c/4)k21px0x2,\mathbb{E}\|x^{k}-x^{*}\|^{2}\leq(\frac{1}{1+c/4})^{k}\frac{2}{1-p}\|x^{0}-x^{*}\|^{2}, (17)

with c=min{γμ,p(1+p)(4+p)}c=\min\{\gamma\mu,\frac{p}{(1+\sqrt{p})(4+p)}\}.

Proof.

If BB is μ\mu-strongly monotone, then (11) becomes

BxByk,xykμxyk2.\langle Bx^{*}-By^{k},x^{*}-y^{k}\rangle\geq\mu\|x^{*}-y^{k}\|^{2}.

We continue as in the proof of Theorem 2.1 to obtain, instead of (15),

2γμykx2+λ𝔼kxk+1x2+1λp𝔼kwk+1x2\displaystyle\ \ \ \ 2\gamma\mu\|y^{k}-x^{*}\|^{2}+\lambda\mathbb{E}_{k}\|x^{k+1}-x^{*}\|^{2}+\frac{1-\lambda}{p}\mathbb{E}_{k}\|w^{k+1}-x^{*}\|^{2} (18)
λxkx2+1λpwkx2λykxk2\displaystyle\leq\lambda\|x^{k}-x^{*}\|^{2}+\frac{1-\lambda}{p}\|w^{k}-x^{*}\|^{2}-\lambda\|y^{k}-x^{k}\|^{2}
(1λγ2L2γ2β)ykwk2.\displaystyle\ \ \ \ -(1-\lambda-\gamma^{2}L^{2}-\frac{\gamma}{2\beta})\|y^{k}-w^{k}\|^{2}.

By a+b22a2+2b2\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}, the step 6 and (3), we have

2γμykx2\displaystyle 2\gamma\mu\|y^{k}-x^{*}\|^{2} γμ𝔼k[xk+1x2]2γμ𝔼k[γ(BξkwkBξkyk)2]\displaystyle\geq\gamma\mu\mathbb{E}_{k}[\|x^{k+1}-x^{*}\|^{2}]-2\gamma\mu\mathbb{E}_{k}[\|\gamma(B_{\xi_{k}}w^{k}-B_{\xi_{k}}y^{k})\|^{2}] (19)
γμ𝔼k[xk+1x2]2γ3L2μykwk2.\displaystyle\geq\gamma\mu\mathbb{E}_{k}[\|x^{k+1}-x^{*}\|^{2}]-2\gamma^{3}L^{2}\mu\|y^{k}-w^{k}\|^{2}.

Combining (18), (19) and λ=1p\lambda=1-p, we get

(1p+γμ)𝔼k[xk+1x2]+𝔼k[wk+1x2]\displaystyle\ \ \ \ (1-p+\gamma\mu)\mathbb{E}_{k}[\|x^{k+1}-x^{*}\|^{2}]+\mathbb{E}_{k}[\|w^{k+1}-x^{*}\|^{2}] (20)
(1p)xkx2+wkx2(1p)ykxk2\displaystyle\leq(1-p)\|x^{k}-x^{*}\|^{2}+\|w^{k}-x^{*}\|^{2}-(1-p)\|y^{k}-x^{k}\|^{2}
(pγ2L2γ2β2γ3L2μ)ykwk2\displaystyle\ \ \ \ -(p-\gamma^{2}L^{2}-\frac{\gamma}{2\beta}-2\gamma^{3}L^{2}\mu)\|y^{k}-w^{k}\|^{2}
(1p)xkx2+wkx2(1p)ykxk2\displaystyle\leq(1-p)\|x^{k}-x^{*}\|^{2}+\|w^{k}-x^{*}\|^{2}-(1-p)\|y^{k}-x^{k}\|^{2}
p(1p)4ykwk2,\displaystyle\ \ \ \ -\frac{p(1-\sqrt{p})}{4}\|y^{k}-w^{k}\|^{2},

where the last inequality is obtained by γ=min{p2L,βp}\gamma=\min\{\frac{\sqrt{p}}{2L},\beta p\} and μL\mu\leq L. Similar to (19), we have

c2𝔼k[xk+1x2]\displaystyle\frac{c}{2}\mathbb{E}_{k}[\|x^{k+1}-x^{*}\|^{2}] c4𝔼k[wk+1x2]c2𝔼k[𝔼k+12xk+1wk+12]\displaystyle\geq\frac{c}{4}\mathbb{E}_{k}[\|w^{k+1}-x^{*}\|^{2}]-\frac{c}{2}\mathbb{E}_{k}[\mathbb{E}_{k+\frac{1}{2}}\|x^{k+1}-w^{k+1}\|^{2}] (21)
=c4𝔼k[wk+1x2]c(1p)2𝔼k[xk+1wk2]\displaystyle=\frac{c}{4}\mathbb{E}_{k}[\|w^{k+1}-x^{*}\|^{2}]-\frac{c(1-p)}{2}\mathbb{E}_{k}[\|x^{k+1}-w^{k}\|^{2}]
=c4𝔼k[wk+1x2]c(1p)2𝔼k[ykwk+γ(BξkwkBξkyk)2]\displaystyle=\frac{c}{4}\mathbb{E}_{k}[\|w^{k+1}-x^{*}\|^{2}]-\frac{c(1-p)}{2}\mathbb{E}_{k}[\|y^{k}-w^{k}+\gamma(B_{\xi_{k}}w^{k}-B_{\xi_{k}}y^{k})\|^{2}]
c4𝔼k[wk+1x2]c(1p)(1+γ2L2)ykwk2\displaystyle\geq\frac{c}{4}\mathbb{E}_{k}[\|w^{k+1}-x^{*}\|^{2}]-c(1-p)(1+\gamma^{2}L^{2})\|y^{k}-w^{k}\|^{2}
c4𝔼k[wk+1x2]c(1p)(4+p)4ykwk2.\displaystyle\geq\frac{c}{4}\mathbb{E}_{k}[\|w^{k+1}-x^{*}\|^{2}]-\frac{c(1-p)(4+p)}{4}\|y^{k}-w^{k}\|^{2}.

Putting (21) into (20) and recalling that cγμc\leq\gamma\mu, we have

(1p+c2)𝔼k[xk+1x2]+(1+c4)𝔼k[wk+1x2]\displaystyle\ \ \ \ (1-p+\frac{c}{2})\mathbb{E}_{k}[\|x^{k+1}-x^{*}\|^{2}]+(1+\frac{c}{4})\mathbb{E}_{k}[\|w^{k+1}-x^{*}\|^{2}] (22)
(1p)xkx2+wkx2(1p)ykxk2\displaystyle\leq(1-p)\|x^{k}-x^{*}\|^{2}+\|w^{k}-x^{*}\|^{2}-(1-p)\|y^{k}-x^{k}\|^{2}
[p(1p)4c(1p)(4+p)4]ykwk2\displaystyle\ \ \ \ -\left[\frac{p(1-\sqrt{p})}{4}-\frac{c(1-p)(4+p)}{4}\right]\|y^{k}-w^{k}\|^{2}
(1p)xkx2+wkx2,\displaystyle\leq(1-p)\|x^{k}-x^{*}\|^{2}+\|w^{k}-x^{*}\|^{2},

where the last inequality comes from cp(1+p)(4+p).c\leq\frac{p}{(1+\sqrt{p})(4+p)}. Then, using 1p+c2(1p)(1+c4)1-p+\frac{c}{2}\geq(1-p)(1+\frac{c}{4}) and taking the full expectation on (22), we have

(1+c4)𝔼[(1p)xk+1x2+wk+1x2]𝔼[(1p)xkx2+wkx2].(1+\frac{c}{4})\mathbb{E}[(1-p)\|x^{k+1}-x^{*}\|^{2}+\|w^{k+1}-x^{*}\|^{2}]\leq\mathbb{E}[(1-p)\|x^{k}-x^{*}\|^{2}+\|w^{k}-x^{*}\|^{2}].

Iterating this inequality, we obain

(1p)𝔼xkx2(11+c/4)k(2p)x0x2,(1-p)\mathbb{E}\|x^{k}-x^{*}\|^{2}\leq(\frac{1}{1+c/4})^{k}(2-p)\|x^{0}-x^{*}\|^{2},

showing (17). ∎

3 Numerical Simulations

In this section, we compare the Algorithm 2.2 (VRFBHF) with the FBHF splitting algorithm (2). All codes were written in MATLAB R2018b and performed on a PC Desktop Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz 3.19 GHz, RAM 8.00 GB. Consider the nonlinear constrained optimization problem of the form

minxCf(x)+h(x),\min_{x\in C}f(x)+h(x), (23)

where C={xd|(i{1,,q})gi(x)0}C=\{x\in\mathbb{R}^{d}\,|\,(\forall i\in\{1,...,q\})\ g_{i}(x)\leq 0\}, f:d(,+]f:\mathbb{R}^{d}\rightarrow(-\infty,+\infty] is a proper, convex and lower semi-continuous function, for every i{1,,q}i\in\{1,...,q\}, gi:dom(gi)dg_{i}:\operatorname{dom}(g_{i})\subset\mathbb{R}^{d}\rightarrow\mathbb{R} and h:dh:\mathbb{R}^{d}\rightarrow\mathbb{R} are C1C^{1} convex functions in intdomgi\operatorname*{int}\operatorname{dom}g_{i} and d\mathbb{R}^{d}, respectively, and h\nabla h is β\beta-Lipschitz. The solution to the optimization problem (23) can be found via the saddle points of the Lagrangian

L(x,u)=f(x)+h(x)+ug(x)ι+q(u),L(x,u)=f(x)+h(x)+u^{\top}g(x)-\iota_{\mathbb{R}_{+}^{q}}(u),

where ι+q\iota_{\mathbb{R}_{+}^{q}} is the indicator function of +q\mathbb{R}_{+}^{q}, Under some standard qualifications, the solution to the optimization problem (23) can be found by solving the monotone inclusion [4, 16]: find xYx\in Y such that u+q\exists u\in\mathbb{R}_{+}^{q},

(0,0)(A+B+C)(x,u),(0,0)\in(A+B+C)(x,u), (24)

where YdY\subset\mathbb{R}^{d} is a nonempty closed convex set modeling the prior information of the solution, A:(x,u)f(x)×N+quA:(x,u)\mapsto\partial f(x)\times{N_{\mathbb{R}_{+}^{q}}u} is maximally monotone, C:(x,u)(h(x),0)C:(x,u)\mapsto(\nabla h(x),0) is β\beta-cocoercive, and B:(x,u)(i=1quigi(x),g1(x),,gq(x))B:(x,u)\mapsto(\sum_{i=1}^{q}u_{i}\nabla g_{i}(x),-g_{1}(x),...,-g_{q}(x)) is nonlinear, monotone and continuous. In the light of the structure of BB, we can rewrite BB as B=i=1qBiB=\sum_{i=1}^{q}B_{i} where Bi:(x,u)(uigi(x),0,,gi(x),,0)B_{i}:(x,u)\mapsto(u_{i}\nabla g_{i}(x),0,...,-g_{i}(x),...,0) for every i{1,,q}i\in\{1,...,q\}. In the numerical results listed in the following table, “Iter” denotes the number of iterations.

Example 3.1.

Let f=ι[0,1]df=\iota_{[0,1]^{d}}, gi(x)=dixg_{i}(x)=d_{i}^{\top}x (i{1,,q}\forall i\in\{1,...,q\}) with d1,,dqdd_{1},\ldots,d_{q}\in\mathbb{R}^{d}, and h=12Gxb2h=\frac{1}{2}\|Gx-b\|^{2} with GG being an t×dt\times d real matrix, d=2td=2t, btb\in\mathbb{R}^{t}. Then the operators in (24) become

A:(x,u)ι[0,1]d(x)×N+qu,\displaystyle A:(x,u)\mapsto\partial{\iota_{[0,1]^{d}}(x)}\times{N_{\mathbb{R}_{+}^{q}}u}, (25)
B:(x,u)(Du,Dx),\displaystyle B:(x,u)\mapsto(D^{\top}u,-Dx),
C:(x,u)(G(Gxb),0),\displaystyle C:(x,u)\mapsto(G^{\top}(Gx-b),0),

where xdx\in\mathbb{R}^{d}, u+qu\in\mathbb{R}_{+}^{q}, D=[d1,,dq]D=[d_{1},\ldots,d_{q}]^{\top}. It is easy to see that the operator AA is a maximally monotone operator, CC is a β\beta-cocoercive operator with β=G2\beta=\|G\|^{-2}, and B is a LL-Lipschitz operator with L=DL=\|D\|. According to the structure of the operator BB, rewrite BB as B=i=1qBiB=\sum_{i=1}^{q}B_{i} where Bi:(x,u)(diui,0,,di(x),,0)B_{i}:(x,u)\mapsto(d_{i}u_{i},0,...,-d_{i}^{\top}(x),...,0) for every i{1,,q}i\in\{1,...,q\}. For uniform sampling, the stochastic oracle Bξ(x,u)=qBi(x,u)B_{\xi}(x,u)=qB_{i}(x,u), Pξ(i)=Prob{ξ=i}=1qP_{\xi}(i)=\operatorname{Prob}{\{\xi=i\}}=\frac{1}{q}, i{1,,q}i\in\{1,...,q\}.

Now, we use Algorithm 2.2 to solve the problem (1) with (25), then Algorithm 2.2 reduces to

x¯k=λxk+(1λ)wk,u¯k=λuk+(1λ)vk,yk=Proxγι[0,1]d(x¯kγDTvkγ(GT(Gwkb))),for everyj=1,,q,ηjk=max{0,u¯jk+γdjTwk},Sampleξkuniformly at random from{1,,q}xk+1=yk+γqdξk(vξkkηξkk),uk+1=ηk+γSξk(wkyk),wk+1={xk+1,with probabilitypwk,with probability  1pvk+1={uk+1,with probabilitypvk,with probability  1p\left\lfloor\begin{aligned} &\bar{x}^{k}=\lambda x^{k}+(1-\lambda)w^{k},\\ &\bar{u}^{k}=\lambda u^{k}+(1-\lambda)v^{k},\\ &y^{k}={\rm Prox}_{\gamma\iota_{[0,1]^{d}}}(\bar{x}^{k}-\gamma D^{T}v^{k}-\gamma(G^{T}(Gw^{k}-b))),\\ &\hbox{for every}\,\,j=1,\ldots,q,\\ &\left\lfloor\begin{aligned} &\eta^{k}_{j}=\max\{0,\bar{u}^{k}_{j}+\gamma d_{j}^{T}w^{k}\},\\ \end{aligned}\right.\\ &\hbox{Sample}\ \xi_{k}\ \hbox{uniformly at random from}\,\ \{1,...,q\}\\ &x^{k+1}=y^{k}+\gamma qd_{\xi_{k}}(v_{\xi_{k}}^{k}-\eta_{\xi_{k}}^{k}),\\ &u^{k+1}=\eta^{k}+\gamma S_{\xi_{k}}(w^{k}-y^{k}),\\ &w^{k+1}=\begin{cases}x^{k+1},&\hbox{with probability}\,\,p\\ w^{k},&\hbox{with probability}\,\,1-p\\ \end{cases}\\ &v^{k+1}=\begin{cases}u^{k+1},&\hbox{with probability}\,\,p\\ v^{k},&\hbox{with probability}\,\,1-p\\ \end{cases}\\ \end{aligned}\right.

where Sξk=[0,,qdξk(x),,0]S_{\xi_{k}}=[0,...,-qd_{\xi_{k}}^{\top}(x),...,0].

In the numerical test, G,D,bG,D,b and initial value (x0,u0)(x_{0},u_{0}) are all randomly generated. In VRFBHF, set (w0,v0)=(x0,u0)(w_{0},v_{0})=(x_{0},u_{0}), take λ=0.1\lambda=0.1 and γ=3.999β(1λ)1+1+16β2L2(1λ)\gamma=\frac{3.999\beta(1-\lambda)}{1+\sqrt{1+16\beta^{2}L^{2}(1-\lambda)}}. In FBHF, take γ=3.999β1+1+16β2LB2\gamma=\frac{3.999\beta}{1+\sqrt{1+16\beta^{2}L_{B}^{2}}}. We use

Ek=(xk+1xk,uk+1uk)(xk,uk)<106,E_{k}=\frac{\|(x^{k+1}-x^{k},u^{k+1}-u^{k})\|}{\|(x^{k},u^{k})\|}<10^{-6},

as the stopping criterion.

Refer to caption
Figure 1: Decay of EkE_{k} with the number of iterations of different pp for Example 3.1 with q=1000,d=500q=1000,d=500.

Figure 1 illustrates the behavior of VRFBHF for different pp, from which it can be observed that EnE_{n} oscillates with kk and reaches the stopping criterion the fastest when p=0.2p=0.2. Next we test eight problem sizes and randomly generate 10 instances for each size. The average number of iterations and CPU time for 10 instances are listed in Table 1. It is observed from Table 1 that VRFBHF has remarkably less CPU time and iteration numbers compared to the FBHF splitting algorithm (2).

Table 1: Computational results with FBHF and VRFBHF with p=0.2p=0.2
Problem size Iter CPU time
qq dd VRFBHF FBHF VRFBHF FBHF
1000 500 75.8 3147 1.0531 15.4125
750 92.3 1363 1.1047 8.8984
1000 45.8 1933 0.8344 24.825
2000 16.4 1731 1.5344 77.2984
2000 1000 96.2 1058 2.4609 28.775
1500 73.8 1020 8.8984 55.3953
2000 16.9 2022 14.6687 151.3359
2500 26.6 1268 22.8078 136.9688

References

  • [1] Alacaoglu, A., Malitsky, Y.: Stochastic variance reduction for variational inequality methods. Mach. Learn. 178, 1–-39 (2022)
  • [2] Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd ed. Springer, New York, (2017)
  • [3] Barnet, S., Rudzusika, J., Öktem, O., and Adler, J.: Accelerated forward-backward optimization using deep learning. https://arxiv.org/abs/2105.05210
  • [4] Briceño-Arias, L.M., Davis, D.: Forward-backward-half forward algorithm for solving monotone inclusions. Siam J. Optimiz. 28(4), 2839–2871 (2017)
  • [5] Combettes, P.L., Pesquet, J.-C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. Siam J. Optimiz. 25(2), 1221–1248 (2015)
  • [6] Combettes, P.L., Pesquet, J.C.: Primal-dual splitting algorithm for solving inclusions with mixtures of composite, Lipschitzian, and parallel-sum type monotone operators. Set-valued. Var. Anal. 20, 307–-330 (2012)
  • [7] Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications. Set-valued. Var. Anal. 25, 829–858 (2017)
  • [8] Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural. Inf. Process. Syst. 315–323 (2013)
  • [9] Kovalev, D., Horvath, S., and Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and katyusha are better without the outer loop. Mach. Learn. 117, 1–17 (2020)
  • [10] Liu, J.C., Xu, L.L., Shen, S.H., and Ling, Q.: An accelerated variance reducing stochastic method with Douglas-Rachford splitting. Mach. Learn. 108, 859-–878 (2019)
  • [11] Latafat, P., Patrinos, P.: Asymmetric forward-backward-adjoint splitting for solving monotone inclusions involving three operators. Comput. Optim. Appl. 68, 57–93 (2017)
  • [12] Malitsky, Y., Tam, M.K.: A forward-backward splitting method for monotone inclusions without cocoercivity. Siam J. Optim. 30(2), 1451–1472 (2020)
  • [13] Rieger, J., Tam, M.K.: Backward-forward-reflected-backward splitting for three operator monotone inclusions. Appl. Math. Comput. 381, (2020)
  • [14] Ryu, E.K.: Uniqueness of DRS as the 2 operator resolvent-splitting and impossibility of 3 operator resolvent-splitting. Math. Program. 182, 233–273 (2020)
  • [15] Ryu, E.K., Vũ, B.C.: Finding the forward-Douglas–Rachford-forward method. J. Optimiz. Theory. App. 184, 858–876 (2020)
  • [16] Rockafellar, R.T.: Monotone operators associated with saddle-functions and minimax problems, in: Nonlinear Func. Anal., I, F.E. Browder ed., Proc. Pure Mat 18, 241–-250 (1970)
  • [17] Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. Optimizing Methods in Statistics. 233–257 (1971)
  • [18] Tseng, P.: A modified forward-backward splitting method for maximal monotone mapping. Siam J. Control. Optim. 38(2), (2000)
  • [19] Yu, H., Zong, C., and Tang, Y.: An outer reflected forward-backward splitting algorithm for solving monotone inclusions. https://arxiv.org/abs/2009.12493 (2020)
  • [20] Zhang, X., Haskell, W. B., and Ye, Z.S.: A Unifying Framework for Variance-Reduced Algorithms for Findings Zeroes of Monotone operators. J. Mach. Learn. Res. 23(60), 1–-44 (2022)
  • [21] Zong, C., Tang, Y., and Cho, Y.J.: Convergence analysis of an inexact three-operator splitting algorithm. Symmetry. 10(11), 563 (2018)
  • [22] Zong, C., Tang, Y., and Zhang, G.: An accelerated forward-backward-half forward splitting algorithm for monotone inclusion with applications to image restoration. Optimization. (2022) DOI: 10.1080/02331934.2022.2107926