Stochastic forward-backward-half forward splitting algorithm with variance reduction

Liqian Qin, Yaxuan Zhang and Qiao-Li Dong
College of Science, Civil Aviation University of China, Tianjin 300300, China,
email: [email protected]email: [email protected]Corresponding author. email: [email protected]

Abstract

In this paper, we present a stochastic forward-backward-half forward splitting algorithm with variance reduction for solving the structured monotone inclusion problem composed of a maximally monotone operator, a maximally monotone and Lipschitz continuous operator and a cocoercive operator. By defining a Lyapunov function, we establish the almost sure convergence of the proposed algorithm, and obtain the linear convergence when one of the maximally monotone operators is strongly monotone. Numerical examples are provided to show the performance of the proposed algorithm.

Key words: Variance reduction; Forward-backward-half forward splitting algorithm; Monotone inclusion problem; The almost sure convergence; Strongly monotone; Linear convergence.

1 Introduction and preliminaries

In this paper, we consider the structured monotone inclusion problem which is to find $x\in\mathbb{R}^{d}$ such that

\displaystyle\ 0\in(A+B+C)(x),

(1)

where $A:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}}$ is a maximally monotone operator, $B:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is a maximally monotone and $L_{B}$ -Lipschitz operator, and $C:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is a $\beta$ -cocoercive operator. Problem (1) arises in various applications such as optimization problems [4, 7], variational inequalities [1], deep learning [3] and image deblurring [22].

Numerous iterative algorithms for solving (1) have been presented and analyzed, see, for instance, [6, 7, 11, 12, 13, 14, 15, 19, 21, 22] and references therein. In particular, Briceño-Arias et al. [4] first proposed a forward-backward-half forward (FBHF) splitting algorithm as follows

\left\{\begin{array}[]{lr}p^{k}=J_{\gamma^{k}A}\left(x^{k}-\gamma^{k}(B+C)x^{k}\right),&\\ x^{k+1}=P_{X}(p^{k}+\gamma^{k}(Bx^{k}-Bp^{k})),&\\ \end{array}\right.

(2)

where $\gamma^{k}$ is step-size, $\gamma^{k}\in[\eta,\chi-\eta]$ , $\eta\in(0,\frac{\chi}{2})$ , $\chi=\frac{4\beta}{1+\sqrt{1+16\beta^{2}L_{B}^{2}}}$ , $J_{\gamma^{k}A}=({\rm Id}+\gamma^{k}A)^{-1}$ is the resolvent of $A$ , and $X$ is a nonempty closed convex subset of $\mathbb{R}^{d}$ containing a solution of the problem (1). They obtained the weak convergence of the method (2) in the real Hilbert space.

In many cases, monotone inclusion problems have a finite sum structure. For example, finite sum minimization is ubiquitous in machine learning where we minimize the empirical risk [10], and nonlinear constrained optimization problems [4]. Finite sum saddle-point problems and finite sum variational inequalities can also be transformed into the monotone inclusion problems [20]. Given the effectiveness of variance-reduced algorithms for finite sum function minimization, a natural idea is to use similar algorithms to solve the more general finite sum monotone inclusion problems.

Now, we detail our problem setting. Suppose that the maximally monotone operator $B$ in (1) has a finite sum representation $B=\sum_{i=1}^{N}B_{i}$ , where each $B_{i}$ is $L_{i}$ -Lipschitz, $B$ is $L_{B}$ -Lipschitz and it is $L$ -Lipschitz in mean. Then the problem (1) can be written in the following form

\mbox{Find}\ x\in\mathbb{R}^{d}\ \ \mbox{such that}\ 0\in(A+\sum_{i=1}^{N}B_{i}+C)(x).

It might be the case that $L_{i}$ are easy to compute, but not $L_{B}$ . In this case, $\sum_{i=1}^{N}L_{i}\geq L_{B}$ gives us a most natural upper bound on $L_{B}$ . On the other hand, the cost of computing $Bx$ is rather expansive when $N$ is very large.

Throughout this paper, we assume access to a stochastic oracle $B_{\xi}$ such that $B_{\xi}$ is unbiased, $B(x)=\mathbb{E}[B_{\xi}(x)]$ , and then consider utilizing the stochastic oracle $B_{\xi}$ to perform in the half forward step in the (2) instead of $B$ , which yields lower cost per iteration. The two simplest stochastic oracles can be defined as follows

(i)

Uniform sampling: $B_{\xi}(x)=NB_{i}(x)$ , $P_{\xi}(i)=\operatorname{Prob}\{\xi=i\}=\frac{1}{N}$ . In this case, $L=\sqrt{N\sum_{i=1}^{N}L_{i}^{2}}$ .
(ii)

Importance sampling: $B_{\xi}(x)=\frac{1}{P_{i}}B_{i}(x)$ , $P_{\xi}(i)=\operatorname{Prob}\{\xi=i\}=\frac{L_{i}}{\sum_{j=1}^{N}L_{j}}$ . In this case, $L=\sum_{i=1}^{N}L_{i}$ .

Recently, Kovalev et al.[9] proposed a loopless variant of SVRG [8] which removes the outer loop present in SVRG and uses a probabilistic update of the full gradient instead. Later, Alacaoglu et al. [1] proposed the loopless version of extragradient method with variance reduction for solving variational inequalities. They also applied the same idea over the forward-backward-forward (FBF) splitting algorithm which was introduced by Tseng [18] to solve the two operators monotone inclusion problem,

\displaystyle\mbox{find}\ x\in\mathbb{R}^{d}\ \ \mbox{such that}\ 0\in(A+B)(x),

where $A:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}}$ and $B:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ are maximally monotone operators. The operator $B:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ has a stochastic oracle $B_{\xi}$ that is unbiased and $\bar{L}$ -Lipschitz in mean. They proved the almost sure convergence of the forward-backward-forward splitting algorithm with variance reduction when $B_{\xi}$ is continuous for all $\xi$ . However, the cocoercive operator $C$ is required to admit a finite-sum structure as well, if one extends the forward-backward-forward splitting algorithm with variance reduction to solve problem (1).

In this paper, we propose a stochastic forward-backward-half forward splitting algorithm with variance reduction (shortly, VRFBHF). Under some mild assumptions, we establish the almost sure convergence of the sequence $\{x^{k}\}_{k\in\mathbb{N}}$ generated by our algorithm. Lyapunov analysis of the proposed algorithm is based on the monotonicity inequalities of $A$ and $B$ , and the cocoercivity inequality of $C$ . Furthermore, we obtain the linear convergence when $A$ or $B$ is strongly monotone. Numerical experiments are conducted to demonstrate the efficacy of the proposed algorithm.

Following, we recall some definitions and known results which will be helpful for further analysis.

Throughout this paper, $\mathbb{R}^{d}$ is a $d$ -dimensional Euclidean space with inner product $\langle\cdot,\cdot\rangle$ and induced norm $\|\cdot\|$ . The set of nonnegative integers is denoted by $\mathbb{N}$ . Probability mass function $P_{\xi}(\cdot)$ supported on $\{1,...,N\}$ .

Definition 1.1.

([2, Definition 20.1 and Definition 20.20]) A set-valued mapping $A:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}}$ is characterized by its graph $\operatorname*{gra}(A)=\{(x,u)\in\mathbb{R}^{d}\times\mathbb{R}^{d}:u\in Ax\}.$ A set-valued mapping $A:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}}$ is said to be
(i) monotone if $\langle u-v,x-y\rangle\geq 0$ for all $(x,u),(y,v)\in\operatorname*{gra}(A).$ (ii) maximally monotone if there exists no monotone operator $B:\mathbb{R}^{d}\rightarrow 2^{\mathbb{R}^{d}}$ such that gra $(B)$ properly contains gra $(A),$ i.e., for every $(x,u)\in\mathbb{R}^{d}\times\mathbb{R}^{d}$ ,

(x,u)\in\operatorname*{gra}(A)\ \ \Leftrightarrow\ \ \langle u-v,x-y\rangle\geq 0,\ \ \forall(y,v)\in\operatorname*{gra}(A).

Definition 1.2.

An operator $T:\mathbb{R}^{d}\rightarrow{\mathbb{R}^{d}}$ is said to be(i) $L$ -Lipschitz continuous, if there exists a constant $L>0$ , such that

\|Tx-Ty\|\leq L\|x-y\|,\quad\forall x,y\in\mathbb{R}^{d};

(ii) ${\beta}$ -cocoercive, if there exists a constant $\beta>0$ , such that

\langle Tx-Ty,x-y\rangle\geq{\beta}\|Tx-Ty\|^{2},\quad\forall x,y\in\mathbb{R}^{d}.

By the Cauchy–Schwarz inequality, a ${\beta}$ -cocoercive operator is $\frac{1}{\beta}$ -Lipschitz continuous.

Lemma 1.3.

([2, Proposition 20.38]) Let $A:\mathcal{Z}\rightarrow 2^{\mathcal{Z}}$ be maximally monotone, where $\mathcal{Z}$ is a finite dimensional Euclidean space. Then $\operatorname*{gra}(A)$ is closed in ${\mathcal{Z}}^{\rm strong}\times{\mathcal{Z}}^{\rm strong}$ , i.e., for every sequence $(x^{k},u^{k})_{k\in\mathbb{N}}$ in $\operatorname*{gra}(A)$ and $(x,u)\in\mathcal{Z}\times\mathcal{Z}$ , if $x^{k}\rightarrow x$ and $u^{k}\rightarrow u$ , then $(x,u)\in\operatorname*{gra}(A)$ .

Lemma 1.4.

([17, Theorem 1]) Let $(\Omega,\mathcal{F},P)$ be a probability space and $\mathcal{F}_{1}\subset\mathcal{F}_{2}\subset...$ be a sequence of sub- $\sigma$ -algebras of $\mathcal{F}$ . For each $k=1,2,...,$ let $z^{k}$ , $\beta^{k}$ , $\xi^{k}$ , and $\zeta^{k}$ be non-negative $\mathcal{F}_{k}$ -measurable random variables such that

\mathbb{E}(z^{k+1}|\mathcal{F}_{k})\leq(1+\beta^{k})z^{k}+\xi^{k}-\zeta^{k}.

then $\lim_{k\rightarrow\infty}z^{k}$ exists and $\sum_{k=1}^{\infty}\zeta^{k}<\infty$ almost surely on $\sum_{k=1}^{\infty}\beta^{k}<\infty$ and $\sum_{k=1}^{\infty}\xi^{k}<\infty$ .

The paper is organized as follows. In Section 2, we introduce the stochastic forward-backward-half forward splitting algorithm with variance reduction to solve the problem (1), and show the almost sure and linear convergence of the proposed algorithm. Finally, we present the numerical experiments in Section 3.

2 Main Results

In the sequel, we assume that the following conditions are satisfied:

Assumption 2.1.

(i)

The operator $A:\mathbb{R}^{d}\to 2^{\mathbb{R}^{d}}$ is maximal monotone;
(ii)

The operator $B:\mathbb{R}^{d}\to\mathbb{R}^{d}$ is single-valued, monotone and $L_{B}$ -Lipschitz;
(iii)

The operator $B$ has a stochastic oracle $B_{\xi}$ that is unbiased, $B(x)=\mathbb{E}[B_{\xi}(x)]$ , and $L$ -Lipschitz in mean:

$\mathbb{E}[\|B_{\xi}(u)-B_{\xi}(v)\|^{2}]\leq L^{2}\|u-v\|^{2},\quad\forall u,v\in\mathbb{R}^{d};$ (3)
(iv)

$C:\mathcal{H}\to\mathcal{H}$ is $\beta$ -cocoercive;
(v)

The solution set of the problem (1), denoted by ${\rm zer}(A+B+C)$ , is nonempty.

We now present the stochastic forward-backward-half forward splitting algorithm with variance reduction to solve the problem (1).

Algorithm 2.2.

VRFBHF

1. Input: Probability $p\in(0,1]$ , probability distribution $Q$ , step-size $\gamma$ , $\lambda\in(0,1).$

Let $x^{0}=w^{0}$ .

2. for $k=0,1,\ldots$ do

3. $\bar{x}^{k}=\lambda x^{k}+(1-\lambda)w^{k}$

4. $y^{k}=J_{\gamma A}\left(\bar{x}^{k}-\gamma(B+C)w^{k}\right)$

5. Draw an index $\xi_{k}$ according to $Q$

6. $x^{k+1}=y^{k}+\gamma(B_{\xi_{k}}w^{k}-B_{\xi_{k}}y^{k})$

7. $w^{k+1}=\begin{cases}x^{k+1},&\hbox{with probability}\,\,p\\ w^{k},&\hbox{with probability}\,\,1-p\\ \end{cases}$

8: end for

Remark 2.3.

Algorithm 2.2 is a very general algorithm and it is brand new to the literature. We review how Algorithm 2.2 relates to previous work. Algorithm 2.2 becomes the forward-backward-forward algorithm with variance reduction in [1] if $C=0$ . Algorithm 2.2 reduces to loopless SVRG in [9] if $\lambda=1$ , $B=\nabla f$ , $A=0$ and $C=0$ , where $f(x)=\sum_{i=1}^{N}f_{i}(x)$ and $f_{i}(x)$ is the loss of model $x$ on data point $i$ .

Remark 2.4.

We have two sources of randomness at each iteraton: the index $\xi_{k}$ which is used for updating $x^{k+1}$ , and the reference point $w^{k}$ which is updated in each iteration with probability $p$ by the iterate $x^{k+1}$ , or left unchanged with probability $1-p$ . Intuitively, we wish to keep $p$ small to lower the cost per iteration. And different from the FBHF splitting algorithm (2), we use the parameter $\lambda$ to add a step of calculating $\bar{x}^{k}=\lambda x^{k}+(1-\lambda)w^{k}$ . This means that we assign some weight to the past iteration points.

2.1 The almost sure convergence

In this subsection, we establish the almost sure convergence of Algorithm 2.2. We use the following notations for conditional expectations: $\mathbb{E}_{k}[\cdot]=\mathbb{E}[\cdot|\sigma(\xi_{0},...,\xi_{k-1},w^{k})]$ and $\mathbb{E}_{k+\frac{1}{2}}[\cdot]=\mathbb{E}[\cdot|\sigma(\xi_{0},...,\xi_{k},w^{k})]$ .

For the iterates $\{x^{k}\}_{k\in\mathbb{N}}$ and $\{w^{k}\}_{k\in\mathbb{N}}$ generated by Algorithm 2.2, we define the Lyapunov function

\Phi_{k}(x):=\lambda\|x^{k}-x\|^{2}+\frac{1-\lambda}{p}\|w^{k}-x\|^{2},\ \forall x\in\mathbb{R}^{d},

which helps to establish the almost sure convergence of the proposed algorithm.

Theorem 2.1.

Let Assumption 2.1 hold, $\lambda\in[0,1)$ , $p\in(0,1]$ , and $\gamma\in(0,\frac{4\beta(1-\lambda)}{1+\sqrt{1+16\beta^{2}L^{2}(1-\lambda)}})$ . Then for $\{x^{k}\}_{k\in\mathbb{N}}$ generated by Algorithm 2.2 and any $x^{*}\in\operatorname*{zer}(A+B+C)$ , it holds that

\mathbb{E}_{k}[\Phi_{k+1}(x^{*})]\leq\Phi_{k}(x^{*}).

(4)

Then, almost surely there exists $x^{*}\in\operatorname*{zer}(A+B+C)$ such that the sequence $\{x^{k}\}_{k\in\mathbb{N}}$ generated by Algorithm 2.2 converges to $x^{*}$ .

Proof.

Since $x^{*}\in\operatorname*{zer}(A+B+C),$ we have

-\gamma(B+C)x^{*}\in\gamma Ax^{*}.

(5)

Step 4 in Algorithm 2.2 is equivalent to the inclusion

\bar{x}^{k}-y^{k}-\gamma(B+C)w^{k}\in\gamma Ay^{k}.

(6)

Combining (5), (6) and the monotonicity of $A$ , we have

\langle y^{k}-\bar{x}^{k}+\gamma(B+C)w^{k},x^{*}-y^{k}\rangle-\gamma\langle(B+C)x^{*},x^{*}-y^{k}\rangle\geq 0.

Then from step 6 in Algorithm 2.2, we obtain

\langle x^{k+1}-\bar{x}^{k}+\gamma(Bw^{k}-B_{\xi_{k}}w^{k}+B_{\xi_{k}}y^{k})+\gamma Cw^{k},x^{*}-y^{k}\rangle-\gamma\langle(B+C)x^{*},x^{*}-y^{k}\rangle\geq 0.

(7)

By the definition of $\bar{x}^{k}$ and identities $2\langle a,b\rangle=\|a+b\|^{2}-\|a\|^{2}-\|b\|^{2}=\|a\|^{2}+\|b\|^{2}-\|a-b\|^{2}$ , we have

	$\displaystyle 2\langle x^{k+1}-\bar{x}^{k},x^{*}-y^{k}\rangle$	(8)
$\displaystyle=$	$\displaystyle 2\langle x^{k+1}-y^{k},x^{}-y^{k}\rangle+2\langle y^{k}-\bar{x}^{k},x^{}-y^{k}\rangle$
$\displaystyle=$	$\displaystyle\\|x^{k+1}-y^{k}\\|^{2}+\\|x^{}-y^{k}\\|^{2}-\\|x^{k+1}-x^{}\\|^{2}+2\lambda\langle y^{k}-x^{k},x^{*}-y^{k}\rangle$
	$\displaystyle+2(1-\lambda)\langle y^{k}-w^{k},x^{*}-y^{k}\rangle$
$\displaystyle=$	$\displaystyle\\|x^{k+1}-y^{k}\\|^{2}+\\|x^{}-y^{k}\\|^{2}-\\|x^{k+1}-x^{}\\|^{2}$
	$\displaystyle+\lambda(\\|x^{k}-x^{}\\|^{2}-\\|y^{k}-x^{k}\\|^{2}-\\|y^{k}-x^{}\\|^{2})$
	$\displaystyle+(1-\lambda)(\\|w^{k}-x^{}\\|^{2}-\\|y^{k}-w^{k}\\|^{2}-\\|y^{k}-x^{}\\|^{2})$
$\displaystyle=$	$\displaystyle\\|x^{k+1}-y^{k}\\|^{2}-\\|x^{k+1}-x^{}\\|^{2}+\lambda\\|x^{k}-x^{}\\|^{2}-\lambda\\|y^{k}-x^{k}\\|^{2}$
	$\displaystyle+(1-\lambda)\\|w^{k}-x^{*}\\|^{2}-(1-\lambda)\\|y^{k}-w^{k}\\|^{2}.$

By the $\beta$ -cocoercivity of $C$ and Young’s inequality $\langle a,b\rangle\leq\beta\|a\|^{2}+\frac{1}{4\beta}\|b\|^{2}$ for all $a,b\in\mathbb{R}^{d}$ , we get

	$\displaystyle 2\gamma\langle Cw^{k}-Cx^{},x^{}-y^{k}\rangle$	(9)
$\displaystyle=$	$\displaystyle 2\gamma\langle Cw^{k}-Cx^{},x^{}-w^{k}\rangle+2\gamma\langle Cw^{k}-Cx^{*},w^{k}-y^{k}\rangle$
$\displaystyle\leq$	$\displaystyle-2\gamma\beta\\|Cw^{k}-Cx^{}\\|^{2}+2\gamma\beta\\|Cw^{k}-Cx^{}\\|^{2}+\frac{\gamma}{2\beta}\\|w^{k}-y^{k}\\|^{2}$
$\displaystyle=$	$\displaystyle\frac{\gamma}{2\beta}\\|w^{k}-y^{k}\\|^{2}.$

We use (8) and (9) in (7) to obtain

		$\displaystyle\ \ \ \ 2\gamma\langle Bx^{}-(Bw^{k}-B_{\xi_{k}}w^{k}+B_{\xi_{k}}y^{k}),x^{}-y^{k}\rangle+\\|x^{k+1}-x^{*}\\|^{2}$		(10)
		$\displaystyle\leq\lambda\\|x^{k}-x^{}\\|^{2}+(1-\lambda)\\|w^{k}-x^{}\\|^{2}+\\|x^{k+1}-y^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -\lambda\\|y^{k}-x^{k}\\|^{2}-(1-\lambda-\frac{\gamma}{2\beta})\\|y^{k}-w^{k}\\|^{2}.$

Taking expectation $\mathbb{E}_{k}$ on (10) and using

\mathbb{E}_{k}[\langle Bw^{k}-B_{\xi_{k}}w^{k}+B_{\xi_{k}}y^{k},x^{*}-y^{k}\rangle]=\langle By^{k},x^{*}-y^{k}\rangle,

we obtain

		$\displaystyle\ \ \ \ 2\gamma\langle Bx^{}-By^{k},x^{}-y^{k}\rangle+\mathbb{E}_{k}\\|x^{k+1}-x^{*}\\|^{2}$
		$\displaystyle\leq\lambda\\|x^{k}-x^{}\\|^{2}+(1-\lambda)\\|w^{k}-x^{}\\|^{2}+\mathbb{E}_{k}\\|x^{k+1}-y^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -\lambda\\|y^{k}-x^{k}\\|^{2}-(1-\lambda-\frac{\gamma}{2\beta})\\|y^{k}-w^{k}\\|^{2}.$

By the monotonicity of $B,$ we have

\langle Bx^{*}-By^{k},x^{*}-y^{k}\rangle\geq 0.

(11)

Combining the definition of $x^{k+1}$ and (3), we have

\mathbb{E}_{k}\|x^{k+1}-y^{k}\|^{2}\leq\gamma^{2}L^{2}\|y^{k}-w^{k}\|^{2}.

Therefore,

	$\displaystyle\mathbb{E}_{k}\\|x^{k+1}-x^{*}\\|^{2}$	$\displaystyle\leq\lambda\\|x^{k}-x^{}\\|^{2}+(1-\lambda)\\|w^{k}-x^{}\\|^{2}-\lambda\\|y^{k}-x^{k}\\|^{2}$		(12)
		$\displaystyle\ \ \ \ -(1-\lambda-\gamma^{2}L^{2}-\frac{\gamma}{2\beta})\\|y^{k}-w^{k}\\|^{2}.$		(12)

On the other hand, the definition of $w^{k+1}$ and $\mathbb{E}_{k+\frac{1}{2}}$ yield that

\frac{1-\lambda}{p}\mathbb{E}_{k+\frac{1}{2}}[\|w^{k+1}-x^{*}\|^{2}]=(1-\lambda)\|x^{k+1}-x^{*}\|^{2}+(1-\lambda)\frac{1-p}{p}\|w^{k}-x^{*}\|^{2}.

(13)

Then apply to (13) the tower property $\mathbb{E}_{k}[\mathbb{E}_{k+\frac{1}{2}}[\cdot]]=\mathbb{E}_{k}[\cdot]$ , we have

\frac{1-\lambda}{p}\mathbb{E}_{k}[\|w^{k+1}-x^{*}\|^{2}]=(1-\lambda)\mathbb{E}_{k}\|x^{k+1}-x^{*}\|^{2}+(1-\lambda)\frac{1-p}{p}\|w^{k}-x^{*}\|^{2}.

(14)

We add (14) to (12) to obtain

		$\displaystyle\ \ \ \ \lambda\mathbb{E}_{k}\\|x^{k+1}-x^{}\\|^{2}+\frac{1-\lambda}{p}\mathbb{E}_{k}\\|w^{k+1}-x^{}\\|^{2}$		(15)
		$\displaystyle\leq\lambda\\|x^{k}-x^{}\\|^{2}+\frac{1-\lambda}{p}\\|w^{k}-x^{}\\|^{2}-\lambda\\|y^{k}-x^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -(1-\lambda-\gamma^{2}L^{2}-\frac{\gamma}{2\beta})\\|y^{k}-w^{k}\\|^{2}.$

Thus, the inequality (4) holds by $\gamma\in(0,\frac{4\beta(1-\lambda)}{1+\sqrt{1+16\beta^{2}L^{2}(1-\lambda)}})$ and $0<\lambda<1$ .

Next, we show the almost sure convergence of the sequence $\{x^{k}\}_{k\in\mathbb{N}}$ generated by Algorithm 2.2. By Lemma 1.4, we have that $\{\Phi_{k}(x^{*})\}_{k\in\mathbb{N}}$ converges almost surely and $\{\|y^{k}-x^{k}\|^{2}\}_{k\in\mathbb{N}},$ $\{\|y^{k}-w^{k}\|^{2}\}_{k\in\mathbb{N}}$ converges to $0$ almost surely. Then applying Proposition 2.3 in [5], we can construct a space $\Xi$ , with $\mathbb{P}(\Xi)=1$ , such that $\forall\theta\in\Xi$ and $\forall x^{*}\in\operatorname*{zer}(A+B+C)$ ,

\{\lambda\|x^{k}(\theta)-x^{*}\|^{2}+\frac{1-\lambda}{p}\|w^{k}(\theta)-x^{*}\|^{2}\}_{k\in\mathbb{N}}\ \hbox{converges,}

(16)

which implies that the sequence $\{x^{k}(\theta)\}_{k\in\mathbb{N}}$ is bounded. Let $\Xi^{{}^{\prime}}$ be the probability 1 set such that $\forall\theta\in\Xi^{{}^{\prime}}$ , $y^{k}(\theta)-x^{k}(\theta)\rightarrow 0$ , $y^{k}(\theta)-w^{k}(\theta)\rightarrow 0$ , which implies $y^{k}(\theta)-\bar{x}^{k}(\theta)\rightarrow 0$ . Pick $\theta\in\Xi\bigcap\Xi^{{}^{\prime}}$ and let $\{x^{k_{j}}(\theta)\}_{j\in\mathbb{N}}$ be the convergent subsequence of the bounded sequence $\{x^{k}(\theta)\}_{k\in\mathbb{N}}$ , say without loss of generality that $x^{k_{j}}(\theta)\rightarrow\bar{x}(\theta)$ as $j\rightarrow\infty$ . From $y^{k_{j}}(\theta)-x^{k_{j}}(\theta)\rightarrow 0$ , it follows that $y^{k_{j}}(\theta)\rightarrow\bar{x}(\theta)$ as $j\rightarrow\infty$ . Then according to (6), we can get

\bar{x}^{k_{j}}(\theta)-y^{k_{j}}(\theta)-\gamma((B+C)w^{k_{j}}(\theta)-(B+C)y^{k_{j}}(\theta))\in\gamma(A+B+C)y^{k_{j}}(\theta).

We know that $B+C$ is $(L_{B}+\frac{1}{\beta})$ -Lipschitz. Therefore,

\bar{x}^{k_{j}}(\theta)-y^{k_{j}}(\theta)-\gamma((B+C)w^{k_{j}}(\theta)-(B+C)y^{k_{j}}(\theta))\rightarrow 0,\ j\rightarrow\infty.

Furthermore, based on the assumption that the operator $B$ has a full domain, we have that $A+B$ is maximally monotone. Combining $C$ is cocoercive, one has that $A+B+C$ is maximally monotone. By Lemma 1.3, $(\bar{x}(\theta),0)\in\operatorname*{gra}(A+B+C)$ , i.e., $\bar{x}(\theta)\in\operatorname*{zer}(A+B+C)$ .

Hence, all cluster points of $\{x^{k}(\theta)\}_{k\in\mathbb{N}}$ and $\{w^{k}(\theta)\}_{k\in\mathbb{N}}$ belong to $\operatorname*{zer}(A+B+C)$ . We have shown that at least one subsequence of $\{\lambda\|x^{k}(\theta)-\bar{x}(\theta)\|^{2}+\frac{1-\lambda}{p}\|w^{k}(\theta)-\bar{x}(\theta)\|^{2}\}_{k\in\mathbb{N}}$ converges to $0$ . Combining (16), we deduce $\lambda\|x^{k}(\theta)-\bar{x}(\theta)\|^{2}+\frac{1-\lambda}{p}\|w^{k}(\theta)-\bar{x}(\theta)\|^{2}\rightarrow 0$ and consequently $\|x^{k}(\theta)-\bar{x}(\theta)\|^{2}\rightarrow 0$ . This shows that $\{x^{k}\}_{k\in\mathbb{N}}$ converges almost surely to a point in $\operatorname*{zer}(A+B+C)$ . ∎

2.2 Linear convergence

In this subsection, we show the linear convergence of Algorithm 2.2 for solving the structured monotone inclusion problem (1) when $B$ is $\mu$ -strongly monotone. Indeed, assuming that the operator $A$ is strongly monotone also leads to a linear convergence result, and the proof procedure is similar.

Theorem 2.2.

Let Assumption 2.1 hold, $B$ be $\mu$ -strongly monotone. If we set $\lambda=1-p$ , and $\gamma=\min\{\frac{\sqrt{p}}{2L},\beta p\}$ in Algorithm 2.2, then for the sequence $\{x^{k}\}_{k\in\mathbb{N}}$ generated by Algorithm 2.2 and any $x^{*}\in\operatorname*{zer}(A+B+C)$ , it holds that

\mathbb{E}\|x^{k}-x^{*}\|^{2}\leq(\frac{1}{1+c/4})^{k}\frac{2}{1-p}\|x^{0}-x^{*}\|^{2},

(17)

with $c=\min\{\gamma\mu,\frac{p}{(1+\sqrt{p})(4+p)}\}$ .

Proof.

If $B$ is $\mu$ -strongly monotone, then (11) becomes

\langle Bx^{*}-By^{k},x^{*}-y^{k}\rangle\geq\mu\|x^{*}-y^{k}\|^{2}.

We continue as in the proof of Theorem 2.1 to obtain, instead of (15),

		$\displaystyle\ \ \ \ 2\gamma\mu\\|y^{k}-x^{}\\|^{2}+\lambda\mathbb{E}_{k}\\|x^{k+1}-x^{}\\|^{2}+\frac{1-\lambda}{p}\mathbb{E}_{k}\\|w^{k+1}-x^{*}\\|^{2}$		(18)
		$\displaystyle\leq\lambda\\|x^{k}-x^{}\\|^{2}+\frac{1-\lambda}{p}\\|w^{k}-x^{}\\|^{2}-\lambda\\|y^{k}-x^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -(1-\lambda-\gamma^{2}L^{2}-\frac{\gamma}{2\beta})\\|y^{k}-w^{k}\\|^{2}.$

By $\|a+b\|^{2}\leq 2\|a\|^{2}+2\|b\|^{2}$ , the step 6 and (3), we have

	$\displaystyle 2\gamma\mu\\|y^{k}-x^{*}\\|^{2}$	$\displaystyle\geq\gamma\mu\mathbb{E}_{k}[\\|x^{k+1}-x^{*}\\|^{2}]-2\gamma\mu\mathbb{E}_{k}[\\|\gamma(B_{\xi_{k}}w^{k}-B_{\xi_{k}}y^{k})\\|^{2}]$		(19)
		$\displaystyle\geq\gamma\mu\mathbb{E}_{k}[\\|x^{k+1}-x^{*}\\|^{2}]-2\gamma^{3}L^{2}\mu\\|y^{k}-w^{k}\\|^{2}.$		(19)

Combining (18), (19) and $\lambda=1-p$ , we get

		$\displaystyle\ \ \ \ (1-p+\gamma\mu)\mathbb{E}_{k}[\\|x^{k+1}-x^{}\\|^{2}]+\mathbb{E}_{k}[\\|w^{k+1}-x^{}\\|^{2}]$		(20)
		$\displaystyle\leq(1-p)\\|x^{k}-x^{}\\|^{2}+\\|w^{k}-x^{}\\|^{2}-(1-p)\\|y^{k}-x^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -(p-\gamma^{2}L^{2}-\frac{\gamma}{2\beta}-2\gamma^{3}L^{2}\mu)\\|y^{k}-w^{k}\\|^{2}$
		$\displaystyle\leq(1-p)\\|x^{k}-x^{}\\|^{2}+\\|w^{k}-x^{}\\|^{2}-(1-p)\\|y^{k}-x^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -\frac{p(1-\sqrt{p})}{4}\\|y^{k}-w^{k}\\|^{2},$

where the last inequality is obtained by $\gamma=\min\{\frac{\sqrt{p}}{2L},\beta p\}$ and $\mu\leq L$ . Similar to (19), we have

$\displaystyle\frac{c}{2}\mathbb{E}_{k}[\\|x^{k+1}-x^{*}\\|^{2}]$	$\displaystyle\geq\frac{c}{4}\mathbb{E}_{k}[\\|w^{k+1}-x^{*}\\|^{2}]-\frac{c}{2}\mathbb{E}_{k}[\mathbb{E}_{k+\frac{1}{2}}\\|x^{k+1}-w^{k+1}\\|^{2}]$	(21)
	$\displaystyle=\frac{c}{4}\mathbb{E}_{k}[\\|w^{k+1}-x^{*}\\|^{2}]-\frac{c(1-p)}{2}\mathbb{E}_{k}[\\|x^{k+1}-w^{k}\\|^{2}]$
	$\displaystyle=\frac{c}{4}\mathbb{E}_{k}[\\|w^{k+1}-x^{*}\\|^{2}]-\frac{c(1-p)}{2}\mathbb{E}_{k}[\\|y^{k}-w^{k}+\gamma(B_{\xi_{k}}w^{k}-B_{\xi_{k}}y^{k})\\|^{2}]$
	$\displaystyle\geq\frac{c}{4}\mathbb{E}_{k}[\\|w^{k+1}-x^{*}\\|^{2}]-c(1-p)(1+\gamma^{2}L^{2})\\|y^{k}-w^{k}\\|^{2}$
	$\displaystyle\geq\frac{c}{4}\mathbb{E}_{k}[\\|w^{k+1}-x^{*}\\|^{2}]-\frac{c(1-p)(4+p)}{4}\\|y^{k}-w^{k}\\|^{2}.$

Putting (21) into (20) and recalling that $c\leq\gamma\mu$ , we have

		$\displaystyle\ \ \ \ (1-p+\frac{c}{2})\mathbb{E}_{k}[\\|x^{k+1}-x^{}\\|^{2}]+(1+\frac{c}{4})\mathbb{E}_{k}[\\|w^{k+1}-x^{}\\|^{2}]$		(22)
		$\displaystyle\leq(1-p)\\|x^{k}-x^{}\\|^{2}+\\|w^{k}-x^{}\\|^{2}-(1-p)\\|y^{k}-x^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -\left[\frac{p(1-\sqrt{p})}{4}-\frac{c(1-p)(4+p)}{4}\right]\\|y^{k}-w^{k}\\|^{2}$
		$\displaystyle\leq(1-p)\\|x^{k}-x^{}\\|^{2}+\\|w^{k}-x^{}\\|^{2},$

where the last inequality comes from $c\leq\frac{p}{(1+\sqrt{p})(4+p)}.$ Then, using $1-p+\frac{c}{2}\geq(1-p)(1+\frac{c}{4})$ and taking the full expectation on (22), we have

(1+\frac{c}{4})\mathbb{E}[(1-p)\|x^{k+1}-x^{*}\|^{2}+\|w^{k+1}-x^{*}\|^{2}]\leq\mathbb{E}[(1-p)\|x^{k}-x^{*}\|^{2}+\|w^{k}-x^{*}\|^{2}].

Iterating this inequality, we obain

(1-p)\mathbb{E}\|x^{k}-x^{*}\|^{2}\leq(\frac{1}{1+c/4})^{k}(2-p)\|x^{0}-x^{*}\|^{2},

showing (17). ∎

3 Numerical Simulations

In this section, we compare the Algorithm 2.2 (VRFBHF) with the FBHF splitting algorithm (2). All codes were written in MATLAB R2018b and performed on a PC Desktop Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz 3.19 GHz, RAM 8.00 GB. Consider the nonlinear constrained optimization problem of the form

\min_{x\in C}f(x)+h(x),

(23)

where $C=\{x\in\mathbb{R}^{d}\,|\,(\forall i\in\{1,...,q\})\ g_{i}(x)\leq 0\}$ , $f:\mathbb{R}^{d}\rightarrow(-\infty,+\infty]$ is a proper, convex and lower semi-continuous function, for every $i\in\{1,...,q\}$ , $g_{i}:\operatorname{dom}(g_{i})\subset\mathbb{R}^{d}\rightarrow\mathbb{R}$ and $h:\mathbb{R}^{d}\rightarrow\mathbb{R}$ are $C^{1}$ convex functions in $\operatorname*{int}\operatorname{dom}g_{i}$ and $\mathbb{R}^{d}$ , respectively, and $\nabla h$ is $\beta$ -Lipschitz. The solution to the optimization problem (23) can be found via the saddle points of the Lagrangian

L(x,u)=f(x)+h(x)+u^{\top}g(x)-\iota_{\mathbb{R}_{+}^{q}}(u),

where $\iota_{\mathbb{R}_{+}^{q}}$ is the indicator function of $\mathbb{R}_{+}^{q}$ , Under some standard qualifications, the solution to the optimization problem (23) can be found by solving the monotone inclusion [4, 16]: find $x\in Y$ such that $\exists u\in\mathbb{R}_{+}^{q}$ ,

(0,0)\in(A+B+C)(x,u),

(24)

where $Y\subset\mathbb{R}^{d}$ is a nonempty closed convex set modeling the prior information of the solution, $A:(x,u)\mapsto\partial f(x)\times{N_{\mathbb{R}_{+}^{q}}u}$ is maximally monotone, $C:(x,u)\mapsto(\nabla h(x),0)$ is $\beta$ -cocoercive, and $B:(x,u)\mapsto(\sum_{i=1}^{q}u_{i}\nabla g_{i}(x),-g_{1}(x),...,-g_{q}(x))$ is nonlinear, monotone and continuous. In the light of the structure of $B$ , we can rewrite $B$ as $B=\sum_{i=1}^{q}B_{i}$ where $B_{i}:(x,u)\mapsto(u_{i}\nabla g_{i}(x),0,...,-g_{i}(x),...,0)$ for every $i\in\{1,...,q\}$ . In the numerical results listed in the following table, “Iter” denotes the number of iterations.

Example 3.1.

Let $f=\iota_{[0,1]^{d}}$ , $g_{i}(x)=d_{i}^{\top}x$ ( $\forall i\in\{1,...,q\}$ ) with $d_{1},\ldots,d_{q}\in\mathbb{R}^{d}$ , and $h=\frac{1}{2}\|Gx-b\|^{2}$ with $G$ being an $t\times d$ real matrix, $d=2t$ , $b\in\mathbb{R}^{t}$ . Then the operators in (24) become

		$\displaystyle A:(x,u)\mapsto\partial{\iota_{[0,1]^{d}}(x)}\times{N_{\mathbb{R}_{+}^{q}}u},$		(25)
		$\displaystyle B:(x,u)\mapsto(D^{\top}u,-Dx),$
		$\displaystyle C:(x,u)\mapsto(G^{\top}(Gx-b),0),$

where $x\in\mathbb{R}^{d}$ , $u\in\mathbb{R}_{+}^{q}$ , $D=[d_{1},\ldots,d_{q}]^{\top}$ . It is easy to see that the operator $A$ is a maximally monotone operator, $C$ is a $\beta$ -cocoercive operator with $\beta=\|G\|^{-2}$ , and B is a $L$ -Lipschitz operator with $L=\|D\|$ . According to the structure of the operator $B$ , rewrite $B$ as $B=\sum_{i=1}^{q}B_{i}$ where $B_{i}:(x,u)\mapsto(d_{i}u_{i},0,...,-d_{i}^{\top}(x),...,0)$ for every $i\in\{1,...,q\}$ . For uniform sampling, the stochastic oracle $B_{\xi}(x,u)=qB_{i}(x,u)$ , $P_{\xi}(i)=\operatorname{Prob}{\{\xi=i\}}=\frac{1}{q}$ , $i\in\{1,...,q\}$ .

Now, we use Algorithm 2.2 to solve the problem (1) with (25), then Algorithm 2.2 reduces to

\left\lfloor\begin{aligned} &\bar{x}^{k}=\lambda x^{k}+(1-\lambda)w^{k},\\ &\bar{u}^{k}=\lambda u^{k}+(1-\lambda)v^{k},\\ &y^{k}={\rm Prox}_{\gamma\iota_{[0,1]^{d}}}(\bar{x}^{k}-\gamma D^{T}v^{k}-\gamma(G^{T}(Gw^{k}-b))),\\ &\hbox{for every}\,\,j=1,\ldots,q,\\ &\left\lfloor\begin{aligned} &\eta^{k}_{j}=\max\{0,\bar{u}^{k}_{j}+\gamma d_{j}^{T}w^{k}\},\\ \end{aligned}\right.\\ &\hbox{Sample}\ \xi_{k}\ \hbox{uniformly at random from}\,\ \{1,...,q\}\\ &x^{k+1}=y^{k}+\gamma qd_{\xi_{k}}(v_{\xi_{k}}^{k}-\eta_{\xi_{k}}^{k}),\\ &u^{k+1}=\eta^{k}+\gamma S_{\xi_{k}}(w^{k}-y^{k}),\\ &w^{k+1}=\begin{cases}x^{k+1},&\hbox{with probability}\,\,p\\ w^{k},&\hbox{with probability}\,\,1-p\\ \end{cases}\\ &v^{k+1}=\begin{cases}u^{k+1},&\hbox{with probability}\,\,p\\ v^{k},&\hbox{with probability}\,\,1-p\\ \end{cases}\\ \end{aligned}\right.

where $S_{\xi_{k}}=[0,...,-qd_{\xi_{k}}^{\top}(x),...,0]$ .

In the numerical test, $G,D,b$ and initial value $(x_{0},u_{0})$ are all randomly generated. In VRFBHF, set $(w_{0},v_{0})=(x_{0},u_{0})$ , take $\lambda=0.1$ and $\gamma=\frac{3.999\beta(1-\lambda)}{1+\sqrt{1+16\beta^{2}L^{2}(1-\lambda)}}$ . In FBHF, take $\gamma=\frac{3.999\beta}{1+\sqrt{1+16\beta^{2}L_{B}^{2}}}$ . We use

E_{k}=\frac{\|(x^{k+1}-x^{k},u^{k+1}-u^{k})\|}{\|(x^{k},u^{k})\|}<10^{-6},

as the stopping criterion.

Refer to caption — Figure 1: Decay of $E_{k}$ with the number of iterations of different $p$ for Example 3.1 with $q=1000,d=500$ .

Figure 1 illustrates the behavior of VRFBHF for different $p$ , from which it can be observed that $E_{n}$ oscillates with $k$ and reaches the stopping criterion the fastest when $p=0.2$ . Next we test eight problem sizes and randomly generate 10 instances for each size. The average number of iterations and CPU time for 10 instances are listed in Table 1. It is observed from Table 1 that VRFBHF has remarkably less CPU time and iteration numbers compared to the FBHF splitting algorithm (2).

Table 1: Computational results with FBHF and VRFBHF with

p=0.2

Problem size				CPU time
$q$	$d$	VRFBHF	FBHF	VRFBHF	FBHF
1000	500	75.8	3147	1.0531	15.4125
	750	92.3	1363	1.1047	8.8984
	1000	45.8	1933	0.8344	24.825
	2000	16.4	1731	1.5344	77.2984
2000	1000	96.2	1058	2.4609	28.775
	1500	73.8	1020	8.8984	55.3953
	2000	16.9	2022	14.6687	151.3359
	2500	26.6	1268	22.8078	136.9688

References

[1] Alacaoglu, A., Malitsky, Y.: Stochastic variance reduction for variational inequality methods. Mach. Learn. 178, 1–-39 (2022)
[2] Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd ed. Springer, New York, (2017)
[3] Barnet, S., Rudzusika, J., Öktem, O., and Adler, J.: Accelerated forward-backward optimization using deep learning. https://arxiv.org/abs/2105.05210
[4] Briceño-Arias, L.M., Davis, D.: Forward-backward-half forward algorithm for solving monotone inclusions. Siam J. Optimiz. 28(4), 2839–2871 (2017)
[5] Combettes, P.L., Pesquet, J.-C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. Siam J. Optimiz. 25(2), 1221–1248 (2015)
[6] Combettes, P.L., Pesquet, J.C.: Primal-dual splitting algorithm for solving inclusions with mixtures of composite, Lipschitzian, and parallel-sum type monotone operators. Set-valued. Var. Anal. 20, 307–-330 (2012)
[7] Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications. Set-valued. Var. Anal. 25, 829–858 (2017)
[8] Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural. Inf. Process. Syst. 315–323 (2013)
[9] Kovalev, D., Horvath, S., and Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and katyusha are better without the outer loop. Mach. Learn. 117, 1–17 (2020)
[10] Liu, J.C., Xu, L.L., Shen, S.H., and Ling, Q.: An accelerated variance reducing stochastic method with Douglas-Rachford splitting. Mach. Learn. 108, 859-–878 (2019)
[11] Latafat, P., Patrinos, P.: Asymmetric forward-backward-adjoint splitting for solving monotone inclusions involving three operators. Comput. Optim. Appl. 68, 57–93 (2017)
[12] Malitsky, Y., Tam, M.K.: A forward-backward splitting method for monotone inclusions without cocoercivity. Siam J. Optim. 30(2), 1451–1472 (2020)
[13] Rieger, J., Tam, M.K.: Backward-forward-reflected-backward splitting for three operator monotone inclusions. Appl. Math. Comput. 381, (2020)
[14] Ryu, E.K.: Uniqueness of DRS as the 2 operator resolvent-splitting and impossibility of 3 operator resolvent-splitting. Math. Program. 182, 233–273 (2020)
[15] Ryu, E.K., Vũ, B.C.: Finding the forward-Douglas–Rachford-forward method. J. Optimiz. Theory. App. 184, 858–876 (2020)
[16] Rockafellar, R.T.: Monotone operators associated with saddle-functions and minimax problems, in: Nonlinear Func. Anal., I, F.E. Browder ed., Proc. Pure Mat 18, 241–-250 (1970)
[17] Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. Optimizing Methods in Statistics. 233–257 (1971)
[18] Tseng, P.: A modified forward-backward splitting method for maximal monotone mapping. Siam J. Control. Optim. 38(2), (2000)
[19] Yu, H., Zong, C., and Tang, Y.: An outer reflected forward-backward splitting algorithm for solving monotone inclusions. https://arxiv.org/abs/2009.12493 (2020)
[20] Zhang, X., Haskell, W. B., and Ye, Z.S.: A Unifying Framework for Variance-Reduced Algorithms for Findings Zeroes of Monotone operators. J. Mach. Learn. Res. 23(60), 1–-44 (2022)
[21] Zong, C., Tang, Y., and Cho, Y.J.: Convergence analysis of an inexact three-operator splitting algorithm. Symmetry. 10(11), 563 (2018)
[22] Zong, C., Tang, Y., and Zhang, G.: An accelerated forward-backward-half forward splitting algorithm for monotone inclusion with applications to image restoration. Optimization. (2022) DOI: 10.1080/02331934.2022.2107926

		$\displaystyle\ \ \ \ 2\gamma\langle Bx^{}-(Bw^{k}-B_{\xi_{k}}w^{k}+B_{\xi_{k}}y^{k}),x^{}-y^{k}\rangle+\\|x^{k+1}-x^{*}\\|^{2}$		(10)
		$\displaystyle\leq\lambda\\|x^{k}-x^{}\\|^{2}+(1-\lambda)\\|w^{k}-x^{}\\|^{2}+\\|x^{k+1}-y^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -\lambda\\|y^{k}-x^{k}\\|^{2}-(1-\lambda-\frac{\gamma}{2\beta})\\|y^{k}-w^{k}\\|^{2}.$

	$\displaystyle\mathbb{E}_{k}\\|x^{k+1}-x^{*}\\|^{2}$	$\displaystyle\leq\lambda\\|x^{k}-x^{}\\|^{2}+(1-\lambda)\\|w^{k}-x^{}\\|^{2}-\lambda\\|y^{k}-x^{k}\\|^{2}$		(12)
		$\displaystyle\ \ \ \ -(1-\lambda-\gamma^{2}L^{2}-\frac{\gamma}{2\beta})\\|y^{k}-w^{k}\\|^{2}.$		(12)

		$\displaystyle\ \ \ \ \lambda\mathbb{E}_{k}\\|x^{k+1}-x^{}\\|^{2}+\frac{1-\lambda}{p}\mathbb{E}_{k}\\|w^{k+1}-x^{}\\|^{2}$		(15)
		$\displaystyle\leq\lambda\\|x^{k}-x^{}\\|^{2}+\frac{1-\lambda}{p}\\|w^{k}-x^{}\\|^{2}-\lambda\\|y^{k}-x^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -(1-\lambda-\gamma^{2}L^{2}-\frac{\gamma}{2\beta})\\|y^{k}-w^{k}\\|^{2}.$

		$\displaystyle\ \ \ \ 2\gamma\mu\\|y^{k}-x^{}\\|^{2}+\lambda\mathbb{E}_{k}\\|x^{k+1}-x^{}\\|^{2}+\frac{1-\lambda}{p}\mathbb{E}_{k}\\|w^{k+1}-x^{*}\\|^{2}$		(18)
		$\displaystyle\leq\lambda\\|x^{k}-x^{}\\|^{2}+\frac{1-\lambda}{p}\\|w^{k}-x^{}\\|^{2}-\lambda\\|y^{k}-x^{k}\\|^{2}$
		$\displaystyle\ \ \ \ -(1-\lambda-\gamma^{2}L^{2}-\frac{\gamma}{2\beta})\\|y^{k}-w^{k}\\|^{2}.$

	$\displaystyle 2\gamma\mu\\|y^{k}-x^{*}\\|^{2}$	$\displaystyle\geq\gamma\mu\mathbb{E}_{k}[\\|x^{k+1}-x^{*}\\|^{2}]-2\gamma\mu\mathbb{E}_{k}[\\|\gamma(B_{\xi_{k}}w^{k}-B_{\xi_{k}}y^{k})\\|^{2}]$		(19)
		$\displaystyle\geq\gamma\mu\mathbb{E}_{k}[\\|x^{k+1}-x^{*}\\|^{2}]-2\gamma^{3}L^{2}\mu\\|y^{k}-w^{k}\\|^{2}.$		(19)