This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic Three-Operator Splitting Algorithms for Nonconvex and Nonsmooth Optimization Arising from FLASH Radiotherapy

Fengmiao Bian Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong, China. E-mail: [email protected]    Jiulong Liu Institute of Computational Mathematics and Scientific/Engineering Computing, Academy of Mathematics and System Sciences, Chinese Academy of Sciences, Beijing, China. E-mail: [email protected]    Xiaoqun Zhang School of Mathematical Sciences, MOE-LSC and Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China. E-mail: [email protected]    Hao Gao Department of Radiation Oncology, University of Kansas Medical Center, Kansas City, Kansas, USA. E-mail: [email protected]    Jianfeng Cai Department of Mathematics, The Hong Kong University of Science and Technology, Hong Kong, China and HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, Futian, Shenzhen, China. E-mail: [email protected].
Abstract

Radiation therapy (RT) aims to deliver tumoricidal doses with minimal radiation-induced normal-tissue toxicity. Compared to conventional RT (of conventional dose rate), FLASH-RT (of ultra-high dose rate) can provide additional normal tissue sparing, which however has created a new nonconvex and nonsmooth optimization problem that is highly challenging to solve. In this paper, we propose a stochastic three-operator splitting (STOS) algorithm to address the FLASH optimization problem. We establish the convergence and convergence rates of the STOS algorithm under the nonconvex framework for both unbiased gradient estimators and variance-reduced gradient estimators. These stochastic gradient estimators include the most popular ones, such as SGD, SAGA, SARAH, and SVRG, among others. The effectiveness of the STOS algorithm is validated using FLASH radiotherapy planning for patients.

Key words. Nonconvex optimization \cdot stochastic three-operator splitting \cdot unbiased stochastic gradient \cdot variance-reduced stochastic gradient \cdot FLASH radiotherapy

AMS subject classifications. 90C06 \cdot 90C15 \cdot 90C26 \cdot 90C90

1 Introduction

1.1 Background

Radiation therapy (RT) has long been a cornerstone of cancer treatment. With the continuous advancement of medical technology, there are various radiation therapy techniques available, such as IMPT and PBS [54, 18, 27]. Although these techniques have been widely adopted in the medical field, complete eradication of cancerous tissue still relies on the dose, which is limited by the risk of severe radiation-induced side effects. A new technique called flash radiotherapy (FLASH-RT) has emerged. FLASH-RT delivers ultra-high dose rates, several orders of magnitude higher than conventional therapy [34]. Clinical data [34, 52] indicates that FLASH-RT can achieve better control with fewer side effects. This is primarily because flash radiotherapy irradiates tissues at ultra-high dose rates (\geq 40 Gy/s), potentially reducing the toxicity of normal tissues while retaining tumor-killing effects. Proton irradiation can achieve this ultra-high dose rate, due to the unique depth-dose characteristics of proton beams, allowing the delivery of ultra-high dose rates to deep tissues [37, 8]. Additionally, clinical data [48, 34] has demonstrated that FLASH proton radiotherapy, due to its superior immune response capabilities, has become an important treatment modality for certain acute and advanced cancers.

While FLASH proton radiotherapy may become the primary radiotherapy method for certain tumors, current research on FLASH proton radiotherapy planning has mainly focused on optimizing dose distribution. However, each spot has constraints not only on dose distribution but also on dose rate distribution, highlighting the importance of exploring a new treatment optimization method called FLASH proton radiotherapy with dose and dose rate optimization[19, 55, 14]. Mathematically, it can be formulated as the following constraint nonconvex optimization:

minxnH(x):=Axb2,s.t.{xi0,xiα,ifxi>0,Qxu,\displaystyle\min_{x\in\mathbb{R}^{n}}H(x):=\left\|Ax-b\right\|^{2},\qquad\textmd{s.t.}\begin{cases}x_{i}\geq 0,\\ x_{i}\geq\alpha,\,\,\textmd{if}\,\,x_{i}>0,\\ Qx\geq u,\end{cases} (1.1)

where xx is the spot weight to be optimized, bb is the dose objective, AA represents the forward system matrix, QQ is a block matrix composed of the forward matrix that only maps xx to the dose in the ROI, uu is a vector associated with DVH constraints, and α\alpha represents the planning monitor unit. Here HH exhibits a large-scale summation structure, representing the sum of planning objectives. See section 4 for more details. Currently, most of the research on solving (1.1) employs the nonstandard alternating direction method of multipliers (ADMM) algorithm to solve the convex relaxation problem. A classical technique for solving constraint optimization is to transform it into a multiple summation problem via indicator functions. Here, we employ this technique to equivalently transform (1.1) into the following large-scale nonconvex optimization:

minxnH(x)+C(x)+D(x),\min_{x\in\mathbb{R}^{n}}H(x)+\mathcal{I}_{C}(x)+\mathcal{I}_{D}(x), (1.2)

where C={xn|xi{0}[α,+),i=1,2,,n}=+n/(0,α)nC=\big{\{}x\in\mathbb{R}^{n}|x_{i}\in\{0\}\bigcup[\alpha,+\infty),\,\forall\,i=1,2,\dots,n\big{\}}=\mathbb{R}_{+}^{n}/(0,\alpha)^{n}, the set D={xn|Qxu}D=\big{\{}x\in\mathbb{R}^{n}|Qx\geq u\big{\}}. This transformation allows us to apply classical splitting algorithms to solve (1.2). The three-operator splitting algorithm [6] is an efficient algorithm for solving this kind of composite optimization. However, in our case, HH has a large-scale summation structure, and computing H\nabla H can be computationally expensive, reducing the efficiency of the method. This prompts us to propose a novel stochastic three-operator splitting algorithm and motivates the research presented below.

1.2 Model and Algorithm

Driven by the FLASH proton radiotherapy problem (1.2) and to broaden the applicability of our research, we consider the following general optimization problem:

minxnF(x)+G(x)+H(x),\min_{x\in\mathbb{R}^{n}}F(x)+G(x)+H(x), (1.3)

where F,GF,G and HH are proper lower semi-continuous functions. We do not assume convexity for FF, GG, or HH throughout this paper. Corresponding to the FLASH proton radiotherapy problem (1.2), we know H=Axb2H=\left\|Ax-b\right\|^{2} with a large-scale summation structure, while FF and GG correspond to indicator functions of two constraint sets. Problem (1.3) not only corresponds to the FLASH proton radiotherapy problem but also captures many other applications, particularly in signal/image processing and computer vision, that involve multiple regularization terms. For instance, deep learning-based models in image processing often include both sparsity regularization terms and neural network-based regularization terms [31]. Non-negative low-rank matrix decompositions have both low-rank and non-negativity constraints [29, 40, 33]. In compressive sensing, a highly significant type of regularization known as L1-L2 sparse regularization allows the compressive sensing problem to be formulated as a three-term composite nonconvex optimization problem [57, 43]. In robust principal component analysis (RPCA), the optimization problem includes constraints not only on the data term but also on the low-rank factor and sparse factor, which can be formulated as a three-term composite minimization [17]. Operator splitting is an important method for solving composite optimization problems of type (1.3). Classic algorithms such as Forward-Backward Splitting (FBS) and Douglas-Rachford Splitting (DRS) can be employed when (1.3) H=0H=0. In [15], Attouch, Bolte, and Svaiter established the convergence of FBS for nonconvex and nonsmooth optimization under the assumption that F\nabla F is Lipschitz continuous. In [12], Li and Pong applied the DRS algorithm to nonconvex feasibility problems and established its convergence under similar assumptions.

The above splitting algorithms are not suitable for (1.3) because they require computing the proximal operator of G+HG+H in each iteration. Computing this operator is much more expensive or even infeasible compared to computing the proximal operators of GG and HH separately. To address this limitation, Davis and Yin [6] proposed a three-operator splitting algorithm for solving (1.3). In the three-operator splitting algorithm, each sub-problem involves solving the proximal operator of a single function, either FF or GG. When the proximal operators of FF and GG are easy to solve, the three-operator splitting algorithm becomes a simple and effective method. Davis and Yin [6] established the convergence of a three-operator splitting algorithm for convex optimization. Liu and Yin in [56] proved the convergence of the three-operator splitting method when FF is weakly convex, and GG and HH are twice continuously differentiable with bounded Hessians. In [9] Bian and Zhang established the global convergence of a three-operator splitting algorithm for solving nonconvex optimization problems and provided local convergence rates.

However, when solving large-scale problems, the TOS algorithm may suffer from computational inefficiency due to the time-consuming computation of H\nabla H, thereby reducing the efficiency of the TOS method. To overcome this challenge, we propose a stochastic version of the TOS algorithm by replacing H\nabla H with its stochastic approximation. This will lead to a new stochastic three-operator splitting (STOS) algorithm that allows for more efficient computations. We present STOS in Algorithm 1 below.

  Step 0. Choose a step-size γ>0\gamma>0 and an initial point x0x_{0}.
  Step 1. Set
yt+1argminy{F(y)+12γyxt2},\displaystyle y_{t+1}\in\arg\min_{y}\bigg{\{}F(y)+\frac{1}{2\gamma}\|y-x_{t}\|^{2}\bigg{\}}, (1.4a)
zt+1argminz{G(z)+12γz(2yt+1γ~H(yt+1)xt)2},\displaystyle z_{t+1}\in\arg\min_{z}\bigg{\{}G(z)+\frac{1}{2\gamma}\|z-(2y_{t+1}-\gamma{\widetilde{\nabla}}H(y_{t+1})-x_{t})\|^{2}\bigg{\}}, (1.4b)
xt+1=xt+(zt+1yt+1).\displaystyle x_{t+1}=x_{t}+(z_{t+1}-y_{t+1}). (1.4c)
     end while.
  Step 2. If a termination criterion is not met, go to Step 1.
Algorithm 1 A Stochastic Three-Operator Splitting (STOS)

Note that the STOS algorithm is also an extension of these classical algorithms. When F=0F=0, Algorithm 1 becomes the stochastic Forward-Backward Splitting (FBS) algorithm. When H=0H=0, the Algorithm 1 corresponds to the classical Douglas-Rachford Splitting (DRS) algorithm. When F=0F=0 and G=0G=0, this algorithm becomes the stochastic gradient descent algorithm. Here, ~H{\widetilde{\nabla}}H can be either unbiased gradient estimators (such as SGD, etc.) or variance-reduced gradient estimators (such as SAGA, SVRG, SARAH, etc.).

1.3 Related work

Stochastic splitting algorithms for solving large-scale nonconvex optimization problems have received extensive interest in the last two decades.

When F=G=0F=G=0 in (1.3), the classical stochastic gradient descent algorithm [22] can solve this kind of problem. In [46], Ghadimi and Lan proposed the randomized stochastic gradient (RSG) method to solve (1.3) with F=G=0F=G=0 and presented the first non-asymptotic convergence analysis of SGD. They demonstrated that under the assumptions of HH being smooth, ~H{\widetilde{\nabla}}H being an unbiased and bounded variance estimator, the RSG method achieves a solution xx_{*} satisfying 𝔼[H(x)2]ε\mathbb{E}[\|\nabla H(x_{*})\|^{2}]\leq\varepsilon with an iterative complexity 𝒪(1/ε2)\mathcal{O}(1/\varepsilon^{2}). Further, the stochastic gradient descent algorithm has also been combined with certain variance-reduced gradient estimators. For instance, Shamir [42] proposed nonconvex SVRG and considered the problem of computing several leading eigenvectors. Allen-zhu and Hazan [59] also proposed SVRG, showing its linear convergence rate. Later, Reddi et al. [49] introduced the nonconvex SAGA method and demonstrated its linear convergence to the global optimal solution for a class of nonconvex optimization.

When F=0F=0 in (1.3), this problem can be solved by the classical proximal gradient descent (PGD) algorithm. Ghadimi, Lan, and Zhang [47] introduced a randomized stochastic projected gradient (RSPG) algorithm to solve (1.3) with a constraint set XX when F=0F=0. It is worth noting that when the constraint set X=nX=\mathbb{R}^{n}, the RSPG algorithm becomes the stochastic PGD algorithm. Further, the PGD algorithm combined with variance-reduced gradient estimators has also received a lot of attention. Reddi et al. [50] integrate the proximal gradient descent with variance-reduced estimators SAGA and SVRG, referred to as PROXSAGA and PROXSVRG, respectively. They showed that under the assumptions of GG being nonsmooth convex and HH being smooth nonconvex, achieving an ε\varepsilon-accurate solution requires calling the number of proximal oracles to be 𝒪(1/ε)\mathcal{O}(1/\varepsilon). Pham et al. [41] combined proximal gradient descent with the SARAH stochastic estimator, referred to as ProxSARAH. Under the assumptions of HH being smooth nonconvex and GG nonconvex nonsmooth, they demonstrated that ProxSARAH finds an ε\varepsilon-stationary point with complexity 𝒪(N+N1/2ε2)\mathcal{O}(N+N^{1/2}\varepsilon^{-2}) for HH having finite-sum structure.

The study of (1.3) with three-term composite structures is limited in the nonconvex stochastic setting. Metel and Takeda [38] proposed the stochastic proximal gradient method for solving (1.3) and showed that the iteration complexity is 𝒪(1/ε3)\mathcal{O}(1/\varepsilon^{3}) when the distance to the subdifferential mapping of loss is less than ε\varepsilon. Yurtsever, Mangalick, and Sra [3] proposed a stochastic version of a three-operator splitting algorithm to solve (1.3) by combining unbiased and bounded-variance stochastic gradient estimators, and showed that finding an ε\varepsilon-stationary point requires an iteration complexity of 𝒪(1/ε3)\mathcal{O}(1/\varepsilon^{3}) based on a variational inequality. Driggs et al. [7] proposed a stochastic proximal alternating linearization algorithm when HH is a cross-term in (1.3), which combines a class of variance-reduced stochastic gradient estimators. The algorithm finds a ε\varepsilon-stationary point with a complexity of 𝒪(1/ε)\mathcal{O}(1/\varepsilon). Further, for a class of large-scale three-block composite optimization problems in deep learning, Bian, Liu, and Zhang [11] propose a new stochastic three-block splitting algorithm for (1.3) with a cross-term.

1.4 Contributions

In this paper, we propose a stochastic three-operator splitting (STOS) algorithm to solve the nonconvex and nonsmooth problem (1.3). The main contributions of our paper can be summarized as follows:

  • 1.

    We propose a stochastic three-operator splitting (STOS) algorithm for solving (1.3), which combines two types of stochastic gradient estimators: unbiased estimators and variance-reduced estimators. Compared to the stochastic TOS in [3] which only considers unbiased gradient estimator, our STOS (Algorithm 1) combines both unbiased gradient estimators and variance-reduced gradient estimators. These include the most commonly used gradient estimators. Hence, our algorithm can select the appropriate gradient estimators for different applications.

  • 2.

    When the unbiased gradient estimator is incorporated, we establish the convergence and obtain an 𝒪(1/ε)\mathcal{O}(1/\varepsilon) convergence rate for the STOS algorithm under the variance-bounded assumption. When a variance-reduced gradient estimator is incorporated, we construct a stability function for the STOS algorithm. Furthermore, if the objective function is semi-algebraic, we obtain the convergence and convergence rate of STOS. Compared to the stochastic TOS in [3], our theoretical analysis does not require the convexity of FF and GG.

  • 3.

    In the numerical experiments, we apply the STOS algorithm to solve the FLASH proton radiotherapy with dose and dose rate optimization in radiation therapy. In all current research work, nonstandard ADMM algorithms are used to solve the convex relaxation of this problem. In this paper, we directly solve this nonconvex problem using the stochastic three-operator splitting algorithm. To the best of our knowledge, this is the first time a stochastic algorithm has been adopted to solve the FLASH proton radiotherapy optimization. All test results demonstrate that STOS significantly outperforms the nonstandard ADMM algorithm.

Notations and paper organization. Before presenting the main content of the paper, we list some notations used throughout. We denote n\mathbb{R}^{n} as the n-dimensional Euclidean space and ,\langle\cdot,\cdot\rangle as the inner product. The norm induced by the inner product is denoted as =,\|\cdot\|=\sqrt{\langle\cdot,\cdot\rangle}. An extended-real-valued function f:n(,]f:\mathbb{R}^{n}\to(-\infty,\infty] is proper if it is never -\infty and its domain, domf:={xn:f(x)<+}f:=\{x\in\mathbb{R}^{n}:f(x)<+\infty\} is nonempty. The function is closed if it is proper and lower semicontinuous. A point xx_{*} is a stationary point of a function ff if 0f(x)0\in\partial f(x_{*}). xx_{*} is a critical point of ff if ff is differentiable at xx_{*} and f(x)=0\nabla f(x_{*})=0. A function is coercive if limxf(x)=+\lim_{\|x\|\to\infty}f(x)=+\infty. We call ff a σ\sigma-convex function if fσ2f-\frac{\sigma\|\cdot\|}{2} is a convex function. When σ>0\sigma>0, ff is a strongly convex function. When σ<0\sigma<0, ff is a weakly convex function. When σ=0\sigma=0, ff is a convex function.

The rest of this paper is organized as follows. We first present some notation and preliminaries in Section 2. The principal theoretical analysis of our proposed algorithm can be found in Section 3, followed by numerical results in Section 4. In Section 5, we give some concluding remarks.

2 Notation and preliminaries

This section provides the necessary notations and definitions for our paper. For STOS (Algorithm 1), two types of gradient estimators can be used: unbiased gradient estimators and variance-reduced gradient estimators. The definitions for these two types of gradient estimators are presented below.

2.1 Unbiased Gradient Estimator

A class of gradient estimators that are often used in stochastic algorithms are unbiased gradient estimators, such as the most classical stochastic gradient descent (SGD) estimator [21]. In this paper, we also analyze STOS (Algorithm 1) combined unbiased gradient estimators. We define an unbiased gradient estimator.

Definition 2.1 (Unbiased Gradient Estimator).

The stochastic gradient estimator ~H(x){\widetilde{\nabla}}H(x) is unbiased if

𝔼[~H(x)]=H(x).\mathbb{E}[{\widetilde{\nabla}}H(x)]=\nabla H(x).
Remark 2.2.

In many applications, the function HH has a finite-sum structure, such as in empirical risk minimization problems encountered in deep learning. In this case, randomness mainly arises from the way samples are drawn from the sum, and the unbiased gradient estimator can be expressed in the following unified form:

~H(x)=1Ni=1NviHi(x),{\widetilde{\nabla}}H(x)=\frac{1}{N}\sum_{i=1}^{N}v_{i}\nabla H_{i}(x), (2.1)

where {vi}i=1N\{v_{i}\}_{i=1}^{N} is a sampling vector drawn from a certain distribution 𝒟\mathcal{D}. The random variable viv_{i} can take different forms for different sampling methods. A representative sampling method is bb-nice sampling without replacement, which forms the stochastic gradient descent (SGD) estimator commonly used in deep learning.

We provide the specific form of SGD, there exist other sampling methods that can produce an unbiased stochastic gradient, such as sampling with replacement and independent sampling without replacement. For more details, we refer to readers to [2].

Definition 2.3 (SGD [21]).

The SGD estimator ~SGDH(x){\widetilde{\nabla}}^{SGD}H(x) is defined as follows:

~SGDH(xt)=1bjJtHj(xt),{\widetilde{\nabla}}^{SGD}H(x_{t})=\frac{1}{b}\sum_{j\in J_{t}}\nabla H_{j}(x_{t}),

where JtJ_{t} is a random subset uniformly drawn from {1,,N}\{1,\cdots,N\} of fixed batch size bb.

Remark 2.4.

For the bb-nice sampling without replacement, let vi=1iJtpiv_{i}=\frac{1_{i\in J_{t}}}{p_{i}} with 1iJt=11_{i\in J_{t}}=1 for iJti\in J_{t} and 0 otherwise, and pi=bNp_{i}=\frac{b}{N} in (2.1), resulting in SGD estimator.

2.2 Variance-Reduced Gradient Estimator.

Variance-reduced gradient estimators have been proposed and widely adopted in stochastic algorithms, such as SVRG (stochastic variance reduced gradient) [44], SAG (stochastic average gradient) [39], and SDCA (stochastic dual coordinate ascent) [51]. In [7], a unified definition is provided for a class of variance-reduced gradient estimators. In this paper, we analyze the combination of STOS (Algorithm 1) with such variance-reduced gradient estimators taken from[7]. For convenience, we recall the definition of variance-reduced gradient estimators below.

Definition 2.5 (Variance-reduced Gradient Estimator [7, Definition 2.1]).

A gradient estimator ~{\widetilde{\nabla}} is called variance-reduced if there exist constants V1,V2,VΥ0V_{1},\,V_{2},\,V_{\Upsilon}\geq 0 and ρ(0,1]\rho\in(0,1] such that

  • 1.

    (MSE Bound). There exists a sequence having random variables {Υt}t1\{\Upsilon_{t}\}_{t\geq 1} of the form Υt=i=1svti2\Upsilon_{t}=\sum_{i=1}^{s}\|v_{t}^{i}\|^{2} for some random vectors vtiv_{t}^{i} such that

    𝔼t~H(xt)H(xt)2Υt+V1(𝔼txt+1xt2+xtxt12),\mathbb{E}_{t}\|{\widetilde{\nabla}}H(x_{t})-\nabla H(x_{t})\|^{2}\leq\Upsilon_{t}+V_{1}(\mathbb{E}_{t}\|x_{t+1}-x_{t}\|^{2}+\|x_{t}-x_{t-1}\|^{2}),

    and, with Γt=i=1svti,\Gamma_{t}=\sum_{i=1}^{s}\|v_{t}^{i}\|,

    𝔼t~H(xt)H(xt)Γt+V2(𝔼txt+1xt+xtxt1).\mathbb{E}_{t}\|{\widetilde{\nabla}}H(x_{t})-\nabla H(x_{t})\|\leq\Gamma_{t}+V_{2}(\mathbb{E}_{t}\|x_{t+1}-x_{t}\|+\|x_{t}-x_{t-1}\|). (2.2)
  • 2.

    (Geometric Decay). The sequence {Υt}t1\{\Upsilon_{t}\}_{t\geq 1} decays geometrically:

    𝔼tΥt+1(1ρ)Υt+VΥ(𝔼txt+1xt2+xtxt12).\mathbb{E}_{t}\Upsilon_{t+1}\leq(1-\rho)\Upsilon_{t}+V_{\Upsilon}\big{(}{\mathbb{E}_{t}\|x_{t+1}-x_{t}\|^{2}+\|x_{t}-x_{t-1}\|^{2}}\big{)}. (2.3)
  • 3.

    (Convergence of Estimator). For all sequences {xt}t=0\{x_{t}\}_{t=0}^{\infty} satisfying limt𝔼xtxt12\lim_{t\to\infty}\mathbb{E}\|x_{t}-x_{t-1}\|^{2} 0,\to 0, it follows that 𝔼Υt0\mathbb{E}\Upsilon_{t}\to 0 and 𝔼Γt0\mathbb{E}\Gamma_{t}\to 0.

The variance-reduced gradient estimator defined in Definition 2.5 is widely used in many algorithms, such as SAGA [1], SARAH [35], SAG [39], and SVRG [30, 44]. It has been verified in [7, 10] that these four estimators are variance-reduced gradient estimators of this type. Below, we provide the definitions for these four estimators.

Definition 2.6 (SAGA [1]).

The SAGA gradient approximation ~SAGAH(x){\widetilde{\nabla}}^{SAGA}H(x) is defined as follows:

~SAGAH(xt)=1b(jJtHj(xt)Hj(φtj))+1Ni=1NHi(φti),{\widetilde{\nabla}}^{SAGA}H(x_{t})=\frac{1}{b}\big{(}\sum_{j\in J_{t}}\nabla H_{j}(x_{t})-\nabla H_{j}(\varphi_{t}^{j})\big{)}+\frac{1}{N}\sum_{i=1}^{N}\nabla H_{i}(\varphi_{t}^{i}),

where JtJ_{t} is mini-batches containing bb indices. The variables φti\varphi_{t}^{i} follow the update rules φt+1i=xt\varphi_{t+1}^{i}=x_{t} if iJti\in J_{t} and φt+1i=φti\varphi_{t+1}^{i}=\varphi_{t}^{i} otherwise.

Definition 2.7 (SARAH [35]).

The SARAH estimator reads for t=0t=0 as

~SARAHH(x0)=H(x0).{\widetilde{\nabla}}^{SARAH}H(x_{0})=\nabla H(x_{0}).

For t=1,2,t=1,2,\dots, define random variables pt{0,1}p_{t}\in\{0,1\} with P(pt=0)=1pP(p_{t}=0)=\frac{1}{p} and P(pt=1)=11pP(p_{t}=1)=1-\frac{1}{p}, where p(1,)p\in(1,\infty) is a fixed chosen parameter. Let JtJ_{t} be a random subset uniformly drawn from {1,,N}\{1,\dots,N\} of fixed batch size bb. Then for t=1,2,t=1,2,\dots the SARAH gradient approximation reads as

~SARAHH(xt)={H(xt):ifpt=0,1b(jJtHj(xt)Hj(xt1))+~SARAHH(xt1):ifpt=1.{\widetilde{\nabla}}^{SARAH}H(x_{t})=\left\{\begin{aligned} \nabla H(x_{t})&:\textrm{if}\,\,p_{t}=0,\\ \frac{1}{b}\big{(}\sum_{j\in J_{t}}\nabla H_{j}(x_{t})-\nabla H_{j}(x_{t-1})\big{)}+{\widetilde{\nabla}}^{SARAH}H(x_{t-1})&:\textrm{if}\,\,p_{t}=1.\end{aligned}\right.
Definition 2.8 (SAG [39]).

The SAG gradient approximation ~SAGH(x){\widetilde{\nabla}}^{SAG}H(x) is defined as follows:

~SAGH(xt)=1NjJt(Hjt(xt)Hjt(φtj))+1Ni=1NHi(φti),{\widetilde{\nabla}}^{SAG}H(x_{t})=\frac{1}{N}\sum_{j\in J_{t}}\big{(}\nabla H_{j_{t}}(x_{t})-\nabla H_{j_{t}}(\varphi_{t}^{j})\big{)}+\frac{1}{N}\sum_{i=1}^{N}\nabla H_{i}(\varphi_{t}^{i}),

where JtJ_{t} is mini-batches containing bb indices. And the gradient history update follows : φt+1jt=xt\varphi_{t+1}^{j_{t}}=x_{t} and Hi(φt+1i)={Hi(xt)ifiJt,Hi(φti)o.w.\nabla H_{i}(\varphi_{t+1}^{i})=\begin{cases}\nabla H_{i}(x_{t})\,\,\,\textmd{if}\,i\in J_{t},\\ \nabla H_{i}(\varphi_{t}^{i})\,\,\,\textmd{o.w.}\end{cases}

Definition 2.9 (SVRG [44, 30]).

The SVRG gradient approximation ~SVRGH(x){\widetilde{\nabla}}^{SVRG}H(x) is defined as follows:

~SVRGH(xt)=1bjJt(Hjt(xt)Hjt(φs))+1Ni=1NHi(φs),{\widetilde{\nabla}}^{SVRG}H(x_{t})=\frac{1}{b}\sum_{j\in J_{t}}\big{(}\nabla H_{j_{t}}(x_{t})-\nabla H_{j_{t}}(\varphi_{s})\big{)}+\frac{1}{N}\sum_{i=1}^{N}\nabla H_{i}(\varphi_{s}),

where JtJ_{t} is a random subset uniformly drawn from {1,,N}\{1,\cdots,N\} of fixed batch size bb. And φs\varphi_{s} is a point updated every mm step where the mm is the number of steps in the inner loop.

Remark 2.10.

Both types of estimators can be combined with our algorithm, but they have their strengths and weaknesses depending on the application. SGD is the simplest stochastic gradient estimator that only relies on the gradient computed on each batch without additional gradient storage or computation. However, SGD may suffer from relatively large variance, which requires a smaller step size to ensure convergence, leading to slow convergence rates. To address this shortcoming, several variance-reduced gradient estimators have been proposed, such as SVRG, SAGA, SAG, and SARAH. These gradient estimators reduce the variance, allowing for larger step sizes and faster convergence rates. However, the design of these stochastic gradient estimators often requires additional gradients (or dual variable) storage and full gradient computation, increasing the amount of storage and computation needed for faster convergence. This can make these variance-reduced gradient estimators difficult to use for some more complex problems, such as structured prediction and neural network learning. Therefore, for different problems, one must weigh the trade-offs between using SGD for fast computation per iteration with slow convergence or variance-reduced gradient estimators for slower computation per iteration but faster convergence. Regarding the variance-reduced gradient estimators in [7], SAG, SVRG, SAGA, and SARAH all belong to this type of estimator, but not all of them are biased gradient estimators. For example, SAGA is unbiased, and SARAH is biased.

2.3 Kurdyka-Łojasiewicz property

The Kurdyka-Łojasiewicz (KL) property plays a crucial role in the convergence analysis of nonconvex optimization problems. We first recall the definition of KL property, which is particularly applicable to semi-algebraic functions. For a more detailed account, we refer readers to [16, 15, 24, 23, 25] and the references therein. Let ε1\varepsilon_{1} and ε2\varepsilon_{2} be two real numbers satisfying inf<ε1<ε2<+inf-\inf<\varepsilon_{1}<\varepsilon_{2}<+\inf. We define the set [ε1<F<ε2]=def{xm1:ε1<F(x)<ε2}[\varepsilon_{1}<F<\varepsilon_{2}]\overset{\mathrm{def}}{=}\{x\in\mathbb{R}^{m_{1}}:\varepsilon_{1}<F(x)<\varepsilon_{2}\}.

Definition 2.11 (KL property).

A function F:n{+}F:\mathbb{R}^{n}\to\mathbb{R}\cup\{+\infty\} has the Kurdyka-Łojasiewicz property at xx^{*}\in dom F\partial F if there exist η(0,]\eta\in(0,\infty], a neighborhood UU of xx^{*}, and a continuous concave function φ:[0,η)+\varphi:[0,\eta)\to\mathbb{R}_{+} such that:

  • (i)

    φ(0)=0,φC1((0,η))\varphi(0)=0,\,\varphi\in C^{1}((0,\eta)), and φ(s)>0\varphi^{{}^{\prime}}(s)>0 for all s(0,η)s\in(0,\eta);

  • (ii)

    for all xU[F(x)<F<F(x)+η]x\in U\cap[F(x^{*})<F<F(x^{*})+\eta] the Kurdyka-Łojasiewicz inequality holds, i.e.,

    φ(F(x)F(x))dist(0,F(x))1.\varphi^{{}^{\prime}}(F(x)-F(x^{*}))\textmd{dist}(0,\partial F(x))\geq 1.

If FF satisfies the Kurdyka-Łojasiewicz property at each point of dom F\partial F, then it is called a KL function.

Roughly speaking, the KL property allows for sharp optimization up to reparameterization via φ\varphi, which is known as a desingularizing function for FF. A typical class of KL functions are semi-algebraic functions, such as the 0\ell_{0} pseudo-norm and the rank function. More specifically, we can find a detailed result on the KL property in [16, Section 4.3]. Further arguments can be found in [24, Section 2] and [23, Corollary 16]. Semi-algebraic functions are easier to identify and encompass a wide range of possibly nonconvex functions that arise in applications, as shown in [16, 15, 24, 23, 25]. To clarify, we revisit the definition of semi-algebraic functions.

Definition 2.12 (Semi-algebraic set).

A semi-algebraic set SnS\subseteq\mathbb{R}^{n} is a finite union of sets of the form

{xn:h1(x)=hk(x)=0,g1(x)<0,,gl(x)<0},\Big{\{}x\in\mathbb{R}^{n}:h_{1}(x)=\cdots h_{k}(x)=0,\,g_{1}(x)<0,\dots,g_{l}(x)<0\Big{\}},

where g1,,glg_{1},\dots,g_{l} and h1,,hkh_{1},\dots,h_{k} are real polynomials.

Definition 2.13 (Semi-algebraic function).

A function f:nf:\mathbb{R}^{n}\to\mathbb{R} is semi-algebraic if its graph {(x,f(x))n+1:xn}\big{\{}(x,\,f(x))\in\mathbb{R}^{n+1}:x\in\mathbb{R}^{n}\big{\}} is semi-algebraic.

Lemma 2.14 (KL inequality in the semi-algebraic cases).

Let hh be a proper closed semi-algebraic function on n\mathbb{R}^{n}. Then, hh satisfies the Kurdyka-Łojasiewicz property at all points in dom h\partial h with φ(s)=cs1θ\varphi(s)=cs^{1-\theta} for some θ[0,1)\theta\in[0,1) and c>0.c>0.

3 Convergence analysis

This section focuses on the convergence analysis of STOS (Algorithm 1) combined with two different types of stochastic gradient estimators: unbiased estimator and variance-reduced estimator. Our analysis is based on the following assumptions concerning the objective functions F,GF,G, and HH.

Assumption 3.1.

Functions FF, GG, and HH satisfy the following:

  • (a1)

    FF has a Lipschitz continuous gradient, i.e., there exists a constant L>0L>0 such that

    F(y1)F(y2)Ly1y2y1,y2n.\|\nabla F(y_{1})-\nabla F(y_{2})\|\leq L\|y_{1}-y_{2}\|\quad\forall\,y_{1},y_{2}\in\mathbb{R}^{n}.
  • (a2)

    GG is a proper closed function with a nonempty mapping 𝒫γG(x)\mathcal{P}_{\gamma G}(x) for any xx and for γ>0\gamma>0.

  • (a3)

    For each i=1,2,,Ni=1,2,\cdots,N, HiH_{i} has a Lipschitz continuous gradient, i.e., there exists a constant β>0\beta>0 such that

    Hi(y1)Hi(y2)βy1y2y1,y2n.\|\nabla H_{i}(y_{1})-\nabla H_{i}(y_{2})\|\leq\beta\|y_{1}-y_{2}\|\quad\forall\,\,y_{1},y_{2}\in\mathbb{R}^{n}.

If FF has a Lipschitz continuous gradient, then we can always find ll\in\mathbb{R} such that F+l22F+\frac{l}{2}\|\cdot\|^{2} is convex, in particular, ll can be taken to be LL. We also present the optimality conditions for Algorithm 1:

0=F(yt+1)+1γ(yt+1xt),\displaystyle 0=\nabla F(y_{t+1})+\frac{1}{\gamma}(y_{t+1}-x_{t}), (3.1a)
0G(zt+1)+1γ(zt+1+γ~H(yt+1)2yt+1+xt).\displaystyle 0\in\partial G(z_{t+1})+\frac{1}{\gamma}(z_{t+1}+\gamma{\widetilde{\nabla}}H(y_{t+1})-2y_{t+1}+x_{t}). (3.1b)

These conditions will be frequently used in the convergence analysis.

3.1 Convergence rates for unbiased gradient estimator

In this subsection, we consider the stochastic nonconvex problem (1.3) with HH being the expectation of a function of a random variable, i.e., H(x)=𝔼ξH(x,ξ)H(x)=\mathbb{E}_{\xi}H(x,\xi), where ξ\xi is a random variable with distribution 𝒫\mathcal{P}:

minxnF(x)+G(x)+𝔼ξH(x,ξ).\min_{x\in\mathbb{R}^{n}}F(x)+G(x)+\mathbb{E}_{\xi}H(x,\xi). (3.2)

As a special case of (3.2), if ξ\xi is a uniformly random vector defined on a finite support set Ω:={ξ1,ξ2,,ξN}\Omega:=\{\xi_{1},\xi_{2},\cdots,\xi_{N}\}, then (3.2) simplifies to the following finite-sum minimization problem:

minxnF(x)+G(x)+1Ni=1NHi(x),\min_{x\in\mathbb{R}^{n}}F(x)+G(x)+\frac{1}{N}\sum_{i=1}^{N}H_{i}(x),

where Hi(x):=H(x,ξi)H_{i}(x):=H(x,\xi_{i}) for i=1,,Ni=1,\cdots,N. Here, the stochastic gradient estimator ~H{\widetilde{\nabla}}H in Algorithm 1 adopts the following SGD estimator:

~H(x)=1bξJtH(x,ξ),{\widetilde{\nabla}}H(x)=\frac{1}{b}\sum_{\xi\in J_{t}}\nabla H(x,\xi),

where JtJ_{t} is a set of bb i.i.d.i.i.d. samples from distribution 𝒫\mathcal{P}. Our convergence analysis relies on the following assumptions.

Assumption 3.2.

We assume that

  • (i)

    H(x,ξ)\nabla H(x,\xi) is an unbiased estimator of H(x)\nabla H(x), i.e.,

    𝔼ξ[H(x,ξ)]=H(x)xn.\mathbb{E}_{\xi}[\nabla H(x,\xi)]=\nabla H(x)\qquad\forall x\in\mathbb{R}^{n}.
  • (ii)

    H(x,ξ)\nabla H(x,\xi) has bounded variance, i.e.,

    𝔼ξ[H(x,ξ)H(x)2]σ2xn.\mathbb{E}_{\xi}[\|\nabla H(x,\xi)-\nabla H(x)\|^{2}]\leq\sigma^{2}\qquad\forall x\in\mathbb{R}^{n}.
Remark 3.3.

Unbiased gradient estimators with bounded variance are often used in convergence analysis of general stochastic algorithms [4, 47, 46, 20, 38]. In practical applications, many unbiased gradient estimators satisfy this assumption, and one can refer to the references [45, 32, 36] for more details.

Definition 3.4.

For ε>0\varepsilon>0, we call xx_{*} an ε\varepsilon-stationary point if it satisfies

dist(0,F(x)+G(x)+H(x))ε,dist(0,\nabla F(x_{*})+\partial G(x_{*})+\nabla H(x_{*}))\leq\varepsilon,

where dist(0,A)\textmd{dist}(0,A) denotes the distance between 0 and the set AA.

In the following, we establish the convergence analysis of STOS (Algorithm 1) with unbiased estimators. We show that STOS convergences to an ε\varepsilon-stationary point and obtain its convergence rate. The proof of Theorem 3.5 is given in Appendix A.2.

Theorem 3.5.

Suppose that Assumptions 3.1 and 3.2 hold. Let {(xt,yt,zt)}t0\{(x_{t},y_{t},z_{t})\}_{t\geq 0} be a sequence generated by STOS (Algorithm 1). Choose γ>0\gamma>0 such that

Λ(γ):=1γ(1+l+2β)2γ(3γ+2)(γL2+2l+2L)2>0.\Lambda(\gamma):=\frac{1-\gamma(1+l+2\beta)}{2\gamma}-\frac{(3\gamma+2)(\gamma L^{2}+2l+2L)}{2}>0.

Then zτz_{\tau}, randomly returned by the algorithm for τ{1,2,,T}\tau\in\{1,2,\cdots,T\}, satisfies

𝔼τ𝔼dist2(0,G(zτ)+F(zτ)+H(zτ))\displaystyle\mathbb{E}_{\tau}\mathbb{E}\,\textmd{dist}^{2}(0,\partial G(z_{\tau})+\nabla F(z_{\tau})+\nabla H(z_{\tau}))
1+γ2(L2+β2)2γ2Λ(γ)(1+γL)2T(δ0+3σ22)+σ22T=𝒪(T1).\displaystyle\leq\frac{1+\gamma^{2}(L^{2}+\beta^{2})}{2\gamma^{2}\Lambda(\gamma)}\frac{(1+\gamma L)^{2}}{T}\big{(}\delta_{0}+\frac{3\sigma^{2}}{2}\big{)}+\frac{\sigma^{2}}{2T}=\mathcal{O}(T^{-1}).

where LL and β\beta are Lipschitz constants.

Remark 3.6.

From the expression for Λ(γ)\Lambda(\gamma), we know Λ(γ)+\Lambda(\gamma)\to+\infty as γ0\gamma\to 0. Thus, given l,Ll,L and β\beta, it is always true that Λ(γ)>0\Lambda(\gamma)>0 for γ\gamma small sufficiently. Indeed, we can get a computable thresholding for γ\gamma:

0<γ<min{1L,6L2+(2β+5l+10)L+6l2L}.0<\gamma<\min\Big{\{}\frac{1}{L},\frac{6L^{2}+(2\beta+5l+10)L+6l}{2L}\Big{\}}.

3.2 Convergence rates for variance-reduced gradient estimator

We have provided the convergence guarantee for STOS (Algorithm 1) when combined with unbiased gradient estimators. However, some variance-reduced gradient estimators, such as SARAH, can make the algorithm more effective for certain large-scale image processing problems [10]. To allow STOS to use a wider variety of stochastic gradient estimators, we further provide the convergence guarantee for STOS combined with the variance-reduced gradient estimator (Definition 2.5) and obtain its convergence rate. The convergence analysis is based on the Kurdyka-Łojasiewicz inequality. For this purpose, it is crucial to construct an energy function that decreases along the sequence generated by Algorithm 1. Thus, we define

Φ(x,y,z,y)\displaystyle\Phi(x,y,z,y^{\prime}) =F(y)+G(z)+H(y)+12γyx212γzx2+H(y),zy12zy2\displaystyle=F(y)+G(z)+H(y)+\frac{1}{2\gamma}\|y-x\|^{2}-\frac{1}{2\gamma}\|z-x\|^{2}+\langle\nabla H(y),z-y\rangle-\frac{1}{2}\|z-y\|^{2} (3.3)
+C1yy2\displaystyle\quad+C_{1}\|y-y^{\prime}\|^{2}

for C1>0C_{1}>0. Denote Φt=Φ(xt,yt,zt,yt1)\Phi_{t}=\Phi(x_{t},y_{t},z_{t},y_{t-1}). Next, we give the energy function Ψt\Psi_{t} associated with STOS (Algorithm 1) as follows:

Ψt=Φt𝔼t~H(yt)H(yt)2+52ρΥt+5V1ρ+5VΥ2ρ𝔼tytyt12.\Psi_{t}=\Phi_{t}-\mathbb{E}_{t}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{5}{2\rho}\Upsilon_{t}+\frac{5V_{1}\rho+5V_{\Upsilon}}{2\rho}\mathbb{E}_{t}\|y_{t}-y_{t-1}\|^{2}. (3.4)

The decreasing property of Ψt\Psi_{t} in expectation is given in the following theorem.

Theorem 3.7.

Suppose that Assumption 3.1 holds. Let {(xt,yt,zt)}t0\{(x_{t},y_{t},z_{t})\}_{t\geq 0} be a sequence generated by STOS (Algorithm 1) using variance-reduced gradient estimators. Then for any t1t\geq 1, there holds

𝔼t[Ψt+1+C1ytyt12+Λ1(γ)yt+1yt2]Ψt,\mathbb{E}_{t}[\Psi_{t+1}+C_{1}\|y_{t}-y_{t-1}\|^{2}+\Lambda_{1}(\gamma)\|y_{t+1}-y_{t}\|^{2}]\leq\Psi_{t},

where C1>0C_{1}>0 is a given constant and

Λ1(γ):=1γ(1+l+2β)2γC12γ5V1ρ+5VΥρ3γ+22γ((1+2γl)+(1+γL)2)(1+γL)2.\Lambda_{1}(\gamma):=\frac{1-\gamma(1+l+2\beta)-2\gamma C_{1}}{2\gamma}-\frac{5V_{1}\rho+5V_{\Upsilon}}{\rho}-\frac{3\gamma+2}{2\gamma}((-1+2\gamma l)+(1+\gamma L)^{2})-(1+\gamma L)^{2}.

Moreover, if Λ1(γ)>0\Lambda_{1}(\gamma)>0, then Ψt\Psi_{t} is a decreasing function and

t=0yt+1yt2<,t=0xt+1xt2<,t=0zt+1yt+12<a.s..\sum_{t=0}^{\infty}\|y_{t+1}-y_{t}\|^{2}<\infty,\qquad\sum_{t=0}^{\infty}\|x_{t+1}-x_{t}\|^{2}<\infty,\qquad\sum_{t=0}^{\infty}\|z_{t+1}-y_{t+1}\|^{2}<\infty\qquad a.s..
Remark 3.8.

We now show that choosing a suitable γ\gamma always makes Λ1(γ)>0\Lambda_{1}(\gamma)>0. From the expression for Λ1(γ)\Lambda_{1}(\gamma) we have that Λ1(γ)+\Lambda_{1}(\gamma)\to+\infty as γ0\gamma\to 0. Thus it is always true that Λ1(γ)>0\Lambda_{1}(\gamma)>0 when γ\gamma is sufficiently small. Here we also provide a computable threshold for γ\gamma. Denote

𝒯γ=(2K+(6+4L)(lL+1)+8)+(2K+(6+4L)(lL+1)+8)2+(12L+8L2)6L+4L2,\mathcal{T}_{\gamma}=\frac{-\big{(}2K+(6+4L)(\frac{l}{L}+1)+8\big{)}+\sqrt{\big{(}2K+(6+4L)(\frac{l}{L}+1)+8\big{)}^{2}+(12L+8L^{2})}}{6L+4L^{2}},

where K=1+l+2β2+5V1ρ+5VΥρ+C1K=\frac{1+l+2\beta}{2}+\frac{5V_{1}\rho+5V_{\Upsilon}}{\rho}+C_{1}. Then when 0<γ<min{1L,𝒯γ}0<\gamma<\min\{\frac{1}{L},\mathcal{T}_{\gamma}\}, we have Λ1(γ)>0\Lambda_{1}(\gamma)>0.

The proof of Theorem 3.7 is provided in Appendix A.3. Next, we will establish the global convergence in expectation for the whole sequence generated by Algorithm 1. See Appendix A.3 for details of the proof.

Theorem 3.9.

Suppose that FF, GG, and HH are semialgebraic functions with KL exponent θ[0,1)\theta\in[0,1), and the sequence {(xt,yt,zt)}\{(x_{t},y_{t},z_{t})\} is generated by Algorithm 1. Then, either {(xt,yt,zt)}\{(x_{t},y_{t},z_{t})\} is a critical point after a finite number of iterations, or {(xt,yt,zt)}t=0\{(x_{t},y_{t},z_{t})\}_{t=0}^{\infty} almost surely satisfies the finite length property in expectation:

t=0𝔼xt+1xt<,t=0𝔼yt+1yt<andt=0𝔼zt+1zt<.\sum_{t=0}^{\infty}\mathbb{E}\|x_{t+1}-x_{t}\|<\infty,\quad\sum_{t=0}^{\infty}\mathbb{E}\|y_{t+1}-y_{t}\|<\infty\,\,\textmd{and}\,\,\sum_{t=0}^{\infty}\mathbb{E}\|z_{t+1}-z_{t}\|<\infty.

Moreover, there exists an integer mm such that for all i>mi>m,

t=mi𝔼yt+1yt2+𝔼ytyt12(22+1)s3(C+2sVΥρ)ρ𝔼Υm+(42+2)(C+2sVΥρ)3max{Λ(γ),C1}Δm,i1,\sum_{t=m}^{i}\mathbb{E}\|y_{t+1}-y_{t}\|^{2}+\mathbb{E}\|y_{t}-y_{t-1}\|^{2}\leq\frac{(2\sqrt{2}+1)\sqrt{s}}{3\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\rho}\sqrt{\mathbb{E}\Upsilon_{m}}+\frac{(4\sqrt{2}+2)\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}}{3\max\{\Lambda(\gamma),C_{1}\}}\Delta_{m,i-1},

where CC and C1C_{1} are both positive constants, and Δm,n=φ(𝔼[ΦmΦm])φ(𝔼[ΦnΦn])\Delta_{m,n}=\varphi(\mathbb{E}[\Phi_{m}-\Phi_{m}^{*}])-\varphi(\mathbb{E}[\Phi_{n}-\Phi_{n}^{*}]).

Furthermore, we derive the convergence rates for the sequence {(xt,yt,zt)}t0\{(x_{t},y_{t},z_{t})\}_{t\geq 0} generated by Algorithm 1 based on the KL exponent. The calculation of the KL exponent is detailed in [13]. We omit the proof of convergence rate, which is similar to the proof of Theorem 3.7 in [10].

Theorem 3.10.

Suppose that 2τβA22\tau\geq\beta\|A\|^{2}, 0<σ10<\sigma\leq 1 and Assumption 3.1 is satisfied. Moreover, suppose that FF, HH and GG are semialgebraic functions with KL exponent θ[0,1)\theta\in[0,1). Let {(xt,yt,zt)}t=0\{(x_{t},y_{t},z_{t})\}_{t=0}^{\infty} be a bounded sequence generated by STOS (Algorithm 1) using a variance-reduced gradient estimator. Then, the following convergence rates hold almost surely:

  • 1.

    If θ=0,\theta=0, then there exists an mm\in\mathbb{N} such that 𝔼Φ(Xt)=𝔼Φ(X)\mathbb{E}\Phi(X_{t})=\mathbb{E}\Phi(X_{*}) for all tm;t\geq m;

  • 2.

    If θ(0,12]\theta\in(0,\frac{1}{2}], then there exists d1>0d_{1}>0 and τ[1ρ,1)\tau\in[1-\rho,1) such that 𝔼XtXd1τt\mathbb{E}\|X_{t}-X_{*}\|\leq d_{1}\tau^{t};

  • 3.

    If θ(12,1)\theta\in(\frac{1}{2},1), then there exists a constant d2>0d_{2}>0 such that 𝔼XtXd2t1θ2θ1\mathbb{E}\|X_{t}-X_{*}\|\leq d_{2}t^{-\frac{1-\theta}{2\theta-1}}.

4 Numerical experiments

In this section, we implement the STOS algorithm on FLASH proton radiotherapy with dose and dose rate constraints. All experiments are run in MATLAB R2021a on a desktop equipped with a 3.9GHz 16-core AMD processor and 64GB memory.

4.1 FLASH proton radiotherapy with dose and dose rate optimization

4.1.1 Problem background

FLASH proton radiotherapy is an emerging cutting-edge technology that can deliver ultra-high doses of radiation in an extremely short period, within a fraction of a second. Although this new treatment modality is still in the experimental stage, it has demonstrated the potential to better protect normal tissues compared to conventional radiation. In clinical practice, proton beams can be utilized to deliver this high-dose radiation, making FLASH proton radiotherapy an intriguing prospect for future cancer treatment [19, 55].

However, the majority of research on FLASH proton radiotherapy primarily focuses on radiation therapy problems with a minimum threshold to the number of protons delivered per spot, such as the well-known minimization monitor-unit (MMU) problem [58, 28]. In recent developments, modeling the FLASH effect while satisfying both dose and dose rate constraints has demonstrated greater effectiveness [14]. Therefore, it becomes imperative to tackle the challenge of simultaneous dose and dose rate constraints as well as the MMU in FLASH proton radiotherapy, which can be formulated as the following nonconvex-constrained optimization problem:

minxn12Axy2=iN12Ai,xyi22\displaystyle\min_{x\in\mathbb{R}^{n}}\,\,\frac{1}{2}\big{\|}Ax-y\big{\|}^{2}=\sum_{i}^{N}\frac{1}{2}\big{\|}\langle A_{i},x\rangle-y_{i}\big{\|}_{2}^{2} (4.1)
s. t.{x0,xα, if x>0,Qxu.\displaystyle\textmd{s. t.}\begin{cases}x\geq 0,\\ x\geq\alpha,\text{ if }x>0,\\ Qx\geq u.\end{cases}

Here, xx represents proton spot weights to be optimized, and yy is the weighted vector of objective constraint values. Let u=[𝟎n×1;μdr𝟏n×1]u=[\bm{0}_{n\times 1};\ {\mu_{dr}}\bm{1}_{n\times 1}] where 𝟎n×1n×1\bm{0}_{n\times 1}\in\mathbb{R}^{n\times 1} with all elements equal to 0 . AA represents the forward system matrix, a linear operator mapping xx to yy, and AiA_{i} denotes the ii-th row of AA. α\alpha is the planning minimum spot weight threshold, known as the MMU constraint. The FLASH constraints are enforced in a region of interest (ROI), we use Bp×nB\in\mathbb{R}^{p\times n} to denote the partial forward system matrix which only maps xx to the dose in the ROI (yroiy_{roi}). The ROI, as defined in [14] is much smaller than the combined regions of CTV, body, and OARs. Consequently, the size of BB exhibits significantly smaller dimensions compared to the size of AA. Here, QQ can be explicitly written as the following block matrix:

[αt(BB)BB𝟎p×n][In×nIn×n].\begin{bmatrix}\frac{\alpha}{t}(B\circ B)&B\\ B&{\bf 0}_{p\times n}\end{bmatrix}\begin{bmatrix}{I}_{n\times n}\\ {I}_{n\times n}\end{bmatrix}.

This shows that the constraint QxuQ{x}\geq u is equivalent to two constraints, one regarding dose rate: (BB)(αtx)μdrBx0(B\circ B)(\frac{\alpha}{t}x)-\mu_{dr}Bx\geq 0 with μdr\mu_{dr} being the lower bound of dose rate in ROI and the other regarding dose: Bxμd0Bx-\mu_{d}\geq 0 with μd\mu_{d} being the lower bound of dose in ROI. Here “\circ” represents dot product, “\geq” applies to each element, and tt is minimum spot duration. Additional physical interpretation of these two constraints regarding the FLASH effect can be found in [14].

4.1.2 Model and algorithm

Based on our knowledge, all of the literature uses an nonstandard ADMM algorithm to solve the convex relaxation of problem (4.1). The technique of transforming constrained optimization problems into unconstrained ones using indicator functions is a commonly employed approach in the field of optimization. Here, we apply this technique to transform problem (4.1) into the following equivalent unconstrained optimization problem:

minxnH(x)+C(x)+D(x),\min_{x\in\mathbb{R}^{n}}H(x)+\mathcal{I}_{C}(x)+\mathcal{I}_{D}(x), (4.2)

where we denote the set C={xn|xi{0}[α,+),i=1,2,,n}=+n/(0,α)nC=\big{\{}x\in\mathbb{R}^{n}|x_{i}\in\{0\}\bigcup[\alpha,+\infty),\,\forall\,i=1,2,\dots,n\big{\}}=\mathbb{R}_{+}^{n}/(0,\alpha)^{n}, the set D={xn|Qxu}D=\big{\{}x\in\mathbb{R}^{n}|Qx\geq u\big{\}}. Here \mathcal{I} represents the indicator function of the set, defined as

D(x)={0,ifxD,+,ifxD.\mathcal{I}_{D}(x)=\begin{cases}0,\qquad\,\,\,\,\,\textmd{if}\,x\in D,\\ +\infty,\qquad\,\textmd{if}\,x\notin D.\end{cases}

We note that problem (4.2) can be solved by the classical three-operator splitting algorithm. Furthermore, since NN is typically on the order of millions in FLASH proton therapy optimization, our proposed STOS algorithm can effectively solve (4.2).

We apply the STOS method to solve (4.2) with H(x)=iN12Ai,xyi22H(x)=\sum_{i}^{N}\frac{1}{2}\big{\|}\langle A_{i},x\rangle-y_{i}\big{\|}_{2}^{2}, G(x)=C(x)G(x)=\mathcal{I}_{C}(x), and F(x)=D(x)F(x)=\mathcal{I}_{D}(x). However, we note that in our convergence analysis, we require the function FF to be smooth. Therefore, we approximate the indicator function D(x)\mathcal{I}_{D}(x) using the distance function distD(x)=minzDλ2yz2\textmd{dist}_{D}(x)=\min_{z\in D}\frac{\lambda}{2}\|y-z\|^{2}. This is mainly because for the first subproblem in Algorithm 1:

argminy{distD(y)+12γyxt2}=11+γλ(xt+γλ𝒫D(xt)).\arg\min_{y}\Big{\{}\textmd{dist}_{D}(y)+\frac{1}{2\gamma}\|y-x_{t}\|^{2}\Big{\}}=\frac{1}{1+\gamma\lambda}\Big{(}x_{t}+\gamma\lambda\mathcal{P}_{D}(x_{t})\Big{)}. (4.3)

Here, 𝒫D\mathcal{P}_{D} is the projection operator onto set DD. It is worth noting that in (4.3), as the γλ\gamma\lambda\to\infty, it approaches to 𝒫D(xt)\mathcal{P}_{D}(x_{t}), which is the solution of the first subproblem in Algorithm 1 when F=DF=\mathcal{I}_{D}. Then we can apply STOS to solve (4.2) with H(x)=iN12Ai,xyi22H(x)=\sum_{i}^{N}\frac{1}{2}\big{\|}\langle A_{i},x\rangle-y_{i}\big{\|}_{2}^{2}, G(x)=C(x)G(x)=\mathcal{I}_{C}(x), and F(x)=distD(x)F(x)=\textmd{dist}_{D}(x), it yields the following algorithm:

{yt+1=argminydistD(x)+12γyxt2,zt+1=𝒫C(2yt+1γ~H(yt+1)xt),xt+1=xt+(zt+1yt+1).\begin{cases}y_{t+1}=\arg\min_{y}\textmd{dist}_{D}(x)+\frac{1}{2\gamma}\|y-x_{t}\|^{2},\\ \ z_{t+1}=\mathcal{P}_{C}(2y_{t+1}-\gamma{\widetilde{\nabla}}H(y_{t+1})-x_{t}),\\ x_{t+1}=x_{t}+(z_{t+1}-y_{t+1}).\end{cases}

For the first subproblem, due to the smoothness of distD(x)\mathrm{dist}_{D}(x), we can directly compute the gradient and further utilize a quasi-Newton method in MATLAB to solve it. The second subproblem is easy to solve, and it has an explicit solution:

[𝒫C(x)]i={xi,ifxiα2,0,otherwise.\big{[}\mathcal{P}_{C}(x)\big{]}_{i}=\begin{cases}x_{i},\,\,\,\,\textmd{if}\,\,x_{i}\geq\frac{\alpha}{2},\\ 0,\quad\textmd{otherwise}.\end{cases}

Further, we verify that FF, GG, and HH satisfy the assumptions of a semi-algebraic function. According to [16, Section 4.3], H(x)H(x) and F(x)F(x) are semi-algebraic functions. From Appendix of [26], the set CC is a semi-algebraic set, therefore the indicator function of CC is also semi-algebraic.

Next, we evaluate the STOS algorithm in two different cases separately, i.e., FLASH proton RT for lung and brain, and we also compared the nonstandard ADMM in [19] in which the nonlinearity of the constraint is linearized, and in all experiments, the minimum spot weight threshold α\alpha are fixed as [19] suggested (especially 160 for our experiments). Before presenting the results, we shall introduce several quality measures designed to assess the effectiveness and accuracy of therapy planning. One of these key metrics is the Conformal Index, denoted as (CI)(\mathrm{CI}), which is defined as V1002/(V×V100)V_{100}^{2}/\left(V\times V_{100}^{\prime}\right), where V100V_{100} represents the PTV volume receiving at least 0%0\% of the prescription dose, VV represents the volume of PTV, and V100V^{\prime}_{100} represents the total volume receiving at least 100%100\% of the prescription dose. Therefore, the value of CI\mathrm{CI} ranges from 0 to 1, and ideally, CI=1\mathrm{CI}=1. DmaxD_{\max}, DmaxcordD_{\max}^{cord}, and DmaxlungD_{\max}^{lung} represent the maximum dose received by CTV(Clinical Target Volume), cord, and lung, respectively. DmeanroiD_{mean}^{roi}, DmeanesophagusD_{mean}^{esophagus}, and DmeanbodyD_{mean}^{body} represent the average dose received by ROI, esophagus, and the body, respectively. The quantitative dose coverage and dose coverage in percentage are calculated as #{wd>0}#{wd}×100\frac{\#\{w_{d}>0\}}{\#\{w_{d}\}}\times 100 and #{wdr>0}#{wdr}×100\frac{\#\{w_{dr}>0\}}{\#\{w_{dr}\}}\times 100 providing an assessment of the degree to which constraints related to wdw_{d} and wdrw_{dr} are met.

4.1.3 FLASH proton radiotherapy for the lung

Radiation therapy is a common and effective treatment method for lung cancer. The first experiment of FLASH proton radiotherapy is conducted for the lung. The dose influence matrix, denoted as AA, is generated via MatRad [53], with 5mm5mm spot width, and 3mm3mm lateral spacing on 3mm33mm^{3} dose grid. The beam angles are empirically chosen as (0,120,240)\left(0^{\circ},120^{\circ},240^{\circ}\right). In this experiment, the regions of CTV, the body, and the organs at risk (spinal cord, lung, esophagus) are chosen for dose planning. The size of the dose influence matrix AA is 1323239×15051323239\times 1505. Matrix BB corresponds to a portion of AA on ROI and has a size of 24353×150524353\times 1505. The data yy with the length N=1323239N=1323239 is shuffled and divided into 8 parts and 16 parts for testing our STOS algorithm. The parameter μdr\mu_{dr} is set to 4040 Gy, while μd\mu_{d} is set to 88 Gy. The parameter γ\gamma for STOS is set to 2.5×1042.5\times 10^{4}.

Refer to caption
(a) Varying batch sizes
Refer to caption
(b) Varying gradient methods
Figure 1: Test loss for different methods with the same epochs for the lung.
DmaxD_{max} CI DmeanroiD_{mean}^{roi} DmaxcordD_{max}^{cord} DmaxlungD_{max}^{lung} DmeanesophagusD_{mean}^{esophagus} DmeanbodyD_{mean}^{body} PdP_{d} PdrP_{dr}
ADMM 109.62 0.56 1.29 9.85 18.54 2.57 1.14 58.80 90.01
STOS SGD NN 118.51 0.67 1.19 7.28 17.49 1.95 1.00 67.76 92.94
STOS SGD N8\frac{N}{8} 118.83 0.66 1.19 7.13 17.85 1.81 1.02 71.14 94.16
STOS SGD N16\frac{N}{16} 118.82 0.65 1.19 7.14 17.86 1.81 1.02 71.63 94.21
STOS SAGA N8\frac{N}{8} 109.50 0.66 1.18 8.09 16.93 1.86 1.01 71.54 94.17
STOS SARAH N8\frac{N}{8} 109.53 0.66 1.18 8.11 16.96 1.86 1.01 71.09 94.31
Table 1: Physical dose parameters for Lung (dose unit: Gy )
Refer to caption
(a) Varying batch sizes
Refer to caption
(b) Varying batch sizes
Refer to caption
(c) Varying gradient methods
Refer to caption
(d) Varying gradient methods
Figure 2: Lung. Left, DVH (X-axis: dose; Y-axis: percentage); Right, DRVH (X-axis: dose rate; Y-axis: percentage). DVH and DRVH show the proposed methods with all the gradient methods improve dose rate for ROI and dose at the organ at risk (cord, lung, esophagus).
Refer to caption
Refer to caption
(a) ADMM
Refer to caption
(b) STOS nbs=Nn_{bs}=N
Refer to caption
(c) STOS SGD nbs=N8n_{bs}=\frac{N}{8}
Refer to caption
(d) STOS SGD nbs=N16n_{bs}=\frac{N}{16}
Refer to caption
(e) STOS SAGA nbs=N8n_{bs}=\frac{N}{8}
Refer to caption
(f) STOS SARAH nbs=N8n_{bs}=\frac{N}{8}
Figure 3: FLASH-RT for lung. The dose distribution is plotted here using the window [0%, 115%]. 100% isodose line, 90% isodose line, and CTV are highlighted in the plots.

Here, we tested three different stochastic gradient estimators: SGD, SAGA, and SARAH. In Figure 1, we first present the function values of all algorithms running the same number of epochs. From Figure 1 (a), it can be observed that both the STOS and TOS exhibit faster energy decay compared to ADMM. Additionally, the STOS achieves faster energy function descent compared to the TOS algorithm. In Figure 1 (b), we also show the results of STOS combined with SGD, SAGA, and SARAH. We can observe that the results of the three stochastic gradient estimators are similar, with SGD performing slightly better compared to the other two estimators. In summary, as shown in Figure 1, the energy results indicate that the stochastic methods are superior to the non-stochastic methods for FLASH proton RT. This observation highlights the effectiveness of combining stochastic and probabilistic mechanisms in the optimization process. Moreover, decomposing large-scale problems into smaller sub-problems makes each sub-problem easier to solve, thereby further expanding the applicability of the algorithm.

To further demonstrate the effectiveness of STOS, graphical representations in the form of DVH and DVRH have been illustrated in Figure 2 and values for some important indices are presented in Table 1. All the results demonstrate that the proposed STOS can spare more high-dose OAR regions (e.g., ROI = PTV10mm) near treatment targets than ADMM. The proposed methods achieved better dose conformality to the treatment target (e.g., CI). In Figure 2 (a) and (b), we present a comparison between STOS combined with SGD and ADMM. In Figure 2 (c) and (d), we provide a comparison between STOS combined with SGD, SAGA, and SARAH. In Figure 2 (a), it can be observed that the treatment plans obtained by STOS combined with SGD, SAGA, and SARAH exhibit lower dose on all OARs compared to those obtained by the ADMM algorithm. Although the results in Figure 2 (c) suggest that there is no significant difference between STOS combined with SGD and combined with SAGA or SARAH, the numerical results provided in Table 1 indicate that STOS combined with SGD performs significantly better than SAGA and SARAH. For example, the mean FLASH of ROI by SGD reduced from 1.29 Gy to 1.19 Gy for the lung when compared to ADMM. For DmaxD_{max}, STOS combined with SGD outperforms all other algorithms, such as DmaxcordD_{max}^{cord} is significantly lower compared to other algorithms. SAGA and SARAH obtain similar results, but they achieve lower DmaxlungD_{max}^{lung} compared to SGD and other algorithms. We also present the dose distribution in Figure 3. From Figure 3, we can also observe that STOS combined with SGD provides a closer fit to the target region compared to SAGA or SARAH.

4.1.4 FLASH proton radiotherapy for the brain

Brain tumors are also a challenging problem in the medical field. Compared to tumors in the lungs, brain tumors present additional difficulties due to the intricate network of structures and pathways within the brain, making treatment more challenging. In this experiment, we conducted FLASH proton radiation therapy for the brain. The beam angles are empirically chosen to be (45,135,225,315)\left(45^{\circ},135^{\circ},225^{\circ},315^{\circ}\right). The regions of CTV, body, and the organ at risk (carotid, cranial nerve, oral cavity) are chosen for dose planning leading to the size of the dose influence matrix AA being 1323239×15051323239\times 1505, and the ROI is chosen for Flash effect optimization leading to the size of BB being 3089×3533089\times 353. The brain data with a length of 547397547397 is shuffled and divided into 8 parts and 16 parts in the same way as the lung data. The μdr\mu_{dr} is set to 4040 Gy and the μd\mu_{d} is set to 88 Gy. The parameter γ\gamma for STOS is set to 1.0×1031.0\times 10^{3}.

Refer to caption
(a) Varying batch sizes
Refer to caption
(b) Varying gradient methods
Figure 4: Test loss for different methods with the same epochs for the brain.
DmaxD_{max} CI DmeanroiD_{mean}^{roi} DmaxcarotidD_{max}^{carotid} DmaxcranerD_{max}^{craner} DmeanoralcavD_{mean}^{oralcav} DmeanbodyD_{mean}^{body} PdP_{d} PdrP_{dr}
ADMM 106.02 0.68 12.11 59.75 59.07 11.02 0.49 65.88 96.21
STOS SGD NN 108.77 0.76 11.29 60.13 56.41 10.19 0.45 70.21 96.83
STOS SGD N8\frac{N}{8} 107.89 0.79 11.34 60.00 57.29 10.17 0.45 72.21 97.17
STOS SGD N16\frac{N}{16} 114.85 0.74 11.67 60.02 59.61 10.61 0.46 72.32 97.66
STOS SAGA N8\frac{N}{8} 108.76 0.76 11.29 60.13 56.40 10.18 0.45 73.41 97.55
STOS SARAH N8\frac{N}{8} 110.63 0.76 12.49 58.35 53.77 11.94 0.49 72.93 97.83
Table 2: Physical dose parameters for Brain (dose unit: Gy )

The objective function values are shown in Figure 4 when running the same number of epochs. Here, we compare STOS combined with SGD, SAGA, SARAH, and ADMM. All the stochastic algorithms demonstrate a much faster decrease in the objective function value compared to ADMM. In Figure 4 (b), the comparison results between SGD, SAGA, and SARAH are displayed, indicating that for brain tumors, SAGA leads to a faster decrease in the energy function compared to SGD and SARAH. Further, we provide a graph of DVH and DVRH in Figure 5. Some important indices also be presented in Table 2. All the results demonstrate that STOS combined with SGD, SAGA, and SARAH improve the dose rate to the ROI and reduce the dose to critical organs (carotid, cranial nerves, oral cavity). For instance, from Figure 5(b), it is evident that for the dose rate DVH (DRVH), STOS combined with SGD is noticeably higher than ADMM. In Figure 5 (d), the results of STOS combined with SGD, SAGA, and SARAH are very similar. The numerical results in Table 2 also demonstrate these phenomena. For example, the DmaxD_{max} (maximum dose) for SGD reaches 114.85 Gy, while ADMM achieves only 102.02 Gy. SGD significantly outperforms ADMM in this aspect. SARAH, on the other hand, reduces DmaxcranialnerD_{max}^{cranialner} to 53.77 Gy, which is substantially lower than that of ADMM, 59.07 Gy. Moreover, SARAH also yields much lower DmaxcarotidD_{max}^{carotid} compared to ADMM. Furthermore, in Figure 6, we present the results of dose distribution, which demonstrate that STOS combined with the SARAH gradient estimator achieves the best fit to the tumor region.

Refer to caption
(a) Varying batch sizes
Refer to caption
(b) Varying batch sizes
Refer to caption
(c) Varying gradient methods
Refer to caption
(d) Varying gradient methods
Figure 5: Brain. Left, DVH (X-axis: dose; Y-axis: percentage); Right, DRVH (X-axis: dose rate; Y-axis: percentage). The DVH and DRVH show the proposed methods with all the gradient methods improve the dose rate for ROI and dose at the organ at risk (carotid, cranial nerve, oral cavity).
Refer to caption
Refer to caption
(a) ADMM
Refer to caption
(b) STOS nbs=Nn_{bs}=N
Refer to caption
(c) STOS SGD nbs=N8n_{bs}=\frac{N}{8}
Refer to caption
(d) STOS SGD nbs=N16n_{bs}=\frac{N}{16}
Refer to caption
(e) STOS SAGA nbs=N8n_{bs}=\frac{N}{8}
Refer to caption
(f) STOS SARAH nbs=N8n_{bs}=\frac{N}{8}
Figure 6: FLASH-RT for the brain. The dose distribution is plotted here using the window [0%, 115%]. 100% isodose line, 90% isodose line, and CTV are highlighted in the plots.

5 Conclusion

In this paper, we propose a stochastic three-operator splitting (STOS) algorithm, which combines two types of stochastic gradient estimators: unbiased gradient estimators and variance-reduced gradient estimators. We establish the convergence analysis and obtain the convergence rates for each type of gradient estimator. Furthermore, we validate the effectiveness of the STOS algorithm in the FLASH proton radiotherapy optimization. The extensive experiments demonstrate that STOS has attained state-of-the-art performance for the simultaneous dose and dose rate optimization for FLASH proton radiotherapy. The STOS algorithm has the capability to partition a large-scale problem into smaller subproblems, and the utilization of stochastic gradients leads to improved convergence.

Acknowledgments

The work of Jian-Feng Cai was supported in part by the Hong Kong Research Grants Council GRF Grants 16310620 and 16306821, in part by the Hong Kong ITF MHP/009/20, and in part by the Project of Hetao Shenzhen-Hong Kong Science and Technology Innovation Cooperation Zone under Grant HZQB-KCZYB-2020083. Xiaoqun Zhang was supported by Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102) and NSFC (No. 12090024). Jiulong Liu was partially supported by the Key grant of Chinese Ministry of Education (2022YFC2504302) and the Fund of the Youth Innovation Promotion Association, CAS (2022002). Fengmiao Bian was also partially supported by an outstanding Ph.D. graduate development scholarship from Shanghai Jiao Tong University.

References

  • [1] A. Defazio, F. Bach and S. L. Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
  • [2] A. Khaled and P. Richtarik. Better theory for sgd in the nonconvex world. ArXiv preprint arXiv:2002.03329, 2020.
  • [3] A. Yurtsever, V. Mangalick, and S. Sra. Three Operator Splitting with a Nonconvex Loss Function. Proceedings of the 38-th International Conference on Machine Learning, (PMLR), 139, 2021.
  • [4] C. fang, J. Li, Z. Lin and T. Zhang. SPIDER: near-optimal non-convex optimization via stochastic path integated differential estimator. 32nd Conference on Neural Information Processing Systems (NIPS), 2018.
  • [5] D. Davis. The asynchronous palm algorithm for nonsmooth nonconvex problems. ArXiv preprint arXiv:1604.00526, 2016.
  • [6] D. Davis and W. Yin. Three-Operator Splitting Scheme and its Optimization Applications. Set-Valued Var. Anal., 25:829–858, 2017.
  • [7] D. Driggs, J. Tang, J. Liang, M. Davies and C. B. Schonlieb. SPRING: A fast stochastic proximal alternating method for non-smooth non-convex optimization. SIAM J. Imaging Sci., 14(04):1932–1970, 2021.
  • [8] E. Beyreuther, M. Brand, S. Hans, et al. Feasibility of proton FLASH effect tested by zebrafish embryo irradiation. Radiother Oncol., 139:46–50, 2019.
  • [9] F. Bian and X. Zhang. A three-operator splitting algorithm for nonconvex sparsity regularization. SIAM J. Sci. Comput, 43(4):A2809–A2839, 2021.
  • [10] F. Bian, J. Liang and X. Zhang. A stochastic alternating direction method of multipliers for non-smooth and non-convex optimization. Inverse Problems, 37(7), 2021.
  • [11] F. Bian, R. Liu and X. Zhang. A Stochastic Three-block Splittinng Algorithm and Its Application to Quantized Deep Neural NNetworks. ArXiv preprint arXiv:2204.11065, 2022.
  • [12] G. Li and T.K. Pong. Douglas–Rachford splitting for nonconvex optimization with application to nonconvex feasibility problems. Math. Program., Ser. A, 159:371–401, 2016.
  • [13] G. Li and T.K. Pong. Calculus of the exponent of Kurdyka-Lojasiewicz inequality and its applications to linear convergence of first-order methods. Found. Comput. Math., 18(5):1199–1232, 2018.
  • [14] Hao Gao, Jiulong Liu, Yuting Lin, Gregory N Gan, Guillem Pratx, Fen Wang, Katja Langen, Jeffrey D Bradley, Ronny L Rotondo, Harold H Li, et al. Simultaneous dose and dose rate optimization (sddro) of the flash effect for pencil-beam-scanning proton therapy. Medical Physics, 49(3):2014–2025, 2022.
  • [15] H. Attouch, J. Bolte and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program., 137(1-2):91–129, 2013.
  • [16] H. Attouch, J. Bolte, P. Redont and A. Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems : An approach based on the Kurdyka–Łojasiewicz inequality. Math. Oper. Res., 35:438–457, 2010.
  • [17] H. Cai, J.-F. Cai and K. Wei. Accelerated Alternating Projections for Robust Principal Component Analysis. J. Mach. Learn. Res., 20:1–33, 2019.
  • [18] H. Gao, B. Clasie, M. McDonald, K. M. Langen, T. Liu, and Y. Lin. Plan-delivery-time constrained inverse optimization method with minimum-MU-per-energy-layer(MMPEL) for efficient pencil beam scanning proton therapy. Med. Phys., 47(9):3892–3897, 2020.
  • [19] H. Gao, B. Lin, Yuting Lin, Shujun Fu, Katja Langen, Tian Liu and Jeffery Bradley. Simultaneous dose and dose rate optimization (sddro) for flash proton therapy. Medical Physics, 47(12):6388–6395, 2020.
  • [20] H. Ouyang, N. He, L. Q. Tran and A. Gray. Stochastic alternating direction method of multipliers. Proceedings of the 30-th International conference on machine learning, (JMLR), 2013.
  • [21] H. Robbins and D. Siegmund. A convergence theorem for non-negative almost supermartingales and some applications. Optimizing Methods in Statistics, pages 233–257, 1971.
  • [22] H. Robbins and S. Monro. A stochastic approximation method. Ann. Math. Statist., 22(3):400–407, 1951.
  • [23] J. Bolte, A. Daniilidis, A. Lewis and M. Shiota. Clarke subgradients of stratifiable functions. SIAM J. Optim., 18:556–572, 2007.
  • [24] J. Bolte, A. Daniilidis and A. Lewis. The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim., 17(4):1205–1223, 2006.
  • [25] J. Bolte, A. Daniilidis, O. Ley and L. Mazet. Characterizations of Łojasiewicz inequalities subgradient flows, talweg, convexity. Trans. Amer. Math. Soc., 362(6):3319–3363, 2010.
  • [26] J. Bolte, S. Sabach and M. Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program., Ser. A, 146:459–494, 2014.
  • [27] J. D. Wilson, E. M. Hammond, G. S. Higgins and K.Petersson. Ultra-High Dose Rate (FLASH) Radio-therapy: Silver Bullet or Fool’s Gold? Front. Oncol., 9, 2020.
  • [28] J.-F. Cai, R. Chen, J. Fan, and H. Gao. Minimum-monitor-unit optimization via a stochastic coordinate descent method. Physics in Medicine & Biology, 67(1):015009, 2022.
  • [29] J. Flamant, S. Miron, and D. Brie. Quaternion non-negative matrix factorization: Definition, uniqueness, and algorithm. IEEE Trans. Image Process., 68:1870–1883, 2020.
  • [30] J. Konecny, J. Liu, P. Richtarik and M. Takac. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10:242–255, 2016.
  • [31] J. Li, C. Huang, R. Chan, H. Feng, M. K. Ng, and T. Zeng. Spherical Image Inpainting with Frame Transformation and Data-Driven Prior Deep Networks. SIAM J. Imag. Sci., 16(3):1177–1194, 2023.
  • [32] J. Mairal, F. Bach, J. Ponce and G. Sapiro. Online dictionary learning for sparse coding. Proceedings of the 26th Annual International Conference on Machine Learning, (ICML), pages 689–696, 2009.
  • [33] J. Pan and M. K. Ng. Separable Quaternion Matrix Factorization for Polarization Images. SIAM J. Imag. Sci., 16(3):1281–1307, 2023.
  • [34] J. R. Hughes and J. L. Parsons. FLASH Radiotherapy: current knowledge and future insights using proton-beam therapy. Int. J. Mol. Sci., 21(18), 2020.
  • [35] L. M. Nguyen, J. Liu, K. Scheinberg and M. Takac. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning, 70(2613-2621), 2017.
  • [36] L. Mason, J. Baxter, P. Bartlett and M. Frean. Boosting algorithms as gradient descent. Conference on Neural Information Processing Systems, (NIPS), 12:512–518, 1999.
  • [37] M. Buonanno, V. Grilj, D. J. Brenner. Biological effects in normal cells exposed to FLASH dose rate protons. Radiother oncol., 139(51-55), 2019.
  • [38] M. R. Metel and A. Takeda. Stochastic proximal methods for non-smooth non-convex constrained sparse optimization. J. Mach. Learn. Res., 22:1–36, 2021.
  • [39] M. Schmidt, N. L. Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Math. Program., pages 1–30, 2016.
  • [40] N. Gillis. The why and how of nonnegative matrix factorization. Regulariz. Optim. Kernels Supp. Vector Mach., 12:257–291, 2014.
  • [41] N. H. Pham, L. M. Nguyen, D. T. Phan, and Q. T. Dinh. ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization. J. Mach. Learn. Res., 21:1–48, 2020.
  • [42] O. Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. Proceedings of the 32-th International Conference on Machine Learning, (PMLR), 2015.
  • [43] P. Yin, Y. Lou, Q. He and J. Xin. Minimization of l12l_{1-2} for compressed sensing. SIAM J. Sci. Comput., 37(1):A536–A563, 2015.
  • [44] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, (NIPS), pages 315–323, 2013.
  • [45] S. Ghadimi and G. Lan. Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Optimization. Technical Report, 2013.
  • [46] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim., 23(4):2341–2368, 2013.
  • [47] S. Ghadimi, G. Lan and H. Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program., Ser. A, 155:267–305, 2016.
  • [48] S. Girdhani, E. Abel, A. Katsis, et al. FLASH: A novel paradigm changing tumor irra- diation platform that enhances therapeutic ratio by reducing normal tissue toxicity and activating immune pathways. Cancer Res., 79, 2019.
  • [49] S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic Variance Reduction for Nonconvex Optimization. Proceedings of the 33-th International Conference on Machine Learning, (PMLR), 2016.
  • [50] S. J. Reddi, S. Sra, B. Poczos and A. J. Smola. Proximal Stochastic Methods for Nonsmooth Nonconvex Finite-Sum Optimization. 30th Conference on Neural Information Processing Systems (NIPS), 2016.
  • [51] S. S. Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res., 14(1):567–599, 2013.
  • [52] V. Favaudon, L. Caplier, V. Monceau, et al. Ultrahigh dose-rate FLASH irradiation increases the differ- ential response between normal and tumor tissue in mice. Sci. Transl. Med., 6(245), 2014.
  • [53] Hans-Peter Wieser, Eduardo Cisternas, Niklas Wahl, Silke Ulrich, Alexander Stadler, Henning Mescher, Lucas-Raphael Müller, Thomas Klinge, Hubert Gabrys, Lucas Burigo, et al. Development of the open-source dose calculation and optimization toolkit matrad. Medical physics, 44(6):2556–2568, 2017.
  • [54] X. Zhang, D. Robertson, H.Li, S. Cchoi, A. K. Lee, X. R. Zhu, N. Sahooand, and M. T. Gillin. Intensitymodulated proton therapy treatment planning using single-field optimization: the impact of monitor unit constraints on plan quality. Med. Phys., 37(3):1210–9, 2010.
  • [55] Y. Lin, B. Lin, S. Fu, M. Folkerts, E. Abel, J. Bradley and H. Gao. Sddro-joint: simultaneous dose and dose rate optimization with the joint use of transmission beams and bragg peaks for flash proton therapy. Physics in Medicine and Biology, 66(12):125011, 2021.
  • [56] Y. Liu and W. Yin. An Envelope for Davis–Yin Splitting and Strict Saddle-Point Avoidance. J. Optim. Theory Appl., 181:567–587, 2019.
  • [57] Y. Lou and M. Yan. Fast L1–L2 Minimization via a Proximal Operator. J. Sci. Comput., 74:767–785, 2018.
  • [58] Y.-N. Zhu, X. Zhang, Y. Lin, C. Lominska, and H. Gao. An orthogonal matching pursuit optimization method for solving minimum-monitor-unit problems: Applications to proton impt, arc and flash. Medical Physics, 2023.
  • [59] Z. Allen-Zhu and E. Hazan. Variance Reduction for Faster NonConvex Optimization. Proceedings of the 33-th International Conference on Machine Learning, (PMLR), 2016.

Appendix A Proofs of main theorems

A.1 Basic Lemmas

We introduce some basic lemmas that are required in the convergence analysis. The proofs of Lemma A.1 and Lemma A.2 can be found in reference [9] and are not detailed here.

Lemma A.1.

Let a,b,c,dna,b,c,d\in\mathbb{R}^{n}. Then we have

2abcd2acd2\displaystyle\|2a-b-c-d\|^{2}-\|a-c-d\|^{2}
=ac2bc2+2ab2+2d,ba.\displaystyle=\|a-c\|^{2}-\|b-c\|^{2}+2\|a-b\|^{2}+2\langle d,b-a\rangle.
Lemma A.2.

Suppose that the function FF satisfies Assumption 3.1 (a1)(a1). Then the sequence {(xt,\{(x_{t}, yt,zt)}t1y_{t},z_{t})\}_{t\geq 1} generated by the Algorithm 1 satisfies

xtxt1(1+γL)yt+1yt\|x_{t}-x_{t-1}\|\leq(1+\gamma L)\|y_{t+1}-y_{t}\|

and

yt+1zt2[(1+2γl)+(1+γL)2]yt+1yt2.\|y_{t+1}-z_{t}\|^{2}\leq\big{[}(-1+2\gamma l)+(1+\gamma L)^{2}\big{]}\|y_{t+1}-y_{t}\|^{2}.

Next, we present a crucial lemma in the convergence analysis of the STOS algorithm with both the unbiased estimator and the variance-reduced estimator.

Lemma A.3.

Suppose that the functions FF, GG, and HH satisfy Assumption 3.1. Let {(xt,yt,zt)}t1\{(x_{t},y_{t},z_{t})\}_{t\geq 1} be a sequence generated by Algorithm 1. Then for all t1t\geq 1, we have

F(yt+1)+G(zt+1)+H(yt+1)+12γyt+1xt+1212γzt+1xt+12+~H(yt+1),zt+1yt+1\displaystyle F(y_{t+1})+G(z_{t+1})+H(y_{t+1})+\frac{1}{2\gamma}\|y_{t+1}-x_{t+1}\|^{2}-\frac{1}{2\gamma}\|z_{t+1}-x_{t+1}\|^{2}+\langle{\widetilde{\nabla}}H(y_{t+1}),z_{t+1}-y_{t+1}\rangle (A.1)
F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+~H(yt),ztyt+3γ+22γyt+1zt2\displaystyle\leq F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}
+12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2+γ(1+l+2β)12γyt+1yt2.\displaystyle\quad+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\|y_{t+1}-y_{t}\|^{2}.
  • Proof.

    From the proof of [9, Lemma 3.3], we have

    F(yt+1)+G(zt+1)+12γ2yt+1zt+1xt+1γ~H(yt+1)21γyt+1zt+12\displaystyle F(y_{t+1})+G(z_{t+1})+\frac{1}{2\gamma}\|2y_{t+1}-z_{t+1}-x_{t+1}-\gamma{\widetilde{\nabla}}H(y_{t+1})\|^{2}-\frac{1}{\gamma}\|y_{t+1}-z_{t+1}\|^{2} (A.2)
    12γxt+1yt+1+γ~H(yt+1)2\displaystyle\quad-\frac{1}{2\gamma}\|x_{t+1}-y_{t+1}+\gamma{\widetilde{\nabla}}H(y_{t+1})\|^{2}
    F(yt)+G(zt)+12γ2ytztxtγ~H(yt)212γxtyt+γ~H(yt)21γytzt2\displaystyle\leq F(y_{t})+G(z_{t})+\frac{1}{2\gamma}\|2y_{t}-z_{t}-x_{t}-\gamma{\widetilde{\nabla}}H(y_{t})\|^{2}-\frac{1}{2\gamma}\|x_{t}-y_{t}+\gamma{\widetilde{\nabla}}H(y_{t})\|^{2}-\frac{1}{\gamma}\|y_{t}-z_{t}\|^{2}
    +~H(yt+1),ztyt+1~H(yt),ztyt+1γyt+1zt212(1γl)yt+1yt2.\displaystyle\quad+\langle{\widetilde{\nabla}}H(y_{t+1}),z_{t}-y_{t+1}\rangle-\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle+\frac{1}{\gamma}\|y_{t+1}-z_{t}\|^{2}-\frac{1}{2}\left(\frac{1}{\gamma}-l\right)\|y_{t+1}-y_{t}\|^{2}.

    While the gradient ~H{\widetilde{\nabla}}H here is stochastic, the proof of the above inequality in [9] still holds. Next, we further estimate ~H(yt+1),ztyt+1~H(yt),ztyt\langle{\widetilde{\nabla}}H(y_{t+1}),z_{t}-y_{t+1}\rangle-\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle. Notice that

    ~H(yt+1),ztyt+1~H(yt),ztyt\displaystyle\langle{\widetilde{\nabla}}H(y_{t+1}),z_{t}-y_{t+1}\rangle-\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle (A.3)
    =~H(yt+1)H(yt+1),ztyt+1~H(yt)H(yt),ztyt+H(yt),ytyt+1\displaystyle=\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t}-y_{t+1}\rangle-\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle+\langle\nabla H(y_{t}),y_{t}-y_{t+1}\rangle
    +H(yt+1)H(yt),ztyt+1\displaystyle\quad+\langle\nabla H(y_{t+1})-\nabla H(y_{t}),z_{t}-y_{t+1}\rangle
    ~H(yt+1)H(yt+1),ztyt+1~H(yt)H(yt),ztyt\displaystyle\leq\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t}-y_{t+1}\rangle-\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle
    +H(yt)H(yt+1)+β2yt+1yt2+H(yt+1)H(yt),ztyt+1\displaystyle\quad+H(y_{t})-H(y_{t+1})+\frac{\beta}{2}\|y_{t+1}-y_{t}\|^{2}+\langle\nabla H(y_{t+1})-\nabla H(y_{t}),z_{t}-y_{t+1}\rangle
    ~H(yt+1)H(yt+1),ztyt+1~H(yt)H(yt),ztyt\displaystyle\leq\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t}-y_{t+1}\rangle-\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle
    +H(yt)H(yt+1)+βyt+1yt2+12yt+1zt2\displaystyle\quad+H(y_{t})-H(y_{t+1})+\beta\|y_{t+1}-y_{t}\|^{2}+\frac{1}{2}\|y_{t+1}-z_{t}\|^{2}

    where we have used (a3)(a3) in Assumption 3.1. Furthermore, using some basic inequalities, we get

    ~H(yt+1)H(yt+1),ztyt+1~H(yt)H(yt),ztyt\displaystyle\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t}-y_{t+1}\rangle-\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle (A.4)
    =~H(yt+1)H(yt+1),ztyt+1~H(yt)H(yt),ztyt+1\displaystyle=\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t}-y_{t+1}\rangle-\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t+1}\rangle
    ~H(yt)H(yt),yt+1yt\displaystyle\quad-\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),y_{t+1}-y_{t}\rangle
    12~H(yt+1)H(yt+1)2+12yt+1zt2+12~H(yt)H(yt)2+12yt+1zt2\displaystyle\leq\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\frac{1}{2}\|y_{t+1}-z_{t}\|^{2}+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{1}{2}\|y_{t+1}-z_{t}\|^{2}
    +12~H(yt)H(yt)2+12yt+1yt2\displaystyle\quad+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{1}{2}\|y_{t+1}-y_{t}\|^{2}
    12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2+yt+1zt2+12yt+1yt2.\displaystyle\leq\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\|y_{t+1}-z_{t}\|^{2}+\frac{1}{2}\|y_{t+1}-y_{t}\|^{2}.

    Substituting (A.4) into (A.3), we get

    ~H(yt+1),ztyt+1~H(yt),ztyt\displaystyle\langle{\widetilde{\nabla}}H(y_{t+1}),z_{t}-y_{t+1}\rangle-\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle (A.5)
    12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2+H(yt)H(yt+1)\displaystyle\leq\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+H(y_{t})-H(y_{t+1})
    +2β+12yt+1yt2+32yt+1zt2.\displaystyle\quad+\frac{2\beta+1}{2}\|y_{t+1}-y_{t}\|^{2}+\frac{3}{2}\|y_{t+1}-z_{t}\|^{2}.

    Combining (A.5) with (A.2), we have

    F(yt+1)+G(zt+1)+H(yt+1)+12γ2yt+1zt+1xt+1γ~H(yt+1)2\displaystyle F(y_{t+1})+G(z_{t+1})+H(y_{t+1})+\frac{1}{2\gamma}\|2y_{t+1}-z_{t+1}-x_{t+1}-\gamma{\widetilde{\nabla}}H(y_{t+1})\|^{2}
    12γyt+1xt+1γ~H(yt+1)21γyt+1zt+12\displaystyle\quad-\frac{1}{2\gamma}\|y_{t+1}-x_{t+1}-\gamma{\widetilde{\nabla}}H(y_{t+1})\|^{2}-\frac{1}{\gamma}\|y_{t+1}-z_{t+1}\|^{2}
    F(yt)+G(zt)++H(yt)+12γ2ytztxtγ~H(yt)212γytxtγ~H(yt)2\displaystyle\leq F(y_{t})+G(z_{t})++H(y_{t})+\frac{1}{2\gamma}\|2y_{t}-z_{t}-x_{t}-\gamma{\widetilde{\nabla}}H(y_{t})\|^{2}-\frac{1}{2\gamma}\|y_{t}-x_{t}-\gamma{\widetilde{\nabla}}H(y_{t})\|^{2}
    1γytzt2+12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2+3γ+22γyt+1zt2\displaystyle\quad-\frac{1}{\gamma}\|y_{t}-z_{t}\|^{2}+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}
    +γ(1+l+2β)12γyt+1yt2.\displaystyle\quad+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\|y_{t+1}-y_{t}\|^{2}.

    Using Lemma A.1, we obtain

    F(yt+1)+G(zt+1)+H(yt+1)+12γyt+1xt+1212γzt+1xt+12+~H(yt+1),zt+1yt+1\displaystyle F(y_{t+1})+G(z_{t+1})+H(y_{t+1})+\frac{1}{2\gamma}\|y_{t+1}-x_{t+1}\|^{2}-\frac{1}{2\gamma}\|z_{t+1}-x_{t+1}\|^{2}+\langle{\widetilde{\nabla}}H(y_{t+1}),z_{t+1}-y_{t+1}\rangle
    F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+~H(yt),ztyt+3γ+22γyt+1zt2\displaystyle\leq F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}
    +12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2+γ(1+l+2β)12γyt+1yt2.\displaystyle\quad+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\|y_{t+1}-y_{t}\|^{2}.

    This completes the proof. ∎

A.2 Convergence rates for unbiased gradient estimator

  • Proof of Theorem 3.5.

    From Lemma A.3, we have

    F(yt+1)+G(zt+1)+H(yt+1)+12γyt+1xt+1212γzt+1xt+12+~H(yt+1),zt+1yt+1\displaystyle F(y_{t+1})+G(z_{t+1})+H(y_{t+1})+\frac{1}{2\gamma}\|y_{t+1}-x_{t+1}\|^{2}-\frac{1}{2\gamma}\|z_{t+1}-x_{t+1}\|^{2}+\langle{\widetilde{\nabla}}H(y_{t+1}),z_{t+1}-y_{t+1}\rangle
    F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+~H(yt),ztyt+3γ+22γyt+1zt2\displaystyle\leq F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}
    +12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2+γ(1+l+2β)12γyt+1yt2.\displaystyle\quad+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\|y_{t+1}-y_{t}\|^{2}.

    Since

    ~H(yt),ztyt=~H(yt)H(yt),ztyt+H(yt),ztyt,\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle=\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle+\langle\nabla H(y_{t}),z_{t}-y_{t}\rangle,

    we have

    F(yt+1)+G(zt+1)+H(yt+1)+12γyt+1xt+1212γzt+1xt+12+H(yt+1),zt+1yt+1\displaystyle F(y_{t+1})+G(z_{t+1})+H(y_{t+1})+\frac{1}{2\gamma}\|y_{t+1}-x_{t+1}\|^{2}-\frac{1}{2\gamma}\|z_{t+1}-x_{t+1}\|^{2}+\langle\nabla H(y_{t+1}),z_{t+1}-y_{t+1}\rangle (A.6)
    +~H(yt+1)H(yt+1),zt+1yt+1\displaystyle\quad+\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t+1}-y_{t+1}\rangle
    F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+H(yt),ztyt\displaystyle\leq F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\langle\nabla H(y_{t}),z_{t}-y_{t}\rangle
    +~H(yt)H(yt),ztyt+12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2\displaystyle\quad+\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}
    +3γ+22γyt+1zt2+γ(1+l+2β)12γyt+1yt2.\displaystyle\quad+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\|y_{t+1}-y_{t}\|^{2}.

    Denote

    Φinf=inft1{F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+H(yt),ztyt}\Phi^{inf}=\inf_{t\geq 1}\{F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\langle\nabla H(y_{t}),z_{t}-y_{t}\rangle\}

    and

    δt=F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+H(yt),ztytΦinf.\delta_{t}=F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\langle\nabla H(y_{t}),z_{t}-y_{t}\rangle-\Phi^{inf}.

    Then (A.6) can be rewritten as

    δt+1\displaystyle\delta_{t+1} δt+~H(yt)H(yt),ztyt~H(yt+1)H(yt+1),zt+1yt+1\displaystyle\leq\delta_{t}+\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle-\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t+1}-y_{t+1}\rangle (A.7)
    +12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2+3γ+22γyt+1zt2\displaystyle\quad+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}
    +γ(1+l+2β)12γyt+1yt2.\displaystyle\quad+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\|y_{t+1}-y_{t}\|^{2}.

    Combining (A.7) with the estimate of yt+1zt2\|y_{t+1}-z_{t}\|^{2} in Lemma A.2, we have

    δt+1\displaystyle\delta_{t+1} δt+~H(yt)H(yt),ztyt~H(yt+1)H(yt+1),zt+1yt+1\displaystyle\leq\delta_{t}+\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle-\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t+1}-y_{t+1}\rangle
    +12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2\displaystyle\quad+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}
    +((3γ+2)(γL2+2l+2L)2+γ(1+l+2β)12γ)yt+1yt2.\displaystyle\quad+\Big{(}\frac{(3\gamma+2)(\gamma L^{2}+2l+2L)}{2}+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\Big{)}\|y_{t+1}-y_{t}\|^{2}.

    Denote Λ(γ)=((3γ+2)(γL2+2l+2L)2+γ(1+l+2β)12γ)\Lambda(\gamma)=-\Big{(}\frac{(3\gamma+2)(\gamma L^{2}+2l+2L)}{2}+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\Big{)}, we have

    δt+1+Λ(γ)yt+1yt2\displaystyle\delta_{t+1}+\Lambda(\gamma)\|y_{t+1}-y_{t}\|^{2} δt+~H(yt)H(yt),ztyt~H(yt+1)H(yt+1),zt+1yt+1\displaystyle\leq\delta_{t}+\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle-\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t+1}-y_{t+1}\rangle
    +12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2.\displaystyle\quad+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}.

    Further, we have

    yt+1yt2\displaystyle\|y_{t+1}-y_{t}\|^{2} 1Λ(γ)[δtδt+1+~H(yt)H(yt),ztyt~H(yt+1)H(yt+1),zt+1yt+1\displaystyle\leq\frac{1}{\Lambda(\gamma)}\big{[}\delta_{t}-\delta_{t+1}+\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle-\langle{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1}),z_{t+1}-y_{t+1}\rangle (A.8)
    +12~H(yt+1)H(yt+1)2+~H(yt)H(yt)2].\displaystyle\quad+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}\big{]}.

    By the optimality conditions (3.1), we know

    1γ(yt+1zt+1)G(zt+1)+F(yt+1)+~H(yt+1).\frac{1}{\gamma}(y_{t+1}-z_{t+1})\in\partial G(z_{t+1})+\nabla F(y_{t+1})+{\widetilde{\nabla}}H(y_{t+1}).

    This implies that

    1γ(yt+1zt+1)+F(zt+1)F(yt+1)+H(zt+1)H(yt+1)\displaystyle\frac{1}{\gamma}(y_{t+1}-z_{t+1})+\nabla F(z_{t+1})-\nabla F(y_{t+1})+\nabla H(z_{t+1})-\nabla H(y_{t+1})
    +H(yt+1)~H(yt+1)G(zt+1)+F(zt+1)+H(zt+1).\displaystyle+\nabla H(y_{t+1})-{\widetilde{\nabla}}H(y_{t+1})\in\partial G(z_{t+1})+\nabla F(z_{t+1})+\nabla H(z_{t+1}).

    By Cauchy inequality, the update (1.4c) and Lemma A.2, we have

    dist2(0,G(zt+1)+F(zt+1)+H(zt+1))\displaystyle\mathrm{dist}^{2}\big{(}0,\partial G(z_{t+1})+\nabla F(z_{t+1})+\nabla H(z_{t+1})\big{)}
    12γ2yt+1zt+12+12F(zt+1)F(yt+1)2+12H(zt+1)H(yt+1)2\displaystyle\leq\frac{1}{2\gamma^{2}}\big{\|}y_{t+1}-z_{t+1}\big{\|}^{2}+\frac{1}{2}\big{\|}\nabla F(z_{t+1})-\nabla F(y_{t+1})\big{\|}^{2}+\frac{1}{2}\big{\|}\nabla H(z_{t+1})-\nabla H(y_{t+1})\big{\|}^{2}
    +12H(yt+1)~H(yt+1)2\displaystyle\quad+\frac{1}{2}\big{\|}\nabla H(y_{t+1})-{\widetilde{\nabla}}H(y_{t+1})\big{\|}^{2}
    1+γ2(L2+β2)2γ2zt+1yt+12+12H(yt+1)~H(yt+1)2\displaystyle\leq\frac{1+\gamma^{2}(L^{2}+\beta^{2})}{2\gamma^{2}}\big{\|}z_{t+1}-y_{t+1}\big{\|}^{2}+\frac{1}{2}\big{\|}\nabla H(y_{t+1})-{\widetilde{\nabla}}H(y_{t+1})\big{\|}^{2}
    [1+γ2(L2+β2)](1+γL)22γ2yt+1yt2+12H(yt+1)~H(yt+1)2.\displaystyle\leq\frac{\big{[}1+\gamma^{2}(L^{2}+\beta^{2})\big{]}(1+\gamma L)^{2}}{2\gamma^{2}}\big{\|}y_{t+1}-y_{t}\big{\|}^{2}+\frac{1}{2}\big{\|}\nabla H(y_{t+1})-{\widetilde{\nabla}}H(y_{t+1})\big{\|}^{2}.

    Taking expectations on both sides (over JtJ_{t}) and using (A.8), we have

    𝔼dist2(0,G(zt+1))+F(zt+1)+H(zt+1))\displaystyle\mathbb{E}\mathrm{dist}^{2}\big{(}0,\partial G(z_{t+1}))+\nabla F(z_{t+1})+\nabla H(z_{t+1})\big{)}
    [1+γ2(L2+β2)](1+γL)22γ2𝔼yt+1yt2+12𝔼H(yt+1)~H(yt+1)2\displaystyle\leq\frac{\big{[}1+\gamma^{2}(L^{2}+\beta^{2})\big{]}(1+\gamma L)^{2}}{2\gamma^{2}}\mathbb{E}\|y_{t+1}-y_{t}\|^{2}+\frac{1}{2}\mathbb{E}\|\nabla H(y_{t+1})-{\widetilde{\nabla}}H(y_{t+1})\|^{2}
    [1+γ2(L2+β2)](1+γL)22γ2Λ(γ)(𝔼δt𝔼δt+1+12𝔼~H(yt+1)H(yt+1)2\displaystyle\leq\frac{\big{[}1+\gamma^{2}(L^{2}+\beta^{2})\big{]}(1+\gamma L)^{2}}{2\gamma^{2}\Lambda(\gamma)}\big{(}\mathbb{E}\delta_{t}-\mathbb{E}\delta_{t+1}+\frac{1}{2}\mathbb{E}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}
    +𝔼~H(yt)H(yt)2)+12𝔼H(yt+1)~H(yt+1)2.\displaystyle\quad+\mathbb{E}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}\big{)}+\frac{1}{2}\mathbb{E}\|\nabla H(y_{t+1})-{\widetilde{\nabla}}H(y_{t+1})\|^{2}\|.

    From [3, Lemma 3] and Assumption 3.2, we know

    𝔼~H(yt)H(yt)σ2b.\mathbb{E}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|\leq\frac{\sigma^{2}}{b}.

    Thus, we have

    𝔼dist2(0,G(zt+1))+F(zt+1)+H(zt+1))\displaystyle\mathbb{E}\mathrm{dist}^{2}\big{(}0,\partial G(z_{t+1}))+\nabla F(z_{t+1})+\nabla H(z_{t+1})\big{)}
    [1+γ2(L2+β2)](1+γL)22γ2Λ(γ)(𝔼δt𝔼δt+1+32σ2b)+σ22b.\displaystyle\leq\frac{\big{[}1+\gamma^{2}(L^{2}+\beta^{2})\big{]}(1+\gamma L)^{2}}{2\gamma^{2}\Lambda(\gamma)}\big{(}\mathbb{E}\delta_{t}-\mathbb{E}\delta_{t+1}+\frac{3}{2}\frac{\sigma^{2}}{b}\big{)}+\frac{\sigma^{2}}{2b}.

    Summing from t=0t=0 to TT and choosing b=Tb=T, we get

    1Tt=0T𝔼dist2(0,G(zt+1)+F(zt+1)+H(zt+1))\displaystyle\frac{1}{T}\sum_{t=0}^{T}\mathbb{E}\,\mathrm{dist}^{2}(0,\partial G(z_{t+1})+\nabla F(z_{t+1})+\nabla H(z_{t+1}))
    1+γ2(L2+β2)2γ2Λ(γ)(1+γL)2T(δ0+3σ22)+σ22T.\displaystyle\leq\frac{1+\gamma^{2}(L^{2}+\beta^{2})}{2\gamma^{2}\Lambda(\gamma)}\frac{(1+\gamma L)^{2}}{T}\big{(}\delta_{0}+\frac{3\sigma^{2}}{2}\big{)}+\frac{\sigma^{2}}{2T}.

    Take τ\tau from {1,2,,T}\{1,2,\cdots,T\} randomly. Then we obtain

    𝔼τ𝔼dist2(0,G(zτ)+F(zτ)+H(zτ))\displaystyle\mathbb{E}_{\tau}\mathbb{E}\,\mathrm{dist}^{2}(0,\partial G(z_{\tau})+\nabla F(z_{\tau})+\nabla H(z_{\tau}))
    1+γ2(L2+β2)2γ2Λ(γ)(1+γL)2T(δ0+3σ22)+σ22T\displaystyle\leq\frac{1+\gamma^{2}(L^{2}+\beta^{2})}{2\gamma^{2}\Lambda(\gamma)}\frac{(1+\gamma L)^{2}}{T}\big{(}\delta_{0}+\frac{3\sigma^{2}}{2}\big{)}+\frac{\sigma^{2}}{2T}
    =𝒪(T1).\displaystyle=\mathcal{O}(T^{-1}).

    The proof of Theorem 3.5 is completed. ∎

A.3 Convergence rates for variance-reduced gradient estimator

We first recall a supermartingale convergence theorem which can be referred to [5] and [21] for details.

Lemma A.4 (Supermartingale Convergence Theorem).

Let 𝔼t\mathbb{E}_{t} denote the expectation conditional on the first tt iterations of Algorithm 1. Let {Ut}t=0\{U_{t}\}_{t=0}^{\infty} and {Vt}t=0\{V_{t}\}_{t=0}^{\infty} be sequences of bounded non-negative random variables such that UtU_{t} and VtV_{t} depend only on the first tt iterations of Algorithm 1. If

𝔼tUt+1+VtUt,\mathbb{E}_{t}U_{t+1}+V_{t}\leq U_{t},

then t=0Vt<a.s.\sum_{t=0}^{\infty}V_{t}<\infty\,a.s. and UtU_{t} converges a.s.a.s..

  • Proof of Theorem 3.7.

    We will complete the proof of this theorem using Lemma A.3. According to Cauchy inequality, we get the following estimate

    ~H(yt),ztyt\displaystyle\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle =~H(yt)H(yt),ztyt+H(yt),ztyt\displaystyle=\langle{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t}),z_{t}-y_{t}\rangle+\langle\nabla H(y_{t}),z_{t}-y_{t}\rangle (A.9)
    12~H(yt)H(yt)2+12ztyt2+H(yt),ztyt.\displaystyle\leq\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{1}{2}\|z_{t}-y_{t}\|^{2}+\langle\nabla H(y_{t}),z_{t}-y_{t}\rangle.

    Combining (A.9) with (A.1) in Lemma A.3, we have

    F(yt+1)+G(zt+1)+H(yt+1)+12γyt+1xt+1212γzt+1xt+12+H(yt+1),zt+1yt+1\displaystyle F(y_{t+1})+G(z_{t+1})+H(y_{t+1})+\frac{1}{2\gamma}\|y_{t+1}-x_{t+1}\|^{2}-\frac{1}{2\gamma}\|z_{t+1}-x_{t+1}\|^{2}+\langle\nabla H(y_{t+1}),z_{t+1}-y_{t+1}\rangle (A.10)
    12~H(yt+1)H(yt+1)2\displaystyle\quad-\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}
    F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+~H(yt),ztyt\displaystyle\leq F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\langle{\widetilde{\nabla}}H(y_{t}),z_{t}-y_{t}\rangle
    12~H(yt)H(yt)2+H(yt+1)~H(yt+1),zt+1yt+1+32~H(yt)H(yt)2\displaystyle\quad-\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\langle\nabla H(y_{t+1})-{\widetilde{\nabla}}H(y_{t+1}),z_{t+1}-y_{t+1}\rangle+\frac{3}{2}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}
    +3γ+22γyt+1zt2+γ(1+l+2β)12γyt+1yt2\displaystyle\quad+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\|y_{t+1}-y_{t}\|^{2}
    F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+12ztyt2+H(yt),ztyt\displaystyle\leq F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\frac{1}{2}\|z_{t}-y_{t}\|^{2}+\langle\nabla H(y_{t}),z_{t}-y_{t}\rangle
    +12~H(yt+1)H(yt+1)2+12zt+1yt+12+32~H(yt)H(yt)2+3γ+22γyt+1zt2\displaystyle\quad+\frac{1}{2}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\frac{1}{2}\|z_{t+1}-y_{t+1}\|^{2}+\frac{3}{2}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}
    +γ(1+l+2β)12γyt+1yt2.\displaystyle\quad+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\|y_{t+1}-y_{t}\|^{2}.

    Simplifying the equation (A.10), we get

    F(yt+1)+G(zt+1)+H(yt+1)+12γyt+1xt+1212γzt+1xt+12+H(yt+1),zt+1yt+1\displaystyle F(y_{t+1})+G(z_{t+1})+H(y_{t+1})+\frac{1}{2\gamma}\|y_{t+1}-x_{t+1}\|^{2}-\frac{1}{2\gamma}\|z_{t+1}-x_{t+1}\|^{2}+\langle\nabla H(y_{t+1}),z_{t+1}-y_{t+1}\rangle
    F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+12ztyt2+H(yt),ztyt\displaystyle\leq F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\frac{1}{2}\|z_{t}-y_{t}\|^{2}+\langle\nabla H(y_{t}),z_{t}-y_{t}\rangle
    +~H(yt+1)H(yt+1)2+12zt+1yt+12+32~H(yt)H(yt)2+3γ+22γyt+1zt2\displaystyle\quad+\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\frac{1}{2}\|z_{t+1}-y_{t+1}\|^{2}+\frac{3}{2}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}
    +γ(1+l+2β)12γyt+1yt2.\displaystyle\quad+\frac{\gamma(1+l+2\beta)-1}{2\gamma}\|y_{t+1}-y_{t}\|^{2}.

    For any C1>0C_{1}>0, adding C1yt+1yt2C_{1}\|y_{t+1}-y_{t}\|^{2} to both sides of the equation yields

    F(yt+1)+G(zt+1)+H(yt+1)+12γyt+1xt+1212γzt+1xt+12+H(yt+1),zt+1yt+1\displaystyle F(y_{t+1})+G(z_{t+1})+H(y_{t+1})+\frac{1}{2\gamma}\|y_{t+1}-x_{t+1}\|^{2}-\frac{1}{2\gamma}\|z_{t+1}-x_{t+1}\|^{2}+\langle\nabla H(y_{t+1}),z_{t+1}-y_{t+1}\rangle (A.11)
    12zt+1yt+12+C1yt+1yt2\displaystyle\quad-\frac{1}{2}\|z_{t+1}-y_{t+1}\|^{2}+C_{1}\|y_{t+1}-y_{t}\|^{2}
    F(yt)+G(zt)+H(yt)+12γytxt212γztxt2+H(yt),ztyt+C1ytyt12\displaystyle\leq F(y_{t})+G(z_{t})+H(y_{t})+\frac{1}{2\gamma}\|y_{t}-x_{t}\|^{2}-\frac{1}{2\gamma}\|z_{t}-x_{t}\|^{2}+\langle\nabla H(y_{t}),z_{t}-y_{t}\rangle+C_{1}\|y_{t}-y_{t-1}\|^{2}
    12ztyt2+ztyt2+32~H(yt)H(yt)2+3γ+22γyt+1zt2\displaystyle\quad-\frac{1}{2}\|z_{t}-y_{t}\|^{2}+\|z_{t}-y_{t}\|^{2}+\frac{3}{2}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{3\gamma+2}{2\gamma}\|y_{t+1}-z_{t}\|^{2}
    +(γ(1+l+2β)12γ+C1)yt+1yt2+~H(yt+1)H(yt+1)2C1ytyt12.\displaystyle\quad+\big{(}\frac{\gamma(1+l+2\beta)-1}{2\gamma}+C_{1}\big{)}\|y_{t+1}-y_{t}\|^{2}+\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}-C_{1}\|y_{t}-y_{t-1}\|^{2}.

    Recalling the definition of Φt\Phi_{t} in (3.3) and taking the conditional expectation with respect to xtx_{t} on both sides of (A.11) yields

    𝔼tΦt+1\displaystyle\mathbb{E}_{t}\Phi_{t+1} Φt+ztyt2+32𝔼t~H(yt)H(yt)2+3γ+22γ𝔼tyt+1zt2\displaystyle\leq\Phi_{t}+\|z_{t}-y_{t}\|^{2}+\frac{3}{2}\mathbb{E}_{t}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{3\gamma+2}{2\gamma}\mathbb{E}_{t}\|y_{t+1}-z_{t}\|^{2}
    +γ(1+l+2β)1+2γC12γ𝔼tyt+1yt2+𝔼t~H(yt+1)H(yt+1)2C1ytyt12.\displaystyle+\frac{\gamma(1+l+2\beta)-1+2\gamma C_{1}}{2\gamma}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|^{2}+\mathbb{E}_{t}\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}-C_{1}\|y_{t}-y_{t-1}\|^{2}.

    This is equivalent to

    𝔼t[Φt+1~H(yt+1)H(yt+1)2]\displaystyle\mathbb{E}_{t}[\Phi_{t+1}-\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}] (A.12)
    Φt𝔼t~H(yt)H(yt)2+ztyt2+52𝔼t~H(yt)H(yt)2+3γ+22γ𝔼tyt+1zt2\displaystyle\leq\Phi_{t}-\mathbb{E}_{t}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\|z_{t}-y_{t}\|^{2}+\frac{5}{2}\mathbb{E}_{t}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{3\gamma+2}{2\gamma}\mathbb{E}_{t}\|y_{t+1}-z_{t}\|^{2}
    +γ(1+l+2β)1+2γC12γ𝔼tyt+1yt2C1ytyt12.\displaystyle\quad+\frac{\gamma(1+l+2\beta)-1+2\gamma C_{1}}{2\gamma}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|^{2}-C_{1}\|y_{t}-y_{t-1}\|^{2}.

    According to (2.3), we have

    Υt1ρΥt1ρ𝔼tΥt+1+VΥρ𝔼tyt+1yt2+VΥρytyt12.\Upsilon_{t}\leq\frac{1}{\rho}\Upsilon_{t}-\frac{1}{\rho}\mathbb{E}_{t}\Upsilon_{t+1}+\frac{V_{\Upsilon}}{\rho}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|^{2}+\frac{V_{\Upsilon}}{\rho}\|y_{t}-y_{t-1}\|^{2}. (A.13)

    Combining (A.13) with (A.12), we get

    𝔼t[Φt+1~H(yt+1)H(yt+1)2]\displaystyle\mathbb{E}_{t}[\Phi_{t+1}-\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}] (A.14)
    Φt𝔼t~H(yt)H(yt)2+3γ+22γ𝔼tyt+1zt2+γ(1+l+2β)1+2γC12γ𝔼tyt+1yt2\displaystyle\leq\Phi_{t}-\mathbb{E}_{t}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{3\gamma+2}{2\gamma}\mathbb{E}_{t}\|y_{t+1}-z_{t}\|^{2}+\frac{\gamma(1+l+2\beta)-1+2\gamma C_{1}}{2\gamma}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|^{2}
    +52Υt+5V12𝔼tyt+1yt2+5V12ytyt12+ztyt2C1ytyt12\displaystyle\quad+\frac{5}{2}\Upsilon_{t}+\frac{5V_{1}}{2}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|^{2}+\frac{5V_{1}}{2}\|y_{t}-y_{t-1}\|^{2}+\|z_{t}-y_{t}\|^{2}-C_{1}\|y_{t}-y_{t-1}\|^{2}
    Φt𝔼t~H(yt)H(yt)2+3γ+22γ𝔼tyt+1zt2+γ(1+l+2β)1+2γC12γ𝔼tyt+1yt2\displaystyle\leq\Phi_{t}-\mathbb{E}_{t}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{3\gamma+2}{2\gamma}\mathbb{E}_{t}\|y_{t+1}-z_{t}\|^{2}+\frac{\gamma(1+l+2\beta)-1+2\gamma C_{1}}{2\gamma}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|^{2}
    +52ρΥt52ρ𝔼tΥt+1+5VΥ2ρ𝔼tyt+1yt2+5VΥ2ρytyt12+5V12𝔼tyt+1yt2\displaystyle\quad+\frac{5}{2\rho}\Upsilon_{t}-\frac{5}{2\rho}\mathbb{E}_{t}\Upsilon_{t+1}+\frac{5V_{\Upsilon}}{2\rho}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|^{2}+\frac{5V_{\Upsilon}}{2\rho}\|y_{t}-y_{t-1}\|^{2}+\frac{5V_{1}}{2}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|^{2}
    +5V12ytyt12+ztyt2C1ytyt12.\displaystyle\quad+\frac{5V_{1}}{2}\|y_{t}-y_{t-1}\|^{2}+\|z_{t}-y_{t}\|^{2}-C_{1}\|y_{t}-y_{t-1}\|^{2}.

    We rearrange (A.14) as

    𝔼t[Φt+1~H(yt+1)H(yt+1)2+52ρΥt+1+5V1ρ+5VΥ2ρyt+1yt2]\displaystyle\mathbb{E}_{t}[\Phi_{t+1}-\|{\widetilde{\nabla}}H(y_{t+1})-\nabla H(y_{t+1})\|^{2}+\frac{5}{2\rho}\Upsilon_{t+1}+\frac{5V_{1}\rho+5V_{\Upsilon}}{2\rho}\|y_{t+1}-y_{t}\|^{2}]
    Φt𝔼t~H(yt)H(yt)2+52ρΥt+5V1ρ+5VΥ2ρytyt12C1ytyt12\displaystyle\leq\Phi_{t}-\mathbb{E}_{t}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{5}{2\rho}\Upsilon_{t}+\frac{5V_{1}\rho+5V_{\Upsilon}}{2\rho}\|y_{t}-y_{t-1}\|^{2}-C_{1}\|y_{t}-y_{t-1}\|^{2}
    +3γ+22γ𝔼tyt+1zt2+(γ(1+l+2β)1+2γC12γ+5V1ρ+5VΥρ)𝔼tyt+1yt2+ztyt2.\displaystyle\quad+\frac{3\gamma+2}{2\gamma}\mathbb{E}_{t}\|y_{t+1}-z_{t}\|^{2}+\big{(}\frac{\gamma(1+l+2\beta)-1+2\gamma C_{1}}{2\gamma}+\frac{5V_{1}\rho+5V_{\Upsilon}}{\rho}\big{)}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|^{2}+\|z_{t}-y_{t}\|^{2}.

    Recalling the definition of Ψt\Psi_{t} in (3.4) and taking the expectation on both sides of (LABEL:eq29) yields

    𝔼Ψt+1\displaystyle\mathbb{E}\Psi_{t+1} 𝔼Ψt+(γ(1+l+2β)1+2γC12γ+5V1ρ+5VΥρ)𝔼yt+1yt2+𝔼ztyt2\displaystyle\leq\mathbb{E}\Psi_{t}+\big{(}\frac{\gamma(1+l+2\beta)-1+2\gamma C_{1}}{2\gamma}+\frac{5V_{1}\rho+5V_{\Upsilon}}{\rho}\big{)}\mathbb{E}\|y_{t+1}-y_{t}\|^{2}+\mathbb{E}\|z_{t}-y_{t}\|^{2} (A.15)
    +3γ+22γ𝔼yt+1zt2C1𝔼ytyt12.\displaystyle\quad+\frac{3\gamma+2}{2\gamma}\mathbb{E}\|y_{t+1}-z_{t}\|^{2}-C_{1}\mathbb{E}\|y_{t}-y_{t-1}\|^{2}.

    According to Lemma A.2, inequality (A.15) can be rewritten as

    𝔼Ψt+1\displaystyle\mathbb{E}\Psi_{t+1} 𝔼Ψt+(γ(1+l+2β)1+2γC12γ+5V1ρ+5VΥρ\displaystyle\leq\mathbb{E}\Psi_{t}+\Big{(}\frac{\gamma(1+l+2\beta)-1+2\gamma C_{1}}{2\gamma}+\frac{5V_{1}\rho+5V_{\Upsilon}}{\rho}
    +3γ+22γ((1+2γl)+(1+γL)2)+(1+γL)2)𝔼yt+1yt2C1𝔼ytyt12.\displaystyle\quad+\frac{3\gamma+2}{2\gamma}((-1+2\gamma l)+(1+\gamma L)^{2})+(1+\gamma L)^{2}\Big{)}\mathbb{E}\|y_{t+1}-y_{t}\|^{2}-C_{1}\mathbb{E}\|y_{t}-y_{t-1}\|^{2}.

    Denote

    Λ1(γ):=1γ(1+l+2β)2γC12γ5V1ρ+5VΥρ3γ+22γ((1+2γl)+(1+γL)2)(1+γL)2.\Lambda_{1}(\gamma):=\frac{1-\gamma(1+l+2\beta)-2\gamma C_{1}}{2\gamma}-\frac{5V_{1}\rho+5V_{\Upsilon}}{\rho}-\frac{3\gamma+2}{2\gamma}((-1+2\gamma l)+(1+\gamma L)^{2})-(1+\gamma L)^{2}.

    Then we have

    𝔼Ψt+1𝔼ΨtΛ1(γ)𝔼yt+1yt2C1𝔼ytyt12.\mathbb{E}\Psi_{t+1}\leq\mathbb{E}\Psi_{t}-\Lambda_{1}(\gamma)\mathbb{E}\|y_{t+1}-y_{t}\|^{2}-C_{1}\mathbb{E}\|y_{t}-y_{t-1}\|^{2}. (A.16)

    When Λ1(γ)>0\Lambda_{1}(\gamma)>0, (A.16) shows that Ψt\Psi_{t} is a decreasing function. Further, according to Lemma A.4, we know

    t=0yt+1yt2<a.s.\sum_{t=0}^{\infty}\|y_{t+1}-y_{t}\|^{2}<\infty\quad a.s.

    Again, according to Lemma A.2 and (1.4c) we have

    t=0xt+1xt2<a.s.andt=0zt+1yt+12<a.s..\sum_{t=0}^{\infty}\|x_{t+1}-x_{t}\|^{2}<\infty\quad a.s.\quad\textmd{and}\quad\sum_{t=0}^{\infty}\|z_{t+1}-y_{t+1}\|^{2}<\infty\quad a.s..

In the above, we have shown that the sum of squares of the sequence norm has a finite length almost surely. However, we need to further demonstrate that the sequence norm is summable. To do so, we first provide a bound on the norm of subgradient Φ(Xt)\partial\Phi(X_{t}).

Lemma A.5.

Suppose that Assumption 3.1 holds and HH is twice continuously differentiable with a bounded Hessian, i.e., there exists M>0M>0 such that 2H(y)<M\|\nabla^{2}H(y)\|<M for any yy. Let {(xt,yt,zt)}t0\{(x_{t},y_{t},z_{t})\}_{t\geq 0} be a sequence generated by Algorithm 1 using variance-reduced gradient estimators. Denote Xt:=(xt,yt,zt,yt1)X_{t}:=(x_{t},y_{t},z_{t},y_{t-1}). Then, for any t1t\geq 1, we have

𝔼dist(0,Φ(Xt))C(𝔼yt+1yt+𝔼ytyt1)+𝔼Γt,\mathbb{E}dist(0,\partial\Phi(X_{t}))\leq C\big{(}\mathbb{E}\|y_{t+1}-y_{t}\|+\mathbb{E}\|y_{t}-y_{t-1}\|\big{)}+\mathbb{E}\Gamma_{t},

where C=max{(M+1+βγ)(1+γL)+V2,2C1+V2}C=\max\{(M+1+\frac{\beta}{\gamma})(1+\gamma L)+V_{2},2C_{1}+V_{2}\}. Moreover, for any critical point X=(x,y,z,y)X_{*}=(x_{*},y_{*},z_{*},y_{*}), we have 𝔼dist(0,Φ(X))=0\mathbb{E}dist(0,\partial\Phi(X_{*}))=0.

  • Proof.

    First, we denote

    ξxt\displaystyle\xi_{x}^{t} =1γ(ztyt),\displaystyle=\frac{1}{\gamma}(z_{t}-y_{t}),
    ξyt\displaystyle\xi_{y}^{t} =1γ(xt1xt)+1γ(ztyt)+2H(yt),ztyt,\displaystyle=\frac{1}{\gamma}(x_{t-1}-x_{t})+\frac{1}{\gamma}(z_{t}-y_{t})+\langle\nabla^{2}H(y_{t}),z_{t}-y_{t}\rangle,
    ξzt\displaystyle\xi_{z}^{t} =(1+2γ)(ytzt)+1γ(xtxt1)+H(yt)~H(yt),\displaystyle=(1+\frac{2}{\gamma})(y_{t}-z_{t})+\frac{1}{\gamma}(x_{t}-x_{t-1})+\nabla H(y_{t})-{\widetilde{\nabla}}H(y_{t}),
    ξyt\displaystyle\xi_{y^{\prime}}^{t} =2C1ytyt1.\displaystyle=-2C_{1}\|y_{t}-y_{t-1}\|.

    According to the definition of Φ\Phi in (3.3), we can compute

    xΦ(Xt)=1γ(ytxt)+1γ(ztxt)=1γ(ztyt),\displaystyle\partial_{x}\Phi(X_{t})=-\frac{1}{\gamma}(y_{t}-x_{t})+\frac{1}{\gamma}(z_{t}-x_{t})=\frac{1}{\gamma}(z_{t}-y_{t}), (A.17a)
    zΦ(Xt)=G(zt)1γ(ztxt)+~H(yt)(ztyt),\displaystyle\partial_{z}\Phi(X_{t})=\partial G(z_{t})-\frac{1}{\gamma}(z_{t}-x_{t})+{\widetilde{\nabla}}H(y_{t})-(z_{t}-y_{t}), (A.17b)
    yΦ(Xt)=2C1(ytyt1),\displaystyle\partial_{y^{\prime}}\Phi(X_{t})=-2C_{1}(y_{t}-y_{t-1}), (A.17c)

    and

    yΦ(Xt)\displaystyle\partial_{y}\Phi(X_{t}) =F(yt)+H(yt)+1γ(ytxt)+(ztyt)+2H(yt),ztytH(yt)\displaystyle=\nabla F(y_{t})+\nabla H(y_{t})+\frac{1}{\gamma}(y_{t}-x_{t})+(z_{t}-y_{t})+\langle\nabla^{2}H(y_{t}),z_{t}-y_{t}\rangle-\nabla H(y_{t})
    =F(yt)+1γ(ztxt)+2H(yt),ztyt.\displaystyle=\nabla F(y_{t})+\frac{1}{\gamma}(z_{t}-x_{t})+\langle\nabla^{2}H(y_{t}),z_{t}-y_{t}\rangle.

    Since 1γ(2ytztγ~H(yt)xt1)G(zt)\frac{1}{\gamma}(2y_{t}-z_{t}-\gamma{\widetilde{\nabla}}H(y_{t})-x_{t-1})\in\partial G(z_{t}), we have

    (1+2γ)(ytzt)+1γ(xtxt1)+H(yt)~H(yt)zΦ(Xt).(1+\frac{2}{\gamma})(y_{t}-z_{t})+\frac{1}{\gamma}(x_{t}-x_{t-1})+\nabla H(y_{t})-{\widetilde{\nabla}}H(y_{t})\in\partial_{z}\Phi(X_{t}).

    By the optimality condition 0=F(yt)+1γ(ytxt1)0=\nabla F(y_{t})+\frac{1}{\gamma}(y_{t}-x_{t-1}), we know

    yΦ(Xt)\displaystyle\partial_{y}\Phi(X_{t}) =1γ(xt1xt)+1γ(ztyt)+2H(yt),ztyt.\displaystyle=\frac{1}{\gamma}(x_{t-1}-x_{t})+\frac{1}{\gamma}(z_{t}-y_{t})+\langle\nabla^{2}H(y_{t}),z_{t}-y_{t}\rangle.

    Then, we have (ξxt,ξyt,ξzt,ξyt)Φ(Xt)(\xi_{x}^{t},\xi_{y}^{t},\xi_{z}^{t},\xi_{y^{\prime}}^{t})\in\partial\Phi(X_{t}). By further estimates, we get

    ξxt\displaystyle\|\xi_{x}^{t}\| =1γztyt=1γxtxt11+γLγyt+1yt,\displaystyle=\frac{1}{\gamma}\|z_{t}-y_{t}\|=\frac{1}{\gamma}\|x_{t}-x_{t-1}\|\leq\frac{1+\gamma L}{\gamma}\|y_{t+1}-y_{t}\|, (A.18a)
    ξyt\displaystyle\|\xi_{y}^{t}\| 1γxtxt1+1γztyt+2H(yt)ztyt,\displaystyle\leq\frac{1}{\gamma}\|x_{t}-x_{t-1}\|+\frac{1}{\gamma}\|z_{t}-y_{t}\|+\|\nabla^{2}H(y_{t})\|\|z_{t}-y_{t}\|, (A.18b)
    (2γ+M)xtxt1(2γ+M)(1+γL)yt+1yt,\displaystyle\leq(\frac{2}{\gamma}+M)\|x_{t}-x_{t-1}\|\leq(\frac{2}{\gamma}+M)(1+\gamma L)\|y_{t+1}-y_{t}\|, (A.18c)
    ξzt\displaystyle\|\xi_{z}^{t}\| (1+2γ)ztyt+1γxtxt1+H(yt)~H(yt),\displaystyle\leq(1+\frac{2}{\gamma})\|z_{t}-y_{t}\|+\frac{1}{\gamma}\|x_{t}-x_{t-1}\|+\|\nabla H(y_{t})-{\widetilde{\nabla}}H(y_{t})\|, (A.18d)
    (1+3γ)xtxt1+H(yt)~H(yt),\displaystyle\leq(1+\frac{3}{\gamma})\|x_{t}-x_{t-1}\|+\|\nabla H(y_{t})-{\widetilde{\nabla}}H(y_{t})\|, (A.18e)
    (1+3γ)(1+γL)yt+1yt+H(yt)~H(yt),\displaystyle\leq(1+\frac{3}{\gamma})(1+\gamma L)\|y_{t+1}-y_{t}\|+\|\nabla H(y_{t})-{\widetilde{\nabla}}H(y_{t})\|, (A.18f)
    ξyt\displaystyle\|\xi_{y^{\prime}}^{t}\| =2C1ytyt1.\displaystyle=2C_{1}\|y_{t}-y_{t-1}\|. (A.18g)

    Then we have the following estimate

    𝔼t(ξxt,ξyt,ξzt,ξyt)\displaystyle\mathbb{E}_{t}\|(\xi_{x}^{t},\xi_{y}^{t},\xi_{z}^{t},\xi_{y^{\prime}}^{t})\|
    (M+1+βγ)(1+γL)𝔼tyt+1yt+2C1ytyt1+𝔼tH(yt)~H(yt)\displaystyle\leq(M+1+\frac{\beta}{\gamma})(1+\gamma L)\mathbb{E}_{t}\|y_{t+1}-y_{t}\|+2C_{1}\|y_{t}-y_{t-1}\|+\mathbb{E}_{t}\|\nabla H(y_{t})-{\widetilde{\nabla}}H(y_{t})\|
    ((M+1+βγ)(1+γL)+V2)𝔼tyt+1yt+(2C1+V2)ytyt1+Γt\displaystyle\leq\Big{(}(M+1+\frac{\beta}{\gamma})(1+\gamma L)+V_{2}\Big{)}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|+(2C_{1}+V_{2})\|y_{t}-y_{t-1}\|+\Gamma_{t}
    C[𝔼tyt+1yt+ytyt1]+Γt,\displaystyle\leq C\big{[}\mathbb{E}_{t}\|y_{t+1}-y_{t}\|+\|y_{t}-y_{t-1}\|\big{]}+\Gamma_{t},

    where C=max{(M+1+βγ)(1+γL)+V2,2C1+V2}.C=\max\{(M+1+\frac{\beta}{\gamma})(1+\gamma L)+V_{2},2C_{1}+V_{2}\}. Because dist(0,Φ(Xt))(ξxt,ξyt,ξzt,ξyt)\mathrm{dist}(0,\partial\Phi(X_{t}))\leq\|(\xi_{x}^{t},\xi_{y}^{t},\xi_{z}^{t},\xi_{y^{\prime}}^{t})\|, we get

    𝔼dist(0,Φ(Xt))C(𝔼yt+1yt+𝔼ytyt1)+𝔼Γt.\mathbb{E}\mathrm{dist}(0,\partial\Phi(X_{t}))\leq C\big{(}\mathbb{E}\|y_{t+1}-y_{t}\|+\mathbb{E}\|y_{t}-y_{t-1}\|\big{)}+\mathbb{E}\Gamma_{t}.

    Furthermore, by the Convergence of Estimator, we have limt𝔼(ξxt,ξyt,ξzt,ξyt)=0\lim_{t\to\infty}\mathbb{E}\|(\xi_{x}^{t},\xi_{y}^{t},\xi_{z}^{t},\xi_{y^{\prime}}^{t})\|=0. Then

    limt𝔼dist(0,Φ(Xt))=0.\lim_{t\to\infty}\mathbb{E}\mathrm{dist}(0,\partial\Phi(X_{t}))=0. (A.19)

    Suppose that X=(x,y,z,y)X_{*}=(x_{*},y_{*},z_{*},y_{*}) is any cluster point of Xt:={xt,yt,zt,yt1}t1X_{t}:=\{x_{t},y_{t},z_{t},y_{t-1}\}_{t\geq 1}. Then there exists a subsequence {Xtk}\{X_{t_{k}}\} satisfying limkXtk=X\lim_{k\to\infty}X_{t_{k}}=X_{*}. By the subproblem (1.4b) in Algorithm 1, for any k1k\geq 1, we have

    G(ztk)+12γztk(2ytkγ~H(ytk)xtk1)2\displaystyle G(z_{t_{k}})+\frac{1}{2\gamma}\|z_{t_{k}}-(2y_{t_{k}}-\gamma{\widetilde{\nabla}}H(y_{t_{k}})-x_{t_{k}-1})\|^{2}
    G(z)+12γz(2ytkγ~H(ytk)xtk1)2.\displaystyle\leq G(z_{*})+\frac{1}{2\gamma}\|z_{*}-(2y_{t_{k}}-\gamma{\widetilde{\nabla}}H(y_{t_{k}})-x_{t_{k}-1})\|^{2}.

    This is equivalent to

    G(ztk)+12γ(ztk2z2ztkz,2ytkxtk1γ~H(ytk))G(z),G(z_{t_{k}})+\frac{1}{2\gamma}\Big{(}\|z_{t_{k}}\|^{2}-\|z_{*}\|^{2}-\langle z_{t_{k}}-z_{*},2y_{t_{k}}-x_{t_{k}-1}-\gamma{\widetilde{\nabla}}H(y_{t_{k}})\rangle\Big{)}\leq G(z_{*}),

    which means limsupkG(ztk)=G(z)\lim\sup_{k\to\infty}G(z_{t_{k}})=G(z_{*}). Because GG is a lower semi-continuous function, we know liminfkG(ztk)G(z)\lim\inf_{k\to\infty}G(z_{t_{k}})\geq G(z_{*}), thus limkG(ztk)=G(z)\lim_{k\to\infty}G(z_{t_{k}})=G(z_{*}). Further, by the continuity of FF and HH, we get

    limkΦ(xtk,ytk,ztk,ytk1)=Φ(x,y,z,y)=Φ(X).\lim_{k\to\infty}\Phi(x_{t_{k}},y_{t_{k}},z_{t_{k}},y_{t_{k}-1})=\Phi(x_{*},y_{*},z_{*},y_{*})=\Phi(X_{*}).

    Combined with (A.19), we have

    𝔼dist(0,Φ(X))=0.\mathbb{E}\mathrm{dist}(0,\partial\Phi(X_{*}))=0.

Next, we present the random version of the KL property, which is taken from Lemma 4.5 in [7].

Theorem A.6.

Let {Xt}t=0\{X_{t}\}_{t=0}^{\infty} be a bounded sequence of iterates of Algorithm 1 using a variance-reduced gradient estimator, and suppose that XtX_{t} is not a critical point after a finite number of iterations. Let Φ\Phi be a semialgebraic function satisfying the Kurdyka-Ł\Lojasiewicz property (see Definition 2.11) with exponent θ\theta. Then there exists an index mm and desingularizing function φ=ar1θ\varphi=ar^{1-\theta} so that the following holds almost surely:

φ(𝔼[Φ(Xt)Φt])𝔼dist(0,Φ(Xt))1,t>m,\varphi^{\prime}(\mathbb{E}[\Phi(X_{t})-\Phi_{t}^{*}])\mathbb{E}\mathrm{dist}(0,\partial\Phi(X_{t}))\geq 1,\,\,\,\,\,\forall t>m,

where Φt\Phi_{t}^{*} is an non-decreasing sequence converging to 𝔼Φ(X)\mathbb{E}\Phi(X_{*}) for some XΩX_{*}\in\Omega, where Ω\Omega is the set of cluster points of {Xt}t0\{X_{t}\}_{t\geq 0}.

In the following, we provide the convergence for STOS (Algorithm 1) under the assumption that the objective functions F,GF,G, and HH are semi-algebraic.

  • Proof of Theorem 3.9.

    If θ(0,12)\theta\in(0,\frac{1}{2}), then Φ\Phi also satisfies the KL property with exponent 12\frac{1}{2}, so we only consider the case θ[12,1).\theta\in[\frac{1}{2},1). By Theorem A.6 there exists a function φ0(r)=ar1θ\varphi_{0}(r)=ar^{1-\theta} such that, almost surely,

    φ0(𝔼[Φ(Xt)Φt])𝔼dist(0,Φ(Xt))1,t>m.\varphi_{0}^{\prime}(\mathbb{E}[\Phi(X_{t})-\Phi_{t}^{*}])\mathbb{E}\mathrm{dist}(0,\partial\Phi(X_{t}))\geq 1,\,\,\forall t>m.

    Using the bound on 𝔼dist(0,Φ(Xt))\mathbb{E}\mathrm{dist}(0,\partial\Phi(X_{t})) in Theorem A.5 and Jensens inequality, we have

    𝔼dist(0,Φ(xt))\displaystyle\mathbb{E}\mathrm{dist}(0,\partial\Phi(x_{t})) C(𝔼yt+1yt+𝔼ytyt1)+𝔼Γt\displaystyle\leq C\left(\mathbb{E}\|y_{t+1}-y_{t}\|+\mathbb{E}\|y_{t}-y_{t-1}\|\right)+\mathbb{E}\Gamma_{t} (A.20)
    C(𝔼yt+1yt2+𝔼ytyt12)+s𝔼Υt.\displaystyle\leq C\left(\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}\right)+\sqrt{s\mathbb{E}\Upsilon_{t}}.

    Because Γt=i=1svti\Gamma_{t}=\sum_{i=1}^{s}\|v_{t}^{i}\| for some vectors vtiv_{t}^{i}, we have 𝔼Γt=𝔼i=1svti𝔼si=1svti2s𝔼Υt\mathbb{E}\Gamma_{t}=\mathbb{E}\sum_{i=1}^{s}\|v_{t}^{i}\|\leq\mathbb{E}\sqrt{s\sum_{i=1}^{s}\|v_{t}^{i}\|^{2}}\leq\sqrt{s\mathbb{E}\Upsilon_{t}}. By the geometric decay in Definition 2.5 we can bound the term 𝔼Υt\sqrt{\mathbb{E}\Upsilon_{t}}:

    𝔼Υt+1\displaystyle\sqrt{\mathbb{E}\Upsilon_{t+1}} (1ρ)𝔼Υt+VΥ𝔼yt+1yt2+VΥ𝔼ytyt12\displaystyle\leq\sqrt{(1-\rho)\mathbb{E}\Upsilon_{t}+V_{\Upsilon}\mathbb{E}\|y_{t+1}-y_{t}\|^{2}+V_{\Upsilon}\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}
    1ρ𝔼Υt+VΥ𝔼yt+1yt2+VΥ𝔼ytyt12\displaystyle\leq\sqrt{1-\rho}\sqrt{\mathbb{E}\Upsilon_{t}}+\sqrt{V_{\Upsilon}}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\sqrt{V_{\Upsilon}}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}
    (1ρ2)𝔼Υt+VΥ𝔼yt+1yt2+VΥ𝔼ytyt12.\displaystyle\leq(1-\frac{\rho}{2})\sqrt{\mathbb{E}\Upsilon_{t}}+\sqrt{V_{\Upsilon}}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\sqrt{V_{\Upsilon}}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}.

    Rearranging and multiplying by s\sqrt{s} yields

    s𝔼Υt\displaystyle\sqrt{s\mathbb{E}\Upsilon_{t}} 2sρ(𝔼Υt𝔼Υt+1)+2sVΥρ𝔼yt+1yt2+2sVΥρ𝔼ytyt12.\displaystyle\leq\frac{2\sqrt{s}}{\rho}(\sqrt{\mathbb{E}\Upsilon_{t}}-\sqrt{\mathbb{E}\Upsilon_{t+1}})+\frac{2\sqrt{s}\sqrt{V_{\Upsilon}}}{\rho}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\frac{2\sqrt{s}\sqrt{V_{\Upsilon}}}{\rho}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}. (A.21)

    Substituting (A.21) into (A.20), we get

    𝔼dist(0,Φ(Xt))\displaystyle\mathbb{E}\mathrm{dist}(0,\partial\Phi(X_{t})) (A.22)
    C𝔼yt+1yt2+C𝔼ytyt12+2sρ(𝔼Υt𝔼Υt+1)+2sVΥρ𝔼yt+1yt2\displaystyle\leq C\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+C\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}+\frac{2\sqrt{s}}{\rho}(\sqrt{\mathbb{E}\Upsilon_{t}}-\sqrt{\mathbb{E}\Upsilon_{t+1}})+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}
    +2sVΥρ𝔼ytyt12\displaystyle\quad+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}
    (C+2sVΥρ)𝔼yt+1yt2+(C+2sVΥρ)𝔼ytyt12+2sρ(𝔼Υt𝔼Υt+1)\displaystyle\leq(C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho})\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+(C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho})\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}+\frac{2\sqrt{s}}{\rho}(\sqrt{\mathbb{E}\Upsilon_{t}}-\sqrt{\mathbb{E}\Upsilon_{t+1}})
    (C+2sVΥρ)(𝔼yt+1yt2+𝔼ytyt12)+2sρ(𝔼Υt𝔼Υt+1).\displaystyle\leq(C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho})(\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}})+\frac{2\sqrt{s}}{\rho}(\sqrt{\mathbb{E}\Upsilon_{t}}-\sqrt{\mathbb{E}\Upsilon_{t+1}}).

    Define Ct=(C+2sVΥρ)(𝔼yt+1yt2+𝔼ytyt12)+2sρ(𝔼Υt𝔼Υt+1)C_{t}=(C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho})(\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}})+\frac{2\sqrt{s}}{\rho}(\sqrt{\mathbb{E}\Upsilon_{t}}-\sqrt{\mathbb{E}\Upsilon_{t+1}}). Then we have 𝔼dist(0,Φ(xt))Ct\mathbb{E}\mathrm{dist}(0,\partial\Phi(x_{t}))\leq C_{t}. Therefore, φ0(𝔼[Φ(xt)Φ(xt)])Ct1\varphi_{0}^{\prime}(\mathbb{E}[\Phi(x_{t})-\Phi(x_{t}^{*})])C_{t}\geq 1 for t>m.t>m. Using the definition of φ0\varphi_{0}, we can rewrite this inequality as

    a(1θ)𝐂𝐭(𝔼[Φ(Xt)Φt])θ1,t>m.\frac{a(1-\theta){\bf C_{t}}}{(\mathbb{E}[\Phi(X_{t})-\Phi_{t}^{*}])^{\theta}}\geq 1,\,\,\,\,\,\forall t>m.

    Next we prove that the above inequality holds for Ψt\Psi_{t}, for which an additional term 𝒪{(𝔼[~H(yt)\mathcal{O}\{(\mathbb{E}[\|{\widetilde{\nabla}}H(y_{t}) H(yt)2+Υt+ytyt12])θ}-\nabla H(y_{t})\|^{2}+\Upsilon_{t}+\|y_{t}-y_{t-1}\|^{2}])^{\theta}\} needs to be added to the denominator. By 𝐂𝐭𝒪(𝔼yt+1yt2{\bf C_{t}}\geq\mathcal{O}(\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}} +𝔼ytyt12+𝔼Υt1)+\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}+\sqrt{\mathbb{E}\Upsilon_{t-1}}), 𝔼~H(yt)H(yt)20,𝔼Υt0\mathbb{E}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}\to 0,\,\mathbb{E}\Upsilon_{t}\to 0, 𝔼ytyt10\mathbb{E}\|y_{t}-y_{t-1}\|\to 0 and θ>12\theta>\frac{1}{2}, there exists an index mm and a constant c>0c>0 such that

    𝔼[~H(yt)H(yt)2+52ρΥt+5VΥ+5V1ρ2ρytyt12]\displaystyle\mathbb{E}\big{[}-\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{5}{2\rho}\Upsilon_{t}+\frac{5V_{\Upsilon}+5V_{1}\rho}{2\rho}\|y_{t}-y_{t-1}\|^{2}\big{]}
    𝔼[~H(yt)H(yt)2+52ρΥt+5VΥ+5V1ρ2ρytyt12]\displaystyle\leq\mathbb{E}\big{[}\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{5}{2\rho}\Upsilon_{t}+\frac{5V_{\Upsilon}+5V_{1}\rho}{2\rho}\|y_{t}-y_{t-1}\|^{2}\big{]}
    𝔼[Υt+V1yt+1yt2+V1ytyt12+52ρΥt+5VΥ+5V1ρ2ρytyt12]\displaystyle\leq\mathbb{E}\big{[}\Upsilon_{t}+V_{1}\|y_{t+1}-y_{t}\|^{2}+V_{1}\|y_{t}-y_{t-1}\|^{2}+\frac{5}{2\rho}\Upsilon_{t}+\frac{5V_{\Upsilon}+5V_{1}\rho}{2\rho}\|y_{t}-y_{t-1}\|^{2}\big{]}
    𝔼[(1+52ρ)Υt+V1yt+1yt2+(V1+5VΥ+5V1ρ2ρ)ytyt12]\displaystyle\leq\mathbb{E}\big{[}(1+\frac{5}{2\rho})\Upsilon_{t}+V_{1}\|y_{t+1}-y_{t}\|^{2}+\left(V_{1}+\frac{5V_{\Upsilon}+5V_{1}\rho}{2\rho}\right)\|y_{t}-y_{t-1}\|^{2}\big{]}
    𝒪(𝔼[Υt+yt+1yt2+ytyt12])\displaystyle\leq\mathcal{O}\big{(}\mathbb{E}[\Upsilon_{t}+\|y_{t+1}-y_{t}\|^{2}+\|y_{t}-y_{t-1}\|^{2}]\big{)}
    c𝐂𝐭1θ,t>m.\displaystyle\leq c{\bf C_{t}}^{\frac{1}{\theta}},\,\,\,\forall t>m.

    Denote

    Λt=def~H(yt)H(yt)2+52ρΥt+5VΥ+5V1ρ2ρytyt12.\Lambda_{t}\overset{\mathrm{def}}{=}-\|{\widetilde{\nabla}}H(y_{t})-\nabla H(y_{t})\|^{2}+\frac{5}{2\rho}\Upsilon_{t}+\frac{5V_{\Upsilon}+5V_{1}\rho}{2\rho}\|y_{t}-y_{t-1}\|^{2}.

    Since the terms above are small compared to 𝐂𝐭{\bf C_{t}}, we can find a constant d(c,+)d\in(c,+\infty) such that

    ad(1θ)𝐂𝐭(𝔼[Φ(Xt)Φt])θ+(𝔼Λt)θ1\displaystyle\frac{ad(1-\theta){\bf C_{t}}}{(\mathbb{E}[\Phi(X_{t})-\Phi_{t}^{*}])^{\theta}+\left(\mathbb{E}\Lambda_{t}\right)^{\theta}}\geq 1

    for all t>mt>m. Using the fact that (a+b)θaθ+bθ(a+b)^{\theta}\leq a^{\theta}+b^{\theta}, we get

    ad(1θ)𝐂𝐭(𝔼[ΨtΦt])θ=ad(1θ)𝐂𝐭(𝔼[ΦtΦt+Λt])θad(1θ)𝐂𝐭(𝔼[ΦtΦt])θ+(𝔼Λt)θ1,t>m.\displaystyle\frac{ad(1-\theta){\bf C_{t}}}{(\mathbb{E}[\Psi_{t}-\Phi_{t}^{*}])^{\theta}}=\frac{ad(1-\theta){\bf C_{t}}}{(\mathbb{E}[\Phi_{t}-\Phi_{t}^{*}+\Lambda_{t}])^{\theta}}\geq\frac{ad(1-\theta){\bf C_{t}}}{(\mathbb{E}[\Phi_{t}-\Phi_{t}^{*}])^{\theta}+(\mathbb{E}\Lambda_{t})^{\theta}}\geq 1,\,\,\forall t>m.

    Hence, for φ(r)=adr1θ\varphi(r)=adr^{1-\theta},

    φ(𝔼[ΨtΦt])𝐂𝐭1,t>m.\varphi^{\prime}(\mathbb{E}[\Psi_{t}-\Phi_{t}^{*}]){\bf C_{t}}\geq 1,\,\,\,\forall t>m.

    By the concavity of φ\varphi,

    φ(𝔼[ΨtΦt])φ(𝔼[Ψt+1Φt+1])\displaystyle\varphi(\mathbb{E}[\Psi_{t}-\Phi_{t}^{*}])-\varphi(\mathbb{E}[\Psi_{t+1}-\Phi_{t+1}^{*}]) φ(𝔼[ΨtΦt])(𝔼[ΨtΦt+Φt+1Ψt+1])\displaystyle\geq\varphi^{\prime}(\mathbb{E}[\Psi_{t}-\Phi_{t}^{*}])(\mathbb{E}[\Psi_{t}-\Phi_{t}^{*}+\Phi_{t+1}^{*}-\Psi_{t+1}])
    φ(𝔼[ΨtΦt])(𝔼[ΨtΨt+1]),\displaystyle\geq\varphi^{\prime}(\mathbb{E}[\Psi_{t}-\Phi_{t}^{*}])(\mathbb{E}[\Psi_{t}-\Psi_{t+1}]),

    where the last inequality is due to the non-decreasing property of Φt\Phi_{t}^{*}. Denote Δp,q=defφ(𝔼[ΨpΦp])φ(𝔼[ΨqΦq])\Delta_{p,q}\overset{\mathrm{def}}{=}\varphi(\mathbb{E}[\Psi_{p}-\Phi_{p}^{*}])-\varphi(\mathbb{E}[\Psi_{q}-\Phi_{q}^{*}]), then

    Δt,t+1𝐂𝐭𝔼[ΨtΨt+1].\Delta_{t,t+1}{\bf C_{t}}\geq\mathbb{E}[\Psi_{t}-\Psi_{t+1}].

    Since 𝔼[ΨtΨt+1]Λ(γ)𝔼yt+1yt2+C1𝔼ytyt12\mathbb{E}[\Psi_{t}-\Psi_{t+1}]\geq\Lambda(\gamma)\mathbb{E}\|y_{t+1}-y_{t}\|^{2}+C_{1}\mathbb{E}\|y_{t}-y_{t-1}\|^{2}, we have

    Δt,t+1𝐂𝐭\displaystyle\Delta_{t,t+1}{\bf C_{t}} Λ(γ)𝔼yt+1yt2+C1𝔼ytyt12\displaystyle\geq\Lambda(\gamma)\mathbb{E}\|y_{t+1}-y_{t}\|^{2}+C_{1}\mathbb{E}\|y_{t}-y_{t-1}\|^{2} (A.23)
    max{Λ(γ),C1}𝔼yt+1yt2+max{Λ(γ),C1}𝔼ytyt12.\displaystyle\geq\max\{\Lambda(\gamma),C_{1}\}\mathbb{E}\|y_{t+1}-y_{t}\|^{2}+\max\{\Lambda(\gamma),C_{1}\}\mathbb{E}\|y_{t}-y_{t-1}\|^{2}.

    Applying Young’s inequality to (A.23) yields

    2𝔼yt+1yt2+𝔼ytyt12\displaystyle 2\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}+\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}
    2(max{Λ(γ),C1})1𝐂𝐭Δt,t+1𝐂𝐭2(C+2sVΥρ)+2(C+2sVΥρ)Δt,t+1max{Λ(γ),C1}\displaystyle\leq 2\sqrt{\big{(}\max\{\Lambda(\gamma),C_{1}\}\big{)}^{-1}{\bf C_{t}}\Delta_{t,t+1}}\leq\frac{{\bf C_{t}}}{2\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}}+\frac{2\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\Delta_{t,t+1}}{\max\{\Lambda(\gamma),C_{1}\}}
    𝔼yt+1yt22+𝔼ytyt122+s(𝔼Υt𝔼Υt+1)(C+2sVΥρ)ρ+2(C+2sVΥρ)Δt,t+1max{Λ(γ),C1}.\displaystyle\leq\frac{\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}}{2}+\frac{\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}}{2}+\frac{\sqrt{s}(\sqrt{\mathbb{E}\Upsilon_{t}}-\sqrt{\mathbb{E}\Upsilon_{t+1}})}{\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\rho}+\frac{2\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\Delta_{t,t+1}}{\max\{\Lambda(\gamma),C_{1}\}}.

    Then we have

    𝔼yt+1yt2+𝔼ytyt12\displaystyle\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}} (A.24)
    2𝔼yt+1yt2+𝔼ytyt12\displaystyle\leq\sqrt{2}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}+\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}
    2𝔼yt+1yt24+2𝔼ytyt124+2s(𝔼Υt𝔼Υt+1)2(C+2sVΥρ)ρ+2(C+2sVΥρ)Δt,t+1max{Λ(γ),C1}.\displaystyle\leq\frac{\sqrt{2}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}}{4}+\frac{\sqrt{2}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}}{4}+\frac{\sqrt{2s}(\sqrt{\mathbb{E}\Upsilon_{t}}-\sqrt{\mathbb{E}\Upsilon_{t+1}})}{2\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\rho}+\frac{2\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\Delta_{t,t+1}}{\max\{\Lambda(\gamma),C_{1}\}}.

    Summing inequality (A.24) from t=mt=m to ii,

    t=mi𝔼yt+1yt2+t=mi𝔼ytyt12\displaystyle\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}
    24t=mi𝔼yt+1yt2+24t=mi𝔼ytyt12+2s2(C+2sVΥρ)ρt=mi(𝔼Υt𝔼Υt+1)\displaystyle\leq\frac{\sqrt{2}}{4}\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\frac{\sqrt{2}}{4}\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}+\frac{\sqrt{2s}}{2\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\rho}\sum_{t=m}^{i}(\sqrt{\mathbb{E}\Upsilon_{t}}-\sqrt{\mathbb{E}\Upsilon_{t+1}})
    +2(C+2sVΥρ)max{Λ(γ),C1}t=miΔt,t+1\displaystyle\quad+\frac{\sqrt{2}\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}}{\max\{\Lambda(\gamma),C_{1}\}}\sum_{t=m}^{i}\Delta_{t,t+1}
    24t=mi𝔼yt+1yt2+24t=mi𝔼ytyt12+2s2(C+2sVΥρ)ρ(𝔼Υm𝔼Υi)\displaystyle\leq\frac{\sqrt{2}}{4}\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\frac{\sqrt{2}}{4}\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}+\frac{\sqrt{2s}}{2\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\rho}(\sqrt{\mathbb{E}\Upsilon_{m}}-\sqrt{\mathbb{E}\Upsilon_{i}})
    +2(C+2sVΥρ)max{Λ(γ),C1}Δm,i+1.\displaystyle\quad+\frac{\sqrt{2}\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}}{\max\{\Lambda(\gamma),C_{1}\}}\Delta_{m,i+1}.

    This implies that

    t=mi𝔼yt+1xt2+t=mi𝔼ytyt12\displaystyle\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t+1}-x_{t}\|^{2}}+\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}
    (22+1)s3(C+2sVΥρ)ρ𝔼Υm+(42+2)(C+2sVΥρ)3max{Λ(γ),C1}Δm,i+1\displaystyle\leq\frac{(2\sqrt{2}+1)\sqrt{s}}{3\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\rho}\sqrt{\mathbb{E}\Upsilon_{m}}+\frac{(4\sqrt{2}+2)\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}}{3\max\{\Lambda(\gamma),C_{1}\}}\Delta_{m,i+1}
    C2(𝔼Υm+Δm,i+1),\displaystyle\leq C_{2}(\sqrt{\mathbb{E}\Upsilon_{m}}+\Delta_{m,i+1}),

    where C2=max{(22+1)s3(C+2sVΥρ)ρ,(42+2)(C+2sVΥρ)3max{Λ(γ),C1}}C_{2}=\max\{\frac{(2\sqrt{2}+1)\sqrt{s}}{3\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}\rho},\frac{(4\sqrt{2}+2)\big{(}C+\frac{2\sqrt{sV_{\Upsilon}}}{\rho}\big{)}}{3\max\{\Lambda(\gamma),C_{1}\}}\}. Using Jensen’s inequality yields

    t=mi𝔼yt+1yt+t=mi𝔼ytyt1\displaystyle\sum_{t=m}^{i}\mathbb{E}\|y_{t+1}-y_{t}\|+\sum_{t=m}^{i}\mathbb{E}\|y_{t}-y_{t-1}\|
    t=mi𝔼yt+1yt2+t=mi𝔼ytyt12\displaystyle\leq\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t+1}-y_{t}\|^{2}}+\sum_{t=m}^{i}\sqrt{\mathbb{E}\|y_{t}-y_{t-1}\|^{2}}
    C2(𝔼Υm1+Δm,i+1).\displaystyle\leq C_{2}(\sqrt{\mathbb{E}\Upsilon_{m-1}}+\Delta_{m,i+1}).

    Since Δm,i+1\Delta_{m,i+1} is bounded, we obtain

    t=m𝔼yt+1yt<andt=m𝔼xt+1xt<.\sum_{t=m}^{\infty}\mathbb{E}\|y_{t+1}-y_{t}\|<\infty\,\,\textmd{and}\,\,\sum_{t=m}^{\infty}\mathbb{E}\|x_{t+1}-x_{t}\|<\infty.

    Together with (1.4c), we get

    t=m𝔼zt+1zt<.\sum_{t=m}^{\infty}\mathbb{E}\|z_{t+1}-z_{t}\|<\infty.