This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Projection-free Algorithm for Constrained Stochastic Multi-level Composition Optimization

Tesi Xiao Department of Statistics, University of California, Davis. [email protected].    Krishnakumar Balasubramanian Department of Statistics, University of California, Davis [email protected].    Saeed Ghadimi Department of Management Sciences, University of Waterloo. [email protected].
Abstract

We propose a projection-free conditional gradient-type algorithm for smooth stochastic multi-level composition optimization, where the objective function is a nested composition of TT functions and the constraint set is a closed convex set. Our algorithm assumes access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle satisfying certain standard unbiasedness and second-moment assumptions. We show that the number of calls to the stochastic first-order oracle and the linear-minimization oracle required by the proposed algorithm, to obtain an ϵ\epsilon-stationary solution, are of order 𝒪T(ϵ2)\mathcal{O}_{T}(\epsilon^{-2}) and 𝒪T(ϵ3)\mathcal{O}_{T}(\epsilon^{-3}) respectively, where 𝒪T\mathcal{O}_{T} hides constants in TT. Notably, the dependence of these complexity bounds on ϵ\epsilon and TT are separate in the sense that changing one does not impact the dependence of the bounds on the other. For the case of T=1T=1, we also provide a high-probability convergence result that depends poly-logarithmically on the inverse confidence level. Moreover, our algorithm is parameter-free and does not require any (increasing) order of mini-batches to converge unlike the common practice in the analysis of stochastic conditional gradient-type algorithms.

1 Introduction

We study projection-free algorithms for solving the following stochastic multi-level composition optimization problem

minx𝒳F(x):=f1fT(x),\underset{x\in\mathcal{X}}{\min}\quad F(x):=f_{1}\circ\cdots\circ f_{T}(x), (1)

where fi:didi1,i=1,,T(d0=1)f_{i}:\mathbb{R}^{d_{i}}\rightarrow\mathbb{R}^{d_{i-1}},i=1,\cdots,T~{}(d_{0}=1) are continuously differentiable functions, the composite function FF is bounded below by F>F^{\star}>-\infty and 𝒳d\mathcal{X}\subset\mathbb{R}^{d} is a closed convex set. We assume that the exact function values and derivatives of fif_{i}’s are not available. In particular, we assume that fi(y)=𝔼ξi[Gi(y,ξi)]f_{i}(y)=\mathbb{E}_{\xi_{i}}[G_{i}(y,\xi_{i})] for some random variable ξi\xi_{i}. Our goal is to solve the above optimization problem, given access to noisy evaluations of fi\nabla f_{i}’s and fif_{i}’s.

There are two main challenges to address in developing efficient projection-free algorithms for solving (1). First, note that denoting the transpose of the Jacobian of fif_{i} by fi\nabla f_{i}, the gradient of the objective function F(x)F(x) in (1), is given by F(x)=fT(yT)fT1(yT1)f1(y1)\nabla F(x)=\nabla f_{T}(y_{T})\nabla f_{T-1}(y_{T-1})\cdots\nabla f_{1}(y_{1}), where yi=fi+1fT(x)y_{i}=f_{i+1}\circ\cdots\circ f_{T}(x) for 1i<T1\leq i<T, and yT=xy_{T}=x. Because of the nested nature of the gradient F(x)\nabla F(x), obtaining an unbiased gradient estimator in the stochastic first-order setting, with controlled moments, becomes non-trivial. Using naive stochastic gradient estimators lead to oracle complexities that depend exponentially on TT (in terms of the accuracy parameter). Next, even when T=1T=1 in the stochastic setting, projection-free algorithms like the conditional gradient method or its sliding variants invariably require increasing order of mini-batches111We discuss in detail about some recent works that avoid requiring increasing mini-batches, albeit under stronger assumptions, in Section 1.2. [LZ16, RSPS16, HL16, QLX18, YSC19], which make their practical implementation infeasible.

In this work, we propose a projection-free conditional gradient-type algorithm that achieves level-independent oracle complexities (i.e., the dependence of the complexities on the target accuracy is independent of TT) using only one sample of (ξi)1iT(\xi_{i})_{1\leq i\leq T} in each iteration, thereby addressing both of the above challenges. Our algorithm uses moving-average based stochastic estimators of the gradient and function values, also used recently by [GRW20] and [BGN22], along with an inexact conditional gradient method used by [BG22] (which in turn is inspired by the sliding method by [LZ16]). In order to establish our oracle complexity results, we use a novel merit function based convergence analysis. To the best of our knowledge, such an analysis technique is used for the first time in the context of analyzing stochastic conditional gradient-type algorithms.

1.1 Preliminaries and Main Contributions

We now introduce the technical preliminaries required to state and highlight the main contributions of this work. Throughout this work, \|\cdot\| denotes the Euclidean norm for vectors and the Frobenius norm for matrices. We first describe the set of assumptions on the objective functions and the constraint set.

Assumption 1 (Constraint set).

The set 𝒳d\mathcal{X}\subset\mathbb{R}^{d} is convex and closed with maxx,y𝒳xyD𝒳\underset{x,y\in\mathcal{X}}{\max}~{}\|x-y\|\leq D_{\mathcal{X}}.

Assumption 2 (Smoothness).

All functions f1,,fTf_{1},\dots,f_{T} and their derivatives are Lipschitz continuous with Lipschitz constants LfiL_{f_{i}} and LfiL_{\nabla f_{i}}, respectively.

The above assumptions on the constraint set and the objective function are standard in the literature on stochastic optimization, and in particular in the analysis of conditional gradient algorithms and multi-level optimization; see, for example, [LZ16][YWF19] and  [BGN22]. We emphasize here that the above smoothness assumptions are made only on the functions (fi)1iT(f_{i})_{1\leq i\leq T} and not on the stochastic functions (Gi)1iT(G_{i})_{1\leq i\leq T} (which would be a stronger assumption). Moreover, the Lipschitz continuity of fif_{i}’s are implied by the Assumption 1 and assuming the functions are continuously differentiable. However, for sake of illustration, we state both assumptions separately. In addition to these assumptions, we also make unbiasedness and bounded-variance assumptions on the stochastic first-order oracle. Due to its technical nature, we state the precise details later in Section 3 (see Assumption 3).

We next turn our attention to the convergence criterion that we use in this work to evaluate the performance of the proposed algorithm. Gradient-based algorithms iteratively solve sub-problems in the form of

argminu𝒳{g,u+β2ux2},\underset{u\in\mathcal{X}}{\operatorname*{arg\,min}}\left\{\langle g,u\rangle+\frac{\beta}{2}\|u-x\|^{2}\right\}, (2)

for some β>0\beta>0, gdg\in\mathbb{R}^{d} and x𝒳x\in\mathcal{X}. Denoting the optimal solution of the above problem by P𝒳(x,g,β)P_{\mathcal{X}}(x,g,\beta) and noting its optimality condition, one can easily show that

F(x¯)𝒩𝒳(x¯)+(0,gF(x¯)D𝒳+βxP𝒳(x,g,β)(D𝒳+F(x¯)/β)),-\nabla F(\bar{x})\in\mathcal{N}_{\mathcal{X}}(\bar{x})+{\cal B}\Big{(}0,\|g-\nabla F(\bar{x})\|D_{\mathcal{X}}+\beta\|x-P_{\mathcal{X}}(x,g,\beta)\|(D_{\mathcal{X}}+\|\nabla F(\bar{x})\|/\beta)\Big{)},

where 𝒩𝒳(x¯)\mathcal{N}_{\mathcal{X}}(\bar{x}) is the normal cone to 𝒳\mathcal{X} at x¯\bar{x} and (0,r){\cal B}(0,r) denotes a ball centered at the origin with radius rr. Thus, reducing the radius of the ball in the above relation will result in finding an approximate first-order stationary point of the problem for non-convex constrained minimization problems. Motivated by this fact, we can define the gradient mapping at point x¯𝒳\bar{x}\in\mathcal{X} as

𝒢𝒳(x¯,F(x¯),β)β(x¯P𝒳(x¯,F(x¯),β))=β(x¯Π𝒳(x¯1βF(x¯))),\mathcal{G}_{\mathcal{X}}(\bar{x},\nabla F(\bar{x}),\beta)\coloneqq\beta\left(\bar{x}-P_{\mathcal{X}}(\bar{x},\nabla F(\bar{x}),\beta)\right)=\beta\left(\bar{x}-\Pi_{\mathcal{X}}\left(\bar{x}-\frac{1}{\beta}\nabla F(\bar{x})\right)\right), (3)

where Π𝒳(y)\Pi_{\mathcal{X}}(y) denotes the Euclidean projection of the vector yy onto the set 𝒳\mathcal{X}. The gradient mapping is a classical measure has been widely used in the literature as a convergence criterion when solving nonconvex constrained problems [Nes18]. It plays an analogous role of the gradient for constrained optimization problems; in fact when the set 𝒳d\mathcal{X}\equiv\mathbb{R}^{d} the gradient mapping just reduces to F(x¯)\nabla F(\bar{x}). It should be emphasized that while the gradient mapping cannot be computed in the stochastic setting, it still serves as a measure of convergence. Our main goal in this work is to find an ϵ\epsilon-stationary solution to (1), in the sense described below.

Definition 1.

A point x¯𝒳\bar{x}\in\mathcal{X} generated by an algorithm for solving (1) is called an ϵ\epsilon-stationary point, if we have 𝔼[𝒢𝒳(x¯,F(x¯),β)2]ϵ\mathbb{E}[\|\mathcal{G}_{\mathcal{X}}(\bar{x},\nabla F(\bar{x}),\beta)\|^{2}]\leq\epsilon, where the expectation is taken over all the randomness involved in the problem.

In the literature on conditional gradient methods for the nonconvex setting, the so-called Frank-Wolfe gap is also widely used to provide convergence analysis. The Frank-Wolfe Gap is defined formally as

g𝒳(x¯,F(x¯)):=maxy𝒳F(x¯),x¯y.g_{\mathcal{X}}(\bar{x},\nabla F(\bar{x})):=\underset{y\in\mathcal{X}}{\max}~{}\langle\nabla F(\bar{x}),\bar{x}-y\rangle. (4)

As pointed out by [BG22], the gradient mapping criterion and the Frank-Wolfe gap are related to each other in the following sense.

Proposition 1.

[BG22] Let g𝒳()g_{\mathcal{X}}(\cdot) be the Frank-Wolfe gap defined in (4) and 𝒢𝒳()\mathcal{G}_{\mathcal{X}}(\cdot) be the gradient mapping defined in (3). Then, we have 𝒢𝒳(x,F(x),β)2g𝒳(x,F(x)),x𝒳\|\mathcal{G}_{\mathcal{X}}(x,\nabla F(x),\beta)\|^{2}\leq g_{\mathcal{X}}(x,\nabla F(x)),\forall x\in\mathcal{X}. Moreover, under Assumption 1, 2, we have g𝒳(x,F(x))[(1/β)i=1TLfi+D𝒳]𝒢𝒳(x,F(x),β)g_{\mathcal{X}}(x,\nabla F(x))\leq\left[(1/\beta)\prod_{i=1}^{T}L_{f_{i}}+D_{\mathcal{X}}\right]\|\mathcal{G}_{\mathcal{X}}(x,\nabla F(x),\beta)\|.

For stochastic conditional gradient-type algorithms, the oracle complexity is measured in terms of number of calls to the Stochastic First-order Oracle (SFO) and the Linear Minimization Oracle (LMO) used to the solve the sub-problems (that are of the form of minimizing a linear function over the convex feasible set) arising in the algorithm. In this work, we hence measure the number of calls to SFO and LMO required by the proposed algorithm to obtain an ϵ\epsilon-stationary solution in the sense of Definition 1. We now highlight our main contributions:

  • We propose a projection-free conditional gradient-type method (Algorithm 1) for solving (1). In Theorem 2, we show that the SFO and LMO complexities of this algorithm, in order to achieve an ϵ\epsilon-stationary solution in the sense of Definition 1, are of order 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) and 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}), respectively.

  • The above SFO and LMO complexities are in particular level-independent (i.e., the dependence of the complexities on the target accuracy is independent of TT). The proposed algorithm is parameter-free and does not require any mini-batches, making it applicable for the online setting.

  • When considering the case of T2T\leq 2, we present a simplified method (Algorithm 3 and 4) to obtain the same oracle complexities. Intriguingly, while this simplified method is still parameter-free for T=1T=1, it is not when T=2T=2 (see Theorem 3 and Remark Remark). Furthermore, for the case of T=1T=1, we also establish high-probability bounds (see Theorem 5).

A summary of oracle complexities for stochastic conditional gradient-type algorithms is in Table 1.

Algorithm Criterion # of levels Batch size SFO LMO
SPIFER-SFW [YSC19] FW-gap 1 𝒪(ϵ1)\mathcal{O}(\epsilon^{-1}) 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2})
1-SFW [ZSM+20] FW-gap 1 1 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3})
SCFW [ABTR21] FW-gap 2 1 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3})
SCGS [QLX18] GM 1 𝒪(ϵ1)\mathcal{O}(\epsilon^{-1}) 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2})
SGD+ICG [BG22] GM 1 𝒪(ϵ1)\mathcal{O}(\epsilon^{-1}) 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2})
LiNASA+ICG (Algorithm 1) GM TT 1 𝒪T(ϵ2)\mathcal{O}_{T}(\epsilon^{-2}) 𝒪T(ϵ3)\mathcal{O}_{T}(\epsilon^{-3})
Table 1: Complexity results for stochastic conditional gradient type algorithms to find an ϵ\epsilon-stationary solution in the nonconvex setting. FW-Gap and GM stands for Frank-Wolfe Gap (see (4)) and Gradient Mapping (see (3)) respectively. 𝒪T\mathcal{O}_{T} hides constants in TT. Existing one-sample based stochastic conditional gradient algorithms are either (i) not applicable to the case of general T>1T>1, or (ii) require strong assumptions [ZSM+20], or (iii) are not truly online [ABTR21]; see Section 1.2 for detailed discussion. The results in [BG22] are actually presented for the zeroth-order setting; however the above stated first-order complexities follow immediately.

1.2 Related Work

Conditional Gradient-Type Method. The conditional gradient algorithm [FW56, LP66], has had a renewed interest in the machine learning and optimization communities in the past decade; see [Mig94, Jag13, HJN15, LJJ15, BS17, GKS21] for a partial list of recent works. Considering the stochastic convex setup, [HK12, HL16] provided expected oracle complexity results for the stochastic conditional gradient algorithm. The complexities were further improved by a sliding procedure in [LZ16], based on Nesterov’s acceleration method. [RSPS16, YSC19, HL16] considered variance reduced stochastic conditional gradient algorithms, and provided expected oracle complexities in the non-convex setting. [QLX18] analyzed the sliding algorithm in the non-convex setting and provided results for the gradient mapping criterion. All of the above works require increasing orders of mini-batches to obtain their oracle complexity results.

[MHK20] and [ZSM+20] proposed a stochastic conditional gradient-type algorithm with moving-average gradient estimator for the convex and non-convex setting that uses only one-sample in each iteration. However, [MHK20] and [ZSM+20] require several restrictive assumptions, which we explain next (focusing on [ZSM+20] which considers the nonconvex case). Specifically, [ZSM+20] requires that the stochastic function G1(x,ξ1)G_{1}(x,\xi_{1}) has uniformly bounded function value, gradient-norm, and Hessian spectral-norm, and the distribution of the random vector ξ1\xi_{1} has an absolutely continuous density pp such that the norm of the gradient of logp\log p and spectral norm of the Hessian of logp\log p has finite fourth and second-moments respectively. In contrasts, we do not require such stringent assumptions. Next, all of the above works focus only on the case of T=1T=1.  [ABTR21] considered stochastic conditional gradient algorithm for solving (1), with T=2T=2. However, [ABTR21] also makes stringent assumptions: (i) the stochastic objective functions G1(x,ξ1)G_{1}(x,\xi_{1}) and G2(x,ξ1)G_{2}(x,\xi_{1}) themselves have Lipschitz gradients almost surely and (ii) for a given instance of random vectors ξ1\xi_{1} and ξ2\xi_{2}, one could query the oracle at the current and previous iterations, which makes the algorithm not to be truly online. See Table 1 for a summary.

Stochastic Multi-level Composition Optimization. Compositional optimization problems of the form in (1) have been considered as early as 1970s by [Erm76]. Recently, there has been a renewed interest on this problem. [EN13] and [DPR17] considered a sample-average approximation approach for solving (1) and established several asymptotic results. For the case of T=2T=2[WFL17], [WLF16] and [BGI+17] proposed and analyzed stochastic gradient descent-type algorithms in the smooth setting. [DD19] and [DR18] considered the non-smooth setting and established oracle complexity results. Furthermore, [HZCH20] proposed algorithms when the randomness between the two levels are not necessarily independent. For the general case of T1T\geq 1, [YWF19] proposed stochastic gradient descent-type algorithms and established oracle complexities established that depend exponentially on TT and are hence sub-optimal. Indeed, large deviation and Central Limit Theorem results established in [EN13, DPR17], respectively, show that in the sample-average approximation setting, the argmin\operatorname*{arg\,min} of the problem in (1) based on nn samples, converges at a level-independent rate (i.e., dependence of the convergence rate on the target accuracy is independent of TT) to the true minimizer, under suitable regularity conditions.

[GRW20] proposed a single time-scale Nested Averaged Stochastic Approximation (NASA) algorithm and established optimal rates for the cases of T=1,2T=1,2. For the general case of T1T\geq 1[BGN22] proposed a linearized NASA algorithm and established level-independent and optimal convergence rates. Concurrently, [Rus21] considered the case when the function fif_{i} are non-smooth and established asymptotic convergence results. [ZX21] also established non-asymptotic level-independent oracle complexities, however, under stronger assumptions than that in [BGN22]. Firstly, they assumed that for a fixed batch of samples, one could query the oracle on different points, which is not suited for the general online stochastic optimization setup. Next, they assume a much stronger mean-square Lipschitz smoothness assumption on the individual functions fif_{i} and their gradients. Finally, they required mini-batches sizes that depend exponentially on TT, which makes their method impractical. Concurrent to [BGN22], level-independent rates were also obtained for unconstrained problems by [CSY21], albeit, under the stronger assumption that the stochastic functions Gi(x,ξi)G_{i}(x,\xi_{i}) are Lipschitz, almost surely. It is also worth mentioning that while some of the above papers considered constrained problems, the algorithms proposed and analyzed in the above works are not projection-free, which is the main focus of this work.

2 Methodology

In this section, we present our projection-free algorithm for solving problem (1). The method generates three random sequences, namely, approximate solutions {xk}\{x^{k}\}, average gradients {zk}\{z^{k}\}, and average function values {uk}\{u^{k}\}, defined on a certain probability space (Ω,,P)(\Omega,\mathscr{F},P). We let k\mathscr{F}_{k} to be the σ\sigma-algebra generated by {x0,,xk,z0,,zk,u10,,u1k,,uT0,,uTk}\{x^{0},\dots,x^{k},z^{0},\dots,z^{k},u_{1}^{0},\dots,u_{1}^{k},\dots,u_{T}^{0},\dots,u_{T}^{k}\}. The overall method is given in Algorithm 1. In (7), the stochastic Jacobians Jik+1di×di1J_{i}^{k+1}\in\mathbb{R}^{d_{i}\times d_{i-1}}, and the product i=1TJT+1ik+1\prod_{i=1}^{T}J_{T+1-i}^{k+1} is calculated as JTk+1JT1k+1J1k+1dT×d1dT×1J_{T}^{k+1}J_{T-1}^{k+1}\cdots J_{1}^{k+1}\in\mathbb{R}^{d_{T}\times d_{1}}\equiv\mathbb{R}^{d_{T}\times 1}. In (8), we use the notation ,\langle\cdot,\cdot\rangle to represent both matrix-vector multiplication and vector-vector inner product. There are two aspects of the algorithm that we highlight specifically: (i) In addition to estimating the gradient of FF, we also estimate a stochastic linear approximation of the inner functions fif_{i} by a moving-average technique. In the multi-level setting we consider, it helps us to avoid the accumulation of bias, when estimating the fif_{i} directly. Linearization techniques were used in the stochastic optimization since the work of [Rus87]. A similar approach was used in [BGN22] in the context of projected-based methods for solving (1). It is also worth mentioning that other linearization techniques have been used in [DD19] and [DR18] for estimating the stochastic inner function values for weakly convex two-level composition problems, and (ii) The ICG method given in Algorithm 2 is essentially applying deterministic conditional gradient method with the exact line search for solving the quadratic minimization subproblem in (2) with the estimated gradient zkz_{k} in (7). It was also used in [BG22] as a sub-routine and is motivated by the sliding approach of [LZ16].

Algorithm 1 Linearized NASA with Inexact Conditional Gradient Method (LiNASA+ICG)
  Input: x0𝒳x^{0}\in\mathcal{X}, z0=0dz^{0}=0\in\mathbb{R}^{d}, ui0di,i=1,,Tu_{i}^{0}\in\mathbb{R}^{d_{i}},i=1,\dots,T, βk>0\beta_{k}>0, tk>0t_{k}>0, τk(0,1]\tau_{k}\in(0,1], δ0\delta\geq 0.
  for k=0,1,2,,Nk=0,1,2,\dots,N do
     1. Update the solution:
y~k\displaystyle\tilde{y}^{k} =ICG(xk,zk,βk,tk,δ),\displaystyle=\texttt{ICG}(x^{k},z^{k},\beta_{k},t_{k},\delta), (5)
xk+1\displaystyle x^{k+1} =xk+τk(y~kxk),\displaystyle=x^{k}+\tau_{k}(\tilde{y}^{k}-x^{k}), (6)
and compute stochastic Jacobians Jik+1J_{i}^{k+1}, and function values Gik+1G_{i}^{k+1} at ui+1ku_{i+1}^{k} for i=1,,Ti=1,\dots,T.
     2. Update average gradients zz and function value estimates uiu_{i} for each level i=1,,Ti=1,\dots,T
zk+1\displaystyle z^{k+1} =(1τk)zk+τki=1TJT+1ik+1,\displaystyle=(1-\tau_{k})z^{k}+\tau_{k}\prod_{i=1}^{T}J_{T+1-i}^{k+1}, (7)
uik+1\displaystyle u_{i}^{k+1} =(1τk)uik+τkGik+1+Jik+1,ui+1k+1ui+1k.\displaystyle=(1-\tau_{k})u_{i}^{k}+\tau_{k}G_{i}^{k+1}+\langle J_{i}^{k+1},u_{i+1}^{k+1}-u_{i+1}^{k}\rangle. (8)
  end for
  Output: (xR,zR,u1R,,uTR)(x^{R},z^{R},u_{1}^{R},\cdots,u_{T}^{R}), where RR is uniformly distributed over {1,2,,N}\{1,2,\dots,N\}
Algorithm 2 Inexact Conditional Gradient Method (ICG)
  Input: (x,z,β,M,δ)x,z,\beta,M,\delta)Set w0=xw^{0}=x.
  for t=0,1,2,,Mt=0,1,2,\dots,M do
     1. Find vt𝒳v^{t}\in\mathcal{X} with a quantity δ0\delta\geq 0 such that
z+β(wtx),vtminv𝒳z+β(wtx),v+βD𝒳2δt+2.\langle z+\beta(w^{t}-x),v^{t}\rangle\leq\min_{v\in\mathcal{X}}~{}\langle z+\beta(w^{t}-x),v\rangle+\frac{\beta D_{\mathcal{X}}^{2}\delta}{t+2}.
     2. Set wt+1=(1μt)wt+μtvtw^{t+1}=(1-\mu_{t})w^{t}+\mu_{t}v^{t} with μt=min{1,β(xwt)z,vtwtβvtwt2}.\mu_{t}=\min\bigg{\{}1,\frac{\langle\beta(x-w^{t})-z,v^{t}-w^{t}\rangle}{\beta\|v^{t}-w^{t}\|^{2}}\bigg{\}}.
  end for
  Output: wMw^{M}

3 Main Results

In this section, we present our main result on the oracle complexity of Algorithm 1. Before we proceed, we present our assumptions on the stochastic first-order oracle.

Assumption 3 (Stochastic First-Order Oracle).

Denote uT+1kxku_{T+1}^{k}\equiv x^{k}. For each kk, ui+1ku_{i+1}^{k} being the input, the stochastic oracle outputs Gik+1diG_{i}^{k+1}\in\mathbb{R}^{d_{i}} and Jik+1J_{i}^{k+1} such that given k\mathscr{F}_{k} and for any i{1,,T}i\in\{1,\dots,T\}

  • (a)

    𝔼[Jik+1|k]=fi(ui+1k),𝔼[Gik+1|k]=fi(ui+1k)\mathbb{E}[J_{i}^{k+1}|\mathscr{F}_{k}]=\nabla f_{i}(u_{i+1}^{k}),~{}\mathbb{E}[G_{i}^{k+1}|\mathscr{F}_{k}]=f_{i}(u_{i+1}^{k}),

  • (b)

    𝔼[Gik+1fi(ui+1k)2|k]σGi2,𝔼[Jik+1fi(ui+1k)2|k]σJi2\mathbb{E}[\|G_{i}^{k+1}-f_{i}(u_{i+1}^{k})\|^{2}|\mathscr{F}_{k}]\leq\sigma_{G_{i}}^{2},~{}\mathbb{E}[\|J_{i}^{k+1}-\nabla f_{i}(u_{i+1}^{k})\|^{2}|\mathscr{F}_{k}]\leq\sigma_{J_{i}}^{2},

  • (c)

    The outputs of the stochastic oracle at level ii, Gik+1G_{i}^{k+1} and Jik+1J_{i}^{k+1}, are independent. The outputs of the stochastic oracle are independent between levels, i.e., {Gik+1}i=1,,T\{G_{i}^{k+1}\}_{i=1,\dots,T} are independent and so are {Jik+1}i=1,,T\{J_{i}^{k+1}\}_{i=1,\dots,T}.

Parts (a) and (b) in Assumption 3 are standard unbiasedness and bounded variance assumptions on the stochastic gradient, common in the literature. Part (c) is essential to establish the convergence results in the multi-level case. Similar assumptions have been made, for example, in [YWF19] and [BGN22]. We also emphasize that unlike some prior works (see e.g., [ZSM+20]), Assumption 3 allows the case of endogenous uncertainty, and we do not require the distribution of the random variables (ξi)1iT(\xi_{i})_{1\leq i\leq T} to be independent of the distribution of the decision variables (ui)1iT(u_{i})_{1\leq i\leq T}.

Remark.

Under Assumption 2, and 3, we can immediately conclude that 𝔼[Jik+12|k]=𝔼[Jik+1fi(ui+1k)2|k]+fi(ui+1k)2σJi2+Lfi2:=σ^Ji2\mathbb{E}[\|J_{i}^{k+1}\|^{2}|\mathscr{F}_{k}]=\mathbb{E}[\|J_{i}^{k+1}-\nabla f_{i}(u_{i+1}^{k})\|^{2}|\mathscr{F}_{k}]+\|\nabla f_{i}(u_{i+1}^{k})\|^{2}\leq\sigma_{J_{i}}^{2}+L_{f_{i}}^{2}:=\hat{\sigma}_{J_{i}}^{2}. In the sequel, σ^Ji2\hat{\sigma}_{J_{i}}^{2} will be used to simplify the presentation.

We start with the merit function used in this work and its connection to the gradient mapping criterion. Our proof leverages the following merit function:

Wα,γ(x,z,u)=F(x)Fη(x,z)+αF(x)z2+i=1Tγifi(ui+1)ui2,W_{\alpha,\gamma}(x,z,u)=F(x)-F^{\star}-\eta(x,z)+\alpha\|\nabla F(x)-z\|^{2}+\sum_{i=1}^{T}\gamma_{i}\|f_{i}(u_{i+1})-u_{i}\|^{2}, (9)

where α,{γi}1iT\alpha,\{\gamma_{i}\}_{1\leq i\leq T} are positive constants and

η(x,z)=miny𝒳{H(y;x,z,β):=z,yx+β2yx2}.\eta(x,z)=\min_{y\in\mathcal{X}}\left\{H(y;x,z,\beta):=\langle z,y-x\rangle+\frac{\beta}{2}\|y-x\|^{2}\right\}. (10)

Compared to [BGN22], we require the additional term F(x)z2\|\nabla F(x)-z\|^{2}, which turns out to be essential in our proof due to the ICG routine. The following proposition relates the merit function above to the gradient mapping.

Proposition 2.

Let 𝒢𝒳()\mathcal{G}_{\mathcal{X}}(\cdot) be the gradient mapping defined in (3) and η(,)\eta(\cdot,\cdot) be defined in (10). For any pair of (x,z)(x,z) and β>0\beta>0, we have 𝒢𝒳(x,F(x),β)24βη(x,z)+2F(x)z2\|\mathcal{G}_{\mathcal{X}}(x,\nabla F(x),\beta)\|^{2}\leq-4\beta\eta(x,z)+2\|\nabla F(x)-z\|^{2}.

Proof.

By expanding the square, and using the properties of projection operation, we have

Π𝒳(x1βz)x2+Π𝒳(x1βz)(x1βz)2x¯(x1βz)2=1βz2.\|\Pi_{\mathcal{X}}({x}-\frac{1}{\beta}{z})-{x}\|^{2}+\|\Pi_{\mathcal{X}}({x}-\frac{1}{\beta}{z})-(x-\frac{1}{\beta}{z})\|^{2}\leq\|\bar{x}-({x}-\frac{1}{\beta}{z})\|^{2}=\|\frac{1}{\beta}{z}\|^{2}.

Thus, we have η(x,z)β2Π𝒳(x1βz)x2\eta({x},{z})\leq-\frac{\beta}{2}\|\Pi_{\mathcal{X}}({x}-\frac{1}{\beta}{z})-{x}\|^{2}. The proof is completed immediately by noting that 𝒢(x,F(x),β)22β2Π𝒳(x1βz)x2+2F(x)z2.\|\mathcal{G}({x},\nabla F({x}),\beta)\|^{2}\leq 2\beta^{2}\|\Pi_{\mathcal{X}}({x}-\frac{1}{\beta}{z})-{x}\|^{2}+2\left\|\nabla F({x})-{z}\right\|^{2}.

We now present out main result on the oracle complexity of Algorithm1.

Theorem 2.

Under Assumption 1, 2, 3, let {xk,zk,{uik}1iT}k0\{x^{k},z^{k},\{u_{i}^{k}\}_{1\leq i\leq T}\}_{k\geq 0} be the sequence generated by Algorithm 1 with N1N\geq 1 and

βkβ>0,τ0=1,t0=0,τk=1N,tk=k,k1,\begin{split}\beta_{k}\equiv\beta>0,\qquad\tau_{0}=1,~{}t_{0}=0,\quad\tau_{k}=\frac{1}{\sqrt{N}},~{}t_{k}=\lceil\sqrt{k}\rceil,\quad\forall k\geq 1,\end{split} (11)

where β\beta is an arbitrary positive constant. Provided that the merit function Wα,γ(x,z,u)W_{\alpha,\gamma}(x,z,u) is defined as (9) with

α=β20LF2,γ1=β2,γj=(2α+14αLF2)(T1)Cj2+β2,2jT,\alpha=\frac{\beta}{20L_{\nabla F}^{2}},\quad\gamma_{1}=\frac{\beta}{2},\quad\gamma_{j}=\left(2\alpha+\frac{1}{4\alpha L_{\nabla F}^{2}}\right)(T-1)C_{j}^{2}+\frac{\beta}{2},\quad 2\leq j\leq T, (12)

we have,

𝔼[𝒢𝒳(xR,F(xR),β)2]2(β+20LF2β)[2Wα,γ(x0,z0,u0)+(β,σ2,L,D𝒳,T,δ)]N,\mathbb{E}\left[\|\mathcal{G}_{\mathcal{X}}(x^{R},\nabla F(x^{R}),\beta)\|^{2}\right]\leq\frac{2(\beta+\frac{20L_{\nabla F}^{2}}{\beta})\left[2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)\right]}{\sqrt{N}}, (13)
𝔼[fi(ui+1R)uiR2]2Wα,γ(x0,z0,u0)+(β,σ2,L,D𝒳,T,δ)βN,1iT.\mathbb{E}\left[\|f_{i}(u_{i+1}^{R})-u_{i}^{R}\|^{2}\right]\leq\frac{2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)}{\beta\sqrt{N}},\quad 1\leq i\leq T. (14)

where uT+1=x,(β,σ2,L,D𝒳,T,δ)=4σ^2+32βD𝒳2(1+δ)(35+5LF2β2)u_{T+1}=x,\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)=4\hat{\sigma}^{2}+32\beta D_{\mathcal{X}}^{2}(1+\delta)\left(\frac{3}{5}+\frac{5L_{\nabla F}^{2}}{\beta^{2}}\right), and σ^2\hat{\sigma}^{2} is a constant depending on the parameters (β,σ2,L,D𝒳,T)(\beta,\sigma^{2},L,D_{\mathcal{X}},T) given in (42). The expectation is taken with respect to all random sequences generated by the method and an independent random integer number RR uniformly distributed over {1,,N}\{1,\dots,N\}. That is to say, the number of calls to SFO and LMO to get an ϵ\epsilon-stationary point is upper bounded by 𝒪T(ϵ2),𝒪T(ϵ3)\mathcal{O}_{T}(\epsilon^{-2}),\mathcal{O}_{T}(\epsilon^{-3}) respectively.

Remark.

The constant (β,σ2,L,D𝒳,T,δ)\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta) is 𝒪(T)\mathcal{O}(T) given the definition of σ^2\hat{\sigma}^{2} and the value of γj\gamma_{j} in (12), which further implies that the total number of calls to SFO and LMO of Algorithm 1 for finding an ϵ\epsilon-stationary point of (1), are bounded by 𝒪(T2ϵ2)=𝒪T(ϵ2)\mathcal{O}(T^{2}\epsilon^{-2})=\mathcal{O}_{T}(\epsilon^{-2}) and 𝒪(T3ϵ3)=𝒪T(ϵ3)\mathcal{O}(T^{3}\epsilon^{-3})=\mathcal{O}_{T}(\epsilon^{-3}) respectively. Furthermore, it is worth noting that this complexity bound for Algorithm 1 is obtained without any dependence of the parameter βk\beta_{k} on Lipschitz constants due to the choice of arbitrary positive constant β\beta in (11), and τk,tk\tau_{k},t_{k} depend only on the number of iterations NN and kk respectively. This makes Algorithm 1 parameter-free and easy to implement.

Remark.

As discussed in Section 2, the ICG routine given in Algorithm 2 is a deterministic method with the estimated gradient zkz_{k} in (7). The number of iterations, tkt_{k}, required to run Algorithm 2 is given by tk=kt_{k}=\lceil\sqrt{k}\rceil. That is, we require more precise solutions for the ICG routine, only for later outer iterations. Furthermore, due to the deterministic nature of the ICG routine, further advances in the analysis of deterministic conditional gradient methods under additional assumptions on the constraint set 𝒳\mathcal{X} (see, for example, [GH15, GW21]) could be leveraged to improve the overall LMO complexity.

3.1 The special cases of T=1T=1 and T=2T=2

We now discuss several intriguing points regarding the choice of tuning parameter β\beta, for the case of T=2T=2, and the more standard case of T=1T=1. Specifically, the linearization technique used in Algorithm 1 turns out to be not necessary for the case of T=2T=2 and T=1T=1 to obtain similar rates. However, without linearization, the choice of β\beta is dependent on the problem parameters for T=2T=2. Whereas it turns out to be independent of the problem parameters (similar to Algorithm 1 and Theorem 2 which holds for all T1T\geq 1) for T=1T=1. As the outer function value estimates (i.e., u1k+1u^{k+1}_{1} sequence) are not required for the convergence analysis, we remove them in Algorithms 3 and 4.

Algorithm 3 NASA with Inexact Conditional Gradient Method (NASA+ICG) for T=2T=2
  Replace Step 2 of Algorithm 1 with the following:
  2’. Update the average gradient zz and the function value estimate u2u_{2} respectively as:
zk+1=(1τk)zk+τkJ2k+1J1k+1andu2k+1=(1τk)uk+τkG2k+1\displaystyle z^{k+1}=(1-\tau_{k})z^{k}+\tau_{k}J_{2}^{k+1}J_{1}^{k+1}~{}\quad\text{and}\quad u_{2}^{k+1}=(1-\tau_{k})u^{k}+\tau_{k}G_{2}^{k+1}
Algorithm 4 ASA with Inexact Conditional Gradient Method (ASA+ICG) for T=1T=1
  Replace Step 2 of Algorithm 1 with the following:
  2”. Update the average gradient zz as: zk+1=(1τk)zk+τkJ1k+1.z^{k+1}=(1-\tau_{k})z^{k}+\tau_{k}J_{1}^{k+1}.
Theorem 3.

Let Assumptions 1, 2, 3 be satisfied by the optimization problem (1). Let 𝒞1,𝒞2\mathcal{C}_{1},\mathcal{C}_{2} and 𝒞3\mathcal{C}_{3} be some constants depending on the parameters (β,σ2,L,D𝒳,δ)(\beta,\sigma^{2},L,D_{\mathcal{X}},\delta), as defined in (54) and (62). Let τ0=1,t0=0\tau_{0}=1,t_{0}=0, τk=1N,tk=k,k1\tau_{k}=\frac{1}{\sqrt{N}},t_{k}=\lceil\sqrt{k}\rceil,\forall k\geq 1, where NN is the total number of iterations.

  • (a)

    Let T=2T=2, and let {xk,zk,u2k}k0\{x^{k},z^{k},u_{2}^{k}\}_{k\geq 0} be the sequence generated by Algorithm 3 with

    βkβ6ρLF+(2ρ+23ρ)Lf1Lf22,ρ>0.\beta_{k}\equiv\beta\geq 6\rho L_{\nabla F}+(2\rho+\frac{2}{3\rho})L_{\nabla f_{1}}L_{f_{2}}^{2},\quad\rho>0. (15)

    Then, we have N1\forall N\geq 1,

    𝔼[𝒢𝒳(xR,F(xR),β)2]𝒞1N,𝔼[f2(xR)u2R2]𝒞2N.\mathbb{E}\left[\|\mathcal{G}_{\mathcal{X}}(x^{R},\nabla F(x^{R}),\beta)\|^{2}\right]\leq\frac{\mathcal{C}_{1}}{\sqrt{N}},\quad\mathbb{E}\left[\|f_{2}(x^{R})-u_{2}^{R}\|^{2}\right]\leq\frac{\mathcal{C}_{2}}{\sqrt{N}}.
  • (b)

    Let T=1T=1 and let {xk,zk}k0\{x^{k},z^{k}\}_{k\geq 0} be the sequence generated by Algorithm 4 with βkβ>0\beta_{k}\equiv\beta>0. Then, we have N1\forall N\geq 1,

    𝔼[𝒢𝒳(xR,F(xR),β)2]𝒞3N.\mathbb{E}\left[\|\mathcal{G}_{\mathcal{X}}(x^{R},\nabla F(x^{R}),\beta)\|^{2}\right]\leq\frac{\mathcal{C}_{3}}{\sqrt{N}}.

All expectations are taken with respect to all random sequences generated by the respective algorithms and an independent random integer number RR uniformly distributed over {1,,N}\{1,\dots,N\}. In both cases, the number of calls to SFO and LMO to get an ϵ\epsilon-stationary point is upper bounded by 𝒪(ϵ2),𝒪(ϵ3)\mathcal{O}(\epsilon^{-2}),\mathcal{O}(\epsilon^{-3}) respectively.

Remark.

While we can obtain the same complexities without using the linear approximation of the inner function for T=2T=2, it seems necessary to have a parameter-free algorithm as the choice of β\beta in (15) depends on the knowledge of the problem parameters. Indeed, the linearization term in (8) helps use to better exploit the Lipschitz smoothness of the gradients get an error bound in the order of τk2dk2\tau_{k}^{2}\|d^{k}\|^{2} for estimating the inner function values. Without this term, we are only able to use the Lipschitz continuity of the inner functions and so the error estimate will increase to the order of τkdk\tau_{k}\|d^{k}\|. Hence, we need to choose a larger betabeta (as in (15)) to reduce dk\|d^{k}\| and handle the error term without compromising the complexities. However, this is not the case for T=1T=1 as it can be seen as a two-level problem whose inner function is exactly known (the identity map). In this case, the choice of β\beta is independent of the problem parameters with or without the linearization term.

3.2 High-Probability Convergence for T=1T=1

In this subsection, we establish an oracle complexity result with high-probability for the case of T=1T=1. We first provide a notion of (ϵ,δ)(\epsilon,\delta)-stationary point and a related tail assumption on the stochastic first-order oracle below.

Definition 4.

A point x¯𝒳\bar{x}\in\mathcal{X} generated by an algorithm for solving (1) is called an (ϵ,δ)(\epsilon,\delta)-stationary point, if we have 𝒢𝒳(x¯,F(x¯),β)2ϵ\|\mathcal{G}_{\mathcal{X}}(\bar{x},\nabla F(\bar{x}),\beta)\|^{2}\leq\epsilon with probability 1δ1-\delta.

Assumption 4.

Let Δk+1=F(xk)J1k+1\Delta^{k+1}=\nabla F(x^{k})-J_{1}^{k+1} for k0k\geq 0. For each kk, given k\mathscr{F}_{k} we have 𝔼[Δk+1|k]=0\mathbb{E}[\Delta^{k+1}|\mathscr{F}_{k}]=0 and Δk+1|k\|\Delta^{k+1}\|\big{|}\mathscr{F}_{k} is KK-sub-Gaussian.

The above assumption is commonly used in the literature; see [HK14, HLPR19, LO20, ZCC+18]. We also refer to [Ver18] and Appendix E for additional details. The high-probability bound for solving non-convex constrained problems by Algorithm 4 is given below.

Theorem 5.

Let Assumptions 1, 2, 4 be satisfied by the optimization problem (1) with T=1T=1. Let τ0=1,t0=0\tau_{0}=1,t_{0}=0, τk=1N,tk=k,k1\tau_{k}=\frac{1}{\sqrt{N}},t_{k}=\lceil\sqrt{k}\rceil,\forall k\geq 1, where NN is the total number of iterations. Let T=1T=1 and let {xk,zk}k0\{x^{k},z^{k}\}_{k\geq 0} be the sequence generated by Algorithm 4 with βkβ>0\beta_{k}\equiv\beta>0. Then, we have N1,δ>0\forall N\geq 1,\delta>0, with probability at least 1δ1-\delta,

mink=1,,N𝒢𝒳(xk,F(xk),β)2𝒪(K2log(1/δ)N)\min_{k=1,\dots,N}\left\|\mathcal{G}_{\mathcal{X}}(x^{k},\nabla F(x^{k}),\beta)\right\|^{2}\leq\mathcal{O}\left(\frac{K^{2}\log(1/\delta)}{\sqrt{N}}\right)

Therefore, the number of calls to SFO and LMO to get an (ϵ,δ)(\epsilon,\delta)-stationary point is upper bounded by 𝒪(ϵ2log2(1/δ)),𝒪(ϵ3log3(1/δ))\mathcal{O}(\epsilon^{-2}\log^{2}(1/\delta)),\mathcal{O}(\epsilon^{-3}\log^{3}(1/\delta)) respectively.

Remark.

To the best of our knowledge, the above result is (i) the first high-probability bound for one-sample stochastic conditional gradient-type algorithm for the case of T=1T=1, and (ii) the first high-probability bound for constrained stochastic optimization algorithms in the non-convex setting; see Appendix J of [MDB21].

4 Proof Sketch of Main Results

In this section, we only present the proof sketch. The complete proofs are provided in the appendix. For convenience, let uT+1=xu_{T+1}=x, and we denote HkH_{k} as the function value of the subproblem at step kk, yky^{k} as the optimal solution of the subproblem i.e.,

Hk(y):=H(y;xk,zk,βk),yk=argminy𝒳Hk(y).H_{k}(y):=H(y;x^{k},z^{k},\beta_{k}),\quad y^{k}=\underset{y\in\mathcal{X}}{\operatorname*{arg\,min}}~{}H_{k}(y). (16)

Then, the proof of Theorem 2 proceeds via the following steps:

  1. 1.

    We first leverage the merit function Wk:=Wα,γ(xk,zk,uk)W_{k}:=W_{\alpha,\gamma}(x^{k},z^{k},u^{k}) defined in (9) with appropriate choices of α,γ\alpha,\gamma for any β>0\beta>0 to obtain

    Wk+1Wkτk2(β[dk2+i=1Tfi(ui+1k)uik2]+β20LF2F(xk)zk2)+𝐑k+τk(125+20LF2β2)(Hk(y~k)Hk(yk)),k0\begin{split}W_{k+1}-W_{k}\leq&-\frac{\tau_{k}}{2}\left(\beta\left[\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right]+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right)\\ &+\mathbf{R}_{k}+\tau_{k}\left(\frac{12}{5}+\frac{20L_{\nabla F}^{2}}{\beta^{2}}\right)\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\quad\forall k\geq 0\end{split}

    where 𝐑k\mathbf{R}_{k} is the residual term (see (31)) and 𝔼[𝐑k|k]σ^2τk2\mathbb{E}[\mathbf{R}_{k}|\mathscr{F}_{k}]\leq\hat{\sigma}^{2}\tau_{k}^{2}, as shown in Proposition 3.

  2. 2.

    Telescoping the above inequality, in Lemma 11 we obtain the following:

    k=1Nτk[β(dk2+i=1Tfi(ui+1k)uik2)+β20LF2F(xk)zk2]2W0+2k=0N𝐑k+(245+40LF2β2)k=0Nτk(Hk(y~k)Hk(yk)),N1.\begin{split}&\sum_{k=1}^{N}\tau_{k}\left[\beta\left(\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right)+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right]\\ &\leq 2W_{0}+2\sum_{k=0}^{N}\mathbf{R}_{k}+\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\quad\forall N\geq 1.\end{split}
  3. 3.

    To further control the error term Hk(y~k)Hk(yk)H_{k}(\tilde{y}^{k})-H_{k}(y^{k}) introduced by the ICG method, we set tkt_{k}, the number of iterations in ICG method at step kk, to k\lceil\sqrt{k}\rceil. By Lemma 8, we therefore have

    Hk(y~k)Hk(yk)2βD𝒳2(1+δ)tk+22βD𝒳2(1+δ)k,k1.H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{t_{k}+2}\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{\sqrt{k}},\quad\forall k\geq 1.

    Also, with the choice of τk=1N\tau_{k}=\frac{1}{\sqrt{N}} and z0=0z^{0}=0, we can conclude that

    k=0Nτk(Hk(y~k)Hk(yk))2βD𝒳2(1+δ)Nk=1N1k4βD𝒳2(1+δ).\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{\sqrt{N}}\sum_{k=1}^{N}\frac{1}{\sqrt{k}}\leq 4\beta D_{\mathcal{X}}^{2}(1+\delta).
  4. 4.

    Then, taking expectation of both sides and by the definition of random integer RR, we have

    𝔼[β(dR2+i=1Tfi(ui+1R)uiR2)+β20LF2F(xR)zR2]2W0+,\begin{split}\mathbb{E}\left[\beta\left(\|d^{R}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{R})-u_{i}^{R}\|^{2}\right)+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{R})-z^{R}\|^{2}\right]\leq 2W_{0}+\mathcal{B},\end{split}

    N1\forall N\geq 1, where \mathcal{B} is a constant depending on the problem parameters (β,σ2,L,D𝒳,T,δ)(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta).

  5. 5.

    As a result, we can obtain (13) and (14) by noting that k1\forall k\geq 1

    𝒢(xk,F(xk),β)22β2dk2+2β2Π𝒳(xk1βF(xk))Π𝒳(xk1βzk)22β2dk2+2F(xk)zk2.\begin{split}\|\mathcal{G}(x^{k},\nabla F(x^{k}),\beta)\|^{2}&\leq 2\beta^{2}\|d^{k}\|^{2}+2\beta^{2}\left\|\Pi_{\mathcal{X}}\left(x^{k}-\frac{1}{\beta}\nabla F(x^{k})\right)-\Pi_{\mathcal{X}}\left(x^{k}-\frac{1}{\beta}z^{k}\right)\right\|^{2}\\ &\leq 2\beta^{2}\|d^{k}\|^{2}+2\|\nabla F(x^{k})-z^{k}\|^{2}.\end{split}

    where the second inequality follows the non-expansiveness of the projection operator.

The proofs of Theorems 3 and 5 follow the same argument with appropriate modifications. The high-probability convergence proof of Theorem 5 mainly consists of controlling the tail probability of the residual term 𝐑k\mathbf{R}_{k} being large.

5 Discussion

In this work, we propose and analyze projection-free conditional gradient-type algorithms for constrained stochastic multi-level composition optimization of the form in (1). We show that the oracle complexity of the proposed algorithms is level-independent in terms of the target accuracy. Furthermore, our algorithm does not require any increasing order of mini-batches under standard unbiasedness and bounded second-moment assumptions on the stochastic first-order oracle, and is parameter-free. Some open questions for future research: (i) Considering the one-sample setting, either improving the LMO complexity from 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) to 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}) for general closed convex constraint sets or establishing lower bounds showing that 𝒪(ϵ3)\mathcal{O}(\epsilon^{-3}) is necessary while keeping the SFO in the order of 𝒪(ϵ2)\mathcal{O}(\epsilon^{-2}), is extremely interesting; and (ii) Providing high-probability bounds for stochastic multi-level composition problems (T>1T>1) and under sub-Gaussian or heavy-tail assumptions (as in [MDB21, LZW22]) is interesting to explore.

Acknowledgment

TX was partially supported by a seed grant from the Center for Data Science and Artificial Intelligence Research, UC Davis and National Science Foundation (NSF) grant CCF-1934568. KB was partially supported by a seed grant from the Center for Data Science and Artificial Intelligence Research, UC Davis and NSF grant DMS-2053918. SG was partially supported by an NSERC Discovery Grant.

References

  • [ABTR21] Zeeshan Akhtar, Amrit Singh Bedi, Srujan Teja Thomdapu, and Ketan Rajawat. Projection-Free Algorithm for Stochastic Bi-level Optimization. arXiv preprint arXiv:2110.11721, 2021.
  • [AF21] Raul Astudillo and Peter Frazier. Bayesian optimization of function networks. Advances in Neural Information Processing Systems, 34, 2021.
  • [BG22] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, 22(1):35–76, 2022.
  • [BGI+17] Jose Blanchet, Donald Goldfarb, Garud Iyengar, Fengpei Li, and Chaoxu Zhou. Unbiased simulation for optimizing stochastic function compositions. arXiv preprint arXiv:1711.07564, 2017.
  • [BGN22] Krishnakumar Balasubramanian, Saeed Ghadimi, and Anthony Nguyen. Stochastic multilevel composition optimization algorithms with level-independent convergence rates. SIAM Journal on Optimization, 32(2):519–544, 2022.
  • [BS17] Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient for non-strongly convex functions. Mathematical Programming, 164(1-2):1–27, 2017.
  • [CFKM20] Weilin Cong, Rana Forsati, Mahmut Kandemir, and Mehrdad Mahdavi. Minimal variance sampling with provable guarantees for fast training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1393–1403, 2020.
  • [CSY21] Tianyi Chen, Yuejiao Sun, and Wotao Yin. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Transactions on Signal Processing, 69:4937–4948, 2021.
  • [DD19] Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
  • [DPR17] Darinka Dentcheva, Spiridon Penev, and Andrzej Ruszczyński. Statistical estimation of composite risk functionals and risk optimization problems. Annals of the Institute of Statistical Mathematics, 69(4):737–760, 2017.
  • [DR18] John Duchi and Feng Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018.
  • [EN13] Yuri Ermoliev and Vladimir Norkin. Sample average approximation method for compound stochastic optimization problems. SIAM Journal on Optimization, 23(4):2231–2263, 2013.
  • [Erm76] Yuri Ermoliev. Methods of stochastic programming. Nauka, Moscow, 1976.
  • [FMO21] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. Advances in Neural Information Processing Systems, 34, 2021.
  • [FW56] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
  • [GH15] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In International Conference on Machine Learning, pages 541–549. PMLR, 2015.
  • [GKS21] Dan Garber, Atara Kaplan, and Shoham Sabach. Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems. Mathematical Programming, 186(1):185–208, 2021.
  • [GRW20] Saeed Ghadimi, Andrzej Ruszczynski, and Mengdi Wang. A single timescale stochastic approximation method for nested stochastic optimization. SIAM Journal on Optimization, 30(1):960–979, 2020.
  • [GW21] Dan Garber and Noam Wolf. Frank-Wolfe with a nearest extreme point oracle. In Conference on Learning Theory, pages 2103–2132. PMLR, 2021.
  • [HJN15] Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1-2):75–112, 2015.
  • [HK12] Elad Hazan and Satyen Kale. Projection-free online learning. In 29th International Conference on Machine Learning, ICML 2012, pages 521–528, 2012.
  • [HK14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
  • [HL16] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pages 1263–1271, 2016.
  • [HLPR19] Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
  • [HZCH20] Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. Advances in Neural Information Processing Systems, 33, 2020.
  • [Jag13] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pages 427–435. PMLR, 2013.
  • [LJJ15] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems, pages 496–504, 2015.
  • [LO20] Xiaoyu Li and Francesco Orabona. A high probability analysis of adaptive sgd with momentum. arXiv preprint arXiv:2007.14294, 2020.
  • [LP66] Evgeny Levitin and Boris Polyak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1–50, 1966.
  • [LZ16] Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2):1379–1409, 2016.
  • [LZW22] Zhipeng Lou, Wanrong Zhu, and Wei Biao Wu. Beyond sub-gaussian noises: Sharp concentration analysis for stochastic gradient descent. Journal of Machine Learning Research, 23:1–22, 2022.
  • [MDB21] Liam Madden, Emiliano Dall’Anese, and Stephen Becker. High-probability convergence bounds for non-convex stochastic gradient descent. arXiv preprint arXiv:2006.05610, 2021.
  • [MHK20] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization. Journal of machine learning research, 2020.
  • [Mig94] Athanasios Migdalas. A regularization of the frank—wolfe method and unification of certain nonlinear programming methods. Mathematical Programming, 65(1):331–345, 1994.
  • [Nes18] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.
  • [QGX+21] Qi Qi, Zhishuai Guo, Yi Xu, Rong Jin, and Tianbao Yang. An online method for a class of distributionally robust optimization with non-convex objectives. Advances in Neural Information Processing Systems, 34, 2021.
  • [QHZ+22] Zi-Hao Qiu, Quanqi Hu, Yongjian Zhong, Lijun Zhang, and Tianbao Yang. Large-scale stochastic optimization of ndcg surrogates for deep learning with provable convergence. arXiv preprint arXiv:2202.12183, 2022.
  • [QLX18] Chao Qu, Yan Li, and Huan Xu. Non-convex conditional gradient sliding. In International Conference on Machine Learning, pages 4208–4217. PMLR, 2018.
  • [QLX+21] Qi Qi, Youzhi Luo, Zhao Xu, Shuiwang Ji, and Tianbao Yang. Stochastic optimization of areas under precision-recall curves with provable convergence. Advances in Neural Information Processing Systems, 34, 2021.
  • [RS06] Andrzej Ruszczyński and Alexander Shapiro. Optimization of convex risk functions. Mathematics of operations research, 31(3):433–452, 2006.
  • [RSPS16] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic frank-wolfe methods for nonconvex optimization. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1244–1251. IEEE, 2016.
  • [Rus87] Andrzej Ruszczyński. A linearization method for nonsmooth stochastic programming problems. Mathematics of Operations Research, 12(1):32–49, 1987.
  • [Rus21] Andrzej Ruszczynski. A stochastic subgradient method for nonsmooth nonconvex multilevel composition optimization. SIAM Journal on Control and Optimization, 59(3):2301–2320, 2021.
  • [RWC03] David Ruppert, Matt P Wand, and Raymond J Carroll. Semiparametric regression. Number 12. Cambridge university press, 2003.
  • [Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • [WFL17] Mengdi Wang, Ethan Fang, and Han Liu. Stochastic compositional gradient descent: Algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017.
  • [WGE17] Gang Wang, Georgios B Giannakis, and Yonina C Eldar. Solving systems of random quadratic equations via truncated amplitude flow. IEEE Transactions on Information Theory, 64(2):773–794, 2017.
  • [WLF16] Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic composition optimization. In Advances in Neural Information Processing Systems, 2016.
  • [WYZY22] Guanghui Wang, Ming Yang, Lijun Zhang, and Tianbao Yang. Momentum accelerates the convergence of stochastic AUPRC maximization. In International Conference on Artificial Intelligence and Statistics, pages 3753–3771. PMLR, 2022.
  • [YSC19] Alp Yurtsever, Suvrit Sra, and Volkan Cevher. Conditional gradient methods via stochastic path-integrated differential estimator. In International Conference on Machine Learning, pages 7282–7291. PMLR, 2019.
  • [YWF19] Shuoguang Yang, Mengdi Wang, and Ethan Fang. Multilevel stochastic gradient methods for nested composition optimization. SIAM Journal on Optimization, 29(1):616–659, 2019.
  • [ZCC+18] Dongruo Zhou, Jinghui Chen, Yuan Cao, Yiqi Tang, Ziyan Yang, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
  • [ZSM+20] Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. One-sample Stochastic Frank-Wolfe. In International Conference on Artificial Intelligence and Statistics, pages 4012–4023. PMLR, 2020.
  • [ZX21] Junyu Zhang and Lin Xiao. Multilevel composite stochastic optimization via nested variance reduction. SIAM Journal on Optimization, 31(2):1131–1157, 2021.

 
Supplementary Materials
 

The supplementary materials are organized as follows. Appendix A provides motivating examples for stochastic multilevel optimization. Appendix B introduces the essential technical lemmas to complete the proof. We present the whole proofs of Theorem 2 and Theorem 3 in Appendix C and D. Finally, we present the high-probability convergence analysis particularly for the case when T=1T=1 in Appendix E.

Appendix A Motivating Examples

Problems of the form in (1) are generalizations of the standard constrained stochastic optimization problem which is obtained when T=1T=1, and arise in several machine learning applications. Some examples include sparse additive modeling in non-parametric statistics [WFL17, Section 4.1], Bayesian optimization [AF21], model-agnostic meta-learning [CSY21, FMO21], distributionally robust optimization [QGX+21], training graph neural networks [CFKM20], reinforcement learning [WLF16, Setion 1.1] and AUPRC maximization [QLX+21, WYZY22, QHZ+22]. Below, we provide a concrete motivating example from the field of risk-averse stochastic optimization [RS06].

The mean-deviation risk-averse optimization is given by the following optimization problem

maxx𝒳{𝔼[U(x,ξ)]ρ𝔼[{𝔼[U(x,ξ)]U(x,ξ)}2]1/2}.\displaystyle\max_{x\in\mathcal{X}}\left\{\mathbb{E}[U(x,\xi)]-\rho\mathbb{E}\left[\{\mathbb{E}[U(x,\xi)]-U(x,\xi)\}^{2}\right]^{1/2}\right\}.

As noted by [YWF19][Rus21] and  [BGN22], the above problem is a stochastic 3-level composition optimization problem with

f3:=𝔼[U(x,ξ)]f2(z,x):=𝔼[{zU(x,ξ)}2]f1((y1,y2)):=y1y2+δ.\displaystyle f_{3}:=\mathbb{E}[U(x,\xi)]\qquad f_{2}(z,x):=\mathbb{E}[\{z-U(x,\xi)\}^{2}]\qquad f_{1}((y_{1},y_{2})):=y_{1}-\sqrt{y_{2}+\delta}.

Here, δ>0\delta>0 is added to make the square root function smooth. In particular, we consider a semi-parametric data generating process given by a sparse single-index model of the form b=g(a,x)+ζb=g(\langle a,x^{*}\rangle)+\zeta, where g:g:\mathbb{R}\to\mathbb{R} is called the link function, xdx^{*}\in\mathbb{R}^{d} is assumed to be a sparse vector and ,\langle\cdot,\cdot\rangle represents the Euclidean inner-product between two vectors. Such single-index models are widely used in statistics, machine learning and economics [RWC03]. A standard choices of the link function gg is the square function, in which case, the model is also called as the sparse phase retrieval model [WGE17]. Here, aa is the input data which is assumed to be independent of the noise ζ\zeta. In this case, ξ:=(a,b)\xi:=(a,b) and the if we consider the squared-loss, then U(x,ξ):=(b(a,x)2)2U(x,\xi):=(b-(\langle a,x\rangle)^{2})^{2} and is non-convex in xx. The goal is to estimate the sparse index vector xx^{*} in a risk-averse manner, as they are well-known to provide stable solutions [YWF19]. To encourage sparsity, the set 𝒳\mathcal{X} is the 1\ell_{1} ball [Jag13].

Appendix B Technical Lemmas

Lemma 6.

(Smoothness of Composite Functions [BGN22]) Assume that Assumption 2 holds.

  • a)

    Define Fi(x)=fifi+1fT(x)F_{i}(x)=f_{i}\circ f_{i+1}\circ\cdots\circ f_{T}(x). Under , the gradient of FiF_{i} is Lipschitz continuous with the constant

    LFi=j=iT[Lfjl=ij1Lfll=j+1TLfl2].L_{\nabla F_{i}}=\sum_{j=i}^{T}\left[L_{\nabla f_{j}}\prod_{l=i}^{j-1}L_{f_{l}}\prod_{l=j+1}^{T}L_{f_{l}}^{2}\right].
  • b)

    Define

    R1=Lf1Lf2LfT,Rj=Lf1Lfj1LfjLfj+1LfT/Lfj,2jT1,C2=R1,Cj=i=1j2Ri(l=i+1j1Lfl),3jT\begin{split}&R_{1}=L_{\nabla f_{1}}L_{f_{2}}\cdots L_{f_{T}},\qquad R_{j}=L_{f_{1}}\cdots L_{f_{j-1}}L_{\nabla f_{j}}L_{f_{j+1}}\cdots L_{f_{T}}/L_{f_{j}},\quad 2\leq j\leq T-1,\\ &C_{2}=R_{1},\qquad C_{j}=\sum_{i=1}^{j-2}R_{i}\left(\prod_{l=i+1}^{j-1}L_{f_{l}}\right),\quad 3\leq j\leq T\end{split} (17)

    and let uT+1=xu_{T+1}=x. Then, for T2T\geq 2, we have

    F(x)i=1TfT+1i(uT+2i)j=2TCjfj(uj+1)uj.\left\|\nabla F(x)-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i})\right\|\leq\sum_{j=2}^{T}C_{j}\|f_{j}(u_{j+1})-u_{j}\|. (18)
Lemma 7.

(Smoothness of η(,)\eta(\cdot,\cdot) [GRW20]) For fixed β>0\beta>0 and, η(x,z)\eta(x,z) defined in (10), the gradient of η(x,z)\eta(x,z) w.r.t. (x,z)(x,z) is Lipschitz continuous with the constant Lη=2(1+β)2+(1+12β)2L_{\nabla\eta}=2\sqrt{(1+\beta)^{2}+\left(1+\frac{1}{2\beta}\right)^{2}}.

Lemma 8.

(Convergnece of ICG [Jag13]) Let y~k\tilde{y}^{k} be the vector output by Algorithm 2 at step kk, and yky^{k} be the optimal solution of the subproblem 16, then under Assumption 1

β2y~kyk2Hk(y~k)Hk(yk)2βD𝒳2(1+δ)tk+2\frac{\beta}{2}\|\tilde{y}^{k}-y^{k}\|^{2}\leq H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{t_{k}+2}\\

where δ\delta defined in Algorithm 2 is the quality of the linear minimization procedure.

Proof of Lemma 8.

The result is obtained by applying Theorem 1 in [Jag13] to HkH_{k} and noting that the curvature constant CHk=βD𝒳2,k0C_{H_{k}}=\beta D_{\mathcal{X}}^{2},\forall k\geq 0. ∎

Appendix C Proof of Theorem 2

To establish the rate of convergence for Algorithm 1 in Theorem 2, we first present Lemma 9 and Lemma 10 regarding the basic recursion on the errors in estimating the inner function values and the order of 𝔼[uik+1uik2|k]\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]. The proofs follow [BGN22] with minor modifications. We present the complete proofs below for the reader’s convenience.

Lemma 9.

Let {xk}k0\{x^{k}\}_{k\geq 0} and {uik}k0\{u_{i}^{k}\}_{k\geq 0} be generated by Algorithm 1 and uT+1=xu_{T+1}=x. Define, 1iT1\leq i\leq T,

ΔGik+1:=fi(ui+1k)Gik+1,ΔJik+1:=fi(ui+1k)Jik+1,eik:=fi(ui+1k+1)fi(ui+1k)fi(ui+1k),ui+1k+1ui+1k.\begin{split}\Delta_{G_{i}}^{k+1}&:=f_{i}(u_{i+1}^{k})-G_{i}^{k+1},\quad\Delta_{J_{i}}^{k+1}:=\nabla f_{i}(u_{i+1}^{k})-J_{i}^{k+1},\\ e_{i}^{k}&:=f_{i}(u_{i+1}^{k+1})-f_{i}(u_{i+1}^{k})-\langle\nabla f_{i}(u_{i+1}^{k}),u_{i+1}^{k+1}-u_{i+1}^{k}\rangle.\end{split} (19)

Under Assumption 2, we have, for 1iT1\leq i\leq T,

fi(ui+1k+1)uik+12(1τk)fi(ui+1k)uik2+τk2ΔGik+12+r˙ik+1+[4Lfi2+Lfifi(ui+1k)uik+ΔJik+12]ui+1k+1ui+1k2,\begin{split}&\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|^{2}\leq(1-\tau_{k})\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+\dot{r}_{i}^{k+1}\\ &\quad+\left[4L_{f_{i}}^{2}+L_{\nabla f_{i}}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|+\|\Delta_{J_{i}}^{k+1}\|^{2}\right]\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2},\end{split} (20)

and

uik+1uik2τk2[2fi(ui+1k)uik2+ΔGik+12]+2Jik+12ui+1k+1ui+1k2+r¨ik+1\begin{split}\|u_{i}^{k+1}-u_{i}^{k}\|^{2}\leq\tau_{k}^{2}\left[2\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\|\Delta_{G_{i}}^{k+1}\|^{2}\right]+2\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}+\ddot{r}_{i}^{k+1}\end{split} (21)

where

r˙ik+1:=2τkΔGik+1,eik+(1τk)(fi(ui+1k)uik)+ΔJik+1(ui+1k+1ui+1k)+2ΔJik+1(ui+1k+1ui+1k),eik+(1τk)(fi(ui+1k)uik),r¨ik+1:=τkΔGik+1,τk(fi(ui+1k)uik)+Jik+1(ui+1k+1ui+1k).\begin{split}&\dot{r}_{i}^{k+1}:=2\tau_{k}\langle\Delta_{G_{i}}^{k+1},e_{i}^{k}+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})+{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ &\qquad\qquad+2\langle{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k}),e_{i}^{k}+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})\rangle,\\ &\ddot{r}_{i}^{k+1}:=\tau_{k}\langle-\Delta_{G_{i}}^{k+1},\tau_{k}(f_{i}(u_{i+1}^{k})-u_{i}^{k})+{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle.\end{split} (22)
Proof.

We first prove part (20). By the definitions in (19), (22), for any 1iT1\leq i\leq T, we have

fi(ui+1k+1)uik+12=eik+fi(ui+1k)+fi(ui+1k)(ui+1k+1ui+1k)(1τk)uikτkGik+1Jik+1(ui+1k+1ui+1k)2=eik+ΔJik+1(ui+1k+1ui+1k)+(1τk)(fi(ui+1k)uik)+τkΔGik+12=ΔJik+1(ui+1k+1ui+1k)2+eik+(1τk)(fi(ui+1k)uik)2+τk2ΔGik+12+r˙ik+1eik+(1τk)(fi(ui+1k)uik)2+τk2ΔGik+12+ΔJik+12ui+1k+1ui+1k2+r˙ik+1(1τk)fi(ui+1k)uik2+eik2+2(1τk)eikfi(ui+1k)uik+τk2ΔGik+12+ΔJik+12ui+1k+1ui+1k2+r˙ik+1.\begin{split}&\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|^{2}\\ =&\|e_{i}^{k}+f_{i}(u_{i+1}^{k})+\nabla f_{i}(u_{i+1}^{k})^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})-(1-\tau_{k})u_{i}^{k}-\tau_{k}G_{i}^{k+1}-{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}\\ =&\|e_{i}^{k}+{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})+\tau_{k}\Delta_{G_{i}}^{k+1}\|^{2}\\ =&\|{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}+\|e_{i}^{k}+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+\dot{r}_{i}^{k+1}\\ \leq&\|e_{i}^{k}+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+\|\Delta_{J_{i}}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}+\dot{r}_{i}^{k+1}\\ \leq&(1-\tau_{k})\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\|e_{i}^{k}\|^{2}+2(1-\tau_{k})\|e_{i}^{k}\|\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}\\ &\qquad+\|\Delta_{J_{i}}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}+\dot{r}_{i}^{k+1}.\end{split}

Furthermore, with Assumption 2, we have

eikLfi2ui+1k+1ui+1k2,eik24Lfi2ui+1k+1ui+1k2,\|e_{i}^{k}\|\leq\frac{L_{\nabla f_{i}}}{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2},\qquad\|e_{i}^{k}\|^{2}\leq 4L_{f_{i}}^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}, (23)

which leads to (20). To show (21), with the update rule given by (8) and the definitions in (19), we have, for 1iT1\leq i\leq T,

uik+1uik2=τk(Gik+1uik)+Jik+1,ui+1k+1ui+1k2=τk2Gik+1uik2+Jik+1(ui+1k+1ui+1k)2+2τkGik+1uik,Jik+1(ui+1k+1ui+1k)=τk2Gik+1uik2+Jik+1(ui+1k+1ui+1k)2+2τkfi(ui+1k)uik,Jik+1(ui+1k+1ui+1k)+2τkΔGik+1,Jik+1(ui+1k+1ui+1k)τk2Gik+1uik2+2Jik+12ui+1k+1ui+1k2+τk2fi(ui+1k)uik2+2τkΔGik+1,Jik+1(ui+1k+1ui+1k)=2τk2fi(ui+1k)uik2+τk2ΔGik+12+2Jik+12ui+1k+1ui+1k2+2τkΔGik+1,τk(fi(ui+1k)uik)+Jik+1(ui+1k+1ui+1k).\begin{split}&\|u_{i}^{k+1}-u_{i}^{k}\|^{2}\\ =&\|\tau_{k}(G_{i}^{k+1}-u_{i}^{k})+\langle J_{i}^{k+1},u_{i+1}^{k+1}-u_{i+1}^{k}\rangle\|^{2}\\ =&\tau_{k}^{2}\|G_{i}^{k+1}-u_{i}^{k}\|^{2}+\|{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}+2\tau_{k}\langle G_{i}^{k+1}-u_{i}^{k},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ =&\tau_{k}^{2}\|G_{i}^{k+1}-u_{i}^{k}\|^{2}+\|{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}+2\tau_{k}\langle f_{i}(u_{i+1}^{k})-u_{i}^{k},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ &\qquad+2\tau_{k}\langle-\Delta_{G_{i}}^{k+1},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ \leq&\tau_{k}^{2}\|G_{i}^{k+1}-u_{i}^{k}\|^{2}+2\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}+\tau_{k}^{2}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\\ &\qquad+2\tau_{k}\langle-\Delta_{G_{i}}^{k+1},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ =&2\tau_{k}^{2}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+2\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}\\ &\qquad+2\tau_{k}\langle-\Delta_{G_{i}}^{k+1},\tau_{k}(f_{i}(u_{i+1}^{k})-u_{i}^{k})+{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle.\end{split}

where the inequality comes from the fact that Jik+1(ui+1k+1ui+1k)2Jik+12ui+1k+1ui+1k2\|{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}\leq\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2} and 2τkfi(ui+1k)uik,Jik+1(ui+1k+1ui+1k)Jik+1(ui+1k+1ui+1k)2+τk2fi(ui+1k)uik22\tau_{k}\langle f_{i}(u_{i+1}^{k})-u_{i}^{k},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\leq\|{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}+\tau_{k}^{2}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}. ∎

Lemma 10.

Let uT+1=xu_{T+1}=x. Under Assumption 2, 3, and with the choice of τ0=1\tau_{0}=1, we have, for 1iT1\leq i\leq T and k0k\geq 0,

𝔼[fi(ui+1k+1)uik+12|k]σGi2+(4Lfi2+σJi2)ci+1,\displaystyle\mathbb{E}[\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|^{2}|\mathscr{F}_{k}]\leq\sigma_{G_{i}}^{2}+(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2})c_{i+1}, (24)
𝔼[uik+1uik2|k]ciτk2,\displaystyle\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\leq c_{i}\tau_{k}^{2}, (25)

where

ci:=3σGi2+2(4Lfi2+σJi2+σ^Ji2)ci+1,cT+1=D𝒳2.c_{i}:=3\sigma_{G_{i}}^{2}+2(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2}+\hat{\sigma}_{J_{i}}^{2})c_{i+1},\qquad c_{T+1}=D_{\mathcal{X}}^{2}. (26)
Proof.

By the update rule given in (8) and the definitions in (19), for 1iT1\leq i\leq T and k0k\geq 0, we have

fi(ui+1k+1)uik+1=(1τk)(fi(ui+1k)uik)+𝐃k,i,f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}=(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})+\mathbf{D}_{k,i},

where 𝐃k,i:=eik+τkΔGik+1+ΔJik+1(ui+1k+1ui+1k)\mathbf{D}_{k,i}:=e_{i}^{k}+\tau_{k}\Delta_{G_{i}}^{k+1}+{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k}). With the convexity of 2\|\cdot\|^{2}, we can further obtain

fi(ui+1k+1)uik+12(1τk)fi(ui+1k)uik2+1τk𝐃k,i2,k0.\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|^{2}\leq(1-\tau_{k})\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\frac{1}{\tau_{k}}\|\mathbf{D}_{k,i}\|^{2},\quad\forall k\geq 0. (27)

Moreover, under Assumption 3, we have, for 1iT1\leq i\leq T and k0k\geq 0,

𝔼[𝐃k,i2|k]=𝔼[eik2|k]+τk2𝔼[ΔGik+12|k]+𝔼[ΔJik+1(ui+1k+1ui+1k)2|k]τk2𝔼[ΔGik+12|k]+(4Lfi2+𝔼[ΔJik+12|k])𝔼[ui+1k+1ui+1k2|k]τk2σGi2+(4Lfi2+σJi2)𝔼[ui+1k+1ui+1k2|k].\begin{split}\mathbb{E}[\|\mathbf{D}_{k,i}\|^{2}|\mathscr{F}_{k}]&=\mathbb{E}[\|e_{i}^{k}\|^{2}|\mathscr{F}_{k}]+\tau_{k}^{2}\mathbb{E}[\|\Delta_{G_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]+\mathbb{E}[\|{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}|\mathscr{F}_{k}]\\ &\leq\tau_{k}^{2}\mathbb{E}[\|\Delta_{G_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]+\left(4L_{f_{i}}^{2}+\mathbb{E}[\|\Delta_{J_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]\right)\mathbb{E}[\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}|\mathscr{F}_{k}]\\ &\leq\tau_{k}^{2}\sigma_{G_{i}}^{2}+\left(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2}\right)\mathbb{E}[\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}|\mathscr{F}_{k}].\end{split} (28)

where the second inequality follows from (23). Setting i=Ti=T in the inequality above and noting that uT+1k=xku_{T+1}^{k}=x^{k}, we have

𝔼[𝐃k,T2|k]τk2[σGT2+(4LfT2+σJT2)D𝒳2],k0.\mathbb{E}[\|\mathbf{D}_{k,T}\|^{2}|\mathscr{F}_{k}]\leq\tau_{k}^{2}\left[\sigma_{G_{T}}^{2}+(4L_{f_{T}}^{2}+\sigma_{J_{T}}^{2})D_{\mathcal{X}}^{2}\right],\quad\forall k\geq 0.

Thus, with the choice of τ0=1\tau_{0}=1, we obtain

𝔼[fT(xk)uTk2|k]σGT2+(4LfT2+σJT2)D𝒳2,k1.\mathbb{E}[\|f_{T}(x^{k})-u_{T}^{k}\|^{2}|\mathscr{F}_{k}]\leq\sigma_{G_{T}}^{2}+(4L_{f_{T}}^{2}+\sigma_{J_{T}}^{2})D_{\mathcal{X}}^{2},\quad\forall k\geq 1.

Taking expectation of both sides of (21) conditioning on k\mathscr{F}_{k}, and under Assumption 3, we obtain

𝔼[uik+1uik2|k]τk2𝔼[2fi(xk)uik2+ΔGik+12+2τk2Jik+12ui+1k+1ui+1k2|k].\begin{split}\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\leq\tau_{k}^{2}\mathbb{E}\left[2\|f_{i}(x^{k})-u_{i}^{k}\|^{2}+\|\Delta_{G_{i}}^{k+1}\|^{2}+\frac{2}{\tau_{k}^{2}}\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}\middle|\mathscr{F}_{k}\right].\end{split} (29)

Setting i=Ti=T in the inequality above, we have

𝔼[uTk+1uTk2|k]τk2[3σGT2+2(4LfT2+σJT2+σ^JT2)D𝒳2],k1.\begin{split}\mathbb{E}[\|u_{T}^{k+1}-u_{T}^{k}\|^{2}|\mathscr{F}_{k}]\leq\tau_{k}^{2}\left[3\sigma_{G_{T}}^{2}+2(4L_{f_{T}}^{2}+\sigma_{J_{T}}^{2}+\hat{\sigma}_{J_{T}}^{2})D_{\mathcal{X}}^{2}\right],\quad\forall k\geq 1.\end{split}

This completes the proof of (24) and (25) when i=Ti=T. We now use backward induction to complete the proof. By the above result, the base case of i=Ti=T holds. Assume that (25) hold when i=ji=j for some 1<jT1<j\leq T, i.e., 𝔼[ujk+1ujk2|k]cjτk2,k0\mathbb{E}[\|u_{j}^{k+1}-u_{j}^{k}\|^{2}|\mathscr{F}_{k}]\leq c_{j}\tau_{k}^{2},\forall k\geq 0. Then, setting i=j1i=j-1 in (28), we obtain

𝔼[𝐃k,j12|k]τk2[σGj12+(4Lfj12+σJj12)cj],k0.\mathbb{E}[\|\mathbf{D}_{k,j-1}\|^{2}|\mathscr{F}_{k}]\leq\tau_{k}^{2}\left[\sigma_{G_{j-1}}^{2}+(4L_{f_{j-1}}^{2}+\sigma_{J_{j-1}}^{2})c_{j}\right],\quad\forall k\geq 0.

Furthermore, with (27) and the choice of τ0=1\tau_{0}=1, we have

𝔼[fj1(ujk+1)uj1k+12|k]σGj12+(4Lfj12+σJj12)cj,k0.\mathbb{E}[\|f_{j-1}(u_{j}^{k+1})-u_{j-1}^{k+1}\|^{2}|\mathscr{F}_{k}]\leq\sigma_{G_{j-1}}^{2}+(4L_{f_{j-1}}^{2}+\sigma_{J_{j-1}}^{2})c_{j},\quad\forall k\geq 0.

which together with (29), imply that

𝔼[uj1k+1uj1k2|k]cj1τk2,k0.\mathbb{E}[\|u_{j-1}^{k+1}-u_{j-1}^{k}\|^{2}|\mathscr{F}_{k}]\leq c_{j-1}\tau_{k}^{2},\quad\forall k\geq 0.

We now leverage the merit function defined in (9) and provide a basic inequality for establishing convergence analysis of Algorithm 1 in Lemma 11. In Proposition 3, we show the boundedness of the term 𝐑k\mathbf{R}_{k} appearing on the right hand side of (30) in expectation. These two results form the crucial steps in establishing the convergence analysis of Algorithm 1.

Lemma 11.

Let {xk,zk,uk}k0\{x^{k},z^{k},u^{k}\}_{k\geq 0} be the sequence generated by Algorithm 1, the merit function Wα,γ(,,)W_{\alpha,\gamma}(\cdot,\cdot,\cdot) be defined in (9) with positive constants {α,{γi}1iT}\{\alpha,\{\gamma_{i}\}_{1\leq i\leq T}\}, and uT+1=xu_{T+1}=x. Under Assumption 2, for any β>0\beta>0, let

βkβ,α=β20LF2,γ1=β2,γj=(2α+14αLF2)(T1)Cj2+β2,2jT,\beta_{k}\equiv\beta,\quad\alpha=\frac{\beta}{20L_{\nabla F}^{2}},\quad\gamma_{1}=\frac{\beta}{2},\quad\gamma_{j}=\left(2\alpha+\frac{1}{4\alpha L_{\nabla F}^{2}}\right)(T-1)C_{j}^{2}+\frac{\beta}{2},\quad 2\leq j\leq T,

where CjC_{j}’s are defined in (17). Then, N0\forall N\geq 0

k=0Nτk(β[dk2+i=1Tfi(ui+1k)uik2]+β20LF2F(xk)zk2)2W0+2k=0N𝐑k+(245+40LF2β2)k=0Nτk(Hk(y~k)Hk(yk)),\begin{split}\sum_{k=0}^{N}\tau_{k}\left(\beta\left[\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right]+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right)\\ \leq 2W_{0}+2\sum_{k=0}^{N}\mathbf{R}_{k}+\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\end{split} (30)

where dk:=ykxkd^{k}:=y^{k}-x^{k}, Hk(),ykH_{k}(\cdot),y^{k} are defined in (16), and

𝐑k:=i=1Tγi[4Lfi2+Lfifi(ui+1k)uik+ΔJik+12]ui+1k+1ui+1k2+τk2[LF+Lη2D𝒳2+i=1TγiΔGik+12+αΔk+12]+τk[dk,Δk+1+i=1Tγir˙ik+1+2αr˙˙˙k+1]+Lη2zk+1zk2,Δk+1:=i=1TfT+1i(uT+2ik)i=1TJTi+1k+1,r˙˙˙k+1:=Δk+1,(1τk)[F(xk)zk]+τk[F(xk)i=1TfT+1i(uT+2ik)]+F(xk+1)F(xk),\begin{split}\mathbf{R}_{k}:=&\sum_{i=1}^{T}\gamma_{i}\left[4L_{f_{i}}^{2}+L_{\nabla f_{i}}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|+\|\Delta_{J_{i}}^{k+1}\|^{2}\right]\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}\\ &+\tau_{k}^{2}\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+\sum_{i=1}^{T}\gamma_{i}\|\Delta_{G_{i}}^{k+1}\|^{2}+\alpha\|\Delta^{k+1}\|^{2}\right]\\ &+\tau_{k}\left[\langle d^{k},\Delta^{k+1}\rangle+\sum_{i=1}^{T}\gamma_{i}\dot{r}_{i}^{k+1}+2\alpha\dddot{r}^{k+1}\right]+\frac{L_{\nabla\eta}}{2}\|z^{k+1}-z^{k}\|^{2},\\ \Delta^{k+1}:=&\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k})-\prod_{i=1}^{T}J^{k+1}_{T-i+1},\\ \dddot{r}^{k+1}:=&\langle\Delta^{k+1},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\nabla F(x^{k})-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k})]\\ &+\nabla F(x^{k+1})-\nabla F(x^{k})\rangle,\end{split} (31)

ΔGik+1\Delta_{G_{i}}^{k+1}, ΔJik+1\Delta_{J_{i}}^{k+1} are defined in (19), and r˙ik+1\dot{r}_{i}^{k+1} is defined in (22).

Proof.

We first bound F(xk+1)F(xk)F(x^{k+1})-F(x^{k}). By the Lipschitzness of F\nabla F (Lemma 6), we have

F(xk+1)F(xk)F(xk),xk+1xk+LFτk22d~k2=τkF(xk),dk+τkF(xk)zk,y~kykτkβdk,y~kyk+τkzk+βdk,y~kyk+LFτk22d~k2τkF(xk),dk+τkF(xk)zky~kyk+τkzk+βdk,y~kyk+τkβdky~kyk+LFτk22d~k2.\begin{split}&F(x^{k+1})-F(x^{k})\leq\langle\nabla F(x^{k}),x^{k+1}-x^{k}\rangle+\frac{L_{\nabla F}\tau_{k}^{2}}{2}\|\tilde{d}^{k}\|^{2}\\ &=\tau_{k}\langle\nabla F(x^{k}),d^{k}\rangle+\tau_{k}\langle\nabla F(x^{k})-z^{k},\tilde{y}^{k}-y^{k}\rangle-\tau_{k}\langle\beta d^{k},\tilde{y}^{k}-y^{k}\rangle\\ &\qquad+\tau_{k}\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle+\frac{L_{\nabla F}\tau_{k}^{2}}{2}\|\tilde{d}^{k}\|^{2}\\ &\leq\tau_{k}\langle\nabla F(x^{k}),d^{k}\rangle+\tau_{k}\|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|+\tau_{k}\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle\\ &\qquad+\tau_{k}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|+\frac{L_{\nabla F}\tau_{k}^{2}}{2}\|\tilde{d}^{k}\|^{2}.\end{split} (32)

We then provide a bound for η(xk,zk)η(xk+1,zk+1)\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1}). By the lipschitzness of η\nabla\eta (Lemma 7) with the partial gradients of η\nabla\eta given by

xη(xk,zk)=zkβdk,zη(xk,zk)=dk,\nabla_{x}\eta(x^{k},z^{k})=-z^{k}-\beta d^{k},\quad\nabla_{z}\eta(x^{k},z^{k})=d^{k},

we have

η(xk,zk)η(xk+1,zk+1)zk+βdk,xk+1xkdk,zk+1zk+Lη2[xk+1xk2+zk+1zk2]=τk2zk+βdk,dk+τkzk+βdk,d~kdkτkdk,i=1TJTi+1k+1+Lη2[τk2d~k2+zk+1zk2],\begin{split}&\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1})\\ &\leq\big{\langle}z^{k}+\beta d^{k},x^{k+1}-x^{k}\big{\rangle}-\big{\langle}d^{k},z^{k+1}-z^{k}\big{\rangle}+\frac{L_{\nabla\eta}}{2}\bigg{[}\|x^{k+1}-x^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]}\\ &=\tau_{k}\big{\langle}2z^{k}+\beta d^{k},d^{k}\big{\rangle}+\tau_{k}\big{\langle}z^{k}+\beta d^{k},\tilde{d}^{k}-d^{k}\big{\rangle}-\tau_{k}\big{\langle}d^{k},\prod_{i=1}^{T}J_{T-i+1}^{k+1}\big{\rangle}\\ &\qquad+\frac{L_{\nabla\eta}}{2}\bigg{[}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]},\end{split} (33)

where the second equality comes from (6) and (7). Due to the optimality condition of in the definition of yky^{k}, we have zk+βdk,xyk0\big{\langle}z^{k}+\beta d^{k},x-y^{k}\big{\rangle}\geq 0 for all x𝒳x\in\mathcal{X}, which together with the choice of x=xkx=x^{k} implies that

zk,dk+βdk20.\langle z^{k},d^{k}\rangle+\beta\|d^{k}\|^{2}\leq 0. (34)

Thus, combining (33) with (34), we obtain

η(xk,zk)η(xk+1,zk+1)βτkdk2+τkzk+βdk,y~kykτkdk,i=1TJTi+1k+1+Lη2[τk2d~k2+zk+1zk2].\begin{split}\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1})\leq-\beta\tau_{k}\|d^{k}\|^{2}+\tau_{k}\big{\langle}z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\big{\rangle}-\tau_{k}\big{\langle}d^{k},\prod_{i=1}^{T}J_{T-i+1}^{k+1}\big{\rangle}\\ +\frac{L_{\nabla\eta}}{2}\bigg{[}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]}.\end{split} (35)

In addition, by Lemma 18, we have

dk,F(xk)i=1TfT+1i(uT+2ik)j=2TCjdkfj(uj+1k)ujk.\langle d^{k},\nabla F(x^{k})-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k})\rangle\leq\sum_{j=2}^{T}C_{j}\|d^{k}\|\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|. (36)

Then combing (32), (35), (36), we have

[F(xk+1)η(xk+1,zk+1)][F(xk)η(xk,zk)]τk{βdk2+j=2TCjdkfj(uj+1k)ujk+dk,Δk+1+2zk+βdk,y~kyk+[βdk+F(xk)zk]y~kyk}+LF+Lη2τk2d~k2+Lη2zk+1zk2.\begin{split}&[F(x^{k+1})-\eta(x^{k+1},z^{k+1})]-[F(x^{k})-\eta(x^{k},z^{k})]\\ &\leq\tau_{k}\bigg{\{}-\beta\|d^{k}\|^{2}+\sum_{j=2}^{T}C_{j}\|d^{k}\|\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|+\langle d^{k},\Delta^{k+1}\rangle+2\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle\\ &~{}+\left[\beta\|d^{k}\|+\|\nabla F(x^{k})-z^{k}\|\right]\|\tilde{y}^{k}-y^{k}\|\bigg{\}}+\frac{L_{\nabla F}+L_{\nabla\eta}}{2}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\frac{L_{\nabla\eta}}{2}\|z^{k+1}-z^{k}\|^{2}.\end{split} (37)

Furthermore, defining

ϰk:=F(xk)i=1TfT+1i(uT+2ik),ϰ¯k:=F(xk+1)F(xk)τk,\varkappa^{k}:=\nabla F(x^{k})-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k}),\qquad\bar{\varkappa}^{k}:=\frac{\nabla F(x^{k+1})-\nabla F(x^{k})}{\tau_{k}},

and by the update rule given by (7), we have

F(xk+1)zk+12=(1τk)[F(xk)zk]+τk[ϰk+ϰ¯k+Δk+1]2=(1τk)[F(xk)zk]+τk[ϰk+ϰ¯k]2+τk2Δk+12+2τkr˙˙˙k+1(1τk)F(xk)zk2+2τk[ϰk2+ϰ¯k2]+τk2Δk+12+2τkr˙˙˙k+1(1τk)F(xk)zk2+τk2Δk+12+2τk[(T1)j=2TCj2fj(uj+1)uj2+2LF2(dk2+y~kyk2)+r˙˙˙k+1].\begin{split}&\|\nabla F(x^{k+1})-z^{k+1}\|^{2}\\ &=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\varkappa^{k}+\bar{\varkappa}^{k}+\Delta^{k+1}]\|^{2}\\ &=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\varkappa^{k}+\bar{\varkappa}^{k}]\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+2\tau_{k}\dddot{r}^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+2\tau_{k}\left[\|\varkappa^{k}\|^{2}+\|\bar{\varkappa}^{k}\|^{2}\right]+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+2\tau_{k}\dddot{r}^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}\\ &+2\tau_{k}\left[(T-1)\sum_{j=2}^{T}C_{j}^{2}\|f_{j}(u_{j+1})-u_{j}\|^{2}+2L_{\nabla F}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2})+\dddot{r}^{k+1}\right].\\ \end{split} (38)

where r˙˙˙k+1:=Δk+1,(1τk)[F(xk)zk]+τk[ϰk+ϰ¯k]\dddot{r}^{k+1}:=\langle\Delta^{k+1},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\varkappa^{k}+\bar{\varkappa}^{k}]\rangle and the last inequality comes from two fact that ϰ¯k22LF2(dk2+y~kyk2)\|\bar{\varkappa}^{k}\|^{2}\leq 2L_{\nabla F}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2}) and

ϰk2=F(xk)i=1TfT+1i(uT+2ik)2(T1)j=2TCj2fj(uj+1)uj2.\begin{split}\|\varkappa^{k}\|^{2}=\left\|\nabla F(x^{k})-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k})\right\|^{2}\leq(T-1)\sum_{j=2}^{T}C_{j}^{2}\|f_{j}(u_{j+1})-u_{j}\|^{2}.\end{split}

The above upper bound for the term ϰk2\|\varkappa^{k}\|^{2} is obtained by leveraging Lemma 18 and the fact that (i=1nai)ni=1nai2(\sum_{i=1}^{n}a_{i})\leq n\sum_{i=1}^{n}a_{i}^{2} for non-negative sequence (ai)1in(a_{i})_{1\leq i\leq n}.

Moreover, by Lemma 9, we have, for 1iT1\leq i\leq T,

fi(ui+1k+1)uik+12fi(ui+1k)uik2τkfi(ui+1k)uik2+τk2ΔGik+12+r˙ik+1+[4Lfi2+Lfifi(ui+1k)uik+ΔJik+12]ui+1k+1ui+1k2,\begin{split}&\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|^{2}-\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\leq-\tau_{k}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+\dot{r}_{i}^{k+1}\\ &\qquad\qquad+\left[4L_{f_{i}}^{2}+L_{\nabla f_{i}}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|+\|\Delta_{J_{i}}^{k+1}\|^{2}\right]\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2},\end{split} (39)

Finally, multiplying both sides of (39) by γi\gamma_{i} for i=1,,Ti=1,\dots,T and both sides of (38) by α\alpha, adding them to (37), rearranging the terms, and noting that zk+βdk,y~kyk=Hk(y~k)Hk(yk)(β/2)y~kyk2\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle=H_{k}(\tilde{y}^{k})-H_{k}(y^{k})-(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2} due to the quadratic structure of HkH_{k} and d~k2D𝒳2\|\tilde{d}^{k}\|^{2}\leq D_{\mathcal{X}}^{2}, we obtain

Wk+1Wkτk𝐀k+𝐑kW_{k+1}-W_{k}\leq\tau_{k}\mathbf{A}_{k}+\mathbf{R}_{k} (40)

where 𝐑k\mathbf{R}_{k} is defined in (31) and

𝐀k:=(β+4αLF2)dk2+j=2T(γj+2α(T1)Cj2)fj(uj+1k)ujk2γ1f1(u2k)u1k2αF(xk)zk2+j=2TCjdkfj(uj+1)uj+(βdk+F(xk)zk)y~kyk+(4αLF2β)y~kyk2+2(Hk(y~k)Hk(yk)).\begin{split}\mathbf{A}_{k}:=&\left(-\beta+4\alpha L_{\nabla F}^{2}\right)\|d^{k}\|^{2}+\sum_{j=2}^{T}\left(-\gamma_{j}+2\alpha(T-1)C_{j}^{2}\right)\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|^{2}\\ &-\gamma_{1}\|f_{1}(u_{2}^{k})-u_{1}^{k}\|^{2}-\alpha\|\nabla F(x^{k})-z^{k}\|^{2}+\sum_{j=2}^{T}C_{j}\|d^{k}\|\|f_{j}(u_{j+1})-u_{j}\|\\ &+\left(\beta\|d^{k}\|+\|\nabla F(x^{k})-z^{k}\|\right)\|\tilde{y}^{k}-y^{k}\|+\left(4\alpha L_{\nabla F}^{2}-\beta\right)\|\tilde{y}^{k}-y^{k}\|^{2}\\ &+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

We can further provide a simplified upper bound for 𝐀k\mathbf{A}_{k}. By Young’s inequality, we have

βdky~kykβ4dk2+βy~kyk2,F(xk)zky~kykα2F(xk)zk2+12αy~kyk2Cjdkfj(uj+1)ujαLF2T1dk2+(T1)Cj24αLF2fj(uj+1)uj2.\begin{split}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\beta}{4}\|d^{k}\|^{2}+\beta\|\tilde{y}^{k}-y^{k}\|^{2},\\ \|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\alpha}{2}\|\nabla F(x^{k})-z^{k}\|^{2}+\frac{1}{2\alpha}\|\tilde{y}^{k}-y^{k}\|^{2}\\ C_{j}\|d^{k}\|\|f_{j}(u_{j+1})-u_{j}\|&\leq\frac{\alpha L_{\nabla F}^{2}}{T-1}\|d^{k}\|^{2}+\frac{(T-1)C_{j}^{2}}{4\alpha L_{\nabla F}^{2}}\|f_{j}(u_{j+1})-u_{j}\|^{2}.\end{split}

Thus,

𝐀k(3β4+5αLF2)dk2γ1f1(u2k)u1k2α2F(xk)zk2+j=2T(γj+(2α+14αLF2)(T1)Cj2)fj(uj+1k)ujk2+(4αLF2+12α)y~kyk2+2(Hk(y~k)Hk(yk))\begin{split}\mathbf{A}_{k}\leq&\left(-\frac{3\beta}{4}+5\alpha L_{\nabla F}^{2}\right)\|d^{k}\|^{2}-\gamma_{1}\|f_{1}(u_{2}^{k})-u_{1}^{k}\|^{2}-\frac{\alpha}{2}\|\nabla F(x^{k})-z^{k}\|^{2}\\ &+\sum_{j=2}^{T}\left(-\gamma_{j}+\left(2\alpha+\frac{1}{4\alpha L_{\nabla F}^{2}}\right)(T-1)C_{j}^{2}\right)\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|^{2}\\ &+\left(4\alpha L_{\nabla F}^{2}+\frac{1}{2\alpha}\right)\|\tilde{y}^{k}-y^{k}\|^{2}+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\\ \end{split}

For any β>0\beta>0, let

α=β20LF2,γ1=β2,γj=(2α+14αLF2)(T1)Cj2+β2,2jT\alpha=\frac{\beta}{20L_{\nabla F}^{2}},\qquad\gamma_{1}=\frac{\beta}{2},\qquad\gamma_{j}=\left(2\alpha+\frac{1}{4\alpha L_{\nabla F}^{2}}\right)(T-1)C_{j}^{2}+\frac{\beta}{2},\quad 2\leq j\leq T

Then, we have

𝐀kβ2(dk2+i=1Tfi(ui+1k)uik2)β40LF2F(xk)zk2+(125+20LF2β2)(Hk(y~k)Hk(yk)).\begin{split}\mathbf{A}_{k}\leq-\frac{\beta}{2}\left(\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right)-\frac{\beta}{40L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\\ +\left(\frac{12}{5}+\frac{20L_{\nabla F}^{2}}{\beta^{2}}\right)\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\\ \end{split} (41)

As a result of (40) and (41), we can further obtain

τk(β[dk2+i=1Tfi(ui+1k)uik2]+β20LF2F(xk)zk2)2Wk2Wk+1+2𝐑k+τk(245+40LF2β2)(Hk(y~k)Hk(yk)),\begin{split}\tau_{k}\left(\beta\left[\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right]+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right)\\ \leq 2W_{k}-2W_{k+1}+2\mathbf{R}_{k}+\tau_{k}\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\end{split}

which immediately implies (30) by telescoping. ∎

Proposition 3.

Let 𝐑k\mathbf{R}_{k} be defined in (31) and τ0=1\tau_{0}=1. Then, under Assumption 3, we have

𝔼[𝐑k|k]σ^2τk2,k1,\mathbb{E}[\mathbf{R}_{k}|\mathscr{F}_{k}]\leq\hat{\sigma}^{2}\tau_{k}^{2},\quad\forall k\geq 1,

where

σ^2:=i=1Tγi([4Lfi2+LfiσGi2+(4Lfi2+σJi2)ci+1+σJi2]ci+1+σGi2)+(α+2Lη)i=1Tσ^Ji2+LF+Lη2D𝒳2.\begin{split}\hat{\sigma}^{2}:=&\sum_{i=1}^{T}\gamma_{i}\left(\left[4L_{\nabla f_{i}}^{2}+L_{\nabla f_{i}}\sqrt{\sigma_{G_{i}}^{2}+(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2})c_{i+1}}+\sigma_{J_{i}}^{2}\right]c_{i+1}+\sigma_{G_{i}}^{2}\right)\\ &+(\alpha+2L_{\eta})\prod_{i=1}^{T}\hat{\sigma}_{J_{i}}^{2}+\frac{L_{\nabla F}+L_{\eta}}{2}D_{\mathcal{X}}^{2}.\end{split} (42)
Proof.

Note that under Assumption 3, we have, for 1iT1\leq i\leq T,

𝔼[Δk+1|k]=0,𝔼[r˙ik+1|k]=0,𝔼[r˙˙˙k+1|k]=0,𝔼[ΔGik+12|k]σGi2,𝔼[ΔJik+12|k]σJi2,\begin{split}&\mathbb{E}[\Delta^{k+1}|\mathscr{F}_{k}]=0,\quad\mathbb{E}[\dot{r}_{i}^{k+1}|\mathscr{F}_{k}]=0,\quad\mathbb{E}[\dddot{r}^{k+1}|\mathscr{F}_{k}]=0,\\ &\mathbb{E}[\|\Delta_{G_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]\leq\sigma_{G_{i}}^{2},\quad\mathbb{E}[\|\Delta_{J_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]\leq\sigma_{J_{i}}^{2},\end{split}

and

𝔼[Δk+12|k]𝔼[i=1TJTi+1k+12|k]i=1T𝔼[JTi+1k+12|k]i=1Tσ^Ji2.\mathbb{E}[\|\Delta^{k+1}\|^{2}|\mathscr{F}_{k}]\leq\mathbb{E}\left[\left\|\prod_{i=1}^{T}J^{k+1}_{T-i+1}\right\|^{2}\middle|\mathscr{F}_{k}\right]\leq\prod_{i=1}^{T}\mathbb{E}\left[\left\|J^{k+1}_{T-i+1}\right\|^{2}\middle|\mathscr{F}_{k}\right]\leq\prod_{i=1}^{T}\hat{\sigma}_{J_{i}}^{2}.

In addition, by Lemma 9 and Hölder’s inequality. we have 𝔼[uik+1uik2|k]ciτk2\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\leq c_{i}\tau_{k}^{2} and

𝔼[fi(ui+1k+1)uik+1uik+1uik2|k]𝔼[fi(ui+1k+1)uik+1|k]𝔼[uik+1uik2|k](𝔼[fi(ui+1k+1)uik+1|k])12𝔼[uik+1uik2|k]ciσGi2+(4Lfi2+σJi2)ci+1τk2.\begin{split}&\mathbb{E}[\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\\ &\leq\mathbb{E}[\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\||\mathscr{F}_{k}]\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\\ &\leq\left(\mathbb{E}[\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\||\mathscr{F}_{k}]\right)^{\frac{1}{2}}\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\\ &\leq c_{i}\sqrt{\sigma_{G_{i}}^{2}+(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2})c_{i+1}}~{}\tau_{k}^{2}.\end{split}

Lastly, from eq.(28) of Proposition 2.1 in [BGN22], we have for any k1k\geq 1,

𝔼[zk+1zk2|k]4τk2i=1Tσ^Ji2.\mathbb{E}[\|z^{k+1}-z^{k}\|^{2}|\mathscr{F}_{k}]\leq 4\tau_{k}^{2}\prod_{i=1}^{T}\hat{\sigma}_{J_{i}}^{2}.

The proof is completed by combing all above observations with the expression of 𝐑k\mathbf{R}_{k} in (31). ∎

Proof of Theorem 2.

We now present the proof of Theorem 2. Note that by Lemma 11 and given values of α,γ\alpha,\gamma in (12), we obtain

k=1Nτk[β(dk2+i=1Tfi(ui+1k)uik2)+β20LF2F(xk)zk2]2Wα,γ(x0,z0,u0)+2k=0N𝐑k+(245+40LF2β2)k=0Nτk(Hk(y~k)Hk(yk)),\begin{split}&\sum_{k=1}^{N}\tau_{k}\left[\beta\left(\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right)+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right]\\ &\leq 2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+2\sum_{k=0}^{N}\mathbf{R}_{k}+\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\end{split}

Taking expectation of both sides and noting that 𝔼[𝐑k|k]σ^2τk2\mathbb{E}[\mathbf{R}_{k}|\mathscr{F}_{k}]\leq\hat{\sigma}^{2}\tau_{k}^{2} by Proposition 3, we have

k=1Nτk𝔼[ρ(dk2+i=1Tfi(ui+1k)uik2)+αF(xk)zk2|k1]2Wα,γ(x0,z0,u0)+2σ^2k=0Nτk2+(245+40LF2β2)k=0Nτk(Hk(y~k)Hk(yk)).\begin{split}&\sum_{k=1}^{N}\tau_{k}\mathbb{E}\left[\rho\left(\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right)+\alpha\|\nabla F(x^{k})-z^{k}\|^{2}\middle|\mathscr{F}_{k-1}\right]\\ &\leq 2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+2\hat{\sigma}^{2}\sum_{k=0}^{N}\tau_{k}^{2}+\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split} (43)

Then, setting τk,tk\tau_{k},t_{k} to be values in (11) and noting that by Lemma 8, we have

Hk(y~k)Hk(yk)2βD𝒳2(1+δ)tk+22βD𝒳2(1+δ)k,k1.H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{t_{k}+2}\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{\sqrt{k}},\quad\forall k\geq 1.

Also, with the choice of z0=0z^{0}=0, we have y0=y~0=x0y^{0}=\tilde{y}^{0}=x^{0}. Thus, we can conclude that

k=0Nτk(Hk(y~k)Hk(yk))2βD𝒳2(1+δ)Nk=1N1k4βD𝒳2(1+δ).\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{\sqrt{N}}\sum_{k=1}^{N}\frac{1}{\sqrt{k}}\leq 4\beta D_{\mathcal{X}}^{2}(1+\delta).

which together with (43) immediately imply that N1\forall N\geq 1,

1Nk=1N𝔼[β(dk2+j=1Tfj(uj+1k)ujk2)+β20LF2F(xk)zk2|k1]2Wα,γ(x0,z0,u0)+(β,σ2,L,D𝒳,T,δ).\begin{split}\frac{1}{\sqrt{N}}\sum_{k=1}^{N}\mathbb{E}\left[\beta\left(\|d^{k}\|^{2}+\sum_{j=1}^{T}\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|^{2}\right)+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\middle|\mathscr{F}_{k-1}\right]\\ \leq 2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta).\end{split}

where

(β,σ2,L,D𝒳,T,δ)=4σ^2+32βD𝒳2(1+δ)(35+5LF2β2),\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)=4\hat{\sigma}^{2}+32\beta D_{\mathcal{X}}^{2}(1+\delta)\left(\frac{3}{5}+\frac{5L_{\nabla F}^{2}}{\beta^{2}}\right),

and σ^2\hat{\sigma}^{2} is given in (42). As a result, we can obtain (13) and (14) by the definition of random integer RR and

𝒢(xk,F(xk),β)22β2dk2+2β2Π𝒳(xk1βF(xk))Π𝒳(xk1βzk)22β2dk2+2F(xk)zk2.\begin{split}\|\mathcal{G}(x^{k},\nabla F(x^{k}),\beta)\|^{2}&\leq 2\beta^{2}\|d^{k}\|^{2}+2\beta^{2}\left\|\Pi_{\mathcal{X}}\left(x^{k}-\frac{1}{\beta}\nabla F(x^{k})\right)-\Pi_{\mathcal{X}}\left(x^{k}-\frac{1}{\beta}z^{k}\right)\right\|^{2}\\ &\leq 2\beta^{2}\|d^{k}\|^{2}+2\|\nabla F(x^{k})-z^{k}\|^{2}.\end{split}

Appendix D Proofs for Section 3.1

D.1 Proof of Theorem 3 for T=2T=2

To show the rate of convergence for Algorithm 3, we simplify the merit function in the analysis of the multi-level problems and leverage the following function:

Wα,γ(xk,zk,uk)=F(xk)Fη(xk,zk)+αF(xk)zk2+γf2(xk)u2k2,W_{\alpha,\gamma}(x^{k},z^{k},u^{k})=F(x^{k})-F^{\star}-\eta(x^{k},z^{k})+\alpha\|\nabla F(x^{k})-z^{k}\|^{2}+\gamma\|f_{2}(x^{k})-u_{2}^{k}\|^{2}, (44)

where α,γ\alpha,\gamma are positive constants, η(,)\eta(\cdot,\cdot) is defined in (10). We now present the analogue of Lemma 11 for Algorithm 3. The proof follows similar steps as that proof of Lemma 11 with slight modifications, and hence we will skip some arguments already presented before.

Lemma 12.

Let {xk,zk,u2k}k0\{x^{k},z^{k},u_{2}^{k}\}_{k\geq 0} be the sequence generated by Algorithm 3 and the merit function Wα,γ(,,)W_{\alpha,\gamma}(\cdot,\cdot,\cdot) be defined in (44) with

α=ρLF,γ=3ρLf1,ρ>0.\alpha=\frac{\rho}{L_{\nabla F}},\qquad\gamma=3\rho L_{\nabla f_{1}},\qquad\rho>0.

Under Assumptions 2 with T=2T=2, setting βkβ6ρLF+(2ρ+23ρ)Lf1Lf22\beta_{k}\equiv\beta\geq 6\rho L_{\nabla F}+(2\rho+\frac{2}{3\rho})L_{\nabla f_{1}}L_{f_{2}}^{2}, we have N0\forall N\geq 0

ρk=0Nτk(LFdk2+Lf1f2(xk)u2k2+1LFF(xk)zk2)2W0+2k=0N𝐑k+(4+2(8ρ+1/ρ)LF+24ρLf1Lf22β)k=0Nτk(Hk(y~k)Hk(yk))\begin{split}&\rho\sum_{k=0}^{N}\tau_{k}\bigg{(}L_{\nabla F}\|d^{k}\|^{2}+L_{\nabla f_{1}}\|f_{2}(x^{k})-u_{2}^{k}\|^{2}+\frac{1}{L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 2W_{0}+2\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}4+\frac{2(8\rho+1/\rho)L_{\nabla F}+24\rho L_{\nabla f_{1}}L_{f_{2}}^{2}}{\beta}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

where dk=ykxkd^{k}=y^{k}-x^{k}, Hk(),ykH_{k}(\cdot),y^{k} are defined in (16), and

𝐑k:=τk2[LF+Lη2D𝒳2+γΔG2k+12+αΔk+12]+Lη2zk+1zk2+τkdk,Δk+1+γr˙k+1+αr¨k+1,Δk+1:=f2(xk)f1(u2k)J2k+1J1k+1,ΔG2k+1:=f2(xk)G2k+1r˙k+1:=2τkΔG2k+1,f2(xk+1)f2(xk)+(1τk)(f2(xk)u2k),r¨k+1:=2τkΔk+1,(1τk)[F(xk)zk]+τk[F(xk)f2(xk)f1(uk)]+F(xk+1)F(xk).\begin{split}\mathbf{R}_{k}:=&\tau_{k}^{2}\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+\gamma\|\Delta_{G_{2}}^{k+1}\|^{2}+\alpha\|\Delta^{k+1}\|^{2}\right]+\frac{L_{\eta}}{2}\|z^{k+1}-z^{k}\|^{2}\\ &\qquad+\tau_{k}\langle d^{k},\Delta^{k+1}\rangle+\gamma\dot{r}^{k+1}+\alpha\ddot{r}^{k+1},\\ \Delta^{k+1}:=&\nabla f_{2}(x^{k})\nabla f_{1}(u_{2}^{k})-J_{2}^{k+1}J_{1}^{k+1},\quad\Delta_{G_{2}}^{k+1}:=f_{2}(x^{k})-G_{2}^{k+1}\\ \dot{r}^{k+1}:=&2\tau_{k}\langle\Delta_{G_{2}}^{k+1},f_{2}(x^{k+1})-f_{2}(x^{k})+(1-\tau_{k})(f_{2}(x^{k})-u_{2}^{k})\rangle,\\ \ddot{r}^{k+1}:=&2\tau_{k}\langle\Delta^{k+1},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\nabla F(x^{k})-\nabla f_{2}(x^{k})\nabla f_{1}(u^{k})]\\ &\qquad+\nabla F(x^{k+1})-\nabla F(x^{k})\rangle.\end{split} (45)
Proof of Lemma 12.

1. By the Lipschitzness of F\nabla F (Lemma 6), we have

F(xk+1)F(xk)τkF(xk),dk+τkF(xk)zky~kyk+τkβdky~kyk+τkzk+βdk,y~kyk+LFτk2d~k22.\begin{split}&F(x^{k+1})-F(x^{k})\leq\tau_{k}\langle\nabla F(x^{k}),d^{k}\rangle+\tau_{k}\|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|\\ &\qquad+\tau_{k}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|+\tau_{k}\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle+\frac{L_{\nabla F}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}}{2}.\\ \end{split} (46)

2. Also, by the Lipschitzness of η\nabla\eta (Lemma 7) and the optimality condition of in the definition of yky^{k}, we have

η(xk,zk)η(xk+1,zk+1)βτkdk2+τkzk+βdk,y~kykτkdk,f2(xk)f1(u2k)+τkdk,Δk+1+Lη2[τk2d~k2+zk+1zk2].\begin{split}&\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1})\leq-\beta\tau_{k}\|d^{k}\|^{2}+\tau_{k}\big{\langle}z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\big{\rangle}\\ &\qquad-\tau_{k}\big{\langle}d^{k},\nabla f_{2}(x^{k})\nabla f_{1}(u_{2}^{k})\big{\rangle}+\tau_{k}\big{\langle}d^{k},\Delta^{k+1}\big{\rangle}+\frac{L_{\nabla\eta}}{2}\bigg{[}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]}.\end{split} (47)

3. In addition, by the Lipschitzness of f2f_{2} and f1\nabla f_{1}, we have

dk,F(xk)f2(xk)f1(u2k)=dk,f2(xk)[f1(f2(xk))f1(u2k)]Lf1Lf2dkf2(xk)u2k.\begin{split}\langle d^{k},\nabla F(x^{k})-\nabla f_{2}(x^{k})\nabla f_{1}(u_{2}^{k})\rangle&=\big{\langle}d^{k},\nabla f_{2}(x^{k})^{\top}\big{[}\nabla f_{1}(f_{2}(x^{k}))-\nabla f_{1}(u_{2}^{k})\big{]}\big{\rangle}\\ &\leq L_{\nabla f_{1}}L_{f_{2}}\|d^{k}\|\|f_{2}(x^{k})-u_{2}^{k}\|.\end{split} (48)

4. Moreover, by the update rule, we have

f2(xk+1)u2k+12=f2(xk+1)f2(xk)+(1τk)[f2(xk)u2k]+τkΔG2k+12=(1τk)[f2(xk)u2k]+f2(xk+1)f2(xk)2+τk2ΔG2k+12+r˙k+1(1τk)f2(xk)u2k2+2τkLf22(dk2+y~kyk2)+τk2ΔG2k+12+r˙k+1\begin{split}&\|f_{2}(x^{k+1})-u_{2}^{k+1}\|^{2}=\|f_{2}(x^{k+1})-f_{2}(x^{k})+(1-\tau_{k})[f_{2}(x^{k})-u_{2}^{k}]+\tau_{k}\Delta_{G_{2}}^{k+1}\|^{2}\\ &\qquad=\|(1-\tau_{k})[f_{2}(x^{k})-u_{2}^{k}]+f_{2}(x^{k+1})-f_{2}(x^{k})\|^{2}+\tau_{k}^{2}\|\Delta_{G_{2}}^{k+1}\|^{2}+\dot{r}^{k+1}\\ &\qquad\leq(1-\tau_{k})\|f_{2}(x^{k})-u_{2}^{k}\|^{2}+2\tau_{k}L_{f_{2}}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2})+\tau_{k}^{2}\|\Delta_{G_{2}}^{k+1}\|^{2}+\dot{r}^{k+1}\end{split} (49)

where r˙k+1:=2τkΔG2k+1,f2(xk+1)f2(xk)+(1τk)(f2(xk)u2k)\dot{r}^{k+1}:=2\tau_{k}\langle\Delta_{G_{2}}^{k+1},f_{2}(x^{k+1})-f_{2}(x^{k})+(1-\tau_{k})(f_{2}(x^{k})-u_{2}^{k})\rangle and the last inequality follows Jensen’s inequality for the convex function 2\|\cdot\|^{2} as well as

1τk[f2(xk+1)f2(xk)]2Lf22d~k22Lf22(dk2+y~kyk2).\left\|\frac{1}{\tau_{k}}\left[f_{2}(x^{k+1})-f_{2}(x^{k})\right]\right\|^{2}\leq L_{f_{2}}^{2}\|\tilde{d}^{k}\|^{2}\leq 2L_{f_{2}}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2}).

5. Defining

ek:=1τk[F(xk+1)F(xk)]+F(xk)f2(xk)f1(uk),e^{k}:=\frac{1}{\tau_{k}}\bigg{[}\nabla F(x^{k+1})-\nabla F(x^{k})\bigg{]}+\nabla F(x^{k})-\nabla f_{2}(x^{k})\nabla f_{1}(u^{k}),

and by the update rule, we have

F(xk+1)zk+12=(1τk)[F(xk)zk]+τk[ek+Δk+1]2=(1τk)[F(xk)zk]+τkek2+τk2Δk+12+r¨k+1(1τk)F(xk)zk2+τkek2+τk2Δk+12+r¨k+1\begin{split}&\|\nabla F(x^{k+1})-z^{k+1}\|^{2}=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[e^{k}+\Delta^{k+1}]\|^{2}\\ &=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}e^{k}\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+\ddot{r}^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+\tau_{k}\|e^{k}\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+\ddot{r}^{k+1}\\ \end{split} (50)

where r¨k+1:=2τkΔk,(1τk)[F(xk)zk]+τkek\ddot{r}^{k+1}:=2\tau_{k}\langle\Delta^{k},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}e^{k}\rangle. We can further upper bound the term ek2\|e^{k}\|^{2} by

ek22LF2d~k2+2Lf12Lf22f2(xk)uk24LF2(dk2+y~kyk2)+2Lf12Lf22f2(xk)uk2\begin{split}\|e^{k}\|^{2}&\leq 2L_{\nabla F}^{2}\|\tilde{d}^{k}\|^{2}+2L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}\|f_{2}(x^{k})-u^{k}\|^{2}\\ &\leq 4L_{\nabla F}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2})+2L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}\|f_{2}(x^{k})-u^{k}\|^{2}\end{split} (51)

6. By combing (46), (47), (48), (49), (50), (51), rearranging the terms, and noting that zk+βdk,y~kyk=Hk(y~k)Hk(yk)(β/2)y~kyk2\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle=H_{k}(\tilde{y}^{k})-H_{k}(y^{k})-(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2} and d~kD𝒳\|\tilde{d}^{k}\|\leq D_{\mathcal{X}}, we obtain

Wk+1Wkτk𝐀k+𝐑kW_{k+1}-W_{k}\leq\tau_{k}\mathbf{A}_{k}+\mathbf{R}_{k} (52)

where 𝐑k\mathbf{R}_{k} is defined in (45) and

𝐀k:=(β+4αLF2+2γLf22)dk2+(γ+2αLf12Lf22)f2(xk)u2k2+Lf1Lf2dkf2(xk)u2kαF(xk)zk2+(βdk+F(xk)zk)y~kyk+(4αLF2+2γLf22β)y~kyk2+2(Hk(y~k)Hk(yk)).\begin{split}\mathbf{A}_{k}:=&\left(-\beta+4\alpha L_{\nabla F}^{2}+2\gamma L_{f_{2}}^{2}\right)\|d^{k}\|^{2}+\left(-\gamma+2\alpha L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}\right)\|f_{2}(x^{k})-u_{2}^{k}\|^{2}\\ &+L_{\nabla f_{1}}L_{f_{2}}\|d^{k}\|\|f_{2}(x^{k})-u_{2}^{k}\|-\alpha\|\nabla F(x^{k})-z^{k}\|^{2}\\ &+\left(\beta\|d^{k}\|+\|\nabla F(x^{k})-z^{k}\|\right)\|\tilde{y}^{k}-y^{k}\|\\ &+\left(4\alpha L_{\nabla F}^{2}+2\gamma L_{f_{2}}^{2}-\beta\right)\|\tilde{y}^{k}-y^{k}\|^{2}+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

We then provide a simplified upper bound for 𝐀k\mathbf{A}_{k}. By the Young’s inequality, we have

βdky~kykβ4dk2+βy~kyk2,F(xk)zky~kykα2F(xk)zk2+12αy~kyk2.\begin{split}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\beta}{4}\|d^{k}\|^{2}+\beta\|\tilde{y}^{k}-y^{k}\|^{2},\\ \|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\alpha}{2}\|\nabla F(x^{k})-z^{k}\|^{2}+\frac{1}{2\alpha}\|\tilde{y}^{k}-y^{k}\|^{2}.\end{split}

In addition, we reparametrize α=ρLF\alpha=\frac{\rho}{L_{\nabla F}}. Noting that by Lemma 6 with T=2T=2

Lf12Lf22LF=Lf12Lf22Lf1Lf22+Lf1Lf2Lf1,\frac{L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}}{L_{\nabla F}}=\frac{L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}}{L_{\nabla f_{1}}L_{f_{2}}^{2}+L_{f_{1}}L_{\nabla f_{2}}}\leq L_{\nabla f_{1}},

we therefore have

𝐀k(3β4+4ρLF+2γLf22)dk2+(γ+2ρLf1)f2(xk)u2k2+Lf1Lf2dkf2(xk)u2kρ2LFF(xk)zk2+(4ρLF+2γLf22+LF2ρ)y~kyk2+2(Hk(y~k)Hk(yk))\begin{split}\mathbf{A}_{k}\leq&\left(-\frac{3\beta}{4}+4\rho L_{\nabla F}+2\gamma L_{f_{2}}^{2}\right)\|d^{k}\|^{2}+\left(-\gamma+2\rho L_{\nabla f_{1}}\right)\|f_{2}(x^{k})-u_{2}^{k}\|^{2}\\ &+L_{\nabla f_{1}}L_{f_{2}}\|d^{k}\|\|f_{2}(x^{k})-u_{2}^{k}\|-\frac{\rho}{2L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\\ &+\left(4\rho L_{\nabla F}+2\gamma L_{f_{2}}^{2}+\frac{L_{\nabla F}}{2\rho}\right)\|\tilde{y}^{k}-y^{k}\|^{2}+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

Then, setting γ=3ρLf1\gamma=3\rho L_{\nabla f_{1}} and β6ρLF+(2ρ+23ρ)Lf1Lf22\beta\geq 6\rho L_{\nabla F}+(2\rho+\frac{2}{3\rho})L_{\nabla f_{1}}L_{f_{2}}^{2}, we can obtain

(3β4+4ρLF+2γLf22)dk2+(γ+2ρLf1)f2(xk)u2k2+Lf1Lf2dkf2(xk)u2kρLF2dk2ρLf12f2(xk)u2k2\begin{split}&\left(-\frac{3\beta}{4}+4\rho L_{\nabla F}+2\gamma L_{f_{2}}^{2}\right)\|d^{k}\|^{2}+\left(-\gamma+2\rho L_{\nabla f_{1}}\right)\|f_{2}(x^{k})-u_{2}^{k}\|^{2}\\ &\qquad+L_{\nabla f_{1}}L_{f_{2}}\|d^{k}\|\|f_{2}(x^{k})-u_{2}^{k}\|\leq-\frac{\rho L_{\nabla F}}{2}\|d^{k}\|^{2}-\frac{\rho L_{\nabla f_{1}}}{2}\|f_{2}(x^{k})-u_{2}^{k}\|^{2}\end{split}

Also, we have (β/2)y~kyk2Hk(y~k)Hk(yk)(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2}\leq H_{k}(\tilde{y}^{k})-H_{k}(y^{k}). Therefore, we can further bound 𝐀k\mathbf{A}_{k} by

𝐀kρLF2dk2ρLf12g(xk)uk2ρ2LFF(xk)zk2+(2+(8ρ+1/ρ)LF+12ρLf1Lf22β)(Hk(y~k)Hk(yk)).\begin{split}\mathbf{A}_{k}\leq&-\frac{\rho L_{\nabla F}}{2}\|d^{k}\|^{2}-\frac{\rho L_{\nabla f_{1}}}{2}\|g(x^{k})-u^{k}\|^{2}-\frac{\rho}{2L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\\ &+\bigg{(}2+\frac{(8\rho+1/\rho)L_{\nabla F}+12\rho L_{\nabla f_{1}}L_{f_{2}}^{2}}{\beta}\bigg{)}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split} (53)

Telescoping (52) together with (53), we get

ρk=0Nτk(LFdk2+Lf1f2(xk)u2k2+1LFF(xk)zk2)2W0+2k=0N𝐑k+(4+2(8ρ+1/ρ)LF+24ρLf1Lf22β)k=0Nτk(Hk(y~k)Hk(yk))\begin{split}&\rho\sum_{k=0}^{N}\tau_{k}\bigg{(}L_{\nabla F}\|d^{k}\|^{2}+L_{\nabla f_{1}}\|f_{2}(x^{k})-u_{2}^{k}\|^{2}+\frac{1}{L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 2W_{0}+2\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}4+\frac{2(8\rho+1/\rho)L_{\nabla F}+24\rho L_{\nabla f_{1}}L_{f_{2}}^{2}}{\beta}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

Proof of Theorem 3, part (a).

The proof follows the same arguments in the proof of Theorem 2. Note that by Lemma 12 and given values of α,γ\alpha,\gamma in (12), we obtain

ρk=1Nτk[LFdk2+Lf1f2(xk)u2k2+1LFF(xk)zk2]2Wα,γ(x0,z0,u0)+2k=0N𝐑k+(4+2(8ρ+1/ρ)LF+24ρLf1Lf22β)k=0Nτk(Hk(y~k)Hk(yk)).\begin{split}&\rho\sum_{k=1}^{N}\tau_{k}\left[L_{\nabla F}\|d^{k}\|^{2}+L_{\nabla f_{1}}\|f_{2}(x^{k})-u_{2}^{k}\|^{2}+\frac{1}{L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\right]\leq 2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})\\ &\quad+2\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}4+\frac{2(8\rho+1/\rho)L_{\nabla F}+24\rho L_{\nabla f_{1}}L_{f_{2}}^{2}}{\beta}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

Noting that

𝔼[𝐑k|k]=τk2[LF+Lη2D𝒳2+γσG22+(α+2Lη)σ^J12σ^J22]:=τk2σ^2,\mathbb{E}[\mathbf{R}_{k}|\mathscr{F}_{k}]=\tau_{k}^{2}\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+\gamma\sigma_{G_{2}}^{2}+(\alpha+2L_{\eta})\hat{\sigma}_{J_{1}}^{2}\hat{\sigma}_{J_{2}}^{2}\right]:=\tau_{k}^{2}\hat{\sigma}^{2},

and taking expectation of both sides, we can complete the proof with the same arguments in the proof of Theorem 2. The constants 𝒞1\mathcal{C}_{1} and 𝒞2\mathcal{C}_{2} turn out to be

𝒞1=4(β2ρLF+LFρ){Wα,γ(x0,z0,u0)+σ^2+4D𝒳2(1+δ)[2β+(8ρ+1ρ)LF+12ρLf1Lf22)]},𝒞2=2ρLf1{Wα,γ(x0,z0,u0)+σ^2+4D𝒳2(1+δ)[2β+(8ρ+1ρ)LF+12ρLf1Lf22)]}.\begin{split}&\mathcal{C}_{1}=4\left(\frac{\beta^{2}}{\rho L_{\nabla F}}+\frac{L_{\nabla F}}{\rho}\right)\bigg{\{}W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\hat{\sigma}^{2}\\ &\qquad\qquad\qquad\qquad+4D_{\mathcal{X}}^{2}(1+\delta)\left[2\beta+(8\rho+\frac{1}{\rho})L_{\nabla F}+12\rho L_{\nabla f_{1}}L_{f_{2}}^{2})\right]\bigg{\}},\\ &\mathcal{C}_{2}=\frac{2}{\rho L_{\nabla f_{1}}}\bigg{\{}W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\hat{\sigma}^{2}\\ &\qquad\qquad\qquad\qquad+4D_{\mathcal{X}}^{2}(1+\delta)\left[2\beta+(8\rho+\frac{1}{\rho})L_{\nabla F}+12\rho L_{\nabla f_{1}}L_{f_{2}}^{2})\right]\bigg{\}}.\end{split} (54)

D.2 Proof of Theorem 3 for T=1T=1

To show the rate of convergence for Algorithm 4, we leverage the following merit function:

Wα(xk,zk,uk)=F(xk)Fη(xk,zk)+αF(xk)zk2,W_{\alpha}(x^{k},z^{k},u^{k})=F(x^{k})-F^{\star}-\eta(x^{k},z^{k})+\alpha\|\nabla F(x^{k})-z^{k}\|^{2}, (55)

where α>0\alpha>0, η(,)\eta(\cdot,\cdot) is defined in (10).

Lemma 13.

Let {xk,zk}k0\{x^{k},z^{k}\}_{k\geq 0} be the sequence generated by Algorithm 4 with βkβ>0\beta_{k}\equiv\beta>0 and the merit function Wα(,)W_{\alpha}(\cdot,\cdot) be defined in (55) with α=β4LF2\alpha=\frac{\beta}{4L_{\nabla F}^{2}}. Under Assumptions 2 with T=1T=1, we have N0\forall N\geq 0

βk=0Nτk(dk2+12LF2F(xk)zk2)4Wα(x0,u0)+4k=0N𝐑k+(12+16LF2β2)k=0Nτk(Hk(y~k)Hk(yk))\begin{split}&\beta\sum_{k=0}^{N}\tau_{k}\bigg{(}\|d^{k}\|^{2}+\frac{1}{2L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 4W_{\alpha}(x^{0},u^{0})+4\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}12+\frac{16L_{\nabla F}^{2}}{\beta^{2}}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

where dk:=ykxkd^{k}:=y^{k}-x^{k}, Hk(),ykH_{k}(\cdot),y^{k} are defined in (16), Δk+1:=F(xk)J1k+1\Delta^{k+1}:=\nabla F(x^{k})-J_{1}^{k+1}, and

𝐑k:=τk2[LF+Lη2D𝒳2+αΔk+12]+Lη2zk+1zk2+τkdk,Δk+1+αrk+1,rk+1:=2τkΔk+1,(1τk)[F(xk)zk]+F(xk+1)F(xk).\begin{split}\mathbf{R}_{k}:=&\tau_{k}^{2}\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+\alpha\|\Delta^{k+1}\|^{2}\right]+\frac{L_{\eta}}{2}\|z^{k+1}-z^{k}\|^{2}\\ &\qquad+\tau_{k}\langle d^{k},\Delta^{k+1}\rangle+\alpha r^{k+1},\\ r^{k+1}:=&2\tau_{k}\langle\Delta^{k+1},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\rangle.\end{split} (56)
Proof.

The proof is a essentially a simplified version of the proof of Lemma 12. Hence, we skip some arguments already presented earlier.

1. By the Lipschitzness of F\nabla F, we have

F(xk+1)F(xk)τkF(xk),dk+τkF(xk)zky~kyk+τkβdky~kyk+τkzk+βdk,y~kyk+LFτk2d~k22.\begin{split}&F(x^{k+1})-F(x^{k})\leq\tau_{k}\langle\nabla F(x^{k}),d^{k}\rangle+\tau_{k}\|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|\\ &\qquad+\tau_{k}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|+\tau_{k}\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle+\frac{L_{\nabla F}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}}{2}.\\ \end{split} (57)

2. Also, by the lipschitzness of η\nabla\eta (Lemma 7) and the optimality condition of in the definition of yky^{k}, we have

η(xk,zk)η(xk+1,zk+1)βτkdk2+τkzk+βdk,y~kykτkdk,F(xk)+τkdk,Δk+1+Lη2[τk2d~k2+zk+1zk2].\begin{split}&\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1})\leq-\beta\tau_{k}\|d^{k}\|^{2}+\tau_{k}\big{\langle}z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\big{\rangle}\\ &\qquad-\tau_{k}\big{\langle}d^{k},\nabla F(x^{k})\big{\rangle}+\tau_{k}\big{\langle}d^{k},\Delta^{k+1}\big{\rangle}+\frac{L_{\nabla\eta}}{2}\bigg{[}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]}.\end{split} (58)

3. By the update rule, we have

F(xk+1)zk+12=(1τk)[F(xk)zk]+F(xk+1)F(xk)+τkΔk+12=(1τk)[F(xk)zk]+F(xk+1)F(xk)2+τk2Δk+12+rk+1(1τk)F(xk)zk2+1τkF(xk+1)F(xk)2+τk2Δk+12+rk+1(1τk)F(xk)zk2+τkLF2d~k2+τk2Δk+12+rk+1(1τk)F(xk)zk2+2τkLF2(dk2+y~kyk2)+τk2Δk+12+rk+1\begin{split}&\|\nabla F(x^{k+1})-z^{k+1}\|^{2}=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})+\tau_{k}\Delta^{k+1}\|^{2}\\ &=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+r^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+\frac{1}{\tau_{k}}\|\nabla F(x^{k+1})-\nabla F(x^{k})\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+r^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+\tau_{k}L_{\nabla F}^{2}\|\tilde{d}^{k}\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+r^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+2\tau_{k}L_{\nabla F}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2})+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+r^{k+1}\\ \end{split} (59)

where rk+1:=2τkΔk,(1τk)[F(xk)zk]+F(xk+1)F(xk)r^{k+1}:=2\tau_{k}\langle\Delta^{k},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\rangle.

4. By combing (57), (58) (59), rearranging the terms, and noting that zk+βdk,y~kyk=Hk(y~k)Hk(yk)(β/2)y~kyk2\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle=H_{k}(\tilde{y}^{k})-H_{k}(y^{k})-(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2} and d~kD𝒳\|\tilde{d}^{k}\|\leq D_{\mathcal{X}}, we obtain

Wk+1Wkτk𝐀k+𝐑kW_{k+1}-W_{k}\leq\tau_{k}\mathbf{A}_{k}+\mathbf{R}_{k} (60)

where 𝐑k\mathbf{R}_{k} is defined in (56) and

𝐀k:=(β+2αLF2)dk2αF(xk)zk2+(βdk+F(xk)zk)y~kyk+(2αLF2β)y~kyk2+2(Hk(y~k)Hk(yk)).\begin{split}\mathbf{A}_{k}:=&\left(-\beta+2\alpha L_{\nabla F}^{2}\right)\|d^{k}\|^{2}-\alpha\|\nabla F(x^{k})-z^{k}\|^{2}+\left(\beta\|d^{k}\|+\|\nabla F(x^{k})-z^{k}\|\right)\|\tilde{y}^{k}-y^{k}\|\\ &+\left(2\alpha L_{\nabla F}^{2}-\beta\right)\|\tilde{y}^{k}-y^{k}\|^{2}+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

We then provide a simplified upper bound for 𝐀k\mathbf{A}_{k}. By the Young’s inequality, we have

βdky~kykβ4dk2+βy~kyk2,F(xk)zky~kykα2F(xk)zk2+12αy~kyk2.\begin{split}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\beta}{4}\|d^{k}\|^{2}+\beta\|\tilde{y}^{k}-y^{k}\|^{2},\\ \|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\alpha}{2}\|\nabla F(x^{k})-z^{k}\|^{2}+\frac{1}{2\alpha}\|\tilde{y}^{k}-y^{k}\|^{2}.\end{split}

In addition, setting α=β4LF2\alpha=\frac{\beta}{4L_{\nabla F}^{2}} and noting (β/2)y~kyk2Hk(y~k)Hk(yk)(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2}\leq H_{k}(\tilde{y}^{k})-H_{k}(y^{k}), we have

𝐀kβ4dk2β8LF2F(xk)zk2+(3+4LF2β2)(Hk(y~k)Hk(yk))\begin{split}\mathbf{A}_{k}\leq&-\frac{\beta}{4}\|d^{k}\|^{2}-\frac{\beta}{8L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}+\left(3+\frac{4L_{\nabla F}^{2}}{\beta^{2}}\right)\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\\ \end{split} (61)

Telescoping (60) together with (61), we get

βk=0Nτk(dk2+12LF2F(xk)zk2)4Wα(x0,u0)+4k=0N𝐑k+(12+16LF2β2)k=0Nτk(Hk(y~k)Hk(yk))\begin{split}&\beta\sum_{k=0}^{N}\tau_{k}\bigg{(}\|d^{k}\|^{2}+\frac{1}{2L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 4W_{\alpha}(x^{0},u^{0})+4\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}12+\frac{16L_{\nabla F}^{2}}{\beta^{2}}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

Proof of Theorem 3, part (b).

Given Lemma 13, the proof follows the same arguments as in the proof of Theorem 2. The constant 𝒞3\mathcal{C}_{3} turns out to be

𝒞3=8(β+2LF2β){Wα(x0,u0)+D𝒳2[(1+δ)(12β+16LF2β)+LF+Lη2]+ασJ12+2Lησ^J12}.\begin{split}\mathcal{C}_{3}=8\left(\beta+\frac{2L_{\nabla F}^{2}}{\beta}\right)\bigg{\{}W_{\alpha}(x^{0},u^{0})+D_{\mathcal{X}}^{2}\left[(1+\delta)\left(12\beta+\frac{16L_{\nabla F}^{2}}{\beta}\right)+\frac{L_{\nabla F}+L_{\nabla\eta}}{2}\right]\\ +\alpha\sigma_{J_{1}}^{2}+2L_{\eta}\hat{\sigma}_{J_{1}}^{2}\bigg{\}}\end{split}. (62)

Appendix E High-Probability Convergence for T=1T=1

E.1 Preliminaries

We provide a short review of sub-gaussian and sub-exponential random variables for completeness.

Definition 14.

(Sub-gaussian and Sub-exponential)

  • (a)

    A random variable XX is KK-sub-gaussian if there exists K>0K>0 such that 𝔼[exp(X2/K2)]2\mathbb{E}[\exp(X^{2}/K^{2})]\leq 2. The sub-gaussian norm of XX, denoted Xψ2\|X\|_{\psi_{2}}, is defined to be the smallest KK. That is to say,

    Xψ2=inf{t>0:𝔼[exp(X2/t2)]2}.\|X\|_{\psi_{2}}=\inf\left\{t>0:\mathbb{E}[\exp(X^{2}/t^{2})]\leq 2\right\}.
  • (b)

    A random variable XX is KK-sub-exponential if there exists K>0K>0 such that 𝔼[exp(|X|/K)]2\mathbb{E}[\exp(|X|/K)]\leq 2. The sub-exponential norm of XX, denoted Xψ1\|X\|_{\psi_{1}}, is defined to be the smallest KK. That is to say,

    Xψ1=inf{t>0:𝔼[exp(|X|/t)]2}.\|X\|_{\psi_{1}}=\inf\left\{t>0:\mathbb{E}[\exp(|X|/t)]\leq 2\right\}.

The above characterization is based on the so-called orlicz norm of a random variable. There are equivalent definitions of sub-gaussian and sub-exponential random variables. We refer readers to Proposition 2.5.2 and Proposition 2.7.1 in [Ver18]. In particular, we will also use another definition of sub-gaussian random variables based on the moment generating function given below.

Lemma 15.

(Sub-gaussian M.G.F. [Ver18]) If a random variable XX is KK-sub-gaussian with 𝔼[X]=0\mathbb{E}[X]=0, then 𝔼[exp(λX)]exp(cλ2K2)λ\mathbb{E}[\exp(\lambda X)]\leq\exp(c\lambda^{2}K^{2})~{}\forall\lambda\in\mathbb{R}, where cxcx is a absolute constant.

In the high probability results we show for the special case with T=1T=1, we handle the tail probability for two terms involving the mean-zero noise with sub-gaussian norm, Δk+12\|\Delta^{k+1}\|^{2} and Δk+1,Λk\langle\Delta^{k+1},\Lambda^{k}\rangle, where (Δk)(\Delta^{k}) and (Λk)(\Lambda^{k}) are adapted to (k)(\mathscr{F}_{k}). Our proof leverages the following two lemmas to control the probability of these two terms being too large.

Lemma 16.

(Sub-exponential is sub-gaussian squared [Ver18]) A random variable XX is sub-gaussian if and only if X2X^{2} is sub-exponential. Moreover, X2ψ1=Xψ22\|X^{2}\|_{\psi_{1}}=\|X\|_{\psi_{2}}^{2}.

Lemma 17.

(Generalized Freedman-type Inequality [HLPR19]) Let (Ω,,(i),P)(\Omega,\mathscr{F},(\mathscr{F}_{i}),P) be a filtered probability space, (Xi)(X_{i}) and (Ki)(K_{i}) be adapted to (i)(\mathscr{F}_{i}), and nn\in\mathbb{N}. Suppose for all i[n]i\in[n], Ki10K_{i-1}\geq 0, 𝔼[Xi|i1]=0\mathbb{E}[X_{i}|\mathscr{F}_{i-1}]=0, and 𝔼[exp(λXi)|i1]exp(λ2Ki2)\mathbb{E}\left[\exp(\lambda X_{i})|\mathscr{F}_{i-1}\right]\leq\exp(\lambda^{2}K_{i}^{2}). Then for any t,b0,a>0t,b\geq 0,a>0,

Pr(k[n]{i=1kXit and 2i=1kKi12ai=1kXi+b})exp(t4a+8b/t).\mathrm{Pr}\left(\underset{k\in[n]}{\bigcup}\left\{\sum_{i=1}^{k}X_{i}\geq t\text{ and }2\sum_{i=1}^{k}K_{i-1}^{2}\leq a\sum_{i=1}^{k}X_{i}+b\right\}\right)\leq\exp\left(-\frac{t}{4a+8b/t}\right). (63)

E.2 Proof of Theorem 5

We start with presenting the lemma below which leverages inequalities in Appendix E to show a high-probability upper bound for terms involving in the previous analysis.

Lemma 18.

Under the conditions of Lemma 13 and Assumption 4, for any δ1,δ2,δ3,a>0\delta_{1},\delta_{2},\delta_{3},a>0, we have

  • (a)

    with probability at least 1δ11-\delta_{1}, k=0Nτk2Δk+12K2log(2/δ1)k=0Nτk2;\sum_{k=0}^{N}\tau_{k}^{2}\|\Delta^{k+1}\|^{2}\leq K^{2}\log(2/\delta_{1})\sum_{k=0}^{N}\tau_{k}^{2};

  • (b)

    with probability at least 1δ21-\delta_{2},

    k=0Nτk2i=0k1αi,kΔi+12K2log(2/δ2)k=0Nτk2,\sum_{k=0}^{N}\tau_{k}^{2}\sum_{i=0}^{k-1}\alpha_{i,k}\|\Delta^{i+1}\|^{2}\leq K^{2}\log(2/\delta_{2})\sum_{k=0}^{N}\tau_{k}^{2},

    where αi,k>0\alpha_{i,k}>0 and i=0k1αi,k=1\sum_{i=0}^{k-1}{\alpha_{i,k}}=1;

  • (c)

    with probability at least 1δ31-\delta_{3},

    k=0NΔk+1,2ατk(1τk)[F(xk)zk]+2ατk(F(xk+1)F(xk))+τkdk4alog(1/δ3)+β2K2aLF4k=0Nτk2(1τk)F(xk)zk2+K2ak=0Nτk2(4+β2τkLF2)dk2.\begin{split}&\sum_{k=0}^{N}\langle\Delta^{k+1},2\alpha\tau_{k}(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+2\alpha\tau_{k}(\nabla F(x^{k+1})-\nabla F(x^{k}))+\tau_{k}d^{k}\rangle\\ &\leq 4a\log(1/\delta_{3})+\frac{\beta^{2}K^{2}}{aL_{\nabla F}^{4}}\sum_{k=0}^{N}\tau_{k}^{2}(1-\tau_{k})\left\|\nabla F(x^{k})-z^{k}\right\|^{2}+\frac{K^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}(4+\frac{\beta^{2}\tau_{k}}{L_{\nabla F}^{2}})\left\|d^{k}\right\|^{2}.\end{split}
Proof of Lemma 18.

We first show (a). Using the law of total expectation, we have 𝔼[exp(τkΔk+12τk2K2)]2\mathbb{E}\left[\exp\left(\frac{\|\tau_{k}\Delta^{k+1}\|^{2}}{\tau_{k}^{2}K^{2}}\right)\right]\leq 2, which implies that τkΔk+12\|\tau_{k}\Delta^{k+1}\|^{2} is τk2K2\tau_{k}^{2}K^{2}-sub-exponential. Thus, we have with probability at least 1δ11-\delta_{1},

k=0Nτk2Δk+12K2log(2/δ1)k=0Nτk2.\sum_{k=0}^{N}\tau_{k}^{2}\|\Delta^{k+1}\|^{2}\leq K^{2}\log(2/\delta_{1})\sum_{k=0}^{N}\tau_{k}^{2}. (64)

We then show (b). Let Zk=τk2{i=0k1αi,kΔi+12}k0Z_{k}=\tau_{k}^{2}\left\{\sum_{i=0}^{k-1}\alpha_{i,k}\left\|\Delta^{i+1}\right\|^{2}\right\}\forall k\geq 0. Note that for all k0k\geq 0, Δk+12\|\Delta^{k+1}\|^{2} is K2K^{2}-sub-exponential, which further implies that the sub-exponential norm of Zk(k>0)Z_{k}~{}(k>0) satisfies Zkψ1τk2K2\|Z_{k}\|_{\psi_{1}}\leq\tau_{k}^{2}K^{2}. Therefore, we have for any δ2>0\delta_{2}>0, with probability at least 1δ21-\delta_{2},

k=0NZkK2log(2/δ2)k=0Nτk2.\sum_{k=0}^{N}Z_{k}\leq K^{2}\log(2/\delta_{2})\sum_{k=0}^{N}\tau_{k}^{2}. (65)

To prove (c), we apply Lemma 15 and Lemma 17 with

Xi=Δk+1,2ατk{(1τk)[F(xk)zk]+F(xk+1)F(xk)}+τkdk,Ki=cK2ατk{(1τk)[F(xk)zk]+F(xk+1)F(xk)}+τkdk,b=0,t=4alog(1/δ3).\begin{split}X_{i}&=\left\langle\Delta^{k+1},2\alpha\tau_{k}\left\{(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\}+\tau_{k}d^{k}\right\rangle,\\ K_{i}&=\sqrt{c}K\left\|2\alpha\tau_{k}\left\{(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\}+\tau_{k}d^{k}\right\|,\\ b&=0,t=4a\log(1/\delta_{3}).\end{split}

Noting that α=β4LF2\alpha=\frac{\beta}{4L_{\nabla F}^{2}}, we obtain that for all a>0a>0 with probability at least 1δ31-\delta_{3}, i=0NXi4alog(1/δ3)\sum_{i=0}^{N}X_{i}\leq 4a\log(1/\delta_{3}) and

i=0NXi\displaystyle\sum_{i=0}^{N}X_{i} 2cK2ak=0N2ατk{(1τk)[F(xk)zk]+F(xk+1)F(xk)}+τkdk2\displaystyle\leq\frac{2cK^{2}}{a}\sum_{k=0}^{N}\left\|2\alpha\tau_{k}\left\{(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\}+\tau_{k}d^{k}\right\|^{2}
4cK2ak=0Nτk2{4α2(1τk)[F(xk)zk]+F(xk+1)F(xk)2+dk2}\displaystyle\leq\frac{4cK^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}\left\{4\alpha^{2}\left\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\|^{2}+\left\|d^{k}\right\|^{2}\right\}
4cK2ak=0Nτk2{4α2(1τk)F(xk)zk2+(1+4α2LF2τk)dk2}\displaystyle\leq\frac{4cK^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}\left\{4\alpha^{2}(1-\tau_{k})\left\|\nabla F(x^{k})-z^{k}\right\|^{2}+(1+4\alpha^{2}L_{\nabla F}^{2}\tau_{k})\left\|d^{k}\right\|^{2}\right\}
=cβ2K2aLF4k=0Nτk2(1τk)F(xk)zk2+cK2ak=0Nτk2(4+β2τkLF2)dk2,\displaystyle=\frac{c\beta^{2}K^{2}}{aL_{\nabla F}^{4}}\sum_{k=0}^{N}\tau_{k}^{2}(1-\tau_{k})\left\|\nabla F(x^{k})-z^{k}\right\|^{2}+\frac{cK^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}(4+\frac{\beta^{2}\tau_{k}}{L_{\nabla F}^{2}})\left\|d^{k}\right\|^{2},

where the third inequality comes from the convexity of 2\|\cdot\|^{2} and the Lipschitzness of F\nabla F. ∎

Provided with the above lemma and Lemma 13, we now present the complete proof of Theorem 5.

Proof of Theorem 5.

Given the update rule of {zk}\{z^{k}\} and the fact that τ0=1\tau_{0}=1, we can obtain

zk=i=0k1αi,kJ1i+1,αi,k=τiΓi+1Γk1ik,i=0k1αi,k=1k1z^{k}=\sum_{i=0}^{k-1}\alpha_{i,k}J_{1}^{i+1},\quad\alpha_{i,k}=\frac{\tau_{i}}{\Gamma_{i+1}}\Gamma_{k}~{}~{}1\leq i\leq k,\quad\sum_{i=0}^{k-1}\alpha_{i,k}=1~{}~{}k\geq 1

where Γk=Γ1i=1k1(1τi)\Gamma_{k}=\Gamma_{1}\prod_{i=1}^{k-1}(1-\tau_{i}) and Γ1=1\Gamma_{1}=1. Thus,

zk+1zk2=τk2J1k+1zk22τk2{J1k+12+i=0k1αi,kJ1i+12}2τk2{J1k+12+i=0k1αi,kJ1i+12}4τk2{Δk+12+F(xk)2+i=0k1αi,k[Δi+12+F(xi)2]}4τk2{Δk+12+i=0k1αi,kΔi+12+2LF2}\begin{split}\left\|z^{k+1}-z^{k}\right\|^{2}&=\tau_{k}^{2}\left\|J_{1}^{k+1}-z^{k}\right\|^{2}\leq 2\tau_{k}^{2}\left\{\left\|J_{1}^{k+1}\right\|^{2}+\left\|\sum_{i=0}^{k-1}\alpha_{i,k}J_{1}^{i+1}\right\|^{2}\right\}\\ &\leq 2\tau_{k}^{2}\left\{\left\|J_{1}^{k+1}\right\|^{2}+\sum_{i=0}^{k-1}\alpha_{i,k}\left\|J_{1}^{i+1}\right\|^{2}\right\}\\ &\leq 4\tau_{k}^{2}\left\{\left\|\Delta^{k+1}\right\|^{2}+\left\|\nabla F(x^{k})\right\|^{2}+\sum_{i=0}^{k-1}\alpha_{i,k}\left[\left\|\Delta^{i+1}\right\|^{2}+\|\nabla F(x^{i})\|^{2}\right]\right\}\\ &\leq 4\tau_{k}^{2}\left\{\left\|\Delta^{k+1}\right\|^{2}+\sum_{i=0}^{k-1}\alpha_{i,k}\left\|\Delta^{i+1}\right\|^{2}+2L_{F}^{2}\right\}\\ \end{split}

where the second inequality comes from the convexity of 2\|\cdot\|^{2}. Therefore, we have

k=0Nzk+1zk24k=0Nτk2Δk+12+4k=0Nτk2i=0k1αi,kΔi+12+8LF2k=0Nτk2\sum_{k=0}^{N}\|z^{k+1}-z^{k}\|^{2}\leq 4\sum_{k=0}^{N}\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+4\sum_{k=0}^{N}\tau_{k}^{2}\sum_{i=0}^{k-1}\alpha_{i,k}\|\Delta^{i+1}\|^{2}+8L_{F}^{2}\sum_{k=0}^{N}\tau_{k}^{2}

Applying Lemma 18 with δ1=δ2=δ3=δ/3\delta_{1}=\delta_{2}=\delta_{3}=\delta/3 and a=16cβK2LF2a=\frac{16c\beta K^{2}}{L_{\nabla F}^{2}} together with Lemma 13, we have with probability at least 1δ1-\delta,

k=0N𝐑k=\displaystyle\sum_{k=0}^{N}\mathbf{R}_{k}= k=0NΔk+1,2ατk(1τk)[F(xk)zk]+2ατk(F(xk+1)F(xk))+τkdk\displaystyle\sum_{k=0}^{N}\langle\Delta^{k+1},2\alpha\tau_{k}(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+2\alpha\tau_{k}(\nabla F(x^{k+1})-\nabla F(x^{k}))+\tau_{k}d^{k}\rangle
+(α+2Lη)k=0Nτk2Δk+12+2Lηk=0Nτk2i=0k1αi,kΔi+12\displaystyle+(\alpha+2L_{\eta})\sum_{k=0}^{N}\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+2L_{\eta}\sum_{k=0}^{N}\tau_{k}^{2}\sum_{i=0}^{k-1}\alpha_{i,k}\|\Delta^{i+1}\|^{2}
+[LF+Lη2D𝒳2+4LηLF2]k=0Nτk2\displaystyle+\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+4L_{\eta}L_{F}^{2}\right]\sum_{k=0}^{N}\tau_{k}^{2}
\displaystyle\leq 64βK2LF2log(3/δ)+β16LF2k=0Nτk2(1τk)F(xk)zk2+(LF24β+β16)k=0Nτk2dk2\displaystyle\frac{64\beta K^{2}}{L_{\nabla F}^{2}}\log(3/\delta)+\frac{\beta}{16L_{\nabla F}^{2}}\sum_{k=0}^{N}\tau_{k}^{2}(1-\tau_{k})\left\|\nabla F(x^{k})-z^{k}\right\|^{2}+(\frac{L_{\nabla F}^{2}}{4\beta}+\frac{\beta}{16})\sum_{k=0}^{N}\tau_{k}^{2}\left\|d^{k}\right\|^{2}
+[(β4LF2+4Lη)K2log(6/δ)+LF+Lη2D𝒳2+4LηLF2]k=0Nτk2\displaystyle+\left[(\frac{\beta}{4L_{\nabla F}^{2}}+4L_{\eta})K^{2}\log(6/\delta)+\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+4L_{\eta}L_{F}^{2}\right]\sum_{k=0}^{N}\tau_{k}^{2}

Thus, noting that dk2D𝒳2k0\|d^{k}\|^{2}\leq D_{\mathcal{X}}^{2}~{}\forall k\geq 0, we have with probability at least 1δ1-\delta,

βk=0Nτk(dk2+14LF2F(xk)zk2)4Wα(x0,u0)+(12+16LF2β2)k=0Nτk(Hk(y~k)Hk(yk))+256βK2LF2log(3/δ)+[(βLF2+16Lη)K2log(6/δ)+(LF2β+β4+2LF+2Lη)D𝒳2+16LηLF2]k=0Nτk2\begin{split}&\beta\sum_{k=0}^{N}\tau_{k}\bigg{(}\|d^{k}\|^{2}+\frac{1}{4L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 4W_{\alpha}(x^{0},u^{0})+\bigg{(}12+\frac{16L_{\nabla F}^{2}}{\beta^{2}}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)+\frac{256\beta K^{2}}{L_{\nabla F}^{2}}\log(3/\delta)\\ &+\left[\left(\frac{\beta}{L_{\nabla F}^{2}}+16L_{\eta}\right)K^{2}\log(6/\delta)+\left(\frac{L_{\nabla F}^{2}}{\beta}+\frac{\beta}{4}+2L_{\nabla F}+2L_{\nabla\eta}\right)D_{\mathcal{X}}^{2}+16L_{\eta}L_{F}^{2}\right]\sum_{k=0}^{N}\tau_{k}^{2}\end{split}

Following the same arguments as in the proof of Theorem 2, we have with probability at least 1δ1-\delta,

mink=1,,NV(xk,zk)𝒪(K2log(1/δ)N)\min_{k=1,\dots,N}V(x^{k},z^{k})\leq\mathcal{O}\left(\frac{K^{2}\log(1/\delta)}{\sqrt{N}}\right)