A Projection-free Algorithm for Constrained Stochastic Multi-level Composition Optimization

Tesi Xiao Department of Statistics, University of California, Davis. [email protected]. Krishnakumar Balasubramanian Department of Statistics, University of California, Davis [email protected]. Saeed Ghadimi Department of Management Sciences, University of Waterloo. [email protected].

Abstract

We propose a projection-free conditional gradient-type algorithm for smooth stochastic multi-level composition optimization, where the objective function is a nested composition of $T$ functions and the constraint set is a closed convex set. Our algorithm assumes access to noisy evaluations of the functions and their gradients, through a stochastic first-order oracle satisfying certain standard unbiasedness and second-moment assumptions. We show that the number of calls to the stochastic first-order oracle and the linear-minimization oracle required by the proposed algorithm, to obtain an $\epsilon$ -stationary solution, are of order $\mathcal{O}_{T}(\epsilon^{-2})$ and $\mathcal{O}_{T}(\epsilon^{-3})$ respectively, where $\mathcal{O}_{T}$ hides constants in $T$ . Notably, the dependence of these complexity bounds on $\epsilon$ and $T$ are separate in the sense that changing one does not impact the dependence of the bounds on the other. For the case of $T=1$ , we also provide a high-probability convergence result that depends poly-logarithmically on the inverse confidence level. Moreover, our algorithm is parameter-free and does not require any (increasing) order of mini-batches to converge unlike the common practice in the analysis of stochastic conditional gradient-type algorithms.

1 Introduction

We study projection-free algorithms for solving the following stochastic multi-level composition optimization problem

\underset{x\in\mathcal{X}}{\min}\quad F(x):=f_{1}\circ\cdots\circ f_{T}(x),

(1)

where $f_{i}:\mathbb{R}^{d_{i}}\rightarrow\mathbb{R}^{d_{i-1}},i=1,\cdots,T~{}(d_{0}=1)$ are continuously differentiable functions, the composite function $F$ is bounded below by $F^{\star}>-\infty$ and $\mathcal{X}\subset\mathbb{R}^{d}$ is a closed convex set. We assume that the exact function values and derivatives of $f_{i}$ ’s are not available. In particular, we assume that $f_{i}(y)=\mathbb{E}_{\xi_{i}}[G_{i}(y,\xi_{i})]$ for some random variable $\xi_{i}$ . Our goal is to solve the above optimization problem, given access to noisy evaluations of $\nabla f_{i}$ ’s and $f_{i}$ ’s.

There are two main challenges to address in developing efficient projection-free algorithms for solving (1). First, note that denoting the transpose of the Jacobian of $f_{i}$ by $\nabla f_{i}$ , the gradient of the objective function $F(x)$ in (1), is given by $\nabla F(x)=\nabla f_{T}(y_{T})\nabla f_{T-1}(y_{T-1})\cdots\nabla f_{1}(y_{1})$ , where $y_{i}=f_{i+1}\circ\cdots\circ f_{T}(x)$ for $1\leq i<T$ , and $y_{T}=x$ . Because of the nested nature of the gradient $\nabla F(x)$ , obtaining an unbiased gradient estimator in the stochastic first-order setting, with controlled moments, becomes non-trivial. Using naive stochastic gradient estimators lead to oracle complexities that depend exponentially on $T$ (in terms of the accuracy parameter). Next, even when $T=1$ in the stochastic setting, projection-free algorithms like the conditional gradient method or its sliding variants invariably require increasing order of mini-batches¹¹1We discuss in detail about some recent works that avoid requiring increasing mini-batches, albeit under stronger assumptions, in Section 1.2. [LZ16, RSPS16, HL16, QLX18, YSC19], which make their practical implementation infeasible.

In this work, we propose a projection-free conditional gradient-type algorithm that achieves level-independent oracle complexities (i.e., the dependence of the complexities on the target accuracy is independent of $T$ ) using only one sample of $(\xi_{i})_{1\leq i\leq T}$ in each iteration, thereby addressing both of the above challenges. Our algorithm uses moving-average based stochastic estimators of the gradient and function values, also used recently by [GRW20] and [BGN22], along with an inexact conditional gradient method used by [BG22] (which in turn is inspired by the sliding method by [LZ16]). In order to establish our oracle complexity results, we use a novel merit function based convergence analysis. To the best of our knowledge, such an analysis technique is used for the first time in the context of analyzing stochastic conditional gradient-type algorithms.

1.1 Preliminaries and Main Contributions

We now introduce the technical preliminaries required to state and highlight the main contributions of this work. Throughout this work, $\|\cdot\|$ denotes the Euclidean norm for vectors and the Frobenius norm for matrices. We first describe the set of assumptions on the objective functions and the constraint set.

Assumption 1 (Constraint set).

The set $\mathcal{X}\subset\mathbb{R}^{d}$ is convex and closed with $\underset{x,y\in\mathcal{X}}{\max}~{}\|x-y\|\leq D_{\mathcal{X}}$ .

Assumption 2 (Smoothness).

All functions $f_{1},\dots,f_{T}$ and their derivatives are Lipschitz continuous with Lipschitz constants $L_{f_{i}}$ and $L_{\nabla f_{i}}$ , respectively.

The above assumptions on the constraint set and the objective function are standard in the literature on stochastic optimization, and in particular in the analysis of conditional gradient algorithms and multi-level optimization; see, for example, [LZ16], [YWF19] and [BGN22]. We emphasize here that the above smoothness assumptions are made only on the functions $(f_{i})_{1\leq i\leq T}$ and not on the stochastic functions $(G_{i})_{1\leq i\leq T}$ (which would be a stronger assumption). Moreover, the Lipschitz continuity of $f_{i}$ ’s are implied by the Assumption 1 and assuming the functions are continuously differentiable. However, for sake of illustration, we state both assumptions separately. In addition to these assumptions, we also make unbiasedness and bounded-variance assumptions on the stochastic first-order oracle. Due to its technical nature, we state the precise details later in Section 3 (see Assumption 3).

We next turn our attention to the convergence criterion that we use in this work to evaluate the performance of the proposed algorithm. Gradient-based algorithms iteratively solve sub-problems in the form of

\underset{u\in\mathcal{X}}{\operatorname*{arg\,min}}\left\{\langle g,u\rangle+\frac{\beta}{2}\|u-x\|^{2}\right\},

(2)

for some $\beta>0$ , $g\in\mathbb{R}^{d}$ and $x\in\mathcal{X}$ . Denoting the optimal solution of the above problem by $P_{\mathcal{X}}(x,g,\beta)$ and noting its optimality condition, one can easily show that

-\nabla F(\bar{x})\in\mathcal{N}_{\mathcal{X}}(\bar{x})+{\cal B}\Big{(}0,\|g-\nabla F(\bar{x})\|D_{\mathcal{X}}+\beta\|x-P_{\mathcal{X}}(x,g,\beta)\|(D_{\mathcal{X}}+\|\nabla F(\bar{x})\|/\beta)\Big{)},

where $\mathcal{N}_{\mathcal{X}}(\bar{x})$ is the normal cone to $\mathcal{X}$ at $\bar{x}$ and ${\cal B}(0,r)$ denotes a ball centered at the origin with radius $r$ . Thus, reducing the radius of the ball in the above relation will result in finding an approximate first-order stationary point of the problem for non-convex constrained minimization problems. Motivated by this fact, we can define the gradient mapping at point $\bar{x}\in\mathcal{X}$ as

\mathcal{G}_{\mathcal{X}}(\bar{x},\nabla F(\bar{x}),\beta)\coloneqq\beta\left(\bar{x}-P_{\mathcal{X}}(\bar{x},\nabla F(\bar{x}),\beta)\right)=\beta\left(\bar{x}-\Pi_{\mathcal{X}}\left(\bar{x}-\frac{1}{\beta}\nabla F(\bar{x})\right)\right),

(3)

where $\Pi_{\mathcal{X}}(y)$ denotes the Euclidean projection of the vector $y$ onto the set $\mathcal{X}$ . The gradient mapping is a classical measure has been widely used in the literature as a convergence criterion when solving nonconvex constrained problems [Nes18]. It plays an analogous role of the gradient for constrained optimization problems; in fact when the set $\mathcal{X}\equiv\mathbb{R}^{d}$ the gradient mapping just reduces to $\nabla F(\bar{x})$ . It should be emphasized that while the gradient mapping cannot be computed in the stochastic setting, it still serves as a measure of convergence. Our main goal in this work is to find an $\epsilon$ -stationary solution to (1), in the sense described below.

Definition 1.

A point $\bar{x}\in\mathcal{X}$ generated by an algorithm for solving (1) is called an $\epsilon$ -stationary point, if we have $\mathbb{E}[\|\mathcal{G}_{\mathcal{X}}(\bar{x},\nabla F(\bar{x}),\beta)\|^{2}]\leq\epsilon$ , where the expectation is taken over all the randomness involved in the problem.

In the literature on conditional gradient methods for the nonconvex setting, the so-called Frank-Wolfe gap is also widely used to provide convergence analysis. The Frank-Wolfe Gap is defined formally as

g_{\mathcal{X}}(\bar{x},\nabla F(\bar{x})):=\underset{y\in\mathcal{X}}{\max}~{}\langle\nabla F(\bar{x}),\bar{x}-y\rangle.

(4)

As pointed out by [BG22], the gradient mapping criterion and the Frank-Wolfe gap are related to each other in the following sense.

Proposition 1.

[BG22] Let $g_{\mathcal{X}}(\cdot)$ be the Frank-Wolfe gap defined in (4) and $\mathcal{G}_{\mathcal{X}}(\cdot)$ be the gradient mapping defined in (3). Then, we have $\|\mathcal{G}_{\mathcal{X}}(x,\nabla F(x),\beta)\|^{2}\leq g_{\mathcal{X}}(x,\nabla F(x)),\forall x\in\mathcal{X}$ . Moreover, under Assumption 1, 2, we have $g_{\mathcal{X}}(x,\nabla F(x))\leq\left[(1/\beta)\prod_{i=1}^{T}L_{f_{i}}+D_{\mathcal{X}}\right]\|\mathcal{G}_{\mathcal{X}}(x,\nabla F(x),\beta)\|$ .

For stochastic conditional gradient-type algorithms, the oracle complexity is measured in terms of number of calls to the Stochastic First-order Oracle (SFO) and the Linear Minimization Oracle (LMO) used to the solve the sub-problems (that are of the form of minimizing a linear function over the convex feasible set) arising in the algorithm. In this work, we hence measure the number of calls to SFO and LMO required by the proposed algorithm to obtain an $\epsilon$ -stationary solution in the sense of Definition 1. We now highlight our main contributions:

•

We propose a projection-free conditional gradient-type method (Algorithm 1) for solving (1). In Theorem 2, we show that the SFO and LMO complexities of this algorithm, in order to achieve an $\epsilon$ -stationary solution in the sense of Definition 1, are of order $\mathcal{O}(\epsilon^{-2})$ and $\mathcal{O}(\epsilon^{-3})$ , respectively.
•

The above SFO and LMO complexities are in particular level-independent (i.e., the dependence of the complexities on the target accuracy is independent of $T$ ). The proposed algorithm is parameter-free and does not require any mini-batches, making it applicable for the online setting.
•

When considering the case of $T\leq 2$ , we present a simplified method (Algorithm 3 and 4) to obtain the same oracle complexities. Intriguingly, while this simplified method is still parameter-free for $T=1$ , it is not when $T=2$ (see Theorem 3 and Remark Remark). Furthermore, for the case of $T=1$ , we also establish high-probability bounds (see Theorem 5).

A summary of oracle complexities for stochastic conditional gradient-type algorithms is in Table 1.

Algorithm	Criterion	# of levels	Batch size	SFO	LMO
SPIFER-SFW [YSC19]	FW-gap	1	$\mathcal{O}(\epsilon^{-1})$	$\mathcal{O}(\epsilon^{-3})$	$\mathcal{O}(\epsilon^{-2})$
1-SFW [ZSM⁺20]	FW-gap	1	1	$\mathcal{O}(\epsilon^{-3})$	$\mathcal{O}(\epsilon^{-3})$
SCFW [ABTR21]	FW-gap	2	1	$\mathcal{O}(\epsilon^{-3})$	$\mathcal{O}(\epsilon^{-3})$
SCGS [QLX18]	GM	1	$\mathcal{O}(\epsilon^{-1})$	$\mathcal{O}(\epsilon^{-2})$	$\mathcal{O}(\epsilon^{-2})$
SGD+ICG [BG22]	GM	1	$\mathcal{O}(\epsilon^{-1})$	$\mathcal{O}(\epsilon^{-2})$	$\mathcal{O}(\epsilon^{-2})$
LiNASA+ICG (Algorithm 1)	GM	$T$	1	$\mathcal{O}_{T}(\epsilon^{-2})$	$\mathcal{O}_{T}(\epsilon^{-3})$

Table 1: Complexity results for stochastic conditional gradient type algorithms to find an

\epsilon

-stationary solution in the nonconvex setting. FW-Gap and GM stands for Frank-Wolfe Gap (see (4)) and Gradient Mapping (see (3)) respectively.

\mathcal{O}_{T}

hides constants in

T

. Existing one-sample based stochastic conditional gradient algorithms are either (i) not applicable to the case of general

T>1

, or (ii) require strong assumptions [ZSM⁺20], or (iii) are not truly online [ABTR21]; see Section 1.2 for detailed discussion. The results in [BG22] are actually presented for the zeroth-order setting; however the above stated first-order complexities follow immediately.

1.2 Related Work

Conditional Gradient-Type Method. The conditional gradient algorithm [FW56, LP66], has had a renewed interest in the machine learning and optimization communities in the past decade; see [Mig94, Jag13, HJN15, LJJ15, BS17, GKS21] for a partial list of recent works. Considering the stochastic convex setup, [HK12, HL16] provided expected oracle complexity results for the stochastic conditional gradient algorithm. The complexities were further improved by a sliding procedure in [LZ16], based on Nesterov’s acceleration method. [RSPS16, YSC19, HL16] considered variance reduced stochastic conditional gradient algorithms, and provided expected oracle complexities in the non-convex setting. [QLX18] analyzed the sliding algorithm in the non-convex setting and provided results for the gradient mapping criterion. All of the above works require increasing orders of mini-batches to obtain their oracle complexity results.

[MHK20] and [ZSM⁺20] proposed a stochastic conditional gradient-type algorithm with moving-average gradient estimator for the convex and non-convex setting that uses only one-sample in each iteration. However, [MHK20] and [ZSM⁺20] require several restrictive assumptions, which we explain next (focusing on [ZSM⁺20] which considers the nonconvex case). Specifically, [ZSM⁺20] requires that the stochastic function $G_{1}(x,\xi_{1})$ has uniformly bounded function value, gradient-norm, and Hessian spectral-norm, and the distribution of the random vector $\xi_{1}$ has an absolutely continuous density $p$ such that the norm of the gradient of $\log p$ and spectral norm of the Hessian of $\log p$ has finite fourth and second-moments respectively. In contrasts, we do not require such stringent assumptions. Next, all of the above works focus only on the case of $T=1$ . [ABTR21] considered stochastic conditional gradient algorithm for solving (1), with $T=2$ . However, [ABTR21] also makes stringent assumptions: (i) the stochastic objective functions $G_{1}(x,\xi_{1})$ and $G_{2}(x,\xi_{1})$ themselves have Lipschitz gradients almost surely and (ii) for a given instance of random vectors $\xi_{1}$ and $\xi_{2}$ , one could query the oracle at the current and previous iterations, which makes the algorithm not to be truly online. See Table 1 for a summary.

Stochastic Multi-level Composition Optimization. Compositional optimization problems of the form in (1) have been considered as early as 1970s by [Erm76]. Recently, there has been a renewed interest on this problem. [EN13] and [DPR17] considered a sample-average approximation approach for solving (1) and established several asymptotic results. For the case of $T=2$ , [WFL17], [WLF16] and [BGI⁺17] proposed and analyzed stochastic gradient descent-type algorithms in the smooth setting. [DD19] and [DR18] considered the non-smooth setting and established oracle complexity results. Furthermore, [HZCH20] proposed algorithms when the randomness between the two levels are not necessarily independent. For the general case of $T\geq 1$ , [YWF19] proposed stochastic gradient descent-type algorithms and established oracle complexities established that depend exponentially on $T$ and are hence sub-optimal. Indeed, large deviation and Central Limit Theorem results established in [EN13, DPR17], respectively, show that in the sample-average approximation setting, the $\operatorname*{arg\,min}$ of the problem in (1) based on $n$ samples, converges at a level-independent rate (i.e., dependence of the convergence rate on the target accuracy is independent of $T$ ) to the true minimizer, under suitable regularity conditions.

[GRW20] proposed a single time-scale Nested Averaged Stochastic Approximation (NASA) algorithm and established optimal rates for the cases of $T=1,2$ . For the general case of $T\geq 1$ , [BGN22] proposed a linearized NASA algorithm and established level-independent and optimal convergence rates. Concurrently, [Rus21] considered the case when the function $f_{i}$ are non-smooth and established asymptotic convergence results. [ZX21] also established non-asymptotic level-independent oracle complexities, however, under stronger assumptions than that in [BGN22]. Firstly, they assumed that for a fixed batch of samples, one could query the oracle on different points, which is not suited for the general online stochastic optimization setup. Next, they assume a much stronger mean-square Lipschitz smoothness assumption on the individual functions $f_{i}$ and their gradients. Finally, they required mini-batches sizes that depend exponentially on $T$ , which makes their method impractical. Concurrent to [BGN22], level-independent rates were also obtained for unconstrained problems by [CSY21], albeit, under the stronger assumption that the stochastic functions $G_{i}(x,\xi_{i})$ are Lipschitz, almost surely. It is also worth mentioning that while some of the above papers considered constrained problems, the algorithms proposed and analyzed in the above works are not projection-free, which is the main focus of this work.

2 Methodology

In this section, we present our projection-free algorithm for solving problem (1). The method generates three random sequences, namely, approximate solutions $\{x^{k}\}$ , average gradients $\{z^{k}\}$ , and average function values $\{u^{k}\}$ , defined on a certain probability space $(\Omega,\mathscr{F},P)$ . We let $\mathscr{F}_{k}$ to be the $\sigma$ -algebra generated by $\{x^{0},\dots,x^{k},z^{0},\dots,z^{k},u_{1}^{0},\dots,u_{1}^{k},\dots,u_{T}^{0},\dots,u_{T}^{k}\}$ . The overall method is given in Algorithm 1. In (7), the stochastic Jacobians $J_{i}^{k+1}\in\mathbb{R}^{d_{i}\times d_{i-1}}$ , and the product $\prod_{i=1}^{T}J_{T+1-i}^{k+1}$ is calculated as $J_{T}^{k+1}J_{T-1}^{k+1}\cdots J_{1}^{k+1}\in\mathbb{R}^{d_{T}\times d_{1}}\equiv\mathbb{R}^{d_{T}\times 1}$ . In (8), we use the notation $\langle\cdot,\cdot\rangle$ to represent both matrix-vector multiplication and vector-vector inner product. There are two aspects of the algorithm that we highlight specifically: (i) In addition to estimating the gradient of $F$ , we also estimate a stochastic linear approximation of the inner functions $f_{i}$ by a moving-average technique. In the multi-level setting we consider, it helps us to avoid the accumulation of bias, when estimating the $f_{i}$ directly. Linearization techniques were used in the stochastic optimization since the work of [Rus87]. A similar approach was used in [BGN22] in the context of projected-based methods for solving (1). It is also worth mentioning that other linearization techniques have been used in [DD19] and [DR18] for estimating the stochastic inner function values for weakly convex two-level composition problems, and (ii) The ICG method given in Algorithm 2 is essentially applying deterministic conditional gradient method with the exact line search for solving the quadratic minimization subproblem in (2) with the estimated gradient $z_{k}$ in (7). It was also used in [BG22] as a sub-routine and is motivated by the sliding approach of [LZ16].

Algorithm 1 Linearized NASA with Inexact Conditional Gradient Method (LiNASA+ICG)

Input:

x^{0}\in\mathcal{X}

z^{0}=0\in\mathbb{R}^{d}

u_{i}^{0}\in\mathbb{R}^{d_{i}},i=1,\dots,T

\beta_{k}>0

t_{k}>0

\tau_{k}\in(0,1]

\delta\geq 0

for

k=0,1,2,\dots,N

1. Update the solution:

	$\displaystyle\tilde{y}^{k}$	$\displaystyle=\texttt{ICG}(x^{k},z^{k},\beta_{k},t_{k},\delta),$		(5)
	$\displaystyle x^{k+1}$	$\displaystyle=x^{k}+\tau_{k}(\tilde{y}^{k}-x^{k}),$		(6)

and compute stochastic Jacobians

J_{i}^{k+1}

, and function values

G_{i}^{k+1}

u_{i+1}^{k}

for

i=1,\dots,T

2. Update average gradients

z

and function value estimates

u_{i}

for each level

i=1,\dots,T

	$\displaystyle z^{k+1}$	$\displaystyle=(1-\tau_{k})z^{k}+\tau_{k}\prod_{i=1}^{T}J_{T+1-i}^{k+1},$		(7)
	$\displaystyle u_{i}^{k+1}$	$\displaystyle=(1-\tau_{k})u_{i}^{k}+\tau_{k}G_{i}^{k+1}+\langle J_{i}^{k+1},u_{i+1}^{k+1}-u_{i+1}^{k}\rangle.$		(8)

end for

Output:

(x^{R},z^{R},u_{1}^{R},\cdots,u_{T}^{R})

, where

R

is uniformly distributed over

\{1,2,\dots,N\}

Algorithm 2 Inexact Conditional Gradient Method (ICG)

Input: (

x,z,\beta,M,\delta)

Set

w^{0}=x

for

t=0,1,2,\dots,M

1. Find

v^{t}\in\mathcal{X}

with a quantity

\delta\geq 0

such that

\langle z+\beta(w^{t}-x),v^{t}\rangle\leq\min_{v\in\mathcal{X}}~{}\langle z+\beta(w^{t}-x),v\rangle+\frac{\beta D_{\mathcal{X}}^{2}\delta}{t+2}.

2. Set

w^{t+1}=(1-\mu_{t})w^{t}+\mu_{t}v^{t}

with

\mu_{t}=\min\bigg{\{}1,\frac{\langle\beta(x-w^{t})-z,v^{t}-w^{t}\rangle}{\beta\|v^{t}-w^{t}\|^{2}}\bigg{\}}.

end for

Output:

w^{M}

3 Main Results

In this section, we present our main result on the oracle complexity of Algorithm 1. Before we proceed, we present our assumptions on the stochastic first-order oracle.

Assumption 3 (Stochastic First-Order Oracle).

Denote $u_{T+1}^{k}\equiv x^{k}$ . For each $k$ , $u_{i+1}^{k}$ being the input, the stochastic oracle outputs $G_{i}^{k+1}\in\mathbb{R}^{d_{i}}$ and $J_{i}^{k+1}$ such that given $\mathscr{F}_{k}$ and for any $i\in\{1,\dots,T\}$

(a)

$\mathbb{E}[J_{i}^{k+1}|\mathscr{F}_{k}]=\nabla f_{i}(u_{i+1}^{k}),~{}\mathbb{E}[G_{i}^{k+1}|\mathscr{F}_{k}]=f_{i}(u_{i+1}^{k})$ ,
(b)

$\mathbb{E}[\|G_{i}^{k+1}-f_{i}(u_{i+1}^{k})\|^{2}|\mathscr{F}_{k}]\leq\sigma_{G_{i}}^{2},~{}\mathbb{E}[\|J_{i}^{k+1}-\nabla f_{i}(u_{i+1}^{k})\|^{2}|\mathscr{F}_{k}]\leq\sigma_{J_{i}}^{2}$ ,
(c)

The outputs of the stochastic oracle at level $i$ , $G_{i}^{k+1}$ and $J_{i}^{k+1}$ , are independent. The outputs of the stochastic oracle are independent between levels, i.e., $\{G_{i}^{k+1}\}_{i=1,\dots,T}$ are independent and so are $\{J_{i}^{k+1}\}_{i=1,\dots,T}$ .

Parts (a) and (b) in Assumption 3 are standard unbiasedness and bounded variance assumptions on the stochastic gradient, common in the literature. Part (c) is essential to establish the convergence results in the multi-level case. Similar assumptions have been made, for example, in [YWF19] and [BGN22]. We also emphasize that unlike some prior works (see e.g., [ZSM⁺20]), Assumption 3 allows the case of endogenous uncertainty, and we do not require the distribution of the random variables $(\xi_{i})_{1\leq i\leq T}$ to be independent of the distribution of the decision variables $(u_{i})_{1\leq i\leq T}$ .

Remark.

Under Assumption 2, and 3, we can immediately conclude that $\mathbb{E}[\|J_{i}^{k+1}\|^{2}|\mathscr{F}_{k}]=\mathbb{E}[\|J_{i}^{k+1}-\nabla f_{i}(u_{i+1}^{k})\|^{2}|\mathscr{F}_{k}]+\|\nabla f_{i}(u_{i+1}^{k})\|^{2}\leq\sigma_{J_{i}}^{2}+L_{f_{i}}^{2}:=\hat{\sigma}_{J_{i}}^{2}$ . In the sequel, $\hat{\sigma}_{J_{i}}^{2}$ will be used to simplify the presentation.

We start with the merit function used in this work and its connection to the gradient mapping criterion. Our proof leverages the following merit function:

W_{\alpha,\gamma}(x,z,u)=F(x)-F^{\star}-\eta(x,z)+\alpha\|\nabla F(x)-z\|^{2}+\sum_{i=1}^{T}\gamma_{i}\|f_{i}(u_{i+1})-u_{i}\|^{2},

(9)

where $\alpha,\{\gamma_{i}\}_{1\leq i\leq T}$ are positive constants and

\eta(x,z)=\min_{y\in\mathcal{X}}\left\{H(y;x,z,\beta):=\langle z,y-x\rangle+\frac{\beta}{2}\|y-x\|^{2}\right\}.

(10)

Compared to [BGN22], we require the additional term $\|\nabla F(x)-z\|^{2}$ , which turns out to be essential in our proof due to the ICG routine. The following proposition relates the merit function above to the gradient mapping.

Proposition 2.

Let $\mathcal{G}_{\mathcal{X}}(\cdot)$ be the gradient mapping defined in (3) and $\eta(\cdot,\cdot)$ be defined in (10). For any pair of $(x,z)$ and $\beta>0$ , we have $\|\mathcal{G}_{\mathcal{X}}(x,\nabla F(x),\beta)\|^{2}\leq-4\beta\eta(x,z)+2\|\nabla F(x)-z\|^{2}$ .

Proof.

By expanding the square, and using the properties of projection operation, we have

\|\Pi_{\mathcal{X}}({x}-\frac{1}{\beta}{z})-{x}\|^{2}+\|\Pi_{\mathcal{X}}({x}-\frac{1}{\beta}{z})-(x-\frac{1}{\beta}{z})\|^{2}\leq\|\bar{x}-({x}-\frac{1}{\beta}{z})\|^{2}=\|\frac{1}{\beta}{z}\|^{2}.

Thus, we have $\eta({x},{z})\leq-\frac{\beta}{2}\|\Pi_{\mathcal{X}}({x}-\frac{1}{\beta}{z})-{x}\|^{2}$ . The proof is completed immediately by noting that $\|\mathcal{G}({x},\nabla F({x}),\beta)\|^{2}\leq 2\beta^{2}\|\Pi_{\mathcal{X}}({x}-\frac{1}{\beta}{z})-{x}\|^{2}+2\left\|\nabla F({x})-{z}\right\|^{2}.$ ∎

We now present out main result on the oracle complexity of Algorithm1.

Theorem 2.

Under Assumption 1, 2, 3, let $\{x^{k},z^{k},\{u_{i}^{k}\}_{1\leq i\leq T}\}_{k\geq 0}$ be the sequence generated by Algorithm 1 with $N\geq 1$ and

\begin{split}\beta_{k}\equiv\beta>0,\qquad\tau_{0}=1,~{}t_{0}=0,\quad\tau_{k}=\frac{1}{\sqrt{N}},~{}t_{k}=\lceil\sqrt{k}\rceil,\quad\forall k\geq 1,\end{split}

(11)

where $\beta$ is an arbitrary positive constant. Provided that the merit function $W_{\alpha,\gamma}(x,z,u)$ is defined as (9) with

\alpha=\frac{\beta}{20L_{\nabla F}^{2}},\quad\gamma_{1}=\frac{\beta}{2},\quad\gamma_{j}=\left(2\alpha+\frac{1}{4\alpha L_{\nabla F}^{2}}\right)(T-1)C_{j}^{2}+\frac{\beta}{2},\quad 2\leq j\leq T,

(12)

we have,

\mathbb{E}\left[\|\mathcal{G}_{\mathcal{X}}(x^{R},\nabla F(x^{R}),\beta)\|^{2}\right]\leq\frac{2(\beta+\frac{20L_{\nabla F}^{2}}{\beta})\left[2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)\right]}{\sqrt{N}},

(13)

\mathbb{E}\left[\|f_{i}(u_{i+1}^{R})-u_{i}^{R}\|^{2}\right]\leq\frac{2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)}{\beta\sqrt{N}},\quad 1\leq i\leq T.

(14)

where $u_{T+1}=x,\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)=4\hat{\sigma}^{2}+32\beta D_{\mathcal{X}}^{2}(1+\delta)\left(\frac{3}{5}+\frac{5L_{\nabla F}^{2}}{\beta^{2}}\right)$ , and $\hat{\sigma}^{2}$ is a constant depending on the parameters $(\beta,\sigma^{2},L,D_{\mathcal{X}},T)$ given in (42). The expectation is taken with respect to all random sequences generated by the method and an independent random integer number $R$ uniformly distributed over $\{1,\dots,N\}$ . That is to say, the number of calls to SFO and LMO to get an $\epsilon$ -stationary point is upper bounded by $\mathcal{O}_{T}(\epsilon^{-2}),\mathcal{O}_{T}(\epsilon^{-3})$ respectively.

Remark.

The constant $\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)$ is $\mathcal{O}(T)$ given the definition of $\hat{\sigma}^{2}$ and the value of $\gamma_{j}$ in (12), which further implies that the total number of calls to SFO and LMO of Algorithm 1 for finding an $\epsilon$ -stationary point of (1), are bounded by $\mathcal{O}(T^{2}\epsilon^{-2})=\mathcal{O}_{T}(\epsilon^{-2})$ and $\mathcal{O}(T^{3}\epsilon^{-3})=\mathcal{O}_{T}(\epsilon^{-3})$ respectively. Furthermore, it is worth noting that this complexity bound for Algorithm 1 is obtained without any dependence of the parameter $\beta_{k}$ on Lipschitz constants due to the choice of arbitrary positive constant $\beta$ in (11), and $\tau_{k},t_{k}$ depend only on the number of iterations $N$ and $k$ respectively. This makes Algorithm 1 parameter-free and easy to implement.

Remark.

As discussed in Section 2, the ICG routine given in Algorithm 2 is a deterministic method with the estimated gradient $z_{k}$ in (7). The number of iterations, $t_{k}$ , required to run Algorithm 2 is given by $t_{k}=\lceil\sqrt{k}\rceil$ . That is, we require more precise solutions for the ICG routine, only for later outer iterations. Furthermore, due to the deterministic nature of the ICG routine, further advances in the analysis of deterministic conditional gradient methods under additional assumptions on the constraint set $\mathcal{X}$ (see, for example, [GH15, GW21]) could be leveraged to improve the overall LMO complexity.

3.1 The special cases of $T=1$ and $T=2$

We now discuss several intriguing points regarding the choice of tuning parameter $\beta$ , for the case of $T=2$ , and the more standard case of $T=1$ . Specifically, the linearization technique used in Algorithm 1 turns out to be not necessary for the case of $T=2$ and $T=1$ to obtain similar rates. However, without linearization, the choice of $\beta$ is dependent on the problem parameters for $T=2$ . Whereas it turns out to be independent of the problem parameters (similar to Algorithm 1 and Theorem 2 which holds for all $T\geq 1$ ) for $T=1$ . As the outer function value estimates (i.e., $u^{k+1}_{1}$ sequence) are not required for the convergence analysis, we remove them in Algorithms 3 and 4.

Algorithm 3 NASA with Inexact Conditional Gradient Method (NASA+ICG) for

T=2

Replace Step 2 of Algorithm 1 with the following:

2’. Update the average gradient

z

and the function value estimate

u_{2}

respectively as:

\displaystyle z^{k+1}=(1-\tau_{k})z^{k}+\tau_{k}J_{2}^{k+1}J_{1}^{k+1}~{}\quad\text{and}\quad u_{2}^{k+1}=(1-\tau_{k})u^{k}+\tau_{k}G_{2}^{k+1}

Algorithm 4 ASA with Inexact Conditional Gradient Method (ASA+ICG) for

T=1

Replace Step 2 of Algorithm 1 with the following:

2”. Update the average gradient

z

as:

z^{k+1}=(1-\tau_{k})z^{k}+\tau_{k}J_{1}^{k+1}.

Theorem 3.

Let Assumptions 1, 2, 3 be satisfied by the optimization problem (1). Let $\mathcal{C}_{1},\mathcal{C}_{2}$ and $\mathcal{C}_{3}$ be some constants depending on the parameters $(\beta,\sigma^{2},L,D_{\mathcal{X}},\delta)$ , as defined in (54) and (62). Let $\tau_{0}=1,t_{0}=0$ , $\tau_{k}=\frac{1}{\sqrt{N}},t_{k}=\lceil\sqrt{k}\rceil,\forall k\geq 1$ , where $N$ is the total number of iterations.

(a)

Let $T=2$ , and let $\{x^{k},z^{k},u_{2}^{k}\}_{k\geq 0}$ be the sequence generated by Algorithm 3 with

\beta_{k}\equiv\beta\geq 6\rho L_{\nabla F}+(2\rho+\frac{2}{3\rho})L_{\nabla f_{1}}L_{f_{2}}^{2},\quad\rho>0.

(15)

Then, we have $\forall N\geq 1$ ,

\mathbb{E}\left[\|\mathcal{G}_{\mathcal{X}}(x^{R},\nabla F(x^{R}),\beta)\|^{2}\right]\leq\frac{\mathcal{C}_{1}}{\sqrt{N}},\quad\mathbb{E}\left[\|f_{2}(x^{R})-u_{2}^{R}\|^{2}\right]\leq\frac{\mathcal{C}_{2}}{\sqrt{N}}.

(b)

Let $T=1$ and let $\{x^{k},z^{k}\}_{k\geq 0}$ be the sequence generated by Algorithm 4 with $\beta_{k}\equiv\beta>0$ . Then, we have $\forall N\geq 1$ ,

$\mathbb{E}\left[\|\mathcal{G}_{\mathcal{X}}(x^{R},\nabla F(x^{R}),\beta)\|^{2}\right]\leq\frac{\mathcal{C}_{3}}{\sqrt{N}}.$

All expectations are taken with respect to all random sequences generated by the respective algorithms and an independent random integer number $R$ uniformly distributed over $\{1,\dots,N\}$ . In both cases, the number of calls to SFO and LMO to get an $\epsilon$ -stationary point is upper bounded by $\mathcal{O}(\epsilon^{-2}),\mathcal{O}(\epsilon^{-3})$ respectively.

Remark.

While we can obtain the same complexities without using the linear approximation of the inner function for $T=2$ , it seems necessary to have a parameter-free algorithm as the choice of $\beta$ in (15) depends on the knowledge of the problem parameters. Indeed, the linearization term in (8) helps use to better exploit the Lipschitz smoothness of the gradients get an error bound in the order of $\tau_{k}^{2}\|d^{k}\|^{2}$ for estimating the inner function values. Without this term, we are only able to use the Lipschitz continuity of the inner functions and so the error estimate will increase to the order of $\tau_{k}\|d^{k}\|$ . Hence, we need to choose a larger $beta$ (as in (15)) to reduce $\|d^{k}\|$ and handle the error term without compromising the complexities. However, this is not the case for $T=1$ as it can be seen as a two-level problem whose inner function is exactly known (the identity map). In this case, the choice of $\beta$ is independent of the problem parameters with or without the linearization term.

3.2 High-Probability Convergence for $T=1$

In this subsection, we establish an oracle complexity result with high-probability for the case of $T=1$ . We first provide a notion of $(\epsilon,\delta)$ -stationary point and a related tail assumption on the stochastic first-order oracle below.

Definition 4.

A point $\bar{x}\in\mathcal{X}$ generated by an algorithm for solving (1) is called an $(\epsilon,\delta)$ -stationary point, if we have $\|\mathcal{G}_{\mathcal{X}}(\bar{x},\nabla F(\bar{x}),\beta)\|^{2}\leq\epsilon$ with probability $1-\delta$ .

Assumption 4.

Let $\Delta^{k+1}=\nabla F(x^{k})-J_{1}^{k+1}$ for $k\geq 0$ . For each $k$ , given $\mathscr{F}_{k}$ we have $\mathbb{E}[\Delta^{k+1}|\mathscr{F}_{k}]=0$ and $\|\Delta^{k+1}\|\big{|}\mathscr{F}_{k}$ is $K$ -sub-Gaussian.

The above assumption is commonly used in the literature; see [HK14, HLPR19, LO20, ZCC⁺18]. We also refer to [Ver18] and Appendix E for additional details. The high-probability bound for solving non-convex constrained problems by Algorithm 4 is given below.

Theorem 5.

Let Assumptions 1, 2, 4 be satisfied by the optimization problem (1) with $T=1$ . Let $\tau_{0}=1,t_{0}=0$ , $\tau_{k}=\frac{1}{\sqrt{N}},t_{k}=\lceil\sqrt{k}\rceil,\forall k\geq 1$ , where $N$ is the total number of iterations. Let $T=1$ and let $\{x^{k},z^{k}\}_{k\geq 0}$ be the sequence generated by Algorithm 4 with $\beta_{k}\equiv\beta>0$ . Then, we have $\forall N\geq 1,\delta>0$ , with probability at least $1-\delta$ ,

\min_{k=1,\dots,N}\left\|\mathcal{G}_{\mathcal{X}}(x^{k},\nabla F(x^{k}),\beta)\right\|^{2}\leq\mathcal{O}\left(\frac{K^{2}\log(1/\delta)}{\sqrt{N}}\right)

Therefore, the number of calls to SFO and LMO to get an $(\epsilon,\delta)$ -stationary point is upper bounded by $\mathcal{O}(\epsilon^{-2}\log^{2}(1/\delta)),\mathcal{O}(\epsilon^{-3}\log^{3}(1/\delta))$ respectively.

Remark.

To the best of our knowledge, the above result is (i) the first high-probability bound for one-sample stochastic conditional gradient-type algorithm for the case of $T=1$ , and (ii) the first high-probability bound for constrained stochastic optimization algorithms in the non-convex setting; see Appendix J of [MDB21].

4 Proof Sketch of Main Results

In this section, we only present the proof sketch. The complete proofs are provided in the appendix. For convenience, let $u_{T+1}=x$ , and we denote $H_{k}$ as the function value of the subproblem at step $k$ , $y^{k}$ as the optimal solution of the subproblem i.e.,

H_{k}(y):=H(y;x^{k},z^{k},\beta_{k}),\quad y^{k}=\underset{y\in\mathcal{X}}{\operatorname*{arg\,min}}~{}H_{k}(y).

(16)

Then, the proof of Theorem 2 proceeds via the following steps:

We first leverage the merit function $W_{k}:=W_{\alpha,\gamma}(x^{k},z^{k},u^{k})$ defined in (9) with appropriate choices of $\alpha,\gamma$ for any $\beta>0$ to obtain

\begin{split}W_{k+1}-W_{k}\leq&-\frac{\tau_{k}}{2}\left(\beta\left[\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right]+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right)\\ &+\mathbf{R}_{k}+\tau_{k}\left(\frac{12}{5}+\frac{20L_{\nabla F}^{2}}{\beta^{2}}\right)\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\quad\forall k\geq 0\end{split}

where $\mathbf{R}_{k}$ is the residual term (see (31)) and $\mathbb{E}[\mathbf{R}_{k}|\mathscr{F}_{k}]\leq\hat{\sigma}^{2}\tau_{k}^{2}$ , as shown in Proposition 3.

Telescoping the above inequality, in Lemma 11 we obtain the following:

\begin{split}&\sum_{k=1}^{N}\tau_{k}\left[\beta\left(\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right)+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right]\\ &\leq 2W_{0}+2\sum_{k=0}^{N}\mathbf{R}_{k}+\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\quad\forall N\geq 1.\end{split}

To further control the error term $H_{k}(\tilde{y}^{k})-H_{k}(y^{k})$ introduced by the ICG method, we set $t_{k}$ , the number of iterations in ICG method at step $k$ , to $\lceil\sqrt{k}\rceil$ . By Lemma 8, we therefore have

H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{t_{k}+2}\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{\sqrt{k}},\quad\forall k\geq 1.

Also, with the choice of $\tau_{k}=\frac{1}{\sqrt{N}}$ and $z^{0}=0$ , we can conclude that

\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{\sqrt{N}}\sum_{k=1}^{N}\frac{1}{\sqrt{k}}\leq 4\beta D_{\mathcal{X}}^{2}(1+\delta).

Then, taking expectation of both sides and by the definition of random integer $R$ , we have

\begin{split}\mathbb{E}\left[\beta\left(\|d^{R}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{R})-u_{i}^{R}\|^{2}\right)+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{R})-z^{R}\|^{2}\right]\leq 2W_{0}+\mathcal{B},\end{split}

$\forall N\geq 1$ , where $\mathcal{B}$ is a constant depending on the problem parameters $(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)$ .

As a result, we can obtain (13) and (14) by noting that $\forall k\geq 1$

\begin{split}\|\mathcal{G}(x^{k},\nabla F(x^{k}),\beta)\|^{2}&\leq 2\beta^{2}\|d^{k}\|^{2}+2\beta^{2}\left\|\Pi_{\mathcal{X}}\left(x^{k}-\frac{1}{\beta}\nabla F(x^{k})\right)-\Pi_{\mathcal{X}}\left(x^{k}-\frac{1}{\beta}z^{k}\right)\right\|^{2}\\ &\leq 2\beta^{2}\|d^{k}\|^{2}+2\|\nabla F(x^{k})-z^{k}\|^{2}.\end{split}

where the second inequality follows the non-expansiveness of the projection operator.

The proofs of Theorems 3 and 5 follow the same argument with appropriate modifications. The high-probability convergence proof of Theorem 5 mainly consists of controlling the tail probability of the residual term $\mathbf{R}_{k}$ being large.

5 Discussion

In this work, we propose and analyze projection-free conditional gradient-type algorithms for constrained stochastic multi-level composition optimization of the form in (1). We show that the oracle complexity of the proposed algorithms is level-independent in terms of the target accuracy. Furthermore, our algorithm does not require any increasing order of mini-batches under standard unbiasedness and bounded second-moment assumptions on the stochastic first-order oracle, and is parameter-free. Some open questions for future research: (i) Considering the one-sample setting, either improving the LMO complexity from $\mathcal{O}(\epsilon^{-3})$ to $\mathcal{O}(\epsilon^{-2})$ for general closed convex constraint sets or establishing lower bounds showing that $\mathcal{O}(\epsilon^{-3})$ is necessary while keeping the SFO in the order of $\mathcal{O}(\epsilon^{-2})$ , is extremely interesting; and (ii) Providing high-probability bounds for stochastic multi-level composition problems ( $T>1$ ) and under sub-Gaussian or heavy-tail assumptions (as in [MDB21, LZW22]) is interesting to explore.

Acknowledgment

TX was partially supported by a seed grant from the Center for Data Science and Artificial Intelligence Research, UC Davis and National Science Foundation (NSF) grant CCF-1934568. KB was partially supported by a seed grant from the Center for Data Science and Artificial Intelligence Research, UC Davis and NSF grant DMS-2053918. SG was partially supported by an NSERC Discovery Grant.

References

[ABTR21] Zeeshan Akhtar, Amrit Singh Bedi, Srujan Teja Thomdapu, and Ketan Rajawat. Projection-Free Algorithm for Stochastic Bi-level Optimization. arXiv preprint arXiv:2110.11721, 2021.
[AF21] Raul Astudillo and Peter Frazier. Bayesian optimization of function networks. Advances in Neural Information Processing Systems, 34, 2021.
[BG22] Krishnakumar Balasubramanian and Saeed Ghadimi. Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, 22(1):35–76, 2022.
[BGI⁺17] Jose Blanchet, Donald Goldfarb, Garud Iyengar, Fengpei Li, and Chaoxu Zhou. Unbiased simulation for optimizing stochastic function compositions. arXiv preprint arXiv:1711.07564, 2017.
[BGN22] Krishnakumar Balasubramanian, Saeed Ghadimi, and Anthony Nguyen. Stochastic multilevel composition optimization algorithms with level-independent convergence rates. SIAM Journal on Optimization, 32(2):519–544, 2022.
[BS17] Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient for non-strongly convex functions. Mathematical Programming, 164(1-2):1–27, 2017.
[CFKM20] Weilin Cong, Rana Forsati, Mahmut Kandemir, and Mehrdad Mahdavi. Minimal variance sampling with provable guarantees for fast training of graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1393–1403, 2020.
[CSY21] Tianyi Chen, Yuejiao Sun, and Wotao Yin. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. IEEE Transactions on Signal Processing, 69:4937–4948, 2021.
[DD19] Damek Davis and Dmitriy Drusvyatskiy. Stochastic model-based minimization of weakly convex functions. SIAM Journal on Optimization, 29(1):207–239, 2019.
[DPR17] Darinka Dentcheva, Spiridon Penev, and Andrzej Ruszczyński. Statistical estimation of composite risk functionals and risk optimization problems. Annals of the Institute of Statistical Mathematics, 69(4):737–760, 2017.
[DR18] John Duchi and Feng Ruan. Stochastic methods for composite and weakly convex optimization problems. SIAM Journal on Optimization, 28(4):3229–3259, 2018.
[EN13] Yuri Ermoliev and Vladimir Norkin. Sample average approximation method for compound stochastic optimization problems. SIAM Journal on Optimization, 23(4):2231–2263, 2013.
[Erm76] Yuri Ermoliev. Methods of stochastic programming. Nauka, Moscow, 1976.
[FMO21] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Generalization of model-agnostic meta-learning algorithms: Recurring and unseen tasks. Advances in Neural Information Processing Systems, 34, 2021.
[FW56] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
[GH15] Dan Garber and Elad Hazan. Faster rates for the frank-wolfe method over strongly-convex sets. In International Conference on Machine Learning, pages 541–549. PMLR, 2015.
[GKS21] Dan Garber, Atara Kaplan, and Shoham Sabach. Improved complexities of conditional gradient-type methods with applications to robust matrix recovery problems. Mathematical Programming, 186(1):185–208, 2021.
[GRW20] Saeed Ghadimi, Andrzej Ruszczynski, and Mengdi Wang. A single timescale stochastic approximation method for nested stochastic optimization. SIAM Journal on Optimization, 30(1):960–979, 2020.
[GW21] Dan Garber and Noam Wolf. Frank-Wolfe with a nearest extreme point oracle. In Conference on Learning Theory, pages 2103–2132. PMLR, 2021.
[HJN15] Zaid Harchaoui, Anatoli Juditsky, and Arkadi Nemirovski. Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1-2):75–112, 2015.
[HK12] Elad Hazan and Satyen Kale. Projection-free online learning. In 29th International Conference on Machine Learning, ICML 2012, pages 521–528, 2012.
[HK14] Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research, 15(1):2489–2512, 2014.
[HL16] Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pages 1263–1271, 2016.
[HLPR19] Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019.
[HZCH20] Yifan Hu, Siqi Zhang, Xin Chen, and Niao He. Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. Advances in Neural Information Processing Systems, 33, 2020.
[Jag13] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pages 427–435. PMLR, 2013.
[LJJ15] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In Advances in Neural Information Processing Systems, pages 496–504, 2015.
[LO20] Xiaoyu Li and Francesco Orabona. A high probability analysis of adaptive sgd with momentum. arXiv preprint arXiv:2007.14294, 2020.
[LP66] Evgeny Levitin and Boris Polyak. Constrained minimization methods. USSR Computational mathematics and mathematical physics, 6(5):1–50, 1966.
[LZ16] Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2):1379–1409, 2016.
[LZW22] Zhipeng Lou, Wanrong Zhu, and Wei Biao Wu. Beyond sub-gaussian noises: Sharp concentration analysis for stochastic gradient descent. Journal of Machine Learning Research, 23:1–22, 2022.
[MDB21] Liam Madden, Emiliano Dall’Anese, and Stephen Becker. High-probability convergence bounds for non-convex stochastic gradient descent. arXiv preprint arXiv:2006.05610, 2021.
[MHK20] Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. Stochastic conditional gradient methods: From convex minimization to submodular maximization. Journal of machine learning research, 2020.
[Mig94] Athanasios Migdalas. A regularization of the frank—wolfe method and unification of certain nonlinear programming methods. Mathematical Programming, 65(1):331–345, 1994.
[Nes18] Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.
[QGX⁺21] Qi Qi, Zhishuai Guo, Yi Xu, Rong Jin, and Tianbao Yang. An online method for a class of distributionally robust optimization with non-convex objectives. Advances in Neural Information Processing Systems, 34, 2021.
[QHZ⁺22] Zi-Hao Qiu, Quanqi Hu, Yongjian Zhong, Lijun Zhang, and Tianbao Yang. Large-scale stochastic optimization of ndcg surrogates for deep learning with provable convergence. arXiv preprint arXiv:2202.12183, 2022.
[QLX18] Chao Qu, Yan Li, and Huan Xu. Non-convex conditional gradient sliding. In International Conference on Machine Learning, pages 4208–4217. PMLR, 2018.
[QLX⁺21] Qi Qi, Youzhi Luo, Zhao Xu, Shuiwang Ji, and Tianbao Yang. Stochastic optimization of areas under precision-recall curves with provable convergence. Advances in Neural Information Processing Systems, 34, 2021.
[RS06] Andrzej Ruszczyński and Alexander Shapiro. Optimization of convex risk functions. Mathematics of operations research, 31(3):433–452, 2006.
[RSPS16] Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic frank-wolfe methods for nonconvex optimization. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1244–1251. IEEE, 2016.
[Rus87] Andrzej Ruszczyński. A linearization method for nonsmooth stochastic programming problems. Mathematics of Operations Research, 12(1):32–49, 1987.
[Rus21] Andrzej Ruszczynski. A stochastic subgradient method for nonsmooth nonconvex multilevel composition optimization. SIAM Journal on Control and Optimization, 59(3):2301–2320, 2021.
[RWC03] David Ruppert, Matt P Wand, and Raymond J Carroll. Semiparametric regression. Number 12. Cambridge university press, 2003.
[Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
[WFL17] Mengdi Wang, Ethan Fang, and Han Liu. Stochastic compositional gradient descent: Algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017.
[WGE17] Gang Wang, Georgios B Giannakis, and Yonina C Eldar. Solving systems of random quadratic equations via truncated amplitude flow. IEEE Transactions on Information Theory, 64(2):773–794, 2017.
[WLF16] Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic composition optimization. In Advances in Neural Information Processing Systems, 2016.
[WYZY22] Guanghui Wang, Ming Yang, Lijun Zhang, and Tianbao Yang. Momentum accelerates the convergence of stochastic AUPRC maximization. In International Conference on Artificial Intelligence and Statistics, pages 3753–3771. PMLR, 2022.
[YSC19] Alp Yurtsever, Suvrit Sra, and Volkan Cevher. Conditional gradient methods via stochastic path-integrated differential estimator. In International Conference on Machine Learning, pages 7282–7291. PMLR, 2019.
[YWF19] Shuoguang Yang, Mengdi Wang, and Ethan Fang. Multilevel stochastic gradient methods for nested composition optimization. SIAM Journal on Optimization, 29(1):616–659, 2019.
[ZCC⁺18] Dongruo Zhou, Jinghui Chen, Yuan Cao, Yiqi Tang, Ziyan Yang, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
[ZSM⁺20] Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. One-sample Stochastic Frank-Wolfe. In International Conference on Artificial Intelligence and Statistics, pages 4012–4023. PMLR, 2020.
[ZX21] Junyu Zhang and Lin Xiao. Multilevel composite stochastic optimization via nested variance reduction. SIAM Journal on Optimization, 31(2):1131–1157, 2021.

Supplementary Materials

The supplementary materials are organized as follows. Appendix A provides motivating examples for stochastic multilevel optimization. Appendix B introduces the essential technical lemmas to complete the proof. We present the whole proofs of Theorem 2 and Theorem 3 in Appendix C and D. Finally, we present the high-probability convergence analysis particularly for the case when $T=1$ in Appendix E.

Appendix A Motivating Examples

Problems of the form in (1) are generalizations of the standard constrained stochastic optimization problem which is obtained when $T=1$ , and arise in several machine learning applications. Some examples include sparse additive modeling in non-parametric statistics [WFL17, Section 4.1], Bayesian optimization [AF21], model-agnostic meta-learning [CSY21, FMO21], distributionally robust optimization [QGX⁺21], training graph neural networks [CFKM20], reinforcement learning [WLF16, Setion 1.1] and AUPRC maximization [QLX⁺21, WYZY22, QHZ⁺22]. Below, we provide a concrete motivating example from the field of risk-averse stochastic optimization [RS06].

The mean-deviation risk-averse optimization is given by the following optimization problem

\displaystyle\max_{x\in\mathcal{X}}\left\{\mathbb{E}[U(x,\xi)]-\rho\mathbb{E}\left[\{\mathbb{E}[U(x,\xi)]-U(x,\xi)\}^{2}\right]^{1/2}\right\}.

As noted by [YWF19], [Rus21] and [BGN22], the above problem is a stochastic 3-level composition optimization problem with

\displaystyle f_{3}:=\mathbb{E}[U(x,\xi)]\qquad f_{2}(z,x):=\mathbb{E}[\{z-U(x,\xi)\}^{2}]\qquad f_{1}((y_{1},y_{2})):=y_{1}-\sqrt{y_{2}+\delta}.

Here, $\delta>0$ is added to make the square root function smooth. In particular, we consider a semi-parametric data generating process given by a sparse single-index model of the form $b=g(\langle a,x^{*}\rangle)+\zeta$ , where $g:\mathbb{R}\to\mathbb{R}$ is called the link function, $x^{*}\in\mathbb{R}^{d}$ is assumed to be a sparse vector and $\langle\cdot,\cdot\rangle$ represents the Euclidean inner-product between two vectors. Such single-index models are widely used in statistics, machine learning and economics [RWC03]. A standard choices of the link function $g$ is the square function, in which case, the model is also called as the sparse phase retrieval model [WGE17]. Here, $a$ is the input data which is assumed to be independent of the noise $\zeta$ . In this case, $\xi:=(a,b)$ and the if we consider the squared-loss, then $U(x,\xi):=(b-(\langle a,x\rangle)^{2})^{2}$ and is non-convex in $x$ . The goal is to estimate the sparse index vector $x^{*}$ in a risk-averse manner, as they are well-known to provide stable solutions [YWF19]. To encourage sparsity, the set $\mathcal{X}$ is the $\ell_{1}$ ball [Jag13].

Appendix B Technical Lemmas

Lemma 6.

(Smoothness of Composite Functions [BGN22]) Assume that Assumption 2 holds.

a)

Define $F_{i}(x)=f_{i}\circ f_{i+1}\circ\cdots\circ f_{T}(x)$ . Under , the gradient of $F_{i}$ is Lipschitz continuous with the constant

$L_{\nabla F_{i}}=\sum_{j=i}^{T}\left[L_{\nabla f_{j}}\prod_{l=i}^{j-1}L_{f_{l}}\prod_{l=j+1}^{T}L_{f_{l}}^{2}\right].$

Define

\begin{split}&R_{1}=L_{\nabla f_{1}}L_{f_{2}}\cdots L_{f_{T}},\qquad R_{j}=L_{f_{1}}\cdots L_{f_{j-1}}L_{\nabla f_{j}}L_{f_{j+1}}\cdots L_{f_{T}}/L_{f_{j}},\quad 2\leq j\leq T-1,\\ &C_{2}=R_{1},\qquad C_{j}=\sum_{i=1}^{j-2}R_{i}\left(\prod_{l=i+1}^{j-1}L_{f_{l}}\right),\quad 3\leq j\leq T\end{split}

(17)

and let $u_{T+1}=x$ . Then, for $T\geq 2$ , we have

\left\|\nabla F(x)-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i})\right\|\leq\sum_{j=2}^{T}C_{j}\|f_{j}(u_{j+1})-u_{j}\|.

(18)

Lemma 7.

(Smoothness of $\eta(\cdot,\cdot)$ [GRW20]) For fixed $\beta>0$ and, $\eta(x,z)$ defined in (10), the gradient of $\eta(x,z)$ w.r.t. $(x,z)$ is Lipschitz continuous with the constant $L_{\nabla\eta}=2\sqrt{(1+\beta)^{2}+\left(1+\frac{1}{2\beta}\right)^{2}}$ .

Lemma 8.

(Convergnece of ICG [Jag13]) Let $\tilde{y}^{k}$ be the vector output by Algorithm 2 at step $k$ , and $y^{k}$ be the optimal solution of the subproblem 16, then under Assumption 1

\frac{\beta}{2}\|\tilde{y}^{k}-y^{k}\|^{2}\leq H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{t_{k}+2}\\

where $\delta$ defined in Algorithm 2 is the quality of the linear minimization procedure.

Proof of Lemma 8.

The result is obtained by applying Theorem 1 in [Jag13] to $H_{k}$ and noting that the curvature constant $C_{H_{k}}=\beta D_{\mathcal{X}}^{2},\forall k\geq 0$ . ∎

Appendix C Proof of Theorem 2

To establish the rate of convergence for Algorithm 1 in Theorem 2, we first present Lemma 9 and Lemma 10 regarding the basic recursion on the errors in estimating the inner function values and the order of $\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]$ . The proofs follow [BGN22] with minor modifications. We present the complete proofs below for the reader’s convenience.

Lemma 9.

Let $\{x^{k}\}_{k\geq 0}$ and $\{u_{i}^{k}\}_{k\geq 0}$ be generated by Algorithm 1 and $u_{T+1}=x$ . Define, $1\leq i\leq T$ ,

\begin{split}\Delta_{G_{i}}^{k+1}&:=f_{i}(u_{i+1}^{k})-G_{i}^{k+1},\quad\Delta_{J_{i}}^{k+1}:=\nabla f_{i}(u_{i+1}^{k})-J_{i}^{k+1},\\ e_{i}^{k}&:=f_{i}(u_{i+1}^{k+1})-f_{i}(u_{i+1}^{k})-\langle\nabla f_{i}(u_{i+1}^{k}),u_{i+1}^{k+1}-u_{i+1}^{k}\rangle.\end{split}

(19)

Under Assumption 2, we have, for $1\leq i\leq T$ ,

\begin{split}&\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|^{2}\leq(1-\tau_{k})\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+\dot{r}_{i}^{k+1}\\ &\quad+\left[4L_{f_{i}}^{2}+L_{\nabla f_{i}}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|+\|\Delta_{J_{i}}^{k+1}\|^{2}\right]\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2},\end{split}

(20)

and

\begin{split}\|u_{i}^{k+1}-u_{i}^{k}\|^{2}\leq\tau_{k}^{2}\left[2\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\|\Delta_{G_{i}}^{k+1}\|^{2}\right]+2\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}+\ddot{r}_{i}^{k+1}\end{split}

(21)

where

\begin{split}&\dot{r}_{i}^{k+1}:=2\tau_{k}\langle\Delta_{G_{i}}^{k+1},e_{i}^{k}+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})+{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ &\qquad\qquad+2\langle{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k}),e_{i}^{k}+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})\rangle,\\ &\ddot{r}_{i}^{k+1}:=\tau_{k}\langle-\Delta_{G_{i}}^{k+1},\tau_{k}(f_{i}(u_{i+1}^{k})-u_{i}^{k})+{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle.\end{split}

(22)

Proof.

We first prove part (20). By the definitions in (19), (22), for any $1\leq i\leq T$ , we have

\begin{split}&\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|^{2}\\ =&\|e_{i}^{k}+f_{i}(u_{i+1}^{k})+\nabla f_{i}(u_{i+1}^{k})^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})-(1-\tau_{k})u_{i}^{k}-\tau_{k}G_{i}^{k+1}-{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}\\ =&\|e_{i}^{k}+{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})+\tau_{k}\Delta_{G_{i}}^{k+1}\|^{2}\\ =&\|{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}+\|e_{i}^{k}+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+\dot{r}_{i}^{k+1}\\ \leq&\|e_{i}^{k}+(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+\|\Delta_{J_{i}}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}+\dot{r}_{i}^{k+1}\\ \leq&(1-\tau_{k})\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\|e_{i}^{k}\|^{2}+2(1-\tau_{k})\|e_{i}^{k}\|\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}\\ &\qquad+\|\Delta_{J_{i}}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}+\dot{r}_{i}^{k+1}.\end{split}

Furthermore, with Assumption 2, we have

\|e_{i}^{k}\|\leq\frac{L_{\nabla f_{i}}}{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2},\qquad\|e_{i}^{k}\|^{2}\leq 4L_{f_{i}}^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2},

(23)

which leads to (20). To show (21), with the update rule given by (8) and the definitions in (19), we have, for $1\leq i\leq T$ ,

\begin{split}&\|u_{i}^{k+1}-u_{i}^{k}\|^{2}\\ =&\|\tau_{k}(G_{i}^{k+1}-u_{i}^{k})+\langle J_{i}^{k+1},u_{i+1}^{k+1}-u_{i+1}^{k}\rangle\|^{2}\\ =&\tau_{k}^{2}\|G_{i}^{k+1}-u_{i}^{k}\|^{2}+\|{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}+2\tau_{k}\langle G_{i}^{k+1}-u_{i}^{k},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ =&\tau_{k}^{2}\|G_{i}^{k+1}-u_{i}^{k}\|^{2}+\|{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}+2\tau_{k}\langle f_{i}(u_{i+1}^{k})-u_{i}^{k},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ &\qquad+2\tau_{k}\langle-\Delta_{G_{i}}^{k+1},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ \leq&\tau_{k}^{2}\|G_{i}^{k+1}-u_{i}^{k}\|^{2}+2\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}+\tau_{k}^{2}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\\ &\qquad+2\tau_{k}\langle-\Delta_{G_{i}}^{k+1},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\\ =&2\tau_{k}^{2}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+2\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}\\ &\qquad+2\tau_{k}\langle-\Delta_{G_{i}}^{k+1},\tau_{k}(f_{i}(u_{i+1}^{k})-u_{i}^{k})+{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle.\end{split}

where the inequality comes from the fact that $\|{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}\leq\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}$ and $2\tau_{k}\langle f_{i}(u_{i+1}^{k})-u_{i}^{k},{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\rangle\leq\|{J_{i}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}+\tau_{k}^{2}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}$ . ∎

Lemma 10.

Let $u_{T+1}=x$ . Under Assumption 2, 3, and with the choice of $\tau_{0}=1$ , we have, for $1\leq i\leq T$ and $k\geq 0$ ,

	$\displaystyle\mathbb{E}[\\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\\|^{2}\|\mathscr{F}_{k}]\leq\sigma_{G_{i}}^{2}+(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2})c_{i+1},$		(24)
	$\displaystyle\mathbb{E}[\\|u_{i}^{k+1}-u_{i}^{k}\\|^{2}\|\mathscr{F}_{k}]\leq c_{i}\tau_{k}^{2},$		(25)

where

c_{i}:=3\sigma_{G_{i}}^{2}+2(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2}+\hat{\sigma}_{J_{i}}^{2})c_{i+1},\qquad c_{T+1}=D_{\mathcal{X}}^{2}.

(26)

Proof.

By the update rule given in (8) and the definitions in (19), for $1\leq i\leq T$ and $k\geq 0$ , we have

f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}=(1-\tau_{k})(f_{i}(u_{i+1}^{k})-u_{i}^{k})+\mathbf{D}_{k,i},

where $\mathbf{D}_{k,i}:=e_{i}^{k}+\tau_{k}\Delta_{G_{i}}^{k+1}+{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})$ . With the convexity of $\|\cdot\|^{2}$ , we can further obtain

\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|^{2}\leq(1-\tau_{k})\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\frac{1}{\tau_{k}}\|\mathbf{D}_{k,i}\|^{2},\quad\forall k\geq 0.

(27)

Moreover, under Assumption 3, we have, for $1\leq i\leq T$ and $k\geq 0$ ,

\begin{split}\mathbb{E}[\|\mathbf{D}_{k,i}\|^{2}|\mathscr{F}_{k}]&=\mathbb{E}[\|e_{i}^{k}\|^{2}|\mathscr{F}_{k}]+\tau_{k}^{2}\mathbb{E}[\|\Delta_{G_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]+\mathbb{E}[\|{\Delta_{J_{i}}^{k+1}}^{\top}(u_{i+1}^{k+1}-u_{i+1}^{k})\|^{2}|\mathscr{F}_{k}]\\ &\leq\tau_{k}^{2}\mathbb{E}[\|\Delta_{G_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]+\left(4L_{f_{i}}^{2}+\mathbb{E}[\|\Delta_{J_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]\right)\mathbb{E}[\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}|\mathscr{F}_{k}]\\ &\leq\tau_{k}^{2}\sigma_{G_{i}}^{2}+\left(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2}\right)\mathbb{E}[\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}|\mathscr{F}_{k}].\end{split}

(28)

where the second inequality follows from (23). Setting $i=T$ in the inequality above and noting that $u_{T+1}^{k}=x^{k}$ , we have

\mathbb{E}[\|\mathbf{D}_{k,T}\|^{2}|\mathscr{F}_{k}]\leq\tau_{k}^{2}\left[\sigma_{G_{T}}^{2}+(4L_{f_{T}}^{2}+\sigma_{J_{T}}^{2})D_{\mathcal{X}}^{2}\right],\quad\forall k\geq 0.

Thus, with the choice of $\tau_{0}=1$ , we obtain

\mathbb{E}[\|f_{T}(x^{k})-u_{T}^{k}\|^{2}|\mathscr{F}_{k}]\leq\sigma_{G_{T}}^{2}+(4L_{f_{T}}^{2}+\sigma_{J_{T}}^{2})D_{\mathcal{X}}^{2},\quad\forall k\geq 1.

Taking expectation of both sides of (21) conditioning on $\mathscr{F}_{k}$ , and under Assumption 3, we obtain

\begin{split}\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\leq\tau_{k}^{2}\mathbb{E}\left[2\|f_{i}(x^{k})-u_{i}^{k}\|^{2}+\|\Delta_{G_{i}}^{k+1}\|^{2}+\frac{2}{\tau_{k}^{2}}\|J_{i}^{k+1}\|^{2}\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}\middle|\mathscr{F}_{k}\right].\end{split}

(29)

Setting $i=T$ in the inequality above, we have

\begin{split}\mathbb{E}[\|u_{T}^{k+1}-u_{T}^{k}\|^{2}|\mathscr{F}_{k}]\leq\tau_{k}^{2}\left[3\sigma_{G_{T}}^{2}+2(4L_{f_{T}}^{2}+\sigma_{J_{T}}^{2}+\hat{\sigma}_{J_{T}}^{2})D_{\mathcal{X}}^{2}\right],\quad\forall k\geq 1.\end{split}

This completes the proof of (24) and (25) when $i=T$ . We now use backward induction to complete the proof. By the above result, the base case of $i=T$ holds. Assume that (25) hold when $i=j$ for some $1<j\leq T$ , i.e., $\mathbb{E}[\|u_{j}^{k+1}-u_{j}^{k}\|^{2}|\mathscr{F}_{k}]\leq c_{j}\tau_{k}^{2},\forall k\geq 0$ . Then, setting $i=j-1$ in (28), we obtain

\mathbb{E}[\|\mathbf{D}_{k,j-1}\|^{2}|\mathscr{F}_{k}]\leq\tau_{k}^{2}\left[\sigma_{G_{j-1}}^{2}+(4L_{f_{j-1}}^{2}+\sigma_{J_{j-1}}^{2})c_{j}\right],\quad\forall k\geq 0.

Furthermore, with (27) and the choice of $\tau_{0}=1$ , we have

\mathbb{E}[\|f_{j-1}(u_{j}^{k+1})-u_{j-1}^{k+1}\|^{2}|\mathscr{F}_{k}]\leq\sigma_{G_{j-1}}^{2}+(4L_{f_{j-1}}^{2}+\sigma_{J_{j-1}}^{2})c_{j},\quad\forall k\geq 0.

which together with (29), imply that

\mathbb{E}[\|u_{j-1}^{k+1}-u_{j-1}^{k}\|^{2}|\mathscr{F}_{k}]\leq c_{j-1}\tau_{k}^{2},\quad\forall k\geq 0.

∎

We now leverage the merit function defined in (9) and provide a basic inequality for establishing convergence analysis of Algorithm 1 in Lemma 11. In Proposition 3, we show the boundedness of the term $\mathbf{R}_{k}$ appearing on the right hand side of (30) in expectation. These two results form the crucial steps in establishing the convergence analysis of Algorithm 1.

Lemma 11.

Let $\{x^{k},z^{k},u^{k}\}_{k\geq 0}$ be the sequence generated by Algorithm 1, the merit function $W_{\alpha,\gamma}(\cdot,\cdot,\cdot)$ be defined in (9) with positive constants $\{\alpha,\{\gamma_{i}\}_{1\leq i\leq T}\}$ , and $u_{T+1}=x$ . Under Assumption 2, for any $\beta>0$ , let

\beta_{k}\equiv\beta,\quad\alpha=\frac{\beta}{20L_{\nabla F}^{2}},\quad\gamma_{1}=\frac{\beta}{2},\quad\gamma_{j}=\left(2\alpha+\frac{1}{4\alpha L_{\nabla F}^{2}}\right)(T-1)C_{j}^{2}+\frac{\beta}{2},\quad 2\leq j\leq T,

where $C_{j}$ ’s are defined in (17). Then, $\forall N\geq 0$

\begin{split}\sum_{k=0}^{N}\tau_{k}\left(\beta\left[\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right]+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right)\\ \leq 2W_{0}+2\sum_{k=0}^{N}\mathbf{R}_{k}+\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\end{split}

(30)

where $d^{k}:=y^{k}-x^{k}$ , $H_{k}(\cdot),y^{k}$ are defined in (16), and

\begin{split}\mathbf{R}_{k}:=&\sum_{i=1}^{T}\gamma_{i}\left[4L_{f_{i}}^{2}+L_{\nabla f_{i}}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|+\|\Delta_{J_{i}}^{k+1}\|^{2}\right]\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2}\\ &+\tau_{k}^{2}\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+\sum_{i=1}^{T}\gamma_{i}\|\Delta_{G_{i}}^{k+1}\|^{2}+\alpha\|\Delta^{k+1}\|^{2}\right]\\ &+\tau_{k}\left[\langle d^{k},\Delta^{k+1}\rangle+\sum_{i=1}^{T}\gamma_{i}\dot{r}_{i}^{k+1}+2\alpha\dddot{r}^{k+1}\right]+\frac{L_{\nabla\eta}}{2}\|z^{k+1}-z^{k}\|^{2},\\ \Delta^{k+1}:=&\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k})-\prod_{i=1}^{T}J^{k+1}_{T-i+1},\\ \dddot{r}^{k+1}:=&\langle\Delta^{k+1},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\nabla F(x^{k})-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k})]\\ &+\nabla F(x^{k+1})-\nabla F(x^{k})\rangle,\end{split}

(31)

$\Delta_{G_{i}}^{k+1}$ , $\Delta_{J_{i}}^{k+1}$ are defined in (19), and $\dot{r}_{i}^{k+1}$ is defined in (22).

Proof.

We first bound $F(x^{k+1})-F(x^{k})$ . By the Lipschitzness of $\nabla F$ (Lemma 6), we have

\begin{split}&F(x^{k+1})-F(x^{k})\leq\langle\nabla F(x^{k}),x^{k+1}-x^{k}\rangle+\frac{L_{\nabla F}\tau_{k}^{2}}{2}\|\tilde{d}^{k}\|^{2}\\ &=\tau_{k}\langle\nabla F(x^{k}),d^{k}\rangle+\tau_{k}\langle\nabla F(x^{k})-z^{k},\tilde{y}^{k}-y^{k}\rangle-\tau_{k}\langle\beta d^{k},\tilde{y}^{k}-y^{k}\rangle\\ &\qquad+\tau_{k}\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle+\frac{L_{\nabla F}\tau_{k}^{2}}{2}\|\tilde{d}^{k}\|^{2}\\ &\leq\tau_{k}\langle\nabla F(x^{k}),d^{k}\rangle+\tau_{k}\|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|+\tau_{k}\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle\\ &\qquad+\tau_{k}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|+\frac{L_{\nabla F}\tau_{k}^{2}}{2}\|\tilde{d}^{k}\|^{2}.\end{split}

(32)

We then provide a bound for $\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1})$ . By the lipschitzness of $\nabla\eta$ (Lemma 7) with the partial gradients of $\nabla\eta$ given by

\nabla_{x}\eta(x^{k},z^{k})=-z^{k}-\beta d^{k},\quad\nabla_{z}\eta(x^{k},z^{k})=d^{k},

we have

\begin{split}&\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1})\\ &\leq\big{\langle}z^{k}+\beta d^{k},x^{k+1}-x^{k}\big{\rangle}-\big{\langle}d^{k},z^{k+1}-z^{k}\big{\rangle}+\frac{L_{\nabla\eta}}{2}\bigg{[}\|x^{k+1}-x^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]}\\ &=\tau_{k}\big{\langle}2z^{k}+\beta d^{k},d^{k}\big{\rangle}+\tau_{k}\big{\langle}z^{k}+\beta d^{k},\tilde{d}^{k}-d^{k}\big{\rangle}-\tau_{k}\big{\langle}d^{k},\prod_{i=1}^{T}J_{T-i+1}^{k+1}\big{\rangle}\\ &\qquad+\frac{L_{\nabla\eta}}{2}\bigg{[}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]},\end{split}

(33)

where the second equality comes from (6) and (7). Due to the optimality condition of in the definition of $y^{k}$ , we have $\big{\langle}z^{k}+\beta d^{k},x-y^{k}\big{\rangle}\geq 0$ for all $x\in\mathcal{X}$ , which together with the choice of $x=x^{k}$ implies that

\langle z^{k},d^{k}\rangle+\beta\|d^{k}\|^{2}\leq 0.

(34)

Thus, combining (33) with (34), we obtain

\begin{split}\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1})\leq-\beta\tau_{k}\|d^{k}\|^{2}+\tau_{k}\big{\langle}z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\big{\rangle}-\tau_{k}\big{\langle}d^{k},\prod_{i=1}^{T}J_{T-i+1}^{k+1}\big{\rangle}\\ +\frac{L_{\nabla\eta}}{2}\bigg{[}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]}.\end{split}

(35)

In addition, by Lemma 18, we have

\langle d^{k},\nabla F(x^{k})-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k})\rangle\leq\sum_{j=2}^{T}C_{j}\|d^{k}\|\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|.

(36)

Then combing (32), (35), (36), we have

\begin{split}&[F(x^{k+1})-\eta(x^{k+1},z^{k+1})]-[F(x^{k})-\eta(x^{k},z^{k})]\\ &\leq\tau_{k}\bigg{\{}-\beta\|d^{k}\|^{2}+\sum_{j=2}^{T}C_{j}\|d^{k}\|\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|+\langle d^{k},\Delta^{k+1}\rangle+2\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle\\ &~{}+\left[\beta\|d^{k}\|+\|\nabla F(x^{k})-z^{k}\|\right]\|\tilde{y}^{k}-y^{k}\|\bigg{\}}+\frac{L_{\nabla F}+L_{\nabla\eta}}{2}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\frac{L_{\nabla\eta}}{2}\|z^{k+1}-z^{k}\|^{2}.\end{split}

(37)

Furthermore, defining

\varkappa^{k}:=\nabla F(x^{k})-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k}),\qquad\bar{\varkappa}^{k}:=\frac{\nabla F(x^{k+1})-\nabla F(x^{k})}{\tau_{k}},

and by the update rule given by (7), we have

\begin{split}&\|\nabla F(x^{k+1})-z^{k+1}\|^{2}\\ &=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\varkappa^{k}+\bar{\varkappa}^{k}+\Delta^{k+1}]\|^{2}\\ &=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\varkappa^{k}+\bar{\varkappa}^{k}]\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+2\tau_{k}\dddot{r}^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+2\tau_{k}\left[\|\varkappa^{k}\|^{2}+\|\bar{\varkappa}^{k}\|^{2}\right]+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+2\tau_{k}\dddot{r}^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}\\ &+2\tau_{k}\left[(T-1)\sum_{j=2}^{T}C_{j}^{2}\|f_{j}(u_{j+1})-u_{j}\|^{2}+2L_{\nabla F}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2})+\dddot{r}^{k+1}\right].\\ \end{split}

(38)

where $\dddot{r}^{k+1}:=\langle\Delta^{k+1},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\varkappa^{k}+\bar{\varkappa}^{k}]\rangle$ and the last inequality comes from two fact that $\|\bar{\varkappa}^{k}\|^{2}\leq 2L_{\nabla F}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2})$ and

\begin{split}\|\varkappa^{k}\|^{2}=\left\|\nabla F(x^{k})-\prod_{i=1}^{T}\nabla f_{T+1-i}(u_{T+2-i}^{k})\right\|^{2}\leq(T-1)\sum_{j=2}^{T}C_{j}^{2}\|f_{j}(u_{j+1})-u_{j}\|^{2}.\end{split}

The above upper bound for the term $\|\varkappa^{k}\|^{2}$ is obtained by leveraging Lemma 18 and the fact that $(\sum_{i=1}^{n}a_{i})\leq n\sum_{i=1}^{n}a_{i}^{2}$ for non-negative sequence $(a_{i})_{1\leq i\leq n}$ .

Moreover, by Lemma 9, we have, for $1\leq i\leq T$ ,

\begin{split}&\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|^{2}-\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\leq-\tau_{k}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}+\tau_{k}^{2}\|\Delta_{G_{i}}^{k+1}\|^{2}+\dot{r}_{i}^{k+1}\\ &\qquad\qquad+\left[4L_{f_{i}}^{2}+L_{\nabla f_{i}}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|+\|\Delta_{J_{i}}^{k+1}\|^{2}\right]\|u_{i+1}^{k+1}-u_{i+1}^{k}\|^{2},\end{split}

(39)

Finally, multiplying both sides of (39) by $\gamma_{i}$ for $i=1,\dots,T$ and both sides of (38) by $\alpha$ , adding them to (37), rearranging the terms, and noting that $\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle=H_{k}(\tilde{y}^{k})-H_{k}(y^{k})-(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2}$ due to the quadratic structure of $H_{k}$ and $\|\tilde{d}^{k}\|^{2}\leq D_{\mathcal{X}}^{2}$ , we obtain

W_{k+1}-W_{k}\leq\tau_{k}\mathbf{A}_{k}+\mathbf{R}_{k}

(40)

where $\mathbf{R}_{k}$ is defined in (31) and

\begin{split}\mathbf{A}_{k}:=&\left(-\beta+4\alpha L_{\nabla F}^{2}\right)\|d^{k}\|^{2}+\sum_{j=2}^{T}\left(-\gamma_{j}+2\alpha(T-1)C_{j}^{2}\right)\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|^{2}\\ &-\gamma_{1}\|f_{1}(u_{2}^{k})-u_{1}^{k}\|^{2}-\alpha\|\nabla F(x^{k})-z^{k}\|^{2}+\sum_{j=2}^{T}C_{j}\|d^{k}\|\|f_{j}(u_{j+1})-u_{j}\|\\ &+\left(\beta\|d^{k}\|+\|\nabla F(x^{k})-z^{k}\|\right)\|\tilde{y}^{k}-y^{k}\|+\left(4\alpha L_{\nabla F}^{2}-\beta\right)\|\tilde{y}^{k}-y^{k}\|^{2}\\ &+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

We can further provide a simplified upper bound for $\mathbf{A}_{k}$ . By Young’s inequality, we have

\begin{split}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\beta}{4}\|d^{k}\|^{2}+\beta\|\tilde{y}^{k}-y^{k}\|^{2},\\ \|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\alpha}{2}\|\nabla F(x^{k})-z^{k}\|^{2}+\frac{1}{2\alpha}\|\tilde{y}^{k}-y^{k}\|^{2}\\ C_{j}\|d^{k}\|\|f_{j}(u_{j+1})-u_{j}\|&\leq\frac{\alpha L_{\nabla F}^{2}}{T-1}\|d^{k}\|^{2}+\frac{(T-1)C_{j}^{2}}{4\alpha L_{\nabla F}^{2}}\|f_{j}(u_{j+1})-u_{j}\|^{2}.\end{split}

Thus,

\begin{split}\mathbf{A}_{k}\leq&\left(-\frac{3\beta}{4}+5\alpha L_{\nabla F}^{2}\right)\|d^{k}\|^{2}-\gamma_{1}\|f_{1}(u_{2}^{k})-u_{1}^{k}\|^{2}-\frac{\alpha}{2}\|\nabla F(x^{k})-z^{k}\|^{2}\\ &+\sum_{j=2}^{T}\left(-\gamma_{j}+\left(2\alpha+\frac{1}{4\alpha L_{\nabla F}^{2}}\right)(T-1)C_{j}^{2}\right)\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|^{2}\\ &+\left(4\alpha L_{\nabla F}^{2}+\frac{1}{2\alpha}\right)\|\tilde{y}^{k}-y^{k}\|^{2}+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\\ \end{split}

For any $\beta>0$ , let

\alpha=\frac{\beta}{20L_{\nabla F}^{2}},\qquad\gamma_{1}=\frac{\beta}{2},\qquad\gamma_{j}=\left(2\alpha+\frac{1}{4\alpha L_{\nabla F}^{2}}\right)(T-1)C_{j}^{2}+\frac{\beta}{2},\quad 2\leq j\leq T

Then, we have

\begin{split}\mathbf{A}_{k}\leq-\frac{\beta}{2}\left(\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right)-\frac{\beta}{40L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\\ +\left(\frac{12}{5}+\frac{20L_{\nabla F}^{2}}{\beta^{2}}\right)\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\\ \end{split}

(41)

As a result of (40) and (41), we can further obtain

\begin{split}\tau_{k}\left(\beta\left[\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right]+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right)\\ \leq 2W_{k}-2W_{k+1}+2\mathbf{R}_{k}+\tau_{k}\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\end{split}

which immediately implies (30) by telescoping. ∎

Proposition 3.

Let $\mathbf{R}_{k}$ be defined in (31) and $\tau_{0}=1$ . Then, under Assumption 3, we have

\mathbb{E}[\mathbf{R}_{k}|\mathscr{F}_{k}]\leq\hat{\sigma}^{2}\tau_{k}^{2},\quad\forall k\geq 1,

where

\begin{split}\hat{\sigma}^{2}:=&\sum_{i=1}^{T}\gamma_{i}\left(\left[4L_{\nabla f_{i}}^{2}+L_{\nabla f_{i}}\sqrt{\sigma_{G_{i}}^{2}+(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2})c_{i+1}}+\sigma_{J_{i}}^{2}\right]c_{i+1}+\sigma_{G_{i}}^{2}\right)\\ &+(\alpha+2L_{\eta})\prod_{i=1}^{T}\hat{\sigma}_{J_{i}}^{2}+\frac{L_{\nabla F}+L_{\eta}}{2}D_{\mathcal{X}}^{2}.\end{split}

(42)

Proof.

Note that under Assumption 3, we have, for $1\leq i\leq T$ ,

\begin{split}&\mathbb{E}[\Delta^{k+1}|\mathscr{F}_{k}]=0,\quad\mathbb{E}[\dot{r}_{i}^{k+1}|\mathscr{F}_{k}]=0,\quad\mathbb{E}[\dddot{r}^{k+1}|\mathscr{F}_{k}]=0,\\ &\mathbb{E}[\|\Delta_{G_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]\leq\sigma_{G_{i}}^{2},\quad\mathbb{E}[\|\Delta_{J_{i}}^{k+1}\|^{2}|\mathscr{F}_{k}]\leq\sigma_{J_{i}}^{2},\end{split}

and

\mathbb{E}[\|\Delta^{k+1}\|^{2}|\mathscr{F}_{k}]\leq\mathbb{E}\left[\left\|\prod_{i=1}^{T}J^{k+1}_{T-i+1}\right\|^{2}\middle|\mathscr{F}_{k}\right]\leq\prod_{i=1}^{T}\mathbb{E}\left[\left\|J^{k+1}_{T-i+1}\right\|^{2}\middle|\mathscr{F}_{k}\right]\leq\prod_{i=1}^{T}\hat{\sigma}_{J_{i}}^{2}.

In addition, by Lemma 9 and Hölder’s inequality. we have $\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\leq c_{i}\tau_{k}^{2}$ and

\begin{split}&\mathbb{E}[\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\|\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\\ &\leq\mathbb{E}[\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\||\mathscr{F}_{k}]\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\\ &\leq\left(\mathbb{E}[\|f_{i}(u_{i+1}^{k+1})-u_{i}^{k+1}\||\mathscr{F}_{k}]\right)^{\frac{1}{2}}\mathbb{E}[\|u_{i}^{k+1}-u_{i}^{k}\|^{2}|\mathscr{F}_{k}]\\ &\leq c_{i}\sqrt{\sigma_{G_{i}}^{2}+(4L_{f_{i}}^{2}+\sigma_{J_{i}}^{2})c_{i+1}}~{}\tau_{k}^{2}.\end{split}

Lastly, from eq.(28) of Proposition 2.1 in [BGN22], we have for any $k\geq 1$ ,

\mathbb{E}[\|z^{k+1}-z^{k}\|^{2}|\mathscr{F}_{k}]\leq 4\tau_{k}^{2}\prod_{i=1}^{T}\hat{\sigma}_{J_{i}}^{2}.

The proof is completed by combing all above observations with the expression of $\mathbf{R}_{k}$ in (31). ∎

Proof of Theorem 2.

We now present the proof of Theorem 2. Note that by Lemma 11 and given values of $\alpha,\gamma$ in (12), we obtain

\begin{split}&\sum_{k=1}^{N}\tau_{k}\left[\beta\left(\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right)+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\right]\\ &\leq 2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+2\sum_{k=0}^{N}\mathbf{R}_{k}+\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right),\end{split}

Taking expectation of both sides and noting that $\mathbb{E}[\mathbf{R}_{k}|\mathscr{F}_{k}]\leq\hat{\sigma}^{2}\tau_{k}^{2}$ by Proposition 3, we have

\begin{split}&\sum_{k=1}^{N}\tau_{k}\mathbb{E}\left[\rho\left(\|d^{k}\|^{2}+\sum_{i=1}^{T}\|f_{i}(u_{i+1}^{k})-u_{i}^{k}\|^{2}\right)+\alpha\|\nabla F(x^{k})-z^{k}\|^{2}\middle|\mathscr{F}_{k-1}\right]\\ &\leq 2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+2\hat{\sigma}^{2}\sum_{k=0}^{N}\tau_{k}^{2}+\left(\frac{24}{5}+\frac{40L_{\nabla F}^{2}}{\beta^{2}}\right)\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

(43)

Then, setting $\tau_{k},t_{k}$ to be values in (11) and noting that by Lemma 8, we have

H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{t_{k}+2}\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{\sqrt{k}},\quad\forall k\geq 1.

Also, with the choice of $z^{0}=0$ , we have $y^{0}=\tilde{y}^{0}=x^{0}$ . Thus, we can conclude that

\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\leq\frac{2\beta D_{\mathcal{X}}^{2}(1+\delta)}{\sqrt{N}}\sum_{k=1}^{N}\frac{1}{\sqrt{k}}\leq 4\beta D_{\mathcal{X}}^{2}(1+\delta).

which together with (43) immediately imply that $\forall N\geq 1$ ,

\begin{split}\frac{1}{\sqrt{N}}\sum_{k=1}^{N}\mathbb{E}\left[\beta\left(\|d^{k}\|^{2}+\sum_{j=1}^{T}\|f_{j}(u_{j+1}^{k})-u_{j}^{k}\|^{2}\right)+\frac{\beta}{20L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\middle|\mathscr{F}_{k-1}\right]\\ \leq 2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta).\end{split}

where

\mathcal{B}(\beta,\sigma^{2},L,D_{\mathcal{X}},T,\delta)=4\hat{\sigma}^{2}+32\beta D_{\mathcal{X}}^{2}(1+\delta)\left(\frac{3}{5}+\frac{5L_{\nabla F}^{2}}{\beta^{2}}\right),

and $\hat{\sigma}^{2}$ is given in (42). As a result, we can obtain (13) and (14) by the definition of random integer $R$ and

\begin{split}\|\mathcal{G}(x^{k},\nabla F(x^{k}),\beta)\|^{2}&\leq 2\beta^{2}\|d^{k}\|^{2}+2\beta^{2}\left\|\Pi_{\mathcal{X}}\left(x^{k}-\frac{1}{\beta}\nabla F(x^{k})\right)-\Pi_{\mathcal{X}}\left(x^{k}-\frac{1}{\beta}z^{k}\right)\right\|^{2}\\ &\leq 2\beta^{2}\|d^{k}\|^{2}+2\|\nabla F(x^{k})-z^{k}\|^{2}.\end{split}

∎

Appendix D Proofs for Section 3.1

D.1 Proof of Theorem 3 for $T=2$

To show the rate of convergence for Algorithm 3, we simplify the merit function in the analysis of the multi-level problems and leverage the following function:

W_{\alpha,\gamma}(x^{k},z^{k},u^{k})=F(x^{k})-F^{\star}-\eta(x^{k},z^{k})+\alpha\|\nabla F(x^{k})-z^{k}\|^{2}+\gamma\|f_{2}(x^{k})-u_{2}^{k}\|^{2},

(44)

where $\alpha,\gamma$ are positive constants, $\eta(\cdot,\cdot)$ is defined in (10). We now present the analogue of Lemma 11 for Algorithm 3. The proof follows similar steps as that proof of Lemma 11 with slight modifications, and hence we will skip some arguments already presented before.

Lemma 12.

Let $\{x^{k},z^{k},u_{2}^{k}\}_{k\geq 0}$ be the sequence generated by Algorithm 3 and the merit function $W_{\alpha,\gamma}(\cdot,\cdot,\cdot)$ be defined in (44) with

\alpha=\frac{\rho}{L_{\nabla F}},\qquad\gamma=3\rho L_{\nabla f_{1}},\qquad\rho>0.

Under Assumptions 2 with $T=2$ , setting $\beta_{k}\equiv\beta\geq 6\rho L_{\nabla F}+(2\rho+\frac{2}{3\rho})L_{\nabla f_{1}}L_{f_{2}}^{2}$ , we have $\forall N\geq 0$

\begin{split}&\rho\sum_{k=0}^{N}\tau_{k}\bigg{(}L_{\nabla F}\|d^{k}\|^{2}+L_{\nabla f_{1}}\|f_{2}(x^{k})-u_{2}^{k}\|^{2}+\frac{1}{L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 2W_{0}+2\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}4+\frac{2(8\rho+1/\rho)L_{\nabla F}+24\rho L_{\nabla f_{1}}L_{f_{2}}^{2}}{\beta}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

where $d^{k}=y^{k}-x^{k}$ , $H_{k}(\cdot),y^{k}$ are defined in (16), and

\begin{split}\mathbf{R}_{k}:=&\tau_{k}^{2}\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+\gamma\|\Delta_{G_{2}}^{k+1}\|^{2}+\alpha\|\Delta^{k+1}\|^{2}\right]+\frac{L_{\eta}}{2}\|z^{k+1}-z^{k}\|^{2}\\ &\qquad+\tau_{k}\langle d^{k},\Delta^{k+1}\rangle+\gamma\dot{r}^{k+1}+\alpha\ddot{r}^{k+1},\\ \Delta^{k+1}:=&\nabla f_{2}(x^{k})\nabla f_{1}(u_{2}^{k})-J_{2}^{k+1}J_{1}^{k+1},\quad\Delta_{G_{2}}^{k+1}:=f_{2}(x^{k})-G_{2}^{k+1}\\ \dot{r}^{k+1}:=&2\tau_{k}\langle\Delta_{G_{2}}^{k+1},f_{2}(x^{k+1})-f_{2}(x^{k})+(1-\tau_{k})(f_{2}(x^{k})-u_{2}^{k})\rangle,\\ \ddot{r}^{k+1}:=&2\tau_{k}\langle\Delta^{k+1},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[\nabla F(x^{k})-\nabla f_{2}(x^{k})\nabla f_{1}(u^{k})]\\ &\qquad+\nabla F(x^{k+1})-\nabla F(x^{k})\rangle.\end{split}

(45)

Proof of Lemma 12.

1. By the Lipschitzness of $\nabla F$ (Lemma 6), we have

\begin{split}&F(x^{k+1})-F(x^{k})\leq\tau_{k}\langle\nabla F(x^{k}),d^{k}\rangle+\tau_{k}\|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|\\ &\qquad+\tau_{k}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|+\tau_{k}\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle+\frac{L_{\nabla F}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}}{2}.\\ \end{split}

(46)

2. Also, by the Lipschitzness of $\nabla\eta$ (Lemma 7) and the optimality condition of in the definition of $y^{k}$ , we have

\begin{split}&\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1})\leq-\beta\tau_{k}\|d^{k}\|^{2}+\tau_{k}\big{\langle}z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\big{\rangle}\\ &\qquad-\tau_{k}\big{\langle}d^{k},\nabla f_{2}(x^{k})\nabla f_{1}(u_{2}^{k})\big{\rangle}+\tau_{k}\big{\langle}d^{k},\Delta^{k+1}\big{\rangle}+\frac{L_{\nabla\eta}}{2}\bigg{[}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]}.\end{split}

(47)

3. In addition, by the Lipschitzness of $f_{2}$ and $\nabla f_{1}$ , we have

\begin{split}\langle d^{k},\nabla F(x^{k})-\nabla f_{2}(x^{k})\nabla f_{1}(u_{2}^{k})\rangle&=\big{\langle}d^{k},\nabla f_{2}(x^{k})^{\top}\big{[}\nabla f_{1}(f_{2}(x^{k}))-\nabla f_{1}(u_{2}^{k})\big{]}\big{\rangle}\\ &\leq L_{\nabla f_{1}}L_{f_{2}}\|d^{k}\|\|f_{2}(x^{k})-u_{2}^{k}\|.\end{split}

(48)

4. Moreover, by the update rule, we have

\begin{split}&\|f_{2}(x^{k+1})-u_{2}^{k+1}\|^{2}=\|f_{2}(x^{k+1})-f_{2}(x^{k})+(1-\tau_{k})[f_{2}(x^{k})-u_{2}^{k}]+\tau_{k}\Delta_{G_{2}}^{k+1}\|^{2}\\ &\qquad=\|(1-\tau_{k})[f_{2}(x^{k})-u_{2}^{k}]+f_{2}(x^{k+1})-f_{2}(x^{k})\|^{2}+\tau_{k}^{2}\|\Delta_{G_{2}}^{k+1}\|^{2}+\dot{r}^{k+1}\\ &\qquad\leq(1-\tau_{k})\|f_{2}(x^{k})-u_{2}^{k}\|^{2}+2\tau_{k}L_{f_{2}}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2})+\tau_{k}^{2}\|\Delta_{G_{2}}^{k+1}\|^{2}+\dot{r}^{k+1}\end{split}

(49)

where $\dot{r}^{k+1}:=2\tau_{k}\langle\Delta_{G_{2}}^{k+1},f_{2}(x^{k+1})-f_{2}(x^{k})+(1-\tau_{k})(f_{2}(x^{k})-u_{2}^{k})\rangle$ and the last inequality follows Jensen’s inequality for the convex function $\|\cdot\|^{2}$ as well as

\left\|\frac{1}{\tau_{k}}\left[f_{2}(x^{k+1})-f_{2}(x^{k})\right]\right\|^{2}\leq L_{f_{2}}^{2}\|\tilde{d}^{k}\|^{2}\leq 2L_{f_{2}}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2}).

5. Defining

e^{k}:=\frac{1}{\tau_{k}}\bigg{[}\nabla F(x^{k+1})-\nabla F(x^{k})\bigg{]}+\nabla F(x^{k})-\nabla f_{2}(x^{k})\nabla f_{1}(u^{k}),

and by the update rule, we have

\begin{split}&\|\nabla F(x^{k+1})-z^{k+1}\|^{2}=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}[e^{k}+\Delta^{k+1}]\|^{2}\\ &=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}e^{k}\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+\ddot{r}^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+\tau_{k}\|e^{k}\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+\ddot{r}^{k+1}\\ \end{split}

(50)

where $\ddot{r}^{k+1}:=2\tau_{k}\langle\Delta^{k},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\tau_{k}e^{k}\rangle$ . We can further upper bound the term $\|e^{k}\|^{2}$ by

\begin{split}\|e^{k}\|^{2}&\leq 2L_{\nabla F}^{2}\|\tilde{d}^{k}\|^{2}+2L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}\|f_{2}(x^{k})-u^{k}\|^{2}\\ &\leq 4L_{\nabla F}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2})+2L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}\|f_{2}(x^{k})-u^{k}\|^{2}\end{split}

(51)

6. By combing (46), (47), (48), (49), (50), (51), rearranging the terms, and noting that $\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle=H_{k}(\tilde{y}^{k})-H_{k}(y^{k})-(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2}$ and $\|\tilde{d}^{k}\|\leq D_{\mathcal{X}}$ , we obtain

W_{k+1}-W_{k}\leq\tau_{k}\mathbf{A}_{k}+\mathbf{R}_{k}

(52)

where $\mathbf{R}_{k}$ is defined in (45) and

\begin{split}\mathbf{A}_{k}:=&\left(-\beta+4\alpha L_{\nabla F}^{2}+2\gamma L_{f_{2}}^{2}\right)\|d^{k}\|^{2}+\left(-\gamma+2\alpha L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}\right)\|f_{2}(x^{k})-u_{2}^{k}\|^{2}\\ &+L_{\nabla f_{1}}L_{f_{2}}\|d^{k}\|\|f_{2}(x^{k})-u_{2}^{k}\|-\alpha\|\nabla F(x^{k})-z^{k}\|^{2}\\ &+\left(\beta\|d^{k}\|+\|\nabla F(x^{k})-z^{k}\|\right)\|\tilde{y}^{k}-y^{k}\|\\ &+\left(4\alpha L_{\nabla F}^{2}+2\gamma L_{f_{2}}^{2}-\beta\right)\|\tilde{y}^{k}-y^{k}\|^{2}+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

We then provide a simplified upper bound for $\mathbf{A}_{k}$ . By the Young’s inequality, we have

\begin{split}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\beta}{4}\|d^{k}\|^{2}+\beta\|\tilde{y}^{k}-y^{k}\|^{2},\\ \|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\alpha}{2}\|\nabla F(x^{k})-z^{k}\|^{2}+\frac{1}{2\alpha}\|\tilde{y}^{k}-y^{k}\|^{2}.\end{split}

In addition, we reparametrize $\alpha=\frac{\rho}{L_{\nabla F}}$ . Noting that by Lemma 6 with $T=2$

\frac{L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}}{L_{\nabla F}}=\frac{L_{\nabla f_{1}}^{2}L_{f_{2}}^{2}}{L_{\nabla f_{1}}L_{f_{2}}^{2}+L_{f_{1}}L_{\nabla f_{2}}}\leq L_{\nabla f_{1}},

we therefore have

\begin{split}\mathbf{A}_{k}\leq&\left(-\frac{3\beta}{4}+4\rho L_{\nabla F}+2\gamma L_{f_{2}}^{2}\right)\|d^{k}\|^{2}+\left(-\gamma+2\rho L_{\nabla f_{1}}\right)\|f_{2}(x^{k})-u_{2}^{k}\|^{2}\\ &+L_{\nabla f_{1}}L_{f_{2}}\|d^{k}\|\|f_{2}(x^{k})-u_{2}^{k}\|-\frac{\rho}{2L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\\ &+\left(4\rho L_{\nabla F}+2\gamma L_{f_{2}}^{2}+\frac{L_{\nabla F}}{2\rho}\right)\|\tilde{y}^{k}-y^{k}\|^{2}+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

Then, setting $\gamma=3\rho L_{\nabla f_{1}}$ and $\beta\geq 6\rho L_{\nabla F}+(2\rho+\frac{2}{3\rho})L_{\nabla f_{1}}L_{f_{2}}^{2}$ , we can obtain

\begin{split}&\left(-\frac{3\beta}{4}+4\rho L_{\nabla F}+2\gamma L_{f_{2}}^{2}\right)\|d^{k}\|^{2}+\left(-\gamma+2\rho L_{\nabla f_{1}}\right)\|f_{2}(x^{k})-u_{2}^{k}\|^{2}\\ &\qquad+L_{\nabla f_{1}}L_{f_{2}}\|d^{k}\|\|f_{2}(x^{k})-u_{2}^{k}\|\leq-\frac{\rho L_{\nabla F}}{2}\|d^{k}\|^{2}-\frac{\rho L_{\nabla f_{1}}}{2}\|f_{2}(x^{k})-u_{2}^{k}\|^{2}\end{split}

Also, we have $(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2}\leq H_{k}(\tilde{y}^{k})-H_{k}(y^{k})$ . Therefore, we can further bound $\mathbf{A}_{k}$ by

\begin{split}\mathbf{A}_{k}\leq&-\frac{\rho L_{\nabla F}}{2}\|d^{k}\|^{2}-\frac{\rho L_{\nabla f_{1}}}{2}\|g(x^{k})-u^{k}\|^{2}-\frac{\rho}{2L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\\ &+\bigg{(}2+\frac{(8\rho+1/\rho)L_{\nabla F}+12\rho L_{\nabla f_{1}}L_{f_{2}}^{2}}{\beta}\bigg{)}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

(53)

Telescoping (52) together with (53), we get

\begin{split}&\rho\sum_{k=0}^{N}\tau_{k}\bigg{(}L_{\nabla F}\|d^{k}\|^{2}+L_{\nabla f_{1}}\|f_{2}(x^{k})-u_{2}^{k}\|^{2}+\frac{1}{L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 2W_{0}+2\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}4+\frac{2(8\rho+1/\rho)L_{\nabla F}+24\rho L_{\nabla f_{1}}L_{f_{2}}^{2}}{\beta}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

∎

Proof of Theorem 3, part (a).

The proof follows the same arguments in the proof of Theorem 2. Note that by Lemma 12 and given values of $\alpha,\gamma$ in (12), we obtain

\begin{split}&\rho\sum_{k=1}^{N}\tau_{k}\left[L_{\nabla F}\|d^{k}\|^{2}+L_{\nabla f_{1}}\|f_{2}(x^{k})-u_{2}^{k}\|^{2}+\frac{1}{L_{\nabla F}}\|\nabla F(x^{k})-z^{k}\|^{2}\right]\leq 2W_{\alpha,\gamma}(x^{0},z^{0},u^{0})\\ &\quad+2\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}4+\frac{2(8\rho+1/\rho)L_{\nabla F}+24\rho L_{\nabla f_{1}}L_{f_{2}}^{2}}{\beta}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

Noting that

\mathbb{E}[\mathbf{R}_{k}|\mathscr{F}_{k}]=\tau_{k}^{2}\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+\gamma\sigma_{G_{2}}^{2}+(\alpha+2L_{\eta})\hat{\sigma}_{J_{1}}^{2}\hat{\sigma}_{J_{2}}^{2}\right]:=\tau_{k}^{2}\hat{\sigma}^{2},

and taking expectation of both sides, we can complete the proof with the same arguments in the proof of Theorem 2. The constants $\mathcal{C}_{1}$ and $\mathcal{C}_{2}$ turn out to be

\begin{split}&\mathcal{C}_{1}=4\left(\frac{\beta^{2}}{\rho L_{\nabla F}}+\frac{L_{\nabla F}}{\rho}\right)\bigg{\{}W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\hat{\sigma}^{2}\\ &\qquad\qquad\qquad\qquad+4D_{\mathcal{X}}^{2}(1+\delta)\left[2\beta+(8\rho+\frac{1}{\rho})L_{\nabla F}+12\rho L_{\nabla f_{1}}L_{f_{2}}^{2})\right]\bigg{\}},\\ &\mathcal{C}_{2}=\frac{2}{\rho L_{\nabla f_{1}}}\bigg{\{}W_{\alpha,\gamma}(x^{0},z^{0},u^{0})+\hat{\sigma}^{2}\\ &\qquad\qquad\qquad\qquad+4D_{\mathcal{X}}^{2}(1+\delta)\left[2\beta+(8\rho+\frac{1}{\rho})L_{\nabla F}+12\rho L_{\nabla f_{1}}L_{f_{2}}^{2})\right]\bigg{\}}.\end{split}

(54)

∎

D.2 Proof of Theorem 3 for $T=1$

To show the rate of convergence for Algorithm 4, we leverage the following merit function:

W_{\alpha}(x^{k},z^{k},u^{k})=F(x^{k})-F^{\star}-\eta(x^{k},z^{k})+\alpha\|\nabla F(x^{k})-z^{k}\|^{2},

(55)

where $\alpha>0$ , $\eta(\cdot,\cdot)$ is defined in (10).

Lemma 13.

Let $\{x^{k},z^{k}\}_{k\geq 0}$ be the sequence generated by Algorithm 4 with $\beta_{k}\equiv\beta>0$ and the merit function $W_{\alpha}(\cdot,\cdot)$ be defined in (55) with $\alpha=\frac{\beta}{4L_{\nabla F}^{2}}$ . Under Assumptions 2 with $T=1$ , we have $\forall N\geq 0$

\begin{split}&\beta\sum_{k=0}^{N}\tau_{k}\bigg{(}\|d^{k}\|^{2}+\frac{1}{2L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 4W_{\alpha}(x^{0},u^{0})+4\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}12+\frac{16L_{\nabla F}^{2}}{\beta^{2}}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

where $d^{k}:=y^{k}-x^{k}$ , $H_{k}(\cdot),y^{k}$ are defined in (16), $\Delta^{k+1}:=\nabla F(x^{k})-J_{1}^{k+1}$ , and

\begin{split}\mathbf{R}_{k}:=&\tau_{k}^{2}\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+\alpha\|\Delta^{k+1}\|^{2}\right]+\frac{L_{\eta}}{2}\|z^{k+1}-z^{k}\|^{2}\\ &\qquad+\tau_{k}\langle d^{k},\Delta^{k+1}\rangle+\alpha r^{k+1},\\ r^{k+1}:=&2\tau_{k}\langle\Delta^{k+1},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\rangle.\end{split}

(56)

Proof.

The proof is a essentially a simplified version of the proof of Lemma 12. Hence, we skip some arguments already presented earlier.

1. By the Lipschitzness of $\nabla F$ , we have

\begin{split}&F(x^{k+1})-F(x^{k})\leq\tau_{k}\langle\nabla F(x^{k}),d^{k}\rangle+\tau_{k}\|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|\\ &\qquad+\tau_{k}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|+\tau_{k}\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle+\frac{L_{\nabla F}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}}{2}.\\ \end{split}

(57)

2. Also, by the lipschitzness of $\nabla\eta$ (Lemma 7) and the optimality condition of in the definition of $y^{k}$ , we have

\begin{split}&\eta(x^{k},z^{k})-\eta(x^{k+1},z^{k+1})\leq-\beta\tau_{k}\|d^{k}\|^{2}+\tau_{k}\big{\langle}z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\big{\rangle}\\ &\qquad-\tau_{k}\big{\langle}d^{k},\nabla F(x^{k})\big{\rangle}+\tau_{k}\big{\langle}d^{k},\Delta^{k+1}\big{\rangle}+\frac{L_{\nabla\eta}}{2}\bigg{[}\tau_{k}^{2}\|\tilde{d}^{k}\|^{2}+\|z^{k+1}-z^{k}\|^{2}\bigg{]}.\end{split}

(58)

3. By the update rule, we have

\begin{split}&\|\nabla F(x^{k+1})-z^{k+1}\|^{2}=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})+\tau_{k}\Delta^{k+1}\|^{2}\\ &=\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+r^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+\frac{1}{\tau_{k}}\|\nabla F(x^{k+1})-\nabla F(x^{k})\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+r^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+\tau_{k}L_{\nabla F}^{2}\|\tilde{d}^{k}\|^{2}+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+r^{k+1}\\ &\leq(1-\tau_{k})\|\nabla F(x^{k})-z^{k}\|^{2}+2\tau_{k}L_{\nabla F}^{2}(\|d^{k}\|^{2}+\|\tilde{y}^{k}-y^{k}\|^{2})+\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+r^{k+1}\\ \end{split}

(59)

where $r^{k+1}:=2\tau_{k}\langle\Delta^{k},(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\rangle$ .

4. By combing (57), (58) (59), rearranging the terms, and noting that $\langle z^{k}+\beta d^{k},\tilde{y}^{k}-y^{k}\rangle=H_{k}(\tilde{y}^{k})-H_{k}(y^{k})-(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2}$ and $\|\tilde{d}^{k}\|\leq D_{\mathcal{X}}$ , we obtain

W_{k+1}-W_{k}\leq\tau_{k}\mathbf{A}_{k}+\mathbf{R}_{k}

(60)

where $\mathbf{R}_{k}$ is defined in (56) and

\begin{split}\mathbf{A}_{k}:=&\left(-\beta+2\alpha L_{\nabla F}^{2}\right)\|d^{k}\|^{2}-\alpha\|\nabla F(x^{k})-z^{k}\|^{2}+\left(\beta\|d^{k}\|+\|\nabla F(x^{k})-z^{k}\|\right)\|\tilde{y}^{k}-y^{k}\|\\ &+\left(2\alpha L_{\nabla F}^{2}-\beta\right)\|\tilde{y}^{k}-y^{k}\|^{2}+2\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right).\end{split}

We then provide a simplified upper bound for $\mathbf{A}_{k}$ . By the Young’s inequality, we have

\begin{split}\beta\|d^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\beta}{4}\|d^{k}\|^{2}+\beta\|\tilde{y}^{k}-y^{k}\|^{2},\\ \|\nabla F(x^{k})-z^{k}\|\|\tilde{y}^{k}-y^{k}\|&\leq\frac{\alpha}{2}\|\nabla F(x^{k})-z^{k}\|^{2}+\frac{1}{2\alpha}\|\tilde{y}^{k}-y^{k}\|^{2}.\end{split}

In addition, setting $\alpha=\frac{\beta}{4L_{\nabla F}^{2}}$ and noting $(\beta/2)\|\tilde{y}^{k}-y^{k}\|^{2}\leq H_{k}(\tilde{y}^{k})-H_{k}(y^{k})$ , we have

\begin{split}\mathbf{A}_{k}\leq&-\frac{\beta}{4}\|d^{k}\|^{2}-\frac{\beta}{8L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}+\left(3+\frac{4L_{\nabla F}^{2}}{\beta^{2}}\right)\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\\ \end{split}

(61)

Telescoping (60) together with (61), we get

\begin{split}&\beta\sum_{k=0}^{N}\tau_{k}\bigg{(}\|d^{k}\|^{2}+\frac{1}{2L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 4W_{\alpha}(x^{0},u^{0})+4\sum_{k=0}^{N}\mathbf{R}_{k}+\bigg{(}12+\frac{16L_{\nabla F}^{2}}{\beta^{2}}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)\end{split}

∎

Proof of Theorem 3, part (b).

Given Lemma 13, the proof follows the same arguments as in the proof of Theorem 2. The constant $\mathcal{C}_{3}$ turns out to be

\begin{split}\mathcal{C}_{3}=8\left(\beta+\frac{2L_{\nabla F}^{2}}{\beta}\right)\bigg{\{}W_{\alpha}(x^{0},u^{0})+D_{\mathcal{X}}^{2}\left[(1+\delta)\left(12\beta+\frac{16L_{\nabla F}^{2}}{\beta}\right)+\frac{L_{\nabla F}+L_{\nabla\eta}}{2}\right]\\ +\alpha\sigma_{J_{1}}^{2}+2L_{\eta}\hat{\sigma}_{J_{1}}^{2}\bigg{\}}\end{split}.

(62)

∎

Appendix E High-Probability Convergence for $T=1$

E.1 Preliminaries

We provide a short review of sub-gaussian and sub-exponential random variables for completeness.

Definition 14.

(Sub-gaussian and Sub-exponential)

(a)

A random variable $X$ is $K$ -sub-gaussian if there exists $K>0$ such that $\mathbb{E}[\exp(X^{2}/K^{2})]\leq 2$ . The sub-gaussian norm of $X$ , denoted $\|X\|_{\psi_{2}}$ , is defined to be the smallest $K$ . That is to say,

$\|X\|_{\psi_{2}}=\inf\left\{t>0:\mathbb{E}[\exp(X^{2}/t^{2})]\leq 2\right\}.$
(b)

A random variable $X$ is $K$ -sub-exponential if there exists $K>0$ such that $\mathbb{E}[\exp(|X|/K)]\leq 2$ . The sub-exponential norm of $X$ , denoted $\|X\|_{\psi_{1}}$ , is defined to be the smallest $K$ . That is to say,

$\|X\|_{\psi_{1}}=\inf\left\{t>0:\mathbb{E}[\exp(|X|/t)]\leq 2\right\}.$

The above characterization is based on the so-called orlicz norm of a random variable. There are equivalent definitions of sub-gaussian and sub-exponential random variables. We refer readers to Proposition 2.5.2 and Proposition 2.7.1 in [Ver18]. In particular, we will also use another definition of sub-gaussian random variables based on the moment generating function given below.

Lemma 15.

(Sub-gaussian M.G.F. [Ver18]) If a random variable $X$ is $K$ -sub-gaussian with $\mathbb{E}[X]=0$ , then $\mathbb{E}[\exp(\lambda X)]\leq\exp(c\lambda^{2}K^{2})~{}\forall\lambda\in\mathbb{R}$ , where $cx$ is a absolute constant.

In the high probability results we show for the special case with $T=1$ , we handle the tail probability for two terms involving the mean-zero noise with sub-gaussian norm, $\|\Delta^{k+1}\|^{2}$ and $\langle\Delta^{k+1},\Lambda^{k}\rangle$ , where $(\Delta^{k})$ and $(\Lambda^{k})$ are adapted to $(\mathscr{F}_{k})$ . Our proof leverages the following two lemmas to control the probability of these two terms being too large.

Lemma 16.

(Sub-exponential is sub-gaussian squared [Ver18]) A random variable $X$ is sub-gaussian if and only if $X^{2}$ is sub-exponential. Moreover, $\|X^{2}\|_{\psi_{1}}=\|X\|_{\psi_{2}}^{2}$ .

Lemma 17.

(Generalized Freedman-type Inequality [HLPR19]) Let $(\Omega,\mathscr{F},(\mathscr{F}_{i}),P)$ be a filtered probability space, $(X_{i})$ and $(K_{i})$ be adapted to $(\mathscr{F}_{i})$ , and $n\in\mathbb{N}$ . Suppose for all $i\in[n]$ , $K_{i-1}\geq 0$ , $\mathbb{E}[X_{i}|\mathscr{F}_{i-1}]=0$ , and $\mathbb{E}\left[\exp(\lambda X_{i})|\mathscr{F}_{i-1}\right]\leq\exp(\lambda^{2}K_{i}^{2})$ . Then for any $t,b\geq 0,a>0$ ,

\mathrm{Pr}\left(\underset{k\in[n]}{\bigcup}\left\{\sum_{i=1}^{k}X_{i}\geq t\text{ and }2\sum_{i=1}^{k}K_{i-1}^{2}\leq a\sum_{i=1}^{k}X_{i}+b\right\}\right)\leq\exp\left(-\frac{t}{4a+8b/t}\right).

(63)

E.2 Proof of Theorem 5

We start with presenting the lemma below which leverages inequalities in Appendix E to show a high-probability upper bound for terms involving in the previous analysis.

Lemma 18.

Under the conditions of Lemma 13 and Assumption 4, for any $\delta_{1},\delta_{2},\delta_{3},a>0$ , we have

(a)

with probability at least $1-\delta_{1}$ , $\sum_{k=0}^{N}\tau_{k}^{2}\|\Delta^{k+1}\|^{2}\leq K^{2}\log(2/\delta_{1})\sum_{k=0}^{N}\tau_{k}^{2};$

(b)

with probability at least $1-\delta_{2}$ ,

\sum_{k=0}^{N}\tau_{k}^{2}\sum_{i=0}^{k-1}\alpha_{i,k}\|\Delta^{i+1}\|^{2}\leq K^{2}\log(2/\delta_{2})\sum_{k=0}^{N}\tau_{k}^{2},

where $\alpha_{i,k}>0$ and $\sum_{i=0}^{k-1}{\alpha_{i,k}}=1$ ;

(c)

with probability at least $1-\delta_{3}$ ,

\begin{split}&\sum_{k=0}^{N}\langle\Delta^{k+1},2\alpha\tau_{k}(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+2\alpha\tau_{k}(\nabla F(x^{k+1})-\nabla F(x^{k}))+\tau_{k}d^{k}\rangle\\ &\leq 4a\log(1/\delta_{3})+\frac{\beta^{2}K^{2}}{aL_{\nabla F}^{4}}\sum_{k=0}^{N}\tau_{k}^{2}(1-\tau_{k})\left\|\nabla F(x^{k})-z^{k}\right\|^{2}+\frac{K^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}(4+\frac{\beta^{2}\tau_{k}}{L_{\nabla F}^{2}})\left\|d^{k}\right\|^{2}.\end{split}

Proof of Lemma 18.

We first show (a). Using the law of total expectation, we have $\mathbb{E}\left[\exp\left(\frac{\|\tau_{k}\Delta^{k+1}\|^{2}}{\tau_{k}^{2}K^{2}}\right)\right]\leq 2$ , which implies that $\|\tau_{k}\Delta^{k+1}\|^{2}$ is $\tau_{k}^{2}K^{2}$ -sub-exponential. Thus, we have with probability at least $1-\delta_{1}$ ,

\sum_{k=0}^{N}\tau_{k}^{2}\|\Delta^{k+1}\|^{2}\leq K^{2}\log(2/\delta_{1})\sum_{k=0}^{N}\tau_{k}^{2}.

(64)

We then show (b). Let $Z_{k}=\tau_{k}^{2}\left\{\sum_{i=0}^{k-1}\alpha_{i,k}\left\|\Delta^{i+1}\right\|^{2}\right\}\forall k\geq 0$ . Note that for all $k\geq 0$ , $\|\Delta^{k+1}\|^{2}$ is $K^{2}$ -sub-exponential, which further implies that the sub-exponential norm of $Z_{k}~{}(k>0)$ satisfies $\|Z_{k}\|_{\psi_{1}}\leq\tau_{k}^{2}K^{2}$ . Therefore, we have for any $\delta_{2}>0$ , with probability at least $1-\delta_{2}$ ,

\sum_{k=0}^{N}Z_{k}\leq K^{2}\log(2/\delta_{2})\sum_{k=0}^{N}\tau_{k}^{2}.

(65)

To prove (c), we apply Lemma 15 and Lemma 17 with

\begin{split}X_{i}&=\left\langle\Delta^{k+1},2\alpha\tau_{k}\left\{(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\}+\tau_{k}d^{k}\right\rangle,\\ K_{i}&=\sqrt{c}K\left\|2\alpha\tau_{k}\left\{(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\}+\tau_{k}d^{k}\right\|,\\ b&=0,t=4a\log(1/\delta_{3}).\end{split}

Noting that $\alpha=\frac{\beta}{4L_{\nabla F}^{2}}$ , we obtain that for all $a>0$ with probability at least $1-\delta_{3}$ , $\sum_{i=0}^{N}X_{i}\leq 4a\log(1/\delta_{3})$ and

	$\displaystyle\sum_{i=0}^{N}X_{i}$	$\displaystyle\leq\frac{2cK^{2}}{a}\sum_{k=0}^{N}\left\\|2\alpha\tau_{k}\left\{(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\}+\tau_{k}d^{k}\right\\|^{2}$
		$\displaystyle\leq\frac{4cK^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}\left\{4\alpha^{2}\left\\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\\|^{2}+\left\\|d^{k}\right\\|^{2}\right\}$
		$\displaystyle\leq\frac{4cK^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}\left\{4\alpha^{2}(1-\tau_{k})\left\\|\nabla F(x^{k})-z^{k}\right\\|^{2}+(1+4\alpha^{2}L_{\nabla F}^{2}\tau_{k})\left\\|d^{k}\right\\|^{2}\right\}$
		$\displaystyle=\frac{c\beta^{2}K^{2}}{aL_{\nabla F}^{4}}\sum_{k=0}^{N}\tau_{k}^{2}(1-\tau_{k})\left\\|\nabla F(x^{k})-z^{k}\right\\|^{2}+\frac{cK^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}(4+\frac{\beta^{2}\tau_{k}}{L_{\nabla F}^{2}})\left\\|d^{k}\right\\|^{2},$

where the third inequality comes from the convexity of $\|\cdot\|^{2}$ and the Lipschitzness of $\nabla F$ . ∎

Provided with the above lemma and Lemma 13, we now present the complete proof of Theorem 5.

Proof of Theorem 5.

Given the update rule of $\{z^{k}\}$ and the fact that $\tau_{0}=1$ , we can obtain

z^{k}=\sum_{i=0}^{k-1}\alpha_{i,k}J_{1}^{i+1},\quad\alpha_{i,k}=\frac{\tau_{i}}{\Gamma_{i+1}}\Gamma_{k}~{}~{}1\leq i\leq k,\quad\sum_{i=0}^{k-1}\alpha_{i,k}=1~{}~{}k\geq 1

where $\Gamma_{k}=\Gamma_{1}\prod_{i=1}^{k-1}(1-\tau_{i})$ and $\Gamma_{1}=1$ . Thus,

\begin{split}\left\|z^{k+1}-z^{k}\right\|^{2}&=\tau_{k}^{2}\left\|J_{1}^{k+1}-z^{k}\right\|^{2}\leq 2\tau_{k}^{2}\left\{\left\|J_{1}^{k+1}\right\|^{2}+\left\|\sum_{i=0}^{k-1}\alpha_{i,k}J_{1}^{i+1}\right\|^{2}\right\}\\ &\leq 2\tau_{k}^{2}\left\{\left\|J_{1}^{k+1}\right\|^{2}+\sum_{i=0}^{k-1}\alpha_{i,k}\left\|J_{1}^{i+1}\right\|^{2}\right\}\\ &\leq 4\tau_{k}^{2}\left\{\left\|\Delta^{k+1}\right\|^{2}+\left\|\nabla F(x^{k})\right\|^{2}+\sum_{i=0}^{k-1}\alpha_{i,k}\left[\left\|\Delta^{i+1}\right\|^{2}+\|\nabla F(x^{i})\|^{2}\right]\right\}\\ &\leq 4\tau_{k}^{2}\left\{\left\|\Delta^{k+1}\right\|^{2}+\sum_{i=0}^{k-1}\alpha_{i,k}\left\|\Delta^{i+1}\right\|^{2}+2L_{F}^{2}\right\}\\ \end{split}

where the second inequality comes from the convexity of $\|\cdot\|^{2}$ . Therefore, we have

\sum_{k=0}^{N}\|z^{k+1}-z^{k}\|^{2}\leq 4\sum_{k=0}^{N}\tau_{k}^{2}\|\Delta^{k+1}\|^{2}+4\sum_{k=0}^{N}\tau_{k}^{2}\sum_{i=0}^{k-1}\alpha_{i,k}\|\Delta^{i+1}\|^{2}+8L_{F}^{2}\sum_{k=0}^{N}\tau_{k}^{2}

Applying Lemma 18 with $\delta_{1}=\delta_{2}=\delta_{3}=\delta/3$ and $a=\frac{16c\beta K^{2}}{L_{\nabla F}^{2}}$ together with Lemma 13, we have with probability at least $1-\delta$ ,

	$\displaystyle\sum_{k=0}^{N}\mathbf{R}_{k}=$	$\displaystyle\sum_{k=0}^{N}\langle\Delta^{k+1},2\alpha\tau_{k}(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+2\alpha\tau_{k}(\nabla F(x^{k+1})-\nabla F(x^{k}))+\tau_{k}d^{k}\rangle$
		$\displaystyle+(\alpha+2L_{\eta})\sum_{k=0}^{N}\tau_{k}^{2}\\|\Delta^{k+1}\\|^{2}+2L_{\eta}\sum_{k=0}^{N}\tau_{k}^{2}\sum_{i=0}^{k-1}\alpha_{i,k}\\|\Delta^{i+1}\\|^{2}$
		$\displaystyle+\left[\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+4L_{\eta}L_{F}^{2}\right]\sum_{k=0}^{N}\tau_{k}^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{64\beta K^{2}}{L_{\nabla F}^{2}}\log(3/\delta)+\frac{\beta}{16L_{\nabla F}^{2}}\sum_{k=0}^{N}\tau_{k}^{2}(1-\tau_{k})\left\\|\nabla F(x^{k})-z^{k}\right\\|^{2}+(\frac{L_{\nabla F}^{2}}{4\beta}+\frac{\beta}{16})\sum_{k=0}^{N}\tau_{k}^{2}\left\\|d^{k}\right\\|^{2}$
		$\displaystyle+\left[(\frac{\beta}{4L_{\nabla F}^{2}}+4L_{\eta})K^{2}\log(6/\delta)+\frac{L_{\nabla F}+L_{\nabla\eta}}{2}D_{\mathcal{X}}^{2}+4L_{\eta}L_{F}^{2}\right]\sum_{k=0}^{N}\tau_{k}^{2}$

Thus, noting that $\|d^{k}\|^{2}\leq D_{\mathcal{X}}^{2}~{}\forall k\geq 0$ , we have with probability at least $1-\delta$ ,

\begin{split}&\beta\sum_{k=0}^{N}\tau_{k}\bigg{(}\|d^{k}\|^{2}+\frac{1}{4L_{\nabla F}^{2}}\|\nabla F(x^{k})-z^{k}\|^{2}\bigg{)}\\ &\leq 4W_{\alpha}(x^{0},u^{0})+\bigg{(}12+\frac{16L_{\nabla F}^{2}}{\beta^{2}}\bigg{)}\sum_{k=0}^{N}\tau_{k}\left(H_{k}(\tilde{y}^{k})-H_{k}(y^{k})\right)+\frac{256\beta K^{2}}{L_{\nabla F}^{2}}\log(3/\delta)\\ &+\left[\left(\frac{\beta}{L_{\nabla F}^{2}}+16L_{\eta}\right)K^{2}\log(6/\delta)+\left(\frac{L_{\nabla F}^{2}}{\beta}+\frac{\beta}{4}+2L_{\nabla F}+2L_{\nabla\eta}\right)D_{\mathcal{X}}^{2}+16L_{\eta}L_{F}^{2}\right]\sum_{k=0}^{N}\tau_{k}^{2}\end{split}

Following the same arguments as in the proof of Theorem 2, we have with probability at least $1-\delta$ ,

\min_{k=1,\dots,N}V(x^{k},z^{k})\leq\mathcal{O}\left(\frac{K^{2}\log(1/\delta)}{\sqrt{N}}\right)

∎

	$\displaystyle\sum_{i=0}^{N}X_{i}$	$\displaystyle\leq\frac{2cK^{2}}{a}\sum_{k=0}^{N}\left\\|2\alpha\tau_{k}\left\{(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\}+\tau_{k}d^{k}\right\\|^{2}$
		$\displaystyle\leq\frac{4cK^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}\left\{4\alpha^{2}\left\\|(1-\tau_{k})[\nabla F(x^{k})-z^{k}]+\nabla F(x^{k+1})-\nabla F(x^{k})\right\\|^{2}+\left\\|d^{k}\right\\|^{2}\right\}$
		$\displaystyle\leq\frac{4cK^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}\left\{4\alpha^{2}(1-\tau_{k})\left\\|\nabla F(x^{k})-z^{k}\right\\|^{2}+(1+4\alpha^{2}L_{\nabla F}^{2}\tau_{k})\left\\|d^{k}\right\\|^{2}\right\}$
		$\displaystyle=\frac{c\beta^{2}K^{2}}{aL_{\nabla F}^{4}}\sum_{k=0}^{N}\tau_{k}^{2}(1-\tau_{k})\left\\|\nabla F(x^{k})-z^{k}\right\\|^{2}+\frac{cK^{2}}{a}\sum_{k=0}^{N}\tau_{k}^{2}(4+\frac{\beta^{2}\tau_{k}}{L_{\nabla F}^{2}})\left\\|d^{k}\right\\|^{2},$

A Projection-free Algorithm for Constrained Stochastic Multi-level Composition Optimization

Abstract

1 Introduction

1.1 Preliminaries and Main Contributions

Assumption 1 (Constraint set).

Assumption 2 (Smoothness).

Definition 1.

Proposition 1.

1.2 Related Work

2 Methodology

3 Main Results

Assumption 3 (Stochastic First-Order Oracle).

Remark.

Proposition 2.

Proof.

Theorem 2.

Remark.

Remark.

3.1 The special cases of T=1T=1 and T=2T=2

Theorem 3.

Remark.

3.2 High-Probability Convergence for T=1T=1

Definition 4.

Assumption 4.

Theorem 5.

Remark.

4 Proof Sketch of Main Results

5 Discussion

Acknowledgment

References

Appendix A Motivating Examples

Appendix B Technical Lemmas

Lemma 6.

Lemma 7.

Lemma 8.

Proof of Lemma 8.

Appendix C Proof of Theorem 2

Lemma 9.

Proof.

Lemma 10.

Proof.

Lemma 11.

Proof.

Proposition 3.

Proof.

Proof of Theorem 2.

Appendix D Proofs for Section 3.1

D.1 Proof of Theorem 3 for T=2T=2

Lemma 12.

Proof of Lemma 12.

Proof of Theorem 3, part (a).

D.2 Proof of Theorem 3 for T=1T=1

Lemma 13.

Proof.

Proof of Theorem 3, part (b).

Appendix E High-Probability Convergence for T=1T=1

E.1 Preliminaries

Definition 14.

Lemma 15.

Lemma 16.

Lemma 17.

E.2 Proof of Theorem 5

Lemma 18.

Proof of Lemma 18.

Proof of Theorem 5.

3.1 The special cases of $T=1$ and $T=2$

3.2 High-Probability Convergence for $T=1$

D.1 Proof of Theorem 3 for $T=2$

D.2 Proof of Theorem 3 for $T=1$

Appendix E High-Probability Convergence for $T=1$