Trust-Region Stochastic Optimization with Variance Reduction Technique

Xinshou Zheng Boston University

1 Introduction

We propose a novel algorithm, TR-SVR, for solving unconstrained stochastic optimization problems. This method builds on the trust-region framework, which effectively balances local and global exploration in optimization tasks. TR-SVR incorporates variance reduction techniques to improve both computational efficiency and stability when addressing stochastic objective functions. The algorithm applies a sequential quadratic programming (SQP) approach within the trust-region framework, solving each subproblem approximately using variance-reduced gradient estimators. This integration ensures a robust convergence mechanism while maintaining efficiency, making TR-SVR particularly suitable for large-scale stochastic optimization challenges.

Unlike traditional SQP methods typically designed for deterministic or constrained optimization problems, TR-SVR is specifically tailored to address unconstrained stochastic settings. This makes it highly applicable to large-scale machine learning and data-driven tasks, where efficiency and scalability are crucial.

The trust-region mechanism in TR-SVR dynamically adjusts the step size by defining a region where the quadratic approximation of the objective function is reliable. This ensures that the algorithm progresses steadily while avoiding excessively large or overly cautious updates. Simultaneously, variance reduction techniques inspired by methods like Stochastic Variance Reduced Gradient (SVRG) significantly reduce the noise in stochastic gradient estimates, thereby improving both the stability and accuracy of the optimization process.

By iteratively refining the solution and adaptively modifying the trust-region radius based on the quality of gradient estimates and the current solution, TR-SVR achieves faster convergence rates and enhanced robustness, even in noisy environments typical of stochastic optimization problems.

2 Literature Review

Stochastic optimization is a rapidly growing field due to its pivotal role in applications like machine learning, signal processing, and control systems. Among its foundational techniques is Stochastic Gradient Descent (SGD), introduced by Robbins and Monro in the mid-20th century Robbins and Monro (1951). SGD has gained popularity for its simplicity and scalability, particularly in large-scale optimization tasks. However, its high variance in gradient estimates often leads to slow convergence and instability. To mitigate this issue, advanced variance reduction techniques such as Stochastic Variance Reduced Gradient (SVRG) Johnson and Zhang (2013), Stochastic Average Gradient (SAGA) Defazio et al. (2014), and Stochastic Recursive Gradient Algorithm (SARAH) Nguyen et al. (2017) have been developed. These methods refine the gradient estimates through mechanisms like reference points or control variates, significantly improving performance.

In parallel, trust-region methods have emerged as robust tools for handling non-convex optimization problems, ensuring global convergence through adaptive step-size control Conn et al. (2000). These methods dynamically adjust a region around the iterate within which a quadratic model of the objective is trusted, making them highly effective in deterministic settings, as detailed in foundational works like Nocedal and Wright’s ”Numerical Optimization” Nocedal and Wright (2006). Efforts to extend trust-region methods to stochastic domains have gained momentum. Curtis et al. Curtis and Shi (2019) introduced a fully stochastic second-order trust-region method leveraging stochastic Hessian approximations. More recently, Fang et al. Fang et al. (2024) proposed a stochastic trust-region sequential quadratic programming (TR-SQP) method tailored for equality-constrained problems, offering robust theoretical guarantees and practical implementations. These developments highlight the growing synergy between stochastic optimization and trust-region frameworks, paving the way for tackling increasingly complex optimization challenges.

Sequential Quadratic Programming (SQP) methods are widely recognized for their effectiveness in solving constrained optimization problems Boggs and Tolle (1995). These methods operate by solving a sequence of quadratic subproblems that locally approximate the original nonlinear problem, gradually refining the solution at each iteration. While initially designed for deterministic optimization, recent advancements have extended SQP to stochastic settings. Notably, Berahas et al. Berahas et al. (2022) proposed a stochastic SQP framework that integrates variance reduction techniques, significantly enhancing convergence rates for equality-constrained problems compared to traditional first-order approaches.

Despite these promising developments, existing stochastic SQP methods are predominantly tailored for constrained problems or require intricate adjustments to address challenges like non-convexity and noisy gradients. Our proposed TR-SVR algorithm addresses this gap by offering a novel framework for unconstrained stochastic optimization. Integrating trust-region principles with advanced variance reduction strategies, TR-SVR extends the applicability of stochastic SQP methods while preserving computational efficiency and delivering robust theoretical guarantees.

3 Algorithm

3.1 Problem Description

We consider the unconstrained stochastic optimization problem of the form:

\min_{\bm{x}\in\mathbb{R}^{d}}f(\bm{x})=\frac{1}{N}\sum_{i=1}^{N}f_{i}(\bm{x}),

where $f_{i}(\bm{x})$ represents individual stochastic functions, and $f(\bm{x})$ is the overall objective. This formulation is common in large-scale machine learning applications, where the objective function is typically a sum of loss functions over a dataset. The main challenge in solving such problems arises from the stochastic nature of the gradients, which can introduce high variance and slow down convergence.

Our goal is to develop an efficient algorithm that can handle large-scale problems by reducing the variance in gradient estimates while maintaining computational efficiency. To this end, we propose a novel algorithm, TR-SVR, which integrates variance reduction techniques into a trust-region-based sequential quadratic programming (SQP) framework. Unlike traditional methods that rely on full gradients or line-search techniques, our approach uses mini-batch gradient estimates and dynamically adjusts the trust-region radius to ensure robust performance in noisy environments.

The TR-SVR algorithm operates in two loops: an outer loop indexed by $k$ , and an inner loop indexed by $s$ . In each iteration, a quadratic approximation of the objective function is constructed using variance-reduced gradient estimates. A trust-region subproblem is then solved to update the current iterate. The trust-region radius is adjusted dynamically based on the quality of the solution at each step, ensuring that the algorithm takes appropriately sized steps to balance exploration and exploitation.

3.2 Algorithm Description

The TR-SVR algorithm is presented below. It combines trust-region principles with variance reduction techniques to solve unconstrained stochastic optimization problems efficiently.

Algorithm 1 TR-SVR Algorithm

1:Input: Initial iterate

x_{0}\in\mathbb{R}^{d}

, initial trust-region radius

\Delta_{0}>0

, batch size

b\in[1,N]

, maximum inner iterations

S>0

, parameters

\eta_{1},\eta_{2}>0

, and scaling factor

\alpha>0

2:for

k=0,1,2,\dots

3: Set

x_{k,0}=x_{k-1,S}

(if

k>0

) or initialize

x_{k,0}

4: Compute full gradient at

x_{k,0}

g_{k,0}=\nabla f(x_{k,0})=\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(x_{k,0}).

5: for

s=0

S-1

6: Select a mini-batch

I_{k,s}\subset[N]

of size

b

7: Compute mini-batch gradient estimate:

\widetilde{g}_{k,s}=\frac{1}{b}\sum_{i\in I_{k,s}}\nabla f_{i}(x_{k,s}).

8: Compute variance-reduced gradient:

\bar{g}_{k,s}=\widetilde{g}_{k,s}-(\widetilde{g}_{k,0}-g_{k,0}),

where

g_{k,0}

is the full gradient at

x_{k,0}

9: Solve the trust-region subproblem:

m(\Delta x)=g_{k,s}^{T}(\Delta x)+\frac{1}{2}(\Delta x)^{T}H_{k,s}(\Delta x),

subject to

||\Delta x||_{2}\leq\Delta_{k,s},

where

H_{k,s}

is an approximation of the Hessian matrix.

10: Update the iterate:

x_{k,s+1}=x_{k,s}+(\Delta x)_{k,s}.

11: Adjust trust-region radius: If the reduction in objective function meets certain criteria (e.g., sufficient decrease), increase or decrease the trust-region radius as follows:

\Delta_{k,s+1}=\begin{cases}{\eta_{1}}\alpha||g_{k,s}||&||g_{k,s}||<1/\eta_{1},\\ {\alpha},&1/\eta_{1}<||g_{k,s}||<1/\eta_{2},\\ {\eta_{2}}\alpha||g_{k,s}||&||g_{k,s}||>1/\eta_{2}.\end{cases}

12: end forUpdate outer loop iterate: Set

x_{k+1,0}=x_{k,S}

13:end for

The TR-SVR algorithm iteratively refines both the solution and the trust-region radius using variance-reduced gradient estimates. The use of mini-batches ensures computational efficiency, particularly in large-scale settings, while the dynamic adjustment of the trust-region radius maintains stability, even in noisy environments. To further improve efficiency, the quadratic subproblems are solved approximately using Hessian approximations, thereby avoiding the need for expensive second-order computations.

4 Assumptions

Assumption 4.1.

The objective function $f:\mathbb{R}^{n}\to\mathbb{R}$ is twice continuously differentiable and bounded below by a scalar $f_{\inf}:=\inf_{x\in\mathbb{R}^{n}}f(x)>-\infty$ . The gradient $\nabla f(x)$ is Lipschitz continuous with constant $L_{g}>0$ , and the Hessian matrix $\nabla^{2}f(x)$ is uniformly bounded, i.e., $\|\nabla^{2}f(x)\|_{2}\leq L_{H}$ for all $x\in\mathbb{R}^{n}$ . Additionally, the Hessian approximation $H_{k,s}$ satisfies $\|H_{k,s}\|\leq K_{H}$ for some constant $K_{H}>0$ and for all iterations $k,s$ .

Assumption 4.2.

The gradient approximation $\widetilde{g}_{k,s}$ is an unbiased estimator of the true gradient of the objective function, i.e., $\mathbb{E}_{k,s}[\widetilde{g}_{k,s}]=g_{k,s}=\nabla f(x_{k,s})$ , where the expectation is conditioned on the event that the algorithm has reached $x_{k,s}$ . The variance of the stochastic gradient is bounded, i.e., $\mathbb{E}_{k,s}[\|\widetilde{g}_{k,s}-g_{k,s}\|^{2}]\leq\sigma_{g}^{2}$ .

Assumption 4.3.

The variance-reduced gradient estimate satisfies a variance bound such that:

\mathbb{E}_{k,s}\left[\|\bar{g}_{k,s}-g_{k,s}\|^{2}\right]\leq\frac{L^{2}}{b}\|x_{k,s}-x_{k,0}\|^{2},

where $b$ is the mini-batch size and $L$ is a Lipschitz constant.

Assumption 4.4.

The iterates $x_{k,s}$ are contained within a compact convex set $\mathcal{X}\subseteq\mathbb{R}^{n}$ . The objective function $f(x)$ , its gradient $\nabla f(x)$ , and its Hessian matrix are bounded over this set. Each component function $f_{i}(x)$ is continuously differentiable, and its gradient is Lipschitz continuous with constant $L>0$ .

5 Convergence Analysis

We begin by analyzing the trust-region subproblem and establishing key properties of the variance-reduced gradient estimates.

5.1 Trust-Region Subproblem Properties

At each iterate $x_{k,s}$ , the trust-region subproblem is formulated as:

\min_{{\Delta x_{k,s}}}m_{1}(\Delta x_{k,s})=\bar{g}_{k,s}^{T}\Delta x_{k,s}+\frac{1}{2}\Delta x_{k,s}^{T}H_{k,s}\Delta x_{k,s}\quad\text{s.t. }\|\Delta x_{k,s}\|\leq\Delta_{k,s}

For this subproblem, we only require the Cauchy decrease condition rather than an exact solution:

m_{1}(\Delta x_{k,s})-m_{1}(0)\leq-\|\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}\|H_{k,s}\|\Delta_{k,s}^{2}

5.2 Variance Reduction Properties

Lemma 5.1 (Variance Bound).

Let $\bar{g}_{k,s}$ be computed as in Algorithm 1. Then for all $[k,s]\in\mathbb{N}\times S$ , we have:

\mathbb{E}_{k,s}[\|\bar{g}_{k,s}-g(x_{k,s})\|^{2}]\leq\mu_{k,s}

where $\mu_{k,s}=\frac{L^{2}}{b}\|x_{k,s}-x_{k,0}\|^{2}$ .

Proof.

Let us denote:

J_{k,s}=\frac{1}{b}\sum(\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0}))

By Assumption 4.2, we have that $\mathbb{E}_{k,s}[J_{k,s}]=g(x_{k,s})-g(x_{k,0})$ .

Using the fact that $\mathbb{E}[\|z-\mathbb{E}[z]\|^{2}]\leq\mathbb{E}[\|z\|^{2}]$ (from Assumption 4.3) and the property that for independent mean-zero random variables $z_{1},z_{2},...,z_{n}$ :

\mathbb{E}[\|z_{1}+z_{2}+...+z_{n}\|^{2}]=\mathbb{E}[\|z_{1}\|^{2}+\|z_{2}\|^{2}+...+\|z_{n}\|^{2}]

We can derive:

	$\displaystyle\mathbb{E}_{k,s}[\\|\bar{g}_{k,s}-g(x_{k,s})\\|^{2}]$	$\displaystyle=\mathbb{E}_{k,s}[\\|J_{k,s}+g(x_{k,0})-g(x_{k,s})\\|^{2}]$
		$\displaystyle=\mathbb{E}_{k,s}[\\|J_{k,s}-\mathbb{E}[J_{k,s}]\\|^{2}]$
		$\displaystyle=\frac{1}{b^{2}}\mathbb{E}_{k,s}[\\|\sum(\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0})-\mathbb{E}[J_{k,s}])\\|^{2}]$
		$\displaystyle=\frac{1}{b^{2}}\mathbb{E}_{k,s}[\sum\\|\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0})-\mathbb{E}[J_{k,s}]\\|^{2}]$
		$\displaystyle\leq\frac{1}{b^{2}}\mathbb{E}_{k,s}[\sum\\|\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0})\\|^{2}]$
		$\displaystyle\leq\frac{L^{2}}{b}\\|x_{k,s}-x_{k,0}\\|^{2}$

The last inequality follows from Assumption 4.4, which ensures the Lipschitz continuity of the component gradients with constant $L$ . ∎

5.3 One-Step Decrease Properties

Lemma 5.2 (One-Step Decrease).

For any iteration $(k,s)$ , we have:

f(x_{k,s+1})-f(x_{k,s})\leq-\|\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}\|H_{k,s}\|\Delta_{k,s}^{2}+\|g(x_{k,s})-\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}(L_{\nabla f}+\|H_{k,s}\|)\Delta_{k,s}^{2}

Proof.

By Assumption 4.1, which ensures twice continuous differentiability and Lipschitz continuity of the gradient, we can write:

f(x_{k,s+1})\leq f(x_{k,s})+g(x_{k,s})^{T}\Delta x_{k,s}+\frac{1}{2}L_{\nabla f}\|\Delta x_{k,s}\|^{2}

Using the trust-region subproblem formulation and the Cauchy decrease condition:

f(x_{k,s+1})-f(x_{k,s})-\bar{g}_{k,s}^{T}\Delta x_{k,s}-\frac{1}{2}\Delta x_{k,s}^{T}H_{k,s}\Delta x_{k,s}

\leq g(x_{k,s})^{T}\Delta x_{k,s}+\frac{1}{2}L_{\nabla f}\|\Delta x_{k,s}\|^{2}-\bar{g}_{k,s}^{T}\Delta x_{k,s}-\frac{1}{2}\Delta x_{k,s}^{T}H_{k,s}\Delta x_{k,s}

=(g(x_{k,s})-\bar{g}_{k,s})^{T}\Delta x_{k,s}+\frac{1}{2}L_{\nabla f}\|\Delta x_{k,s}\|^{2}-\frac{1}{2}\Delta x_{k,s}^{T}H_{k,s}\Delta x_{k,s}

Using the Cauchy-Schwarz inequality and the fact that $\|\Delta x_{k,s}\|\leq\Delta_{k,s}$ :

\leq\|g(x_{k,s})-\bar{g}_{k,s}\|\|\Delta x_{k,s}\|+\frac{1}{2}(L_{\nabla f}+\|H_{k,s}\|)\|\Delta x_{k,s}\|^{2}

\leq\|g(x_{k,s})-\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}(L_{\nabla f}+\|H_{k,s}\|)\Delta_{k,s}^{2}

Therefore, combining with the trust-region subproblem solution property:

f(x_{k,s+1})-f(x_{k,s})\leq-\|\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}\|H_{k,s}\|\Delta_{k,s}^{2}+\|g(x_{k,s})-\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}(L_{\nabla f}+\|H_{k,s}\|)\Delta_{k,s}^{2}

This result relies on Assumptions 4.1 (Lipschitz continuity), and 4.4 (boundedness of iterates). ∎

5.4 Expected Decrease Properties

Lemma 5.3 (Expected Decrease).

When $\alpha\leq\frac{1}{2(L_{\nabla f}+2K_{H})}$ , we have:

\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{2}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]

Proof.

Since $\Delta_{k,s}=\alpha\|\bar{g}_{k,s}\|$ (by the trust-region radius update rule), we can write:

	$\displaystyle f(x_{k,s+1})-f(x_{k,s})$	$\displaystyle\leq-\alpha\\|\bar{g}_{k,s}\\|^{2}+\frac{1}{2}\\|H_{k,s}\\|\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$
		$\displaystyle+\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|\\|\bar{g}_{k,s}\\|$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+\\|H_{k,s}\\|)\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$

Using Assumption 4.1, which ensures $\|H_{k,s}\|\leq K_{H}$ , we have:

	$\displaystyle f(x_{k,s+1})-f(x_{k,s})$	$\displaystyle\leq-\alpha\\|\bar{g}_{k,s}\\|^{2}+\frac{1}{2}K_{H}\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$
		$\displaystyle+\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|\\|\bar{g}_{k,s}\\|$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+K_{H})\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$
		$\displaystyle=-\alpha\\|\bar{g}_{k,s}\\|^{2}+\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|\\|\bar{g}_{k,s}\\|$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$

Using the inequality $ab\leq\frac{1}{2}a^{2}+\frac{1}{2}b^{2}$ , we get:

	$\displaystyle\leq-\alpha\\|\bar{g}_{k,s}\\|^{2}+\frac{1}{2}\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|^{2}+\frac{1}{2}\alpha\\|\bar{g}_{k,s}\\|^{2}$
	$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$
	$\displaystyle=-\frac{1}{2}\alpha\\|\bar{g}_{k,s}\\|^{2}+\frac{1}{2}\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|^{2}$
	$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$

Taking expectation conditional on $\mathcal{F}_{k,s}$ , and since $\bar{g}_{k,s}$ is an unbiased estimator of $g(x_{k,s})$ (by Assumption 4.2), we have:

\mathbb{E}_{k,s}[\|\bar{g}_{k,s}\|^{2}]=\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]+\|g(x_{k,s})\|^{2}

Therefore:

	$\displaystyle\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})$	$\displaystyle\leq-\frac{1}{2}\alpha\\|g(x_{k,s})\\|^{2}$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\\|g(x_{k,s})-\bar{g}_{k,s}\\|^{2}]$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\\|g(x_{k,s})\\|^{2}$

This proof relies on Assumptions 4.1 (Lipschitz continuity), and 4.2 (unbiased gradient estimates). ∎

Lemma 5.4 (Expected Decrease Bound).

When $\alpha\leq\frac{1}{2(L_{\nabla f}+2K_{H})}$ , we have:

\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]

Proof.

Starting from Lemma 5.3, we have:

\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{2}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|g(x_{k,s})\|^{2}

Since $\alpha\leq\frac{1}{2(L_{\nabla f}+2K_{H})}$ , we have:

\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|g(x_{k,s})\|^{2}\leq\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}

Therefore:

-\frac{1}{2}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|g(x_{k,s})\|^{2}\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}

Thus:

\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]

This proof relies on Assumptions 4.1 (Lipschitz continuity of gradient), and 4.2 (unbiased gradient estimates). The Lipschitz constant $L_{\nabla f}$ comes from Assumption 4.1. ∎

Theorem 5.5 (Global Convergence).

Let $\{x_{k,s}\}$ be the sequence generated by Algorithm 1. Under Assumptions 4.1-4.4, for any $K\geq 0$ , we have:

\mathbb{E}\left[\frac{1}{(K+1)S}\sum_{k=0}^{K}\sum_{s=0}^{S-1}\|g(x_{k,s})\|^{2}\right]\leq\frac{\mathbb{E}[f(x_{0,0})]-f_{\inf}}{(K+1)S\cdot\Lambda_{\min}}

where $\Lambda_{\min}=\min_{s\in[S]}\Lambda_{s}$ .

Proof.

From Lemma 5.4, we have:

\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]

And from Lemma 5.1:

\mathbb{E}_{k,s}[\|\bar{g}_{k,s}-g(x_{k,s})\|^{2}]\leq\frac{L^{2}}{b}\|x_{k,s}-x_{k,0}\|^{2}

Therefore:

\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\frac{L^{2}}{b}\|x_{k,s}-x_{k,0}\|^{2}

Notice that:

	$\displaystyle\mathbb{E}_{k,s}[\\|x_{k,s+1}-x_{k,0}\\|^{2}]$	$\displaystyle=\mathbb{E}_{k,s}[\\|x_{k,s+1}-x_{k,s}+x_{k,s}-x_{k,0}\\|^{2}]$
		$\displaystyle=\mathbb{E}_{k,s}[\\|x_{k,s+1}-x_{k,s}\\|^{2}]$
		$\displaystyle+2\mathbb{E}_{k,s}[(x_{k,s+1}-x_{k,s})^{T}(x_{k,s}-x_{k,0})]$
		$\displaystyle+\mathbb{E}_{k,s}[\\|x_{k,s}-x_{k,0}\\|^{2}]$
		$\displaystyle\leq\mathbb{E}_{k,s}[\Delta_{k,s}^{2}]+\frac{1}{\alpha z}\mathbb{E}_{k,s}[\\|x_{k,s+1}-x_{k,s}\\|^{2}]$
		$\displaystyle+\alpha z\mathbb{E}_{k,s}[\\|x_{k,s}-x_{k,0}\\|^{2}]+\mathbb{E}_{k,s}[\\|x_{k,s}-x_{k,0}\\|^{2}]$
		$\displaystyle\leq(1+\frac{1}{\alpha z})\mathbb{E}_{k,s}[\Delta_{k,s}^{2}]+(1+\alpha z)\mathbb{E}_{k,s}[\\|x_{k,s}-x_{k,0}\\|^{2}]$
		$\displaystyle=(1+\frac{1}{\alpha z})\alpha^{2}\mathbb{E}_{k,s}[\\|\bar{g}_{k,s}\\|^{2}]+(1+\alpha z)\mathbb{E}_{k,s}[\\|x_{k,s}-x_{k,0}\\|^{2}]$
		$\displaystyle=(1+\frac{1}{\alpha z})\alpha^{2}\\|g(x_{k,s})\\|^{2}+(1+\alpha z+(\alpha^{2}+\frac{\alpha}{z})\frac{L^{2}}{b})\mathbb{E}_{k,s}[\\|x_{k,s}-x_{k,0}\\|^{2}]$

Define $R_{k,s}=f(x_{k,s})+\lambda_{s}\|x_{k,s}-x_{k,0}\|^{2}$ , where:

\lambda_{s}=\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\frac{L^{2}}{b}+\lambda_{s+1}(1+\alpha z+(\alpha^{2}+\frac{\alpha}{z})\frac{L^{2}}{b})

\lambda_{s}=\frac{1}{4}\alpha-\lambda_{s+1}(1+\frac{1}{\alpha z})\alpha^{2}

And $\Lambda_{\min}=\min_{s\in[S]}\Lambda_{s}$

Then:

\mathbb{E}[R_{k,s+1}-R_{k,s}]\leq-\Lambda_{\min}\mathbb{E}[\|g(x_{k,s})\|^{2}]

Therefore:

\mathbb{E}[\|g(x_{k,s})\|^{2}]\leq\frac{\mathbb{E}[R_{k,s}]-\mathbb{E}[R_{k,s+1}]}{\Lambda_{\min}}

Summing over $s=0,\ldots,S-1$ :

\sum_{s=0}^{S-1}\mathbb{E}[\|g(x_{k,s})\|^{2}]\leq\frac{\mathbb{E}[R_{k,0}]-\mathbb{E}[R_{k,S}]}{\Lambda_{\min}}=\frac{\mathbb{E}[f(x_{k,0})-f(x_{k+1,0})]}{\Lambda_{\min}}

Summing over $k=0,1,2,\ldots,K$ :

\sum_{k=0}^{K}\sum_{s=0}^{S-1}\mathbb{E}[\|g(x_{k,s})\|^{2}]\leq\frac{\mathbb{E}[f(x_{0,0})]-f_{\inf}}{\Lambda_{\min}}

Finally, dividing both sides by $(K+1)S$ :

\mathbb{E}\left[\frac{1}{(K+1)S}\sum_{k=0}^{K}\sum_{s=0}^{S-1}\|g(x_{k,s})\|^{2}\right]\leq\frac{\mathbb{E}[f(x_{0,0})]-f_{\inf}}{(K+1)S\cdot\Lambda_{\min}}

This proof relies on all Assumptions 4.1-4.4, particularly the boundedness of the objective function (4.1), the unbiased gradient estimates (4.2), the variance bound (4.3), and the bounded iterates (4.4). ∎

References

Berahas et al. (2022) A. S. Berahas, J. Shi, Z. Yi, and B. Zhou. Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction. arXiv preprint arXiv:2204.04161, 2022.
Boggs and Tolle (1995) P. T. Boggs and J. W. Tolle. Sequential quadratic programming. Acta Numerica, 4:1–51, 1995.
Conn et al. (2000) A. R. Conn, N. I. Gould, and P. L. Toint. Trust region methods. 2000.
Curtis and Shi (2019) F. E. Curtis and R. Shi. A fully stochastic second-order trust region method. INFORMS Journal on Optimization, 1(3):200–220, 2019.
Defazio et al. (2014) A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems, 27, 2014.
Fang et al. (2024) Y. Fang, S. Na, M. W. Mahoney, and M. Kolar. Fully stochastic trust-region sequential quadratic programming for equality-constrained optimization problems. arXiv preprint arXiv:2211.15943, 2024.
Johnson and Zhang (2013) R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013.
Nguyen et al. (2017) L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. International Conference on Machine Learning, pages 2613–2621, 2017.
Nocedal and Wright (2006) J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.

	$\displaystyle\mathbb{E}_{k,s}[\\|\bar{g}_{k,s}-g(x_{k,s})\\|^{2}]$	$\displaystyle=\mathbb{E}_{k,s}[\\|J_{k,s}+g(x_{k,0})-g(x_{k,s})\\|^{2}]$
		$\displaystyle=\mathbb{E}_{k,s}[\\|J_{k,s}-\mathbb{E}[J_{k,s}]\\|^{2}]$
		$\displaystyle=\frac{1}{b^{2}}\mathbb{E}_{k,s}[\\|\sum(\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0})-\mathbb{E}[J_{k,s}])\\|^{2}]$
		$\displaystyle=\frac{1}{b^{2}}\mathbb{E}_{k,s}[\sum\\|\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0})-\mathbb{E}[J_{k,s}]\\|^{2}]$
		$\displaystyle\leq\frac{1}{b^{2}}\mathbb{E}_{k,s}[\sum\\|\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0})\\|^{2}]$
		$\displaystyle\leq\frac{L^{2}}{b}\\|x_{k,s}-x_{k,0}\\|^{2}$

	$\displaystyle f(x_{k,s+1})-f(x_{k,s})$	$\displaystyle\leq-\alpha\\|\bar{g}_{k,s}\\|^{2}+\frac{1}{2}\\|H_{k,s}\\|\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$
		$\displaystyle+\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|\\|\bar{g}_{k,s}\\|$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+\\|H_{k,s}\\|)\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$

	$\displaystyle f(x_{k,s+1})-f(x_{k,s})$	$\displaystyle\leq-\alpha\\|\bar{g}_{k,s}\\|^{2}+\frac{1}{2}K_{H}\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$
		$\displaystyle+\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|\\|\bar{g}_{k,s}\\|$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+K_{H})\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$
		$\displaystyle=-\alpha\\|\bar{g}_{k,s}\\|^{2}+\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|\\|\bar{g}_{k,s}\\|$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$

	$\displaystyle\leq-\alpha\\|\bar{g}_{k,s}\\|^{2}+\frac{1}{2}\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|^{2}+\frac{1}{2}\alpha\\|\bar{g}_{k,s}\\|^{2}$
	$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$
	$\displaystyle=-\frac{1}{2}\alpha\\|\bar{g}_{k,s}\\|^{2}+\frac{1}{2}\alpha\\|g(x_{k,s})-\bar{g}_{k,s}\\|^{2}$
	$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\\|\bar{g}_{k,s}\\|^{2}$

	$\displaystyle\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})$	$\displaystyle\leq-\frac{1}{2}\alpha\\|g(x_{k,s})\\|^{2}$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\\|g(x_{k,s})-\bar{g}_{k,s}\\|^{2}]$
		$\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\\|g(x_{k,s})\\|^{2}$