This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Trust-Region Stochastic Optimization with Variance Reduction Technique

Xinshou Zheng Boston University

1 Introduction

We propose a novel algorithm, TR-SVR, for solving unconstrained stochastic optimization problems. This method builds on the trust-region framework, which effectively balances local and global exploration in optimization tasks. TR-SVR incorporates variance reduction techniques to improve both computational efficiency and stability when addressing stochastic objective functions. The algorithm applies a sequential quadratic programming (SQP) approach within the trust-region framework, solving each subproblem approximately using variance-reduced gradient estimators. This integration ensures a robust convergence mechanism while maintaining efficiency, making TR-SVR particularly suitable for large-scale stochastic optimization challenges.

Unlike traditional SQP methods typically designed for deterministic or constrained optimization problems, TR-SVR is specifically tailored to address unconstrained stochastic settings. This makes it highly applicable to large-scale machine learning and data-driven tasks, where efficiency and scalability are crucial.

The trust-region mechanism in TR-SVR dynamically adjusts the step size by defining a region where the quadratic approximation of the objective function is reliable. This ensures that the algorithm progresses steadily while avoiding excessively large or overly cautious updates. Simultaneously, variance reduction techniques inspired by methods like Stochastic Variance Reduced Gradient (SVRG) significantly reduce the noise in stochastic gradient estimates, thereby improving both the stability and accuracy of the optimization process.

By iteratively refining the solution and adaptively modifying the trust-region radius based on the quality of gradient estimates and the current solution, TR-SVR achieves faster convergence rates and enhanced robustness, even in noisy environments typical of stochastic optimization problems.

2 Literature Review

Stochastic optimization is a rapidly growing field due to its pivotal role in applications like machine learning, signal processing, and control systems. Among its foundational techniques is Stochastic Gradient Descent (SGD), introduced by Robbins and Monro in the mid-20th century Robbins and Monro (1951). SGD has gained popularity for its simplicity and scalability, particularly in large-scale optimization tasks. However, its high variance in gradient estimates often leads to slow convergence and instability. To mitigate this issue, advanced variance reduction techniques such as Stochastic Variance Reduced Gradient (SVRG) Johnson and Zhang (2013), Stochastic Average Gradient (SAGA) Defazio et al. (2014), and Stochastic Recursive Gradient Algorithm (SARAH) Nguyen et al. (2017) have been developed. These methods refine the gradient estimates through mechanisms like reference points or control variates, significantly improving performance.

In parallel, trust-region methods have emerged as robust tools for handling non-convex optimization problems, ensuring global convergence through adaptive step-size control Conn et al. (2000). These methods dynamically adjust a region around the iterate within which a quadratic model of the objective is trusted, making them highly effective in deterministic settings, as detailed in foundational works like Nocedal and Wright’s ”Numerical Optimization” Nocedal and Wright (2006). Efforts to extend trust-region methods to stochastic domains have gained momentum. Curtis et al. Curtis and Shi (2019) introduced a fully stochastic second-order trust-region method leveraging stochastic Hessian approximations. More recently, Fang et al. Fang et al. (2024) proposed a stochastic trust-region sequential quadratic programming (TR-SQP) method tailored for equality-constrained problems, offering robust theoretical guarantees and practical implementations. These developments highlight the growing synergy between stochastic optimization and trust-region frameworks, paving the way for tackling increasingly complex optimization challenges.

Sequential Quadratic Programming (SQP) methods are widely recognized for their effectiveness in solving constrained optimization problems Boggs and Tolle (1995). These methods operate by solving a sequence of quadratic subproblems that locally approximate the original nonlinear problem, gradually refining the solution at each iteration. While initially designed for deterministic optimization, recent advancements have extended SQP to stochastic settings. Notably, Berahas et al. Berahas et al. (2022) proposed a stochastic SQP framework that integrates variance reduction techniques, significantly enhancing convergence rates for equality-constrained problems compared to traditional first-order approaches.

Despite these promising developments, existing stochastic SQP methods are predominantly tailored for constrained problems or require intricate adjustments to address challenges like non-convexity and noisy gradients. Our proposed TR-SVR algorithm addresses this gap by offering a novel framework for unconstrained stochastic optimization. Integrating trust-region principles with advanced variance reduction strategies, TR-SVR extends the applicability of stochastic SQP methods while preserving computational efficiency and delivering robust theoretical guarantees.

3 Algorithm

3.1 Problem Description

We consider the unconstrained stochastic optimization problem of the form:

min𝒙df(𝒙)=1Ni=1Nfi(𝒙),\min_{\bm{x}\in\mathbb{R}^{d}}f(\bm{x})=\frac{1}{N}\sum_{i=1}^{N}f_{i}(\bm{x}),

where fi(𝒙)f_{i}(\bm{x}) represents individual stochastic functions, and f(𝒙)f(\bm{x}) is the overall objective. This formulation is common in large-scale machine learning applications, where the objective function is typically a sum of loss functions over a dataset. The main challenge in solving such problems arises from the stochastic nature of the gradients, which can introduce high variance and slow down convergence.

Our goal is to develop an efficient algorithm that can handle large-scale problems by reducing the variance in gradient estimates while maintaining computational efficiency. To this end, we propose a novel algorithm, TR-SVR, which integrates variance reduction techniques into a trust-region-based sequential quadratic programming (SQP) framework. Unlike traditional methods that rely on full gradients or line-search techniques, our approach uses mini-batch gradient estimates and dynamically adjusts the trust-region radius to ensure robust performance in noisy environments.

The TR-SVR algorithm operates in two loops: an outer loop indexed by kk, and an inner loop indexed by ss. In each iteration, a quadratic approximation of the objective function is constructed using variance-reduced gradient estimates. A trust-region subproblem is then solved to update the current iterate. The trust-region radius is adjusted dynamically based on the quality of the solution at each step, ensuring that the algorithm takes appropriately sized steps to balance exploration and exploitation.

3.2 Algorithm Description

The TR-SVR algorithm is presented below. It combines trust-region principles with variance reduction techniques to solve unconstrained stochastic optimization problems efficiently.

Algorithm 1 TR-SVR Algorithm
1:Input: Initial iterate x0dx_{0}\in\mathbb{R}^{d}, initial trust-region radius Δ0>0\Delta_{0}>0, batch size b[1,N]b\in[1,N], maximum inner iterations S>0S>0, parameters η1,η2>0\eta_{1},\eta_{2}>0, and scaling factor α>0\alpha>0.
2:for k=0,1,2,k=0,1,2,\dots do
3:     Set xk,0=xk1,Sx_{k,0}=x_{k-1,S} (if k>0k>0) or initialize xk,0x_{k,0}.
4:     Compute full gradient at xk,0x_{k,0}:
gk,0=f(xk,0)=1Ni=1Nfi(xk,0).g_{k,0}=\nabla f(x_{k,0})=\frac{1}{N}\sum_{i=1}^{N}\nabla f_{i}(x_{k,0}).
5:     for s=0s=0 to S1S-1 do
6:         Select a mini-batch Ik,s[N]I_{k,s}\subset[N] of size bb.
7:         Compute mini-batch gradient estimate:
g~k,s=1biIk,sfi(xk,s).\widetilde{g}_{k,s}=\frac{1}{b}\sum_{i\in I_{k,s}}\nabla f_{i}(x_{k,s}).
8:         Compute variance-reduced gradient:
g¯k,s=g~k,s(g~k,0gk,0),\bar{g}_{k,s}=\widetilde{g}_{k,s}-(\widetilde{g}_{k,0}-g_{k,0}),
where gk,0g_{k,0} is the full gradient at xk,0x_{k,0}.
9:         Solve the trust-region subproblem:
m(Δx)=gk,sT(Δx)+12(Δx)THk,s(Δx),m(\Delta x)=g_{k,s}^{T}(\Delta x)+\frac{1}{2}(\Delta x)^{T}H_{k,s}(\Delta x),
subject to
Δx2Δk,s,||\Delta x||_{2}\leq\Delta_{k,s},
where Hk,sH_{k,s} is an approximation of the Hessian matrix.
10:         Update the iterate:
xk,s+1=xk,s+(Δx)k,s.x_{k,s+1}=x_{k,s}+(\Delta x)_{k,s}.
11:         Adjust trust-region radius: If the reduction in objective function meets certain criteria (e.g., sufficient decrease), increase or decrease the trust-region radius as follows:
Δk,s+1={η1αgk,sgk,s<1/η1,α,1/η1<gk,s<1/η2,η2αgk,sgk,s>1/η2.\Delta_{k,s+1}=\begin{cases}{\eta_{1}}\alpha||g_{k,s}||&||g_{k,s}||<1/\eta_{1},\\ {\alpha},&1/\eta_{1}<||g_{k,s}||<1/\eta_{2},\\ {\eta_{2}}\alpha||g_{k,s}||&||g_{k,s}||>1/\eta_{2}.\end{cases}
12:     end forUpdate outer loop iterate: Set xk+1,0=xk,Sx_{k+1,0}=x_{k,S}.
13:end for

The TR-SVR algorithm iteratively refines both the solution and the trust-region radius using variance-reduced gradient estimates. The use of mini-batches ensures computational efficiency, particularly in large-scale settings, while the dynamic adjustment of the trust-region radius maintains stability, even in noisy environments. To further improve efficiency, the quadratic subproblems are solved approximately using Hessian approximations, thereby avoiding the need for expensive second-order computations.

4 Assumptions

Assumption 4.1.

The objective function f:nf:\mathbb{R}^{n}\to\mathbb{R} is twice continuously differentiable and bounded below by a scalar finf:=infxnf(x)>f_{\inf}:=\inf_{x\in\mathbb{R}^{n}}f(x)>-\infty. The gradient f(x)\nabla f(x) is Lipschitz continuous with constant Lg>0L_{g}>0, and the Hessian matrix 2f(x)\nabla^{2}f(x) is uniformly bounded, i.e., 2f(x)2LH\|\nabla^{2}f(x)\|_{2}\leq L_{H} for all xnx\in\mathbb{R}^{n}. Additionally, the Hessian approximation Hk,sH_{k,s} satisfies Hk,sKH\|H_{k,s}\|\leq K_{H} for some constant KH>0K_{H}>0 and for all iterations k,sk,s.

Assumption 4.2.

The gradient approximation g~k,s\widetilde{g}_{k,s} is an unbiased estimator of the true gradient of the objective function, i.e., 𝔼k,s[g~k,s]=gk,s=f(xk,s)\mathbb{E}_{k,s}[\widetilde{g}_{k,s}]=g_{k,s}=\nabla f(x_{k,s}), where the expectation is conditioned on the event that the algorithm has reached xk,sx_{k,s}. The variance of the stochastic gradient is bounded, i.e., 𝔼k,s[g~k,sgk,s2]σg2\mathbb{E}_{k,s}[\|\widetilde{g}_{k,s}-g_{k,s}\|^{2}]\leq\sigma_{g}^{2}.

Assumption 4.3.

The variance-reduced gradient estimate satisfies a variance bound such that:

𝔼k,s[g¯k,sgk,s2]L2bxk,sxk,02,\mathbb{E}_{k,s}\left[\|\bar{g}_{k,s}-g_{k,s}\|^{2}\right]\leq\frac{L^{2}}{b}\|x_{k,s}-x_{k,0}\|^{2},

where bb is the mini-batch size and LL is a Lipschitz constant.

Assumption 4.4.

The iterates xk,sx_{k,s} are contained within a compact convex set 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n}. The objective function f(x)f(x), its gradient f(x)\nabla f(x), and its Hessian matrix are bounded over this set. Each component function fi(x)f_{i}(x) is continuously differentiable, and its gradient is Lipschitz continuous with constant L>0L>0.

5 Convergence Analysis

We begin by analyzing the trust-region subproblem and establishing key properties of the variance-reduced gradient estimates.

5.1 Trust-Region Subproblem Properties

At each iterate xk,sx_{k,s}, the trust-region subproblem is formulated as:

minΔxk,sm1(Δxk,s)=g¯k,sTΔxk,s+12Δxk,sTHk,sΔxk,ss.t. Δxk,sΔk,s\min_{{\Delta x_{k,s}}}m_{1}(\Delta x_{k,s})=\bar{g}_{k,s}^{T}\Delta x_{k,s}+\frac{1}{2}\Delta x_{k,s}^{T}H_{k,s}\Delta x_{k,s}\quad\text{s.t. }\|\Delta x_{k,s}\|\leq\Delta_{k,s}

For this subproblem, we only require the Cauchy decrease condition rather than an exact solution:

m1(Δxk,s)m1(0)g¯k,sΔk,s+12Hk,sΔk,s2m_{1}(\Delta x_{k,s})-m_{1}(0)\leq-\|\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}\|H_{k,s}\|\Delta_{k,s}^{2}

5.2 Variance Reduction Properties

Lemma 5.1 (Variance Bound).

Let g¯k,s\bar{g}_{k,s} be computed as in Algorithm 1. Then for all [k,s]×S[k,s]\in\mathbb{N}\times S, we have:

𝔼k,s[g¯k,sg(xk,s)2]μk,s\mathbb{E}_{k,s}[\|\bar{g}_{k,s}-g(x_{k,s})\|^{2}]\leq\mu_{k,s}

where μk,s=L2bxk,sxk,02\mu_{k,s}=\frac{L^{2}}{b}\|x_{k,s}-x_{k,0}\|^{2}.

Proof.

Let us denote:

Jk,s=1b(fi(xk,s)fi(xk,0))J_{k,s}=\frac{1}{b}\sum(\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0}))

By Assumption 4.2, we have that 𝔼k,s[Jk,s]=g(xk,s)g(xk,0)\mathbb{E}_{k,s}[J_{k,s}]=g(x_{k,s})-g(x_{k,0}).

Using the fact that 𝔼[z𝔼[z]2]𝔼[z2]\mathbb{E}[\|z-\mathbb{E}[z]\|^{2}]\leq\mathbb{E}[\|z\|^{2}] (from Assumption 4.3) and the property that for independent mean-zero random variables z1,z2,,znz_{1},z_{2},...,z_{n}:

𝔼[z1+z2++zn2]=𝔼[z12+z22++zn2]\mathbb{E}[\|z_{1}+z_{2}+...+z_{n}\|^{2}]=\mathbb{E}[\|z_{1}\|^{2}+\|z_{2}\|^{2}+...+\|z_{n}\|^{2}]

We can derive:

𝔼k,s[g¯k,sg(xk,s)2]\displaystyle\mathbb{E}_{k,s}[\|\bar{g}_{k,s}-g(x_{k,s})\|^{2}] =𝔼k,s[Jk,s+g(xk,0)g(xk,s)2]\displaystyle=\mathbb{E}_{k,s}[\|J_{k,s}+g(x_{k,0})-g(x_{k,s})\|^{2}]
=𝔼k,s[Jk,s𝔼[Jk,s]2]\displaystyle=\mathbb{E}_{k,s}[\|J_{k,s}-\mathbb{E}[J_{k,s}]\|^{2}]
=1b2𝔼k,s[(fi(xk,s)fi(xk,0)𝔼[Jk,s])2]\displaystyle=\frac{1}{b^{2}}\mathbb{E}_{k,s}[\|\sum(\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0})-\mathbb{E}[J_{k,s}])\|^{2}]
=1b2𝔼k,s[fi(xk,s)fi(xk,0)𝔼[Jk,s]2]\displaystyle=\frac{1}{b^{2}}\mathbb{E}_{k,s}[\sum\|\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0})-\mathbb{E}[J_{k,s}]\|^{2}]
1b2𝔼k,s[fi(xk,s)fi(xk,0)2]\displaystyle\leq\frac{1}{b^{2}}\mathbb{E}_{k,s}[\sum\|\nabla f_{i}(x_{k,s})-\nabla f_{i}(x_{k,0})\|^{2}]
L2bxk,sxk,02\displaystyle\leq\frac{L^{2}}{b}\|x_{k,s}-x_{k,0}\|^{2}

The last inequality follows from Assumption 4.4, which ensures the Lipschitz continuity of the component gradients with constant LL. ∎

5.3 One-Step Decrease Properties

Lemma 5.2 (One-Step Decrease).

For any iteration (k,s)(k,s), we have:

f(xk,s+1)f(xk,s)g¯k,sΔk,s+12Hk,sΔk,s2+g(xk,s)g¯k,sΔk,s+12(Lf+Hk,s)Δk,s2f(x_{k,s+1})-f(x_{k,s})\leq-\|\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}\|H_{k,s}\|\Delta_{k,s}^{2}+\|g(x_{k,s})-\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}(L_{\nabla f}+\|H_{k,s}\|)\Delta_{k,s}^{2}
Proof.

By Assumption 4.1, which ensures twice continuous differentiability and Lipschitz continuity of the gradient, we can write:

f(xk,s+1)f(xk,s)+g(xk,s)TΔxk,s+12LfΔxk,s2f(x_{k,s+1})\leq f(x_{k,s})+g(x_{k,s})^{T}\Delta x_{k,s}+\frac{1}{2}L_{\nabla f}\|\Delta x_{k,s}\|^{2}

Using the trust-region subproblem formulation and the Cauchy decrease condition:

f(xk,s+1)f(xk,s)g¯k,sTΔxk,s12Δxk,sTHk,sΔxk,sf(x_{k,s+1})-f(x_{k,s})-\bar{g}_{k,s}^{T}\Delta x_{k,s}-\frac{1}{2}\Delta x_{k,s}^{T}H_{k,s}\Delta x_{k,s}
g(xk,s)TΔxk,s+12LfΔxk,s2g¯k,sTΔxk,s12Δxk,sTHk,sΔxk,s\leq g(x_{k,s})^{T}\Delta x_{k,s}+\frac{1}{2}L_{\nabla f}\|\Delta x_{k,s}\|^{2}-\bar{g}_{k,s}^{T}\Delta x_{k,s}-\frac{1}{2}\Delta x_{k,s}^{T}H_{k,s}\Delta x_{k,s}
=(g(xk,s)g¯k,s)TΔxk,s+12LfΔxk,s212Δxk,sTHk,sΔxk,s=(g(x_{k,s})-\bar{g}_{k,s})^{T}\Delta x_{k,s}+\frac{1}{2}L_{\nabla f}\|\Delta x_{k,s}\|^{2}-\frac{1}{2}\Delta x_{k,s}^{T}H_{k,s}\Delta x_{k,s}

Using the Cauchy-Schwarz inequality and the fact that Δxk,sΔk,s\|\Delta x_{k,s}\|\leq\Delta_{k,s}:

g(xk,s)g¯k,sΔxk,s+12(Lf+Hk,s)Δxk,s2\leq\|g(x_{k,s})-\bar{g}_{k,s}\|\|\Delta x_{k,s}\|+\frac{1}{2}(L_{\nabla f}+\|H_{k,s}\|)\|\Delta x_{k,s}\|^{2}
g(xk,s)g¯k,sΔk,s+12(Lf+Hk,s)Δk,s2\leq\|g(x_{k,s})-\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}(L_{\nabla f}+\|H_{k,s}\|)\Delta_{k,s}^{2}

Therefore, combining with the trust-region subproblem solution property:

f(xk,s+1)f(xk,s)g¯k,sΔk,s+12Hk,sΔk,s2+g(xk,s)g¯k,sΔk,s+12(Lf+Hk,s)Δk,s2f(x_{k,s+1})-f(x_{k,s})\leq-\|\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}\|H_{k,s}\|\Delta_{k,s}^{2}+\|g(x_{k,s})-\bar{g}_{k,s}\|\Delta_{k,s}+\frac{1}{2}(L_{\nabla f}+\|H_{k,s}\|)\Delta_{k,s}^{2}

This result relies on Assumptions 4.1 (Lipschitz continuity), and 4.4 (boundedness of iterates). ∎

5.4 Expected Decrease Properties

Lemma 5.3 (Expected Decrease).

When α12(Lf+2KH)\alpha\leq\frac{1}{2(L_{\nabla f}+2K_{H})}, we have:

𝔼k,s[f(xk,s+1)]f(xk,s)12αg(xk,s)2+12(Lf+2KH)α2𝔼k,s[g(xk,s)g¯k,s2]\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{2}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]
Proof.

Since Δk,s=αg¯k,s\Delta_{k,s}=\alpha\|\bar{g}_{k,s}\| (by the trust-region radius update rule), we can write:

f(xk,s+1)f(xk,s)\displaystyle f(x_{k,s+1})-f(x_{k,s}) αg¯k,s2+12Hk,sα2g¯k,s2\displaystyle\leq-\alpha\|\bar{g}_{k,s}\|^{2}+\frac{1}{2}\|H_{k,s}\|\alpha^{2}\|\bar{g}_{k,s}\|^{2}
+αg(xk,s)g¯k,sg¯k,s\displaystyle+\alpha\|g(x_{k,s})-\bar{g}_{k,s}\|\|\bar{g}_{k,s}\|
+12(Lf+Hk,s)α2g¯k,s2\displaystyle+\frac{1}{2}(L_{\nabla f}+\|H_{k,s}\|)\alpha^{2}\|\bar{g}_{k,s}\|^{2}

Using Assumption 4.1, which ensures Hk,sKH\|H_{k,s}\|\leq K_{H}, we have:

f(xk,s+1)f(xk,s)\displaystyle f(x_{k,s+1})-f(x_{k,s}) αg¯k,s2+12KHα2g¯k,s2\displaystyle\leq-\alpha\|\bar{g}_{k,s}\|^{2}+\frac{1}{2}K_{H}\alpha^{2}\|\bar{g}_{k,s}\|^{2}
+αg(xk,s)g¯k,sg¯k,s\displaystyle+\alpha\|g(x_{k,s})-\bar{g}_{k,s}\|\|\bar{g}_{k,s}\|
+12(Lf+KH)α2g¯k,s2\displaystyle+\frac{1}{2}(L_{\nabla f}+K_{H})\alpha^{2}\|\bar{g}_{k,s}\|^{2}
=αg¯k,s2+αg(xk,s)g¯k,sg¯k,s\displaystyle=-\alpha\|\bar{g}_{k,s}\|^{2}+\alpha\|g(x_{k,s})-\bar{g}_{k,s}\|\|\bar{g}_{k,s}\|
+12(Lf+2KH)α2g¯k,s2\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|\bar{g}_{k,s}\|^{2}

Using the inequality ab12a2+12b2ab\leq\frac{1}{2}a^{2}+\frac{1}{2}b^{2}, we get:

αg¯k,s2+12αg(xk,s)g¯k,s2+12αg¯k,s2\displaystyle\leq-\alpha\|\bar{g}_{k,s}\|^{2}+\frac{1}{2}\alpha\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}+\frac{1}{2}\alpha\|\bar{g}_{k,s}\|^{2}
+12(Lf+2KH)α2g¯k,s2\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|\bar{g}_{k,s}\|^{2}
=12αg¯k,s2+12αg(xk,s)g¯k,s2\displaystyle=-\frac{1}{2}\alpha\|\bar{g}_{k,s}\|^{2}+\frac{1}{2}\alpha\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}
+12(Lf+2KH)α2g¯k,s2\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|\bar{g}_{k,s}\|^{2}

Taking expectation conditional on k,s\mathcal{F}_{k,s}, and since g¯k,s\bar{g}_{k,s} is an unbiased estimator of g(xk,s)g(x_{k,s}) (by Assumption 4.2), we have:

𝔼k,s[g¯k,s2]=𝔼k,s[g(xk,s)g¯k,s2]+g(xk,s)2\mathbb{E}_{k,s}[\|\bar{g}_{k,s}\|^{2}]=\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]+\|g(x_{k,s})\|^{2}

Therefore:

𝔼k,s[f(xk,s+1)]f(xk,s)\displaystyle\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s}) 12αg(xk,s)2\displaystyle\leq-\frac{1}{2}\alpha\|g(x_{k,s})\|^{2}
+12(Lf+2KH)α2𝔼k,s[g(xk,s)g¯k,s2]\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]
+12(Lf+2KH)α2g(xk,s)2\displaystyle+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|g(x_{k,s})\|^{2}

This proof relies on Assumptions 4.1 (Lipschitz continuity), and 4.2 (unbiased gradient estimates). ∎

Lemma 5.4 (Expected Decrease Bound).

When α12(Lf+2KH)\alpha\leq\frac{1}{2(L_{\nabla f}+2K_{H})}, we have:

𝔼k,s[f(xk,s+1)]f(xk,s)14αg(xk,s)2+12(Lf+2KH)α2𝔼k,s[g(xk,s)g¯k,s2]\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]
Proof.

Starting from Lemma 5.3, we have:

𝔼k,s[f(xk,s+1)]f(xk,s)12αg(xk,s)2+12(Lf+2KH)α2𝔼k,s[g(xk,s)g¯k,s2]+12(Lf+2KH)α2g(xk,s)2\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{2}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|g(x_{k,s})\|^{2}

Since α12(Lf+2KH)\alpha\leq\frac{1}{2(L_{\nabla f}+2K_{H})}, we have:

12(Lf+2KH)α2g(xk,s)214αg(xk,s)2\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|g(x_{k,s})\|^{2}\leq\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}

Therefore:

12αg(xk,s)2+12(Lf+2KH)α2g(xk,s)214αg(xk,s)2-\frac{1}{2}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\|g(x_{k,s})\|^{2}\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}

Thus:

𝔼k,s[f(xk,s+1)]f(xk,s)14αg(xk,s)2+12(Lf+2KH)α2𝔼k,s[g(xk,s)g¯k,s2]\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]

This proof relies on Assumptions 4.1 (Lipschitz continuity of gradient), and 4.2 (unbiased gradient estimates). The Lipschitz constant LfL_{\nabla f} comes from Assumption 4.1. ∎

Theorem 5.5 (Global Convergence).

Let {xk,s}\{x_{k,s}\} be the sequence generated by Algorithm 1. Under Assumptions 4.1-4.4, for any K0K\geq 0, we have:

𝔼[1(K+1)Sk=0Ks=0S1g(xk,s)2]𝔼[f(x0,0)]finf(K+1)SΛmin\mathbb{E}\left[\frac{1}{(K+1)S}\sum_{k=0}^{K}\sum_{s=0}^{S-1}\|g(x_{k,s})\|^{2}\right]\leq\frac{\mathbb{E}[f(x_{0,0})]-f_{\inf}}{(K+1)S\cdot\Lambda_{\min}}

where Λmin=mins[S]Λs\Lambda_{\min}=\min_{s\in[S]}\Lambda_{s}.

Proof.

From Lemma 5.4, we have:

𝔼k,s[f(xk,s+1)]f(xk,s)14αg(xk,s)2+12(Lf+2KH)α2𝔼k,s[g(xk,s)g¯k,s2]\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\mathbb{E}_{k,s}[\|g(x_{k,s})-\bar{g}_{k,s}\|^{2}]

And from Lemma 5.1:

𝔼k,s[g¯k,sg(xk,s)2]L2bxk,sxk,02\mathbb{E}_{k,s}[\|\bar{g}_{k,s}-g(x_{k,s})\|^{2}]\leq\frac{L^{2}}{b}\|x_{k,s}-x_{k,0}\|^{2}

Therefore:

𝔼k,s[f(xk,s+1)]f(xk,s)14αg(xk,s)2+12(Lf+2KH)α2L2bxk,sxk,02\mathbb{E}_{k,s}[f(x_{k,s+1})]-f(x_{k,s})\leq-\frac{1}{4}\alpha\|g(x_{k,s})\|^{2}+\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\frac{L^{2}}{b}\|x_{k,s}-x_{k,0}\|^{2}

Notice that:

𝔼k,s[xk,s+1xk,02]\displaystyle\mathbb{E}_{k,s}[\|x_{k,s+1}-x_{k,0}\|^{2}] =𝔼k,s[xk,s+1xk,s+xk,sxk,02]\displaystyle=\mathbb{E}_{k,s}[\|x_{k,s+1}-x_{k,s}+x_{k,s}-x_{k,0}\|^{2}]
=𝔼k,s[xk,s+1xk,s2]\displaystyle=\mathbb{E}_{k,s}[\|x_{k,s+1}-x_{k,s}\|^{2}]
+2𝔼k,s[(xk,s+1xk,s)T(xk,sxk,0)]\displaystyle+2\mathbb{E}_{k,s}[(x_{k,s+1}-x_{k,s})^{T}(x_{k,s}-x_{k,0})]
+𝔼k,s[xk,sxk,02]\displaystyle+\mathbb{E}_{k,s}[\|x_{k,s}-x_{k,0}\|^{2}]
𝔼k,s[Δk,s2]+1αz𝔼k,s[xk,s+1xk,s2]\displaystyle\leq\mathbb{E}_{k,s}[\Delta_{k,s}^{2}]+\frac{1}{\alpha z}\mathbb{E}_{k,s}[\|x_{k,s+1}-x_{k,s}\|^{2}]
+αz𝔼k,s[xk,sxk,02]+𝔼k,s[xk,sxk,02]\displaystyle+\alpha z\mathbb{E}_{k,s}[\|x_{k,s}-x_{k,0}\|^{2}]+\mathbb{E}_{k,s}[\|x_{k,s}-x_{k,0}\|^{2}]
(1+1αz)𝔼k,s[Δk,s2]+(1+αz)𝔼k,s[xk,sxk,02]\displaystyle\leq(1+\frac{1}{\alpha z})\mathbb{E}_{k,s}[\Delta_{k,s}^{2}]+(1+\alpha z)\mathbb{E}_{k,s}[\|x_{k,s}-x_{k,0}\|^{2}]
=(1+1αz)α2𝔼k,s[g¯k,s2]+(1+αz)𝔼k,s[xk,sxk,02]\displaystyle=(1+\frac{1}{\alpha z})\alpha^{2}\mathbb{E}_{k,s}[\|\bar{g}_{k,s}\|^{2}]+(1+\alpha z)\mathbb{E}_{k,s}[\|x_{k,s}-x_{k,0}\|^{2}]
=(1+1αz)α2g(xk,s)2+(1+αz+(α2+αz)L2b)𝔼k,s[xk,sxk,02]\displaystyle=(1+\frac{1}{\alpha z})\alpha^{2}\|g(x_{k,s})\|^{2}+(1+\alpha z+(\alpha^{2}+\frac{\alpha}{z})\frac{L^{2}}{b})\mathbb{E}_{k,s}[\|x_{k,s}-x_{k,0}\|^{2}]

Define Rk,s=f(xk,s)+λsxk,sxk,02R_{k,s}=f(x_{k,s})+\lambda_{s}\|x_{k,s}-x_{k,0}\|^{2}, where:

λs=12(Lf+2KH)α2L2b+λs+1(1+αz+(α2+αz)L2b)\lambda_{s}=\frac{1}{2}(L_{\nabla f}+2K_{H})\alpha^{2}\frac{L^{2}}{b}+\lambda_{s+1}(1+\alpha z+(\alpha^{2}+\frac{\alpha}{z})\frac{L^{2}}{b})
λs=14αλs+1(1+1αz)α2\lambda_{s}=\frac{1}{4}\alpha-\lambda_{s+1}(1+\frac{1}{\alpha z})\alpha^{2}

And Λmin=mins[S]Λs\Lambda_{\min}=\min_{s\in[S]}\Lambda_{s}

Then:

𝔼[Rk,s+1Rk,s]Λmin𝔼[g(xk,s)2]\mathbb{E}[R_{k,s+1}-R_{k,s}]\leq-\Lambda_{\min}\mathbb{E}[\|g(x_{k,s})\|^{2}]

Therefore:

𝔼[g(xk,s)2]𝔼[Rk,s]𝔼[Rk,s+1]Λmin\mathbb{E}[\|g(x_{k,s})\|^{2}]\leq\frac{\mathbb{E}[R_{k,s}]-\mathbb{E}[R_{k,s+1}]}{\Lambda_{\min}}

Summing over s=0,,S1s=0,\ldots,S-1:

s=0S1𝔼[g(xk,s)2]𝔼[Rk,0]𝔼[Rk,S]Λmin=𝔼[f(xk,0)f(xk+1,0)]Λmin\sum_{s=0}^{S-1}\mathbb{E}[\|g(x_{k,s})\|^{2}]\leq\frac{\mathbb{E}[R_{k,0}]-\mathbb{E}[R_{k,S}]}{\Lambda_{\min}}=\frac{\mathbb{E}[f(x_{k,0})-f(x_{k+1,0})]}{\Lambda_{\min}}

Summing over k=0,1,2,,Kk=0,1,2,\ldots,K:

k=0Ks=0S1𝔼[g(xk,s)2]𝔼[f(x0,0)]finfΛmin\sum_{k=0}^{K}\sum_{s=0}^{S-1}\mathbb{E}[\|g(x_{k,s})\|^{2}]\leq\frac{\mathbb{E}[f(x_{0,0})]-f_{\inf}}{\Lambda_{\min}}

Finally, dividing both sides by (K+1)S(K+1)S:

𝔼[1(K+1)Sk=0Ks=0S1g(xk,s)2]𝔼[f(x0,0)]finf(K+1)SΛmin\mathbb{E}\left[\frac{1}{(K+1)S}\sum_{k=0}^{K}\sum_{s=0}^{S-1}\|g(x_{k,s})\|^{2}\right]\leq\frac{\mathbb{E}[f(x_{0,0})]-f_{\inf}}{(K+1)S\cdot\Lambda_{\min}}

This proof relies on all Assumptions 4.1-4.4, particularly the boundedness of the objective function (4.1), the unbiased gradient estimates (4.2), the variance bound (4.3), and the bounded iterates (4.4). ∎

References

  • Berahas et al. (2022) A. S. Berahas, J. Shi, Z. Yi, and B. Zhou. Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction. arXiv preprint arXiv:2204.04161, 2022.
  • Boggs and Tolle (1995) P. T. Boggs and J. W. Tolle. Sequential quadratic programming. Acta Numerica, 4:1–51, 1995.
  • Conn et al. (2000) A. R. Conn, N. I. Gould, and P. L. Toint. Trust region methods. 2000.
  • Curtis and Shi (2019) F. E. Curtis and R. Shi. A fully stochastic second-order trust region method. INFORMS Journal on Optimization, 1(3):200–220, 2019.
  • Defazio et al. (2014) A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems, 27, 2014.
  • Fang et al. (2024) Y. Fang, S. Na, M. W. Mahoney, and M. Kolar. Fully stochastic trust-region sequential quadratic programming for equality-constrained optimization problems. arXiv preprint arXiv:2211.15943, 2024.
  • Johnson and Zhang (2013) R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013.
  • Nguyen et al. (2017) L. M. Nguyen, J. Liu, K. Scheinberg, and M. Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. International Conference on Machine Learning, pages 2613–2621, 2017.
  • Nocedal and Wright (2006) J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006.
  • Robbins and Monro (1951) H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.