This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11affiliationtext: Department of Industrial and Systems Engineering, Lehigh University

Single-Loop Deterministic and Stochastic Interior-Point Algorithms for Nonlinearly Constrained Optimization

Frank E. Curtis E-mail: [email protected] Xin Jiang E-mail: [email protected] Qi Wang E-mail: [email protected]

An interior-point algorithm framework is proposed, analyzed, and tested for solving nonlinearly constrained continuous optimization problems. The main setting of interest is when the objective and constraint functions may be nonlinear and/or nonconvex, and when constraint values and derivatives are tractable to compute, but objective function values and derivatives can only be estimated. The algorithm is intended primarily for a setting that is similar for stochastic-gradient methods for unconstrained optimization, namely, the setting when stochastic-gradient estimates are available and employed in place of gradients of the objective, and when no objective function values (nor estimates of them) are employed. This is achieved by the interior-point framework having a single-loop structure rather than the nested-loop structure that is typical of contemporary interior-point methods. For completeness, convergence guarantees for the framework are provided both for deterministic and stochastic settings. Numerical experiments show that the algorithm yields good performance on a large set of test problems.

1 Introduction

The interior-point methodology is one of the most popular and effective derivative-based algorithm methodologies for solving inequality-constrained continuous optimization problems. Indeed, while there has been a long history of development of other types of algorithms—e.g., of the augmented-Lagrangian and sequential-quadratic-optimization varieties [16]—interior-point methods are arguably the most effective class of approaches for solving many constrained continuous optimization problems. In the setting of nonconvex optimization, which is the setting of interest in this paper, the effectiveness of interior-point methods is evidenced by the fact that the most popular software packages for solving such problems are based on interior-point methods. These include the packages Ipopt [19], Knitro [5], and LOQO [18]. This popularity withstands despite the fact that interior-point methods for solving nonconvex problems do not necessarily achieve the same optimal worst-case complexity properties that can be achieved in convex settings.

Despite its widespread success, little work has been done on extending the interior-point methodology to the stochastic regime. Here, we are focused on the setting in which constraint values and derivatives are tractable to compute, but objective values and derivatives are not. Furthermore, like in the setting of stochastic-gradient-based methods for solving unconstrained problems, our focus is to have an interior-point method that employs stochastic-gradient estimates (in place of exact objective gradients) and no objective function evaluations. Given the success of interior-point methods in the deterministic regime, there is ample motivation to explore their potential theoretical and practical behavior in such a stochastic regime. This is the main motivation for the work presented in this paper.

We contend that there are at least a few significant obstacles to the design and analysis of a stochastic interior-point method, which may be at least part of the reason that there has been little prior work on this topic. Firstly, interior-point methods often have a nested-loop structure wherein an inner loop is designed to solve (approximately) a so-called barrier subproblem for a fixed value of a barrier parameter and the outer loop involves decreasing the barrier parameter. Determining when to terminate the inner loop typically involves evaluating a measure of stationarity. However, in the aforementioned stochastic setting, evaluating such a measure is not tractable since objective-gradient evaluations are not tractable. Secondly, the barrier functions introduced in interior-point methods have corresponding gradient functions that are not Lipschitz continuous nor bounded. This presents major difficulties since for many stochastic-gradient-based methods, Lipschitz continuity and boundedness of the gradients are necessary features for ensuring convergence guarantees. Thirdly, interior-point methods regularly involve globalization mechanisms such as line searches or trust-region step-acceptance mechanisms that require exact function evaluations, but these are not tractable to perform in the aforementioned stochastic setting that is of interest here.

1.1 Contributions and Limitations

To provide a significant advancement in the design and analysis of stochastic-gradient-based interior-point methods, in this paper we propose and analyze a single-loop interior-point algorithm framework that offers rigorous convergence guarantees in both deterministic and stochastic settings. Let us emphasize upfront that we do not expect our method to be competitive with state-of-the-art interior-point methods in the deterministic regime, and especially not with those that employ second-order derivatives of the objective. Nevertheless, we present an algorithm and analyze its convergence properties for that setting for completeness and since it helps to elucidate the essential features of an algorithm that are necessary for achieving convergence guarantees in the stochastic regime.

Our method is an extension of the approach proposed in [7], which was proposed and analyzed for the bound-constrained setting. We go beyond the bound-constrained setting and propose a framework that is applicable for problems with nonlinear inequality constraints, at least as long as a strictly feasible initial point for the algorithm can be provided. For the sake of generality, we present the framework in a manner such that linear equality constraints can also be present, although we emphasize that our convergence guarantees in neither the deterministic nor stochastic regime allow for the presence of nonlinear equality constraints. Along with our concluding remarks, we discuss the challenges that would be faced when trying to extend our approach to settings with nonlinear equality constraints.

Most of the assumptions that are required for our convergence guarantees are standard in the literature on interior-point algorithms, specifically for that on so-called feasible interior-point methods that maintain feasibility (with respect to all of the constraints) at all algorithm iterates. However, we require one additional assumption that is not standard. This assumption essentially requires that, when an algorithm iterate is close to a constraint boundary, a direction of sufficient descent can be computed that moves sufficiently away from the boundary. It is, in essence, an assumption that combines two aspects: (a) a type of nondegeneracy assumption and (b) an assumption that the barrier parameter is initialized to be sufficiently large compared to a neighborhood parameter that is introduced in our algorithm. We demonstrate these aspects through some illustrative examples. Since, to the best of our knowledge, there are no other interior-point methods in the literature that possess convergence guarantees in a stochastic setting that is comparable to ours, we contend that our algorithm and analysis together constitute a notable contribution to the literature despite our method being a feasible interior-point method and our need to introduce a nonstandard assumption. Future research may reveal techniques that can offer convergence guarantees for a stochastic-gradient-based interior-point method in other settings.

We also provide the results of numerical experiments showing that our algorithms performs well when employed to solve challenging test problems.

1.2 Notation

We use \mathbb{R} to denote the set of real numbers and, given aa\in\mathbb{R}, we use a\mathbb{R}_{\geq a} (resp., >a\mathbb{R}_{>a}) to denote the set of real numbers that are greater than or equal to aa (resp., strictly greater than aa). Superscripts are used to indicate the dimension of a vector or matrix whose elements are restricted to such a set; e.g., the set of nn-dimensional vectors is denoted as n\mathbb{R}^{n}. The set of positive integers is denoted as :={1,2,}\mathbb{N}:=\{1,2,\dots\} and, given nn\in\mathbb{N}, the set of positive integers up to nn is denoted as [n]:={1,,n}[n]:=\{1,\dots,n\}. The set of nn-by-nn-dimensional symmetric positive semidefinite (resp., definite) matrices is denoted as 𝕊0nn×n\mathbb{S}_{\succeq 0}^{n}\subset\mathbb{R}^{n\times n} (resp., 𝕊0nn×n\mathbb{S}_{\succ 0}^{n}\subset\mathbb{R}^{n\times n}).

A real-valued sequence is always introduced as {vk}\{v_{k}\}\subset{\cal R}, where {\cal R} is a real vector (sub)space and the subset notation indicates that vkv_{k}\in{\cal R} for each index kk\in\mathbb{N} (although elements in the sequence may repeat). Here, subscripts are used to indicate index number in a sequence, although in some situations a subscript is used to indicate an element number of a vector or vector-valued function. When indicating the iith element of a vector vkv_{k}, the notation [vk]i[v_{k}]_{i} is used. In all cases, the meaning of a subscript is clear from the context. The notation {vk}v\{v_{k}\}\to v indicates that the sequence {vk}\{v_{k}\} converges to vv. The notation {vk}0\{v_{k}\}\searrow 0 indicates that {vk}\{v_{k}\} is a sequence of positive real numbers that vanishes monotonically.

Given a vector vnv\in\mathbb{R}^{n}, we use diag(v)\operatorname{diag}(v) to denote the diagonal matrix for which, for all i[n]i\in[n], the (i,i)(i,i) element is viv_{i}\in\mathbb{R}. Given two vectors zmz\in\mathbb{R}^{m} and cmc\in\mathbb{R}^{m}, their Hadamard (i.e., component-wise) product is denoted as zcmz\circ c\in\mathbb{R}^{m}. We use 𝟙\mathds{1} to denote a vector whose elements are all equal to one; the length of such a vector is determined by the context in which it appears. Given Al×nA\in\mathbb{R}^{l\times n}, the null space of AA is denoted as Null(A)\operatorname{Null}(A). Given (A,B)𝕊0n×𝕊0n(A,B)\in\mathbb{S}_{\succeq 0}^{n}\times\mathbb{S}_{\succeq 0}^{n}, we use ABA\succeq B (resp., ABA\preceq B) to indicate that ABA-B (resp., BAB-A) is positive semidefinite. Given vpv\in\mathbb{R}^{p} and M𝕊0pM\in\mathbb{S}_{\succ 0}^{p}, the MM-norm of vv is written as vM=vTMv\|v\|_{M}=\sqrt{v^{T}Mv}. Finally, given a set 𝒫{\cal P}, its interior is denoted as int(𝒫)\operatorname{int}({\cal P}) and its polar cone is denoted as

𝒫:={yn:pTy0for allp𝒫}.{\cal P}^{\circ}:=\{y\in\mathbb{R}^{n}:p^{T}y\leq 0\ \text{for all}\ p\in{\cal P}\}.

1.3 Outline

In §2, we present our problem of interest and preliminary commentary. In §3, we present our proposed framework and prove basic facts about its behavior. Section 4 contains our assumptions and convergence analyses for the deterministic and stochastic settings. As previously mentioned, our analyses rely on an assumption that is not standard; we discuss this assumption in detail in §5, highlighting certain settings of interest in which it indeed holds. Details about a practical implementation of our algorithm and resulting numerical experiments are presented in §6. Concluding remarks are provided in §7.

2 Problem Formulation and Preliminaries

Our proposed algorithm framework (Algorithm 1 on page 1) is designed to solve nonlinearly constrained optimization problems of the form

minxnf(x)subjecttoAx=bandc(x)0,\min_{x\in\mathbb{R}^{n}}\ f(x)\ \operatorname{subject~{}to}\ Ax=b\ \text{and}\ c(x)\leq 0, (1)

where Al×nA\in\mathbb{R}^{l\times n} (with l<nl<n), blb\in\mathbb{R}^{l}, and the functions ff and cc satisfy the following assumption with respect to the feasible region of problem (1), namely,

:=𝒞0,where{:={xn:Ax=b};𝒞0:={xn:c(x)0}.{\cal F}:={\cal E}\cap{\cal C}_{\leq 0},\ \text{where}\ \left\{\begin{aligned} {\cal E}&:=\{x\in\mathbb{R}^{n}:Ax=b\};\\ {\cal C}_{\leq 0}&:=\{x\in\mathbb{R}^{n}:c(x)\leq 0\}.\end{aligned}\right.
Assumption 2.1.

There exists an open set ¯\overline{\cal F}\supset{\cal F} over which the objective function f:¯f:\overline{\cal F}\to\mathbb{R} is continuously differentiable and bounded below by finff_{\inf}\in\mathbb{R} and the constraint function c:¯mc:\overline{\cal F}\to\mathbb{R}^{m} is continuously differentiable. In addition, the matrix Al×nA\in\mathbb{R}^{l\times n} has full row rank. Finally, there exists xx\in{\cal F} such that c(x)<0c(x)<0 ((i.e., where the inequality holds strictly, component-wise)).

The elements of Assumption 2.1 pertaining to ff and cc are mostly common for the literature on smooth nonlinearly constrained optimization. The only exception is the assumption that there exists xx\in{\cal F} such that c(x)<0c(x)<0. This aspect of the assumption precludes the presence of nonlinear equality constraints (through two-sided inequalities). It is needed in our setting since our algorithm requires a feasible initial point that is strictly feasible with respect to the nonaffine constraints.

Assumption 2.1 does not require that the feasible region {\cal F} is bounded, although our convergence analysis in §4 requires that the iterates remain within a set such that cc remains component-wise bounded below, and that the functions ff, cc, f:¯n\nabla f:\overline{\cal F}\to\mathbb{R}^{n}, and ci:¯n\nabla c_{i}:\overline{\cal F}\to\mathbb{R}^{n} for all i[m]i\in[m] are Lipschitz continuous; see Assumption 4.1 on page 4.1. Our analysis could be extended to the setting in which the matrix AA does not have full row rank as long as the linear system Ax=bAx=b is consistent. Our algorithm framework requires that the initial iterate is contained in {\cal E} and that each step lies in the null space of AA, which inductively implies that each iterate lies in {\cal E}. These requirements remain reasonable when AA has linearly dependent rows as long as a projection operator onto the null space of AA is available. Merely for ease of notation, we assume AA has full row rank.

In fact, our algorithm framework ensures that all iterates remain in the set

<0:=𝒞<0,where𝒞<0:={xn:c(x)<0}.{\cal F}_{<0}:={\cal E}\cap{\cal C}_{<0},\ \text{where}\ {\cal C}_{<0}:=\{x\in\mathbb{R}^{n}:c(x)<0\}. (2)

Observe that this set is guaranteed to be nonempty under Assumption 2.1. In each iteration, our framework involves the computation of a search direction that, for a given barrier parameter μ>0\mu\in\mathbb{R}_{>0}, aims to reduce the value of the objective augmented with the logarithmic barrier function defined with respect to the inequality-constraint functions of (1), namely, the function ϕ(,μ):𝒞<0\phi(\cdot,\mu):{\cal C}_{<0}\to\mathbb{R} defined by

ϕ(x,μ)=f(x)μi[m]log(ci(x)).\phi(x,\mu)=f(x)-\mu\sum_{i\in[m]}\log(-c_{i}(x)).

Going forward, we refer to this as the barrier-augmented objective function. Observe that if xμx^{\mu} is a local solution of the problem to minimize ϕ(,μ)\phi(\cdot,\mu) over <0=𝒞<0{\cal F}_{<0}={\cal E}\cap{\cal C}_{<0}, then there exists a Lagrange multiplier vector yμly^{\mu}\in\mathbb{R}^{l} such that, along with ziμ:=μ(ci(xμ))1z_{i}^{\mu}:=-\mu(c_{i}(x^{\mu}))^{-1} for all i[m]i\in[m], one finds that

f(xμ)+ATyμ+c(xμ)zμ=0,Axμ=b,c(xμ)<0,zμ>0,andzμc(xμ)=μ𝟙.\nabla f(x^{\mu})+A^{T}y^{\mu}+\nabla c(x^{\mu})z^{\mu}=0,\ Ax^{\mu}=b,\ c(x^{\mu})<0,\ z^{\mu}>0,\ \text{and}\ -z^{\mu}\circ c(x^{\mu})=\mu\mathds{1}.

Any tuple (xμ,yμ,zμ)(x^{\mu},y^{\mu},z^{\mu}) satisfying this system is referred to as a stationary point for the problem to minimize ϕ(,μ)\phi(\cdot,\mu) over <0{\cal F}_{<0}. This system should be compared to the Karush-Kuhn-Tucker (KKT) conditions for problem (1); in particular, under a constraint qualification—such as the Mangasarian-Fromovitz constraint qualification (MFCQ) [13]—if xx is a local solution of problem (1), then there exists yly\in\mathbb{R}^{l} and zmz\in\mathbb{R}^{m} such that the following system of KKT conditions is satisfied:

f(x)+ATy+c(x)z=0,Ax=b,c(x)0,z0,andzc(x)=0.\nabla f(x)+A^{T}y+\nabla c(x)z=0,\ Ax=b,\ c(x)\leq 0,\ z\geq 0,\ \text{and}\ -z\circ c(x)=0. (3)

Any tuple (x,y,z)(x,y,z) satisfying this system is referred to as a stationary point (or KKT point) for problem (1). As is generally the case for an interior-point method, our algorithm framework aims to find such a KKT point by taking steps to reduce the barrier-augmented objective function for a diminishing sequence of barrier parameters—a procedure that can be motivated by the fact that the stationarity conditions for the minimization of ϕ(,μ)\phi(\cdot,\mu) over <0{\cal F}_{<0} tend toward those of (1).

3 Algorithm

Our algorithm framework is stated as Algorithm 1 on page 1. The framework presumes that it is given an initial neighborhood parameter θ0>0\theta_{0}\in\mathbb{R}_{>0} and an initial iterate x1𝒩(θ0)x_{1}\in{\cal E}\cap{\cal N}(\theta_{0}), where for all θ>0\theta\in\mathbb{R}_{>0} we define

𝒩(θ):={xn:c(x)θ𝟙}𝒞<0.{\cal N}(\theta):=\{x\in\mathbb{R}^{n}:c(x)\leq-\theta\mathds{1}\}\subseteq{\cal C}_{<0}.

Observe that the existence of such θ0\theta_{0} and x1x_{1} is guaranteed under Assumption 2.1. Moreover, the steps of Algorithm 1 ensure that, for a prescribed, monotonically decreasing sequence {θk}>0\{\theta_{k}\}\subset\mathbb{R}_{>0} with θ1<θ0\theta_{1}<\theta_{0}, one finds xk𝒩(θk1)x_{k}\in{\cal E}\cap{\cal N}(\theta_{k-1}) for all kk\in\mathbb{N}. Hence, to describe the main steps of the algorithm for each kk\in\mathbb{N}, we may presume that for all kk\in\mathbb{N} the algorithm has an iterate xk𝒩(θk1)x_{k}\in{\cal E}\cap{\cal N}(\theta_{k-1}).

Following [7], Algorithm 1 also employs a prescribed, monotonically decreasing sequence of barrier parameters {μk}>0\{\mu_{k}\}\subset\mathbb{R}_{>0} such that {μk/θk1}\{\mu_{k}/\theta_{k-1}\} is a constant sequence. Other inputs required by Algorithm 1 in the deterministic setting are η(θ0/μ1,1)\eta\in(\theta_{0}/\mu_{1},1), η¯(θ0,)\underline{\eta}\in(\theta_{0},\infty), ζ¯>0\underline{\zeta}\in\mathbb{R}_{>0}, ζ¯>0\overline{\zeta}\in\mathbb{R}_{>0} with ζ¯ζ¯\underline{\zeta}\leq\overline{\zeta}, and ζ(0,1)\zeta\in(0,1), each of which influences the search direction computation. Given these inputs along with kk\in\mathbb{N} and xk𝒩(θk1)x_{k}\in{\cal E}\cap{\cal N}(\theta_{k-1}), the computation proceeds as follows. Firstly, a set of indices of constraints that are nearly active with respect to xkx_{k} is identified as

𝒜k:={i[m]:ημk<ci(xk)}.{\cal A}_{k}:=\{i\in[m]:-\eta\mu_{k}<c_{i}(x_{k})\}.

Observe that since xk𝒩(θk1)x_{k}\in{\cal N}(\theta_{k-1}) it follows that ci(xk)θk1c_{i}(x_{k})\leq-\theta_{k-1} for all i[m]i\in[m], and, since the facts that η(θ0/μ1,1)\eta\in(\theta_{0}/\mu_{1},1) and {μk/θk1}\{\mu_{k}/\theta_{k-1}\} is a constant sequence together ensure ημk<θk1-\eta\mu_{k}<-\theta_{k-1}, it follows that ημk<ci(xk)θk1-\eta\mu_{k}<c_{i}(x_{k})\leq-\theta_{k-1} for all i𝒜ki\in{\cal A}_{k}. Secondly, a vector qknq_{k}\in\mathbb{R}^{n} is computed, the value of which is distinct between the deterministic and stochastic settings. In particular, the framework employs

qk:={f(xk)μkc(xk)diag(c(xk))1𝟙(deterministic)gkμkc(xk)diag(c(xk))1𝟙(stochastic)q_{k}:=\begin{cases}\nabla f(x_{k})-\mu_{k}\nabla c(x_{k})\operatorname{diag}(c(x_{k}))^{-1}\mathds{1}&\text{(deterministic)}\\ g_{k}-\mu_{k}\nabla c(x_{k})\operatorname{diag}(c(x_{k}))^{-1}\mathds{1}&\text{(stochastic)}\end{cases} (4)

where in the stochastic setting the vector gkg_{k} is a stochastic estimate of f(xk)\nabla f(x_{k}). Third, with P:=IAT(AAT)1AP:=I-A^{T}(AA^{T})^{-1}A denoting the orthogonal projection matrix onto the null space of AA, a search direction dknd_{k}\in\mathbb{R}^{n} is computed such that

Adk\displaystyle Ad_{k} =0,\displaystyle=0, (5a)
ζ¯Pqk2\displaystyle\underline{\zeta}\|Pq_{k}\|_{2} dk2,\displaystyle\leq\|d_{k}\|_{2}, (5b)
ζ¯Pqk2\displaystyle\overline{\zeta}\|Pq_{k}\|_{2} dk2,\displaystyle\geq\|d_{k}\|_{2}, (5c)
(Pqk)Tdk\displaystyle-(Pq_{k})^{T}d_{k} ζPqk2dk2,and\displaystyle\geq\zeta\|Pq_{k}\|_{2}\|d_{k}\|_{2},\ \ \text{and} (5d)
ci(xk)Tdk\displaystyle\nabla c_{i}(x_{k})^{T}d_{k} 12η¯dk2for alli𝒜k.\displaystyle\leq-\tfrac{1}{2}\underline{\eta}\|d_{k}\|_{2}\ \ \text{for all}\ \ i\in{\cal A}_{k}. (5e)

In the deterministic setting, all that is required of the search direction dkd_{k} is that it satisfies (5). For the stochastic setting, we introduce in our analysis a particular strategy for computing the search direction that implies that (5) holds along with other useful properties for our analysis for that setting.

Let us now discuss (5) in detail. Firstly, condition (5a) requires that the search direction satisfies dkNull(A)d_{k}\in\operatorname{Null}(A), i.e., it lies in the null space of AA, which, since xkx_{k}\in{\cal E}, ensures that the subsequent iterate will also lie in {\cal E}. Secondly, conditions (5b) and (5c) require that the norm of the search direction dkd_{k} is proportional to the norm of Pqk=(IAT(AAT)1A)qkPq_{k}=(I-A^{T}(AA^{T})^{-1}A)q_{k}, i.e., the norm of the projection of qkq_{k} onto Null(A)\operatorname{Null}(A). Thus, conditions (5b) and (5c) require that the norm of the search direction is proportional to that of a projected gradient (estimate), which is a common type of requirement for a search direction in an algorithm for minimizing a smooth objective function over an affine set. Thirdly, condition (5d) requires that dkd_{k} makes an angle with PqkPq_{k} that is sufficiently obtuse. Again, this is a common type of requirement; e.g., in the deterministic setting, it ensures that any nonzero dkd_{k} is a direction of sufficient descent with respect to the barrier-augmented objective function. Together, conditions (5a)–(5d) define a nonempty set and computing a search direction to satisfy these conditions is straightforward; e.g., one can compute the projection of qk-q_{k} onto Null(A)\operatorname{Null}(A) and scale the resulting direction, if needed, to satisfy (5b)–(5c). However, it remains finally to consider condition (5e). Unfortunately, this is a requirement that is not always satisfiable along with the other conditions, namely, (5a)–(5d). That said, we contend that all of (5) can be satisfied as long as (a) the constraints are not degenerate in some sense and (b) the initial barrier parameter μ1\mu_{1} is sufficiently large relative to θ0\theta_{0}. Denoting the conical hull of nearly active constraint gradients as

𝒫k:={i𝒜kci(xk)σi:σ0|𝒜k|},{\cal P}_{k}:=\left\{\sum_{i\in{\cal A}_{k}}\nabla c_{i}(x_{k})\sigma_{i}:\sigma\in\mathbb{R}^{|{\cal A}_{k}|}_{\geq 0}\right\},

condition (5e) requires that dkd_{k} lies sufficiently within the interior of the polar cone of 𝒫k{\cal P}_{k}, i.e., int(𝒫k)\operatorname{int}({\cal P}_{k}^{\circ}). Hence, to satisfy (5), there needs to exist a direction sufficiently within this interior that also satisfies (5a)–(5d). For our purposes now, let us simply introduce Assumption 3.1 below, which implicitly means that Algorithm 1 is being applied to a problem such that a search direction satisfying (5) always exists. We defer until §5 a more detailed discussion Assumption 3.1, where in particular we provide (a) example problems for which the assumption can be shown to hold or cannot be shown to hold, (b) further explanation about why the assumption can be viewed as a type of nondegeneracy assumption and an assumption about μ1/θ0\mu_{1}/\theta_{0} being sufficient large (which is reminiscient of a requirement in Corollary 3.8 and Lemma 3.15 in [7] for the bound-constrained setting), and (c) a procedure for computing dkd_{k} to satisfy (5) when such a direction exists.

Assumption 3.1.

For all kk\in\mathbb{N} generated in a run of Algorithm 1, there exists a search direction dknd_{k}\in\mathbb{R}^{n} satisfying the conditions (5).

Upon computation of the search direction, Algorithm 1 chooses a step size αk>0\alpha_{k}\in\mathbb{R}_{>0}. For simplicity, our analyses for the deterministic and stochastic settings consider conservative rules for selecting the step sizes that each depend on the barrier- and neighborhood-parameter sequences, {μk}\{\mu_{k}\} and {θk}\{\theta_{k}\}. We conjecture that it would be possible to generalize our algorithm and analyses to allow more flexibility in the step-size selection rules, as is done in [7]. However, for our purposes here, we present conservative strategies that are sufficient for proving convergence.

The algorithm’s last major step in the kkth iteration is to compute a positive real number γk>0\gamma_{k}\in\mathbb{R}_{>0} such that the line segment [xk,xk+γkαkdk][x_{k},x_{k}+\gamma_{k}\alpha_{k}d_{k}] is contained in the neighborhood 𝒩(θk){\cal N}(\theta_{k}). The subsequent iterate is then set as xk+1xk+γkαkdkx_{k+1}\leftarrow x_{k}+\gamma_{k}\alpha_{k}d_{k}. As mentioned at the beginning of this section, from the facts that the kkth iterate has xk𝒩(θk1)x_{k}\in{\cal E}\cap{\cal N}(\theta_{k-1}), the kkth search direction satisfies (5a), and the choice of γk\gamma_{k} ensures that xk+γkαkdk𝒩(θk)x_{k}+\gamma_{k}\alpha_{k}d_{k}\in{\cal N}(\theta_{k}), it follows that xk+1𝒩(θk)x_{k+1}\in{\cal E}\cap{\cal N}(\theta_{k}). For the sake of formality, we state the following lemma that has been proved.

Lemma 3.1.

For all kk\in\mathbb{N} generated in any run of Algorithm 1, it follows that the iterate satisfies xk𝒩(θk1)<0x_{k}\in{\cal E}\cap{\cal N}(\theta_{k-1})\subseteq{\cal F}_{<0} and the search direction satifies dkNull(A)d_{k}\in\operatorname{Null}(A).

Algorithm 1 Single-Loop Interior-Point (SLIP) Method
1:initial neighborhood parameter θ0>0\theta_{0}\in\mathbb{R}_{>0}, initial iterate x1𝒩(θ0)x_{1}\in{\cal E}\cap{\cal N}(\theta_{0}), barrier-parameter sequence {μk}0\{\mu_{k}\}\searrow 0, neighborhood-parameter sequence {θk}0\{\theta_{k}\}\searrow 0 with θ1<θ0\theta_{1}<\theta_{0} such that {μk/θk1}\{\mu_{k}/\theta_{k-1}\} is a constant sequence, maximum neighborhood-parameter sequence {γk,max}(0,1]\{\gamma_{k,\max}\}\subset(0,1], η(θ0/μ1,1)\eta\in(\theta_{0}/\mu_{1},1), η¯(θ0,)\underline{\eta}\in(\theta_{0},\infty), ζ¯>0\underline{\zeta}\in\mathbb{R}_{>0}, ζ¯>0\overline{\zeta}\in\mathbb{R}_{>0} with ζ¯ζ¯\underline{\zeta}\leq\overline{\zeta}, and ζ(0,1)\zeta\in(0,1)
2:for all kk\in\mathbb{N} do
3:     Compute qknq_{k}\in\mathbb{R}^{n} by (4)
4:     Compute dknd_{k}\in\mathbb{R}^{n} to satisfy (5)
5:     Choose αk>0\alpha_{k}\in\mathbb{R}_{>0}
6:     Compute γk(0,γk,max]\gamma_{k}\in(0,\gamma_{k,\max}] such that [xk,xk+γkαkdk]𝒩(θk)[x_{k},x_{k}+\gamma_{k}\alpha_{k}d_{k}]\subseteq{\cal N}(\theta_{k})
7:     Set xk+1xk+γkαkdkx_{k+1}\leftarrow x_{k}+\gamma_{k}\alpha_{k}d_{k}
8:end for

4 Analysis

We prove theoretical convergence guarantees for Algorithm 1 in both deterministic and stochastic settings. For both settings, we make the following additional, standard type of assumption beyond Assumptions 2.1 and 3.1.

Assumption 4.1.

Let 𝒳{\cal X} be an open convex set containing the sequence of line segments between iterates, namely, {[xk,xk+αkdk]}\{[x_{k},x_{k}+\alpha_{k}d_{k}]\}, generated in any run of Algorithm 1. There exist constants κf>0\kappa_{\nabla f}\in\mathbb{R}_{>0}, Lf>0L_{\nabla f}\in\mathbb{R}_{>0}, {κci}i[m]>0\{\kappa_{c_{i}}\}_{i\in[m]}\subset\mathbb{R}_{>0}, {Lci}i[m]>0\{L_{c_{i}}\}_{i\in[m]}\subset\mathbb{R}_{>0}, {κci}i[m]>0\{\kappa_{\nabla c_{i}}\}_{i\in[m]}\subset\mathbb{R}_{>0}, and {Lci}i[m]>0\{L_{\nabla c_{i}}\}_{i\in[m]}\subset\mathbb{R}_{>0} such that:

  • \bullet

    ff is Lipschitz with constant κf\kappa_{\nabla f} over 𝒳{\cal X}, so (under Assumption 2.1) it follows that f\nabla f is bounded in 2\ell_{2} (Euclidean) norm by κf\kappa_{\nabla f} over 𝒳{\cal X}.

  • \bullet

    f\nabla f is Lipschitz with constant LfL_{\nabla f} with respect to 2\|\cdot\|_{2} over 𝒳{\cal X}.

  • \bullet

    For all i[m]i\in[m], cic_{i} is bounded in absolute value by κci\kappa_{c_{i}} over 𝒳{\cal X}.

  • \bullet

    For all i[m]i\in[m], cic_{i} is Lipschitz with constant LciL_{c_{i}} over 𝒳{\cal X}, so (under Assumption 2.1) it follows that ci\nabla c_{i} is bounded in 2\ell_{2} (Euclidean) norm by κci:=Lci\kappa_{\nabla c_{i}}:=L_{c_{i}} over 𝒳{\cal X}.

  • \bullet

    For all i[m]i\in[m], ci\nabla c_{i} is Lipschitz with constant LciL_{\nabla c_{i}} with respect to 2\|\cdot\|_{2} over 𝒳{\cal X}.

Under Assumption 4.1, with {κci}i[m]>0\{\kappa_{c_{i}}\}_{i\in[m]}\subset\mathbb{R}_{>0} defined in the assumption, we introduce 𝒞[κc,0):={xn:ci(x)[κci,0)for alli[m]}{\cal C}_{[-\kappa_{c},0)}:=\{x\in\mathbb{R}^{n}:c_{i}(x)\in[-\kappa_{c_{i}},0)\ \text{for all}\ i\in[m]\} along with the shifted barrier-augmented objective function ϕ~:𝒞[κc,0)×>0\tilde{\phi}:{\cal C}_{[-\kappa_{c},0)}\times\mathbb{R}_{>0}\to\mathbb{R} defined by

ϕ~(x,μ)=f(x)μi[m]log(ci(x)κci).\tilde{\phi}(x,\mu)=f(x)-\mu\sum_{i\in[m]}\log\left(-\frac{c_{i}(x)}{\kappa_{c_{i}}}\right).

Important properties of the shifted barrier-augmented objective function are that (a) it is bounded below by finf:=infx<0f(x)f_{\inf}:=\inf_{x\in{\cal F}_{<0}}f(x)\in\mathbb{R} whose existence follows under Assumption 2.1 and (b) it shares the same gradient function with respect to its first argument as that of the (unshifted) barrier-augmented objective function; i.e., for all (x,μ)𝒞[κc,0)×>0(x,\mu)\in{\cal C}_{[-\kappa_{c},0)}\times\mathbb{R}_{>0}, one finds that

xϕ(x,μ)=xϕ~(x,μ),\nabla_{x}\phi(x,\mu)=\nabla_{x}\tilde{\phi}(x,\mu), (6)

where x\nabla_{x} denotes the gradient operator with respect to a function’s first argument. These follow since for any i[m]i\in[m] and δ(0,κci]\delta\in(0,\kappa_{c_{i}}] one has log(δ/κci)0-\log(\delta/\kappa_{c_{i}})\geq 0 and log(δ/κci)=log(δ)+log(κci)-\log(\delta/\kappa_{c_{i}})=-\log(\delta)+\log(\kappa_{c_{i}}), so (x,μ)𝒞[κci,0)×>0(x,\mu)\in{\cal C}_{[-\kappa_{c_{i}},0)}\times\mathbb{R}_{>0} implies

ϕ(x,μ)+μi[m]log(κci)=ϕ~(x,μ)finf.\phi(x,\mu)+\mu\sum_{i\in[m]}\log(\kappa_{c_{i}})=\tilde{\phi}(x,\mu)\geq f_{\inf}. (7)

In addition, the shifted barrier-augmented objective function is monotonically increasing in its second argument in the sense that for any x𝒞[κc,0)x\in{\cal C}_{[-\kappa_{c},0)} one finds that ϕ~(x,μ)<ϕ~(x,μ¯)\tilde{\phi}(x,\mu)<\tilde{\phi}(x,\bar{\mu}) for any (μ,μ¯)>0×>0(\mu,\bar{\mu})\in\mathbb{R}_{>0}\times\mathbb{R}_{>0} with μ<μ¯\mu<\bar{\mu}. These properties make the function ϕ~\tilde{\phi} useful for our analysis. Indeed, even though the algorithm is not aware of the constants {κci}i[m]\{\kappa_{c_{i}}\}_{i\in[m]} in Assumption 4.1, we shall show that it follows from the former property above that it computes steps that lead to (expected) decrease in the shifted barrier-augmented objective function ϕ~\tilde{\phi}. Also, with the latter property above, decreases in the barrier parameter also lead to decreases in ϕ~\tilde{\phi}.

Our first result is useful for both the deterministic and stochastic settings. It shows that, for any θ>0\theta\in\mathbb{R}_{>0}, the gradient of the shifted barrier-augmented objective function with respect to its first argument is Lipschitz over line segments in 𝒩(θ){\cal N}(\theta) with a Lipschitz constant that depends on θ\theta. This is a critical property, especially since this function is not globally Lipschitz over <0𝒩(θ){\cal F}_{<0}\supset{\cal N}(\theta). The proof follows standard techniques for such a result; we provide it for completeness.

Lemma 4.1.

For any (μ,θ,θ¯)>0×>0×>0(\mu,\theta,\bar{\theta})\in\mathbb{R}_{>0}\times\mathbb{R}_{>0}\times\mathbb{R}_{>0} with θ¯(0,θ]\bar{\theta}\in(0,\theta], x𝒩(θ)x\in{\cal N}(\theta), x¯𝒩(θ¯)\mkern 1.5mu\overline{\mkern-1.5mux}\in{\cal N}(\bar{\theta}), and γ[0,1]\gamma\in[0,1], one finds with

Lμ,θ,θ¯:=Lf+μθθ¯i=1m(Lciκci+κciLci)>0L_{\mu,\theta,\bar{\theta}}:=L_{\nabla f}+\frac{\mu}{\theta\bar{\theta}}\sum_{i=1}^{m}(L_{c_{i}}\kappa_{\nabla c_{i}}+\kappa_{c_{i}}L_{\nabla c_{i}})\in\mathbb{R}_{>0}

that one has both

xϕ~(x+γ(x¯x),μ)xϕ~(x,μ)2γLμ,θ,θ¯x¯x2\displaystyle\|\nabla_{x}\tilde{\phi}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x),\mu)-\nabla_{x}\tilde{\phi}(x,\mu)\|_{2}\leq\gamma L_{\mu,\theta,\bar{\theta}}\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}
andϕ~(x¯,μ)ϕ~(x,μ)+xϕ~(x,μ)T(x¯x)+12Lμ,θ,θ¯x¯x22\displaystyle\text{and}\ \ \tilde{\phi}(\mkern 1.5mu\overline{\mkern-1.5mux},\mu)\leq\tilde{\phi}(x,\mu)+\nabla_{x}\tilde{\phi}(x,\mu)^{T}(\mkern 1.5mu\overline{\mkern-1.5mux}-x)+\tfrac{1}{2}L_{\mu,\theta,\bar{\theta}}\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}^{2}
Proof.

For arbitrary (μ,θ,θ¯,x,x¯,γ)(\mu,\theta,\bar{\theta},x,\mkern 1.5mu\overline{\mkern-1.5mux},\gamma) satisfying the conditions of the lemma, it follows from Assumption 2.1, Assumption 4.1, (6), and the triangle inequality that

xϕ~(x+γ(x¯x),μ)xϕ~(x,μ)2\displaystyle\ \|\nabla_{x}\tilde{\phi}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x),\mu)-\nabla_{x}\tilde{\phi}(x,\mu)\|_{2}
=\displaystyle= f(x+γ(x¯x))f(x)μ(i=1mci(x+γ(x¯x))ci(x+γ(x¯x))i=1mci(x)ci(x))2\displaystyle\ \left\|\nabla f(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))-\nabla f(x)-\mu\left(\sum_{i=1}^{m}\frac{\nabla c_{i}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))}{c_{i}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))}-\sum_{i=1}^{m}\frac{\nabla c_{i}(x)}{c_{i}(x)}\right)\right\|_{2}
\displaystyle\leq γLfx¯x2+μi=1mci(x+γ(x¯x))ci(x)ci(x)ci(x+γ(x¯x))ci(x)ci(x+γ(x¯x))2\displaystyle\ \gamma L_{\nabla f}\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}+\mu\sum_{i=1}^{m}\left\|\frac{c_{i}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))\nabla c_{i}(x)-c_{i}(x)\nabla c_{i}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))}{c_{i}(x)c_{i}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))}\right\|_{2}
\displaystyle\leq γLfx¯x2+μθθ¯i=1mci(x+γ(x¯x))ci(x)ci(x)ci(x)\displaystyle\ \gamma L_{\nabla f}\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}+\frac{\mu}{\theta\bar{\theta}}\sum_{i=1}^{m}\|c_{i}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))\nabla c_{i}(x)-c_{i}(x)\nabla c_{i}(x)
+ci(x)ci(x)ci(x)ci(x+γ(x¯x))2\displaystyle\hskip 100.0pt+c_{i}(x)\nabla c_{i}(x)-c_{i}(x)\nabla c_{i}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))\|_{2}
\displaystyle\leq γLfx¯x2+μθθ¯i=1m(|ci(x+γ(x¯x))ci(x)|ci(x)2\displaystyle\ \gamma L_{\nabla f}\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}+\frac{\mu}{\theta\bar{\theta}}\sum_{i=1}^{m}(|c_{i}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))-c_{i}(x)|\|\nabla c_{i}(x)\|_{2}
+|ci(x)|ci(x)ci(x+γ(x¯x))2)\displaystyle\hskip 100.0pt+|c_{i}(x)|\|\nabla c_{i}(x)-\nabla c_{i}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x))\|_{2})
\displaystyle\leq γLfx¯x2+μθθ¯i=1m(γLciκcix¯x2+γκciLcix¯x2),\displaystyle\ \gamma L_{\nabla f}\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}+\frac{\mu}{\theta\bar{\theta}}\sum_{i=1}^{m}(\gamma L_{c_{i}}\kappa_{\nabla c_{i}}\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}+\gamma\kappa_{c_{i}}L_{\nabla c_{i}}\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}),

from which the first desired conclusion follows. Hence, for arbitrary (μ,θ,θ¯,x,x¯)(\mu,\theta,\bar{\theta},x,\mkern 1.5mu\overline{\mkern-1.5mux}) satisfying the conditions of the lemma, it follows from the Fundamental Theorem of Calculus and the Cauchy–Schwarz inequality that

ϕ~(x¯,μ)ϕ~(x,μ)\displaystyle\ \tilde{\phi}(\mkern 1.5mu\overline{\mkern-1.5mux},\mu)-\tilde{\phi}(x,\mu)
=\displaystyle= 01ϕ~(x+γ(x¯x),μ)γdγ=01xϕ~(x+γ(x¯x),μ)T(x¯x)dγ\displaystyle\ \int_{0}^{1}\frac{\partial\tilde{\phi}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x),\mu)}{\partial\gamma}\text{d}\gamma=\int_{0}^{1}\nabla_{x}\tilde{\phi}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x),\mu)^{T}(\mkern 1.5mu\overline{\mkern-1.5mux}-x)\text{d}\gamma
=\displaystyle= xϕ~(x,μ)T(x¯x)+01(xϕ~(x+γ(x¯x),μ)xϕ~(x,μ))T(x¯x)dγ\displaystyle\ \nabla_{x}\tilde{\phi}(x,\mu)^{T}(\mkern 1.5mu\overline{\mkern-1.5mux}-x)+\int_{0}^{1}(\nabla_{x}\tilde{\phi}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x),\mu)-\nabla_{x}\tilde{\phi}(x,\mu))^{T}(\mkern 1.5mu\overline{\mkern-1.5mux}-x)\text{d}\gamma
\displaystyle\leq xϕ~(x,μ)T(x¯x)+x¯x201xϕ~(x+γ(x¯x),μ)xϕ~(x,μ))2dγ\displaystyle\ \nabla_{x}\tilde{\phi}(x,\mu)^{T}(\mkern 1.5mu\overline{\mkern-1.5mux}-x)+\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}\int_{0}^{1}\|\nabla_{x}\tilde{\phi}(x+\gamma(\mkern 1.5mu\overline{\mkern-1.5mux}-x),\mu)-\nabla_{x}\tilde{\phi}(x,\mu))\|_{2}\text{d}\gamma
\displaystyle\leq xϕ~(x,μ)T(x¯x)+Lμ,θ,θ¯x¯x201γdγ,\displaystyle\ \nabla_{x}\tilde{\phi}(x,\mu)^{T}(\mkern 1.5mu\overline{\mkern-1.5mux}-x)+L_{\mu,\theta,\bar{\theta}}\|\mkern 1.5mu\overline{\mkern-1.5mux}-x\|_{2}\int_{0}^{1}\gamma\text{d}\gamma,

so the desired conclusion follows from the fact that 01γdγ=12\int_{0}^{1}\gamma\text{d}\gamma=\tfrac{1}{2}. ∎∎

4.1 Deterministic Setting

In this subsection, we prove convergence guarantees for Algorithm 1 in the deterministic setting, i.e., when qk=f(xk)μkc(xk)diag(c(xk))1𝟙=xϕ~(xk,μk)q_{k}=\nabla f(x_{k})-\mu_{k}\nabla c(x_{k})\operatorname{diag}(c(x_{k}))^{-1}\mathds{1}=\nabla_{x}\tilde{\phi}(x_{k},\mu_{k}) (see (4) and (6)) for all kk\in\mathbb{N}. For ease of exposition, we refer to this instance of the method as Algorithm 1(D). In particular, we prove that under a certain rule for choosing the step-size parameters, the algorithm is well defined, generates a sequence of iterates over which a measure of stationarity vanishes, and under a constraint qualification can produce a sequence of Lagrange multipliers such that convergence to a KKT point is guaranteed. This algorithm for the deterministic setting is not expected to be competitive with state-of-the-art (second-order) interior-point methods for solving problems in many real-world settings. That said, our analysis for the deterministic setting provides a useful precursor for our subsequent analysis for the stochastic setting for which state-of-the-art methods are not applicable (since they do not possess known convergence guarantees in the stochastic setting that we consider in our analysis in §4.2).

The class of step-size parameters that we consider for our analysis in this subsection is that satisfying the following parameter rule. We state the rule, then summarize the reasons that the rule has been designed in this manner.

Parameter 1.

Given tα(,0]t_{\alpha}\in(-\infty,0], for all kk\in\mathbb{N}, Algorithm 1(D) employs

αkktα(ζ¯ζζ¯2Lk),\alpha_{k}\leftarrow k^{t_{\alpha}}\left(\frac{\underline{\zeta}\zeta}{\overline{\zeta}^{2}L_{k}}\right),

where Lk:=Lf+μkθkθk1i=1m(Lciκci+κciLci)>0\displaystyle L_{k}:=L_{\nabla f}+\frac{\mu_{k}}{\theta_{k}\theta_{k-1}}\sum_{i=1}^{m}\left(L_{c_{i}}\kappa_{\nabla c_{i}}+\kappa_{c_{i}}L_{\nabla c_{i}}\right)\in\mathbb{R}_{>0}. In addition, for all kk\in\mathbb{N}, Algorithm 1(D) employs γk,max1\gamma_{k,\max}\leftarrow 1 and γkmini[m]γk,i\gamma_{k}\leftarrow\min_{i\in[m]}\gamma_{k,i}, where, for all i[m]i\in[m],

γk,imin{1,ci(xk)Tdk+(ci(xk)Tdk)2+2Lcidk22(ci(xk)θk)Lciαkdk22};\gamma_{k,i}\leftarrow\min\left\{1,\frac{-\nabla c_{i}(x_{k})^{T}d_{k}+\sqrt{(\nabla c_{i}(x_{k})^{T}d_{k})^{2}+2L_{\nabla c_{i}}\|d_{k}\|_{2}^{2}(-c_{i}(x_{k})-\theta_{k})}}{L_{\nabla c_{i}}\alpha_{k}\|d_{k}\|_{2}^{2}}\right\};

hence, γk,i(0,1]\gamma_{k,i}\in(0,1] for all kk\in\mathbb{N} and i[m]i\in[m], so γk(0,1]\gamma_{k}\in(0,1] for all kk\in\mathbb{N}.

The fundamental aspects of Parameter Rule 1 can be understood as follows. First, the choice of αk\alpha_{k} ensures that this step size is chosen sufficiently small such that it ensures sufficient decrease in the barrier-augmented objective function with each step. Intuitively, this can be seen in the fact that the step size is proportional to the reciprocal of a Lipschitz-type constant for the gradient of the barrier function (see Lemma 4.1), as is common in basic descent methods for gradient-based optimization algorithms. For additional flexibility and since it is required in the stochastic setting, we introduce the factor ktαk^{t_{\alpha}} such that the step size may diminish faster, namely, when a choice tα<0t_{\alpha}<0 is made. (That said, our analysis shows that tα=0t_{\alpha}=0 is a valid choice in the deterministic setting.) Finally, the rule for the choice of γk\gamma_{k} is shown in our analysis to guarantee that [xk,xk+γkαkdk]𝒩(θk)[x_{k},x_{k}+\gamma_{k}\alpha_{k}d_{k}]\subseteq{\cal N}(\theta_{k}) for all kk\in\mathbb{N}, which is necessary for our analysis to employ Lemma 4.1.

For our next lemma, we prove this critical property of γk\gamma_{k} for all kk\in\mathbb{N}.

Lemma 4.2.

Suppose that Assumptions 2.1, 3.1, and 4.1 hold and that Algorithm 1(D) is run with Parameter Rule 1. Then, for all kk\in\mathbb{N}. the choice of γk(0,1]\gamma_{k}\in(0,1] in Parameter Rule 1 ensures that [xk,xk+γkαkdk]𝒩(θk)[x_{k},x_{k}+\gamma_{k}\alpha_{k}d_{k}]\subseteq{\cal N}(\theta_{k}), i.e., the condition in Step 6 of Algorithm 1(D) holds. Consequently, under Parameter Rule 1, Algorithm 1(D) is well defined in the sense that it will generate an infinite sequence of iterates.

Proof.

Consider arbitrary kk\in\mathbb{N} and i[m]i\in[m]. We prove that the choice of γk,i(0,1]\gamma_{k,i}\in(0,1] in Parameter Rule 1 ensures that ci(xk+γαkdk)θkc_{i}(x_{k}+\gamma\alpha_{k}d_{k})\leq-\theta_{k} for all γ[0,γk,i]\gamma\in[0,\gamma_{k,i}], after which the desired conclusion follows directly from the fact that γkmini[m]γk,i\gamma_{k}\leftarrow\min_{i\in[m]}\gamma_{k,i} for all kk\in\mathbb{N}. Under Assumptions 2.1 and 4.1, one has for all γ[0,1]\gamma\in[0,1] that

ci(xk+γαkdk)ci(xk)+γαkci(xk)Tdk+12Lciγ2αk2dk22.c_{i}(x_{k}+\gamma\alpha_{k}d_{k})\leq c_{i}(x_{k})+\gamma\alpha_{k}\nabla c_{i}(x_{k})^{T}d_{k}+\tfrac{1}{2}L_{\nabla c_{i}}\gamma^{2}\alpha_{k}^{2}\|d_{k}\|_{2}^{2}.

Therefore, ci(xk+γαkdk)θkc_{i}(x_{k}+\gamma\alpha_{k}d_{k})\leq-\theta_{k} is ensured as long as γ[0,1]\gamma\in[0,1] yields

ci(xk)+γ¯αkci(xk)Tdk+12Lciγ¯2αk2dk22θkfor allγ¯[0,γ].c_{i}(x_{k})+\bar{\gamma}\alpha_{k}\nabla c_{i}(x_{k})^{T}d_{k}+\tfrac{1}{2}L_{\nabla c_{i}}\bar{\gamma}^{2}\alpha_{k}^{2}\|d_{k}\|_{2}^{2}\leq-\theta_{k}\ \ \text{for all}\ \ \bar{\gamma}\in[0,\gamma]. (8)

Recalling that Lemma 3.1 ensures xk𝒩(θk1)x_{k}\in{\cal N}(\theta_{k-1}), which means that ci(xk)θk1<θkc_{i}(x_{k})\leq-\theta_{k-1}<-\theta_{k}, it follows that (8) holds for all γ[0,1]\gamma\in[0,1] when dk=0d_{k}=0, and otherwise (when dk0d_{k}\neq 0) one finds that the left-hand side of (8) is a strongly convex quadratic function of γ¯\bar{\gamma}. The second term in the minimum in the definition of γk,i\gamma_{k,i} in Parameter Rule 1 is the unique positive root of this quadratic function, where we remark that ci(xk)θk>0-c_{i}(x_{k})-\theta_{k}>0 since, as previously mentioned, ci(xk)<θkc_{i}(x_{k})<-\theta_{k}. ∎∎

We now go beyond Lemma 4.2 and prove a lower bound for γk\gamma_{k} that will be critical for our ultimate convergence guarantee in this subsection.

Lemma 4.3.

Suppose that Assumptions 2.1, 3.1, and 4.1 hold and that Algorithm 1(D) is run with Parameter Rule 1. In addition, define

β:=ζ¯(κf+μ1θ0j[m]κcj)>0,\beta:=\overline{\zeta}\left(\kappa_{\nabla f}+\frac{\mu_{1}}{\theta_{0}}\sum_{j\in[m]}\kappa_{\nabla c_{j}}\right)\in\mathbb{R}_{>0},

and for all i[m]i\in[m] define

γk,i,min:=min{1,κci+κci2+2Lci(ημkθk)αkβLci,η¯θkαkβLci}(0,1].\gamma_{k,i,\min}:=\min\left\{1,\frac{-\kappa_{\nabla c_{i}}+\sqrt{\kappa_{\nabla c_{i}}^{2}+2L_{\nabla c_{i}}(\eta\mu_{k}-\theta_{k})}}{\alpha_{k}\beta L_{\nabla c_{i}}},\frac{\underline{\eta}-\theta_{k}}{\alpha_{k}\beta L_{\nabla c_{i}}}\right\}\in(0,1].

Then, γkγk,min:=mini[m]γk,i,min\gamma_{k}\geq\gamma_{k,\min}:=\min_{i\in[m]}\gamma_{k,i,\min} for all kk\in\mathbb{N}.

Proof.

Consider arbitrary kk\in\mathbb{N} and i[m]i\in[m]. Our aim is to prove that γk,iγk,i,min\gamma_{k,i}\geq\gamma_{k,i,\min}, from which the desired conclusion follows due to the fact that γkmini[m]γk,i\gamma_{k}\leftarrow\min_{i\in[m]}\gamma_{k,i} under Parameter Rule 1. Toward this end, observe that if Parameter Rule 1 yields γk,i=1\gamma_{k,i}=1, then, clearly, γk,iγk,i,min(0,1]\gamma_{k,i}\geq\gamma_{k,i,\min}\in(0,1] and there is nothing left to prove. Therefore, we may proceed under the assumption that Parameter Rule 1 yields γk,i(0,1)\gamma_{k,i}\in(0,1). It follows under this condition that

γk,i=ci(xk)Tdkdk2+(ci(xk)Tdkdk2)2+2Lci(ci(xk)θk)αkdk2Lci.\gamma_{k,i}=\frac{\frac{-\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}}+\sqrt{\left(\frac{\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}}\right)^{2}+2L_{\nabla c_{i}}(-c_{i}(x_{k})-\theta_{k})}}{\alpha_{k}\|d_{k}\|_{2}L_{\nabla c_{i}}}. (9)

On the other hand, by Assumptions 2.1, 3.1, and 4.1, the fact that the matrix norm 2\|\cdot\|_{2} is submultiplicative, the triangle inequality, (5c), the fact that xk𝒩(θk1)x_{k}\in{\cal N}(\theta_{k-1}), the fact that {μk/θk1}\{\mu_{k}/\theta_{k-1}\} is a constant sequence, and the fact that for the orthogonal projection matrix PP one has P21\|P\|_{2}\leq 1, one finds that

dk2\displaystyle\|d_{k}\|_{2} ζ¯P(f(xk)μkc(xk)diag(c(xk))1𝟙)2\displaystyle\leq\overline{\zeta}\|P(\nabla f(x_{k})-\mu_{k}\nabla c(x_{k})\operatorname{diag}(c(x_{k}))^{-1}\mathds{1})\|_{2} (10)
ζ¯(f(xk)2+μkj[m]cj(xk)1cj(xk)2)β.\displaystyle\leq\overline{\zeta}\left(\|\nabla f(x_{k})\|_{2}+\mu_{k}\sum_{j\in[m]}\left\|c_{j}(x_{k})^{-1}\nabla c_{j}(x_{k})\right\|_{2}\right)\leq\beta.

Let us now proceed by considering two cases.

  1. 1.

    Suppose ci(xk)ημkc_{i}(x_{k})\leq-\eta\mu_{k}. (The iith constraint is not nearly active.) Observe that Assumption 4.1 and the Cauchy-Schwarz inequality together imply that ci(xk)Tdkκcidk2\nabla c_{i}(x_{k})^{T}d_{k}\leq\kappa_{\nabla c_{i}}\|d_{k}\|_{2}. At the same time, observe that the numerator of the right-hand side of (9) can be viewed as the value of the function h:(,κci]h:(-\infty,\kappa_{\nabla c_{i}}]\to\mathbb{R} defined by h(a)=a+a2+bh(a)=-a+\sqrt{a^{2}+b} at a=ci(xk)Tdkdk2a=\frac{\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}} for b>0b\in\mathbb{R}_{>0}, where positivity of bb follows since xk𝒩(θk1)x_{k}\in{\cal N}(\theta_{k-1}) implies ci(xk)θkθk1θk>0-c_{i}(x_{k})-\theta_{k}\geq\theta_{k-1}-\theta_{k}>0. One finds that hh is a monotonically decreasing function of aa over (,κci](-\infty,\kappa_{\nabla c_{i}}]. Hence, from (9), (10), and the condition of this case, one has

    γk,i\displaystyle\gamma_{k,i} κci+κci2+2Lci(ci(xk)θk)αkβLci\displaystyle\geq\frac{-\kappa_{\nabla c_{i}}+\sqrt{\kappa_{\nabla c_{i}}^{2}+2L_{\nabla c_{i}}(-c_{i}(x_{k})-\theta_{k})}}{\alpha_{k}\beta L_{\nabla c_{i}}} (11)
    κci+κci2+2Lci(ημkθk)αkβLci.\displaystyle\geq\frac{-\kappa_{\nabla c_{i}}+\sqrt{\kappa_{\nabla c_{i}}^{2}+2L_{\nabla c_{i}}(\eta\mu_{k}-\theta_{k})}}{\alpha_{k}\beta L_{\nabla c_{i}}}.
  2. 2.

    Suppose ci(xk)>ημkc_{i}(x_{k})>-\eta\mu_{k}. (The iith constraint is nearly active.) Under Assumption 3.1, it follows from (5e) that ci(xk)Tdk12η¯dk2\nabla c_{i}(x_{k})^{T}d_{k}\leq-\tfrac{1}{2}\underline{\eta}\|d_{k}\|_{2}. Consequently, by θk>0\theta_{k}>0, the fact that xk𝒩(θk1)x_{k}\in{\cal N}(\theta_{k-1}) implies ci(xk)θk>0-c_{i}(x_{k})-\theta_{k}>0, and the fact that the conditions of the lemma ensure η¯θk>0\underline{\eta}-\theta_{k}>0, it follows that

    η¯2\displaystyle\!\!\!\!\!\!-\frac{\underline{\eta}}{2} ci(xk)Tdkdk2\displaystyle\geq\frac{\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}}
    \displaystyle\implies Lci(ci(xk)θkη¯θk)η¯θk2\displaystyle\!\!\!\!\!\!L_{\nabla c_{i}}\left(\frac{-c_{i}(x_{k})-\theta_{k}}{\underline{\eta}-\theta_{k}}\right)-\frac{\underline{\eta}-\theta_{k}}{2} ci(xk)Tdkdk2\displaystyle\geq\frac{\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}}
    \displaystyle\implies 2Lci(ci(xk)θk)(η¯θk)2\displaystyle\!\!\!\!\!\!2L_{\nabla c_{i}}(-c_{i}(x_{k})-\theta_{k})-(\underline{\eta}-\theta_{k})^{2} 2(η¯θk)ci(xk)Tdkdk2\displaystyle\geq 2(\underline{\eta}-\theta_{k})\frac{\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}}
    \displaystyle\implies (ci(xk)Tdkdk2)2+2Lci(ci(xk)θk)\displaystyle\!\!\!\!\!\!\left(\frac{\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}}\right)^{2}+2L_{\nabla c_{i}}(-c_{i}(x_{k})-\theta_{k}) (ci(xk)Tdkdk2+(η¯θk))2\displaystyle\geq\left(\frac{\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}}+(\underline{\eta}-\theta_{k})\right)^{2}
    \displaystyle\implies (ci(xk)Tdkdk2)2+2Lci(ci(xk)θk)\displaystyle\!\!\!\!\!\!\sqrt{\left(\frac{\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}}\right)^{2}+2L_{\nabla c_{i}}(-c_{i}(x_{k})-\theta_{k})} ci(xk)Tdkdk2+(η¯θk),\displaystyle\geq\frac{\nabla c_{i}(x_{k})^{T}d_{k}}{\|d_{k}\|_{2}}+(\underline{\eta}-\theta_{k}),

    which with (9) and (10) implies

    γk,iη¯θkαkβLci.\gamma_{k,i}\geq\frac{\underline{\eta}-\theta_{k}}{\alpha_{k}\beta L_{\nabla c_{i}}}.

Combining the results from the two cases, one can conclude that γk,iγk,i,min\gamma_{k,i}\geq\gamma_{k,i,\min} for all i[m]i\in[m], which, as previously mentioned, completes the proof. ∎∎

For our next lemma, we prove the aforementioned sufficient decrease condition that is guaranteed by the choices in Parameter Rule 1. In particular, one finds in the lemma that the amount of decrease is at least proportional to a squared norm of the projected gradient of the barrier-augmented objective function, which makes sense since, under Lemma 3.1, xkx_{k}\in{\cal E} and dkNull(A)d_{k}\in\operatorname{Null}(A) for all kk\in\mathbb{N}.

Lemma 4.4.

Suppose that Assumptions 2.1, 3.1, and 4.1 hold and that Algorithm 1(D) is run with Parameter Rule 1. Then, for all kk\in\mathbb{N}, one finds

ϕ~(xk+1,μk+1)ϕ~(xk,μk)12ζ¯ζγkαkPxϕ(xk,μk)22.\tilde{\phi}(x_{k+1},\mu_{k+1})-\tilde{\phi}(x_{k},\mu_{k})\leq-\tfrac{1}{2}\underline{\zeta}\zeta\gamma_{k}\alpha_{k}\|P\nabla_{x}\phi(x_{k},\mu_{k})\|_{2}^{2}.
Proof.

Consider arbitrary kk\in\mathbb{N}. By Lemmas 4.14.2, (6), (5), dk=Pdkd_{k}=Pd_{k}, and P=PTP=P^{T}, it follows with Lk>0L_{k}\in\mathbb{R}_{>0} defined as in Parameter Rule 1 that

ϕ~(xk+1,μk)ϕ~(xk,μk)\displaystyle\tilde{\phi}(x_{k+1},\mu_{k})-\tilde{\phi}(x_{k},\mu_{k})\leq xϕ~(xk,μk)T(xk+1xk)+12Lkxk+1xk22\displaystyle\ \nabla_{x}\tilde{\phi}(x_{k},\mu_{k})^{T}(x_{k+1}-x_{k})+\tfrac{1}{2}L_{k}\|x_{k+1}-x_{k}\|_{2}^{2}
=\displaystyle= γkαkqkTdk+12γk2αk2Lkdk22\displaystyle\ \gamma_{k}\alpha_{k}q_{k}^{T}d_{k}+\tfrac{1}{2}\gamma_{k}^{2}\alpha_{k}^{2}L_{k}\|d_{k}\|_{2}^{2}
=\displaystyle= γkαk(Pqk)Tdk+12γk2αk2Lkdk22\displaystyle\ \gamma_{k}\alpha_{k}(Pq_{k})^{T}d_{k}+\tfrac{1}{2}\gamma_{k}^{2}\alpha_{k}^{2}L_{k}\|d_{k}\|_{2}^{2}
\displaystyle\leq γkαkζPqk2dk2+12γk2αk2Lkdk22\displaystyle\ -\gamma_{k}\alpha_{k}\zeta\|Pq_{k}\|_{2}\|d_{k}\|_{2}+\tfrac{1}{2}\gamma_{k}^{2}\alpha_{k}^{2}L_{k}\|d_{k}\|_{2}^{2}
\displaystyle\leq γkαk(ζ¯ζ12ζ¯2γkαkLk)Pqk22.\displaystyle\ -\gamma_{k}\alpha_{k}(\underline{\zeta}\zeta-\tfrac{1}{2}\overline{\zeta}^{2}\gamma_{k}\alpha_{k}L_{k})\|Pq_{k}\|_{2}^{2}. (12)

Due to Parameter Rule 1, one finds that γk(0,1]\gamma_{k}\in(0,1], ktα(0,1]k^{t_{\alpha}}\in(0,1], and

ζ¯ζ12ζ¯2γkαkLkζ¯ζ12ζ¯2αkLkζ¯ζ12ktαζ¯ζ12ζ¯ζ.\underline{\zeta}\zeta-\tfrac{1}{2}\overline{\zeta}^{2}\gamma_{k}\alpha_{k}L_{k}\geq\underline{\zeta}\zeta-\tfrac{1}{2}\overline{\zeta}^{2}\alpha_{k}L_{k}\geq\underline{\zeta}\zeta-\tfrac{1}{2}k^{t_{\alpha}}\underline{\zeta}\zeta\geq\tfrac{1}{2}\underline{\zeta}\zeta. (13)

Combining (12) and (13), and since μk+1μk\mu_{k+1}\leq\mu_{k} implies ϕ~(xk+1,μk+1)ϕ~(xk+1,μk)\tilde{\phi}(x_{k+1},\mu_{k+1})\leq\tilde{\phi}(x_{k+1},\mu_{k}) (see the discussion following (7)), the desired conclusion follows. ∎∎

We are now prepared to prove our guarantee for the deterministic setting. For the sake of notational clarity, the following theorem introduces the parameters tμt_{\mu} and tθt_{\theta} that dictate the rates of decrease of {μk}\{\mu_{k}\} and {θk}\{\theta_{k}\}, respectively. However, the theorem requires tμ=tθt_{\mu}=t_{\theta}, and indeed one can see in the details of the proof that both tμtθt_{\mu}\leq t_{\theta} and tμtθt_{\mu}\geq t_{\theta} are required to prove the theorem.

Theorem 4.1.

Suppose that Assumptions 2.1, 3.1, and 4.1 hold and that Algorithm 1(D) is run with Parameter Rule 1. Suppose, in addition, that for some (tμ,tθ,tα)(,0)×(,0)×(,0](t_{\mu},t_{\theta},t_{\alpha})\in(-\infty,0)\times(-\infty,0)\times(-\infty,0] with tμ=tθt_{\mu}=t_{\theta} and tμ+tα[1,0)t_{\mu}+t_{\alpha}\in[-1,0) and for some μ1>0\mu_{1}\in\mathbb{R}_{>0} and θ0>0\theta_{0}\in\mathbb{R}_{>0} the algorithm employs μk=μ1ktμ\mu_{k}=\mu_{1}k^{t_{\mu}} and θk1=θ0ktθ\theta_{k-1}=\theta_{0}k^{t_{\theta}} for all kk\in\mathbb{N}. Then,

k=1γkαk=andlim infkPxϕ(xk,μk)22=0.\sum_{k=1}^{\infty}\gamma_{k}\alpha_{k}=\infty\ \ \text{and}\ \ \liminf_{k\to\infty}\|P\nabla_{x}\phi(x_{k},\mu_{k})\|_{2}^{2}=0. (14)

If, in addition, there exists 𝒦{\cal K}\subseteq\mathbb{N} with |𝒦|=|{\cal K}|=\infty such that

  • \bullet

    {Pxϕ(xk,μk)}k𝒦0\{P\nabla_{x}\phi(x_{k},\mu_{k})\}_{k\in{\cal K}}\to 0,

  • \bullet

    {xk}k𝒦x¯\{x_{k}\}_{k\in{\cal K}}\to\mkern 1.5mu\overline{\mkern-1.5mux} for some x¯\mkern 1.5mu\overline{\mkern-1.5mux}\in{\cal F}, and

  • \bullet

    at x¯\mkern 1.5mu\overline{\mkern-1.5mux} the linear independence constraint qualification (LICQ) holds with respect to problem (1) in the sense that with 𝒜¯:={i[m]:ci(x¯)=0}\bar{\cal A}:=\{i\in[m]:c_{i}(\mkern 1.5mu\overline{\mkern-1.5mux})=0\} the columns of ATA^{T} combined with the vectors in {ci(x¯)}i𝒜¯\{\nabla c_{i}(\mkern 1.5mu\overline{\mkern-1.5mux})\}_{i\in\bar{\cal A}} form a linearly independent set,

then x¯\mkern 1.5mu\overline{\mkern-1.5mux} is a KKT point for problem (1) in the sense that there exists a pair of Lagrange multipliers (y¯,z¯)l×m(\mkern 1.5mu\overline{\mkern-1.5muy},\mkern 1.5mu\overline{\mkern-1.5muz})\in\mathbb{R}^{l}\times\mathbb{R}^{m} such that (x¯,y¯,z¯)(\mkern 1.5mu\overline{\mkern-1.5mux},\mkern 1.5mu\overline{\mkern-1.5muy},\mkern 1.5mu\overline{\mkern-1.5muz}) satisfies (3).

Proof.

It follows from (7) that ϕ~\tilde{\phi} is bounded below by finff_{\inf} over 𝒳×0{\cal X}\times\mathbb{R}_{\geq 0}. Then, by summing the expression in Lemma 4.4 over kk\in\mathbb{N} and (6), one finds that

>ϕ~(x1,μ1)finf\displaystyle\infty>\tilde{\phi}(x_{1},\mu_{1})-f_{\inf} k=1(ϕ~(xk,μk)ϕ~(xk+1,μk+1))\displaystyle\geq\sum_{k=1}^{\infty}(\tilde{\phi}(x_{k},\mu_{k})-\tilde{\phi}(x_{k+1},\mu_{k+1}))
k=112ζ¯ζγkαkPxϕ(xk,μk)22.\displaystyle\geq\sum_{k=1}^{\infty}\tfrac{1}{2}\underline{\zeta}\zeta\gamma_{k}\alpha_{k}\|P\nabla_{x}\phi(x_{k},\mu_{k})\|_{2}^{2}.

To complete the proof of (14), let us now show that {γkαk}\{\gamma_{k}\alpha_{k}\} is unsummable. To begin, first observe that the lower bound on {γk}\{\gamma_{k}\} stated in Lemma 4.3 holds, i.e., there exist κ>0\kappa\in\mathbb{R}_{>0}, L¯>0\underline{L}\in\mathbb{R}_{>0}, and L>0L\in\mathbb{R}_{>0} such that, for all kk\in\mathbb{N}, one has

γk,minmin{1,κ+κ2+2L¯(ημkθk)αkβL,η¯θkαkβL}.\gamma_{k,\min}\geq\min\left\{1,\frac{-\kappa+\sqrt{\kappa^{2}+2\underline{L}(\eta\mu_{k}-\theta_{k})}}{\alpha_{k}\beta L},\frac{\underline{\eta}-\theta_{k}}{\alpha_{k}\beta L}\right\}.

Consequently, it holds for all kk\in\mathbb{N} that

γkαkγk,minαk\displaystyle\gamma_{k}\alpha_{k}\geq\gamma_{k,\min}\alpha_{k} min{αk,κ+κ2+2L¯(ημkθk)βL,η¯θkβL}\displaystyle\geq\min\left\{\alpha_{k},\frac{-\kappa+\sqrt{\kappa^{2}+2\underline{L}(\eta\mu_{k}-\theta_{k})}}{\beta L},\frac{\underline{\eta}-\theta_{k}}{\beta L}\right\}
=:min{αk,ρk,υk}.\displaystyle=:\min\{\alpha_{k},\rho_{k},\upsilon_{k}\}. (15)

Let us now consider the sequences {αk}\{\alpha_{k}\}, {ρk}\{\rho_{k}\}, and {υk}\{\upsilon_{k}\} in turn. Our aim is to show that each of these sequences is unsummable, which will complete the proof of (14). With respect to the step-size sequence {αk}\{\alpha_{k}\}, one finds by Parameter Rule 1, υ:=i=1m(Lciκci+κciLci)\upsilon:=\sum_{i=1}^{m}\left(L_{c_{i}}\kappa_{\nabla c_{i}}+\kappa_{c_{i}}L_{\nabla c_{i}}\right), and the conditions of the theorem that

αk=ktαζ¯ζζ¯2Lf+μ1θ02ktμktθ(k+1)kθυ.\alpha_{k}=\frac{k^{t_{\alpha}}\underline{\zeta}\zeta\overline{\zeta}^{-2}}{L_{\nabla f}+\mu_{1}\theta_{0}^{-2}k^{t_{\mu}}k^{-t_{\theta}}(k+1)^{-k_{\theta}}\upsilon}. (16)

Observe that tμ(,0)t_{\mu}\in(-\infty,0), tθ(,0)t_{\theta}\in(-\infty,0), and tμ=tθt_{\mu}=t_{\theta} show that ktμktθ(k+1)tθktμtθ(2k)tθ=2tθktμ2tθ=2tμktμk^{t_{\mu}}k^{-t_{\theta}}(k+1)^{-t_{\theta}}\leq k^{t_{\mu}-t_{\theta}}(2k)^{-t_{\theta}}=2^{-t_{\theta}}k^{t_{\mu}-2t_{\theta}}=2^{-t_{\mu}}k^{-t_{\mu}}. With (16) and tμ(,0)t_{\mu}\in(-\infty,0), this shows

αkktαζ¯ζζ¯2Lf+μ1θ022tμυktμ=ktμ+tαζ¯ζζ¯2Lfktμ+μ1θ022tμυktμ+tαζ¯ζζ¯2Lf+μ1θ022tμυ,\alpha_{k}\geq\frac{k^{t_{\alpha}}\underline{\zeta}\zeta\overline{\zeta}^{-2}}{L_{\nabla f}+\mu_{1}\theta_{0}^{-2}2^{-t_{\mu}}\upsilon k^{-t_{\mu}}}=\frac{k^{t_{\mu}+t_{\alpha}}\underline{\zeta}\zeta\overline{\zeta}^{-2}}{L_{\nabla f}k^{t_{\mu}}+\mu_{1}\theta_{0}^{-2}2^{-t_{\mu}}\upsilon}\geq\frac{k^{t_{\mu}+t_{\alpha}}\underline{\zeta}\zeta\overline{\zeta}^{-2}}{L_{\nabla f}+\mu_{1}\theta_{0}^{-2}2^{-t_{\mu}}\upsilon},

which since tμ+tα[1,0)t_{\mu}+t_{\alpha}\in[-1,0) implies that {αk}\{\alpha_{k}\} is unsummable, as desired. Now, with respect to {ρk}\{\rho_{k}\}, one finds under the conditions of the theorem that

ρk=κ+κ2+2L¯(ημ1ktμθ0(k+1)tθ)βL.\rho_{k}=\frac{-\kappa+\sqrt{\kappa^{2}+2\underline{L}(\eta\mu_{1}k^{t_{\mu}}-\theta_{0}(k+1)^{t_{\theta}})}}{\beta L}. (17)

Observe that tμ(,0)t_{\mu}\in(-\infty,0) and tμ=tθt_{\mu}=t_{\theta} show that

ημ1ktμθ0(k+1)tθημ1ktμθ0ktθ=(ημ1θ0)ktμ.\eta\mu_{1}k^{t_{\mu}}-\theta_{0}(k+1)^{t_{\theta}}\geq\eta\mu_{1}k^{t_{\mu}}-\theta_{0}k^{t_{\theta}}=(\eta\mu_{1}-\theta_{0})k^{t_{\mu}}.

In fact, tμ[1,0)t_{\mu}\in[-1,0), so the bound above along with (17) shows that {ρk}\{\rho_{k}\} is unsummable, as desired. Finally, {υk}\{\upsilon_{k}\} can be seen to be unsummable since υk=η¯θk>η¯θ0\upsilon_{k}=\underline{\eta}-\theta_{k}>\underline{\eta}-\theta_{0} for all kk\in\mathbb{N}. Having reached the desired conclusion that {αk}\{\alpha_{k}\}, {ρk}\{\rho_{k}\}, and {υk}\{\upsilon_{k}\} are unsummable, it follows through (15) that {γkαk}\{\gamma_{k}\alpha_{k}\} is unsummable, which, as previously mentioned, completes the proof of (14).

Now suppose that there exists an infinite-cardinality set 𝒦{\cal K}\subseteq\mathbb{N} as described in the theorem. By construction of the algorithm, it follows that Axk=bAx_{k}=b and c(xk)<0c(x_{k})<0 for all kk\in\mathbb{N}. Now, for all k𝒦k\in{\cal K}, define the Lagrange multiplier estimates

zk:=μkdiag(c(xk))1𝟙andyk:=(AAT)1A(f(xk)+c(xk)zk).z_{k}:=-\mu_{k}\operatorname{diag}(c(x_{k}))^{-1}\mathds{1}\ \ \text{and}\ \ y_{k}:=-(AA^{T})^{-1}A(\nabla f(x_{k})+\nabla c(x_{k})z_{k}). (18)

Since μk>0\mu_{k}>0 and c(xk)<0c(x_{k})<0 for all kk\in\mathbb{N}, it follows that zk0z_{k}\geq 0 for all kk\in\mathbb{N}. In addition, the fact that {Pϕ(xk,μk)}k𝒦0\{P\nabla\phi(x_{k},\mu_{k})\}_{k\in{\cal K}}\to 0 shows that

{f(xk)+ATyk+c(xk)zk2}k𝒦\displaystyle\ \{\|\nabla f(x_{k})+A^{T}y_{k}+\nabla c(x_{k})z_{k}\|_{2}\}_{k\in{\cal K}}
=\displaystyle= {(f(xk)+c(xk)zk)AT(AAT)1A(f(xk)+c(xk)zk)2}k𝒦\displaystyle\ \{\|(\nabla f(x_{k})+\nabla c(x_{k})z_{k})-A^{T}(AA^{T})^{-1}A(\nabla f(x_{k})+\nabla c(x_{k})z_{k})\|_{2}\}_{k\in{\cal K}}
=\displaystyle= {Pϕ(xk,μk)2}k𝒦0.\displaystyle\ \{\|P\nabla\phi(x_{k},\mu_{k})\|_{2}\}_{k\in{\cal K}}\to 0. (19)

Consider 𝒜¯={i[m]:ci(x¯)=0}\bar{\cal A}=\{i\in[m]:c_{i}(\mkern 1.5mu\overline{\mkern-1.5mux})=0\}. Since c(xk)zk=μk𝟙c(x_{k})\circ z_{k}=-\mu_{k}\mathds{1} for all kk\in\mathbb{N} and {μk}0\{\mu_{k}\}\searrow 0, it follows that {[zk]i}k𝒦0\{[z_{k}]_{i}\}_{k\in{\cal K}}\to 0 for all i𝒜¯i\not\in\bar{\cal A}. Combining this with (19) and using the fact that {ci}k𝒦\{\nabla c_{i}\}_{k\in{\cal K}} is bounded for all i[m]i\in[m] under Assumption 4.1,

{f(xk)+ATyk+i𝒜¯ci(xk)[zk]i2}k𝒦0.\bigg{\{}\bigg{\|}\nabla f(x_{k})+A^{T}y_{k}+\sum_{i\in\bar{\cal A}}\nabla c_{i}(x_{k})[z_{k}]_{i}\bigg{\|}_{2}\bigg{\}}_{k\in{\cal K}}\to 0.

Under the LICQ, it follows from this limit that {(yk,zk)}k𝒦\{(y_{k},z_{k})\}_{k\in{\cal K}} converges to some pair (y¯,z¯)(\mkern 1.5mu\overline{\mkern-1.5muy},\mkern 1.5mu\overline{\mkern-1.5muz}) such that (x¯,y¯,z¯)(\mkern 1.5mu\overline{\mkern-1.5mux},\mkern 1.5mu\overline{\mkern-1.5muy},\mkern 1.5mu\overline{\mkern-1.5muz}) satisfies (3), as desired. ∎∎

4.2 Stochastic Setting

We now prove convergence guarantees for Algorithm 1 in a stochastic setting, i.e., when qk=gkμkc(xk)diag(c(xk))1𝟙q_{k}=g_{k}-\mu_{k}\nabla c(x_{k})\operatorname{diag}(c(x_{k}))^{-1}\mathds{1} (see (4)) for all kk\in\mathbb{N} where gkg_{k} is a stochastic estimate of f(xk)\nabla f(x_{k}). Formally, let us consider the probability space (Ω,𝒢,)(\Omega,{\cal G},\mathbb{P}), where Ω\Omega captures all outcomes of a run of Algorithm 1. As mentioned shortly, we make assumptions that guarantee that each iteration is well defined, which implies that each such outcome corresponds to an infinite sequence of iterates. In this manner, each realization of a run of the algorithm can be associated with ωΩ\omega\in\Omega, an infinite-dimensional tuple whose kkth element determines the stochastic gradient-of-the-objective estimate in iteration kk\in\mathbb{N}. The stochastic process defined by the algorithm can thus be expressed as

{(Xk(ω),Gk(ω),Qk(ω),Dk(ω),𝒜k(ω),Γk(ω))},\{(X_{k}(\omega),G_{k}(\omega),Q_{k}(\omega),D_{k}(\omega),{\cal A}_{k}(\omega),\Gamma_{k}(\omega))\},

where, for all kk\in\mathbb{N}, the random variables are the iterate Xk(ω)X_{k}(\omega), stochastic objective-gradient estimator Gk(ω)G_{k}(\omega), stochastic gradient estimator

Qk(ω):=Gk(ω)μkc(Xk(ω))diag(c(Xk(ω)))1𝟙,Q_{k}(\omega):=G_{k}(\omega)-\mu_{k}\nabla c(X_{k}(\omega))\operatorname{diag}(c(X_{k}(\omega)))^{-1}\mathds{1}, (20)

direction Dk(ω)D_{k}(\omega), step size 𝒜k(ω){\cal A}_{k}(\omega), and neighborhood-enforcement parameter Γk(ω)\Gamma_{k}(\omega). Observe that the statement of Algorithm 1 on page 1 instantiates a particular realization of a run, expressed as {(xk,gk,qk,dk,αk,γk)}\{(x_{k},g_{k},q_{k},d_{k},\alpha_{k},\gamma_{k})\}.

Given the initial conditions of the algorithm (including for simplicity that X1(ω)=x1X_{1}(\omega)=x_{1} for all ωΩ\omega\in\Omega), let 𝒢1{\cal G}_{1} denote the sub-σ\sigma-algebra of 𝒢{\cal G} corresponding to the initial conditions and, for all k{1}k\in\mathbb{N}\setminus\{1\}, let 𝒢k{\cal G}_{k} denote the sub-σ\sigma-algebra of 𝒢{\cal G} corresponding to the initial conditions and {G1(ω),,Gk1(ω)}\{G_{1}(\omega),\dots,G_{k-1}(\omega)\}. In this manner, the sequence {𝒢k}\{{\cal G}_{k}\} is a filtration. For our analysis, with respect to this filtration, we make the following assumption. For the sake of brevity, here and for the remainder of our analysis, we drop from our notation the dependence of the stochastic process on an element ω\omega of Ω\Omega, since this dependence remains obvious.

Assumption 4.2.

For all kk\in\mathbb{N}, it holds that 𝔼[Gk|𝒢k]=f(Xk)\mathbb{E}[G_{k}|{\cal G}_{k}]=\nabla f(X_{k}). In addition, there exists σ0\sigma\in\mathbb{R}_{\geq 0} such that, for all kk\in\mathbb{N}, one has P(Gkf(Xk))2σ\|P(G_{k}-\nabla f(X_{k}))\|_{2}\leq\sigma.

Assumption 4.2 amounts to the stochastic-gradient estimators being unbiased with bounded error. This is consistent with the stochastic setting in [7]. Looser assumptions are possible for stochastic-gradient-based methods in the unconstrained setting, but for the context of stochastic-gradient-based interior-point methods we know of no analysis that can avoid a bounded-error assumption.

We also carry forward our prior assumptions, namely, Assumptions 2.1, 3.1, and 4.1. However, we strengthen Assumption 3.1 somewhat to impose additional structure on the manner in which the search direction is defined. In particular, for all kk\in\mathbb{N}, we define the search direction DkD_{k} through the linear system

[HkATA0][DkYk]=[Qk0],\begin{bmatrix}H_{k}&A^{T}\\ A&0\end{bmatrix}\begin{bmatrix}D_{k}\\ Y_{k}\end{bmatrix}=-\begin{bmatrix}Q_{k}\\ 0\end{bmatrix}, (21)

where the sequence {Hk}\{H_{k}\} satisfies the following assumption.

Assumption 4.3.

For all kk\in\mathbb{N}, the matrix Hk𝕊0nH_{k}\in\mathbb{S}_{\succeq 0}^{n} is 𝒢k{\cal G}_{k}-measurable and DkD_{k} satisfies (21). In addition, there exist constants (λ¯,λ¯)(0,1]×>0(\underline{\lambda},\overline{\lambda})\in(0,1]\times\mathbb{R}_{>0} with λ¯λ¯\underline{\lambda}\leq\overline{\lambda} such that, over any run of Algorithm 1, one finds uTHku/u2[λ¯,λ¯]u^{T}H_{k}u/\|u\|^{2}\in[\underline{\lambda},\overline{\lambda}] for all uNull(A)u\in\operatorname{Null}(A).

All combined, Assumptions 3.1 and 4.3 amount to the assumption that DkD_{k} is computed by (21) with a choice of HkH_{k} such that (5e) holds. Here, we are remarking on the fact that, under Assumption 4.3, the search direction satisfies the null-space and angle conditions imposed by the combination of (5a)–(5d). Indeed, one finds

Dk=Z(ZTHkZ)1ZTQk,D_{k}=-Z(Z^{T}H_{k}Z)^{-1}Z^{T}Q_{k}, (22)

where Zn×(nl)Z\in\mathbb{R}^{n\times(n-l)} is an orthogonal matrix whose columns form an orthonormal basis for Null(A)\operatorname{Null}(A), through which one can in turn show that ADk=0AD_{k}=0 along with

Dk2[λ¯1PQk2,λ¯1PQk2]and(PQk)TDkλ¯λ¯1PQk2Dk2.\|D_{k}\|_{2}\in[\overline{\lambda}^{-1}\|PQ_{k}\|_{2},\underline{\lambda}^{-1}\|PQ_{k}\|_{2}]\ \ \text{and}\ \ -(PQ_{k})^{T}D_{k}\geq\underline{\lambda}\overline{\lambda}^{-1}\|PQ_{k}\|_{2}\|D_{k}\|_{2}. (23)

Hence, we employ (λ¯,λ¯)(\underline{\lambda},\overline{\lambda}) and (23) in place of (ζ¯,ζ¯,ζ)(\underline{\zeta},\overline{\zeta},\zeta) and (5b)–(5d) in this section. For our analysis, it is worthwhile to emphasize that the tuple of parameters (η,η¯,λ¯,λ¯)(\eta,\bar{\eta},\underline{\lambda},\overline{\lambda}) is presumed to be determined uniquely for any given instance of problem (1). Therefore, the parameters in Assumption 3.1 (i.e., in (5)) and (as explicitly stated) in Assumption 4.3 are presumed to be uniform over all possible runs of the algorithm. Similarly, for Assumption 4.1 in this section, the set 𝒳{\cal X} and stated constants are assumed to be uniform over all possible realizations of {Xk}\{X_{k}\}.

Our convergence guarantees for this stochastic setting rely on Parameter Rule 2, stated below, that we employ for choosing step-size- and neighborhood-enforcement parameter sequences, namely, {αk}\{\alpha_{k}\} and {(γk,min,γk,γk,max)}\{(\gamma_{k,\min},\gamma_{k},\gamma_{k,\max})\}. Like in the deterministic setting, in this stochastic setting, the step sizes depend on a parameter tα(,0)t_{\alpha}\in(-\infty,0), with one difference being that the choice tα=0t_{\alpha}=0 is not allowed for the stochastic setting. However, unlike in the deterministic setting, in this stochastic setting we prescribe a particular dependence of the neighborhood-enforcement values {(γk,min,γk,γk,max)}\{(\gamma_{k,\min},\gamma_{k},\gamma_{k,\max})\} on the barrier-parameter sequence {μk}\{\mu_{k}\} and neighborhood-parameter sequence {θk}\{\theta_{k}\}, which in turn are determined by parameters (tμ,tθ)(,0)×(,0)(t_{\mu},t_{\theta})\in(-\infty,0)\times(-\infty,0). Whereas in the deterministic setting our ultimate requirements for the tuple (tμ,tθ,tα)(t_{\mu},t_{\theta},t_{\alpha}) (specified between Parameter Rule 1 and Theorem 4.1) were that tμ=tθ<0t_{\mu}=t_{\theta}<0, tα0t_{\alpha}\leq 0, and tμ+tα[1,0)t_{\mu}+t_{\alpha}\in[-1,0), the requirements for these parameters is stricter in this stochastic setting. For the sake of simplicity, rather than introduce separate parameters tμt_{\mu} and tθt_{\theta} for {μk}\{\mu_{k}\} and {θk}\{\theta_{k}\}, respectively, in Parameter Rule 2 we introduce a single parameter tt. This is reasonable since, even in the deterministic setting, our analysis requires tμ=tθt_{\mu}=t_{\theta}. It is worthwhile to emphasize upfront that the restrictions of the rule are satisfiable; e.g., one may consider t=0.99t=-0.99 and tα=0.01t_{\alpha}=-0.01 as one acceptable setting.

Parameter 2.

Suppose the following are given:

  • \bullet

    (t,tα)(,0)2(t,t_{\alpha})\in(-\infty,0)^{2} such that t+tα[1,0)t+t_{\alpha}\in[-1,0), 2t+tα<12t+t_{\alpha}<-1, and t+2tα<1t+2t_{\alpha}<-1;

  • \bullet

    μ1>0\mu_{1}\in\mathbb{R}_{>0} and θ0>0\theta_{0}\in\mathbb{R}_{>0}; and

  • \bullet

    γbuff0\gamma_{\mathrm{buff}}\in\mathbb{R}_{\geq 0} and {γk,buff}0\{\gamma_{k,\mathrm{buff}}\}\subset\mathbb{R}_{\geq 0} such that γk,buffγbuffkt\gamma_{k,\mathrm{buff}}\leq\gamma_{\mathrm{buff}}k^{t} for all kk\in\mathbb{N}.

Algorithm 1 employs, for all kk\in\mathbb{N}, the following:

  • \bullet

    μk=μ1kt\mu_{k}=\mu_{1}k^{t} and θk1=θ0kt\theta_{k-1}=\theta_{0}k^{t};

  • \bullet

    αkktα(λ¯2λ¯Lk)\displaystyle\alpha_{k}\leftarrow k^{t_{\alpha}}\left(\frac{\underline{\lambda}^{2}}{\overline{\lambda}L_{k}}\right), where Lk>0L_{k}\in\mathbb{R}_{>0} is defined as in Parameter Rule 1; and

  • \bullet

    γkmini[m]γk,i\gamma_{k}\leftarrow\min_{i\in[m]}\gamma_{k,i} where

    γk,i\displaystyle\gamma_{k,i} min{γ~k,i,γk,max}for all i[m],\displaystyle\leftarrow\min\{\tilde{\gamma}_{k,i},\gamma_{k,\max}\}\ \ \text{for all $i\in[m]$},
    γ~k,i\displaystyle\tilde{\gamma}_{k,i} ci(xk)Tdk+(ci(xk)Tdk)2+2Lcidk22(ci(xk)θk)Lciαkdk22\displaystyle\leftarrow\frac{-\nabla c_{i}(x_{k})^{T}d_{k}+\sqrt{(\nabla c_{i}(x_{k})^{T}d_{k})^{2}+2L_{\nabla c_{i}}\|d_{k}\|_{2}^{2}(-c_{i}(x_{k})-\theta_{k})}}{L_{\nabla c_{i}}\alpha_{k}\|d_{k}\|_{2}^{2}}
    for all i[m],\displaystyle\qquad\text{for all $i\in[m]$},
    βσ\displaystyle\beta^{\sigma} λ¯1(κf+σ+μ1θ0j[m]κcj),\displaystyle\leftarrow\underline{\lambda}^{-1}\left(\kappa_{\nabla f}+\sigma+\frac{\mu_{1}}{\theta_{0}}\sum_{j\in[m]}\kappa_{\nabla c_{j}}\right),
    γk,i,min\displaystyle\gamma_{k,i,\min} min{1,κci+κci2+2Lci(ημkθk)αkβσLci,η¯θkαkβσLci}\displaystyle\leftarrow\min\left\{1,\frac{-\kappa_{\nabla c_{i}}+\sqrt{\kappa_{\nabla c_{i}}^{2}+2L_{\nabla c_{i}}(\eta\mu_{k}-\theta_{k})}}{\alpha_{k}\beta^{\sigma}L_{\nabla c_{i}}},\frac{\underline{\eta}-\theta_{k}}{\alpha_{k}\beta^{\sigma}L_{\nabla c_{i}}}\right\}
    for all i[m],\displaystyle\qquad\text{for all $i\in[m]$},
    γk,min\displaystyle\gamma_{k,\min} mini[m]γk,i,min,and\displaystyle\leftarrow\min_{i\in[m]}\gamma_{k,i,\min},\ \ \text{and}
    γk,max\displaystyle\gamma_{k,\max} min{1,γk,min+γk,buff}.\displaystyle\leftarrow\min\{1,\gamma_{k,\min}+\gamma_{k,\mathrm{buff}}\}.

The strategy for selecting {αk}\{\alpha_{k}\} in Parameter Rule 2 is consistent with that in Parameter Rule 1. Observe that since {μk}\{\mu_{k}\} and {θk}\{\theta_{k}\} are set in advance of any run of the algorithm, it follows under Parameter Rule 2 that {Lk}\{L_{k}\} and so {αk}\{\alpha_{k}\} are determined in advance of a run of the algorithm as well. Thus, our analysis can consider {𝒜k}={αk}\{{\cal A}_{k}\}=\{\alpha_{k}\}, where {αk}\{\alpha_{k}\} is a prescribed sequence that is uniform over all runs of Algorithm 1 employed for any given instance of problem (1). The choice of {(γk,min,γk,γk,max)}\{(\gamma_{k,\min},\gamma_{k},\gamma_{k,\max})\} in Parameter Rule 2, on the other hand, is more stringent than that in Parameter Rule 1. Observe that Parameter Rule 1 amounts to choosing γkmini[m]γk,i\gamma_{k}\leftarrow\min_{i\in[m]}\gamma_{k,i} for all kk\in\mathbb{N}, where γk,imin{γ~k,i,1}\gamma_{k,i}\leftarrow\min\{\tilde{\gamma}_{k,i},1\} for all (k,i)×[m](k,i)\in\mathbb{N}\times[m]. The choice in Parameter Rule 2 is similar, but with γk,imin{γ~k,i,γk,max}\gamma_{k,i}\leftarrow\min\{\tilde{\gamma}_{k,i},\gamma_{k,\max}\} for all (k,i)×[m](k,i)\in\mathbb{N}\times[m], where for any given kk\in\mathbb{N} the upper limit γk,max(0,1]\gamma_{k,\max}\in(0,1] might be less than 1. The particular strategy for setting γk,max\gamma_{k,\max} for all kk\in\mathbb{N} in Parameter Rule 2 involves first setting a lower value γk,min>0\gamma_{k,\min}\in\mathbb{R}_{>0} for each kk\in\mathbb{N} that is defined only by values that are set in advance of a run of the algorithm. This fact is critical, since it ensures that {(γk,min,γk,max)}\{(\gamma_{k,\min},\gamma_{k,\max})\} is prescribed and uniform over all runs of Algorithm 1 employed for any given instance of problem (1). For each (k,i)×[m](k,i)\in\mathbb{N}\times[m], the values γ~k,i\tilde{\gamma}_{k,i} and γk\gamma_{k} may differ between runs, but importantly the sequence {(γk,min,γk,max)}\{(\gamma_{k,\min},\gamma_{k,\max})\} does not.

Many of our prior results from the deterministic setting carry over here to our present analysis for the stochastic setting. Firstly, Lemma 3.1 holds, as does Lemma 4.1 since it has been proved irrespective of any particular algorithm. Secondly, following the arguments in Lemmas 4.2 and 4.3, it follows that Parameter Rule 2 ensures that γk,minγkγk,max\gamma_{k,\min}\leq\gamma_{k}\leq\gamma_{k,\max} for all kk\in\mathbb{N} in any given run of the algorithm. We formalize this in Lemma 4.5 below, the proof of which is omitted since it would follow the same lines of arguments as in the proofs of Lemmas 4.2 and 4.3. The only significant difference that would be in the proof is that the bound Pf(xk)2κf\|P\nabla f(x_{k})\|_{2}\leq\kappa_{\nabla f} needs to be replaced by PGk2κf+σ\|PG_{k}\|_{2}\leq\kappa_{\nabla f}+\sigma for all kk\in\mathbb{N}, which follows under Assumption 4.2. This is the reason that βσ\beta^{\sigma} in Parameter Rule 2 involves an additional σ\sigma term, as opposed to β\beta in Lemma 4.3, which does not. This affects the denominators in the expression for γk,i,min\gamma_{k,i,\min} for all (k,i)×[m](k,i)\in\mathbb{N}\times[m]. Otherwise, γk,i,min\gamma_{k,i,\min} is consistent between Lemma 4.3 and Parameter Rule 2.

Lemma 4.5.

Suppose that Assumptions 2.1, 3.1, 4.1, 4.2, and 4.3 hold and that Algorithm 1 is run with Parameter Rule 2. Then, for all kk\in\mathbb{N}, the choice of Γk(0,γk,max]\Gamma_{k}\in(0,\gamma_{k,\max}] in Parameter Rule 2 ensures that [Xk,Xk+ΓkαkDk]𝒩(θk)[X_{k},X_{k}+\Gamma_{k}\alpha_{k}D_{k}]\subseteq{\cal N}(\theta_{k}), i.e., the condition in Step 6 of Algorithm 1 holds. Consequently, under Parameter Rule 2, Algorithm 1 is well defined in the sense that it will generate an infinite sequence of iterates. Moreover, Parameter Rule 2 ensures that Dk2βσ\|D_{k}\|_{2}\leq\beta^{\sigma} and Γk[γk,min,γk,max]\Gamma_{k}\in[\gamma_{k,\min},\gamma_{k,\max}] for all kk\in\mathbb{N}.

Now let us turn to results for the stochastic setting that are distinct from those for the deterministic setting. Our first main goal is to prove an upper bound on the expected decrease in the shifted barrier-augmented function that occurs with each iteration. Toward this end, let us begin with the following preliminary bound.

Lemma 4.6.

Suppose that Assumptions 2.1, 3.1, 4.1, 4.2, and 4.3 hold and that Algorithm 1 is run with Parameter Rule 2. Then, for all kk\in\mathbb{N}, one finds that

ϕ~(Xk+1,μk+1)ϕ~(Xk,μk)\displaystyle\ \tilde{\phi}(X_{k+1},\mu_{k+1})-\tilde{\phi}(X_{k},\mu_{k})
\displaystyle\leq Γkαkλ¯1Pxϕ~(Xk,μk)22+12Γk2αk2Lkλ¯2PQk22\displaystyle\ -\Gamma_{k}\alpha_{k}\overline{\lambda}^{-1}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}+\tfrac{1}{2}\Gamma_{k}^{2}\alpha_{k}^{2}L_{k}\underline{\lambda}^{-2}\|PQ_{k}\|_{2}^{2}
+Γkαkxϕ~(Xk,μk)TZ(ZTHkZ)1ZT(xϕ~(Xk,μk)Qk).\displaystyle\ +\Gamma_{k}\alpha_{k}\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})^{T}Z(Z^{T}H_{k}Z)^{-1}Z^{T}(\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})-Q_{k}).
Proof.

Consider arbitrary kk\in\mathbb{N}. By Lemmas 4.1 and 4.5, (6), (5), (22), and the fact that ZTxϕ~(xk,μk)2=Pxϕ~(xk,μk)2\|Z^{T}\nabla_{x}\tilde{\phi}(x_{k},\mu_{k})\|_{2}=\|P\nabla_{x}\tilde{\phi}(x_{k},\mu_{k})\|_{2}, it follows with Lk>0L_{k}\in\mathbb{R}_{>0} defined as in Parameter Rule 2 (and 1) that

ϕ~(Xk+1,μk)ϕ~(Xk,μk)\displaystyle\ \tilde{\phi}(X_{k+1},\mu_{k})-\tilde{\phi}(X_{k},\mu_{k})
\displaystyle\leq xϕ~(Xk,μk)T(Xk+1Xk)+12LkXk+1Xk22\displaystyle\ \nabla_{x}\tilde{\phi}(X_{k},\mu_{k})^{T}(X_{k+1}-X_{k})+\tfrac{1}{2}L_{k}\|X_{k+1}-X_{k}\|_{2}^{2}
=\displaystyle= Γkαkxϕ~(Xk,μk)TDk+12Γk2αk2LkDk22\displaystyle\ \Gamma_{k}\alpha_{k}\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})^{T}D_{k}+\tfrac{1}{2}\Gamma_{k}^{2}\alpha_{k}^{2}L_{k}\|D_{k}\|_{2}^{2}
=\displaystyle= Γkαkxϕ~(Xk,μk)TZ(ZTHkZ)1ZTQk+12Γk2αk2LkZ(ZTHkZ)1ZTQk22\displaystyle\ -\Gamma_{k}\alpha_{k}\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})^{T}Z(Z^{T}H_{k}Z)^{-1}Z^{T}Q_{k}+\tfrac{1}{2}\Gamma_{k}^{2}\alpha_{k}^{2}L_{k}\|Z(Z^{T}H_{k}Z)^{-1}Z^{T}Q_{k}\|_{2}^{2}
=\displaystyle= ΓkαkZTxϕ~(Xk,μk)(ZHkZ)12+12Γk2αk2Lk(ZTHkZ)1ZTQk22\displaystyle\ -\Gamma_{k}\alpha_{k}\|Z^{T}\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{(ZH_{k}Z)^{-1}}^{2}+\tfrac{1}{2}\Gamma_{k}^{2}\alpha_{k}^{2}L_{k}\|(Z^{T}H_{k}Z)^{-1}Z^{T}Q_{k}\|_{2}^{2}
+Γkαkxϕ~(Xk,μk)TZ(ZTHkZ)1ZT(xϕ~(Xk,μk)Qk)\displaystyle\ +\Gamma_{k}\alpha_{k}\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})^{T}Z(Z^{T}H_{k}Z)^{-1}Z^{T}(\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})-Q_{k})
\displaystyle\leq Γkαkλ¯1Pxϕ~(Xk,μk)22+12Γk2αk2Lkλ¯2PQk22\displaystyle\ -\Gamma_{k}\alpha_{k}\overline{\lambda}^{-1}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}+\tfrac{1}{2}\Gamma_{k}^{2}\alpha_{k}^{2}L_{k}\underline{\lambda}^{-2}\|PQ_{k}\|_{2}^{2}
+Γkαkxϕ~(Xk,μk)TZ(ZTHkZ)1ZT(xϕ~(Xk,μk)Qk).\displaystyle\ +\Gamma_{k}\alpha_{k}\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})^{T}Z(Z^{T}H_{k}Z)^{-1}Z^{T}(\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})-Q_{k}).

Finally, since μk+1μk\mu_{k+1}\leq\mu_{k} implies ϕ~(Xk+1,μk+1)ϕ~(Xk+1,μk)\tilde{\phi}(X_{k+1},\mu_{k+1})\leq\tilde{\phi}(X_{k+1},\mu_{k}) (see the discussion following (7)), the desired conclusion follows. ∎∎

Lemma 4.6 shows that the change in the shifted barrier-augmented function with each iteration is bounded above by the sum of a negative term and two terms that may be considered noise terms. The goal of our next two lemmas is to bound these noise terms. We start with the second term on the right-hand side in Lemma 4.6.

Lemma 4.7.

Suppose that Assumptions 2.1, 3.1, 4.1, 4.2, and 4.3 hold and that Algorithm 1 is run with Parameter Rule 2. Then, for all kk\in\mathbb{N}, one finds that

12Γk2αk2Lkλ¯2PQk2234Γk2αk2Lkλ¯2Pxϕ~(Xk,μk)22+32αk2Lkλ¯2σ2.\tfrac{1}{2}\Gamma_{k}^{2}\alpha_{k}^{2}L_{k}\underline{\lambda}^{-2}\|PQ_{k}\|_{2}^{2}\leq\tfrac{3}{4}\Gamma_{k}^{2}\alpha_{k}^{2}L_{k}\underline{\lambda}^{-2}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}+\tfrac{3}{2}\alpha_{k}^{2}L_{k}\underline{\lambda}^{-2}\sigma^{2}.
Proof.

Consider arbitrary kk\in\mathbb{N}. Observe that, for any (a,b)n×n(a,b)\in\mathbb{R}^{n}\times\mathbb{R}^{n},

0P(12ab)22=14Pa22aTPTPb+Pb22,0\leq\|P(\tfrac{1}{2}a-b)\|_{2}^{2}=\tfrac{1}{4}\|Pa\|_{2}^{2}-a^{T}P^{T}Pb+\|Pb\|_{2}^{2},

from which it follows that

P(a+b)22=Pa22+2aTPTPb+Pb2232Pa22+3Pb22.\|P(a+b)\|_{2}^{2}=\|Pa\|_{2}^{2}+2a^{T}P^{T}Pb+\|Pb\|_{2}^{2}\leq\tfrac{3}{2}\|Pa\|_{2}^{2}+3\|Pb\|_{2}^{2}.

Consequently, with (6) it holds that

12PQk22=\displaystyle\tfrac{1}{2}\|PQ_{k}\|_{2}^{2}= 12P(xϕ~(Xk,μk)+Qkxϕ~(Xk,μk))22\displaystyle\ \tfrac{1}{2}\|P(\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})+Q_{k}-\nabla_{x}\tilde{\phi}(X_{k},\mu_{k}))\|_{2}^{2}
\displaystyle\leq 34Pxϕ~(Xk,μk)22+32P(Qkxϕ~(Xk,μk))22\displaystyle\ \tfrac{3}{4}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}+\tfrac{3}{2}\|P(Q_{k}-\nabla_{x}\tilde{\phi}(X_{k},\mu_{k}))\|_{2}^{2}
\displaystyle\leq 34Pxϕ~(Xk,μk)22+32σ2.\displaystyle\ \tfrac{3}{4}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}+\tfrac{3}{2}\sigma^{2}.

Therefore, since Γk1\Gamma_{k}\leq 1 under Parameter Rule 2, the proof is complete. ∎∎

The next lemma provides an upper bound on a conditional expectation of the last term on the right-hand side of the inequality in Lemma 4.6.

Lemma 4.8.

Suppose that Assumptions 2.1, 3.1, 4.1, 4.2, and 4.3 hold and that Algorithm 1 is run with Parameter Rule 2. Then, for all kk\in\mathbb{N}, one finds that

𝔼[Γkαkxϕ~(Xk,μk)TZ(ZTHkZ)1ZT(xϕ~(Xk,μk)Qk)|𝒢k]γk,buffαkBσ,\mathbb{E}[\Gamma_{k}\alpha_{k}\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})^{T}Z(Z^{T}H_{k}Z)^{-1}Z^{T}(\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})-Q_{k})|{\cal G}_{k}]\leq\gamma_{k,\mathrm{buff}}\alpha_{k}B^{\sigma},

where

Bσ:=λ¯1(κf+μ1θ0j[m]κcj)σ.B^{\sigma}:=\underline{\lambda}^{-1}\left(\kappa_{\nabla f}+\frac{\mu_{1}}{\theta_{0}}\sum_{j\in[m]}\kappa_{\nabla c_{j}}\right)\sigma.
Proof.

Consider arbitrary kk\in\mathbb{N}. The inner product in the expected value of interest may be nonnegative or nonpositive. Hence, let us invoke the Law of Total Expectation in order to handle each case separately. Let k{\cal I}_{k} be the event that Ik0I_{k}\geq 0, where Ik:=xϕ~(Xk,μk)TZ(ZTHkZ)1ZT(xϕ~(Xk,μk)Qk)I_{k}:=\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})^{T}Z(Z^{T}H_{k}Z)^{-1}Z^{T}(\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})-Q_{k}), and let kc{\cal I}_{k}^{c} be the complementary event that Ik<0I_{k}<0. By Parameter Rule 2, the fact that 𝔼[Ik|𝒢k]=0\mathbb{E}[I_{k}|{\cal G}_{k}]=0 under Assumption 4.2, and Lemma 4.5, it follows that

𝔼[ΓkαkIk|𝒢k]\displaystyle\ \mathbb{E}[\Gamma_{k}\alpha_{k}I_{k}|{\cal G}_{k}]
=\displaystyle= 𝔼[ΓkαkIk|𝒢kk][k|𝒢k]+𝔼[ΓkαkIk|𝒢kkc][kc|𝒢k]\displaystyle\ \mathbb{E}[\Gamma_{k}\alpha_{k}I_{k}|{\cal G}_{k}\wedge{\cal I}_{k}]\mathbb{P}[{\cal I}_{k}|{\cal G}_{k}]+\mathbb{E}[\Gamma_{k}\alpha_{k}I_{k}|{\cal G}_{k}\wedge{\cal I}_{k}^{c}]\mathbb{P}[{\cal I}_{k}^{c}|{\cal G}_{k}]
\displaystyle\leq γk,maxαk𝔼[Ik|𝒢kk][k|𝒢k]+γk,minαk𝔼[Ik|𝒢kkc][kc|𝒢k]\displaystyle\ \gamma_{k,\max}\alpha_{k}\mathbb{E}[I_{k}|{\cal G}_{k}\wedge{\cal I}_{k}]\mathbb{P}[{\cal I}_{k}|{\cal G}_{k}]+\gamma_{k,\min}\alpha_{k}\mathbb{E}[I_{k}|{\cal G}_{k}\wedge{\cal I}_{k}^{c}]\mathbb{P}[{\cal I}_{k}^{c}|{\cal G}_{k}]
\displaystyle\leq γk,minαk(𝔼[Ik|𝒢kk][k|𝒢k]+𝔼[Ik|𝒢kkc][kc|𝒢k])\displaystyle\ \gamma_{k,\min}\alpha_{k}(\mathbb{E}[I_{k}|{\cal G}_{k}\wedge{\cal I}_{k}]\mathbb{P}[{\cal I}_{k}|{\cal G}_{k}]+\mathbb{E}[I_{k}|{\cal G}_{k}\wedge{\cal I}_{k}^{c}]\mathbb{P}[{\cal I}_{k}^{c}|{\cal G}_{k}])
+γk,buffαk𝔼[Ik|𝒢kk][k|𝒢k]\displaystyle\ +\gamma_{k,\mathrm{buff}}\alpha_{k}\mathbb{E}[I_{k}|{\cal G}_{k}\wedge{\cal I}_{k}]\mathbb{P}[{\cal I}_{k}|{\cal G}_{k}]
=\displaystyle= γk,buffαk𝔼[Ik|𝒢kk][k|𝒢k].\displaystyle\ \gamma_{k,\mathrm{buff}}\alpha_{k}\mathbb{E}[I_{k}|{\cal G}_{k}\wedge{\cal I}_{k}]\mathbb{P}[{\cal I}_{k}|{\cal G}_{k}]. (24)

Furthermore, by Assumptions 2.14.3, the Cauchy–Schwarz inequality, the triangle inequality, P21\|P\|_{2}\leq 1, and the fact that ZTq2=Pq2\|Z^{T}q\|_{2}=\|Pq\|_{2} for any qnq\in\mathbb{R}^{n} one has

𝔼[Ik|𝒢kk][k|𝒢k]\displaystyle\ \mathbb{E}[I_{k}|{\cal G}_{k}\wedge{\cal I}_{k}]\mathbb{P}[{\cal I}_{k}|{\cal G}_{k}]
=\displaystyle= 𝔼[xϕ~(Xk,μk)TZ(ZTHkZ)1ZT(xϕ~(Xk,μk)Qk)|𝒢kk][k|𝒢k]\displaystyle\ \mathbb{E}[\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})^{T}Z(Z^{T}H_{k}Z)^{-1}Z^{T}(\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})-Q_{k})|{\cal G}_{k}\wedge{\cal I}_{k}]\mathbb{P}[{\cal I}_{k}|{\cal G}_{k}]
\displaystyle\leq λ¯1𝔼[Pxϕ~(Xk,μk)2P(xϕ~(Xk,μk)Qk)2|𝒢kk][k|𝒢k]\displaystyle\ \underline{\lambda}^{-1}\mathbb{E}[\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}\|P(\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})-Q_{k})\|_{2}|{\cal G}_{k}\wedge{\cal I}_{k}]\mathbb{P}[{\cal I}_{k}|{\cal G}_{k}]
\displaystyle\leq λ¯1(κf+μ1θ0j[m]κcj)σ.\displaystyle\ \underline{\lambda}^{-1}\left(\kappa_{\nabla f}+\frac{\mu_{1}}{\theta_{0}}\sum_{j\in[m]}\kappa_{\nabla c_{j}}\right)\sigma.

Combining this inequality with (24) yields the desired conclusion. ∎∎

Combining the preliminary bound in Lemma 4.6 with the previous two lemmas, we are now prepared to prove an upper bound on the expected decrease in the shifted barrier-augmented function that occurs with each iteration.

Lemma 4.9.

Suppose that Assumptions 2.1, 3.1, 4.1, 4.2, and 4.3 hold and that Algorithm 1 is run with Parameter Rule 2. Then, for all kk\in\mathbb{N}, one finds that

𝔼[ϕ~(Xk+1,μk+1)|𝒢k]ϕ~(Xk,μk)\displaystyle\ \mathbb{E}[\tilde{\phi}(X_{k+1},\mu_{k+1})|{\cal G}_{k}]-\tilde{\phi}(X_{k},\mu_{k})
\displaystyle\leq 14γk,minαkλ¯1Pxϕ~(Xk,μk)22+32αk2Lkλ¯2σ2+γk,buffαkBσ,\displaystyle\ -\tfrac{1}{4}\gamma_{k,\min}\alpha_{k}\overline{\lambda}^{-1}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}+\tfrac{3}{2}\alpha_{k}^{2}L_{k}\underline{\lambda}^{-2}\sigma^{2}+\gamma_{k,\mathrm{buff}}\alpha_{k}B^{\sigma},

where BσB^{\sigma} is defined as in Lemma 4.8.

Proof.

Consider arbitrary kk\in\mathbb{N}. One finds from Lemmas 4.6, 4.7, and 4.8 that

𝔼[ϕ~(Xk+1,μk+1)|𝒢k]ϕ~(Xk,μk)\displaystyle\ \mathbb{E}[\tilde{\phi}(X_{k+1},\mu_{k+1})|{\cal G}_{k}]-\tilde{\phi}(X_{k},\mu_{k})
\displaystyle\leq Γkαk(λ¯34ΓkαkLkλ¯2)Pxϕ~(Xk,μk)22+32αk2Lkλ¯2σ2+γk,buffαkBσ.\displaystyle\ -\Gamma_{k}\alpha_{k}(\overline{\lambda}-\tfrac{3}{4}\Gamma_{k}\alpha_{k}L_{k}\underline{\lambda}^{-2})\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}+\tfrac{3}{2}\alpha_{k}^{2}L_{k}\underline{\lambda}^{-2}\sigma^{2}+\gamma_{k,\mathrm{buff}}\alpha_{k}B^{\sigma}.

Then, under Parameter Rule 2, one has Γk1\Gamma_{k}\leq 1 and ktα1k^{t_{\alpha}}\leq 1, so

αk=ktαλ¯2λ¯1Lk1λ¯34ΓkαkLkλ¯2=λ¯134λ¯1Γkktα14λ¯1,\alpha_{k}=k^{t_{\alpha}}\underline{\lambda}^{2}\overline{\lambda}^{-1}L_{k}^{-1}\implies\overline{\lambda}-\tfrac{3}{4}\Gamma_{k}\alpha_{k}L_{k}\underline{\lambda}^{-2}=\overline{\lambda}^{-1}-\tfrac{3}{4}\overline{\lambda}^{-1}\Gamma_{k}k^{t_{\alpha}}\geq\tfrac{1}{4}\overline{\lambda}^{-1},

which when combined with the prior conclusion completes the proof. ∎∎

Using this sufficient decrease result and looking further into the selections in Parameter Rule 2, we obtain the following result.

Lemma 4.10.

Suppose that Assumptions 2.1, 3.1, 4.1, 4.2, and 4.3 hold and that Algorithm 1 is run with Parameter Rule 2. Then, (ν,ς)>0×>0(\nu,\varsigma)\in\mathbb{R}_{>0}\times\mathbb{R}_{>0} exists such that, for all kk\in\mathbb{N},

𝔼[ϕ~(Xk+1,μk+1)|𝒢k]ϕ~(Xk,μk)νkt+tαPxϕ~(Xk,μk)22+ςkmax{2t+tα,t+2tα}.\mathbb{E}[\tilde{\phi}(X_{k+1},\mu_{k+1})|{\cal G}_{k}]-\tilde{\phi}(X_{k},\mu_{k})\\ \leq-\nu k^{t+t_{\alpha}}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}+\varsigma k^{\max\{2t+t_{\alpha},t+2t_{\alpha}\}}.
Proof.

We prove the result by building on Lemma 4.9. By Parameter Rule 2, there exist κ>0\kappa\in\mathbb{R}_{>0}, L¯>0\underline{L}\in\mathbb{R}_{>0}, and L>0L\in\mathbb{R}_{>0} such that for all kk\in\mathbb{N} one finds

γk,minαk=min{αk,κ+κ2+2L¯(ημkθk)βσL,η¯θkβσL}.\gamma_{k,\min}\alpha_{k}=\min\left\{\alpha_{k},\frac{-\kappa+\sqrt{\kappa^{2}+2\underline{L}(\eta\mu_{k}-\theta_{k})}}{\beta^{\sigma}L},\frac{\underline{\eta}-\theta_{k}}{\beta^{\sigma}L}\right\}.

Since this is the same type of lower bound as in (15), the same argument in the proof of Theorem 4.1 shows that there exists ν>0\nu\in\mathbb{R}_{>0} such that 14γk,minαkλ¯1νkt+tα\tfrac{1}{4}\gamma_{k,\min}\alpha_{k}\overline{\lambda}^{-1}\geq\nu k^{t+t_{\alpha}} for all kk\in\mathbb{N}. Hence, by Lemma 4.9, all that remains is to prove that there exists ς>0\varsigma\in\mathbb{R}_{>0} such that for all kk\in\mathbb{N} one finds that

32αk2Lkλ¯2σ2+γk,buffαkBσςkmax{2t+tα,t+2tα}.\tfrac{3}{2}\alpha_{k}^{2}L_{k}\underline{\lambda}^{-2}\sigma^{2}+\gamma_{k,\mathrm{buff}}\alpha_{k}B^{\sigma}\leq\varsigma k^{\max\{2t+t_{\alpha},t+2t_{\alpha}\}}. (25)

Starting with the first term in (25), one finds that

32αk2Lkλ¯2σ2=32αk(αkLkλ¯2)σ2=32λ¯2λ¯2σ2k2tαLk,\tfrac{3}{2}\alpha_{k}^{2}L_{k}\underline{\lambda}^{-2}\sigma^{2}=\tfrac{3}{2}\alpha_{k}(\alpha_{k}L_{k}\underline{\lambda}^{-2})\sigma^{2}=\tfrac{3}{2}\underline{\lambda}^{2}\overline{\lambda}^{-2}\sigma^{2}\frac{k^{2t_{\alpha}}}{L_{k}},

where with υ:=i=1m(Lciκci+κciLci)\upsilon:=\sum_{i=1}^{m}\left(L_{c_{i}}\kappa_{\nabla c_{i}}+\kappa_{c_{i}}L_{\nabla c_{i}}\right) one finds that

k2tαLk=k2tαLf+μ1ktθ02kt(k+1)tυk2tαLf+μ1θ02υktkt+2tαμ1θ02υ.\frac{k^{2t_{\alpha}}}{L_{k}}=\frac{k^{2t_{\alpha}}}{L_{\nabla f}+\mu_{1}k^{t}\theta_{0}^{-2}k^{-t}(k+1)^{-t}\upsilon}\leq\frac{k^{2t_{\alpha}}}{L_{\nabla f}+\mu_{1}\theta_{0}^{-2}\upsilon k^{-t}}\leq\frac{k^{t+2t_{\alpha}}}{\mu_{1}\theta_{0}^{-2}\upsilon}.

As for the second term in (25), one similarly finds that

γk,buffαkBσγbuffktktα(λ¯2Bσλ¯2Lk)(γbuffλ¯2Bσμ1θ02υλ¯2)k2t+tα.\gamma_{k,\mathrm{buff}}\alpha_{k}B^{\sigma}\leq\gamma_{\mathrm{buff}}k^{t}k^{t_{\alpha}}\left(\frac{\underline{\lambda}^{2}B^{\sigma}}{\overline{\lambda}^{2}L_{k}}\right)\leq\left(\frac{\gamma_{\mathrm{buff}}\underline{\lambda}^{2}B^{\sigma}}{\mu_{1}\theta_{0}^{-2}\upsilon\overline{\lambda}^{2}}\right)k^{2t+t_{\alpha}}.

Combining these bounds yields (25) for some ς>0\varsigma\in\mathbb{R}_{>0}. ∎∎

Our final theorem, presented next, now follows similarly to Theorem 3.16 in [7]. We provide a complete proof for the sake of completeness.

Theorem 4.2.

Suppose that Assumptions 2.1, 3.1, 4.1, 4.2, and 4.3 hold and that Algorithm 1 is run with Parameter Rule 2. Then,

lim infkPxϕ(Xk,μk)22=0almost surely.\liminf_{k\to\infty}\|P\nabla_{x}\phi(X_{k},\mu_{k})\|_{2}^{2}=0\ \ \text{almost surely}.

Consequently, if in a given run there exists 𝒦{\cal K}\subseteq\mathbb{N} with |𝒦|=|{\cal K}|=\infty such that

  • \bullet

    {Pxϕ(xk,μk)}k𝒦0\{P\nabla_{x}\phi(x_{k},\mu_{k})\}_{k\in{\cal K}}\to 0,

  • \bullet

    {xk}k𝒦x¯\{x_{k}\}_{k\in{\cal K}}\to\mkern 1.5mu\overline{\mkern-1.5mux} for some x¯\mkern 1.5mu\overline{\mkern-1.5mux}\in{\cal F}, and

  • \bullet

    at x¯\mkern 1.5mu\overline{\mkern-1.5mux} the linear independence constraint qualification (LICQ) holds with respect to problem (1) in the sense that with 𝒜¯:={i[m]:ci(x¯)=0}\bar{\cal A}:=\{i\in[m]:c_{i}(\mkern 1.5mu\overline{\mkern-1.5mux})=0\} the columns of ATA^{T} combined with the vectors in {ci(x¯)}i𝒜¯\{\nabla c_{i}(\mkern 1.5mu\overline{\mkern-1.5mux})\}_{i\in\bar{\cal A}} form a linearly independent set,

then x¯\mkern 1.5mu\overline{\mkern-1.5mux} is a KKT point for problem (1) in the sense that there exists a pair of Lagrange multipliers (y¯,z¯)l×m(\mkern 1.5mu\overline{\mkern-1.5muy},\mkern 1.5mu\overline{\mkern-1.5muz})\in\mathbb{R}^{l}\times\mathbb{R}^{m} such that (x¯,y¯,z¯)(\mkern 1.5mu\overline{\mkern-1.5mux},\mkern 1.5mu\overline{\mkern-1.5muy},\mkern 1.5mu\overline{\mkern-1.5muz}) satisfies (3).

Proof.

It follows from Lemma 4.10 and the Law of Total Expectation that there exists (ν,ς)>0×>0(\nu,\varsigma)\in\mathbb{R}_{>0}\times\mathbb{R}_{>0} such that, for all kk\in\mathbb{N}, one finds that

𝔼[ϕ~(Xk+1,μk+1)]𝔼[ϕ~(Xk,μk)]νkt+tα𝔼[Pxϕ~(Xk,μk)22]+ςkmax{2t+tα,t+2tα}.\mathbb{E}[\tilde{\phi}(X_{k+1},\mu_{k+1})]-\mathbb{E}[\tilde{\phi}(X_{k},\mu_{k})]\\ \leq-\nu k^{t+t_{\alpha}}\mathbb{E}[\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}]+\varsigma k^{\max\{2t+t_{\alpha},t+2t_{\alpha}\}}.

Consider arbitrary KK\in\mathbb{N}. Summing over k[K]k\in[K] yields

finf𝔼[ϕ~(x1,μ1)]\displaystyle f_{\inf}-\mathbb{E}[\tilde{\phi}(x_{1},\mu_{1})] 𝔼[ϕ~(XK+1,μK+1)]𝔼[ϕ~(x1,μ1)]\displaystyle\leq\mathbb{E}[\tilde{\phi}(X_{K+1},\mu_{K+1})]-\mathbb{E}[\tilde{\phi}(x_{1},\mu_{1})]
νk=1Kkt+tα𝔼[Pxϕ~(Xk,μk)22]+ςk=1Kkmax{2t+tα,t+2tα}.\displaystyle\leq-\nu\sum_{k=1}^{K}k^{t+t_{\alpha}}\mathbb{E}[\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}]+\varsigma\sum_{k=1}^{K}k^{\max\{2t+t_{\alpha},t+2t_{\alpha}\}}.

After rearrangement, this yields

k=1Kkt+tα𝔼[Pxϕ~(Xk,μk)22]1ν(𝔼[ϕ~(x1,μ1)]finf)+ςνk=1Kkmax{2t+tα,t+2tα}.\sum_{k=1}^{K}k^{t+t_{\alpha}}\mathbb{E}[\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}]\leq\tfrac{1}{\nu}(\mathbb{E}[\tilde{\phi}(x_{1},\mu_{1})]-f_{\inf})+\tfrac{\varsigma}{\nu}\sum_{k=1}^{K}k^{\max\{2t+t_{\alpha},t+2t_{\alpha}\}}.

Since 2t+tα(,1)2t+t_{\alpha}\in(-\infty,-1) and t+2tα(,1)t+2t_{\alpha}\in(-\infty,-1), the right-hand side of this inequality converges to a finite limit as KK\to\infty. Furthermore, by k=1kt+tα=\sum_{k=1}^{\infty}k^{t+t_{\alpha}}=\infty, nonnegativity of Pxϕ~(Xk,μk)22\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}, and Fatou’s lemma, one almost surely has

0=lim infk𝔼[Pxϕ~(Xk,μk)22]𝔼[lim infkPxϕ~(Xk,μk)22]=0.0=\liminf_{k\to\infty}\mathbb{E}[\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}]\geq\mathbb{E}\left[\liminf_{k\to\infty}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}\right]=0.

Now consider Φ:=lim infkPxϕ~(Xk,μk)22\Phi:=\liminf_{k\to\infty}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}, the expectation of which has been shown above to be zero. Nonnegativity of Pxϕ~(Xk,μk)22\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2} for all kk\in\mathbb{N} and the Law of Total Expectation imply that 0=𝔼[Φ][Φ>0]𝔼[Φ|Φ>0]0=\mathbb{E}[\Phi]\geq\mathbb{P}[\Phi>0]\mathbb{E}[\Phi|\Phi>0], so 0=[Φ>0]=[lim infkPxϕ~(Xk,μk)22>0]0=\mathbb{P}[\Phi>0]=\mathbb{P}[\liminf_{k\to\infty}\|P\nabla_{x}\tilde{\phi}(X_{k},\mu_{k})\|_{2}^{2}>0]. This is the first desired conclusion of the theorem. The second desired conclusion, on the other hand, follows by using the same argument as in the proof of Theorem 4.1. ∎∎

5 Discussion of Assumptions 3.1 and 4.3

Our aim in this section is to discuss Assumptions 3.1 and 4.3. In particular, first for the deterministic setting, we provide (a) example problems for which Assumption 3.1 can be shown to hold or cannot be shown to hold, (b) further explanation about why Assumption 3.1 can be viewed as a combination of a nondegeneracy assumption and an assumption about μ1/θ0\mu_{1}/\theta_{0} being sufficiently large, and (c) a procedure for computing dkd_{k} to satisfy (5) when such a direction exists. We follow this with a discussion about the combination of Assumptions 3.1 and 4.3 for the stochastic setting. We emphasize upfront that, generally speaking, a straightforward procedure for computing dkd_{k} to satisfy the conditions of our convergence analysis ((for either the deterministic or stochastic settings)) may not be available in practice. For example, for one thing, determining a sufficiently large value for the ratio μ1/θ0\mu_{1}/\theta_{0} may rely on problem-specific constants that are not known in advance of a run of the algorithm. That being said, in this section we provide guidance for how to compute dkd_{k} in practice such that, along with parameter tuning (as is common for stochastic-gradient-based methods in practice), one obtains good practical performance.

Throughout this section, for the sake of clarity, we drop the subscript notation that indicates the iteration index of an algorithm—rather, in this section, a subscript only refers to an index of a component of a vector. In particular, in this section, we denote xxkx\equiv x_{k} (which in turn means, e.g., that x1x_{1} refers to the first component of the vector xx), μμk\mu\equiv\mu_{k}, θθk1\theta\equiv\theta_{k-1}, qqkq\equiv q_{k}, and ddkd\equiv d_{k}.

Let us begin with some example problems for the deterministic setting. Firstly, recall that under Assumption 2.1 a strictly feasible point must exist. This means that one can rule out problems with, e.g., the affine constraint x1=0x_{1}=0 and the inequality constraint x10x_{1}\leq 0, or any problems with such a combination of constraints that implies the lack of a strictly feasible point. Secondly, observe that the existence or not of a direction satisfying (5) depends on the feasible region reduced to Null(A)\operatorname{Null}(A). This means that one can distinguish cases of whether (5) holds or not by presuming that one has already restricted attention to this reduced space and consider only the geometry of the feasible region with respect to inequality constraints where the ambient space of the variables is this reduced space.

Let us now present an example for which Assumption 3.1 can be shown to hold. In particular, the following example shows that there exists a value of the tuple (μ,θ,η,η¯,ζ¯,ζ¯,ζ)(\mu,\theta,\eta,\underline{\eta},\underline{\zeta},\overline{\zeta},\zeta) under the ranges specified by Algorithm 1 such that at all x<0x\in{\cal F}_{<0} there exists dd satisfying (5). The main aspect of the example is that μ\mu must be sufficiently large relative to θ\theta. We remark that the case a>0a\in\mathbb{R}_{>0} that is considered in the example is the more challenging case to consider. Indeed, if a0a\in\mathbb{R}_{\leq 0}, then, using similar arguments, the parameter requirements are less restrictive than in the following example. We expound on this claim further after the example.

Example 1.

To start, let (μ,θ,η,η¯,ζ¯,ζ¯,ζ)(\mu,\theta,\eta,\underline{\eta},\underline{\zeta},\overline{\zeta},\zeta) be arbitrary positive constants under the ranges specified by Algorithm 1. (A further restriction on the relationship between μ\mu and θ\theta is developed during the discussion of this example, when it is needed.) Consider a two-dimensional problem where f(x)=vTxf(x)=v^{T}x for some v2v\in\mathbb{R}^{2}, c1(x)=x1c_{1}(x)=-x_{1}, c2(x)=ax1x2c_{2}(x)=ax_{1}-x_{2} for some a>0a\in\mathbb{R}_{>0}, and g=f(x)=vg=\nabla f(x)=v (see (4)), so

q=v+μ(1c1(x))c1(x)+μ(1c2(x))c2(x).q=v+\mu\left(\frac{1}{-c_{1}(x)}\right)\nabla c_{1}(x)+\mu\left(\frac{1}{-c_{2}(x)}\right)\nabla c_{2}(x). (26)

Defining 𝒜(x):={i{1,2}:ημ<ci(x)θ}{\cal A}(x):=\{i\in\{1,2\}:-\eta\mu<c_{i}(x)\leq-\theta\}, the conditions (5) reduce to

ζ¯q2d2\displaystyle\underline{\zeta}\|q\|_{2}\leq\|d\|_{2} ζ¯q2,\displaystyle\leq\overline{\zeta}\|q\|_{2}, (27a)
qTd\displaystyle-q^{T}d ζq2d2,\displaystyle\geq\zeta\|q\|_{2}\|d\|_{2}, (27b)
andci(x)Td\displaystyle\text{and}\ \ \nabla c_{i}(x)^{T}d 12η¯d2for alli𝒜(x).\displaystyle\leq-\tfrac{1}{2}\underline{\eta}\|d\|_{2}\ \ \text{for all}\ \ i\in{\cal A}(x). (27c)

Since the gradient vectors f(x)\nabla f(x), c1(x)\nabla c_{1}(x), and c2(x)\nabla c_{2}(x) are constant over x<0x\in{\cal F}_{<0} and the latter two gradient vectors are nonzero, it follows from (26) that for any x<0x\in{\cal F}_{<0} one finds that q/q2qlim/qlim2q/\|q\|_{2}\to q_{\lim}/\|q_{\lim}\|_{2} as μ\mu\to\infty, where qlim=c1(x)/(c1(x))+c2(x)/(c2(x))q_{\lim}=\nabla c_{1}(x)/(-c_{1}(x))+\nabla c_{2}(x)/(-c_{2}(x)). At the same time, for any pair of constants (χ1,χ2)>0×>0(\chi_{1},\chi_{2})\in\mathbb{R}_{>0}\times\mathbb{R}_{>0} and with the search direction defined by

d=vμ(χ1c1(x))c1(x)μ(χ2c2(x))c2(x),d=-v-\mu\left(\frac{\chi_{1}}{-c_{1}(x)}\right)\nabla c_{1}(x)-\mu\left(\frac{\chi_{2}}{-c_{2}(x)}\right)\nabla c_{2}(x), (28)

one has for any x<0x\in{\cal F}_{<0} that d/d2dlim/dlim2d/\|d\|_{2}\to d_{\lim}/\|d_{\lim}\|_{2} as μ\mu\to\infty, where the limiting direction is dlim=(χ1/c1(x))c1(x)+(χ2/c2(x))c2(x)d_{\lim}=(\chi_{1}/c_{1}(x))\nabla c_{1}(x)+(\chi_{2}/c_{2}(x))\nabla c_{2}(x).

Now suppose that the constants (μ,θ,η)(\mu,\theta,\eta) and point x<0x\in{\cal F}_{<0} are such that both constraints are nearly active at xx, i.e., 𝒜(x)={1,2}{\cal A}(x)=\{1,2\}. Since one has χ3:=(c1(x)Tc2(x))/(c1(x)2c2(x)2)=a/1+a2(0,1)\chi_{3}:=(-\nabla c_{1}(x)^{T}\nabla c_{2}(x))/(\|\nabla c_{1}(x)\|_{2}\|\nabla c_{2}(x)\|_{2})=a/\sqrt{1+a^{2}}\in(0,1), it follows that

c1(x)Tdlim\displaystyle\nabla c_{1}(x)^{T}d_{\lim} =χ1c1(x)22c1(x)<0+χ2χ3c1(x)2c2(x)2c2(x)>0\displaystyle=\chi_{1}\underbrace{\frac{\|\nabla c_{1}(x)\|_{2}^{2}}{c_{1}(x)}}_{<0}+\chi_{2}\chi_{3}\underbrace{\frac{\|\nabla c_{1}(x)\|_{2}\|\nabla c_{2}(x)\|_{2}}{-c_{2}(x)}}_{>0}
andc2(x)Tdlim\displaystyle\text{and}\ \ \nabla c_{2}(x)^{T}d_{\lim} =χ1χ3c1(x)2c2(x)2c1(x)>0+χ2c2(x)22c2(x)<0.\displaystyle=\chi_{1}\chi_{3}\underbrace{\frac{\|\nabla c_{1}(x)\|_{2}\|\nabla c_{2}(x)\|_{2}}{-c_{1}(x)}}_{>0}+\chi_{2}\underbrace{\frac{\|\nabla c_{2}(x)\|_{2}^{2}}{c_{2}(x)}}_{<0}.

Now suppose χ1=c1(x)c2(x)2/(c2(x)c1(x)2)\chi_{1}=c_{1}(x)\|\nabla c_{2}(x)\|_{2}/(c_{2}(x)\|\nabla c_{1}(x)\|_{2}) and χ2=1\chi_{2}=1. (The effect of these choices is to normalize the contributions of the two terms in the definition of dlimd_{\lim}.) These choices and the fact that η=ψθ/μ\eta=\psi\theta/\mu for some ψ>1\psi\in\mathbb{R}_{>1} implies that

c1(x)Tdlim\displaystyle\nabla c_{1}(x)^{T}d_{\lim} <(1χ3ψθ)c1(x)2c2(x)2\displaystyle<-\left(\frac{1-\chi_{3}}{\psi\theta}\right)\|\nabla c_{1}(x)\|_{2}\|\nabla c_{2}(x)\|_{2}
andc2(x)Tdlim\displaystyle\text{and}\ \ \nabla c_{2}(x)^{T}d_{\lim} <(1χ3ψθ)c2(x)22.\displaystyle<-\left(\frac{1-\chi_{3}}{\psi\theta}\right)\|\nabla c_{2}(x)\|_{2}^{2}.

On the other hand, one has that dlim22c2(x)2/θ\|d_{\lim}\|_{2}\leq 2\|\nabla c_{2}(x)\|_{2}/\theta. Thus, one may conclude that there exists χ4>0\chi_{4}\in\mathbb{R}_{>0} such that c1(x)Tdlimχ4dlim2\nabla c_{1}(x)^{T}d_{\lim}\leq-\chi_{4}\|d_{\lim}\|_{2} and c2(x)Tdlimχ4dlim2\nabla c_{2}(x)^{T}d_{\lim}\leq-\chi_{4}\|d_{\lim}\|_{2} for all x<0x\in{\cal F}_{<0} such that both constraints are nearly active.

On the other hand, with dlimd_{\lim} defined as in the previous paragraph (i.e., with χ1=c1(x)c2(x)2/(c2(x)c1(x)2)\chi_{1}=c_{1}(x)\|\nabla c_{2}(x)\|_{2}/(c_{2}(x)\|\nabla c_{1}(x)\|_{2}) and χ2=1\chi_{2}=1), it follows that

qlimTdlim\displaystyle\ -q_{\lim}^{T}d_{\lim}
=\displaystyle= (1c1(x)c1(x)+1c2(x)c2(x))T(χ1c1(x)c1(x)+χ2c2(x)c2(x))\displaystyle\left(\frac{1}{c_{1}(x)}\nabla c_{1}(x)+\frac{1}{c_{2}(x)}\nabla c_{2}(x)\right)^{T}\left(\frac{\chi_{1}}{c_{1}(x)}\nabla c_{1}(x)+\frac{\chi_{2}}{c_{2}(x)}\nabla c_{2}(x)\right)
=\displaystyle= 1c1(x)c2(x)c1(x)2c2(x)2+1c2(x)2c2(x)22\displaystyle\ \frac{1}{c_{1}(x)c_{2}(x)}\|\nabla c_{1}(x)\|_{2}\|\nabla c_{2}(x)\|_{2}+\frac{1}{c_{2}(x)^{2}}\|\nabla c_{2}(x)\|_{2}^{2}
+(1c1(x)c2(x)+c2(x)2c2(x)2c1(x)2)c1(x)Tc2(x)\displaystyle\ +\left(\frac{1}{c_{1}(x)c_{2}(x)}+\frac{\|\nabla c_{2}(x)\|_{2}}{c_{2}(x)^{2}\|\nabla c_{1}(x)\|_{2}}\right)\nabla c_{1}(x)^{T}\nabla c_{2}(x)
=\displaystyle= 1χ3c1(x)c2(x)c1(x)2c2(x)2+1χ3c2(x)2c2(x)22\displaystyle\ \frac{1-\chi_{3}}{c_{1}(x)c_{2}(x)}\|\nabla c_{1}(x)\|_{2}\|\nabla c_{2}(x)\|_{2}+\frac{1-\chi_{3}}{c_{2}(x)^{2}}\|\nabla c_{2}(x)\|_{2}^{2}
\displaystyle\geq 1χ3θ2(c1(x)2c2(x)2+c2(x)22).\displaystyle\ \frac{1-\chi_{3}}{\theta^{2}}(\|\nabla c_{1}(x)\|_{2}\|\nabla c_{2}(x)\|_{2}+\|\nabla c_{2}(x)\|_{2}^{2}).

On the other hand, dlim22c2(x)2/θ\|d_{\lim}\|_{2}\leq 2\|\nabla c_{2}(x)\|_{2}/\theta and qlim22c2(x)2/θ\|q_{\lim}\|_{2}\leq 2\|\nabla c_{2}(x)\|_{2}/\theta. Thus, one may conclude that there exists χ5>0\chi_{5}\in\mathbb{R}_{>0} such that qlimTdlimχ5qlim2dlim2-q_{\lim}^{T}d_{\lim}\geq\chi_{5}\|q_{\lim}\|_{2}\|d_{\lim}\|_{2} for all x<0x\in{\cal F}_{<0} such that both constraints are nearly active.

The arguments in the preceding two paragraphs involve the limiting directions dlimd_{\lim} and qlimq_{\lim}, where it was shown that there exist positive constants χ4\chi_{4} and χ5\chi_{5}—that are uniform with respect to x<0x\in{\cal F}_{<0}—such that conditions of the form (27b)–(27c) hold for these limit directions. All that remains is to observe that, since these are the limiting directions for any given x<0x\in{\cal F}_{<0} as μ\mu\to\infty, such conditions and (27a) hold for (d,q)(d,q) for some positive constants as long as μ\mu is sufficiently large relative to θ\theta. All together from our observations, it follows that with μ\mu sufficiently large relative to θ\theta, η(θ/μ,1)\eta\in(\theta/\mu,1), x<0x\in{\cal F}_{<0} such that both constraints are nearly active, η¯\underline{\eta} sufficiently small, ζ¯\underline{\zeta} sufficiently small, ζ¯\overline{\zeta} sufficiently large, and ζ\zeta sufficiently small, one finds dd in (28) with the aforementioned choices of χ1\chi_{1} and χ2\chi_{2} satisfies (5).

Finally, observe that for any x<0x\in{\cal F}_{<0} such that both constraints are not nearly active, the situation is simpler than above. In such a setting, one does not need condition (27c) to hold for both constraints; it only needs to hold for whichever constraint, if any, is nearly active. This offers much more flexibility in the choice of dd. Overall, since the constraint gradients are constant over x<0x\in{\cal F}_{<0}, one can again derive the existence of uniform parameter choices, including that μ\mu is sufficiently large relative to θ\theta, to ensure that dd satisfying (5) exists at all such xx. ∎

The analysis in Example 1 can be extended to any problem with polyhedral constraints (for arbitrary nn\in\mathbb{N} and mm\in\mathbb{N}) as long as at any strictly feasible point in <0{\cal F}_{<0} the interior of the polar cone of the nearly active constraint gradients is nonempty. Intuitively, this can be understood as follows. With μ\mu sufficiently large relative to θ\theta, the gradient of the barrier term corresponding to nearly active constraints pushes the vector q-q (through more emphasis on the barrier term) to point with the interior of the polar cone of the nearly active constraint gradients. This means that, as long as a direction dd is chosen appropriately to compensate for the different magnitudes of the norms of the constraint gradients, the desired conditions in (5) hold. Admittedly, the parameter choices that ensure that such a direction dd always exists are not straightforward to determine in advance of a run of the algorithm. This means that, in practice, the algorithm parameters require tuning and/or adaptive choices to be made, as we discuss further shortly. To illustrate the situation that we have described in Example 1, please see Figure 1, on the left.

c1(x)\nabla c_{1}(x)c2(x)\nabla c_{2}(x)xxxqx-qx+dx+dc1(x)\nabla c_{1}(x)c2(0)\nabla c_{2}(0)xxxqx-qx+dx+d
Figure 1: On the left, an illustration of Example 1 for which Assumption 3.1 holds as long as the barrier parameter μ\mu is sufficiently large relative to θ\theta (recall that μk/θk1\mu_{k}/\theta_{k-1} is constant in the algorithm) amongst other parameter choices. Since μ\mu is sufficiently large relative to θ\theta, it follows that a direction dd pointing into the interior of the polar cone of nearly active constraint gradients is also one of sufficient decrease for the barrier-augmented objective function. On the right, an illustration of Example 2 for which Assumption 3.1 fails to hold since, as xx approaches the origin from within the feasible region, there is no minimum value for the ratio μ/θ\mu/\theta such that a direction into the interior of the polar cone of nearly active constraint gradients satisfies (5).

Now let us briefly describe an example for which we are unable to show that Assumption 3.1 holds with uniform parameter choices; see Figure 1, on the right.

Example 2.

Consider a two-dimensional problem where f(x)=vTxf(x)=v^{T}x for some v2v\in\mathbb{R}^{2}, c1(x)=x1c_{1}(x)=-x_{1}, c2(x)=x1x22c_{2}(x)=x_{1}-x_{2}^{2}, and g=f(x)=vg=\nabla f(x)=v (see (4)). Defining, as before, 𝒜(x):={i{1,2}:ημ<ci(x)θ}{\cal A}(x):=\{i\in\{1,2\}:-\eta\mu<c_{i}(x)\leq-\theta\}, (5) reduces to (27). The situation is similar to that in Example 1, except that c2(x)=[12x2]T\nabla c_{2}(x)=[1\ -2x_{2}]^{T} is not constant over xx. Attempting to follow a similar analysis as in Example 1, one finds that the arguments break down since χ3=c1(x)Tc2(x)/(c1(x)2c2(x)2)1\chi_{3}=-\nabla c_{1}(x)^{T}\nabla c_{2}(x)/(\|\nabla c_{1}(x)\|_{2}\|\nabla c_{2}(x)\|_{2})\to 1 as x0x\to 0 from within the interior of the feasible region. ∎

Intuitively, Example 2 has the property that the constraint gradients point in opposite directions as x0x\to 0 from within the interior of the feasible region. This means that, if the algorithm were to compute x0x\to 0 (say, since 0 is the optimal solution), the barrier terms for the two constraints would push against each other, in the limit. Consequently, we are not able to show that there exists a uniform ratio μ/θ\mu/\theta such that a direction dd satisfying (5) always exists over the feasible region.

One sees in Figure 1 our claim that Assumption 3.1 is a type of nondegeneracy assumption. When, at any strictly feasible point, the polar cone of the nearly active constraint gradients is sufficiently wide, the situation is favorable, like on the left in Figure 1. On the other hand, if the polar cone collapses as xx¯x\to\mkern 1.5mu\overline{\mkern-1.5mux} for some x¯\mkern 1.5mu\overline{\mkern-1.5mux}, like in the situation on the right in Figure 1, then the situation is not favorable since the barrier terms for different constraints push against each other.

We close this section by observing that a procedure for computing a direction to satisfy (5), if one exists, has been suggested in Example 1. In particular, at any iterate xx one can compute dd by (a) determining the set of nearly active constraints, (b) computing weights for the nearly active constraints, as in (28), such that the contribution from each nearly active constraint gradient is normalized in some sense in order to satisfy (5). Concretely, one can consider the feasibility problem

find{χi}i[m]>0msuch thatd=gμi[m]χici(x)ci(x)satisfies(5).\text{find}\ \ \{\chi_{i}\}_{i\in[m]}\in\mathbb{R}^{m}_{>0}\ \ \text{such that}\ \ d=g-\mu\sum_{i\in[m]}\frac{\chi_{i}}{c_{i}(x)}\nabla c_{i}(x)\ \ \text{satisfies}\ \ \eqref{eq.d_conditions}.

However, in practice, we suspect such a procedure not to be worth the required computational effort. Instead, we suggest that, in practice, a reasonable choice is to initialize μ\mu and θ\theta to positive values, and in each iteration compute dd by solving a linear system of the form (21), as we specify in Assumption 4.3 for the stochastic setting. For example, HkH_{k} can simply be an identity matrix, or it might be chosen as an identity matrix plus (an approximation of) the Hessian of the barrier function, which may involve second-order derivatives of {ci}i[m]\{c_{i}\}_{i\in[m]}, but does not require derivatives (of any order) of the objective function ff. In any case, if dd fails to satisfy (5) (specifically, (5e)) for some choices of the algorithm parameters, then the algorithm might “reset” the barrier parameter to a larger value. This may be allowed to occur iteratively, at least up to some upper limit for practical purposes. In the worst case, the algorithm may need to compute smaller values of γ\gamma than as stated in our theoretical results. However, at least this is a practical approach that worked well in our experiments.

6 Numerical Experiments

As previously stated, we do not claim that the deterministic version of our algorithm would be competitive with state-of-the-art derivative-based interior-point methods, especially not with such methods that allow infeasible iterates and/or employ second-order derivatives. On the other hand, for the stochastic setting, we know of no other stochastic interior-point method like ours that can solve generally constrained, smooth optimization problems, meaning that there is no such method in the literature against which our algorithm can be compared on a level playing field. All of this being said, it is illustrative to demonstrate the practical performance of our method through numerical experiments, both on a well-established and diverse set of test problems and on a couple of relatively straightforward test problems that represent the settings for which the algorithm has been designed. In this section, we present the results of such numerical experiments. For our experiments, each run of the algorithm was conducted on a compute node with an 16 AMD Opteron Processor 6128 with 32GB of memory.

First, we conducted an experiment with a large test set of problems in order to demonstrate the broad applicability of our algorithm. For such a demonstration, in these experiments we considered only the deterministic version of our algorithm. (Our second and third sets of experiments consider the stochastic version.) We generated our test set from the CUTEst collection [12], specifically using the PyCUTEst interface [11]. This collection includes 1315 unconstrained or constrained smooth optimization problems. To generate the subset of problems that we employed for our experiments, we proceeded as follows. Firstly, we removed all problems with nonlinear equality constraints, with no inequality constraints, or that were too large to be loaded into memory. This resulted in a set of 396 problems of the form (1). Second, since our algorithm requires a strictly feasible initial point, we ran a so-called Phase I algorithm (implemented in Python) starting from the initial point provided by PyCUTEst; see Algorithm 2 in Appendix A. Of the 396 problems on which the Phase I algorithm was run, 10 encountered function evaluation errors through the interface and 59 resulted in a failure to find a strictly feasible point; these were all removed. Our test set is the remainder of these problems—327 in total—for which a strictly feasible initial point could be found. These points were stored as the initial points for this first set of our experiments.

We implemented Algorithm 1 in Python. We refer to the software as SLIP. For each test problem, for which x1x_{1} was determined by Algorithm 2, the input parameters for the algorithm were set as follows. The initial neighborhood parameter was set as θ00.9max{c(x1)}>0\theta_{0}\leftarrow-0.9\max\{c(x_{1})\}>0 and the initial barrier parameter was set as μ1max{101,2θ0}\mu_{1}\leftarrow\max\{10^{-1},2\theta_{0}\}. The corresponding sequences of neighborhood and barrier parameters were then prescribed such that θk1θ0kt\theta_{k-1}\leftarrow\theta_{0}k^{-t} and μkμ1kt\mu_{k}\leftarrow\mu_{1}k^{-t} with t0.7t\leftarrow 0.7 for all kk\in\mathbb{N}. Estimates of κf\kappa_{\nabla f}, LfL_{\nabla f}, {κci}i[m]\{\kappa_{c_{i}}\}_{i\in[m]}, {Lci}i[m]\{L_{c_{i}}\}_{i\in[m]}, {κci}i[m]\{\kappa_{\nabla c_{i}}\}_{i\in[m]}, and {Lci}i[m]\{L_{\nabla c_{i}}\}_{i\in[m]} were determined by randomly generating nn points from a normal distribution centered at x1x_{1}, then computing constraint function and derivative values at these points to compute bound and Lipschitz-constant estimates. These values were employed to set the step sizes {αk}\{\alpha_{k}\} as stated in Parameter Rule 1 with tα0t_{\alpha}\leftarrow 0. (This ensures t+tα[1,0)t+t_{\alpha}\in[-1,0), which means that the condition on these exponents in Theorems 4.1 was satisfied.) Finally, SLIP employed the parameter values η(θ0/μ1+1)/2\eta\leftarrow(\theta_{0}/\mu_{1}+1)/2, η¯θ0+108\underline{\eta}\leftarrow\theta_{0}+10^{-8}, ζ¯1\underline{\zeta}\leftarrow 1, ζ¯1\overline{\zeta}\leftarrow 1, and ζ1\zeta\leftarrow 1. Rather than attempt to ensure that (5) holds for all kk\in\mathbb{N}, SLIP follows the strategy described at the end of §5 to reset μ1\mu_{1} whenever a search direction is computed that does not satisfy (5). In particular, μ1\mu_{1} was multiplied by a factor of 2 in such cases, although we imposed an overall limit of 10410^{4} beyond which μ1\mu_{1} was not increased. One additional feature that we included in the implementation is that we allowed the algorithm to explore values of γk\gamma_{k} greater than 1, since we believe this would be a useful feature in any practical implementation of the method. Specifically, if xk+αkdk𝒩(θk)x_{k}+\alpha_{k}d_{k}\not\in{\cal N}(\theta_{k}), then the algorithm would iteratively reduce γk\gamma_{k} from 1 until xk+γkαkdk𝒩(θk)x_{k}+\gamma_{k}\alpha_{k}d_{k}\in{\cal N}(\theta_{k}) was found. On the other hand, if xk+αkdk𝒩(θk)x_{k}+\alpha_{k}d_{k}\in{\cal N}(\theta_{k}), then the algorithm would iteratively increase γk\gamma_{k} from 1 as long as xk+γkαkdk𝒩(θk)x_{k}+\gamma_{k}\alpha_{k}d_{k}\in{\cal N}(\theta_{k}) and the barrier term in the barrier-augmented objective function did not increase.

We ran SLIP with an iteration budget of K:=2×104K:=2\times 10^{4} in order to determine the progress that the algorithm could make within a budget. (To resemble the stochastic setting, our implementation of SLIP does not employ second-order derivatives of the objective function, so one should not expect the algorithm to converge at a fast rate like other interior-point methods that employ second-order derivatives.) Let xKnx_{K}\in\mathbb{R}^{n} be the final iterate generated by a run of the algorithm on a test problem. Since our algorithm is a feasible method, one can compare f(x1)f(x_{1}) and f(xK)f(x_{K}) for each run in order to determine if the algorithm has made progress. We confirmed that, indeed, in all of our runs of SLIP over our entire test set, we found that f(xK)<f(x1)f(x_{K})<f(x_{1}). However, without knowledge of finf:=infxf(x)f_{\inf}:=\inf_{x\in{\cal F}}f(x) for each problem in our test set, it is not possible to compare f(xK)f(x_{K}) with finff_{\inf}. Instead, we considered the relative stationarity measure (recall Theorem 4.1):

relative stationarity:=Pxϕ(xK,μK)2min{Pxϕ(x1,μ1)2,Pxϕ(x1,μK)2}.\text{relative stationarity}:=\frac{\|P\nabla_{x}\phi(x_{K},\mu_{K})\|_{2}}{\min\{\|P\nabla_{x}\phi(x_{1},\mu_{1})\|_{2},\|P\nabla_{x}\phi(x_{1},\mu_{K})\|_{2}\}}. (29)

Here, the min\min in the denominator respects the fact that different choices of the barrier parameter correspond to different multiplier estimates at the initial point (see (18)), which in turn give different stationarity measures at the initial point. Our use of the min\min in the denominator in this definition of relative stationarity means that we are determining the better of the stationarity measures at the initial point to compare against our stationarity measure at the final iterate.

Figure 2 shows a histogram of relative stationarity values over our test set of problems from the CUTEst collection. The results show that, generally speaking, the relative stationarity values were quite small, meaning that the final iterates in most runs appeared to be much closer to stationarity than the initial iterates, thus indicating that within the iteration budget the algorithm made substantial progress toward stationarity. We conjecture that the problems on which the relative stationarity value was higher may be ones that are very nonlinear and/or have constraints that are degenerate near the iterates generated by the algorithm.

Refer to caption
Figure 2: Histogram of relative stationarity values for experiments with a deterministic version of SLIP over problems from the CUTEst collection with a strictly feasible initial point.

As a second experiment, we tested stochastic runs of SLIP on a randomly generated instance of a second-order conic optimization problem, namely,

minxncTxs.t.Ax=bandx1:n12xn.\min_{x\in\mathbb{R}^{n}}\ c^{T}x\ \ \operatorname{s.\!t.}\ \ Ax=b\ \ \text{and}\ \ \|x_{1:n-1}\|_{2}\leq x_{n}. (30)

Specifically, we randomly generated AA and bb such that a strictly feasible point existed, and randomly generated cc such that the problem had a finite optimal value. We chose a problem of this type since it is nonlinear, but convex, so for our first demonstration of stochastic runs of our algorithm we could avoid challenging situations that arise due to nonconvexity, such as the algorithm potentially converging to different minimizers. (By contrast, our third experiment considers a nonconvex problem.) First, as a benchmark, we ran deterministic SLIP with the same parameter choices as our experiments with the CUTEst collection, except for two differences: (a) we set tα=0.151t_{\alpha}=-0.151 so that t+tα[1,0)t+t_{\alpha}\in[-1,0) and t+2tα<1t+2t_{\alpha}<-1, as required in Parameter Rule 2, and (b) our procedure for computing γk\gamma_{k} for all kk\in\mathbb{N} did not observe barrier-augmented objective function values, since this would require objective function values that are not tractable to compute in a stochastic setting. Instead, we allowed γk\gamma_{k} to increase above 1, but only up to a limit of 10, when doing so maintained all constraint values less than or equal to θk-\theta_{k}.

Second, we ran stochastic SLIP with 10 different seeds for the random number generator. These runs used the same values for x1x_{1}, θ0\theta_{0}, initial μ1\mu_{1}, η\eta, and η¯\underline{\eta} as the deterministic algorithm along with γk,max=10\gamma_{k,\max}=10 for all kk\in\mathbb{N}. It also employed the same Lipschitz constant estimates and employed the same procedured for “resetting” μ1\mu_{1} (up to a limit of 10410^{4}) as the deterministic algorithm. Thus, the main difference between the deterministic and stochastic runs of the algorithm were that the latter employed stochastic objective gradient estimates of the form c+ΔCc+\Delta C, where for all kk\in\mathbb{N} in each run the vector ΔC\Delta C was drawn randomly from a standard normal distribution. By design of the algorithm, the final iterates generated by the deterministic and all stochastic runs of the algorithm were feasible with respect to all of the constraints. Table 1 shows the final objective values for each of the 10 stochastic runs of SLIP. For reference, f(x1)=f(x_{1})= 1.80561e+03 for all runs, deterministic SLIP achieved f(xK)=f(x_{K})= 6.31100e+02, and the relative stationarity measure achieved by deterministic SLIP (see (29)) was 1.65377e-02. Thus, overall, one can observe that the deterministic and stochastic runs of SLIP yielded final iterates that were relatively very close to stationarity.

Table 1: Final objective function values over 10 runs of stochastic SLIP to solve (30).
run 1 2 3 4 5
f(xK)f(x_{K}) 6.31103e+02 6.31102e+02 6.31100e+02 6.31101e+02 6.31100e+02
run 6 7 8 9 10
f(xK)f(x_{K}) 6.31101e+02 6.31102e+02 6.31101e+02 6.31101e+02 6.31103e+02

As a third experiment, we tested stochastic runs of SLIP to train a binary classifier subject to a norm constraint for the mushrooms data set from the LIBSVM collection [6]. The model that we trained was an artificial neural network with a single hidden layer with 512 nodes, tanh activation, and cross-entropy loss. This objective function is nonconvex. We trained the model subject to the constraint that the neural network weights had squared 2\ell_{2}-norm less than or equal to 100. This feasible region is convex. Like for our second experiment, we chose a convex feasible region to ensure that it would be reasonable to expect multiple stochastic runs of the algorithm to converge to a point of similar solution quality in terms of the final objective function values (our measure of comparison), despite the nonconvexity of the objective function. The experimental setup was the same as for our second experiment, except that, rather than artificial noise in the stochastic-gradient estimates, we employed mini-batch gradient estimates with a mini-batch size of 256. (The mushrooms dataset has 8124 data points, each with 112 features.)

Table 6 shows the final objective values for each of the 10 stochastic runs of SLIP. For reference, f(x1)=f(x_{1})= 6.93137e-01 for all runs, deterministic SLIP achieved f(xK)=f(x_{K})= 7.77697e-03, and the relative stationarity measure achieved by deterministic SLIP (see (29)) was 7.06457610e-02. Thus, one can observe that the deterministic and stochastic runs of SLIP yielded final iterates of comparable quality.

run 1 2 3 4 5
f(xK)f(x_{K}) 7.77711e-03 7.76221e-03 7.79686e-03 7.82982e-03 7.78385e-03
run 6 7 8 9 10
f(xK)f(x_{K}) 7.79379e-03 7.80981e-03 7.80194e-03 7.75375e-03 7.77848e-03

7 Conclusion

We have proposed, analyzed, and tested a single-loop interior-point framework for solving constrained optimization problems. Of particular interest is a stochastic-gradient-based version of the framework, which can be employed for solving problems with affine equality constraints and (potentially nonconvex) nonlinear inequality constraints, at least when a strictly feasible initial point can be provided (or at least computed through a so-called Phase I algorithm, such as the method that we present in Appendix A). We have shown that the algorithm possesses convergence guarantees in both the deterministic and stochastic settings. We have also shown that a deterministic version of the algorithm performs reliably on a large test set of problems, and that a stochastic version of the algorithm yields good results both in the setting of artificial noise and in the context of a mini-batch stochastic-gradient-based algorithm for the training an artificial neural network for binary classification subject to a norm constraint.

A few significant open questions remain in the study of stochastic-gradient-based interior-point methods for solving nonlinearly constrained optimization problems. For example, our theoretical convergence guarantees requires that the ratio μ1/θ0\mu_{1}/\theta_{0} is sufficiently large and that each computed search direction satisfies the conditions in Assumptions 3.1 and/or 4.3, but, as we have explained throughout the paper, a computationally effective strategy for computing search directions that adhere to our theoretical guarantees is not straightforward to design. A related open question is whether it is possible to design a single-loop infeasible interior-point framework that possesses theoretical convergence guarantees that are at least on par with those that we offer for the algorithm proposed in this paper. The main challenge in the design of such an approach is that our theoretical guarantees, which employ a prescribed sequence {μk}0\{\mu_{k}\}\searrow 0, rely heavily on the fact that the iterates generated by our algorithm are feasible in every iteration. There exist stochastic-gradient-based infeasible Newton-type methods for solving equality-constrained optimization problems; see, e.g., [3, 1, 2, 4, 9, 8, 10, 14, 15, 17]. Hence, one might expect it to be possible to employ such methods for solving the equality-constrained subproblems that arise in an infeasible interior-point method. However, even if one were to introduce slack variables for which the barrier functions are introduced, it remains unclear how to merge such an approach with our strategy for ensuring feasibility at all iterates. Concretely, suppose that inequality constraints c(x)0c(x)\leq 0 are reformulated with slack variables as c(x)+s=0c(x)+s=0 along with s0s\geq 0, where the latter inequalities are handled with a barrier function. Attempting to follow the strategy for bound-constrained optimization in [7], one is confronted with the challenge that the search direction in the slack variables, call it dksd_{k}^{s}, depends on the search direction in the original variables, say dkxd_{k}^{x}, where it may be possible that c(xk)0c(x_{k})\not\leq 0. It is difficult to follow a strategy such as ours to ensure, e.g., that there exists γ\gamma uniformly bounded below such that sk+γαkdkss_{k}+\gamma\alpha_{k}d_{k}^{s} is sufficiently positive, say to stay within a prescribed neighborhood. The difficulties of designing such an approach led us to propose the feasible method that has been presented in this paper, although it is possible that such an approach, or one based on alternative strategies, could be designed with convergence guarantees.

Appendix A Appendix: Algorithm for Finding a Strictly Feasible Point

Our algorithm for finding a strictly feasible initial point for our numerical experiments with problems from the CUTEst collection [12] is presented as Algorithm 2 in this appendix. For simplicity, for our discussion here we state user-defined parameters in terms of the specific values that we used in our implementation rather than introduce a generic parameter range.

The algorithm can be understood as an infeasible interior-point method, where line searches on a merit function are performed as a step-acceptance mechanism. Rather than attempt to solve an optimization problem to high accuracy, the aim of the algorithm is merely to find a strictly feasible point, i.e., a point in <0{\cal F}_{<0} (recall (2)) satisfying affine equality constraints and strictly satisfying (potentially nonlinear) inequality constraints. Hence, the barrier parameter is fixed at 1 and the algorithm terminates as soon as a strictly feasible point is found. A strictly feasible point is defined as follows. As opposed to problem (1), which is stated in terms of only one-sided inequality constraints, our definition here of a strictly feasible point distinguishes between one- and two-sided inequalities. With respect to any one-sided inequality, say, φ(x)φu\varphi(x)\leq\varphi_{u}, strict feasibility of a point xx is defined as φ(x)φu104\varphi(x)\leq\varphi_{u}-10^{-4}. With respect to any two-sided inequality, say, φlφ(x)φu\varphi_{l}\leq\varphi(x)\leq\varphi_{u}, strict feasibility of a point xx is defined as

φl+104min{φuφl,1}φ(x)φu104min{φuφl,1}\varphi_{l}+10^{-4}\min\{\varphi_{u}-\varphi_{l},1\}\leq\varphi(x)\leq\varphi_{u}-10^{-4}\min\{\varphi_{u}-\varphi_{l},1\}

Supposing again that the inequality constraints are stated as in problem (1), the algorithm can be understood as an iterative method toward solving

min(x,s)n×mi[m]log(si)s.t.{Ax=bc(x)+s=0\min_{(x,s)\in\mathbb{R}^{n}\times\mathbb{R}^{m}}\ -\sum_{i\in[m]}\log(s_{i})\ \ \operatorname{s.\!t.}\left\{\begin{aligned} Ax&=b\\ c(x)+s&=0\end{aligned}\right. (31)

the first-order optimality conditions for which are

S1e+z=0,ATy+c(x)z=0,c(x)+s=0,andAx=b.-S^{-1}e+z=0,\ \ A^{T}y+\nabla c(x)z=0,\ \ c(x)+s=0,\ \ \text{and}\ \ Ax=b.

Given an initial point x0nx_{0}\in\mathbb{R}^{n}, the algorithm commences by attempting to compute x1nx_{1}\in\mathbb{R}^{n} with Ax1=bAx_{1}=b by solving a least-squares problem. (If this yields Ax1b2106\|Ax_{1}-b\|_{2}\leq 10^{-6}, then the algorithm continues; otherwise, the run is a failure.) Then, up to an iteration limit, the algorithm follows a standard infeasible interior-point approach with line searches on a merit function; in particular, the merit function ϕ:n×m×>0\phi:\mathbb{R}^{n}\times\mathbb{R}^{m}\times\mathbb{R}_{>0}\to\mathbb{R} is defined by

ϕ(x,s,τ)=τi=1mlog(si)+c(x)+s1,\phi(x,s,\tau)=-\tau\sum_{i=1}^{m}\log(s_{i})+\|c(x)+s\|_{1},

where τ>0\tau\in\mathbb{R}_{>0} is a merit parameter that is updated adaptively. (If the iteration limit is exceeded, then the run is a failure.) Firstly, a Hessian approximation is determined with a modification, if necessary, to ensure that it is sufficiently positive definite. To represent potential use in practice with our proposed Algorithm 1, the Hessian approximation does not employ second-order derivatives of the objective function, although it does employ second-order derivatives of the constraint functions. Secondly, a search direction is determined by solving a linear system. Thirdly, the merit parameter is updated using a standard technique from the literature; see, e.g., [5]. Specifically, defining the predicted reduction in the merit function corresponding to the tuple (xk,τk)(x_{k},\tau_{k}) and search direction components (dkx,dks)(d_{k}^{x},d_{k}^{s}) as

Δq(xk,sk,dkx,dks,τk)=τk(dksTSk1𝟙+12dkxTHkdkx+12dksTSk2dks)+c(xk)+sk1,\Delta q(x_{k},s_{k},d_{k}^{x},d_{k}^{s},\tau_{k})=-\tau_{k}(-{d_{k}^{s}}^{T}S_{k}^{-1}\mathds{1}+\tfrac{1}{2}{d_{k}^{x}}^{T}H_{k}d_{k}^{x}+\tfrac{1}{2}{d_{k}^{s}}^{T}S_{k}^{-2}d_{k}^{s})+\|c(x_{k})+s_{k}\|_{1},

the update ensures that τk\tau_{k} is sufficiently small such that Δq(xk,sk,dkx,dks,τk)\Delta q(x_{k},s_{k},d_{k}^{x},d_{k}^{s},\tau_{k}) is sufficiently large relative to c(xk)+sk1\|c(x_{k})+s_{k}\|_{1}. Fourthly, the largest step size in (0,1](0,1] satisfying a fraction-to-the-boundary rule is determined. Finally, a line search is conducted to compute a step size that satisfies the fraction-to-the-boundary rule and yields sufficient decrease in the merit function.

Algorithm 2 Finding a Strictly Feasible Point
1:x0nx_{0}\in\mathbb{R}^{n}, τ0=1\tau_{0}=1
2:Set x1argminxn12xx022s.t.Ax=bx_{1}\leftarrow\text{argmin}_{x\in\mathbb{R}^{n}}\tfrac{1}{2}\|x-x_{0}\|_{2}^{2}\ \operatorname{s.\!t.}\ Ax=b
3:Set s1max{c(x1),1}s_{1}\leftarrow\max\{-c(x_{1}),1\} and z1s11z_{1}\leftarrow s_{1}^{-1} (component-wise)
4:for k[103]k\in[10^{3}] do
5:     if c(xk)c(x_{k}) is sufficiently interior, component-wise then
6:         terminate and return xkx_{k}
7:     end if
8:     Set HkI+i[m][yk]i2ci(xk)+λkIH_{k}\leftarrow I+\sum_{i\in[m]}[y_{k}]_{i}\nabla^{2}c_{i}(x_{k})+\lambda_{k}I for λk0\lambda_{k}\geq 0 such that Hk104IH_{k}\succeq 10^{-4}I
9:     Solve
[Hk0ATc(xk)0Sk20IA000c(xk)TI00][dkxdksykdkz]=[c(xk)zkSk1𝟙+zk0c(xk)+sk]\displaystyle\begin{bmatrix}H_{k}&0&A^{T}&\nabla c(x_{k})\\ 0&S_{k}^{-2}&0&I\\ A&0&0&0\\ \nabla c(x_{k})^{T}&I&0&0\end{bmatrix}\begin{bmatrix}d_{k}^{x}\\ d_{k}^{s}\\ y_{k}\\ d_{k}^{z}\end{bmatrix}=-\begin{bmatrix}\nabla c(x_{k})z_{k}\\ -S_{k}^{-1}\mathds{1}+z_{k}\\ 0\\ c(x_{k})+s_{k}\end{bmatrix}
10:     Set, with Θk:=dksTSk1𝟙+dkxTHkdkx+dksTSk2dks\Theta_{k}:=-{d_{k}^{s}}^{T}S_{k}^{-1}\mathds{1}+{d_{k}^{x}}^{T}H_{k}d_{k}^{x}+{d_{k}^{s}}^{T}S_{k}^{-2}d_{k}^{s}, first
τktrial\displaystyle\tau_{k}^{\text{trial}} {if Θk00.5c(xk)+sk1Θkotherwise\displaystyle\leftarrow\begin{cases}\infty&\text{if $\Theta_{k}\leq 0$}\\ \tfrac{0.5\|c(x_{k})+s_{k}\|_{1}}{\Theta_{k}}&\text{otherwise}\end{cases}
thenτk\displaystyle\text{then}\ \tau_{k} {τk1if τk1τktrial(1106)τktrialotherwise\displaystyle\leftarrow\begin{cases}\tau_{k-1}&\text{if $\tau_{k-1}\leq\tau_{k}^{\text{trial}}$}\\ (1-10^{-6})\tau_{k}^{\text{trial}}&\text{otherwise}\end{cases}
11:     Set αkftbmin{α(0,1]:sk+αdks0.1sk}\alpha_{k}^{\text{ftb}}\leftarrow\min\{\alpha\in(0,1]:s_{k}+\alpha d_{k}^{s}\geq 0.1s_{k}\}
12:     Set αkαjαkftb\alpha_{k}\leftarrow\alpha^{j}\alpha_{k}^{\text{ftb}}, where jj is the minimum value in {0}\{0\}\cup\mathbb{N} such that
ϕ(xk+αjαkftbdkx,sk+αjαkftbdks,τk)ϕ(xk,sk,τk)104αjαkftbΔq(xk,sk,dkx,dks,τk)\phi(x_{k}+\alpha^{j}\alpha_{k}^{\text{ftb}}d_{k}^{x},s_{k}+\alpha^{j}\alpha_{k}^{\text{ftb}}d_{k}^{s},\tau_{k})\leq\phi(x_{k},s_{k},\tau_{k})-10^{-4}\alpha^{j}\alpha_{k}^{\text{ftb}}\Delta q(x_{k},s_{k},d_{k}^{x},d_{k}^{s},\tau_{k})
13:end for

References

  • [1] Albert S. Berahas, Raghu Bollapragada, and Baoyu Zhou. An adaptive sampling sequential quadratic programming method for equality constrained stochastic optimization. arXiv 2206.00712, 2022.
  • [2] Albert S. Berahas, Frank E. Curtis, Michael J. O’Neill, and Daniel P. Robinson. A Stochastic Sequential Quadratic Optimization Algorithm for Nonlinear Equality Constrained Optimization with Rank-Deficient Jacobians. Mathematics of Operations Research, https://doi.org/10.1287/moor.2021.0154, 2023.
  • [3] Albert S. Berahas, Frank E. Curtis, Daniel P. Robinson, and Baoyu Zhou. Sequential quadratic optimization for nonlinear equality constrained stochastic optimization. SIAM Journal on Optimization, 31(2):1352–1379, 2021.
  • [4] Albert S. Berahas, Jiahao Shi, Zihong Yi, and Baoyu Zhou. Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction. Computational Optimization and Applications, 86:79–116, 2022.
  • [5] Richard H. Byrd, Mary E. Hribar, and Jorge Nocedal. An interior point algorithm for large-scale nonlinear programming. SIAM Journal on Optimization, 9(4):877–900, 1999.
  • [6] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  • [7] Frank E. Curtis, Vyacheslav Kungurtsev, Daniel P. Robinson, and Qi Wang. A Stochastic-Gradient-based Interior-Point Algorithm for Solving Smooth Bound-Constrained Optimization Problems. arXiv e-prints, arXiv:2304.14907, April 2023.
  • [8] Frank E. Curtis, Michael J. O’Neill, and Daniel P. Robinson. Worst-Case Complexity of an SQP Method for Nonlinear Equality Constrained Stochastic Optimization. Mathematical Programming, https://doi.org/10.1007/s10107-023-01981-1, 2023.
  • [9] Frank E. Curtis, Daniel P. Robinson, and Baoyu Zhou. Inexact sequential quadratic optimization for minimizing a stochastic objective function subject to deterministic nonlinear equality constraints. arXiv 2107.03512 (to appear in INFORMS Journal on Optimization), 2021.
  • [10] Yuchen Fang, Sen Na, Michael W. Mahoney, and Mladen Kolar. Fully stochastic trust-region sequential quadratic programming for equality-constrained optimization problems. SIAM Journal on Optimization, 34(2):2007–2037, 2024.
  • [11] Jaroslav Fowkes, Lindon Roberts, and Árpád Brmen. Pycutest: an open source python package of optimization test problems. Journal of Open Source Software, 7(78):4377, 2022.
  • [12] Nicholas I. M. Gould, Dominique Orban, and Philippe L. Toint. CUTEst: a constrained and unconstrained testing environment with safe threads for mathematical optimization. Comp. Opt. Appl., 60(3):545–557, 2015.
  • [13] Olvi L. Mangasarian and Stan Fromovitz. The Fritz John necessary optimality conditions in the presence of equality and inequality constraints. Journal of Mathematical Analysis and Applications, 17:37–47, 1967.
  • [14] Sen Na, Mihai Anitescu, and Mladen Kolar. An adaptive stochastic sequential quadratic programming with differentiable exact augmented Lagrangians. Mathematical Programming, 199:721–791, 2023.
  • [15] Sen Na and Michael W. Mahoney. Asymptotic convergence rate and statistical inference for stochastic sequential quadratic programming. arXiv 2205.13687, 2022.
  • [16] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer-Verlag New York, 2006.
  • [17] Songqiang Qiu and Vyacheslav Kungurtsev. A sequential quadratic programming method for optimization with stochastic objective functions, deterministic inequality constraints and robust subproblems. arXiv 2302.07947, 2023.
  • [18] Robert J. Vanderbei and David F. Shanno. An Interior-Point Algorithm for Nonconvex Nonlinear Programming. Computational Optimization and Applications, 13:231–252, 1999.
  • [19] Andreas Wächter and Lorenz T. Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical Programming, 106(1):25–57, 2006.