This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Inexact Frank-Wolfe Algorithm for Composite Convex Optimization Involving a Self-Concordant Function

Nimita Shinde Department of Industrial and Systems Engineering, Lehigh University, Bethlehem, PA 18015 ([email protected])    Vishnu Narayanan Industrial Engineering and Operations Research, Indian Institute of Technology Bombay, Powai, Mumbai, 400076 India ([email protected])    James Saunderson Department of Electrical and Computer Systems Engineering, Monash University, Clayton, VIC 3800, Australia ([email protected])
Abstract

In this paper, we consider Frank-Wolfe-based algorithms for composite convex optimization problems with objective involving a logarithmically-homogeneous, self-concordant functions. Recent Frank-Wolfe-based methods for this class of problems assume an oracle that returns exact solutions of a linearized subproblem. We relax this assumption and propose a variant of the Frank-Wolfe method with inexact oracle for this class of problems. We show that our inexact variant enjoys similar convergence guarantees to the exact case, while allowing considerably more flexibility in approximately solving the linearized subproblem. In particular, our approach can be applied if the subproblem can be solved prespecified additive error or to prespecified relative error (even though the optimal value of the subproblem may not be uniformly bounded). Furthermore, our approach can also handle the situation where the subproblem is solved via a randomized algorithm that fails with positive probability. Our inexact oracle model is motivated by certain large-scale semidefinite programs where the subproblem reduces to computing an extreme eigenvalue-eigenvector pair, and we demonstrate the practical performance of our algorithm with numerical experiments on problems of this form.

1 Introduction

Consider a composite optimization problem

minxnF(x)minxnf((x))+g(x),\min_{x\in\mathbb{R}^{n}}F(x)\equiv\min_{x\in\mathbb{R}^{n}}f(\mathcal{B}(x))+g(x), (SC-OPT)

where :nd\mathcal{B}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{d} is a linear map, ff is a convex, θ\theta-logarithmically-homogeneous, MM-self-concordant barrier for a proper cone 𝒦\mathcal{K}, and gg is a proper, closed, convex and possibly nonsmooth function with compact domain. We further assume that int(𝒦)(dom(g))\textup{int}(\mathcal{K})\cap\mathcal{B}(\textup{dom}(g))\neq\emptyset, ensuring that there exists at least one feasible point for (SC-OPT). Self-concordant functions are thrice differentiable convex functions whose third directional derivatives are bounded by their second directional derivatives (see Section 2 for the exact definition and key properties of these functions). Self-concordant functions appear in several applications such as machine learning, and image processing. Several second-order [28, 26, 27, 15] as well as first-order [13, 6, 7, 36] methods have been used to solve (SC-OPT) (see Section 1.4 for more details).

For large-scale applications of (SC-OPT), one natural algorithmic approach is to use the Frank-Wolfe algorithm [8], which proceeds by partially linearizing the objective at the current iterate xtx^{t}, solving the linearized problem

minhnFxtlin(h)minhn(f((xt))),h+g(h),\min_{h\in\mathbb{R}^{n}}F^{\textup{lin}}_{x^{t}}(h)\equiv\min_{h\in\mathbb{R}^{n}}\langle\mathcal{B}^{*}(\nabla f(\mathcal{B}(x^{t}))),h\rangle+g(h), (Lin-OPT)

and taking the next iterate to be a convex combination of xtx^{t} and an optimal point for (Lin-OPT). It is challenging to analyse the convergence of such first-order methods for self-concordant functions because their gradients are generally unbounded on their domains, and so are not globally Lipschitz continuous. Despite this, when ff is θ\theta-logarithmically homogeneous and self-concordant, Zhao and Freund [36] established convergence guarantees for the Frank-Wolfe method when the step direction and step size are carefully chosen in a way that depends on the optimal solution of (Lin-OPT) (see Algorithm 1). This limits the applicability of the method in situations where it may not be possible to solve (Lin-OPT) to high accuracy under reasonable time and storage constraints.

In this paper, we extend the generalized Frank-Wolfe method of [36] to allow for inexact solutions to the linearized subproblem (Lin-OPT) at each iteration. Our inexact oracle model is flexible enough to allow for situations (a) where the subproblem is only solved to a specified relative error, and (b) where the method for solving the subproblem is randomized and fails with positive probability. Despite this flexible oracle model, we establish convergence guarantees that are comparable to those given by Zhao and Freund [36] for the method with exact oracle. While the high-level strategy of the analysis is broadly inspired by that of [36], our arguments are somewhat more involved.

Algorithm 1 Generalized Frank-Wolfe Algorithm

Input: Problem (SC-OPT) where f()f(\cdot) is a 2-self-concordant, θ\theta-logarithmically-homogeneous barrier function

1:  Initialize x0nx^{0}\in\mathbb{R}^{n}.
2:  for t=0,1,t=0,1,\dotsc do
3:     Compute update direction hth^{t}. Compute f((xt))\nabla f(\mathcal{B}(x^{t})) and htargminhnFxtlin(h)h^{t}\in\operatorname*{arg\,min}_{h\in\mathbb{R}^{n}}F^{\textup{lin}}_{x^{t}}(h).
4:     Compute duality gap GtG_{t}, and distance between hth^{t} and xtx^{t}. Compute Gt=Fxtlin(xt)Fxtlin(ht)G_{t}=F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{t}) and Dt=(htxt)(xt)D_{t}=\|\mathcal{B}(h^{t}-x^{t})\|_{\mathcal{B}(x^{t})}.
5:     Compute step-size γt\gamma_{t}. Set γt=min{GtDt(Dt+Gt),1}\gamma_{t}=\min\left\{\frac{G_{t}}{D_{t}(D_{t}+G_{t})},1\right\}.
6:     Update xt+1=(1γt)xt+γthtx^{t+1}=(1-\gamma_{t})x^{t}+\gamma_{t}h^{t}.
7:  end for

1.1 Questions addressed by this work

In this section, we discuss three ideas to further generalize the generalized Frank-Wolfe method [36]. These modifications help in addressing the following questions and we discuss them in detail in this section.

  1. 1.

    What if we can only generate an approximate minmizer to the linear minimizer problem (Lin-OPT), i.e., we can only solve the problem in Step 3 of Algorithm 1 to some given additive error?

  2. 2.

    What if we only have access to a method to solve (Lin-OPT), that fails, with some nonzero probability, to return a point with specified approximation error?

  3. 3.

    What if we only have access to a method that can solve (Lin-OPT) with some specified relative (rather than additive) error?

Motivating example.

A key motivating example for the use of an inexact oracle in the probabilistic setting is the case where g()g(\cdot) is the indicator function for the set

𝒮={X𝕊n:X0,Tr(X)=1},\mathcal{S}=\{X\in\mathbb{S}^{n}:\textup{X}\succeq 0,\textup{Tr}(X)=1\}, (1.1)

whose extreme points are the rank-1 PSD matrices. The problem of optimizing convex functions over the set 𝒮\mathcal{S} prominently features in the study of memory-efficient techniques for semidefinite programming (see, e.g., [25, 34]). For such problems, generating an optimal solution to (Lin-OPT) is equivalent to determining the smallest eigenvalue-eigenvector pair of (f((Xt)))\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t}))). For large-scale problems, it is typically not practical to compute such an eigenvalue-eigenvector pair to a very high accuracy at each iteration. Furthermore, scalable extreme eigenvalue-eigenvector algorithms (such as the Lanczos method with random initialization [14]) usually only give relative error guarantees, and are only guaranteed to succeed with probability at least 1p1-p for some p(0,1)p\in(0,1).

Furthermore, for large-scale instances, it is essential to have a memory-efficient way to represent the decision variable. Recent memory-efficient techniques such as sketching [33, 34, 29, 30] or sampling [25] involve alternate low-memory representation of the decision variable and provide provable convergence guarantees of first-order methods that employ these low-memory representations. In Section 6, we briefly discuss how such memory-efficient techniques can be combined with our proposed method specifically when applied to a semidefinite relaxation of the problem of optimizing quadratic form over the sphere (see Section 1.2.1).

We now describe the setting for each question listed in this section.

Inexact Linear Minimization Oracle

An inexact minimization oracle that we consider here does not need to generate an optimal solution to (Lin-OPT), i.e., it does not need to solve the problem exactly in order to provide provable convergence guarantees for the algorithm. Before defining the oracle precisely, we look at how an inexact oracle changes the definition of quantities in Algorithm 1. We first define the approximate duality gap, GtaG_{t}^{a}, as

Gta(xt,ht)=Fxtlin(xt)Fxtlin(ht).G_{t}^{a}(x^{t},h^{t})=F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{t}).

Here, xtx^{t} is the value of the decision variable at iterate tt of our method, and hth^{t} is the approximate minimizer of (Lin-OPT). Note that, if hth^{t} is replaced by hh^{\star}, an optimal solution to (Lin-OPT), then GtaG_{t}^{a} becomes equal to the Frank-Wolfe duality gap GtG_{t} [13]. In Algorithm 1, GtG_{t} is used to define the step length γt\gamma_{t}. Thus, in Algorithm 1, an exact oracle is needed to compute the update direction as well as the step length.

We consider a setting where we compute the update direction by computing an approximate minimizer, hth^{t}, to (Lin-OPT), and correspondingly defining GtaG_{t}^{a} and the step length γt\gamma_{t} which are functions of hth^{t}. For our iterative algorithm to be a descent method, we need Gta0G_{t}^{a}\geq 0. Thus, this gives us the criteria that the output of inexact oracle must satisfy. Furthermore, our analysis shows that if, at iterate tt, GtaG_{t}^{a} is large enough, we can ensure the algorithm makes progress, even if the approximation error in (Lin-OPT) is not small. Thus, we define an inexact oracle ILMO as follows:

Definition 1 (ILMO).

ILMO is an oracle that takes xtx^{t}, (f((xt)))\nabla\mathcal{B}^{*}(f(\mathcal{B}(x^{t}))) (gradient of the objective function at xtx^{t}), and a nonnegative quantity δt>0\delta_{t}>0 as an input, and outputs a pair (ht,Gta)(h^{t},G_{t}^{a}), where GtaG_{t}^{a} is the approximate duality gap, such that Gta0G_{t}^{a}\geq 0, and at least one of the following two conditions are satisfied:

  1. C.1

    (large gap condition) Gta>θ+RgG_{t}^{a}>\theta+R_{g}, where RgR_{g} is the maximum variation in g()g(\cdot) over its domain,

  2. C.2

    (δt\delta_{t}-suboptimality condition) Fxtlin(ht)Fxtlin(h)+δtF^{\textup{lin}}_{x^{t}}(h^{t})\leq F^{\textup{lin}}_{x^{t}}(h^{\star})+\delta_{t}, where hh^{\star} is an optimal solution to (Lin-OPT).

Note that, condition C.2 (δt\delta_{t}-suboptimality) is equivalent to stating that the output of ILMO, hth^{t}, is an δt\delta_{t}-optimal solution to (Lin-OPT), and thus, we do not need to solve (Lin-OPT) exactly. Furthermore, from our analysis, we observed that when GtaG_{t}^{a} is lower bounded by θ+Rh\theta+R_{h}, hth^{t} need not be a δt\delta_{t}-optimal solution. Thus, the definition of ILMO enforces that hth^{t} needs to be δt\delta_{t}-optimal only when Gtaθ+RhG_{t}^{a}\leq\theta+R_{h}, which can lead to reduced computational complexity of solving (Lin-OPT). In Section 3, we propose approximate Frank-Wolfe algorithm (see Algorithm 2), that only requires access to ILMO, and furthermore, we establish convergence guarantees comparable to that of Algorithm 1 that requires exact oracle.

Inexact oracle with a nonzero probability of failure.

We consider an inexact oracle PLMO that has a nonzero probability pp of failure as defined below.

Definition 2 (PLMO).

PLMO is an oracle that takes xtx^{t}, (f((xt)))\mathcal{B}^{*}(\nabla f(\mathcal{B}(x^{t}))) (the gradient of the objective function at xtx^{t}), and δt\delta_{t} as input, and returns a pair (ht,Gta)(h^{t},G_{t}^{a}) such that Gta0G_{t}^{a}\geq 0 and with probability at least 1p1-p, at least one of the two conditions C.1 (large gap) and C.2 (δt\delta_{t}-suboptimality) is satisfied.

The key difference between PLMO and ILMO (Definition 1) is the nonzero probability of failure of the oracle PLMO. Indeed, ILMO is a special case of PLMO when p=0p=0. In Section 5, we consider this probabilistic setting for the oracle. We provide convergence analysis of our proposed approximate generalized Frank-Wolfe algorithm, when the linear minimization oracle fails to satisfy both conditions C.1 (large gap) and C.2 (δt\delta_{t}-suboptimality) with probability at most pp.

Implementing the inexact oracles

Our inexact oracle model is flexible enough that it can be implemented using an algorithm for solving (Lin-OPT) to either prespecified additive error or prespecified multiplicative error. The multiplicative error case is of interest when g()g(\cdot) is the indicator function of a compact convex set so that g(x)=0g(x)=0 for all feasible xx. In this case, it turns out (see Lemma 8) that the optimal value of (Lin-OPT), Fxtlin(h)F^{\textup{lin}}_{x^{t}}(h^{\star}), is negative. By an algorithm that solves (Lin-OPT) to multiplicative error τt[0,1)\tau_{t}\in[0,1), we mean an algorithm that returns a feasible point hth^{t} that satisfies

Fxtlin(h)Fxtlin(ht)(1τt)Fxtlin(h),F^{\textup{lin}}_{x^{t}}(h^{\star})\leq F^{\textup{lin}}_{x^{t}}(h^{t})\leq(1-\tau_{t})F^{\textup{lin}}_{x^{t}}(h^{\star}),

This corresponds to an additive error of τt(Fxtlin(h))\tau_{t}(-F^{\textup{lin}}_{x^{t}}(h^{\star})). The multiplicative error case is particularly challenging because the gradient of a self-concordant barrier, f((xt))\nabla f(\mathcal{B}(x^{t})), grows without bound near the boundary of the domain of f()f(\cdot), meaning that Fxtlin(h)F^{\textup{lin}}_{x^{t}}(h^{\star}) is not uniformly bounded. Therefore, a small relative error for (Lin-OPT) (i.e., a small value of τt\tau_{t}) may still lead to large absolute errors. In Section 4 we show how the large gap condition (Condition C.1) in our oracle model allows us to overcome this issue.

1.2 Applications

Consider the MM-self-concordant, θ\theta-logarithmically-homogeneous barrier function

f((x))=i=1mcilog(aiTx),f(\mathcal{B}(x))=-\sum_{i=1}^{m}c_{i}\log(a_{i}^{T}x), (1.2)

with M=maxi2ciM=\max_{i}\frac{2}{\sqrt{c_{i}}} and θ=i=1mci\theta=\sum_{i=1}^{m}c_{i}, where the linear map ():nm\mathcal{B}(\cdot):\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}, maps the decision variable xnx\in\mathbb{R}^{n} to an mm-dimensional vector [aiTx]i=1m[a_{i}^{T}x]_{i=1}^{m}. In this section, we review two applications (see Subsections 1.2.1 and 1.2.2) whose convex formulations involve minimizing functions of the form (1.2) over 𝒮\mathcal{S} (1.1). The problems of this type are also often studied in the context of memory-efficient algorithms. One of the contributions of this paper is to show that, for these applications, we can combine memory-efficient representations [25] of the decision variable with the first-order algorithm proposed in this paper.

Historically, self-concordant barrier functions were often studied because they allow for an affine-invariant analysis of Newton’s method [19]. Recently, problems with self-concordant barriers in the objective are also studied in the context of machine learning, often as a loss function [2, 22, 16]. The function (1.2) appears in several applications such as maximum-likelihood quantum state tomography [11], portfolio optimization [31, 4, 35], and maximum likelihood Poisson inverse problems [10, 20, 21] in a variety of imaging applications such as night vision, low-light imaging. Furthermore, convex formulations of problems such as DD-optimal design [5, 1] and learning sparse Gaussian Markov Random Fields [26, 23] also involve optimizing a 2-self-concordant, nn-logarithmically-homogeneous barrier function (f(X)=log(det(X))f(X)=-\log(\det(X)), with X𝕊++nX\in\mathbb{S}^{n}_{++}). The functions logdetX-\log\det X and logx-\log x are also commonly used self-concordant barriers for the PSD cone and +\mathbb{R}_{+}, respectively.

1.2.1 Optimizing quadratic form over the sphere

Consider the problem

maxx=1,xni=1dxAi,x1/d,\max_{\|x\|=1,x\in\mathbb{R}^{n}}\prod_{i=1}^{d}\langle xA_{i},x\rangle^{1/d}, (GMean)

where Ai𝕊+nA_{i}\in\mathbb{S}^{n}_{+} for i=1,,di=1,\dotsc,d. Many applications such as approximating the permanent of PSD matrices or portfolio optimization take the form of maximizing quadratic forms i=1dxAi,x\prod_{i=1}^{d}\langle xA_{i},x\rangle over {xn:x=1}\{x\in\mathbb{R}^{n}:\|x\|=1\}. By normalizing the objective of this problem, we get (GMean), whose objective is the geometric mean of the quadratic forms. Yuan and Parrilo [32, Theorem 7.1] shows that when d=Ω(n)d=\Omega(n), (GMean) is NP-hard. A natural semidefinite relaxation of (GMean) is

maxXi=1dAi,X1/dsubject to{Tr(X)=1,X0.\max_{X}\prod_{i=1}^{d}\langle A_{i},X\rangle^{1/d}\quad\textup{subject to}\ \begin{cases}\textup{Tr}(X)=1,\\ X\succeq 0.\end{cases} (1.3)

Yuan and Parrilo [32] show that solving the SDP relaxation (1.3) and appropriately rounding the solution to have rank one, gives a constant factor approximation algorithm for (GMean). For 𝒮\mathcal{S} given in (1.1), we have an equivalent problem

minX𝒮i=1dlog(Ai,X),\min_{X\in\mathcal{S}}-\sum_{i=1}^{d}\log(\langle A_{i},X\rangle), (SC-GMean)

where the objective is a 2-self-concordant, dd-logarithmically-homogeneous barrier function, and (SC-GMean) is equivalent to (1.3) up to the scaling of the objective. So, an optimal solution of (SC-GMean) is also an optimal solution of (1.3). Yuan and Parrilo [32, Theorem 1.3] show that if XX^{\star} is optimal for (1.3) and y𝒩(0,X)y\sim\mathcal{N}(0,X^{\star}) then y/yy/\|y\| feasible for (GMean) and within a factor of eLre^{-L_{r}} of the optimal value, where Lr=γ+log2+ψ(r2)log(r2)<1.271L_{r}=\gamma+\log 2+\psi\left(\frac{r}{2}\right)-\log\left(\frac{r}{2}\right)<1.271, γ\gamma is the Euler-Mascheroni constant, ψ(r)\psi(r) is the digamma function.

1.2.2 Maximum-likelihood Quantum State Tomography (ML-QST)

The problem with structure similar to (SC-GMean) also appears in Quantum State Tomography (QST). QST is the process of estimating an unknown quantum state from given measurement data. We consider that a measurement is a random variable zz which takes the values in the set [N]={1,,N}[N]=\{1,\dotsc,N\} with probability distribution P(z=i)=Bi,X,i[N]P(z=i)=\langle B_{i},X\rangle,\forall i\in[N], where each BiB_{i} is PSD complex matrix such that i=1NBi=I\sum_{i=1}^{N}B_{i}=I, X𝒮X\in\mathcal{S}, where 𝒮\mathcal{S} is defined in (1.1). Note that, if there are qq qubits, then n=2qn=2^{q}. Maximum-likelihood estimation of quantum states [11, 12] involves finding the state (represented by X𝒮X\in\mathcal{S}) that maximizes the probability of observing the measurement. Assume that we have mm measurements z1,,zmz_{1},\ldots,z_{m} where zj[N]z_{j}\in[N] for j=1,,mj=1,\ldots,m. The maximum-likelihood QST problem then takes the form maxX𝒮j=1mAj,X\max_{X\in\mathcal{S}}\prod_{j=1}^{m}\langle A_{j},X\rangle, where Aj=BzjA_{j}=B_{z_{j}} for j=1,,mj=1,\dotsc,m. Several results for maximum likelihood estimation of QST are provided in literature ([9, 3, 24]). The problem can be equivalently stated as

minX𝒮j=1mlog(Aj,X).\min_{X\in\mathcal{S}}-\sum_{j=1}^{m}\log(\langle A_{j},X\rangle). (ML-QST)

Note that (ML-QST) is identical to (SC-GMean) and its objective is a 2-self-concordant, mm-logarithmically-homogeneous barrier.

1.3 Contributions

The focus of this paper is to provide a framework to solve a minimization problem of the form (SC-OPT), where the objective function is a self-concordant, logarithmically-homogeneous barrier function. We provide an algorithmic framework that addresses the three questions stated in Section 1.1. The main contributions of the paper are summarized below.

  • We provide an algorithmic framework, that generates an ϵ\epsilon-optimal solution to (SC-OPT). The proposed algorithm (Algorithm 2) only requires access to an inexact oracle ILMO (Definition 1). We provide provable convergence guarantees (Lemmas 5 and 6) that are comparable to guarantees given by Zhao and Freund [36] whose method requires access to an exact oracle.

  • We consider the setting where the inexact oracle can provide an approximate solution to (Lin-OPT) with either additive or multiplicative error bound. We show how to implement the oracle that satisfies the output criteria of ILMO. In this scenario, the convergence guarantees given by Lemmas 5 and 6 for Algorithm 2 still hold true.

  • We next consider a setting where the linear minimization oracle fails to satisfy both conditions C.1 (large gap) and C.2 (δt\delta_{t}-suboptimality) with probability at most p>0p>0. In this setting, we show that we can bound, with probability at least 1p1-p, the number of iterations of Algorithm 2 required to converge to an ϵ\epsilon-optimal solution of (SC-OPT) (see Lemmas 12 and 13).

  • We apply our algorithmic framework specifically to (SC-GMean) and provide the computational complexity of our proposed method. We also show that our method can be combined with a memory-efficient technique given by Shinde et al. [25], so that the method uses at most 𝒪(n+d)\mathcal{O}(n+d) memory (apart from the memory used to store the input parameters). We also implement our method as well as Algorithm 1, and use these algorithms to generate an ϵ\epsilon-optimal solution to randomly generated instances of (SC-GMean). We show that, for these instances, the behavior of our method (in terms of change in optimality gap with each iteration) is similar to that of Algorithm 1. We also show that the theoretical upper bound on the number of iterations to converge to an ϵ\epsilon-optimal solution is quite conservative, and in practice, the algorithms generate the desired solution much quicker.

1.4 Brief Review of First- and Second-Order Methods for (SC-OPT)

It is easy to see that interior-point methods [18, 17] can be used to solve composite problems of type (SC-OPT). In fact, a standard technique while using IPMs for such problems first eliminates the nonsmooth term h(x)h(x) and additionally uses (self-concordant) barrier functions for inequality constraints. The method then solves a sequence of problems to generate an approximate solution to (SC-OPT). Such methods often result in the loss of underlying structures (for example, sparsity) of the input problem. To overcome this issue, proximal gradient-based methods [28, 26, 27] directly solve the nonsmooth problem (SC-OPT) and provide local convergence guarantees (xt+1xtxtϵ\|x^{t+1}-x^{t}\|_{x^{t}}\leq\epsilon, where uxt=uT2f(xt)u\|u\|_{x^{t}}=\sqrt{u^{T}\nabla^{2}f(x^{t})u}) for (SC-OPT). Liu et al. [15] use projected Newton method to solve minx𝒞f(x)\min_{x\in\mathcal{C}}f(x), where f()f(\cdot) is self-concordant, such that the projected Newton direction is computed in each iteration of the method using an adaptive Frank-Wolfe method to solve the subproblem. They show that the outer loop (projected Newton) and the inner loop (Frank-Wolfe) have the complexity of 𝒪(log(1/ϵ))\mathcal{O}(\log(1/\epsilon)) and 𝒪(1/ϵν)\mathcal{O}(1/\epsilon^{\nu}) respectively, where ν=1+log(12β)log(σ)\nu=1+\frac{\log(1-2\beta)}{\log(\sigma)}, for some constants σ,β>0\sigma,\beta>0, where (β,σ)(\beta,\sigma) satisfy the conditions outlined in [15, Theorem 4.2].

For large-scale applications of (SC-OPT), Frank-Wolfe and other first-order methods are often preferred since, at each iteration either an exact or approximate solution is generated to the linear minimization problem (Lin-OPT) which can lead to low-computational complexity per iteration. However, the analysis of such methods usually depends on the assumption that ff has a Lipschitz continuous gradient [13], which a self-concordant function does not satisfy. There are algorithms (for example [6]) that provide new policies to compute step length at each iteration of Frank-Wolfe and provide convergence guarantees for the modified algorithm. However, these guarantees depend on the Lipschitz constant of the objective on a sublevel set, which might not be easy to compute and in some cases may not even be bounded.

1.5 Notations and Outline

Notations.

We use small alphabets (for instance, xx) to denote a vector variables, and big alphabets (for example, XX) to denote matrix variables. We use superscripts (x(t))(x^{(t)}) to denote values of parameters and variables at tt-th iteration of the algorithm. The term A,B=Tr(ATB)\left\langle A,B\right\rangle=\textup{Tr}\left(A^{T}B\right) denotes the matrix inner product. The notations g()\nabla g(\cdot) and 2g()\nabla^{2}g(\cdot) are used to denote the gradient and the Hessian respectively of a twice differentiable function gg. The linear map ():𝕊nd\mathcal{B}(\cdot):\mathbb{S}^{n}\rightarrow\mathbb{R}^{d} maps a symmetric n×nn\times n matrix to a dd-dimensional space and ()\mathcal{B}^{*}(\cdot) is its adjoint. The notation 𝒪\mathcal{O} has the usual complexity interpretation.

Outline.

In Section 2, we review the generalized Frank-Wolfe algorithm and key properties of a 2-self-concordant, θ\theta-logarithmically-homogeneous function. We also define the terminology used in the rest of the paper. In Section 3, we propose an approximate generalized Frank-Wolfe algorithm (Algorithm 2) that generates an ϵ\epsilon-optimal solution to (SC-OPT), and we provide convergence analysis of the algorithm. In Section 4, we show how an oracle that generates an approximate solution with relative error can be used with our proposed method (Algorithm 2). In Section 5, we propose a method (Algorithm 2) that, with high probability, generates an ϵ\epsilon-optimal solution to (SC-OPT) when the oracle that solves the linear subproblem has a nonzero probability of failure. In Section 6, we apply our method specifically to (SC-GMean) and provide the convergence analysis of the method. We also show that our method can be combined with memory-efficient techniques from literature so that it can implemented in 𝒪(n+d)\mathcal{O}(n+d) memory. Finally, in Section 7, we provide computational results of applying Algorithm 2 to random instances of (SC-GMean).

2 Preliminaries

In this section, we review key properties of self-concordant, θ\theta-logarithmically-homogeneous barrier functions as well as other quantities that are useful in our analysis. For a detailed review of self-concordant functions, refer to Nesterov and Nemirovskii [19].

Definition 3.

[19] Let 𝒦\mathcal{K} be a regular cone, and let ff be a thrice-continuously differentiable, strictly convex function defined on int(𝒦)\textup{int}(\mathcal{K}). Define the third-order directional derivative 3f(w)[u1,u2,u3]=3t1t2t3f(w+t1u1+t2u2+t3u3)|t1=t2=t3=0\nabla^{3}f(w)[u_{1},u_{2},u_{3}]=\frac{\partial^{3}}{\partial t_{1}\partial t_{2}\partial t_{3}}f(w+t_{1}u_{1}+t_{2}u_{2}+t_{3}u_{3})|_{t_{1}=t_{2}=t_{3}=0}. Then f(w)f(w) is a MM-self-concordant, θ\theta-logarithmically-homogeneous barrier on int(𝒦)\textup{int}(\mathcal{K}) if

  1. 1.

    f(wi)f(w_{i})\rightarrow\infty along every sequence {wi}int(𝒦)\{w_{i}\}\in\textup{int}(\mathcal{K}) that converges to a boundary point of 𝒦\mathcal{K},

  2. 2.

    |3f(w)[u,u,u]|M(uT2f(w)u)3/2\left|\nabla^{3}f(w)[u,u,u]\right|\leq M\left(u^{T}\nabla^{2}f(w)u\right)^{3/2} for all wint(𝒦)w\in\textup{int}(\mathcal{K}) and udu\in\mathbb{R}^{d}, and

  3. 3.

    f(cw)=f(w)θlog(c)f(cw)=f(w)-\theta\log(c) for all wint(𝒦)w\in\textup{int}(\mathcal{K}) and c>0c>0.

Properties of a 22-self concordant, θ\theta-logarithmically homogeneous barrier function ff.

Let 𝒦\mathcal{K} be a proper cone and let f()f(\cdot) be a 22-self-concordant, θ\theta-logarithmically-homogeneous barrier function on int(𝒦)\textup{int}(\mathcal{K}). For any uint(𝒦)u\in\textup{int}(\mathcal{K}), let (u)\mathcal{H}(u) denote the Hessian of f()f(\cdot) at uu. Define the (semi)norm wu=w,(u)w\|w\|_{u}=\sqrt{\langle w,\mathcal{H}(u)w\rangle} for any wdw\in\mathbb{R}^{d}. Also, define 𝒟(u,1)={w𝒦:wuu<1}\mathcal{D}(u,1)=\{w\in\mathcal{K}:\|w-u\|_{u}<1\} which is often referred to as the unit Dikin ball. Define a univariate function ω\omega and its Fenchel conjugate as follows:

ω(a)\displaystyle\omega(a) ={alog(1a)if a<1otherwise.\displaystyle=\begin{cases}-a-\log(1-a)&\textup{if $a<1$}\\ \infty&\textup{otherwise}.\end{cases} (2.1)
ω(a)\displaystyle\omega^{\star}(a) ={alog(1+a)if a>1otherwise.\displaystyle=\begin{cases}a-\log(1+a)&\textup{if $a>-1$}\\ \infty&\textup{otherwise.}\end{cases} (2.2)
Proposition 1 (Proposition 2.3.4, Corollary 2.3.1,2.3.3 [19]).

Let 𝒦\mathcal{K} be a proper cone and let f()f(\cdot) be a 22-self-concordant, θ\theta-logarithmically-homogeneous barrier function on int(𝒦)\textup{int}(\mathcal{K}). If uint(𝒦)u\in\textup{int}(\mathcal{K}) then

f(w)f(u)f(u),wu\displaystyle f(w)-f(u)-\langle\nabla f(u),w-u\rangle ω(wuu)w𝒟(u,1)\displaystyle\leq\omega(\|w-u\|_{u})\quad\forall w\in\mathcal{D}(u,1) (2.3)
|f(u),w|\displaystyle|\langle\nabla f(u),w\rangle| θwu,wd\displaystyle\leq\sqrt{\theta}\|w\|_{u},\quad\quad\;\,\forall w\in\mathbb{R}^{d} (2.4)
wu\displaystyle\|w\|_{u} f(u),ww𝒦\displaystyle\leq-\langle\nabla f(u),w\rangle\quad\forall w\in\mathcal{K} (2.5)
f(u),w\displaystyle\langle\nabla f(u),w\rangle =(u)u,wwd\displaystyle=-\langle\mathcal{H}(u)u,w\rangle\quad\forall w\in\mathbb{R}^{d} (2.6)
f(u),u\displaystyle\langle\nabla f(u),u\rangle =θand\displaystyle=-\theta\quad\textup{and} (2.7)
θ\displaystyle\theta 1.\displaystyle\geq 1. (2.8)

These properties will be used later in the proofs.

2.1 Terminology

We now define a few quantities that are used in the rest of the paper.

  • Frank-Wolfe duality gap, Gt(xt)G_{t}(x^{t}) [13]: For any feasible point xtx^{t} to (SC-OPT), let hh^{\star} be an optimal solution to (Lin-OPT). Then,

    Gt(xt)=f((xt)),(xth)+g(xt)g(h)=Fxtlin(xt)Fxtlin(h).G_{t}(x^{t})=\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t}-h^{\star})\rangle+g(x^{t})-g(h^{\star})=F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{\star}). (2.9)
  • Approximate Frank-Wolfe duality gap, Gta(xt,ht)G_{t}^{a}(x^{t},h^{t}): For any two feasible points xtx^{t}, hth^{t} to (SC-OPT), we have

    Gta(xt,ht)=f((xt)),(xtht)+g(xt)g(ht)=Fxtlin(xt)Fxtlin(ht).G_{t}^{a}(x^{t},h^{t})=\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t}-h^{t})\rangle+g(x^{t})-g(h^{t})=F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{t}). (2.10)

    Note: The duality gap, Gt(xt)G_{t}(x^{t}), is a function of the current iterate xtx^{t}, whereas the approximate duality gap, Gta(xt,ht)G_{t}^{a}(x^{t},h^{t}), is a function of the current iterate xtx^{t}, and any feasible point hth^{t} to (SC-OPT). However, for the ease of notation, we denote Gt(xt)G_{t}(x^{t}) by GtG_{t}, and Gta(xt,ht)G_{t}^{a}(x^{t},h^{t}) by GtaG_{t}^{a} in the rest of the paper when xtx^{t} and hth^{t} are clear from the context.

  • Optimality gap Δt\Delta_{t}: If x=argminxnF(x)x^{\star}=\operatorname*{arg\,min}_{x\in\mathbb{R}^{n}}F(x), i.e., xx^{\star} is an optimal solution to (SC-OPT), and xtx^{t} is a feasible point to (SC-OPT) then

    Δt(xt)=F(xt)F(x).\Delta_{t}(x^{t})=F(x^{t})-F(x^{\star}). (2.11)

    When xtx^{t} is clear from the context we denote Δt(xt)\Delta_{t}(x^{t}) by Δt\Delta_{t}.

  • Maximum variation of g()g(\cdot) over its domain, RgR_{g}: We define as RgR_{g} as

    Rg=maxx1,x2dom(g)|g(x1)g(x2)|.R_{g}=\max_{x_{1},x_{2}\in\textup{dom}(g)}|g(x_{1})-g(x_{2})|. (2.12)
  • Distance, DtD_{t}: For any two feasible points xtx^{t} and hth^{t} to (SC-OPT), we define

    Dt(xt,ht)=(htxt)(xt),D_{t}(x^{t},h^{t})=\|\mathcal{B}(h^{t}-x^{t})\|_{\mathcal{B}(x^{t})}, (2.13)

    where (xt)\|\cdot\|_{\mathcal{B}(x^{t})} is the (semi)norm defined by the Hessian of f()f(\cdot) at (xt)\mathcal{B}(x^{t}). When xtx^{t} and hth^{t} are clear from the context, we denote Dt(xt,ht)D_{t}(x^{t},h^{t}) by DtD_{t}.

The following standard result shows that the optimality gap is bounded by the duality gap.

Fact 1.

For any feasible point xtx^{t} to (SC-OPT), we have ΔtGt\Delta_{t}\leq G_{t}.

Proof.

Since ff is convex, for any xtx^{t} and ww that are feasible for (SC-OPT),

0f((w))[f((xt))f((xt)),(wxt)].0\leq f(\mathcal{B}(w))-\left[f(\mathcal{B}(x^{t}))-\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(w-x^{t})\rangle\right].

Setting ww to be xx^{\star}, an optimal point for (SC-OPT), and rearranging gives

Δt=f((xt))f((x))+g(xt)g(x)\displaystyle\Delta_{t}=f(\mathcal{B}(x^{t}))-f(\mathcal{B}(x^{\star}))+g(x^{t})-g(x^{\star}) f((xt)),(xtx)+g(xt)g(x).\displaystyle\leq\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t}-x^{\star})\rangle+g(x^{t})-g(x^{\star}).

Finally, if hh^{\star} is optimal for (Lin-OPT) then

f((xt)),(xtx)+g(xt)g(x)f((xt)),(xth)+g(xt)g(h)=Gt,\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t}-x^{\star})\rangle+g(x^{t})-g(x^{\star})\leq\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t}-h^{\star})\rangle+g(x^{t})-g(h^{\star})=G_{t},

completing the argument. ∎

3 Approximate Generalized Frank-Wolfe algorithm

Consider the composite optimization problem (SC-OPT), where f()f(\cdot) is a θ\theta-logarithmically-homogeneous, MM-self-concordant barrier function on int(𝒦)\textup{int}(\mathcal{K}), 𝒦\mathcal{K} is a proper cone, and g()g(\cdot) is a proper, closed, convex and possibly nonsmooth function with dom(g)\textup{dom}(g) a closed, compact, nonempty set. Zhao and Freund [36] propose a Generalized Frank-Wolfe algorithm (Algorithm 1) that requires an exact solution to the subproblem (Lin-OPT) at each iteration. In their algorithm, the knowledge of the exact minimizer is necessary to compute the update step, hh^{\star}, as well as the step length (since the step length depends on the duality gap Gt(xt)G_{t}(x^{t})). Furthermore, the analysis of the algorithm also depends on the exact minimizer of the subproblem (Lin-OPT). We, however, propose an ‘approximate’ generalized Frank-Wolfe algorithm. The term approximate in the name of our algorithm comes from the fact that we do not require the linear minimization subproblem (Lin-OPT) to be solved exactly at each iteration. The framework of our method is given in Algorithm 2. We provide a detailed explanation of Steps 5 and 6 (setting the value of δt\delta_{t} and ILMO respectively) below.

Algorithm 2 Approximate Generalized Frank-Wolfe

Input: Problem (SC-OPT), where f()f(\cdot) is a 2-self-concordant, θ\theta-logarithmically-homogeneous barrier function, ϵθ+Rg\epsilon\leq\theta+R_{g}: suboptimality parameter, scheduled strategy S.1 or adaptive strategy S.2 for setting the value of δt\delta_{t}
Output: x^ϵ\widehat{x}_{\epsilon}, such that x^ϵ\widehat{x}_{\epsilon} an ϵ\epsilon-optimal solution to (SC-OPT)

1:  Initialize x0nx^{0}\in\mathbb{R}^{n}.
2:  Set t0t\leftarrow 0.
3:  while true do
4:     Compute (f((xt)))\mathcal{B}^{*}(\nabla f(\mathcal{B}(x^{t}))).
5:     Set δt0\delta_{t}\geq 0 according to the input strategy.
6:     (ht,Gta)=ILMO((f((xt))),δt,xt)(h^{t},G_{t}^{a})=\texttt{ILMO}(\mathcal{B}^{*}(\nabla f(\mathcal{B}(x^{t}))),\delta_{t},x^{t}).
7:     if GtaϵG_{t}^{a}\leq\epsilon and δt3ϵ/2\delta_{t}\leq 3\epsilon/2 then
8:        return (xt,δt)(x_{t},\delta_{t}).
9:     end if
10:     Compute Dt=(htxt)(xt)D_{t}=\|\mathcal{B}(h^{t}-x^{t})\|_{\mathcal{B}(x^{t})}.
11:     Set γt=min{GtaDt(Dt+Gta),1}\gamma_{t}=\min\left\{\frac{G_{t}^{a}}{D_{t}(D_{t}+G_{t}^{a})},1\right\}.
12:     Update xt+1=(1γt)xt+γthtx^{t+1}=(1-\gamma_{t})x^{t}+\gamma_{t}h^{t}.
13:     t=t+1t=t+1.
14:  end while

Note that any MM-self concordant function can be converted to a 22-self concordant function f(u)f(u) by scaling the objective. We define this property more formally in the following proposition.

Proposition 2.

Let f¯(u)\overline{f}(u) be an MM-self concordant, θ¯\overline{\theta}-logarithmically homogeneous function. Define a function f(u)f(u) and parameter θ\theta such that

  1. 1.

    f(u)=M24f¯(u)f(u)=\frac{M^{2}}{4}\overline{f}(u),

  2. 2.

    θ=M24θ¯\theta=\frac{M^{2}}{4}\overline{\theta}.

Then f(u)f(u) is a 22-self concordant, θ\theta-logarithmically homogeneous function.

Proposition 2 tells us that, by rescaling, we can always reduce to the regime of functions that are 2-self-concordant (often called standard self-concordant). However, if we are minimizing such a function, rescaling affects the optimal value, and hence affects approximation error guarantees. Thus, even though the input to Algorithm 2 is specifically a problem with 2-self-concordant barrier function in the objective, by replacing ϵ\epsilon by M24ϵ\frac{M^{2}}{4}\epsilon in Step 7 of the algorithm, we can generate an ϵ\epsilon-optimal solution to (SC-OPT), when f()f(\cdot) is an MM-self-concordant, θ\theta-logarithmically-homogeneous barrier function.

Choosing δt\delta_{t}.

We propose two different strategies to choose the value of δt\delta_{t} in Step 5 of the algorithm.

  1. S.1

    (Scheduled strategy) (δt)t0(\delta_{t})_{t\geq 0} is any positive sequence converging to zero. It follows that, given any ϵ>0\epsilon>0, there exists some positive integer KK such that δtϵ/2\delta_{t}\leq\epsilon/2 for all tKt\geq K.

  2. S.2

    (Adaptive strategy) δtϵ/2+minτ<tGτat\delta_{t}\leq\epsilon/2+\min_{\tau<t}G_{\tau}^{a}\quad\forall t.

Remark 1.

The main idea behind adaptive strategy S.2 is to allow the target accuracy of ILMO to adapt to the progress of the algorithm (in terms of the smallest approximate duality gap we have encountered so far). This allows the ILMO to solve to low accuracy until significant progress has been made in terms of approximate duality gap. Since we are working with the approximate duality gap, and not the actual duality gap, it is possible that Gta=0G_{t}^{a}=0 before the algorithm has reached the desired optimality gap. We use the additive quantity ϵ/2\epsilon/2 to ensure that we can choose δt>0\delta_{t}>0 even when Gτa=0G_{\tau}^{a}=0 for some τ<t\tau<t.

ILMO in Algorithm 2.

The main difference between generalized Frank-Wolfe [36] and Algorithm 2, is the definition of the oracle. The generalized Frank-Wolfe algorithm (Algorithm 1) requires LMO to generate an optimal solution to the subproblem (Lin-OPT) in order to generate update direction and step length, as well as to prove its convergence guarantees. Whereas in Algorithm 2, ILMO is an oracle whose output only needs to satisfy Gta0G_{t}^{a}\geq 0 and at least one the two conditions C.1 (large gap) and C.2 (δt\delta_{t}-suboptimality) (see Definition 1).

When condition C.2 (δt\delta_{t}-suboptimality) is satisfied by the output of ILMO, we have the following relationship between the duality gap GtG_{t} and the approximate duality gap GtaG_{t}^{a}.

Lemma 1 (Relationship between GtG_{t} and GtaG_{t}^{a}).

Let GtG_{t}, GtaG_{t}^{a} be as defined in Section 2.1. If condition C.2 (δt\delta_{t}-suboptimality) is satisfied, then

GtδtGtaGt.G_{t}-\delta_{t}\leq G_{t}^{a}\leq G_{t}. (3.1)
Proof.

If condition C.2 (δt\delta_{t}-suboptimality) holds, then Algorithm 2 generates hth^{t} such that

Fxtlin(h)Fxtlin(ht)=Fxtlin(h)+δt.F^{\textup{lin}}_{x^{t}}(h^{\star})\leq F^{\textup{lin}}_{x^{t}}(h^{t})=F^{\textup{lin}}_{x^{t}}(h^{\star})+\delta_{t}. (3.2)

From the definition of GtaG_{t}^{a} (see (2.10)), we have

Gta=Fxtlin(xt)Fxtlin(ht)=Fxtlin(xt)Fxtlin(h)δt=Gtδt,\begin{split}G_{t}^{a}&=F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{t})\\ &=F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{\star})-\delta_{t}\\ &=G_{t}-\delta_{t},\end{split}

where the second equality follows from the second inequality in (3.2). We also have

GtaFxtlin(xt)Fxtlin(h)=Gt,G_{t}^{a}\leq F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{\star})=G_{t},

where the inequality follows from the first inequality in (3.2). Combining the two results completes the proof. ∎

Computing γt\gamma_{t}.

We now explain how the step-size, γt\gamma_{t}, is selected in Algorithm 2. Substituting xtx^{t} and xt+1x^{t+1} (as given by Algorithm 2) in (2.3), we have

f((xt+γt(htxt)))f((xt))+γtf((xt)),(htxt)+ω(γtDt).f(\mathcal{B}(x^{t}+\gamma_{t}(h^{t}-x^{t})))\leq f(\mathcal{B}(x^{t}))+\gamma_{t}\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(h^{t}-x^{t})\rangle+\omega(\gamma_{t}D_{t}). (3.3)

Furthermore, since g()g(\cdot) is convex,

g(xt+γt(htxt))=g((1γt)xt+γtht)(1γt)g(xt)+γtg(ht)=g(xt)γt(g(xt)g(ht)).g(x^{t}+\gamma_{t}(h^{t}-x^{t}))=g((1-\gamma_{t})x^{t}+\gamma_{t}h^{t})\leq(1-\gamma^{t})g(x^{t})+\gamma^{t}g(h^{t})=g(x^{t})-\gamma_{t}(g(x^{t})-g(h^{t})). (3.4)

Now, adding (3.3) and (3.4), we have

F(xt+γt(htxt))=f((xt+γt(htxt)))+g(xt+γt(htxt))F(xt)γtGta+ω(γtDt),\begin{split}F(x^{t}+\gamma_{t}(h^{t}-x^{t}))&=f(\mathcal{B}(x^{t}+\gamma_{t}(h^{t}-x^{t})))+g(x^{t}+\gamma_{t}(h^{t}-x^{t}))\\ &\leq F(x^{t})-\gamma_{t}G_{t}^{a}+\omega(\gamma_{t}D_{t}),\end{split} (3.5)

where the last inequality follows from (2.10). We can now compute an optimal step length γt\gamma_{t} at every iteration tt by optimizing the right hand side of (3.5) for γt\gamma_{t} to get

γt=min{GtaDt(Dt+Gta),1}.\gamma_{t}=\min\left\{\frac{G_{t}^{a}}{D_{t}(D_{t}+G_{t}^{a})},1\right\}. (3.6)

Zhao and Freund [36] use a similar technique to compute γt\gamma_{t} based on GtG_{t}. The gap GtG_{t} requires the knowledge of an optimal solution to the subproblem (Lin-OPT), whereas (3.6) uses an approximate solution to (Lin-OPT).

Behavior of Δt\Delta_{t}.

In the following proposition, we give lower bounds on the amount by which Δt\Delta_{t} decreases at each iteration of Algorithm 2. Note that this result is independent of the sequence (δt)t0(\delta_{t})_{t\geq 0}.

Proposition 3.

Let (SC-OPT) (with f((x))f(\mathcal{B}(x)) a 2-self-concordant and θ\theta-logarithmically-homogeneous barrier function) be the input to Algorithm 2. Let the quantities RgR_{g} and GtaG_{t}^{a} be as defined in (2.12) and (2.10) respectively. For any t0t\geq 0 in Algorithm 2,

  1. 1.

    if Gta>θ+RgG_{t}^{a}>\theta+R_{g}, we have

    ΔtΔt+1110.6,\Delta_{t}-\Delta_{t+1}\geq\frac{1}{10.6}, (3.7)
  2. 2.

    if Gtaθ+RgG_{t}^{a}\leq\theta+R_{g}, we have

    ΔtΔt+1(Gta)212(θ+Rg)2.\Delta_{t}-\Delta_{t+1}\geq\frac{(G_{t}^{a})^{2}}{12(\theta+R_{g})^{2}}. (3.8)

Consequently, (Δt)t0(\Delta_{t})_{t\geq 0} is a monotonically non-increasing sequence.

Proof.

The proof of this result is similar to the proof given in [36]. The difference arises due to the slightly different value of the step length γt\gamma_{t} that we have used. We give the proof here for completeness.

Proving (3.7).

Rewriting (2.10), we have

Gta=Fxtlin(ht)Fxtlin(xt)0,-G_{t}^{a}=F^{\textup{lin}}_{x^{t}}(h^{t})-F^{\textup{lin}}_{x^{t}}(x^{t})\leq 0, (3.9)

since Gta0G_{t}^{a}\geq 0 for all tt. Thus,

Gta=|f((xt)),(htxt)+g(ht)g(xt)||f((xt)),(htxt)|+|g(ht)g(xt)|.\begin{split}G_{t}^{a}&=|\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(h^{t}-x^{t})\rangle+g(h^{t})-g(x^{t})|\\ &\leq|\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(h^{t}-x^{t})\rangle|+|g(h^{t})-g(x^{t})|.\end{split} (3.10)

Now, using the definition of DtD_{t} and (2.4), we can write

Dt=(htxt)(xt)|f((xt)),(htxt)|/θ.D_{t}=\|\mathcal{B}(h^{t}-x^{t})\|_{\mathcal{B}(x^{t})}\geq|\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(h^{t}-x^{t})\rangle|/\sqrt{\theta}. (3.11)

Combining (3.10) and (3.11), we bound DtD_{t} as

DtGta|g(ht)g(xt)|θ(i)GtaRgθ>(ii)θ(iii)1,D_{t}\geq\frac{G_{t}^{a}-|g(h^{t})-g(x^{t})|}{\sqrt{\theta}}\underset{(i)}{\geq}\frac{G_{t}^{a}-R_{g}}{\sqrt{\theta}}\underset{(ii)}{>}\sqrt{\theta}\underset{(iii)}{\geq}1, (3.12)

where (i) uses the definition of RgR_{g} (2.12), (ii) uses the fact that Gta>θ+RgG_{t}^{a}>\theta+R_{g} and (iii) uses (2.8). Now, since Dt1D_{t}\geq 1, it follows that γt<1\gamma_{t}<1, and so, substituting γt=GtaDt(Gta+Dt)\gamma_{t}=\frac{G_{t}^{a}}{D_{t}(G_{t}^{a}+D_{t})} in (3.5) and rearranging gives

F(xt+1)F(xt)ω(GtaDt).F(x^{t+1})\leq F(x^{t})-\omega^{\star}\left(\frac{G_{t}^{a}}{D_{t}}\right). (3.13)

The relationship between the optimality gap at consecutive iterations is given as

ΔtΔt+1=F(xt)F(xt+1)ω(GtaDt)(i)ω(GtaGta+θ+Rg)0,\begin{split}\Delta_{t}-\Delta_{t+1}&=F(x^{t})-F(x^{t+1})\\ &\geq\omega^{\star}\left(\frac{G_{t}^{a}}{D_{t}}\right)\underset{(i)}{\geq}\omega^{\star}\left(\frac{G_{t}^{a}}{G_{t}^{a}+\theta+R_{g}}\right)\geq 0,\end{split} (3.14)

where (i) follows from (A.1) and the monotonicity of ω()\omega^{\star}(\cdot). Since Gta>θ+RgG_{t}^{a}>\theta+R_{g}, we have GtaGta+θ+Rg>12\frac{G_{t}^{a}}{G_{t}^{a}+\theta+R_{g}}>\frac{1}{2}. Now, [36, Proposition 2.1] states that ω(s)s/5.3\omega^{\star}(s)\geq s/5.3 for all s1/2s\geq 1/2, and so,

ΔtΔt+1Gta5.3(Gta+θ+Rg)>Gta5.3×2Gta=110.6,\Delta_{t}-\Delta_{t+1}\geq\frac{G_{t}^{a}}{5.3(G_{t}^{a}+\theta+R_{g})}>\frac{G_{t}^{a}}{5.3\times 2G_{t}^{a}}=\frac{1}{10.6}, (3.15)

where the second inequality follows since Gta>θ+RgG_{t}^{a}>\theta+R_{g}, thus proving (3.7).

Proving (3.8).

First, consider the case where γt=1\gamma_{t}=1, which indicates that Dt(Dt+Gta)GtaD_{t}(D_{t}+G_{t}^{a})\leq G_{t}^{a}, which can be rearranged as

GtaDt21Dt.G_{t}^{a}\geq\frac{D_{t}^{2}}{1-D_{t}}. (3.16)

From (3.5), we see that

ΔtΔt+1=F(xt)F(xt+1)Gtaω(Dt)=Gta+Dt+log(1Dt).\begin{split}\Delta_{t}-\Delta_{t+1}&=F(x^{t})-F(x^{t+1})\\ &\geq G_{t}^{a}-\omega(D_{t})\\ &=G_{t}^{a}+D_{t}+\log(1-D_{t}).\end{split} (3.17)

Combining (3.16), (3.17), and the fact that log(1+s)ss22(1|s|)s(1,1)\log(1+s)\geq s-\frac{s^{2}}{2(1-|s|)}\quad\forall s\in(-1,1) [36, Proposition 2.2], we have

ΔtΔt+1GtaDt22(1Dt)Gta2.\Delta_{t}-\Delta_{t+1}\geq G_{t}^{a}-\frac{D_{t}^{2}}{2(1-D_{t})}\geq\frac{G_{t}^{a}}{2}. (3.18)

Furthermore,

ΔtΔt+1Gta2(i)(Gta)22(θ+Rg)(ii)(Gta)22(θ+Rg)2(Gta)212(θ+Rg)2,\Delta_{t}-\Delta_{t+1}\geq\frac{G_{t}^{a}}{2}\underset{(i)}{\geq}\frac{(G_{t}^{a})^{2}}{2(\theta+R_{g})}\underset{(ii)}{\geq}\frac{(G_{t}^{a})^{2}}{2(\theta+R_{g})^{2}}\geq\frac{(G_{t}^{a})^{2}}{12(\theta+R_{g})^{2}}, (3.19)

where (i) uses the fact that Gtaθ+RgG_{t}^{a}\leq\theta+R_{g}, and (ii) uses the inequality θ+Rg1\theta+R_{g}\geq 1. This completes the proof of (3.8) for the case γt=1\gamma_{t}=1. Now, if γt<1\gamma_{t}<1, then γt=GtaDt(Gta+Dt)\gamma_{t}=\frac{G_{t}^{a}}{D_{t}(G_{t}^{a}+D_{t})}, and from (3.13), we have

ΔtΔt+1=F(xt)F(xt+1)ω(GtaDt)(i)ω(GtaGta+θ+Rg)(ii)(Gta)23(Gta+θ+Rg)2(iii)(Gta)212(θ+Rg)2,\begin{split}\Delta_{t}-\Delta_{t+1}&=F(x^{t})-F(x^{t+1})\\ &\geq\omega^{\star}\left(\frac{G_{t}^{a}}{D_{t}}\right)\\ &\underset{(i)}{\geq}\omega^{\star}\left(\frac{G_{t}^{a}}{G_{t}^{a}+\theta+R_{g}}\right)\\ &\underset{(ii)}{\geq}\frac{(G_{t}^{a})^{2}}{3(G_{t}^{a}+\theta+R_{g})^{2}}\\ &\underset{(iii)}{\geq}\frac{(G_{t}^{a})^{2}}{12(\theta+R_{g})^{2}},\end{split} (3.20)

where (i) uses (A.1) and the monotonicity of ω()\omega^{\star}(\cdot), (ii) uses the fact that GtaGta+θ+Rg12\frac{G_{t}^{a}}{G_{t}^{a}+\theta+R_{g}}\leq\frac{1}{2} and ω(s)s2/3s[0,1/2]\omega^{\star}(s)\geq s^{2}/3\quad\forall s\in[0,1/2] [36, Proposition 2.1], and (iii) uses the fact that Gtaθ+RgG_{t}^{a}\leq\theta+R_{g}. This completes the proof of (3.8).

Finally, from (3.7) and (3.8), we see that (Δt)t0(\Delta_{t})_{t\geq 0} is a monotonically non-increasing sequence. ∎

Proposition 4.

If 0<ϵθ+Rg0<\epsilon\leq\theta+R_{g} and Algorithm 2 stops (i.e., it reaches Step 8) at iteration tt, we have ΔtGta+δt5ϵ/2\Delta_{t}\leq G_{t}^{a}+\delta_{t}\leq 5\epsilon/2.

Proof.

If Algorithm 2 stops, we have that 0Gtaϵθ+Rg0\leq G_{t}^{a}\leq\epsilon\leq\theta+R_{g} and δt3ϵ/2\delta_{t}\leq 3\epsilon/2. From Lemma 1 and (2.11), we see that ΔtGtGta+δt5ϵ/2\Delta_{t}\leq G_{t}\leq G_{t}^{a}+\delta_{t}\leq 5\epsilon/2. ∎

Note that, as stated in Remark 1, we can replace ϵ/2\epsilon/2 by ϵ/κ\epsilon/\kappa in the definition of adaptive strategy S.2. In this regime, Algorithm 2 stops when δt(1+1/κ)ϵ\delta_{t}\leq(1+1/\kappa)\epsilon, Thus, we have that, if Algorithm 2 stops at iteration tt, then Δt(2+1/κ)ϵ\Delta_{t}\leq(2+1/\kappa)\epsilon and XtX^{t} is an (2+1/κ)ϵ(2+1/\kappa)\epsilon-optimal solution to the input problem. For simplicity, we assume κ=2\kappa=2 in the rest of the paper.

3.1 Convergence Analysis of Algorithm 2

Before providing the convergence analysis of Algorithm 2, we introduce a few key concepts and relationships that are used in the analysis.

3.1.1 Partitioning the iterates of Algorithm 2

Our analysis depends on decomposing the sequence of iterates generated by Algorithm 2 into three subsequences depending on the value of GtaG_{t}^{a}. This partitioning depends on a parameter C0C\geq 0, which controls which iterates fall into which subsequence, as well as the logarithmic homogeneity parameter θ\theta and the maximum variation parameter RgR_{g}. Note that the definition makes sense for any pair of sequences (xt)t0(x^{t})_{t\geq 0} and (ht)t0(h^{t})_{t\geq 0} of points feasible for (SC-OPT), irrespective of how they are generated.

Definition 4.

Let (xt)t0(x^{t})_{t\geq 0} and (ht)t0(h^{t})_{t\geq 0} be any sequence of feasible points for (SC-OPT) such that Gta0G_{t}^{a}\geq 0. Given C0C\geq 0, define three subsequences of \mathbb{N} as follows:

  • The rr-subsequence r1<r2<r_{1}<r_{2}<\cdots consists of those tt\in\mathbb{N} for which Gta>θ+RgG_{t}^{a}>\theta+R_{g}.

  • The ss-subsequence s1<s2<s_{1}<s_{2}<\cdots consists of those tt\in\mathbb{N} for which θ+RgGta\theta+R_{g}\geq G_{t}^{a} and GtaΔt/(2C+1)G_{t}^{a}\geq\Delta_{t}/(2C+1).

  • The qq-subsequence q1<q2<q_{1}<q_{2}<\cdots consists of those tt\in\mathbb{N} for which θ+RgGta\theta+R_{g}\geq G_{t}^{a} and Gta<Δt/(2C+1)G_{t}^{a}<\Delta_{t}/(2C+1).

Remark 2.

The partitioning of the iterates as in Definition 4 refines the partitioning used in [36] which considers the situation where the linearized subproblems are solved exactly at each iteration. In that situation, Gta=GtΔtG_{t}^{a}=G_{t}\geq\Delta_{t} for all tt and the partitioning in Definition 4 (with C=0C=0) only yields the rr- and ss-subsequences. One key idea in our analysis is to distinguish the iterates where GtaG_{t}^{a} behaves like GtG_{t} by providing an upper bound on Δt\Delta_{t} (the ss-subsequence), and the iterates where GtaG_{t}^{a} does not give such an upper bound on the optimality gap Δt\Delta_{t} (the qq-subsequence). For the ss-subsequence, our analysis broadly follows the approach in [36], showing progress in both the optimality gap and the approximate duality gap. For the qq-subsequence, we can still ensure progress in terms of the approximate duality gap. If the algorithm is run for sufficiently many iterations, either the ss-subsequence, or the qq-subsequence, will be long enough to ensure the stopping criterion is reached.

Given a positive integer NN, define

Nr\displaystyle N_{r} =max{j:rjN}\displaystyle=\max\{j\;:\;r_{j}\leq N\} (3.21)
Ns\displaystyle N_{s} =max{j:sjN}\displaystyle=\max\{j\;:\;s_{j}\leq N\} (3.22)
Nq\displaystyle N_{q} =max{j:qjN}\displaystyle=\max\{j\;:\;q_{j}\leq N\} (3.23)

Note that Nr+Ns+Nq=NN_{r}+N_{s}+N_{q}=N, since every iterate falls into exactly one of these three sequences.

When the iterates (xt)t0(x^{t})_{t\geq 0} and (ht)t0(h^{t})_{t\geq 0} are generated by Algorithm 2, then the optimality gap and approximate duality gap for different subsequences satisfy certain inequalities. These allow us to analyse the progress of the algorithm in different ways along the different subsequences. The following result summarizes the key inequalities for each subsequence that will be used in our later analysis.

Lemma 2.

Let (xt)t0(x^{t})_{t\geq 0} and (ht)t0(h^{t})_{t\geq 0} be sequences of feasible points for (SC-OPT) generated by Algorithm 2. Then, for all i1i\geq 1,

r-subsequence:ΔriΔri+1\displaystyle\textup{$r$-subsequence:}\quad\Delta_{r_{i}}-\Delta_{r_{i+1}} 110.6\displaystyle\geq\frac{1}{10.6} (3.24)
s-subsequence:ΔsiΔsi+1\displaystyle\textup{$s$-subsequence:}\quad\Delta_{s_{i}}-\Delta_{s_{i+1}} (Gsia)212(θ+Rg)2\displaystyle\geq\frac{(G_{s_{i}}^{a})^{2}}{12(\theta+R_{g})^{2}} (3.25)
q-subsequence:Gqia\displaystyle\textup{$q$-subsequence:}\quad\quad\quad\;\;\quad G_{q_{i}}^{a} 12Cδqi\displaystyle\leq\frac{1}{2C}\delta_{q_{i}} (3.26)
Proof.

First, recall from Proposition 3 that the sequence (xt)(x^{t}) generated by Algorithm 2 satisfies Δt+1Δt\Delta_{t+1}\leq\Delta_{t} for all t0t\geq 0. Since ri+1ri+1r_{i+1}\geq r_{i}+1 we have that

ΔriΔri+1ΔriΔri+1110.6,\Delta_{r_{i}}-\Delta_{r_{i+1}}\geq\Delta_{r_{i}}-\Delta_{r_{i}+1}\geq\frac{1}{10.6},

where the last inequality follows from Proposition 3. Similarly, since si+1si+1s_{i+1}\geq s_{i}+1 we have that

ΔsiΔsi+1ΔsiΔsi+1(Gsia)212(θ+Rg)2,\Delta_{s_{i}}-\Delta_{s_{i+1}}\geq\Delta_{s_{i}}-\Delta_{s_{i}+1}\geq\frac{(G_{s_{i}}^{a})^{2}}{12(\theta+R_{g})^{2}},

where the last inequality follows from Proposition 3.

Since θ+RgGqia\theta+R_{g}\geq G_{q_{i}}^{a}, it follows from the definition of ILMO that condition C.2 (δt\delta_{t}-suboptimality) holds, and so Fact 1 and Lemma 1 tell us that ΔqiGqiGqia+δqi\Delta_{q_{i}}\leq G_{q_{i}}\leq G_{q_{i}}^{a}+\delta_{q_{i}}. Since Δqi/(2C+1)Gqia0\Delta_{q_{i}}/(2C+1)\geq G_{q_{i}}^{a}\geq 0 (by the definition of the qq-subsequence), we have that

GqiaΔqi2C+1Gqi2C+1Gqia+δqi2C+1.G_{q_{i}}^{a}\leq\frac{\Delta_{q_{i}}}{2C+1}\leq\frac{G_{q_{i}}}{2C+1}\leq\frac{G_{q_{i}}^{a}+\delta_{q_{i}}}{2C+1}.

Rearranging gives the inequality Gqiaδqi/(2C)G_{q_{i}}^{a}\leq\delta_{q_{i}}/(2C). ∎

3.1.2 Bounding the length of the rr-subsequence

We now provide an upper bound on the length of the rr-subsequence in terms of the initial optimality gap Δ0\Delta_{0}.

Lemma 3 (Bound on NrN_{r}).

Let (xt)t0(x^{t})_{t\geq 0} and (ht)t0(h^{t})_{t\geq 0} be sequences of feasible points for (SC-OPT) generated by Algorithm 2. For any positive integer NN, let NrN_{r} be defined as in (3.21). Then Nr10.6Δ0N_{r}\leq\lfloor 10.6\Delta_{0}\rfloor.

Proof.

From Lemma 2 we have that

Δri+1Δri110.6for all r1.\Delta_{r_{i+1}}\leq\Delta_{r_{i}}-\frac{1}{10.6}\quad\textup{for all $r\geq 1$}.

Since r10r_{1}\geq 0 we have that Δr1Δ0\Delta_{r_{1}}\leq\Delta_{0}. Substituting i=Nr1i=N_{r}-1 and recursively applying the bound gives

ΔrNrΔr1Nr110.6Δ0Nr110.6.\Delta_{r_{N_{r}}}\leq\Delta_{r_{1}}-\frac{N_{r}-1}{10.6}\leq\Delta_{0}-\frac{N_{r}-1}{10.6}.

We also have that 0ΔrNr+1ΔrNr110.60\leq\Delta_{r_{N_{r}}+1}\leq\Delta_{r_{N_{r}}}-\frac{1}{10.6}, from Proposition 3. Combining these gives 0Δ0Nr10.60\leq\Delta_{0}-\frac{N_{r}}{10.6}. It follows that Nr10.6Δ0N_{r}\leq\lfloor 10.6\Delta_{0}\rfloor, since NrN_{r} is an integer. ∎

3.1.3 Controlling the optimality gap and approximate duality gap along the ss-subsequence

In this section, we will show that in any sufficiently large interval of the ss-subsequence, there must be an iterate at which the approximate duality gap is small. To do this, we use the following result, a slight modification of [36, Proposition 2.4]. While [36, Proposition 2.4] bounds the minimum over the first j+1j+1 elements in the sequence (gj)(g_{j}), i.e., min{g0,,gj}\min\{g_{0},\dotsc,g_{j}\}, Proposition 5 provides a bound on the minimum over any subsequence of consecutive elements gl,,gjg_{l},\dotsc,g_{j} with j>l0j>l\geq 0.

Proposition 5.

Let β>0\beta>0. Consider two nonnegative sequences (dj)j0(d_{j})_{j\geq 0} and (gj)j0(g_{j})_{j\geq 0} that satisfy

  • dj+1djgj2/βd_{j+1}\leq d_{j}-g_{j}^{2}/\beta for all j0j\geq 0 and

  • gjdjg_{j}\geq d_{j} for all j0j\geq 0.

If j1j\geq 1, then dj<β/jd_{j}<\beta/j and

min{gl,,gj}<{2βjif j+12>l0β(l+1)(jl)if j+12l<j\min\{g_{l},\dotsc,g_{j}\}<\begin{cases}\frac{2\beta}{j}&\textup{if $\lfloor\frac{j+1}{2}\rfloor>l\geq 0$}\\ \frac{\beta}{\sqrt{(l+1)(j-l)}}&\textup{if $\lfloor\frac{j+1}{2}\rfloor\leq l<j$}\end{cases} (3.27)

If, in addition, d0βd_{0}\leq\beta, then djβ/(j+1)d_{j}\leq\beta/(j+1) and

min{gl,,gj}{2βj+1if j2+1>l0β(l+2)(jl)if j2+1l<j\min\{g_{l},\dotsc,g_{j}\}\leq\begin{cases}\frac{2\beta}{j+1}&\textup{if $\lfloor\frac{j}{2}+1\rfloor>l\geq 0$}\\ \frac{\beta}{\sqrt{(l+2)(j-l)}}&\textup{if $\lfloor\frac{j}{2}+1\rfloor\leq l<j$}\end{cases} (3.28)
Proof.

First, suppose that dj=0d_{j}=0 for some jj. In this case gk=0g_{k}=0 for all kjk\geq j and dk=0d_{k}=0 for all kjk\geq j, so the bounds hold trivially. As such, assume that j>l0j>l\geq 0 is chosen so that dj>0d_{j}>0, and hence dk>0d_{k}>0 for 0kj0\leq k\leq j.

The first recurrence implies that djdj1d_{j}\leq d_{j-1} for all j1j\geq 1. Dividing both sides of the first recurrence by the positive quantity djdj1d_{j}d_{j-1} gives

dj1dj11+β1gj1dj1gj1djdj11+β1gj1djdj11+β1d_{j}^{-1}\geq d_{j-1}^{-1}+\beta^{-1}\frac{g_{j-1}}{d_{j-1}}\frac{g_{j-1}}{d_{j}}\geq d_{j-1}^{-1}+\beta^{-1}\frac{g_{j-1}}{d_{j}}\geq d_{j-1}^{-1}+\beta^{-1}

where the inequalities follow from the assumption that gj1dj1g_{j-1}\geq d_{j-1} and the fact that djdj1d_{j}\leq d_{j-1} for all j0j\geq 0.

It follows that dj1dl1+(jl)β1d_{j}^{-1}\geq d_{l}^{-1}+(j-l)\beta^{-1} and so djβjl+β/dld_{j}\leq\frac{\beta}{j-l+\beta/d_{l}}. Since this inequality is independent of ll, we can minimize over ll to obtain the best bound. By setting l=0l=0 and using d0>0d_{0}>0, we obtain dj<β/jd_{j}<\beta/j for all j1j\geq 1. Furthermore, if d0βd_{0}\leq\beta, we have that djβ/(j+β/d0)β/(j+1)d_{j}\leq\beta/(j+\beta/d_{0})\leq\beta/(j+1).

Now we establish the bound on min{gl,,gj}\min\{g_{l},\ldots,g_{j}\}. Let 1l+1ij1\leq l+1\leq i\leq j. Since the minimum is bounded above by the average, we have that

min{gl2,,gj2}min{gi2,,gj2}1ji+1k=ijgk2βji+1k=ij(dkdk+1)=βji+1(didj+1).\min\{g_{l}^{2},\ldots,g_{j}^{2}\}\leq\min\{g_{i}^{2},\dots,g_{j}^{2}\}\leq\frac{1}{j-i+1}\sum_{k=i}^{j}g_{k}^{2}\leq\frac{\beta}{j-i+1}\sum_{k=i}^{j}(d_{k}-d_{k+1})=\frac{\beta}{j-i+1}(d_{i}-d_{j+1}).

Using the bound di<β/id_{i}<\beta/i (valid since i1i\geq 1), we have that

min{gl2,,gj2}<βji+1βi.\min\{g_{l}^{2},\ldots,g_{j}^{2}\}<\frac{\beta}{j-i+1}\cdot\frac{\beta}{i}. (3.29)

We are free to choose l+1ijl+1\leq i\leq j to minimize the bound on the right hand side of (3.29). Let i=j+12i^{*}=\lfloor\frac{j+1}{2}\rfloor. A minimum occurs at ii^{*} if il+1i^{*}\geq l+1, otherwise it occurs at i=l+1i=l+1. In the first case, we obtain

min{gl2,,gj2}4β2j2.\min\{g_{l}^{2},\ldots,g_{j}^{2}\}\leq\frac{4\beta^{2}}{j^{2}}.

In the second case we obtain

min{gl2,,gj2}β2(l+1)(jl).\min\{g_{l}^{2},\ldots,g_{j}^{2}\}\leq\frac{\beta^{2}}{(l+1)(j-l)}.

Taking the square root of both sides gives the desired bound.

If, in addition d0βd_{0}\leq\beta, we can use the bound diβ/(i+1)d_{i}\leq\beta/(i+1) which gives

min{gl2,,gj2}βji+1βi+1.\min\{g_{l}^{2},\ldots,g_{j}^{2}\}\leq\frac{\beta}{j-i+1}\cdot\frac{\beta}{i+1}. (3.30)

Again we want to minimize the right hand size of (3.30) over l+1ijl+1\leq i\leq j. Let i=j/2+1i^{*}=\lfloor j/2+1\rfloor. A minimum occurs at i=ii=i^{*} if il+1i^{*}\geq l+1, otherwise a minimum occurs at i=l+1i=l+1. Substituting these into (3.29) and taking the square roots of both sides completes the argument. ∎

The following lemma shows how the optimality gap and the approximate duality gap reduce for iterates in the ss-subsequence. The implication of this result is that if the ss-subsequence is long enough, both of these quantities will eventually become small.

Lemma 4.

Let (xt)t0(x^{t})_{t\geq 0} and (ht)t0(h^{t})_{t\geq 0} be sequences of feasible points for (SC-OPT) generated by Algorithm 2.

  • If j1j\geq 1 then Δsj+112(2C+1)2(θ+Rg)2j+1\Delta_{s_{j+1}}\leq\frac{12(2C+1)^{2}(\theta+R_{g})^{2}}{j+1}.

  • If 0l<j/2+10\leq l<\lfloor j/2+1\rfloor then mink{l+1,,j+1}Gska24(2C+1)(θ+Rg)2j+1\displaystyle{\min_{k\in\{l+1,\dotsc,j+1\}}}G_{s_{k}}^{a}\leq\frac{24(2C+1)(\theta+R_{g})^{2}}{j+1}.

  • If lj/2+1l\geq\lfloor j/2+1\rfloor then mink{l+1,,j+1}Gska12(2C+1)(θ+Rg)2(l+2)(jl)\displaystyle{\min_{k\in\{l+1,\dotsc,j+1\}}}G_{s_{k}}^{a}\leq\frac{12(2C+1)(\theta+R_{g})^{2}}{\sqrt{(l+2)(j-l)}}.

Proof.

Consider the sequences dj1=Δsj/(2C+1)d_{j-1}=\Delta_{s_{j}}/(2C+1) and gj1=Gsjag_{j-1}=G_{s_{j}}^{a} for j=1,2,j=1,2,\ldots. Note that dj0d_{j}\geq 0 and gj0g_{j}\geq 0 for all jj. From Lemma 2 we have that

djdj1gj1212(2C+1)(θ+Rg)2for all j1.d_{j}\leq d_{j-1}-\frac{g_{j-1}^{2}}{12(2C+1)(\theta+R_{g})^{2}}\quad\textup{for all $j\geq 1$}. (3.31)

Combining this with the inequality gj1=GsjaΔsj/(2C+1)=dj1g_{j-1}=G_{s_{j}}^{a}\geq\Delta_{s_{j}}/(2C+1)=d_{j-1} for all j=1,2,j=1,2,\ldots (from the definition of the ss-subsequence), we have that

dj+1djgj2βandgjdjfor all j0,d_{j+1}\leq d_{j}-\frac{g_{j}^{2}}{\beta}\quad\textup{and}\quad g_{j}\geq d_{j}\quad\textup{for all $j\geq 0$},

where β=12(2C+1)(θ+Rg)2\beta=12(2C+1)(\theta+R_{g})^{2}. Furthermore, d0g0=Gs1aθ+Rgβd_{0}\leq g_{0}=G_{s_{1}}^{a}\leq\theta+R_{g}\leq\beta (since 12(2C+1)(θ+Rg)2112(2C+1)(\theta+R_{g})^{2}\geq 1). The result then follows from Proposition 5. ∎

3.2 Analysis of Algorithm 2

In this section, we provide a bound on the number of iterations of Algorithm 2 required to achieve the stopping criterion and return an ϵ\epsilon-optimal solution (i) when δt\delta_{t} is chosen according to scheduled strategy S.1 (see Lemma 5), and (ii) when δt\delta_{t} is chosen according to adaptive strategy S.2 (see Lemma 6).

Lemma 5.

Let 0<ϵθ+Rg0<\epsilon\leq\theta+R_{g} and let δt\delta_{t} be a non-negative sequence that converges to 0. Let KqK_{q} be the smallest positive integer such that δtϵ/2\delta_{t}\leq\epsilon/2 for all tKqt\geq K_{q}. Let

Kr=10.6Δ0andKs=48(θ+Rg)2ϵ.K_{r}=\lfloor 10.6\Delta_{0}\rfloor\quad\textup{and}\quad K_{s}=\left\lceil\frac{48(\theta+R_{g})^{2}}{\epsilon}\right\rceil.

Then, for Algorithm 2 (with (SC-OPT) as input), there exists some KKr+2Ks+2KqK\leq K_{r}+2K_{s}+2K_{q} such that ΔKϵ\Delta_{K}\leq\epsilon, GKaϵG_{K}^{a}\leq\epsilon and δKϵ\delta_{K}\leq\epsilon.

Proof.

Let N=Kr+2Ks+2KqN=K_{r}+2K_{s}+2K_{q} and let the first NN iterations of Algorithm 2 be partitioned into the rr-, ss-, and qq-subsequences according to Definition 4 with C=1/2C=1/2. Then N=Nr+Ns+NqN=N_{r}+N_{s}+N_{q}.

From Lemma 3, we know that NrKrN_{r}\leq K_{r}, and so it follows that Ns+Nq(2Ks+Kq)+KqN_{s}+N_{q}\geq(2K_{s}+K_{q})+K_{q}. We now consider two cases: (i) Ns2Ks+Kq+1N_{s}\geq 2K_{s}+K_{q}+1, and (ii) Ns2Ks+KqN_{s}\leq 2K_{s}+K_{q}. In the first case, the ss-subseqence will be long enough to establish convergence. In the second case, the qq-subsequence will be long enough to ensure convergence.

Case 1: Ns2Ks+Kq+1N_{s}\geq 2K_{s}+K_{q}+1. In this case we will show that it suffices to choose K=skK=s_{k^{*}} where

k=argmini{Ks+Kq+1,,2Ks+Kq+1}Gsia.k^{*}=\operatorname*{arg\,min}_{i\in\{K_{s}+K_{q}+1,\ldots,2K_{s}+K_{q}+1\}}G_{s_{i}}^{a}.

Since Ns2Ks+Kq+1N_{s}\geq 2K_{s}+K_{q}+1 it follows that K=sks2Ks+Kq+1sNsNK=s_{k^{*}}\leq s_{2K_{s}+K_{q}+1}\leq s_{N_{s}}\leq N, as required. We now establish bounds on δK,GKa\delta_{K},G_{K}^{a}, and ΔK\Delta_{K}.

  • Since kKqk^{*}\geq K_{q}, it follows that sksKqKqs_{k^{*}}\geq s_{K_{q}}\geq K_{q}, and so that δskϵ/2\delta_{s_{k}^{*}}\leq\epsilon/2 (by the definition of KqK_{q}).

  • By the definition of kk^{*}, from Lemma 4 (with l=Ks+Kql=K_{s}+K_{q} and j=2Ks+Kqj=2K_{s}+K_{q}) we have that lj/2+1l\geq\lfloor j/2+1\rfloor and so

    Gska24(θ+Rg)2(Ks+Kq+2)Ksϵ/2.G_{s_{k^{*}}}^{a}\leq\frac{24(\theta+R_{g})^{2}}{\sqrt{(K_{s}+K_{q}+2)K_{s}}}\leq\epsilon/2.
  • From Proposition 4 we have that ΔskGska+δskϵ\Delta_{s_{k^{*}}}\leq G_{s_{k^{*}}}^{a}+\delta_{s_{k^{*}}}\leq\epsilon.

Case 2: Ns2Ks+KqN_{s}\leq 2K_{s}+K_{q}. In this case we will show that it suffices to choose K=qNqNK=q_{N_{q}}\leq N.

Since Ns2Ks+KqN_{s}\leq 2K_{s}+K_{q}, we have that Nq2Ks+2KqNsKqN_{q}\geq 2K_{s}+2K_{q}-N_{s}\geq K_{q}. Consider the sequence qkq_{k} for k=1,2,,Nqk=1,2,\ldots,N_{q}. Since C=1/2C=1/2, from Lemma 2 we have that GqkaδqkG_{q_{k}}^{a}\leq\delta_{q_{k}} for each kk.

Since K=qNqNqKqK=q_{N_{q}}\geq N_{q}\geq K_{q}, we have δK=δqNqϵ/2\delta_{K}=\delta_{q_{N_{q}}}\leq\epsilon/2 (from the definition of KqK_{q}). We also have GKa=GqNqaδqNqϵ/2G_{K}^{a}=G_{q_{N_{q}}}^{a}\leq\delta_{q_{N_{q}}}\leq\epsilon/2. Finally, from Proposition 4 we have that ΔKGKa+δKϵ\Delta_{K}\leq G_{K}^{a}+\delta_{K}\leq\epsilon, completing the argument. ∎

The following result (Lemma 6) provides an upper bound on the number of iterations of Algorithm 2 until it stops when the value of δt\delta_{t} is set according to adaptive strategy S.2. The main difference between the proofs of Lemma 5 and Lemma 6 is in the analysis of the qq-subsequence. In Lemma 6, since δt\delta_{t} is chosen in terms of the smallest value of the approximate duality gap we have encountered so far, we can ensure that the approximate duality gap reduces geometrically along the qq-subsequence.

Lemma 6.

Let 0<ϵθ+Rg0<\epsilon\leq\theta+R_{g}, and let δt\delta_{t} be a non-negative sequence such that δtϵ/2+minτ<tGτa\delta_{t}\leq\epsilon/2+\min_{\tau<t}G_{\tau}^{a} for all tt. Let

Kr=10.6Δ0andKs=72(θ+Rg)2ϵandKq=2+log2(θ+Rgϵ).K_{r}=\left\lfloor 10.6\Delta_{0}\right\rfloor\quad\textup{and}\quad K_{s}=\left\lceil\frac{72(\theta+R_{g})^{2}}{\epsilon}\right\rceil\quad\textup{and}\quad K_{q}=2+\left\lceil\log_{2}\left(\frac{\theta+R_{g}}{\epsilon}\right)\right\rceil.

Then, for Algorithm 2 (with (SC-OPT) as input), there exists some KKr+2Ks+KqK\leq K_{r}+2K_{s}+K_{q} such that ΔK5ϵ/2\Delta_{K}\leq 5\epsilon/2, GKaϵG_{K}^{a}\leq\epsilon and δK3ϵ/2\delta_{K}\leq 3\epsilon/2.

Proof.

Let N=Kr+2Ks+2KqN=K_{r}+2K_{s}+2K_{q} and let the first NN iterations of Algorithm 2 be partitioned into the rr-, ss-, and qq-subsequences according to Definition 4 with C=1C=1. Then N=Nr+Ns+NqN=N_{r}+N_{s}+N_{q}. We know that NrKrN_{r}\leq K_{r} (from Lemma 3). It follows that Ns+Nq2Ks+2KqN_{s}+N_{q}\geq 2K_{s}+2K_{q}.

We now consider two cases: (i) NqKqN_{q}\leq K_{q}, (ii) NqKq+1N_{q}\geq K_{q}+1.

Case 1: NqKqN_{q}\leq K_{q}. In this case we will show that it suffices to choose K=skK=s_{k^{*}} where

k=argmini{Ks+1,,2Ks}Gsia.k^{*}=\operatorname*{arg\,min}_{i\in\{K_{s}+1,\ldots,2K_{s}\}}G_{s_{i}}^{a}.

Since NqKqN_{q}\leq K_{q}, we have Ns2Ks+KqNq2KskN_{s}\geq 2K_{s}+K_{q}-N_{q}\geq 2K_{s}\geq k^{*}. It follows that K=sksNsNK=s_{k^{*}}\leq s_{N_{s}}\leq N, as required. We now establish bounds on δK,GKa\delta_{K},G_{K}^{a} and ΔK\Delta_{K}.

  • Since kKs+1k^{*}\geq K_{s}+1 we have that

    δskϵ2+minτ<skGτaϵ2+min1jk1Gsjaϵ2+min1jKsGsja(i)ϵ2+72(θ+Rg)2Ks(ii)3ϵ2,\delta_{s_{k^{*}}}\leq\frac{\epsilon}{2}+\min_{\tau<s_{k^{*}}}G_{\tau}^{a}\leq\frac{\epsilon}{2}+\min_{1\leq j\leq k^{*}-1}G_{s_{j}}^{a}\leq\frac{\epsilon}{2}+\min_{1\leq j\leq K_{s}}G_{s_{j}}^{a}\underset{(i)}{\leq}\frac{\epsilon}{2}+\frac{72(\theta+R_{g})^{2}}{K_{s}}\underset{(ii)}{\leq}\frac{3\epsilon}{2},

    where (i) follows from Lemma 4 with l=0l=0 and j=Ks1j=K_{s}-1 and (ii) follows from the definition of KsK_{s}.

  • From Lemma 4 with l=Ksl=K_{s} and j=2Ks1j=2K_{s}-1, noting that l=j/2+1l=\lfloor j/2+1\rfloor, we have that

    Gska36(θ+Rg)2(Ks+2)(Ks1)36(θ+Rg)2Ksϵ2.G_{s_{k^{*}}}^{a}\leq\frac{36(\theta+R_{g})^{2}}{\sqrt{(K_{s}+2)(K_{s}-1)}}\leq\frac{36(\theta+R_{g})^{2}}{K_{s}}\leq\frac{\epsilon}{2}.
  • From Proposition 4 we have that ΔskGska+δsk5ϵ/2\Delta_{s_{k^{*}}}\leq G_{s_{k^{*}}}^{a}+\delta_{s_{k^{*}}}\leq 5\epsilon/2.

Case 2: NqKq+1N_{q}\geq K_{q}+1. In this case, we will show that it suffices to choose K=qNqNK=q_{N_{q}}\leq N.

Consider the sequence qkq_{k} for k=1,2,,Nqk=1,2,\ldots,N_{q}. Since C=1C=1, from Lemma 2 we have that Gqkaδqk/2G_{q_{k}}^{a}\leq\delta_{q_{k}}/2 for k=1,2,,Nqk=1,2,\ldots,N_{q}. From our choice of δt\delta_{t}, we know that for k2k\geq 2,

δqkϵ2+minτ<qkGτaϵ2+min<kGqaϵ2+Gqk1a.\delta_{q_{k}}\leq\frac{\epsilon}{2}+\min_{\tau<q_{k}}G_{\tau}^{a}\leq\frac{\epsilon}{2}+\min_{\ell<k}G_{q_{\ell}}^{a}\leq\frac{\epsilon}{2}+G_{q_{k-1}}^{a}.

Since Gq1aθ+RgG_{q_{1}}^{a}\leq\theta+R_{g} and, for each k2k\geq 2 we have Gqkaϵ/4+(1/2)Gqk1aG_{q_{k}}^{a}\leq\epsilon/4+(1/2)G_{q_{k-1}}^{a}, it follows that

Gqkaϵ4(i=0k212i)+2k+1(θ+Rg)ϵ2+2k+1(θ+Rg)for k2,G^{a}_{q_{k}}\leq\frac{\epsilon}{4}\left(\sum_{i=0}^{k-2}\frac{1}{2^{i}}\right)+2^{-k+1}(\theta+R_{g})\leq\frac{\epsilon}{2}+2^{-k+1}(\theta+R_{g})\quad\textup{for $k\geq 2$},

where the second inequality holds because i=02i2\sum_{i=0}^{\ell}2^{-i}\leq 2 for all 0\ell\geq 0. Recall that K=qNqNK=q_{N_{q}}\leq N and that Kq+1NqK_{q}+1\leq N_{q} by assumption. Then

GKa=GqNqaϵ2+2Nq+1(θ+Rg)ϵ2+2Kq(θ+Rg)ϵ2+ϵ4=3ϵ4G_{K}^{a}=G_{q_{N_{q}}}^{a}\leq\frac{\epsilon}{2}+2^{-N_{q}+1}(\theta+R_{g})\leq\frac{\epsilon}{2}+2^{-K_{q}}(\theta+R_{g})\leq\frac{\epsilon}{2}+\frac{\epsilon}{4}=\frac{3\epsilon}{4}

by the definition of KqK_{q}. Moreover

δK=δqNqϵ2+Gq(Nq1)aϵ2+ϵ2+2Nq+2(θ+Rg)ϵ+2Kq+1(θ+Rg)3ϵ/2.\delta_{K}=\delta_{q_{N_{q}}}\leq\frac{\epsilon}{2}+G_{q_{(N_{q}-1)}}^{a}\leq\frac{\epsilon}{2}+\frac{\epsilon}{2}+2^{-N_{q}+2}(\theta+R_{g})\leq\epsilon+2^{-K_{q}+1}(\theta+R_{g})\leq 3\epsilon/2.

Finally, from Proposition 4 we have that ΔKGKa+δK9ϵ/45ϵ/2\Delta_{K}\leq G_{K}^{a}+\delta_{K}\leq 9\epsilon/4\leq 5\epsilon/2, as required. ∎

4 Implementing ILMO

In this section we discuss how to implement the oracle ILMO given that we have access to a method to approximately solve (Lin-OPT) with either prescribed additive error or (under additional assumptions) prescribed relative error.

If we have a method to solve (Lin-OPT) to a prescribed additive error, then we can use it to simply guarantee that condition C.2 (δt\delta_{t}-suboptimality) is satisfied at each iteration. However, this does not automatically guarantee that the associated near-minimizer for (Lin-OPT) satisfies the additional condition Gta0G_{t}^{a}\geq 0. In Section 4.1, we briefly discuss how to ensure that this condition is also satisfied.

The oracle ILMO, with its two possible exit conditions ( C.1 (large gap) and C.2 (δt\delta_{t}-suboptimality)), allows more implementation flexibility than just solving (Lin-OPT) to a prescribed additive error. In Section 4.2, we restrict to problems for which gg is the indicator function of a compact convex set. In this case, we will show that if we only have a method to solve (Lin-OPT) to prescribed relative error, it can still be used to implement ILMO. This is less straightforward since, in general, the optimal value of (Lin-OPT) is not uniformly bounded. Moreover, this is exactly the scenario of interest in the motivating example discussed in Section 1.1.

4.1 Ensuring the output of ILMO satisfies Gta0G_{t}^{a}\geq 0

The definition of ILMO requires that the point hth^{t} produced by ILMO is such that Gta(xt,ht)G_{t}^{a}(x^{t},h^{t}) is nonnegative. This is clearly satisfied whenever IMLO returns a point hth^{t} such that Gtaθ+RgG_{t}^{a}\geq\theta+R_{g}. However, when the prescribed additive error δt\delta_{t} for (Lin-OPT) is small, it could be the case that condition C.2 (δt\delta_{t}-suboptimality) is satisfied but Gta<0G_{t}^{a}<0. In this case, it turns out that we can just return ht=xth^{t}=x^{t} as a valid output of ILMO.

Lemma 7.

Suppose that xtx^{t} and hth^{t} are feasible for (SC-OPT) and satisfy Fxtlin(ht)Fxtlin(h)δtF^{\textup{lin}}_{x^{t}}(h^{t})-F^{\textup{lin}}_{x^{t}}(h^{\star})\leq\delta_{t}, (i.e., condition C.2 (δt\delta_{t}-suboptimality)). If Gta(xt,ht)<0G_{t}^{a}(x^{t},h^{t})<0 then Fxtlin(xt)Fxtlin(h)δtF^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{\star})\leq\delta_{t} and Gta(xt,xt)=0G_{t}^{a}(x^{t},x^{t})=0.

Proof.

Recall from the definition of FxtlinF^{\textup{lin}}_{x^{t}} in (Lin-OPT) and the definition of Gta(xt,ht)G_{t}^{a}(x^{t},h^{t}) in (2.10) that

Gta(xt,ht)=Fxtlin(xt)Fxtlin(ht).G_{t}^{a}(x^{t},h^{t})=F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{t}).

If Gta<0G_{t}^{a}<0 then Fxtlin(xt)<Fxtlin(ht)F^{\textup{lin}}_{x^{t}}(x^{t})<F^{\textup{lin}}_{x^{t}}(h^{t}) and so xtx^{t} is a feasible point for (Lin-OPT) with smaller cost than hth^{t}. Since hth^{t} satisfies condition C.2 (δt\delta_{t}-suboptimality), the same holds for xtx^{t}, since

Fxtlin(xt)Fxtlin(h)Fxtlin(ht)Fxtlin(h)δt.F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{\star})\leq F^{\textup{lin}}_{x^{t}}(h^{t})-F^{\textup{lin}}_{x^{t}}(h^{\star})\leq\delta_{t}.

In addition Gta(xt,xt)=0G_{t}^{a}(x^{t},x^{t})=0 (from the definition of GtaG_{t}^{a}), completing the argument. ∎

We note that if ILMO returns (ht,Gta)=(xt,0)(h^{t},G_{t}^{a})=(x^{t},0) then either Algorithm 2 terminates at line 7 or we will have Dt=0D_{t}=0 and γt=0\gamma_{t}=0, so that xt+1=xtx^{t+1}=x^{t}.

4.2 ILMO using an approximate minimizer of FxtlinF^{\textup{lin}}_{x^{t}} with prescribed relative error

In this section, we assume that we have access to a method that can compute an approximate minimizer of FxtlinF^{\textup{lin}}_{x^{t}} with prescribed relative error. We further assume that g()g(\cdot) is an indicator function, so that Rg=0R_{g}=0. This assumption ensures that the optimal value of (Lin-OPT) is related to θ\theta, the parameter of logarithmic homogeneity of ff.

Lemma 8.

Consider an instance of (SC-OPT) with Rg=0R_{g}=0. Let xtx^{t} and hth^{t} be feasible for (SC-OPT). Then

Gta(xt,ht)\displaystyle G_{t}^{a}(x^{t},h^{t}) =θFxtlin(ht)and\displaystyle=-\theta-F^{\textup{lin}}_{x^{t}}(h^{t})\quad\textup{and}
Gt(xt)\displaystyle G_{t}(x^{t}) =θFxtlin(h).\displaystyle=-\theta-F^{\textup{lin}}_{x^{t}}(h^{\star}).

It follows that Fxtlin(h)θF^{\textup{lin}}_{x^{t}}(h^{\star})\leq-\theta.

Proof.

Since xtx^{t} is feasible for (SC-OPT) and Rg=0R_{g}=0, we have that g(xt)=0g(x^{t})=0. We then have Fxtlin(xt)=f((xt)),(xt)=θF^{\textup{lin}}_{x^{t}}(x^{t})=\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t})\rangle=-\theta since ff is θ\theta-logarithmically-homogeneous (see Proposition 1). Using the fact that Gta(xt,ht)=Fxtlin(xt)Fxtlin(ht)G_{t}^{a}(x^{t},h^{t})=F^{\textup{lin}}_{x^{t}}(x^{t})-F^{\textup{lin}}_{x^{t}}(h^{t}), we obtain the stated expression for Gta(xt,ht)G_{t}^{a}(x^{t},h^{t}). The expression for Gt(xt)G_{t}(x^{t}) follows by substituting hh^{\star} for hth^{t}. The observation that Fxtlin(h)θF^{\textup{lin}}_{x^{t}}(h^{\star})\leq-\theta follows since Gt0G_{t}\geq 0 (Fact 1). ∎

The following result shows that a method to solve (Lin-OPT) with a multiplicative error guarantee is enough to implement ILMO in Algorithm 2. Crucially, the prescribed multiplicative error required is uniformly bounded, independent of the current iterate xtx^{t}.

Lemma 9.

Consider an instance of (SC-OPT) with Rg=0R_{g}=0. Let δt0\delta_{t}\geq 0, let c>2c>2 and let τt:=min{δt,(c2)θ}cθ<1\tau_{t}:=\frac{\textup{min}\{\delta_{t},(c-2)\theta\}}{c\theta}<1. If hth^{t} is feasible for (Lin-OPT) and

Fxtlin(h)Fxtlin(ht)(1τt)Fxtlin(h),F^{\textup{lin}}_{x^{t}}(h^{\star})\leq F^{\textup{lin}}_{x^{t}}(h^{t})\leq(1-\tau_{t})F^{\textup{lin}}_{x^{t}}(h^{\star}), (4.1)

then at least one of the two conditions C.1 (large gap) and C.2 (δt\delta_{t}-suboptimality) is satisfied.

Proof.

We prove the result by considering two cases for the value of OPTt:=Fxtlin(h)OPT_{t}:=F^{\textup{lin}}_{x^{t}}(h^{\star}). From Lemma 8 we know that OPTtθ-OPT_{t}\geq\theta. The argument is based on two cases, depending on whether OPTt-OPT_{t} is large or small.

Case I: OPTtcθ-OPT_{t}\leq c\theta.

In this case, τtmin{δt,(c2)θ}cθδt(OPTt)\tau_{t}\leq\frac{\textup{min}\{\delta_{t},(c-2)\theta\}}{c\theta}\leq\frac{\delta_{t}}{(-OPT_{t})}. Thus, τt(OPTt)δt\tau_{t}(-OPT_{t})\leq\delta_{t}. Substituting this inequality into (4.1), we see that

Fxtlin(ht)OPTt+δt,F^{\textup{lin}}_{x^{t}}(h^{t})\leq OPT_{t}+\delta_{t}, (4.2)

thus, satisfying the condition C.2 (δt\delta_{t}-suboptimality).

Case II: OPTt>cθ.-OPT_{t}>c\theta.

From Lemma 8 and (4.1), we have

Gta=Fxtlin(ht)θ(1τt)(OPTt)θ>(i)(1τt)cθθ(ii)(1min{δt,(c2)θ}cθ)cθθ=(c1)θmin{δt,(c2)θ}(c1)θ(c2)θθ,\begin{split}G_{t}^{a}&=-F^{\textup{lin}}_{x^{t}}(h^{t})-\theta\\ &\geq(1-\tau_{t})(-OPT_{t})-\theta\\ &\underset{(i)}{>}(1-\tau_{t})c\theta-\theta\\ &\underset{(ii)}{\geq}\left(1-\frac{\textup{min}\{\delta_{t},(c-2)\theta\}}{c\theta}\right)c\theta-\theta\\ &=(c-1)\theta-\textup{min}\{\delta_{t},(c-2)\theta\}\\ &\geq(c-1)\theta-(c-2)\theta\\ &\geq\theta,\end{split} (4.3)

where (i) follows since OPTt>cθ-OPT_{t}>c\theta and (ii) follows from the definition of τt\tau_{t}. Since Rg=0R_{g}=0, we have that Gtaθ+RgG_{t}^{a}\geq\theta+R_{g}, showing that condition C.1 (δt\delta_{t}-suboptimality) is satisfied. ∎

5 Probabilistic LMO

Algorithm 2 uses an oracle ILMO that generates a pair (ht,Gat)(h^{t},G^{t}_{a}) such that the output satisfies at least one of the two conditions C.1 (large gap) and C.2 (δt\delta_{t}-suboptimality). In this section, we consider a setting where we only have access to a probabilistic oracle PLMO, that fails with probability at most pp to generate the output as required by the algorithm (see Definition 2). We first provide an algorithm that uses PLMO to generate an ϵ\epsilon-optimal solution to (SC-OPT) with any desired probability of success. We then analyze the convergence of the algorithm for both strategies S.1 (scheduled) and S.2 (adaptive) of choosing the sequence δt\delta_{t}.

Framework of the algorithm.

Algorithm 3 provides the framework of the algorithm that uses PLMO. There are two main differences between Algorithm 2 and Algorithm 3. The first is that we use the probablistic oracle PLMO (see Definition 2) in Step 6 instead of ILMO, i.e., with probability at most pp, the pair (ht,Gta)(h^{t},G_{t}^{a}) generated in the Step 6 of Algorithm 3 satisfies neither condition C.1 (large gap) nor condition C.2 (δt\delta_{t}-suboptimality), but does satisfy Gta0G_{t}^{a}\geq 0. The second difference is in the stopping criteria of the algorithm. While Algorithm 2 stops when GtaϵG_{t}^{a}\leq\epsilon and δt3ϵ/2\delta_{t}\leq 3\epsilon/2, Algorithm 3 stops when the bounds GtaϵG_{t}^{a}\leq\epsilon and δt3ϵ/2\delta_{t}\leq 3\epsilon/2 are satisfied ll times. We will see (in Proposition 6) that the parameter ll controls the probability that the optimality gap is small when the algorithm stops.

Algorithm 3 Approximate Generalized Frank-Wolfe with PLMO

Input: Problem (SC-OPT), where f()f(\cdot) is a 2-self-concordant, θ\theta-logarithmically-homogeneous barrier function, ϵθ+Rg\epsilon\leq\theta+R_{g}: suboptimality parameter, scheduled strategy S.1 or adaptive strategy S.2 for setting the value of δt\delta_{t}
Output: x^ϵ\widehat{x}_{\epsilon}: x^ϵ\widehat{x}_{\epsilon} an ϵ\epsilon-optimal solution to (SC-OPT) with probability at least 1pl1-p^{l}

1:  Initialize x0nx^{0}\in\mathbb{R}^{n}.
2:  Set t0t\leftarrow 0, count=0\textup{count}=0.
3:  while count<l\textup{count}<l do
4:     Compute (f((xt)))\mathcal{B}^{\star}(\nabla f(\mathcal{B}(x^{t}))).
5:     Set δt0\delta_{t}\geq 0 according to the input strategy.
6:     (ht,Gta)=PLMO((f((xt))),δt,xt)(h^{t},G_{t}^{a})=\texttt{PLMO}(\mathcal{B}^{\star}(\nabla f(\mathcal{B}(x^{t}))),\delta_{t},x^{t}).
7:     if GtaϵG_{t}^{a}\leq\epsilon and δt3ϵ/2\delta_{t}\leq 3\epsilon/2 then
8:        count=count+1\textup{count}=\textup{count}+1.
9:     end if
10:     Compute Dt=(htxt)(xt)D_{t}=\|\mathcal{B}(h^{t}-x^{t})\|_{\mathcal{B}(x^{t})}.
11:     Set γt=min{GtaDt(Dt+Gta),1}\gamma_{t}=\min\left\{\frac{G_{t}^{a}}{D_{t}(D_{t}+G_{t}^{a})},1\right\}.
12:     Update xt+1=(1γt)xt+γthtx^{t+1}=(1-\gamma_{t})x^{t}+\gamma_{t}h^{t}.
13:     t=t+1t=t+1.
14:  end while
15:  return (xt,δt)(x_{t},\delta_{t})

It is sometimes fruitful to think of the iterates of Algorithm 3 as being iterates of Algorithm 2 with respect to an alternative (random) sequence (δ^t)t0(\hat{\delta}_{t})_{t\geq 0}. To see how this is the case, let (δt)t0(\delta_{t})_{t\geq 0} be a non-negative sequence and let (bt)t0(b_{t})_{t\geq 0} be a sequence of independent random variables where bt=1b_{t}=1 with probability 1pt1-p_{t} and bt=b_{t}=\infty with probability ptp_{t} and ptpp_{t}\leq p for all tt. Then we can think of the iterates of Algorithm 3 with respect to (δt)t0(\delta_{t})_{t\geq 0} as the iterates of Algorithm 2 with respect to the random sequence (δ^t)t0(\hat{\delta}_{t})_{t\geq 0} where δ^t=btδt\hat{\delta}_{t}=b_{t}\delta_{t} for all tt. The multiplier btb_{t} allows ILMO to model the behaviour of PLMO. Indeed when the multiplier bt=b_{t}=\infty, condition C.2 (δt\delta_{t}-suboptimality) holds trivially with respect to δ^t=btδt=\hat{\delta}_{t}=b_{t}\delta_{t}=\infty, effectively enforcing no additional constraint on the output of the oracle, apart from Gta0G_{t}^{a}\geq 0.

This viewpoint allows us to leverage existing results for the iterates of Algorithm 2 in the context of Algorithm 3. In particular, any results that are independent of the sequence (δt)t0(\delta_{t})_{t\geq 0}, such as Lemmas 3 and 4, continue to hold for Algorithm 3. Other results that depend on δt\delta_{t} (such as the bound (3.26) in Lemma 2) remain valid as long as we replace δt\delta_{t} with the random variable δ^t=btδt\hat{\delta}_{t}=b_{t}\delta_{t}. In light of this observation, the following result summarizes the key inequalities that hold for the iterates of Algorithm 3.

Lemma 10.

Let (xt)t0(x^{t})_{t\geq 0} and (ht)t0(h^{t})_{t\geq 0} be sequences of feasible points for (SC-OPT) generated by Algorithm 3. Partition the iterates into the qq-, rr-, and ss- subsequences with respect to some C>0C>0 as in Definition 4. Then

  • Nr10.6Δ0N_{r}\leq\lfloor 10.6\Delta_{0}\rfloor

  • If 0l<j/2+10\leq l<\lfloor j/2+1\rfloor then mink{l+1,,j+1}Gska24(2C+1)(θ+Rg)2j+1\displaystyle{\min_{k\in\{l+1,\ldots,j+1\}}G_{s_{k}}^{a}\leq\frac{24(2C+1)(\theta+R_{g})^{2}}{j+1}}.

  • If lj/2+1l\geq\lfloor j/2+1\rfloor then mink{l+1,,j+1}Gska12(2C+1)(θ+Rg)2(l+2)(jl)\displaystyle{\min_{k\in\{l+1,\ldots,j+1\}}G_{s_{k}}^{a}\leq\frac{12(2C+1)(\theta+R_{g})^{2}}{\sqrt{(l+2)(j-l)}}}.

  • The random variables 12CδqkGqka\frac{1}{2C}\delta_{q_{k}}-G_{q_{k}}^{a} (indexed by k1k\geq 1) are independent, and are each non-negative with probability at least 1p1-p.

The following result shows that the parameter ll can be used to control the probability that Algorithm 3 successfully returns a near-optimal point.

Proposition 6.

If 0ϵθ+Rg0\leq\epsilon\leq\theta+R_{g} and Algorithm 3 (with parameter ll) stops after TlT_{l} iterations, we have ΔTlGTla+δTl5ϵ/2\Delta_{T_{l}}\leq G_{T_{l}}^{a}+\delta_{T_{l}}\leq 5\epsilon/2 with probability at least 1pl1-p^{l}.

Proof.

Let (Ti)i1(T_{i})_{i\geq 1} be the iterations of Algorithm 3 at which GTiaϵG_{T_{i}}^{a}\leq\epsilon and δTi3ϵ/2\delta_{T_{i}}\leq 3\epsilon/2. If Algorithm 3 stops, it does so after TlT_{l} iterations. Since ϵθ+Rg\epsilon\leq\theta+R_{g}, we have that Gtaθ+RgG_{t}^{a}\leq\theta+R_{g} at each (Ti)i=1l(T_{i})_{i=1}^{l}. By the definition of PLMO, Gta0G_{t}^{a}\geq 0 for all tt. Therefore, (Δt)t0(\Delta_{t})_{t\geq 0} is monotonically non-increasing. As such

ΔTlΔTiGTifor all 1il.\Delta_{T_{l}}\leq\Delta_{T_{i}}\leq G_{T_{i}}\quad\textup{for all $1\leq i\leq l$.}

Suppose that there is some 1il1\leq i\leq l such that condition C.2 (δt\delta_{t}-suboptimality) holds at iterate TiT_{i}. Then

ΔTiGTi(i)GTia+δTi5ϵ/2,\Delta_{T_{i}}\leq G_{T_{i}}\underset{(i)}{\leq}G_{T_{i}}^{a}+\delta_{T_{i}}\leq 5\epsilon/2,

where (i) follows from Lemma 1. Combining these inequalities gives that ΔTlΔTi5ϵ/2\Delta_{T_{l}}\leq\Delta_{T_{i}}\leq 5\epsilon/2.

Finally, note that the probability that condition C.2 (δt\delta_{t}-suboptimality) is satisfied for at least one out of ll iterates is at least 1pl1-p^{l}, so it follows that ΔTl5ϵ/2\Delta_{T_{l}}\leq 5\epsilon/2 with probability at least 1pl1-p^{l}. ∎

5.1 Convergence Analysis of Algorithm 3

The convergence analysis of Algorithm 3, like that of Algorithm 2, also makes use of the partition of the iterates into the qq-, rr-, and ss-subsequences, as in Definition 4. If the first NN iterates of Algorithm 3 are divided into sequences the qq-, rr-, and ss-subsequences, let NrN_{r}, NsN_{s}, and NqN_{q} (defined in (3.21)) respectively denote the length of each subsequence.

Our analysis of Algorithm 3 follows a broadly similar approach to the analysis of Algorithm 2. The key difference is that we need to bound the time it takes to achieve the stopping criterion for the llth time, and that the inequality that allows us to control the approximate duality gap along the qq-subsequence only holds with probability at least 1p1-p independently on each iteration.

The main technical tool needed to deal with these issues is a standard tail bound on the number of successes in a sequence of independent (but not necessarily identically distributed) Bernoulli trials.

Lemma 11.

Let ll be a positive integer, let 0<p<10<p<1 and let 0<p¯<10<\overline{p}<1. Let MM be a positive integer such that Ml1+log(1/p¯)1p+2log(1/p¯)M\geq\frac{l-1+\log(1/\overline{p})}{1-p}+2\log(1/\overline{p}). Let ξ1,ξ2,,ξM\xi_{1},\xi_{2},\ldots,\xi_{M} be a collection of independent random variables such that ξi\xi_{i} takes value 11 with probability 1pi1-p_{i} and ξi\xi_{i} takes value 0 with probability pip_{i}, and assume that 0pip0\leq p_{i}\leq p for all ii. Then the probability that at least ll of the variables ξ1,ξ2,,ξM\xi_{1},\xi_{2},\ldots,\xi_{M} takes value 11 is at least 1p¯1-\overline{p}.

Proof.

The number of variables that take value 0 is i=1M(1ξi)\sum_{i=1}^{M}(1-\xi_{i}), which is a sum of independent random variables, each taking values in {0,1}\{0,1\}. The mean is μ=i=1MpipM\mu=\sum_{i=1}^{M}p_{i}\leq pM. By Hoeffding’s inequality

Pr[i=1Mξi(1p)Mτ]=Pr[i=1M(1ξi)pM+τ]Pr[i=1M(1ξi)μ+τ]exp(2τ2/M).\textup{Pr}\left[\sum_{i=1}^{M}\xi_{i}\leq(1-p)M-\tau\right]=\textup{Pr}\left[\sum_{i=1}^{M}(1-\xi_{i})\geq pM+\tau\right]\leq\textup{Pr}\left[\sum_{i=1}^{M}(1-\xi_{i})\geq\mu+\tau\right]\leq\exp\left(-2\tau^{2}/M\right). (5.1)

Let τ=log(1/p¯)\tau=\log(1/\overline{p}). From our assumption that M2log(1/p¯)+(l1+log(1/p¯))/(1p)M\geq 2\log(1/\overline{p})+(l-1+\log(1/\overline{p}))/(1-p) we know that M2τM\geq 2\tau and that Ml1+τ1pM\geq\frac{l-1+\tau}{1-p}. It follows that 2τ2/Mlog(1/p¯)2\tau^{2}/M\leq\log(1/\overline{p}) and that (1p)Mτl1(1-p)M-\tau\geq l-1. Therefore

Pr[i=1Mξil1]Pr[i=1Mξi(1p)Mτ]exp(2τ2/M)p¯.\textup{Pr}\left[\sum_{i=1}^{M}\xi_{i}\leq l-1\right]\leq\textup{Pr}\left[\sum_{i=1}^{M}\xi_{i}\leq(1-p)M-\tau\right]\leq\exp(-2\tau^{2}/M)\leq\overline{p}. (5.2)

We are now ready to provide bound on number of iterations performed by Algorithm 3 until the stopping criteria (defined in Step 7 of the algorithm) is satisfied. Lemmas 12 and 13 provide a bound on number of iterations performed by Algorithm 3 when δt\delta_{t} is set according to scheduled strategy S.1 and adaptive strategy S.2 respectively.

Lemma 12.

Let 0<ϵθ+Rg0<\epsilon\leq\theta+R_{g} and let δt\delta_{t} be a non-negative sequence that converges to 0. Let KqK_{q} be the smallest positive integer such that δtϵ/2\delta_{t}\leq\epsilon/2 for all tKqt\geq K_{q}. Let 0<p¯<10<\overline{p}<1. Let

Kr=10.6Δ0andKs=48(θ+Rg)2ϵ.K_{r}=\lfloor 10.6\Delta_{0}\rfloor\quad\textup{and}\quad K_{s}=\left\lceil\frac{48(\theta+R_{g})^{2}}{\epsilon}\right\rceil.

Then, with probability at least 1p¯1-\overline{p}, Algorithm 3 (with (SC-OPT) as input) stops after at most Kr+(l+1)Ks+2Kq+l1+log(1/p¯)1p+2log(1/p¯)K_{r}+(l+1)K_{s}+2K_{q}+\frac{l-1+\log(1/\overline{p})}{1-p}+2\log(1/\overline{p}) iterations.

Proof.

For brevity of notation, define M(l,p¯,p):=l1log(1/p¯)1p+2log(1/p¯)M(l,\overline{p},p):=\frac{l-1\log(1/\overline{p})}{1-p}+2\log(1/\overline{p}). Let N=Kr+(l+1)Ks+2Kq+M(l,p¯,p)N=K_{r}+(l+1)K_{s}+2K_{q}+M(l,\overline{p},p) and let the first NN iterations of Algorithm 2 be partitioned into the rr- ss- and qq- subsequences according to Definition 4 with C=1/2C=1/2. Then, N=Nr+Ns+NqN=N_{r}+N_{s}+N_{q}.

From Lemma 10, we know that NrKrN_{r}\leq K_{r}, and so it follows that Ns+Nq((l+1)Ks+Kq)+(Kq+M(l,p¯,p))N_{s}+N_{q}\geq((l+1)K_{s}+K_{q})+\left(K_{q}+M(l,\overline{p},p)\right). We now consider two cases: (i) Ns(l+1)Ks+Kq+1N_{s}\geq(l+1)K_{s}+K_{q}+1, and (ii) Ns(l+1)Ks+KqN_{s}\leq(l+1)K_{s}+K_{q}. In the first case, the ss-subseqence will be long enough to establish that the algorithm reaches the stopping criterion. In the second case, the qq-subsequence will be long enough to ensure that the algorithm reaches the stopping criterion with probability at least 1p¯1-\overline{p}.

Case 1: Ns(l+1)Ks+Kq+1N_{s}\geq(l+1)K_{s}+K_{q}+1. For u=1,2,,lu=1,2,\ldots,l, let

ku=argmini{uKs+Kq+1,,(u+1)Ks+Kq+1}Gsia.k_{u}^{*}=\operatorname*{arg\,min}_{i\in\{uK_{s}+K_{q}+1,\ldots,(u+1)K_{s}+K_{q}+1\}}G_{s_{i}}^{a}.

Since Ns(l+1)Ks+Kq+1N_{s}\geq(l+1)K_{s}+K_{q}+1 it follows that skus(l+1)Ks+Kq+1sNsNs_{k_{u}^{*}}\leq s_{(l+1)K_{s}+K_{q}+1}\leq s_{N_{s}}\leq N, for all u=1,2,,lu=1,2,\ldots,l, as required. We now establish bounds on δsku,Gskua\delta_{s_{k_{u}^{*}}},G_{s_{k_{u}^{*}}}^{a}, for u=1,2,,lu=1,2,\ldots,l, showing that at each of these iterations, the stopping criterion is satisfied.

  • Since kuKqk_{u}^{*}\geq K_{q}, it follows that skusKqKqs_{k_{u}^{*}}\geq s_{K_{q}}\geq K_{q}, and so that δskuϵ/2\delta_{s_{k_{u}^{*}}}\leq\epsilon/2 (by the definition of KqK_{q}).

  • By the definition of kuk_{u}^{*}, from Lemma 10 (with l=uKs+Kql=uK_{s}+K_{q} and j=(u+1)Ks+Kqj=(u+1)K_{s}+K_{q}) we have that lj/2+1l\geq\lfloor j/2+1\rfloor and so

    Gskua24(θ+Rg)2(uKs+Kq+2)Ksϵ/2.G_{s_{k_{u}^{*}}}^{a}\leq\frac{24(\theta+R_{g})^{2}}{\sqrt{(uK_{s}+K_{q}+2)K_{s}}}\leq\epsilon/2.

It follows that Algorithm 3 terminates at iteration sklNs_{k_{l}^{*}}\leq N.

Case 2: Ns(l+1)Ks+KqN_{s}\leq(l+1)K_{s}+K_{q}. Since Ns(l+1)Ks+KqN_{s}\leq(l+1)K_{s}+K_{q}, we have that Nq(l+1)Ks+2Kq+M(l,p¯,p)NsKq+M(l,p¯,p)N_{q}\geq(l+1)K_{s}+2K_{q}+M(l,\overline{p},p)-N_{s}\geq K_{q}+M(l,\overline{p},p). We will show that among the iterates t{qKq,qKq+1,,qNq}t\in\{q_{K_{q}},q_{K_{q}+1},\ldots,q_{N_{q}}\}, the conditions δtϵ/2\delta_{t}\leq\epsilon/2 and Gtaϵ/2G_{t}^{a}\leq\epsilon/2 hold at least ll times with probability at least 1p¯1-\overline{p}. This ensures that Algorithm 3 stops with probability at least 1p¯1-\overline{p} by iterate qNqNq_{N_{q}}\leq N.

Since qKqKqq_{K_{q}}\geq K_{q}, it follows (from the definition of KqK_{q}) that δqkϵ/2\delta_{q_{k}}\leq\epsilon/2 for k=Kq,Kq+1,,Nqk=K_{q},K_{q}+1,\ldots,N_{q}. The subsequence qkq_{k} for k{Kq,Kq+1,,Nq}k\in\{K_{q},K_{q}+1,\ldots,N_{q}\}, has length at least M(l,p¯,p)M(l,\overline{p},p). From Lemma 10, we know that Gqkaδqkϵ/2G_{q_{k}}^{a}\leq\delta_{q_{k}}\leq\epsilon/2 holds with probability at least 1p1-p, independently for each k{Kq,Kq+1,,Nq}k\in\{K_{q},K_{q}+1,\ldots,N_{q}\}. From Lemma 11, we can conclude that, with probability at least 1p¯1-\overline{p}, we have that Gqkaϵ/2G_{q_{k}}^{a}\leq\epsilon/2 holds at least ll times over the iterates k{Kq,Kq+1,,Nq}k\in\{K_{q},K_{q}+1,\ldots,N_{q}\}.

We now consider the case when δt\delta_{t} is set via the adaptive strategy S.2. Recall that in the deterministic case, the main idea was that the approximate duality gap decreases geometrically along the qq-subsequence. In the random setting, this does not quite hold, because the bound Gqka12CδqkG_{q_{k}}^{a}\leq\frac{1}{2C}\delta_{q_{k}} fails to hold with probability at most pp. However, if the qq-subsequence is long enough, then this relationship does hold along a sufficiently long subsequence of the qq-subsequence, so that the stopping criterion is reached (with any desired probability).

Lemma 13.

Let 0<ϵθ+Rg0<\epsilon\leq\theta+R_{g}, and let δt\delta_{t} be a non-negative sequence such that δtϵ/2+minτ<tGτa\delta_{t}\leq\epsilon/2+\min_{\tau<t}G_{\tau}^{a} for all tt. Let 0<p¯<10<\overline{p}<1. Let

Kr=10.6Δ0andKs=72(θ+Rg)2ϵandKq=2+log2(θ+Rgϵ).K_{r}=\lfloor 10.6\Delta_{0}\rfloor\quad\textup{and}\quad K_{s}=\left\lceil\frac{72(\theta+R_{g})^{2}}{\epsilon}\right\rceil\quad\textup{and}\quad K_{q}=2+\left\lceil\log_{2}\left(\frac{\theta+R_{g}}{\epsilon}\right)\right\rceil.

Then, with probability at least 1p¯1-\overline{p}, Algorithm 3 (with (SC-OPT) as input) stops after at most Kr+(l+1)Ks+Kq+l1+log(1/p¯)1p+2log(1/p¯)K_{r}+(l+1)K_{s}+\frac{K_{q}+l-1+\log(1/\overline{p})}{1-p}+2\log(1/\overline{p}) iterations.

Proof.

Let N=Kr+(l+1)Ks+M(Kq,l,p¯,p)N=K_{r}+(l+1)K_{s}+M(K_{q},l,\overline{p},p) where M(Kq,l,p¯,p)=Kq+l1+log(1/p¯)1p+2log(1/p¯)M(K_{q},l,\overline{p},p)=\frac{K_{q}+l-1+\log(1/\overline{p})}{1-p}+2\log(1/\overline{p}). Let the first NN iterations of Algorithm 2 be partitioned into the rr-, ss-, and qq-subsequences according to Definition 4 with C=1C=1. Then N=Nr+Ns+NqN=N_{r}+N_{s}+N_{q}. We know that NrKrN_{r}\leq K_{r} (from 10). It follows that Ns+Nq(l+1)Ks+M(Kq,l,p¯,p)N_{s}+N_{q}\geq(l+1)K_{s}+M(K_{q},l,\overline{p},p). We now consider two cases: (i) NqM(Kq,l,p¯,p)1N_{q}\leq M(K_{q},l,\overline{p},p)-1, (ii) NqM(Kq,l,p¯,p)N_{q}\geq M(K_{q},l,\overline{p},p).

Case 1: NqM(Kq,l,p¯,p)1N_{q}\leq M(K_{q},l,\overline{p},p)-1. If NqM(Kq,l,p¯,p)1N_{q}\leq M(K_{q},l,\overline{p},p)-1 then Ns(l+1)Ks+1N_{s}\geq(l+1)K_{s}+1. For u=1,2,,lu=1,2,\ldots,l, let

ku=argmini{uKs+1,,(u+1)Ks+1}Gsia.k_{u}^{*}=\operatorname*{arg\,min}_{i\in\{uK_{s}+1,\ldots,(u+1)K_{s}+1\}}G_{s_{i}}^{a}.

Since Ns(l+1)Ks+1N_{s}\geq(l+1)K_{s}+1, it follows that skus(l+1)Ks+1sNsNs_{k_{u}^{*}}\leq s_{(l+1)K_{s}+1}\leq s_{N_{s}}\leq N, for all i=1,2,,li=1,2,\ldots,l, as required. We now establish bounds on δsku\delta_{s_{k_{u}^{*}}} and GskuaG^{a}_{s_{k_{u}^{*}}}, for u=1,2,,lu=1,2,\ldots,l, showing that at each of these ll iterations, the stopping criterion is satisfied.

  • Since kuuKs+1k^{*}_{u}\geq uK_{s}+1, from Lemma 10 (with l=0l=0 and j=uKs+1j=uK_{s}+1) we have that

    δskuϵ/2+minτ<skuGτaϵ/2+mink{1,2,,uKs}Gskaϵ/2+72(θ+Rg)2uKs3ϵ/2,\delta_{s_{k_{u}^{*}}}\leq\epsilon/2+\min_{\tau<s_{k_{u}^{*}}}G_{\tau}^{a}\leq\epsilon/2+\min_{k\in\{1,2,\ldots,uK_{s}\}}G_{s_{k}}^{a}\leq\epsilon/2+\frac{72(\theta+R_{g})^{2}}{uK_{s}}\leq 3\epsilon/2,

    for u=1,2,,lu=1,2,\ldots,l.

  • Applying Lemma 10 (with l=Ksl=K_{s} and j=2Ksj=2K_{s}), we have that l=j/2<j/2+1l=j/2<\lfloor j/2+1\rfloor, and so

    Gsk1a72(θ+Rg)22Ks+1ϵ.G_{s_{k_{1}^{*}}}^{a}\leq\frac{72(\theta+R_{g})^{2}}{2K_{s}+1}\leq\epsilon.

    For u=2,3,,lu=2,3,\ldots,l, we again apply Lemma 10 (with l=uKsl=uK_{s} and j=(u+1)Ksj=(u+1)K_{s}). In these cases, lj/2+1l\geq\lfloor j/2+1\rfloor and so

    Gskua36(θ+Rg)2(uKs+2)KsϵG_{s_{k_{u}^{*}}}^{a}\leq\frac{36(\theta+R_{g})^{2}}{\sqrt{(uK_{s}+2)K_{s}}}\leq\epsilon

    for u=2,3,,lu=2,3,\ldots,l.

Case 2: NqM(Kq,l,p¯,p)N_{q}\geq M(K_{q},l,\overline{p},p). It is convenient to define a further subsequence of the qq-subsequence. Let the vv-subsequence v1<v2<v_{1}<v_{2}<\cdots consist of those tt\in\mathbb{N} for which t=qkt=q_{k} for some kk and Gtaδt/2G_{t}^{a}\leq\delta_{t}/2. The vv-subsequence consists of those indices from the qq-subseqence at which the PLMO satisfies condition C.2 (δt\delta_{t}-suboptimality). From Lemma 10, we know that the vv-subsequence can be obtained from the qq-subsequence by (independently) keeping each element of the qq-subsequence with some probability that is at least 1p1-p. Let NvN_{v} denote the (random) number of elements of {1,2,,N}\{1,2,\ldots,N\} that are in the vv-subsequence. From Lemma 11, we know that if Nql+Kq1+log(1/p¯)1p+2log(1/p¯)N_{q}\geq\frac{l+K_{q}-1+\log(1/\overline{p})}{1-p}+2\log(1/\overline{p}) then Nvl+KqN_{v}\geq l+K_{q} with probability at least 1p¯1-\overline{p}.

By the same argument as in Case 2 of Lemma 6 (but replacing q()q_{(\cdot)} with v()v_{(\cdot)}) we have that

δvkϵ/2+Gvk1a\delta_{v_{k}}\leq\epsilon/2+G_{v_{k-1}}^{a}

for k2k\geq 2 and that

Gvkaϵ/2+2k+1(θ+Rg)G_{v_{k}}^{a}\leq\epsilon/2+2^{-k+1}(\theta+R_{g})

for k2k\geq 2. Then, if kKq+1k\geq K_{q}+1, we have that Gvka3ϵ/4G_{v_{k}}^{a}\leq 3\epsilon/4 and that δvk3ϵ/2\delta_{v_{k}}\leq 3\epsilon/2. In particular, the stopping criterion of Algorithm 3 holds at iterates vKq+1,,vKq+lv_{K_{q}+1},\ldots,v_{K_{q}+l}. As long as vKq+lNv_{K_{q}+l}\leq N, the stopping criterion of Algorithm 3 is achieved ll times. We have seen that, with probability at least 1p¯1-\overline{p}, we have that l+KqNvl+K_{q}\leq N_{v} and so vl+KqvNvNv_{l+K_{q}}\leq v_{N_{v}}\leq N, as required. ∎

6 Application of Algorithm 3 to (SC-GMean)

In this section, we consider the application of Algorithm 3 to generate a near-optimal solution to (SC-GMean). The aim of this section is to work through this example as it covers the study of motivating example introduced in Section 1.1. We show how to implement PLMO using Lanczos method in Section 6.1, followed by stating the computational complexity of Algorithm 3 in Section 6.2, where we explicitly define all the parameters of the problem. Finally, in Section 6.3, we show that Algorithm 3 can be made memory-efficient (requiring 𝒪(n+d)\mathcal{O}(n+d) memory) using memory-efficient techniques proposed by Shinde et al. [25]. We first restate this problem as the composite optimization problem

minX𝕊ni=1dlog(Ai,X)+g(X),\min_{X\in\mathbb{S}^{n}}-\sum_{i=1}^{d}\log(\langle A_{i},X\rangle)+g(X), (SC-GMean-Cmp)

where g(X)g(X) is the indicator function of 𝒮={X0:Tr(X)=1}\mathcal{S}=\{X\succeq 0:\textup{Tr}(X)=1\} and f((X))=i=1dlog(Ai,X)f(\mathcal{B}(X))=-\sum_{i=1}^{d}\log(\langle A_{i},X\rangle) is a dd-logarithmically-homogeneous, 2-self-concordant function defined on int(𝒮)\textup{int}(\mathcal{S}). In the case of (SC-GMean-Cmp), the linear subproblem (Lin-OPT) takes the form

minH𝒮(f((Xt))),HminH𝒮i=1dAiAi,Xt,H\min_{H\in\mathcal{S}}\langle\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t}))),H\rangle\equiv\min_{H\in\mathcal{S}}\left\langle-\sum_{i=1}^{d}\frac{A_{i}}{\langle A_{i},X^{t}\rangle},H\right\rangle (6.1)

which is equivalent to solving an eigenvalue problem. Indeed, if uu^{\star} is a unit norm eigenvector corresponding to the smallest eigenvalue of (f((X)))\mathcal{B}^{*}(\nabla f(\mathcal{B}(X))), then H=uuTH^{\star}=u^{\star}u^{\star T} is an optimal solution to the linear subproblem (6.1). Now, to generate the update direction HtH^{t} in Step 6 of Algorithm 3, we need to find an approximate solution to the linear subproblem or equivalently an approximate solution to the corresponding extreme eigenvalue problem. In this paper, we consider using the Lanczos method (with random initialization) to generate such an approximate solution. In the next subsection, we review an existing convergence result for the Lanczos method and show how it can ultimately be used to implement PLMO. Furthermore, in Lemma 15, we provide a bound on number of iterations of the Lanczos method needed to implement PLMO. Finally, in Subsection 6.2, we provide the ‘inner’ iteration complexity, i.e., the computational complexity of each iteration of Algorithm 3 applied to (SC-GMean-Cmp), as well as a bound on the total computational complexity of the algorithm.

6.1 Implementing PLMO in Algorithm 3 when applied to (SC-GMean-Cmp)

The Lanczos method (see, e.g., [14]) is an iterative method that can be used to determine the extreme eigenvalues and eigenvectors of a Hermitian matrix. A typical convergence result for the method, when initialized with a uniform random point on the unit sphere, is given in Lemma 14.

Lemma 14 (Convergence of Lanczos method [14]).

Let ρ(0,1]\rho\in(0,1] and p(0,1]p\in(0,1]. For J𝕊nJ\in\mathbb{S}^{n}, let J\|J\| denote the largest absolute eigenvalue of JJ. The Lanczos method, after at most N=12+1ρlog(4n/p2)N=\left\lceil\frac{1}{2}+\frac{1}{\sqrt{\rho}}\log(4n/p^{2})\right\rceil iterations, generates a unit vector unu\in\mathbb{R}^{n}, that satisfies

uTJuλmax(J)ρ8Ju^{T}Ju\geq\lambda_{\textup{max}}(J)-\frac{\rho}{8}\|J\| (6.2)

with probability at least 1p1-p. The computational complexity of the method is 𝒪(N(n+mvc(J)))\mathcal{O}(N(n+\textup{mvc}(J))), where mvc(J)\textup{mvc}(J) is the matrix-vector multiplication complexity for the matrix JJ.

Algorithm 4 shows how to use the Lanczos method as the basis for an implementation of PLMO, to generate the update direction in Algorithm 3. The following result establishes that Algorithm 4 is indeed a valid implementation of PLMO.

Algorithm 4 PLMO for (SC-GMean-Cmp)

Input: (f((Xt)))\mathcal{B}^{\star}(\nabla f(\mathcal{B}(X^{t}))), δt\delta_{t}, XtX^{t} Output: (HtH^{t}, GtaG_{t}^{a})

1:  Set J=(f((Xt)))0J=-\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t})))\succeq 0.
2:  Run Lanczos method for N=12+cθ8min{δt,(c2)θ}log(4n/p2)N=\frac{1}{2}+\sqrt{\frac{c\theta}{8\min\{\delta_{t},(c-2)\theta\}}}\log(4n/p^{2}) iterations to generate a unit vector uu.
3:  Set Ht=uuTH^{t}=uu^{T}.
4:  Compute Gta(Xt,Ht)G_{t}^{a}(X^{t},H^{t}) according to (2.10).
5:  if Gta<0G_{t}^{a}<0 then
6:     Set Ht=XtH^{t}=X^{t} and Gta=0G_{t}^{a}=0.
7:  end if
8:  return (HtH^{t}, GtaG_{t}^{a})
Lemma 15.

Algorithm 4 implements PLMO correctly for the problem (SC-GMean-Cmp).

Proof.

For J=(f((Xt)))0J=-\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t})))\succeq 0, we can rewrite (6.2) as uTJuλmax(J)(1ρ/8)=J(1ρ/8)u^{T}Ju\geq\lambda_{\textup{max}}(J)(1-\rho/8)=\|J\|(1-\rho/8). Let ρ=8τt=8min{δt,(c2)θ}cθ\rho=8\tau_{t}=8\frac{\textup{min}\{\delta_{t},(c-2)\theta\}}{c\theta} (set according to Lemma 9). From Lemma 14, we see that after N=12+cθ8min{δt,(c2)θ}log(4n/p2)N=\frac{1}{2}+\sqrt{\frac{c\theta}{8\min\{\delta_{t},(c-2)\theta\}}}\log(4n/p^{2}) iterations, the Lanczos method generates a unit vector uu that satisfies, with probability at least 1p1-p,

uTJuJ(1min{δt,(c2)θ}cθ)=J(1τt),u^{T}Ju\geq\|J\|\left(1-\frac{\min\{\delta_{t},(c-2)\theta\}}{c\theta}\right)=\|J\|(1-\tau_{t}), (6.3)

where the last equality follows from the definition of τt\tau_{t}.

Furthermore, since Ht=uutH^{t}=uu^{t}, and J=(f((Xt)))J=-\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t}))), we can rewrite (6.3) as

J,Ht=FXtlin(Ht)J(1τt)=(1τt)FXtlin(H),-\langle J,H^{t}\rangle=F^{\textup{lin}}_{X^{t}}(H^{t})\leq-\|J\|(1-\tau_{t})=(1-\tau_{t})F^{\textup{lin}}_{X^{t}}(H^{\star}), (6.4)

where the last equality follows since J=λmin((ft))=FXtlin(H)\|J\|=-\lambda_{\textup{min}}(\mathcal{B}^{*}(\nabla f_{t}))=-F^{\textup{lin}}_{X^{t}}(H^{\star}). So, from Lemma 9, we see that, if (6.4) holds, then at least one of the two conditions C.1 (large gap) and C.2 (δt\delta_{t}-suboptimality) is satisfied. And since (6.4) holds with probability at least 1p1-p, we have that after at most NN iterations of Lanczos method, the generated pair (Ht,Gta)(H^{t},G_{t}^{a}) satisfy at least one of the two conditions C.1 (large gap) and C.2 (δt\delta_{t}-suboptimality) with probability at least 1p1-p.

Finally, if condition C.2 (δt\delta_{t}-suboptimality) is satisfied and yet Gta<0G_{t}^{a}<0 then, from Lemma 7, we know that Ht=XtH^{t}=X^{t} also satisfies condition C.2 (δt\delta_{t}-suboptimality) with corresponding value of GtaG_{t}^{a} equal to 0. ∎

We are now ready to provide the convergence analysis and the total computational complexity of Algorithm 3 when applied to (SC-GMean-Cmp) in the next subsection.

6.2 Computational complexity of Algorithm 3 for (SC-GMean)

In this section, we provide the computational complexity of Algorithm 3 when applied to (SC-GMean). The total computational complexity of Algorithm 3 consists of the ‘inner’ iteration complexity, i.e., the computational complexity of each iteration of Algorithm 3 multiplied by the number of iterations performed by the algorithm. The two most computationally expensive steps in Algorithm 3 are (i) Step 10 (computing DtD_{t}), and (ii) Step 6, implementing PLMO, whose computational complexity can be determined using Lemmas 14 and 15. We elaborate on this complexity result in the proof of Lemma 16.

Lemmas 12 and 13 (from Section 5) provide a bound on the number of iterations performed by Algorithm 3 when the scheduled strategy S.1 and the adaptive strategy S.2 are used, respectively. Note that the bounds given in these lemmas depend on the parameters θ\theta, RgR_{g} and Δ0\Delta_{0}. We first compute a bound on these quantities and finally provide the total computational complexity of Algorithm 3 in Lemma 16.

Values of parameters θ\theta, RgR_{g} and Δ0\Delta_{0}.

Since f((X))f(\mathcal{B}(X)) in (SC-GMean-Cmp) is a dd-logarithmically-homogeneous function, we have θ=d\theta=d. Also, since g()g(\cdot) is an indicator function for the set 𝒮\mathcal{S} and Algorithm 3 generates feasible updates to the decision variable, we have Rg=0R_{g}=0.

Finally, we determine an upper bound on Δ0\Delta_{0}. The initial optimality gap, Δ0\Delta_{0} is defined as

Δ0=f((X0))f((X))+g(X0)g(X)=f((X0))f((X)),\Delta_{0}=f(\mathcal{B}(X^{0}))-f(\mathcal{B}(X^{\star}))+g(X^{0})-g(X^{\star})=f(\mathcal{B}(X^{0}))-f(\mathcal{B}(X^{\star})), (6.5)

where the last equality holds since X𝒮X^{\star}\in\mathcal{S} and X0𝒮X^{0}\in\mathcal{S}. Now let X0=I/nX^{0}=I/n, so that f((I/n))=i=1dlog(Tr(Ai))+dlog(n)f(\mathcal{B}(I/n))=-\sum_{i=1}^{d}\log(\textup{Tr}(A_{i}))+d\log(n). To generate a lower bound on f((X))f(\mathcal{B}(X^{\star})), we use the knowledge of the dual function. The dual of (SC-GMean) is

maxy,λλ+i=1d(log(yi)+1)s.t.{i=1dAiyiλIyi0,i=1,,d.\max_{y,\lambda}\ -\lambda+\sum_{i=1}^{d}(\log(y_{i})+1)\ \ \textup{s.t.}\ \begin{cases}&\sum_{i=1}^{d}A_{i}y_{i}\preceq\lambda I\\ &y_{i}\geq 0,\quad i=1,\dotsc,d.\end{cases} (6.6)

Let λ¯=d\overline{\lambda}=d and y¯=1/Tr(Ai)\overline{y}=1/\textup{Tr}(A_{i}), for i=1,,di=1,\dotsc,d. Note that, (λ¯(\overline{\lambda}, y¯)\overline{y}) is a feasible solution to the dual problem (6.6). Therefore,

f((X))λ¯+i=1d(log(y¯i)+1)=i=1dlog(Tr(Ai)).f(\mathcal{B}(X^{\star}))\geq-\overline{\lambda}+\sum_{i=1}^{d}(\log(\overline{y}_{i})+1)=-\sum_{i=1}^{d}\log(\textup{Tr}(A_{i})). (6.7)

The optimality gap Δ0\Delta_{0} can now be bounded by

Δ0=f((X0))f((X))dlog(n).\Delta_{0}=f(\mathcal{B}(X^{0}))-f(\mathcal{B}(X^{\star}))\leq d\log(n). (6.8)

We now provide the convergence analysis of Algorithm 3 in Lemma 16.

Lemma 16.

Let (SC-GMean-Cmp) be solved using Algorithm 3 with parameters ll, ϵ\epsilon and pp, such that PLMO fails with probability at most pp.

  1. 1.

    Outer iteration complexity when scheduled strategy S.1 with Kq=1K_{q}=1 is used. In this case, with probability at least 1p¯1-\overline{p}, Algorithm 3 stops after at most

    K¯=10.6dlogn+(l+1)48d2ϵ+2+l1+log(1/p¯)1p+2log(1/p¯)\overline{K}=\lceil 10.6d\log n\rceil+(l+1)\left\lceil\frac{48d^{2}}{\epsilon}\right\rceil+2+\frac{l-1+\log(1/\overline{p})}{1-p}+2\log(1/\overline{p})

    iterations.

  2. 2.

    Inner iteration complexity when scheduled strategy S.1 is used. When scheduled strategy S.1 with Kq=1K_{q}=1 and δt=ϵ/2\delta_{t}=\epsilon/2 for all t1t\geq 1 is used in Algorithm 3, and Lanczos method is used to implement PLMO in each iteration of the algorithm, then the total computational complexity of each iteration of Algorithm 3 is bounded by 𝒪(dϵlog(4n/p2)(n+i=1dmvc(Ai)))\mathcal{O}\left(\sqrt{\frac{d}{\epsilon}}\log(4n/p^{2})(n+\sum_{i=1}^{d}\textup{mvc}(A_{i}))\right), where mvc(Ai)\textup{mvc}(A_{i}) denotes the complexity of matrix-vector multiplication with AiA_{i}.

  3. 3.

    Total computational complexity when scheduled strategy S.1 is used. When scheduled strategy S.1 with Kq=1K_{q}=1 and δt=ϵ/2\delta_{t}=\epsilon/2 for all t1t\geq 1 is used in Algorithm 3, the total computational complexity of the algorithm is bounded by 𝒪(dϵlogn(dlogn+ld2ϵ)(n+i=1dmvc(Ai)))\mathcal{O}\left(\sqrt{\frac{d}{\epsilon}}\log n\left(d\log n+\frac{ld^{2}}{\epsilon}\right)\left(n+\sum_{i=1}^{d}\textup{mvc}(A_{i})\right)\right).

Proof.

Outer iteration complexity. When scheduled strategy S.1 with Kq=1K_{q}=1 is used, we have that δtϵ/2\delta_{t}\leq\epsilon/2 for all t1t\geq 1. Substituting Kq=1K_{q}=1, and values of parameters θ\theta, RgR_{g} and Δ0\Delta_{0} in Lemma 12, we see that, with probability at most 1p¯1-\overline{p}, Algorithm 3 stops after at most

K¯=10.6dlogn+(l+1)48d2ϵ+2+l1+log(1/p¯)1p+2log(1/p¯)\overline{K}=\lceil 10.6d\log n\rceil+(l+1)\left\lceil\frac{48d^{2}}{\epsilon}\right\rceil+2+\frac{l-1+\log(1/\overline{p})}{1-p}+2\log(1/\overline{p})

iterations. Thus, we have that K¯\overline{K} is bounded by 𝒪(dlogn+ld2ϵ)\mathcal{O}\left(d\log n+\frac{ld^{2}}{\epsilon}\right).

Inner iteration complexity when scheduled strategy S.1 is used. The two most expensive steps in Algorithm 3 are computing DtD_{t}, and implementing PLMO. PLMO is implemented using Lanczos method whose computational complexity is given in Lemma 14. Assume that scheduled strategy S.1 with Kq=1K_{q}=1, and δt=ϵ/2\delta_{t}=\epsilon/2 is used for all t1t\geq 1. From Lemma 14, we see that the computational complexity of the Lanczos method depends on mvc((f((Xt)))\textup{mvc}(\mathcal{B}^{\star}(\nabla f(\mathcal{B}(X^{t}))). Note that, (f((Xt))=i=1dAiAi,Xt\mathcal{B}^{\star}(\nabla f(\mathcal{B}(X^{t}))=-\sum_{i=1}^{d}\frac{A_{i}}{\langle A_{i},X^{t}\rangle}. If X0=1nIX^{0}=\frac{1}{n}I, then Ai,X0=Tr(Ai)\langle A_{i},X^{0}\rangle=\textup{Tr}(A_{i}) for all i=1,,di=1,\dotsc,d. Moreover, we have that Xt=(1γt1)X(t1)+γt1H(t1)X^{t}=(1-\gamma_{t-1})X^{(t-1)}+\gamma_{t-1}H^{(t-1)} and rank(H(t1))1\textup{rank}(H^{(t-1)})\leq 1 for all t2t\geq 2. Thus, Ai,Xt=Ai,X(t1)+uTAiu\langle A_{i},X^{t}\rangle=\langle A_{i},X^{(t-1)}\rangle+u^{T}A_{i}u, where Ht=uuTH^{t}=uu^{T}. And so, if we store Ai,X(t1)\langle A_{i},X^{(t-1)}\rangle for all ii, then we see that the complexity of computing Ai,Xt\langle A_{i},X^{t}\rangle for all i1,,di\in 1,\dotsc,d is bounded by i=1dmvc(Ai)\sum_{i=1}^{d}\textup{mvc}(A_{i}), i.e., the complexity is bounded by the sum of complexities of matrix-vector multiplication with each AiA_{i}. Finally, from Lemmas 14 and 15 (with c=4c=4), we have that computational complexity of PLMO at iteration tt of the Algorithm 3 is bounded by 𝒪(dϵlog(4n/p2)(n+i=1dmvc(Ai)))\mathcal{O}\left(\sqrt{\frac{d}{\epsilon}}\log(4n/p^{2})(n+\sum_{i=1}^{d}\textup{mvc}(A_{i}))\right).

Furthermore, recall that Dt=HtXt(Xt)=i=1dAi,HtXt2Ai,Xt2D_{t}=\|H^{t}-X^{t}\|_{\mathcal{B}(X_{t})}=\sqrt{\sum_{i=1}^{d}\frac{\langle A_{i},H^{t}-X^{t}\rangle^{2}}{\langle A_{i},X^{t}\rangle^{2}}}, and so, the complexity of computing DtD_{t} is also bounded by 𝒪(i=1dmvc(Ai))\mathcal{O}(\sum_{i=1}^{d}\textup{mvc}(A_{i})). Thus, the total computational complexity of each iteration of Algorithm 3 is bounded by 𝒪(dϵlog(4n/p2)(n+i=1dmvc(Ai)))\mathcal{O}\left(\sqrt{\frac{d}{\epsilon}}\log(4n/p^{2})(n+\sum_{i=1}^{d}\textup{mvc}(A_{i}))\right).

Total computational complexity when scheduled strategy S.1 is used. Note that, Algorithm 3 stops after at most K¯\overline{K} iterations with probability at least 1p¯1-\overline{p}. Thus, the total computational complexity of Algorithm 3 is 𝒪((dϵlog(4n/p2)(n+i=1dmvc(Ai)))K¯)\mathcal{O}\left(\left(\sqrt{\frac{d}{\epsilon}}\log(4n/p^{2})(n+\sum_{i=1}^{d}\textup{mvc}(A_{i}))\right)\overline{K}\right) or equivalently,

𝒪(dϵlogn(dlogn+ld2ϵ)(n+i=1dmvc(Ai))).\mathcal{O}\left(\sqrt{\frac{d}{\epsilon}}\log n\left(d\log n+\frac{ld^{2}}{\epsilon}\right)\left(n+\sum_{i=1}^{d}\textup{mvc}(A_{i})\right)\right).

6.3 Making Algorithm 3 memory-efficient

One reason to use methods based on the Frank-Wolfe algorithm for semidefinite optimization problems with constraint 𝒮\mathcal{S}, is that they typically can be modified to lead to approaches that do not require explicit storage of a PSD decision variable, and hence can be made memory-efficient. One approach to this is to represent the PSD decision variable XX by a Gaussian random vector z𝒩(0,X)z\sim\mathcal{N}(0,X), as suggested by Shinde et al. [25]. This is particularly useful in situations where the semidefinite program is used as the input to a rounding scheme that only requires such Gaussian samples. An example of this situation is discussed in Section 1.2.1, where an approximation algorithm for maximizing the product of quadratic forms on the unit sphere is obtained by solving (SC-GMean-Cmp) to get a matrix X𝒮X\in\mathcal{S}, and then sampling a Gaussian vector zz with covariance equal to XX and normalizing it, to get a feasible point for the original optimization problem on the sphere.

The Gaussian sampling-based approach takes advantage of two properties of a problem with the form

minX𝒮f((X)).\min_{X\in\mathcal{S}}f(\mathcal{B}(X)).

The first is the fact that the objective function only depends on XtX^{t} via the lower dimensional vector vt:=(Xt)v^{t}:=\mathcal{B}(X^{t}) with coordinates vit=Ai,Xtv_{i}^{t}=\langle A_{i},X^{t}\rangle for i[d]i\in[d]. The second key property is the fact that the linearized subproblem is an extreme eigenvalue problem. Therefore it always has a rank one optimal point, and we can implement PLMO so that the output HtH^{t} is always rank one. The Gaussian-sampling based approach then involves making the following modifications to the classical Frank-Wolfe algorithm [25, Algorithm 2]:

  • Initializing the samples with z0𝒩(0,X0)z^{0}\sim\mathcal{N}(0,X^{0}), where X0X^{0} is a covariance matrix for which the samples z0z^{0} can be generated without explicitly storing X0X_{0}, such as X0=I/nX^{0}=I/n.

  • Modifying PLMO so that it returns the vector utu^{t} rather than the rank one matrix Ht=(ut)(ut)TH^{t}=(u^{t})(u^{t})^{T}.

  • Updating the samples via zt+1=1γtzt+γtζtutz^{t+1}=\sqrt{1-\gamma_{t}}z^{t}+\sqrt{\gamma_{t}}\zeta_{t}u^{t} where the ζt\zeta_{t} are a sequence of i.i.d. 𝒩(0,1)\mathcal{N}(0,1) random variables.

  • Updating vt:=(Xt)v^{t}:=\mathcal{B}(X^{t}) via vit+1=(1γt)vit+γt(ut)TAiutv^{t+1}_{i}=(1-\gamma_{t})v^{t}_{i}+\gamma_{t}(u^{t})^{T}A_{i}u^{t} for i[d]i\in[d]

  • Expressing the linearized objective function in terms of vtv^{t} via (f((Xt)))=i=1dAiif(vt)\mathcal{B}^{\star}(\nabla f(\mathcal{B}(X^{t})))=\sum_{i=1}^{d}A_{i}\partial_{i}f(v^{t}),where if\partial_{i}f is the partial derivative of ff with respect to the iith coordinate.

The updates for the modified algorithm only depend on the parameters AiA_{i} of the linear map \mathcal{B} via matrix-vector products with the AiA_{i}. To run the algorithm, only zz (a vector of length nn) and vv (a vector of length dd) need to be stored. If the PLMO is implemented via an algorithm that only requires matrix-vector multiplications with (f((Xt))\mathcal{B}^{\star}(\nabla f(\mathcal{B}(X^{t})) (such as the power method), then this only requires vtv^{t} and access to the AiA_{i} via matrix-vector products, and only requires 𝒪(n)\mathcal{O}(n) storage (in addition to the storage associated with the AiA_{i}).

When extending this approach to the setting of generalized Frank-Wolfe, the main additional issue is computing Dt=HtXt(Xt)D_{t}=\|H^{t}-X^{t}\|_{\mathcal{B}(X^{t})} and GtaG_{t}^{a} (and consequently γt\gamma_{t}). These computations can be done in a memory-efficient way because

Dt=HtXt(Xt)=(wtvt)T2f(vt)(wtvt)andGta=vtwt,f(vt),D_{t}=\|H^{t}-X^{t}\|_{\mathcal{B}(X^{t})}=\sqrt{(w^{t}-v^{t})^{T}\nabla^{2}f(v^{t})(w^{t}-v^{t})}\quad\textup{and}\quad G_{t}^{a}=\langle v^{t}-w^{t},\nabla f(v^{t})\rangle,

where wt=(Ht)w^{t}=\mathcal{B}(H^{t}) is the vector with coordinates wit=(ut)TAi(ut)w^{t}_{i}=(u^{t})^{T}A_{i}(u^{t}) for i[d]i\in[d]. Again, computing GtaG_{t}^{a} and DtD_{t} only requires the vector vtv^{t} as well as access to the AiA^{i} via matrix-vector multiplications to compute wtw^{t}.

In the specific case of (SC-GMean-Cmp), we have that ff is separable and so

(f(vt))=i=1dAi/vitandDt=i=1d(wit/vit1)2andGta=i=1d(wit/vit1).\mathcal{B}^{\star}(\nabla f(v^{t}))=-\sum_{i=1}^{d}A_{i}/v^{t}_{i}\quad\textup{and}\quad D_{t}=\sqrt{\sum_{i=1}^{d}(w_{i}^{t}/v_{i}^{t}-1)^{2}}\quad\text{and}\quad G_{t}^{a}=\sum_{i=1}^{d}(w^{t}_{i}/v^{t}_{i}-1).
Proposition 7.

Suppose that the power method [14] is used to implement PLMO in Algorithm 3. Then, Algorithm 3, combined with the Gaussian sampling-based representation of the decision variable, uses 𝒪(n+d)\mathcal{O}(n+d) memory (in addition to the memory required to store the input parameters) to generate a zero-mean Gaussian random vector z^ϵ𝒩(0,X^ϵ)\widehat{z}_{\epsilon}\sim\mathcal{N}(0,\widehat{X}_{\epsilon}) such that X^ϵ\widehat{X}_{\epsilon} is an ϵ\epsilon-optimal solution to the input problem (SC-GMean-Cmp).

7 Numerical Experiments

For our experiments, we implemented Algorithm 3 with two different strategies S.1 (scheduled) and S.2 (adaptive) for δt\delta_{t}, as well as [36, Algorithm 1]. We generated two types of random instances of (SC-GMean), and reported the output of the three algorithms described below. The aim of our experiments was to check the performance of Algorithm 3, and investigate how conservative the theoretical results are against the practical observations.

Generating random instances of (SC-GMean-Cmp).

We generated two types of problem instances of (SC-GMean-Cmp), namely Diag, Rnd.

  • Diag problem instances. In Diag problem instances, Ai𝕊nA_{i}\in\mathbb{S}^{n} were generated as follows: [Ai]ii=i[A_{i}]_{ii}=i for i=1,,di=1,\dotsc,d, with other entries in the matrix set to 0. For Diag instances, the optimal solution XX^{\star} has a d×dd\times d scaled identity matrix I/dI/d on its principal diagonal and 0 everywhere else. Accordingly, we computed the optimality gap Δ0=F(X0)F(X)\Delta_{0}=F(X^{0})-F(X^{\star}).

  • Rnd problem instances. Rnd problem instances were generated by generating parameter matrices AiA_{i}’s for i=1,,di=1,\dotsc,d such that Ai=j=1nujujTA_{i}=\sum_{j=1}^{n}u_{j}u_{j}^{T} where uj𝒩(0,I)u_{j}\sim\mathcal{N}(0,I) for each j[n]j\in[n]. For these instances, we use dlognd\log n as an upper bound on Δ0\Delta_{0} as derived in Section 6.2.

Experimental Setup.

For our experiments, we implemented the following three algorithms: (a) Algorithm 3 with Kq=1K_{q}=1, i.e., δt=ϵ/2\delta_{t}=\epsilon/2 for all t1t\geq 1 (GFW-ApproxI), (b) Algorithm 2 with δt=ϵ/2+mink<tGka\delta_{t}=\epsilon/2+\min_{k<t}G_{k}^{a} (GFW-ApproxII), and (c) [36, Algorithm 1] (GFW-Exact). Note that, for (SC-GMean-Cmp), θ=d\theta=d, and Rg=0R_{g}=0 since g()g(\cdot) is an indicator function for 𝒮\mathcal{S} and all three algorithms generate iterates feasible for the set 𝒮\mathcal{S}. The values of other parameters were set as: ϵ=0.05\epsilon=0.05, l=1l=1. Note that, Algorithm 3, and thus, GFW-ApproxI and GFW-ApproxII, stop after the bounds, GtaϵG_{t}^{a}\leq\epsilon and δt3ϵ/2\delta_{t}\leq 3\epsilon/2, are satisfied at least ll times, whereas GFW-Exact stops if GtaϵG_{t}^{a}\leq\epsilon [36]. Let KuK^{u} be defined as

Ku={10.6Δ0+248d2ϵ+2+log(1/p¯)1p+2log(1/p¯)for GFW-ApproxI10.6Δ0+272d2ϵ+2(2+log2(d/ϵ))1p+2log(1/p¯)for GFW-ApproxII5.3(Δ0+d)log(10.6Δ0)+24d2ϵfor GFW-Exact,K^{u}=\begin{cases}\lceil 10.6\Delta_{0}\rceil+2\left\lceil\frac{48d^{2}}{\epsilon}\right\rceil+2+\frac{\log(1/\overline{p})}{1-p}+2\log(1/\overline{p})\quad\textup{for {GFW-ApproxI}}\\ \lceil 10.6\Delta_{0}\rceil+2\left\lceil\frac{72d^{2}}{\epsilon}\right\rceil+\frac{2(2+\lceil\log_{2}(d/\epsilon)\rceil)}{1-p}+2\log(1/\overline{p})\quad\textup{for {GFW-ApproxII}}\\ \left\lceil 5.3(\Delta_{0}+d)\log(10.6\Delta_{0})\right\rceil+\left\lceil\frac{24d^{2}}{\epsilon}\right\rceil\quad\textup{for {GFW-Exact}},\end{cases} (7.1)

which follows from Lemma 13, Lemma 16 and [36, Theorem 2.1] respectively. Note that, in (7.1), pp is an upper bound on the probability of failure of PLMO, and determines its computational complexity (see Lemma 16), while p¯\overline{p} denotes an upper bound on the probability that the algorithm stops after at most KuK^{u} iterations. Thus, if KuK^{u} is as defined in (7.1), we have that with probability at least 1p¯1-\overline{p}, GFW-ApproxI and GFW-ApproxII stops after at most KuK^{u} iterations, and with probability 1, GFW-Exact stops after at most KuK^{u} iterations. For our experiments, we set p=p¯=0.1p=\overline{p}=0.1.

The computations were performed using MATLAB R2023a on a machine with 8GB RAM. For GFW-Exact, we used eig in MATLAB to implement LMO. Whereas to implement PLMO in Algorithm 3, we used eigs in MATLAB instead of Lanczos method as stated in Section 6.

Using eigs in MATLAB.

For (SC-GMean-Cmp), the update direction HtH^{t} generated at each iteration of Algorithm 3 is a rank-1 matrix Ht=hhTH^{t}=hh^{T} such that hh is an approximate minimum eigenvector of (f((Xt)))\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t}))). The command eigs in MATLAB generates an approximate minimum eigenvalue-eigenvector pair (λ,h)(\lambda,h) for the matrix C=(f((Xt)))C=\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t}))) such that hh is a unit vector and Chλhtol×C\|Ch-\lambda h\|\leq\textup{tol}\times\|C\|, where tol is a user-defined value. In case of GFW-ApproxI, we set tol=2.22×1016\textup{tol}=2.22\times 10^{-16}, which is the default value, at each iteration of the algorithm. While at each iteration tt of GFW-ApproxII, we set tol=δtC\textup{tol}=\frac{\delta_{t}}{\|C\|}. Note that, for (SC-GMean-Cmp), the optimal objective function value of the subproblem (Lin-OPT) at iteration tt is (f((Xt)))\|\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t})))\|. We verify that by setting the value of tol as given above for both algorithms, eigs returns a unit vector hh such that hT(f((Xt)))h=(f((Xt)))+ηt(f((Xt)))+δth^{T}\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t})))h=\|\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t})))\|+\eta_{t}\leq\|\mathcal{B}^{*}(\nabla f(\mathcal{B}(X^{t})))\|+\delta_{t} at each iteration tt. Thus, eigs returns an update direction Ht=hhTH^{t}=hh^{T} such that ηtδt\eta_{t}\leq\delta_{t} for all t0t\geq 0.

Results for Diag problem instances.

The random instances generated were of two sizes (n=500n=500, d=50,100d=50,100). For Diag problem instances, we randomly initialized all three algorithms to X0=i=1nujujTX^{0}=\sum_{i=1}^{n}u_{j}u_{j}^{T} such that uj𝒩(0,I)u_{j}\sim\mathcal{N}(0,I) for each j[n]j\in[n]. For each problem instance, we performed 10 runs of each algorithm with different random initialization. Table 1 shows the computational results for GFW-ApproxI, GFW-ApproxII and GFW-Exact. The quantity KK denotes the number of iterations until the algorithms stop and KuK^{u}, defined in (7.1), denotes an upper bound on KK. Note that the average is taken over 10 independent runs with different random initializations for each problem instance.

Table 1: Computational results of using GFW-ApproxI, GFW-ApproxII and GFW-Exact to generate an ϵ\epsilon-optimal solution to Diag instances of (SC-GMean). Here, average is taken over 10 independent runs with different random initialization.
nn dd Algorithm Avg time (secs) Avg KuK^{u} (×106\times 10^{6}) Avg KK (×104\times 10^{4}) σ[K]\sigma[K]
500 50 GFW-ApproxI 652.33 4.8 2.948 356.79
500 50 GFW-ApproxII 682.33 7.2 3.03 337.07
500 50 GFW-Exact 1038.1 1.21 5.054 0.95
500 100 GFW-ApproxI 2374.9 19.2 5.717 398.36
500 100 GFW-ApproxII 2548 28.8 5.965 748.17
500 100 GFW-Exact 8407.4 4.81 20.12 2.02

From average KuK^{u} and average KK in Table 1, we see that all three algorithms perform approximately two orders of magnitude better than the theoretical convergence bounds. Also, the standard deviation of KK shows small variation over random initialization of the algorithms. We also see that for both random instances of varying size, GFW-ApproxI and GFW-ApproxII perform lesser number of iterations than GFW-Exact before stopping. Figure 1 shows the change in the optimality gap with number of iterations for the three algorithms when they were used to generate an ϵ\epsilon-optimal solution to a Diag instance of (SC-GMean-Cmp) with n=500n=500, d=50d=50. From the figure, we see that all three algorithms show similar trend in the change of optimality gap at each iteration, and after a few initial iterations, the algorithms converge linearly wrt tt to an ϵ\epsilon-optimal solution. We also observe that GFW-ApproxI and GFW-ApproxII require slightly fewer iterations than GFW-Exact to converge to an ϵ\epsilon-optimal solution.

Refer to caption
Figure 1: Optimality gap vs number of iterations for Diag instance (n=500n=500, d=50d=50) of (SC-GMean-Cmp).
Results for Rnd problem instances.

The three algorithms were randomly initialized to X0X^{0} such that X0=j=1nujujTX^{0}=\sum_{j=1}^{n}u_{j}u_{j}^{T} with uj𝒩(0,I)u_{j}\sim\mathcal{N}(0,I) for j[n]j\in[n]. For each problem instance, we performed 10 runs of each algorithm with different random initialization. The results for Rnd problem instances are given in Table 2. In the table, the quantity KK denotes the number of iterations until the algorithms stop and KuK^{u}, defined in (7.1), denotes an upper bound on KK. The average values in the table arise from the different random initial points during each run of the algorithm.

Table 2: Computational results of using GFW-ApproxI, GFW-ApproxII and GFW-Exact to generate an ϵ\epsilon-optimal solution to Rnd instances of (SC-GMean).
nn dd Algorithm Avg time (secs) KuK^{u} (×108\times 10^{8}) Avg KK σ[K]\sigma[K]
200 250 GFW-ApproxI 1.84 1.20 88.7 1.34
200 250 GFW-ApproxII 2.46 1.80 89.7 1.34
200 250 GFW-Exact 1.94 0.30 88.7 1.34
300 350 GFW-ApproxI 6.93 2.35 111.7 1.83
300 350 GFW-ApproxII 6.89 3.52 112.7 1.83
300 350 GFW-Exact 6.23 0.58 111.7 1.83
400 450 GFW-ApproxI 14.54 3.88 131.1 1.37
400 450 GFW-ApproxII 16.96 5.82 132.1 1.37
400 450 GFW-Exact 16.38 0.97 131.1 1.37
400 600 GFW-ApproxI 21.66 6.91 149.7 1.49
400 600 GFW-ApproxII 24.92 10.36 150.7 1.49
400 600 GFW-Exact 23.79 1.73 149.7 1.49
500 750 GFW-ApproxI 63.7 10.8 172.4 1.5
500 750 GFW-ApproxII 52.72 16.2 173.4 1.62
500 750 GFW-Exact 71.71 2.70 172.4 1.5
600 900 GFW-ApproxI 152 15.55 193.6 1.17
600 900 GFW-ApproxII 142.74 23.32 194.7 1.25
600 900 GFW-Exact 122.43 3.89 193.6 1.17
700 750 GFW-ApproxI 170.08 10.8 181.8 1.32
700 750 GFW-ApproxII 167.85 16.2 182.8 1.32
700 750 GFW-Exact 176.51 2.70 181.8 1.32

From the values of average KuK^{u} and average KK in Table 2, we see that for each problem instance, all three algorithms satisfiy the stopping criterion after <500<500 iterations. This value is several orders of magnitude smaller than the theoretical upper bounds given in (7.1). Furthermore, the standard deviation of KK shows a very small variation over the random initialization of the algorithms. Finally, we also observed that for varying values of nn and dd, all algorithms take the same number of iterations to converge to an ϵ\epsilon-optimal solution. Thus, for the specific generated random instances, the approximation error in solving the linear subproblem does not affect the convergence of the algorithm. Furthermore, Figure 2 shows the change Gta+δtG_{t}^{a}+\delta_{t} with the number of iterations for the three algorithms (with Rnd instance of size n=700n=700, d=750d=750 as input) until the stopping criterion is satisfied. The y-axis shows the average of Gta+δtG_{t}^{a}+\delta_{t} taken over 10 runs of the algorithms each with a different random initialization. Since ΔtGta+δt\Delta_{t}\leq G_{t}^{a}+\delta_{t}, the y-axis denotes the upper bound on the optimality gap. Note that, for GFW-ApproxII, we have δt=ϵ/2+minτ<tGτa\delta_{t}=\epsilon/2+\min_{\tau<t}G_{\tau}^{a}, and so, for GFW-ApproxII, Gta+δtG_{t}^{a}+\delta_{t} has a value larger than that for GFW-ApproxI and GFW-Exact. Furthermore, all three algorithms show similar trend in the decrease in Gta+δtG_{t}^{a}+\delta_{t}, and consequently, GtaG_{t}^{a}. We also observe that the quantity Gta+δtG_{t}^{a}+\delta_{t} decreases exponentially wrt tt. Note that, this is a much faster convergence as compared to Diag instances (see Figure 1).

From these experiments (for Diag as well as Rnd instances), we see that our proposed method (Algorithm 3) has similar performance as Algorithm 1, i.e., the algorithm with exact linear minimization oracle.

Refer to caption
Figure 2: Average Gta+δtG_{t}^{a}+\delta_{t} vs number of iterations (tt) for Rnd instance (n=700n=700, d=750d=750) of (SC-GMean-Cmp). Note, since ΔtGta+δt\Delta_{t}\leq G_{t}^{a}+\delta_{t} with probability at least 1pl1-p^{l}, y-axis denotes the upper bound on optimality gap Δt\Delta_{t}.

8 Discussion

In this paper, we proposed a first-order algorithm, based on the Frank-Wolfe method, with provable approximation guarantees for composite optimization problems with θ\theta-logarithmically-homogeneous, MM-self-concordant barrier function in the objective and a convex, compact feasible set. We showed that it is enough to solve the linear subproblem at each iteration approximately making Algorithm 2 more practical while also reducing the complexity of each iteration of the algorithm. We also showed that Algorithm 2 can generate an ϵ\epsilon-optimal solution to (SC-OPT) when the oracle generates an approximate solution with either additive or relative approximation error. We analyzed Algorithm 2 for two different settings of the value of δt\delta_{t}. However, it is possible that alternate techniques to set the value of δt\delta_{t} could result in potentially faster convergence. Since the convergence of the algorithm depends on the initial optimality gap, we can potentially use warm starting techniques to speed up the method. We further proposed Algorithm 3 which generates an ϵ\epsilon-optimal solution to (SC-OPT) with high probability when the linear minimization oracle has a nonzero probability of failure. Finally, we performed numerical experiments to show that the convergence analysis of our method holds true. The results of numerical experiments confirmed that the convergence bounds are satisfied for random instances of (SC-GMean).

Acknowledgements.

This material is based upon work supported by the National Science Foundation under Grant No. DMS-1929284 while the author, Nimita Shinde, was in residence at the Institute for Computational and Experimental Research in Mathematics in Providence, RI, during the Discrete Optimization: Mathematics, Algorithms, and Computation semester program.

Appendix A Preliminary results

Lemma 17.

Let the quantities RgR_{g}, GtaG_{t}^{a} and DtD_{t} be defined in (2.12), (2.10) and (2.13) respectively. For any t0t\geq 0, if Gta0G_{t}^{a}\geq 0, it holds that

DtGta+θ+Rg.D_{t}\leq G_{t}^{a}+\theta+R_{g}. (A.1)
Proof.

Zhao and Freund [36] state the relationship between DtD_{t} and the duality gap GtG_{t}, which is given as DtGt+θ+RgD_{t}\leq G_{t}+\theta+R_{g} [36, Proposition 2.3]. The proof of (A.1) is adapted from the proof of [36, Proposition 2.3].

From the definition of DtD_{t}, we have

Dt2=(htxt)(xt)2=((xt))(ht),(ht)2((xt))(xt),(ht)+((xt))(xt),(xt).\begin{split}D_{t}^{2}&=\|\mathcal{B}(h^{t}-x^{t})\|_{\mathcal{B}(x^{t})}^{2}\\ &=\langle\mathcal{H}(\mathcal{B}(x^{t}))\mathcal{B}(h^{t}),\mathcal{B}(h^{t})\rangle-2\langle\mathcal{H}(\mathcal{B}(x^{t}))\mathcal{B}(x^{t}),\mathcal{B}(h^{t})\rangle+\langle\mathcal{H}(\mathcal{B}(x^{t}))\mathcal{B}(x^{t}),\mathcal{B}(x^{t})\rangle.\end{split} (A.2)

We now bound Ut=((xt))(ht),(ht)U^{t}=\langle\mathcal{H}(\mathcal{B}(x^{t}))\mathcal{B}(h^{t}),\mathcal{B}(h^{t})\rangle as

Ut(i)f((xt)),(ht)2=[f((xt)),(htxt)+f((xt)),(xt)]2=(ii)[Gtag(ht)+g(Xt)+f((xt)),(xt)]2=(iii)[Gtag(ht)+g(xt)θ]2,\begin{split}U^{t}&\underset{(i)}{\leq}\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(h^{t})\rangle^{2}\\ &=[\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(h^{t}-x^{t})\rangle+\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t})\rangle]^{2}\\ &\underset{(ii)}{=}[-G_{t}^{a}-g(h^{t})+g(X^{t})+\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t})\rangle]^{2}\\ &\underset{(iii)}{=}[-G_{t}^{a}-g(h^{t})+g(x^{t})-\theta]^{2},\end{split} (A.3)

where (i) follows from (2.5), (ii) follows from the definition of GtaG_{t}^{a} (given in (2.10)), and (iii) follows from (2.7). For simplicity, define Vt=2((xt))(xt),(ht)+((xt))(xt),(xt)V^{t}=-2\langle\mathcal{H}(\mathcal{B}(x^{t}))\mathcal{B}(x^{t}),\mathcal{B}(h^{t})\rangle+\langle\mathcal{H}(\mathcal{B}(x^{t}))\mathcal{B}(x^{t}),\mathcal{B}(x^{t})\rangle. Then, from (2.6), (2.7) and (2.10), it follows that

Vt=(i)2f((xt)),(htxt)+f((xt)),(xt)=(ii)2Gta2g(ht)+2g(xt)+f((xt)),(xt)=(iii)2Gta2g(ht)+2g(xt)θ,\begin{split}V^{t}&\underset{(i)}{=}2\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(h^{t}-x^{t})\rangle+\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t})\rangle\\ &\underset{(ii)}{=}-2G_{t}^{a}-2g(h^{t})+2g(x^{t})+\langle\nabla f(\mathcal{B}(x^{t})),\mathcal{B}(x^{t})\rangle\\ &\underset{(iii)}{=}-2G_{t}^{a}-2g(h^{t})+2g(x^{t})-\theta,\end{split} (A.4)

where (i) follows from (2.6), (ii) follows from (2.10), and (iii) follows from (2.7). Combining (A.2), (A.3) and (A.4),

Dt2[(Gta+θ)+(g(ht)g(xt))]22Gta2g(ht)+2g(xt)θ=(Gta+θ)2+2(Gta+θ)(g(ht)g(xt))+(g(ht)g(xt))22Gta2g(ht)+2g(xt)θ(i)(Gta+θ)2+2(Gta+θ)(g(ht)g(xt))+(g(ht)g(xt))22(g(ht)g(xt))=(Gta+θ)2+2(Gta+θ1)(g(ht)g(xt))+(g(ht)g(xt))2(ii)(Gta+θ)2+2Rg(Gta+θ1)+Rg2(Gta+θ)2+2Rg(Gta+θ)+Rg2=(Gta+θ+Rg)2,\begin{split}D_{t}^{2}&\leq[(G_{t}^{a}+\theta)+(g(h^{t})-g(x^{t}))]^{2}-2G_{t}^{a}-2g(h^{t})+2g(x^{t})-\theta\\ &=(G_{t}^{a}+\theta)^{2}+2(G_{t}^{a}+\theta)(g(h^{t})-g(x^{t}))+(g(h^{t})-g(x^{t}))^{2}-2G_{t}^{a}-2g(h^{t})+2g(x^{t})-\theta\\ &\underset{(i)}{\leq}(G_{t}^{a}+\theta)^{2}+2(G_{t}^{a}+\theta)(g(h^{t})-g(x^{t}))+(g(h^{t})-g(x^{t}))^{2}-2(g(h^{t})-g(x^{t}))\\ &=(G_{t}^{a}+\theta)^{2}+2(G_{t}^{a}+\theta-1)(g(h^{t})-g(x^{t}))+(g(h^{t})-g(x^{t}))^{2}\\ &\underset{(ii)}{\leq}(G_{t}^{a}+\theta)^{2}+2R_{g}(G_{t}^{a}+\theta-1)+R_{g}^{2}\\ &\leq(G_{t}^{a}+\theta)^{2}+2R_{g}(G_{t}^{a}+\theta)+R_{g}^{2}\\ &=(G_{t}^{a}+\theta+R_{g})^{2},\end{split}

where (i) uses that the fact that Gta0G_{t}^{a}\geq 0 and θ0\theta\geq 0, and (ii) uses the definition of RgR_{g} (2.12). ∎

References

  • Atwood [1969] Corwin L Atwood. Optimal and efficient designs of experiments. The Annals of Mathematical Statistics, pages 1570–1602, 1969.
  • Bach [2010] Francis Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statistics, 4:384–414, 2010.
  • Baumgratz et al. [2013] Tillmann Baumgratz, Alexander Nüßeler, Marcus Cramer, and Martin B Plenio. A scalable maximum likelihood method for quantum state tomography. New Journal of Physics, 15(12):125004, 2013.
  • Cover [1984] Thomas Cover. An algorithm for maximizing expected log investment return. IEEE Transactions on Information Theory, 30(2):369–373, 1984.
  • de Aguiar et al. [1995] P Fernandes de Aguiar, B Bourguignon, MS Khots, DL Massart, and R Phan-Than-Luu. D-optimal designs. Chemometrics and intelligent laboratory systems, 30(2):199–210, 1995.
  • Dvurechensky et al. [2020] Pavel Dvurechensky, Petr Ostroukhov, Kamil Safin, Shimrit Shtern, and Mathias Staudigl. Self-concordant analysis of Frank-Wolfe algorithms. In International Conference on Machine Learning, pages 2814–2824. PMLR, 2020.
  • Dvurechensky et al. [2022] Pavel Dvurechensky, Kamil Safin, Shimrit Shtern, and Mathias Staudigl. Generalized self-concordant analysis of Frank–Wolfe algorithms. Mathematical Programming, pages 1–69, 2022.
  • Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics, 3(1-2):95–110, 1956.
  • Glancy et al. [2012] Scott Glancy, Emanuel Knill, and Mark Girard. Gradient-based stopping rules for maximum-likelihood quantum-state tomography. New Journal of Physics, 14(9):095017, 2012.
  • Harmany et al. [2011] Zachary T Harmany, Roummel F Marcia, and Rebecca M Willett. This is SPIRAL-TAP: Sparse Poisson intensity reconstruction algorithms—theory and practice. IEEE Transactions on Image Processing, 21(3):1084–1096, 2011.
  • Hradil [1997] Zdeněk Hradil. Quantum-state estimation. Physical Review A, 55(3):R1561, 1997.
  • Hradil et al. [2004] Zdeněk Hradil, Jaroslav Řeháček, Jaromír Fiurášek, and Miroslav Ježek. Maximum-likelihood methods in quantum mechanics. In Quantum state estimation, pages 59–112. Springer, 2004.
  • Jaggi [2013] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pages 427–435, 2013.
  • Kuczyński and Woźniakowski [1992] Jacek Kuczyński and Henryk Woźniakowski. Estimating the largest eigenvalue by the power and Lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications, 13(4):1094–1122, 1992.
  • Liu et al. [2021] Deyi Liu, Volkan Cevher, and Quoc Tran-Dinh. A Newton Frank–Wolfe method for constrained self-concordant minimization. Journal of Global Optimization, pages 1–27, 2021.
  • Marteau-Ferey et al. [2019] Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Beyond least-squares: Fast rates for regularized empirical risk minimization through self-concordance. In Conference on Learning Theory, pages 2294–2340. PMLR, 2019.
  • Nemirovski and Todd [2008] Arkadi S Nemirovski and Michael J Todd. Interior-point methods for optimization. Acta Numerica, 17:191–234, 2008.
  • Nesterov [2003] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003.
  • Nesterov and Nemirovskii [1994] Yurii Nesterov and Arkadii Nemirovskii. Interior-point polynomial algorithms in convex programming. SIAM, 1994.
  • Nowak and Kolaczyk [1998] Robert Nowak and Eric D Kolaczyk. A multiscale MAP estimation method for Poisson inverse problems. In Conference Record of Thirty-Second Asilomar Conference on Signals, Systems and Computers (Cat. No. 98CH36284), volume 2, pages 1682–1686. IEEE, 1998.
  • Nowak and Kolaczyk [2000] Robert Nowak and Eric D Kolaczyk. A multiscale statistical framework for Poisson inverse problems. IEEE Trans. Info. Theory, 46:1811–1825, 2000.
  • Owen [2013] Art B Owen. Self-concordance for empirical likelihood. Canadian Journal of Statistics, 41(3):387–397, 2013.
  • Rue and Held [2005] Havard Rue and Leonhard Held. Gaussian Markov random fields: theory and applications. Chapman and Hall/CRC, 2005.
  • Scholten and Blume-Kohout [2018] Travis L Scholten and Robin Blume-Kohout. Behavior of the maximum likelihood in quantum state tomography. New Journal of Physics, 20(2):023050, 2018.
  • Shinde et al. [2021] Nimita Shinde, Vishnu Narayanan, and James Saunderson. Memory-efficient structured convex optimization via extreme point sampling. SIAM Journal on Mathematics of Data Science, 3(3):787–814, 2021.
  • Tran-Dinh et al. [2013] Quoc Tran-Dinh, Anastasios Kyrillidis, and Volkan Cevher. A proximal Newton framework for composite minimization: Graph learning without Cholesky decompositions and matrix inversions. In International Conference on Machine Learning, pages 271–279. PMLR, 2013.
  • Tran-Dinh et al. [2014] Quoc Tran-Dinh, Anastasios Kyrillidis, and Volkan Cevher. An inexact proximal path-following algorithm for constrained convex minimization. SIAM Journal on Optimization, 24(4):1718–1745, 2014.
  • Tran-Dinh et al. [2015] Quoc Tran-Dinh, Anastasios Kyrillidis, and Volkan Cevher. Composite self-concordant minimization. Journal of Machine Learning Research, 16(1):371–416, 2015.
  • Tropp et al. [2017] Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 38(4):1454–1485, 2017.
  • Tropp et al. [2019] Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Streaming low-rank matrix approximation with an application to scientific simulation. SIAM Journal on Scientific Computing, 41(4):A2430–A2463, 2019.
  • Vardi and Lee [1993] Y Vardi and D Lee. From image deblurring to optimal investments: Maximum likelihood solutions for positive linear inverse problems. Journal of the Royal Statistical Society: Series B (Methodological), 55(3):569–598, 1993.
  • Yuan and Parrilo [2021] Chenyang Yuan and Pablo A Parrilo. Semidefinite relaxations of products of nonnegative forms on the sphere. arXiv preprint arXiv:2102.13220, 2021.
  • Yurtsever et al. [2017] Alp Yurtsever, Madeleine Udell, Joel A Tropp, and Volkan Cevher. Sketchy decisions: Convex low-rank matrix optimization with optimal storage. In Artificial Intelligence and Statistics, pages 1188–1196, 2017.
  • Yurtsever et al. [2021] Alp Yurtsever, Joel A Tropp, Olivier Fercoq, Madeleine Udell, and Volkan Cevher. Scalable semidefinite programming. SIAM Journal on Mathematics of Data Science, 3(1):171–200, 2021.
  • Zhao [2021] Renbo Zhao. Non-asymptotic convergence analysis of the multiplicative gradient algorithm for the log-optimal investment problems. arXiv preprint arXiv:2109.05601, 2021.
  • Zhao and Freund [2022] Renbo Zhao and Robert M Freund. Analysis of the frank–wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier. Mathematical Programming, pages 1–41, 2022.