This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A conditional gradient homotopy method with applications to Semidefinite Programming

Pavel Dvurechensky Weierstrass Institute for Applied Analysis and Stochastics, Mohrenstr. 39, 10117 Berlin, Germany
([email protected])
Gabriele Iommazzo Zuse Institute Berlin, Berlin, Germany
([email protected])
Shimrit Shtern Faculty of Data and Decision Sciences, Technion - Israel Institute of Technology, Haifa, Israel
([email protected])
Mathias Staudigl Corresponding Author Department of Mathematics, 68159 Mannheim, B6, Germany
([email protected])
Abstract

We propose a new homotopy-based conditional gradient method for solving convex optimization problems with a large number of simple conic constraints. Instances of this template naturally appear in semidefinite programming problems arising as convex relaxations of combinatorial optimization problems. Our method is a double-loop algorithm in which the conic constraint is treated via a self-concordant barrier, and the inner loop employs a conditional gradient algorithm to approximate the analytic central path, while the outer loop updates the accuracy imposed on the temporal solution and the homotopy parameter. Our theoretical iteration complexity is competitive when confronted to state-of-the-art SDP solvers, with the decisive advantage of cheap projection-free subroutines. Preliminary numerical experiments are provided for illustrating the practical performance of the method.

1 Introduction

In this paper we investigate a new algorithm for solving a large class of convex optimization problems with conic constraints of the following form:

minxg(x)s.t. x𝖷,𝒫(x)𝖪𝖧.\min_{x}g(x)\quad\text{s.t. }x\in\mathsf{X},\mathcal{P}(x)\in\mathsf{K}\subseteq\mathsf{H}. (P)

The objective function g:𝖤(,]g:\mathsf{E}\to(-\infty,\infty] is assumed to be closed, convex, and proper, 𝖷𝖤\mathsf{X}\subseteq\mathsf{E} is a compact convex set, 𝒫:𝖤𝖧\mathcal{P}:\mathsf{E}\to\mathsf{H} is an affine mapping between two finite-dimensional Euclidean vector spaces 𝖤\mathsf{E} and 𝖧\mathsf{H}, and 𝖪\mathsf{K} is a closed convex pointed cone. Problem (P) is sufficiently generic to cover many optimization settings considered in the literature.

Example 1.1 (Packing SDP).

The model template (P) covers the large class of packing semidefinite programs (SDPs) [24, 14]. In this problem class we have 𝖤=𝕊n\mathsf{E}=\mathbb{S}^{n} – the space of symmetric n×nn\times n matrices, a collection of positive semidefinite input matrices 𝐀1,,𝐀m𝕊+n\mathbf{A}_{1},\ldots,\mathbf{A}_{m}\in\mathbb{S}^{n}_{+}, with constraints given by 𝒫(𝐗)=[1tr(𝐀1𝐗);;1tr(𝐀m𝐗)]\mathcal{P}(\mathbf{X})=[1-\operatorname{tr}(\mathbf{A}_{1}\mathbf{X});\ldots;1-\operatorname{tr}(\mathbf{A}_{m}\mathbf{X})]^{\top} and 𝖪=+m\mathsf{K}=\mathbb{R}^{m}_{+}, coupled with the objective function g(𝐗)=tr(C𝐗)g(\mathbf{X})=\operatorname{tr}(C\mathbf{X}) and the set 𝖷={𝐗𝕊+n|tr(𝐗)ρ}\mathsf{X}=\{\mathbf{X}\in\mathbb{S}^{n}_{+}|\operatorname{tr}(\mathbf{X})\leq\rho\}. Special instances of packing SDPs include the MaxCut SDP relaxation of [19], the Lovasz-θ\theta function SDP, among many others; see [23].

Example 1.2 (Covering SDP).

A dual formulation of the packing SDP is the covering problem [14]. In its normalized version, the problem reads as

minxi=1mxis.t.:i=1mxi𝐀i𝐈0,x+m,\displaystyle\min_{x}\sum_{i=1}^{m}x_{i}\quad\text{s.t.:}\sum_{i=1}^{m}x_{i}\mathbf{A}_{i}-\mathbf{I}\succeq 0,x\in\mathbb{R}^{m}_{+},

where 𝐀1,,𝐀m𝕊+n\mathbf{A}_{1},\ldots,\mathbf{A}_{m}\in\mathbb{S}^{n}_{+} are given input matrices. This can be seen as a special case of problem (P) with g(x)=i=1mxi,𝒫(x)=i=1mxi𝐀i𝐈,𝖪=𝕊+n,𝖧=𝕊n,𝖤=mg(x)=\sum_{i=1}^{m}x_{i},\mathcal{P}(x)=\sum_{i=1}^{m}x_{i}\mathbf{A}_{i}-\mathbf{I},\mathsf{K}=\mathbb{S}^{n}_{+},\mathsf{H}=\mathbb{S}^{n},\mathsf{E}=\mathbb{R}^{m}, and the constraint set 𝖷={x+m|i=1mxiρ}\mathsf{X}=\{x\in\mathbb{R}^{m}_{+}|\sum_{i=1}^{m}x_{i}\leq\rho\} for some large ρ\rho added to the problem for the compactification purposes.

Example 1.3 (Relative entropy programs).

Relative entropy programs are conic optimization problems in which a linear functional of a decision variable is minimized subject to linear constraints and conic constraints specified by a relative entropy cone. The relative entropy cone is defined for triples (u,v,w)n×n×n(u,v,w)\in\mathbb{R}^{n}\times\mathbb{R}^{n}\times\mathbb{R}^{n} via Cartesian products of the following elementary cone

𝖪RE(1)={(ν,λ,δ)+×+×|νlog(λν)δ}.\mathsf{K}^{(1)}_{\text{RE}}=\{(\nu,\lambda,\delta)\in\mathbb{R}_{+}\times\mathbb{R}_{+}\times\mathbb{R}|-\nu\log\left(\frac{\lambda}{\nu}\right)\leq\delta\}.

As the function ++2(ν,λ)νlog(λ/ν)\mathbb{R}^{2}_{++}\ni(\nu,\lambda)\mapsto-\nu\log(\lambda/\nu) is the perspective transform of the negative logarithm, we know that it is a convex function [4]. Hence, 𝖪RE(1)\mathsf{K}^{(1)}_{\text{RE}} is a convex cone. The relative entropy cone is accordingly defined as 𝖪=i=1n𝖪RE(1)\mathsf{K}=\prod_{i=1}^{n}\mathsf{K}^{(1)}_{\text{RE}}. This set appears in many important applications, containing estimation of dynamical systems and the optimization of quantum channels; see [5] for an overview.

Example 1.4 (Relative Entropy entanglement).

The relative entropy of entanglement (REE) is an important measure for entanglement for a given quantum state in quantum information theory [31]. Obtaining accurate bounds on the REE is a major problem in quantum information theory. In [44] the following problem is used to find a lower bound to the REE:

minxg(x)=tr[ρ(log(ρ)log(x))]\displaystyle\min_{x}g(x)=\operatorname{tr}[\rho(\log(\rho)-\log(x))]
s.t.: x+n,tr(x)=1,𝒫(x)+n,\displaystyle\text{s.t.: }x\in\mathbb{H}^{n}_{+},\operatorname{tr}(x)=1,\;\mathcal{P}(x)\in\mathbb{H}^{n}_{+},

where +n\mathbb{H}^{n}_{+} is the cone of n×nn\times n Hermitian positive semi-definite matrices, and 𝒫(x)\mathcal{P}(x) is the so-called partial transpose operator, which is a linear symmetric operator (see [22]). ρ+n\rho\in\mathbb{H}^{n}_{+} is a given quantum reference state. We obtain a problem of the form (P), upon the identification 𝖤=n\mathsf{E}=\mathbb{H}^{n} the space of n×nn\times n Hermitian matrices, and 𝖪=+n\mathsf{K}=\mathbb{H}^{n}_{+}, and 𝖷={x𝖤|x+n,tr(x)=1}\mathsf{X}=\{x\in\mathsf{E}|x\in\mathbb{H}^{n}_{+},\operatorname{tr}(x)=1\}.

The model template (P) can be solved with moderate accuracy using fast proximal gradient methods, like FISTA and related acceleration tricks [12]. In this formalism, the constraints imposed on the optimization problem have to be included in proximal operators which essentially means that one can calculate a projection onto the feasible set. Unfortunately, the computation of proximal operators can impose a significant burden and even lead to intractability of gradient-based methods. This is in particular the case in SDPs with a large number of constraints and decision variables, such as in convex relaxations of NP-hard combinatorial optimization problems [28]. Other prominent examples in machine learning are kk-means clustering [32], kk-nearest neighbor classification [36], sparse PCA [8], kernel learning [27], among many others. In many of these mentioned applications, it is not only the size of the problem that challenges the application of off-the-shelve solvers, but also the desire to obtain sparse solutions. As a result, the Conditional Gradient (CndG) method [17] has seen renewed interest in large-scale convex programming, since it only requires computing a Linear Mnimization Oracle (LMO) at each iteration. Compared to proximal methods, CndG features significantly reduced computational costs (e.g. when 𝖷\mathsf{X} is a spectrahedron), tractability and interpretability (e.g. it generates solutions as a combination of atoms of 𝖷)\mathsf{X}).

Our approach

In this paper we propose a CndG framework for solving (P) with rigorous iteration complexity guarantees. Our approach retains the simplicity of CndG, but allows to disentangle the different sources of complexity describing the problem. Specifically, we study the performance of a new homotopy/path-following method, which iteratively approaches a solution of problem (P). To develop the path-following argument we start by decomposing the feasible set into two components. We assume that 𝖷\mathsf{X} is a compact convex set which admits efficient applications of CndG. The challenging constraints are embodied by the membership condition 𝒫(x)𝖪\mathcal{P}(x)\in\mathsf{K}. These are conic restrictions on the solution, and in typical applications of our method we have to manage a lot of them. Our approach deals with these constraints via a logarithmically homogeneous barrier for the cone 𝖪\mathsf{K}. [30] introduced this class of functions in connection with polynomial-time interior point methods. It is a well-known fact that any closed convex cone admits a logarithmically homogeneous barrier (the universal barrier) [30, Thm. 2.5.1]. A central assumption of our approach is that we have a practical representation of a logarithmically homogenous barrier for the cone 𝖪\mathsf{K}. As many restrictions appearing in applications are linear or PSD inequalities, this assumption is rather mild. Exploiting the announced decomposition, we approximate the target problem (P) by a family of parametric penalty-based composite convex optimization problems formulated in terms of the potential function

minx𝖷{Vt(x):=1tF(x)+g(x)}.\min_{x\in\mathsf{X}}\{V_{t}(x):=\frac{1}{t}F(x)+g(x)\}. (1.1)

This potential function involves a path-following/homotopy parameter t>0t>0, and FF is a barrier function defined over the set 𝒫1(𝖪)\mathcal{P}^{-1}(\mathsf{K}). By minimizing the function VtV_{t} over the set 𝖷\mathsf{X} for a sequence of increasing values of tt, we can trace the analytic central path z(t)z^{\ast}(t) of (1.1) as it converges to a solution of (P). The practical implementation of this path-following scheme is achieved via a double-loop algorithm in which the inner loop performs a number of CndG updating steps on the potential function VtV_{t}, and the outer-loop readjusts the parameter tt as well as the desired tolerance imposed on the CndG-solver. Importantly, unlike existing primal-dual methods for (P) (e.g. [41]), our approach generates a sequence of feasible points for the original problem (P).

Comparison to the literature

Exact homotopy/path-following methods were developed in statistics and signal processing in the context of the 1\ell_{1}-regularized least squares problem (LASSO) [34] to compute the entire regularization path. Relatedly, approximate homotopy methods have been studied in this context as well [21, 37], and superior empirical performance has been reported for seeking sparse solutions. [38] is the first reference which investigates the complexity of a proximal gradient homotopy method for the LASSO problem. To our best knowledge, our paper provides the first complexity analysis of a homotopy CndG algorithm to compute an approximate solution close to the analytic central path for the optimization template (P). Our analysis is heavily influenced by recent papers [43] and [42]. They study a generalized CndG algorithm designed for solving a composite convex optimization problem, in which the smooth part is given by a logarithmically homogeneous barrier. Unlike [43], we analyze a more complicated dynamic problem in which the objective changes as we update the penalty parameter, and we propose important practical extensions including line-search and inexact-LMO versions of the CndG algorithm for reducing the potential function (1.1). Furthermore, we analyze the complexity of the whole path-following process in order to approximate a solution to the initial non-penalized problem (P).

Main Results

One of the challenging questions is how to design a penalty parameter updating schedule, and we develop a practical answer to this question. Specifically, our contributions can be summarized as follows:

  • We introduce a simple CndG framework for solving problem (P) and prove that it achieves a O~(ε2)\tilde{O}(\varepsilon^{-2}) iteration complexity, where O~()\tilde{O}(\cdot) hides polylogarithmic factors (Theorem 3.3).

  • We provide two extensions for the inner-loop CndG algorithm solving the potential minimization problem (1.1): a line-search and an inexact-LMO version. For both, we show that the complexity of the inner and the outer loop are the same as in the basic variant up to constant factors.

  • We present practically relevant applications of our framework, including instances of SDPs arising from convex relaxations of NP-hard combinatorial optimization problems, and provide promising results of the numerical experiments.

Direct application of existing CndG frameworks [40, 41] for solving (P) would require projection onto 𝖪\mathsf{K}, which may be complicated if, e.g., 𝖪\mathsf{K} is a PSD cone. Moreover, these algorithms do not produce feasible iterates but rather approximate solutions with primal optimality and dual feasibility guarantees. For specific instances of SDPs, the theoretical complexity achieved by our method matches or even improves the state-of-the-art iteration complexity results reported in the literature. For packing and covering types of SDPs, the state-of-the-art upper complexity bound reported in [14] is worse than ours since their algorithm has complexity O~(C1ε2.5+C2ε2)\tilde{O}(C_{1}\varepsilon^{-2.5}+C_{2}\varepsilon^{-2}), where C1,C2C_{1},C_{2} are constants depending on the problem data111The notation O~\tilde{O} hides polylogarithmic factors. For SDPs with linear equality constraints and spectrahedron constraints, the primal-dual method CGAL [41] produces approximately optimal and approximately feasible points in O(ε2)O(\varepsilon^{-2}) iterations. Our method is purely primal, generating feasible points anytime. At the same time, our method shares the same theoretical iteration complexity as CGAL and can deal with additional conic constraints embodied by the membership restriction 𝒫(x)𝖪\mathcal{P}(x)\in\mathsf{K} without use of projection. We expect that this added feature will allow us to apply our method to handle challenging scientific computing questions connected to the quantum information theory, such as bounding the relative entropy of entanglement of quantum states (cf. Example 1.4 and [44, 15, 16]).

Organization of the paper

This paper is organized as follows: Section 2 introduces the notation and terminology we use throughout this paper. Section 3 presents the basic algorithm under consideration in this paper. In section 4 we describe an inexact version of our basic homotopy method. Section 5 contains a detailed complexity analysis of our scheme when an exact LMO is available. The corresponding statements under our inexact LMO are performed in Appendix A. Section 6 reports the performance of our method in solving relevant examples of SDPs and gives some first comparisons with existing approaches. In particular, numerical results on the mixing time problem of a finite Markov Chain, and synthetic examples aimed at testing the robustness to scaling are reported there. Appendix B contains further results on the numerical experiments.

2 Notation and Preliminaries

Let 𝖤\mathsf{E} be a finite-dimensional Euclidean vector space, 𝖤\mathsf{E}^{\ast} its dual space, which is formed by all linear functions on 𝖤\mathsf{E}. The value of function s𝖤s\in\mathsf{E}^{\ast} at x𝖤x\in\mathsf{E} is denoted by s,x\langle s,x\rangle. Let B:𝖤𝖤B:\mathsf{E}\to\mathsf{E}^{\ast} be a positive definite self-adjoint operator. Define the primal and dual norms

hB=Bh,h1/2, and sB=s,B1s1/2\lVert h\rVert_{B}=\langle Bh,h\rangle^{1/2},\text{ and }\lVert s\rVert_{B}^{\ast}=\langle s,B^{-1}s\rangle^{1/2}

for h𝖤h\in\mathsf{E} and s𝖤s\in\mathsf{E}^{\ast}, respectively. If BB is the identity operator, we simplify notation to h\lVert h\rVert and s\lVert s\rVert_{\ast} respectively for the primal and the dual norm. For a linear operator A:𝖤𝖤A:\mathsf{E}\to\mathsf{E}^{\ast} we define the operator norm

A=maxh𝖤:h1Ah.\lVert A\rVert=\max_{h\in\mathsf{E}:\lVert h\rVert\leq 1}\lVert Ah\rVert_{\ast}.

For a differentiable function f(x)f(x) with dom(f)𝖤\operatorname{dom}(f)\subseteq\mathsf{E}, we denote by f(x)𝖤f^{\prime}(x)\in\mathsf{E}^{\ast} its gradient, by f′′(x):𝖤𝖤f^{\prime\prime}(x):\mathsf{E}\to\mathsf{E}^{\ast} its Hessian, and by f′′′(x)f^{\prime\prime\prime}(x) its third derivative. Accordingly, given directions h1,h2,h3𝖤h_{1},h_{2},h_{3}\in\mathsf{E} the first, second, and third directional derivatives of ff are denoted respectively as f(x)[h1]f^{\prime}(x)[h_{1}],f′′(x)[h1,h2]f^{\prime\prime}(x)[h_{1},h_{2}],f′′′(x)[h1,h2,h3]f^{\prime\prime\prime}(x)[h_{1},h_{2},h_{3}].

2.1 Self-Concordant Functions

Let 𝖰\mathsf{Q} be an open and convex subset of 𝖤\mathsf{E}. A function f:𝖤(,]f:\mathsf{E}\to(-\infty,\infty] is self-concordant (SC) on 𝖰\mathsf{Q} if f𝐂3(𝖰)f\in{\mathbf{C}}^{3}(\mathsf{Q}) and for all x𝖰,h𝖤x\in\mathsf{Q},h\in\mathsf{E}, we have |f′′′(x)[h,h,h]|2(f′′(x)[h,h])3/2.\lvert f^{\prime\prime\prime}(x)[h,h,h]\rvert\leq 2(f^{\prime\prime}(x)[h,h])^{3/2}. In case where cl(𝖰)\operatorname{cl}(\mathsf{Q}) is a closed, convex, and pointed (i.e., contains no line) cone, we have more structure. We call ff a ν\nu-canonical barrier for 𝖰\mathsf{Q}, denoted by fν(𝖰)f\in\mathcal{B}_{\nu}(\mathsf{Q}), if it is SC and

(x,t)𝖰×(0,):f(tx)=f(x)νlog(t).\forall(x,t)\in\mathsf{Q}\times(0,\infty):\quad f(tx)=f(x)-\nu\log(t).

From [30, Prop. 2.3.4], the following properties are satisfied by a ν\nu-canonical barrier for each x𝖰,h𝖤,t>0x\in\mathsf{Q},h\in\mathsf{E},t>0:

f(tx)=t1f(x),\displaystyle f^{\prime}(tx)=t^{-1}f^{\prime}(x), (2.1)
f(x)[h]=f′′(x)[x,h],\displaystyle f^{\prime}(x)[h]=-f^{\prime\prime}(x)[x,h], (2.2)
f(x)[x]=ν,\displaystyle f^{\prime}(x)[x]=-\nu, (2.3)
f′′(x)[x,x]=ν.\displaystyle f^{\prime\prime}(x)[x,x]=\nu. (2.4)

We define the local norm hf′′(x):=(f′′(x)[h,h])1/2\lVert h\rVert_{f^{\prime\prime}(x)}:=(f^{\prime\prime}(x)[h,h])^{1/2} for all xdom(f)x\in\operatorname{dom}(f) and h𝖤h\in\mathsf{E}. SC functions are in general not Lipschitz smooth. Still, we have access to a version of a descent lemma of the following form [29, Thm. 5.1.9]:

f(x+h)f(x)+f(x)[h]+ω(hf′′(x))f(x+h)\leq f(x)+f^{\prime}(x)[h]+\omega_{\ast}(\lVert h\rVert_{f^{\prime\prime}(x)}) (2.5)

for all xdom(f)x\in\operatorname{dom}(f) and h𝖤h\in\mathsf{E} such that hf′′(x)<1\lVert h\rVert_{f^{\prime\prime}(x)}<1, where ω(t)=tlog(1t)\omega_{\ast}(t)=-t-\log(1-t). It also holds true that [29, Thm. 5.1.8]

f(x+h)f(x)+f(x)[h]+ω(hf′′(x))f(x+h)\geq f(x)+f^{\prime}(x)[h]+\omega(\lVert h\rVert_{f^{\prime\prime}(x)}) (2.6)

for all xdom(f)x\in\operatorname{dom}(f) and h𝖤h\in\mathsf{E} such that x+hdom(f)x+h\in\operatorname{dom}(f), where ω(t)=tlog(1+t)\omega(t)=t-\log(1+t) for t0t\geq 0.

A classical and useful bound is [29, Lemma 5.1.5]:

t22tω(t)t22(1t)t[0,1).\frac{t^{2}}{2-t}\leq\omega_{\ast}(t)\leq\frac{t^{2}}{2(1-t)}\quad\forall t\in[0,1). (2.7)

2.2 The Optimization Problem

The following assumptions are made for the rest of this paper.

Assumption 1.

The function g:𝖤(,]g:\mathsf{E}\to(-\infty,\infty] is closed and convex and continuous on dom(g)\operatorname{dom}(g). The quantity Ωg:=maxx,ydom(g)𝖷|g(x)g(y)|\Omega_{g}:=\max_{x,y\in\operatorname{dom}(g)\cap\mathsf{X}}\lvert g(x)-g(y)\rvert is finite.

Assumption 2.

𝖪𝖧\mathsf{K}\subset\mathsf{H} is a closed convex pointed cone with int(𝖪)\operatorname{int}(\mathsf{K})\neq\varnothing, admitting a ν\nu-canonical barrier fν(𝖪)f\in\mathcal{B}_{\nu}(\mathsf{K}).

The next assumption transports the barrier setup from the codomain 𝖪\mathsf{K} to the domain 𝒫1(𝖪)\mathcal{P}^{-1}(\mathsf{K}). This is a common operation in the framework of the "barrier calculus" developed in [30, Section 5.1].

Assumption 3.

The map 𝒫:𝖤𝖧\mathcal{P}:\mathsf{E}\to\mathsf{H} is linear, and F(x):=f(𝒫(x))F(x):=f(\mathcal{P}(x)) is a ν\nu-canonical barrier on the cone 𝒫1(𝖪)\mathcal{P}^{-1}(\mathsf{K}), i.e. Fν(𝒫1(𝖪))F\in\mathcal{B}_{\nu}(\mathcal{P}^{-1}(\mathsf{K})).

Note that dom(F)=int𝒫1(𝖪)\operatorname{dom}(F)=\operatorname{int}\mathcal{P}^{-1}(\mathsf{K}).

Example 2.1.

Considering the normalised covering problem presented in Example 1.2. [14] solve this problem via a logarithmic potential function method [30], involving the logarithmically homogeneous barrier f(𝐗)=logdet(𝐗)f(\mathbf{X})=\log\det(\mathbf{X}) for 𝐗𝕊+n\mathbf{X}\in\mathbb{S}^{n}_{+}. This is a typical choice in Newton-type methods to impose the semi-definiteness constraint. However, in our projection-free environment, we use the power of the linear minimisation oracle to obtain search directions which leaves the cone of positive semidefinite matrices invariant. Instead, we employ barrier functions to incorporate the additional linear constraints in (P). Hence, we set F(x)=f(𝒫(x))=logdet(i=1mxi𝐀i𝐈)F(x)=f(\mathcal{P}(x))=\log\det(\sum_{i=1}^{m}x_{i}\mathbf{A}_{i}-\mathbf{I}) to absorb the constraint 𝒫(x)𝕊+n\mathcal{P}(x)\in\mathbb{S}^{n}_{+}. In particular, this frees us from matrix inversions, and related computationally intensive steps coming with Newton and interior-point methods.

Assumption 4.

𝖷\mathsf{X} is a nonempty compact convex set in 𝖤\mathsf{E}, and 𝖢:=𝖷dom(F)dom(g)\mathsf{C}:=\mathsf{X}\cap\operatorname{dom}(F)\cap\operatorname{dom}(g)\neq\varnothing.

Let Opt:=min{g(x)|x𝖷,𝒫(x)𝖪}\operatorname{Opt}:=\min\{g(x)|x\in\mathsf{X},\mathcal{P}(x)\in\mathsf{K}\}. Thanks to assumptions 2 and 4, Opt\operatorname{Opt} is attained. Our goal is to find an ε\varepsilon-solution of problem (P), defined as follows.

Definition 2.1.

Given a tolerance ε>0\varepsilon>0, we say that zεz^{\ast}_{\varepsilon} is an ε\varepsilon-solution for (P) if zεz^{\ast}_{\varepsilon} is a feasible solution of problem (P) with objective at most ε\varepsilon higher than Opt\operatorname{Opt}, that is,

zε𝒫1(𝖪)𝖷 and g(zε)Optε.z^{\ast}_{\varepsilon}\in\mathcal{P}^{-1}(\mathsf{K})\cap\mathsf{X}\text{ and }g(z^{\ast}_{\varepsilon})-\operatorname{Opt}\leq\varepsilon.

Given t>0t>0, define

Opt(t):=minx𝖷{Vt(x)=1tF(x)+g(x)}, and z(t){x𝖷|Vt(x)=Opt(t)}.\operatorname{Opt}(t):=\min_{x\in\mathsf{X}}\{V_{t}(x)=\frac{1}{t}F(x)+g(x)\},\text{ and }z^{\ast}(t)\in\{x\in\mathsf{X}|V_{t}(x)=\operatorname{Opt}(t)\}. (2.8)

The following Lemma shows that the path tz(t)t\mapsto z^{\ast}(t) traces a trajectory in the interior of the feasible set which can be used to approximate a solution of the original problem (P), provided the penalty parameter tt is chosen large enough.

Lemma 2.2.

For all t>0t>0, it holds that z(t)𝖢z^{\ast}(t)\in\mathsf{C}. In particular,

g(z(t))Optνt.g(z^{\ast}(t))-\operatorname{Opt}\leq\frac{\nu}{t}. (2.9)
Proof.

Since Fν(𝒫1(𝖪))F\in\mathcal{B}_{\nu}(\mathcal{P}^{-1}(\mathsf{K})), it follows immediately that z(t)dom(F)𝖷z^{\ast}(t)\in\operatorname{dom}(F)\cap\mathsf{X}. By Fermat’s principle we have

01tF(z(t))+g(z(t))+𝖭𝖢𝖷(z(t)).0\in\frac{1}{t}F^{\prime}(z^{\ast}(t))+\partial g(z^{\ast}(t))+\operatorname{\mathsf{NC}}_{\mathsf{X}}(z^{\ast}(t)).

Convexity and [29, Thm. 5.3.7] implies that for all zdom(F)𝖷z\in\operatorname{dom}(F)\cap\mathsf{X}, we have

g(z)g(z(t))1tF(z(t))[zz(t)]g(z(t))νt.g(z)\geq g(z^{\ast}(t))-\frac{1}{t}F^{\prime}(z^{\ast}(t))[z-z^{\ast}(t)]\geq g(z^{\ast}(t))-\frac{\nu}{t}.

Let zz^{\ast} be a feasible point point satisfying g(z)=Optg(z^{\ast})=\operatorname{Opt}. Choosing a sequence {zj}jdom(F)𝖷\{z^{j}\}_{j\in\mathbb{N}}\subset\operatorname{dom}(F)\cap\mathsf{X} with zjzz^{j}\to z^{\ast}, the continuity on dom(g)\operatorname{dom}(g) gives

Optg(z(t))νt.\operatorname{Opt}\geq g(z^{\ast}(t))-\frac{\nu}{t}.

\blacksquare

3 Algorithm

In this section we first describe a new CndG method for solving general conic constrained convex optimization problems of the form (2.8), for a fixed t>0t>0. This procedure serves as the inner loop of our homotopy method. The path-following strategy in the outer loop is then explained in Section 3.2.

3.1 The Proposed CndG Solver

The computational efficiency of our approach relies on the availability of a suitably defined minimisation oracle:

Assumption 5.

For any c𝖤c\in\mathsf{E}^{\ast}, the auxiliary problem

g(c):=argmins𝖷{c,s+g(s)}\mathcal{L}_{g}(c):=\operatorname*{argmin}_{s\in\mathsf{X}}\{\langle c,s\rangle+g(s)\} (3.1)

is easily solvable.

We can compute the gradient and Hessian of FF as

F(x)=f(𝒫(x))𝒫(x),F′′(x)[s,t]=f′′(𝒫(x))[𝒫(s),𝒫(t)]xint𝒫1(𝖪).F^{\prime}(x)=f^{\prime}(\mathcal{P}(x))\mathcal{P}^{\prime}(x),\quad F^{\prime\prime}(x)[s,t]=f^{\prime\prime}(\mathcal{P}(x))[\mathcal{P}(s),\mathcal{P}(t)]\quad\forall x\in\operatorname{int}\mathcal{P}^{-1}(\mathsf{K}).

This means that we obtain a local norm on 𝖧\mathsf{H} given by

wx:=F′′(x)[w,w]1/2(x,w)int𝒫1(𝖪)×𝖧.\lVert w\rVert_{x}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}:=}F^{\prime\prime}(x)[w,w]^{1/2}\qquad\forall(x,w)\in\operatorname{int}\mathcal{P}^{-1}(\mathsf{K})\times\mathsf{H}.

Note that in order to evaluate the local norm we do not need to compute the full Hessian F′′(x)F^{\prime\prime}(x). It only requires a directional derivative, which is potentially easy to do numerically.

A Linear Mnimization Oracle (LMO) is defined as an oracle producing the vector field

st(x)g(t1F(x)).s_{t}(x)\in\mathcal{L}_{g}(t^{-1}F^{\prime}(x)). (3.2)

Note that our analysis does not rely on a specific tie-breaking rule, so any member of the set g(t1F(x))\mathcal{L}_{g}(t^{-1}F^{\prime}(x)) will be acceptable.

To measure solution accuracy and overall algorithmic progress, we introduce two merit functions:

𝖦𝖺𝗉t(x)\displaystyle\operatorname{\mathsf{Gap}}_{t}(x) :=t1F(x)[xst(x)]+g(x)g(st(x)), and\displaystyle:=t^{-1}F^{\prime}(x)[x-s_{t}(x)]+g(x)-g(s_{t}(x)),\text{ and } (3.3)
Δt(x)\displaystyle\Delta_{t}(x) :=Vt(x)Vt(z(t)).\displaystyle:=V_{t}(x)-V_{t}(z^{\ast}(t)). (3.4)

Note that 𝖦𝖺𝗉t(x)0\operatorname{\mathsf{Gap}}_{t}(x)\geq 0 and Δt(x)0\Delta_{t}(x)\geq 0 for all xdom(F)x\in\operatorname{dom}(F). Moreover, convexity together with the definition of the point z(t)z^{\ast}(t), gives

0\displaystyle 0 Δt(x)=t1[F(x)F(z(t))]+g(x)g(z(t))\displaystyle\leq\Delta_{t}(x)=t^{-1}[F(x)-F(z^{\ast}(t))]+g(x)-g(z^{\ast}(t))
t1F(x)[xz(t)]+g(x)g(z(t))\displaystyle\leq t^{-1}F^{\prime}(x)[x-z^{\ast}(t)]+g(x)-g(z^{\ast}(t))
t1F(x)[xst(x)]+g(x)g(st(x))=𝖦𝖺𝗉t(x).\displaystyle\leq t^{-1}F^{\prime}(x)[x-s_{t}(x)]+g(x)-g(s_{t}(x))=\operatorname{\mathsf{Gap}}_{t}(x).

Hence,

𝖦𝖺𝗉t(x)Δt(x)xdom(F)dom(g).\operatorname{\mathsf{Gap}}_{t}(x)\geq\Delta_{t}(x)\qquad\forall x\in\operatorname{dom}(F)\cap\operatorname{dom}(g). (3.5)

Define

𝚎t(x):=st(x)xx.\mathtt{e}_{t}(x):=\lVert s_{t}(x)-x\rVert_{x}. (3.6)

Then, for α[0,min{1,1/𝚎t(x)})\alpha\in[0,\min\{1,1/\mathtt{e}_{t}(x)\}), we get from (2.5)

F(x+α(st(x)x))F(x)+αF(x)[st(x)x]+ω(α𝚎t(x)).F(x+\alpha(s_{t}(x)-x))\leq F(x)+\alpha F^{\prime}(x)[s_{t}(x)-x]+\omega_{\ast}(\alpha\mathtt{e}_{t}(x)).

Together with the convexity of gg, this implies

Vt(x+α(st(x)x))Vt(x)α𝖦𝖺𝗉t(x)+t1ω(α𝚎t(x))=Vt(x)ηt,x(α),V_{t}(x+\alpha(s_{t}(x)-x))\leq V_{t}(x)-\alpha\operatorname{\mathsf{Gap}}_{t}(x)+t^{-1}\omega_{\ast}(\alpha\mathtt{e}_{t}(x))={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}V_{t}(x)-\eta_{t,x}(\alpha)}, (3.7)

where

ηt,x(α):=α𝖦𝖺𝗉t(x)t1ω(α𝚎t(x))=α𝖦𝖺𝗉t(x)+t1(α𝚎t(x)+log(1α𝚎t(x))).\eta_{t,x}(\alpha):=\alpha\operatorname{\mathsf{Gap}}_{t}(x)-t^{-1}\omega_{\ast}(\alpha\mathtt{e}_{t}(x))=\alpha\operatorname{\mathsf{Gap}}_{t}(x)+t^{-1}\left(\alpha\mathtt{e}_{t}(x)+\log(1-\alpha\mathtt{e}_{t}(x))\right).

Optimising this expression w.r.t α[0,1]\alpha\in[0,1], we obtain the analytic step-size policy

αt(x):=min{1,t𝖦𝖺𝗉t(x)𝚎t(x)(𝚎t(x)+t𝖦𝖺𝗉t(x))}[0,1/𝚎t(x)).\alpha_{t}(x):=\min\left\{1,\frac{t\operatorname{\mathsf{Gap}}_{t}(x)}{\mathtt{e}_{t}(x)(\mathtt{e}_{t}(x)+t\operatorname{\mathsf{Gap}}_{t}(x))}\right\}\in[0,1/\mathtt{e}_{t}(x)). (3.8)

Equipped with this step strategy, procedure CG(x0,ε,t)\operatorname{\texttt{CG}}(x^{0},\varepsilon,t), described in Algorithm 1, constructs a sequence {xtk}k0\{x^{k}_{t}\}_{k\geq 0} which produces an approximately-optimal solution in terms of the merit function 𝖦𝖺𝗉t()\operatorname{\mathsf{Gap}}_{t}(\cdot) and the potential function gap Δt\Delta_{t}. Specifically, our main results on the performance of Algorithm 1 are the following iteration complexity estimates, which are proven in Section 5.

Algorithm 1 CG(x0,ε,t)\operatorname{\texttt{CG}}(x^{0},\varepsilon,t)
Input: (x0,t)𝖢×(0,)(x^{0},t)\in\mathsf{C}\times(0,\infty) initial state; ε>0\varepsilon>0 accuracy level
for k=0,1,k=0,1,\ldots do
     if 𝖦𝖺𝗉t(xk)>ε\operatorname{\mathsf{Gap}}_{t}(x^{k})>\varepsilon then
         Obtain sk=st(xk)s^{k}=s_{t}(x^{k}) defined in (3.2).
         αk=αt(xk)\alpha_{k}=\alpha_{t}(x^{k}) defined in (3.8).
         Set xk+1=xk+αk(skxk)x^{k+1}=x^{k}+\alpha_{k}(s^{k}-x^{k}).
     else
         STOP. Return iteration index kk and last iterate xkx^{k}.
     end if
end for
Proposition 3.1.

Given η,t>0\eta,t>0, Algorithm CG(xt0,η,t)\operatorname{\texttt{CG}}(x^{0}_{t},\eta,t) requires at most

R(xt0,η,t):=5.3(ν+tΔt(xt0)+tΩg)log(10.6tΔt(xt0))+24tη(ν+tΩg)2,R(x^{0}_{t},\eta,t):=\lceil 5.3(\nu+t\Delta_{t}(x^{0}_{t})+t\Omega_{g})\log(10.6t\Delta_{t}(x^{0}_{t}))\rceil+\lceil\frac{24}{t\eta}(\nu+t\Omega_{g})^{2}\rceil, (3.9)

iterations in order to reach a point xtkx^{k}_{t} satisfying 𝖦𝖺𝗉t(xtk)η\operatorname{\mathsf{Gap}}_{t}(x^{k}_{t})\leq\eta for the first time.

Proposition 3.2.

Given η,t>0\eta,t>0. Algorithm CG(xt0,η,t)\operatorname{\texttt{CG}}(x^{0}_{t},\eta,t) requires at most

N(xt0,η,t):=5.3(ν+tΔt(xt0)+tΩg)log(10.6tΔt(xt0))+12(ν+tΩg)2(1tη1tΔt(xt0))+N(x^{0}_{t},\eta,t):=\lceil 5.3(\nu+t\Delta_{t}(x^{0}_{t})+t\Omega_{g})\log(10.6t\Delta_{t}(x^{0}_{t}))\rceil+\lceil 12(\nu+t\Omega_{g})^{2}\left(\frac{1}{t\eta}-\frac{1}{t\Delta_{t}(x^{0}_{t})}\right)_{+}\rceil (3.10)

iterations, in order to reach a point xtkx^{k}_{t} satisfying Δt(xtk)η\Delta_{t}(x^{k}_{t})\leq\eta for the first time.

3.2 Updating the Homotopy Parameter

Our analysis so far focuses on minimisation of the potential function Vt(x)V_{t}(x) for a fixed t>0t>0. However, in order to solve the initial problem (P), one must trace the sequence of approximate solutions {z(t)}t0\{z^{\ast}(t)\}_{t\geq 0} as tt\uparrow\infty. The construction of such an increasing sequence of homotopy parameters is the purpose of this section.

Algorithm 2 Method Homotopy(x0,ε,f)\texttt{Homotopy}(x^{0},\varepsilon,f)
Input: x0𝖢x^{0}\in\mathsf{C}, fν(𝖪)f\in\mathcal{B}_{\nu}(\mathsf{K}), t0,η0>0t_{0},\eta_{0}>0.
Parameters: ε,σ(0,1)\varepsilon,\sigma\in(0,1).
Initialize: xt00=x0x^{0}_{t_{0}}=x^{0}, I=log(2η0/ε)log(1/σ)I=\lceil\frac{\log(2\eta_{0}/\varepsilon)}{\log(1/\sigma)}\rceil.
for i=0,1,,I1i=0,1,\ldots,I-1 do
     Set x^i\hat{x}_{i} the output of CG(xti0,ηi,ti)\operatorname{\texttt{CG}}(x^{0}_{t_{i}},\eta_{i},t_{i}).
     Update ti+1=ti/σ,ηi+1=σηit_{i+1}=t_{i}/\sigma,\eta_{i+1}=\sigma\eta_{i}, xti+10=x^ix^{0}_{t_{i+1}}=\hat{x}_{i}.
end for
Set z^=x^I\hat{z}=\hat{x}_{I} the output of CG(xtI0,ηI,tI)\operatorname{\texttt{CG}}(x^{0}_{t_{I}},\eta_{I},t_{I}).

Let {ηi}i0,{ti}i0\{\eta_{i}\}_{i\geq 0},\{t_{i}\}_{i\geq 0} be a sequence of approximation errors and homotopy parameters. For each run ii, we activate procedure CG(xti0,ηi,ti)\operatorname{\texttt{CG}}(x^{0}_{t_{i}},\eta_{i},t_{i}) with the given configuration (ηi,ti)(\eta_{i},t_{i}). For i=0i=0, we assume to have an admissible initial point xt00=x0x^{0}_{t_{0}}=x^{0} available. For i1i\geq 1, we restart Algorithm 1 using a kind of warm-start strategy by choosing xti0=xti1Ri1x^{0}_{t_{i}}=x^{R_{i-1}}_{t_{i-1}}, where Ri1R(xti10,ηi1,ti1)R_{i-1}\leq R(x^{0}_{t_{i-1}},\eta_{i-1},t_{i-1}) denotes the first iterate kk of Algorithm CG(xti10,ηi1,ti1)\operatorname{\texttt{CG}}(x^{0}_{t_{i-1}},\eta_{i-1},t_{i-1}) satisfying 𝖦𝖺𝗉ti1(xti1k)ηi1\operatorname{\mathsf{Gap}}_{t_{i-1}}(x^{k}_{t_{i-1}})\leq\eta_{i-1}. Note that Ri1R_{i-1} is upper bounded by the constant R(xti10,ηi1,ti1)R(x^{0}_{t_{i-1}},\eta_{i-1},t_{i-1}) defined in Proposition 3.1. After the ii-th restart, let us call the obtained iterate x^i:=xtiRi\hat{x}_{i}:=x^{R_{i}}_{t_{i}}. In this way we obtain a sequence {x^i}i0\{\hat{x}_{i}\}_{i\geq 0}, consisting of candidates for approximately following the central path, as they are ηi\eta_{i}-close in terms of the gap 𝖦𝖺𝗉ti\operatorname{\mathsf{Gap}}_{t_{i}} (and, hence, in terms of the function gap) to the stage tit_{i}’s optimal point z(ti)z^{\ast}(t_{i}). We update the parameters (ηi,ti)(\eta_{i},t_{i}) as follows:

  • The sequence of homotopy parameters is determined as ti=t0σit_{i}=t_{0}\sigma^{-i} for σ(0,1)\sigma\in(0,1) until the last round of Homotopy(x0,ε,f)\texttt{Homotopy}(x^{0},\varepsilon,f) is reached.

  • The sequence of accuracies requested in the algorithm CG(xti0,ηi,ti)\operatorname{\texttt{CG}}(x^{0}_{t_{i}},\eta_{i},t_{i}) is updated by ηi=η0σi\eta_{i}=\eta_{0}\sigma^{i}.

  • The Algorithm stops after II(σ,η0,ε)=log(2η0/ε)log(1/σ)I\equiv I(\sigma,\eta_{0},\varepsilon)=\lceil\frac{\log(2\eta_{0}/\varepsilon)}{\log(1/\sigma)}\rceil updates of the accuracy and homotopy parameters and yields an ε\varepsilon-approximate solution of problem (P).

Our updating strategy for the parameters ensures ”detailed balance” tiηi=t0η0t_{i}\eta_{i}=t_{0}\eta_{0}. This equilibrating choice between the increasing homotopy parameter and the decreasing accuracy parameter is sensible, because the iteration complexity of the CndG solver is inversely proportional to tiηit_{i}\eta_{i} (cf. Proposition 3.1). Making the judicious choice η0t0=2ν\eta_{0}t_{0}=2\nu, yields a compact assessment of the total complexity of method Homotopy(x0,ε,f)\texttt{Homotopy}(x^{0},\varepsilon,f).

Theorem 3.3.

Choose t0=νΩgt_{0}=\frac{\nu}{\Omega_{g}} and η0=2Ωg\eta_{0}=2\Omega_{g}. The total iteration complexity of method Homotopy(x0,ε,f)\texttt{Homotopy}(x^{0},\varepsilon,f) to find an ε\varepsilon-solution of (P) is

Compl(x0,ε,f)=𝒪~(384Ωg2νε2(1σ2)+Ωglog(21.2ν(1+(2/ε)Ωg))21.2νε2σ1σ),\texttt{Compl}(x^{0},\varepsilon,f)=\tilde{\mathcal{O}}\left(\frac{384\Omega^{2}_{g}\nu}{\varepsilon^{2}(1-\sigma^{2})}+\Omega_{g}\log(21.2\nu(1+(2/\varepsilon)\Omega_{g}))\frac{21.2\nu}{\varepsilon}\frac{2-\sigma}{1-\sigma}\right), (3.11)

where 𝒪~\tilde{\mathcal{O}} hides polylogarithmic factors in ε1\varepsilon^{-1}.

Remark 3.1.

The theorem leaves open the choice of the parameter σ\sigma. We treat σ\sigma as a hyperparameter that should be calibrated by the user. The proof of Theorem 3.3 can be found in Section 5.

4 Modifications and Extensions

4.1 Line Search

The step size policy employed in Algorithm 1 is inversely proportional to the local norm 𝚎t(x)\mathtt{e}_{t}(x). In particular, its derivation is based on the a-priori restriction that we force our iterates to stay within a trust-region defined by the Dikin ellipsoid. This restriction may force the method to take very small steps, and as a result display bad performance in practice. A simple remedy would be to employ a line search. Given (x,t)dom(F)×(0,+)(x,t)\in\operatorname{dom}(F)\times(0,+\infty), let

γt(x)=argminγ[0,1] s.t. x+γ(st(x)x)dom(F)Vt(x+γ(st(x)x)).\gamma_{t}(x)=\operatorname*{argmin}_{\gamma\in[0,1]\text{ s.t. }x+\gamma(s_{t}(x)-x)\in\operatorname{dom}(F)}V_{t}(x+\gamma(s_{t}(x)-x)). (4.1)
Algorithm 3 CndG with line search: LCG(x0,ε,t)\operatorname{\texttt{LCG}}(x^{0},\varepsilon,t)
Input: (x0,t)𝖢×(0,)(x^{0},t)\in\mathsf{C}\times(0,\infty) initial state; ε>0\varepsilon>0 accuracy level
for k=0,1,k=0,1,\ldots do
     if 𝖦𝖺𝗉t(xk)>ε\operatorname{\mathsf{Gap}}_{t}(x^{k})>\varepsilon then
         Same as Algorithm 1, but with step size strategy (4.1).
     else
         STOP. Return iteration index kk and last iterate xkx^{k}.
     end if
end for

Thanks to the barrier structure of the potential function VtV_{t}, some useful consequences can be drawn from the definition of γt(x)\gamma_{t}(x). First, γt(x){γ0|x+γ(st(x)x)dom(F)}\gamma_{t}(x)\in\{\gamma\geq 0|x+\gamma(s_{t}(x)-x)\in\operatorname{dom}(F)\}. This implies x+γt(x)(st(x)x)𝖷dom(F)dom(g)x+\gamma_{t}(x)(s_{t}(x)-x)\in\mathsf{X}\cap\operatorname{dom}(F)\cap\operatorname{dom}(g). Second, since αt(x)\alpha_{t}(x) is also contained in the latter set, we have

Vt(x+γt(x)(st(x)x))Vt(x+αt(x)(st(x)x))(x,t)dom(F)×(0,+).V_{t}(x+\gamma_{t}(x)(s_{t}(x)-x))\leq V_{t}(x+\alpha_{t}(x)(s_{t}(x)-x))\qquad\forall(x,t)\in\operatorname{dom}(F)\times(0,+\infty).

Via a comparison principle, this allows us to deduce the analysis of the sequence produced by LCG(x0,ε,t)\operatorname{\texttt{LCG}}(x^{0},\varepsilon,t) from the analysis of the sequence induced by CG(x0,ε,t)\operatorname{\texttt{CG}}(x^{0},\varepsilon,t). Indeed, if {ξtk}k0\{\xi^{k}_{t}\}_{k\geq 0} is the sequence constructed by the line search procedure LCG(x0,η,t)\operatorname{\texttt{LCG}}(x^{0},\eta,t), then we have Vt(ξtk+1)Vt(ξtk+αt(ξtk)(st(ξtk)xtk))V_{t}(\xi^{k+1}_{t})\leq V_{t}(\xi_{t}^{k}+\alpha_{t}(\xi^{k}_{t})(s_{t}(\xi^{k}_{t})-x^{k}_{t})) for all kk. Hence, we can perform the complexity analysis on the majorising function as in the analysis of procedure CG(x0,ε,t)\operatorname{\texttt{CG}}(x^{0},\varepsilon,t). Consequently, all the complexity-related estimates for the method CG(x0,η,t)\operatorname{\texttt{CG}}(x^{0},\eta,t) apply verbatim to the sequence {ξtk}k0\{\xi^{k}_{t}\}_{k\geq 0} induced by the method LCG(x0,η,t)\operatorname{\texttt{LCG}}(x^{0},\eta,t).

4.2 Algorithm with inexact LMO

A key assumption in our approach so far is that we can obtain an exact answer from the LMO (3.2). In this section we consider the case when our search direction subproblem is solved only with some pre-defined accuracy, a setting that is of particular interest in optimization problems defined over matrix variables. As an illustration, let us consider instances of (P) where 𝖷\mathsf{X} is the spectrahedron of symmetric matrices, namely 𝖷={𝐗n×n|𝐗0,tr(𝐗)=1}\mathsf{X}=\{\mathbf{X}\in\mathbb{R}^{n\times n}|\mathbf{X}\succeq 0,\operatorname{tr}(\mathbf{X})=1\}. For these instances solving the subproblem (3.1) with g=tr(𝐂𝐗)g=\operatorname{tr}({\mathbf{C}}\mathbf{X}) corresponds to computing the leading eigenvector of a symmetric matrix, which when nn is very large is typically computed inexactly using iterative methods. We therefore relax our definition of the LMO by allowing for an Inexact Linear Minimization Oracle (ILMO), featuring additive and multiplicative errors. To define our ILMO, let

Γt(x,s):=1tF(x)[xs]+g(x)g(s)t>0,(x,s)𝖢×𝖢.\Gamma_{t}(x,s):=\frac{1}{t}F^{\prime}(x)[x-s]+g(x)-g(s)\quad t>0,(x,s)\in\mathsf{C}\times\mathsf{C}.

By definition of the point st(x)s_{t}(x) in (3.2) and of the gap function 𝖦𝖺𝗉t(x)\operatorname{\mathsf{Gap}}_{t}(x), we have Γt(x,s)𝖦𝖺𝗉t(x)\Gamma_{t}(x,s)\leq\operatorname{\mathsf{Gap}}_{t}(x) for all s𝖢s\in\mathsf{C}.

Definition 4.1.

Given t>0,δ(0,1]t>0,\delta\in(0,1], and θ>0\theta>0, we call an oracle producing a point s~t(x)𝖢\tilde{s}_{t}(x)\in\mathsf{C} a (δ,θ)(\delta,\theta)-ILMO, such that

𝖦𝖺𝗉~t(x)δ(𝖦𝖺𝗉t(x)θ),where 𝖦𝖺𝗉~t(x)Γt(x,s~t(x)).\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x)\geq\delta(\operatorname{\mathsf{Gap}}_{t}(x)-\theta),\text{where }\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x)\equiv\Gamma_{t}(x,\tilde{s}_{t}(x)). (4.2)

We write s~t(x)g(δ,θ)(t1F(x))\tilde{s}_{t}(x)\in\mathcal{L}^{(\delta,\theta)}_{g}(t^{-1}\nabla F(x)) for any direction s~t(x)\tilde{s}_{t}(x) satisfying (4.2).

We note that a (0,0)(0,0)-ILMO returns a search point satisfying inclusion (3.2). We therefore refer to the case where we can design a (0,0)(0,0)-ILMO as the exact oracle case. Moreover, a (1,θ)(1,\theta)-ILMO is an oracle that only features additive errors. This relaxation for inexact computations has been considered in previous work, such as [10, 25, 18]. The combination of multiplicative error δ\delta and additive error θ\theta displayed in (4.2) is similar to the oracle constructed in [26]. The above definition can also be interpreted as follows. The answer st(x)s_{t}(x) of the exact LMO solves the maximization problem maxs𝖷{Γt(x,s)=t1F(x)[xs]+g(x)g(s)}\max_{s\in\mathsf{X}}\{\Gamma_{t}(x,s)=t^{-1}F^{\prime}(x)[x-s]+g(x)-g(s)\} and the optimal value in this problem is nonnegative. Thus, (4.2) can be interpreted as the requirement that the point s~t(x)\tilde{s}_{t}(x) solves the same problem up to the additive error θ\theta and the multiplicative error δ\delta (i.e. because it is itself the output of a convex optimisation routine). We emphasize that our analysis does not depend on the specific choice of the point s~t(x)\tilde{s}_{t}(x).

Equipped with these concepts, we can derive an analytic step-size policy as in the exact regime. Let x𝖢x\in\mathsf{C} be given, s~t(x)g(δ,θ)(t1F(x))\tilde{s}_{t}(x)\in\mathcal{L}^{(\delta,\theta)}_{g}(t^{-1}\nabla F(x)), and α(0,1)\alpha\in(0,1) small enough so that αs~t(x)xxα𝚎~t(x)(0,1)\alpha\lVert\tilde{s}_{t}(x)-x\rVert_{x}\equiv\alpha\tilde{\mathtt{e}}_{t}(x)\in(0,1). Just like in (3.7), we can then obtain the upper bound

Vt(x+α(s~t(x)x))Vt(x)α𝖦𝖺𝗉~t(x)+1tω(α𝚎~t(x)).V_{t}(x+\alpha(\tilde{s}_{t}(x)-x))\leq V_{t}(x)-\alpha\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x)+\frac{1}{t}\omega_{\ast}(\alpha\tilde{\mathtt{e}}_{t}(x)).

Optimizing the upper bound with respect to α\alpha gives us the analytic step-size criterion

α~t(x):=min{1,t𝖦𝖺𝗉~t(x)𝚎~t(x)(𝚎~t(x)+t𝖦𝖺𝗉~t(x))}.\tilde{\alpha}_{t}(x):=\min\left\{1,\frac{t\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x)}{\tilde{\mathtt{e}}_{t}(x)(\tilde{\mathtt{e}}_{t}(x)+t\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x))}\right\}. (4.3)

The (δ,θ)(\delta,\theta)-ILMO, and the step size policy (4.3) are enough to define a CndG-update with inexact computations that are executed by a function P(x,t,δ,θ)P(x,t,\delta,\theta), defined in Algorithm 4.

Obtain s~t(x)g(δ,θ)(t1F(x))\tilde{s}_{t}(x)\in\mathcal{L}^{(\delta,\theta)}_{g}(t^{-1}\nabla F(x));
Obtain α~t(x)\tilde{\alpha}_{t}(x) defined in (4.3);
Update xx+α~t(x)(s~t(x)x)x\leftarrow x+\tilde{\alpha}_{t}(x)(\tilde{s}_{t}(x)-x).
Algorithm 4 Function P(x,t,δ,θ)P(x,t,\delta,\theta)

To use this method within a bona fide algorithm, we would need to introduce some stopping criterion. Ideally, such a stopping criterion should aim at making the merit functions 𝖦𝖺𝗉t\operatorname{\mathsf{Gap}}_{t} or Δt\Delta_{t} small. However, our inexact oracle does not allow us to make direct inferences on the values of these merit functions. Hence, we need to come up with a stopping rule that uses only quantities that are observable and that are coupled with these merit functions. The natural optimality measure for the inexact oracle case is thus 𝖦𝖺𝗉~t(x)\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x). Indeed, by the very definition of our inexact oracle, if, for accuracy η>0\eta>0, we have arrived at an iterate xx for which 𝖦𝖺𝗉~t(x)δη\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x)\leq\delta\eta, we know from (4.2) that 𝖦𝖺𝗉t(x)η+θ\operatorname{\mathsf{Gap}}_{t}(x)\leq\eta+\theta. Therefore, by making θ\theta a function of η\eta, we would attain essentially the same accuracy as the exact implementation requires when running Algorithm 1. This is the main idea leading to the development of Algorithm 5.

Algorithm 5 ICG(x,t,η,δ,θ)\texttt{ICG}(x,t,\eta,\delta,\theta)
Input: x𝖢x\in\mathsf{C}, δ(0,1],θ>0,t>0,η>0\delta\in(0,1],\theta>0,t>0,\eta>0.
for k=0,1,k=0,1,\ldots do
     if 𝖦𝖺𝗉~t(xk)>ηδ\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x^{k})>\eta\delta then
         xk+1=P(xk,t,δ,θ)x^{k+1}=P(x^{k},t,\delta,\theta)
     else
         STOP. Return iteration index kk and last iterate xkx^{k}.
     end if
end for

There are two types of iteration complexity guarantees that we can derive for ICG(x,t,η,δ,θ)\texttt{ICG}(x,t,\eta,\delta,\theta). One is concerned with upper bounding the number of iterations after which our stopping criterion will be satisfied. The other is an upper bound on the number of iterations needed to make the potential function gap Δt(xk)\Delta_{t}(x^{k}) smaller than η+θ\eta+\theta. The derivation of these estimates is rather technical and follows closely the complexity analysis of the exact regime. We therefore only present the statements here and invite the reader to consult the Appendix A for a detailed proof.

Proposition 4.2.

Assume that η,θ,t>0\eta,\theta,t>0, δ(0,1]\delta\in(0,1], and x0𝖢x^{0}\in\mathsf{C}. Algorithm ICG(x0,t,η,δ,θ)\operatorname{\texttt{ICG}}(x^{0},t,\eta,\delta,\theta) requires at most

R~t(x0,η,δ,θ)\displaystyle\tilde{R}_{t}(x^{0},\eta,\delta,\theta) :=5.3(tδ(Δt(x0)θ)+ν+tΩg)δlog(min{10.6tΔt(x0)(110.6tθ)+,Δt(x0)η})\displaystyle:=\left\lceil\frac{5.3(t\delta(\Delta_{t}(x^{0})-\theta)+\nu+t\Omega_{g})}{\delta}\log\left(\min\left\{\frac{10.6t\Delta_{t}(x^{0})}{(1-10.6t\theta)_{+}},\frac{\Delta_{t}(x^{0})}{\eta}\right\}\right)\right\rceil
+12(ν+tΩg)2tδ2η(2+θη)+1\displaystyle+\left\lceil\frac{12(\nu+t\Omega_{g})^{2}}{t\delta^{2}\eta}\left(2+\frac{\theta}{\eta}\right)\right\rceil+1 (4.4)

iterations in order to reach a point xkx^{k} satisfying 𝖦𝖺𝗉~t(xk)ηδ\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x^{k})\leq\eta\delta for the first time.

Proposition 4.3.

Assume that η,θ,t>0\eta,\theta,t>0, δ(0,1]\delta\in(0,1], and x0𝖢x^{0}\in\mathsf{C}. Algorithm ICG(x0,t,η,δ,θ)\operatorname{\texttt{ICG}}(x^{0},t,\eta,\delta,\theta) requires at most

N~(x0,η,δ,θ)\displaystyle\tilde{N}(x^{0},\eta,\delta,\theta) :=5.3(tδ(Δt(x0)θ)+ν+tΩg)δlog(min{10.6tΔt(x0)(110.6tθ)+,Δt(x0)η})\displaystyle:=\left\lceil\frac{5.3(t\delta(\Delta_{t}(x^{0})-\theta)+\nu+t\Omega_{g})}{\delta}\log\left(\min\left\{\frac{10.6t\Delta_{t}(x^{0})}{(1-10.6t\theta)_{+}},\frac{\Delta_{t}(x^{0})}{\eta}\right\}\right)\right\rceil
+12(ν+tΩg)2tδ2(1η1Δt(x0)θ)++1\displaystyle\hskip 16.00003pt+\left\lceil\frac{12(\nu+t\Omega_{g})^{2}}{t\delta^{2}}\left(\frac{1}{\eta}-\frac{1}{\Delta_{t}(x^{0})-\theta}\right)_{+}\right\rceil+1 (4.5)

iterations, in order to reach a point xkx^{k} satisfying Δt(xk)η+θ\Delta_{t}(x^{k})\leq\eta+\theta for the first time.

Updating the Homotopy Parameter in the Inexact Case

We are now in a position to describe the outer loop of our algorithm with ILMO. The construction is very similar to Algorithm 2.

Algorithm 6 Inexact Homotopy(x0,ε,f,δ)\texttt{Inexact Homotopy}(x^{0},\varepsilon,f,\delta)
Input: x0𝖢x^{0}\in\mathsf{C}, fν(𝖪)f\in\mathcal{B}_{\nu}(\mathsf{K}), δ(0,1]\delta\in(0,1].
Parameters: ε,t0,θ0=η0>0\varepsilon,t_{0},\theta_{0}=\eta_{0}>0, σ(0,1)\sigma\in(0,1).
Initialisation: xt00=x0x^{0}_{t_{0}}=x^{0}, I=log(3η0/ε)log(1/σ)I=\lceil\frac{\log(3\eta_{0}/\varepsilon)}{\log(1/\sigma)}\rceil.
for i=0,1,,I1i=0,1,\ldots,I-1 do
     Set x^i\hat{x}_{i} the output of ICG(xti0,ti,ηi,δ,θi)\operatorname{\texttt{ICG}}(x^{0}_{t_{i}},t_{i},\eta_{i},\delta,\theta_{i}).
     Update ti+1=ti/σ,ηi+1=σηi,θi+1=σθi+1t_{i+1}=t_{i}/\sigma,\eta_{i+1}=\sigma\eta_{i},\theta_{i+1}=\sigma\theta_{i+1}, xti+10=x^ix^{0}_{t_{i+1}}=\hat{x}_{i}.
end for
z^=x^I\hat{z}=\hat{x}_{I} the output of ICG(xtI0,tI,ηI,δ,θI)\operatorname{\texttt{ICG}}(x^{0}_{t_{I}},t_{I},\eta_{I},\delta,\theta_{I}).

As before, our aim is to reach an ε\varepsilon-solution (Definition 2.1). Let {ti}i0,{ηi}i0,{θi}i0\{t_{i}\}_{i\geq 0},\{\eta_{i}\}_{i\geq 0},\{\theta_{i}\}_{i\geq 0} be sequences of homotopy parameters, accuracies in the inner loop, and ILMO inexactnesses. For each run ii, we activate procedure ICG(xti0,ti,ηi,δ,θi)\operatorname{\texttt{ICG}}(x^{0}_{t_{i}},t_{i},\eta_{i},\delta,\theta_{i}) with the given configuration (ti,ηi,θi)(t_{i},\eta_{i},\theta_{i}). For i=0i=0, we assume to have an admissible initial point xt00=x0x^{0}_{t_{0}}=x^{0} available. For i1i\geq 1, we restart Algorithm 5 using a kind of warm-start strategy by choosing xti0=xti1Ri1x^{0}_{t_{i}}=x^{R_{i-1}}_{t_{i-1}}, where R~i1R~ti1(xti10,ηi1,δ,θi1)\tilde{R}_{i-1}\leq\tilde{R}_{t_{i-1}}(x^{0}_{t_{i-1}},\eta_{i-1},\delta,\theta_{i-1}) is the first iterate kk of Algorithm ICG(xti10,ti1,ηi1,δ,θi1)\operatorname{\texttt{ICG}}(x^{0}_{t_{i-1}},t_{i-1},\eta_{i-1},\delta,\theta_{i-1}) satisfying 𝖦𝖺𝗉~ti1(xti1k)ηi1δ\widetilde{\operatorname{\mathsf{Gap}}}_{t_{i-1}}(x^{k}_{t_{i-1}})\leq\eta_{i-1}\delta. Note R~i1\tilde{R}_{i-1} is upper bounded by the mentioned number, by means of Proposition 4.2. After the ii-th restart, let us call the obtained iterate x^i:=xtiR~i\hat{x}_{i}:=x^{\tilde{R}_{i}}_{t_{i}}. In this way we generate a sequence {x^i}i0\{\hat{x}_{i}\}_{i\geq 0}, consisting of candidates for approximately following the central path, as they are (ηi+θi)(\eta_{i}+\theta_{i})-close in terms of 𝖦𝖺𝗉ti\operatorname{\mathsf{Gap}}_{t_{i}} (and, hence, in terms of the potential function gap Δti\Delta_{t_{i}}). We update the parameters (ti,ηi,θi)(t_{i},\eta_{i},\theta_{i}) as follows:

  • The sequence of homotopy parameters is determined as ti=t0σit_{i}=t_{0}\sigma^{-i} for σ(0,1)\sigma\in(0,1) until the last round of Inexact Homotopy(x0,ε,f,δ)\texttt{Inexact Homotopy}(x^{0},\varepsilon,f,\delta) is reached.

  • The sequences of accuracies in the inner loop and ILMO inexactnesses requested in procedure ICG(xti0,ti,ηi,δ,θi)\operatorname{\texttt{ICG}}(x^{0}_{t_{i}},t_{i},\eta_{i},\delta,\theta_{i}) is updated by ηi=η0σi\eta_{i}=\eta_{0}\sigma^{i}, θi=θ0σi\theta_{i}=\theta_{0}\sigma^{i}.

  • The Algorithm stops after II(σ,η0,ε)=log(3η0/ε)log(1/σ)I\equiv I(\sigma,\eta_{0},\varepsilon)=\lceil\frac{\log(3\eta_{0}/\varepsilon)}{\log(1/\sigma)}\rceil updates of the accuracy and homotopy parameters, and yields an ε\varepsilon-approximate solution of problem (P).

We underline that it is not necessary to stop the algorithm after a prescribed number of iterations and it is essentially an any-time convergent algorithm.

Our updating strategy for the parameters ensures that tiηi=tiθi=t0η0=t0θ0t_{i}\eta_{i}=t_{i}\theta_{i}=t_{0}\eta_{0}=t_{0}\theta_{0}. We will make the judicious choice η0t0=2ν\eta_{0}t_{0}=2\nu to obtain an overall assessment of the iteration complexity of method Inexact Homotopy(x0,ε,f,δ)\texttt{Inexact Homotopy}(x^{0},\varepsilon,f,\delta). The proof is given in Appendix A.

Theorem 4.4.

Choose t0=νΩgt_{0}=\frac{\nu}{\Omega_{g}} and θ0=η0=2Ωg\theta_{0}=\eta_{0}=2\Omega_{g}. The total iteration complexity, i.e. the number of CndG steps, of method Inexact Homotopy(x0,ε,f,δ)\texttt{Inexact Homotopy}(x^{0},\varepsilon,f,\delta) to find an ε\varepsilon-solution of (P) is Compl(x0,ε,f,δ)=O(νΩg2δ2ε2(1σ2)).\texttt{Compl}(x^{0},\varepsilon,f,\delta)=O\left(\frac{\nu\Omega_{g}^{2}}{\delta^{2}\varepsilon^{2}(1-\sigma^{2})}\right).

5 Complexity Analysis

In this section, we give a detailed presentation of the analysis of the iteration complexity of Algorithm 2.

5.1 Analysis of procedure CG(x0,ε,t)\operatorname{\texttt{CG}}(x^{0},\varepsilon,t)

Recall Ωg:=maxx,ydom(g)𝖷|g(x)g(y)|\Omega_{g}:=\max_{x,y\in\operatorname{dom}(g)\cap\mathsf{X}}\lvert g(x)-g(y)\rvert and 𝖢=dom(F)𝖷dom(g)\mathsf{C}=\operatorname{dom}(F)\cap\mathsf{X}\cap\operatorname{dom}(g). The following estimate can be established as in [43, Prop. 3]. We therefore omit its simple proof.

Lemma 5.1.

For all (x,t)𝖢×(0,)(x,t)\in\mathsf{C}\times(0,\infty), we have

t1𝚎t(x)ν/t+𝖦𝖺𝗉t(x)+Ωg.t^{-1}\mathtt{e}_{t}(x)\leq\nu/t+\operatorname{\mathsf{Gap}}_{t}(x)+\Omega_{g}. (5.1)

Observe that

𝚎t(x)\displaystyle\mathtt{e}_{t}(x) =st(x)xx=F′′(x)[st(x)x,st(x)x]\displaystyle=\lVert s_{t}(x)-x\rVert_{x}=\sqrt{F^{\prime\prime}(x)[s_{t}(x)-x,s_{t}(x)-x]}
ν1/2|F(x)[st(x)x]|\displaystyle\geq\nu^{-1/2}\lvert F^{\prime}(x)[s_{t}(x)-x]\rvert
=tν|𝖦𝖺𝗉t(x)g(x)+g(st(x))|\displaystyle=\frac{t}{\sqrt{\nu}}\lvert\operatorname{\mathsf{Gap}}_{t}(x)-g(x)+g(s_{t}(x))\rvert
tν(𝖦𝖺𝗉t(x)Ωg).\displaystyle\geq\frac{t}{\sqrt{\nu}}(\operatorname{\mathsf{Gap}}_{t}(x)-\Omega_{g}). (5.2)

5.1.1 Proof of Proposition 3.2

The proof mainly follows the steps of the proof of Theorem 1 in [43]. The main technical challenge in our analysis is taking care of the penalty parameter tt. A main step in the complexity analysis is the identification of two convergence regimes, depending on the magnitude of the merit function 𝖦𝖺𝗉t\operatorname{\mathsf{Gap}}_{t}.

Lemma 5.2.

Define 𝚊t:=νt+Ωg\mathtt{a}_{t}:=\frac{\nu}{t}+\Omega_{g}, and the partition

𝕂I(t)={k|𝖦𝖺𝗉t(xtk)>𝚊t},𝕂II(t)=𝕂I(t).\mathbb{K}_{I}(t)=\{k\in\mathbb{N}|\operatorname{\mathsf{Gap}}_{t}(x_{t}^{k})>\mathtt{a}_{t}\},\quad\mathbb{K}_{II}(t)=\mathbb{N}\setminus\mathbb{K}_{I}(t).

If k𝕂I(t)k\in\mathbb{K}_{I}(t) we have

Δt(xtk)Δt(xtk+1)15.3Gtkν+tGtk+tΩgΔt(xtk)5.3(ν+tΩg+tΔt(xtk)),\Delta_{t}(x^{k}_{t})-\Delta_{t}(x^{k+1}_{t})\geq\frac{1}{5.3}\frac{G^{k}_{t}}{\nu+tG^{k}_{t}+t\Omega_{g}}\geq\frac{\Delta_{t}(x^{k}_{t})}{5.3(\nu+t\Omega_{g}+t\Delta_{t}(x^{k}_{t}))}, (5.3)

and if k𝕂II(t)k\in\mathbb{K}_{II}(t) we have

1Δt(xtk+1)1Δt(xtk)+t12(ν+tΩg)2\frac{1}{\Delta_{t}(x^{k+1}_{t})}\geq\frac{1}{\Delta_{t}(x^{k}_{t})}+\frac{t}{12(\nu+t\Omega_{g})^{2}} (5.4)

In particular, {Δt(xtk)}k\{\Delta_{t}(x^{k}_{t})\}_{k\in\mathbb{N}} is monotonically decreasing.

Before proving this Lemma, let us explain how we use it to estimate the total number of iterations needed by method CG(x0,ε,t)\operatorname{\texttt{CG}}(x^{0},\varepsilon,t) until the stopping criterion is satisfied. Motivated by the two different regimes identified in Lemma 5.2, we can bound the number of iterations produced by CG(xt0,η,t)\operatorname{\texttt{CG}}(x^{0}_{t},\eta,t) before the stopping criterion Δt(xtk)η\Delta_{t}(x^{k}_{t})\leq\eta applies in two steps: First, we bound the number of iterations which the algorithm spends in the set 𝕂I(t)\mathbb{K}_{I}(t); Second we bound the number of iterations the algorithm spends in 𝕂II(t)\mathbb{K}_{II}(t). Combining these two upper bounds, yields the total iteration bound reported in Proposition 3.2.

Proof of Lemma 5.2.

Assume that iteration kk is in phase 𝕂I(t)\mathbb{K}_{I}(t). To reduce notational clutter, we set 𝚎tk𝚎t(xtk)\mathtt{e}^{k}_{t}\equiv\mathtt{e}_{t}(x^{k}_{t}) and Gtk𝖦𝖺𝗉t(xtk)G^{k}_{t}\equiv\operatorname{\mathsf{Gap}}_{t}(x^{k}_{t}). Combining the condition Gtk>νt+ΩgG^{k}_{t}>\frac{\nu}{t}+\Omega_{g} of phase 𝕂I(t)\mathbb{K}_{I}(t) with (5.2), we deduce that 𝚎tkν1\mathtt{e}^{k}_{t}\geq\sqrt{\nu}\geq 1. This in turn implies tGtk𝚎tk(𝚎tk+tGtk)<tGtk𝚎tk+tGtk<1\frac{tG^{k}_{t}}{\mathtt{e}^{k}_{t}(\mathtt{e}^{k}_{t}+tG^{k}_{t})}<\frac{tG^{k}_{t}}{\mathtt{e}^{k}_{t}+tG^{k}_{t}}<1. Hence, at this iteration, algorithm CG(x0,ε,t)\operatorname{\texttt{CG}}(x^{0},\varepsilon,t) chooses the step sizes αk=tGtk𝚎tk(𝚎tk+tGtk)\alpha_{k}=\frac{tG^{k}_{t}}{\mathtt{e}^{k}_{t}(\mathtt{e}^{k}_{t}+tG^{k}_{t})}. The per-iteration reduction of the potential function can be estimated as

Vt(xtk+1)\displaystyle V_{t}(x^{k+1}_{t}) Vt(xtk)Gtk𝚎tktGtk𝚎tk+tGtk+1tω(tGtk𝚎tk+tGtk)\displaystyle\leq V_{t}(x^{k}_{t})-\frac{G^{k}_{t}}{\mathtt{e}^{k}_{t}}\frac{tG^{k}_{t}}{\mathtt{e}_{t}^{k}+tG^{k}_{t}}+\frac{1}{t}\omega_{\ast}\left(\frac{tG^{k}_{t}}{\mathtt{e}_{t}^{k}+tG^{k}_{t}}\right)
=Vt(xtk)Gtk𝚎tk+1tlog(1+tGtk𝚎tk)\displaystyle=V_{t}(x^{k}_{t})-\frac{G^{k}_{t}}{\mathtt{e}^{k}_{t}}+\frac{1}{t}\log\left(1+\frac{tG_{t}^{k}}{\mathtt{e}^{k}_{t}}\right)
=Vt(xtk)1tω(tGtk𝚎tk)\displaystyle=V_{t}(x^{k}_{t})-\frac{1}{t}\omega\left(\frac{tG^{k}_{t}}{\mathtt{e}^{k}_{t}}\right) (5.5)

where ω(τ)=τlog(1+τ)\omega(\tau)=\tau-\log(1+\tau) and ω(t)=τlog(1τ)\omega_{*}(t)=-\tau-\log(1-\tau). This readily yields Δt(xtk)Δt(xtk+1)1tω(tGtk𝚎tk)\Delta_{t}(x^{k}_{t})-\Delta_{t}(x^{k+1}_{t})\geq\frac{1}{t}\omega\left(\frac{tG^{k}_{t}}{\mathtt{e}^{k}_{t}}\right) and implies that {Δt(xtk)}k𝕂I(t)\{\Delta_{t}(x^{k}_{t})\}_{k\in\mathbb{K}_{I}(t)} is monotonically decreasing. For k𝕂I(t)k\in\mathbb{K}_{I}(t), we also see tGtkν+tGtk+tΩg>12\frac{tG^{k}_{t}}{\nu+tG^{k}_{t}+t\Omega_{g}}>\frac{1}{2}. Together with (5.1), this implies

tGtk𝚎tkGtkν/t+Gtk+Ωg=tGtkν+tGtk+tΩg>12.\frac{tG_{t}^{k}}{\mathtt{e}^{k}_{t}}\geq\frac{G^{k}_{t}}{\nu/t+G_{t}^{k}+\Omega_{g}}=\frac{tG^{k}_{t}}{\nu+tG_{t}^{k}+t\Omega_{g}}>\frac{1}{2}. (5.6)

Using (3.5), the monotonicity of ω()\omega(\cdot) and that ω(τ)τ5.3\omega(\tau)\geq\frac{\tau}{5.3} for τ1/2\tau\geq 1/2 (see [43, Prop. 1]), the fact that the function xxa+txx\mapsto\frac{x}{a+tx} is strictly increasing for t,a>0t,a>0, we arrive at

Δt(xtk)Δt(xtk+1)15.3Gtkν+tGtk+tΩgΔt(xtk)5.3(ν+tΩg+tΔt(xtk)).\Delta_{t}(x^{k}_{t})-\Delta_{t}(x^{k+1}_{t})\geq\frac{1}{5.3}\frac{G^{k}_{t}}{\nu+tG^{k}_{t}+t\Omega_{g}}\geq\frac{\Delta_{t}(x^{k}_{t})}{5.3(\nu+t\Omega_{g}+t\Delta_{t}(x^{k}_{t}))}. (5.7)

Next, consider an integer k𝕂II(t)k\in\mathbb{K}_{II}(t). We have to distinguish the iterates using large steps (αk=1(\alpha_{k}=1), from those using short steps (αk<1\alpha_{k}<1). Let us first consider the case where αk=1\alpha_{k}=1. Then, tGtk𝚎tk(tGtk+𝚎tk)tG^{k}_{t}\geq\mathtt{e}^{k}_{t}(tG^{k}_{t}+\mathtt{e}^{k}_{t}), which implies 𝚎tk(0,1)\mathtt{e}^{k}_{t}\in(0,1), and Gtk(𝚎tk)2t(1𝚎tk)G^{k}_{t}\geq\frac{(\mathtt{e}^{k}_{t})^{2}}{t(1-\mathtt{e}^{k}_{t})}. Moreover, using (2.7), we readily obtain

Vt(xtk+1)\displaystyle V_{t}(x^{k+1}_{t}) Vt(xtk)Gtk+1tω(𝚎tk)\displaystyle\leq V_{t}(x^{k}_{t})-G^{k}_{t}+\frac{1}{t}\omega_{\ast}(\mathtt{e}^{k}_{t})
Vt(xtk)Gtk+1t(𝚎tk)22(1𝚎tk)\displaystyle\leq V_{t}(x^{k}_{t})-G^{k}_{t}+\frac{1}{t}\frac{(\mathtt{e}^{k}_{t})^{2}}{2(1-\mathtt{e}^{k}_{t})}
Vt(xtk)Gtk2.\displaystyle\leq V_{t}(x^{k}_{t})-\frac{G_{t}^{k}}{2}.

Therefore,

Δt(xtk)Δt(xtk+1)Gtk212(Gtk)2ν/t+Ωg=12t(Gtk)2ν+tΩg112t(Gtk)2(ν+tΩg)2.\Delta_{t}(x^{k}_{t})-\Delta_{t}(x^{k+1}_{t})\geq\frac{G_{t}^{k}}{2}\geq\frac{1}{2}\frac{(G_{t}^{k})^{2}}{\nu/t+\Omega_{g}}=\frac{1}{2}\frac{t(G^{k}_{t})^{2}}{\nu+t\Omega_{g}}\geq\frac{1}{12}\frac{t(G^{k}_{t})^{2}}{(\nu+t\Omega_{g})^{2}}.

In the regime where αk<1\alpha_{k}<1, we obtain the estimate Δt(xtk)Δt(xtk+1)112t(Gtk)2(ν+tΩg)2\Delta_{t}(x^{k}_{t})-\Delta_{t}(x^{k+1}_{t})\geq\frac{1}{12}\frac{t(G^{k}_{t})^{2}}{(\nu+t\Omega_{g})^{2}}, by following the same reasoning as in [43]. Using the relation GtkΔt(xtk)Δt(xtk+1)G^{k}_{t}\geq\Delta_{t}(x^{k}_{t})\geq\Delta_{t}(x^{k+1}_{t}), we conclude

1Δt(xtk+1)1Δt(xtk)t12(ν+tΩg)2.\frac{1}{\Delta_{t}(x^{k+1}_{t})}-\frac{1}{\Delta_{t}(x^{k}_{t})}\geq\frac{t}{12(\nu+t\Omega_{g})^{2}}.

\blacksquare

We now exploit this monotonicity of the potential function gap in order to upper bound the number of iterations which Algorithm CG(xt0,η,t)\operatorname{\texttt{CG}}(x^{0}_{t},\eta,t) spends in the disjoint phases 𝕂I(t)\mathbb{K}_{I}(t) and 𝕂II(t)\mathbb{K}_{II}(t), respectively.

Proof of Proposition 3.2.

Given the pair (t,η)(0,)2(t,\eta)\in(0,\infty)^{2}, denote by NN(xt0,η,t)N\equiv N(x^{0}_{t},\eta,t) the upper bound defined in (3.10). We separate NN into two parts, corresponding with stages 𝕂I(t)\mathbb{K}_{I}(t) and 𝕂II(t)\mathbb{K}_{II}(t), as they are defined in the proof of Lemma 5.2. We will show that N=Kt(xt0)+MN=K_{t}(x^{0}_{t})+M where Kt(xt0)K_{t}(x^{0}_{t}) is a bound on the number of iterates within 𝕂I(t)\mathbb{K}_{I}(t) and MM is bound on the number of iterates within 𝕂II(t)\mathbb{K}_{II}(t) until Δt(xtk)η\Delta_{t}(x^{k}_{t})\leq\eta is reached.

Note that Lemma 5.2 ensures that for all k=𝕂I(t)𝕂II(t)k\in\mathbb{N}=\mathbb{K}_{I}(t)\cup\mathbb{K}_{II}(t), it holds that Δt(xtk+1)Δt(xtk)\Delta_{t}(x^{k+1}_{t})\leq\Delta_{t}(x^{k}_{t}), i.e. the sequence Δt(xtk)\Delta_{t}(x^{k}_{t}) is monotonically decreasing.

We begin by deriving an expression for Kt(xt0)K_{t}(x^{0}_{t}), the number of iterations which CG(xt0,η,t)\operatorname{\texttt{CG}}(x^{0}_{t},\eta,t) spends in 𝕂I(t)\mathbb{K}_{I}(t). To that end, let {ki}i\{k_{i}\}_{i\in\mathbb{N}} denote the increasing set of indices enumerating the elements of the set 𝕂I(t)\mathbb{K}_{I}(t). From (5.7) and Δt(xtki+1)Δt(xtki+1)Δt(xtki)Δt(xt0\Delta_{t}(x^{k_{i+1}}_{t})\leq\Delta_{t}(x^{k_{i}+1}_{t})\leq\Delta_{t}(x^{k_{i}}_{t})\leq\Delta_{t}(x^{0}_{t}), we conclude

Δt(xtki)[115.3(ν+tΩg+tΔt(xt0))]Δt(xtki+1)Δt(xtki+1).\Delta_{t}(x^{k_{i}}_{t})\left[1-\frac{1}{5.3(\nu+t\Omega_{g}+t\Delta_{t}(x^{0}_{t}))}\right]\geq\Delta_{t}(x^{k_{i}+1}_{t})\geq\Delta_{t}(x^{k_{i+1}}_{t}). (5.8)

Setting qt:=15.3(ν+tΩg+tΔt(xt0))(0,1)q_{t}:=\frac{1}{5.3(\nu+t\Omega_{g}+t\Delta_{t}(x^{0}_{t}))}\in(0,1) (since ν1\nu\geq 1), we arrive at the recursion

Δt(xtki)(1qt)iΔt(xt0)exp(qti)Δt(xt0).\Delta_{t}(x^{k_{i}}_{t})\leq(1-q_{t})^{i}\Delta_{t}(x^{0}_{t})\leq\exp(-q_{t}i)\Delta_{t}(x^{0}_{t}).

Furthermore, by definition, all ki𝕂I(t)k_{i}\in\mathbb{K}_{I}(t) satisfy Gtki𝚊t=νt+ΩgG_{t}^{k_{i}}\geq\mathtt{a}_{t}=\frac{\nu}{t}+\Omega_{g}, so that (5.6) yields

Δt(xtki)\displaystyle\Delta_{t}(x^{k_{i}}_{t}) Δt(xtki)Δt(xtki+1)Gtki5.3(ν+tGtki+tΩg)110.6t.\displaystyle\geq\Delta_{t}(x^{k_{i}}_{t})-\Delta_{t}(x^{k_{i}+1}_{t})\geq\frac{G_{t}^{k_{i}}}{5.3(\nu+tG^{k_{i}}_{t}+t\Omega_{g})}\geq\frac{1}{10.6t}.

Hence, 110.6tΔt(xt0)exp(qt|𝕂I(t)|)\frac{1}{10.6t\Delta_{t}(x^{0}_{t})}\leq\exp(-q_{t}\lvert\mathbb{K}_{I}(t)\rvert), and

Kt(xt0):=5.3(ν+tΔt(xt0)+tΩg)log(10.6tΔt(xt0))K_{t}(x^{0}_{t}):=\lceil 5.3(\nu+t\Delta_{t}(x^{0}_{t})+t\Omega_{g})\log(10.6t\Delta_{t}(x^{0}_{t}))\rceil (5.9)

yields the upper bound.

We proceed analogously with the analysis on the set 𝕂II(t)\mathbb{K}_{II}(t). Let {j}j0\{\ell_{j}\}_{j\in\mathbb{N}_{0}} the increasing set of indices enumerating the elements of the set 𝕂II(t)\mathbb{K}_{II}(t), and M=NKt(xt0)M=N-K_{t}(x^{0}_{t}) the upper bound on iterations in this set. Using inequality (5.4) for k:=jk:=\ell_{j} together with the monotonicity of Δt(xtk)\Delta_{t}(x^{k}_{t}) results in

1Δt(xtj+1)1Δt(xtj+1)1Δt(xtj)+t12(ν+tΩg)2.\frac{1}{\Delta_{t}(x^{\ell_{j+1}}_{t})}\geq\frac{1}{\Delta_{t}(x^{\ell_{j}+1}_{t})}\geq\frac{1}{\Delta_{t}(x^{\ell_{j}}_{t})}+\frac{t}{12(\nu+t\Omega_{g})^{2}}. (5.10)

Telescoping the inequality (5.10) together with Δt(xt1)Δt(xt0)\Delta_{t}(x^{\ell_{1}}_{t})\leq\Delta_{t}(x^{0}_{t}) we get

1Δt(xtM)1Δt(xt0)+Mt12(ν+tΩg)21Δt(xt0)+Mt12(ν+tΩg)2.\frac{1}{\Delta_{t}(x^{\ell_{M}}_{t})}\geq\frac{1}{\Delta_{t}(x^{\ell_{0}}_{t})}+\frac{Mt}{12(\nu+t\Omega_{g})^{2}}\geq\frac{1}{\Delta_{t}(x^{0}_{t})}+\frac{Mt}{12(\nu+t\Omega_{g})^{2}}.

Thus, Δt(xtM)η\Delta_{t}(x^{\ell_{M}}_{t})\leq\eta, is satisfied for M=12(ν+tΩg)2(1tη1tΔt(xt0))+M=\lceil 12(\nu+t\Omega_{g})^{2}(\frac{1}{t\eta}-\frac{1}{t\Delta_{t}(x^{0}_{t})})_{+}\rceil. This demonstrates that at most N=Kt(xt0)+MN=K_{t}(x^{0}_{t})+M iterations of CG(xt0,η,t)\operatorname{\texttt{CG}}(x^{0}_{t},\eta,t) are sufficient for achieving Δt(xtk)η.\Delta_{t}(x_{t}^{k})\leq\eta. \blacksquare

5.1.2 Proof of Proposition 3.1

Using the bound Kt(xt0)K_{t}(x^{0}_{t}) on the number of iterations in 𝕂I(t)\mathbb{K}_{I}(t) from the proof of Proposition 3.2, we are left to bound the number of iteration in 𝕂II(t)\mathbb{K}_{II}(t), until the condition GtkηG_{t}^{k}\leq\eta is satisfied.

Let {kj(t)}j\{k_{j}(t)\}_{j} denote the increasing set of indices enumerating the elements of the set 𝕂II(t)\mathbb{K}_{II}(t). From (5.4) we know that

Δt(xtkj+1(t))Δt(xtkj(t))112t(Gtkj(t))2(ν+tΩg)2, and Gtkj(t)Δt(xtkj(t))Δt(xtkj(t)).\Delta_{t}(x_{t}^{k_{j+1}(t)})-\Delta_{t}(x^{k_{j}(t)}_{t})\leq-\frac{1}{12}\frac{t(G_{t}^{k_{j}(t)})^{2}}{(\nu+t\Omega_{g})^{2}},\text{ and }G_{t}^{k_{j}(t)}\geq\Delta_{t}(x^{k_{j}(t)}_{t})\geq\Delta_{t}(x^{k_{j}(t)}_{t}).

Set djΔt(xtkj(t))d_{j}\equiv\Delta_{t}(x^{k_{j}(t)}_{t}) and 1Mt12(ν+tΩg)2,ΓjGtkj(t)\frac{1}{M}\equiv\frac{t}{12(\nu+t\Omega_{g})^{2}},\Gamma_{j}\equiv G_{t}^{k_{j}(t)}. We thus obtain the recursion

dj+1djΓj2M,Γjdj,d_{j+1}\leq d_{j}-\frac{\Gamma_{j}^{2}}{M},\quad\Gamma_{j}\geq d_{j},

and we can apply [43, Prop. 2.4] directly to the above recursion to obtain the estimates

djMj, and min{Γ0,,Γj}2Mj.d_{j}\leq\frac{M}{j},\quad\text{ and }\min\{\Gamma_{0},\ldots,\Gamma_{j}\}\leq\frac{2M}{j}.

Since the algorithm terminates when we obtain an iterate with Γj=𝖦𝖺𝗉t(xtkj(t))η\Gamma_{j}=\operatorname{\mathsf{Gap}}_{t}(x^{k_{j}(t)}_{t})\leq\eta, it is sufficient to stop at the label jj is reached satisfying 2Mjη\frac{2M}{j}\leq\eta. Solving this for jj yields

j24(ν+tΩg)2tη.j\leq\left\lceil\frac{24(\nu+t\Omega_{g})^{2}}{t\eta}\right\rceil. (5.11)

Combining the bound (5.11) with Kt(xt0)K_{t}(x^{0}_{t}), we obtain the total complexity estimate postulated in Proposition 3.1.

5.2 Analysis of the outer loop and proof of Theorem 3.3

Let II(η0,σ,ε)=log(2η0/ε)log(1/σ)I\equiv I(\eta_{0},\sigma,\varepsilon)=\lceil\frac{\log(2\eta_{0}/\varepsilon)}{\log(1/\sigma)}\rceil denote the a-priori fixed number of updates of the accuracy and homotopy parameter. We set x^ixtiRi\hat{x}_{i}\equiv x^{R_{i}}_{t_{i}}, the last iterate of procedure CG(xti0,ηi,ti)\operatorname{\texttt{CG}}(x^{0}_{t_{i}},\eta_{i},t_{i}) satisfying Δti(x^i)𝖦𝖺𝗉ti(x^i)ηi\Delta_{t_{i}}(\hat{x}_{i})\leq\operatorname{\mathsf{Gap}}_{t_{i}}(\hat{x}_{i})\leq\eta_{i}. From this, using the definition of the gap function, we deduce that

g(x^i)g(z(ti))𝖦𝖺𝗉ti(x^i)+1tiF(x^i)[z(ti)x^i]ηi+νti.g(\hat{x}_{i})-g(z^{\ast}(t_{i}))\leq\operatorname{\mathsf{Gap}}_{t_{i}}(\hat{x}_{i})+\frac{1}{t_{i}}F^{\prime}(\hat{x}_{i})[z^{\ast}(t_{i})-\hat{x}_{i}]\leq\eta_{i}+\frac{\nu}{t_{i}}. (5.12)

Hence, using Lemma 2.2, we observe

g(x^i)Opt=g(x^i)g(z(ti))+g(z(ti))Optηi+2νti.g(\hat{x}_{i})-\operatorname{Opt}=g(\hat{x}_{i})-g(z^{\ast}(t_{i}))+g(z^{\ast}(t_{i}))-\operatorname{Opt}\leq\eta_{i}+\frac{2\nu}{t_{i}}. (5.13)

Since ti=2νηit_{i}=\frac{2\nu}{\eta_{i}}, we obtain g(x^i)Opt2ηig(\hat{x}_{i})-\operatorname{Opt}\leq 2\eta_{i}. We next estimate the initial gap Δti+1(xti+10)=Δti+1(x^i)\Delta_{t_{i+1}}(x^{0}_{t_{i+1}})=\Delta_{t_{i+1}}(\hat{x}_{i}) incurred by our warm-start strategy. Observe that

Vti+1(xti+10)Vti+1(z(ti+1))\displaystyle V_{t_{i+1}}(x^{0}_{t_{i+1}})-V_{t_{i+1}}(z^{\ast}(t_{i+1})) =Vti(xti+10)Vti(z(ti+1))\displaystyle=V_{t_{i}}(x^{0}_{t_{i+1}})-V_{t_{i}}(z^{\ast}(t_{i+1}))
+(1ti+11ti)(F(xti+10)F(z(ti+1)))\displaystyle+\left(\frac{1}{t_{i+1}}-\frac{1}{t_{i}}\right)(F(x^{0}_{t_{i+1}})-F(z^{\ast}(t_{i+1})))
Vti(xtiRi)Vti(z(ti))\displaystyle\leq V_{t_{i}}(x^{R_{i}}_{t_{i}})-V_{t_{i}}(z^{\ast}(t_{i}))
+(1ti+1ti)(Vti+1(xti+10)Vti+1(z(ti+1)))\displaystyle+\left(1-\frac{t_{i+1}}{t_{i}}\right)\left(V_{t_{i+1}}(x^{0}_{t_{i+1}})-V_{t_{i+1}}(z^{\ast}(t_{i+1}))\right)
+(1ti+1ti)(g(z(ti+1))g(xti+10)),\displaystyle+(1-\frac{t_{i+1}}{t_{i}})(g(z^{\ast}(t_{i+1}))-g(x^{0}_{t_{i+1}})),

where the last transitions follows from Vti(z(ti+1))Vti(z(ti))V_{t_{i}}(z^{\ast}(t_{i+1}))\geq V_{t_{i}}(z^{\ast}(t_{i})). Moreover,

g(xti+10)g(z(ti+1))g(xti+10)Optηi+2νti=4νti.g(x^{0}_{t_{i+1}})-g(z^{\ast}(t_{i+1}))\leq g(x^{0}_{t_{i+1}})-\operatorname{Opt}\leq\eta_{i}+\frac{2\nu}{t_{i}}=\frac{4\nu}{t_{i}}.

Since ti+1>tit_{i+1}>t_{i}, the above two inequalities imply

ti+1ti[Vti+1(xti+10)Vti+1(z(ti+1))]Vti(xtiRi)Vti(z(ti))+ti+1titi24ν.\displaystyle\frac{t_{i+1}}{t_{i}}[V_{t_{i+1}}(x^{0}_{t_{i+1}})-V_{t_{i+1}}(z^{\ast}(t_{i+1}))]\leq V_{t_{i}}(x^{R_{i}}_{t_{i}})-V_{t_{i}}(z^{\ast}(t_{i}))+\frac{t_{i+1}-{t_{i}}}{t_{i}^{2}}4\nu.

Whence,

ti+1Δti+1(xti+10)tiΔti(xtiRi)+ti+1titi4νtiηi+ti+1titi4ν=(2ti+1ti1)2ν.t_{i+1}\Delta_{t_{i+1}}(x^{0}_{t_{i+1}})\leq t_{i}\Delta_{t_{i}}(x^{R_{i}}_{t_{i}})+\frac{t_{i+1}-{t_{i}}}{t_{i}}4\nu\leq t_{i}\eta_{i}+\frac{t_{i+1}-{t_{i}}}{t_{i}}4\nu=\left(2\frac{t_{i+1}}{t_{i}}-1\right)2\nu. (5.14)

We are now in the position to estimate the total iteration complexity Compl(x0,ε,f):=i=0IRii=0IR(xti0,ηi,ti)\texttt{Compl}(x^{0},\varepsilon,f):=\sum_{i=0}^{I}R_{i}\leq\sum_{i=0}^{I}R(x^{0}_{t_{i}},\eta_{i},t_{i}) by using Proposition 3.1, and thereby prove Theorem 3.3. We do so by using the updating regime of the sequences {ti}\{t_{i}\} and {ηi}\{\eta_{i}\} explained in Section 3. These updating mechanisms imply

i=1IRi\displaystyle\sum_{i=1}^{I}R_{i} i=1I5.3(ν+2ν(2titi11)+tiΩg)log(10.6(2ν(2titi11)))\displaystyle\leq\sum_{i=1}^{I}5.3\left(\nu+2\nu\left(\frac{2t_{i}}{t_{i-1}}-1\right)+t_{i}\Omega_{g}\right)\log\left(10.6\left(2\nu\left(\frac{2t_{i}}{t_{i-1}}-1\right)\right)\right)
+i=1I12ν(ν+tiΩg)2\displaystyle+\sum_{i=1}^{I}\frac{12}{\nu}(\nu+t_{i}\Omega_{g})^{2}
log(21.2(ν(2/σ1)))i=1I5.3(ν(4/σ1)+tiΩg))+i=1I24ν(ν2+ti2Ωg2)\displaystyle\leq\log(21.2(\nu(2/\sigma-1)))\sum_{i=1}^{I}5.3(\nu(4/\sigma-1)+t_{i}\Omega_{g}))+\sum_{i=1}^{I}\frac{24}{\nu}(\nu^{2}+t^{2}_{i}\Omega^{2}_{g})
24ν×I+21.2ν/σlog(42.4ν/σ)×I+5.3log(42.4ν/σ)Ωgi=1Iti+24Ωg2νi=1Iti2\displaystyle\leq{24}{\nu}\times I+21.2\nu/\sigma\log(42.4\nu/\sigma){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\times I}+{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}5.3}\log(42.4\nu/\sigma)\Omega_{g}\sum_{i=1}^{I}t_{i}+\frac{24\Omega_{g}^{2}}{\nu}\sum_{i=1}^{I}t_{i}^{2}
24νlog(2η0/ε)log(1/σ)+log(42.4νσ)21.2νσlog(2η0/ε)log(1/σ)\displaystyle\leq{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{24\nu\log(2\eta_{0}/\varepsilon)}{\log(1/\sigma)}}+\log\left(\frac{42.4\nu}{\sigma}\right)\frac{21.2\nu}{\sigma}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{\log(2\eta_{0}/\varepsilon)}{\log(1/\sigma)}}
+log(42.4νσ)21.2νΩgε(1σ)+384Ωg2νε2(1σ2).\displaystyle+\log\left(\frac{42.4\nu}{\sigma}\right)\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}21.2}\nu\Omega_{g}}{\varepsilon(1-\sigma)}+\frac{384\Omega_{g}^{2}\nu}{\varepsilon^{2}(1-\sigma^{2})}.

In deriving these bounds, we have used the definition of II and the following estimates:

i=1Iti=t0σ1i=0I1(1/σ)it0σI1σ4νε(1σ),\displaystyle\sum_{i=1}^{I}t_{i}=t_{0}\sigma^{-1}\sum_{i=0}^{I-1}(1/\sigma)^{i}\leq t_{0}\frac{\sigma^{-I}}{1-\sigma}\leq\frac{4\nu}{\varepsilon(1-\sigma)},

and

i=1Iti2=t02i=1I(1/σ)2i=t02σ2i=0I1(1/σ2)i16ν2ε2(1σ2).\displaystyle\sum_{i=1}^{I}t_{i}^{2}=t^{2}_{0}\sum_{i=1}^{I}(1/\sigma)^{2i}=\frac{t^{2}_{0}}{\sigma^{2}}\sum_{i=0}^{I-1}(1/\sigma^{2})^{i}\leq\frac{16\nu^{2}}{\varepsilon^{2}(1-\sigma^{2})}.

It remains to bound the complexity at i=0i=0. We have

R0\displaystyle R_{0} 5.3(ν+t0Δt0(xt00)+t0Ωg)log(10.6t0Δt0(x0))+12ν(ν+t0Ωg)2\displaystyle\leq 5.3(\nu+t_{0}\Delta_{t_{0}}(x^{0}_{t_{0}})+t_{0}\Omega_{g})\log(10.6t_{0}\Delta_{t_{0}}(x^{0}))+\frac{12}{\nu}(\nu+t_{0}\Omega_{g})^{2}
5.3(ν+F(x0)F(z(t0))+2t0Ωg)log(10.6(F(x0)F(z(t0))+t0Ωg))\displaystyle\leq 5.3(\nu+F(x^{0})-F(z^{\ast}(t_{0}))+2t_{0}\Omega_{g})\log\left(10.6(F(x^{0})-F(z^{\ast}(t_{0}))+t_{0}\Omega_{g})\right)
+24ν(ν2+t02Ωg2).\displaystyle+\frac{24}{\nu}(\nu^{2}+t^{2}_{0}\Omega^{2}_{g}).

Since t0=νΩgt_{0}=\frac{\nu}{\Omega_{g}} and η0=2Ωg\eta_{0}=2\Omega_{g}, we have

R05.3(3ν+F(x0)F(z(t0)))log(10.6(ν+F(x0)F(z(t0))))+48ν.R_{0}\leq 5.3(3\nu+F(x^{0})-F(z^{\ast}(t_{0})))\log\left(10.6(\nu+F(x^{0})-F(z^{\ast}(t_{0})))\right)+48\nu.

Adding the two gives the total complexity bound

Compl(x0,ε,f)\displaystyle\texttt{Compl}(x^{0},\varepsilon,f) 5.3(3ν+F(x0)F(z(t0)))log(10.6(ν+F(x0)F(z(t0))))+48ν\displaystyle\leq 5.3(3\nu+F(x^{0})-F(z^{\ast}(t_{0})))\log\left(10.6(\nu+F(x^{0})-F(z^{\ast}(t_{0})))\right)+48\nu
+νlog(4Ωg/ε)log(1/σ)[24+log(42.4νσ)21.2σ]\displaystyle+\frac{\nu\log({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}4\Omega_{g}}/\varepsilon)}{\log(1/\sigma)}\left[24+\log\left(\frac{42.4\nu}{\sigma}\right)\frac{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}21.2}}{\sigma}\right]
+log(42.4νσ)21.2νΩgε(1σ)+384Ωg2νε2(1σ2).\displaystyle+\log\left(\frac{42.4\nu}{\sigma}\right)\frac{21.2\nu\Omega_{g}}{\varepsilon(1-\sigma)}+\frac{384\Omega_{g}^{2}\nu}{\varepsilon^{2}(1-\sigma^{2})}.
=O~(Ωg2νε2(1σ2)).\displaystyle={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\tilde{O}}\left(\frac{\Omega_{g}^{2}\nu}{\varepsilon^{2}(1-\sigma^{2})}\right).

6 Experimental results

In this section, we give two examples covered by the problem template (P), and report the results of numerical experiments that illustrate the practical performance and scalability of our method. We implement Algorithm 2 using procedure CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} as solvers for the inner loops. We compare the results of our algorithm to two benchmarks: (1) an interior point solution of the problem by SDPT3 [35], and (2) the solution obtained by CGAL [41]. Our performance measure for these experiments is the relative optimality gap, which given the value of the objective of the solution computed for each method MM and iteration ii, giMg^{M}_{i}, is computed by |giMgSDPT33|/|gSDPT33|{|g^{M}_{i}-g^{\operatorname{\texttt{SDPT3}}3}|}/{|g^{\operatorname{\texttt{SDPT3}}3}|}, where gSDPT33g^{\operatorname{\texttt{SDPT3}}3} is the interior point solution. The full description of the setup for these experiments is available in Appendix B. A key takeaway from these numerical experiments is that CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} are not influenced by the problem’s scaling and are therefore well-suited to solving ill-posed problems with differently scaled constraints.

6.1 Estimating the mixing time of a Markov chain

Consider a continuous-time Markov chain on a weighted graph, with edge weights representing the transition rates of the process. Let 𝐏(t)=eLt{\mathbf{P}}(t)=e^{-Lt} be the stochastic semi-group of the process, where LL is the graph Laplacian. From L𝟏=0L\mathbf{1}=0 it follows that 1n𝟏\frac{1}{n}\mathbf{1} is a stationary distribution of the process. From [9] we have the following bound on the total variation distance of the time tt distribution of the Markov process and the stationary distribution supππ𝐏(t)1n𝟏TV12neλt.\sup_{\pi}\lVert\pi{\mathbf{P}}(t)-\frac{1}{n}\mathbf{1}\rVert_{\text{TV}}\leq\frac{1}{2}\sqrt{n}e^{-\lambda t}. The quantity Tmix=λT_{\text{mix}}=\lambda is called the mixing time. The constant λ\lambda is the second-largest eigenvalue of the graph Laplacian LL. To bound this eigenvalue, we follow the SDP-strategy laid out in [33], and in some detail described in Appendix B.
In the experiments, we generated random connected undirected graphs with 100800100-800 nodes and 10310410^{3}-10^{4} edges, and for each edge {i,j}\{i,j\} in the graph, we generated a random weight uniformly in [0,1][0,1]. Figure 1 illustrates the relative optimality gap for the generated results by each method (and the dependence of CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} on the starting accuracy η0\eta_{0}), CGAL\operatorname{CGAL} only appears in the right most graphs222More detailed results are available in Appendix B.. We conclude that, despite LCG\operatorname{\texttt{LCG}} being generally costly per iteration, it exhibits the best computational performance both in the number of iterations and in time. Moreover, for all of the datasets, the primal-dual method CGAL\operatorname{CGAL} generated solutions that were far from being feasible, and was not able to reduce the feasibility gap even after 1000 seconds or 10000 iterations, resulting in very poor relative optimality gap. Our conjecture is that the slow convergence rate of CGAL\operatorname{CGAL} with respect to feasibility is due to the different scaling for the right-hand-side of the constraints, which results in a relative feasibility gap which is very large for constraints with small right-hand-side. To test this conjecture , in the next section we present further experiments testing the performance of the methods as a function of the right-hand-side scaling.333We thank an anonymous referee for suggesting this experiment.

Refer to caption
Figure 1: Mixing time problem: relative gap from SDPT3 solution vs. iteration.

6.2 Randomly Scaled SDP

In order to further investigate the robustness of CG\operatorname{\texttt{CG}}, LCG\operatorname{\texttt{LCG}} and CGAL\operatorname{CGAL} to constraint scaling, we’ve created a synthetic data set for a general SDP with inequality constraints. The Synthetic Random Scaling (SRS) problem is formulated as follows:

max𝐂,𝐗 s.t.: 𝐀i,𝐗bi,i{1,2,,m},𝐗𝕊+n,tr(𝐗)1,\max\langle{\mathbf{C}},\mathbf{X}\rangle\text{ s.t.: }\langle\mathbf{A}_{i},\mathbf{X}\rangle\leq{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}b_{i}},i\in\{1,2,\ldots,m\},\mathbf{X}\in\mathbb{S}^{n}_{+},\operatorname{tr}(\mathbf{X})\leq 1, (SRS)

where both the objective function coefficient matrix 𝐂{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\mathbf{C}}} and, the constraint coefficient matrices 𝐀i{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{A}_{i}}, for i=1,,mi=1,\dots,m, are randomly generated PSD matrices with unit Frobenius norm. The coefficients bib_{i} are defined as bi=uiip/2𝐀i,𝐗0b_{i}=\frac{u_{i}}{i^{p/2}}\langle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{A}_{i}},{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{X}_{0}}\rangle, where ui[12,1]u_{i}\in[\frac{1}{2},1] is a uniformly distributed random number, p{0,1,2}p\in\{0,1,2\}, and 𝐗0{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{X}_{0}} is a randomly generated PSD matrix with unit trace. Thus, parameter pp controls the scaling of the right-hand-side of the constraints. Specifically, when p=0p=0 we expect all bib_{i} to have the same order of magnitude, while for p=2p=2 constraint bm=O(1/m)b1b_{m}=O(1/m)b_{1}. As we observed for the Mixing problem in Section B.2, our expectation is for CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} to be impervious to this right-hand-side scaling, and for the performance of CGAL\operatorname{CGAL} to deteriorate as pp increases. To compare our method with CGAL, the CGAL output was adjusted to obtain a feasible solution. The details of this procedure are provided in Appendix B.3.

Refer to caption
Figure 2: Randomly scaled SDP problem: relative optimality gap vs. iterations for several parameter choices (averaged over 30 instances for each pp)
Refer to caption
Figure 3: Randomly scaled SDP problem: relative optimality gap vs. time for several parameter choices (averaged over 30 instances for each pp)

For the numerical experiments, we have chosen parameters n=100n=100 and m=100m=100. Figures 2 and 3 show the performance of our methods in terms of the relative optimality gap for problem (SRS), where the “optimal” solution is computed using the SDPT3 solver, as described in Appendix B.3. Since the dataset consists of homogeneous data points, the plots present aggregated results over 30 instances for each value of pp. Figure 2 illustrates the performance of each algorithm in terms of relative optimality gap over iteration count, while Figure 3 presents the performance in terms of runtime, within a time limit of 1000 seconds. The plots indicate that, while in the initial stages of running CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} with a smaller value of σ\sigma (σ=0.25\sigma=0.25) are superior, larger values (σ=0.9\sigma=0.9) achieve slightly better overall results. Furthermore, the line search-based method LCG\operatorname{\texttt{LCG}} consistently achieves better iteration and time complexity performances than CG\operatorname{\texttt{CG}}.

Importantly, the plots confirm the conjecture that the performance of CGAL\operatorname{CGAL} deteriorates as pp increases, our methods show consistent performance across different pp values. In fact, CGAL\operatorname{CGAL} has inferior performance for ill-scaled problems (p=2p=2) compared to CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}}, leading to an average optimality gap of up to two orders of magnitude larger. Although Figure 2 shows that CGAL\operatorname{CGAL} has iteration complexity comparable to or worse than CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} across all values of pp, its low-cost iterations can make it competitive in terms of time complexity, as shown in Figure 3. Specifically, CGAL\operatorname{CGAL} is able to perform more iterations leading to superior time complexity when p=0p=0. However, when p=1p=1 and especially p=2p=2, CGAL\operatorname{CGAL}’s iterations, while still inexpensive, fail to decrease the solution infeasibility, leading to limited improvement. A detailed analysis of the numerical experiments is provided in Appendix B.3.

7 Conclusions

Solving large scale conic-constrained convex programming problems is still a challenge in mathematical optimization and machine learning. In particular, SDPs with a massive amount of linear constraints appear naturally as convex relaxations of combinatorial optimization problems. For such problems, the development of scalable algorithms is a very active field of research [41, 14]. In this paper, we introduce a new path-following/homotopy strategy to solve such massive conic-constrained problems, building on recent advances in projection-free methods for self-concordant minimization [43, 11, 13]. Our theoretical analysis and numerical experiments scheme show that the method is competitive with state-of-the-art solvers in terms of its iteration complexity and gives promising results in practice. Future work will focus on improvements of our deployed CndG solver to accelerate the subroutines.

References

  • [1] Y. Ye, GSet random graphs, https://www.cise.u.edu/research/sparse/matrices/Gset/ (accessed May, 2022).
  • Arora et al. [2005] S. Arora, E. Hazan, and S. Kale. Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05), pages 339–348, 2005. doi: 10.1109/SFCS.2005.35.
  • Bandeira et al. [2016] Afonso S. Bandeira, Nicolas Boumal, and Vladislav Voroninski. On the low-rank approach for semidefinite programs arising in synchronization and community detection. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 361–382, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR. URL https://proceedings.mlr.press/v49/bandeira16.html.
  • Boyd and Vandenberghe [2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. ISBN 1107394007.
  • Chandrasekaran and Shah [2017] Venkat Chandrasekaran and Parikshit Shah. Relative entropy optimization and its applications. Mathematical Programming, 161:1–32, 2017.
  • Charikar and Wirth [2004] M. Charikar and A. Wirth. Maximizing quadratic programs: extending grothendieck’s inequality. In 45th Annual IEEE Symposium on Foundations of Computer Science, pages 54–60, 2004. doi: 10.1109/FOCS.2004.39.
  • CVX Research [2012] Inc. CVX Research. CVX: Matlab software for disciplined convex programming, version 2.0. http://cvxr.com/cvx, August 2012.
  • d’Aspremont et al. [2008] Alexandre d’Aspremont, Francis Bach, and Laurent El Ghaoui. Optimal solutions for sparse principal component analysis. Journal of Machine Learning Research, 9(7), 2008.
  • Diaconis and Stroock [1991] Persi Diaconis and Daniel W. Stroock. Geometric bounds for eigenvalues of Markov chains. Annals of Applied Probability, 1:36–61, 1991.
  • Dunn and Harshbarger [1978] J. C Dunn and S Harshbarger. Conditional gradient algorithms with open loop step size rules. Journal of Mathematical Analysis and Applications, 62(2):432–444, 1978. doi: https://doi.org/10.1016/0022-247X(78)90137-3. URL https://www.sciencedirect.com/science/article/pii/0022247X78901373.
  • Dvurechensky et al. [2020] Pavel Dvurechensky, Petr Ostroukhov, Kamil Safin, Shimrit Shtern, and Mathias Staudigl. Self-concordant analysis of Frank-Wolfe algorithms. In Hal Daume III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2814–2824, Virtual, 13–18 Jul 2020. PMLR. URL http://proceedings.mlr.press/v119/dvurechensky20a.html. arXiv:2002.04320.
  • Dvurechensky et al. [2021] Pavel Dvurechensky, Shimrit Shtern, and Mathias Staudigl. First-order methods for convex optimization. EURO Journal on Computational Optimization, 9:100015, 2021. ISSN 2192-4406. doi: https://doi.org/10.1016/j.ejco.2021.100015. URL https://www.sciencedirect.com/science/article/pii/S2192440621001428. arXiv:2101.00935.
  • Dvurechensky et al. [2022] Pavel Dvurechensky, Kamil Safin, Shimrit Shtern, and Mathias Staudigl. Generalized self-concordant analysis of frank–wolfe algorithms. Mathematical Programming, 2022. doi: 10.1007/s10107-022-01771-1. URL https://doi.org/10.1007/s10107-022-01771-1.
  • Elbassioni et al. [2022] Khaled Elbassioni, Kazuhisa Makino, and Waleed Najy. Finding sparse solutions for packing and covering semidefinite programs. SIAM Journal on Optimization, 32(2):321–353, 2022. doi: 10.1137/20M137570X. URL https://doi.org/10.1137/20M137570X.
  • Fawzi et al. [2019] Hamza Fawzi, James Saunderson, and Pablo A Parrilo. Semidefinite approximations of the matrix logarithm. Foundations of Computational Mathematics, 19:259–296, 2019.
  • Faybusovich and Zhou [2020] Leonid Faybusovich and Cunlu Zhou. Self-concordance and matrix monotonicity with applications to quantum entanglement problems. Applied Mathematics and Computation, 375:125071, 2020. doi: https://doi.org/10.1016/j.amc.2020.125071. URL https://www.sciencedirect.com/science/article/pii/S0096300320300400.
  • Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95–110, 2019/09/05 1956. doi: 10.1002/nav.3800030109. URL https://doi.org/10.1002/nav.3800030109.
  • Freund and Grigas [2016] Robert M Freund and Paul Grigas. New analysis and results for the frank–wolfe method. Mathematical Programming, 155(1-2):199–230, 2016.
  • Goemans and Williamson [1995] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
  • Göring et al. [2008] Frank Göring, Christoph Helmberg, and Markus Wappler. Embedded in the shadow of the separator. SIAM Journal on Optimization, 19(1):472–501, 2008. doi: 10.1137/050639430. URL https://doi.org/10.1137/050639430.
  • Hale et al. [2008] Elaine T. Hale, Wotao Yin, and Yin Zhang. Fixed-point continuation for 1\ell_{1}-minimization: Methodology and convergence. SIAM Journal on Optimization, 19(3):1107–1130, 2008. doi: 10.1137/070698920. URL https://doi.org/10.1137/070698920.
  • Horodecki et al. [1996] Michał Horodecki, Paweł Horodecki, and Ryszard Horodecki. Separability of mixed states: necessary and sufficient conditions. Physics Letters A, 223(1):1–8, 1996. doi: https://doi.org/10.1016/S0375-9601(96)00706-2. URL https://www.sciencedirect.com/science/article/pii/S0375960196007062.
  • Iyengar et al. [2011] G. Iyengar, D. J. Phillips, and C. Stein. Approximating semidefinite packing programs. SIAM Journal on Optimization, 21(1):231–268, 2011. doi: 10.1137/090762671. URL https://doi.org/10.1137/090762671.
  • Iyengar et al. [2010] Garud Iyengar, David J Phillips, and Cliff Stein. Feasible and accurate algorithms for covering semidefinite programs. Scandinavian Workshop on Algorithm Theory, pages 150–162, 2010.
  • Jaggi [2013] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pages 427–435, 2013.
  • Lacoste-Julien et al. [2013] Simon Lacoste-Julien, Martin Jaggi, Mark Schmidt, and Patrick Pletscher. Block-coordinate Frank-Wolfe optimization for structural SVMs. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 53–61, Atlanta, Georgia, USA, 2013. PMLR. URL https://proceedings.mlr.press/v28/lacoste-julien13.html.
  • Lanckriet et al. [2004] Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning the kernel matrix with semidefinite programming. Journal of Machine learning research, 5(Jan):27–72, 2004.
  • Laurent and Rendl [2005] Monique Laurent and Franz Rendl. Semidefinite programming and integer programming. Handbooks in Operations Research and Management Science, 12:393–514, 2005.
  • Nesterov [2018] Yurii Nesterov. Lectures on Convex Optimization, volume 137 of Springer Optimization and Its Applications. Springer International Publishing, 2018.
  • Nesterov and Nemirovski [1994] Yurii Nesterov and Arkadi Nemirovski. Interior Point Polynomial methods in Convex programming. SIAM Publications, 1994.
  • Nielsen and Chuang [2010] M. Nielsen and I Chuang. Quantum Computation and Quantum Information. Cambridge: Cambridge University Press., 2010.
  • Peng and Wei [2007] Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite programming. SIAM Journal on Optimization, 18(1):186–205, 2007. doi: 10.1137/050641983. URL https://doi.org/10.1137/050641983.
  • Sun et al. [2006] Jun Sun, Stephen Boyd, Lin Xiao, and Persi Diaconis. The fastest mixing markov process on a graph and a connection to a maximum variance unfolding problem. SIAM Review, 48(4):681–699, 2022/01/12/ 2006. URL http://www.jstor.org/stable/20453871.
  • Tibshirani and Taylor [2011] Ryan J. Tibshirani and Jonathan Taylor. The solution path of the generalized lasso. Ann. Statist., 39(3):1335–1371, 2011. doi: 10.1214/11-AOS878. URL https://projecteuclid.org:443/euclid.aos/1304514656.
  • Toh et al. [1999] K. C. Toh, M. J. Todd, and R. H. Tütüncü. Sdpt3 —a matlab software package for semidefinite programming, version 1.3. Optimization Methods and Software, 11(1-4):545–581, 01 1999. doi: 10.1080/10556789908805762. URL https://doi.org/10.1080/10556789908805762.
  • Weinberger and Saul [2009] Kilian Q Weinberger and Lawrence K Saul. Distance metric learning for large margin nearest neighbor classification. Journal of machine learning research, 10(2), 2009.
  • Wen et al. [2010] Zaiwen Wen, Wotao Yin, Donald Goldfarb, and Yin Zhang. A fast algorithm for sparse reconstruction based on shrinkage, subspace optimization, and continuation. SIAM Journal on Scientific Computing, 32(4):1832–1857, 2010. doi: 10.1137/090747695. URL https://doi.org/10.1137/090747695.
  • Xiao and Zhang [2013] Lin Xiao and Tong Zhang. A proximal-gradient homotopy method for the sparse least-squares problem. SIAM Journal on Optimization, 23(2):1062–1091, 2013. doi: 10.1137/120869997. URL https://doi.org/10.1137/120869997.
  • [39] Alp Yurtsever. SketchyCGAL MATLAB code. URL https://github.com/alpyurtsever/SketchyCGAL.
  • Yurtsever et al. [2018] Alp Yurtsever, Olivier Fercoq, Francesco Locatello, and Volkan Cevher. A conditional gradient framework for composite convex minimization with applications to semidefinite programming. pages 5727–5736. PMLR, 2018. ISBN 2640-3498.
  • Yurtsever et al. [2021] Alp Yurtsever, Joel A. Tropp, Olivier Fercoq, Madeleine Udell, and Volkan Cevher. Scalable semidefinite programming. SIAM Journal on Mathematics of Data Science, 3(1):171–200, 2021. doi: 10.1137/19M1305045. URL https://doi.org/10.1137/19M1305045.
  • Zhao [2023] Renbo Zhao. An away-step frank-wolfe method for minimizing logarithmically-homogeneous barriers. arXiv preprint arXiv:2305.17808, 2023.
  • Zhao and Freund [2022] Renbo Zhao and Robert M. Freund. Analysis of the Frank–Wolfe method for convex composite optimization involving a logarithmically-homogeneous barrier. Mathematical Programming, 2022. doi: 10.1007/s10107-022-01820-9. URL https://doi.org/10.1007/s10107-022-01820-9.
  • Zinchenko et al. [2010] Yuriy Zinchenko, Shmuel Friedland, and Gilad Gour. Numerical estimation of the relative entropy of entanglement. Physical Review A, 82(5):052336, 2010.

Appendix A Modifications due to inexact oracles

In this appendix we give the missing proofs of the theoretical statements in Sections 4 and 4.2. Consider the function 𝚎~t(x):=s~t(x)xx\tilde{\mathtt{e}}_{t}(x):=\lVert\tilde{s}_{t}(x)-x\rVert_{x}, where s~t(x)\tilde{s}_{t}(x) is any member of the set g(δ,θ)(1tF(x))\mathcal{L}^{(\delta,\theta)}_{g}(\frac{1}{t}\nabla F(x)) (cf. Definition 4.1). It is easy to see that 5.1 translates to the case with inexact oracle feedback as

t1𝚎~t(x)ν/t+𝖦𝖺𝗉~t(x)+Ωg.t^{-1}\tilde{\mathtt{e}}_{t}(x)\leq\nu/t+\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x)+\Omega_{g}. (A.1)

Moreover, one can show that

𝚎~t(x)tν(𝖦𝖺𝗉~t(x)Ωg).\tilde{\mathtt{e}}_{t}(x)\geq\frac{t}{\sqrt{\nu}}\left(\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x)-\Omega_{g}\right). (A.2)

A.1 Proof of Proposition 4.3

The proof of this proposition mimics the proof of the exact regime (cf. Proposition 3.2), with some important twists. We split the analysis into two phases, bound the maximum number of iterations the algorithms can spend on these phases, and add these numbers up to obtain the worst-case upper bound.

First, for a round kk of ICG(x,t,η,δ,θ)\texttt{ICG}(x,t,\eta,\delta,\theta), we use the abbreviations

s~tk=s~t(xtk),𝚎~tks~tkxtkxtk,G~tk𝖦𝖺𝗉~t(xtk),α~tkα~t(xtk).\tilde{s}_{t}^{k}=\tilde{s}_{t}(x^{k}_{t}),\;\tilde{\mathtt{e}}_{t}^{k}\equiv\lVert\tilde{s}_{t}^{k}-x^{k}_{t}\rVert_{x^{k}_{t}},\;\tilde{G}^{k}_{t}\equiv\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x^{k}_{t}),\;\tilde{\alpha}_{t}^{k}\equiv\tilde{\alpha}_{t}(x^{k}_{t}).

We further redefine the phases as 𝕂I(t):={k|𝖦𝖺𝗉~t(xtk)>𝚊t}\mathbb{K}_{I}(t):=\{k\in\mathbb{N}|\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x^{k}_{t})>\mathtt{a}_{t}\} (phase I) and 𝕂II(t):={k|𝖦𝖺𝗉~t(xtk)𝚊t}\mathbb{K}_{II}(t):=\{k\in\mathbb{N}|\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x^{k}_{t})\leq\mathtt{a}_{t}\} (phase II), where 𝚊t:=νt+Ωg\mathtt{a}_{t}:=\frac{\nu}{t}+\Omega_{g}.

Note that in the analysis we use the inequality Δt(xtk)>η+θ\Delta_{t}(x^{k}_{t})>\eta+\theta since our goal is to estimate the number of iterations to achieve Δt(xtk)η+θ\Delta_{t}(x^{k}_{t})\leq\eta+\theta. Moreover, if the stopping condition 𝖦𝖺𝗉~t(xtk)G~tkηδ\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x^{k}_{t})\equiv\tilde{G}^{k}_{t}\leq\eta\delta holds, we obtain by the definition of the ILMO and inequality (3.5) that

ηδ𝖦𝖺𝗉~t(xtk)δ(𝖦𝖺𝗉t(xtk)θ)δ(Δt(xtk)θ),\eta\delta\geq\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x^{k}_{t})\geq\delta(\operatorname{\mathsf{Gap}}_{t}(x^{k}_{t})-\theta)\geq\delta(\Delta_{t}(x^{k}_{t})-\theta), (A.3)

which implies that Δt(xtk)η+θ\Delta_{t}(x^{k}_{t})\leq\eta+\theta. Thus, in the analysis we also use the inequality G~tk>ηδ>0\tilde{G}^{k}_{t}>\eta\delta>0. We start by showing that the sequence {Δt(xtk)}k\{\Delta_{t}(x_{t}^{k})\}_{k\in\mathbb{N}} is monotonically decreasing.

Monotonicity of Δt(xkt)\Delta_{t}(x^{t}_{k})

Consider k𝕂I(t)k\in\mathbb{K}_{I}(t). Combining the condition G~tk>νt+Ωg\tilde{G}^{k}_{t}>\frac{\nu}{t}+\Omega_{g} of this phase with (A.2), we deduce that 𝚎~tkν1\tilde{\mathtt{e}}^{k}_{t}\geq\sqrt{\nu}\geq 1. This in turn implies tG~tk𝚎~tk(𝚎~tk+tG~tk)<tG~tk𝚎~k+tG~tk<1\frac{t\tilde{G}^{k}_{t}}{\tilde{\mathtt{e}}^{k}_{t}(\tilde{\mathtt{e}}^{k}_{t}+t\tilde{G}^{k}_{t})}<\frac{t\tilde{G}^{k}_{t}}{\tilde{\mathtt{e}}^{k}+t\tilde{G}^{k}_{t}}<1. Hence, at this iteration algorithm ICG(x,t,η,δ,θ)\texttt{ICG}(x,t,\eta,\delta,\theta) chooses the step size α~tk=tG~tk𝚎~tk(𝚎~~tk+tG~tk)\tilde{\alpha}_{t}^{k}=\frac{t\tilde{G}^{k}_{t}}{\tilde{\mathtt{e}}^{k}_{t}(\tilde{\tilde{\mathtt{e}}}^{k}_{t}+t\tilde{G}^{k}_{t})}. The per-iteration reduction of the potential function can then be estimated just as in the derivation of (5.5):

Vt(xtk+1)Vt(xtk)1tω(tG~tk𝚎~tk).V_{t}(x^{k+1}_{t})\leq V_{t}(x^{k}_{t})-\frac{1}{t}\omega\left(\frac{t\tilde{G}^{k}_{t}}{\tilde{\mathtt{e}}_{t}^{k}}\right). (A.4)

Note that the r.h.s. is well-defined since G~tk>ηδ>0\tilde{G}^{k}_{t}>\eta\delta>0 as the stopping condition is not yet achieved. The above inequality, using the definition of the potential function gap as Δt(xtk)=Vt(xtk)Vt(z(t))\Delta_{t}(x^{k}_{t})=V_{t}(x^{k}_{t})-V_{t}(z^{\ast}(t)), readily yields

Δt(xtk)Δt(xtk+1)1tω(tG~tk𝚎~tk)0.\Delta_{t}(x^{k}_{t})-\Delta_{t}(x^{k+1}_{t})\geq\frac{1}{t}\omega\left(\frac{t\tilde{G}^{k}_{t}}{\tilde{\mathtt{e}}^{k}_{t}}\right)\geq 0. (A.5)

Now, consider k𝕂II(t)k\in\mathbb{K}_{II}(t), which means that G~tkνt+Ωg\tilde{G}_{t}^{k}\leq\frac{\nu}{t}+\Omega_{g}. There are two cases we need to consider, depending on the step size α~tk\tilde{\alpha}_{t}^{k}. We first consider the case where α~tk=1\tilde{\alpha}_{t}^{k}=1. Then tG~tk𝚎~tk(𝚎~tk+tG~tk)t\tilde{G}_{t}^{k}\geq\tilde{\mathtt{e}}^{k}_{t}(\tilde{\mathtt{e}}_{t}^{k}+t\tilde{G}_{t}^{k}), which implies 𝚎~tk(0,1)\tilde{\mathtt{e}}_{t}^{k}\in(0,1). From there we conclude that G~tk(𝚎~tk)2t(1𝚎~tk)\tilde{G}_{t}^{k}\geq\frac{(\tilde{\mathtt{e}}_{t}^{k})^{2}}{t(1-\tilde{\mathtt{e}}_{t}^{k})}, and, by (3.7) and (2.7), that

Vt(xtk+1)\displaystyle V_{t}(x^{k+1}_{t}) Vt(xtk)G~tk+1tω(𝚎~tk)\displaystyle\leq V_{t}(x^{k}_{t})-\tilde{G}_{t}^{k}+\frac{1}{t}\omega_{\ast}(\tilde{\mathtt{e}}_{t}^{k})
Vt(xtk)G~tk+1t(𝚎~tk)22(1𝚎~tk)\displaystyle\leq V_{t}(x^{k}_{t})-\tilde{G}_{t}^{k}+\frac{1}{t}\frac{(\tilde{\mathtt{e}}^{k}_{t})^{2}}{2(1-\tilde{\mathtt{e}}_{t}^{k})}
Vt(xtk)G~tk2.\displaystyle\leq V_{t}(x^{k}_{t})-\frac{\tilde{G}^{k}_{t}}{2}.

Therefore, using again that G~tkνt+Ωg\tilde{G}_{t}^{k}\leq\frac{\nu}{t}+\Omega_{g}, we have

Δt(xtk)Δt(xtk+1)G~tk2(G~tk)22(ν/t+Ωg)t(G~tk)212(ν+tΩg)20,\Delta_{t}(x^{k}_{t})-\Delta_{t}(x^{k+1}_{t})\geq\frac{\tilde{G}^{k}_{t}}{2}\geq\frac{(\tilde{G}^{k}_{t})^{2}}{2(\nu/t+\Omega_{g})}\geq\frac{t(\tilde{G}^{k}_{t})^{2}}{12(\nu+t\Omega_{g})^{2}}\geq 0, (A.6)

where the last inequality follows G~tk>ηδ>0\tilde{G}^{k}_{t}>\eta\delta>0.

Consider now the case where α~tk<1\tilde{\alpha}_{t}^{k}<1, i.e., α~k=tG~tk𝚎~tk(𝚎~tk+tG~tk)\tilde{\alpha}_{k}=\frac{t\tilde{G}^{k}_{t}}{\tilde{\mathtt{e}}^{k}_{t}(\tilde{\mathtt{e}}^{k}_{t}+t\tilde{G}^{k}_{t})}. Then, using this stepsize, just as in the derivation of (5.5) and (A.4), we get

Vt(xtk+1)Vt(xtk)1tω(tG~tk𝚎~tk).V_{t}(x^{k+1}_{t})\leq V_{t}(x^{k}_{t})-\frac{1}{t}\omega\left(\frac{t\tilde{G}^{k}_{t}}{\tilde{\mathtt{e}}_{t}^{k}}\right). (A.7)

Again, since G~tkνt+Ωg\tilde{G}_{t}^{k}\leq\frac{\nu}{t}+\Omega_{g} for k𝕂II(t)k\in\mathbb{K}_{II}(t), we have G~tkG~tk+ν/t+Ωg12\frac{\tilde{G}^{k}_{t}}{\tilde{G}^{k}_{t}+\nu/t+\Omega_{g}}\leq\frac{1}{2}. Using the monotonicity of ω(τ)\omega(\tau), inequality (A.1), and that ω(τ)τ23\omega(\tau)\geq\frac{\tau^{2}}{3} for τ1/2\tau\leq 1/2 [43, Prop. 1], we can develop the bound

1tω(tG~tk𝚎~tk)=1tω(G~tk𝚎~tk/t)1tω(G~tkG~tk+ν/t+Ωg)13t(G~tkG~tk+ν/t+Ωg)2.\frac{1}{t}\omega\left(\frac{t\tilde{G}^{k}_{t}}{\tilde{\mathtt{e}}_{t}^{k}}\right)=\frac{1}{t}\omega\left(\frac{\tilde{G}^{k}_{t}}{\tilde{\mathtt{e}}_{t}^{k}/t}\right)\geq\frac{1}{t}\omega\left(\frac{\tilde{G}^{k}_{t}}{\tilde{G}^{k}_{t}+\nu/t+\Omega_{g}}\right)\geq\frac{1}{3t}\left(\frac{\tilde{G}^{k}_{t}}{\tilde{G}^{k}_{t}+\nu/t+\Omega_{g}}\right)^{2}.

This and (A.7), again by G~tkνt+Ωg\tilde{G}_{t}^{k}\leq\frac{\nu}{t}+\Omega_{g}, implies that also in this regime we obtain the estimate

Δt(xtk)Δt(xtk+1)t(G~tk)212(ν+tΩg)2.\Delta_{t}(x^{k}_{t})-\Delta_{t}(x^{k+1}_{t})\geq\frac{t(\tilde{G}^{k}_{t})^{2}}{12(\nu+t\Omega_{g})^{2}}. (A.8)

As we see, the above inequality holds on the whole phase II, both when α~tk<1\tilde{\alpha}_{t}^{k}<1 and when α~tk=1\tilde{\alpha}_{t}^{k}=1. In particular, since G~tk>ηδ>0\tilde{G}^{k}_{t}>\eta\delta>0, together with (A.6) imply that {Δt(xtk)}k\{\Delta_{t}(x^{k}_{t})\}_{k\in\mathbb{N}} is monotonically decreasing.

We now move to the analysis of the iteration complexity.

Analysis for Phase I

Let let {ki}i0\{k_{i}\}_{i\in\mathbb{N}_{0}} denote the increasing set of indices enumerating the elements of the set 𝕂I(t)\mathbb{K}_{I}(t). Since ki𝕂I(t)k_{i}\in\mathbb{K}_{I}(t), it holds that G~tki>νt+Ωg\tilde{G}^{k_{i}}_{t}>\frac{\nu}{t}+\Omega_{g} and so tG~tkν+tG~tk+tΩg>12\frac{t\tilde{G}^{k}_{t}}{\nu+t\tilde{G}^{k}_{t}+t\Omega_{g}}>\frac{1}{2}. Together with (A.1), this implies

tG~tk𝚎~tkG~tkν/t+G~tk+Ωg=tG~tkν+tG~tk+tΩg>12.\frac{t\tilde{G}_{t}^{k}}{\tilde{\mathtt{e}}^{k}_{t}}\geq\frac{\tilde{G}^{k}_{t}}{\nu/t+\tilde{G}_{t}^{k}+\Omega_{g}}=\frac{t\tilde{G}^{k}_{t}}{\nu+t\tilde{G}_{t}^{k}+t\Omega_{g}}>\frac{1}{2}.

Using the monotonicity of ω()\omega(\cdot) and that ω(τ)τ5.3\omega(\tau)\geq\frac{\tau}{5.3} for τ1/2\tau\geq 1/2 [43, Prop. 1], we further obtain

Δt(xtki)Δt(xtki+1)1tω(tG~tki𝚎~tki)G~tki5.3(ν+tΩg+tG~tki).\Delta_{t}(x^{k_{i}}_{t})-\Delta_{t}(x^{k_{i}+1}_{t})\geq\frac{1}{t}\omega\left(\frac{t\tilde{G}^{k_{i}}_{t}}{\tilde{\mathtt{e}}^{k_{i}}_{t}}\right)\geq\frac{\tilde{G}^{k_{i}}_{t}}{5.3(\nu+t\Omega_{g}+t\tilde{G}_{t}^{k_{i}})}. (A.9)

By our definition of the inexact oracle, inequalities (3.5) and Δt(xtki)>η+θ\Delta_{t}(x^{k_{i}}_{t})>\eta+\theta, we see that

G~tki𝖦𝖺𝗉~t(xtki)δ(𝖦𝖺𝗉t(xtki)θ)δ(Δt(xtki)θ)>η>0.\tilde{G}^{k_{i}}_{t}\equiv\widetilde{\operatorname{\mathsf{Gap}}}_{t}(x^{k_{i}}_{t})\geq\delta(\operatorname{\mathsf{Gap}}_{t}(x^{k_{i}}_{t})-\theta)\geq\delta(\Delta_{t}(x^{k_{i}}_{t})-\theta)>\eta>0.

Combining this with (A.9) and the fact that the function xxa+txx\mapsto\frac{x}{a+tx} is strictly increasing for a,t>0a,t>0 and x>0x>0, we obtain

Δt(xtki)Δt(xtki+1)δ(Δt(xtki)θ)5.3(tδ(Δt(xtki)θ)+ν+tΩg)δ(Δt(xtki)θ)5.3(tδ(Δt(xt0)θ)+ν+tΩg),\Delta_{t}(x^{k_{i}}_{t})-\Delta_{t}(x^{k_{i+1}}_{t})\geq\frac{\delta(\Delta_{t}(x^{k_{i}}_{t})-\theta)}{5.3(t\delta(\Delta_{t}(x^{k_{i}}_{t})-\theta)+\nu+t\Omega_{g})}\geq\frac{\delta(\Delta_{t}(x^{k_{i}}_{t})-\theta)}{5.3(t\delta(\Delta_{t}(x^{0}_{t})-\theta)+\nu+t\Omega_{g})}, (A.10)

where in the first and last inequalities follow from {Δt(xtk)}k0\{\Delta_{t}(x^{k}_{t})\}_{k\in\mathbb{N}_{0}} being monotonically decreasing and Δt(xtki)>η+θθ\Delta_{t}(x^{k_{i}}_{t})>\eta+\theta\geq\theta. Rearranging the terms and defining q:=δ5.3(tδ(Δt(xt0)θ)+ν+tΩg)q:=\frac{\delta}{5.3(t\delta(\Delta_{t}(x^{0}_{t})-\theta)+\nu+t\Omega_{g})}, which we know to be in (0,1)(0,1) since ν1\nu\geq 1 and Δt(xt0)>η+θ\Delta_{t}(x^{0}_{t})>\eta+\theta, it follows from the previous display that

Δt(xtki+1)Δt(xtki)(1q)+qθΔt(xt0)(1q)i+θexp(iq)Δt(xt0)+θ.{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Delta_{t}(x^{k_{i+1}}_{t})\leq\Delta_{t}(x^{k_{i}}_{t})(1-q)+q\theta\leq\Delta_{t}(x^{0}_{t})(1-q)^{i}+\theta\leq\exp(-iq)\Delta_{t}(x^{0}_{t})+\theta.} (A.11)

On the other hand, we have

Δt(xtki)Δt(xtki)Δt(xtki+1)G~tki5.3(ν+tΩg+tG~tki)>110.6t\Delta_{t}(x^{k_{i}}_{t})\geq\Delta_{t}(x^{k_{i}}_{t})-\Delta_{t}(x^{k_{i}+1}_{t})\geq\frac{\tilde{G}^{k_{i}}_{t}}{5.3(\nu+t\Omega_{g}+t\tilde{G}_{t}^{k_{i}})}>\frac{1}{10.6t}

since tG~tkiν+tG~tki+tΩg>12\frac{t\tilde{G}^{k_{i}}_{t}}{\nu+t\tilde{G}^{k_{i}}_{t}+t\Omega_{g}}>\frac{1}{2} for ki𝕂I(t)k_{i}\in\mathbb{K}_{I}(t). Moreover, we know that Δt(xtki)>η+θ\Delta_{t}(x^{k_{i}}_{t})>\eta+\theta, which gives us in combination with (A.11)

max{110.6t,η+θ}Δt(xtki)exp(iq)Δt(xt0)+θ.\max\left\{\frac{1}{10.6t},\eta+\theta\right\}\leq{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Delta_{t}(x^{k_{i}}_{t})\leq\exp(-iq)\Delta_{t}(x^{0}_{t})}+\theta. (A.12)

Solving w.r.t. ii, we see that the number of iterations in 𝕂I(t)\mathbb{K}_{I}(t) is bounded as

Kt(xt0):=5.3(tδ(Δt(xt0)θ)+ν+tΩg)δlog(min{10.6tΔt(xt0)(110.6tθ)+,Δt(xt0)η}).K_{t}(x_{t}^{0}):=\left\lceil\frac{5.3(t\delta(\Delta_{t}(x^{0}_{t})-\theta)+\nu+t\Omega_{g})}{\delta}\log\left(\min\left\{\frac{10.6t\Delta_{t}(x^{0}_{t})}{(1-10.6t\theta)_{+}},\frac{\Delta_{t}(x^{0}_{t})}{\eta}\right\}\right)\right\rceil. (A.13)
Analysis for Phase II

Let let {j}i0\{\ell_{j}\}_{i\in\mathbb{N}_{0}} denote the increasing set of indices enumerating the elements of the set 𝕂II(t)\mathbb{K}_{II}(t). Let Nt(xt0)N_{t}(x_{t}^{0}) be a bound on the number of iterations until Δt(xtk)η+θ\Delta_{t}(x^{k}_{t})\leq\eta+\theta belonging to phase II. We will show that setting Nt(xt0)=12(ν+tΩg)2tδ2(1η1Δt(x0)θ)+N_{t}(x_{t}^{0})=\left\lceil\frac{12(\nu+t\Omega_{g})^{2}}{t\delta^{2}}\left(\frac{1}{\eta}-\frac{1}{\Delta_{t}(x^{0})-\theta}\right)_{+}\right\rceil is such a bound. Indeed, if Δt(xtNt(xt0)+1)θ\Delta_{t}(x^{N_{t}(x^{0}_{t})+1}_{t})\leq\theta then the proof is done, otherwise Δt(xtj)θ\Delta_{t}(x_{t}^{\ell_{j}})\geq\theta for all jNt(xt0)+1j\leq N_{t}(x_{t}^{0})+1 from monotonicity of the sequence {Δt(xtj)}j𝕂II(t)\{\Delta_{t}(x^{\ell_{j}}_{t})\}_{\ell_{j}\in\mathbb{K}_{II}(t)}. Calling ΔtjΔ(xtj(t))\Delta_{t}^{j}\equiv{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Delta(x^{\ell_{j}(t)}_{t})} and using the monotonicity of this sequence, we see from (A.8) and (A.3) that, for j=1,,Nt1(xt0)j=1,\ldots,N_{t}-1(x_{t}^{0}),

ΔtjΔtj+1ΔtjΔt(xtj(t)+1)t(G~tj(t))212(ν+tΩg)2tδ2(Δtjθ)(Δtj+1θ)12(ν+tΩg)2.\Delta^{j}_{t}-\Delta^{j+1}_{t}\geq\Delta^{j}_{t}-\Delta_{t}(x^{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\ell_{j}}(t)+1}_{t})\geq\frac{t(\tilde{G}^{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\ell_{j}(t)}}_{t})^{2}}{12(\nu+t\Omega_{g})^{2}}\geq\frac{t\delta^{2}(\Delta^{j}_{t}-\theta)(\Delta^{j+1}_{t}-\theta)}{12(\nu+t\Omega_{g})^{2}}.

This implies that for j=1,,Nt(xt0)j=1,\ldots,{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}N_{t}(x_{t}^{0})} that

1Δtj+1θ1Δtjθtδ212(ν+tΩg)2,j=1,,Nt(xt0).\frac{1}{\Delta^{j+1}_{t}-\theta}-\frac{1}{\Delta^{j}_{t}-\theta}\geq\frac{t\delta^{2}}{12(\nu+t\Omega_{g})^{2}},\quad j=1,\ldots,N_{t}(x_{t}^{0}).

Telescoping this expression together with Δt1Δt(xt0)\Delta_{t}^{1}\leq\Delta_{t}(x^{0}_{t}), which stems from the entire sequence {Δt(xtk)}k0\{\Delta_{t}(x^{k}_{t})\}_{k\in\mathbb{N}_{0}} being monotonically decreasing, we see that

1ΔtNt(xt0)+1θ(Nt(xt0)1)tδ212(ν+tΩg)2+1Δt1θ(Nt(xt0)1)tδ212(ν+tΩg)2+1Δt(xt0)θ1η,\frac{1}{\Delta^{N_{t}(x_{t}^{0})+1}_{t}-\theta}\geq{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\frac{(N_{t}(x_{t}^{0})-1)t\delta^{2}}{12(\nu+t\Omega_{g})^{2}}+\frac{1}{\Delta_{t}^{1}-\theta}}\geq\frac{(N_{t}(x_{t}^{0})-1)t\delta^{2}}{12(\nu+t\Omega_{g})^{2}}+\frac{1}{\Delta_{t}(x_{t}^{0})-\theta}\geq\frac{1}{\eta},

is satisfied for Nt(xt0)N_{t}(x^{0}_{t}).

We can therefore conclude that the number of iterations required to obtain Δt(xtk)η+θ\Delta_{t}(x_{t}^{k})\leq\eta+\theta does not exceed Nt(xt0)+Kt(xt0)+1N_{t}(x^{0}_{t})+K_{t}(x^{0}_{t})+1 which is exactly the value of N~(x0,η,δ,θ)\tilde{N}(x^{0},\eta,\delta,\theta), thus concluding the proof.

A.2 Proof of Proposition 4.2

Our analysis concentrates on the behaviour of the algorithm in Phase II since the estimate for number of iterations in Phase I still holds. Here we split the iterates of the set 𝕂II(t)={k1(t),,kp(t)}{k1,,kp}\mathbb{K}_{II}(t)=\{k_{1}(t),\ldots,k_{p}(t)\}\equiv\{k_{1},\ldots,k_{p}\} into two regimes so that 𝕂II(t)=HtLt\mathbb{K}_{II}(t)=H_{t}\cup L_{t} with Lt:={k𝕂II(t)|Δt(xtk)<θ}L_{t}:=\{k\in\mathbb{K}_{II}(t)|\Delta_{t}(x^{k}_{t})<\theta\} and Ht:={k𝕂II(t)|Δt(xk)θ}H_{t}:=\{k\in\mathbb{K}_{II}(t)|\Delta_{t}(x^{k})\geq\theta\}. By monotonicity of the potential function gap, we know that the indices of of HtH_{t} (’high states’) precede those of LtL_{t} (’low states’). We bound the size of each of these subsets in order to make 𝖦𝖺𝗉~t\widetilde{\operatorname{\mathsf{Gap}}}_{t} smaller than ηδ\eta\delta. Since the potential function gap is monotonically decreasing, we can organise the iterates on phase II as Ht={k1,,kq}H_{t}=\{k_{1},\ldots,k_{q}\} and accordingly, Lt={kq+1,,kp}L_{t}=\{k_{q+1},\ldots,k_{p}\} for some q{1,,p}q\in\{1,\ldots,p\}.

Bound on HtH_{t}:

For kj(t)Htk_{j}(t)\in H_{t} we know that

Δt(xtkj(t))Δt(xtkj+1(t))Δt(xtkj(t))Δt(xtkj(t)+1)t(G~tkj(t))212(ν+tΩg)2.\Delta_{t}(x^{k_{j}(t)}_{t})-\Delta_{t}(x^{k_{j+1}(t)}_{t})\geq\Delta_{t}(x^{k_{j}(t)}_{t})-\Delta_{t}(x^{k_{j}(t)+1}_{t})\geq\frac{t(\tilde{G}^{k_{j}(t)}_{t})^{2}}{12(\nu+t\Omega_{g})^{2}}. (A.14)

Define dj:=δ(Δt(xtkj(t))θ)d_{j}:=\delta(\Delta_{t}(x^{k_{j}(t)}_{t})-\theta) for j=1,,qj=1,\ldots,q, so that dj0d_{j}\geq 0. Then, we obtain the recursion

dj+1djΓj2Md_{j+1}-d_{j}\leq-\frac{\Gamma^{2}_{j}}{M}

for j{1,,q1}j\in\{1,\ldots,q-1\} and ΓjG~tkj(t)dj\Gamma_{j}\equiv\tilde{G}^{k_{j}(t)}_{t}\geq d_{j} and 1Mtδ12(ν+tΩg)2\frac{1}{M}\equiv\frac{t\delta}{12(\nu+t\Omega_{g})^{2}}. By [43, Proposition 4], we can conclude that

dq<Mq1 and min{Γ1,,Γq}<2Mq1.d_{q}<\frac{M}{q-1}\text{ and }\min\{\Gamma_{1},\ldots,\Gamma_{q}\}<\frac{2M}{q-1}.

Since we know that min{Γ1,,Γq}δη\min\{\Gamma_{1},\ldots,\Gamma_{q}\}\geq\delta\eta, we obtain the bound

q2Mδη+1=24(ν+tΩg)2tδ2η+1.q\leq\frac{2M}{\delta\eta}+1=\frac{24(\nu+t\Omega_{g})^{2}}{t\delta^{2}\eta}+1. (A.15)
Bound on LtL_{t}:

On this subset we cannot use the same argument as for the iterates in HtH_{t}, since we cannot guarantee that the sequence djd_{j} involved in the previous part of the proof is positive. However, we still know that (A.14) applies for the iterates in LtL_{t}. If we telescope this expression over the indices q+1,,pq+1,\ldots,p, and using the fact that Δt(xkj(t))[0,θ]\Delta_{t}(x^{k_{j}}(t))\in[0,\theta], we see that

θΔt(xtkq+1(t))Δt(xtkp+1(t))\displaystyle\theta\geq\Delta_{t}(x_{t}^{k_{q+1}(t)})-\Delta_{t}(x_{t}^{k_{p+1}(t)}) t(pq)12(ν+tΩg)2minq+1jp(G~tkj(t))2\displaystyle\geq\frac{t(p-q)}{12(\nu+t\Omega_{g})^{2}}\min_{q+1\leq j\leq p}\left(\tilde{G}_{t}^{k_{j}(t)}\right)^{2}
t(pq)η2δ212(ν+tΩg)2\displaystyle\geq\frac{t(p-q)\eta^{2}\delta^{2}}{12(\nu+t\Omega_{g})^{2}}

It follows

(pq)12θ(ν+tΩg)2tη2δ2.(p-q)\leq\frac{12\theta(\nu+t\Omega_{g})^{2}}{t\eta^{2}\delta^{2}}. (A.16)
Combining the bounds:

Combining the expressions (A.15) and (A.16) together with the bound (A.13), we see that

R~t(xt0,η,δ,θ)\displaystyle\tilde{R}_{t}(x^{0}_{t},\eta,\delta,\theta) =5.3(tδ(Δt(xt0)θ)+ν+tΩg)δlog(min{10.6tΔt(xt0)(110.6tθ)+,Δt(xt0)η})\displaystyle=\left\lceil\frac{5.3(t\delta(\Delta_{t}(x^{0}_{t})-\theta)+\nu+t\Omega_{g})}{\delta}\log\left(\min\left\{\frac{10.6t\Delta_{t}(x^{0}_{t})}{(1-10.6t\theta)_{+}},\frac{\Delta_{t}(x^{0}_{t})}{\eta}\right\}\right)\right\rceil
+12(ν+tΩg)2tδ2η(2+θη)+1\displaystyle+\left\lceil\frac{12(\nu+t\Omega_{g})^{2}}{t\delta^{2}\eta}\left(2+\frac{\theta}{\eta}\right)\right\rceil+1

defines an upper bound for the length of phase I.

A.3 Analysis of the outer loop and proof of Theorem 4.4

We set x^ixtiR~i\hat{x}_{i}\equiv x^{\tilde{R}_{i}}_{t_{i}} the last iterate of procedure ICG(xti0,ti,ηi,δ,θi)\operatorname{\texttt{ICG}}(x^{0}_{t_{i}},t_{i},\eta_{i},\delta,\theta_{i}), where R~i\tilde{R}_{i} is at most RG~,ti(xti0,ηi,δ,θi)R_{\tilde{G},t_{i}}(x^{0}_{t_{i}},\eta_{i},\delta,\theta_{i}) by 4.2. Therefore, we know that 𝖦𝖺𝗉~ti(x^i)ηiδ\widetilde{\operatorname{\mathsf{Gap}}}_{t_{i}}(\hat{x}_{i})\leq\eta_{i}\delta, and a-fortiori

Δti(x^i)𝖦𝖺𝗉ti(x^i)ηi+θi.\Delta_{t_{i}}(\hat{x}_{i})\leq\operatorname{\mathsf{Gap}}_{t_{i}}(\hat{x}_{i})\leq\eta_{i}+\theta_{i}.

By eq. 5.12 and the definition of the (δ,θi)(\delta,\theta_{i})-ILMO, we obtain

g(x^i)g(z(ti))\displaystyle g(\hat{x}_{i})-g(z^{\ast}(t_{i})) 𝖦𝖺𝗉ti(x^i)+1tF(x^i)[z(ti)x^i]\displaystyle\leq\operatorname{\mathsf{Gap}}_{t_{i}}(\hat{x}_{i})+\frac{1}{t}F^{\prime}(\hat{x}_{i})[z^{\ast}(t_{i})-\hat{x}_{i}]
𝖦𝖺𝗉~ti(x^i)δ+θi+νti\displaystyle\leq\frac{\widetilde{\operatorname{\mathsf{Gap}}}_{t_{i}}(\hat{x}_{i})}{\delta}+\theta_{i}+\frac{\nu}{t_{i}}
ηi+θi+νti.\displaystyle\leq\eta_{i}+\theta_{i}+\frac{\nu}{t_{i}}.

Hence, using 2.2, we observe

g(x^i)Opt=g(x^i)g(z(ti))+g(z(ti))Optηi+θi+2νti.g(\hat{x}_{i})-\operatorname{Opt}=g(\hat{x}_{i})-g(z^{\ast}(t_{i}))+g(z^{\ast}(t_{i}))-\operatorname{Opt}\leq\eta_{i}+\theta_{i}+\frac{2\nu}{t_{i}}.

Since ti=2νηit_{i}=\frac{2\nu}{\eta_{i}} and θi=ηi\theta_{i}=\eta_{i}, we obtain g(x^i)Opt3ηig(\hat{x}_{i})-\operatorname{Opt}\leq 3\eta_{i}. The latter, by the choice of the number of restarts II implies that

g(z^)Opt=g(x^I)Optε.g(\hat{z})-\operatorname{Opt}=g(\hat{x}_{I})-\operatorname{Opt}\leq\varepsilon.

Our next goal is to estimate the total number of inner iterations based on the estimation for the first epoch i=0i=0 and summation over epochs i=1,,Ii=1,...,I. For each epoch we use the estimate from 4.2, which we repeat here for convenience

R~ti(xti0,ηi,δ,θi)\displaystyle\tilde{R}_{t_{i}}(x^{0}_{t_{i}},\eta_{i},\delta,\theta_{i}) =5.3(tiδ(Δti(xi0)θi)+ν+tiΩg)δlog(min{10.6tiΔti(xti0)(110.6tiθi)+,Δti(xti0)ηi})\displaystyle=\left\lceil\frac{5.3(t_{i}\delta(\Delta_{t_{i}}(x^{0}_{i})-\theta_{i})+\nu+t_{i}\Omega_{g})}{\delta}\log\left(\min\left\{\frac{10.6t_{i}\Delta_{t_{i}}(x^{0}_{t_{i}})}{(1-10.6t_{i}\theta_{i})_{+}},\frac{\Delta_{t_{i}}(x^{0}_{t_{i}})}{\eta_{i}}\right\}\right)\right\rceil
+12(ν+tiΩg)2tiδ2ηi(2+θiηi)+1.\displaystyle\hskip 16.00003pt+\left\lceil\frac{12(\nu+t_{i}\Omega_{g})^{2}}{t_{i}\delta^{2}\eta_{i}}\left(2+\frac{\theta_{i}}{\eta_{i}}\right)\right\rceil+1. (A.17)

We see that we need to estimate the quantities ti(Δti(xi0)θi)t_{i}(\Delta_{t_{i}}(x^{0}_{i})-\theta_{i}) and Δti(xti0)ηi\frac{\Delta_{t_{i}}(x^{0}_{t_{i}})}{\eta_{i}}, which is our next step. Consider i1i\geq 1 and observe that

Vti+1(xti+10)Vti+1(z(ti+1))\displaystyle V_{t_{i+1}}(x^{0}_{t_{i+1}})-V_{t_{i+1}}(z^{\ast}(t_{i+1})) =Vti(xti+10)Vti(z(ti+1))\displaystyle=V_{t_{i}}(x^{0}_{t_{i+1}})-V_{t_{i}}(z^{\ast}(t_{i+1}))
+(1ti+11ti)(F(xti+10)F(z(ti+1)))\displaystyle+\left(\frac{1}{t_{i+1}}-\frac{1}{t_{i}}\right)(F(x^{0}_{t_{i+1}})-F(z^{\ast}(t_{i+1})))
Vti(xtiRi)Vti(z(ti))\displaystyle\leq V_{t_{i}}(x^{R_{i}}_{t_{i}})-V_{t_{i}}(z^{\ast}(t_{i}))
+(1ti+1ti)(Vti+1(xti+10)Vti+1(z(ti+1)))\displaystyle+\left(1-\frac{t_{i+1}}{t_{i}}\right)\left(V_{t_{i+1}}(x^{0}_{t_{i+1}})-V_{t_{i+1}}(z^{\ast}(t_{i+1}))\right)
+(1ti+1ti)(g(z(ti+1))g(xti+10)).\displaystyle+(1-\frac{t_{i+1}}{t_{i}})(g(z^{\ast}(t_{i+1}))-g(x^{0}_{t_{i+1}})).

Moreover,

g(xti+10)g(z(ti+1))g(xti+10)Optηi+θi+2νti.g(x^{0}_{t_{i+1}})-g(z^{\ast}(t_{i+1}))\leq g(x^{0}_{t_{i+1}})-\operatorname{Opt}\leq\eta_{i}+\theta_{i}+\frac{2\nu}{t_{i}}.

Since ti+1>tit_{i+1}>t_{i}, the above two inequalities imply

ti+1ti[Vti+1(xti+10)Vti+1(z(ti+1))]Vti(xtiRi)Vti(z(ti))+ti+1titi(ηi+θi+2νti).\displaystyle\frac{t_{i+1}}{t_{i}}[V_{t_{i+1}}(x^{0}_{t_{i+1}})-V_{t_{i+1}}(z^{\ast}(t_{i+1}))]\leq V_{t_{i}}(x^{R_{i}}_{t_{i}})-V_{t_{i}}(z^{\ast}(t_{i}))+\frac{t_{i+1}-{t_{i}}}{t_{i}}(\eta_{i}+\theta_{i}+\frac{2\nu}{t_{i}}).

Whence,

ti+1(Δti+1(xti+10)θi+1)\displaystyle t_{i+1}(\Delta_{t_{i+1}}(x^{0}_{t_{i+1}})-\theta_{i+1}) tiΔti(xtiRi)+(ti+1ti)(ηi+θi+2νti)ti+1θi+1\displaystyle\leq t_{i}\Delta_{t_{i}}(x^{R_{i}}_{t_{i}})+(t_{i+1}-t_{i})(\eta_{i}+\theta_{i}+\frac{2\nu}{t_{i}})-t_{i+1}\theta_{i+1}
ti(ηi+θi)+(ti+1ti)(ηi+θi+2νti)ti+1θi+1\displaystyle\leq t_{i}(\eta_{i}+\theta_{i})+(t_{i+1}-t_{i})(\eta_{i}+\theta_{i}+\frac{2\nu}{t_{i}})-t_{i+1}\theta_{i+1}
=4ν+(ti+1ti)6νti2ν=2ν+(1/σ1)6ν=2ν(3/σ2),\displaystyle=4\nu+(t_{i+1}-t_{i})\cdot\frac{6\nu}{t_{i}}-2\nu=2\nu+(1/\sigma-1)6\nu=2\nu(3/\sigma-2),

where we used our assignments ti+1=ti/σ,ηiti=θiti=2νt_{i+1}=t_{i}/\sigma,\eta_{i}t_{i}=\theta_{i}t_{i}=2\nu. In the same way, we obtain

Δti+1(xti+10)ηi+1\displaystyle\frac{\Delta_{t_{i+1}}(x^{0}_{t_{i+1}})}{\eta_{i+1}} titi+1ηi+1Δti(xtiRi)+ti+1titi+1ηi+1(ηi+θi+2νti)\displaystyle\leq\frac{t_{i}}{t_{i+1}\eta_{i+1}}\Delta_{t_{i}}(x^{R_{i}}_{t_{i}})+\frac{t_{i+1}-t_{i}}{t_{i+1}\eta_{i+1}}(\eta_{i}+\theta_{i}+\frac{2\nu}{t_{i}})
titi+1ηi+1(ηi+θi)+ti+1titi+1ηi+1(ηi+θi+2νti)\displaystyle\leq\frac{t_{i}}{t_{i+1}\eta_{i+1}}(\eta_{i}+\theta_{i})+\frac{t_{i+1}-t_{i}}{t_{i+1}\eta_{i+1}}(\eta_{i}+\theta_{i}+\frac{2\nu}{t_{i}})
=4νti+1ηi+1+ti+1titi+1ηi+16νti=2+(1/σ1)3=(3/σ1),\displaystyle=\frac{4\nu}{t_{i+1}\eta_{i+1}}+\frac{t_{i+1}-t_{i}}{t_{i+1}\eta_{i+1}}\frac{6\nu}{t_{i}}=2+(1/\sigma-1)3=(3/\sigma-1),

where we again used our assignments ti+1=ti/σ,ηiti=θiti=2νt_{i+1}=t_{i}/\sigma,\eta_{i}t_{i}=\theta_{i}t_{i}=2\nu.

Using these bounds, our assignments ηi=θi=2ν/ti\eta_{i}=\theta_{i}=2\nu/t_{i}, we see that 110.6tiθi<01-10.6t_{i}\theta_{i}<0 and the bound (A.3) simplifies to

R~ti(xti0,ηi,δ,θi)\displaystyle\tilde{R}_{t_{i}}(x^{0}_{t_{i}},\eta_{i},\delta,\theta_{i}) 5.3(2δν(3/σ2)+ν+tiΩg)δlog(3/σ1)+18(ν+tiΩg)2νδ2+1\displaystyle\leq\left\lceil\frac{5.3(2\delta\nu(3/\sigma-2)+\nu+t_{i}\Omega_{g})}{\delta}\log(3/\sigma-1)\right\rceil+\left\lceil\frac{18(\nu+t_{i}\Omega_{g})^{2}}{\nu\delta^{2}}\right\rceil+1 (A.18)
3+5.3(2δν(3/σ2)+ν)δlog(3/σ1)+18νδ2\displaystyle\leq 3+\frac{5.3(2\delta\nu(3/\sigma-2)+\nu)}{\delta}\log(3/\sigma-1)+\frac{18\nu}{\delta^{2}} (A.19)
+ti(5.3Ωgδlog(3/σ1)+18Ωgδ2)+ti218Ωg2νδ2.\displaystyle\hskip 8.00002pt+t_{i}\cdot\left(\frac{5.3\Omega_{g}}{\delta}\log(3/\sigma-1)+\frac{18\Omega_{g}}{\delta^{2}}\right)+t_{i}^{2}\cdot\frac{18\Omega_{g}^{2}}{\nu\delta^{2}}. (A.20)

Using the same bounds for iti\sum_{i}t_{i}, iti2\sum_{i}t_{i}^{2} as in the proof of 3.3 (see Section 5.2), we obtain that

i=1IR~i\displaystyle\sum_{i=1}^{I}\tilde{R}_{i} (3+5.3(2δν(3/σ2)+ν)δlog(3/σ1)+18νδ2)log(2η0/ε)log(1/σ)\displaystyle\leq\left(3+\frac{5.3(2\delta\nu(3/\sigma-2)+\nu)}{\delta}\log(3/\sigma-1)+\frac{18\nu}{\delta^{2}}\right)\frac{\log(2\eta_{0}/\varepsilon)}{\log(1/\sigma)}
+(5.3Ωgδlog(3/σ1)+18Ωgδ2)4νε(1σ)\displaystyle\hskip 8.00002pt+\left(\frac{5.3\Omega_{g}}{\delta}\log(3/\sigma-1)+\frac{18\Omega_{g}}{\delta^{2}}\right)\frac{4\nu}{\varepsilon(1-\sigma)}
+288νΩg2δ2ε2(1σ2).\displaystyle\hskip 8.00002pt+\frac{288\nu\Omega_{g}^{2}}{\delta^{2}\varepsilon^{2}(1-\sigma^{2})}. (A.21)

Finally, we estimate R~0\tilde{R}_{0} as follows using (A.3) and that t0η0=2νt_{0}\eta_{0}=2\nu, t0=ν/Ωgt_{0}=\nu/\Omega_{g}, θ0=η0\theta_{0}=\eta_{0}, δ<1\delta<1:

R~0\displaystyle\tilde{R}_{0} 5.3(t0δΔt0(x0)+ν+t0Ωg)δlog(Δt0(x0)η0)+18(ν+t0Ωg)2νδ2+1\displaystyle\leq\left\lceil\frac{5.3(t_{0}\delta\Delta_{t_{0}}(x^{0})+\nu+t_{0}\Omega_{g})}{\delta}\log\left(\frac{\Delta_{t_{0}}(x^{0})}{\eta_{0}}\right)\right\rceil+\left\lceil\frac{18(\nu+t_{0}\Omega_{g})^{2}}{\nu\delta^{2}}\right\rceil+1
5.3(F(x0)F(z(t0))+3ν)δlog(F(x0)F(z(t0))+ν2ν)+72νδ2+3.\displaystyle\leq\frac{5.3(F(x^{0})-F(z^{*}(t_{0}))+3\nu)}{\delta}\log\left(\frac{F(x^{0})-F(z^{*}(t_{0}))+\nu}{2\nu}\right)+\frac{72\nu}{\delta^{2}}+3.

Combining the latter with (A.21), we obtain that the leading complexity term is O(νΩg2δ2ε2(1σ2)).O\left(\frac{\nu\Omega_{g}^{2}}{\delta^{2}\varepsilon^{2}(1-\sigma^{2})}\right).

We remark that assuming that the oracle accuracies θi\theta_{i} have geometric decay is crucial in order to obtain the same dependence on the accuracy in the complexity as in the exact case. Indeed, if the oracle accuracy is fixed to be θiθ\theta_{i}\equiv\theta, then the last term in the bound (A.3) contains a term 12ti2Ωg2θ2νδ2ηi=12ti3Ωg2θ4ν2δ2\frac{12t_{i}^{2}\Omega_{g}^{2}\theta}{2\nu\delta^{2}\eta_{i}}=\frac{12t_{i}^{3}\Omega_{g}^{2}\theta}{4\nu^{2}\delta^{2}}, which after summation will imply that the leading term in the complexity would be O(1/ε4)O(1/\varepsilon^{4}).

Appendix B Added Details to the Numerical Experiments

In this section we report outcomes obtained by extensive numerical experiments illustrating the practical performance of our method. We implement Algorithm 2 using procedure CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} as solvers for the inner loops. We compare the results of our algorithm to two benchmarks: (1) an interior point solution of the problem, and (2) CGAL\operatorname{CGAL}, implemented following the suggestions in [41]. In order to enable a fair comparison with CGAL\operatorname{CGAL}, both CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} also used the LMO implementation of the random Lanczos algorithm provided as part of the CGAL\operatorname{CGAL} code. The interior point solution was obtained via the Matlab-based modelling inteface CVX [7] (version 2.2) with the SDPT3 solver (version 4.0) [35]. The implementation of CGAL was taken from [39]. All experiments have been conducted on an Intel(R) Xeon(R) Gold 6254 CPU @3.10GHz server limited to 4 threads per run and 512G total RAM using Matlab R2019b.

B.1 MaxCut Problem

The following class of SDPs is studied in [2]. This SDP arises in many algorithms such as approximating MaxCut, approximating the CUTNORM of a matrix, and approximating solutions to the little Grothendieck problem [6, 3]. The optimization problem is defined as

max\displaystyle\max 𝐂,𝐗\displaystyle\langle{\mathbf{C}},\mathbf{X}\rangle (MAXQP)
s.t. Xii1i=1,,n\displaystyle X_{ii}\leq 1\quad i=1,\ldots,n
𝐗0\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{X}}\succeq 0

Let 𝖰:={yn|yi1,i=1,,n}\mathsf{Q}:=\{y\in\mathbb{R}^{n}|y_{i}\leq 1,\;i=1,\ldots,n\}, and consider the conic hull of 𝖰\mathsf{Q}, defined as 𝖪:={(y,t)n×|1ty𝖰,t>0}n+1.\mathsf{K}:=\{(y,t)\in\mathbb{R}^{n}\times\mathbb{R}|\frac{1}{t}y\in\mathsf{Q},t>0\}\subset\mathbb{R}^{n+1}. This set admits the nn-logarithmically homogenous barrier

f(y,t)=i=1nlog(tyi)=i=1nlog(11tyi)nlog(t).f(y,t)=-\sum_{i=1}^{n}\log(t-y_{i})=-\sum_{i=1}^{n}\log(1-\frac{1}{t}y_{i})-n\log(t).

We can then reformulate (MAXQP) as

min𝐗,t\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\min_{\mathbf{X},t}} {g(𝐗):=𝐂,𝐗}\displaystyle\{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}g(\mathbf{X}):=-\langle{\mathbf{C}},\mathbf{X}\rangle}\} (B.1)
s.t. (𝐗,t)𝖷:={(𝐗,t)𝒮n×|𝐗0,tr(𝐗)n,t=1},\displaystyle(\mathbf{X},t)\in{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathsf{X}:=\{(\mathbf{X}^{\prime},t^{\prime})\in\mathcal{S}^{n}\times\mathbb{R}|\mathbf{X}^{\prime}\succeq 0,\;\operatorname{tr}(\mathbf{X}^{\prime})\leq n,t^{\prime}=1\}},
𝒫(𝐗,t)𝖪\displaystyle\mathcal{P}(\mathbf{X},t)\in\mathsf{K}

where 𝒫(𝐗,t):=[X11;;Xnn;t]\mathcal{P}(\mathbf{X},t):=[X_{11};\ldots;X_{nn};t]^{\top} is a linear homogenous mapping from 𝕊n×\mathbb{S}^{n}\times\mathbb{R} to n+1\mathbb{R}^{n+1}. Set F(𝐗,t)=f(𝒫(𝐗,t))F(\mathbf{X},t)=f(\mathcal{P}(\mathbf{X},t)) for (𝐗,t)𝕊n×(0,)(\mathbf{X},t)\in\mathbb{S}^{n}\times(0,\infty).

We apply this formulation to the classical MaxCut problem: given an undirected graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) with vertex set 𝒱={1,,n}\mathcal{V}=\{1,\ldots,n\} and edge set \mathcal{E}, we seek to find the maximum-weight cut of the graph. The MaxCut problem is formulated as max𝐱{±1}n𝐱𝐋𝐱\max_{\mathbf{x}\in\{\pm 1\}^{n}}\mathbf{x}^{\top}\mathbf{L}\mathbf{x}, where 𝐋\mathbf{L} is the combinatorial Laplace matrix of graph 𝒢\mathcal{G}. A convex relaxation of this NP-hard combinatorial optimization problem can be given by the SDP [19]

max14tr(𝐋𝐗) s.t.: Xii1,i{1,2,,n},𝐗𝕊+n,\displaystyle\max\frac{1}{4}\operatorname{tr}(\mathbf{L}\mathbf{X})\text{ s.t.: }X_{ii}\leq 1,i\in\{1,2,\ldots,n\},\mathbf{X}\in\mathbb{S}^{n}_{+},

which is the form of (MAXQP) with 𝐂=14𝐋{\mathbf{C}}=\frac{1}{4}\mathbf{L}.

To evaluate the performance of the tested algorithms on this problem, we consider the random graphs G1-G30, published online in [1]. The CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} methods were applied to all datasets with parameters η0=2Ωg\eta_{0}=2\Omega_{g} and σ{0.25,0.5,0.9}\sigma\in\{0.25,0.5,0.9\}. The solutions were compared to the solution obtained from running CGAL\operatorname{CGAL}, using the original CGAL\operatorname{CGAL} code applied to the scaled problem and assuming that all constraint are satisfied with equality. The computations were stopped after either 10,00010,000 seconds of running time. The benchmark solutions for the interior point method for each data set are displayed in Table 1. Table 1 also displays the size of each dataset, the value obtained by solving the MaxCut SDP relaxation using CVX with the SDPT3 solver, and λmin(𝐗SDPT33)\lambda_{\text{min}}(\mathbf{X}^{\operatorname{\texttt{SDPT3}}3}), that is the minimum eigenvalue of the solution. We observe that for larger graphs the interior-point solver SDPT3 returns infeasible solutions, displaying negative eigenvalues. If this occurs, the value obtained by a corrected solution is used as a reference value instead. This corrected solution 𝐗~SDPT33\tilde{\mathbf{X}}^{\operatorname{\texttt{SDPT3}}3} is constructed as 𝐗~SDPT33=(𝐗SDPT33λmin(𝐗SDPT33)𝐈)/α\tilde{\mathbf{X}}^{\operatorname{\texttt{SDPT3}}3}=(\mathbf{X}^{\operatorname{\texttt{SDPT3}}3}-\lambda_{\text{min}}(\mathbf{X}^{\operatorname{\texttt{SDPT3}}3})\mathbf{I})/\alpha, where 𝐈\mathbf{I} is the identity matrix, and α=maxi{1,,n}max{X~iiSDPT33,1}\alpha=\max_{i\in\{1,\ldots,n\}}\max\{\tilde{X}^{\operatorname{\texttt{SDPT3}}3}_{ii},1\} is the minimal number for which X~iiSDPT331\tilde{X}^{\operatorname{\texttt{SDPT3}}3}_{ii}\leq 1 for all i=1,,ni=1,\ldots,n. The objective value of this corrected solution, g(X~SDPT33)g(\tilde{X}^{\operatorname{\texttt{SDPT3}}3}) is also recorded in Table 1 as the feasible value (Feas. value). We therefore label the optimal solution gg^{*} as the minimum of g(X~SDPT33)g(\tilde{X}^{\operatorname{\texttt{SDPT3}}3}) and the minimal value obtained by any of the methods. The performance measure we employ to benchmark our experimental results is the relative optimality gap, defined as |g(𝐗)g||g|\frac{\lvert g(\mathbf{X})-g^{*}\rvert}{\lvert g^{*}\rvert}. Moreover, since CGAL does not necessarily generate feasible solutions, its reported solution 𝐗CGAL\mathbf{X}^{\operatorname{CGAL}} is corrected to the feasible one 𝐗~CGAL=𝐗CGAL/(γ+1)\tilde{\mathbf{X}}^{\operatorname{CGAL}}=\mathbf{X}^{\operatorname{CGAL}}/(\gamma+1), where γ=maxi{1,,n}max{XiiCGAL1,0}\gamma=\max_{i\in\{1,\ldots,n\}}\max\{X^{\operatorname{CGAL}}_{ii}-1,0\} is the relative feasibility gap of CGAL\operatorname{CGAL}. In particular, for every dataset DD, method MM, and parameter choice PP, after each iteration ii we record the objective value giM,P,Dg^{M,P,D}_{i} of the solution after this iteration (or, in the case of CGAL\operatorname{CGAL}, the corrected solution), the corresponding relative optimality gap riM,P,D=|giM,P,Dg|/|g|r^{M,P,D}_{i}={|g^{M,P,D}_{i}-g^{*}|}/{|g^{*}|}, the time from the start of the run tiM,P,Dt^{M,P,D}_{i} and, in the case of CGAL\operatorname{CGAL}, the feasibility gap γiCGAL,P,D\gamma^{\operatorname{CGAL},P,D}_{i} of the current solution.

Figures 4-7 illustrate the performance of our methods for various parameter values and datasets G1-G30 vs. both iteration and time. More specifically, Figures 4 and 6 display, for each iteration ii, the best relative optimality gap minji{rjM,P,D}\min_{j\leq i}\{r^{M,P,D}_{j}\} attained up to that iteration; Figures 5 and 7 display, for each time point tt, the best relative optimality gap minj{rjM,P,D|tjM,P,Dt}\min_{j}\{r^{M,P,D}_{j}|t^{M,P,D}_{j}\leq t\} obtained up to time tt. Additionally, Tables 2 and 3 report the algorithmic performances of the tested methods for each data base and method, at specific time points t{100,1000,10000}t\in\{100,1000,10000\} seconds. Specifically, for each of the time points we compute j=argminj{gjM,P,D|tjM,P,Dt}j^{*}=\operatorname*{argmin}_{j}\{g^{M,P,D}_{j}|t^{M,P,D}_{j}\leq t\}, and then display the relative optimality gap (Opt. Gap) rjM,P,Dr^{M,P,D}_{j^{*}}, the relative feasibility gap γjCGAL,D\gamma^{\operatorname{CGAL},D}_{j^{*}} for CGAL\operatorname{CGAL} (Feas. Gap), and the number of iterations max{j|tjM,P,Dt}\max\{j|t^{M,P,D}_{j}\leq t\} performed until that time point (Iter). The smallest relative optimality gap for each data set and time is marked in bold.

Figures 4 and 6, show that the performances of CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} are similar across iterations, with LCG\operatorname{\texttt{LCG}} having a slight advantage. Further, the performance of both methods remains nearly unchanged for different values of σ\sigma. Note that, for a low number of iterations (smaller than 1000), CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} generally perform better than CGAL\operatorname{CGAL}. However, beyond that point, CGAL\operatorname{CGAL} exhibits a sudden improvement in objective value, leading to better performance. When looking at the methods’ performances over time in Figures 5 and 7, the initial advantage of CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} seems to disappear, due to the lower per-iteration cost of CGAL\operatorname{CGAL}. However, a closer look at the results in Tables 2 and 3 reveals that CG\operatorname{\texttt{CG}} achieves the lowest optimality gap in 7 out of 30 datasets after 1000 seconds, and in 13 out of the 30 datasets after 10000 seconds. The former instances can be identified by the fact that CGAL\operatorname{CGAL} obtains an optimality gap higher than 0.5%0.5\% after 100 seconds, and has a very slow improvement rate after that point. In contrast, CG\operatorname{\texttt{CG}} obtains a higher optimality gap of approximately 2%2\% at that time but manages to achieve better performance due to its superior optimality gap reduction rate. Additionally, we observe that the costly iterations of LCG\operatorname{\texttt{LCG}} result in inferior performance over time compared to CG\operatorname{\texttt{CG}}, as it can only complete a tenth of the iterations within the same time frame. In summary, while CGAL\operatorname{CGAL} manages to obtain a less than 1%1\% optimality gap quite quickly, in about a third of the instances it gets ‘stuck’ and does not significantly improve the objective function, whereas LCG\operatorname{\texttt{LCG}} and CG\operatorname{\texttt{CG}} are more consistent across the instances starting from a higher optimality gap, but making progress in a constant rate throughout their execution, with CG\operatorname{\texttt{CG}} being more efficient in terms of time complexity.

Table 1: MaxCut datasets information
SDPT33\operatorname{\texttt{SDPT3}}3
Dataset #Nodes #Edges Obj. value λmin\lambda_{\text{min}} Feas. value
G1 800 19176 12083.2 0 -
G2 800 19176 12089.43 0 -
G3 800 19176 12084.33 0 -
G4 800 19176 12111.5 0 -
G5 800 19176 12100 0 -
G6 800 19176 2667 0 -
G7 800 19176 2494.25 0 -
G8 800 19176 2535.25 0 -
G9 800 19176 12111.5 0 -
G10 800 19176 12111.5 0 -
G11 800 1600 634.83 0 -
G12 800 1600 628.41 0 -
G13 800 1600 651.54 0 -
G14 800 1600 12111.5 0 -
G15 800 4661 3171.56 0 14135.9-
SDPT33\operatorname{\texttt{SDPT3}}3
Dataset #Nodes #Edges Value λmin\lambda_{\text{min}} Feas. value
G16 800 4672 3175.02 0 -
G17 800 4667 3171.25 0 -
G18 800 4694 1173.75 0 -
G19 800 4661 1091 0 -
G20 800 4672 1119 0 -
G21 800 4667 1112.06 0 -
G22 2000 19990 18276.89 -1 14135.95
G23 2000 19990 18289.22 -1 14142.11
G24 2000 19990 18286.75 -1 9997.75
G25 2000 19990 18293.5 -1 9996
G26 2000 19990 18270.24 -1 14132.87
G27 2000 19990 8304.5 -1 4141.75
G28 2000 19990 8253.5 -1 4100.75
G29 2000 19990 8377.75 -1 4209
G30 2000 19990 8381 -1 4215.5
Refer to caption
Figure 4: MaxCut datasets G1-G21 relative optimality gap vs. iteration.
Refer to caption
Figure 5: MaxCut datasets G1-G21 relative optimality gap vs. time.
Refer to caption
Figure 6: MaxCut datasets G22-G30 relative optimality gap vs. iteration.
Refer to caption
Figure 7: MaxCut datasets G22-G30 relative optimality gap vs. time.

Table 2: Performance for datasets G1-G18 at specific time points. For each data set, each time point, and each method the relative optimality gap (Opt. Gap), the iteration count (Iter) and for CGAL\operatorname{CGAL} the relative feasibility gap (Feas. Gap) of the solution with the best objective value obtained until that time. The best relative optimality gap for each row is marked in bold.
Data Time CGAL LCG σ=\sigma=0.25 LCG σ=\sigma=0.5 LCG σ=\sigma=0.9 CG σ=\sigma=0.25 CG σ=\sigma=0.5 CG σ=\sigma=0.9
Set (sec) Opt. Feas. Iter Opt. Iter Opt. Iter Opt. Iter Opt. Iter Opt. Iter Opt. Iter
Gap % Gap % Gap % Gap % Gap % Gap % Gap % Gap %
G1 100 0.0452 0.0427 21685 5.3256 1650 5.1013 1592 6.3882 1494 1.1139 19046 1.1743 18468 1.2960 17742
1000 0.0100 0.0092 144575 1.8670 17024 1.5269 16930 1.7090 16589 0.3782 126215 0.3774 124709 0.4113 127334
10000 0.0010 0.0010 888035 0.4146 134546 0.3628 141906 0.3890 154157 0.1312 786346 0.1322 728109 0.1565 730861
G2 100 0.0443 0.0416 20514 5.1990 1673 4.9872 1668 6.1180 1614 1.1186 20079 1.1583 17849 1.3328 17078
1000 0.0076 0.0074 135864 1.8878 17695 1.5429 17554 1.5702 17592 0.3857 136275 0.3675 120983 0.4254 121586
10000 0.0013 0.0012 816010 0.3419 150126 0.3133 153430 0.4046 148095 0.1317 763621 0.1463 687932 0.1571 715246
G3 100 0.0641 0.0622 21664 5.9692 1416 5.2227 1629 6.1860 1585 1.1050 20425 1.1555 17401 1.3028 17647
1000 0.0074 0.0071 141485 1.8837 17015 1.3351 17410 1.5635 17266 0.4115 124686 0.3536 120982 0.4347 116904
10000 0.0008 0.0008 892189 0.3171 151626 0.3283 156878 0.3777 156813 0.1326 763597 0.1405 649211 0.1665 669949
G4 100 0.0686 0.0666 19922 5.6793 1487 5.7132 1355 6.9583 1208 1.0994 19435 1.1341 18535 1.2752 17574
1000 0.0077 0.0073 133750 2.1315 13938 1.4965 14001 1.7269 14378 0.3918 124072 0.3522 125967 0.4050 123681
10000 0.0008 0.0008 892189 0.3171 151626 0.3283 156878 0.3777 156813 0.1326 763597 0.1405 649211 0.1665 669949
G5 100 0.0668 0.0644 20334 5.8090 1509 5.7594 1379 6.8983 1338 1.1164 20254 1.1668 16756 1.2883 17533
1000 0.0125 0.0118 135291 2.0338 14165 1.3668 14650 1.7768 14696 0.4197 124738 0.3487 116404 0.4222 118817
10000 0.0008 0.0007 888095 0.3882 137871 0.3553 143148 0.3893 146200 0.1322 772666 0.1335 714684 0.1643 694540
G6 100 0.4803 0.0682 19842 8.2189 1541 7.2658 1426 7.0960 1293 2.5433 16239 2.0011 14659 2.1106 12283
1000 0.4179 0.0100 133015 1.8849 16247 1.7304 14630 1.5365 14492 0.6772 109973 0.5733 94957 0.6847 86027
10000 0.4078 0.0009 937170 0.4791 172357 0.4163 177925 0.3755 163790 0.2108 720098 0.2232 581993 0.2106 566692
G7 100 0.2743 0.0735 19832 9.9632 1282 7.2801 1206 8.0242 1129 2.2855 16331 1.9467 13740 2.1709 12006
1000 0.2068 0.0110 133207 2.1909 13391 1.7102 12885 1.7527 12644 0.6705 113963 0.6946 91575 0.6064 89294
10000 0.1959 0.0009 813946 0.4959 142268 0.4302 129157 0.4549 124502 0.2127 657896 0.2206 525425 0.2223 510794
G8 100 0.4518 0.0627 20101 8.5605 1370 7.5897 1277 7.9211 1179 2.3751 15179 2.1037 13710 2.1617 12686
1000 0.3945 0.0102 134291 2.0238 13734 1.7576 13463 1.6682 13493 0.7010 104772 0.5873 92671 0.6569 84183
10000 0.3839 0.0009 820939 0.5342 137709 0.4611 133831 0.4212 129272 0.2134 566425 0.2330 566242 0.2457 480991
G9 100 0.3402 0.0741 20486 8.6602 1454 6.8707 1402 6.7527 1364 2.5177 17074 2.0543 14883 2.1253 12280
1000 0.2705 0.0105 137199 1.9809 15167 1.7249 14672 1.5419 14895 0.6765 113442 0.5708 101335 0.6687 78602
10000 0.2603 0.0011 839573 0.5128 148527 0.4597 141307 0.4304 137229 0.2126 578638 0.2375 561210 0.2158 492155
G10 100 0.4693 0.0931 20176 9.1327 1365 7.0820 1293 7.5663 1210 2.3524 16086 2.0193 13699 2.1163 12605
1000 0.3829 0.0118 134447 2.0525 13854 1.7098 13728 1.5771 13428 0.6489 95647 0.5940 95183 0.5856 89508
10000 0.3711 0.0008 843504 0.5452 140556 0.4503 139367 0.4136 136075 0.2046 572185 0.2221 525312 0.2224 502739
G11 100 0.9391 0.0293 52396 7.6742 2344 6.8577 2332 7.4084 2073 2.1040 42363 1.8304 36848 1.7089 39655
1000 0.9021 0.0045 343271 2.3392 25386 1.9903 24475 2.0117 24162 0.6135 361872 0.5343 341395 0.5732 309567
10000 0.8946 0.0007 1823694 0.7343 196738 0.7811 193144 0.6929 187647 0.2105 2176005 0.2228 1945371 0.2105 1813119
G12 100 0.7673 0.0295 52859 8.1471 2120 7.0435 2302 7.3282 2236 2.1726 38919 1.8728 37346 1.7495 37614
1000 0.7294 0.0033 342496 2.4091 24045 2.1357 24512 2.3042 23901 0.6337 325029 0.5363 314001 0.5735 276203
10000 0.7236 0.0007 1901444 0.7065 228102 0.5723 222533 0.6089 205372 0.2002 2311377 0.2094 2029389 0.2037 1692541
G13 100 0.7167 0.0260 50953 7.5696 2414 7.2154 1830 6.6205 2339 2.1816 42733 1.9334 37760 1.8089 36572
1000 0.6846 0.0050 332950 2.4416 24491 2.1377 23816 1.9294 25070 0.6357 343063 0.5274 312566 0.5570 279764
10000 0.6775 0.0005 1843586 0.7297 211955 0.7435 204609 0.6543 195695 0.1995 1887493 0.2062 1895654 0.1948 1547687
G14 100 0.0488 0.0401 36050 21.069 1686 16.422 1764 14.932 1636 2.9552 32769 2.6901 30857 2.3423 28826
1000 0.0062 0.0051 247958 4.4529 18672 3.5753 18026 3.3863 17441 0.8246 243773 0.7049 220892 0.6569 206482
10000 0.0010 0.0006 1653247 0.9661 175901 0.7821 173581 0.6847 186339 0.2270 1629850 0.2462 1237850 0.2253 1212944
G15 100 0.0495 0.0411 36866 18.219 2085 15.498 2061 14.534 2039 3.2756 35930 2.7522 29819 2.5041 27217
1000 0.0083 0.0064 251117 4.1913 22101 3.2621 21712 3.0489 20963 0.8459 255448 0.7324 225802 0.6674 204773
10000 0.0011 0.0007 1579597 0.9704 196785 0.7993 192710 0.7211 183253 0.2433 1550812 0.2374 1314984 0.2514 1151339
G16 100 0.0365 0.0287 41046 16.989 2178 14.670 2150 13.058 2034 2.7754 33679 2.5751 31471 2.3225 30479
1000 0.0061 0.0049 271597 3.9114 22674 3.1580 22386 2.8169 21605 0.7870 260408 0.7366 228063 0.7303 222851
10000 0.0010 0.0008 1679075 0.8573 224493 0.6819 227033 0.6209 224707 0.2386 1499401 0.2269 1423609 0.2134 1430100
G17 100 0.0518 0.0434 37335 20.096 1707 17.255 1757 14.974 1703 3.2235 33611 2.7927 29911 2.6083 26684
1000 0.0090 0.0075 253376 4.5791 18601 3.6161 18267 3.4578 17899 0.8749 243142 0.7551 216924 0.6898 205937
10000 0.0009 0.0006 1650749 0.9701 232319 0.7884 226754 0.6712 220683 0.2417 1537197 0.2402 1352397 0.2139 1371570
G18 100 0.7302 0.0560 38999 11.727 1795 10.971 1780 9.5330 1586 2.4057 24255 2.0556 22388 2.0462 20555
1000 0.6753 0.0084 260520 2.7593 18331 2.1365 18265 2.1493 17040 0.7666 185915 0.6925 159283 0.6533 141416
10000 0.6645 0.0006 1599182 0.6728 182254 0.4799 183154 0.5566 166432 0.2586 1061813 0.2166 962496 0.2154 879504
Table 3: Performance for datasets G19-G30 at specific time points. For each data set, each time point, and each method we record the best relative optimality gap (Opt. Gap), the iteration count (Iter) and for CGAL\operatorname{CGAL} the relative feasibility gap (Feas. Gap) of the solution with the best objective value obtained until that time. The best relative optimality gap for each row is marked in bold.
Data Time CGAL LCG σ=\sigma=0.25 LCG σ=\sigma=0.5 LCG σ=\sigma=0.9 CG σ=\sigma=0.25 CG σ=\sigma=0.5 CG σ=\sigma=0.9
Set (sec) Opt. Feas. Iter Opt. Iter Opt. Iter Opt. Iter Opt. Iter Opt. Iter Opt. Iter
Gap % Gap % Gap % Gap % Gap % Gap % Gap % Gap %
G19 100 0.8967 0.0492 37083 12.059 1630 10.580 1541 10.506 1527 2.3693 25356 2.2978 22770 2.0119 20054
1000 0.8444 0.0066 254368 2.6314 16530 2.1796 16473 2.1604 15966 0.6124 162830 0.6152 137255 0.6637 129524
10000 0.8351 0.0006 1723323 0.5384 205382 0.5075 194687 0.4851 186757 0.2624 1026619 0.2506 842195 0.2320 756654
G20 100 0.7363 0.0538 36281 12.956 1486 10.386 1593 10.453 1462 2.8064 23999 2.4927 21325 2.2067 19130
1000 0.6819 0.0062 251142 2.9914 15873 2.4176 15989 2.2335 16143 0.7705 162281 0.7034 144692 0.6373 132063
10000 0.6742 0.0006 1716273 0.6061 208638 0.5587 206001 0.4746 187323 0.2475 1068889 0.2064 924839 0.2231 790836
G21 100 0.7840 0.0726 41830 10.288 1935 7.7700 1998 8.3024 1955 2.4996 24785 2.0761 20738 2.0545 20794
1000 0.7121 0.0102 275681 2.1759 20341 1.8593 21211 1.8629 20506 0.7668 160845 0.5995 139988 0.6157 136838
10000 0.7005 0.0008 1610239 0.4848 188165 0.4806 187510 0.4938 178700 0.2522 1016576 0.2235 914508 0.2284 824945
G22 100 0.1174 0.1139 12416 18.1438 300 16.5330 288 18.2411 248 3.7563 7042 2.9849 6751 3.4068 5745
1000 0.0193 0.0187 90981 4.7658 3319 4.3706 3215 5.5525 2941 0.7828 64426 0.6636 59780 0.7353 58167
10000 0.0022 0.0021 632496 1.0548 34970 1.1614 31913 1.1801 31293 0.2095 497773 0.1931 447623 0.2247 449915
G23 100 0.1023 0.0980 12410 18.5561 292 17.3589 250 17.8255 256 3.6153 6927 2.9642 6491 3.3613 5583
1000 0.0149 0.0143 90915 4.7672 3275 4.3952 2940 5.3747 3069 0.7403 60120 0.6684 57688 0.7333 57294
10000 0.0021 0.0020 631480 1.0797 33094 1.1388 31910 1.1669 32278 0.2062 478154 0.1923 440393 0.2229 449699
G24 100 0.1257 0.1222 12342 20.1997 251 17.6862 259 18.7144 242 3.8042 6856 3.1557 6188 3.4891 5526
1000 0.0164 0.0157 90383 5.1886 2825 3.8498 2957 5.5847 2858 0.8075 61881 0.6991 55330 0.7356 57577
10000 0.0020 0.0018 624799 1.2467 30859 1.1084 31608 1.1785 30999 0.2195 475491 0.2035 412481 0.2309 438368
G25 100 0.1057 0.1012 12533 18.5114 300 16.3371 284 17.9667 256 3.5888 6971 2.9573 6687 3.4011 5701
1000 0.0135 0.0128 91948 4.7165 3368 4.4134 3115 5.4227 3022 0.7561 63518 0.6510 60912 0.7309 58963
10000 0.0018 0.0017 637281 1.0292 34064 1.1919 32793 1.1593 32361 0.2028 491183 0.1930 448653 0.2257 452544
G26 100 0.1375 0.1337 12736 20.6034 248 17.1955 269 18.7351 245 3.7270 6860 2.9696 6664 3.4397 5703
1000 0.0151 0.0145 92871 4.9788 3025 4.4023 3025 5.7863 2932 0.7733 62802 0.6581 60555 0.7417 59907
10000 0.0022 0.0021 635639 1.0644 32298 0.9407 32064 1.1557 32364 0.2074 490930 0.1957 442197 0.2317 461381
G27 100 0.3024 0.2283 11530 28.4263 195 25.0943 214 27.2575 174 4.7957 6361 4.0809 6127 4.4816 4712
1000 0.0877 0.0246 82946 8.2273 2240 6.4036 2407 5.7783 2351 0.8400 55681 0.8083 52249 0.7602 47704
10000 0.0646 0.0031 609686 1.5543 25336 0.9938 25525 0.9926 25790 0.0102 390126 0.0000 378324 0.0409 344793
G28 100 0.3037 0.1984 12418 23.2191 281 22.2085 280 23.0712 226 4.6768 6415 4.1060 6054 4.3763 4992
1000 0.1194 0.0240 91115 6.2536 3044 4.6599 3113 5.0701 3019 0.7727 54325 0.8275 54898 0.7498 50886
10000 0.0964 0.0026 634335 1.3236 33006 0.8704 33326 0.7890 32323 0.0216 393334 0.0000 397716 0.0190 377339
G29 100 0.1737 0.1651 12428 22.335 291 22.374 276 22.594 244 4.8811 6431 4.1196 6132 4.4864 4976
1000 0.0238 0.0216 91157 5.9446 3169 5.2525 3145 5.0132 3068 0.9131 57929 0.9055 50524 0.8096 51421
10000 0.0025 0.0022 637658 1.2430 33012 0.8851 33265 0.9036 32289 0.1865% 421074 0.0965 375627 0.1135 356157
G30 100 0.2294 0.2176 11582 26.6363 217 22.9201 208 26.8685 190 4.8414 6122 4.0918 6027 4.4921 4685
1000 0.0300 0.0254 86515 7.8141 2482 5.3362 2403 6.1525 2295 0.8334 53756 0.7499 51787 0.7489 47761
10000 0.0048 0.0017 616813 1.0134 27305 1.1942 24995 0.9684 24960 0.1127 388638 0.0027 364280 0.0314 353046

B.2 Markov chain mixing rate

We next consider the problem of finding the fastest mixing rate of a Markov chain on a graph [33]. In this problem, a symmetric Markov chain is defined on an undirected graph G=({1,,n},)G=(\{1,\ldots,n\},\mathcal{E}). Given GG and weights dijd_{ij} for {i,j}\{i,j\}\in\mathcal{E}, we are tasked with finding the transition rates wij0w_{ij}\geq 0 for each {i,j}\{i,j\}\in\mathcal{E} with weighted sum smaller than 1, that result in the fastest mixing rate. We assume that ijdij2=n2\sum_{ij}d^{2}_{ij}=n^{2}. The mixing rate is given by the second smallest eigenvalue of the graph’s Laplacian matrix, described as

L(𝐰)ij={j:{i,j}wijj=iwij{i,j}0otherwise..L(\mathbf{w})_{ij}=\begin{cases}\sum_{j:\{i,j\}\in\mathcal{E}}w_{ij}&j=i\\ -w_{ij}&\{i,j\}\in\mathcal{E}\\ 0&\text{otherwise}.\end{cases}.

From L𝟏=0L\mathbf{1}=0 it follows that 1n𝟏\frac{1}{n}\mathbf{1} is a stationary distribution of the process. From [9], we have the following bound on the total variation distance of the time tt distribution of the Markov process and the stationary distribution supππ𝐏(t)1n𝟏TV12neλt.\sup_{\pi}\lVert\pi{\mathbf{P}}(t)-\frac{1}{n}\mathbf{1}\rVert_{\text{TV}}\leq\frac{1}{2}\sqrt{n}e^{-\lambda t}. The quantity Tmix=λT_{\text{mix}}=\lambda is called the mixing time, and λλ2(L(𝐰))\lambda\equiv\lambda_{2}(L(\mathbf{w})) is the second-largest eigenvalue of the graph Laplacian L(𝐰)L(\mathbf{w}). To bound this eigenvalue, we follow the strategy laid out in [33]. Thus, the problem can be written as

max𝐰\displaystyle\max_{\mathbf{w}}\quad λ2(L(𝐰))\displaystyle\lambda_{2}(L(\mathbf{w}))
s.t. {i,j}dij2wij1\displaystyle\sum_{\{i,j\}\in\mathcal{E}}d_{ij}^{2}w_{ij}\leq 1
𝐰0.\displaystyle\mathbf{w}\geq 0.

This problem can also alternatively be formulated as

min𝐰\displaystyle\min_{\mathbf{w}}\quad {i,j}dij2wij\displaystyle\sum_{\{i,j\}\in\mathcal{E}}d_{ij}^{2}w_{ij}
s.t. λ2(L(𝐰))1\displaystyle\lambda_{2}(L(\mathbf{w}))\geq 1
𝐰0.\displaystyle\mathbf{w}\geq 0.

Due to the properties of the Laplacian, the first constraint can be reformulated as

min𝐰\displaystyle\min_{\mathbf{w}}\quad {i,j}dij2wij\displaystyle\sum_{\{i,j\}\in\mathcal{E}}d_{ij}^{2}w_{ij} (B.2)
s.t. L(𝐰)𝐈n×n1n𝟏𝟏\displaystyle L(\mathbf{w})\succeq\mathbf{I}_{n\times n}-\frac{1}{n}\mathbf{1}\mathbf{1}^{\top} (B.3)
𝐰0.\displaystyle\mathbf{w}\geq 0. (B.4)

The dual problem is then given by

max𝐗𝕊+n𝐈n×n1n𝟏𝟏,𝐗s.t.Xii+Xjj2Xijdij2{i,j}\begin{split}\max_{\mathbf{X}\in\mathbb{S}^{n}_{+}}\quad&\langle\mathbf{I}_{n\times n}-\frac{1}{n}\mathbf{1}\mathbf{1}^{\top},\mathbf{X}\rangle\\ \text{s.t.}\quad&X_{ii}+X_{jj}-2X_{ij}\leq d_{ij}^{2}\qquad\{i,j\}\in\mathcal{E}\end{split} (B.5)

To obtain an SDP within our convex programming model, we combine arguments from [33] and [2]. Let 𝐗\mathbf{X} be a feasible point for (B.5). Then, there exists a n×nn\times n matrix 𝐕{\mathbf{V}} such that 𝐗=𝐕𝐕\mathbf{X}={\mathbf{V}}{\mathbf{V}}^{\top}. It is easy to see that multiplying each row of the matrix 𝐕=[𝐯1,,𝐯n]{\mathbf{V}}^{\top}=[\mathbf{v}_{1},\ldots,\mathbf{v}_{n}] with an orthonormal matrix does not change the feasibility of the candidate solution. Moreover Xij=𝐯i𝐯jX_{ij}=\mathbf{v}_{i}^{\top}\mathbf{v}_{j} for all 1i,jn1\leq i,j\leq n, so that Xii+Xjj2Xij=𝐯i𝐯j22X_{ii}+X_{jj}-2X_{ij}=\lVert\mathbf{v}_{i}-\mathbf{v}_{j}\rVert^{2}_{2}. Since shifting each vector 𝐯i\mathbf{v}_{i} by a constant vector 𝐯\mathbf{v} would not affect the objective or constraints, we can, without loss of generality, set 𝐯1=𝟎\mathbf{v}_{1}=\mathbf{0}. This gives the equivalent optimization problem

max𝐯1,,𝐯ni=1n𝐯i2s.t.𝐯i𝐯j2dij2{i,j},𝐯in,𝐯1=𝟎.\begin{split}\max_{\mathbf{v}_{1},\ldots,\mathbf{v}_{n}}\quad&\sum_{i=1}^{n}\lVert\mathbf{v}_{i}\rVert^{2}\\ \text{s.t.}\quad&\lVert\mathbf{v}_{i}-\mathbf{v}_{j}\rVert^{2}\leq d^{2}_{ij}\qquad\{i,j\}\in\mathcal{E},\\ &\mathbf{v}_{i}\in\mathbb{R}^{n},\mathbf{v}_{1}=\mathbf{0}.\end{split} (B.6)

This is the geometric dual derived in [33], which is strongly connected to the geometric embedding of a graph in the plane [20]. Thus, setting again 𝐗=𝐕𝐕\mathbf{X}={\mathbf{V}}{\mathbf{V}}^{\top}, we therefore obtain that X1j=Xj1=0X_{1j}=X_{j1}=0 for all i=1,,ni=1,\ldots,n thus reducing the dimension of the problem, and (B.7) can be reformulated as follows:

max𝐗𝕊+n1𝐈n1×n11n𝟏𝟏,𝐗s.t.X(i1)(i1)+X(j1)(j1)2X(i1)(j1)dij2{i,j},i,j>1X(j1)(j1)d1j2{1,j}\begin{split}\max_{\mathbf{X}\in\mathbb{S}^{n-1}_{+}}\quad&\langle\mathbf{I}_{{n-1}\times{n-1}}-\frac{1}{n}\mathbf{1}\mathbf{1}^{\top},\mathbf{X}\rangle\\ \text{s.t.}\quad&X_{(i-1)(i-1)}+X_{(j-1)(j-1)}-2X_{(i-1)(j-1)}\leq d_{ij}^{2}\qquad\{i,j\}\in\mathcal{E},i,j>1\\ &X_{(j-1)(j-1)}\leq d_{1j}^{2}\qquad\qquad\qquad\qquad\qquad\qquad\{1,j\}\in\mathcal{E}\end{split} (B.7)

Moreover, since 𝐗𝕊+n1\mathbf{X}\in\mathbb{S}^{n-1}_{+}, it follows that 2|Xij|XiiXjj2|X_{ij}|\geq-\sqrt{X_{ii}X_{jj}} and so for any {i,j}\{i,j\}\in\mathcal{E} such that i,j>1i,j>1,X(j1)(j1)X(i1)(i1)+dij\sqrt{X_{(j-1)(j-1)}}\leq\sqrt{X_{(i-1)(i-1)}}+\sqrt{d_{ij}}. Therefore, since the graph is connected, we can recursively bound each diagonal element of 𝐗\mathbf{X} and therefore the trace by a constant α\alpha, and add the trace constraint 𝖷={𝐗𝕊n|𝐗0,tr(𝐗)α}\mathsf{X}=\{\mathbf{X}\in\mathbb{S}^{n}|\mathbf{X}\succeq 0,\operatorname{tr}(\mathbf{X})\leq\alpha\} to the problem formulation without affecting the optimal values. Furthermore, defining 𝒟:𝕊n1||\mathcal{D}:\mathbb{S}^{n-1}\to\mathbb{R}^{\lvert\mathcal{E}\rvert} by

𝒟(𝐗){i,j}={X(i1)(i1)+X(j1)(j1)2X(i1)(j1),i,j>0,X(j1)(j1),i=1.\mathcal{D}(\mathbf{X})_{\{i,j\}}=\begin{cases}X_{(i-1)(i-1)}+X_{(j-1)(j-1)}-2X_{(i-1)(j-1)},&i,j>0,\\ X_{(j-1)(j-1)},&i=1.\end{cases}

Let 𝖰={𝐲|||yi,jdij2,{i,j}}\mathsf{Q}=\{\mathbf{y}\in\mathbb{R}^{\lvert\mathcal{E}\rvert}|y_{i,j}\leq d^{2}_{ij},\{i,j\}\in\mathcal{E}\} and 𝖪={(y,t)||×|1ty𝖰,t>0}\mathsf{K}=\{(y,t)\in\mathbb{R}^{\lvert\mathcal{E}\rvert}\times\mathbb{R}|\frac{1}{t}y\in\mathsf{Q},t>0\}. This is a closed convex cone with logarithmically homogeneous barrier

f(y,t)={i,j}log(tdij2yi,j)=elog(dij21tyi,j)||log(t).f(y,t)=-\sum_{\{i,j\}\in\mathcal{E}}\log(td^{2}_{ij}-y_{i,j})=-\sum_{e\in\mathcal{E}}\log(d^{2}_{ij}-\frac{1}{t}y_{i,j})-\lvert\mathcal{E}\rvert\log(t).

This gives a logarithmically homogeneous barrier F(𝐗,t)=f(𝒫(𝐗,t))F(\mathbf{X},t)=f(\mathcal{P}(\mathbf{X},t)), where 𝒫(𝐗,t)=[𝒟(𝐗);t]||×\mathcal{P}(\mathbf{X},t)=[\mathcal{D}(\mathbf{X});t]\in\mathbb{R}^{\lvert\mathcal{E}\rvert}\times\mathbb{R}. We thus arrive at a formulation of the form of problem (P), which reads explicitly as

min𝐗,t{g(𝐗):=𝐈n×n1n𝟏𝟏,𝐗}s.t.𝒫(𝐗,t)𝖪𝐗𝖷,t=1.\begin{split}\min_{\mathbf{X},t}\quad&\{g(\mathbf{X}):=\langle\mathbf{I}_{n\times n}-\frac{1}{n}\mathbf{1}\mathbf{1}^{\top},\mathbf{X}\rangle\}\\ \text{s.t.}\quad&\mathcal{P}(\mathbf{X},t)\in\mathsf{K}\\ &\mathbf{X}\in\mathsf{X},t=1.\end{split} (B.8)
Dataset # Nodes # Edges SDPT3 Value
1 100 1000 15.62
2 100 2000 7.93
3 200 1000 72.32
4 200 4000 17.84
5 400 1000 388.52
6 400 8000 38.37
7 800 4000 333.25
8 800 16000 80.03
Table 4: Mixing datasets characteristics.

We generated random connected undirected graphs of various sizes, and for each edge {i,j}\{i,j\} in the graph we generated a random dij2d_{ij}^{2} uniformly in [0,1][0,1]. Table 4 provides the size of each dataset, and the value obtained by solving the Mixing problem SDP using CVX with the SDPT3 solver. In order to compare to the CGAL\operatorname{CGAL} algorithm, we used the normalisation guidelines specified in [41], so that the norm of the coefficients vector of each constraint will be equal to each other, and the norm of the constraint matrix and the objective matrix equals 1.

All datasets were run for both CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} options with the following choice of parameters η0{0.5Ωg,1Ωg,2Ωg}\eta_{0}\in\{0.5\Omega_{g},1\Omega_{g},2\Omega_{g}\} and σ{0.9,0.5,0.25}\sigma\in\{0.9,0.5,0.25\}. All methods were stopped when either they reached 1000010000 iterations or 36003600 seconds running time, the earlier of the two conditions. Since CGAL\operatorname{CGAL} does not generate feasible solutions, we computed the deviation from feasibility of its outputted solution 𝐗CGAL\mathbf{X}^{\operatorname{CGAL}} by

Relative Feasibility Gap=γ(𝐗CGAL)max{i,j}dij2𝒟(𝐗CGAL)ijdij2.\text{Relative Feasibility Gap}=\gamma(\mathbf{X}^{\operatorname{CGAL}})\equiv\max_{\{i,j\}\in\mathcal{E}}\frac{d_{ij}^{2}-\mathcal{D}(\mathbf{X}^{\operatorname{CGAL}})_{ij}}{d_{ij}^{2}}.

We then corrected the solution to a feasible solution 𝐗~CGAL=𝐗CGAL/(γ(𝐗CGAL)+1)\tilde{\mathbf{X}}^{\operatorname{CGAL}}=\mathbf{X}^{\operatorname{CGAL}}/(\gamma(\mathbf{X}^{\operatorname{CGAL}})+1), however in our numerical experiments it turned out that γ\gamma was extremely large, which resulted in 𝐗~CGAL\tilde{\mathbf{X}}^{\operatorname{CGAL}} being very close to the zero matrix with very poor performance. Therefore, we used the following alternative iterative correction method: We set 𝐗^CGAL,1=𝐗CGAL\hat{\mathbf{X}}^{\operatorname{CGAL},1}=\mathbf{X}^{\operatorname{CGAL}}, and at each iteration kk we identify {i,j}\{i,j\}\in\mathcal{E} which leads to the maximum value for γ(𝐗^CGAL,k)\gamma(\hat{\mathbf{X}}^{\operatorname{CGAL},k}) set i=argmaxl{i,j}Xlli^{*}=\operatorname*{argmax}_{l\in\{i,j\}}X_{ll} and set

𝐗^ijCGAL,k+1={𝐗^ijCGAL,k+1,i,ji.𝐗^ijCGAL,k+1/(γ(𝐗^CGAL,k)+1),i=i or j=i..\hat{\mathbf{X}}^{\operatorname{CGAL},k+1}_{ij}=\begin{cases}\hat{\mathbf{X}}^{\operatorname{CGAL},k+1}_{ij},&i,j\neq i^{*}.\\ \hat{\mathbf{X}}^{\operatorname{CGAL},k+1}_{ij}/(\gamma(\hat{\mathbf{X}}^{\operatorname{CGAL},k})+1),&i=i^{*}\text{ or }j=i^{*}.\end{cases}.

We terminate the algorithm when γ(𝐗^CGAL,k)=0\gamma(\hat{\mathbf{X}}^{\operatorname{CGAL},k})=0. The value of the returned 𝐗^CGAL\hat{\mathbf{X}}^{\operatorname{CGAL}} is then computed. As in the previous section, one of our performance measures is the relative optimality gap |g(𝐗)g(𝐗~SDPT33)||g(𝐗~SDPT33)|\frac{\lvert g(\mathbf{X})-g(\tilde{\mathbf{X}}^{\operatorname{\texttt{SDPT3}}3})\rvert}{\lvert g(\tilde{\mathbf{X}}^{\operatorname{\texttt{SDPT3}}3})\rvert} between the (corrected) solution of each method and the SDPT33\operatorname{\texttt{SDPT3}}3 solution. Specifically, for each dataset DD, method MM and parameter combination PP, we record the objective value giM,P,Dg^{M,P,D}_{i} (or, for CGAL\operatorname{CGAL}, the corrected objective value) after iteration ii, the corresponding relative optimality gap riM,P,D=(giM,P,Dg(𝐗~SDPT33))/g(𝐗~SDPT33)r^{M,P,D}_{i}=(g^{M,P,D}_{i}-g(\tilde{\mathbf{X}}^{\operatorname{\texttt{SDPT3}}3}))/g(\tilde{\mathbf{X}}^{\operatorname{\texttt{SDPT3}}3}), the time tiM,P,Dt^{M,P,D}_{i} elapsed since the start of the run and, only for CGAL\operatorname{CGAL}, the relative feasibility gap γiCGAL,D\gamma^{\operatorname{CGAL},D}_{i}.

Figures 8-9 illustrate the results for the tested σ\sigma and η0\eta_{0} values. Figure 8 displays, for each iteration ii, the best relative optimality gap minji{rjM,P,D}\min_{j\leq i}\{r^{M,P,D}_{j}\} up to that iteration. Figure 9 displays for each time tt the best relative optimality gap minj{riM,P,D:tjM,P,Dt}\min_{j}\{r^{M,P,D}_{i}:t_{j}^{M,P,D}\leq t\} obtained up to time tt. Table 5 reports the performances of the tested methods at time points t{100,500,1000}t\in\{100,500,1000\} seconds, in order to examine their algorithmic behavior under different time constraints. Specifically, given a data set DD, a method MM, parameters PP and a time point tt, for CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} we compute (j,P)=argminj,P{rjM,P,D|tjM,P,Dt}(j^{*},P^{*})=\operatorname*{argmin}_{j,P}\{r^{M,P,D}_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}j}|t_{j}^{M,P,D}\leq t\}, so the table contains: the parameter values P=(η0,σ)P^{*}=(\eta_{0},\sigma), the corresponding objective function value (Obj. Value) gjM,P,Dg_{j^{*}}^{M,P^{*},D}, the relative optimality gap (Opt. Gap) rjM,P,Dr_{j}^{M,P,D}, and the number of iterations (Iter) performed until that time, given by max{j|tjM,P,Dt}\max\{j|t_{j}^{M,P^{*},D}\leq t\}. For CGAL\operatorname{CGAL}, since there is no parameter choice, we retrieve j=argminj{riCGAL,D|tjCGAL,Dt}j^{*}=\operatorname*{argmin}_{j}\{r^{\operatorname{CGAL},D}_{i}|t_{j}^{\operatorname{CGAL},D}\leq t\}, and the table records the corresponding objective value (Obj. Value) gjCGAL,Dg^{\operatorname{CGAL},D}_{j^{*}}, the relative optimal gap (Opt. Gap) rjCGAL,Dr^{\operatorname{CGAL},D}_{j^{*}}, the relative feasibility gap (Feas Gap) γjCGAL,D\gamma^{\operatorname{CGAL},D}_{j^{*}}, as well as the number of iterations (Iter) given by max{j|tjCGAL,Dt}\max\{j|t_{j}^{\operatorname{CGAL},D}\leq t\}. The lowest relative optimality gap for each data set and time is marked in bold.

Refer to caption
Figure 8: Mixing datasets relative optimality gap vs. iteration for various parameter choices.
Refer to caption
Figure 9: Mixing datasets relative optimality gap vs. time for various parameter choices.
Table 5: Mixing datasets results. For each dataset, time, and algorithm provides the best parameter choice for η0\eta_{0} (in multipliers of Ωg\Omega_{g}) and σ\sigma, the function value obtained, the relative optimality gap (Opt. Gap), the iteration, and for CGAL\operatorname{CGAL} the relative feasibility gap (Feas. Gap) of the solution with the best objective value obtained until that time. The best objective value and relative optimality gap for each row are marked in bold.
Data Time CG\operatorname{\texttt{CG}} LCG\operatorname{\texttt{LCG}} CGAL\operatorname{CGAL}
Set (Sec) η0\eta_{0} σ\sigma Obj. Opt. Iter η0\eta_{0} σ\sigma Obj. Opt. Iter Obj. Opt. Feas. Iter
(xΩgx\cdot\Omega_{g}) Value Gap % (xΩgx\cdot\Omega_{g}) Value Gap % Value Gap % Gap %
MixData1 100 2 0.5 13.03 16.6 2385 2 0.9 14.39 7.9 809 3.68 76.4 21930.2 3090
500 2 0.9 13.96 10.7 4969 2 0.25 14.97 4.1 9001 3.74 76.1 12304.4 11460
1000 2 0.9 14.20 9.1 9286 0.5 0.25 15.12 3.2 13598 3.76 75.9 8873.5 19981
MixData2 100 2 0.9 6.71 15.3 957 2 0.9 7.26 8.4 745 1.88 76.2 19184.7 2948
500 2 0.9 7.17 9.5 4099 2 0.9 7.49 5.5 3873 1.90 76.0 8729.7 10864
1000 2 0.9 7.23 8.8 8024 2 0.9 7.59 4.3 7758 1.91 76.0 6066.0 18977
MixData3 100 1 0.25 32.86 54.6 1865 0.5 0.25 65.70 9.2 888 15.15 79.1 44622.3 1656
500 1 0.25 54.34 24.9 8832 1 0.25 68.88 4.8 3289 15.30 78.8 13031.6 6080
1000 1 0.25 54.87 24.1 14128 1 0.25 69.34 4.1 5851 15.32 78.8 7902.7 10657
MixData4 100 2 0.25 10.28 42.4 1676 2 0.9 15.42 13.6 166 3.33 81.3 116332.0 1440
500 0.5 0.25 14.69 17.7 4710 2 0.9 16.15 9.5 921 3.36 81.2 57781.9 5203
1000 0.5 0.25 14.72 17.5 6939 2 0.9 16.47 7.7 1880 3.36 81.2 41696.7 9081
MixData5 100 2 0.25 92.61 76.2 1289 1 0.25 340.02 12.5 415 59.39 84.7 122971.0 946
500 0.5 0.9 163.46 57.9 4788 0.5 0.5 360.70 7.2 1701 59.31 84.7 126861.0 3397
1000 1 0.25 277.79 28.5 5359 2 0.25 367.26 5.5 2937 59.90 84.6 64948.0 5939
MixData6 100 1 0.5 10.52 72.6 751 2 0.25 30.11 21.5 256 7.44 80.6 149166.0 811
500 0.5 0.25 28.98 24.5 2995 2 0.9 33.59 12.5 232 7.60 80.2 103077.0 3004
1000 0.5 0.25 32.30 15.8 4661 2 0.9 34.24 10.8 482 7.66 80.0 68409.9 5270
MixData7 100 0.5 0.9 35.69 89.3 496 1 0.25 276.90 16.9 82 59.19 82.2 702281.0 478
500 1 0.25 102.57 69.2 984 0.5 0.25 295.47 11.3 616 62.41 81.3 239810.0 1766
1000 1 0.25 152.56 54.2 2101 1 0.25 301.48 9.5 1447 62.78 81.2 185699.0 3124
MixData8 100 0.5 0.9 4.49 94.4 306 0.5 0.25 55.80 30.3 66 13.77 82.8 613444.0 431
500 0.5 0.9 12.03 85.0 1455 2 0.25 63.64 20.5 566 13.99 82.5 282965.0 1593
1000 0.5 0.9 15.84 80.2 2751 1 0.9 66.98 16.3 133 14.05 82.4 243686.0 2817

Figures 8 and 9 indicate that the line search version LCG\operatorname{\texttt{LCG}} outperforms CG\operatorname{\texttt{CG}}, although in most cases less iterations are performed. One explanation of this phenomenon is that the step size policy employed in CG\operatorname{\texttt{CG}} is based on global optimization ideas which do not take into account the local structure of the problem. LCG\operatorname{\texttt{LCG}} instead captures these local features of the problem more accurately, leading to better numerical performance. By comparison, CGAL\operatorname{CGAL} performs quite poorly, with optimality gaps ranging from 70%70\% to 90%90\% roughly; this is closely linked to large relative feasibility gaps, exceeding 10000%10000\%. One possible reason for these results is that CGAL\operatorname{CGAL}’s convergence guarantees rely on the magnitude of the dual variables, which in turn depend on the size of the Slater condition gap. Since the right-hand side dij2d_{ij}^{2} of the constraints may be very small, this gap will generally be small, causing the norm of the dual variables to become large and slowing convergence. This may also explain the contrast with CGAL\operatorname{CGAL}’s superior performance in the MaxCut example. Additionally, Table 5 does not indicate that there is a preferable choice of parameters for CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}}, since the best configuration is different for different data sets and times.

B.3 Randomly Scaled SDP

Finally, we consider a general SDP problems with inequality constraints, which we use to investigate the sensitivity of the compared methods to problem scaling. The synthetically scaled SDP (SRS) takes the form

max\displaystyle\max 𝐂,𝐗\displaystyle\langle{\mathbf{C}},\mathbf{X}\rangle (B.9)
s.t. 𝐀i,𝐗bii=1,,m\displaystyle\langle\mathbf{A}_{i},\mathbf{X}\rangle\leq b_{i}\quad i=1,\ldots,m
tr(𝐗)1\displaystyle\operatorname{tr}{(\mathbf{X})}\leq 1
𝐗0.\displaystyle{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{X}}\succeq 0.

Let 𝖰={𝐲m|yibi,i{1,2,,m}}\mathsf{Q}=\{\mathbf{y}\in\mathbb{R}^{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}m}}|y_{i}\leq b_{i},i\in\{1,2,\ldots,m\}\} and 𝖪={(𝐲,t)m×|1t𝐲𝖰,t>0}\mathsf{K}=\{({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{y}},t)\in\mathbb{R}^{m}\times\mathbb{R}|\frac{1}{t}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{y}}\in\mathsf{Q},t>0\}. This is a closed convex cone with logarithmically homogeneous barrier

f(𝐲,t)=i=1mlog(tbiyi)=i=1mlog(bi1tyi)mlog(t),f({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{y}},t)=-\sum_{i=1}^{m}\log(tb_{i}-y_{i})=-\sum_{i=1}^{m}\log(b_{i}-\frac{1}{t}y_{i})-m\log(t),

i.e., F(𝐗,t)=f(𝒫(𝐗,t))F(\mathbf{X},t)=f(\mathcal{P}(\mathbf{X},t)), where 𝒫(𝐗,t)=[𝐀1,𝐗;𝐀2,𝐗;;𝐀m,𝐗;t]m×\mathcal{P}(\mathbf{X},t)=[\langle\mathbf{A}_{1},\mathbf{X}\rangle;\langle\mathbf{A}_{2},\mathbf{X}\rangle;\ldots;\langle\mathbf{A}_{m},\mathbf{X}\rangle;t]\in\mathbb{R}^{m}\times\mathbb{R}. We thus arrive at a formulation of the form of problem (P), which reads explicitly as

min𝐗,tg(𝐗):=𝐂,𝐗s.t.𝒫(𝐗,t)𝖪(𝐗,t)𝖷:={(𝐗,t)𝒮n×|𝐗0,tr(𝐗)1,t=1}.\begin{split}\min_{\mathbf{X},t}\quad&g(\mathbf{X}):=\langle-{\mathbf{C}},\mathbf{X}\rangle\\ \text{s.t.}\quad&\mathcal{P}(\mathbf{X},t)\in\mathsf{K}\\ &(\mathbf{X},t)\in\mathsf{X}:=\{(\mathbf{X}^{\prime},t^{\prime})\in\mathcal{S}^{n}\times\mathbb{R}|\mathbf{X}^{\prime}\succeq 0,\;\operatorname{tr}(\mathbf{X}^{\prime})\leq 1,t^{\prime}=1\}.\end{split} (B.10)

We randomly generate the objective function and constraint coefficients to be PSD matrices with unit Frobenius norm where, given a matrix Mm1×m2M\in\mathbb{R}^{m_{1}\times m_{2}}, its Frobenius norm is MF:=i=1m1j=1m2Mij2\|M\|_{\text{F}}:=\sqrt{\sum_{i=1}^{m_{1}}\sum_{j=1}^{m_{2}}M_{ij}^{2}}. Specifically, the constraint matrices for every instance and i=1,,mi=1,\ldots,m are generated by first sampling 𝐮in\mathbf{u}_{i}\in\mathbb{R}^{n} with components from a standard Gaussian distribution, rounded to the nearest integer, then setting 𝐀i=𝐮i𝐮i\mathbf{A}_{i}=\mathbf{u}_{i}\mathbf{u}_{i}^{\top}, and finally normalizing the matrix to Frobenius norm of 1. Similarly, the objective matrix is generated by sampling 𝐔0n×n\mathbf{U}_{0}\in\mathbb{R}^{n\times n} with components from a standard Gaussian distribution, rounded to the nearest integer, and 𝐂=(𝐔0𝐔0)/(𝐔0𝐔0)F{\mathbf{C}}=(\mathbf{U}_{0}\mathbf{U}_{0}^{\top})/\lVert(\mathbf{U}_{0}\mathbf{U}_{0}^{\top})\rVert_{F}. Additionally, we generate a PSD reference solution 𝐗0\mathbf{X}_{0} by sampling 𝐕n×n{\mathbf{V}}\in\mathbb{R}^{n\times n} with components from a standard Gaussian distribution, then setting 𝐗0=𝐕𝐕\mathbf{X}_{0}={\mathbf{V}}{\mathbf{V}}^{\top} and, finally, normalizing the result to have a trace equal to 1. The right-hand side constraint coefficients are constructed as bi=uiip/2𝐀i,𝐗0b_{i}=\frac{u_{i}}{i^{p/2}}\langle\mathbf{A}_{i},\mathbf{X}_{0}\rangle, with uiu_{i} is a random variable uniformaly disributed over [12,1][\frac{1}{2},1], and p{0,1,2}p\in\{0,1,2\}, where pp is a parameter that measures the scaling of the constraints. We note that CGAL\operatorname{CGAL} ran on the normalized problem, using the normalization guidelines specified in [41].

Similarly to the previous examples, the performance metric to benchmark our experimental results is the relative optimality gap, defined |g(𝐗)g(𝐗SDPT33))||g(𝐗SDPT33)|\frac{|g({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{X}})-g({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{X}}^{\operatorname{\texttt{SDPT3}}3}))|}{|g({\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{X}}^{\operatorname{\texttt{SDPT3}}3})|}, where 𝐗SDPT33{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{X}}^{\operatorname{\texttt{SDPT3}}3} is the optimal solution computed by the interior-point solver SDPT33\operatorname{\texttt{SDPT3}}3. We compare our solutions with those generated by CGAL\operatorname{CGAL}. However, since CGAL\operatorname{CGAL} does not necessarily produce feasible solutions, its reported solution 𝐗CGAL\mathbf{X}^{\operatorname{CGAL}} is corrected to ensure feasibility. Specifically, given a relative feasibility gap γ=maxi{1,2,,m}max{{𝐀i,𝐗𝐛i},0}𝐛i\gamma=\max_{i\in\{1,2,\ldots,m\}}\frac{\max\{\{\langle\mathbf{A}_{i},\mathbf{X}\rangle-\mathbf{b}_{i}\},0\}}{\mathbf{b}_{i}}, the corrected solution was given by 𝐗~CGAL=𝐗CGAL/(1+γ)\tilde{\mathbf{X}}^{\operatorname{CGAL}}=\mathbf{X}^{\operatorname{CGAL}}/(1+\gamma), and the relative optimality gap is obtained by using 𝐗~CGAL\tilde{\mathbf{X}}^{\operatorname{CGAL}}.

For each value of pp, we ran CGAL\operatorname{CGAL}, LCG\operatorname{\texttt{LCG}} and CG\operatorname{\texttt{CG}} on 30 realizations of problems of size n=100n=100 and m=100m=100. All datasets were run for both CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} with parameters η0=2Ωg\eta_{0}=2\Omega_{g} and σ{0.25,0.9}\sigma\in\{0.25,0.9\}. and all methods were stopped when either after 10001000 seconds of running time. For every p{0,1,2}p\in\{0,1,2\}, realization R{1,2,,30}R\in\{1,2,\ldots,30\} of the data and algorithm MM, we record the objective value giM,p,Rg^{M,p,R}_{i} (corrected as explained above for CGAL\operatorname{CGAL}) obtained after the iteration, the corresponding relative optimality gap riM,p,Rr^{M,p,R}_{i}, the time since the start of the run tiM,p,Rt^{M,p,R}_{i} and, for CGAL\operatorname{CGAL}, the relative feasibility gap γiCGAL,p,R\gamma^{\operatorname{CGAL},p,R}_{i}.

Table 6: Computational results for randomly scaled SDP datasets. For each time point, value of the parameter pp, and algorithm (CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} with σ{0.25,0.9}\sigma\in\{0.25,0.9\}, and CGAL\operatorname{CGAL}), the table provides the objective function value (Obj. Value), the relative optimality gap (Opt. Gap), the iteration count and, only for CGAL\operatorname{CGAL}, the relative feasibility gap (Feas. Gap) of the solution with the best objective value obtained up to that time. For each value of pp, the metrics displayed in the table are aggregated across 30 datasets, and the best relative optimality gap achieved is marked in bold.
Data Time CG\operatorname{\texttt{CG}} σ=0.25\sigma=0.25 CG\operatorname{\texttt{CG}} σ=0.9\sigma=0.9 LCG\operatorname{\texttt{LCG}} σ=0.25\sigma=0.25 LCG\operatorname{\texttt{LCG}} σ=0.9\sigma=0.9 CGAL\operatorname{CGAL}
pp (sec) Obj. Opt. Iter Obj. Opt. Iter Obj. Opt. Iter Obj. Opt. Iter Obj. Opt. Feas. Iter
Value Gap % Value Gap % Value Gap % Value Gap % Value Gap % Gap %
0 100 -186.45 0.671 2067 -186.78 0.4978 1467 -186.79 0.4908 1543 -186.99 0.3875 1305 -187.67 0.0059 0.0183 13349
500 -187.19 0.2788 6665 -187.34 0.198 6120 -187.28 0.2272 6095 -187.4 0.1666 6201 -187.7 0.0009 0.0049 50070
1000 -187.33 0.2027 13071 -187.46 0.1308 11662 -187.4 0.1653 11918 -187.5 0.1139 12133 -187.7 0.0004 0.0041 88240
1 100 -143.26 1.3039 1131 -143.56 1.0994 1020 -143.81 0.9277 1099 -144.01 0.786 783 -141.99 1.7272 2.1609 13266
500 -144.38 0.5311 3976 -144.53 0.423 3841 -144.58 0.3932 3626 -144.66 0.3374 3427 -144.26 0.2718 0.6039 50306
1000 -144.62 0.3621 7458 -144.73 0.2838 7314 -144.72 0.2955 7014 -144.8 0.235 6790 -144.83 0.1209 0.213 87925
2 100 -107.75 1.5629 767 -108.18 1.1652 686 -108.31 1.0371 612 -108.64 0.7336 569 -32.44 70.2786 265.9191 14138
500 -108.82 0.5716 2868 -109.02 0.3879 2724 -108.98 0.4199 2727 -109.14 0.276 2603 -65.05 40.3478 74.8299 50504
1000 -108.98 0.4222 5519 -109.15 0.2695 5283 -108.64 0.3322 5374 -109.22 0.2037 5191 -82.82 24.0898 35.2035 90162

Table 6 displays the performance of CGAL\operatorname{CGAL}, CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} after time points t{100,500,1000}t\in\{100,500,1000\} seconds for different values of the parameter pp, in order to study algorithmic behaviour under different time constraints. Specifically, for each method MM, parameter value pp, realization RR, and time tt, we compute (j)M,p,R=argminj{rjM,p,R:tjM,p,Rt}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}(j^{*})^{M,p,R}}=\operatorname*{argmin}_{j}\{r^{M,p,R}_{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}j}:t_{j}^{M,p,R}\leq t\} and kM,p,R=max{j:tjM,p,Rt}k^{M,p,R}=\max\{j:t_{j}^{M,p,R}\leq t\}. The table displays the average objective function value (Obj. Value) given by 130Rg(j)M,p,RM,p,R\frac{1}{30}\sum_{R}g^{M,p,R}_{(j^{*})^{M,p,R}}, the average relative optimality gap (Opt. Gap) 130Rr(j)M,p,RM,p,R\frac{1}{30}\sum_{R}r^{M,p,R}_{(j^{*})^{M,p,R}}, as well as the number of iterations (Iter) until that time given by 130RkM,p,R\frac{1}{30}\sum_{R}k^{M,p,R} and, for CGAL\operatorname{CGAL}, the average feasibility gap (Feas. Gap) 130RγM,p,R\frac{1}{30}\sum_{R}\gamma^{M,p,R}. Table 6 shows that CGAL\operatorname{CGAL} benefits from inexpensive iterations, allowing it to close the optimality gap more quickly than CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} in favorable settings (p=0p=0). However, as pp increases to 2, although CGAL\operatorname{CGAL} still performs more iterations than our algorithms, it fails to close the infeasibility gap even after 1000 seconds. This leads to a significantly larger optimality gap, as our calculations penalize infeasible solutions. In contrast, CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} demonstrate robust performance, with little variation across different values of pp. In particular LCG\operatorname{\texttt{LCG}} reaches an average optimality gap smaller than 1%1\% after 100 seconds for all values of pp, while the average optimality gap of CG\operatorname{\texttt{CG}} increases by a factor of 2-2.5 when moving from p=0p=0 to p=2p=2, but still maintaining less than 1%1\% average optimality gap after 500 seconds for all values of pp.

The performance of the average relative optimality gap with respect to iteration and time for each value of pp is visually depicted in Figures 10 and 11, respectively. In these figures, it can clearly be observed that the performance of CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} is robust to the value of pp, while CGAL\operatorname{CGAL}’s performance deteriorates significantly as deteriorates as pp gets larger. In order to create Figure 10 we first compute, for each iteration ii, algorithm MM, parameter value p{0,1,2}p\in\{0,1,2\} and realization R{1,,30}R\in\{1,\dots,30\}, the best relative optimality gap r^iM,p,R=minji{rjM,p,R}\hat{r}^{M,p,R}_{i}=\min_{j\leq i}\{r^{M,p,R}_{j}\} up to iteration ii. We then plot its average 130Rr^iM,p,R\frac{1}{30}\sum_{R}\hat{r}^{M,p,R}_{i} across all realizations, for each iteration ii. Instead, Figure 11 is obtained by first computing, for each time tt, algorithm MM, parameter value pp and realization RR, the best relative optimality gap r(j)M,p,RM,p,Rr^{M,p,R}_{(j^{*})^{M,p,R}} obtained up to time tt (as defined above), and then by averaging these values across all realizations. These figures elucidate that while CGAL\operatorname{CGAL} performance deteriorates as pp increases, the performance of CG\operatorname{\texttt{CG}} and LCG\operatorname{\texttt{LCG}} is robust to the scaling of the problem, .

Refer to caption
Figure 10: Randomly scaled SDP problem: relative optimality gap vs. iterations for several parameter choices (averaged over 30 instances for each pp)
Refer to caption
Figure 11: Randomly scaled SDP problem: relative optimality gap vs. time for several parameter choices (averaged over 30 instances for each pp)